Russian speech recognition programs. Overview of voice recognition technologies and how to use them

Phonograms recorded using digital voice recorders "Gnome R" and "Gnome 2M" meet the requirements for phonograms submitted for phonoscopic examinations and are suitable for identifying individuals by voice and speech...

First Deputy Chief

The Gnome 2M voice recorder has been repeatedly used to record conferences and seminars in difficult acoustic environments; the recorded soundtracks are of high quality. The built-in noise reduction function allows you to improve the quality of playback of soundtracks...

Leading engineer of IPK BNTU

Institute for Advanced Studies and Retraining of Personnel BNTU

During its service life, “Gnome R” has proven itself on the positive side. High quality recording with minimal dimensions, long duration of sound recording, prompt transfer of accumulated information from the built-in memory of the recorder to the PC...

Senior officer of the 3rd department of the seventh directorate

General Staff of the Armed Forces of the Republic of Belarus

Phonograms recorded using the Forget-Me-Not II system meet the requirements for multi-channel digital systems for recording voice messages over telephone communication channels, and are suitable for identifying a person by voice and speech...

Head of the center

State Forensic Expertise Center

An unlimited number of notified subscribers, a large number of simultaneously processed tasks will make "Rupor" an indispensable assistant in the work of employees of the credit department of branch No. 524 of OJSC "JSSB Belarusbank...

Deputy Director – Head of Retail Business Center

Branch No. 524 of JSC "ASB Belarusbank"

The Rupor automatic warning system worked over analog telephone lines and was tested to notify personnel. The system served 100 subscribers, worked stably and did not require constant maintenance...

Acting Military Commissioner

Military Commissariat of Minsk

The Forget-Me-Not II recording system ensures the reception of voice messages from residents, high-quality recording of them on a computer, the ability to listen to recorded messages and enter information into a text database. The notification system "Rupor" automatically notifies debtors...

Head of ACS Department

Unitary Enterprise "ZhREO Sovetsky district of Minsk"

The Rupor system provides notification to a large number of subscribers in a short time in accordance with established parameters with the provision of a report on the notification, works reliably, fully complies with the requirements for it...

Director of Retail Business Department

The mobile speech recording and documentation system “Protocol” includes a digital voice recorder “Gnome 2M” and a computer transcriber “Caesar”. The Gnome 2M voice recorder allows you to obtain high-quality recordings of meetings and sessions, and the Caesar transcriber significantly increases the speed of translating audio information into a text document...

Leading Specialist

Institute of State and Law of the Academy of Sciences of the Republic of Belarus

Identification by voice

In the modern world, there is increasing interest in biometric technologies and biometric personal identification systems, and this interest is quite understandable.

Biometric identification is based on the principle of recognizing and comparing the unique characteristics of the human body. The main sources of a person’s biometric characteristics are fingerprints, iris and retina, voice, face, signature, gait, etc. These biometric identifiers belong to the person and are an integral part of him. They cannot be forgotten, left or lost somewhere.

Various characteristics and traits of a person can be used for biometric identification. This article provides a brief overview of how biometric technologies work using the example of a voice recognition system.

The value of voice technology for biometrics has been proven time and time again. However, only high quality implementation of automatic speaker recognition systems can actually introduce such technologies into practice. Similar systems already exist. They are used in security systems, banking technologies, e-commerce, and law enforcement practice.

The use of speaker recognition systems is the most natural and economical way to solve problems of unauthorized access to a computer or information transmission systems, as well as problems of multi-level access control to network or information resources.

Speaker recognition systems can solve two problems: identify an individual from a given, limited list of people (personal identification) or confirm the identity of the speaker (identity verification). Identification and verification of personality by voice are areas of development of speech processing technology.

Rice. 1 – Speaker recognition

Speech is a signal that arises as a result of transformations that occur at several different levels: semantic, linguistic, articulatory and acoustic. As is known, the source of a speech signal is the vocal tract, which excites sound waves in an elastic air medium. The vocal tract usually refers to the speech-producing organ located above the vocal cords. As can be seen from Figure 2, the vocal tract consists of the hypopharynx, oropharynx, oral cavity, nasopharynx and nasal cavity.

Rice. 2 – Structure of the human vocal tract

The human voice arises when air passes from the lungs through the trachea into the larynx, past the vocal cords, and then into the pharynx and mouth and nasal cavity. When a sound wave passes through the vocal tract, its frequency spectrum is changed by vibrations in the vocal tract. The vibrations of the vocal tract are called formants. Speaker verification systems usually recognize distinctive features of the speech signal, which reflect the individual characteristics of the muscular activity of the vocal tract of the individual.

Let's take a closer look at the speaker verification system. Voice verification is the process of determining whether the speaker is who they say they are. A user previously registered in the system pronounces his identifier, which is a registration number, password word or phrase. In text-dependent recognition, the password word is known to the system, and it “asks” the user to pronounce it. The password word is displayed on the screen and the person speaks it into the microphone. With text-independent recognition, the password word spoken by the user does not coincide with the reference word, i.e. The user can say an arbitrary word or phrase as a password. The verification system receives the speech signal, processes it and decides whether to accept or reject the identifier presented by the user. The system can inform the user that his voice does not match the existing standard and ask him to provide additional information in order to make a final decision.

Rice. 3 – Human interaction with the system

The diagram of a person’s interaction with the voice-based identity verification system is shown in Figure 3. The user speaks into the microphone the number offered to him by the system so that the system checks whether his voice corresponds to the standard stored in the system database. Typically, there is a trade-off between voice recognition accuracy and speech sample size, i.e. The longer the speech sample, the higher the recognition accuracy. In addition to voice, echoes and extraneous noise may enter the microphone.

There are a number of factors that can contribute to verification and identification errors, for example:

incorrect pronunciation or reading of a password word or phrase;
the emotional state of the speaker (stress, pronouncing a passphrase under duress, etc.);
difficult acoustic environment (noise, interference, radio waves, etc.);
different communication channels (use of different microphones during speaker registration and verification);
colds;
natural voice changes.

Some of these can be eliminated, for example by using better microphones.

The process of identity verification by voice consists of 5 stages: receiving a speech signal, parameterization, or highlighting the distinctive features of the voice, comparing the resulting voice sample with a previously established standard, making an “admit/reject” decision, training, or updating the reference model. The verification scheme is presented in Figure 4.

Rice. 4 – Verification scheme

During registration, a new user enters their ID and then says a keyword or phrase several times, thus creating benchmarks. The number of repetitions of a key phrase can vary for each user, or it can be constant for everyone.

In order for a computer to process a speech signal, the sound wave is converted into an analog and then into a digital signal.

At the stage of voice feature extraction, the speech signal is divided into separate audio frames, which are subsequently converted into a digital model. These patterns are called “voiceprints.” The newly obtained “voice print” is compared with a previously established standard. To recognize the identity of the speaker, the most important are the most striking distinctive features of the voice, which would allow the system to accurately recognize the voice of each specific user.

Finally, the system makes a decision to admit or deny the user access depending on whether his voice matches or does not match the established standard. If the system incorrectly matches the voice presented to it with the standard, then a “false admission” (FA) error occurs. If the system does not recognize a biometric feature that corresponds to the standard it contains, then it is called a “false refusal” (FR) error. A false admission error creates a gap in the security system, and a false rejection error leads to a decrease in the usability of the system, which sometimes does not recognize a person the first time. An attempt to reduce the probability of occurrence of one error leads to a more frequent occurrence of another, therefore, depending on the requirements for the system, a certain compromise is chosen, i.e. a decision threshold is set.

Conclusion

Voice identification methods are also used in practice. Identification technology by company voice allows you to organize regulated user access using a given password phrase to enterprise resources, telephone and WEB services. The use of technology can significantly increase the security of systems and, at the same time, simplify the user identification process. Voice Key technology will ensure high reliability and stability of the system, and will also help improve the quality of customer service.

All materials posted on this site are permitted for publication and printing on other resources and printed publications only with the written permission of Speech Technologies LLC.

Did you know that voice recognition technology has been around for 50 years? Scientists have been solving this problem for half a century, and only in the last few decades have IT companies become involved in solving it. The result of the last year of work has been a new level of recognition accuracy and widespread use of technology in everyday and professional life.

Technology in life

Every day we use search engines. We are looking for where to have lunch, how to get to a certain place, or trying to find the meaning of an unknown term. Voice recognition technology, which is used, for example, by Google or Yandex.Navigator, helps us spend a minimum of time searching. It's simple and convenient.

In a professional environment, technology helps simplify work several times. For example, in medicine, the doctor’s speech is converted into the text of a medical history and a prescription immediately at the appointment. This saves time on entering patient information into documents. The system built into the car’s on-board computer responds to the driver’s requests, for example, helps to find the nearest gas station. For people with disabilities, it is important to implement systems in the software of household appliances to control them using voice.

Development of voice recognition systems

The idea of speech recognition has always looked promising. But already at the stage of recognizing numbers and the simplest words, the researchers encountered a problem. The essence of recognition was reduced to building an acoustic model, when speech was presented as a statistical model, which was compared with ready-made templates. If the model matched the template, then the system decided that the command or number was recognized. The growth of dictionaries that the system could recognize required an increase in the power of computing systems.

GCharts of growth in computer performance and reduction in recognition errors in voice recognition systems for English speech
Sources:Herb Sutter. The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software
https://minghsiehee.usc.edu/2017/04/the-machines-are-coming/

Today, recognition algorithms have been supplemented by language models that describe the structure of a language, for example, a typical sequence of words. The system is trained on real speech material.

A new stage in the development of technology was the use of neural networks. The recognition system is designed in such a way that each new recognition affects the accuracy of recognition in the future. The system becomes learnable.

Quality of voice recognition systems

The state of affairs in the development of technology today is expressed by the goal: from speech recognition to understanding. For this purpose, a key indicator was chosen - the percentage of errors in recognition. It is worth saying that this indicator is also used in recognizing the speech of one person by another. We skip some words while taking other factors, such as context, into account. This allows us to understand speech even without understanding the meaning of individual words. For humans, the recognition error rate is 5.1%.

Other difficulties in training a speech recognition system to understand a language will be emotions, unexpected changes in the topic of conversation, the use of slang, and the individual characteristics of the speaker: speech rate, timbre, pronunciation of sounds.

Global market players

Several global players in the voice recognition platform market are well known. These are Apple, Google, Microsoft, IBM. These companies have sufficient resources for research and an extensive base for training their own systems. For example, Google uses millions of search queries for training, which users are happy to ask themselves. On the one hand, this increases the accuracy of recognition, but on the other, it imposes limitations: the system recognizes speech in 15-second segments and relies on a “general profile question.” The Google system recognition error is 4.9%. For IBM this figure is 5.5%, and for Microsoft it is 6.3% at the end of 2016.

The platform for use in professional fields is being developed by the American company Nuance. Among the areas of application: medicine, law, finance, journalism, construction, security, automotive.

In Russia, the Center for Speech Technologies is the largest manufacturer of professional voice recognition and speech synthesis tools. The company's solutions have been implemented in 67 countries around the world. Main areas of work: voice biometrics – voice identification; self-service speech systems – IVR, used in call centers; speech synthesizers. In the USA, the Russian company operates under the SpeechPro brand and conducts research on English speech recognition. Recognition results are included in the TOP 5 results by error value.

The Value of Voice Recognition in Marketing

The purpose of marketing is to study market needs and organize business in accordance with them to increase profitability and efficiency. Voice is of interest to marketers in two cases: if the client speaks and if the employee speaks. Therefore, the object of study for marketers and the scope of application of the technology is telephone calls.

Today, telephone conversation analytics is poorly developed. Calls not only need to be recorded, but also listened to, evaluated and only then analyzed. While organizing a recording is easy - any virtual PBX or call tracking service can do this - organizing call listening is more difficult. This problem is solved either by an individual in the company or by the head of the call center. Call listening is also outsourced. In any case, the error in call assessment is a problem that calls into question the results of analytics and the decisions made based on them.

In our modern, eventful world, the speed of working with information is one of the cornerstones of achieving success. Our work performance and productivity, and therefore our immediate material wealth, depend on how quickly we receive, create, and process information. Among the tools that can improve our working capabilities, programs for translating speech into text occupy an important place, allowing us to significantly increase the speed of typing the texts we need. In this material I will tell you what popular programs exist for translating audio voice into text, and what their features are.

Application for translating audio voice into text - system requirements

Most of the currently existing programs for translating voice into text are paid, placing a number of requirements on the microphone (in the case when the program is intended for a computer). It is highly not recommended to work with a microphone built into a webcam, or located in the body of a standard laptop (the quality of speech recognition from such devices is quite low). In addition, it is quite important to have a quiet environment, without unnecessary noise that can directly affect your speech recognition level.

Moreover, most of these programs are capable of not only transforming speech into text on the computer screen, but also using voice commands to control your computer (launching and closing programs, receiving and sending email, opening and closing websites, and so on).

Speech to text program

Let's move on to a direct description of programs that can help translate speech into text.

Laitis program

The free Russian-language voice recognition program “Laitis” has a good quality of speech understanding, and, according to its creators, can almost completely replace the user’s usual keyboard. The program also works well with voice commands, allowing you to perform many actions to control your computer.

For its operation, the program requires high-speed Internet on the PC (the program uses network voice recognition services from Google and Yandex). The program’s capabilities also allow you to control your browser using voice commands, which requires installing a special extension from “Laitis” (Chrome, Mozilla, Opera) on your web navigator.

"Dragon Professional" - transcribing audio recordings into text

At the time of writing this material, a digital English-language product « Dragon Professional Individual" is one of the world leaders in the quality of recognized texts. The program understands seven languages (only the Dragon Anywhere mobile application in and works with Russian so far), has high quality voice recognition, and can perform a number of voice commands. Moreover, this product is exclusively paid (the price for the main program is 300 US dollars, and for the “home” version of the Dragon Home product the buyer will have to pay 75 US dollars).

To operate, this product from Nuance Communications requires the creation of your own profile, which is designed to adapt the program’s capabilities to the specifics of your voice. In addition to directly dictating text, you can train the program to perform a number of commands, thereby making your interaction with the computer even more congruent and convenient.

"RealSpeaker" - ultra-accurate speech recognizer

The program for transforming voice into text “RealSpeaker”, in addition to the standard functions for programs of this kind, allows you to use the capabilities of your PC’s webcam. Now the program not only reads the audio component of the sound, but also records the movement of the corners of the speaker’s lips, thereby more correctly recognizing the words he pronounces.

"RealSpeaker" reads not only the audio, but also the visual component of the speech process

The application supports more than ten languages (including Russian), allows speech recognition taking into account accents and dialects, allows you to transcribe audio and video, gives access to the cloud and much more. The program is shareware, but for the paid version you will have to pay real money.

“Voco” - the program will quickly translate your voice into a text document

Another voice-to-text converter is the paid digital product “Voco”, the price of the “home” version of which is now about 1,700 rubles. More advanced and expensive versions of this program - “Voco.Professional” and “Voco.Enterprise” have a number of additional features, one of which is speech recognition from the user’s audio recordings.

Among the features of Voco, I would like to note the ability to expand the program’s vocabulary (currently the program’s vocabulary includes more than 85 thousand words), as well as its autonomous operation from the network, allowing you not to depend on your Internet connection.

Among the advantages of Voco is the high learning curve of the program.

The application is turned on quite simply - just press the “Ctrl” key twice. To activate voice input in Gboard, just press and hold the spacebar

The application is absolutely free, supports several dozen languages, including Russian.

Conclusion

Above, I listed programs for translating your audio voice recording into text, described their general functionality and characteristic features. Most of these products are usually paid, and the range and quality of Russian-language programs is qualitatively inferior to their English-language counterparts. When working with such applications, I recommend paying special attention to your microphone and its settings - this is important in the process of speech recognition, because a bad microphone can negate even the highest quality software of the type I reviewed.

Encyclopedic YouTube

1 / 5
Work on speech recognition dates back to the middle of the last century. The first system was created in the early 1950s: its developers set themselves the task of recognizing numbers. The developed system could identify numbers, but spoken in one voice, such as the Bell Laboratories “Audrey” system. It worked by identifying the formant in the power spectrum of each speech passage. In general terms, the system consisted of three main parts: analyzers and quantizers, network matcher patterns and, finally, sensors. It was created, accordingly, on the elemental basis of various frequency filters, switches, and the sensors also included gas-filled tubes [ ] .
By the end of the decade, systems had emerged that recognized vowels independently of the speaker. In the 70s, new methods began to be used that made it possible to achieve more advanced results - the dynamic programming method and the linear prediction method (Linear Predictive Coding - LPC). The aforementioned company, Bell Laboratories, created systems using exactly these methods. In the 80s, the next step in the development of voice recognition systems was the use of Hidden Markov Models (HMM). At this time, the first large voice recognition programs began to appear, such as Kurzweil text-to-speech. In the late 80s, methods of artificial neural networks (Artificial Neural Network - ANN) also began to be used. In 1987, Worlds of Wonder's Julie dolls, which were capable of understanding voices, appeared on the market. And 10 years later, Dragon Systems released the program “NaturallySpeaking 1.0”.

Reliability

The main sources of voice recognition errors are:

Gender recognition can be distinguished as a separate type of problem, which is solved quite successfully - with large amounts of initial data, the gender is determined almost without error, and in short passages such as a stressed vowel sound, the probability of error is 5.3% for men and 3.1% for women.
The problem of voice imitation was also considered. Research by France Telecom has shown that professional voice imitation practically does not increase the likelihood of an identity error - imitators fake the voice only externally, emphasizing the features of speech, but are not able to fake the basic outline of the voice. Even the voices of close relatives, twins, will have a difference, at least in the dynamics of control. But with the development of computer technology, a new problem has arisen that requires the use of new methods of analysis - voice transformation, which increases the probability of error to 50%.
To describe the reliability of the system, there are two criteria used: FRR (False Rejection Rate) - the probability of a false denial of access (error of the first kind) and FAR (False Acceptance Rate) - the probability of a false admission when the system mistakenly identifies a stranger as its own (error of the second type) . Also, sometimes recognition systems are characterized by a parameter such as EER (Equal Error Rates), which represents the point of coincidence of the FRR and FAR probabilities. The more reliable the system, the lower the EER it has.
Identification error values for various biometric modalities

Application

Recognition can be divided into two main areas: identification and verification. In the first case, the system must independently identify the user by voice; in the second case, the system must confirm or deny the identifier presented by the user. Determining the speaker under study consists of a pairwise comparison of voice models that take into account the individual speech characteristics of each speaker. Thus, we first need to collect a fairly large database. And based on the results of this comparison, a list of phonograms can be generated that, with some probability, are the speech of the user we are interested in.
Although voice recognition cannot guarantee a 100% correct result, it can be used quite effectively in areas such as forensics and forensics; intelligence service; anti-terrorism monitoring; safety; banking and so on.

Analysis

The entire process of processing a speech signal can be divided into several main stages:
- signal preprocessing;
- highlighting criteria;
- speaker recognition.
Each stage represents an algorithm or some set of algorithms, which ultimately produces the required result.
The main features of the voice are formed by three main properties: the mechanics of vibration of the vocal folds, the anatomy of the vocal tract and the articulation control system. In addition, sometimes it is possible to use the speaker’s dictionary, his figures of speech. The main features by which a decision is made about the personality of the speaker are formed taking into account all factors of the speech production process: the voice source, the resonant frequencies of the vocal tract and their attenuation, as well as the dynamics of articulation control. If we look at the sources in more detail, the properties of the voice source include: the average frequency of the fundamental tone, the contour and fluctuations of the fundamental frequency, and the shape of the excitation pulse. The spectral characteristics of the vocal tract are described by the spectrum envelope and its mean slope, formant frequencies, long-term spectrum or cepstrum. In addition, the duration of words, rhythm (stress distribution), signal level, frequency and duration of pauses are also considered. To determine these characteristics, it is necessary to use rather complex algorithms, but since, for example, the error of formant frequencies is quite large, cepstrum coefficients calculated from the spectrum envelope or the transfer function of the vocal tract found by the linear prediction method are used to simplify it. In addition to the mentioned cepstrum coefficients, their first and second time differences are also used. This method was first proposed in the works of Davis and Mermelstein.

Cepstral analysis

In works on voice recognition, the most popular method is cepstral transformation of the spectrum of speech signals. The scheme of the method is as follows: over a time interval of 10 - 20 ms, the current power spectrum is calculated, and then the inverse Fourier transform of the logarithm of this spectrum (cepstrum) is applied and the coefficients are found: c n = 1 Θ ∫ 0 Θ ∣ S (j , ω , t) ∣ 2 exp − j n ω Ω ⁡ d ω (\displaystyle c_(n)=(\frac (1)(\Theta ))\int _(0 )^(\Theta )(\mid S(j,\omega ,t)\mid )^(2)\exp ^(-jn\omega \Omega )d\omega ), Ω = 2 2 π Θ , Θ (\displaystyle \Omega =2(\frac (2\pi )(\Theta )),\Theta )- the highest frequency in the spectrum of the speech signal, ∣ S (j , ω , t) ∣ 2 (\displaystyle (\mid S(j,\omega ,t)\mid )^(2))- power spectrum. The number of cepstral coefficients n depends on the required spectrum smoothing, and ranges from 20 to 40. If a comb of bandpass filters is used, then the discrete cepstral transform coefficients are calculated as c n = ∑ m = 1 N log ⁡ Y (m) 2 cos ⁡ π n M (m − 1 2)) (\displaystyle c_(n)=\sum _(m=1)^(N)\log (Y (m)^(2))\cos ((\frac (\pi n)(M))(m-(\frac (1)(2)))))), where Y(m) is the output signal of the m-th filter, c n (\displaystyle c_(n))- nth cepstrum coefficient.
Hearing properties are taken into account through a nonlinear frequency scale transformation, usually on the chalk scale. This scale is formed based on the presence of so-called critical bands in hearing, such that signals of any frequency within the critical band are indistinguishable. The chalk scale is calculated as M (f) = 1125 ln ⁡ (1 + f 700) (\displaystyle M(f)=1125\ln ((1+(\frac (f)(700))))), where f is the frequency in Hz, M is the frequency in chalk. Or another scale is used - bark, such that the difference between the two frequencies, equal to the critical band, is 1 bark. Frequency B is calculated as B = 13 a r c t g (0 . 00076 f) + 3. 5 a r c t g f 7500 (\displaystyle B=13\operatorname (arctg((0.00076f))) +3.5\operatorname (arctg(\frac (f)(7500 ))) ). The coefficients found are sometimes referred to in the literature as MFCC - Mel Frequiency Cepstral Coefficients. Their number ranges from 10 to 30. The use of the first and second time differences of cepstral coefficients triples the dimension of the decision space, but improves the speaker recognition efficiency.
The cepstrum describes the shape of the signal spectrum envelope, which is influenced by both the properties of the excitation source and the features of the vocal tract. Experiments have shown that the spectrum envelope has a strong influence on voice recognition. Therefore, the use of various methods of analyzing the spectrum envelope for voice recognition purposes is quite justified.

Methods

The GMM method follows from the theorem that any probability density function can be represented as a weighted sum of normal distributions:
P (x | λ) = ∑ j = 1 k ω j ϕ (χ , Θ j) (\displaystyle p(x|\lambda)=\sum _(j=1)^(k)(\omega _(j )\phi (\chi ,\Theta _(j)))); λ (\displaystyle \lambda)- speaker model; k - number of model components; ω j (\displaystyle (\omega _(j)))- the weights of the components are such that ∑ j = 1 n ω j = 1. (\displaystyle \sum _(j=1)^(n)(\omega _(j))=1.) ϕ (χ , Θ j) (\displaystyle \phi (\chi ,\Theta _(j)))- distribution function of a multidimensional argument χ , Θ j (\displaystyle \chi ,\Theta _(j)) .ϕ (χ , Θ j) = p (χ ∣ μ j , R j) = 1 (2 π) n 2 ∣ R j ∣ 1 2 exp ⁡ − 1 (χ − μ j) T R j − 1 (χ − μ j) 2 (\displaystyle \phi (\chi ,\Theta _(j))=p(\chi \mid \mu _(j),R_(j))=(\frac (1)(((2\ pi ))^(\frac (n)(2))(\mid R_(j)\mid )^(\frac (1)(2)))\exp (\frac (-1(\chi -\ mu _(j))^(T)R_(j)^(-1)(\chi -\mu _(j)))(2))), ω j (\displaystyle \omega _(j))- its weight, k - the number of components in the mixture. Here n is the dimension of the feature space, μ j ∈ R n (\displaystyle \mu _(j)\in \mathbb (R) ^(n))- vector of mathematical expectation of the j-th component of the mixture, R j ∈ R n × n (\displaystyle R_(j)\in \mathbb (R) ^(n\times n))- covariance matrix.
Very often, systems with this model use a diagonal covariance matrix. It can be used for all components of the model or even for all models. To find the covariance matrix, weights, vectors of means, the EM algorithm is often used. At the input we have a training sequence of vectors X = (x 1 , . . . , x T ) . The model parameters are initialized with initial values and then the parameters are re-estimated at each iteration of the algorithm. To determine the initial parameters, a clustering algorithm such as the K-means algorithm is usually used. After the set of training vectors has been divided into M clusters, the model parameters can be determined as follows: initial values μ j (\displaystyle \mu _(j)) coincide with the centers of the clusters, covariance matrices are calculated based on the vectors included in a given cluster, the weights of the components are determined by the proportion of vectors of a given cluster among the total number of training vectors.
Revaluation of parameters occurs according to the following formulas:

GMM can also be called a continuation of the vector quantization method (centroid method). It creates a codebook for disjoint regions in feature space (often using K-means clustering). Vector quantization is the simplest model in context-independent recognition systems.
The support vector machine (SVM) builds a hyperplane in a multidimensional space that separates two classes - parameters of the target speaker and parameters of speakers from the reference base. The hyperplane is calculated using support vectors - chosen in a special way. A nonlinear transformation of the space of measured parameters into some space of higher-dimensional features will be performed, since the dividing surface may not correspond to the hyperplane. The dividing surface in the hyperplane is constructed by the support vector machine method if the condition of linear separability in the new feature space is satisfied. Thus, the success of using SMM depends on the selected nonlinear transformation in each specific case. The support vector machine is often used with the GMM or HMM method. Typically, for short phrases lasting a few seconds, phoneme-dependent HMMs are better suited for the context-dependent approach.

Popularity

According to New York-based consulting company International Biometric Group, the most common technology is fingerprint scanning. It is noted that of the $127 million in revenue from the sale of biometric devices, 44% comes from fingerprint scanners. Facial recognition systems rank second in terms of demand at 14%, followed by palm shape recognition devices (13%), voice recognition (10%) and iris recognition (8%). Signature verification devices make up 2% of this list. Some of the most famous manufacturers in the voice biometrics market are Nuance Communications, SpeechWorks, VeriVoice.
In February 2016, The Telegraph published an article reporting that customers of the British bank HSBC would be able to access accounts and conduct transactions using voice identification. The transition was supposed to take place in early summer

Man has always been attracted to the idea of controlling a machine using natural language. Perhaps this is partly due to the desire of man to be ABOVE the machine. So to speak, to feel superior. But the main message is to simplify human interaction with artificial intelligence. Voice control in Linux has been implemented with varying degrees of success for almost a quarter of a century. Let's look into the issue and try to get as close to our OS as possible.

The crux of the matter

Systems for working with human voice for Linux have been around for a long time, and there are a great many of them. But not all of them process Russian speech correctly. Some were completely abandoned by the developers. In the first part of our review, we will talk directly about speech recognition systems and voice assistants, and in the second, we will look at specific examples of their use on a Linux desktop.
It is necessary to distinguish between speech recognition systems themselves (translation of speech into text or into commands), such as, for example, CMU Sphinx, Julius, as well as applications based on these two engines, and voice assistants, which have become popular with the development of smartphones and tablets. This is, rather, a by-product of speech recognition systems, their further development and the implementation of all successful ideas of voice recognition, their application in practice. There are few of these for Linux desktops yet.

You need to understand that the speech recognition engine and the interface to it are two different things. This is the basic principle of Linux architecture - dividing a complex mechanism into simpler components. The most difficult work falls on the shoulders of the engines. This is usually a boring console program that runs unnoticed by the user. The user interacts mainly with the interface program. Creating an interface is not difficult, so developers focus their main efforts on developing open-source speech recognition engines.

What happened before

Historically, all speech processing systems in Linux developed slowly and in leaps and bounds. The reason is not the crookedness of the developers, but the high level of entry into the development environment. Writing system code for working with voice requires a highly qualified programmer. Therefore, before starting to understand speech systems in Linux, it is necessary to make a short excursion into history. IBM once had such a wonderful operating system - OS/2 Warp (Merlin). It came out in September back in 1996. In addition to the fact that it had obvious advantages over all other operating systems, OS/2 was equipped with a very advanced speech recognition system - IBM ViaVoice. For that time, this was very cool, considering that the OS ran on systems with a 486 processor with 8 MB of RAM (!).

As you know, OS/2 lost the battle to Windows, but many of its components continued to exist independently. One of these components was the same IBM ViaVoice, which turned into an independent product. Since IBM always loved Linux, ViaVoice was ported to this OS, which gave the brainchild of Linus Torvalds the most advanced speech recognition system of its time.

Unfortunately, the fate of ViaVoice did not turn out the way Linux users would have liked. The engine itself was distributed free of charge, but its sources remained closed. In 2003, IBM sold the rights to the technology to the Canadian-American company Nuance. Nuance, which developed perhaps the most successful commercial speech recognition product - Dragon Naturally Speeking, is still alive today. This is almost the end of the inglorious history of ViaVoice on Linux. During the short time that ViaVoice was free and available to Linux users, several interfaces were developed for it, such as Xvoice. However, the project has long been abandoned and is now practically inoperable.

INFO
The most difficult part of machine speech recognition is natural human language.
What today?

Today everything is much better. In recent years, after the discovery of the Google Voice API sources, the situation with the development of speech recognition systems in Linux has improved significantly, and the quality of recognition has increased. For example, the Linux Speech Recognition project based on the Google Voice API shows very good results for the Russian language. All engines work approximately the same: first, the sound from the microphone of the user’s device enters the recognition system, after which either the voice is processed on the local device, or the recording is sent to a remote server for further processing. The second option is more suitable for smartphones or tablets. Actually, this is exactly how commercial engines work - Siri, Google Now and Cortana.

Of the variety of engines for working with the human voice, there are several that are currently active.

WARNING
Installing many of the described speech recognition systems is a non-trivial task!
CMU Sphinx

Much of the development of CMU Sphinx takes place at Carnegie Mellon University. At different times, both the Massachusetts Institute of Technology and the now deceased Sun Microsystems corporation worked on the project. The engine sources are distributed under the BSD license and are available for both commercial and non-commercial use. Sphinx is not a custom application, but rather a set of tools that can be used to develop end-user applications. Sphinx is now the largest speech recognition project. It consists of several parts:
- Pocketsphinx is a small, fast program that processes sound, acoustic models, grammars and dictionaries;
- Sphinxbase library, required for Pocketsphinx to work;
- Sphinx4 - the actual recognition library;
- Sphinxtrain is a program for training acoustic models (recordings of the human voice).
The project is developing slowly but surely. And most importantly, it can be used in practice. And not only on PCs, but also on mobile devices. In addition, the engine works very well with Russian speech. If you have straight hands and a clear head, you can set up Russian speech recognition using Sphinx to control home appliances or a smart home. In fact, you can turn an ordinary apartment into a smart home, which is what we will do in the second part of this review. Sphinx implementations are available for Android, iOS and even Windows Phone. Unlike the cloud method, when the work of speech recognition falls on the shoulders of Google ASR or Yandex SpeechKit servers, Sphinx works more accurately, faster and cheaper. And completely local. If you wish, you can teach Sphinx the Russian language model and the grammar of user queries. Yes, you will have to work a little during installation. Just like setting up Sphinx voice models and libraries is not an activity for beginners. Because the core of CMU Sphinx, the Sphinx4 library, is written in Java, you can include its code in your speech recognition applications. Specific examples of use will be described in the second part of our review.

VoxForge

Let us especially highlight the concept of a speech corpus. A speech corpus is a structured set of speech fragments, which is provided with software for accessing individual elements of the corpus. In other words, it is a set of human voices in different languages. Without a speech corpus, no speech recognition system can operate. It is difficult to create a high-quality open speech corpus alone or even with a small team, so a special project is collecting recordings of human voices - VoxForge.

Anyone with access to the Internet can contribute to the creation of a speech corpus by simply recording and submitting a speech fragment. This can be done even by phone, but it is more convenient to use the website. Of course, in addition to the audio recording itself, the speech corpus must include additional information, such as phonetic transcription. Without this, speech recording is meaningless for the recognition system.

HTK, Julius and Simon

HTK - Hidden Markov Model Toolkit is a toolkit for research and development of speech recognition tools using hidden Markov models, developed at the University of Cambridge under the patronage of Microsoft (Microsoft once bought this code from a commercial enterprise Entropic Cambridge Research Laboratory Ltd, and then returned it Cambridge together with a restrictive license). The project's sources are available to everyone, but the use of HTK code in products intended for end users is prohibited by the license.

However, this does not mean that HTK is useless for Linux developers: it can be used as an auxiliary tool when developing open-source (and commercial) speech recognition tools, which is what the developers of the open-source Julius engine, which is being developed in Japan, do. Julius works best with Japanese. The great and powerful is also not deprived, because the same VoxForge is used as a voice database.

Continuation is available only to members

Option 1. Join the “site” community to read all materials on the site

Membership in the community within the specified period will give you access to ALL Hacker materials, increase your personal cumulative discount and allow you to accumulate a professional Xakep Score rating!