Speech Recognition

Accuracy

This article may need to be rewritten entirely to comply with Wikipedia's quality standards, as section. You can help. The discussion page may contain suggestions.

As mentioned earlier in this article, accuracy of speech recognition varies in the following:

Error rates increase as the vocabulary size grows:

e.g. The 10 digits "zero" to "nine" can be recognized essentially perfectly, but vocabulary sizes of 200, 5000 or 100000 may have error rates of 3%, 7% or 45%.

Vocabulary is hard to recognize if it contains confusable words:

e.g. The 26 letters of the English alphabet are difficult to discriminate because they are confusable words (most notoriously, the E-set: "B, C, D, E, G, P, T, V, Z"); an 8% error rate is considered good for this vocabulary.

Speaker dependence vs. independence:

A speaker-dependent system is intended for use by a single speaker.
A speaker-independent system is intended for use by any speaker, more difficult.

Isolated, Discontinuous or continuous speech

With isolated speech single words are used, therefore it becomes easier to recognize the speech.
With discontinuous speech full sentences separated by silence are used, therefore it becomes easier to recognize the speech as well as with isolated speech.
With continuous speech naturally spoken sentences are used, therefore it becomes harder to recognize the speech, different from both isolated and discontinuous speech.

Task and language constraints

e.g. Querying application may dismiss the hypothesis "The apple is red."
e.g. Constraints may be semantic; rejecting "The apple is angry."
e.g. Syntactic; rejecting "Red is apple the."
Constraints are often represented by a grammar.

Read vs. Spontaneous Speech

When a person reads it's usually in a context that has been previously prepared, but when a person uses spontaneous speech, it is difficult to recognize the speech because of the disfluences (like "uh" and "um", false starts, incomplete sentences, stuttering, coughing, and laughter) and limited vocabulary.

Adverse conditions

Environmental noise (e.g. Noise in a car or a factory)
Acoustical distortions (e.g. echoes, room acoustics)
Speech recognition is a multi-leveled pattern recognition task.

Acoustical signals are structured into a hierarchy of units;

e.g. Phonemes, Words, Phrases, and Sentences;

Each level provides additional constraints;

e.g. Known word pronunciations or legal word sequences, which can compensate for errors or uncertainties at lower level;

This hierarchy of constraints are exploited;

By combining decisions probabilistically at all lower levels, and making more deterministic decisions only at the highest level;
Speech recognition by a machine is a process broken into several phases. Computationally, it is a problem in which a sound pattern has to be recognized or classified into a category that represents a meaning to a human. Every acoustic signal can be broken in smaller more basic sub-signals. As the more complex sound signal is broken into the smaller sub-sounds, different levels are created, where at the top level we have complex sounds, which are made of simpler sounds on lower level, and going to lower levels even more, we create more basic and shorter and simpler sounds. The lowest level, where the sounds are the most fundamental, a machine would check for simple and more probabilistic rules of what sound should represent. Once these sounds are put together into more complex sound on upper level, a new set of more deterministic rules should predict what new complex sound should represent. The most upper level of a deterministic rule should figure out the meaning of complex expressions. In order to expand our knowledge about speech recognition we need to take into a consideration neural networks. There are four steps of neural network approaches:

Digitize the speech that we want to recognize

For telephone speech the sampling rate is 8000 samples per second;

Compute features of spectral-domain of the speech (with Fourier transform);

computed every 10 msec, with one 10 msec section called a frame;

Analysis of four-step neural network approaches can be explained by further information. Sound is produced by air (or some other medium) vibration, which we register by ears, but machines by receivers. Basic sound creates a wave which has 2 descriptions; Amplitude (how strong is it), and frequency (how often it vibrates per second).

The sound waves can be digitized: Sample a strength at short intervals like in picture above to get bunch of numbers that approximate at each time step the strength of a wave. Collection of these numbers represent analog wave. This new wave is digital. Sound waves are complicated because they superimpose one on top of each other. Like the waves would. This way they create odd looking waves. For example, if there are two waves that interact with each other we can add them which creates new odd looking wave.

Neural network classifies features into phonetic-based categories;

Given basic sound blocks, that a machine digitized, one has a bunch of numbers which describe a wave and waves describe words. Each frame has a unit block of sound, which are broken into basic sound waves and represented by numbers after Fourier Transform, can be statistically evaluated to set to which class of sounds it belongs to. The nodes in the figure on a slide represent a feature of a sound in which a feature of a wave from first layer of nodes to a second layer of nodes based on some statistical analysis. This analysis depends on programmer's instructions. At this point, a second layer of nodes represents higher level features of a sound input which is again statistically evaluated to see what class they belong to. Last level of nodes should be output nodes that tell us with high probability what original sound really was.

Search to match the neural-network output scores for the best word, to determine the word that was most likely uttered;

In 1982, Kurzweil Applied Intelligence and Dragon Systems released speech recognition products. By 1985, Kurzweil’s software had a vocabulary of 1,000 words—if uttered one word at a time. Two years later, in 1987, its lexicon reached 20,000 words, entering the realm of human vocabularies, which range from 10,000 to 150,000 words. But recognition accuracy was only 10% in 1993. Two years later, the error rate crossed below 50%. Dragon Systems released "Naturally Speaking" in 1997, which recognized normal human speech. Progress mainly came from improved computer performance and larger source text databases. The Brown Corpus was the first major database available, containing several million words. In 2006, Google published a trillion-word corpus, while Carnegie Mellon University researchers found no significant increase in recognition accuracy.

Read more about this topic: Speech Recognition

Famous quotes containing the word accuracy:

“Such is the never-failing beauty and accuracy of language, the most perfect art in the world; the chisel of a thousand years retouches it.”
—Henry David Thoreau (1817–1862)

“U.S. international and security policy ... has as its primary goal the preservation of what we might call “the Fifth Freedom,” understood crudely but with a fair degree of accuracy as the freedom to rob, to exploit and to dominate, to undertake any course of action to ensure that existing privilege is protected and advanced.”
—Noam Chomsky (b. 1928)

“In everything from athletic ability to popularity to looks, brains, and clothes, children rank themselves against others. At this age [7 and 8], children can tell you with amazing accuracy who has the coolest clothes, who tells the biggest lies, who is the best reader, who runs the fastest, and who is the most popular boy in the third grade.”
—Stanley I. Greenspan (20th century)

Related Subjects

Independent Speech Recognition

Speech Recognition Performance

Speech Recognition Program

Speech Recognition Systems

Related Phrases