GB2240203A - Automated speech recognition system - Google Patents
Automated speech recognition system Download PDFInfo
- Publication number
- GB2240203A GB2240203A GB9026766A GB9026766A GB2240203A GB 2240203 A GB2240203 A GB 2240203A GB 9026766 A GB9026766 A GB 9026766A GB 9026766 A GB9026766 A GB 9026766A GB 2240203 A GB2240203 A GB 2240203A
- Authority
- GB
- United Kingdom
- Prior art keywords
- utterances
- unknown
- recited
- list
- verbal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 claims abstract description 52
- 230000001755 vocal effect Effects 0.000 claims abstract description 45
- 230000001537 neural effect Effects 0.000 claims abstract description 12
- 238000013138 pruning Methods 0.000 claims description 2
- 238000003909 pattern recognition Methods 0.000 claims 6
- 230000000052 comparative effect Effects 0.000 claims 2
- 239000011159 matrix material Substances 0.000 description 17
- 230000007704 transition Effects 0.000 description 17
- 230000008569 process Effects 0.000 description 9
- 238000012545 processing Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 7
- 230000003068 static effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 235000005459 Digitaria exilis Nutrition 0.000 description 1
- 240000008570 Digitaria exilis Species 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 239000002243 precursor Substances 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
A speech recognition system reduces ambiguity by taking segments of electrical signals derived from unknown verbal utterances and time aligning them with reference signal segments derived from known verbal utterances using Hidden Markov Model techniques, 66, so as to produce a scored list of tentatively identified works or phonemes corresponding to the time-aligned segments. The list of tentatively identified words or phonemes are then rescored using artificial neural net techniques, 70, to produce an output signal corresponding to a finalized list of words determined to be the most likely correct interpretation of the unknown verbal utterances. Alternatively, only ambiguous portions of the list of words tentatively identified by HMM techniques are input to the artificial neural net for rescoring. <IMAGE>
Description
AUTOMATED SPEECH RECOGNITION SYSTEM
Field of the Invention
The present invention relates generally to speech recognition systems, and more particularly to an automated speech recognition system which merges Hidden Markov Models (HMM) with Artificial Neural Nets (ANN), and which more precisely first applies HMM's to time align speech to standard templates and then applies ANN's to identify speech sounds.
Brief Description of the Prior Art
A Hidden Markov Model (HMM) is a finite state machine in which state transitions are controlled by doubly stochastic transition links. HMM's are widely used in speech recognition systems of the prior art. In such systems, one transition link probability is a state transition probability (or static hidden probability) which influences the time warping ability of the state machine, while the other transition link probability is a computed probability (or dynamic observation probability) which determines speech sound similarity. HMMs are used to align unknown speech utterances in time with known speech utterances. HMM's make these time-alignments by following the time evolution of flnite state machines representing the unknown utterances, which involves computing the probability of being at many states inside the finite state machine at each time interval of the unknown utterance. thereby requiring significant computational time and memory.
Information about the use of HMM's in speech recognition systems and various solutions for determining the above mentioned probabilities are presented in Furl, S., 1989, DIgital Speech Processing, Synthesis, and
Recognition, Appendix D, HMM Procedures, Marcel Dekker, Inc., pp. 341347; Levinson et. al., U.S. Pat. No. 4,587,670, issued May 6, 1986 Bahl et.
al., European Pat. No. 238,693, published September 30, 1987; Bahl et. al.,
European Pat. No. 238, 697. published September 30, 1987; Bahl et. al., U.S.
Pat. No. 4,741,036, issued April 26, 1988; Bahl et. al., Canadian Pat. No.
1,236,578, issued May 10, 1988; Bahl et. al., U.S. Pat. No. 4,759,068. issued
July 19, 1988; Juang et. al., U.S. Pat. No. 4,783,804, issued November 8, 1988; Bahl et. al. 4,819,271, issued April 4, 1989; Nishimura, European Pat.
No. 312,209, issued April 19, 1989; Bahl et. al., U.S. Pat. No. 4,827,521, May 2, 1989; Koroda et. al., U.S. Pat. No. 4,829,577, issued May 9. 1989; and
Levinson, U.S. Pat. No. 4,852,180, issued July 25, 1989.
Artificial neural nets, also called "connectionist models", have been used for speech recognition because of the ability of such systems to recognize short time-registered speech patterns. In such applications, spectrum analyzed speech signals are typically applied to the input nodes of the ANN. The ANN itself is composed of many simple linear computational nodes operating in parallel and arranged in patterns which were originally inspired by the biological neural nets of the human brain. Each node sums its Inputs and runs its output through some type of nonlinearity, such as a lImiter (a logistic function). These nodes interact with other nodes using weighted connections. Each node has a state or activity level which is determined by the weighted inputs received from other nodes in the network.Long term knowledge within the network is derived from the respective input weights of connections between nodes. Learning consists of changing the weights of the connections. Short-term knowledge is typically determined by the states of the nodes.
One type of ANN which has proven useful for speech recognition purposes is a canonical ANN or Multi-Layered Perceptron (MLP). These are feedforward nets with one or more layers of nodes between the input and output nodes. The intermediate layers are composed of hidden nodes having unspecific desired states. which typically make learning harder for the ANN because the states of the hidden nodes must be determined as part of the learning procedure. Although these types of ANN's typically require even more computational capacity than HMM s, the hidden nodes allow the
ANN to extract progressively more complex features from its inputs, thereby enhancing its ability for matching (and thereby detecting) ambiguous speech patterns.Multilayered Perceptrons are explained in both Fundi, S., 1989,
Digital Speech Processing, Synthesis, and Recognition. Appendix E, Neural
Nets, Marcel Dekker, Inc., pp. 349-53; and H. Bourlard & N. Morgan (July 1989). Merging Multilayer Perceptrons and Hidden Markov Models: Some
Experiments in Continuous Speech Recognition, International Computer
Science Institute, TR-89-033. The use of similar types of connectionist systems is also disclosed in A. Waibel, H. Sawai & K. Shikano (1989),
Consonant Recognition by Modular Construction of Large Phonemic Time
Delay Neural Networks. IEEE, Acoustic, Speech and Signal Processing
Socfety, pp.112-15; and H.Sawal et. al (1989), Spotting Japanese CV
Syllables and Phonemes Using Time-Delay Neural Networks, IEEE,
Acoustic, Speech and Signal Processfng Society, pp. 25-28.
Although HMM and ANN techniques have previously been used in speech recognition systems, the two techniques have only recently been combined for such a purpose. Bourlard & Morgan (July 1989), describe integrating a multilayer Perceptron (MLP) into an HMM. The MLP is used to estimate probabilities for pattern classifications, which are then incorporated by the HMM to segment continuous speech into a succession of words. M. Franzini, M. Witbrock & K. Lee (June 1989), Speaker
Independent Recognition of Connected Utterances Using Recurrent and
Non-recurrent Neural Networks, IJCNN, pp. II-1 to II-7, also describe a combination of ANN and HMM techniques for speech recognition. As in
Bourlard & Morgan (July 1989). Franzini, Witbrock & Lee (June 1989) only describe a method for applying ANN techniques before E5MM techniques.
No prior art speech recognition systems disclose time-aligning segments of unknown speech using HMM techniques before inputting such segments to an ANN. Instead, prior art systems apply the outputs of the
ANN directly to transition arcs between nodes of the HMM, thereby failing to take advantage of the large benefits to be gained from performing HMM analysis before ANN techniques in a speech recognition application, as will be described below.
Surnmarv of the Invention
A preferred embodiment of the present invention comprises a method and apparatus for reducing speech sound ambiguity in speech recognition systems by taking segments of electrical signals from unknown verbal utterances and time aligning them with reference signal segments from known verbal utterances. using Hidden Markov Modeling techniques so as to produce a scored list of tentatively identified words or phonemes corresponding to the time-aligned segments. The list of tentatively identified words or phonemes are then rescored using artificial neural net techniques to produce an output signal corresponding to a finalized list of words determined to be the most likely correct interpretation of the unknown verbal utterances. Alternatively, only ambiguous portions of the list of tentatively identified words by the HMM techniques would be Input to the artificial neural net for rescoring.
Brief Descrlption of the Drawlng Fig. 1 is an illustration of a Hidden Markov Model phoneme template in accordance with the preferred embodiment of the present invention;
Fig. 2 is an illustration of processed unknown speech signals being divided into HMM nodes in accordance with the preferred embodiment of the present invention;
Fig. 3a is an illustration of an unknown utterance signal in comparison to a known utterance signal;
Fig. 3b is an illustration of an undesirable method for time-aligning the unknown and known utterance signals of Fig. 3a;
Fig. 3c is an illustration of a method of the preferred embodiment of the present invention for time-aligning the unknown and known utterance signals of Fig. 3a;;
Fig. 4 is an illustration of a matrix of distances for a dynamic time warping based system in accordance with the preferred embodiment of the present invention;
Fig. 5 is an illustration of a matrix of accumulated distances for the dynamic time warping based system of Fig. 4;
Fig. 6 is an illustration of a best path boundary for the matrix of accumulated distances of Fig. 5 and in accordance with the preferred embodiment of the present invention;
Fig. 7 is a partial block diagram illustrating the two-step process for combining HMM and ANN techniques for automated speech recognition systems in accordance with the preferred embodiment of the present invention;
Fig. 8 is a further illustration of the two-step process of Fig. 7; and
Fig. 9 is a block diagram illustrating a first alternative embodiment of the speech recognition system of the present invention.
Detailed Description of the Preferred Embodiment
A Hidden Markov Model (HMM) is a finite state machine in which state transitions are controlled by doubly stochastic transition links. These links are stated to be doubly stochastic because each link has two probabilities assigned to it: a state transition probability (or static hidden probability), which can be used to influence the time warping ability of a speech recognition system; and a computed probability (or dynamic observation probability), which can be used to determine speech sound slmilarity. Fig. 1 depicts a topology of an HMM representing a phoneme.
Arbitrary topologies are possible in generic HMM's, however, the art associated with using HMM's in speech recognition systems is to determine what the network topologies should be for particular phonemes. In Fig. 1, the state machine network, shown generally as 8, is comprised of a number of nodes 10, connected by a number of state transition links 12. Each of these transition links 12 includes a dynamic observation probability, shown generally as 14, and a static-hidden probability 16. The dynamic observation probability is in turn comprised of an unknown sound element 18, derived from the speech signals being analyzed, and a reference sound element 20, to which the unknown sound element 18 is to be compared.
Fig. 2 illustrates a representation of an HMM network 8 as it is intended to be used in relation to the present invention. Unknown utterances detected by the microphone 22 are converted Into unknown speech signals 24 by speech processor 26. Speech signals 24 output by the speech processor 26 are then separated into 10 to 20 millisecond segments 28 that are assigned to individual nodes 10 of the HMM network 8 to facilitate the isomorphic performance of the time alignment process. An
HMM network time-aligns segments of unknown speech by following the time evolution of finite state machines representing the utterances. This involves computing the probability of being at each and every state inside the utterances at each of the 10 millisecond time intervals, as well as requiring self-looping links 30 and skipping links 32 so as to allow for speaking rate variations.
The dynamic observation probabilities of the network are determined either directly from measuring the speech sound similarity between two speech time slices or Indirectly by encoding the speech signals with cepstral, LPC, Fourier, or band pass fllter coefficients. Regardless of the particular encoding scheme utilized, each time slice 28 will typically be represented by a vector of 12 to 26 numbers, or be represented by scalar nuinbers specifying vector quantization (VQ) code labels.In mathematical terms, the probability of a node Ni at time t is equal to the largest of the probabilities assoclated with- each of the transition arcs terminating at node Nj. where the probability of each transition arc is equal to Hj * Di * Pitt-l), where H1 is the static hidden probability, Di is the dynamic observation probability, and Pii(t-l) is the probability of the transition arc originating from node Njj at time t-1. Although the encoding of reference speech models is described above, language models and other types of reference information about speech could also be encoded in the form of the finite state machines of the HMM.
During the recognition process, the time evolution through the state probabilities Is recorded as each unknown utterance is evaluated. For each time interval, the probability of being in a state is the product of the probability of being in a legal precursor state times the probability that the transition to the current node is compatible with acoustic measurements.
The process of determining the time evolution through the state probabilities can be better understood with reference now to Figs. 3 to 6.
Fig. 3a illustrates a comparison between an unknown utterance signal 34 and a reference (known) utterance signal 36 as detected by the microphone 22.
In this particular example, since the reference utterance 36 was annunciated in less time than the unknown utterance 34, the two signals are not aligned in time. Stretching the shorter utterance, reference utterance 36, to fit the longer utterance, unknown utterance 34, produces an undesirable mismatched middle region 38 between the two utterances, as illustrated in Fig. 3b. In accordance with well known Dynamic Time
Warping (DTW) and HMM techniques, as will be further explained below, these two signals can be time-aligned, in a manner slmilar to that illustrated in Fig. 3c, so that a subsequent proper comparison of the two signals can be achieved. In Fig. 3c, the alignment pointers 40 indicate the portions of the signals which are comparable in time.
Although HMM techniques are more general than DTW techniques,
HMM handles time alignment in substantially the same manner as DTW.
With regard to the present invention, the ANN will work equally well when the inputs to the ANN are time-aligned either by HMM or DTW based
Systems. A DTW based system solves the time alignment problem by creating a matrix of distances, shown generally as 41, between the two utterances illustrated in Fig. 4 and computing the smallest accumulated distance, the best path 42, between the beginning point 44 and the ending point 46. To determine the best path 42 through the accumulated distances, it is first necessary to compute the distance value D(id) corresponding to the various entries 48 of the distance matrix 41. The entries 48 relate to the distance between each time interval of the unknown utterance (i) 34 and each time interval of the reference utterance (j) 36.
The distance value D(i,f) of an entry 48 is inversely proportional to the similarity of the utterance segments being compared in the matrix 41.
The distance matrix 41 is then converted into a matrix of accumulated distance values AD(ij), as illustrated by the partially completed accumulated distance matrix 50 of Fig. 5. The accumulated distance values AD(ld) are determined for each entry 52 by the following equation: AD(i,j) = D(1J)+ MIN [AD(1-1. i-i), AD(i-1. i), AD0, 1-1)1 The matrix 50 is filled in column by column from the bottom to the top within each column.Using entry AD(3,3) as an example, the accumulated distance value for that entry would be determined by taking the original distance value 54 calculated for D(3,3), as indicated to the bottom right of each accumulated distance entry 52, and adding the minimum accumulated distance value from the three preceding entries (AD(2,3), AD(2,2), AD(3,2)), as illustrated by the three direction arrows, to derive an accumulated distance value of eleven for that entry. After determining the accumulated distance values for the entire matrix 50, the best path 42 through the original distance matrix 41 is determined by back-tracking through each of the lowest accumulated distance values from the end point to the beginning point of the matrix 50.
As an alternative to explicitly back-tracking through the accumulated distance matrix 50 after the accumulated distance value of the end point was determined, a list of the various distance entries 48 visited on each of a number of best paths could simultaneously be carried along as the computations proceed, a process known as Viterbi decoding". The Viterbi algorithm produces a list of distance entries, or states, on paths that lead to non-pruned active states (states which are still candidates for inclusion in one or more best paths), such that for any given active state the optimal path between it and the Initial state are given. When processing of the matrix 41 has been completed, the list with the best score contains the best path back to the beginning point 44.The various best paths are not determined by tracking every possible path that could be followed between the beginning point and the end point. Bad paths are determined and their nodes or states are pruned away at every opportunity so as to reduce the amount of computational time and memory required to process the utterance.
One method for imposing such control is to impose a path boundary, starting at the beginning point and ending at the ending point of the accumulated distance matrix, defining the possible best path states and states which could not possibly be included in the best path. One such boundary is illustrated in Fig. 6, which depicts the matrix 50 of Fig. 5 having a central area bounded by a legal path region 56. The parallelogram of the region 56 is bounded by lines having slopes of 1/2 and 2. This forced flt technique for pruning away bad paths can also be applied to HMM systems, which is one reason why DTW and HMM outputs will work equally well as inputs to the ANN.A path boundary is imposed on the state machine template of the HMM by limiting self-looping transitions and skipping transitions from occurring two times in succession, thereby creating boundaries having effective slopes of 1/2 and 2, respectively.
When an HMM is utilized for speech recognition, It is typically structured as a series of nodes in a linear finite state network, wherein each one of the nodes is comprised of a number of phones, words or utterances.
The HMM proceeds by creating a list of probable word sequences based on acoustic, phonetic analysis and on the structure of the HMM language model.
For time alignment purposes, the HMM maintains a list of time markers or pointers to the most probable state sequences, which may be used, for example, to indicate word sequence candidates for output from the recognizer or to indicate the time at which certain phone sequences occurred. However, many prior art HMM systems do not keep pointers to the original speech, thereby limiting such system's ability to determine locations of elemental phone states within words of the input utterance, as will be further explained below. It is important to note that in the present invention, such pointers are provided and saved in the form of Viterbi lists with each tentatively identified list of words or phonemes for application to the ANN.These Viterbi lists of word sequences are developed in parallel, and when the end of the utterance string has been reached, the leading list of words is accepted as the correct sequence and its time markers are accepted as correct locators of word boundaries in time.
Details of HMM procedures for achieving dynamic time alignment can be obtained through review of the prior art references described above.
What is important with regard to the use of HMM techniques in the present invention is that HMM techniques provide a variety of ways in which to determine the time evolution of a series of states so as to. determine the time at which the states occupied positions on a time-aligned path. The Vlterbl search is but one computationally efficient way to determine the optimal time alignment of speech to the states in a finite state machine, such as a Hidden Markov Model; many other well known techniques could also be used. Once again, however, it is important to note that HMM formalism assimilates the best known technique for time registration between the Internal states of a word model or subword model and an associated speech input.
With regard to the present invention, Fig. 7 illustrates the combined use of HMM techniques, applied during a first step (preprocessing), and
ANN techniques, applied during a second step (post-processing), for the purposes of automated speech recognition. A speech time series signal 60, derived from unknown verbal utterances, is first divided into 10 millisecond segments as depicted by the time bars 62. A number of these signal segments, in accordance with well known HMM techniques, are then assigned to the nodes 64 of the HMM 66 and time-aligned with reference signal segments (not shown), derived from a vocabulary or library of known verbal utterances, to produce a scored (ranked) list of tentatively identified words or phonemes corresponding to the time-aligned segments. This list is said to be scored because each tentatively identified word or phoneme is ranked by order of its probability of being correct. This scored list of tentatively identified words or phonemes (the linear finite state network of the HMM 66) is then verified and/or rescored (reranked), as necessary, by applying each node 64 to a corresponding input node 68 of the ANN 70.
The output of each ANN input node 64 is then interconnected to each ANN hidden node 72, which is in-turn interconnected to each ANN output node 74 in accordance with well known ANN techniques. Although these types of ANN's typically require even more computational capacity than HMM's, the hidden nodes allow the ANN to extract progressively more complex features from its inputs, thereby enhancing its ability for matching (and thereby detecting) ambiguous speech patterns to produce a series of output signals corresponding to a finalized list of words determined to be the most likely correct interpretation of the unknown verbal utterances.
In accordance with the present invention, application of the ANN 70 can either be (1) before the entire utterance has been completely preprocessed by the HMM.66, or (2) after the entire utterance has been completely preprocessed by the HMM 66. It may also be applied as a preprocessing step to the HMM 66 itself. In the preferred embodiment of the present invention, illustrated in Fig. 8, ANN processing is applied to each new word as it is completed and before a new word is started by the
HMM. As each new word is completed by the microprocessing system (not shown) operating the HMM, the completed word leaves the top of the processing stack of the microprocessor and assumes a second place position in the stack.The ANN Is applied to completed words at this second place position in the stack because the beginning and ending points of the phoneme or word can only first be accurately determined at this position.
Taking words or phonemes directly off the top of the stack may not always be preferable. For instance, if the entire utterance being analyzed only contains one word (or phoneme), the top of the stack will be occupied by a silence token" during the final moments of processing, while the second position on the stack will be occupied by the single word or phoneme required for ANN application. This technique also improves the general response time of the automated speech recognition system as a whole because application of the ANN can be limited to only those speech segments tentatively identified by the HMM as ambiguous.
A first alternative embodiment of the present invention is depicted in
Fig. 9. Microphone 22 detects the unknown utterance for processing by the signal or sound processor 80, which delays outputting a speech time series signal until the end of the utterance is detected, which is perhaps specified by a predetermined period of silence. The entire utterance 82 is then output by the signal processor 80 to the HMM 84. The HMM 84 time-aligns the speech data of the entire utterance to a fixed format. The inputs of the
ANN 86 are taken from the fixed format of the HMM state machine models with pointers supplied by the HMM models to the original speech.Although the ANN should be trained (programed) to recognize phonetic similarities between unknown and known utterances, in this particular application, the ANN's must be trained to recognize similarities between entire utterance units.
The forced fit" technique for time alignment between word templates and speech input are used to train the ANN for each word in the vocabulary. This technique assigns an HMM word model to each segment of the input speech to achieve the best flt between the reference input and the speech input. Each reference word in the utterance is decoded into its finite state machine representation and linked to its nelghbors. Reference words may be transcribed into their phonetic spellings and applied into the
HMM, or word level HMM models can used.
Although the present invention has been described with reference to
Figs. 1-9 and with emphasis on a particular order combination of HMM and
ANN techniques, it should be understood that the figures are for illustration only and should not be taken as limitations upon the invention. It is contemplated that many changes and modifications may be made by one of ordinary skill in the art to the elements, process and arrangement of steps of the process of the invention without departing from the spirit and scope of the invention as disclosed above. For example, the present invention need not be limited to applications for recognizing speech alone, and could also be used to detect similarities between a variety of different inputs and reference templates.
Claims (35)
1. An automated speech recognizer, comprising:
means for time-aligning segments of a signal derived from unknown utterances with one or more reference signal segments from a vocabulary of known utterances so as to produce a scored list of tentatively identified utterances believed to correspond to said unknown utterances; and
means for verifying and rescoring at least a portion of said scored list to produce a final list of positively identified utterances which correctly interpret said unknown utterances.
2. An automated speech recognizer as recited in claim 1, wherein said time-aligning means includes a dynamic time alignment system for correcting speaking rate variations between said unknown utterances and said vocabulary of known utterances.
3. An automated speech recognizer as recited in claim 2, wherein said final list is produced by applying at least a portion of said scored list to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified utterances and said vocabulary of known utterances.
4. An automated speech recognizer as recited in claim 3, wherein said scored list includes one or more ambiguously identified utterances and wherein said portion of said scored list is only comprised of said ambiguously identified utterances.
5. An automated speech recognizer as recited in claim 3, wherein said connectionist model is an artificial neural net.
6. An automated speech recognizer as recited in claim 2, wherein said dynamic time alignment system is a hidden Markov model state machine.
7. An automated speech. recognizer as recited in claim 6, wherein said scored list is comprised of a series of states within said state machine and wherein said final list is produced by applying at least a portion of said states to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified utterances and said vocabulary of known utterances.
8. An automated speech recognizer as recited In claim 7, wherein said series of states includes one or more states corresponding to ambiguously identified utterances and wherein only said states corresponding to said ambiguously identified utterances are app]ied to said connectionist model,
9. An automated speech recognizer as recited In claim 7, wherein said connectionist model is an artificial neural net.
10. An automated speech recognizer as recited in claim 1, wherein said final list is produced by applying at least a portion of said scored list to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified utterances and said vocabulary of known utterances.
11. An automated speech recognizer as recited in claim 10, wherein said connectionist model is an artificial neural net.
12. An automated speech recognizer as recited in claim 11, wherein said time-aligning means includes a hidden Markov model state machine for correcting speaking rate variations between said unknown utterances and said vocabulary of known utterances.
13. A method for recognizing unknown verbal utterances, comprising the steps of:
time-aligning unknown segments of a signal derived from the unknown verbal utterances with one or more reference signal segments from a vocabulary of known verbal utterances to correct for speaking rate variations between said unknown verbal utterances and said vocabulary of known verbal utterances so as to produce a scored list of tentatively identified verbal utterances believed to correspond to said unknown verbal utterances; and
verifying and/or rescoring said scored list to produce a final list of verbal utterances which correctly interpret said unknown verbal utterances.
14. A method for recognizing unknown verbal utterances as recited in claim 13, wherein said time-aligning step includes the steps of:
comparing said unknown segments to one or more of said reference signal segments from said vocabulary so as to produce a comparative list of said reference signal segments corresponding to each of said unknown segments; and
pruning said reference segments from each of said comparative lists which do not closely resemble said unknown segments to develop a single list comprised of said reference segments which most closely comparatively resemble said unknown segments.
15. A method for recognizing unknown verbal utterances as recited in claim 14, wherein said verifying and rescoring step includes the step of applying at least a portion of said scored list to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified verbal utterances and said vocabulary of known verbal utterances.
16. A method for recognizing unknown verbal utterances as recited in claim 13, wherein said time-aligning step includes the steps of:
applying said unknown segments to Individual states of a hidden
Markov model state machine;
following the time evolution of said states representing said unknown segments through said state machine; and
producing said scored list of tentatively identified verbal utterances from time-evolved states of said state machine.
17. A method for recognizing unknown verbal utterances as recited in claim 16, wherein said verifying and rescoring step includes the step of applying at least a portion of said scored list to individual input nodes of a connectionist model trained to recognize phonetic similarities between said scored list and said vocabulary.
18. An automated speech recognizer, comprising:
means for time-aligning a signal derived from an unknown verbal utterance with one or more reference formats from a vocabulary of known verbal utterances so as to produce a scored list of tentatively identified verbal utterances believed to correspond to said unknown verbal utterance; and
means for verifying and rescoring at least a portion of said scored list to produce a positively identified verbal utterance which correctly interprets said unknown verbal utterance.
19. An automated speech recognizer as recited in claim 18, wherein said time-aligning means includes a dynamic time alignment system for correcting speaking rate variations between said unknown verbal utterance and said vocabulary of known verbal utterances.
20. An automated speech recognizer as recited in claim 19, wherein said positively identified verbal utterance Is produced by applying at least a portion of said scored list to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified verbal utterances and said vocabulary.
21. An automated speech recognizer as recited in claim 20, wherein said scored list includes one or more ambiguously identified verbal utterances and wherein said portion of said scored list is only comprised of said ambiguously identified verbal utterances.
22. An automated speech recognizer as recited in claim 20, wherein said connectionist model is an artificial neural net.
23. An automated speech recognizer as recited in claim 19, wherein said dynamic time alignment system is a hidden Markov model state machine.
24. An automated speech recognizer as recited in claim 23, wherein said scored list is comprised of a series of states within said state machine and wherein said positively identified verbal utterance is produced by applying at least a portion of said states to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified verbal utterance and said vocabulary.
25. An automated speech recognizer as recited in claim 24, wherein said series of states includes one or more states corresponding to ambiguously identified verbal utterances and wherein only said states corresponding to said ambiguously identified verbal utterances are applied to said connectlonist model.
26. An automated speech recognizer as recited in claim 24, wherein said connectionist model is an artificial neural net.
27. An automated speech recognizer as recited in claim 18, wherein said positively identified verbal utterance is produced by applying at least a portion of said scored list to individual input nodes of an artificial neural net system trained to recognize phonetic similarities between said tentatively identified verbal utterances and said vocabulary.
28. An automated speech recognizer as recited in claim 27, wherein said time-aligning means includes a hidden Markov model state machine for correcting speaking rate variations between said unknown verbal utterance and said vocabulary of known verbal utterances.
29. A pattern recognition system, comprising:
means for time-aligning an unknown pattern with one or more reference patterns from a library of known patterns so as to produce a tentative list of identified patterns believed to correspond to said unknown pattern; and
means for verifying and/or rescoring said tentative list of patterns to produce a positively identified pattern which correctly corresponds to said unknown pattern.
30. A pattern recognition system as recited in claim 29, wherein said positively identified pattern is produced by applying each pattern from said tentative list to individual input nodes of a connectionist model trained to recognize similarities between said tentative list of patterns and said library.
31. A pattern recognition system as recited in claim 30, wherein said connectionist model is an artificial neural net.
32. A pattern recognition system as recited in claim 29, wherein said time-aligning means includes a hidden Markov model state machine for following the time evolution of said unknown pattern and producing said tentative list.
33. A pattern recognition system as recited in claim 32, wherein said positively identified pattern is produced by applying each pattern from said tentative list to individual input nodes of a connectionist model trained to recognize similarities between said tentative list and said library.
34. A pattern recognition system as recited in claim 33, wherein the connectionist model is an artificial neural net.
35. . An automated speech recognizer, substantially as hereinbefore described with reference to the accompanying drawings.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US46770690A | 1990-01-18 | 1990-01-18 |
Publications (2)
Publication Number | Publication Date |
---|---|
GB9026766D0 GB9026766D0 (en) | 1991-01-30 |
GB2240203A true GB2240203A (en) | 1991-07-24 |
Family
ID=23856794
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB9026766A Withdrawn GB2240203A (en) | 1990-01-18 | 1990-12-10 | Automated speech recognition system |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2240203A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0590925A1 (en) * | 1992-09-29 | 1994-04-06 | International Business Machines Corporation | Method of speech modelling and a speech recognizer |
EP0592150A1 (en) * | 1992-10-09 | 1994-04-13 | AT&T Corp. | Speaker verification |
EP0601778A1 (en) * | 1992-12-11 | 1994-06-15 | AT&T Corp. | Keyword/non-keyword classification in isolated word speech recognition |
EP0623914A1 (en) * | 1993-05-05 | 1994-11-09 | CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. | Speaker independent isolated word recognition system using neural networks |
WO1996041333A1 (en) * | 1995-06-07 | 1996-12-19 | Dragon Systems, Inc. | Systems and methods for word recognition |
GB2331826A (en) * | 1997-12-01 | 1999-06-02 | Motorola Inc | Context dependent phoneme networks for encoding speech information |
EP0955628A2 (en) * | 1998-05-07 | 1999-11-10 | CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. | A method of and a device for speech recognition employing neural network and Markov model recognition techniques |
US6125284A (en) * | 1994-03-10 | 2000-09-26 | Cable & Wireless Plc | Communication system with handset for distributed processing |
GB2347253B (en) * | 1999-02-23 | 2001-03-07 | Motorola Inc | Method of selectively assigning a penalty to a probability associated with a voice recognition system |
EP2685452A1 (en) * | 2012-07-13 | 2014-01-15 | Samsung Electronics Co., Ltd | Method of recognizing speech and electronic device thereof |
CN106023995A (en) * | 2015-08-20 | 2016-10-12 | 漳州凯邦电子有限公司 | Voice recognition method and wearable voice control device using the method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0242743A1 (en) * | 1986-04-25 | 1987-10-28 | Texas Instruments Incorporated | Speech recognition system |
GB2230370A (en) * | 1989-04-12 | 1990-10-17 | Smiths Industries Plc | Speech recognition |
-
1990
- 1990-12-10 GB GB9026766A patent/GB2240203A/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0242743A1 (en) * | 1986-04-25 | 1987-10-28 | Texas Instruments Incorporated | Speech recognition system |
GB2230370A (en) * | 1989-04-12 | 1990-10-17 | Smiths Industries Plc | Speech recognition |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5502791A (en) * | 1992-09-29 | 1996-03-26 | International Business Machines Corporation | Speech recognition by concatenating fenonic allophone hidden Markov models in parallel among subwords |
EP0590925A1 (en) * | 1992-09-29 | 1994-04-06 | International Business Machines Corporation | Method of speech modelling and a speech recognizer |
EP0592150A1 (en) * | 1992-10-09 | 1994-04-13 | AT&T Corp. | Speaker verification |
EP0601778A1 (en) * | 1992-12-11 | 1994-06-15 | AT&T Corp. | Keyword/non-keyword classification in isolated word speech recognition |
EP0623914A1 (en) * | 1993-05-05 | 1994-11-09 | CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. | Speaker independent isolated word recognition system using neural networks |
US5566270A (en) * | 1993-05-05 | 1996-10-15 | Cselt-Centro Studi E Laboratori Telecomunicazioni S.P.A. | Speaker independent isolated word recognition system using neural networks |
US6125284A (en) * | 1994-03-10 | 2000-09-26 | Cable & Wireless Plc | Communication system with handset for distributed processing |
WO1996041333A1 (en) * | 1995-06-07 | 1996-12-19 | Dragon Systems, Inc. | Systems and methods for word recognition |
US5680511A (en) * | 1995-06-07 | 1997-10-21 | Dragon Systems, Inc. | Systems and methods for word recognition |
GB2331826B (en) * | 1997-12-01 | 2001-12-19 | Motorola Inc | Context dependent phoneme networks for encoding speech information |
GB2331826A (en) * | 1997-12-01 | 1999-06-02 | Motorola Inc | Context dependent phoneme networks for encoding speech information |
EP0955628A3 (en) * | 1998-05-07 | 2000-07-26 | CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. | A method of and a device for speech recognition employing neural network and Markov model recognition techniques |
EP0955628A2 (en) * | 1998-05-07 | 1999-11-10 | CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. | A method of and a device for speech recognition employing neural network and Markov model recognition techniques |
US6185528B1 (en) * | 1998-05-07 | 2001-02-06 | Cselt - Centro Studi E Laboratori Telecomunicazioni S.P.A. | Method of and a device for speech recognition employing neural network and markov model recognition techniques |
GB2347253B (en) * | 1999-02-23 | 2001-03-07 | Motorola Inc | Method of selectively assigning a penalty to a probability associated with a voice recognition system |
US6233557B1 (en) | 1999-02-23 | 2001-05-15 | Motorola, Inc. | Method of selectively assigning a penalty to a probability associated with a voice recognition system |
EP2685452A1 (en) * | 2012-07-13 | 2014-01-15 | Samsung Electronics Co., Ltd | Method of recognizing speech and electronic device thereof |
US20140019131A1 (en) * | 2012-07-13 | 2014-01-16 | Korea University Research And Business Foundation | Method of recognizing speech and electronic device thereof |
CN103544955A (en) * | 2012-07-13 | 2014-01-29 | 三星电子株式会社 | Method of recognizing speech and electronic device thereof |
CN103544955B (en) * | 2012-07-13 | 2018-09-25 | 三星电子株式会社 | Identify the method and its electronic device of voice |
CN106023995A (en) * | 2015-08-20 | 2016-10-12 | 漳州凯邦电子有限公司 | Voice recognition method and wearable voice control device using the method |
Also Published As
Publication number | Publication date |
---|---|
GB9026766D0 (en) | 1991-01-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5893058A (en) | Speech recognition method and apparatus for recognizing phonemes using a plurality of speech analyzing and recognizing methods for each kind of phoneme | |
Hadian et al. | Flat-start single-stage discriminatively trained HMM-based models for ASR | |
US20220223066A1 (en) | Method, device, and computer program product for english pronunciation assessment | |
Hasegawa-Johnson et al. | Simultaneous recognition of words and prosody in the Boston University Radio Speech Corpus | |
Mariani | Recent advances in speech processing | |
Jothilakshmi et al. | Large scale data enabled evolution of spoken language research and applications | |
GB2240203A (en) | Automated speech recognition system | |
Soltau et al. | Reducing the computational complexity for whole word models | |
Lin et al. | Learning acoustic frame labeling for phoneme segmentation with regularized attention mechanism | |
Sirigos et al. | A hybrid syllable recognition system based on vowel spotting | |
Holmes et al. | Why have HMMs been so successful for automatic speech recognition and how might they be improved | |
Nguyen et al. | Improving acoustic model for vietnamese large vocabulary continuous speech recognition system using deep bottleneck features | |
Nwe et al. | Myanmar language speech recognition with hybrid artificial neural network and hidden Markov model | |
Kim et al. | Automatic recognition of pitch movements using multilayer perceptron and time-delay recursive neural network | |
Lee et al. | Speaker‐independent phoneme recognition using hidden Markov models | |
Lee | Automatic recognition of isolated cantonese syllables using neural networks | |
JP3571821B2 (en) | Speech recognition device, dictionary of word components, and learning method of hidden Markov model | |
Tang | Large vocabulary continuous speech recognition using linguistic features and constraints | |
Sekhar et al. | Recognition of stop-consonant-vowel (SCV) segments in continuous speech using neural network models | |
Mary et al. | Keyword spotting techniques | |
Ishaq | Voice activity detection and garbage modelling for a mobile automatic speech recognition application | |
Manjunath et al. | Improvement of phone recognition accuracy using source and system features | |
Bengio et al. | Speech coding with multi-layer networks | |
Liao et al. | An overview of RNN‐based mandarin speech recognition approaches | |
Singh et al. | Phoneme Based Hindi Speech Recognition Using Deep Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |