GB2240203A - Automated speech recognition system - Google Patents

Automated speech recognition system Download PDF

Info

Publication number
GB2240203A
GB2240203A GB9026766A GB9026766A GB2240203A GB 2240203 A GB2240203 A GB 2240203A GB 9026766 A GB9026766 A GB 9026766A GB 9026766 A GB9026766 A GB 9026766A GB 2240203 A GB2240203 A GB 2240203A
Authority
GB
United Kingdom
Prior art keywords
utterances
unknown
recited
list
verbal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB9026766A
Other versions
GB9026766D0 (en
Inventor
George M White
Richard Alen Parfitt
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Apple Inc
Original Assignee
Apple Computer Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Apple Computer Inc filed Critical Apple Computer Inc
Publication of GB9026766D0 publication Critical patent/GB9026766D0/en
Publication of GB2240203A publication Critical patent/GB2240203A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

A speech recognition system reduces ambiguity by taking segments of electrical signals derived from unknown verbal utterances and time aligning them with reference signal segments derived from known verbal utterances using Hidden Markov Model techniques, 66, so as to produce a scored list of tentatively identified works or phonemes corresponding to the time-aligned segments. The list of tentatively identified words or phonemes are then rescored using artificial neural net techniques, 70, to produce an output signal corresponding to a finalized list of words determined to be the most likely correct interpretation of the unknown verbal utterances. Alternatively, only ambiguous portions of the list of words tentatively identified by HMM techniques are input to the artificial neural net for rescoring. <IMAGE>

Description

AUTOMATED SPEECH RECOGNITION SYSTEM Field of the Invention The present invention relates generally to speech recognition systems, and more particularly to an automated speech recognition system which merges Hidden Markov Models (HMM) with Artificial Neural Nets (ANN), and which more precisely first applies HMM's to time align speech to standard templates and then applies ANN's to identify speech sounds.
Brief Description of the Prior Art A Hidden Markov Model (HMM) is a finite state machine in which state transitions are controlled by doubly stochastic transition links. HMM's are widely used in speech recognition systems of the prior art. In such systems, one transition link probability is a state transition probability (or static hidden probability) which influences the time warping ability of the state machine, while the other transition link probability is a computed probability (or dynamic observation probability) which determines speech sound similarity. HMMs are used to align unknown speech utterances in time with known speech utterances. HMM's make these time-alignments by following the time evolution of flnite state machines representing the unknown utterances, which involves computing the probability of being at many states inside the finite state machine at each time interval of the unknown utterance. thereby requiring significant computational time and memory.
Information about the use of HMM's in speech recognition systems and various solutions for determining the above mentioned probabilities are presented in Furl, S., 1989, DIgital Speech Processing, Synthesis, and Recognition, Appendix D, HMM Procedures, Marcel Dekker, Inc., pp. 341347; Levinson et. al., U.S. Pat. No. 4,587,670, issued May 6, 1986 Bahl et.
al., European Pat. No. 238,693, published September 30, 1987; Bahl et. al., European Pat. No. 238, 697. published September 30, 1987; Bahl et. al., U.S.
Pat. No. 4,741,036, issued April 26, 1988; Bahl et. al., Canadian Pat. No.
1,236,578, issued May 10, 1988; Bahl et. al., U.S. Pat. No. 4,759,068. issued July 19, 1988; Juang et. al., U.S. Pat. No. 4,783,804, issued November 8, 1988; Bahl et. al. 4,819,271, issued April 4, 1989; Nishimura, European Pat.
No. 312,209, issued April 19, 1989; Bahl et. al., U.S. Pat. No. 4,827,521, May 2, 1989; Koroda et. al., U.S. Pat. No. 4,829,577, issued May 9. 1989; and Levinson, U.S. Pat. No. 4,852,180, issued July 25, 1989.
Artificial neural nets, also called "connectionist models", have been used for speech recognition because of the ability of such systems to recognize short time-registered speech patterns. In such applications, spectrum analyzed speech signals are typically applied to the input nodes of the ANN. The ANN itself is composed of many simple linear computational nodes operating in parallel and arranged in patterns which were originally inspired by the biological neural nets of the human brain. Each node sums its Inputs and runs its output through some type of nonlinearity, such as a lImiter (a logistic function). These nodes interact with other nodes using weighted connections. Each node has a state or activity level which is determined by the weighted inputs received from other nodes in the network.Long term knowledge within the network is derived from the respective input weights of connections between nodes. Learning consists of changing the weights of the connections. Short-term knowledge is typically determined by the states of the nodes.
One type of ANN which has proven useful for speech recognition purposes is a canonical ANN or Multi-Layered Perceptron (MLP). These are feedforward nets with one or more layers of nodes between the input and output nodes. The intermediate layers are composed of hidden nodes having unspecific desired states. which typically make learning harder for the ANN because the states of the hidden nodes must be determined as part of the learning procedure. Although these types of ANN's typically require even more computational capacity than HMM s, the hidden nodes allow the ANN to extract progressively more complex features from its inputs, thereby enhancing its ability for matching (and thereby detecting) ambiguous speech patterns.Multilayered Perceptrons are explained in both Fundi, S., 1989, Digital Speech Processing, Synthesis, and Recognition. Appendix E, Neural Nets, Marcel Dekker, Inc., pp. 349-53; and H. Bourlard & N. Morgan (July 1989). Merging Multilayer Perceptrons and Hidden Markov Models: Some Experiments in Continuous Speech Recognition, International Computer Science Institute, TR-89-033. The use of similar types of connectionist systems is also disclosed in A. Waibel, H. Sawai & K. Shikano (1989), Consonant Recognition by Modular Construction of Large Phonemic Time Delay Neural Networks. IEEE, Acoustic, Speech and Signal Processing Socfety, pp.112-15; and H.Sawal et. al (1989), Spotting Japanese CV Syllables and Phonemes Using Time-Delay Neural Networks, IEEE, Acoustic, Speech and Signal Processfng Society, pp. 25-28.
Although HMM and ANN techniques have previously been used in speech recognition systems, the two techniques have only recently been combined for such a purpose. Bourlard & Morgan (July 1989), describe integrating a multilayer Perceptron (MLP) into an HMM. The MLP is used to estimate probabilities for pattern classifications, which are then incorporated by the HMM to segment continuous speech into a succession of words. M. Franzini, M. Witbrock & K. Lee (June 1989), Speaker Independent Recognition of Connected Utterances Using Recurrent and Non-recurrent Neural Networks, IJCNN, pp. II-1 to II-7, also describe a combination of ANN and HMM techniques for speech recognition. As in Bourlard & Morgan (July 1989). Franzini, Witbrock & Lee (June 1989) only describe a method for applying ANN techniques before E5MM techniques.
No prior art speech recognition systems disclose time-aligning segments of unknown speech using HMM techniques before inputting such segments to an ANN. Instead, prior art systems apply the outputs of the ANN directly to transition arcs between nodes of the HMM, thereby failing to take advantage of the large benefits to be gained from performing HMM analysis before ANN techniques in a speech recognition application, as will be described below.
Surnmarv of the Invention A preferred embodiment of the present invention comprises a method and apparatus for reducing speech sound ambiguity in speech recognition systems by taking segments of electrical signals from unknown verbal utterances and time aligning them with reference signal segments from known verbal utterances. using Hidden Markov Modeling techniques so as to produce a scored list of tentatively identified words or phonemes corresponding to the time-aligned segments. The list of tentatively identified words or phonemes are then rescored using artificial neural net techniques to produce an output signal corresponding to a finalized list of words determined to be the most likely correct interpretation of the unknown verbal utterances. Alternatively, only ambiguous portions of the list of tentatively identified words by the HMM techniques would be Input to the artificial neural net for rescoring.
Brief Descrlption of the Drawlng Fig. 1 is an illustration of a Hidden Markov Model phoneme template in accordance with the preferred embodiment of the present invention; Fig. 2 is an illustration of processed unknown speech signals being divided into HMM nodes in accordance with the preferred embodiment of the present invention; Fig. 3a is an illustration of an unknown utterance signal in comparison to a known utterance signal; Fig. 3b is an illustration of an undesirable method for time-aligning the unknown and known utterance signals of Fig. 3a; Fig. 3c is an illustration of a method of the preferred embodiment of the present invention for time-aligning the unknown and known utterance signals of Fig. 3a;; Fig. 4 is an illustration of a matrix of distances for a dynamic time warping based system in accordance with the preferred embodiment of the present invention; Fig. 5 is an illustration of a matrix of accumulated distances for the dynamic time warping based system of Fig. 4; Fig. 6 is an illustration of a best path boundary for the matrix of accumulated distances of Fig. 5 and in accordance with the preferred embodiment of the present invention; Fig. 7 is a partial block diagram illustrating the two-step process for combining HMM and ANN techniques for automated speech recognition systems in accordance with the preferred embodiment of the present invention; Fig. 8 is a further illustration of the two-step process of Fig. 7; and Fig. 9 is a block diagram illustrating a first alternative embodiment of the speech recognition system of the present invention.
Detailed Description of the Preferred Embodiment A Hidden Markov Model (HMM) is a finite state machine in which state transitions are controlled by doubly stochastic transition links. These links are stated to be doubly stochastic because each link has two probabilities assigned to it: a state transition probability (or static hidden probability), which can be used to influence the time warping ability of a speech recognition system; and a computed probability (or dynamic observation probability), which can be used to determine speech sound slmilarity. Fig. 1 depicts a topology of an HMM representing a phoneme.
Arbitrary topologies are possible in generic HMM's, however, the art associated with using HMM's in speech recognition systems is to determine what the network topologies should be for particular phonemes. In Fig. 1, the state machine network, shown generally as 8, is comprised of a number of nodes 10, connected by a number of state transition links 12. Each of these transition links 12 includes a dynamic observation probability, shown generally as 14, and a static-hidden probability 16. The dynamic observation probability is in turn comprised of an unknown sound element 18, derived from the speech signals being analyzed, and a reference sound element 20, to which the unknown sound element 18 is to be compared.
Fig. 2 illustrates a representation of an HMM network 8 as it is intended to be used in relation to the present invention. Unknown utterances detected by the microphone 22 are converted Into unknown speech signals 24 by speech processor 26. Speech signals 24 output by the speech processor 26 are then separated into 10 to 20 millisecond segments 28 that are assigned to individual nodes 10 of the HMM network 8 to facilitate the isomorphic performance of the time alignment process. An HMM network time-aligns segments of unknown speech by following the time evolution of finite state machines representing the utterances. This involves computing the probability of being at each and every state inside the utterances at each of the 10 millisecond time intervals, as well as requiring self-looping links 30 and skipping links 32 so as to allow for speaking rate variations.
The dynamic observation probabilities of the network are determined either directly from measuring the speech sound similarity between two speech time slices or Indirectly by encoding the speech signals with cepstral, LPC, Fourier, or band pass fllter coefficients. Regardless of the particular encoding scheme utilized, each time slice 28 will typically be represented by a vector of 12 to 26 numbers, or be represented by scalar nuinbers specifying vector quantization (VQ) code labels.In mathematical terms, the probability of a node Ni at time t is equal to the largest of the probabilities assoclated with- each of the transition arcs terminating at node Nj. where the probability of each transition arc is equal to Hj * Di * Pitt-l), where H1 is the static hidden probability, Di is the dynamic observation probability, and Pii(t-l) is the probability of the transition arc originating from node Njj at time t-1. Although the encoding of reference speech models is described above, language models and other types of reference information about speech could also be encoded in the form of the finite state machines of the HMM.
During the recognition process, the time evolution through the state probabilities Is recorded as each unknown utterance is evaluated. For each time interval, the probability of being in a state is the product of the probability of being in a legal precursor state times the probability that the transition to the current node is compatible with acoustic measurements.
The process of determining the time evolution through the state probabilities can be better understood with reference now to Figs. 3 to 6.
Fig. 3a illustrates a comparison between an unknown utterance signal 34 and a reference (known) utterance signal 36 as detected by the microphone 22.
In this particular example, since the reference utterance 36 was annunciated in less time than the unknown utterance 34, the two signals are not aligned in time. Stretching the shorter utterance, reference utterance 36, to fit the longer utterance, unknown utterance 34, produces an undesirable mismatched middle region 38 between the two utterances, as illustrated in Fig. 3b. In accordance with well known Dynamic Time Warping (DTW) and HMM techniques, as will be further explained below, these two signals can be time-aligned, in a manner slmilar to that illustrated in Fig. 3c, so that a subsequent proper comparison of the two signals can be achieved. In Fig. 3c, the alignment pointers 40 indicate the portions of the signals which are comparable in time.
Although HMM techniques are more general than DTW techniques, HMM handles time alignment in substantially the same manner as DTW.
With regard to the present invention, the ANN will work equally well when the inputs to the ANN are time-aligned either by HMM or DTW based Systems. A DTW based system solves the time alignment problem by creating a matrix of distances, shown generally as 41, between the two utterances illustrated in Fig. 4 and computing the smallest accumulated distance, the best path 42, between the beginning point 44 and the ending point 46. To determine the best path 42 through the accumulated distances, it is first necessary to compute the distance value D(id) corresponding to the various entries 48 of the distance matrix 41. The entries 48 relate to the distance between each time interval of the unknown utterance (i) 34 and each time interval of the reference utterance (j) 36.
The distance value D(i,f) of an entry 48 is inversely proportional to the similarity of the utterance segments being compared in the matrix 41.
The distance matrix 41 is then converted into a matrix of accumulated distance values AD(ij), as illustrated by the partially completed accumulated distance matrix 50 of Fig. 5. The accumulated distance values AD(ld) are determined for each entry 52 by the following equation: AD(i,j) = D(1J)+ MIN [AD(1-1. i-i), AD(i-1. i), AD0, 1-1)1 The matrix 50 is filled in column by column from the bottom to the top within each column.Using entry AD(3,3) as an example, the accumulated distance value for that entry would be determined by taking the original distance value 54 calculated for D(3,3), as indicated to the bottom right of each accumulated distance entry 52, and adding the minimum accumulated distance value from the three preceding entries (AD(2,3), AD(2,2), AD(3,2)), as illustrated by the three direction arrows, to derive an accumulated distance value of eleven for that entry. After determining the accumulated distance values for the entire matrix 50, the best path 42 through the original distance matrix 41 is determined by back-tracking through each of the lowest accumulated distance values from the end point to the beginning point of the matrix 50.
As an alternative to explicitly back-tracking through the accumulated distance matrix 50 after the accumulated distance value of the end point was determined, a list of the various distance entries 48 visited on each of a number of best paths could simultaneously be carried along as the computations proceed, a process known as Viterbi decoding". The Viterbi algorithm produces a list of distance entries, or states, on paths that lead to non-pruned active states (states which are still candidates for inclusion in one or more best paths), such that for any given active state the optimal path between it and the Initial state are given. When processing of the matrix 41 has been completed, the list with the best score contains the best path back to the beginning point 44.The various best paths are not determined by tracking every possible path that could be followed between the beginning point and the end point. Bad paths are determined and their nodes or states are pruned away at every opportunity so as to reduce the amount of computational time and memory required to process the utterance.
One method for imposing such control is to impose a path boundary, starting at the beginning point and ending at the ending point of the accumulated distance matrix, defining the possible best path states and states which could not possibly be included in the best path. One such boundary is illustrated in Fig. 6, which depicts the matrix 50 of Fig. 5 having a central area bounded by a legal path region 56. The parallelogram of the region 56 is bounded by lines having slopes of 1/2 and 2. This forced flt technique for pruning away bad paths can also be applied to HMM systems, which is one reason why DTW and HMM outputs will work equally well as inputs to the ANN.A path boundary is imposed on the state machine template of the HMM by limiting self-looping transitions and skipping transitions from occurring two times in succession, thereby creating boundaries having effective slopes of 1/2 and 2, respectively.
When an HMM is utilized for speech recognition, It is typically structured as a series of nodes in a linear finite state network, wherein each one of the nodes is comprised of a number of phones, words or utterances.
The HMM proceeds by creating a list of probable word sequences based on acoustic, phonetic analysis and on the structure of the HMM language model.
For time alignment purposes, the HMM maintains a list of time markers or pointers to the most probable state sequences, which may be used, for example, to indicate word sequence candidates for output from the recognizer or to indicate the time at which certain phone sequences occurred. However, many prior art HMM systems do not keep pointers to the original speech, thereby limiting such system's ability to determine locations of elemental phone states within words of the input utterance, as will be further explained below. It is important to note that in the present invention, such pointers are provided and saved in the form of Viterbi lists with each tentatively identified list of words or phonemes for application to the ANN.These Viterbi lists of word sequences are developed in parallel, and when the end of the utterance string has been reached, the leading list of words is accepted as the correct sequence and its time markers are accepted as correct locators of word boundaries in time.
Details of HMM procedures for achieving dynamic time alignment can be obtained through review of the prior art references described above.
What is important with regard to the use of HMM techniques in the present invention is that HMM techniques provide a variety of ways in which to determine the time evolution of a series of states so as to. determine the time at which the states occupied positions on a time-aligned path. The Vlterbl search is but one computationally efficient way to determine the optimal time alignment of speech to the states in a finite state machine, such as a Hidden Markov Model; many other well known techniques could also be used. Once again, however, it is important to note that HMM formalism assimilates the best known technique for time registration between the Internal states of a word model or subword model and an associated speech input.
With regard to the present invention, Fig. 7 illustrates the combined use of HMM techniques, applied during a first step (preprocessing), and ANN techniques, applied during a second step (post-processing), for the purposes of automated speech recognition. A speech time series signal 60, derived from unknown verbal utterances, is first divided into 10 millisecond segments as depicted by the time bars 62. A number of these signal segments, in accordance with well known HMM techniques, are then assigned to the nodes 64 of the HMM 66 and time-aligned with reference signal segments (not shown), derived from a vocabulary or library of known verbal utterances, to produce a scored (ranked) list of tentatively identified words or phonemes corresponding to the time-aligned segments. This list is said to be scored because each tentatively identified word or phoneme is ranked by order of its probability of being correct. This scored list of tentatively identified words or phonemes (the linear finite state network of the HMM 66) is then verified and/or rescored (reranked), as necessary, by applying each node 64 to a corresponding input node 68 of the ANN 70.
The output of each ANN input node 64 is then interconnected to each ANN hidden node 72, which is in-turn interconnected to each ANN output node 74 in accordance with well known ANN techniques. Although these types of ANN's typically require even more computational capacity than HMM's, the hidden nodes allow the ANN to extract progressively more complex features from its inputs, thereby enhancing its ability for matching (and thereby detecting) ambiguous speech patterns to produce a series of output signals corresponding to a finalized list of words determined to be the most likely correct interpretation of the unknown verbal utterances.
In accordance with the present invention, application of the ANN 70 can either be (1) before the entire utterance has been completely preprocessed by the HMM.66, or (2) after the entire utterance has been completely preprocessed by the HMM 66. It may also be applied as a preprocessing step to the HMM 66 itself. In the preferred embodiment of the present invention, illustrated in Fig. 8, ANN processing is applied to each new word as it is completed and before a new word is started by the HMM. As each new word is completed by the microprocessing system (not shown) operating the HMM, the completed word leaves the top of the processing stack of the microprocessor and assumes a second place position in the stack.The ANN Is applied to completed words at this second place position in the stack because the beginning and ending points of the phoneme or word can only first be accurately determined at this position.
Taking words or phonemes directly off the top of the stack may not always be preferable. For instance, if the entire utterance being analyzed only contains one word (or phoneme), the top of the stack will be occupied by a silence token" during the final moments of processing, while the second position on the stack will be occupied by the single word or phoneme required for ANN application. This technique also improves the general response time of the automated speech recognition system as a whole because application of the ANN can be limited to only those speech segments tentatively identified by the HMM as ambiguous.
A first alternative embodiment of the present invention is depicted in Fig. 9. Microphone 22 detects the unknown utterance for processing by the signal or sound processor 80, which delays outputting a speech time series signal until the end of the utterance is detected, which is perhaps specified by a predetermined period of silence. The entire utterance 82 is then output by the signal processor 80 to the HMM 84. The HMM 84 time-aligns the speech data of the entire utterance to a fixed format. The inputs of the ANN 86 are taken from the fixed format of the HMM state machine models with pointers supplied by the HMM models to the original speech.Although the ANN should be trained (programed) to recognize phonetic similarities between unknown and known utterances, in this particular application, the ANN's must be trained to recognize similarities between entire utterance units.
The forced fit" technique for time alignment between word templates and speech input are used to train the ANN for each word in the vocabulary. This technique assigns an HMM word model to each segment of the input speech to achieve the best flt between the reference input and the speech input. Each reference word in the utterance is decoded into its finite state machine representation and linked to its nelghbors. Reference words may be transcribed into their phonetic spellings and applied into the HMM, or word level HMM models can used.
Although the present invention has been described with reference to Figs. 1-9 and with emphasis on a particular order combination of HMM and ANN techniques, it should be understood that the figures are for illustration only and should not be taken as limitations upon the invention. It is contemplated that many changes and modifications may be made by one of ordinary skill in the art to the elements, process and arrangement of steps of the process of the invention without departing from the spirit and scope of the invention as disclosed above. For example, the present invention need not be limited to applications for recognizing speech alone, and could also be used to detect similarities between a variety of different inputs and reference templates.

Claims (35)

1. An automated speech recognizer, comprising: means for time-aligning segments of a signal derived from unknown utterances with one or more reference signal segments from a vocabulary of known utterances so as to produce a scored list of tentatively identified utterances believed to correspond to said unknown utterances; and means for verifying and rescoring at least a portion of said scored list to produce a final list of positively identified utterances which correctly interpret said unknown utterances.
2. An automated speech recognizer as recited in claim 1, wherein said time-aligning means includes a dynamic time alignment system for correcting speaking rate variations between said unknown utterances and said vocabulary of known utterances.
3. An automated speech recognizer as recited in claim 2, wherein said final list is produced by applying at least a portion of said scored list to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified utterances and said vocabulary of known utterances.
4. An automated speech recognizer as recited in claim 3, wherein said scored list includes one or more ambiguously identified utterances and wherein said portion of said scored list is only comprised of said ambiguously identified utterances.
5. An automated speech recognizer as recited in claim 3, wherein said connectionist model is an artificial neural net.
6. An automated speech recognizer as recited in claim 2, wherein said dynamic time alignment system is a hidden Markov model state machine.
7. An automated speech. recognizer as recited in claim 6, wherein said scored list is comprised of a series of states within said state machine and wherein said final list is produced by applying at least a portion of said states to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified utterances and said vocabulary of known utterances.
8. An automated speech recognizer as recited In claim 7, wherein said series of states includes one or more states corresponding to ambiguously identified utterances and wherein only said states corresponding to said ambiguously identified utterances are app]ied to said connectionist model,
9. An automated speech recognizer as recited In claim 7, wherein said connectionist model is an artificial neural net.
10. An automated speech recognizer as recited in claim 1, wherein said final list is produced by applying at least a portion of said scored list to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified utterances and said vocabulary of known utterances.
11. An automated speech recognizer as recited in claim 10, wherein said connectionist model is an artificial neural net.
12. An automated speech recognizer as recited in claim 11, wherein said time-aligning means includes a hidden Markov model state machine for correcting speaking rate variations between said unknown utterances and said vocabulary of known utterances.
13. A method for recognizing unknown verbal utterances, comprising the steps of: time-aligning unknown segments of a signal derived from the unknown verbal utterances with one or more reference signal segments from a vocabulary of known verbal utterances to correct for speaking rate variations between said unknown verbal utterances and said vocabulary of known verbal utterances so as to produce a scored list of tentatively identified verbal utterances believed to correspond to said unknown verbal utterances; and verifying and/or rescoring said scored list to produce a final list of verbal utterances which correctly interpret said unknown verbal utterances.
14. A method for recognizing unknown verbal utterances as recited in claim 13, wherein said time-aligning step includes the steps of: comparing said unknown segments to one or more of said reference signal segments from said vocabulary so as to produce a comparative list of said reference signal segments corresponding to each of said unknown segments; and pruning said reference segments from each of said comparative lists which do not closely resemble said unknown segments to develop a single list comprised of said reference segments which most closely comparatively resemble said unknown segments.
15. A method for recognizing unknown verbal utterances as recited in claim 14, wherein said verifying and rescoring step includes the step of applying at least a portion of said scored list to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified verbal utterances and said vocabulary of known verbal utterances.
16. A method for recognizing unknown verbal utterances as recited in claim 13, wherein said time-aligning step includes the steps of: applying said unknown segments to Individual states of a hidden Markov model state machine; following the time evolution of said states representing said unknown segments through said state machine; and producing said scored list of tentatively identified verbal utterances from time-evolved states of said state machine.
17. A method for recognizing unknown verbal utterances as recited in claim 16, wherein said verifying and rescoring step includes the step of applying at least a portion of said scored list to individual input nodes of a connectionist model trained to recognize phonetic similarities between said scored list and said vocabulary.
18. An automated speech recognizer, comprising: means for time-aligning a signal derived from an unknown verbal utterance with one or more reference formats from a vocabulary of known verbal utterances so as to produce a scored list of tentatively identified verbal utterances believed to correspond to said unknown verbal utterance; and means for verifying and rescoring at least a portion of said scored list to produce a positively identified verbal utterance which correctly interprets said unknown verbal utterance.
19. An automated speech recognizer as recited in claim 18, wherein said time-aligning means includes a dynamic time alignment system for correcting speaking rate variations between said unknown verbal utterance and said vocabulary of known verbal utterances.
20. An automated speech recognizer as recited in claim 19, wherein said positively identified verbal utterance Is produced by applying at least a portion of said scored list to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified verbal utterances and said vocabulary.
21. An automated speech recognizer as recited in claim 20, wherein said scored list includes one or more ambiguously identified verbal utterances and wherein said portion of said scored list is only comprised of said ambiguously identified verbal utterances.
22. An automated speech recognizer as recited in claim 20, wherein said connectionist model is an artificial neural net.
23. An automated speech recognizer as recited in claim 19, wherein said dynamic time alignment system is a hidden Markov model state machine.
24. An automated speech recognizer as recited in claim 23, wherein said scored list is comprised of a series of states within said state machine and wherein said positively identified verbal utterance is produced by applying at least a portion of said states to individual input nodes of a connectionist model trained to recognize phonetic similarities between said tentatively identified verbal utterance and said vocabulary.
25. An automated speech recognizer as recited in claim 24, wherein said series of states includes one or more states corresponding to ambiguously identified verbal utterances and wherein only said states corresponding to said ambiguously identified verbal utterances are applied to said connectlonist model.
26. An automated speech recognizer as recited in claim 24, wherein said connectionist model is an artificial neural net.
27. An automated speech recognizer as recited in claim 18, wherein said positively identified verbal utterance is produced by applying at least a portion of said scored list to individual input nodes of an artificial neural net system trained to recognize phonetic similarities between said tentatively identified verbal utterances and said vocabulary.
28. An automated speech recognizer as recited in claim 27, wherein said time-aligning means includes a hidden Markov model state machine for correcting speaking rate variations between said unknown verbal utterance and said vocabulary of known verbal utterances.
29. A pattern recognition system, comprising: means for time-aligning an unknown pattern with one or more reference patterns from a library of known patterns so as to produce a tentative list of identified patterns believed to correspond to said unknown pattern; and means for verifying and/or rescoring said tentative list of patterns to produce a positively identified pattern which correctly corresponds to said unknown pattern.
30. A pattern recognition system as recited in claim 29, wherein said positively identified pattern is produced by applying each pattern from said tentative list to individual input nodes of a connectionist model trained to recognize similarities between said tentative list of patterns and said library.
31. A pattern recognition system as recited in claim 30, wherein said connectionist model is an artificial neural net.
32. A pattern recognition system as recited in claim 29, wherein said time-aligning means includes a hidden Markov model state machine for following the time evolution of said unknown pattern and producing said tentative list.
33. A pattern recognition system as recited in claim 32, wherein said positively identified pattern is produced by applying each pattern from said tentative list to individual input nodes of a connectionist model trained to recognize similarities between said tentative list and said library.
34. A pattern recognition system as recited in claim 33, wherein the connectionist model is an artificial neural net.
35. . An automated speech recognizer, substantially as hereinbefore described with reference to the accompanying drawings.
GB9026766A 1990-01-18 1990-12-10 Automated speech recognition system Withdrawn GB2240203A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US46770690A 1990-01-18 1990-01-18

Publications (2)

Publication Number Publication Date
GB9026766D0 GB9026766D0 (en) 1991-01-30
GB2240203A true GB2240203A (en) 1991-07-24

Family

ID=23856794

Family Applications (1)

Application Number Title Priority Date Filing Date
GB9026766A Withdrawn GB2240203A (en) 1990-01-18 1990-12-10 Automated speech recognition system

Country Status (1)

Country Link
GB (1) GB2240203A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0590925A1 (en) * 1992-09-29 1994-04-06 International Business Machines Corporation Method of speech modelling and a speech recognizer
EP0592150A1 (en) * 1992-10-09 1994-04-13 AT&T Corp. Speaker verification
EP0601778A1 (en) * 1992-12-11 1994-06-15 AT&T Corp. Keyword/non-keyword classification in isolated word speech recognition
EP0623914A1 (en) * 1993-05-05 1994-11-09 CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. Speaker independent isolated word recognition system using neural networks
WO1996041333A1 (en) * 1995-06-07 1996-12-19 Dragon Systems, Inc. Systems and methods for word recognition
GB2331826A (en) * 1997-12-01 1999-06-02 Motorola Inc Context dependent phoneme networks for encoding speech information
EP0955628A2 (en) * 1998-05-07 1999-11-10 CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. A method of and a device for speech recognition employing neural network and Markov model recognition techniques
US6125284A (en) * 1994-03-10 2000-09-26 Cable & Wireless Plc Communication system with handset for distributed processing
GB2347253B (en) * 1999-02-23 2001-03-07 Motorola Inc Method of selectively assigning a penalty to a probability associated with a voice recognition system
EP2685452A1 (en) * 2012-07-13 2014-01-15 Samsung Electronics Co., Ltd Method of recognizing speech and electronic device thereof
CN106023995A (en) * 2015-08-20 2016-10-12 漳州凯邦电子有限公司 Voice recognition method and wearable voice control device using the method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0242743A1 (en) * 1986-04-25 1987-10-28 Texas Instruments Incorporated Speech recognition system
GB2230370A (en) * 1989-04-12 1990-10-17 Smiths Industries Plc Speech recognition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0242743A1 (en) * 1986-04-25 1987-10-28 Texas Instruments Incorporated Speech recognition system
GB2230370A (en) * 1989-04-12 1990-10-17 Smiths Industries Plc Speech recognition

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5502791A (en) * 1992-09-29 1996-03-26 International Business Machines Corporation Speech recognition by concatenating fenonic allophone hidden Markov models in parallel among subwords
EP0590925A1 (en) * 1992-09-29 1994-04-06 International Business Machines Corporation Method of speech modelling and a speech recognizer
EP0592150A1 (en) * 1992-10-09 1994-04-13 AT&T Corp. Speaker verification
EP0601778A1 (en) * 1992-12-11 1994-06-15 AT&T Corp. Keyword/non-keyword classification in isolated word speech recognition
EP0623914A1 (en) * 1993-05-05 1994-11-09 CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. Speaker independent isolated word recognition system using neural networks
US5566270A (en) * 1993-05-05 1996-10-15 Cselt-Centro Studi E Laboratori Telecomunicazioni S.P.A. Speaker independent isolated word recognition system using neural networks
US6125284A (en) * 1994-03-10 2000-09-26 Cable & Wireless Plc Communication system with handset for distributed processing
WO1996041333A1 (en) * 1995-06-07 1996-12-19 Dragon Systems, Inc. Systems and methods for word recognition
US5680511A (en) * 1995-06-07 1997-10-21 Dragon Systems, Inc. Systems and methods for word recognition
GB2331826B (en) * 1997-12-01 2001-12-19 Motorola Inc Context dependent phoneme networks for encoding speech information
GB2331826A (en) * 1997-12-01 1999-06-02 Motorola Inc Context dependent phoneme networks for encoding speech information
EP0955628A3 (en) * 1998-05-07 2000-07-26 CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. A method of and a device for speech recognition employing neural network and Markov model recognition techniques
EP0955628A2 (en) * 1998-05-07 1999-11-10 CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. A method of and a device for speech recognition employing neural network and Markov model recognition techniques
US6185528B1 (en) * 1998-05-07 2001-02-06 Cselt - Centro Studi E Laboratori Telecomunicazioni S.P.A. Method of and a device for speech recognition employing neural network and markov model recognition techniques
GB2347253B (en) * 1999-02-23 2001-03-07 Motorola Inc Method of selectively assigning a penalty to a probability associated with a voice recognition system
US6233557B1 (en) 1999-02-23 2001-05-15 Motorola, Inc. Method of selectively assigning a penalty to a probability associated with a voice recognition system
EP2685452A1 (en) * 2012-07-13 2014-01-15 Samsung Electronics Co., Ltd Method of recognizing speech and electronic device thereof
US20140019131A1 (en) * 2012-07-13 2014-01-16 Korea University Research And Business Foundation Method of recognizing speech and electronic device thereof
CN103544955A (en) * 2012-07-13 2014-01-29 三星电子株式会社 Method of recognizing speech and electronic device thereof
CN103544955B (en) * 2012-07-13 2018-09-25 三星电子株式会社 Identify the method and its electronic device of voice
CN106023995A (en) * 2015-08-20 2016-10-12 漳州凯邦电子有限公司 Voice recognition method and wearable voice control device using the method

Also Published As

Publication number Publication date
GB9026766D0 (en) 1991-01-30

Similar Documents

Publication Publication Date Title
US5893058A (en) Speech recognition method and apparatus for recognizing phonemes using a plurality of speech analyzing and recognizing methods for each kind of phoneme
Hadian et al. Flat-start single-stage discriminatively trained HMM-based models for ASR
US20220223066A1 (en) Method, device, and computer program product for english pronunciation assessment
Hasegawa-Johnson et al. Simultaneous recognition of words and prosody in the Boston University Radio Speech Corpus
Mariani Recent advances in speech processing
Jothilakshmi et al. Large scale data enabled evolution of spoken language research and applications
GB2240203A (en) Automated speech recognition system
Soltau et al. Reducing the computational complexity for whole word models
Lin et al. Learning acoustic frame labeling for phoneme segmentation with regularized attention mechanism
Sirigos et al. A hybrid syllable recognition system based on vowel spotting
Holmes et al. Why have HMMs been so successful for automatic speech recognition and how might they be improved
Nguyen et al. Improving acoustic model for vietnamese large vocabulary continuous speech recognition system using deep bottleneck features
Nwe et al. Myanmar language speech recognition with hybrid artificial neural network and hidden Markov model
Kim et al. Automatic recognition of pitch movements using multilayer perceptron and time-delay recursive neural network
Lee et al. Speaker‐independent phoneme recognition using hidden Markov models
Lee Automatic recognition of isolated cantonese syllables using neural networks
JP3571821B2 (en) Speech recognition device, dictionary of word components, and learning method of hidden Markov model
Tang Large vocabulary continuous speech recognition using linguistic features and constraints
Sekhar et al. Recognition of stop-consonant-vowel (SCV) segments in continuous speech using neural network models
Mary et al. Keyword spotting techniques
Ishaq Voice activity detection and garbage modelling for a mobile automatic speech recognition application
Manjunath et al. Improvement of phone recognition accuracy using source and system features
Bengio et al. Speech coding with multi-layer networks
Liao et al. An overview of RNN‐based mandarin speech recognition approaches
Singh et al. Phoneme Based Hindi Speech Recognition Using Deep Learning

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)