WO1995034064A1 - Speech-recognition system utilizing neural networks and method of using same - Google Patents
Speech-recognition system utilizing neural networks and method of using same Download PDFInfo
- Publication number
- WO1995034064A1 WO1995034064A1 PCT/US1995/005006 US9505006W WO9534064A1 WO 1995034064 A1 WO1995034064 A1 WO 1995034064A1 US 9505006 W US9505006 W US 9505006W WO 9534064 A1 WO9534064 A1 WO 9534064A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- output
- neural networks
- frames
- recognition system
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 99
- 238000000034 method Methods 0.000 title description 60
- 238000007781 pre-processing Methods 0.000 claims abstract description 25
- 238000004458 analytical method Methods 0.000 claims abstract description 21
- 238000000638 solvent extraction Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims 1
- 238000010183 spectrum analysis Methods 0.000 claims 1
- 238000012549 training Methods 0.000 abstract description 35
- 230000006870 function Effects 0.000 abstract description 28
- 238000006243 chemical reaction Methods 0.000 abstract description 12
- 230000003252 repetitive effect Effects 0.000 abstract description 4
- 210000002569 neuron Anatomy 0.000 description 14
- 238000010586 diagram Methods 0.000 description 10
- 238000001228 spectrum Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 6
- 239000013598 vector Substances 0.000 description 5
- 230000003595 spectral effect Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000002411 adverse Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Definitions
- the present invention is related to the following inventions which are assigned to the same assignee as the present invention:
- This invention relates generally to speech-recognition devices, and, in particular, to a speech-recognition system which is capable of speaker-independent, isolated word recognition.
- Potential applications for an automated speech- recognition device include a database query technique using voice commands, voice input for quality control in a manufacturing process, a voice-dial cellular phone which would allow a driver to focus on the road while dialing, and a voice-operated prosthetic device for the physically disabled.
- Dynamic time-warping is a technique which uses an optimization principle to minimize the errors between an unknown spoken word and a stored template of a known word. Reported data shows that the DTW technique is very robust and produces good recognition. However, the DTW technique is computationally intensive. Therefore, it is impractical to implement the DTW technique for real-world applications.
- the hidden-Markov modeling technique uses stochastic models for known words and compares the probability that the unknown word was generated by each model.
- the HMM technique will check the sequence (or state) of the word, and find the model that provides the best match.
- the HMM technique has been successfully used in many commercial applications; however, the technique has many drawbacks. These drawbacks include an inability to differentiate acoustically similar words, a susceptibility to noise, and computational intensiveness.
- a time-delay neural network is a type of neural network which addresses the temporal effects of speech by adopting limited neuron connections.
- a TDNN shows slightly better result than the HMM method.
- a TDNN suffers from some serious drawbacks.
- the training time for a TDNN is very lengthy, on the order of several weeks.
- the training algorithm for a TDNN often converges to a local minimum, which is not the optimum solution.
- the optimum solution would be a global minimum.
- Another advantage of the present invention is to provide a speech-recognition system which does not require repetitive training.
- Yet another advantage of the present invention is to provide a speech-recognition system which operates with a vast reduction in computational complexity.
- a speech-recognition system responsive to audio input from which the system identifies utterances of human speech, comprising: a pre-processing circuit for analyzing the audio input, the circuit generating output representing the results of the analysis; a computer responsive to the output of the pre-processing circuit, the computer
- the computer executing an algorithm partitioning the output of the preprocessing circuit into data blocks, the computer producing as output a plurality of the data blocks; a plurality of neural networks for computing polynomial expansions, each of the neural networks responsive to the plurality of data blocks and generating at least one output; and a selector responsive to the at least one output of each of the neural networks and generating as output a label representing the utterance of speech.
- a method of operating a speech-recognition system comprising the following steps: (a) receiving a spoken word; (b) performing analog-to-digital conversion of the spoken word, the conversion producing a digitized word; (c) performing cepstral analysis of the digitized word, the analysis resulting in a sequence of data frames; (d) generating a plurality of data blocks from the sequence of data frames; (e) broadcasting one of the plurality of data blocks to a plurality of neural networks, wherein each of the plurality of neural networks has been previously trained to recognize a specific word; (f) each one of the neural networks generating an output as a result of
- step (g) receiving the data block; (g) accumulating the output of each of the neural networks to produce a respective neural network sum; (h) determining if the-re is another one of the plurality of data blocks to be broadcast to the plurality of neural networks, and, if so, returning to step (e), but, if not, proceeding to step (j); and (j) generating a system output, corresponding to the largest of the neural network sums, the system output indicating the spoken word.
- FIG. 1 shows a contextual block diagram of a speech- recognition system in accordance with the present
- FIG. 2 shows a conceptual diagram of a speech- recognition system in accordance with a preferred
- FIG. 3 shows a flow diagram of a method of operating the speech-recognition system illustrated in FIG. 2.
- FIG. 4 illustrates data inputs and outputs of a divide-and-conquer algorithm of a preferred embodiment of the present invention.
- FIG. 5 shows a flow diagram of a method of executing a divide-and-conquer algorithm of a preferred embodiment of the present invention.
- FIG. 6 shows a flow diagram of a method of training a neural network to recognize speech in accordance with a preferred embodiment of the present invention.
- FIG. 1 shows a contextual block diagram of a speech- recognition system in accordance with the present
- the system comprises a microphone 1 or
- pre-processing circuitry 3 which receives electrical signals from microphone 1 and performs various tasks such as wave-form sampling, analog-to-digital (A/D) conversion, cepstral analysis, etc.
- computer 5 which executes a program for recognizing speech and accordingly generates an output identifying the recognized speech.
- the operation of the system commences when a user speaks into microphone 1.
- the system depicted by FIG. 1 is used for isolated word
- Isolated word recognition takes place when a person speaking into the microphone makes a distinct pause between each word.
- microphone 1 When a speaker utters a word, microphone 1 generates a signal which represents the acoustic wave-form of the word. This signal is then fed to pre-processing circuitry 3 for digitization by means of an A/D converter (not shown). The digitized signal is then subjected to cepstral analysis, a method of feature extraction, which is also performed by pre-processing circuitry 3. Computer 5 receives the results of the cepstral analysis and uses these results to determine the identity of the spoken word.
- Pre-processing circuitry 3 may include a combination of hardware and software components in order to perform its tasks. For example, the A/D conversion may be performed by a
- cepstral analysis may be performed by software which is executed on a
- Pre-processing circuitry 3 includes appropriate means for A/D conversion.
- the signal from microphone 1 is an analog signal.
- An A/D converter (not shown) samples the signal from microphone 1 several thousand times per second (e.g. between 8000 and 14,000 times per second in a preferred embodiment). Each of the samples is then converted to a digital word, wherein the length of the word is between 12 and 32 bits. The digitized signal comprises one or more of these digital words.
- the sampling rate and word length of A/D converters may vary and that the numbers given above do not place any limitations on the sampling rate or word length of the A/D converter which is included in the present invention.
- the cepstral analysis is performed as follows. First, the digitized samples, which make up the digitized signal, are divided into a sequence of sets. Each set includes samples taken during an interval of time which is of fixed duration. To illustrate, in a preferred embodiment of the present invention the interval of time is .15 milliseconds. If the duration of a spoken word is, for example, 150 milliseconds, then circuitry 3 will produce a sequence of ten sets of digitized samples.
- n an integer index
- k an integer index
- a(k) represents the k th prediction coefficient
- c (n - k) represents the (n - k) th cepstrum coefficient
- cepstrum liftering This weighting is commonly referred to cepstrum liftering.
- the effect of this liftering process is to smooth the spectral peaks in the spectrum of the speech sample. It has also been found that cepstrum liftering suppresses the existing variations in the high and low cepstrum coefficients, and thus considerably improves the performance of the speech-recognition system.
- the result of the cepstral analysis is a sequence of smoothed log spectra wherein each spectrum corresponds to a discrete time interval from the period during which the word was spoken.
- preprocessing circuitry 3 For each spectrum, preprocessing circuitry 3 generates a respective data frame which comprises data points from the spectrum.
- each data frame contains twelve data points, wherein each of the data points
- the data points are 32-bit digital words .
- the present invention places no limits on the number of data points per frame or the bit length of the data points; the number of data points contained in a data frame may be twelve or any other appropriate value, while the data point bit length may be 32 bits, 16 bits, or any other value.
- computer 5 may include a partitioning program for manipulating the sequence of data frames, a plurality of neural networks for computing polynomial expansions, and a selector which uses the outputs of the neural networks to classify the spoken word as a known word. Further details of the operation of computer 5 are given below.
- FIG. 2 shows a conceptual diagram of a speech- recognition system in accordance with a preferred
- the speech-recognition system recognizes isolated spoken words.
- a microphone 1 receives speech input from a person who is speaking, and converts the input into electrical signals. The electrical signals are fed to pre-processing circuitry 3.
- Pre-processing circuitry 3 performs the functions described above regarding FIG. 1. Circuitry 3 performs A/D conversion and cepstral analysis, and circuitry 3 may include a combination of hardware and software components in order to perform its tasks.
- the output of preprocessing circuitry 3 takes the form of a sequence of data frames which represent the spoken word. Each data frame comprises a set of data points (32-bit words) which correspond to a discrete time interval from the period during which the word was spoken.
- the output of circuitry 3 is transmitted to computer 5.
- Computer 5 may be a general-purpose digital computer or a specific-purpose computer.
- Computer 5 comprises suitable hardware and/or software for performing a divide- and-conquer algorithm 11.
- Computer 5 further comprises a plurality of neural networks represented by 1st Neural Network 12, 2nd Neural Network 13, and Nth Neural Network, 14.
- the output of each neural network 12, 13, and 14 is fed into a respective accumulator 15, 16, and 17.
- the outputs of accumulators 15-17 are fed into a selector 18, whose output represents the recognized speech word.
- Divide-and-conquer algorithm 11 receives the sequence of data frames from pre-processing circuitry 3, and from the sequence of data frames it generates a plurality of data blocks. In essence, algorithm 11 partitions the sequence of data frames into a set of data blocks, each of which comprises a subset of data frames from the input sequence. The details of the operation of divide-and- conquer algorithm 11 are given below in the section
- the first data block comprises the first data frame and every fourth data frame thereafter appearing in the sequence of data frames.
- the second data block comprises the second data frame and every fourth data frame thereafter in the sequence. And so on, successive data frames being
- each data block contains the same number of data frames. If the number of data frames turns out to be insufficient to provide each block with an identical number of data frames, then the last data frame in the sequence is copied into the remaining data blocks, so that each contains the same number of data frames.
- a means for distributing the data blocks is used to transfer the data blocks from algorithm 11 to the inputs of neural networks 12, 13, and 14. In turn, each data block is transferred simultaneously to neural networks 12, 13, and 14. While FIG. 2 shows only three neural networks in the speech-recognition system, it will be understood by one of ordinary skill that any number of neural network may be used if a particular application requires more or less than three neural networks.
- each neural network comprises a plurality of neurons.
- each of the neural networks may have been previously trained to recognize a specific set of speech phonemes.
- a spoken word comprises one or more speech phonemes.
- Neural networks 12, 13, and 14 act as classifiers that determine which word was spoken, based on the data blocks.
- a classifier makes a decision as to which class an input pattern belongs.
- each class is labeled with a known word, and data blocks are obtained from a predefined set of spoken words (the training set) and used to determine boundaries between the classes, boundaries which maximize the recognition performance for each class.
- a parametric decision method is used to determine whether a spoken word belongs to a certain class.
- the neural networks Upon receiving a data block, the neural networks compute their respective discriminant functions. If the discriminant function computed by a particular neural network is greater than the discriminant function of each of the other networks, then the data block belongs to the particular class
- each neural network defines a
- neural network 12 may be trained to recognize the word "one”
- neural network 13 may be trained to recognize the word "two”, and so forth.
- the method of training the neural networks is described below in the section entitled "Neural Network Training”.
- the discriminant functions computed by the neural networks of the present invention are based upon the use of a polynomial expansion and, in a loose sense, the use of an orthogonal function, such as a sine, cosine,
- a preferred embodiment employs a polynomial expansion of which the general case is represented by Equation 4 as follows : i l
- n is the number of co-processor inputs.
- Equation 4 expresses a neuron output and the weight and gating functions associated with such neuron.
- the number of terms of the polynomial expansion to be used in a neural network is based upon a number of factors, including the number of available neurons, the number of training examples, etc. It should be understood that the higher order terms of the polynomial expansion usually have less significance than the lower order terms. Therefore, in a preferred embodiment, the lower order terms are chosen whenever possible, based upon the various factors mentioned above. Also, because the unit of measurement associated with the various inputs may vary, the inputs may need to be normalized before they are used.
- Equation 5 is an alternative representation of
- Equation 4 showing terms up to the third order terms.
- Equation (5) wherein the variables have the same meaning as in Equation 4 and wherein fi(i) is an index function in the range of n+1 to 2n; f 2(i,j) is an index function in the range of 2n+1 to
- f 3(i,j) is in the range of 2n+1+ (n) (n-1)/2 to 3n+(n) (n-1)/2.
- f 4 through f 6 are represented in a similar fashion.
- Equation 5 can be represented as follows:
- y w 0 + w 1 x 1 + w 2 x 2 + . . . w i x i + . . . + w n x n
- Equation (6) wherein the variables have the same meaning as in Equation 4.
- N is any positive integer and represents the Nth neuron in the network.
- a neural network will generate an output for every data block it receives. Since a spoken word may be represented by a sequence of data blocks, each neural network may generate a sequence of outputs. To enhance the classification performance of the speech-recognition system, each sequence of outputs is summed by an accumulator.
- an accumulator is attached to the output of each neural network.
- accumulator 15 is responsive to output from neural network 12
- accumulator 16 is responsive to output from neural network 13
- accumulator 17 is responsive to output from neural network 14.
- the function of an accumulator is to sum the sequence of outputs from a neural network. This creates a sum which corresponds to the neural network, and thus the sum corresponds to a class which is labeled by a known word.
- Accumulator 15 adds each successive output from neural network 12 to an accumulated sum, and
- accumulators 16 and 17 perform the same function for neural networks 13 and 14, respectively. Each accumulator presents its sum as an output.
- Selector 18 receives the sums from the accumulators either sequentially or concurrently. In the former case, selector 18 receives the sums in turn from each of the accumulators, for example, receiving the sum from
- selector 18 receives the sums from accumulators 15, 16, and 17 concurrently. After receiving the sums, selector 18 then determines which sum is largest and assigns the corresponding known word label, i.e. the recognized speech word, to the output of the speech-recognition system.
- FIG. 3 shows a flow diagram of a method of operating the speech-recognition system illustrated in FIG. 2.
- box 20 a spoken word is received from the user by
- A/D conversion is performed on the speech signal.
- A/D conversion is performed by pre-processing circuitry 9 of FIG. 2.
- cepstral analysis is performed on the digitized signal resulting from the A/D conversion.
- the cepstral analysis is, in a preferred embodiment, also performed by pre-processing circuitry 9 of FIG. 2.
- the cepstral analysis produces a sequence of data frames which contain the relevant features of the spoken word.
- a divide-and-conquer algorithm is used to generate a plurality of data blocks from the sequence of data frames .
- the divide-and-conquer algorithm is a method of
- one of the data blocks is broadcast to the neural networks. Upon exiting box 28, the procedure continues to box 30.
- each of the neural networks uses the data block in computing a discriminant function which is based on a polynomial expansion.
- a different discriminant function is computed by each neural network and generated as an output.
- the discriminant function computed by a neural network is determined prior to operating the speech- recognition system by using the method of training the neural network as shown in FIG. 6.
- each neural network is added to a sum, wherein there is one sum generated for each neural network.
- This step generates a plurality of neural network sums, wherein each sum corresponds to a neural network .
- decision box 34 a check is made to determine whether there is another data block to be broadcast to the neural networks. If so, the procedure returns to box 28. If not, the procedure proceeds to box 36.
- the selector determines which neural network sum is the largest, and assigns the known word label which corresponds to the sum as the output of the speech-recognition system.
- FIG. 4 illustrates data inputs and outputs of a divide-and-conquer algorithm of a preferred embodiment of the present invention.
- the divide-and-conquer algorithm is a method of partitioning the sequence of data frames into a set of smaller data blocks.
- the input to the algorithm is the sequence of data frames 38, which, in the example illustrated, comprises data frames 51-70.
- the sequence of data frames 38 contains data which represents the relevant features of a speech sample.
- each data frame contains twelve data points, wherein each of the data points represents the value of a cepstrally-smoothed spectral envelope at a specific frequency.
- the data points are 32-bit digital words. Each data frame corresponds to a discrete time interval from the period during which the speech sample was spoken.
- the present invention places no limits on the number of data points per frame or the bit length of the data points; the number of data points contained in a data frame may be twelve or any other value, while the data point bit length may be 32 bits, 16 bits, or any other value.
- each data point may be used to represent data other than values from a cepstrally-smoothed spectral envelope.
- each data point may represent a spectral amplitude at a specific frequency.
- the divide-and-conquer algorithm 11 receives each frame of the speech sample sequentially and assigns the frame to one of several data blocks . Each data block comprises a subset of data frames from the input sequence of frames. Data blocks 42, 44, 46, and 48 are output by the divide-and-conquer algorithm 11. Although FIG. 4 shows the algorithm generating only four data blocks, the divide- and-conquer algorithm 11 is not limited to generating only four data blocks and may be used to generate either more or less than four blocks.
- FIG. 5 shows a flow diagram of a method of executing a divide-and-conquer algorithm of a preferred embodiment of the present invention.
- the divide-and-conquer algorithm partitions a sequence of data frames into a set of data blocks according to the following steps.
- the number of data blocks to be generated by the algorithm is first calculated.
- the number of data blocks to be generated is calculated in the following manner. First, the number of frames per data block and the number of frames in the sequence are
- Both the number of blocks and the number of frames are integers. Second, the number of frames is divided by the number of frames per block. Next, the result of the division operation is rounded up to the nearest integer, resulting in the number of data blocks to be generated by the divide-and-conquer algorithm. Upon exiting box 75, the procedure continues in box 77.
- the first frame of the sequence of frames is equated to a variable called the current frame.
- the current frame could be represented by either a software variable or, in hardware, as a register or memory device.
- a current block variable is equated to the first block.
- the current block may be a software variable which represents a data block.
- the current block may be one or more registers or memory devices.
- decision box 81 a check is made to determine whether or not there are more frames from the sequence of frames to be processed. If so, the
- box 83 the next frame from the sequence of frames is received and equated to the current frame variable.
- box 85 the current block variable is equated to the next block, and then the current frame variable is assigned to the current block variable. Upon exiting box 85, the procedure proceeds to decision box 87.
- next block is set equal to the first block, and upon exiting box 89 the procedure returns to decision box 81.
- Box 91 is entered from decision box 81.
- a check is made to determine if the current block variable is equal to the last block. If so, the procedure terminates. If not, the current frame is assigned to each of the remaining data blocks which follow the current block, up to and including the last block, as previously explained above in the description of FIG. 2.
- the speech-recognition system of the present invention has principally two modes of operation: (1) a training mode in which examples of spoken words are used to train the neural networks, and (2) a recognition mode in which unknown spoken words are identified.
- a training mode in which examples of spoken words are used to train the neural networks
- a recognition mode in which unknown spoken words are identified.
- the user must train neural networks 12, 13, and 14 by speaking into microphone 1 all of the words that the system is to recognize.
- the training may be limited to several users speaking each word once. However, those skilled in the art will realize that the training may require any number of different speakers uttering each word more than once.
- the weights of each neuron circuit must be determined. This can be
- one In implementing a neural network of the present invention, one generally selects the number of neurons or neuron circuits to be equal to or less than the number of training examples presented to the network.
- a training example is defined as one set of given inputs and resulting output (s).
- each word spoken into microphone 1 of FIG. 2 generates at least one training example.
- the training algorithm used for the neural networks is shown in FIG. 6.
- FIG. 6 shows a flow diagram of a method of training a neural network to recognize speech in accordance with a preferred embodiment of the present invention.
- box 93 an example of a known word is spoken into a microphone of the speech-recognition system.
- A/D conversion is performed on the speech signal.
- Cepstral analysis is performed on the digitized signal which is output from the A/D conversion.
- the cepstral analysis produces a sequence of data frames which contain the relevant features of the spoken word.
- Each data frame comprises twelve 32-bit words which represent the results of the cepstral analysis of ,a time slice of the spoken word. In a preferred embodiment, the duration of the time slice is 15 milliseconds.
- bit length of the words in the data frames may be 32 bits, 16 bits, or any other value.
- number of words per data frame and the duration of the time slice may vary, depending on the intended application of the present invention.
- a divide-and-conquer algorithm (the steps of which are shown in FIG. 5) is used to generate a plurality of blocks from the sequence of data frames .
- one of the blocks generated by the divide- and-conquer algorithm is selected.
- the input portion of a training example is set equal to the select block.
- box 101 if the neural network is being trained to recognize the selected block, then the output portion of the block is set to one, otherwise it is set to zero. Upon exiting box 101 the procedure continues with box 103.
- the training example is saved in memory of computer 5 (FIGS. 1 and 2). This allows a plurality of training examples to be generated and stored.
- decision box 105 a check is made to determine if there is another data block, generated from the current sequence of data frames, to be used in training the neural network. If so, the procedure returns to box 99. If not, the procedure proceeds to decision box 107.
- decision box 107 a determination is made to see if there is another spoken word to be used in the training session. If so, the procedure returns to box 93. If not, the procedure continues to box 109.
- box 109 a comparison is made between the number of training examples provided and the number of neurons in the neural network. If the number of neurons is equal to the number of training examples, a matrix-inversion technique may be employed to solve for the value of each weight. If the number of neurons is not equal to the number of training examples, a least-squares estimation technique is employed to solve for the value of each weight. Suitable least-squares estimation techniques include, for example, least-squares, extended least-squares, pseudo-inverse, Kalman filter, maximum-likelihood algorithm, Bayesian estimation, and the like.
- the various embodiments of the speech- recognition system as herein described utilize a divide- and-conquer algorithm to partition speech samples, they are insensitive to differences in speakers and not adversely affected by background noise.
- embodiments of the speech-recognition system as described herein include a neural network which does not require repetitive training and which yields a global minimum to each given set of input vectors; thus, the embodiments of the present invention require substantially less training time and are significantly more accurate than known speech- recognition systems.
- accumulators, and the selector are implemented in hardware or software. Such design choices greatly depend upon the integrated circuit technology, type of implementation (e.g. analog, digital, software, etc.), die sizes, pin-outs, and so on.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
- Character Discrimination (AREA)
Abstract
Description
Claims
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE19581667T DE19581667C2 (en) | 1994-06-06 | 1995-04-25 | Speech recognition system and method for speech recognition |
AU23624/95A AU685626B2 (en) | 1994-06-06 | 1995-04-25 | Speech-recognition system utilizing neural networks and method of using same |
GB9625251A GB2304507B (en) | 1994-06-06 | 1995-04-25 | Speech-recognition system utilizing neural networks and method of using same |
JP8500848A JPH09507921A (en) | 1994-06-06 | 1995-04-25 | Speech recognition system using neural network and method of using the same |
CA002190619A CA2190619A1 (en) | 1994-06-06 | 1995-04-25 | Speech-recognition system utilizing neural networks and method of using same |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US25484494A | 1994-06-06 | 1994-06-06 | |
US08/254,844 | 1994-06-06 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1995034064A1 true WO1995034064A1 (en) | 1995-12-14 |
Family
ID=22965804
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1995/005006 WO1995034064A1 (en) | 1994-06-06 | 1995-04-25 | Speech-recognition system utilizing neural networks and method of using same |
Country Status (8)
Country | Link |
---|---|
US (1) | US5832181A (en) |
JP (1) | JPH09507921A (en) |
CN (1) | CN1150852A (en) |
AU (1) | AU685626B2 (en) |
CA (1) | CA2190619A1 (en) |
DE (1) | DE19581667C2 (en) |
GB (1) | GB2304507B (en) |
WO (1) | WO1995034064A1 (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6192353B1 (en) * | 1998-02-09 | 2001-02-20 | Motorola, Inc. | Multiresolutional classifier with training system and method |
DE19837102A1 (en) * | 1998-08-17 | 2000-02-24 | Philips Corp Intellectual Pty | Method and arrangement for carrying out a database query |
US6320623B1 (en) * | 1998-11-13 | 2001-11-20 | Philips Electronics North America Corporation | Method and device for detecting an event in a program of a video and/ or audio signal and for providing the program to a display upon detection of the event |
JP3908965B2 (en) | 2002-02-28 | 2007-04-25 | 株式会社エヌ・ティ・ティ・ドコモ | Speech recognition apparatus and speech recognition method |
EP1363271A1 (en) | 2002-05-08 | 2003-11-19 | Sap Ag | Method and system for processing and storing of dialogue speech data |
DE10220524B4 (en) | 2002-05-08 | 2006-08-10 | Sap Ag | Method and system for processing voice data and recognizing a language |
EP1361740A1 (en) * | 2002-05-08 | 2003-11-12 | Sap Ag | Method and system for dialogue speech signal processing |
CN104021373B (en) * | 2014-05-27 | 2017-02-15 | 江苏大学 | Semi-supervised speech feature variable factor decomposition method |
US9761221B2 (en) * | 2015-08-20 | 2017-09-12 | Nuance Communications, Inc. | Order statistic techniques for neural networks |
CN109421053A (en) * | 2017-08-24 | 2019-03-05 | 河海大学 | A kind of writing arm-and-hand system based on speech recognition |
EP3732626A4 (en) * | 2017-12-28 | 2021-09-15 | Syntiant | Always-on keyword detector |
US10970441B1 (en) | 2018-02-26 | 2021-04-06 | Washington University | System and method using neural networks for analog-to-information processors |
US10991370B2 (en) | 2019-04-16 | 2021-04-27 | International Business Machines Corporation | Speech to text conversion engine for non-standard speech |
US20210303662A1 (en) * | 2020-03-31 | 2021-09-30 | Irdeto B.V. | Systems, methods, and storage media for creating secured transformed code from input code using a neural network to obscure a transformation function |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4624008A (en) * | 1983-03-09 | 1986-11-18 | International Telephone And Telegraph Corporation | Apparatus for automatic speech recognition |
US5259030A (en) * | 1991-07-17 | 1993-11-02 | Harris Corporation | Antijam improvement method and apparatus |
US5404422A (en) * | 1989-12-28 | 1995-04-04 | Sharp Kabushiki Kaisha | Speech recognition system with neural network |
US5408588A (en) * | 1991-06-06 | 1995-04-18 | Ulug; Mehmet E. | Artificial neural network method and architecture |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5047952A (en) * | 1988-10-14 | 1991-09-10 | The Board Of Trustee Of The Leland Stanford Junior University | Communication system for deaf, deaf-blind, or non-vocal individuals using instrumented glove |
DE3916478A1 (en) * | 1989-05-20 | 1990-11-22 | Standard Elektrik Lorenz Ag | NEURONAL NETWORK ARCHITECTURE |
JP3323894B2 (en) * | 1991-06-27 | 2002-09-09 | 株式会社日立製作所 | Neural network learning method and apparatus |
FR2689292A1 (en) * | 1992-03-27 | 1993-10-01 | Lorraine Laminage | Voice recognition method using neuronal network - involves recognising pronounce words by comparison with words in reference vocabulary using sub-vocabulary for acoustic word reference |
EP0574951B1 (en) * | 1992-06-18 | 2000-04-05 | Seiko Epson Corporation | Speech recognition system |
-
1995
- 1995-04-25 AU AU23624/95A patent/AU685626B2/en not_active Ceased
- 1995-04-25 JP JP8500848A patent/JPH09507921A/en active Pending
- 1995-04-25 CA CA002190619A patent/CA2190619A1/en not_active Abandoned
- 1995-04-25 CN CN95193473A patent/CN1150852A/en active Pending
- 1995-04-25 GB GB9625251A patent/GB2304507B/en not_active Expired - Fee Related
- 1995-04-25 WO PCT/US1995/005006 patent/WO1995034064A1/en active Application Filing
- 1995-04-25 DE DE19581667T patent/DE19581667C2/en not_active Expired - Fee Related
-
1996
- 1996-06-17 US US08/664,893 patent/US5832181A/en not_active Expired - Lifetime
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4624008A (en) * | 1983-03-09 | 1986-11-18 | International Telephone And Telegraph Corporation | Apparatus for automatic speech recognition |
US5404422A (en) * | 1989-12-28 | 1995-04-04 | Sharp Kabushiki Kaisha | Speech recognition system with neural network |
US5408588A (en) * | 1991-06-06 | 1995-04-18 | Ulug; Mehmet E. | Artificial neural network method and architecture |
US5259030A (en) * | 1991-07-17 | 1993-11-02 | Harris Corporation | Antijam improvement method and apparatus |
Non-Patent Citations (2)
Title |
---|
ADDISON-WESLEY, (R. SEDGEWICK), NEW YORK, (1988), "Algorithms", page 53. * |
LIBRARY OF CONGRESS CATALOGUE CARD NUMBER 68-26850, "Information Theory and Reliable Communication", (GALLAGER) WILEY, NEW YORK (1968), pages 286-290. * |
Also Published As
Publication number | Publication date |
---|---|
GB2304507B (en) | 1999-03-10 |
CN1150852A (en) | 1997-05-28 |
AU2362495A (en) | 1996-01-04 |
AU685626B2 (en) | 1998-01-22 |
CA2190619A1 (en) | 1995-12-14 |
US5832181A (en) | 1998-11-03 |
GB2304507A (en) | 1997-03-19 |
DE19581667T1 (en) | 1997-05-07 |
JPH09507921A (en) | 1997-08-12 |
GB9625251D0 (en) | 1997-01-22 |
DE19581667C2 (en) | 1999-03-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5509103A (en) | Method of training neural networks used for speech recognition | |
US5621848A (en) | Method of partitioning a sequence of data frames | |
US5638486A (en) | Method and system for continuous speech recognition using voting techniques | |
US5596679A (en) | Method and system for identifying spoken sounds in continuous speech by comparing classifier outputs | |
US5594834A (en) | Method and system for recognizing a boundary between sounds in continuous speech | |
AU684214B2 (en) | System for recognizing spoken sounds from continuous speech and method of using same | |
US6021387A (en) | Speech recognition apparatus for consumer electronic applications | |
US6219642B1 (en) | Quantization using frequency and mean compensated frequency input data for robust speech recognition | |
US6347297B1 (en) | Matrix quantization with vector quantization error compensation and neural network postprocessing for robust speech recognition | |
EP0617827B1 (en) | Composite expert | |
US5832181A (en) | Speech-recognition system utilizing neural networks and method of using same | |
CN109192200A (en) | A kind of audio recognition method | |
Devi et al. | A novel approach for speech feature extraction by cubic-log compression in MFCC | |
Chauhan et al. | Speech recognition and separation system using deep learning | |
Jagadeeshwar et al. | ASERNet: Automatic speech emotion recognition system using MFCC-based LPC approach with deep learning CNN | |
CN101246686A (en) | Method and device for identifying analog national language single tone by continuous quadratic Bayes classification method | |
US6275799B1 (en) | Reference pattern learning system | |
Nijhawan et al. | Real time speaker recognition system for hindi words | |
Mirhassani et al. | Fuzzy decision fusion of complementary experts based on evolutionary cepstral coefficients for phoneme recognition | |
Kulkarni et al. | Comparison between SVM and other classifiers for SER | |
Nijhawan et al. | A comparative study of two different neural models for speaker recognition systems | |
Mahkonen et al. | Cascade processing for speeding up sliding window sparse classification | |
Sunny et al. | A comparative study of parametric coding and wavelet coding based feature extraction techniques in recognizing spoken words | |
Salimovna et al. | A Study on the Methods and Algorithms Used for the Separation of Speech Signals | |
Nssr et al. | INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY TAJWEED UTOMATION SYSTEM USING HIDDEN MARKOUV MODEL AND NURAL NETWORK |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 95193473.2 Country of ref document: CN |
|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AM AT AU BB BG BR BY CA CH CN CZ DE DK EE ES FI GB GE HU IS JP KE KG KP KR KZ LK LR LT LU LV MD MG MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TT UA UG UZ VN |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): KE MW SD SZ UG AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2190619 Country of ref document: CA |
|
RET | De translation (de og part 6b) |
Ref document number: 19581667 Country of ref document: DE Date of ref document: 19970507 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 19581667 Country of ref document: DE |
|
122 | Ep: pct application non-entry in european phase |