WO1994002936A1 - Voice recognition apparatus and method - Google Patents
Voice recognition apparatus and method Download PDFInfo
- Publication number
- WO1994002936A1 WO1994002936A1 PCT/US1993/006647 US9306647W WO9402936A1 WO 1994002936 A1 WO1994002936 A1 WO 1994002936A1 US 9306647 W US9306647 W US 9306647W WO 9402936 A1 WO9402936 A1 WO 9402936A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- voice
- templates
- template
- spoken
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 44
- 238000012545 processing Methods 0.000 claims abstract description 6
- 230000009471 action Effects 0.000 claims description 16
- 238000012549 training Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 description 15
- 238000013459 approach Methods 0.000 description 11
- 238000001914 filtration Methods 0.000 description 11
- 101100328887 Caenorhabditis elegans col-34 gene Proteins 0.000 description 8
- 230000004044 response Effects 0.000 description 7
- 101710096660 Probable acetoacetate decarboxylase 2 Proteins 0.000 description 6
- 230000001419 dependent effect Effects 0.000 description 5
- 239000012536 storage buffer Substances 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000012937 correction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000003750 conditioning effect Effects 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- -1 Microprocessor 3 Proteins 0.000 description 1
- 206010000210 abortion Diseases 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 239000000872 buffer Substances 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
Definitions
- the present invention is directed to a voice recognition system, including various voice training, noise compensation and error-correction methods used therein.
- voice recognition is not a new science but at the same time has not been widely used by the public. Because people are not used to voice recognition devices, when first introduced to the technology they tend to make certain predictable mistakes which can have significant impact on the recognition accuracy of such devices.
- a "speaker dependent" voice recognition system is one where the user must first train the voice recognition system with a vocabulary of particular words or phrases which are then converted into voice templates. These voice templates are stored and compared later against new voice commands uttered by the user to determine whether there is a match.
- Patent No. 4,989,248 attempt to solve this "first pass" problem by requiring the user to make multiple passes through the system's vocabulary. Such systems create a single averaged template based on multiple utterances by the user of the same word. Unlike such prior art system, the present invention only requires two (2) passes, and saves both voice templates from both passes. This method significantly reduces voice training time and results in higher recognition accuracy.
- Noise rejection is typically handled in two ways by voice recognition devices — hardware and software.
- the hardware approach to noise rejection is to provide a bandpass filter on the audio section analog front end.
- the analog band pass filter passes all information content in the area of interest of the voice spectrum and filters out the spectral content below the lower and above the upper cutoff points of the filter, reducing the amount of ambient noise passed on to the voice recognizer.
- the voice recognition device will typically also perform some sort of digital filtering to reduce as much of the noise as possible before performing pattern matching on the information.
- Digital filtering is a relatively effective way of reducing noise depending on the filter applied but it is also typically a very processor intensive operation, thus limiting the speed of the device to perform voice recognition.
- To perform digital noise filtering in real time as well as perform voice recognition typically requires either a digital signal processor DSP or a relatively high speed 16 bit microprocessor. Because of the high cost and large size associated with implementing a design with one of these processors, if not with both, the design would be typically limited to larger computer type user applications and would not be suitable for lightweight, portable applications, for example.
- the present invention provides an alternate scheme for performing noise rejection, requiring significantly less processing power than such prior art digital filtering techniques. Instead of performing conventional digital noise filtering, the present invention uses noise templates. Noise templates are similar to voice templates except the information contained within the noise template represents the sound of the ambient noise. When the voice recognizer performs voice recognition, the pattern matching algorithms include these noise templates in with the set of active voice templates.
- voice recognition is not an absolute science
- voice recognition system will on occasion misrecognize a spoken word even in the absence of noise. This is typical of voice recognition system designed for the high end of the market as well as the low end.
- the level of user frustration caused by the misrecognized word is somewhat dependent on the nature of how that word was to be used by the device and also the means the device provides for correcting the recognition error.
- the user is given a second choice, and, if continuous misrecognition occurs, he or she can repeat the word and/or replace the original voice template.
- a first object of the present invention therefore is to provide a training method for a voice recognition system.
- the present invention uses a two pass method to teach a system a set of words from the user's voice.
- reference voice templates are captured one for each word in the vocabulary.
- a second pass a second set of reference voice template related to the same vocabulary words are captured.
- each template is captured during the second pass, it is compared with the reference voice templates from the first pass to determine if they match within certain predetermined limits.
- the system stores both templates in a memory.
- the system performs an additional routine to generate an acceptable set of voice templates for the spoken word.
- the system prompts the user to capture yet a third reference voice template related to the same word. It then compares the third voice template separately with the first and second voice templates to determine if it matches with either of these templates within certain predetermined limits. When there is a match at this stage, the system stores in a memory the third voice template with whichever of the first or second voice templates is a better match. But assuming there is no adequate match, the system prompts the user to capture yet a fourth separate voice template corresponding to the spoken word. At this point, the system then stores in memory whichever two of the first, second, third and fourth voice templates are the closest match to each other.
- a second object of the present invention is to provide a system and method of performing voice recognition that includes compensation for environmental noise.
- reference voice templates are first captured as described above.
- Noise reference templates consisting of preselected multiple noises are hardcoded in ROM.
- an analog input circuit conditions the analog signal representing the spoken voice command.
- An analog to digital converter circuit converts the analog voice information to digital information to be processed by a microprocessor to generate a digital voice command template based on the voice command signal.
- a template memory contains the reference voice templates and the noise templates.
- a program memory contains a control program which is executed by a processor to determine whether the voice command template matches one of the reference voice templates or the noise templates. When a match is made with a reference voice template, the appropriate action is taken by the system. When a match is made with a noise template, no action is taken.
- the third object of the invention is to provide a dynamic, real time means by way of voice commands of correcting recognition errors made by the voice recognition system. This is accomplished as follows: during voice recognition of a user's voice commands, an input voice template of a word spoken by the user is compared with the previously described reference voice templates which correspond to a set of reference words. The system selects a first reference word from the first reference voice templates that best matches the input voice template. It then presents that first choice to the user as the spoken word. If the user accepts the first choice, the system performs whatever action is necessary in response to that voice command.
- the system present a second reference word from the first set of reference templates that is the next best match to the input voice template. Should the user now accept the second choice, the system again performs whatever action is necessary in response to the input voice template. Should the user reject the second choice by voice as well, the user is prompted to repeat the spoken word in order to generate a new input voice template. The new input voice template is compared for a match against the reference voice templates, and the system again presents the next best choice for the spoken word to the user. If the user accepts this new choice, the system again performs whatever action is necessary in response to the input voice template.
- the system then prompts the user to retrain the reference template that corresponds to the spoken word.
- the spoken word is selected from an active word group, which is a limited subset of the system's vocabulary.
- the new voice reference template corresponding to the spoken word is created, and it takes the place of the prior problem reference voice template.
- FIG. 1 is a block diagram of a voice recognition system in accordance with the present invention
- FIG. 2 is a flow chart detailing the two pass voice training method implemented in the present invention
- FIG. 3 is a flow chart detailing the noise template recognition method implemented in the present invention.
- FIG. 4 is a flow chart detailing the "NO" error correction method implemented in the present invention. DETAILED DESCRIPTION OF THE INVENTION
- FIG. 1 A block diagram of the hardware of the present invention is shown in Figure 1.
- the invention consists generally of an Analog Audio Input 1, an analog to digital converter (ADC) 2, a Microprocessor 3 (including various output control lines), a RAM 4, a ROM 5, and an LCD 6.
- ADC 2 Analog Audio Input 2
- Microprocessor 3 including various output control lines
- RAM 4 RAM 4
- ROM 5 RAM 4
- LCD 6 LCD 6.
- Analog Audio Input 1 converts audio information originating from the user's voice and converted by a microphone into an electrical signal and then conditions this electrical signal for digital conversion by ADC 2.
- Analog Audio Input 1 includes a microphone on the front end that is mounted against the front housing of the present invention.
- the microphone is preferably physically mounted to a printed circuit board on which the present invention is located by means of a rubber grommet (not shown) .
- This grommet not only provides a means to physically mount the microphone to the printed circuit board but also provides mechanical isolation required between the hardware and the microphone. This mechanical isolation isolates the microphone from any mechanical noise induced within the unit such as when the user depresses a voice activation switch as well as mechanical noise when holding the plastic.
- Analog Audio Input 1 The output of the microphone is fed into an analog input section of Analog Audio Input 1.
- the signal is then conditioned by well-known electronic circuits that amplify and filter the voice input signal from the microphone prior to going to ADC 2.
- An example of a typical prior art speech recognition amplifying and conditioning circuit can be seen in U.S. Patent No. 4,054,749 (Suzuki et al.) which is incorporated by reference herein.
- Analog Audio Input 1 consists of three stages of gain and filtering. The first stage of the audio section provides a signal gain of 40 with frequency emphasis characteristics of 6 db per octave at the upper end of the band pass.
- the frequency emphasis is used to amplify the voice information at the upper end of the frequency spectrum which increases voice recognition accuracy.
- the second stage of Analog Audio Input 1 consists of an amplifier circuit that provides for analog band pass filtering. This filtering band passes maximum useful voice information while filtering out unwanted noise outside the band pass.
- the band pass section of this analog circuit has minimal gain with a frequency response roll off characteristic of 18 db per octave.
- the overall frequency response of the analog section is 300 to 4800 Hz.
- the third and final stage of Analog Audio Input 1 provides for analog gain control (AGC) of the voice input signal.
- AGC analog gain control
- Microprocessor 3 adjusts the level of the ADC input signal for maximum signal to noise ratio, thus enhancing recognition performance.
- the AGC compensates for variations in audio levels as the user speaks into the present invention, and also compensates volume variations which can result from the user speaking from various distances into the microphone. While one embodiment of Analog Audio Input 1 is shown herein, it would be apparent to one skilled in the art that any equivalent circuit for conditioning audio voice information could be used in lieu thereof.
- Analog Audio Input 1 feeds into an 8-bit analog-to-digital converter (ADC) 2 which samples the data at 9.6 KHZ.
- ADC 2 then outputs digital information related to the input analog voice signal to Microprocessor 3.
- Voice recognition is well-known in the art, and skilled artisans will appreciate that any acceptable prior art method can be used in the present invention.
- the voice recognition software employed in the present invention is part of a control program described in detail in U.S. application serial no. 07/915,112 which has been filed on behalf of the present assignee concurrently herewith and which is hereby incorporated by reference as if fully set forth herein.
- This software is also embodied in a computer program registered in its entirety with the U.S. Copyright Office as registration no. TXu488458 on September 13, 1991 and which is also hereby incorporated by reference as if fully set forth herein.
- ROM 5 contains a 24 kbyte control program consisting of microcode instructions executed by Microprocessor 3 to effectuate voice recognition and other functions described below.
- An additional 1.5 kbytes of RAM 4 for temporary storage is used by Microprocessor 3 for computing and for storage of information needed frequently.
- Microprocessor 3 processes 5 the control program within
- ROM 5 performs a comparison between newly created voice templates and previously stored voice templates to see if there is a match. If a substantially equivalent voice template (the tolerance limit is a design choice) is found within the stored set, then
- Microprocessor 3 generates a control output to some other device to perform some additional task in response to the voice command.
- This control output could be, for example, to turn a light on after hearing the words "LIGHT” and "ON.”
- control of the other devices can be accomplished entirely by a spoken voice command.
- Liquid crystal display (LCD) 6 can be any form of conventional display that can visually display information to the user. Alternatively an audio speaker could be used in lieu of LCD 6 to convey information to the user.
- speaker dependent recognition devices require the user to provide voice patterns or templates for each word that is to be recognized.
- the two pass approach to voice training employed here means the user goes through the entire set of vocabulary words two separate times.
- the flow diagram for the two pass method used in the software is shown in Figure 2.
- a first step 200 voice training for the two pass approach is initiated by capturing words on the first pass to be used as a first word template.
- the user's voice information is captured and converted into a voice signal by Analog Audio Input 1 as explained above.
- This voice signal is converted into a digital voice template by ADC 2 and Microprocessor 3 for each word articulated by the user.
- This step is repeated, and the user is prompted (such as on LCD 6, for example) to continue going through the entire set of words in the vocabulary until a first pass 201 is complete.
- a complete set of digital voice templates exists, one for each word in the vocabulary.
- each of the templates generated from the first pass are stored in a temporary storage buffer within Microprocessor 3 to be later used for comparison.
- the purpose of the first pass is two fold. The user during the first pass becomes familiar with the process of training the device and also becomes familiar with the prompted words and the word templates generated from the first pass are used for a comparison against the word templates from the second pass to test the relative quality of the spoken word of each individual.
- the user on the second pass will generally pronounce words in a manner more consistent with their normal speech.
- the user sequences through the identical set of words in the same order as the first pass.
- the software first checks to see if the word was the first or second pass 201, and if from the second pass, performs a pattern matching test 203 to the word from the second pass to determine how close of a match exists between each word template in the two separate passes.
- the voice recognition software scores the results from a pattern matching algorithm.
- step 204 if the score for the two word templates to be compared meets certain defined criteria for a good word score then the control program at 208 stores the second word template into RAM 4 along with the first word template. Both are later used for voice recognition since the use of two templates, rather than a single averaged template, results in better recognition.
- a mismatch routine is then executed beginning at 205 to find two acceptable voice templates for the spoken word by further prompting of the user.
- the first and second pass word template are placed into a separate temporary storage buffer in Microprocessor 3 for later use. Either of the two word templates may be of poor quality at this time although the chances are that the first word template is the more likely to be the one of concern based on the earlier discussions.
- the next step of the mismatch routine at 206 after failing to meet the matching criteria for the second pass, the user is taken back to step 200 and prompted to repeat the same word a third time.
- the word template from the third pass is compared against the word templates generated from both the first pass and the second pass to see if it meets the set scoring criteria with either of the two prior templates. Then, in step 204 and 208, if the word template from the third pass compares acceptably with the word template from either the first or second pass, the third pass word template and the other matching word template are stored in RAM 4 to be later used for voice recognition. The word template that did not score favorably is deleted from the temporary storage buffer.
- the three templates are stored in a temporary buffer at 205, and the user is re-prompted again as seen in steps 206 and 200.
- the control program goes through the pattern matching process 203 comparing the fourth word template against the word templates generated from the first, second and third pass.
- the control program stores the two matching word templates in RAM 4 at step 208. The two word templates that did not meet the test criteria are deleted from the temporary storage buffer.
- the control program picks the two word templates that are the closest match and stores these two word templates in RAM 4.
- the two remaining word templates with the. poorest match are deleted from the temporary storage buffer. This process continues until all of the words in the vocabulary that are to be used for voice recognition are trained by the user, and there are now two separate voice templates for each word in the vocabulary. This process is repeated for each user as they go through word training. This process can also be used when retraining problem words to again assure high quality word templates.
- control program illustrated in the flow diagram goes through the process of re-prompting the word and performing a comparison on the word templates a total of 4 times, including 3 times on the second pass. It would be apparent to a skilled artisan, however, that the algorithm is not limited to the number of times the comparison is performed and any number of passes could be easily implemented. The major limitation on the number of passes performed is limited by the user's level of patience in going through an excessive number of passes.
- voice recognition may then be performed.
- voice recognition may then be performed.
- a new voice template is generated, and this new template is compared against both the first and second voice template sets to determine a match.
- the use of two voice template sets improves the quality of voice recognition.
- FIG. 3 An approach employed in the present invention for digital noise filtering is illustrated in Figure 3.
- the approach described here uses noise templates in conjunction with word templates as part of an active word group.
- the noise templates are modeled and used in an identical manner as the word templates by the pattern matching routine in the control program.
- the noise templates instead of containing word template information from the user's speech, contain information representing the ambient noise found in the user's environment. Either of two approaches may be used to create the reference noise templates.
- the noise templates can be predetermined and hardcoded in memory if the voice recognition is typically going to take place in similar environments. Alternatively, if the ambient noise in the environment that the voice recognition is to be performed is not predictable or uniform, the user can dynamically record the noise templates and store them into RAM 4.
- the user as he or she is generating word templates, can be prompted on LCD 6 to record any background noise. This could be accomplished by simply activating the system, while at the same time not speaking when the noise is present.
- the recorded noise can consist of a single element background noise or can consist of a complex set of background noises combined into a single noise template.
- the present invention as shown in Figure 3 improves the performance of a voice recognition system.
- the present invention captures and processes audio input data to generate voice templates in the manner described above.
- the control program within ROM 5 includes a pattern recognition routine which is run at 301 by Microprocessor 3 and the best choice is determined from the two separate sets of voice templates learned as described above. It should be emphasized that the captured voice data is only compared with voice templates from the appropriate "active word group" which is described further below. In general, however, the active word group consists of all of the stored noise templates as well as the active vocabulary of words and is described further below.
- the software branches back to capturing the next word with no action taken by the Microprocessor 3 since no command word was received. If instead the best choice is a valid word, the Microprocessor 3 at 303 takes action on the valid word and then continues to its next operation. In this manner, the present invention reduces the number of misrecognized words by recognizing input noise as noise only, and not as a command word that needs to be acted upon. This could arise, for example, where the user pauses between words and the system then captures a voice-absent but noise-filled template.
- the process used to detect noise is also the same process as used for voice recognition and is not a secondary operation, this noise detection approach requires substantially less processing power then prior art digital filter approaches.
- the number of noise templates to be used is not fixed and can be determined by several factors. The first item to consider when determining the number of noise templates is the surrounding ambient noise present where the present invention is to be used. The noise may vary from a background low frequency noise to high frequency noise to people talking in the background. The maximum number of templates required is determined by how many noise templates can be used to model the background noise. The second item to be considered when choosing the number of noise templates to be used is the processing time. Although the present invention's approach requires substantially less processing power then the typical digital noise filter, the number of noise templates will affect the recognition time in the same manner as adding words to the active word group.
- the noise template method shown herein may be used in conjunction with digital filtering. Depending on the filter used and the predictability of the background noise, the noise templates may further improve the noise rejection capability of the voice recognition process.
- Voice recognition inherently has recognition errors regardless of the sophistication or cost of the platform used. Depending on the type of operation being performed when a recognition error occurs, the user may be interrupted from the voice recognition process and be required to perform some additional corrective procedure that disrupts the operation being performed.
- the process employed by the present invention shown in Figure 4 does not require the user to stop the voice operation flow to correct the misrecognized word.
- an audibilized "NO" command by the user permits simple and fast corrective action.
- the present invention recognizes this command as a corrective command at 402, and checks the number of "NO”s articulated by the user at that time. After recognizing the first "NO" corrective command, the control program branches to 403 and selects the second best choice from the word templates from the active word group.
- active word group what is meant is that the software takes into account the context of the user's voice command. For example, assume that the voice recognition method described here is used in a television remote control device, and the user first spoke the word "channel,” which is correctly recognized. The control program would presume, based on the context of the command, that the next voice command should be a channel number, ranging from one to nine. Thus, the numbers one, two, three, etc.
- the "NO" command aborts the action that was to have been executed by the system in response to the misrecognized word and instead replaces the misrecognized word with the next best choice. This choice is then shown to the user and will typically be the correct choice. If the correct choice is made by the recognizer after the first "NO" command, the microprocessor acts on that command and the user then proceeds with the voice operation. It is apparent that the advantage of using the voice command "NO" for correcting the recognition error is significant.
- the second best choice is typically the correct choice and the procedure to correct the misrecognized word does not require the user to change the mode of operation they were using. That is, the user does not have to go outside the normal voice recognition routine, or go to some means other then their voice to take corrective action. Both of these actions can break the user's sequence and train of thought.
- the user now selects the problem word from the abbreviated set of words in the active word group and retrains only the problem word.
- the problem word After the problem word has been retrained at 408, it replaces the misrecognized word with the word that was selected by the user as the problem word to be retrained at 409.
- the algorithm takes the user through several steps, the user typically gets the word correct on the first "NO" command and never has to interrupt his or her sequence of voice commands. Even if they went through the entire sequence, the approach described still provides a simpler way of correcting problem words and still allows the user to continue their operation without having to abort.
- the present invention improves the quality of voice recognition by providing the use of two word templates for each word in a voice recognition system.
- the two word templates are created independently and compared against each other to ensure they are a relatively good match.
- the present invention also improves the quality of voice recognition by providing the use of noise templates. By recognizing background noise, the present invention reduces the misrecognition rate of prior art systems.
- the noise template approach described herein is significantly simpler and cost-effective, and can also be used in a device presently employing hardware or software noise filters to further enhance the noise rejection properties of such devices.
- the present invention provides a simple and efficient means of correcting voice recognition errors by the use of a voice command.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- Analogue/Digital Conversion (AREA)
Abstract
Analog voice is accepted as input (1), digitized (2), and further processing is done with a computer (3, 4, 5 and 6). Templates stored in memory (4, 5) are compared with the digitized input in order to allow a user to retrain templates which do not yield a good match.
Description
VOICE RECOGNITION APPARATUS AND METHOD
FIELD OF THE INVENTION
The present invention is directed to a voice recognition system, including various voice training, noise compensation and error-correction methods used therein.
BACKGROUND OF THE INVENTION The nature of voice recognition systems is such that they have unique problems that can become frustrating to the user if not handled in the proper manner. Voice recognition is not a new science but at the same time has not been widely used by the public. Because people are not used to voice recognition devices, when first introduced to the technology they tend to make certain predictable mistakes which can have significant impact on the recognition accuracy of such devices. A "speaker dependent" voice recognition system is one where the user must first train the voice recognition system with a vocabulary of particular words or phrases which are then converted into voice templates. These voice templates are stored and compared later against new voice commands uttered by the user to determine whether there is a match. With speaker dependent voice recognition system, one of the most common difficulties the first time user has is the proper training of the word templates for each of the words in the vocabulary. The first time user when voice training a system for the generation of voice templates typically is somewhat tentative and/or apprehensive and consequently pronounces the word in a manner different then how they would normally pronounce the word during typical use. Also, the first time user, when voice training the system, typically puts more emphasis on certain parts of the word being
trained then they normally would during normal speech, or they increase or decrease the pause between syllables, all of which will have a significant negative effect on the overall quality of the voice recognition. Typical prior art systems, such as U.S.
Patent No. 4,989,248 attempt to solve this "first pass" problem by requiring the user to make multiple passes through the system's vocabulary. Such systems create a single averaged template based on multiple utterances by the user of the same word. Unlike such prior art system, the present invention only requires two (2) passes, and saves both voice templates from both passes. This method significantly reduces voice training time and results in higher recognition accuracy.
Another inherent problem of voice recognition, based on the fact that it operates on the principle of having to select from the best choice of what it hears in the specified audio spectrum, is the occasional decoding of ambient noise into an unwanted recognized word. The extent of this problem is dependent on the amount and type of ambient noise present in the environment. Noise rejection is typically handled in two ways by voice recognition devices — hardware and software. The hardware approach to noise rejection is to provide a bandpass filter on the audio section analog front end. The analog band pass filter passes all information content in the area of interest of the voice spectrum and filters out the spectral content below the lower and above the upper cutoff points of the filter, reducing the amount of ambient noise passed on to the voice recognizer. Because the ambient noise with the spectral content within the frequency range of interest is passed through to the recognizer, the voice recognition device will typically also perform some sort of digital filtering to reduce as much of the noise as possible before performing pattern matching on
the information. Digital filtering is a relatively effective way of reducing noise depending on the filter applied but it is also typically a very processor intensive operation, thus limiting the speed of the device to perform voice recognition. To perform digital noise filtering in real time as well as perform voice recognition typically requires either a digital signal processor DSP or a relatively high speed 16 bit microprocessor. Because of the high cost and large size associated with implementing a design with one of these processors, if not with both, the design would be typically limited to larger computer type user applications and would not be suitable for lightweight, portable applications, for example. The present invention provides an alternate scheme for performing noise rejection, requiring significantly less processing power than such prior art digital filtering techniques. Instead of performing conventional digital noise filtering, the present invention uses noise templates. Noise templates are similar to voice templates except the information contained within the noise template represents the sound of the ambient noise. When the voice recognizer performs voice recognition, the pattern matching algorithms include these noise templates in with the set of active voice templates.
Finally, because voice recognition is not an absolute science, voice recognition system will on occasion misrecognize a spoken word even in the absence of noise. This is typical of voice recognition system designed for the high end of the market as well as the low end. The level of user frustration caused by the misrecognized word is somewhat dependent on the nature of how that word was to be used by the device and also the means the device provides for correcting the recognition error. In the present invention, if a word is misrecognized, the user is given a second choice,
and, if continuous misrecognition occurs, he or she can repeat the word and/or replace the original voice template.
SUMMARY OF THE INVENTION
A first object of the present invention therefore is to provide a training method for a voice recognition system. The present invention uses a two pass method to teach a system a set of words from the user's voice. On a first pass, reference voice templates are captured one for each word in the vocabulary. In a second pass, a second set of reference voice template related to the same vocabulary words are captured. As each template is captured during the second pass, it is compared with the reference voice templates from the first pass to determine if they match within certain predetermined limits. When there is a match between the two templates, the system stores both templates in a memory. When there is no match (a mismatch) the system performs an additional routine to generate an acceptable set of voice templates for the spoken word.
In the mismatch routine, the system prompts the user to capture yet a third reference voice template related to the same word. It then compares the third voice template separately with the first and second voice templates to determine if it matches with either of these templates within certain predetermined limits. When there is a match at this stage, the system stores in a memory the third voice template with whichever of the first or second voice templates is a better match. But assuming there is no adequate match, the system prompts the user to capture yet a fourth separate voice template corresponding to the spoken word. At this point, the system then stores in memory whichever two of the first, second, third and fourth voice templates are the closest match to each other.
A second object of the present invention is to provide a system and method of performing voice recognition that includes compensation for environmental noise. This is accomplished in the following manner: reference voice templates are first captured as described above. Noise reference templates consisting of preselected multiple noises are hardcoded in ROM. During voice recognition, an analog input circuit conditions the analog signal representing the spoken voice command. An analog to digital converter circuit converts the analog voice information to digital information to be processed by a microprocessor to generate a digital voice command template based on the voice command signal. A template memory contains the reference voice templates and the noise templates. A program memory contains a control program which is executed by a processor to determine whether the voice command template matches one of the reference voice templates or the noise templates. When a match is made with a reference voice template, the appropriate action is taken by the system. When a match is made with a noise template, no action is taken. In this manner, the present system also recognizes background noise as an acceptable input template. The third object of the invention is to provide a dynamic, real time means by way of voice commands of correcting recognition errors made by the voice recognition system. This is accomplished as follows: during voice recognition of a user's voice commands, an input voice template of a word spoken by the user is compared with the previously described reference voice templates which correspond to a set of reference words. The system selects a first reference word from the first reference voice templates that best matches the input voice template. It then presents that first choice to the user as the spoken word. If the user accepts the first choice, the system performs whatever
action is necessary in response to that voice command. However, should the user reject the first choice by means of a voice correction command the system present a second reference word from the first set of reference templates that is the next best match to the input voice template. Should the user now accept the second choice, the system again performs whatever action is necessary in response to the input voice template. Should the user reject the second choice by voice as well, the user is prompted to repeat the spoken word in order to generate a new input voice template. The new input voice template is compared for a match against the reference voice templates, and the system again presents the next best choice for the spoken word to the user. If the user accepts this new choice, the system again performs whatever action is necessary in response to the input voice template. However, should the user reject the new choice also (this being the third rejection) the system then prompts the user to retrain the reference template that corresponds to the spoken word. The spoken word is selected from an active word group, which is a limited subset of the system's vocabulary. The new voice reference template corresponding to the spoken word is created, and it takes the place of the prior problem reference voice template.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a voice recognition system in accordance with the present invention;
FIG. 2 is a flow chart detailing the two pass voice training method implemented in the present invention;
FIG. 3 is a flow chart detailing the noise template recognition method implemented in the present invention; and
FIG. 4 is a flow chart detailing the "NO" error correction method implemented in the present invention.
DETAILED DESCRIPTION OF THE INVENTION
A. Configuration Of The Present System
A block diagram of the hardware of the present invention is shown in Figure 1. As can be seen there, the invention consists generally of an Analog Audio Input 1, an analog to digital converter (ADC) 2, a Microprocessor 3 (including various output control lines), a RAM 4, a ROM 5, and an LCD 6. In a preferred embodiment, ADC 2, Microprocessor 3, RAM 4 and ROM 5 are all included on a single microcontroller chip, model no. NN1872410 made by Panasonic. These circuits are described in detail below.
Analog Audio Input 1 converts audio information originating from the user's voice and converted by a microphone into an electrical signal and then conditions this electrical signal for digital conversion by ADC 2. In a preferred embodiment, Analog Audio Input 1 includes a microphone on the front end that is mounted against the front housing of the present invention. The microphone is preferably physically mounted to a printed circuit board on which the present invention is located by means of a rubber grommet (not shown) . This grommet not only provides a means to physically mount the microphone to the printed circuit board but also provides mechanical isolation required between the hardware and the microphone. This mechanical isolation isolates the microphone from any mechanical noise induced within the unit such as when the user depresses a voice activation switch as well as mechanical noise when holding the plastic.
The output of the microphone is fed into an analog input section of Analog Audio Input 1. The signal is then conditioned by well-known electronic circuits that amplify and filter the voice input signal from the microphone prior to going to ADC 2. An example of a typical prior art speech recognition amplifying and conditioning circuit can be seen in U.S. Patent No.
4,054,749 (Suzuki et al.) which is incorporated by reference herein. In the present preferred embodiment, Analog Audio Input 1 consists of three stages of gain and filtering. The first stage of the audio section provides a signal gain of 40 with frequency emphasis characteristics of 6 db per octave at the upper end of the band pass. The frequency emphasis is used to amplify the voice information at the upper end of the frequency spectrum which increases voice recognition accuracy. The second stage of Analog Audio Input 1 consists of an amplifier circuit that provides for analog band pass filtering. This filtering band passes maximum useful voice information while filtering out unwanted noise outside the band pass. The band pass section of this analog circuit has minimal gain with a frequency response roll off characteristic of 18 db per octave. The overall frequency response of the analog section is 300 to 4800 Hz. The third and final stage of Analog Audio Input 1 provides for analog gain control (AGC) of the voice input signal.
Microprocessor 3 adjusts the level of the ADC input signal for maximum signal to noise ratio, thus enhancing recognition performance. The AGC compensates for variations in audio levels as the user speaks into the present invention, and also compensates volume variations which can result from the user speaking from various distances into the microphone. While one embodiment of Analog Audio Input 1 is shown herein, it would be apparent to one skilled in the art that any equivalent circuit for conditioning audio voice information could be used in lieu thereof.
The output of Analog Audio Input 1 feeds into an 8-bit analog-to-digital converter (ADC) 2 which samples the data at 9.6 KHZ. ADC 2 then outputs digital information related to the input analog voice signal to Microprocessor 3.
Voice recognition is well-known in the art, and skilled artisans will appreciate that any acceptable prior art method can be used in the present invention. The voice recognition software employed in the present invention is part of a control program described in detail in U.S. application serial no. 07/915,112 which has been filed on behalf of the present assignee concurrently herewith and which is hereby incorporated by reference as if fully set forth herein. This software is also embodied in a computer program registered in its entirety with the U.S. Copyright Office as registration no. TXu488458 on September 13, 1991 and which is also hereby incorporated by reference as if fully set forth herein. Microprocessor 3, executing a control program
(which includes a voice recognition routine) within ROM 5, generates voice templates from the digital signals from ADC 2. For this purpose, ROM 5 contains a 24 kbyte control program consisting of microcode instructions executed by Microprocessor 3 to effectuate voice recognition and other functions described below. An additional 1.5 kbytes of RAM 4 for temporary storage is used by Microprocessor 3 for computing and for storage of information needed frequently. Microprocessor 3 processes 5 the control program within
ROM 5 and performs a comparison between newly created voice templates and previously stored voice templates to see if there is a match. If a substantially equivalent voice template (the tolerance limit is a design choice) is found within the stored set, then
Microprocessor 3 generates a control output to some other device to perform some additional task in response to the voice command. This control output could be, for example, to turn a light on after hearing the words "LIGHT" and "ON." In this manner, control of the other devices can be accomplished entirely by a spoken voice command.
Liquid crystal display (LCD) 6 can be any form of conventional display that can visually display information to the user. Alternatively an audio speaker could be used in lieu of LCD 6 to convey information to the user.
B. Voice Training
As explained earlier, speaker dependent recognition devices require the user to provide voice patterns or templates for each word that is to be recognized. The two pass approach to voice training employed here means the user goes through the entire set of vocabulary words two separate times.
The flow diagram for the two pass method used in the software is shown in Figure 2. As a first step 200, voice training for the two pass approach is initiated by capturing words on the first pass to be used as a first word template. The user's voice information is captured and converted into a voice signal by Analog Audio Input 1 as explained above. This voice signal is converted into a digital voice template by ADC 2 and Microprocessor 3 for each word articulated by the user. This step is repeated, and the user is prompted (such as on LCD 6, for example) to continue going through the entire set of words in the vocabulary until a first pass 201 is complete. At the conclusion of this part of the control program, a complete set of digital voice templates exists, one for each word in the vocabulary. As seen in step 202, each of the templates generated from the first pass are stored in a temporary storage buffer within Microprocessor 3 to be later used for comparison. The purpose of the first pass is two fold. The user during the first pass becomes familiar with the process of training the device and also becomes familiar with the prompted words and the word templates generated from the first pass are used for a
comparison against the word templates from the second pass to test the relative quality of the spoken word of each individual.
As explained above, the user on the second pass will generally pronounce words in a manner more consistent with their normal speech. Thus, during the second pass of the two pass voice training method, the user sequences through the identical set of words in the same order as the first pass. After each word from the second pass is prompted and the user speaks the prompted word, the software first checks to see if the word was the first or second pass 201, and if from the second pass, performs a pattern matching test 203 to the word from the second pass to determine how close of a match exists between each word template in the two separate passes. To perform a comparison test between the word templates from the two passes, the voice recognition software scores the results from a pattern matching algorithm. In step 204, if the score for the two word templates to be compared meets certain defined criteria for a good word score then the control program at 208 stores the second word template into RAM 4 along with the first word template. Both are later used for voice recognition since the use of two templates, rather than a single averaged template, results in better recognition.
In the event that the first two templates do not match, a mismatch routine is then executed beginning at 205 to find two acceptable voice templates for the spoken word by further prompting of the user. The first and second pass word template are placed into a separate temporary storage buffer in Microprocessor 3 for later use. Either of the two word templates may be of poor quality at this time although the chances are that the first word template is the more likely to be the one of concern based on the earlier discussions. In the next step of the mismatch routine at 206, after
failing to meet the matching criteria for the second pass, the user is taken back to step 200 and prompted to repeat the same word a third time. This time at 203, the word template from the third pass is compared against the word templates generated from both the first pass and the second pass to see if it meets the set scoring criteria with either of the two prior templates. Then, in step 204 and 208, if the word template from the third pass compares acceptably with the word template from either the first or second pass, the third pass word template and the other matching word template are stored in RAM 4 to be later used for voice recognition. The word template that did not score favorably is deleted from the temporary storage buffer.
Assuming, however, that the third word template also did not compare acceptably with either of the first or second word templates, the three templates are stored in a temporary buffer at 205, and the user is re-prompted again as seen in steps 206 and 200. On this fourth and final pass, the control program goes through the pattern matching process 203 comparing the fourth word template against the word templates generated from the first, second and third pass. At step 204, if the word template from the fourth pass gets an acceptable score by matching up with the word templates from any one of the other passes, the control program stores the two matching word templates in RAM 4 at step 208. The two word templates that did not meet the test criteria are deleted from the temporary storage buffer. At 207 if the test score from the word template comparison on the word generated in the fourth pass does not meet the set scoring criteria with any of the three prior templates, the control program picks the two word templates that are the closest match and stores these two word templates in RAM 4. The two remaining word templates with the. poorest match are
deleted from the temporary storage buffer. This process continues until all of the words in the vocabulary that are to be used for voice recognition are trained by the user, and there are now two separate voice templates for each word in the vocabulary. This process is repeated for each user as they go through word training. This process can also be used when retraining problem words to again assure high quality word templates. In the preferred embodiment shown in Figure 2, the control program illustrated in the flow diagram goes through the process of re-prompting the word and performing a comparison on the word templates a total of 4 times, including 3 times on the second pass. It would be apparent to a skilled artisan, however, that the algorithm is not limited to the number of times the comparison is performed and any number of passes could be easily implemented. The major limitation on the number of passes performed is limited by the user's level of patience in going through an excessive number of passes.
After the present invention is voice-trained twice with a complete vocabulary, voice recognition may then be performed. At this point, when the user speaks a command, a new voice template is generated, and this new template is compared against both the first and second voice template sets to determine a match. The use of two voice template sets improves the quality of voice recognition.
C. Noise Rejection
In almost every environment that voice recognition is employed there is surrounding ambient noise of some type. This ambient noise can cause a voice recognition system to misrecognize words.
An approach employed in the present invention for digital noise filtering is illustrated in Figure 3.
The approach described here uses noise templates in conjunction with word templates as part of an active word group. The noise templates are modeled and used in an identical manner as the word templates by the pattern matching routine in the control program. The noise templates, instead of containing word template information from the user's speech, contain information representing the ambient noise found in the user's environment. Either of two approaches may be used to create the reference noise templates. First, the noise templates can be predetermined and hardcoded in memory if the voice recognition is typically going to take place in similar environments. Alternatively, if the ambient noise in the environment that the voice recognition is to be performed is not predictable or uniform, the user can dynamically record the noise templates and store them into RAM 4. To illustrate how this can be implemented, the user as he or she is generating word templates, can be prompted on LCD 6 to record any background noise. This could be accomplished by simply activating the system, while at the same time not speaking when the noise is present. The recorded noise can consist of a single element background noise or can consist of a complex set of background noises combined into a single noise template.
The present invention as shown in Figure 3 improves the performance of a voice recognition system. At 300 and 301, the present invention captures and processes audio input data to generate voice templates in the manner described above. The control program within ROM 5 includes a pattern recognition routine which is run at 301 by Microprocessor 3 and the best choice is determined from the two separate sets of voice templates learned as described above. It should be emphasized that the captured voice data is only compared with voice templates from the appropriate
"active word group" which is described further below. In general, however, the active word group consists of all of the stored noise templates as well as the active vocabulary of words and is described further below. As shown in 302, if the best match is a noise template, the software branches back to capturing the next word with no action taken by the Microprocessor 3 since no command word was received. If instead the best choice is a valid word, the Microprocessor 3 at 303 takes action on the valid word and then continues to its next operation. In this manner, the present invention reduces the number of misrecognized words by recognizing input noise as noise only, and not as a command word that needs to be acted upon. This could arise, for example, where the user pauses between words and the system then captures a voice-absent but noise-filled template.
Because the process used to detect noise is also the same process as used for voice recognition and is not a secondary operation, this noise detection approach requires substantially less processing power then prior art digital filter approaches. The number of noise templates to be used is not fixed and can be determined by several factors. The first item to consider when determining the number of noise templates is the surrounding ambient noise present where the present invention is to be used. The noise may vary from a background low frequency noise to high frequency noise to people talking in the background. The maximum number of templates required is determined by how many noise templates can be used to model the background noise. The second item to be considered when choosing the number of noise templates to be used is the processing time. Although the present invention's approach requires substantially less processing power then the typical digital noise filter, the number of
noise templates will affect the recognition time in the same manner as adding words to the active word group.
The noise template method shown herein may be used in conjunction with digital filtering. Depending on the filter used and the predictability of the background noise, the noise templates may further improve the noise rejection capability of the voice recognition process.
D. Dynamic Voice Controlled Word Error Correction/Replacement
Voice recognition inherently has recognition errors regardless of the sophistication or cost of the platform used. Depending on the type of operation being performed when a recognition error occurs, the user may be interrupted from the voice recognition process and be required to perform some additional corrective procedure that disrupts the operation being performed. The process employed by the present invention shown in Figure 4 does not require the user to stop the voice operation flow to correct the misrecognized word. In this embodiment, an audibilized "NO" command by the user permits simple and fast corrective action. When an error is made by the voice recognition software at step 400, the user continues in his normal manner and says the word "NO" which is recognized at 401 before advancing to the next word. The present invention recognizes this command as a corrective command at 402, and checks the number of "NO"s articulated by the user at that time. After recognizing the first "NO" corrective command, the control program branches to 403 and selects the second best choice from the word templates from the active word group. By "active word group" what is meant is that the software takes into account the context of the user's voice command. For example, assume that the voice recognition method
described here is used in a television remote control device, and the user first spoke the word "channel," which is correctly recognized. The control program would presume, based on the context of the command, that the next voice command should be a channel number, ranging from one to nine. Thus, the numbers one, two, three, etc. and "NO" would constitute the "active word group" at that point. Accordingly, when the voice recognition software misrecognizes a word it then only presents a second best choice from the active word group. The reduction of the possible second best choices from the entire vocabulary to merely those in the active word group significantly enhances the recognition accuracy as well as the speed of the present invention.
At step 403, the "NO" command aborts the action that was to have been executed by the system in response to the misrecognized word and instead replaces the misrecognized word with the next best choice. This choice is then shown to the user and will typically be the correct choice. If the correct choice is made by the recognizer after the first "NO" command, the microprocessor acts on that command and the user then proceeds with the voice operation. It is apparent that the advantage of using the voice command "NO" for correcting the recognition error is significant. The second best choice is typically the correct choice and the procedure to correct the misrecognized word does not require the user to change the mode of operation they were using. That is, the user does not have to go outside the normal voice recognition routine, or go to some means other then their voice to take corrective action. Both of these actions can break the user's sequence and train of thought.
Looking again at Figure 4, assume that the user has taken corrective action with the "NO" command, but the
second choice word is still incorrect. This problem may have been caused by a mispronounced word, background noise or a bad template. Consequently, at 404, after saying "NO" a second time, the user has another opportunity of taking corrective action again without having to stop the normal sequence taken with the voice recognition process. In particular, after speaking "NO" into the device a second time at 401, the user is prompted at 405 to repeat the misrecognized word. If the word is now properly recognized at 401, the user can proceed with his normal operation.
Assuming that the word is still misrecognized, however, there is more then likely a problem with the stored word template, possibly caused by a change in the user's voice. Thus, at 404, if the word is misrecognized for a third time, the user is prompted at 406 to retrain the problem word.
At 407 the user now selects the problem word from the abbreviated set of words in the active word group and retrains only the problem word. After the problem word has been retrained at 408, it replaces the misrecognized word with the word that was selected by the user as the problem word to be retrained at 409. Although the algorithm takes the user through several steps, the user typically gets the word correct on the first "NO" command and never has to interrupt his or her sequence of voice commands. Even if they went through the entire sequence, the approach described still provides a simpler way of correcting problem words and still allows the user to continue their operation without having to abort.
SUMMARY OF INVENTION
Thus, the present invention improves the quality of voice recognition by providing the use of two word templates for each word in a voice recognition system. The two word templates are created independently and
compared against each other to ensure they are a relatively good match. The present invention also improves the quality of voice recognition by providing the use of noise templates. By recognizing background noise, the present invention reduces the misrecognition rate of prior art systems. Moreover, the noise template approach described herein is significantly simpler and cost-effective, and can also be used in a device presently employing hardware or software noise filters to further enhance the noise rejection properties of such devices. Finally, the present invention provides a simple and efficient means of correcting voice recognition errors by the use of a voice command. By prompting the user to repeat misrecognized words, or, in the worse case, to dynamically replace problem reference words, the user is permitted to correct errors while using a voice recognition system with minimal disruption of the voice recognition process. While one embodiment of the invention has been presented above, the scope of the invention is to be determined by the following claims.
Claims
1. A system for performing voice recognition comprising: an input circuit for generating a voice command signal from a spoken voice command; and a converter circuit for generating a voice command template from the voice command signal; a template memory containing first and second separate reference voice templates, each template being related to a reference word; and a program memory containing a control program; a processor coupled to the template memory and program memory for executing the control program to determine whether the voice command template matches either of one of the first or second separate reference templates.
2. ' The system of claim 1, wherein the template memory also includes noise templates, and wherein the processing circuit determines whether the voice command template matches one of the first or second separate reference templates or the noise templates.
The system of claim 1, including a display for displaying the reference word.
4. method of performing voice recognition, comprising the steps of:
(a) capturing one or more first separate reference voice templates related to a set of spoken words; and
(b) capturing a second reference voice template for each spoken word; and
(c) capturing an input voice template; and
(d) comparing the input voice template separately with each of the first and second reference voice templates to determine if a match exists within predetermined limits.
5. The method of claim 4, further including the steps of: capturing or hardcoding in ROM one or more noise reference templates; and wherein at step (c) the input voice template is compared separately with each of the first and second reference voice templates and the noise reference templates to determine if a match exists within predetermined limits.
6. The method of claim 4, wherein during steps (a) and (b) the user is prompted on a display to speak the words.
7. A method of training a system to perform voice recognition, comprising the steps of:
(a) capturing a first voice template corresponding to a spoken word; and (b) capturing a second voice template corresponding to the same spoken word; and
(c) comparing the first and second voice templates to determine the occurrence of either a match or a mismatch within predetermined limits; and
(d) upon determining that there is a match, storing said templates in a memory; and
(e) upon determining that there is a mismatch, performing a mismatch routine to generate two separate voice templates for the spoken word.
8. The method of claim 7, wherein the mismatch routine includes the steps of: (f) capturing a third separate voice template corresponding to the spoken word; and
(g) comparing the third voice template separately with the first and second voice templates to determine the occurrence of either a match or mismatch with either of said templates within predetermined limits; and
(h) upon determining a match, storing in a memory the third voice template with whichever of the first or second voice templates is a better match; and
(i) upon determining a mismatch, capturing a fourth separate voice template corresponding to the spoken word and storing in a memory whichever two of the first, second, third and fourth voice templates are the closest match to each other.
9. The method of claim 7, wherein during steps (a) and (b) the words are prompted on a display.
10. A method of performing recognition of a user's voice, comprising the steps of:
(a) generating an input voice template from a spoken word;
(b) comparing the input voice template with one or more reference voice templates each corresponding to a reference word; and
(c) selecting a reference voice template and corresponding reference word that best matches the input voice template; and
(d) presenting the reference word to the user; and
(e) repeating steps (a) , (b) and (c) to determine whether a corrective word was spoken or not spoken; and
(f) upon determining that a corrective word was not spoken, performing an action corresponding to the reference word; and
(g) upon determining that a corrective word was spoken, performing a first corrective routine to present an acceptable reference word to the user.
11. The method of claim 10 wherein the first corrective routine includes the steps of:
(h) presenting a second reference word from the reference templates that is the next best match to the input voice template; and
(i) repeating steps (a) , (b) and (c) to determine whether a corrective word was spoken or not spoken; and
(j) upon determining that a corrective word was not spoken, performing an action corresponding to the second choice reference word; and
(k) upon determining that a corrective word was spoken, performing a second corrective routine to present an acceptable choice reference word to the user.
12. The method of claim 11, wherein the second corrective routine includes the steps of: (1) prompting the user to repeat the spoken word in order to generate a new input voice template; and
(m) presenting a reference word from the reference templates that is the best match to the new input voice template; and
(n) repeating steps (a) , (b) and (c) to determine whether a corrective word was spoken or not spoken; and
(p) upon determining that a corrective word was not spoken, performing an action corresponding to the second choice reference word; and
(q) upon determining that a corrective word was spoken, performing a third corrective routine to present an acceptable reference word to the user.
13. The method of claim 12, wherein the third corrective routine includes the steps of:
(r) selecting a reference voice template to be replaced; and (s) prompting the user to repeat the reference word corresponding to the reference voice template; and
(t) capturing and storing a new reference voice template corresponding to the reference word; and
(u) repeating steps (a) through (t) until an acceptable reference word is presented to the user.
14. The method of claim 12, wherein the second reference word is within a subset of words in an active word group.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU46785/93A AU4678593A (en) | 1992-07-17 | 1993-07-15 | Voice recognition apparatus and method |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US91593892A | 1992-07-17 | 1992-07-17 | |
US915,938 | 1992-07-17 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1994002936A1 true WO1994002936A1 (en) | 1994-02-03 |
Family
ID=25436455
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1993/006647 WO1994002936A1 (en) | 1992-07-17 | 1993-07-15 | Voice recognition apparatus and method |
Country Status (3)
Country | Link |
---|---|
AU (1) | AU4678593A (en) |
MX (1) | MX9304334A (en) |
WO (1) | WO1994002936A1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1994018668A1 (en) * | 1993-02-04 | 1994-08-18 | Nokia Telecommunications Oy | A method of transmitting and receiving coded speech |
US5651056A (en) * | 1995-07-13 | 1997-07-22 | Eting; Leon | Apparatus and methods for conveying telephone numbers and other information via communication devices |
EP0702351A3 (en) * | 1994-08-16 | 1997-10-22 | Ibm | Method and apparatus for analysing audio input events in a speech recognition system |
WO1998028733A1 (en) * | 1996-12-24 | 1998-07-02 | Koninklijke Philips Electronics N.V. | A method for training a speech recognition system and an apparatus for practising the method, in particular, a portable telephone apparatus |
US5794205A (en) * | 1995-10-19 | 1998-08-11 | Voice It Worldwide, Inc. | Voice recognition interface apparatus and method for interacting with a programmable timekeeping device |
US5907825A (en) * | 1996-02-09 | 1999-05-25 | Canon Kabushiki Kaisha | Location of pattern in signal |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4403114A (en) * | 1980-07-15 | 1983-09-06 | Nippon Electric Co., Ltd. | Speaker recognizer in which a significant part of a preselected one of input and reference patterns is pattern matched to a time normalized part of the other |
US4751737A (en) * | 1985-11-06 | 1988-06-14 | Motorola Inc. | Template generation method in a speech recognition system |
US4769844A (en) * | 1986-04-03 | 1988-09-06 | Ricoh Company, Ltd. | Voice recognition system having a check scheme for registration of reference data |
US4773093A (en) * | 1984-12-31 | 1988-09-20 | Itt Defense Communications | Text-independent speaker recognition system and method based on acoustic segment matching |
US4837831A (en) * | 1986-10-15 | 1989-06-06 | Dragon Systems, Inc. | Method for creating and using multiple-word sound models in speech recognition |
US4852181A (en) * | 1985-09-26 | 1989-07-25 | Oki Electric Industry Co., Ltd. | Speech recognition for recognizing the catagory of an input speech pattern |
US4984275A (en) * | 1987-03-13 | 1991-01-08 | Matsushita Electric Industrial Co., Ltd. | Method and apparatus for speech recognition |
US5008941A (en) * | 1989-03-31 | 1991-04-16 | Kurzweil Applied Intelligence, Inc. | Method and apparatus for automatically updating estimates of undesirable components of the speech signal in a speech recognition system |
US5027406A (en) * | 1988-12-06 | 1991-06-25 | Dragon Systems, Inc. | Method for interactive speech recognition and training |
US5105465A (en) * | 1989-03-06 | 1992-04-14 | Kabushiki Kaisha Toshiba | Speech recognition apparatus |
US5127043A (en) * | 1990-05-15 | 1992-06-30 | Vcs Industries, Inc. | Simultaneous speaker-independent voice recognition and verification over a telephone network |
-
1993
- 1993-07-15 AU AU46785/93A patent/AU4678593A/en not_active Abandoned
- 1993-07-15 WO PCT/US1993/006647 patent/WO1994002936A1/en active Application Filing
- 1993-07-16 MX MX9304334A patent/MX9304334A/en unknown
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4403114A (en) * | 1980-07-15 | 1983-09-06 | Nippon Electric Co., Ltd. | Speaker recognizer in which a significant part of a preselected one of input and reference patterns is pattern matched to a time normalized part of the other |
US4773093A (en) * | 1984-12-31 | 1988-09-20 | Itt Defense Communications | Text-independent speaker recognition system and method based on acoustic segment matching |
US4852181A (en) * | 1985-09-26 | 1989-07-25 | Oki Electric Industry Co., Ltd. | Speech recognition for recognizing the catagory of an input speech pattern |
US4751737A (en) * | 1985-11-06 | 1988-06-14 | Motorola Inc. | Template generation method in a speech recognition system |
US4769844A (en) * | 1986-04-03 | 1988-09-06 | Ricoh Company, Ltd. | Voice recognition system having a check scheme for registration of reference data |
US4837831A (en) * | 1986-10-15 | 1989-06-06 | Dragon Systems, Inc. | Method for creating and using multiple-word sound models in speech recognition |
US4984275A (en) * | 1987-03-13 | 1991-01-08 | Matsushita Electric Industrial Co., Ltd. | Method and apparatus for speech recognition |
US5027406A (en) * | 1988-12-06 | 1991-06-25 | Dragon Systems, Inc. | Method for interactive speech recognition and training |
US5105465A (en) * | 1989-03-06 | 1992-04-14 | Kabushiki Kaisha Toshiba | Speech recognition apparatus |
US5008941A (en) * | 1989-03-31 | 1991-04-16 | Kurzweil Applied Intelligence, Inc. | Method and apparatus for automatically updating estimates of undesirable components of the speech signal in a speech recognition system |
US5127043A (en) * | 1990-05-15 | 1992-06-30 | Vcs Industries, Inc. | Simultaneous speaker-independent voice recognition and verification over a telephone network |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1994018668A1 (en) * | 1993-02-04 | 1994-08-18 | Nokia Telecommunications Oy | A method of transmitting and receiving coded speech |
AU670361B2 (en) * | 1993-02-04 | 1996-07-11 | Nokia Telecommunications Oy | A method of transmitting and receiving coded speech |
US5715362A (en) * | 1993-02-04 | 1998-02-03 | Nokia Telecommunications Oy | Method of transmitting and receiving coded speech |
EP0702351A3 (en) * | 1994-08-16 | 1997-10-22 | Ibm | Method and apparatus for analysing audio input events in a speech recognition system |
US5651056A (en) * | 1995-07-13 | 1997-07-22 | Eting; Leon | Apparatus and methods for conveying telephone numbers and other information via communication devices |
US5794205A (en) * | 1995-10-19 | 1998-08-11 | Voice It Worldwide, Inc. | Voice recognition interface apparatus and method for interacting with a programmable timekeeping device |
US5907825A (en) * | 1996-02-09 | 1999-05-25 | Canon Kabushiki Kaisha | Location of pattern in signal |
WO1998028733A1 (en) * | 1996-12-24 | 1998-07-02 | Koninklijke Philips Electronics N.V. | A method for training a speech recognition system and an apparatus for practising the method, in particular, a portable telephone apparatus |
Also Published As
Publication number | Publication date |
---|---|
MX9304334A (en) | 1994-04-29 |
AU4678593A (en) | 1994-02-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP3968133B2 (en) | Speech recognition dialogue processing method and speech recognition dialogue apparatus | |
JPH0876785A (en) | Voice recognition device | |
JP2002536691A (en) | Voice recognition removal method | |
US20010056345A1 (en) | Method and system for speech recognition of the alphabet | |
WO1994002936A1 (en) | Voice recognition apparatus and method | |
US20030040915A1 (en) | Method for the voice-controlled initiation of actions by means of a limited circle of users, whereby said actions can be carried out in appliance | |
US7177806B2 (en) | Sound signal recognition system and sound signal recognition method, and dialog control system and dialog control method using sound signal recognition system | |
JP2996019B2 (en) | Voice recognition device | |
JPH06236196A (en) | Method and device for voice recognition | |
JPS6332394B2 (en) | ||
JP4796686B2 (en) | How to train an automatic speech recognizer | |
JP3112037B2 (en) | Voice recognition device | |
JPS6367197B2 (en) | ||
JP2000122678A (en) | Controller for speech recogniging equipment | |
KR20200010149A (en) | Apparatus for recognizing call sign and method for the same | |
KR100322202B1 (en) | Device and method for recognizing voice sound using nervous network | |
KR102052634B1 (en) | Apparatus for recognizing call sign and method for the same | |
JP2975808B2 (en) | Voice recognition device | |
JPS6126678B2 (en) | ||
JP2000039900A (en) | Speech interaction device with self-diagnosis function | |
JP2004004182A (en) | Device, method and program of voice recognition | |
Ainsworth et al. | Comparing error correction strategies in speech recognition systems | |
EP1426924A1 (en) | Speaker recognition for rejecting background speakers | |
JPH05158493A (en) | Speech recognizing device | |
JP2020085942A (en) | Information processing apparatus, information processing method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AU CA JP KR |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
NENP | Non-entry into the national phase |
Ref country code: CA |
|
122 | Ep: pct application non-entry in european phase |