WO2011016129A1 - Voice recognition device, voice recognition method, and voice recognition program - Google Patents
Voice recognition device, voice recognition method, and voice recognition program Download PDFInfo
- Publication number
- WO2011016129A1 WO2011016129A1 PCT/JP2009/063996 JP2009063996W WO2011016129A1 WO 2011016129 A1 WO2011016129 A1 WO 2011016129A1 JP 2009063996 W JP2009063996 W JP 2009063996W WO 2011016129 A1 WO2011016129 A1 WO 2011016129A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- recognition
- speech
- speech recognition
- recognition result
- voice
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 44
- 238000012545 processing Methods 0.000 claims abstract description 118
- 230000003247 decreasing effect Effects 0.000 claims description 6
- 230000007613 environmental effect Effects 0.000 claims 1
- 230000006870 function Effects 0.000 description 30
- 230000000306 recurrent effect Effects 0.000 description 23
- 230000008569 process Effects 0.000 description 21
- 238000012986 modification Methods 0.000 description 15
- 230000004048 modification Effects 0.000 description 15
- 238000004458 analytical method Methods 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 7
- 238000012937 correction Methods 0.000 description 5
- 230000007423 decrease Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000001514 detection method Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012905 input function Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000002040 relaxant effect Effects 0.000 description 1
- 210000000707 wrist Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
Definitions
- the present invention relates to a speech recognition technology for recognizing hierarchical utterance commands.
- Patent Document 1 discloses a technique for recognizing a voice as having been rephrased in the case where a voice input is further received by a predetermined time width after receiving a voice input. Has been.
- Patent Document 2 discloses a technique related to the present invention.
- An object of the present invention is to provide a voice recognition device capable of recognizing.
- the invention according to claim 1 is a voice recognition device for voice recognition of hierarchical commands, wherein a dictionary storage means for storing a dictionary used for voice recognition of each hierarchy for recognizing the hierarchy command, voice data
- a dictionary storage means for storing a dictionary used for voice recognition of each hierarchy for recognizing the hierarchy command, voice data
- the first speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that the speech data input last time is correctly recognized
- Second speech recognition means for performing speech recognition based on a dictionary used when speech recognition of input speech data is executed, a recognition result by the first speech recognition means, and a recognition result by the second speech recognition means
- a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the previously input voice data.
- the invention according to claim 8 is a speech recognition method used by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command, and when speech data is input, A first speech recognition step for performing speech recognition based on a dictionary to be used when it is assumed that the previously input speech data is correctly recognized; and when the speech data is input, A second speech recognition step for performing speech recognition based on the dictionary used when speech recognition is executed, a recognition result by the first speech recognition means, and a recognition result by the second speech recognition means are input the previous time.
- the invention according to claim 9 is a speech recognition program executed by a speech recognition apparatus for storing a dictionary used for speech recognition of each layer for recognizing a layer command, and when speech data is input A first voice recognition means for performing voice recognition based on a dictionary to be used when it is assumed that the voice data input last time is correctly recognized; and when the voice data is input, the voice data input last time A second speech recognition unit that performs speech recognition based on the dictionary used when the speech recognition is performed, a recognition result by the first speech recognition unit, and a recognition result by the second speech recognition unit
- the speech recognition apparatus is caused to function as a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the processed speech data.
- a speech recognition apparatus for speech recognition of hierarchical commands, dictionary storage means for storing a dictionary used for speech recognition of each layer for recognizing the hierarchical commands, and input of speech data
- the first speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that the speech data input last time is correctly recognized, and the previous input when the speech data is input.
- Second voice recognition means for performing voice recognition based on a dictionary used when voice recognition of the voice data performed is performed, a recognition result by the first voice recognition means, a recognition result by the second voice recognition means, And a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the previously input voice data.
- the above speech recognition apparatus recognizes a hierarchical command and includes a dictionary storage unit, a first speech recognition unit, a second speech recognition unit, and a recognition result processing unit.
- hierarchical command refers to a command that realizes one operation command or output command to a device to be controlled by two or more utterances. That is, in the case of a hierarchical command, one operation command or output command is gradually specified hierarchically based on the voice-recognized command.
- the dictionary storage means stores a dictionary used for speech recognition in each hierarchy for recognizing hierarchy commands.
- the “dictionary used in speech recognition in each layer” refers to a dictionary used in each speech recognition process necessary to realize any one operation command or output command.
- the “voice recognition process” refers to the entire process necessary for recognizing one piece of voice data.
- the dictionary stores a plurality of recognized words that perform pattern matching with the input voice data.
- the first speech recognition means performs speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input.
- the first voice recognition means determines a dictionary to be used based on the voice data input immediately before, and performs voice recognition of the newly input voice data.
- the second voice recognition means performs voice recognition based on the dictionary used when voice recognition of the voice data input last time is executed.
- the second voice recognition means performs voice recognition of the newly input voice data on the assumption that the voice recognition of the voice data input immediately before is erroneous recognition.
- the recognition result processing means weights the recognition result of the first voice recognition means and the recognition result of the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously input voice data, and finally recognizes Determine the result.
- the “recognition result” indicates the degree of similarity between each recognized word stored in the used dictionary and newly input speech data.
- “Elapsed time width” refers to the time width from the start or end of processing of the previously input audio data to the start of processing of the newly input audio data, or a time width corresponding thereto.
- the “final recognition result” refers to a recognition result that is finally output by the speech recognition apparatus.
- the latest recognition result is a predetermined number of recognized words arranged in descending order of similarity and the degree of similarity.
- the speech recognition apparatus weights the recognition results of the two recognition processes based on the elapsed time width. As a result, even if there is a misrecognition, the speech recognition apparatus can recognize the recurrent utterance with a minimum number of steps without performing the utterance or operation corresponding to the correction operation or starting over from the beginning.
- the recognition result processing unit determines the final recognition result based only on the recognition result of the first speech recognition unit when the elapsed time width is a predetermined width or more.
- the predetermined width is set to an appropriate value based on experiments or the like. Specifically, the predetermined width is set to an elapsed time width in which, for example, newly input audio data is not likely to be a rephrase of previously input audio data or is extremely low. As described above, in this aspect, the speech recognition apparatus can reduce unnecessary processing when there is no possibility or low possibility of recurrent speech.
- the recognition result processing unit determines the final recognition result based only on the recognition result of the second speech recognition unit when the elapsed time width is less than a predetermined width.
- the predetermined width is set to an appropriate value based on experiments or the like. Specifically, the predetermined width is set to, for example, an elapsed time width in which it is highly likely that the newly input audio data is a rephrase of the previously input audio data.
- the speech recognition apparatus can reduce unnecessary processing when there is a high possibility that the speech is recurrent.
- the first speech recognition unit and the second speech recognition unit calculate likelihood as a recognition result
- the recognition result processing unit is configured by the first speech recognition unit.
- the calculated likelihood is multiplied by a first parameter that is a non-increasing function with respect to the elapsed time width
- the likelihood that is calculated by the second speech recognition means is a non-decreasing function with respect to the elapsed time width.
- the weighting is performed by multiplying the parameter.
- “likelihood” includes log likelihood.
- the speech recognition apparatus gradually multiplies the likelihood calculated by the first speech recognition means by the first parameter that is a non-increasing function with respect to the elapsed time width, so that the recognition result of the first speech recognition means is gradually increased. Increase the weight, that is, the importance of the final recognition result. Further, the speech recognition apparatus gradually weights the recognition result of the second speech recognition unit by multiplying the likelihood calculated by the second speech recognition unit by a second parameter that is a non-decreasing function with respect to the elapsed time width. Lower. In this way, the speech recognition device can recognize a recurrent speech with a minimum number of steps, even if there is a misrecognition, without performing a speech or operation corresponding to a correction operation or starting over from the beginning. it can.
- the recognition result processing means sets a sum of the first parameter and the second parameter to a constant value.
- the above-mentioned constant value is set to 1, for example. In this way, the speech recognition apparatus can easily determine the other parameter based on one of the first parameter and the second parameter, and set these parameters to appropriate values.
- the speech recognition apparatus further includes environment information acquisition means for acquiring information related to the recognition environment, and the recognition result processing means determines that the recognition environment is inferior based on the information, The weight of the recognition result of the second speech recognition means is increased as compared with the case where the recognition environment is not inferior.
- “Recognition environment” refers to an environment in which a speech recognition apparatus performs speech recognition, which affects the accuracy of the speech recognition. In general, when the recognition environment is poor, there is a high possibility of erroneous recognition. That is, in this case, the speaker is likely to restate. Accordingly, when the speech recognition apparatus determines that the recognition environment is inferior, the speech recognition apparatus can perform speech recognition with higher accuracy by increasing the weight of the recognition result of the second speech recognition means than usual.
- the environment information acquisition unit is installed in a vehicle, acquires the vehicle speed of the vehicle or / and the degree of noise included in the speech data
- the recognition result processing unit includes: The greater the vehicle speed or / and the degree of noise, the greater the weighting of the recognition result of the second speech recognition means.
- the “degree of noise” corresponds to, for example, the S / N ratio.
- the vehicle speed is high, the driver concentrates on driving, so that the response to the voice recognition device tends to be slow and the elapsed time tends to increase.
- the vehicle speed is high, it is expected that noise associated with traveling increases.
- the speech recognition apparatus determines that the higher the vehicle speed and / or the degree of noise, the higher the probability of occurrence of erroneous recognition and the higher the possibility of re-speech. Therefore, in this aspect, the speech recognition apparatus can execute speech recognition with higher accuracy.
- a speech recognition method used by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command.
- a first speech recognition step for performing speech recognition based on a dictionary to be used when it is assumed that the input speech data has been correctly recognized; and when the speech data is input, the speech of the previously input speech data
- the second speech recognition step for performing speech recognition based on the dictionary used when the recognition is executed, the recognition result by the first speech recognition means, and the recognition result by the second speech recognition means are input the previous time.
- a recognition result processing step of determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing the voice data.
- a speech recognition program executed by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command, and there is input of speech data
- a first voice recognition means for performing voice recognition based on a dictionary to be used when it is assumed that the voice data input last time is correctly recognized; and when the voice data is input, the voice data input last time
- a second speech recognition unit that performs speech recognition based on the dictionary used when the speech recognition is performed, a recognition result by the first speech recognition unit, and a recognition result by the second speech recognition unit
- the speech recognition apparatus is made to function as a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the processed speech data.
- the voice recognition device is equipped with this program, so that even if there is a misrecognition, it recognizes recurrent utterances with minimal steps without utterances or operations corresponding to corrective actions, or from the beginning. Can do.
- the program is recorded on a storage medium.
- FIG. 1 is a conceptual diagram showing two types of dictionary configurations used for speech recognition. Specifically, FIG. 1 (a) shows an example of a dictionary configuration when a non-hierarchical command as a comparative example is recognized, and FIG. 1 (b) shows a case of recognizing a hierarchical command that is an object of the present invention. An example of the dictionary configuration is shown.
- the speech recognition apparatus of the present invention recognizes hierarchical commands as will be described later.
- non-hierarchical command refers to an utterance command that realizes one operation command or output command to the device to be controlled by one utterance
- hierarchical command refers to a control target among the utterance commands. This refers to an utterance command that realizes one operation command or output command to the device with two or more utterances.
- dictionary refers to a database that stores words to be recognized (hereinafter referred to as “recognized words”).
- the voice recognition apparatus performs a predetermined analysis on the utterance data input through a microphone or the like (hereinafter referred to as “utterance data Sa”), and then outputs a recognition result based on the command dictionary 11.
- the “recognition result” is, for example, a recognized word having the highest similarity to the utterance data Sa, or a recognition word arranged in descending order of similarity to the utterance data Sa and a list of the similarities. Applicable.
- “utterance data Sa” refers to an input signal including voice. For example, when a voice recognition device is installed in a car navigation device, the utterance data Sa indicates an input signal recorded from a microphone during a predetermined time after the user presses the utterance button.
- the hierarchy command is composed of two hierarchies.
- the speech recognition apparatus recognizes speech data Sa input first in a series of hierarchical command recognition processing using the first command dictionary 12.
- This speech recognition is hereinafter referred to as “speech recognition in the first layer”. That is, the speech recognition in the first hierarchy refers to a process of recognizing speech data Sa by using a dictionary used in an initial state, that is, a state in which no speech command is accepted, among a series of processes for recognizing hierarchical commands.
- the speech recognition apparatus when recognizing a hierarchical command, the speech recognition apparatus changes a dictionary (hereinafter also referred to as “standby dictionary”) to be used for each processing state (hierarchy). Then, it is assumed that the speech recognition apparatus recognizes “name search” by speech recognition in the first hierarchy based on the first command dictionary 12. That is, it is assumed that the speech recognition apparatus determines that “name search” has the highest similarity among the recognition words stored in the first command dictionary 12.
- the speech recognition apparatus performs speech recognition using the second command dictionary 13 and the place name dictionary 14 as the next standby dictionary based on the recognized “name search” command.
- the second command dictionary 13 is used when the speaker such as “return”, “stop”, “correction”, etc. explicitly wants to return the process to the previous state (ie, speech recognition in the first layer).
- it includes an utterance command to be input when it is desired to redo a series of recognition processing of hierarchical commands from the beginning.
- the place name dictionary 14 includes place names to be subjected to “name search” recognized by the speech recognition in the first layer.
- the speech recognition apparatus recognizes the recognition words stored in the second command dictionary 13 and the place name dictionary 14 based on the next input utterance data Sa (here, “Tokyo Tower” is uttered). Calculate the similarity of.
- the voice recognition here is referred to as “voice recognition in the second layer”. That is, the speech recognition in the second layer refers to a process for recognizing the utterance data Sa based on the speech recognition result in the first layer.
- voice recognition in the third hierarchy is defined in the same manner.
- the speech recognition apparatus outputs the result of speech recognition at the second layer as the final recognition result. Thereafter, the speech recognition apparatus performs speech recognition in the first hierarchy using the first command dictionary 12 again for the new utterance data Sa.
- the case of the utterance command of two layers has been described.
- the voice recognition device is similar to the previous one in the voice recognition in each layer.
- a standby dictionary is selected based on the recognition word that has been voice-recognized in the hierarchy, and voice recognition is performed.
- the speech recognition apparatus executes speech recognition of hierarchical commands as described above. Furthermore, as will be described later, the speech recognition apparatus according to the present invention changes the output of the recognition result in accordance with the elapsed time from speech recognition in the previous layer. As a result, the voice recognition device according to the present invention allows the user to explicitly issue a command (in the above example, for the second command) even when a recognition error occurs in the middle of a series of hierarchical command recognition processes. This corresponds to the command stored in the dictionary 13).
- FIG. 2 is a schematic configuration diagram of the speech recognition apparatus 100 according to the present invention.
- the voice recognition device 100 includes a voice analysis unit 1, a voice section detection unit 2, a first recognition unit 3, a second recognition unit 4, a recognition result processing unit 5, a presentation unit 6, and a main control unit 7.
- the voice analysis unit 1 calculates an acoustic feature amount (that is, a voice feature parameter) of the utterance data Sa based on the control of the main control unit 7. Specifically, the voice analysis unit 1 performs A / D conversion on the utterance data Sa input from a microphone or the like (not shown), and calculates an acoustic feature amount by using, for example, a well-known acoustic analysis method or a combination thereof.
- the speech segment detection unit 2 uses only the speech feature amount supplied from the speech analysis unit 1 and only the speech segment (hereinafter referred to as “speech data”) from the speech data Sa. cut. Then, the voice section detection unit 2 supplies the first recognition unit 3 and the second recognition unit 4 with voice feature amounts corresponding to the voice data.
- the first recognition unit 3 calculates the likelihood of a recognition word stored in a predetermined standby dictionary using a discriminator such as HMM (Hidden Markov Model) based on the voice feature amount. And the 1st recognition part 3 outputs the list
- HMM Hidden Markov Model
- the first recognition unit 3 performs speech recognition based on a dictionary to be used when it is assumed that the previously input speech data has been correctly recognized. That is, the first recognizing unit 3 performs speech recognition at the next layer after processing the previously input speech data (or the first layer when there is no next layer). Specifically, in the example of FIG. 1B, when the first recognition unit 3 recognizes the previously input voice data in the first hierarchy, the second command dictionary 13 and the place name dictionary are used. 14 is used to calculate the recognition result, and when the previously input speech data is speech-recognized in the second hierarchy, the recognition result is calculated using the first command dictionary 12.
- the first recognition unit 3 recognizes a recognition word (hereinafter referred to as “recognition word WR1”) as a recognition result and a logarithmized likelihood corresponding thereto (hereinafter referred to as “likelihood L1”). Is supplied to the recognition result processing unit 5.
- the first recognition unit 3 supplies a predetermined number (hereinafter referred to as “predetermined number N”) of recognition words WR1 and its likelihood L1.
- predetermined number N is set to an appropriate value through experiments or the like.
- the second recognizing unit 4 calculates the likelihood of a recognized word stored in a predetermined standby dictionary different from the first recognizing unit 3 using a classifier such as an HMM based on the voice feature amount. And the 2nd recognition part 4 outputs the list of the recognition word arranged in order with high similarity, and its likelihood.
- the second recognition unit 4 performs voice recognition based on the standby dictionary used when voice recognition of voice data input last time is executed. That is, the second recognizing unit 4 performs speech recognition in the same hierarchy as the hierarchy that processed the previously input speech data Sa. In other words, the second recognizing unit 4 performs speech recognition in the hierarchy preceding the first recognizing unit 3. More specifically, in the example of FIG. 1B, when the second recognition unit 4 recognizes speech data Sa previously input in the first hierarchy, the second command dictionary 12 is again displayed. When speech recognition is performed in the first hierarchy and the previously input speech data Sa is recognized in the second hierarchy, the second command dictionary 13 and the place name dictionary 14 are used again in the second hierarchy. Perform voice recognition.
- the second recognizing unit 4 recognizes a recognized word (hereinafter referred to as “recognized word WR2”) as a recognition result and a logarithmic likelihood corresponding thereto (hereinafter referred to as “likelihood L2”). Supply to part 5.
- the second recognition unit 4 supplies a predetermined number N of recognition words WR2 and the likelihood L2 that are higher.
- the recognition result processing unit 5 uses the likelihoods L1 and L2 output from the first recognition unit 3 and the second recognition unit 4 as elapsed time widths (hereinafter referred to as “elapsed time width T The predetermined weighting is performed based on the above. Thereby, the recognition result processing unit 5 uses the weighted likelihood L1 (hereinafter referred to as “weighted likelihood Lw1”) and the weighted likelihood L2 (hereinafter referred to as “weighted likelihood Lw2”). calculate. Then, the recognition result processing unit 5 sorts the weighted likelihoods Lw1 and Lw2 together, thereby recognizing the recognized words WR1 and WR2 (hereinafter collectively referred to as “recognized words WR”) and the like.
- weighted likelihood Lw1 weighted likelihood Lw1
- weighted likelihood Lw2 weighted likelihood L2
- weighting likelihood Lw Corresponding weighting likelihoods Lw1 and Lw2 (hereinafter collectively referred to as “weighting likelihood Lw”) are supplied to the presentation unit 6.
- the specific processing in the recognition result processing unit 5 will be described in detail in the next section (Recognition processing method).
- the presentation unit 6 presents the recognition result output by the recognition result processing unit 5 as the final recognition result.
- the presentation unit 6 is a display, a speaker, or the like, and outputs the recognition result output by the recognition result processing unit 5 as an image or sound.
- the main control unit 7 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like (not shown), and performs various controls on each element of the speech recognition apparatus 100.
- CPU Central Processing Unit
- ROM Read Only Memory
- RAM Random Access Memory
- the time measuring unit 8 is a timer for measuring the elapsed time width T. Specifically, the time measuring unit 8 measures the elapsed time width T calculated from the time when the presentation unit 6 outputs the recognition result of the previous utterance data Sa.
- the storage unit 9 includes an acoustic model used by the first recognition unit 3 and the second recognition unit 4, a standby dictionary used for speech recognition in each layer, and other recognition results output by the recognition result processing unit 5. Is a memory for storing. The data stored in the storage unit 9 is supplied to each element of the speech recognition apparatus 100 as necessary based on the control of the main control unit 7.
- FIG. 3 is a functional block diagram specifically showing the processing contents of the recognition result processing unit 5.
- the recognition result processing unit 5 includes multiplication units 5x and 5y, a sorting unit 5z, and a switch unit 5v.
- the multiplication unit 5x multiplies the likelihood L1 supplied from the first recognition unit 3 by a predetermined weighting coefficient (hereinafter, referred to as “first weighting coefficient w1”), and uses the weighting likelihood Lw1 that is the multiplication result. It supplies to the sort part 5z. That is, the multiplication unit 5x obtains the weighting likelihood Lw1 by the following equation (1).
- Lw1 L1 ⁇ w1 Formula (1)
- the first weighting coefficient w1 is a coefficient that varies according to the elapsed time width T. Specifically, the first weighting coefficient w1 is set so that the importance (weight) of the recognition result of the first recognition unit 3 increases as the elapsed time width T increases.
- the first weighting coefficient w1 decreases so as to decrease the absolute value of the weighting likelihood Lw1 as the elapsed time width T increases (that is, non- Increase).
- the speech recognition apparatus 100 regards the likelihoods L1 and L2 and the weighting likelihoods Lw1 and Lw2 as having higher similarity as the negative value is smaller, that is, as the absolute value is smaller. Therefore, the multiplication unit 5x sets the weighting likelihood Lw1 so that the weight of the recognition result of the first recognition unit 3 is increased as the first weighting coefficient w1 is smaller. A method for setting the first weighting coefficient w1 will be described later.
- the multiplying unit 5y multiplies the likelihood L2 supplied from the second recognizing unit 4 by a predetermined weighting coefficient (hereinafter referred to as “second weighting coefficient w2”), and the weighted likelihood that is the multiplication result.
- the degree Lw2 is supplied to the sorting unit 5z. That is, the multiplication unit 5y obtains the second weighting likelihood Lw2 by the following equation (2).
- Lw2 L2 ⁇ w2 Formula (2)
- the second weighting coefficient w2 is a coefficient that varies according to the elapsed time width T. Specifically, the second weighting coefficient w2 is set so that the importance (weight) of the recognition result of the second recognition unit 4 decreases as the elapsed time width T increases.
- the multiplication unit 5y sets the weighting likelihood Lw2 so that the weight of the recognition result of the second recognition unit 4 is increased as the second weighting coefficient w2 is smaller.
- a method for calculating the second weighting coefficient w2 will be described later.
- the first weighting coefficient w1 and the second weighting coefficient w2 are set to values larger than 0 and smaller than 1, respectively, and the sum is set to 1. That is, the first weighting coefficient w1 and the second weighting coefficient w2 satisfy the following expressions (3) to (5).
- w1 + w2 1 Formula (3) 0 ⁇ w1 ⁇ 1 Formula (4) 0 ⁇ w2 ⁇ 1 Formula (5)
- the recognition result processing unit 5 can appropriately set the other coefficient by obtaining one of the first weighting coefficient w1 and the second weighting coefficient w2. That is, the recognition result processing unit 5 sets the recognition result of the second recognition unit 4 to be relatively less important than the recognition result of the first recognition unit 3 as the elapsed time width T increases. Can do.
- the first and second weighting coefficients w1 and w2 are expressed by the following equations (6) and (7) using a predetermined weighting factor “w”.
- w1 w Formula (6)
- w2 1-w Formula (7)
- the weighting coefficient w is set to a value larger than 0 and smaller than 1 from the equations (4) and (6). That is, the following expression (8) is satisfied.
- 0 ⁇ w ⁇ 1 Formula (8) Next, a method for setting the weighting coefficient w for determining the first weighting coefficient w1 and the second weighting coefficient w2 will be described in detail with reference to FIG.
- FIG. 4 is an example of a graph showing the relationship between the weighting coefficient w and the elapsed time width T.
- the graph “G1” and the graph “G2” are examples of functions that determine the weighting coefficient w from the elapsed time width T (hereinafter referred to as “weighting function Fw”). Further, “T1” in FIG. 4 uses the recognition result of the first recognition unit 3 as the final recognition result or uses the recognition result of the second recognition unit 4 in addition to the recognition result of the first recognition unit 3. This is a threshold for the elapsed time width T for the switch unit 5v to determine whether to derive the final recognition result. This will be described in detail in the description of the switch unit 5v.
- the weighting function Fw is a decreasing function using the elapsed time width T as a variable, regardless of whether the weighting function Fw is represented by the graphs G1 and G2.
- the graph G1 is a decreasing function in which the elapsed time width T decreases while maintaining the same slope until the time width “T1 ⁇ ”, and then protrudes downward.
- the graph G2 becomes a decreasing function convex downward to the threshold T1.
- the weighting function Fw for determining the weighting coefficient w may be a constant value regardless of the elapsed time width T.
- the map or expression corresponding to the weighting function Fw is generated in advance based on experiments and the like and stored in the storage unit 9 in consideration of the user's usage environment, the number of recognized words, the performance of the speech recognition apparatus 100, and the like.
- the sorting unit 5z sorts the recognition words corresponding to the weighting likelihoods Lw1 and Lw2 in descending order of similarity based on the weighting likelihood Lw1 supplied from the multiplication unit 5x and the weighting likelihood Lw2 supplied from the multiplication unit 5y. To do. Then, the sorting unit 5z supplies the predetermined number N of recognized words WR and their weighting likelihoods Lw to the presentation unit 6.
- FIG. 5 shows a word that is erroneously recognized in the second-level speech recognition in the example of FIG. 1B and the speaker is misrecognized (hereinafter simply referred to as “recurrent speech”).
- recurrent speech A specific example of the list of the recognition result of the first recognition unit 3, the recognition result of the second recognition unit 4, and the output result of the sorting unit 5 z is shown.
- the utterer uttered “Tokyo Tower” after uttering “name search”, whereas the speech recognition apparatus 100 correctly recognized “name search” in the first-level speech recognition.
- “Tokyo Port” is erroneously recognized in the second layer speech recognition.
- the speaker subsequently made a recurrent talk with “Tokyo Tower”.
- the first recognizing unit 3 performs voice recognition in the first hierarchy using the first command dictionary 12 for the recurrent speech data
- the second recognizing unit 4 performs the recurrent speech data
- the second command dictionary 13 and the place name dictionary 14 are used for speech recognition at the second level.
- the weighting coefficient w is set to “0.7” based on the elapsed time width T, that is, a value that puts a weight on the recognition result of the second recognition unit 4 from the recognition result of the first recognition unit 3.
- FIG. 5A is an example of a list showing the recognition result of the first recognition unit 3 for the recurrent story.
- the first recognition unit 3 calculates the likelihood L1 of the word stored in the first command dictionary 12 based on the acoustic feature amount of the speech data of the recurrent utterance. Then, the first recognition unit 3 outputs the recognition word WR1 and the likelihood L1 by a predetermined number N in descending order of similarity based on the likelihood L1.
- FIG. 5B is an example of a list showing the recognition result of the second recognition unit 4 for the recurrent utterance.
- FIG. 5 (c) is an example of a list showing the output of the sorting unit 5z for recurrent utterances.
- N the predetermined number
- “Tokyo Tower” recognized by the second recognition unit 4 is ranked first, and similarly, the second recognition unit 4 Recognized in Tokyo Port is ranked 2nd.
- “route search” and “route erasure” recognized as first and second in the first recognition unit 3 are ranked lower than words recognized in the second recognition unit 4.
- the sorting unit 5z calculates the weighting likelihoods Lw1 and Lw2 using the weighting coefficient w appropriately set based on the elapsed time width T, and sorts them together.
- the speech recognition apparatus 100 can perform erroneous recognition with a minimum number of procedures without an operation that the speaker explicitly re-utters, or the speaker does not input voice again from the first layer. It is possible to correctly recognize the relapse story when it occurs.
- the sorting unit 5z also recognizes the recognition word WR ranked second when the recognition result for the utterance data Sa input last time matches the recognition word for the first place. Move up to first place. By doing in this way, the sorting part 5z can prevent re-recognizing that it is the same recognized word again with respect to a recurrent utterance, and outputting the same recognized word as the last time as a 1st place.
- the switch unit 5v supplies the output of the first recognition unit 3 to the presentation unit 6 as the final recognition result, or the output of the sorting unit 5z as the final recognition result. 6 is switched to supply. Specifically, when the elapsed time width T is less than the threshold T1, the switch unit 5v determines that there is a possibility of re-speech and outputs the output of the sorting unit 5z in which the recognition result of the second recognition unit 4 is considered. Supply to the presentation unit 6.
- the switch unit 5v determines that there is no or low possibility of a re-utterance because a considerable time has elapsed since the previous utterance, and the first recognition unit 3 The recognition result is supplied to the presentation unit 6.
- the switch unit 5v can reduce unnecessary processing by outputting a final recognition result based only on the recognition result of the first recognition unit 3. it can.
- FIG. 6 is an example of a flowchart showing a processing procedure executed by the speech recognition apparatus 100 in the first embodiment.
- the speech recognition apparatus 100 repeatedly executes the processing of the flowchart shown in FIG.
- the speech recognition apparatus 100 determines whether there is an input of the utterance data Sa (step S101). And when there exists input of speech data Sa (step S101; Yes), the speech recognition apparatus 100 advances a process to step S102. On the other hand, when there is no input of utterance data Sa (step S101; No), the speech recognition apparatus 100 continuously monitors whether or not there is input of utterance data Sa.
- the speech recognition apparatus 100 calculates an elapsed time width T (step S102). That is, for example, the speech recognition apparatus 100 determines the elapsed time width T until the next speech data Sa is input after the recognition result is presented by the presentation unit 6.
- the speech recognition apparatus 100 determines whether or not there is a previous utterance (step S103). In other words, the speech recognition apparatus 100 determines whether or not a speech input of a hierarchical command has been accepted for the first time. If there is a previous utterance (step S103; Yes), the speech recognition apparatus 100 performs the process of step S104. On the other hand, when there is no previous utterance (step S103; No), the speech recognition apparatus 100 performs speech recognition by the first recognition unit 3 (step S111).
- the speech recognition apparatus 100 determines whether or not the elapsed time width T is smaller than the threshold value T1 (step S104). If the elapsed time width T is smaller than the threshold T1 (step S104; Yes), the speech recognition apparatus 100 performs the processing from step S105 to step S110. That is, in this case, the speech recognition apparatus 100 determines that there is a possibility of re-speech caused by misrecognition, and outputs the final recognition result in consideration of the recognition result of the second recognition unit 4.
- the voice recognition device 100 performs voice recognition by the first recognition unit 3 (step S111).
- the speech recognition apparatus 100 determines that there is no possibility or a very low possibility of recurrent speech due to misrecognition.
- the first recognizing unit 3 calculates the likelihood L1, and outputs a predetermined number N of recognized words WR1 and a list of the likelihood L1 in descending order of similarity.
- step S105 the speech recognition apparatus 100 performs speech recognition by the first recognition unit 3 and the second recognition unit 4 (step S105). Specifically, the first recognition unit 3 calculates the likelihood L1, and outputs a list of a predetermined number N of recognition words WR1 and the likelihood L1 in descending order of similarity. The second recognizing unit 4 calculates a likelihood L2 and outputs a predetermined number N of recognition words WR2 and a list of the likelihoods L2 in descending order of similarity. At this time, the second recognizing unit 4 performs speech recognition of the hierarchy before the first recognizing unit 3.
- the speech recognition apparatus 100 determines the first and second weighting coefficients w1 and w2 based on the elapsed time width T (step S106). For example, the speech recognition apparatus 100 refers to the graph G1 or the graph G2 stored in the storage unit 9, and determines the first and second weighting coefficients w1 and w2 based on the elapsed time width T.
- the speech recognition apparatus 100 calculates weighted likelihoods Lw1 and Lw2 (step S107). That is, the speech recognition apparatus 100 calculates the weighting likelihood Lw1 by multiplying the likelihood L1 by the first weighting coefficient w1, and calculates the weighting likelihood Lw2 by multiplying the likelihood L2 by the second weighting coefficient w2. .
- the speech recognition apparatus 100 performs sorting based on the weighting likelihoods Lw1 and Lw2 (step S108).
- the weighting likelihoods Lw1 and Lw2 are weighted according to the elapsed time width T by the first or second weighting coefficients w1 and w2.
- the speech recognition apparatus 100 can appropriately recognize the speech based on the elapsed time width T even in the case of recurrent speech caused by misrecognition.
- the speech recognition apparatus 100 determines whether or not the recognition word that is ranked first after sorting is the same as the previous time (step S109).
- the speech recognition apparatus 100 holds information such as a recognition word that has been ranked first in the storage unit 9. If the first place after sorting is the same as the previous time (step S109; Yes), the speech recognition apparatus 100 moves up the second place to the first place (step 110). Thereby, the speech recognition apparatus 100 can prevent performing erroneous recognition similar to the previous time.
- the speech recognition apparatus 100 advances the process to step S112.
- the speech recognition apparatus 100 presents the recognition result (step S112). For example, the speech recognition apparatus 100 displays the recognition results in a list as shown in FIG. 5C, or outputs the recognition word WR determined to have the highest similarity as a speech.
- the speech recognition apparatus 100 memorize
- the “attribute” is a recognition process such as a calculated value such as the weighting likelihood Lw or / and which recognition result of the first recognition unit 3 or the second recognition unit 4 is output as first place. Refers to accompanying information.
- the speech recognition apparatus 100 copies the setting of the first recognition unit 3 to the setting of the second recognition unit 4 (step S114).
- the setting corresponds to information such as which dictionary is used as the standby dictionary.
- the voice recognition device 100 starts measuring the elapsed time width T (step S115).
- the speech recognition apparatus recognizes hierarchical commands and includes a dictionary storage unit, a first speech recognition unit, a second speech recognition unit, and a recognition result processing unit.
- the dictionary storage means stores a dictionary used for speech recognition in each hierarchy for recognizing hierarchy commands.
- the first speech recognition means performs speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input. That is, the first speech recognition means determines a dictionary to be used based on the previously input speech data, and performs speech recognition for the next speech data input.
- the second voice recognition means performs voice recognition based on the dictionary used when voice recognition of the voice data input last time is executed.
- the second voice recognition means performs voice recognition of the next input voice data on the assumption that the voice recognition of the voice data input last time is erroneous recognition.
- the recognition result processing means weights the recognition result of the first voice recognition means and the recognition result of the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously input voice data, and finally recognizes Determine the result.
- the speech recognition apparatus weights the recognition results of the two recognition processes based on the elapsed time width. As a result, the speech recognition apparatus can recognize a hierarchical command without performing an utterance or operation corresponding to a correction operation or starting over from the beginning even if there is a misrecognition.
- the speech recognition device 100 includes two speech recognition units, a first recognition unit 3 and a second recognition unit 4, and the second recognition unit 4 is the first recognition unit 3.
- the previous level of speech recognition was being performed.
- the configuration of the speech recognition apparatus 100 to which the present invention is applicable is not limited to this. Instead of this, the speech recognition apparatus 100 may execute the above-described processing with only one speech recognition unit.
- FIG. 7 is an example of a diagram illustrating a schematic configuration of the speech recognition apparatus 100a according to the first modification.
- the speech recognition apparatus 100a is different from the speech recognition apparatus 100 shown in FIG. 2 in that it includes a recognition unit 20 in place of the first recognition unit 3 and the second recognition unit 4 in FIG. Different from the configuration.
- the recognition unit 20 sequentially executes both the processes of the first recognition unit 3 and the second recognition unit 4. Specifically, the recognition unit 20 first calculates the likelihood L1 by executing the processing performed by the first recognition unit 3. Then, the recognition unit 20 supplies the likelihood L1 and the recognition word WR1 corresponding to the likelihood L1 to the recognition result holding unit 21. Next, the recognition unit 20 calculates the likelihood L2 by executing the process performed by the second recognition unit 4. Then, the recognition unit 20 supplies the likelihood L2 and the recognition word WR1 corresponding to the likelihood L2 to the recognition result processing unit 5.
- the recognition result holding unit 21 After receiving the likelihood L1 and the recognition word WR corresponding to this from the recognition unit 20, the recognition result holding unit 21 holds these recognition results until the recognition unit 20 calculates the likelihood L2. And the recognition result holding
- the voice recognition device 100a can execute the same processing as that of the first embodiment even when only one recognition unit 20 that is a voice recognition unit is provided. This also makes it possible to reduce the size of the speech recognition apparatus 100a.
- the time measurement unit 8 sets the elapsed time width T from the time when the recognition result is output by the presentation unit 6 to the time when the next utterance data Sa is input.
- the setting to which the present invention is applicable is not limited to this.
- the time measuring unit 8 inputs the time width from the time when the previous utterance data Sa is input to the time when the next utterance data Sa is input, or the previous utterance data Sa.
- a time width from when the utterance button is pressed to the time when the next utterance button is pressed or a time width corresponding thereto may be set as the elapsed time width T.
- time measuring unit 8 may reset the elapsed time width T as appropriate, for example, when the engine of the vehicle is turned off when the speech recognition apparatus 100 is mounted on the vehicle.
- the speech recognition apparatus 100 sets the threshold T1 to the time width in which the weighting coefficient w is near “0”.
- the method to which the present invention is applicable is not limited to this. Instead, for example, the speech recognition apparatus 100 may set the threshold T1 to a time width in which the weighting coefficient w is near “0, 5”.
- the threshold value T1 may be set for each weighting function Fw used or based on an input from the user. Also by this, the speech recognition apparatus 100 can recognize a recurrent utterance caused by misrecognition with a minimum number of steps while reducing unnecessary processing.
- the speech recognition apparatus 100 includes the first recognition unit 3 and the second recognition unit 4, and the second recognition unit 4 performs speech recognition in the hierarchy before the first recognition unit 3.
- the speech recognition apparatus 100 may further include a third recognition unit that performs speech recognition in a layer before the second recognition unit 4.
- the recognition result processing unit 5 refers to a predetermined formula or map, determines the weighting of the likelihood calculated by each recognition unit from the elapsed time width T, and performs sorting based on the weighted likelihood.
- the above formula or map is created in advance based on an experiment or the like and stored in the storage unit 9. The same applies to the case where four or more recognition units are used.
- the speech recognition apparatus 100 can also perform recognition processing over three or more layers.
- the speech recognition apparatus 100 uses the likelihood as the similarity.
- the similarity index to which the present invention is applicable is not limited to this. Instead, the speech recognition apparatus 100 may use another index indicating similarity. Even in this case, the speech recognition apparatus 100 determines that the possibility of erroneous recognition is higher as the elapsed time width T is shorter, and increases the weight of the recognition result of the second recognition unit 4. Also by this, the speech recognition apparatus 100 can recognize a recurrent utterance caused by misrecognition with a minimum number of steps.
- the recognition result processing unit 5 determines the final recognition result based on the recognition result of the first recognition unit 3 and the recognition result of the second recognition unit 4. It was decided. Instead, the recognition result processing unit 5 may determine a final recognition result based only on the recognition result of the second recognition unit 4 when the elapsed time width T is less than the threshold T1. That is, in this case, the speech recognition apparatus 100 determines that there is a high possibility of re-speech, performs only speech recognition in the previous hierarchy, and determines the final rank of the recognized word WR based only on the likelihood L2. . Also by this, the speech recognition apparatus 100 can recognize the recurrent utterance caused by the misrecognition more reliably with the minimum steps.
- the speech recognition apparatus determines the weighting coefficient w based on information indicating the environment for speech recognition (hereinafter referred to as “environment information Ic”) in addition to the elapsed time width T. Thereby, the speech recognition apparatus can recognize the recurrent speech more accurately according to the environment in which speech recognition is executed.
- environment information Ic information indicating the environment for speech recognition
- FIG. 8 is an example of a diagram illustrating a schematic configuration of the speech recognition apparatus 100b according to the second embodiment.
- the voice recognition device 100b is mounted on a vehicle.
- the speech recognition apparatus 100b differs from the speech recognition apparatus 100 of the first embodiment in that it includes an ECU 23, a GPS receiver 24, and a route information acquisition unit 25.
- the vehicle on which the voice recognition device 100b is mounted is referred to as “mounted vehicle”.
- the ECU23 is provided with CPU, ROM, RAM, etc. which are not shown in figure, and performs various control with respect to each component in a mounting vehicle.
- the ECU 23 is electrically connected to the main control unit 7 and transmits information indicating the state of the mounted vehicle (hereinafter referred to as “vehicle information”) to the main control unit 7.
- vehicle information includes, for example, information such as an ACC (accessory power supply: engine key) state, a vehicle speed pulse state, a window opening / closing state of the mounted vehicle, an engine state, and a transmission state.
- the GPS receiver 24 receives radio waves carrying downlink data including positioning data from a plurality of GPS satellites.
- the positioning data is used to detect the absolute position of the mounted vehicle from latitude and longitude information.
- the GPS receiver 24 is electrically connected to the main control unit 7 and transmits latitude and longitude information of the mounted vehicle to the main control unit 7.
- the route information unit 25 acquires information (hereinafter referred to as “VICS information”) distributed from a VICS (Vehicle Information Communication System) center or the like from radio waves.
- the route information unit 25 holds map information, and also holds information about route guidance (route guidance information) when the driver of the mounted vehicle has set a destination.
- the route information unit 25 is electrically connected to the main control unit 7 and exchanges such information with the main control unit 7.
- the main control unit 7 acquires acoustic information such as the S / N ratio of the utterance data Sa input from a microphone (not shown) via a sensor (not shown) and supplies it to the recognition result processing unit 5 as environment information Ic. Further, the main control unit 7 acquires vehicle information, latitude / longitude information, and route guidance information supplied from the ECU 23, the GPS receiver 24, and the route information unit 25, and supplies the vehicle information, latitude / longitude information, and route guidance information to the recognition result processing unit 5 as environment information Ic. . This will be explained in detail in the (Voice Recognition Method) section.
- the recognition result processing unit 5 changes the slope of the weighting function Fw from the supplied environment information Ic based on the control of the main control unit 7. Specifically, the recognition result processing unit 5 determines that the worse the voice recognition environment, the higher the possibility of misrecognition and the higher the possibility of re-speech. Therefore, in this case, the recognition result processing unit 5 increases the importance of the recognition result of the second recognition unit 4 by relaxing the inclination. On the other hand, the recognition result processing unit 5 determines that the better the voice recognition environment, the lower the possibility of erroneous recognition. Therefore, in this case, the recognition result processing unit 5 increases the importance of the recognition result of the first recognition unit 3 by making the gradient of the weighting function Fw steep. Thereby, the recognition result process part 5 can set the weighting coefficient w appropriately according to recognition environment.
- the recognition result processing unit 5 makes the slope of the weighting function Fw gentler than when the vehicle speed is low, when the traveling speed of the mounted vehicle is high. Specifically, the recognition result processing unit 5 continuously changes the inclination for each value of the traveling speed, for example. In another example, the recognition result processing unit 5 may divide the traveling speed into several ranges in advance and determine the inclination for each range.
- the recognition result processing unit 5 makes the slope of the weighting function Fw gentler than when the vehicle is running at a low speed when the running speed of the mounted vehicle is high. Thereby, the recognition result processing unit 5 can appropriately set the weighting coefficient w.
- the recognition result processing unit 5 determines the slope of the weighting function Fw based on the latitude / longitude information and the route guidance information when the traveling route is a point that requires attention in driving. Loosen. Specifically, the recognition result processing unit 5 determines whether or not the route being traveled is an urban area, whether or not an accident occurs frequently, whether or not the intersection is a point where the mounted vehicle turns left or right, etc. to decide.
- the recognition result processing unit 5 determines that there is a high possibility of misrecognition when the travel route corresponds to an urban area, an accident occurrence point, and / or a left / right turn point, etc., and loosens the slope of the weighting function Fw. .
- the recognition result processing unit 5 can appropriately set the weighting coefficient w by using the latitude / longitude information and the route guidance information.
- the voice recognition device 100b acquires the environment information Ic from the ECU 23 or the like after executing step S105. Then, the voice recognition device 100b determines the weighting function Fw based on the environment information Ic. For example, the speech recognition apparatus 100b determines a weighting function Fw to be used from the environment information Ic by preparing a plurality of weighting functions Fw corresponding to the environment information Ic in advance and referring to a predetermined map or the like. The above-described map and the like are created in advance by experiments or the like. In another example, the speech recognition apparatus 100b changes a parameter for determining the slope of the weighting function Fw with reference to a map or the like according to the environment information Ic.
- step S106 the speech recognition apparatus 100b determines the first and second weighting coefficients w1 and w2 from the weighting function Fw based on the elapsed time width T. Thereby, the speech recognition apparatus 100b can recognize the recurrent speech more accurately in consideration of the environment of speech recognition.
- the recognition result processing unit 5 when the recognition result processing unit 5 determines that the recognition environment is inferior, the gradient of the weighting function Fw is set gently. Instead of this, or in addition to this, the recognition result processing unit 5 may temporarily stop the measurement of the elapsed time width T, for example, when determining that the recognition environment is inferior. In another example, when the recognition result processing unit 5 determines that the recognition environment is poor, the recognition result processing unit 5 may reduce the elapsed time width T by dividing or subtracting it by a predetermined value. Also by these, the speech recognition apparatus 100b can appropriately execute speech recognition processing according to the recognition environment.
- Modification 2 Modification 1 to (Modification 6) of the first embodiment
- the speech recognition apparatus 100b performs one or more processes selected from (Modification 1) to (Modification 6) of the first embodiment in addition to the processes of the second embodiment.
- the present invention can be applied to various devices that perform voice recognition processing.
- the present invention can be applied to various devices having a voice input function such as a car navigation device, a mobile phone, a personal computer, an AV device, and a home appliance.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Navigation (AREA)
Abstract
Description
具体的な実施例の説明に先立って、本発明の音声認識手法に関連する辞書の構成について基本説明を行う。図1は、音声認識に用いられる2種類の辞書構成を示す概念図である。具体的には、図1(a)は、比較例となる非階層コマンドを認識する場合の辞書構成の一例を示し、図1(b)は、本発明の対象となる階層コマンドを認識する場合の辞書構成の一例を示す。本発明の音声認識装置は、後述するように、階層コマンドを認識する。なお、「非階層コマンド」とは、制御対象となる装置への1の動作命令又は出力命令を1の発話で実現する発話コマンドを指し、「階層コマンド」とは、発話コマンドのうち、制御対象となる装置への1の動作命令又は出力命令を2以上の発話で実現する発話コマンドを指す。また、「辞書」とは、認識対象の単語(以後、「認識語」と呼ぶ。)を格納したデータベースを指す。 [Basic explanation]
Prior to the description of a specific embodiment, a basic description will be given of the structure of a dictionary related to the speech recognition method of the present invention. FIG. 1 is a conceptual diagram showing two types of dictionary configurations used for speech recognition. Specifically, FIG. 1 (a) shows an example of a dictionary configuration when a non-hierarchical command as a comparative example is recognized, and FIG. 1 (b) shows a case of recognizing a hierarchical command that is an object of the present invention. An example of the dictionary configuration is shown. The speech recognition apparatus of the present invention recognizes hierarchical commands as will be described later. The “non-hierarchical command” refers to an utterance command that realizes one operation command or output command to the device to be controlled by one utterance, and the “hierarchical command” refers to a control target among the utterance commands. This refers to an utterance command that realizes one operation command or output command to the device with two or more utterances. The “dictionary” refers to a database that stores words to be recognized (hereinafter referred to as “recognized words”).
まず、本発明に係る第1実施例について説明する。以後では、音声認識装置の概略構成について説明した後、音声認識方法、及びその処理フローについて説明する。その後、第1実施例の各変形例について説明する。 [First embodiment]
First, a first embodiment according to the present invention will be described. Hereinafter, after describing the schematic configuration of the speech recognition apparatus, the speech recognition method and the processing flow thereof will be described. Thereafter, each modification of the first embodiment will be described.
まず、第1実施例に係る音声認識装置の概略構成について図2を用いて説明する。 (Outline configuration)
First, the schematic configuration of the speech recognition apparatus according to the first embodiment will be described with reference to FIG.
次に、図3乃至図5を参照して、認識結果処理部5が実行する処理について具体的に説明する。図3は、認識結果処理部5の処理内容を具体的に示した機能ブロック図である。図3に示すように、認識結果処理部5は、乗算部5x、5yと、ソート部5zと、スイッチ部5vと、を備える。 (Recognition processing method)
Next, the processing executed by the recognition
Lw1=L1×w1 式(1)
ここで、第1重み付け係数w1は、経過時間幅Tに応じて変動する係数である。具体的には、第1重み付け係数w1は、経過時間幅Tが大きくなるほど、第1認識部3の認識結果の重要度(重み)を上げるように設定される。即ち、第1重み付け係数w1は、経過時間幅Tを変数かつ尤度L1を定数とした場合に、経過時間幅Tの増加と共に重み付け尤度Lw1の絶対値を小さくするように減少(即ち、非増加)する係数となる。なお、音声認識装置100は、尤度L1、L2及び重み付け尤度Lw1、Lw2を、それぞれ負値が小さいほど、即ち絶対値が小さいほど、類似度が高いとみなす。従って、乗算部5xは、第1重み付け係数w1が小さいほど、第1認識部3の認識結果の重みを上げるように重み付け尤度Lw1を設定することになる。第1重み付け係数w1の設定方法については、後述する。 The
Lw1 = L1 × w1 Formula (1)
Here, the first weighting coefficient w1 is a coefficient that varies according to the elapsed time width T. Specifically, the first weighting coefficient w1 is set so that the importance (weight) of the recognition result of the
Lw2=L2×w2 式(2)
ここで、第2重み付け係数w2は、経過時間幅Tに応じて変動する係数である。具体的には、第2重み付け係数w2は、経過時間幅Tが大きくなるほど、第2認識部4の認識結果の重要度(重み)を下げるように設定される。即ち、第2重み付け係数w2は、経過時間幅Tを変数かつ尤度L2を定数とした場合に、経過時間幅Tの増加と共に重み付け尤度Lw2の負値を大きくするように増加(即ち、非減少)する係数となる。従って、乗算部5yは、第2重み付け係数w2が小さいほど、第2認識部4の認識結果の重みを上げるように重み付け尤度Lw2を設定することになる。第2重み付け係数w2の算出方法については、後述する。 Similarly, the multiplying
Lw2 = L2 × w2 Formula (2)
Here, the second weighting coefficient w2 is a coefficient that varies according to the elapsed time width T. Specifically, the second weighting coefficient w2 is set so that the importance (weight) of the recognition result of the
w1+w2=1 式(3)
0<w1<1 式(4)
0<w2<1 式(5)
これにより、認識結果処理部5は、第1重み付け係数w1と第2重み付け係数w2とのいずれか一方を求めることで、他方の係数を適切に設定することができる。即ち、認識結果処理部5は、経過時間幅Tが大きくなるほど、第2認識部4の認識結果を第1認識部3の認識結果に比べ、相対的にその重要度を下げるように設定することができる。 Further, in the present embodiment, the first weighting coefficient w1 and the second weighting coefficient w2 are set to values larger than 0 and smaller than 1, respectively, and the sum is set to 1. That is, the first weighting coefficient w1 and the second weighting coefficient w2 satisfy the following expressions (3) to (5).
w1 + w2 = 1 Formula (3)
0 <w1 <1 Formula (4)
0 <w2 <1 Formula (5)
Thereby, the recognition
w1=w 式(6)
w2=1-w 式(7)
ここで、重み付け係数wは、式(4)及び式(6)より、0より大きく1より小さい値に設定される。即ち、以下の式(8)を満たす。
0<w<1 式(8)
次に、第1重み付け係数w1及び第2重み付け係数w2を決定する重み付け係数wの設定方法について図4を用いて詳しく説明する。図4は、重み付け係数wと経過時間幅Tとの関係を示すグラフの一例である。グラフ「G1」及びグラフ「G2」は、経過時間幅Tから重み付け係数wを決定する関数(以後、「重み付け関数Fw」と呼ぶ。)の一例である。また、図4中の「T1」は、第1認識部3の認識結果を最終的な認識結果とするか、又は第1認識部3の認識結果に加え第2認識部4の認識結果を用いて最終的な認識結果を導出するかをスイッチ部5vが判断するための経過時間幅Tに対する閾値である。これについては、スイッチ部5vの説明で詳しく説明する。 Hereinafter, for convenience, the first and second weighting coefficients w1 and w2 are expressed by the following equations (6) and (7) using a predetermined weighting factor “w”.
w1 = w Formula (6)
w2 = 1-w Formula (7)
Here, the weighting coefficient w is set to a value larger than 0 and smaller than 1 from the equations (4) and (6). That is, the following expression (8) is satisfied.
0 <w <1 Formula (8)
Next, a method for setting the weighting coefficient w for determining the first weighting coefficient w1 and the second weighting coefficient w2 will be described in detail with reference to FIG. FIG. 4 is an example of a graph showing the relationship between the weighting coefficient w and the elapsed time width T. The graph “G1” and the graph “G2” are examples of functions that determine the weighting coefficient w from the elapsed time width T (hereinafter referred to as “weighting function Fw”). Further, “T1” in FIG. 4 uses the recognition result of the
次に、第1実施例における処理の手順について説明する。図6は、第1実施例において音声認識装置100が実行する処理手順を表すフローチャートの一例である。音声認識装置100は、図6に示すフローチャートの処理を繰り返し実行する。 (Processing flow)
Next, a processing procedure in the first embodiment will be described. FIG. 6 is an example of a flowchart showing a processing procedure executed by the
図2の音声認識装置100の構成では、音声認識装置100は、第1認識部3と第2認識部4との2つの音声認識部を備え、第2認識部4が第1認識部3の前の階層の音声認識を実行していた。しかし、本発明が適用可能な音声認識装置100の構成は、これに限定されない。これに代えて、音声認識装置100は、1体の音声認識部のみで上述の処理を実行してもよい。 (Modification 1)
In the configuration of the
図2の説明では、時間計測部8は、提示部6により認識結果が出力された時から次の発話データSaの入力がある時までの時間幅を経過時間幅Tと設定していた。しかし、本発明が適用可能な設定は、これに限定されない。例えば、これに代えて、時間計測部8は、前回の発話データSaが入力された時から次の発話データSaが入力された時までの時間幅、又は、前回の発話データSaを入力するための発話ボタンが押下された時から次の発話ボタンが押下される時までの時間幅又はこれに相当する時間幅等を経過時間幅Tに設定してもよい。 (Modification 2)
In the description of FIG. 2, the time measurement unit 8 sets the elapsed time width T from the time when the recognition result is output by the
図4のグラフG1及びグラフG2では、音声認識装置100は、重み付け係数wが「0」付近になる時間幅に閾値T1を設定していた。しかし、本発明が適用可能な方法は、これに限定されない。これに代えて、音声認識装置100は、例えば、重み付け係数wが「0、5」付近になる時間幅に閾値T1を設定してもよい。その他、閾値T1は、用いられる重み付け関数Fwごとに、又は、ユーザからの入力等に基づき設定されてもよい。これによっても、音声認識装置100は、不要な処理を削減しつつ、誤認識に起因した再発話を最小限のステップで認識することができる。 (Modification 3)
In the graph G1 and the graph G2 in FIG. 4, the
図2等の説明では、音声認識装置100は、第1認識部3及び第2認識部4を備え、第2認識部4は第1認識部3の前の階層の音声認識を実行した。これに加え、音声認識装置100は、第2認識部4のさらに前の階層での音声認識を実行する第3の認識部をさらに備えてもよい。この場合、認識結果処理部5は、所定の式又はマップを参照し、経過時間幅Tから各認識部が算出した尤度の重み付けを決定し、当該重み付けされた尤度に基づきソートを行う。上述の式又はマップは、実験等に基づき予め作成され、記憶部9に記憶される。4以上の認識部を用いる場合についても、同様に拡張される。 (Modification 4)
In the description of FIG. 2 and the like, the
第1実施例の説明では、音声認識装置100は、類似度として尤度を用いていた。しかし、本発明が適用可能な類似度の指標は、これに限定されない。これに代えて、音声認識装置100は、他の類似度を示す指標を用いてもよい。この場合であっても、音声認識装置100は、経過時間幅Tが短いほど誤認識があった可能性が高いと判断し、第2認識部4の認識結果の重みを大きくする。これによっても、音声認識装置100は、誤認識に起因した再発話を最小限のステップで認識することができる。 (Modification 5)
In the description of the first embodiment, the
図3の説明では、認識結果処理部5は、経過時間幅Tが閾値T1未満の場合、第1認識部3の認識結果と第2認識部4の認識結果とに基づき最終的な認識結果を決定していた。これに代えて、認識結果処理部5は、経過時間幅Tが閾値T1未満の場合、第2認識部4の認識結果のみに基づき最終的な認識結果を決定してもよい。即ち、この場合、音声認識装置100は、再発話の可能性が高いと判断し、前の階層での音声認識のみを行い、尤度L2のみに基づき最終的な認識語WRの順位を決定する。これによっても、音声認識装置100は、誤認識に起因した再発話をより確実に最小限のステップで認識することができる。 (Modification 6)
In the description of FIG. 3, when the elapsed time width T is less than the threshold T1, the recognition
次に、第2実施例で音声認識装置が実行する処理について説明する。第2実施例では、音声認識装置は、経過時間幅Tに加え、音声認識の環境を示す情報(以後、「環境情報Ic」と呼ぶ。)に基づき重み付け係数wを決定する。これにより、音声認識装置は、音声認識を実行する環境に応じてより的確に再発話を認識することができる。 [Second Embodiment]
Next, processing executed by the speech recognition apparatus in the second embodiment will be described. In the second embodiment, the speech recognition apparatus determines the weighting coefficient w based on information indicating the environment for speech recognition (hereinafter referred to as “environment information Ic”) in addition to the elapsed time width T. Thereby, the speech recognition apparatus can recognize the recurrent speech more accurately according to the environment in which speech recognition is executed.
図8は、第2実施例に係る音声認識装置100bの概略構成を示す図の一例である。音声認識装置100bは、車両に搭載される。音声認識装置100bは、ECU23、GPS受信機24、及びルート情報取得部25を備える点で第1実施例の音声認識装置100と異なる。なお、以後では、音声認識装置100bが搭載される車両を、「搭載車両」と呼ぶ。 (Outline configuration)
FIG. 8 is an example of a diagram illustrating a schematic configuration of the
次に、第2実施例で音声認識装置100bが実行する処理について説明する。認識結果処理部5は、主制御部7の制御に基づき、供給された環境情報Icから重み付け関数Fwの傾きを変更する。具体的には、認識結果処理部5は、音声認識の環境が劣悪であるほど、誤認識の可能性が高くなり再発話がなされる可能性が高いと判断する。従って、この場合、認識結果処理部5は、当該傾きを緩めることで、第2認識部4の認識結果の重要度を大きくする。一方、認識結果処理部5は、音声認識の環境が良好であるほど、誤認識の可能性は低くなると判断する。従って、この場合、認識結果処理部5は、重み付け関数Fwの傾きを急にすることで、第1認識部3の認識結果の重要度を大きくする。これにより、認識結果処理部5は、認識環境に応じて、適切に重み付け係数wを設定することができる。 (Voice recognition method)
Next, processing executed by the
認識結果処理部5は、車両情報から搭載車両の走行速度が速い場合、低速時よりも重み付け関数Fwの傾きを緩くする。具体的には、認識結果処理部5は、例えば、各走行速度の値ごとに連続して傾きを変化させる。他の例では、認識結果処理部5は、走行速度を予めいくつかの範囲に区分けしておき、当該範囲ごとに傾きを定めてもよい。 1. When Vehicle Information is Used The recognition
認識結果処理部5は、緯度経度情報及びルート案内情報に基づき、走行中の経路が運転上注意を要する地点の場合には、重み付け関数Fwの傾きを緩くする。具体的には、認識結果処理部5は、走行中の経路が都市部か否か、事故多発地点か否か、交差点であって搭載車両が左折又は右折する地点であるか否か、等を判断する。そして、認識結果処理部5は、走行経路が都市部、事故多発地点、又は/及び右左折地点等に該当する場合、誤認識する可能性が高いと判断し、重み付け関数Fwの傾きを緩くする。 2. When Latitude / Longitude Information and Route Guidance Information are Used The recognition
次に、第2実施例で音声認識装置100bが実行する処理手順について説明する。ここでは、第1実施例の処理フローの説明で用いた図6を再び参照して説明する。以後では、第1実施例と処理が異なる部分について説明する。 (Processing flow)
Next, a processing procedure executed by the
上述の第2実施例の説明では、認識結果処理部5は、認識環境が劣悪であると判断した場合、重み付け関数Fwの傾きを緩やかに設定した。これに代えて、又は、これに加えて、認識結果処理部5は、認識環境が劣悪であると判断した場合、例えば経過時間幅Tの計測を一時的に停止させてもよい。他の例では、認識結果処理部5は、認識環境が劣悪であると判断した場合、経過時間幅Tを所定の値で割る又は減算することで小さくしてもよい。これらによっても、音声認識装置100bは、認識環境に応じて適切に音声認識処理を実行することができる。 (Modification 1)
In the description of the second embodiment, when the recognition
第1実施例の(変形例1)乃至(変形例6)は、第2実施例にも適用することができる。即ち、この場合、音声認識装置100bは、上述した第2実施例の処理に加え、第1実施例の(変形例1)乃至(変形例6)から選択した1又は複数の処理を行う。 (Modification 2)
(Modification 1) to (Modification 6) of the first embodiment can also be applied to the second embodiment. That is, in this case, the
2 音声区間検出部
3 第1認識部
4 第2認識部
5 認識結果処理部
6 提示部
7 主制御部
8 時間計測部
9 記憶部
20 認識部
21 認識結果保持部
23 ECU
24 GPS
25 ルート情報取得部 DESCRIPTION OF
24 GPS
25 Route information acquisition unit
Claims (10)
- 階層コマンドを音声認識する音声認識装置であって、
前記階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する辞書記憶手段と、
音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識手段と、
前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識手段と、
前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理手段と、
を備えることを特徴とする音声認識装置。 A voice recognition device for voice recognition of hierarchical commands,
Dictionary storage means for storing a dictionary used in speech recognition of each layer for recognizing the layer command;
First speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input;
A second speech recognition means for performing speech recognition based on a dictionary used when speech recognition of the previously input speech data is performed when the speech data is input;
The final recognition result is determined by weighting the recognition result by the first voice recognition means and the recognition result by the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously inputted voice data. Recognition result processing means,
A speech recognition apparatus comprising: - 前記認識結果処理手段は、前記経過時間幅が所定幅以上の場合、前記第1音声認識手段の認識結果のみに基づき前記最終認識結果を決定することを特徴とする請求項1に記載の音声認識装置。 The speech recognition according to claim 1, wherein the recognition result processing means determines the final recognition result based only on the recognition result of the first speech recognition means when the elapsed time width is equal to or greater than a predetermined width. apparatus.
- 前記認識結果処理手段は、前記経過時間幅が所定幅未満の場合、前記第2音声認識手段の認識結果のみに基づき前記最終認識結果を決定することを特徴とする請求項1又は2に記載の音声認識装置。 The said recognition result processing means determines the said last recognition result only based on the recognition result of a said 2nd audio | voice recognition means, when the said elapsed time width is less than predetermined width, The said recognition result processing means is characterized by the above-mentioned. Voice recognition device.
- 前記第1音声認識手段及び前記第2音声認識手段は、認識結果として尤度を算出し、
前記認識結果処理手段は、前記第1音声認識手段が算出した尤度に前記経過時間幅に対して非増加関数となる第1パラメータを乗じると共に、前記第2音声認識手段が算出した尤度に前記経過時間幅に対して非減少関数となる第2パラメータを乗じることで前記重み付けを実行することを特徴とする請求項1乃至3のいずれか一項に記載の音声認識装置。 The first speech recognition means and the second speech recognition means calculate likelihood as a recognition result,
The recognition result processing unit multiplies the likelihood calculated by the first speech recognition unit by a first parameter that is a non-increasing function with respect to the elapsed time width, and also calculates the likelihood calculated by the second speech recognition unit. The speech recognition apparatus according to claim 1, wherein the weighting is performed by multiplying the elapsed time width by a second parameter that is a non-decreasing function. - 前記認識結果処理手段は、前記第1パラメータと前記第2パラメータとの和を一定値にすることを特徴とする請求項4に記載の音声認識装置。 The speech recognition apparatus according to claim 4, wherein the recognition result processing means sets a sum of the first parameter and the second parameter to a constant value.
- 認識環境に関する情報を取得する環境情報取得手段をさらに備え、
前記認識結果処理手段は、前記情報に基づき認識環境が劣悪であると判断した場合、前記認識環境が劣悪でない場合に比べ、前記第2音声認識手段の認識結果の重みを大きくすることを特徴とする請求項1乃至5のいずれか一項に記載の音声認識装置。 An environment information acquisition means for acquiring information about the recognition environment;
When the recognition result processing means determines that the recognition environment is inferior based on the information, the recognition result processing means increases the weight of the recognition result of the second speech recognition means as compared with the case where the recognition environment is not inferior. The voice recognition device according to any one of claims 1 to 5. - 車両に搭載され、
前記環境情報取得手段は、前記車両の車速又は/及び音声データに含まれる雑音の度合いを取得し、
前記認識結果処理手段は、前記車速又は/及び前記雑音の度合いが大きいほど、前記第2音声認識手段の認識結果の重み付けを大きくすることを特徴とする請求項6に記載の音声認識装置。 Mounted on the vehicle,
The environmental information acquisition means acquires the vehicle speed of the vehicle or / and the degree of noise included in the audio data,
The speech recognition apparatus according to claim 6, wherein the recognition result processing unit increases the weighting of the recognition result of the second speech recognition unit as the vehicle speed and / or the noise level increases. - 階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する音声認識装置が使用する音声認識方法であって、
音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識工程と、
前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識工程と、
前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理工程と、
を備えることを特徴とする音声認識方法。 A speech recognition method used by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command,
A first speech recognition step for performing speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input;
A second speech recognition step for performing speech recognition based on a dictionary used when speech recognition of the previously input speech data is performed when the speech data is input;
The final recognition result is determined by weighting the recognition result by the first voice recognition means and the recognition result by the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously inputted voice data. Recognition result processing step to
A speech recognition method comprising: - 階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する音声認識装置により実行される音声認識プログラムであって、
音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識手段と、
前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識手段と、
前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理手段、
として前記音声認識装置を機能させることを特徴とする音声認識プログラム。 A speech recognition program executed by a speech recognition device that stores a dictionary used in speech recognition of each layer for recognizing a layer command,
First speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input;
A second speech recognition means for performing speech recognition based on a dictionary used when speech recognition of the previously input speech data is performed when the speech data is input;
The final recognition result is determined by weighting the recognition result by the first voice recognition means and the recognition result by the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously inputted voice data. Recognition result processing means,
A speech recognition program for causing the speech recognition apparatus to function as - 請求項9に記載のプログラムを記憶したことを特徴とする記憶媒体。 A storage medium storing the program according to claim 9.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2011503691A JPWO2011016129A1 (en) | 2009-08-07 | 2009-08-07 | Speech recognition apparatus, speech recognition method, and speech recognition program |
PCT/JP2009/063996 WO2011016129A1 (en) | 2009-08-07 | 2009-08-07 | Voice recognition device, voice recognition method, and voice recognition program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2009/063996 WO2011016129A1 (en) | 2009-08-07 | 2009-08-07 | Voice recognition device, voice recognition method, and voice recognition program |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2011016129A1 true WO2011016129A1 (en) | 2011-02-10 |
Family
ID=43544048
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/063996 WO2011016129A1 (en) | 2009-08-07 | 2009-08-07 | Voice recognition device, voice recognition method, and voice recognition program |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2011016129A1 (en) |
WO (1) | WO2011016129A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2016180917A (en) * | 2015-03-25 | 2016-10-13 | 日本電信電話株式会社 | Correction speech detection device, voice recognition system, correction speech detection method, and program |
CN107577151A (en) * | 2017-08-25 | 2018-01-12 | 谢锋 | A kind of method, apparatus of speech recognition, equipment and storage medium |
JP2018180943A (en) * | 2017-04-13 | 2018-11-15 | 新日鐵住金株式会社 | Plan creation apparatus, plan creation method, and program |
JP2019053143A (en) * | 2017-09-13 | 2019-04-04 | アルパイン株式会社 | Voice recognition system and computer program |
US20220020362A1 (en) * | 2020-07-17 | 2022-01-20 | Samsung Electronics Co., Ltd. | Speech signal processing method and apparatus |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH04199198A (en) * | 1990-11-29 | 1992-07-20 | Matsushita Electric Ind Co Ltd | Speech recognition device |
JP2002041078A (en) * | 2000-07-21 | 2002-02-08 | Sharp Corp | Voice recognition equipment, voice recognition method and program recording medium |
JP2003177788A (en) * | 2001-12-12 | 2003-06-27 | Fujitsu Ltd | Audio interactive system and its method |
JP2008089625A (en) * | 2006-09-29 | 2008-04-17 | Honda Motor Co Ltd | Voice recognition apparatus, voice recognition method and voice recognition program |
-
2009
- 2009-08-07 JP JP2011503691A patent/JPWO2011016129A1/en active Pending
- 2009-08-07 WO PCT/JP2009/063996 patent/WO2011016129A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH04199198A (en) * | 1990-11-29 | 1992-07-20 | Matsushita Electric Ind Co Ltd | Speech recognition device |
JP2002041078A (en) * | 2000-07-21 | 2002-02-08 | Sharp Corp | Voice recognition equipment, voice recognition method and program recording medium |
JP2003177788A (en) * | 2001-12-12 | 2003-06-27 | Fujitsu Ltd | Audio interactive system and its method |
JP2008089625A (en) * | 2006-09-29 | 2008-04-17 | Honda Motor Co Ltd | Voice recognition apparatus, voice recognition method and voice recognition program |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2016180917A (en) * | 2015-03-25 | 2016-10-13 | 日本電信電話株式会社 | Correction speech detection device, voice recognition system, correction speech detection method, and program |
JP2018180943A (en) * | 2017-04-13 | 2018-11-15 | 新日鐵住金株式会社 | Plan creation apparatus, plan creation method, and program |
CN107577151A (en) * | 2017-08-25 | 2018-01-12 | 谢锋 | A kind of method, apparatus of speech recognition, equipment and storage medium |
JP2019053143A (en) * | 2017-09-13 | 2019-04-04 | アルパイン株式会社 | Voice recognition system and computer program |
US20220020362A1 (en) * | 2020-07-17 | 2022-01-20 | Samsung Electronics Co., Ltd. | Speech signal processing method and apparatus |
US11670290B2 (en) * | 2020-07-17 | 2023-06-06 | Samsung Electronics Co., Ltd. | Speech signal processing method and apparatus |
Also Published As
Publication number | Publication date |
---|---|
JPWO2011016129A1 (en) | 2013-01-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10121467B1 (en) | Automatic speech recognition incorporating word usage information | |
US9196248B2 (en) | Voice-interfaced in-vehicle assistance | |
US7747437B2 (en) | N-best list rescoring in speech recognition | |
JP4433704B2 (en) | Speech recognition apparatus and speech recognition program | |
US10381000B1 (en) | Compressed finite state transducers for automatic speech recognition | |
JP4846735B2 (en) | Voice recognition device | |
US9715877B2 (en) | Systems and methods for a navigation system utilizing dictation and partial match search | |
EP2082335A2 (en) | System and method for a cooperative conversational voice user interface | |
US20220358908A1 (en) | Language model adaptation | |
JP2010191400A (en) | Speech recognition system and data updating method | |
US10199037B1 (en) | Adaptive beam pruning for automatic speech recognition | |
WO2011016129A1 (en) | Voice recognition device, voice recognition method, and voice recognition program | |
US20230102157A1 (en) | Contextual utterance resolution in multimodal systems | |
US11145296B1 (en) | Language and grammar model adaptation | |
JP2006308848A (en) | Vehicle instrument controller | |
JP2009230068A (en) | Voice recognition device and navigation system | |
JP4770374B2 (en) | Voice recognition device | |
CN110556104B (en) | Speech recognition device, speech recognition method, and storage medium storing program | |
WO2012076895A1 (en) | Pattern recognition | |
KR101063159B1 (en) | Address Search using Speech Recognition to Reduce the Number of Commands | |
CN111798842B (en) | Dialogue system and dialogue processing method | |
KR20100073178A (en) | Speaker adaptation apparatus and its method for a speech recognition | |
KR20060098673A (en) | Method and apparatus for speech recognition | |
KR102527346B1 (en) | Voice recognition device for vehicle, method for providing response in consideration of driving status of vehicle using the same, and computer program | |
US20230315997A9 (en) | Dialogue system, a vehicle having the same, and a method of controlling a dialogue system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2011503691 Country of ref document: JP Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09848061 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13061652 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09848061 Country of ref document: EP Kind code of ref document: A1 |