WO2011016129A1 - Voice recognition device, voice recognition method, and voice recognition program - Google Patents

Voice recognition device, voice recognition method, and voice recognition program Download PDF

Info

Publication number
WO2011016129A1
WO2011016129A1 PCT/JP2009/063996 JP2009063996W WO2011016129A1 WO 2011016129 A1 WO2011016129 A1 WO 2011016129A1 JP 2009063996 W JP2009063996 W JP 2009063996W WO 2011016129 A1 WO2011016129 A1 WO 2011016129A1
Authority
WO
WIPO (PCT)
Prior art keywords
recognition
speech
speech recognition
recognition result
voice
Prior art date
Application number
PCT/JP2009/063996
Other languages
French (fr)
Japanese (ja)
Inventor
実 吉田
Original Assignee
パイオニア株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by パイオニア株式会社 filed Critical パイオニア株式会社
Priority to JP2011503691A priority Critical patent/JPWO2011016129A1/en
Priority to PCT/JP2009/063996 priority patent/WO2011016129A1/en
Publication of WO2011016129A1 publication Critical patent/WO2011016129A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the present invention relates to a speech recognition technology for recognizing hierarchical utterance commands.
  • Patent Document 1 discloses a technique for recognizing a voice as having been rephrased in the case where a voice input is further received by a predetermined time width after receiving a voice input. Has been.
  • Patent Document 2 discloses a technique related to the present invention.
  • An object of the present invention is to provide a voice recognition device capable of recognizing.
  • the invention according to claim 1 is a voice recognition device for voice recognition of hierarchical commands, wherein a dictionary storage means for storing a dictionary used for voice recognition of each hierarchy for recognizing the hierarchy command, voice data
  • a dictionary storage means for storing a dictionary used for voice recognition of each hierarchy for recognizing the hierarchy command, voice data
  • the first speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that the speech data input last time is correctly recognized
  • Second speech recognition means for performing speech recognition based on a dictionary used when speech recognition of input speech data is executed, a recognition result by the first speech recognition means, and a recognition result by the second speech recognition means
  • a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the previously input voice data.
  • the invention according to claim 8 is a speech recognition method used by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command, and when speech data is input, A first speech recognition step for performing speech recognition based on a dictionary to be used when it is assumed that the previously input speech data is correctly recognized; and when the speech data is input, A second speech recognition step for performing speech recognition based on the dictionary used when speech recognition is executed, a recognition result by the first speech recognition means, and a recognition result by the second speech recognition means are input the previous time.
  • the invention according to claim 9 is a speech recognition program executed by a speech recognition apparatus for storing a dictionary used for speech recognition of each layer for recognizing a layer command, and when speech data is input A first voice recognition means for performing voice recognition based on a dictionary to be used when it is assumed that the voice data input last time is correctly recognized; and when the voice data is input, the voice data input last time A second speech recognition unit that performs speech recognition based on the dictionary used when the speech recognition is performed, a recognition result by the first speech recognition unit, and a recognition result by the second speech recognition unit
  • the speech recognition apparatus is caused to function as a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the processed speech data.
  • a speech recognition apparatus for speech recognition of hierarchical commands, dictionary storage means for storing a dictionary used for speech recognition of each layer for recognizing the hierarchical commands, and input of speech data
  • the first speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that the speech data input last time is correctly recognized, and the previous input when the speech data is input.
  • Second voice recognition means for performing voice recognition based on a dictionary used when voice recognition of the voice data performed is performed, a recognition result by the first voice recognition means, a recognition result by the second voice recognition means, And a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the previously input voice data.
  • the above speech recognition apparatus recognizes a hierarchical command and includes a dictionary storage unit, a first speech recognition unit, a second speech recognition unit, and a recognition result processing unit.
  • hierarchical command refers to a command that realizes one operation command or output command to a device to be controlled by two or more utterances. That is, in the case of a hierarchical command, one operation command or output command is gradually specified hierarchically based on the voice-recognized command.
  • the dictionary storage means stores a dictionary used for speech recognition in each hierarchy for recognizing hierarchy commands.
  • the “dictionary used in speech recognition in each layer” refers to a dictionary used in each speech recognition process necessary to realize any one operation command or output command.
  • the “voice recognition process” refers to the entire process necessary for recognizing one piece of voice data.
  • the dictionary stores a plurality of recognized words that perform pattern matching with the input voice data.
  • the first speech recognition means performs speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input.
  • the first voice recognition means determines a dictionary to be used based on the voice data input immediately before, and performs voice recognition of the newly input voice data.
  • the second voice recognition means performs voice recognition based on the dictionary used when voice recognition of the voice data input last time is executed.
  • the second voice recognition means performs voice recognition of the newly input voice data on the assumption that the voice recognition of the voice data input immediately before is erroneous recognition.
  • the recognition result processing means weights the recognition result of the first voice recognition means and the recognition result of the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously input voice data, and finally recognizes Determine the result.
  • the “recognition result” indicates the degree of similarity between each recognized word stored in the used dictionary and newly input speech data.
  • “Elapsed time width” refers to the time width from the start or end of processing of the previously input audio data to the start of processing of the newly input audio data, or a time width corresponding thereto.
  • the “final recognition result” refers to a recognition result that is finally output by the speech recognition apparatus.
  • the latest recognition result is a predetermined number of recognized words arranged in descending order of similarity and the degree of similarity.
  • the speech recognition apparatus weights the recognition results of the two recognition processes based on the elapsed time width. As a result, even if there is a misrecognition, the speech recognition apparatus can recognize the recurrent utterance with a minimum number of steps without performing the utterance or operation corresponding to the correction operation or starting over from the beginning.
  • the recognition result processing unit determines the final recognition result based only on the recognition result of the first speech recognition unit when the elapsed time width is a predetermined width or more.
  • the predetermined width is set to an appropriate value based on experiments or the like. Specifically, the predetermined width is set to an elapsed time width in which, for example, newly input audio data is not likely to be a rephrase of previously input audio data or is extremely low. As described above, in this aspect, the speech recognition apparatus can reduce unnecessary processing when there is no possibility or low possibility of recurrent speech.
  • the recognition result processing unit determines the final recognition result based only on the recognition result of the second speech recognition unit when the elapsed time width is less than a predetermined width.
  • the predetermined width is set to an appropriate value based on experiments or the like. Specifically, the predetermined width is set to, for example, an elapsed time width in which it is highly likely that the newly input audio data is a rephrase of the previously input audio data.
  • the speech recognition apparatus can reduce unnecessary processing when there is a high possibility that the speech is recurrent.
  • the first speech recognition unit and the second speech recognition unit calculate likelihood as a recognition result
  • the recognition result processing unit is configured by the first speech recognition unit.
  • the calculated likelihood is multiplied by a first parameter that is a non-increasing function with respect to the elapsed time width
  • the likelihood that is calculated by the second speech recognition means is a non-decreasing function with respect to the elapsed time width.
  • the weighting is performed by multiplying the parameter.
  • “likelihood” includes log likelihood.
  • the speech recognition apparatus gradually multiplies the likelihood calculated by the first speech recognition means by the first parameter that is a non-increasing function with respect to the elapsed time width, so that the recognition result of the first speech recognition means is gradually increased. Increase the weight, that is, the importance of the final recognition result. Further, the speech recognition apparatus gradually weights the recognition result of the second speech recognition unit by multiplying the likelihood calculated by the second speech recognition unit by a second parameter that is a non-decreasing function with respect to the elapsed time width. Lower. In this way, the speech recognition device can recognize a recurrent speech with a minimum number of steps, even if there is a misrecognition, without performing a speech or operation corresponding to a correction operation or starting over from the beginning. it can.
  • the recognition result processing means sets a sum of the first parameter and the second parameter to a constant value.
  • the above-mentioned constant value is set to 1, for example. In this way, the speech recognition apparatus can easily determine the other parameter based on one of the first parameter and the second parameter, and set these parameters to appropriate values.
  • the speech recognition apparatus further includes environment information acquisition means for acquiring information related to the recognition environment, and the recognition result processing means determines that the recognition environment is inferior based on the information, The weight of the recognition result of the second speech recognition means is increased as compared with the case where the recognition environment is not inferior.
  • “Recognition environment” refers to an environment in which a speech recognition apparatus performs speech recognition, which affects the accuracy of the speech recognition. In general, when the recognition environment is poor, there is a high possibility of erroneous recognition. That is, in this case, the speaker is likely to restate. Accordingly, when the speech recognition apparatus determines that the recognition environment is inferior, the speech recognition apparatus can perform speech recognition with higher accuracy by increasing the weight of the recognition result of the second speech recognition means than usual.
  • the environment information acquisition unit is installed in a vehicle, acquires the vehicle speed of the vehicle or / and the degree of noise included in the speech data
  • the recognition result processing unit includes: The greater the vehicle speed or / and the degree of noise, the greater the weighting of the recognition result of the second speech recognition means.
  • the “degree of noise” corresponds to, for example, the S / N ratio.
  • the vehicle speed is high, the driver concentrates on driving, so that the response to the voice recognition device tends to be slow and the elapsed time tends to increase.
  • the vehicle speed is high, it is expected that noise associated with traveling increases.
  • the speech recognition apparatus determines that the higher the vehicle speed and / or the degree of noise, the higher the probability of occurrence of erroneous recognition and the higher the possibility of re-speech. Therefore, in this aspect, the speech recognition apparatus can execute speech recognition with higher accuracy.
  • a speech recognition method used by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command.
  • a first speech recognition step for performing speech recognition based on a dictionary to be used when it is assumed that the input speech data has been correctly recognized; and when the speech data is input, the speech of the previously input speech data
  • the second speech recognition step for performing speech recognition based on the dictionary used when the recognition is executed, the recognition result by the first speech recognition means, and the recognition result by the second speech recognition means are input the previous time.
  • a recognition result processing step of determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing the voice data.
  • a speech recognition program executed by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command, and there is input of speech data
  • a first voice recognition means for performing voice recognition based on a dictionary to be used when it is assumed that the voice data input last time is correctly recognized; and when the voice data is input, the voice data input last time
  • a second speech recognition unit that performs speech recognition based on the dictionary used when the speech recognition is performed, a recognition result by the first speech recognition unit, and a recognition result by the second speech recognition unit
  • the speech recognition apparatus is made to function as a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the processed speech data.
  • the voice recognition device is equipped with this program, so that even if there is a misrecognition, it recognizes recurrent utterances with minimal steps without utterances or operations corresponding to corrective actions, or from the beginning. Can do.
  • the program is recorded on a storage medium.
  • FIG. 1 is a conceptual diagram showing two types of dictionary configurations used for speech recognition. Specifically, FIG. 1 (a) shows an example of a dictionary configuration when a non-hierarchical command as a comparative example is recognized, and FIG. 1 (b) shows a case of recognizing a hierarchical command that is an object of the present invention. An example of the dictionary configuration is shown.
  • the speech recognition apparatus of the present invention recognizes hierarchical commands as will be described later.
  • non-hierarchical command refers to an utterance command that realizes one operation command or output command to the device to be controlled by one utterance
  • hierarchical command refers to a control target among the utterance commands. This refers to an utterance command that realizes one operation command or output command to the device with two or more utterances.
  • dictionary refers to a database that stores words to be recognized (hereinafter referred to as “recognized words”).
  • the voice recognition apparatus performs a predetermined analysis on the utterance data input through a microphone or the like (hereinafter referred to as “utterance data Sa”), and then outputs a recognition result based on the command dictionary 11.
  • the “recognition result” is, for example, a recognized word having the highest similarity to the utterance data Sa, or a recognition word arranged in descending order of similarity to the utterance data Sa and a list of the similarities. Applicable.
  • “utterance data Sa” refers to an input signal including voice. For example, when a voice recognition device is installed in a car navigation device, the utterance data Sa indicates an input signal recorded from a microphone during a predetermined time after the user presses the utterance button.
  • the hierarchy command is composed of two hierarchies.
  • the speech recognition apparatus recognizes speech data Sa input first in a series of hierarchical command recognition processing using the first command dictionary 12.
  • This speech recognition is hereinafter referred to as “speech recognition in the first layer”. That is, the speech recognition in the first hierarchy refers to a process of recognizing speech data Sa by using a dictionary used in an initial state, that is, a state in which no speech command is accepted, among a series of processes for recognizing hierarchical commands.
  • the speech recognition apparatus when recognizing a hierarchical command, the speech recognition apparatus changes a dictionary (hereinafter also referred to as “standby dictionary”) to be used for each processing state (hierarchy). Then, it is assumed that the speech recognition apparatus recognizes “name search” by speech recognition in the first hierarchy based on the first command dictionary 12. That is, it is assumed that the speech recognition apparatus determines that “name search” has the highest similarity among the recognition words stored in the first command dictionary 12.
  • the speech recognition apparatus performs speech recognition using the second command dictionary 13 and the place name dictionary 14 as the next standby dictionary based on the recognized “name search” command.
  • the second command dictionary 13 is used when the speaker such as “return”, “stop”, “correction”, etc. explicitly wants to return the process to the previous state (ie, speech recognition in the first layer).
  • it includes an utterance command to be input when it is desired to redo a series of recognition processing of hierarchical commands from the beginning.
  • the place name dictionary 14 includes place names to be subjected to “name search” recognized by the speech recognition in the first layer.
  • the speech recognition apparatus recognizes the recognition words stored in the second command dictionary 13 and the place name dictionary 14 based on the next input utterance data Sa (here, “Tokyo Tower” is uttered). Calculate the similarity of.
  • the voice recognition here is referred to as “voice recognition in the second layer”. That is, the speech recognition in the second layer refers to a process for recognizing the utterance data Sa based on the speech recognition result in the first layer.
  • voice recognition in the third hierarchy is defined in the same manner.
  • the speech recognition apparatus outputs the result of speech recognition at the second layer as the final recognition result. Thereafter, the speech recognition apparatus performs speech recognition in the first hierarchy using the first command dictionary 12 again for the new utterance data Sa.
  • the case of the utterance command of two layers has been described.
  • the voice recognition device is similar to the previous one in the voice recognition in each layer.
  • a standby dictionary is selected based on the recognition word that has been voice-recognized in the hierarchy, and voice recognition is performed.
  • the speech recognition apparatus executes speech recognition of hierarchical commands as described above. Furthermore, as will be described later, the speech recognition apparatus according to the present invention changes the output of the recognition result in accordance with the elapsed time from speech recognition in the previous layer. As a result, the voice recognition device according to the present invention allows the user to explicitly issue a command (in the above example, for the second command) even when a recognition error occurs in the middle of a series of hierarchical command recognition processes. This corresponds to the command stored in the dictionary 13).
  • FIG. 2 is a schematic configuration diagram of the speech recognition apparatus 100 according to the present invention.
  • the voice recognition device 100 includes a voice analysis unit 1, a voice section detection unit 2, a first recognition unit 3, a second recognition unit 4, a recognition result processing unit 5, a presentation unit 6, and a main control unit 7.
  • the voice analysis unit 1 calculates an acoustic feature amount (that is, a voice feature parameter) of the utterance data Sa based on the control of the main control unit 7. Specifically, the voice analysis unit 1 performs A / D conversion on the utterance data Sa input from a microphone or the like (not shown), and calculates an acoustic feature amount by using, for example, a well-known acoustic analysis method or a combination thereof.
  • the speech segment detection unit 2 uses only the speech feature amount supplied from the speech analysis unit 1 and only the speech segment (hereinafter referred to as “speech data”) from the speech data Sa. cut. Then, the voice section detection unit 2 supplies the first recognition unit 3 and the second recognition unit 4 with voice feature amounts corresponding to the voice data.
  • the first recognition unit 3 calculates the likelihood of a recognition word stored in a predetermined standby dictionary using a discriminator such as HMM (Hidden Markov Model) based on the voice feature amount. And the 1st recognition part 3 outputs the list
  • HMM Hidden Markov Model
  • the first recognition unit 3 performs speech recognition based on a dictionary to be used when it is assumed that the previously input speech data has been correctly recognized. That is, the first recognizing unit 3 performs speech recognition at the next layer after processing the previously input speech data (or the first layer when there is no next layer). Specifically, in the example of FIG. 1B, when the first recognition unit 3 recognizes the previously input voice data in the first hierarchy, the second command dictionary 13 and the place name dictionary are used. 14 is used to calculate the recognition result, and when the previously input speech data is speech-recognized in the second hierarchy, the recognition result is calculated using the first command dictionary 12.
  • the first recognition unit 3 recognizes a recognition word (hereinafter referred to as “recognition word WR1”) as a recognition result and a logarithmized likelihood corresponding thereto (hereinafter referred to as “likelihood L1”). Is supplied to the recognition result processing unit 5.
  • the first recognition unit 3 supplies a predetermined number (hereinafter referred to as “predetermined number N”) of recognition words WR1 and its likelihood L1.
  • predetermined number N is set to an appropriate value through experiments or the like.
  • the second recognizing unit 4 calculates the likelihood of a recognized word stored in a predetermined standby dictionary different from the first recognizing unit 3 using a classifier such as an HMM based on the voice feature amount. And the 2nd recognition part 4 outputs the list of the recognition word arranged in order with high similarity, and its likelihood.
  • the second recognition unit 4 performs voice recognition based on the standby dictionary used when voice recognition of voice data input last time is executed. That is, the second recognizing unit 4 performs speech recognition in the same hierarchy as the hierarchy that processed the previously input speech data Sa. In other words, the second recognizing unit 4 performs speech recognition in the hierarchy preceding the first recognizing unit 3. More specifically, in the example of FIG. 1B, when the second recognition unit 4 recognizes speech data Sa previously input in the first hierarchy, the second command dictionary 12 is again displayed. When speech recognition is performed in the first hierarchy and the previously input speech data Sa is recognized in the second hierarchy, the second command dictionary 13 and the place name dictionary 14 are used again in the second hierarchy. Perform voice recognition.
  • the second recognizing unit 4 recognizes a recognized word (hereinafter referred to as “recognized word WR2”) as a recognition result and a logarithmic likelihood corresponding thereto (hereinafter referred to as “likelihood L2”). Supply to part 5.
  • the second recognition unit 4 supplies a predetermined number N of recognition words WR2 and the likelihood L2 that are higher.
  • the recognition result processing unit 5 uses the likelihoods L1 and L2 output from the first recognition unit 3 and the second recognition unit 4 as elapsed time widths (hereinafter referred to as “elapsed time width T The predetermined weighting is performed based on the above. Thereby, the recognition result processing unit 5 uses the weighted likelihood L1 (hereinafter referred to as “weighted likelihood Lw1”) and the weighted likelihood L2 (hereinafter referred to as “weighted likelihood Lw2”). calculate. Then, the recognition result processing unit 5 sorts the weighted likelihoods Lw1 and Lw2 together, thereby recognizing the recognized words WR1 and WR2 (hereinafter collectively referred to as “recognized words WR”) and the like.
  • weighted likelihood Lw1 weighted likelihood Lw1
  • weighted likelihood Lw2 weighted likelihood L2
  • weighting likelihood Lw Corresponding weighting likelihoods Lw1 and Lw2 (hereinafter collectively referred to as “weighting likelihood Lw”) are supplied to the presentation unit 6.
  • the specific processing in the recognition result processing unit 5 will be described in detail in the next section (Recognition processing method).
  • the presentation unit 6 presents the recognition result output by the recognition result processing unit 5 as the final recognition result.
  • the presentation unit 6 is a display, a speaker, or the like, and outputs the recognition result output by the recognition result processing unit 5 as an image or sound.
  • the main control unit 7 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like (not shown), and performs various controls on each element of the speech recognition apparatus 100.
  • CPU Central Processing Unit
  • ROM Read Only Memory
  • RAM Random Access Memory
  • the time measuring unit 8 is a timer for measuring the elapsed time width T. Specifically, the time measuring unit 8 measures the elapsed time width T calculated from the time when the presentation unit 6 outputs the recognition result of the previous utterance data Sa.
  • the storage unit 9 includes an acoustic model used by the first recognition unit 3 and the second recognition unit 4, a standby dictionary used for speech recognition in each layer, and other recognition results output by the recognition result processing unit 5. Is a memory for storing. The data stored in the storage unit 9 is supplied to each element of the speech recognition apparatus 100 as necessary based on the control of the main control unit 7.
  • FIG. 3 is a functional block diagram specifically showing the processing contents of the recognition result processing unit 5.
  • the recognition result processing unit 5 includes multiplication units 5x and 5y, a sorting unit 5z, and a switch unit 5v.
  • the multiplication unit 5x multiplies the likelihood L1 supplied from the first recognition unit 3 by a predetermined weighting coefficient (hereinafter, referred to as “first weighting coefficient w1”), and uses the weighting likelihood Lw1 that is the multiplication result. It supplies to the sort part 5z. That is, the multiplication unit 5x obtains the weighting likelihood Lw1 by the following equation (1).
  • Lw1 L1 ⁇ w1 Formula (1)
  • the first weighting coefficient w1 is a coefficient that varies according to the elapsed time width T. Specifically, the first weighting coefficient w1 is set so that the importance (weight) of the recognition result of the first recognition unit 3 increases as the elapsed time width T increases.
  • the first weighting coefficient w1 decreases so as to decrease the absolute value of the weighting likelihood Lw1 as the elapsed time width T increases (that is, non- Increase).
  • the speech recognition apparatus 100 regards the likelihoods L1 and L2 and the weighting likelihoods Lw1 and Lw2 as having higher similarity as the negative value is smaller, that is, as the absolute value is smaller. Therefore, the multiplication unit 5x sets the weighting likelihood Lw1 so that the weight of the recognition result of the first recognition unit 3 is increased as the first weighting coefficient w1 is smaller. A method for setting the first weighting coefficient w1 will be described later.
  • the multiplying unit 5y multiplies the likelihood L2 supplied from the second recognizing unit 4 by a predetermined weighting coefficient (hereinafter referred to as “second weighting coefficient w2”), and the weighted likelihood that is the multiplication result.
  • the degree Lw2 is supplied to the sorting unit 5z. That is, the multiplication unit 5y obtains the second weighting likelihood Lw2 by the following equation (2).
  • Lw2 L2 ⁇ w2 Formula (2)
  • the second weighting coefficient w2 is a coefficient that varies according to the elapsed time width T. Specifically, the second weighting coefficient w2 is set so that the importance (weight) of the recognition result of the second recognition unit 4 decreases as the elapsed time width T increases.
  • the multiplication unit 5y sets the weighting likelihood Lw2 so that the weight of the recognition result of the second recognition unit 4 is increased as the second weighting coefficient w2 is smaller.
  • a method for calculating the second weighting coefficient w2 will be described later.
  • the first weighting coefficient w1 and the second weighting coefficient w2 are set to values larger than 0 and smaller than 1, respectively, and the sum is set to 1. That is, the first weighting coefficient w1 and the second weighting coefficient w2 satisfy the following expressions (3) to (5).
  • w1 + w2 1 Formula (3) 0 ⁇ w1 ⁇ 1 Formula (4) 0 ⁇ w2 ⁇ 1 Formula (5)
  • the recognition result processing unit 5 can appropriately set the other coefficient by obtaining one of the first weighting coefficient w1 and the second weighting coefficient w2. That is, the recognition result processing unit 5 sets the recognition result of the second recognition unit 4 to be relatively less important than the recognition result of the first recognition unit 3 as the elapsed time width T increases. Can do.
  • the first and second weighting coefficients w1 and w2 are expressed by the following equations (6) and (7) using a predetermined weighting factor “w”.
  • w1 w Formula (6)
  • w2 1-w Formula (7)
  • the weighting coefficient w is set to a value larger than 0 and smaller than 1 from the equations (4) and (6). That is, the following expression (8) is satisfied.
  • 0 ⁇ w ⁇ 1 Formula (8) Next, a method for setting the weighting coefficient w for determining the first weighting coefficient w1 and the second weighting coefficient w2 will be described in detail with reference to FIG.
  • FIG. 4 is an example of a graph showing the relationship between the weighting coefficient w and the elapsed time width T.
  • the graph “G1” and the graph “G2” are examples of functions that determine the weighting coefficient w from the elapsed time width T (hereinafter referred to as “weighting function Fw”). Further, “T1” in FIG. 4 uses the recognition result of the first recognition unit 3 as the final recognition result or uses the recognition result of the second recognition unit 4 in addition to the recognition result of the first recognition unit 3. This is a threshold for the elapsed time width T for the switch unit 5v to determine whether to derive the final recognition result. This will be described in detail in the description of the switch unit 5v.
  • the weighting function Fw is a decreasing function using the elapsed time width T as a variable, regardless of whether the weighting function Fw is represented by the graphs G1 and G2.
  • the graph G1 is a decreasing function in which the elapsed time width T decreases while maintaining the same slope until the time width “T1 ⁇ ”, and then protrudes downward.
  • the graph G2 becomes a decreasing function convex downward to the threshold T1.
  • the weighting function Fw for determining the weighting coefficient w may be a constant value regardless of the elapsed time width T.
  • the map or expression corresponding to the weighting function Fw is generated in advance based on experiments and the like and stored in the storage unit 9 in consideration of the user's usage environment, the number of recognized words, the performance of the speech recognition apparatus 100, and the like.
  • the sorting unit 5z sorts the recognition words corresponding to the weighting likelihoods Lw1 and Lw2 in descending order of similarity based on the weighting likelihood Lw1 supplied from the multiplication unit 5x and the weighting likelihood Lw2 supplied from the multiplication unit 5y. To do. Then, the sorting unit 5z supplies the predetermined number N of recognized words WR and their weighting likelihoods Lw to the presentation unit 6.
  • FIG. 5 shows a word that is erroneously recognized in the second-level speech recognition in the example of FIG. 1B and the speaker is misrecognized (hereinafter simply referred to as “recurrent speech”).
  • recurrent speech A specific example of the list of the recognition result of the first recognition unit 3, the recognition result of the second recognition unit 4, and the output result of the sorting unit 5 z is shown.
  • the utterer uttered “Tokyo Tower” after uttering “name search”, whereas the speech recognition apparatus 100 correctly recognized “name search” in the first-level speech recognition.
  • “Tokyo Port” is erroneously recognized in the second layer speech recognition.
  • the speaker subsequently made a recurrent talk with “Tokyo Tower”.
  • the first recognizing unit 3 performs voice recognition in the first hierarchy using the first command dictionary 12 for the recurrent speech data
  • the second recognizing unit 4 performs the recurrent speech data
  • the second command dictionary 13 and the place name dictionary 14 are used for speech recognition at the second level.
  • the weighting coefficient w is set to “0.7” based on the elapsed time width T, that is, a value that puts a weight on the recognition result of the second recognition unit 4 from the recognition result of the first recognition unit 3.
  • FIG. 5A is an example of a list showing the recognition result of the first recognition unit 3 for the recurrent story.
  • the first recognition unit 3 calculates the likelihood L1 of the word stored in the first command dictionary 12 based on the acoustic feature amount of the speech data of the recurrent utterance. Then, the first recognition unit 3 outputs the recognition word WR1 and the likelihood L1 by a predetermined number N in descending order of similarity based on the likelihood L1.
  • FIG. 5B is an example of a list showing the recognition result of the second recognition unit 4 for the recurrent utterance.
  • FIG. 5 (c) is an example of a list showing the output of the sorting unit 5z for recurrent utterances.
  • N the predetermined number
  • “Tokyo Tower” recognized by the second recognition unit 4 is ranked first, and similarly, the second recognition unit 4 Recognized in Tokyo Port is ranked 2nd.
  • “route search” and “route erasure” recognized as first and second in the first recognition unit 3 are ranked lower than words recognized in the second recognition unit 4.
  • the sorting unit 5z calculates the weighting likelihoods Lw1 and Lw2 using the weighting coefficient w appropriately set based on the elapsed time width T, and sorts them together.
  • the speech recognition apparatus 100 can perform erroneous recognition with a minimum number of procedures without an operation that the speaker explicitly re-utters, or the speaker does not input voice again from the first layer. It is possible to correctly recognize the relapse story when it occurs.
  • the sorting unit 5z also recognizes the recognition word WR ranked second when the recognition result for the utterance data Sa input last time matches the recognition word for the first place. Move up to first place. By doing in this way, the sorting part 5z can prevent re-recognizing that it is the same recognized word again with respect to a recurrent utterance, and outputting the same recognized word as the last time as a 1st place.
  • the switch unit 5v supplies the output of the first recognition unit 3 to the presentation unit 6 as the final recognition result, or the output of the sorting unit 5z as the final recognition result. 6 is switched to supply. Specifically, when the elapsed time width T is less than the threshold T1, the switch unit 5v determines that there is a possibility of re-speech and outputs the output of the sorting unit 5z in which the recognition result of the second recognition unit 4 is considered. Supply to the presentation unit 6.
  • the switch unit 5v determines that there is no or low possibility of a re-utterance because a considerable time has elapsed since the previous utterance, and the first recognition unit 3 The recognition result is supplied to the presentation unit 6.
  • the switch unit 5v can reduce unnecessary processing by outputting a final recognition result based only on the recognition result of the first recognition unit 3. it can.
  • FIG. 6 is an example of a flowchart showing a processing procedure executed by the speech recognition apparatus 100 in the first embodiment.
  • the speech recognition apparatus 100 repeatedly executes the processing of the flowchart shown in FIG.
  • the speech recognition apparatus 100 determines whether there is an input of the utterance data Sa (step S101). And when there exists input of speech data Sa (step S101; Yes), the speech recognition apparatus 100 advances a process to step S102. On the other hand, when there is no input of utterance data Sa (step S101; No), the speech recognition apparatus 100 continuously monitors whether or not there is input of utterance data Sa.
  • the speech recognition apparatus 100 calculates an elapsed time width T (step S102). That is, for example, the speech recognition apparatus 100 determines the elapsed time width T until the next speech data Sa is input after the recognition result is presented by the presentation unit 6.
  • the speech recognition apparatus 100 determines whether or not there is a previous utterance (step S103). In other words, the speech recognition apparatus 100 determines whether or not a speech input of a hierarchical command has been accepted for the first time. If there is a previous utterance (step S103; Yes), the speech recognition apparatus 100 performs the process of step S104. On the other hand, when there is no previous utterance (step S103; No), the speech recognition apparatus 100 performs speech recognition by the first recognition unit 3 (step S111).
  • the speech recognition apparatus 100 determines whether or not the elapsed time width T is smaller than the threshold value T1 (step S104). If the elapsed time width T is smaller than the threshold T1 (step S104; Yes), the speech recognition apparatus 100 performs the processing from step S105 to step S110. That is, in this case, the speech recognition apparatus 100 determines that there is a possibility of re-speech caused by misrecognition, and outputs the final recognition result in consideration of the recognition result of the second recognition unit 4.
  • the voice recognition device 100 performs voice recognition by the first recognition unit 3 (step S111).
  • the speech recognition apparatus 100 determines that there is no possibility or a very low possibility of recurrent speech due to misrecognition.
  • the first recognizing unit 3 calculates the likelihood L1, and outputs a predetermined number N of recognized words WR1 and a list of the likelihood L1 in descending order of similarity.
  • step S105 the speech recognition apparatus 100 performs speech recognition by the first recognition unit 3 and the second recognition unit 4 (step S105). Specifically, the first recognition unit 3 calculates the likelihood L1, and outputs a list of a predetermined number N of recognition words WR1 and the likelihood L1 in descending order of similarity. The second recognizing unit 4 calculates a likelihood L2 and outputs a predetermined number N of recognition words WR2 and a list of the likelihoods L2 in descending order of similarity. At this time, the second recognizing unit 4 performs speech recognition of the hierarchy before the first recognizing unit 3.
  • the speech recognition apparatus 100 determines the first and second weighting coefficients w1 and w2 based on the elapsed time width T (step S106). For example, the speech recognition apparatus 100 refers to the graph G1 or the graph G2 stored in the storage unit 9, and determines the first and second weighting coefficients w1 and w2 based on the elapsed time width T.
  • the speech recognition apparatus 100 calculates weighted likelihoods Lw1 and Lw2 (step S107). That is, the speech recognition apparatus 100 calculates the weighting likelihood Lw1 by multiplying the likelihood L1 by the first weighting coefficient w1, and calculates the weighting likelihood Lw2 by multiplying the likelihood L2 by the second weighting coefficient w2. .
  • the speech recognition apparatus 100 performs sorting based on the weighting likelihoods Lw1 and Lw2 (step S108).
  • the weighting likelihoods Lw1 and Lw2 are weighted according to the elapsed time width T by the first or second weighting coefficients w1 and w2.
  • the speech recognition apparatus 100 can appropriately recognize the speech based on the elapsed time width T even in the case of recurrent speech caused by misrecognition.
  • the speech recognition apparatus 100 determines whether or not the recognition word that is ranked first after sorting is the same as the previous time (step S109).
  • the speech recognition apparatus 100 holds information such as a recognition word that has been ranked first in the storage unit 9. If the first place after sorting is the same as the previous time (step S109; Yes), the speech recognition apparatus 100 moves up the second place to the first place (step 110). Thereby, the speech recognition apparatus 100 can prevent performing erroneous recognition similar to the previous time.
  • the speech recognition apparatus 100 advances the process to step S112.
  • the speech recognition apparatus 100 presents the recognition result (step S112). For example, the speech recognition apparatus 100 displays the recognition results in a list as shown in FIG. 5C, or outputs the recognition word WR determined to have the highest similarity as a speech.
  • the speech recognition apparatus 100 memorize
  • the “attribute” is a recognition process such as a calculated value such as the weighting likelihood Lw or / and which recognition result of the first recognition unit 3 or the second recognition unit 4 is output as first place. Refers to accompanying information.
  • the speech recognition apparatus 100 copies the setting of the first recognition unit 3 to the setting of the second recognition unit 4 (step S114).
  • the setting corresponds to information such as which dictionary is used as the standby dictionary.
  • the voice recognition device 100 starts measuring the elapsed time width T (step S115).
  • the speech recognition apparatus recognizes hierarchical commands and includes a dictionary storage unit, a first speech recognition unit, a second speech recognition unit, and a recognition result processing unit.
  • the dictionary storage means stores a dictionary used for speech recognition in each hierarchy for recognizing hierarchy commands.
  • the first speech recognition means performs speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input. That is, the first speech recognition means determines a dictionary to be used based on the previously input speech data, and performs speech recognition for the next speech data input.
  • the second voice recognition means performs voice recognition based on the dictionary used when voice recognition of the voice data input last time is executed.
  • the second voice recognition means performs voice recognition of the next input voice data on the assumption that the voice recognition of the voice data input last time is erroneous recognition.
  • the recognition result processing means weights the recognition result of the first voice recognition means and the recognition result of the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously input voice data, and finally recognizes Determine the result.
  • the speech recognition apparatus weights the recognition results of the two recognition processes based on the elapsed time width. As a result, the speech recognition apparatus can recognize a hierarchical command without performing an utterance or operation corresponding to a correction operation or starting over from the beginning even if there is a misrecognition.
  • the speech recognition device 100 includes two speech recognition units, a first recognition unit 3 and a second recognition unit 4, and the second recognition unit 4 is the first recognition unit 3.
  • the previous level of speech recognition was being performed.
  • the configuration of the speech recognition apparatus 100 to which the present invention is applicable is not limited to this. Instead of this, the speech recognition apparatus 100 may execute the above-described processing with only one speech recognition unit.
  • FIG. 7 is an example of a diagram illustrating a schematic configuration of the speech recognition apparatus 100a according to the first modification.
  • the speech recognition apparatus 100a is different from the speech recognition apparatus 100 shown in FIG. 2 in that it includes a recognition unit 20 in place of the first recognition unit 3 and the second recognition unit 4 in FIG. Different from the configuration.
  • the recognition unit 20 sequentially executes both the processes of the first recognition unit 3 and the second recognition unit 4. Specifically, the recognition unit 20 first calculates the likelihood L1 by executing the processing performed by the first recognition unit 3. Then, the recognition unit 20 supplies the likelihood L1 and the recognition word WR1 corresponding to the likelihood L1 to the recognition result holding unit 21. Next, the recognition unit 20 calculates the likelihood L2 by executing the process performed by the second recognition unit 4. Then, the recognition unit 20 supplies the likelihood L2 and the recognition word WR1 corresponding to the likelihood L2 to the recognition result processing unit 5.
  • the recognition result holding unit 21 After receiving the likelihood L1 and the recognition word WR corresponding to this from the recognition unit 20, the recognition result holding unit 21 holds these recognition results until the recognition unit 20 calculates the likelihood L2. And the recognition result holding
  • the voice recognition device 100a can execute the same processing as that of the first embodiment even when only one recognition unit 20 that is a voice recognition unit is provided. This also makes it possible to reduce the size of the speech recognition apparatus 100a.
  • the time measurement unit 8 sets the elapsed time width T from the time when the recognition result is output by the presentation unit 6 to the time when the next utterance data Sa is input.
  • the setting to which the present invention is applicable is not limited to this.
  • the time measuring unit 8 inputs the time width from the time when the previous utterance data Sa is input to the time when the next utterance data Sa is input, or the previous utterance data Sa.
  • a time width from when the utterance button is pressed to the time when the next utterance button is pressed or a time width corresponding thereto may be set as the elapsed time width T.
  • time measuring unit 8 may reset the elapsed time width T as appropriate, for example, when the engine of the vehicle is turned off when the speech recognition apparatus 100 is mounted on the vehicle.
  • the speech recognition apparatus 100 sets the threshold T1 to the time width in which the weighting coefficient w is near “0”.
  • the method to which the present invention is applicable is not limited to this. Instead, for example, the speech recognition apparatus 100 may set the threshold T1 to a time width in which the weighting coefficient w is near “0, 5”.
  • the threshold value T1 may be set for each weighting function Fw used or based on an input from the user. Also by this, the speech recognition apparatus 100 can recognize a recurrent utterance caused by misrecognition with a minimum number of steps while reducing unnecessary processing.
  • the speech recognition apparatus 100 includes the first recognition unit 3 and the second recognition unit 4, and the second recognition unit 4 performs speech recognition in the hierarchy before the first recognition unit 3.
  • the speech recognition apparatus 100 may further include a third recognition unit that performs speech recognition in a layer before the second recognition unit 4.
  • the recognition result processing unit 5 refers to a predetermined formula or map, determines the weighting of the likelihood calculated by each recognition unit from the elapsed time width T, and performs sorting based on the weighted likelihood.
  • the above formula or map is created in advance based on an experiment or the like and stored in the storage unit 9. The same applies to the case where four or more recognition units are used.
  • the speech recognition apparatus 100 can also perform recognition processing over three or more layers.
  • the speech recognition apparatus 100 uses the likelihood as the similarity.
  • the similarity index to which the present invention is applicable is not limited to this. Instead, the speech recognition apparatus 100 may use another index indicating similarity. Even in this case, the speech recognition apparatus 100 determines that the possibility of erroneous recognition is higher as the elapsed time width T is shorter, and increases the weight of the recognition result of the second recognition unit 4. Also by this, the speech recognition apparatus 100 can recognize a recurrent utterance caused by misrecognition with a minimum number of steps.
  • the recognition result processing unit 5 determines the final recognition result based on the recognition result of the first recognition unit 3 and the recognition result of the second recognition unit 4. It was decided. Instead, the recognition result processing unit 5 may determine a final recognition result based only on the recognition result of the second recognition unit 4 when the elapsed time width T is less than the threshold T1. That is, in this case, the speech recognition apparatus 100 determines that there is a high possibility of re-speech, performs only speech recognition in the previous hierarchy, and determines the final rank of the recognized word WR based only on the likelihood L2. . Also by this, the speech recognition apparatus 100 can recognize the recurrent utterance caused by the misrecognition more reliably with the minimum steps.
  • the speech recognition apparatus determines the weighting coefficient w based on information indicating the environment for speech recognition (hereinafter referred to as “environment information Ic”) in addition to the elapsed time width T. Thereby, the speech recognition apparatus can recognize the recurrent speech more accurately according to the environment in which speech recognition is executed.
  • environment information Ic information indicating the environment for speech recognition
  • FIG. 8 is an example of a diagram illustrating a schematic configuration of the speech recognition apparatus 100b according to the second embodiment.
  • the voice recognition device 100b is mounted on a vehicle.
  • the speech recognition apparatus 100b differs from the speech recognition apparatus 100 of the first embodiment in that it includes an ECU 23, a GPS receiver 24, and a route information acquisition unit 25.
  • the vehicle on which the voice recognition device 100b is mounted is referred to as “mounted vehicle”.
  • the ECU23 is provided with CPU, ROM, RAM, etc. which are not shown in figure, and performs various control with respect to each component in a mounting vehicle.
  • the ECU 23 is electrically connected to the main control unit 7 and transmits information indicating the state of the mounted vehicle (hereinafter referred to as “vehicle information”) to the main control unit 7.
  • vehicle information includes, for example, information such as an ACC (accessory power supply: engine key) state, a vehicle speed pulse state, a window opening / closing state of the mounted vehicle, an engine state, and a transmission state.
  • the GPS receiver 24 receives radio waves carrying downlink data including positioning data from a plurality of GPS satellites.
  • the positioning data is used to detect the absolute position of the mounted vehicle from latitude and longitude information.
  • the GPS receiver 24 is electrically connected to the main control unit 7 and transmits latitude and longitude information of the mounted vehicle to the main control unit 7.
  • the route information unit 25 acquires information (hereinafter referred to as “VICS information”) distributed from a VICS (Vehicle Information Communication System) center or the like from radio waves.
  • the route information unit 25 holds map information, and also holds information about route guidance (route guidance information) when the driver of the mounted vehicle has set a destination.
  • the route information unit 25 is electrically connected to the main control unit 7 and exchanges such information with the main control unit 7.
  • the main control unit 7 acquires acoustic information such as the S / N ratio of the utterance data Sa input from a microphone (not shown) via a sensor (not shown) and supplies it to the recognition result processing unit 5 as environment information Ic. Further, the main control unit 7 acquires vehicle information, latitude / longitude information, and route guidance information supplied from the ECU 23, the GPS receiver 24, and the route information unit 25, and supplies the vehicle information, latitude / longitude information, and route guidance information to the recognition result processing unit 5 as environment information Ic. . This will be explained in detail in the (Voice Recognition Method) section.
  • the recognition result processing unit 5 changes the slope of the weighting function Fw from the supplied environment information Ic based on the control of the main control unit 7. Specifically, the recognition result processing unit 5 determines that the worse the voice recognition environment, the higher the possibility of misrecognition and the higher the possibility of re-speech. Therefore, in this case, the recognition result processing unit 5 increases the importance of the recognition result of the second recognition unit 4 by relaxing the inclination. On the other hand, the recognition result processing unit 5 determines that the better the voice recognition environment, the lower the possibility of erroneous recognition. Therefore, in this case, the recognition result processing unit 5 increases the importance of the recognition result of the first recognition unit 3 by making the gradient of the weighting function Fw steep. Thereby, the recognition result process part 5 can set the weighting coefficient w appropriately according to recognition environment.
  • the recognition result processing unit 5 makes the slope of the weighting function Fw gentler than when the vehicle speed is low, when the traveling speed of the mounted vehicle is high. Specifically, the recognition result processing unit 5 continuously changes the inclination for each value of the traveling speed, for example. In another example, the recognition result processing unit 5 may divide the traveling speed into several ranges in advance and determine the inclination for each range.
  • the recognition result processing unit 5 makes the slope of the weighting function Fw gentler than when the vehicle is running at a low speed when the running speed of the mounted vehicle is high. Thereby, the recognition result processing unit 5 can appropriately set the weighting coefficient w.
  • the recognition result processing unit 5 determines the slope of the weighting function Fw based on the latitude / longitude information and the route guidance information when the traveling route is a point that requires attention in driving. Loosen. Specifically, the recognition result processing unit 5 determines whether or not the route being traveled is an urban area, whether or not an accident occurs frequently, whether or not the intersection is a point where the mounted vehicle turns left or right, etc. to decide.
  • the recognition result processing unit 5 determines that there is a high possibility of misrecognition when the travel route corresponds to an urban area, an accident occurrence point, and / or a left / right turn point, etc., and loosens the slope of the weighting function Fw. .
  • the recognition result processing unit 5 can appropriately set the weighting coefficient w by using the latitude / longitude information and the route guidance information.
  • the voice recognition device 100b acquires the environment information Ic from the ECU 23 or the like after executing step S105. Then, the voice recognition device 100b determines the weighting function Fw based on the environment information Ic. For example, the speech recognition apparatus 100b determines a weighting function Fw to be used from the environment information Ic by preparing a plurality of weighting functions Fw corresponding to the environment information Ic in advance and referring to a predetermined map or the like. The above-described map and the like are created in advance by experiments or the like. In another example, the speech recognition apparatus 100b changes a parameter for determining the slope of the weighting function Fw with reference to a map or the like according to the environment information Ic.
  • step S106 the speech recognition apparatus 100b determines the first and second weighting coefficients w1 and w2 from the weighting function Fw based on the elapsed time width T. Thereby, the speech recognition apparatus 100b can recognize the recurrent speech more accurately in consideration of the environment of speech recognition.
  • the recognition result processing unit 5 when the recognition result processing unit 5 determines that the recognition environment is inferior, the gradient of the weighting function Fw is set gently. Instead of this, or in addition to this, the recognition result processing unit 5 may temporarily stop the measurement of the elapsed time width T, for example, when determining that the recognition environment is inferior. In another example, when the recognition result processing unit 5 determines that the recognition environment is poor, the recognition result processing unit 5 may reduce the elapsed time width T by dividing or subtracting it by a predetermined value. Also by these, the speech recognition apparatus 100b can appropriately execute speech recognition processing according to the recognition environment.
  • Modification 2 Modification 1 to (Modification 6) of the first embodiment
  • the speech recognition apparatus 100b performs one or more processes selected from (Modification 1) to (Modification 6) of the first embodiment in addition to the processes of the second embodiment.
  • the present invention can be applied to various devices that perform voice recognition processing.
  • the present invention can be applied to various devices having a voice input function such as a car navigation device, a mobile phone, a personal computer, an AV device, and a home appliance.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Navigation (AREA)

Abstract

A voice recognition device voice-recognizes a hierarchical command and comprises a dictionary storage means, a first voice recognition means, a second voice recognition means, and a recognition result processing means. The dictionary storage means stores dictionaries used for voice recognition in each hierarchy for recognizing the hierarchical command. The first voice recognition means, when voice data is inputted, performs a voice recognition based on a dictionary to be used when it is assumed that previously inputted voice data has been correctly recognized. The second voice recognition means, when voice data is inputted, performs a voice recognition based on a dictionary used when the previously inputted voice data has been voice-recognized. The recognition result processing means weights the recognition result of the first voice recognition means and the recognition result of the second voice recognition means based on the elapse time width elapsed from the time when the previously inputted voice data has been processed and then determines the final recognition result.

Description

音声認識装置、音声認識方法、及び音声認識プログラムSpeech recognition apparatus, speech recognition method, and speech recognition program
 本発明は、階層化された発話コマンドを認識する音声認識技術に関する。 The present invention relates to a speech recognition technology for recognizing hierarchical utterance commands.
 従来から階層化された発話コマンドを認識する音声認識装置において、発話者の明示的な戻り操作又は初期化操作を受け付けることにより誤認識や誤操作に対処する技術が周知である。例えば、特許文献1には、音声入力を受け付けた後、所定の時間幅までにさらに音声入力があった場合には、前に入力された音声の言い直しがあったとして音声認識する技術が開示されている。その他、本発明に関連する技術が、特許文献2に開示されている。 2. Description of the Related Art Conventionally, in a speech recognition apparatus that recognizes hierarchical utterance commands, a technique for dealing with misrecognition and erroneous operations by accepting an explicit return operation or initialization operation of a speaker is well known. For example, Patent Document 1 discloses a technique for recognizing a voice as having been rephrased in the case where a voice input is further received by a predetermined time width after receiving a voice input. Has been. In addition, Patent Document 2 discloses a technique related to the present invention.
特開平8-190398号公報JP-A-8-190398 特開2001-063489号公報JP 2001-063489 A
 発話者の明示的な戻り操作又は初期化操作を受け付けることにより誤認識や誤操作に対処する場合、当該対処により浪費する時間が大きい。また、このような明示的な操作が要求される場合、発話者は煩わしさを感じることとなる。 When dealing with misrecognition and misoperation by accepting a speaker's explicit return operation or initialization operation, a lot of time is wasted due to the handling. Further, when such an explicit operation is required, the speaker feels troublesome.
 本発明は、上記のような課題を解決するためになされたものであり、誤認識等があった場合でも、訂正動作に相当する発話や操作、又は最初からやり直しをすることなく、階層コマンドを認識することが可能な音声認識装置を提供することを目的とする。 The present invention has been made to solve the above-described problems, and even when there is a misrecognition or the like, a hierarchical command can be transmitted without performing an utterance or operation corresponding to a correction operation, or redoing from the beginning. An object of the present invention is to provide a voice recognition device capable of recognizing.
 請求項1に記載の発明は、階層コマンドを音声認識する音声認識装置であって、前記階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する辞書記憶手段と、音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識手段と、前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識手段と、前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理手段と、を備えることを特徴とする。 The invention according to claim 1 is a voice recognition device for voice recognition of hierarchical commands, wherein a dictionary storage means for storing a dictionary used for voice recognition of each hierarchy for recognizing the hierarchy command, voice data When there is an input, the first speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that the speech data input last time is correctly recognized; Second speech recognition means for performing speech recognition based on a dictionary used when speech recognition of input speech data is executed, a recognition result by the first speech recognition means, and a recognition result by the second speech recognition means , And a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the previously input voice data. .
 請求項8に記載の発明は、階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する音声認識装置が使用する音声認識方法であって、音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識工程と、前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識工程と、前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理工程と、を備えることを特徴とする。 The invention according to claim 8 is a speech recognition method used by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command, and when speech data is input, A first speech recognition step for performing speech recognition based on a dictionary to be used when it is assumed that the previously input speech data is correctly recognized; and when the speech data is input, A second speech recognition step for performing speech recognition based on the dictionary used when speech recognition is executed, a recognition result by the first speech recognition means, and a recognition result by the second speech recognition means are input the previous time. A recognition result processing step of determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the voice data.
 請求項9に記載の発明は、階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する音声認識装置により実行される音声認識プログラムであって、音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識手段と、前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識手段と、前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理手段、として前記音声認識装置を機能させることを特徴とする。 The invention according to claim 9 is a speech recognition program executed by a speech recognition apparatus for storing a dictionary used for speech recognition of each layer for recognizing a layer command, and when speech data is input A first voice recognition means for performing voice recognition based on a dictionary to be used when it is assumed that the voice data input last time is correctly recognized; and when the voice data is input, the voice data input last time A second speech recognition unit that performs speech recognition based on the dictionary used when the speech recognition is performed, a recognition result by the first speech recognition unit, and a recognition result by the second speech recognition unit The speech recognition apparatus is caused to function as a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the processed speech data. The features.
非階層コマンド及び階層コマンドを音声認識する際に使用する辞書構成の一例を示す図である。It is a figure which shows an example of the dictionary structure used when carrying out the speech recognition of a non-hierarchical command and a hierarchical command. 第1実施例に係る音声認識装置の概略構成の一例を示す図である。It is a figure which shows an example of schematic structure of the speech recognition apparatus which concerns on 1st Example. 第1実施例に係る認識結果処理部の機能ブロックの一例を示す図である。It is a figure which shows an example of the functional block of the recognition result process part which concerns on 1st Example. 重み付け関数Fwの例を示す図である。It is a figure which shows the example of the weighting function Fw. 第1認識部の認識結果、第2認識部の認識結果、及び最終的な認識結果の一例を示す図である。It is a figure which shows an example of the recognition result of a 1st recognition part, the recognition result of a 2nd recognition part, and a final recognition result. 第1実施例の処理手順を示すフローチャートの一例である。It is an example of the flowchart which shows the process sequence of 1st Example. 第1実施例の変形例1の概略構成の一例を示すフローチャートの一例である。It is an example of the flowchart which shows an example of schematic structure of the modification 1 of 1st Example. 第2実施例に係る音声認識装置の概略構成の一例を示す図である。It is a figure which shows an example of schematic structure of the speech recognition apparatus which concerns on 2nd Example.
 本発明の1つの観点では、階層コマンドを音声認識する音声認識装置であって、前記階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する辞書記憶手段と、音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識手段と、前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識手段と、前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理手段と、を備える。 In one aspect of the present invention, a speech recognition apparatus for speech recognition of hierarchical commands, dictionary storage means for storing a dictionary used for speech recognition of each layer for recognizing the hierarchical commands, and input of speech data The first speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that the speech data input last time is correctly recognized, and the previous input when the speech data is input. Second voice recognition means for performing voice recognition based on a dictionary used when voice recognition of the voice data performed is performed, a recognition result by the first voice recognition means, a recognition result by the second voice recognition means, And a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the previously input voice data.
 上記の音声認識装置は、階層コマンドを音声認識し、辞書記憶手段と、第1音声認識手段と、第2音声認識手段と、認識結果処理手段と、を備える。ここで、「階層コマンド」とは、制御対象となる装置への1の動作命令又は出力命令を2以上の発話で実現するコマンドを指す。即ち、階層コマンドの場合、1の動作命令又は出力命令は、音声認識されたコマンドに基づき徐々に階層的に特定される。辞書記憶手段は、階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する。ここで、「各階層の音声認識で使用する辞書」とは、任意の1の動作命令又は出力命令を実現するのに必要な各音声認識処理で使用する辞書を指す。なお、ここでの「音声認識処理」とは、1の音声データを認識するのに必要な処理全体を指す。また、辞書には、入力された音声データとパターンマッチングを行う複数の認識語が格納されている。第1音声認識手段は、音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う。即ち、第1音声認識手段は、一つ前に入力された音声データに基づき使用すべき辞書を決定し、新たに入力された音声データの音声認識を行う。第2音声認識手段は、音声データの入力があった場合、前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う。即ち、第2音声認識手段は、1つ前に入力された音声データの音声認識が誤認識であったことを前提に新たに入力された音声データの音声認識を行う。認識結果処理手段は、第1音声認識手段の認識結果と、第2音声認識手段の認識結果と、を前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する。ここで、「認識結果」とは、使用した辞書に格納された各認識語と新たに入力された音声データとの類似度を指す。「経過時間幅」とは、1つ前に入力された音声データの処理開始時若しくは終了時から新たに入力された音声データの処理開始時までの時間幅、又はこれに相当する時間幅を指す。また、「最終認識結果」とは、音声認識装置が最終的に出力する認識結果を指す。例えば最新認識結果は、類似度が高い順に並べられた所定個数分の認識語及びその類似度である。 The above speech recognition apparatus recognizes a hierarchical command and includes a dictionary storage unit, a first speech recognition unit, a second speech recognition unit, and a recognition result processing unit. Here, “hierarchical command” refers to a command that realizes one operation command or output command to a device to be controlled by two or more utterances. That is, in the case of a hierarchical command, one operation command or output command is gradually specified hierarchically based on the voice-recognized command. The dictionary storage means stores a dictionary used for speech recognition in each hierarchy for recognizing hierarchy commands. Here, the “dictionary used in speech recognition in each layer” refers to a dictionary used in each speech recognition process necessary to realize any one operation command or output command. Here, the “voice recognition process” refers to the entire process necessary for recognizing one piece of voice data. The dictionary stores a plurality of recognized words that perform pattern matching with the input voice data. The first speech recognition means performs speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input. In other words, the first voice recognition means determines a dictionary to be used based on the voice data input immediately before, and performs voice recognition of the newly input voice data. When there is input of voice data, the second voice recognition means performs voice recognition based on the dictionary used when voice recognition of the voice data input last time is executed. That is, the second voice recognition means performs voice recognition of the newly input voice data on the assumption that the voice recognition of the voice data input immediately before is erroneous recognition. The recognition result processing means weights the recognition result of the first voice recognition means and the recognition result of the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously input voice data, and finally recognizes Determine the result. Here, the “recognition result” indicates the degree of similarity between each recognized word stored in the used dictionary and newly input speech data. “Elapsed time width” refers to the time width from the start or end of processing of the previously input audio data to the start of processing of the newly input audio data, or a time width corresponding thereto. . The “final recognition result” refers to a recognition result that is finally output by the speech recognition apparatus. For example, the latest recognition result is a predetermined number of recognized words arranged in descending order of similarity and the degree of similarity.
 一般に、再発話までの経過時間幅が短い場合、前に入力された音声の言い直しである可能性が高い。従って、音声認識装置は、経過時間幅に基づき、2つの認識処理の認識結果の重み付けを行う。これにより、音声認識装置は、誤認識があった場合でも、訂正動作に相当する発話や操作、又は最初からやり直しをすることなく、最小限のステップで再発話を認識することができる。 In general, when the elapsed time to re-speech is short, there is a high possibility that the previously input speech is restated. Therefore, the speech recognition apparatus weights the recognition results of the two recognition processes based on the elapsed time width. As a result, even if there is a misrecognition, the speech recognition apparatus can recognize the recurrent utterance with a minimum number of steps without performing the utterance or operation corresponding to the correction operation or starting over from the beginning.
 上記の音声認識装置の一態様では、前記認識結果処理手段は、前記経過時間幅が所定幅以上の場合、前記第1音声認識手段の認識結果のみに基づき前記最終認識結果を決定する。上述の所定幅は、実験等に基づき適切な値に設定される。具体的には、この所定幅は、例えば新たに入力された音声データが前回入力された音声データの言い直しである可能性がない又は極めて低いと判断される経過時間幅に設定される。このように、この態様では、音声認識装置は、再発話である可能性がない又は低い場合に、不要な処理を削減することができる。 In one aspect of the speech recognition apparatus, the recognition result processing unit determines the final recognition result based only on the recognition result of the first speech recognition unit when the elapsed time width is a predetermined width or more. The predetermined width is set to an appropriate value based on experiments or the like. Specifically, the predetermined width is set to an elapsed time width in which, for example, newly input audio data is not likely to be a rephrase of previously input audio data or is extremely low. As described above, in this aspect, the speech recognition apparatus can reduce unnecessary processing when there is no possibility or low possibility of recurrent speech.
 上記の音声認識装置の他の一態様では、前記認識結果処理手段は、前記経過時間幅が所定幅未満の場合、前記第2音声認識手段の認識結果のみに基づき前記最終認識結果を決定する。上述の所定幅は、実験等に基づき適切な値に設定される。具体的には、この所定幅は、例えば新たに入力された音声データが前回入力された音声データの言い直しである可能性が極めて高いと判断される経過時間幅に設定される。このように、この態様では、音声認識装置は、再発話である可能性が高い場合に不要な処理を削減することができる。 In another aspect of the speech recognition apparatus, the recognition result processing unit determines the final recognition result based only on the recognition result of the second speech recognition unit when the elapsed time width is less than a predetermined width. The predetermined width is set to an appropriate value based on experiments or the like. Specifically, the predetermined width is set to, for example, an elapsed time width in which it is highly likely that the newly input audio data is a rephrase of the previously input audio data. Thus, in this aspect, the speech recognition apparatus can reduce unnecessary processing when there is a high possibility that the speech is recurrent.
 上記の音声認識装置の他の一態様では、前記第1音声認識手段及び前記第2音声認識手段は、認識結果として尤度を算出し、前記認識結果処理手段は、前記第1音声認識手段が算出した尤度に前記経過時間幅に対して非増加関数となる第1パラメータを乗じると共に、前記第2音声認識手段が算出した尤度に前記経過時間幅に対して非減少関数となる第2パラメータを乗じることで前記重み付けを実行する。ここで、「尤度」とは、対数尤度も含む。この態様では、音声認識装置は、第1音声認識手段が算出した尤度に経過時間幅に対して非増加関数となる第1パラメータを乗じることで、徐々に第1音声認識手段の認識結果の重み、即ち最終認識結果に対する重要度を上げる。また、音声認識装置は、第2音声認識手段が算出した尤度に経過時間幅に対して非減少関数となる第2パラメータを乗じることで、徐々に第2音声認識手段の認識結果の重みを下げる。このようにすることで、音声認識装置は、誤認識があった場合でも、訂正動作に相当する発話や操作、又は最初からやり直しをすることなく、最小限のステップで再発話を認識することができる。 In another aspect of the speech recognition apparatus, the first speech recognition unit and the second speech recognition unit calculate likelihood as a recognition result, and the recognition result processing unit is configured by the first speech recognition unit. The calculated likelihood is multiplied by a first parameter that is a non-increasing function with respect to the elapsed time width, and the likelihood that is calculated by the second speech recognition means is a non-decreasing function with respect to the elapsed time width. The weighting is performed by multiplying the parameter. Here, “likelihood” includes log likelihood. In this aspect, the speech recognition apparatus gradually multiplies the likelihood calculated by the first speech recognition means by the first parameter that is a non-increasing function with respect to the elapsed time width, so that the recognition result of the first speech recognition means is gradually increased. Increase the weight, that is, the importance of the final recognition result. Further, the speech recognition apparatus gradually weights the recognition result of the second speech recognition unit by multiplying the likelihood calculated by the second speech recognition unit by a second parameter that is a non-decreasing function with respect to the elapsed time width. Lower. In this way, the speech recognition device can recognize a recurrent speech with a minimum number of steps, even if there is a misrecognition, without performing a speech or operation corresponding to a correction operation or starting over from the beginning. it can.
 上記の音声認識装置の他の一態様では、前記認識結果処理手段は、前記第1パラメータと前記第2パラメータとの和を一定値にする。上述の一定値は、例えば1に設定される。このようにすることで、音声認識装置は、第1パラメータ又は第2パラメータの一方に基づき他方のパラメータを容易に決定し、かつ、これらのパラメータを適切な値に設定することができる。 In another aspect of the speech recognition apparatus, the recognition result processing means sets a sum of the first parameter and the second parameter to a constant value. The above-mentioned constant value is set to 1, for example. In this way, the speech recognition apparatus can easily determine the other parameter based on one of the first parameter and the second parameter, and set these parameters to appropriate values.
 上記の音声認識装置の他の一態様では、認識環境に関する情報を取得する環境情報取得手段をさらに備え、前記認識結果処理手段は、前記情報に基づき認識環境が劣悪であると判断した場合、前記認識環境が劣悪でない場合に比べ、前記第2音声認識手段の認識結果の重みを大きくする。「認識環境」とは、音声認識装置が音声認識を行う環境であって、当該音声認識の精度に影響を与えるものを指す。一般に、認識環境が劣悪の場合、誤認識が発生する可能性が高い。即ち、この場合、発話者は言い直しをする可能性が高い。従って、音声認識装置は、認識環境が劣悪であると判断した場合には、第2音声認識手段の認識結果の重みを通常より大きくすることで、より精度高く音声認識を実行することができる。 In another aspect of the above speech recognition apparatus, the speech recognition apparatus further includes environment information acquisition means for acquiring information related to the recognition environment, and the recognition result processing means determines that the recognition environment is inferior based on the information, The weight of the recognition result of the second speech recognition means is increased as compared with the case where the recognition environment is not inferior. “Recognition environment” refers to an environment in which a speech recognition apparatus performs speech recognition, which affects the accuracy of the speech recognition. In general, when the recognition environment is poor, there is a high possibility of erroneous recognition. That is, in this case, the speaker is likely to restate. Accordingly, when the speech recognition apparatus determines that the recognition environment is inferior, the speech recognition apparatus can perform speech recognition with higher accuracy by increasing the weight of the recognition result of the second speech recognition means than usual.
 上記の音声認識装置の他の一態様では、車両に搭載され、前記環境情報取得手段は、前記車両の車速又は/及び音声データに含まれる雑音の度合いを取得し、前記認識結果処理手段は、前記車速又は/及び前記雑音の度合いが大きいほど、前記第2音声認識手段の認識結果の重み付けを大きくする。ここで、「雑音の度合い」とは、例えばS/N比が該当する。一般に、車速が大きい場合、運転者は運転に集中するため、音声認識装置に対する反応が遅くなり経過時間も大きくなる傾向がある。また、車速が大きい場合、走行に伴う雑音も大きくなることが予想される。従って、この態様では、音声認識装置は、車速又は/及び雑音の度合いが大きいほど、誤認識の発生確率が高く、再発話が行われる可能性が高いと判断する。従って、この態様では、音声認識装置は、より精度高く音声認識を実行することができる。 In another aspect of the speech recognition apparatus, the environment information acquisition unit is installed in a vehicle, acquires the vehicle speed of the vehicle or / and the degree of noise included in the speech data, and the recognition result processing unit includes: The greater the vehicle speed or / and the degree of noise, the greater the weighting of the recognition result of the second speech recognition means. Here, the “degree of noise” corresponds to, for example, the S / N ratio. In general, when the vehicle speed is high, the driver concentrates on driving, so that the response to the voice recognition device tends to be slow and the elapsed time tends to increase. In addition, when the vehicle speed is high, it is expected that noise associated with traveling increases. Therefore, in this aspect, the speech recognition apparatus determines that the higher the vehicle speed and / or the degree of noise, the higher the probability of occurrence of erroneous recognition and the higher the possibility of re-speech. Therefore, in this aspect, the speech recognition apparatus can execute speech recognition with higher accuracy.
 本発明の他の観点では、階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する音声認識装置が使用する音声認識方法であって、音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識工程と、前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識工程と、前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理工程と、を備えることを特徴とする。音声認識装置は、この方法を使用することにより、誤認識があった場合でも、訂正動作に相当する発話や操作、又は最初からやり直しをすることなく、最小限のステップで再発話を認識することができる。 In another aspect of the present invention, there is provided a speech recognition method used by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command. When speech data is input, A first speech recognition step for performing speech recognition based on a dictionary to be used when it is assumed that the input speech data has been correctly recognized; and when the speech data is input, the speech of the previously input speech data The second speech recognition step for performing speech recognition based on the dictionary used when the recognition is executed, the recognition result by the first speech recognition means, and the recognition result by the second speech recognition means are input the previous time. A recognition result processing step of determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing the voice data. By using this method, the voice recognition device can recognize a recurrent utterance with minimal steps without utterance or operation corresponding to the corrective action, or from the beginning, even if there is a misrecognition. Can do.
 本発明のさらに別の観点では、階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する音声認識装置により実行される音声認識プログラムであって、音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識手段と、前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識手段と、前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理手段、として前記音声認識装置を機能させる。音声認識装置は、このプログラムを搭載することで、誤認識があった場合でも、訂正動作に相当する発話や操作、又は最初からやり直しをすることなく、最小限のステップで再発話を認識することができる。なお、好適な例では、上記プログラムは、記憶媒体に記録される。 In still another aspect of the present invention, a speech recognition program executed by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command, and there is input of speech data A first voice recognition means for performing voice recognition based on a dictionary to be used when it is assumed that the voice data input last time is correctly recognized; and when the voice data is input, the voice data input last time A second speech recognition unit that performs speech recognition based on the dictionary used when the speech recognition is performed, a recognition result by the first speech recognition unit, and a recognition result by the second speech recognition unit The speech recognition apparatus is made to function as a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the processed speech data. . The voice recognition device is equipped with this program, so that even if there is a misrecognition, it recognizes recurrent utterances with minimal steps without utterances or operations corresponding to corrective actions, or from the beginning. Can do. In a preferred example, the program is recorded on a storage medium.
 以下、図面を参照して本発明の好適な実施例について説明する。以下では、まず[基本説明]のセクションで本発明の音声認識手法に関連する辞書の構成について基本説明を行った後、[第1実施例]及び[第2実施例]の各セクションで本発明を適用した実施例について説明する。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings. In the following, first, in the [Basic Description] section, a basic description of the structure of the dictionary related to the speech recognition method of the present invention will be given, and then in the [First Embodiment] and [Second Embodiment] sections of the present invention. An embodiment to which is applied will be described.
 [基本説明]
 具体的な実施例の説明に先立って、本発明の音声認識手法に関連する辞書の構成について基本説明を行う。図1は、音声認識に用いられる2種類の辞書構成を示す概念図である。具体的には、図1(a)は、比較例となる非階層コマンドを認識する場合の辞書構成の一例を示し、図1(b)は、本発明の対象となる階層コマンドを認識する場合の辞書構成の一例を示す。本発明の音声認識装置は、後述するように、階層コマンドを認識する。なお、「非階層コマンド」とは、制御対象となる装置への1の動作命令又は出力命令を1の発話で実現する発話コマンドを指し、「階層コマンド」とは、発話コマンドのうち、制御対象となる装置への1の動作命令又は出力命令を2以上の発話で実現する発話コマンドを指す。また、「辞書」とは、認識対象の単語(以後、「認識語」と呼ぶ。)を格納したデータベースを指す。
[Basic explanation]
Prior to the description of a specific embodiment, a basic description will be given of the structure of a dictionary related to the speech recognition method of the present invention. FIG. 1 is a conceptual diagram showing two types of dictionary configurations used for speech recognition. Specifically, FIG. 1 (a) shows an example of a dictionary configuration when a non-hierarchical command as a comparative example is recognized, and FIG. 1 (b) shows a case of recognizing a hierarchical command that is an object of the present invention. An example of the dictionary configuration is shown. The speech recognition apparatus of the present invention recognizes hierarchical commands as will be described later. The “non-hierarchical command” refers to an utterance command that realizes one operation command or output command to the device to be controlled by one utterance, and the “hierarchical command” refers to a control target among the utterance commands. This refers to an utterance command that realizes one operation command or output command to the device with two or more utterances. The “dictionary” refers to a database that stores words to be recognized (hereinafter referred to as “recognized words”).
 まず、非階層コマンドを認識する場合について図1(a)を用いて説明する。音声認識装置は、マイク等を通じて入力された発話データ(以後、「発話データSa」と呼ぶ。)に対し所定の解析を実行した後、コマンド用辞書11に基づき認識結果を出力する。ここで、「認識結果」とは、例えば、発話データSaに対し最も類似度が高い認識語、又は、発話データSaに対し類似度が高い順に並べられた認識語及びその類似度のリスト等が該当する。また、「発話データSa」とは、音声を含む入力信号を指す。例えば、カーナビゲーション装置に音声認識装置が実装された場合では、発話データSaはユーザが発話ボタンを押下してから一定時間の間にマイクから録音された入力信号を指す。 First, the case of recognizing a non-hierarchical command will be described with reference to FIG. The voice recognition apparatus performs a predetermined analysis on the utterance data input through a microphone or the like (hereinafter referred to as “utterance data Sa”), and then outputs a recognition result based on the command dictionary 11. Here, the “recognition result” is, for example, a recognized word having the highest similarity to the utterance data Sa, or a recognition word arranged in descending order of similarity to the utterance data Sa and a list of the similarities. Applicable. Further, “utterance data Sa” refers to an input signal including voice. For example, when a voice recognition device is installed in a car navigation device, the utterance data Sa indicates an input signal recorded from a microphone during a predetermined time after the user presses the utterance button.
 次に、階層コマンドを認識する場合について図1(b)を用いて説明する。図1(b)の例では、階層コマンドは、2つの階層から構成される。まず、音声認識装置は、一連の階層コマンドの認識処理のうち最初に入力される発話データSaを、第1コマンド用辞書12を用いて音声認識する。ここでの音声認識を、以後「第1階層での音声認識」と呼ぶ。即ち、第1階層での音声認識とは、一連の階層コマンドを認識する処理のうち、初期状態、即ち発話コマンドを受理していない状態で用いる辞書により発話データSaを音声認識する処理を指す。このように、階層コマンドを認識する場合、音声認識装置は、その処理状態(階層)ごとに使用する辞書(以後、「待ち受け辞書」とも呼ぶ。)を変える。そして、音声認識装置は、第1コマンド用辞書12に基づき、第1階層での音声認識で「名称検索」を認識したものとする。即ち、音声認識装置は、第1コマンド用辞書12に格納された認識語のうち、「名称検索」が最も類似度が高いと判断したものとする。 Next, a case where a hierarchical command is recognized will be described with reference to FIG. In the example of FIG. 1B, the hierarchy command is composed of two hierarchies. First, the speech recognition apparatus recognizes speech data Sa input first in a series of hierarchical command recognition processing using the first command dictionary 12. This speech recognition is hereinafter referred to as “speech recognition in the first layer”. That is, the speech recognition in the first hierarchy refers to a process of recognizing speech data Sa by using a dictionary used in an initial state, that is, a state in which no speech command is accepted, among a series of processes for recognizing hierarchical commands. Thus, when recognizing a hierarchical command, the speech recognition apparatus changes a dictionary (hereinafter also referred to as “standby dictionary”) to be used for each processing state (hierarchy). Then, it is assumed that the speech recognition apparatus recognizes “name search” by speech recognition in the first hierarchy based on the first command dictionary 12. That is, it is assumed that the speech recognition apparatus determines that “name search” has the highest similarity among the recognition words stored in the first command dictionary 12.
 次に、音声認識装置は、認識した「名称検索」のコマンドに基づき、第2コマンド用辞書13及び地名用辞書14を次の待ち受け辞書として音声認識を行う。ここで、第2コマンド用辞書13は、「戻る」、「中止」、「訂正」等の発話者が明示的に前の状態(即ち、第1階層での音声認識)に処理を戻したい場合又は階層コマンドの一連の認識処理を最初からやり直したい場合に入力すべき発話コマンドを含む。また、地名用辞書14は、第1階層での音声認識で認識された「名称探索」の対象となる地名を含む。そして、音声認識装置は、次に入力された発話データSa(ここでは、「東京タワー」が発話されたとする。)に基づき、第2コマンド用辞書13及び地名用辞書14に格納された認識語の類似度を計算する。以後、ここでの音声認識を、以後「第2階層での音声認識」と呼ぶ。即ち、第2階層での音声認識とは、第1階層での音声認識結果に基づき発話データSaを認識する処理を指す。なお、第3階層での音声認識についても、同様に定義される。 Next, the speech recognition apparatus performs speech recognition using the second command dictionary 13 and the place name dictionary 14 as the next standby dictionary based on the recognized “name search” command. Here, the second command dictionary 13 is used when the speaker such as “return”, “stop”, “correction”, etc. explicitly wants to return the process to the previous state (ie, speech recognition in the first layer). Alternatively, it includes an utterance command to be input when it is desired to redo a series of recognition processing of hierarchical commands from the beginning. Further, the place name dictionary 14 includes place names to be subjected to “name search” recognized by the speech recognition in the first layer. Then, the speech recognition apparatus recognizes the recognition words stored in the second command dictionary 13 and the place name dictionary 14 based on the next input utterance data Sa (here, “Tokyo Tower” is uttered). Calculate the similarity of. Hereinafter, the voice recognition here is referred to as “voice recognition in the second layer”. That is, the speech recognition in the second layer refers to a process for recognizing the utterance data Sa based on the speech recognition result in the first layer. Note that voice recognition in the third hierarchy is defined in the same manner.
 そして、音声認識装置は、第2階層での音声認識の結果を最終的な認識結果として出力する。その後、音声認識装置は、新たな発話データSaに対し再び第1コマンド用辞書12を用いて第1階層での音声認識を行う。 The speech recognition apparatus outputs the result of speech recognition at the second layer as the final recognition result. Thereafter, the speech recognition apparatus performs speech recognition in the first hierarchy using the first command dictionary 12 again for the new utterance data Sa.
 なお、上述の階層コマンドの例では、2階層の発話コマンドの場合について説明したが、3階層以上の階層コマンドであっても同様に、音声認識装置は、各階層での音声認識で、前の階層で音声認識された認識語に基づき待ち受け辞書を選択し音声認識を行う。 In the example of the hierarchical command described above, the case of the utterance command of two layers has been described. Similarly, even in the case of a hierarchical command of three or more layers, the voice recognition device is similar to the previous one in the voice recognition in each layer. A standby dictionary is selected based on the recognition word that has been voice-recognized in the hierarchy, and voice recognition is performed.
 本発明では、音声認識装置は、上述したように、階層コマンドの音声認識を実行する。さらに、本発明に係る音声認識装置は、以後説明するように、前の階層での音声認識からの経過時間に応じて認識結果の出力を変える。これにより、本発明に係る音声認識装置は、一連の階層コマンドの認識処理の途中で誤認識が生じた場合であっても、利用者が明示的にコマンド(上述の例では、第2コマンド用辞書13に格納されたコマンドに相当する。)を入力することなく、最小限のステップで再発話を音声認識することができる。 In the present invention, the speech recognition apparatus executes speech recognition of hierarchical commands as described above. Furthermore, as will be described later, the speech recognition apparatus according to the present invention changes the output of the recognition result in accordance with the elapsed time from speech recognition in the previous layer. As a result, the voice recognition device according to the present invention allows the user to explicitly issue a command (in the above example, for the second command) even when a recognition error occurs in the middle of a series of hierarchical command recognition processes. This corresponds to the command stored in the dictionary 13).
 [第1実施例]
 まず、本発明に係る第1実施例について説明する。以後では、音声認識装置の概略構成について説明した後、音声認識方法、及びその処理フローについて説明する。その後、第1実施例の各変形例について説明する。
[First embodiment]
First, a first embodiment according to the present invention will be described. Hereinafter, after describing the schematic configuration of the speech recognition apparatus, the speech recognition method and the processing flow thereof will be described. Thereafter, each modification of the first embodiment will be described.
 (概略構成)
 まず、第1実施例に係る音声認識装置の概略構成について図2を用いて説明する。
(Outline configuration)
First, the schematic configuration of the speech recognition apparatus according to the first embodiment will be described with reference to FIG.
 図2は、本発明に係る音声認識装置100の概略構成図である。音声認識装置100は、音声分析部1と、音声区間検出部2と、第1認識部3と、第2認識部4と、認識結果処理部5と、提示部6と、主制御部7と、時間計測部8と、記憶部9と、を備える。なお、破線は、制御信号の流れを示し、実線は、データの流れを示す。 FIG. 2 is a schematic configuration diagram of the speech recognition apparatus 100 according to the present invention. The voice recognition device 100 includes a voice analysis unit 1, a voice section detection unit 2, a first recognition unit 3, a second recognition unit 4, a recognition result processing unit 5, a presentation unit 6, and a main control unit 7. A time measuring unit 8 and a storage unit 9. The broken line indicates the flow of the control signal, and the solid line indicates the data flow.
 音声分析部1は、主制御部7の制御に基づき、発話データSaの音響特徴量(即ち、音声の特徴パラメータ)を算出する。具体的には、音声分析部1は、図示しないマイク等から入力された発話データSaをA/D変換し、例えば周知の音響分析手法又はこの組み合わせを用いることで、音響特徴量を算出する。 The voice analysis unit 1 calculates an acoustic feature amount (that is, a voice feature parameter) of the utterance data Sa based on the control of the main control unit 7. Specifically, the voice analysis unit 1 performs A / D conversion on the utterance data Sa input from a microphone or the like (not shown), and calculates an acoustic feature amount by using, for example, a well-known acoustic analysis method or a combination thereof.
 音声区間検出部2は、主制御部7の制御に基づき、音声分析部1から供給される音声特徴量を用いて、発話データSaから発話区間(以後、「音声データ」と呼ぶ。)のみを切り出す。そして、音声区間検出部2は、第1認識部3及び第2認識部4に音声データに該当する音声特徴量を供給する。 Based on the control of the main control unit 7, the speech segment detection unit 2 uses only the speech feature amount supplied from the speech analysis unit 1 and only the speech segment (hereinafter referred to as “speech data”) from the speech data Sa. cut. Then, the voice section detection unit 2 supplies the first recognition unit 3 and the second recognition unit 4 with voice feature amounts corresponding to the voice data.
 第1認識部3は、音声特徴量に基づき、HMM(Hidden Markov Model)などの識別器を用いて所定の待ち受け辞書に格納される認識語の尤度を算出する。そして、第1認識部3は、類似度が高い順に並べられた認識語及びその尤度のリストを出力する。 The first recognition unit 3 calculates the likelihood of a recognition word stored in a predetermined standby dictionary using a discriminator such as HMM (Hidden Markov Model) based on the voice feature amount. And the 1st recognition part 3 outputs the list | wrist of the recognition word arranged in order with high similarity, and its likelihood.
 具体的には、第1認識部3は、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う。即ち、第1認識部3は、前回入力された音声データを処理した階層の次の階層(当該次の階層がない場合には、第1階層)で音声認識を行う。具体的には、図1(b)の例では、第1認識部3は、前に入力された音声データを第1階層で音声認識した場合には、第2コマンド用辞書13及び地名用辞書14を用いて認識結果を算出し、前に入力された音声データを第2階層で音声認識した場合には、第1コマンド用辞書12を用いて認識結果を算出する。そして、第1認識部3は、認識結果である認識語(以後、「認識語WR1」と呼ぶ。)及びこれに対応する対数化された尤度(以後、「尤度L1」と呼ぶ。)を認識結果処理部5に供給する。ここでは、第1認識部3は、上位となった所定個数(以後、「所定個数N」と呼ぶ。)の認識語WR1及びその尤度L1を供給するものとする。上述の所定個数Nは、実験等により適切な値に設定される。 Specifically, the first recognition unit 3 performs speech recognition based on a dictionary to be used when it is assumed that the previously input speech data has been correctly recognized. That is, the first recognizing unit 3 performs speech recognition at the next layer after processing the previously input speech data (or the first layer when there is no next layer). Specifically, in the example of FIG. 1B, when the first recognition unit 3 recognizes the previously input voice data in the first hierarchy, the second command dictionary 13 and the place name dictionary are used. 14 is used to calculate the recognition result, and when the previously input speech data is speech-recognized in the second hierarchy, the recognition result is calculated using the first command dictionary 12. Then, the first recognition unit 3 recognizes a recognition word (hereinafter referred to as “recognition word WR1”) as a recognition result and a logarithmized likelihood corresponding thereto (hereinafter referred to as “likelihood L1”). Is supplied to the recognition result processing unit 5. Here, it is assumed that the first recognition unit 3 supplies a predetermined number (hereinafter referred to as “predetermined number N”) of recognition words WR1 and its likelihood L1. The predetermined number N described above is set to an appropriate value through experiments or the like.
 第2認識部4は、音声特徴量に基づき、HMMなどの識別器を用いて、第1認識部3とは異なる所定の待ち受け辞書に格納される認識語の尤度を算出する。そして、第2認識部4は、類似度が高い順に並べられた認識語及びその尤度のリストを出力する。 The second recognizing unit 4 calculates the likelihood of a recognized word stored in a predetermined standby dictionary different from the first recognizing unit 3 using a classifier such as an HMM based on the voice feature amount. And the 2nd recognition part 4 outputs the list of the recognition word arranged in order with high similarity, and its likelihood.
 具体的には、第2認識部4は、前回入力された音声データの音声認識を実行した際に使用した待ち受け辞書に基づき音声認識を行う。即ち、第2認識部4は、前に入力された発話データSaを処理した階層と同一階層で音声認識を行う。言い換えると、第2認識部4は、第1認識部3の前の階層の音声認識を行う。より具体的には、図1(b)の例では、第2認識部4は、前に入力された発話データSaを第1階層で音声認識した場合には、再び第1コマンド用辞書12を用いて第1階層で音声認識を行い、前に入力された発話データSaを第2階層で音声認識した場合には、再び第2コマンド用辞書13及び地名用辞書14を用いて第2階層で音声認識を行う。そして、第2認識部4は、認識結果である認識語(以後、「認識語WR2」と呼ぶ。)及びこれに対応する対数尤度(以後、「尤度L2」と呼ぶ)を認識結果処理部5に供給する。ここでは、第2認識部4は、上位となった所定個数Nの認識語WR2及びその尤度L2を供給するものとする。 Specifically, the second recognition unit 4 performs voice recognition based on the standby dictionary used when voice recognition of voice data input last time is executed. That is, the second recognizing unit 4 performs speech recognition in the same hierarchy as the hierarchy that processed the previously input speech data Sa. In other words, the second recognizing unit 4 performs speech recognition in the hierarchy preceding the first recognizing unit 3. More specifically, in the example of FIG. 1B, when the second recognition unit 4 recognizes speech data Sa previously input in the first hierarchy, the second command dictionary 12 is again displayed. When speech recognition is performed in the first hierarchy and the previously input speech data Sa is recognized in the second hierarchy, the second command dictionary 13 and the place name dictionary 14 are used again in the second hierarchy. Perform voice recognition. Then, the second recognizing unit 4 recognizes a recognized word (hereinafter referred to as “recognized word WR2”) as a recognition result and a logarithmic likelihood corresponding thereto (hereinafter referred to as “likelihood L2”). Supply to part 5. Here, it is assumed that the second recognition unit 4 supplies a predetermined number N of recognition words WR2 and the likelihood L2 that are higher.
 認識結果処理部5は、第1認識部3及び第2認識部4で出力された尤度L1、L2を、前回の認識結果の出力時から起算した経過時間幅(以後、「経過時間幅T」と呼ぶ。)に基づき所定の重み付けを行う。これにより、認識結果処理部5は、重み付けされた尤度L1(以後、「重み付け尤度Lw1」と呼ぶ。)及び重み付けされた尤度L2(以後、「重み付け尤度Lw2」と呼ぶ。)を算出する。そして、認識結果処理部5は、重み付け尤度Lw1、Lw2を一体にしてソートすることで、類似度が高い順に認識語WR1、WR2(以後、「認識語WR」と総称する。)及びこれに対応する重み付け尤度Lw1、Lw2(以後、「重み付け尤度Lw」と総称する。)を提示部6へ供給する。なお、認識結果処理部5での具体的な処理については、次の(認識処理方法)のセクションで詳しく説明する。 The recognition result processing unit 5 uses the likelihoods L1 and L2 output from the first recognition unit 3 and the second recognition unit 4 as elapsed time widths (hereinafter referred to as “elapsed time width T The predetermined weighting is performed based on the above. Thereby, the recognition result processing unit 5 uses the weighted likelihood L1 (hereinafter referred to as “weighted likelihood Lw1”) and the weighted likelihood L2 (hereinafter referred to as “weighted likelihood Lw2”). calculate. Then, the recognition result processing unit 5 sorts the weighted likelihoods Lw1 and Lw2 together, thereby recognizing the recognized words WR1 and WR2 (hereinafter collectively referred to as “recognized words WR”) and the like. Corresponding weighting likelihoods Lw1 and Lw2 (hereinafter collectively referred to as “weighting likelihood Lw”) are supplied to the presentation unit 6. The specific processing in the recognition result processing unit 5 will be described in detail in the next section (Recognition processing method).
 提示部6は、認識結果処理部5が出力した認識結果を最終的な認識結果として提示する。具体的には、提示部6は、ディスプレイ又はスピーカ等であり、認識結果処理部5が出力した認識結果を画像又は音声により出力する。 The presentation unit 6 presents the recognition result output by the recognition result processing unit 5 as the final recognition result. Specifically, the presentation unit 6 is a display, a speaker, or the like, and outputs the recognition result output by the recognition result processing unit 5 as an image or sound.
 主制御部7は、それぞれ図示しないCPU(Central Processing Unit)、ROM(Read Only Memory)及びRAM(Random Access Memory)などを備え、音声認識装置100の各要素に対して種々の制御を行う。 The main control unit 7 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like (not shown), and performs various controls on each element of the speech recognition apparatus 100.
 時間計測部8は、経過時間幅Tを計測するためのタイマである。具体的には、時間計測部8は、1つ前の発話データSaの認識結果を提示部6が出力した時刻から起算した経過時間幅Tを計測する。 The time measuring unit 8 is a timer for measuring the elapsed time width T. Specifically, the time measuring unit 8 measures the elapsed time width T calculated from the time when the presentation unit 6 outputs the recognition result of the previous utterance data Sa.
 記憶部9は、第1認識部3及び第2認識部4等が使用する音響モデル、及び各階層での音声認識ごとに使用される待ち受け辞書、その他認識結果処理部5が出力した認識結果等を記憶するメモリである。記憶部9に記憶されたデータは、主制御部7の制御に基づき、音声認識装置100の各要素へ必要に応じて供給される。 The storage unit 9 includes an acoustic model used by the first recognition unit 3 and the second recognition unit 4, a standby dictionary used for speech recognition in each layer, and other recognition results output by the recognition result processing unit 5. Is a memory for storing. The data stored in the storage unit 9 is supplied to each element of the speech recognition apparatus 100 as necessary based on the control of the main control unit 7.
 (認識処理方法)
 次に、図3乃至図5を参照して、認識結果処理部5が実行する処理について具体的に説明する。図3は、認識結果処理部5の処理内容を具体的に示した機能ブロック図である。図3に示すように、認識結果処理部5は、乗算部5x、5yと、ソート部5zと、スイッチ部5vと、を備える。
(Recognition processing method)
Next, the processing executed by the recognition result processing unit 5 will be specifically described with reference to FIGS. FIG. 3 is a functional block diagram specifically showing the processing contents of the recognition result processing unit 5. As shown in FIG. 3, the recognition result processing unit 5 includes multiplication units 5x and 5y, a sorting unit 5z, and a switch unit 5v.
 乗算部5xは、第1認識部3から供給される尤度L1に所定の重み付け係数(以後、「第1重み付け係数w1」と呼ぶ。)を乗じて、その乗算結果である重み付け尤度Lw1をソート部5zに供給する。即ち、乗算部5xは、重み付け尤度Lw1を以下の式(1)により求める。
       Lw1=L1×w1   式(1)
 ここで、第1重み付け係数w1は、経過時間幅Tに応じて変動する係数である。具体的には、第1重み付け係数w1は、経過時間幅Tが大きくなるほど、第1認識部3の認識結果の重要度(重み)を上げるように設定される。即ち、第1重み付け係数w1は、経過時間幅Tを変数かつ尤度L1を定数とした場合に、経過時間幅Tの増加と共に重み付け尤度Lw1の絶対値を小さくするように減少(即ち、非増加)する係数となる。なお、音声認識装置100は、尤度L1、L2及び重み付け尤度Lw1、Lw2を、それぞれ負値が小さいほど、即ち絶対値が小さいほど、類似度が高いとみなす。従って、乗算部5xは、第1重み付け係数w1が小さいほど、第1認識部3の認識結果の重みを上げるように重み付け尤度Lw1を設定することになる。第1重み付け係数w1の設定方法については、後述する。
The multiplication unit 5x multiplies the likelihood L1 supplied from the first recognition unit 3 by a predetermined weighting coefficient (hereinafter, referred to as “first weighting coefficient w1”), and uses the weighting likelihood Lw1 that is the multiplication result. It supplies to the sort part 5z. That is, the multiplication unit 5x obtains the weighting likelihood Lw1 by the following equation (1).
Lw1 = L1 × w1 Formula (1)
Here, the first weighting coefficient w1 is a coefficient that varies according to the elapsed time width T. Specifically, the first weighting coefficient w1 is set so that the importance (weight) of the recognition result of the first recognition unit 3 increases as the elapsed time width T increases. That is, when the elapsed time width T is a variable and the likelihood L1 is a constant, the first weighting coefficient w1 decreases so as to decrease the absolute value of the weighting likelihood Lw1 as the elapsed time width T increases (that is, non- Increase). Note that the speech recognition apparatus 100 regards the likelihoods L1 and L2 and the weighting likelihoods Lw1 and Lw2 as having higher similarity as the negative value is smaller, that is, as the absolute value is smaller. Therefore, the multiplication unit 5x sets the weighting likelihood Lw1 so that the weight of the recognition result of the first recognition unit 3 is increased as the first weighting coefficient w1 is smaller. A method for setting the first weighting coefficient w1 will be described later.
 同様に、乗算部5yは、第2認識部4から供給される尤度L2に所定の重み付け係数(以後、「第2重み付け係数w2」と呼ぶ。)を乗じて、その乗算結果である重み付け尤度Lw2をソート部5zに供給する。即ち、乗算部5yは、第2重み付け尤度Lw2を以下の式(2)により求める。
       Lw2=L2×w2   式(2)
 ここで、第2重み付け係数w2は、経過時間幅Tに応じて変動する係数である。具体的には、第2重み付け係数w2は、経過時間幅Tが大きくなるほど、第2認識部4の認識結果の重要度(重み)を下げるように設定される。即ち、第2重み付け係数w2は、経過時間幅Tを変数かつ尤度L2を定数とした場合に、経過時間幅Tの増加と共に重み付け尤度Lw2の負値を大きくするように増加(即ち、非減少)する係数となる。従って、乗算部5yは、第2重み付け係数w2が小さいほど、第2認識部4の認識結果の重みを上げるように重み付け尤度Lw2を設定することになる。第2重み付け係数w2の算出方法については、後述する。
Similarly, the multiplying unit 5y multiplies the likelihood L2 supplied from the second recognizing unit 4 by a predetermined weighting coefficient (hereinafter referred to as “second weighting coefficient w2”), and the weighted likelihood that is the multiplication result. The degree Lw2 is supplied to the sorting unit 5z. That is, the multiplication unit 5y obtains the second weighting likelihood Lw2 by the following equation (2).
Lw2 = L2 × w2 Formula (2)
Here, the second weighting coefficient w2 is a coefficient that varies according to the elapsed time width T. Specifically, the second weighting coefficient w2 is set so that the importance (weight) of the recognition result of the second recognition unit 4 decreases as the elapsed time width T increases. That is, when the elapsed time width T is a variable and the likelihood L2 is a constant, the second weighting coefficient w2 increases so as to increase the negative value of the weighting likelihood Lw2 as the elapsed time width T increases (ie, non- (Decrease). Therefore, the multiplication unit 5y sets the weighting likelihood Lw2 so that the weight of the recognition result of the second recognition unit 4 is increased as the second weighting coefficient w2 is smaller. A method for calculating the second weighting coefficient w2 will be described later.
 また、本実施例では、第1重み付け係数w1と第2重み付け係数w2とは、それぞれ0より大きく1より小さい値に設定され、かつ和が1になるように設定される。即ち、第1重み付け係数w1と第2重み付け係数w2とは、以下の式(3)乃至(5)を満たす。
       w1+w2=1   式(3)
       0<w1<1   式(4)
       0<w2<1   式(5)
 これにより、認識結果処理部5は、第1重み付け係数w1と第2重み付け係数w2とのいずれか一方を求めることで、他方の係数を適切に設定することができる。即ち、認識結果処理部5は、経過時間幅Tが大きくなるほど、第2認識部4の認識結果を第1認識部3の認識結果に比べ、相対的にその重要度を下げるように設定することができる。
Further, in the present embodiment, the first weighting coefficient w1 and the second weighting coefficient w2 are set to values larger than 0 and smaller than 1, respectively, and the sum is set to 1. That is, the first weighting coefficient w1 and the second weighting coefficient w2 satisfy the following expressions (3) to (5).
w1 + w2 = 1 Formula (3)
0 <w1 <1 Formula (4)
0 <w2 <1 Formula (5)
Thereby, the recognition result processing unit 5 can appropriately set the other coefficient by obtaining one of the first weighting coefficient w1 and the second weighting coefficient w2. That is, the recognition result processing unit 5 sets the recognition result of the second recognition unit 4 to be relatively less important than the recognition result of the first recognition unit 3 as the elapsed time width T increases. Can do.
 なお、以後では、便宜的に、第1及び第2重み付け係数w1、w2を所定の重み付け係数「w」を用いて以下の式(6)、式(7)により表す。
       w1=w   式(6)
       w2=1-w   式(7)
 ここで、重み付け係数wは、式(4)及び式(6)より、0より大きく1より小さい値に設定される。即ち、以下の式(8)を満たす。
       0<w<1   式(8)
 次に、第1重み付け係数w1及び第2重み付け係数w2を決定する重み付け係数wの設定方法について図4を用いて詳しく説明する。図4は、重み付け係数wと経過時間幅Tとの関係を示すグラフの一例である。グラフ「G1」及びグラフ「G2」は、経過時間幅Tから重み付け係数wを決定する関数(以後、「重み付け関数Fw」と呼ぶ。)の一例である。また、図4中の「T1」は、第1認識部3の認識結果を最終的な認識結果とするか、又は第1認識部3の認識結果に加え第2認識部4の認識結果を用いて最終的な認識結果を導出するかをスイッチ部5vが判断するための経過時間幅Tに対する閾値である。これについては、スイッチ部5vの説明で詳しく説明する。
Hereinafter, for convenience, the first and second weighting coefficients w1 and w2 are expressed by the following equations (6) and (7) using a predetermined weighting factor “w”.
w1 = w Formula (6)
w2 = 1-w Formula (7)
Here, the weighting coefficient w is set to a value larger than 0 and smaller than 1 from the equations (4) and (6). That is, the following expression (8) is satisfied.
0 <w <1 Formula (8)
Next, a method for setting the weighting coefficient w for determining the first weighting coefficient w1 and the second weighting coefficient w2 will be described in detail with reference to FIG. FIG. 4 is an example of a graph showing the relationship between the weighting coefficient w and the elapsed time width T. The graph “G1” and the graph “G2” are examples of functions that determine the weighting coefficient w from the elapsed time width T (hereinafter referred to as “weighting function Fw”). Further, “T1” in FIG. 4 uses the recognition result of the first recognition unit 3 as the final recognition result or uses the recognition result of the second recognition unit 4 in addition to the recognition result of the first recognition unit 3. This is a threshold for the elapsed time width T for the switch unit 5v to determine whether to derive the final recognition result. This will be described in detail in the description of the switch unit 5v.
 図4に示すように、重み付け関数Fwは、グラフG1、G2のいずれで表される場合も、経過時間幅Tを変数とした減少関数となる。具体的には、グラフG1は、経過時間幅Tが時間幅「T1α」までは同一の傾きを保って減少し、その後、下に凸となる減少関数である。一方、グラフG2は、閾値T1まで下に凸の減少関数となる。他の例では、重み付け係数wを決定する重み付け関数Fwは、経過時間幅Tによらず一定値でもよい。具体的には、重み付け関数Fwに相当するマップ又は式は、ユーザの使用環境や認識語数、当該音声認識装置100の性能等を考慮し、実験等に基づき予め生成され、記憶部9に記憶される。 As shown in FIG. 4, the weighting function Fw is a decreasing function using the elapsed time width T as a variable, regardless of whether the weighting function Fw is represented by the graphs G1 and G2. Specifically, the graph G1 is a decreasing function in which the elapsed time width T decreases while maintaining the same slope until the time width “T1α”, and then protrudes downward. On the other hand, the graph G2 becomes a decreasing function convex downward to the threshold T1. In another example, the weighting function Fw for determining the weighting coefficient w may be a constant value regardless of the elapsed time width T. Specifically, the map or expression corresponding to the weighting function Fw is generated in advance based on experiments and the like and stored in the storage unit 9 in consideration of the user's usage environment, the number of recognized words, the performance of the speech recognition apparatus 100, and the like. The
 次に、再び図3に戻り、ソート部5zが実行する処理について説明する。ソート部5zは、乗算部5xから供給された重み付け尤度Lw1及び乗算部5yから供給された重み付け尤度Lw2に基づき、類似度が高い順に、重み付け尤度Lw1、Lw2に対応する認識語をソートする。そして、ソート部5zは、上位の所定個数Nの認識語WR及びその重み付け尤度Lwを提示部6へ供給する。 Next, returning to FIG. 3 again, the processing executed by the sorting unit 5z will be described. The sorting unit 5z sorts the recognition words corresponding to the weighting likelihoods Lw1 and Lw2 in descending order of similarity based on the weighting likelihood Lw1 supplied from the multiplication unit 5x and the weighting likelihood Lw2 supplied from the multiplication unit 5y. To do. Then, the sorting unit 5z supplies the predetermined number N of recognized words WR and their weighting likelihoods Lw to the presentation unit 6.
 これについて図5の具体例を用いて説明する。図5は、図1(b)の例の第2階層の音声認識で誤認識が発生し、発話者が誤認識された単語の言い直し(以後、単に「再発話」と呼ぶ。)を行った場合の第1認識部3の認識結果、第2認識部4の認識結果、及びソート部5zの出力結果のリストの具体例を示す。前提として、図5では、発話者が「名称検索」を発話した後「東京タワー」と発話したのに対し、音声認識装置100は、第1階層の音声認識で「名称検索」と正しく認識した後、第2階層の音声認識で「東京港」と誤認識したとする。また、発話者は、その後、「東京タワー」と再発話を行ったものとする。 This will be described using a specific example of FIG. FIG. 5 shows a word that is erroneously recognized in the second-level speech recognition in the example of FIG. 1B and the speaker is misrecognized (hereinafter simply referred to as “recurrent speech”). A specific example of the list of the recognition result of the first recognition unit 3, the recognition result of the second recognition unit 4, and the output result of the sorting unit 5 z is shown. As a premise, in FIG. 5, the utterer uttered “Tokyo Tower” after uttering “name search”, whereas the speech recognition apparatus 100 correctly recognized “name search” in the first-level speech recognition. Later, it is assumed that “Tokyo Port” is erroneously recognized in the second layer speech recognition. Further, it is assumed that the speaker subsequently made a recurrent talk with “Tokyo Tower”.
 この場合、第1認識部3は、再発話の音声データに対し、第1コマンド用辞書12を用いて第1階層での音声認識を行うとともに、第2認識部4は、再発話の音声データに対し、第2コマンド用辞書13及び地名用辞書14を用いて第2階層での音声認識を行う。また、ここでは、重み付け係数wは、経過時間幅Tに基づき「0.7」、即ち第1認識部3の認識結果より第2認識部4の認識結果に重みを置く値に設定されたものとする。 In this case, the first recognizing unit 3 performs voice recognition in the first hierarchy using the first command dictionary 12 for the recurrent speech data, and the second recognizing unit 4 performs the recurrent speech data. In contrast, the second command dictionary 13 and the place name dictionary 14 are used for speech recognition at the second level. Further, here, the weighting coefficient w is set to “0.7” based on the elapsed time width T, that is, a value that puts a weight on the recognition result of the second recognition unit 4 from the recognition result of the first recognition unit 3. And
 図5(a)は、再発話に対する第1認識部3の認識結果を示すリストの一例である。具体的には、第1認識部3は、再発話の音声データの音響特徴量に基づき、第1コマンド用辞書12に格納された単語の尤度L1を算出する。そして、第1認識部3は、当該尤度L1に基づき類似度が高い順に認識語WR1及びその尤度L1を所定個数Nだけ出力する。また、図5(a)に示すように、乗算部5xは、第1重み付け係数w1(=0.7)に基づき、各尤度L1に対する重み付け尤度Lw1を算出する。 FIG. 5A is an example of a list showing the recognition result of the first recognition unit 3 for the recurrent story. Specifically, the first recognition unit 3 calculates the likelihood L1 of the word stored in the first command dictionary 12 based on the acoustic feature amount of the speech data of the recurrent utterance. Then, the first recognition unit 3 outputs the recognition word WR1 and the likelihood L1 by a predetermined number N in descending order of similarity based on the likelihood L1. As shown in FIG. 5A, the multiplication unit 5x calculates a weighting likelihood Lw1 for each likelihood L1 based on the first weighting coefficient w1 (= 0.7).
 図5(b)は、再発話に対する第2認識部4の認識結果を示すリストの一例である。具体的には、第2認識部4は、再発話の音声データの音響特徴量に基づき、第2コマンド用辞書13及び地名用辞書14に格納された認識語の尤度L2を算出する。そして、第2認識部4は、当該尤度L2に基づき図5(b)のリストを類似度が高い順に認識語WR2及びその尤度L2を所定個数Nだけ出力する。また、図5(b)に示すように、乗算部5yは、第2重み付け係数w2(=0.3)に基づき、各尤度L2に対する重み付け尤度Lw2を算出する。 FIG. 5B is an example of a list showing the recognition result of the second recognition unit 4 for the recurrent utterance. Specifically, the second recognition unit 4 calculates the likelihood L2 of the recognized word stored in the second command dictionary 13 and the place name dictionary 14 based on the acoustic feature amount of the speech data of the recurrent utterance. Then, the second recognizing unit 4 outputs the recognition word WR2 and its likelihood L2 by a predetermined number N from the list of FIG. 5 (b) in descending order of similarity based on the likelihood L2. Further, as illustrated in FIG. 5B, the multiplication unit 5y calculates the weighting likelihood Lw2 for each likelihood L2 based on the second weighting coefficient w2 (= 0.3).
 図5(c)は、再発話に対するソート部5zの出力を示すリストの一例である。図5(c)に示すように、ソート部5zは、図5(a)、(b)に示す重み付け尤度Lw1、Lw2に基づき、認識語WR1、WR2を並び替え、所定個数N(=4)だけ出力している。このとき、第2重み付け係数w2が第1重み付け係数w1より小さいこと等に起因して、第2認識部4で認識された「東京タワー」が1位にランクし、同様に第2認識部4で認識された「東京港」が2位にランクしている。一方、第1認識部3で1位及び2位として認識された「ルート探索」及び「ルート消去」は、第2認識部4で認識された単語よりも下位にランクしている。 FIG. 5 (c) is an example of a list showing the output of the sorting unit 5z for recurrent utterances. As shown in FIG. 5 (c), the sorting unit 5z rearranges the recognition words WR1 and WR2 based on the weighting likelihoods Lw1 and Lw2 shown in FIGS. 5 (a) and 5 (b) to obtain a predetermined number N (= 4 ) Only output. At this time, due to the second weighting coefficient w2 being smaller than the first weighting coefficient w1, etc., “Tokyo Tower” recognized by the second recognition unit 4 is ranked first, and similarly, the second recognition unit 4 Recognized in Tokyo Port is ranked 2nd. On the other hand, “route search” and “route erasure” recognized as first and second in the first recognition unit 3 are ranked lower than words recognized in the second recognition unit 4.
 このように、ソート部5zは、経過時間幅Tに基づき適切に設定した重み付け係数wを用いて重み付け尤度Lw1、Lw2を算出し、これらを一体にソートする。これにより、音声認識装置100は、発話者が明示的に再発話を行う旨の操作をしたり、発話者が再び第1階層から音声入力をやり直すことなく、最小限の手続きで、誤認識が生じた場合の再発話を正しく認識することができる。 As described above, the sorting unit 5z calculates the weighting likelihoods Lw1 and Lw2 using the weighting coefficient w appropriately set based on the elapsed time width T, and sorts them together. As a result, the speech recognition apparatus 100 can perform erroneous recognition with a minimum number of procedures without an operation that the speaker explicitly re-utters, or the speaker does not input voice again from the first layer. It is possible to correctly recognize the relapse story when it occurs.
 また、ソート部5zは、上述の処理に加え、1位の認識語WRが前回入力された発話データSaに対する認識結果が1位の認識語と一致する場合、2位にランクされた認識語WRを1位に繰り上げる。このようにすることで、ソート部5zは、再発話に対し再び同一の認識語であると誤認識して前回と同じ認識語を1位として出力するのを防ぐことができる。 In addition to the above-described processing, the sorting unit 5z also recognizes the recognition word WR ranked second when the recognition result for the utterance data Sa input last time matches the recognition word for the first place. Move up to first place. By doing in this way, the sorting part 5z can prevent re-recognizing that it is the same recognized word again with respect to a recurrent utterance, and outputting the same recognized word as the last time as a 1st place.
 次に、スイッチ部5vが実行する処理について再び図3に戻り説明する。スイッチ部5vは、経過時間幅Tに基づき、第1認識部3の出力を最終的な認識結果として提示部6に供給するか、または、ソート部5zの出力を最終的な認識結果として提示部6に供給するか切り替える。具体的には、スイッチ部5vは、経過時間幅Tが閾値T1未満の場合、再発話の可能性があると判断し、第2認識部4の認識結果が考慮されたソート部5zの出力を提示部6へ供給する。一方、スイッチ部5vは、経過時間幅Tが閾値T1以上の場合、前回の発話から相当の時間が経過していることより再発話の可能性はない又は低いと判断し、第1認識部3の認識結果を提示部6に供給する。このように、スイッチ部5vは、経過時間幅Tが閾値T1未満の場合、第1認識部3の認識結果のみに基づき最終的な認識結果を出力することで、不要な処理を削減することができる。 Next, processing executed by the switch unit 5v will be described with reference back to FIG. Based on the elapsed time width T, the switch unit 5v supplies the output of the first recognition unit 3 to the presentation unit 6 as the final recognition result, or the output of the sorting unit 5z as the final recognition result. 6 is switched to supply. Specifically, when the elapsed time width T is less than the threshold T1, the switch unit 5v determines that there is a possibility of re-speech and outputs the output of the sorting unit 5z in which the recognition result of the second recognition unit 4 is considered. Supply to the presentation unit 6. On the other hand, when the elapsed time width T is equal to or greater than the threshold T1, the switch unit 5v determines that there is no or low possibility of a re-utterance because a considerable time has elapsed since the previous utterance, and the first recognition unit 3 The recognition result is supplied to the presentation unit 6. Thus, when the elapsed time width T is less than the threshold T1, the switch unit 5v can reduce unnecessary processing by outputting a final recognition result based only on the recognition result of the first recognition unit 3. it can.
 (処理フロー)
 次に、第1実施例における処理の手順について説明する。図6は、第1実施例において音声認識装置100が実行する処理手順を表すフローチャートの一例である。音声認識装置100は、図6に示すフローチャートの処理を繰り返し実行する。
(Processing flow)
Next, a processing procedure in the first embodiment will be described. FIG. 6 is an example of a flowchart showing a processing procedure executed by the speech recognition apparatus 100 in the first embodiment. The speech recognition apparatus 100 repeatedly executes the processing of the flowchart shown in FIG.
 まず、音声認識装置100は、発話データSaの入力があるか否か判定する(ステップS101)。そして、発話データSaの入力がある場合(ステップS101;Yes)、音声認識装置100は、ステップS102へ処理を進める。一方、発話データSaの入力がない場合(ステップS101;No)、音声認識装置100は、引き続き発話データSaの入力があるか否か監視する。 First, the speech recognition apparatus 100 determines whether there is an input of the utterance data Sa (step S101). And when there exists input of speech data Sa (step S101; Yes), the speech recognition apparatus 100 advances a process to step S102. On the other hand, when there is no input of utterance data Sa (step S101; No), the speech recognition apparatus 100 continuously monitors whether or not there is input of utterance data Sa.
 次に、音声認識装置100は、経過時間幅Tを算出する(ステップS102)。即ち、音声認識装置100は、例えば提示部6により認識結果が提示された後次の発話データSaの入力があるまでの時間幅を経過時間幅Tに定める。 Next, the speech recognition apparatus 100 calculates an elapsed time width T (step S102). That is, for example, the speech recognition apparatus 100 determines the elapsed time width T until the next speech data Sa is input after the recognition result is presented by the presentation unit 6.
 そして、音声認識装置100は、前回発話があるか否か判定する(ステップS103)。即ち、音声認識装置100は、階層コマンドの音声入力を初めて受け付けたか否か判定する。そして、前回発話がある場合(ステップS103;Yes)、音声認識装置100はステップS104の処理を行う。一方、前回発話がない場合(ステップS103;No)、音声認識装置100は、第1認識部3により音声認識を行う(ステップS111)。 Then, the speech recognition apparatus 100 determines whether or not there is a previous utterance (step S103). In other words, the speech recognition apparatus 100 determines whether or not a speech input of a hierarchical command has been accepted for the first time. If there is a previous utterance (step S103; Yes), the speech recognition apparatus 100 performs the process of step S104. On the other hand, when there is no previous utterance (step S103; No), the speech recognition apparatus 100 performs speech recognition by the first recognition unit 3 (step S111).
 次に、音声認識装置100は、経過時間幅Tが閾値T1より小さいか否か判定する(ステップS104)。そして、経過時間幅Tが閾値T1より小さい場合(ステップS104;Yes)、音声認識装置100は、ステップS105乃至ステップS110の処理を行う。即ち、この場合、音声認識装置100は、誤認識に起因した再発話の可能性があると判断し、第2認識部4の認識結果も考慮して最終的な認識結果を出力する。 Next, the speech recognition apparatus 100 determines whether or not the elapsed time width T is smaller than the threshold value T1 (step S104). If the elapsed time width T is smaller than the threshold T1 (step S104; Yes), the speech recognition apparatus 100 performs the processing from step S105 to step S110. That is, in this case, the speech recognition apparatus 100 determines that there is a possibility of re-speech caused by misrecognition, and outputs the final recognition result in consideration of the recognition result of the second recognition unit 4.
 一方、音声認識装置100は、経過時間幅Tが閾値T1以上の場合(ステップS104;No)、第1認識部3により音声認識を行う(ステップS111)。即ち、この場合、音声認識装置100は、誤認識に起因した再発話の可能性がない又は極めて低いと判断する。そして、第1認識部3は、尤度L1を算出し、類似度が高い順に所定個数Nの認識語WR1及びその尤度L1のリストを出力する。 On the other hand, when the elapsed time width T is equal to or greater than the threshold T1 (step S104; No), the voice recognition device 100 performs voice recognition by the first recognition unit 3 (step S111). In other words, in this case, the speech recognition apparatus 100 determines that there is no possibility or a very low possibility of recurrent speech due to misrecognition. Then, the first recognizing unit 3 calculates the likelihood L1, and outputs a predetermined number N of recognized words WR1 and a list of the likelihood L1 in descending order of similarity.
 次に、ステップS105では、音声認識装置100は、第1認識部3及び第2認識部4により音声認識を行う(ステップS105)。具体的には、第1認識部3は、尤度L1を算出し、類似度が高い順に所定個数Nの認識語WR1及びその尤度L1のリストを出力する。また、第2認識部4は、尤度L2を算出し、類似度が高い順に所定個数Nの認識語WR2及びその尤度L2のリストを出力する。このとき、第2認識部4は、第1認識部3の前の階層の音声認識を行う。 Next, in step S105, the speech recognition apparatus 100 performs speech recognition by the first recognition unit 3 and the second recognition unit 4 (step S105). Specifically, the first recognition unit 3 calculates the likelihood L1, and outputs a list of a predetermined number N of recognition words WR1 and the likelihood L1 in descending order of similarity. The second recognizing unit 4 calculates a likelihood L2 and outputs a predetermined number N of recognition words WR2 and a list of the likelihoods L2 in descending order of similarity. At this time, the second recognizing unit 4 performs speech recognition of the hierarchy before the first recognizing unit 3.
 次に、音声認識装置100は、経過時間幅Tに基づき第1及び第2重み付け係数w1、w2を決定する(ステップS106)。例えば、音声認識装置100は、記憶部9に記憶したグラフG1又はグラフG2等を参照し、経過時間幅Tに基づき第1及び第2重み付け係数w1、w2を決定する。 Next, the speech recognition apparatus 100 determines the first and second weighting coefficients w1 and w2 based on the elapsed time width T (step S106). For example, the speech recognition apparatus 100 refers to the graph G1 or the graph G2 stored in the storage unit 9, and determines the first and second weighting coefficients w1 and w2 based on the elapsed time width T.
 そして、音声認識装置100は、重み付け尤度Lw1、Lw2を算出する(ステップS107)。即ち、音声認識装置100は、尤度L1に第1重み付け係数w1を乗じることで重み付け尤度Lw1を算出するとともに、尤度L2に第2重み付け係数w2を乗じることで重み付け尤度Lw2を算出する。 Then, the speech recognition apparatus 100 calculates weighted likelihoods Lw1 and Lw2 (step S107). That is, the speech recognition apparatus 100 calculates the weighting likelihood Lw1 by multiplying the likelihood L1 by the first weighting coefficient w1, and calculates the weighting likelihood Lw2 by multiplying the likelihood L2 by the second weighting coefficient w2. .
 次に、音声認識装置100は、重み付け尤度Lw1、Lw2に基づきソートを行う(ステップS108)。ここで、重み付け尤度Lw1、Lw2は第1又は第2重み付け係数w1、w2により経過時間幅Tに応じて重み付けられている。これにより、音声認識装置100は、誤認識に起因した再発話の場合であっても、経過時間幅Tに基づき適切に音声認識することができる。 Next, the speech recognition apparatus 100 performs sorting based on the weighting likelihoods Lw1 and Lw2 (step S108). Here, the weighting likelihoods Lw1 and Lw2 are weighted according to the elapsed time width T by the first or second weighting coefficients w1 and w2. Thereby, the speech recognition apparatus 100 can appropriately recognize the speech based on the elapsed time width T even in the case of recurrent speech caused by misrecognition.
 次に、音声認識装置100は、ソート後の1位となる認識語が前回と同様か否か判定する(ステップS109)。ここで、音声認識装置100は、前回の1位となった認識語等の情報を、記憶部9に保持しているものとする。そして、ソート後の1位が前回と同様の場合(ステップS109;Yes)、音声認識装置100は2位を1位に繰り上げる(ステップ110)。これにより、音声認識装置100は、前回と同様の誤認識を行うのを防ぐことができる。一方、ソート後の1位が前回と異なる場合(ステップS109;No)、音声認識装置100は、ステップS112へ処理を進める。 Next, the speech recognition apparatus 100 determines whether or not the recognition word that is ranked first after sorting is the same as the previous time (step S109). Here, it is assumed that the speech recognition apparatus 100 holds information such as a recognition word that has been ranked first in the storage unit 9. If the first place after sorting is the same as the previous time (step S109; Yes), the speech recognition apparatus 100 moves up the second place to the first place (step 110). Thereby, the speech recognition apparatus 100 can prevent performing erroneous recognition similar to the previous time. On the other hand, when the first place after sorting is different from the previous time (step S109; No), the speech recognition apparatus 100 advances the process to step S112.
 次に、音声認識装置100は、認識結果を提示する(ステップS112)。例えば、音声認識装置100は、認識結果を図5(c)に示すようにリストにして表示し、または、最も類似度が高いと判断した認識語WRを音声出力する。 Next, the speech recognition apparatus 100 presents the recognition result (step S112). For example, the speech recognition apparatus 100 displays the recognition results in a list as shown in FIG. 5C, or outputs the recognition word WR determined to have the highest similarity as a speech.
 そして、音声認識装置100は、認識結果及び属性を記憶する(ステップS113)。ここで、「属性」とは、例えば重み付け尤度Lwなどの算出値又は/及び第1認識部3と第2認識部4とのいずれの認識結果を1位として出力したか等の認識処理に付随した情報を指す。 And the speech recognition apparatus 100 memorize | stores a recognition result and an attribute (step S113). Here, the “attribute” is a recognition process such as a calculated value such as the weighting likelihood Lw or / and which recognition result of the first recognition unit 3 or the second recognition unit 4 is output as first place. Refers to accompanying information.
 次に、音声認識装置100は、第1認識部3の設定を第2認識部4の設定にコピーする(ステップS114)。ここで、設定とは、例えば、いずれの辞書を待ち受け辞書として使用したか等の情報が該当する。そして、音声認識装置100は、経過時間幅Tの計測を開始する(ステップS115)。 Next, the speech recognition apparatus 100 copies the setting of the first recognition unit 3 to the setting of the second recognition unit 4 (step S114). Here, the setting corresponds to information such as which dictionary is used as the standby dictionary. Then, the voice recognition device 100 starts measuring the elapsed time width T (step S115).
 以上説明したように、本実施例による音声認識装置は、階層コマンドを音声認識し、辞書記憶手段と、第1音声認識手段と、第2音声認識手段と、認識結果処理手段と、を備える。辞書記憶手段は、階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する。第1音声認識手段は、音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う。即ち、第1音声認識手段は、前回入力された音声データに基づき使用すべき辞書を決定し、次の音声データの入力の音声認識を行う。第2音声認識手段は、音声データの入力があった場合、前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う。即ち、第2音声認識手段は、前回入力された音声データの音声認識が誤認識であったことを前提に次に入力される音声データの音声認識を行う。認識結果処理手段は、第1音声認識手段の認識結果と、第2音声認識手段の認識結果と、を前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する。一般に、再発話までの経過時間幅が短い場合、前に入力された音声の言い直しである可能性が高い。従って、音声認識装置は、経過時間幅に基づき、2つの認識処理の認識結果の重み付けを行う。これにより、音声認識装置は、誤認識があった場合でも、訂正動作に相当する発話や操作、又は最初からやり直しをすることなく、階層コマンドを認識することができる。 As described above, the speech recognition apparatus according to the present embodiment recognizes hierarchical commands and includes a dictionary storage unit, a first speech recognition unit, a second speech recognition unit, and a recognition result processing unit. The dictionary storage means stores a dictionary used for speech recognition in each hierarchy for recognizing hierarchy commands. The first speech recognition means performs speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input. That is, the first speech recognition means determines a dictionary to be used based on the previously input speech data, and performs speech recognition for the next speech data input. When there is input of voice data, the second voice recognition means performs voice recognition based on the dictionary used when voice recognition of the voice data input last time is executed. That is, the second voice recognition means performs voice recognition of the next input voice data on the assumption that the voice recognition of the voice data input last time is erroneous recognition. The recognition result processing means weights the recognition result of the first voice recognition means and the recognition result of the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously input voice data, and finally recognizes Determine the result. In general, when the elapsed time interval until a re-utterance is short, there is a high possibility that the previously input speech is restated. Therefore, the speech recognition apparatus weights the recognition results of the two recognition processes based on the elapsed time width. As a result, the speech recognition apparatus can recognize a hierarchical command without performing an utterance or operation corresponding to a correction operation or starting over from the beginning even if there is a misrecognition.
 (変形例1)
 図2の音声認識装置100の構成では、音声認識装置100は、第1認識部3と第2認識部4との2つの音声認識部を備え、第2認識部4が第1認識部3の前の階層の音声認識を実行していた。しかし、本発明が適用可能な音声認識装置100の構成は、これに限定されない。これに代えて、音声認識装置100は、1体の音声認識部のみで上述の処理を実行してもよい。
(Modification 1)
In the configuration of the speech recognition device 100 in FIG. 2, the speech recognition device 100 includes two speech recognition units, a first recognition unit 3 and a second recognition unit 4, and the second recognition unit 4 is the first recognition unit 3. The previous level of speech recognition was being performed. However, the configuration of the speech recognition apparatus 100 to which the present invention is applicable is not limited to this. Instead of this, the speech recognition apparatus 100 may execute the above-described processing with only one speech recognition unit.
 この具体例について、図7を用いて説明する。図7は、変形例1に係る音声認識装置100aの概略構成を示す図の一例である。音声認識装置100aは、図2の第1認識部3及び第2認識部4に代えて認識部20を備える点、及び、認識結果保持部21を備える点で図2に示す音声認識装置100の構成と異なる。 This specific example will be described with reference to FIG. FIG. 7 is an example of a diagram illustrating a schematic configuration of the speech recognition apparatus 100a according to the first modification. The speech recognition apparatus 100a is different from the speech recognition apparatus 100 shown in FIG. 2 in that it includes a recognition unit 20 in place of the first recognition unit 3 and the second recognition unit 4 in FIG. Different from the configuration.
 認識部20は、第1認識部3及び第2認識部4の処理を逐次的に両方実行する。具体的には、認識部20は、まず、第1認識部3が行う処理を実行することで、尤度L1を算出する。そして、認識部20は、尤度L1及びこれに対応する認識語WR1を認識結果保持部21へ供給する。次に、認識部20は、第2認識部4が行う処理を実行することで、尤度L2を算出する。そして、認識部20は、尤度L2及びこれに対応する認識語WR1を認識結果処理部5へ供給する。 The recognition unit 20 sequentially executes both the processes of the first recognition unit 3 and the second recognition unit 4. Specifically, the recognition unit 20 first calculates the likelihood L1 by executing the processing performed by the first recognition unit 3. Then, the recognition unit 20 supplies the likelihood L1 and the recognition word WR1 corresponding to the likelihood L1 to the recognition result holding unit 21. Next, the recognition unit 20 calculates the likelihood L2 by executing the process performed by the second recognition unit 4. Then, the recognition unit 20 supplies the likelihood L2 and the recognition word WR1 corresponding to the likelihood L2 to the recognition result processing unit 5.
 認識結果保持部21は、認識部20から尤度L1及びこれに対応する認識語WRを受信した後、認識部20が尤度L2を算出するまでこれらの認識結果を保持する。そして、認識結果保持部21は、認識部20が尤度L2及びこれに対応する認識語WR2を認識結果処理部5へ供給すると同時又はその前後に、保持した尤度L1及びこれに対応する認識語WR1を認識結果処理部5へ供給する。その後、認識結果処理部5は、図3で説明した処理を実行する。 After receiving the likelihood L1 and the recognition word WR corresponding to this from the recognition unit 20, the recognition result holding unit 21 holds these recognition results until the recognition unit 20 calculates the likelihood L2. And the recognition result holding | maintenance part 21 recognizes corresponding to the likelihood L1 hold | maintained at the same time as or before and after the recognition part 20 supplies the likelihood L2 and the recognition word WR2 corresponding thereto to the recognition result processing part 5. The word WR1 is supplied to the recognition result processing unit 5. Thereafter, the recognition result processing unit 5 executes the processing described with reference to FIG.
 以上のように、音声認識装置100aは、音声認識部である認識部20を1つのみ備える場合であっても、第1実施例と同様の処理を実行することができる。また、これにより、音声認識装置100aの小型化を実現することができる。 As described above, the voice recognition device 100a can execute the same processing as that of the first embodiment even when only one recognition unit 20 that is a voice recognition unit is provided. This also makes it possible to reduce the size of the speech recognition apparatus 100a.
 (変形例2)
 図2の説明では、時間計測部8は、提示部6により認識結果が出力された時から次の発話データSaの入力がある時までの時間幅を経過時間幅Tと設定していた。しかし、本発明が適用可能な設定は、これに限定されない。例えば、これに代えて、時間計測部8は、前回の発話データSaが入力された時から次の発話データSaが入力された時までの時間幅、又は、前回の発話データSaを入力するための発話ボタンが押下された時から次の発話ボタンが押下される時までの時間幅又はこれに相当する時間幅等を経過時間幅Tに設定してもよい。
(Modification 2)
In the description of FIG. 2, the time measurement unit 8 sets the elapsed time width T from the time when the recognition result is output by the presentation unit 6 to the time when the next utterance data Sa is input. However, the setting to which the present invention is applicable is not limited to this. For example, instead of this, the time measuring unit 8 inputs the time width from the time when the previous utterance data Sa is input to the time when the next utterance data Sa is input, or the previous utterance data Sa. A time width from when the utterance button is pressed to the time when the next utterance button is pressed or a time width corresponding thereto may be set as the elapsed time width T.
 なお、時間計測部8は、音声認識装置100が車両に搭載されているときに当該車両のエンジンがオフになった場合等、経過時間幅Tを適宜リセットしてもよい。 Note that the time measuring unit 8 may reset the elapsed time width T as appropriate, for example, when the engine of the vehicle is turned off when the speech recognition apparatus 100 is mounted on the vehicle.
 (変形例3)
 図4のグラフG1及びグラフG2では、音声認識装置100は、重み付け係数wが「0」付近になる時間幅に閾値T1を設定していた。しかし、本発明が適用可能な方法は、これに限定されない。これに代えて、音声認識装置100は、例えば、重み付け係数wが「0、5」付近になる時間幅に閾値T1を設定してもよい。その他、閾値T1は、用いられる重み付け関数Fwごとに、又は、ユーザからの入力等に基づき設定されてもよい。これによっても、音声認識装置100は、不要な処理を削減しつつ、誤認識に起因した再発話を最小限のステップで認識することができる。
(Modification 3)
In the graph G1 and the graph G2 in FIG. 4, the speech recognition apparatus 100 sets the threshold T1 to the time width in which the weighting coefficient w is near “0”. However, the method to which the present invention is applicable is not limited to this. Instead, for example, the speech recognition apparatus 100 may set the threshold T1 to a time width in which the weighting coefficient w is near “0, 5”. In addition, the threshold value T1 may be set for each weighting function Fw used or based on an input from the user. Also by this, the speech recognition apparatus 100 can recognize a recurrent utterance caused by misrecognition with a minimum number of steps while reducing unnecessary processing.
 (変形例4)
 図2等の説明では、音声認識装置100は、第1認識部3及び第2認識部4を備え、第2認識部4は第1認識部3の前の階層の音声認識を実行した。これに加え、音声認識装置100は、第2認識部4のさらに前の階層での音声認識を実行する第3の認識部をさらに備えてもよい。この場合、認識結果処理部5は、所定の式又はマップを参照し、経過時間幅Tから各認識部が算出した尤度の重み付けを決定し、当該重み付けされた尤度に基づきソートを行う。上述の式又はマップは、実験等に基づき予め作成され、記憶部9に記憶される。4以上の認識部を用いる場合についても、同様に拡張される。
(Modification 4)
In the description of FIG. 2 and the like, the speech recognition apparatus 100 includes the first recognition unit 3 and the second recognition unit 4, and the second recognition unit 4 performs speech recognition in the hierarchy before the first recognition unit 3. In addition to this, the speech recognition apparatus 100 may further include a third recognition unit that performs speech recognition in a layer before the second recognition unit 4. In this case, the recognition result processing unit 5 refers to a predetermined formula or map, determines the weighting of the likelihood calculated by each recognition unit from the elapsed time width T, and performs sorting based on the weighted likelihood. The above formula or map is created in advance based on an experiment or the like and stored in the storage unit 9. The same applies to the case where four or more recognition units are used.
 以上のように、音声認識装置100は、3以上の階層にわたる認識処理についても、実行することができる。 As described above, the speech recognition apparatus 100 can also perform recognition processing over three or more layers.
 (変形例5)
 第1実施例の説明では、音声認識装置100は、類似度として尤度を用いていた。しかし、本発明が適用可能な類似度の指標は、これに限定されない。これに代えて、音声認識装置100は、他の類似度を示す指標を用いてもよい。この場合であっても、音声認識装置100は、経過時間幅Tが短いほど誤認識があった可能性が高いと判断し、第2認識部4の認識結果の重みを大きくする。これによっても、音声認識装置100は、誤認識に起因した再発話を最小限のステップで認識することができる。
(Modification 5)
In the description of the first embodiment, the speech recognition apparatus 100 uses the likelihood as the similarity. However, the similarity index to which the present invention is applicable is not limited to this. Instead, the speech recognition apparatus 100 may use another index indicating similarity. Even in this case, the speech recognition apparatus 100 determines that the possibility of erroneous recognition is higher as the elapsed time width T is shorter, and increases the weight of the recognition result of the second recognition unit 4. Also by this, the speech recognition apparatus 100 can recognize a recurrent utterance caused by misrecognition with a minimum number of steps.
 (変形例6)
 図3の説明では、認識結果処理部5は、経過時間幅Tが閾値T1未満の場合、第1認識部3の認識結果と第2認識部4の認識結果とに基づき最終的な認識結果を決定していた。これに代えて、認識結果処理部5は、経過時間幅Tが閾値T1未満の場合、第2認識部4の認識結果のみに基づき最終的な認識結果を決定してもよい。即ち、この場合、音声認識装置100は、再発話の可能性が高いと判断し、前の階層での音声認識のみを行い、尤度L2のみに基づき最終的な認識語WRの順位を決定する。これによっても、音声認識装置100は、誤認識に起因した再発話をより確実に最小限のステップで認識することができる。
(Modification 6)
In the description of FIG. 3, when the elapsed time width T is less than the threshold T1, the recognition result processing unit 5 determines the final recognition result based on the recognition result of the first recognition unit 3 and the recognition result of the second recognition unit 4. It was decided. Instead, the recognition result processing unit 5 may determine a final recognition result based only on the recognition result of the second recognition unit 4 when the elapsed time width T is less than the threshold T1. That is, in this case, the speech recognition apparatus 100 determines that there is a high possibility of re-speech, performs only speech recognition in the previous hierarchy, and determines the final rank of the recognized word WR based only on the likelihood L2. . Also by this, the speech recognition apparatus 100 can recognize the recurrent utterance caused by the misrecognition more reliably with the minimum steps.
 [第2実施例]
 次に、第2実施例で音声認識装置が実行する処理について説明する。第2実施例では、音声認識装置は、経過時間幅Tに加え、音声認識の環境を示す情報(以後、「環境情報Ic」と呼ぶ。)に基づき重み付け係数wを決定する。これにより、音声認識装置は、音声認識を実行する環境に応じてより的確に再発話を認識することができる。
[Second Embodiment]
Next, processing executed by the speech recognition apparatus in the second embodiment will be described. In the second embodiment, the speech recognition apparatus determines the weighting coefficient w based on information indicating the environment for speech recognition (hereinafter referred to as “environment information Ic”) in addition to the elapsed time width T. Thereby, the speech recognition apparatus can recognize the recurrent speech more accurately according to the environment in which speech recognition is executed.
 以後では、まず、第2実施例の音声認識装置の概略構成について説明した後、音声認識方法及びその処理フローについて順に説明する。 In the following, first, the schematic configuration of the speech recognition apparatus according to the second embodiment will be described, and then the speech recognition method and its processing flow will be described in order.
 (概略構成)
 図8は、第2実施例に係る音声認識装置100bの概略構成を示す図の一例である。音声認識装置100bは、車両に搭載される。音声認識装置100bは、ECU23、GPS受信機24、及びルート情報取得部25を備える点で第1実施例の音声認識装置100と異なる。なお、以後では、音声認識装置100bが搭載される車両を、「搭載車両」と呼ぶ。
(Outline configuration)
FIG. 8 is an example of a diagram illustrating a schematic configuration of the speech recognition apparatus 100b according to the second embodiment. The voice recognition device 100b is mounted on a vehicle. The speech recognition apparatus 100b differs from the speech recognition apparatus 100 of the first embodiment in that it includes an ECU 23, a GPS receiver 24, and a route information acquisition unit 25. Hereinafter, the vehicle on which the voice recognition device 100b is mounted is referred to as “mounted vehicle”.
 ECU23は、図示しないCPU、ROM及びRAMなどを備え、搭載車両内の各構成要素に対して種々の制御を行う。また、ECU23は、主制御部7と電気的に接続し、搭載車両の状態を示す情報(以後、「車両情報」と呼ぶ。)を主制御部7へ送信する。車両情報としては、例えば、ACC(アクセサリー電源:エンジンキー)の状態、車速パルスの状態、搭載車両が備える窓の開閉状態、エンジンの状態、トランスミッションの状態などの情報が該当する。 ECU23 is provided with CPU, ROM, RAM, etc. which are not shown in figure, and performs various control with respect to each component in a mounting vehicle. The ECU 23 is electrically connected to the main control unit 7 and transmits information indicating the state of the mounted vehicle (hereinafter referred to as “vehicle information”) to the main control unit 7. The vehicle information includes, for example, information such as an ACC (accessory power supply: engine key) state, a vehicle speed pulse state, a window opening / closing state of the mounted vehicle, an engine state, and a transmission state.
 GPS受信機24は、複数のGPS衛星から、測位用データを含む下り回線データを搬送する電波を受信する。測位用データは、緯度及び経度情報等から搭載車両の絶対的な位置を検出するために用いられる。GPS受信機24は、主制御部7と電気的に接続し、主制御部7に搭載車両の緯度経度情報を送信する。 The GPS receiver 24 receives radio waves carrying downlink data including positioning data from a plurality of GPS satellites. The positioning data is used to detect the absolute position of the mounted vehicle from latitude and longitude information. The GPS receiver 24 is electrically connected to the main control unit 7 and transmits latitude and longitude information of the mounted vehicle to the main control unit 7.
 ルート情報部25は、VICS(Vehicle Information Communication System)センタなどから配信される情報(以下、「VICS情報」と呼ぶ。)を電波より取得する。また、ルート情報部25は、地図情報を保持すると共に、搭載車両の運転者が目的地設定をしていた場合にはそのルート案内に関する情報(ルート案内情報)を保持する。ルート情報部25は、主制御部7と電気的に接続し、主制御部7にこれらの情報の授受を行う。 The route information unit 25 acquires information (hereinafter referred to as “VICS information”) distributed from a VICS (Vehicle Information Communication System) center or the like from radio waves. The route information unit 25 holds map information, and also holds information about route guidance (route guidance information) when the driver of the mounted vehicle has set a destination. The route information unit 25 is electrically connected to the main control unit 7 and exchanges such information with the main control unit 7.
 主制御部7は、図示しないマイク等から入力された発話データSaのS/N比などの音響情報を図示しないセンサ等を介して取得し、環境情報Icとして認識結果処理部5に供給する。また、主制御部7は、ECU23、GPS受信機24、及びルート情報部25から供給される車両情報、緯度経度情報、ルート案内情報を取得し、環境情報Icとして認識結果処理部5に供給する。これについては、(音声認識方法)のセクションにて詳しく説明する。 The main control unit 7 acquires acoustic information such as the S / N ratio of the utterance data Sa input from a microphone (not shown) via a sensor (not shown) and supplies it to the recognition result processing unit 5 as environment information Ic. Further, the main control unit 7 acquires vehicle information, latitude / longitude information, and route guidance information supplied from the ECU 23, the GPS receiver 24, and the route information unit 25, and supplies the vehicle information, latitude / longitude information, and route guidance information to the recognition result processing unit 5 as environment information Ic. . This will be explained in detail in the (Voice Recognition Method) section.
 (音声認識方法)
 次に、第2実施例で音声認識装置100bが実行する処理について説明する。認識結果処理部5は、主制御部7の制御に基づき、供給された環境情報Icから重み付け関数Fwの傾きを変更する。具体的には、認識結果処理部5は、音声認識の環境が劣悪であるほど、誤認識の可能性が高くなり再発話がなされる可能性が高いと判断する。従って、この場合、認識結果処理部5は、当該傾きを緩めることで、第2認識部4の認識結果の重要度を大きくする。一方、認識結果処理部5は、音声認識の環境が良好であるほど、誤認識の可能性は低くなると判断する。従って、この場合、認識結果処理部5は、重み付け関数Fwの傾きを急にすることで、第1認識部3の認識結果の重要度を大きくする。これにより、認識結果処理部5は、認識環境に応じて、適切に重み付け係数wを設定することができる。
(Voice recognition method)
Next, processing executed by the speech recognition apparatus 100b in the second embodiment will be described. The recognition result processing unit 5 changes the slope of the weighting function Fw from the supplied environment information Ic based on the control of the main control unit 7. Specifically, the recognition result processing unit 5 determines that the worse the voice recognition environment, the higher the possibility of misrecognition and the higher the possibility of re-speech. Therefore, in this case, the recognition result processing unit 5 increases the importance of the recognition result of the second recognition unit 4 by relaxing the inclination. On the other hand, the recognition result processing unit 5 determines that the better the voice recognition environment, the lower the possibility of erroneous recognition. Therefore, in this case, the recognition result processing unit 5 increases the importance of the recognition result of the first recognition unit 3 by making the gradient of the weighting function Fw steep. Thereby, the recognition result process part 5 can set the weighting coefficient w appropriately according to recognition environment.
 以下では、各種の環境情報Icから重み付け関数Fwの傾きを変更する具体例について説明する。これらの具体例は、組み合わせて適用してもよい。 Hereinafter, a specific example of changing the slope of the weighting function Fw from various environment information Ic will be described. These specific examples may be applied in combination.
 1.車両情報を用いた場合
 認識結果処理部5は、車両情報から搭載車両の走行速度が速い場合、低速時よりも重み付け関数Fwの傾きを緩くする。具体的には、認識結果処理部5は、例えば、各走行速度の値ごとに連続して傾きを変化させる。他の例では、認識結果処理部5は、走行速度を予めいくつかの範囲に区分けしておき、当該範囲ごとに傾きを定めてもよい。
1. When Vehicle Information is Used The recognition result processing unit 5 makes the slope of the weighting function Fw gentler than when the vehicle speed is low, when the traveling speed of the mounted vehicle is high. Specifically, the recognition result processing unit 5 continuously changes the inclination for each value of the traveling speed, for example. In another example, the recognition result processing unit 5 may divide the traveling speed into several ranges in advance and determine the inclination for each range.
 これについて、補足説明する。搭載車両の走行速度が速い場合、運転者の運転への集中力が高まることに起因して音声認識装置100bに対する反応が遅くなることが予想される。また、この場合、走行に起因した騒音レベルも上昇することが予想される。以上を考慮し、認識結果処理部5は、車両情報から搭載車両の走行速度が速い場合、低速時よりも重み付け関数Fwの傾きを緩くする。これにより、認識結果処理部5は、適切に重み付け係数wを設定することができる。 Supplementary explanation about this. When the traveling speed of the mounted vehicle is high, it is expected that the response to the voice recognition device 100b is delayed due to the increased concentration of the driver on driving. Further, in this case, it is expected that the noise level resulting from traveling also increases. In consideration of the above, the recognition result processing unit 5 makes the slope of the weighting function Fw gentler than when the vehicle is running at a low speed when the running speed of the mounted vehicle is high. Thereby, the recognition result processing unit 5 can appropriately set the weighting coefficient w.
 2.緯度経度情報及びルート案内情報を用いた場合
 認識結果処理部5は、緯度経度情報及びルート案内情報に基づき、走行中の経路が運転上注意を要する地点の場合には、重み付け関数Fwの傾きを緩くする。具体的には、認識結果処理部5は、走行中の経路が都市部か否か、事故多発地点か否か、交差点であって搭載車両が左折又は右折する地点であるか否か、等を判断する。そして、認識結果処理部5は、走行経路が都市部、事故多発地点、又は/及び右左折地点等に該当する場合、誤認識する可能性が高いと判断し、重み付け関数Fwの傾きを緩くする。
2. When Latitude / Longitude Information and Route Guidance Information are Used The recognition result processing unit 5 determines the slope of the weighting function Fw based on the latitude / longitude information and the route guidance information when the traveling route is a point that requires attention in driving. Loosen. Specifically, the recognition result processing unit 5 determines whether or not the route being traveled is an urban area, whether or not an accident occurs frequently, whether or not the intersection is a point where the mounted vehicle turns left or right, etc. to decide. Then, the recognition result processing unit 5 determines that there is a high possibility of misrecognition when the travel route corresponds to an urban area, an accident occurrence point, and / or a left / right turn point, etc., and loosens the slope of the weighting function Fw. .
 以上のように、認識結果処理部5は、緯度経度情報及びルート案内情報を用いることで、適切に重み付け係数wを設定することができる。 As described above, the recognition result processing unit 5 can appropriately set the weighting coefficient w by using the latitude / longitude information and the route guidance information.
 (処理フロー)
 次に、第2実施例で音声認識装置100bが実行する処理手順について説明する。ここでは、第1実施例の処理フローの説明で用いた図6を再び参照して説明する。以後では、第1実施例と処理が異なる部分について説明する。
(Processing flow)
Next, a processing procedure executed by the speech recognition apparatus 100b in the second embodiment will be described. Here, description will be made with reference to FIG. 6 again used in the description of the processing flow of the first embodiment. In the following, the parts that are different from the first embodiment will be described.
 第2実施例では、音声認識装置100bは、ステップS105の実行後、環境情報IcをECU23等から取得する。そして、音声認識装置100bは、環境情報Icに基づき重み付け関数Fwを決定する。例えば、音声認識装置100bは、環境情報Icに応じた複数の重み付け関数Fwを予め複数用意し、所定のマップ等を参照することで、環境情報Icから使用する重み付け関数Fwを決定する。上述のマップ等は、実験等により予め作成される。他の例では、音声認識装置100bは、重み付け関数Fwの傾きを決定するためのパラメータを環境情報Icに応じてマップ等を参照して変更する。上述のマップ等は、実験等により予め作成される。そして、音声認識装置100bは、ステップS106で、経過時間幅Tに基づき重み付け関数Fwから第1及び第2重み付け係数w1、w2を決定する。これにより、音声認識装置100bは、音声認識の環境を考慮して、より的確に再発話を認識することができる。 In the second embodiment, the voice recognition device 100b acquires the environment information Ic from the ECU 23 or the like after executing step S105. Then, the voice recognition device 100b determines the weighting function Fw based on the environment information Ic. For example, the speech recognition apparatus 100b determines a weighting function Fw to be used from the environment information Ic by preparing a plurality of weighting functions Fw corresponding to the environment information Ic in advance and referring to a predetermined map or the like. The above-described map and the like are created in advance by experiments or the like. In another example, the speech recognition apparatus 100b changes a parameter for determining the slope of the weighting function Fw with reference to a map or the like according to the environment information Ic. The above-described map and the like are created in advance by experiments or the like. In step S106, the speech recognition apparatus 100b determines the first and second weighting coefficients w1 and w2 from the weighting function Fw based on the elapsed time width T. Thereby, the speech recognition apparatus 100b can recognize the recurrent speech more accurately in consideration of the environment of speech recognition.
 (変形例1)
 上述の第2実施例の説明では、認識結果処理部5は、認識環境が劣悪であると判断した場合、重み付け関数Fwの傾きを緩やかに設定した。これに代えて、又は、これに加えて、認識結果処理部5は、認識環境が劣悪であると判断した場合、例えば経過時間幅Tの計測を一時的に停止させてもよい。他の例では、認識結果処理部5は、認識環境が劣悪であると判断した場合、経過時間幅Tを所定の値で割る又は減算することで小さくしてもよい。これらによっても、音声認識装置100bは、認識環境に応じて適切に音声認識処理を実行することができる。
(Modification 1)
In the description of the second embodiment, when the recognition result processing unit 5 determines that the recognition environment is inferior, the gradient of the weighting function Fw is set gently. Instead of this, or in addition to this, the recognition result processing unit 5 may temporarily stop the measurement of the elapsed time width T, for example, when determining that the recognition environment is inferior. In another example, when the recognition result processing unit 5 determines that the recognition environment is poor, the recognition result processing unit 5 may reduce the elapsed time width T by dividing or subtracting it by a predetermined value. Also by these, the speech recognition apparatus 100b can appropriately execute speech recognition processing according to the recognition environment.
 (変形例2)
 第1実施例の(変形例1)乃至(変形例6)は、第2実施例にも適用することができる。即ち、この場合、音声認識装置100bは、上述した第2実施例の処理に加え、第1実施例の(変形例1)乃至(変形例6)から選択した1又は複数の処理を行う。
(Modification 2)
(Modification 1) to (Modification 6) of the first embodiment can also be applied to the second embodiment. That is, in this case, the speech recognition apparatus 100b performs one or more processes selected from (Modification 1) to (Modification 6) of the first embodiment in addition to the processes of the second embodiment.
 本発明は、音声認識処理を行う各種の機器に適用することができる。例えば、カーナビゲーション装置、携帯電話、パーソナルコンピュータ、AV機器、家電製品など、音声入力機能を備える各種の機器に適用することができる。 The present invention can be applied to various devices that perform voice recognition processing. For example, the present invention can be applied to various devices having a voice input function such as a car navigation device, a mobile phone, a personal computer, an AV device, and a home appliance.
 1 音声分析部
 2 音声区間検出部
 3 第1認識部
 4 第2認識部
 5 認識結果処理部
 6 提示部
 7 主制御部
 8 時間計測部
 9 記憶部
 20 認識部
 21 認識結果保持部
 23 ECU
 24 GPS
 25 ルート情報取得部
DESCRIPTION OF SYMBOLS 1 Speech analysis part 2 Voice area detection part 3 1st recognition part 4 2nd recognition part 5 Recognition result processing part 6 Presentation part 7 Main control part 8 Time measurement part 9 Storage part 20 Recognition part 21 Recognition result holding part 23 ECU
24 GPS
25 Route information acquisition unit

Claims (10)

  1.  階層コマンドを音声認識する音声認識装置であって、
     前記階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する辞書記憶手段と、
     音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識手段と、
     前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識手段と、
     前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理手段と、
    を備えることを特徴とする音声認識装置。
    A voice recognition device for voice recognition of hierarchical commands,
    Dictionary storage means for storing a dictionary used in speech recognition of each layer for recognizing the layer command;
    First speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input;
    A second speech recognition means for performing speech recognition based on a dictionary used when speech recognition of the previously input speech data is performed when the speech data is input;
    The final recognition result is determined by weighting the recognition result by the first voice recognition means and the recognition result by the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously inputted voice data. Recognition result processing means,
    A speech recognition apparatus comprising:
  2.  前記認識結果処理手段は、前記経過時間幅が所定幅以上の場合、前記第1音声認識手段の認識結果のみに基づき前記最終認識結果を決定することを特徴とする請求項1に記載の音声認識装置。 The speech recognition according to claim 1, wherein the recognition result processing means determines the final recognition result based only on the recognition result of the first speech recognition means when the elapsed time width is equal to or greater than a predetermined width. apparatus.
  3.  前記認識結果処理手段は、前記経過時間幅が所定幅未満の場合、前記第2音声認識手段の認識結果のみに基づき前記最終認識結果を決定することを特徴とする請求項1又は2に記載の音声認識装置。 The said recognition result processing means determines the said last recognition result only based on the recognition result of a said 2nd audio | voice recognition means, when the said elapsed time width is less than predetermined width, The said recognition result processing means is characterized by the above-mentioned. Voice recognition device.
  4.  前記第1音声認識手段及び前記第2音声認識手段は、認識結果として尤度を算出し、
     前記認識結果処理手段は、前記第1音声認識手段が算出した尤度に前記経過時間幅に対して非増加関数となる第1パラメータを乗じると共に、前記第2音声認識手段が算出した尤度に前記経過時間幅に対して非減少関数となる第2パラメータを乗じることで前記重み付けを実行することを特徴とする請求項1乃至3のいずれか一項に記載の音声認識装置。
    The first speech recognition means and the second speech recognition means calculate likelihood as a recognition result,
    The recognition result processing unit multiplies the likelihood calculated by the first speech recognition unit by a first parameter that is a non-increasing function with respect to the elapsed time width, and also calculates the likelihood calculated by the second speech recognition unit. The speech recognition apparatus according to claim 1, wherein the weighting is performed by multiplying the elapsed time width by a second parameter that is a non-decreasing function.
  5.  前記認識結果処理手段は、前記第1パラメータと前記第2パラメータとの和を一定値にすることを特徴とする請求項4に記載の音声認識装置。 The speech recognition apparatus according to claim 4, wherein the recognition result processing means sets a sum of the first parameter and the second parameter to a constant value.
  6.  認識環境に関する情報を取得する環境情報取得手段をさらに備え、
     前記認識結果処理手段は、前記情報に基づき認識環境が劣悪であると判断した場合、前記認識環境が劣悪でない場合に比べ、前記第2音声認識手段の認識結果の重みを大きくすることを特徴とする請求項1乃至5のいずれか一項に記載の音声認識装置。
    An environment information acquisition means for acquiring information about the recognition environment;
    When the recognition result processing means determines that the recognition environment is inferior based on the information, the recognition result processing means increases the weight of the recognition result of the second speech recognition means as compared with the case where the recognition environment is not inferior. The voice recognition device according to any one of claims 1 to 5.
  7.  車両に搭載され、
     前記環境情報取得手段は、前記車両の車速又は/及び音声データに含まれる雑音の度合いを取得し、
     前記認識結果処理手段は、前記車速又は/及び前記雑音の度合いが大きいほど、前記第2音声認識手段の認識結果の重み付けを大きくすることを特徴とする請求項6に記載の音声認識装置。
    Mounted on the vehicle,
    The environmental information acquisition means acquires the vehicle speed of the vehicle or / and the degree of noise included in the audio data,
    The speech recognition apparatus according to claim 6, wherein the recognition result processing unit increases the weighting of the recognition result of the second speech recognition unit as the vehicle speed and / or the noise level increases.
  8.  階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する音声認識装置が使用する音声認識方法であって、
     音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識工程と、
     前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識工程と、
     前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理工程と、
    を備えることを特徴とする音声認識方法。
    A speech recognition method used by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command,
    A first speech recognition step for performing speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input;
    A second speech recognition step for performing speech recognition based on a dictionary used when speech recognition of the previously input speech data is performed when the speech data is input;
    The final recognition result is determined by weighting the recognition result by the first voice recognition means and the recognition result by the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously inputted voice data. Recognition result processing step to
    A speech recognition method comprising:
  9.  階層コマンドを認識するための各階層の音声認識で使用する辞書を記憶する音声認識装置により実行される音声認識プログラムであって、
     音声データの入力があった場合、前回入力された音声データを正しく認識したと仮定した場合に使用すべき辞書に基づき音声認識を行う第1音声認識手段と、
     前記音声データの入力があった場合、前記前回入力された音声データの音声認識を実行した際に使用した辞書に基づき音声認識を行う第2音声認識手段と、
     前記第1音声認識手段による認識結果と、前記第2音声認識手段による認識結果と、を前記前回入力された音声データの処理時から起算した経過時間幅に基づき重み付けをして最終認識結果を決定する認識結果処理手段、
    として前記音声認識装置を機能させることを特徴とする音声認識プログラム。
    A speech recognition program executed by a speech recognition device that stores a dictionary used in speech recognition of each layer for recognizing a layer command,
    First speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input;
    A second speech recognition means for performing speech recognition based on a dictionary used when speech recognition of the previously input speech data is performed when the speech data is input;
    The final recognition result is determined by weighting the recognition result by the first voice recognition means and the recognition result by the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously inputted voice data. Recognition result processing means,
    A speech recognition program for causing the speech recognition apparatus to function as
  10.  請求項9に記載のプログラムを記憶したことを特徴とする記憶媒体。 A storage medium storing the program according to claim 9.
PCT/JP2009/063996 2009-08-07 2009-08-07 Voice recognition device, voice recognition method, and voice recognition program WO2011016129A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2011503691A JPWO2011016129A1 (en) 2009-08-07 2009-08-07 Speech recognition apparatus, speech recognition method, and speech recognition program
PCT/JP2009/063996 WO2011016129A1 (en) 2009-08-07 2009-08-07 Voice recognition device, voice recognition method, and voice recognition program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2009/063996 WO2011016129A1 (en) 2009-08-07 2009-08-07 Voice recognition device, voice recognition method, and voice recognition program

Publications (1)

Publication Number Publication Date
WO2011016129A1 true WO2011016129A1 (en) 2011-02-10

Family

ID=43544048

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2009/063996 WO2011016129A1 (en) 2009-08-07 2009-08-07 Voice recognition device, voice recognition method, and voice recognition program

Country Status (2)

Country Link
JP (1) JPWO2011016129A1 (en)
WO (1) WO2011016129A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016180917A (en) * 2015-03-25 2016-10-13 日本電信電話株式会社 Correction speech detection device, voice recognition system, correction speech detection method, and program
CN107577151A (en) * 2017-08-25 2018-01-12 谢锋 A kind of method, apparatus of speech recognition, equipment and storage medium
JP2018180943A (en) * 2017-04-13 2018-11-15 新日鐵住金株式会社 Plan creation apparatus, plan creation method, and program
JP2019053143A (en) * 2017-09-13 2019-04-04 アルパイン株式会社 Voice recognition system and computer program
US20220020362A1 (en) * 2020-07-17 2022-01-20 Samsung Electronics Co., Ltd. Speech signal processing method and apparatus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04199198A (en) * 1990-11-29 1992-07-20 Matsushita Electric Ind Co Ltd Speech recognition device
JP2002041078A (en) * 2000-07-21 2002-02-08 Sharp Corp Voice recognition equipment, voice recognition method and program recording medium
JP2003177788A (en) * 2001-12-12 2003-06-27 Fujitsu Ltd Audio interactive system and its method
JP2008089625A (en) * 2006-09-29 2008-04-17 Honda Motor Co Ltd Voice recognition apparatus, voice recognition method and voice recognition program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04199198A (en) * 1990-11-29 1992-07-20 Matsushita Electric Ind Co Ltd Speech recognition device
JP2002041078A (en) * 2000-07-21 2002-02-08 Sharp Corp Voice recognition equipment, voice recognition method and program recording medium
JP2003177788A (en) * 2001-12-12 2003-06-27 Fujitsu Ltd Audio interactive system and its method
JP2008089625A (en) * 2006-09-29 2008-04-17 Honda Motor Co Ltd Voice recognition apparatus, voice recognition method and voice recognition program

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016180917A (en) * 2015-03-25 2016-10-13 日本電信電話株式会社 Correction speech detection device, voice recognition system, correction speech detection method, and program
JP2018180943A (en) * 2017-04-13 2018-11-15 新日鐵住金株式会社 Plan creation apparatus, plan creation method, and program
CN107577151A (en) * 2017-08-25 2018-01-12 谢锋 A kind of method, apparatus of speech recognition, equipment and storage medium
JP2019053143A (en) * 2017-09-13 2019-04-04 アルパイン株式会社 Voice recognition system and computer program
US20220020362A1 (en) * 2020-07-17 2022-01-20 Samsung Electronics Co., Ltd. Speech signal processing method and apparatus
US11670290B2 (en) * 2020-07-17 2023-06-06 Samsung Electronics Co., Ltd. Speech signal processing method and apparatus

Also Published As

Publication number Publication date
JPWO2011016129A1 (en) 2013-01-10

Similar Documents

Publication Publication Date Title
US10121467B1 (en) Automatic speech recognition incorporating word usage information
US9196248B2 (en) Voice-interfaced in-vehicle assistance
US7747437B2 (en) N-best list rescoring in speech recognition
JP4433704B2 (en) Speech recognition apparatus and speech recognition program
US10381000B1 (en) Compressed finite state transducers for automatic speech recognition
JP4846735B2 (en) Voice recognition device
US9715877B2 (en) Systems and methods for a navigation system utilizing dictation and partial match search
EP2082335A2 (en) System and method for a cooperative conversational voice user interface
US20220358908A1 (en) Language model adaptation
JP2010191400A (en) Speech recognition system and data updating method
US10199037B1 (en) Adaptive beam pruning for automatic speech recognition
WO2011016129A1 (en) Voice recognition device, voice recognition method, and voice recognition program
US20230102157A1 (en) Contextual utterance resolution in multimodal systems
US11145296B1 (en) Language and grammar model adaptation
JP2006308848A (en) Vehicle instrument controller
JP2009230068A (en) Voice recognition device and navigation system
JP4770374B2 (en) Voice recognition device
CN110556104B (en) Speech recognition device, speech recognition method, and storage medium storing program
WO2012076895A1 (en) Pattern recognition
KR101063159B1 (en) Address Search using Speech Recognition to Reduce the Number of Commands
CN111798842B (en) Dialogue system and dialogue processing method
KR20100073178A (en) Speaker adaptation apparatus and its method for a speech recognition
KR20060098673A (en) Method and apparatus for speech recognition
KR102527346B1 (en) Voice recognition device for vehicle, method for providing response in consideration of driving status of vehicle using the same, and computer program
US20230315997A9 (en) Dialogue system, a vehicle having the same, and a method of controlling a dialogue system

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2011503691

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09848061

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 13061652

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09848061

Country of ref document: EP

Kind code of ref document: A1