WO2011016129A1

WO2011016129A1 - Voice recognition device, voice recognition method, and voice recognition program

Info

Publication number: WO2011016129A1
Application number: PCT/JP2009/063996
Authority: WO
Inventors: 実吉田
Original assignee: パイオニア株式会社
Priority date: 2009-08-07
Filing date: 2009-08-07
Publication date: 2011-02-10
Also published as: JPWO2011016129A1

Abstract

A voice recognition device voice-recognizes a hierarchical command and comprises a dictionary storage means, a first voice recognition means, a second voice recognition means, and a recognition result processing means. The dictionary storage means stores dictionaries used for voice recognition in each hierarchy for recognizing the hierarchical command. The first voice recognition means, when voice data is inputted, performs a voice recognition based on a dictionary to be used when it is assumed that previously inputted voice data has been correctly recognized. The second voice recognition means, when voice data is inputted, performs a voice recognition based on a dictionary used when the previously inputted voice data has been voice-recognized. The recognition result processing means weights the recognition result of the first voice recognition means and the recognition result of the second voice recognition means based on the elapse time width elapsed from the time when the previously inputted voice data has been processed and then determines the final recognition result.

Description

Speech recognition apparatus, speech recognition method, and speech recognition program

The present invention relates to a speech recognition technology for recognizing hierarchical utterance commands.

2. Description of the Related Art Conventionally, in a speech recognition apparatus that recognizes hierarchical utterance commands, a technique for dealing with misrecognition and erroneous operations by accepting an explicit return operation or initialization operation of a speaker is well known. For example, Patent Document 1 discloses a technique for recognizing a voice as having been rephrased in the case where a voice input is further received by a predetermined time width after receiving a voice input. Has been. In addition, Patent Document 2 discloses a technique related to the present invention.

JP-A-8-190398 JP 2001-063489 A

When dealing with misrecognition and misoperation by accepting a speaker's explicit return operation or initialization operation, a lot of time is wasted due to the handling. Further, when such an explicit operation is required, the speaker feels troublesome.

The present invention has been made to solve the above-described problems, and even when there is a misrecognition or the like, a hierarchical command can be transmitted without performing an utterance or operation corresponding to a correction operation, or redoing from the beginning. An object of the present invention is to provide a voice recognition device capable of recognizing.

The invention according to claim 1 is a voice recognition device for voice recognition of hierarchical commands, wherein a dictionary storage means for storing a dictionary used for voice recognition of each hierarchy for recognizing the hierarchy command, voice data When there is an input, the first speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that the speech data input last time is correctly recognized; Second speech recognition means for performing speech recognition based on a dictionary used when speech recognition of input speech data is executed, a recognition result by the first speech recognition means, and a recognition result by the second speech recognition means , And a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the previously input voice data. .

The invention according to claim 8 is a speech recognition method used by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command, and when speech data is input, A first speech recognition step for performing speech recognition based on a dictionary to be used when it is assumed that the previously input speech data is correctly recognized; and when the speech data is input, A second speech recognition step for performing speech recognition based on the dictionary used when speech recognition is executed, a recognition result by the first speech recognition means, and a recognition result by the second speech recognition means are input the previous time. A recognition result processing step of determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the voice data.

The invention according to claim 9 is a speech recognition program executed by a speech recognition apparatus for storing a dictionary used for speech recognition of each layer for recognizing a layer command, and when speech data is input A first voice recognition means for performing voice recognition based on a dictionary to be used when it is assumed that the voice data input last time is correctly recognized; and when the voice data is input, the voice data input last time A second speech recognition unit that performs speech recognition based on the dictionary used when the speech recognition is performed, a recognition result by the first speech recognition unit, and a recognition result by the second speech recognition unit The speech recognition apparatus is caused to function as a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the processed speech data. The features.

It is a figure which shows an example of the dictionary structure used when carrying out the speech recognition of a non-hierarchical command and a hierarchical command. It is a figure which shows an example of schematic structure of the speech recognition apparatus which concerns on 1st Example. It is a figure which shows an example of the functional block of the recognition result process part which concerns on 1st Example. It is a figure which shows the example of the weighting function Fw. It is a figure which shows an example of the recognition result of a 1st recognition part, the recognition result of a 2nd recognition part, and a final recognition result. It is an example of the flowchart which shows the process sequence of 1st Example. It is an example of the flowchart which shows an example of schematic structure of the modification 1 of 1st Example. It is a figure which shows an example of schematic structure of the speech recognition apparatus which concerns on 2nd Example.

In one aspect of the present invention, a speech recognition apparatus for speech recognition of hierarchical commands, dictionary storage means for storing a dictionary used for speech recognition of each layer for recognizing the hierarchical commands, and input of speech data The first speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that the speech data input last time is correctly recognized, and the previous input when the speech data is input. Second voice recognition means for performing voice recognition based on a dictionary used when voice recognition of the voice data performed is performed, a recognition result by the first voice recognition means, a recognition result by the second voice recognition means, And a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the previously input voice data.

The above speech recognition apparatus recognizes a hierarchical command and includes a dictionary storage unit, a first speech recognition unit, a second speech recognition unit, and a recognition result processing unit. Here, “hierarchical command” refers to a command that realizes one operation command or output command to a device to be controlled by two or more utterances. That is, in the case of a hierarchical command, one operation command or output command is gradually specified hierarchically based on the voice-recognized command. The dictionary storage means stores a dictionary used for speech recognition in each hierarchy for recognizing hierarchy commands. Here, the “dictionary used in speech recognition in each layer” refers to a dictionary used in each speech recognition process necessary to realize any one operation command or output command. Here, the “voice recognition process” refers to the entire process necessary for recognizing one piece of voice data. The dictionary stores a plurality of recognized words that perform pattern matching with the input voice data. The first speech recognition means performs speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input. In other words, the first voice recognition means determines a dictionary to be used based on the voice data input immediately before, and performs voice recognition of the newly input voice data. When there is input of voice data, the second voice recognition means performs voice recognition based on the dictionary used when voice recognition of the voice data input last time is executed. That is, the second voice recognition means performs voice recognition of the newly input voice data on the assumption that the voice recognition of the voice data input immediately before is erroneous recognition. The recognition result processing means weights the recognition result of the first voice recognition means and the recognition result of the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously input voice data, and finally recognizes Determine the result. Here, the “recognition result” indicates the degree of similarity between each recognized word stored in the used dictionary and newly input speech data. “Elapsed time width” refers to the time width from the start or end of processing of the previously input audio data to the start of processing of the newly input audio data, or a time width corresponding thereto. . The “final recognition result” refers to a recognition result that is finally output by the speech recognition apparatus. For example, the latest recognition result is a predetermined number of recognized words arranged in descending order of similarity and the degree of similarity.

In general, when the elapsed time to re-speech is short, there is a high possibility that the previously input speech is restated. Therefore, the speech recognition apparatus weights the recognition results of the two recognition processes based on the elapsed time width. As a result, even if there is a misrecognition, the speech recognition apparatus can recognize the recurrent utterance with a minimum number of steps without performing the utterance or operation corresponding to the correction operation or starting over from the beginning.

In one aspect of the speech recognition apparatus, the recognition result processing unit determines the final recognition result based only on the recognition result of the first speech recognition unit when the elapsed time width is a predetermined width or more. The predetermined width is set to an appropriate value based on experiments or the like. Specifically, the predetermined width is set to an elapsed time width in which, for example, newly input audio data is not likely to be a rephrase of previously input audio data or is extremely low. As described above, in this aspect, the speech recognition apparatus can reduce unnecessary processing when there is no possibility or low possibility of recurrent speech.

In another aspect of the speech recognition apparatus, the recognition result processing unit determines the final recognition result based only on the recognition result of the second speech recognition unit when the elapsed time width is less than a predetermined width. The predetermined width is set to an appropriate value based on experiments or the like. Specifically, the predetermined width is set to, for example, an elapsed time width in which it is highly likely that the newly input audio data is a rephrase of the previously input audio data. Thus, in this aspect, the speech recognition apparatus can reduce unnecessary processing when there is a high possibility that the speech is recurrent.

In another aspect of the speech recognition apparatus, the first speech recognition unit and the second speech recognition unit calculate likelihood as a recognition result, and the recognition result processing unit is configured by the first speech recognition unit. The calculated likelihood is multiplied by a first parameter that is a non-increasing function with respect to the elapsed time width, and the likelihood that is calculated by the second speech recognition means is a non-decreasing function with respect to the elapsed time width. The weighting is performed by multiplying the parameter. Here, “likelihood” includes log likelihood. In this aspect, the speech recognition apparatus gradually multiplies the likelihood calculated by the first speech recognition means by the first parameter that is a non-increasing function with respect to the elapsed time width, so that the recognition result of the first speech recognition means is gradually increased. Increase the weight, that is, the importance of the final recognition result. Further, the speech recognition apparatus gradually weights the recognition result of the second speech recognition unit by multiplying the likelihood calculated by the second speech recognition unit by a second parameter that is a non-decreasing function with respect to the elapsed time width. Lower. In this way, the speech recognition device can recognize a recurrent speech with a minimum number of steps, even if there is a misrecognition, without performing a speech or operation corresponding to a correction operation or starting over from the beginning. it can.

In another aspect of the speech recognition apparatus, the recognition result processing means sets a sum of the first parameter and the second parameter to a constant value. The above-mentioned constant value is set to 1, for example. In this way, the speech recognition apparatus can easily determine the other parameter based on one of the first parameter and the second parameter, and set these parameters to appropriate values.

In another aspect of the above speech recognition apparatus, the speech recognition apparatus further includes environment information acquisition means for acquiring information related to the recognition environment, and the recognition result processing means determines that the recognition environment is inferior based on the information, The weight of the recognition result of the second speech recognition means is increased as compared with the case where the recognition environment is not inferior. “Recognition environment” refers to an environment in which a speech recognition apparatus performs speech recognition, which affects the accuracy of the speech recognition. In general, when the recognition environment is poor, there is a high possibility of erroneous recognition. That is, in this case, the speaker is likely to restate. Accordingly, when the speech recognition apparatus determines that the recognition environment is inferior, the speech recognition apparatus can perform speech recognition with higher accuracy by increasing the weight of the recognition result of the second speech recognition means than usual.

In another aspect of the speech recognition apparatus, the environment information acquisition unit is installed in a vehicle, acquires the vehicle speed of the vehicle or / and the degree of noise included in the speech data, and the recognition result processing unit includes: The greater the vehicle speed or / and the degree of noise, the greater the weighting of the recognition result of the second speech recognition means. Here, the “degree of noise” corresponds to, for example, the S / N ratio. In general, when the vehicle speed is high, the driver concentrates on driving, so that the response to the voice recognition device tends to be slow and the elapsed time tends to increase. In addition, when the vehicle speed is high, it is expected that noise associated with traveling increases. Therefore, in this aspect, the speech recognition apparatus determines that the higher the vehicle speed and / or the degree of noise, the higher the probability of occurrence of erroneous recognition and the higher the possibility of re-speech. Therefore, in this aspect, the speech recognition apparatus can execute speech recognition with higher accuracy.

In another aspect of the present invention, there is provided a speech recognition method used by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command. When speech data is input, A first speech recognition step for performing speech recognition based on a dictionary to be used when it is assumed that the input speech data has been correctly recognized; and when the speech data is input, the speech of the previously input speech data The second speech recognition step for performing speech recognition based on the dictionary used when the recognition is executed, the recognition result by the first speech recognition means, and the recognition result by the second speech recognition means are input the previous time. A recognition result processing step of determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing the voice data. By using this method, the voice recognition device can recognize a recurrent utterance with minimal steps without utterance or operation corresponding to the corrective action, or from the beginning, even if there is a misrecognition. Can do.

In still another aspect of the present invention, a speech recognition program executed by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command, and there is input of speech data A first voice recognition means for performing voice recognition based on a dictionary to be used when it is assumed that the voice data input last time is correctly recognized; and when the voice data is input, the voice data input last time A second speech recognition unit that performs speech recognition based on the dictionary used when the speech recognition is performed, a recognition result by the first speech recognition unit, and a recognition result by the second speech recognition unit The speech recognition apparatus is made to function as a recognition result processing means for determining a final recognition result by weighting based on an elapsed time width calculated from the time of processing of the processed speech data. . The voice recognition device is equipped with this program, so that even if there is a misrecognition, it recognizes recurrent utterances with minimal steps without utterances or operations corresponding to corrective actions, or from the beginning. Can do. In a preferred example, the program is recorded on a storage medium.

Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings. In the following, first, in the [Basic Description] section, a basic description of the structure of the dictionary related to the speech recognition method of the present invention will be given, and then in the [First Embodiment] and [Second Embodiment] sections of the present invention. An embodiment to which is applied will be described.

[Basic explanation]
Prior to the description of a specific embodiment, a basic description will be given of the structure of a dictionary related to the speech recognition method of the present invention. FIG. 1 is a conceptual diagram showing two types of dictionary configurations used for speech recognition. Specifically, FIG. 1 (a) shows an example of a dictionary configuration when a non-hierarchical command as a comparative example is recognized, and FIG. 1 (b) shows a case of recognizing a hierarchical command that is an object of the present invention. An example of the dictionary configuration is shown. The speech recognition apparatus of the present invention recognizes hierarchical commands as will be described later. The “non-hierarchical command” refers to an utterance command that realizes one operation command or output command to the device to be controlled by one utterance, and the “hierarchical command” refers to a control target among the utterance commands. This refers to an utterance command that realizes one operation command or output command to the device with two or more utterances. The “dictionary” refers to a database that stores words to be recognized (hereinafter referred to as “recognized words”).

First, the case of recognizing a non-hierarchical command will be described with reference to FIG. The voice recognition apparatus performs a predetermined analysis on the utterance data input through a microphone or the like (hereinafter referred to as “utterance data Sa”), and then outputs a recognition result based on the command dictionary 11. Here, the “recognition result” is, for example, a recognized word having the highest similarity to the utterance data Sa, or a recognition word arranged in descending order of similarity to the utterance data Sa and a list of the similarities. Applicable. Further, “utterance data Sa” refers to an input signal including voice. For example, when a voice recognition device is installed in a car navigation device, the utterance data Sa indicates an input signal recorded from a microphone during a predetermined time after the user presses the utterance button.

Next, a case where a hierarchical command is recognized will be described with reference to FIG. In the example of FIG. 1B, the hierarchy command is composed of two hierarchies. First, the speech recognition apparatus recognizes speech data Sa input first in a series of hierarchical command recognition processing using the first command dictionary 12. This speech recognition is hereinafter referred to as “speech recognition in the first layer”. That is, the speech recognition in the first hierarchy refers to a process of recognizing speech data Sa by using a dictionary used in an initial state, that is, a state in which no speech command is accepted, among a series of processes for recognizing hierarchical commands. Thus, when recognizing a hierarchical command, the speech recognition apparatus changes a dictionary (hereinafter also referred to as “standby dictionary”) to be used for each processing state (hierarchy). Then, it is assumed that the speech recognition apparatus recognizes “name search” by speech recognition in the first hierarchy based on the first command dictionary 12. That is, it is assumed that the speech recognition apparatus determines that “name search” has the highest similarity among the recognition words stored in the first command dictionary 12.

Next, the speech recognition apparatus performs speech recognition using the second command dictionary 13 and the place name dictionary 14 as the next standby dictionary based on the recognized “name search” command. Here, the second command dictionary 13 is used when the speaker such as “return”, “stop”, “correction”, etc. explicitly wants to return the process to the previous state (ie, speech recognition in the first layer). Alternatively, it includes an utterance command to be input when it is desired to redo a series of recognition processing of hierarchical commands from the beginning. Further, the place name dictionary 14 includes place names to be subjected to “name search” recognized by the speech recognition in the first layer. Then, the speech recognition apparatus recognizes the recognition words stored in the second command dictionary 13 and the place name dictionary 14 based on the next input utterance data Sa (here, “Tokyo Tower” is uttered). Calculate the similarity of. Hereinafter, the voice recognition here is referred to as “voice recognition in the second layer”. That is, the speech recognition in the second layer refers to a process for recognizing the utterance data Sa based on the speech recognition result in the first layer. Note that voice recognition in the third hierarchy is defined in the same manner.

The speech recognition apparatus outputs the result of speech recognition at the second layer as the final recognition result. Thereafter, the speech recognition apparatus performs speech recognition in the first hierarchy using the first command dictionary 12 again for the new utterance data Sa.

In the example of the hierarchical command described above, the case of the utterance command of two layers has been described. Similarly, even in the case of a hierarchical command of three or more layers, the voice recognition device is similar to the previous one in the voice recognition in each layer. A standby dictionary is selected based on the recognition word that has been voice-recognized in the hierarchy, and voice recognition is performed.

In the present invention, the speech recognition apparatus executes speech recognition of hierarchical commands as described above. Furthermore, as will be described later, the speech recognition apparatus according to the present invention changes the output of the recognition result in accordance with the elapsed time from speech recognition in the previous layer. As a result, the voice recognition device according to the present invention allows the user to explicitly issue a command (in the above example, for the second command) even when a recognition error occurs in the middle of a series of hierarchical command recognition processes. This corresponds to the command stored in the dictionary 13).

[First embodiment]
First, a first embodiment according to the present invention will be described. Hereinafter, after describing the schematic configuration of the speech recognition apparatus, the speech recognition method and the processing flow thereof will be described. Thereafter, each modification of the first embodiment will be described.

(Outline configuration)
First, the schematic configuration of the speech recognition apparatus according to the first embodiment will be described with reference to FIG.

FIG. 2 is a schematic configuration diagram of the speech recognition apparatus 100 according to the present invention. The voice recognition device 100 includes a voice analysis unit 1, a voice section detection unit 2, a first recognition unit 3, a second recognition unit 4, a recognition result processing unit 5, a presentation unit 6, and a main control unit 7. A time measuring unit 8 and a storage unit 9. The broken line indicates the flow of the control signal, and the solid line indicates the data flow.

The voice analysis unit 1 calculates an acoustic feature amount (that is, a voice feature parameter) of the utterance data Sa based on the control of the main control unit 7. Specifically, the voice analysis unit 1 performs A / D conversion on the utterance data Sa input from a microphone or the like (not shown), and calculates an acoustic feature amount by using, for example, a well-known acoustic analysis method or a combination thereof.

Based on the control of the main control unit 7, the speech segment detection unit 2 uses only the speech feature amount supplied from the speech analysis unit 1 and only the speech segment (hereinafter referred to as “speech data”) from the speech data Sa. cut. Then, the voice section detection unit 2 supplies the first recognition unit 3 and the second recognition unit 4 with voice feature amounts corresponding to the voice data.

The first recognition unit 3 calculates the likelihood of a recognition word stored in a predetermined standby dictionary using a discriminator such as HMM (Hidden Markov Model) based on the voice feature amount. And the 1st recognition part 3 outputs the list | wrist of the recognition word arranged in order with high similarity, and its likelihood.

Specifically, the first recognition unit 3 performs speech recognition based on a dictionary to be used when it is assumed that the previously input speech data has been correctly recognized. That is, the first recognizing unit 3 performs speech recognition at the next layer after processing the previously input speech data (or the first layer when there is no next layer). Specifically, in the example of FIG. 1B, when the first recognition unit 3 recognizes the previously input voice data in the first hierarchy, the second command dictionary 13 and the place name dictionary are used. 14 is used to calculate the recognition result, and when the previously input speech data is speech-recognized in the second hierarchy, the recognition result is calculated using the first command dictionary 12. Then, the first recognition unit 3 recognizes a recognition word (hereinafter referred to as “recognition word WR1”) as a recognition result and a logarithmized likelihood corresponding thereto (hereinafter referred to as “likelihood L1”). Is supplied to the recognition result processing unit 5. Here, it is assumed that the first recognition unit 3 supplies a predetermined number (hereinafter referred to as “predetermined number N”) of recognition words WR1 and its likelihood L1. The predetermined number N described above is set to an appropriate value through experiments or the like.

The second recognizing unit 4 calculates the likelihood of a recognized word stored in a predetermined standby dictionary different from the first recognizing unit 3 using a classifier such as an HMM based on the voice feature amount. And the 2nd recognition part 4 outputs the list of the recognition word arranged in order with high similarity, and its likelihood.

Specifically, the second recognition unit 4 performs voice recognition based on the standby dictionary used when voice recognition of voice data input last time is executed. That is, the second recognizing unit 4 performs speech recognition in the same hierarchy as the hierarchy that processed the previously input speech data Sa. In other words, the second recognizing unit 4 performs speech recognition in the hierarchy preceding the first recognizing unit 3. More specifically, in the example of FIG. 1B, when the second recognition unit 4 recognizes speech data Sa previously input in the first hierarchy, the second command dictionary 12 is again displayed. When speech recognition is performed in the first hierarchy and the previously input speech data Sa is recognized in the second hierarchy, the second command dictionary 13 and the place name dictionary 14 are used again in the second hierarchy. Perform voice recognition. Then, the second recognizing unit 4 recognizes a recognized word (hereinafter referred to as “recognized word WR2”) as a recognition result and a logarithmic likelihood corresponding thereto (hereinafter referred to as “likelihood L2”). Supply to part 5. Here, it is assumed that the second recognition unit 4 supplies a predetermined number N of recognition words WR2 and the likelihood L2 that are higher.

The recognition result processing unit 5 uses the likelihoods L1 and L2 output from the first recognition unit 3 and the second recognition unit 4 as elapsed time widths (hereinafter referred to as “elapsed time width T The predetermined weighting is performed based on the above. Thereby, the recognition result processing unit 5 uses the weighted likelihood L1 (hereinafter referred to as “weighted likelihood Lw1”) and the weighted likelihood L2 (hereinafter referred to as “weighted likelihood Lw2”). calculate. Then, the recognition result processing unit 5 sorts the weighted likelihoods Lw1 and Lw2 together, thereby recognizing the recognized words WR1 and WR2 (hereinafter collectively referred to as “recognized words WR”) and the like. Corresponding weighting likelihoods Lw1 and Lw2 (hereinafter collectively referred to as “weighting likelihood Lw”) are supplied to the presentation unit 6. The specific processing in the recognition result processing unit 5 will be described in detail in the next section (Recognition processing method).

The presentation unit 6 presents the recognition result output by the recognition result processing unit 5 as the final recognition result. Specifically, the presentation unit 6 is a display, a speaker, or the like, and outputs the recognition result output by the recognition result processing unit 5 as an image or sound.

The main control unit 7 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like (not shown), and performs various controls on each element of the speech recognition apparatus 100.

The time measuring unit 8 is a timer for measuring the elapsed time width T. Specifically, the time measuring unit 8 measures the elapsed time width T calculated from the time when the presentation unit 6 outputs the recognition result of the previous utterance data Sa.

The storage unit 9 includes an acoustic model used by the first recognition unit 3 and the second recognition unit 4, a standby dictionary used for speech recognition in each layer, and other recognition results output by the recognition result processing unit 5. Is a memory for storing. The data stored in the storage unit 9 is supplied to each element of the speech recognition apparatus 100 as necessary based on the control of the main control unit 7.

(Recognition processing method)
Next, the processing executed by the recognition result processing unit 5 will be specifically described with reference to FIGS. FIG. 3 is a functional block diagram specifically showing the processing contents of the recognition result processing unit 5. As shown in FIG. 3, the recognition result processing unit 5 includes

multiplication units

5x and 5y, a sorting unit 5z, and a switch unit 5v.

The multiplication unit 5x multiplies the likelihood L1 supplied from the first recognition unit 3 by a predetermined weighting coefficient (hereinafter, referred to as “first weighting coefficient w1”), and uses the weighting likelihood Lw1 that is the multiplication result. It supplies to the sort part 5z. That is, the multiplication unit 5x obtains the weighting likelihood Lw1 by the following equation (1).
Lw1 = L1 × w1 Formula (1)
Here, the first weighting coefficient w1 is a coefficient that varies according to the elapsed time width T. Specifically, the first weighting coefficient w1 is set so that the importance (weight) of the recognition result of the first recognition unit 3 increases as the elapsed time width T increases. That is, when the elapsed time width T is a variable and the likelihood L1 is a constant, the first weighting coefficient w1 decreases so as to decrease the absolute value of the weighting likelihood Lw1 as the elapsed time width T increases (that is, non- Increase). Note that the speech recognition apparatus 100 regards the likelihoods L1 and L2 and the weighting likelihoods Lw1 and Lw2 as having higher similarity as the negative value is smaller, that is, as the absolute value is smaller. Therefore, the multiplication unit 5x sets the weighting likelihood Lw1 so that the weight of the recognition result of the first recognition unit 3 is increased as the first weighting coefficient w1 is smaller. A method for setting the first weighting coefficient w1 will be described later.

Similarly, the multiplying unit 5y multiplies the likelihood L2 supplied from the second recognizing unit 4 by a predetermined weighting coefficient (hereinafter referred to as “second weighting coefficient w2”), and the weighted likelihood that is the multiplication result. The degree Lw2 is supplied to the sorting unit 5z. That is, the multiplication unit 5y obtains the second weighting likelihood Lw2 by the following equation (2).
Lw2 = L2 × w2 Formula (2)
Here, the second weighting coefficient w2 is a coefficient that varies according to the elapsed time width T. Specifically, the second weighting coefficient w2 is set so that the importance (weight) of the recognition result of the second recognition unit 4 decreases as the elapsed time width T increases. That is, when the elapsed time width T is a variable and the likelihood L2 is a constant, the second weighting coefficient w2 increases so as to increase the negative value of the weighting likelihood Lw2 as the elapsed time width T increases (ie, non- (Decrease). Therefore, the multiplication unit 5y sets the weighting likelihood Lw2 so that the weight of the recognition result of the second recognition unit 4 is increased as the second weighting coefficient w2 is smaller. A method for calculating the second weighting coefficient w2 will be described later.

Further, in the present embodiment, the first weighting coefficient w1 and the second weighting coefficient w2 are set to values larger than 0 and smaller than 1, respectively, and the sum is set to 1. That is, the first weighting coefficient w1 and the second weighting coefficient w2 satisfy the following expressions (3) to (5).
w1 + w2 = 1 Formula (3)
0 <w1 <1 Formula (4)
0 <w2 <1 Formula (5)
Thereby, the recognition result processing unit 5 can appropriately set the other coefficient by obtaining one of the first weighting coefficient w1 and the second weighting coefficient w2. That is, the recognition result processing unit 5 sets the recognition result of the second recognition unit 4 to be relatively less important than the recognition result of the first recognition unit 3 as the elapsed time width T increases. Can do.

Hereinafter, for convenience, the first and second weighting coefficients w1 and w2 are expressed by the following equations (6) and (7) using a predetermined weighting factor “w”.
w1 = w Formula (6)
w2 = 1-w Formula (7)
Here, the weighting coefficient w is set to a value larger than 0 and smaller than 1 from the equations (4) and (6). That is, the following expression (8) is satisfied.
0 <w <1 Formula (8)
Next, a method for setting the weighting coefficient w for determining the first weighting coefficient w1 and the second weighting coefficient w2 will be described in detail with reference to FIG. FIG. 4 is an example of a graph showing the relationship between the weighting coefficient w and the elapsed time width T. The graph “G1” and the graph “G2” are examples of functions that determine the weighting coefficient w from the elapsed time width T (hereinafter referred to as “weighting function Fw”). Further, “T1” in FIG. 4 uses the recognition result of the first recognition unit 3 as the final recognition result or uses the recognition result of the second recognition unit 4 in addition to the recognition result of the first recognition unit 3. This is a threshold for the elapsed time width T for the switch unit 5v to determine whether to derive the final recognition result. This will be described in detail in the description of the switch unit 5v.

As shown in FIG. 4, the weighting function Fw is a decreasing function using the elapsed time width T as a variable, regardless of whether the weighting function Fw is represented by the graphs G1 and G2. Specifically, the graph G1 is a decreasing function in which the elapsed time width T decreases while maintaining the same slope until the time width “T1α”, and then protrudes downward. On the other hand, the graph G2 becomes a decreasing function convex downward to the threshold T1. In another example, the weighting function Fw for determining the weighting coefficient w may be a constant value regardless of the elapsed time width T. Specifically, the map or expression corresponding to the weighting function Fw is generated in advance based on experiments and the like and stored in the storage unit 9 in consideration of the user's usage environment, the number of recognized words, the performance of the speech recognition apparatus 100, and the like. The

Next, returning to FIG. 3 again, the processing executed by the sorting unit 5z will be described. The sorting unit 5z sorts the recognition words corresponding to the weighting likelihoods Lw1 and Lw2 in descending order of similarity based on the weighting likelihood Lw1 supplied from the multiplication unit 5x and the weighting likelihood Lw2 supplied from the multiplication unit 5y. To do. Then, the sorting unit 5z supplies the predetermined number N of recognized words WR and their weighting likelihoods Lw to the presentation unit 6.

This will be described using a specific example of FIG. FIG. 5 shows a word that is erroneously recognized in the second-level speech recognition in the example of FIG. 1B and the speaker is misrecognized (hereinafter simply referred to as “recurrent speech”). A specific example of the list of the recognition result of the first recognition unit 3, the recognition result of the second recognition unit 4, and the output result of the sorting unit 5 z is shown. As a premise, in FIG. 5, the utterer uttered “Tokyo Tower” after uttering “name search”, whereas the speech recognition apparatus 100 correctly recognized “name search” in the first-level speech recognition. Later, it is assumed that “Tokyo Port” is erroneously recognized in the second layer speech recognition. Further, it is assumed that the speaker subsequently made a recurrent talk with “Tokyo Tower”.

In this case, the first recognizing unit 3 performs voice recognition in the first hierarchy using the first command dictionary 12 for the recurrent speech data, and the second recognizing unit 4 performs the recurrent speech data. In contrast, the second command dictionary 13 and the place name dictionary 14 are used for speech recognition at the second level. Further, here, the weighting coefficient w is set to “0.7” based on the elapsed time width T, that is, a value that puts a weight on the recognition result of the second recognition unit 4 from the recognition result of the first recognition unit 3. And

FIG. 5A is an example of a list showing the recognition result of the first recognition unit 3 for the recurrent story. Specifically, the first recognition unit 3 calculates the likelihood L1 of the word stored in the first command dictionary 12 based on the acoustic feature amount of the speech data of the recurrent utterance. Then, the first recognition unit 3 outputs the recognition word WR1 and the likelihood L1 by a predetermined number N in descending order of similarity based on the likelihood L1. As shown in FIG. 5A, the multiplication unit 5x calculates a weighting likelihood Lw1 for each likelihood L1 based on the first weighting coefficient w1 (= 0.7).

FIG. 5B is an example of a list showing the recognition result of the second recognition unit 4 for the recurrent utterance. Specifically, the second recognition unit 4 calculates the likelihood L2 of the recognized word stored in the second command dictionary 13 and the place name dictionary 14 based on the acoustic feature amount of the speech data of the recurrent utterance. Then, the second recognizing unit 4 outputs the recognition word WR2 and its likelihood L2 by a predetermined number N from the list of FIG. 5 (b) in descending order of similarity based on the likelihood L2. Further, as illustrated in FIG. 5B, the multiplication unit 5y calculates the weighting likelihood Lw2 for each likelihood L2 based on the second weighting coefficient w2 (= 0.3).

FIG. 5 (c) is an example of a list showing the output of the sorting unit 5z for recurrent utterances. As shown in FIG. 5 (c), the sorting unit 5z rearranges the recognition words WR1 and WR2 based on the weighting likelihoods Lw1 and Lw2 shown in FIGS. 5 (a) and 5 (b) to obtain a predetermined number N (= 4 ) Only output. At this time, due to the second weighting coefficient w2 being smaller than the first weighting coefficient w1, etc., “Tokyo Tower” recognized by the second recognition unit 4 is ranked first, and similarly, the second recognition unit 4 Recognized in Tokyo Port is ranked 2nd. On the other hand, “route search” and “route erasure” recognized as first and second in the first recognition unit 3 are ranked lower than words recognized in the second recognition unit 4.

As described above, the sorting unit 5z calculates the weighting likelihoods Lw1 and Lw2 using the weighting coefficient w appropriately set based on the elapsed time width T, and sorts them together. As a result, the speech recognition apparatus 100 can perform erroneous recognition with a minimum number of procedures without an operation that the speaker explicitly re-utters, or the speaker does not input voice again from the first layer. It is possible to correctly recognize the relapse story when it occurs.

In addition to the above-described processing, the sorting unit 5z also recognizes the recognition word WR ranked second when the recognition result for the utterance data Sa input last time matches the recognition word for the first place. Move up to first place. By doing in this way, the sorting part 5z can prevent re-recognizing that it is the same recognized word again with respect to a recurrent utterance, and outputting the same recognized word as the last time as a 1st place.

Next, processing executed by the switch unit 5v will be described with reference back to FIG. Based on the elapsed time width T, the switch unit 5v supplies the output of the first recognition unit 3 to the presentation unit 6 as the final recognition result, or the output of the sorting unit 5z as the final recognition result. 6 is switched to supply. Specifically, when the elapsed time width T is less than the threshold T1, the switch unit 5v determines that there is a possibility of re-speech and outputs the output of the sorting unit 5z in which the recognition result of the second recognition unit 4 is considered. Supply to the presentation unit 6. On the other hand, when the elapsed time width T is equal to or greater than the threshold T1, the switch unit 5v determines that there is no or low possibility of a re-utterance because a considerable time has elapsed since the previous utterance, and the first recognition unit 3 The recognition result is supplied to the presentation unit 6. Thus, when the elapsed time width T is less than the threshold T1, the switch unit 5v can reduce unnecessary processing by outputting a final recognition result based only on the recognition result of the first recognition unit 3. it can.

(Processing flow)
Next, a processing procedure in the first embodiment will be described. FIG. 6 is an example of a flowchart showing a processing procedure executed by the speech recognition apparatus 100 in the first embodiment. The speech recognition apparatus 100 repeatedly executes the processing of the flowchart shown in FIG.

First, the speech recognition apparatus 100 determines whether there is an input of the utterance data Sa (step S101). And when there exists input of speech data Sa (step S101; Yes), the speech recognition apparatus 100 advances a process to step S102. On the other hand, when there is no input of utterance data Sa (step S101; No), the speech recognition apparatus 100 continuously monitors whether or not there is input of utterance data Sa.

Next, the speech recognition apparatus 100 calculates an elapsed time width T (step S102). That is, for example, the speech recognition apparatus 100 determines the elapsed time width T until the next speech data Sa is input after the recognition result is presented by the presentation unit 6.

Then, the speech recognition apparatus 100 determines whether or not there is a previous utterance (step S103). In other words, the speech recognition apparatus 100 determines whether or not a speech input of a hierarchical command has been accepted for the first time. If there is a previous utterance (step S103; Yes), the speech recognition apparatus 100 performs the process of step S104. On the other hand, when there is no previous utterance (step S103; No), the speech recognition apparatus 100 performs speech recognition by the first recognition unit 3 (step S111).

Next, the speech recognition apparatus 100 determines whether or not the elapsed time width T is smaller than the threshold value T1 (step S104). If the elapsed time width T is smaller than the threshold T1 (step S104; Yes), the speech recognition apparatus 100 performs the processing from step S105 to step S110. That is, in this case, the speech recognition apparatus 100 determines that there is a possibility of re-speech caused by misrecognition, and outputs the final recognition result in consideration of the recognition result of the second recognition unit 4.

On the other hand, when the elapsed time width T is equal to or greater than the threshold T1 (step S104; No), the voice recognition device 100 performs voice recognition by the first recognition unit 3 (step S111). In other words, in this case, the speech recognition apparatus 100 determines that there is no possibility or a very low possibility of recurrent speech due to misrecognition. Then, the first recognizing unit 3 calculates the likelihood L1, and outputs a predetermined number N of recognized words WR1 and a list of the likelihood L1 in descending order of similarity.

Next, in step S105, the speech recognition apparatus 100 performs speech recognition by the first recognition unit 3 and the second recognition unit 4 (step S105). Specifically, the first recognition unit 3 calculates the likelihood L1, and outputs a list of a predetermined number N of recognition words WR1 and the likelihood L1 in descending order of similarity. The second recognizing unit 4 calculates a likelihood L2 and outputs a predetermined number N of recognition words WR2 and a list of the likelihoods L2 in descending order of similarity. At this time, the second recognizing unit 4 performs speech recognition of the hierarchy before the first recognizing unit 3.

Next, the speech recognition apparatus 100 determines the first and second weighting coefficients w1 and w2 based on the elapsed time width T (step S106). For example, the speech recognition apparatus 100 refers to the graph G1 or the graph G2 stored in the storage unit 9, and determines the first and second weighting coefficients w1 and w2 based on the elapsed time width T.

Then, the speech recognition apparatus 100 calculates weighted likelihoods Lw1 and Lw2 (step S107). That is, the speech recognition apparatus 100 calculates the weighting likelihood Lw1 by multiplying the likelihood L1 by the first weighting coefficient w1, and calculates the weighting likelihood Lw2 by multiplying the likelihood L2 by the second weighting coefficient w2. .

Next, the speech recognition apparatus 100 performs sorting based on the weighting likelihoods Lw1 and Lw2 (step S108). Here, the weighting likelihoods Lw1 and Lw2 are weighted according to the elapsed time width T by the first or second weighting coefficients w1 and w2. Thereby, the speech recognition apparatus 100 can appropriately recognize the speech based on the elapsed time width T even in the case of recurrent speech caused by misrecognition.

Next, the speech recognition apparatus 100 determines whether or not the recognition word that is ranked first after sorting is the same as the previous time (step S109). Here, it is assumed that the speech recognition apparatus 100 holds information such as a recognition word that has been ranked first in the storage unit 9. If the first place after sorting is the same as the previous time (step S109; Yes), the speech recognition apparatus 100 moves up the second place to the first place (step 110). Thereby, the speech recognition apparatus 100 can prevent performing erroneous recognition similar to the previous time. On the other hand, when the first place after sorting is different from the previous time (step S109; No), the speech recognition apparatus 100 advances the process to step S112.

Next, the speech recognition apparatus 100 presents the recognition result (step S112). For example, the speech recognition apparatus 100 displays the recognition results in a list as shown in FIG. 5C, or outputs the recognition word WR determined to have the highest similarity as a speech.

And the speech recognition apparatus 100 memorize | stores a recognition result and an attribute (step S113). Here, the “attribute” is a recognition process such as a calculated value such as the weighting likelihood Lw or / and which recognition result of the first recognition unit 3 or the second recognition unit 4 is output as first place. Refers to accompanying information.

Next, the speech recognition apparatus 100 copies the setting of the first recognition unit 3 to the setting of the second recognition unit 4 (step S114). Here, the setting corresponds to information such as which dictionary is used as the standby dictionary. Then, the voice recognition device 100 starts measuring the elapsed time width T (step S115).

As described above, the speech recognition apparatus according to the present embodiment recognizes hierarchical commands and includes a dictionary storage unit, a first speech recognition unit, a second speech recognition unit, and a recognition result processing unit. The dictionary storage means stores a dictionary used for speech recognition in each hierarchy for recognizing hierarchy commands. The first speech recognition means performs speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input. That is, the first speech recognition means determines a dictionary to be used based on the previously input speech data, and performs speech recognition for the next speech data input. When there is input of voice data, the second voice recognition means performs voice recognition based on the dictionary used when voice recognition of the voice data input last time is executed. That is, the second voice recognition means performs voice recognition of the next input voice data on the assumption that the voice recognition of the voice data input last time is erroneous recognition. The recognition result processing means weights the recognition result of the first voice recognition means and the recognition result of the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously input voice data, and finally recognizes Determine the result. In general, when the elapsed time interval until a re-utterance is short, there is a high possibility that the previously input speech is restated. Therefore, the speech recognition apparatus weights the recognition results of the two recognition processes based on the elapsed time width. As a result, the speech recognition apparatus can recognize a hierarchical command without performing an utterance or operation corresponding to a correction operation or starting over from the beginning even if there is a misrecognition.

(Modification 1)
In the configuration of the speech recognition device 100 in FIG. 2, the speech recognition device 100 includes two speech recognition units, a first recognition unit 3 and a second recognition unit 4, and the second recognition unit 4 is the first recognition unit 3. The previous level of speech recognition was being performed. However, the configuration of the speech recognition apparatus 100 to which the present invention is applicable is not limited to this. Instead of this, the speech recognition apparatus 100 may execute the above-described processing with only one speech recognition unit.

This specific example will be described with reference to FIG. FIG. 7 is an example of a diagram illustrating a schematic configuration of the speech recognition apparatus 100a according to the first modification. The speech recognition apparatus 100a is different from the speech recognition apparatus 100 shown in FIG. 2 in that it includes a recognition unit 20 in place of the first recognition unit 3 and the second recognition unit 4 in FIG. Different from the configuration.

The recognition unit 20 sequentially executes both the processes of the first recognition unit 3 and the second recognition unit 4. Specifically, the recognition unit 20 first calculates the likelihood L1 by executing the processing performed by the first recognition unit 3. Then, the recognition unit 20 supplies the likelihood L1 and the recognition word WR1 corresponding to the likelihood L1 to the recognition result holding unit 21. Next, the recognition unit 20 calculates the likelihood L2 by executing the process performed by the second recognition unit 4. Then, the recognition unit 20 supplies the likelihood L2 and the recognition word WR1 corresponding to the likelihood L2 to the recognition result processing unit 5.

After receiving the likelihood L1 and the recognition word WR corresponding to this from the recognition unit 20, the recognition result holding unit 21 holds these recognition results until the recognition unit 20 calculates the likelihood L2. And the recognition result holding | maintenance part 21 recognizes corresponding to the likelihood L1 hold | maintained at the same time as or before and after the recognition part 20 supplies the likelihood L2 and the recognition word WR2 corresponding thereto to the recognition result processing part 5. The word WR1 is supplied to the recognition result processing unit 5. Thereafter, the recognition result processing unit 5 executes the processing described with reference to FIG.

As described above, the voice recognition device 100a can execute the same processing as that of the first embodiment even when only one recognition unit 20 that is a voice recognition unit is provided. This also makes it possible to reduce the size of the speech recognition apparatus 100a.

(Modification 2)
In the description of FIG. 2, the time measurement unit 8 sets the elapsed time width T from the time when the recognition result is output by the presentation unit 6 to the time when the next utterance data Sa is input. However, the setting to which the present invention is applicable is not limited to this. For example, instead of this, the time measuring unit 8 inputs the time width from the time when the previous utterance data Sa is input to the time when the next utterance data Sa is input, or the previous utterance data Sa. A time width from when the utterance button is pressed to the time when the next utterance button is pressed or a time width corresponding thereto may be set as the elapsed time width T.

Note that the time measuring unit 8 may reset the elapsed time width T as appropriate, for example, when the engine of the vehicle is turned off when the speech recognition apparatus 100 is mounted on the vehicle.

(Modification 3)
In the graph G1 and the graph G2 in FIG. 4, the speech recognition apparatus 100 sets the threshold T1 to the time width in which the weighting coefficient w is near “0”. However, the method to which the present invention is applicable is not limited to this. Instead, for example, the speech recognition apparatus 100 may set the threshold T1 to a time width in which the weighting coefficient w is near “0, 5”. In addition, the threshold value T1 may be set for each weighting function Fw used or based on an input from the user. Also by this, the speech recognition apparatus 100 can recognize a recurrent utterance caused by misrecognition with a minimum number of steps while reducing unnecessary processing.

(Modification 4)
In the description of FIG. 2 and the like, the speech recognition apparatus 100 includes the first recognition unit 3 and the second recognition unit 4, and the second recognition unit 4 performs speech recognition in the hierarchy before the first recognition unit 3. In addition to this, the speech recognition apparatus 100 may further include a third recognition unit that performs speech recognition in a layer before the second recognition unit 4. In this case, the recognition result processing unit 5 refers to a predetermined formula or map, determines the weighting of the likelihood calculated by each recognition unit from the elapsed time width T, and performs sorting based on the weighted likelihood. The above formula or map is created in advance based on an experiment or the like and stored in the storage unit 9. The same applies to the case where four or more recognition units are used.

As described above, the speech recognition apparatus 100 can also perform recognition processing over three or more layers.

(Modification 5)
In the description of the first embodiment, the speech recognition apparatus 100 uses the likelihood as the similarity. However, the similarity index to which the present invention is applicable is not limited to this. Instead, the speech recognition apparatus 100 may use another index indicating similarity. Even in this case, the speech recognition apparatus 100 determines that the possibility of erroneous recognition is higher as the elapsed time width T is shorter, and increases the weight of the recognition result of the second recognition unit 4. Also by this, the speech recognition apparatus 100 can recognize a recurrent utterance caused by misrecognition with a minimum number of steps.

(Modification 6)
In the description of FIG. 3, when the elapsed time width T is less than the threshold T1, the recognition result processing unit 5 determines the final recognition result based on the recognition result of the first recognition unit 3 and the recognition result of the second recognition unit 4. It was decided. Instead, the recognition result processing unit 5 may determine a final recognition result based only on the recognition result of the second recognition unit 4 when the elapsed time width T is less than the threshold T1. That is, in this case, the speech recognition apparatus 100 determines that there is a high possibility of re-speech, performs only speech recognition in the previous hierarchy, and determines the final rank of the recognized word WR based only on the likelihood L2. . Also by this, the speech recognition apparatus 100 can recognize the recurrent utterance caused by the misrecognition more reliably with the minimum steps.

[Second Embodiment]
Next, processing executed by the speech recognition apparatus in the second embodiment will be described. In the second embodiment, the speech recognition apparatus determines the weighting coefficient w based on information indicating the environment for speech recognition (hereinafter referred to as “environment information Ic”) in addition to the elapsed time width T. Thereby, the speech recognition apparatus can recognize the recurrent speech more accurately according to the environment in which speech recognition is executed.

In the following, first, the schematic configuration of the speech recognition apparatus according to the second embodiment will be described, and then the speech recognition method and its processing flow will be described in order.

(Outline configuration)
FIG. 8 is an example of a diagram illustrating a schematic configuration of the speech recognition apparatus 100b according to the second embodiment. The voice recognition device 100b is mounted on a vehicle. The speech recognition apparatus 100b differs from the speech recognition apparatus 100 of the first embodiment in that it includes an ECU 23, a GPS receiver 24, and a route information acquisition unit 25. Hereinafter, the vehicle on which the voice recognition device 100b is mounted is referred to as “mounted vehicle”.

ECU23 is provided with CPU, ROM, RAM, etc. which are not shown in figure, and performs various control with respect to each component in a mounting vehicle. The ECU 23 is electrically connected to the main control unit 7 and transmits information indicating the state of the mounted vehicle (hereinafter referred to as “vehicle information”) to the main control unit 7. The vehicle information includes, for example, information such as an ACC (accessory power supply: engine key) state, a vehicle speed pulse state, a window opening / closing state of the mounted vehicle, an engine state, and a transmission state.

The GPS receiver 24 receives radio waves carrying downlink data including positioning data from a plurality of GPS satellites. The positioning data is used to detect the absolute position of the mounted vehicle from latitude and longitude information. The GPS receiver 24 is electrically connected to the main control unit 7 and transmits latitude and longitude information of the mounted vehicle to the main control unit 7.

The route information unit 25 acquires information (hereinafter referred to as “VICS information”) distributed from a VICS (Vehicle Information Communication System) center or the like from radio waves. The route information unit 25 holds map information, and also holds information about route guidance (route guidance information) when the driver of the mounted vehicle has set a destination. The route information unit 25 is electrically connected to the main control unit 7 and exchanges such information with the main control unit 7.

The main control unit 7 acquires acoustic information such as the S / N ratio of the utterance data Sa input from a microphone (not shown) via a sensor (not shown) and supplies it to the recognition result processing unit 5 as environment information Ic. Further, the main control unit 7 acquires vehicle information, latitude / longitude information, and route guidance information supplied from the ECU 23, the GPS receiver 24, and the route information unit 25, and supplies the vehicle information, latitude / longitude information, and route guidance information to the recognition result processing unit 5 as environment information Ic. . This will be explained in detail in the (Voice Recognition Method) section.

(Voice recognition method)
Next, processing executed by the speech recognition apparatus 100b in the second embodiment will be described. The recognition result processing unit 5 changes the slope of the weighting function Fw from the supplied environment information Ic based on the control of the main control unit 7. Specifically, the recognition result processing unit 5 determines that the worse the voice recognition environment, the higher the possibility of misrecognition and the higher the possibility of re-speech. Therefore, in this case, the recognition result processing unit 5 increases the importance of the recognition result of the second recognition unit 4 by relaxing the inclination. On the other hand, the recognition result processing unit 5 determines that the better the voice recognition environment, the lower the possibility of erroneous recognition. Therefore, in this case, the recognition result processing unit 5 increases the importance of the recognition result of the first recognition unit 3 by making the gradient of the weighting function Fw steep. Thereby, the recognition result process part 5 can set the weighting coefficient w appropriately according to recognition environment.

Hereinafter, a specific example of changing the slope of the weighting function Fw from various environment information Ic will be described. These specific examples may be applied in combination.

1. When Vehicle Information is Used The recognition result processing unit 5 makes the slope of the weighting function Fw gentler than when the vehicle speed is low, when the traveling speed of the mounted vehicle is high. Specifically, the recognition result processing unit 5 continuously changes the inclination for each value of the traveling speed, for example. In another example, the recognition result processing unit 5 may divide the traveling speed into several ranges in advance and determine the inclination for each range.

Supplementary explanation about this. When the traveling speed of the mounted vehicle is high, it is expected that the response to the voice recognition device 100b is delayed due to the increased concentration of the driver on driving. Further, in this case, it is expected that the noise level resulting from traveling also increases. In consideration of the above, the recognition result processing unit 5 makes the slope of the weighting function Fw gentler than when the vehicle is running at a low speed when the running speed of the mounted vehicle is high. Thereby, the recognition result processing unit 5 can appropriately set the weighting coefficient w.

2. When Latitude / Longitude Information and Route Guidance Information are Used The recognition result processing unit 5 determines the slope of the weighting function Fw based on the latitude / longitude information and the route guidance information when the traveling route is a point that requires attention in driving. Loosen. Specifically, the recognition result processing unit 5 determines whether or not the route being traveled is an urban area, whether or not an accident occurs frequently, whether or not the intersection is a point where the mounted vehicle turns left or right, etc. to decide. Then, the recognition result processing unit 5 determines that there is a high possibility of misrecognition when the travel route corresponds to an urban area, an accident occurrence point, and / or a left / right turn point, etc., and loosens the slope of the weighting function Fw. .

As described above, the recognition result processing unit 5 can appropriately set the weighting coefficient w by using the latitude / longitude information and the route guidance information.

(Processing flow)
Next, a processing procedure executed by the speech recognition apparatus 100b in the second embodiment will be described. Here, description will be made with reference to FIG. 6 again used in the description of the processing flow of the first embodiment. In the following, the parts that are different from the first embodiment will be described.

In the second embodiment, the voice recognition device 100b acquires the environment information Ic from the ECU 23 or the like after executing step S105. Then, the voice recognition device 100b determines the weighting function Fw based on the environment information Ic. For example, the speech recognition apparatus 100b determines a weighting function Fw to be used from the environment information Ic by preparing a plurality of weighting functions Fw corresponding to the environment information Ic in advance and referring to a predetermined map or the like. The above-described map and the like are created in advance by experiments or the like. In another example, the speech recognition apparatus 100b changes a parameter for determining the slope of the weighting function Fw with reference to a map or the like according to the environment information Ic. The above-described map and the like are created in advance by experiments or the like. In step S106, the speech recognition apparatus 100b determines the first and second weighting coefficients w1 and w2 from the weighting function Fw based on the elapsed time width T. Thereby, the speech recognition apparatus 100b can recognize the recurrent speech more accurately in consideration of the environment of speech recognition.

(Modification 1)
In the description of the second embodiment, when the recognition result processing unit 5 determines that the recognition environment is inferior, the gradient of the weighting function Fw is set gently. Instead of this, or in addition to this, the recognition result processing unit 5 may temporarily stop the measurement of the elapsed time width T, for example, when determining that the recognition environment is inferior. In another example, when the recognition result processing unit 5 determines that the recognition environment is poor, the recognition result processing unit 5 may reduce the elapsed time width T by dividing or subtracting it by a predetermined value. Also by these, the speech recognition apparatus 100b can appropriately execute speech recognition processing according to the recognition environment.

(Modification 2)
(Modification 1) to (Modification 6) of the first embodiment can also be applied to the second embodiment. That is, in this case, the speech recognition apparatus 100b performs one or more processes selected from (Modification 1) to (Modification 6) of the first embodiment in addition to the processes of the second embodiment.

The present invention can be applied to various devices that perform voice recognition processing. For example, the present invention can be applied to various devices having a voice input function such as a car navigation device, a mobile phone, a personal computer, an AV device, and a home appliance.

DESCRIPTION OF SYMBOLS 1 Speech analysis part 2 Voice area detection part 3 1st recognition part 4 2nd recognition part 5 Recognition result processing part 6 Presentation part 7 Main control part 8 Time measurement part 9 Storage part 20 Recognition part 21 Recognition result holding part 23 ECU
24 GPS
25 Route information acquisition unit

Claims

A voice recognition device for voice recognition of hierarchical commands,
Dictionary storage means for storing a dictionary used in speech recognition of each layer for recognizing the layer command;
First speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input;
A second speech recognition means for performing speech recognition based on a dictionary used when speech recognition of the previously input speech data is performed when the speech data is input;
The final recognition result is determined by weighting the recognition result by the first voice recognition means and the recognition result by the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously inputted voice data. Recognition result processing means,
A speech recognition apparatus comprising:
The speech recognition according to claim 1, wherein the recognition result processing means determines the final recognition result based only on the recognition result of the first speech recognition means when the elapsed time width is equal to or greater than a predetermined width. apparatus.
The said recognition result processing means determines the said last recognition result only based on the recognition result of a said 2nd audio | voice recognition means, when the said elapsed time width is less than predetermined width, The said recognition result processing means is characterized by the above-mentioned. Voice recognition device.
The first speech recognition means and the second speech recognition means calculate likelihood as a recognition result,
The recognition result processing unit multiplies the likelihood calculated by the first speech recognition unit by a first parameter that is a non-increasing function with respect to the elapsed time width, and also calculates the likelihood calculated by the second speech recognition unit. The speech recognition apparatus according to claim 1, wherein the weighting is performed by multiplying the elapsed time width by a second parameter that is a non-decreasing function.
The speech recognition apparatus according to claim 4, wherein the recognition result processing means sets a sum of the first parameter and the second parameter to a constant value.
An environment information acquisition means for acquiring information about the recognition environment;
When the recognition result processing means determines that the recognition environment is inferior based on the information, the recognition result processing means increases the weight of the recognition result of the second speech recognition means as compared with the case where the recognition environment is not inferior. The voice recognition device according to any one of claims 1 to 5.
Mounted on the vehicle,
The environmental information acquisition means acquires the vehicle speed of the vehicle or / and the degree of noise included in the audio data,
The speech recognition apparatus according to claim 6, wherein the recognition result processing unit increases the weighting of the recognition result of the second speech recognition unit as the vehicle speed and / or the noise level increases.
A speech recognition method used by a speech recognition apparatus that stores a dictionary used in speech recognition of each layer for recognizing a layer command,
A first speech recognition step for performing speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input;
A second speech recognition step for performing speech recognition based on a dictionary used when speech recognition of the previously input speech data is performed when the speech data is input;
The final recognition result is determined by weighting the recognition result by the first voice recognition means and the recognition result by the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously inputted voice data. Recognition result processing step to
A speech recognition method comprising:
A speech recognition program executed by a speech recognition device that stores a dictionary used in speech recognition of each layer for recognizing a layer command,
First speech recognition means for performing speech recognition based on a dictionary to be used when it is assumed that speech data input last time is correctly recognized when speech data is input;
A second speech recognition means for performing speech recognition based on a dictionary used when speech recognition of the previously input speech data is performed when the speech data is input;
The final recognition result is determined by weighting the recognition result by the first voice recognition means and the recognition result by the second voice recognition means based on the elapsed time width calculated from the time of processing of the previously inputted voice data. Recognition result processing means,
A speech recognition program for causing the speech recognition apparatus to function as
A storage medium storing the program according to claim 9.