JP6817915B2

JP6817915B2 - Speech recognition devices, in-vehicle systems and computer programs

Info

Publication number: JP6817915B2
Application number: JP2017164874A
Authority: JP
Inventors: 信範工藤
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2017-08-29
Filing date: 2017-08-29
Publication date: 2021-01-20
Anticipated expiration: 2037-08-29
Also published as: JP2019045532A

Description

本発明は、ユーザの発話音声を認識する音声認識の技術に関するものである。 The present invention relates to a voice recognition technique for recognizing a user's spoken voice.

ユーザの発話音声を認識する音声認識の技術としては、楽曲等のオーディオコンテンツの音声をスピーカから出力するオーディオソース機器を備えたシステムに、マイクロフォンでピックアップしたユーザの発話音声を認識する第１の音声認識部に加え、オーディオソース機器がスピーカに出力する音声の音声認識を行う第２の音声認識部を設け、第１の音声認識部が音声認識した結果と、第２の音声認識部が音声認識した結果とが一致した場合に、第１の音声認識部が音声認識した結果を無効化する技術が知られている（たとえば、特許文献１）。 As a voice recognition technology for recognizing a user's voice, a first voice that recognizes a user's voice picked up by a microphone in a system equipped with an audio source device that outputs the voice of audio content such as music from a speaker. In addition to the recognition unit, a second voice recognition unit that performs voice recognition of the voice output by the audio source device to the speaker is provided, and the result of voice recognition by the first voice recognition unit and the voice recognition by the second voice recognition unit. There is known a technique for invalidating the result of voice recognition by the first voice recognition unit when the results match the results (for example, Patent Document 1).

このような技術によれば、マイクロフォンに回りこんだオーディオソース機器の出力音声に対して認識された第１の音声認識部の音声認識結果を、ユーザの発話音声の音声認識結果としてしまうことを抑止することができる。 According to such a technique, it is possible to prevent the voice recognition result of the first voice recognition unit recognized for the output voice of the audio source device that wraps around the microphone as the voice recognition result of the user's spoken voice. can do.

実登２６０２３４２号公報Jitsuto 2602342

さて、各々語句である複数の認識候補について、マイクロフォンでピックアップした音声と認識候補の音声の相違の大きさを表すスコアを目安として、当該マイクロフォンでピックアップした音声の音声認識行う場合において、マイクロフォンからの音声の入力と並行して、リアルタイムに、順次、入力した音声に対する各認識候補の音声のスコアを算出していき、いずれかの認識候補の音声に対して算出されたスコアが所定のしきい値以下となったときに、当該スコアがしきい値以下となった認識候補を認識結果として算定することが考えられる。 By the way, when performing voice recognition of the voice picked up by the microphone, the score indicating the magnitude of the difference between the voice picked up by the microphone and the voice of the recognition candidate is used as a guide for a plurality of recognition candidates, which are words and phrases, from the microphone. In parallel with the voice input, the voice score of each recognition candidate for the input voice is sequentially calculated in real time, and the calculated score for the voice of any recognition candidate is a predetermined threshold value. When it becomes the following, it is conceivable to calculate the recognition candidate whose score is below the threshold as the recognition result.

しかし、このような音声認識を行う場合に、上述した第１の音声認識部と第２の音声認識部とを設ける技術を適用すると、次のような問題が生じる。
すなわち、この場合において、オーディオソース機器が、いずれかの認識候補と同じ語句の音声を出力した場合、当該音声は、ノイズ等の他の音声成分と共にマイクロフォンでピックアップされるため、第１の音声認識部で算出される当該認識候補とのスコアは、オーディオソース機器が出力する同じ音声に対して第２の音声認識部で算出される当該認識候補とのスコアより大きくなることが一般的である。 However, in the case of performing such voice recognition, if the above-mentioned technique of providing the first voice recognition unit and the second voice recognition unit is applied, the following problems occur.
That is, in this case, when the audio source device outputs a voice of the same phrase as any of the recognition candidates, the voice is picked up by the microphone together with other voice components such as noise, so that the first voice recognition The score with the recognition candidate calculated by the unit is generally larger than the score with the recognition candidate calculated by the second voice recognition unit for the same voice output by the audio source device.

そして、この結果、オーディオソース機器の認識候補と同じ語句の出力音声に対する第１の音声認識部の認識結果の算定時期が、当該音声に対する第２の音声認識部の認識結果の算定時期よりも遅延しまうこととなる。 As a result, the calculation time of the recognition result of the first voice recognition unit for the output voice of the same phrase as the recognition candidate of the audio source device is delayed from the calculation time of the recognition result of the second voice recognition unit for the voice. It will end up.

したがって、この場合、マイクロフォンに回りこんだオーディオソース機器の出力音声に対して認識された第１の音声認識部の音声認識結果を無効化するためには、第２の音声認識部の認識結果の算定時期から当該算定時期から上述した遅延の時間経過するまでの期間を調整期間として、当該調整期間中に第１の音声認識部が音声認識した結果については、当該第１の音声認識部が音声認識した結果と、当該第１の音声認識部の音声認識に先行して第２の音声認識部が音声認識した結果との一致の有無を調査し、一致した場合に、第１の音声認識部が音声認識した結果を無効化する処理を行う必要がある。 Therefore, in this case, in order to invalidate the voice recognition result of the first voice recognition unit recognized for the output voice of the audio source device that wraps around the microphone, the recognition result of the second voice recognition unit is used. The period from the calculation time to the elapse of the above-mentioned delay time is set as the adjustment period, and for the result of voice recognition by the first voice recognition unit during the adjustment period, the first voice recognition unit makes a voice. It is investigated whether or not there is a match between the recognized result and the result of voice recognition by the second voice recognition unit prior to the voice recognition of the first voice recognition unit, and if they match, the first voice recognition unit It is necessary to perform a process to invalidate the result of voice recognition.

一方で、上述した遅延の時間は、オーディオソース機器の出力音声の内容や環境に応じて異なり、不特定であるため、上述した調整期間の時間長を一義的に定めることはできない。そして、ここの調整期間の時間長が短すぎれば、オーディオソース機器の出力音声に対して認識された第１の音声認識部の音声認識結果を無効化することができなくなり、調整期間の時間長が長すぎれば、ユーザが本当に発話した音声の音声認識結果までも無効化してしまうこととなる。 On the other hand, the delay time described above varies depending on the content and environment of the output audio of the audio source device and is unspecified, so that the time length of the adjustment period described above cannot be uniquely determined. If the time length of the adjustment period here is too short, the voice recognition result of the first voice recognition unit recognized for the output voice of the audio source device cannot be invalidated, and the time length of the adjustment period is long. If is too long, even the voice recognition result of the voice actually spoken by the user will be invalidated.

そこで、本発明は、オーディオソース機器の音がスピーカから放射される環境下において、より正しく、ユーザの発話した音声についてのみ認識する音声認識を行うことを課題とする。 Therefore, an object of the present invention is to perform voice recognition that recognizes only the voice spoken by the user more accurately in an environment where the sound of the audio source device is radiated from the speaker.

前記課題達成のために、本発明は、スピーカから、当該スピーカにオーディオソース機器から出力された音が放射される空間の中で発話された音声を認識する音声認識装置に、前記空間中に配置されたマイクロフォンと、前記マイクロフォンがピックアップした音声を入力し、当該音声の入力と並行して当該音声に整合することが予測される語句を認識する音声認識を行う第１音声認識手段と、前記オーディオソース機器がスピーカに出力する音声を入力し、当該音声の入力と並行して、当該スピーカに出力される音声に整合することが予測される語句を認識する音声認識を行う第２音声認識手段と、前記第１音声認識手段が認識した語句を認識結果として出力する認識調整手段と備えたものである。ここで、前記第２音声認識手段は、前記語句を認識したならば、当該認識した語句の音声の前記オーディオソースからの出力の完了を検出し、前記認識調整手段は、前記第２音声認識手段が語句を認識したならば当該認識した語句を調整語句に設定すると共に、以降、前記第２音声認識手段が前記出力の完了を検出するまで、もしくは、前記第２音声認識手段が前記出力の完了を検出してから所定期間経過するまで、前記第１音声認識手段が認識した、調整語句と同じ語句の認識結果としての出力を抑止する。 In order to achieve the above object, the present invention is arranged in the space in a voice recognition device that recognizes the voice uttered in the space in which the sound output from the audio source device is emitted from the speaker. The microphone, the first voice recognition means for inputting the voice picked up by the microphone, and voice recognition for recognizing a phrase that is expected to match the voice in parallel with the input of the voice, and the voice. A second voice recognition means that inputs the voice output to the speaker by the source device and recognizes words and phrases that are expected to match the voice output to the speaker in parallel with the input of the voice. , The first speech recognition means is provided with a recognition adjusting means for outputting a phrase recognized as a recognition result. Here, if the second voice recognition means recognizes the phrase, the second voice recognition means detects the completion of the output of the voice of the recognized word from the audio source, and the recognition adjustment means is the second voice recognition means. If the word is recognized, the recognized word is set as the adjustment word, and thereafter, until the second voice recognition means detects the completion of the output, or the second voice recognition means completes the output. The output as the recognition result of the same phrase as the adjusted phrase recognized by the first speech recognition means is suppressed until a predetermined period elapses after the detection of.

また、本発明は、前記課題達成のために、スピーカから、当該スピーカにオーディオソース機器から出力された音が放射される空間の中で発話された音声を認識する音声認識装置に、前記空間中に配置されたマイクロフォンと、前記マイクロフォンがピックアップした音声を入力し、各々語句である複数の認識候補について、前記マイクロフォンから入力する各音声区間の音が入力する度に、当該認識候補の評価値を、当該音声区間の音が、当該認識候補を発音した音声の、当該音声区間に対応する区間の音と整合している場合に減少させ、整合していない場合に増加させると共に、当該評価値が所定の第１しきい値以下となった認識候補の語句を認識する音声認識を行う第１音声認識手段と、前記オーディオソース機器がスピーカに出力する音声を入力し、前記複数の認識候補について、前記オーディオソース機器から入力する各音声区間の音が入力する度に、当該認識候補の評価値を、当該音声区間の音が、当該認識候補を発音した音声の、当該音声区間に対応する区間の音と整合している場合に減少させ、整合していない場合に増加させると共に、当該評価値が所定の第２しきい値以下となった認識候補の語句を認識すると共に、認識候補の語句を認識した後に、前記評価値が減少から増加に転じるピークの発生を検出する第２音声認識手段と、前記第１音声認識手段が認識した語句を認識結果として出力する認識調整手段とを設けたものである。ここで、前記認識調整手段は、前記第２音声認識手段が語句を認識したならば当該認識した語句を調整語句に設定すると共に、以降、前記第２音声認識手段が前記ピークの発生を検出するまで、もしくは、前記第２音声認識手段が前記ピークの発生を検出してから所定期間経過するまで、前記第１音声認識手段が認識した、前記調整語句と同じ語句の認識結果としての出力を抑止する。 Further, in order to achieve the above-mentioned problem, the present invention provides a voice recognition device for recognizing a sound uttered in a space in which a sound output from an audio source device is radiated from the speaker. The microphone arranged in the above and the voice picked up by the microphone are input, and each time the sound of each voice section input from the microphone is input for a plurality of recognition candidates which are words and phrases, the evaluation value of the recognition candidate is calculated. , When the sound of the voice section is consistent with the sound of the section corresponding to the voice section of the voice that pronounced the recognition candidate, it is decreased, and when it is not consistent, it is increased and the evaluation value is With respect to the plurality of recognition candidates by inputting the first voice recognition means for performing voice recognition for recognizing the words and phrases of the recognition candidates having become equal to or less than a predetermined first threshold and the sound output to the speaker by the audio source device. Each time the sound of each voice section input from the audio source device is input, the evaluation value of the recognition candidate is set to the section corresponding to the voice section of the voice in which the sound of the voice section sounds the recognition candidate. It decreases when it matches the sound, increases it when it does not match, recognizes the words and phrases of the recognition candidate whose evaluation value is equal to or less than the predetermined second threshold value, and recognizes the words and phrases of the recognition candidate. A second voice recognition means for detecting the occurrence of a peak in which the evaluation value changes from a decrease to an increase after recognition, and a recognition adjustment means for outputting a phrase recognized by the first voice recognition means as a recognition result. Is. Here, if the second voice recognition means recognizes a word, the recognition adjusting means sets the recognized word as the adjusting word, and thereafter, the second voice recognition means detects the occurrence of the peak. Or, until a predetermined period of time elapses after the second voice recognition means detects the occurrence of the peak, the output as the recognition result of the same word as the adjustment word recognized by the first voice recognition means is suppressed. To do.

ここで、このような音声認識装置においては、前記第２しきい値として前記第１しきい値より大きい値を設定することも好ましい。
また、本発明は、前記課題達成のために、スピーカから、当該スピーカにオーディオソース機器から出力された音が放射される空間の中で発話された音声を認識する音声認識装置に、前記空間中に配置されたマイクロフォンと、前記マイクロフォンがピックアップした音声を入力し、各々語句である複数の認識候補について、前記マイクロフォンから入力する各音声区間の音が入力する度に、当該認識候補の評価値を、当該音声区間の音が、当該認識候補を発音した音声の、当該音声区間に対応する区間の音と整合している場合に増加させ、整合していない場合に減少させると共に、当該評価値が所定の第１しきい値以上となった認識候補の語句を認識する音声認識を行う第１音声認識手段と、前記オーディオソース機器がスピーカに出力する音声を入力し、前記複数の認識候補について、前記オーディオソース機器から入力する各音声区間の音が入力する度に、当該認識候補の評価値を、当該音声区間の音が、当該認識候補を発音した音声の、当該音声区間に対応する区間の音と整合している場合に増加させ、整合していない場合に減少させると共に、当該評価値が所定の第２しきい値以上となった認識候補の語句を認識すると共に、認識候補の語句を認識した後に、前記評価値が増加から減少に転じるピークの発生を検出する第２音声認識手段と、前記第１音声認識手段が認識した語句を認識結果として出力する認識調整手段とを備えたものである。ここで、前記認識調整手段は、前記第２音声認識手段が語句を認識したならば当該認識した語句を調整語句に設定すると共に、以降、前記第２音声認識手段が前記ピークの発生を検出するまで、もしくは、前記第２音声認識手段が前記ピークの発生を検出してから所定期間経過するまで、前記第１音声認識手段が認識した、前記調整語句と同じ語句の認識結果としての出力を抑止する
ここで、このような音声認識装置においては、前記第２しきい値として前記第１しきい値より小さい値を設定することも好ましい。 Here, in such a voice recognition device, it is also preferable to set a value larger than the first threshold value as the second threshold value.
Further, in order to achieve the above-mentioned problem, the present invention provides a voice recognition device for recognizing a sound uttered in a space in which a sound output from an audio source device is radiated from the speaker. The microphone arranged in the above and the voice picked up by the microphone are input, and each time the sound of each voice section input from the microphone is input for a plurality of recognition candidates which are words and phrases, the evaluation value of the recognition candidate is calculated. , The sound of the voice section is increased when the sound of the voice that sounds the recognition candidate is consistent with the sound of the section corresponding to the voice section, and is decreased when the sound is not matched, and the evaluation value is With respect to the plurality of recognition candidates by inputting the first voice recognition means for performing voice recognition for recognizing the words and phrases of the recognition candidate having reached a predetermined first threshold value or more and the sound output by the audio source device to the speaker. Each time the sound of each voice section input from the audio source device is input, the evaluation value of the recognition candidate is set to the section corresponding to the voice section of the voice in which the sound of the voice section sounds the recognition candidate. It increases when it matches the sound, decreases it when it does not match, recognizes the words and phrases of the recognition candidate whose evaluation value is equal to or higher than the predetermined second threshold value, and recognizes the words and phrases of the recognition candidate. A device including a second voice recognition means for detecting the occurrence of a peak in which the evaluation value changes from an increase to a decrease after recognition, and a recognition adjustment means for outputting a phrase recognized by the first voice recognition means as a recognition result. Is. Here, if the second voice recognition means recognizes a word, the recognition adjusting means sets the recognized word as the adjusting word, and thereafter, the second voice recognition means detects the occurrence of the peak. Or, until a predetermined period of time elapses after the second voice recognition means detects the occurrence of the peak, the output as the recognition result of the same word as the adjustment word recognized by the first voice recognition means is suppressed. Here, in such a voice recognition device, it is also preferable to set a value smaller than the first threshold value as the second threshold value.

また、併せて本発明は、以上の音声認識装置と、自動車に搭載された前記スピーカと前記オーディオソース機器とを備えた車載システムも提供する。ここで、この車載システムにおいて、前記空間は前記自動車の車内空間となる。 The present invention also provides an in-vehicle system including the above voice recognition device, the speaker mounted on the automobile, and the audio source device. Here, in this in-vehicle system, the space becomes the vehicle interior space of the automobile.

以上のような音声認識システムや車載システムでは、第２音声認識手段が語句を認識したならば、以降、第２音声認識手段が認識した語句の音声の前記オーディオソースからの出力の完了を検出するまで、もしくは、当該出力の完了を検出してから所定期間経過するまで、前記第１音声認識手段が認識した、第２音声認識手段が認識した語句と同じ語句の認識結果としての出力が抑止される。 In the above-mentioned voice recognition system or in-vehicle system, if the second voice recognition means recognizes a phrase, the completion of output of the voice of the phrase recognized by the second voice recognition means from the audio source is subsequently detected. Or, until a predetermined period of time elapses after detecting the completion of the output, the output as the recognition result of the same phrase recognized by the first speech recognition means and recognized by the second speech recognition means is suppressed. To.

ここで、第１音声認識手段が、スピーカから出力されたオーディオソース機器が出力した音声の語句を認識するタイミングは、当該語句の音声をオーディオソース機器が出力している期間中となる。また、第１音声認識手段は、マイクロフォンから出力される、スピーカから出力されたオーディオソース機器の出力した音声とノイズなどの他の音声とが混在している音声から、オーディオソース機器が出力した音声の語句を認識するので、当該語句を第１音声認識手段が認識するタイミングは、第２音声認識手段が当該語句を認識した後のタイミングとなる。 Here, the timing at which the first voice recognition means recognizes a phrase of the voice output by the audio source device output from the speaker is during the period during which the audio source device outputs the voice of the phrase. Further, the first voice recognition means is a voice output by the audio source device from a voice in which the voice output from the audio source device output from the microphone and other voices such as noise are mixed. Since the phrase is recognized, the timing at which the first voice recognition means recognizes the phrase is the timing after the second voice recognition means recognizes the phrase.

したがって、以上のような音声認識システムや車載システムによれば、第１音声認識手段がオーディオソース機器が出力した音声の語句を認識し得る期間中のみ、第１音声認識手段が認識した、第２音声認識手段が認識した語句と同じ語句の認識結果の出力を抑止できるので、オーディオソース機器がスピーカから出力した音声から認識した語句をユーザの発話音声の認識結果として出力してしまうことを抑止しつつ、ユーザが本当に発話した音声から認識した語句について、正しく、ユーザの発話音声の認識結果として出力することができるようになる。 Therefore, according to the above-mentioned voice recognition system or in-vehicle system, the second voice recognition means recognizes only during the period in which the first voice recognition means can recognize the words and phrases of the voice output by the audio source device. Since it is possible to suppress the output of the recognition result of the same phrase as the phrase recognized by the voice recognition means, it is possible to suppress the output of the phrase recognized from the voice output from the speaker by the audio source device as the recognition result of the user's spoken voice. At the same time, it becomes possible to correctly output the words and phrases recognized from the voice actually spoken by the user as the recognition result of the voice spoken by the user.

以上のように、本発明によれば、オーディオソース機器の音がスピーカから放射される環境下において、より正しく、ユーザの発話した音声についてのみ認識する音声認識を行うことができる。 As described above, according to the present invention, in an environment where the sound of the audio source device is radiated from the speaker, it is possible to perform voice recognition that recognizes only the voice spoken by the user more correctly.

本発明の実施形態に係る情報処理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the information processing system which concerns on embodiment of this invention. 本発明の実施形態に係る音声認識の手法を示す図である。It is a figure which shows the method of voice recognition which concerns on embodiment of this invention. 本発明の実施形態に係る認識調整処理を示すフローチャートである。It is a flowchart which shows the recognition adjustment process which concerns on embodiment of this invention. 本発明の実施形態に係る認識調整処理の処理例を示す図である。It is a figure which shows the processing example of the recognition adjustment processing which concerns on embodiment of this invention.

以下、本発明の実施形態に係る音声認識装置の実施形態を、自動車に搭載される情報処理システムへの適用を例にとり説明する。
図１に本実施形態に係る情報処理システムの構成を示す
図示するように、情報処理システムは、マイクロフォン１、第１音声認識エンジン２、第２音声認識エンジン３、音声認識辞書４、認識調整部５、音声入力制御部６、ナビゲーションアプリケーション等の１または複数のアプリケーション７、オーディオソース８、スピーカ９を備えている。 Hereinafter, an embodiment of the voice recognition device according to the embodiment of the present invention will be described by taking as an example an application to an information processing system mounted on an automobile.
FIG. 1 shows the configuration of the information processing system according to the present embodiment. As shown in the figure, the information processing system includes a microphone 1, a first voice recognition engine 2, a second voice recognition engine 3, a voice recognition dictionary 4, and a recognition adjustment unit. 5. It includes a voice input control unit 6, one or more applications 7 such as a navigation application, an audio source 8, and a speaker 9.

このような構成において、オーディオソース８は、ラジオ受信器やミュージックプレイヤなどの音源となる装置であり、オーディオコンテンツの音声を、スピーカ９と、第２音声認識エンジン３に出力する。 In such a configuration, the audio source 8 is a device that serves as a sound source for a radio receiver, a music player, or the like, and outputs the sound of the audio content to the speaker 9 and the second voice recognition engine 3.

そして、スピーカ９は、オーディオソース８から入力した音声を車内に放射する。
また、音声認識辞書４は、音声認識用の辞書であり、各々異なる語句である複数の認識候補と、その発音を表す発音データが登録されている。なお、発音データは、語句の発音の音素列を表すものであってもよいし、語句の発音の音声データ等であってもよい。 Then, the speaker 9 radiates the sound input from the audio source 8 into the vehicle.
Further, the voice recognition dictionary 4 is a dictionary for voice recognition, and a plurality of recognition candidates, which are different words and phrases, and pronunciation data representing the pronunciation are registered. The pronunciation data may represent a phoneme sequence of pronunciation of a phrase, or may be voice data of pronunciation of a phrase or the like.

次に、第１音声認識エンジン２は、音声認識辞書４を用いて、マイクロフォン１から入力した音声に対して音声認識処理を行って音声認識した語句を認識語句として認識調整部５に出力する。 Next, the first voice recognition engine 2 uses the voice recognition dictionary 4 to perform voice recognition processing on the voice input from the microphone 1 and outputs the voice-recognized words to the recognition adjustment unit 5 as recognition words.

また、第２音声認識エンジン３は、音声認識辞書４を用いて、オーディオソース８から入力した音声に対して音声認識処理を行って音声認識した語句を認識語句として認識調整部５に出力する。 Further, the second voice recognition engine 3 uses the voice recognition dictionary 4 to perform voice recognition processing on the voice input from the audio source 8, and outputs the words and phrases recognized by the voice to the recognition adjustment unit 5 as recognition words.

そして、認識調整部５は、第１音声認識エンジン２から入力した認識語句や第２音声認識エンジン３から入力した認識語句を用いて、認識結果とする語句を算定し、算定した認識結果を音声入力制御部６に出力する。ここで、この認識調整部５の動作については後に詳述する。 Then, the recognition adjustment unit 5 calculates the words and phrases to be the recognition result using the recognition words and phrases input from the first voice recognition engine 2 and the recognition words and phrases input from the second voice recognition engine 3, and the calculated recognition result is voiced. Output to the input control unit 6. Here, the operation of the recognition adjustment unit 5 will be described in detail later.

そして、音声入力制御部６は、認識調整部５が出力した認識結果に対応する音声入力をアプリケーション７に出力し、アプリケーション７は、音声入力を受け付けて、受け付けた音声入力の内容に応じた処理を行う。 Then, the voice input control unit 6 outputs the voice input corresponding to the recognition result output by the recognition adjustment unit 5 to the application 7, and the application 7 accepts the voice input and processes according to the content of the received voice input. I do.

以下、第１音声認識エンジン２と第２音声認識エンジン３で行う音声認識の動作について説明する。
第１音声認識エンジン２と第２音声認識エンジン３は、認識対象音声（第１音声認識エンジン２あればマイクロフォン１が出力する音声、第２音声認識エンジン３であればオーディオソース８が出力する音声）の入力と並行して、認識対象音声に対する音声認識辞書４に格納された各認識候補のスコアを算定する。 Hereinafter, the operation of voice recognition performed by the first voice recognition engine 2 and the second voice recognition engine 3 will be described.
The first voice recognition engine 2 and the second voice recognition engine 3 are recognized voices (voice output by the microphone 1 if the first voice recognition engine 2 is used, and voice output by the audio source 8 if the second voice recognition engine 3 is used. ), The score of each recognition candidate stored in the speech recognition dictionary 4 for the speech to be recognized is calculated.

ここで、認識対象音声に対する認識候補のスコアは、認識対象音声と、認識候補の発話データとの相違の大きさの予測値を表すものであり、より大きい相違を予測しているときほど、スコアはより大きくなる。 Here, the recognition candidate score for the recognition target voice represents a predicted value of the magnitude of the difference between the recognition target voice and the utterance data of the recognition candidate, and the score is higher when a larger difference is predicted. Becomes larger.

より具体的には、スコアの算定は、予め定めておいた初期値をスコアとして設定した上で、認識対象音声の各音声区間（たとえば、音素毎の音声区間）の音が入力する度に、当該音声区間の音と、各認識候補の発音データの当該音声区間に対応する部分との整合の有無を算定し、整合していればスコアを所定値減少し、整合していなければスコアを所定値増加することにより行う。 More specifically, in the calculation of the score, after setting a predetermined initial value as the score, each time the sound of each voice section of the recognition target voice (for example, the voice section for each phoneme) is input, The presence or absence of matching between the sound of the voice section and the portion of the pronunciation data of each recognition candidate corresponding to the voice section is calculated, and if they match, the score is reduced by a predetermined value, and if they do not match, the score is determined. This is done by increasing the value.

このような音声認識によれば、図２ａに、認識対象音声が「あいうえおか」であるときに、認識候補「あいうえお」に対して算出されるスコアの推移と、認識候補「あいうあい」に対して算出されるスコアの推移を示すように、認識候補と一致する認識対象音声の音が入力されている間は、認識候補とのスコアは順次減少し、認識候補と一致しない認識対象音声の音が入力されている間は認識候補のスコアは順次増加する。 According to such voice recognition, in FIG. 2a, when the recognition target voice is "aiueoka", the transition of the score calculated for the recognition candidate "aiueo" and the recognition candidate "aiueoka" As shown by the transition of the score calculated by the above, while the sound of the recognition target voice that matches the recognition candidate is input, the score with the recognition candidate gradually decreases, and the sound of the recognition target voice that does not match the recognition candidate. While is entered, the recognition candidate scores are gradually increased.

すなわち、たとえば、図２ａ１に示したように、認識対象音声「あいうえおか」と、認識候補「あいうえお」とスコアは、認識対象音声の「あいうえお」の音が入力されている期間は順次減少し、その後、認識対象音声の「か」が入力されると増加する。 That is, for example, as shown in FIG. 2a1, the recognition target voice "aiueoka", the recognition candidate "aiueo", and the score gradually decrease during the period in which the recognition target voice "aiueo" sound is input. After that, it increases when the "ka" of the recognition target voice is input.

また、同様に、図２ａ２に示したように、認識対象音声「あいうえおか」と、認識候補「あいうあい」とスコアは、認識対象音声の「あいう」の音が入力されている期間は順次減少し、その後の、認識対象音声の「えおか」が入力されている期間は順次増加する。 Similarly, as shown in FIG. 2a2, the recognition target voice "aiueoka", the recognition candidate "aiai", and the score gradually decrease during the period in which the recognition target voice "ai" sound is input. After that, the period during which the recognition target voice "Eoka" is input gradually increases.

さて、第１音声認識エンジン２と第２音声認識エンジン３は、以上のようにして算出される認識対象音声といずれかの認識候補とのスコアが、しきい値Th以下となったならば、当該スコアがしきい値Th以下となった認識候補の語句を認識し、認識語句として認識調整部５に出力する。 By the way, in the first voice recognition engine 2 and the second voice recognition engine 3, if the score of the recognition target voice calculated as described above and one of the recognition candidates becomes the threshold value Th or less, A recognition candidate word whose score is equal to or less than the threshold value Th is recognized and output to the recognition adjustment unit 5 as a recognition word.

すなわち、たとえば、図２ａ１に示した場合では、認識候補「あいうえお」についてのスコアは、認識対象音声の「あいうえおか」の「え」が入力される直前にしきい値Th以下となるので、この時点で、認識候補「あいうえお」が認識語句として認識調整部５に出力される。 That is, for example, in the case shown in FIG. 2a1, the score for the recognition candidate "aiueo" is equal to or less than the threshold value Th immediately before the "e" of the recognition target voice "aiueoka" is input. Then, the recognition candidate "aiueo" is output to the recognition adjustment unit 5 as a recognition phrase.

一方、図２ａ１に示した場合では、認識候補「あいうあいお」についてのスコアがしきい値Th以下となることはないので、この認識候補「あいうあいお」の語句の認識は行われない。 On the other hand, in the case shown in FIG. 2a1, since the score for the recognition candidate "Au Aio" does not fall below the threshold value Th, the phrase of the recognition candidate "Ai Aio" is not recognized.

なお、以上のような音声認識において、認識対象音声といずれかの認識候補とのスコアが、しきい値Th以下となった場合でも、その認識候補のスコアとの差が所定レベル以上小さいスコアが算出されている他の認識候補が存在する場合には、当該時点で認識は行わず、最小のスコアが算出されている認識候補と、他の認識候補のスコアとの差が所定レベル以上大きくなったときに、当該小のスコアが算出されている認識候補の語句を認識し、認識語句として認識調整部５に出力するようにしたり、マイクロフォン１への音声入力の終了をまって、その時点で最小のスコアが算出されている認識候補の語句を認識し、認識語句として認識調整部５に出力するようにしてもよい。 In the above voice recognition, even if the score between the recognition target voice and one of the recognition candidates is equal to or less than the threshold value Th, the difference between the recognition candidate's score and the recognition candidate is smaller than a predetermined level. If there are other recognition candidates that have been calculated, recognition is not performed at that point, and the difference between the recognition candidate for which the minimum score is calculated and the scores of the other recognition candidates becomes larger than a predetermined level. At that time, the recognition candidate word for which the small score is calculated is recognized and output to the recognition adjustment unit 5 as the recognition word, or the end of the voice input to the microphone 1 is waited for at that point. The phrase of the recognition candidate for which the minimum score is calculated may be recognized and output to the recognition adjustment unit 5 as the recognition phrase.

次に、図２ｂに、オーディオソース８の出力音声に対して行われる第１音声認識エンジン２と第２音声認識エンジン３の音声認識の動作例を示す。
図示するように、オーディオソース８が「なにぬねのは」の出力音声を出力すると、この出力音声はスピーカ９から出力され、マイクロフォン１でピックアップされ、この出力音声に対して第１音声認識エンジン２において音声認識が行われる。 Next, FIG. 2b shows an operation example of voice recognition of the first voice recognition engine 2 and the second voice recognition engine 3 performed on the output voice of the audio source 8.
As shown in the figure, when the audio source 8 outputs the output voice of "What is it", this output voice is output from the speaker 9, picked up by the microphone 1, and the first voice recognition is performed for this output voice. Voice recognition is performed in the engine 2.

一方、この「なにぬねのは」のオーディオソース８の出力音声は直接、第２音声認識エンジン３に送られ、第２音声認識エンジン３においても音声認識が行われる。 On the other hand, the output voice of the audio source 8 of this "Nanunune no ha" is directly sent to the second voice recognition engine 3, and the second voice recognition engine 3 also performs voice recognition.

そして、図２ｂ１は、このようなオーディオソース８が「なにぬねのは」の出力音声に対する第１音声認識エンジン２の音声認識において、認識候補「なにぬねの」に対して算出されるスコアの推移を示したものであり、図２ｂ２は、このオーディオソース８が「なにぬねのは」の出力音声に対する第２音声認識エンジン３の音声認識において、同じ認識候補「なにぬねの」に対して算出されるスコアの推移を示したものである。 Then, FIG. 2b1 is calculated for the recognition candidate “Nanune no” in the voice recognition of the first voice recognition engine 2 for the output voice of “Nani Nune no ha” by such an audio source 8. 2b2 shows the transition of the score, and FIG. 2b2 shows the same recognition candidate “Nannu” in the voice recognition of the second voice recognition engine 3 for the output voice of “Nanune no ha” by the audio source 8. It shows the transition of the score calculated for "Neno".

この場合、図示するように、オーディオソース８の出力音声「なにぬねのは」の「なにぬねの」が入力する期間は、第１音声認識エンジン２が認識候補「なにぬねの」に対して算出するスコアも、第２音声認識エンジン３が認識候補「なにぬねの」に対して算出するスコアも順次減少していくが、第１音声認識エンジン２で算出されるスコアの減少の度合いは、第２音声認識エンジン３で算出されるスコアの減少の度合いよりも小さくなる。これは、第１音声認識エンジン２に入力する音声は、マイクロフォン１がノイズ等の他の音声成分と共にピックアップしたオーディオソース８の出力音声であるため、マイクロフォン１から出力される音声の各音声区間の認識候補「なにぬねの」の発音データの当該音声区間に対応する部分との整合度が、第２音声認識エンジン３に直接入力するオーディオソース８の出力音声の各音声区間の認識候補「なにぬねの」の発音データの当該音声区間に対応する部分との整合度よりも小さくなるからである。 In this case, as shown in the figure, the first speech recognition engine 2 is the recognition candidate "Nanunune" during the period when "Nanunenno" of the output voice "Nanunenano" of the audio source 8 is input. The score calculated for "no" and the score calculated by the second speech recognition engine 3 for the recognition candidate "what" will gradually decrease, but it will be calculated by the first speech recognition engine 2. The degree of decrease in the score is smaller than the degree of decrease in the score calculated by the second speech recognition engine 3. This is because the voice input to the first voice recognition engine 2 is the output voice of the audio source 8 picked up by the microphone 1 together with other voice components such as noise, so that each voice section of the voice output from the microphone 1 The consistency of the pronunciation data of the recognition candidate "Nani Nuneno" with the part corresponding to the relevant voice section is the recognition candidate "reception candidate" for each voice section of the output voice of the audio source 8 that is directly input to the second voice recognition engine 3. This is because it is smaller than the consistency of the pronunciation data of "What is it" with the part corresponding to the voice section.

また、本実施形態では、第１音声認識エンジン２に設定するしきい値Thとして、第１音声認識エンジン２に設定するしきい値Thの値Th2よりも小さい値Th1を設定している。
したがって、図２ｂ１、ｂ２に示すように、第２音声認識エンジン３において認識候補「なにぬねの」に対して算出されるスコアは、第１音声認識エンジン２において認識候補「なにぬねの」に対して算出されるスコアよりも早い時点でしきい値Th以下となる。よって、第２音声認識エンジン３において認識候補「なにぬねの」が認識されて認識語句として認識調整部５に出力された後に、遅延して、第１音声認識エンジン２において認識候補「なにぬねの」が認識されて認識語句として認識調整部５に出力されることとなる。 Further, in the present embodiment, as the threshold value Th set in the first speech recognition engine 2, a value Th1 smaller than the threshold value Th2 set in the first speech recognition engine 2 is set.
Therefore, as shown in FIGS. 2b1 and 2b2, the score calculated for the recognition candidate "Nanune" in the second speech recognition engine 3 is the recognition candidate "Nanune" in the first speech recognition engine 2. It becomes the threshold value Th or less at a time earlier than the score calculated for ". Therefore, after the recognition candidate "what" is recognized by the second speech recognition engine 3 and output to the recognition adjustment unit 5 as a recognition phrase, the recognition candidate "na" is delayed by the first speech recognition engine 2. "Ninune no" is recognized and output to the recognition adjustment unit 5 as a recognition phrase.

なお、本実施形態において、第１音声認識エンジン２に設定するしきい値Thとして、第１音声認識エンジン２に設定するしきい値Thの値Th2よりも小さい値Th1を設定しているのは、オーディオソース８の出力音声に対する音声の認識が、第２音声認識エンジン３において第１音声認識エンジン２よりも確実に前に行われるようにするためである。 In the present embodiment, the threshold value Th1 set in the first speech recognition engine 2 is set to a value Th1 smaller than the threshold value Th2 set in the first speech recognition engine 2. This is to ensure that the voice recognition for the output voice of the audio source 8 is performed in the second voice recognition engine 3 before the first voice recognition engine 2.

さて、第２音声認識エンジン３は、スコアがしきい値Th2以下となった認識候補を認識して認識語句として認識調整部５に出力したならば、認識語句とした認識候補について算出されているスコアの、その後の推移を監視し、図２ｂ２に示すように、スコアの推移の波形の下向きのピーク（スコアが減少から増加に転じる点）が出現したならば、これを検出し、認識調整部５にピークの検出を通知する処理も行う。 By the way, if the second speech recognition engine 3 recognizes the recognition candidate whose score is equal to or less than the threshold value Th2 and outputs it as the recognition phrase to the recognition adjustment unit 5, the recognition candidate as the recognition phrase is calculated. The subsequent transition of the score is monitored, and as shown in FIG. 2b2, if a downward peak (a point at which the score changes from decreasing to increasing) appears in the waveform of the transition of the score, this is detected and the recognition adjustment unit A process of notifying 5 of the detection of the peak is also performed.

以上、第１音声認識エンジン２と第２音声認識エンジン３が行う音声認識の動作について説明した。
以下、上述のように認識調整部５が行う、第１音声認識エンジン２から入力した認識語句や第２音声認識エンジン３から入力した認識語句を用いて、認識結果とする語句を算定し、算定した認識結果を音声入力制御部６に出力する動作について説明する。 The operation of voice recognition performed by the first voice recognition engine 2 and the second voice recognition engine 3 has been described above.
Hereinafter, using the recognition phrase input from the first speech recognition engine 2 and the recognition phrase input from the second speech recognition engine 3 performed by the recognition adjustment unit 5 as described above, the phrase to be the recognition result is calculated and calculated. The operation of outputting the recognized recognition result to the voice input control unit 6 will be described.

図３に、認識調整部５が行う認識調整処理の手順を示す。
図示するように、この処理において、認識調整部５は、第１音声認識エンジン２からの認識語句の入力の発生（ステップ３０２）と、第２音声認識エンジン３からの認識語句の入力の発生（ステップ３０４）と、第２音声認識エンジン３からのピーク検出の通知の入力の発生（ステップ３０６）とを監視する。 FIG. 3 shows a procedure of the recognition adjustment process performed by the recognition adjustment unit 5.
As shown in the figure, in this process, the recognition adjustment unit 5 generates a recognition phrase input from the first speech recognition engine 2 (step 302) and a recognition phrase input from the second speech recognition engine 3 (step 302). Step 304) and the occurrence of input of the peak detection notification from the second voice recognition engine 3 (step 306) are monitored.

そして、第２音声認識エンジン３からの認識語句の入力が発生したばらば（ステップ３０４）、マスク期間中を設定し（ステップ３１２）、第２音声認識エンジン３から入力した認識語句を調整語句に設定する（ステップ３１４）。そして、ステップ３０２、３０４、３０６の監視に戻る。 Then, if the recognition phrase input from the second speech recognition engine 3 occurs (step 304), the mask period is set (step 312), and the recognition phrase input from the second speech recognition engine 3 is used as the adjustment phrase. Set (step 314). Then, the process returns to the monitoring of steps 302, 304, and 306.

一方、第２音声認識エンジン３からのピーク検出の通知の入力が発生したならば（ステップ３０６）、マスク期間中の設定をクリアし（ステップ３０８）、調整語句の設定をクリアする（ステップ３１０）。そして、ステップ３０２、３０４、３０６の監視に戻る。 On the other hand, when the input of the peak detection notification from the second voice recognition engine 3 occurs (step 306), the setting during the mask period is cleared (step 308), and the adjustment phrase setting is cleared (step 310). .. Then, the process returns to the monitoring of steps 302, 304, and 306.

また、第１音声認識エンジン２からの認識語句の入力が発生した場合には（ステップ３０２）、マスク期間中が設定されているかどうを調べ（ステップ３１６）、マスク期間中が設定されていなければ、第１音声認識エンジン２から入力した認識語句を、認識結果とする語句として算定し、算定した認識結果を音声入力制御部６に出力する（ステップ３２０）。そして、ステップ３０２、３０４、３０６の監視に戻る。 Further, when the recognition phrase is input from the first speech recognition engine 2 (step 302), it is checked whether the mask period is set (step 316), and if the mask period is not set, it is checked. , The recognition phrase input from the first speech recognition engine 2 is calculated as a phrase to be the recognition result, and the calculated recognition result is output to the speech input control unit 6 (step 320). Then, the process returns to the monitoring of steps 302, 304, and 306.

一方、ステップ３１６において、マスク期間中が設定されていると判定された場合には、第１音声認識エンジン２から入力した認識語句が調整語句と一致しているかどうかを調べ（ステップ３１８）、一致している場合には、第１音声認識エンジン２から入力した認識語句を破棄し、そのままステップ３０２、３０４、３０６の監視に戻る。 On the other hand, in step 316, when it is determined that the mask period is set, it is checked whether the recognition phrase input from the first speech recognition engine 2 matches the adjustment phrase (step 318), and one If so, the recognition phrase input from the first speech recognition engine 2 is discarded, and the process returns to the monitoring of steps 302, 304, and 306 as it is.

一方、第１音声認識エンジン２から入力した認識語句が調整語句と一致していない場合には（ステップ３１８）、第１音声認識エンジン２から入力した認識語句を、認識結果とする語句として算定し、算定した認識結果を音声入力制御部６に出力する（ステップ３２０）。そして、ステップ３０２、３０４、３０６の監視に戻る。 On the other hand, when the recognition phrase input from the first speech recognition engine 2 does not match the adjustment phrase (step 318), the recognition phrase input from the first speech recognition engine 2 is calculated as the phrase to be the recognition result. , The calculated recognition result is output to the voice input control unit 6 (step 320). Then, the process returns to the monitoring of steps 302, 304, and 306.

以上、認識調整部５が行う認識調整処理について説明した。
次に、このような認識調整処理の処理例を図４に示す。
図４は、オーディオソース８の出力音声「じたくにかえるひと...」に対して行われる、第１音声認識エンジン２と第２音声認識エンジン３の認識候補「じたくにかえる」の認識動作を示したものである。 The recognition adjustment process performed by the recognition adjustment unit 5 has been described above.
Next, a processing example of such recognition adjustment processing is shown in FIG.
FIG. 4 shows recognition of the recognition candidates “Jitaku ni Kaeru” of the first voice recognition engine 2 and the second voice recognition engine 3 performed on the output voice “Jitaku ni Kaeru Hito ...” of the audio source 8. It shows the operation.

オーディオソース８が「じたくにかえるひと...」の出力音声を出力すると、この出力音声はスピーカ９から出力され、マイクロフォン１でピックアップされ、この出力音声に対して第１音声認識エンジン２において、図４ａに示すように、認識候補「じたくにかえる」に対するスコアの算出が行われる。 When the audio source 8 outputs the output voice of "Human who changes ...", this output voice is output from the speaker 9, picked up by the microphone 1, and the first voice recognition engine 2 responds to this output voice. , As shown in FIG. 4a, the score for the recognition candidate "Jitaku ni Kaeru" is calculated.

また、オーディオソース８が「じたくにかえるひと...」の出力音声を出力すると、この出力音声は直接、第２音声認識エンジン３に送られ、第２音声認識エンジン３においても、図４ｂに示すように、認識候補「じたくにかえる」に対するスコアの算出が行われる。 Further, when the audio source 8 outputs the output voice of "Human who changes to human ...", this output voice is directly sent to the second voice recognition engine 3, and also in the second voice recognition engine 3, FIG. 4b. As shown in, the score for the recognition candidate "Jitaku ni Kaeru" is calculated.

この場合、オーディオソース８が出力音声「じたくにかえるひと...」の「じたくにかえる」が入力する期間は、第１音声認識エンジン２が識候補「じたくにかえる」に対して算出するスコアも、第２音声認識エンジン３が認識候補「じたくにかえる」に対して算出するスコアも順次減少していくが、第１音声認識エンジン２で算出される減少の度合いは、第２音声認識エンジン３で算出される減少の度合いよりも小さくなる。 In this case, during the period in which the audio source 8 inputs the output voice "Jitaku ni Kaeru" of "Jitaku ni Kaeru", the first speech recognition engine 2 responds to the knowledge candidate "Jitaku ni Kaeru". The score to be calculated and the score calculated by the second speech recognition engine 3 for the recognition candidate "Jitaku ni Kaeru" gradually decrease, but the degree of decrease calculated by the first speech recognition engine 2 is the first. 2 It becomes smaller than the degree of decrease calculated by the voice recognition engine 3.

また、第１音声認識エンジン２にはしきい値Thとして、第１音声認識エンジン２に設定されているしきい値Thの値Th2よりも小さい値Th1が設定されている。
したがって、第２音声認識エンジン３において認識候補「じたくにかえる」に対して算出されるスコアは、第１音声認識エンジン２において認識候補「じたくにかえる」に対して算出されるスコアよりも早い時点でしきい値Th以下となり、時刻t21で第２音声認識エンジン３において認識候補「じたくにかえる」が認識されて認識語句として認識調整部５に出力される。 Further, the first speech recognition engine 2 is set as the threshold value Th1 which is smaller than the threshold value Th2 set in the first speech recognition engine 2.
Therefore, the score calculated for the recognition candidate "Jitaku ni Kaeru" in the second speech recognition engine 3 is higher than the score calculated for the recognition candidate "Jitaku ni Kaeru" in the first speech recognition engine 2. At an early point, the threshold value becomes Th or less, and at time t21, the second speech recognition engine 3 recognizes the recognition candidate "Jitaku ni Kaeru" and outputs it to the recognition adjustment unit 5 as a recognition phrase.

そして、時刻t21で第２音声認識エンジン３から認識語句「じたくにかえる」が出力されると、この認識語句「じたくにかえる」が調整語句に設定されると共に、マスク期間中が設定される。 Then, when the recognition phrase "jitaku ni kaeru" is output from the second speech recognition engine 3 at time t21, this recognition phrase "jitaku ni kaeru" is set as the adjustment phrase and the mask period is set. To.

また、その後、第２音声認識エンジン３において、認識候補「じたくにかえる」に対して算出されるスコアの推移が監視され、時刻t22においてスコアの推移の波形の下向きのピーク（スコアが減少から増加に転じる点）が出現したならば、第２音声認識エンジン３はピークを検出し、認識調整部５にピークの検出を通知する。 After that, the second speech recognition engine 3 monitors the transition of the score calculated for the recognition candidate "Jitaku ni Kaeru", and at time t22, the downward peak of the score transition waveform (from the decrease in the score). When a point that turns to increase) appears, the second voice recognition engine 3 detects the peak and notifies the recognition adjustment unit 5 of the detection of the peak.

そして、認識調整部５は、ピークの検出が通知されると、マスク期間中の設定をクリアする。
一方、時刻t21で第２音声認識エンジン３において認識候補「じたくにかえる」が認識されて認識語句として認識調整部５に出力された後、第１音声認識エンジン２においても、認識候補「じたくにかえる」に対して算出されるスコアがしきい値Th以下となり、時刻t11において、第１音声認識エンジン２において認識候補「じたくにかえる」が認識され認識語句として認識調整部５に出力される。 Then, when the recognition adjustment unit 5 is notified of the detection of the peak, the recognition adjustment unit 5 clears the setting during the mask period.
On the other hand, at time t21, the recognition candidate "jitaku ni kaeru" is recognized by the second speech recognition engine 3 and output to the recognition adjustment unit 5 as a recognition phrase, and then the recognition candidate "jitaku ni kaeru" is also recognized by the first speech recognition engine 2. The score calculated for "Takuni Kaeru" becomes equal to or less than the threshold value Th, and at time t11, the recognition candidate "Jitakuni Kaeru" is recognized by the first speech recognition engine 2 and output to the recognition adjustment unit 5 as a recognition phrase. Will be done.

ここで、第１音声認識エンジン２において認識候補「じたくにかえる」が認識されて認識語句として認識調整部５に出力される時点t11は、オーディオソース８が、オーディオソース８が出力する音声「じたくにかえるひと...」のうちの、認識候補「じたくにかえる」と一致している部分を出力している期間中に発生することとなる。 Here, at the time t11 when the recognition candidate "jitaku ni kaeru" is recognized by the first voice recognition engine 2 and output to the recognition adjustment unit 5 as a recognition phrase, the voice "t11" is the voice output by the audio source 8. It will occur during the period when the part that matches the recognition candidate "Jitaku ni Kaeru" is being output.

一方、第２音声認識エンジン３がピークを検出する時刻t22は、オーディオソース８が、オーディオソース８が出力する音声「じたくにかえるひと...」のうちの、認識候補「じたくにかえる」と一致している部分を出力している期間の終了時点となる。 On the other hand, at the time t22 when the second voice recognition engine 3 detects the peak, the audio source 8 outputs the voice "Jitaku ni Kaeru Hito ...", which is a recognition candidate "Jitaku ni Kaeru Hito ...". It is the end of the period when the part that matches "" is output.

したがって、第１音声認識エンジン２において認識候補「じたくにかえる」が認識されて認識語句として認識調整部５に出力される時点t11は、マスク期間中が設定されている期間中の時点となる。 Therefore, the time point t11 when the recognition candidate "jitaku ni kaeru" is recognized by the first voice recognition engine 2 and output to the recognition adjustment unit 5 as a recognition phrase is a time point during the period in which the mask period is set. ..

さて、認識調整部５は、時点t11において、第１音声認識エンジン２から認識語句「じたくにかえる」が出力されると、現在、マスク期間中が設定されているので、第１音声認識エンジン２から出力された認識語句「じたくにかえる」と、設定している調整語句「じたくにかえる」とを比較する。そして、この場合、第１音声認識エンジン２から出力された認識語句「じたくにかえる」と、設定している調整語句「じたくにかえる」は一致しているので、第１音声認識エンジン２から出力された認識語句「じたくにかえる」を、認識結果とはせずに破棄する。 By the way, when the recognition phrase "jitaku ni kaeru" is output from the first voice recognition engine 2 at the time point t11, the recognition adjustment unit 5 is currently set during the mask period, so that the first voice recognition engine Compare the recognition phrase "Jitaku ni Kaeru" output from 2 with the set adjustment phrase "Jitaku ni Kaeru". Then, in this case, the recognition phrase "jitaku ni kaeru" output from the first voice recognition engine 2 and the set adjustment phrase "jitaku ni kaeru" match, so that the first voice recognition engine 2 Discard the recognition phrase "Jitaku ni Kaeru" output from, without making it a recognition result.

この結果、オーディオソース８が出力する音声「じたくにかえるひと...」に対して第１音声認識エンジン２で認識された認識語句「じたくにかえる」の、ユーザの発話音声の認識結果としての音声入力制御部６への出力は抑止される。 As a result, the recognition result of the user's uttered voice of the recognition phrase "Jitaku ni Kaeru" recognized by the first voice recognition engine 2 with respect to the voice "Jitaku ni Kaeru Hito ..." output by the audio source 8. The output to the voice input control unit 6 is suppressed.

以上、本発明の実施形態について説明した。
なお、以上の実施形態では、認識調整部５の認識調整処理において、第２音声認識エンジン３からピークの検出が通知されたときにマスク期間中の設定を解除するようにしたが、これは、図４ｂに示すように第２音声認識エンジン３からピークの検出が通知された時点t22から、所定のマージン時間mgn経過した時点t23でマスク期間中の設定を解除するようにしてもよい。 The embodiment of the present invention has been described above.
In the above embodiment, in the recognition adjustment process of the recognition adjustment unit 5, the setting during the mask period is canceled when the second voice recognition engine 3 notifies the detection of the peak. As shown in FIG. 4b, the setting during the mask period may be canceled at the time t23 when the predetermined margin time mgn elapses from the time t22 when the detection of the peak is notified from the second voice recognition engine 3.

このようにすることにより、より確実に、オーディオソース８が出力する音声に対して第１音声認識エンジン認識された認識語句の認識結果としての音声入力制御部６への出力を抑止することができるようになる。 By doing so, it is possible to more reliably suppress the output to the voice input control unit 6 as the recognition result of the recognition phrase recognized by the first voice recognition engine with respect to the voice output by the audio source 8. Will be.

また、以上の実施形態では、認識調整部５の認識調整処理において、第２音声認識エンジン３からピークの検出が通知されたときにマスク期間中の設定を解除するようにしたが、マスク期間中の設定の解除は、当該マスク期間中の設定の解除が、第２音声認識エンジン３が出力した認識語句の音声と一致する音声の出力をオーディオソース８が終了した時点に行われるものであれば、他の任意の手法によって行うようにしてよい。すなわち、たとえば、オーディオソース８から出力された音素数に基づいて、認識語句の音声と一致する音声の出力をオーディオソース８が終了した時点を検出してマスク期間中の設定を解除するなどしてもよい。、 Further, in the above embodiment, in the recognition adjustment process of the recognition adjustment unit 5, the setting during the mask period is canceled when the second voice recognition engine 3 notifies the detection of the peak, but during the mask period If the setting is canceled during the mask period, the voice that matches the voice of the recognition phrase output by the second voice recognition engine 3 is output when the audio source 8 ends. , Any other method may be used. That is, for example, based on the number of phonemes output from the audio source 8, the output of the voice matching the voice of the recognition phrase is detected at the time when the audio source 8 ends, and the setting during the mask period is canceled. May be good. ,

なお、この場合も、第２音声認識エンジン３が出力した認識語句の音声と一致する音声の出力をオーディオソース８が終了した時点から、所定のマージン時間mgn経過した時点でマスク期間中の設定を解除するようにしてよい。 Also in this case, the setting during the mask period is set when the predetermined margin time mgn elapses from the time when the audio source 8 ends the output of the voice matching the voice of the recognition phrase output by the second voice recognition engine 3. You may try to release it.

また、以上の実施形態は、スコアの正負の方向を反転して実施するようにしてもよい。
すなわち、認識対象音声に対する認識候補のスコアは、より大きい相違を予測しているときほど、より小さくなるようにスコアを算出してもよい。
より具体的には、スコアの算定は、予め定めておいた初期値をスコアとして設定した上で、認識対象音声の各音声区間（たとえば、音素毎の音声区間）の音が入力する度に、当該音声区間の音と、各認識候補の発音データの当該音声区間に対応する部分との整合の有無を算定し、整合していればスコアを所定値増加し、整合していなければスコアを所定値減少するようにしてもよい。 Further, the above embodiment may be carried out by reversing the positive and negative directions of the score.
That is, the score of the recognition candidate for the recognition target voice may be calculated so as to be smaller as the difference is predicted to be larger.
More specifically, in the calculation of the score, after setting a predetermined initial value as the score, each time the sound of each voice section of the recognition target voice (for example, the voice section for each phoneme) is input, The presence or absence of matching between the sound of the voice section and the portion of the pronunciation data of each recognition candidate corresponding to the voice section is calculated, and if they match, the score is increased by a predetermined value, and if they do not match, the score is determined. The value may be reduced.

ただし、この場合、第１音声認識エンジン２と第２音声認識エンジン３は、以上のようにして算出される認識対象音声といずれかの認識候補とのスコアが、しきい値Th以上となったならば、当該スコアがしきい値Th以上となった認識候補の語句を認識し、認識語句として認識調整部５に出力する。また、第２音声認識エンジン３しきい値Thとして設定するしきい値Th2は、第１音声認識エンジン２しきい値Thとして設定するしきい値Th1より小さい値とする。また、第２音声認識エンジン３は、スコアがしきい値Th2以下となった認識候補を認識して認識語句として認識調整部５に出力したならば、認識語句とした認識候補について算出されているスコアの、その後の推移を監視し、スコアの推移の波形の上向きのピーク（スコアが増加から減少に転じる点）が出現したならばピークを検出し、認識調整部５にピークの検出を通知する。 However, in this case, in the first voice recognition engine 2 and the second voice recognition engine 3, the scores of the recognition target voice calculated as described above and one of the recognition candidates are equal to or higher than the threshold value Th. If so, the recognition candidate word / phrase whose score is equal to or higher than the threshold value Th is recognized and output to the recognition adjustment unit 5 as a recognition phrase. Further, the threshold value Th2 set as the second voice recognition engine 3 threshold value Th is set to a value smaller than the threshold value Th1 set as the first voice recognition engine 2 threshold value Th. Further, if the second speech recognition engine 3 recognizes the recognition candidate whose score is equal to or less than the threshold value Th2 and outputs it as a recognition phrase to the recognition adjustment unit 5, the recognition candidate as the recognition phrase is calculated. The subsequent transition of the score is monitored, and if an upward peak (a point at which the score changes from an increase to a decrease) appears in the waveform of the transition of the score, the peak is detected and the recognition adjustment unit 5 is notified of the detection of the peak. ..

なお、以上の実施形態における音声認識の技術は、自動車に搭載される情報処理システムのみならず、音声入力を行う任意の情報処理システムに適用することができる。 The voice recognition technology in the above embodiments can be applied not only to an information processing system mounted on an automobile but also to an arbitrary information processing system that performs voice input.

１…マイクロフォン、２…第１音声認識エンジン、３…第２音声認識エンジン、４…音声認識辞書、５…認識調整部、６…音声入力制御部、７…アプリケーション、８…オーディオソース、９…スピーカ。 1 ... Microphone, 2 ... 1st voice recognition engine, 3 ... 2nd voice recognition engine, 4 ... Voice recognition dictionary, 5 ... Recognition adjustment unit, 6 ... Voice input control unit, 7 ... Application, 8 ... Audio source, 9 ... Speaker.

Claims

It is a voice recognition device that recognizes the voice uttered in the space where the sound output from the audio source device is emitted from the speaker to the speaker.
With the microphone placed in the space,
A first voice recognition means for inputting a voice picked up by the microphone and performing voice recognition for recognizing a phrase that is predicted to match the voice in parallel with the input of the voice.
A second voice recognition that inputs the voice output to the speaker by the audio source device and recognizes a phrase that is predicted to match the voice output to the speaker in parallel with the input of the voice. Means and
It has a recognition adjusting means that outputs a phrase recognized by the first voice recognition means as a recognition result.
When the second voice recognition means recognizes the phrase, the second voice recognition means detects the completion of the output of the voice of the recognized phrase from the audio source, and detects the completion of the output.
If the second voice recognition means recognizes a word, the recognition adjusting means sets the recognized word as the adjusting word, and thereafter, until the second voice recognition means detects the completion of the output, or , The output as the recognition result of the same phrase as the adjusted phrase recognized by the first speech recognition means is suppressed until a predetermined period elapses after the second speech recognition means detects the completion of the output. A featured voice recognition device.

It is a voice recognition device that recognizes the voice uttered in the space where the sound output from the audio source device is emitted from the speaker to the speaker.
With the microphone placed in the space,
Each time the voice picked up by the microphone is input and the sound of each voice section input from the microphone is input for a plurality of recognition candidates that are words and phrases, the evaluation value of the recognition candidate is evaluated by the sound of the voice section. , Decrease when the voice that sounds the recognition candidate matches the sound of the section corresponding to the voice section, increase when it does not match, and the evaluation value is the predetermined first threshold value. The first voice recognition means for recognizing the following words and phrases of recognition candidates, and
Each time the voice output from the audio source device is input to the speaker and the sound of each voice section input from the audio source device is input for the plurality of recognition candidates, the evaluation value of the recognition candidate is set to the voice section. When the sound of is consistent with the sound of the section corresponding to the voice section of the voice that pronounced the recognition candidate, it is decreased, and when it is not consistent, it is increased, and the evaluation value is a predetermined second. A second speech recognition means that recognizes the words and phrases of the recognition candidate that are below the threshold value and detects the occurrence of a peak in which the evaluation value changes from decrease to increase after recognizing the words and phrases of the recognition candidate.
It has a recognition adjusting means that outputs a phrase recognized by the first voice recognition means as a recognition result.
If the second voice recognition means recognizes a word, the recognition adjusting means sets the recognized word as the adjusting word, and thereafter, until the second voice recognition means detects the occurrence of the peak, or , The output of the same phrase as the adjusted phrase recognized by the first speech recognition means as a recognition result is suppressed until a predetermined period of time elapses after the second speech recognition means detects the occurrence of the peak. A featured voice recognition device.

The voice recognition device according to claim 2.
A voice recognition device characterized in that a value larger than the first threshold value is set as the second threshold value.

It is a voice recognition device that recognizes the voice uttered in the space where the sound output from the audio source device is emitted from the speaker to the speaker.
With the microphone placed in the space,
Each time the voice picked up by the microphone is input and the sound of each voice section input from the microphone is input for a plurality of recognition candidates that are words and phrases, the evaluation value of the recognition candidate is evaluated by the sound of the voice section. , The voice that pronounces the recognition candidate is increased when it is consistent with the sound of the section corresponding to the voice section, and is decreased when it is not consistent, and the evaluation value is a predetermined first threshold value. The first voice recognition means for recognizing the words and phrases of the recognition candidates described above, and
Each time the voice output from the audio source device is input to the speaker and the sound of each voice section input from the audio source device is input for the plurality of recognition candidates, the evaluation value of the recognition candidate is set to the voice section. When the sound of is consistent with the sound of the section corresponding to the voice section of the voice that pronounced the recognition candidate, it is increased, and when it is not consistent, it is decreased, and the evaluation value is a predetermined second. A second voice recognition means for recognizing the words and phrases of the recognition candidate that have exceeded the threshold value and detecting the occurrence of a peak in which the evaluation value changes from an increase to a decrease after recognizing the words and phrases of the recognition candidate.
It has a recognition adjusting means that outputs a phrase recognized by the first voice recognition means as a recognition result.
If the second voice recognition means recognizes a word, the recognition adjusting means sets the recognized word as the adjusting word, and thereafter, until the second voice recognition means detects the occurrence of the peak, or , The output of the same phrase as the adjusted phrase recognized by the first speech recognition means as a recognition result is suppressed until a predetermined period of time elapses after the second speech recognition means detects the occurrence of the peak. A featured voice recognition device.

The voice recognition device according to claim 4.
A voice recognition device characterized in that a value smaller than the first threshold value is set as the second threshold value.

The voice recognition device according to claim 1, 2, 3, 4 or 5 mounted on an automobile.
It has the speaker mounted on the automobile and the audio source device.
An in-vehicle system characterized in that the space is an interior space of the automobile.

A computer program that is read and executed by a computer equipped with a microphone arranged in a space where sound output from an audio source device is radiated from the speaker.
The computer program uses the computer,
A first voice recognition means for inputting a voice picked up by the microphone and performing voice recognition for recognizing a phrase that is predicted to match the voice in parallel with the input of the voice.
Second voice recognition that inputs the voice output to the speaker by the audio source device and recognizes words and phrases that are predicted to match the voice output to the speaker in parallel with the input of the voice. Means and
The phrase recognized by the first speech recognition means is made to function as a recognition adjustment means for outputting as a recognition result.
If the second voice recognition means recognizes the phrase, the second voice recognition means detects the completion of the output of the voice of the recognized phrase from the audio source, and detects the completion of the output.
If the second voice recognition means recognizes a word, the recognition adjusting means sets the recognized word as the adjustment word, and thereafter, until the second voice recognition means detects the completion of the output, or The second speech recognition means suppresses the output as a recognition result of the same phrase recognized by the first speech recognition means until a predetermined period of time elapses after detecting the completion of the output. Computer program.