JP5173895B2

JP5173895B2 - Voice recognition device

Info

Publication number: JP5173895B2
Application number: JP2009054740A
Authority: JP
Inventors: 博昭関山; 利行難波; 義博大栄; 直樹三浦; 邦雄横井; 收岩田; 昌宏神谷; 位好寺澤; 錦一和田; 達之岡
Original assignee: Aisin AW Co Ltd; Denso Corp; Toyota Motor Corp; Toyota Central R&D Labs Inc
Current assignee: Aisin AW Co Ltd; Denso Corp; Toyota Motor Corp; Toyota Central R&D Labs Inc
Priority date: 2009-03-09
Filing date: 2009-03-09
Publication date: 2013-04-03
Anticipated expiration: 2029-03-09
Also published as: JP2010210756A

Description

本発明は、音声認識装置に関する。 The present invention relates to a speech recognition apparatus.

音声認識装置は、カーナビゲーションの目的地検索などの様々な分野で利用されている。音声認識装置では、一般に、音声認識辞書に収録される各単語のパターンとユーザが発声した音声（単語）のパターンとのマッチングを行い、その一致度の高い単語をユーザの発声した単語として認識する。このような音声認識では誤認識が発生する場合があるが、誤認識する毎にユーザに対するガイダンスが一律（例えば、「もう一度お話ください。」）だと、ユーザとの間で同じやりとりを繰り返すことになる。そのため、誤認識の要因を特定しないと、誤認識を何時までも繰り返す。そこで、特許文献１に記載の装置では、誤認識の要因（音声パワー、音声話速、音響特性、周辺雑音）を特定し、その要因をユーザに提示する。さらに、要因を提示後も誤認識が発生し、今回の誤認識の要因と前回の誤認識の要因が同じ場合、２番目の要因をユーザに提示する。 Voice recognition devices are used in various fields such as destination search for car navigation. In a speech recognition apparatus, generally, a pattern of each word recorded in a speech recognition dictionary and a speech (word) pattern uttered by a user are matched, and a word having a high degree of coincidence is recognized as a word uttered by the user. . Such voice recognition may cause misrecognition, but if the guidance to the user is uniform (for example, “Please speak again”) for each misrecognition, the same exchange with the user is repeated. Become. Therefore, if the cause of misrecognition is not specified, misrecognition is repeated forever. In view of this, the apparatus described in Patent Document 1 identifies the factors of misrecognition (speech power, speech speed, acoustic characteristics, ambient noise) and presents the factors to the user. Furthermore, when a factor is presented and misrecognition occurs, and the cause of the current misrecognition is the same as the factor of the previous misrecognition, the second factor is presented to the user.

特開２００４−３２５６３５号公報JP 2004-325635 A 特開平１０−１３３８４９号公報JP-A-10-133849 特開２００７−２６４１２６号公報JP 2007-264126 A

誤認識の要因としては、上記に示したもの以外にも、音声認識辞書に収録されている単語以外の単語をユーザが使用している場合がある。この場合、上記の装置のようにユーザに対して要因を提示し、ユーザがその要因に注意しながら同じ単語を使用して再度発声しても、装置ではその発声した音声（単語）を正しく認識できず、誤認識を繰り返す。 As a cause of misrecognition, there are cases where the user uses a word other than the words recorded in the speech recognition dictionary in addition to the above. In this case, a factor is presented to the user as in the above device, and even if the user utters again using the same word while paying attention to the factor, the device correctly recognizes the voice (word) uttered. It is not possible to repeat the recognition error.

そこで、本発明は、誤認識を繰り返すことを抑制する音声認識装置を提供することを課題とする。 Then, this invention makes it a subject to provide the speech recognition apparatus which suppresses repeating misrecognition.

本発明に係る音声認識装置は、音声認識辞書に収録されている単語に基づいてユーザが発声した音声を認識する音声認識装置であって、誤認識が発生した場合、ユーザに対して誤認識の要因に注意して再発声を促した後に同じ誤認識が再度発生すると、ユーザが音声認識辞書に収録されていない単語を発声したと判断し、ユーザに対して言い換えを促すことを特徴とする。上記の再発声を促す際の注意する誤認識の要因は、声の大きさ、話すタイミング及び話す速さのうちの少なくとも１つである。 A speech recognition device according to the present invention is a speech recognition device that recognizes speech uttered by a user based on words recorded in a speech recognition dictionary, and when erroneous recognition occurs , When the same misrecognition occurs again after prompting the recurrence voice while paying attention to the factors, it is determined that the user has uttered a word not recorded in the speech recognition dictionary, and the user is prompted to paraphrase. The cause of the misrecognition to be noted when prompting the recurrent voice is at least one of loudness, speaking timing, and speaking speed.

この音声認識装置では、音声認識辞書に単語が収録されており、この収録されている単語のデータに基づいてユーザが発した音声（単語）を認識する。誤認識が発生した場合、音声認識装置では、ユーザが音声認識辞書に収録されていない単語を発声したと判断する。ユーザが音声認識辞書に収録されていない単語を発声している限り、音声認識装置ではその単語を正しく認識することはできないので、ユーザに異なる単語を使用して発声してもらう必要がある。そこで、音声認識装置では、音声認識辞書に収録されている単語をユーザに使用させるために、ユーザに対して異なる単語への言い換えを促す。これによって言い換えられた単語が音声認識辞書に収録されていれば、音声認識装置ではその単語を正しく認識することができる。このように、音声認識装置では、誤認識が発生した場合にはユーザに言い換えを促すことにより、誤認識を繰り返すことを抑制することができる。その結果、ユーザの音声認識装置への信頼性を向上させることができる。 In this speech recognition apparatus, words are recorded in the speech recognition dictionary, and speech (words) uttered by the user is recognized based on the recorded word data. When erroneous recognition occurs, the voice recognition device determines that the user has uttered a word that is not recorded in the voice recognition dictionary. As long as the user utters a word that is not recorded in the speech recognition dictionary, the speech recognition device cannot recognize the word correctly, so the user needs to utter using a different word. Therefore, in the speech recognition apparatus, in order to make the user use the words recorded in the speech recognition dictionary, the user is encouraged to paraphrase the words into different words. If the reworded word is recorded in the speech recognition dictionary, the speech recognition apparatus can correctly recognize the word. As described above, in the speech recognition apparatus, when erroneous recognition occurs, it is possible to suppress repeated erroneous recognition by prompting the user to paraphrase. As a result, the reliability of the user's voice recognition apparatus can be improved.

本発明の上記音声認識装置では、誤認識が発生した場合、ユーザに対して第２候補の単語を提示する構成としてもよい。 The voice recognition device of the present invention may be configured to present the second candidate word to the user when erroneous recognition occurs.

この音声認識装置では、音声認識辞書に収録される各単語とユーザが発声した音声（単語）との一致度を順次求め、最初に、ユーザに対して第１候補の単語（一致度の最も高い単語）を提示する。この第１候補の単語で誤認識が発生した場合、音声認識装置では、ユーザに対して、第２候補の単語（一致度が次に高い単語）を提示する。これによって第２候補の単語がユーザが発声した単語であれば、音声認識装置ではその単語を正しく認識できたことになる。このように、音声認識装置では、誤認識が発生した場合には第２候補の単語もユーザに提示することにより、誤認識を繰り返すことを更に抑制することができる。 In this speech recognition apparatus, the degree of coincidence between each word recorded in the speech recognition dictionary and the speech (word) uttered by the user is sequentially obtained. First, the first candidate word (the highest degree of coincidence) is given to the user. Word). When a misrecognition occurs in the first candidate word, the speech recognition apparatus presents the second candidate word (word with the next highest matching score) to the user. As a result, if the second candidate word is a word uttered by the user, the speech recognition apparatus can recognize the word correctly. As described above, in the speech recognition apparatus, when erroneous recognition occurs, the second candidate word is also presented to the user, whereby repeated erroneous recognition can be further suppressed.

本発明は、誤認識が発生した場合にはユーザに言い換えを促すことにより、誤認識を繰り返すことを抑制することができる。 The present invention can suppress repeated misrecognition by prompting the user to paraphrase when misrecognition occurs.

本実施の形態に係る音声認識装置の構成図である。It is a block diagram of the speech recognition apparatus which concerns on this Embodiment. 本実施の形態に係る音声認識装置とユーザとのやりとりの一例を示すフローチャートである。It is a flowchart which shows an example of interaction between the speech recognition apparatus which concerns on this Embodiment, and a user. 本実施の形態に係る音声認識装置とユーザとのやりとりの他の例を示すフローチャートである。It is a flowchart which shows the other example of interaction between the speech recognition apparatus which concerns on this Embodiment, and a user.

以下、図面を参照して、本発明に係る音声認識装置の実施の形態を説明する。 Hereinafter, embodiments of a speech recognition apparatus according to the present invention will be described with reference to the drawings.

本実施の形態では、本発明に係る音声認識装置を、車両に搭載される音声認識装置に適用する。本実施の形態に係る音声認識装置は、車両に搭載されるナビゲーション装置、エアコン装置、オーディオ装置などにおける各種設定操作に利用され、各種設定操作中にユーザ（運転者など）が発声した音声（単語）を認識し、正しく認識できた音声（単語）を各装置に出力する。 In this embodiment, the speech recognition apparatus according to the present invention is applied to a speech recognition apparatus mounted on a vehicle. The voice recognition device according to the present embodiment is used for various setting operations in a navigation device, an air conditioner device, an audio device and the like mounted on a vehicle, and a voice (word) uttered by a user (driver or the like) during the various setting operations. ) And outputs the voice (word) that has been correctly recognized to each device.

図１を参照して、本実施の形態に係る音声認識装置１について説明する。図１は、本実施の形態に係る音声認識装置の構成図である。 A speech recognition apparatus 1 according to the present embodiment will be described with reference to FIG. FIG. 1 is a configuration diagram of a speech recognition apparatus according to the present embodiment.

音声認識装置１は、音声認識辞書３１に収録されている単語のデータに基づいてユーザが発声した音声（単語）を認識し、その認識結果がユーザとの対話から誤認識か否かを判断する。特に、音声認識装置１では、誤認識と判断した場合、誤認識を繰り返すことを防止するために、ユーザに対するガイダンスを順次変える。そのために、音声認識装置１は、マイクロフォン１０、スピーカ２０、ＥＣＵ[Electronic Control Unit]３０を備えている。 The speech recognition apparatus 1 recognizes speech (words) uttered by the user based on word data recorded in the speech recognition dictionary 31, and determines whether or not the recognition result is a misrecognition from a dialogue with the user. . In particular, in the voice recognition device 1, when it is determined that the recognition is erroneous, the guidance for the user is sequentially changed in order to prevent the erroneous recognition from being repeated. For this purpose, the speech recognition apparatus 1 includes a microphone 10, a speaker 20, and an ECU [Electronic Control Unit] 30.

マイクロフォン１０は、車室内（特に、前席周辺）に取り付けられ、空気の振動からなる音（特に、ユーザが発声した音声）を集音する。マイクロフォン１０では、音が入力されるとその音を電気信号に変換し、その電気信号を入力音声信号としてＥＣＵ３０に送信する。 The microphone 10 is attached to a vehicle interior (particularly, in the vicinity of the front seat) and collects sound (particularly, voice uttered by the user) including air vibrations. When a sound is input, the microphone 10 converts the sound into an electric signal, and transmits the electric signal to the ECU 30 as an input sound signal.

スピーカ２０は、他のシステムと共用される車載スピーカである。スピーカ２０では、ＥＣＵ３０からガイダンス信号を受信すると、そのガイダンス信号に応じて音声を出力する。 The speaker 20 is a vehicle-mounted speaker shared with other systems. When the speaker 20 receives the guidance signal from the ECU 30, the speaker 20 outputs a sound according to the guidance signal.

ＥＣＵ３０は、ＣＰＵ[CentralProcessing Unit]、ＲＯＭ[Read Only Memory]、ＲＡＭ[Random Access Memory]などからなる電子制御ユニットであり、音声認識装置１を統括制御する。ＥＣＵ３０では、マイクロフォン１０から入力音声信号を受信する。車両の各装置での設定操作中に入力音声信号を受信すると、ＥＣＵ３０では、音声認識辞書３１を参照して、音声認識エンジン３２でユーザが発声した音声（単語）を認識する。ユーザとの対話中、ＥＣＵ３０では、音声認識エンジン３２での認識結果に基づいて、対話エンジン３３でガイダンスの内容を設定し、ガイダンス信号をスピーカ２０に送信する。特に、誤認識と判断した場合、ＥＣＵ３０では、対話エンジン３３でユーザに対する再音声入力時の注意点を考慮したガイダンスの内容を設定する。また、正しい認識と判断した場合、ＥＣＵ３０では、正しい認識の単語の情報からなる認識情報信号を設定操作中の装置に送信する。 The ECU 30 is an electronic control unit including a CPU [Central Processing Unit], a ROM [Read Only Memory], a RAM [Random Access Memory], and the like, and comprehensively controls the speech recognition apparatus 1. The ECU 30 receives an input audio signal from the microphone 10. When the input voice signal is received during the setting operation in each device of the vehicle, the ECU 30 refers to the voice recognition dictionary 31 and recognizes the voice (word) uttered by the user by the voice recognition engine 32. During the dialogue with the user, the ECU 30 sets the content of the guidance in the dialogue engine 33 based on the recognition result in the voice recognition engine 32 and transmits a guidance signal to the speaker 20. In particular, when it is determined that the recognition is erroneous, the ECU 30 sets the content of the guidance in consideration of points to be noted at the time of re-speech input to the user by the dialog engine 33. In addition, when it is determined that the recognition is correct, the ECU 30 transmits a recognition information signal including information of the word of the correct recognition to the apparatus that is performing the setting operation.

音声認識辞書３１は、ＥＣＵ３０の記憶装置の所定の領域に設けられる。音声認識辞書３１には、車両の各装置における各種設定で使用される可能性がある多数の単語を収録しており、各単語についてのパターンデータ（例えば、周波数特性のパターン）を収録している。 The voice recognition dictionary 31 is provided in a predetermined area of the storage device of the ECU 30. The voice recognition dictionary 31 records a large number of words that may be used in various settings in each device of the vehicle, and records pattern data (for example, frequency characteristic patterns) for each word. .

音声認識エンジン３２では、車両の各装置での各種設定操作中に入力音声信号を受信する毎に、マイクロフォン１０に入力された音声（電気信号）に対して周波数解析を行い、その入力音声を周波数特性のパターンに変換する。そして、音声認識エンジン３２では、音声認識辞書３１に収録されている単語毎に、その入力音声の周波数特性のパターンと収録単語のパターンとのマッチングを行い、一致度を算出する。さらに、音声認識エンジン３２では、各単語の一致度が閾値以上か否かを判定し、一致度が閾値以上の単語がない場合には認識エラーと判断し、一致度が閾値以上の単語がある場合には一致度の高い順に認識した単語の候補とする。 Each time the voice recognition engine 32 receives an input voice signal during various setting operations in each device of the vehicle, the voice recognition engine 32 performs frequency analysis on the voice (electrical signal) input to the microphone 10 and uses the input voice as a frequency. Convert to a characteristic pattern. Then, the speech recognition engine 32 performs matching between the frequency characteristic pattern of the input speech and the recorded word pattern for each word recorded in the speech recognition dictionary 31, and calculates the degree of coincidence. Further, the speech recognition engine 32 determines whether or not the degree of matching of each word is greater than or equal to a threshold value. If there is no word having a degree of matching greater than or equal to the threshold, it is determined as a recognition error. In this case, word candidates recognized in descending order of coincidence are used.

対話エンジン３３では、音声認識エンジン３２で認識エラーと判断した場合、ユーザに再音声入力を促すためのガイダンスの内容（例えば、「もう一度、はっきりとお話ください。」）を設定し、そのガイダンスを音声出力ためのガイダンス信号をスピーカ２０に送信する。 In the dialog engine 33, when the speech recognition engine 32 determines that the recognition error has occurred, the content of the guidance for prompting the user to input the voice again (for example, “Please speak clearly again”) is set, and the guidance is voiced. A guidance signal for output is transmitted to the speaker 20.

対話エンジン３３では、一致度が閾値以上の単語がある場合、まず、１番目の候補の単語での認識結果を提示するためのガイダンスの内容を設定し、そのガイダンスを音声出力ためのガイダンス信号をスピーカ２０に送信する。１番目の候補の単語での認識結果を提示した後に、音声認識エンジン３２でユーザから応答として肯定的な単語（例えば、「はい」）を認識した場合、対話エンジン３３では、１番目の候補の単語での認識が正しい認識と判断する。この際、ＥＣＵ３０では、設定操作中の装置に、１番目の候補の単語の情報を認識情報信号として送信する。 In the dialogue engine 33, when there is a word whose matching degree is equal to or greater than a threshold, first, the content of the guidance for presenting the recognition result of the first candidate word is set, and a guidance signal for outputting the guidance as a voice is set. It transmits to the speaker 20. When the speech recognition engine 32 recognizes a positive word (for example, “Yes”) as a response from the user after presenting the recognition result of the first candidate word, the dialog engine 33 causes the first candidate It is determined that the word recognition is correct. At this time, the ECU 30 transmits information of the first candidate word as a recognition information signal to the device during the setting operation.

候補の単語での認識結果を提示した後に、音声認識エンジン３２でユーザから応答として否定的な単語（例えば、「いいえ」）を認識した場合、対話エンジン３３では、その認識結果が誤認識と判断する。この際、誤認識の要因は不明であるので、誤認識の要因と考えられるものから順にユーザに提示する。誤認識の要因としては、声が大きい、声が小さい、話すタイミングが早い、話す速さが速い、音声認識辞書３１に収録されていない単語の入力、ユーザ固有の特性（そもそも認識し難い、語尾が小さくなるなど）、マイクロフォン１０が使用不可状態（タイムアウトなど）などが考えられる。そこで、対話エンジン３３では、誤認識の各要因に注意して再音声入力させるためのガイダンスの内容を順に設定し、そのガイダンスを音声出力ためのガイダンス信号をスピーカ２０に送信する。例えば、「もう一度、大きい声でお話ください。」、「もう一度、ゆっくりとお話ください。」、「言い方（あるいは、単語）を変えてお話ください。」の順に提示する。この際、前回のガイダンス内容を記憶しておき、再度、同じ内容のガイダンスを提示しないようにする。なお、誤認識の要因を考慮したガイダンスの提示順序は、予め決められていてもよいし、あるいは、ユーザやそのときの状況などによって決めてもよい。 When the speech recognition engine 32 recognizes a negative word (for example, “No”) as a response from the user after presenting the recognition result with the candidate word, the dialog engine 33 determines that the recognition result is misrecognition. To do. At this time, since the cause of misrecognition is unknown, it is presented to the user in order from what is considered to be a cause of misrecognition. The causes of misrecognition are: loud voice, low voice, fast speaking timing, fast speaking speed, input of words not recorded in the speech recognition dictionary 31, user-specific characteristics (which are difficult to recognize in the first place, ending It is conceivable that the microphone 10 is in an unusable state (timeout or the like). Accordingly, the dialog engine 33 sets the contents of the guidance for re-speech input while paying attention to the factors of misrecognition, and transmits a guidance signal for outputting the guidance to the speaker 20. For example, “Please speak again in a loud voice”, “Please speak slowly again”, and “Please speak in different ways (or words)”. At this time, the previous guidance content is stored, and the guidance with the same content is not presented again. It should be noted that the guidance presentation order in consideration of the cause of misrecognition may be determined in advance, or may be determined depending on the user and the situation at that time.

誤認識の各要因に注意して再音声入力させるためのガイダンスを提示した後に、音声認識エンジン３２でユーザからの応答として前回の認識結果と同じ単語を再度認識した場合、対話エンジン３３では、誤認識の要因が声の大きさ、話すタイミングや速さではないと判断する。この際、認識結果として２番目の候補の単語がある場合、対話エンジン３３では、２番目の候補の単語での認識結果を提示するためのガイダンスの内容を設定し、そのガイダンスを音声出力ためのガイダンス信号をスピーカ２０に送信する。２番目の候補の単語での認識結果を提示した後に、音声認識エンジン３２でユーザから応答として肯定的な単語を認識した場合、対話エンジン３３では、２番目の候補の単語での認識が正しい認識と判断する。この際、ＥＣＵ３０では、設定操作中の装置に、２番目の候補の単語の情報を認識情報信号として送信する。なお、２番目の候補の単語での認識結果も否定された場合、３番目以降の候補もあるときには、３番目以降の候補を用いて同様のガイダンスを行う。 If the speech recognition engine 32 recognizes the same word as the previous recognition result again as a response from the user after presenting guidance for re-speech input while paying attention to each cause of misrecognition, the dialog engine 33 Judge that the recognition factor is not loudness, timing or speed of speaking. At this time, if there is a second candidate word as the recognition result, the dialogue engine 33 sets the content of the guidance for presenting the recognition result with the second candidate word, and outputs the guidance as a voice. A guidance signal is transmitted to the speaker 20. When the speech recognition engine 32 recognizes a positive word as a response from the user after presenting the recognition result with the second candidate word, the dialog engine 33 recognizes that the recognition with the second candidate word is correct. Judge. At this time, the ECU 30 transmits information of the second candidate word as a recognition information signal to the device during the setting operation. If the recognition result of the second candidate word is also denied, the same guidance is performed using the third and subsequent candidates when there are third and subsequent candidates.

一方、誤認識の要因が声の大きさ、話すタイミングや速さではないと判断し、認識結果として候補の単語がなくなった場合、対話エンジン３３では、誤認識の要因として音声認識辞書３１に収録されていない単語をユーザが使用していると判断する。そして、対話エンジン３３では、言い方（あるいは、単語）を変えて再音声入力させるためのガイダンスの内容を設定し、そのガイダンスを音声出力ためのガイダンス信号をスピーカ２０に送信する。 On the other hand, if it is determined that the cause of misrecognition is not the loudness of the voice, the timing or speed of speaking, and the candidate word disappears as a recognition result, the dialogue engine 33 records it in the speech recognition dictionary 31 as the cause of misrecognition. It is determined that the user is using a word that has not been made. Then, the dialog engine 33 sets the content of the guidance for changing the way of speaking (or words) and inputting the voice again, and transmits a guidance signal for outputting the guidance to the speaker 20.

言い方を変えて再音声入力させるためのガイダンスを提示した後に、音声認識エンジン３２でユーザからの応答として前回の認識結果と異なる候補の単語を認識した場合、対話エンジン３３では、その前回と異なる候補の単語での認識結果を提示するためのガイダンスの内容を設定し、そのガイダンスを音声出力ためのガイダンス信号をスピーカ２０に送信する。前回と異なる候補の単語での認識結果を提示した後に、音声認識エンジン３２でユーザから応答として肯定的な単語を認識した場合、対話エンジン３３では、その候補の単語での認識が正しい認識と判断する。この際、ＥＣＵ３０では、設定操作中の装置に、その候補の単語の情報を認識情報信号として送信する。なお、認識結果を提示した後に、音声認識エンジン３２でユーザから応答として否定的な単語を認識した場合、再度、言い方を変えて再音声入力させるためのガイダンスを行う。 When the speech recognition engine 32 recognizes a candidate word different from the previous recognition result as a response from the user after presenting the guidance for re-speech input in a different way, the dialog engine 33 uses the candidate different from the previous one. The content of the guidance for presenting the recognition result in the word is set, and a guidance signal for outputting the guidance as a voice is transmitted to the speaker 20. When the speech recognition engine 32 recognizes a positive word as a response from the user after presenting a recognition result with a candidate word different from the previous one, the dialog engine 33 determines that the recognition with the candidate word is correct recognition. To do. At this time, the ECU 30 transmits information of the candidate word as a recognition information signal to the device during the setting operation. In addition, after presenting the recognition result, when a negative word is recognized as a response from the user by the voice recognition engine 32, guidance for changing the wording and inputting the voice again is performed.

なお、対話エンジン３３で用いる各状況（例えば、認識エラーと判断した場合、各候補の単語での認識結果を提示する場合、誤認識と判断したときに声の大きさ、話すタイミング、話す速さなどを変えることを促す場合、誤認識と判断したときに言い方や単語を変えることを促す場合）に応じたガイダンスの基本文章は、予め用意され、ＥＣＵ３０の記憶装置の所定の領域に格納されている。 It should be noted that each situation used in the dialog engine 33 (for example, when a recognition error is determined, when a recognition result for each candidate word is presented, when loudness is determined as a misrecognition, loudness, speaking timing, speaking speed) The basic text of the guidance corresponding to the case of prompting the user to change the language or the case of prompting the user to change the wording or the word when judging the misrecognition is prepared in advance and stored in a predetermined area of the storage device of the ECU 30 Yes.

図１を参照して、音声認識によって設定操作中のユーザとやりとりを行う音声認識装置１における動作について説明する。ここでは、ユーザがナビゲーション装置における目的地設定（特に、目的地検索）を行っている場合を例に挙げて説明する。２つのケースについて説明し、１つ目のケースが認識結果として１番目の候補しかない場合（誤認識の要因として「カフェ」という単語が音声認識辞書３１に収録されていない場合）であり、２つ目のケースが認識結果として２番目の候補がある場合（誤認識の要因として「カフェ」という単語を２番目の候補として認識した場合）である。１つ目のケースについては図２のフローチャートに沿って説明し、２つの目のケースについては図３のフローチャートに沿って説明する。 With reference to FIG. 1, the operation in the speech recognition apparatus 1 that communicates with a user who is performing a setting operation by speech recognition will be described. Here, a case where the user is performing destination setting (particularly, destination search) in the navigation device will be described as an example. Two cases will be described. The first case is the case where there is only the first candidate as a recognition result (when the word “cafe” is not recorded in the speech recognition dictionary 31 as a cause of misrecognition). The first case is when there is a second candidate as a recognition result (when the word “cafe” is recognized as the second candidate as a cause of misrecognition). The first case will be described with reference to the flowchart of FIG. 2, and the second case will be described with reference to the flowchart of FIG.

１つ目のケースについて説明する。ユーザが、目的地を検索するために、「六本木のカフェ」と発声する（Ｓ１０）。この音声を集音すると、マイクロフォン１０では、電気信号に変換してＥＣＵ３０に入力音声信号として送信する。この入力音声信号を受信すると、ＥＣＵ３０の音声認識エンジン３２では、音声認識辞書３１に収録されている各単語のパターンデータと入力音声のパターンとのマッチングをそれぞれ行い、１番目の候補の単語として「六本木」と「パフェ」を認識する（Ｒ１０）。そして、対話エンジン３３では、この１番目の候補の単語の「六本木」と「パフェ」を用いて「六本木パフェで探しますか？」というガイダンスの内容を設定し、そのガイダンスを音声出力ためのガイダンス信号をスピーカ２０に送信する（Ｇ１０）。このガイダンス信号を受信すると、スピーカ２０では、このガイダンス信号に応じて「六本木パフェで探しますか？」という音声を出力する（Ｇ１０）。 The first case will be described. The user speaks “Roppongi Cafe” in order to search for a destination (S10). When this sound is collected, the microphone 10 converts the sound into an electric signal and transmits it to the ECU 30 as an input sound signal. When the input speech signal is received, the speech recognition engine 32 of the ECU 30 performs matching between the pattern data of each word recorded in the speech recognition dictionary 31 and the pattern of the input speech, respectively, and sets “1” as the first candidate word. "Roppongi" and "parfait" are recognized (R10). Then, the dialogue engine 33 uses the first candidate words “Roppongi” and “Parfait” to set the guidance content “Do you want to search for Roppongi Parfait?” A signal is transmitted to the speaker 20 (G10). When this guidance signal is received, the speaker 20 outputs a sound “Would you like to search for a Roppongi parfait?” According to the guidance signal (G10).

この「六本木パフェで探しますか？」というガイダンスを聞いて、ユーザは、それを否定するために、「いいえ」と発声する（Ｓ１１）。この音声を集音すると、マイクロフォン１０では、上記と同様にＥＣＵ３０に入力音声信号を送信する。この入力音声信号を受信すると、音声認識エンジン３２では、上記同様にパターンマッチングを行い、「いいえ」を認識する（Ｒ１１）。そして、対話エンジン３３では、この「いいえ」という否定的な単語に基づいて、Ｒ１０で認識された「六本木」と「パフェ」が誤認識と判断する（Ｊ１１）。この際、誤認識の要因は不明であるので、対話エンジン３３では、ユーザに誤認識の要因に注意して再音声入力させるために、決められた順に従って、「声の大きさを変えてお話ください」というガイダンスの内容を設定し、そのガイダンスを音声出力ためのガイダンス信号をスピーカ２０に送信する（Ｇ１１）。このガイダンス信号を受信すると、スピーカ２０では、このガイダンス信号に応じて「声の大きさを変えてお話ください」という音声を出力する（Ｇ１１）。 Upon hearing this guidance “Would you like to search at Roppongi Parfait?”, The user speaks “No” to deny it (S11). When this sound is collected, the microphone 10 transmits an input sound signal to the ECU 30 as described above. When this input voice signal is received, the voice recognition engine 32 performs pattern matching in the same manner as described above to recognize “No” (R11). Then, the dialog engine 33 determines that “Roppongi” and “parfait” recognized in R10 are misrecognized based on the negative word “No” (J11). At this time, since the cause of the misrecognition is unknown, the dialog engine 33 causes the user to input the voice again while paying attention to the cause of the misrecognition. The guidance content “Please” is set, and a guidance signal for outputting the guidance is transmitted to the speaker 20 (G11). When this guidance signal is received, the speaker 20 outputs a voice “Please change the volume of the voice” in accordance with the guidance signal (G11).

この「声の大きさを変えてお話ください」というガイダンスを聞いて、ユーザは、再度、大きな声で、「六本木のカフェ」と発声する（Ｓ１２）。この音声を集音すると、マイクロフォン１０では、上記と同様にＥＣＵ３０に入力音声信号を送信する。この入力音声信号を受信すると、音声認識エンジン３２では、上記と同様にパターンマッチングを行い、再度、１番目の候補の単語として「六本木」と「パフェ」を認識する（Ｒ１２）。そして、対話エンジン３３では、今回の認識結果（「六本木」と「パフェ」）がＲ１０での前回の認識結果（「六本木」と「パフェ」）と同じであるため、誤認識と判断するとともに誤認識の要因が音声認識辞書３１に収録されていない単語の使用（声の大きさ、話すタイミングや速さではない）と推測する（Ｊ１２）。そこで、対話エンジン３３では、ユーザに言い方を変えて再音声入力させるために、「言い方を変えてお話ください。渋谷でラーメンが食べたいのように目的から探すことが出来ます」というガイダンスの内容を設定し、そのガイダンスを音声出力ためのガイダンス信号をスピーカ２０に送信する（Ｇ１２）。このガイダンス信号を受信すると、スピーカ２０では、このガイダンス信号に応じて「言い方を変えてお話ください。渋谷でラーメンが食べたいのように目的から探すことが出来ます」という音声を出力する（Ｇ１２）。 Upon listening to the guidance “Please change the loudness of the voice”, the user again speaks loudly as “Roppongi Cafe” (S12). When this sound is collected, the microphone 10 transmits an input sound signal to the ECU 30 as described above. When this input speech signal is received, the speech recognition engine 32 performs pattern matching in the same manner as described above, and again recognizes “Roppongi” and “parfait” as the first candidate words (R12). In the dialogue engine 33, since the current recognition result (“Roppongi” and “parfait”) is the same as the previous recognition result (“Roppongi” and “parfait”) in R10, it is determined that the recognition error is erroneous. It is presumed that the recognition factor is the use of a word that is not recorded in the speech recognition dictionary 31 (not the loudness of the voice, the timing of speaking, or the speed) (J12). Therefore, in the dialog engine 33, in order to make the user change the way of speaking and input the voice again, the content of the guidance “Please change the way of speaking. You can search for purposes like ramen in Shibuya.” A guidance signal for outputting the guidance as a voice is transmitted to the speaker 20 (G12). When this guidance signal is received, the speaker 20 outputs a voice saying “Please change the way you speak. You can search for the purpose of ramen like in Shibuya” (G12). .

この「言い方を変えてお話ください。渋谷でラーメンが食べたいのように目的から探すことが出来ます」というガイダンスを聞いて、ユーザは、コーヒーを飲みたかったので、「六本木でコーヒーが飲みたい」と発声する（Ｓ１３）。この音声を集音すると、マイクロフォン１０では、上記と同様にＥＣＵ３０に入力音声信号を送信する。この入力音声信号を受信すると、音声認識エンジン３２では、上記と同様にパターンマッチングを行い、１番目の候補の単語として「六本木」と「コーヒー」を認識する（Ｒ１３）。そして、対話エンジン３３では、この１番目の候補の単語の「六本木」と「コーヒー」を用いて「六本木コーヒーで探しますか？」というガイダンスの内容を設定し、そのガイダンスを音声出力ためのガイダンス信号をスピーカ２０に送信する（Ｇ１３）。このガイダンス信号を受信すると、スピーカ２０では、このガイダンス信号に応じて「六本木コーヒーで探しますか？」という音声を出力する（Ｇ１３）。 The user wanted to drink coffee after listening to this guidance, “Please speak in different ways. You can search for purposes like ramen in Shibuya.” “I want to drink coffee in Roppongi.” (S13). When this sound is collected, the microphone 10 transmits an input sound signal to the ECU 30 as described above. When this input speech signal is received, the speech recognition engine 32 performs pattern matching in the same manner as described above, and recognizes “Roppongi” and “coffee” as the first candidate words (R13). Then, the dialogue engine 33 uses the first candidate words “Roppongi” and “Coffee” to set the guidance content “Do you want to search for Roppongi Coffee?”, And the guidance is for voice output. A signal is transmitted to the speaker 20 (G13). When this guidance signal is received, the speaker 20 outputs a sound “Would you like to search for Roppongi coffee?” In response to this guidance signal (G13).

この「六本木コーヒーで探しますか？」というガイダンスを聞いて、ユーザは、それを肯定するために、「はい」と発声する（Ｓ１４）。この音声を集音すると、マイクロフォン１０では、上記と同様にＥＣＵ３０に入力音声信号を送信する。この入力音声信号を受信すると、音声認識エンジン３２では、上記同様にパターンマッチングを行い、「はい」を認識する（Ｒ１４）。そして、対話エンジン３３では、この「はい」という肯定的な単語に基づいて、Ｒ１３で認識された「六本木」と「コーヒー」が正しい認識と判断する（Ｊ１４）。そして、ＥＣＵ３０では、音声認識結果として「六本木」と「コーヒー」を含む認識情報信号をナビゲーション装置に送信する。 Upon listening to the guidance “Do you want to search for coffee in Roppongi?”, The user speaks “Yes” to affirm it (S14). When this sound is collected, the microphone 10 transmits an input sound signal to the ECU 30 as described above. When this input voice signal is received, the voice recognition engine 32 performs pattern matching in the same manner as described above to recognize “Yes” (R14). Then, the dialog engine 33 determines that “Roppongi” and “coffee” recognized in R13 are correct recognition based on the positive word “Yes” (J14). Then, the ECU 30 transmits a recognition information signal including “Roppongi” and “coffee” as a voice recognition result to the navigation device.

２つ目のケースについて説明する。この２つ目のケースでは、音声認識エンジン３２での２回目の「六本木」と「パフェ」と認識するＲ２２までの動作は、１つ目のケースと同様の動作である。但し、音声認識エンジン３２では、認識結果として、１番目の候補の「パフェ」の次に２番目の候補として「カフェ」を認識している。 The second case will be described. In this second case, the operation up to R22 for recognizing “Roppongi” and “parfait” for the second time by the speech recognition engine 32 is the same as that in the first case. However, the speech recognition engine 32 recognizes “cafe” as the second candidate after the first candidate “parfait” as the recognition result.

対話エンジン３３では、今回の認識結果（「六本木」と「パフェ」）がＲ２０での前回の認識結果（「六本木」と「パフェ」）と同じであるため、誤認識と判断する（Ｊ２２）。ここで、２番目の候補の単語として「カフェ」があるので、対話エンジン３３では、１番目の候補の「パフェ」を候補外とし、「六本木」と２番目の候補の単語の「カフェ」を用いて「六本木カフェで探しますか？」というガイダンスの内容を設定し、そのガイダンスを音声出力ためのガイダンス信号をスピーカ２０に送信する（Ｇ２２）。このガイダンス信号を受信すると、スピーカ２０では、このガイダンス信号に応じて「六本木カフェで探しますか？」という音声を出力する（Ｇ２２）。 In the dialog engine 33, since the current recognition result (“Roppongi” and “parfait”) is the same as the previous recognition result (“Roppongi” and “parfait”) in R20, it is determined as misrecognition (J22). Here, since there is “cafe” as the second candidate word, the dialog engine 33 excludes the first candidate “parfait” from the candidates, and sets “Roppongi” and the second candidate word “cafe”. The guidance content “Do you want to search in Roppongi Cafe?” Is set and a guidance signal for outputting the guidance as a voice is transmitted to the speaker 20 (G22). When this guidance signal is received, the speaker 20 outputs a sound “Would you like to search at Roppongi Cafe?” According to this guidance signal (G22).

この「六本木カフェで探しますか？」というガイダンスを聞いて、ユーザは、それを肯定するために、「はい」と発声する（Ｓ２３）。この音声を集音すると、マイクロフォン１０では、上記と同様にＥＣＵ３０に入力音声信号を送信する。この入力音声信号を受信すると、音声認識エンジン３２では、上記同様にパターンマッチングを行い、「はい」を認識する（Ｒ２３）。そして、対話エンジン３３では、この「はい」という肯定的な単語に基づいて、Ｒ２３で認識された「六本木」と２番目の候補の「カフェ」が正しい認識と判断する（Ｊ２３）。そして、ＥＣＵ３０では、音声認識結果として「六本木」と「カフェ」を含む認識情報信号をナビゲーション装置に送信する。 Upon hearing the guidance “Do you want to look for at Roppongi Cafe?”, The user utters “Yes” to affirm it (S23). When this sound is collected, the microphone 10 transmits an input sound signal to the ECU 30 as described above. When this input voice signal is received, the voice recognition engine 32 performs pattern matching in the same manner as described above to recognize “Yes” (R23). Based on the positive word “yes”, the dialog engine 33 determines that “Roppongi” recognized in R23 and the second candidate “cafe” are correct recognition (J23). Then, the ECU 30 transmits a recognition information signal including “Roppongi” and “cafe” as a voice recognition result to the navigation device.

この音声認識装置１によれば、誤認識と判断し、誤認識の要因が特定できていない場合にはガイダンスの内容を順次変えることにより、誤認識の要因を考慮してユーザに再音声入力させることができ、誤認識を繰り返すことを抑制することができる。その結果、ユーザの音声認識装置への信頼性を向上させることができる。特に、音声認識装置１によれば、ユーザに言い換えを促すガイダンスを行うことにより、音声認識辞書３１に収録されていない単語を使用してユーザが音声入力したときでも、ユーザに前回とは異なる単語を使用して再音声入力させることができる。また、音声認識装置１によれば、２番目以降の候補の認識結果もある場合には２番目以降の候補の単語もユーザに順次提示することにより、誤認識を繰り返すことを更に抑制することができる。 According to this voice recognition device 1, it is determined that the recognition is erroneous, and if the cause of the erroneous recognition cannot be specified, the content of the guidance is sequentially changed to allow the user to input the voice again in consideration of the cause of the erroneous recognition. It is possible to suppress erroneous recognition. As a result, the reliability of the user's voice recognition apparatus can be improved. In particular, according to the voice recognition device 1, even when a user inputs a voice using a word that is not recorded in the voice recognition dictionary 31 by performing guidance that prompts the user to paraphrase, a word different from the previous one is given to the user. Can be used for voice input again. Moreover, according to the speech recognition apparatus 1, when there are also recognition results of the second and subsequent candidates, the second and subsequent candidate words are also presented to the user in order to further suppress repeated erroneous recognition. it can.

以上、本発明に係る実施の形態について説明したが、本発明は上記実施の形態に限定されることなく様々な形態で実施される。 As mentioned above, although embodiment which concerns on this invention was described, this invention is implemented in various forms, without being limited to the said embodiment.

例えば、本実施の形態では車両に搭載される音声認識装置に適用したが、他の様々な分野に適用可能である。 For example, in the present embodiment, the present invention is applied to a voice recognition device mounted on a vehicle, but can be applied to various other fields.

また、本実施の形態では誤認識と判断した場合のガイダンスの内容や順序の一例を示したが、ガイダンスの内容や順序については特に限定するものではなく、誤認識の繰り返しを防止するための内容や順序であればよい。例えば、誤認識と判断した場合、声の大きさ、話すタイミングや速さなどを変えることを促すのではなく、最初から、言い方や単語を変えて話すことを促すようにしてもよい。 In addition, in the present embodiment, an example of the content and order of guidance when it is determined to be misrecognized is shown, but the content and order of guidance are not particularly limited, and content for preventing repeated misrecognition. Or any order. For example, when it is determined that the recognition is wrong, it may be urged not to change the volume of the voice, the timing or speed of speaking, but to change the way of speaking or the word from the beginning.

また、本実施の形態では認識結果として単語の候補が複数ある場合には１番目の候補の単語を誤認識と判断したときには２番目以降の候補の単語をユーザに提示する構成としたが、単語の候補が複数ある場合でも２番目以降の候補の単語をユーザに提示しない構成としてもよい。 Also, in this embodiment, when there are a plurality of word candidates as recognition results, the first candidate word is determined to be misrecognized, and the second and subsequent candidate words are presented to the user. Even when there are a plurality of candidates, the second and subsequent candidate words may not be presented to the user.

１…音声認識装置、１０…マイクロフォン、２０…スピーカ、３０…ＥＣＵ、３１…音声認識辞書、３２…音声認識エンジン、３３…対話エンジン DESCRIPTION OF SYMBOLS 1 ... Voice recognition apparatus, 10 ... Microphone, 20 ... Speaker, 30 ... ECU, 31 ... Voice recognition dictionary, 32 ... Voice recognition engine, 33 ... Dialog engine

Claims

A speech recognition device for recognizing speech uttered by a user based on words recorded in a speech recognition dictionary,
When misrecognition occurs, if the same misrecognition occurs again after prompting the user to pay attention to the cause of misrecognition and then recite, it is determined that the user has uttered a word not recorded in the speech recognition dictionary A speech recognition apparatus that prompts the user to paraphrase.

The speech recognition apparatus according to claim 1, wherein when a recognition error occurs, the second candidate word is presented to the user.

The voice according to claim 1 or 2, wherein a factor of misrecognition to be noted when prompting the recurrent voice is at least one of loudness, speaking timing, and speaking speed. Recognition device.