JP2008076812A

JP2008076812A - Voice recognition device, voice recognition method and voice recognition program

Info

Publication number: JP2008076812A
Application number: JP2006256908A
Authority: JP
Inventors: Hisayuki Nagashima; 久幸長島
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2006-09-22
Filing date: 2006-09-22
Publication date: 2008-04-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition device, a voice recognition method and a voice recognition program, capable of improving recognition accuracy of utterance by suitably grasping a keyword, even when user's free utterance is recognized. <P>SOLUTION: The voice recognition device 1 comprises: a text conversion means 11 for recognizing input voice and converting it to a text; a first syntax analysis means 31 for outputting a result of analysis processing of the text by using a word string score, together with the score, as a first voice candidate group; a second syntax analysis means 32 for outputting the result of analysis processing of the text by using the word string score, together with the score as a second voice candidate group; and a voice candidate group determination means 33 for determining a final voice candidate group, based on the score from the first voice candidate and the second voice candidate. When it is judged by a comparison judgement means 34 that the first voice candidate with the highest score is not coincident with the second candidate with the highest score, the highest score of the second voice candidate is increased by the prescribed value amount. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、入力された音声を認識する音声認識装置、音声認識方法、及び音声認識プログラムに関する。 The present invention relates to a speech recognition device that recognizes input speech, a speech recognition method, and a speech recognition program.

近年、例えば、使用者が機器の操作等を行うシステムにおいて、使用者により入力される音声を認識して、機器の操作等に必要な情報を取得する音声認識装置が用いられている。このような音声認識装置では、使用者により入力される音声（発話）を認識し、認識した結果に基づいて使用者に応答して使用者の次の発話を促すことで、使用者との対話が行われる。そして、使用者との対話を認識した結果から、機器の操作等を行うために必要な情報が取得される。 In recent years, for example, in a system in which a user operates a device, a voice recognition device that recognizes a voice input by the user and acquires information necessary for the operation of the device has been used. In such a voice recognition device, a voice (utterance) input by the user is recognized, and the user's next utterance is prompted in response to the user based on the recognized result. Is done. Information necessary for operating the device is acquired from the result of recognizing the dialogue with the user.

この音声認識装置において、入力される音声として、予め登録されたコマンドだけでなく、言い回しが限定されない使用者の自由な発話を受け付ける音声認識装置が提案されている。この音声認識装置では、入力される音声を音響的特徴から単語に変換すると共に、単語列の出現確率（Ｎグラム）に基づいて解析して、単語列で表現されるテキストとして認識する手法（ディクテーション）が用いられる。ここで、「テキスト」とは、単語の列で表現された、所定の意味を有する有意構文である。しかし、このような自由な発話を受け付ける音声認識装置では、曖昧な入力音声を認識する必要があるため解析が難しく、誤認識の可能性が高くなる。このため、ディクテーションによる認識結果について、誤認識した可能性のある単語や単語列を、他の手法により再認識することにより認識精度を向上する技術が提案されている（例えば、特許文献１参照）。 In this voice recognition apparatus, a voice recognition apparatus that accepts not only a pre-registered command but also a user's free utterance that is not limited in terms of speech as an input voice has been proposed. In this speech recognition apparatus, an input speech is converted from an acoustic feature into a word, and analyzed based on the appearance probability (N-gram) of a word string, and recognized as text represented by the word string (dictation) ) Is used. Here, “text” is a significant syntax expressed in a string of words and having a predetermined meaning. However, in such a speech recognition apparatus that accepts a free utterance, it is necessary to recognize an ambiguous input speech, so that the analysis is difficult and the possibility of erroneous recognition increases. For this reason, a technique has been proposed for improving recognition accuracy by re-recognizing a word or word string that may have been misrecognized by another method with respect to a recognition result by dictation (see, for example, Patent Document 1). .

この特許文献１の音声認識装置では、ディクテーションの認識結果から、ディクテーションによる認識処理の際の確信度（単語列の出現確率から算出される）が所定の閾値以下の単語や単語列を「再認識対象単語列」とする。そして、この再認識対象単語列の部分の音声についてルールを用いた音声認識を行う。ルールとは、全ての単語列に対して、出現確率ではなく出現するか否かを規定するものであり、ルールを用いた音声認識では、出現すると規定された単語列のみが認識結果となり得る。特許文献１の音声認識装置では、ディクテーションの認識結果の再認識対象単語列以外の部分から、再認識対象単語列について満たすべき所定の条件を決定し、ルールを用いてこの条件を満たすように音声認識を行う。そして、ルールを用いた音声認識の結果でディクテーションの認識結果を置換する。
特開２００５−２０２１６５号公報 In the speech recognition apparatus of Patent Document 1, a word or a word string having a certainty factor (calculated from the occurrence probability of a word string) having a predetermined threshold value or less is re-recognized from a dictation recognition result. "Target word string". Then, speech recognition using rules is performed on the speech of the re-recognition target word string. A rule defines whether or not every word string appears instead of the appearance probability. In speech recognition using a rule, only a word string defined to appear can be a recognition result. In the speech recognition apparatus of Patent Literature 1, a predetermined condition to be satisfied for the re-recognition target word string is determined from a portion other than the re-recognition target word string of the dictation recognition result, and the speech is set so as to satisfy this condition using a rule. Recognize. Then, the dictation recognition result is replaced with the voice recognition result using the rule.
JP-A-2005-202165

上記音声認識装置では、ディクテーションによる音声認識において、確信度の低い単語が誤認識された単語であることを前提としている。しかし、ディクテーションによる音声認識では、ある単語の確信度が低いからといって、その単語自体が誤認識されていることにはならない。すなわち、ディクテーションの認識結果で単語の確信度が低いとは、その単語がディクテーションにより認識されたテキスト（文）の中で文脈から外れていることを意味する。このとき、文脈から外れているにも拘わらず認識結果として出力されたということは、むしろ、その単語は使用者が意識して音響的に明瞭に発音した単語（キーワード）であると考えられる。このため、上記音声認識装置のように、確信度の低い単語の音声を再認識して他の単語に置換する方法では、認識精度を向上することができないという問題がある。 In the speech recognition apparatus, it is assumed that a word with low certainty is a misrecognized word in speech recognition by dictation. However, in speech recognition by dictation, just because a certain word has a low certainty does not mean that the word itself has been misrecognized. That is, a word with low confidence in the dictation recognition result means that the word is out of context in the text (sentence) recognized by the dictation. At this time, the fact that the word is output as a recognition result despite being out of context is considered to be a word (keyword) that is audibly and clearly pronounced by the user. For this reason, there is a problem that the recognition accuracy cannot be improved by a method of re-recognizing the voice of a word with a low certainty level and replacing it with another word as in the voice recognition apparatus.

本発明は、上記事情に鑑み、使用者の自由な発話を認識する場合でも、キーワードを適切に把握して該発話の認識精度を向上することができる音声認識装置、音声認識方法、及び音声認識プログラムを提供することを目的とする。 In view of the above circumstances, the present invention provides a speech recognition apparatus, speech recognition method, and speech recognition that can appropriately recognize a keyword and improve the recognition accuracy of the speech even when a user's free speech is recognized. The purpose is to provide a program.

本発明の音声認識装置は、入力された音声を認識することにより、該音声を単語列で表現されるテキストに変換するテキスト変換手段と、テキスト変換手段により変換されたテキストに含まれる単語及び単語列の特徴に基づいて算出される単語列スコアを用いて該テキストを解析する処理を実行し、該処理の結果を第１の音声候補群として、各音声候補のスコアと共に出力する第１の構文解析手段と、テキスト変換手段により変換されたテキストに含まれる単語の特徴に基づいて算出される単語スコアを用いて該テキストを解析する処理を実行し、該処理の結果を第２の音声候補群として、各音声候補のスコアと共に出力する第２の構文解析手段と、第１の音声候補群と第２の音声候補群から、各音声候補のスコアに基づいて、最終的な音声候補群を決定する音声候補群決定手段と、第１の音声候補群と第２の音声候補群とを比較して、一致する音声候補があるか否かを判断する比較判断手段とを備え、比較判断手段により、第１の音声候補群に含まれる音声候補のうち最も高いスコアをとる第１の音声候補と、第２の音声候補群に含まれる音声候補のうち最も高いスコアをとる第２の音声候補とが、一致しないと判断された場合には、当該第２の音声候補の最も高いスコアを所定値増やすことを特徴とする。 The speech recognition apparatus according to the present invention recognizes input speech to convert the speech into text expressed by a word string, and words and words included in the text converted by the text conversion means A first syntax for executing processing for analyzing the text using a word sequence score calculated based on a feature of the sequence, and outputting a result of the processing as a first speech candidate group together with a score of each speech candidate An analysis unit and a process of analyzing the text using a word score calculated based on a feature of the word included in the text converted by the text conversion unit, and the result of the process is used as a second speech candidate group From the second syntactic analysis means for outputting together with the score of each voice candidate, the first voice candidate group, and the second voice candidate group, the final voice based on the score of each voice candidate A speech candidate group determining means for determining a complement group, and a comparison determination means for comparing the first speech candidate group and the second speech candidate group to determine whether there is a matching speech candidate; The first speech candidate that has the highest score among the speech candidates included in the first speech candidate group and the second score that has the highest score among speech candidates included in the second speech candidate group by the comparison determination unit. When the voice candidate is determined not to match, the highest score of the second voice candidate is increased by a predetermined value.

本発明の音声認識装置によれば、第１の構文解析手段は、入力された音声を認識して得られたテキストに対して、該テキストに含まれる単語及び単語列の特徴に基づいて算出される単語列スコアを用いて、該テキストを解析する処理を実行する。ここで、「スコア」とは、音声候補（入力音声の認識結果の候補とされた単語又は単語列）が音響的観点や言語的観点等のさまざまな観点から入力音声に該当するもっともらしさ（尤度、確信度）を表す指数を意味する。「単語列スコア」は、例えば、単語及び単語列の出現確率等に基づいて算出されるものであり、この単語列スコアを用いることにより、予め登録されたコマンドだけでなく、言い回しが限定されない使用者の自由な発話を認識可能となる。 According to the speech recognition apparatus of the present invention, the first syntax analysis unit is calculated based on the characteristics of the words and word strings included in the text obtained by recognizing the input speech. The process of analyzing the text is executed using the word string score. Here, the “score” is a likelihood that a speech candidate (a word or a word string that is a candidate for the recognition result of the input speech) corresponds to the input speech from various viewpoints such as an acoustic viewpoint and a linguistic viewpoint (likelihood). (Degree of confidence). The “word string score” is calculated based on, for example, the appearance probability of the word and the word string, and the word string score is used to limit not only the pre-registered command but also the wording. The user's free speech can be recognized.

また、第２の構文解析手段は、入力された音声を認識して得られたテキストに対して、該テキストに含まれる単語の特徴に基づいて算出される単語スコアを用いて、該テキストを解析する処理を実行する。「単語スコア」は、例えば、単語毎の出現確率等に基づいて算出される。 The second syntax analysis means analyzes the text using the word score calculated based on the characteristics of the words included in the text obtained by recognizing the input speech. Execute the process. The “word score” is calculated based on, for example, the appearance probability for each word.

このとき、第１の音声候補群に含まれる音声候補のうち最も高いスコアをとる第１の音声候補と、第２の音声候補群に含まれる音声候補のうちの最も高いスコアをとる第２の音声候補とが、一致しないと判断された場合に、本発明では、最終的な音声候補群をキーワードの意味、キーワードが有する機能を反映させた音声候補群にするために、第２の音声候補群のうちの最も高いスコアをとる第２の音声候補のスコアを所定値増やす。 At this time, the first speech candidate that has the highest score among the speech candidates included in the first speech candidate group and the second score that has the highest score among speech candidates included in the second speech candidate group. When it is determined that the speech candidates do not match, the present invention sets the final speech candidate group to the speech candidate group reflecting the meaning of the keyword and the function of the keyword. The score of the second speech candidate having the highest score in the group is increased by a predetermined value.

そして、音声候補群決定手段は、第１の音声候補群と第２の音声候補群から、各音声候補のスコアに基づいて最終的な音声候補群を決定する。このとき、第２の音声候補群のうちの最も高いスコアをとる第２の音声候補のスコアが増やされているので、キーワードの意味を含む音声候補が最終的な音声候補群のうちの上位に入る可能性が高くなる。よって、本発明の音声認識装置によれば、例えば使用者の自然な発話が入力されても、キーワードが適切に把握されるので、該発話の認識精度を向上することができる。 Then, the speech candidate group determining means determines a final speech candidate group from the first speech candidate group and the second speech candidate group based on the score of each speech candidate. At this time, since the score of the second speech candidate having the highest score in the second speech candidate group is increased, the speech candidate including the meaning of the keyword is ranked higher in the final speech candidate group. Increases the chance of entering. Therefore, according to the speech recognition apparatus of the present invention, for example, even if a user's natural utterance is input, the keyword is properly grasped, and thus the recognition accuracy of the utterance can be improved.

また、本発明の音声認識装置において、音声候補群決定手段は、第１及び第２の音声候補群に含まれる音声候補のうちスコアの高さが所定順位内の音声候補群を、最終的な音声候補群として決定することが好ましい。 In the speech recognition apparatus of the present invention, the speech candidate group determination means finally selects a speech candidate group whose score is within a predetermined rank among speech candidates included in the first and second speech candidate groups. It is preferable to determine the speech candidate group.

この場合、キーワードの意味を含むと考えられる第２の音声候補のスコアが増やされているので、当該第２の音声候補が、スコアの高さが所定順位内の音声候補（例えば、最も高いスコアを有する音声候補）となる可能性が高い。よって、例えば使用者の自然な発話が入力されても、キーワードを適切に把握して最終的な音声候補群を決定することができ、該発話の認識精度を向上することができる。 In this case, since the score of the second speech candidate that is considered to include the meaning of the keyword is increased, the second speech candidate is a speech candidate whose score is within a predetermined rank (for example, the highest score). Voice candidate). Therefore, for example, even if a user's natural utterance is input, the final speech candidate group can be determined by appropriately grasping the keyword, and the recognition accuracy of the utterance can be improved.

また、本発明の音声認識装置において、第１及び第２の音声候補群に含まれる複数の音声候補を各音声候補のスコアに基づいて報知する報知手段と、該報知手段により報知された複数の音声候補のうち少なくとも１つを選択する選択手段とを備え、音声候補群決定手段は、該選択手段により選択された音声候補群を、最終的な音声候補群として決定することが好ましい。 Further, in the speech recognition apparatus of the present invention, a notification unit that notifies a plurality of speech candidates included in the first and second speech candidate groups based on the scores of the respective speech candidates, and a plurality of notifications notified by the notification unit Preferably, the voice candidate group determination unit determines the voice candidate group selected by the selection unit as a final voice candidate group.

この場合、キーワードの意味を含むと考えられる第２の音声候補のスコアが増やされているので、報知される複数の音声候補に使用者の発話に該当する音声候補を高い確率で含ませることができ、使用者は意図に沿った音声候補を選択可能となる。よって、例えば使用者の自然な発話が入力されても、キーワードを適切に把握して最終的な音声候補群を決定することができ、該発話の認識精度を向上することができる。 In this case, since the score of the second speech candidate that is considered to include the meaning of the keyword has been increased, the speech candidates corresponding to the user's utterance can be included with a high probability in the plurality of speech candidates to be notified. The user can select a speech candidate according to the intention. Therefore, for example, even if a user's natural utterance is input, the final speech candidate group can be determined by appropriately grasping the keyword, and the recognition accuracy of the utterance can be improved.

また、本発明の音声認識装置において、少なくとも音声候補群決定手段により決定された最終的な音声候補群に基づいて所定の制御処理を実行する制御手段を備えることが好ましい。 The speech recognition apparatus of the present invention preferably includes a control unit that executes a predetermined control process based on at least the final speech candidate group determined by the speech candidate group determination unit.

この場合、制御手段により、最終的な音声候補群に基づいて、例えば予め定められた複数の制御処理（シナリオ）のうちから所定の制御処理が決定されて実行される。所定の制御処理は、例えば、制御対象である機器や機能を、発話から取得された情報に基づいて制御する処理や、使用者への音声や画面表示による応答を制御する処理等である。このとき、本発明によれば、例えば使用者の自然な発話が入力されても、キーワードが適切に把握されて最終的な音声候補群が決定されているので、所定の制御処理を使用者の意図に応じて適切に決定して実行することができる。 In this case, based on the final speech candidate group, for example, a predetermined control process is determined and executed from a plurality of predetermined control processes (scenarios) by the control means. The predetermined control process is, for example, a process for controlling a device or a function to be controlled based on information acquired from an utterance, a process for controlling a response by voice or screen display to a user, and the like. At this time, according to the present invention, for example, even if the user's natural utterance is input, since the keyword is appropriately grasped and the final speech candidate group is determined, the predetermined control processing is performed by the user. It can be determined and executed appropriately according to the intention.

なお、制御手段は、最終的な音声候補群と共に、音声認識装置が搭載されたシステム（例えば車両等）の状態や使用者の状態、或いは制御対象である機器や機能の状態等を考慮して、所定の制御処理を決定して実行することも可能である。また、使用者の対話履歴や、機器の状態変化等を記憶する記憶手段を備え、制御手段は、最終的な音声候補群と共にこの対話履歴や状態変化等を考慮して、所定の制御処理を決定することも可能である。 The control means considers the state of the system (for example, a vehicle) in which the speech recognition device is mounted, the state of the user, the state of the device or function to be controlled, and the like together with the final speech candidate group. It is also possible to determine and execute a predetermined control process. In addition, a storage unit is provided for storing a user's dialog history and device state changes, and the control unit performs predetermined control processing in consideration of the dialog history and state changes together with the final voice candidate group. It is also possible to decide.

次に、本発明の音声認識方法は、入力された音声を認識することにより、該音声を単語列で表現されるテキストに変換するテキスト変換ステップと、テキスト変換ステップで変換されたテキストに含まれる単語及び単語列の特徴に基づいて算出される単語列スコアを用いて該テキストを解析する処理を実行し、該処理の結果を第１の音声候補群として、各音声候補のスコアと共に出力する第１の構文解析ステップと、テキスト変換ステップで変換されたテキストに含まれる単語の特徴に基づいて算出される単語スコアを用いて該テキストを解析する処理を実行し、該処理の結果を第２の音声候補群として、各音声候補のスコアと共に出力する第２の構文解析ステップと、第１の音声候補群に含まれる音声候補のうち最も高いスコアをとる第１の音声候補と、第２の音声候補群に含まれる音声候補のうち最も高いスコアをとる第２の音声候補とを比較して、一致するか否かを判断する比較判断ステップと、比較判断ステップで一致しないと判断された場合に、第２の音声候補の最も高いスコアを所定値増やすスコア決定ステップと、第１の音声候補群と第２の音声候補群から、各音声候補のスコアに基づいて、最終的な音声候補群を決定する音声候補群決定ステップとを備え
たことを特徴とする。 Next, the speech recognition method of the present invention includes a text conversion step of converting the speech into text expressed by a word string by recognizing the input speech, and the text converted in the text conversion step. A process of analyzing the text using a word string score calculated based on the characteristics of the word and the word string is executed, and a result of the process is output as a first voice candidate group together with a score of each voice candidate. A process of analyzing the text using a word score calculated based on the feature of the word included in the text converted in the text analysis step and the text conversion step, and the result of the process is a second result As a speech candidate group, a second syntactic analysis step that is output together with a score of each speech candidate, and a first that takes the highest score among speech candidates included in the first speech candidate group The comparison judgment step for comparing the voice candidate with the second voice candidate having the highest score among the voice candidates included in the second voice candidate group and judging whether or not they match, and the comparison judgment step If it is determined that they do not match, the score determination step for increasing the highest score of the second speech candidate by a predetermined value, and the first speech candidate group and the second speech candidate group, based on the score of each speech candidate And a speech candidate group determination step for determining a final speech candidate group.

本発明の音声認識方法によれば、本発明の音声認識装置に関して説明したように、第２の音声候補群のうちの最も高いスコアとる第２の音声候補のスコアを増やすので、キーワードの意味を含む音声候補が最終的な音声候補群のうちの上位に入る可能性が高くなる。よって、この音声認識方法によれば、使用者の自然な発話が入力されても、キーワードが適切に把握されるので、該発話の認識精度を向上することができる。 According to the speech recognition method of the present invention, as described with respect to the speech recognition device of the present invention, the score of the second speech candidate that takes the highest score in the second speech candidate group is increased. There is a high possibility that the speech candidates to be included are higher in the final speech candidate group. Therefore, according to this speech recognition method, even if the user's natural utterance is input, the keyword is appropriately grasped, so that the recognition accuracy of the utterance can be improved.

次に、本発明の音声認識プログラムは、入力された音声を認識することにより、該音声を単語列で表現されるテキストに変換するテキスト変換処理と、テキスト変換処理により変換されたテキストに含まれる単語及び単語列の特徴に基づいて算出される単語列スコアを用いて該テキストを解析し、該解析の結果を第１の音声候補群として、各音声候補のスコアと共に出力する第１の構文解析処理と、テキスト変換処理により変換されたテキストに含まれる単語の特徴に基づいて算出される単語スコアを用いて該テキストを解析し、該解析の結果を第２の音声候補群として、各音声候補のスコアと共に出力する第２の構文解析処理と、第１の音声候補群と、第２の音声候補群とを比較して、一致する音声候補があるか否かを判断する比較判断処理と、比較判断処理により、第１の音声候補群に含まれる音声候補のうち最も高いスコアをとる第１の音声候補と、第２の音声候補群に含まれる音声候補のうち最も高いスコアをとる第２の音声候補とが、一致しないと判断された場合に、当該第２の音声候補の最も高いスコアを所定値増やすスコア決定処理と、第１の音声候補群と第２の音声候補群から、各音声候補のスコアに基づいて、最終的な音声候補群を決定する音声候補群決定処理とをコンピュータに実行させる機能を有することを特徴とする。 Next, the speech recognition program of the present invention is included in a text conversion process for recognizing input speech to convert the speech into text expressed by a word string, and text converted by the text conversion processing. First text analysis that analyzes the text using a word string score calculated based on the characteristics of the word and the word string, and outputs the result of the analysis as a first voice candidate group together with the score of each voice candidate Each of the speech candidates by analyzing the text using a word score calculated based on the processing and a feature of the word included in the text converted by the text conversion processing, and using the result of the analysis as a second speech candidate group A second syntactic analysis process that is output together with the scores of the first voice candidate group, the first voice candidate group, and the second voice candidate group to determine whether there is a matching voice candidate. Then, by the comparison determination process, the first speech candidate that has the highest score among the speech candidates included in the first speech candidate group and the highest score among the speech candidates that are included in the second speech candidate group. When it is determined that the second speech candidate does not match, the score determination process for increasing the highest score of the second speech candidate by a predetermined value, the first speech candidate group, and the second speech candidate group The computer has a function of causing a computer to execute a speech candidate group determination process for determining a final speech candidate group based on the score of each speech candidate.

この場合、本発明の音声認識装置に関して説明した効果を奏し得る処理をコンピュータに実行させることができる。 In this case, it is possible to cause the computer to execute processing that can achieve the effects described in regard to the speech recognition apparatus of the present invention.

［第１実施形態］
図１に示すように、本発明の第１実施形態の音声認識装置は音声対話ユニット１から成り、車両１０に搭載されている。この音声対話ユニット１には、車両１０の運転者から発話が入力されるマイク２が接続されると共に、車両１０の状態を検出する車両状態検出部３が接続されている。また、音声対話ユニット１には、運転者への応答を出力するスピーカ４と、運転者への表示を行うディスプレイ５とが接続されている。さらに、音声対話ユニット１には、運転者が音声等で操作可能な複数の機器６ａ〜６ｃが接続されている。 [First Embodiment]
As shown in FIG. 1, the speech recognition apparatus according to the first embodiment of the present invention includes a speech dialogue unit 1 and is mounted on a vehicle 10. The voice interaction unit 1 is connected to a microphone 2 to which an utterance is input from a driver of the vehicle 10, and to a vehicle state detection unit 3 that detects the state of the vehicle 10. In addition, a speaker 4 that outputs a response to the driver and a display 5 that displays to the driver are connected to the voice interaction unit 1. Furthermore, a plurality of devices 6 a to 6 c that can be operated by the driver by voice or the like are connected to the voice interaction unit 1.

マイク２は、車両１０の運転者の音声が入力されるものであり、車内の所定位置に設置されている。マイク２は、例えば、トークスイッチにより音声の入力開始が指令されると、入力される音声を運転者の発話として取得する。トークスイッチは、車両１０の運転者により操作されるＯＮ・ＯＦＦスイッチであり、押下してＯＮ操作されることによって音声の入力開始が指令される。 The microphone 2 is for inputting the voice of the driver of the vehicle 10 and is installed at a predetermined position in the vehicle. For example, when a voice switch is instructed by a talk switch, the microphone 2 acquires the input voice as the driver's utterance. The talk switch is an ON / OFF switch that is operated by the driver of the vehicle 10, and is commanded to start voice input when pressed by being pressed.

車両状態検出部３は、車両１０の状態を検出するセンサ等である。車両１０の状態とは、例えば、車両１０の速度や加減速等の走行状態、車両１０の位置や走行道路等の走行環境情報、車両１０に装備された機器（ワイパー、ウィンカー、ナビゲーションシステム６ａ、オーディオ６ｂ等）の動作状態、或いは車両１０の車内温度等の車内の状態をいう。具体的には、例えば、車両１０の走行状態を検出するセンサとして、車両１０の走行速度（車速）を検出する車速センサ、車両１０のヨーレートを検出するヨーレートセンサ、車両１０のブレーキ操作（ブレーキペダルが操作されているか否か）を検出するブレーキセンサ等が挙げられる。さらに、車両１０の状態として、車両１０の運転者の状態（運転者の手掌部の発汗、運転負荷等）を検出してもよい。 The vehicle state detection unit 3 is a sensor or the like that detects the state of the vehicle 10. The state of the vehicle 10 includes, for example, the traveling state of the vehicle 10 such as speed and acceleration / deceleration, traveling environment information such as the position of the vehicle 10 and the traveling road, and the equipment (wiper, winker, navigation system 6a, The operation state of the audio 6b or the like, or the vehicle interior state such as the vehicle interior temperature of the vehicle 10. Specifically, for example, as a sensor that detects the traveling state of the vehicle 10, a vehicle speed sensor that detects the traveling speed (vehicle speed) of the vehicle 10, a yaw rate sensor that detects the yaw rate of the vehicle 10, and a brake operation (brake pedal) of the vehicle 10 And a brake sensor for detecting whether or not the engine is operated. Further, as the state of the vehicle 10, the state of the driver of the vehicle 10 (perspiration of the palm of the driver, driving load, etc.) may be detected.

スピーカ４は、車両１０の運転者への応答（音声ガイド）を出力するものである。なお、このスピーカ４としては、後述のオーディオ６ａが有するスピーカを用いることができる。 The speaker 4 outputs a response (voice guide) to the driver of the vehicle 10. In addition, as this speaker 4, the speaker which the below-mentioned audio 6a has can be used.

ディスプレイ５は、例えば、車両１０のフロントウィンドウに画像等の情報を表示するＨＵＤ（ヘッドアップディスプレイ）、車両１０の車速などの走行状態を表示するメータに一体的に設けられたディスプレイ、或いは後述のナビゲーションシステム６ｂに備えられたディスプレイ等である。なお、ナビゲーションシステム６ｂのディスプレイは、タッチスイッチが組み込まれたタッチパネル２４となっている。 The display 5 is, for example, a HUD (head-up display) that displays information such as an image on the front window of the vehicle 10, a display that is provided integrally with a meter that displays a traveling state such as the vehicle speed of the vehicle 10, or It is the display etc. with which the navigation system 6b was equipped. The display of the navigation system 6b is a touch panel 24 in which a touch switch is incorporated.

機器６ａ〜６ｃは、具体的には、車両１０に装備されたオーディオ６ａ、ナビゲーションシステム６ｂ、エアコンディショナ６ｃである。各機器６ａ〜６ｃには、制御可能な構成要素（デバイス，コンテンツ等）、機能、動作等が予め定められている。 The devices 6a to 6c are specifically an audio 6a, a navigation system 6b, and an air conditioner 6c installed in the vehicle 10. In each of the devices 6a to 6c, controllable components (device, content, etc.), functions, operations, and the like are determined in advance.

例えば、オーディオ６ａには、デバイスとして「ＣＤ」「ＭＰ３」「ラジオ」「スピーカ」等がある。また、オーディオ６ａの機能として「音量」等がある。また、オーディオ６ａの動作として「変更」「オン」「オフ」等がある。さらに、「ＣＤ」「ＭＰ３」の動作として、「再生」「停止」等がある。また、「ラジオ」の機能として「選局」等がある。また、「音量」の動作として「上げる」「下げる」等がある。 For example, the audio 6a includes “CD”, “MP3”, “radio”, “speaker”, and the like as devices. Further, there is a “volume” as a function of the audio 6a. The operation of the audio 6a includes “change”, “on”, “off”, and the like. Furthermore, “CD” and “MP3” operations include “play” and “stop”. “Radio” functions include “channel selection”. In addition, the “volume” operation includes “up”, “down”, and the like.

また、例えば、ナビゲーションシステム６ｂには、コンテンツとして「画面表示」「経路誘導」「ＰＯＩ検索」等がある。さらに、「画面表示」の動作として「変更」「拡大」「縮小」等がある。なお、「経路誘導」は音声ガイド等により目的地へ誘導する機能であり、「ＰＯＩ検索」は、例えばレストラン、ホテル等の目的地を検索する機能である。 Further, for example, the navigation system 6b includes “screen display”, “route guidance”, “POI search”, and the like as contents. Further, the “screen display” operation includes “change”, “enlarge”, “reduce”, and the like. “Route guidance” is a function of guiding to a destination by voice guidance or the like, and “POI search” is a function of searching for a destination such as a restaurant or a hotel.

また、例えば、エアコンディショナ６ｃには、その機能として「風量」「設定温度」等がある。また、エアコンディショナ６ｃの動作として「オン」「オフ」等がある。さらに、「風量」「設定温度」の動作として「変更」「上げる」「下げる」等がある。 Further, for example, the air conditioner 6c has “air volume”, “set temperature”, and the like as its functions. The operation of the air conditioner 6c includes “on” and “off”. Further, “change”, “increase”, “decrease” and the like are included in the operations of “air volume” and “set temperature”.

これらの機器６ａ〜６ｃは、対象を制御するための情報（機器や機能の種別、動作の内容等）を指定することにより制御される。制御対象となる各機器６ａ〜６ｃのデバイス、コンテンツ、機能は複数のドメインに分類されている。「ドメイン」とは認識対象のカテゴリに応じた分類を意味し、具体的には、制御対象である機器や機能を表す。ドメインは、例えば「オーディオ」のドメインが、その下位で「ＣＤ」「ラジオ」のドメインに分類されるといったように、階層的に指定することができる。 These devices 6a to 6c are controlled by designating information (device and function types, operation contents, etc.) for controlling the target. The devices, contents, and functions of the devices 6a to 6c to be controlled are classified into a plurality of domains. “Domain” means classification according to the category of the recognition target, and specifically represents the device or function that is the control target. The domain can be specified hierarchically, for example, such that the “audio” domain is classified into the “CD” and “radio” domains below it.

音声対話ユニット１は、詳細の図示は省略するが、Ａ／Ｄ変換回路、マイクロコンピュータ（ＣＰＵ、ＲＡＭ、ＲＯＭ）等を含む電子回路により構成され、マイク２の出力（アナログ信号）がＡ／Ｄ変換回路を介してデジタル信号に変換されて入力される。そして、音声対話ユニット１は、入力されたデータに基づいて、運転者から入力された発話を認識する処理や、その認識結果に基づいて、スピーカ４やディスプレイ５を介して運転者との対話や運転者への情報提示を行う処理や、機器６ａ〜６ｃを制御する処理等を実行する。これらの処理は、音声対話ユニット１のメモリに予め実装されたプログラムを音声対話ユニット１により実行することにより実現される。このプログラムは、本発明の音声認識プログラムを含んでいる。なお、当該プログラムはＣＤ−ＲＯＭ等の記録媒体を介してメモリに格納されてもよく、外部のサーバからネットワークや人工衛星を介して配信または放送され、車両１０に搭載された通信機器により受信された上でメモリに格納されてもよい。 Although not shown in detail, the voice interaction unit 1 is composed of an electronic circuit including an A / D conversion circuit, a microcomputer (CPU, RAM, ROM), etc., and the output (analog signal) of the microphone 2 is A / D. It is converted into a digital signal and input through a conversion circuit. Then, the voice interaction unit 1 recognizes the utterance input from the driver based on the input data, and performs dialogue with the driver via the speaker 4 or the display 5 based on the recognition result. A process for presenting information to the driver, a process for controlling the devices 6a to 6c, and the like are executed. These processes are realized by the voice interaction unit 1 executing a program pre-installed in the memory of the voice interaction unit 1. This program includes the speech recognition program of the present invention. The program may be stored in a memory via a recording medium such as a CD-ROM, distributed or broadcast from an external server via a network or an artificial satellite, and received by a communication device mounted on the vehicle 10. In addition, it may be stored in a memory.

より詳しくは、音声対話ユニット１は、上記プログラムにより実現される機能として、入力された音声を音響モデル１５と言語モデル１６とを用いて認識してテキストとして出力するテキスト変換部１１と、認識されたテキストから構文モデル１７を用いて発話の意味を理解する構文解析部１２とを備えている。また、音声対話ユニット１は、発話の認識結果に基づいてシナリオデータベース１８を用いてシナリオを決定し、運転者への応答や機器の制御等を行うシナリオ制御部１３と、運転者に出力する音声による応答を音素モデル２１を用いて合成する音声合成部１４とを備えている。 More specifically, the voice interaction unit 1 is recognized as a function realized by the above program, a text conversion unit 11 that recognizes an input voice using the acoustic model 15 and the language model 16 and outputs it as text. And a syntax analysis unit 12 for understanding the meaning of the utterance from the text using the syntax model 17. The voice interaction unit 1 also determines a scenario using the scenario database 18 based on the recognition result of the utterance, and responds to the driver, controls the device, and the like, and the voice to be output to the driver And a speech synthesizer 14 that synthesizes a response by using a phoneme model 21.

なお、音響モデル１５、言語モデル１６、構文モデル１７、シナリオデータベース１８、音素モデル１９は、それぞれ、データが記録されているＣＤ−ＲＯＭ、ＤＶＤ、ＨＤＤ等の記録媒体（データベース）である。 The acoustic model 15, the language model 16, the syntax model 17, the scenario database 18, and the phoneme model 19 are recording media (databases) such as a CD-ROM, a DVD, and an HDD in which data is recorded.

また、テキスト変換部１１が本発明のテキスト変換手段を構成する。また、構文解析部１２が本発明の第１の構文解析手段３１、第２の構文解析手段３２、音声候補群決定手段３３、比較判断手段３４を構成する。また、シナリオ制御部１３が本発明の制御手段を構成する。 The text conversion unit 11 constitutes the text conversion means of the present invention. In addition, the syntax analysis unit 12 constitutes a first syntax analysis unit 31, a second syntax analysis unit 32, a speech candidate group determination unit 33, and a comparison determination unit 34 according to the present invention. Further, the scenario control unit 13 constitutes the control means of the present invention.

テキスト変換部１１は、マイク２に入力された発話の音声を示す波形データを周波数分析して特徴ベクトルを抽出する。そして、テキスト変換部１１は、抽出された特徴ベクトルに基づいて、入力された音声を認識して、単語列で表現されたテキストとして出力する「テキスト変換処理」を実行する。このテキスト変換処理は、次に説明するような確率統計的な手法を用いて、入力音声の音響的な特徴と言語的な特徴とを総合的に判断することにより実行される。 The text conversion unit 11 extracts the feature vector by performing frequency analysis on the waveform data indicating the speech sound input to the microphone 2. Then, the text conversion unit 11 executes “text conversion processing” for recognizing the input voice based on the extracted feature vector and outputting it as text expressed in a word string. This text conversion process is executed by comprehensively determining the acoustic features and linguistic features of the input speech using a probability statistical method as described below.

すなわち、テキスト変換部１１は、まず、音響モデル１５を用いて、抽出された特徴ベクトルに応じた発音データの尤度（以下、この尤度を適宜「音響スコア」という。）を評価し、当該音響スコアに基づいて発音データを決定する。また、テキスト変換部１１は、言語モデル１６を用いて、決定された発音データに応じた単語列で表現されたテキストの尤度（以下、この尤度を適宜「言語スコア」という。）を評価し、当該言語スコアに基づいてテキストを決定する。さらに、テキスト変換部１１は、決定された全てのテキストについて、当該テキストの音響スコアと言語スコアとに基づいてテキスト変換の確信度（以下、この確信度を適宜「テキスト変換スコア」という。）を算出する。そして、テキスト変換部１１は、このテキスト変換スコアが所定の条件を満たすテキストを、認識されたテキスト（Recognized Text）として出力する。 That is, the text conversion unit 11 first evaluates the likelihood of the pronunciation data according to the extracted feature vector using the acoustic model 15 (hereinafter, this likelihood is referred to as “acoustic score” as appropriate). Pronunciation data is determined based on the acoustic score. Further, the text conversion unit 11 uses the language model 16 to evaluate the likelihood of the text expressed by the word string corresponding to the determined pronunciation data (hereinafter, this likelihood is referred to as “language score” as appropriate). The text is determined based on the language score. Furthermore, the text conversion unit 11 determines the certainty of text conversion for all the determined texts based on the acoustic score and language score of the text (hereinafter, this certainty is referred to as “text conversion score” as appropriate). calculate. Then, the text conversion unit 11 outputs a text whose text conversion score satisfies a predetermined condition as a recognized text (Recognized Text).

構文解析部１２は、テキスト変換部１１で認識されたテキストから、構文モデル１７を用いて、入力された発話の意味を理解する「構文解析処理」を実行する。この構文解析処理は、次に説明するように確率統計的な手法を用いて、テキスト変換部１１で認識されたテキストにおける単語間の関係（構文）を解析することにより実行される。 The syntax analysis unit 12 executes “syntax analysis processing” for understanding the meaning of the input utterance from the text recognized by the text conversion unit 11 using the syntax model 17. This parsing process is executed by analyzing a relationship (syntax) between words in the text recognized by the text conversion unit 11 using a probabilistic statistical method as described below.

すなわち、構文解析部１２は、認識されたテキストの尤度（以下、この尤度を適宜「構文解析スコア」という。）を評価し、当該構文解析スコアに基づいて、当該認識されたテキストの意味に対応するクラスに分類されたテキストを決定する。そして、構文解析部１２は、この構文解析スコアが所定の条件を満たすようなクラス分類されたテキスト（Categorized Text）を、入力された発話の認識結果として構文解析スコアと共に出力する。クラスは、上述したドメインのような、制御対象や制御内容を表すカテゴリに応じた分類に相当する。例えば、認識されたテキストが「設定変更」「設定変更する」「設定を変える」「セッティング変更」である場合に、クラス分類されたテキストは、いずれも｛Setup｝となる。 That is, the parsing unit 12 evaluates the likelihood of the recognized text (hereinafter, this likelihood is appropriately referred to as a “parsing score”), and the meaning of the recognized text is determined based on the parsing score. The text classified into the class corresponding to is determined. Then, the parsing unit 12 outputs a text classified by the classification (Categorized Text) such that the parsing score satisfies a predetermined condition as a recognition result of the input utterance together with the parsing score. A class corresponds to a classification according to a category representing a control target or control content, such as the domain described above. For example, when the recognized text is “change setting”, “change setting”, “change setting”, “change setting”, the classified texts are all {Setup}.

シナリオ制御部１３は、少なくとも構文解析部１２から出力される認識結果と、車両状態検出部３から取得される車両１０の状態とに基づいて、シナリオデータベース１８に記録されたデータを用いて、運転者に対する応答出力や機器制御のシナリオを決定する。シナリオデータベース１８には、応答出力や機器制御のための複数のシナリオが、発話の認識結果や車両状態の条件と共に予め記録されている。そして、シナリオ制御部１３は、決定されたシナリオに従って、音声や画像表示による応答を制御する処理や、機器を制御する処理を実行する。具体的には、シナリオ制御部１３は、例えば、音声による応答では、出力する応答の内容（運転者の次の発話を促すための応答文や、操作の完了等を使用者に報知するための応答文）や、応答を出力する際の速度や音量を決定する。 The scenario control unit 13 uses the data recorded in the scenario database 18 on the basis of at least the recognition result output from the syntax analysis unit 12 and the state of the vehicle 10 acquired from the vehicle state detection unit 3. Determine the response output to the user and the device control scenario. In the scenario database 18, a plurality of scenarios for response output and device control are recorded in advance together with the utterance recognition result and vehicle condition. Then, the scenario control unit 13 executes a process for controlling a response by voice or image display or a process for controlling a device according to the determined scenario. Specifically, for example, in the case of a voice response, the scenario control unit 13 informs the user of the content of the response to be output (response sentence for prompting the driver's next utterance, completion of the operation, etc. Response sentence) and the speed and volume when outputting the response.

音声合成部１４は、シナリオ制御部１３で決定された応答文に応じて、音素モデル１９を用いて音声を合成して、音声を示す波形データとして出力する。音声は、例えばＴＴＳ（Text to Speech）等の処理を用いて合成される。具体的には、音声合成部１４は、シナリオ制御部１３で決定された応答文のテキストを音声出力に適した表現に正規化し、この正規化したテキストの各単語を発音データに変換する。そして、音声合成部１４は、音素モデル１９を用いて発音データから特徴ベクトルを決定し、この特徴ベクトルにフィルタ処理を施して波形データに変換する。この波形データは、スピーカ４から音声として出力される。 The speech synthesizer 14 synthesizes speech using the phoneme model 19 in accordance with the response sentence determined by the scenario control unit 13 and outputs it as waveform data indicating the speech. The voice is synthesized using a process such as TTS (Text to Speech). Specifically, the speech synthesis unit 14 normalizes the text of the response sentence determined by the scenario control unit 13 into an expression suitable for speech output, and converts each word of the normalized text into pronunciation data. Then, the speech synthesizer 14 determines a feature vector from the pronunciation data using the phoneme model 19 and performs filtering on the feature vector to convert it into waveform data. This waveform data is output from the speaker 4 as sound.

音響モデル（Acoustic Model）１５には、特徴ベクトルと発音データとの確率的な対応を示すデータが記録されている。詳細には、音響モデル１５には、認識単位（音素、形態素、単語等）毎に用意された複数のＨＭＭ（Hidden Markov Model、隠れマルコフモデル）がデータとして記録されている。ＨＭＭは、音声を定常信号源（状態）の連結で表し、時系列をある状態から次の状態への遷移確率で表現する統計的信号源モデルである。ＨＭＭにより、時系列で変動する音声の音響的な特徴を簡易な確率モデルで表現することができる。ＨＭＭの遷移確率等のパラメータは、対応する学習用の音声データを与えて学習させることにより予め決定される。また、音素モデル１９にも、発音データから特徴ベクトルを決定するための、音響モデル１５と同様のＨＭＭが記録されている。 In the acoustic model 15, data indicating a probabilistic correspondence between the feature vector and the pronunciation data is recorded. Specifically, in the acoustic model 15, a plurality of HMMs (Hidden Markov Models) prepared for each recognition unit (phoneme, morpheme, word, etc.) are recorded as data. The HMM is a statistical signal source model that expresses speech as a connection of stationary signal sources (states) and expresses a time series as a transition probability from one state to the next state. With the HMM, it is possible to represent the acoustic features of speech that varies in time series with a simple probability model. Parameters such as transition probabilities of the HMM are determined in advance by giving corresponding learning speech data for learning. The phoneme model 19 also records the same HMM as the acoustic model 15 for determining the feature vector from the pronunciation data.

言語モデル（Language Model）１６には、認識対象である単語の出現確率や接続確率を示すデータが、この単語の発音データ及びテキストと共に記録されている。認識対象である単語とは、対象を制御するための発話で使用される可能性のある単語として予め定められるものである。単語の出現確率や接続確率等のデータは、大量の学習テキストコーパスを解析することにより統計的に作成される。また、単語の出現確率は、例えば、学習テキストコーパスにおけるその単語の出現頻度等に基づいて算出される。 In the language model 16, data indicating the appearance probability and connection probability of a word to be recognized is recorded together with pronunciation data and text of the word. The word that is the recognition target is predetermined as a word that may be used in the utterance for controlling the target. Data such as word appearance probabilities and connection probabilities are statistically created by analyzing a large amount of learning text corpus. Further, the appearance probability of a word is calculated based on, for example, the appearance frequency of the word in the learning text corpus.

この言語モデル１６には、例えば、特定のＮ個の単語が連続して出現する確率により表現されるＮグラム（N-gram）の言語モデルが用いられる。本実施形態では、言語モデル１６には、入力された発話に含まれる単語数に応じたＮグラムが用いられる。具体的には、言語モデル１６では、Ｎの値が発音データに含まれる単語数以下のＮグラムが用いられる。例えば発音データに含まれる単語数が２である場合、１単語の出現確率で表現されるユニグラム（Uni-gram，Ｎ＝１）、及び２つの単語の列の生起確率（先行する１単語についての条件付き出現確率）で表現されるバイグラム（Bi-gram，Ｎ＝２）が用いられる。 As the language model 16, for example, an N-gram language model expressed by the probability that specific N words appear successively is used. In the present embodiment, N-grams corresponding to the number of words included in the input utterance are used for the language model 16. Specifically, the language model 16 uses N-grams in which the value of N is equal to or less than the number of words included in the pronunciation data. For example, when the number of words included in the pronunciation data is 2, a unigram (Uni-gram, N = 1) represented by the appearance probability of one word, and the occurrence probability of a sequence of two words (for the preceding one word) A bigram (Bi-gram, N = 2) expressed by a conditional appearance probability) is used.

さらに、言語モデル１６では、Ｎの値を所定の上限値に制限してＮグラムを用いることもできる。所定の上限値としては、例えば、予め定められた所定値（例えばＮ＝２）や、入力された発話に対するテキスト変換処理の処理時間が所定時間以内になるように逐次設定される値等を用いることができる。例えばＮ＝２を上限値としてＮグラムを用いる場合、発音データに含まれる単語数が２より大きいときにも、ユニグラム及びバイグラムのみが用いられる。これにより、テキスト変換処理の演算コストが過大になることを防止して、運転者の発話に対して適切な応答時間で応答を出力することができる。 Furthermore, in the language model 16, N gram can be used by limiting the value of N to a predetermined upper limit value. As the predetermined upper limit value, for example, a predetermined value (for example, N = 2), a value that is sequentially set so that the processing time of the text conversion processing for the input utterance is within a predetermined time, or the like is used. be able to. For example, when N-grams are used with N = 2 as the upper limit, only unigrams and bigrams are used even when the number of words included in the pronunciation data is greater than two. Thereby, it is possible to prevent the calculation cost of the text conversion processing from becoming excessive, and to output a response with an appropriate response time to the driver's utterance.

構文モデル（Parser Model）１７には、認識対象である単語の出現確率や接続確率を示すデータが、この単語のテキスト及びクラスと共に記録されている。この構文モデル１７には、例えば、言語モデル１６と同様にＮグラムの言語モデルが用いられる。本実施形態では、具体的には、構文モデル１７で、Ｎ＝３を上限値として、Ｎの値が認識されたテキストに含まれる単語数以下のＮグラムが用いられる。すなわち、構文モデル１７では、ユニグラム、バイグラム、及び３つの単語の列の生起確率（先行する２単語についての条件付き出現確率）で表現されるトライグラム（Tri-gram，Ｎ＝３）が用いられる。なお、上限値は３以外でもよく、任意に設定可能である。また、上限値に制限せずに、Ｎの値が認識されたテキストに含まれる単語数以下のＮグラムを用いるものとしてもよい。 In the syntax model (Parser Model) 17, data indicating the appearance probability and connection probability of a word to be recognized is recorded together with the text and class of the word. For example, an N-gram language model is used for the syntax model 17 in the same manner as the language model 16. In the present embodiment, specifically, the syntax model 17 uses N-grams equal to or less than the number of words included in the text in which the value of N is recognized with N = 3 as the upper limit. That is, in the syntax model 17, a trigram (Tri-gram, N = 3) represented by the occurrence probability (conditional appearance probability of the preceding two words) of a unigram, a bigram, and a sequence of three words is used. . The upper limit value may be other than 3, and can be arbitrarily set. Moreover, it is good also as what uses N gram below the number of words contained in the text by which the value of N was recognized, without restrict | limiting to an upper limit.

図２に示すように、言語モデル１６と構文モデル１７とは、それぞれ、ドメインの種類毎に分類されて作成されている。図２の例では、ドメインの種類は、｛Audio，Climate，Passenger Climate，POI，Ambiguous，Navigation，Clock，Help｝の８種類である。｛Audio｝は制御対象がオーディオ６ａであることを示している。｛Climate｝は制御対象がエアコンディショナ６ｃであることを示している。｛Passenger Climate｝は制御対象が助手席のエアコンディショナ６ｃであることを示している。｛POI｝は制御対象がナビゲーションシステム６ｂのＰＯＩ検索機能であることを示している。｛Navigation｝は制御対象がナビゲーションシステム６ｂの経路誘導や地図操作等の機能であることを示している。｛Clock｝は制御対象が時計機能であることを示している。｛Help｝は制御対象が機器６ａ〜６ｃや音声認識装置の操作方法を知るためのヘルプ機能であることを示している。また、｛Ambiguous｝は、制御対象が不明であることを示している。 As shown in FIG. 2, the language model 16 and the syntax model 17 are created by being classified for each type of domain. In the example of FIG. 2, there are eight types of domains: {Audio, Climate, Passenger Climate, POI, Ambiguous, Navigation, Clock, Help}. {Audio} indicates that the control target is the audio 6a. {Climate} indicates that the controlled object is the air conditioner 6c. {Passenger Climate} indicates that the control target is the air conditioner 6c of the passenger seat. {POI} indicates that the control target is the POI search function of the navigation system 6b. {Navigation} indicates that the control target is a function such as route guidance or map operation of the navigation system 6b. {Clock} indicates that the controlled object is a clock function. {Help} indicates that the control target is a help function for knowing how to operate the devices 6a to 6c and the speech recognition apparatus. {Ambiguous} indicates that the control target is unknown.

次に、本実施形態の音声認識装置の作動（音声対話処理）について説明する。図３に示すように、まず、ＳＴＥＰ１で、車両１０の運転者から、対象を制御するための発話がマイク２に入力される。具体的には、運転者がトークスイッチをＯＮ操作して発話の入力開始を指令し、マイク２に音声を入力する。 Next, the operation (voice dialogue processing) of the voice recognition device of this embodiment will be described. As shown in FIG. 3, first, in STEP 1, an utterance for controlling an object is input to the microphone 2 from the driver of the vehicle 10. Specifically, the driver turns on the talk switch to instruct the start of utterance input, and inputs sound into the microphone 2.

次に、ＳＴＥＰ２で、音声対話ユニット１は、入力された音声を認識してテキストとして出力するテキスト変換処理を実行する。 Next, in STEP 2, the voice interaction unit 1 executes text conversion processing for recognizing the input voice and outputting it as text.

まず、音声対話ユニット１は、マイク２に入力された音声をＡ／Ｄ変換して音声を示す波形データを取得する。次に、音声対話ユニット１は、音声を示す波形データを周波数分析して特徴ベクトルを抽出する。これにより、音声を示す波形データは、例えば短時間スペクトル分析の手法によってフィルタ処理を施され、特徴ベクトルの時系列に変換される。この特徴ベクトルは、各時刻における音声スペクトルの特微量を抽出したもので、一般に１０次元〜１００次元（例えば３９次元）であり、ＬＰＣメルケプストラム（Linear Predictive Coding（線形予測分析） Mel Cepstrum）係数等が用いられる。 First, the voice interaction unit 1 performs A / D conversion on the voice input to the microphone 2 to obtain waveform data indicating the voice. Next, the voice interaction unit 1 performs frequency analysis on the waveform data indicating the voice and extracts a feature vector. Thereby, the waveform data indicating the voice is subjected to filter processing by, for example, a technique of short-time spectrum analysis, and converted into a time series of feature vectors. This feature vector is obtained by extracting a feature amount of a speech spectrum at each time, and generally has 10 to 100 dimensions (for example, 39 dimensions), such as LPC mel cepstrum (Linear Predictive Coding) coefficients, etc. Is used.

次に、音声対話ユニット１は、抽出された特徴ベクトルに対し、音響モデル１５に記録された複数のＨＭＭのそれぞれについて、当該特徴ベクトルの尤度（音響スコア）を評価する。そして、音声対話ユニット１は、当該複数のＨＭＭのうちの音響スコアの高いＨＭＭに対応する発音データを決定する。これにより、例えば「千歳」という発話が入力された場合、その音声の波形データから、「ti-to-se」という発音データがその音響スコアと共に得られる。このとき、例えば「マークセット」という発話が入力された場合、「ma-a-ku-se-t-to」という発音データと共に、「ma-a-ku-ri-su-to」のような音響的に類似の度合が高い発音データがそれぞれ音響スコアと共に得られる。 Next, the voice interaction unit 1 evaluates the likelihood (acoustic score) of the feature vector for each of the plurality of HMMs recorded in the acoustic model 15 with respect to the extracted feature vector. Then, the voice interaction unit 1 determines pronunciation data corresponding to the HMM having a high acoustic score among the plurality of HMMs. Thus, for example, when an utterance “Chitose” is input, pronunciation data “ti-to-se” is obtained together with the acoustic score from the waveform data of the speech. At this time, for example, when the utterance "mark set" is input, the pronunciation data "ma-a-ku-se-t-to" and "ma-a-ku-ri-su-to" The pronunciation data having a high degree of acoustic similarity are obtained together with the acoustic score.

次に、音声対話ユニット１は、決定された発音データから、単語列で表現されたテキストをそのテキストの言語スコアに基づいて決定する。このとき、複数の発音データが決定されている場合には、各発音データについて、それぞれテキストが決定される。なお、音声対話ユニット１は、例えば、タッチパネル２４への運転者からの入力等に基づいてドメインの種類が決定される場合には、言語モデル１６のうちの、この決定された種類のドメインに分類された部分のデータのみを用いて以下のテキストを決定する処理を行うようにしてもよい。 Next, the voice interaction unit 1 determines the text expressed by the word string from the determined pronunciation data based on the language score of the text. At this time, when a plurality of pronunciation data are determined, text is determined for each pronunciation data. For example, when the domain type is determined based on, for example, an input from the driver to the touch panel 24, the voice interaction unit 1 is classified into the determined type domain in the language model 16. The process of determining the following text may be performed using only the data of the part that has been set.

まず、音声対話ユニット１は、言語モデル１６を用いて、発音データからテキストを決定する。具体的には、まず、音声対話ユニット１は、決定された発音データと言語モデル１６に記録された発音データとを比較して、類似の度合の高い単語を抽出する。 First, the voice interaction unit 1 uses the language model 16 to determine text from pronunciation data. Specifically, first, the voice interaction unit 1 compares the determined pronunciation data with the pronunciation data recorded in the language model 16 and extracts words having a high degree of similarity.

次に、音声対話ユニット１は、抽出された単語の言語スコアを、発音データに含まれる単語数に応じたＮグラムを用いて算出する。そして、音声対話ユニット１は、発音データにおける各単語について、算出した言語スコアが所定の条件（例えば所定値以上）を満たすテキストを決定する。例えば、図４に示すように、入力された発話が「Set the station ninety nine point three FM.」である場合に、この発話から決定された発音データに応じたテキストとして、「set the station ninety nine point three FM」が決定される。 Next, the voice interaction unit 1 calculates the language score of the extracted word using an N-gram according to the number of words included in the pronunciation data. Then, the voice interaction unit 1 determines the text for which the calculated language score satisfies a predetermined condition (for example, a predetermined value or more) for each word in the pronunciation data. For example, as shown in FIG. 4, when the input utterance is “Set the station ninety nine point three FM.”, The text corresponding to the pronunciation data determined from this utterance is “set the station ninety nine”. “point three FM” is determined.

このとき、ユ二グラムでは、「set」「the」…「FM」のそれぞれの出現確率ａ１〜ａ８が与えられる。また、バイグラムでは、「set the」「the station」…「three FM」のそれぞれの２単語の生起確率ｂ１〜ｂ７が与えられる。同様に、Ｎ＝３〜８について、Ｎ単語の生起確率ｃ１〜ｃ６，ｄ１〜ｄ５，ｅ１〜ｅ４，ｆ１〜ｆ３，ｇ１〜ｇ２，ｈ１が与えられる。そして、例えばテキスト「ninety」の言語スコアは、発音データに含まれる単語「ninety」と当該単語に先行する単語とを合わせた単語数４に応じて、Ｎ＝１〜４のＮグラムから得られるａ４，ｂ３，ｃ２，ｄ１に基づいて算出される。 At this time, in the unigram, the appearance probabilities a1 to a8 of “set”, “the”... “FM” are given. In the bigram, occurrence probabilities b1 to b7 of two words “set the”, “the station”,..., “Three FM” are given. Similarly, for N = 3 to 8, N word occurrence probabilities c1 to c6, d1 to d5, e1 to e4, f1 to f3, g1 to g2, and h1 are given. For example, the language score of the text “ninety” is obtained from N grams of N = 1 to 4 according to the number of words 4 including the word “ninety” included in the pronunciation data and the word preceding the word. It is calculated based on a4, b3, c2, and d1.

このように、入力された発話を、単語毎の確率統計的な言語モデルを用いてテキストとして書き起こす手法（ディクテーション）を用いることで、予め決められた言い回しの発話に限定されない、運転者の自然な発話の認識が可能となる。 In this way, by using a technique (dictation) that transcribes input utterances as text using a probabilistic language model for each word, the driver's natural utterances are not limited to utterances of predetermined phrases. Utterances can be recognized.

次に、音声対話ユニット１は、言語モデル１６を用いて決定された全てのテキストについて、音響スコアと言語スコアとの重み付き和であるテキスト変換の確信度（テキスト変換スコア）を算出する。なお、重み係数としては、例えば実験的に予め定められた値が用いられる。 Next, the speech interaction unit 1 calculates the certainty of text conversion (text conversion score) that is a weighted sum of the acoustic score and the language score for all texts determined using the language model 16. As the weighting factor, for example, a value predetermined experimentally is used.

次に、音声対話ユニット１は、算出したテキスト変換スコアが所定の条件を満たす単語列で表現されるテキストを、認識されたテキストとして決定して出力する。所定の条件は、例えば、テキスト変換スコアが最も高いテキスト、テキスト変換スコアが上位から所定順位までのテキスト、或いはテキスト変換スコアが所定値以上のテキスト等のように予め定められている。 Next, the voice interaction unit 1 determines and outputs a text represented by a word string whose calculated text conversion score satisfies a predetermined condition as a recognized text. The predetermined condition is determined in advance as, for example, text having the highest text conversion score, text having a text conversion score from the top to a predetermined rank, or text having a text conversion score of a predetermined value or more.

次に、ＳＴＥＰ３で、音声対話ユニット１は、認識されたテキストから発話の意味を理解する構文解析処理を実行する。 Next, in STEP 3, the voice interaction unit 1 executes a parsing process for understanding the meaning of the utterance from the recognized text.

まず、音声対話ユニット１は、構文モデル１７を用いて、認識されたテキストから、クラス分類されたテキストを決定する。具体的には、まず、音声対話ユニット１は、構文モデル１７全体を用いて、各単語について各ドメインに対する尤度を算出する。次に、音声対話ユニット１は、各単語について算出された尤度に基づいてドメインを決定する。次に、音声対話ユニット１は、構文モデル１７のうちの決定されたドメインの種類に分類された部分のデータを用いて、各単語について各クラスの組に対する尤度を算出する。そして、各単語について算出された尤度（単語スコア）に基づいてクラスの組（クラス分類されたテキスト）を決定する。なお、この各単語についてクラスの組を決定する処理は、第２の構文解析手段３２の処理に相当し、決定された１又は複数のクラスの組は、第２の音声候補群に相当する。ここで、「音声候補」は、入力音声から認識されたテキストを解析することにより得られる、制御対象や制御内容を指すコマンドの候補を示している。 First, the spoken dialogue unit 1 uses the syntax model 17 to determine the classified text from the recognized text. Specifically, first, the voice interaction unit 1 calculates the likelihood for each domain for each word using the entire syntax model 17. Next, the voice interaction unit 1 determines a domain based on the likelihood calculated for each word. Next, the voice interaction unit 1 calculates the likelihood for each class set for each word using the data of the portion of the syntax model 17 classified into the determined domain type. Then, a class set (classified text) is determined based on the likelihood (word score) calculated for each word. Note that the process of determining a class set for each word corresponds to the process of the second syntax analysis unit 32, and the determined set of one or more classes corresponds to a second speech candidate group. Here, the “speech candidate” indicates a command candidate indicating a control target or control content obtained by analyzing text recognized from the input speech.

同様に、音声対話ユニット１は、認識されたテキストに含まれる２単語列について、それぞれ、２単語における各ドメインの尤度を算出し、当該尤度に基づいて２単語におけるドメインを決定する。さらに、音声対話ユニット１は、２単語における各クラスの組の尤度（２単語スコア）を算出し、当該２単語スコアに基づいて２単語におけるクラスの組（クラス分類されたテキスト）を決定する。また、同様に、音声対話ユニット１は、認識されたテキストに含まれる３単語列について、それぞれ、３単語における各ドメインの尤度を算出し、当該尤度に基づいて３単語におけるドメインを決定する。さらに、音声対話ユニット１は、３単語における各クラスの組の尤度（３単語スコア）を算出し、当該３単語スコアに基づいて３単語におけるクラスの組（クラス分類されたテキスト）を決定する。 Similarly, the voice interaction unit 1 calculates the likelihood of each domain in two words for each of the two word strings included in the recognized text, and determines the domain in the two words based on the likelihood. Furthermore, the voice interaction unit 1 calculates the likelihood (two-word score) of each class set in two words, and determines a class set (class-categorized text) in two words based on the two-word score. . Similarly, the voice interaction unit 1 calculates the likelihood of each domain in the three words for each of the three word strings included in the recognized text, and determines the domain in the three words based on the likelihood. . Furthermore, the voice interaction unit 1 calculates the likelihood (three-word score) of each class set in three words, and determines the class set (class-categorized text) in three words based on the three-word score. .

次に、音声対話ユニット１は、１単語、２単語、３単語で決定された各クラスの組と当該クラスの組のスコア（１単語スコア、２単語スコア、３単語スコア）とに基づいて、認識されたテキスト全体における各クラスの組の尤度（構文解析スコア）を算出する。そして、音声対話ユニット１は、当該構文解析スコアに基づいて、認識されたテキスト全体におけるクラスの組（クラス分類されたテキスト）を決定する。なお、テキスト全体についてクラスの組を決定する処理は、第１の構文解析手段３１の処理に相当し、決定された１又は複数のクラスの組は、第１の音声候補群に相当する。 Next, the voice interaction unit 1 is based on each class set determined by 1 word, 2 words, and 3 words and the score of the class set (1 word score, 2 word score, 3 word score), The likelihood (parse score) of each class set in the entire recognized text is calculated. Then, the voice interaction unit 1 determines a class set (classified text) in the entire recognized text based on the parsing score. Note that the process of determining a class set for the entire text corresponds to the process of the first syntax analysis unit 31, and the determined set of one or more classes corresponds to a first speech candidate group.

ここで、図５に示す例を用いて、構文モデル１７を用いてクラス分類されたテキストを決定する処理について説明する。図５の例では、認識されたテキストが「AC on floor to defrost」である。 Here, using the example shown in FIG. 5, processing for determining text classified by using the syntax model 17 will be described. In the example of FIG. 5, the recognized text is “AC on floor to defrost”.

このとき、構文モデル１７全体を用いて、ユニグラムで、「AC」「on」…「defrost」について、それぞれ、１単語における各ドメインの尤度が算出される。そして、当該尤度に基づいて１単語におけるドメインが決定される。例えば、第１位の（尤度の最も高い）ドメインは、「ＡＣ」については｛Climate｝、「on」については｛Ambiguous｝、「defrost」については｛Climate｝と決定される。 At this time, the likelihood of each domain in one word is calculated for each of “AC”, “on”... “Defrost” as a unigram using the entire syntax model 17. Then, a domain in one word is determined based on the likelihood. For example, the first (highest likelihood) domain is determined as {Climate} for “AC”, {Ambiguous} for “on”, and {Climate} for “defrost”.

さらに、構文モデル１７のうちの決定されたドメインの種類に分類された部分のデータを用いて、ユニグラムで、「AC」「on」…「defrost」について、１単語における各クラスの組に対する尤度がそれぞれ算出される。そして、当該尤度に基づいて１単語におけるクラスの組が決定される。例えば、「AC」について、第１位の（尤度の最も高い）クラスの組は、｛Climate_ACOnOff_On｝と決定され、このクラスの組に対する尤度（単語スコア）ｉ１が得られる。同様に、「on」…「defrost」について、クラスの組が決定され、このクラスの組に対する尤度（単語スコア）ｉ２〜ｉ５が得られる。これにより、第２の音声候補群として、図６（ａ）に示すような単語毎の音声候補リストが得られる。 Further, using the data of the portion of the syntax model 17 classified into the determined domain type, the likelihood for each class set in one word for “AC” “on”. Are calculated respectively. Then, a class set in one word is determined based on the likelihood. For example, with respect to “AC”, the class set having the highest rank (highest likelihood) is determined as {Climate_ACOnOff_On}, and the likelihood (word score) i1 for this class set is obtained. Similarly, for “on”... “Defrost”, a class set is determined, and likelihoods (word scores) i2 to i5 for the class set are obtained. Thereby, a speech candidate list for each word as shown in FIG. 6A is obtained as the second speech candidate group.

同様に、バイグラムで、「AC on」「on floor」…「to defrost」について、それぞれ、２単語における各ドメインの尤度が算出され、当該尤度に基づいて２単語におけるドメインが決定される。そして、２単語におけるクラスの組とその尤度（２単語スコア）ｊ１〜ｊ４が決定される。また、同様に、トライグラムで、「AC on floor」「on floor to」「floor to defrost」について、それぞれ、３単語における各ドメインの尤度が算出され、当該尤度に基づいて３単語におけるドメインが決定される。そして、３単語におけるクラスの組とその尤度（３単語スコア）ｋ１〜ｋ３が決定される。 Similarly, for each of “AC on”, “on floor”... “To defrost” in the bigram, the likelihood of each domain in two words is calculated, and the domain in two words is determined based on the likelihood. Then, a class set in two words and its likelihood (two-word score) j1 to j4 are determined. Similarly, in the trigram, for each of “AC on floor”, “on floor to”, and “floor to defrost”, the likelihood of each domain in three words is calculated, and the domain in three words is calculated based on the likelihood. Is determined. Then, a class set in three words and its likelihood (three word score) k1 to k3 are determined.

次に、１単語、２単語、３単語で決定された各クラスの組について、例えば、各クラスの組の単語スコアｉ１〜ｉ５、２単語スコアｊ１〜ｊ４、３単語スコアｋ１〜ｋ３の和が、テキスト全体における各クラスの組に対する尤度（構文解析スコア）として算出される。例えば、｛Climate_Fan-Vent_Floor｝に対する構文解析スコアは、ｉ３＋ｊ２＋ｊ３＋ｋ１＋ｋ２＝０．８となる。また、例えば、｛Climate_ACOnOff_On｝に対する構文解析スコアは、ｉ１+ｊ１＝０．６となる。また、例えば、｛Climate_Defrost_Front｝に対する構文解析スコアは、ｉ５＋ｊ４＝０．５となる。そして、算出された構文解析スコアに基づいて、テキスト全体についてのクラスの組（クラス分類されたテキスト）が決定される。これにより、第１の音声候補群として、図６（ａ）に示すような単語列の音声候補リストが得られる。 Next, for each class set determined by one word, two words, and three words, for example, the sum of the word scores i1 to i5, the two word scores j1 to j4, and the three word scores k1 to k3 of each class set is The likelihood (syntactic analysis score) for each set of classes in the entire text is calculated. For example, the parsing score for {Climate_Fan-Vent_Floor} is i3 + j2 + j3 + k1 + k2 = 0.8. For example, the parsing score for {Climate_ACOnOff_On} is i1 + j1 = 0.6. For example, the parsing score for {Climate_Defrost_Front} is i5 + j4 = 0.5. Then, based on the calculated parsing score, a class set (classified text) for the entire text is determined. Thereby, a speech candidate list of word strings as shown in FIG. 6A is obtained as the first speech candidate group.

次に、ＳＴＥＰ４で、音声対話ユニット１は、第１の音声候補群に含まれる音声候補のうち最も高いスコアをとる第１の音声候補と、第２の音声候補のうち最も高いスコアをとる第２の音声候補とを比較して、一致するか否かを判断する。図６に示した例では、単語列の音声候補リストの第１位の音声候補｛Climate_Fan-Vent_Floor｝と、単語毎の音声候補リストの第１位の音声候補｛Climate_Defrost_Front｝とが比較され、一致するか否かが判断される。 Next, in STEP 4, the voice interaction unit 1 takes the first voice candidate that has the highest score among the voice candidates included in the first voice candidate group and the highest score among the second voice candidates. The two voice candidates are compared to determine whether or not they match. In the example shown in FIG. 6, the first speech candidate {Climate_Fan-Vent_Floor} in the speech candidate list of the word string is compared with the first speech candidate {Climate_Defrost_Front} in the speech candidate list for each word. It is determined whether or not to do so.

ＳＴＥＰ４の判断結果がＹＥＳである（一致する）場合には、そのままＳＴＥＰ６に進む。ＳＴＥＰ４の判断結果がＮＯである（一致しない）場合には、ＳＴＥＰ５に進み、音声対話ユニット１は、第２の音声候補の最も高いスコアを所定値増やす。 If the determination result in STEP 4 is YES (matches), the process proceeds to STEP 6 as it is. If the determination result in STEP 4 is NO (does not match), the process proceeds to STEP 5 and the voice interaction unit 1 increases the highest score of the second voice candidate by a predetermined value.

図６に示した例では、単語列の音声候補リストの第１位の音声候補｛Climate_Fan-Vent_Floor｝と単語毎の音声候補リストの第１位の音声候補｛Climate_Defrost_Front｝とは一致しないので、ＳＴＥＰ４の判断結果はＮＯとなる。この場合、単語毎の音声候補リストの第１位の音声候補｛Climate_Defrost_Front｝は、入力音声に含まれるキーワード「defrost」の意味を含む音声候補である。このとき、ＳＴＥＰ５に進み、図６（ｂ）に示すように、単語毎の音声候補リストの第１位の音声候補｛Climate_Defrost_Front｝のスコア（単語スコア）が０．２増やされる。そして、このスコアを反映して、テキスト全体について各クラスの組に対する尤度（構文解析スコア）が算出される。例えば、構文解析スコアは、単語列スコアと同様に、ＳＴＥＰ５で決定される単語スコア、２単語スコア、及び３単語スコアの和として算出される。そして、算出された構文解析スコアに基づいて、図６（ｂ）に示すような全音声候補リストが得られる。これにより、図６（ａ）の単語列の音声候補リストの第３位の音声候補｛Climate_Defrost_Front｝は、図６（ｂ）の全音声候補リストではスコアが0.2増やされ、全音声候補リストの第２位の音声候補となる。すなわち、キーワードの意味を含む音声候補｛Climate_Defrost_Front｝のスコアが増やされ順位が高くなる。そして、ＳＴＥＰ６に進む。 In the example shown in FIG. 6, the first speech candidate {Climate_Fan-Vent_Floor} in the word sequence speech candidate list does not match the first speech candidate {Climate_Defrost_Front} in the speech candidate list for each word. The determination result is NO. In this case, the first speech candidate {Climate_Defrost_Front} in the speech candidate list for each word is a speech candidate including the meaning of the keyword “defrost” included in the input speech. At this time, the process proceeds to STEP 5 and, as shown in FIG. 6B, the score (word score) of the first speech candidate {Climate_Defrost_Front} in the speech candidate list for each word is increased by 0.2. Reflecting this score, the likelihood (syntactic analysis score) for each class set is calculated for the entire text. For example, the parsing score is calculated as the sum of the word score determined in STEP 5, the two-word score, and the three-word score, similarly to the word string score. Then, based on the calculated parsing score, an all speech candidate list as shown in FIG. 6B is obtained. As a result, the third speech candidate {Climate_Defrost_Front} in the word candidate speech list of the word string in FIG. 6A is increased by 0.2 in the all speech candidate list in FIG. It becomes the second speech candidate. That is, the score of the speech candidate {Climate_Defrost_Front} including the meaning of the keyword is increased and the ranking is increased. Then, the process proceeds to STEP6.

次に、ＳＴＥＰ６で、音声対話ユニット１は、全音声候補リストに含まれる音声候補から、算出された構文解析スコアが所定の条件を満たすような１又は複数の音声候補を、入力された発話の認識結果（最終的な音声候補群）として決定して、認識結果の確信度（各音声候補の構文解析スコア）と共に出力する。所定の条件は、例えば、構文解析スコアが最も高い音声候補、構文解析スコアが上位から所定順位までの音声候補、或いは構文解析スコアが所定値以上の音声候補等のように予め定められている。 Next, in STEP 6, the voice interaction unit 1 selects one or a plurality of voice candidates whose parsed scores satisfy a predetermined condition from the voice candidates included in the all voice candidate list. It is determined as a recognition result (final speech candidate group) and is output together with the certainty of the recognition result (syntactic analysis score of each speech candidate). The predetermined condition is determined in advance, for example, as a speech candidate having the highest syntax analysis score, a speech candidate having a syntax analysis score from a higher rank to a predetermined rank, or a speech candidate having a syntax analysis score of a predetermined value or more.

このとき、ＳＴＥＰ５でキーワードの意味を含む音声候補のスコアが増やされ順位が高くなっているので、当該音声候補が最終的な音声候補群に含まれる可能性が高くなっている。例えば、上述のように「AC on floor to defrost」という発話が入力された場合に、所定の条件を構文解析スコアが第２位までの音声候補とすると、図６（ｃ）の最終的な音声候補リストに示すように、認識結果として｛Climate_Fan-Vent_Floor｝｛Climate_Defrost_Front｝がそのスコアと共に出力される。すなわち、キーワードの意味を含む音声候補｛Climate_Defrost_Front｝が最終的な音声候補群に含まれることとなる。なお、この処理は、音声候補群決定手段３３の処理に相当する。 At this time, since the score of the speech candidate including the meaning of the keyword is increased in STEP 5 and the rank is higher, the possibility that the speech candidate is included in the final speech candidate group is high. For example, when the utterance “AC on floor to defrost” is input as described above, if the predetermined condition is a speech candidate with a parsing score up to the second place, the final speech in FIG. As shown in the candidate list, {Climate_Fan-Vent_Floor} {Climate_Defrost_Front} is output together with the score as a recognition result. That is, the speech candidate {Climate_Defrost_Front} including the keyword meaning is included in the final speech candidate group. This process corresponds to the process of the speech candidate group determination unit 33.

次に、ＳＴＥＰ７で、音声対話ユニット１は、車両状態検出部３により検出される、車両１０の状態（車両１０の走行状態、車両１０に搭載された機器の状態、車両１０の運転者の状態等）の検出値を取得する。 Next, in STEP 7, the voice interaction unit 1 detects the state of the vehicle 10 (the traveling state of the vehicle 10, the state of the device mounted on the vehicle 10, the state of the driver of the vehicle 10) detected by the vehicle state detection unit 3. Etc.) is obtained.

次に、ＳＴＥＰ８で、音声対話ユニット１は、ＳＴＥＰ６で決定された最終的な音声候補群（発話の認識結果）と、ＳＴＥＰ５で検出された車両１０の状態とに基づいて、シナリオデータベース１８を用いて、運転者への応答や機器の制御を行うためのシナリオを決定する。このとき、ＳＴＥＰ６で、キーワードが適切に把握されて最終的な音声候補群が決定され、運転者の発話の認識結果の認識精度が向上しているので、運転者の意図に応じてシナリオが適切に決定される。 Next, in STEP 8, the voice interaction unit 1 uses the scenario database 18 based on the final voice candidate group (utterance recognition result) determined in STEP 6 and the state of the vehicle 10 detected in STEP 5. The scenario for responding to the driver and controlling the device is determined. At this time, since the keyword is appropriately grasped and the final speech candidate group is determined in STEP 6 and the recognition accuracy of the recognition result of the driver's utterance is improved, the scenario is appropriate according to the driver's intention. To be determined.

まず、音声対話ユニット１は、発話の認識結果と車両１０の状態から、対象を制御するための情報を取得する。図７に示すように、音声対話ユニット１には、対象を制御するための情報を格納する複数のフォームが備えられている。各フォームには、必要な情報のクラスに対応した所定数のスロットが設けられている。例えば、ナビゲーションシステム６ｂを制御するための情報を格納するフォームとして、「Plot a route」「Traffic info.」等が備えられ、エアコンディショナ６ｃを制御するための情報を格納するフォームとして「Climate control」等が備えられている。また、フォーム「Plot a route」には、４つのスロット「From」「To」「Request」「via」が設けられている。 First, the voice interaction unit 1 acquires information for controlling a target from the utterance recognition result and the state of the vehicle 10. As shown in FIG. 7, the voice interaction unit 1 is provided with a plurality of forms for storing information for controlling an object. Each form has a predetermined number of slots corresponding to the class of information required. For example, “Plot a route” and “Traffic info.” Are provided as forms for storing information for controlling the navigation system 6b, and “Climate control” is provided as a form for storing information for controlling the air conditioner 6c. And the like. The form “Plot a route” is provided with four slots “From”, “To”, “Request”, and “via”.

音声対話ユニット１は、運転者との対話における各回の発話の認識結果と、車両１０の状態とから、該当するフォームのスロットに値を入力していく。具体的には、音声対話ユニット１は、最終的な音声候補群から、車両１０の状態を考慮して音声候補を決定し、この音声候補に該当するフォームのスロットに値を入力していく。これと共に、各フォームについての確信度（フォームに入力された値の信頼の度合）を算出してフォームに記録する。フォームの確信度は、例えば、各回の発話の認識結果の確信度と、各フォームのスロットの埋まり具合とに基づいて算出される。例えば、図８に示すように、「千歳空港まで最短ルートで案内して」という発話が運転者から入力された場合には、フォーム「Plot a route」の３つのスロット「From」「To」「Request」に値「ここ」「千歳空港」「最短」が入力される。また、フォーム「Plot a route」の「Score」に、算出されたフォームの確信度８０が記録される。 The voice dialogue unit 1 inputs a value into a slot of the corresponding form from the recognition result of each utterance in the dialogue with the driver and the state of the vehicle 10. Specifically, the voice interaction unit 1 determines a voice candidate from the final voice candidate group in consideration of the state of the vehicle 10, and inputs a value into a slot of a form corresponding to this voice candidate. At the same time, the certainty factor (degree of confidence of the value input to the form) for each form is calculated and recorded on the form. For example, the certainty factor of the form is calculated based on the certainty factor of the recognition result of each utterance and the filling degree of the slot of each form. For example, as shown in FIG. 8, when an utterance “Guide to the shortest route to Chitose Airport” is input from the driver, three slots “From” “To” “ The values “here”, “Chitose Airport” and “shortest” are entered in “Request”. Further, the calculated confidence factor 80 of the form is recorded in “Score” of the form “Plot a route”.

次に、音声対話ユニット１は、フォームの確信度と、ＳＴＥＰ７で検出された車両１０の状態とに基づいて、実際の制御処理に用いるフォームを選択する。そして、選択されたフォームに基づいて、シナリオデータベース１８に格納されたデータを用いて、シナリオを決定する。図８に示すように、シナリオデータベース１８には、例えば運転者へ出力する応答文（プロンプト）等が、スロットの埋まり具合やレベル毎に分類されて格納されている。なお、レベルは、例えばフォームの確信度や車両１０の状態（車両１０の走行状態、運転者の状態等）等に基づいて設定される値である。 Next, the voice interaction unit 1 selects a form to be used for actual control processing based on the certainty of the form and the state of the vehicle 10 detected in STEP 7. Then, based on the selected form, the scenario is determined using the data stored in the scenario database 18. As shown in FIG. 8, in the scenario database 18, for example, response sentences (prompts) to be output to the driver are classified and stored for each slot filling level and level. Note that the level is a value set based on, for example, the certainty of the form, the state of the vehicle 10 (the traveling state of the vehicle 10, the state of the driver, and the like).

例えば、選択されたフォーム内に空きスロット（値が入力されていないスロット）がある場合には、運転者へフォーム内の空きスロットの入力を促すような応答文を出力するシナリオが決定される。このとき、レベルに応じて、すなわちフォームの確信度や車両１０の状態を考慮して、運転者の次回の発話を促す適切な応答文が決定される。例えば、運転者の運転負荷に応じて、運転負荷が高いと考えられる状態では、入力を促すスロットの数が少なめに設定された応答文がが決定される。そして、このように決定された応答文の出力により使用者の次の発話を促すことで、効率の良い対話が行われる。 For example, when there is an empty slot (a slot in which no value is input) in the selected form, a scenario is determined for outputting a response sentence that prompts the driver to input an empty slot in the form. At this time, an appropriate response sentence that prompts the driver to speak next time is determined according to the level, that is, taking into account the certainty of the form and the state of the vehicle 10. For example, in a state where the driving load is considered to be high according to the driving load of the driver, a response sentence in which the number of slots for prompting input is set to be small is determined. Then, by prompting the user's next utterance by outputting the response sentence determined in this way, an efficient dialogue is performed.

図８に示す例では、フォーム「Plot a route」の第１〜第３のスロット「From」「To」「Request」には値が入力され、第４のスロット「via」には値が入力されていない。また、レベル＝２に設定されている。このとき、シナリオデータベース１８から応答文「<To>を<Request>設定します」が選択され、「千歳空港を高速優先設定します」という応答文の内容が決定される。 In the example shown in FIG. 8, values are input to the first to third slots “From”, “To”, and “Request” of the form “Plot a route”, and values are input to the fourth slot “via”. Not. Further, level = 2 is set. At this time, the response sentence “<To> <Request> is set” is selected from the scenario database 18, and the content of the response sentence “High-speed priority setting is set for Chitose Airport” is determined.

また、例えば、選択されたフォーム内の全てのスロットが全て埋まっている（値が入力されている）場合には、内容を確認するような応答文（例えば各スロットの入力値を運転者に報知する応答文）を出力するシナリオが決定される。 In addition, for example, when all slots in the selected form are all filled (values are input), a response sentence that confirms the contents (for example, the input value of each slot is notified to the driver) Response scenario) is determined.

次に、ＳＴＥＰ９で、音声対話ユニット１は、決定したシナリオに基づいて、運転者との対話が終了したか否かを判断する。ＳＴＥＰ９の判断結果がＮＯの場合には、ＳＴＥＰ１０に進み、音声対話ユニット１は、決定された応答文の内容や応答文を出力する際の条件に応じて音声を合成する。そして、ＳＴＥＰ１０で、生成された応答文（運転者の次回の発話を促す応答文等）が、スピーカ４から出力される。 Next, in STEP 9, the voice interaction unit 1 determines whether or not the dialogue with the driver has ended based on the determined scenario. If the determination result in STEP 9 is NO, the process proceeds to STEP 10, where the voice interaction unit 1 synthesizes speech according to the contents of the determined response sentence and the conditions for outputting the response sentence. Then, in STEP 10, the generated response text (such as a response text prompting the driver to speak next time) is output from the speaker 4.

次に、ＳＴＥＰ１に戻り、２回目の発話が運転者から入力される。次に、ＳＴＥＰ２〜８で、音声対話ユニット１は、１回目の発話と同様に、２回目の発話に対して処理を実行する。このとき、ＳＴＥＰ２で、音声対話ユニット１は、例えば、運転者からの前回の発話の認識結果に基づいてドメインの種類が決定される場合には、言語モデル１６のうちの、この決定された種類のドメインに分類された部分のデータのみを用いてテキストを決定する処理を行うようにしてもよい。次に、ＳＴＥＰ９で、音声対話ユニット１は、運転者との対話が終了したか否かを判断する。ＳＴＥＰ９の判断結果がＮＯの場合には、１回目の発話と同様に、音声対話ユニット１は、ＳＴＥＰ１０〜１１の処理を実行する。 Next, returning to STEP 1, the second utterance is input from the driver. Next, in STEPs 2 to 8, the voice interaction unit 1 executes processing for the second utterance as in the first utterance. At this time, in STEP 2, for example, when the type of the domain is determined based on the recognition result of the previous utterance from the driver, the voice interaction unit 1 determines the determined type of the language model 16. The text may be determined using only the data of the portion classified into the domain. Next, in STEP 9, the voice interaction unit 1 determines whether or not the dialogue with the driver has ended. When the determination result in STEP 9 is NO, the voice interaction unit 1 executes the processing in STEPs 10 to 11 as in the first utterance.

以下、ＳＴＥＰ９の判断結果がＹＥＳとなるまで、上述の２回目の発話に対するＳＴＥＰ１〜１１と同様の処理が繰り返される。 Thereafter, the processing similar to STEP 1 to STEP 11 for the second utterance described above is repeated until the determination result of STEP 9 is YES.

ＳＴＥＰ９の判断結果がＹＥＳの場合には、ＳＴＥＰ１２に進み、音声対話ユニット１は、決定された応答文（機器制御の内容を報知する応答文等）の音声を合成する。次に、ＳＴＥＰ１３で、応答文がスピーカ４から出力される。次に、ＳＴＥＰ１４で、音声対話ユニット１は、決定されたシナリオに基づいて機器を制御して、音声対話処理を終了する。 If the determination result in STEP 9 is YES, the process proceeds to STEP 12 and the voice interaction unit 1 synthesizes the voice of the determined response sentence (such as a response sentence that informs the contents of device control). Next, a response sentence is output from the speaker 4 in STEP13. Next, in STEP 14, the voice interaction unit 1 controls the device based on the determined scenario, and ends the voice interaction process.

以上の処理によって、キーワードが適切に把握されて発話の認識精度が向上するので、効率の良い対話を介して機器の制御が行われる。
［第２実施形態］
次に、本発明の第２実施形態の音声認識装置について説明する。なお、本実施形態は、第１実施形態と、音声対話処理におけるＳＴＥＰ６の最終的な音声候補群（発話の認識結果）を決定する処理のみが相違する。本実施形態の構成は、第１実施形態と同様であるので、同一の構成には同一の参照符号を付して、以下では説明を省略する。 Through the above processing, keywords are properly grasped and the recognition accuracy of the utterance is improved, so that the device is controlled through an efficient dialogue.
[Second Embodiment]
Next, the speech recognition apparatus according to the second embodiment of the present invention will be described. The present embodiment is different from the first embodiment only in the process of determining the final speech candidate group (utterance recognition result) of STEP 6 in the speech dialogue processing. Since the configuration of this embodiment is the same as that of the first embodiment, the same reference numerals are given to the same configuration, and the description thereof is omitted below.

本実施形態の音声認識装置の作動（音声対話処理）について、図９〜図１１に示す例を用いて説明する。図９〜図１１の例では、認識されたテキストが「Tune the next disc」である。このとき、ＳＴＥＰ３で、第１実施形態と同様に、１単語、２単語、３単語におけるクラスの組と当該クラスの組のスコアが算出される。図９には、第１位の（尤度の最も高い）クラスの組が示されている。また、図１０には、第２位のクラスの組が示されている。 The operation (voice dialogue processing) of the voice recognition device of the present embodiment will be described using the examples shown in FIGS. In the example of FIGS. 9 to 11, the recognized text is “Tune the next disc”. At this time, in STEP 3, as in the first embodiment, a class set of one word, two words, and three words and a score of the class set are calculated. FIG. 9 shows a set of the first class (highest likelihood) class. FIG. 10 also shows the second class set.

このとき、図９に示した第１位のクラスの組については、第１実施形態のＳＴＥＰ４と同様に、単語列の音声候補リストの第１位の音声候補｛Audio_CD_Nextpre_Next｝と、単語毎の音声候補リストの第１位の音声候補｛Audio_CD_DiscNo｝とが比較され、一致するか否かが判断される。そして、｛Audio_CD_Nextpre_Next｝と｛Audio_CD_DiscNo｝とは一致しないので、ＳＴＥＰ４の判断結果はＮＯとなる。この場合、単語毎の音声候補リストの第１位の音声候補｛Audio_CD_DiscNo｝は、入力音声に含まれるキーワード「disc」の意味を含む音声候補である。このとき、ＳＴＥＰ５に進み、図９に示すように、単語毎の音声候補リストの｛Audio_CD_DiscNo｝のスコアが増やされ、このスコアを反映して、構文解析スコアが算出され、第１位の全音声候補リストが得られる。これにより、単語列の音声候補リストの第３位の音声候補｛Audio_CD_DiscNo｝は、第１位の全音声候補リストではスコアが増やされ、全音声候補リストの第２位の音声候補となる。すなわち、キーワードの意味を含む音声候補｛Audio_CD_DiscNo｝のスコアが増やされ順位が高くなる。 At this time, for the first class set shown in FIG. 9, as in STEP 4 of the first embodiment, the first speech candidate {Audio_CD_Nextpre_Next} in the speech candidate list of the word string and the speech for each word The first candidate audio candidate {Audio_CD_DiscNo} in the candidate list is compared to determine whether or not they match. Since {Audio_CD_Nextpre_Next} and {Audio_CD_DiscNo} do not match, the determination result in STEP 4 is NO. In this case, the first speech candidate {Audio_CD_DiscNo} in the speech candidate list for each word is a speech candidate including the meaning of the keyword “disc” included in the input speech. At this time, the process proceeds to STEP 5 and, as shown in FIG. 9, the score of {Audio_CD_DiscNo} in the speech candidate list for each word is increased, and the parsing score is calculated to reflect this score. A candidate list is obtained. Accordingly, the third speech candidate {Audio_CD_DiscNo} in the speech candidate list of the word string is increased in score in the first speech candidate list and becomes the second speech candidate in the speech candidate list. That is, the score of the audio candidate {Audio_CD_DiscNo} including the meaning of the keyword is increased and the ranking is increased.

また、図１０に示した第２位のクラスの組については、単語列スコアをそのまま構文解析スコアとし、単語列の音声候補リストをそのまま第２位の全音声候補リストとする。そして、ＳＴＥＰ６に進む。 For the second class set shown in FIG. 10, the word string score is directly used as the syntax analysis score, and the speech candidate list of the word string is directly used as the second highest speech candidate list. Then, the process proceeds to STEP6.

次に、ＳＴＥＰ６で、音声対話ユニット１は、図１１に示すように、第２位の全音声候補リストについて、各音声候補のスコアに１より小さい所定値（例えば０．９）を乗じる。そして、第１位の全音声候補リストと第２位の全音声候補リストとを合わせた音声候補群から、各音声候補のスコアに基づいて、最終的な音声候補リストを決定する。他の動作は第１実施形態と同じである。 Next, in STEP 6, as shown in FIG. 11, the voice interaction unit 1 multiplies the score of each voice candidate by a predetermined value (for example, 0.9) less than 1 for the second highest voice candidate list. Then, a final speech candidate list is determined based on the score of each speech candidate from the speech candidate group in which the first speech candidate list and the second speech candidate list are combined. Other operations are the same as those in the first embodiment.

本実施形態の音声認識装置によれば、第１実施形態と同様に、キーワードが適切に把握されて発話の認識精度が向上するので、効率の良い対話を介して機器の制御が行われる。
［第３実施形態］
次に、本発明の第３実施形態の音声認識装置について説明する。なお、本実施形態は、第１実施形態と、音声対話処理におけるＳＴＥＰ６の最終的な音声候補群（発話の認識結果）を決定する処理（音声候補群決定手段３３の処理）のみが相違する。本実施形態の構成は、第１実施形態と同様であるので、同一の構成には同一の参照符号を付して、以下では説明を省略する。 According to the speech recognition apparatus of the present embodiment, as in the first embodiment, keywords are appropriately grasped and speech recognition accuracy is improved, so that the device is controlled through efficient dialogue.
[Third Embodiment]
Next, a speech recognition apparatus according to a third embodiment of the present invention will be described. Note that the present embodiment is different from the first embodiment only in the process of determining the final speech candidate group (speech recognition result) of STEP 6 in the speech interaction process (the process of the speech candidate group determination unit 33). Since the configuration of this embodiment is the same as that of the first embodiment, the same reference numerals are given to the same configuration, and the description thereof is omitted below.

本実施形態の音声認識装置では、音声候補群決定手段３３は、タッチパネル２４を介して、第１及び第２の音声候補群に含まれる複数の音声候補を、各音声候補のスコアに基づいて運転者に選択を促すように表示すると共に、スピーカ４を介して音声ガイドを出力する。そして、音声候補群決定手段３３は、運転者からのタッチパネル２４に表示された音声候補のうちの少なくとも１つを選択するタッチ入力に基づいて、最終的な音声候補を決定する。なお、シナリオ制御部１３、音声合成部１４が、本発明の報知手段及び選択手段を構成する。 In the speech recognition apparatus according to the present embodiment, the speech candidate group determination unit 33 drives a plurality of speech candidates included in the first and second speech candidate groups based on the scores of the speech candidates via the touch panel 24. And a voice guide is output through the speaker 4. Then, the voice candidate group determination unit 33 determines a final voice candidate based on a touch input for selecting at least one of voice candidates displayed on the touch panel 24 from the driver. The scenario control unit 13 and the speech synthesis unit 14 constitute notification means and selection means of the present invention.

本実施形態の音声認識装置の作動（音声対話処理）では、ＳＴＥＰ６で、まず、音声対話ユニット１は、第１及び第２の音声候補群に含まれる音声候補から、各音声候補のスコアに基づいて、タッチパネル２４に表示する複数の音声候補を決定する。具体的には、図６（ｂ）に例示するような、キーワードの意味を含む音声候補のスコアが増やされ順位が高くなっている全音声候補リストから、各音声候補の構文解析スコアが所定の条件を満たすような複数の音声候補を決定する。所定の条件は、例えば、構文解析スコアが最も高い音声候補、構文解析スコアが上位から所定順位までの音声候補、或いは構文解析スコアが所定値以上の音声候補等のように予め定められている。そして、音声対話ユニット１は、この複数の音声候補を、タッチパネル２４に運転者に選択を促すように画面表示して情報提示する。このとき、音声対話ユニット１は、運転者に選択を促すような応答文（音声ガイド）を、音声合成部１４で合成して、スピーカ４から出力する。 In the operation of the speech recognition apparatus of the present embodiment (speech dialogue processing), in STEP 6, first, the speech dialogue unit 1 starts from the speech candidates included in the first and second speech candidate groups based on the scores of the speech candidates. Then, a plurality of voice candidates to be displayed on the touch panel 24 are determined. Specifically, as shown in FIG. 6B, the speech analysis score of each speech candidate is set to a predetermined score from the entire speech candidate list in which the speech candidate score including the meaning of the keyword is increased and the ranking is high. A plurality of speech candidates that satisfy the condition are determined. The predetermined condition is determined in advance, for example, as a speech candidate having the highest syntax analysis score, a speech candidate having a syntax analysis score from a higher rank to a predetermined rank, or a speech candidate having a syntax analysis score of a predetermined value or more. Then, the voice interaction unit 1 presents information by displaying the plurality of voice candidates on the touch panel 24 so as to prompt the driver to select them. At this time, the voice interaction unit 1 synthesizes a response sentence (voice guide) that prompts the driver to make a selection by the voice synthesizer 14 and outputs it from the speaker 4.

これにより、運転者により、タッチパネル２４に表示された複数の音声候補のうち少なくとも１つを選択するタッチ入力がなされる。このとき、ＳＴＥＰ５で、キーワードの意味を含む音声候補のスコアが増やされ順位が高くなっているので、タッチパネル２４に画面表示される音声候補に運転者の発話に該当する音声候補を高い確率で含ませることができ、運転者は発話の意図に沿った音声候補を選択可能となる。そして、音声対話ユニット１は、このタッチ入力により選択された音声候補群を最終的な音声候補群として決定する。これにより、運転者の発話に合致した音声候補群が最終的な音声候補群として決定される。他の動作は第１実施形態と同じである。 Thereby, the driver performs touch input for selecting at least one of the plurality of voice candidates displayed on the touch panel 24. At this time, since the score of the voice candidate including the meaning of the keyword is increased and the rank is higher in STEP 5, the voice candidate corresponding to the driver's utterance is included in the voice candidate displayed on the touch panel 24 with a high probability. Thus, the driver can select a voice candidate according to the intention of the utterance. Then, the voice interaction unit 1 determines the voice candidate group selected by the touch input as the final voice candidate group. Thereby, the speech candidate group that matches the utterance of the driver is determined as the final speech candidate group. Other operations are the same as those in the first embodiment.

本実施形態の音声認識装置によれば、第１実施形態と同様に、キーワードが適切に把握されて発話の認識精度が向上するので、効率の良い対話を介して機器の制御が行われる。 According to the speech recognition apparatus of the present embodiment, as in the first embodiment, keywords are appropriately grasped and speech recognition accuracy is improved, so that the device is controlled through efficient dialogue.

なお、第１〜第３実施形態においては、車両状態検出部３を備え、シナリオ制御部１３は、認識結果と検出した車両状態とに応じてシナリオを決定するものとしたが、車両状態検出部３を備えず、シナリオ制御部１３は認識結果のみから制御処理を決定するものとしてもよい。 In the first to third embodiments, the vehicle state detection unit 3 is provided, and the scenario control unit 13 determines the scenario according to the recognition result and the detected vehicle state. 3, the scenario control unit 13 may determine the control process only from the recognition result.

また、第１〜第３実施形態においては、音声入力する使用者は、車両１０の運転者としたが、運転者以外の乗員としてもよい。 In the first to third embodiments, the user who inputs the voice is the driver of the vehicle 10, but may be an occupant other than the driver.

また、第１〜第３実施形態においては、音声認識装置は、車両１０に搭載されるものとしたが、車両以外の移動体に搭載されるものとしてもよい。さらに、移動体に限らず、使用者が発話により対象を制御するシステムに適用可能である。 In the first to third embodiments, the voice recognition device is mounted on the vehicle 10, but may be mounted on a moving body other than the vehicle. Furthermore, the present invention is not limited to a mobile object, and can be applied to a system in which a user controls an object by speaking.

本発明の第１実施形態である音声認識装置の機能ブロック図。The functional block diagram of the speech recognition apparatus which is 1st Embodiment of this invention. 図１の音声認識装置の言語モデル、構文モデルの構成を示す説明図。FIG. 2 is an explanatory diagram illustrating a configuration of a language model and a syntax model of the speech recognition apparatus in FIG. 1. 図１の音声認識装置の全体的な作動（音声対話処理）を示すフローチャート。The flowchart which shows the whole operation | movement (voice dialogue process) of the speech recognition apparatus of FIG. 図３の音声対話処理における言語モデルを用いたテキスト変換処理を示す説明図。Explanatory drawing which shows the text conversion process using the language model in the speech dialogue process of FIG. 図３の音声対話処理における構文モデルを用いた構文解析処理を示す説明図。FIG. 4 is an explanatory diagram illustrating a syntax analysis process using a syntax model in the voice interaction process of FIG. 3. 図３の音声対話処理における最終的な音声候補群を決定する処理を示す説明図。Explanatory drawing which shows the process which determines the final audio | voice candidate group in the audio | voice conversation process of FIG. 図３の音声対話処理におけるシナリオを決定する処理に用いるフォームを示す説明図。Explanatory drawing which shows the form used for the process which determines the scenario in the voice dialog process of FIG. 図３の音声対話処理におけるシナリオを決定する処理を示す説明図。Explanatory drawing which shows the process which determines the scenario in the voice dialogue process of FIG. 本発明の第２実施形態の音声認識装置の音声対話処理における構文解析処理を示す説明図。Explanatory drawing which shows the syntax analysis process in the speech dialogue process of the speech recognition apparatus of 2nd Embodiment of this invention. 図９の音声対話処理における構文解析処理を示す説明図。Explanatory drawing which shows the syntax analysis process in the voice dialogue process of FIG. 図９の音声対話処理における最終的な音声候補群を決定する処理を示す説明図。Explanatory drawing which shows the process which determines the final audio | voice candidate group in the audio | voice dialog process of FIG.

Explanation of symbols

１…音声対話ユニット、２…マイク、３…車両状態検出部、４…スピーカ、５…ディスプレイ、６ａ〜６ｃ…機器、１０…車両、１１…テキスト変換部、１２…構文解析部、１３…シナリオ制御部、１４…音声合成部、１５…音響モデル、１６…言語モデル、１７…構文モデル、１８…シナリオデータベース、１９…音素モデル、２４…タッチパネル、３１…第１の構文解析手段、３２…第２の構文解析手段、３３…音声候補群決定手段、３４…比較判断手段。 DESCRIPTION OF SYMBOLS 1 ... Voice interaction unit, 2 ... Microphone, 3 ... Vehicle state detection part, 4 ... Speaker, 5 ... Display, 6a-6c ... Apparatus, 10 ... Vehicle, 11 ... Text conversion part, 12 ... Syntax analysis part, 13 ... Scenario Control unit, 14 ... synthesizer, 15 ... acoustic model, 16 ... language model, 17 ... syntax model, 18 ... scenario database, 19 ... phoneme model, 24 ... touch panel, 31 ... first syntax analysis means, 32 ... first 2 syntax analysis means, 33... Speech candidate group determination means, 34... Comparison judgment means.

Claims

A text conversion means for recognizing the input speech to convert the speech into text represented by a word string;
Executing a process of analyzing the text using a word string score calculated based on a word and a characteristic of the word string included in the text converted by the text conversion unit, and the result of the process is a first speech candidate First parsing means for outputting together with the score of each speech candidate as a group;
Performing a process of analyzing the text using a word score calculated based on the characteristics of the words included in the text converted by the text conversion means, and using the result of the process as a second speech candidate group, A second parsing means for outputting together with the speech candidate score;
A speech candidate group determining means for determining a final speech candidate group from the first speech candidate group and the second speech candidate group based on the score of each speech candidate;
Comparing and determining means for comparing the first speech candidate group and the second speech candidate group to determine whether there is a matching speech candidate;
The comparison / judgment means obtains the highest score among the first speech candidates having the highest score among the speech candidates included in the first speech candidate group and the speech candidate included in the second speech candidate group. A speech recognition apparatus characterized in that, when it is determined that the second speech candidate to be taken does not match, the highest score of the second speech candidate is increased by a predetermined value.

The speech recognition apparatus according to claim 1,
The speech candidate group determination means determines a speech candidate group whose score is within a predetermined rank among speech candidates included in the first and second speech candidate groups as the final speech candidate group. A voice recognition device characterized by the above.

The speech recognition apparatus according to claim 1,
Informing means for informing a plurality of speech candidates included in the first and second speech candidate groups based on a score of each speech candidate, and selecting at least one of the plurality of speech candidates informed by the informing means A speech recognition apparatus, wherein the speech candidate group determination unit determines the speech candidate group selected by the selection unit as the final speech candidate group.

4. The speech recognition apparatus according to claim 1, further comprising a control unit that executes a predetermined control process based on at least a final speech candidate group determined by the speech candidate group determination unit. Voice recognition device.

A text conversion step of converting the voice into a text represented by a word string by recognizing the input voice;
Executing a process of analyzing the text using a word string score calculated based on a word and a characteristic of the word string included in the text converted in the text conversion step, and using the result of the process as a first speech candidate A first parsing step for outputting together with a score for each speech candidate as a group;
Performing a process of analyzing the text using a word score calculated based on the characteristics of the words included in the text converted in the text conversion step, and using the result of the process as a second speech candidate group, A second parsing step for outputting together with the speech candidate score;
A first speech candidate that has the highest score among speech candidates included in the first speech candidate group, and a second speech candidate that has the highest score among speech candidates included in the second speech candidate group A comparison and determination step for determining whether or not they match,
A score determination step of increasing the highest score of the second speech candidate by a predetermined value when it is determined that they do not match in the comparison determination step;
A speech candidate group determining step of determining a final speech candidate group from the first speech candidate group and the second speech candidate group based on a score of each speech candidate; Recognition method.

A text conversion process for recognizing the input speech to convert the speech into text represented by a word string;
Analyzing the text using a word string score calculated based on the characteristics of the word and the word string included in the text converted by the text conversion process, the result of the analysis as a first speech candidate group, A first parsing process to be output together with the speech candidate score;
Analyzing the text using a word score calculated based on the characteristics of the words included in the text converted by the text conversion process, and using the result of the analysis as a second speech candidate group, the score of each speech candidate A second parsing process to be output together with
A comparison determination process for comparing the first speech candidate group and the second speech candidate group to determine whether there is a matching speech candidate;
As a result of the comparison and determination process, the first speech candidate having the highest score among speech candidates included in the first speech candidate group and the highest score among speech candidates included in the second speech candidate group. A score determination process for increasing the highest score of the second speech candidate by a predetermined value when it is determined that the second speech candidate to be taken does not match,
A function of causing the computer to execute a speech candidate group determination process for determining a final speech candidate group based on a score of each speech candidate from the first speech candidate group and the second speech candidate group; A speech recognition program characterized by that.