JP2007183516A

JP2007183516A - Voice interactive apparatus and speech recognition method

Info

Publication number: JP2007183516A
Application number: JP2006003048A
Authority: JP
Inventors: Takeshi Ono; 健大野; Keiko Katsuragawa; 景子桂川
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2006-01-10
Filing date: 2006-01-10
Publication date: 2007-07-19

Abstract

PROBLEM TO BE SOLVED: To provide a voice interactive apparatus capable of reducing user's load at correction utterance. SOLUTION: When a first collation result, obtained by collating a first utterance voice by using a first language model including a word to be subjected to speech recognition, is not adopted, a signal processor 14 in the voice interactive device collates a second utterance speech by using a second language model including other words for expressing the attribute of the word, and the first utterance voice is subjected to the voice recognition again, by using the language model restricted by the obtained second collation result, and a system response is generated according to the voice recognition result. COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、発話された音声に応じて対話をする音声対話装置、及び発話された音声を認識する音声認識方法に関する。 The present invention relates to a voice dialogue apparatus that performs a dialogue according to spoken voice, and a voice recognition method that recognizes spoken voice.

従来から、発話された音声に応じて対話をする音声対話装置を用いて、ユーザが発話した目的地の施設名称などを認識し、認識した目的地に対応する経路情報を提供するナビゲーション装置が提案されている（例えば、特許文献１など参照。）。 Conventionally, a navigation device has been proposed that recognizes the facility name of the destination spoken by the user using a voice dialogue device that performs dialogue according to the spoken voice and provides route information corresponding to the recognized destination. (See, for example, Patent Document 1).

この種のナビゲーション装置においては、自車両が存在する都道府県内の施設名称のみをユーザが発話することにより、目的地入力を行うことができる。また、ナビゲーション装置においては、他県の施設については、都道府県名称といった当該施設に関する属性の名称と施設名称とをユーザが連続的に発話することにより、目的地などを入力することができる。さらに、ナビゲーション装置においては、入力した目的地が誤認識された場合であっても、ユーザが訂正指示を入力して再度発話するなど、いわゆる訂正発話（言い直し発話）を行うことにより、目的地を再入力することもできる。
特開平２００２−３５０１６３号公報 In this type of navigation device, the user can input a destination by speaking only the facility name in the prefecture where the vehicle is present. In the navigation apparatus, for facilities in other prefectures, the user can input a destination or the like by continuously speaking the name of the attribute related to the facility such as the prefecture name and the facility name. Furthermore, in the navigation device, even if the input destination is misrecognized, the destination can be obtained by performing a so-called corrected utterance (rephrased utterance) such as when the user inputs a correction instruction and speaks again. Can be re-entered.
Japanese Patent Laid-Open No. 2002-350163

しかしながら、従来の音声対話装置においては、先に入力された施設名称の誤認識を訂正するために、ユーザが訂正発話を行った場合には、当該先に入力された施設名称の全てが取り消されてしまうことから、取り消された施設名称を最初から入力する手間を強いるという問題があった。例えば、従来の音声対話装置においては、自車両位置が東京都内であり、ユーザが目的地として神奈川県に存在する「追浜駅」と発話したにもかかわらず、これを東京都内に存在する駅名称と照合した結果「奥多摩駅」であるものと誤認識した場合には、ユーザが訂正発話によって「神奈川県の追浜駅」と発話せざるを得なかった。このように、従来の音声対話装置においては、先の発話内容の入力はなかったものとして取り扱うことから、ユーザが重複した発話を繰り返さなければならないという問題があった。 However, in the conventional spoken dialogue apparatus, in order to correct a misrecognition of a facility name input in advance, when the user makes a correction utterance, all the facility names input in advance are canceled. For this reason, there is a problem that it is time-consuming to input the canceled facility name from the beginning. For example, in a conventional spoken dialogue apparatus, the location of the vehicle is in Tokyo, and the user speaks “Oppama Station” in Kanagawa as the destination, but this is the name of the station in Tokyo. If the user misrecognizes that it is “Okutama Station”, the user has to speak “Oppama Station in Kanagawa Prefecture” with a corrected utterance. As described above, in the conventional voice interactive apparatus, since it is assumed that the previous utterance content has not been input, there is a problem that the user has to repeat duplicate utterances.

そこで、本発明は、上述した実情に鑑みて提案されたものであり、訂正発話時におけるユーザの負担を軽減することができる音声対話装置及び音声認識方法を提供することを目的とする。 Therefore, the present invention has been proposed in view of the above-described circumstances, and an object of the present invention is to provide a voice interaction apparatus and a voice recognition method that can reduce the burden on the user at the time of corrected utterance.

本発明にかかる音声対話装置は、発話音声を入力する入力手段と、前記入力手段によって入力された発話音声を音声認識し、音声認識結果に応じたシステム応答を生成する音声認識手段と、前記音声認識手段によって生成された前記システム応答を出力する出力手段とを備える。そして、前記音声認識手段は、音声認識の対象となる語彙を含む第１の言語モデルを用いて第１の発話音声を照合した第１の照合結果が採用されなかった場合には、前記語彙の属性を表す他の語彙を含む第２の言語モデルを用いて第２の発話音声を照合し、得られた第２の照合結果によって限定される言語モデルを用いて前記第１の発話音声を再度音声認識し、音声認識結果に応じたシステム応答を生成することにより、上述の課題を解決する。 The speech dialogue apparatus according to the present invention includes an input unit for inputting speech speech, a speech recognition unit for recognizing speech speech input by the input unit, and generating a system response according to a speech recognition result, and the speech Output means for outputting the system response generated by the recognition means. When the first collation result obtained by collating the first uttered speech using the first language model including the vocabulary to be recognized is not adopted, the speech recognition means The second utterance speech is collated using the second language model including other vocabulary representing the attribute, and the first utterance speech is again represented using the language model limited by the obtained second collation result. The above-described problem is solved by performing speech recognition and generating a system response according to the speech recognition result.

また、本発明にかかる音声認識方法は、入力された発話音声を音声認識し、音声認識結果に応じたシステム応答を生成する音声認識工程と、前記音声認識工程にて生成された前記システム応答を出力する出力工程とを備える。そして、前記音声認識工程では、音声認識の対象となる語彙を含む第１の言語モデルを用いて第１の発話音声を照合した第１の照合結果が採用されなかった場合には、前記語彙の属性を表す他の語彙を含む第２の言語モデルを用いて第２の発話音声を照合し、得られた第２の照合結果によって限定される言語モデルを用いて前記第１の発話音声を再度音声認識し、音声認識結果に応じたシステム応答を生成することにより、上述の課題を解決する。 In addition, the speech recognition method according to the present invention includes a speech recognition step of recognizing input speech and generating a system response according to a speech recognition result, and the system response generated in the speech recognition step. An output step of outputting. In the speech recognition step, if the first collation result obtained by collating the first uttered speech using the first language model including the vocabulary to be speech-recognized is not adopted, The second utterance speech is collated using the second language model including other vocabulary representing the attribute, and the first utterance speech is again represented using the language model limited by the obtained second collation result. The above-described problem is solved by performing speech recognition and generating a system response according to the speech recognition result.

本発明にかかる音声対話装置及び音声認識方法においては、第１の言語モデルを用いてユーザが発話した第１の発話音声を照合した結果、誤認識した場合であっても、訂正発話として、第１の発話音声の語彙の属性を表す語彙を含む第２の発話音声のみを発話すればよく、これに応じて、第２の言語モデルを用いて第２の発話音声を照合して得られた第２の照合結果によって限定される言語モデルを用いて第１の発話音声を再度照合することにより、当該第１の発話音声の語彙を正しく音声認識することが可能となる。 In the spoken dialogue apparatus and the speech recognition method according to the present invention, the first utterance speech uttered by the user using the first language model is collated, and as a result, the correct utterance is It is only necessary to utter the second utterance voice including the vocabulary representing the vocabulary attribute of the utterance voice of one utterance, and the second utterance voice is collated using the second language model accordingly. By collating the first utterance voice again using the language model limited by the second collation result, it becomes possible to correctly recognize the vocabulary of the first utterance voice.

したがって、本発明にかかる音声対話装置及び音声認識方法においては、ユーザが訂正発話する際に、第２の発話音声として第１の発話音声を重複する必要がなくなり、当該ユーザの負担を軽減することができる。 Therefore, in the voice interaction apparatus and the voice recognition method according to the present invention, when the user utters a corrected utterance, it is not necessary to duplicate the first utterance voice as the second utterance voice, thereby reducing the burden on the user. Can do.

以下、本発明の実施の形態について図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

この実施の形態として示す音声対話装置は、車両や、携帯端末装置などに搭載されるナビゲーション装置に適用した場合の構成である。この音声対話装置をナビゲーション装置に適用すると、ナビゲーション装置で要求される各種機能を、ユーザとシステムとのインタラクティブな対話によって動作させることができる。 The voice interaction device shown as this embodiment has a configuration when applied to a navigation device mounted on a vehicle, a portable terminal device or the like. When this voice interactive apparatus is applied to a navigation apparatus, various functions required by the navigation apparatus can be operated by interactive interaction between the user and the system.

なお、本発明は、これに限定されるものではなく、各種情報処理装置に搭載されて、各種機能をインタラクティブに段階的に動作させることができる、どのようなアプリケーションにも適用することができる。 The present invention is not limited to this, and can be applied to any application that is mounted on various information processing apparatuses and can operate various functions interactively in stages.

［第１の実施の形態］
［音声対話装置の構成］
まず、図１を用いて、本発明の第１の実施の形態として示す音声対話装置の構成について説明をする。 [First Embodiment]
[Configuration of voice interactive device]
First, the configuration of the voice interactive apparatus shown as the first embodiment of the present invention will be described with reference to FIG.

音声対話装置は、信号処理ユニット１と、マイク２と、スピーカ３と、入力装置４と、ディスプレイ５とを備えている。なお、本発明にかかわる主要構成要素ではないことから、図示しないが、携帯端末装置のナビゲーション装置であれば、他に送受信手段を備えた構成であってもよく、また、車両に搭載されたナビゲーション装置であれば、送受信手段または通信接続手段などを備えた構成であってもよい。 The voice interactive apparatus includes a signal processing unit 1, a microphone 2, a speaker 3, an input device 4, and a display 5. Since it is not a main component according to the present invention, it is not shown in the figure, but if it is a navigation device for a portable terminal device, it may have a configuration provided with other transmission / reception means, and navigation mounted on a vehicle. If it is an apparatus, the structure provided with the transmission / reception means or the communication connection means etc. may be sufficient.

信号処理ユニット１は、マイク２から入力されるユーザによって発話された音声をデジタル音声信号に変換して出力するＡ／Ｄコンバータ１１と、システム応答として信号処理装置１４から出力されるデジタル音声信号をアナログ音声信号に変換して出力するＤ／Ａコンバータ１２と、Ｄ／Ａコンバータ１２から出力されるアナログ音声信号を増幅する出力アンプ１３と、信号処理装置１４と、外部記憶装置１５とを備えている。 The signal processing unit 1 converts an audio uttered by a user input from the microphone 2 into a digital audio signal and outputs the digital audio signal, and a digital audio signal output from the signal processing device 14 as a system response. A D / A converter 12 that converts and outputs an analog audio signal, an output amplifier 13 that amplifies the analog audio signal output from the D / A converter 12, a signal processing device 14, and an external storage device 15 are provided. Yes.

信号処理装置１４は、ＣＰＵ（Central Processing Unit）２１と、メモリ２２とを備えており、マイク２から、Ａ／Ｄコンバータ１１を介して入力されるユーザによって発話された音声の音声認識処理を実行し、音声認識処理結果を出力する。また、信号処理装置１４は、バージイン機能を備えており、当該信号処理装置１４によるシステム応答に割り込むように入力されたユーザの発話音声に対しても音声認識処理を実行することができる。 The signal processing device 14 includes a CPU (Central Processing Unit) 21 and a memory 22, and performs speech recognition processing of speech uttered by a user input from the microphone 2 via the A / D converter 11. The voice recognition processing result is output. In addition, the signal processing device 14 has a barge-in function, and can perform voice recognition processing on a user's uttered voice input so as to interrupt a system response by the signal processing device 14.

ＣＰＵ２１は、信号処理装置１４を統括的に制御する制御手段である。ＣＰＵ２１は、メモリ２２に記憶されている処理プログラムを読み出して実行し、音声認識処理を制御したり、バージイン機能の制御を行ったりする。 The CPU 21 is a control unit that comprehensively controls the signal processing device 14. The CPU 21 reads and executes a processing program stored in the memory 22 to control voice recognition processing and control barge-in functions.

通常、バージイン機能は、有効となっておらず機能していない。バージイン機能は、システム応答に対して、ユーザによる割り込み発話がなされると予測された場合にのみ有効となり、割り込み発話に対する音声認識処理が実行される。 Normally, the barge-in function is not enabled and is not functioning. The barge-in function is effective only when it is predicted that an interrupt utterance will be made by the user in response to the system response, and a speech recognition process for the interrupt utterance is executed.

メモリ２２は、ＣＰＵ２１で実行される処理プログラムや、使用頻度の高い各種データなどをあらかじめ記憶している。また、メモリ２２は、ＣＰＵ２１によって音声認識処理が実行される場合には、外部記憶装置１５から読み出された認識対象語及びその音響モデルが格納されることにより、認識対象語の辞書が構築される。 The memory 22 stores in advance processing programs executed by the CPU 21 and various types of frequently used data. In addition, when the speech recognition process is executed by the CPU 21, the memory 22 stores a recognition target word read from the external storage device 15 and its acoustic model, thereby constructing a dictionary of recognition target words. The

信号処理装置１４のＣＰＵ２１は、このメモリ２２に構築された辞書を参照して、ユーザによって発話された発話音声の音声特徴パターンと、認識対象語の音響モデルの音声パターンとの一致度を演算することで音声認識処理を行う。 The CPU 21 of the signal processing device 14 refers to the dictionary constructed in the memory 22 and calculates the degree of coincidence between the voice feature pattern of the uttered voice spoken by the user and the voice pattern of the acoustic model of the recognition target word. The voice recognition process is performed.

外部記憶装置１５は、ナビゲーション装置で使用される各種データや、信号処理装置１４で実行される音声認識処理で使用される認識対象データや、システム応答用の音声データなどを記憶している。外部記憶装置１５は、認識対象データとして、音声認識処理で音声認識の対象となる認識対象語や、認識対象語の音響モデル、さらに、認識対象語とその接続関係を規定したネットワーク文法などの言語モデルを記憶している。認識対象語の音響モデルは、音響的に意味を持つ部分単語モデルを定義したものである。 The external storage device 15 stores various data used in the navigation device, recognition target data used in speech recognition processing executed by the signal processing device 14, speech data for system response, and the like. The external storage device 15 uses, as recognition target data, a language such as a recognition target word that is a target of voice recognition in the voice recognition process, an acoustic model of the recognition target word, and a network grammar that defines the recognition target word and its connection relationship. Remember the model. The acoustic model of the recognition target word defines a partial word model having acoustic meaning.

認識対象語の音響モデルは、音響的に意味を持つ部分単語モデルを定義したものであり、一般的な発話速度で発話される際の通常発話と、通常発話を一旦行った際に誤認識されたと判断した後に、ユーザによって発話される訂正発話（言い直し発話）とにそれぞれ対応した音響モデルが用意されている。 The acoustic model of the recognition target word is a definition of a partial word model that has acoustic meaning, and it is misrecognized when a normal utterance is spoken at a normal utterance speed and once a normal utterance is performed. After the determination, the acoustic models corresponding to the corrected utterances (rephrased utterances) uttered by the user are prepared.

また、ネットワーク文法とは、認識対象語の接続関係を規定するためのルールであり、例えば、図２に示すような階層構造で表すことができる。図２に示す例では、認識対象語である“駅名”を下位の階層Ｂとし、この下位の階層Ｂに接続される上位の階層Ａとして“都道府県名”を規定している。 The network grammar is a rule for defining the connection relation of recognition target words, and can be represented by, for example, a hierarchical structure as shown in FIG. In the example illustrated in FIG. 2, “recognition name” “station name” is defined as a lower hierarchy B, and “prefecture name” is defined as an upper hierarchy A connected to the lower hierarchy B.

信号処理装置１４は、図２に示すように認識対象語に対して規定された階層構造をなすネットワーク文法を用いることで、ユーザによって、例えば、「神奈川県の鶴見駅」といったような発話がなされた場合でも、音声認識をすることができる。 As shown in FIG. 2, the signal processing device 14 uses a network grammar having a hierarchical structure defined for the recognition target words, so that an utterance such as “Tsurumi Station in Kanagawa Prefecture” is made by the user. Even in the case of voice recognition, voice recognition can be performed.

また、“都道府県名”と、“駅名”とを、それぞれ個別に有するネットワーク文法を切り替えて用いれば、「神奈川県」で発話が一旦完了され、その後「鶴見駅」と発話された場合でも音声認識をすることができる。 In addition, if the network grammar that has “prefecture name” and “station name” is switched separately, the utterance is once completed in “Kanagawa Prefecture”, and even if “Tsurumi Station” is subsequently spoken Can recognize.

マイク２は、ユーザの発話音声を、当該音声対話装置に入力する。マイク２から入力されたユーザの発話音声は、電気信号である音声信号に変換され、Ａ／Ｄコンバータ１１でデジタル音声信号に変換されて信号処理装置１４に供給される。 The microphone 2 inputs the user's uttered voice to the voice interaction device. The user's utterance voice input from the microphone 2 is converted into a voice signal that is an electrical signal, converted into a digital voice signal by the A / D converter 11, and supplied to the signal processing device 14.

スピーカ３は、システムの発話として、信号処理装置１４から出力され、Ｄ／Ａコンバータ１２でアナログ音声信号に変換され、出力アンプ１３で信号増幅されたアナログ音声信号を音声として出力する。 The speaker 3 outputs the analog audio signal output from the signal processing device 14 as an utterance of the system, converted into an analog audio signal by the D / A converter 12 and amplified by the output amplifier 13 as audio.

入力装置４は、ユーザによって押下される発話スイッチ４ａ及び訂正スイッチ４ｂを備えている。発話スイッチ４ａは、音声認識の開始指示を行うためのスイッチである。一方、訂正スイッチ４ｂは、ユーザによって発話された音声が、システムにおいて誤認識された場合に、訂正を行うためのスイッチである。なお、この訂正スイッチ４ｂを一定期間押し続けると、音声認識処理を途中で終了させることができる。 The input device 4 includes a speech switch 4a and a correction switch 4b that are pressed by the user. The utterance switch 4a is a switch for issuing a voice recognition start instruction. On the other hand, the correction switch 4b is a switch for performing correction when the voice uttered by the user is erroneously recognized in the system. If the correction switch 4b is kept pressed for a certain period, the voice recognition process can be terminated halfway.

ディスプレイ５は、例えばＬＣＤ（液晶表示装置）などで実現され、ナビゲーションの行き先や、探索条件設定時のガイダンス表示を行ったり、経路誘導などの画面を表示したり、信号処理装置１４による音声認識処理結果である応答画像を表示させたりする。 The display 5 is realized by, for example, an LCD (Liquid Crystal Display) or the like, and displays a navigation destination, guidance display at the time of setting search conditions, a route guidance screen, and the voice recognition processing by the signal processing device 14. A response image as a result is displayed.

［音声対話装置の動作］
このような構成からなる音声対話装置は、図３に示す一連の手順にしたがった処理動作を行う。なお、同図においては、ナビゲーション装置の所定の機能を動作させる場合に、ユーザが、要求される設定事項を音声対話装置を介して入力し、ナビゲーション装置を動作させるまでの一連の処理工程を示している。 [Operation of voice interactive device]
The voice interaction apparatus having such a configuration performs processing operations according to a series of procedures shown in FIG. In the figure, when a predetermined function of the navigation device is operated, a series of processing steps from when the user inputs required setting items via the voice interaction device to operate the navigation device are shown. ing.

まず、音声対話装置における信号処理装置１４は、ステップＳ１において、ユーザによる発話スイッチ４ａの操作によって発話開始が指示されたことに応じて、発話された音声に対する音声認識処理を開始する。 First, in step S1, the signal processing device 14 in the voice interaction device starts voice recognition processing for the spoken voice in response to an instruction to start utterance by an operation of the utterance switch 4a by the user.

続いて、信号処理装置１４は、ステップＳ２において、初期状態で認識対象語として待ち受ける文法（言語モデル）を読み出し、メモリ２２に設定することにより、語彙の初期化を行う。すなわち、信号処理装置１４は、認識対象データとして外部記憶装置１５に記憶されている音声認識処理で音声認識の対象となる認識対象語や、認識対象語の音響モデル、さらには、認識対象語とその接続関係を規定したネットワーク文法などの言語モデルを読み出し、メモリ２２に設定する。 Subsequently, in step S 2, the signal processing device 14 reads a grammar (language model) that is awaited as a recognition target word in an initial state and sets it in the memory 22 to initialize the vocabulary. That is, the signal processing device 14 recognizes a recognition target word that is a target of speech recognition in the speech recognition processing stored in the external storage device 15 as recognition target data, an acoustic model of the recognition target word, and a recognition target word. A language model such as a network grammar that defines the connection relationship is read out and set in the memory 22.

ここで、信号処理装置１４は、例えば図４乃至図７に示すような住所や施設名称などを認識対象とする。 Here, for example, the signal processing device 14 recognizes addresses and facility names as shown in FIGS. 4 to 7.

具体的には、信号処理装置１４は、図４に示すように、“都道府県名”、“市名”、“区町村名”などからなる住所の文法に基づいて、「神奈川県横須賀市夏島町」といった住所の連続音声発話や、「神奈川県」、「横須賀市」、「夏島町」といった単語毎に分割された音声発話を認識することができる。また、信号処理装置１４は、図５に示すように、“都道府県名”、“駅名”などからなる施設の文法に基づいて、「神奈川県追浜駅」といった施設の連続音声発話や、「神奈川県」、「追浜駅」、といった単語毎に分割された音声発話を認識することができる。 Specifically, as shown in FIG. 4, the signal processing device 14, based on the grammar of an address including “prefecture name”, “city name”, “city name”, “Yokosuka City Summer, Kanagawa Prefecture” It is possible to recognize continuous speech utterances of addresses such as “Shimamachi” and speech utterances divided into words such as “Kanagawa Prefecture”, “Yokosuka City”, and “Natsushima Town”. Further, as shown in FIG. 5, the signal processing device 14, based on the grammar of the facility composed of “prefecture name”, “station name”, and the like, continuously voice utterances of facilities such as “Oppama Station in Kanagawa Prefecture” Speech utterances divided for each word such as “prefecture” and “Oppama station” can be recognized.

さらに、信号処理装置１４は、図６及び図７に示すように、自車両位置Ｏの近傍に存在する施設名称を集めて動的に構築される文法（言語モデル）を利用することもできる。図６は、領域Ａに存在する施設を文法に登録し、領域Ａ以外の領域に存在する施設は登録されない場合の例を示している。ここで、領域Ａは、通常、半径数十キロメートルの円領域などとされる。一方、図７は、領域Ａを内包する領域Ｃに存在する施設を文法に登録し、領域Ｂに存在する施設は登録されない場合の例を示している。なお、領域Ａは、領域Ｃと比較して詳細度の高い施設名称までを抽出して文法が構築される領域である。通常、領域Ａは、半径数十キロメートルの円領域などとされ、領域Ｃは、半径数百キロメートルの円領域などとされる。なお、動的に構築される文法の領域としては、自車両が存在する都道府県領域とすることもできる。この実施の形態においては、自車両位置が東京都内であり、領域Ａには数千箇所程度の近傍施設名称が含まれているものとする。 Furthermore, as shown in FIGS. 6 and 7, the signal processing device 14 can also use a grammar (language model) that is dynamically constructed by collecting facility names existing in the vicinity of the host vehicle position O. FIG. 6 shows an example in which facilities existing in the area A are registered in the grammar, and facilities existing in the areas other than the area A are not registered. Here, the region A is usually a circular region having a radius of several tens of kilometers. On the other hand, FIG. 7 shows an example in which a facility existing in the region C containing the region A is registered in the grammar and a facility existing in the region B is not registered. The area A is an area where a grammar is constructed by extracting up to a facility name having a higher degree of detail than the area C. Usually, the area A is a circular area having a radius of several tens of kilometers, and the area C is a circular area having a radius of several hundred kilometers. The dynamically constructed grammar area may be a prefectural area where the vehicle is present. In this embodiment, it is assumed that the position of the host vehicle is in Tokyo, and the area A includes several thousand nearby facility names.

また、信号処理装置１４は、図示しないナビゲーション操作コマンドなどを認識対象としてもよい。 The signal processing device 14 may recognize a navigation operation command (not shown) as a recognition target.

信号処理装置１４は、このようにして語彙の初期化を行うと、図３中ステップＳ３において、外部記憶装置１５に記憶された告知音声データを読み出して、Ｄ／Ａコンバータ１２、出力アンプ１３、スピーカ３を介して出力させることで、処理を開始した旨をユーザに告知し、発話要求を行う。 When the signal processing device 14 initializes the vocabulary in this manner, in step S3 in FIG. 3, the signal processing device 14 reads out the notification voice data stored in the external storage device 15, and the D / A converter 12, the output amplifier 13, By outputting through the speaker 3, the user is notified that the process has been started, and an utterance request is made.

すなわち、ユーザは、スピーカ３を介して出力される、信号処理装置１４による処理が開始された旨を知らせる告知音声を聞いたことに応じて、認識対象データに含まれる認識対象語の発話を開始する。ユーザによって発話され、マイク２を介して入力された音声は、Ａ／Ｄコンバータ１１でデジタル音声信号に変化されて、信号処理装置１４に出力される。 That is, the user starts uttering the recognition target word included in the recognition target data in response to listening to the notification voice that is output via the speaker 3 and informing that the processing by the signal processing device 14 has started. To do. The voice uttered by the user and input via the microphone 2 is converted into a digital voice signal by the A / D converter 11 and output to the signal processing device 14.

続いて、信号処理装置１４は、ステップＳ４において、ユーザによって発話された音声の取り込みを開始する。 Subsequently, in step S4, the signal processing device 14 starts capturing the voice uttered by the user.

通常、信号処理装置１４は、発話スイッチ４ａの操作がなされるまでは、Ｄ／Ａコンバータ１２の出力（デジタル音声信号）の平均パワーを演算している。信号処理装置１４は、上述したステップＳ１において、発話スイッチ４ａの操作がなされると、演算された平均パワーと、入力されたデジタル音声信号の瞬間パワーとを比較する。そして、信号処理装置１４は、入力されたデジタル音声信号が、演算された平均パワーよりも所定値以上大きくなった場合に、ユーザが発話をした音声区間であると判断して、音声の取り込みを開始する。その後も、信号処理装置１４は、平均パワーの演算を継続して実行し、平均パワーが所定値よりも小さくなった場合に、ユーザの発話が終了したと判断をする。 Normally, the signal processing device 14 calculates the average power of the output (digital audio signal) of the D / A converter 12 until the speech switch 4a is operated. When the speech switch 4a is operated in step S1 described above, the signal processing device 14 compares the calculated average power with the instantaneous power of the input digital audio signal. Then, when the input digital audio signal is greater than the calculated average power by a predetermined value or more, the signal processing device 14 determines that it is the audio section in which the user uttered and captures the audio. Start. Thereafter, the signal processing device 14 continues to calculate the average power, and determines that the user's utterance has ended when the average power becomes smaller than a predetermined value.

続いて、信号処理装置１４は、ステップＳ５において、取り込まれた発話音声と、外部記憶装置１５からメモリ２２に読み込まれた認識対象語との一致度演算を開始する。一致度は、音声区間部分のデジタル音声信号と、個々の認識対象語がどの程度似ているのかをスコアとして示したものである。例えば、信号処理装置１４は、スコアの値が大きい認識対象語ほど一致度が高いと評価する。なお、信号処理装置１４は、この一致度演算を実行している間も、並列処理により音声取り込みを継続して実行する。 Subsequently, in step S 5, the signal processing device 14 starts a degree-of-match calculation between the captured utterance and the recognition target word read from the external storage device 15 into the memory 22. The degree of coincidence indicates how similar the digital speech signal in the speech section part is to each recognition target word as a score. For example, the signal processing device 14 evaluates that the recognition target word having a larger score value has a higher matching degree. Note that the signal processing device 14 continues to execute voice capturing by parallel processing while executing the coincidence calculation.

そして、信号処理装置１４は、ステップＳ６において、デジタル音声信号の瞬間パワーが所定時間以上、所定値以下となったことに応じて、ユーザの発話が終了したと判断し、音声取り込みを終了する。 Then, in step S6, the signal processing device 14 determines that the user's utterance has ended in response to the instantaneous power of the digital audio signal being not less than a predetermined time and not more than a predetermined value, and ends the audio capturing.

信号処理装置１４は、音声取り込みを終了すると、ステップＳ７において、ステップＳ５における一致度演算が終了するまで待機し、認識対象語を音声認識結果として確定したか否かを判定する。具体的には、信号処理装置１４は、音声認識結果の信頼度を演算し、その信頼度が所定の閾値以上であった場合に音声認識結果として確定する。なお、音声認識結果の信頼度の演算については、“駒谷、河原著、「音声対話システムにおける音声認識結果の信頼度の利用法」、日本音響学会全国大会論文集３−５−２、ｐｐ．７３−７４、２０００年”などに詳細に記載されている。 After completing the voice capturing, the signal processing device 14 waits in step S7 until the coincidence calculation in step S5 is completed, and determines whether or not the recognition target word is confirmed as the speech recognition result. Specifically, the signal processing device 14 calculates the reliability of the speech recognition result, and determines the speech recognition result when the reliability is equal to or higher than a predetermined threshold. The calculation of the reliability of the speech recognition result is described in “Komatani, Kawahara,“ Usage of the reliability of the speech recognition result in the speech dialogue system ”, Japanese Acoustical Society Proceedings 3-5-2, pp. 73-74, 2000 "and the like.

ここで、信号処理装置１４は、認識対象語を音声認識結果として確定した場合には、ステップＳ８へと処理を移行する一方で、認識対象語を音声認識結果として確定しない場合には、ステップＳ１２へと処理を移行し、再度発話してもらいたい旨をユーザに告知するために、再発話要求を行い、ステップＳ４からの処理を繰り返す。 Here, when the recognition target word is determined as the speech recognition result, the signal processing device 14 proceeds to step S8, whereas when the recognition target word is not determined as the speech recognition result, the signal processing device 14 proceeds to step S12. In order to notify the user that he / she wants to speak again, a re-utterance request is made and the processing from step S4 is repeated.

なお、ここでは、自車両位置が東京都内であり、自車両位置に応じて動的に構築される辞書に含まれる施設名称が東京都内近傍に存在する施設名称のみとなることから、ユーザが目的地として神奈川県に存在する「追浜駅」と発話したにもかかわらず、これを東京都内に存在する駅名称と照合した結果「奥多摩駅」であるものと誤認識したものとする。この場合、「奥多摩駅」は、その発音が「追浜駅」と非常に似ており、信頼度も十分に大きな値となることから、信号処理装置１４は、「奥多摩駅」を音声認識結果として確定し、ステップＳ８へと処理を移行することになる。 Here, the user's purpose is that the vehicle position is in Tokyo, and the facility names included in the dictionary dynamically constructed according to the vehicle position are only the facility names existing in the vicinity of Tokyo. It is assumed that despite being uttered “Oppama Station” in Kanagawa Prefecture as a place, it is misrecognized as “Okutama Station” as a result of collating this with the station name existing in Tokyo. In this case, “Okutama Station” is very similar in sound to “Oppama Station” and has a sufficiently high reliability. Therefore, the signal processing device 14 uses “Okutama Station” as a speech recognition result. After confirming, the process proceeds to step S8.

信号処理装置１４は、ステップＳ８において、先に図５に示したような文法のように、音声認識した語彙が最下層の語彙を含むか否かを判定する。信号処理装置１４は、音声認識した語彙が最下層の語彙を含まないと判定した場合には、ステップＳ１３へと処理を移行する一方で、音声認識した語彙が最下層の語彙を含むと判定した場合には、ステップＳ９へと処理を移行する。この場合、信号処理装置１４は、音声認識した「奥多摩駅」という語彙自体が地点情報を有する最下層の語彙であることから、ステップＳ９へと処理を移行する。 In step S8, the signal processing device 14 determines whether or not the vocabulary that has been speech-recognized includes the lowest vocabulary as in the grammar shown in FIG. When the signal processing device 14 determines that the speech-recognized vocabulary does not include the lowest-level vocabulary, the signal processing device 14 determines that the speech-recognized vocabulary includes the lowest-level vocabulary while proceeding to step S13. In that case, the process proceeds to step S9. In this case, the signal processing device 14 shifts the process to step S9 because the vocabulary “Okutama Station” recognized by speech is the vocabulary at the lowest level having point information.

続いて、信号処理装置１４は、ステップＳ９において、システム応答を生成して出力する。具体的には、信号処理装置１４は、図示しない音声合成処理機能を用いて音声認識結果である「奥多摩駅」を音声信号に変換する。この音声信号は、Ｄ／Ａコンバータ１２でアナログ音声信号に変換され、出力アンプ１３で信号増幅された上で、スピーカ３を介して音声として出力される。 Subsequently, the signal processing device 14 generates and outputs a system response in step S9. Specifically, the signal processing device 14 converts “Okutama Station”, which is a speech recognition result, into a speech signal using a speech synthesis processing function (not shown). This audio signal is converted into an analog audio signal by the D / A converter 12, amplified by the output amplifier 13, and then output as audio through the speaker 3.

そして、信号処理装置１４は、ステップＳ１０において、ユーザによる訂正スイッチ４ｂの押下があるか否かを所定時間待ち受ける。ここで、信号処理装置１４は、訂正スイッチ４ｂの押下がなかった場合には、音声認識結果をユーザが認容したと判断し、ステップＳ１１において、音声認識結果を決定し、その音声認識結果に応じた処理を行う。例えば、ナビゲーション装置に適用された音声対話装置においては、音声認識結果の住所を目的地として設定し、一連の処理を終了する。一方、信号処理装置１４は、訂正スイッチ４ｂの押下があった場合には、音声認識結果をユーザが否定したと判断し、ステップＳ１２へと処理を移行する。 In step S10, the signal processing device 14 waits for a predetermined time whether or not the correction switch 4b is pressed by the user. Here, if the correction switch 4b is not pressed, the signal processing device 14 determines that the user has accepted the speech recognition result, determines the speech recognition result in step S11, and responds to the speech recognition result. Process. For example, in a voice interaction device applied to a navigation device, an address of a voice recognition result is set as a destination, and a series of processing ends. On the other hand, if the correction switch 4b is pressed, the signal processing device 14 determines that the user has denied the voice recognition result, and proceeds to step S12.

なお、ここでは、ユーザによる発話音声「追浜駅」に対して、システム応答が「奥多摩駅」となったことから、当該ユーザが訂正スイッチ４ｂを押下し、ステップＳ１２へと処理を移行し、再度発話してもらいたい旨をユーザに告知するために、再発話要求を行い、ステップＳ４からの処理を繰り返すことになる。 Here, since the system response is “Okutama Station” for the speech voice “Oppama Station” by the user, the user presses the correction switch 4b, shifts the processing to Step S12, and again. In order to notify the user that he / she wants to speak, a re-utterance request is made and the processing from step S4 is repeated.

信号処理装置１４は、ステップＳ４乃至ステップＳ７において、ユーザによる第２の発話音声の音声認識処理を行う。ここで、ユーザは、発話音声「追浜駅」に対して東京都内に存在する「奥多摩駅」と誤認識されたことに応じて、第２の発話にて「神奈川県」と訂正発話したとする。ユーザによって発話され、マイク２を介して入力された音声は、Ａ／Ｄコンバータ１１でデジタル音声信号に変化されて、信号処理装置１４に出力される。これに応じて、信号処理装置１４は、音声認識処理を行い、「神奈川県」と正しい音声認識結果を得たとする。 In step S4 to step S7, the signal processing device 14 performs voice recognition processing of the second uttered voice by the user. Here, it is assumed that the user corrects utterance “Kanagawa” in the second utterance in response to the misrecognition of “Okutama Station” existing in Tokyo with respect to the utterance voice “Oppama Station”. . The voice uttered by the user and input via the microphone 2 is converted into a digital voice signal by the A / D converter 11 and output to the signal processing device 14. In response to this, it is assumed that the signal processing device 14 performs voice recognition processing and obtains a correct voice recognition result “Kanagawa”.

この場合、信号処理装置１４は、ステップＳ８において、音声認識した「神奈川県」という語彙が上位階層の語彙であることから、ステップＳ１３へと処理を移行する。信号処理装置１４は、ステップＳ１３において、ステップＳ７と同様に、「神奈川県」という音声認識結果の信頼度の評価を再度行い、その信頼度が所定の閾値以上であった場合に音声認識結果として確定する。なお、この信頼度の評価においては、ステップＳ７にて用いた閾値よりも高い閾値を用いる。信号処理装置１４は、かかる閾値を用いることによって十分に信頼できる場合にのみ、認識対象語を音声認識結果として確定し、ステップＳ１４へと処理を移行する。一方、信号処理装置１４は、認識対象語を音声認識結果として確定しない場合には、第１の発話音声「追浜駅」の利用を諦め、ステップＳ２１へと処理を移行する。なお、ここでは、音声認識結果の確定を行い、ステップＳ１４へと処理を移行したものとする。 In this case, the signal processing device 14 shifts the processing to step S13 because the vocabulary “Kanagawa prefecture” recognized in step S8 is the vocabulary of the upper hierarchy. In step S13, the signal processing device 14 again evaluates the reliability of the speech recognition result “Kanagawa” in the same manner as in step S7. If the reliability is equal to or higher than a predetermined threshold, the signal processing device 14 Determine. In this reliability evaluation, a threshold value higher than the threshold value used in step S7 is used. Only when the signal processing device 14 is sufficiently reliable by using such a threshold value, the signal processing device 14 determines the recognition target word as a speech recognition result, and proceeds to step S14. On the other hand, if the recognition target word is not confirmed as the speech recognition result, the signal processing device 14 gives up the use of the first utterance speech “Oppama Station” and proceeds to step S21. Here, it is assumed that the speech recognition result is confirmed and the process proceeds to step S14.

続いて、信号処理装置１４は、ステップＳ１４において、１つ前のシステム応答としての発話がユーザによって否定されたという事実があったか否かを判定する。ここで、信号処理装置１４は、１つ前の発話が存在し、それが否定された場合には、誤認識を訂正しようとするユーザの意図があったと判断し、ステップＳ１５へと処理を移行する一方で、そうでない場合には、ステップＳ２１へと処理を移行する。なお、ここでは、発話音声「追浜駅」に対して東京都内に存在する「奥多摩駅」と誤認識されたことに応じて、第２の発話音声として「神奈川県」と訂正発話したことから、ステップＳ１５へと処理を移行することになる。 Subsequently, in step S14, the signal processing device 14 determines whether or not there is a fact that the utterance as the previous system response has been denied by the user. Here, if the previous utterance exists and is denied, the signal processing device 14 determines that the user's intention to correct the misrecognition is present, and proceeds to step S15. On the other hand, if not, the process proceeds to step S21. In addition, because the speech utterance “Oppama Station” was misrecognized as “Okutama Station” in Tokyo, the correct utterance was “Kanagawa” as the second utterance speech. The process proceeds to step S15.

続いて、信号処理装置１４は、ステップＳ１５において、音声認識結果である「神奈川県」を信頼し、神奈川県内の施設を待ち受け語として設定する。すなわち、信号処理装置１４は、先に図５に示した神奈川県の下位階層に位置する施設名称語彙のみを待ち受け範囲として絞り込み、図４に示した文法、図５の他の部分、図６又は図７に示した文法を待ち受け範囲外とする。 Subsequently, in step S15, the signal processing device 14 trusts “Kanagawa Prefecture”, which is the voice recognition result, and sets a facility in Kanagawa Prefecture as a standby language. That is, the signal processing device 14 narrows down only the facility name vocabulary located in the lower hierarchy of Kanagawa Prefecture shown in FIG. 5 as the standby range, and the grammar shown in FIG. 4, the other part of FIG. 5, FIG. The grammar shown in FIG. 7 is out of the standby range.

続いて、信号処理装置１４は、ステップＳ１６において、前回の音声認識時に保存しておいたユーザによる第１の発話音声と、ステップＳ１５にて絞り込んだ文法との照合を行うことにより、再度音声認識処理を行う。 Subsequently, in step S16, the signal processing device 14 performs voice recognition again by collating the first utterance voice by the user saved at the time of the previous voice recognition with the grammar narrowed down in step S15. Process.

そして、信号処理装置１４は、ステップＳ１７において、ステップＳ７と同様に、「追浜駅」という音声認識結果の信頼度の評価を再度行い、その信頼度が所定の閾値以上であった場合に音声認識結果として確定する。なお、この信頼度の評価においても、ステップＳ７にて用いた閾値よりも高い閾値を用いる。信号処理装置１４は、かかる閾値を用いることによって十分に信頼できる場合にのみ、認識対象語を音声認識結果として確定し、ステップＳ１８へと処理を移行する。一方、信号処理装置１４は、認識対象語を音声認識結果として確定しない場合には、第１の発話音声の利用を諦め、ステップＳ２０において、第１の発話音声を破棄した上で、ステップＳ２１へと処理を移行する。 Then, in step S17, the signal processing device 14 evaluates the reliability of the speech recognition result “Oppama station” again in the same manner as in step S7, and if the reliability is equal to or higher than a predetermined threshold, the speech recognition Confirm as a result. In this reliability evaluation, a threshold value higher than the threshold value used in step S7 is used. Only when the signal processing device 14 is sufficiently reliable by using such a threshold value, the signal processing device 14 determines the recognition target word as the speech recognition result, and proceeds to step S18. On the other hand, if the recognition target word is not determined as the speech recognition result, the signal processing device 14 gives up the use of the first utterance speech, discards the first utterance speech in step S20, and then proceeds to step S21. And migrate the process.

信号処理装置１４は、認識対象語を音声認識結果として確定した場合には、ステップＳ１８において、図示しない音声合成処理機能を用いて音声認識結果である「神奈川県の追浜駅」を音声信号に変換する。この音声信号は、Ｄ／Ａコンバータ１２でアナログ音声信号に変換され、出力アンプ１３で信号増幅された上で、スピーカ３を介して音声として出力される。 If the recognition target word is confirmed as a speech recognition result, the signal processing device 14 converts the speech recognition result “Oppama Station in Kanagawa Prefecture” into a speech signal using a speech synthesis processing function (not shown) in step S18. To do. This audio signal is converted into an analog audio signal by the D / A converter 12, amplified by the output amplifier 13, and then output as audio through the speaker 3.

そして、信号処理装置１４は、ステップＳ１９において、ユーザによる訂正スイッチ４ｂの押下があるか否かを所定時間待ち受ける。ここで、信号処理装置１４は、訂正スイッチ４ｂの押下がなかった場合には、音声認識結果をユーザが認容したと判断し、ステップＳ１１において、音声認識結果を決定し、上述したように、その音声認識結果に応じた処理を行い、一連の処理を終了する。一方、信号処理装置１４は、訂正スイッチ４ｂの押下があった場合には、音声認識結果をユーザが否定したと判断し、ステップＳ２０において、第１の発話音声を破棄した上で、ステップＳ２１へと処理を移行する。 In step S19, the signal processing device 14 waits for a predetermined time whether or not the user has pressed the correction switch 4b. Here, when the correction switch 4b is not pressed, the signal processing device 14 determines that the user has accepted the speech recognition result, determines the speech recognition result in step S11, and as described above, A process according to the voice recognition result is performed, and a series of processes is terminated. On the other hand, if the correction switch 4b is pressed, the signal processing device 14 determines that the user has denied the voice recognition result, and discards the first utterance voice in step S20, and then proceeds to step S21. And move the process.

また、信号処理装置１４は、ステップＳ２１へと処理を移行した場合には、「神奈川県」という発話音声のみを受理した状態であることから、ステップＳ２２において、ユーザによる訂正スイッチ４ｂの押下があるか否かを所定時間待ち受ける。ここで、信号処理装置１４は、訂正スイッチ４ｂの押下がなかった場合には、音声認識結果をユーザが認容したと判断し、ステップＳ２３において、文法を、神奈川県内の住所及び施設に限定し、ステップＳ４からの処理を繰り返す。一方、信号処理装置１４は、訂正スイッチ４ｂの押下があった場合には、音声認識結果をユーザが否定したと判断し、ステップＳ１２において、再度発話してもらいたい旨をユーザに告知するために、再発話要求を行った上で、ステップＳ４からの処理を繰り返す。 In addition, since the signal processing device 14 is in a state of accepting only the utterance voice “Kanagawa” when the process proceeds to step S21, the user presses the correction switch 4b in step S22. It waits for a predetermined time. Here, if the correction switch 4b is not pressed, the signal processing device 14 determines that the user has accepted the speech recognition result, and in step S23, restricts the grammar to addresses and facilities in Kanagawa Prefecture, The processing from step S4 is repeated. On the other hand, if the correction switch 4b is pressed, the signal processing device 14 determines that the user has denied the voice recognition result, and notifies the user that he / she wants to speak again in step S12. Then, after making a re-utterance request, the processing from step S4 is repeated.

音声対話装置は、このような一連の手順にしたがった処理動作を行うことにより、第１の発話音声が誤認識された場合であっても、第２の発話音声を音声認識することによって正しい目的地を設定し、ナビゲーション装置の所定の機能を動作させることができる。 By performing the processing operation according to such a series of procedures, the voice interactive apparatus can correctly recognize the second utterance voice even if the first utterance voice is erroneously recognized. The ground can be set and a predetermined function of the navigation device can be operated.

［第１の実施の形態の効果］
以上詳細に説明したように、第１の実施の形態として示した音声対話装置においては、信号処理装置１４により、音声認識の対象となる語彙を含む第１の言語モデルを用いて第１の発話音声を照合した第１の照合結果が採用されなかった場合には、その語彙の属性を表す他の語彙を含む第２の言語モデルを用いて第２の発話音声を照合し、得られた第２の照合結果によって限定される言語モデルを用いて第１の発話音声を再度音声認識し、音声認識結果に応じたシステム応答を生成する。 [Effect of the first embodiment]
As described above in detail, in the voice interactive apparatus shown as the first embodiment, the signal processing apparatus 14 uses the first language model including the vocabulary to be recognized by the first speech. If the first collation result obtained by collating the speech is not adopted, the second speech model is collated using the second language model including another vocabulary representing the attribute of the vocabulary, and the obtained first utterance is obtained. The first utterance speech is recognized again using a language model limited by the collation result of 2, and a system response corresponding to the speech recognition result is generated.

これにより、この音声対話装置においては、ユーザが訂正発話する際に、第２の発話音声として第１の発話音声を重複する必要がなくなり、当該ユーザの負担を軽減することができる。 As a result, in this voice interactive apparatus, when the user utters a corrected utterance, it is not necessary to duplicate the first utterance as the second utterance, and the burden on the user can be reduced.

具体的には、音声対話装置においては、信号処理装置１４により、第２の照合結果によって限定される言語モデルとして、第２の発話音声によって指示される地点に含まれる第１の発話音声と同じ属性を表す語彙を含む言語モデルに絞り込む。例えば、音声対話装置においては、自車両位置が東京都内であるときに、ユーザが目的地として「追浜駅」と発話したにもかかわらず、これを東京都内に存在する施設名称と照合した結果「奥多摩駅」であるものと誤認識した場合であっても、訂正発話として、「追浜駅」の属性としての都道府県を表す「神奈川県」とのみ発話すればよい。これに応じて、音声対話装置においては、神奈川県内に存在する施設名称と第１の発話音声である「追浜駅」とを照合することにより、「追浜駅」という正しい音声認識結果を出力することが可能となる。なお、音声対話装置においては、訂正発話として、例えば「神奈川県の横須賀市」や「京浜急行本線」といったように、都道府県以外の属性を含む発話を第２の発話音声としてもよい。 Specifically, in the voice interaction device, the signal processing device 14 uses the same language model as that limited by the second collation result as the first utterance speech included in the point indicated by the second utterance speech. Refine to language models that contain vocabulary representing attributes. For example, in the spoken dialogue apparatus, when the vehicle position is in Tokyo, the user utters “Oppama Station” as the destination, but the result of matching this with the name of the facility existing in Tokyo is “ Even if it is misrecognized as "Okutama Station", only "Kanagawa Prefecture" representing the prefecture as an attribute of "Oppama Station" should be uttered as a corrected utterance. In response to this, the voice dialogue device outputs the correct speech recognition result “Oppama Station” by collating the name of the facility existing in Kanagawa Prefecture with “Oppama Station” as the first utterance voice. Is possible. In the spoken dialogue apparatus, as the corrected utterance, for example, an utterance including an attribute other than the prefecture such as “Yokosuka City in Kanagawa Prefecture” or “Keihin Express Main Line” may be used as the second utterance speech.

また、音声対話装置においては、信号処理装置１４により、第２の発話音声の音声認識結果の信頼度が所定の閾値以上であった場合にのみ、第１の発話音声を再度音声認識する。これにより、音声対話装置においては、第２の照合結果が信頼できない場合に、これに基づいて辞書を切り替えて再度音声認識してしまうことがなく、再認識結果が誤認識となってしまう事態を回避することができる。また、音声対話装置においては、無駄に再認識してしまうことによる演算量の増加も回避することができる。 Further, in the voice interaction device, the signal processing device 14 recognizes the first utterance voice again only when the reliability of the voice recognition result of the second utterance voice is equal to or higher than a predetermined threshold. As a result, in the voice interaction device, when the second collation result is not reliable, the dictionary is not switched based on the second collation result, and the voice recognition is not performed again, and the re-recognition result is erroneously recognized. It can be avoided. Further, in the voice interactive apparatus, it is possible to avoid an increase in calculation amount due to wasteful re-recognition.

さらに、音声対話装置においては、信号処理装置１４により、第１の発話音声の再音声認識結果の信頼度が所定の閾値以上であった場合にのみ、システム応答を生成する。これにより、音声対話装置においては、再認識結果が誤認識となってしまう事態を回避することができる。 Further, in the voice interactive apparatus, the signal processing apparatus 14 generates a system response only when the reliability of the re-speech recognition result of the first uttered voice is equal to or higher than a predetermined threshold. Thereby, in the voice interaction apparatus, it is possible to avoid a situation in which the re-recognition result is erroneously recognized.

さらにまた、音声対話装置においては、信号処理装置１４により、第２の発話音声の音声認識結果の信頼度演算に用いる閾値、又は、第１の発話音声の再音声認識結果の信頼度演算に用いる閾値として、第１の発話音声の音声認識結果の信頼度演算に用いる閾値よりも高い閾値を用いる。これにより、音声対話装置においては、辞書を切り替えて再度音声認識を行った結果の採用基準を高くすることができ、再認識結果が誤認識となってしまう事態を確実に回避することができる。 Furthermore, in the voice interaction device, the signal processing device 14 uses the threshold value used for the reliability calculation of the voice recognition result of the second utterance voice or the reliability calculation of the re-voice recognition result of the first utterance voice. As the threshold value, a threshold value higher than the threshold value used for the reliability calculation of the speech recognition result of the first uttered speech is used. Thereby, in the voice interaction apparatus, the adoption standard of the result of performing the voice recognition again by switching the dictionary can be increased, and the situation where the re-recognition result is erroneously recognized can be surely avoided.

［第２の実施の形態］
つぎに、本発明の第２の実施の形態として示す音声対話装置について説明をする。 [Second Embodiment]
Next, a voice interactive apparatus shown as the second embodiment of the present invention will be described.

この第２の実施の形態として示す音声対話装置は、図１を用いて説明した第１の実施の形態として示す音声対話装置と同一の構成である。したがって、第２の実施の形態として示す音声対話装置の構成については、同一符号を付すことによってその説明を省略する。 The voice interactive apparatus shown as the second embodiment has the same configuration as the voice interactive apparatus shown as the first embodiment described with reference to FIG. Therefore, about the structure of the voice interactive apparatus shown as 2nd Embodiment, the description is abbreviate | omitted by attaching | subjecting the same code | symbol.

また、この第２の実施の形態として示す音声対話装置は、上述した第１の実施の形態として示した音声対話装置と同様に、誤認識があった場合における訂正発話時に、使用する言語モデルを限定することにより、ユーザが重複した発話を繰り返すのを回避することができるものであるが、ユーザによる発話音声によって指示された地点からの距離に基づいて、使用する言語モデルを限定するものである。 In addition, the voice interaction apparatus shown as the second embodiment, like the voice interaction apparatus shown as the first embodiment described above, uses a language model to be used at the time of correct utterance when there is a misrecognition. By limiting, it is possible to prevent the user from repeating repeated utterances, but the language model to be used is limited based on the distance from the point indicated by the user's uttered speech. .

したがって、第２の実施の形態として示す音声対話装置の処理動作は、第１の実施の形態として示した音声対話装置の処理動作として図３を用いて説明したフローチャートの一部が変更されるだけであり、同一の処理内容については、同一ステップ番号を付すことによってその説明を省略する。 Therefore, the processing operation of the voice interaction apparatus shown as the second embodiment is only a part of the flowchart described with reference to FIG. 3 as the processing operation of the voice interaction apparatus shown as the first embodiment. The same processing contents are given the same step numbers, and the description thereof is omitted.

［音声対話装置の動作］
第２の実施の形態として示す音声対話装置は、図８に示す一連の手順にしたがった処理動作を行う。なお、同図においては、ナビゲーション装置の所定の機能を動作させる場合に、ユーザが、要求される設定事項を音声対話装置を介して入力し、ナビゲーション装置を動作させるまでの一連の処理工程を示している。 [Operation of voice interactive device]
The voice interactive apparatus shown as the second embodiment performs processing operations according to a series of procedures shown in FIG. In the figure, when a predetermined function of the navigation device is operated, a series of processing steps from when the user inputs required setting items via the voice interaction device to operate the navigation device are shown. ing.

音声対話装置における信号処理装置１４は、ステップＳ１乃至ステップＳ６の処理を行い、音声取り込みを終了すると、ステップＳ７において、音声認識結果の信頼度を演算し、その信頼度が所定の閾値以上であった場合に音声認識結果として確定する。 When the signal processing device 14 in the voice interactive device performs the processing of Steps S1 to S6 and finishes the voice capturing, in Step S7, the reliability of the speech recognition result is calculated, and the reliability is equal to or higher than a predetermined threshold. If it is detected, the result is confirmed as a voice recognition result.

なお、ここでは、自車両位置が東京都内であり、自車両位置に応じて動的に構築される辞書に含まれる施設名称が東京都内近傍に存在する施設名称のみであるのに応じて、ユーザが目的地として神奈川県に存在する「海洋研究開発機構」と発話したにもかかわらず、これを東京都内に存在する施設名称と照合した結果「葛西臨海公園」であるものと誤認識したものとする。この場合、「葛西臨海公園」は、その発音が「海洋研究開発機構」と非常に似ており、信頼度も十分に大きな値となることから、信号処理装置１４は、「葛西臨海公園」を音声認識結果として確定し、ステップＳ８へと処理を移行することになる。 Here, the user's vehicle position is in Tokyo, and the facility name included in the dictionary that is dynamically constructed according to the vehicle position is only the facility name that exists in the vicinity of Tokyo. Despite having spoken with the “Japan Agency for Marine-Earth Science and Technology” in Kanagawa Prefecture as a destination, the result of collating this with the name of a facility existing in Tokyo was misunderstood as “Kasai Rinkai Park”. To do. In this case, since the pronunciation of “Kasai Rinkai Park” is very similar to that of “Japan Agency for Marine-Earth Science and Technology” and the reliability is sufficiently large, the signal processing device 14 changes the name of “Kasai Rinkai Park”. The result is confirmed as a voice recognition result, and the process proceeds to step S8.

信号処理装置１４は、ステップＳ８において、先に図５に示したような文法のように、音声認識した語彙が最下層の語彙を含むか否かを判定する。信号処理装置１４は、音声認識した語彙が最下層の語彙を含まないと判定した場合には、ステップＳ２１へと処理を移行する一方で、音声認識した語彙が最下層の語彙を含むと判定した場合には、ステップＳ３１へと処理を移行する。この場合、信号処理装置１４は、音声認識した「葛西臨海公園」という語彙自体が地点情報を有する最下層の語彙であることから、ステップＳ３１へと処理を移行する。 In step S8, the signal processing device 14 determines whether or not the vocabulary that has been speech-recognized includes the lowest vocabulary as in the grammar shown in FIG. If the signal processing device 14 determines that the speech-recognized vocabulary does not include the lowest-level vocabulary, the signal processing device 14 determines that the speech-recognized vocabulary includes the lowest-level vocabulary while proceeding to step S21. In that case, the process proceeds to step S31. In this case, since the vocabulary “Kasai Rinkai Park” recognized by speech is the lowest vocabulary having point information, the signal processing device 14 shifts the processing to step S31.

続いて、信号処理装置１４は、ステップＳ３１において、１つ前のシステム応答としての発話がユーザによって否定されたという事実があったか否かを判定する。ここで、信号処理装置１４は、１つ前の発話が存在し、それが否定された場合には、誤認識を訂正しようとするユーザの意図があったと判断し、ステップＳ３２へと処理を移行する一方で、そうでない場合には、ステップＳ９へと処理を移行する。なお、ここでは、最初の発話音声（第１の発話音声）しか入力されていないことから、ステップＳ９へと処理を移行することになる。 Subsequently, in step S31, the signal processing device 14 determines whether or not there is a fact that the utterance as the previous system response has been denied by the user. Here, if the previous utterance exists and is denied, the signal processing device 14 determines that the user's intention to correct the misrecognition is present, and proceeds to step S32. On the other hand, if not, the process proceeds to step S9. Here, since only the first utterance voice (first utterance voice) has been input, the process proceeds to step S9.

続いて、信号処理装置１４は、ステップＳ９において、システム応答を生成して出力する。具体的には、信号処理装置１４は、図示しない音声合成処理機能を用いて音声認識結果である「葛西臨海公園」を音声信号に変換する。この音声信号は、Ｄ／Ａコンバータ１２でアナログ音声信号に変換され、出力アンプ１３で信号増幅された上で、スピーカ３を介して音声として出力される。 Subsequently, the signal processing device 14 generates and outputs a system response in step S9. Specifically, the signal processing device 14 converts “Kasai Rinkai Park”, which is a speech recognition result, into a speech signal using a speech synthesis processing function (not shown). This audio signal is converted into an analog audio signal by the D / A converter 12, amplified by the output amplifier 13, and then output as audio through the speaker 3.

なお、ここでは、ユーザによる発話音声「海洋研究開発機構」に対して、システム応答が「葛西臨海公園」となったことから、当該ユーザが訂正スイッチ４ｂを押下し、ステップＳ１２へと処理を移行し、再度発話してもらいたい旨をユーザに告知するために、再発話要求を行い、ステップＳ４からの処理を繰り返すことになる。 Here, since the system response is “Kasai Rinkai Park” with respect to the speech voice “Marine Research and Development Organization” by the user, the user presses the correction switch 4b, and the process proceeds to step S12. In order to notify the user that he / she wants to speak again, a re-utterance request is made and the processing from step S4 is repeated.

信号処理装置１４は、ステップＳ４乃至ステップＳ８において、ユーザによる第２の発話音声の音声認識処理を行う。ここで、ユーザは、発話音声「海洋研究開発機構」に対して東京都内に存在する「葛西臨海公園」と誤認識されたことに応じて、第２の発話音声として「海洋研究開発機構」の代替施設としての「神奈川県の追浜駅」と訂正発話したとする。ユーザによって発話され、マイク２を介して入力された音声は、Ａ／Ｄコンバータ１１でデジタル音声信号に変化されて、信号処理装置１４に出力される。これに応じて、信号処理装置１４は、音声認識処理を行い、「神奈川県の追浜駅」と正しい音声認識結果を得たとする。 In step S4 to step S8, the signal processing device 14 performs voice recognition processing of the second uttered voice by the user. Here, in response to the misrecognition of the utterance voice “Marine Research and Development Organization” as “Kasai Rinkai Park” existing in Tokyo, the user uses “Marine Research and Development Organization” as the second utterance voice. Assume that you have made a correct utterance as "Oppama Station in Kanagawa Prefecture" as an alternative facility. The voice uttered by the user and input via the microphone 2 is converted into a digital voice signal by the A / D converter 11 and output to the signal processing device 14. In response to this, it is assumed that the signal processing device 14 performs voice recognition processing and obtains a correct voice recognition result “Oppama Station in Kanagawa”.

この場合、信号処理装置１４は、ステップＳ８において、ユーザが発話した「神奈川県の追浜駅」という語彙が最下層の語彙であることから、ステップＳ３１へと処理を移行し、１つ前のシステム応答としての発話がユーザによって否定されたという事実があったか否かを判定する。なお、ここでは、発話音声「海洋研究開発機構」に対して東京都内に存在する「葛西臨海公園」と誤認識されたことに応じて、第２の発話音声として「神奈川県の追浜駅」と訂正発話したことから、ステップＳ３２へと処理を移行することになる。 In this case, since the vocabulary “Oppama Station in Kanagawa Prefecture” uttered by the user in step S8 is the lowest vocabulary, the signal processing device 14 shifts the processing to step S31 and moves to the previous system. It is determined whether or not there was a fact that the utterance as a response was denied by the user. In addition, here, in response to the misrecognition of the utterance voice “Japan Marine Research and Development Organization” as “Kasai Rinkai Park” in Tokyo, the second utterance voice is “Oppama Station in Kanagawa”. Since the corrected speech has been made, the process proceeds to step S32.

信号処理装置１４は、ステップＳ３２において、音声認識結果である「神奈川県の追浜駅」を信頼し、追浜駅から所定距離内に存在する近傍施設を待ち受け語として設定する。すなわち、信号処理装置１４は、例えば図９に示すように、追浜駅Ａから所定距離内に存在する近傍施設名称を集めて動的に文法（言語モデル）を構築し、その施設名称語彙のみを待ち受け範囲として絞り込む。このとき、信号処理装置１４は、追浜駅Ａの近傍施設として、詳細度の高い施設名称を抽出し、その施設名称を文法に組み込む。 In step S32, the signal processing device 14 trusts the voice recognition result “Oppama Station in Kanagawa” and sets a nearby facility existing within a predetermined distance from Oppama Station as a standby word. That is, as shown in FIG. 9, for example, the signal processing device 14 collects nearby facility names existing within a predetermined distance from Oppama Station A and dynamically constructs a grammar (language model), and uses only the facility name vocabulary. We narrow down as waiting range. At this time, the signal processing device 14 extracts a facility name having a high degree of detail as a facility in the vicinity of Oppama station A, and incorporates the facility name into the grammar.

続いて、信号処理装置１４は、ステップＳ３３において、前回の音声認識時に保存しておいたユーザによる第１の発話音声と、ステップＳ３２にて絞り込んだ文法との照合を行うことにより、再度音声認識処理を行う。なお、ここでは、音声認識結果として「海洋研究開発機構」が得られたものとする。 Subsequently, in step S33, the signal processing device 14 performs voice recognition again by collating the first utterance voice by the user saved at the time of the previous voice recognition with the grammar narrowed down in step S32. Process. In this case, it is assumed that “Japan Agency for Marine-Earth Science and Technology” is obtained as a result of speech recognition.

そして、信号処理装置１４は、ステップＳ３４において、ステップＳ７と同様に、「海洋研究開発機構」という音声認識結果の信頼度の評価を再度行い、その信頼度が所定の閾値以上であった場合に音声認識結果として確定する。なお、この信頼度の評価においては、ステップＳ７にて用いた閾値よりも高い閾値を用いる。信号処理装置１４は、かかる閾値を用いることによって十分に信頼できる場合にのみ、認識対象語を音声認識結果として確定し、ステップＳ３５へと処理を移行する。一方、信号処理装置１４は、認識対象語を音声認識結果として確定しない場合には、第１の発話音声の利用を諦め、ステップＳ３７において、第１の発話音声を破棄した上で、ステップＳ３８へと処理を移行する。 Then, in step S34, the signal processing device 14 again evaluates the reliability of the speech recognition result “Marine Research and Development Organization” as in step S7, and when the reliability is equal to or higher than a predetermined threshold value. Confirmed as a speech recognition result. In this reliability evaluation, a threshold value higher than the threshold value used in step S7 is used. Only when the signal processing device 14 is sufficiently reliable by using such a threshold, the signal processing device 14 determines the recognition target word as a speech recognition result, and proceeds to step S35. On the other hand, if the recognition target word is not determined as the speech recognition result, the signal processing device 14 gives up the use of the first utterance speech, discards the first utterance speech in step S37, and then proceeds to step S38. And move the process.

信号処理装置１４は、認識対象語を音声認識結果として確定した場合には、ステップＳ３５において、図示しない音声合成処理機能を用いて音声認識結果である「神奈川県の追浜駅付近の海洋研究開発機構」を音声信号に変換する。この音声信号は、Ｄ／Ａコンバータ１２でアナログ音声信号に変換され、出力アンプ１３で信号増幅された上で、スピーカ３を介して音声として出力される。 If the recognition target word is confirmed as the speech recognition result, the signal processing device 14 uses the speech synthesis processing function (not shown) to obtain the speech recognition result “Ocean Research and Development Organization near Oppama Station in Kanagawa Prefecture” in step S35. Is converted into an audio signal. This audio signal is converted into an analog audio signal by the D / A converter 12, amplified by the output amplifier 13, and then output as audio through the speaker 3.

そして、信号処理装置１４は、ステップＳ３６において、ユーザによる訂正スイッチ４ｂの押下があるか否かを所定時間待ち受ける。ここで、信号処理装置１４は、訂正スイッチ４ｂの押下がなかった場合には、音声認識結果をユーザが認容したと判断し、ステップＳ１１において、音声認識結果を決定し、上述したように、その音声認識結果に応じた処理を行い、一連の処理を終了する。一方、信号処理装置１４は、訂正スイッチ４ｂの押下があった場合には、音声認識結果をユーザが否定したと判断し、ステップＳ３７において、第１の発話音声を破棄した上で、ステップＳ３８へと処理を移行する。 In step S36, the signal processing device 14 waits for a predetermined time whether or not the user has pressed the correction switch 4b. Here, when the correction switch 4b is not pressed, the signal processing device 14 determines that the user has accepted the speech recognition result, determines the speech recognition result in step S11, and as described above, A process according to the voice recognition result is performed, and a series of processes is terminated. On the other hand, if the correction switch 4b is pressed, the signal processing device 14 determines that the user has denied the speech recognition result, and discards the first uttered speech in step S37, and then proceeds to step S38. And move the process.

また、信号処理装置１４は、ステップＳ３８へと処理を移行した場合には、第１の発話音声が破棄されていることから、今回の音声認識結果である「神奈川県の追浜駅」を図示しない音声合成処理機能を用いて音声信号に変換する。この音声信号は、Ｄ／Ａコンバータ１２でアナログ音声信号に変換され、出力アンプ１３で信号増幅された上で、スピーカ３を介して音声として出力される。 Further, when the process proceeds to step S38, the signal processing device 14 does not illustrate “Oppama station in Kanagawa”, which is the current speech recognition result, because the first utterance speech is discarded. The voice signal is converted into a voice signal using the voice synthesis processing function. This audio signal is converted into an analog audio signal by the D / A converter 12, amplified by the output amplifier 13, and then output as audio through the speaker 3.

そして、信号処理装置１４は、ステップＳ３９において、ユーザによる訂正スイッチ４ｂの押下があるか否かを所定時間待ち受ける。ここで、信号処理装置１４は、訂正スイッチ４ｂの押下がなかった場合には、音声認識結果をユーザが認容したと判断し、ステップＳ１１において、音声認識結果を決定し、上述したように、その音声認識結果に応じた処理を行い、一連の処理を終了する。一方、信号処理装置１４は、訂正スイッチ４ｂの押下があった場合には、音声認識結果をユーザが否定したと判断し、ステップＳ１２へと処理を移行し、再度発話してもらいたい旨をユーザに告知するために、再発話要求を行った上で、ステップＳ４からの処理を繰り返す。 In step S39, the signal processing device 14 waits for a predetermined time whether or not the correction switch 4b is pressed by the user. Here, if the correction switch 4b is not pressed, the signal processing device 14 determines that the user has accepted the speech recognition result, determines the speech recognition result in step S11, and as described above, A process according to the voice recognition result is performed, and a series of processes is terminated. On the other hand, if the correction switch 4b is pressed, the signal processing device 14 determines that the user has denied the voice recognition result, proceeds to step S12, and indicates that the user wants to speak again. In order to notify the user, after making a re-utterance request, the processing from step S4 is repeated.

［第２の実施の形態の効果］
以上詳細に説明したように、第２の実施の形態として示した音声対話装置においては、信号処理装置１４により、第２の照合結果によって限定される言語モデルとして、第２の発話音声によって指示される地点から所定距離内に存在する第１の発話音声と同じ属性を表す語彙を含む言語モデルに絞り込む。 [Effect of the second embodiment]
As described above in detail, in the voice interactive device shown as the second embodiment, the signal processing device 14 instructs the second spoken voice as a language model limited by the second collation result. To a language model including a vocabulary that represents the same attribute as the first uttered speech existing within a predetermined distance from the point where

これにより、この音声対話装置においては、ユーザが訂正発話する際に、第２の発話音声として第１の発話音声を重複する必要がなくなり、当該ユーザの負担を軽減することができる。例えば、音声対話装置においては、自車両位置が東京都内であるときに、ユーザが目的地として「海洋研究開発機構」と発話したにもかかわらず、これを東京都内に存在する施設名称と照合した結果「葛西臨海公園」であるものと誤認識した場合であっても、訂正発話として、「神奈川県の追浜駅」と目的とする施設の代替となる施設を発話すればよい。これに応じて、音声対話装置においては、神奈川県の追浜駅から所定距離内に存在する施設名称と第１の発話音声である「海洋研究開発機構」とを照合することにより、「海洋研究開発機構」という正しい音声認識結果を出力することが可能となる。 As a result, in this voice interactive apparatus, when the user utters a corrected utterance, it is not necessary to duplicate the first utterance as the second utterance, and the burden on the user can be reduced. For example, in a spoken dialogue device, when the vehicle position is in Tokyo, the user spoke with “Japan Agency for Marine-Earth Science and Technology” as the destination, but this was compared with the name of the facility existing in Tokyo. Even if the result is misunderstood as “Kasai Rinkai Park”, “Oppama Station in Kanagawa” can be used as a corrected utterance to speak out a facility that is an alternative to the target facility. In response to this, the spoken dialogue device collates the name of the facility existing within a predetermined distance from Oppama Station in Kanagawa Prefecture with the “Ocean Research and Development Organization” as the first utterance speech, It is possible to output a correct speech recognition result “mechanism”.

なお、上述の実施の形態は本発明の一例である。このため、本発明は、上述の実施の形態に限定されることはなく、この実施の形態以外の形態であっても、本発明に係る技術的思想を逸脱しない範囲であれば、設計などに応じて種々の変更が可能であることは勿論である。 The above-described embodiment is an example of the present invention. For this reason, the present invention is not limited to the above-described embodiment, and even if it is a form other than this embodiment, as long as it does not depart from the technical idea according to the present invention, the design and the like Of course, various modifications are possible.

本発明の実施の形態として示す音声対話装置の構成について説明するブロック図である。It is a block diagram explaining the structure of the voice interactive apparatus shown as embodiment of this invention. ネットワーク文法について説明するための図である。It is a figure for demonstrating network grammar. 本発明の第１の実施の形態として示す音声対話装置の処理動作について説明するフローチャートである。It is a flowchart explaining the processing operation of the voice interactive apparatus shown as the first embodiment of the present invention. 住所の文法について説明するための図である。It is a figure for demonstrating the grammar of an address. 施設の文法について説明するための図である。It is a figure for demonstrating the grammar of a facility. 自車両位置の近傍に存在する施設名称を集めて動的に構築される文法について説明するための図である。It is a figure for demonstrating the grammar which collects the facility names which exist in the vicinity of the own vehicle position, and is built dynamically. 自車両位置の近傍に存在する施設名称を集めて動的に構築される文法について説明するための図であり、図６とは異なる文法について説明するための図である。It is a figure for demonstrating the grammar which collects the facility names which exist in the vicinity of the own vehicle position, and is built dynamically, and is a figure for demonstrating the grammar different from FIG. 本発明の第２の実施の形態として示す音声対話装置の処理動作について説明するフローチャートである。It is a flowchart explaining the processing operation of the voice interactive apparatus shown as the 2nd Embodiment of this invention. 代替施設から所定距離内に存在する近傍施設名称を集めて動的に構築される文法について説明するための図である。It is a figure for demonstrating the grammar constructed | assembled dynamically by collecting the name of the nearby facility which exists within the predetermined distance from an alternative facility.

Explanation of symbols

１信号処理ユニット
２マイク
３スピーカ
４入力装置
４ａ発話スイッチ
４ｂ訂正スイッチ
５ディスプレイ
１１Ａ／Ｄコンバータ
１２Ｄ／Ａコンバータ
１３出力アンプ
１４信号処理装置
１５外部記憶装置
２１ＣＰＵ
２２メモリ DESCRIPTION OF SYMBOLS 1 Signal processing unit 2 Microphone 3 Speaker 4 Input device 4a Speech switch 4b Correction switch 5 Display 11 A / D converter 12 D / A converter 13 Output amplifier 14 Signal processing device 15 External storage device 21 CPU
22 memory

Claims

An input means for inputting speech voice;
Voice recognition means for recognizing speech speech input by the input means and generating a system response according to the voice recognition result;
Output means for outputting the system response generated by the voice recognition means,
If the first collation result obtained by collating the first uttered speech using the first language model including the vocabulary to be recognized is not adopted, the speech recognition means sets the attribute of the vocabulary. The second utterance speech is collated using the second language model including the other vocabulary to represent, and the first utterance speech is recognized again using the language model limited by the obtained second collation result. And generating a system response according to the voice recognition result.

The speech recognition means includes, as a language model limited by the second collation result, a language model including a vocabulary representing the same attribute as that of the first utterance speech included in a point indicated by the second utterance speech The voice interaction device according to claim 1, wherein the voice interaction device is narrowed down to

The speech recognition means, as a language model limited by the second collation result, a vocabulary representing the same attribute as the first speech speech existing within a predetermined distance from a point indicated by the second speech speech The spoken dialogue apparatus according to claim 1, wherein the language model is narrowed down to language models including

Comprising reliability calculation means for calculating the reliability of the voice recognition result by the voice recognition means;
The voice recognition means recognizes the first utterance voice again only when the reliability of the voice recognition result of the second utterance voice obtained by the reliability calculation means is equal to or greater than a predetermined threshold. The spoken dialogue apparatus according to claim 1, wherein:

The speech recognition means generates a system response only when the reliability of the re-speech recognition result of the first uttered speech obtained by the reliability calculation means is equal to or greater than a predetermined threshold value. The voice interactive apparatus according to claim 4.

The reliability calculation means uses the first threshold as a threshold used for the reliability calculation of the speech recognition result of the second utterance speech or the threshold used for the reliability calculation of the re-speech recognition result of the first utterance speech. The voice dialogue apparatus according to claim 4 or 5, wherein a threshold value higher than a threshold value used for the reliability calculation of the voice recognition result of the uttered voice is used.

A speech recognition step of recognizing the input speech and generating a system response according to the speech recognition result;
An output step of outputting the system response generated in the voice recognition step,
In the speech recognition step, if the first collation result obtained by collating the first uttered speech using the first language model including the vocabulary to be speech-recognized is not adopted, the attribute of the vocabulary is set. The second utterance speech is collated using the second language model including the other vocabulary to represent, and the first utterance speech is recognized again using the language model limited by the obtained second collation result. And generating a system response according to the speech recognition result.