JP4353212B2

JP4353212B2 - Word string recognition device

Info

Publication number: JP4353212B2
Application number: JP2006197990A
Authority: JP
Inventors: 美樹男笹木; 克志浅見
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2006-07-20
Filing date: 2006-07-20
Publication date: 2009-10-28
Anticipated expiration: 2019-07-26
Also published as: JP2007017990A

Description

本発明は、例えば音声認識に代表される様々な信号の認識結果候補群が離散的な単語列の集合として得られる場合に、適正な単語列を認識することのできる単語列認識装置に関するものである。 The present invention relates to a word string recognition device capable of recognizing an appropriate word string when, for example, a recognition result candidate group of various signals represented by speech recognition is obtained as a set of discrete word strings. is there.

従来より、例えば人間から発せられた音声を入力し、辞書データと比較して一致度合の高い複数の単語列候補を出力する認識装置が知られており、例えばその認識装置を用いて使用者の発話内容を認識し、認識結果に応じて機器を動作させる制御システムなどに適用されている。例えば、使用者が音声コマンドとして予め定められた言葉を発話すると、その言葉に対応した情報検索動作などを行う自動車用ナビゲーション装置などが実用化されている。 2. Description of the Related Art Conventionally, there is known a recognition device that inputs, for example, a voice uttered by a human and outputs a plurality of word string candidates having a high degree of matching compared to dictionary data. It is applied to a control system that recognizes the utterance content and operates the device in accordance with the recognition result. For example, when a user speaks a predetermined word as a voice command, an automobile navigation device that performs an information search operation corresponding to the word has been put into practical use.

ところで、現状の音声認識には連続単語認識とワードスポッティングという２つの手法がある。前者の認識手法によれば、例えば「おかざき、○○○○○（店名）、らーめん」といった単語毎の認識が可能であるが、実際には誤認識を避けて通ることはできない。そのため、現行のナビゲーション装置などでは、認識結果として複数の候補を持ち、最初にその内の１つをトークバックしてユーザに確認を求めることがなされている。ユーザは自分が発話した内容と違っていれば、認識結果が違っている旨を装置に伝える。すると、装置側は別の認識結果を提示して再度ユーザの確認を求める。認識結果の候補を多数準備して次々に提示していけば、最終的にはユーザが意図した内容の認識結果となることは可能であるが、候補の中には、全く意味をなさない内容のものも含まれる可能性があり、上述した最終的に適切な候補に至るまでに長時間要してしまうことも考えられる。 By the way, there are two methods of current speech recognition: continuous word recognition and word spotting. According to the former recognition method, for example, it is possible to recognize each word such as “okazaki, ○○○○○ (store name), ramen”, but in practice, it is impossible to avoid misrecognition. Therefore, in a current navigation device or the like, there are a plurality of candidates as recognition results, and one of them is talked back first to ask the user for confirmation. If the user is different from the content that he / she uttered, the user informs the apparatus that the recognition result is different. Then, the apparatus side presents another recognition result and requests user confirmation again. If a large number of recognition result candidates are prepared and presented one after another, the result can be the recognition result of the content intended by the user, but some of the candidates have no meaning at all. May be included, and it may take a long time to finally reach the appropriate candidate described above.

また、後者のワードスポッティング手法は、例えば「えーと、おかざきの○○○○○（店名）でらーめんくいたいなあ」という日常語的な音声入力から「おかざき、○○○○○（店名）、らーめん、くいたい」というキーワードを抽出できるという点で近年急速に注目されている音声認識技術である。しかしながら、当該手法の場合には、ラティスと呼ばれるその出力（時区間情報と確率情報を持った単語集合）から生成される単語列候補の数は非常に多く、意味を持つ小数の単語列に絞られることは稀である。また、当該手法における現在の認識語彙数は１００語程度であるが、将来的には１０００語以上に増大することが見込まれ、その結果として発生するラティスから生成される単語列は膨大な数に上ると考えられる。したがって、上述した連続単語列認識手法の場合で述べたのと同様の（むしろ、より顕著な）問題が存在する。 In addition, the latter word spotting method is, for example, “Okazaki, ○○○○○ (store name),“ Okazaki, ○○○○○ (store name), This is a speech recognition technology that has been attracting rapid attention in recent years because it can extract the keyword "Ramen, Kuitai". However, in the case of this method, the number of word string candidates generated from the output called a lattice (a word set having time interval information and probability information) is very large, and is narrowed down to a small number of meaningful word strings. It is rare to be done. Moreover, the current number of recognized vocabulary in the method is about 100 words, but it is expected to increase to 1000 or more in the future, and the number of word strings generated from the resulting lattice is enormous. It is thought that it goes up. Therefore, there is a problem similar to (or rather more prominent) the case described in the case of the continuous word string recognition method described above.

また、現状の音声認識技術では、雑音や音声環境の変動などに基づく誤認識時において、正常な対話への復帰が困難であった。例えば地名を入力したい場合に、意図せず駅名中から選択するモードに入ってしまうと、それ以外の地名を適切に認識してもらえなくなる。誤認識を少なくするために絞り込みができる辞書構造にしておくことは、ある意味では好ましく、多くの単語認識手法において用いられているが、上述したように意図しない認識モードに陥ってしまうと、それ以降の所望の入力ができなくなる可能性がある。そして、そのモードからエスケープする操作を知らないと正常な対話への復帰が困難となり、ユーザは途方に暮れてしまうこともありえる。 Moreover, with the current speech recognition technology, it has been difficult to return to normal dialogue at the time of misrecognition based on noise or fluctuations in speech environment. For example, if you want to enter a place name, if you enter the mode to select from station names unintentionally, you will not be able to properly recognize other place names. Having a dictionary structure that can be narrowed down to reduce misrecognition is preferable in a sense, and is used in many word recognition methods, but if it falls into an unintended recognition mode as described above, There is a possibility that subsequent desired input cannot be performed. If the user does not know the operation for escaping from the mode, it is difficult to return to a normal dialogue, and the user may be at a loss.

本発明は、こうした問題に鑑みなされたものであり、誤認識に起因し、例えば利用者の意図しない認識モードに陥って復帰ができず利用者が途方に暮れてしまう、といった不都合を防止可能な単語列認識装置を提供することを目的とする。 The present invention has been made in view of these problems, and can prevent inconveniences caused by misrecognition, such as, for example, falling into a recognition mode unintended by the user and being unable to recover, and the user is at a loss. An object of the present invention is to provide a word string recognition device.

上記目的を達成するため、期待外時対応を提案する。 In order to achieve the above purpose, we propose the expected outside during the corresponding.

請求項１に係る単語列認識装置は、期待外時対応手段が、単語列出力手段から出力された単語列が文脈に沿った期待通りの内容であるか否か、即ちある話題の中で予想される発話かどうかを、構文における要求が何であるかを示す要求キーワードに基づいて判断し、現在の話題とは異なる要求キーワードが認識されたことによって期待から外れていると判断した場合は、少なくとも話題が転換されたのかどうかを確認するための問いかけを行う話題転換確認処理、話題が転換されたことを宣言する話題転換宣言処理、それまでの話題が継続していると仮定して文脈に沿った対応を行う文脈優先対応処理のいずれかを行う。そして期待外時対応手段は、話題転換確認処理の実行後に単語列出力手段から出力された単語列が、その転換された話題に沿った内容であれば、前記話題転換宣言処理を実行する。 In the word string recognition apparatus according to claim 1, the unexpected time response means predicts whether or not the word string output from the word string output means has an expected content in accordance with the context, that is , predicts in a certain topic. Is determined based on the request keyword indicating what the request in the syntax is, and if it is determined that the request keyword is different from the current topic, Topic change confirmation process that asks to confirm whether the topic has changed, Topic change declaration process that declares that the topic has been changed, Assuming that the previous topic continues, it follows the context One of the context-first correspondence processing is performed. The unexpected time response means executes the topic change declaration process if the word string output from the word string output means after execution of the topic change confirmation process is a content in line with the changed topic.

なお、単語列出力手段は、認識対象である人間の動作内容が反映された情報を入力し、認識用辞書データと比較した一致度合いの高い単語列の候補を出力する。 The word string output means inputs information reflecting the action content of the human being, and outputs a word string candidate having a high degree of matching compared with the recognition dictionary data .

文脈優先対応処理は、上述したように、それまでの話題が継続していると仮定して文脈に沿った対応を行うのであるが、話題の継続回数に基づいて話題の転換かどうかの判定を行い、それに応じて期待外時対応手段が話題転換確認処理と文脈優先対応処理とを使い分けてもよい。請求項２に示すように、話題が転換された場合であっても、転換直後には文脈優先対応処理を実行し、その後も転換された話題が継続している場合に限って、話題転換確認処理を実行するよう、期待外時対応手段を構成するのである。As described above, the context-first correspondence processing is based on the assumption that the previous topic is continuing, and responds according to the context. Accordingly, the unexpected response means may use the topic change confirmation process and the context priority response process properly. As shown in claim 2, even if the topic is changed, the context priority confirmation processing is executed immediately after the change, and the change of the topic is confirmed only when the changed topic continues thereafter. The unexpected time response means is configured to execute the processing.

請求項３に係る単語列認識装置は、期待外時対応手段が、単語列出力手段から出力された単語列が文脈に沿った期待通りの内容であるか否か、即ちある話題の中で予想される発話かどうかを、構文における要求が何であるかを示す要求キーワードに基づいて判断し、現在の話題とは異なる要求キーワードが認識されたことによって期待から外れていると判断した場合は、少なくとも話題が転換されたのかどうかを確認するための問いかけを行う話題転換確認処理、話題が転換されたことを宣言する話題転換宣言処理、それまでの話題が継続していると仮定して文脈に沿った対応を行う文脈優先対応処理のいずれかを行う。そして期待外時対応手段は、話題が転換された場合であっても、転換直後には文脈優先対応処理を実行し、その後も転換された話題が継続している場合に限って、話題転換確認処理を実行する。In the word string recognition device according to claim 3, the unexpected time response means predicts whether or not the word string output from the word string output means has an expected content in accordance with the context. Is determined based on the request keyword indicating what the request in the syntax is, and if it is determined that the request keyword is different from the current topic, Topic conversion confirmation process that asks to confirm whether the topic has been changed, Topic conversion declaration process to declare that the topic has been changed, Assuming that the previous topic is continuing, it follows the context One of the context-first correspondence processing is performed. And even when the topic is changed, the unexpected response method executes the context-first response processing immediately after the conversion, and only after the converted topic continues, confirms the topic change. Execute the process.

このようにすることで、誤認識に起因して利用者が途方に暮れてしまう、といった不都合を防止することができる。 By doing so, it is possible to prevent the inconvenience that the user is at a loss due to misrecognition .

ところで、単語列認識装置に関しては、単語列出力手段が、認識対象である人間の動作内容が反映された情報を入力し、認識用辞書データと比較して一致度合の高い単語列を出力するものであるという説明をしたが、具体的には、例えば次に示すようなものが考えられる。 By the way, with respect to the word string recognition device, the word string output means inputs information reflecting the action content of the human being , and outputs a word string having a high degree of coincidence compared with the recognition dictionary data. Specifically, for example, the following can be considered.

まず、請求項４に示すように、認識対象である人間によって入力された音声を辞書データと比較し、一致度合の高い複数の単語列候補を出力する音声認識装置として実現することが考えられる。例えばカーナビゲーションシステムにおける目的地などの指示を音声入力するために実用化されており、適用対象としては一般的に考えられる。つまり、音声認識に際しては入力音声の音響的な特徴に基づくのであるが、人間が発する音声の音響的な特徴は個人差が大きく、また日常的には正確な発音をしない場合も多いので誤認識が発生し易い。 First, as shown in claim 4 , it may be realized as a speech recognition device that compares a speech input by a person who is a recognition target with dictionary data and outputs a plurality of word string candidates having a high degree of coincidence. For example, it has been put to practical use for voice input of an instruction such as a destination in a car navigation system, and is generally considered as an application target. In other words, speech recognition is based on the acoustic features of the input speech, but the acoustic features of the speech produced by humans vary greatly between individuals, and there are many cases where accurate pronunciation is not made on a daily basis. Is likely to occur.

また、例えば請求項５に示すように、認識対象である人間によって入力された手書き文字列を認識用辞書データと比較し、一致度合の高い単語列を出力する文字認識装置として実現してもよい。手書き文字に関しても、上述の音響的特徴と同様に個人差が大きく、また日常的には正確な書体にて文字を書かない場合も多いので、やはり誤認識が発生し易い。 Further, for example, as shown in claim 5 , it may be realized as a character recognition device that compares a handwritten character string input by a person who is a recognition target with recognition dictionary data and outputs a word string having a high degree of coincidence. . As for the handwritten characters, the individual differences are large as in the case of the above-described acoustic features, and since there are many cases where characters are not written with an accurate typeface on a daily basis, misrecognition is likely to occur.

したがって、このような誤認識を防止したり、あるいは誤認識が発生した後の適切な対処を行うことによって、誤認識に起因して利用者が途方に暮れてしまう、といった不都合を防止することができる。 Therefore, by preventing such misrecognition, or by taking appropriate measures after the misrecognition has occurred, it is possible to prevent inconveniences such as the user getting lost due to misrecognition. it can.

なお、文字認識の形態については、種々考えられ、例えば筆記具で書いた文字をスキャナで読み取る場合の認識はもちろん、ＰＤＡ（携帯情報端末）などによく見られるように、入力ペンにて画面上をなぞるような文字入力方法の場合の認識であってもよい。さらには、このような音声認識や文字認識のように、認識装置に入力される時点で直接的に単語列の内容となっているものに限らず、画像認識装置であってもよい。即ち、認識対象である人間を捉えた画像を場面として認識した上で、場面を自然言語化するための辞書データと認識場面を比較し、一致度合いの高い複数の単語列候補を出力するような画像認識装置である。その一具体例として、認識対象である人間が手話をしている画像から手話パターンを認識し、その手話パターンが表す自然言語的な意味を示す単語列を出力することが考えられる。この場合であれば、手話パターンと単語との対応パターンが確立されているので、そのパターンマッチングにより自然言語的な意味を示す単語列を出力することは容易に実現できる。但し、この場合の手話パターンについても、微妙な指使いによって表す単語が異なるため、手話をする者の個人差などによって、やはり誤認識は発生する。 Various forms of character recognition are conceivable. For example, as well as recognition when a character written with a writing instrument is read with a scanner, as well as on a PDA (personal digital assistant), the screen is viewed with an input pen. Recognition in the case of a character input method such as tracing may be used. Furthermore, as in the case of such voice recognition and character recognition, the contents of the word string are not limited to those directly input to the recognition device, but may be an image recognition device. That is, after recognizing an image that captures a human being as a recognition target as a scene, the dictionary data for converting the scene into a natural language is compared with the recognition scene, and a plurality of word string candidates having a high degree of coincidence are output. An image recognition device. As a specific example, it is conceivable to recognize a sign language pattern from an image in which a human being who is being recognized is sign language, and to output a word string indicating a natural language meaning represented by the sign language pattern. In this case, since the correspondence pattern between the sign language pattern and the word has been established, it is possible to easily output a word string indicating a natural language meaning by the pattern matching. However, also in the sign language pattern in this case, misrecognition still occurs due to individual differences of the sign language person because the words expressed by subtle fingering are different.

したがって、やはりこの場合も、誤認識が含まれる可能性の高い手話パターンの認識装置においても、本発明を適用することで誤認識を防止したり、あるいは誤認識が発生した後の適切な対処を行うことができ、誤認識に起因して利用者が途方に暮れてしまう、といった不都合を防止することができる。 Therefore, in this case as well, even in a sign language pattern recognition device that is likely to include misrecognition, the present invention can be applied to prevent misrecognition or to take appropriate measures after misrecognition has occurred. It is possible to prevent such inconvenience that the user is at a loss due to misrecognition.

また、上述した単語列認識装置における期待外時対応手段の実行する処理をコンピュータシステムにて実現する機能は、例えば、コンピュータシステム側で起動するプログラムとして備えることができる。このようなプログラムの場合、例えば、フレキシブルディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ハードディスク等のコンピュータ読み取り可能な記録媒体に記録し、必要に応じてコンピュータシステムにロードして起動することにより用いることができる。この他、ＲＯＭやバックアップＲＡＭをコンピュータ読み取り可能な記録媒体として前記プログラムを記録しておき、このＲＯＭあるいはバックアップＲＡＭをコンピュータシステムに組み込んで用いても良い。 Moreover, the ability to implement the process performed by the put that expectations out time corresponding hand stage word string recognition apparatus described above in a computer system, for example, may be provided as a program to be started on the computer system side. In the case of such a program, for example, the program is recorded on a computer-readable recording medium such as a flexible disk, a magneto-optical disk, a CD-ROM, and a hard disk, and is used by being loaded into a computer system and started up as necessary. it can. In addition, the ROM or backup RAM may be recorded as a computer-readable recording medium, and the ROM or backup RAM may be incorporated into a computer system and used.

以下、本発明の実施形態について、図面を用いて説明する。
まず図１は、実施形態の単語列認識装置について機能に着目して概念的に示したブロック図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
First, FIG. 1 is a block diagram conceptually showing the word string recognition device of the embodiment focusing on the function.

ユーザインタフェースを介して入力された、音声に代表される信号は、音声認識部あるいはその他の信号系認識部において所定の認識処理がなされ、単語列候補を出力する。音声認識部は音声認識及び単語列生成の機能を備えており、認識辞書（認識語彙辞書）を用いて、ある発話に相当する音声入力に対する認識を行い、認識辞書に登録されている単語の集まりとして単語列候補を得る。これがここでいう音声認識機能であるが、この時点では候補となる単語が同一時刻に複数個示される可能性がある。その後、単語列生成機能を用いることで、音声認識機能によって得た単語に基づき時区間的に重なりのない単語列を複数個生成し、対話管理手段へ出力する。この際、単語列に付随して尤度も出力される。このように、１つの音声発話に対する音声認識結果は、通常、尤度情報を伴った候補が複数個存在することとなり、音声認識においては「湧き出し」と呼ばれる。 A signal represented by speech input via the user interface is subjected to a predetermined recognition process in a speech recognition unit or other signal system recognition unit, and a word string candidate is output. The voice recognition unit has functions of voice recognition and word string generation, recognizes a voice input corresponding to a certain utterance using a recognition dictionary (recognition vocabulary dictionary), and collects words registered in the recognition dictionary Get word string candidates as This is the voice recognition function here, but at this point, a plurality of candidate words may be shown at the same time. Thereafter, by using the word string generation function, a plurality of word strings that do not overlap in terms of time interval are generated based on the words obtained by the voice recognition function, and are output to the dialogue management means. At this time, the likelihood is also output along with the word string. As described above, a speech recognition result for one speech utterance usually has a plurality of candidates with likelihood information, and is referred to as “well-up” in speech recognition.

そして、この尤度情報を伴った単語列候補について、対話管理手段は文脈や意味的制約情報を適用して認識し、辞書構成手段を用いて、認識辞書を動的に更新する。さらに、対話管理手段は、画面制御手段を介して表示系に対する表示制御を行ったり、音声出力系、機器制御系に対する制御を行う。 Then, the dialogue management means recognizes the word string candidate accompanied by the likelihood information by applying the context and semantic constraint information, and dynamically updates the recognition dictionary using the dictionary construction means. Furthermore, the dialogue management unit performs display control for the display system, and controls the audio output system and the device control system via the screen control unit.

以上は、機能に着目して単語列認識装置を概念的に示したブロック図であったが、次に、単語列認識装置を車載の制御システムに適用した場合の構成を示すブロック図である図２を参照して説明する。本制御システムは、自動車（車両）に搭載されて、ユーザとしての車両の乗員（主に、運転者）と音声にて対話しながら、その車両に搭載された様々な機器を制御するシステムである。 The above is a block diagram conceptually showing the word string recognition device focusing on the function. Next, it is a block diagram showing a configuration when the word string recognition device is applied to an in-vehicle control system. This will be described with reference to FIG. This control system is a system that is mounted on an automobile (vehicle) and controls various devices mounted on the vehicle while talking with a vehicle occupant (mainly a driver) as a user. .

図２に示すように、本実施形態の制御システムは、制御装置１と、使用者が各種の指令やデータなどを外部操作によって入力するための入力装置３と、音声を入力するためのマイクロフォン５と、音声を出力するためのスピーカ７と、画像を表示するためのディスプレイ（表示装置）８と、車両の現在位置（現在地）の検出や経路案内などを行う周知のナビゲーション装置９と、車内の空調を制御するエアコン装置１３と、カセットテープレコーダ，ＣＤ（コンパクトディスク）プレーヤ，ＭＤ（ミニディスク）プレーヤ，ラジオ，及びテレビなどからなるオーディオ装置１５と、周知のＶＩＣＳ（Vehicle Information and Communication System）の放送端末や、インターネットとの接続窓口であるインターネット放送端末との間で無線によりデータ通信を行う通信装置１７と、車速や加減速状態などの車両運転状態，車両内外の温度，及び雨滴の有無などを検出するための各種センサ１９と、車両のドアロック，窓ガラス（パワーウィンドウ），エンジン，及びブレーキ装置などを制御する他の制御装置（図示省略）とが接続されて構成されている。 As shown in FIG. 2, the control system of the present embodiment includes a control device 1, an input device 3 for a user to input various commands and data by an external operation, and a microphone 5 for inputting sound. A speaker 7 for outputting sound, a display (display device) 8 for displaying an image, a known navigation device 9 for detecting the current position (current location) of the vehicle, route guidance, and the like. An air conditioner 13 for controlling air conditioning, an audio device 15 including a cassette tape recorder, a CD (compact disc) player, an MD (mini disc) player, a radio, a television, and the like, and a well-known VICS (Vehicle Information and Communication System) Wireless data communication between broadcast terminals and Internet broadcast terminals, which are the connection points for the Internet A communication device 17 to perform, a vehicle operation state such as a vehicle speed and acceleration / deceleration state, various sensors 19 for detecting the temperature inside and outside the vehicle, and the presence / absence of raindrops, a vehicle door lock, a window glass (power window), an engine , And other control devices (not shown) for controlling the brake device and the like.

尚、ナビゲーション装置９は、車両の現在位置を検出するための周知のＧＰＳ装置や、地図データ，地名データ，施設名データなどの経路案内用データを記憶したＣＤ−ＲＯＭ、そのＣＤ−ＲＯＭからデータを読み出すためのＣＤ−ＲＯＭドライブ、及び、使用者が指令を入力するための操作キーなどを備えている。そして、ナビゲーション装置９は、例えば、使用者から操作キーを介して、目的地と目的地までの経路案内を指示する指令とが入力されると、車両の現在位置と目的地へ至るのに最適な経路とを含む道路地図を、ディスプレイ８に表示させて経路案内を行う。また、ディスプレイ８には、ナビゲーション装置９によって経路案内用の道路地図が表示されるだけでなく、情報検索用メニューなどの様々な画像が表示され、さらに、オーディオ装置１５がテレビのモードに設定されると、そのオーディオ装置１５に備えられたテレビチューナにより受信されたテレビの受信画像が表示される。 The navigation device 9 is a well-known GPS device for detecting the current position of the vehicle, a CD-ROM storing route guidance data such as map data, place name data, and facility name data, and data from the CD-ROM. A CD-ROM drive for reading out and an operation key for a user to input a command. The navigation device 9 is optimal for reaching the current position of the vehicle and the destination when, for example, a command for instructing the route to the destination and the destination is input via the operation key from the user. A road map including various routes is displayed on the display 8 to provide route guidance. The display 8 not only displays a road map for route guidance by the navigation device 9, but also displays various images such as an information search menu, and the audio device 15 is set to a television mode. Then, the received image of the television received by the television tuner provided in the audio device 15 is displayed.

そして、制御装置１は、ＣＰＵ，ＲＯＭ，及びＲＡＭなどからなるマイクロコンピュータを中心に構成されたシステム制御部２１と、システム制御部２１に入力装置３からの指令やデータを入力するインタフェース（Ｉ／Ｆ）２３と、マイクロフォン５から入力された音声信号をデジタルデータに変換してシステム制御部２１に入力する音声入力部２５と、システム制御部２１から出力されたテキストデータをアナログの音声信号に変換してスピーカ７に出力し、スピーカ７を鳴動させる音声合成部２７と、ディスプレイ８への表示画面の制御を行う画面制御部２８と、上記ナビゲーション装置９，エアコン装置１３，オーディオ装置１５，通信装置１７，各種センサ１９，及び他の制御装置とシステム制御部２１とをデータ通信可能に接続する機器制御インタフェース（機器制御Ｉ／Ｆ）２９とを備えている。 The control device 1 includes a system control unit 21 mainly composed of a microcomputer including a CPU, a ROM, a RAM, and the like, and an interface (I / O) for inputting commands and data from the input device 3 to the system control unit 21. F) 23, a voice input unit 25 for converting a voice signal input from the microphone 5 into digital data and inputting the digital data to the system control unit 21, and a text data output from the system control unit 21 converted into an analog voice signal Then, the voice synthesizing unit 27 that outputs to the speaker 7 and causes the speaker 7 to ring, the screen control unit 28 that controls the display screen on the display 8, the navigation device 9, the air conditioner device 13, the audio device 15, and the communication device. 17. Various sensors 19 and other control devices and the system control unit 21 are connected so that data communication is possible. That device control interface and a (device control I / F) 29.

また、制御装置１には、通信装置１７によりインターネットから所望の情報を検索及び取得するために、インターネットのアドレス（インターネットアドレス）を記憶するインターネットアドレスデータベース３１と、検索制御部３３とが備えられている。そして、システム制御部２１が、検索制御部３３へ検索内容（コンテンツ）を表す検索キーワードを出力すると、検索制御部３３は、機器制御Ｉ／Ｆ２９を介し通信装置１７を動作させて、インターネット放送端末から上記検索キーワードに対応した情報を検索し、その検索結果をシステム制御部２１へ入力させる。また、インターネットアドレスデータベース３１には、検索制御部３３によって過去に用いられたインターネットアドレスが、システム制御部２１からの指令によって記憶され、検索制御部３３は、システム制御部２１から過去に入力した検索キーワードと同じ検索キーワードを受けると、インターネットアドレスデータベース３１内のインターネットアドレスを再利用する。 In addition, the control device 1 includes an Internet address database 31 that stores an Internet address (Internet address) and a search control unit 33 in order to search and acquire desired information from the Internet by the communication device 17. Yes. Then, when the system control unit 21 outputs a search keyword representing the search content (content) to the search control unit 33, the search control unit 33 operates the communication device 17 via the device control I / F 29, and the Internet broadcast terminal. The information corresponding to the search keyword is searched for, and the search result is input to the system control unit 21. The Internet address database 31 stores Internet addresses used in the past by the search control unit 33 according to commands from the system control unit 21, and the search control unit 33 searches the system control unit 21 in the past. When the same search keyword as the keyword is received, the Internet address in the Internet address database 31 is reused.

一方、制御装置１は、マイクロフォン５及び音声入力部２５を介して入力される音声信号から、使用者が発話した言葉としてのキーワード（以下、発話キーワードともいう）を認識して取得するために、使用者が発話すると想定され且つ当該制御装置１が認識すべき複数の認識語彙を予め記憶した認識語彙記憶部３４を備えている。つまり、この認識語彙記憶部３４に記憶されている認識語彙群が、当該制御装置１の認識語彙データベースとなっている。 On the other hand, the control device 1 recognizes and acquires a keyword (hereinafter also referred to as an utterance keyword) as a word spoken by the user from an audio signal input through the microphone 5 and the audio input unit 25. A recognition vocabulary storage unit 34 that stores in advance a plurality of recognition vocabulary that is assumed to be uttered by the user and should be recognized by the control device 1 is provided. That is, the recognized vocabulary group stored in the recognized vocabulary storage unit 34 is a recognized vocabulary database of the control device 1.

またさらに、制御装置１は、スピーカ７から出力する発話（以下、エージェント発話ともいう）の内容（即ち、スピーカ７の動作内容）とスピーカ７以外の他の機器Ｍの動作内容とを設定するため、及び、使用者の要求と使用者の精神的或いは肉体的な状態とを推定するためのデータを記憶する手段として、対話データベースを記憶する対話データ記憶部３５と、要求・状態推定用データを記憶する要求・状態推定用データ記憶部３６と、使用者の複数人分の個人情報（以下、ユーザプロファイルともいう）を記憶するユーザプロファイル記憶部３７とを備えている。尚、この対話データ記憶部３５，要求・状態推定用データ記憶部３６，及びユーザプロファイル記憶部３７と、前述したインターネットアドレスデータベース３１は、データの読み出しと書き込みとが可能な不揮発性メモリによって構成されている。 Furthermore, the control device 1 sets the content of the utterance (hereinafter also referred to as agent utterance) output from the speaker 7 (that is, the operation content of the speaker 7) and the operation content of the device M other than the speaker 7. As a means for storing data for estimating the user's request and the mental or physical state of the user, a dialog data storage unit 35 for storing a dialog database, and request / state estimation data A request / state estimation data storage unit 36 to be stored and a user profile storage unit 37 to store personal information (hereinafter also referred to as a user profile) for a plurality of users. The dialog data storage unit 35, the request / state estimation data storage unit 36, the user profile storage unit 37, and the Internet address database 31 described above are configured by a nonvolatile memory capable of reading and writing data. ing.

なお、対話データ記憶部３５に記憶される対話データベース、要求・状態推定用データ記憶部３６に記憶される要求・状態推定用データ、ユーザプロファイル記憶部３７に記憶されるユーザプロファイルなどについては、例えば特願平１０−１６２４５７号、特願平１０−１８４８４０号などを参照されたい。 The dialog database stored in the dialog data storage unit 35, the request / state estimation data stored in the request / state estimation data storage unit 36, the user profile stored in the user profile storage unit 37, etc. See Japanese Patent Application Nos. 10-162457 and 10-184840.

次に、以上のように構成された本実施形態の制御システムにおいて、システム制御部２１で実行される処理の概要を図３を用いて説明する。なお、ここでは、図１で言えば音声認識部にて認識処理を行い、対話管理手段がその認識結果である単語列候補に対して所定の処理を行う「単語列の認識」にかかわる処理について説明する。 Next, an outline of processing executed by the system control unit 21 in the control system according to the present embodiment configured as described above will be described with reference to FIG. Here, in FIG. 1, the speech recognition unit performs the recognition process, and the dialogue management means performs a predetermined process on the word string candidate that is the recognition result. explain.

処理が開始されると、入力された音声に対して認識処理を行う（Ｓ１０）。そして、Ｓ２０での判断結果に基づき、「入力待ち」であると判断してＳ１０へ戻るか、多段階処理（Ｓ３０）を行うか、認識完了又はタイムアウトであると判断してＳ４０へ移行する。Ｓ３０での多段階処理の詳細については、後述する。 When the process is started, a recognition process is performed on the input voice (S10). Then, based on the determination result in S20, it is determined that “waiting for input” and the process returns to S10, multi-stage processing (S30) is performed, or it is determined that the recognition is completed or timed out, and the process proceeds to S40. Details of the multi-stage processing in S30 will be described later.

Ｓ４０での対話管理に移行した後は、期待外の応答であるかどうかを判断し（Ｓ５０）、期待外の応答であればＳ６０へ移行し、期待外の応答の種類を判別する。その判別結果に応じて、話題転換の確認（Ｓ７０）、話題転換後の発話（Ｓ８０）、文脈優先の発話（Ｓ９０）のいずれかを実行する。その後、発話処理（Ｓ１８０）を経て、Ｓ１０へ戻る。 After the transition to the dialog management in S40, it is determined whether or not the response is an unexpected response (S50), and if it is an unexpected response, the flow proceeds to S60 to determine the type of unexpected response. Depending on the determination result, confirmation of topic change (S70), utterance after topic change (S80), or context-priority utterance (S90) is executed. Thereafter, the process returns to S10 through the speech process (S180).

一方、Ｓ５０にて期待外の応答ではない（つまり期待に沿った応答である）と判断された場合は、Ｓ１００へ移行する。Ｓ１００での判断処理において、エージェントからの問い返し回数がＮ回を超えたと判断された場合は、ヘルプモード処理を実行する（Ｓ１２０）、また、Ｓ１００での判断処理において、エージェントからの問い返し回数がＫ回（Ｋ＞Ｎ）を超えた場合、又はユーザからのリセット要求があった場合は、Ｓ１７０へ移行して初期状態に戻る。そして、これら以外の場合、すなわち、エージェントからの問い返し回数がＮ回以下の場合には、Ｓ１１０での判断結果に基づき、ディスプレイ８に選択肢を提示する処理（Ｓ１３０）、通常の発話戦略（Ｓ１４０）、音声メニューモード（Ｓ１５０）又は問い返し（Ｓ１６０）のいずれかの処理を選択的に実行する。これらの詳細については後述する。 On the other hand, if it is determined in S50 that the response is not unexpected (that is, the response is in line with the expectation), the process proceeds to S100. If it is determined in the determination process in S100 that the number of questions returned from the agent has exceeded N times, a help mode process is executed (S120). In the determination process in S100, the number of questions returned from the agent is K. If the number of times exceeds (K> N), or if there is a reset request from the user, the process proceeds to S170 and returns to the initial state. In other cases, that is, when the number of questions returned from the agent is N or less, processing for presenting options on the display 8 based on the determination result in S110 (S130), normal speech strategy (S140) Then, either the voice menu mode (S150) or the inquiry (S160) is selectively executed. Details of these will be described later.

Ｓ１２０〜Ｓ１７０のいずれかの処理が実行された後は、発話処理（Ｓ１８０）を経て、Ｓ１０へ戻る。
以上は、処理の概略的な流れの説明であったので、続いて詳細内容を説明していく。但しここでは、漸進的階層探索、多段階処理、期待外時対応処理、誤認識対応処理という本発明の概念単位の順番に説明することにする。但し、説明の都合上、期待外時対応処理と誤認識対応処理については、[３．誤認識対応処理について］としてまとめ、その中で区別して説明している。
［１．漸進的階層探索について］
［１．１概要］
漸進的階層探索は、図３のフローチャート中では、Ｓ１３０の選択肢を提示する処理に相当する。但し、１単語ずつ処理するために他の処理とは時間サイクルが異なるため、事前に漸進的階層探索を実行するモードに設定しておく必要がある。したがって、漸進的階層探索モードに設定されている場合に限り、図３のＳ１３０は実行されることとなる。 After any process of S120 to S170 is executed, the process returns to S10 through the speech process (S180).
The above is a description of the schematic flow of processing, and the detailed contents will be described subsequently. However, here, description will be made in the order of the conceptual units of the present invention: progressive hierarchical search, multi-stage processing, unexpected response processing, and misrecognition processing. However, for the convenience of explanation, the unexpected response processing and the misrecognition processing are described in [3. The misrecognition handling process] is summarized as follows.
[1. About progressive hierarchical search]
[1.1 Overview]
The progressive hierarchy search corresponds to the process of presenting the options of S130 in the flowchart of FIG. However, since each word is processed one by one, the time cycle is different from other processes, so it is necessary to set a mode in which a progressive hierarchical search is executed in advance. Therefore, S130 of FIG. 3 is executed only when the progressive hierarchical search mode is set.

［１．２具体例］
図４には漸進的階層探索の具体的な画面遷移例を示す。
（１）図４の最初の画面Ｇ１では、デフォルトモードにおいて東海４県の県名が表示されている状態において「愛知県」と発話した結果、愛知県にフォーカスされたことを示している。 [1.2 Specific example]
FIG. 4 shows a specific screen transition example of the progressive hierarchical search.
(1) The first screen G1 shown in FIG. 4 indicates that, as a result of uttering “Aichi Prefecture” in a state where the prefecture names of the four prefectures of Tokai are displayed in the default mode, the focus is on Aichi Prefecture.

そして、制御システム側は、愛知県という単語を認識した時点で「漸進的階層探索」機能を発揮して、次にユーザに期待する発話語彙を即座に画面に提示する。この場合は、画面Ｇ２のように愛知県内の市町村名を表示する。なお、画面Ｇ２では４つの市町村名しか挙げていないが、これは説明を簡単にするためのものである。なお、画面Ｇ２は、利用者が「岡崎」と発話した結果、岡崎にフォーカスされたことを示している。 Then, when the control system recognizes the word Aichi Prefecture, it performs the “gradual hierarchy search” function, and immediately displays the utterance vocabulary expected for the user on the screen. In this case, the names of municipalities in Aichi Prefecture are displayed as shown on the screen G2. Although only four municipalities are listed on the screen G2, this is for simplifying the explanation. The screen G2 indicates that the user has focused on Okazaki as a result of speaking “Okazaki”.

そして、制御システム側は、岡崎が入力された時点で「漸進的階層探索」機能を発揮して、次にユーザに期待する発話語彙を画面に提示できる状態にしておく。ここで「提示できる状態にしておく」としたのは、実際には、「岡崎で食事したいなあ」のように次の単語が連続して発話されることが多いので、実用上は提示しないからである。提示するのは、岡崎の後に所定時間（例えば１〜２秒）の無音区間があった場合には、ユーザが迷っていると推定し、発話語彙を画面提示する。つまり「岡崎の要求メニュー」である。その内容は、誤認識時に用いる画面Ｇ７の岡崎の要求メニューの内の「もう一度お話下さい」を除いた部分となる。画面Ｇ３が表示された状態で「インド料理がいいね。」と発話されると、画面Ｇ４に示すようにインド料理にフォーカスされ、画面Ｇ５に示すように、「愛知県岡崎市インド料理検索しています」という表示し、検索が終了すると、画面Ｇ６に示すように、その検索結果を表示する。 Then, when Okazaki is input, the control system side exhibits the “gradual hierarchy search” function so that the utterance vocabulary expected for the user next can be presented on the screen. The reason for "being ready to present" here is because, in practice, the next word is often spoken continuously like "I want to eat in Okazaki", so it is not presented in practice. It is. When there is a silent period of a predetermined time (for example, 1 to 2 seconds) after Okazaki, it is assumed that the user is at a loss and the utterance vocabulary is presented on the screen. That is, “Okazaki's request menu”. The content is a portion excluding “Please speak again” in the request menu of Okazaki on the screen G7 used for erroneous recognition. When the screen G3 is displayed and “Indian food is good” is spoken, the focus is on Indian food as shown in the screen G4. As shown in the screen G5, “Indian food in Okazaki City, Aichi Prefecture. When the search is completed, the search result is displayed as shown in screen G6.

一方、画面Ｇ２に示す岡崎が発話された時点で次に発話を期待する語彙以外の語彙が発話された場合には、画面Ｇ７へ移行して「岡崎の要求メニュー」と共に「もう一度お話下さい」という案内を加えた内容を表示する。ここで、「食事」と発話されれば画面Ｇ３へ移行し、「デパートは？」と発話されれば、画面Ｇ８に示すように、該当するデパートを一覧表示する。そして、その内のいずれかが指定されて「△△△の地図」と発話されると、画面Ｇ９に示すように、そのデパートの位置が明確になるように、周辺の地図と共に表示する。 On the other hand, when a vocabulary other than the vocabulary that is expected to be spoken next is spoken when Okazaki is spoken on screen G2, the screen moves to screen G7 and says "Please speak again" along with the "Okazaki request menu". Display the content with guidance. Here, if “meal” is spoken, the screen shifts to a screen G3. If “What is a department store?” Is spoken, a corresponding department store is displayed in a list as shown in a screen G8. Then, when any of them is designated and “map of ΔΔΔ” is uttered, as shown on the screen G9, the map is displayed together with the surrounding maps so that the position of the department store is clear.

（２）図５も漸進的階層探索の一例である。図５の画面Ｇ１〜Ｇ３は図４にて示した画面内容と同じであるが、デフォルトモードにおいて東海４県の県名が表示されている状態において「東京」と発話すると、画面Ｇ１１へ移行する。この場合、東京といっても東京都のみを指すのではない場合もあるので、東京都周辺を対象としてもよい。 (2) FIG. 5 is also an example of a progressive hierarchical search. Screens G1 to G3 in FIG. 5 are the same as the screen contents shown in FIG. 4, but when “Tokyo” is spoken in the default mode in which the prefecture names of the four prefectures of Tokai are displayed, the screen shifts to screen G11. . In this case, since the term “Tokyo” does not necessarily refer to only Tokyo, the area around Tokyo may be used.

そして、制御システム側は、東京という単語を認識した時点で「漸進的階層探索」機能を発揮して、次にユーザに期待する発話語彙を即座に画面に提示する。この場合は、画面Ｇ１２に示すように東京都内の市区町村名を画面表示する。そして、銀座と発話した時点で銀座にフォーカスする。 Then, when the word “Tokyo” is recognized, the control system performs the “gradual hierarchy search” function, and immediately presents the utterance vocabulary expected for the user to the screen. In this case, as shown on the screen G12, the names of cities, towns and villages in Tokyo are displayed on the screen. And when I speak with Ginza, I focus on Ginza.

そして、制御システム側は、銀座という単語を認識した時点で「漸進的階層探索」機能を発揮して、次にユーザに期待する発話語彙を画面に提示できる状態にしておく。ここで「提示できる状態にしておく」としたのは、上述の画面Ｇ２→Ｇ３へ移行する部分と同様に、実際には、「銀座、○○○ビル」のように次の単語が連続して発話されることが多いので、実用上は提示しないからである。提示するのは、銀座の後に所定時間（例えば１〜２秒）の無音区間があった場合である。画面に提示する発話語彙は、銀座内の地名関連情報である。例えば地名そのものでもよいし、使い勝手の面から言えば、○○○ビルや□□デパートのような施設名でもよい。 Then, when the control system recognizes the word Ginza, it performs a “gradual hierarchy search” function so that the utterance vocabulary expected for the user next can be presented on the screen. In this case, “to be able to present” is the same as the above-described transition from the screen G2 to G3. In fact, the next word such as “Ginza, XXX Building” continues. This is because it is not presented in practice. Presented is a case where there is a silent section for a predetermined time (for example, 1 to 2 seconds) after Ginza. The utterance vocabulary presented on the screen is place name related information in Ginza. For example, the place name itself may be used, or in terms of usability, it may be a facility name such as XX Building or □□ Department Store.

そして、利用者から例えば○○ビルと発話されると、画面Ｇ１３に示すように、その○○○ビルの位置が明確になるように、周辺の地図と共に表示する。
［１．３効果］
例えば「岡崎で食事したいなあ、インド料理がいいね」という発話が利用者からなされた場合、ワードスポッティングによる音声認識手法の出力結果は、「岡崎、食事、インド料理」という単語列になる。従来の音声認識手法では、これら３つが揃った段階で認識に対応するシステム側の処理が開示されていたが、本手法によれば、「岡崎」が入力された時点で、次に利用者に入力を期待する発話語彙を即座に提示できるため、利用者はとまどうことなく発話できるようになる。これによって、誤認識の原因となる認識辞書外の語彙を利用者が発話してしまうことを未然に防止できる。
［２．多段階処理について］
［２．１概要］
（１）現状のワードスポッティング手法では１回の処理に対する認識語彙数は１００語程度であり、連続単語認識の辞書のような大規模化は困難である。一方、実用的に見た場合、車室内において発生すると想定される特定の話題（例えば「食事に行く」など）にフォーカスする際には、例えば１００語程度でも対応可能である。したがって、話題を的確に認識し、誤認識の際は話題の不連続性を検出し、これらに応じてワードスポッティングの語彙を切り替えていけばよい。そこで、多段階処理を行う。 Then, when the user utters, for example, XX building, as shown on the screen G13, it is displayed together with the surrounding map so that the position of the XX building becomes clear.
[1.3 Effect]
For example, when a user utters “I want to eat in Okazaki, Indian food is good”, the output result of the speech recognition method by word spotting is the word string “Okazaki, meal, Indian food”. In the conventional speech recognition method, the processing on the system side corresponding to the recognition is disclosed at the stage when these three are gathered, but according to this method, when “Okazaki” is input, Since the utterance vocabulary expected to be input can be presented immediately, the user can speak utterly. As a result, it is possible to prevent the user from uttering a vocabulary outside the recognition dictionary that causes misrecognition.
[2. About multi-step processing]
[2.1 Overview]
(1) With the current word spotting technique, the number of recognized vocabulary per process is about 100 words, and it is difficult to increase the scale like a dictionary for continuous word recognition. On the other hand, from a practical viewpoint, when focusing on a specific topic (for example, “going to a meal”) that is expected to occur in the passenger compartment, for example, about 100 words can be handled. Therefore, it is only necessary to accurately recognize the topic, detect the discontinuity of the topic in the case of misrecognition, and switch the word spotting vocabulary accordingly. Therefore, multistage processing is performed.

図６には、「よこはまのちゅうかがいでしゅうまいでもくいたいなあ」という発話がなされた場合に行う多段階処理の一例を示した。なお、本発話例では、以下のような単語属性に分類できるものとする。 FIG. 6 shows an example of multi-stage processing that is performed when an utterance “Yokohama no Chugai ga I want to go to” is made. In this utterance example, it can be classified into the following word attributes.

よこはまのちゅうかがいでしゅうまいでもくいたいなあ
（場所）（施設名）（要求対象）（要求キーワード）
したがって、まず、何が要求であるかを把握するため、第１段階では「くいたい」という要求キーワードをスポッティングし、話題を確定する。 Yokohamano Chugai I would like to have a good time (place) (facility name) (request target) (request keyword)
Therefore, first, in order to grasp what is the request, in the first stage, the request keyword “I want to go” is spotted to determine the topic.

そして、第２段階では、第１段階で確定させた話題から語彙を限定し、辞書を切り替える。すなわち、この場合には、目的地をベースとしたレストラン名と関連する料理名で１００語の大半を構成する。これは、「くいたい」という要求キーワードから食事の要求であることが判るため、単語列を構成する他の単語はレストラン名や料理名となっていると予想できるからである。これにより、「ちゅうかがい」や「しゅうまい」などが認識語彙としてヒットし易い辞書を構成することができる。 In the second stage, the vocabulary is limited from the topics determined in the first stage, and the dictionary is switched. That is, in this case, most of the 100 words are composed of a restaurant name based on the destination and a dish name related to the restaurant name. This is because the request keyword “Kaitai” is understood to be a meal request, and therefore, it can be predicted that other words constituting the word string are restaurant names and dish names. This makes it possible to construct a dictionary in which “Chukagai”, “Syumai”, and the like are easy to hit as recognition vocabulary.

（２）なお、図６では第１段階の処理として要求キーワードをスポッティングして話題を確定しているが、それ以外の施設名や場所、あるいは要求対象をスポッティングして話題を確定してもよい。但し、現実的には、ワードスポッティングの語彙数は現状では１００語程度であるので、その程度の語彙でまかなうことを鑑みると、要求キーワードでの話題確定が好ましい。 (2) In FIG. 6, the topic is determined by spotting the request keyword as the first stage process, but the topic may be determined by spotting other facility names or places or the request target. . However, in reality, the number of vocabularies for word spotting is currently about 100 words, so that it is preferable to determine the topic with the requested keyword in view of the availability of such vocabularies.

（３）また、図６で示した具体例は、１の単語列を構成する単語の属性という観点からｎ次元の軸を設定したが、さらに時間軸に沿った関連性を考慮しても良い。つまり、文脈という観点も加味して話題を確定するのである。 (3) In the specific example shown in FIG. 6, the n-dimensional axis is set from the viewpoint of the attributes of the words constituting one word string. However, the relevance along the time axis may be considered. . In other words, the topic is determined in consideration of the context.

［２．２ユーザ発話の基本構成］
ユーザの発話はたいていのコンテンツ検索の場合、『場所』『施設名』『要求対象』『要求キーワード』からなるか、その並び替え、あるいは省略形で基本形が構成されると考えられる。語順が変わる場合には要求キーワードが音声信号中のどこに存在するかは不明であるが、例えば本願出願人が特願平１１−２０３４９号にて提案したような適正単語列の推定手法を用いることにより、構文的な制約に基づいて複数の候補に対して優先順序を定めることはできる。 [2.2 Basic structure of user utterance]
In the case of most content searches, the user's utterance is considered to consist of “location”, “facility name”, “request target”, “request keyword”, or a rearrangement or abbreviated form of the basic form. When the word order changes, it is unclear where the requested keyword exists in the audio signal. For example, use a method for estimating an appropriate word string as proposed in Japanese Patent Application No. 11-20349 by the applicant of the present application. Thus, a priority order can be set for a plurality of candidates based on syntactic constraints.

［２．３認識辞書の構成］
認識辞書は図２に示す認識語彙記憶部３４に記憶されている認識語彙データベースから動的に構成し得るものとする。認識語彙データベースは、システムで扱う現実の話題に対応して予め各カテゴリ毎の語彙クラスタに分割しておく（図７参照）。なお、この各カテゴリ毎の語彙クラスタはクラスタ辞書と呼ばれる。また、カテゴリには例えば下記のような種類がある。 [2.3 Configuration of recognition dictionary]
It is assumed that the recognition dictionary can be dynamically constructed from a recognition vocabulary database stored in the recognition vocabulary storage unit 34 shown in FIG. The recognized vocabulary database is divided into vocabulary clusters for each category in advance corresponding to actual topics handled by the system (see FIG. 7). The vocabulary cluster for each category is called a cluster dictionary. The categories include the following types, for example.

（１）各種コマンド
・ナビコマンド
・スケジュール帳
・アドレス帳
・電話
（２）要求キーワード（要求ＫＷ）
（３）施設名
１）レストラン名
・料理名
・雰囲気
・値段
２）スキー場名
３）ゴルフ場名
４）デパート名
５）遊園地名
６）公園名
７）映画館名
８）温泉
（４）イベント名
（５）検索結果
（６）地名
（７）鉄道駅名
（８）基本的な対話語彙
・肯定、否定
・問い合わせ
・説明、状況通知、確認、……
これらの構成語彙にはデータベースの要素となる固有名詞のみならず、対話上の同義語（はらへった、ごはんたべたい、ｅｔｃ）も含まれる。この各々からここでは１回のワードスポッティングの語彙即ち、目的地をべースとしたレストラン名と関連する料理名（ここではユーザプロファイルも参照する）で１００語の大半を構成する。これをもとに『中華街』や『しゅうまい』などが認識語彙としてヒットする。 (1) Various commands ・ Navigation command ・ Schedule book ・ Address book ・ Telephone (2) Request keyword (request KW)
(3) Facility name 1) Restaurant name ・ Cooking name ・ Atmosphere ・ Price 2) Ski resort name 3) Golf course name 4) Department store name 5) Amusement park name 6) Park name 7) Movie theater name 8) Hot spring (4) Event Name (5) Search result (6) Place name (7) Railway station name (8) Basic dialogue vocabulary-Affirmation, denial-Inquiry-Explanation, status notification, confirmation, ...
These constituent vocabularies include not only proper nouns that are elements of the database, but also synonyms (dialogues, meals, etc) in dialogue. From each of these, the vocabulary of one word spotting, that is, the name of the restaurant based on the destination and the name of the dish associated with the name (here also refer to the user profile) constitutes the majority of 100 words. Based on this, "Chukagai" and "Syumai" are hit as recognition vocabulary.

［２．４単語間のネットワーク］
辞書の基本構造は上記の階層表現に準じて定義するが、その他の意味的関係などのネットワーク関係は随時、ユーザやデータベース供給者から提供される。例えば、ユーザ発話は下記のような属性の組（対話べクトルと呼ぶ）の集まりである対話データベース（図８）の中で位置づけられる。 [2.4 Network between words]
The basic structure of the dictionary is defined according to the hierarchical expression described above, but other network relationships such as semantic relationships are provided from users and database suppliers as needed. For example, user utterances are positioned in a dialogue database (FIG. 8), which is a collection of attribute sets (called dialogue vectors) as described below.

（話題、時間・位置、環境・状況、状態・要求、ユーザ発話、エージェント発話、制御出力）
対話ベクトルは無数に存在しうるが、あらゆるベクトル値を取るわけではなく、人間と機械との間の実際的なコミュニケーションの単位として、意味のある有限個のまとまりにクラスタリングできる。そこには単語の意味的な分類、文法的制約、話題の連続性、物理的・常識的制約、事象の連続性などが用いられる。したがって、
（Ａ）あるユーザ発話を構成する単語列に用いられる語彙の範囲
（Ｂ）現在の発話から次の発話に至る際の語彙の制約
は対話ベクトルが張る空間を構成する主要因となる（話題、時間・位置、環境・状況、状態・要求）に大きく影響される。 (Topic, time / position, environment / situation, state / request, user utterance, agent utterance, control output)
There are an infinite number of interaction vectors, but they do not take all vector values, but can be clustered into meaningful finite groups as a unit of practical communication between humans and machines. It uses the semantic classification of words, grammatical constraints, topic continuity, physical / common sense constraints, and event continuity. Therefore,
(A) Range of vocabulary used for a word sequence that constitutes a certain user utterance (B) Vocabulary restrictions from the current utterance to the next utterance are the main factors that constitute the space spanned by the conversation vector (topic, Time / location, environment / situation, condition / request).

そこで、あるユーザ発話における単語が他の単語に対してどういうネットワーク構造になるかは下記の要因で決定する。
（１）クラスタ辞書間の関係
（２）単語間の関係
（３）話題間の関係
（４）文脈の連続性
（５）ユーザの特性や状況
（６）アプリケーション間
以上の関係に基づいてある単語Ｗ１から別の単語Ｗ２が活性化され、これを次のユーザ発話に対する認識辞書の語彙に加える。さらに、認識結果に付随する尤度値ＬＦＫを高めるように音声認識モジュールのパラメータを調整する。ここで、（１）クラスタ辞書間の関係、（２）単語間の関係、（３）話題間の関係、（４）文脈の連続性、（５）ユーザの特性や状況に関して補足説明する。 Therefore, what kind of network structure a word in a certain user utterance has with respect to other words is determined by the following factors.
(1) Relationship between cluster dictionaries (2) Relationship between words (3) Relationship between topics (4) Continuity of context (5) User characteristics and situation (6) Between applications Words based on the above relationships Another word W2 is activated from W1 and added to the vocabulary of the recognition dictionary for the next user utterance. Further, the parameters of the speech recognition module are adjusted so as to increase the likelihood value LFK accompanying the recognition result. Here, (1) relationship between cluster dictionaries, (2) relationship between words, (3) relationship between topics, (4) continuity of context, (5) user characteristics and situations will be supplementarily explained.

［２．４．１クラスタ辞書間の関係］
基本的には、上述した［２．３］辞書の構成で述べた関係に準ずる。
（例）施設→スキー場→おんたけスキー場
［２．４．２単語間の意味的関係］
［２．４．２．１包含関係］
（例）中華料理→シュウマイ、ラーメン、ギョーザ、……
（例）スポーツ→テニス、スキー、スイミング、ジョギング、……
［２．４．２．２連想関係］
（１）同一分類のオブジェクトを連想する場合
（例）うどん→麺類＋ラーメン
（２）シーンの構成要素を連想する場合
（例）ゲレンデ→スキー→リフト、スノーボード、ゴーグル、……
（例）ゴルフ→ゴルフ場→ホール、キャデイ、フェアウェイ、クラブ……
（例）海辺→海水浴→水着、ビーチパラソル、青い空、白い雲、……
（３）シーンに関連する興味の対象を連想する場合
（例）スキー→ゲレンデ、雪質、リフト……
（例）ゴルフ→天気、経路、費用、スコア、……
（４）季節から代表的なシーンを連想する場合
（例）夏→プール、海水浴、かき氷、セミ、クーラー、……
（５）要求キーワード間に基づく連想
（例）はらへった→レストラン
［２．４．３話題間の関係］
現在の話題に連関した話題のキーワードを活性化することにより、認識語彙を設定することができる。その連関のカテゴリは手段、付随する行動、よくある付帯事象、などがある。 [2.4.1 Relationship between cluster dictionaries]
Basically, it conforms to the relationship described in [2.3] Dictionary configuration.
(Example) Facility → Ski resort → Ontake ski resort [2.4.2 Semantic relationship between words]
[2.4.2.1 Inclusion Relationship]
(Example) Chinese food → Shumai, Ramen, Gyoza ...
(Example) Sports → Tennis, skiing, swimming, jogging, ...
[2.4.2.2 Association relationship]
(1) When associating objects of the same classification (Example) Udon → Noodles + Ramen (2) When associating scene components (Example) Slope → Ski → Lift, Snowboard, Goggles, ...
(Example) Golf-> Golf course-> Hall, Caddy, Fairway, Club ...
(Example) seaside → bathing → swimsuit, beach umbrella, blue sky, white clouds, ...
(3) When reminiscent of the object of interest related to the scene (Example) Ski → slope, snow quality, lift ...
(Example) Golf-> weather, route, cost, score ...
(4) When reminiscent of a typical scene from the season (Example) Summer → Pool, bathing, shaved ice, cicada, cooler, ...
(5) Associations based on requested keywords (Example) Harahata → Restaurant [2.4.3 Relationship between topics]
A recognition vocabulary can be set by activating a topic keyword associated with the current topic. The category of association includes means, accompanying actions, and common incidental events.

（例）ショッピング
→駐車場（手段）、レストラン（付随する行動）、バーゲン（付帯事象）……
［２．４．４文脈の連続性］
［２．４．４．１話題の連続性］
通常の自然な対話に見られるように、ある話題（たとえばショッピングなど）で閉じた認識語彙の範囲で対話が継続することが考えられる。このような話題の連続性という制約のもとで認識語彙を設定することができる。 (Example) Shopping → Parking lot (means), restaurant (accompanying action), bargain (incidental event) ……
[2.4.4 Continuity of context]
[2.4.4.1 Topic continuity]
As can be seen in normal natural dialogue, it is conceivable that the dialogue continues in a range of recognized vocabulary closed on a certain topic (for example, shopping). The recognition vocabulary can be set under such a restriction of topic continuity.

［２．４．４．２発話−応答の妥当性］
車室内に代表される対話環境では、ある発話内容（ユーザもしくはエージェントによる）は、｛呼びかけ、申告、通知、教示、解説、指示、依頼、警告、督促、問い合わせ｝のいずれかに分類できると考えることができる。一方、この発話に対する応答は、｛応答、確認、保留、判断、回答、その他応答｝に分類できる。この発話と応答の組み合わせを発話対、あるいは対話ユニットと呼ぶ。この対話ユニットに基づいて話題の内容によらず、文脈の論理的な連続性を定義することができる。図９中に「○」で示した部分は対話ユニットとして成立する発話−応答の組み合わせを示す。この対話ユニットをもとにしてエージェントは次のユーザ発話に含まれる認識語彙を予想して設定することができる。 [2.4.4.2 Validity of utterance-response]
In the dialogue environment represented by the passenger compartment, a certain utterance content (by a user or an agent) can be classified as {call, declaration, notification, teaching, commentary, instruction, request, warning, reminder, inquiry}. be able to. On the other hand, the response to this utterance can be classified into {response, confirmation, hold, judgment, answer, other response}. This combination of utterance and response is called an utterance pair or dialog unit. Based on this dialogue unit, the logical continuity of context can be defined regardless of the topic content. In FIG. 9, a portion indicated by “◯” indicates an utterance-response combination established as a dialogue unit. Based on this dialog unit, the agent can predict and set the recognition vocabulary included in the next user utterance.

以下、発話内容と、この発話に対する応答の具体例について説明する。なお、ユーザの発話については「…」で示し、エージェントの発話については『…』で示す。
（１）呼びかけ
一般的な意味での呼びかけや挨拶などがこれに含まれる。 Hereinafter, the utterance content and a specific example of a response to the utterance will be described. Note that the user's utterance is indicated by "...", and the agent's utterance is indicated by "...".
(1) Calls This includes general calls and greetings.

（例）
呼びかけ：「おい、ＸＹＺ。」
返事：『はい、何ですか。』
（例）
呼びかけ：『おはようございます、今日はいい天気ですね。』
返事：「ああ、おはよう。」
（２）申告
（例）
申告：「今日は家族とドライブ。」
確認：『御家族とドライブですね。』
（３）通知
（例）
通知：『およそｌｋｍ先、渋滞です。』
無応答：「」、又は
確認：「わかった。」
（例）
通知：『私の名前はＸＹＺです。』
確認：「ＯＫ。」、「よろしく。」
（４）教示
（例）
教示：「今、雨が降ってきた。」
確認：『“現在、雨が降っている”というメッセージを確認しました。』
（５）解説
（例）
解説：『操作方法がわからないときはへルプといってください。』
無応答：「」
（６）指示
（例）
指示：『ユーザパスワードをしゃべってください。』
確認：「わかった。ｘｘｘｘｘ」
（７）依頼
（例）
依頼：『そろそろガソリンが少なくなってきました。次の交差点のガソリン
スタンドで給油していただけませんか？』
保留：「いや、あとにしよう。」
（８）警告
（例）
警告：『１０ｋｍ先、○○トンネルで事故発生。次のインターで降りてくだ
さい。』
判断：「わかった、そうしよう。」
（９）督促
（例）
督促：『ユーザ名がまだ登録されていません。すぐに登録してください。』
確認：「わかった。」
（１０）問い合わせ
問い合わせには次の４種類がある。 (Example)
Call: “Hey, XYZ.”
Answer: “Yes, what is it? ]
(Example)
Call: "Good morning, today is a nice weather. ]
Answer: “Oh, good morning.”
(2) Report (Example)
Declaration: “Today ’s drive with my family.”
Confirmation: “It ’s a drive with your family. ]
(3) Notification (example)
Notice: “It is about 1km ahead and there is traffic. ]
No response: “” or Confirmation: “Okay.”
(Example)
Notice: “My name is XYZ. ]
Confirmation: "OK."
(4) Teaching (Example)
Teaching: “It's raining now.”
Confirmation: “The message“ It is raining now ”was confirmed. ]
(5) Explanation (Example)
Explanation: “If you don't know how to operate, please say help. ]
No reply:""
(6) Instruction (Example)
Instructions: “Speak user password. ]
Confirmation: “Okay. Xxxxxxx”
(7) Request (example)
Request: “Soon, gasoline is running low. Gasoline at the next intersection
Could you refuel at the stand? ]
Hold: “No, let's do it later”
(8) Warning (Example)
Warning: “Accident occurred in XX tunnel 10km ahead. Get off at the next interchange
Please. ]
Judgment: “Okay, let ’s do it.”
(9) Reminder (Example)
Dunning: “User name has not been registered yet. Register immediately. ]
Confirmation: “I understand.”
(10) Inquiry There are the following four types of inquiries.

１）合意要請
（例）：『御出にならないので電話接続を中止します。よろしいですか？』
２）選択要請
（例）問い合わせ：『Ａですか？Ｂですか？』
回答：「Ａです。」
３）問い合わせ
場所、時間、情報など特定データの問い合わせをするユニットである。 1) Request for agreement (example): “The telephone connection will be canceled because it will not be issued. Is it OK? ]
2) Selection request (Example) Inquiry: “A? Is it B? ]
Answer: “A.”
3) Inquiry A unit for inquiring specific data such as location, time, and information.

（例）問い合わせ：「○○○○スキー場の積雪情報はどうなっている？」
（例）問い合わせ：『これからどちらへいかれますか？』
４）話題の確認
文脈から外れた突然の話題遷移が発生したことをユーザに確認する。 (Example) Inquiry: “What is the snow cover information on the XX ski resort?”
(Example) Inquiry: “Where are you going to go? ]
4) Confirmation of topic Confirm with the user that sudden topic transition out of context has occurred.

（例）
：「１２時に岡崎にいく。」「ねむい。」
問い合わせ：『”ねむい”と聞こえましたけど、岡崎の話はどうなりま
したか？』
［２．４・４・３対話ユニット間の接続性］
上記の対話ユニット内の呼応関係のみならず、対話ユニット間の接続（話題の遷移や呼び出し、終了を含む）の妥当性に関する制約も認識語彙の設定において考慮することができる。 (Example)
: “Go to Okazaki at 12:00.” “Nemui.”
Inquiries: I heard "Nemu" but what about Okazaki's story?
Did you do that? ]
[2.4.4.3 Connectivity between dialogue units]
In addition to the above-mentioned responsiveness within the dialog unit, restrictions on the validity of connections between dialog units (including topic transitions, calls, and terminations) can be taken into account in setting the recognition vocabulary.

［２．４．５ユーザの特性や状況］
ユーザ発話に付随するユーザの環境・状況・要求・状態、ユーザプロファイルに基づいて次のユーザ発話に対応できる認識辞書を設定する。この場合、必ずしも上述の文脈の連続性が保たれるとは限らない。 [2.4.5 User characteristics and status]
Based on the user environment / situation / request / state associated with the user utterance and the user profile, a recognition dictionary that can respond to the next user utterance is set. In this case, the continuity of the above-mentioned context is not always maintained.

（１）自然な要求推定
例えば本願出願人が特願平１０−１８４８４０号にて提案したような要求推定装置に基づくと共に、図２に示す要求・状態推定用データ記憶部３６に記憶された要求・状態推定用データを参照し、ユーザの環境・状況・要求・状態、ユーザプロファイルから次の認識語彙を限定する。 (1) Natural request estimation Requests stored in the request / state estimation data storage unit 36 shown in FIG. 2 as well as based on the request estimation device proposed by the applicant of the present application in Japanese Patent Application No. 10-184840, for example. -Refer to the state estimation data and limit the next recognition vocabulary from the user's environment / situation / request / state and user profile.

（２）突然の運転状況の変化
不連続的に発生する予測不可能な緊急事態や警告の対象となる事態に際して、文脈の連続性を一時停止し、事態に必要な話題を割り込ませるべく認識語彙辞書を設定する。 (2) Sudden changes in driving conditions A recognized vocabulary to pause the continuity of context and interrupt the topics required for situations in the event of discontinuous unpredictable emergencies or warnings Set up a dictionary.

（例）
エージェント：これからどうなさいますか？』
ユーザ：「○○駅前で買い物」
エージェント：『到着時刻は１１時ごろです。駐車場はどこにしますか？』
（先行車が急停止したので急ブレーキをかけた）
ユーザ：「あー、危なかった。」
エージェント：『危なかったですね、安全運転にこころがけてください。お疲
れならば休みますか？』
（３）システムの機能移行
機能が切り替わったときにシステムが発話し、必要な対話を開始するべく認識語彙を設定する。 (Example)
Agent: What are you going to do now? ]
User: “Shopping in front of XX station”
Agent: “The arrival time is around 11:00. Where is the parking lot? ]
(Because the preceding car suddenly stopped, the brakes were applied suddenly.)
User: “Oh, it was dangerous.”
Agent: “It was dangerous, please try to drive safely. Exhaustion
Do you take a rest? ]
(3) System function transition When the function is switched, the system speaks and sets the recognition vocabulary to start the necessary dialogue.

［２・５多段階処理の具体例］
多段階処理の具体例を、図１０，１１のフローチャートを参照して説明する。
ここでは、まず使用頻度の高いローカル情報を優先するかどうかを判断して（Ｓ２１０）、認識処理を２つにわける。なお、分岐条件はこれ以外にも考えられ、対話戦略に依存する。 [Specific example of 2.5 multi-stage processing]
A specific example of multi-stage processing will be described with reference to the flowcharts of FIGS.
Here, it is first determined whether or not local information that is frequently used is prioritized (S210), and the recognition processing is divided into two. Note that other branching conditions can be considered and depend on the dialogue strategy.

ローカル優先の場合は（Ｓ２１０：ＹＥＳ）、代表的な場所、施設名、要求キーワード、要求関連属性でＮ語の辞書を構成し、ワードスポッティングをかけ（Ｓ２２０）、要求キーワードの尤度を構文評価で補正する（Ｓ２３０）。そして、キーワード属性の重み付けによって尤度を補正し（Ｓ２４０）、単語列の順序付け（Ｓ２５０）を行う。その後、認識完了した音声区間を次回の認識対象から外し（Ｓ３６０）、全音声区間を認識完了していなければ（Ｓ３７０：ＮＯ）、不足する属性の語彙を追加して辞書更新の準備をしてから（Ｓ３８０）、Ｓ２１０へ戻る。 In the case of local priority (S210: YES), an N-word dictionary is constructed with representative locations, facility names, request keywords, and request-related attributes, word spotting is applied (S220), and the likelihood of the request keywords is evaluated. (S230). Then, the likelihood is corrected by weighting the keyword attributes (S240), and the word strings are ordered (S250). After that, the speech section that has been recognized is removed from the next recognition target (S360). If all speech sections have not been recognized (S370: NO), the vocabulary of the missing attribute is added to prepare for dictionary update. (S380), the process returns to S210.

なお、この場合はＳ２２０〜Ｓ２５０がローカル優先の場合の処理であり、この処理は「多段階処理」ではない。そして、このローカル優先の場合の処理によれば、１回の認識ですべての音声区間を処理可能であり、認識時間も少ないが、検索対象が例えば１００語に収まるように限られるため、日常的な要求から外れた発話の場合は誤認識になる確率が高くなる。 In this case, S220 to S250 are processes when local priority is given, and this process is not a “multi-stage process”. According to this local priority processing, all speech segments can be processed with one recognition and the recognition time is short, but the search target is limited to be within 100 words, for example. In the case of an utterance that deviates from the demand, the probability of misrecognition increases.

一方、ローカル優先でない場合、すなわち要求を優先する場合は（Ｓ２１０：ＮＯ）、まず、１回目の認識か否かを判断し（Ｓ２６０）、１回目の認識であれば（Ｓ２６０：ＹＥＳ）、認識属性を要求キーワードに設定して（Ｓ２７０）、Ｓ２８０へ移行する。２回目以降の認識であれば（Ｓ２６０：ＮＯ）、Ｓ２７０の処理は実行せずＳ２８０へ移行する。Ｓ２８０では、認識属性のキーワードを多く含む単語セットを構成し、その構成された単語セットを用いてワードスポッティングを実行する（Ｓ２９０）。 On the other hand, when the local priority is not given, that is, when the request is given priority (S210: NO), first, it is determined whether or not the first recognition (S260), and if the first recognition (S260: YES), the recognition is performed. The attribute is set as a request keyword (S270), and the process proceeds to S280. If it is the second or later recognition (S260: NO), the process of S270 is not executed and the process proceeds to S280. In S280, a word set including many recognition attribute keywords is formed, and word spotting is executed using the formed word set (S290).

その後、要求キーワードの尤度が所定値（ここでは一例として０．６とする。）以上かどうかを判断し（Ｓ３００）、要求キーワードが０．６以上の尤度を持っていれば（Ｓ３００：ＹＥＳ）、Ｓ３１０へ移行する。Ｓ３１０では、同一属性の単語が複数ある場合は構文位置を優先する。つまり、これで要求キーワードに基づく要求が確定し、続くＳ３２０においては、ヒットしたキーワードに対応する属性を次回の認識語彙に設定する。その後は、認識完了した音声区間を次回の認識対象から外し（Ｓ３６０）、全音声区間を認識完了していなければ（Ｓ３７０：ＮＯ）、不足する属性の語彙を追加して辞書更新の準備をしてから（Ｓ３８０）、Ｓ２１０へ戻る。 Thereafter, it is determined whether the likelihood of the request keyword is a predetermined value (here, 0.6 as an example) or more (S300), and if the request keyword has a likelihood of 0.6 or more (S300: YES), the process proceeds to S310. In S310, when there are a plurality of words having the same attribute, the syntax position is given priority. That is, the request based on the requested keyword is confirmed, and in the subsequent S320, the attribute corresponding to the hit keyword is set in the next recognized vocabulary. Thereafter, the recognized speech segment is removed from the next recognition target (S360). If all speech segments have not been recognized (S370: NO), the vocabulary of the missing attribute is added to prepare for dictionary update. (S380), the process returns to S210.

一方、要求キーワードの尤度が０．６未満の場合は（Ｓ３００：ＮＯ）、Ｓ３３０へ移行し、他の属性で尤度が０．６以上の単語があるか否かを判断する。そして、尤度が０．６以上の単語があれば（Ｓ３３０：ＹＥＳ）、その中で最も尤度が高い単語を認識属性とし、その認識属性のキーワードを多く含む単語セットを構成してワードスポッティングを実行する（Ｓ３４０）。しかし、尤度が０．６以上の単語がなければ（Ｓ３３０：ＮＯ）、場所、施設名、要求関連属性でＮ語の辞書を構成し、ワードスポッティングを実行する（Ｓ３５０）。Ｓ３４０，Ｓ３５０の処理後は、認識完了した音声区間を次回の認識対象から外し（Ｓ３６０）、上述したとおり、全音声区間を認識完了していなければ（Ｓ３７０：ＮＯ）、不足する属性の語彙を追加して辞書更新の準備をしてから（Ｓ３８０）、Ｓ２１０へ戻る。 On the other hand, when the likelihood of the request keyword is less than 0.6 (S300: NO), the process proceeds to S330, and it is determined whether or not there is a word having a likelihood of 0.6 or more in other attributes. If there is a word having a likelihood of 0.6 or more (S330: YES), the word with the highest likelihood is set as a recognition attribute, and a word set including a large number of keywords of the recognition attribute is formed to perform word spotting. Is executed (S340). However, if there is no word with a likelihood of 0.6 or more (S330: NO), an N-word dictionary is constructed with the location, facility name, and request-related attributes, and word spotting is executed (S350). After the processing of S340 and S350, the speech segment that has been recognized is removed from the next recognition target (S360). As described above, if the recognition of all speech segments has not been completed (S370: NO), the vocabulary of the missing attribute is After adding and preparing to update the dictionary (S380), the process returns to S210.

一方、全音声区間を認識完了していれば（Ｓ３７０：ＹＥＳ）、単語列の推定をし（図１１のＳ３９０）、エコーバックして（Ｓ４００）、不足条件を問い合わせる発話を行う（Ｓ４１０）。その後、ユーザの応答が否定的発話（例えば、違う、そうじゃない、など）かどうかを判断し、否定的発話でなければ（Ｓ４２０：ＮＯ）、不足する属性の語彙を追加して辞書更新の準備をしてから（Ｓ４３０）、図１０のＳ２１０へ戻る。一方、否定的発話であれば（Ｓ４２０：ＹＥＳ）、Ｓ４４０〜Ｓ４７０の誤認識対応処理を実行する。 On the other hand, if the entire speech section has been recognized (S370: YES), the word string is estimated (S390 in FIG. 11), echoed back (S400), and the utterance for inquiring about the shortage condition is performed (S410). Thereafter, it is determined whether the user's response is a negative utterance (for example, different or not), and if it is not a negative utterance (S420: NO), the vocabulary of the missing attribute is added to update the dictionary. After preparation (S430), the process returns to S210 in FIG. On the other hand, if it is a negative utterance (S420: YES), the misrecognition response process of S440-S470 will be performed.

具体的には、Ｓ４４０にてモード設定に基づいて分岐し、連続単語認識に切り替えてユーザに発話方法を指示するか（Ｓ４５０）、ヘルプモードとして要求キーワードを入れるようユーザに指示する（Ｓ４６０）、漸進的階層探索として認識可能な語彙を表示するか（Ｓ４７０）、のいずれかを実行する。その後は、不足する属性の語彙を追加して辞書更新の準備をしてから（Ｓ４８０）、図１０のＳ２１０へ戻る。 Specifically, branching is performed based on the mode setting in S440, and switching to continuous word recognition is performed to instruct the user to speak (S450), or the user is instructed to enter the requested keyword as a help mode (S460), Either a vocabulary that can be recognized as a progressive hierarchical search is displayed (S470). After that, the vocabulary having insufficient attributes is added to prepare for dictionary update (S480), and the process returns to S210 in FIG.

このような処理を、例えば「えーと、あんじょうでらーめんをくいたい」といったユーザ発話音声に対して実行した場合について説明する。まず、音声区間から要求キーワード（くいたい、など）を優先的に検出する。このために、認識語彙も要求キーワードを多く含む（要求の種類や同一要求に対する同義語・類義語・連想語を充実させる）構成にする。しかし、この要求優先の場合、１回のワードスポッティングでは認識が完了しない。したがって、図６に示したように要求キーワードが確定（食事関連）した後に次の施設名や要求対象をそれぞれ１回のワードスポッティングに対応させて認識を繰り返し、すべての音声区間を認識終了した時点で単語列推定を駆動させて単語列認識結果を絞り込む。なお、この単語列認識結果の絞り込みの詳細については、例えば特願平１１−２０３４９号を参照されたい。 A case will be described in which such processing is performed on a user's uttered voice, for example, “I want to pick up a good ramen”. First, a request keyword (Kitai, etc.) is preferentially detected from the voice section. For this purpose, the recognition vocabulary also includes a large number of request keywords (enriched with synonyms, synonyms, and associative words for request types and the same request). However, in the case of request priority, recognition is not completed by one word spotting. Therefore, as shown in FIG. 6, after the request keyword is fixed (meal-related), the next facility name and request target are each recognized corresponding to one word spotting, and recognition of all speech sections is completed. To drive word string estimation to narrow down word string recognition results. For details of narrowing down the word string recognition results, refer to, for example, Japanese Patent Application No. 11-20349.

これをもとにシステムは「あんじょうでらーめんをたべるんですね」といったエコーバックの発話を行い（Ｓ４００）、論理的発話戦略の駆動により不足条件を問い合わせる発話（Ｓ４１０）として「あんじょうには３けんらーめんやがあります。みせはどこにしますか？」と発話する。これに対してユーザは、まったく異なる発話「○○でれいぞうこをかいたい」と発していたとすると、上述のエコーバック結果に対しては否定的応答（「ぜんぜん違う」など）を発話する（Ｓ４２０：ＹＥＳ）。したがってその場合は誤認識対応処理（Ｓ４４０〜Ｓ４７０）が駆動される。 Based on this, the system utters an echo-back such as “I'll eat an angelo ramen” (S400), and utters an inquiry about insufficient conditions by driving a logical utterance strategy (S410). "Where do you want to see?" On the other hand, if the user utters a completely different utterance “I want to laugh at XX”, the user utters a negative response (such as “all different”) to the above echo back result ( S420: YES). Accordingly, in this case, the erroneous recognition handling process (S440 to S470) is driven.

一方、否定的応答がなされない場合（Ｓ４２０：ＮＯ）、ユーザとの問で正しく対話がなされているとエージェントは解釈し、次にユーザが発話すると期待されるキーワード群（いっけんめ」、「すしやにかえて」など）を認識語彙辞書に設定して（Ｓ４３０）、ユーザの発話を待つ。 On the other hand, when a negative response is not made (S420: NO), the agent interprets that the dialogue with the user is correctly performed, and then the keyword group expected to be spoken by the user (Ikenken), “sushi” ”In the recognized vocabulary dictionary (S430), and waits for the user's utterance.

[３．誤認識対応処理について］
音声認識部から出力される認識語彙の尤度判定と語彙限定、ユーザからのへルプ要求、画面表示の支援に基づく入力の簡単化（スイッチ操作の併用）、不足条件の判定、サービス内容の有機的連関付けなどに基づきへルプ、選択肢の提示、初期化、問いかけを駆動する。 [3. Misrecognition handling process]
Likelihood judgment and vocabulary limitation of recognition vocabulary output from speech recognition unit, help request from user, simplification of input based on screen display support (combined with switch operation), judgment of insufficient condition, organic service content Helps, option presentation, initialization, and questioning are driven based on relevant associations.

［３．１誤認識の検出］
エージェントがユーザ発話を認識できていない状況は誤認識であり、以下のケースがある。 [3.1 Detection of false recognition]
The situation in which the agent cannot recognize the user utterance is misrecognition, and there are the following cases.

１）どの認識辞書にもない言葉をユーザが発話
２）他の辞書にはあるが、現在使っている辞書にない言葉をユーザが発話
３）ユーザ以外の話者の発話内容に応答し、ユーザの意図に反してモード移行
このような状況は、エージェントの応答に対してユーザが「ちがう」、「わかってないね」、「全然だめ」といった発話をすることで検出される。この場合、下記のいずれかの対話内容をユーザの状況に応じて選択する。 1) A user utters a word that is not in any recognition dictionary 2) A user utters a word that is in another dictionary but not in the dictionary currently in use 3) Responds to the utterance content of a speaker other than the user, and the user Mode transitions against the intention of the user Such a situation is detected when the user utters “No”, “I don't know”, or “None at all” in response to the agent. In this case, one of the following dialogue contents is selected according to the user's situation.

（１）音声メニューモード
（２）選択肢を出す
なお、例えば「ちがう」などの発話がユーザからＫ回以上繰り返されたときは初期状態
に戻る。Ｋは例えば５とする。 (1) Voice menu mode (2) Gives an option Note that, for example, when an utterance such as “No” is repeated K or more times from the user, the state returns to the initial state. For example, K is 5.

［３．１．１期待外の応答であることの検出］
認識語に付随する尤度のしきい値処理に基づき、尤度が低い場合は音声認識の信頼度が低いため認識語彙以外の発話がなされたとして、［誤認識の可能性有り］と判定する。 [3.1.1 Detection of unexpected response]
Based on the threshold processing of likelihood associated with the recognized word, if the likelihood is low, the speech recognition reliability is low, and it is determined that there is an utterance other than the recognized vocabulary, and there is a possibility of misrecognition. .

［３．２話題転換の検出］
ユーザ発話がエージェントの期待から外れているかどうかはそれまでの文脈に沿っているかどうか、即ちある話題の中で予想される発話かどうかで判定する。予想される発話は「２．４単語間のネットワーク」で示した単語間の関係をもとにして導出され、これに対応する語彙を認識辞書に設定する。これを便宜上ケースＡと呼ぶ。それ以外の場合は、予想されない発話であり、それを構成する単語は下記のように分類できる。 [3.2 Detection of topic change]
Whether or not the user utterance deviates from the agent's expectation is determined by whether or not the user utterance is in line with the previous context, that is, whether the utterance is expected in a certain topic. An expected utterance is derived based on the relationship between words shown in “2.4 Network between Words”, and the corresponding vocabulary is set in the recognition dictionary. This is called Case A for convenience. Otherwise, it is an unexpected utterance, and the words constituting it can be classified as follows.

（Ｂ１）認識語彙辞書に登録されていない単語
（Ｂ２）認識語彙辞書に登録されているが、違う話題の単語
（Ｂ２ａ）現在の認識語彙範囲に含まれている単語
（Ｂ２ｂ）現在の認識語彙範囲に含まれていない単語
このうち、（Ｂ１）と（Ｂ２ｂ）は通常のワードスポッティングでは認識され得ないため、不用語とみなされるかあるいは信号処理的に近いとみなされる他の認識可能語彙に置き換えられて出力される。これらは後述の誤認認処理で対応する。 (B1) A word that is not registered in the recognized vocabulary dictionary (B2) A word that is registered in the recognized vocabulary dictionary but has a different topic (B2a) A word that is included in the current recognized vocabulary range (B2b) A current recognized vocabulary Words not included in the range Of these, (B1) and (B2b) cannot be recognized by normal word spotting, so they are considered as non-terminable terms or other recognizable vocabularies that are considered close in signal processing. Replaced and output. These are dealt with by the misidentification process described later.

一方、（Ａ）と（Ｂ２ａ）については以下の３つの処理形態で対応する。
（１）文脈優先処理［（Ａ）の場合］
出力された複数の認識候補（ラティス）間に尤度の差異があまり認められない（分散が小さい）とき、文脈に沿った認識候補が優先して選ばれる。 On the other hand, (A) and (B2a) correspond to the following three processing forms.
(1) Context priority processing [(A)]
When there is not much difference in likelihood between the plurality of output recognition candidates (lattices) (the variance is small), recognition candidates according to the context are preferentially selected.

（２）突然の話題の遷移（話題転換）［（Ｂ２ａ）の場合］
出力した認識候補が１個であり、ある一定しきい値以上の尤度を持つ場合、この話題に突然遷移したことを認める。 (2) Sudden topic transition (topic change) [(B2a)]
If there is one recognition candidate that has been output and the likelihood is equal to or greater than a certain threshold, it is recognized that the topic has suddenly changed.

（３）突然の話題の遷移の確認［（Ｂ２ａ）の場合］
出力した認識候補が１個であるが、ある一定しきい値以上には満たない尤度を持つ場合、この話題に突然遷移したのかどうかを確認するため、ユーザに問い合わせる。 (3) Confirmation of sudden topic transition [in the case of (B2a)]
If the output recognition candidate is one but has a likelihood not exceeding a certain threshold value, the user is inquired to confirm whether or not the topic has suddenly changed.

［３．２．１話題転換の検出］
現在の話題とは異なる要求キーワードが認識された場合には話題が転換した可能性があるとみなす。 [3.2.1 Detection of topic change]
If a request keyword different from the current topic is recognized, it is considered that the topic may have changed.

［３．２．２話題転換の確認］
上記に基づき、システムはユーザに対して話題が転換したことを確認する問いかけの発話を生成する。 [3.2.2 Confirmation of topic change]
Based on the above, the system generates a query utterance that confirms to the user that the topic has changed.

（例）
エージェント『□□駅前には３件のラーメン屋があります。』
ユーザ「東京の○○さんに電話をかける」
エージェント『電話をかけますか？』
ユーザ「うん。東京の○○さんに」
エージェント『東京の○○さんに電話をかけます』
［３．２．３話題転換の発話］
話題が転換したことを宣言する発話を生成する（上記例参照）。 (Example)
Agent “□□ There are three ramen shops in front of the station. ]
User "Call XXX in Tokyo"
Agent "Do you want to call me? ]
User “Yes. To Mr. XX in Tokyo”
Agent "Calls Mr. XX in Tokyo"
[3.2.3 Utterance change utterance]
Generate an utterance declaring that the topic has changed (see example above).

［３．３文脈優先の発話］
それまでの話題Ｔ(ｎ)（ｎ：発話対の通し番号）が継続していると仮定し、次回のエージェント発話もその話題に基づいて生成する。従って、今回のユーザ発話の解釈結果Ｕ(ｎ)が話題Ｔ(ｎ)に関連しない語彙であっても、それに即応せずに、話題Ｔ(ｎ)からの文脈に沿って発話内容に限定処理を加える。 [3.3 Context-first utterance]
Assuming that the topic T (n) (n: serial number of the utterance pair) continues, the next agent utterance is also generated based on the topic. Therefore, even if the interpretation result U (n) of the user utterance this time is a vocabulary that is not related to the topic T (n), the processing is limited to the utterance content according to the context from the topic T (n) without immediately responding to it. Add

（例）
エージェント『○○駅前には３件のラーメン屋があります。』
・・・・・・・・・・・・・・・・・・・｛Ｔ(ｎ)＝食事｝
ユーザ「あっ、電話かけなきゃ。」
・・・・・・・・・・・・・・・・・・・｛「電話」が認識され
ればＴｎｅｗ＝電話だが話題を更新しない｝
エージェント『どのラーメン屋にしますか？』
・・・・・・・・・・・・・・｛Ｔ(n+1) ＝Ｔ(ｎ)＝食事｝
ユーザ「えーと、□□」
エージェント『□□へのルートを表示します。』
なお、上述した話題転換と、ここで説明した文脈優先は相反する応答だが、例えば、その条件判断はＵ(ｎ)の尤度情報ＬＦＫやＵ(ｎ)で示される話題Ｔnew の連続出現回数Ｎtnewなどを用いる。即ち、Ｎtnew＞２且つＬＦＫ＞０．４ならば話題はＴ(n-1) からＴ(ｎ)＝Ｔnew に移行し、それ以外の場合はＵ(ｎ)から得られたＴnew は棄却し、Ｔ(ｎ)＝Ｔ(n-1) とする、といった条件分岐を用いる。 (Example)
Agent “There are three ramen shops in front of the station. ]
・・・・・・・・・・・・・・ {T (n) = meal}
User “Oh, I have to make a phone call.”
・・・・・・・・・・・・・・ {"Phone" is recognized
Then Tnew = phone but don't update topic}
Agent “Which ramen shop would you like to have? ]
・・・・・・・・・・・・・・ {T (n + 1) = T (n) = meal}
User “Uh, □□”
Displays the route to the agent “□□”. ]
The topic change described above and the context priority explained here are contradictory responses. For example, the condition judgment is U (n) likelihood information LFK or the number of consecutive occurrences Ntnew of the topic Tnew indicated by U (n). Etc. are used. That is, if Ntnew> 2 and LFK> 0.4, the topic shifts from T (n-1) to T (n) = Tnew, otherwise Tnew obtained from U (n) is rejected, A conditional branch such as T (n) = T (n-1) is used.

［３．４ヘルプモード］
起こりやすい誤認識の例、代表的な要求キーワード、などの選択肢を表示したり音声で発話する。 [3.4 Help mode]
It displays choices such as examples of misrecognition that tend to occur, typical request keywords, and utters speech.

（例）
・電話をかける場合は、「でんわをかける」で電話番号画面を表示し、相手先の電話番号を入力してください。登録されている相手先（例えば○○さん）の場合は「でんわをかける、○○」でも結構です。 (Example)
・ To make a call, use "Call phone" to display the phone number screen and enter the phone number of the other party. In the case of a registered partner (for example, Mr. XX), “Call phone, XX” is fine.

・次のどれですか？もういちど発話してください。｛食事（レストラン、ごはん、おなかすいた）、スケジュール帳、アドレス帳（住所録、電話帳）、地図（経路案内、ドライブアシスタント）｝
・かっこ内の同義語でもう一度発話してください。・ Which of the following? Please speak again. {Meals (restaurant, rice, hungry), schedule book, address book (address book, phone book), map (route guidance, drive assistant)}
・ Please speak again with a synonym in parentheses.

・地図表示の場合は、経路案内から道路図を選択したほうが確実です。
・目的地は市町村をつけて発話した方が正確です。（例かりや→かりやし）
［３．５選択肢の提示］
選択肢提示は既に説明した漸進的階層探索に相当する。・ In the case of a map display, it is better to select a road map from the route guidance.
・ The destination is more accurate when speaking with a municipality. (Example: Kariya → Kariyashi)
[3.5 Presentation of options]
Choice presentation corresponds to the progressive hierarchical search already described.

［３．６通常の発話戦略］
認識結果（認識語、尤度）をもとに誤認識が検出されなかった場合は通常の発話戦略が適用される。この通常の発話戦略の内容は本発明の主眼とするところではないので、ここでは詳細については言及しない。 [3.6 Normal speech strategy]
If no misrecognition is detected based on the recognition result (recognized word, likelihood), a normal speech strategy is applied. Since the contents of this normal speech strategy are not the main points of the present invention, details are not mentioned here.

［３．７対話管理の適応化］
［３．７．１環境・状況への適応］
時間（季節、日時、時刻［朝昼夜］）、空間（自車位置、地域［都道府県、市町村］）、環境（道路環境［高速、一般道、トンネルなど］、道路状態［路面凍結、滑り易い、他］、交通環境［高速道路、速度制限など］、地理環境［海が近い、山の中、街中、駅前、など］）、車外状況（天候、交通状況［渋滞など］、車外周辺状況［追い越し車あり、など］）、車内状況（運転状態、乗車状態、移動目的、話題）などに応じて、話題の選定や対話管理、適切なメッセージの伝達などを行うことができる。なお、これらは表示系にも反映される。 [3.7 Adaptation of dialog management]
[3.7.1 Adaptation to the environment and situation]
Time (season, date and time, time [morning, day and night]), space (vehicle position, area [prefecture, municipality]), environment (road environment [highway, general road, tunnel, etc.], road condition [road surface freezing, slippery , Etc.], traffic environment [expressway, speed limit, etc.], geographical environment [close to the sea, in the mountains, in the city, in front of the station, etc.]), outside conditions (weather, traffic conditions [such as traffic congestion], It is possible to perform topic selection, dialogue management, transmission of appropriate messages, etc., depending on the situation of the overtaking car, etc.)) and in-vehicle conditions (driving state, riding state, purpose of movement, topic). These are also reflected in the display system.

［３．７．２スケジュールへの適応］
（１）ドライブスケジュール
ユーザの設定した目的地や経由地に基づき、ドライブスケジュールを作成し、各イベントの意味（食事、ショッピング、観光など）と場所・時間に基づいて、話題の決定や対話管理、さらにはドライブスケジュールの空白部分について提案することができる。 [3.7.2 Adaptation to schedule]
(1) Drive schedule Create a drive schedule based on the destinations and waypoints set by the user, determine topics and manage conversations based on the meaning of each event (meal, shopping, sightseeing, etc.) and location / time. You can even make suggestions for blank parts of your drive schedule.

（２）個人スケジュール
ＰＤＡやＰＣ用のインタフェースを介してＰＤＡやＰＣ上の個人スケジュールデータをダウンロードし、それに基づいて話題の決定や対話管理、さらには個人スケジュールの空白部分について提案することができる。 (2) Personal schedule It is possible to download personal schedule data on a PDA or PC via an interface for the PDA or PC, and based on that, determine topics, manage dialogues, and further propose blank portions of the personal schedule.

［３．７．３ユーザが応答の意味を理解できないとき］
ユーザが「どういう意味」、「よくわからん」といった発話をしたとき、ユーザはエージェント応答の意味を理解できていないと判断し、次のいずれかの処理を行う。 [3.7.3 When the user cannot understand the meaning of the response]
When the user utters “What do you mean” or “I don't know well”, the user judges that he / she does not understand the meaning of the agent response, and performs one of the following processes.

（１）ヘルプ機能
（２）メニュー選択
（３）音声メニュー
［３．７．４エージェントがユーザ要求に対応できないとき］
ユーザ発話を正常に認識できても、以下のような場合ではエージェントがユーザの要求に対応できないときがある。これらについてはその状況を音声メッセージで伝達する。 (1) Help function (2) Menu selection (3) Voice menu [3.7.4 When agent cannot respond to user request]
Even if the user utterance can be recognized normally, the agent may not be able to respond to the user's request in the following cases. About these, the situation is conveyed by a voice message.

１）検索結果がない（該当データベースがない）
２）該当する機器がない、故障中あるいは準備されていない（電話が接続されていない、など）
３）操作対象となる機器の動作範囲を超えた制御命令が出されたとき
［３．８音声メニューモード］
例えば本願出願人が特願平１０−１７７６６７号にて提案したような機器操作モードに移行する。各操作モードにおいて必要なコマンドは必ず優先的に認識語彙辞書に加え、認識の尤度を高める。 1) No search results (no corresponding database)
2) There is no corresponding device, it is out of order, or it is not ready (the telephone is not connected, etc.)
3) When a control command exceeding the operating range of the device to be operated is issued [3.8 Voice Menu Mode]
For example, the present applicant shifts to the device operation mode as proposed in Japanese Patent Application No. 10-177667. Commands required in each operation mode are always preferentially added to the recognition vocabulary dictionary to increase the likelihood of recognition.

［３．９問い返し］
問い返しは下記のようにいくつかの場合が考えられ、問い返す場合にはその内容に沿った語彙に限定した認識辞書が設定される。 [3.9 Questions]
There are a number of cases where the question is answered as follows. When the question is asked, a recognition dictionary limited to the vocabulary according to the content is set.

（１）エージェントが正しく話題を認識しているにもかかわらず、それに沿ったユーザ発話がなされなかった場合、話題確認のための問い返しを発することにより、それ以後の誤認識を回避する。 (1) If the agent correctly recognizes the topic but the user utterance is not made in accordance with the topic, the subsequent misrecognition is avoided by issuing a query for confirming the topic.

（例）
エージェント『□□駅前には３件のラーメン屋があります。』
ユーザ「東京の○○さんに電話をかける」
エージェント『電話をかけますか？』
（２）エージェントが話題を間違えて認識し、ちがう話題のもとで不足条件を問い合わせた場合、ユーザからの問い返し又は否定的発言（「ちがうよ」や「そうじゃない」など）により、正しい話題が何であるかを問い返す。これにより、それ以後の誤認識に歯止めをかける。 (Example)
Agent “□□ There are three ramen shops in front of the station. ]
User "Call XXX in Tokyo"
Agent "Do you want to call me? ]
(2) When an agent recognizes a topic by mistake and inquires about a deficiency condition under a different topic, the correct topic is obtained by a question from the user or a negative remark (such as “No” or “No”). Asks what is. This stops subsequent misrecognition.

（例）
ユーザ「東京の○○さんに電話をかける。」
エージェント『□□駅付近のレストランを表示します。どれにしますか？』
ユーザ「そうじやない、電話をかける」
エージェント『電話ですか、食事ですか？』
ユーザ「でんわ」
これは、すでにレストラン検索依頼が出ており、その検索をしている最中にユーザが電話をかけるという割り込み要求を行った場合である。２つの用件（電話か食事か）の優先順位を問い返した例と言える。 (Example)
User “Calls Mr. XX in Tokyo”
Agent “□□ Displays restaurants near the station. Which would you like? ]
User "I don't care, make a call"
Agent: Is it a phone call or a meal? ]
User “Denwa”
This is a case where a restaurant search request has already been issued, and the user makes an interruption request to make a call during the search. This is an example of asking the priorities of the two requirements (phone or meal).

（３）エージェントが正しく話題を認識し、文脈上必要な条件が不足していることを確認した上で問い返す。これにより、文脈に沿った正しい対話が展開され、問題解決を促進する。 (3) After the agent correctly recognizes the topic and confirms that the necessary condition in context is insufficient, the agent asks the question. This develops the right dialogue in context and promotes problem solving.

（例１）
ユーザ「東京の○○さんに電話をかける。」
エージェント『東京の○○さんに電話をかけるんですね。それは○○◇◇さん
ですか、それとも○○△△さんですか？』
ユーザ「○○◇◇さん」
エージェント『○○◇◇さんに電話をかけます』
（例２）
ユーザ「東京の○○さんに電話をかける」
エージェント『□□駅付近のレストランを表示します。どれにしますか？」
ユーザ「あとにして。先に電話をかける。」
エージェント『東京の○○◇◇さんに電話をかけます．よろしいですね。」
ユーザ「うん、そうして」
ここで説明した（例１）の場合は、エージェントが正しく認識し、電話をかけるという問題解決に向けて不足している条件を問い合わせた例である。 (Example 1)
User “Calls Mr. XX in Tokyo”
Agent “You are calling XXX in Tokyo. That is ○○ ◇◇
Or is it Mr. ○○ △△? ]
User “○○ ◇◇”
Agent “Call XX ◇◇”
(Example 2)
User "Call XXX in Tokyo"
Agent “□□ Displays restaurants near the station. Which would you like? "
User “Later. Call first.”
I'll call the agent "Tokyo, Japan."It's all right. "
User “Yes, then”
The case described here (example 1) is an example in which an agent has inquired about an insufficient condition for solving the problem of correctly recognizing and making a call.

また、（例２の）場合には、ユーザが電請をかける意図がそれまでの対話の経過やスケジュールなどでわかっているならば、その電話の用件（○○さんと食事する）で必要な情報を事前に検索することが可能になる。つまり、気を回して自主的にレストラン検索を行い、問い返した例と言える。 Also, in the case of (Example 2), if the user's intention to make an electric bill is known from the progress and schedule of the conversation so far, it is necessary for the telephone requirement (meal with Mr. XX) It becomes possible to search for information in advance. In other words, it can be said that the restaurant search was voluntarily performed and the questions were asked back.

［３．１０初期化］
ユーザからの問い返しがＫ回（Ｋ＞Ｎ）を超える場合、初期状態あるいはトップのメニュー画面に戻る。 [3.10 Initialization]
When the number of questions from the user exceeds K times (K> N), the screen returns to the initial state or the top menu screen.

以上説明したように、本実施形態の制御システムによれば、漸進的階層探索、多段階処理、期待外時対応処理、誤認識対応処理を行うが、それぞれの処理によって以下の効果がある。 As described above, according to the control system of the present embodiment, the progressive hierarchical search, the multi-stage process, the unexpected response process, and the misrecognition process are performed.

まず、漸進的階層探索の場合には、上述した「岡崎で食事したいなあ、インド料理がいいね」という発話を例に取れば、ワードスポッティングによる音声認識手法の出力結果が「岡崎、食事、インド料理」という単語列であった場合、従来の音声認識手法では、これ
ら３つが揃った段階で認識に対応するシステム側の処理が開示されているのに対し、本手法によれば、「岡崎」が入力された時点で、次に利用者に入力を期待する発話語彙を即座に提示できる。そのため、利用者はとまどうことなく発話できるようになる。これによって、誤認識の原因となる認識辞書外の語彙を利用者が発話してしまうことを未然に防止できる。 First, in the case of progressive hierarchical search, taking the utterance of “I want to eat in Okazaki, I like Indian food” as an example, the output result of the speech recognition method by word spotting is “Okazaki, Meals, India. In the case of the word sequence “cooking”, the conventional speech recognition method discloses the processing on the system side corresponding to the recognition at the stage when these three are gathered, whereas according to this method, “Okazaki” Then, the utterance vocabulary expected to be input next can be immediately presented. Therefore, the user can speak without any trouble. As a result, it is possible to prevent the user from uttering a vocabulary outside the recognition dictionary that causes misrecognition.

また、多段階処理の場合には、単語列を構成する要求キーワードなどを軸として、単語列を構成する単語間の意味的な制約を利用して認識辞書を動的且つ小規模に構成することにより、適切な認識を行う。これによって、利用者の意図しない誤認識が発生しないようにできる。 Also, in the case of multi-stage processing, a recognition dictionary is dynamically and smallly constructed using semantic constraints between words constituting the word string, with the request keyword constituting the word string as an axis. The appropriate recognition is performed. As a result, erroneous recognition unintended by the user can be prevented.

また、期待外時対応処理においては、対話中における利用者からの発話が文脈に沿った期待通りの内容ではない場合に、話題が転換されたのかどうかを確認するための問いかけを行う話題転換確認処理、話題が転換されたことを宣言する話題転換宣言処理、それまでの話題が継続していると仮定して文脈に沿った対応を行う文脈優先対応処理のいずれかを行う。一方、誤認識対応処理においては、対話中における利用者からの発話が所定の否定的内容であった場合に、正しい話題が何であるかを確認するための問い返す問い返し処理や、要求する内容に対応したキーワードを含める指示を利用者が視覚又は聴覚にて認識可能なように提示するヘルプモード処理や、初期状態に戻る初期化処理などを実行する。このようにすることで、誤認識に起因して利用者が途方に暮れてしまう、といった不都合を防止することができる。 Also, in the unexpected response process, if the utterance from the user during the conversation is not the expected content according to the context, a question to confirm whether the topic has been changed is checked. One of processing, topic conversion declaration processing for declaring that the topic has been changed, and context priority processing for performing correspondence in accordance with the context on the assumption that the previous topic has continued. On the other hand, in the misrecognition handling process, if the utterance from the user during the conversation has a predetermined negative content, it responds to the question-reply processing to confirm what the correct topic is, and the requested content A help mode process for presenting an instruction for including the keyword so that the user can recognize visually or auditorily, an initialization process for returning to the initial state, and the like are executed. By doing so, it is possible to prevent the inconvenience that the user is at a loss due to misrecognition.

さらに、本実施形態の場合には、これら漸進的階層探索、多段階処理、期待外時対応処理、誤認識対応処理が組み合わされているため、さらに効果的である。
ところで、本実施形態においては、例えば自動車に搭載されて、ユーザとしての車両の乗員（主に、運転者）と音声にて対話しながら、その車両に搭載された様々な機器を制御するシステムとして説明したが、図１の概念図に示すように、その他の信号系認識部にて認識した結果としての単語列候補に対しても同様の処理が可能である。例えば、認識対象者によって入力された手書き文字列を辞書データと比較し、一致度合の高い複数の単語列候補を出力する文字認識装置であってもよい。手書き文字に関してもやはり誤認識が発生し易いため、適正な単語列を推定することは有効性が高い。また、音声認識や文字認識のように、認識装置に入力される時点で直接的に単語列の内容となっているものに限らず、画像認識装置であってもよい。即ち、認識対象を捉えた画像を場面として認識した上で、場面を自然言語化するような認識装置であれば実現できる。具体的には、例えば認識対象者が手話をしている画像から手話パターンを認識し、その手話パターンが表す自然言語的な意味を示す単語列候補を出力するようなものである。手話パターンについても、微妙な指使いによって表す単語が異なるため、手話をする者の個人差などによって、やはり誤認識は発生する。したがって、やはりこの場合も、誤認識が含まれる可能性の高い手話パターンの認識装置においても、上述した各種処理を実行することで、誤認識を防止したり、あるいは誤認識が発生した後の適切な対処を行うことができ、誤認識に起因して利用者が途方に暮れてしまう、といった不都合を防止することができる。 Furthermore, in the case of the present embodiment, these progressive hierarchical search, multi-stage processing, unexpected time response processing, and misrecognition response processing are combined, which is more effective.
By the way, in this embodiment, for example, as a system that is mounted on a car and controls various devices mounted on the vehicle while talking with a vehicle occupant (mainly a driver) as a user. As described above, as shown in the conceptual diagram of FIG. 1, the same processing can be performed for word string candidates as a result of recognition by other signal system recognition units. For example, it may be a character recognition device that compares a handwritten character string input by a person to be recognized with dictionary data and outputs a plurality of word string candidates having a high degree of coincidence. Since handwritten characters are also likely to be erroneously recognized, it is highly effective to estimate an appropriate word string. Moreover, it is not limited to the content of the word string directly at the time of input to the recognition device, such as voice recognition or character recognition, but may be an image recognition device. That is, any recognition device that recognizes an image that captures a recognition target as a scene and converts the scene into a natural language can be realized. Specifically, for example, a sign language pattern is recognized from an image in which the person to be recognized is sign language, and a word string candidate indicating a natural language meaning represented by the sign language pattern is output. As for the sign language pattern, because the words expressed by subtle fingering are different, misrecognition still occurs due to individual differences among persons who sign language. Therefore, in this case as well, in the sign language pattern recognition apparatus that is likely to include misrecognition, it is possible to prevent misrecognition by performing the above-described various processes, or to perform appropriate processing after misrecognition occurs. It is possible to prevent the inconvenience that the user is at a loss due to misrecognition.

以上、本発明の一実施形態について説明したが、本発明は、上記実施形態に限定されるものではなく、種々の形態を採り得ることは言うまでもない。
例えば、図３の処理概要を示すフローチャートにおいては、多段階処理（Ｓ３０）が先に行われ、その後の誤認識対応処理の中で漸進的階層探索（Ｓ１３０が相当する）が実行されているが、そのような順番には限定されない。 As mentioned above, although one Embodiment of this invention was described, it cannot be overemphasized that this invention can take a various form, without being limited to the said embodiment.
For example, in the flowchart showing the process outline of FIG. 3, the multi-stage process (S30) is performed first, and the progressive hierarchical search (corresponding to S130) is performed in the subsequent misrecognition handling process. The order is not limited.

また、上記実施形態の場合には、漸進的階層探索、多段階処理、期待外時対応処理、誤認識対応処理が組み合わされていたが、これらは単独で実施しても効果はある。但し、上述したように、組み合わせて実施すればさらに効果的である。その際、４つの処理を全て組み合わせなくてはならないわけではなく、２つ以上の処理の組み合わせであればよい。 In the case of the above embodiment, the progressive hierarchical search, multi-stage processing, unexpected time response processing, and misrecognition response processing are combined. However, these are effective even if they are executed alone. However, as described above, it is more effective if implemented in combination. In that case, it is not necessary to combine all four processes, and it is sufficient to combine two or more processes.

実施形態の単語列認識装置について機能に着目して概念的に示したブロック図である。It is the block diagram which showed notionally the function about the word string recognition apparatus of embodiment, and was shown notionally. 単語列認識装置を車載の制御システムに適用した場合の構成を示すブロック図である。It is a block diagram which shows the structure at the time of applying a word sequence recognition apparatus to a vehicle-mounted control system. システム制御部で実行される処理の概要を示すフローチャートである。It is a flowchart which shows the outline | summary of the process performed by a system control part. 漸進的階層探索の具体的な画面遷移例を示す説明図である。It is explanatory drawing which shows the specific example of a screen transition of a progressive hierarchy search. 漸進的階層探索の具体的な画面遷移例を示す説明図である。It is explanatory drawing which shows the specific example of a screen transition of a progressive hierarchy search. 多段階処理の概要を示す説明図である。It is explanatory drawing which shows the outline | summary of a multistep process. 辞書の動的構成の概要を示す説明図である。It is explanatory drawing which shows the outline | summary of the dynamic structure of a dictionary. 対話データベース表の構成を示す説明図である。It is explanatory drawing which shows the structure of a dialogue database table | surface. 対話ユニットとして成立する発話−応答の組み合わせを示す説明図である。It is explanatory drawing which shows the combination of the utterance-response which is materialized as a dialogue unit. 多段階処理の前半を示すフローチャートである。It is a flowchart which shows the first half of a multistage process. 多段階処理の後半を示すフローチャートである。It is a flowchart which shows the second half of multistage processing.

Explanation of symbols

１…制御装置、３…入力装置、５…マイクロフォン、７…スピーカ、８…ディスプレイ、９…ナビゲーション装置、１１…表示装置、１３…エアコン装置、１５…オーディオ装置、１７…通信装置、２１…システム制御部、２５…音声入力部、２７…音声合成部、２８…画面制御部、２９…機器制御Ｉ／Ｆ、３１…インターネットアドレスデータベース、３３…検索制御部、３４…認識語彙記憶部、３５…対話データ記憶部、３６…要求・状態推定用データ記憶部、３７…ユーザプロファイル記憶部。 DESCRIPTION OF SYMBOLS 1 ... Control apparatus, 3 ... Input device, 5 ... Microphone, 7 ... Speaker, 8 ... Display, 9 ... Navigation apparatus, 11 ... Display apparatus, 13 ... Air-conditioner apparatus, 15 ... Audio apparatus, 17 ... Communication apparatus, 21 ... System Control unit, 25 ... Voice input unit, 27 ... Speech synthesis unit, 28 ... Screen control unit, 29 ... Device control I / F, 31 ... Internet address database, 33 ... Search control unit, 34 ... Recognition vocabulary storage unit, 35 ... Dialog data storage unit, 36... Request / state estimation data storage unit, 37... User profile storage unit.

Claims

A word string output means for inputting information reflecting the action content of a human being recognition object and outputting a word string candidate having a high degree of matching compared to the dictionary data for recognition;
A request indicating what the request in the syntax is whether the word string output from the word string output means has the expected content in accordance with the context, that is, whether it is an expected utterance in a certain topic. Judgment based on keywords, and if it is determined that the request keyword of a topic different from the current topic has been recognized as being disappointing, ask at least whether the topic has changed Unexpected to perform either topic conversion confirmation processing, topic conversion declaration processing that declares that the topic has changed, or context-first response processing that responds according to the context assuming that the previous topic has continued With time handling means,
The unexpected time response means executes the topic change declaration process if the word string output from the word string output means after execution of the topic change confirmation process is a content in line with the changed topic. A word string recognition device characterized by the above.

The word string recognition device according to claim 1,
Even when the topic is changed, the unexpected time response means executes the context-first response processing immediately after the conversion, and only when the converted topic continues after that. A word string recognition apparatus characterized by executing the topic change confirmation process .

A word string output means for inputting information reflecting the action content of a human being recognition object and outputting a word string candidate having a high degree of matching compared to the dictionary data for recognition;
A request indicating what the request in the syntax is whether the word string output from the word string output means has the expected content in accordance with the context, that is, whether it is an expected utterance in a certain topic. Judgment based on keywords, and if it is determined that the request keyword of a topic different from the current topic has been recognized as being disappointing, ask at least whether the topic has changed Unexpected to perform either topic conversion confirmation processing, topic conversion declaration processing that declares that the topic has changed, or context-first response processing that responds according to the context assuming that the previous topic has continued With time handling means,
Even when the topic is changed, the unexpected time response means executes the context-first response processing immediately after the conversion, and only when the converted topic continues after that. A word string recognition apparatus characterized by executing the topic change confirmation process.

In the word string recognition device according to any one of claims 1 to 3,
The word string output device outputs a word string candidate having a high degree of coincidence obtained by comparing speech inputted by a person who is the recognition target with recognition dictionary data.

In the word string recognition device according to any one of claims 1 to 4,
The word string output means compares a handwritten character string input by a person who is the recognition target with recognition dictionary data, and outputs a plurality of word string candidates having a high degree of coincidence. .