JP2007004054A

JP2007004054A - Voice interactive device and voice understanding result generation method

Info

Publication number: JP2007004054A
Application number: JP2005186903A
Authority: JP
Inventors: Keiko Katsuragawa; 景子桂川; Minoru Togashi; 実冨樫; Takeshi Ono; 健大野
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2005-06-27
Filing date: 2005-06-27
Publication date: 2007-01-11
Anticipated expiration: 2025-06-27
Also published as: JP4635743B2

Abstract

PROBLEM TO BE SOLVED: To improve recognition rate by reducing repeated identical incorrect recognition, while leaving the possibility of being adopted as an understanding result. SOLUTION: An understanding result generating part 53 generates an understood result to be the response to a voice uttered, using understood result candidates selected on the basis of understood result score provided to each recognition result candidate, from a plurality of recognition result candidates that are the recognition results of a voice recognition part 52. A cancel button 12 indicates correction to the understood result generated. The understood result score provided to the understood candidate, corresponding to the understood result to which the correction has been indicated, is corrected so in the direction of understood result candidate being less likely to be selected, when the understood result is generated by the understood result generating part 53. COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、発話された音声に応じて対話をする音声対話装置に関し、詳しくは、一旦、誤認識された際に行われる訂正発話の認識率を向上させる音声対話装置及び音声理解結果生成方法に関する。 More particularly, the present invention relates to a speech dialogue apparatus and a speech understanding result generation method for improving a recognition rate of a corrected utterance that is performed once a recognition error is made. .

ユーザによって発話された音声を入力し、入力された音声の音声認識結果に応じたシステム応答をすることで、ユーザとの間で対話をする音声対話装置が考案されている。このような音声対話装置では、一旦、誤認識された際に、キャンセルボタンなどを押下することでなされる訂正発話に対して、新たに音声認識処理を実行することができる。 2. Description of the Related Art A voice dialogue apparatus has been devised in which a voice uttered by a user is input and a system response is made according to a voice recognition result of the inputted voice, thereby performing a dialogue with the user. In such a voice interaction device, once a recognition error is made, a new voice recognition process can be executed for a corrected utterance made by pressing a cancel button or the like.

このような音声対話装置において、入力された音声に対する音声認識結果が誤認識されたことで、ユーザによってキャンセルされた場合、この音声認識結果を音声認識対象から外すことで、同一の誤認識を繰り返すことを防止することができる手法が開示されている（特許文献１）。
特開平４−１７７９５６号公報 In such a voice interaction apparatus, when the voice recognition result for the input voice is erroneously recognized and canceled by the user, the same erroneous recognition is repeated by removing the voice recognition result from the voice recognition target. A technique capable of preventing this is disclosed (Patent Document 1).
JP-A-4-177756

しかしながら、特許文献１で開示された技術では、過去にキャンセルされた語句を音声認識対象から直ちに排除してしまうため、ユーザが明確に発話したとしても、一度キャンセルされてしまった語句を再入力することができない。例えば、本来、音声認識したい語句を誤ってキャンセルした場合や、異なる場面で、キャンセルされた語句を入力する必要がある場合などに全く対応することができず、柔軟性に欠けていた。 However, in the technique disclosed in Patent Document 1, a phrase that has been canceled in the past is immediately excluded from the speech recognition target. Therefore, even if the user clearly speaks, the phrase that has been canceled once is re-input. I can't. For example, it was not possible to cope with a case where a word / phrase to be speech-recognized was canceled accidentally, or when a canceled word / phrase needs to be input in a different scene, and lacked flexibility.

そこで、本発明は、上述した実情に鑑みて提案されたものであり、誤認識されたとして一度キャンセルされた発話であっても、理解結果として採用される可能性を残しながら、同一の誤認識が繰り返されることを低減することができる音声対話装置及び音声理解結果生成方法を提供することを目的とする。 Therefore, the present invention has been proposed in view of the above-described circumstances, and even if the utterance has been canceled once as being misrecognized, the same misrecognition remains with the possibility of being adopted as an understanding result. An object of the present invention is to provide a speech dialogue apparatus and a speech understanding result generation method capable of reducing the repetition of the above.

本発明の音声対話装置は、発話された音声を入力する入力手段と、前記入力手段によって入力された音声を認識対象語に基づき認識する音声認識手段と、前記音声認識手段による認識結果である複数の認識結果候補から、各認識結果候補に与えられた所定の選択基準値に基づき選択された前記認識結果候補を用いて、前記発話された音声に対する応答となる理解結果を生成する理解結果生成手段と、前記理解結果生成手段によって生成された前記理解結果に対して訂正を指示する訂正指示手段と、前記訂正指示手段によって訂正を指示された前記理解結果に対応する前記認識結果候補に与えられた前記所定の選択基準値を、前記理解結果生成手段で前記理解結果を生成する際に当該認識結果候補が選択されにくくなる方向に修正する修正手段とを備えることにより、上述の課題を解決する。 The speech dialogue apparatus of the present invention includes an input means for inputting spoken speech, a speech recognition means for recognizing speech input by the input means based on a recognition target word, and a plurality of recognition results by the speech recognition means. An understanding result generating means for generating an understanding result as a response to the spoken speech using the recognition result candidate selected from the recognition result candidates based on a predetermined selection reference value given to each recognition result candidate Correction instruction means for instructing correction of the understanding result generated by the understanding result generation means, and the recognition result candidate corresponding to the understanding result instructed to be corrected by the correction instruction means. Correction means for correcting the predetermined selection reference value in a direction that makes it difficult to select the recognition result candidate when the understanding result generation means generates the understanding result. By providing, to solve the problems described above.

本発明の音声理解結果生成方法は、発話された音声を入力する入力工程と、前記入力工程によって入力された音声を認識対象語に基づき認識する音声認識工程と、前記音声認識工程による認識結果である複数の認識結果候補から、各認識結果候補に与えられた所定の選択基準値に基づき選択された前記認識結果候補を用いて、前記発話された音声に対する応答となる理解結果を生成する理解結果生成工程と、前記理解結果生成工程によって生成された前記理解結果に対して訂正を指示する訂正指示工程と、前記訂正指示工程によって訂正を指示された前記理解結果に対応する前記認識結果候補に与えられた前記所定の選択基準値を、前記理解結果生成工程で前記理解結果を生成する際に当該認識結果候補が選択されにくくなる方向に修正する修正工程とを備えることにより、上述の課題を解決する。 The speech understanding result generation method of the present invention includes an input step for inputting spoken speech, a speech recognition step for recognizing speech input by the input step based on a recognition target word, and a recognition result by the speech recognition step. An understanding result for generating an understanding result as a response to the spoken speech using the recognition result candidate selected from a plurality of recognition result candidates based on a predetermined selection reference value given to each recognition result candidate A correction instruction step for instructing correction to the understanding result generated by the understanding result generation step; and giving to the recognition result candidate corresponding to the understanding result instructed to correct by the correction instruction step. The predetermined selection reference value is corrected in a direction that makes it difficult for the recognition result candidate to be selected when the understanding result is generated in the understanding result generation step. By providing a positive step, to solve the problems described above.

本発明の音声対話装置は、音声認識手段による認識結果である複数の認識結果候補から、各認識結果候補に与えられた所定の選択基準値に基づき選択された認識結果候補を用いて、発話された音声に対する応答となる理解結果を生成する。そして、訂正指示手段によって訂正を指示された理解結果に対応する認識結果候補に与えられた所定の選択基準値を、理解結果生成手段で理解結果を生成する際に当該認識結果候補が選択されにくくなる方向に修正する。 The speech dialogue apparatus of the present invention is uttered using a recognition result candidate selected from a plurality of recognition result candidates that are recognition results by the speech recognition means based on a predetermined selection reference value given to each recognition result candidate. An understanding result is generated as a response to the voice. The recognition result candidate is not easily selected when the understanding result generation unit generates the understanding result based on the predetermined selection reference value given to the recognition result candidate corresponding to the understanding result instructed to be corrected by the correction instruction unit. Correct in the direction.

これにより、ユーザによって発話された音声が何度も繰り返して誤認識されることを低減させることができると共に、キャンセルされた認識結果が理解結果として採用される可能性を残すことができる。したがって、ユーザによって発話された音声の音声認識率を大幅に向上させることが可能となる。 As a result, it is possible to reduce the erroneous recognition of the speech uttered by the user over and over, and leave the possibility that the canceled recognition result is adopted as the understanding result. Therefore, it is possible to greatly improve the voice recognition rate of the voice uttered by the user.

また、本発明の音声理解結果生成方法は、音声認識による認識結果である複数の認識結果候補から、各認識結果候補に与えられた所定の選択基準値に基づき選択された認識結果候補を用いて、発話された音声に対する応答となる理解結果を生成する。そして、訂正を指示された理解結果に対応する認識結果候補に与えられた所定の選択基準値を、理解結果を生成する際に当該認識結果候補が選択されにくくなる方向に修正する。 Further, the speech understanding result generation method of the present invention uses a recognition result candidate selected from a plurality of recognition result candidates that are recognition results by speech recognition based on a predetermined selection reference value given to each recognition result candidate. Then, an understanding result that is a response to the spoken voice is generated. Then, the predetermined selection reference value given to the recognition result candidate corresponding to the understanding result instructed to be corrected is corrected in a direction that makes it difficult to select the recognition result candidate when generating the understanding result.

以下、本発明の実施の形態について図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

まず、図１を用いて、本発明の第１の実施の形態として示す音声対話装置の構成について説明をする。図１に本発明の第１の実施の形態として示す音声対話装置は、車両などの移動体に搭載されるナビゲーション装置に適用した場合の構成である。ナビゲーション装置は、例えば、移動体である車両に搭載された場合、車両の現在位置を検出し、地図データから表示された車両の現在位置に対応する地図を表示しながら所望の目的地までの経路案内をすることができる。 First, the configuration of the voice interactive apparatus shown as the first embodiment of the present invention will be described with reference to FIG. The voice interaction apparatus shown in FIG. 1 as the first embodiment of the present invention has a configuration when applied to a navigation apparatus mounted on a moving body such as a vehicle. For example, when the navigation device is mounted on a vehicle that is a moving body, the navigation device detects the current position of the vehicle, displays a map corresponding to the current position of the vehicle displayed from the map data, and displays a route to a desired destination. You can give guidance.

この音声対話装置をナビゲーション装置に適用すると、ナビゲーション装置で要求される各種機能を、ユーザとシステムとの対話によってインタラクティブに動作させることができる。 When this voice interactive apparatus is applied to a navigation apparatus, various functions required by the navigation apparatus can be operated interactively by the interaction between the user and the system.

図１に示すように、音声対話装置は、入力装置１０と、マイク２０と、メモリ３０と、経路案内に用いる地図データや、ガイダンス音声の音声データなどを格納するディスク４０と、ディスク４０に格納された各種データを読み取るディスク読み取り装置４１と、マイク２０を介して入力された音声を音声認識し、音声認識結果の内容を理解してシステム応答を生成する制御装置５０と、経路探索結果を示す地図、メニュー画面、制御装置５０による音声認識結果などを表示する、例えば液晶ディスプレイといったモニタ６０と、ガイダンス音声やユーザとの対話におけるシステム応答音声などを出力するスピーカ７０とを備えている。 As shown in FIG. 1, the voice interaction device includes an input device 10, a microphone 20, a memory 30, a disk 40 that stores map data used for route guidance, voice data of guidance voice, and the like, and is stored in the disk 40. A disk reader 41 that reads various data, a controller 50 that recognizes voice input through the microphone 20, understands the contents of the voice recognition result, and generates a system response, and a route search result are shown. A monitor 60 such as a liquid crystal display that displays a map, a menu screen, a voice recognition result by the control device 50, and the like, and a speaker 70 that outputs a guidance voice or a system response voice in a dialog with the user are provided.

入力装置１０は、ユーザの押下により、ユーザによって発話されマイク２０を介して入力された音声に対する音声認識処理の開始を指示するための音声認識開始ボタン１１と、音声認識処理によって得られた認識結果に基づき生成された理解結果が誤りであった場合に、直前の音声入力前の状態にシステム状態を戻して再度音声入力をやり直すためのキャンセルボタン１２とを備えている。 The input device 10 includes a speech recognition start button 11 for instructing start of speech recognition processing for speech uttered by the user and input via the microphone 20 by the user pressing, and a recognition result obtained by the speech recognition processing. And a cancel button 12 for returning the system state to the state before the previous voice input and starting the voice input again when the understanding result generated based on the above is incorrect.

マイク２０は、ユーザによって発話された音声を、後述する制御装置５０の音声認識部５２に入力する。例えば、ユーザは、ナビゲーション装置の操作に使用される語句及び文、すなわち操作コマンド及び地名や施設名、道路名などの固有名詞及びこれらの語句を含む文を発話して、マイク２０からその音声を入力する。 The microphone 20 inputs the voice uttered by the user to a voice recognition unit 52 of the control device 50 described later. For example, the user utters words and sentences used for the operation of the navigation device, that is, operation commands and proper nouns such as place names, facility names, road names, and sentences including these phrases, and the voice is output from the microphone 20. input.

メモリ３０は、ランダムアクセス可能なＲＡＭ（Random Access Memory）などであり、音声認識処理が実行される場合に、ディスク読み取り装置４１によってディスク４０から読み出される音声認識用辞書・文法を記憶し展開する記憶領域３１、音声認識処理により得られる認識結果候補に含まれる単語と、その単語信頼度とを保存する記憶領域３２、キャンセルボタン１２が押下された場合に、キャンセル時の理解結果やシステム応答などをキャンセル情報として保存する記憶領域３３とを備えている。 The memory 30 is a random accessible RAM (Random Access Memory) or the like, and stores and expands a speech recognition dictionary / grammar read from the disk 40 by the disk reader 41 when a speech recognition process is executed. Area 31, storage area 32 for storing words included in the recognition result candidates obtained by the speech recognition processing and the word reliability, and understanding result and system response at the time of cancellation when the cancel button 12 is pressed And a storage area 33 to be saved as cancel information.

ディスク４０は、音声認識に使用する音声認識用辞書・文法、地図データベース、ガイダンス音声の音声データなどを格納した記憶媒体である。 The disk 40 is a storage medium that stores a speech recognition dictionary / grammar used for speech recognition, a map database, speech data of guidance speech, and the like.

一般に、音声認識用辞書・文法を用いて音声認識をするシステムでは、この音声認識用辞書・文法に記述されている認識対象語と文法とを用いた入力文だけを音声認識結果として受理することができる。 Generally, in a speech recognition system using a speech recognition dictionary / grammar, only input sentences using the recognition target words and grammar described in the speech recognition dictionary / grammar are accepted as speech recognition results. Can do.

例えば、ナビゲーション装置のメインタスクを経路探索をする際の目的地設定とすると、ユーザによってマイク２０から入力される入力文として、「神奈川県」「横浜駅」などといった施設に関する単語のみの入力と「神奈川県の横浜駅」「東海道線の横浜駅」などといった複数のキーワードを組み合わせた文章による入力との両方を想定することができる。 For example, assuming that the main task of the navigation device is a destination setting for route search, the user can input only words related to facilities such as “Kanagawa Prefecture” and “Yokohama Station” as input sentences input from the microphone 20 by the user. It is possible to envisage both input by sentences combining a plurality of keywords such as “Yokohama Station in Kanagawa Prefecture” and “Yokohama Station on the Tokaido Line”.

したがって、ディスク４０に格納される音声認識用辞書・文法は、このような単語のみの入力と複数のキーワードを含んだ文書の両方に対応することができる構成となっている。 Therefore, the speech recognition dictionary / grammar stored in the disk 40 is configured to be able to handle both such a word-only input and a document including a plurality of keywords.

続いて、制御装置５０について説明をする。制御装置５０は入力制御部５１と、音声認識部５２と、理解結果生成部５３と、対話制御部５４と、ＧＵＩ表示制御部５５と、音声合成部５６とを備え、マイク２０を介して入力された音声に対して、音声認識処理をし、音声認識結果に応じたシステム応答を行う。 Next, the control device 50 will be described. The control device 50 includes an input control unit 51, a speech recognition unit 52, an understanding result generation unit 53, a dialogue control unit 54, a GUI display control unit 55, and a speech synthesis unit 56, and inputs via the microphone 20. Voice recognition processing is performed on the received voice, and a system response is made according to the voice recognition result.

入力制御部５１は、ユーザによって音声認識開始ボタン１１が押下されたことに応じて、音声認識部５２に対して音声認識処理を開始するよう指示をする。また、入力制御部５１は、ユーザによってキャンセルボタン１２が押下され、直前の理解結果の訂正が指示された場合には、その旨を理解結果生成部５３に伝える。 The input control unit 51 instructs the voice recognition unit 52 to start the voice recognition process in response to the user pressing the voice recognition start button 11. Further, when the user presses the cancel button 12 and an instruction to correct the previous understanding result is given, the input control unit 51 notifies the understanding result generation unit 53 to that effect.

音声認識処理が開始されると、マイク２０から入力される音声をトリガーとして音声認識部５２による音声認識処理と、理解結果生成部５３による理解結果生成処理、対話制御部５４によるシステム応答出力が、ナビゲーション装置の機能である目的地設定や施設検索など、一つのタスクが終了するまで繰り返されることになる。この間にユーザはキャンセルボタン１２を押下することで直前の理解結果を取り消すことができる。 When the voice recognition process is started, a voice recognition process by the voice recognition unit 52, an understanding result generation process by the understanding result generation unit 53, and a system response output by the dialogue control unit 54 are triggered by the voice input from the microphone 20. This is repeated until one task is completed, such as destination setting and facility search, which are functions of the navigation device. During this time, the user can cancel the previous understanding result by pressing the cancel button 12.

音声認識部５２は、入力制御部５１の指示に応じて、マイク２０から入力されるユーザによって発話され、図示しないＡ／Ｄコンバータでデジタル化された音声信号を取り込み音声認識処理を実行する。 In response to an instruction from the input control unit 51, the speech recognition unit 52 executes speech recognition processing by capturing a speech signal uttered by a user input from the microphone 20 and digitized by an A / D converter (not shown).

音声認識部５２は、取り込んだデジタル化された音声信号と、メモリ３０の記憶領域３１に構築された音声認識用辞書・文法が保持する認識対象語からなる待ち受け文とのマッチング処理により音声認識を行い、音声認識結果を理解結果生成部５３に出力する。 The voice recognition unit 52 performs voice recognition by matching the captured digitized voice signal with a standby sentence made up of recognition target words held in the voice recognition dictionary / grammar constructed in the storage area 31 of the memory 30. The speech recognition result is output to the understanding result generation unit 53.

音声認識部５２は、マッチング処理の際に、音声特徴データと各待ち受け文との音響的な近さである尤度を計算し、この尤度が一定の値以上のものを音声認識結果の認識結果候補とする。 During the matching process, the speech recognition unit 52 calculates a likelihood that is the acoustic proximity between the speech feature data and each standby sentence, and recognizes a speech recognition result that has a likelihood equal to or greater than a certain value. The result is a candidate.

音声認識部５２は、認識結果候補として、尤度が高い音声認識結果の上位Ｎ個の認識結果候補（以下、Ｎ−ｂｅｓｔ候補とも呼ぶ。）とその尤度とを理解結果生成部５３に出力する。 The speech recognition unit 52 outputs the recognition result generation unit 53 with the top N recognition result candidates (hereinafter also referred to as N-best candidates) of the speech recognition results with the highest likelihood as the recognition result candidates. To do.

理解結果生成部５３は、音声認識部５２から音声認識結果として出力された認識結果候補に含まれる全ての単語に対して、各単語の読み方毎に単語信頼度を算出し、算出された単語信頼度に基づいて、認識結果候補からユーザによって発話された音声に対する正しい理解結果を選択して対話制御部５４に出力する。この理解結果生成部５３における処理内容については、後で詳細に説明をする。 The understanding result generation unit 53 calculates word reliability for each word reading method for all words included in the recognition result candidate output as the speech recognition result from the speech recognition unit 52, and calculates the calculated word reliability. Based on the degree, the correct understanding result for the speech uttered by the user is selected from the recognition result candidates and output to the dialogue control unit 54. The processing content in the understanding result generation unit 53 will be described in detail later.

ここで、理解結果生成部５３で算出される単語信頼度について説明をする。理解結果生成部５３は、同じ意味を示す単語であっても読み方が異なれば、異なる単語であるとして扱い、その単語信頼度を算出する。 Here, the word reliability calculated by the understanding result generation unit 53 will be described. The understanding result generation unit 53 treats words having the same meaning as different words if they are read differently, and calculates the word reliability.

単語信頼度とは、単一の発話において、その読み方で単語が発話された可能性を示す値であり、ある単語Ｗの単語信頼度をＣｏｎｆ（Ｗ）、Ｎ−ｂｅｓｔ候補それぞれに対する対数尤度をＬｉとすると、以下に示す（１）式によって求めることができる。

The word reliability is a value indicating the possibility that a word is uttered in a single utterance, and the word reliability of a certain word W is the log likelihood for each of Conf (W) and N-best candidates. If Li is Li, it can be obtained by the following equation (1).

理解結果生成部５３によって算出された単語信頼度は、メモリ３０に保存される。なお、理解結果生成部５３による単語信頼度の演算については、特開２００４−２５１９９８号公報で開示されている。 The word reliability calculated by the understanding result generation unit 53 is stored in the memory 30. Note that calculation of word reliability by the understanding result generation unit 53 is disclosed in Japanese Patent Application Laid-Open No. 2004-251998.

対話制御部５４は、理解結果生成部５３から出力された理解結果に基づいて応答文を生成し、ＧＵＩ表示制御部５５、音声合成部５６に出力する。 The dialogue control unit 54 generates a response sentence based on the understanding result output from the understanding result generation unit 53 and outputs the response sentence to the GUI display control unit 55 and the speech synthesis unit 56.

ＧＵＩ表示制御部５５は、必要に応じて、ディスク読み取り装置４１を制御してディスク４０に格納されている地図データを読み出し、モニタ６０を介して地図を表示させたり、対話制御部５４で生成された応答文に即した応答内容をモニタ６０を介して表示させる。 The GUI display control unit 55 controls the disk reading device 41 to read the map data stored in the disk 40 and display a map via the monitor 60 or generated by the dialogue control unit 54 as necessary. The response content corresponding to the response sentence is displayed via the monitor 60.

音声合成部５６は、対話制御部５４によって生成される応答文に応じて、応答文に即したデジタル音声信号を合成し、当該音声合成部５６が備える図示しないＤ／Ａコンバータ、出力増幅器を介してスピーカ７０に出力する。 The voice synthesizer 56 synthesizes a digital voice signal corresponding to the response sentence according to the response sentence generated by the dialogue control unit 54, and passes through a D / A converter and an output amplifier (not shown) included in the voice synthesizer 56. To the speaker 70.

続いて、図２に示すフローチャートを用いて、制御装置５０による音声認識処理を開始してから応答文を出力するまでの処理動作について説明をする。 Next, the processing operation from when the speech recognition process by the control device 50 is started until the response sentence is output will be described using the flowchart shown in FIG.

まず、ステップＳ１において、ナビゲーション装置が起動されると、音声対話装置の制御装置５０は、ディスク読み取り装置４１を制御してディスク４０から音声認識用辞書・文法を読み出しメモリ３０の記憶領域３１に格納させる。 First, in step S 1, when the navigation device is activated, the control device 50 of the voice interactive device controls the disk reading device 41 to read the voice recognition dictionary / grammar from the disk 40 and store it in the storage area 31 of the memory 30. Let

そして、ユーザが入力装置１０の音声認識開始ボタン１１を押下することで、入力制御部５１により、音声認識開始が指示され、音声認識部５２は音声認識可能状態となる。これに応じて、音声認識部５２は、ユーザによって発話されマイク２０を介して入力され、図示しないＡ／Ｄコンバータでデジタル化された音声信号の取り込みを開始する。 Then, when the user presses the voice recognition start button 11 of the input device 10, the input control unit 51 instructs the voice recognition start, and the voice recognition unit 52 enters a voice recognition enabled state. In response to this, the voice recognizing unit 52 starts taking in a voice signal that is spoken by the user and input through the microphone 20 and digitized by an A / D converter (not shown).

音声認識部５２は、音声認識開始ボタン１１が押下されるまでは、デジタル化された音声信号（以下、単にデジタル信号とも呼ぶ。）の平均パワーの演算を継続している。音声認識開始ボタン１１が押下された後、この平均パワーに較べてデジタル信号の瞬時パワーが所定値以上に大きくなった時、ユーザが発話したと判断して、デジタル化された音声信号の取り込みが開始される。 The voice recognition unit 52 continues to calculate the average power of the digitized voice signal (hereinafter also simply referred to as a digital signal) until the voice recognition start button 11 is pressed. After the voice recognition start button 11 is pressed, when the instantaneous power of the digital signal becomes larger than a predetermined value compared to the average power, it is determined that the user has spoken and the digitized voice signal is captured. Be started.

ステップＳ２において、音声認識部５２は、入力制御部５１により音声認識開始の指示がなされたことに応じて、ユーザの発話待機状態となる。ユーザによって発話された場合、ステップＳ３へと進み、ユーザによって発話されなかった場合は、ステップＳ６へと進む。 In step S 2, the voice recognition unit 52 enters a user utterance standby state in response to an instruction to start voice recognition by the input control unit 51. If the user has spoken, the process proceeds to step S3, and if not spoken by the user, the process proceeds to step S6.

ステップＳ３において、音声認識部５２は、ユーザによって発話されたことに応じて、取り込んだデジタル化された音声信号と、メモリ３０の記憶領域３１に構築された音声認識用辞書・文法が保持する待ち受け文とを比較して、音響的な尤度を計算することで音声認識処理を実行する。 In step S 3, the speech recognition unit 52 waits for the captured digital speech signal and the speech recognition dictionary / grammar constructed in the storage area 31 of the memory 30 in response to the utterance by the user. The speech recognition process is executed by comparing the sentence and calculating the acoustic likelihood.

音声認識部５２は、音響的な尤度の高い上位Ｎ個の認識結果候補とその尤度とを音声認識結果として理解結果生成部５３に出力する。 The speech recognition unit 52 outputs the top N recognition result candidates with high acoustic likelihoods and the likelihoods to the understanding result generation unit 53 as speech recognition results.

ステップＳ４において、理解結果生成部５３は、理解結果生成処理を実行する。なお、理解結果生成部５３による理解結果生成処理については、後で詳細に説明をする。 In step S4, the understanding result generation unit 53 executes an understanding result generation process. The understanding result generation processing by the understanding result generation unit 53 will be described in detail later.

ステップＳ５において、理解結果生成部５３による理解結果生成処理が終了したことに応じて、対話制御部５４は、理解結果に基づいて音声合成部５６に出力する応答文及びＧＵＩ表示制御部５５に出力する応答表示内容を生成する。 In step S5, in response to the completion of the understanding result generation process by the understanding result generation unit 53, the dialogue control unit 54 outputs the response sentence to the voice synthesis unit 56 based on the understanding result and the GUI display control unit 55. The response display content to be generated is generated.

ステップＳ６において、入力制御部５１は、ユーザによる発話待機状態において、入力装置１０のキャンセルボタン１２が押下されたかどうかを検出する。入力制御部５１は、ユーザによる発話待機状態において、キャンセルボタン１２が押下された場合、ステップＳ７へと進め、キャンセルボタン１２が押下されなかった場合、ステップＳ２へと戻り、ユーザの発話待機、並びにキャンセルボタン１２の押下待機を所定の時間だけ継続する。 In step S 6, the input control unit 51 detects whether or not the cancel button 12 of the input device 10 has been pressed in the utterance standby state by the user. When the cancel button 12 is pressed in the utterance standby state by the user, the input control unit 51 proceeds to step S7. When the cancel button 12 is not pressed, the input control unit 51 returns to step S2, and waits for the user's utterance. The standby for pressing the cancel button 12 is continued for a predetermined time.

ステップＳ７において、入力制御部５１は、ユーザによりキャンセルボタン１２が押下されたことに応じて、キャンセルボタン１２が押下された旨を理解結果生成部５３に通知する。これに応じて、理解結果生成部５３は、キャンセルボタン１２が押下された時点での理解結果や、直前のルーチンで対話制御部５４によって生成された応答文及び応答表示内容をキャンセル情報として、メモリ３０の記憶領域３３に保存する。 In step S 7, the input control unit 51 notifies the understanding result generation unit 53 that the cancel button 12 has been pressed in response to the user pressing the cancel button 12. In response to this, the understanding result generation unit 53 stores the understanding result when the cancel button 12 is pressed, the response sentence generated by the dialog control unit 54 in the immediately preceding routine, and the response display content as cancellation information. 30 storage areas 33 are stored.

ステップＳ８において、音声対話装置は、当該音声対話装置のシステムの状態を一つ前の状態に戻す。これにより、見かけ上、ユーザによって、直前の発話内容が取り消されたことになる。 In step S8, the voice interactive apparatus returns the system state of the voice interactive apparatus to the previous state. As a result, the content of the immediately preceding utterance is canceled by the user.

ステップＳ９において、音声対話装置は、ステップＳ５又はステップＳ８の処理を受けシステム応答を出力する。 In step S9, the voice interaction apparatus receives the process of step S5 or step S8 and outputs a system response.

ステップＳ５を経た場合、ＧＵＩ表示制御部５５は、対話制御部５４によって生成された応答表示内容をモニタ６０に表示させる。音声合成部５６は、対話制御部５４によって生成された応答文に応じて、応答文に即したデジタル音声信号を合成し、当該音声合成部５６が備える図示しないＤ／Ａコンバータ、出力増幅器を介してスピーカ７０に出力する。 In step S5, the GUI display control unit 55 causes the monitor 60 to display the response display content generated by the dialogue control unit 54. The voice synthesizer 56 synthesizes a digital voice signal corresponding to the response sentence in accordance with the response sentence generated by the dialogue control unit 54, and passes through a D / A converter and an output amplifier (not shown) included in the voice synthesizer 56. To the speaker 70.

また、ステップＳ８を経た場合、対話制御部５４は、メモリ３０の記憶領域３３にキャンセル情報として保存された応答文及び応答表示内容を読み出し、それぞれ音声合成部５６、ＧＵＩ表示制御部５５に出力する。 In addition, when step S 8 is performed, the dialogue control unit 54 reads the response sentence and the response display content stored as the cancellation information in the storage area 33 of the memory 30, and outputs them to the speech synthesis unit 56 and the GUI display control unit 55, respectively. .

ＧＵＩ表示制御部５５は、対話制御部５４によってメモリ３０から読み出された応答表示内容をモニタ６０に表示させる。音声合成部５６は、対話制御部５４によってメモリ３０から読み出された応答文に応じて、応答文に即したデジタル音声信号を合成し、当該音声合成部５６が備える図示しないＤ／Ａコンバータ、出力増幅器を介してスピーカ７０に出力する。 The GUI display control unit 55 causes the monitor 60 to display the response display content read from the memory 30 by the dialogue control unit 54. The voice synthesis unit 56 synthesizes a digital voice signal corresponding to the response sentence in accordance with the response sentence read from the memory 30 by the dialogue control unit 54, and a D / A converter (not shown) included in the voice synthesis unit 56, It outputs to the speaker 70 via an output amplifier.

ステップＳ１０において、対話制御部５４は、音声認識開始ボタン１１が押下されたことに応じて開始された施設検索や目的地設定などのタスクが一通り完了したかどうかを判断する。対話制御部５４は、全てのタスクが完了した場合は、音声認識処理を終了し、タスク継続中の場合は、ステップＳ１へと戻り音声取り込みを再開する。 In step S10, the dialogue control unit 54 determines whether or not the tasks such as facility search and destination setting started in response to the voice recognition start button 11 being pressed are completed. When all tasks are completed, the dialogue control unit 54 ends the speech recognition process, and when the task is continuing, returns to step S1 and resumes voice capture.

ステップＳ６〜ステップＳ８を経た場合は、タスクが完了しないためステップＳ１へと戻り、ユーザからの次の発話を待ち受けることになる。 After step S6 to step S8, since the task is not completed, the process returns to step S1 and waits for the next utterance from the user.

このようにして、音声対話装置は、ユーザによって発話された音声の認識処理を実行し、認識結果から生成される理解結果に応じたシステム応答が出力される。この時、キャンセルボタン１２を押下することで、出力されたシステム応答、つまり理解結果をキャンセルすることができる。 In this way, the voice interaction device executes a process for recognizing the voice uttered by the user, and a system response corresponding to the understanding result generated from the recognition result is output. At this time, by pressing the cancel button 12, the output system response, that is, the understanding result can be canceled.

続いて、理解結果生成部５３による理解結果生成処理について説明をする。理解結果生成部５３による理解結果生成処理について説明するにあたり、『対話例１』として以下に示すユーザと音声対話装置とによる対話例を利用する。 Subsequently, an understanding result generation process by the understanding result generation unit 53 will be described. In explaining the understanding result generation processing by the understanding result generation unit 53, an example of dialogue between the user and the voice dialogue device shown below is used as “dialogue example 1”.

『対話例１』
第１のシステム発話：「目的地をどうぞ」
第１のユーザ発話：「○×鉄道の品川駅」
第２のシステム発話：「○×鉄道の北川駅でよろしいですか？」
第１のユーザ操作：キャンセルボタン１２押下
第３のシステム発話：「目的地をどうぞ」
第２のユーザ発話：「○×鉄道の品川駅」
第４のシステム発話：「○×鉄道の品川駅でよろしいですか？」 Dialogue example 1
First system utterance: “Destination please”
First user utterance: “X × Shinagawa Station of Railway”
Second system utterance: "Are you sure you want to go to Kitagawa Station on the X Train?"
First user operation: Press the cancel button 12 Third system utterance: “Please go to your destination”
Second user utterance: “X × Shinagawa Station of Railway”
Fourth system utterance: “Are you sure you want to go to Shinagawa Station on the X Train?”

まず、この『対話例１』について説明をする。『対話例１』において、第１のシステム発話として出力された「目的地をどうぞ」という問いに対し、ユーザは、目的地を設定するために第１のユーザ発話として「○×鉄道の品川駅」を発話した。 First, the “dialogue example 1” will be described. In the “dialogue example 1”, in response to the question “please have a destination” output as the first system utterance, the user uses “○ × Shinagawa station of the railway” as the first user utterance in order to set the destination. ".

音声対話装置は、第１のユーザ発話に対して、「○×鉄道の北川駅」と誤認識し、第２のシステム発話で「○×鉄道の北川駅でよろしいですか？」と応答した。 The voice interactive device misrecognized “○ × Railway Kitagawa Station” in response to the first user utterance, and responded with the second system utterance “Are you sure you want at Kitagawa Station?”

そのため、ユーザは、第１のユーザ操作でキャンセルボタン１２を押下し、音声対話装置による「○×鉄道の北川駅」という認識結果をキャンセルした。これによって、音声対話装置は、第１のシステム発話である「目的地をどうぞ」と同じ応答を、第３のシステム発話として行うため、見かけ上の音声対話装置のシステム状態を、第１のユーザ発話が入力される前の状態まで戻すことになる。 Therefore, the user presses the cancel button 12 by the first user operation, and cancels the recognition result “○ × Railway Kitagawa Station” by the voice interaction device. As a result, the voice interaction device makes the same response as the first system utterance “Please Destination” as the third system utterance, so that the system state of the apparent voice interaction device is changed to the first user. It will return to the state before the utterance was input.

そして、ユーザが、第２のユーザ発話として「○×鉄道の品川駅」と、第１のユーザ発話と同じ発話を繰り返しているという例である。 And it is an example that the user repeats the same utterance as the first user utterance as “○ × Shinagawa Station of Railway” as the second user utterance.

図３に、『対話例１』における第１のユーザ発話である「○×鉄道の品川駅」に対して音声対話装置の音声認識部５２により求められた認識結果候補であるＮ−ｂｅｓｔを示す。また、図４に、『対話例１』における第２のユーザ発話である「○×鉄道の品川駅」に対して音声対話装置の音声認識部５２により求められた認識結果候補であるＮ−ｂｅｓｔを示す。図３、図４に示すように、どちらの認識結果候補においても第１の認識結果候補は、「○×鉄道の北川駅」となっている。 FIG. 3 shows N-best which is a recognition result candidate obtained by the speech recognition unit 52 of the speech dialogue apparatus for “○ × Shinagawa Station of Railway” which is the first user utterance in “Dialogue Example 1”. . Further, FIG. 4 shows N-best, which is a recognition result candidate obtained by the speech recognition unit 52 of the speech interactive apparatus for “○ × Shinagawa station of the railway” which is the second user utterance in “dialogue example 1”. Indicates. As shown in FIG. 3 and FIG. 4, the first recognition result candidate in both recognition result candidates is “○ × Kitakawa Station of Railway”.

このように、第１のユーザ発話、第２のユーザ発話に対する音声認識部５２の音声認識結果は、どちらも第１候補が「○×鉄道の北川駅」となり、同じ誤認識が繰り返されている。本発明の実施の形態として示す音声対話装置では、このように繰り返し誤認識される状態を回避することができ、正確な理解結果を生成することができる。 As described above, the speech recognition results of the speech recognition unit 52 for the first user utterance and the second user utterance both have the first candidate “○ × Kitakawa Station of Railway”, and the same erroneous recognition is repeated. . In the voice interaction apparatus shown as the embodiment of the present invention, it is possible to avoid such a state of being erroneously recognized repeatedly and to generate an accurate understanding result.

このような、『対話例１』を踏まえ、図５に示すフローチャートを用いて、図２に示したフローチャートのステップＳ４での処理である理解結果生成部５３による理解結果生成処理について説明をする。 Based on this “dialogue example 1”, an understanding result generation process by the understanding result generation unit 53 that is the process in step S4 of the flowchart shown in FIG. 2 will be described using the flowchart shown in FIG.

まず、ステップＳ２１において、理解結果生成部５３は、音声認識部５２から認識結果候補のＮ−ｂｅｓｔを受け取ると、今回の発話の直前にキャンセルボタン１２が押下されたかどうかを調べる。理解結果生成部５３は、今回の発話の直前にキャンセルボタン１２が押下されていなかった場合には、ステップＳ２２へと進み、キャンセルボタン１２が押下されていた場合には、ステップＳ２４へと進める。 First, in step S 21, upon receiving the recognition result candidate N-best from the speech recognition unit 52, the understanding result generation unit 53 checks whether the cancel button 12 has been pressed immediately before the current utterance. The understanding result generation unit 53 proceeds to step S22 if the cancel button 12 is not pressed immediately before the current utterance, and proceeds to step S24 if the cancel button 12 is pressed.

ステップＳ２２において、今回の発話の直前にキャンセルボタン１２が押下されていなかったことに応じて、理解結果生成部５３は、音声認識部５２から出力された認識結果候補であるＮ−ｂｅｓｔから第１位の認識結果候補を理解結果とする。 In step S 22, in response to the cancel button 12 not being pressed immediately before the current utterance, the understanding result generation unit 53 starts from the N-best recognition result candidate output from the speech recognition unit 52. The recognition result candidate of the position is taken as the understanding result.

ステップＳ２３において、理解結果生成部５３は、メモリ３０の記憶領域３２に過去の認識結果が保存してあればこれを削除する。また、前回の発話以前にキャンセルボタン１２が押下され、キャンセル時の理解結果などがキャンセル情報として、メモリ３０の記憶領域３３に保存されている場合、キャンセル情報も削除をする。 In step S 23, the understanding result generation unit 53 deletes a past recognition result if it is stored in the storage area 32 of the memory 30. In addition, when the cancel button 12 is pressed before the previous utterance and an understanding result at the time of cancellation is stored in the storage area 33 of the memory 30 as cancel information, the cancel information is also deleted.

『対話例１』において、理解結果生成部５３は、第１のユーザ発話に対する理解結果生成の際には、それ以前にキャンセルボタン１２が押下されていないので、図３の中で最も尤度が高い認識結果候補である「○×鉄道の北川駅」を理解結果として選択することになる。 In “Interaction Example 1”, the understanding result generation unit 53 has the highest likelihood in FIG. 3 because the cancel button 12 has not been pressed before the generation of the understanding result for the first user utterance. A high recognition result candidate “X Train Kitagawa Station” is selected as an understanding result.

『対話例１』では、これに応じて、システムが出力した「○×鉄道の北川駅でよろしいですか？」という第２のシステム発話に対して、ユーザがキャンセルボタン１２を押下したため、システムは直前の理解結果「○×鉄道＋北川駅」をキャンセル情報としてメモリ３０の記憶領域３３に保存し（図２：ステップＳ７）、システム状態を直前の状態に戻す（図２：ステップＳ８）ことになる。ここで、メモリ３０の記憶領域３３に保存されたキャンセル情報は、次回の認識結果理解の際に用いられることになる。 In the “dialogue example 1”, the system presses the cancel button 12 in response to the second system utterance “Are you sure you want to do at Kitagawa station on the railway?” The previous understanding result “○ × Railway + Kitakawa Station” is stored as cancellation information in the storage area 33 of the memory 30 (FIG. 2: step S7), and the system state is returned to the previous state (FIG. 2: step S8). Become. Here, the cancel information stored in the storage area 33 of the memory 30 is used when the next recognition result is understood.

ステップＳ２４において、理解結果生成部５３は、今回の発話の直前にキャンセルボタン１２が押下されたことに応じて、このキャンセルボタン１２の押下の直前にもう一度、キャンセルボタン１２が押下されたかどうか、つまり２回連続してキャンセルボタン１２が押下されたかどうかを調べる。 In step S24, the understanding result generation unit 53 determines whether the cancel button 12 has been pressed again immediately before the cancel button 12 is pressed in response to the cancel button 12 being pressed immediately before the current utterance, that is, It is checked whether or not the cancel button 12 is pressed twice in succession.

このステップＳ２４において、キャンセルボタン１２が押下されることで、今回の発話の直前に２回連続でキャンセルボタン１２が押下された場合には、キャンセルボタン１２が押下された直後であっても、ステップＳ２２、ステップＳ２３へと進む。 In this step S24, when the cancel button 12 is pressed, and the cancel button 12 is pressed twice immediately before the current utterance, the step is performed even immediately after the cancel button 12 is pressed. It progresses to S22 and step S23.

このように、発話を間に挟まずに、２回連続でキャンセルボタン１２が押下された場合、理解結果生成部５３は、ユーザによってリセットされたと判断し、キャンセルボタン１２が押下されたことによって実行される処理を無効とする。 As described above, when the cancel button 12 is pressed twice in succession without interposing an utterance, the understanding result generation unit 53 determines that the user has reset, and executes when the cancel button 12 is pressed. The processing to be performed is invalidated.

また、理解結果生成部５３は、２回連続ではなく、１度だけキャンセルボタン１２が押下された場合には、ステップＳ２５以降の処理において、キャンセルボタン１２の押下によってメモリ３０の記憶領域３３に保存されたキャンセル情報を考慮しつつ、今回の認識結果から最適な理解結果を導き出す。 In addition, when the cancel button 12 is pressed only once instead of twice in succession, the understanding result generation unit 53 saves it in the storage area 33 of the memory 30 by pressing the cancel button 12 in the processing after step S25. The optimum understanding result is derived from the current recognition result while considering the canceled information.

『対話例１』における、第２のユーザ発話を受理する場合、直前に１度だけキャンセルボタン１２が押下されているので、ステップＳ２５〜ステップＳ３５の処理が適用される。 In the case of accepting the second user utterance in the “dialogue example 1”, since the cancel button 12 is pressed only once immediately before, the processing of steps S25 to S35 is applied.

ステップＳ２５において、理解結果生成部５３は、音声認識部５２から受け取った全ての認識結果候補に含まれる単語の中から、助詞などを除く意味を理解するために必要な単語を全て取り出し、その尤度を用いて、上述した（１）式より今回の認識結果に対する単語信頼度を算出する。 In step S25, the understanding result generation unit 53 extracts all words necessary for understanding the meaning excluding particles etc. from the words included in all the recognition result candidates received from the speech recognition unit 52, and their likelihoods. Using the degree, the word reliability for the current recognition result is calculated from the above-described equation (1).

図６に、一例として『対話例１』における、第２のユーザ発話である「○×鉄道の品川駅」に対する認識結果候補に含まれる単語の単語信頼度を算出した結果を示す。 FIG. 6 shows, as an example, the result of calculating the word reliability of a word included in the recognition result candidate for “○ × Shinagawa Station of Railway” as the second user utterance in “Dialogue Example 1”.

ステップＳ２６において、理解結果生成部５３は、単語信頼度を算出した後、メモリ３０の記憶領域３２に過去の認識結果として、過去の認識結果候補に含まれる単語とその単語信頼度とが保存されているかどうかを調べる。 In step S 26, the understanding result generation unit 53 calculates the word reliability, and then stores words included in the past recognition result candidates and the word reliability as past recognition results in the storage area 32 of the memory 30. Find out if you have.

このとき、理解結果生成部５３は、一つ前、つまり前回の音声認識時にも、直前にキャンセルボタン１２が押下されており、単語信頼度の算出を行なったかどうかを調べる。 At this time, the understanding result generation unit 53 checks whether the cancel button 12 has been pressed immediately before the previous speech recognition, that is, during the previous speech recognition, and whether word reliability has been calculated.

理解結果生成部５３は、前回の音声認識時に直前でキャンセルボタン１２が押下されなかったり、２回連続でキャンセルボタン１２が押下されたことなどにより、メモリ３０の記憶領域３２に保存されていた過去の認識結果候補に含まれる単語とその単語信頼度がクリアされた場合、ステップＳ２７へと進める。 The understanding result generation unit 53 stores the past saved in the storage area 32 of the memory 30 because the cancel button 12 was not pressed immediately before the previous speech recognition or the cancel button 12 was pressed twice in succession. If the word and the word reliability included in the recognition result candidate are cleared, the process proceeds to step S27.

また、理解結果生成部５３は、メモリ３０の記憶領域３２に保存されていた過去の認識結果がクリアされずに残っている場合、メモリ３０の記憶領域３２に保存されている過去の認識結果候補に含まれる単語とその単語信頼度を、前回までの認識結果候補中の理解結果生成に必要な単語と、その信頼度としステップＳ２８へと進める。 Further, when the past recognition result stored in the storage area 32 of the memory 30 remains without being cleared, the understanding result generation unit 53 stores the past recognition result candidate stored in the storage area 32 of the memory 30. The word and its word reliability are set as the word necessary for generating the understanding result in the previous recognition result candidates and its reliability, and the process proceeds to step S28.

ステップＳ２７において、理解結果生成部５３は、メモリ３０の記憶領域３２に保存されていた過去の認識結果候補に含まれる単語とその単語信頼度が一旦クリアされたことに応じて、後述するステップＳ３５において、メモリ３０の記憶領域３２のクリア後に保存される前回の認識結果候補に含まれる単語の単語信頼度を算出する。 In step S 27, the understanding result generation unit 53 determines that a word included in a past recognition result candidate stored in the storage area 32 of the memory 30 and its word reliability are once cleared, and will be described later in step S 35. , The word reliability of the word included in the previous recognition result candidate saved after clearing the storage area 32 of the memory 30 is calculated.

前回の認識結果候補に含まれる単語から単語信頼度を算出するには、ステップＳ２５における今回の認識結果候補に含まれる単語の単語信頼度を算出する場合と同様に、上述した（１）式を用いて算出する。 In order to calculate the word reliability from the word included in the previous recognition result candidate, the equation (1) described above is used in the same manner as in the case of calculating the word reliability of the word included in the current recognition result candidate in step S25. Use to calculate.

そして、理解結果生成部５３は、この前回の認識結果候補とその単語信頼度を、前回までの認識結果候補に含まれる理解結果生成に必要な単語とその単語信頼度とする。 Then, the understanding result generation unit 53 sets the previous recognition result candidate and the word reliability as the word necessary for generating the understanding result included in the previous recognition result candidate and the word reliability.

『対話例１』における、第１のユーザ発話「○×鉄道の品川駅」に対する理解結果生成処理時には、キャンセルボタン１２が押下されなかったため、単語信頼度の計算が行なわれなかった。したがって、第１のユーザ操作によりキャンセルボタン１２が押下されたことに応じて、本ステップにおいて、新たに第１のユーザ発話の認識結果候補に含まれる単語の単語信頼度を算出することになる。図７に、第１のユーザ発話の認識結果候補に含まれる単語と、その単語の算出された単語信頼度とを示す。 During the understanding result generation process for the first user utterance “XX train Shinagawa station” in “dialog example 1”, the word reliability was not calculated because the cancel button 12 was not pressed. Therefore, in response to the cancel button 12 being pressed by the first user operation, the word reliability of the word newly included in the recognition result candidate of the first user utterance is newly calculated in this step. FIG. 7 shows words included in the recognition result candidates of the first user utterance and the calculated word reliability of the words.

ステップＳ２８において、理解結果生成部５３は、求められた前回までの認識結果候補に含まれる全ての単語の単語信頼度を修正する。 In step S28, the understanding result generation unit 53 corrects the word reliability of all words included in the obtained recognition result candidates up to the previous time.

理解結果生成部５３は、単語信頼度を修正するにあたり、前回までの認識結果候補に含まれる単語の単語信頼度が、今回の認識結果候補に含まれる単語の単語信頼度よりも信頼度が低いとして、前回まで認識結果候補に含まれる単語の単語信頼度を全て一定の割合だけ下げる方向で修正を行う。 When correcting the word reliability, the understanding result generation unit 53 has a lower word reliability of words included in the previous recognition result candidates than the word reliability of words included in the current recognition result candidates. As described above, correction is performed in such a way that the word reliability of the words included in the recognition result candidates until the previous time is reduced by a certain ratio.

上述した図７には、『対話例１』における、前回までの認識結果候補に含まれる単語の単語信頼度を修正した修正結果も示している。ここでは、単語信頼度を６０％まで低下させるように修正をしている。 FIG. 7 described above also shows a correction result obtained by correcting the word reliability of the words included in the recognition result candidates up to the previous time in “dialogue example 1”. Here, correction is made to reduce the word reliability to 60%.

ステップＳ２９において、理解結果生成部５３は、単語信頼度の修正が終了したことに応じて、ステップＳ２８で修正を施した前回までの認識結果候補に含まれる単語と、ステップＳ２５で求めた今回の認識結果候補に含まれる単語とをマージして認識結果単語リストを生成する。 In step S29, the understanding result generation unit 53, in response to the completion of the word reliability correction, the words included in the recognition result candidates up to the previous time corrected in step S28, and the current time obtained in step S25. The recognition result word list is generated by merging the words included in the recognition result candidates.

理解結果生成部５３は、認識結果単語リストを生成する際、前回までの認識結果候補と今回の認識結果候補それぞれに重複して含まれる単語の単語信頼度を、前回までの認識結果から得られた単語信頼度と今回の認識結果から得られた単語信頼度の合計とする。それ以外の重複しない単語については、各単語の単語信頼度をそのまま用いる。 When generating the recognition result word list, the understanding result generation unit 53 can obtain the word reliability of the words that are included in both the previous recognition result candidate and the current recognition result candidate from the previous recognition result. And the word reliability obtained from the current recognition result. For other non-overlapping words, the word reliability of each word is used as it is.

ステップＳ３０において、理解結果生成部５３は、ステップＳ２９で得られた結果を、次回の理解結果生成処理において、“前回までの認識結果候補に含まれる単語とその単語信頼度”として利用するため、生成した認識結果単語リストをメモリ３０に保存する。 In step S30, the understanding result generation unit 53 uses the result obtained in step S29 as “a word included in the recognition result candidate up to the previous time and its word reliability” in the next understanding result generation process. The generated recognition result word list is stored in the memory 30.

図８に、『対話例１』における、第１のユーザ発話と第２のユーザ発話の認識結果候補に含まれる単語とその単語信頼度とをマージして得られる認識結果単語リストを示す。 FIG. 8 shows a recognition result word list obtained by merging words included in the recognition result candidates of the first user utterance and the second user utterance and their word reliability in “dialogue example 1”.

ステップＳ３１において、理解結果生成部５３は、図２に示すフローチャートのステップＳ３で求められた今回の認識結果候補に含まれる単語に対して新たに単語信頼度を割り当て、割り当てられた単語信頼度を足し合わせ、認識結果候補から最適な理解結果を選択するためのスコアとする。 In step S31, the understanding result generation unit 53 assigns a new word reliability to the word included in the current recognition result candidate obtained in step S3 of the flowchart shown in FIG. 2, and determines the assigned word reliability. A score for selecting the optimum understanding result from the recognition result candidates is added.

具体的には、理解結果生成部５３は、まず今回の認識結果候補に含まれる単語と同一の単語を、ステップＳ２９で生成した認識結果単語リストから検索する。そして、認識結果単語リストから検索された単語に対応づけられている単語信頼度を取得し、この単語を今回の認識結果候補に含まれる単語に対して割り当てる。 Specifically, the understanding result generation unit 53 first searches the recognition result word list generated in step S29 for the same word as the word included in the current recognition result candidate. And the word reliability matched with the word searched from the recognition result word list | wrist is acquired, and this word is allocated with respect to the word contained in this recognition result candidate.

さらに、理解結果生成部５３は、今回の認識結果候補に含まれる単語に対して、認識結果単語リストから取得した単語信頼度を割り当てた後、認識結果候補ごとに割り当てられた単語信頼度を合計することで上述のスコアを算出する。以下、このスコアを理解結果スコアとし、理解結果スコアが求められたことで、認識結果候補を理解結果候補とする。 Furthermore, after assigning the word reliability acquired from the recognition result word list to the words included in the current recognition result candidate, the understanding result generation unit 53 adds the word reliability assigned to each recognition result candidate. By doing so, the above-mentioned score is calculated. Hereinafter, this score is set as the understanding result score, and the recognition result candidate is set as the understanding result candidate when the understanding result score is obtained.

図９に、『対話例１』の第２のユーザ発話における理解結果候補と理解結果スコアの例を示す。例えば、図９に示すように、理解結果候補が「○×鉄道＋北川駅」であれば、図８に示す認識結果単語リストより、「○×鉄道」に対して単語信頼度１．３２を設定し、「北川駅」に対して単語信頼度０．７４を設定し、両者を足し合わせて理解結果スコア２．０６を求めることができる。 FIG. 9 shows an example of an understanding result candidate and an understanding result score in the second user utterance of “conversation example 1”. For example, as shown in FIG. 9, if the understanding result candidate is “◯ × railway + Kitakawa station”, the word reliability of 1.32 is assigned to “鉄道 × railway” from the recognition result word list shown in FIG. It is possible to set a word reliability of 0.74 for “Kitakawa Station” and add them together to obtain an understanding result score of 2.06.

ステップＳ３２において、理解結果生成部５３は、異なる単語数の理解結果候補同士でも、理解結果スコアを比較できるようにするため、各理解結果候補に含まれる単語の単語数に応じて各理解結果スコアを補正する。理解結果スコアの補正は、様々な補正方法が考えられるが、単語数に応じた補正値で理解結果スコアを除算する手法がある。 In step S 32, the understanding result generation unit 53 compares each of the understanding result scores with different word counts, so that the understanding result scores can be compared with each other according to the number of words included in each understanding result candidate. Correct. Various correction methods are conceivable for correcting the understanding result score, and there is a method of dividing the understanding result score by a correction value corresponding to the number of words.

例えば、理解結果生成部５３は、理解結果候補に２単語を含んでいる場合、この理解結果スコアには、２単語分の単語信頼度が加算されている。そこで、この理解結果スコアを補正値１．６で割ることで、１単語のみの理解結果候補の理解結果スコアと比較可能となる。 For example, when the understanding result candidate includes two words in the understanding result candidate, the word reliability for two words is added to the understanding result score. Therefore, by dividing the understanding result score by the correction value 1.6, it becomes possible to compare with the understanding result score of the understanding result candidate of only one word.

上述した図９には、各理解結果候補の理解結果スコアを補正した補正後スコアも示している。図９に示した補正後スコアは、各理解結果スコアを補正値１．６で割ることで求まる値である。 FIG. 9 described above also shows a corrected score obtained by correcting the understanding result score of each understanding result candidate. The corrected score shown in FIG. 9 is a value obtained by dividing each understanding result score by the correction value 1.6.

このような、理解結果スコアと、理解結果スコアの補正後スコアまで含めた理解結果候補を、まとめて理解結果候補リストとする。 Such an understanding result candidate including the understanding result score and the corrected score of the understanding result score is collectively used as an understanding result candidate list.

ステップＳ３３において、理解結果生成部５３は、過去にキャンセルされた理解結果に応じて、理解結果候補の補正後スコアを修正し修正後スコアを求める。 In step S33, the understanding result generation unit 53 corrects the corrected score of the understanding result candidate according to the understanding result canceled in the past to obtain a corrected score.

例えば、理解結果生成部５３は、理解結果候補リストの各理解結果候補のうち、過去にキャンセルされた情報と一致する理解結果候補があるかどうかを検索し、一致する理解結果候補の補正後スコアに対しては、補正後スコアの値を下げるように修正をし修正後スコアを求める。この、補正後スコアを修正し、修正後スコアを求める処理については、後で詳細に説明をする。 For example, the understanding result generation unit 53 searches the understanding result candidate list to determine whether there is an understanding result candidate that matches the information canceled in the past, and corrects the corrected score of the understanding result candidate. Is corrected so as to lower the value of the corrected score, and the corrected score is obtained. The process of correcting the corrected score and obtaining the corrected score will be described in detail later.

ステップＳ３４において、理解結果生成部５３は、求めた修正後スコアが最大となっている理解結果候補を最適な理解結果として選択する。 In step S34, the understanding result generation unit 53 selects the understanding result candidate having the maximum corrected score as the optimum understanding result.

ステップＳ３５において、理解結果生成部５３は、図２に示すフローチャートのステップＳ３において、音声認識部５２で求められた今回の認識結果候補をメモリ３０の記憶領域３２に保存し、理解結果生成処理を終了する。 In step S35, the understanding result generation unit 53 stores the current recognition result candidate obtained by the speech recognition unit 52 in the storage area 32 of the memory 30 in step S3 of the flowchart shown in FIG. finish.

このとき、ステップＳ２２、ステップＳ２３を経由した場合、メモリ３０の記憶領域３２には、過去の認識結果として、今回の認識結果候補のみが保存さる。また、ステップＳ２５〜ステップＳ３４を経由した場合、メモリ３０の記憶領域３２には、過去の認識結果として今回の認識結果候補とその単語信頼度とが追加保存される。 At this time, when passing through steps S22 and S23, only the current recognition result candidate is stored in the storage area 32 of the memory 30 as the past recognition result. Further, when passing through steps S25 to S34, the current recognition result candidate and its word reliability are additionally stored in the storage area 32 of the memory 30 as past recognition results.

このようにして、本発明の実施の形態として示す音声対話装置では、キャンセルボタン１２の押下により、一旦キャンセルされた認識結果に含まれる単語が、理解結果候補に含まれている場合、ユーザによって発話された音声と一致する可能性が低いとして補正後スコアの値を下げるように修正し、最終的な理解結果として選択される可能性を低減させる。 In this way, in the voice interactive apparatus shown as the embodiment of the present invention, when the word included in the recognition result once canceled by pressing the cancel button 12 is included in the understanding result candidate, the user speaks It is modified so that the value of the corrected score is lowered because it is unlikely that it matches the received voice, and the possibility of being selected as the final understanding result is reduced.

これにより、ユーザによって発話された音声が何度も繰り返して誤認識されることを低減させることができると共に、キャンセルされた認識結果が理解結果として採用される可能性を残すことができる。したがって、ユーザによって発話された音声の音声認識率を大幅に向上させることができる。 As a result, it is possible to reduce the erroneous recognition of the speech uttered by the user over and over, and leave the possibility that the canceled recognition result is adopted as the understanding result. Therefore, the voice recognition rate of the voice uttered by the user can be greatly improved.

続いて、図５に示したフローチャートのステップＳ３３におけるキャンセルされた理解結果候補の補正後スコアを修正するいくつかの手法について説明をする。 Next, several techniques for correcting the corrected score of the canceled understanding result candidate in step S33 of the flowchart shown in FIG. 5 will be described.

（補正後スコアの修正処理：一致度に応じた修正）
まず、メモリ３０の記憶領域３３にキャンセル情報として保存されている過去の理解結果と、理解結果候補リストの理解結果候補との一致度に応じて、補正後スコアを修正する手法について説明をする。 (Correction of corrected score: correction according to the degree of coincidence)
First, a method for correcting the corrected score according to the degree of coincidence between past understanding results stored in the storage area 33 of the memory 30 as cancellation information and the understanding result candidates in the understanding result candidate list will be described.

理解結果生成部５３は、メモリ３０の記憶領域３３にキャンセル情報として保存されている過去の理解結果を読み出し、理解結果候補リストの理解結果候補との一致度に応じて０より大きく、１よりも小さい範囲の修正係数（０＜修正係数＜１）を選択し、選択された修正係数を補正後スコアに乗算することで修正処理を行う。 The understanding result generation unit 53 reads past understanding results stored as cancellation information in the storage area 33 of the memory 30, and is greater than 0 and greater than 1 according to the degree of coincidence with the understanding result candidates in the understanding result candidate list. A correction coefficient is selected by selecting a small correction coefficient (0 <correction coefficient <1) and multiplying the corrected score by the selected correction coefficient.

理解結果生成部５３は、今回の理解結果候補と過去のキャンセル時の理解結果とを比較して、一致度に応じて、例えば、両者が完全に一致する場合（完全一致）、理解結果候補が過去にキャンセルされた理解結果に含まれる場合（訂正⊃理解結果）、過去にキャンセルされた理解結果が理解結果候補に含まれる場合（訂正⊂理解結果）の３つに場合分けをし、それぞれの場合に応じて異なる修正係数を選択する。 The understanding result generation unit 53 compares the current understanding result candidate with the previous understanding result at the time of cancellation, and, for example, when both are completely matched (complete matching), the understanding result candidate is Cases are classified into three cases: when they are included in an understanding result canceled in the past (correction⊃understanding result) and when an understanding result canceled in the past is included in an understanding result candidate (correction⊂understanding result). Depending on the case, different correction factors are selected.

これは、過去にキャンセルされた理解内容と完全に一致する内容が再度入力される、つまり“完全一致”となる可能性が最も低く、続いて、過去にキャンセルされた理解内容の一部と一致する内容が再度入力される、つまり“訂正⊃理解結果”となる可能性が低く、過去にキャンセルされた理解内容を全て含む内容が再度入力される、つまり“訂正⊂理解結果”となる可能性が最も高いことに基づいている。 This means that it is unlikely that the content that completely matches the understanding canceled in the past will be re-entered, that is, the “complete match” is the lowest, followed by a portion of the understanding that was canceled in the past. The content that is to be entered again, that is, the “correction⊃understanding result” is unlikely to be entered, and the content that includes all of the previously canceled understanding content is entered again, that is, the “correction⊂understanding result” is possible. Is based on the highest.

例えば、図１０に示すように、今回の理解結果候補と過去のキャンセル時の理解結果との一致度に応じて、異なる修正係数が用いられる。図１１に、『対話例１』において、“完全一致”、“訂正⊃理解結果”、“訂正⊂理解結果”となるキャンセル時の理解結果と今回の理解結果の組み合わせの一例を示す。 For example, as shown in FIG. 10, different correction coefficients are used depending on the degree of coincidence between the current understanding result candidate and the past understanding result at the time of cancellation. FIG. 11 shows an example of a combination of the understanding result at the time of cancellation and the current understanding result, which are “complete match”, “correction / understanding result”, and “correction / understanding result” in “dialogue example 1”.

続いて、図１２に示すフローチャートを用いて、図５に示したフローチャートのステップＳ３３におけるキャンセルされた理解結果候補の補正後スコアを修正する処理動作について説明をする。 Next, the processing operation for correcting the corrected score of the canceled understanding result candidate in step S33 of the flowchart shown in FIG. 5 will be described using the flowchart shown in FIG.

まず、ステップＳ４１において、理解結果生成部５３は、理解結果候補リストから理解結果候補を一つ取り出す。 First, in step S41, the understanding result generation unit 53 extracts one understanding result candidate from the understanding result candidate list.

ステップＳ４２において、理解結果生成部５３は、メモリ３０の記憶領域３３にキャンセル情報として保存されている過去の理解結果を読み出し、理解結果候補リストから取り出した理解結果候補と完全に一致するかどうかを調べる。理解結果生成部５３は、一致する場合、ステップＳ４３へと進め、一致しない場合ステップＳ４４へと進める。 In step S42, the understanding result generation unit 53 reads the past understanding result stored as the cancellation information in the storage area 33 of the memory 30, and determines whether or not it completely matches the understanding result candidate extracted from the understanding result candidate list. Investigate. The understanding result generation unit 53 proceeds to step S43 if they match, and proceeds to step S44 if they do not match.

ステップＳ４３において、理解結果生成部５３は、理解結果候補リストから取り出した理解結果候補と過去の理解結果とが完全一致したことに応じて、この理解結果候補の補正後スコアに、“完全一致”した場合の修正係数ａを乗算して修正後スコアを算出する。 In step S43, in response to the fact that the understanding result candidate extracted from the understanding result candidate list and the past understanding result completely match, the understanding result generation unit 53 sets “complete match” to the corrected score of this understanding result candidate. The corrected score a is multiplied to calculate a corrected score.

図１１に示したように、『対話例１』において、キャンセル時の理解結果が「○×鉄道＋北川駅」で、今回の理解結果候補も同じく「○×鉄道＋北川駅」である場合、この理解結果候補は、キャンセル時の理解結果と“完全一致”しているとみなされる。 As shown in FIG. 11, in “Dialogue Example 1”, when the understanding result at the time of cancellation is “◯ × Railway + Kitakawa Station” and the current understanding result candidate is also “◯ × Railway + Kitakawa Station”, This understanding result candidate is considered to be “perfectly matched” with the understanding result at the time of cancellation.

このように“完全一致”している場合は、図１０に示すように、修正係数として０．５が選択されるため、今回の理解結果候補「○×鉄道＋北川駅」の補正後スコアに、修正係数０．５を乗算して修正し、修正後スコアを求めることができる。 As shown in FIG. 10, in the case of “perfect match” in this way, 0.5 is selected as the correction coefficient. Therefore, the corrected score of the current understanding result candidate “○ × Railway + Kitagawa Station” The corrected score can be obtained by multiplying the correction coefficient 0.5.

また、『対話例１』において、「北川駅」が「○×鉄道」の下位カテゴリとしてしか存在しない場合、「○×鉄道＋北川駅」と「北川駅」は意味上、一致しているので「完全一致」であるとみなすことにする。 In “Dialogue Example 1”, if “Kitakawa Station” exists only as a subcategory of “XX Railway”, “XX Railway + Kitakawa Station” and “Kitakawa Station” are semantically identical. It is assumed that it is “perfect match”.

なお、『対話例１』では、第１のユーザ操作により、キャンセルされた時の理解結果は、「○×鉄道＋北川駅」である。図９に示した今回の理解結果候補の一つである「○×鉄道＋北川駅」は、キャンセル時の理解結果と今回の理解結果が完全に一致している。よって、「○×鉄道＋北川駅」の修正後スコアは、補正後スコアの１．２９にキャンセルによる修正係数０．５を乗算して０．６５となる。図９に示すように、修正後スコアは、理解結果候補リストに記述される。 In the “dialogue example 1”, the understanding result when canceled by the first user operation is “◯ × railway + Kitakawa station”. As for the current understanding result candidate shown in FIG. 9, the understanding result at the time of cancellation and the understanding result at this time completely coincide with each other. Therefore, the corrected score of “◯ × Railway + Kitagawa Station” is 0.65 by multiplying the corrected score 1.29 by the correction coefficient 0.5 by cancellation. As shown in FIG. 9, the corrected score is described in the understanding result candidate list.

ステップＳ４４において、理解結果生成部５３は、キャンセル時の理解結果が、理解結果候補リストから取り出した理解結果候補を全て含んでいる（訂正⊃理解結果）かどうかを調べる。 In step S44, the understanding result generation unit 53 checks whether or not the understanding result at the time of cancellation includes all the understanding result candidates extracted from the understanding result candidate list (correction understanding result).

ステップＳ４５において、理解結果生成部５３は、キャンセル時の理解結果が、理解結果候補リストから取り出した理解結果候補を全て含んでいることに応じて、この理解結果候補の補正後スコアに、“訂正⊃理解結果”である場合の修正係数ｂを乗算して修正後スコアを算出する。 In step S45, the understanding result generation unit 53 determines that the correction result of the understanding result candidate is “corrected” when the understanding result at the time of cancellation includes all the understanding result candidates extracted from the understanding result candidate list. The corrected score b is multiplied by the “understanding result” to calculate a corrected score.

図１１に示すように、キャンセル時の理解結果が「○×鉄道＋北川駅」で、今回の理解結果候補が「○×鉄道」である場合、キャンセル時の理解結果は、理解結果候補を全て含むため、“訂正⊃理解結果”であるとみなされる。 As shown in FIG. 11, when the understanding result at the time of cancellation is “○ × railway + Kitakawa Station” and the current understanding result candidate is “○ × railway”, the understanding result at the time of cancellation is all the understanding result candidates. Because it is included, it is regarded as “correction ⊃ understanding result”.

このように、“訂正⊃理解結果”である場合は、図１０に示すように、修正係数として０．７が選択されるため、今回の理解結果候補「○×鉄道」の補正後スコアに、修正係数０．７を乗算して修正し、修正後スコアを求めることができる。 In this way, in the case of “correction ⊃ understanding result”, as shown in FIG. 10, 0.7 is selected as the correction coefficient, so the corrected score of the current understanding result candidate “○ × Railway” is A corrected score can be obtained by multiplying the correction coefficient 0.7.

ステップＳ４６において、理解結果生成部５３は、理解結果候補リストから取り出した理解結果候補が、キャンセル時の理解結果を全て含んでいる（訂正⊂理解結果）かどうかを調べる。 In step S46, the understanding result generation unit 53 checks whether the understanding result candidates extracted from the understanding result candidate list include all of the understanding results at the time of cancellation (correction habit understanding results).

ステップＳ４７において、理解結果生成部５３は、理解結果候補リストから取り出した理解結果候補が、キャンセル時の理解結果を全て含んでいることに応じて、この理解結果候補の補正後スコアに、“訂正⊂理解結果”である場合の修正係数ｃを乗算して修正後スコアを算出する。 In step S47, the understanding result generation unit 53 adds “correction” to the corrected score of the understanding result candidate according to the fact that the understanding result candidate extracted from the understanding result candidate list includes all the understanding results at the time of cancellation. The corrected score is calculated by multiplying the correction coefficient c in the case of “⊂ understanding result”.

図１１に示すように、キャンセル時の理解結果が「○×鉄道」で、今回の理解結果候補が「○×鉄道＋北川駅」である場合、この理解結果候補は、キャンセル時の理解結果を全て含むため、“訂正⊂理解結果”であるとみなされる。 As shown in FIG. 11, when the understanding result at the time of cancellation is “○ × Railway” and the current understanding result candidate is “○ × Railway + Kitakawa Station”, the understanding result candidate is the understanding result at the time of cancellation. Because it includes everything, it is considered to be a “correction / understanding result”.

このように、“訂正⊂理解結果”である場合は、図１０に示すように、修正係数として０．９が選択されるため、今回の理解結果候補「○×鉄道＋北川駅」の補正後スコアに、修正係数０．９を乗算して修正し、修正後スコアを求めることができる。 Thus, in the case of “correction ⊂ understanding result”, as shown in FIG. 10, 0.9 is selected as the correction coefficient, so that the current understanding result candidate “○ × railway + Kitagawa station” is corrected. The score can be corrected by multiplying by a correction coefficient of 0.9 to obtain a corrected score.

ステップＳ４８において、理解結果生成部５３は、全ての理解結果候補に対して、キャンセル時の理解結果との一致度を調べ終わったことに応じてスコア修正処理を終了する。 In step S48, the understanding result generation unit 53 ends the score correction process in response to the completion of checking the degree of coincidence with the understanding result at the time of cancellation for all the understanding result candidates.

図９に示した理解結果候補リストの理解結果候補の中で、図１１に示すキャンセル時の理解結果との一致が認められるのは「○×鉄道＋北川駅」のみであるため、他の理解結果候補の補正後スコアは修正されず、補正後スコアがそのまま修正後スコアとなる。 Among the understanding result candidates in the understanding result candidate list shown in FIG. 9, only “○ × railway + Kitagawa station” matches the understanding result at the time of cancellation shown in FIG. The corrected score of the result candidate is not corrected, and the corrected score becomes the corrected score as it is.

これにより、図９に示す修正後スコアから、理解結果候補である「○×鉄道＋北川駅」と「○×鉄道＋品川駅」の修正後スコアが逆転し、「○×鉄道＋品川駅」の修正後スコアが最大になる。したがって、図５に示すフローチャートにおけるステップＳ３４において、理解結果生成部５３は、理解結果候補から最大の修正後スコアとなる「○×鉄道＋品川駅」を理解結果として選択する。 As a result, the corrected scores of “○ × Railway + Kitakawa Station” and “○ × Railway + Shinagawa Station”, which are candidates for understanding, are reversed from the corrected score shown in FIG. 9, and “○ × Railway + Shinagawa Station” is reversed. The corrected score is maximized. Therefore, in step S34 in the flowchart shown in FIG. 5, the understanding result generation unit 53 selects “選択 × railway + Shinagawa station”, which is the maximum corrected score, from the understanding result candidates as the understanding result.

これにより、音声認識部５２では、『対話例１』における第１のユーザ発話と第２のユーザ発話に対して同じ誤認識をしたにも関わらず、第４のシステム発話にて「○×鉄道の品川駅でよろしいですか？」と正しい応答をすることができる。 As a result, the voice recognition unit 52 does not recognize the same error for the first user utterance and the second user utterance in the “dialogue example 1”, but “○ × railway” in the fourth system utterance. Is it OK at Shinagawa Station? "

このように、今回の理解結果候補とキャンセル時の理解結果とを比較した際の一致度に応じて修正係数を変化させることで、キャンセルされた理解結果を修正係数に正しく反映させることができるため、補正後スコアを正確に修正することが可能となる。 In this way, the canceled understanding result can be correctly reflected in the correction coefficient by changing the correction coefficient according to the degree of coincidence when the current understanding result candidate and the understanding result at the time of cancellation are compared. The corrected score can be corrected accurately.

上述した例では、今回の理解結果候補とキャンセル時の理解結果とを比較した際の一致度に応じて異なる３段階の修正係数を用意することで、キャンセルされた理解結果を修正係数に反映させていた。さらに、一致度以外の観点から、キャンセル時の理解結果を修正係数に反映させることもできる。 In the above-described example, by preparing three different correction coefficients according to the degree of coincidence when the current understanding result candidate and the understanding result at the time of cancellation are compared, the canceled understanding result is reflected in the correction coefficient. It was. Furthermore, from the viewpoint other than the degree of coincidence, the understanding result at the time of cancellation can be reflected in the correction coefficient.

（補正後スコアの修正処理：理解結果のキャンセルされた回数よる修正）
まず、理解結果に対するキャンセル回数に応じて補正後スコアを修正する手法について説明をする。 (Correction score after correction: Correction based on the number of times the understanding result was canceled)
First, a method of correcting the corrected score according to the number of cancellations for the understanding result will be described.

例えば、理解結果に対して複数回キャンセルを行った場合に、対象となる理解結果が何回前にキャンセルされたかに応じて、修正係数を変化させることができる。この手法は、上述した一致度に応じて修正係数を変化させる手法と組み合わせて用いることもできる。 For example, when the understanding result is canceled a plurality of times, the correction coefficient can be changed according to how many times the target understanding result has been canceled before. This method can also be used in combination with the above-described method of changing the correction coefficient according to the degree of coincidence.

図１３に、修正係数を変化させるパラメータとして、理解結果候補とキャンセル時の理解結果との一致度の他に、何回前にキャンセルされた理解結果であるのかというパラメータを付加した場合の修正係数の一例を示す。 FIG. 13 shows a correction coefficient when a parameter indicating how many times the understanding result was canceled in addition to the degree of coincidence between the understanding result candidate and the understanding result at the time of cancellation is added as a parameter for changing the correction coefficient. An example is shown.

例えば、図１３に示す修正係数を用いると、１回前にキャンセルボタン１２が押下された時の理解結果と現在の理解結果候補とが完全に一致したら、補正後スコアを０．５倍して修正をする。また、２回前にキャンセルボタン１２が押下された時の理解結果と現在の理解結果候補とが完全に一致したら、補正後スコアを０．６倍して修正をする。さらに、３回前に、キャンセルボタン１２が押下された時の理解結果と現在の理解結果候補とが完全に一致したら、補正後スコアを０．７倍して修正をする。 For example, when the correction coefficient shown in FIG. 13 is used, if the understanding result when the cancel button 12 is pressed once and the current understanding result candidate completely match, the corrected score is multiplied by 0.5. Make corrections. If the understanding result when the cancel button 12 is pressed twice before and the current understanding result candidate completely coincide, the corrected score is multiplied by 0.6 and corrected. Further, if the understanding result when the cancel button 12 is pressed three times before and the current understanding result candidate completely match, the corrected score is corrected by 0.7.

このように、その理解結果が「キャンセルされた」という情報が古ければ古いほど、つまりキャンセルしたことによる理解結果への影響の低下に応じて、０＜ｄ（ｄ：修正係数）＜１の範囲で、修正係数を大きくし、補正後スコアを下げる割合を小さくする。 Thus, the older the information that the understanding result is “cancelled”, that is, in accordance with the decrease in the influence on the understanding result due to the cancellation, 0 <d (d: correction coefficient) <1 In the range, increase the correction coefficient and decrease the rate of lowering the corrected score.

具体的には、メモリ３０の記憶領域３３に、キャンセル情報として保存されている過去の理解結果と共に、他の理解結果がキャンセルされる度にその回数をカウントした情報を記憶させる。 Specifically, in the storage area 33 of the memory 30, information obtained by counting the number of times each time another understanding result is canceled is stored together with the past understanding result stored as cancellation information.

理解結果候補リストの理解結果候補とキャンセル時の理解結果とを比較した際に、キャンセル時の理解結果に付加されたキャンセル回数をカウントした情報を参照し、キャンセル回数が多ければ理解結果候補への影響が少なく、キャンセル回数が少なければ理解結果候補への影響が大きいと判断することができる。 When comparing the understanding result candidate in the understanding result candidate list with the understanding result at the time of cancellation, reference is made to the information obtained by counting the number of cancellations added to the understanding result at the time of cancellation. If the influence is small and the number of cancellations is small, it can be determined that the influence on the understanding result candidate is large.

したがって、これに応じて修正係数を変化させることで、キャンセルされた理解結果を修正係数に反映させることができるため、補正後スコアを正確に修正することが可能となる。 Therefore, by changing the correction coefficient according to this, the canceled understanding result can be reflected in the correction coefficient, so that the corrected score can be corrected accurately.

（補正後スコアの修正処理：理解結果がキャンセルされてからの経過時間に応じた修正）
次に、キャンセルされたという情報が入力されてから、現在、理解結果生成処理中の発話をが入力されるまでに要した時間に応じて補正後スコアを修正する手法について説明をする。 (Correction score correction process: correction according to the elapsed time since the understanding result was canceled)
Next, a method for correcting the corrected score according to the time required from the input of the information that the information is canceled until the input of the utterance that is currently under the understanding result generation process will be described.

この手法は、キャンセルされたという情報が入力されてから、現在、理解結果生成処理中の発話が入力されるまでに要した時間によって修正係数を変化させることで実現できる。 This method can be realized by changing the correction coefficient according to the time required from the input of the information indicating that the cancellation has been performed until the input of the utterance currently being processed for the generation of the understanding result.

これについて、図１４に示すユーザによる発話とシステムによるシステム応答のタイミングを示したタイミングチャートを用いて説明をする。図１４に示すタイミングチャートでは、ユーザが時刻Ｔ１において、「品川駅」と発話し、これに対して、システムは「北川駅」であると誤認識をし、時刻Ｔ２において「北川駅でよろしいですか？」というシステム応答を行っている。 This will be described with reference to a timing chart showing the timing of the user's utterance and the system response by the system shown in FIG. In the timing chart shown in FIG. 14, the user speaks “Shinagawa Station” at time T1, and the system misrecognizes that it is “Kitakawa Station”. Is the system response?

これに応じて、ユーザは、時刻Ｔ３にキャンセルボタン１２を押下し、時刻Ｔ５において「○×鉄道の品川駅」と再入力している。ここでシステムは、これまでの認識結果とキャンセル情報とを用いて入力音声を認識するが、再度誤認識をし、時刻Ｔ６において「○×鉄道の立川駅でよろしいですか？」というシステム応答を行っている。 In response to this, the user presses the cancel button 12 at time T3, and re-enters “○ × Shinagawa station of railway” at time T5. Here, the system recognizes the input speech using the recognition result and the cancellation information so far, but misrecognizes it again. At time T6, the system response “Are you sure you want to be at Tachikawa Station on the railway?” Is going.

そのため、ユーザは、時刻Ｔ７において、再度、キャンセルボタン１２を押下、時刻Ｔ９において再度「品川駅」と入力をしている。これに応じて、システムは、これまでの認識結果とキャンセル情報とを用いて、最適な理解結果を導き出し、時刻Ｔ１０において、「品川駅でよろしいですか？」というシステム応答を行っている。 Therefore, the user presses the cancel button 12 again at time T7 and inputs “Shinagawa station” again at time T9. In response to this, the system derives an optimal understanding result using the recognition result and the cancellation information so far, and performs a system response “Are you sure at Shinagawa Station?” At time T10.

このような、図１４に示すユーザとシステムとの対話例の中で、時刻Ｔ９において入力されたユーザ発話に対して、理解結果生成部５３は、時刻Ｔ３でキャンセルされたシステムによる理解結果「北側駅」と、時刻Ｔ７でキャンセルされたシステムによる理解結果「○×鉄道の立川駅」について、現在の理解結果候補と一致するかどうか判定し、補正後スコアを修正する必要がある。 In such a dialogue example between the user and the system shown in FIG. 14, for the user utterance input at time T9, the understanding result generation unit 53 causes the understanding result “north side” by the system canceled at time T3. It is necessary to determine whether or not the station “and the understanding result“ ○ × Tachikawa Station of the railway ”by the system canceled at time T7 matches the current understanding result candidate, and correct the corrected score.

この時、理解結果生成部５３は、キャンセルされた発話が入力されてから、現在の理解結果生成を行っている発話が入力されるまでに要した時間に応じて修正係数を変化させる。 At this time, the understanding result generation unit 53 changes the correction coefficient according to the time required from the input of the canceled utterance to the input of the utterance for which the current understanding result is generated.

例えば、図１４に示す例では、キャンセルされた発話が入力されてから、現在の理解結果生成を行っている発話が入力されるまでに要する時間は、時刻Ｔ３において、キャンセルされた理解結果「北川駅」を導いた発話の入力が開始された時刻Ｔ１から、現在、理解結果生成処理中の発話の入力が開始された時刻Ｔ９までの時間Ｔｘ１と、時刻Ｔ７において、キャンセルされた理解結果「○×鉄道の立川駅」を導いた発話の入力が開始された時刻Ｔ５から、現在、理解結果生成処理中の発話の入力が開始された時刻Ｔ９までの時間Ｔｘ２である。 For example, in the example illustrated in FIG. 14, the time required from the input of the canceled utterance to the input of the utterance for which the current understanding result generation is performed is the time at which the canceled understanding result “Kitakawa At time Tx1 from time T1 when the input of the utterance leading to “station” is started to time T9 when the input of the utterance currently under the understanding result generation process is started, and at time T7, the understanding result “○ The time Tx2 from the time T5 when the input of the utterance leading to “Tachikawa station of the railway” is started to the time T9 when the input of the utterance currently being processed for the understanding result generation is started.

理解結果生成部５３は、キャンセルされた発話が入力されてから現在、理解結果生成処理を行なっている発話が入力されるまでに要した時間に比例して、０＜ｄ（ｄ：修正係数）＜１の範囲で、修正係数を大きくする。つまり、キャンセルされた発話が入力されてから、現在、理解結果生成処理を行っている発話が入力されるまでの時間が長ければ長いほど、修正係数が大きくなるため、その影響力は小さくなる。 The understanding result generation unit 53 is in proportion to the time required from the input of the canceled utterance to the input of the utterance currently being processed for the understanding result generation, 0 <d (d: correction coefficient) In the range of <1, the correction coefficient is increased. That is, the longer the time from the input of the canceled utterance to the input of the utterance that is currently undergoing the understanding result generation process, the larger the correction coefficient, the smaller the influence.

ここで、修正係数をｄとし、キャンセルされた発話が入力されてから現在の入力が開始されるまでの時間をＴとすると、修正係数は、以下に示す（３）式で表すことができる。

Here, when the correction coefficient is d and the time from when the canceled utterance is input until the current input is started is T, the correction coefficient can be expressed by the following equation (3).

上述したＴｘ１、Ｔｘ２をそれぞれ４０秒、２０秒とし、図１４に示す対話例に（３）式を適用すると、Ｔｘ１、Ｔｘ２に応じた修正係数ｄ１と、修正係数ｄ２とは以下に示すように求めることができる。 When Tx1 and Tx2 are set to 40 seconds and 20 seconds, respectively, and the expression (3) is applied to the interactive example shown in FIG. 14, the correction coefficient d1 and the correction coefficient d2 corresponding to Tx1 and Tx2 are as follows: Can be sought.

ｄ１＝０．０２×４０＝０．８
ｄ２＝０．０２×２０＝０．４
補正後スコアを修正し修正後スコアを求める場合、この修正係数を補正後スコアに乗算するため、修正係数が小さいほど補正後スコアに対する修正の割合が大きくなる。 d1 = 0.02 × 40 = 0.8
d2 = 0.02 × 20 = 0.4
When the corrected score is corrected and the corrected score is obtained, the correction coefficient is multiplied by the corrected score. Therefore, the smaller the correction coefficient is, the larger the ratio of correction to the corrected score is.

したがって、今回の理解結果候補が、時刻Ｔ１に入力された発話より得られた理解結果である「北川駅」と一致する場合よりも、時刻Ｔ５に入力された発話より得られた理解結果である「○×鉄道の立川駅」と一致した場合のほうが、補正後スコアを修正する割合が大きくなる。 Therefore, the current understanding result candidate is the understanding result obtained from the utterance input at time T5, rather than the case where the understanding result candidate matches the “Kitakawa station” which is the understanding result obtained from the utterance input at time T1. The ratio of correcting the corrected score is greater when it matches “Ox Train Tachikawa Station”.

図１５に、図１４に示す対話例における第３のユーザ発話に対する理解結果候補と、その理解結果スコア、補正後スコア、修正係数、修正後スコアを示す。 FIG. 15 shows an understanding result candidate for the third user utterance in the example of interaction shown in FIG. 14, its understanding result score, a corrected score, a correction coefficient, and a corrected score.

図１５に示す理解結果候補のうち第１候補の「○×鉄道＋立川駅」、第２候補の「北川駅」は、図１４の対話例で示したように誤認識であり、第３候補である「品川駅」がユーザによって入力された発話と一致する正しい結果である。 Of the understanding result candidates shown in FIG. 15, the first candidate “○ × Railway + Tachikawa Station” and the second candidate “Kitakawa Station” are misrecognized as shown in the example of FIG. “Shinagawa Station” is a correct result that matches the utterance input by the user.

図１５の理解結果スコア、補正後スコアは、上述した図５のフローチャートのステップＳ３１、ステップＳ３２においてそれぞれ求められる値である。また、図１５に示す理解結果候補のうち、過去にキャンセルされている「○×鉄道＋立川駅」と「北川駅」に対し、それぞれ上述した修正係数ｄ２＝０．４、修正係数ｄ１＝０．８を乗算すると、修正後スコアを求めることができる。 The understanding result score and the corrected score in FIG. 15 are values obtained respectively in step S31 and step S32 of the flowchart of FIG. 5 described above. Also, among the understanding result candidates shown in FIG. 15, the correction coefficient d2 = 0.4 and the correction coefficient d1 = 0 described above for “○ × railway + Tachikawa station” and “Kitakawa station” canceled in the past, respectively. Multiply by .8 to get a modified score.

図１５に示すように、修正前のスコアである補正後スコアは、「○×鉄道＋立川駅」のの方が「北川駅」よりも大きいが、修正された修正後スコアを較べると、直近でキャンセルされた「○×鉄道＋立川駅」の方が、「北川駅」よりも小さなスコアとなっているのが分かる。しかしながら、キャンセルによって修正されなかった「品川駅」の修正後スコアの方が、他の理解結果候補の修正後スコアよりも大きいため、最終的な最適な理解結果として「品川駅」が選択されることになる。 As shown in FIG. 15, the corrected score, which is the score before correction, is larger for “○ × Railway + Tachikawa Station” than for “Kitakawa Station”. It can be seen that “○ × Railway + Tachikawa Station” which was canceled at has a smaller score than “Kitakawa Station”. However, since the corrected score of “Shinagawa Station” that was not corrected by cancellation is larger than the corrected score of other understanding result candidates, “Shinagawa Station” is selected as the final optimal understanding result. It will be.

このように、キャンセルされた理解結果を導く音声が発話されて入力されてから、現在、理解結果生成処理中の音声が発話され入力されるまでに要した時間に応じて修正係数を変化させることで、キャンセルされた理解結果を修正係数に正しく反映させることができるため、補正後スコアを正確に修正することが可能となる。 In this way, the correction coefficient is changed according to the time required from the time when the voice that leads to the canceled understanding result is uttered and inputted to the time when the voice that is currently undergoing the understanding result generation process is spoken and inputted. Since the canceled understanding result can be correctly reflected in the correction coefficient, the corrected score can be corrected accurately.

（補正後スコアの修正処理：タスクシーケンスによる修正）
また、現在のシステム状態がタスクシーケンスのどこにあるかによって修正係数を変化させることもできる。 (Correction score after correction: correction by task sequence)
It is also possible to change the correction coefficient depending on where the current system state is in the task sequence.

例えば、階層的にツリー構造を辿って施設を検索するような目的地設定の場合、上位項目の選択と、その中身の確認をいくつかの上位項目に対して行なってから最終的な項目を選択するような場合がある。 For example, in the case of destination setting that searches for facilities by following the tree structure hierarchically, the final item is selected after selecting the upper item and confirming its contents for several upper items. There are cases.

例えば、現在地の近くで飲食店を選択する場合に、選択可能なジャンルとして「和食」と「中華」があったとする。その際、一旦、「和食」を選択して中身を確認した後、「和食」をキャンセルして「中華」を選択し、さらに「中華」の中身を確認したが結局「中華」もキャンセルして「和食」を選択しなおすような場合がある。 For example, when a restaurant is selected near the current location, it is assumed that there are “Japanese food” and “Chinese” as selectable genres. At that time, once select “Japanese food” and confirm the contents, cancel “Japanese food” and select “Chinese”, then confirm the contents of “Chinese”, but also cancel “Chinese” after all There is a case where “Japanese food” is selected again.

このような場合、「和食」は、一旦キャンセルされているが誤認識だったわけではない。このように、階層的にならんだツリー構造の途中の項目を過去にキャンセルした場合には、誤認識ではなく正解であったとしても、キャンセルされた可能性があるとして考慮する必要がある。 In such a case, “Japanese food” has been canceled once, but it was not a misrecognition. In this way, when an item in the middle of a hierarchically tree structure is canceled in the past, it is necessary to consider that it may have been canceled even if it was a correct answer rather than a misrecognition.

しかし、ツリー構造の末端、例えば「○○寿司」などの店舗名などを選択した場合には、いったん選択してからキャンセルされる可能性は非常に低いため、キャンセルされた場合これは誤認識であった可能性が高いと考える必要がある。 However, if you select the end of the tree structure, for example, a store name such as “XX Sushi”, it is very unlikely that it will be canceled once it is selected. It is necessary to think that there was a high possibility.

つまり、音声対話装置が、段階的に発話を要求し所望のタスクを実現するタスクシーケンスを実行する際に、タスクシーケンスにおけるツリー構造の末端がキャンセルされた場合の修正係数は、０＜ｄ（ｄ：修正係数）＜１の範囲で、ツリー構造の途中の項目がキャンセルされた場合の修正係数よりも小さくなるように、つまり修正の割合を大きくして、再度選択されにくくする方向で修正を行う。 That is, when the spoken dialogue apparatus executes a task sequence that requests utterances in stages and realizes a desired task, the correction coefficient when the end of the tree structure in the task sequence is canceled is 0 <d (d : Correction coefficient) In a range of <1, correction is performed in such a way that the correction factor is increased so that the item in the middle of the tree structure is smaller than that in the case where the item is canceled, that is, the correction ratio is increased to make it difficult to select again. .

これにより、キャンセルされた理解結果を修正係数に正しく反映させることができるため、補正後スコアを正確に修正することが可能となる。 As a result, the canceled understanding result can be correctly reflected in the correction coefficient, so that the corrected score can be corrected accurately.

例えば、ツリー構造のメニューを辿ることでタスクを実現するタスクシーケンスにおいて、下位の選択項目を有する上位項目を選択する場合には、キャンセルの影響を小さく、末端の下位の選択項目を選択する場合には、キャンセルの影響を大きくすることができる。したがって、ツリー構造のメニュー操作において、各上位項目の下にどのような下位項目があるかを確認してから上位項目を選択するような場合には、一度キャンセルされても上位項目であれば、次回確認時に再度認識されやすくすることができる。 For example, in a task sequence that realizes a task by following a menu of a tree structure, when selecting an upper item having a lower selection item, the influence of cancellation is reduced, and when selecting a lower lower selection item Can increase the influence of cancellation. Therefore, in the menu operation of the tree structure, in the case of selecting the upper item after confirming what the lower item is under each upper item, if it is a higher item even if it is canceled once, It can be easily recognized again at the next confirmation.

（補正後スコアの修正処理：キャンセル操作までに要した時間に応じた修正）
また、発話された音声が入力されてから、この音声に対する理解結果が生成されキャンセル操作がなされるまでに要した時間に応じて修正係数を修正することもできる。 (Correction of corrected score: correction according to the time required until cancel operation)
It is also possible to correct the correction coefficient according to the time required from the input of the spoken voice to the generation of an understanding result for the voice and the canceling operation.

具体的には、キャンセルされた発話が入力されてからキャンセル操作までの時間が短い場合には、キャンセルされた理解結果の影響が大きくなるように修正係数を修正し、キャンセルされた発話が入力されてからキャンセル操作までの時間が長い場合には、キャンセルされた理解結果の影響が小さくなるように修正係数を修正する。 Specifically, when the time from the input of the canceled utterance to the cancel operation is short, the correction coefficient is corrected so that the influence of the canceled understanding result is increased, and the canceled utterance is input. If the time until the cancel operation is long, the correction coefficient is corrected so that the influence of the canceled understanding result is reduced.

これにより、上述したようなツリー構造のメニューにおいて、各上位項目の下にどのような下位項目があるかを確認してから上位項目を選択ような場合において、ユーザが下位項目に興味を示さずに理解結果が出力されてから時間をかけずに直ぐにキャンセルすると、この上位項目は次回認識時に再度認識されにくくなる。逆に、下位項目を長く表示させて内容を確認するなど時間をかけると、この上位項目は次回認識時に再度認識されやすくなる。 As a result, in the tree-structured menu as described above, when the lower item is selected after confirming what lower item is under each upper item, the user does not show interest in the lower item. If the cancellation is immediately performed without taking time after the understanding result is output, the higher-level item is less likely to be recognized again at the next recognition. On the contrary, if it takes time to display the lower item for a long time and confirm the contents, the upper item is likely to be recognized again at the next recognition.

（補正後スコアの修正処理：キャンセル指示の入力手法の違いに応じた修正）
また、本発明の実施の形態として示す音声対話装置では、音声認識結果をキャンセルする場合にキャンセルボタン１２を押下することで、キャンセルする旨を入力しているが、例えば、否定語などの音声入力により音声認識結果をキャンセルする旨を入力するようにしてもよい。また、キャンセルボタン１２の押下、音声による入力、さらに別な手法により音声認識結果をキャンセルする旨を入力するようにしてもよい。 (Correction score correction processing: correction according to the difference in input method of cancellation instructions)
Further, in the voice interactive apparatus shown as the embodiment of the present invention, when canceling the voice recognition result, the cancel button 12 is pressed to input the cancellation, but for example, voice input such as a negative word is input. To cancel the speech recognition result. In addition, the user may input that the speech recognition result is canceled by pressing the cancel button 12, inputting by voice, or using another method.

このように、音声認識結果をキャンセルする旨と通知する手段を複数備えている場合、キャンセル手段の違いに応じて上述した、修正係数を変化させることができる。 As described above, when a plurality of means for notifying that the speech recognition result is canceled are provided, the correction coefficient described above can be changed according to the difference of the canceling means.

例えば、音声入力による誤認識は、スイッチ入力の際の押し間違えなどによる誤入力よりも頻繁に起こるため、音声入力により音声認識結果をキャンセルする旨を入力した場合、その信頼性は低いと考えられる。 For example, misrecognition due to voice input occurs more frequently than erroneous input due to a mistake in pressing at the time of switch input. Therefore, if the input of canceling the voice recognition result is input by voice input, the reliability is considered to be low. .

そこで、音声入力による音声認識によりキャンセルする旨を入力する場合、キャンセルボタン１２を押下して音声認識結果をキャンセルする旨を入力する場合よりも、０＜ｄ（ｄ：修正係数）＜１の範囲で、修正係数を大きくして、補正後スコアを修正する割合を小さくすることで、音声入力による不確かなキャンセル情報であっても理解結果生成に利用することができる。 Therefore, in the case of inputting cancellation to be performed by voice recognition by voice input, the range of 0 <d (d: correction coefficient) <1 is greater than in the case of inputting cancellation of the voice recognition result by pressing the cancel button 12. Thus, by increasing the correction coefficient and decreasing the rate of correcting the corrected score, even uncertain cancellation information by voice input can be used for generating an understanding result.

以上、説明した補正スコアの修正処理手法は、それぞれ独立して用いた場合でも、それぞれを任意に組み合わせて使用した場合でも、キャンセルされた理解結果を正確に修正係数へと反映させることができ、補正後スコアを正確に修正することが可能となる。したがって、ユーザによって発話された音声に対して、高い認識率で認識されるため、より正確な理解結果を生成することができる。 As described above, the correction score correction processing method described above can accurately reflect the canceled understanding result in the correction coefficient, regardless of whether they are used independently or in any combination. It becomes possible to correct the corrected score accurately. Therefore, since the speech uttered by the user is recognized at a high recognition rate, a more accurate understanding result can be generated.

上述した説明では、認識結果候補に含まれる単語の単語信頼度を求め、理解結果候補から最適な理解結果を選定するための判断基準となる理解結果スコアをこの単語信頼度から算出している。そして、理解結果がキャンセルされた場合には、理解結果スコアを理解結果候補の単語数に基づいて補正した補正後スコアに対して、キャンセルされた理解結果に応じた修正することで、最終的に選定される理解結果に、キャンセルされた理解結果による影響を反映させている。 In the above description, the word reliability of the word included in the recognition result candidate is obtained, and an understanding result score that is a criterion for selecting an optimum understanding result from the understanding result candidate is calculated from the word reliability. When the understanding result is canceled, the corrected result score is corrected based on the number of words of the understanding result candidate. The selected understanding result reflects the influence of the canceled understanding result.

本発明は、音声認識部５２による音声認識結果である認識結果候補から最終的な理解結果を選択するにあたり、選択の基準値として必ずしも理解結果スコアを求める必要はなく、認識結果候補に含まれる単語の単語信頼度を選択の基準値として理解結果を求めたり、認識結果候補の音響的な尤度を選択の基準値として理解結果を求めるようにしてもよい。 In the present invention, in selecting a final understanding result from recognition result candidates that are speech recognition results by the speech recognition unit 52, it is not always necessary to obtain an understanding result score as a reference value for selection, and words included in the recognition result candidate An understanding result may be obtained using the word reliability as a reference value for selection, or an understanding result may be obtained using the acoustic likelihood of a recognition result candidate as the reference value for selection.

したがって、このような場合、最終的に選定される理解結果にキャンセルされた理解結果による影響を反映するには、理解結果スコアに対するスコア修正ではなく、単語信頼度の修正又は音響的な尤度の修正を行うことになる。 Therefore, in such a case, in order to reflect the influence of the canceled understanding result on the finally selected understanding result, not the score correction to the understanding result score but the word reliability correction or the acoustic likelihood. A correction will be made.

具体的には、上述した理解結果スコアを求める際に認識結果候補に含まれる単語に対して求めた単語信頼度を、キャンセルされた理解結果に応じて、最終的な理解結果とし選択されにくくなる方向に修正することになる。そして、キャンセルされた理解結果に応じて修正された単語信頼度が最も高い単語を最終的な理解結果として選定する。 Specifically, the word reliability obtained for the word included in the recognition result candidate when obtaining the above-described understanding result score is less likely to be selected as the final understanding result according to the canceled understanding result. Will be corrected in the direction. Then, the word having the highest word reliability corrected according to the canceled understanding result is selected as the final understanding result.

また、音響的な尤度の場合は、音声認識部５２による音声認識処理により得られた認識結果候補の尤度を、キャンセルされた理解結果に応じて、最終的な理解結果として選択されにくくなる方向に修正をする。そして、キャンセルされた理解結果に応じて修正された音響的な尤度が最も高い認識結果候補を、最終的な理解結果として選定する。 In the case of acoustic likelihood, the likelihood of the recognition result candidate obtained by the speech recognition processing by the speech recognition unit 52 is less likely to be selected as the final understanding result according to the canceled understanding result. Make corrections in the direction. And the recognition result candidate with the highest acoustic likelihood corrected according to the canceled understanding result is selected as the final understanding result.

このように、単語信頼度に基づいて理解結果を生成する場合、音響的な尤度に基づいて理解結果を生成する場合、いずれの場合も、図５に示すフローチャートのステップＳ３３におけるキャンセルされた理解結果候補に対する理解結果スコアを修正する手法を全て適用することができ、この手法を用いて単語信頼度又は音響的な尤度を修正することで、キャンセルされた理解結果をより正確に反映させて、最終的な理解結果を生成することができる。 As described above, in the case of generating the understanding result based on the word reliability, in the case of generating the understanding result based on the acoustic likelihood, in any case, the canceled understanding in step S33 of the flowchart shown in FIG. All methods of correcting the understanding result score for the result candidate can be applied, and by using this method to correct the word reliability or acoustic likelihood, the canceled understanding result can be reflected more accurately. Can produce final understanding results.

ただし、単語信頼度を求めた場合、さらには、理解結果スコアを求めた場合には、同一の判断基準で各単語の発話可能性を判断することができるため、より正確な理解結果を生成することができるという点で優位性がある。 However, when the word reliability is obtained, and further, when the understanding result score is obtained, it is possible to determine the utterance possibility of each word based on the same criterion, so that a more accurate understanding result is generated. There is an advantage in that it can be.

なお、上述の実施の形態は本発明の一例である。このため、本発明は、上述の実施形態に限定されることはなく、この実施の形態以外であっても、本発明に係る技術的思想を逸脱しない範囲であれば、設計等に応じて種々の変更が可能であることは勿論である。 The above-described embodiment is an example of the present invention. For this reason, the present invention is not limited to the above-described embodiment, and various modifications can be made depending on the design and the like as long as the technical idea according to the present invention is not deviated from this embodiment. Of course, it is possible to change.

本発明の実施の形態として示す音声対話装置の構成について説明するための図である。It is a figure for demonstrating the structure of the voice interactive apparatus shown as embodiment of this invention. 前記音声対話装置において、音声認識処理を開始してから応答文を出力するまでの処理動作について説明をするためのフローチャートである。6 is a flowchart for explaining a processing operation from the start of speech recognition processing to the output of a response sentence in the voice interactive device. 認識結果候補の一例とその尤度とを示した図である。It is the figure which showed an example of the recognition result candidate, and its likelihood. 認識結果候補の一例とその尤度とを示した図である。It is the figure which showed an example of the recognition result candidate, and its likelihood. 理解結果生成処理について説明するためのフローチャートである。It is a flowchart for demonstrating an understanding result production | generation process. 認識結果候補に含まれる単語の一例とその単語信頼度とを示した図である。It is the figure which showed an example of the word contained in a recognition result candidate, and its word reliability. 認識結果候補に含まれる単語の一例とその単語信頼度、修正済単語信頼度とを示した図である。It is the figure which showed an example of the word contained in a recognition result candidate, its word reliability, and corrected word reliability. 認識結果単語リストの一例を示した図である。It is the figure which showed an example of the recognition result word list. 理解結果候補の一例とその理解結果スコア、補正後スコア、修正後スコアとを示した図である。It is a figure showing an example of an understanding result candidate and its understanding result score, a corrected score, and a corrected score. 一致度と修正係数との関係を示した図である。It is the figure which showed the relationship between a coincidence degree and a correction coefficient. キャンセル時の理解結果と今回の理解結果との一致度の一例を示した図である。It is the figure which showed an example of the degree of agreement with the understanding result at the time of cancellation, and this understanding result. キャンセルされた理解結果候補の補正後スコアを一致度に応じて修正する処理動作について説明するためのフローチャートである。It is a flowchart for demonstrating the processing operation | movement which corrects the score after correction of the canceled understanding result candidate according to a coincidence degree. 一致度と何回前にキャンセルされたかに応じて変化する修正係数を示した図である。It is the figure which showed the correction coefficient which changes according to a coincidence degree and how many times it was canceled before. 時間に応じた補正後スコアの修正について説明するための図である。It is a figure for demonstrating correction of the score after correction | amendment according to time. 図１４に示した対話例に基づく理解結果候補とその理解結果スコア、補正後スコア、修正係数、修正後スコアの一例を示した図である。It is the figure which showed an example of an understanding result candidate based on the example of interaction shown in FIG. 14, its understanding result score, a corrected score, a correction coefficient, and a corrected score.

Explanation of symbols

１０入力装置
１２キャンセルボタン
２０マイク
３０メモリ
５０制御装置
５１入力制御部
５２音声認識部
５３理解結果生成部
５４対話制御部 DESCRIPTION OF SYMBOLS 10 Input apparatus 12 Cancel button 20 Microphone 30 Memory 50 Control apparatus 51 Input control part 52 Speech recognition part 53 Understanding result generation part 54 Dialog control part

Claims

Input means for inputting spoken speech;
Speech recognition means for recognizing speech input by the input means based on recognition target words;
A response to the spoken speech using the recognition result candidate selected from a plurality of recognition result candidates as recognition results by the speech recognition means based on a predetermined selection reference value given to each recognition result candidate; An understanding result generating means for generating an understanding result,
Correction instruction means for instructing correction for the understanding result generated by the understanding result generation means;
When the understanding result generation unit generates the understanding result, the recognition result candidate corresponding to the predetermined recognition criterion value given to the recognition result candidate corresponding to the understanding result instructed to be corrected by the correction instruction unit. And a correcting means for correcting in a direction that makes it difficult to select.

The correction unit is configured to recognize the recognition result candidate according to an acoustic coincidence between the understanding result instructed to be corrected by the correction instruction unit and the plurality of recognition result candidates that are recognition results by the speech recognition unit. The spoken dialogue apparatus according to claim 1, wherein the predetermined reference value is corrected.

The correction means corresponds to the understanding result instructed to be corrected by the correction instructing means depending on how many times the understanding result instructed to correct by the correction instructing means The spoken dialogue apparatus according to claim 1, wherein a predetermined reference value of the recognition result candidate is corrected.

The correcting means utters a voice that leads the understanding result instructed to be corrected by the correction instructing means and is input to the input means, and then speaks a voice that is currently being processed to generate an understanding result and is input by the input means The voice according to claim 1, wherein a predetermined reference value of the recognition result candidate corresponding to the understanding result instructed to be corrected by the correction instructing unit is corrected according to a time required until the correction is performed. Interactive device.

When the spoken dialogue device executes a task sequence that requests utterances step by step and realizes a desired task,
The correction means determines the predetermined one of the recognition result candidates corresponding to the understanding result instructed to be corrected by the correction instruction means depending on at which stage of the task sequence correction is instructed by the correction instruction means. The spoken dialogue apparatus according to claim 1, wherein the reference value is corrected.

The correction means is necessary until the understanding result generated by the speech recognition means and the understanding result generation means is instructed to be corrected by the correction instruction means after the speech is spoken and input to the input means. 2. The spoken dialogue apparatus according to claim 1, wherein a predetermined reference value of the recognition result candidate corresponding to the understanding result instructed to be corrected by the correction instructing unit is corrected according to the corrected time.

A plurality of the correction instruction means different in the way of inputting the correction instruction by the user,
The correction means corrects a predetermined reference value of the recognition result candidate corresponding to the understanding result instructed to be corrected by the correction instruction means in accordance with a method of inputting the correction instruction to the correction instruction means by a user. The spoken dialogue apparatus according to claim 1, wherein:

A word reliability calculation means for calculating the word reliability that indicates the possibility that the word included in the recognition result candidate is spoken and becomes the predetermined selection reference value;
The understanding result generation means is the recognition result selected based on the word reliability calculated by the word reliability calculation means from the words included in a plurality of recognition result candidates that are recognition results by the speech recognition means. Using the words included in the candidates, generate an understanding result that is a response to the spoken speech,
The correction means generates the word reliability of the word included in the recognition result candidate corresponding to the understanding result instructed to be corrected by the correction instruction means when the understanding result generation means generates the understanding result. The spoken dialogue apparatus according to any one of claims 1 to 7, wherein the word included in the recognition result candidate is corrected in a direction in which the word is difficult to be selected.

A new value is given to the word included in the recognition result candidate based on the word reliability calculated by the word reliability calculation means and the word reliability given to the word included in the past recognition result candidate. Score calculation means for calculating a score of the recognition result candidate that is the predetermined selection reference value from the given word reliability,
The understanding result generation means is uttered by using the recognition result candidate selected based on the score calculated by the score calculation means from a plurality of recognition result candidates that are recognition results by the voice recognition means. Generate an understanding result that is a response to the voice,
The correction means selects the score of the recognition result candidate corresponding to the understanding result instructed to be corrected by the correction instructing means, when the recognition result candidate is selected by the understanding result generating means. The spoken dialogue apparatus according to claim 8, wherein correction is made in a direction in which it is difficult to be performed.

A likelihood calculating means for calculating a likelihood indicating acoustic closeness to the recognition target word of the recognition result candidate;
The understanding result generation means uses the recognition result candidate selected based on the likelihood calculated by the likelihood calculation means from a plurality of recognition result candidates that are recognition results by the voice recognition means, and uses the recognition result candidate. Generate an understanding result that is a response to
The correction means is configured to determine the likelihood of the recognition result candidate corresponding to the understanding result instructed to be corrected by the correction instructing means when the recognition result candidate is generated when the understanding result generating means generates the understanding result. The spoken dialogue apparatus according to any one of claims 1 to 7, wherein correction is made in a direction in which selection becomes difficult.

An input process for inputting spoken voice;
A speech recognition step for recognizing the speech input by the input step based on a recognition target word;
A response to the spoken speech using the recognition result candidates selected from a plurality of recognition result candidates that are recognition results by the speech recognition step based on a predetermined selection reference value given to each recognition result candidate; An understanding result generation step for generating an understanding result,
A correction instruction step for instructing correction to the understanding result generated by the understanding result generation step;
When the predetermined selection reference value given to the recognition result candidate corresponding to the understanding result instructed to be corrected by the correction instruction step is generated in the understanding result generation step, the recognition result candidate A speech understanding result generation method comprising: a correction step of correcting in a direction that makes it difficult to select.