JP2021189348A

JP2021189348A - Voice dialogue device, voice dialogue method, and voice dialogue program

Info

Publication number: JP2021189348A
Application number: JP2020096174A
Authority: JP
Inventors: 尚和内田; Hisakazu Uchida
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2021-12-13
Anticipated expiration: 2040-06-02
Also published as: JP7471921B2

Abstract

To provide a voice dialogue device, a voice dialogue method, and a voice dialogue program for performing a voice dialogue which smooths a dialogue with a user.SOLUTION: A voice dialogue device 300 includes a processor which executes a program, and a storage device which stores the program. A functional configuration by the processor includes a reliability acquisition unit which acquires reliability of each word which constitutes a word string related to an uttered voice, and a repeat questioning sentence generator to generate a repeat questioning sentence to hear back to an utterance source of an uttered voice, based on reliability acquired by the acquisition process.SELECTED DRAWING: Figure 4

Description

本発明は、音声対話を実行する音声対話装置、音声対話方法、および音声対話プログラムに関する。 The present invention relates to a voice dialogue device, a voice dialogue method, and a voice dialogue program for performing a voice dialogue.

人間との音声対話が可能なコミュニケーションロボットは、テキスト入力の対話システムに比べ、誤りを含む音声入力を受け付けることが多い。この誤りを含む音声入力には様々なパターンがあり、たとえば、音声区間検出誤り、音声認識誤り、無関係な音声の認識、言い淀みや発話途中での訂正発話による崩れた発話がある。したがって、このような音声入力を受け付けた場合、コミュニケーションロボットが入力音声が正常か否かを判定し、適切な例外処理を行うという処理を実行しないと、人間との対話が成立しない。 Communication robots capable of voice dialogue with humans often accept voice input including errors, as compared with text input dialogue systems. There are various patterns of voice input including this error, for example, voice section detection error, voice recognition error, recognition of irrelevant voice, stagnation, and broken utterance due to correction utterance in the middle of utterance. Therefore, when such a voice input is accepted, a dialogue with a human cannot be established unless the communication robot determines whether or not the input voice is normal and performs an appropriate exception handling process.

このような音声対話技術として、特許文献１および特許文献２がある。特許文献１は、音声認識の精度が所定値に満たなかった語句に対応する音声データを、音声により対象者に通知するロボットを開示する。このロボットは、対象者からの発話音声を取得し、取得した発話音声の音声データの内、音声認識の精度が所定値に満たなかった語句に対応する音声データを、前記対象者に対して音声出力する。 Patent Document 1 and Patent Document 2 are such voice dialogue techniques. Patent Document 1 discloses a robot that notifies a subject by voice of voice data corresponding to a phrase whose voice recognition accuracy is less than a predetermined value. This robot acquires the utterance voice from the target person, and among the voice data of the acquired utterance voice, the voice data corresponding to the phrase whose voice recognition accuracy does not reach the predetermined value is voiced to the target person. Output.

特許文献２は、人間の行う確認動作を行うように対話装置を制御し、対話装置の誤った応答を低減する対話制御装置を開示する。この対話制御装置は、対話装置側から対話の契機となる音声を出力して対話を開始する話しかけシナリオ、利用者側からの発話に対して応答する応答シナリオ、及び、利用者に対して対話を開始するか否かを確認する確認シナリオを記憶するシナリオ記憶部と、対話装置側から対話の契機となる音声を出力して対話を開始すべきであるか否かを示す話しかけ開始指標Ｓと、ある音声に対して応答すべきであるか否かを示す応答開始指標Ｒとを入力とし、Ｊ及びＫをそれぞれ１以上の整数の何れかとし、話しかけ開始指標ＳとＪ個の閾値との大小関係、及び、応答開始指標ＲとＫ個の閾値との大小関係とに基づき、話しかけシナリオ、応答シナリオ、または、確認シナリオを選択するシナリオ選択部を含む。 Patent Document 2 discloses a dialogue control device that controls the dialogue device so as to perform a confirmation operation performed by a human and reduces an erroneous response of the dialogue device. This dialogue control device outputs a voice that triggers a dialogue from the dialogue device side to start a dialogue, a response scenario that responds to a speech from the user side, and a dialogue with the user. A scenario storage unit that stores a confirmation scenario that confirms whether or not to start, a talk start index S that outputs a voice that triggers the dialogue from the dialogue device side, and indicates whether or not the dialogue should be started. The response start index R, which indicates whether or not a response should be made to a certain voice, is used as an input, J and K are set to any of integers of 1 or more, and the magnitude of the talk start index S and the thresholds of J is set. It includes a scenario selection unit that selects a talking scenario, a response scenario, or a confirmation scenario based on the relationship and the magnitude relationship between the response start index R and K thresholds.

特開２０１８−８１１４７号公報Japanese Unexamined Patent Publication No. 2018-81147 特開２０１８−８７８４７号公報Japanese Unexamined Patent Publication No. 2018-87847

しかしながら、上述した特許文献１，２では、コミュニケーションロボットが、入力された対話音声のどの部分を認識でき、どの部分を認識できなかったかといった点については考慮されていない。したがって、対話音声を発話したユーザは、コミュニケーションロボットにどの部分が伝わっていてどの部分が伝わっていないのかを知ることができない。したがって、ユーザに何度も同じ質問をさせてしまい、ユーザのわずらわしさが増加し、ユーザとの対話が破綻しかねない。 However, in the above-mentioned Patent Documents 1 and 2, it is not considered which part of the input dialogue voice can be recognized and which part cannot be recognized by the communication robot. Therefore, the user who utters the dialogue voice cannot know which part is transmitted to the communication robot and which part is not transmitted. Therefore, the user may be asked the same question over and over again, which may increase the annoyance of the user and disrupt the dialogue with the user.

本発明は、ユーザとの対話の円滑化することを目的とする。 An object of the present invention is to facilitate dialogue with a user.

本願において開示される発明の一側面となる音声対話装置は、プログラムを実行するプロセッサと、前記プログラムを記憶する記憶デバイスと、を有する音声対話装置であって、前記プロセッサは、発話音声に関する単語列を構成する各単語の信頼度に基づいて、前記発話音声の発話元に聞き返す聞き返し文を生成する生成処理と、前記生成処理によって生成された聞き返し文を出力する出力処理と、を実行することを特徴とする。 The voice dialogue device according to one aspect of the invention disclosed in the present application is a voice dialogue device having a processor for executing a program and a storage device for storing the program, and the processor is a word string related to spoken voice. Based on the reliability of each word constituting the It is a feature.

本発明の代表的な実施の形態によれば、ユーザとの対話の円滑化を図ることができる。前述した以外の課題、構成及び効果は、以下の実施例の説明により明らかにされる。 According to a typical embodiment of the present invention, it is possible to facilitate dialogue with the user. Issues, configurations and effects other than those described above will be clarified by the description of the following examples.

図１は、実施例１にかかるユーザとコミュニケーションロボットとの音声対話例を示す説明図である。FIG. 1 is an explanatory diagram showing an example of voice dialogue between a user and a communication robot according to the first embodiment. 図２は、実施例１にかかる音声対話システムのシステム構成例を示す説明図である。FIG. 2 is an explanatory diagram showing a system configuration example of the voice dialogue system according to the first embodiment. 図３は、音声対話装置のハードウェア構成例を示すブロック図である。FIG. 3 is a block diagram showing a hardware configuration example of the voice dialogue device. 図４は、実施例１にかかる音声対話装置の機能的構成例を示すブロック図である。FIG. 4 is a block diagram showing a functional configuration example of the voice dialogue device according to the first embodiment. 図５は、音声認識結果の一例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of the voice recognition result. 図６は、単語列から品詞列への変換例を示す説明図である。FIG. 6 is an explanatory diagram showing an example of conversion from a word string to a part-of-speech string. 図７は、品詞列の品詞分類定義の一例を示す説明図１である。FIG. 7 is an explanatory diagram 1 showing an example of a part-speech classification definition of a part-speech sequence. 図８は、品詞列の品詞分類定義の一例を示す説明図２である。FIG. 8 is an explanatory diagram 2 showing an example of a part-speech classification definition of a part-speech sequence. 図９は、信頼度取得部による信頼度算出例を示す説明図である。FIG. 9 is an explanatory diagram showing an example of reliability calculation by the reliability acquisition unit. 図１０は、実施例１にかかる音声対話装置による音声対話処理手順例を示すフローチャートである。FIG. 10 is a flowchart showing an example of a voice dialogue processing procedure by the voice dialogue device according to the first embodiment. 図１１は、図１０に示した聞き返し文生成処理（ステップＳ１００５）の詳細な処理手順例を示すフローチャートである。FIG. 11 is a flowchart showing a detailed processing procedure example of the hearing back sentence generation processing (step S1005) shown in FIG. 図１２は、図１１に示したマスク単語推定処理（ステップＳ１１０６）の詳細な処理手順例を示すフローチャートである。FIG. 12 is a flowchart showing a detailed processing procedure example of the mask word estimation process (step S1106) shown in FIG. 図１３は、図１１に示したマスク単語推定処理（ステップＳ１１０６）の一例を示す説明図である。FIG. 13 is an explanatory diagram showing an example of the mask word estimation process (step S1106) shown in FIG. 図１４は、図１０に示した言い直し発話解釈処理（ステップＳ１００７）の詳細な処理手順例を示すフローチャートである。FIG. 14 is a flowchart showing a detailed processing procedure example of the rephrasing utterance interpretation processing (step S1007) shown in FIG. 図１５は、図１０および図１４に示した対話制御処理（ステップＳ１００８、Ｓ１４０３）の詳細な処理手順例を示すフローチャートである。FIG. 15 is a flowchart showing a detailed processing procedure example of the dialogue control processing (steps S1008 and S1403) shown in FIGS. 10 and 14. 図１６は、ユーザと音声対話装置との対話の流れの一例を示すフローチャートである。FIG. 16 is a flowchart showing an example of the flow of dialogue between the user and the voice dialogue device. 図１７は、実施例２にかかる音声対話装置の機能的構成例を示すブロック図である。FIG. 17 is a block diagram showing a functional configuration example of the voice dialogue device according to the second embodiment. 図１８は、実施例２にかかる音声対話装置による音声対話処理手順例を示すフローチャートである。FIG. 18 is a flowchart showing an example of a voice dialogue processing procedure by the voice dialogue device according to the second embodiment.

＜音声対話例＞
図１は、実施例１にかかるユーザとコミュニケーションロボット（以下、単に、「ロボット」）との音声対話例を示す説明図である。図１では、ユーザ１０１がロボット１０２に対し「明日の富士山の日の出の時刻は？」という発話音声１１０と発話した場合のロボット１０２からの聞き返しパターンＰ１〜Ｐ５を示す。なお、発話音声１１０の大きさ、発話速度、発音の正確さ、および周囲の環境の少なくとも１つは、下記（Ａ）〜（Ｅ）ごとに異なるものとする。 <Example of voice dialogue>
FIG. 1 is an explanatory diagram showing an example of voice dialogue between a user and a communication robot (hereinafter, simply “robot”) according to the first embodiment. FIG. 1 shows the hearing patterns P1 to P5 from the robot 102 when the user 101 speaks to the robot 102 with the utterance voice 110 "What is the sunrise time of Mt. Fuji tomorrow?". It should be noted that the size of the uttered voice 110, the utterance speed, the accuracy of pronunciation, and at least one of the surrounding environments are different for each of the following (A) to (E).

（Ａ）は、聞き返しパターンＰ１での応答例を示す。具体的には、たとえば、ロボット１０２は、発話音声１１０を「明日の富士山の日の出の〇×△□」（「〇×△□」は、信頼度が低く認識できなかった部分）として音声認識し、信頼度がしきい値以上である高い部分（明日の富士山の日の出の）を聞き返す応答文（以下、聞き返し文）「よく聞き取れませんでした。『明日の富士山の日の出の』の何ですか？」を聞き返しパターンＰ１として生成し、発話元であるユーザ１０１に発話する。 (A) shows an example of a response in the listening pattern P1. Specifically, for example, the robot 102 voice-recognizes the utterance voice 110 as "Tomorrow's sunrise of Mt. Fuji 〇 × △ □" ("○ × △ □" is a part that cannot be recognized due to low reliability). , Response sentence to hear back the high part (of the sunrise of Mt. Fuji tomorrow) whose reliability is above the threshold (hereinafter, the answer sentence) "I couldn't hear it well. What is" Sunrise of Mt. Fuji tomorrow "? Is generated as the echo pattern P1 and is spoken to the user 101 who is the source of the speech.

（Ｂ）は、聞き返しパターンＰ２での応答例を示す。具体的には、たとえば、ロボット１０２は、発話音声１１０を「〇×△富士山〇×△□時刻は？」（「〇×△」および「〇×△□」は、それぞれ信頼度がしきい値未満で低く認識できなかった部分）として音声認識し、信頼度がしきい値以上の高い単語「富士山」および「時刻」を聞き返す聞き返し文「『富士山』？『時刻』？もう一度質問をお願いします。」を聞き返しパターンＰ２として生成し、ユーザ１０１に発話する。 (B) shows a response example in the listening back pattern P2. Specifically, for example, the robot 102 sets the utterance voice 110 to "○ × △ Mt. Fuji 〇 × △ □ time?" ("○ × △" and "○ × △ □" have thresholds of reliability, respectively. Recognize the words "Mt. Fuji" and "Time" with high reliability above the threshold by utterance recognition as "less than the part that could not be recognized low") "Mt. Fuji"? "Time"? Is generated as a back-listening pattern P2, and is spoken to the user 101.

（Ｃ）は、聞き返しパターンＰ３での応答例を示す。具体的には、たとえば、ロボット１０２は、発話音声１１０を「明日の〇×△□の日の出の時刻は？」（「〇×△□」は、信頼度がしきい値未満で低く認識できなかった部分）として音声認識し、信頼度がしきい値未満の低い部分（〇×△□）の推測結果（富士山）と、当該推測結果を含めたユーザ１０１が発話した発話音声１１０を確認する聞き返し文「『明日の富士山の日の出の時刻は』とおっしゃいましたか？」とを、聞き返しパターンＰ３として生成し、ユーザ１０１に発話する。 (C) shows a response example in the listening back pattern P3. Specifically, for example, the robot 102 cannot recognize the utterance voice 110 as "What is the sunrise time of tomorrow's 〇 × △ □?" ("○ × △ □" has a reliability lower than the threshold value and cannot be recognized. The voice is recognized as the part), and the estimation result (Mt. Fuji) of the part (○ × △ □) whose reliability is lower than the threshold value and the utterance voice 110 uttered by the user 101 including the estimation result are confirmed. The sentence "Did you say" What is the sunrise time of Mt. Fuji tomorrow? "" Is generated as a listening pattern P3 and spoken to user 101.

（Ｄ）は、聞き返しパターンＰ４での応答例を示す。具体的には、たとえば、ロボット１０２は、発話音声１１０を高精度に認識し、認識した質問全体を確認するために聞き返す聞き返し文「『明日の富士山の日の出の時刻は』とおっしゃいましたか？」を聞き返しパターンＰ４として生成し、ユーザ１０１に発話する。 (D) shows an example of a response in the listening pattern P4. Specifically, for example, the robot 102 recognizes the uttered voice 110 with high accuracy, and asks back to confirm the entire recognized question. "Did you say" What is the sunrise time of Mt. Fuji tomorrow? " Is generated as a back-listening pattern P4 and is spoken to the user 101.

（Ｅ）は、聞き返しパターンＰ５での応答例を示す。具体的には、たとえば、ロボット１０２は、発話音声１１０を認識できず、再質問を依頼する聞き返し文「よく聞き取れなかったのでもう一度お願いします。」を聞き返しパターンＰ５として生成し、ユーザ１０１に発話する。 (E) shows an example of a response in the listening pattern P5. Specifically, for example, the robot 102 cannot recognize the spoken voice 110, and generates a reply sentence "I could not hear it well, please try again." As a reply pattern P5, and speaks to the user 101. do.

ロボット１０２は、発話音声１１０の音声認識結果に対し、音声認識の信頼度に加え、言語尤度（言語モデルのＰｅｒｐｌｅｘｉｔｙ）を計算し、音声認識の信頼度および言語尤度によって、通常の対話制御での応答と、上記聞き返しパターンＰ１〜Ｐ５による聞き返し文の生成と、のいずれかを選択する。 The robot 102 calculates the language likelihood (Perplexity of the language model) in addition to the reliability of the voice recognition for the voice recognition result of the spoken voice 110, and controls the normal dialogue by the reliability of the voice recognition and the language likelihood. The response in the above and the generation of the speech-back sentence by the speech-back patterns P1 to P5 are selected.

このように、ロボット１０２は、発話音声１１０の一部または全部が音声認識できなかったことをユーザ１０１に伝えることで、必要な再発話をユーザ１０１に促し、ユーザ１０１との対話を円滑に進めることができる。また、ユーザ１０１は、発話音声１１０のどの部分が伝わっていてどの部分が伝わっていないのかを知ることができる。 In this way, the robot 102 informs the user 101 that part or all of the spoken voice 110 could not be recognized, prompts the user 101 for the necessary recurrence, and smoothly advances the dialogue with the user 101. be able to. Further, the user 101 can know which part of the utterance voice 110 is transmitted and which part is not transmitted.

＜音声対話システム＞
図２は、実施例１にかかる音声対話システムのシステム構成例を示す説明図である。音声対話システム２００は、たとえば、クライアントサーバシステムであり、サーバ２０１と、ロボット１０２，スマートフォン１０３などの情報処理装置１０４と、を有する。サーバ２０１と情報処理装置１０４とは、インターネット、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）などのネットワーク２０２を介して通信可能である。 <Voice dialogue system>
FIG. 2 is an explanatory diagram showing a system configuration example of the voice dialogue system according to the first embodiment. The voice dialogue system 200 is, for example, a client-server system, and includes a server 201 and an information processing device 104 such as a robot 102 and a smartphone 103. The server 201 and the information processing device 104 can communicate with each other via a network 202 such as the Internet, a LAN (Local Area Network), and a WAN (Wide Area Network).

クライアントサーバシステムの場合、音声対話プログラムは、サーバ２０１にインストールされる。したがって、サーバ２０１は、音声対話装置として、音声認識処理や音声認識の信頼度算出、言語尤度計算、応答文生成を実行する。この場合、情報処理装置１０４は、発話音声１１０の入力、入力した発話音声１１０のデータ変換、当該変換による音声データのサーバ２０１への送信、サーバ２０１からの応答文の受信、応答文の発話を実行する対話インタフェースとなる。 In the case of a client-server system, the voice dialogue program is installed on the server 201. Therefore, the server 201 executes voice recognition processing, voice recognition reliability calculation, language likelihood calculation, and response sentence generation as a voice dialogue device. In this case, the information processing apparatus 104 inputs the utterance voice 110, converts the input utterance voice 110 into data, transmits the voice data by the conversion to the server 201, receives the response text from the server 201, and utters the response text. It becomes an interactive interface to execute.

一方、スタンドアロン型の場合、音声対話プログラムは、情報処理装置１０４にインストールされ、サーバ２０１は不要である。したがって、情報処理装置１０４は、音声対話装置として、発話音声１１０の入力、入力した発話音声１１０のデータ変換、音声認識処理、音声認識の信頼度算出、言語尤度計算、応答文生成、および応答文の発話を実行する。 On the other hand, in the stand-alone type, the voice dialogue program is installed in the information processing apparatus 104, and the server 201 is unnecessary. Therefore, the information processing apparatus 104, as a voice dialogue device, inputs the utterance voice 110, converts the input utterance voice 110 data, performs voice recognition processing, calculates the reliability of voice recognition, calculates the language likelihood, generates a response sentence, and responds. Perform the utterance of a sentence.

＜音声対話装置のハードウェア構成例＞
図３は、音声対話装置のハードウェア構成例を示すブロック図である。音声対話装置３００は、プロセッサ３０１と、記憶デバイス３０２と、入力デバイス３０３と、出力デバイス３０４と、通信インタフェース（通信ＩＦ）３０５と、を有する。プロセッサ３０１、記憶デバイス３０２、入力デバイス３０３、出力デバイス３０４、および通信ＩＦ３０５は、バス３０６により接続される。プロセッサ３０１は、音声対話装置３００を制御する。記憶デバイス３０２は、プロセッサ３０１の作業エリアとなる。また、記憶デバイス３０２は、各種プログラムやデータを記憶する非一時的なまたは一時的な記録媒体である。記憶デバイス３０２としては、たとえば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、フラッシュメモリがある。入力デバイス３０３は、データを入力する。入力デバイス３０３としては、たとえば、キーボード、マウス、タッチパネル、テンキー、スキャナ、マイク、生体センサがある。出力デバイス３０４は、データを出力する。出力デバイス３０４としては、たとえば、ディスプレイ、プリンタ、スピーカがある。通信ＩＦ３０５は、ネットワーク２０２と接続し、データを送受信する。 <Hardware configuration example of voice dialogue device>
FIG. 3 is a block diagram showing a hardware configuration example of the voice dialogue device. The voice dialogue device 300 includes a processor 301, a storage device 302, an input device 303, an output device 304, and a communication interface (communication IF) 305. The processor 301, the storage device 302, the input device 303, the output device 304, and the communication IF 305 are connected by the bus 306. The processor 301 controls the voice dialogue device 300. The storage device 302 serves as a work area for the processor 301. Further, the storage device 302 is a non-temporary or temporary recording medium for storing various programs and data. Examples of the storage device 302 include a ROM (Read Only Memory), a RAM (Random Access Memory), an HDD (Hard Disk Drive), and a flash memory. The input device 303 inputs data. The input device 303 includes, for example, a keyboard, a mouse, a touch panel, a numeric keypad, a scanner, a microphone, and a biosensor. The output device 304 outputs data. The output device 304 includes, for example, a display, a printer, and a speaker. The communication IF 305 connects to the network 202 and transmits / receives data.

＜音声対話装置３００の機能的構成例＞
図４は、実施例１にかかる音声対話装置３００の機能的構成例を示すブロック図である。音声対話装置３００は、音声対話プログラムがインストールされたコンピュータである。音声対話装置３００は、音声認識モデル４１１と、標準言語モデル４１２と、分野別言語モデル４１３、質問文言語モデル４１４、対話文脈言語モデル４１５、および対話知識ＤＢ４１６（データベース）４１６にアクセス可能である。これらは、図３に示した音声対話装置３００の記憶デバイス３０２または音声対話装置３００と通信可能な他のコンピュータの記憶デバイス３０２に記憶される。 <Example of functional configuration of voice dialogue device 300>
FIG. 4 is a block diagram showing a functional configuration example of the voice dialogue device 300 according to the first embodiment. The voice dialogue device 300 is a computer in which a voice dialogue program is installed. The speech dialogue device 300 has access to a speech recognition model 411, a standard language model 412, a field-specific language model 413, a question sentence language model 414, a dialogue context language model 415, and a dialogue knowledge DB 416 (database) 416. These are stored in the storage device 302 of the voice dialogue device 300 shown in FIG. 3 or the storage device 302 of another computer capable of communicating with the voice dialogue device 300.

音声認識モデル４１１は、音響的特徴量と音素との対応関係、音素列と単語との対応関係、および単語列の統計モデルである。音声認識モデル４１１としては、たとえば、音響モデルと言語モデルとを一つのオブジェクトに組み合わせた重み付き有限状態トランスデューサ（ＷｅｉｇｈｔｅｄＦｉｎｉｔｅ‐ＳｔａｔｅＴｒａｎｓｄｕｃｅｒ：ＷＦＳＴ）がある。音響モデルとしては、たとえば、ＤｅｅｐＮｅｕｒａｌＮｅｔｏｗｒｋ（ＤＮＮ）と隠れマルコフモデル（ＨＭＭ）とのハイブリッド型の音響モデル（ＤＮＮ‐ＨＭＭ）がある。また、言語モデルにはＮ−ｇｒａｍがある。 The speech recognition model 411 is a statistical model of the correspondence between the acoustic features and the phonemes, the correspondence between the phoneme sequences and the words, and the word strings. As the speech recognition model 411, for example, there is a weighted finite state transducer (WFST) that combines an acoustic model and a language model into one object. As an acoustic model, for example, there is a hybrid acoustic model (DNN-HMM) of a Deep Natural Network (DNN) and a hidden Markov model (HMM). In addition, there is N-gram in the language model.

標準言語モデル４１２は、対話文を含む様々な分野のテキストから得られた単語の並びの統計モデルである。標準言語モデル４１２としては、たとえば、単語Ｎ−ｇｒａｍでもよく、リカレントニューラルネットワーク（ＲＮＮ）でもよい。標準言語モデル４１２は、特定の言語（本例では日本語）としての尤もらしさを評価するために使用される。 The standard language model 412 is a statistical model of a sequence of words obtained from texts in various fields including dialogue sentences. The standard language model 412 may be, for example, the word N-gram or a recurrent neural network (RNN). The standard language model 412 is used to evaluate the plausibility of a particular language (Japanese in this example).

分野別言語モデル４１３は、ロボット１０２の運用環境に関するテキスト（たとえば、運用環境が観光案内であればその観光地の案内に関するテキスト）や想定質問集などから得られた単語の並びの統計モデルである。分野別言語モデル４１３としては、たとえば、単語Ｎ−ｇｒａｍでもよく、ＲＮＮでもよい。分野別言語モデル４１３は、その運用環境で発話される内容としての尤もらしさを評価するために使用される。なお、標準言語モデル４１２は、複数種類の分野の分野別言語モデル４１３を集約した言語モデルでもよく、分野別言語モデル４１３を除いた一般的な話し言葉に関する言語モデルでもよい。 The field-specific language model 413 is a statistical model of a sequence of words obtained from a text about the operating environment of the robot 102 (for example, if the operating environment is a tourist guide, a text about the guide to the tourist spot) or a collection of assumed questions. .. The field-specific language model 413 may be, for example, the word N-gram or RNN. The field-specific language model 413 is used to evaluate the plausibility of the content spoken in the operating environment. The standard language model 412 may be a language model in which field-specific language models 413 of a plurality of fields are aggregated, or may be a language model related to general spoken language excluding the field-specific language model 413.

質問文言語モデル４１４は、想定質問集や実際にこれまでユーザ１０１から受けた質問例から得られた品詞レベルでの単語の並びの統計モデルである。質問文言語モデル４１４としては、たとえば、品詞Ｎ−ｇｒａｍでもよく、ＲＮＮでもよい。質問文言語モデル４１４は、単語の並びからユーザ１０１の発話が質問かどうかを判定するために用いられる言語モデルである。 The question sentence language model 414 is a statistical model of a sequence of words at the part of speech level obtained from a collection of assumed questions and examples of questions actually received from the user 101 so far. The question sentence language model 414 may be, for example, the part of speech N-gram or RNN. The question sentence language model 414 is a language model used for determining whether or not the utterance of the user 101 is a question from the sequence of words.

対話文脈言語モデル４１５は、対話文を含むあらゆる分野のテキストについて、任意の単語とその周辺に現れる単語群の統計と句の並びの統計を学習させた統計モデルである。対話文脈言語モデル４１５としては、ＢｉｄｉｒｅｃｔｉｏｎａｌＥｎｃｏｄｅｒＲｅｐｒｅｓｅｎｔａｔｉｏｎｓｆｒｏｍＴｒａｎｓｆｏｒｍｅｒｓ（ＢＥＲＴ）がある。 The dialogue context language model 415 is a statistical model in which the statistics of arbitrary words and the word groups appearing around them and the statistics of the sequence of phrases are learned for texts in all fields including dialogue sentences. As a dialogue context language model 415, there is a Bidirectional Encoder Representations from Transformers (BERT).

対話知識ＤＢ４１６は、想定質問文や過去の質問文の例とそれらの答えとを関連付けて格納したデータベースである。 The dialogue knowledge DB 416 is a database that stores examples of assumed question sentences and past question sentences in association with their answers.

また、音声対話装置３００は、音声認識部４０１と、言語尤度取得部４０２と、信頼度取得部４０３と、聞き返し判定部４０４と、対話制御部４０５と、聞き返し文生成部４０６と、対話履歴管理部４０７と、音声合成部４０８と、を有する。これらは、具体的には、たとえば、図３に示した記憶デバイス３０２に記憶された音声対話プログラムをプロセッサ３０１に実行させることにより実現される。 Further, the voice dialogue device 300 includes a voice recognition unit 401, a language likelihood acquisition unit 402, a reliability acquisition unit 403, a listening back determination unit 404, a dialogue control unit 405, a listening back sentence generation unit 406, and a dialogue history. It has a management unit 407 and a voice synthesis unit 408. Specifically, these are realized, for example, by causing the processor 301 to execute a voice dialogue program stored in the storage device 302 shown in FIG.

音声認識部４０１は、発話音声１１０の波形データを入力し、音声認識モデル４１１を用いて、波形データを音声特徴量Ｘに変換する。具体的には、たとえば、音声認識部４０１は、波形データを２５［ｍｓｅｃ］程度の長さで切り出し、音声特徴量ベクトルｘに変換する。音声特徴量ベクトルｘには、たとえば、１３次元のメル周波数ケプストラム係数（ＭＦＣＣ）が用いられる。 The voice recognition unit 401 inputs the waveform data of the spoken voice 110, and converts the waveform data into the voice feature amount X by using the voice recognition model 411. Specifically, for example, the voice recognition unit 401 cuts out the waveform data with a length of about 25 [msec] and converts it into a voice feature amount vector x. For the voice feature amount vector x, for example, a 13-dimensional mel frequency cepstrum coefficient (MFCC) is used.

音声認識部４０１は、この音声特徴量ベクトルｘを１０［ｍｓｅｃ］程度シフトさせながら波形データの全区間を変換し、音声特徴量Ｘ＝［ｘ＿１，…，ｘ＿ｔ，…，ｘ＿Ｔ］を生成する。ｘ＿ｔは、波形データのｔ番目の区間で変換された音声特徴量ベクトルである。Ｔは、波形データの区間数である。そして、音声認識部４０１は、下記式（１）により、音声特徴量Ｘが与えられた条件で、最も可能性の高い単語列Ｗｔを求める。なお、Ｗは単語列である。 The voice recognition unit 401 converts the entire section of the waveform data while shifting the voice feature amount vector x by about 10 [msec], and generates the voice feature amount X = [x_1, ..., x_t, ..., X_T]. x_t is a voice feature amount vector converted in the t-th interval of the waveform data. T is the number of intervals of the waveform data. Then, the voice recognition unit 401 obtains the most probable word string Wt under the condition that the voice feature amount X is given by the following equation (1). W is a word string.

音声認識部４０１は、単語列Ｗｔを言語尤度取得部４０２に出力する。また、音声認識部４０１は、単語ラティスを生成して、信頼度取得部４０３に出力する。単語列Ｗｔと単語ラティスとを音声認識結果とする。 The voice recognition unit 401 outputs the word string Wt to the language likelihood acquisition unit 402. Further, the voice recognition unit 401 generates a word lattice and outputs it to the reliability acquisition unit 403. The word string Wt and the word lattice are used as speech recognition results.

図５は、音声認識結果の一例を示す説明図である。音声認識結果５００は、単語列Ｗｔと単語ラティスＷＬとを含む。単語列Ｗｔ（単語列Ｗも同様）は、１以上の単語の各々の表記、読み、および品詞により構成される。単語列Ｗｔが「今日」、「は」および「晴れ」の場合、単語列Ｗｔは、単語「今日」の表記「今日」、読み「キョー」、および品詞「名詞」と、単語「は」の表記「は」、読み「ワ」、および品詞「助詞」と、単語「晴れ」の表記「晴れ」、読み「ハレ」、および品詞「名詞」により構成される。 FIG. 5 is an explanatory diagram showing an example of the voice recognition result. The speech recognition result 500 includes the word string Wt and the word lattice WL. The word string Wt (as well as the word string W) is composed of the notation, reading, and part of speech of each of one or more words. When the word string Wt is "today", "ha" and "sunny", the word string Wt is the notation "today", the reading "kyo", and the part of speech "noun" and the word "ha" of the word "today". It is composed of the notation "ha", the reading "wa", and the part of speech "auxiliary", and the notation "sunny", the reading "halle", and the part of speech "noun" of the word "sunny".

なお、音声認識モデル４１１と各言語モデル（対話文脈言語モデル４１５を除く）では、単語の区切りおよび品詞体系は同じ仕様とする。対話文脈言語モデル４１５で使用するＢＥＲＴは、対話文脈言語モデル４１５の学習時に自動的に単語の区切りを決定する。 The speech recognition model 411 and each language model (excluding the dialogue context language model 415) have the same specifications for word breaks and part-of-speech systems. The BERT used in the dialogue context language model 415 automatically determines word breaks when learning the dialogue context language model 415.

単語ラティスＷＬは、音声認識の仮説候補のネットワーク構造データである。単語ごとに音響尤度と言語尤度とを合算した数値が付与される。たとえば、単語Ａ（今日）についての音響尤度と言語尤度とを合算した数値は「０．７」であり、単語Ｂ（は）についての音響尤度と言語尤度とを合算した数値は「０．６」であり、単語Ｃ（晴れ）についての音響尤度と言語尤度とを合算した数値は「０．７」である。これらの数値の合計が最大となるような経路の単語の並びが、単語列Ｗｔである。 The word lattice WL is network structure data of hypothetical candidates for speech recognition. A numerical value that is the sum of the acoustic likelihood and the language likelihood is given for each word. For example, the sum of the acoustic and linguistic likelihoods for word A (today) is "0.7", and the sum of the acoustic and linguistic likelihoods for word B (ha) is. It is "0.6", and the total value of the acoustic likelihood and the language likelihood for the word C (sunny) is "0.7". The sequence of words in the route that maximizes the sum of these numerical values is the word string Wt.

なお、音声認識部４０１は、音声対話装置３００と通信可能な他のコンピュータにユーザ１０１対話音声の波形データを転送し、当該他のコンピュータが生成した音声認識結果５００を取得してもよい。 The voice recognition unit 401 may transfer the waveform data of the user 101 dialogue voice to another computer capable of communicating with the voice dialogue device 300, and acquire the voice recognition result 500 generated by the other computer.

図４に戻り、言語尤度取得部４０２は、音声認識部４０１から入力された単語列Ｗｔに対し、複数の言語モデルを使用して、それぞれの言語モデルでの言語尤度（テストセットパープレキシティ）を算出する。複数の言語モデルとは、標準言語モデル４１２、分野別言語モデル４１３、質問文言語モデル４１４、および対話文脈言語モデル４１５である。このうち、標準言語モデル４１２および質問文言語モデル４１４が言語尤度取得部４０２に必須の言語モデルである。 Returning to FIG. 4, the language likelihood acquisition unit 402 uses a plurality of language models for the word string Wt input from the speech recognition unit 401, and the language likelihood (test set perplexi) in each language model. T) is calculated. The plurality of language models are a standard language model 412, a field-specific language model 413, a question sentence language model 414, and a dialogue context language model 415. Of these, the standard language model 412 and the question sentence language model 414 are essential language models for the language likelihood acquisition unit 402.

Ｎ−ｇｒａｍ（単語Ｎ−ｇｒａｍ、品詞Ｎ−ｇｒａｍ）を用いた場合の単語列Ｗ１ｎの生起確率は、たとえば、下記式（２）の通りであり、ＲＮＮを用いた場合の単語列Ｗ１ｎの生起確率は、下記式（３）の通りであり、ＢＥＲＴを用いた場合の単語列Ｗ１ｎの生起確率は、下記式（４）の通りである。単語列Ｗ１ｎは、単語Ｗ１から単語Ｗｎ（ｎは単語列Ｗ１ｎに含まれる単語の総数）までの単語の配列を意味し、ここでは、単語列Ｗ１ｎ＝単語列Ｗｔである。言語尤度取得部４０２は、言語モデルごとに算出した言語尤度を聞き返し判定部４０４に出力する。 The probability of occurrence of the word string W1n when N-gram (word N-gram, part word N-gram) is used is, for example, as shown in the following equation (2), and the occurrence probability of the word string W1n when RNN is used. The probability is as shown in the following formula (3), and the probability of occurrence of the word string W1n when BERT is used is as shown in the following formula (4). The word string W1n means an array of words from the word W1 to the word Wn (n is the total number of words included in the word string W1n), and here, the word string W1n = the word string Wt. The language likelihood acquisition unit 402 outputs the language likelihood calculated for each language model to the listening back determination unit 404.

なお、言語尤度取得部４０２は、上記式（２）により、品詞Ｎ−ｇｒａｍの質問文言語モデル４１４を用いて言語尤度を算出する場合、単語列Ｗｔを品詞列に変換する。単語列Ｗｔの各単語は、表記、読み、および品詞により構成されているため、言語尤度取得部４０２は、単語列Ｗｔの各単語から表記および読みを削除して、品詞のみからなる単語列、すなわち、品詞列に変換する。 When the language likelihood acquisition unit 402 calculates the language likelihood using the question sentence language model 414 of the part of speech N-gram by the above equation (2), the word sequence Wt is converted into a part of speech sequence. Since each word in the word string Wt is composed of notation, reading, and part of speech, the language likelihood acquisition unit 402 deletes the notation and reading from each word in the word string Wt, and the word string consisting only of part of speech. That is, it is converted into a part of speech string.

図６は、単語列から品詞列への変換例を示す説明図である。単語列Ｗｔの場合、名詞→助詞→名詞という品詞列に変換される。言語尤度取得部４０２は、品詞列を単語列Ｗ１ｎとみなして、上記式（２）を適用して言語尤度を算出する。なお、言語尤度取得部４０２は、品詞列への変換によって必要以上に情報が欠落しないように、活用形がある品詞には活用形を、代名詞および助詞には基本形（単語）を付与してもよい。 FIG. 6 is an explanatory diagram showing an example of conversion from a word string to a part-of-speech string. In the case of the word string Wt, it is converted into a part of speech string of noun → particle → noun. The language likelihood acquisition unit 402 considers the part-speech sequence as the word sequence W1n and applies the above equation (2) to calculate the language likelihood. The language likelihood acquisition unit 402 assigns inflected forms to part of speech that have inflected forms and basic forms (words) to pronouns and particles so that information is not lost more than necessary due to conversion to part of speech strings. May be good.

図７および図８は、品詞列の品詞分類定義７００の一例を示す説明図で、国立国語研究所が規定した短単位の定義（形態素解析用辞書の品詞分類）に従っている。品詞分類定義７００は、たとえば、記憶デバイス３０２に記憶された情報である。品詞分類定義７００は、大分類７０１と、中分類７０２と、小分類７０３と、補足情報７０４と、を有する。大分類７０１は、単語Ｗ１〜Ｗｎに含まれる品詞を規定する。中分類７０２は、大分類７０１を細分化した項目を規定する。小分類７０３は、中分類７０２を細分化した項目を規定する。補足情報７０４は、大分類７０１〜小分類７０３に補足すべき情報を規定する。単語列Ｗｔの各単語Ｗ１〜Ｗｎの品詞には、大分類７０１〜小分類７０３と、活用形がある品詞は活用形、英単語由来のカタカナ語には英語表記など情報が含まれているものとする。 7 and 8 are explanatory diagrams showing an example of the part-speech classification definition 700 of the part-speech sequence, and follow the definition of the short unit (part-speech classification of the morphological analysis dictionary) defined by the National Institute for Japanese Language and Linguistics. The part-of-speech classification definition 700 is, for example, information stored in the storage device 302. The part-speech classification definition 700 has a major classification 701, a middle classification 702, a minor classification 703, and supplementary information 704. The major classification 701 defines the part of speech contained in the words W1 to Wn. The middle classification 702 defines the subdivided items of the major classification 701. The sub-classification 703 defines the subdivided items of the middle classification 702. Supplementary information 704 defines information to be supplemented to the major classification 701 to the minor classification 703. Part of speech of each word W1 to Wn in the word string Wt contains information such as major classification 701 to minor classification 703, part of speech with inflected forms is inflected, and Katakana words derived from English words are written in English. And.

言語尤度取得部４０２は、単語列Ｗｔから品詞列への変換に際し、品詞分類定義７００を参照して、各品詞に対し品詞分類定義７００にある情報のみを残し、補足情報７０４に基本形の記載のある品詞については基本形（単語）を付与する。これにより、品詞列における情報の欠落が抑制される。 When converting the word string Wt to the part of speech string, the language likelihood acquisition unit 402 refers to the part of speech classification definition 700, leaves only the information in the part of speech classification definition 700 for each part of speech, and describes the basic form in the supplementary information 704. A basic form (word) is given to a part of speech with. As a result, the lack of information in the part of speech sequence is suppressed.

なお、言語尤度取得部４０２は、音声対話装置３００と通信可能な他のコンピュータに、音声認識部４０１から入力された単語列Ｗｔを転送し、当該他のコンピュータが単語列Ｗｔに基づいて算出した言語モデルごとの言語尤度を取得してもよい。 The language likelihood acquisition unit 402 transfers the word string Wt input from the voice recognition unit 401 to another computer capable of communicating with the voice dialogue device 300, and the other computer calculates based on the word string Wt. You may get the language likelihood for each language model.

図４に戻り、信頼度取得部４０３は、単語ラティスＷＬに基づいて、単語列Ｗｔおよび単語列Ｗｔを構成する各単語Ｗ１〜Ｗｎの音声認識信頼度（以下、単に、「信頼度」）を算出する。具体的には、たとえば、信頼度とは、音声認識部４０１によって得られた単語列Ｗｔがどの程度信頼できるかを示す指標値である。ここでは、値が大きいほど信頼度が高いものとする。 Returning to FIG. 4, the reliability acquisition unit 403 determines the speech recognition reliability (hereinafter, simply “reliability”) of each word W1 to Wn constituting the word string Wt and the word string Wt based on the word lattice WL. calculate. Specifically, for example, the reliability is an index value indicating how reliable the word string Wt obtained by the voice recognition unit 401 is. Here, it is assumed that the higher the value, the higher the reliability.

図９は、信頼度取得部４０３による信頼度算出例を示す説明図である。信頼度取得部４０３は、まず、単語ラティスＷＬをネットワーク６００に変換する。なお、状態遷移が１つしかない単語（たとえば、単語Ｅ）については、信頼度取得部４０３は、単語の信頼度が１．０である空シンボルεを有する状態遷移を挿入して、単語Ｅの状態遷移とアライメントをとる。 FIG. 9 is an explanatory diagram showing an example of reliability calculation by the reliability acquisition unit 403. The reliability acquisition unit 403 first converts the word lattice WL into the network 600. For a word having only one state transition (for example, word E), the reliability acquisition unit 403 inserts a state transition having an empty symbol ε having a word reliability of 1.0, and inserts the state transition into word E. Align with the state transition of.

つぎに、信頼度取得部４０３は、ネットワーク６００における各単語Ａ〜Ｆの信頼度を区間ごとに１．０で正規化し、単語コンフュージョンネットワーク６０１に変換する。また、信頼度取得部４０３は、単語コンフュージョンネットワーク６０１において、正規化された単語Ａ〜Ｆの信頼度の調和平均を算出し、単語列Ｗｔの信頼度とする。信頼度取得部４０３は、正規化された単語Ａ〜Ｆの信頼度と、単語列Ｗｔの信頼度と、を聞き返し判定部４０４に出力する。 Next, the reliability acquisition unit 403 normalizes the reliability of each word A to F in the network 600 by 1.0 for each section and converts it into the word confusion network 601. Further, the reliability acquisition unit 403 calculates the harmonic mean of the reliabilitys of the normalized words A to F in the word confusion network 601 and uses it as the reliability of the word string Wt. The reliability acquisition unit 403 outputs the reliability of the normalized words A to F and the reliability of the word string Wt to the listening back determination unit 404.

なお、信頼度取得部４０３は、音声対話装置３００と通信可能な他のコンピュータに、音声認識部４０１から入力された単語ラティスＷＬを転送し、当該他のコンピュータが単語ラティスＷＬに基づいて算出した信頼度を取得してもよい。 The reliability acquisition unit 403 transfers the word lattice WL input from the voice recognition unit 401 to another computer capable of communicating with the voice dialogue device 300, and the other computer calculates based on the word lattice WL. You may get the reliability.

図４に戻り、聞き返し判定部４０４は、ユーザ１０１への聞き返しをすべきか否かを判定する。具体的には、たとえば、聞き返し判定部４０４は、言語尤度取得部４０２からの言語モデルごとの言語尤度と信頼度取得部４０３からの単語列Ｗｔの信頼度とを用いて、対話制御部４０５による対話制御と、聞き返し文生成部４０６による聞き返し処理と、のうち、いずれの処理を実行するかを判定する。 Returning to FIG. 4, the listening back determination unit 404 determines whether or not the listening back to the user 101 should be performed. Specifically, for example, the listening back determination unit 404 uses the language likelihood for each language model from the language likelihood acquisition unit 402 and the reliability of the word string Wt from the reliability acquisition unit 403 to control the dialogue. It is determined which of the dialogue control by 405 and the hearing back processing by the hearing back sentence generation unit 406 is to be executed.

なお、言語モデルごとの言語尤度、単語列Ｗｔの信頼度、および正規化された各単語Ｗ１〜Ｗｎの信頼度の各々には、それぞれしきい値が設定される。聞き返し判定部４０４は、言語モデルごとの言語尤度および単語列Ｗｔの信頼度のいずれか１つでもしきい値未満であれば、聞き返し文生成処理を実行する。なお、正規化された各単語Ｗ１〜Ｗｎの信頼度は、聞き返し文生成部４０６で用いられる。 A threshold value is set for each of the language likelihood for each language model, the reliability of the word string Wt, and the reliability of each normalized word W1 to Wn. If any one of the language likelihood and the reliability of the word string Wt for each language model is less than the threshold value, the listening back determination unit 404 executes the listening back sentence generation process. The reliability of each of the normalized words W1 to Wn is used by the recurrence sentence generation unit 406.

対話制御部４０５は、聞き返し判定部４０４によってユーザ１０１に聞き返す必要がないと判定された場合に、対話知識ＤＢ４１６を参照して、発話音声１１０の音声認識結果５００に対応する応答文を生成する。 When the dialogue control unit 405 determines that it is not necessary to listen back to the user 101 by the listening back determination unit 404, the dialogue control unit 405 refers to the dialogue knowledge DB 416 and generates a response sentence corresponding to the voice recognition result 500 of the spoken voice 110.

聞き返し文生成部４０６は、聞き返し判定部４０４によってユーザ１０１に聞き返すべきと判定された場合に、ユーザ１０１の発話音声１１０に対する聞き返し文を生成する。具体的には、たとえば、聞き返し文生成部４０６は、聞き返し判定部４０４の判定結果（どの言語モデルの言語尤度がしきい値未満か）に応じて聞き返し文を生成する。たとえば、単語列Ｗｔについて一部の単語のみ信頼度がしきい値未満である場合、対話文脈言語モデル４１５を使用して、当該部分に当てはまる尤もらしい言葉を推定し、推定した言葉をユーザ１０１が発話したかどうかを聞き返す聞き返し文を応答文として生成する。 The back-to-back sentence generation unit 406 generates a back-to-back sentence for the spoken voice 110 of the user 101 when it is determined by the back-to-back determination unit 404 that the user 101 should be heard back. Specifically, for example, the listening back sentence generation unit 406 generates a listening back sentence according to the determination result of the listening back determination unit 404 (which language model has a language likelihood of less than the threshold value). For example, if the confidence of only some words in the word string Wt is less than the threshold, the dialogue context language model 415 is used to estimate the plausible words that apply to that part, and the user 101 estimates the estimated words. Generates a response sentence as a response sentence that asks whether or not the person has spoken.

対話履歴管理部４０７は、ユーザ１０１の発話音声１１０の単語列Ｗｔと音声対話装置３００が生成した応答文とを蓄積する。音声合成部４０８は、対話制御部４０５または聞き返し文生成部４０６が生成した応答文を音声に変換して出力する。なお、音声対話装置３００がクライアントサーバシステムのサーバ２０１によって実現される場合、音声対話装置３００は、音声合成部４０８を有さず、対話制御部４０５または聞き返し文生成部４０６が生成した応答文を、クライアントとなるロボット１０２やスマートフォン１０３などの情報処理装置１０４に送信する。この場合、クライアントが音声合成部４０８を有し、サーバ２０１から受信した応答文を音声に変換して出力する。 The dialogue history management unit 407 stores the word string Wt of the spoken voice 110 of the user 101 and the response sentence generated by the voice dialogue device 300. The voice synthesis unit 408 converts the response sentence generated by the dialogue control unit 405 or the listening back sentence generation unit 406 into voice and outputs it. When the voice dialogue device 300 is realized by the server 201 of the client server system, the voice dialogue device 300 does not have the voice synthesis unit 408, and the response sentence generated by the dialogue control unit 405 or the listening back sentence generation unit 406 is generated. , Is transmitted to an information processing device 104 such as a robot 102 or a smartphone 103 that serves as a client. In this case, the client has a voice synthesis unit 408, and converts the response sentence received from the server 201 into voice and outputs it.

＜音声対話処理手順例＞
図１０は、実施例１にかかる音声対話装置３００による音声対話処理手順例を示すフローチャートである。音声対話装置３００は、図５に示したように、ユーザ１０１対話音声を入力して音声認識部４０１により音声認識処理を実行し、音声認識結果５００を出力する（ステップＳ１００１）。 <Example of voice dialogue processing procedure>
FIG. 10 is a flowchart showing an example of a voice dialogue processing procedure by the voice dialogue device 300 according to the first embodiment. As shown in FIG. 5, the voice dialogue device 300 inputs the dialogue voice of the user 101, executes the voice recognition process by the voice recognition unit 401, and outputs the voice recognition result 500 (step S1001).

つぎに、音声対話装置３００は、信頼度取得部４０３により、図９に示したように、単語列Ｗｔの信頼度と、単語列Ｗｔを構成する単語Ｗ１〜Ｗｎの信頼度と、を取得する（ステップＳ１００２）。また、音声対話装置３００は、言語尤度取得部４０２により、言語モデルごとに言語尤度を取得する（ステップＳ１００３）。 Next, the voice dialogue device 300 acquires the reliability of the word string Wt and the reliability of the words W1 to Wn constituting the word string Wt by the reliability acquisition unit 403 as shown in FIG. (Step S1002). Further, the voice dialogue device 300 acquires the language likelihood for each language model by the language likelihood acquisition unit 402 (step S1003).

なお、ステップＳ１００３において、対話文脈言語モデル４１５を用いる場合、言語尤度取得部４０２は、ユーザ１０１の発話音声１１０よりも前の音声対話装置３００およびユーザ１０１の発話音声１１０を繋げた文（単語列）を入力として、言語尤度を取得する。 When the dialogue context language model 415 is used in step S1003, the language likelihood acquisition unit 402 connects a sentence (word) in which the voice dialogue device 300 before the spoken voice 110 of the user 101 and the spoken voice 110 of the user 101 are connected. Get the language likelihood by taking the column) as an input.

たとえば、
音声対話装置３００：「こんにちは」
ユーザ１０１：「こんにちは、あなたの名前は」
音声対話装置３００：「僕の名前はロボットです」
ユーザ１０１：「へーそうなんだ、かわいいね」
音声対話装置３００：「ありがとうございます」
ユーザ１０１：「あなたは何ができるの？」（発話音声１１０）
という対話を例に挙げる。 for example,
Voice Dialogue Device 300: "Hello"
User 101: "Hello, your name is"
Voice Dialogue Device 300: "My name is a robot"
User 101: "Hey, that's cute, isn't it?"
Voice dialogue device 300: "Thank you"
User 101: "What can you do?" (Voice voice 110)
Take the dialogue as an example.

この場合、ユーザ１０１の発話音声１１０である「あなたは何ができるの？」よりも前の音声対話装置３００の発話音声「こんにちは」、ユーザ１０１の発話音声「こんにちは、あなたの名前は」、音声対話装置３００の発話音声「僕の名前はロボット１０２です」、ユーザ１０１の発話音声「へーそうなんだ、かわいいね」、音声対話装置３００の発話音声「ありがとうございます」について音声認識部４０１から得られた単語列を繋げて、「こんにちは。こんにちは、あなたの名前は。僕の名前はロボットです。へーそうなんだ、かわいいね。ありがとうございます。あなたは何ができるの。」という単語列とする。 In this case, the voice "hello" of the voice dialogue device 300 before the voice 110 of the user 101 "what can you do?", The voice "hello, your name" of the user 101, the voice. Obtained from the voice recognition unit 401 about the voice of the dialogue device 300 "My name is robot 102", the voice of the user 101 "Hey, that's cute", and the voice of the voice dialogue device 300 "Thank you". Connect the word strings together to make the word string "Hello. Hello, your name is. My name is a robot. Oh yeah, cute. Thank you. What can you do?"

音声対話装置３００は、この繋げた単語列「こんにちは。こんにちは、あなたの名前は。僕の名前はロボットです。へーそうなんだ、かわいいね。ありがとうございます。あなたは何ができるの。」を対話文脈言語モデル４１５に入力して、言語尤度を算出する。このように、ユーザ１０１の発話音声１１０だけではなく、それ以前の対話も対話文脈言語モデル４１５に入力することにより、対話文脈言語モデル４１５から得られる言語尤度の高精度化を図ることができる。 The voice dialogue device 300 uses this connected word string "Hello. Hello, your name. My name is a robot. Oh yeah, cute. Thank you. What can you do?" Input to the language model 415 to calculate the language likelihood. In this way, by inputting not only the spoken voice 110 of the user 101 but also the dialogue before that into the dialogue context language model 415, it is possible to improve the accuracy of the language likelihood obtained from the dialogue context language model 415. ..

なお、ステップＳ１００３において、対話文脈言語モデル４１５を用いる場合、言語尤度取得部４０２は、音声認識結果５００の単語列Ｗｔの各単語をつなげて文を生成し、対話文脈言語モデル４１５が有する形態素解析器の形態素解析で単語列Ｗｔ´にしてもよい。これにより、単語列Ｗｔと単語列Ｗｔ´とでは、単語の区切りが異なる場合がある。そして、言語尤度取得部４０２は、単語列Ｗｔ´で尤度算出を行う。 When the dialogue context language model 415 is used in step S1003, the language likelihood acquisition unit 402 connects each word of the word string Wt of the speech recognition result 500 to generate a sentence, and the morphology element possessed by the dialogue context language model 415. The word string Wt'may be used in the morphological analysis of the analyzer. As a result, the word delimiter may differ between the word string Wt and the word string Wt'. Then, the language likelihood acquisition unit 402 calculates the likelihood with the word string Wt'.

そして、音声対話装置３００は、聞き返し判定部４０４により、ユーザ１０１への聞き返しをすべきか否かを判定する（ステップＳ１００４）。具体的には、たとえば、音声対話装置３００は、言語モデルごとの言語尤度、および単語列Ｗｔの信頼度のいずれか１つでもしきい値未満であるか否かを判定する。言語モデルごとの言語尤度、および単語列Ｗｔの信頼度のいずれか１つでもしきい値未満である場合（ステップＳ１００４：Ｙｅｓ）、すなわち、ユーザ１０１への聞き返しをすべき場合、音声対話装置３００は、聞き返し文生成部４０６により聞き返し処理を実行する（ステップＳ１００５）。 Then, the voice dialogue device 300 determines whether or not to listen back to the user 101 by the listening back determination unit 404 (step S1004). Specifically, for example, the voice dialogue device 300 determines whether or not any one of the language likelihood for each language model and the reliability of the word string Wt is less than the threshold value. When any one of the language likelihood for each language model and the reliability of the word string Wt is less than the threshold value (step S1004: Yes), that is, when the user 101 should be heard back, the voice dialogue device. In the 300, the dialogue process is executed by the dialogue generation unit 406 (step S1005).

音声対話装置３００は、聞き返し文生成処理（ステップＳ１００５）により、聞き返しパターンＰ１〜Ｐ５のいずれかの応答文を生成、または、応答文の非生成を通知して、ステップＳ１００９に移行する。聞き返し文生成処理（ステップＳ１００５）の詳細は、図１１で後述する。 The voice dialogue device 300 generates the response sentence of any of the listening back patterns P1 to P5 by the listening back sentence generation process (step S1005), or notifies that the response sentence is not generated, and proceeds to step S1009. The details of the back-to-back sentence generation process (step S1005) will be described later with reference to FIG.

一方、言語モデルごとの言語尤度、および単語列Ｗｔの信頼度のいずれもしきい値未満でない場合（ステップＳ１００５：Ｎｏ）、すなわち、ユーザ１０１への聞き返しの必要がない場合、音声対話装置３００は、ユーザ１０１に対し聞き返しパターンＰ１、Ｐ３、またはＰ４で聞き返し中であるか否かを判定する（ステップＳ１００６）。 On the other hand, when neither the language likelihood for each language model nor the reliability of the word string Wt is less than the threshold value (step S1005: No), that is, when there is no need to listen back to the user 101, the voice dialogue device 300 , It is determined whether or not the user 101 is being heard back by the back-listening pattern P1, P3, or P4 (step S1006).

聞き返しパターンＰ１、Ｐ３、またはＰ４で聞き返し中である場合（ステップＳ１００６：Ｙｅｓ）、音声対話装置３００は、聞き返し文生成部４０６により、言い直し発話解釈処理を実行して（ステップＳ１００７）、ステップＳ１００９に移行する。言い直し発話解釈処理（ステップＳ１００７）とは、ユーザ１０１が言い直した発話音声を解釈して、当該解釈に応じて応答する処理である。言い直し解釈処理（ステップＳ１００７）の詳細は、図１４で後述する。 When the listening pattern P1, P3, or P4 is being heard back (step S1006: Yes), the voice dialogue device 300 executes the rephrasing utterance interpretation process by the hearing back sentence generation unit 406 (step S1007), and steps S1009. Move to. The rephrasing utterance interpretation process (step S1007) is a process of interpreting the utterance voice rephrased by the user 101 and responding according to the interpretation. The details of the rephrasing interpretation process (step S1007) will be described later with reference to FIG.

一方、聞き返しパターンＰ１、Ｐ３、またはＰ４で聞き返し中でない場合（ステップＳ１００６：Ｎｏ）、音声対話装置３００は、対話制御部４０５により、対話制御処理を実行し（ステップＳ１００８）、ステップＳ１００９に移行する。対話制御処理（ステップＳ１００８）は、ユーザ１０１発話音声に応答する応答文を生成する処理である。対話制御処理（ステップＳ１００８）の詳細については、図１５で後述する。 On the other hand, when the listening pattern P1, P3, or P4 is not being heard back (step S1006: No), the voice dialogue device 300 executes the dialogue control process by the dialogue control unit 405 (step S1008), and proceeds to step S1009. .. The dialogue control process (step S1008) is a process of generating a response sentence in response to the user 101 utterance voice. The details of the dialogue control process (step S1008) will be described later with reference to FIG.

ステップＳ１００９では、音声対話装置３００は、対話履歴管理部４０７により、ユーザ１０１発話音声の単語列Ｗｔと、聞き返し処理（ステップＳ１００５）、言い直し解釈処理（ステップＳ１００７）、および対話制御処理（ステップＳ１００８）によって生成された応答文とを、対話履歴として蓄積する（ステップＳ１００９）。 In step S1009, the dialogue history management unit 407 uses the dialogue history management unit 407 to input the word string Wt of the user 101 utterance voice, the listening back process (step S1005), the rephrasing interpretation process (step S1007), and the dialogue control process (step S1008). ) Is accumulated as a dialogue history (step S1009).

そして、音声対話装置３００は、音声合成部４０８により、対話制御部４０５または聞き返し文生成部４０６によって生成された応答文を、音声に変換して出力する（ステップＳ１０１０）。なお、音声対話装置３００がスタンドアロン型で実現される場合、音声対話装置３００は、図１０に示したステップＳ１００１〜Ｓ１０１０の処理を実行する。 Then, the voice dialogue device 300 converts the response sentence generated by the dialogue control unit 405 or the listening back sentence generation unit 406 by the voice synthesis unit 408 into voice and outputs it (step S1010). When the voice dialogue device 300 is realized as a stand-alone type, the voice dialogue device 300 executes the processes of steps S1001 to S1010 shown in FIG.

一方、音声対話装置３００がクライアントサーバシステムのサーバ２０１によって実現される場合、音声対話装置３００は、ステップＳ１００１〜１００９まで実行し、クライアントとなるコミュニケーションロボット１０２やスマートフォンなどの通信装置に、応答文を送信する。そして、クライアントが、音声合成部４０８により、対話制御部４０５または聞き返し文生成部４０６によって生成された応答文を、音声に変換して出力する（ステップＳ１０１０）。 On the other hand, when the voice dialogue device 300 is realized by the server 201 of the client server system, the voice dialogue device 300 executes steps S1001 to 1009 and sends a response message to a communication device such as a communication robot 102 or a smartphone which is a client. Send. Then, the client converts the response sentence generated by the dialogue control unit 405 or the listening back sentence generation unit 406 by the voice synthesis unit 408 into voice and outputs it (step S1010).

＜聞き返し文生成処理（ステップＳ１００５）＞
図１１は、図１０に示した聞き返し文生成処理（ステップＳ１００５）の詳細な処理手順例を示すフローチャートである。聞き返し文生成部４０６は、単語列Ｗｔの信頼度がしきい値未満か否かを判定する（ステップＳ１１０１）。 <Return sentence generation process (step S1005)>
FIG. 11 is a flowchart showing a detailed processing procedure example of the hearing back sentence generation processing (step S1005) shown in FIG. The hearing back sentence generation unit 406 determines whether or not the reliability of the word string Wt is less than the threshold value (step S1101).

単語列Ｗｔの信頼度がしきい値未満である場合（ステップＳ１１０２：Ｙｅｓ）、聞き返し文生成部４０６は、単語Ｗ１〜Ｗｎの全信頼度がしきい値未満であるか否かを判定する（ステップＳ１１０２）。単語Ｗ１〜Ｗｎの全信頼度がしきい値未満である場合（ステップＳ１１０２：Ｙｅｓ）、聞き返し文生成部４０６は、単語Ｗ１〜Ｗｎのうちどの部分の単語の信頼度がしきい値未満であるかを判定する（ステップＳ１１０３）。 When the reliability of the word string Wt is less than the threshold value (step S1102: Yes), the response sentence generation unit 406 determines whether or not the total reliability of the words W1 to Wn is less than the threshold value (step S1102: Yes). Step S1102). When the total reliability of the words W1 to Wn is less than the threshold value (step S1102: Yes), the response sentence generation unit 406 has the reliability of any part of the words W1 to Wn less than the threshold value. (Step S1103).

部分Ａの場合（ステップＳ１１０３：部分Ａ）、ステップＳ１１０４に移行し、部分Ｂの場合（ステップＳ１１０３：部分Ｂ）、ステップＳ１１０５に移行し、部分Ｃの場合（ステップＳ１１０３：部分Ｃ）、ステップＳ１１０６に移行する。なお、一例として、単語列Ｗｔが部分Ａおよび部分Ｂの両方に該当する場合は、部分Ａを優先適用し、部分Ｂおよび部分Ｃの両方に該当する場合は、部分Ｂを優先適用する。
In the case of part A (step S1103: part A), the process proceeds to step S1104, and in the case of part B (step S1103: part B), the process proceeds to step S1105, and in the case of part C (step S1103: part C), step S1106. Move to. As an example, when the word string Wt corresponds to both the part A and the part B, the part A is preferentially applied, and when the word string Wt corresponds to both the part B and the part C, the part B is preferentially applied.

部分Ａとは、単語列Ｗｔのうち前半または後半に存在し、かつ、信頼度がしきい値未満の複数の単語である。単語Ｗ１〜Ｗｎの単語数ｎが偶数であれば、前半とは単語Ｗ１〜Ｗｎのうち単語Ｗ１〜Ｗ（ｎ／２）であり、後半とは単語Ｗ１〜Ｗｎのうち単語Ｗ（ｎ／２＋１）〜Ｗｎである。単語Ｗ１〜Ｗｎの単語数ｎが奇数であれば、前半とは単語Ｗ１〜Ｗｎのうち単語Ｗ１〜Ｗ（（ｎ＋１）／２）であり、後半とは単語Ｗ１〜Ｗｎのうち単語Ｗ（（ｎ＋１）／２）〜Ｗｎである。 The part A is a plurality of words existing in the first half or the second half of the word string Wt and having a reliability of less than the threshold value. If the number of words n of the words W1 to Wn is even, the first half is the words W1 to W (n / 2) among the words W1 to Wn, and the second half is the word W (n / 2 + 1) among the words W1 to Wn. ) ~ Wn. If the number of words n of the words W1 to Wn is odd, the first half is the words W1 to W ((n + 1) / 2) among the words W1 to Wn, and the latter half is the word W (((n + 1) / 2) among the words W1 to Wn. n + 1) / 2) to Wn.

ただし、当該複数の単語は、自立語を少なくとも一つ含む。自立語とは、付属語以外の品詞（動詞、形容詞、形容動詞、名詞、副詞、連体詞、接続詞、感動詞）の単語である。付属語とは、品詞が助動詞または助詞である単語である。部分Ａの場合（ステップＳ１１０３：部分Ａ）、ユーザ１０１の発話音声の前半または後半に、音声対話装置３００が音声認識しにくかった単語群が存在する。すなわち、部分Ａは、音声対話装置３００が、ユーザ１０１の発話音声のうち半分を聞き取れなかった場合に相当する。 However, the plurality of words include at least one independent word. An independent word is a word of a part of a word (verb, adjective, adjective verb, noun, adverb, adnominal adjective, connective verb, emotional verb) other than an adverb. An adjunct is a word whose part of speech is an auxiliary verb or particle. In the case of the part A (step S1103: part A), there is a word group in the first half or the second half of the spoken voice of the user 101, which is difficult for the voice dialogue device 300 to recognize the voice. That is, the part A corresponds to the case where the voice dialogue device 300 cannot hear half of the spoken voice of the user 101.

部分Ａの場合（ステップＳ１１０３：部分Ａ）、聞き返し文生成部４０６は、単語列Ｗｔのうち部分を除いた残余の部分、すなわち、信頼度がしきい値以上の単語を聞き返す聞き返し文を聞き返しパターンＰ１（図１を参照）として生成して（ステップＳ１１０４）、ステップＳ１００９に移行する。 In the case of the part A (step S1103: part A), the back sentence generation unit 406 listens back to the remaining part of the word string Wt excluding the part, that is, the back sentence for listening back the word whose reliability is equal to or higher than the threshold value. It is generated as P1 (see FIG. 1) (step S1104), and the process proceeds to step S1009.

また、部分Ｂとは、単語列Ｗｔのうち離散的に存在し、かつ、信頼度がしきい値未満の複数の単語（自立語を少なくとも一つ含む）である。ただし、部分Ａとの重複を回避するため、単語列Ｗｔの前半と後半のそれぞれに、信頼度がしきい値未満の単語が少なくとも１つ存在する必要がある。すなわち、部分Ｂは、音声対話装置３００が、ユーザ１０１の発話音声１１０のうち断片的に聞き取れない部分があった場合に相当する。 Further, the portion B is a plurality of words (including at least one independent word) that exist discretely in the word string Wt and whose reliability is less than the threshold value. However, in order to avoid duplication with the part A, it is necessary that at least one word having a reliability lower than the threshold value exists in each of the first half and the second half of the word string Wt. That is, the part B corresponds to the case where the voice dialogue device 300 has a part of the spoken voice 110 of the user 101 that cannot be heard in fragments.

部分Ｂの場合（ステップＳ１１０３：部分Ｂ）、単語列Ｗｔのうち部分Ｂに該当しない信頼度が閾値以上の単語を聞き返す聞き返し文を聞き返しパターンＰ２（図１を参照）として生成して（ステップＳ１１０４）、ステップＳ１００９に移行する。 In the case of the part B (step S1103: part B), a back sentence for listening back a word whose reliability does not correspond to the part B in the word string Wt is generated as a back pattern P2 (see FIG. 1) (step S1104). ), The process proceeds to step S1009.

また、部分Ｃとは、単語列Ｗｔのうち信頼度がしきい値未満の１個の自立語、または、単語列Ｗｔのうち信頼度がしきい値未満の連続する２個の単語（自立語を少なくとも一つ含む）である。ただし、部分Ａと重複した場合は部分Ａが優先される（部分Ａよりも部分Ｃを優先適用してもよい。）。すなわち、部分Ｃは、音声対話装置３００が、ユーザ１０１の発話音声１１０のうち一部分が聞き取れなかった場合に相当する。 Further, the partial C is one independent word whose reliability is less than the threshold value in the word string Wt, or two consecutive words (independent words) whose reliability is less than the threshold value in the word string Wt. Including at least one). However, if it overlaps with the part A, the part A has priority (the part C may be applied preferentially over the part A). That is, the portion C corresponds to the case where the voice dialogue device 300 cannot hear a part of the spoken voice 110 of the user 101.

部分Ｃの場合（ステップＳ１１０３：部分Ｃ）、聞き返し文生成部４０６は、マスク単語推定処理を実行する（ステップＳ１１０６）。マスク単語推定処理（ステップＳ１１０６）は、マスク単語を推定する処理である。マスク単語とは、部分Ｃに該当する単語である。マスク単語推定処理（ステップＳ１１０６）の詳細については、図１２で後述する。 In the case of the part C (step S1103: part C), the hearing back sentence generation unit 406 executes the mask word estimation process (step S1106). The mask word estimation process (step S1106) is a process of estimating a mask word. The mask word is a word corresponding to the part C. The details of the mask word estimation process (step S1106) will be described later with reference to FIG.

マスク単語推定処理（ステップＳ１１０６）の実行後、聞き返し文生成部４０６は、マスク単語の推定が成功したか否かを判定する（ステップＳ１１０７）。マスク単語の推定が成功した場合（ステップＳ１１０７：Ｙｅｓ）、聞き返し文生成部４０６は、信頼度がしきい値未満である部分Ｃの推定結果と当該推定結果を含めたユーザ１０１の発話音声１１０を確認する聞き返し文とを聞き返しパターンＰ３（図１を参照）として生成する。 After executing the mask word estimation process (step S1106), the listening back sentence generation unit 406 determines whether or not the mask word estimation is successful (step S1107). When the estimation of the mask word is successful (step S1107: Yes), the hearing back sentence generation unit 406 outputs the estimation result of the portion C whose reliability is less than the threshold value and the utterance voice 110 of the user 101 including the estimation result. The utterance sentence to be confirmed is generated as the utterance pattern P3 (see FIG. 1).

図１の聞き返しパターンＰ３の場合、部分Ｃの推定結果が「富士山」であり、当該推定結果を含めたユーザ１０１の発話音声１１０を確認する聞き返し文が、『明日の富士山の日の出の時刻は』である。 In the case of the back-to-back pattern P3 in FIG. 1, the estimation result of the portion C is "Mt. Fuji", and the back-to-back sentence confirming the utterance voice 110 of the user 101 including the estimation result is "Tomorrow's sunrise time of Mt. Fuji". Is.

一方、マスク単語の推定が成功しなかった場合（ステップＳ１１０７：Ｎｏ）、聞き返し文生成部４０６は、再質問を依頼、すなわち、発話音声１１０を再要求する聞き返し文を聞き返しパターンＰ５（図１を参照）として生成し（ステップＳ１１０９）、ステップＳ１００９に移行する。 On the other hand, when the estimation of the mask word is not successful (step S1107: No), the hearing-back sentence generation unit 406 requests a re-question, that is, the back-listening sentence that re-requests the spoken voice 110 is heard back pattern P5 (FIG. 1). (See) (see step S1109), and the process proceeds to step S1009.

また、ステップＳ１１０２において、全単語Ｗ１〜Ｗｎの信頼度がしきい値未満である場合（ステップＳ１１０２：Ｙｅｓ）も、聞き返し文生成部４０６は、ステップＳ１１０９を実行し、ステップＳ１００９に移行する。 Further, in step S1102, even when the reliability of all words W1 to Wn is less than the threshold value (step S1102: Yes), the hearing back sentence generation unit 406 executes step S1109 and proceeds to step S1009.

また、ステップＳ１１０１において、単語列Ｗｔの信頼度がしきい値以上である場合（ステップＳ１１０２：Ｎｏ）、聞き返し文生成部４０６は、標準言語モデル４１２の言語尤度がしきい値未満であるか否かを判定する（ステップＳ１１１０）。標準言語モデル４１２の言語尤度がしきい値未満である場合（ステップＳ１１１０：Ｙｅｓ）、ステップＳ１１０９を実行し、ステップＳ１００９に移行する。 Further, in step S1101, when the reliability of the word string Wt is equal to or higher than the threshold value (step S1102: No), whether the verbal likelihood of the standard language model 412 is less than the threshold value in the response sentence generation unit 406. It is determined whether or not (step S1110). If the language likelihood of the standard language model 412 is less than the threshold value (step S1110: Yes), step S1109 is executed and the process proceeds to step S1009.

一方、標準言語モデル４１２の言語尤度がしきい値以上である場合（ステップＳ１１１０：Ｎｏ）、聞き返し文生成部４０６は、質問文言語モデル４１４の言語尤度がしきい値未満であるか否かを判定する（ステップＳ１１１１）。 On the other hand, when the language likelihood of the standard language model 412 is equal to or greater than the threshold value (step S1110: No), the response sentence generation unit 406 determines whether or not the language likelihood of the question sentence language model 414 is less than the threshold value. (Step S1111).

質問文言語モデル４１４の言語尤度がしきい値以上である場合（ステップＳ１１１１：Ｎｏ）、分野別言語モデル４１３または対話文脈言語モデル４１５の言語尤度がしきい値未満となる。したがって、聞き返し文生成部４０６は、質問（ユーザ１０１の発話音声１１０）全体を復唱して確認する聞き返し文を聞き返しパターンＰ４（図１を参照）として生成し（ステップＳ１１１２）、ステップＳ１００９に移行する。また、分野別言語モデル４１３および対話文脈言語モデル４１５のいずれも用いられていない場合も、質問文言語モデル４１４の言語尤度がしきい値以上である場合（ステップＳ１１１１：Ｎｏ）、聞き返し文生成部４０６は、ステップＳ１１１２を実行する。 When the language likelihood of the question sentence language model 414 is equal to or greater than the threshold value (step S1111: No), the language likelihood of the field-specific language model 413 or the dialogue context language model 415 is less than the threshold value. Therefore, the re-listening sentence generation unit 406 generates a re-listening sentence for reciting and confirming the entire question (spoken voice 110 of the user 101) as a re-listening pattern P4 (see FIG. 1) (step S1112), and proceeds to step S1009. .. Further, even when neither the field-specific language model 413 nor the dialogue context language model 415 is used, when the language likelihood of the question sentence language model 414 is equal to or greater than the threshold value (step S1111: No), the response sentence is generated. Section 406 executes step S1112.

一方、質問文言語モデル４１４の言語尤度がしきい値未満である場合（ステップＳ１１１１：Ｙｅｓ）、聞き返し文生成部４０６は、聞き返し文を生成せず、聞き返し文の非生成を通知して（ステップＳ１１１３）、ステップＳ１００９に移行する。このように、聞き返し文生成部４０６は、各単語の信頼度や各言語モデルの言語尤度に応じた聞き返しパターンＰ１〜Ｐ５の聞き返し文を生成することができる。 On the other hand, when the language likelihood of the question sentence language model 414 is less than the threshold value (step S1111: Yes), the listening back sentence generation unit 406 does not generate the listening back sentence and notifies that the listening back sentence is not generated (step S1111: Yes). Step S1113) and step S1009. In this way, the listening back sentence generation unit 406 can generate the listening back sentences of the listening back patterns P1 to P5 according to the reliability of each word and the language likelihood of each language model.

＜マスク単語推定処理（ステップＳ１１０６）＞
図１２は、図１１に示したマスク単語推定処理（ステップＳ１１０６）の詳細な処理手順例を示すフローチャートである。図１３は、図１１に示したマスク単語推定処理（ステップＳ１１０６）の一例を示す説明図である。マスク単語推定処理（ステップＳ１１０６）は、ステップＳ１１０３：部分Ｃの場合に実行される。図１３では、部分Ｃである「○×▽」を含む単語列Ｗｔを「○×▽の高さを教えて」という単語列１３００とする。 <Mask word estimation process (step S1106)>
FIG. 12 is a flowchart showing a detailed processing procedure example of the mask word estimation process (step S1106) shown in FIG. FIG. 13 is an explanatory diagram showing an example of the mask word estimation process (step S1106) shown in FIG. The mask word estimation process (step S1106) is executed in the case of step S1103: partial C. In FIG. 13, the word string Wt including the partial C “○ × ▽” is defined as the word string 1300 “tell me the height of ○ × ▽”.

聞き返し文生成部４０６は、部分Ｃである「○×▽」の単語の読みを抽出する（ステップＳ１２０１）。図１３では、「○×▽」の読み１３０２として“フサン”が抽出されたとする。つぎに、聞き返し文生成部４０６は、部分Ｃの単語「○×▽」をマスク加工して、「＊＊＊」にする（ステップＳ１２０２）。部分Ｃの単語「○×▽」のマスク後の単語列１３００を単語列１３０１とする。 The hearing-back sentence generation unit 406 extracts the reading of the word "○ × ▽" which is the part C (step S1201). In FIG. 13, it is assumed that "Fusan" is extracted as the reading 1302 of "○ × ▽". Next, the listening back sentence generation unit 406 masks the word "○ × ▽" in the part C to make it "***" (step S1202). The word string 1300 after the mask of the word "○ × ▽" in the part C is set as the word string 1301.

つぎに、聞き返し文生成部４０６は、マスク後の単語列１３０１を対話文脈言語モデル４１５の一例であるＢＥＲＴ１３１０に入力し、マスク単語を予測する（ステップＳ１２０３）。ここでは、予測結果１３０３として、「ランドマークタワー」（予測マスク単語１３０３Ａ）、「東京タワー」（予測マスク単語１３０３Ｂ）、および「富士山」（予測マスク単語１３０３Ｃ）が予測される。 Next, the listening back sentence generation unit 406 inputs the masked word string 1301 into the BERT 1310, which is an example of the dialogue context language model 415, and predicts the masked word (step S1203). Here, as the prediction result 1303, "Landmark Tower" (prediction mask word 1303A), "Tokyo Tower" (prediction mask word 1303B), and "Mt. Fuji" (prediction mask word 1303C) are predicted.

つぎに、聞き返し文生成部４０６は、予測マスク単語１３０３Ａ〜１３０３Ｃの読みを抽出する（ステップＳ１２０４）。ここでは、抽出結果１３０４として、“ランドマークタワー”（予測マスク単語の読み１３０４Ａ）、“トーキョータワー”（予測マスク単語の読み１３０４Ｂ）、および“フジサン”（予測マスク単語の読み１３０４Ｃ）が予測される。 Next, the listening back sentence generation unit 406 extracts the readings of the prediction mask words 1303A to 1303C (step S1204). Here, as the extraction result 1304, "Landmark Tower" (predictive mask word reading 1304A), "Tokyo Tower" (predictive mask word reading 1304B), and "Fujisan" (predictive mask word reading 1304C) are predicted.

そして、聞き返し文生成部４０６は、部分Ｃの単語「○×▽」の読み１３０２である“フサン”と、予測マスク単語の読み１３０４Ａ〜１３０４Ｃの各々とを、たとえば、編集距離（レーベンシュタイン距離）で比較する（ステップＳ１２０６）。聞き返し文生成部４０６は、部分Ｃの単語「○×▽」の読み１３０２である“フサン”と所定距離以内の予測マスク単語の読み１３０４Ａ〜１３０４Ｃがあるか否かを判定する（ステップＳ１２０６）。 Then, the hearing-back sentence generation unit 406 edits each of the readings 1302 of the word "○ × ▽" of the part C, “Fusan”, and the readings of the prediction mask words 1304A to 1304C, for example, the editing distance (Levenshtein distance). (Step S1206). The hearing-back sentence generation unit 406 determines whether or not there is a reading 1302 of the word “○ × ▽” in the portion C and readings 1304A to 1304C of the predicted mask words within a predetermined distance (step S1206).

所定距離以内の予測マスク単語の読み１３０４Ａ〜１３０４Ｃがない場合（ステップＳ１２０６：Ｎｏ）、聞き返し文生成部４０６は、マスク単語推定処理（ステップＳ１１０６）を終了し、ステップＳ１１０７に移行する。この場合、ステップＳ１１０７では、マスク単語推定失敗（ステップＳ１１０７：Ｎｏ）となる。 When there is no reading 1304A to 1304C of the predicted mask word within a predetermined distance (step S1206: No), the listening back sentence generation unit 406 ends the mask word estimation process (step S1106) and proceeds to step S1107. In this case, in step S1107, the mask word estimation fails (step S1107: No).

一方、所定距離以内の予測マスク単語の読み１３０４Ａ〜１３０４Ｃがある場合（ステップＳ１２０６：Ｙｅｓ）、聞き返し文生成部４０６は、編集距離が最も短い予測マスク単語を最も読みが近い予測マスク単語として選択する（ステップＳ１２０７）。ここでは、例として“フジサン”（予測マスク単語の読み１３０４Ｃ）を読みとする「富士山」（予測マスク単語１３０３Ｃ）を選択する。そして、聞き返し文生成部４０６は、マスク単語推定処理（ステップＳ１１０６）を終了し、ステップＳ１１０７に移行する。この場合、ステップＳ１１０７では、マスク単語推定成功（ステップＳ１１０７：Ｙｅｓ）となる。 On the other hand, when there are readings 1304A to 1304C of the prediction mask word within a predetermined distance (step S1206: Yes), the listening back sentence generation unit 406 selects the prediction mask word with the shortest editing distance as the prediction mask word with the closest reading. (Step S1207). Here, as an example, "Mt. Fuji" (predictive mask word 1303C) whose reading is "Fujisan" (predictive mask word reading 1304C) is selected. Then, the back-to-back sentence generation unit 406 ends the mask word estimation process (step S1106), and proceeds to step S1107. In this case, in step S1107, the mask word estimation is successful (step S1107: Yes).

＜言い直し発話解釈処理（ステップＳ１００７）＞
図１４は、図１０に示した言い直し発話解釈処理（ステップＳ１００７）の詳細な処理手順例を示すフローチャートである。言い直し発話解釈処理（ステップＳ１００７）は、ステップＳ１００６：Ｙｅｓの場合、すなわち、聞き返しパターンＰ１、Ｐ３またはＰ４で聞き返し中の場合に実行される。 <Rephrasing utterance interpretation processing (step S1007)>
FIG. 14 is a flowchart showing a detailed processing procedure example of the rephrasing utterance interpretation processing (step S1007) shown in FIG. The rephrasing utterance interpretation process (step S1007) is executed in the case of step S1006: Yes, that is, in the case of listening back in the listening back pattern P1, P3 or P4.

ステップＳ１００６：Ｙｅｓの場合、聞き返し中の聞き返しパターンがＰ１であれば（ステップＳ１４０１：Ｐ１）、聞き返し文生成部４０６は、ユーザ１０１の言い直し発話音声（今回入力された発話音声の単語列Ｗｔ）と前回の発話の認識結果とを結合して（ステップＳ１４０２）、対話制御処理（ステップＳ１４０３）に移行する。 In the case of step S1006: Yes, if the listening back pattern during listening back is P1 (step S1401: P1), the listening back sentence generation unit 406 rephrases the spoken voice of the user 101 (word string Wt of the spoken voice input this time). And the recognition result of the previous utterance are combined (step S1402), and the process proceeds to the dialogue control process (step S1403).

ここで、聞き返しパターンＰ１である前回の発話の認識結果を『明日の富士山の日の出の』であるとする。『明日の富士山の日の出の』は、各単語の信頼度がしきい値以上の単語群である。 Here, it is assumed that the recognition result of the previous utterance, which is the listening pattern P1, is "tomorrow's sunrise of Mt. Fuji". "Tomorrow's sunrise of Mt. Fuji" is a group of words whose reliability of each word is above the threshold.

『明日の富士山の日の出の』が前回のユーザ１０１の発話音声の単語列の前半部分である場合、聞き返し文生成部４０６は、ユーザ１０１の言い直し発話音声（今回入力された発話音声の単語列Ｗｔ（たとえば、「時刻教えて」））を、『明日の富士山の日の出の』の末尾に連結して、「明日の富士山の日の出の時刻教えて」を生成する。 When "Tomorrow's sunrise of Mt. Fuji" is the first half of the word string of the previous spoken voice of the user 101, the rehearsal sentence generation unit 406 uses the rephrased voice of the user 101 (the word string of the spoken voice input this time). Wt (for example, "tell me the time") is concatenated at the end of "tomorrow's sunrise on Mt. Fuji" to generate "tell me the time of tomorrow's sunrise on Mt. Fuji".

また、聞き返しパターンＰ１である前回の発話の認識結果を『日の出の時刻教えて』であるとする。『日の出の時刻教えて』は、各単語の信頼度がしきい値以上の単語群である。 Further, it is assumed that the recognition result of the previous utterance, which is the listening pattern P1, is "Tell me the time of sunrise". "Tell me the time of sunrise" is a group of words whose reliability of each word is equal to or higher than the threshold value.

『日の出の時刻教えて』が前回のユーザ１０１の発話音声の単語列の前半部分である場合、聞き返し文生成部４０６は、ユーザ１０１の言い直し発話音声（今回入力された発話音声の単語列Ｗｔ（たとえば、「明日の富士山」））を、『日の出の時刻教えて』の先頭に連結して、「明日の富士山日の出の時刻教えて」を生成する。 When "Tell me the time of sunrise" is the first half of the word string of the utterance voice of the previous user 101, the rehearsal sentence generation unit 406 uses the rephrased utterance voice of the user 101 (the word string Wt of the utterance voice input this time). (For example, "Tomorrow's Mt. Fuji") is connected to the beginning of "Tell me the time of sunrise" to generate "Tell me the time of sunrise tomorrow's Mt. Fuji".

ただし、聞き返し文生成部４０６は、ユーザ１０１の言い直し発話音声（今回入力された発話音声の単語列Ｗｔ）と前回の発話の認識結果との一致する部分についてはいずれか一方を削除して、冗長化を防止する。 However, the rehearsal sentence generation unit 406 deletes one of the parts that match the rephrased utterance voice of the user 101 (word string Wt of the utterance voice input this time) and the recognition result of the previous utterance. Prevent redundancy.

たとえば、『日の出の時刻教えて』が前回の発話の認識結果であり、前回のユーザ１０１の発話音声の単語列が「明日の富士山の日の出の時刻」であるとすると、結合結果は、「明日の富士山の日の出の時刻日の出の時刻教えて」となる。この場合、「日の出の時刻」が２回出現しているため、聞き返し文生成部４０６は、「日の出の時刻」を１つ削除して、「明日の富士山の日の出の時刻教えて」にする。 For example, if "tell me the time of sunrise" is the recognition result of the previous utterance and the word string of the utterance voice of the previous user 101 is "the time of sunrise of Mt. Fuji tomorrow", the combined result is "tomorrow". The time of sunrise on Mt. Fuji Please tell me the time of sunrise. " In this case, since the "sunrise time" appears twice, the hearing-back sentence generation unit 406 deletes one "sunrise time" to "tell me the sunrise time of Mt. Fuji tomorrow".

また、聞き返し中の聞き返しパターンがＰ３またはＰ４であれば（ステップＳ１４０１：Ｐ３，Ｐ４）、聞き返し文生成部４０６は、聞き返しパターンがＰ３またはＰ４（例：「富士山の高さを教えて、とおっしゃいましたか？」）に対するユーザ１０１の回答（今回入力された発話音声の単語列Ｗｔ）が肯定であるか否定であるかを判定する（ステップＳ１４０４）。肯定である場合（ステップＳ１４０４：肯定）、対話制御処理（ステップＳ１４０３）に移行する。対話制御処理（ステップＳ１４０３）については、図１５で後述する。 Further, if the listening back pattern during listening back is P3 or P4 (step S1401: P3, P4), the listening back sentence generation unit 406 tells us that the listening back pattern is P3 or P4 (example: "Tell me the height of Mt. Fuji". Did you? ”), It is determined whether the answer of the user 101 (word string Wt of the spoken voice input this time) is affirmative or negative (step S1404). If it is affirmative (step S1404: affirmative), the process proceeds to the dialogue control process (step S1403). The dialogue control process (step S1403) will be described later with reference to FIG.

一方、否定である場合（ステップＳ１４０４：否定）、聞き返し文生成部４０６は、再質問を依頼する応答文（例：「質問をもう一度お願いします」）を生成して（ステップＳ１４０６）、ステップＳ１００７に移行する。 On the other hand, if it is negative (step S1404: negative), the response sentence generation unit 406 generates a response sentence requesting a re-question (example: "Please ask the question again") (step S1406), and step S1007. Move to.

＜対話制御処理（ステップＳ１００８、Ｓ１４０３）＞
図１５は、図１０および図１４に示した対話制御処理（ステップＳ１００８、Ｓ１４０３）の詳細な処理手順例を示すフローチャートである。まず、対話制御部４０５は、対話知識ＤＢ４１６を参照して音声認識結果５００の単語列Ｗｔに近い想定質問文を検索する（ステップＳ１５０１）。具体的には、たとえば、対話制御部４０５は、単語列Ｗｔと対話知識ＤＢ４１６の想定質問文との編集距離により、単語列Ｗｔとの類似度を想定質問文ごとに算出する。ここでは、例として、編集距離の逆数を類似度とする。したがって、類似度の値が大きい想定質問文ほど単語列Ｗｔに類似する。 <Dialogue control processing (steps S1008, S1403)>
FIG. 15 is a flowchart showing a detailed processing procedure example of the dialogue control processing (steps S1008 and S1403) shown in FIGS. 10 and 14. First, the dialogue control unit 405 searches for a hypothetical question sentence close to the word string Wt of the voice recognition result 500 with reference to the dialogue knowledge DB 416 (step S1501). Specifically, for example, the dialogue control unit 405 calculates the similarity with the word string Wt for each assumed question sentence by the editing distance between the word string Wt and the assumed question sentence of the dialogue knowledge DB 416. Here, as an example, the reciprocal of the editing distance is used as the degree of similarity. Therefore, the larger the similarity value is, the more similar the assumed question sentence is to the word string Wt.

つぎに、対話制御部４０５は、類似度がしきい値以上の想定質問文があるか否かを判定する（ステップＳ１５０２）。類似度がしきい値以上の想定質問文がない場合（ステップＳ１５０３：Ｎｏ）、対話制御部４０５は、質問の意味が分からない旨の応答文を生成して、ステップＳ１００９、Ｓ１００７に移行する。一方、類似度がしきい値以上の想定質問文がある場合（ステップＳ１５０２：Ｙｅｓ）、対話制御部４０５は、対話知識ＤＢ４１６において類似度がしきい値以上の想定質問文に対応する回答文を応答文として出力して（ステップＳ１５０４）、ステップＳ１００９、Ｓ１００７に移行する。 Next, the dialogue control unit 405 determines whether or not there is an assumed question sentence whose similarity is equal to or higher than the threshold value (step S1502). When there is no assumed question sentence whose similarity is equal to or higher than the threshold value (step S1503: No), the dialogue control unit 405 generates a response sentence indicating that the meaning of the question is not understood, and proceeds to steps S1009 and S1007. On the other hand, when there is an assumed question sentence whose similarity is equal to or higher than the threshold value (step S1502: Yes), the dialogue control unit 405 provides an answer sentence corresponding to the assumed question sentence whose similarity is equal to or higher than the threshold value in the dialogue knowledge DB 416. It is output as a response statement (step S1504), and the process proceeds to steps S1009 and S1007.

＜対話例＞
図１６は、ユーザ１０１と音声対話装置３００との対話の流れの一例を示すフローチャートである。ユーザ１０１が「富士山の高さを教えて」と発話したとする（ステップＳ１６０１）。音声対話装置３００は、「富士山の高さを教えて」を「＊＊＊の高さを教えて」と認識した場合（ステップＳ１６０２）、マスク部分の「＊＊＊」を推定して、「富士山ですか？」と応答する（ステップＳ１６０３）（聞き返しパターンＰ３）。 <Dialogue example>
FIG. 16 is a flowchart showing an example of the flow of dialogue between the user 101 and the voice dialogue device 300. It is assumed that the user 101 utters "Tell me the height of Mt. Fuji" (step S1601). When the voice dialogue device 300 recognizes "tell me the height of Mt. Fuji" as "tell me the height of ***" (step S1602), it estimates "***" of the mask portion and "Tells me the height of Mt. Fuji". Is it Mt. Fuji? ”(Step S1603) (listening pattern P3).

ユーザ１０１が、「富士山ですか？」に対して否定を意味する「いいえ」を応答した場合（ステップＳ１６０４）、音声対話装置３００は、マスク部分の推定結果である「富士山」を否定されたため、「質問をもう一度お願いします」と発話する（ステップＳ１６０５）。 When the user 101 responds with "No" meaning denial to "Mt. Fuji?" (Step S1604), the voice dialogue device 300 is denied "Mt. Fuji" which is the estimation result of the mask portion. Say "Please ask the question again" (step S1605).

また、ステップＳ１６０３での「富士山ですか？」の質問に、ユーザ１０１が肯定を意味する「はい」を応答した場合（ステップＳ１６０６）、音声対話装置３００は、Ｓ１６０１のユーザ１０１の発話を「富士山の高さを教えて」と認識する。したがって、音声対話装置３００は、「富士山の高さ」を想定質問文として、対応する回答「３７７６メートル」を対話知識ＤＢ４１６から検索し（ステップＳ１６１０）、「富士山の高さは３７７６メートルです。」と発話する（ステップＳ１６１１）。 Further, when the user 101 responds to the question "Is it Mt. Fuji?" In step S1603 with "Yes" meaning affirmation (step S1606), the voice dialogue device 300 speaks the utterance of the user 101 in S1601 to "Mt. Fuji". Please tell me the height of. " Therefore, the voice dialogue device 300 searches for the corresponding answer "3776 meters" from the dialogue knowledge DB 416 (step S1610) with "the height of Mt. Fuji" as the assumed question sentence, and "the height of Mt. Fuji is 3776 meters." (Step S1611).

また、ステップＳ１６０１のユーザ１０１の発話音声１１０である「富士山の高さを教えて」に対し、音声対話装置３００が、「富士山の＊＊＊」と認識した場合（ステップＳ１６０７）、「富士山の何ですか？」と応答する（ステップＳ１６０８）（聞き返しパターンＰ１）。これに対し、ユーザ１０１が「高さ」と応答すると（ステップＳ１６０９）、音声対話装置３００は、「富士山の高さ」を想定質問文として、対応する回答「３７７６メートル」を対話知識ＤＢ４１６から検索し（ステップＳ１６１０）、「富士山の高さは３７７６メートルです。」と発話する（ステップＳ１６１１）。 Further, when the voice dialogue device 300 recognizes "Mt. Fuji ***" in response to "Tell me the height of Mt. Fuji" which is the utterance voice 110 of the user 101 in step S1601 (step S1607), "Tell me the height of Mt. Fuji". What is it? ”(Step S1608) (listening pattern P1). On the other hand, when the user 101 responds with "height" (step S1609), the voice dialogue device 300 searches the dialogue knowledge DB 416 for the corresponding answer "3776 meters" with "the height of Mt. Fuji" as the assumed question sentence. (Step S1610), he says, "The height of Mt. Fuji is 3776 meters." (Step S1611).

このように、実施例１によれば、単語の信頼度や言語尤度に応じて、ユーザ１０１の発話音声１１０に対する聞き返しパターンＰ１〜Ｐ５を選択することができる。特に、標準言語モデル４１２、分野別言語モデル４１３、質問文言語モデル４１４、および対話文脈言語モデル４１５を使ってユーザ１０１の発話音声１１０を評価することで、それぞれ、日本語として尤もらしいか、その場での発話として尤もらしいか、質問として尤もらしいか、対話文脈を考慮した上で尤もらしいか、を判定することができ、判定結果に基づいて聞き返しを行うことで、対話の破綻を防ぐことができる。 As described above, according to the first embodiment, the listening back patterns P1 to P5 for the spoken voice 110 of the user 101 can be selected according to the reliability of the word and the language likelihood. In particular, by evaluating the utterance voice 110 of the user 101 using the standard language model 412, the field-specific language model 413, the question sentence language model 414, and the dialogue context language model 415, each of them is plausible as Japanese. It is possible to judge whether it is plausible as an utterance in the field, plausible as a question, or plausible after considering the dialogue context, and by listening back based on the judgment result, it is possible to prevent the dialogue from breaking down. Can be done.

このように、音声認識の信頼度をもとに必要最低限の聞き返しを行うことで、ユーザ１０１に何度も同じ質問をさせないようにすることができ、ユーザ１０１のわずらわしさを低減することができる。 In this way, by performing the minimum necessary re-listening based on the reliability of voice recognition, it is possible to prevent the user 101 from asking the same question over and over, and it is possible to reduce the annoyance of the user 101. can.

実施例２は、実施例１において、ユーザ１０１の発話音声１１０の独特の言い回しにより、分野別言語モデル４１３や対話文脈別言語モデルによってそれらの言語尤度がしきい値以上であるような場合に対し、誤判定、すなわち、ステップＳ１００４：Ｎｏに遷移するのを防止する例である。これにより、音声対話装置３００がユーザ１０１に再質問を依頼したり（ステップＳ１４０５）、対話制御処理（ステップＳ１００８、Ｓ１４０３）により想定質問文の意味が分からない旨の応答をしたり（ステップＳ１５０３）、尤もらしくない想定質問文に対応する尤もらしくない回答をしたり（ステップＳ１５０４）するのを防止する。なお、ここでは、実施例２の内容を中心に説明するため、実施例１と重複する部分については説明を省略する。 In the second embodiment, in the case where the language likelihood of the user 101 is equal to or higher than the threshold value according to the field-specific language model 413 or the dialogue context-specific language model due to the unique wording of the spoken voice 110 of the user 101. On the other hand, this is an example of preventing an erroneous determination, that is, a transition to step S1004: No. As a result, the voice dialogue device 300 requests the user 101 to re-question (step S1405), or responds by the dialogue control process (steps S1008, S1403) that the meaning of the assumed question sentence is not understood (step S1503). , It is prevented from giving an unprobable answer corresponding to an unprobable assumed question sentence (step S1504). In addition, since the content of the second embodiment will be mainly described here, the description of the part overlapping with the first embodiment will be omitted.

＜音声対話装置３００の機能的構成例＞
図１７は、実施例２にかかる音声対話装置３００の機能的構成例を示すブロック図である。音声対話装置３００は、実施例１の図４に示した構成のほか、個人性言語モデル１７０１と、個人識別部１７０２と、を有する。個人性言語モデル１７０１は、は、図３に示した音声対話装置３００の記憶デバイス３０２または音声対話装置３００と通信可能な他のコンピュータの記憶デバイス３０２に記憶される。個人識別部１７０２は、具体的には、たとえば、図３に示した記憶デバイス３０２に記憶された音声対話プログラムをプロセッサ３０１に実行させることにより実現される。 <Example of functional configuration of voice dialogue device 300>
FIG. 17 is a block diagram showing a functional configuration example of the voice dialogue device 300 according to the second embodiment. In addition to the configuration shown in FIG. 4 of the first embodiment, the voice dialogue device 300 has an individual language model 1701 and an individual identification unit 1702. The personal language model 1701 is stored in the storage device 302 of the voice dialogue device 300 shown in FIG. 3 or the storage device 302 of another computer capable of communicating with the voice dialogue device 300. Specifically, the personal identification unit 1702 is realized, for example, by causing the processor 301 to execute a voice dialogue program stored in the storage device 302 shown in FIG.

個人性言語モデル１７０１は、ユーザ１０１固有の性質（個人性）により作成された言語モデルであり、具体的には、たとえば、ユーザ１０１ごとの対話履歴からユーザ１０１別に作成される。したがって、対話履歴管理部４０７は、ユーザ１０１ごとに対話履歴を管理する。個人性言語モデル１７０１は、たとえば、単語Ｎ−ｇｒａｍやＲＮＮにより実現される。 The personality language model 1701 is a language model created by the property (individuality) peculiar to the user 101, and specifically, for example, is created for each user 101 from the dialogue history of each user 101. Therefore, the dialogue history management unit 407 manages the dialogue history for each user 101. The personal language model 1701 is realized by, for example, the words N-gram and RNN.

個人識別部１７０２は、ユーザ１０１を識別する。個人識別部１７０２は、具体的には、たとえば、指紋や掌の静脈、虹彩、顔画像、音声といったユーザ１０１の生体情報を管理し、入力された生体情報と一致した場合に、入力者をその生体情報を持つユーザ１０１として識別する。また、個人識別部１７０２は、ユーザＩＤおよびパスワードを管理し、入力されたユーザＩＤおよびパスワードと一致した場合に、入力者をユーザ１０１として識別してもよい。 The personal identification unit 1702 identifies the user 101. Specifically, the personal identification unit 1702 manages the biometric information of the user 101 such as fingerprints, veins of the palm, iris, facial image, and voice, and when it matches the input biometric information, the input person is selected. Identify as user 101 with biometric information. Further, the personal identification unit 1702 manages the user ID and password, and may identify the input person as the user 101 when the user ID and password match the input user ID and password.

＜音声対話処理手順例＞
図１８は、実施例２にかかる音声対話装置３００による音声対話処理手順例を示すフローチャートである。音声対話装置３００は、図１０に示したステップＳ１００１〜Ｓ１０１０に先立って、ステップＳ１８０１〜Ｓ１８０４を実行する。 <Example of voice dialogue processing procedure>
FIG. 18 is a flowchart showing an example of a voice dialogue processing procedure by the voice dialogue device 300 according to the second embodiment. The voice dialogue device 300 executes steps S1801 to S1804 prior to steps S1001 to S1010 shown in FIG.

具体的には、たとえば、音声対話装置３００は、個人識別部１７０２により、生体情報を入力したユーザ１０１を識別する（ステップＳ１８０１）。音声対話装置３００は、生体情報を入力したユーザ１０１が登録済みのユーザ１０１であるか否かを判定する（ステップＳ１８０２）。具体的には、たとえば、音声対話装置３００は、登録済みの生体情報と入力された生体情報とが一致するか否かを判定し、一致すれば、生体情報を入力したユーザ１０１が登録済みのユーザ１０１であると判定する。 Specifically, for example, the voice dialogue device 300 identifies the user 101 who has input the biometric information by the personal identification unit 1702 (step S1801). The voice dialogue device 300 determines whether or not the user 101 who has input the biometric information is the registered user 101 (step S1802). Specifically, for example, the voice dialogue device 300 determines whether or not the registered biometric information and the input biometric information match, and if they match, the user 101 who has input the biometric information is registered. It is determined that the user is 101.

登録済みのユーザ１０１である場合（ステップＳ１８０２：Ｙｅｓ）、音声対話装置３００は、当該ユーザ１０１の個人性言語モデル１７０１と対話履歴とをロードし（ステップＳ１８０３）、ステップＳ１００１に移行する。一方、登録済みのユーザ１０１でない場合（ステップＳ１８０２：Ｎｏ）、音声対話装置３００は、当該ユーザ１０１（以下、新規ユーザ１０１）の個人性言語モデル１７０１と対話履歴とを新規作成する（ステップＳ１８０４）。具体的には、たとえば、音声対話装置３００は、新規ユーザ１０１と対話して対話履歴を取得し、取得した対話履歴をもとに新規ユーザ１０１の個人性言語モデル１７０１を作成する。そして、ステップＳ１００１に移行する。 When the user is a registered user 101 (step S1802: Yes), the voice dialogue device 300 loads the personal language model 1701 of the user 101 and the dialogue history (step S1803), and proceeds to step S1001. On the other hand, when the user is not the registered user 101 (step S1802: No), the voice dialogue device 300 newly creates the personal language model 1701 and the dialogue history of the user 101 (hereinafter, new user 101) (step S1804). .. Specifically, for example, the voice dialogue device 300 interacts with the new user 101 to acquire a dialogue history, and creates a personal language model 1701 of the new user 101 based on the acquired dialogue history. Then, the process proceeds to step S1001.

個人性言語モデル１７０１は、言語尤度算出（ステップＳ１００３）において用いられる。そして、図１１に示したステップＳ１１１１で質問文言語モデル４１４の言語尤度がしきい値以上の場合（ステップＳ１１１１：Ｎｏ）、分野別言語モデル４１３、対話文脈言語モデル４１５または個人性言語モデル１７０１のいずれかの言語尤度がしきい値未満となる。したがって、聞き返し文生成部４０６は、質問（ユーザ１０１の発話音声）全体を復唱して確認する聞き返し文を聞き返しパターンＰ４（図１を参照）として生成し（ステップＳ１１１２）、ステップＳ１００９に移行する。 The personal language model 1701 is used in the language likelihood calculation (step S1003). When the language likelihood of the question sentence language model 414 is equal to or greater than the threshold value in step S1111 shown in FIG. 11 (step S1111: No), the field-specific language model 413, the dialogue context language model 415, or the individual language model 1701 The language likelihood of any of is less than the threshold. Therefore, the re-listening sentence generation unit 406 generates a re-listening sentence for reciting and confirming the entire question (spoken voice of the user 101) as a re-listening pattern P4 (see FIG. 1) (step S1112), and proceeds to step S1009.

また、分野別言語モデル４１３、対話文脈言語モデル４１５および個人性言語モデル１７０１のいずれも用いられていない場合も、質問文言語モデル４１４の言語尤度がしきい値以上である場合（ステップＳ１１１１：Ｎｏ）、聞き返し文生成部４０６は、ステップＳ１１１２を実行する。 Further, even when none of the field-specific language model 413, the dialogue context language model 415, and the individual language model 1701 is used, the language likelihood of the question sentence language model 414 is equal to or higher than the threshold value (step S1111:). No), the listening back sentence generation unit 406 executes step S1112.

このように、実施例２によれば、ユーザ１０１独特の言い回しによって生じる言語尤度の誤判定を抑制し、対話の円滑化を図ることができる。 As described above, according to the second embodiment, it is possible to suppress the erroneous determination of the language likelihood caused by the wording peculiar to the user 101 and to facilitate the dialogue.

また、上述した実施例１および実施例２にかかる音声対話装置３００は、下記（１）〜（１３）のように構成することもできる。 Further, the voice dialogue device 300 according to the first and second embodiments described above can be configured as described in (1) to (13) below.

（１）音声対話プログラムを実行するプロセッサ３０１と、音声対話プログラムを記憶する記憶デバイス３０２と、を有する音声対話装置３００では、プロセッサ３０１が、発話音声１１０に関する単語列Ｗｔを構成する各単語Ｗ１〜Ｗｎの信頼度を取得する信頼度取得処理（ステップＳ１１０２）と、信頼度取得処理（ステップＳ１１０２）によって取得された信頼度に基づいて、発話音声１１０の発話元であるユーザ１０１に聞き返す聞き返し文を生成する聞き返し文生成処理（ステップＳ１００５）と、を実行する。 (1) In the voice dialogue device 300 having a processor 301 for executing the voice dialogue program and a storage device 302 for storing the voice dialogue program, the processor 301 constitutes each word W1 to form a word string Wt relating to the spoken voice 110. Based on the reliability acquisition process (step S1102) for acquiring the reliability of Wn and the reliability acquired by the reliability acquisition process (step S1102), a response sentence to be heard back to the user 101 who is the utterance source of the utterance voice 110 is sent back. The process of generating a back-to-back sentence (step S1005) to be generated is executed.

これにより、音声対話装置３００は、たとえば、図１１に示したような聞き返しパターンＰ１〜Ｐ３、Ｐ５の中から適切な聞き返し文を生成することができる。 As a result, the voice dialogue device 300 can generate an appropriate return sentence from the return patterns P1 to P3 and P5 as shown in FIG. 11, for example.

（２）上記（１）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、単語列Ｗｔを構成する全単語Ｗ１〜Ｗｎの信頼度がいずれも第１しきい値未満である場合（ステップＳ１１０２：Ｙｅｓ）、発話音声１１０を再要求する聞き返し文を生成する（ステップＳ１１０９）。 (2) In the voice dialogue device 300 of the above (1), in the utterance sentence generation process (step S1005), the processor 301 has a first threshold value for the reliability of all the words W1 to Wn constituting the word string Wt. If it is less than (step S1102: Yes), a back sentence for re-requesting the spoken voice 110 is generated (step S1109).

これにより、音声対話装置３００は、たとえば、図１１に示したような聞き返しパターンＰ１〜Ｐ３、Ｐ５の中から聞き返しパターンＰ５の聞き返し文を生成して、ユーザ１０１に再質問を依頼することができる。 As a result, the voice dialogue device 300 can, for example, generate a back-to-back sentence of the back-to-back pattern P5 from the back-to-back patterns P1 to P3 and P5 as shown in FIG. 11 and request the user 101 to re-question. ..

（３）上記（１）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、単語列Ｗｔに信頼度が第１しきい値以上の単語が存在する場合（ステップＳ１１０２：Ｎｏ）、単語列Ｗｔにおける信頼度が第１しきい値未満の単語の位置に基づいて、聞き返し文を生成する。 (3) In the voice dialogue device 300 of the above (1), in the back-to-back sentence generation process (step S1005), the processor 301 has a word string Wt in which a word having a reliability equal to or higher than the first threshold value is present (step S1102). : No), a back sentence is generated based on the position of a word whose reliability is less than the first threshold value in the word string Wt.

これにより、音声対話装置３００は、たとえば、図１１に示したような聞き返しパターンＰ１〜Ｐ３、Ｐ５の中から、信頼度が第１しきい値未満の単語の位置に応じた聞き返し文を生成することができる。 As a result, the voice dialogue device 300 generates, for example, a return sentence according to the position of a word whose reliability is less than the first threshold value from the return patterns P1 to P3 and P5 as shown in FIG. be able to.

（４）上記（３）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、信頼度が第１しきい値未満の単語が単語列Ｗｔの前半または後半に存在する場合（ステップＳ１１０３：部分Ａ）、前半または後半のうち信頼度が第１しきい値未満の単語が存在しない方を聞き返す聞き返し文を生成する（ステップＳ１１０４）。 (4) In the voice dialogue device 300 of the above (3), in the back-to-back sentence generation process (step S1005), the processor 301 has a word whose reliability is less than the first threshold value in the first half or the second half of the word string Wt. In the case (step S1103: part A), a dialogue sentence is generated to listen back to the first half or the second half in which the word whose reliability is less than the first threshold value does not exist (step S1104).

これにより、音声対話装置３００は、たとえば、図１１に示したような聞き返しパターンＰ１〜Ｐ３、Ｐ５の中から聞き返しパターンＰ１の聞き返し文を生成して、ユーザ１０１に、発話音声１１０のうち聞き取れている部分を発話して、聞き取れていない部分の再発話を促すことができる。 As a result, the voice dialogue device 300 generates, for example, a back-to-back sentence of the back-to-back pattern P1 from the back-to-back patterns P1 to P3 and P5 as shown in FIG. 11, and the user 101 can hear the spoken voice 110. It is possible to utter the part that is present and encourage the recurrence of the part that cannot be heard.

（５）上記（３）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、信頼度が第１しきい値未満の単語が単語列Ｗｔの前半および後半にわたって複数存在する場合（ステップＳ１１０３：部分Ｂ）、単語列Ｗｔのうち第１しきい値以上の単語を聞き返す聞き返し文を生成する（ステップＳ１１０５）。 (5) In the voice dialogue device 300 of the above (3), in the back-to-back sentence generation process (step S1005), the processor 301 has a plurality of words whose reliability is less than the first threshold value over the first half and the second half of the word string Wt. (Step S1103: Part B), a back sentence is generated to listen back to the word having the first threshold value or more in the word string Wt (step S1105).

これにより、音声対話装置３００は、たとえば、図１１に示したような聞き返しパターンＰ１〜Ｐ３、Ｐ５の中から聞き返しパターンＰ２の聞き返し文を生成して、ユーザ１０１に、発話音声１１０のうち聞き取れている部分を発話して、ユーザ１０１に再質問を依頼することができる。 As a result, the voice dialogue device 300 generates, for example, a back-to-back sentence of the back-to-back pattern P2 from the back-to-back patterns P1 to P3 and P5 as shown in FIG. 11, and the user 101 can hear the spoken voice 110. You can ask the user 101 to ask a question again by speaking the part.

（６）上記（３）の音声対話装置３００において、プロセッサ３０１は、信頼度が第１しきい値未満の１個の単語、または信頼度が第１しきい値未満の連続する２個の単語が単語列に存在する場合（ステップＳ１１０３：部分Ｃ）、任意の単語とその周辺に現れる単語群の統計と句の並びの統計を学習させた対話文脈言語モデル４１５に基づいて、１個の単語または連続する２個の単語がどのような単語であるかを推定するマスク単語推定処理（ステップＳ１１０６）を実行し、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、マスク単語推定処理（ステップＳ１１０６）による推定結果に応じた聞き返し文を生成する。 (6) In the voice dialogue device 300 of (3) above, the processor 301 uses one word whose reliability is less than the first threshold value, or two consecutive words whose reliability is less than the first threshold value. Is present in the word sequence (step S1103: part C), one word is based on the dialogue context language model 415 trained in the statistics of the word group appearing in and around any word and the statistics of the sequence of phrases. Alternatively, in the mask word estimation process (step S1106) for estimating what kind of word the two consecutive words are, and in the listening back sentence generation process (step S1005), the processor 301 performs the mask word estimation process (step). Generates a reply sentence according to the estimation result by S1106).

これにより、音声対話装置３００は、たとえば、信頼度が第１しきい値未満の１個の単語、または信頼度が第１しきい値未満の連続する２個の単語を文脈から推定することにより、聞き返しの頻度の低減化を図ることができる。 Thereby, the voice dialogue device 300 estimates from the context, for example, one word whose reliability is less than the first threshold value or two consecutive words whose reliability is less than the first threshold value. , It is possible to reduce the frequency of listening back.

（７）上記（６）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、マスク単語推定処理（ステップＳ１１０６）による推定が成功した場合（ステップＳ１１０７：Ｙｅｓ）、推定した単語を含み、かつ、発話音声１１０を確認する聞き返し文を生成する（ステップＳ１１０８）。 (7) In the voice dialogue device 300 of the above (6), in the hearing back sentence generation process (step S1005), the processor 301 estimates when the estimation by the mask word estimation process (step S1106) is successful (step S1107: Yes). A dialogue sentence is generated (step S1108), which includes the words and confirms the spoken voice 110.

これにより、音声対話装置３００は、たとえば、聞き返しパターンＰ３として、推定した単語と、発話音声１１０全体とを確認する聞き返し文を生成することができ、どの部分が認識しにくかったか、および全体として、どのように認識したかを、ユーザ１０１に伝え、聞き返しの頻度の低減化を図ることができる。 As a result, the voice dialogue device 300 can generate a back-to-back sentence confirming the estimated word and the entire spoken voice 110 as the back-to-back pattern P3, for example, which part was difficult to recognize, and as a whole, It is possible to inform the user 101 how the recognition was made and reduce the frequency of listening back.

（８）上記（６）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、マスク単語推定処理（ステップＳ１１０６）による推定が失敗した場合（ステップＳ１１０７：Ｎｏ）、発話音声１１０を再要求する聞き返し文を生成する。 (8) In the voice dialogue device 300 of the above (6), in the back-to-back sentence generation process (step S1005), the processor 301 speaks when the estimation by the mask word estimation process (step S1106) fails (step S1107: No). Generate a dialogue that re-requests the voice 110.

これにより、音声対話装置３００は、たとえば、聞き返しパターンＰ５の聞き返し文を生成して、ユーザ１０１に再質問を依頼することができる。 As a result, the voice dialogue device 300 can generate, for example, a back-to-back sentence of the back-to-back pattern P5 and request the user 101 to re-question.

（９）上記（１）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、単語列Ｗｔの信頼度が第１しきい値未満である場合（ステップＳ１１０１：Ｙｅｓ）、単語列Ｗｔを構成する各単語Ｗ１〜Ｗｎの信頼度に基づいて、聞き返し文を生成する。 (9) In the voice dialogue device 300 of the above (1), in the back-to-back sentence generation process (step S1005), the processor 301 has a case where the reliability of the word string Wt is less than the first threshold value (step S1101: Yes). , A back sentence is generated based on the reliability of each word W1 to Wn constituting the word string Wt.

これにより、単語列Ｗｔの信頼度が第１しきい値未満であれば、発話音声１１０が正しく認識されていないとして、個々の単語Ｗ１〜Ｗｎの信頼度で、どのように聞き返すかを聞き返しパターンＰ１〜Ｐ３、Ｐ５から選択することができる。 As a result, if the reliability of the word string Wt is less than the first threshold value, it is assumed that the spoken voice 110 is not correctly recognized, and the reliability of the individual words W1 to Wn is used to determine how to listen back. It can be selected from P1 to P3 and P5.

（１０）上記（１）の音声対話装置３００において、プロセッサ３０１は、複数の言語モデル４１２〜４１５の各々に単語列を入力した結果得られる複数の言語尤度を取得する言語尤度取得処理（ステップＳ１１０３）を実行し、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、単語列Ｗｔの信頼度が第１しきい値以上である場合（ステップＳ１１０１：Ｎｏ）、言語尤度取得処理（ステップＳ１１０３）によって取得された複数の言語尤度に基づいて、聞き返し文を生成する。 (10) In the voice dialogue device 300 of the above (1), the processor 301 acquires a plurality of language likelihood obtained as a result of inputting a word string into each of the plurality of language models 421-415 (language likelihood acquisition process). In step S1103) and the response sentence generation process (step S1005), the processor 301 performs the language likelihood acquisition process (step S1101: No) when the reliability of the word string Wt is equal to or higher than the first threshold value (step S1101: No). A back sentence is generated based on the plurality of language likelihood acquired in step S1103).

これにより、発話音声１１０の認識の信頼性が高い場合に、その単語列Ｗｔがどの言語モデルによる発話として尤もらしいかを特定して、聞き返しパターンＰ４、Ｐ５、または聞き返しなしを選択することができる。 Thereby, when the recognition of the spoken voice 110 is highly reliable, it is possible to specify which language model the word string Wt is likely to be spoken by, and to select the listening pattern P4, P5, or no listening. ..

（１１）上記（１０）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、複数の言語モデル４１２〜４１５のうち複数の分野のテキストから得られた単語の並びの統計モデルである標準言語モデル４１２に、単語列Ｗｔを入力した結果得られる第１言語尤度が、第２しきい値未満である場合（ステップＳ１１１０：Ｙｅｓ）、発話音声１１０を再要求する聞き返し文を生成する（ステップＳ１１０９）。 (11) In the voice dialogue device 300 of the above (10), in the utterance sentence generation process (step S1005), the processor 301 is a sequence of words obtained from texts in a plurality of fields among a plurality of language models 421-415. When the first language likelihood obtained as a result of inputting the word string Wt into the standard language model 412, which is a statistical model, is less than the second threshold value (step S1110: Yes), the spoken voice 110 is re-requested. Generate a statement (step S1109).

これにより、標準的な言語の発話として尤もらしくない場合に、聞き返しパターンＰ５の聞き返し文を生成して、ユーザ１０１に再質問を促すことができる。 As a result, when it is not plausible as an utterance in a standard language, it is possible to generate a back-to-back sentence of the back-to-back pattern P5 and prompt the user 101 to re-question.

（１２）上記（１１）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、第１言語尤度が第２しきい値以上である場合（ステップＳ１１１０：Ｙｅｓ）、質問文を構成する単語の並びから発話音声１１０が質問であるかどうかを判定する質問文言語モデル４１４に、単語列Ｗｔを入力した結果得られる第２言語尤度が、第２しきい値未満であれば（ステップＳ１１１１：Ｙｅｓ）、聞き返し文を生成しない（ステップＳ１１１３）。 (12) In the voice dialogue device 300 of the above (11), in the hearing back sentence generation process (step S1005), when the first language likelihood is equal to or higher than the second threshold value (step S1110: Yes), the processor 301 The second language likelihood obtained as a result of inputting the word string Wt into the question sentence language model 414 that determines whether or not the spoken voice 110 is a question from the sequence of words constituting the question sentence is less than the second threshold value. If (step S1111: Yes), the utterance sentence is not generated (step S1113).

これにより、質問文言語モデル４１４で質問文として尤もらしいとされた場合に、聞き返しの無駄な繰り返しを抑制し、対話の円滑化を促進することができる。 As a result, when the question sentence is considered to be plausible as a question sentence in the question sentence language model 414, it is possible to suppress unnecessary repetition of listening back and promote smooth dialogue.

（１３）上記（１１）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、第１言語尤度が第２しきい値以上である場合、質問文を構成する単語の並びから発話音声１１０が質問であるかどうかを判定する質問文言語モデル４１４に、単語列Ｗｔを入力した結果得られる第２言語尤度が、第２しきい値以上であれば（ステップＳ１１１１：Ｎｏ）、発話音声１１０を聞き返す聞き返し文を生成する（ステップＳ１１１２）。 (13) In the voice dialogue device 300 of the above (11), in the hearing back sentence generation process (step S1005), the processor 301 constitutes a question sentence when the first language likelihood is equal to or more than the second threshold value. If the second language likelihood obtained as a result of inputting the word string Wt into the question sentence language model 414 that determines whether or not the spoken voice 110 is a question is equal to or higher than the second threshold value (step S1111). : No), a return sentence for listening back to the spoken voice 110 is generated (step S1112).

これにより、発話音声１１０が音声認識されても、質問文言語モデル４１４で質問文として尤もらしくないとされた場合に、聞き返しパターンＰ４の聞き返し文を生成して、質問全体を再度依頼することができる。 As a result, even if the spoken voice 110 is voice-recognized, if the question sentence language model 414 determines that the question sentence is not plausible, the question-back sentence of the question-back pattern P4 can be generated and the entire question can be requested again. can.

なお、本発明は前述した実施例に限定されるものではなく、添付した特許請求の範囲の趣旨内における様々な変形例及び同等の構成が含まれる。たとえば、前述した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに本発明は限定されない。また、ある実施例の構成の一部を他の実施例の構成に置き換えてもよい。また、ある実施例の構成に他の実施例の構成を加えてもよい。また、各実施例の構成の一部について、他の構成の追加、削除、または置換をしてもよい。 It should be noted that the present invention is not limited to the above-mentioned examples, but includes various modifications and equivalent configurations within the scope of the attached claims. For example, the above-described embodiment has been described in detail in order to explain the present invention in an easy-to-understand manner, and the present invention is not necessarily limited to those having all the described configurations. Further, a part of the configuration of one embodiment may be replaced with the configuration of another embodiment. Further, the configuration of another embodiment may be added to the configuration of one embodiment. In addition, other configurations may be added, deleted, or replaced with respect to a part of the configurations of each embodiment.

また、前述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、たとえば集積回路で設計する等により、ハードウェアで実現してもよく、プロセッサ３０１がそれぞれの機能を実現するプログラムを解釈し実行することにより、ソフトウェアで実現してもよい。 Further, each of the above-mentioned configurations, functions, processing units, processing means, etc. may be realized by hardware by designing a part or all of them by, for example, an integrated circuit, and the processor 301 performs each function. It may be realized by software by interpreting and executing the program to be realized.

各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記憶装置、又は、ＩＣ（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）カード、ＳＤカード、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）の記録媒体に格納することができる。 Information such as programs, tables, and files that realize each function is recorded in a storage device such as a memory, a hard disk, an SSD (Solid State Drive), or an IC (Integrated Circuit) card, an SD card, or a DVD (Digital Versail Disc). It can be stored in a medium.

また、制御線や情報線は説明上必要と考えられるものを示しており、実装上必要な全ての制御線や情報線を示しているとは限らない。実際には、ほとんど全ての構成が相互に接続されていると考えてよい。 In addition, the control lines and information lines show what is considered necessary for explanation, and do not necessarily show all the control lines and information lines necessary for mounting. In practice, it can be considered that almost all configurations are interconnected.

Ｗ１〜Ｗｎ単語
ＷＬ単語ラティス
Ｗｔ単語列
１０１ユーザ
１０２コミュニケーションロボット
１０３スマートフォン
１０４情報処理装置
１１０発話音声
２００音声対話システム
２０１サーバ
２０２ネットワーク
３００音声対話装置
３０１プロセッサ
３０２記憶デバイス
４０１音声認識部
４０２言語尤度取得部
４０３信頼度取得部
４０４聞き返し判定部
４０５対話制御部
４０６聞き返し文生成部
４０７対話履歴管理部
４０８音声合成部
４１１音声認識モデル
４１２標準言語モデル
４１３分野別言語モデル
４１４質問文言語モデル
４１５対話文脈言語モデル
４１６対話知識ＤＢ W1 to Wn Word WL Word Lattice Wt Word string 101 User 102 Communication robot 103 Smartphone 104 Information processing device 110 Speech voice 200 Voice dialogue system 201 Server 202 Network 300 Voice dialogue device 301 Processor 302 Storage device 401 Voice recognition unit 402 Language likelihood acquisition Part 403 Reliability acquisition part 404 Hearing back judgment part 405 Dialogue control part 406 Hearing back sentence generation part 407 Dialogue history management part 408 Speech synthesis part 411 Speech recognition model 412 Standard language model 413 Field-specific language model 414 Questionnaire language model 415 Dialogue context language Model 416 Dialogue knowledge DB

Claims

A voice dialogue device comprising a processor that executes a program and a storage device that stores the program.
The processor
The acquisition process to acquire the reliability of each word that composes the word string related to the spoken voice, and
Based on the reliability acquired by the acquisition process, a generation process for generating a back sentence to be heard back to the originator of the uttered voice, and a generation process.
A voice dialogue device characterized by performing.

The voice dialogue device according to claim 1.
In the generation process, the processor generates a back sentence that re-requests the spoken voice when the reliability of all the words constituting the word string is less than the first threshold value.
A voice dialogue device characterized by that.

The voice dialogue device according to claim 1.
In the generation process, the processor is based on the position of the word having the reliability less than the first threshold value in the word string when the word having the reliability equal to or higher than the first threshold value is present in the word string. To generate the above-mentioned return sentence,
A voice dialogue device characterized by that.

The voice dialogue device according to claim 3.
In the generation process, when a word whose reliability is less than the first threshold value is present in the first half or the second half of the word string, the processor has the first reliability among the first half or the second half. Generate a reply sentence that listens back to the person who does not have a word less than the threshold value,
A voice dialogue device characterized by that.

The voice dialogue device according to claim 3.
In the generation process, when the processor has a plurality of words whose reliability is less than the first threshold value over the first half and the second half of the word string, the word of the word string having the first threshold value or more is present. Generate a reply sentence,
A voice dialogue device characterized by that.

The voice dialogue device according to claim 3.
The processor
If one word whose reliability is less than the first threshold value or two consecutive words whose reliability is less than the first threshold value are present in the word string, any word and its word are present. Estimating what kind of word the one word or two consecutive words are based on the dialogue context language model trained by the statistics of the word group appearing in the surroundings and the statistics of the sequence of phrases. Execute the process,
In the generation process, the processor generates a back sentence according to the estimation result by the estimation process.
A voice dialogue device characterized by that.

The voice dialogue device according to claim 6.
In the generation process, when the estimation by the estimation process is successful, the processor generates a back sentence including the estimated word and confirming the spoken voice.
A voice dialogue device characterized by that.

The voice dialogue device according to claim 6.
In the generation process, the processor generates a back sentence that re-requests the spoken voice when the estimation by the estimation process fails.
A voice dialogue device characterized by that.

The voice dialogue device according to claim 1.
In the generation process, when the reliability of the word string is less than the first threshold value, the processor generates the back sentence based on the reliability of each word constituting the word string.
A voice dialogue device characterized by that.

The voice dialogue device according to claim 1.
In the acquisition process, the processor acquires a plurality of language likelihoods obtained as a result of inputting the word string into each of the plurality of language models.
In the generation process, when the reliability of the word string is equal to or higher than the first threshold value, the processor generates the return sentence based on the plurality of language likelihoods acquired by the acquisition process.
A voice dialogue device characterized by that.

The voice dialogue device according to claim 10.
In the generation process, the processor inputs the word string into the first language model, which is a statistical model of a sequence of words obtained from texts in a plurality of fields among the plurality of language models. If the language likelihood is less than the second threshold, a response sentence reclaiming the spoken voice is generated.
A voice dialogue device characterized by that.

The voice dialogue device according to claim 11.
In the generation process, when the first language likelihood is equal to or higher than the second threshold value, the processor determines from the sequence of words constituting the question sentence whether or not the spoken voice is a question. If the second language likelihood obtained as a result of inputting the word string into the language model is less than the second threshold value, the utterance sentence is not generated.
A voice dialogue device characterized by that.

The voice dialogue device according to claim 11.
In the generation process, when the first language likelihood is equal to or higher than the second threshold value, the processor determines from the sequence of words constituting the question sentence whether or not the spoken voice is a question. If the second language likelihood obtained as a result of inputting the word string into the language model is equal to or greater than the second threshold value, a back sentence that listens back to the spoken voice is generated.
A voice dialogue device characterized by that.

A voice dialogue method executed by a voice dialogue device having a processor for executing a program and a storage device for storing the program.
The voice dialogue method is
The processor
The acquisition process to acquire the reliability of each word that composes the word string related to the spoken voice, and
Based on the reliability acquired by the acquisition process, a generation process for generating a back sentence to be heard back to the originator of the uttered voice, and a generation process.
A voice dialogue method characterized by performing.

To the processor
The acquisition process to acquire the reliability of each word that composes the word string related to the spoken voice, and
Based on the reliability acquired by the acquisition process, a generation process for generating a back sentence to be heard back to the originator of the uttered voice, and a generation process.
A voice dialogue program characterized by executing.