JP7471921B2

JP7471921B2 - Speech dialogue device, speech dialogue method, and speech dialogue program

Info

Publication number: JP7471921B2
Application number: JP2020096174A
Authority: JP
Inventors: 尚和内田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2024-04-22
Anticipated expiration: 2040-06-02
Also published as: JP2021189348A

Description

本発明は、音声対話を実行する音声対話装置、音声対話方法、および音声対話プログラムに関する。 The present invention relates to a voice dialogue device, a voice dialogue method, and a voice dialogue program for executing voice dialogue.

人間との音声対話が可能なコミュニケーションロボットは、テキスト入力の対話システムに比べ、誤りを含む音声入力を受け付けることが多い。この誤りを含む音声入力には様々なパターンがあり、たとえば、音声区間検出誤り、音声認識誤り、無関係な音声の認識、言い淀みや発話途中での訂正発話による崩れた発話がある。したがって、このような音声入力を受け付けた場合、コミュニケーションロボットが入力音声が正常か否かを判定し、適切な例外処理を行うという処理を実行しないと、人間との対話が成立しない。 Compared to dialogue systems that use text input, communication robots capable of voice dialogue with humans are more likely to accept erroneous voice input. There are various patterns of erroneous voice input, such as voice activity detection errors, voice recognition errors, recognition of irrelevant voice, and distorted speech due to hesitation or mid-utterance corrections. Therefore, when such voice input is accepted, a dialogue with a human cannot be established unless the communication robot executes a process to determine whether the input voice is normal or not and to carry out appropriate exception handling.

このような音声対話技術として、特許文献１および特許文献２がある。特許文献１は、音声認識の精度が所定値に満たなかった語句に対応する音声データを、音声により対象者に通知するロボットを開示する。このロボットは、対象者からの発話音声を取得し、取得した発話音声の音声データの内、音声認識の精度が所定値に満たなかった語句に対応する音声データを、前記対象者に対して音声出力する。 Examples of such voice dialogue technology include Patent Document 1 and Patent Document 2. Patent Document 1 discloses a robot that notifies a target person by voice of voice data corresponding to words for which the voice recognition accuracy did not meet a predetermined value. This robot acquires a speech from the target person, and outputs to the target voice data from the acquired speech that corresponds to words for which the voice recognition accuracy did not meet a predetermined value.

特許文献２は、人間の行う確認動作を行うように対話装置を制御し、対話装置の誤った応答を低減する対話制御装置を開示する。この対話制御装置は、対話装置側から対話の契機となる音声を出力して対話を開始する話しかけシナリオ、利用者側からの発話に対して応答する応答シナリオ、及び、利用者に対して対話を開始するか否かを確認する確認シナリオを記憶するシナリオ記憶部と、対話装置側から対話の契機となる音声を出力して対話を開始すべきであるか否かを示す話しかけ開始指標Ｓと、ある音声に対して応答すべきであるか否かを示す応答開始指標Ｒとを入力とし、Ｊ及びＫをそれぞれ１以上の整数の何れかとし、話しかけ開始指標ＳとＪ個の閾値との大小関係、及び、応答開始指標ＲとＫ個の閾値との大小関係とに基づき、話しかけシナリオ、応答シナリオ、または、確認シナリオを選択するシナリオ選択部を含む。 Patent Document 2 discloses a dialogue control device that controls a dialogue device to perform a confirmation action performed by a human, and reduces erroneous responses from the dialogue device. This dialogue control device includes a scenario storage unit that stores a conversation scenario that starts a dialogue by outputting a voice that triggers a dialogue from the dialogue device, a response scenario that responds to an utterance from the user, and a confirmation scenario that confirms whether or not to start a dialogue, and a scenario selection unit that receives as input a conversation start index S indicating whether or not to start a dialogue from the dialogue device by outputting a voice that triggers a dialogue, and a response start index R indicating whether or not to respond to a certain voice, where J and K are each an integer equal to or greater than 1, and selects the conversation scenario, response scenario, or confirmation scenario based on the magnitude relationship between the conversation start index S and J thresholds, and the magnitude relationship between the response start index R and K thresholds.

特開２０１８－８１１４７号公報JP 2018-81147 A 特開２０１８－８７８４７号公報JP 2018-87847 A

しかしながら、上述した特許文献１，２では、コミュニケーションロボットが、入力された対話音声のどの部分を認識でき、どの部分を認識できなかったかといった点については考慮されていない。したがって、対話音声を発話したユーザは、コミュニケーションロボットにどの部分が伝わっていてどの部分が伝わっていないのかを知ることができない。したがって、ユーザに何度も同じ質問をさせてしまい、ユーザのわずらわしさが増加し、ユーザとの対話が破綻しかねない。 However, the above-mentioned Patent Documents 1 and 2 do not take into consideration which parts of the input dialogue voice the communication robot was able to recognize and which parts it was unable to recognize. Therefore, the user who uttered the dialogue voice cannot know which parts were conveyed to the communication robot and which parts were not. As a result, the user ends up having to ask the same question over and over again, which increases the user's annoyance and may cause the dialogue with the user to break down.

本発明は、ユーザとの対話の円滑化することを目的とする。 The purpose of the present invention is to facilitate smooth dialogue with users.

本願において開示される発明の一側面となる音声対話装置は、プログラムを実行するプロセッサと、前記プログラムを記憶する記憶デバイスと、を有する音声対話装置であって、前記プロセッサは、前記単語列に前記信頼度が第１しきい値以上である単語が存在し、かつ、前記信頼度が前記第１しきい値未満である単語が前記単語列の前半および後半にわたって複数存在する場合、前記単語列のうち前記第１しきい値以上の単語を前記発話音声の発話元に聞き返す聞き返し文を生成する生成処理と、を実行することを特徴とする。 A voice dialogue device according to one aspect of the invention disclosed in the present application is a voice dialogue device having a processor that executes a program and a storage device that stores the program, wherein the processor executes a generation process that, when a word string contains a word whose reliability is equal to or greater than a first threshold value and there are multiple words in the first and second halves of the word string whose reliability is less than the first threshold value, generates a reply sentence that asks the source of the spoken voice about words in the word string that are equal to or greater than the first threshold value .

本発明の代表的な実施の形態によれば、ユーザとの対話の円滑化を図ることができる。前述した以外の課題、構成及び効果は、以下の実施例の説明により明らかにされる。 A representative embodiment of the present invention can facilitate smooth dialogue with the user. Problems, configurations, and effects other than those described above will become clear from the explanation of the following examples.

図１は、実施例１にかかるユーザとコミュニケーションロボットとの音声対話例を示す説明図である。FIG. 1 is an explanatory diagram illustrating an example of a voice dialogue between a user and a communication robot according to a first embodiment. 図２は、実施例１にかかる音声対話システムのシステム構成例を示す説明図である。FIG. 2 is a diagram illustrating an example of a system configuration of the voice dialogue system according to the first embodiment. 図３は、音声対話装置のハードウェア構成例を示すブロック図である。FIG. 3 is a block diagram showing an example of a hardware configuration of the speech dialogue device. 図４は、実施例１にかかる音声対話装置の機能的構成例を示すブロック図である。FIG. 4 is a block diagram of an example of a functional configuration of the voice dialogue apparatus according to the first embodiment. 図５は、音声認識結果の一例を示す説明図である。FIG. 5 is an explanatory diagram illustrating an example of a voice recognition result. 図６は、単語列から品詞列への変換例を示す説明図である。FIG. 6 is an explanatory diagram showing an example of conversion from a word string to a part-of-speech string. 図７は、品詞列の品詞分類定義の一例を示す説明図１である。FIG. 7 is an explanatory diagram 1 showing an example of a part-of-speech classification definition of a part-of-speech string. 図８は、品詞列の品詞分類定義の一例を示す説明図２である。FIG. 8 is an explanatory diagram 2 showing an example of a part-of-speech classification definition of a part-of-speech string. 図９は、信頼度取得部による信頼度算出例を示す説明図である。FIG. 9 is an explanatory diagram illustrating an example of reliability calculation by the reliability acquisition unit. 図１０は、実施例１にかかる音声対話装置による音声対話処理手順例を示すフローチャートである。FIG. 10 is a flowchart of an example of a voice dialogue processing procedure performed by the voice dialogue device according to the first embodiment. 図１１は、図１０に示した聞き返し文生成処理（ステップＳ１００５）の詳細な処理手順例を示すフローチャートである。FIG. 11 is a flowchart showing a detailed example of the process of generating a reflection sentence (step S1005) shown in FIG. 図１２は、図１１に示したマスク単語推定処理（ステップＳ１１０６）の詳細な処理手順例を示すフローチャートである。FIG. 12 is a flowchart illustrating an example of a detailed processing procedure of the mask word inference process (step S1106) illustrated in FIG. 図１３は、図１１に示したマスク単語推定処理（ステップＳ１１０６）の一例を示す説明図である。FIG. 13 is an explanatory diagram showing an example of the mask word inference process (step S1106) shown in FIG. 図１４は、図１０に示した言い直し発話解釈処理（ステップＳ１００７）の詳細な処理手順例を示すフローチャートである。FIG. 14 is a flowchart illustrating an example of a detailed process procedure of the restatement utterance interpretation process (step S1007) illustrated in FIG. 図１５は、図１０および図１４に示した対話制御処理（ステップＳ１００８、Ｓ１４０３）の詳細な処理手順例を示すフローチャートである。FIG. 15 is a flowchart showing a detailed example of the procedure of the dialogue control process (steps S1008 and S1403) shown in FIG. 10 and FIG. 図１６は、ユーザと音声対話装置との対話の流れの一例を示すフローチャートである。FIG. 16 is a flowchart showing an example of the flow of a dialogue between a user and the voice dialogue device. 図１７は、実施例２にかかる音声対話装置の機能的構成例を示すブロック図である。FIG. 17 is a block diagram of an example of a functional configuration of the voice dialogue apparatus according to the second embodiment. As shown in FIG. 図１８は、実施例２にかかる音声対話装置による音声対話処理手順例を示すフローチャートである。FIG. 18 is a flowchart of an example of a voice dialogue processing procedure performed by the voice dialogue device according to the second embodiment.

＜音声対話例＞
図１は、実施例１にかかるユーザとコミュニケーションロボット（以下、単に、「ロボット」）との音声対話例を示す説明図である。図１では、ユーザ１０１がロボット１０２に対し「明日の富士山の日の出の時刻は？」という発話音声１１０と発話した場合のロボット１０２からの聞き返しパターンＰ１～Ｐ５を示す。なお、発話音声１１０の大きさ、発話速度、発音の正確さ、および周囲の環境の少なくとも１つは、下記（Ａ）～（Ｅ）ごとに異なるものとする。 <Example of voice dialogue>
Fig. 1 is an explanatory diagram showing an example of a voice dialogue between a user and a communication robot (hereinafter, simply referred to as "robot") according to the first embodiment. Fig. 1 shows patterns P1 to P5 of requests for reflection from the robot 102 when a user 101 utters a speech 110 of "What time will the sunrise be on Mt. Fuji tomorrow?" to the robot 102. Note that at least one of the volume, speech rate, pronunciation accuracy, and surrounding environment of the speech 110 is assumed to be different for each of the following (A) to (E).

（Ａ）は、聞き返しパターンＰ１での応答例を示す。具体的には、たとえば、ロボット１０２は、発話音声１１０を「明日の富士山の日の出の〇×△□」（「〇×△□」は、信頼度が低く認識できなかった部分）として音声認識し、信頼度がしきい値以上である高い部分（明日の富士山の日の出の）を聞き返す応答文（以下、聞き返し文）「よく聞き取れませんでした。『明日の富士山の日の出の』の何ですか？」を聞き返しパターンＰ１として生成し、発話元であるユーザ１０１に発話する。 (A) shows an example of a response in the reflection pattern P1. Specifically, for example, the robot 102 recognizes the speech voice 110 as "Ox△□ of tomorrow's sunrise over Mt. Fuji" ("Ox△□" is the part that could not be recognized due to low reliability), and generates a response sentence (hereinafter, the reflection sentence) to reflect the part with a high reliability above a threshold (of tomorrow's sunrise over Mt. Fuji) as "I didn't hear it very well. What is 'tomorrow's sunrise over Mt. Fuji'?" as the reflection pattern P1, and speaks it to the user 101 who originated the speech.

（Ｂ）は、聞き返しパターンＰ２での応答例を示す。具体的には、たとえば、ロボット１０２は、発話音声１１０を「〇×△富士山〇×△□時刻は？」（「〇×△」および「〇×△□」は、それぞれ信頼度がしきい値未満で低く認識できなかった部分）として音声認識し、信頼度がしきい値以上の高い単語「富士山」および「時刻」を聞き返す聞き返し文「『富士山』？『時刻』？もう一度質問をお願いします。」を聞き返しパターンＰ２として生成し、ユーザ１０１に発話する。 (B) shows an example of a response in the repeat pattern P2. Specifically, for example, the robot 102 recognizes the speech 110 as "Ox△Mt. FujiOx△□What's the time?" (where "Ox△" and "Ox△□" are parts that could not be recognized because their reliability was below the threshold), generates a repeat sentence "'Mt. Fuji'? 'Time'? Please ask the question again," which repeats the words "Mt. Fuji" and "Time," which have a reliability above the threshold, as the repeat pattern P2, and speaks it to the user 101.

（Ｃ）は、聞き返しパターンＰ３での応答例を示す。具体的には、たとえば、ロボット１０２は、発話音声１１０を「明日の〇×△□の日の出の時刻は？」（「〇×△□」は、信頼度がしきい値未満で低く認識できなかった部分）として音声認識し、信頼度がしきい値未満の低い部分（〇×△□）の推測結果（富士山）と、当該推測結果を含めたユーザ１０１が発話した発話音声１１０を確認する聞き返し文「『明日の富士山の日の出の時刻は』とおっしゃいましたか？」とを、聞き返しパターンＰ３として生成し、ユーザ１０１に発話する。 (C) shows an example of a response in the repeat pattern P3. Specifically, for example, the robot 102 recognizes the speech sound 110 as "What time will the sunrise be tomorrow at XX △ □?" (where "XX △ □" is a portion that could not be recognized because its reliability was below the threshold), generates a guess (Mt. Fuji) for the portion (XX △ □) with a low reliability below the threshold and a repeat sentence that confirms the speech sound 110 uttered by the user 101 including the guess, "Did you say 'What time will the sunrise be tomorrow at Mt. Fuji'?" as the repeat pattern P3, and repeats it to the user 101.

（Ｄ）は、聞き返しパターンＰ４での応答例を示す。具体的には、たとえば、ロボット１０２は、発話音声１１０を高精度に認識し、認識した質問全体を確認するために聞き返す聞き返し文「『明日の富士山の日の出の時刻は』とおっしゃいましたか？」を聞き返しパターンＰ４として生成し、ユーザ１０１に発話する。 (D) shows an example of a response in the pattern P4. Specifically, for example, the robot 102 recognizes the speech 110 with high accuracy, and generates a response sentence, "Did you say, 'What time will the sun rise on Mt. Fuji tomorrow?'" to confirm the entire recognized question as the pattern P4, and speaks it to the user 101.

（Ｅ）は、聞き返しパターンＰ５での応答例を示す。具体的には、たとえば、ロボット１０２は、発話音声１１０を認識できず、再質問を依頼する聞き返し文「よく聞き取れなかったのでもう一度お願いします。」を聞き返しパターンＰ５として生成し、ユーザ１０１に発話する。 (E) shows an example of a response in the pattern P5 of asking back. Specifically, for example, the robot 102 cannot recognize the speech 110, and generates a asking back sentence, "I didn't hear it very well, so please say it again," as the pattern P5 of asking back, requesting the user 101 to ask again.

ロボット１０２は、発話音声１１０の音声認識結果に対し、音声認識の信頼度に加え、言語尤度（言語モデルのＰｅｒｐｌｅｘｉｔｙ）を計算し、音声認識の信頼度および言語尤度によって、通常の対話制御での応答と、上記聞き返しパターンＰ１～Ｐ５による聞き返し文の生成と、のいずれかを選択する。 The robot 102 calculates the speech recognition result of the spoken voice 110, not only the speech recognition reliability but also the language likelihood (perplexity of the language model), and selects either a response using normal dialogue control or the generation of a review sentence using the above-mentioned review patterns P1 to P5, depending on the speech recognition reliability and language likelihood.

このように、ロボット１０２は、発話音声１１０の一部または全部が音声認識できなかったことをユーザ１０１に伝えることで、必要な再発話をユーザ１０１に促し、ユーザ１０１との対話を円滑に進めることができる。また、ユーザ１０１は、発話音声１１０のどの部分が伝わっていてどの部分が伝わっていないのかを知ることができる。 In this way, by informing the user 101 that some or all of the spoken voice 110 could not be recognized, the robot 102 can prompt the user 101 to repeat the necessary speech and smoothly advance the dialogue with the user 101. In addition, the user 101 can know which parts of the spoken voice 110 have been conveyed and which parts have not.

＜音声対話システム＞
図２は、実施例１にかかる音声対話システムのシステム構成例を示す説明図である。音声対話システム２００は、たとえば、クライアントサーバシステムであり、サーバ２０１と、ロボット１０２，スマートフォン１０３などの情報処理装置１０４と、を有する。サーバ２０１と情報処理装置１０４とは、インターネット、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）などのネットワーク２０２を介して通信可能である。 <Speech dialogue system>
2 is an explanatory diagram showing an example of the system configuration of the voice dialogue system according to Example 1. The voice dialogue system 200 is, for example, a client-server system, and includes a server 201 and an information processing device 104 such as a robot 102 and a smartphone 103. The server 201 and the information processing device 104 can communicate with each other via a network 202 such as the Internet, a LAN (Local Area Network), or a WAN (Wide Area Network).

クライアントサーバシステムの場合、音声対話プログラムは、サーバ２０１にインストールされる。したがって、サーバ２０１は、音声対話装置として、音声認識処理や音声認識の信頼度算出、言語尤度計算、応答文生成を実行する。この場合、情報処理装置１０４は、発話音声１１０の入力、入力した発話音声１１０のデータ変換、当該変換による音声データのサーバ２０１への送信、サーバ２０１からの応答文の受信、応答文の発話を実行する対話インタフェースとなる。 In the case of a client-server system, the voice dialogue program is installed in the server 201. Therefore, the server 201, as a voice dialogue device, executes voice recognition processing, voice recognition reliability calculation, language likelihood calculation, and response sentence generation. In this case, the information processing device 104 becomes a dialogue interface that inputs the spoken voice 110, converts the data of the inputted spoken voice 110, transmits the converted voice data to the server 201, receives the response sentence from the server 201, and speaks the response sentence.

一方、スタンドアロン型の場合、音声対話プログラムは、情報処理装置１０４にインストールされ、サーバ２０１は不要である。したがって、情報処理装置１０４は、音声対話装置として、発話音声１１０の入力、入力した発話音声１１０のデータ変換、音声認識処理、音声認識の信頼度算出、言語尤度計算、応答文生成、および応答文の発話を実行する。 On the other hand, in the case of a stand-alone type, the voice dialogue program is installed in the information processing device 104, and the server 201 is not required. Therefore, as a voice dialogue device, the information processing device 104 executes input of spoken voice 110, data conversion of the inputted spoken voice 110, voice recognition processing, calculation of the reliability of the voice recognition, calculation of the language likelihood, generation of a response sentence, and speaking of the response sentence.

＜音声対話装置のハードウェア構成例＞
図３は、音声対話装置のハードウェア構成例を示すブロック図である。音声対話装置３００は、プロセッサ３０１と、記憶デバイス３０２と、入力デバイス３０３と、出力デバイス３０４と、通信インタフェース（通信ＩＦ）３０５と、を有する。プロセッサ３０１、記憶デバイス３０２、入力デバイス３０３、出力デバイス３０４、および通信ＩＦ３０５は、バス３０６により接続される。プロセッサ３０１は、音声対話装置３００を制御する。記憶デバイス３０２は、プロセッサ３０１の作業エリアとなる。また、記憶デバイス３０２は、各種プログラムやデータを記憶する非一時的なまたは一時的な記録媒体である。記憶デバイス３０２としては、たとえば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、フラッシュメモリがある。入力デバイス３０３は、データを入力する。入力デバイス３０３としては、たとえば、キーボード、マウス、タッチパネル、テンキー、スキャナ、マイク、生体センサがある。出力デバイス３０４は、データを出力する。出力デバイス３０４としては、たとえば、ディスプレイ、プリンタ、スピーカがある。通信ＩＦ３０５は、ネットワーク２０２と接続し、データを送受信する。 <Example of hardware configuration of a voice dialogue device>
3 is a block diagram showing an example of a hardware configuration of the voice dialogue device. The voice dialogue device 300 has a processor 301, a storage device 302, an input device 303, an output device 304, and a communication interface (communication IF) 305. The processor 301, the storage device 302, the input device 303, the output device 304, and the communication IF 305 are connected by a bus 306. The processor 301 controls the voice dialogue device 300. The storage device 302 is a working area for the processor 301. The storage device 302 is a non-temporary or temporary recording medium that stores various programs and data. Examples of the storage device 302 include a ROM (Read Only Memory), a RAM (Random Access Memory), a HDD (Hard Disk Drive), and a flash memory. The input device 303 inputs data. The input device 303 may be, for example, a keyboard, a mouse, a touch panel, a numeric keypad, a scanner, a microphone, or a biosensor. The output device 304 outputs data. The output device 304 may be, for example, a display, a printer, or a speaker. The communication IF 305 is connected to the network 202 and transmits and receives data.

＜音声対話装置３００の機能的構成例＞
図４は、実施例１にかかる音声対話装置３００の機能的構成例を示すブロック図である。音声対話装置３００は、音声対話プログラムがインストールされたコンピュータである。音声対話装置３００は、音声認識モデル４１１と、標準言語モデル４１２と、分野別言語モデル４１３、質問文言語モデル４１４、対話文脈言語モデル４１５、および対話知識ＤＢ４１６（データベース）４１６にアクセス可能である。これらは、図３に示した音声対話装置３００の記憶デバイス３０２または音声対話装置３００と通信可能な他のコンピュータの記憶デバイス３０２に記憶される。 <Example of functional configuration of the voice dialogue device 300>
4 is a block diagram showing an example of a functional configuration of a voice dialogue device 300 according to the first embodiment. The voice dialogue device 300 is a computer in which a voice dialogue program is installed. The voice dialogue device 300 can access a voice recognition model 411, a standard language model 412, a domain-specific language model 413, a question sentence language model 414, a dialogue context language model 415, and a dialogue knowledge DB 416 (database) 416. These are stored in the storage device 302 of the voice dialogue device 300 shown in FIG. 3 or in the storage device 302 of another computer that can communicate with the voice dialogue device 300.

音声認識モデル４１１は、音響的特徴量と音素との対応関係、音素列と単語との対応関係、および単語列の統計モデルである。音声認識モデル４１１としては、たとえば、音響モデルと言語モデルとを一つのオブジェクトに組み合わせた重み付き有限状態トランスデューサ（ＷｅｉｇｈｔｅｄＦｉｎｉｔｅ‐ＳｔａｔｅＴｒａｎｓｄｕｃｅｒ：ＷＦＳＴ）がある。音響モデルとしては、たとえば、ＤｅｅｐＮｅｕｒａｌＮｅｔｏｗｒｋ（ＤＮＮ）と隠れマルコフモデル（ＨＭＭ）とのハイブリッド型の音響モデル（ＤＮＮ‐ＨＭＭ）がある。また、言語モデルにはＮ－ｇｒａｍがある。 The speech recognition model 411 is a statistical model of the correspondence between acoustic features and phonemes, the correspondence between phoneme strings and words, and word strings. An example of the speech recognition model 411 is a weighted finite-state transducer (WFST) that combines an acoustic model and a language model into a single object. An example of the acoustic model is a hybrid acoustic model (DNN-HMM) of a Deep Neural Network (DNN) and a Hidden Markov Model (HMM). An example of the language model is N-gram.

標準言語モデル４１２は、対話文を含む様々な分野のテキストから得られた単語の並びの統計モデルである。標準言語モデル４１２としては、たとえば、単語Ｎ－ｇｒａｍでもよく、リカレントニューラルネットワーク（ＲＮＮ）でもよい。標準言語モデル４１２は、特定の言語（本例では日本語）としての尤もらしさを評価するために使用される。 The standard language model 412 is a statistical model of word sequences obtained from texts in various fields, including dialogue. The standard language model 412 may be, for example, a word N-gram or a recurrent neural network (RNN). The standard language model 412 is used to evaluate the likelihood of a particular language (Japanese in this example).

分野別言語モデル４１３は、ロボット１０２の運用環境に関するテキスト（たとえば、運用環境が観光案内であればその観光地の案内に関するテキスト）や想定質問集などから得られた単語の並びの統計モデルである。分野別言語モデル４１３としては、たとえば、単語Ｎ－ｇｒａｍでもよく、ＲＮＮでもよい。分野別言語モデル４１３は、その運用環境で発話される内容としての尤もらしさを評価するために使用される。なお、標準言語モデル４１２は、複数種類の分野の分野別言語モデル４１３を集約した言語モデルでもよく、分野別言語モデル４１３を除いた一般的な話し言葉に関する言語モデルでもよい。 The domain-specific language model 413 is a statistical model of word sequences obtained from text related to the operating environment of the robot 102 (for example, if the operating environment is a tourist guide, then text related to the tourist spot guide) or a collection of anticipated questions. The domain-specific language model 413 may be, for example, a word N-gram or an RNN. The domain-specific language model 413 is used to evaluate the likelihood of the content being spoken in the operating environment. The standard language model 412 may be a language model that aggregates domain-specific language models 413 for multiple types of fields, or may be a language model related to general spoken language excluding the domain-specific language models 413.

質問文言語モデル４１４は、想定質問集や実際にこれまでユーザ１０１から受けた質問例から得られた品詞レベルでの単語の並びの統計モデルである。質問文言語モデル４１４としては、たとえば、品詞Ｎ－ｇｒａｍでもよく、ＲＮＮでもよい。質問文言語モデル４１４は、単語の並びからユーザ１０１の発話が質問かどうかを判定するために用いられる言語モデルである。 The question language model 414 is a statistical model of word sequences at the part-of-speech level obtained from a collection of expected questions and example questions actually received from the user 101. The question language model 414 may be, for example, an N-gram part-of-speech model or an RNN. The question language model 414 is a language model used to determine whether an utterance by the user 101 is a question from the word sequence.

対話文脈言語モデル４１５は、対話文を含むあらゆる分野のテキストについて、任意の単語とその周辺に現れる単語群の統計と句の並びの統計を学習させた統計モデルである。対話文脈言語モデル４１５としては、ＢｉｄｉｒｅｃｔｉｏｎａｌＥｎｃｏｄｅｒＲｅｐｒｅｓｅｎｔａｔｉｏｎｓｆｒｏｍＴｒａｎｓｆｏｒｍｅｒｓ（ＢＥＲＴ）がある。 The dialogue context language model 415 is a statistical model that learns statistics of any word and the group of words that appear around it, as well as statistics of phrase arrangements, for texts in any field, including dialogue. An example of the dialogue context language model 415 is Bidirectional Encoder Representations from Transformers (BERT).

対話知識ＤＢ４１６は、想定質問文や過去の質問文の例とそれらの答えとを関連付けて格納したデータベースである。 The dialogue knowledge DB416 is a database that stores examples of anticipated questions and past questions in association with their answers.

また、音声対話装置３００は、音声認識部４０１と、言語尤度取得部４０２と、信頼度取得部４０３と、聞き返し判定部４０４と、対話制御部４０５と、聞き返し文生成部４０６と、対話履歴管理部４０７と、音声合成部４０８と、を有する。これらは、具体的には、たとえば、図３に示した記憶デバイス３０２に記憶された音声対話プログラムをプロセッサ３０１に実行させることにより実現される。 The voice dialogue device 300 also has a voice recognition unit 401, a language likelihood acquisition unit 402, a reliability acquisition unit 403, a reflection determination unit 404, a dialogue control unit 405, a reflection sentence generation unit 406, a dialogue history management unit 407, and a voice synthesis unit 408. Specifically, these are realized by, for example, having the processor 301 execute a voice dialogue program stored in the storage device 302 shown in FIG. 3.

音声認識部４０１は、発話音声１１０の波形データを入力し、音声認識モデル４１１を用いて、波形データを音声特徴量Ｘに変換する。具体的には、たとえば、音声認識部４０１は、波形データを２５［ｍｓｅｃ］程度の長さで切り出し、音声特徴量ベクトルｘに変換する。音声特徴量ベクトルｘには、たとえば、１３次元のメル周波数ケプストラム係数（ＭＦＣＣ）が用いられる。 The speech recognition unit 401 receives the waveform data of the spoken voice 110 and converts the waveform data into speech feature X using the speech recognition model 411. Specifically, for example, the speech recognition unit 401 cuts out the waveform data into pieces having a length of about 25 msec and converts them into a speech feature vector x. For example, 13-dimensional Mel Frequency Cepstrum Coefficients (MFCC) are used for the speech feature vector x.

音声認識部４０１は、この音声特徴量ベクトルｘを１０［ｍｓｅｃ］程度シフトさせながら波形データの全区間を変換し、音声特徴量Ｘ＝［ｘ＿１，…，ｘ＿ｔ，…，ｘ＿Ｔ］を生成する。ｘ＿ｔは、波形データのｔ番目の区間で変換された音声特徴量ベクトルである。Ｔは、波形データの区間数である。そして、音声認識部４０１は、下記式（１）により、音声特徴量Ｘが与えられた条件で、最も可能性の高い単語列Ｗｔを求める。なお、Ｗは単語列である。 The speech recognition unit 401 converts the entire section of the waveform data while shifting this speech feature vector x by about 10 [msec], and generates speech feature X = [x_1, ..., x_t, ..., x_T]. x_t is the speech feature vector converted in the t-th section of the waveform data. T is the number of sections of the waveform data. Then, the speech recognition unit 401 finds the most likely word string Wt under the conditions given the speech feature X, using the following formula (1). Note that W is a word string.

音声認識部４０１は、単語列Ｗｔを言語尤度取得部４０２に出力する。また、音声認識部４０１は、単語ラティスを生成して、信頼度取得部４０３に出力する。単語列Ｗｔと単語ラティスとを音声認識結果とする。 The speech recognition unit 401 outputs the word string Wt to the language likelihood acquisition unit 402. The speech recognition unit 401 also generates a word lattice and outputs it to the confidence acquisition unit 403. The word string Wt and the word lattice are the speech recognition result.

図５は、音声認識結果の一例を示す説明図である。音声認識結果５００は、単語列Ｗｔと単語ラティスＷＬとを含む。単語列Ｗｔ（単語列Ｗも同様）は、１以上の単語の各々の表記、読み、および品詞により構成される。単語列Ｗｔが「今日」、「は」および「晴れ」の場合、単語列Ｗｔは、単語「今日」の表記「今日」、読み「キョー」、および品詞「名詞」と、単語「は」の表記「は」、読み「ワ」、および品詞「助詞」と、単語「晴れ」の表記「晴れ」、読み「ハレ」、および品詞「名詞」により構成される。 Figure 5 is an explanatory diagram showing an example of a speech recognition result. The speech recognition result 500 includes a word string Wt and a word lattice WL. The word string Wt (as well as the word string W) is composed of the spelling, reading, and part of speech of each of one or more words. If the word string Wt is "kyo", "wa", and "sunny", the word string Wt is composed of the spelling "kyo", the reading "kyo", and the part of speech "noun" of the word "kyo", the spelling "ha", the reading "wa", and the part of speech "particle" of the word "wa", and the spelling "hare", the reading "hare", and the part of speech "noun" of the word "sunny".

なお、音声認識モデル４１１と各言語モデル（対話文脈言語モデル４１５を除く）では、単語の区切りおよび品詞体系は同じ仕様とする。対話文脈言語モデル４１５で使用するＢＥＲＴは、対話文脈言語モデル４１５の学習時に自動的に単語の区切りを決定する。 Note that the speech recognition model 411 and each language model (except for the dialogue context language model 415) have the same specifications for word division and part of speech system. The BERT used in the dialogue context language model 415 automatically determines word division when the dialogue context language model 415 is trained.

単語ラティスＷＬは、音声認識の仮説候補のネットワーク構造データである。単語ごとに音響尤度と言語尤度とを合算した数値が付与される。たとえば、単語Ａ（今日）についての音響尤度と言語尤度とを合算した数値は「０．７」であり、単語Ｂ（は）についての音響尤度と言語尤度とを合算した数値は「０．６」であり、単語Ｃ（晴れ）についての音響尤度と言語尤度とを合算した数値は「０．７」である。これらの数値の合計が最大となるような経路の単語の並びが、単語列Ｗｔである。 The word lattice WL is network structure data of candidate hypotheses for speech recognition. A numerical value is assigned to each word, which is the sum of the acoustic likelihood and linguistic likelihood. For example, the sum of the acoustic likelihood and linguistic likelihood for word A (today) is "0.7", the sum of the acoustic likelihood and linguistic likelihood for word B (is) is "0.6", and the sum of the acoustic likelihood and linguistic likelihood for word C (sunny) is "0.7". The sequence of words on the path that maximizes the sum of these numerical values is the word string Wt.

なお、音声認識部４０１は、音声対話装置３００と通信可能な他のコンピュータにユーザ１０１対話音声の波形データを転送し、当該他のコンピュータが生成した音声認識結果５００を取得してもよい。 The voice recognition unit 401 may also transfer waveform data of the user 101's dialogue voice to another computer that can communicate with the voice dialogue device 300, and obtain the voice recognition result 500 generated by the other computer.

図４に戻り、言語尤度取得部４０２は、音声認識部４０１から入力された単語列Ｗｔに対し、複数の言語モデルを使用して、それぞれの言語モデルでの言語尤度（テストセットパープレキシティ）を算出する。複数の言語モデルとは、標準言語モデル４１２、分野別言語モデル４１３、質問文言語モデル４１４、および対話文脈言語モデル４１５である。このうち、標準言語モデル４１２および質問文言語モデル４１４が言語尤度取得部４０２に必須の言語モデルである。 Returning to FIG. 4, the language likelihood acquisition unit 402 uses multiple language models for the word string Wt input from the speech recognition unit 401 to calculate the language likelihood (test set perplexity) for each language model. The multiple language models are a standard language model 412, a domain-specific language model 413, a question language model 414, and a dialogue context language model 415. Of these, the standard language model 412 and the question language model 414 are essential language models for the language likelihood acquisition unit 402.

Ｎ－ｇｒａｍ（単語Ｎ－ｇｒａｍ、品詞Ｎ－ｇｒａｍ）を用いた場合の単語列Ｗ１ｎの生起確率は、たとえば、下記式（２）の通りであり、ＲＮＮを用いた場合の単語列Ｗ１ｎの生起確率は、下記式（３）の通りであり、ＢＥＲＴを用いた場合の単語列Ｗ１ｎの生起確率は、下記式（４）の通りである。単語列Ｗ１ｎは、単語Ｗ１から単語Ｗｎ（ｎは単語列Ｗ１ｎに含まれる単語の総数）までの単語の配列を意味し、ここでは、単語列Ｗ１ｎ＝単語列Ｗｔである。言語尤度取得部４０２は、言語モデルごとに算出した言語尤度を聞き返し判定部４０４に出力する。 The occurrence probability of word string W1n when N-gram (word N-gram, part of speech N-gram) is, for example, as shown in formula (2) below, the occurrence probability of word string W1n when RNN is used is as shown in formula (3) below, and the occurrence probability of word string W1n when BERT is used is as shown in formula (4) below. Word string W1n means the arrangement of words from word W1 to word Wn (n is the total number of words contained in word string W1n), and here word string W1n = word string Wt. The language likelihood acquisition unit 402 outputs the language likelihood calculated for each language model to the reflection determination unit 404.

なお、言語尤度取得部４０２は、上記式（２）により、品詞Ｎ－ｇｒａｍの質問文言語モデル４１４を用いて言語尤度を算出する場合、単語列Ｗｔを品詞列に変換する。単語列Ｗｔの各単語は、表記、読み、および品詞により構成されているため、言語尤度取得部４０２は、単語列Ｗｔの各単語から表記および読みを削除して、品詞のみからなる単語列、すなわち、品詞列に変換する。 When calculating the language likelihood using the part-of-speech N-gram question language model 414 according to the above formula (2), the language likelihood acquisition unit 402 converts the word string Wt into a part-of-speech string. Since each word in the word string Wt is composed of a spelling, a reading, and a part of speech, the language likelihood acquisition unit 402 deletes the spelling and reading from each word in the word string Wt and converts it into a word string consisting only of parts of speech, that is, a part-of-speech string.

図６は、単語列から品詞列への変換例を示す説明図である。単語列Ｗｔの場合、名詞→助詞→名詞という品詞列に変換される。言語尤度取得部４０２は、品詞列を単語列Ｗ１ｎとみなして、上記式（２）を適用して言語尤度を算出する。なお、言語尤度取得部４０２は、品詞列への変換によって必要以上に情報が欠落しないように、活用形がある品詞には活用形を、代名詞および助詞には基本形（単語）を付与してもよい。 Figure 6 is an explanatory diagram showing an example of conversion from a word string to a part-of-speech string. In the case of a word string Wt, it is converted into a part-of-speech string of noun → particle → noun. The linguistic likelihood acquisition unit 402 regards the part-of-speech string as a word string W1n and calculates the linguistic likelihood by applying the above formula (2). Note that the linguistic likelihood acquisition unit 402 may assign conjugated forms to parts of speech that have conjugated forms, and base forms (words) to pronouns and particles, so that more information is not lost than necessary due to the conversion to a part-of-speech string.

図７および図８は、品詞列の品詞分類定義７００の一例を示す説明図で、国立国語研究所が規定した短単位の定義（形態素解析用辞書の品詞分類）に従っている。品詞分類定義７００は、たとえば、記憶デバイス３０２に記憶された情報である。品詞分類定義７００は、大分類７０１と、中分類７０２と、小分類７０３と、補足情報７０４と、を有する。大分類７０１は、単語Ｗ１～Ｗｎに含まれる品詞を規定する。中分類７０２は、大分類７０１を細分化した項目を規定する。小分類７０３は、中分類７０２を細分化した項目を規定する。補足情報７０４は、大分類７０１～小分類７０３に補足すべき情報を規定する。単語列Ｗｔの各単語Ｗ１～Ｗｎの品詞には、大分類７０１～小分類７０３と、活用形がある品詞は活用形、英単語由来のカタカナ語には英語表記など情報が含まれているものとする。 Figures 7 and 8 are explanatory diagrams showing an example of a part-of-speech classification definition 700 for a part-of-speech string, which conforms to the short unit definition (part-of-speech classification in a dictionary for morphological analysis) defined by the National Institute for Japanese Language and Linguistics. The part-of-speech classification definition 700 is, for example, information stored in the storage device 302. The part-of-speech classification definition 700 has a major category 701, a medium category 702, a small category 703, and supplementary information 704. The major category 701 defines the parts of speech contained in words W1 to Wn. The medium category 702 defines the items into which the major category 701 is subdivided. The small category 703 defines the items into which the medium category 702 is subdivided. The supplementary information 704 defines information to be supplemented to the major category 701 to the small category 703. The parts of speech of each word W1 to Wn in the word string Wt include information such as major category 701 to minor category 703, the conjugated form for parts of speech that have conjugated forms, and the English spelling for katakana words derived from English words.

言語尤度取得部４０２は、単語列Ｗｔから品詞列への変換に際し、品詞分類定義７００を参照して、各品詞に対し品詞分類定義７００にある情報のみを残し、補足情報７０４に基本形の記載のある品詞については基本形（単語）を付与する。これにより、品詞列における情報の欠落が抑制される。 When converting the word string Wt into a parts-of-speech string, the language likelihood acquisition unit 402 refers to the parts-of-speech classification definition 700, and for each part of speech, only the information in the parts-of-speech classification definition 700 is retained, and the base form (word) is assigned to parts of speech whose base form is described in the supplemental information 704. This prevents information loss in the parts-of-speech string.

なお、言語尤度取得部４０２は、音声対話装置３００と通信可能な他のコンピュータに、音声認識部４０１から入力された単語列Ｗｔを転送し、当該他のコンピュータが単語列Ｗｔに基づいて算出した言語モデルごとの言語尤度を取得してもよい。 The language likelihood acquisition unit 402 may transfer the word string Wt input from the speech recognition unit 401 to another computer that can communicate with the speech dialogue device 300, and acquire the language likelihood for each language model calculated by the other computer based on the word string Wt.

図４に戻り、信頼度取得部４０３は、単語ラティスＷＬに基づいて、単語列Ｗｔおよび単語列Ｗｔを構成する各単語Ｗ１～Ｗｎの音声認識信頼度（以下、単に、「信頼度」）を算出する。具体的には、たとえば、信頼度とは、音声認識部４０１によって得られた単語列Ｗｔがどの程度信頼できるかを示す指標値である。ここでは、値が大きいほど信頼度が高いものとする。 Returning to FIG. 4, the reliability acquisition unit 403 calculates the speech recognition reliability (hereinafter simply "reliability") of the word string Wt and each of the words W1 to Wn that make up the word string Wt based on the word lattice WL. Specifically, for example, the reliability is an index value that indicates how reliable the word string Wt obtained by the speech recognition unit 401 is. Here, the larger the value, the higher the reliability.

図９は、信頼度取得部４０３による信頼度算出例を示す説明図である。信頼度取得部４０３は、まず、単語ラティスＷＬをネットワーク６００に変換する。なお、状態遷移が１つしかない単語（たとえば、単語Ｅ）については、信頼度取得部４０３は、単語の信頼度が１．０である空シンボルεを有する状態遷移を挿入して、単語Ｅの状態遷移とアライメントをとる。 Figure 9 is an explanatory diagram showing an example of reliability calculation by the reliability acquisition unit 403. The reliability acquisition unit 403 first converts the word lattice WL into a network 600. For a word with only one state transition (e.g., word E), the reliability acquisition unit 403 inserts a state transition with an empty symbol ε, whose word reliability is 1.0, to align it with the state transition of word E.

つぎに、信頼度取得部４０３は、ネットワーク６００における各単語Ａ～Ｆの信頼度を区間ごとに１．０で正規化し、単語コンフュージョンネットワーク６０１に変換する。また、信頼度取得部４０３は、単語コンフュージョンネットワーク６０１において、正規化された単語Ａ～Ｆの信頼度の調和平均を算出し、単語列Ｗｔの信頼度とする。信頼度取得部４０３は、正規化された単語Ａ～Ｆの信頼度と、単語列Ｗｔの信頼度と、を聞き返し判定部４０４に出力する。 The confidence acquisition unit 403 then normalizes the confidence of each word A to F in the network 600 to 1.0 for each section, and converts it into a word confusion network 601. The confidence acquisition unit 403 also calculates the harmonic mean of the normalized confidences of words A to F in the word confusion network 601, and sets this as the confidence of the word string Wt. The confidence acquisition unit 403 outputs the confidence of the normalized words A to F and the confidence of the word string Wt to the reflection determination unit 404.

なお、信頼度取得部４０３は、音声対話装置３００と通信可能な他のコンピュータに、音声認識部４０１から入力された単語ラティスＷＬを転送し、当該他のコンピュータが単語ラティスＷＬに基づいて算出した信頼度を取得してもよい。 The reliability acquisition unit 403 may transfer the word lattice WL input from the speech recognition unit 401 to another computer that can communicate with the speech dialogue device 300, and acquire the reliability calculated by the other computer based on the word lattice WL.

図４に戻り、聞き返し判定部４０４は、ユーザ１０１への聞き返しをすべきか否かを判定する。具体的には、たとえば、聞き返し判定部４０４は、言語尤度取得部４０２からの言語モデルごとの言語尤度と信頼度取得部４０３からの単語列Ｗｔの信頼度とを用いて、対話制御部４０５による対話制御と、聞き返し文生成部４０６による聞き返し処理と、のうち、いずれの処理を実行するかを判定する。 Returning to FIG. 4, the ask-back determination unit 404 determines whether or not to ask the user 101 back. Specifically, for example, the ask-back determination unit 404 uses the language likelihood for each language model from the language likelihood acquisition unit 402 and the reliability of the word string Wt from the reliability acquisition unit 403 to determine which process to execute: dialogue control by the dialogue control unit 405 or ask-back processing by the ask-back sentence generation unit 406.

なお、言語モデルごとの言語尤度、単語列Ｗｔの信頼度、および正規化された各単語Ｗ１～Ｗｎの信頼度の各々には、それぞれしきい値が設定される。聞き返し判定部４０４は、言語モデルごとの言語尤度および単語列Ｗｔの信頼度のいずれか１つでもしきい値未満であれば、聞き返し文生成処理を実行する。なお、正規化された各単語Ｗ１～Ｗｎの信頼度は、聞き返し文生成部４０６で用いられる。 Note that a threshold value is set for each of the language likelihood for each language model, the reliability of the word string Wt, and the reliability of each normalized word W1 to Wn. If any one of the language likelihood for each language model and the reliability of the word string Wt is less than the threshold, the reflection determination unit 404 executes the reflection sentence generation process. Note that the reliability of each normalized word W1 to Wn is used by the reflection sentence generation unit 406.

対話制御部４０５は、聞き返し判定部４０４によってユーザ１０１に聞き返す必要がないと判定された場合に、対話知識ＤＢ４１６を参照して、発話音声１１０の音声認識結果５００に対応する応答文を生成する。 When the ask-back determination unit 404 determines that there is no need to ask the user 101 to ask again, the dialogue control unit 405 refers to the dialogue knowledge DB 416 and generates a response sentence corresponding to the speech recognition result 500 of the spoken voice 110.

聞き返し文生成部４０６は、聞き返し判定部４０４によってユーザ１０１に聞き返すべきと判定された場合に、ユーザ１０１の発話音声１１０に対する聞き返し文を生成する。具体的には、たとえば、聞き返し文生成部４０６は、聞き返し判定部４０４の判定結果（どの言語モデルの言語尤度がしきい値未満か）に応じて聞き返し文を生成する。たとえば、単語列Ｗｔについて一部の単語のみ信頼度がしきい値未満である場合、対話文脈言語モデル４１５を使用して、当該部分に当てはまる尤もらしい言葉を推定し、推定した言葉をユーザ１０１が発話したかどうかを聞き返す聞き返し文を応答文として生成する。 When the ask-back determination unit 404 determines that the user 101 should ask the user 101 to ...

対話履歴管理部４０７は、ユーザ１０１の発話音声１１０の単語列Ｗｔと音声対話装置３００が生成した応答文とを蓄積する。音声合成部４０８は、対話制御部４０５または聞き返し文生成部４０６が生成した応答文を音声に変換して出力する。なお、音声対話装置３００がクライアントサーバシステムのサーバ２０１によって実現される場合、音声対話装置３００は、音声合成部４０８を有さず、対話制御部４０５または聞き返し文生成部４０６が生成した応答文を、クライアントとなるロボット１０２やスマートフォン１０３などの情報処理装置１０４に送信する。この場合、クライアントが音声合成部４０８を有し、サーバ２０１から受信した応答文を音声に変換して出力する。 The dialogue history management unit 407 accumulates the word string Wt of the speech 110 of the user 101 and the response sentence generated by the voice dialogue device 300. The voice synthesis unit 408 converts the response sentence generated by the dialogue control unit 405 or the reflective sentence generation unit 406 into voice and outputs it. When the voice dialogue device 300 is realized by the server 201 of a client-server system, the voice dialogue device 300 does not have the voice synthesis unit 408, and transmits the response sentence generated by the dialogue control unit 405 or the reflective sentence generation unit 406 to the information processing device 104 such as the robot 102 or smartphone 103 that serves as the client. In this case, the client has the voice synthesis unit 408, and converts the response sentence received from the server 201 into voice and outputs it.

＜音声対話処理手順例＞
図１０は、実施例１にかかる音声対話装置３００による音声対話処理手順例を示すフローチャートである。音声対話装置３００は、図５に示したように、ユーザ１０１対話音声を入力して音声認識部４０１により音声認識処理を実行し、音声認識結果５００を出力する（ステップＳ１００１）。 <Example of voice dialogue processing procedure>
Fig. 10 is a flowchart showing an example of a voice dialogue processing procedure by the voice dialogue device 300 according to the embodiment 1. As shown in Fig. 5, the voice dialogue device 300 inputs the dialogue voice of the user 101, executes the voice recognition process by the voice recognition unit 401, and outputs the voice recognition result 500 (step S1001).

つぎに、音声対話装置３００は、信頼度取得部４０３により、図９に示したように、単語列Ｗｔの信頼度と、単語列Ｗｔを構成する単語Ｗ１～Ｗｎの信頼度と、を取得する（ステップＳ１００２）。また、音声対話装置３００は、言語尤度取得部４０２により、言語モデルごとに言語尤度を取得する（ステップＳ１００３）。 Next, the speech dialogue device 300 acquires the reliability of the word string Wt and the reliability of the words W1 to Wn that make up the word string Wt, as shown in FIG. 9, by the reliability acquisition unit 403 (step S1002). In addition, the speech dialogue device 300 acquires the language likelihood for each language model by the language likelihood acquisition unit 402 (step S1003).

なお、ステップＳ１００３において、対話文脈言語モデル４１５を用いる場合、言語尤度取得部４０２は、ユーザ１０１の発話音声１１０よりも前の音声対話装置３００およびユーザ１０１の発話音声１１０を繋げた文（単語列）を入力として、言語尤度を取得する。 In addition, in step S1003, when the dialogue context language model 415 is used, the language likelihood acquisition unit 402 acquires the language likelihood using as input the sentence (word string) that combines the voice dialogue device 300 and the user's utterance 110 prior to the user's utterance 110.

たとえば、
音声対話装置３００：「こんにちは」
ユーザ１０１：「こんにちは、あなたの名前は」
音声対話装置３００：「僕の名前はロボットです」
ユーザ１０１：「へーそうなんだ、かわいいね」
音声対話装置３００：「ありがとうございます」
ユーザ１０１：「あなたは何ができるの？」（発話音声１１０）
という対話を例に挙げる。 for example,
Speech dialogue device 300: "Hello."
User 101: "Hello, what's your name?"
Speech dialogue device 300: "My name is robot."
User 101: "Oh really, that's cute."
Speech dialogue device 300: "Thank you very much."
User 101: “What can you do?” (utterance 110)
Take the following dialogue as an example.

この場合、ユーザ１０１の発話音声１１０である「あなたは何ができるの？」よりも前の音声対話装置３００の発話音声「こんにちは」、ユーザ１０１の発話音声「こんにちは、あなたの名前は」、音声対話装置３００の発話音声「僕の名前はロボット１０２です」、ユーザ１０１の発話音声「へーそうなんだ、かわいいね」、音声対話装置３００の発話音声「ありがとうございます」について音声認識部４０１から得られた単語列を繋げて、「こんにちは。こんにちは、あなたの名前は。僕の名前はロボットです。へーそうなんだ、かわいいね。ありがとうございます。あなたは何ができるの。」という単語列とする。 In this case, the word strings obtained from the speech recognition unit 401 for the speech dialogue device 300's speech "Hello" before the speech dialogue device 300's speech 110 "What can you do?", the speech dialogue device 300's speech "Hello, what's your name", the speech dialogue device 300's speech "My name is Robot 102", the speech dialogue device 300's speech "Oh, really, that's cute", and the speech dialogue device 300's speech "Thank you" are concatenated to obtain the word string "Hello. Hello, what's your name. My name is Robot. Oh, really, that's cute. Thank you. What can you do.".

音声対話装置３００は、この繋げた単語列「こんにちは。こんにちは、あなたの名前は。僕の名前はロボットです。へーそうなんだ、かわいいね。ありがとうございます。あなたは何ができるの。」を対話文脈言語モデル４１５に入力して、言語尤度を算出する。このように、ユーザ１０１の発話音声１１０だけではなく、それ以前の対話も対話文脈言語モデル４１５に入力することにより、対話文脈言語モデル４１５から得られる言語尤度の高精度化を図ることができる。 The voice dialogue device 300 inputs this connected word sequence "Hello. Hello, what's your name? My name is Robot. Oh really, that's cute. Thank you. What can you do?" into the dialogue context language model 415 and calculates the language likelihood. In this way, by inputting not only the spoken voice 110 of the user 101 but also the previous dialogue into the dialogue context language model 415, it is possible to improve the accuracy of the language likelihood obtained from the dialogue context language model 415.

なお、ステップＳ１００３において、対話文脈言語モデル４１５を用いる場合、言語尤度取得部４０２は、音声認識結果５００の単語列Ｗｔの各単語をつなげて文を生成し、対話文脈言語モデル４１５が有する形態素解析器の形態素解析で単語列Ｗｔ´にしてもよい。これにより、単語列Ｗｔと単語列Ｗｔ´とでは、単語の区切りが異なる場合がある。そして、言語尤度取得部４０２は、単語列Ｗｔ´で尤度算出を行う。 When using the dialogue context language model 415 in step S1003, the language likelihood acquisition unit 402 may generate a sentence by concatenating each word of the word string Wt in the speech recognition result 500, and may generate a word string Wt' by morphological analysis of a morphological analyzer included in the dialogue context language model 415. As a result, the word strings Wt and Wt' may have different word divisions. Then, the language likelihood acquisition unit 402 calculates the likelihood of the word string Wt'.

そして、音声対話装置３００は、聞き返し判定部４０４により、ユーザ１０１への聞き返しをすべきか否かを判定する（ステップＳ１００４）。具体的には、たとえば、音声対話装置３００は、言語モデルごとの言語尤度、および単語列Ｗｔの信頼度のいずれか１つでもしきい値未満であるか否かを判定する。言語モデルごとの言語尤度、および単語列Ｗｔの信頼度のいずれか１つでもしきい値未満である場合（ステップＳ１００４：Ｙｅｓ）、すなわち、ユーザ１０１への聞き返しをすべき場合、音声対話装置３００は、聞き返し文生成部４０６により聞き返し処理を実行する（ステップＳ１００５）。 Then, the voice dialogue device 300 determines whether or not to ask the user 101 to reflect using the reflection determination unit 404 (step S1004). Specifically, for example, the voice dialogue device 300 determines whether or not any one of the language likelihood for each language model and the reliability of the word string Wt is less than a threshold value. If any one of the language likelihood for each language model and the reliability of the word string Wt is less than a threshold value (step S1004: Yes), that is, if the user 101 should be asked to reflect, the voice dialogue device 300 executes a reflection process using the reflection sentence generation unit 406 (step S1005).

音声対話装置３００は、聞き返し文生成処理（ステップＳ１００５）により、聞き返しパターンＰ１～Ｐ５のいずれかの応答文を生成、または、応答文の非生成を通知して、ステップＳ１００９に移行する。聞き返し文生成処理（ステップＳ１００５）の詳細は、図１１で後述する。 The voice dialogue device 300 generates a response sentence from one of the response patterns P1 to P5 through the reflective sentence generation process (step S1005), or notifies the user that no response sentence will be generated, and then proceeds to step S1009. Details of the reflective sentence generation process (step S1005) will be described later with reference to FIG. 11.

一方、言語モデルごとの言語尤度、および単語列Ｗｔの信頼度のいずれもしきい値未満でない場合（ステップＳ１００５：Ｎｏ）、すなわち、ユーザ１０１への聞き返しの必要がない場合、音声対話装置３００は、ユーザ１０１に対し聞き返しパターンＰ１、Ｐ３、またはＰ４で聞き返し中であるか否かを判定する（ステップＳ１００６）。 On the other hand, if neither the language likelihood for each language model nor the reliability of the word string Wt is less than the threshold (step S1005: No), i.e., if there is no need to ask the user 101 to repeat the question, the speech dialogue device 300 determines whether the user 101 is currently being asked to repeat the question using the repeat pattern P1, P3, or P4 (step S1006).

聞き返しパターンＰ１、Ｐ３、またはＰ４で聞き返し中である場合（ステップＳ１００６：Ｙｅｓ）、音声対話装置３００は、聞き返し文生成部４０６により、言い直し発話解釈処理を実行して（ステップＳ１００７）、ステップＳ１００９に移行する。言い直し発話解釈処理（ステップＳ１００７）とは、ユーザ１０１が言い直した発話音声を解釈して、当該解釈に応じて応答する処理である。言い直し解釈処理（ステップＳ１００７）の詳細は、図１４で後述する。 If the user is currently asking back in the reflection pattern P1, P3, or P4 (step S1006: Yes), the voice dialogue device 300 executes a restatement utterance interpretation process by the reflection sentence generation unit 406 (step S1007) and proceeds to step S1009. The restatement utterance interpretation process (step S1007) is a process of interpreting the speech restated by the user 101 and responding according to the interpretation. Details of the restatement interpretation process (step S1007) will be described later with reference to FIG. 14.

一方、聞き返しパターンＰ１、Ｐ３、またはＰ４で聞き返し中でない場合（ステップＳ１００６：Ｎｏ）、音声対話装置３００は、対話制御部４０５により、対話制御処理を実行し（ステップＳ１００８）、ステップＳ１００９に移行する。対話制御処理（ステップＳ１００８）は、ユーザ１０１発話音声に応答する応答文を生成する処理である。対話制御処理（ステップＳ１００８）の詳細については、図１５で後述する。 On the other hand, if the user is not asking for a response in the reflection pattern P1, P3, or P4 (step S1006: No), the voice dialogue device 300 executes a dialogue control process by the dialogue control unit 405 (step S1008) and proceeds to step S1009. The dialogue control process (step S1008) is a process for generating a response sentence in response to the voice uttered by the user 101. Details of the dialogue control process (step S1008) will be described later with reference to FIG. 15.

ステップＳ１００９では、音声対話装置３００は、対話履歴管理部４０７により、ユーザ１０１発話音声の単語列Ｗｔと、聞き返し処理（ステップＳ１００５）、言い直し解釈処理（ステップＳ１００７）、および対話制御処理（ステップＳ１００８）によって生成された応答文とを、対話履歴として蓄積する（ステップＳ１００９）。 In step S1009, the dialogue history management unit 407 of the voice dialogue device 300 accumulates, as a dialogue history (step S1009), the word string Wt of the user's 101 speech and the response sentences generated by the ask-back process (step S1005), the rephrasing interpretation process (step S1007), and the dialogue control process (step S1008).

そして、音声対話装置３００は、音声合成部４０８により、対話制御部４０５または聞き返し文生成部４０６によって生成された応答文を、音声に変換して出力する（ステップＳ１０１０）。なお、音声対話装置３００がスタンドアロン型で実現される場合、音声対話装置３００は、図１０に示したステップＳ１００１～Ｓ１０１０の処理を実行する。 Then, the voice dialogue device 300 converts the response sentence generated by the dialogue control unit 405 or the reply sentence generation unit 406 into voice and outputs it by the voice synthesis unit 408 (step S1010). Note that if the voice dialogue device 300 is realized as a stand-alone type, the voice dialogue device 300 executes the processes of steps S1001 to S1010 shown in FIG. 10.

一方、音声対話装置３００がクライアントサーバシステムのサーバ２０１によって実現される場合、音声対話装置３００は、ステップＳ１００１～１００９まで実行し、クライアントとなるコミュニケーションロボット１０２やスマートフォンなどの通信装置に、応答文を送信する。そして、クライアントが、音声合成部４０８により、対話制御部４０５または聞き返し文生成部４０６によって生成された応答文を、音声に変換して出力する（ステップＳ１０１０）。 On the other hand, when the voice dialogue device 300 is realized by the server 201 of a client-server system, the voice dialogue device 300 executes steps S1001 to S1009 and transmits a response sentence to a communication device such as a communication robot 102 or a smartphone that serves as a client. Then, the client converts the response sentence generated by the dialogue control unit 405 or the replies sentence generation unit 406 into voice and outputs it using the voice synthesis unit 408 (step S1010).

＜聞き返し文生成処理（ステップＳ１００５）＞
図１１は、図１０に示した聞き返し文生成処理（ステップＳ１００５）の詳細な処理手順例を示すフローチャートである。聞き返し文生成部４０６は、単語列Ｗｔの信頼度がしきい値未満か否かを判定する（ステップＳ１１０１）。 <Reply Sentence Generation Process (Step S1005)>
Fig. 11 is a flowchart showing a detailed example of the process of generating a reflection sentence (step S1005) shown in Fig. 10. The reflection sentence generator 406 determines whether the reliability of the word string Wt is less than a threshold value (step S1101).

単語列Ｗｔの信頼度がしきい値未満である場合（ステップＳ１１０２：Ｙｅｓ）、聞き返し文生成部４０６は、単語Ｗ１～Ｗｎの全信頼度がしきい値未満であるか否かを判定する（ステップＳ１１０２）。単語Ｗ１～Ｗｎの全信頼度がしきい値未満である場合（ステップＳ１１０２：Ｙｅｓ）、聞き返し文生成部４０６は、単語Ｗ１～Ｗｎのうちどの部分の単語の信頼度がしきい値未満であるかを判定する（ステップＳ１１０３）。 If the reliability of the word string Wt is less than the threshold (step S1102: Yes), the reflection sentence generation unit 406 determines whether the total reliability of words W1 to Wn is less than the threshold (step S1102). If the total reliability of words W1 to Wn is less than the threshold (step S1102: Yes), the reflection sentence generation unit 406 determines which part of words W1 to Wn has a reliability less than the threshold (step S1103).

部分Ａの場合（ステップＳ１１０３：部分Ａ）、ステップＳ１１０４に移行し、部分Ｂの場合（ステップＳ１１０３：部分Ｂ）、ステップＳ１１０５に移行し、部分Ｃの場合（ステップＳ１１０３：部分Ｃ）、ステップＳ１１０６に移行する。なお、一例として、単語列Ｗｔが部分Ａおよび部分Ｂの両方に該当する場合は、部分Ａを優先適用し、部分Ｂおよび部分Ｃの両方に該当する場合は、部分Ｂを優先適用する。
If it is part A (step S1103: part A), the process proceeds to step S1104, if it is part B (step S1103: part B), the process proceeds to step S1105, and if it is part C (step S1103: part C), the process proceeds to step S1106. Note that, as an example, if the word string Wt corresponds to both part A and part B, part A is applied preferentially, and if it corresponds to both part B and part C, part B is applied preferentially.

部分Ａとは、単語列Ｗｔのうち前半または後半に存在し、かつ、信頼度がしきい値未満の複数の単語である。単語Ｗ１～Ｗｎの単語数ｎが偶数であれば、前半とは単語Ｗ１～Ｗｎのうち単語Ｗ１～Ｗ（ｎ／２）であり、後半とは単語Ｗ１～Ｗｎのうち単語Ｗ（ｎ／２＋１）～Ｗｎである。単語Ｗ１～Ｗｎの単語数ｎが奇数であれば、前半とは単語Ｗ１～Ｗｎのうち単語Ｗ１～Ｗ（（ｎ＋１）／２）であり、後半とは単語Ｗ１～Ｗｎのうち単語Ｗ（（ｎ＋１）／２）～Ｗｎである。 Part A refers to multiple words that exist in the first or second half of the word string Wt and have a reliability below a threshold. If the number of words n in the words W1 to Wn is an even number, the first half refers to words W1 to W(n/2) of the words W1 to Wn, and the second half refers to words W(n/2+1) to Wn of the words W1 to Wn. If the number of words n in the words W1 to Wn is an odd number, the first half refers to words W1 to W((n+1)/2) of the words W1 to Wn, and the second half refers to words W((n+1)/2) to Wn of the words W1 to Wn.

ただし、当該複数の単語は、自立語を少なくとも一つ含む。自立語とは、付属語以外の品詞（動詞、形容詞、形容動詞、名詞、副詞、連体詞、接続詞、感動詞）の単語である。付属語とは、品詞が助動詞または助詞である単語である。部分Ａの場合（ステップＳ１１０３：部分Ａ）、ユーザ１０１の発話音声の前半または後半に、音声対話装置３００が音声認識しにくかった単語群が存在する。すなわち、部分Ａは、音声対話装置３００が、ユーザ１０１の発話音声のうち半分を聞き取れなかった場合に相当する。 However, the multiple words include at least one independent word. An independent word is a word of a part of speech other than an auxiliary word (verb, adjective, adjectival verb, noun, adverb, conjunction, conjunction, interjection). An auxiliary word is a word whose part of speech is an auxiliary verb or particle. In the case of part A (step S1103: part A), a group of words that were difficult for the voice dialogue device 300 to recognize exist in the first half or second half of the speech of the user 101. In other words, part A corresponds to a case where the voice dialogue device 300 was unable to hear half of the speech of the user 101.

部分Ａの場合（ステップＳ１１０３：部分Ａ）、聞き返し文生成部４０６は、単語列Ｗｔのうち部分を除いた残余の部分、すなわち、信頼度がしきい値以上の単語を聞き返す聞き返し文を聞き返しパターンＰ１（図１を参照）として生成して（ステップＳ１１０４）、ステップＳ１００９に移行する。 In the case of part A (step S1103: part A), the reflection sentence generation unit 406 generates a reflection sentence that asks for the remaining part of the word string Wt excluding part, i.e., the words whose reliability is equal to or greater than the threshold, as reflection pattern P1 (see FIG. 1) (step S1104), and proceeds to step S1009.

また、部分Ｂとは、単語列Ｗｔのうち離散的に存在し、かつ、信頼度がしきい値未満の複数の単語（自立語を少なくとも一つ含む）である。ただし、部分Ａとの重複を回避するため、単語列Ｗｔの前半と後半のそれぞれに、信頼度がしきい値未満の単語が少なくとも１つ存在する必要がある。すなわち、部分Ｂは、音声対話装置３００が、ユーザ１０１の発話音声１１０のうち断片的に聞き取れない部分があった場合に相当する。 Part B refers to multiple words (including at least one independent word) that exist discretely in the word string Wt and have a reliability below the threshold. However, in order to avoid overlap with part A, there must be at least one word whose reliability is below the threshold in each of the first and second halves of the word string Wt. In other words, part B corresponds to a case where the voice dialogue device 300 cannot hear fragments of the spoken voice 110 of the user 101.

部分Ｂの場合（ステップＳ１１０３：部分Ｂ）、単語列Ｗｔのうち部分Ｂに該当しない信頼度が閾値以上の単語を聞き返す聞き返し文を聞き返しパターンＰ２（図１を参照）として生成して（ステップＳ１１０４）、ステップＳ１００９に移行する。 In the case of part B (step S1103: part B), a asking-back sentence that asks back about words in the word string Wt that do not fall into part B and have a reliability equal to or greater than a threshold is generated as asking-back pattern P2 (see FIG. 1) (step S1104), and the process proceeds to step S1009.

また、部分Ｃとは、単語列Ｗｔのうち信頼度がしきい値未満の１個の自立語、または、単語列Ｗｔのうち信頼度がしきい値未満の連続する２個の単語（自立語を少なくとも一つ含む）である。ただし、部分Ａと重複した場合は部分Ａが優先される（部分Ａよりも部分Ｃを優先適用してもよい。）。すなわち、部分Ｃは、音声対話装置３００が、ユーザ１０１の発話音声１１０のうち一部分が聞き取れなかった場合に相当する。 Furthermore, part C is one independent word in the word string Wt whose reliability is less than the threshold value, or two consecutive words (including at least one independent word) in the word string Wt whose reliability is less than the threshold value. However, if there is an overlap with part A, part A takes precedence (part C may be applied with priority over part A). In other words, part C corresponds to a case where the voice dialogue device 300 cannot hear a part of the spoken voice 110 of the user 101.

部分Ｃの場合（ステップＳ１１０３：部分Ｃ）、聞き返し文生成部４０６は、マスク単語推定処理を実行する（ステップＳ１１０６）。マスク単語推定処理（ステップＳ１１０６）は、マスク単語を推定する処理である。マスク単語とは、部分Ｃに該当する単語である。マスク単語推定処理（ステップＳ１１０６）の詳細については、図１２で後述する。 In the case of part C (step S1103: part C), the reflection sentence generation unit 406 executes a mask word estimation process (step S1106). The mask word estimation process (step S1106) is a process of estimating a mask word. A mask word is a word that corresponds to part C. Details of the mask word estimation process (step S1106) will be described later with reference to FIG. 12.

マスク単語推定処理（ステップＳ１１０６）の実行後、聞き返し文生成部４０６は、マスク単語の推定が成功したか否かを判定する（ステップＳ１１０７）。マスク単語の推定が成功した場合（ステップＳ１１０７：Ｙｅｓ）、聞き返し文生成部４０６は、信頼度がしきい値未満である部分Ｃの推定結果と当該推定結果を含めたユーザ１０１の発話音声１１０を確認する聞き返し文とを聞き返しパターンＰ３（図１を参照）として生成する。 After executing the mask word estimation process (step S1106), the reflection sentence generation unit 406 determines whether the estimation of the mask word is successful (step S1107). If the estimation of the mask word is successful (step S1107: Yes), the reflection sentence generation unit 406 generates the estimation result of the part C whose reliability is less than the threshold value and a reflection sentence that confirms the speech voice 110 of the user 101 including the estimation result as a reflection pattern P3 (see FIG. 1).

図１の聞き返しパターンＰ３の場合、部分Ｃの推定結果が「富士山」であり、当該推定結果を含めたユーザ１０１の発話音声１１０を確認する聞き返し文が、『明日の富士山の日の出の時刻は』である。 In the case of the reflection pattern P3 in FIG. 1, the estimated result of part C is "Mt. Fuji," and the reflection sentence for confirming the speech 110 of the user 101 including the estimated result is "What time will sunrise be tomorrow at Mt. Fuji?"

一方、マスク単語の推定が成功しなかった場合（ステップＳ１１０７：Ｎｏ）、聞き返し文生成部４０６は、再質問を依頼、すなわち、発話音声１１０を再要求する聞き返し文を聞き返しパターンＰ５（図１を参照）として生成し（ステップＳ１１０９）、ステップＳ１００９に移行する。 On the other hand, if the mask word estimation is not successful (step S1107: No), the reflection sentence generation unit 406 generates a reflection sentence requesting a repeat question, i.e., a repeat request for the spoken voice 110, as reflection pattern P5 (see FIG. 1) (step S1109), and proceeds to step S1009.

また、ステップＳ１１０２において、全単語Ｗ１～Ｗｎの信頼度がしきい値未満である場合（ステップＳ１１０２：Ｙｅｓ）も、聞き返し文生成部４０６は、ステップＳ１１０９を実行し、ステップＳ１００９に移行する。 Also, in step S1102, if the reliability of all words W1 to Wn is less than the threshold value (step S1102: Yes), the reflection sentence generation unit 406 also executes step S1109 and proceeds to step S1009.

また、ステップＳ１１０１において、単語列Ｗｔの信頼度がしきい値以上である場合（ステップＳ１１０２：Ｎｏ）、聞き返し文生成部４０６は、標準言語モデル４１２の言語尤度がしきい値未満であるか否かを判定する（ステップＳ１１１０）。標準言語モデル４１２の言語尤度がしきい値未満である場合（ステップＳ１１１０：Ｙｅｓ）、ステップＳ１１０９を実行し、ステップＳ１００９に移行する。 In addition, in step S1101, if the reliability of the word string Wt is equal to or greater than the threshold (step S1102: No), the reflection sentence generation unit 406 determines whether the language likelihood of the standard language model 412 is less than the threshold (step S1110). If the language likelihood of the standard language model 412 is less than the threshold (step S1110: Yes), step S1109 is executed and the process proceeds to step S1009.

一方、標準言語モデル４１２の言語尤度がしきい値以上である場合（ステップＳ１１１０：Ｎｏ）、聞き返し文生成部４０６は、質問文言語モデル４１４の言語尤度がしきい値未満であるか否かを判定する（ステップＳ１１１１）。 On the other hand, if the language likelihood of the standard language model 412 is equal to or greater than the threshold (step S1110: No), the reflection sentence generation unit 406 determines whether the language likelihood of the question sentence language model 414 is less than the threshold (step S1111).

質問文言語モデル４１４の言語尤度がしきい値以上である場合（ステップＳ１１１１：Ｎｏ）、分野別言語モデル４１３または対話文脈言語モデル４１５の言語尤度がしきい値未満となる。したがって、聞き返し文生成部４０６は、質問（ユーザ１０１の発話音声１１０）全体を復唱して確認する聞き返し文を聞き返しパターンＰ４（図１を参照）として生成し（ステップＳ１１１２）、ステップＳ１００９に移行する。また、分野別言語モデル４１３および対話文脈言語モデル４１５のいずれも用いられていない場合も、質問文言語モデル４１４の言語尤度がしきい値以上である場合（ステップＳ１１１１：Ｎｏ）、聞き返し文生成部４０６は、ステップＳ１１１２を実行する。 If the language likelihood of the question language model 414 is equal to or greater than the threshold (step S1111: No), the language likelihood of the domain-specific language model 413 or the dialogue context language model 415 is less than the threshold. Therefore, the reflection sentence generation unit 406 generates a reflection sentence that repeats and confirms the entire question (the speech 110 of the user 101) as a reflection pattern P4 (see FIG. 1) (step S1112), and proceeds to step S1009. Also, even if neither the domain-specific language model 413 nor the dialogue context language model 415 is used, if the language likelihood of the question language model 414 is equal to or greater than the threshold (step S1111: No), the reflection sentence generation unit 406 executes step S1112.

一方、質問文言語モデル４１４の言語尤度がしきい値未満である場合（ステップＳ１１１１：Ｙｅｓ）、聞き返し文生成部４０６は、聞き返し文を生成せず、聞き返し文の非生成を通知して（ステップＳ１１１３）、ステップＳ１００９に移行する。このように、聞き返し文生成部４０６は、各単語の信頼度や各言語モデルの言語尤度に応じた聞き返しパターンＰ１～Ｐ５の聞き返し文を生成することができる。 On the other hand, if the language likelihood of the question language model 414 is less than the threshold value (step S1111: Yes), the reflection sentence generation unit 406 does not generate a reflection sentence, notifies the non-generation of a reflection sentence (step S1113), and proceeds to step S1009. In this way, the reflection sentence generation unit 406 can generate reflection sentences of the reflection patterns P1 to P5 according to the reliability of each word and the language likelihood of each language model.

＜マスク単語推定処理（ステップＳ１１０６）＞
図１２は、図１１に示したマスク単語推定処理（ステップＳ１１０６）の詳細な処理手順例を示すフローチャートである。図１３は、図１１に示したマスク単語推定処理（ステップＳ１１０６）の一例を示す説明図である。マスク単語推定処理（ステップＳ１１０６）は、ステップＳ１１０３：部分Ｃの場合に実行される。図１３では、部分Ｃである「○×▽」を含む単語列Ｗｔを「○×▽の高さを教えて」という単語列１３００とする。 <Masked Word Estimation Process (Step S1106)>
Fig. 12 is a flowchart showing a detailed example of the processing procedure of the mask word inference process (step S1106) shown in Fig. 11. Fig. 13 is an explanatory diagram showing an example of the mask word inference process (step S1106) shown in Fig. 11. The mask word inference process (step S1106) is executed in the case of step S1103: part C. In Fig. 13, a word string Wt including "○×▽" which is part C is set as a word string 1300 of "Please tell me the height of ○×▽".

聞き返し文生成部４０６は、部分Ｃである「○×▽」の単語の読みを抽出する（ステップＳ１２０１）。図１３では、「○×▽」の読み１３０２として“フサン”が抽出されたとする。つぎに、聞き返し文生成部４０６は、部分Ｃの単語「○×▽」をマスク加工して、「＊＊＊」にする（ステップＳ１２０２）。部分Ｃの単語「○×▽」のマスク後の単語列１３００を単語列１３０１とする。 The reflection sentence generation unit 406 extracts the reading of the word "○×▽" in part C (step S1201). In FIG. 13, it is assumed that "Fusan" has been extracted as the reading 1302 of "○×▽". Next, the reflection sentence generation unit 406 masks the word "○×▽" in part C to make it "***" (step S1202). The word string 1300 after masking the word "○×▽" in part C is set as word string 1301.

つぎに、聞き返し文生成部４０６は、マスク後の単語列１３０１を対話文脈言語モデル４１５の一例であるＢＥＲＴ１３１０に入力し、マスク単語を予測する（ステップＳ１２０３）。ここでは、予測結果１３０３として、「ランドマークタワー」（予測マスク単語１３０３Ａ）、「東京タワー」（予測マスク単語１３０３Ｂ）、および「富士山」（予測マスク単語１３０３Ｃ）が予測される。 Next, the reflection sentence generator 406 inputs the masked word sequence 1301 into a BERT 1310, which is an example of the dialogue context language model 415, to predict mask words (step S1203). Here, the predicted results 1303 are "Landmark Tower" (predicted mask word 1303A), "Tokyo Tower" (predicted mask word 1303B), and "Mt. Fuji" (predicted mask word 1303C).

つぎに、聞き返し文生成部４０６は、予測マスク単語１３０３Ａ～１３０３Ｃの読みを抽出する（ステップＳ１２０４）。ここでは、抽出結果１３０４として、“ランドマークタワー”（予測マスク単語の読み１３０４Ａ）、“トーキョータワー”（予測マスク単語の読み１３０４Ｂ）、および“フジサン”（予測マスク単語の読み１３０４Ｃ）が予測される。 Next, the reflection sentence generator 406 extracts the readings of the predicted mask words 1303A to 1303C (step S1204). Here, the predicted extraction results 1304 are "Landmark Tower" (predicted mask word reading 1304A), "Tokyo Tower" (predicted mask word reading 1304B), and "Fujisan" (predicted mask word reading 1304C).

そして、聞き返し文生成部４０６は、部分Ｃの単語「○×▽」の読み１３０２である“フサン”と、予測マスク単語の読み１３０４Ａ～１３０４Ｃの各々とを、たとえば、編集距離（レーベンシュタイン距離）で比較する（ステップＳ１２０６）。聞き返し文生成部４０６は、部分Ｃの単語「○×▽」の読み１３０２である“フサン”と所定距離以内の予測マスク単語の読み１３０４Ａ～１３０４Ｃがあるか否かを判定する（ステップＳ１２０６）。 Then, the reflection sentence generator 406 compares the reading 1302 of the word "○×▽" in part C, "Fusan", with each of the predicted mask word readings 1304A to 1304C, for example, using the edit distance (Levenshtein distance) (step S1206). The reflection sentence generator 406 determines whether any of the predicted mask word readings 1304A to 1304C is within a predetermined distance from the reading 1302 of the word "○×▽" in part C, "Fusan" (step S1206).

所定距離以内の予測マスク単語の読み１３０４Ａ～１３０４Ｃがない場合（ステップＳ１２０６：Ｎｏ）、聞き返し文生成部４０６は、マスク単語推定処理（ステップＳ１１０６）を終了し、ステップＳ１１０７に移行する。この場合、ステップＳ１１０７では、マスク単語推定失敗（ステップＳ１１０７：Ｎｏ）となる。 If there are no predicted mask word readings 1304A-1304C within the specified distance (step S1206: No), the reflection sentence generator 406 ends the mask word estimation process (step S1106) and proceeds to step S1107. In this case, in step S1107, the mask word estimation fails (step S1107: No).

一方、所定距離以内の予測マスク単語の読み１３０４Ａ～１３０４Ｃがある場合（ステップＳ１２０６：Ｙｅｓ）、聞き返し文生成部４０６は、編集距離が最も短い予測マスク単語を最も読みが近い予測マスク単語として選択する（ステップＳ１２０７）。ここでは、例として“フジサン”（予測マスク単語の読み１３０４Ｃ）を読みとする「富士山」（予測マスク単語１３０３Ｃ）を選択する。そして、聞き返し文生成部４０６は、マスク単語推定処理（ステップＳ１１０６）を終了し、ステップＳ１１０７に移行する。この場合、ステップＳ１１０７では、マスク単語推定成功（ステップＳ１１０７：Ｙｅｓ）となる。 On the other hand, if there are readings 1304A to 1304C of the predicted mask word within the specified distance (step S1206: Yes), the reflection sentence generation unit 406 selects the predicted mask word with the shortest edit distance as the predicted mask word with the closest reading (step S1207). Here, as an example, "Fujisan" (predicted mask word 1303C) with the reading "Fujisan" (predicted mask word reading 1304C) is selected. Then, the reflection sentence generation unit 406 ends the mask word estimation process (step S1106) and proceeds to step S1107. In this case, in step S1107, the mask word estimation is successful (step S1107: Yes).

＜言い直し発話解釈処理（ステップＳ１００７）＞
図１４は、図１０に示した言い直し発話解釈処理（ステップＳ１００７）の詳細な処理手順例を示すフローチャートである。言い直し発話解釈処理（ステップＳ１００７）は、ステップＳ１００６：Ｙｅｓの場合、すなわち、聞き返しパターンＰ１、Ｐ３またはＰ４で聞き返し中の場合に実行される。 <Repair utterance interpretation process (step S1007)>
Fig. 14 is a flowchart showing a detailed example of the process procedure of the repair utterance interpretation process (step S1007) shown in Fig. 10. The repair utterance interpretation process (step S1007) is executed if step S1006: Yes, that is, if the user is asking the user to reflect in the reflection pattern P1, P3 or P4.

ステップＳ１００６：Ｙｅｓの場合、聞き返し中の聞き返しパターンがＰ１であれば（ステップＳ１４０１：Ｐ１）、聞き返し文生成部４０６は、ユーザ１０１の言い直し発話音声（今回入力された発話音声の単語列Ｗｔ）と前回の発話の認識結果とを結合して（ステップＳ１４０２）、対話制御処理（ステップＳ１４０３）に移行する。 Step S1006: If the answer is Yes, and the pattern of the current repetition is P1 (Step S1401: P1), the repetition sentence generation unit 406 combines the user 101's restated speech (the word string Wt of the currently input speech) with the recognition result of the previous utterance (Step S1402), and proceeds to the dialogue control process (Step S1403).

ここで、聞き返しパターンＰ１である前回の発話の認識結果を『明日の富士山の日の出の』であるとする。『明日の富士山の日の出の』は、各単語の信頼度がしきい値以上の単語群である。 Here, let us say that the recognition result of the previous utterance, which is the listening pattern P1, is "Tomorrow's sunrise over Mt. Fuji." "Tomorrow's sunrise over Mt. Fuji" is a group of words whose reliability for each word is equal to or exceeds a threshold value.

『明日の富士山の日の出の』が前回のユーザ１０１の発話音声の単語列の前半部分である場合、聞き返し文生成部４０６は、ユーザ１０１の言い直し発話音声（今回入力された発話音声の単語列Ｗｔ（たとえば、「時刻教えて」））を、『明日の富士山の日の出の』の末尾に連結して、「明日の富士山の日の出の時刻教えて」を生成する。 If "tomorrow's sunrise on Mt. Fuji" is the first half of the word sequence of the previous utterance of user 101, the repeat sentence generation unit 406 concatenates the user's 101's restatement speech (the word sequence Wt of the currently input utterance (e.g., "tell me the time")) to the end of "tomorrow's sunrise on Mt. Fuji" to generate "tell me the time of sunrise on Mt. Fuji tomorrow."

また、聞き返しパターンＰ１である前回の発話の認識結果を『日の出の時刻教えて』であるとする。『日の出の時刻教えて』は、各単語の信頼度がしきい値以上の単語群である。 Let us also assume that the recognition result of the previous utterance, which is the reflection pattern P1, is "Tell me the time of sunrise." "Tell me the time of sunrise" is a word group in which the reliability of each word is equal to or greater than a threshold value.

『日の出の時刻教えて』が前回のユーザ１０１の発話音声の単語列の前半部分である場合、聞き返し文生成部４０６は、ユーザ１０１の言い直し発話音声（今回入力された発話音声の単語列Ｗｔ（たとえば、「明日の富士山」））を、『日の出の時刻教えて』の先頭に連結して、「明日の富士山日の出の時刻教えて」を生成する。 If "Tell me the time of sunrise" is the first half of the word sequence of the previous speech of the user 101, the repeat sentence generation unit 406 concatenates the user 101's restatement speech (the word sequence Wt of the currently input speech (for example, "Mt. Fuji tomorrow")) to the beginning of "Tell me the time of sunrise" to generate "Mt. Fuji tomorrow, tell me the time of sunrise."

ただし、聞き返し文生成部４０６は、ユーザ１０１の言い直し発話音声（今回入力された発話音声の単語列Ｗｔ）と前回の発話の認識結果との一致する部分についてはいずれか一方を削除して、冗長化を防止する。 However, the review sentence generation unit 406 deletes one of the parts that match the user's 101 restated speech (the word sequence Wt of the currently input speech) and the recognition result of the previous utterance to prevent redundancy.

たとえば、『日の出の時刻教えて』が前回の発話の認識結果であり、前回のユーザ１０１の発話音声の単語列が「明日の富士山の日の出の時刻」であるとすると、結合結果は、「明日の富士山の日の出の時刻日の出の時刻教えて」となる。この場合、「日の出の時刻」が２回出現しているため、聞き返し文生成部４０６は、「日の出の時刻」を１つ削除して、「明日の富士山の日の出の時刻教えて」にする。 For example, if "Tell me the time of sunrise" is the recognition result of the previous utterance, and the word sequence of the previous utterance of the user 101 is "The time of sunrise on Mt. Fuji tomorrow," the combined result will be "Tell me the time of sunrise on Mt. Fuji tomorrow." In this case, since "the time of sunrise" appears twice, the repeat sentence generation unit 406 deletes one "the time of sunrise" to make it "Tell me the time of sunrise on Mt. Fuji tomorrow."

また、聞き返し中の聞き返しパターンがＰ３またはＰ４であれば（ステップＳ１４０１：Ｐ３，Ｐ４）、聞き返し文生成部４０６は、聞き返しパターンがＰ３またはＰ４（例：「富士山の高さを教えて、とおっしゃいましたか？」）に対するユーザ１０１の回答（今回入力された発話音声の単語列Ｗｔ）が肯定であるか否定であるかを判定する（ステップＳ１４０４）。肯定である場合（ステップＳ１４０４：肯定）、対話制御処理（ステップＳ１４０３）に移行する。対話制御処理（ステップＳ１４０３）については、図１５で後述する。 If the pattern of the current asking back is P3 or P4 (step S1401: P3, P4), the asking back sentence generator 406 judges whether the answer of the user 101 (the word string Wt of the currently input speech) to the asking back pattern P3 or P4 (e.g., "Did you say, 'Tell me the height of Mt. Fuji?') is positive or negative (step S1404). If the answer is positive (step S1404: positive), the process proceeds to the dialogue control process (step S1403). The dialogue control process (step S1403) will be described later with reference to FIG. 15.

一方、否定である場合（ステップＳ１４０４：否定）、聞き返し文生成部４０６は、再質問を依頼する応答文（例：「質問をもう一度お願いします」）を生成して（ステップＳ１４０６）、ステップＳ１００７に移行する。 On the other hand, if the answer is no (step S1404: no), the rephrase sentence generation unit 406 generates a response sentence requesting the question to be asked again (e.g., "Please ask the question again") (step S1406), and proceeds to step S1007.

＜対話制御処理（ステップＳ１００８、Ｓ１４０３）＞
図１５は、図１０および図１４に示した対話制御処理（ステップＳ１００８、Ｓ１４０３）の詳細な処理手順例を示すフローチャートである。まず、対話制御部４０５は、対話知識ＤＢ４１６を参照して音声認識結果５００の単語列Ｗｔに近い想定質問文を検索する（ステップＳ１５０１）。具体的には、たとえば、対話制御部４０５は、単語列Ｗｔと対話知識ＤＢ４１６の想定質問文との編集距離により、単語列Ｗｔとの類似度を想定質問文ごとに算出する。ここでは、例として、編集距離の逆数を類似度とする。したがって、類似度の値が大きい想定質問文ほど単語列Ｗｔに類似する。 <Dialogue control process (steps S1008, S1403)>
Fig. 15 is a flowchart showing a detailed example of the processing procedure of the dialogue control process (steps S1008, S1403) shown in Fig. 10 and Fig. 14. First, the dialogue control unit 405 refers to the dialogue knowledge DB 416 to search for an assumed question sentence that is similar to the word string Wt of the speech recognition result 500 (step S1501). Specifically, for example, the dialogue control unit 405 calculates the similarity between the word string Wt and each assumed question sentence in the dialogue knowledge DB 416 and the word string Wt based on the edit distance between the word string Wt and the assumed question sentence in the dialogue knowledge DB 416. Here, as an example, the reciprocal of the edit distance is taken as the similarity. Therefore, an assumed question sentence with a larger similarity value is more similar to the word string Wt.

つぎに、対話制御部４０５は、類似度がしきい値以上の想定質問文があるか否かを判定する（ステップＳ１５０２）。類似度がしきい値以上の想定質問文がない場合（ステップＳ１５０３：Ｎｏ）、対話制御部４０５は、質問の意味が分からない旨の応答文を生成して、ステップＳ１００９、Ｓ１００７に移行する。一方、類似度がしきい値以上の想定質問文がある場合（ステップＳ１５０２：Ｙｅｓ）、対話制御部４０５は、対話知識ＤＢ４１６において類似度がしきい値以上の想定質問文に対応する回答文を応答文として出力して（ステップＳ１５０４）、ステップＳ１００９、Ｓ１００７に移行する。 Next, the dialogue control unit 405 determines whether there is an expected question sentence whose similarity is equal to or greater than the threshold value (step S1502). If there is no expected question sentence whose similarity is equal to or greater than the threshold value (step S1503: No), the dialogue control unit 405 generates a response sentence to the effect that the meaning of the question is not understood, and proceeds to steps S1009 and S1007. On the other hand, if there is an expected question sentence whose similarity is equal to or greater than the threshold value (step S1502: Yes), the dialogue control unit 405 outputs an answer sentence that corresponds to the expected question sentence whose similarity is equal to or greater than the threshold value in the dialogue knowledge DB 416 as a response sentence (step S1504), and proceeds to steps S1009 and S1007.

＜対話例＞
図１６は、ユーザ１０１と音声対話装置３００との対話の流れの一例を示すフローチャートである。ユーザ１０１が「富士山の高さを教えて」と発話したとする（ステップＳ１６０１）。音声対話装置３００は、「富士山の高さを教えて」を「＊＊＊の高さを教えて」と認識した場合（ステップＳ１６０２）、マスク部分の「＊＊＊」を推定して、「富士山ですか？」と応答する（ステップＳ１６０３）（聞き返しパターンＰ３）。 <Example of dialogue>
16 is a flowchart showing an example of the flow of a dialogue between the user 101 and the voice dialogue device 300. It is assumed that the user 101 utters "Tell me the height of Mt. Fuji" (step S1601). When the voice dialogue device 300 recognizes "Tell me the height of Mt. Fuji" as "Tell me the height of ***" (step S1602), it infers the "***" in the masked portion and responds "Is it Mt. Fuji?" (step S1603) (request pattern P3).

ユーザ１０１が、「富士山ですか？」に対して否定を意味する「いいえ」を応答した場合（ステップＳ１６０４）、音声対話装置３００は、マスク部分の推定結果である「富士山」を否定されたため、「質問をもう一度お願いします」と発話する（ステップＳ１６０５）。 If the user 101 responds with "No" to the question "Is it Mt. Fuji?" (step S1604), the voice dialogue device 300 utters "Please ask the question again" since the estimation result of the masked portion, "Mt. Fuji", has been denied (step S1605).

また、ステップＳ１６０３での「富士山ですか？」の質問に、ユーザ１０１が肯定を意味する「はい」を応答した場合（ステップＳ１６０６）、音声対話装置３００は、Ｓ１６０１のユーザ１０１の発話を「富士山の高さを教えて」と認識する。したがって、音声対話装置３００は、「富士山の高さ」を想定質問文として、対応する回答「３７７６メートル」を対話知識ＤＢ４１６から検索し（ステップＳ１６１０）、「富士山の高さは３７７６メートルです。」と発話する（ステップＳ１６１１）。 Furthermore, if the user 101 responds with "Yes" to the question "Is it Mt. Fuji?" in step S1603 (step S1606), the voice dialogue device 300 recognizes the utterance of the user 101 in S1601 as "Tell me the height of Mt. Fuji." Therefore, the voice dialogue device 300 regards "The height of Mt. Fuji" as an expected question sentence, searches for the corresponding answer "3,776 meters" from the dialogue knowledge DB 416 (step S1610), and speaks "The height of Mt. Fuji is 3,776 meters" (step S1611).

また、ステップＳ１６０１のユーザ１０１の発話音声１１０である「富士山の高さを教えて」に対し、音声対話装置３００が、「富士山の＊＊＊」と認識した場合（ステップＳ１６０７）、「富士山の何ですか？」と応答する（ステップＳ１６０８）（聞き返しパターンＰ１）。これに対し、ユーザ１０１が「高さ」と応答すると（ステップＳ１６０９）、音声対話装置３００は、「富士山の高さ」を想定質問文として、対応する回答「３７７６メートル」を対話知識ＤＢ４１６から検索し（ステップＳ１６１０）、「富士山の高さは３７７６メートルです。」と発話する（ステップＳ１６１１）。 In addition, if the voice dialogue device 300 recognizes "*** of Mt. Fuji" in response to the speech 110 of the user 101 in step S1601, "Tell me the height of Mt. Fuji" (step S1607), it responds with "What on Mt. Fuji?" (step S1608) (request pattern P1). In response to this, if the user 101 responds with "height" (step S1609), the voice dialogue device 300 searches for the corresponding answer "3776 meters" from the dialogue knowledge DB 416, regarding "The height of Mt. Fuji" as the expected question (step S1610), and speaks "The height of Mt. Fuji is 3776 meters" (step S1611).

このように、実施例１によれば、単語の信頼度や言語尤度に応じて、ユーザ１０１の発話音声１１０に対する聞き返しパターンＰ１～Ｐ５を選択することができる。特に、標準言語モデル４１２、分野別言語モデル４１３、質問文言語モデル４１４、および対話文脈言語モデル４１５を使ってユーザ１０１の発話音声１１０を評価することで、それぞれ、日本語として尤もらしいか、その場での発話として尤もらしいか、質問として尤もらしいか、対話文脈を考慮した上で尤もらしいか、を判定することができ、判定結果に基づいて聞き返しを行うことで、対話の破綻を防ぐことができる。 In this way, according to the first embodiment, it is possible to select the ask-back patterns P1 to P5 for the user's utterance 110 according to the word reliability and language likelihood. In particular, by evaluating the user's utterance 110 using the standard language model 412, the domain-specific language model 413, the question language model 414, and the dialogue context language model 415, it is possible to determine whether the utterance is likely to be Japanese, likely to be spoken on the spot, likely to be a question, and likely taking into account the dialogue context, and by asking for the utterance based on the determination result, it is possible to prevent the dialogue from breaking down.

このように、音声認識の信頼度をもとに必要最低限の聞き返しを行うことで、ユーザ１０１に何度も同じ質問をさせないようにすることができ、ユーザ１０１のわずらわしさを低減することができる。 In this way, by asking the user 101 to repeat the question only a minimum number of times based on the reliability of the voice recognition, the user 101 can be prevented from having to ask the same question multiple times, thereby reducing the inconvenience to the user 101.

実施例２は、実施例１において、ユーザ１０１の発話音声１１０の独特の言い回しにより、分野別言語モデル４１３や対話文脈別言語モデルによってそれらの言語尤度がしきい値以上であるような場合に対し、誤判定、すなわち、ステップＳ１００４：Ｎｏに遷移するのを防止する例である。これにより、音声対話装置３００がユーザ１０１に再質問を依頼したり（ステップＳ１４０５）、対話制御処理（ステップＳ１００８、Ｓ１４０３）により想定質問文の意味が分からない旨の応答をしたり（ステップＳ１５０３）、尤もらしくない想定質問文に対応する尤もらしくない回答をしたり（ステップＳ１５０４）するのを防止する。なお、ここでは、実施例２の内容を中心に説明するため、実施例１と重複する部分については説明を省略する。 Example 2 is an example of preventing a false positive, i.e., transition to step S1004: No, in the case where the language likelihood of the domain-specific language model 413 or the dialogue context-specific language model is equal to or greater than a threshold value due to a unique expression in the speech 110 of the user 101 in Example 1. This prevents the voice dialogue device 300 from requesting the user 101 to ask a question again (step S1405), from responding that the meaning of the expected question sentence is not understood by the dialogue control process (steps S1008, S1403) (step S1503), or from giving an unlikely answer corresponding to an unlikely expected question sentence (step S1504). Note that, since the contents of Example 2 will be mainly described here, the description of the parts that overlap with Example 1 will be omitted.

＜音声対話装置３００の機能的構成例＞
図１７は、実施例２にかかる音声対話装置３００の機能的構成例を示すブロック図である。音声対話装置３００は、実施例１の図４に示した構成のほか、個人性言語モデル１７０１と、個人識別部１７０２と、を有する。個人性言語モデル１７０１は、は、図３に示した音声対話装置３００の記憶デバイス３０２または音声対話装置３００と通信可能な他のコンピュータの記憶デバイス３０２に記憶される。個人識別部１７０２は、具体的には、たとえば、図３に示した記憶デバイス３０２に記憶された音声対話プログラムをプロセッサ３０１に実行させることにより実現される。 <Example of functional configuration of the voice dialogue device 300>
Fig. 17 is a block diagram showing an example of a functional configuration of a voice dialogue apparatus 300 according to a second embodiment. The voice dialogue apparatus 300 includes an individuality language model 1701 and an individual identification unit 1702 in addition to the configuration shown in Fig. 4 of the first embodiment. The individuality language model 1701 is stored in the storage device 302 of the voice dialogue apparatus 300 shown in Fig. 3 or in the storage device 302 of another computer capable of communicating with the voice dialogue apparatus 300. Specifically, the individual identification unit 1702 is realized by, for example, causing the processor 301 to execute a voice dialogue program stored in the storage device 302 shown in Fig. 3.

個人性言語モデル１７０１は、ユーザ１０１固有の性質（個人性）により作成された言語モデルであり、具体的には、たとえば、ユーザ１０１ごとの対話履歴からユーザ１０１別に作成される。したがって、対話履歴管理部４０７は、ユーザ１０１ごとに対話履歴を管理する。個人性言語モデル１７０１は、たとえば、単語Ｎ－ｇｒａｍやＲＮＮにより実現される。 The individual language model 1701 is a language model created based on the characteristics (individuality) unique to the user 101, and specifically, for example, is created for each user 101 from the dialogue history of each user 101. Therefore, the dialogue history management unit 407 manages the dialogue history for each user 101. The individual language model 1701 is realized, for example, by a word N-gram or an RNN.

個人識別部１７０２は、ユーザ１０１を識別する。個人識別部１７０２は、具体的には、たとえば、指紋や掌の静脈、虹彩、顔画像、音声といったユーザ１０１の生体情報を管理し、入力された生体情報と一致した場合に、入力者をその生体情報を持つユーザ１０１として識別する。また、個人識別部１７０２は、ユーザＩＤおよびパスワードを管理し、入力されたユーザＩＤおよびパスワードと一致した場合に、入力者をユーザ１０１として識別してもよい。 The personal identification unit 1702 identifies the user 101. Specifically, the personal identification unit 1702 manages biometric information of the user 101, such as fingerprints, palm veins, irises, facial images, and voice, and if the biometric information matches the inputted biometric information, identifies the inputter as the user 101 who has that biometric information. The personal identification unit 1702 may also manage a user ID and password, and if the inputted user ID and password match, identify the inputter as the user 101.

＜音声対話処理手順例＞
図１８は、実施例２にかかる音声対話装置３００による音声対話処理手順例を示すフローチャートである。音声対話装置３００は、図１０に示したステップＳ１００１～Ｓ１０１０に先立って、ステップＳ１８０１～Ｓ１８０４を実行する。 <Example of voice dialogue processing procedure>
18 is a flowchart showing an example of a voice dialogue processing procedure by the voice dialogue apparatus 300 according to the embodiment 2. The voice dialogue apparatus 300 executes steps S1801 to S1804 prior to steps S1001 to S1010 shown in FIG.

具体的には、たとえば、音声対話装置３００は、個人識別部１７０２により、生体情報を入力したユーザ１０１を識別する（ステップＳ１８０１）。音声対話装置３００は、生体情報を入力したユーザ１０１が登録済みのユーザ１０１であるか否かを判定する（ステップＳ１８０２）。具体的には、たとえば、音声対話装置３００は、登録済みの生体情報と入力された生体情報とが一致するか否かを判定し、一致すれば、生体情報を入力したユーザ１０１が登録済みのユーザ１０１であると判定する。 Specifically, for example, the voice dialogue device 300 identifies the user 101 who inputs the biometric information by the personal identification unit 1702 (step S1801). The voice dialogue device 300 determines whether the user 101 who inputs the biometric information is a registered user 101 (step S1802). Specifically, for example, the voice dialogue device 300 determines whether the registered biometric information matches the input biometric information, and if they match, determines that the user 101 who inputs the biometric information is a registered user 101.

登録済みのユーザ１０１である場合（ステップＳ１８０２：Ｙｅｓ）、音声対話装置３００は、当該ユーザ１０１の個人性言語モデル１７０１と対話履歴とをロードし（ステップＳ１８０３）、ステップＳ１００１に移行する。一方、登録済みのユーザ１０１でない場合（ステップＳ１８０２：Ｎｏ）、音声対話装置３００は、当該ユーザ１０１（以下、新規ユーザ１０１）の個人性言語モデル１７０１と対話履歴とを新規作成する（ステップＳ１８０４）。具体的には、たとえば、音声対話装置３００は、新規ユーザ１０１と対話して対話履歴を取得し、取得した対話履歴をもとに新規ユーザ１０１の個人性言語モデル１７０１を作成する。そして、ステップＳ１００１に移行する。 If the user 101 is a registered user (step S1802: Yes), the voice dialogue device 300 loads the individual language model 1701 and dialogue history of the user 101 (step S1803) and proceeds to step S1001. On the other hand, if the user 101 is not a registered user (step S1802: No), the voice dialogue device 300 creates a new individual language model 1701 and dialogue history of the user 101 (hereinafter, new user 101) (step S1804). Specifically, for example, the voice dialogue device 300 interacts with the new user 101 to obtain a dialogue history, and creates the individual language model 1701 of the new user 101 based on the obtained dialogue history. Then, proceeds to step S1001.

個人性言語モデル１７０１は、言語尤度算出（ステップＳ１００３）において用いられる。そして、図１１に示したステップＳ１１１１で質問文言語モデル４１４の言語尤度がしきい値以上の場合（ステップＳ１１１１：Ｎｏ）、分野別言語モデル４１３、対話文脈言語モデル４１５または個人性言語モデル１７０１のいずれかの言語尤度がしきい値未満となる。したがって、聞き返し文生成部４０６は、質問（ユーザ１０１の発話音声）全体を復唱して確認する聞き返し文を聞き返しパターンＰ４（図１を参照）として生成し（ステップＳ１１１２）、ステップＳ１００９に移行する。 The individual language model 1701 is used in language likelihood calculation (step S1003). Then, if the language likelihood of the question language model 414 is equal to or greater than the threshold value in step S1111 shown in FIG. 11 (step S1111: No), the language likelihood of any of the domain-specific language model 413, the dialogue context language model 415, or the individual language model 1701 is less than the threshold value. Therefore, the reflection sentence generation unit 406 generates a reflection sentence that repeats and confirms the entire question (the speech of the user 101) as reflection pattern P4 (see FIG. 1) (step S1112), and proceeds to step S1009.

また、分野別言語モデル４１３、対話文脈言語モデル４１５および個人性言語モデル１７０１のいずれも用いられていない場合も、質問文言語モデル４１４の言語尤度がしきい値以上である場合（ステップＳ１１１１：Ｎｏ）、聞き返し文生成部４０６は、ステップＳ１１１２を実行する。 Even if none of the domain-specific language model 413, the dialogue context language model 415, and the individual language model 1701 are used, if the language likelihood of the question language model 414 is equal to or greater than the threshold value (step S1111: No), the reflection sentence generation unit 406 executes step S1112.

このように、実施例２によれば、ユーザ１０１独特の言い回しによって生じる言語尤度の誤判定を抑制し、対話の円滑化を図ることができる。 In this way, according to the second embodiment, it is possible to suppress erroneous determination of language likelihood caused by the unique phrasing of the user 101, and facilitate smooth dialogue.

また、上述した実施例１および実施例２にかかる音声対話装置３００は、下記（１）～（１３）のように構成することもできる。 The voice dialogue device 300 according to the above-mentioned first and second embodiments can also be configured as follows (1) to (13).

（１）音声対話プログラムを実行するプロセッサ３０１と、音声対話プログラムを記憶する記憶デバイス３０２と、を有する音声対話装置３００では、プロセッサ３０１が、発話音声１１０に関する単語列Ｗｔを構成する各単語Ｗ１～Ｗｎの信頼度を取得する信頼度取得処理（ステップＳ１１０２）と、信頼度取得処理（ステップＳ１１０２）によって取得された信頼度に基づいて、発話音声１１０の発話元であるユーザ１０１に聞き返す聞き返し文を生成する聞き返し文生成処理（ステップＳ１００５）と、を実行する。 (1) In a voice dialogue device 300 having a processor 301 that executes a voice dialogue program and a storage device 302 that stores the voice dialogue program, the processor 301 executes a reliability acquisition process (step S1102) that acquires the reliability of each word W1 to Wn that constitutes a word string Wt related to the spoken voice 110, and a reflection sentence generation process (step S1005) that generates a reflection sentence to ask the user 101, who is the source of the spoken voice 110, based on the reliability acquired by the reliability acquisition process (step S1102).

これにより、音声対話装置３００は、たとえば、図１１に示したような聞き返しパターンＰ１～Ｐ３、Ｐ５の中から適切な聞き返し文を生成することができる。 This allows the voice dialogue device 300 to generate an appropriate reflection sentence from, for example, the reflection patterns P1 to P3 and P5 shown in FIG. 11.

（２）上記（１）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、単語列Ｗｔを構成する全単語Ｗ１～Ｗｎの信頼度がいずれも第１しきい値未満である場合（ステップＳ１１０２：Ｙｅｓ）、発話音声１１０を再要求する聞き返し文を生成する（ステップＳ１１０９）。 (2) In the speech dialogue device 300 described above in (1), in the process of generating a reflection sentence (step S1005), if the reliability of all words W1 to Wn constituting the word string Wt is less than the first threshold value (step S1102: Yes), the processor 301 generates a reflection sentence that requests the spoken voice 110 again (step S1109).

これにより、音声対話装置３００は、たとえば、図１１に示したような聞き返しパターンＰ１～Ｐ３、Ｐ５の中から聞き返しパターンＰ５の聞き返し文を生成して、ユーザ１０１に再質問を依頼することができる。 As a result, the voice dialogue device 300 can generate a review sentence for the review pattern P5 from among the review patterns P1 to P3 and P5 shown in FIG. 11, for example, and request the user 101 to ask the question again.

（３）上記（１）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、単語列Ｗｔに信頼度が第１しきい値以上の単語が存在する場合（ステップＳ１１０２：Ｎｏ）、単語列Ｗｔにおける信頼度が第１しきい値未満の単語の位置に基づいて、聞き返し文を生成する。 (3) In the speech dialogue device 300 described above in (1), in the process of generating a reflection sentence (step S1005), if the word string Wt contains a word whose reliability is equal to or greater than the first threshold (step S1102: No), the processor 301 generates a reflection sentence based on the position of the word in the word string Wt whose reliability is less than the first threshold.

これにより、音声対話装置３００は、たとえば、図１１に示したような聞き返しパターンＰ１～Ｐ３、Ｐ５の中から、信頼度が第１しきい値未満の単語の位置に応じた聞き返し文を生成することができる。 As a result, the voice dialogue device 300 can generate a review sentence according to the position of a word whose reliability is less than the first threshold value, for example, from among the review patterns P1 to P3 and P5 as shown in FIG. 11.

（４）上記（３）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、信頼度が第１しきい値未満の単語が単語列Ｗｔの前半または後半に存在する場合（ステップＳ１１０３：部分Ａ）、前半または後半のうち信頼度が第１しきい値未満の単語が存在しない方を聞き返す聞き返し文を生成する（ステップＳ１１０４）。 (4) In the speech dialogue device 300 of (3) above, in the process of generating a reflection sentence (step S1005), if a word whose reliability is less than the first threshold value is present in the first or second half of the word string Wt (step S1103: part A), the processor 301 generates a reflection sentence to ask back about either the first or second half that does not contain a word whose reliability is less than the first threshold value (step S1104).

これにより、音声対話装置３００は、たとえば、図１１に示したような聞き返しパターンＰ１～Ｐ３、Ｐ５の中から聞き返しパターンＰ１の聞き返し文を生成して、ユーザ１０１に、発話音声１１０のうち聞き取れている部分を発話して、聞き取れていない部分の再発話を促すことができる。 As a result, the voice dialogue device 300 can generate a reply sentence of, for example, reply pattern P1 from reply patterns P1 to P3 and P5 as shown in FIG. 11, and prompt the user 101 to speak the part of the spoken voice 110 that he or she was able to hear, and to repeat the part that he or she was unable to hear.

（５）上記（３）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、信頼度が第１しきい値未満の単語が単語列Ｗｔの前半および後半にわたって複数存在する場合（ステップＳ１１０３：部分Ｂ）、単語列Ｗｔのうち第１しきい値以上の単語を聞き返す聞き返し文を生成する（ステップＳ１１０５）。 (5) In the speech dialogue device 300 of (3) above, in the process of generating a reflection sentence (step S1005), if there are multiple words with a reliability less than the first threshold value in both the first and second halves of the word string Wt (step S1103: part B), the processor 301 generates a reflection sentence that asks for words in the word string Wt that are equal to or greater than the first threshold value (step S1105).

これにより、音声対話装置３００は、たとえば、図１１に示したような聞き返しパターンＰ１～Ｐ３、Ｐ５の中から聞き返しパターンＰ２の聞き返し文を生成して、ユーザ１０１に、発話音声１１０のうち聞き取れている部分を発話して、ユーザ１０１に再質問を依頼することができる。 As a result, the voice dialogue device 300 can generate a request for a repeat question from, for example, the request pattern P2 from among the request patterns P1 to P3 and P5 as shown in FIG. 11, and have the user 101 speak the part of the spoken voice 110 that the user 101 was able to hear.

（６）上記（３）の音声対話装置３００において、プロセッサ３０１は、信頼度が第１しきい値未満の１個の単語、または信頼度が第１しきい値未満の連続する２個の単語が単語列に存在する場合（ステップＳ１１０３：部分Ｃ）、任意の単語とその周辺に現れる単語群の統計と句の並びの統計を学習させた対話文脈言語モデル４１５に基づいて、１個の単語または連続する２個の単語がどのような単語であるかを推定するマスク単語推定処理（ステップＳ１１０６）を実行し、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、マスク単語推定処理（ステップＳ１１０６）による推定結果に応じた聞き返し文を生成する。 (6) In the voice dialogue device 300 of (3) above, when a word string contains one word whose reliability is less than the first threshold value, or two consecutive words whose reliability is less than the first threshold value (step S1103: part C), the processor 301 executes a mask word estimation process (step S1106) to estimate what kind of word the one word or two consecutive words are based on the dialogue context language model 415 that has been trained on statistics of a word and the word group that appears around it, and statistics of phrase arrangements, and in the reflection sentence generation process (step S1005), the processor 301 generates a reflection sentence according to the estimation result obtained by the mask word estimation process (step S1106).

これにより、音声対話装置３００は、たとえば、信頼度が第１しきい値未満の１個の単語、または信頼度が第１しきい値未満の連続する２個の単語を文脈から推定することにより、聞き返しの頻度の低減化を図ることができる。 As a result, the voice dialogue device 300 can reduce the frequency of asking to repeat, for example, by estimating from the context a single word whose reliability is less than the first threshold value, or two consecutive words whose reliability is less than the first threshold value.

（７）上記（６）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、マスク単語推定処理（ステップＳ１１０６）による推定が成功した場合（ステップＳ１１０７：Ｙｅｓ）、推定した単語を含み、かつ、発話音声１１０を確認する聞き返し文を生成する（ステップＳ１１０８）。 (7) In the speech dialogue device 300 of (6) above, in the reflection sentence generation process (step S1005), if the estimation by the mask word estimation process (step S1106) is successful (step S1107: Yes), the processor 301 generates a reflection sentence that includes the estimated word and confirms the spoken voice 110 (step S1108).

これにより、音声対話装置３００は、たとえば、聞き返しパターンＰ３として、推定した単語と、発話音声１１０全体とを確認する聞き返し文を生成することができ、どの部分が認識しにくかったか、および全体として、どのように認識したかを、ユーザ１０１に伝え、聞き返しの頻度の低減化を図ることができる。 As a result, the voice dialogue device 300 can generate, for example, a ask-back sentence as a ask-back pattern P3 to confirm the estimated words and the entire spoken voice 110, and can inform the user 101 which parts were difficult to recognize and how the entire speech was recognized, thereby reducing the frequency of asking back.

（８）上記（６）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、マスク単語推定処理（ステップＳ１１０６）による推定が失敗した場合（ステップＳ１１０７：Ｎｏ）、発話音声１１０を再要求する聞き返し文を生成する。 (8) In the voice dialogue device 300 of (6) above, in the reflective sentence generation process (step S1005), if the estimation by the mask word estimation process (step S1106) fails (step S1107: No), the processor 301 generates a reflective sentence that re-requests the spoken voice 110.

これにより、音声対話装置３００は、たとえば、聞き返しパターンＰ５の聞き返し文を生成して、ユーザ１０１に再質問を依頼することができる。 This allows the voice dialogue device 300 to generate a repeat sentence of, for example, the repeat pattern P5 and ask the user 101 to ask the question again.

（９）上記（１）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、単語列Ｗｔの信頼度が第１しきい値未満である場合（ステップＳ１１０１：Ｙｅｓ）、単語列Ｗｔを構成する各単語Ｗ１～Ｗｎの信頼度に基づいて、聞き返し文を生成する。 (9) In the speech dialogue device 300 of (1) above, in the process of generating a reflection sentence (step S1005), if the reliability of the word string Wt is less than the first threshold value (step S1101: Yes), the processor 301 generates a reflection sentence based on the reliability of each word W1 to Wn that constitutes the word string Wt.

これにより、単語列Ｗｔの信頼度が第１しきい値未満であれば、発話音声１１０が正しく認識されていないとして、個々の単語Ｗ１～Ｗｎの信頼度で、どのように聞き返すかを聞き返しパターンＰ１～Ｐ３、Ｐ５から選択することができる。 As a result, if the reliability of the word string Wt is less than the first threshold value, it is determined that the speech 110 has not been correctly recognized, and the system can select from the re-listening patterns P1 to P3 and P5 how to ask for re-listening based on the reliability of each word W1 to Wn.

（１０）上記（１）の音声対話装置３００において、プロセッサ３０１は、複数の言語モデル４１２～４１５の各々に単語列を入力した結果得られる複数の言語尤度を取得する言語尤度取得処理（ステップＳ１１０３）を実行し、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、単語列Ｗｔの信頼度が第１しきい値以上である場合（ステップＳ１１０１：Ｎｏ）、言語尤度取得処理（ステップＳ１１０３）によって取得された複数の言語尤度に基づいて、聞き返し文を生成する。 (10) In the voice dialogue device 300 of (1) above, the processor 301 executes a language likelihood acquisition process (step S1103) to acquire multiple language likelihoods obtained as a result of inputting a word string to each of the multiple language models 412 to 415, and in the reflection sentence generation process (step S1005), if the reliability of the word string Wt is equal to or greater than a first threshold (step S1101: No), the processor 301 generates a reflection sentence based on the multiple language likelihoods acquired by the language likelihood acquisition process (step S1103).

これにより、発話音声１１０の認識の信頼性が高い場合に、その単語列Ｗｔがどの言語モデルによる発話として尤もらしいかを特定して、聞き返しパターンＰ４、Ｐ５、または聞き返しなしを選択することができる。 As a result, when the reliability of the recognition of the speech sound 110 is high, it is possible to determine which language model the word sequence Wt is most likely to be the result of, and select the replay pattern P4, P5, or no replay.

（１１）上記（１０）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、複数の言語モデル４１２～４１５のうち複数の分野のテキストから得られた単語の並びの統計モデルである標準言語モデル４１２に、単語列Ｗｔを入力した結果得られる第１言語尤度が、第２しきい値未満である場合（ステップＳ１１１０：Ｙｅｓ）、発話音声１１０を再要求する聞き返し文を生成する（ステップＳ１１０９）。 (11) In the speech dialogue device 300 of (10) above, in the process of generating a reflection sentence (step S1005), if the first language likelihood obtained as a result of inputting the word string Wt to the standard language model 412, which is a statistical model of word sequences obtained from texts in multiple fields among the multiple language models 412 to 415, is less than the second threshold value (step S1110: Yes), the processor 301 generates a reflection sentence that re-requests the spoken voice 110 (step S1109).

これにより、標準的な言語の発話として尤もらしくない場合に、聞き返しパターンＰ５の聞き返し文を生成して、ユーザ１０１に再質問を促すことができる。 In this way, if the utterance is unlikely to be in standard language, a repeat sentence of the repeat pattern P5 can be generated to prompt the user 101 to ask the question again.

（１２）上記（１１）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、第１言語尤度が第２しきい値以上である場合（ステップＳ１１１０：Ｙｅｓ）、質問文を構成する単語の並びから発話音声１１０が質問であるかどうかを判定する質問文言語モデル４１４に、単語列Ｗｔを入力した結果得られる第２言語尤度が、第２しきい値未満であれば（ステップＳ１１１１：Ｙｅｓ）、聞き返し文を生成しない（ステップＳ１１１３）。 (12) In the voice dialogue device 300 of (11) above, in the process of generating a reflection sentence (step S1005), if the first language likelihood is equal to or greater than the second threshold (step S1110: Yes), the processor 301 does not generate a reflection sentence (step S1113) if the second language likelihood obtained as a result of inputting the word string Wt into the question sentence language model 414, which determines whether the spoken voice 110 is a question based on the sequence of words that constitute the question sentence, is less than the second threshold (step S1111: Yes).

これにより、質問文言語モデル４１４で質問文として尤もらしいとされた場合に、聞き返しの無駄な繰り返しを抑制し、対話の円滑化を促進することができる。 This makes it possible to suppress unnecessary repetition of asking for clarification when a question is deemed plausible by the question language model 414, and to facilitate smooth dialogue.

（１３）上記（１１）の音声対話装置３００において、聞き返し文生成処理（ステップＳ１００５）では、プロセッサ３０１は、第１言語尤度が第２しきい値以上である場合、質問文を構成する単語の並びから発話音声１１０が質問であるかどうかを判定する質問文言語モデル４１４に、単語列Ｗｔを入力した結果得られる第２言語尤度が、第２しきい値以上であれば（ステップＳ１１１１：Ｎｏ）、発話音声１１０を聞き返す聞き返し文を生成する（ステップＳ１１１２）。 (13) In the speech dialogue device 300 of (11) above, in the process of generating a reflection sentence (step S1005), if the first language likelihood is equal to or greater than the second threshold, the processor 301 inputs the word string Wt into the question sentence language model 414, which determines whether the spoken speech 110 is a question based on the sequence of words that make up the question sentence, and if the second language likelihood obtained as a result of inputting the word string Wt is equal to or greater than the second threshold (step S1111: No), generates a reflection sentence that asks for the spoken speech 110 again (step S1112).

これにより、発話音声１１０が音声認識されても、質問文言語モデル４１４で質問文として尤もらしくないとされた場合に、聞き返しパターンＰ４の聞き返し文を生成して、質問全体を再度依頼することができる。 As a result, even if the spoken voice 110 is recognized as a question, if the question language model 414 determines that it is not plausible as a question, a repeat question of the repeat pattern P4 can be generated, and the entire question can be asked again.

なお、本発明は前述した実施例に限定されるものではなく、添付した特許請求の範囲の趣旨内における様々な変形例及び同等の構成が含まれる。たとえば、前述した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに本発明は限定されない。また、ある実施例の構成の一部を他の実施例の構成に置き換えてもよい。また、ある実施例の構成に他の実施例の構成を加えてもよい。また、各実施例の構成の一部について、他の構成の追加、削除、または置換をしてもよい。 The present invention is not limited to the above-described embodiments, but includes various modified examples and equivalent configurations within the spirit of the appended claims. For example, the above-described embodiments have been described in detail to clearly explain the present invention, and the present invention is not necessarily limited to having all of the configurations described. Furthermore, a portion of the configuration of one embodiment may be replaced with the configuration of another embodiment. Furthermore, the configuration of another embodiment may be added to the configuration of one embodiment. Furthermore, other configurations may be added, deleted, or replaced with part of the configuration of each embodiment.

また、前述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、たとえば集積回路で設計する等により、ハードウェアで実現してもよく、プロセッサ３０１がそれぞれの機能を実現するプログラムを解釈し実行することにより、ソフトウェアで実現してもよい。 Furthermore, each of the configurations, functions, processing units, processing means, etc. described above may be realized in part or in whole in hardware, for example by designing them as integrated circuits, or may be realized in software by having the processor 301 interpret and execute a program that realizes each function.

各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記憶装置、又は、ＩＣ（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）カード、ＳＤカード、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）の記録媒体に格納することができる。 Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, or an SSD (Solid State Drive), or in a recording medium such as an IC (Integrated Circuit) card, an SD card, or a DVD (Digital Versatile Disc).

また、制御線や情報線は説明上必要と考えられるものを示しており、実装上必要な全ての制御線や情報線を示しているとは限らない。実際には、ほとんど全ての構成が相互に接続されていると考えてよい。 In addition, the control lines and information lines shown are those considered necessary for explanation, and do not necessarily represent all control lines and information lines necessary for implementation. In reality, it is safe to assume that almost all components are interconnected.

Ｗ１～Ｗｎ単語
ＷＬ単語ラティス
Ｗｔ単語列
１０１ユーザ
１０２コミュニケーションロボット
１０３スマートフォン
１０４情報処理装置
１１０発話音声
２００音声対話システム
２０１サーバ
２０２ネットワーク
３００音声対話装置
３０１プロセッサ
３０２記憶デバイス
４０１音声認識部
４０２言語尤度取得部
４０３信頼度取得部
４０４聞き返し判定部
４０５対話制御部
４０６聞き返し文生成部
４０７対話履歴管理部
４０８音声合成部
４１１音声認識モデル
４１２標準言語モデル
４１３分野別言語モデル
４１４質問文言語モデル
４１５対話文脈言語モデル
４１６対話知識ＤＢ W1 to Wn Word WL Word lattice Wt Word string 101 User 102 Communication robot 103 Smartphone 104 Information processing device 110 Speech 200 Speech dialogue system 201 Server 202 Network 300 Speech dialogue device 301 Processor 302 Storage device 401 Speech recognition unit 402 Language likelihood acquisition unit 403 Reliability acquisition unit 404 Reflection determination unit 405 Dialogue control unit 406 Reflection sentence generation unit 407 Dialogue history management unit 408 Speech synthesis unit 411 Speech recognition model 412 Standard language model 413 Domain-specific language model 414 Question sentence language model 415 Dialogue context language model 416 Dialogue knowledge DB

Claims

A speech dialogue apparatus having a processor that executes a program and a storage device that stores the program,
The processor,
An acquisition process for acquiring the reliability of each word constituting a word string related to the spoken voice;
a generation process for generating a reply sentence for asking back to a source of the spoken voice which is the first half or the second half of the word string when the word string contains a word whose reliability is equal to or greater than a first threshold value and a word whose reliability is less than the first threshold value in the first half or the second half of the word string and which does not contain a word whose reliability is less than the first threshold value;
A voice dialogue device which executes the above.

A speech dialogue apparatus having a processor that executes a program and a storage device that stores the program,
The processor,
An acquisition process for acquiring the reliability of each word constituting a word string related to the spoken voice;
a generation process for generating a reply sentence for asking the speaker of the spoken voice about the words in the word string that are equal to or greater than the first threshold value when the word string includes a word whose reliability is equal to or greater than the first threshold value and a plurality of words whose reliability is less than the first threshold value in both the first and second halves of the word string;
A voice dialogue device which executes the above.

A speech dialogue apparatus having a processor that executes a program and a storage device that stores the program,
The processor,
An acquisition process for acquiring the reliability of each word constituting a word string related to the spoken voice;
an estimation process for estimating what kind of word the one word or the two consecutive words are based on a dialogue context language model that has been trained on statistics of a word and a word group that appears around the word and statistics of phrase sequences when the word string includes a word whose reliability is equal to or greater than a first threshold value and a word whose reliability is less than the first threshold value or two consecutive words whose reliability is less than the first threshold value;
a generation process of generating a reflection sentence for asking the source of the uttered voice back according to an estimation result of the estimation process;
A voice dialogue device which executes the above.

4. The speech dialogue device according to claim 3,
In the generation process, when the estimation by the estimation process is successful, the processor generates a reflection sentence that includes the estimated word and confirms the spoken voice.
1. A voice dialogue device comprising:

4. The speech dialogue device according to claim 3,
In the generation process, when the estimation process fails, the processor generates a reflection sentence for requesting the speech voice again.
1. A voice dialogue device comprising:

A speech dialogue apparatus having a processor that executes a program and a storage device that stores the program,
The processor,
An acquisition process for acquiring the reliability of each word constituting a word string related to the spoken voice and the reliability of the word string;
a generation process for generating a reflection sentence for asking a speaker of the utterance of the spoken voice back based on the reliability of each word constituting the word string when the reliability of the word string is less than a first threshold value;
A voice dialogue device which executes the above.

A speech dialogue apparatus having a processor that executes a program and a storage device that stores the program,
The processor,
An acquisition process for acquiring a reliability of a word string related to a speech sound and a plurality of language likelihoods obtained as a result of inputting the word string into each of a plurality of language models;
a generation process for generating a reflection sentence for asking a speaker of the spoken voice back based on a plurality of linguistic likelihoods acquired by the acquisition process when the reliability of the word string acquired by the acquisition process is equal to or greater than a first threshold value;
A voice dialogue device which executes the above.

The speech dialogue device according to claim 7,
In the generation process, when a first language likelihood obtained as a result of inputting the word sequence into a first language model which is a statistical model of word sequences obtained from texts in a plurality of fields among the plurality of language models is less than a second threshold value, the processor generates a reflection sentence for requesting the spoken voice again.
1. A voice dialogue device comprising:

9. The speech dialogue device according to claim 8,
In the generation process, when the first language likelihood is equal to or greater than the second threshold value, the processor does not generate the reflection sentence if a second language likelihood obtained as a result of inputting the word sequence into a second language model that determines whether the spoken voice is a question based on a sequence of words constituting a question sentence is less than the second threshold value.
1. A voice dialogue device comprising:

9. The speech dialogue device according to claim 8,
In the generation process, when the first language likelihood is equal to or greater than the second threshold value, the processor generates a reflection sentence for asking the spoken voice to repeat the spoken voice if a second language likelihood obtained as a result of inputting the word sequence into a second language model that determines whether the spoken voice is a question based on a sequence of words constituting a question sentence is equal to or greater than the second threshold value.
1. A voice dialogue device comprising:

A voice dialogue method executed by a voice dialogue device having a processor that executes a program and a storage device that stores the program, comprising:
The speech dialogue method includes:
The processor,
An acquisition process for acquiring the reliability of each word constituting a word string related to the spoken voice;
a generation process for generating a reply sentence for asking back to a source of the spoken voice which is the first half or the second half of the word string when the word string contains a word whose reliability is equal to or greater than a first threshold value and a word whose reliability is less than the first threshold value in the first half or the second half of the word string and which does not contain a word whose reliability is less than the first threshold value;
A speech dialogue method comprising the steps of:

A voice dialogue method executed by a voice dialogue device having a processor that executes a program and a storage device that stores the program, comprising:
The speech dialogue method includes:
The processor,
An acquisition process for acquiring the reliability of each word constituting a word string related to the spoken voice;
a generation process for generating a reply sentence for asking the speaker of the spoken voice about the words in the word string that are equal to or greater than the first threshold value when the word string includes a word whose reliability is equal to or greater than the first threshold value and a plurality of words whose reliability is less than the first threshold value in both the first and second halves of the word string;
A speech dialogue method comprising the steps of:

A voice dialogue method executed by a voice dialogue device having a processor that executes a program and a storage device that stores the program, comprising:
The speech dialogue method includes:
The processor,
An acquisition process for acquiring the reliability of each word constituting a word string related to the spoken voice;
an estimation process for estimating what kind of word the one word or the two consecutive words are based on a dialogue context language model that has been trained on statistics of a word and a word group that appears around the word and statistics of phrase sequences when the word string includes a word whose reliability is equal to or greater than a first threshold value and a word whose reliability is less than the first threshold value or two consecutive words whose reliability is less than the first threshold value;
A speech dialogue method comprising the steps of:

The processor:
An acquisition process for acquiring the reliability of each word constituting a word string related to the spoken voice;
a generation process for generating a reply sentence for asking back to a source of the spoken voice which is the first half or the second half of the word string when the word string contains a word whose reliability is equal to or greater than a first threshold value and a word whose reliability is less than the first threshold value in the first half or the second half of the word string and which does not contain a word whose reliability is less than the first threshold value;
A speech dialogue program comprising:

The processor:
An acquisition process for acquiring the reliability of each word constituting a word string related to the spoken voice;
a generation process for generating a reply sentence for asking the speaker of the spoken voice about the words in the word string that are equal to or greater than the first threshold value when the word string includes a word whose reliability is equal to or greater than the first threshold value and a plurality of words whose reliability is less than the first threshold value in both the first and second halves of the word string ;
A speech dialogue program comprising: