JP6306447B2

JP6306447B2 - Terminal, program, and system for reproducing response sentence using a plurality of different dialogue control units simultaneously

Info

Publication number: JP6306447B2
Application number: JP2014129678A
Authority: JP
Inventors: ▲シン▼ 徐; 加藤　恒夫; 恒夫加藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2014-06-24
Filing date: 2014-06-24
Publication date: 2018-04-04
Anticipated expiration: 2034-06-24
Also published as: JP2016009091A

Description

本発明は、ユーザからの発話文に対して応答文を出力する対話システムの技術に関する。 The present invention relates to a technology for an interactive system that outputs a response sentence in response to an utterance sentence from a user.

人間に対して自然な対話を実現した対話システムが、特にスマートフォンやタブレットのような端末で、一般的に普及しつつある。対話システムとしては、ユーザとの対話内容に沿った適切な語彙のみが、認識対象語彙として対話制御処理に入力される技術がある（例えば特許文献１参照）。この技術によれば、認識対象語彙として必要最小限度の語彙のみを記憶している。ユーザの発話文から未登録の語彙を抽出し、その語彙を追加的に登録していく。これによって、メモリ容量の削減と、語彙検索の高速化とを可能としつつ、自然な対話処理を実現することができる。 A dialogue system that realizes natural dialogue with human beings is becoming more popular, especially on terminals such as smartphones and tablets. As a dialog system, there is a technique in which only an appropriate vocabulary according to the content of a dialog with a user is input to a dialog control process as a recognition target vocabulary (see, for example, Patent Document 1). According to this technique, only a minimum necessary vocabulary is stored as a recognition target vocabulary. Unregistered vocabulary is extracted from the user's utterance and the vocabulary is additionally registered. As a result, natural conversation processing can be realized while reducing the memory capacity and speeding up the vocabulary search.

また、対話のドメインを限定した対話制御部を、携帯端末内に実装する技術もある（例えば非特許文献１参照）。この技術によれば、ユーザの発話文からその意図を抽出する意図推定処理と、抽出された意図に対して対話応答を決定する対話制御処理とを、携帯端末内で実行することができる。 In addition, there is a technique for mounting a dialog control unit in which a dialog domain is limited in a mobile terminal (for example, see Non-Patent Document 1). According to this technique, the intention estimation process for extracting the intention from the user's utterance and the dialog control process for determining the dialog response to the extracted intention can be executed in the portable terminal.

更に、複数の音声対話装置を用いる技術もある（例えば特許文献２参照）。この技術によれば、第１の音声対話装置は、ユーザの発話文の言語理解に失敗した場合、ユーザの直前の発話文を処理した第２の音声対話装置へ言語理解失敗を送信する。第２の音声対話装置は、言語理解失敗を受信した際に、生成済みの理解状態の下で、言語理解規則を逆解析した発話内容を、第１の音声対話装置へ送信する。このように、第１の音声対話装置及び第２の音声対話装置が、相互に協調してユーザの発話文を解析し、対話を進行することができる。 Furthermore, there is a technique using a plurality of voice interactive devices (see, for example, Patent Document 2). According to this technique, when the language understanding of the user's spoken sentence fails, the first voice interactive apparatus transmits a language understanding failure to the second voice interactive apparatus that has processed the speech sentence immediately before the user. When the second spoken dialogue device receives the language understanding failure, the second spoken dialogue device transmits, to the first spoken dialogue device, the utterance content obtained by reverse-analyzing the language understanding rules in the already-understood understanding state. As described above, the first voice interactive device and the second voice interactive device can cooperate with each other to analyze the user's utterance and proceed with the dialogue.

更に、情報提供型、情報獲得型、質問応答型及び情報受理型の４種類の対話制御システムを用いた技術もある（例えば特許文献３参照）。この技術によれば、ユーザの無入力時間と、入力された発話文の内容が要求か又は質問かとによって、対話型を分類する。その分類に適した型の複数の対話制御システムを用いることにより、比較的複雑で且つ広範囲の対話ドメインに対しても、ユーザに適切な応答文を出力し、対話のユーザ満足度を向上させることができる。 Furthermore, there is a technique using four types of dialog control systems, that is, an information providing type, an information acquiring type, a question answering type, and an information receiving type (for example, see Patent Document 3). According to this technique, the interactive type is classified according to the user's non-input time and whether the content of the input utterance is a request or a question. By using a plurality of dialog control systems of a type suitable for the classification, an appropriate response sentence is output to the user even in a relatively complicated and wide range of dialog domains, and the user satisfaction of the dialog is improved. Can do.

特開２００２−１４９６４５号公報JP 2002-149645 A 特開２００４−２４０２２５号公報JP 2004-240225 A 特開２００９−１９８６１４号公報JP 2009-198614 A

ＫＤＤＩ研究所、「マルチデバイス連携が可能なスマートフォン用対話プラットフォームの開発」、[online]、［平成２６年５月２０日検索］、インターネット＜http://www.kddilabs.jp/press/2013/1010.html＞KDDI R & D Laboratories, “Development of Dialogue Platform for Smartphones with Multi-Device Collaboration”, [online], [Search May 20, 2014], Internet <http://www.kddilabs.jp/press/2013/ 1010.html>

しかしながら、特許文献１や非特許文献１に記載の技術によれば、対話の応答時間を短縮できるものの、認識対象となる語彙やタスクの数が限られており、応答内容の信頼度が低くならざるを得ない。
一方で、特許文献２や特許文献３に記載の技術によれば、複数の対話制御装置を用いるために、対話の応答時間が長くならざるを得ない。
このように、対話システムによれば、対話における応答時間とその応答内容の信頼度との間には、トレードオフの関係がある。応答時間が長くなっても、その応答内容の信頼度が低くなってもいずれも、ユーザに対して対話のストレスをかけることとなる。 However, according to the techniques described in Patent Document 1 and Non-Patent Document 1, although the response time of the dialogue can be shortened, the number of vocabularies and tasks to be recognized is limited, and the reliability of the response content is low. I must.
On the other hand, according to the techniques described in Patent Literature 2 and Patent Literature 3, since a plurality of dialogue control devices are used, the dialogue response time has to be long.
Thus, according to the dialogue system, there is a trade-off relationship between the response time in the dialogue and the reliability of the response content. Regardless of whether the response time is long or the reliability of the response content is low, the user is stressed by dialogue.

そこで、本発明は、ユーザとの対話の中で、応答時間とその応答内容の信頼度との間のトレードオフを考慮して応答文を再生する端末、プログラム及びシステムを提供することを目的とする。 Accordingly, an object of the present invention is to provide a terminal, a program, and a system for reproducing a response sentence in consideration of a trade-off between a response time and a reliability of the response content in a dialog with a user. To do.

本発明によれば、音声で対話可能なユーザインタフェースを有する端末において、
ユーザの発話文の入力後、第１の応答文を出力する第１の対話制御手段と、
「接続語」を記憶した接続語記憶手段と、
ユーザの発話文の入力後、第１の対話制御手段よりも、応答時間が長く且つその応答内容の信頼度が高い第２の応答文を出力する第２の対話制御手段と、
第１の対話制御手段及び第２の対話制御手段の両方に、ユーザの発話文を入力する発話文入力手段と、
第１の対話制御手段から出力された第１の応答文を、ユーザに対して音声で再生している再生時間中に、第２の対話制御手段から第２の応答文が出力された場合、第１の応答文と第２の応答文との間の類似度を算出する応答文類似度算出手段と、
類似度が第１の閾値以下で類似しない場合、第１の応答文の再生が終了した直後に連続して、接続語を再生し、第２の応答文の再生を開始する応答文再生手段と
を有することを特徴とする。 According to the present invention, in a terminal having a user interface capable of voice interaction,
A first dialog control means for outputting a first response sentence after inputting the user's utterance sentence;
Connection word storage means storing "connection word";
A second dialogue control means for outputting a second response sentence having a longer response time and higher reliability of the response content than the first dialogue control means after the user's utterance sentence is input;
An utterance sentence input means for inputting the user's utterance sentence to both the first dialog control means and the second dialog control means;
When the second response text is output from the second dialog control means during the playback time in which the first response text output from the first dialog control means is played back to the user by voice, Response sentence similarity calculating means for calculating the similarity between the first response sentence and the second response sentence;
Response sentence reproduction means for reproducing a connected word and starting reproduction of a second response sentence continuously immediately after the reproduction of the first response sentence is finished when the similarity is not more than a first threshold value It is characterized by having.

本発明の端末における他の実施形態によれば、
接続語記憶手段は、類似度が、第１の閾値以下であって、且つ、高い方から低い方へ複数の所定範囲に区分されており、当該類似度の所定範囲毎に「接続語」が対応付けて記憶されており、
応答文再生手段は、接続語記憶手段を用いて類似度に対応する接続語を選択することも好ましい。 According to another embodiment of the terminal of the present invention,
The connected word storage means is divided into a plurality of predetermined ranges in which the degree of similarity is not more than a first threshold and is higher to lower, and a “connected word” is set for each predetermined range of the degree of similarity. Stored in association with each other,
It is also preferable that the response sentence reproduction means selects a connection word corresponding to the similarity using the connection word storage means.

本発明の端末における他の実施形態によれば、
接続語記憶手段は、類似度が、第１の閾値Th1以下であって、且つ、高い方から低い方へ３段階の所定範囲に区分されており、以下のように接続語を対応付けて記憶する
類似度＞第１の閾値Th1 ：第２の応答文を再生しない
第１の閾値Th1≧類似度＞第２の閾値Th1 ：累加の接続語
第２の閾値Th2≧類似度＞第３の閾値Th3 ：逆接の接続語
第３の閾値Th3≧類似度：転換の接続語
ことも好ましい。 According to another embodiment of the terminal of the present invention,
The connected word storage means is divided into a predetermined range of three levels from the highest to the lowest, with the similarity being equal to or less than the first threshold Th1, and the connected words are stored in association with each other as follows. Similarity> first threshold Th1: the second response sentence is not reproduced First threshold Th1 ≧ similarity> second threshold Th1: cumulative connection word second threshold Th2 ≧ similarity> third threshold Th3: Reverse connection word Third threshold Th3 ≧ Similarity: A conversion connection word is also preferable.

本発明の端末における他の実施形態によれば、
応答文類似度算出手段は、再生時間中として
（１）音声による第１の応答文の再生が完了するまで
（２）音声による第１の応答文の再生後、ユーザからの発話文が検出されるまで
であることも好ましい。 According to another embodiment of the terminal of the present invention,
The response sentence similarity calculation means is as follows: (1) Until the reproduction of the first response sentence by voice is completed (2) After the reproduction of the first response sentence by voice, the utterance sentence from the user is detected It is also preferable that

本発明の端末における他の実施形態によれば、
第１の対話制御手段及び第２の対話制御手段は、シナリオ型又は統計型の対話制御機能であることも好ましい。 According to another embodiment of the terminal of the present invention,
The first dialog control means and the second dialog control means are preferably scenario-type or statistical-type dialog control functions.

本発明の端末における他の実施形態によれば、
第１の対話制御手段及び第２の対話制御手段はそれぞれ、応答文と共に信頼度を出力するものであり、
応答文類似度算出手段は、第１の対話制御手段から出力された第１の応答文を、ユーザに対して音声で再生している再生時間中に、複数の第２の対話制御手段から第２の応答文が出力された場合、応答内容の信頼度が最も高い第２の対話制御手段から出力された第２の応答文と第１の応答文との間の類似度を算出する
を有することも好ましい。 According to another embodiment of the terminal of the present invention,
Each of the first dialogue control means and the second dialogue control means outputs a reliability together with a response sentence.
The response sentence similarity calculating means outputs the first response sentence output from the first dialog control means to the user from the plurality of second dialog control means during the playback time during which the user is playing back by voice. When the second response sentence is output, the similarity between the second response sentence output from the second dialogue control unit with the highest reliability of the response content and the first response sentence is calculated. It is also preferable.

本発明の端末における他の実施形態によれば、
第１の対話制御手段及び第２の対話制御手段における信頼度は、平均対話正解率Ｐとリアルタイム制御信頼度スコアＣとに基づいて算出されるものである
ことも好ましい。 According to another embodiment of the terminal of the present invention,
Confidence in the first dialogue control means and the second interaction control device, it is also preferable that is calculated based on the average conversation correct answer rate P and real-time control confidence score C.

本発明の端末における他の実施形態によれば、
応答文類似度算出手段は、第１の応答文及び第２の応答文について形態素解析によって複数の単語を抽出し、第１の応答文の単語と第２の応答文の単語との間の品詞又は意味を解析したベクトルを算出し、これらベクトルのコサイン類似度を算出する
ことも好ましい。 According to another embodiment of the terminal of the present invention,
The response sentence similarity calculating means extracts a plurality of words by morphological analysis for the first response sentence and the second response sentence, and the part of speech between the first response sentence word and the second response sentence word. Alternatively, it is also preferable to calculate a vector whose meaning has been analyzed and to calculate the cosine similarity of these vectors.

本発明によれば、音声で対話可能なユーザインタフェースを有する端末に搭載されたコンピュータを機能させるプログラムにおいて、
ユーザの発話文の入力後、第１の応答文を出力する第１の対話制御手段と、
「接続語」を記憶した接続語記憶手段と、
ユーザの発話文の入力後、第１の対話制御手段よりも、応答時間が長く且つその応答内容の信頼度が高い第２の応答文を出力する第２の対話制御手段と、
第１の対話制御手段及び第２の対話制御手段の両方に、ユーザの発話文を入力する発話文入力手段と、
第１の対話制御手段から出力された第１の応答文を、ユーザに対して音声で再生している再生時間中に、第２の対話制御手段から第２の応答文が出力された場合、第１の応答文と第２の応答文との間の類似度を算出する応答文類似度算出手段と、
類似度が第１の閾値以下で類似しない場合、第１の応答文の再生が終了した直後に連続して、接続語を再生し、第２の応答文の再生を開始する応答文再生手段と
してコンピュータを機能させることを特徴とする。 According to the present invention, in a program for causing a computer mounted on a terminal having a user interface capable of voice interaction to function,
A first dialog control means for outputting a first response sentence after inputting the user's utterance sentence;
Connection word storage means storing "connection word";
A second dialogue control means for outputting a second response sentence having a longer response time and higher reliability of the response content than the first dialogue control means after the user's utterance sentence is input;
An utterance sentence input means for inputting the user's utterance sentence to both the first dialog control means and the second dialog control means;
When the second response text is output from the second dialog control means during the playback time in which the first response text output from the first dialog control means is played back to the user by voice, Response sentence similarity calculating means for calculating the similarity between the first response sentence and the second response sentence;
When the similarity is not more than the first threshold value, as response sentence reproduction means for reproducing the connected word and starting reproduction of the second response sentence immediately after the reproduction of the first response sentence is completed It is characterized by making a computer function.

本発明によれば、音声で対話可能なユーザインタフェースを有する端末と、対話制御サーバとがネットワークを介して接続されたシステムにおいて、
端末は、
ユーザの発話文の入力後、第１の応答文を出力する第１の対話制御手段と、
「接続語」を記憶した接続語記憶手段と、
を有し、
サーバは、ユーザの発話文の入力後、第１の対話制御手段よりも、応答時間が長く且つその応答内容の信頼度が高い第２の応答文を出力する第２の対話制御手段を有し、
端末は、
ユーザの発話文を、第１の対話制御手段へ入力すると共に、サーバの第２の対話制御手段へ送信する発話文入力手段と、
第１の対話制御手段から出力された第１の応答文を、ユーザに対して音声で再生している再生時間中に、第２の対話制御手段から第２の応答文が受信された場合、第１の応答文と第２の応答文との間の類似度を算出する応答文類似度算出手段と、
類似度が第１の閾値以下で類似しない場合、第１の応答文の再生が終了した直後に連続して、接続語を再生し、第２の応答文の再生を開始する応答文再生手段と
を有することを特徴とする。 According to the present invention, in a system in which a terminal having a user interface capable of voice interaction and a dialogue control server are connected via a network,
The terminal
After input of the utterance Yu chromatography The, the first dialogue control means for outputting a first answering sentence,
Connection word storage means storing "connection word";
Have
The server has second dialogue control means for outputting a second response sentence having a longer response time and higher reliability of the response content than the first dialogue control means after inputting the user's utterance sentence. ,
The terminal
An utterance sentence input means for inputting the user's utterance sentence to the first dialog control means and transmitting it to the second dialog control means of the server;
When the second response text is received from the second dialog control means during the playback time in which the first response text output from the first dialog control means is played back to the user by voice, Response sentence similarity calculating means for calculating the similarity between the first response sentence and the second response sentence;
Response sentence reproduction means for reproducing a connected word and starting reproduction of a second response sentence continuously immediately after the reproduction of the first response sentence is finished when the similarity is not more than a first threshold value It is characterized by having.

本発明によれば、音声で対話可能なユーザインタフェースを有する端末と、複数の対話制御サーバとがネットワークを介して接続されたシステムにおいて、
ユーザの発話文の受信後、第１の応答文を返信する第１の対話制御サーバと、
ユーザの発話文の受信後、第１の対話制御サーバよりも、応答時間が長く且つその応答内容の信頼度が高い第２の応答文を出力する第２の対話制御サーバと
を有し、
端末は、
「接続語」を記憶した接続語記憶手段と、
第１の対話制御サーバと第２の対話制御サーバとの両方へ、ユーザの発話文を送信する発話文入力手段と、
第１の対話制御サーバから受信した第１の応答文を、ユーザに対して音声で再生している再生時間中に、第２の対話制御サーバから第２の応答文が受信された場合、第１の応答文と第２の応答文との間の類似度を算出する応答文類似度算出手段と、
類似度が第１の閾値以下で類似しない場合、第１の応答文の再生が終了した直後に連続して、接続語を再生し、第２の応答文の再生を開始する応答文再生手段と
を有することを特徴とする。
According to the present invention, in a system in which a terminal having a user interface capable of voice dialogue and a plurality of dialogue control servers are connected via a network,
A first dialog control server that returns a first response after receiving the user's utterance;
A second dialog control server that outputs a second response sentence having a longer response time and higher reliability of the response content than the first dialog control server after receiving the user's utterance sentence;
The terminal
Connection word storage means storing "connection word";
An utterance sentence input means for transmitting the user's utterance sentence to both the first dialog control server and the second dialog control server;
When the second response text is received from the second dialog control server during the playback time in which the first response text received from the first dialog control server is being played back by voice to the user, A response sentence similarity calculating means for calculating a similarity between the first response sentence and the second response sentence;
Response sentence reproduction means for reproducing a connected word and starting reproduction of a second response sentence continuously immediately after the reproduction of the first response sentence is finished when the similarity is not more than a first threshold value It is characterized by having.

本発明の端末、プログラム及びシステムによれば、ユーザとの対話の中で、応答時間とその応答内容の信頼度との間のトレードオフを考慮して応答文を再生することができる。 According to the terminal, the program, and the system of the present invention, it is possible to reproduce the response sentence in consideration of the trade-off between the response time and the reliability of the response content in the dialog with the user.

本発明における端末の機能構成図である。It is a function block diagram of the terminal in this invention. 応答文の第１の具体的な再生タイミングを表す説明図である。It is explanatory drawing showing the 1st specific reproduction | regeneration timing of a response sentence. 応答文の第２の具体的な再生タイミングを表す説明図である。It is explanatory drawing showing the 2nd concrete reproduction | regeneration timing of a response sentence. 応答文の第３の具体的な再生タイミングを表す説明図である。It is explanatory drawing showing the 3rd concrete reproduction | regeneration timing of a response sentence. 本発明における第１のシステム構成図である。It is a 1st system block diagram in this invention. 本発明における第２のシステム構成図である。It is a 2nd system block diagram in this invention.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明における端末の機能構成図である。 FIG. 1 is a functional configuration diagram of a terminal in the present invention.

図１によれば、端末１は、スマートフォンやタブレットのようなものであって、音声で対話可能なユーザインタフェースを有する。入力デバイスがマイクである場合、入力音声処理部１０１は、マイクによって取得された音声信号を発話文に変換し、その発話文を対話制御部へ入力する。また、出力デバイスがスピーカである場合、出力音声処理部１０２は、対話制御部から出力された応答文を音声信号に変換し、その音声信号をスピーカへ出力する。 According to FIG. 1, the terminal 1 is like a smartphone or a tablet, and has a user interface that allows voice interaction. When the input device is a microphone, the input voice processing unit 101 converts a voice signal acquired by the microphone into an utterance sentence, and inputs the utterance sentence to the dialogue control unit. When the output device is a speaker, the output voice processing unit 102 converts the response sentence output from the dialogue control unit into a voice signal, and outputs the voice signal to the speaker.

図１によれば、端末１は、発話文入力部１１１と、応答文類似度算出部１１２と、応答文再生部１１３と、接続語記憶部１１４と、第１の対話制御部１２１と、第２の対話制御部１２２とを有する。これら機能構成部は、端末に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。 According to FIG. 1, the terminal 1 includes an utterance sentence input unit 111, a response sentence similarity calculation unit 112, a response sentence reproduction unit 113, a connected word storage unit 114, a first dialog control unit 121, Two dialog control units 122. These functional components are realized by executing a program that causes a computer installed in the terminal to function.

［第１の対話制御部１２１・第２の対話制御部１２２］
対話制御部は、人同士の間の対話と同様に、人とシステムとの間で対話を成立させるものである。対話制御部は、自然言語の発話文を認識し、それに対して適切な応答文を出力する対話ロジックを有する。対話制御機能としては、例えば以下のような２種類がある。
シナリオ型の対話制御機能
統計型の対話制御機能 [First Dialog Control Unit 121 / Second Dialog Control Unit 122]
The dialogue control unit establishes a dialogue between the person and the system, similar to the dialogue between the people. The dialogue control unit has dialogue logic that recognizes a natural language utterance and outputs an appropriate response sentence. There are two types of dialogue control functions as follows, for example.
Scenario-type dialog control function Statistical-type dialog control function

シナリオ型の対話制御機能は、人手によって予め記述された固定的なシナリオで対話を進行する。タスク指向型であって、何らかの達成したいタスク（目的）が明確となっている用途に適する。固定的なシナリオは、事前の対話経験やシステムノウハウに基づいて予め設計されたものであって、対話ノード（又はノード群）間の遷移が固定となっている。 The scenario type dialogue control function proceeds with dialogue in a fixed scenario described in advance by hand. It is task-oriented and suitable for applications where the task (purpose) to be achieved is clear. The fixed scenario is designed in advance based on prior dialogue experience and system know-how, and the transition between dialogue nodes (or node groups) is fixed.

統計型の対話制御機能は、対話ノードを大量に蓄積し、現在の対話ノードに対して遷移確率が最大となる次の対話ノードへ遷移することによって、自然な対話を進行する。具体的なタスクを持たず、他の対話ノードへの遷移は前後の遷移確率によって決定される。この遷移確率は、機械学習によって、自動的に逐次更新されていく。 The statistical dialog control function advances a natural dialog by accumulating a large number of dialog nodes and transitioning to the next dialog node having the maximum transition probability with respect to the current dialog node. Without a specific task, the transition to another dialog node is determined by the preceding and following transition probabilities. This transition probability is automatically and sequentially updated by machine learning.

対話制御機能はそれぞれ、その特性に応じて、同じ発話文が入力されたとしても、異なる応答文を出力する。対話制御部における応答文の応答内容の信頼度Ｐsは、平均対話正解率Ｐとリアルタイム制御信頼度スコアＣとに基づいて、以下のように算出される。
Ｐs(N)＝Ｐ(N)・Ｃ(N)
Ｐ(N)：対話制御Nの平均対話正解率
Ｃ(N)：対話制御Nからの応答文のリアルタイム制御信頼度スコア Each dialog control function outputs a different response sentence even if the same utterance sentence is input according to the characteristic. The reliability Ps of the response content of the response sentence in the dialogue control unit is calculated as follows based on the average dialogue correct answer rate P and the real-time control reliability score C.
Ps (N) = P (N) ・ C (N)
P (N): Average dialogue correct answer rate of dialogue control N C (N): Real-time control reliability score of response sentence from dialogue control N

平均対話正解率Ｐ(N)は、予め用意された対話テストデータを当該対話制御部へ入力し、その応答文と事前作成の正解文とを比較して算出された比率である。
平均対話正解率Ｐ(N)＝正解文数／全応答文数 The average dialogue correct answer rate P (N) is a ratio calculated by inputting preliminarily prepared dialogue test data to the dialogue control unit and comparing the response sentence with a pre-prepared correct answer sentence.
Average dialogue correct answer rate P (N) = number of correct sentences / number of all response sentences

リアルタイム制御信頼度スコアＣ(N)は、対話制御機能から出力された応答文に対して、リアルタイム計算された信頼できる尺度をいう。シナリオ型の対話制御について、Ｃ(N)は、ユーザの対話入力に対する意図理解の信頼度となる。例えば意図理解の統計モデルはサポートベクタマシン(ＳＶＭ：Support Vector Machines)によって学習された場合に、意図理解の結果がＳＶＭの分離平面からの距離を信頼度とする。また、統計型の対話制御機能については、機械学習手法による大量対話例文から確立した統計モデルに基づき、Ｃ(N)は、選択された応答文の最大遷移確率である。 The real-time control reliability score C (N) is a reliable measure calculated in real time for the response sentence output from the dialogue control function. For scenario-type dialog control, C (N) is the degree of intent understanding reliability for the user's dialog input. For example, when a statistical model of intent understanding is learned by Support Vector Machines (SVM), the reliability of the result of intent understanding is the distance from the separation plane of the SVM. In addition, regarding the statistical dialog control function, C (N) is the maximum transition probability of the selected response sentence based on a statistical model established from a large number of dialogue examples by machine learning techniques.

［発話文入力部１１１］
発話文入力部１１１は、入力音声処理部１０１から出力されたユーザの発話文を、第１の対話制御部１２１及び第２の対話制御部１２２の両方に入力する。 [Speech sentence input unit 111]
The utterance sentence input unit 111 inputs the user's utterance sentence output from the input voice processing unit 101 to both the first dialog control unit 121 and the second dialog control unit 122.

［応答文類似度算出部１１２］
応答文類似度算出部１１２は、第１の対話制御部１２１から出力された第１の応答文を、ユーザに対して音声で再生している「再生時間中」に、第２の対話制御部１２２から第２の応答文が出力された場合、第１の応答文と第２の応答文との間の「類似度」を算出する。応答文類似度算出部１１２は、第１の応答文の音声の「再生時間中」であることを、応答文再生部１１３と連携して認識する。 [Response sentence similarity calculation unit 112]
The response sentence similarity calculation unit 112 outputs the first response sentence output from the first dialog control unit 121 to the user during the “playback time” while playing back the first response sentence by voice. When the second response text is output from 122, the “similarity” between the first response text and the second response text is calculated. The response sentence similarity calculating unit 112 recognizes that the voice of the first response sentence is “during reproduction time” in cooperation with the response sentence reproduction unit 113.

ここで、「再生時間中」として、以下の２つのパターンがある。
（１）音声による第１の応答文の再生が完了するまで
（２）音声による第１の応答文の再生後、ユーザからの発話文が検出されるまで
即ち、第１の応答文に対して、ユーザが反応して発話しない限りは、できる限り信頼度の高い応答文を出力するようにする。 Here, there are the following two patterns as “during playback time”.
(1) Until reproduction of the first response sentence by voice is completed (2) After reproduction of the first response sentence by voice, until an utterance sentence from the user is detected, that is, for the first response sentence Unless the user responds and speaks, a response sentence with the highest reliability is output.

尚、第１の対話制御部１２１の信頼度Ｐsよりも、第２の対話制御部１２２の信頼度Ｐsの方が低い場合、第１の応答文の再生時間中に、第２の対話制御部１２２から第２の応答文が出力されても、応答文類似度算出部１１２は、類似度を算出することなく無視する。第１の応答文よりも信頼度の低い第２の応答文を、あえて再生する必要は無いためである。 When the reliability Ps of the second dialog control unit 122 is lower than the reliability Ps of the first dialog control unit 121, the second dialog control unit during the playback time of the first response sentence. Even if the second response sentence is output from 122, the response sentence similarity calculation unit 112 ignores the similarity without calculating the similarity. This is because it is not necessary to intentionally reproduce the second response sentence having lower reliability than the first response sentence.

応答文類似度算出部１１２は、「類似度」について、最初に、第１の応答文及び第２の応答文から形態素解析によって複数の単語を抽出する。そして、応答文類似度算出部１１２は、第１の応答文の単語と第２の応答文の単語との間の品詞又は意味を解析したベクトルを算出し、これらベクトルのコサイン類似度（Bag of wordsベース）を算出する。勿論、各応答文に対して、対話コーパスにおける出現頻度が高い重要語（名詞）を蓄積した重要語辞書を用いて、重要語を検出するものであってもよい。コサイン距離は、各応答文の中から抽出された単語及びその名詞種別に応じて算出される。例えば以下の概念式で算出される。
第１の応答文の単語の特徴ベクトル：Ｄ
第２の応答文の単語の特徴ベクトル：Ｅ
２つの文の類似度：sim(Ｄ,Ｅ)
sim(Ｄ,Ｅ)＝cosθ＝（Ｄ・Ｅ）／(|Ｄ||Ｅ|)
コサイン距離は、同じ単語同士である場合には類似度重みを１とし、同じカテゴリ同士である場合にも類似度重みを１とする。勿論、同じ単語カテゴリの類似度重みは、０〜１の間の値で設定可能である。 For the “similarity”, the response sentence similarity calculation unit 112 first extracts a plurality of words from the first response sentence and the second response sentence by morphological analysis. Then, the response sentence similarity calculation unit 112 calculates a vector obtained by analyzing the part of speech or the meaning between the words of the first response sentence and the words of the second response sentence, and the cosine similarity (Bag of words based). Of course, for each response sentence, a key word may be detected using a key word dictionary in which key words (nouns) having a high appearance frequency in the dialogue corpus are accumulated. The cosine distance is calculated according to the word extracted from each response sentence and its noun type. For example, it is calculated by the following conceptual formula.
Feature vector of word of first response sentence: D
Feature vector of word of second response sentence: E
Similarity between two sentences: sim (D, E)
sim (D, E) = cos θ = (D · E) / (| D || E |)
For the cosine distance, the similarity weight is set to 1 when the words are the same, and the similarity weight is set to 1 when the words are the same category. Of course, the similarity weight of the same word category can be set to a value between 0 and 1.

［応答文再生部１１３］
応答文再生部１１３は、類似度が第１の閾値以下の場合（類似度が低い場合）、第１の応答文の再生が終了した直後に連続して、第２の応答文の再生を開始する（後述する図２参照）。
また、応答文再生部１１３は、第１の応答文の音声による再生終了後に、第２の対話制御部１２２から第２の応答文が出力された場合、あえて、第２の応答文は再生されない（後述する図３参照）。
更に、応答文再生部１１３は、類似度が第１の閾値よりも高い場合（類似度が高い場合）、あえて、第２の応答文は再生されない（後述する図４参照）。 [Response sentence reproduction unit 113]
When the similarity is equal to or lower than the first threshold (when the similarity is low), the response sentence playback unit 113 starts playback of the second response sentence immediately after the end of playback of the first response sentence. (See FIG. 2 described later).
In addition, when the second response sentence is output from the second dialogue control unit 122 after the reproduction of the first response sentence by voice is completed, the response sentence reproducing unit 113 does not intentionally reproduce the second response sentence. (See FIG. 3 described later).
Furthermore, when the similarity is higher than the first threshold (when the similarity is high), the response sentence reproduction unit 113 does not reproduce the second response sentence (see FIG. 4 described later).

［接続語記憶部１１４］
接続語記憶部１１４は、「接続語」を記憶したものである。応答文再生部１１３は、第１の応答文の再生が終了した直後に連続して、接続語を再生し、第２の応答文の再生を開始するものであってもよい。当該類似度の所定範囲毎に対応付けられた「接続語」を、接続語記憶部１１４が記憶する。 [Connected word storage unit 114]
The connection word storage unit 114 stores “connection words”. The response sentence reproduction unit 113 may reproduce the connected word and start reproduction of the second response sentence immediately after the reproduction of the first response sentence is completed. The connection word storage unit 114 stores the “connection word” associated with each predetermined range of the similarity.

図２は、応答文の第１の具体的な再生タイミングを表す説明図である。 FIG. 2 is an explanatory diagram showing the first specific reproduction timing of the response sentence.

第１の対話制御部１２１及び第２の対話制御部１２２は、以下のようなトレードオフの関係にあるとする。
第１の対話制御部１２１：対話の応答時間が短いものの、応答内容の信頼度が低い
第２の対話制御部１２２：対話の応答時間が長いものの、応答内容の信頼度が高い
信頼度は、例えば以下のように決定される。
第１の対話制御部１２１：平均応答時間＝０．７秒
平均対話正解率Ｐ(1)＝０．６５
システム信頼度Ｃ(1)＝０．７０
信頼度Ｐs(1)＝Ｃ(1)×Ｐ(1)＝０．４５５
第２の対話制御部１２２：平均応答時間＝１．０秒
平均対話正解率Ｐ(2)＝０．８５
システム信頼度Ｃ(2)＝０．７０
信頼度Ｐs(2)＝Ｃ(2)×Ｐ(2)＝０．５９５
即ち、第１の対話制御部は、第２の対話制御部と比較して、対話の応答時間が短いものの、応答内容の信頼度が低いとする。 It is assumed that the first dialogue control unit 121 and the second dialogue control unit 122 have the following trade-off relationship.
First dialogue control unit 121: Although the response time of the dialogue is short, the reliability of the response content is low. Second dialogue control unit 122: Although the response time of the dialogue is long, the reliability of the response content is high. For example, it is determined as follows.
First dialog control unit 121: average response time = 0.7 seconds
Average dialog correct answer rate P (1) = 0.65
System reliability C (1) = 0.70
Reliability Ps (1) = C (1) × P (1) = 0.455
Second dialog control unit 122: average response time = 1.0 second
Average dialog correct answer rate P (2) = 0.85
System reliability C (2) = 0.70
Reliability Ps (2) = C (2) × P (2) = 0.595
That is, it is assumed that the first dialog control unit has a shorter response time of the dialog compared to the second dialog control unit, but the reliability of the response content is low.

（Ｓ２１）ユーザが、マイクに向かって「明日の天気はどうですか？」と発話したとする。これに対し、入力音声処理部１０１は、音声認識によってその発話文を、発話文入力部１１１へ出力する。これに対し、発話文入力部１１１は、以下の発話文を、第１の対話制御部１２１及び第２の対話制御部１２２の両方へ入力する。
「明日の天気はどうですか？」 (S21) It is assumed that the user utters "How is the weather tomorrow?" On the other hand, the input voice processing unit 101 outputs the utterance sentence to the utterance sentence input unit 111 by voice recognition. On the other hand, the utterance sentence input unit 111 inputs the following utterance sentence to both the first dialog control unit 121 and the second dialog control unit 122.
"What's the forecast for tomorrow?"

（Ｓ２２１）これに対し、応答時間が短い第１の対話制御部１２１が、最初に、応答類似度算出部１１２へ、以下の応答文を出力したとする。
「明日予定はありません」
第１の対話制御部１１１は、応答時間が短いものの、応答内容の信頼度が比較的低いために、天気を問われているのに対し、予定を問われたと誤って判断したとする。
（Ｓ２２２）これに対し、応答類似度算出部１１２は、その応答文をそのまま、応答文再生部１１３へ出力する。
「明日予定はありません」
（Ｓ２２３）応答文再生部１１３は、出力音声処理部１０２に対して、以下のように順次発声し、スピーカからユーザへ応答する。
”あ”す”よ”て”い”は”あ”り”ま”せ”ん”
（この音声の発声には、例えば１．６秒の時間を要する） (S221) On the other hand, it is assumed that the first dialog control unit 121 with a short response time first outputs the following response sentence to the response similarity calculation unit 112.
“No plans tomorrow”
It is assumed that the first dialog control unit 111 erroneously determines that the schedule is asked while the weather is asked because the reliability of the response content is relatively low although the response time is short.
(S222) On the other hand, the response similarity calculation unit 112 outputs the response sentence as it is to the response sentence reproduction unit 113.
“No plans tomorrow”
(S223) The response sentence reproduction unit 113 sequentially utters the output audio processing unit 102 as follows, and responds to the user from the speaker.
“Ah, you”, “Ah,” “Ah,” “Ah,” “Ah,” “Ah,” “You”
(This voice takes 1.6 seconds, for example)

（Ｓ２３１）次に、応答時間が長い第２の対話制御部１２２が、応答類似度算出部１１２へ、以下の応答文を出力したとする。ここで、この応答文は、Ｓ２２３によって応答再生部１１３が第１の応答文を発声し始めてから、１．６秒以内であるとする。
「明日天気は晴れです」 (S231) Next, it is assumed that the second dialogue control unit 122 having a long response time outputs the following response sentence to the response similarity calculation unit 112. Here, it is assumed that this response sentence is within 1.6 seconds after the response reproduction unit 113 starts to utter the first response sentence in S223.
"The weather will be sunny tomorrow"

（Ｓ２３２）応答文類似度算出部１１２は、第１の対話制御部１２１から出力された第１の応答文「明日予定はありません」を、ユーザに対して音声で再生している再生時間中であること認識している。この再生時間中に、第２の対話制御部１２２から第２の応答文「明日天気は晴れです」が出力された場合、第１の応答文と第２の応答文との間の類似度を算出する。
第１の応答文「明日予定はありません」
ベクトルとして検出される単語：「明日」「予定」「ない」
第２の応答文「明日天気は晴れです」
ベクトルとして検出される単語：「明日」「天気」「晴れ」
この場合、第１の応答文と第２の応答文とのコサイン距離は０．３３となる。 (S232) The response sentence similarity calculation unit 112 reproduces the first response sentence “There is no tomorrow scheduled” output from the first dialogue control unit 121 to the user during the playback time. I recognize that there is. If the second response sentence “Tomorrow's weather is sunny” is output from the second dialogue control unit 122 during this playback time, the similarity between the first response sentence and the second response sentence is calculated. calculate.
The first response "There is no plan tomorrow"
Words detected as vectors: “Tomorrow”, “plan”, “not”
Second response sentence “Tomorrow's weather is sunny”
Words detected as vectors: “Tomorrow” “Weather” “Sunny”
In this case, the cosine distance between the first response sentence and the second response sentence is 0.33.

図２によれば、このコサイン距離が、第１の閾値Th1以下となり、第２の応答文を再生すべきと判定されたとする。このとき、応答文再生部１１３は、第１の応答文の再生が終了した直後に連続して、第２の応答文の再生を開始するものとする。 According to FIG. 2, it is assumed that the cosine distance is equal to or smaller than the first threshold Th1, and it is determined that the second response sentence should be reproduced. At this time, it is assumed that the response sentence reproduction unit 113 starts reproduction of the second response sentence immediately after the reproduction of the first response sentence is completed.

（Ｓ２３３）接続語記憶部１１４は、類似度が、第１の閾値以下であって、且つ、高い方から低い方へ複数の所定範囲に区分されている。応答文再生部１１３は、接続語記憶部１１４の類似度の範囲に対応して、接続語を選択して再生する。 (S233) In the connected word storage unit 114, the similarity is equal to or lower than the first threshold value, and is divided into a plurality of predetermined ranges from higher to lower. The response sentence reproducing unit 113 selects and reproduces the connected word corresponding to the similarity range of the connected word storage unit 114.

接続語記憶部１１４は、類似度が、第１の閾値Th1以下であって、且つ、高い方から低い方へ３段階の所定範囲に区分されており、以下のように接続語を対応付けて記憶する。
類似度＞第１の閾値Th1 ：第２の応答文を再生しない
第１の閾値Th1≧類似度＞第２の閾値Th1 ：「累加」の接続語
「累加」->「さらに」「そのうえ」
第２の閾値Th2≧類似度＞第３の閾値Th3 ：「逆接」の接続語
「逆接」->「しかし」「けれども」「と言いたいですが」「だからといって」
第３の閾値Th3≧類似度：「転換」の接続語
「転換」->「ところで」「さて」「先ほどと別の事ですが」
例えば、Th1＝0.8、Th2＝0.5、Th3＝0.3と設定してもよい。類似度が低くなるほど、話を転換する接続語が選択される。
図２によれば、応答文再生部１１３は、第１の応答文と第２の応答文との間のコサイン距離が例えば０．３３である場合、逆接の接続語「けれども」が選択される。 The connected word storage unit 114 has a similarity degree equal to or lower than the first threshold Th1 and is divided into predetermined ranges of three levels from higher to lower, and associates connected words as follows: Remember.
Similarity> first threshold Th1: the second response sentence is not reproduced First threshold Th1 ≧ similarity> second threshold Th1: a conjunctive word “cumulative”
"Progress"->"More""Beyond"
Second threshold Th2 ≧ similarity> third threshold Th3: concatenation word of “reverse connection”
"Reverse connection"->"But""But""I want to say""That'swhy"
Third threshold Th3 ≧ similarity: conjunctive word for “conversion”
“Conversion”-> “By the way” “Well” “It ’s a different thing”
For example, Th1 = 0.8, Th2 = 0.5, and Th3 = 0.3 may be set. The lower the similarity, the more connected words that change the story are selected.
According to FIG. 2, when the cosine distance between the first response sentence and the second response sentence is, for example, 0.33, the response sentence reproduction unit 113 selects the reverse connection word “but”. .

（Ｓ２３４）応答文再生部１１３は、以下のように再生する。
”あ”す”よ”て”い”は”あ”り”ま”せ”ん”（１．６秒）
”け”れ”ど”も”（０．５秒） (S234) The response sentence playback unit 113 plays back as follows.
“A”, “Y”, “T”, “A”, “R”, “M”, “N” (1.6 seconds)
“Ke” re ”do” mo ”(0.5 seconds)

（Ｓ２３５）応答文類似度算出部１１２は、第２の応答文「明日天気は晴れです」を、応答文再生部１１３へ出力する。
（Ｓ２３６）応答文再生部１１３は、以下のように再生する。
”あ”す”よ”て”い”は”あ”り”ま”せ”ん”（１．６秒）
”け”れ”ど”も”（０．５秒）
”あ”す”あ”す”の”て”ん”き”は”は”れ”で”す”（１．９秒） (S235) The response sentence similarity calculating unit 112 outputs the second response sentence “Tomorrow's weather is sunny” to the response sentence reproducing unit 113.
(S236) The response sentence playback unit 113 plays back as follows.
“A”, “Y”, “T”, “A”, “R”, “M”, “N” (1.6 seconds)
“Ke” re ”do” mo ”(0.5 seconds)
"A""" A """""""""""""""""""""""""" (1.9 seconds) "

図２から明らかなとおり、応答時間が短く且つ信頼度が低い第１の対話制御部から出力された第１の応答文を再生し、その再生時間中に、応答時間が長く且つ信頼度が高い第２の対話制御部から第２の応答文が出力された場合、第１の応答文に第２の応答文を連続して再生する。また、第１の応答文と第２の応答文との間の類似度が第１の閾値よりも低い場合、その類似度に応じた接続語を、第１の応答文と第２の応答文との間に挿入して接続することによって、できる限り、ユーザに不自然さを感じさせないように応答することができる。 As is clear from FIG. 2, the first response sentence output from the first dialog control unit with a short response time and low reliability is reproduced, and during the reproduction time, the response time is long and the reliability is high. When the second response sentence is output from the second dialog control unit, the second response sentence is reproduced in succession to the first response sentence. Further, when the similarity between the first response sentence and the second response sentence is lower than the first threshold, a connection word corresponding to the similarity is used as the first response sentence and the second response sentence. By inserting and connecting between the two, it is possible to respond so as not to make the user feel unnatural as much as possible.

図３は、応答文の第２の具体的な再生タイミングを表す説明図である。 FIG. 3 is an explanatory diagram showing a second specific reproduction timing of the response sentence.

図３によれば、応答文再生部１１３は、第１の応答文「明日予定はありません」の音声による再生終了後に、第２の対話制御部１２２から第２の応答文「明日天気は晴れです」が出力された場合、あえて、第２の応答文は再生されない。通常、応答文の再生が一旦途切れた後、ユーザは直ぐ反応しようとする。第１の応答文「明日予定はありません」を再生した後、一旦途切れ、その後、第２の応答文「明日天気は晴れです」を再生しまうと、ユーザの発話とぶつかる可能性が高いためである。 According to FIG. 3, the response sentence reproduction unit 113 receives the second response sentence “Tomorrow's weather is sunny” from the second dialogue control unit 122 after the reproduction of the first response sentence “No tomorrow is scheduled” by voice. ”Is output, the second response sentence is not reproduced. Normally, the user tries to react immediately after playback of the response text is interrupted. This is because if the first response sentence “There is no plan tomorrow” is interrupted and then the second response sentence “Tomorrow's weather is sunny” is played, then there is a high possibility that it will collide with the user's utterance. .

図４は、応答文の第３の具体的な再生タイミングを表す説明図である。 FIG. 4 is an explanatory diagram showing a third specific reproduction timing of the response sentence.

図４によれば、応答文再生部１１３は、類似度が第１の閾値よりも高い場合（類似度が高い場合）、あえて、第２の応答文は再生されない。類似度が高いということは、同じ意味合いの応答文を２回連続して再生することになるためである。 According to FIG. 4, when the similarity is higher than the first threshold (when the similarity is high), the response sentence reproducing unit 113 does not reproduce the second response sentence. The high degree of similarity is because response sentences having the same meaning are reproduced twice in succession.

図５は、本発明における第１のシステム構成図である。 FIG. 5 is a first system configuration diagram in the present invention.

図５によれば、第１の対話制御部１２１は、端末１内に備えられ、第２の対話制御部１２２は、ネットワークを介して外部サーバに備えられている。端末１は、一般に、少ないメモリ量と低い演算処理能力であるために、対話の応答時間は短いが、応答内容の信頼度が低い軽量の対話制御部しか搭載できない。一方で、外部サーバは、大きいメモリ量と高い演算処理能力であるために、対話の応答時間は長いものの、応答内容の信頼度が高い対話制御部を搭載することができる。 According to FIG. 5, the first dialog control unit 121 is provided in the terminal 1, and the second dialog control unit 122 is provided in an external server via a network. Since the terminal 1 generally has a small amount of memory and low arithmetic processing capability, the response time of the dialogue is short, but only a lightweight dialogue control unit with low reliability of the response content can be mounted. On the other hand, since the external server has a large amount of memory and a high arithmetic processing capacity, it is possible to mount a dialogue control unit with a high reliability of response contents although the response time of the dialogue is long.

図６は、本発明における第２のシステム構成図である。 FIG. 6 is a second system configuration diagram according to the present invention.

図６によれば、第１の対話制御部１２１及び第２の対話制御部１２２の両方とも、ネットワークを介して外部サーバに備えられている。端末１には、対話制御機能を備えないために、応答時間は比較的長くなるものの、外部サーバの多様な対話制御機能を用いることができる。 According to FIG. 6, both the first dialog control unit 121 and the second dialog control unit 122 are provided in an external server via a network. Since the terminal 1 does not have a dialog control function, the response time is relatively long, but various dialog control functions of an external server can be used.

＜３つ以上の対話制御部＞
前述した実施形態によれば、２つの対話制御部について説明したが、勿論、３つ以上搭載されたものであってもよい。対話制御部それぞれについて、対話の応答時間や、応答内容の信頼度に差が生じる。具体的には、第１の対話制御部１２１を基準として、複数の異なる種類の第２の対話制御部１２２を搭載するものであってもよい。 <3 or more dialog control units>
According to the above-described embodiment, the two dialog control units have been described. Of course, three or more dialog control units may be mounted. For each dialog control unit, there is a difference in the response time of the dialog and the reliability of the response content. Specifically, a plurality of different types of second dialogue control units 122 may be mounted on the basis of the first dialogue control unit 121.

例えば以下のように、対話制御部毎に、以下のように信頼度が設定されているとする。
第１の対話制御部１２１：信頼度Ｐs(1)＝０．４５５
第１の応答文「明日予定はありません」
第２１の対話制御部１２２：信頼度Ｐs(2)＝０．５９５
第２１の応答文「明日天気は晴れです」
第２２の対話制御部１２２：信頼度Ｐs(3)＝０．７２０
第２２の応答文「明日の降水確率は２０％です」 For example, it is assumed that the reliability is set as follows for each dialogue control unit as follows.
First dialogue control unit 121: reliability Ps (1) = 0.455
The first response "There is no plan tomorrow"
21st dialogue control unit 122: reliability Ps (2) = 0.595
The 21st response "Tomorrow's weather is fine"
Twenty-second dialog control unit 122: reliability Ps (3) = 0.720
The 22nd response sentence "The chance of precipitation tomorrow is 20%"

応答文類似度算出部１１２は、第１の対話制御部１２１から出力された第１の応答文を、ユーザに対して音声で再生している再生時間中に、第２１の対話制御部１２２と第２２の対話制御部１２３からそれぞれ第２の応答文が出力された場合、応答内容の信頼度Ｐsが最も高い第２２の対話制御部１２３から出力された第２の応答文と第１の応答文との間の類似度を算出する。ここで、第１の応答文と第２２の応答文との間のコサイン距離が例えば０．２９であるとする。 The response sentence similarity calculation unit 112 is connected to the 21st dialog control unit 122 during the playback time during which the first response sentence output from the first dialog control unit 121 is played back to the user by voice. When the second response sentence is output from the twenty-second dialog control unit 123, the second response sentence and the first response output from the twenty-second dialog control unit 123 having the highest reliability Ps of the response content. Calculate similarity between sentences. Here, it is assumed that the cosine distance between the first response sentence and the twenty-second response sentence is 0.29, for example.

これに対し、応答文再生部１１３は、逆接の接続語「ところで」が選択される。最終的に、応答文再生部１１３は、以下のように再生する。
”あ”す”よ”て”い”は”あ”り”ま”せ”ん”（１．６秒）
”と”こ”ろ”で”（０．５秒）
”あ”す”の”こ”う”す”い”か”く”り”つ”は”に”じゅ”っ”ぱ”―”せ”ん”と”で”す”（２．７秒） On the other hand, the response sentence reproduction unit 113 selects the reverse connection word “by the way”. Finally, the response sentence reproduction unit 113 reproduces as follows.
“A”, “Y”, “T”, “A”, “R”, “M”, “N” (1.6 seconds)
"To" and "Kororo""(0.5 seconds)
“A” su “ko” u “su” i ”or“ ku ”ri“ tsu ”is“ ju ”“ pa ”-“ se ”and“ su ”(2.7 seconds) )

以上、詳細に説明したように、本発明の端末、プログラム及びシステムによれば、ユーザとの対話の中で、応答時間とその応答内容の信頼度との間のトレードオフを考慮して応答文を再生することができる。特に、ユーザの発話文に対して応答時間が短くなると共に、一連の応答文の中で、最終的にできる限り信頼度の高い応答文を再生することができる。 As described above in detail, according to the terminal, the program, and the system of the present invention, in the dialogue with the user, the response sentence is considered in consideration of the trade-off between the response time and the reliability of the response content. Can be played. In particular, the response time is shortened with respect to the user's utterance sentence, and the response sentence having the highest reliability can be reproduced in the series of response sentences.

尚、本発明は、ユーザ操作に基づく端末のディスプレイに「キャラクタ・エージェント」を表示し、ユーザとエージェントとが音声による対話を進める、音声対話システムに適する。 The present invention is suitable for a voice dialogue system in which a “character agent” is displayed on a display of a terminal based on a user operation, and a dialogue between a user and an agent is advanced.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 Various changes, modifications, and omissions of the above-described various embodiments of the present invention can be easily made by those skilled in the art. The above description is merely an example, and is not intended to be restrictive. The invention is limited only as defined in the following claims and the equivalents thereto.

１端末
１１１発話文入力部
１１２応答文類似度算出部
１１３応答文再生部
１１４接続語記憶部
１２１第１の対話制御部
１２２第２の対話制御部
２１、２２サーバ DESCRIPTION OF SYMBOLS 1 Terminal 111 Utterance sentence input part 112 Response sentence similarity calculation part 113 Response sentence reproduction | regeneration part 114 Conjunction memory | storage part 121 1st dialog control part 122 2nd dialog control part 21, 22 Server

Claims

In a terminal having a user interface capable of voice interaction,
A first dialog control means for outputting a first response sentence after inputting the user's utterance sentence;
Connection word storage means storing "connection word";
A second dialogue control means for outputting a second response sentence having a longer response time and higher reliability of the response content than the first dialogue control means after the user's utterance sentence is input;
An utterance sentence input means for inputting the user's utterance sentence to both the first dialog control means and the second dialog control means;
When the second response text is output from the second dialog control means during the playback time in which the first response text output from the first dialog control means is played back to the user by voice, Response sentence similarity calculating means for calculating the similarity between the first response sentence and the second response sentence;
If the similarity is less than or equal to the first threshold and is not similar, response sentence playback that plays back the connected word and starts playback of the second response sentence immediately after the end of playback of the first response sentence And a terminal.

The connection word storage means is divided into a plurality of predetermined ranges in which the similarity is equal to or less than a first threshold and is higher to lower. "Is stored in association with each other,
The terminal according to claim 1 , wherein the response sentence reproduction unit selects a connection word corresponding to the similarity using the connection word storage unit.

The connection word storage means is divided into a predetermined range of three levels from the higher to the lower with the similarity being equal to or less than the first threshold Th1, and associating the connection words as follows: Similarity> First threshold Th1: Do not reproduce second response sentence First threshold Th1 ≧ similarity> Second threshold Th1: Cumulative connection word Second threshold Th2 ≧ Similarity> Third The terminal according to claim 2 , wherein the threshold value Th 3 is a reverse connection word, and the third threshold value Th 3 ≧ similarity is a conversion connection word.

The response sentence similarity calculation means determines that during the reproduction time, (1) until reproduction of the first response sentence by voice is completed. (2) After reproduction of the first response sentence by voice, an utterance sentence from the user is received. The terminal according to any one of claims 1 to 3 , wherein the terminal is detected.

The terminal according to any one of claims 1 to 4 , wherein the first dialog control means and the second dialog control means have a scenario type or statistical type dialog control function.

Each of the first dialogue control means and the second dialogue control means outputs a reliability together with a response sentence.
A plurality of second dialogue control means are provided,
The response sentence similarity calculating means receives a first response sentence output from the first dialog control means from a plurality of second dialog control means during a playback time during which the user is playing back by voice. When the second response sentence is output, the similarity between the second response sentence output from the second dialogue control means with the highest reliability of the response content and the first response sentence is calculated. terminal according to claim 1, any one of 5, characterized in that it has.

The confidence in the first dialogue control means and the second interaction control device, it is intended to be calculated on the basis of the average interaction correct answer rate P and real-time control confidence score C to claim 6, wherein The listed terminal.

The response sentence similarity calculating means extracts a plurality of words by morphological analysis for the first response sentence and the second response sentence, and between the first response sentence word and the second response sentence word The terminal according to any one of claims 1 to 7 , wherein a vector obtained by analyzing part of speech or meaning is calculated, and a cosine similarity of these vectors is calculated.

In a program for causing a computer mounted on a terminal having a user interface capable of voice interaction to function,
A first dialog control means for outputting a first response sentence after inputting the user's utterance sentence;
Connection word storage means storing "connection word";
A second dialogue control means for outputting a second response sentence having a longer response time and higher reliability of the response content than the first dialogue control means after the user's utterance sentence is input;
An utterance sentence input means for inputting the user's utterance sentence to both the first dialog control means and the second dialog control means;
When the second response text is output from the second dialog control means during the playback time in which the first response text output from the first dialog control means is played back to the user by voice, Response sentence similarity calculating means for calculating the similarity between the first response sentence and the second response sentence;
If the similarity is less than or equal to the first threshold and is not similar, response sentence playback that plays back the connected word and starts playback of the second response sentence immediately after the end of playback of the first response sentence A program that causes a computer to function as means.

In a system in which a terminal having a user interface capable of voice communication and a dialog control server are connected via a network,
The terminal
After input of the utterance Yu chromatography The, the first dialogue control means for outputting a first answering sentence,
Connection word storage means storing "connection words" , and
The server has second dialog control means for outputting a second response sentence having a longer response time and higher reliability of the response content than the first dialog control means after inputting the user's utterance sentence. And
The terminal
An utterance sentence input means for inputting the user's utterance sentence to the first dialog control means and transmitting to the second dialog control means of the server;
When the second response text is received from the second dialog control means during the playback time in which the first response text output from the first dialog control means is played back to the user by voice, Response sentence similarity calculating means for calculating the similarity between the first response sentence and the second response sentence;
If the similarity is less than or equal to the first threshold and is not similar, response sentence playback that plays back the connected word and starts playback of the second response sentence immediately after the end of playback of the first response sentence And a system.

In a system in which a terminal having a user interface capable of voice interaction and a plurality of dialogue control servers are connected via a network,
A first dialog control server that returns a first response after receiving the user's utterance;
A second dialog control server that outputs a second response sentence having a longer response time and higher reliability of the response content than the first dialog control server after receiving the user's utterance sentence;
The terminal
Connection word storage means storing "connection word";
An utterance sentence input means for transmitting the user's utterance sentence to both the first dialog control server and the second dialog control server;
When the second response text is received from the second dialog control server during the playback time in which the first response text received from the first dialog control server is being played back by voice to the user, A response sentence similarity calculating means for calculating a similarity between the first response sentence and the second response sentence;
If the similarity is less than or equal to the first threshold and is not similar, response sentence playback that plays back the connected word and starts playback of the second response sentence immediately after the end of playback of the first response sentence And a system.