JP5818753B2

JP5818753B2 - Spoken dialogue system and spoken dialogue method

Info

Publication number: JP5818753B2
Application number: JP2012179294A
Authority: JP
Inventors: 尚義永江; 裕美若木; 憲治岩田; 康顕有賀
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2012-08-13
Filing date: 2012-08-13
Publication date: 2015-11-18
Anticipated expiration: 2032-08-13
Also published as: JP2014038150A

Description

実施形態は、音声対話に関する。 Embodiments relate to voice interaction.

ユーザとの音声対話を通じて、当該ユーザの要望に合致したサービスを提供するシステム（以降、音声対話システムと称される）が知られている。従来、例えばある種のカーナビゲーションシステムは、音声対話を通じてユーザが所望する情報（例えば、ルート案内情報）を提供する。また近年、係る音声対話システムの用途は、例えば携帯電話機向けのアプリケーションなどに拡大されつつある。 2. Description of the Related Art There is known a system that provides a service that meets a user's request through voice dialogue with a user (hereinafter referred to as a voice dialogue system). Conventionally, for example, a certain type of car navigation system provides information desired by a user (for example, route guidance information) through voice interaction. In recent years, the use of such a voice interaction system has been expanded to, for example, applications for mobile phones.

音声対話システムによって進行される音声対話において、当該システムからの応答文の音声出力と、ユーザからの音声入力（即ち、ユーザの発話）とが交互に繰り返される。一般に、応答文が長いほど当該応答文の音声出力を完了するまでに多くの時間を要する。故に、応答文の音声出力が完了するまでユーザの発話を受理しない音声対話システムは、ユーザが発話を開始できるタイミングが制限されるので音声対話を円滑に進行することが困難である。 In the voice dialogue progressed by the voice dialogue system, the voice output of the response sentence from the system and the voice input from the user (that is, the user's speech) are alternately repeated. In general, the longer the response sentence, the longer it takes to complete the voice output of the response sentence. Therefore, in a voice interaction system that does not accept the user's utterance until the voice output of the response sentence is completed, the timing at which the user can start the utterance is limited, so that it is difficult to smoothly advance the voice conversation.

応答文の音声出力の途中でユーザの発話を受理可能な（いわゆるバージイン機能を備える）音声対話システムが提案されている。従来のバージイン機能は、例えば、未出力部分を含む応答文全体に対してユーザの発話がなされたとみなすものである。しかしながら、ユーザが応答文の一部しか聞いていないことに起因して、ユーザは応答文全体の文意を誤って理解した状態で発話を開始することがある。この場合に、応答文全体に対してユーザの発話がなされたとみなすと、ユーザの発話とシステムによって想定される発話とが噛み合わないので音声対話の進行に支障が生じる。 There has been proposed a voice interaction system (having a so-called barge-in function) that can accept a user's utterance in the middle of outputting a response sentence. In the conventional barge-in function, for example, it is considered that the user's utterance has been made for the entire response sentence including the non-output part. However, due to the fact that the user has only listened to a part of the response sentence, the user may start speaking in a state where the user has misunderstood the meaning of the entire response sentence. In this case, if it is considered that the user's utterance is made with respect to the entire response sentence, the user's utterance and the utterance assumed by the system do not mesh with each other.

また、従来の別のバージイン機能は、応答文の音声出力をキャンセルしたうえでユーザの発話を例えば過去の対話状態を用いて処理するものである。しかしながら、ユーザは、例えば言い直しのためではなく出力中の応答文に対する返事のためにバージインを行うことがある。この場合に、応答文の音声出力をキャンセルしたうえでユーザの発話を過去の対話状態を用いて処理すると、ユーザの発話とシステムによって想定される発話とが噛み合わないので音声対話の進行に支障が生じる。 Further, another conventional barge-in function is to process the user's utterance using, for example, the past dialog state after canceling the voice output of the response sentence. However, the user may barge-in, for example, not for rephrasing but for replying to the response sentence being output. In this case, if the user's utterance is processed using the past dialogue state after canceling the voice output of the response sentence, the user's utterance and the utterance assumed by the system do not mesh with each other. Arise.

特表２００２−５２７８２９号公報Japanese translation of PCT publication No. 2002-527829

実施形態は、ユーザからのバージインが生じる場合にも音声対話を円滑に進行することを目的とする。 The embodiment aims to smoothly advance a voice conversation even when a barge-in from a user occurs.

実施形態によれば、音声対話システムは、出力部と、入力部と、検出部と、取得部と、第１の理解部と、更新部と、音声認識部と、第２の理解部と、制御部とを備える。出力部は、現行の対話状態に対応する応答文を音声出力する。入力部は、ユーザの発話を入力する。検出部は、ユーザの発話の開始タイミングを検出し、タイミング情報を生成する。取得部は、応答文に関する情報とタイミング情報とに基づいて、応答文のうち発話の開始タイミングにおいて音声出力済みの部分応答文を取得する。第１の理解部は、部分応答文の文意を理解することによって、１以上の第１の理解結果を得る。更新部は、１以上の第１の理解結果に基づいて、現行の対話状態を更新するための代替対話状態を生成する。音声認識部は、発話に対して音声認識処理を行うことによって、発話テキストを得る。第２の理解部は、少なくとも発話テキストに基づいてユーザの発話の意図を理解することによって、第２の理解結果を得る。制御部は、代替対話状態によって現行の対話状態を更新してから第２の理解結果を処理し、対話状態間の遷移を制御する。 According to the embodiment, the voice interaction system includes an output unit, an input unit, a detection unit, an acquisition unit, a first understanding unit, an update unit, a voice recognition unit, a second understanding unit, And a control unit. The output unit outputs a response sentence corresponding to the current dialog state by voice. The input unit inputs a user's utterance. A detection part detects the start timing of a user's utterance, and produces | generates timing information. The acquisition unit acquires a partial response sentence that has been output as a voice at the start timing of the utterance in the response sentence based on the information about the response sentence and the timing information. The first understanding unit obtains one or more first understanding results by understanding the meaning of the partial response sentence. The update unit generates an alternative dialog state for updating the current dialog state based on the one or more first understanding results. The speech recognition unit obtains utterance text by performing speech recognition processing on the utterance. The second understanding unit obtains a second understanding result by understanding the intention of the user's utterance based on at least the utterance text. The control unit updates the current dialog state with the alternative dialog state, processes the second understanding result, and controls the transition between the dialog states.

第１の実施形態に係る音声対話システムを例示するブロック図。1 is a block diagram illustrating a voice interaction system according to a first embodiment. ユーザからのバージインが生じない場合の音声対話の進行例を示す図。The figure which shows the example of advancing of the voice dialogue when the barge-in from a user does not arise. ユーザからのバージインが生じる場合の音声対話の進行例を示す図。The figure which shows the advancing example of the voice dialogue when the barge-in from a user arises. ユーザからのバージインの説明図。Explanatory drawing of the barge-in from a user. バージインのタイミングによるユーザの発話意図への影響の説明図。Explanatory drawing of the influence on the user's utterance intention by the timing of barge-in. ユーザからのバージインが生じない場合における、第１の実施形態に係る音声対話システム及び比較例に係る音声対話システムの動作を例示する図。The figure which illustrates operation | movement of the voice dialogue system which concerns on 1st Embodiment, and the voice dialogue system which concerns on a comparative example, when the barge-in from a user does not arise. ユーザからのバージインが生じる場合における、比較例に係る音声対話システムの動作を例示する図。The figure which illustrates operation | movement of the voice interactive system which concerns on a comparative example in case the barge-in from a user arises. ユーザからのバージインが生じる場合における、第１の実施形態に係る音声対話システムの動作を例示する図。The figure which illustrates operation | movement of the speech dialogue system which concerns on 1st Embodiment in case the barge-in from a user arises. 第２の実施形態に係る音声対話システムを例示するブロック図。The block diagram which illustrates the voice dialogue system concerning a 2nd embodiment. 第３の実施形態に係る音声対話システムを例示するブロック図。The block diagram which illustrates the voice dialog system concerning a 3rd embodiment. ユーザからのバージインが生じる場合における、第２の実施形態に係る音声対話システムの動作を例示する図。The figure which illustrates operation | movement of the speech dialogue system which concerns on 2nd Embodiment in the case where the barge-in from a user arises. 確認文の説明図。Explanatory drawing of a confirmation sentence. ユーザからのバージインが生じる場合における、比較例に係る音声対話システムの動作を例示する図。The figure which illustrates operation | movement of the voice interactive system which concerns on a comparative example in case the barge-in from a user arises. 図１０の文意決定部の動作の説明図。Explanatory drawing of operation | movement of the meaning determination part of FIG. 代替対話状態を生成するための技法を例示する図。FIG. 6 illustrates a technique for generating an alternative interaction state.

以下、図面を参照しながら実施形態の説明が述べられる。尚、以降、説明済みの要素と同一または類似の要素には同一または類似の符号が付され、重複する説明は基本的に省略される。
（第１の実施形態）
ユーザからのバージインが生じない場合（即ち、ユーザがシステムからの応答文を最後まで聞き取ってから発話する場合）には、一般に、音声対話は例えば図２に示されるように進行する。図２の例によれば、ユーザがシステムからの応答文を適切に理解したうえで当該応答文に対する返事のために発話するので、発話はシステムの想定内のものとなり音声対話は円滑に進行する。 Hereinafter, embodiments will be described with reference to the drawings. Hereinafter, the same or similar elements as those already described are denoted by the same or similar reference numerals, and redundant description is basically omitted.
(First embodiment)
If no barge-in occurs from the user (that is, the user speaks after listening to the response sentence from the system), the voice dialogue generally proceeds as shown in FIG. 2, for example. According to the example of FIG. 2, since the user properly understands the response sentence from the system and speaks in response to the response sentence, the utterance is within the system assumption and the voice conversation proceeds smoothly. .

しかしながら、一般に、ユーザからのバージインが生じる場合に当該ユーザの発話意図を適切に理解することは容易でない。具体的には、ユーザからのバージインは、図４に例示されるように様々なタイミングで生じる可能性がある。ユーザが聞き取ることのできる応答文はバージインのタイミング次第で異なるので、図５に例示されるように、応答文に対するユーザの理解もまたバージインのタイミング次第で異なる。そして、ユーザは応答文に対する理解に基づいて発話するから、図５に例示されるように、ユーザの発話内容が同じであってもバージインのタイミング次第でユーザの発話意図が異なる可能性がある。 However, in general, when barge-in occurs from a user, it is not easy to properly understand the user's intention to speak. Specifically, barge-in from the user may occur at various timings as illustrated in FIG. Since the response sentence that the user can hear varies depending on the timing of barge-in, as illustrated in FIG. 5, the user's understanding of the response sentence also varies depending on the timing of barge-in. And since a user utters based on the understanding with respect to a response sentence, even if a user's utterance content is the same as illustrated in FIG. 5, a user's utterance intention may differ depending on the timing of barge-in.

第１の実施形態に係る音声対話システムは、ユーザからのバージインが生じる場合であっても、ユーザが聞き取った応答文（以降、部分応答文と称される）の文意を理解し、この理解結果を利用してユーザの発話を処理する。故に、この音声対話システムによれば、音声対話を円滑に進行することができる。 The voice interaction system according to the first embodiment understands the meaning of a response sentence (hereinafter, referred to as a partial response sentence) heard by the user even when barge-in occurs from the user. The user's utterance is processed using the result. Therefore, according to this voice dialogue system, the voice dialogue can proceed smoothly.

具体的には、ユーザからのバージインが生じる場合（即ち、ユーザがシステムからの応答文を最後まで聞き取らずに発話する場合）には、音声対話は例えば図３に示されるように進行する。尚、図３の例において、ユーザは、システムからの応答文（Ａ２）「年利が１００％のギリシャ国債は嫌ですよね。それでは７％のイタリア国債か１％の日本国債のどちらにしますか？」のうち「年利が１００％のギリシャ国債」を聞き取ってから「はい！」と発話している（Ｂ３）。この場合に、ユーザは、システムからの応答文（Ａ２）を最後まで聞き取っていないので、聞き取った応答文の文意を誤って理解し（Ｃ２）、システムの想定外の発話をするおそれがある。本実施形態に係る音声対話システムは、後述されるように、ユーザの発話の開始タイミングに基づいて応答文から部分応答文を抽出し、その文意を理解し、文意理解結果を利用して対話状態を更新する。故に、この音声対話システムによれば、システムの想定外の発話も適切に処理することができる。 Specifically, when barge-in occurs from the user (that is, when the user speaks without listening to the response sentence from the system), the voice dialogue proceeds as shown in FIG. 3, for example. In the example of Fig. 3, the user responds with a response from the system (A2) "I don't like Greek government bonds with an annual interest rate of 100%. Then, would you like 7% Italian government bonds or 1% Japanese government bonds? "Yes!" After hearing "Greek government bonds with 100% annual interest rate" (B3). In this case, since the user has not heard the response sentence (A2) from the system to the end, the user may misunderstand the meaning of the received response sentence (C2), and may utter an unexpected utterance of the system. . As will be described later, the voice interaction system according to the present embodiment extracts a partial response sentence from a response sentence based on the start timing of the user's utterance, understands the meaning of the sentence, and uses the sentence understanding result. Update the conversation state. Therefore, according to this spoken dialogue system, it is possible to appropriately handle utterances that are not anticipated by the system.

図１に示されるように、本実施形態に係る音声対話システムは、音声入力部１０１と、音声認識部１０２と、発話意図理解部１０３と、対話状態制御部１０４と、対話履歴記憶部１０５と、音声出力部１０６と、発話検出部１０７と、部分応答文取得部１０８と、部分応答文理解部１０９と、対話状態更新部１１０とを備える。 As shown in FIG. 1, the voice dialogue system according to the present embodiment includes a voice input unit 101, a voice recognition unit 102, an utterance intention understanding unit 103, a dialogue state control unit 104, and a dialogue history storage unit 105. A voice output unit 106, an utterance detection unit 107, a partial response sentence acquisition unit 108, a partial response sentence understanding unit 109, and a dialog state update unit 110.

図１の音声対話システムは、単独の装置によって実装されてもよいし、２以上の装置に亘って実装されてもよい。例えば、図１の音声対話システムのうち一部の要素がサーバ装置に備えられ、残りの要素がサーバ装置にネットワーク経由で接続可能な電子機器（例えば、携帯電話機、ＰＣ、ポータブルメディアプレーヤなど）に備えられてよい。 The voice interaction system of FIG. 1 may be implemented by a single device or may be implemented across two or more devices. For example, some elements of the voice interaction system of FIG. 1 are provided in a server device, and the remaining elements are connected to an electronic device (for example, a mobile phone, a PC, a portable media player, etc.) that can be connected to the server device via a network. May be provided.

音声入力部１０１は、例えばマイクロホンを用いて実装されてよい。この場合に、音声入力部１０１は、ユーザからの入力音声を電気信号に変換することによって、発話信号を得ることができる。或いは、音声入力部１０１は、図１に示されない装置からネットワーク経由で発話信号を受信してもよい。音声入力部１０１は、発話信号を音声認識部１０２及び発話検出部１０７へと出力する。尚、音声入力部１０１が音声認識部１０２及び発話検出部１０７とは別の装置に設けられる場合には、音声入力部１０１はネットワーク経由で発話信号を出力（送信）してもよい。 The voice input unit 101 may be implemented using a microphone, for example. In this case, the voice input unit 101 can obtain a speech signal by converting voice input from the user into an electrical signal. Alternatively, the voice input unit 101 may receive a speech signal from a device not shown in FIG. 1 via a network. The voice input unit 101 outputs the utterance signal to the voice recognition unit 102 and the utterance detection unit 107. When the voice input unit 101 is provided in a device different from the voice recognition unit 102 and the utterance detection unit 107, the voice input unit 101 may output (send) an utterance signal via a network.

音声認識部１０２は、音声入力部１０１から発話信号を入力する。音声認識部１０２は、発話信号に対して音声認識処理を行うことによって、ユーザの発話の内容を表す発話テキストを得る。音声認識部１０２は、任意の技法を用いて音声認識処理を実現してよい。 The voice recognition unit 102 inputs an utterance signal from the voice input unit 101. The speech recognition unit 102 performs speech recognition processing on the speech signal, thereby obtaining speech text representing the content of the user's speech. The speech recognition unit 102 may implement speech recognition processing using any technique.

発話意図理解部１０３は、音声認識部１０２から発話テキストを入力する。発話意図理解部１０３は、発話テキストに基づいてユーザの発話の意図を理解することによって、発話意図理解結果を得る。具体的には、発話意図理解部１０３は、後述される部分応答文理解部１０９と同一または類似の技法を利用することによって、発話テキストに基づいてユーザの発話の意図を理解してよい。ここで、発話意図理解結果は、発話テキストの意図を抽象化して表現するデータを意味する。発話意図理解部１０３は、発話意図理解結果を対話状態制御部１０４へと出力する。 The utterance intention understanding unit 103 inputs the utterance text from the voice recognition unit 102. The speech intention understanding unit 103 obtains a speech intention understanding result by understanding the user's speech intention based on the speech text. Specifically, the utterance intention understanding unit 103 may understand the intention of the user's utterance based on the utterance text by using the same or similar technique as the partial response sentence understanding unit 109 described later. Here, the utterance intention understanding result means data that abstracts and expresses the intention of the utterance text. The utterance intention understanding unit 103 outputs the utterance intention understanding result to the dialog state control unit 104.

発話テキストが曖昧なもの（例えば、「はい」、「いいえ」、「そうですね」など）である場合には当該発話テキスト単体からユーザの発話の意図を高精度に特定することは困難である。故に、発話意図理解部１０３は、発話テキストに加えて対話履歴記憶部１０５からの対話履歴及び部分応答文理解部１０９からの部分応答文の文意理解結果に基づいてユーザの発話の意図を理解してもよい。発話意図理解部１０３は、対話履歴及び文意理解結果のうち少なくとも一方を考慮することによって、ユーザの発話の意図をより正確に理解することが可能となる。 If the utterance text is ambiguous (for example, “Yes”, “No”, “Yes”, etc.), it is difficult to specify the intention of the user's utterance with high accuracy from the utterance text alone. Therefore, the utterance intention understanding unit 103 understands the intention of the user's utterance based on the dialogue history from the dialogue history storage unit 105 and the meaning of the partial response sentence from the partial response sentence understanding unit 109 in addition to the utterance text. May be. The utterance intention understanding unit 103 can understand the intention of the user's utterance more accurately by considering at least one of the dialogue history and the sentence understanding result.

対話状態制御部１０４は、発話意図理解部１０３から発話意図理解結果を入力する。対話状態制御部１０４は、現行の対話状態を用いて発話意図理解結果を処理し、必要に応じて対話状態間の遷移（自己遷移を含む）を制御する。具体的には、図１の音声対話システムは、シナリオ毎に対話状態を準備しており、発話意図理解結果に応じて対話状態間の遷移を行うことによって、ユーザとの音声対話を実現する。 The dialogue state control unit 104 inputs an utterance intention understanding result from the utterance intention understanding unit 103. The dialogue state control unit 104 processes the utterance intention understanding result using the current dialogue state, and controls transitions between dialogue states (including self-transitions) as necessary. Specifically, the voice dialogue system in FIG. 1 prepares a dialogue state for each scenario, and realizes a voice dialogue with the user by performing a transition between dialogue states according to the utterance intention understanding result.

ここで、対話状態は、音声対話システムの振る舞いを決定付けるパラメータである。対話状態は、例えば、音声対話システムがユーザに対して出力する応答文、音声対話システムが当該応答文に対して受理可能な単語、当該受理可能な単語が入力された場合の音声対話システムの動作などを示す。 Here, the dialogue state is a parameter that determines the behavior of the voice dialogue system. The dialogue state includes, for example, a response sentence that the voice dialogue system outputs to the user, a word that the voice dialogue system can accept for the response sentence, and an operation of the voice dialogue system when the acceptable word is input. Etc.

対話状態制御部１０４は、発話意図理解結果を処理することによって、ユーザに対して次に出力する応答文を決定する。対話状態制御部１０４は、応答文を音声出力部１０６へと出力する。更に、対話状態制御部１０４は、現行の対話状態と発話意図理解結果とを対話履歴として対話履歴記憶部１０５へと出力する。 The dialogue state control unit 104 determines the response sentence to be output next to the user by processing the utterance intention understanding result. The dialog state control unit 104 outputs a response sentence to the voice output unit 106. Furthermore, the dialogue state control unit 104 outputs the current dialogue state and the utterance intention understanding result to the dialogue history storage unit 105 as a dialogue history.

更に、対話状態制御部１０４は、後述される対話状態更新部１１０から代替対話状態を入力することができる。対話状態制御部１０４は、代替対話状態によって現行の対話状態を更新してから、発話意図理解結果を処理する。 Furthermore, the dialogue state control unit 104 can input an alternative dialogue state from the dialogue state update unit 110 described later. The dialog state control unit 104 processes the utterance intention understanding result after updating the current dialog state with the alternative dialog state.

対話履歴記憶部１０５は、対話状態制御部１０４からの対話履歴を保存する。対話履歴記憶部１０５に保存された対話履歴は、発話意図理解部１０３によって必要に応じて読み出される。 The dialogue history storage unit 105 stores the dialogue history from the dialogue state control unit 104. The dialogue history stored in the dialogue history storage unit 105 is read by the utterance intention understanding unit 103 as necessary.

音声出力部１０６は、例えばスピーカを用いて実装されてよい。この場合に、音声出力部１０６は、対話状態制御部１０４からの応答文を音声出力することができる。音声出力部１０６は、応答文がテキスト形式である場合には例えば任意のテキスト読み上げ処理を行うことによって当該応答文を音声出力してもよい。或いは、音声出力部１０６は応答文が音声形式である場合には必要な信号処理を行うことによって当該応答文を音声出力してもよい。或いは、音声出力部１０６は、図１に示されない装置へとネットワーク経由で応答文を送信してもよい。尚、音声出力部１０６が対話状態制御部１０４とは別の装置に設けられる場合には、音声出力部１０６はネットワーク経由で応答文を入力（受信）してもよい。 The audio output unit 106 may be mounted using a speaker, for example. In this case, the voice output unit 106 can output the response sentence from the dialogue state control unit 104 as a voice. When the response sentence is in a text format, the voice output unit 106 may output the response sentence by voice by performing, for example, an arbitrary text reading process. Alternatively, when the response sentence is in a voice format, the voice output unit 106 may output the response sentence by performing necessary signal processing. Or the audio | voice output part 106 may transmit a response sentence via a network to the apparatus which is not shown by FIG. When the voice output unit 106 is provided in a device different from the dialog state control unit 104, the voice output unit 106 may input (receive) a response sentence via a network.

更に、音声出力部１０６は、応答文に関する情報を部分応答文取得部１０８へと出力する。応答文に関する情報は、例えば、応答文を構成する文字列を示す文字列情報と、当該文字列を構成する各文字の音声出力時刻を示す時刻情報とを含む。尚、音声出力部１０６が部分応答文取得部１０８とは別の装置に設けられる場合には、音声出力部１０６はネットワーク経由で応答文に関する情報を出力（送信）してもよい。 Further, the voice output unit 106 outputs information about the response sentence to the partial response sentence acquisition unit 108. The information regarding the response sentence includes, for example, character string information indicating a character string constituting the response sentence, and time information indicating a voice output time of each character constituting the character string. When the voice output unit 106 is provided in a device different from the partial response text acquisition unit 108, the voice output unit 106 may output (send) information about the response text via the network.

発話検出部１０７は、音声入力部１０１から発話信号を入力する。発話検出部１０７は、発話信号に基づいてユーザの発話の開始タイミングを検出する。開始タイミングは、例えば、発話が開始された時刻、発話が開始された時点での音声出力済みの応答文の文字数などを意味する。発話検出部１０７は、検出した開始タイミングを用いてタイミング情報を生成する。発話検出部１０７は、タイミング情報を部分応答文取得部１０８へと出力する。尚、発話検出部１０７が音声入力部１０１とは別の装置に設けられる場合には、発話検出部１０７はネットワーク経由で発話信号を入力（受信）してもよい。また、発話検出部１０７が部分応答文取得部１０８とは別の装置に設けられる場合には、発話検出部１０７はネットワーク経由でタイミング情報を出力（送信）してもよい。 The utterance detection unit 107 inputs an utterance signal from the voice input unit 101. The utterance detection unit 107 detects the start timing of the user's utterance based on the utterance signal. The start timing means, for example, the time when the utterance is started, the number of characters of the response sentence that has already been output at the time when the utterance is started, and the like. The utterance detection unit 107 generates timing information using the detected start timing. The utterance detection unit 107 outputs the timing information to the partial response sentence acquisition unit 108. When the utterance detection unit 107 is provided in a device different from the voice input unit 101, the utterance detection unit 107 may input (receive) an utterance signal via a network. When the utterance detection unit 107 is provided in a device different from the partial response sentence acquisition unit 108, the utterance detection unit 107 may output (transmit) timing information via a network.

タイミング情報は、開始タイミングそのものを特定してもよいし、当該開始タイミングに補正することによって得られるタイミングを特定してもよい。一般に、人間が音声を聞き取り、その音声の内容を理解し、理解した内容に基づいて発話するまでにはある程度の遅延が生じる。従って、発話検出部１０７は、ユーザの理解能力によって生じる遅延に相当する補正量（時間或いは文字数）をユーザの発話の開始タイミングを遡らせたタイミングを特定するタイミング情報を生成してもよい。また、発話検出部１０７は、ネットワーク伝送などの他の遅延について考慮してもよい。尚、遅延を考慮した開始タイミングの補正は、部分応答文取得部１０８において行われてもよい。 The timing information may specify the start timing itself, or may specify the timing obtained by correcting to the start timing. In general, a certain amount of delay occurs until a human hears a voice, understands the content of the voice, and speaks based on the understood content. Therefore, the utterance detection unit 107 may generate timing information that identifies a timing when the user's utterance start timing is traced back by a correction amount (time or number of characters) corresponding to a delay caused by the user's comprehension ability. Further, the utterance detection unit 107 may consider other delays such as network transmission. Note that the correction of the start timing in consideration of the delay may be performed in the partial response sentence acquisition unit 108.

部分応答文取得部１０８は、音声出力部１０６から応答文に関する情報を入力し、発話検出部１０７からタイミング情報を入力する。部分応答文取得部１０８は、応答文に関する情報とタイミング情報とに基づいて、応答文全体から部分応答文を抽出する。部分応答文取得部１０８は、部分応答文を部分応答文理解部１０９へと出力する。 The partial response sentence acquisition unit 108 inputs information about the response sentence from the voice output unit 106 and inputs timing information from the utterance detection unit 107. The partial response text acquisition unit 108 extracts a partial response text from the entire response text based on the information about the response text and the timing information. The partial response text acquisition unit 108 outputs the partial response text to the partial response text understanding unit 109.

部分応答文の末尾文字は、上記タイミング情報によって定められる。例えば、タイミング情報が時刻を表すならば、当該時刻までに音声出力された最後の文字が部分応答文の末尾文字に定められてよい。また、タイミング情報が文字数を表すならば、応答文の先頭文字から当該文字数だけ先にある文字が部分応答文の末尾文字に定められてよい。 The last character of the partial response sentence is determined by the timing information. For example, if the timing information represents a time, the last character output by voice up to that time may be determined as the last character of the partial response sentence. Further, if the timing information represents the number of characters, a character that is the number of characters ahead of the first character of the response sentence may be determined as the last character of the partial response sentence.

部分応答文の先頭文字として、典型的には、応答文の先頭文字が定められてよい。或いは、応答文が複数の文を含んでいる場合には、これら複数の文のいずれかの先頭文字が部分応答文の先頭文字に定められてよい。また、応答文が複数の話題を含む場合には、話題の転換点に基づいて部分応答文の先頭文字が定められてよい。 Typically, the first character of the response sentence may be determined as the first character of the partial response sentence. Alternatively, when the response sentence includes a plurality of sentences, the first character of any of the plurality of sentences may be determined as the first character of the partial response sentence. When the response sentence includes a plurality of topics, the first character of the partial response sentence may be determined based on the turning point of the topic.

部分応答文理解部１０９は、部分応答文取得部１０８から部分応答文を入力する。部分応答文理解部１０９は、部分応答文の文意を理解することによって、当該部分応答文の文意理解結果を得る。ここで、部分応答文の文意理解結果は、部分応答文の文意を抽象化して表現するデータを意味する。尚、部分応答文理解部１０９は、部分応答文に複数の文意が見込まれる場合であっても、いずれか１つの文意理解結果を自動で決定するものとする。部分応答文理解部１０９は、文意理解結果を発話意図理解部１０３及び対話状態更新部１１０へと出力する。 The partial response text understanding unit 109 inputs the partial response text from the partial response text acquisition unit 108. The partial response sentence understanding unit 109 obtains the meaning understanding result of the partial response sentence by understanding the meaning of the partial response sentence. Here, the meaning understanding result of the partial response sentence means data that abstractly expresses the meaning of the partial response sentence. It should be noted that the partial response sentence understanding unit 109 automatically determines any one sentence meaning understanding result even when a plurality of sentences are expected in the partial response sentence. The partial response sentence understanding unit 109 outputs the sentence understanding result to the utterance intention understanding unit 103 and the dialog state updating unit 110.

例えば、部分応答文理解部１０９は、部分応答文の文意を抽象的に表現するタグへと、部分応答文を変換してもよい。例えば、以降の説明では、部分応答文の文意は、行為（ａｃｔ）とコマンド（ｃｏｍｍａｎｄ）と呼ばれるタグの組み合わせで表現されている。但し、行為タグのみまたはその他の１以上のタグによって部分応答文の文意が表現されてもよい。或いは、部分応答文理解部１０９は、多数の文意が予め整理及び正規化された例文集に列挙された例文のいずれかに、部分応答文を変換してもよい。 For example, the partial response sentence understanding unit 109 may convert the partial response sentence into a tag that abstractly expresses the meaning of the partial response sentence. For example, in the following description, the meaning of the partial response sentence is expressed by a combination of tags called “act” and “command”. However, the meaning of the partial response sentence may be expressed by only the action tag or one or more other tags. Alternatively, the partial response sentence understanding unit 109 may convert the partial response sentence into any of the example sentences listed in the example sentence collection in which a large number of sentences are arranged and normalized in advance.

部分応答文理解部１０９は、部分応答文と文意との対応関係が定められたルールに従って、部分応答文の文意を決定してもよい。或いは、部分応答文理解部１０９は、大量の学習データを用いた統計的な学習結果に基づいて、部分応答文と任意の文意とが対応付けられる確率を計算し、最も高い確率を示す文意を選択してもよい。 The partial response sentence understanding unit 109 may determine the meaning of the partial response sentence according to a rule in which the correspondence between the partial response sentence and the meaning of the sentence is determined. Alternatively, the partial response sentence understanding unit 109 calculates a probability that a partial response sentence and an arbitrary sentence meaning are associated with each other based on a statistical learning result using a large amount of learning data, and a sentence showing the highest probability. You may choose your will.

対話状態更新部１１０は、部分応答文理解部１０９から文意理解結果を入力する。対話状態更新部１１０は、文意理解結果に応じた対話状態を生成することによって、現行の対話状態を更新するための代替対話状態を得る。対話状態更新部１１０は、代替対話状態を対話状態制御部１０４へと出力する。 The dialogue state update unit 110 inputs a sentence understanding result from the partial response sentence understanding unit 109. The dialog state update unit 110 obtains an alternative dialog state for updating the current dialog state by generating a dialog state corresponding to the sentence understanding result. The dialog state update unit 110 outputs the alternative dialog state to the dialog state control unit 104.

以下、図６、図７、図８及び図１５を用いて、図１の音声対話システムの動作と、比較例に係る音声対話システムの動作とを比較する。ここで、比較例に係る音声対話システムは、ユーザからのバージインが生じた場合に、未出力部分を含む応答文全体に対してユーザの発話がなされたとみなすものとする。 Hereinafter, the operation of the voice interaction system of FIG. 1 is compared with the operation of the voice interaction system according to the comparative example with reference to FIGS. 6, 7, 8, and 15. Here, it is assumed that the voice interaction system according to the comparative example assumes that the user's utterance has been made with respect to the entire response sentence including the unoutput portion when barge-in occurs from the user.

ユーザからのバージインが生じない場合に、図１の音声対話システム及び比較例に係る音声対話システムは、図６に例示されるように動作する。尚、図６、図７、図８、図１１及び図１３において、システムによって理解されたユーザの発話意図は、主体（「Ｕｓｅｒ」；１行目）、行為タグ（２行目）及びコマンドタグ（３行目）によって表現されている。同様に、図６、図７、図８、図１１及び図１３において、応答文の文意は、主体（「Ｓｙｓｔｅｍ」；１行目）、行為タグ（２行目）及びコマンドタグ（３行目）によって表現されている。 When barge-in from the user does not occur, the voice interaction system of FIG. 1 and the voice interaction system according to the comparative example operate as illustrated in FIG. 6, 7, 8, 11, and 13, the user's utterance intention understood by the system is the subject (“User”; first line), the action tag (second line), and the command tag. (3rd line). Similarly, in FIG. 6, FIG. 7, FIG. 8, FIG. 11 and FIG. 13, the meaning of the response sentence is the subject ("System"; the first line), the action tag (second line), and the command tag (third line). Eyes).

図６の例によれば、ユーザは、応答文を最後まで聞き取り、応答文全体の文意を適切に理解できる。故に、ユーザからの発話はシステムの想定内のものとなり、音声対話が円滑に進行する。 According to the example of FIG. 6, the user can listen to the response sentence to the end and appropriately understand the meaning of the entire response sentence. Therefore, the utterance from the user is within the assumption of the system, and the voice conversation proceeds smoothly.

ユーザからのバージインが生じる場合に、比較例に係る音声対話システムは例えば図７に示されるように動作する。図７の例において、ユーザは、システムからの応答文（Ａ２）「年利が１００％のギリシャ国債は嫌ですよね。それでは７％のイタリア国債か１％の日本国債のどちらにしますか？」のうち「年利が１００％のギリシャ国債」を聞き取ってから「はい！」と発話している（Ｂ３）。この場合に、ユーザは、システムからの応答文（Ａ２）を最後まで聞き取っていないので、聞き取った応答文の文意を例えば「ギリシャ国債にしますか？」のように誤って理解している。即ち、ユーザの発話（Ｂ３）の意図は、例えば「金利１００％のギリシャ国債を購入したい！」（Ｄ３）のようになる。 When barge-in from the user occurs, the voice interaction system according to the comparative example operates as shown in FIG. 7, for example. In the example of Fig. 7, the user responds with a response from the system (A2) "I don't like Greek government bonds with an annual interest rate of 100%. Then, would you like 7% Italian government bonds or 1% Japanese government bonds?" Among them, “Yes!” Was spoken after hearing “Greek government bonds with 100% annual interest rate” (B3). In this case, since the user has not heard the response sentence (A2) from the system to the end, he / she misunderstood the meaning of the sentence of the answer sentence heard, for example, “Do you want to make Greek government bonds?”. That is, the intention of the user's utterance (B3) is, for example, “I want to purchase a Greek government bond with an interest rate of 100%!” (D3).

他方、システムは、ユーザが応答文（Ａ２）の全体の文意を理解しているとみなす。故に、システムの対話状態（Ｄ２）は、国名「イタリア」または国名「日本」に相当する発話を想定する。結果的に、ユーザの発話（Ｂ３）は、システムの想定外のものとして扱われ、「「はい」という国はありません。国名を入力してください。」のような応答文（Ａ４）が音声出力される。この応答文（Ａ４）は、ギリシャ国債の購入を希望するユーザの発話意図（Ｄ３）から見て適切でない。故に、音声対話を進行に支障が生じる。 On the other hand, the system assumes that the user understands the overall meaning of the response sentence (A2). Therefore, the dialog state (D2) of the system assumes an utterance corresponding to the country name “Italy” or the country name “Japan”. As a result, the user's utterance (B3) is treated as an unexpected system, and there is no “Yes” country. Please enter your country name. A response sentence (A4) like “ This response sentence (A4) is not appropriate in view of the utterance intention (D3) of the user who wishes to purchase Greek government bonds. Therefore, the progress of the voice dialogue is hindered.

ユーザからのバージインが生じる場合に、図１の音声対話システムは例えば図８に示されるように動作する。図８の例において、ユーザは、システムからの応答文（Ａ２）「年利が１００％のギリシャ国債は嫌ですよね。それでは７％のイタリア国債か１％の日本国債のどちらにしますか？」のうち「年利が１００％のギリシャ国債」を聞き取ってから「はい！」と発話している（Ｂ３−１）。この場合に、ユーザは、システムからの応答文（Ａ２）を最後まで聞き取っていないので、聞き取った応答文の文意を例えば「ギリシャ国債にしますか？」のように誤って理解している。 When barge-in occurs from the user, the voice interaction system of FIG. 1 operates as shown in FIG. 8, for example. In the example of Fig. 8, the user responds with a response from the system (A2) "I don't like Greek government bonds with an annual interest rate of 100%. Then, would you like 7% Italian government bonds or 1% Japanese government bonds?" Among them, “Yes!” Was spoken after hearing “Greek government bonds with 100% annual interest rate” (B3-1). In this case, since the user has not heard the response sentence (A2) from the system to the end, he / she misunderstood the meaning of the sentence of the answer sentence heard, for example, “Do you want to make Greek government bonds?”.

システムは、ユーザの発話（Ｂ３−１）を検出し、部分応答文である「年利が１００％のギリシャ国債」（Ａ３−２）を取得する。システムは、部分応答文（Ａ３−２）の文意（Ｃ３−２）を理解し、代替対話状態（Ｄ３−２）を生成して現行の対話状態を更新する。故に、システムの対話状態は、肯定または否定に相当する発話を想定することになる。結果的に、ユーザの発話（Ｂ３−１，Ｂ３−３）は、システムの想定内のものとして扱われ、「ギリシャ国債への投資ですね。承知しました。いくら投資されますか？」（Ａ４）のような応答文が音声出力される。この応答文（Ａ４）は、ギリシャ国債の購入を希望するユーザの発話意図（Ｄ３−１，Ｄ３−３）から見て適切であるから、ユーザは例えば「とりあえず１万で。」（Ｂ５）のように発話して音声対話が円滑に進行する。 The system detects the user's utterance (B3-1), and acquires the partial response sentence “Greek government bond with 100% annual interest rate” (A3-2). The system understands the meaning (C3-2) of the partial response sentence (A3-2), generates an alternative dialog state (D3-2), and updates the current dialog state. Therefore, the dialogue state of the system assumes an utterance corresponding to affirmation or denial. As a result, the user's utterances (B3-1, B3-3) are treated as being within the assumption of the system, and it is “an investment in Greek government bonds. I understand. How much will you invest?” (A4 ) Is output by voice. Since this response sentence (A4) is appropriate in view of the utterance intentions (D3-1, D3-3) of the user who wishes to purchase the Greek government bond, the user is, for example, “10,000 for the time being” (B5). Spoken and voice conversation proceeds smoothly.

ところで、部分応答文の文意理解結果に応じた代替対話状態を生成するために、図１５に例示されるように、部分応答文の文意と当該文意に対応する応答文を出力する時の対話状態（即ち、代替対話状態）とを組にした情報を予め作成して利用することができる。図１５に例示される表は、部分応答文（列Ｂ）とその文意（列Ｃ）とその文意に対応する応答文を出力する時の対話状態（列Ｄ）との組を表している。また、図１５に例示されるように、直前の対話状態の情報（列Ａ）が、上記組の要素として加えられてもよい。この直前の対話状態を示す情報（列Ａ）は、代替対話状態を生成するために必須ではないが、部分応答文から複数の文意理解結果が得られる場合に、より適切な文意理解結果を特定するために利用されてよい。 By the way, in order to generate the alternative dialogue state according to the sentence understanding result of the partial response sentence, when outputting the sentence meaning of the partial response sentence and the response sentence corresponding to the sentence meaning as illustrated in FIG. It is possible to create in advance and use information that is a combination of the dialog state (that is, the alternative dialog state). The table illustrated in FIG. 15 represents a set of a partial response sentence (column B), its sentence meaning (column C), and a dialogue state (column D) when outputting a response sentence corresponding to the sentence meaning. Yes. Further, as illustrated in FIG. 15, the previous dialog state information (column A) may be added as an element of the set. The information indicating the previous conversation state (column A) is not essential for generating the alternative conversation state, but more appropriate meaning understanding results when a plurality of meaning understanding results are obtained from the partial response sentence. May be used to identify.

例えば、部分応答文（図８のＡ３−２）から２つの文意理解結果（図１５のＣ１及びＣ２を得ることができる。しかしながら、図１５の例によれば、直前の対話状態が「Ｒｅｑｕｅｓｔ、Ｒｅｃｏｍｍｅｎｄ（ｃｏｎｕｎｔｒｙ）」（図８のＣ１）であれば、直前の対話状態は「Ｓ１（＝Ｒｅｑｕｅｓｔ、ＢｏｕｎｄＩｎｆｏ（ギリシャ））以外の対話状態」（即ち、図１５のＡ１）に該当する。故に、上記部分応答文（図８のＡ３−２）の文意理解結果は、「Ｑｕｅｓｔｉｏｎ−ＹＮ、国債購入（ギリシャ）」（図１５のＣ１）に特定されることになる。 For example, two sentence meaning understanding results (C1 and C2 in FIG. 15 can be obtained from the partial response sentence (A3-2 in FIG. 8). However, according to the example in FIG. 15, the previous dialog state is “Request. , “Recommend (concountry)” (C1 in FIG. 8), the immediately previous dialog state corresponds to the “interactive state other than“ S1 (= Request, BoundInfo (Greece)) ”(ie, A1 in FIG. 15). The sentence understanding result of the partial response sentence (A3-2 in FIG. 8) is specified as “Question-YN, purchase of government bonds (Greece)” (C1 in FIG. 15).

具体的には、部分応答文（図８のＡ３−２）と直前の対話状態（図８のＣ１）「Ｒｅｑｕｅｓｔ、Ｒｅｃｏｍｍｅｎｄ（ｃｏｎｕｎｔｒｙ）」との組み合わせによって、部分応答文の文意理解結果（図１５のＣ１）が特定されるので、文意理解結果に応じた代替対話状態（図１５のＤ１）を生成することができる。そして、音声対話システムは、対話状態を（図８のＤ３−２）のように置き換えてからユーザの発話を処理することによって、前述のように音声対話を円滑に進行できる。 More specifically, the combination of the partial response sentence (A3-2 in FIG. 8) and the previous dialog state (C1 in FIG. 8) “Request, Recommend (country)” results in the understanding of the meaning of the partial response sentence (see FIG. Since 15 C1) is specified, an alternative dialogue state (D1 in FIG. 15) according to the sentence understanding result can be generated. Then, the voice dialogue system can smoothly advance the voice dialogue as described above by processing the user's utterance after replacing the dialogue state as (D3-2 in FIG. 8).

以上説明したように、第１の実施形態に係る音声対話システムは、ユーザの発話の開始タイミングに基づいて部分応答文を取得し、当該部分応答文の文意を理解し、対話状態を更新する。従って、この音声対話システムによれば、部分応答文の文意理解結果に適応して対話状態が更新されるので、ユーザの発話がシステムの想定内のものとして扱われ易くなる。即ち、この音声対話システムによれば、ユーザからのバージインが生じる場合にも音声対話を円滑に進行することができる。 As described above, the voice interaction system according to the first embodiment acquires a partial response sentence based on the start timing of the user's utterance, understands the meaning of the partial response sentence, and updates the conversation state. . Therefore, according to this spoken dialogue system, the dialogue state is updated in conformity with the meaning understanding result of the partial response sentence, so that the user's utterance is easily handled as assumed in the system. That is, according to this voice dialogue system, voice dialogue can proceed smoothly even when barge-in occurs from the user.

（第２の実施形態）
図９に示されるように、第２の実施形態に係る音声対話システムは、音声入力部１０１と、音声認識部１０２と、発話意図理解部１０３と、対話状態制御部１０４と、対話履歴記憶部１０５と、音声出力部１０６と、発話検出部１０７と、部分応答文取得部１０８と、部分応答文理解部２０９と、対話状態更新部１１０と、文意確認部２１１とを備える。 (Second Embodiment)
As shown in FIG. 9, the voice dialogue system according to the second embodiment includes a voice input unit 101, a voice recognition unit 102, an utterance intention understanding unit 103, a dialogue state control unit 104, and a dialogue history storage unit. 105, a voice output unit 106, an utterance detection unit 107, a partial response text acquisition unit 108, a partial response text understanding unit 209, a dialog state update unit 110, and a text verification unit 211.

部分応答文理解部２０９は、部分応答文取得部１０８から部分応答文を入力する。部分応答文理解部２０９は、部分応答文の文意を理解することによって、１以上の文意理解結果を得る。換言すれば、部分応答文理解部２０９は、部分応答文に複数の文意が見込まれる場合には、複数の文意理解結果を得てもよい。尚、部分応答文理解部２０９は、１以上の文意理解結果を得るために、前述の部分応答文理解部１０９と同一または類似の技法を用いることができる。部分応答文理解部２０９は、１以上の文意理解結果を文意確認部２１１へと出力する。 The partial response text understanding unit 209 inputs the partial response text from the partial response text acquisition unit 108. The partial response sentence understanding unit 209 obtains one or more sentence meaning understanding results by understanding the meaning of the partial response sentence. In other words, the partial response sentence understanding unit 209 may obtain a plurality of meaning understanding results when a plurality of meanings are expected in the partial response sentence. The partial response sentence understanding unit 209 can use the same or similar technique as the partial response sentence understanding unit 109 described above in order to obtain one or more sentence understanding results. The partial response sentence understanding unit 209 outputs one or more sentence understanding results to the sentence confirmation unit 211.

文意確認部２１１は、部分応答文理解部２０９から１以上の文意理解結果を入力する。文意確認部２１１は、１以上の文意理解結果を変換することによって１以上の確認文を生成する。尚、文意確認部２１１は、部分応答文理解部２０９から入力した文意理解結果の数が１つである場合には、当該文意理解結果をそのまま対話状態更新部１１０へと出力してもよい。ここで、確認文は、対応する文意理解結果の内容に対するユーザの理解を補助するテキストを意味する。具体的には、図１２に例示されるように、部分応答文の文意理解結果が「Ｑｕｓｅｓｔｉｏｎ−ＹＮ，国債購入（ギリシャ）」（Ｂ１）である場合に、当該文意理解結果をそのまま提示するのではなく「ギリシャ国債はいかがですか？」（Ｃ１）のような確認文を提示することによって、当該文意理解結果の内容に対するユーザの理解を補助することができる。 The text confirmation unit 211 inputs one or more textual understanding results from the partial response text understanding unit 209. The sentence confirmation unit 211 generates one or more confirmation sentences by converting one or more sentence understanding results. In addition, when the number of the text meaning understanding results input from the partial response text understanding section 209 is one, the text meaning confirmation section 211 outputs the text meaning understanding results to the dialog state update section 110 as they are. Also good. Here, the confirmation sentence means text that assists the user's understanding of the contents of the corresponding sentence understanding result. Specifically, as illustrated in FIG. 12, when the understanding result of the partial response sentence is “Question-YN, JGB purchase (Greece)” (B1), the understanding result of the sentence is presented as it is. Instead, by presenting a confirmation sentence such as “How is the Greek government bond?” (C1), it is possible to assist the user in understanding the contents of the meaning understanding result.

文意確認部２１１は、１以上の確認文をユーザに提示する。文意確認部２１１は、任意のやり方で１以上の確認文を提示してよい。一般に、音声よりも映像の方が文章を短時間で提示することができるので、音声対話を円滑化する観点から、文意確認部２１１は例えば１以上の確認文が列挙された映像を図示されない表示部に表示させてよい。勿論、文意確認部２１１は、１以上の確認文を音声によって提示してもよいし、映像及び音声を併用して提示してもよい。 The sentence confirmation unit 211 presents one or more confirmation sentences to the user. The sentence confirmation unit 211 may present one or more confirmation sentences in an arbitrary manner. In general, a video can present a sentence in a shorter time than a voice. Therefore, from the viewpoint of facilitating a voice dialogue, the text confirmation unit 211 does not illustrate a video in which one or more confirmation texts are listed, for example. You may display on a display part. Of course, the sentence confirmation unit 211 may present one or more confirmation sentences by voice, or may present both video and voice.

図１２の例によれば、部分応答文が「年利が１００％のギリシャ国債」（Ａ１，Ａ２）の場合に２つの文意理解結果（Ｂ１，Ｂ２）が得られる。故に、文意確認部２１１は、これらに対応する確認文である「ギリシャ国債はいかがですか？」，「ギリシャ国債は年利１００％です。」をユーザに提示する。 According to the example of FIG. 12, when the partial response sentence is “Greek government bond with 100% annual interest rate” (A1, A2), two sentence understanding results (B1, B2) are obtained. Therefore, the text confirmation unit 211 presents to the user confirmation texts corresponding to these, “How about Greek government bonds?” And “Greek government bonds have an annual interest rate of 100%.”

ユーザは、提示された１以上の確認文の中から、自己が聞き取った応答文に対する理解に最も合致する１つを選択する。そして、文意確認部２１１は、１以上の確認文のいずれか１つに対するユーザの選択を受理する。文意確認部２１１は、任意のやり方でユーザの選択を受理してよい。具体的には、文意確認部２１１は、例えば音声入力、タッチパネル入力、キー入力、ジェスチャー入力などの形式でユーザの選択を受理してよい。文意確認部２１１は、ユーザによって選択された１つの確認文に対する１つの文意理解結果を対話状態更新部１１０へと出力する。 The user selects one of the presented confirmation sentences that best matches the understanding of the response sentence that he / she has heard. Then, the sentence confirmation unit 211 accepts the user's selection for any one of the one or more confirmation sentences. The text confirmation unit 211 may accept the user's selection in an arbitrary manner. Specifically, the text confirmation unit 211 may accept a user's selection in the form of voice input, touch panel input, key input, gesture input, or the like. The textual intention confirmation unit 211 outputs one textual meaning understanding result for one confirmation text selected by the user to the dialog state update unit 110.

文意確認部２１１は、提示された１以上の確認文のいずれも、ユーザが聞き取った応答文に対する理解に合致しないことを示す入力を受理してもよい。文意確認部２１１が係る入力を受理した場合には、現行の対話状態に対応する応答文が改めて音声出力されてもよいし、部分応答文に関する処理（例えば、部分応答文を取得する処理、部分応答文の文意を理解する処理など）がやり直されたりしてもよい。 The sentence confirmation unit 211 may accept an input indicating that none of the presented one or more confirmation sentences matches the understanding of the response sentence heard by the user. When the sentence verification unit 211 accepts the input, a response sentence corresponding to the current conversation state may be output again, or a process related to the partial response sentence (for example, a process of acquiring a partial response sentence, The process of understanding the meaning of the partial response sentence, etc.) may be redone.

文意確認部２１１は、部分応答文の文意理解結果に対応する確認文に加えて、応答文全体の文意に対応する確認文（或いは、応答文そのもの）をユーザに提示してもよい。例えばユーザが図９の音声対話システムの使用に慣れている場合に、ユーザが応答文の一部を聞き取った時点で応答文全体の文意を正しく理解することも想定される。ところが、ユーザの発話の開始タイミング次第では部分応答文の文意理解結果に対応する確認文のいずれも応答文全体の文意に対応しないかもしれない。即ち、ユーザの発話の開始タイミング次第では、適切な確認文が提示されないおそれがある。故に、ユーザの発話の開始タイミングに関わらず応答文全体の文意に対応する確認文を併せて提示することによって、このユースケースにおいて音声対話を円滑に進行することが有用である。 In addition to the confirmation sentence corresponding to the meaning understanding result of the partial response sentence, the intention confirmation unit 211 may present a confirmation sentence (or the response sentence itself) corresponding to the meaning of the entire response sentence to the user. . For example, when the user is accustomed to using the voice interaction system of FIG. 9, it is assumed that the user understands the meaning of the entire response sentence when the user hears a part of the response sentence. However, depending on the start timing of the user's utterance, none of the confirmation sentences corresponding to the meaning understanding result of the partial response sentence may correspond to the meaning of the entire response sentence. That is, depending on the start timing of the user's utterance, there is a possibility that an appropriate confirmation sentence may not be presented. Therefore, it is useful to smoothly advance the voice conversation in this use case by presenting the confirmation sentence corresponding to the meaning of the entire response sentence regardless of the start timing of the user's utterance.

文意確認部２１１によれば、複数の文意理解結果が存在する場合であっても、ユーザの選択に基づいて適切な１つの文意理解結果を抽出することができる。故に、文意確認部２１１によれば、ユーザの理解と異なる文意理解結果に基づいて対話状態が更新されることにより音声対話の進行に支障が生じる事態を予防できる。 According to the meaning verification unit 211, even if there are a plurality of meaning-understanding results, one appropriate meaning-understanding result can be extracted based on the user's selection. Therefore, according to the text meaning confirmation unit 211, it is possible to prevent a situation in which the progress of the voice conversation is hindered by the dialog state being updated based on the text understanding result different from the user's understanding.

対話状態更新部１１０は、文意確認部２１１から文意理解結果を入力する。対話状態更新部１１０は、文意理解結果に応じた対話状態を生成することによって、現行の対話状態を更新するための代替対話状態を得る。対話状態更新部１１０は、代替対話状態を対話状態制御部１０４へと出力する。 The dialog state update unit 110 inputs a text understanding result from the text confirmation unit 211. The dialog state update unit 110 obtains an alternative dialog state for updating the current dialog state by generating a dialog state corresponding to the sentence understanding result. The dialog state update unit 110 outputs the alternative dialog state to the dialog state control unit 104.

以下、図１１及び図１３を用いて、図９の音声対話システムの動作と、比較例に係る音声対話システムの動作とを比較する。ここで、比較例に係る音声対話システムは、ユーザからのバージインが生じた場合に、未出力部分を含む応答文全体に対してユーザの発話がなされたとみなすものとする。 Hereinafter, the operation of the voice interaction system of FIG. 9 is compared with the operation of the voice interaction system according to the comparative example with reference to FIGS. 11 and 13. Here, it is assumed that the voice interaction system according to the comparative example assumes that the user's utterance has been made with respect to the entire response sentence including the unoutput portion when barge-in occurs from the user.

ユーザからのバージインが生じる場合に、比較例に係る音声対話システムは例えば図１１に示されるように動作する。図１１の例において、ユーザは、システムからの応答文（Ａ２）「年利が１００％のギリシャ国債は嫌ですよね。それでは７％のイタリア国債か１％の日本国債のどちらにしますか？」のうち「年利が１００％のギリシャ国債」を聞き取ってから「なんで？」と発話している（Ｂ３）。この場合に、ユーザは、システムからの応答文（Ａ２）を最後まで聞き取っていないので、聞き取った応答文の文意を「ギリシャ国債は年利１００％です。」のように誤って理解している。即ち、ユーザの発話（Ｂ３）の意図は、例えば「ギリシャ国債の状況の説明依頼」（Ｄ３）のようになる。 When barge-in occurs from the user, the voice interaction system according to the comparative example operates as shown in FIG. 11, for example. In the example of Fig. 11, the user responds with a response from the system (A2) "I don't like Greek government bonds with an annual interest rate of 100%. Then, would you like 7% Italian government bonds or 1% Japanese government bonds?" Among them, after hearing “Greek bonds with 100% annual interest rate”, he said, “Why?” (B3). In this case, since the user has not heard the response sentence (A2) from the system to the end, he misunderstood the meaning of the sentence of the response sentence that he heard, such as “Greek government bond is 100% annual interest rate”. . That is, the intention of the user's utterance (B3) is, for example, “request for explanation of the situation of Greek government bonds” (D3).

他方、システムは、ユーザが応答文（Ａ２）の全体の文意を理解しているとみなす。故に、システムの対話状態（Ｄ２）は、国名「イタリア」または国名「日本」に相当する発話を想定する。結果的に、ユーザの発話（Ｂ３）は、システムの想定外のものとして扱われ、「「なんで」という国はありません。国名を入力してください。」のような応答文（Ａ４）が音声出力される。この応答文（Ａ４）は、ギリシャ国債の状況の説明を希望するユーザの発話意図（Ｄ３）から見て適切でない。故に、音声対話を進行に支障が生じる。 On the other hand, the system assumes that the user understands the overall meaning of the response sentence (A2). Therefore, the dialog state (D2) of the system assumes an utterance corresponding to the country name “Italy” or the country name “Japan”. As a result, the user's utterance (B3) is treated as an unexpected one of the system, and there is no country called “Why”. Please enter your country name. A response sentence (A4) like “ This response sentence (A4) is not appropriate in view of the utterance intention (D3) of the user who wishes to explain the situation of the Greek government bond. Therefore, the progress of the voice dialogue is hindered.

ユーザからのバージインが生じる場合に、図９の音声対話システムは例えば図１３に示されるように動作する。図１３の例において、ユーザは、システムからの応答文（Ａ２）「年利が１００％のギリシャ国債は嫌ですよね。それでは７％のイタリア国債か１％の日本国債のどちらにしますか？」のうち「年利が１００％のギリシャ国債」を聞き取ってから「なんで？」と発話している（Ｂ３−１）。この場合に、ユーザは、システムからの応答文（Ａ２）を最後まで聞き取っていないので、聞き取った応答文の文意を例えば「ギリシャ国債はいかがですか？」、「ギリシャ国債は年利１００％です。」のように誤って理解している。 When barge-in occurs from the user, the voice interaction system of FIG. 9 operates as shown in FIG. 13, for example. In the example of Fig. 13, the user responds with a response from the system (A2) "I hate Greek government bonds with an annual interest rate of 100%. Then, would you like 7% Italian government bonds or 1% Japanese government bonds?" Among them, after hearing "Greek government bonds with 100% annual interest rate", he said, "Why?" (B3-1). In this case, since the user has not heard the response sentence (A2) from the system to the end, the sentence of the response sentence heard is, for example, "How about the Greek government bond?" ”And misunderstood.

システムは、ユーザの発話（Ｂ３−１）を検出し、部分応答文である「年利が１００％のギリシャ国債」（Ａ３−２）を取得する。システムは、部分応答文（Ａ３−２）の文意を理解し、複数の文意理解結果（例えば、図１２のＢ１，Ｂ２）を得る。システムは、複数の文意理解結果を変換することによって、複数の確認文（例えば、図１２のＣ１，Ｃ２）を得る。システムは、複数の確認文を提示し、これらに対するユーザの選択を受理する。そして、システムは、ユーザの選択に対応する１つの文意理解結果（Ｃ３−２）を抽出し、これに基づいて代替対話状態（Ｄ３−２）を生成して現行の対話状態を更新する。 The system detects the user's utterance (B3-1), and acquires the partial response sentence “Greek government bond with 100% annual interest rate” (A3-2). The system understands the meaning of the partial response sentence (A3-2) and obtains a plurality of sentence meaning understanding results (for example, B1 and B2 in FIG. 12). The system obtains a plurality of confirmation sentences (for example, C1 and C2 in FIG. 12) by converting a plurality of sentence meaning understanding results. The system presents a plurality of confirmation sentences and accepts user selections for these. Then, the system extracts one sentence meaning understanding result (C3-2) corresponding to the user's selection, generates an alternative dialogue state (D3-2) based on the result, and updates the current dialogue state.

故に、ユーザの発話（Ｂ３−１，Ｂ３−３）は、システムの想定内のものとして扱われ、「ギリシャは国家財政の債務危機でデフォルトになる可能性があるからです。」（Ａ４）のような応答文が音声出力される。この応答文（Ａ４）は、ギリシャ国債の状況の説明を希望するユーザの発話意図（Ｄ３−１，Ｄ３−３）から見て適切であるから、音声対話が円滑に進行する。 Therefore, the user's utterances (B3-1, B3-3) are treated as being within the system's assumptions, as “Greece may become the default in the national financial debt crisis” (A4) Such a response sentence is output as voice. Since this response sentence (A4) is appropriate in view of the utterance intentions (D3-1, D3-3) of the user who wishes to explain the situation of the Greek government bond, the voice dialogue proceeds smoothly.

以上説明したように、第２の実施形態に係る音声対話システムは、部分応答文の１以上の文意理解結果に対応する確認文をユーザに提示し、ユーザの選択を受理し、ユーザの選択に対応する１つの文意理解結果に基づいて対話状態を更新する。故に、この音声対話システムによれば、ユーザの理解と異なる文意理解結果に基づいて対話状態が更新されることにより音声対話の進行に支障が生じる事態を予防できる。 As described above, the voice interaction system according to the second embodiment presents a confirmation sentence corresponding to one or more meaning understanding results of the partial response sentence to the user, accepts the user's selection, and selects the user. The dialogue state is updated based on one sentence understanding result corresponding to. Therefore, according to this voice dialogue system, it is possible to prevent a situation in which the progress of the voice dialogue is hindered by updating the dialogue state on the basis of the sentence understanding result different from the user's understanding.

（第３の実施形態）
図１０に示されるように、第３の実施形態に係る音声対話システムは、音声入力部１０１と、音声認識部１０２と、発話意図理解部３０３と、対話状態制御部１０４と、対話履歴記憶部１０５と、音声出力部１０６と、発話検出部１０７と、部分応答文取得部１０８と、部分応答文理解部２０９と、対話状態更新部１１０と、文意決定部３１２とを備える。 (Third embodiment)
As shown in FIG. 10, the voice dialogue system according to the third embodiment includes a voice input unit 101, a voice recognition unit 102, an utterance intention understanding unit 303, a dialogue state control unit 104, and a dialogue history storage unit. 105, a voice output unit 106, an utterance detection unit 107, a partial response sentence acquisition unit 108, a partial response sentence understanding unit 209, a dialog state update unit 110, and a sentence determination unit 312.

発話意図理解部３０３は、音声認識部１０２から発話テキストを入力する。発話意図理解部３０３は、発話テキストに基づいてユーザの発話の意図を理解することによって、第１の発話意図理解結果を得る。具体的には、発話意図理解部３０３は、発話意図理解部１０３と同一または類似の技法を利用することによって、発話テキストに基づいてユーザの発話の意図を理解してよい。発話意図理解部３０３は、第１の発話意図理解結果を文意決定部３１２へと出力する。また、発話意図理解部３０３は、第１の発話意図理解結果を対話状態制御部１０４へ出力してもよい。 The utterance intention understanding unit 303 inputs the utterance text from the voice recognition unit 102. The utterance intention understanding unit 303 obtains a first utterance intention understanding result by understanding the user's utterance intention based on the utterance text. Specifically, the utterance intention understanding unit 303 may understand the user's utterance intention based on the utterance text by using the same or similar technique as the utterance intention understanding unit 103. The utterance intention understanding unit 303 outputs the first utterance intention understanding result to the sentence determination unit 312. Further, the utterance intention understanding unit 303 may output the first utterance intention understanding result to the dialogue state control unit 104.

発話テキストが曖昧なもの（例えば、「はい」、「いいえ」、「そうですね」など）である場合には当該発話テキスト単体からユーザの発話の意図を高精度に特定することは困難である。故に、発話意図理解部１０３は、発話テキストに加えて対話履歴記憶部１０５からの対話履歴に基づいてユーザの発話の意図を理解してもよい。発話意図理解部１０３は、対話履歴を考慮することによって、ユーザの発話の意図をより正確に理解することが可能となる。 If the utterance text is ambiguous (for example, “Yes”, “No”, “Yes”, etc.), it is difficult to specify the intention of the user's utterance with high accuracy from the utterance text alone. Therefore, the utterance intention understanding unit 103 may understand the intention of the user's utterance based on the dialogue history from the dialogue history storage unit 105 in addition to the utterance text. The utterance intention understanding unit 103 can understand the user's utterance intention more accurately by considering the dialogue history.

発話意図理解部３０３は、後述される文意決定部３１２から文意理解結果を入力してもよい。そして、発話意図理解部３０３は、上記発話テキスト（及び対話履歴）に加えて文意理解結果に基づいて、第２の発話意図理解結果を生成してもよい。発話意図理解部３０３は、文意理解結果を考慮することによって、ユーザの発話の意図をより正確に理解することが可能となる。発話意図理解部３０３は、第１の発話意図理解結果の代わりに第２の発話意図理解結果を対話制御部１０４へと出力してもよい。 The utterance intention understanding unit 303 may input a sentence understanding result from the sentence intention determination unit 312 described later. Then, the utterance intention understanding unit 303 may generate a second utterance intention understanding result based on the sentence understanding result in addition to the utterance text (and the conversation history). The utterance intention understanding unit 303 can understand the intention of the user's utterance more accurately by considering the sentence understanding result. The utterance intention understanding unit 303 may output the second utterance intention understanding result to the dialogue control unit 104 instead of the first utterance intention understanding result.

対話状態制御部１０４は、発話意図理解部１０３から第１の発話意図理解結果または第２の発話意図理解結果を入力する。対話状態制御部１０４は、現行の対話状態を用いて第１の発話意図理解結果または第２の発話意図理解結果を処理し、必要に応じて対話状態間の遷移（自己遷移を含む）を制御する。 The dialogue state control unit 104 inputs the first utterance intention understanding result or the second utterance intention understanding result from the utterance intention understanding unit 103. The dialogue state control unit 104 processes the first utterance intention understanding result or the second utterance intention understanding result using the current dialogue state, and controls transitions (including self-transition) between the dialogue states as necessary. To do.

文意決定部３１２は、発話意図理解部３０３から第１の発話意図理解結果を入力し、部分応答文理解部２０９から１以上の文意理解結果を入力する。文意決定部３１２は、後述されるように、１以上の文意理解結果の中から信頼度の最も高い１つを決定する。尚、文意決定部３１２は、部分応答文理解部２０９から入力した文意理解結果の数が１つである場合には、当該文意理解結果をそのまま対話状態更新部１１０へと出力してもよい。文意決定部３１２は、信頼度の最も高い文意理解結果を対話状態更新部１１０へと出力する。 The sentence determination unit 312 inputs the first utterance intention understanding result from the utterance intention understanding unit 303, and inputs one or more sentence intention understanding results from the partial response sentence understanding unit 209. As will be described later, the sentence determination unit 312 determines one of the one or more sentence understanding results with the highest reliability. In addition, when the number of the text meaning understanding results input from the partial response text understanding section 209 is one, the text meaning determination section 312 outputs the text meaning understanding result to the dialog state update section 110 as it is. Also good. The meaning determination unit 312 outputs the result of understanding of the meaning with the highest reliability to the dialog state update unit 110.

具体的には、文意決定部３１２は、図１４に例示されるように、部分応答文に対応する１以上の文意理解結果の各々が出現する第１の確率を参照できる。第１の確率は、前提となる部分応答文が与えられた場合に対象の文意理解結果がユーザの当該部分応答文に対する理解と合致する確率に相当する。図１４の例によれば、部分応答文が「年利が１００％のギリシャ国債」（Ａ１，Ａ２）である場合に、２つの文意理解結果（Ｂ１，Ｂ２）の各々が出現する確率は７０％（Ｃ１），３０％（Ｃ２）である。 Specifically, the sentence determination unit 312 can refer to a first probability that each of one or more sentence understanding results corresponding to the partial response sentence appears as illustrated in FIG. The first probability corresponds to the probability that the target sentence understanding result matches the user's understanding of the partial response sentence when a partial response sentence as a premise is given. According to the example of FIG. 14, when the partial response sentence is “Greek government bond with 100% annual interest rate” (A1, A2), the probability that each of the two sentence understanding results (B1, B2) appears is 70. % (C1) and 30% (C2).

更に、文意決定部３１２は、図１４に例示さえるように、部分応答文に対応する１以上の文意理解結果の各々の出現を仮定した場合にユーザの複数の発話意図の各々が出現する第２の確率を参照できる。第２の確率は、前提となる部分応答文の文意理解結果がユーザの当該部分応答文に対する理解と合致する場合に、ユーザが対象の発話意図に対応する発話をする確率に相当する。即ち、この第２の確率を利用すれば、第１の発話意図理解結果からその前提であるユーザの部分応答文に対する理解を推定することが可能となる。図１４の例によれば、「Ｑｕｅｓｔｉｏｎ−ＹＮ，国債購入（ギリシャ）」という文意理解結果（Ｂ１）の出現を仮定した場合に、ユーザが「Ｒｅｓｐｏｎｓｅ」（Ｄ１）に対応する発話をする確率が「８０％」（Ｅ１）である。 Further, as illustrated in FIG. 14, the sentence intention determination unit 312 causes each of a plurality of utterance intentions of the user to appear when each of one or more meaning understanding results corresponding to the partial response sentence is assumed to appear. A second probability can be referenced. The second probability corresponds to the probability that the user utters corresponding to the target utterance intention when the understanding result of the partial response sentence as a premise matches the user's understanding of the partial response sentence. That is, if this second probability is used, it is possible to estimate the understanding of the partial response sentence of the user, which is the premise, from the first utterance intention understanding result. According to the example of FIG. 14, when the appearance of the sentence understanding result (B1) “Question-YN, purchase of government bonds (Greece)” is assumed, the probability that the user speaks corresponding to “Response” (D1). Is “80%” (E1).

文意決定部３１２は、ある文意理解結果に対応する第１の確率と、当該文意理解結果に対応する複数の第２の確率のうち第１の発話意図理解結果に合致するものとを乗算することによって、当該文意理解結果の信頼度を評価する。 The sentence determination unit 312 determines a first probability corresponding to a certain meaning understanding result and a plurality of second probabilities corresponding to the meaning understanding result that match the first utterance intention understanding result. By multiplying, the reliability of the meaning understanding result is evaluated.

図１４の例において、部分応答文が「年利が１００％のギリシャ国債」（Ａ１，Ａ２）であり、第１の発話意図理解結果が「Ｒｅｑｕｅｓｔ（依頼）」であるするとする。この場合に、「Ｑｕｅｓｔｉｏｎ−ＹＮ，国債購入（ギリシャ）」という文意理解結果（Ｂ１）に対応する第１の確率は「７０％」（Ｃ１）であり、当該文意理解結果（Ｂ１）に対応する複数の第２の確率（Ｅ１）のうち第１の発話意図理解結果に合致するものは「５％」である。故に、文意理解結果（Ｂ１）の信頼度は、例えば「３．５％」と評価することができる。また、「Ｉｎｆｏｒｍ，国債情報（ギリシャ）」という文意理解結果（Ｂ２）に対応する第１の確率は「３０％」（Ｃ２）であり、当該文意理解結果（Ｂ２）に対応する複数の第２の確率（Ｅ２）のうち第１の発話意図理解結果に合致するものは「６５％」である。故に、文意理解結果（Ｂ２）の信頼度は、例えば「１９．５％」と評価することができる。文意決定部３１２は、複数の文意理解結果（Ｂ１，Ｂ２）のうち信頼度の最も高い１つ（Ｂ２）を決定する。 In the example of FIG. 14, it is assumed that the partial response sentence is “Greek government bond with 100% annual interest rate” (A1, A2), and the first utterance intention understanding result is “Request”. In this case, the first probability corresponding to the sentence understanding result (B1) “Question-YN, purchase of government bonds (Greece)” is “70%” (C1), and the sentence understanding result (B1) Of the plurality of corresponding second probabilities (E1), the one that matches the first utterance intention understanding result is “5%”. Therefore, the reliability of the sentence understanding result (B1) can be evaluated as “3.5%”, for example. In addition, the first probability corresponding to the sentence understanding result (B2) of “Inform, government bond information (Greece)” is “30%” (C2), and there are a plurality of correspondences corresponding to the sentence understanding result (B2). Of the second probabilities (E2), the one that matches the first utterance intention understanding result is “65%”. Therefore, the reliability of the sentence understanding result (B2) can be evaluated as, for example, “19.5%”. The sentence determination unit 312 determines one of the plurality of sentence understanding results (B1, B2) having the highest reliability (B2).

文意決定部３１２によれば、複数の文意理解結果が存在する場合であっても、第１の発話意図理解結果を利用して信頼度の最も高い１つの文意理解結果を自動決定することができる。故に、文意決定部３１２によれば、ユーザの部分応答文に対する理解に合致した文意理解結果を高精度にかつ短時間で決定できるので、音声対話を円滑に進行することができる。 According to the sentence determination unit 312, even when there are a plurality of sentence understanding results, the sentence determination understanding result having the highest reliability is automatically determined using the first utterance intention understanding result. be able to. Therefore, according to the sentence determination unit 312, a sentence understanding result that matches the user's understanding of the partial response sentence can be determined with high accuracy and in a short time, so that the voice conversation can proceed smoothly.

対話状態更新部１１０は、文意決定部３１２から文意理解結果を入力する。対話状態更新部１１０は、文意理解結果に応じた対話状態を生成することによって、現行の対話状態を更新するための代替対話状態を得る。対話状態更新部１１０は、代替対話状態を対話状態制御部１０４へと出力する。 The dialog state update unit 110 inputs a text understanding result from the text determination unit 312. The dialog state update unit 110 obtains an alternative dialog state for updating the current dialog state by generating a dialog state corresponding to the sentence understanding result. The dialog state update unit 110 outputs the alternative dialog state to the dialog state control unit 104.

以上説明したように、第３の実施形態に係る音声対話システムは、部分応答文の１以上の文意理解結果のうち信頼度の最も高い１つをユーザの発話意図理解結果を利用して自動的に決定し、信頼度の最も高い文意理解結果に基づいて対話状態を更新する。故に、この音声対話システムによれば、ユーザの部分応答文に対する理解に合致した文意理解結果を高精度にかつ短時間で決定できるので、音声対話を円滑に進行することができる。 As described above, the speech dialogue system according to the third embodiment automatically uses one of the one or more sentence meaning understanding results of the partial response sentence with the highest reliability using the user's utterance intention understanding result. The dialog state is updated based on the sentence understanding result having the highest reliability. Therefore, according to the voice dialogue system, the sentence understanding result that matches the user's understanding of the partial response sentence can be determined with high accuracy and in a short time, so that the voice dialogue can proceed smoothly.

上記各実施形態の処理は、汎用のコンピュータを基本ハードウェアとして用いることで実現可能である。上記各実施形態の処理を実現するプログラムは、コンピュータで読み取り可能な記憶媒体に格納して提供されてもよい。プログラムは、インストール可能な形式のファイルまたは実行可能な形式のファイルとして記憶媒体に記憶される。記憶媒体としては、磁気ディスク、光ディスク（ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ等）、光磁気ディスク（ＭＯ等）、半導体メモリなどである。記憶媒体は、プログラムを記憶でき、かつ、コンピュータが読み取り可能であれば、何れであってもよい。また、上記各実施形態の処理を実現するプログラムを、インターネットなどのネットワークに接続されたコンピュータ（サーバ）上に格納し、ネットワーク経由でコンピュータ（クライアント）にダウンロードさせてもよい。 The processing of each of the above embodiments can be realized by using a general-purpose computer as basic hardware. The program for realizing the processing of each of the above embodiments may be provided by being stored in a computer-readable storage medium. The program is stored in the storage medium as an installable file or an executable file. Examples of the storage medium include a magnetic disk, an optical disk (CD-ROM, CD-R, DVD, etc.), a magneto-optical disk (MO, etc.), and a semiconductor memory. The storage medium may be any as long as it can store the program and can be read by the computer. Further, the program for realizing the processing of each of the above embodiments may be stored on a computer (server) connected to a network such as the Internet and downloaded to the computer (client) via the network.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１０１・・・音声入力部
１０２・・・音声認識部
１０３，３０３・・・発話意図理解部
１０４・・・対話状態制御部
１０５・・・対話履歴記憶部
１０６・・・音声出力部
１０７・・・発話検出部
１０８・・・部分応答文取得部
１０９，２０９・・・部分応答文理解部
１１０・・・対話状態更新部
２１１・・・文意確認部
３１２・・・文意決定部 DESCRIPTION OF SYMBOLS 101 ... Voice input part 102 ... Voice recognition part 103,303 ... Utterance intention understanding part 104 ... Dialogue state control part 105 ... Dialogue history memory | storage part 106 ... Voice output part 107 ...・ Speech detection unit 108... Partial response sentence acquisition unit 109, 209... Partial response sentence understanding unit 110... Dialogue state update unit 211.

Claims

An output unit that outputs a response sentence corresponding to the current dialogue state by voice;
An input unit for inputting the user's utterance;
A detection unit that detects a start timing of the user's utterance and generates timing information;
Based on the information related to the response sentence and the timing information, an acquisition unit that acquires a partial response sentence that has been output as a voice at the start timing of the utterance of the response sentence;
A first understanding unit that obtains one or more first understanding results by understanding the meaning of the partial response sentence;
An update unit for generating an alternative dialogue state for updating the current dialogue state based on the one or more first understanding results;
A speech recognition unit that obtains speech text by performing speech recognition processing on the speech;
A second understanding unit for obtaining a second understanding result by understanding the intention of the user's utterance based on at least the utterance text;
A voice dialog system comprising: a controller that updates the current dialog state with the alternative dialog state, processes the second understanding result, and controls transition between the dialog states.

The second understanding unit obtains the second understanding result by understanding an intention of the user's utterance based on the one or more first understanding results in addition to the utterance text. Voice dialogue system.

The spoken dialogue system according to claim 1, wherein the timing information is time information based on the start timing.

The spoken dialogue system according to claim 1, wherein the timing information is character number information based on the start timing.

The voice interaction system according to claim 1, wherein the timing information specifies a timing that goes back a correction amount corresponding to a delay caused by the understanding ability of the user from the start timing.

The spoken dialogue system according to claim 1, wherein the one or more first understanding results are expressed in a tag format.

The one or more first understanding results are converted into one or more confirmation sentences, the one or more confirmation sentences are presented to the user, and the user's selection for any one of the one or more confirmation sentences is accepted. Further comprising a confirmation unit
The update unit generates the alternative dialogue state based on one of the one or more first understanding results corresponding to the user's selection;
The speech dialogue system according to claim 1.

Assuming the first probability that each of the one or more first understanding results appears and the appearance of each of the one or more first understanding results, an utterance corresponding to the second understanding result appears. A determination unit that evaluates the reliability of each of the one or more first understanding results based on the second probability and determines one of the one or more first understanding results having the highest reliability. Further comprising
The update unit generates the alternative conversation state based on the highest reliability among the one or more first understanding results.
The speech dialogue system according to claim 1.

Outputting a response sentence corresponding to the current dialog state,
Enter the user ’s utterance,
Obtaining timing information by detecting the start timing of the user's utterance;
Based on the information on the response sentence and the timing information, obtaining a partial response sentence that has already been voice output at the start time of the utterance of the response sentence;
Obtaining one or more first understanding results by understanding the meaning of the partial response sentence;
Generating an alternative interaction state for updating the current interaction state based on the one or more first understanding results;
Obtaining speech text by performing speech recognition processing on the speech;
Obtaining a second understanding result by understanding the intention of the user's utterance based at least on the utterance text;
Updating the current dialog state with the alternative dialog state, processing the second understanding result, and controlling a transition between the dialog states.

Means for outputting a response sentence corresponding to the current dialog state to the computer,
Means to input user utterances,
Means for obtaining timing information by detecting a start timing of the user's utterance;
Based on the information about the response sentence and the timing information, a means for acquiring a partial response sentence that has already been voice output at the start timing of the utterance out of the response sentence;
Means for obtaining one or more first understanding results by understanding the meaning of the partial response sentence;
Means for generating an alternative interaction state for updating the current interaction state based on the one or more first understanding results;
Means for obtaining speech text by performing speech recognition processing on the speech;
Means for obtaining a second understanding result by understanding an intention of the user's utterance based at least on the utterance text;
A voice dialogue program for processing the second understanding result after updating the current dialogue state with the alternative dialogue state and functioning as a means for controlling transition between the dialogue states.