JP2016071050A

JP2016071050A - Voice interactive device, voice interactive system, terminal, voice interactive method, program for letting computer function as voice interactive device

Info

Publication number: JP2016071050A
Application number: JP2014198740A
Authority: JP
Inventors: 泰貴畠山; Yasutaka Hatakeyama
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2014-09-29
Filing date: 2014-09-29
Publication date: 2016-05-09
Anticipated expiration: 2034-09-29
Also published as: JP6129134B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice interactive device that authenticates a user without using a radio tag nor a camera, and replies a topic matching the user according to utterance.SOLUTION: A voice interactive system 10 comprises: a voice recognition module 110 for recognizing voice data based upon utterance; a voice authentication module 120 for determining a speaker based upon utterance; a topic estimation module 130 for estimating a topic corresponding to the determined speaker; and an interaction generation module 140 for generating details of utterance to the user based upon a result (for example, topics of sports that the user's children are interested in, topics of fashion that females are interested in, etc.) of the topic estimation module 130.SELECTED DRAWING: Figure 1

Description

本開示は、音声認識および音声認証に関し、より特定的には、音声認識と音声認証とを同じタイミングで行う技術に関する。 The present disclosure relates to voice recognition and voice authentication, and more specifically, to a technique for performing voice recognition and voice authentication at the same timing.

音声認識技術を用いた装置が知られている。たとえば、特開２０１１−０００６８１号公報（特許文献１）は、「コミュニケーション対象との親密度に応じて多様なコミュニケーション行動を実行することができるコミュニケーションロボット」を開示している（［要約］参照）。 Devices using voice recognition technology are known. For example, Japanese Patent Laying-Open No. 2011-000681 (Patent Document 1) discloses a “communication robot that can execute various communication behaviors according to intimacy with a communication target” (see [Summary]). .

特開２０１１−０００６８１号公報JP 2011-000681 A

特許文献１に開示された技術によると、コミュニケーションロボットのユーザ認証は、ユーザが持つ無線タグを用いて行なわれる。そのため、同一の無線タグが別人によって用いられた場合、ユーザ認証が正しく行われず、コミュニケーションロボットが不適切に作動する場合もあり得る。したがって、コミュニケーションが適切に行われる技術が必要とされている。 According to the technology disclosed in Patent Document 1, user authentication of a communication robot is performed using a wireless tag possessed by the user. Therefore, when the same wireless tag is used by another person, user authentication may not be performed correctly, and the communication robot may operate inappropriately. Therefore, there is a need for a technology that allows appropriate communication.

本開示は、上述のような問題点を解決するためになされたものであって、ある局面における目的は、ユーザ認証が正確に行われてコミュニケーションが実現される音声対話装置を提供することである。 The present disclosure has been made in order to solve the above-described problems, and an object in one aspect is to provide a voice interactive apparatus in which user authentication is accurately performed and communication is realized. .

他の局面における目的は、ユーザ認証を正確に行ないユーザに応じたコミュニケーションを実現するための音声対話システムを提供することである。 An object in another aspect is to provide a voice interaction system for accurately performing user authentication and realizing communication according to a user.

他の局面における目的は、ユーザ認証を正確に行ないユーザに応じたコミュニケーションを実現するための端末を提供することである。 An object in another aspect is to provide a terminal for accurately performing user authentication and realizing communication according to a user.

他の局面における目的は、ユーザ認証を正確に行ないユーザに応じたコミュニケーションを実現するための音声対話方法を提供することである。 An object in another aspect is to provide a voice interaction method for accurately performing user authentication and realizing communication according to a user.

さらに他の局面における目的は、ユーザ認証が正確に行われてコミュニケーションが実現される音声対話装置としてコンピュータを機能させるためのプログラムを提供することである。 Still another object of the present invention is to provide a program for causing a computer to function as a voice interactive apparatus in which user authentication is accurately performed and communication is realized.

一実施の形態に従う音声対話装置は、発話を認識するように構成された音声認識部と、認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するように構成された音声認証部と、認識された発話と、特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するように構成された話題推定部と、当該話題を音声で出力するように構成された音声出力部とを備える。 A speech dialogue apparatus according to an embodiment is configured to identify a speaker based on a speech recognition unit configured to recognize an utterance, a recognized utterance, and user information registered in advance. A speech estimation unit, a recognized speech, and a topic estimation unit configured to generate a topic that the speaker is interested in based on the identified speaker, and outputs the topic in speech And an audio output unit configured as described above.

他の実施の形態に従う音声対話装置は、発話に基づく音声信号の入力を受け付けるように構成された音声信号入力部と、入力された音声信号に基づいて発話を認識するように構成された音声認識部と、入力された音声信号と予め登録されているユーザ情報とに基づいて、発話者を特定するように構成された音声認証部と、認識された発話と、特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するように構成された話題推定部と、当該話題を音声で出力するための話題信号を出力するように構成された出力部とを備える。 A voice interaction device according to another embodiment includes a voice signal input unit configured to accept an input of a voice signal based on an utterance, and a voice recognition configured to recognize the utterance based on the input voice signal Based on the voice authentication unit configured to identify the speaker, the recognized utterance, and the identified speaker based on the input unit, the input voice signal and the user information registered in advance Thus, a topic estimation unit configured to generate a topic in which the speaker is interested and an output unit configured to output a topic signal for outputting the topic by voice.

他の実施の形態に従うと、音声対話システムが提供される。この音声対話システムは、端末と、端末と通信可能なサーバとを備える。端末は、発話を受け付けて当該発話を認識するように構成された音声認識部と、認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するように構成された音声認証部と、発話に基づく音声信号と、特定された発話者の識別信号とをサーバに送信するように構成された送信部とを備える。サーバは、音声信号と識別信号とに基づいて、当該発話者が興味を持つ話題を生成するように構成された話題推定部と、当該話題を音声で出力するための話題信号を端末に送信するように構成された話題送信部とを備える。端末は、さらに、サーバから受信する話題信号に基づいて、話題を音声で出力するように構成された出力部を備える。 According to another embodiment, a voice interaction system is provided. This voice interaction system includes a terminal and a server capable of communicating with the terminal. The terminal is configured to identify a speaker based on a speech recognition unit configured to accept an utterance and recognize the utterance, and the recognized utterance and user information registered in advance. An authentication unit; and a transmission unit configured to transmit an audio signal based on the utterance and an identification signal of the identified speaker to the server. The server transmits a topic estimation unit configured to generate a topic in which the speaker is interested based on the voice signal and the identification signal, and a topic signal for outputting the topic as a voice to the terminal. A topic transmission unit configured as described above. The terminal further includes an output unit configured to output the topic by voice based on the topic signal received from the server.

他の実施の形態に従うと、上記の音声対話システムに用いられる端末が提供される。この端末は、発話を認識するように構成された音声認識部と、認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するように構成された音声認証部と、発話に基づく音声信号と、特定された発話者の識別信号とをサーバに送信するように構成された送信部と、当該発話者が興味を持つ話題を音声で出力するための話題信号をサーバから受信して、話題を音声で出力するように構成された出力部とを備える。 According to another embodiment, a terminal used for the above-described voice interaction system is provided. The terminal includes a voice recognition unit configured to recognize an utterance, a voice authentication unit configured to identify a speaker based on the recognized utterance and pre-registered user information, A transmission unit configured to transmit an audio signal based on the utterance and an identification signal of the identified speaker to the server, and a topic signal for outputting a topic that the speaker is interested in by voice from the server And an output unit configured to receive and output the topic by voice.

好ましくは、音声対話装置は、音声対話装置の各ユーザとの対話の履歴を格納するように構成された記憶部をさらに備える。話題推定部は、当該ユーザとの対話の履歴に基づいて、話題を生成するように構成されている。 Preferably, the voice interaction device further includes a storage unit configured to store a history of interaction with each user of the voice interaction device. The topic estimation unit is configured to generate a topic based on a history of dialogues with the user.

好ましくは、音声対話装置は、音声対話装置のユーザとの対話の履歴に基づいて、当該ユーザと音声対話装置との親密度を算出するように構成された親密度算出部をさらに備える。話題推定部は、親密度に応じて、話題の語調を調整するように構成されている。 Preferably, the voice interaction device further includes a closeness calculation unit configured to calculate a closeness between the user and the voice interaction device based on a history of interaction with the user of the voice interaction device. The topic estimation unit is configured to adjust the tone of the topic according to the familiarity.

他の実施の形態に従うと、音声対話方法が提供される。この方法は、発話を認識するステップと、認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、認識された発話と、特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するステップと、当該話題を音声で出力するステップとを含む。 According to another embodiment, a voice interaction method is provided. This method is based on a step of recognizing an utterance, a step of identifying a speaker based on the recognized utterance and pre-registered user information, a recognized utterance, and an identified speaker. And generating a topic that the speaker is interested in, and outputting the topic as a voice.

他の実施の形態に従う音声対話方法は、発話に基づく音声信号の入力を受け付けるステップと、入力された音声信号に基づいて発話を認識するステップと、入力された音声信号と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、認識された発話と、特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するステップと、当該話題を音声で出力するための話題信号を出力するステップとを含む。 A voice interaction method according to another embodiment includes a step of receiving an input of an audio signal based on an utterance, a step of recognizing an utterance based on the input audio signal, and a user who is registered in advance with the input audio signal A step of identifying a speaker based on the information; a step of generating a topic of interest to the speaker based on the recognized utterance and the identified speaker; and outputting the topic in speech Outputting a topic signal for the purpose.

他の実施の形態に従う音声対話方法は、発話を認識するステップと、認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、発話に基づく音声信号と、特定された発話者の識別信号とをサーバに送信するステップと、音声信号と識別信号とに基づいて推定された、当該発話者が興味を持つ話題のを音声で出力するための話題信号をサーバから受信するステップと、話題信号に基づいて当該話題を音声で出力するステップとを含む。 A speech interaction method according to another embodiment includes a step of recognizing an utterance, a step of identifying a speaker based on the recognized utterance and pre-registered user information, an audio signal based on the utterance, A step of transmitting an identification signal of the identified speaker to the server, and a topic signal for outputting a topic of interest of the speaker, which is estimated based on the voice signal and the identification signal. And a step of outputting the topic by voice based on the topic signal.

他の実施の形態に従うと、コンピュータを音声対話装置として機能させるためのプログラムが提供される。このプログラムは、一つ以上のプロセッサに、発話を認識するステップと、認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、認識された発話と、特定された発話者とに基づいて、当該発話者が興味を持つ話題を音声で出力するための話題信号を生成するステップと、話題信号に基づいて当該話題を音声で出力するステップとを実行させる。 According to another embodiment, a program for causing a computer to function as a voice interaction device is provided. The program includes, in one or more processors, a step of recognizing an utterance, a step of identifying a speaker based on the recognized utterance and pre-registered user information, a recognized utterance, A step of generating a topic signal for outputting a topic in which the speaker is interested in a voice based on the speaker, and a step of outputting the topic in a voice based on the topic signal are executed.

他の実施の形態に従う、コンピュータを音声対話装置として機能させるためのプログラムは、一つ以上のプロセッサに、発話に基づく音声信号の入力を受け付けるステップと、入力された音声信号に基づいて発話を認識するステップと、入力された音声信号と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、認識された発話と、特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するステップと、当該話題を音声で出力するための話題信号を出力するステップとを実行させる。 According to another embodiment, a program for causing a computer to function as a voice interaction apparatus includes a step of accepting an input of an audio signal based on an utterance to one or more processors, and an utterance is recognized based on the input audio signal. And the step of identifying a speaker based on the input voice signal and pre-registered user information, the recognized utterance, and the identified speaker. A step of generating a topic having interest and a step of outputting a topic signal for outputting the topic as a voice are executed.

他の実施の形態に従う、コンピュータを音声対話装置として機能させるためのプログラムは、一つ以上のプロセッサに、発話を認識するステップと、認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、発話に基づく音声信号と、特定された発話者の識別信号とをサーバに送信するステップと、音声信号と識別信号とに基づいて推定された、当該発話者が興味を持つ話題のを音声で出力するための話題信号をサーバから受信するステップと、話題信号に基づいて当該話題を音声で出力するステップとを実行させる。 A program for causing a computer to function as a voice interaction device according to another embodiment is based on the step of recognizing an utterance by one or more processors, the recognized utterance, and user information registered in advance. A step of identifying a speaker, a step of transmitting an audio signal based on the utterance and an identification signal of the identified speaker to the server, and the speaker estimated based on the audio signal and the identification signal A step of receiving a topic signal for outputting a topic of interest by voice from the server and a step of outputting the topic by voice based on the topic signal are executed.

他の局面に従う音声対話システムは、端末と、サーバとを備える。端末は、発話を認識するように構成された音声認識部と、認識された発話を発話信号に変換するように構成された音声信号変換部と、発話信号をサーバに送信するように構成された送信部とを含む。サーバは、端末から受信した発話信号に基づいて発話を認識するように構成された音声認識部と、発話信号と予め登録されているユーザ情報とに基づいて、発話者を特定するように構成された音声認証部と、認識された発話と、特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するように構成された話題推定部と、当該話題を音声で出力するための話題信号を端末に送信するように構成された送信部とを含む。端末は、さらに、サーバから話題信号を受信するように構成された受信部と、話題信号に基づいて当該話題を音声で出力するように構成された出力部とを含む。 A voice interaction system according to another aspect includes a terminal and a server. The terminal is configured to transmit a speech signal to the server, a speech recognition unit configured to recognize the speech, a speech signal conversion unit configured to convert the recognized speech into a speech signal, and Including a transmitter. The server is configured to identify the speaker based on the speech recognition unit configured to recognize the speech based on the speech signal received from the terminal, and the speech signal and the user information registered in advance. A speech estimation unit, a recognized speech, and a topic estimation unit configured to generate a topic that the speaker is interested in based on the identified speaker, and outputs the topic in speech And a transmitter configured to transmit a topic signal for transmission to the terminal. The terminal further includes a receiving unit configured to receive a topic signal from the server, and an output unit configured to output the topic as a voice based on the topic signal.

他の実施の形態に従うと、上記のシステムに用いる端末が提供される。この端末は、発話を認識するように構成された音声認識部と、認識された発話を発話信号に変換するように構成された音声信号変換部と、発話信号をサーバに送信するように構成された送信部と、発話信号に基づいて生成された話題信号をサーバから受信するように構成された受信部と、話題信号に基づいて、発話に対応する話題を音声で出力するように構成された出力部とを備える。 According to another embodiment, a terminal for use in the above system is provided. The terminal is configured to transmit a speech signal to a server, a speech recognition unit configured to recognize a speech, a speech signal conversion unit configured to convert the recognized speech into a speech signal, and A transmission unit, a reception unit configured to receive a topic signal generated based on the utterance signal from the server, and a topic corresponding to the utterance based on the topic signal. And an output unit.

この発明の上記および他の目的、特徴、局面および利点は、添付の図面と関連して理解されるこの発明に関する次の詳細な説明から明らかとなるであろう。 The above and other objects, features, aspects and advantages of the present invention will become apparent from the following detailed description of the present invention taken in conjunction with the accompanying drawings.

音声対話システム１０の構成を概念的に表わす図である。1 is a diagram conceptually showing the configuration of a voice interaction system 10. FIG. 音声対話システム１０の構成の一例を表わすブロック図である。1 is a block diagram illustrating an example of a configuration of a voice interaction system 10. FIG. コミュニケーション端末２００のハードウェア構成を表わすブロック図である。2 is a block diagram showing a hardware configuration of communication terminal 200. FIG. 実施の形態に係るサーバを実現するコンピュータ４００のハードウェア構成を表わすブロック図である。It is a block diagram showing the hardware constitutions of the computer 400 which implement | achieves the server which concerns on embodiment. 無線タグを用いた発話に対する音声対話システムの構成の概要を表わす図である。It is a figure showing the outline | summary of a structure of the voice dialogue system with respect to the utterance using a wireless tag. ユーザを音声対話システム１０に登録する場合に実行される処理を表わすシーケンスチャートである。3 is a sequence chart showing processing executed when a user is registered in the voice interaction system 10. 発話したユーザに応じた返答が生成される処理を表わすシーケンスチャートである。It is a sequence chart showing the process by which the reply according to the user who uttered is produced | generated. ある局面における興味推定の一例を表わす図である。It is a figure showing an example of the interest estimation in a certain situation. 音声対話システム９０の構成の一例を表わすブロック図である。2 is a block diagram illustrating an example of a configuration of a voice interaction system 90. FIG. 音声対話システム１０において行なわれる処理を表わすシーケンスチャートである。3 is a sequence chart showing processing performed in the voice interaction system 10. 複数のユーザの各々の興味を推定する方法を概念的に表わす図である。It is a figure which represents notionally the method of estimating each user's interest. 対話履歴記憶部２６０に保持されるテーブル１２００を表わす図である。It is a figure showing the table 1200 hold | maintained at the dialog log | history memory | storage part 260. FIG. 特定のユーザについて抽出されたテーブル１３００を表わす図である。It is a figure showing the table 1300 extracted about the specific user. 対話ＤＢ部２７０のデータ構造を表わす図である。It is a figure showing the data structure of dialog DB section 270. ある局面における音声対話システム８０による話題推定の概要を表わす図である。It is a figure showing the outline | summary of the topic estimation by the speech dialogue system 80 in a certain situation. 音声対話システム１６００の構成を概念的に表わす図である。It is a figure which represents notionally the structure of the voice dialogue system 1600. 複数のユーザそれぞれの発話に基づいて話題を推定する一態様を表わす図である。It is a figure showing the one aspect | mode which estimates a topic based on each user's utterance. 対話ＤＢ部１６７０のデータ構造を概念的に表わす図である。It is a figure which represents the data structure of the dialog DB part 1670 conceptually. 対話履歴記憶部１６６０におけるデータの格納の一態様を概念的に表わす図である。It is a figure which represents notionally the one aspect | mode of the data storage in the dialog log | history memory | storage part 1660. FIG. サーバ１６２０が備えるテーブル２０００におけるデータの格納の一態様を概念的に表す図である。FIG. 18 is a diagram conceptually illustrating one aspect of data storage in a table 2000 provided in the server 1620. 対話ＤＢ部１６７０のデータ構造を表わす図である。It is a figure showing the data structure of dialog DB part 1670. 音声対話システム２２００の構成の一例を表わす図である。It is a figure showing an example of a structure of the voice interactive system 2200. FIG. 親密度算出モジュール２２５１による親密度の算出を概念的に表わす図である。FIG. 10 is a diagram conceptually illustrating calculation of intimacy by an intimacy calculating module 2251. 親密度に応じて返答が変化する態様を説明する図である。It is a figure explaining the aspect from which a response changes according to familiarity. 音声対話装置２５００の構成の概要を表すブロック図である。2 is a block diagram illustrating an outline of a configuration of a voice interaction apparatus 2500. FIG. 音声対話システム２６００の構成の概略を表すブロック図である。FIG. 11 is a block diagram illustrating a schematic configuration of a voice interaction system 2600.

以下、図面を参照しつつ、本発明の実施の形態について説明する。以下の説明では、同一の部品には同一の符号を付してある。それらの名称および機能も同じである。したがって、それらについての詳細な説明は繰り返さない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated.

＜技術思想＞
まず、本開示の概要について説明する。開示される技術思想は、以下の通り、３つの要素から構成される。 <Technology>
First, an outline of the present disclosure will be described. The disclosed technical idea is composed of three elements as follows.

（１）音声認識と音声認証とが並列に行なわれる。したがって、ユーザの発話内容の認識と当該ユーザの認証とが同時に行なわれる。 (1) Voice recognition and voice authentication are performed in parallel. Therefore, recognition of the user's utterance content and authentication of the user are performed simultaneously.

（２）ユーザ毎に、対話内容のログに基づいて各ユーザの興味ある話題が推定され、推定された話題に基づく対話が生成される。 (2) For each user, a topic of interest of each user is estimated based on the log of the conversation content, and a dialog based on the estimated topic is generated.

（３）対話数やその頻度に基いて、ロボット（音声対話装置、あるいは音声対話システム）の発話内容が変化する。 (3) The utterance content of the robot (speech dialogue apparatus or voice dialogue system) changes based on the number of dialogues and their frequency.

これらの要素の結果、ユーザは、ロボット（音声対話システム）に親しみを持つことができる。 As a result of these factors, the user can become familiar with the robot (voice interaction system).

たとえば、要素（１）により、当該技術思想が適用される音声対話システムは、カメラや無線タグ等の機器からの情報を使用することなく、ユーザを特定し（音声認証）、また、当該ユーザの発言内容の取得（音声認識）が可能になる。 For example, the voice interaction system to which the technical idea is applied by the element (1) specifies a user without using information from a device such as a camera or a wireless tag (voice authentication), and the user's Acquisition of speech contents (voice recognition) becomes possible.

次に、要素（２）により、ユーザの日々の会話が音声対話システムに記憶され、必要に応じて分析される。音声対話システムは、分析結果に基づいて、各ユーザが興味ある話題（スポーツ、芸能ニュースなど）を他の情報提供装置から取得し、対話しているユーザに応じた話題を当該ユーザに提供することができる。 Then, by element (2), the user's daily conversation is stored in the spoken dialogue system and analyzed as needed. Based on the analysis result, the voice dialogue system acquires a topic (sports, entertainment news, etc.) that each user is interested in from another information providing device, and provides the user with a topic according to the user who is interacting. Can do.

さらに、要素（３）により、音声対話システムとユーザとの対話が長期にかつ定期的に行われることにより、対話内容に応じて、音声対話システムからの発話の表現（言葉づかい、語調等）が変化し得る。その結果、ユーザが音声対話システム（あるいは、音声対話システムに含まれるロボットのような音声入出力端末）に対して親近感を持ち得る。 Furthermore, due to element (3), the dialogue between the spoken dialogue system and the user is carried out over a long period and regularly, so that the expression of speech from the spoken dialogue system (wording, tone, etc.) changes according to the dialogue content Can do. As a result, the user can be familiar with the voice interaction system (or a voice input / output terminal such as a robot included in the voice interaction system).

＜音声対話システム１０の構成＞
図１を参照して、本開示の実施の形態に係る音声対話システム１０の技術思想について具体的に説明する。図１は、音声対話システム１０の構成を概念的に表わす図である。音声対話システム１０は、音声認識モジュール１１０と、音声認証モジュール１２０と、話題推定モジュール１３０と、対話生成モジュール１４０とを備える。 <Configuration of Spoken Dialog System 10>
With reference to FIG. 1, the technical idea of the spoken dialogue system 10 according to the embodiment of the present disclosure will be specifically described. FIG. 1 is a diagram conceptually showing the configuration of the voice interaction system 10. The voice dialogue system 10 includes a voice recognition module 110, a voice authentication module 120, a topic estimation module 130, and a dialogue generation module 140.

音声対話システム１０に対して行なわれたユーザの発話は、音声データに変換される。音声認識モジュール１１０は、音声データを認識するための処理を実行する。当該処理は特に限られず、様々な周知の音声認識技術が適用可能である。音声認識モジュール１１０による認識の結果（たとえば認識された話題）は、話題推定モジュール１３０に入力される。 The user's utterance made to the voice interactive system 10 is converted into voice data. The voice recognition module 110 executes a process for recognizing voice data. The process is not particularly limited, and various known voice recognition techniques can be applied. The recognition result (for example, the recognized topic) by the speech recognition module 110 is input to the topic estimation module 130.

音声対話システム１０に対するユーザの発話は、音声認識モジュール１１０に入力されると同時に音声認証モジュール１２０に入力される。音声認証モジュール１２０は、ユーザの発話の音声を認証し、その発話者（音声対話システム１０のユーザ）を特定する。したがって、音声の認識と発話者の特定とが、ほぼ同じタイミングで実行される。特定されたユーザは、話題推定モジュール１３０に入力される。 The user's speech to the voice interaction system 10 is input to the voice recognition module 110 and simultaneously to the voice authentication module 120. The voice authentication module 120 authenticates the voice of the user's utterance and identifies the speaker (user of the voice interaction system 10). Therefore, voice recognition and speaker identification are performed at substantially the same timing. The identified user is input to the topic estimation module 130.

話題推定モジュール１３０は、音声認識モジュール１１０によって認識された話題と、音声認証モジュール１２０によって特定されたユーザとに基づいて、発話を行なったユーザに適切な話題を推定する。たとえば、ユーザが子供である場合には、ニュースを保持するデータベースから当該子供が興味を持つ話題が抽出され、あるいは、当該子供との間で最近交わされた話題が対話履歴から抽出され得る。別の局面において、ユーザが大人の女性である場合には、女性が興味を持つ話題がデータベースから抽出され、あるいは、当該女性との間で過去に交わされた話題が対話履歴から抽出され得る。話題推定モジュール１３０による推定の結果は、対話生成モジュール１４０に入力される。 The topic estimation module 130 estimates a topic appropriate for the user who made the utterance based on the topic recognized by the speech recognition module 110 and the user specified by the speech authentication module 120. For example, when the user is a child, a topic in which the child is interested can be extracted from a database holding news, or a topic recently exchanged with the child can be extracted from the conversation history. In another aspect, when the user is an adult woman, topics in which the woman is interested can be extracted from the database, or topics that have been exchanged with the woman in the past can be extracted from the conversation history. The result of estimation by the topic estimation module 130 is input to the dialog generation module 140.

対話生成モジュール１４０は、話題推定モジュール１３０による結果（たとえば、子供のユーザが興味を持つスポーツの話題、女性が興味を持つファッションの話題等）に基づいて、ユーザに対する発話の内容を生成する。さらに別の局面において、対話生成モジュール１４０は、話題推定モジュール１３０による結果に加えて、当該ユーザと音声対話システム１０との親密度をさらに考慮して、ユーザに対する発話の内容を生成する。対話生成モジュール１４０は、発話の内容を生成すると、音声対話システム１０は、当該内容を音声で出力するための信号を生成し、当該信号に基づいて、機器の発話として当該ユーザに返答する。 The dialogue generation module 140 generates the content of the utterance for the user based on the result of the topic estimation module 130 (for example, the topic of sports in which the child user is interested, the topic of fashion in which the woman is interested). In yet another aspect, the dialog generation module 140 generates the content of the utterance for the user in consideration of the closeness between the user and the voice interaction system 10 in addition to the result of the topic estimation module 130. When the dialog generation module 140 generates the content of the utterance, the voice dialog system 10 generates a signal for outputting the content as a voice, and replies to the user as a device utterance based on the signal.

［第１の実施の形態］
＜音声対話システム１０の構成＞
図２を参照して、第１の実施の形態に係る音声対話システム１０の構成について説明する。図２は、音声対話システム１０の構成の一例を表わすブロック図である。音声対話システム１０は、コミュニケーション端末２００と、サーバ２２０とを備える。コミュニケーション端末２００は、音声入力部２１０と、音声出力部２１１とを含む。 [First Embodiment]
<Configuration of Spoken Dialog System 10>
With reference to FIG. 2, the configuration of the voice interaction system 10 according to the first exemplary embodiment will be described. FIG. 2 is a block diagram illustrating an example of the configuration of the voice interaction system 10. The voice interaction system 10 includes a communication terminal 200 and a server 220. Communication terminal 200 includes an audio input unit 210 and an audio output unit 211.

サーバ２２０は、制御部２３０と、音声認識部２４０と、対話分析部２５０と、対話履歴記憶部２６０と、対話ＤＢ（Database）部２７０と、対話生成部２８０と、音声合成部２９０とを含む。音声認識部２４０は、音声認識モジュール２４１と、話者特定モジュール２４２とを含む。 The server 220 includes a control unit 230, a speech recognition unit 240, a dialog analysis unit 250, a dialog history storage unit 260, a dialog DB (Database) unit 270, a dialog generation unit 280, and a voice synthesis unit 290. . The voice recognition unit 240 includes a voice recognition module 241 and a speaker identification module 242.

コミュニケーション端末２００は、ある局面において、たとえば、ぬいぐるみの外観を備える電子機器として実現される。別の局面において、コミュニケーション端末２００は、液晶テレビその他の表示装置であって、予め準備された人の画像を表示可能な装置によっても実現される。この場合、人の画像は、３次元の画像として立体的に表示されてもよい。 The communication terminal 200 is implemented as an electronic device having a stuffed appearance in a certain aspect, for example. In another aspect, communication terminal 200 is also realized by a liquid crystal television or other display device that can display an image of a person prepared in advance. In this case, the human image may be displayed three-dimensionally as a three-dimensional image.

コミュニケーション端末２００において、音声入力部２１０は、コミュニケーション端末２００に対する発話の入力を受け付けて、当該発話に応じた信号をサーバ２２０に送信する。 In the communication terminal 200, the voice input unit 210 receives an utterance input to the communication terminal 200 and transmits a signal corresponding to the utterance to the server 220.

音声出力部２１１は、サーバ２２０から送られる信号に基づいてコミュニケーション端末２００の発話として音声を出力する。 The voice output unit 211 outputs a voice as an utterance of the communication terminal 200 based on a signal sent from the server 220.

サーバ２２０において、制御部２３０は、サーバ２２０の動作を制御する。
ある局面において、制御部２３０は、コミュニケーション端末２００から送られる信号を処理し、サーバ２２０における音声認識のため処理後の信号を音声認識部２４０に送出する。 In the server 220, the control unit 230 controls the operation of the server 220.
In one aspect, control unit 230 processes a signal sent from communication terminal 200 and sends the processed signal to voice recognition unit 240 for voice recognition in server 220.

音声認識部２４０は、制御部２３０から送られる音声信号を用いて周知の技術による音声認識処理と当該音声を与えたユーザ（話者）を特定する処理とを実行する。より具体的には、音声認識部２４０において、音声認識モジュール２４１は、コミュニケーション端末２００が受け付けた音声の認識処理を実行する。話者特定モジュール２４２は、コミュニケーション端末２００が受け付けた音声を発話した話者（コミュニケーション端末２００のユーザ）を特定する。たとえば、話者特定モジュール２４２は、サーバ２２０において予め登録されているユーザの音声情報（たとえば、既に保存されているユーザ識別情報と声紋データ）と、コミュニケーション端末２００によって送られた音声信号（抽出された声紋データ）とを比較して、当該発話者を特定する。 The voice recognition unit 240 executes a voice recognition process using a known technique using a voice signal sent from the control unit 230 and a process of specifying a user (speaker) who has given the voice. More specifically, in the speech recognition unit 240, the speech recognition module 241 executes speech recognition processing accepted by the communication terminal 200. The speaker specifying module 242 specifies a speaker (user of the communication terminal 200) who has spoken the voice received by the communication terminal 200. For example, the speaker identifying module 242 may extract the voice information (for example, user identification information and voiceprint data already stored) of the user registered in advance in the server 220 and the voice signal (extracted) transmitted by the communication terminal 200. And the voiceprint data) are identified.

対話分析部２５０は、音声認識部２４０による認識の結果に基づいて、コミュニケーション端末２００に対するユーザの発話を分析する。より具体的には、対話分析部２５０は、音声認識の結果（発話内容）から形態素を切り出し、固有表現の抽出処理を実行する。 The dialog analysis unit 250 analyzes the user's utterance to the communication terminal 200 based on the recognition result by the voice recognition unit 240. More specifically, the dialogue analysis unit 250 cuts out morphemes from the result of speech recognition (utterance content) and executes a specific expression extraction process.

対話履歴記憶部２６０は、対話分析部２５０による分析の結果を保持する。より具体的には、対話履歴記憶部２６０は、ユーザと音声対話システム１０との過去の会話、および、それらの会話の結果に基づく形態素、固有表現の出現履歴などを保持する。 The dialogue history storage unit 260 holds the result of the analysis by the dialogue analysis unit 250. More specifically, the dialogue history storage unit 260 holds past conversations between the user and the voice dialogue system 10, morphemes based on the results of those conversations, appearance history of specific expressions, and the like.

対話ＤＢ部２７０は、予め用意された対話文を生成するための対話の入力フレーズと返答フレーズとの対を保持している。対話を生成する際の条件が対話ＤＢ部２７０に与えられると、複数の返答フレーズから当該条件によって特定される状況に応じたフレーズが検索される。 The dialogue DB unit 270 holds a pair of dialogue input phrase and response phrase for generating a prepared dialogue sentence. When a condition for generating a dialog is given to the dialog DB unit 270, a phrase corresponding to the situation specified by the condition is searched from a plurality of response phrases.

対話生成部２８０は、対話履歴記憶部２６０に保持されているデータと対話ＤＢ部２７０によって保持されているデータベースとを用いて対話を生成する。より具体的には、対話生成部２８０は、対話ＤＢ部２７０に与えられるユーザの発話内容によって検索される返答フレーズと対話履歴記憶部２６０において保持されている対話履歴とを用いて音声対話システム１０に対して発話を行なっているユーザとの対話文を生成する。対話文は、たとえば文字列情報として生成される。 The dialog generation unit 280 generates a dialog using the data held in the dialog history storage unit 260 and the database held by the dialog DB unit 270. More specifically, the dialog generation unit 280 uses the response phrase searched based on the user's utterance content given to the dialog DB unit 270 and the dialog history stored in the dialog history storage unit 260 to use the voice dialog system 10. Generate a dialogue with the user who is speaking. The dialogue sentence is generated as character string information, for example.

音声合成部２９０は、対話生成部２８０によって生成された対話文を用いて音声合成を行ない、音声対話システム１０の発話のためのデータを生成する。 The voice synthesis unit 290 performs voice synthesis using the dialogue sentence generated by the dialogue generation unit 280 and generates data for the speech of the voice dialogue system 10.

制御部２３０は、音声合成部２９０によって音声合成されたデータを受け取ると、そのデータをコミュニケーション端末２００に送信する。 When control unit 230 receives the data synthesized by speech synthesis unit 290, control unit 230 transmits the data to communication terminal 200.

コミュニケーション端末２００において、音声出力部２１１はその対話文を音声として出力する。 In the communication terminal 200, the voice output unit 211 outputs the dialogue sentence as a voice.

一例として、ある局面において、子供のユーザ２０１が「ただいま」と発話すると、コミュニケーション端末２００は、その発話の内容の入力を受け付けて、サーバ２２０に音声信号を送信する。サーバ２２０は、ユーザ２０１の発話内容について音声認識処理を実行し、発話された内容が「ただいま」であることを音声認識し、また、その発話者がユーザ２０１（子供）であることを特定する。サーバ２２０は、そのような音声認識の結果に基づいて「ただいま」に対する対話として「お帰り。学校どうだった？」と発話するための音声合成処理を実行し、処理後の結果をコミュニケーション端末２００に送信する。その結果、コミュニケーション端末２００は、ユーザ２０１に対し「お帰り。学校どうだった？」と発話する。 As an example, in a certain situation, when the child user 201 utters “just now”, the communication terminal 200 receives an input of the content of the utterance and transmits an audio signal to the server 220. The server 220 performs a speech recognition process on the utterance content of the user 201, recognizes that the uttered content is “now”, and specifies that the utterer is the user 201 (child). . Based on the result of such speech recognition, the server 220 executes speech synthesis processing for speaking “Return. How was the school?” As a dialogue for “I ’m right now”, and the result of the processing is the communication terminal 200. Send to. As a result, the communication terminal 200 utters “Return. How was the school?” To the user 201.

別の局面において、大人の女性のユーザ２０２が同じ言葉「ただいま」と発話すると、コミュニケーション端末２００は、その発話の音声入力を受け付けて、音声信号をサーバ２２０に送信する。サーバ２２０は、当該発話の内容を音声認識するとともに話者を特定する。より具体的には、サーバ２２０は、「ただいま」との発話内容を認識し、同時に、当該発話の内容がユーザ２０２（大人の女性）によるものであることを特定する。サーバ２２０は、そのような音声認識の結果に基づいて「ただいま」に対する対話文を生成する。より具体的には、サーバ２２０は、ユーザ２０２による発話の内容に応答するための対話として「お帰り。お仕事お疲れ様」との発話内容を生成する。サーバ２２０は、その対話文を音声合成すると、合成後の信号をコミュニケーション端末２００に送信する。コミュニケーション端末２００は、ユーザ２０２に対し「お帰り。お仕事お疲れ様」と音声で出力する。 In another aspect, when the adult female user 202 utters the same word “I'm right now”, the communication terminal 200 accepts the voice input of the utterance and sends a voice signal to the server 220. The server 220 recognizes the content of the utterance by voice and identifies the speaker. More specifically, the server 220 recognizes the content of the utterance “immediately” and, at the same time, specifies that the content of the utterance is from the user 202 (adult female). The server 220 generates a dialogue sentence for “Now” based on the result of such speech recognition. More specifically, the server 220 generates an utterance content “Return. Thank you for your work” as a dialog for responding to the utterance content by the user 202. When the dialogue is synthesized with the dialogue, the server 220 transmits the synthesized signal to the communication terminal 200. The communication terminal 200 outputs, to the user 202, “Return.

＜コミュニケーション端末２００の構成＞
図３を参照して、本実施の形態に係るコミュニケーション端末２００の構成について説明する。図３は、コミュニケーション端末２００のハードウェア構成を表わすブロック図である。 <Configuration of communication terminal 200>
With reference to FIG. 3, the configuration of communication terminal 200 according to the present embodiment will be described. FIG. 3 is a block diagram showing a hardware configuration of communication terminal 200.

コミュニケーション端末２００は、ＣＰＵ（Central Processing Unit）２０と、アンテナ２３と、通信装置２４と、操作ボタン２５と、カメラ２６と、フラッシュメモリ２７と、ＲＡＭ（Random Access Memory）２８と、ＲＯＭ（Read Only Memory）２９と、メモリカード駆動装置３０と、マイク３２と、スピーカ３３と、音声信号処理回路３４と、モニタ３５と、ＬＥＤ（Light Emitting Diode）３６と、データ通信インターフェイス３７と、バイブレータ３８と、加速度センサ３９と、アクチュエータ４０とを備える。メモリカード駆動装置３０には、メモリカード３１が装着され得る。 The communication terminal 200 includes a CPU (Central Processing Unit) 20, an antenna 23, a communication device 24, operation buttons 25, a camera 26, a flash memory 27, a RAM (Random Access Memory) 28, and a ROM (Read Only). Memory) 29, memory card drive device 30, microphone 32, speaker 33, audio signal processing circuit 34, monitor 35, LED (Light Emitting Diode) 36, data communication interface 37, vibrator 38, An acceleration sensor 39 and an actuator 40 are provided. A memory card 31 can be attached to the memory card drive device 30.

アンテナ２３は、サーバ２２０によって発信される信号を受信し、または、サーバ２２０を介して他の通信装置と通信するための信号を送信する。アンテナ２３によって受信された信号は、通信装置２４によってフロントエンド処理が行なわれた後、処理後の信号は、ＣＰＵ２０に送られる。 The antenna 23 receives a signal transmitted by the server 220 or transmits a signal for communicating with another communication device via the server 220. The signal received by the antenna 23 is subjected to front-end processing by the communication device 24, and the processed signal is sent to the CPU 20.

操作ボタン２５は、コミュニケーション端末２００に対する操作を受け付ける。操作ボタン２５は、たとえば、ハードキーまたはソフトキーとして実現される。操作ボタン２５は、ユーザによる操作を受け付けると、その時のコミュニケーション端末２００の動作モードに応じた信号をＣＰＵ２０に送出する。 The operation button 25 receives an operation on the communication terminal 200. The operation button 25 is realized as a hard key or a soft key, for example. When the operation button 25 receives an operation by the user, the operation button 25 sends a signal to the CPU 20 according to the operation mode of the communication terminal 200 at that time.

ＣＰＵ２０は、コミュニケーション端末２００に対して与えられる命令に基づいてコミュニケーション端末２００の動作を制御するための処理を実行する。コミュニケーション端末２００が信号を受信すると、ＣＰＵ２０は、通信装置２４から送られた信号に基づいて予め規定された処理を実行し、処理後の信号を音声信号処理回路３４に送出する。音声信号処理回路３４は、その信号に対して予め規定された信号処理を実行し、処理後の信号をスピーカ３３に送出する。スピーカ３３は、その信号に基づいて音声を出力する。 The CPU 20 executes a process for controlling the operation of the communication terminal 200 based on a command given to the communication terminal 200. When the communication terminal 200 receives the signal, the CPU 20 executes a predetermined process based on the signal sent from the communication device 24 and sends the processed signal to the audio signal processing circuit 34. The audio signal processing circuit 34 performs predetermined signal processing on the signal, and sends the processed signal to the speaker 33. The speaker 33 outputs sound based on the signal.

マイク３２は、コミュニケーション端末２００に対する発話を受け付けて、発話された音声に対応する信号を音声信号処理回路３４に対して送出する。音声信号処理回路３４は、予め規定された処理を当該信号に対して実行し、処理後の信号をＣＰＵ２０に対して送出する。ＣＰＵ２０は、その信号を送信用のデータに変換し、変換後のデータを通信装置２４に対して送出する。通信装置２４は、そのデータを用いて送信用の信号を生成し、アンテナ２３に向けてその信号を送出する。アンテナ２３から発信される信号は、サーバ２２０に受信される。なお、他の局面において、アンテナ２３の代わりに、有線によってサーバ２２０とコミュニケーション端末２００とが接続されていてもよい。 The microphone 32 receives an utterance to the communication terminal 200 and sends a signal corresponding to the uttered voice to the voice signal processing circuit 34. The audio signal processing circuit 34 performs a predetermined process on the signal, and sends the processed signal to the CPU 20. The CPU 20 converts the signal into data for transmission, and sends the converted data to the communication device 24. The communication device 24 generates a signal for transmission using the data, and transmits the signal to the antenna 23. A signal transmitted from the antenna 23 is received by the server 220. In another aspect, instead of antenna 23, server 220 and communication terminal 200 may be connected by wire.

フラッシュメモリ２７は、ＣＰＵ２０から送られるデータを格納する。また、ＣＰＵ２０は、フラッシュメモリ２７に格納されているデータを読み出し、そのデータを用いて予め規定された処理を実行する。 The flash memory 27 stores data sent from the CPU 20. In addition, the CPU 20 reads data stored in the flash memory 27 and executes a predetermined process using the data.

ＲＡＭ２８は、操作ボタン２５に対して行なわれた操作に基づいてＣＰＵ２０によって生成されるデータを一時的に保持する。ＲＯＭ２９は、コミュニケーション端末２００に予め定められた動作を実行させるためのプログラムあるいはデータを格納している。ＣＰＵ２０は、ＲＯＭ２９から当該プログラムまたはデータを読み出し、コミュニケーション端末２００の動作を制御する。 The RAM 28 temporarily holds data generated by the CPU 20 based on the operation performed on the operation button 25. The ROM 29 stores a program or data for causing the communication terminal 200 to execute a predetermined operation. The CPU 20 reads the program or data from the ROM 29 and controls the operation of the communication terminal 200.

メモリカード駆動装置３０は、メモリカード３１に格納されているデータを読み出し、ＣＰＵ２０に送出する。メモリカード駆動装置３０は、ＣＰＵ２０によって出力されるデータを、メモリカード３１の空き領域に書き込む。 The memory card driving device 30 reads data stored in the memory card 31 and sends it to the CPU 20. The memory card drive device 30 writes the data output by the CPU 20 in the empty area of the memory card 31.

音声信号処理回路３４は、上述のような通話のための信号処理を実行する。なお、図３に示される例では、ＣＰＵ２０と音声信号処理回路３４とが別個の構成として示されているが、他の局面において、ＣＰＵ２０と音声信号処理回路３４とが一体として構成されていてもよい。 The audio signal processing circuit 34 performs signal processing for a call as described above. In the example shown in FIG. 3, the CPU 20 and the audio signal processing circuit 34 are shown as separate configurations. However, in other aspects, the CPU 20 and the audio signal processing circuit 34 may be configured as an integral unit. Good.

モニタ３５は、ＣＰＵ２０から取得されるデータに基づいて画像を表示する。モニタ３５は、たとえば、フラッシュメモリ２７に格納されている静止画像（たとえば、会議資料、契約書その他の電子文書）、動画像、音楽ファイルの属性（当該ファイルの名前、演奏者、演奏時間など）を表示する。静止画像は、描画された画像、デフォルトでコミュニケーション端末２００の製造事業者によって予め準備された画像を含み得る。 The monitor 35 displays an image based on data acquired from the CPU 20. The monitor 35 has, for example, still images (for example, conference materials, contracts and other electronic documents), moving images, and music file attributes (name of the file, performer, performance time, etc.) stored in the flash memory 27. Is displayed. The still image may include a drawn image and an image prepared in advance by a manufacturer of the communication terminal 200 by default.

ＬＥＤ３６は、ＣＰＵ２０からの信号に基づいて、予め定められた発光動作を実現する。データ通信インターフェイス３７は、データ通信用のケーブルの装着を受け付ける。 The LED 36 realizes a predetermined light emission operation based on a signal from the CPU 20. The data communication interface 37 accepts attachment of a data communication cable.

データ通信インターフェイス３７は、ＣＰＵ２０から出力される信号を当該ケーブルに対して送出する。あるいは、データ通信インターフェイス３７は、当該ケーブルを介して受信されるデータを、ＣＰＵ２０に対して送出する。 The data communication interface 37 sends a signal output from the CPU 20 to the cable. Alternatively, the data communication interface 37 sends data received via the cable to the CPU 20.

バイブレータ３８は、ＣＰＵ２０から出力される信号に基づいて、予め定められた周波数で発振動作を実行する。 Vibrator 38 performs an oscillating operation at a predetermined frequency based on a signal output from CPU 20.

加速度センサ３９は、コミュニケーション端末２００に作用する加速度の方向を検出する。検出結果は、ＣＰＵ２０に入力される。 The acceleration sensor 39 detects the direction of acceleration acting on the communication terminal 200. The detection result is input to the CPU 20.

アクチュエータ４０は、ＣＰＵ２０からの信号に基づいて、コミュニケーション端末２０の一部の部材（図示しない）を駆動する。たとえば、コミュニケーション端末２０が、ぬいぐるみの外観を有する電子機器として実現される場合、アクチュエータ４０は、当該ぬいぐるみの手、足、首その他の部分を駆動し得る。これにより、コミュニケーション端末４０は、ユーザの発話に応じた動作（うなずき、首振り等）を行ない得る。 The actuator 40 drives some members (not shown) of the communication terminal 20 based on a signal from the CPU 20. For example, when the communication terminal 20 is realized as an electronic device having the appearance of a stuffed toy, the actuator 40 can drive the hand, the leg, the neck, and other parts of the stuffed toy. Thereby, the communication terminal 40 can perform an operation (nodding, swinging, etc.) according to the user's utterance.

なお、本実施の形態に係るコミュニケーション端末２００は上述の構成要素を全て備える必要はなく、少なくとも、音声入出力機能と通信機能とを有する情報処理端末であればよい。 Note that the communication terminal 200 according to the present embodiment need not include all the above-described components, and may be any information processing terminal having at least a voice input / output function and a communication function.

＜サーバの構成＞
図４を参照して、本実施の形態に係るサーバ２２０の構成について説明する。図４は、実施の形態に係るサーバ２２０を実現するコンピュータ４００のハードウェア構成を表わすブロック図である。 <Server configuration>
With reference to FIG. 4, a configuration of server 220 according to the present embodiment will be described. FIG. 4 is a block diagram illustrating a hardware configuration of a computer 400 that implements the server 220 according to the embodiment.

コンピュータ４００は、主たる構成要素として、プログラムを実行するＣＰＵ１と、コンピュータ４００のユーザによる指示の入力を受けるマウス２およびキーボード３と、ＣＰＵ１によるプログラムの実行により生成されたデータ、又はマウス２若しくはキーボード３を介して入力されたデータを揮発的に格納するＲＡＭ４と、データを不揮発的に格納するハードディスク５と、光ディスク駆動装置６と、通信ＩＦ（Interface）７と、モニタ８とを備える。各構成要素は、相互にバスによって接続されている。光ディスク駆動装置６には、ＣＤ−ＲＯＭ９その他の光ディスクが装着され得る。通信ＩＦ７は、ＵＳＢ（Universal Serial Bus）インターフェイス、有線ＬＡＮ（Local Area Network）、無線ＬＡＮ、Bluetooth（登録商標）インターフェイス等を含むが、これらに限られない。 The computer 400 includes, as main components, a CPU 1 that executes a program, a mouse 2 and a keyboard 3 that receive input of instructions from a user of the computer 400, data generated by execution of a program by the CPU 1, or a mouse 2 or a keyboard 3 A RAM 4 that stores data input via the volatile memory, a hard disk 5 that stores data in a nonvolatile manner, an optical disk drive device 6, a communication IF (Interface) 7, and a monitor 8. Each component is connected to each other by a bus. A CD-ROM 9 and other optical disks can be mounted on the optical disk drive 6. The communication IF 7 includes, but is not limited to, a USB (Universal Serial Bus) interface, a wired LAN (Local Area Network), a wireless LAN, and a Bluetooth (registered trademark) interface.

コンピュータ４００における処理は、各ハードウェアおよびＣＰＵ１により実行されるソフトウェアによって実現される。このようなソフトウェアは、ハードディスク５に予め格納されている場合がある。また、ソフトウェアは、ＣＤ−ＲＯＭ９その他のコンピュータ読み取り可能な不揮発性のデータ記録媒体に格納されて、プログラム製品として流通している場合もある。あるいは、当該ソフトウェアは、インターネットその他のネットワークに接続されている情報提供事業者によってダウンロード可能なプログラム製品として提供される場合もある。このようなソフトウェアは、光ディスク駆動装置６その他のデータ読取装置によってデータ記録媒体から読み取られて、あるいは、通信ＩＦ７を介してダウンロードされた後、ハードディスク５に一旦格納される。そのソフトウェアは、ＣＰＵ１によってハードディスク５から読み出され、ＲＡＭ４に実行可能なプログラムの形式で格納される。ＣＰＵ１は、そのプログラムを実行する。 Processing in the computer 400 is realized by each hardware and software executed by the CPU 1. Such software may be stored in the hard disk 5 in advance. The software may be stored in a CD-ROM 9 or other non-volatile computer-readable data recording medium and distributed as a program product. Alternatively, the software may be provided as a program product that can be downloaded by an information provider connected to the Internet or other networks. Such software is read from the data recording medium by the optical disk drive 6 or other data reading device, or downloaded via the communication IF 7 and then temporarily stored in the hard disk 5. The software is read from the hard disk 5 by the CPU 1 and stored in the RAM 4 in the form of an executable program. The CPU 1 executes the program.

図４に示されるコンピュータ４００を構成する各構成要素は、一般的なものである。したがって、本実施の形態に係る本質的な部分は、コンピュータ４００に格納されたプログラムであるともいえる。コンピュータ４００のハードウェアの動作は周知であるので、詳細な説明は繰り返さない。 Each component constituting the computer 400 shown in FIG. 4 is general. Therefore, it can be said that an essential part according to the present embodiment is a program stored in the computer 400. Since the operation of the hardware of computer 400 is well known, detailed description will not be repeated.

なお、データ記録媒体としては、ＣＤ−ＲＯＭ、ＦＤ（Flexible Disk）、ハードディスクに限られず、磁気テープ、カセットテープ、光ディスク（ＭＯ（Magnetic Optical Disc）／ＭＤ（Mini Disc）／ＤＶＤ（Digital Versatile Disc））、ＩＣ（Integrated Circuit）カード（メモリカードを含む）、光カード、マスクＲＯＭ、ＥＰＲＯＭ（Electronically Programmable Read-Only Memory）、ＥＥＰＲＯＭ（Electronically Erasable Programmable Read-Only Memory）、フラッシュＲＯＭなどの半導体メモリ等の固定的にプログラムを担持する不揮発性のデータ記録媒体でもよい。 The data recording medium is not limited to a CD-ROM, FD (Flexible Disk), and hard disk, but is a magnetic tape, cassette tape, optical disk (MO (Magnetic Optical Disc) / MD (Mini Disc) / DVD (Digital Versatile Disc)). ), IC (Integrated Circuit) card (including memory card), optical card, mask ROM, EPROM (Electronically Programmable Read-Only Memory), EEPROM (Electronically Erasable Programmable Read-Only Memory), flash ROM, etc. It may be a non-volatile data recording medium that carries a fixed program.

ここでいうプログラムとは、ＣＰＵにより直接実行可能なプログラムだけでなく、ソースプログラム形式のプログラム、圧縮処理されたプログラム、暗号化されたプログラム等を含み得る。 The program here may include not only a program directly executable by the CPU but also a program in a source program format, a compressed program, an encrypted program, and the like.

＜無線タグを用いた音声対話システム＞
図５を参照して、別の局面に従う音声対話システムについて説明する。図５は、無線タグを用いた発話に対する音声対話システムの構成の概要を表わす図である。 <Voice interaction system using wireless tags>
With reference to FIG. 5, a voice interaction system according to another aspect will be described. FIG. 5 is a diagram showing an outline of the configuration of a voice interaction system for utterance using a wireless tag.

音声対話システム５０は、コミュニケーション端末５００と、サーバ５２０とを備える。コミュニケーション端末５００は、音声入力部２１０と、音声出力部２１１と、無線タグ情報送信部５１０とを含む。サーバ５２０は、制御部５３０と、音声認識部５４０と、ユーザ識別子部５４１と、対話分析部５５０と、対話履歴記憶部５６０と、対話ＤＢ部５７０と、対話生成部５８０と、音声合成部５９０とを含む。 The voice interaction system 50 includes a communication terminal 500 and a server 520. Communication terminal 500 includes a voice input unit 210, a voice output unit 211, and a wireless tag information transmission unit 510. The server 520 includes a control unit 530, a voice recognition unit 540, a user identifier unit 541, a dialog analysis unit 550, a dialog history storage unit 560, a dialog DB unit 570, a dialog generation unit 580, and a voice synthesis unit 590. Including.

ある局面において、ユーザ２０１は、携帯電話５０１を有している。携帯電話５０１は、その識別情報としてたとえば無線タグＡを有している。ある局面において、ユーザ２０１が「ただいま」と発話すると、携帯電話５０１の識別情報（無線タグＡ）とともにコミュニケーション端末５００に入力される。コミュニケーション端末５００は、ユーザの発話の内容を認識する。コミュニケーション端末５００において、無線タグ情報送信部５１０は、ユーザ２０１からの発話に伴う無線タグＡを抽出し、その抽出した内容をサーバ５２０に送信する。サーバ５２０は、ユーザ２０１の発話内容「ただいま」と無線タグＡとを認識し、ユーザ２０１に応じた対話が対話生成部５８０によって生成される。 In one aspect, the user 201 has a mobile phone 501. The cellular phone 501 has, for example, a wireless tag A as its identification information. In a certain situation, when the user 201 speaks “I'm right now”, it is input to the communication terminal 500 together with the identification information (wireless tag A) of the mobile phone 501. The communication terminal 500 recognizes the content of the user's utterance. In the communication terminal 500, the wireless tag information transmission unit 510 extracts the wireless tag A accompanying the utterance from the user 201, and transmits the extracted content to the server 520. The server 520 recognizes the user 201's utterance content “now” and the wireless tag A, and a dialog corresponding to the user 201 is generated by the dialog generation unit 580.

サーバ５２０は、ユーザ２０１に対する対話文を生成するとその信号をコミュニケーション端末５００に送信する。コミュニケーション端末５００は、ユーザ２０１に対して「お帰り。学校どうだった？」と発話する。 When the server 520 generates a dialogue sentence for the user 201, the server 520 transmits the signal to the communication terminal 500. The communication terminal 500 utters “Return. How was the school?” To the user 201.

別の局面において、大人のユーザ２０２が、同一の携帯電話５０１を使用している場合において「ただいま」と発話すると、コミュニケーション端末５００は、発話の内容「ただいま」を受け付けるとともに、携帯電話５０１と通信することにより携帯電話５０１に関連付けられている無線タグＡを取得する。コミュニケーション端末５００は、発話の内容「ただいま」と無線タグＡとをサーバ５２０に送信する。すなわち、コミュニケーション端末５００は、発話の主体がユーザ２０１およびユーザ２０２のいずれであっても、携帯電話５０１に関連付けられている無線タグＡを送信する。サーバ５２０は、発話の内容「ただいま」と無線タグＡとを受信すると、発話者が大人のユーザ２０２であるにも係わらず、子供のユーザ２０１による発話が行なわれたと無線タグＡを用いて判断する。サーバ５２０は、ユーザ２０１に対する対話文と同じ対話文を音声合成する。サーバ５２０は、「お帰り。学校どうだった？」との対話文をコミュニケーション端末５００に送信する。コミュニケーション端末５００は、ユーザ２０２に対して「お帰り。学校どうだった？」と発話することになる。このように、ユーザが所有し得る無線通信端末（たとえば携帯電話５０１）のように無線タグを用いてユーザを認識する場合、発話者は簡単に他人になりすますことができる。サーバ５２０は、どのユーザが発話したかを特定することができなくなるため、ユーザに応じた対話文ではなく、携帯電話５０１に固有な対話文を生成することになる。 In another aspect, when an adult user 202 uses the same mobile phone 501 and utters “immediately”, the communication terminal 500 receives the utterance content “immediately” and communicates with the mobile phone 501. By doing so, the wireless tag A associated with the mobile phone 501 is acquired. The communication terminal 500 transmits the utterance content “Now” and the wireless tag A to the server 520. That is, the communication terminal 500 transmits the wireless tag A associated with the mobile phone 501 regardless of whether the subject of the utterance is the user 201 or the user 202. When the server 520 receives the wireless tag A and the content of the utterance “immediately”, the server 520 determines using the wireless tag A that the utterance is made by the child user 201 even though the speaker is the adult user 202. To do. The server 520 synthesizes the same dialogue sentence as the dialogue sentence for the user 201. The server 520 transmits a dialogue sentence “Return. How was the school?” To the communication terminal 500. The communication terminal 500 speaks to the user 202 “Return. How was the school?” As described above, when a user is recognized using a wireless tag like a wireless communication terminal (for example, a mobile phone 501) that can be owned by the user, the speaker can easily impersonate another person. Since the server 520 cannot identify which user has spoken, the server 520 generates a dialogue sentence unique to the mobile phone 501 instead of a dialogue sentence according to the user.

＜制御構造＞
図６を参照して、本実施の形態に係る音声対話システム１０の制御構造について説明する。図６は、ユーザを音声対話システム１０に登録する場合に実行される処理を表わすシーケンスチャートである。 <Control structure>
With reference to FIG. 6, the control structure of the spoken dialogue system 10 according to the present embodiment will be described. FIG. 6 is a sequence chart showing processing executed when a user is registered in the voice interaction system 10.

ステップＳ６１０にて、音声対話システム１０への登録を求めるユーザは、コミュニケーション端末２００に対して音声認証学習のリクエストを送信する。コミュニケーション端末２００は、そのリクエストを受信すると、サーバ２２０との通信を確立し、当該リクエストをサーバ２２０に送信する。サーバ２２０の制御部２３０は、そのリクエストを受信する。 In step S <b> 610, the user who requests registration in the voice interaction system 10 transmits a request for voice authentication learning to the communication terminal 200. When receiving the request, the communication terminal 200 establishes communication with the server 220 and transmits the request to the server 220. The control unit 230 of the server 220 receives the request.

ステップＳ６２０にて、サーバ２２０の制御部２３０は、音声認証学習のリクエストの受信に応答して、音声認証学習用のメッセージをユーザに通知する。より具体的には、制御部２３０は、コミュニケーション端末２００に対して当該メッセージを送信する。コミュニケーション端末２００は、サーバ２２０から当該メッセージを受信すると、音声出力部２１１がメッセージを音声で出力する。ユーザは、音声認証のために発話しないといけないメッセージを知ることができる。その後、制御はステップＳ６２５に移される。 In step S620, the control unit 230 of the server 220 notifies the user of a voice authentication learning message in response to receiving the voice authentication learning request. More specifically, the control unit 230 transmits the message to the communication terminal 200. When the communication terminal 200 receives the message from the server 220, the voice output unit 211 outputs the message by voice. The user can know the message that must be spoken for voice authentication. Thereafter, control is transferred to step S625.

ステップＳ６２５にて、制御部２３０は、コミュニケーション端末２００に対して音声取得を指示する命令を送信する。 In step S625, control unit 230 transmits a command to instruct voice acquisition to communication terminal 200.

ステップＳ６３０にて、コミュニケーション端末２００は、サーバ２２０から当該命令を受信すると、発話を促すメッセージをユーザに対して出力する。より具体的には、たとえばコミュニケーション端末２００は、発話を促すメッセージ「このメッセージが終わった後に発話をして下さい」を音声で出力する。他の局面において、コミュニケーション端末２００は、メッセージをモニタに表示してもよい。さらに他の局面において、コミュニケーション端末２００が音声入出力機能と通信機能と駆動機能とを備えるぬいぐるみとして実現される場合、コミュニケーション端末２００は、手を耳に当てる仕草のように、発話を促す動作を行なってもよい。 In step S630, upon receiving the command from server 220, communication terminal 200 outputs a message prompting the user to speak. More specifically, for example, the communication terminal 200 outputs a message for prompting utterance “Please speak after this message is finished” by voice. In another aspect, communication terminal 200 may display a message on a monitor. In still another aspect, when communication terminal 200 is realized as a stuffed toy having a voice input / output function, a communication function, and a drive function, communication terminal 200 performs an operation for prompting utterance like a gesture of placing a hand on an ear. You may do it.

ステップＳ６４０にて、ユーザは、当該発話を促すメッセージを認識すると、音声認証学習用のメッセージをコミュニケーション端末２００に向けて発話する。 In step S640, when the user recognizes the message prompting the utterance, the user utters the voice authentication learning message toward the communication terminal 200.

ステップＳ６５０にて、コミュニケーション端末２００の音声入力部２１０は、ユーザによるメッセージの発話の入力を受け付けて、その発話に応じた音声データをサーバ２２０に送信する。 In step S650, voice input unit 210 of communication terminal 200 receives an input of a message utterance by a user, and transmits voice data corresponding to the utterance to server 220.

ステップＳ６６０にて、サーバ２２０の制御部２３０は、その音声データの受信を検知すると、音声認識部２４０に対して当該メッセージの学習リクエストを送信する。音声認識部２４０は、当該学習リクエストの受信に応答して、音声認識処理と話者特定処理とを実行する。より具体的には、音声認識部２４０は、音声認識モジュール２４１としてユーザによって行なわれた発話の内容を音声認識処理する。また、音声認識部２４０は、話者特定モジュール２４２として、発話の内容から形態素を抽出し、当該発話を行なった話者を特定するための情報を取得する。 In step S660, when the control unit 230 of the server 220 detects reception of the voice data, the control unit 230 transmits a learning request for the message to the voice recognition unit 240. In response to receiving the learning request, the voice recognition unit 240 performs voice recognition processing and speaker identification processing. More specifically, the voice recognition unit 240 performs voice recognition processing on the content of an utterance made by the user as the voice recognition module 241. In addition, the speech recognition unit 240 extracts, as the speaker identification module 242, morphemes from the content of the utterance and acquires information for identifying the speaker who performed the utterance.

ステップＳ６７０にて、音声認識部２４０は、学習が完了したことを示す学習完了レスポンスを制御部２３０に送信する。 In step S670, the speech recognition unit 240 transmits a learning completion response indicating that learning is completed to the control unit 230.

ステップＳ６８０にて、制御部２３０は、学習完了レスポンスを受信すると、当該ユーザの学習が完了したことをコミュニケーション端末２００に通知する。 In step S680, when receiving the learning completion response, control unit 230 notifies communication terminal 200 that learning of the user has been completed.

ステップＳ６９０にて、コミュニケーション端末２００は、学習完了を通知するメッセージをユーザに対して発話する。このようにしてユーザ識別のための登録処理が実行される。なお、ステップＳ６１０におけるリクエストは、別の局面においては、ユーザが直接サーバに対して行なうものであってもよい。また音声対話システム１０にユーザを登録するために用いられる端末はコミュニケーション端末２００に限られない。少なくとも音声認識機能とサーバ２２０との通信機能とを備える情報処理通信端末であればよい。 In step S690, communication terminal 200 utters a message notifying completion of learning to the user. In this way, registration processing for user identification is executed. In another aspect, the request in step S610 may be made directly by the user to the server. A terminal used for registering a user in the voice interaction system 10 is not limited to the communication terminal 200. Any information processing communication terminal having at least a voice recognition function and a communication function with the server 220 may be used.

上記のような処理は、コミュニケーション端末２００およびサーバ２２０にユーザを登録するための処理プログラムが予め実行されている場合に実現される。また、ユーザを登録するための処理を開始するトリガは、ユーザによる特定の発話（ユーザ登録希望など）、あるいは、コミュニケーション端末２００またはサーバ２２０の入力スイッチその他入力操作等であってもよい。 The above processing is realized when a processing program for registering a user in the communication terminal 200 and the server 220 is executed in advance. The trigger for starting the process for registering the user may be a specific utterance (such as user registration request) by the user, an input switch of the communication terminal 200 or the server 220, or other input operation.

図７を参照して、本実施の形態に係るユーザ識別に基づく発話シーケンスについて説明する。図７は、発話したユーザに応じた返答が生成される処理を表わすシーケンスチャートである。 With reference to FIG. 7, an utterance sequence based on user identification according to the present embodiment will be described. FIG. 7 is a sequence chart showing a process of generating a response according to the user who spoke.

ステップＳ７１０にて、ユーザがコミュニケーション端末２００に対して発話する。
ステップＳ７２０にて、コミュニケーション端末２００は、ユーザの発話の入力を受け付けると、当該発話に応じた音声データをサーバ２２０に送信する。 In step S710, the user speaks to communication terminal 200.
In step S <b> 720, when receiving an input of the user's utterance, communication terminal 200 transmits voice data corresponding to the utterance to server 220.

ステップＳ７３０にて、サーバ２２０の制御部２３０は、コミュニケーション端末２００からの音声データの受信を検知すると、当該音声データを認識するリクエストを音声認識部２４０に送信する。 In step S730, when the control unit 230 of the server 220 detects reception of voice data from the communication terminal 200, the control unit 230 transmits a request for recognizing the voice data to the voice recognition unit 240.

ステップＳ７４０にて、音声認識部２４０は、当該認識のリクエストに応答して、発話の内容を認識するための音声認識処理と、発話者を特定するための話者特定処理と、を実行する。さらに、音声認識部２４０は、音声認識の結果および話者特定の結果を認識レスポンスとして制御部２３０に送信する。 In step S740, in response to the recognition request, the speech recognition unit 240 executes speech recognition processing for recognizing the content of the utterance and speaker identification processing for identifying the speaker. Further, the voice recognition unit 240 transmits the result of voice recognition and the result of speaker identification to the control unit 230 as a recognition response.

ステップＳ７５０にて、制御部２３０は、認識レスポンスの受信に応答して、分析生成リクエストを対話分析部２５０および対話生成部２８０にそれぞれ送信する。対話分析部２５０は、そのリクエストを受信すると、対話履歴記憶部２６０を参照して当該ユーザの過去の対話の履歴を抽出する。対話生成部２８０は、生成リクエストの受信に応答して対話履歴記憶部２６０に保持されている対話履歴と対話ＤＢ部２７０に保持されている対話のデータベースとを用いて、当該発話を行なったユーザに固有の対話文を生成する。 In step S750, in response to receiving the recognition response, control unit 230 transmits an analysis generation request to dialog analysis unit 250 and dialog generation unit 280, respectively. When the dialog analysis unit 250 receives the request, the dialog analysis unit 250 refers to the dialog history storage unit 260 and extracts the past dialog history of the user. The dialogue generation unit 280 uses the dialogue history held in the dialogue history storage unit 260 and the dialogue database held in the dialogue DB unit 270 in response to receiving the generation request, and the user who made the utterance Generate a dialog sentence specific to.

ステップＳ７６０にて、対話分析部２５０および対話生成部２８０は、発話の分析の結果と生成した対話とを制御部２３０に送信する。制御部２３０は、分析の結果と生成された対話との受信に基づいて音声合成部２９０に当該対話文の音声合成を実行させる。音声合成部２９０が発話に対する返答を音声合成処理により生成する。ステップＳ７７０にて、制御部２３０は、音声合成部２９０によって生成された返答フレーズをコミュニケーション端末２００に送信する。 In step S760, dialog analysis unit 250 and dialog generation unit 280 transmit the result of the utterance analysis and the generated dialog to control unit 230. The control unit 230 causes the speech synthesis unit 290 to perform speech synthesis of the dialogue sentence based on the reception of the analysis result and the generated dialogue. The speech synthesizer 290 generates a response to the utterance by speech synthesis processing. In step S770, control unit 230 transmits the response phrase generated by speech synthesis unit 290 to communication terminal 200.

ステップＳ７８０にて、コミュニケーション端末２００は、返答フレーズをサーバ２２０から受信すると、当該返答フレーズをユーザに発話する。これにより、音声対話システム１０に対する発話を行なったユーザに固有な対話が実現され得る。 In step S780, when receiving a response phrase from server 220, communication terminal 200 utters the response phrase to the user. Thereby, the dialog peculiar to the user who performed the utterance to the voice dialog system 10 can be realized.

＜興味推定＞
図８を参照して、本実施の形態に係る音声対話システム１０の一例としてユーザ識別に基づくユーザに合わせた返答を行なう場合（興味推定）の概要について説明する。図８は、ある局面における興味推定の一例を表わす図である。ある局面に従う音声対話システム８０は、コミュニケーション端末５００とサーバ８２０とを備える。コミュニケーション端末５００は、音声入力部２１０と音声出力部２１１と無線タグ情報送信部５１０とを含む。サーバ８２０は、制御部５３０と、音声認識部５４０と、対話分析部８５０と、対話生成部８８０と、対話ＤＢ部５７０と、音声合成部５９０とを含む。 <Interest estimation>
With reference to FIG. 8, an outline of a case where a reply tailored to a user based on user identification (interest estimation) is described as an example of the voice interaction system 10 according to the present embodiment. FIG. 8 is a diagram illustrating an example of interest estimation in a certain situation. A voice interaction system 80 according to an aspect includes a communication terminal 500 and a server 820. Communication terminal 500 includes an audio input unit 210, an audio output unit 211, and an RFID tag information transmission unit 510. Server 820 includes a control unit 530, a speech recognition unit 540, a dialog analysis unit 850, a dialog generation unit 880, a dialog DB unit 570, and a speech synthesis unit 590.

図８に示される例は、音声対話システム８０が発話を行なったユーザを特定できず、当該ユーザの興味を知ることができない場合である。この場合、ユーザ２０１がコミュニケーション端末５００に対して「ニュースを教えて」と発話すると、コミュニケーション端末５００はサーバ８２０と通信し、「ニュースを教えて」に対する適切な応答としてたとえば最新のニュース「今日、日本代表がギリシャと引分けたよ」を特定する。サーバ８２０がその対話の結果をコミュニケーション端末５００に送信すると、コミュニケーション端末５００は、ユーザ２０１に対して「今日、日本代表がギリシャと引分けたよ」と発話する。 The example shown in FIG. 8 is a case where the voice conversation system 80 cannot identify the user who made the utterance and cannot know the user's interest. In this case, when the user 201 utters “tell news” to the communication terminal 500, the communication terminal 500 communicates with the server 820, and as an appropriate response to “tell news”, for example, the latest news “Today, The Japanese national team drew with Greece. When the server 820 transmits the result of the dialogue to the communication terminal 500, the communication terminal 500 speaks to the user 201 “Today, the Japanese representative has drawn with Greece”.

このような音声対話システム８０に対して別のユーザ（たとえば大人の女性のユーザ２０２）が同じ問いかけ「ニュースを教えて」を発すると、コミュニケーション端末５００はサーバ８２０と通信する。このとき、最新のニュースが更新されていない場合には、サーバ８２０は、「ニュースを教えて」に対する対話としてユーザ２０１に対して出力された結果「今日、日本代表がギリシャと引分けたよ」との発話を特定する。その結果、コミュニケーション端末５００は、大人の女性のユーザ２０２に対しても「今日、日本代表がギリシャと引分けたよ」と発話することになる。すなわち、ユーザの種類や興味に係わらず、同様の発話（同じキーワードを有する発話）に対しては音声対話システム８０は同じ返答を行なうことになる。 When another user (for example, an adult female user 202) issues the same question “tell news” to such a voice interaction system 80, the communication terminal 500 communicates with the server 820. At this time, if the latest news has not been updated, the server 820 outputs the result of the dialogue to the user 201 as “Tell me the news” “Today, the Japanese representative has drawn with Greece” Identify utterances. As a result, the communication terminal 500 speaks to the adult female user 202 as “Today, the Japanese representative has drawn with Greece”. In other words, regardless of the type and interest of the user, the voice interaction system 80 makes the same response to the same utterance (an utterance having the same keyword).

＜音声対話システム９０の構成＞
図９を参照して、本実施の形態に従う音声対話システム９０について説明する。図９は、音声対話システム９０の構成の一例を表わすブロック図である。音声対話システム９０は、ユーザ識別に基づきユーザに合わせた返答をすることができる。音声対話システム９０は、コミュニケーション端末２００と、サーバ９２０とを備える。コミュニケーション端末２００は、音声入力部２１０と、音声出力部２１１とを含む。サーバ９２０は、サーバ２２０の構成に対して、対話生成部２８０の代わりに対話生成部９８０を備える。対話生成部９８０は、興味推定モジュール９９０を含む。 <Configuration of Spoken Dialog System 90>
With reference to FIG. 9, a spoken dialogue system 90 according to the present embodiment will be described. FIG. 9 is a block diagram illustrating an example of the configuration of the voice interaction system 90. The voice interaction system 90 can make a response tailored to the user based on the user identification. The voice interaction system 90 includes a communication terminal 200 and a server 920. Communication terminal 200 includes an audio input unit 210 and an audio output unit 211. The server 920 includes a dialog generation unit 980 instead of the dialog generation unit 280 with respect to the configuration of the server 220. The dialog generation unit 980 includes an interest estimation module 990.

ある局面において、ユーザ２０１がコミュニケーション端末２００に対して「ニュースを教えて」と発話すると、コミュニケーション端末２００はサーバ９２０と通信し、ユーザ２０１の興味と発話内容（ニュースを教えて）とに応じた対話を生成する。より具体的には、対話生成部９８０において、興味推定モジュール９９０は、ユーザ２０１に固有の興味と、発話の内容（ニュースを教えて）とに基づいて、ユーザ２０１の興味を推定する。たとえば、興味推定モジュール９９０は、ユーザ２０１の興味としてスポーツが含まれることを対話履歴記憶部２６０から検知する。興味推定モジュール９９０は、そのような検知結果に基づいて、ユーザ２０１に応じた対話を生成する。たとえば、興味推定モジュール９９０は、対話ＤＢ部２７０に保持されているデータ（スポーツに特化したニュース）を参照して、「今日、日本代表がギリシャと引分けたよ」との対話を生成する。サーバ９２０がそのような興味推定の結果に基づいてユーザ２０１の興味に固有な対話を生成し、当該対話の音声合成を行なうと、コミュニケーション端末２００はユーザ２０１に対して「今日、日本代表がギリシャと引分けたよ」と発話する。 In a certain situation, when the user 201 utters “tell news” to the communication terminal 200, the communication terminal 200 communicates with the server 920 in accordance with the interest of the user 201 and the content of the utterance (tell news). Create a conversation. More specifically, in the dialog generation unit 980, the interest estimation module 990 estimates the interest of the user 201 based on the interest unique to the user 201 and the content of the utterance (tell the news). For example, the interest estimation module 990 detects from the dialogue history storage unit 260 that sports are included as the interest of the user 201. The interest estimation module 990 generates a dialog corresponding to the user 201 based on such a detection result. For example, the interest estimation module 990 refers to the data held in the dialogue DB unit 270 (sports-specific news), and generates a dialogue “Today the Japanese representative has drawn with Greece”. When the server 920 generates a dialogue unique to the interest of the user 201 based on the result of the interest estimation and performs speech synthesis of the dialogue, the communication terminal 200 gives the user 201 “Today, the Japanese representative is Greek. "I was drawn."

別の局面において、大人の女性のユーザ２０２が「ニュースを教えて」と発話すると、コミュニケーション端末２００はサーバ９２０に発話の内容を送信する。サーバ９２０において、興味推定モジュール９９０は、発話者に応じた対話を生成する。より具体的には、まず、興味推定モジュール９９０は、「ニュースを教えて」との発話を行なったユーザ２０２が大人の女性であることを特定し、当該ユーザ２０２の興味（たとえば芸能関係）を特定する。興味推定モジュール９９０は、対話ＤＢ部２７０にアクセスして、芸能関係の最新のニュースを特定する。対話生成部９８０は、ユーザ２０２に応じた対話として芸能関係のニュースを特定すると、「ニュースを教えて」に対する対話「大島優子卒業わずか９日でサプライズ復帰だって」と生成する。サーバ９２０が、生成した対話をコミュニケーション端末２００に送信すると、コミュニケーション端末２００はユーザ２０２に対して「大島優子卒業。わずか９日でサプライズ復帰だって」と発話する。このように、音声対話システム９０は、発話者に応じた発話を行なうことになる。 In another aspect, when the adult female user 202 utters “Tell me the news”, the communication terminal 200 transmits the content of the utterance to the server 920. In the server 920, the interest estimation module 990 generates a dialog corresponding to the speaker. More specifically, first, the interest estimation module 990 specifies that the user 202 who made the utterance “tell me the news” is an adult woman, and the interest (for example, entertainment related) of the user 202 is specified. Identify. The interest estimation module 990 accesses the dialogue DB unit 270 and identifies the latest entertainment-related news. When the dialogue-related unit 980 identifies entertainment-related news as a dialogue according to the user 202, the dialogue generation unit 980 generates a dialogue “Tell me the news” “Yushima Oshima graduated just 9 days after a surprise return”. When the server 920 transmits the generated dialog to the communication terminal 200, the communication terminal 200 utters to the user 202 “Yuko Oshima graduated. In this way, the voice interaction system 90 performs an utterance according to the speaker.

音声対話システム９０は、コミュニケーション端末２００に対するユーザの過去の発話内容（たとえばサッカーの話を数多くしていたり、芸能関係の話を多くしていたりするなど）の情報をこれまでの発話情報から解析し履歴として保持する。これにより、音声対話システム９０は、複数のユーザのそれぞれに応じた興味のある発話が可能となる。 The voice dialogue system 90 analyzes information on the user's past utterance contents (for example, a lot of soccer talks or a lot of entertainment-related talks) from the utterance information so far. Keep as history. As a result, the voice interaction system 90 can make an interesting utterance according to each of a plurality of users.

＜制御構造＞
図１０を参照して、音声対話システム９０の制御構造について説明する。図１０は、音声対話システム１０において行なわれる処理を表わすシーケンスチャートである。なお、前述の処理と同一の処理には同一のステップ番号を付してある。したがって、同じ処理の説明は繰り返さない。 <Control structure>
With reference to FIG. 10, the control structure of the voice interaction system 90 will be described. FIG. 10 is a sequence chart showing processing performed in the voice interaction system 10. The same steps as those described above are denoted by the same step numbers. Therefore, the description of the same process will not be repeated.

ステップＳ１０１０にて、対話分析部２５０および対話生成部２８０は、対話履歴記憶部２６０に対して興味取得リクエストを送信する。対話履歴記憶部２６０は、興味取得リクエストから、認識されたユーザに固有の興味を抽出する。 In step S1010, dialog analysis unit 250 and dialog generation unit 280 transmit an interest acquisition request to dialog history storage unit 260. The dialogue history storage unit 260 extracts an interest unique to the recognized user from the interest acquisition request.

ステップＳ１０２０にて、対話履歴記憶部２６０は、対話分析部２５０および対話生成部９８０に対して興味取得レスポンスを送信する。より具体的には、対話履歴記憶部２６０は、興味取得リクエストに含まれる当該ユーザに固有の興味を参照して、対話生成部９８０を介して、対話ＤＢ部２７０から当該興味を抽出し、その抽出結果を対話分析部２５０および対話生成部２８０に送信する。 In step S1020, dialog history storage unit 260 transmits an interest acquisition response to dialog analysis unit 250 and dialog generation unit 980. More specifically, the dialogue history storage unit 260 extracts the interest from the dialogue DB unit 270 via the dialogue generation unit 980 with reference to the interest specific to the user included in the interest acquisition request, and The extraction result is transmitted to the dialog analysis unit 250 and the dialog generation unit 280.

ステップＳ１０３０にて、制御部２３０は、対話ログ保存リクエストを対話履歴記憶部２６０に送信する。より具体的には、制御部２３０は、対話ログを保存するリクエストと、保存の対象となる対話ログ（または対話ログを識別するためのデータ）とを対話履歴記憶部２６０に送信する。 In step S1030, control unit 230 transmits a dialogue log storage request to dialogue history storage unit 260. More specifically, the control unit 230 transmits a request to save the dialogue log and a dialogue log (or data for identifying the dialogue log) to be saved to the dialogue history storage unit 260.

ステップＳ１０４０にて、対話履歴記憶部２６０は、当該リクエストの受信に基づいて、当該リクエストにより特定される対話ログを保存する。 In step S1040, dialogue history storage section 260 saves a dialogue log specified by the request based on the reception of the request.

＜興味を推定する方法＞
図１１を参照して、音声対話システムのユーザの興味推定法について説明する。図１１は、複数のユーザの各々の興味を推定する方法を概念的に表わす図である。 <Method of estimating interest>
With reference to FIG. 11, a user's interest estimation method of the voice interaction system will be described. FIG. 11 is a diagram conceptually showing a method for estimating the interest of each of a plurality of users.

ある局面においてユーザ２０１は、音声対話システム９０に対して「新しいゲーム知ってる？」と発話する。音声対話システム９０は、発話者（ユーザ２０１）を特定し、発話の内容（新しいゲーム知ってる？）を認識すると、発話に含まれるキーワード（たとえば名詞「ゲーム」）を抽出し、対話履歴記憶部２６０にキーワード「ゲーム」をユーザ２０１に関連付けて格納する。 In one aspect, the user 201 utters “Do you know a new game?” To the voice interaction system 90. When the spoken dialogue system 90 identifies the speaker (user 201) and recognizes the content of the utterance (Do you know a new game?), The voice dialogue system 90 extracts a keyword (for example, a noun “game”) included in the utterance, and stores the dialogue history storage unit. The keyword “game” is stored in 260 in association with the user 201.

別の局面において、別のユーザ２０２が音声対話システム９０に対して「新しいカフェが近所にできたんだって」と発話すると、サーバ９２０は、キーワード「カフェ」を抽出し、その抽出したキーワードとユーザ２０２の識別情報とを関連付けて対話履歴記憶部２６０に格納する。 In another aspect, when another user 202 speaks to the voice interaction system 90 that “a new cafe has been created in the neighborhood”, the server 920 extracts the keyword “cafe” and the extracted keyword and the user 202. Are stored in the dialogue history storage unit 260 in association with each other.

このようにして、対話履歴記憶部２６０は、ユーザ毎に、当該ユーザの発話中に含まれるキーワード（たとえば名詞）を順次蓄積していく。 In this manner, the dialogue history storage unit 260 sequentially accumulates keywords (for example, nouns) included in the user's utterance for each user.

興味推定モジュール９９０は、対話履歴記憶部２６０に格納されている各ユーザの発話内容に含まれる名詞の出現回数と、出現時刻とに基づいてスコア付けを行なう。興味推定モジュール９９０は、スコアが高いものから当該ユーザの興味ある事象として扱う。たとえば、興味推定モジュール９９０は、より直近の一定期間に出現する名詞のスコアが高くなるように係数を設定する。係数の設定方法は、たとえば比例的にあるいはステップ関数的に増加するように設定され得る。 The interest estimation module 990 performs scoring based on the number of appearances of the noun included in the utterance content of each user stored in the dialogue history storage unit 260 and the appearance time. The interest estimation module 990 treats the user as an event of interest from the highest score. For example, the interest estimation module 990 sets the coefficient so that the score of a noun that appears in a more recent fixed period is higher. The coefficient setting method can be set to increase proportionally or stepwise, for example.

＜データ構造＞
図１２および図１３を参照して、本実施の形態に係る音声対話システム９０のデータ構造について説明する。図１２は、対話履歴記憶部２６０に保持されるテーブル１２００を表わす図である。図１３は、特定のユーザについて抽出されたテーブル１３００を表わす図である。 <Data structure>
With reference to FIG. 12 and FIG. 13, the data structure of the voice interaction system 90 according to the present embodiment will be described. FIG. 12 is a diagram showing a table 1200 held in the dialogue history storage unit 260. FIG. 13 is a diagram showing a table 1300 extracted for a specific user.

図１２に示されるように、テーブル１２００は、対話履歴興味記録テーブルとして作成され更新される。テーブル１２００は、レコードＩＤ１２１０と、ユーザＩＤ１２２０と、話者１２３０と、興味名詞１２４０と、タイムスタンプ１２５０とを含む。レコードＩＤ１２１０は、音声対話システム９０と各ユーザとによって行なわれた対話を識別する。ユーザＩＤ１２２０は、音声対話システム９０に登録されている。ユーザを識別する。話者１２３０は、当該発話を行なったユーザの名前である。興味名詞１２４０は、当該ユーザが関心を持つ名詞を表わす。タイムスタンプ１２５０は、当該発話が認識された時刻を特定する。タイムスタンプ１２５０を用いて、レコードの抽出対象となる期間を適宜設定することができる。 As shown in FIG. 12, the table 1200 is created and updated as a dialog history interest record table. The table 1200 includes a record ID 1210, a user ID 1220, a speaker 1230, an interest noun 1240, and a time stamp 1250. The record ID 1210 identifies a dialogue performed by the voice dialogue system 90 and each user. The user ID 1220 is registered in the voice interaction system 90. Identifies the user. The speaker 1230 is the name of the user who made the utterance. The interest noun 1240 represents a noun that the user is interested in. The time stamp 1250 specifies the time when the utterance is recognized. The time stamp 1250 can be used to appropriately set a period from which records are to be extracted.

図１３を参照して、テーブル１３００は、レコードＩＤ１２１０と、ユーザＩＤ１２２０と、話者１２３０と、興味名詞１２４０と、タイムスタンプ１２５０とを含む。たとえば、音声対話システム９０のユーザとしてユーザＩＤ１２２０が「１２３４４３１２」と特定されると、テーブル１３００に示されるように、当該ユーザＩＤの値を有する各レコードが抽出される。このユーザは、興味として、たとえば「音楽」、「きゃりーぱみゅぱみゅ」、「サマーソニック」を有していることがわかる。 Referring to FIG. 13, table 1300 includes record ID 1210, user ID 1220, speaker 1230, interest noun 1240, and time stamp 1250. For example, when the user ID 1220 is specified as “12344312” as the user of the voice interaction system 90, each record having the value of the user ID is extracted as shown in the table 1300. It can be seen that this user has, for example, “music”, “kyary pamyu pamyu”, and “summer sonic” as interests.

図１４を参照して、音声対話システム９０のデータ構造についてさらに説明する。図１４は、対話ＤＢ部２７０のデータ構造を表わす図である。対話ＤＢ部２７０は、入力フレーズ１１１０と、興味名詞１１２０と、出力フレーズ１１３０とを含む。入力フレーズ１１１０は、音声対話システム９０に対して入力された発話内容を表わす。 The data structure of the voice interaction system 90 will be further described with reference to FIG. FIG. 14 is a diagram showing the data structure of dialog DB unit 270. The dialogue DB unit 270 includes an input phrase 1110, an interest noun 1120, and an output phrase 1130. The input phrase 1110 represents the utterance content input to the voice interactive system 90.

興味名詞１１２０は、対話履歴記憶部２６０に格納されている興味名詞１２４０に相当する。出力フレーズ１１３０は、興味名詞１１２０のそれぞれに応じて関連付けられているユーザに対する応答内容を表わす。 The interest noun 1120 corresponds to the interest noun 1240 stored in the dialogue history storage unit 260. The output phrase 1130 represents the response content for the user associated with each of the interest nouns 1120.

［実施の形態の効果］
以上のようにして、本実施の形態によれば、ユーザがRFIDを所持することを要求したり、音声対話システムにカメラを導入することなく、ユーザの認証とそのユーザに合わせた話題の提供が可能となる。音声対話システムは、そのユーザが過去に話したこと、あるいは、関連することを提供できるので、ユーザと音声対話システムとの円滑な会話が可能となる。 [Effect of the embodiment]
As described above, according to the present embodiment, it is possible to authenticate a user and provide a topic tailored to the user without requiring the user to possess an RFID or introducing a camera to the voice interaction system. It becomes possible. Since the voice interaction system can provide what the user has spoken or related in the past, a smooth conversation between the user and the voice interaction system is possible.

ユーザが定期的かつ長期的に使用することにより、音声対話システムを構成するロボット（コミュニケーション端末２００）の発話内容がより親近感を持つものへと変化する。ロボットがユーザの興味ある内容に基づいて返答することにより、ユーザがロボットに対して親しみや愛着を持つことが可能となる。 When the user uses it regularly and for a long period of time, the utterance content of the robot (communication terminal 200) constituting the voice interaction system changes to something more familiar. When the robot responds based on the content that the user is interested in, it becomes possible for the user to be familiar with and attached to the robot.

［第２の実施の形態］
以下、第２の実施の形態について説明する。第２の実施の形態に係る音声対話システムでは、特定のユーザの話題が推定され得る。 [Second Embodiment]
Hereinafter, a second embodiment will be described. In the voice interactive system according to the second embodiment, the topic of a specific user can be estimated.

＜話題推定＞
図１５を参照して、ユーザ識別に基づきユーザに合わせた返答する他の対応（話題推定）について説明する。図１５は、ある局面における音声対話システム８０による話題推定の概要を表わす図である。 <Topic estimation>
With reference to FIG. 15, another response (topic estimation) that responds to the user based on the user identification will be described. FIG. 15 is a diagram showing an outline of topic estimation by the voice interaction system 80 in a certain situation.

ユーザ１５０１が音声対話システム８０に対して、「週末、京都旅行なんだ」と発話する（発話１５１０）。音声対話システム８０は、発話１５１０を認識すると、コミュニケーション端末５００が「京都といえば金閣だよね」と発話する（発話１５２０）。その後、ユーザ１５０１が「お勧めのお土産あるかな？」と発話すると（発話１５３０）、音声対話システム８０は発話１５３０の音声を認識し、その認識結果に基づいて、コミュニケーション端末５００は、メッセージ「何のお土産？」を音声で出力する（発話１５４０）。 The user 1501 utters “I'm traveling to Kyoto on the weekend” to the voice interaction system 80 (utterance 1510). When the voice conversation system 80 recognizes the utterance 1510, the communication terminal 500 utters “Kyoto is a golden pavilion” (utterance 1520). Thereafter, when the user 1501 utters “Is there a recommended souvenir?” (Utterance 1530), the voice dialogue system 80 recognizes the voice of the utterance 1530, and based on the recognition result, the communication terminal 500 displays the message “ "What souvenir?" Is output by voice (utterance 1540).

発話１５４０の内容から明らかなように、音声対話システム８０は、ユーザ１５０１が直前まで話していた話題を知らないため、ユーザが正確に表現する必要がある。 As is clear from the content of the utterance 1540, the voice dialogue system 80 does not know the topic that the user 1501 has spoken until immediately before, so the user needs to express it accurately.

＜音声対話システム１６００の構成＞
そこで、図１６を参照して、本実施の形態に従う音声対話システム１６００について説明する。図１６は、音声対話システム１６００の構成を概念的に表わす図である。音声対話システム１６００は、コミュニケーション端末２００と、サーバ１６２０とを備える。サーバ１６２０は、図２に示されるサーバ２２０の構成に対して、対話履歴記憶部２６０に代えて、対話履歴記憶部１６６０を備える。また、サーバ１６２０は、対話生成部１６８０と対話ＤＢ部１６７０とを備える。対話生成部１６８０は、話題推定モジュール１６９０を含む。なお、本実施の形態に係る音声対話システム１６００の他の構成は、音声対話システム９０の構成と同じである。したがって、同じ構成の説明は繰り返さない。 <Configuration of Spoken Dialog System 1600>
Therefore, with reference to FIG. 16, a spoken dialogue system 1600 according to the present embodiment will be described. FIG. 16 is a diagram conceptually showing the configuration of the voice interaction system 1600. The voice interaction system 1600 includes a communication terminal 200 and a server 1620. The server 1620 includes a dialogue history storage unit 1660 instead of the dialogue history storage unit 260 in the configuration of the server 220 shown in FIG. The server 1620 also includes a dialog generation unit 1680 and a dialog DB unit 1670. Dialog generation unit 1680 includes a topic estimation module 1690. The other configuration of the voice interaction system 1600 according to the present embodiment is the same as the configuration of the voice interaction system 90. Therefore, the description of the same configuration will not be repeated.

本実施の形態に係る音声対話システム１６００において、ユーザ１５０１が「週末、京都旅行なんだ」とコミュニケーション端末２００に対して発話すると（発話１５１０）、コミュニケーション端末２００は、「京都といえば金閣だよね」をユーザ１５０１に返す（発話１５２０）。ユーザ１５０１が「お勧めのお土産あるかな？」とコミュニケーション端末１２００に返すと（発話１５３０）、音声対話システム１６００は、過去の履歴と話題とに基づいて、僕は八つ橋がお勧めだよ」とユーザ１５０１に返答する（発話１６４０）。 In voice dialogue system 1600 according to the present embodiment, when user 1501 utters “communication on weekends in Kyoto” with respect to communication terminal 200 (utterance 1510), communication terminal 200 says “Kyoto is Kinkaku”. Is returned to the user 1501 (utterance 1520). When the user 1501 returns “Is there a recommended souvenir?” To the communication terminal 1200 (utterance 1530), the voice dialogue system 1600 recommends Yatsuhashi based on the past history and topic. ”To the user 1501 (utterance 1640).

すなわち、音声対話システム１６００によると、サーバ１６２０は、ユーザ１５０１と音声対話システム１６００との間で直前まで話されていた話題（たとえば、京都に関する話）を参照することができるため、ユーザ１５０１に対応した、より自然な対話が可能となる。 That is, according to the voice dialogue system 1600, the server 1620 can refer to the topic (for example, a story about Kyoto) that has been spoken immediately before between the user 1501 and the voice dialogue system 1600. A more natural dialogue.

なお、音声対話システム１６００による話題推定を用いた発話のシーケンスは、興味推定を用いた発話シーケンス（図１０）と同様である。したがって、音声対話システム１６００の発話シーケンスの説明は繰り返さない。 Note that the utterance sequence using topic estimation by the voice interaction system 1600 is the same as the utterance sequence using interest estimation (FIG. 10). Therefore, the description of the utterance sequence of the voice interaction system 1600 will not be repeated.

＜話題推定法＞
図１７を参照して、音声対話システム１６００における話題推定法について説明する。図１７は、音声対話システム１６００が複数のユーザそれぞれの発話に基づいて話題を推定する一態様を表わす図である。 <Topic estimation method>
With reference to FIG. 17, the topic estimation method in the voice interaction system 1600 will be described. FIG. 17 is a diagram illustrating an aspect in which the voice conversation system 1600 estimates a topic based on the utterances of each of a plurality of users.

ある局面において、ユーザ２０１は、音声対話システム１６００のコミュニケーション端末２００に対して、発話（明日、遠足で上野動物園に行くんだ）を行なう。音声対話システム１６００は、その発話から興味名詞（キーワード）として「上野動物園」を抽出し、その抽出した内容を対話履歴記憶部１６６０に格納する。別の局面において、大人のユーザ２０２が発話（代官山においしいカフェがあるんだって）を行なうと、サーバ１６２０は、興味名詞として「代官山」を抽出し、その抽出した結果をユーザ２０２に関連付けて対話履歴１６６０に格納する。すなわち、サーバ１６２０は、固有表現の抽出を行ない、得られた単語とその種別とを対話履歴として対話履歴記憶部１６６０に保存する。固有表現は、たとえば、組織名、人名、地名、日付表現、時間表現、金額表現、割合表現、固有物名の８種類を含む。 In one aspect, the user 201 utters (goes to Ueno Zoo on an excursion tomorrow) to the communication terminal 200 of the voice interaction system 1600. The spoken dialogue system 1600 extracts “Ueno Zoo” as an interesting noun (keyword) from the utterance, and stores the extracted content in the dialogue history storage unit 1660. In another aspect, when the adult user 202 utters (there is a delicious cafe in Daikanyama), the server 1620 extracts “Daikanyama” as an interest noun, and associates the extracted result with the user 202 for dialogue. Stored in the history 1660. That is, the server 1620 extracts a specific expression, and stores the obtained word and its type in the dialog history storage unit 1660 as a dialog history. The unique expressions include, for example, eight types of names such as an organization name, a person name, a place name, a date expression, a time expression, a monetary expression, a ratio expression, and a unique object name.

対話生成部１６８０は、対話履歴記憶部１６６０に格納されている対話履歴を参照して話題を抽出する。より具体的には、話題推定モジュール１６９０は、対話履歴記憶部１６６０に格納されているデータの中から、予め定められた直近の一定時間内に記録されている固有表現を話題として抽出する。話題推定モジュール１６９０は、その抽出された話題をフィルタとして用いて、対話ＤＢ部１６７０に保存されているデータから候補を絞り込む。図１７に示される例では、ユーザ２０２による最後の発話から予め定められた直近の一定期間内に抽出された固有表現（沖縄、石垣島）をフィルタとして用いる。 The dialogue generation unit 1680 refers to the dialogue history stored in the dialogue history storage unit 1660 and extracts topics. More specifically, the topic estimation module 1690 extracts, from the data stored in the dialogue history storage unit 1660, a specific expression recorded within a predetermined fixed time as a topic. The topic estimation module 1690 uses the extracted topics as a filter to narrow down candidates from data stored in the dialog DB unit 1670. In the example shown in FIG. 17, a specific expression (Okinawa, Ishigakijima) extracted within a certain period of time that is predetermined in advance from the last utterance by the user 202 is used as a filter.

＜データ構造＞
図１８を参照して、サーバ１６２０のデータ構造について説明する。図１８は、対話ＤＢ部１６７０のデータ構造を概念的に表わす図である。ある局面において、対話ＤＢ部１６７０は、テーブル１８００を含む。テーブル１８００は、話題フィルタ構造を有している。より具体的には、テーブル１８００は、ユーザ発話１８１０と、地名１８２０と、返答フレーズ１８３０とを含む。ユーザ発話１８１０は、音声対話システム１６００のユーザによって行なわれた発話を表わす。地名１８２０は、当該発話の際に固有表現として抽出された地名を表わす。返答フレーズ１８３０は、ユーザとの対話において出力された返答を表わす。 <Data structure>
The data structure of the server 1620 will be described with reference to FIG. FIG. 18 is a diagram conceptually showing the data structure of dialog DB unit 1670. In one aspect, dialog DB unit 1670 includes a table 1800. The table 1800 has a topic filter structure. More specifically, table 1800 includes user utterance 1810, place name 1820, and reply phrase 1830. User utterance 1810 represents an utterance made by a user of voice interaction system 1600. The place name 1820 represents a place name extracted as a specific expression at the time of the utterance. The response phrase 1830 represents a response output in the dialog with the user.

図１９を参照して、本実施の形態に係る対話システム１６００のデータ構造についてさらに説明する。図１９は、対話履歴記憶部１６６０におけるデータの格納の一態様を概念的に表わす図である。対話履歴記憶部１６６０は、テーブル１９００を含む。テーブル１９００は、ユーザと音声対話システム１６００との対話の履歴を記憶している。テーブル１９００は、レコードＩＤ１９１０と、ユーザＩＤ１９２０と、話者１９３０と、話題キーワード１９４０と、話題種別１９５０と、タイムスタンプ１９６０とを含む。レコードＩＤ１９１０は、テーブル１９００に含まれる各レコードを識別する。ユーザＩＤ１９２０は、当該レコードの発話を行なったユーザを識別する。話者１９３０は、当該ユーザＩＤによって特定されるユーザ（発話者）を特定する。話題キーワード１９４０は、当該ユーザによる発話から固有表現として抽出された名詞を表わす。話題種別１９５０は、話題キーワード１９４０によって特定される話題の種類を表わす。話題種別１９５０は、たとえば組織（ORGANIZATION）、場所（LOCATION）などと表わされる。タイムスタンプ１９６０は、当該発話がテーブル１９００に追加された時刻を表わす。 With reference to FIG. 19, the data structure of dialog system 1600 according to the present embodiment will be further described. FIG. 19 is a diagram conceptually showing one mode of data storage in dialog history storage unit 1660. The dialogue history storage unit 1660 includes a table 1900. The table 1900 stores a history of dialogue between the user and the voice dialogue system 1600. The table 1900 includes a record ID 1910, a user ID 1920, a speaker 1930, a topic keyword 1940, a topic type 1950, and a time stamp 1960. The record ID 1910 identifies each record included in the table 1900. The user ID 1920 identifies the user who made the utterance of the record. The speaker 1930 specifies the user (speaker) specified by the user ID. The topic keyword 1940 represents a noun extracted as a specific expression from the utterance by the user. The topic type 1950 represents the type of topic specified by the topic keyword 1940. The topic type 1950 is expressed as, for example, an organization (ORGANIZATION), a location (LOCATION), or the like. Time stamp 1960 represents the time when the utterance was added to table 1900.

図２０を参照して、サーバ１６２０のデータ構造についてさらに説明する。図２０は、サーバ１６２０が備えるテーブル２０００におけるデータの格納の一態様を概念的に表す図である。テーブル２０００は、レコードＩＤ１９１０と、ユーザＩＤ１９２０と、話者１９３０と、話題キーワード２０４０と、話題種別１９５０と、タイムスタンプ１９６０とを含む。テーブル２０００は、ユーザＩＤ１９２０が「１２３４４３１２」で特定されるユーザのみの発話のレコードを含む。 The data structure of the server 1620 will be further described with reference to FIG. FIG. 20 is a diagram conceptually illustrating one aspect of data storage in table 2000 provided in server 1620. The table 2000 includes a record ID 1910, a user ID 1920, a speaker 1930, a topic keyword 2040, a topic type 1950, and a time stamp 1960. The table 2000 includes an utterance record only for the user whose user ID 1920 is specified by “12344312”.

より具体的には、話題キーワード２０４０に示されるように、当該ユーザは、話題として上野動物園、上野、東京都を有している。したがって、あるユーザが音声対話システム１６００に対して発話した場合、対話生成部１６８０は、当該ユーザの直近の話題として上野動物園、上野、東京都の話題キーワード２０４０を用いて当該ユーザからの発話に対する返答を生成し得る。 More specifically, as indicated by the topic keyword 2040, the user has Ueno Zoo, Ueno, and Tokyo as topics. Therefore, when a certain user utters the voice dialogue system 1600, the dialogue generation unit 1680 uses the topic keywords 2040 of Ueno Zoo, Ueno, and Tokyo as the latest topic of the user, and responds to the utterance from the user. Can be generated.

図２１を参照して、本実施の形態に係る音声対話システム１６００のデータ構造についてさらに説明する。図２１は、対話ＤＢ部１６７０のデータ構造を表わす図である。対話ＤＢ部１６７０は、入力フレーズ２１１０と、場所２１２０と、出力フレーズ２１２０とを含む。対話ＤＢ部１６７０は、話題で推定された場所に基づいてフィルタリングを行なうためのデータを保持している。たとえば、音声対話システム１６００において、ユーザが「雑学教えて」との発話をコミュニケーション端末２００に与えると（入力フレーズ２１１０）、サーバ１６２０は、その発話に関連付けられる場所（たとえば「北海道」）を抽出する。この場合、話題推定モジュール１６９０は、場所２１２０が「北海道」である４つの出力フレーズ２１２０を抽出することになる。 With reference to FIG. 21, the data structure of voice interactive system 1600 according to the present embodiment will be further described. FIG. 21 shows a data structure of dialog DB unit 1670. The dialogue DB unit 1670 includes an input phrase 2110, a place 2120, and an output phrase 2120. The dialogue DB unit 1670 holds data for performing filtering based on the location estimated by the topic. For example, in the voice interaction system 1600, when the user gives an utterance “Tell me about trivia” to the communication terminal 200 (input phrase 2110), the server 1620 extracts a location (for example, “Hokkaido”) associated with the utterance. . In this case, the topic estimation module 1690 extracts four output phrases 2120 whose place 2120 is “Hokkaido”.

［実施の形態の効果］
以上のようにして、本実施の形態に係る音声対話システムによれば、ユーザとコミュニケーション端末との対話が継続する場合に、直前の話題を理解するので、コミュニケーション端末２００は、ユーザの発話に応じて詳細な返答を行なうことができる。たとえば、京都を旅行するユーザがお土産を訪ねている場合に、コミュニケーション端末は、京都にちなんだお土産を返答することができる。このような応答ができるので、ユーザとコミュニケーション端末との対話がより自然な対話となる。 [Effect of the embodiment]
As described above, according to the voice interaction system according to the present embodiment, when the conversation between the user and the communication terminal continues, the immediately preceding topic is understood, so that the communication terminal 200 responds to the user's utterance. Detailed responses. For example, when a user traveling in Kyoto visits a souvenir, the communication terminal can respond with a souvenir associated with Kyoto. Since such a response can be made, the dialogue between the user and the communication terminal becomes a more natural dialogue.

［第３の実施の形態］
以下、第３の実施の形態について説明する。本実施の形態に係る音声対話システムは、ユーザとの親密度に応じて音声出力される発話の語調が異なる点で、前述の実施の形態に係る音声対話システムと異なる。 [Third Embodiment]
The third embodiment will be described below. The spoken dialogue system according to the present embodiment is different from the spoken dialogue system according to the above-described embodiment in that the tone of the utterance output by voice is different according to the familiarity with the user.

＜親密度＞
まず、図２２を参照して、第３の実施の形態に係る音声対話システム２２００について説明する。図２２は、音声対話システム２２００の構成の一例を表わす図である。音声対話システム２２００は、コミュニケーション端末２００と、サーバ２２２０とを備える。サーバ２２２０は、サーバ２２０の構成に対して、対話分析部２５０に代えて対話分析部２２５０を、対話生成部２８０に代えて対話生成部２２８０を備える。対話分析部２２５０は、親密度算出モジュール２２５１を含む。その他の構成は、図２に示される構成と同様である。したがって同じ構成の説明は繰り返さない。 <Intimacy>
First, a voice interaction system 2200 according to the third embodiment will be described with reference to FIG. FIG. 22 is a diagram illustrating an example of the configuration of the voice interaction system 2200. The voice interaction system 2200 includes a communication terminal 200 and a server 2220. The server 2220 includes a dialog analysis unit 2250 instead of the dialog analysis unit 250 and a dialog generation unit 2280 instead of the dialog generation unit 280 with respect to the configuration of the server 220. The dialog analysis unit 2250 includes a closeness calculation module 2251. Other configurations are the same as those shown in FIG. Therefore, the description of the same configuration will not be repeated.

音声対話システム２２００によれば、ユーザとの対話数やその頻度に基づいて親密度が変化し、応答が変わる点で前述の各実施の形態に係る音声対話システムと異なる。たとえば、ユーザ１５０１が３ヶ月前に「おはよう」とコミュニケーション端末２００に対して発話する。このとき、音声対話システム２２００とユーザ１５０１との間はそれほど親密ではないため、サーバ２２２０は、予め保存されているデータに基づいて発話「おはよう」に対する応答を返信する。具体的には、コミュニケーション端末２００は、ユーザ１５０１に対して「おはようございます。今日もいい天気ですね」と丁寧な語調で発話する。 The voice interaction system 2200 is different from the voice interaction systems according to the above-described embodiments in that the intimacy changes based on the number of interactions with the user and the frequency thereof and the response changes. For example, the user 1501 utters “good morning” to the communication terminal 200 three months ago. At this time, since the voice interaction system 2200 and the user 1501 are not so intimate, the server 2220 returns a response to the utterance “good morning” based on the data stored in advance. Specifically, the communication terminal 200 utters in a polite tone to the user 1501 “Good morning. Good weather today”.

これに対し３ヶ月経過後の現時点でユーザ１５０１と音声対話システム２２００との間の親密度が増している場合、ユーザ１５０１が「おはよう」と同じ発話を行なった場合でも、コミュニケーション端末２００は、よりフランクな表現として「おはよう。今日もいい天気だからお出かけしてみたら？」と発話する。このように、同じユーザ１５０１による同じ発話（おはよう）に対するそれぞれの応答は、時間の経過によって変化し得る親密度によって応答内容が変わる。 On the other hand, when the intimacy between the user 1501 and the voice interaction system 2200 is increasing at the present time after three months have passed, the communication terminal 200 is more effective even when the user 1501 makes the same utterance as “good morning”. As a frank expression, he says, "Good morning. Why don't you go out because the weather is good today?" In this way, the response contents of the responses to the same utterance (good morning) by the same user 1501 change depending on the familiarity that can change over time.

＜親密度の算出方法＞
図２３を参照して、本実施の形態に係る音声対話システム２２００における親密度の算出方法について説明する。図２３は、親密度算出モジュール２２５１による親密度の算出を概念的に表わす図である。 <Calculation method of intimacy>
With reference to FIG. 23, a method of calculating the familiarity in voice interaction system 2200 according to the present embodiment will be described. FIG. 23 is a diagram conceptually showing the calculation of the familiarity by the familiarity calculating module 2251.

ケース（Ａ）は、ユーザが継続的に音声対話システムに話しかけることにより親密度が上昇する場合を表わす図である。すなわち、グラフ２３１０に示されるように、時間の経過とともにユーザ２００２がコミュニケーション端末２００に話しかけることにより、音声対話システム２２００とユーザ２０２との親密度は上昇する。上昇の程度は、ある局面において線形的（比例的）であるが、上昇の程度は必ずしもグラフ２３１０に見られる程度に限られない。たとえば、段階的に（ステップ関数的に）親密度が上昇してもよい。話しかけるフレーズ中に出現する単語が示す感情の程度に基づいて変化し得る。たとえば、出現する単語がネガティブなフレーズの場合には親密度は上昇しない。一方、出現する単語がポジティブなフレーズの場合には、そのポジティブ度に応じて親密度が上昇し得る。 Case (A) is a diagram showing a case where the intimacy increases due to the user continuously speaking to the voice interaction system. That is, as shown in the graph 2310, as the user 2002 speaks to the communication terminal 200 over time, the closeness between the voice interactive system 2200 and the user 202 increases. The degree of increase is linear (proportional) in a certain aspect, but the degree of increase is not necessarily limited to the level seen in the graph 2310. For example, the intimacy may increase step by step (step function). It can change based on the degree of emotion represented by the words that appear in the spoken phrase. For example, intimacy does not increase when the appearing word is a negative phrase. On the other hand, when the appearing word is a positive phrase, the familiarity can be increased according to the positive degree.

また、ユーザ２０２が音声対話システム２２００を継続的に使用していない場合には親密度が低下するように、予め定められた一定期間内に一定値の親密度が減少するように構成されてもよい。親密度算出モジュール２２５１は、ある局面において、対話履歴記憶部２６０に格納されている対話履歴を用いて各ユーザについて親密度を算出する。たとえば、親密度算出モジュール２２５１は、親密度の上昇度合いとして予め設定された値を逐次加算し、あるいはネガティブフレーズの場合には当該値を減算することにより各ユーザの親密度を算出する。 In addition, when the user 202 is not continuously using the voice interaction system 2200, the intimacy of a certain value may be decreased within a predetermined period so that the intimacy is decreased. Good. In a certain situation, the familiarity calculation module 2251 calculates the familiarity for each user using the conversation history stored in the conversation history storage unit 260. For example, the familiarity calculation module 2251 calculates the familiarity of each user by sequentially adding a preset value as the degree of increase in familiarity or subtracting the value in the case of a negative phrase.

ケース（Ｂ）は、同一のユーザ２０２によって同じフレーズを有する発話が行なわれた場合に親密度の上昇が抑制される態様を表わす図である。すなわち、グラフ２３２０に示されるように、ユーザ２０２が「天気を教えて」と発話すると、最初は親密度は予め定められた一定の度合いだけ上昇し得る。しかしながら、ユーザ２０２が同じフレーズを有する発話しか行なわない場合には、音声対話システム２２００は、そのユーザについての親密度の上昇を抑制する。より具体的には、親密度算出モジュール２２５１は、対話履歴記憶部２６０に格納されているデータを参照して、ユーザ２０２による発話に含まれるフレーズ（名詞）が同一であるか否かを判断する。親密度算出モジュール２２５１は、ユーザ２０２による発話が同じフレーズを含む発話の繰り返しであることを検知すると、そのユーザ２０２についての親密度を一定値に維持する。 Case (B) is a diagram showing a mode in which an increase in closeness is suppressed when utterances having the same phrase are performed by the same user 202. That is, as shown in the graph 2320, when the user 202 speaks “tell me the weather”, the intimacy may initially increase by a certain fixed degree. However, if the user 202 does not speak only with the same phrase, the voice interaction system 2200 suppresses an increase in familiarity for the user. More specifically, the familiarity calculation module 2251 refers to the data stored in the dialogue history storage unit 260 and determines whether phrases (nouns) included in the utterances by the user 202 are the same. . When the closeness calculation module 2251 detects that the utterance by the user 202 is repetition of the utterance including the same phrase, the closeness calculation module 2251 maintains the closeness for the user 202 at a constant value.

したがって、たとえば、あるユーザが、毎日、天気予報しか尋ねない場合、あるいはニュースのような一般的な話題しか尋ねない場合には、そのユーザによる発話に含まれるフレーズが同じであれば、そのユーザの親密度は一定値のままである。 Thus, for example, if a user asks only the weather forecast every day, or asks only general topics such as news, if the phrases included in the utterance by the user are the same, the user's Intimacy remains constant.

＜音声対話システムによる返答の変化＞
図２４を参照して、音声対話システム２２００による返答の変化について説明する。図２４は、親密度に応じて返答が変化する態様を説明する図である。 <Changes in responses by spoken dialogue system>
With reference to FIG. 24, the change of the response by the voice interaction system 2200 will be described. FIG. 24 is a diagram for explaining a mode in which the response changes according to the familiarity.

ケース（Ａ）に示されるように、親密度に応じて対話ＤＢ部２７０のフレーズが変化し得る。対話ＤＢ部２７０は、テーブル２４００を含む。テーブル２４００は、ユーザ発話２４１０と、親密度２４２０と、返答フレーズ２４３０とを含む。ユーザ発話２４１０は、ユーザによる音声対話システム２２００への発話を表わす。親密度２４２０は、たとえば複数の区分によって分けられる。返答フレーズ２４３０は、各親密度に応じて予め保存されているフレーズを表わす。たとえば、ユーザ発話２４１０が「おはよう」である場合において親密度２４２０が「高い」と判定される場合には、返答フレーズ２４３０は、「おはよう。今日も元気に頑張ろう」となる。 As shown in the case (A), the phrase of the dialogue DB unit 270 can change according to the familiarity. The dialogue DB unit 270 includes a table 2400. Table 2400 includes user utterances 2410, intimacy 2420, and reply phrases 2430. User utterance 2410 represents an utterance to the voice interaction system 2200 by the user. The familiarity 2420 is divided, for example, by a plurality of sections. The reply phrase 2430 represents a phrase stored in advance according to each familiarity. For example, when the user utterance 2410 is “good morning” and the closeness 2420 is determined to be “high”, the response phrase 2430 is “good morning. I will do my best today”.

ケース（Ｂ）は、親密度に応じて言い回しが変化する態様を表わす図である。ある局面において、ユーザからのニュースの問いかけに対して、音声対話システム２２００は、設定されている親密度に応じて、「政府が、南極大陸に日本の新たな基地建設を計画していることが明らかになったそうです。」と発話する（発話２４４０）。この時点では、親密度は低いと設定されているため、発話の語調も比較的丁寧な語調である。 Case (B) is a diagram showing a manner in which the wording changes according to the familiarity. In one aspect, the spoken dialogue system 2200 responds to the news question from the user according to the familiarity that has been set: “The government is planning to build a new base in Japan in Antarctica. It seems to have become clear, "he utters (utterance 2440). At this point, since the intimacy is set to be low, the tone of the utterance is also a relatively polite tone.

ある局面において、対話生成部２２８０は、その発話２４４０の内容を変えることなく、音声対話システム２２００との対話を行なおうとしているユーザの親密度に応じて、発話２４４０の内容を伝える表現を変換する。たとえば、認証されたユーザの親密度が予め設定された基準よりも高いことが検出されると、対話生成部２２８０は、発話内容を構成するフレーズ「していることが」をフレーズ「してるってことが」に変更する。また、対話生成部２２８０は、フレーズ「明らかになった」をフレーズ「わかった」に変更する。すなわち、対話生成部２２８０は、親密度が標準よりも高いという判断結果に基づいて、デフォルトで出力され得る発話内容の表現を平易な表現に変更する。さらに、対話生成部２２８０は、フレーズ「そうです」をフレーズ「そうだよ」に変換する。すなわち、対話生成部２２８０は、発話対象となるフレーズを丁寧な語調から普通の語調に変換する。 In a certain aspect, the dialogue generation unit 2280 converts the expression that conveys the content of the utterance 2440 according to the familiarity of the user who is going to interact with the voice dialogue system 2200 without changing the content of the utterance 2440. To do. For example, when it is detected that the familiarity of the authenticated user is higher than a preset standard, the dialogue generation unit 2280 phrase “is doing” the phrase that constitutes the utterance content. Change to "tega". In addition, the dialogue generation unit 2280 changes the phrase “obtained” to the phrase “understood”. That is, the dialogue generation unit 2280 changes the expression of the utterance content that can be output by default to a simple expression based on the determination result that the familiarity is higher than the standard. Furthermore, the dialog generation unit 2280 converts the phrase “Yes” to the phrase “Yes”. That is, the dialogue generation unit 2280 converts a phrase to be uttered from a polite tone into a normal tone.

このような変換ルールが対話生成部２２８０において規定されている場合、対話生成部２２８０は、返答フレーズ２１５０として親密度が高いフレーズを返答する。すなわち対話生成部２２８０はフレーズ「政府が、南極大陸に日本の新たな基地建設を計画しているってことがわかったそうだよ」と変換する。コミュニケーション端末２００は、その変換に基づいて生成された返答フレーズ２４５０を出力し得る。 When such a conversion rule is defined in the dialog generation unit 2280, the dialog generation unit 2280 responds with a phrase having a high familiarity as the response phrase 2150. That is, the dialogue generator 2280 translates the phrase “It seems that the government is planning to build a new Japanese base in Antarctica”. The communication terminal 200 can output the response phrase 2450 generated based on the conversion.

［実施の形態の効果］
以上のようにして、本実施の形態に係る音声対話システム２２００は、各ユーザとのこれまでの対話の履歴に応じて親密度を算出し、その親密度に応じて発話の語調を変更する。これにより、ユーザは、音声対話システム２２００を構成するコミュニケーション端末２００に対して親しみをさらに感じることになる。 [Effect of the embodiment]
As described above, the voice interaction system 2200 according to the present embodiment calculates the familiarity according to the history of conversations with each user so far, and changes the tone of the utterance according to the familiarity. As a result, the user feels more familiar with the communication terminal 200 constituting the voice interaction system 2200.

［第４の実施の形態］
上述の第１〜第３の実施の形態は、音声入出力機能（コミュニケーション端末）と発話生成機能（サーバ）とが別個の機器で実現される場合が例示されている。しかしながら、本開示に係る技術思想は、他の機器構成によっても実現され得る。たとえば、音声入出力機能と発話生成機能とが一つの機器によって実現されてもよい。たとえば、図２に示されるコミュニケーション端末２００とサーバ２２０とが一体となった機器が音声対話装置として実現されてもよい。 [Fourth Embodiment]
In the first to third embodiments described above, a case where the voice input / output function (communication terminal) and the speech generation function (server) are realized by separate devices is illustrated. However, the technical idea according to the present disclosure can be realized by other device configurations. For example, the voice input / output function and the speech generation function may be realized by a single device. For example, a device in which the communication terminal 200 and the server 220 shown in FIG. 2 are integrated may be realized as a voice interaction device.

そこで、図２５を参照して、第４の実施の形態に係る音声対話装置２５００について説明する。図２５は、音声対話装置２５００の構成の概要を表すブロック図である。音声対話装置２５００は、図２に示されるコミュニケーション端末２００およびサーバ２２０が備える構成を備える。このような構成により、音声対話装置２５００は、通信回線を用いることなく、ユーザの発話に基づいて、音声認識、音声認証、および発話生成を行なうことができるので、通信回線の影響を受けることなく迅速な会話が可能になる。 Therefore, with reference to FIG. 25, a voice interactive apparatus 2500 according to the fourth embodiment will be described. FIG. 25 is a block diagram showing an outline of the configuration of the voice interaction apparatus 2500. The voice interactive apparatus 2500 has a configuration provided in the communication terminal 200 and the server 220 shown in FIG. With this configuration, the voice interaction apparatus 2500 can perform voice recognition, voice authentication, and utterance generation based on a user's utterance without using a communication line, and thus is not affected by the communication line. Quick conversation is possible.

［第５の実施の形態］
さらに別の局面において、音声認識および音声認証が、ユーザからの発話を受け付けるコミュニケーション端末によって行われてもよい。この場合、コミュニケーション端末は、音声認識の結果および音声認証の結果を、それぞれサーバに送信する。サーバは、各結果を用いて発話に対する応答を生成する。 [Fifth Embodiment]
In still another aspect, voice recognition and voice authentication may be performed by a communication terminal that accepts an utterance from a user. In this case, the communication terminal transmits the voice recognition result and the voice authentication result to the server. The server uses each result to generate a response to the utterance.

そこで、図２６を参照して、この局面に従う音声対話システム２６００の構成について説明する。図２６は、音声対話システム２６００の構成の概略を表すブロック図である。音声対話システム２６００は、コミュニケーション端末２６１０と、サーバ２６２０とを備える。 Therefore, with reference to FIG. 26, the configuration of a voice interaction system 2600 according to this aspect will be described. FIG. 26 is a block diagram showing an outline of the configuration of the voice interaction system 2600. The voice interaction system 2600 includes a communication terminal 2610 and a server 2620.

コミュニケーション端末２６１０は、コミュニケーション端末２００の構成に加えて、音声認識部２４０を備える。音声認識部２４０は、音声認識モジュール２４１と、話者特定モジュール２４２とを含む。音声認識モジュール２４１は、音声入力部２１０によって受け付けられた音声信号の認識処理を実行する。話者特定モジュール２４２は、当該音声信号と、コミュニケーション端末２６１０のメモリ（図示しない）に登録されている音声データおよびユーザ識別情報とを用いて発話者を特定する。 The communication terminal 2610 includes a voice recognition unit 240 in addition to the configuration of the communication terminal 200. The voice recognition unit 240 includes a voice recognition module 241 and a speaker identification module 242. The speech recognition module 241 executes speech signal recognition processing accepted by the speech input unit 210. The speaker specifying module 242 specifies a speaker using the voice signal, voice data and user identification information registered in a memory (not shown) of the communication terminal 2610.

音声認識部２４０によって認識された音声内容および特定されたユーザ情報は、サーバ２６２０に送信される。サーバ２６２０は、当該音声内容およびユーザ情報を用いて、当該ユーザの過去の対話履歴を参照しながら、音声内容に応じた応答を生成する。 The voice content recognized by the voice recognition unit 240 and the specified user information are transmitted to the server 2620. The server 2620 uses the audio content and user information to generate a response corresponding to the audio content while referring to the user's past conversation history.

その他の処理は、前述の実施の形態に係る音声対話システムにおける処理と同じである。したがって、詳細な説明は繰り返さない。 Other processes are the same as the processes in the voice interaction system according to the above-described embodiment. Therefore, detailed description will not be repeated.

［第６の実施の形態］
第１〜第３の実施の形態に係るサーバは、音声認識機能と対話生成機能と音声合成機能とを実現するように構成されていた。他の局面において、各機能が別個のコンピュータ装置において実現されてもよい。 [Sixth Embodiment]
The servers according to the first to third embodiments are configured to realize a speech recognition function, a dialog generation function, and a speech synthesis function. In other aspects, each function may be implemented in a separate computer device.

［第７の実施の形態］
上述の各実施の形態は、コミュニケーション端末あるいはサーバが備えるコンピュータのプロセッサ（図示しない）が、メモリに保存されているプログラムに含まれる命令を実行することにより、実現されるものとして例示されている。しかしながら、本実施の形態に係るコミュニケーション端末またはサーバが備える各機能の少なくとも一部または全部が、当該機能を実現する回路その他のハードウェアによって実現されてもよい。 [Seventh Embodiment]
Each of the above-described embodiments is exemplified as being realized by a processor (not shown) of a computer included in a communication terminal or server executing instructions included in a program stored in a memory. However, at least a part or all of the functions included in the communication terminal or the server according to the present embodiment may be realized by a circuit or other hardware that realizes the function.

＜構成＞
本開示に基づく構成は、以下のように要約され得る。 <Configuration>
The configuration according to the present disclosure can be summarized as follows.

［構成１］
発話を認識するように構成された音声認識部（２４１）と、
上記認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するように構成された音声認証部（２４２）と、
上記認識された発話と、上記特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するように構成された話題推定部（２８０）と、
当該話題を音声で出力するように構成された音声出力部（２１１）とを備える、音声対話装置（２５００）。 [Configuration 1]
A voice recognition unit (241) configured to recognize an utterance;
A voice authentication unit (242) configured to identify a speaker based on the recognized utterance and pre-registered user information;
A topic estimation unit (280) configured to generate a topic in which the speaker is interested based on the recognized utterance and the identified speaker;
A voice interactive device (2500) comprising a voice output unit (211) configured to output the topic in a voice.

［構成２］
発話に基づく音声信号の入力を受け付けるように構成された音声信号入力部と、
入力された上記音声信号に基づいて上記発話を認識するように構成された音声認識部（２４１）と、
入力された上記音声信号と予め登録されているユーザ情報とに基づいて、発話者を特定するように構成された音声認証部（２４２）と、
上記認識された発話と、上記特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するように構成された話題推定部（２８０）と、
当該話題を音声で出力するための話題信号を出力するように構成された出力部（２３０）とを備える、音声対話装置（２２０）。 [Configuration 2]
An audio signal input unit configured to accept an input of an audio signal based on an utterance;
A voice recognition unit (241) configured to recognize the utterance based on the input voice signal;
A voice authentication unit (242) configured to identify a speaker based on the input voice signal and user information registered in advance;
A topic estimation unit (280) configured to generate a topic in which the speaker is interested based on the recognized utterance and the identified speaker;
A voice interaction device (220) comprising: an output unit (230) configured to output a topic signal for outputting the topic in voice.

［構成３］
端末（２６１０）と、
上記端末と通信可能なサーバ（２６２０）とを備え、
上記端末は、
発話を受け付けて当該発話を認識するように構成された音声認識部（２４１）と、
上記認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するように構成された音声認証部（２４２）と、
上記発話に基づく音声信号と、上記特定された発話者の識別信号とを上記サーバに送信するように構成された送信部とを備え、
上記サーバは、
上記音声信号と上記識別信号とに基づいて、当該発話者が興味を持つ話題を生成するように構成された話題推定部と、
当該話題を音声で出力するための話題信号を上記端末に送信するように構成された話題送信部とを備え、
上記端末は、さらに、
上記サーバから受信する上記話題信号に基づいて、上記話題を音声で出力するように構成された出力部を備える、音声対話システム。 [Configuration 3]
A terminal (2610);
A server (2620) capable of communicating with the terminal,
The terminal
A voice recognition unit (241) configured to accept an utterance and recognize the utterance;
A voice authentication unit (242) configured to identify a speaker based on the recognized utterance and pre-registered user information;
A transmission unit configured to transmit an audio signal based on the utterance and an identification signal of the identified speaker to the server;
The server
A topic estimation unit configured to generate a topic in which the speaker is interested based on the voice signal and the identification signal;
A topic transmission unit configured to transmit a topic signal for outputting the topic as a voice to the terminal,
The terminal
A spoken dialogue system comprising an output unit configured to output the topic by voice based on the topic signal received from the server.

［構成４］
構成３に記載の音声対話システムに用いられる端末であって、
発話を認識するように構成された音声認識部と、
上記認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するように構成された音声認証部と、
上記発話に基づく音声信号と、上記特定された発話者の識別信号とをサーバに送信するように構成された送信部と、
当該発話者が興味を持つ話題を音声で出力するための話題信号を上記サーバから受信して、上記話題を音声で出力するように構成された出力部とを備える、端末。 [Configuration 4]
A terminal used in the voice interaction system according to Configuration 3,
A speech recognizer configured to recognize utterances;
A voice authentication unit configured to identify a speaker based on the recognized utterance and pre-registered user information;
A transmitter configured to transmit an audio signal based on the utterance and an identification signal of the identified speaker to a server;
A terminal comprising: an output unit configured to receive a topic signal for outputting a topic of interest of the speaker by voice from the server and output the topic by voice.

［構成５］
上記音声対話装置の各ユーザとの対話の履歴を格納するように構成された記憶部をさらに備え、
上記話題推定部は、当該ユーザとの対話の履歴に基づいて、上記話題を生成するように構成されている、構成１または２に記載の音声対話装置。 [Configuration 5]
A storage unit configured to store a history of interaction with each user of the voice interaction device;
The speech conversation apparatus according to Configuration 1 or 2, wherein the topic estimation unit is configured to generate the topic based on a history of conversations with the user.

［構成６］
上記音声対話装置のユーザとの対話の履歴に基づいて、当該ユーザと上記音声対話装置との親密度を算出するように構成された親密度算出部をさらに備え、
上記話題推定部は、上記親密度に応じて、上記話題の語調を調整するように構成されている、構成１または２に記載の音声対話装置。 [Configuration 6]
A closeness calculation unit configured to calculate a closeness between the user and the voice interaction device based on a history of interaction with the user of the voice interaction device;
The spoken dialogue apparatus according to Configuration 1 or 2, wherein the topic estimation unit is configured to adjust the tone of the topic according to the familiarity.

［構成７］
発話を認識するステップと、
上記認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、
上記認識された発話と、上記特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するステップと、
当該話題を音声で出力するステップとを含む、音声対話方法。 [Configuration 7]
Recognizing the utterance,
Identifying a speaker based on the recognized utterance and pre-registered user information;
Generating a topic in which the speaker is interested based on the recognized utterance and the identified speaker;
A voice dialogue method including the step of outputting the topic by voice.

［構成８］
発話に基づく音声信号の入力を受け付けるステップと、
入力された上記音声信号に基づいて上記発話を認識するステップと、
入力された上記音声信号と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、
上記認識された発話と、上記特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するステップと、
当該話題を音声で出力するための話題信号を出力するステップとを含む、音声対話方法。 [Configuration 8]
Receiving an input of an audio signal based on an utterance;
Recognizing the utterance based on the input audio signal;
Identifying a speaker based on the input audio signal and pre-registered user information;
Generating a topic in which the speaker is interested based on the recognized utterance and the identified speaker;
Outputting a topic signal for outputting the topic by voice.

［構成９］
発話を認識するステップと、
上記認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、
上記発話に基づく音声信号と、特定された発話者の識別信号とをサーバに送信するステップと、
上記音声信号と上記識別信号とに基づいて推定された、当該発話者が興味を持つ話題のを音声で出力するための話題信号を上記サーバから受信するステップと、
上記話題信号に基づいて当該話題を音声で出力するステップとを含む、音声対話方法。 [Configuration 9]
Recognizing the utterance,
Identifying a speaker based on the recognized utterance and pre-registered user information;
Transmitting an audio signal based on the utterance and an identification signal of the identified speaker to a server;
Receiving from the server a topic signal for outputting a topic of interest that the speaker is interested, estimated based on the voice signal and the identification signal;
A voice dialogue method including the step of outputting the topic by voice based on the topic signal.

［構成１０］
コンピュータを音声対話装置として機能させるためのプログラムであって、上記プログラムは、一つ以上のプロセッサに、
発話を認識するステップと、
上記認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、
上記認識された発話と、上記特定された発話者とに基づいて、当該発話者が興味を持つ話題を音声で出力するための話題信号を生成するステップと、
上記話題信号に基づいて当該話題を音声で出力するステップとを実行させる、プログラム。 [Configuration 10]
A program for causing a computer to function as a voice interaction device, wherein the program is stored in one or more processors,
Recognizing the utterance,
Identifying a speaker based on the recognized utterance and pre-registered user information;
Generating a topic signal for outputting a topic in which the speaker is interested based on the recognized utterance and the identified speaker;
A program for executing the step of outputting the topic by voice based on the topic signal.

［構成１１］
コンピュータを音声対話装置として機能させるためのプログラムであって、上記プログラムは、一つ以上のプロセッサに、
発話に基づく音声信号の入力を受け付けるステップと、
入力された上記音声信号に基づいて上記発話を認識するステップと、
入力された上記音声信号と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、
上記認識された発話と、上記特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するステップと、
当該話題を音声で出力するための話題信号を出力するステップとを実行させる、プログラム。 [Configuration 11]
A program for causing a computer to function as a voice interaction device, wherein the program is stored in one or more processors,
Receiving an input of an audio signal based on an utterance;
Recognizing the utterance based on the input audio signal;
Identifying a speaker based on the input audio signal and pre-registered user information;
Generating a topic in which the speaker is interested based on the recognized utterance and the identified speaker;
A program for executing a step of outputting a topic signal for outputting the topic by voice.

［構成１２］
コンピュータを音声対話装置として機能させるためのプログラムであって、上記プログラムは、一つ以上のプロセッサに、
発話を認識するステップと、
上記認識された発話と予め登録されているユーザ情報とに基づいて、発話者を特定するステップと、
上記発話に基づく音声信号と、特定された発話者の識別信号とをサーバに送信するステップと、
上記音声信号と上記識別信号とに基づいて推定された、当該発話者が興味を持つ話題のを音声で出力するための話題信号を上記サーバから受信するステップと、
上記話題信号に基づいて当該話題を音声で出力するステップとを実行させる、プログラム。 [Configuration 12]
A program for causing a computer to function as a voice interaction device, wherein the program is stored in one or more processors,
Recognizing the utterance,
Identifying a speaker based on the recognized utterance and pre-registered user information;
Transmitting an audio signal based on the utterance and an identification signal of the identified speaker to a server;
Receiving from the server a topic signal for outputting a topic of interest that the speaker is interested, estimated based on the voice signal and the identification signal;
A program for executing the step of outputting the topic by voice based on the topic signal.

［構成１３］
端末と、
サーバとを備え、
上記端末は、発話を認識するように構成された音声認識部と、
認識された発話を発話信号に変換するように構成された音声信号変換部と、
上記発話信号を上記サーバに送信するように構成された送信部とを含み、
上記サーバは、
上記端末から受信した上記発話信号に基づいて上記発話を認識するように構成された音声認識部と、
上記発話信号と予め登録されているユーザ情報とに基づいて、発話者を特定するように構成された音声認証部と、
上記認識された発話と、上記特定された発話者とに基づいて、当該発話者が興味を持つ話題を生成するように構成された話題推定部と、
当該話題を音声で出力するための話題信号を上記端末に送信するように構成された送信部とを含み、
上記端末は、さらに、
上記サーバから上記話題信号を受信するように構成された受信部と、
上記話題信号に基づいて当該話題を音声で出力するように構成された出力部とを含む、音声対話システム。 [Configuration 13]
A terminal,
With a server,
The terminal includes a voice recognition unit configured to recognize an utterance;
An audio signal converter configured to convert a recognized utterance into an utterance signal;
A transmitter configured to transmit the speech signal to the server,
The server
A speech recognition unit configured to recognize the utterance based on the utterance signal received from the terminal;
A voice authentication unit configured to identify a speaker based on the utterance signal and user information registered in advance;
A topic estimation unit configured to generate a topic in which the speaker is interested based on the recognized utterance and the identified speaker;
A transmission unit configured to transmit a topic signal for outputting the topic as a voice to the terminal,
The terminal
A receiver configured to receive the topic signal from the server;
And a voice dialogue system including an output unit configured to output the topic in a voice based on the topic signal.

［構成１４］
構成１３に記載のシステムに用いる端末であって、
発話を認識するように構成された音声認識部と、
上記認識された発話を発話信号に変換するように構成された音声信号変換部と、
上記発話信号をサーバに送信するように構成された送信部と、
上記発話信号に基づいて生成された話題信号を上記サーバから受信するように構成された受信部と、
上記話題信号に基づいて、上記発話に対応する話題を音声で出力するように構成された出力部とを備える、端末。 [Configuration 14]
A terminal used in the system according to Configuration 13,
A speech recognizer configured to recognize utterances;
An audio signal converter configured to convert the recognized utterance into an utterance signal;
A transmitter configured to transmit the utterance signal to a server;
A receiver configured to receive from the server a topic signal generated based on the utterance signal;
A terminal comprising: an output unit configured to output a topic corresponding to the utterance by voice based on the topic signal.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

１１０，２４１音声認識モジュール、１２０音声認証モジュール、１３０，１６９０話題推定モジュール、１４０対話生成モジュール、２００，５００，１２００，２６１０コミュニケーション端末、２１０音声入力部、２１１音声出力部、２２０，５２０，８２０，９２０，１６２０，２２２０，２６２０サーバ、２３０，５３０制御部、２４０，５４０音声認識部、２４２話者特定モジュール、２５０，５５０，８５０，２２５０対話分析部、２６０，５６０，１６６０対話履歴記憶部、２８０，５８０，８８０，９８０，１６８０，２２８０対話生成部、２９０，５９０音声合成部、４００コンピュータ、５０１携帯電話、５１０無線タグ情報送信部、５４１ユーザ識別子部、９９０興味推定モジュール、１１１０，２１１０入力フレーズ、１１２０，１２４０興味名詞、１１３０，２１２０出力フレーズ、１２００，１３００，１８００，１９００，２０００，２４００テーブル、１２３０，１９３０話者、１２５０，１９６０タイムスタンプ、１５０１ユーザ。 110,241 Voice recognition module, 120 Voice authentication module, 130,1690 Topic estimation module, 140 Dialogue generation module, 200,500,1200,2610 Communication terminal, 210 Voice input part, 211 Voice output part, 220,520,820, 920, 1620, 2220, 2620 Server, 230, 530 Control unit, 240, 540 Speech recognition unit, 242 Speaker identification module, 250, 550, 850, 2250 Dialog analysis unit, 260, 560, 1660 Dialog history storage unit, 280 , 580, 880, 980, 1680, 2280 Dialogue generation unit, 290, 590 Speech synthesis unit, 400 computer, 501 mobile phone, 510 RFID tag information transmission unit, 541 user identifier unit, 990 interest estimation module, 1110, 2110 Input phrase, 1120, 1240 Interest noun, 1130, 2120 Output phrase, 1200, 1300, 1800, 1900, 2000, 2400 Table, 1230, 1930 Speaker, 1250, 1960 Time stamp, 1501 User.

Claims

A speech recognizer configured to recognize utterances;
A voice authenticator configured to identify a speaker based on the recognized utterance and pre-registered user information;
A topic estimation unit configured to generate a topic in which the speaker is interested based on the recognized utterance and the identified speaker;
A voice dialogue apparatus comprising: a voice output unit configured to output the topic in voice.

An audio signal input unit configured to accept an input of an audio signal based on an utterance;
A speech recognition unit configured to recognize the utterance based on the input speech signal;
A voice authentication unit configured to identify a speaker based on the input voice signal and pre-registered user information;
A topic estimation unit configured to generate a topic in which the speaker is interested based on the recognized utterance and the identified speaker;
An audio dialogue apparatus comprising: an output unit configured to output a topic signal for outputting the topic in audio.

A terminal,
A server capable of communicating with the terminal,
The terminal
A voice recognition unit configured to accept an utterance and recognize the utterance;
A voice authenticator configured to identify a speaker based on the recognized utterance and pre-registered user information;
A transmission unit configured to transmit an audio signal based on the utterance and an identification signal of the identified speaker to the server;
The server
A topic estimation unit configured to generate a topic in which the speaker is interested based on the voice signal and the identification signal;
A topic transmission unit configured to transmit a topic signal for outputting the topic by voice to the terminal;
The terminal further includes
A spoken dialogue system comprising: an output unit configured to output the topic by voice based on the topic signal received from the server.

A terminal used in the voice interaction system according to claim 3,
A speech recognizer configured to recognize utterances;
A voice authenticator configured to identify a speaker based on the recognized utterance and pre-registered user information;
A transmission unit configured to transmit an audio signal based on the utterance and an identification signal of the identified speaker to a server;
A terminal comprising: an output unit configured to receive a topic signal for outputting a topic of interest of the speaker by voice from the server and output the topic by voice.

A storage unit configured to store a history of interaction with each user of the voice interaction device;
The spoken dialogue apparatus according to claim 1, wherein the topic estimation unit is configured to generate the topic based on a history of dialogue with the user.

A closeness calculation unit configured to calculate a closeness between the user and the voice interaction device based on a history of interaction with the user of the voice interaction device;
The spoken dialogue apparatus according to claim 1, wherein the topic estimation unit is configured to adjust a tone of the topic according to the familiarity.

Recognizing the utterance,
Identifying a speaker based on the recognized utterance and pre-registered user information;
Generating a topic that the speaker is interested in based on the recognized utterance and the identified speaker;
A voice dialogue method including the step of outputting the topic by voice.

Receiving an input of an audio signal based on an utterance;
Recognizing the utterance based on the input audio signal;
Identifying a speaker based on the input audio signal and pre-registered user information;
Generating a topic that the speaker is interested in based on the recognized utterance and the identified speaker;
Outputting a topic signal for outputting the topic by voice.

Recognizing the utterance,
Identifying a speaker based on the recognized utterance and pre-registered user information;
Transmitting an audio signal based on the utterance and an identification signal of the identified speaker to a server;
Receiving, from the server, a topic signal for outputting a topic of interest of the speaker, which is estimated based on the voice signal and the identification signal;
And outputting the topic in a voice based on the topic signal.

A program for causing a computer to function as a voice interaction device, wherein the program is stored in one or more processors,
Recognizing the utterance,
Identifying a speaker based on the recognized utterance and pre-registered user information;
Generating a topic signal for outputting a topic in which the speaker is interested based on the recognized utterance and the identified speaker;
A program for executing the step of outputting the topic by voice based on the topic signal.

A program for causing a computer to function as a voice interaction device, wherein the program is stored in one or more processors,
Receiving an input of an audio signal based on an utterance;
Recognizing the utterance based on the input audio signal;
Identifying a speaker based on the input audio signal and pre-registered user information;
Generating a topic that the speaker is interested in based on the recognized utterance and the identified speaker;
A program for executing a step of outputting a topic signal for outputting the topic by voice.

A program for causing a computer to function as a voice interaction device, wherein the program is stored in one or more processors,
Recognizing the utterance,
Identifying a speaker based on the recognized utterance and pre-registered user information;
Transmitting an audio signal based on the utterance and an identification signal of the identified speaker to a server;
Receiving, from the server, a topic signal for outputting a topic of interest of the speaker, which is estimated based on the voice signal and the identification signal;
A program for executing the step of outputting the topic by voice based on the topic signal.

A terminal,
With a server,
The terminal includes a voice recognition unit configured to recognize an utterance;
An audio signal converter configured to convert a recognized utterance into an utterance signal;
A transmitter configured to transmit the speech signal to the server,
The server
A speech recognition unit configured to recognize the utterance based on the utterance signal received from the terminal;
A voice authentication unit configured to identify a speaker based on the speech signal and pre-registered user information;
A topic estimation unit configured to generate a topic in which the speaker is interested based on the recognized utterance and the identified speaker;
A transmission unit configured to transmit to the terminal a topic signal for outputting the topic by voice,
The terminal further includes
A receiver configured to receive the topic signal from the server;
And a voice dialogue system including an output unit configured to output the topic by voice based on the topic signal.

A terminal used in the system according to claim 13,
A speech recognizer configured to recognize utterances;
An audio signal converter configured to convert the recognized utterance into an utterance signal;
A transmitter configured to transmit the speech signal to a server;
A receiving unit configured to receive a topic signal generated based on the utterance signal from the server;
A terminal comprising: an output unit configured to output a topic corresponding to the utterance by voice based on the topic signal.