JP6166889B2

JP6166889B2 - Dialog support apparatus, dialog system, dialog support method and program

Info

Publication number: JP6166889B2
Application number: JP2012251337A
Authority: JP
Inventors: 俊治栗栖; 結旗柘植; 直樹橋田; 恭子増田
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2012-11-15
Filing date: 2012-11-15
Publication date: 2017-07-19
Anticipated expiration: 2032-11-15
Also published as: JP2014098844A

Description

本発明は、いわゆる音声エージェントサービスに関する。 The present invention relates to a so-called voice agent service.

ユーザが音声により質問等を入力することによって情報を検索し、検索結果を文字や画像によって出力する技術が知られている（例えば、特許文献１参照）。また、スマートフォン等の携帯端末においては、音声によって情報を検索したり、当該端末を操作したりするとともに、当該端末が音声によって応答することができる音声対話機能も実用化されている。このような音声対話機能により、ユーザは、あたかも人間に話しかけるように携帯端末を操作することが可能である。 A technique is known in which a user searches for information by inputting a question or the like by voice, and outputs a search result as characters or images (see, for example, Patent Document 1). Also, in mobile terminals such as smartphones, a voice interaction function is also put into practical use, in which information is searched by voice or the terminal is operated, and the terminal can respond by voice. With such a voice interaction function, the user can operate the portable terminal as if speaking to a human.

特開２００６−２０９０２２号公報JP 2006-209022 A

ところで、現実の人間との対話においては、情報の検索や携帯端末の操作のように一定の応答が期待される場合もあれば、そうでない場合もある。例えば、ユーザは、雑談のような何気ない会話においては、必ずしも一定の応答を期待しているわけではない。また、雑談に対して決まった応答しかなされなければ、その応答はまさに機械的なものとなり、人間味が感じられないともいえる。したがって、特許文献１に記載された技術のように、既存のデータベースから回答を検索しただけでは、情報の検索に対する適切な回答は得られたとしても、雑談などに対する適切な応答とはなり得ない。
そこで、本発明は、ユーザの意図に応じて応答の方法を変えられる技術を提供することを目的とする。 By the way, in a dialogue with a real person, there may be a case where a certain response is expected, such as a search for information or an operation of a portable terminal, and a case where it is not. For example, the user does not necessarily expect a constant response in a casual conversation such as chat. Moreover, if only a fixed response is made to the chat, the response will be exactly mechanical and it can be said that human feeling is not felt. Therefore, as in the technique described in Patent Document 1, simply searching for an answer from an existing database cannot provide an appropriate response to a chat or the like, even if an appropriate answer to the information search is obtained. .
Therefore, an object of the present invention is to provide a technique capable of changing a response method according to a user's intention.

本発明は、ユーザが発話した音声に基づいて生成される入力情報を取得する取得部と、前記取得部により取得された入力情報に基づき、前記ユーザの発話の意図を解釈する解釈部と、前記解釈部により解釈された意図が情報の検索である場合に、当該意図に応じて実行された検索の検索結果に応じた応答情報を出力し、前記解釈部により解釈された意図が情報の検索でない場合に、当該意図に対応して用意された複数の応答情報のいずれかを選択的に出力する出力部とを備える対話支援装置を提供する。 The present invention provides an acquisition unit that acquires input information generated based on voice uttered by a user, an interpretation unit that interprets the intention of the user's utterance based on the input information acquired by the acquisition unit, When the intention interpreted by the interpreter is a search for information, response information corresponding to the search result of the search executed according to the intention is output, and the intention interpreted by the interpreter is not a search for information In some cases, a dialogue support device is provided that includes an output unit that selectively outputs any one of a plurality of response information prepared corresponding to the intention.

また、前記出力部は、前記解釈部により解釈された意図が情報の検索である場合とそうでない場合とで、参照する情報源を異ならせる構成であってもよい。
また、前記出力部は、所定の文字列が含まれているか否かによって、前記解釈部により解釈された意図が情報の検索であるか否かを判断する構成であってもよい。
また、前記出力部は、前記複数の応答情報のうち、前記ユーザにより選択されたキャラクタに応じた応答情報を出力する構成であってもよい。
また、前記出力部は、前記応答情報と、当該応答情報に付加される付加情報であって前記ユーザにより選択されたキャラクタに応じた付加情報とを出力する構成であってもよい。
また、前記出力部は、発話しているユーザ、当該ユーザの位置又は発話日時が所定の条件を満たす場合に、あらかじめ用意された複数の応答情報のいずれかを選択的に出力する構成であってもよい。 Further, the output unit may be configured such that the information source to be referred to is different depending on whether the intention interpreted by the interpretation unit is a search for information or not.
The output unit may be configured to determine whether the intention interpreted by the interpretation unit is a search for information depending on whether a predetermined character string is included.
The output unit may output response information corresponding to a character selected by the user among the plurality of response information.
The output unit may output the response information and additional information added to the response information and corresponding to the character selected by the user.
Further, the output unit is configured to selectively output any of a plurality of response information prepared in advance when the user who is speaking, the position of the user, or the date and time of speaking satisfies a predetermined condition. Also good.

また、本発明は、前記対話支援装置と、前記ユーザの音声を収音し、前記出力部により出力された応答情報に応じた応答を少なくとも音声により再生するユーザ端末とを備える対話システムを提供する。 The present invention also provides a dialog system comprising the dialog support device and a user terminal that collects the user's voice and reproduces at least the voice according to the response information output by the output unit. .

また、本発明は、ユーザが発話した音声に基づいて生成される入力情報を取得するステップと、前記取得された入力情報に基づき、前記ユーザの発話の意図を解釈するステップと、前記解釈された意図が情報の検索である場合に、当該意図に応じて実行された検索の検索結果に応じた応答情報を出力し、前記解釈された意図が情報の検索でない場合に、当該意図に対応して用意された複数の応答情報のいずれかを選択的に出力するステップとを有する対話支援方法を提供する。 The present invention also includes a step of acquiring input information generated based on speech uttered by a user, a step of interpreting the user's utterance intention based on the acquired input information, and the interpretation When the intention is a search of information, response information corresponding to the search result of the search executed according to the intention is output, and when the interpreted intention is not a search of information, And a step of selectively outputting any of the plurality of prepared response information.

また、本発明は、コンピュータに、ユーザが発話した音声に基づいて生成される入力情報を取得するステップと、前記取得された入力情報に基づき、前記ユーザの発話の意図を解釈するステップと、前記解釈された意図が情報の検索である場合に、当該意図に応じて実行された検索の検索結果に応じた応答情報を出力し、前記解釈された意図が情報の検索でない場合に、当該意図に対応して用意された複数の応答情報のいずれかを選択的に出力するステップとを実行させるためのプログラムを提供する。 According to another aspect of the present invention, a computer acquires input information generated based on speech uttered by a user, interprets the user's utterance intention based on the acquired input information, When the interpreted intention is a search of information, response information is output according to the search result of the search executed according to the intention, and when the interpreted intention is not a search of information, the intention is A program for executing a step of selectively outputting any of a plurality of response information prepared correspondingly is provided.

本発明によれば、ユーザの意図に応じて応答の方法を変えることが可能である。 According to the present invention, it is possible to change the response method according to the user's intention.

対話システムの全体構成を示すブロック図Block diagram showing the overall configuration of the dialogue system ユーザ端末のハードウェア構成を示すブロック図Block diagram showing the hardware configuration of the user terminal 音声認識サーバ、意図解釈サーバ及び雑談管理サーバのハードウェア構成を示すブロック図Block diagram showing the hardware configuration of a speech recognition server, intention interpretation server, and chat management server 意図解釈サーバの機能的構成を示すブロック図Block diagram showing functional configuration of intent interpretation server 雑談データのデータ構造を例示する図Diagram illustrating data structure of chat data 音声入力に対する応答が得られるまでの動作を示すシーケンスチャートSequence chart showing operation until response to voice input is obtained 意図解釈処理を示すフローチャートFlow chart showing intention interpretation processing ユーザ端末における出力結果を例示する図The figure which illustrates the output result in a user terminal ユーザ端末における出力結果を例示する図The figure which illustrates the output result in a user terminal 応答情報と付加情報を例示する図The figure which illustrates response information and additional information

［実施例］
図１は、本発明の一実施例である対話システム１０の全体構成を示すブロック図である。対話システム１０は、いわゆる音声エージェントサービスを提供するためのシステムであり、ユーザ端末１００、音声認識サーバ２００Ａ、意図解釈サーバ２００Ｂ、雑談管理サーバ２００Ｃ及び音声合成サーバ２００Ｄを備え、これらをネットワーク３００によって相互に接続した構成を有する。なお、ユーザ端末１０００は、ここでは図示を省略しているが、このサービスを利用するユーザの数だけ実際には存在する。また、ネットワーク３００は、移動通信網やインターネットを組み合わせた複合的なネットワークである。 [Example]
FIG. 1 is a block diagram showing the overall configuration of a dialog system 10 according to an embodiment of the present invention. The dialogue system 10 is a system for providing a so-called voice agent service, and includes a user terminal 100, a voice recognition server 200A, an intention interpretation server 200B, a chat management server 200C, and a voice synthesis server 200D. It has the structure connected to. In addition, although illustration is abbreviate | omitted here, there exist as many user terminals 1000 as the number of users who use this service. The network 300 is a complex network that combines a mobile communication network and the Internet.

ここにおいて、音声エージェントサービスとは、ユーザ端末１００におけるユーザの各種の操作を音声によって仲介するサービスをいう。このサービスによれば、ユーザは、情報を検索したりする場合に、ユーザ端末１００に文字を手で入力したりすることなく、ユーザ端末１００に自然文で話しかけるだけで所望の結果を得ることが可能である。 Here, the voice agent service refers to a service that mediates various user operations on the user terminal 100 by voice. According to this service, when searching for information, the user can obtain a desired result by simply speaking to the user terminal 100 in a natural sentence without manually inputting characters into the user terminal 100. Is possible.

また、本実施例の音声エージェントサービスは、キャラクタを介してサービスを提供する点に第１の特徴を有する。ここでいうキャラクタは、漫画の登場人物、著名人、動物などを模したものであり、例えば、３次元ＣＧ（Computer Graphics）などによってアニメーション表示される。このキャラクタは、コンテンツプロバイダによってさまざまなものが提供されており、各ユーザが所望のキャラクタを有償又は無償で選択して使用することが可能である。これらのキャラクタは、キャラクタ毎に音声（声色）や口調が異なっている。本実施例のサービスにおいては、ユーザが自分の好みのキャラクタを使用し、これに（あたかも現実の人間と対話するように）話しかけることによって、特定のキャラクタに愛着を抱き、本サービスの利用の促進が図られることが期待される。 In addition, the voice agent service of the present embodiment has a first feature in that a service is provided through a character. The character referred to here is a character imitating a cartoon character, a celebrity, an animal, or the like, and is animated by, for example, three-dimensional CG (Computer Graphics). Various characters are provided by content providers, and each user can select and use a desired character for a fee or free of charge. These characters have different voices (voice colors) and tone for each character. In the service of this embodiment, the user uses his / her favorite character and speaks to it (as if interacting with a real person), so that he / she is attached to a specific character and promotes the use of this service. Is expected to be achieved.

さらに、本実施例の音声エージェントサービスは、ユーザが情報の検索以外の対話を行える点に第２の特徴を有する。ここにおいて、情報の検索以外の対話とは、いわゆる雑談や世間話のようなものをいう。すなわち、ここでいう対話は、いわゆるウェブ検索のようにユーザが特定の結果を求めるものではなく、現実の人間とのコミュニケーションにおいて会話や暇つぶしとして行われるものなどである。 Furthermore, the voice agent service of the present embodiment has a second feature in that a user can perform a dialog other than information retrieval. Here, the dialogue other than information retrieval refers to something like so-called chat or social talk. In other words, the dialog here is not what the user asks for a specific result as in the so-called web search, but is what is performed as a conversation or killing time in communication with a real person.

ユーザ端末１００は、ユーザが本サービスを利用するために用いる通信端末である。ユーザ端末１００は、ここではスマートフォンであるとする。音声認識サーバ２００Ａは、ユーザが発話した音声に基づいて生成される音声データをテキストデータに変換するためのサーバ装置である。意図解釈サーバ２００Ｂは、テキストデータに基づいてユーザの意図を解釈し、その意図に応じた応答情報をユーザ端末１００に送信するサーバ装置である。雑談管理サーバ２００Ｃは、キャラクタ毎の雑談データを管理するサーバ装置であり、雑談データを所定のタイミングで意図解釈サーバ２００Ｂに送信する。音声合成サーバ２００Ｄは、キャラクタ毎の音声データ（以下「合成音声データ」という。）を記憶するサーバ装置であり、これを必要に応じてユーザ端末１００に送信する。合成音声データは、コンテンツプロバイダによって提供されるデータであり、後述するキャラクタＩＤと応答文章とに各々が対応付けられている。 The user terminal 100 is a communication terminal used by a user to use this service. Here, it is assumed that the user terminal 100 is a smartphone. The speech recognition server 200A is a server device for converting speech data generated based on speech uttered by a user into text data. The intention interpretation server 200B is a server device that interprets a user's intention based on text data and transmits response information corresponding to the intention to the user terminal 100. The chat management server 200C is a server device that manages chat data for each character, and transmits the chat data to the intention interpretation server 200B at a predetermined timing. The speech synthesis server 200D is a server device that stores speech data for each character (hereinafter referred to as “synthesized speech data”), and transmits this to the user terminal 100 as necessary. The synthesized voice data is data provided by the content provider, and is associated with a character ID and a response sentence, which will be described later.

図２は、ユーザ端末１００のハードウェア構成を示すブロック図である。ユーザ端末１００は、制御部１１０と、記憶部１２０と、通信部１３０と、表示部１４０と、音声入力部１５０と、音声出力部１６０とを少なくとも備える。また、ユーザ端末１００は、テンキーなどのキーパッドやバイブレータを備えていてもよい。 FIG. 2 is a block diagram illustrating a hardware configuration of the user terminal 100. The user terminal 100 includes at least a control unit 110, a storage unit 120, a communication unit 130, a display unit 140, a voice input unit 150, and a voice output unit 160. The user terminal 100 may include a keypad such as a numeric keypad or a vibrator.

制御部１１０は、ユーザ端末１００の各部の動作を制御する手段である。制御部１１０は、ＣＰＵ（Central Processing Unit）などの演算処理装置やメモリを備え、所定のプログラムを実行することによって制御を行う。
記憶部１２０は、データを記憶する手段である。記憶部１２０は、フラッシュメモリなどの記録媒体を備え、制御部１１０が制御に用いるデータを記憶している。 The control unit 110 is means for controlling the operation of each unit of the user terminal 100. The control unit 110 includes an arithmetic processing device such as a CPU (Central Processing Unit) and a memory, and performs control by executing a predetermined program.
The storage unit 120 is means for storing data. The storage unit 120 includes a recording medium such as a flash memory, and stores data used by the control unit 110 for control.

通信部１３０は、ネットワーク３００を介してデータを送受信する手段である。通信部３００は、アンテナや、ネットワーク３００の通信方式に対応したモデムなどを備え、データの変調・復調といったデータ通信に必要な処理を実行する。
表示部１４０は、画像を表示する手段である。表示部１４０は、液晶素子や有機ＥＬ（electroluminescence）素子により構成された表示パネル（すなわち表示領域）とこれを駆動する駆動回路とを備え、画像データに応じた画像を表示する。 The communication unit 130 is means for transmitting and receiving data via the network 300. The communication unit 300 includes an antenna and a modem corresponding to the communication method of the network 300, and executes processing necessary for data communication such as data modulation / demodulation.
The display unit 140 is a means for displaying an image. The display unit 140 includes a display panel (that is, a display area) configured by a liquid crystal element or an organic EL (electroluminescence) element and a drive circuit that drives the display panel, and displays an image according to image data.

音声入力部１５０は、音声を入力する手段である。音声入力部１５０は、マイクロホン又はこれを接続する入力端子を備え、音声データを制御部１１０に供給する。一方、音声出力部１６０は、音声を出力する手段である。音声出力部１６０は、スピーカ又はこれを接続する出力端子を備え、制御部１１０から供給された音声情報に応じた音声を出力する。なお、音声入力部１５０及び音声出力部１６０は、ヘッドセットなどを無線接続する構成であってもよい。 The voice input unit 150 is a means for inputting voice. The audio input unit 150 includes a microphone or an input terminal for connecting the microphone, and supplies audio data to the control unit 110. On the other hand, the audio output unit 160 is means for outputting audio. The audio output unit 160 includes a speaker or an output terminal for connecting the speaker, and outputs audio corresponding to the audio information supplied from the control unit 110. Note that the audio input unit 150 and the audio output unit 160 may be configured to wirelessly connect a headset or the like.

図３は、音声認識サーバ２００Ａ、意図解釈サーバ２００Ｂ、雑談管理サーバ２００Ｃ及び音声合成サーバ２００Ｄのハードウェア構成を示すブロック図である。これらのサーバ装置は、記憶するデータや実行する処理に相違はあるものの、主要なハードウェア構成は共通している。そのため、ここでは、音声認識サーバ２００Ａ、意図解釈サーバ２００Ｂ、雑談管理サーバ２００Ｃ及び音声合成サーバ２００Ｄを総称して「サーバ装置２００」と表記し、これらのサーバ装置に共通する主要なハードウェア構成を説明する。 FIG. 3 is a block diagram showing the hardware configuration of the speech recognition server 200A, the intention interpretation server 200B, the chat management server 200C, and the speech synthesis server 200D. Although these server devices have different data to be stored and processes to be executed, the main hardware configurations are common. Therefore, here, the speech recognition server 200A, the intention interpretation server 200B, the chat management server 200C, and the speech synthesis server 200D are collectively referred to as “server device 200”, and the main hardware configuration common to these server devices is described. explain.

サーバ装置２００は、制御部２１０と、記憶部２２０と、通信部２３０とを備える。制御部２１０は、サーバ装置２００の各部の動作を制御する手段である。制御部２１０は、演算処理装置やメモリを備え、所定のプログラムを実行することによって制御を行う。記憶部２２０は、データを記憶する手段であり、ハードディスクなどの記録媒体を備える。通信部１３０は、ネットワーク３００を介してデータを送受信する手段である。 The server device 200 includes a control unit 210, a storage unit 220, and a communication unit 230. The control unit 210 is means for controlling the operation of each unit of the server device 200. The control unit 210 includes an arithmetic processing unit and a memory, and performs control by executing a predetermined program. The storage unit 220 is a means for storing data, and includes a recording medium such as a hard disk. The communication unit 130 is means for transmitting and receiving data via the network 300.

図４は、意図解釈サーバ２００Ｂの機能的構成を示すブロック図である。意図解釈サーバ２００Ｂの制御部２１０は、所定のプログラムを実行することにより、取得部２１１、解釈部２１２及び出力部２１３の各部に相当する機能を実現する。制御部２１０が実現するこれらの機能は、本発明に係る対話支援装置に相当する機能である。 FIG. 4 is a block diagram showing a functional configuration of the intention interpretation server 200B. The control unit 210 of the intention interpretation server 200B implements functions corresponding to the respective units of the acquisition unit 211, the interpretation unit 212, and the output unit 213 by executing a predetermined program. These functions realized by the control unit 210 are functions corresponding to the dialogue support apparatus according to the present invention.

取得部２１１は、入力情報を取得する手段である。ここにおいて、入力情報とは、ユーザが音声によって入力した情報の総称である。入力情報には、ユーザ端末１００において生成される音声データと、これを変換して得られるテキストデータとが含まれる。本実施例において、取得部２１１は、音声認識サーバ２００Ａによって音声データからテキストデータに変換された状態で入力情報を取得する。 The acquisition unit 211 is means for acquiring input information. Here, the input information is a general term for information input by the user by voice. The input information includes voice data generated in the user terminal 100 and text data obtained by converting the voice data. In the present embodiment, the acquisition unit 211 acquires input information in a state where the voice recognition server 200A converts voice data into text data.

解釈部２１２は、取得部２１１により取得された入力情報に基づき、ユーザの発話の意図を解釈する手段である。解釈部２１２は、テキストデータに対して形態素解析などを実行して当該テキストデータに含まれる単語を特定し、さらにその構文などを解析することによってユーザによる発話の意図を解釈する。本実施例において、解釈部２１２は、ユーザの意図が「情報の検索」と「雑談」のいずれであるかを判断する。この判断は、雑談データを用いて行われる。 The interpretation unit 212 is a means for interpreting the user's intention to speak based on the input information acquired by the acquisition unit 211. The interpretation unit 212 performs morphological analysis on the text data to identify words included in the text data, and further interprets the syntax and the like to interpret the intention of the utterance by the user. In this embodiment, the interpretation unit 212 determines whether the user's intention is “information search” or “chat”. This determination is made using chat data.

出力部２１３は、解釈部２１２による解釈結果に応じた応答情報を出力する手段である。出力部２１３は、解釈部２１２により解釈されたユーザの意図が情報の検索である場合と雑談である場合とに応じた応答情報を出力する。出力部２１３は、解釈部２１２により解釈されたユーザの意図が情報の検索である場合、外部のサーチエンジンやデータベースを用いて検索を実行して検索結果を受信し、この検索結果に応じた応答情報を出力する。一方、解釈部２１２により解釈されたユーザの意図が雑談である場合、出力部２１３は、記憶部２２０に記憶された複数の雑談データの中から適当なものを選択的に抽出し、これに応じた応答情報を出力する。すなわち、出力部２１３は、解釈部２１２により解釈されたユーザの意図に応じて、参照する情報源を異ならせている。 The output unit 213 is means for outputting response information corresponding to the interpretation result by the interpretation unit 212. The output unit 213 outputs response information corresponding to the case where the user's intention interpreted by the interpretation unit 212 is a search for information and the case where the intention is a chat. When the user's intention interpreted by the interpretation unit 212 is a search for information, the output unit 213 performs a search using an external search engine or database, receives the search result, and responds according to the search result Output information. On the other hand, when the user's intention interpreted by the interpretation unit 212 is chat, the output unit 213 selectively extracts appropriate data from the plurality of chat data stored in the storage unit 220, and responds accordingly. Response information is output. That is, the output unit 213 changes the information source to be referred to according to the user's intention interpreted by the interpretation unit 212.

図５は、雑談データのデータ構造を例示する図である。本実施例の雑談データは、「登録文字列」、「時間」、「禁止用語フラグ」、「キャラクタＩＤ」、「応答文章」、「アクションＩＤ」などの項目を含んで構成される。コンテンツプロバイダは、これらの項目のうち、「禁止用語フラグ」、「キャラクタＩＤ」、「応答文章」及び「アクションＩＤ」を登録することができる。その他の項目は、本サービスのサービス事業者（例えば通信事業者）によってあらかじめ決められている。 FIG. 5 is a diagram illustrating a data structure of chat data. The chat data of this embodiment includes items such as “registered character string”, “time”, “prohibited term flag”, “character ID”, “response text”, and “action ID”. Among these items, the content provider can register “prohibited term flag”, “character ID”, “response text”, and “action ID”. Other items are determined in advance by a service provider (for example, a communication provider) of this service.

「登録文字列」は、ユーザの発話の意図が雑談であるか否かを判断するための文字列である。意図解釈サーバ２００Ｂは、テキストデータに登録文字列が含まれている場合に、ユーザの発話の意図が雑談であると判断することができる。登録文字列は、例えば、「おはよう」、「こんにちは」などといった挨拶である。 The “registered character string” is a character string for determining whether or not the intention of the user's utterance is a chat. The intention interpretation server 200B can determine that the intention of the user's utterance is a chat when the registered character string is included in the text data. Registration string, for example, "Good morning", a greeting such as "Hello".

「時間」は、ユーザが発話した時間を示すデータである。この項目は、例えば、「朝」、「昼」、「夜」といった時間帯を記述することにより、ユーザが発話した時間帯に応じて雑談の応答を変えられるように設けられている。このようにすれば、例えば、ユーザが朝に「おはよう」と発話した場合と昼に「おはよう」と発話した場合とで、応答の具体的内容を異ならせることが可能である。 “Time” is data indicating the time when the user speaks. This item is provided so that the chat response can be changed according to the time zone spoken by the user by describing the time zone such as “morning”, “daytime”, and “night”, for example. In this way, for example, the specific content of the response can be made different when the user utters “good morning” in the morning and when utters “good morning” in the daytime.

「禁止用語フラグ」は、禁止用語を含むテキストデータに対する応答を一律に指定するためのデータである。この項目を用いる場合には、ユーザの発話の意図の判断とは別に、テキストデータに禁止用語が含まれるか否かが判断される。意図解釈サーバ２００Ｂは、テキストデータに禁止用語が含まれる場合、禁止用語フラグが「１」である雑談データを抽出する。なお、ここにおいて、禁止用語とは、例えば、卑わいな表現や暴力的な表現であるが、コンテンツプロバイダ又はサービス事業者が任意に設定することが可能である。また、ユーザの年齢を識別することが可能であれば、禁止用語は年齢に応じて異なっていてもよい。 The “prohibited term flag” is data for uniformly designating responses to text data including prohibited terms. When this item is used, it is determined whether or not prohibited terms are included in the text data separately from the determination of the intention of the user's utterance. The intention interpretation server 200B extracts chat data in which the prohibited term flag is “1” when the prohibited term is included in the text data. Here, the prohibited term is, for example, an obscene expression or a violent expression, but can be arbitrarily set by a content provider or a service provider. In addition, as long as it is possible to identify the user's age, the prohibited terms may differ depending on the age.

「キャラクタＩＤ」は、ユーザがユーザ端末１００において使用しているキャラクタを識別するためのＩＤである。「応答文章」は、当該キャラクタによる応答を表すテキストデータである。また、「アクションＩＤ」は、当該キャラクタの応答時の動作（身体の動き、顔の表情など）を識別するためのＩＤである。各キャラクタには、それぞれのアクションＩＤに応じたアニメーション表示をするためのアニメーションデータがコンテンツプロバイダによってあらかじめ用意されている。 The “character ID” is an ID for identifying a character used by the user on the user terminal 100. “Response text” is text data representing a response by the character. The “action ID” is an ID for identifying an action (such as body movement or facial expression) of the character in response. For each character, animation data for animation display corresponding to each action ID is prepared in advance by the content provider.

雑談管理サーバ２００Ｃは、コンテンツプロバイダからこのような雑談データの登録を受け付ける。雑談管理サーバ２００Ｃは、所定のタイミング（１時間毎、１日１回など）で雑談データを意図解釈サーバ２００Ｂに送信する。意図解釈サーバ２００Ｂは、雑談管理サーバ２００Ｃから新たな雑談データを受信すると、自装置に記憶された雑談データを更新する。 The chat management server 200C accepts such chat data registration from the content provider. The chat management server 200C transmits the chat data to the intention interpretation server 200B at a predetermined timing (every hour, once a day, etc.). When the intention interpretation server 200B receives new chat data from the chat management server 200C, the intention interpretation server 200B updates the chat data stored in its own device.

対話システム１０の構成は、以上のとおりである。この構成のもと、ユーザ端末１００は、待ち受け画面などにキャラクタを表示するとともに、音声によるユーザの入力を待機する。ユーザは、キャラクタが表示されたユーザ端末１００に対して、当該キャラクタに話しかけるように発話し、音声を入力する。ユーザによって音声が入力されると、ユーザ端末１００、音声認識サーバ２００Ａ及び意図解釈サーバ２００Ｂは、協働して以下の処理を実行する。 The configuration of the dialogue system 10 is as described above. Under this configuration, the user terminal 100 displays a character on a standby screen and waits for a user input by voice. The user speaks to the user terminal 100 on which the character is displayed so as to speak to the character, and inputs a voice. When voice is input by the user, the user terminal 100, the voice recognition server 200A, and the intention interpretation server 200B cooperate to execute the following processing.

図６は、ユーザの音声入力に対する応答が得られるまでの動作を示すシーケンスチャートである。この動作は、ユーザがユーザ端末１００に対して音声を入力することによって開始される。ユーザ端末１００は、ユーザの音声を収音して音声データを生成し（ステップＳ１）、これを音声認識サーバ２００Ａに送信する（ステップＳ２）。音声認識サーバ２００Ａは、音声データを受信し、音声認識を実行することにより音声データをテキストデータ（文字コードで記述可能なデータ）に変換する（ステップＳ３）。音声認識サーバ２００Ａは、テキストデータをユーザ端末１００に返送（送信）する（ステップＳ４）。 FIG. 6 is a sequence chart showing an operation until a response to the user's voice input is obtained. This operation is started when the user inputs voice to the user terminal 100. The user terminal 100 collects the user's voice to generate voice data (step S1), and transmits it to the voice recognition server 200A (step S2). The speech recognition server 200A receives the speech data and executes speech recognition to convert the speech data into text data (data that can be described with a character code) (step S3). The voice recognition server 200A returns (transmits) the text data to the user terminal 100 (step S4).

ユーザ端末１００は、テキストデータを受信したら、これを意図解釈サーバ２００Ｂに転送（送信）する（ステップＳ５）。このとき、ユーザ端末１００は、表示部１４０に表示されているキャラクタ、すなわち、ユーザにより選択されたキャラクタのキャラクタＩＤをあわせて送信する。意図解釈サーバ２００Ｂは、テキストデータを入力情報として受信し、ユーザが発話した音声の意図を解釈することにより、必要な応答情報を得る処理を実行する（ステップＳ６）。この処理のことを、以下においては「意図解釈処理」という。 When receiving the text data, the user terminal 100 transfers (transmits) the text data to the intention interpretation server 200B (step S5). At this time, the user terminal 100 transmits the character ID of the character displayed on the display unit 140, that is, the character ID of the character selected by the user. The intention interpretation server 200B receives text data as input information, and executes a process of obtaining necessary response information by interpreting the intention of the voice spoken by the user (step S6). This process is hereinafter referred to as “intention interpretation process”.

意図解釈サーバ２００Ｂは、意図解釈処理を実行することにより、応答情報を得ることができる。意図解釈サーバ２００Ｂは、意図解釈処理により得られた応答情報をユーザ端末１００に送信する（ステップＳ７）。なお、応答情報は、応答文章及びアクションＩＤを少なくとも含む。 The intention interpretation server 200B can obtain response information by executing intention interpretation processing. The intention interpretation server 200B transmits response information obtained by the intention interpretation process to the user terminal 100 (step S7). Note that the response information includes at least a response sentence and an action ID.

ユーザ端末１００は、応答情報を受信すると、応答情報に含まれる応答文章を音声合成サーバ２００Ｄに送信する（ステップＳ８）。このとき、ユーザ端末１００は、ステップＳ５において送信したキャラクタＩＤと同一のキャラクタＩＤをあわせて送信する。音声合成サーバ２００Ｄは、受信したキャラクタＩＤと応答文章とに基づき、合成音声データを一意的に特定することができる。音声合成サーバ２００Ｄは、特定した合成音声データをユーザ端末１００に送信する（ステップＳ９）。そして、ユーザ端末１００は、応答情報に応じた音声及び画像を再生することにより、ユーザの音声入力に対して応答する（ステップＳ１０）。 When receiving the response information, the user terminal 100 transmits the response text included in the response information to the speech synthesis server 200D (step S8). At this time, the user terminal 100 transmits the same character ID as the character ID transmitted in step S5. The speech synthesis server 200D can uniquely identify the synthesized speech data based on the received character ID and response text. The speech synthesis server 200D transmits the identified synthesized speech data to the user terminal 100 (step S9). And the user terminal 100 responds with respect to a user's audio | voice input by reproducing | regenerating the audio | voice and image according to response information (step S10).

図７は、ステップＳ６の意図解釈処理をより詳細に示すフローチャートである。意図解釈サーバ２００Ｂ（の制御部２１０）は、まず、テキストデータを形態素解析などにより解析し、ユーザの発話の意図を解釈する（ステップＳ６１）。なお、ステップＳ６１の処理には、言語解析に関する周知技術を適宜用いればよい。また、意図解釈サーバ２００Ｂは、この処理の一部又は全部を自装置で実行せず、言語解析用の専用の装置に実行させてもよい。要するに、意図解釈サーバ２００Ｂは、テキストデータの解釈結果を取得できればそれでよく、その具体的な方法は問われない。 FIG. 7 is a flowchart showing in more detail the intention interpretation process in step S6. The intention interpretation server 200B (the control unit 210) first analyzes the text data by morphological analysis or the like, and interprets the intention of the user's utterance (step S61). In addition, what is necessary is just to use the well-known technique regarding a language analysis suitably for the process of step S61. In addition, the intention interpretation server 200B may cause a dedicated device for language analysis to execute part or all of this processing on its own device. In short, the intention interpretation server 200B only needs to acquire the interpretation result of the text data, and its specific method is not limited.

次に、意図解釈サーバ２００Ｂは、テキストデータが示すユーザの意図が情報の検索であるか否かを判断する（ステップＳ６２）。ここにおいて、意図解釈サーバ２００Ｂは、自装置に記憶された雑談データを参照し、テキストデータに登録文字列が含まれているか否かを判断することにより、ユーザの意図が雑談であるか否かを判断することができる。また、意図解釈サーバ２００Ｂは、登録文字列以外（構文など）も組み合わせて用いて雑談か否かを判断してもよい。例えば、意図解釈サーバ２００Ｂは、「教えて」や「調べて」といった、要求や依頼を意味する所定の文字列がテキストデータに含まれていた場合に、ユーザの意図が情報の検索であると判断してもよい。つまり、テキストデータに登録文字列が含まれているか否かの判断は、ステップＳ６２の処理の１つの具体例にすぎない。 Next, the intention interpretation server 200B determines whether or not the user's intention indicated by the text data is information retrieval (step S62). Here, the intention interpretation server 200B refers to the chat data stored in its own device, and determines whether or not the user's intention is chat by determining whether or not the registered character string is included in the text data. Can be judged. In addition, the intention interpretation server 200B may determine whether or not it is a chat using a combination other than the registered character string (such as syntax). For example, the intention interpretation server 200B determines that the user's intention is a search for information when the text data includes a predetermined character string meaning a request or a request such as “Tell me” or “Check it”. You may judge. That is, the determination as to whether or not the registered character string is included in the text data is only one specific example of the process of step S62.

意図解釈サーバ２００Ｂは、ステップＳ６２の判断結果に応じてその後の処理を異ならせる。まず、テキストデータが示すユーザの意図が情報の検索である場合、意図解釈サーバ２００Ｂは、テキストデータに含まれる文字列を用いて検索を実行し（ステップＳ６３）、その検索結果を受信する（ステップＳ６４）。このとき、意図解釈サーバ２００Ｂは、いわゆるサーチエンジンのように不特定多数のウェブページの中から検索結果を求めてもよいし、特定の分野の情報に特化したデータベースを用いて検索結果を求めてもよい。例えば、ユーザが「東京のイタリアンレストランを教えて」という文言を発した場合であれば、意図解釈サーバ２００Ｂは、「東京」及び「イタリアン」という検索語によってウェブ検索を実行してもよいし、場所として「東京」、ジャンルとして「イタリアン」といった検索条件を設定し、全国各地の飲食店の情報を提供するサービス事業者のデータベースにアクセスしてもよい。 The intention interpretation server 200B changes the subsequent processing according to the determination result of step S62. First, when the intention of the user indicated by the text data is a search for information, the intention interpretation server 200B executes a search using a character string included in the text data (step S63), and receives the search result (step S63). S64). At this time, the intention interpretation server 200B may obtain a search result from an unspecified number of web pages like a so-called search engine, or obtain a search result using a database specialized for information in a specific field. May be. For example, if the user utters the phrase “Tell me an Italian restaurant in Tokyo”, the intention interpretation server 200B may perform a web search using the search terms “Tokyo” and “Italian”. A search condition such as “Tokyo” as a place and “Italian” as a genre may be set, and a service provider database that provides information on restaurants in various parts of the country may be accessed.

一方、テキストデータが示すユーザの意図が情報の検索でなく、雑談である場合、意図解釈サーバ２００Ｂは、雑談データを参照し、ユーザ端末１００から送信されたテキストデータ及びキャラクタＩＤに応じた雑談データ（特に、「応答文章」及び「アクションＩＤ」）を抽出する（ステップＳ６５）。 On the other hand, when the intention of the user indicated by the text data is not a search of information but a chat, the intention interpretation server 200B refers to the chat data and chat data corresponding to the text data and the character ID transmitted from the user terminal 100. (In particular, “response text” and “action ID”) are extracted (step S65).

これらの処理が終了したら、意図解釈サーバ２００Ｂは、応答情報を生成する（ステップＳ６６）。ステップＳ６６において、意図解釈サーバ２００Ｂは、雑談データや検索結果に基づき、ユーザ端末１００において音声及び画像を再生するためのデータを応答情報として生成する。このとき、意図解釈サーバ２００Ｂは、雑談データや検索結果に加工を施してもよい。 When these processes are completed, the intention interpretation server 200B generates response information (step S66). In step S <b> 66, the intention interpretation server 200 </ b> B generates, as response information, data for reproducing audio and images on the user terminal 100 based on the chat data and the search result. At this time, the intention interpretation server 200B may process chat data and search results.

図８及び図９は、ユーザ端末１００における出力結果を例示する図である。図８は、ユーザが「東京のイタリアンレストランを教えて」という文言を質問（要求）として発話した場合を示す。一方、図９は、ユーザが「元気ですか？」という文言を雑談として発話した場合を示す。意図解釈サーバ２００Ｂは、ユーザの意図が情報の検索である場合、検索を実行した旨とそのときの検索条件（検索語）とを音声により通知するとともに、検索結果に相当する情報を表示する応答情報を生成する。一方、ユーザの意図が雑談である場合、意図解釈サーバ２００Ｂは、「元気」という検索語によってウェブ検索を実行したりするのではなく、「元気」という登録文字列に対応する音声及び画像を再生するための応答情報を生成する。 8 and 9 are diagrams illustrating output results in the user terminal 100. FIG. FIG. 8 shows a case where the user speaks the word “Tell me an Italian restaurant in Tokyo” as a question (request). On the other hand, FIG. 9 shows a case where the user utters the word “How are you?” As a chat. If the intention of the user is a search of information, the intention interpretation server 200B notifies the fact that the search has been executed and the search condition (search word) at that time by voice, and displays a response that displays information corresponding to the search result Generate information. On the other hand, when the user's intention is a chat, the intention interpretation server 200B does not perform a web search by using the search term “genki” but reproduces a voice and an image corresponding to the registered character string “genki”. To generate response information.

このように、本実施例の対話システム１０によれば、意図解釈サーバ２００Ｂによって解釈された発話の意図に応じてその応答方法を変えることができるようになる。これにより、ユーザは、音声入力を情報の検索以外の用途にも利用することが可能になり、かつ、その応答を不自然にすることなく、あたかも現実の人間と対話をしているようにすることができる。 Thus, according to the dialogue system 10 of the present embodiment, the response method can be changed according to the intention of the utterance interpreted by the intention interpretation server 200B. This makes it possible for the user to use voice input for purposes other than information retrieval, and to make it feel as if he / she is interacting with a real person without making the response unnatural. be able to.

また、本実施例によれば、ユーザは、情報を検索するときには、自身が選択したキャラクタによらず一定の応答を得られるのに対して、雑談を行ったときにはキャラクタ毎に異なる応答を得ることができる。このようにすることで、キャラクタの選択に面白みを与え、ユーザの興味を喚起することが可能であり、ユーザが本サービスを利用する動機付けとなることが期待できる。また、キャラクタ毎に応答を異ならせることによって、ユーザは自身の好みのキャラクタを選択的に使用することが可能であり、特定のキャラクタに対する愛着が増すことも期待できる。 Also, according to the present embodiment, when searching for information, the user can obtain a constant response regardless of the character selected by the user, whereas when performing a chat, the user obtains a different response for each character. Can do. By doing so, it is possible to give interest to the selection of the character, evoke the user's interest, and it can be expected that the user will be motivated to use this service. Also, by making the response different for each character, the user can selectively use his / her favorite character, and it can be expected that attachment to a specific character will increase.

［変形例］
本発明は、上述した実施例の態様に限らず、他の態様でも実施することができる。以下に示すのは、本発明の他の態様の一例である。なお、これらの変形例は、必要に応じて、各々を適宜組み合わせることも可能である。 [Modification]
The present invention is not limited to the embodiment described above, but can be implemented in other embodiments. The following is an example of another embodiment of the present invention. Note that these modifications can be appropriately combined as necessary.

（１）本発明の応答は、応答情報と他の情報とによって構成されてもよい。ここにおいて、他の情報とは、応答情報に付加される情報であって、例えば、語尾変化などの語調（口調）に変化をもたらす文言を表す情報である。このような情報のことを、以下においては「付加情報」という。 (1) The response of the present invention may be composed of response information and other information. Here, the other information is information added to the response information, for example, information representing a word that changes the tone (speech tone) such as a ending change. Such information is hereinafter referred to as “additional information”.

図１０は、応答情報（応答文章）と付加情報を例示する図である。ここにおいて、応答情報は、「元気」という名詞のみによって構成されている。また、付加情報は、この「元気」という名詞に続く語句を表しており、ここでは、「（元気）だよ。」、「（元気）ですよ。」、「（元気）です。ありがとう。」といった複数の付加情報があるものとする。このとき、付加情報は、キャラクタ毎に異なるものであるとする。なお、応答情報は、キャラクタ毎に異なってもよいし、そうでなくてもよい。 FIG. 10 is a diagram illustrating response information (response text) and additional information. Here, the response information is composed of only the noun “Genki”. In addition, the additional information expresses the phrase following the noun “Genki”. Here, “I'm fine”, “I ’m fine”, “I ’m fine. Thank you.” And a plurality of additional information. At this time, it is assumed that the additional information is different for each character. Note that the response information may or may not differ for each character.

このようにすれば、付加情報によって応答に変化を与えることが可能である。また、応答情報と付加情報の双方をキャラクタ毎に異ならせることで、例えば、応答の大意については応答情報によって変化を与えつつ、語調のような微妙なニュアンスについては付加情報によって変化を与える、といった利用の態様も考えられる。 In this way, the response can be changed by the additional information. Also, by making the response information and the additional information different for each character, for example, while changing the meaning of the response depending on the response information, subtle nuances such as tone are changed by the additional information. A mode of use is also conceivable.

（２）音声合成サーバ２００Ｄは、ユーザが特定の用語を発したり、特定の問いかけを行ったりした場合に、合成音声データに代えて楽曲データをユーザ端末１００に送信するようにしてもよい。例えば、ユーザが「歌って」と発した場合、音声合成サーバ２００Ｄは、これに対応する応答として所定の楽曲の楽曲データを送信する。この楽曲データは、日時に応じて異なるようにされていてもよい。 (2) The speech synthesis server 200D may transmit music data to the user terminal 100 instead of the synthesized speech data when the user issues a specific term or asks a specific question. For example, when the user utters “sing”, the speech synthesis server 200D transmits music data of a predetermined music as a response corresponding thereto. The music data may be different depending on the date and time.

さらに、このように歌うことができるキャラクタを一部のキャラクタに限定してもよい。この場合、音声合成サーバ２００Ｄは、歌うことができない（換言すれば、歌うように設定されていない）キャラクタについては、「歌って」というユーザの問いかけに対して適当な合成音声データ（例えば、「よくわかりません」、「歌えません」など）を送信してもよい。このようにすれば、各キャラクタの個性をより際立たせることが可能である。 Further, the characters that can be sung in this way may be limited to some characters. In this case, the speech synthesis server 200D cannot properly sing (in other words, for characters that are not set to sing), the synthesized speech data (for example, “ I do n’t know ”,“ I ca n’t sing ”, etc.). In this way, the individuality of each character can be made more prominent.

（３）意図解釈サーバ２００Ｂは、所定の条件を満たす場合に、情報の検索を行わずに所定の応答情報を出力するようにしてもよい。例えば、意図解釈サーバ２００Ｂは、発話しているユーザが特定のユーザ（例えばユーザ端末１００の所有者）でない場合、ユーザ端末１００の位置が所定の位置（例えばユーザが事前に設定した位置）でない場合、発話日時が所定の日時（例えば深夜）である場合などに、ユーザの意図を問わず、あらかじめ決められた応答情報を出力するようにしてもよい。この場合、意図解釈サーバ２００Ｂは、ユーザの認識結果、ユーザ端末１００の位置を示す位置情報、発話日時を示す時刻情報などを取得できるように構成される。このようにすれば、意図しない第三者によって情報の検索が行われるような想定外の事態を防いだりすることが可能である。 (3) The intention interpretation server 200B may output predetermined response information without searching for information when a predetermined condition is satisfied. For example, in the intention interpretation server 200B, when the speaking user is not a specific user (for example, the owner of the user terminal 100), the position of the user terminal 100 is not a predetermined position (for example, a position set in advance by the user). When the utterance date / time is a predetermined date / time (for example, midnight), response information determined in advance may be output regardless of the user's intention. In this case, the intention interpretation server 200B is configured to be able to acquire a user recognition result, position information indicating the position of the user terminal 100, time information indicating the utterance date and time, and the like. In this way, it is possible to prevent an unexpected situation where information is retrieved by an unintended third party.

なお、ユーザの判別には、周知の話者認識技術を利用したり、あらかじめユーザ（所有者）の音声を登録したりすればよい。また、ユーザ端末１００の位置の特定には、ＧＰＳ（Global Positioning System）による測位のような位置情報を取得可能な周知技術を用いればよい。また、所定の条件を満たした場合のキャラクタの応答は、例えば「よくわかりません」や「お答えできません」といったものであるが、キャラクタ毎に異なっていてもよい。 In addition, what is necessary is just to utilize a well-known speaker recognition technique or register a user's (owner's) voice | voice previously for a user's discrimination | determination. Further, the position of the user terminal 100 may be specified by using a known technique that can acquire position information such as positioning by GPS (Global Positioning System). Further, the response of the character when the predetermined condition is satisfied is, for example, “I don't know well” or “I can't answer”, but it may be different for each character.

また、意図解釈サーバ２００Ｂは、所定の条件を満たすか否かを判断を意図解釈処理の前後のいずれに実行してもよい。この判断を意図解釈処理の前に実行する場合、意図解釈サーバ２００Ｂは、所定の条件を満たすと判断したら意図解釈処理をせずに省略してもよい。また、意図解釈処理を先に実行する場合、意図解釈サーバ２００Ｂは、ユーザの意図が情報の検索であると判断した場合にのみこの判断を実行すればよい。 Further, the intention interpretation server 200B may determine whether or not a predetermined condition is satisfied before or after the intention interpretation process. When this determination is executed before the intention interpretation process, the intention interpretation server 200B may omit the intention interpretation process without performing the intention interpretation process when determining that the predetermined condition is satisfied. When the intention interpretation process is executed first, the intention interpretation server 200B only needs to execute this determination when it is determined that the user's intention is information retrieval.

さらに、本変形例は、上述した変形例２の場合のように、ユーザの問いかけに対して音声合成サーバ２００Ｄが楽曲データを送信する場合にも適用可能である。例えば、意図解釈サーバ２００Ｂは、ユーザ端末１００の位置が所定の位置でない場合には、楽曲データに代えて「ここでは歌えません」と発声する合成音声データを送信させるようにしたり、発生日時が深夜である場合には、楽曲データに代えて「今は歌えません」と発声する合成音声データを送信させるようにしたりすることも可能である。 Furthermore, this modification can also be applied to the case where the speech synthesis server 200D transmits music data in response to a user's inquiry as in the case of the modification 2 described above. For example, if the position of the user terminal 100 is not a predetermined position, the intention interpretation server 200B may transmit synthetic voice data that says “I can't sing here” instead of music data, In the case of midnight, it is also possible to transmit synthetic voice data that says “I can't sing now” instead of the music data.

（４）本発明は、ユーザからの入力とユーザに対する出力の双方が音声であることを要しない。つまり、本発明は、ユーザに対する出力（すなわち、入力に対する応答）については、画像の再生（表示）のみであって音声の再生を伴わない態様であってもよい。この場合、音声合成サーバ２００Ｄに相当する構成は不要である。あるいは、ユーザに対する出力は、バイブレータを振動させるなどの触覚的なフィードバックを伴ってもよい。 (4) The present invention does not require that both the input from the user and the output to the user are audio. That is, the present invention may be in a mode in which the output to the user (that is, the response to the input) is only the reproduction (display) of the image and is not accompanied by the reproduction of the sound. In this case, a configuration corresponding to the speech synthesis server 200D is not necessary. Alternatively, the output to the user may be accompanied by tactile feedback such as vibrating the vibrator.

（５）本発明において、音声入力は、情報の検索と雑談以外の応答に用いられてもよい。例えば、本発明は、情報の検索と雑談に加え、ユーザ端末１００の操作に音声入力を用いるものであってもよい。ここでいう操作とは、例えば、ユーザ端末１００に備わっている特定の機能のオン／オフの切り替えや、特定のプログラムの実行などである。 (5) In the present invention, voice input may be used for responses other than information retrieval and chat. For example, the present invention may use voice input for operation of the user terminal 100 in addition to information search and chat. The operation here is, for example, switching on / off of a specific function provided in the user terminal 100, execution of a specific program, or the like.

（６）本発明を実施するための具体的構成は、上述した実施例の形態に限定されない。例えば、音声認識は、音声認識サーバ２００Ａに代えてユーザ端末１００や意図解釈サーバ２００Ｂにおいて実行されてもよい。また、合成音声データは、音声合成サーバ２００Ｄから送信されるのではなく、意図解釈サーバ２００Ｂから（ステップＳ７において）送信されてもよい。さらに、本発明に係る対話支援装置の機能についても、その一部又は全部がユーザ端末１００に備わっていてもよいし、複数のサーバ装置に分散的に備わっていてもよい。 (6) The specific configuration for carrying out the present invention is not limited to the embodiment described above. For example, voice recognition may be executed in the user terminal 100 or the intention interpretation server 200B instead of the voice recognition server 200A. Further, the synthesized speech data may be transmitted from the intention interpretation server 200B (in step S7) instead of being transmitted from the speech synthesis server 200D. Furthermore, some or all of the functions of the dialogue support apparatus according to the present invention may be provided in the user terminal 100, or may be provided in a plurality of server apparatuses in a distributed manner.

（７）本発明のユーザ端末には、スマートフォン以外のさまざまな端末も該当し得る。例えば、本発明は、いわゆるタブレットＰＣ（Personal Computer）、ゲーム機、音楽プレーヤなどにも適用可能である。また、本発明のユーザ端末は、無線通信端末に限定されるものでもなく、モデム等を介して有線接続されたデスクトップＰＣであってもよい。 (7) Various terminals other than a smartphone may also correspond to the user terminal of the present invention. For example, the present invention can be applied to a so-called tablet PC (Personal Computer), a game machine, a music player, and the like. In addition, the user terminal of the present invention is not limited to a wireless communication terminal, but may be a desktop PC connected by wire via a modem or the like.

また、本発明は、対話支援装置やこれを備えたサーバ装置（又はユーザ端末）としてだけではなく、１又は複数のサーバ装置とユーザ端末とを備えた対話システム、対話支援装置による対話支援方法、コンピュータを本発明の対話支援装置として機能させるためのプログラムなどとして特定されてもよい。また、本発明に係るプログラムは、光ディスクなどの記録媒体に記録した形態や、インターネットなどのネットワークを介して、コンピュータにダウンロードさせ、これをインストールして利用可能にする形態などでも提供することができる。 The present invention is not limited to a dialog support apparatus or a server apparatus (or user terminal) including the dialog support apparatus, but includes a dialog system including one or more server apparatuses and a user terminal, a dialog support method using the dialog support apparatus, It may be specified as a program for causing a computer to function as the dialogue support apparatus of the present invention. Further, the program according to the present invention can be provided in a form recorded on a recording medium such as an optical disk, or a form that is downloaded to a computer via a network such as the Internet and can be installed and used. .

１０…対話システム、１００…ユーザ端末、１１０…制御部、１２０…記憶部、１３０…通信部、１４０…表示部、１５０…音声入力部、１６０…音声出力部、２００Ａ…音声認識サーバ、２００Ｂ…意図解釈サーバ、２００Ｃ…雑談管理サーバ、２００Ｄ…音声合成サーバ、２１０…制御部、２１１…取得部、２１２…解釈部、２１３…出力部、２２０…記憶部、２３０…通信部、３００…ネットワーク DESCRIPTION OF SYMBOLS 10 ... Dialog system, 100 ... User terminal, 110 ... Control part, 120 ... Memory | storage part, 130 ... Communication part, 140 ... Display part, 150 ... Voice input part, 160 ... Voice output part, 200A ... Voice recognition server, 200B ... Intent interpretation server, 200C ... chat management server, 200D ... speech synthesis server, 210 ... control unit, 211 ... acquisition unit, 212 ... interpretation unit, 213 ... output unit, 220 ... storage unit, 230 ... communication unit, 300 ... network

Claims

An acquisition unit for acquiring input information generated based on voice spoken by the user;
An interpretation unit that interprets the intention of the user's utterance based on the input information acquired by the acquisition unit;
When the intention interpreted by the interpretation unit is a search for information, response information is output according to the search result of the search executed according to the intention, and the intention interpreted by the interpretation unit is a search for information. An output unit that selectively outputs any one of a plurality of response information prepared in accordance with the intention,
The output unit, when intended interpreted by the interpretation unit is a request of singing by characters, when position of the user satisfies a predetermined condition, while outputting music data, position of the user If the predetermined condition is not satisfied, response information indicating a message for rejecting singing is output.

The output unit is
The dialogue support apparatus according to claim 1, wherein the information source to be referred to is different depending on whether the intention interpreted by the interpretation unit is a search for information or not.

The output unit is
The dialogue support apparatus according to claim 1 or 2, wherein whether or not the intention interpreted by the interpretation unit is a search for information is determined based on whether or not a predetermined character string is included.

The output unit is
The dialogue support device according to any one of claims 1 to 3, wherein response information corresponding to a character selected by the user among the plurality of response information is output.

The output unit is
The dialogue support device according to any one of claims 1 to 4, wherein the response information and additional information added to the response information and corresponding to the character selected by the user are output.

The dialogue support apparatus according to any one of claims 1 to 5,
A dialogue system comprising: a user terminal that picks up the user's voice and reproduces at least the voice according to the response information output by the output unit.

Obtaining input information generated based on voice spoken by the user;
Interpreting the user's intention to speak based on the acquired input information;
When the interpreted intention is a search for information, response information corresponding to the search result of the search executed according to the intention is output, and when the interpreted intention is not a search for information, the intention is Selectively outputting one of a plurality of response information prepared corresponding to
In the step of outputting one of the plurality of response information selectively, if the interpreted intended is a request of singing by characters, when position of the user satisfies a predetermined condition, it outputs the music data while, when the position of the user does not satisfy the predetermined condition, interactive support method and outputting the response information indicating the refuse singing message.

On the computer,
Obtaining input information generated based on voice spoken by the user;
Interpreting the user's intention to speak based on the acquired input information;
When the interpreted intention is a search for information, response information corresponding to the search result of the search executed according to the intention is output, and when the interpreted intention is not a search for information, the intention is A step of selectively outputting any one of a plurality of response information prepared in correspondence with
In the step of outputting one of the plurality of response information selectively, if the interpreted intended is a request of singing by characters, when position of the user satisfies a predetermined condition, it outputs the music data while, when the position of the user does not satisfy the predetermined condition, a program and outputting the response information indicating the refuse singing message.