JP2007328283A

JP2007328283A - Interaction system, program and interactive method

Info

Publication number: JP2007328283A
Application number: JP2006161265A
Authority: JP
Inventors: Haruki Kuzuoka; 春樹葛岡; Makoto Nakamura; 誠中村; Takeshi Wakasa; 岳史若狭; Jun Takahashi; 潤高橋; Takaaki Matsuoka; 貴晃松岡; Atsushi Tsurumi; 篤鶴見; Kunihiro Suga; 邦博須賀; Sayuri Yuzukizaki; さゆり柚木▲崎▼
Original assignee: Kenwood KK
Current assignee: Kenwood KK
Priority date: 2006-06-09
Filing date: 2006-06-09
Publication date: 2007-12-20

Abstract

<P>PROBLEM TO BE SOLVED: To provide an interactive technology capable of attaining more natural interaction. <P>SOLUTION: The interactive system comprises: a nodding means for outputting voice of nodding, every time a period of voice by utterance of an interactive party, is input (Step 21, 24 and 25); an extracting means for extracting a word or a phrase, in which appearance frequency in the input voice reaches a prescribed value or more (Step 21 to 23); a searching means for searching a word or a phrase related to the word or the phrase extracted by the extracting means (Step 26); and an outputting means for outputting the voice using the word or the phrase which the searching means has searched and obtained (Step 28). <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、対話相手の発話に含まれる単語又は語句を抽出し、これに関連する内容の発話としての音声を出力するようにした対話装置、プログラム、及び対話方法に関する。 The present invention relates to a dialogue apparatus, a program, and a dialogue method that extract a word or phrase included in a dialogue partner's utterance and output a voice as an utterance of contents related thereto.

近年、音声入力を利用し、誰もが安心して容易に各種情報やサービスの提供が受けられるようにした対話型エージェント装置の開発が進んでいる。図４は従来の対話型エージェント装置の構成を示す。この装置は、ユーザが発する音声を音声信号に変換する入力部４１、入力部４１からの音声信号に基づいて出力用の音声信号を生成する音声対話システム４０、及び音声対話システム４０からの音声信号を音声に変換して出力する出力部４６を備える。 In recent years, development of an interactive agent device that uses voice input and enables anyone to receive various information and services with peace of mind has been progressing. FIG. 4 shows the configuration of a conventional interactive agent device. This apparatus includes an input unit 41 that converts voice uttered by a user into a voice signal, a voice dialogue system 40 that generates a voice signal for output based on a voice signal from the input unit 41, and a voice signal from the voice dialogue system 40. Is output to a voice.

音声対話システム４０は、入力部４１からの音声信号に基づき、音声認識を行って文章データを抽出する音声認識エンジン４２、音声認識エンジン４２からの文章データ及び所定のシナリオデータに基づいてテキストデータを生成する対話処理エンジン４３、対話処理エンジン４３が使用するシナリオデータを記憶する記憶部４４、対話処理エンジン４３からのテキストデータに基づき、音声信号を生成して出力部４６に供給する音声合成エンジン４５を備える。 The voice dialogue system 40 performs voice recognition based on a voice signal from the input unit 41 and extracts text data, text data from the voice recognition engine 42 and text data based on predetermined scenario data. A dialogue processing engine 43 to be generated, a storage unit 44 that stores scenario data used by the dialogue processing engine 43, and a voice synthesis engine 45 that generates a voice signal based on text data from the dialogue processing engine 43 and supplies the voice signal to the output unit 46. Is provided.

ここで、シナリオデータとは、処理項目データ及び遷移定義データの組合せであり、対話処理エンジン４３の動作を決定するものである。処理項目データとは、所定の処理項目毎にその処理内容を記述したものであり、遷移定義データとは、各処理項目の処理間で生じ得る遷移を規定したものである。対話処理の場合、シナリオデータは、対話相手の発話に基づく入力音声に含まれる単語や語句に対し、出力すべき対応する発話の内容を規定したものということができる。 Here, the scenario data is a combination of process item data and transition definition data, and determines the operation of the dialog processing engine 43. The process item data describes the processing contents for each predetermined process item, and the transition definition data defines a transition that can occur between processes of each process item. In the case of dialogue processing, it can be said that the scenario data defines the content of the corresponding utterance to be output for words and phrases included in the input speech based on the utterance of the dialogue partner.

この構成において、ユーザが発する音声は入力部４１において音声信号に変換され、音声認識エンジン４２に供給される。音声認識エンジン４２は供給される音声信号を文章データ化し、対話処理エンジン４３に供給する。対話処理エンジン４３は、供給される文章データに基づき、シナリオデータに従い、出力すべき音声に対応するテキストデータを作成し、音声合成エンジン４５に供給する。音声合成エンジン４５は供給されるテキストデータを音声信号に変換し、出力部４６に供給する。出力部４６は供給される音声信号を音声に変換して出力する。このようにしてエージェント装置は、ユーザの対話相手としての機能を行う。 In this configuration, the voice uttered by the user is converted into a voice signal at the input unit 41 and supplied to the voice recognition engine 42. The voice recognition engine 42 converts the supplied voice signal into text data and supplies it to the dialogue processing engine 43. The dialogue processing engine 43 creates text data corresponding to the speech to be output according to the scenario data based on the supplied text data, and supplies the text data to the speech synthesis engine 45. The voice synthesis engine 45 converts the supplied text data into a voice signal and supplies it to the output unit 46. The output unit 46 converts the supplied audio signal into sound and outputs the sound. In this way, the agent device functions as a user's dialogue partner.

また、音声による入出力機能を利用した電子計算機との対話技術の例として、音声認識に用いるデータベースを話題となるシーン毎に予め作成しておき、対話時には、ユーザからの入力文章を分解した入力単語を含むシーンを特定し、特定したシーンに基づいて、対話内容を限定して対話を進めていくとともに、ユーザとの対話状況に変化があった場合には、変化に従って新規データベースを構築し、該新規データベースからの情報をユーザに提供するようにしたものが提案されている（たとえば、特許文献１参照）。 In addition, as an example of dialogue technology with an electronic computer using voice input / output functions, a database used for speech recognition is created in advance for each topic scene, and input is performed by disassembling input text from the user during dialogue. Identify scenes that contain words, and based on the identified scenes, proceed with the conversation by limiting the content of the conversation, and if there is a change in the conversation status with the user, build a new database according to the change, There has been proposed a system in which information from the new database is provided to a user (see, for example, Patent Document 1).

一方、音声入力を利用した情報検索技術の例として、入力された希望情報のキーワードを含む情報提供要求を情報提供サーバへ送信し、これに応じて該サーバから転送される情報提供画面を受信して表示するようにした情報提供支援システムが提案されている（たとえば特許文献２参照）。これによれば、面倒な操作を要することなく、非常に簡単かつ容易に各種情報の提供やサービスの提供を受けることができるとされている。 On the other hand, as an example of information retrieval technology using voice input, an information provision request including a keyword of inputted desired information is transmitted to an information provision server, and an information provision screen transferred from the server is received in response to the request. An information providing support system has been proposed (see, for example, Patent Document 2). According to this, various information and services can be provided very easily and easily without requiring a troublesome operation.

また、検索された正式名称の言い換え語を生成し、次回の音声認識処理で利用される音声認識辞書として、その言い換え語を含む音声認識辞書を生成するようにした音声対話装置も提案されている（たとえば特許文献３参照）。これによれば、ユーザがデータベースに登録されている正式名称を正確に知らない場合でも、その正式名称を検索して提示することができるとされている。 A spoken dialogue apparatus has also been proposed in which a paraphrase of a searched official name is generated, and a speech recognition dictionary including the paraphrase is generated as a speech recognition dictionary used in the next speech recognition process. (For example, refer to Patent Document 3). According to this, even when the user does not know the official name registered in the database correctly, the official name can be searched and presented.

特開２００５−００４５０１号公報JP 2005-004501 A 特開２００４−３８２５２号公報JP 2004-38252 A 特開２００５−３３８２７４号公報JP 2005-338274 A

しかしながら、上述図４の対話型エージェント装置によれば、予め準備されている発話用のテキストに基づいて発話が行われるので、発話用のテキストを予め準備しておく必要がある。また、発話可能なテキストを増やすためには、膨大な量のデータを蓄積する必要がある。上述の電子計算機との対話技術の場合においても、シーン固有の単語を蓄積したデータベースを予め用意する必要がある。 However, according to the interactive agent device of FIG. 4 described above, since the utterance is performed based on the utterance text prepared in advance, it is necessary to prepare the utterance text in advance. Also, in order to increase the text that can be spoken, it is necessary to accumulate a huge amount of data. Even in the case of the above-described dialogue technique with an electronic computer, it is necessary to prepare a database in which scene-specific words are accumulated.

また、上述の各技術によれば、認識されない言葉を話すと「もう一度話してください」といった冷たい返事をされることが多い。また、Ｗｅｂサーバからの情報提供を行う場合には、ユーザからの音声を認識し、これに応じてＷｅｂサーバの情報提供画面を表示するようにしているだけであり、対話という形での情報提供は行われていない。したがって、現状の対話型の装置は、高齢者等にとっては抵抗があり、また、操作性などの面においても問題があるので、未だ普及するには至っていない。 In addition, according to each of the above-mentioned techniques, when a word that is not recognized is spoken, a cold reply such as “Please speak again” is often given. Also, when providing information from a Web server, the user's voice is recognized, and the information providing screen of the Web server is simply displayed accordingly. Is not done. Therefore, the current interactive device has a resistance to the elderly and the like, and has a problem in terms of operability, so that it has not yet spread.

本発明の目的は、このような従来技術の問題点に鑑み、より自然な対話を実現することができる対話技術を提供することにある。 An object of the present invention is to provide a dialogue technique that can realize a more natural dialogue in view of the problems of the prior art.

上記目的を達成するため、第１の発明に係る対話装置は、対話相手の発話における出現頻度が所定値以上となった単語又は語句を抽出する抽出手段と、前記抽出手段により抽出した単語又は語句に関連する発話の内容を示すテキストを生成する生成手段と、前記生成手段が生成したテキストに基づき、音声の出力を行う出力手段とを具備することを特徴とする。 In order to achieve the above object, the dialogue apparatus according to the first invention includes an extraction means for extracting a word or phrase whose appearance frequency in the utterance of the conversation partner is a predetermined value or more, and the word or phrase extracted by the extraction means. And generating means for generating a text indicating the content of the utterance related to the voice, and output means for outputting voice based on the text generated by the generating means.

ここで、対話装置としては、たとえば、ナビゲーション装置、携帯端末、パソコン等において実施されるものが該当する。出現頻度が所定値以上となった単語又は語句としてはたとえば、対話の開始から所定回数以上出現した単語又は語句が該当する。抽出した単語又は語句に関連する発話の内容を示すテキストとしては、たとえば、抽出した「夏目漱石」に関連する「坊ちゃん」や「三四郎」を用いたテキスト、たとえば「夏目漱石は他に三四郎があるね」が該当する。また、単語又は語句に対し、上述のシナリオデータによって対応付けられたテキストであってもよい。 Here, as the interactive device, for example, a device implemented in a navigation device, a portable terminal, a personal computer, or the like is applicable. For example, a word or phrase that has appeared more than a predetermined value corresponds to a word or phrase that has appeared more than a predetermined number of times since the start of dialogue. As text indicating the content of the utterance related to the extracted word or phrase, for example, text using “Bo-chan” or “Sanshiro” related to the extracted “Natsume Soseki”, for example, “Natsume Soseki is Sanshiro "". Moreover, the text matched with the word or phrase by the above-mentioned scenario data may be sufficient.

この構成において、入力音声における出現頻度が所定値以上となった単語又は語句は、ユーザが何らかしらの関心又は興味をもっている事項を表しているものであると考えられる。したがって、そのような単語又は語句を抽出し、これに関連する内容の発話を行うことは、ユーザの関心又は興味沿った対応となる。したがって自然な対話を実現することができる。 In this configuration, a word or phrase whose appearance frequency in the input speech is greater than or equal to a predetermined value is considered to represent an item that the user has some interest or interest in. Therefore, extracting such a word or phrase and uttering a content related thereto is a response in accordance with the user's interest or interest. Therefore, natural dialogue can be realized.

第２の発明に係る対話装置は、第１発明において、与えられた指示に応じて前記所定値を変更する変更手段を有することを特徴とする。 According to a second aspect of the present invention, there is provided the interactive apparatus according to the first aspect, further comprising changing means for changing the predetermined value in accordance with a given instruction.

第３の発明に係る対話装置は、第１又は第２発明において、前記抽出手段により抽出した単語又は語句に関連する単語又は語句を検索する検索手段を備え、前記生成手段は、前記関連する発話の内容を示すテキストとして、前記検索手段が検索して得た単語又は語句を用いたテキストを生成するものであることを特徴とする。 According to a third aspect of the present invention, there is provided a dialogue apparatus according to the first or second aspect, further comprising search means for searching for a word or phrase related to the word or phrase extracted by the extraction means, wherein the generation means includes the related utterance. A text using a word or a phrase obtained by the search by the search means is generated as the text indicating the content of the text.

第４の発明に係る対話装置は、第３の発明において、前記検索手段は、前記抽出手段が抽出した単語又は語句をキーワードとして、ネットワーク上のデータベースに基づき、前記関連する単語又は語句の検索を行うものであることを特徴とする。 According to a fourth aspect of the present invention, in the third aspect of the present invention, the search unit searches the related word or phrase based on a database on the network using the word or phrase extracted by the extraction unit as a keyword. It is what is performed.

第５の発明に係る対話装置は、第３又は第４の発明において、前記抽出手段が抽出した単語又は語句を登録する登録手段と、前記関連する単語又は語句が複数存在する場合に、いずれの単語又は語句を使用して前記テキストの生成を行うかを、前記登録手段により登録されている単語又は語句を参照して決定する決定手段とを有することを特徴とする。 In the third or fourth invention, the dialogue apparatus according to the fifth invention is any one of the registration means for registering the word or phrase extracted by the extraction means, and when there are a plurality of the related words or phrases. And determining means for determining whether to generate the text using a word or phrase by referring to the word or phrase registered by the registration means.

第６の発明に係る対話装置は、第１〜第５のいずれかの発明において、対話相手による一区切りの発話毎に相槌を打つ旨の音声を出力する相槌手段を有することを特徴とする。一区切りの発話としては、たとえば、所定時間以上の無音状態が後続する発話や、所定の語尾、たとえば、「・・・だよ」、「・・・いね」等の語尾で終了するような発話が該当する。 A dialogue apparatus according to a sixth invention is characterized in that, in any one of the first to fifth inventions, there is a dialogue means for outputting a voice indicating that a dialogue is made for each utterance by a dialogue partner. For example, an utterance that is followed by a silence state for a predetermined time or longer, or an utterance that ends with a predetermined ending, for example, “... Dai”, “... Ine”, etc. Applicable.

第７の発明に係るプログラムは、第１〜第６のいずれかの対話装置における各手段としてコンピュータを機能させることを特徴とする。 A program according to a seventh invention causes a computer to function as each means in any one of the first to sixth interactive devices.

第８の発明に係る対話方法は、対話装置が、対話相手の発話における出現頻度が所定値以上となった単語又は語句を抽出する抽出工程と、対話装置が、前記抽出工程において抽出した単語又は語句に関連する発話の内容を示すテキストを生成する生成工程と、前記生成工程において生成したテキストに基づき、音声の出力を行う出力工程とを具備することを特徴とする。 In the dialogue method according to the eighth invention, the dialogue device extracts a word or a phrase whose appearance frequency in the dialogue partner's utterance is equal to or higher than a predetermined value, and the dialogue device extracts the word or phrase extracted in the extraction step, or It comprises a generation step for generating text indicating the content of an utterance related to a phrase, and an output step for outputting speech based on the text generated in the generation step.

本発明によれば、対話相手の発話における出現頻度が所定値以上となった単語又は語句を抽出することができたときには、抽出した単語又は語句に関連する内容の発話を出力することができるようにしたため、適宜、ユーザの関心又は興味に沿った内容の発話による対話を行うことができる。 According to the present invention, when it is possible to extract a word or phrase whose appearance frequency in the utterance of the conversation partner is equal to or higher than a predetermined value, it is possible to output an utterance having contents related to the extracted word or phrase. Therefore, it is possible to appropriately carry out a dialogue by uttering a content in line with the user's interest or interest.

また、与えられた指示に応じて前記所定値を変更することができるようにしたため、単語又は語句の抽出頻度を調整し、対話装置側からの、抽出した単語又は語句に対応する内容の発話の頻度を調整することができる。 In addition, since the predetermined value can be changed according to a given instruction, the extraction frequency of the word or phrase is adjusted, and the utterance of the content corresponding to the extracted word or phrase from the interactive device side is adjusted. The frequency can be adjusted.

また、抽出した単語又は語句に関連する発話の内容を示すテキストとして、抽出した単語又は語句に関連する単語又は語句を検索し、得られた単語又は語句を用いたテキストを生成するようにしたため、ユーザの関心又は興味に沿った内容の発話による対話を行うことができる。 Further, as the text indicating the content of the utterance related to the extracted word or phrase, the word or phrase related to the extracted word or phrase is searched, and the text using the obtained word or phrase is generated. It is possible to conduct a dialogue by uttering a content that meets the user's interest or interest.

また、抽出した単語又は語句をキーワードとして、ネットワーク上のデータベースに基づき、関連する単語又は語句の検索を行うようにしたため、対話装置上に大きなデータベースを設ける必要なく、関連する単語又は語句を用いた内容の発話による対話を行うことができる。 In addition, because the extracted word or phrase is used as a keyword and the related word or phrase is searched based on the database on the network, the related word or phrase is used without the need to provide a large database on the interactive device. Dialogue can be performed by uttering content.

また、抽出した単語又は語句を登録しておき、関連する単語又は語句が複数存在する場合にいずれの単語又は語句を使用してテキストの作成を行うかについて、登録されている単語又は語句を参照して決定するようにしたため、今まであまり使用されなかった単語又は語句を用いてユーザの知らない情報を与え、対話内容を広げる方向に対話内容を誘導したり、逆によく使用された単語又は語句を用いてユーザの興味ある分野を深く掘り下げる内容の対話に誘導したりすることができる。 Also, register the extracted word or phrase and refer to the registered word or phrase for which word or phrase to use to create text when there are multiple related words or phrases Therefore, information that the user does not know is given using words or phrases that have not been used so far, and the dialogue content is guided in a direction to expand the dialogue content, It is possible to use a word or phrase to guide the user to a conversation that deepens the field of interest.

また、検索手段による検索中に、相槌を打つ旨の音声を出力するようにしたため、検索中であっても、会話の途切れを相手に感じさせず、相手の話を真摯に聴いていることをアピールすることができる。 In addition, since the voice to the effect of competing is output during the search by the search means, even if the search is in progress, the conversation is interrupted without making the other party feel that the conversation is interrupted. Can appeal.

図１は本発明の一実施形態に係るエージェント装置の構成を示すブロック図である。このエージェント装置は、たとえば、ナビゲーション装置の一部として実施することができる。対話専用の機器として実施したり、パソコン上のソフトウェアにより実施したりしてもよい。また、エージェントを表すキャラクタを表示し、そのキャラクタと会話を交わすようなインタフェースを有するものとして実施するようにしてもよい。 FIG. 1 is a block diagram showing a configuration of an agent device according to an embodiment of the present invention. This agent device can be implemented as part of a navigation device, for example. It may be implemented as a device dedicated to dialogue or by software on a personal computer. Further, the present invention may be implemented by displaying a character representing an agent and having an interface for communicating with the character.

同図に示すように、このエージェント装置は、ユーザが発する音声を音声信号に変換する入力部１１、入力部１１からの音声信号に基づき、音声認識を行って文章データを抽出する音声認識エンジン１２、音声認識エンジン１２からの文章データから単語を抽出し、これに基づいて種々の処理や出力を行う対話エンジン制御部１３、対話エンジン制御部１３から与えられる単語に基づき、それに関連する情報を、所定のＷｅｂサーバ上のＷｅｂデータベース１５から取得するＷｅｂデータ検索部１４、Ｗｅｂデータ検索部１４が取得した情報に基づき、発話に用いるテキストの作成を行う発話テキスト作成部１６、対話エンジン制御部１３や対話テキスト作成部１６からのテキストデータに基づき、音声信号を生成する音声合成エンジン１７、音声合成エンジン１７からの音声信号に基づき、音声による出力を行う出力部１８、及び対話エンジン制御部１３が抽出した単語を記憶する単語リスト記憶部１９を備える。 As shown in the figure, this agent device includes an input unit 11 that converts a voice uttered by a user into a voice signal, and a voice recognition engine 12 that performs voice recognition and extracts sentence data based on a voice signal from the input unit 11. , A word is extracted from sentence data from the speech recognition engine 12, and based on the word given from the dialogue engine control unit 13, the dialogue engine control unit 13 that performs various processes and outputs based on the word, A Web data search unit 14 acquired from a Web database 15 on a predetermined Web server, an utterance text generation unit 16 that generates text used for utterance based on information acquired by the Web data search unit 14, a dialog engine control unit 13, A speech synthesis engine 17 that generates a speech signal based on text data from the dialog text creation unit 16. Based on the audio signal from the speech engine 17, and a word list storage unit 19 for storing the words output unit 18 for outputting voice, and interactive engine control unit 13 it is extracted.

エージェント装置は、その構成全体により、対話型エージェントとしての機能を行うものであるが、主に対話エンジン制御部１３及び発話テキスト作成部１６が、主要な役割を果たしている。対話エンジン制御部１３は、図４の従来例における対話処理エンジン４３及びシナリオデータ記憶部４４の各機能を併せた機能を有する。この対話処理エンジンの機能、並びに音声認識エンジン１２及び音声合成エンジン１７の３つのエンジンにより、対話システムが構成されている。 The agent device functions as an interactive agent by its entire configuration, but mainly the dialog engine control unit 13 and the utterance text creation unit 16 play the main roles. The dialog engine control unit 13 has a function that combines the functions of the dialog processing engine 43 and the scenario data storage unit 44 in the conventional example of FIG. A dialog system is configured by the functions of the dialog processing engine and the three engines of the speech recognition engine 12 and the speech synthesis engine 17.

対話システムの動作はシナリオデータによって決定される。すなわち、対話処理エンジンはそのシナリオデータに基づいて音声認識エンジン１２及び音声合成エンジン１７を制御することにより、対話機能を実現している。 The operation of the dialogue system is determined by scenario data. That is, the dialog processing engine realizes a dialog function by controlling the speech recognition engine 12 and the speech synthesis engine 17 based on the scenario data.

入力部１１はマイクなどの音声入力を受け入れるための手段を備える。入力された音声信号は、音声認識エンジン１２によりテキスト化され、対話エンジン制御部１３に供給される。テキスト化の方式としては、文章データを抽出するディクテーション方式が用いられる。対話エンジン制御部１３は、音声認識エンジン１２が供給する文章データから、所定の論理に基づいて単語を抽出し、Ｗｅｂデータ検索部１４に与える。 The input unit 11 includes means for accepting voice input such as a microphone. The input speech signal is converted into text by the speech recognition engine 12 and supplied to the dialog engine control unit 13. As a text conversion method, a dictation method for extracting sentence data is used. The dialogue engine control unit 13 extracts words from the text data supplied by the speech recognition engine 12 based on a predetermined logic, and provides the web data search unit 14 with the words.

Ｗｅｂデータ検索部１４は、Ｗｅｂデータベース１５にアクセスし、与えられた単語を含む検索要求を送信するとともに、これに応じてＷｅｂデータベース１５から返信される検索結果を受信する機能を有する。Ｗｅｂデータベース１５は、検索要求に含まれている単語をキーワードとして情報検索を行い、検索結果をクライアントに提供する機能を有する。たとえば「夏目漱石」という単語を含む検索要求に対しては、「坊ちゃん、三四郎」という検索結果を返信することができる。Ｗｅｂデータベース１５としては、既存のＷｅｂサイトのものであっても、また専用のＷｅｂサイトのものであってもよい。しかし、スムーズな対話を行うためには専用のＷｅｂサイトのものの方がよい。 The Web data search unit 14 has a function of accessing the Web database 15, transmitting a search request including a given word, and receiving a search result returned from the Web database 15 in response thereto. The Web database 15 has a function of performing an information search using a word included in the search request as a keyword and providing a search result to the client. For example, in response to a search request including the word “Natsume Soseki”, a search result “Bo-chan, Sanshiro” can be returned. The web database 15 may be an existing website or a dedicated website. However, it is better to use a dedicated website for smooth conversation.

発話テキスト作成部１６は、Ｗｅｂデータ検索部１４が取得した情報に基づき、発話テキストを作成して、音声合成エンジン１７に供給する。音声合成エンジン１７は、発話テキスト作成部１６から供給される発話テキストを音声信号に変換し、出力部１８に供給する。出力部１８は、与えられる音声信号をスピーカにより音波に変換し、音声として出力する。 The utterance text creation unit 16 creates an utterance text based on the information acquired by the Web data search unit 14 and supplies the utterance text to the speech synthesis engine 17. The speech synthesis engine 17 converts the speech text supplied from the speech text creation unit 16 into a speech signal and supplies the speech signal to the output unit 18. The output unit 18 converts a given audio signal into a sound wave by a speaker and outputs the sound as a sound.

対話エンジン制御部１３は、音声認識エンジン１２により抽出される各文章から、所定回数繰り返し使われている文字列を選び出し、ユーザが今回の対話においてよく使用している単語として抽出する。所定回数の値に応じて抽出単語の数は変化する。１つの文章が入力される毎に、エージェント装置は相槌を打つ旨の出力等を行い、聞き手に回るようにするが、その間、一度でも抽出した単語は、記憶部１９の単語リストに登録する。また、登録した単語については、抽出した回数を記録する。つまり、抽出した単語が既に登録されている場合には、抽出回数をカウントアップする。単語リストは、Ｗｅｂデータ検索部１４が取得した情報としての単語が複数である場合に、発話テキスト作成部１６が、どの単語を用いて発話テキストを作成すべきかを決定するために用いられる。 The dialog engine control unit 13 selects a character string that has been repeatedly used a predetermined number of times from each sentence extracted by the speech recognition engine 12 and extracts it as a word frequently used by the user in the current dialog. The number of extracted words changes according to the predetermined number of times. Each time one sentence is input, the agent device outputs a message to the effect that the agent makes a conflict and turns it around to the listener. During that time, the extracted word is registered in the word list of the storage unit 19 even once. For registered words, the number of times of extraction is recorded. That is, when the extracted word is already registered, the number of extractions is counted up. The word list is used by the utterance text creation unit 16 to determine which word should be used to create the utterance text when there are a plurality of words as information acquired by the Web data search unit 14.

なお、登録してから一定時間経過した単語については単語リストから抹消するようにしてもよい。また、単語リストへの登録数が所定値を超えた場合には、最も古いものを抹消するようにしてもよい。また、単語リストに対し、ユーザ自身が予め登録を行うことができるようにして、発話テキストの作成に際して使用す単語の決定に対し、自身の好みを反映させることができるようにしてもよい。 In addition, you may make it delete from the word list about the word for which fixed time passed since registration. When the number of registrations in the word list exceeds a predetermined value, the oldest one may be deleted. In addition, the user himself / herself can register the word list in advance so that his / her preference can be reflected in the determination of the word to be used when creating the utterance text.

図２は図１の装置における対話処理を示すフローチャートである。また、図３はこの処理における対話内容の一例を示す。対話処理を開始すると、まずステップ２１において、一区切りの音声入力を受け入れる。このとき、たとえば図３に示すように、『この前、“夏目漱石”の「坊ちゃん」を読んだよ。』との入力を受け入れることができる。一区切りの音声入力としてはたとえば、所定時間以上の無音状態が後続する音声や、所定の語尾、たとえば、「・・・だよ」、「・・・いね」等の語尾で終了するような音声が該当する。 FIG. 2 is a flowchart showing dialogue processing in the apparatus of FIG. FIG. 3 shows an example of the conversation contents in this process. When the dialogue process is started, first, in step 21, a single voice input is accepted. At this time, for example, as shown in FIG. 3, “I read“ Botchan ”from“ Natsume Soseki ”. ”Can be accepted. For example, the voice input that is followed by a silence state for a predetermined time or longer, or a voice that ends with a predetermined ending, for example, “... Dayo”, “... Ine”, etc. Applicable.

次に、ステップ２２において、音声入力に応じて入力部１１から出力される音声信号に基づき、音声認識エンジン１２は文章データを抽出する。これにより、音声入力に対応する文章データ、たとえば『この前、“夏目漱石”の「坊ちゃん」を読んだよ。』との文章データを得ることができる。 Next, in step 22, the speech recognition engine 12 extracts text data based on the speech signal output from the input unit 11 in response to speech input. As a result, I read the text data corresponding to the voice input, for example, “Bokchan” of “Natsume Soseki”. Can be obtained.

次に、ステップ２３において、対話エンジン制御部１３は、音声認識エンジン１２が抽出した文章データ中に、対話処理を開始してから所定回数以上出現した単語があるか否かを判定する。この判定は、対話処理を開始してから、ステップ２２の処理を行う毎に、抽出した文章データ中に出現した各単語についての出現回数をカウントアップしてゆき、いずれかの単語についての出現回数が所定回数を超えたかどうかを調べることにより行うことができる。 Next, in step 23, the dialogue engine control unit 13 determines whether or not the sentence data extracted by the speech recognition engine 12 includes a word that has appeared more than a predetermined number of times after the dialogue processing is started. This determination is performed by incrementing the number of appearances for each word that appears in the extracted sentence data each time the process of step 22 is performed after the start of the dialogue process. Can be determined by checking whether or not has exceeded a predetermined number of times.

所定回数以上出現した単語が存在しないと判定した場合には、ステップ２４において、相槌を打つ旨の音声出力を行い、ステップ２１に戻る。相槌を打つ旨の出力は、相槌を打つ旨を示すテキストデータ、たとえば『ふ〜ん。』を音声合成エンジン１７に供給することにより行う。供給されるテキストデータは、音声合成エンジン１７及び出力部１８を経て、相槌を打つ音声となって出力される。なお、ステップ２１に戻った場合、図３の例に従えば、さらに、『やっぱ、“夏目漱石”は面白いね。』との文章データを得ることができる。 If it is determined that there is no word that has appeared more than a predetermined number of times, in step 24, a voice output indicating that a match has been made is performed, and the process returns to step 21. The output indicating that a match is made is text data indicating that a match is made, for example, “Fun. Is supplied to the speech synthesis engine 17. The supplied text data is output as a voice that strikes a conflict through the speech synthesis engine 17 and the output unit 18. When returning to step 21, according to the example in FIG. 3, “Natsume Soseki is interesting. Can be obtained.

ステップ２３において所定回数以上出現した単語が存在すると判定した場合には、ステップ２５において、ステップ２４の場合と同様にして相槌を打つ旨の音声出力を行う。図３の例では、所定回数以上出現した単語として、「夏目漱石」が例示されている。すなわち、「夏目漱石」が、所定回数以上出現した頻出単語として、抽出されることになる。また、ステップ２５の相槌として、図３の例では、『へ〜！』との出力を行っている。 If it is determined in step 23 that a word that has appeared a predetermined number of times or more is present, in step 25, a voice output indicating that a conflict is made is performed in the same manner as in step 24. In the example of FIG. 3, “Natsume Soseki” is illustrated as a word that appears more than a predetermined number of times. That is, “Natsume Soseki” is extracted as a frequent word that appears more than a predetermined number of times. In addition, in the example of FIG. Is output.

次に、ステップ２６において、Ｗｅｂデータ検索部１４が当該頻出単語をキーワードとする検索要求をＷｅｂデータベース１５に送り、検索結果をＷｅｂデータベース１５から受け取る。図３の例では、Ｗｅｂデータ検索部１４は、対話処理において所定回数以上出現した「夏目漱石」をキーワードとする検索要求をＷｅｂデータベース１５に送り、検索結果として「坊ちゃん」及び「三四郎」を取得している。このような、Ｗｅｂデータベース１５からの情報の取得は、ステップ２５における相槌を打つ旨の音声出力を行って、聞き手に回っている間に、並行して行うことができる。 Next, in step 26, the Web data search unit 14 sends a search request using the frequent word as a keyword to the Web database 15 and receives the search result from the Web database 15. In the example of FIG. 3, the Web data search unit 14 sends a search request with “Natsume Soseki” as a keyword that appears more than a predetermined number of times in the dialogue process to the Web database 15 and obtains “Bo-chan” and “Sanshiro” as search results. is doing. Such acquisition of information from the Web database 15 can be performed in parallel while turning to the listener by performing a voice output to the effect of making a conflict in step 25.

ステップ２６ではまた、上述の記憶部１９における単語リストへの登録を行う。すなわち、ステップ２３において所定回数以上出現したと判定され、抽出された頻出単語を、単語リストに登録し、抽出した回数を「１」に設定する。その単語が既に登録済みである場合には、抽出回数をカウントアップする。 In step 26, registration to the word list in the storage unit 19 is also performed. That is, it is determined in step 23 that it has appeared a predetermined number of times or more, and the extracted frequent word is registered in the word list, and the extracted number is set to “1”. If the word has already been registered, the number of extractions is counted up.

次に、ステップ２７において、Ｗｅｂデータ検索部１４が取得した単語に基づき、発話テキスト作成部１６が、発話に使用するテキストを作成する。このとき、取得した単語が複数存在する場合には、記憶部１９の単語リストに登録されていない単語を用いてテキストの作成を行う。すべての単語が登録されている場合には、抽出回数が最も少ない単語を用いる。対話に出現した頻度が少ない単語に関するユーザの知識は少ないと考えられるので、そのような単語を用いて発話テキストを作成する方が、ユーザに対する新たな知識の提供としての意義が深まり、かつ話題が広がりやすいからである。 Next, in step 27, the utterance text creation unit 16 creates a text to be used for the utterance based on the word acquired by the Web data search unit 14. At this time, if there are a plurality of acquired words, text is created using words that are not registered in the word list of the storage unit 19. When all the words are registered, the word with the smallest number of extractions is used. Since it is considered that the user has little knowledge about words that appear less frequently in dialogue, it is more meaningful to create utterance texts using such words as providing new knowledge to the user and It is easy to spread.

たとえば図３のように、取得した単語が「坊ちゃん」及び「三四郎」であるとすれば、「坊ちゃん」及び「三四郎」が単語リストに登録されているかどうかを調べる。図３の例の場合、「坊ちゃん」が単語リストに登録されており、「三四郎」が登録されていないので、発話テキスト作成部１６は、「三四郎」を用い、『“夏目漱石”は他に「三四郎」があるね』との発話テキストを作成する。 For example, as shown in FIG. 3, if the acquired words are “Bo-chan” and “Sanshiro”, it is checked whether “Bo-chan” and “Sanshiro” are registered in the word list. In the case of the example in FIG. 3, since “Bo-chan” is registered in the word list and “San-Shiro” is not registered, the utterance text creation unit 16 uses “San-Shiro”, and ““ Natsume Soseki ” Create an utterance text saying "There is Sanshiro".

この作成はたとえば、『×××といったら○○○あるね』といったような、予め準備されているフォーマットを用いて行うことができる。「×××」の部分には、抽出された頻出単語が当て嵌められ、「○○○」の部分には、検索により取得した単語が当て嵌められるようになっている。したがって、「×××」に「夏目漱石」を当て嵌め、「○○○」に「三四郎」を当て嵌めることによって、『“夏目漱石”は他に「三四郎」があるね』との発話テキストを作成することができる。 This creation can be performed using a format prepared in advance, such as “There is XXX if XXX”. The extracted frequent word is applied to the “xxx” portion, and the word acquired by the search is applied to the “xxx” portion. Therefore, by applying “Natsume Soseki” to “XXX” and “Sanshiro” to “XX”, the utterance text “Natsume Soseki has“ Sanshiro ”in addition to that.” Can be created.

なお、上述とは逆に、抽出回数が多い単語を用いて、発話テキストを作成するようにしてもよい。この場合は、ユーザの興味に沿った発話テキストを作成することができる。また、これらの発話テキストの作成に際して使用する単語の選択方法を適宜、切り替えるようにしてもよい。 Contrary to the above, the utterance text may be created using words with a large number of extractions. In this case, it is possible to create an utterance text according to the user's interest. In addition, a method for selecting a word to be used when creating these utterance texts may be switched as appropriate.

次に、ステップ２８において、作成された発話テキストに基づき、音声出力を行う。すなわち、音声合成エンジン１７は、発話テキスト作成部１６により作成された発話テキストを音声信号に変換して出力部１８に供給する。出力部１８はこの音声信号を音声に変換して出力する。 Next, in step 28, voice output is performed based on the created utterance text. That is, the speech synthesis engine 17 converts the speech text created by the speech text creation unit 16 into a speech signal and supplies the speech signal to the output unit 18. The output unit 18 converts the sound signal into sound and outputs the sound.

本実施形態によれば、ユーザの発話に対して相槌を打つことによって（ステップ２４）ユーザの聞き手に回り、ユーザの発話において所定回数以上出現した単語を抽出し、その単語に関連した単語を取得するための検索を行っている際にも相槌を打つことによって（ステップ２５）、ユーザが話したいだけの場合でも聞き手に回ることができると同時に、検索により音声が途切れることによって、会話の流れが途切れる可能性がある場面であっても、ユーザの話を真摯に聴いているとの印象をアピールすることができる。つまり、相槌が返ってくるので、ユーザは自然な流れで話しを続行することができ、適当なときに、話題に沿った発話が返ってくるので、より人間味が溢れた対話を行うことができる。 According to the present embodiment, by competing with the user's utterance (step 24), the user turns to the user's listener, extracts a word that appears more than a predetermined number of times in the user's utterance, and acquires a word related to the word Even when a search is being performed (step 25), even if the user just wants to speak, the user can turn to the listener, and at the same time, the voice is interrupted by the search, so the flow of the conversation Even in a scene where there is a possibility of interruption, it is possible to appeal the impression that the user is listening seriously. In other words, because the answer is returned, the user can continue the conversation in a natural manner, and the utterance according to the topic is returned at an appropriate time, so that a more humane conversation can be performed. .

その際、ユーザがエージェント装置に対して話を聞いて欲しい場合には、出現回数に基づく単語抽出の基準となる所定回数を大きい値に設定することができるようにすることによって、単語の抽出頻度を低下させ、エージェント装置からの発話頻度を抑えることができる。逆に、エージェント装置からの、相槌以外の発話頻度を高めたい場合には、単語の抽出頻度を上げることができるようにすればよい。これにより、ユーザは、自身のペースに適合した対話を行うことができる。 At that time, if the user wants the agent device to talk to the user, the word extraction frequency can be set to a large value by setting a predetermined number of times as a reference for word extraction based on the number of appearances. And the utterance frequency from the agent device can be suppressed. Conversely, when it is desired to increase the frequency of utterances other than the conflict from the agent device, the word extraction frequency may be increased. Thereby, the user can perform a dialogue adapted to his / her pace.

また、Ｗｅｂデータベース１５からの新しい情報を提供する場合でも、提供情報を単にＷｅｂページとして表示するのではなく、提供情報としての単語を含む音声により伝えるようにしたため、提供情報を、対話として噛み砕いたものとして、会話の中に自然な形で挿入することができる。したがって、高齢者等に対しても容易に提供情報を伝達することができる。 Even when new information from the Web database 15 is provided, the provided information is not simply displayed as a Web page, but is communicated by voice including words as provided information. As a thing, it can be inserted into the conversation in a natural way. Therefore, provision information can be easily transmitted to elderly people and the like.

なお、本発明は上述の実施形態に限定されることなく、適宜変形して実施することができる。たとえば、上述においては、入力音声から単語を抽出し、Ｗｅｂデータ検索部１４に与えるようにしているが、これに加えて、語句をも抽出し、与えるようにしてもよい。 In addition, this invention is not limited to the above-mentioned embodiment, It can deform | transform and implement suitably. For example, in the above description, a word is extracted from the input voice and given to the Web data search unit 14, but in addition to this, a phrase may also be extracted and given.

また、上述においては、Ｗｅｂデータ検索部１４はＷｅｂデータベース１５に依頼して関連単語の検索を行うようにしているが、装置上に専用のデータベースを設け、これを用いて関連単語の検索を行うようにしてもよい。その際、専用データベースでは検索できなかった場合に、Ｗｅｂデータベース１５に検索依頼を行うようにしてもよい。また、その場合、Ｗｅｂデータベース１５から得られた検索結果を、専用データベースに反映するようにしてもよい。 In the above description, the Web data search unit 14 requests the Web database 15 to search related words. However, a dedicated database is provided on the apparatus, and the related words are searched using this database. You may do it. At that time, if the search cannot be performed in the dedicated database, a search request may be made to the Web database 15. In this case, the search result obtained from the Web database 15 may be reflected in the dedicated database.

また上述においては、対話エンジン制御部１３が入力音声から抽出した単語について、関連する単語をＷｅｂデータ検索部１４が取得し、取得した単語を含む発話用のテキストを発話テキスト作成部１６が作成するようにしているが、この代わりに、対話エンジン制御部１３が、抽出した単語に基づき、従来のようにシナリオデータに従って発話テキストを作成するようにするとともに、シナリオデータが対応していないために発話テキストの作成を行うことができない抽出単語についてのみ、上述実施形態に従い、発話テキスト作成部１６において発話テキストを作成するようにしてもよい。 Moreover, in the above-mentioned, the web data search part 14 acquires the related word about the word extracted from the input audio | voice by the dialog engine control part 13, and the utterance text preparation part 16 produces the text for speech containing the acquired word. However, instead of this, the dialog engine control unit 13 creates an utterance text according to the scenario data based on the extracted word as in the past, and the utterance because the scenario data does not correspond. Only for extracted words for which text cannot be created, the utterance text may be created in the utterance text creation unit 16 according to the embodiment described above.

本発明の一実施形態に係るエージェント装置の構成を示すブロック図である。It is a block diagram which shows the structure of the agent apparatus which concerns on one Embodiment of this invention. 図１の装置における対話処理を示すフローチャートである。It is a flowchart which shows the dialogue process in the apparatus of FIG. 図２の処理における対話内容の一例を示す図である。It is a figure which shows an example of the dialog content in the process of FIG. 従来の対話型エージェント装置の構成を示すブロック図である。It is a block diagram which shows the structure of the conventional interactive agent apparatus.

Explanation of symbols

１１，４１：入力部、１２，４２：音声認識エンジン、１３：対話エンジン制御部、１４：Ｗｅｂデータ検索部、１５：Ｗｅｂデータベース、１６：発話テキスト作成部、１７，４５：音声合成エンジン、１８，４６：出力部、４０：音声対話システム、４４：シナリオデータ。
11, 41: Input unit, 12, 42: Speech recognition engine, 13: Dialogue engine control unit, 14: Web data search unit, 15: Web database, 16: Utterance text creation unit, 17, 45: Speech synthesis engine, 18 , 46: output unit, 40: voice dialogue system, 44: scenario data.

Claims

An extraction means for extracting a word or phrase whose appearance frequency in the utterance of the conversation partner is a predetermined value or more;
Generating means for generating text indicating the content of the utterance related to the word or phrase extracted by the extracting means;
An interactive apparatus comprising: output means for outputting voice based on the text generated by the generating means.

The interactive apparatus according to claim 1, further comprising a changing unit that changes the predetermined value in accordance with a given instruction.

Search means for searching for a word or phrase related to the word or phrase extracted by the extraction means,
The said generation means produces | generates the text using the word or phrase obtained by the said search means as a text which shows the content of the said related utterance, It is characterized by the above-mentioned. Interactive device.

4. The dialogue according to claim 3, wherein the search means searches for the related word or phrase based on a database on a network using the word or phrase extracted by the extraction means as a keyword. apparatus.

Registration means for registering the word or phrase extracted by the extraction means;
A determining unit that determines which word or phrase is used to generate the text with reference to the word or phrase registered by the registering unit when there are a plurality of related words or phrases. The interactive apparatus according to claim 3 or 4, characterized by comprising:

The dialogue apparatus according to any one of claims 3 to 5, further comprising a conflicting unit that outputs a voice indicating that a conflict has been made during the search by the search unit.

A program for causing a computer to function as each means in the interactive apparatus according to claim 1.

An extraction step in which the dialogue device extracts words or phrases whose appearance frequency in the utterance of the dialogue partner is equal to or higher than a predetermined value;
A generating step in which the interactive device generates text indicating the content of the utterance related to the word or phrase extracted in the extracting step;
And an output step of outputting voice based on the text generated in the generation step.