JP2009198686A

JP2009198686A - Response generator and program

Info

Publication number: JP2009198686A
Application number: JP2008038814A
Authority: JP
Inventors: Ryoko Hotta; 良子堀田; Takakatsu Yoshimura; 貴克吉村; Kazuya Shimooka; 和也下岡; Yusuke Nakano; 雄介中野
Original assignee: Toyota Motor Corp; Toyota Central R&D Labs Inc
Current assignee: Toyota Motor Corp; Toyota Central R&D Labs Inc
Priority date: 2008-02-20
Filing date: 2008-02-20
Publication date: 2009-09-03
Anticipated expiration: 2028-02-20
Also published as: JP4893655B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a response generator and a program, capable of generating a proper response to a voice recognition result, and reducing an erroneous response. <P>SOLUTION: This response generator 10 includes: an input part 12 for recognizing an utterance of a user, and generating recognition candidates; a response generating part 14 for generating a response candidate corresponding to each recognition candidate generated in the input part 12; a response selecting part 16 for selecting the response candidate having the largest number of the corresponding recognition candidates generated in the response generating part 14; and an output part 18 outputting the response candidate selected in the response selecting part 16. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、ユーザの発話に対する応答を生成する応答生成装置およびプログラムに関する。 The present invention relates to a response generation apparatus and a program for generating a response to a user's utterance.

ユーザの発話に対してシステムが応答する技術として、例えば、下記特許文献１に記載のように、これまでに認識した音声認識文脈を踏まえて音声認識結果を修正したり、音声認識結果の構文情報から音声認識結果を修正する装置が知られている。 As a technique in which the system responds to the user's utterance, for example, as described in Patent Document 1 below, the speech recognition result is corrected based on the speech recognition context recognized so far, or the syntax information of the speech recognition result is used. A device for correcting a speech recognition result is known.

また、下記非特許文献１の論文には、複数の認識候補を用いて応答を検索し、応答を生成する技術が記載されている。例えば、音声認識結果の候補が「トイレはどこですか」と「トイレはありますか」と「食堂はありますか」だった場合には、これらの認識候補に含まれる形態素をすべて用いて（この場合は「トイレ／は／どこ／です／か／あり／ます／食堂」という８形態素）応答を検索する。
特開２００３−２２３１８５号公報「実環境研究プラットホームとしての音声情報案内システムの運用」，電子情報通信学会論文誌 Further, the following paper of Non-Patent Document 1 describes a technique for searching for a response using a plurality of recognition candidates and generating a response. For example, if the speech recognition result candidates are “Where is the toilet”, “Is there a toilet” and “Is there a cafeteria”, all the morphemes contained in these recognition candidates are used (in this case, “Toilet / Has / Where / Is / Kana / Are / Masu / Dining room” 8 morphemes) Search for responses.
JP 2003-223185 A "Operation of a voice information guidance system as an actual environment research platform", IEICE Transactions

しかしながら、上記特許文献１に記載の技術では、これまで認識された文脈や構文情報を用いたとしても誤認識の修正の精度は十分でなく、適切に誤認識を修正できるとは限らないため、誤認識のまま応答生成して間違った応答を生成する可能性がある。また、誤認識を修正するために処理時間を要してしまうという問題点もある。 However, in the technique described in Patent Document 1, the accuracy of correction of misrecognition is not sufficient even if context or syntax information recognized so far is used, and it is not always possible to correct misrecognition appropriately. There is a possibility that a false response is generated by generating a response with erroneous recognition. There is also a problem that processing time is required to correct erroneous recognition.

また、上記非特許文献１に記載の技術では、複数の認識結果を統合して応答を検索するため、必ずしも適切な応答が生成できるとは限らない。具体的には、上記の例では「トイレ／は／どこ／です／か／あり／ます／食堂」という８形態素を用いて適切な応答を検索しているが、「トイレ」や「食堂」など、本来どちらかしか発話されていないと考えられる認識結果が含まれていると、これらをまとめて認識結果として応答を検索した場合には、「トイレはどこですか？」に対して「食堂の場所は８階です」のような応答を生成してしまう可能性がある。 Further, in the technique described in Non-Patent Document 1, since a plurality of recognition results are integrated to search for a response, an appropriate response cannot always be generated. Specifically, in the above example, an appropriate response is searched using the 8 morphemes “toilet / ha / where / is / ka / a / mas / canteen”, but “toilet”, “canteen”, etc. If there is a recognition result that is considered to be spoken only in one of them, and you search for a response as a recognition result by collecting them together, you will be asked "Where is the toilet?" May generate a response such as

本発明は上述した問題を解決するためになされたものであり、音声認識結果に対して適切な応答を生成して、誤応答を低減することができる応答生成装置およびプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and it is an object of the present invention to provide a response generation apparatus and program that can generate an appropriate response to a speech recognition result and reduce erroneous responses. And

上記目的を達成するために、本発明の応答生成装置は、ユーザの発話を認識し、認識候補を生成する認識手段と、前記認識手段で生成された各認識候補に対応して応答候補を生成する生成手段と、前記生成手段で生成された応答候補から、対応する認識候補が最も多い応答候補を選択する選択手段と、前記選択手段で選択された応答候補を出力する応答出力手段と、を含んで構成されている。 In order to achieve the above object, the response generation apparatus of the present invention recognizes a user's utterance, generates a recognition candidate, and generates a response candidate corresponding to each recognition candidate generated by the recognition unit. Generating means for selecting, a selecting means for selecting a response candidate having the largest number of corresponding recognition candidates from the response candidates generated by the generating means, and a response output means for outputting the response candidate selected by the selecting means. It is configured to include.

このような構成によれば、音声認識結果に対して適切な応答を生成して、誤応答を低減することができる。例えば、ユーザ発話の音声認識結果に誤認識が起きた場合でも、本発明では多くの認識候補に対応した応答候補が出力されるため、破綻することなく対話を進行でき、また、音声認識誤りの修正をすることなく、各音声認識候補に対して矛盾の少ない応答を生成することができる。 According to such a configuration, it is possible to generate an appropriate response to the speech recognition result and reduce erroneous responses. For example, even if misrecognition occurs in the speech recognition result of the user's utterance, the present invention outputs response candidates corresponding to many recognition candidates. A response with little contradiction can be generated for each speech recognition candidate without correction.

また、本発明の他の態様の応答生成装置は、ユーザの発話を認識し、認識候補を生成する認識手段と、前記認識手段で生成された各認識候補に対応して応答候補を生成する生成手段と、前記生成手段で生成された応答候補毎に、前記認識候補の認識の信頼度に基づいて前記ユーザ発話に対する応答の適切さを示す数値を算出し、該算出した数値が示す適切さが最も高い応答候補を前記生成された応答候補から選択する選択手段と、前記選択手段で選択された応答候補を出力する応答出力手段と、を含んで構成されている。 The response generation apparatus according to another aspect of the present invention recognizes a user's utterance and generates a recognition candidate, and generates a response candidate corresponding to each recognition candidate generated by the recognition unit. And a numerical value indicating the appropriateness of the response to the user utterance based on the recognition reliability of the recognition candidate for each response candidate generated by the generating means and the appropriateness indicated by the calculated numerical value. Selection means for selecting the highest response candidate from the generated response candidates, and response output means for outputting the response candidate selected by the selection means are configured.

このような構成によれば、音声認識結果に対して適切な応答を生成して、誤応答を低減することができる。例えば、ユーザ発話の音声認識結果に誤認識が起きた場合でも、本発明では認識候補の信頼度から算出された数値が示す適切さが最も高い応答候補が選択されて出力されるため、破綻することなく対話を進行でき、また、音声認識誤りの修正をすることなく、各音声認識候補に対して矛盾の少ない応答を生成することができる。 According to such a configuration, it is possible to generate an appropriate response to the speech recognition result and reduce erroneous responses. For example, even when an erroneous recognition occurs in the speech recognition result of the user utterance, the present invention fails because the response candidate having the highest appropriateness indicated by the numerical value calculated from the reliability of the recognition candidate is selected and output. The dialogue can proceed without any error, and a response with less contradiction can be generated for each speech recognition candidate without correcting the speech recognition error.

なお、前記生成手段を、前記認識手段で生成された各認識候補に対応して応答候補を生成する第１生成手段と、前記第１生成手段で生成した応答候補に含まれる用語をその上位概念に相当する用語に置き換えて応答候補を生成し直す第２生成手段と、を備えて構成してもよい。 The generating means includes a first generating means for generating a response candidate corresponding to each recognition candidate generated by the recognizing means, and a term included in the response candidate generated by the first generating means. And a second generation unit that regenerates a response candidate by replacing it with a term corresponding to.

このように、上位概念に相当する用語に置き換えることで、結果として複数の応答候補が集約され、多くの認識候補に対応可能な応答候補が生成される。従って、該応答候補から出力すべき応答候補を選択することで、誤応答を低減することができる。 Thus, by replacing with terms corresponding to the superordinate concept, as a result, a plurality of response candidates are aggregated, and response candidates that can correspond to many recognition candidates are generated. Therefore, it is possible to reduce erroneous responses by selecting response candidates to be output from the response candidates.

また、前記第２生成手段は、シソーラスを用いて前記第１生成手段で生成した応答候補に含まれる用語をその上位概念に相当する用語に置き換えて応答候補を生成し直すようにしてもよい。 The second generation unit may regenerate the response candidate by replacing a term included in the response candidate generated by the first generation unit using a thesaurus with a term corresponding to the superordinate concept.

このような構成によれば、容易に上位概念に相当する用語を見つけ出して置き換えることができる。 According to such a configuration, a term corresponding to a superordinate concept can be easily found and replaced.

また、前記生成手段を、前記認識手段で生成された各認識候補に対応して応答候補を生成する第１生成手段と、前記第１生成手段で生成した応答候補に、集約が可能な複数の応答候補が存在する場合に、該複数の応答候補を、該複数の応答候補の内容が集約された同一内容の応答候補に生成し直す第２生成手段と、を備えて構成してもよい。 In addition, the generation unit includes a first generation unit that generates response candidates corresponding to each recognition candidate generated by the recognition unit, and a plurality of response candidates generated by the first generation unit that can be aggregated. You may comprise including the 2nd production | generation means which produces | generates these response candidates to the response content of the same content by which the content of these response candidates was aggregated when a response candidate exists.

このような構成によれば、複数の応答候補を集約することができ、多くの認識候補に対応可能な応答候補を生成することができる。従って、該応答候補から出力すべき応答候補を選択することで、誤応答を低減することができる。 According to such a configuration, a plurality of response candidates can be aggregated, and response candidates that can handle many recognition candidates can be generated. Therefore, it is possible to reduce erroneous responses by selecting response candidates to be output from the response candidates.

なお、前記集約が可能な複数の応答候補とは、一部分が同一の複数の応答候補であって、前記第２生成手段は、前記第１生成手段で生成した応答候補に、一部分が同一の複数の応答候補が存在する場合に、該複数の応答候補を、同一の部分以外の部分を削除した同一内容の応答候補に生成し直すようにしてもよい。 The plurality of response candidates that can be aggregated are a plurality of response candidates that are partially the same, and the second generation means includes a plurality of response candidates that are partially the same as the response candidates generated by the first generation means. When there are two response candidates, the plurality of response candidates may be regenerated as response candidates having the same content from which parts other than the same part are deleted.

また、前記集約が可能な複数の応答候補とは、一部分が同一の複数の応答候補であって、前記第２生成手段は、前記第１生成手段で生成した応答候補に、一部分が同一の複数の応答候補が存在する場合に、該複数の応答候補を、該同一の部分から類推される概念を表す同一内容の応答候補に生成し直すようにしてもよい。 The plurality of response candidates that can be aggregated are a plurality of response candidates that are partially the same, and the second generation unit includes a plurality of response candidates that are partially the same as the response candidates generated by the first generation unit. When there are two response candidates, the plurality of response candidates may be regenerated as response candidates having the same contents representing the concept inferred from the same part.

また、前記集約が可能な複数の応答候補とは、応答候補に含まれる用語または応答候補全体の少なくとも一方から推論される概念が同一の複数の応答候補であって、前記第２生成手段は、前記第１生成手段で生成した応答候補に、応答候補に含まれる用語または応答候補全体の少なくとも一方から推論される概念が同一の複数の応答候補が存在する場合に、該複数の応答候補を、該推論された概念が表現された同一内容の応答候補に生成し直すようにしてもよい。さらにまた、前記応答候補に含まれる用語または応答候補全体の少なくとも一方から推論される概念は、感情または上位概念とすることもできる。 The plurality of response candidates that can be aggregated are a plurality of response candidates that have the same concept inferred from at least one of the terms included in the response candidates or the entire response candidates, and the second generation unit includes: When there are a plurality of response candidates having the same concept inferred from at least one of the terms included in the response candidates or the entire response candidates in the response candidates generated by the first generation unit, the plurality of response candidates are You may make it produce | generate again the response candidate of the same content by which this inferred concept was expressed. Furthermore, the concept inferred from at least one of the terms included in the response candidate or the entire response candidate may be an emotion or a superordinate concept.

なお、本発明は、上記応答生成装置の機能をコンピュータで実現するためのプログラムにも適用可能である。 Note that the present invention can also be applied to a program for realizing the functions of the response generation apparatus by a computer.

以上説明したように、本発明によれば、音声認識結果に対して適切な応答を生成して、誤応答を低減することができる、という優れた効果を奏する。 As described above, according to the present invention, it is possible to generate an appropriate response with respect to the speech recognition result, and to obtain an excellent effect that an erroneous response can be reduced.

以下、図面を参照して、本発明の実施の形態について詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

［第１の実施の形態］ [First Embodiment]

図１は、第１の実施の形態に係る応答生成装置１０の概略的な構成を示すブロック図である。応答生成装置１０は、入力部１２、応答生成部１４、応答選択部１６、および出力部１８を備えている。 FIG. 1 is a block diagram illustrating a schematic configuration of a response generation apparatus 10 according to the first embodiment. The response generation device 10 includes an input unit 12, a response generation unit 14, a response selection unit 16, and an output unit 18.

入力部１２は、音声認識辞書を備え、入力されたユーザの発話を示す音声波形を、該音声認識辞書を用いて音声認識し、考えられる複数の音声認識候補を出力する。入力部１２は、音声認識の際、音響的な情報や言語情報を用いる一般的な手法（例えば、李晃伸，河原達也，鹿野清宏、「２パス探索アルゴリズムにおける高速な単語事後確率に基づく信頼度算出法」情報処理学会研究報告, 2003-SLP-49-48, 2003-12．等を参照。）により、音声認識の信頼度も計算する。 The input unit 12 includes a speech recognition dictionary, recognizes a speech waveform indicating the input user's utterance using the speech recognition dictionary, and outputs a plurality of possible speech recognition candidates. The input unit 12 is a general method that uses acoustic information or linguistic information for speech recognition (for example, Lee Shin-nobu, Tatsuya Kawahara, Kiyohiro Shikano, “Reliability based on fast word posterior probabilities in the two-pass search algorithm” Calculate the reliability of speech recognition according to "Calculation method" Information Processing Society of Japan Research Report, 2003-SLP-49-48, 2003-12 etc.).

応答生成部１４は、音声認識されたすべての候補に対して個別にユーザの発話に対する応答候補を生成する。本実施の形態では、所定の記憶手段に、音声認識候補と質問タイプとが対応付けられた質問情報を記憶しておくと共に、応答用のテンプレートも記憶しておき、該質問情報とテンプレートとを記憶手段から読みだして個別に応答候補を生成する。質問情報およびテンプレートについては後述する。 The response generation unit 14 generates response candidates for the user's utterance individually for all candidates that have been voice-recognized. In the present embodiment, the question information in which the speech recognition candidate and the question type are associated with each other is stored in a predetermined storage unit, and a response template is also stored, and the question information and the template are stored. Response candidates are generated individually by reading from the storage means. The question information and template will be described later.

応答選択部１６は、応答生成部１４で生成された応答候補からより多くの認識候補に対応できる応答候補を選択したり、さらに、音声認識の信頼度を考慮した上で応答候補を選択する。 The response selection unit 16 selects response candidates that can correspond to more recognition candidates from the response candidates generated by the response generation unit 14, and further selects response candidates in consideration of the reliability of speech recognition.

出力部１８は、スピーカやディスプレイ等を含んで構成され、応答選択部１６で選択された応答候補を表す文字画像をディスプレイに表示したり、選択された応答候補を音声合成した後、スピーカで音声出力する。 The output unit 18 includes a speaker, a display, and the like. The output unit 18 displays a character image representing the response candidate selected by the response selection unit 16 on the display, or voice-synthesizes the selected response candidate, and then outputs the voice through the speaker. Output.

なお、上記説明した応答生成装置１０を構成する各構成要素（出力部１８に含まれるスピーカやディスプレイ等を除く）は、ＣＰＵ、ＲＡＭ、ＲＯＭを含んで構成されたコンピュータによって実現される。すなわちＣＰＵが、ＲＯＭや所定の記憶装置に記憶されたプログラムを実行することにより上記各構成部の機能を実現し、以下に説明する処理が行なわれる。また、各構成部を別々のコンピュータで構成してもよいし、１つのコンピュータで構成してもよい。 Each component (except for the speaker and display included in the output unit 18) constituting the response generation apparatus 10 described above is realized by a computer including a CPU, a RAM, and a ROM. That is, the CPU implements the functions of the above-described components by executing a program stored in a ROM or a predetermined storage device, and the processing described below is performed. In addition, each component may be configured by a separate computer or a single computer.

以上のように構成された応答生成装置１０は、ユーザとの間で例えば次のような対話例１を実現することができる。以下、ユーザ発話に対する本装置の発話を「システム発話」という。 The response generation device 10 configured as described above can realize, for example, the following dialogue example 1 with the user. Hereinafter, the utterance of the apparatus in response to the user utterance is referred to as “system utterance”.

（対話例１）
ユーザ発話１：聞いてよ、お母さんが勝手に開けたんだよ
システム発話２：何を？
ユーザ発話３：引き出しにしまっといた箱。友達にもらって大事にしまっといたのに
システム発話４：そうなんだ (Dialogue example 1)
User utterance 1: Listen, my mom opened it automatically System utterance 2: What?
User utterance 3: A box in a drawer. System utterance 4: Yes, I got it from a friend

以下、この応答生成装置１０における応答生成の流れを説明する。 Hereinafter, the flow of response generation in the response generation apparatus 10 will be described.

まず、応答生成装置１０の入力部１２は、例えば上記対話例１のユーザ発話１などのユーザ発話を音声認識して、複数の音声認識候補を応答生成部１４に入力する。入力部１２は、例えば図２に示すように、実際のユーザ発話の認識結果である「見出し」と、該見出しの「原形」を音声認識結果として応答生成部１４に入力する。また、前述したように、各音声認識候補の信頼度も応答生成部１４に入力する。 First, the input unit 12 of the response generation apparatus 10 recognizes a user utterance such as the user utterance 1 of the above-described dialogue example 1 and inputs a plurality of speech recognition candidates to the response generation unit 14. For example, as illustrated in FIG. 2, the input unit 12 inputs “heading” that is a recognition result of an actual user utterance and “original form” of the heading to the response generation unit 14 as a speech recognition result. Further, as described above, the reliability of each voice recognition candidate is also input to the response generation unit 14.

図３は、応答生成部１４が行なう応答候補生成処理ルーチンの流れを示すフローチャートである。 FIG. 3 is a flowchart showing a flow of a response candidate generation processing routine performed by the response generation unit 14.

ステップ１００では、応答生成部１４は、入力部１２で認識された音声認識候補の１つに対して応答候補を生成する。ここでは、前述したように、予め記憶しておいた質問情報とテンプレートとに基づいて応答候補を生成する。 In step 100, the response generation unit 14 generates a response candidate for one of the speech recognition candidates recognized by the input unit 12. Here, as described above, response candidates are generated based on the question information and the template stored in advance.

図４に質問情報の具体例を示す。図４では「開ける」という音声認識候補（原形）に対して、「誰が」「何を」「どうして」という３種類の質問タイプが対応付けられている。 FIG. 4 shows a specific example of the question information. In FIG. 4, three types of question types “who”, “what”, and “why” are associated with a speech recognition candidate (original form) “open”.

また、図５に応答用のテンプレートの具体例を示す。図５（Ａ）は、述語の認識候補に対する応答テンプレートを示しており、質問タイプ＋音声認識候補（述語）（※）＋時制終助詞（※）＋疑問終助詞（※）とから構成される。※印の部分は、省略可能な部分であり、本実施の形態では、この※印のうち音声認識候補（述語）と時制終助詞とを常に省略するものとする。図５（Ｂ）は、名詞に対する応答テンプレートを示しており、質問タイプ＋音声認識候補（名詞）（※）＋疑問終助詞（※）とから構成される。本実施の形態では、この※印のうち音声認識候補（名詞）を常に省略するものとする。 FIG. 5 shows a specific example of a response template. FIG. 5A shows a response template for a predicate recognition candidate, which is composed of question type + speech recognition candidate (predicate) (*) + temporal final particle (*) + question final particle (*). . The part marked with * is an optional part, and in this embodiment, the speech recognition candidate (predicate) and the tense final particle are always omitted from the * mark. FIG. 5B shows a response template for a noun, which is composed of question type + speech recognition candidate (noun) (*) + question final particle (*). In this embodiment, voice recognition candidates (nouns) are always omitted from the * mark.

なお、時制終助詞と疑問終助詞には様々なものがあるが、この時制終助詞と疑問終助詞については、テンプレートと同様に、予め記憶手段に記憶しておく。例えば、図６に示すようなテーブルに予め登録しておく。そして、応答生成部１４は、適宜記憶手段から必要な情報を読みだして用いて応答候補を生成する。 There are various types of tense-term particles and question-term particles, but the tense-term particles and question-term particles are stored in the storage means in advance, like the template. For example, it is registered in advance in a table as shown in FIG. And the response production | generation part 14 reads a required information from a memory | storage means suitably, and produces | generates a response candidate.

例えば、「開ける」という音声認識候補に対しては、通常では、図７に示すように、「誰が開けたの？」、「何を開けたの？」、「どうして開けたの？」という応答候補が生成されるが、本実施の形態では、これら応答候補に含まれる「開けたの」の部分は省略して生成される。 For example, in response to a voice recognition candidate “open”, normally, as shown in FIG. 7, a response “Who opened it?”, “What opened it?”, “Why opened it?” Candidates are generated, but in this embodiment, the “opened” part included in these response candidates is omitted.

ステップ１０２では、応答生成部１４は、全ての音声認識候補に対する応答候補の生成処理が終了したか否かを判断し、終了していないと判断した場合には、ステップ１００に戻って、未処理の音声認識候補について、上記と同様に応答候補を生成する。また、ステップ１０２で、全ての音声認識候補に対する応答候補の生成処理が終了したと判断した場合には、本ルーチンを終了する。 In step 102, the response generation unit 14 determines whether or not the response candidate generation processing for all speech recognition candidates has been completed. If it is determined that the response candidate generation processing has not ended, the process returns to step 100 and is unprocessed. For the voice recognition candidates, response candidates are generated in the same manner as described above. If it is determined in step 102 that the generation of response candidates for all speech recognition candidates has been completed, this routine ends.

これにより、図８に示すように、各音声認識候補に対して応答候補が生成される。例えば、音声認識候補「開けたよ」については、述語の原形「開ける」から「誰が？」「何を？」「どうして？」が生成される。また、音声認識候補「蹴ったよ」については、述語の原形「蹴る」から「誰が？」「何を？」が生成される。 Thereby, as shown in FIG. 8, a response candidate is produced | generated with respect to each speech recognition candidate. For example, for the speech recognition candidate “I opened”, “Who?”, “What?”, “Why?” Are generated from the original predicate “Open”. For the voice recognition candidate “Kicked”, “who” and “what?” Are generated from the predicate “kick”.

応答選択部１６は、応答生成部１４で生成された応答候補から適切な応答候補を選択する。 The response selection unit 16 selects an appropriate response candidate from the response candidates generated by the response generation unit 14.

図９は、応答選択部１６が行なう応答候補選択処理ルーチンの流れを示すフローチャートである。 FIG. 9 is a flowchart showing a flow of a response candidate selection processing routine performed by the response selection unit 16.

ステップ１２０では、応答選択部１６は、全ての応答候補について生成数をカウントする。ここでいう生成数とは、各応答候補が対応している音声認識候補の数をいう。本実施の形態では、図８に示す通り、５種類の音声認識候補に対して「誰が？」「何が？」「何を？」「何に？」「どうして？」「いつ？」という６種類の応答候補が生成されている。各応答候補が対応している認識候補の数は、以下の通りである。 In step 120, the response selection unit 16 counts the number of generations for all response candidates. Here, the number of generations refers to the number of speech recognition candidates to which each response candidate corresponds. In the present embodiment, as shown in FIG. 8, “Who?” “What?” “What?” “What?” “Why?” “When?” Kinds of response candidates have been generated. The number of recognition candidates corresponding to each response candidate is as follows.

「誰が？」＝音声認識候補３個
「何か？」＝音声認識候補２個
「何を？」＝音声認識候補４個
「何に？」＝音声認識候補１個
「いつ？」＝音声認識候補２個 “Who?” = 3 speech recognition candidates “something?” = 2 speech recognition candidates “What?” = 4 speech recognition candidates “What?” = 1 speech recognition candidate “when?” = Speech recognition 2 candidates

ステップ１２２では、応答選択部１６は、最も生成数が多い応答候補（ここでは、「何を？」となる。）を選択する。 In step 122, the response selection unit 16 selects a response candidate having the largest number of generations (here, “What?”).

ステップ１２４では、応答選択部１６は、最も生成数が多い応答候補が複数存在するか否かを判断する。ここで、複数存在すると判断した場合には、ステップ１２６で、該複数の応答候補の中からランダムに１つを選択して最終的な選択結果として出力する。また、複数存在しないと判断した場合には、ステップ１２２で選択した応答候補をそのまま最終的な選択結果として出力する（上記対話例１のシステム発話２も参照。）。 In step 124, the response selection unit 16 determines whether there are a plurality of response candidates with the largest number of generations. If it is determined that there are a plurality of response candidates, at step 126 one of the plurality of response candidates is selected at random and output as the final selection result. If it is determined that there are not a plurality, the response candidates selected in step 122 are output as they are as the final selection results (see also system utterance 2 in Dialogue Example 1 above).

なお、ここでは、認識候補の数をカウントしてその結果から多くの認識候補に共通する応答候補を選択するようにしたが、音声認識の信頼度を用いて応答候補を選択するようにしてもよい。図１０は、音声認識の信頼度を考慮して応答候補を選択する場合の応答候補選択処理ルーチンの流れを示すフローチャートである。 Here, the number of recognition candidates is counted, and the response candidates common to many recognition candidates are selected from the results. However, the response candidates may be selected using the reliability of speech recognition. Good. FIG. 10 is a flowchart showing the flow of a response candidate selection processing routine when selecting a response candidate in consideration of the reliability of voice recognition.

ステップ１４０では、応答選択部１６は、すべての応答候補に対してユーザ発話に対する応答の適切さを計算する。応答の適切さは、認識候補の信頼度を用いて以下の式１により算出する。 In step 140, the response selection unit 16 calculates appropriateness of the response to the user utterance for all response candidates. The appropriateness of the response is calculated by the following equation 1 using the reliability of the recognition candidate.

［式１］
応答候補Ｘの応答の適切さ=Σ｛応答候補Ｘに対応する認識候補の認識の信頼度｝ [Formula 1]
Appropriateness of response of response candidate X = Σ {Reliability of recognition of recognition candidate corresponding to response candidate X}

図２に示す音声認識候補および信頼度を用いて、各応答候補の適切さを式１により算出すると、以下のようになる。 Using the speech recognition candidates and the reliability shown in FIG. 2, the appropriateness of each response candidate is calculated by Equation 1 as follows.

「誰が？」＝「開けるの信頼度０．１」＋「蹴るの信頼度０．２」
＋「掛けるの信頼度０．１５」＝０．３５
「何が？」＝「避けるの信頼度０．１」＋「炊けるの信頼度０．９」＝１．０
「何を？」＝「開けるの信頼度０．１」＋「蹴るの信頼度０．２」
＋「掛けるの信頼度０．１５」＋「避けるの信頼度０．１」＝０．４５
「何に？」＝「掛けるの信頼度０．１５」＝０．１５
「いつ？」＝「掛けるの信頼度０．１５」＋「炊けるの信頼度０．９」＝１．０５ “Who?” = “Reliability of opening 0.1” + “Reliability of kicking 0.2”
+ "Reliability of multiplication 0.15" = 0.35
“What?” = “Confidence to avoid 0.1” + “confidence to cook 0.9” = 1.0
“What?” = “Reliability of opening 0.1” + “Reliability of kicking 0.2”
+ "Reliability of multiplication 0.15" + "Reliability of avoidance 0.1" = 0.45
“What?” = “Reliability of multiplying 0.15” = 0.15
“When?” = “Reliability of hanging 0.15” + “Reliability of cooking 0.9” = 1.05

ステップ１４２では、応答選択部１６は、適切さのスコアが最も高い応答候補（ここでは、「いつ？」）を選択する。 In step 142, the response selection unit 16 selects a response candidate (here, “when?”) Having the highest appropriateness score.

ステップ１４４では、応答選択部１６は、適切さのスコアが最も高い応答候補が複数存在するか否かを判断する。ここで、複数存在すると判断した場合には、ステップ１４６で、該複数の応答候補の中からランダムに１つを選択して最終的な選択結果として出力する。また、複数存在しないと判断した場合には、ステップ１４２で選択した応答候補をそのまま最終的な選択結果として出力する。 In step 144, the response selection unit 16 determines whether or not there are a plurality of response candidates having the highest appropriateness score. If it is determined that there are a plurality of response candidates, at step 146, one of the plurality of response candidates is selected at random and output as a final selection result. If it is determined that there are not a plurality, the response candidates selected in step 142 are output as they are as final selection results.

出力部１８は、このように選択された応答候補を表す文字画像をディスプレイに表示したり、選択された応答候補を音声合成した後、スピーカで音声出力する。 The output unit 18 displays a character image representing the response candidate selected in this way on the display, or synthesizes the selected response candidate by voice, and then outputs the voice through a speaker.

なお、応答生成部１４で生成されたすべての応答候補についての生成数或いは適切さのスコアが同一である場合には、該応答候補から選択せずに、単なる相づち（例えば「そうだね」など）を出力するなどの処理を行なうようにしてもよい。 In addition, when the number of generations or appropriate scores for all the response candidates generated by the response generation unit 14 are the same, it is not selected from the response candidates, but is simply matched (for example, “That's right”). May be processed.

以上説明したように、複数の音声認識候補から各音声認識候補それぞれに対する応答候補を生成し、該生成した応答候補から多くの認識候補に共通する応答候補を選択、或いは認識候補の信頼度から算出された数値が示す適切さが最も高い応答候補を選択して出力するようにしたため、音声認識候補の中に仮に誤認識を含んでいたとしても、破綻することなく対話を進行することができる。 As described above, a response candidate for each speech recognition candidate is generated from a plurality of speech recognition candidates, and a response candidate common to many recognition candidates is selected from the generated response candidates, or calculated from the reliability of the recognition candidates. Since the response candidate having the highest appropriateness indicated by the displayed numerical value is selected and output, the dialogue can proceed without failure even if the speech recognition candidate includes a false recognition.

また、従来は、音声認識誤りが生じた場合は、誤認識を修正した上で応答生成していたが、誤認識の修正の精度が十分でないため、適切に誤認識を修正できるとは限らず、また、誤認識を修正するための処理時間を要するという問題があった。しかし、上記のように応答を生成することにより、音声認識候補の中に仮に誤認識を含んでいたとしても、音声認識誤りを修正することなく、適切な応答を出力することができる。 Conventionally, when a speech recognition error occurs, a response is generated after correcting the misrecognition. However, since the accuracy of correcting the misrecognition is not sufficient, it is not always possible to correct the misrecognition appropriately. In addition, there is a problem that it takes a processing time to correct misrecognition. However, by generating a response as described above, an appropriate response can be output without correcting the speech recognition error even if the speech recognition candidate includes a misrecognition.

また、従来は、複数の音声認識候補を用いて応答を生成していたため、音声認識候補に対して矛盾する応答が選択される場合もあった。しかしながら、上記説明したように、多くの音声認識結果に対して共通する応答候補或いは適切さのスコアが高い応答候補を選択して出力するため、各音声認識候補に対して矛盾の少ない応答を生成することができる。 Conventionally, since responses are generated using a plurality of speech recognition candidates, inconsistent responses may be selected for the speech recognition candidates. However, as described above, a common response candidate for many speech recognition results or a response candidate with a high appropriateness score is selected and output, so that a response with less contradiction is generated for each speech recognition candidate. can do.

なお、応答生成部１４の応答生成方法は一例であって、本実施の形態で説明した方法に限定されない。例えば、例えば、特開２００７−２０６８８８号公報に記載の技術や特開２００６−２０１８７０号公報に記載の技術等を用いて行なうようにしてもよい。 Note that the response generation method of the response generation unit 14 is an example, and is not limited to the method described in the present embodiment. For example, you may make it perform using the technique as described in Unexamined-Japanese-Patent No. 2007-206888, the technique as described in Unexamined-Japanese-Patent No. 2006-201870, etc., for example.

具体的には、特開２００７−２０６８８８号公報に記載の技術では、解析された述語及びそれに対応する格要素を抽出して、この述語及び格要素を確認するための応答を生成する。特開２００６−２０１８７０号公報に記載の技術では、応答用の発話情報を事象と事象、事象と評価、評価と評価の各組み合わせからなるルールに従って記憶しておき、音声認識結果と該記憶したルールとに基づいて応答用の発話を生成する。ここで、事象と事象の組み合わせや、事象と評価の組み合わせは、例えば、因果関係や時間的な関係によって構成される。また、評価は、例えばユーザ発話が意味する感情等によって構成される。 Specifically, in the technique described in Japanese Patent Application Laid-Open No. 2007-206888, an analyzed predicate and a case element corresponding thereto are extracted, and a response for confirming the predicate and the case element is generated. In the technique described in Japanese Patent Application Laid-Open No. 2006-201870, response utterance information is stored in accordance with a rule including a combination of an event and an event, an event and an evaluation, and an evaluation and an evaluation, and the voice recognition result and the stored rule are stored. Based on the above, a response utterance is generated. Here, the combination of an event and an event, and the combination of an event and evaluation are comprised by causal relationship and temporal relationship, for example. Moreover, evaluation is comprised by the emotion etc. which a user utterance means, for example.

これにより、様々なタイプの応答候補、例えば、ユーザ発話から認識された語を確認するための応答候補や、ユーザ発話に含まれる事象を感情語を含む表現で評価した応答候補や、ユーザ発話に含まれる事象に起因する事象を表す応答候補などが生成される。 Thus, for various types of response candidates, for example, response candidates for confirming words recognized from user utterances, response candidates for evaluating events included in user utterances with expressions including emotion words, and user utterances. Response candidates representing events resulting from the included events are generated.

［第２の実施の形態］ [Second Embodiment]

図１１は、第２の実施の形態に係る応答生成装置２０の概略的な構成を示すブロック図である。応答生成装置２０は、入力部１２、応答生成部１４、応答候補集約部１５、応答選択部１６、および出力部１８を備えている。ここで、図１１において、図１と同一もしくは同等の部分には同じ記号を付し、その説明を省略する。 FIG. 11 is a block diagram illustrating a schematic configuration of the response generation apparatus 20 according to the second embodiment. The response generation device 20 includes an input unit 12, a response generation unit 14, a response candidate aggregation unit 15, a response selection unit 16, and an output unit 18. Here, in FIG. 11, the same or equivalent parts as in FIG.

応答候補集約部１５は、シソーラスを備え、応答生成部１４が生成した複数の応答候補それぞれについて、シソーラスを用いて応答候補に含まれる用語をその上位概念に相当する他の用語に置き換えて生成し直す。なお、シソーラスとは、図１２に示すように、用語を同義語や意味上の類似関係、或いは包含関係などにより階層的に分類した辞書をいう。なお、用語の置き換えて応答候補を生成し直すにより、結果として複数の応答候補が同一内容の応答候補に集約されることから、以下では、応答候補の生成後に用語の置き換え等により応答候補を生成し直すことを「集約」と表現する。 The response candidate aggregating unit 15 includes a thesaurus, and for each of the plurality of response candidates generated by the response generation unit 14, the term included in the response candidate is replaced with another term corresponding to the superordinate concept using the thesaurus. cure. As shown in FIG. 12, the thesaurus refers to a dictionary in which terms are hierarchically classified by synonyms, semantic similarity relationships, inclusion relationships, and the like. In addition, since the response candidates are generated again by replacing the terms, as a result, a plurality of response candidates are aggregated into response candidates having the same contents, so in the following, response candidates are generated by replacing the terms after generating the response candidates. Re-doing is expressed as “aggregation”.

応答選択部１６は、応答候補集約部１５で集約した後の応答候補からより多くの認識候補に対応できる応答を選択したり、さらに、音声認識の信頼度を考慮した上で応答を選択する。 The response selection unit 16 selects a response that can handle more recognition candidates from the response candidates aggregated by the response candidate aggregation unit 15, and further selects a response in consideration of the reliability of voice recognition.

以上のように構成された応答生成装置２０は、ユーザとの間で例えば次のような対話例２を実現することができる。 The response generation device 20 configured as described above can realize, for example, the following dialogue example 2 with the user.

（対話例２）
ユーザ発話１：遠足の途中、小学校で休憩したよ
システム発話２：どんな学校なの？
ユーザ発話３：大きい学校だったよ、トイレもきれいだった
システム発話４：そうなんだ (Dialogue example 2)
User utterance 1: I took a break at an elementary school during an excursion System utterance 2: What kind of school is it?
User utterance 3: It was a big school, the toilet was clean System utterance 4: That's right

以下、この応答生成装置２０における応答生成の流れを説明する。 Hereinafter, the flow of response generation in the response generation device 20 will be described.

まず、応答生成装置２０の入力部１２は、例えば上記対話例２のユーザ発話１などのユーザ発話を音声認識して、複数の音声認識候補を応答生成部１４に入力する。以下では、図１３に示すように、「小学校」「中学校」「駐車場」の３つの音声認識結果とその信頼度が応答生成部１４に入力された場合を例に挙げて、これ以降の処理についてより具体的に説明する。 First, the input unit 12 of the response generation device 20 recognizes a user utterance such as the user utterance 1 of the above-described dialogue example 2 and inputs a plurality of speech recognition candidates to the response generation unit 14. In the following, as shown in FIG. 13, the case where three speech recognition results of “elementary school”, “junior high school”, and “parking lot” and their reliability are input to the response generation unit 14 is taken as an example, and the subsequent processing Will be described more specifically.

応答生成部１４は、第１の実施の形態において説明したように、図１３の音声認識候補のそれぞれに対して応答候補を生成する。その結果、図１４（Ａ）（Ｂ）に示す通り、「小学校」という認識候補からは「どんな小学校なの？」、「中学校」という認識候補からは「どんな中学校なの？」、「駐車場」という認識候補からは「どこの駐車場なの？」という応答候補が生成される。 As described in the first embodiment, the response generation unit 14 generates response candidates for each of the speech recognition candidates in FIG. As a result, as shown in FIGS. 14A and 14B, the recognition candidate “elementary school” is “what kind of elementary school is?”, And the recognition candidate “junior high school” is “what kind of junior high school is?”, “Parking lot” A response candidate “Where is the parking lot?” Is generated from the recognition candidates.

応答候補集約部１５は、図１２に示すようなシソーラスを用いて、応答生成部１４が生成した応答候補に含まれる語をシソーラスの上位概念に相当する語に置き換える。図１２において、図１３に示す音声認識候補「小学校」や「中学校」の上位概念に相当する語は「学校」であり、音声認識候補「駐車場」の上位概念に相当する語は「保管場所」である。従って、図１４（Ｃ）に示す通り、「どんな小学校なの？」は「どんな学校なの？」に言い換えられ、「どんな中学佼なの？」は「どんな学校なの？」に言い換えられ、「どこの駐車場なの？」は「どこの保管場所なの？」に言い換えられる。なお、どの程度上位概念化するかは、特に限定されないが、例えば音声認識候補の階層レベルや、音声認識された用語の種類等に応じて上位概念化すればよい。 Using a thesaurus as shown in FIG. 12, the response candidate aggregating unit 15 replaces words included in the response candidates generated by the response generating unit 14 with words corresponding to the superordinate concept of the thesaurus. In FIG. 12, the word corresponding to the superordinate concept of the speech recognition candidates “elementary school” and “junior high school” shown in FIG. 13 is “school”, and the word corresponding to the superordinate concept of the speech recognition candidate “parking lot” is “storage location”. Is. Therefore, as shown in FIG. 14 (C), “what kind of elementary school?” Is rephrased as “what kind of school?”, “What kind of junior high school student?” Is rephrased as “what kind of school?” "Where is the car?" Can be rephrased as "Where is the storage location?" Note that the degree of superordinate conception is not particularly limited, but may be superordinated according to, for example, the hierarchical level of speech recognition candidates, the type of terms that have been speech-recognized, or the like.

応答選択部１６は、応答候補集約部１５での集約後の応答候補から適切な応答候補を選択する。応答候補の選択の方法は第１の実施の形態と同様である。例えば、図９の処理ルーチンを実行して図１４（Ｃ）に示す応答候補から応答候補を選択する場合には、「どんな学校なの？」が選択され、応答が確定する（上記対話例２のシステム発話２も参照。）。 The response selection unit 16 selects an appropriate response candidate from the response candidates after aggregation in the response candidate aggregation unit 15. The method for selecting a response candidate is the same as that in the first embodiment. For example, in the case where the processing routine of FIG. 9 is executed and a response candidate is selected from the response candidates shown in FIG. 14C, “what school is?” Is selected and the response is confirmed (in the dialogue example 2 above). (See also system utterance 2.)

以上説明した例では、応答候補集約部１５は、応答生成部１４が生成した応答候補のそれぞれについて、シソーラスを用いてその応答候補に含まれる語を上位概念に相当する語に置き換えた応答候補を生成し、該集約後の応答候補から適切な応答候補を選択する例について説明したが、これに限定されず、例えば、応答候補に含まれる用語から複数の応答候補間で共通する部分のみを残して集約し、集約後の応答候補から適切な応答候補を選択するようにしてもよい。 In the example described above, the response candidate aggregating unit 15 uses the thesaurus for each of the response candidates generated by the response generation unit 14 to replace the response candidates with the words corresponding to the higher concept. Although an example of generating and selecting an appropriate response candidate from the response candidates after aggregation has been described, the present invention is not limited to this. For example, only a common part among a plurality of response candidates is left from terms included in the response candidates. And an appropriate response candidate may be selected from the response candidates after aggregation.

例えば、応答生成部１４は、図１５（Ａ）（Ｂ）に示すように、「小学校」という認識候補からは「どんな小学校なの？」、「中学校」という認識候補からは「どんな中学校なの？」、「駐車場」という認識候補からは「どこの駐車場なの？」という応答候補を生成する。 For example, as shown in FIGS. 15A and 15B, the response generation unit 14 determines “what kind of elementary school is” from the recognition candidate “elementary school” and “what kind of junior high school?” From the recognition candidate “junior high school”. From the recognition candidate “parking lot”, a response candidate “where is the parking lot?” Is generated.

応答候補集約部１５は、応答生成部１４で生成されたすべての応答候補を比較し、各応答候補に含まれる用語のうち複数の応答候補に共通する部分だけを残し、共通部分以外を削除することで、図１５（Ｃ）に示すように応答候補を集約する。ここでは、２つの応答候補において、「学校」という単語が共通であるため、「小」および「中」という部分を削除し、「どんな学校なの？」という応答候補を生成する。また、「どこの駐車場なの？」という応答候補については、他の応答候補と共通な部分が存在しないため、言い換えは行われない。なお、ここで、終助詞など、テンプレートによって共通に生成された部分は、共通箇所とはみなさない。 The response candidate aggregating unit 15 compares all the response candidates generated by the response generating unit 14, leaves only a part common to a plurality of response candidates among terms included in each response candidate, and deletes other than the common part Thus, the response candidates are collected as shown in FIG. Here, since the word “school” is common in the two response candidates, the parts “small” and “medium” are deleted, and a response candidate “what school is it?” Is generated. In addition, the response candidate “where is the parking lot?” Is not rephrased because there is no common part with other response candidates. Here, the part generated in common by the template, such as a final particle, is not regarded as a common part.

応答選択部１６は、集約後の応答候補から適切な応答を選択する。例えば、図９の処理ルーチンを実行して図１５（Ｃ）に示す応答候補から応答候補を選択する場合には、「どんな学校なの？」が選択される。 The response selection unit 16 selects an appropriate response from the aggregated response candidates. For example, when the processing routine of FIG. 9 is executed and response candidates are selected from the response candidates shown in FIG. 15C, “what school is it?” Is selected.

また、応答候補を各応答候補に含まれる共通の単語から推論される概念を表す応答候補に生成し直しても良い。 In addition, the response candidates may be generated again as response candidates representing the concept inferred from a common word included in each response candidate.

例えば、図１６（Ａ）に示す「１００ｍ」「２００ｍ」が音声認識候補である場合に、応答生成部１４で、図１６（Ｂ）に示すように、「１００ｍなの？」および「２００ｍなの？」という応答候補が生成されたとする。ここで、各応答候補に共通する部分は「ｍ（メートル）」である。 For example, when “100 m” and “200 m” shown in FIG. 16A are speech recognition candidates, the response generation unit 14 makes “100 m?” And “200 m?” As shown in FIG. ”Is generated. Here, the part common to each response candidate is “m (meter)”.

予め図１６（Ｄ）に示すような推論ルールがデータベース化されていた場合には、応答候補集約部１５は、該データベースを参照して、２つの音声認識結果に共通する「ｍ（メートル）」から「長さ」という概念が推論される。従って、応答候補集約部１５は、図１６（Ｃ）に示すように、推論された概念を表す「そんなに長く？」という応答候補を生成できる。 When the inference rules as shown in FIG. 16D are stored in a database in advance, the response candidate aggregating unit 15 refers to the database and “m (meter)” common to the two speech recognition results. From this, the concept of “length” is inferred. Accordingly, as shown in FIG. 16C, the response candidate aggregation unit 15 can generate a response candidate “so long?” Representing the inferred concept.

また、各応答候補を感情を表す応答候補に生成し直してもよい。 In addition, each response candidate may be regenerated as a response candidate representing emotion.

例えば、応答生成部１４が、図１７（Ａ）（Ｂ）に示すように、「小学校」という認識候補からは「小学校に行くの？」、「中学校」という認識候補からは「中学校に行くの？」、「進学校」という認識候補からは「進学校に行くの？」という応答候補を生成する。 For example, as shown in FIGS. 17 (A) and 17 (B), the response generation unit 14 selects “going to elementary school” from the recognition candidate “elementary school” and “goes to junior high school” from the recognition candidate “junior high school”. ? "Or" going to school "from the recognition candidates" going to school "is generated.

応答候補集約部１５は、応答生成部１４で生成された各応答候補を、各応答候補が表す感情を表現した応答候補に生成し直す。例えば、予め定められたルールに従って、各応答候補に含まれる名詞や述語等からポジティブ（望ましい）な感情に該当するか、ネガティブ（望ましくない）な感情に該当するかを判断する。そして、ポジティブな感情に該当すると判断された応答候補については、ポジティブな感情を表現した共通の応答候補に言い換える。ネガティブな感情に該当すると判断された応答候補については、ネガティブな感情を表現した共通の応答候補に変換する。 The response candidate aggregating unit 15 regenerates each response candidate generated by the response generating unit 14 into a response candidate expressing the emotion represented by each response candidate. For example, according to a predetermined rule, it is determined whether it corresponds to a positive (desirable) emotion or a negative (undesirable) emotion from a noun or a predicate included in each response candidate. Then, the response candidates determined to correspond to the positive emotion are paraphrased as common response candidates expressing the positive emotion. Response candidates determined to be negative emotions are converted into common response candidates expressing negative emotions.

なお、ポジティブな感情語（楽しい、美しい、笑う、など）とネガティブな感情語（イヤだ、泣く、大変、など）は予め所定の記憶手段に登録されており、応答候補集約部１５は、この記憶手段からポジティブな感情を表す感情語を選択して用いて感情を表現した応答候補を生成する。図１７（Ｃ）に各応答候補を感情を表現した応答候補に言い換えた変換例を示す。 Note that positive emotion words (fun, beautiful, laughter, etc.) and negative emotion words (unpleasant, cry, hard, etc.) are registered in advance in a predetermined storage means. A response candidate expressing an emotion is generated by selecting an emotion word expressing a positive emotion from the storage means. FIG. 17C shows a conversion example in which each response candidate is paraphrased as a response candidate expressing emotion.

応答選択部１６は、集約後の応答候補から適切な応答を選択する。例えば、図９の処理ルーチンを実行して図１７（Ｃ）に示す応答候補から応答候補を選択する場合には、「大変だね？」が選択される。 The response selection unit 16 selects an appropriate response from the aggregated response candidates. For example, when the processing routine of FIG. 9 is executed and a response candidate is selected from the response candidates shown in FIG.

このように、生成された応答候補を集約し、該集約結果から適切な応答候補を選択するようにしたため、第１の実施の形態と同様の効果を奏するとともに、個々の認識候補に対してさらに矛盾の少ない応答を出力することができ、誤応答が低減する。 As described above, since the generated response candidates are aggregated and an appropriate response candidate is selected from the aggregation result, the same effects as those of the first embodiment can be obtained, and each recognition candidate can be further improved. Responses with less contradiction can be output, and erroneous responses are reduced.

第１の実施の形態に係る応答生成装置の概略的な構成を示すブロック図である。It is a block diagram showing a schematic structure of a response generating device concerning a 1st embodiment. 入力部の音声認識結果の一例を示す図である。It is a figure which shows an example of the voice recognition result of an input part. 応答生成部が行なう応答候補生成処理ルーチンの流れを示すフローチャートである。It is a flowchart which shows the flow of the response candidate production | generation processing routine which a response production | generation part performs. 質問情報の具体例を示す図である。It is a figure which shows the specific example of question information. 応答用のテンプレートの具体例を示す図である。It is a figure which shows the specific example of the template for a response. 時制終助詞および疑問終助詞を登録したテーブルの一例を示す図である。It is a figure which shows an example of the table which registered the tense final particle and the question final particle. 応答生成部で生成される応答候補の具体例を示す図である。It is a figure which shows the specific example of the response candidate produced | generated in a response production | generation part. 第１の実施の形態の応答生成部で、図２に示す各音声認識候補に対して応答候補が生成された場合の生成結果を示す表である。It is a table | surface which shows the production | generation result when the response production | generation part of 1st Embodiment produces | generates a response candidate with respect to each speech recognition candidate shown in FIG. 応答選択部が行なう応答候補選択処理ルーチンの流れを示すフローチャートである。It is a flowchart which shows the flow of the response candidate selection process routine which a response selection part performs. 音声認識の信頼度を考慮して応答候補を選択する場合の応答候補選択処理ルーチンの流れを示すフローチャートである。It is a flowchart which shows the flow of the response candidate selection process routine in the case of selecting a response candidate in consideration of the reliability of voice recognition. 第２の実施の形態に係る応答生成装置の概略的な構成を示すブロック図である。It is a block diagram which shows the schematic structure of the response generation apparatus which concerns on 2nd Embodiment. シソーラスの具体例を示す図である。It is a figure which shows the specific example of a thesaurus. 入力部の音声認識結果の一例を示す図である。It is a figure which shows an example of the voice recognition result of an input part. 応答候補を、シソーラスを用いて集約する場合の具体例を説明する説明図である。It is explanatory drawing explaining the specific example in the case of collecting a response candidate using a thesaurus. 応答候補を、応答候補に含まれる用語から複数の応答候補に共通する部分以外を削除して集約する場合の具体例を説明する説明図である。It is explanatory drawing explaining the specific example in the case of deleting a response candidate except the part common to a some response candidate from the term contained in a response candidate. 応答候補を、複数の応答候補に共通する語から推論される概念を表現する応答候補に言い換える場合の具体例を説明する説明図である。It is explanatory drawing explaining the specific example in the case of paraphrasing a response candidate into the response candidate expressing the concept inferred from the word common to several response candidates. 応答候補を、応答候補が表す感情を表現する応答候補に言い換える場合の具体例を説明する説明図である。It is explanatory drawing explaining the specific example in the case of paraphrasing a response candidate to the response candidate expressing the emotion which a response candidate represents.

Explanation of symbols

１０応答生成装置
１２入力部
１４応答生成部
１５応答候補集約部
１６応答選択部
１８出力部
２０応答生成装置 DESCRIPTION OF SYMBOLS 10 Response production | generation apparatus 12 Input part 14 Response generation part 15 Response candidate aggregation part 16 Response selection part 18 Output part 20 Response generation apparatus

Claims

A recognition means for recognizing a user's utterance and generating a recognition candidate;
Generating means for generating response candidates corresponding to each recognition candidate generated by the recognition means;
A selection means for selecting a response candidate having the largest number of corresponding recognition candidates from the response candidates generated by the generation means;
Response output means for outputting the response candidates selected by the selection means;
A response generation device including:

A recognition means for recognizing a user's utterance and generating a recognition candidate;
Generating means for generating response candidates corresponding to each recognition candidate generated by the recognition means;
For each response candidate generated by the generating means, a numerical value indicating the appropriateness of the response to the user utterance is calculated based on the recognition reliability of the recognition candidate, and the response indicated by the calculated numerical value is the most appropriate Selecting means for selecting candidates from the generated response candidates;
Response output means for outputting the response candidates selected by the selection means;
A response generation device including:

The generating means includes
First generation means for generating a response candidate corresponding to each recognition candidate generated by the recognition means;
A second generation means for regenerating a response candidate by replacing a term included in the response candidate generated by the first generation means with a term corresponding to the superordinate concept;
The response generation device according to claim 1, comprising:

The response generation apparatus according to claim 3, wherein the second generation unit regenerates a response candidate by replacing a term included in the response candidate generated by the first generation unit using a thesaurus with a term corresponding to the superordinate concept. .

The generating means includes
First generation means for generating a response candidate corresponding to each recognition candidate generated by the recognition means;
When there are a plurality of response candidates that can be aggregated in the response candidates generated by the first generation means, the plurality of response candidates are converted into response candidates having the same content in which the contents of the plurality of response candidates are aggregated. Second generation means for regenerating,
The response generation device according to claim 1, comprising:

The plurality of response candidates that can be aggregated are a plurality of response candidates that are partly the same,
The second generation means, when there are a plurality of response candidates that are the same in part among the response candidates generated by the first generation means, the plurality of response candidates are the same by deleting a part other than the same part Re-generate the content response candidates,
The response generation device according to claim 5.

The plurality of response candidates that can be aggregated are a plurality of response candidates that are partly the same,
The second generation means has a concept in which when there are a plurality of response candidates that are the same in part among the response candidates generated by the first generation means, the plurality of response candidates are inferred from the same part. To generate a response candidate with the same content
The response generation device according to claim 5.

The plurality of response candidates that can be aggregated are a plurality of response candidates that have the same concept inferred from at least one of terms included in the response candidates or the entire response candidates,
When the second generation means includes a plurality of response candidates having the same concept inferred from at least one of terms included in the response candidates or the entire response candidates in the response candidates generated by the first generation means, Regenerating the plurality of response candidates into response candidates having the same content in which the inferred concept is expressed;
The response generation device according to claim 5.

The concept inferred from at least one of the terms included in the response candidate or the entire response candidate is an emotion or a superordinate concept.
The response generation device according to claim 8.

Computer
A recognition means for recognizing a user's utterance and generating a recognition candidate;
Generating means for generating response candidates corresponding to each recognition candidate generated by the recognition means;
A selection unit that selects a response candidate having the largest number of corresponding recognition candidates from the response candidates generated by the generation unit; and a response output unit that outputs the response candidate selected by the selection unit;
Program to function as.

Computer
A recognition means for recognizing a user's utterance and generating a recognition candidate;
Generating means for generating response candidates corresponding to each recognition candidate generated by the recognition means;
For each response candidate generated by the generating means, a numerical value indicating the appropriateness of the response to the user utterance is calculated based on the recognition reliability of the recognition candidate, and the response indicated by the calculated numerical value is the most appropriate Selecting means for selecting candidates from the generated response candidates; and response output means for outputting response candidates selected by the selecting means;
Program to function as.