JP2010224153A

JP2010224153A - Speech interaction device and program

Info

Publication number: JP2010224153A
Application number: JP2009070465A
Authority: JP
Inventors: Takakatsu Yoshimura; 貴克吉村; Kazuya Shimooka; 和也下岡; Ryoko Hotta; 良子堀田; Hiroyuki Hoshino; 博之星野; Yusuke Nakano; 雄介中野; Takatada Yamaguchi; 宇唯山口; Seisho Watabe; 生聖渡部
Original assignee: Toyota Motor Corp; Toyota Central R&D Labs Inc
Current assignee: Toyota Motor Corp; Toyota Central R&D Labs Inc
Priority date: 2009-03-23
Filing date: 2009-03-23
Publication date: 2010-10-07
Anticipated expiration: 2029-03-23
Also published as: JP4992925B2

Abstract

PROBLEM TO BE SOLVED: To create appropriate response to user's utterance, even when speech recognition results or feeling estimation results is incorrect. SOLUTION: The speech interaction device includes: a speech recognition unit 10 which recognizes speech uttered by the user, to extract a word included in the speech, and which calculates reliability of the word included in the speech; a feeling estimation unit 20 for estimating reliability of the estimated feeling by estimating the feeling of user's speech by using recognition results or rhythm information of the speech; a response candidate creation unit 30 for creating a response candidate by using the extracted word and a predetermined response template, and creating the response candidate by using the estimated feeling and the predetermined response template; and a response candidate selection unit 40 for selecting a response candidate based on a word or feeling, whose reliability is the highest, in the response candidates created by the response candidate creation unit 30. COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声対話装置及びプログラムに関する。 The present invention relates to a voice interaction apparatus and a program.

従来、ユーザの感情の状態によってバリエーションに富んだ会話を行う対話処理装置が提案されている（例えば特許文献１参照）。特許文献１の対話処理装置は、ユーザから入力された語句の概念、韻律情報、ユーザの顔画像、ユーザの生理情報を用いてユーザの感情を推定し、その感情を表す感情情報に基づいてユーザに出力する出力文を生成する。 2. Description of the Related Art Conventionally, there has been proposed a dialogue processing apparatus that performs conversations rich in variations depending on the emotional state of a user (see, for example, Patent Document 1). The dialogue processing device of Patent Literature 1 estimates a user's emotion using the concept of words / phrases input from the user, prosodic information, user's face image, and user's physiological information, and the user is based on emotion information representing the emotion. Generate an output statement to output to.

特開２００１−２１５９９３号公報JP 2001-215993 A

特許文献１の対話処理装置は、語句の概念及び韻律情報等の多くの情報を用いてユーザの感情を推定している。しかし、ユーザの感情を確実に推定するのは非常に困難であり、間違った感情が推定される場合がある。しかし、特許文献１の対話処理装置は、間違った感情を推定しても、その感情の信頼度が分からないので、間違った感情推定結果に基づいて出力文を生成してしまう問題がある。 The dialogue processing apparatus of Patent Literature 1 estimates a user's emotion using a lot of information such as a phrase concept and prosodic information. However, it is very difficult to reliably estimate the user's emotion, and a wrong emotion may be estimated. However, since the dialogue processing apparatus of Patent Document 1 does not know the reliability of the emotion even if the wrong emotion is estimated, there is a problem that an output sentence is generated based on the wrong emotion estimation result.

また、ユーザの発話した音声に対する認識結果に基づいて応答を生成する公知技術があるが、誤認識があった場合は、ユーザの発話に対して誤った応答を生成してしまう問題がある。 In addition, there is a known technique for generating a response based on the recognition result for the speech uttered by the user. However, when there is a misrecognition, there is a problem that an incorrect response is generated for the user utterance.

本発明は、上述した課題を解決するために提案されたものであり、音声認識結果又は感情推定結果が誤っていてもユーザの発話に対して適切な応答を生成する音声対話装置及びプログラムを提供することを目的とする。 The present invention has been proposed in order to solve the above-described problem, and provides a voice interaction apparatus and a program for generating an appropriate response to a user's utterance even if a voice recognition result or an emotion estimation result is incorrect. The purpose is to do.

本発明に係る音声対話装置は、ユーザが発話した音声を認識して、当該音声に含まれる単語を抽出し、当該音声に含まれる単語の信頼度を算出する音声認識手段と、前記音声認識手段の認識結果又は前記音声の韻律情報を用いて前記ユーザの音声の感情を推定し、推定した感情の信頼度を算出する感情推定手段と、前記音声認識手段により抽出された単語と、予め定められた応答テンプレートと、を用いて応答候補を生成する第１の応答候補生成手段と、前記感情推定手段により推定された感情と、予め定められた応答テンプレートと、を用いて応答候補を生成する第２の応答候補生成手段と、前記第１及び第２の応答候補生成手段により生成された応答候補のうち、信頼度が最も高い単語又は感情に基づく応答候補を選択する応答候補選択手段と、を備えている。 The voice interaction device according to the present invention recognizes a voice spoken by a user, extracts a word included in the voice, and calculates a reliability of the word included in the voice; and the voice recognition means A speech estimation unit that estimates the emotion of the user's speech using the recognition result of the speech or the prosodic information of the speech, calculates the reliability of the estimated emotion, and a word extracted by the speech recognition unit A response candidate is generated using a first response candidate generation unit that generates a response candidate using the response template, an emotion estimated by the emotion estimation unit, and a predetermined response template. Response candidate selection that selects a response candidate based on a word or emotion having the highest reliability among the response candidates generated by two response candidate generation means and the first and second response candidate generation means It includes a stage, a.

上記発明によれば、音声認識手段により抽出された単語と予め定められた応答テンプレートとを用いて応答候補を生成し、感情推定手段により推定された感情と予め定められた応答テンプレートとを用いて応答候補を生成し、生成された応答候補のうち信頼度が最も高い単語又は感情に基づく応答候補を選択する。これにより、上記発明は、音声認識結果又は感情推定結果に誤りがあったとしても、その誤りの影響のない応答候補を選択するので、ユーザの発話に対して適切な応答を生成することができる。 According to the said invention, a response candidate is produced | generated using the word extracted by the speech recognition means, and the predetermined response template, and using the emotion estimated by the emotion estimation means, and the predetermined response template A response candidate is generated, and a response candidate based on the word or emotion having the highest reliability is selected from the generated response candidates. As a result, even if there is an error in the speech recognition result or the emotion estimation result, the above invention selects a response candidate that is not affected by the error, so that it is possible to generate an appropriate response to the user's utterance. .

本発明は、音声認識結果又は感情推定結果に誤りがあったとしても、その誤りの影響のない応答を生成するので、ユーザの発話に対して適切な応答をすることができる。 Even if there is an error in the speech recognition result or the emotion estimation result, the present invention generates a response that is not affected by the error, so that an appropriate response can be made to the user's utterance.

本発明の実施形態に係る音声対話装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice interactive apparatus which concerns on embodiment of this invention. 応答生成ルールを示す図である。It is a figure which shows a response production | generation rule. 音声対話装置により実行される音声対話ルーチンを示すフローチャートである。It is a flowchart which shows the voice dialogue routine performed by the voice dialogue apparatus.

以下、本発明の好ましい実施の形態について図面を参照しながら詳細に説明する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の実施形態に係る音声対話装置の構成を示すブロック図である。音声対話装置は、音声を認識する音声認識部１０と、音声認識部１０で認識された履歴を格納する音声認識履歴格納部１１と、音声認識部１０の認識結果に基づいてユーザの感情を推定する感情推定部２０と、感情推定部２０の感情推定結果の履歴を格納する感情推定履歴格納部２１と、を備えている。 FIG. 1 is a block diagram showing a configuration of a voice interaction apparatus according to an embodiment of the present invention. The voice interactive apparatus estimates a user's emotion based on a voice recognition unit 10 that recognizes voice, a voice recognition history storage unit 11 that stores a history recognized by the voice recognition unit 10, and a recognition result of the voice recognition unit 10. And an emotion estimation history storage unit 21 for storing a history of emotion estimation results of the emotion estimation unit 20.

さらに、音声対話装置は、応答候補を生成する応答候補生成部３０と、応答ルールが格納されている応答ルール格納部３１と、応答候補を選択する応答候補選択部４０と、応答履歴を格納する応答履歴格納部４１と、を備えている。 Furthermore, the voice interactive apparatus stores a response candidate generation unit 30 that generates response candidates, a response rule storage unit 31 that stores response rules, a response candidate selection unit 40 that selects response candidates, and a response history. A response history storage unit 41.

音声認識部１０は、ユーザの発話した音声の認識処理を行い、その音声に含まれる１つ以上の単語を認識し、各単語の信頼度を算出する。音声認識部１０で認識された単語及びその信頼度は、音声認識履歴格納部１１に逐次格納される。 The speech recognition unit 10 performs recognition processing of speech uttered by the user, recognizes one or more words included in the speech, and calculates the reliability of each word. The words recognized by the voice recognition unit 10 and their reliability are sequentially stored in the voice recognition history storage unit 11.

なお、信頼度の算出方法は、特に限定されるものではないが、例えば、文献１「２パス探索アルゴリズムにおける高速な単語事後確率に基づく信頼度算出法」李ら、２００３年１２月１９日、社団法人情報処理学会研究報告、に記載された技術を用いることができる。また、本実施形態では、信頼度は０〜１．０とし、最も高い信頼度は１．０である。 The reliability calculation method is not particularly limited. For example, Document 1 “Reliability calculation method based on fast word posterior probabilities in the two-pass search algorithm” Li et al., December 19, 2003, The technology described in the Information Processing Society of Japan Research Report can be used. In the present embodiment, the reliability is 0 to 1.0, and the highest reliability is 1.0.

感情推定部２０は、音声認識部１０でユーザの音声が認識された場合、音声認識部１０から出力された音声認識結果に基づいて、入力された音声の感情を推定する。音声認識結果を用いた感情推定の手法は、特に限定されるものではないが、例えば、文献２「Ｗｅｂから獲得した感情生起要因コーパスに基づく感情推定」、徳久ら、言語処理学会第１４回年次大会論文集、ｐｐ．３３−３６、２００８年３月に記載された技術を用いることができる。 When the voice recognition unit 10 recognizes the user's voice, the emotion estimation unit 20 estimates the emotion of the input voice based on the voice recognition result output from the voice recognition unit 10. The emotion estimation method using the speech recognition result is not particularly limited. For example, Reference 2 “Emotion estimation based on the emotion-causing factor corpus acquired from the Web”, Tokuhisa et al. Proceedings of the next conference, pp. 33-36, March 2008, can be used.

また、感情推定部２０は、音声認識部１０でユーザの音声が認識されなかった場合、ユーザの音声に含まれる韻律情報（基本周波数等）を用いて、ユーザの感情を推定する。韻律情報を用いて感情を推定する手法は、特に限定されるものではないが、例えば特開２００２−９１４８２号公報（感情検出方法及び感情検出装置ならびに記憶媒体）に記載された技術を用いることができる。 In addition, when the voice recognition unit 10 does not recognize the user's voice, the emotion estimation unit 20 estimates the user's emotion using prosodic information (basic frequency, etc.) included in the user's voice. The technique for estimating emotions using prosodic information is not particularly limited, but for example, the technique described in Japanese Patent Application Laid-Open No. 2002-91482 (emotion detection method, emotion detection apparatus, and storage medium) may be used. it can.

なお、感情推定部２０は、音声認識部１０で音声が認識された場合、音声認識結果を用いて感情を推定するだけでなく、韻律情報を用いて感情を推定してもよい。 When the voice recognition unit 10 recognizes a voice, the emotion estimation unit 20 may not only estimate the emotion using the voice recognition result but also may estimate the emotion using prosodic information.

応答候補生成部３０は、音声認識部１０によって信頼度が高い音声認識結果が得られた場合には、認識された単語に基づく応答候補を生成する。また、応答候補生成部３０は、感情推定部２０で感情が推定された場合には、推定された感情に基づく応答候補を生成する。 When the speech recognition unit 10 obtains a speech recognition result with high reliability, the response candidate generation unit 30 generates a response candidate based on the recognized word. Moreover, the response candidate production | generation part 30 produces | generates the response candidate based on the estimated emotion, when an emotion is estimated in the emotion estimation part 20. FIG.

なお、応答候補生成部３０は、音声認識結果が得られず、かつ感情が推定されない場合、音声が所定時間入力されない場合（ライムアウトの場合）においても、応答候補を生成する。これらの応答候補は、応答生成ルール格納部３１に格納されている応答生成ルールに従ってそれぞれ生成される。 Note that the response candidate generation unit 30 generates a response candidate even when a voice recognition result is not obtained and an emotion is not estimated or when no voice is input for a predetermined time (in the case of lime out). These response candidates are generated according to the response generation rules stored in the response generation rule storage unit 31, respectively.

図２は、応答生成ルール格納部３１に格納されている応答生成ルールを示す図である。応答生成ルールは、入力と、その入力を用いて応答候補を生成するための応答テンプレートと、を対応付けたものである。 FIG. 2 is a diagram illustrating response generation rules stored in the response generation rule storage unit 31. The response generation rule associates an input with a response template for generating a response candidate using the input.

図２に示す［動詞］、［形容詞］、［名詞］は、音声認識部１０で認識された動詞、形容詞、名詞をそれぞれ示している。［感情：楽しい］、［感情：悲しい］は、感情推定部２０で推定されたユーザの感情をそれぞれ示している。［音声認識候補、感情推定結果なし］は、音声認識部１０で信頼度の高い音声認識結果（例えば信頼度が閾値を超える単語を含む音声認識結果）が得られず、かつ感情推定部２０で感情が推定されないことを示している。［タイムアウト］は、本装置の音声出力後、所定時間ユーザが発話しない場合を示している。 [Verb], [adjective], and [noun] illustrated in FIG. 2 respectively indicate a verb, an adjective, and a noun recognized by the speech recognition unit 10. [Emotion: Fun] and [Emotion: Sad] indicate the user's emotions estimated by the emotion estimation unit 20, respectively. “No voice recognition candidate, no emotion estimation result” means that the voice recognition unit 10 cannot obtain a voice recognition result with high reliability (for example, a voice recognition result including a word whose reliability exceeds a threshold), and the emotion estimation unit 20 It shows that emotion is not estimated. [Timeout] indicates a case where the user does not speak for a predetermined time after the sound is output from the apparatus.

ここで図２によると、応答生成ルールは、動詞と、その動詞を用いて応答候補を生成するための３つの応答テンプレート（「［動詞］したんだ。」、「誰と［動詞］したの？」、「どこで［動詞］したの？」）を対応付けている。 Here, according to FIG. 2, the response generation rule includes a verb and three response templates (“[verb]” and “[verb]” for generating response candidates using the verb. ? "And" Where did [verb]? ") Associated.

そして、音声認識部１０において動詞「食べる」が認識された場合、応答テンプレートの［動詞］の部分に、［食べる］が最適な形式に変形されて挿入される。その結果、「食べたんだ。」、「誰と食べたの？」、「どこで食べたの？」の３つの応答候補が生成される。 When the verb “eat” is recognized by the voice recognition unit 10, [eat] is transformed into an optimal format and inserted into the [verb] portion of the response template. As a result, three response candidates are generated: “I ate it.”, “Who ate with?” And “Where did you eat?”.

また、応答生成ルールは、感情、例えば「楽しい」と、その感情を用いて応答候補を生成するための２つの応答テンプレート（「よかったね。」、「楽しかったんだね。」）を対応付けている。そして、感情推定部２０で感情「楽しい」が推定された場合、「よかったね。」、「楽しかったんだね。」の２つの応答候補が生成される。 The response generation rule associates an emotion, for example, “fun” with two response templates for generating response candidates using the emotion (“It was good”, “It was fun”). Yes. When the emotion estimation unit 20 estimates the emotion “fun”, two response candidates “good” and “it was fun” are generated.

応答生成ルールは、音声認識部１０で認識された文字列「こんにちは」と、「こんにちは」とを対応づけている。よって、音声認識部１０で文字列「こんにちは」が認識された場合、応答候補として「こんにちは」が生成される。 Response generation rule is, the character string "Hello", which is recognized by the voice recognition unit 10, in association with the "Hello". Therefore, if the string "Hello" is recognized by the voice recognition unit 10, "Hello" is generated as the response candidate.

応答生成ルールは、音声認識部１０で認識された文字列「今日の天気は？」と、「今日の天気は［今日の天気情報］だよ」とを対応づけている。よって、音声認識部１０で文字列「今日の天気は？」が認識された場合、外部から今日の天気情報（例えば「曇り」）を取得し、応答候補として「今日の天気は曇りだよ」が生成される。 The response generation rule associates the character string “What is today's weather?” Recognized by the voice recognition unit 10 with “Today's weather is [Today's weather information]”. Therefore, when the character string “What is today's weather?” Is recognized by the voice recognition unit 10, today's weather information (for example, “cloudy”) is acquired from the outside, and “Today's weather is cloudy” as a response candidate. Is generated.

また、応答生成ルールは、［音声認識候補、感情推定結果なし］と「もう一度言ってください。」とを対応付け、［タイムアウト］と「今日は天気がいいね。」とを対応付けている。よって、音声認識部１０で信頼度が高い音声認識結果が得られず、かつ感情推定部２０で感情が推定されない場合は、応答候補として「もう一度言ってください。」が生成され、タイムアウトの場合は、応答候補として「今日は天気がいいね。」が生成される。 The response generation rule associates [speech recognition candidate, no emotion estimation result] with “Please say again.”, And associates [timeout] with “the weather is good today”. Therefore, when the speech recognition unit 10 cannot obtain a highly reliable speech recognition result and the emotion estimation unit 20 does not estimate the emotion, “Please say again” is generated as a response candidate. , “Today's weather is good” is generated as a response candidate.

そして、応答候補生成部３０は、音声認識結果に基づく応答候補、感情に基づく応答候補を生成した場合、それぞれの応答候補の信頼度を算出する。応答候補の信頼度は、その応答候補に含まれる単語の信頼度、又は、その応答候補の感情の信頼度と同じ値である。 And the response candidate production | generation part 30 calculates the reliability of each response candidate, when the response candidate based on a speech recognition result and the response candidate based on an emotion are produced | generated. The reliability of the response candidate is the same value as the reliability of the word included in the response candidate or the reliability of the emotion of the response candidate.

応答候補選択部４０は、応答履歴格納部４１の応答履歴を参照して、応答候補生成部３０で生成された１つ以上の応答候補の中から、過去に選択された応答候補を除外し、残りの応答候補の中から最も信頼度が高い応答候補を選択する。 The response candidate selection unit 40 refers to the response history of the response history storage unit 41, excludes response candidates selected in the past from one or more response candidates generated by the response candidate generation unit 30, The response candidate with the highest reliability is selected from the remaining response candidates.

以上のように構成された音声対話装置は、ユーザが例えば
「今日は、遊園地に行ったよ。」
と発話した場合、次の音声対話ルーチンを実行する。 The voice interactive apparatus configured as described above is, for example, that the user “has gone to an amusement park today”.
The following voice dialogue routine is executed.

図３は、音声対話装置により実行される音声対話ルーチンを示すフローチャートである。 FIG. 3 is a flowchart showing a voice dialogue routine executed by the voice dialogue apparatus.

ステップＳ１では、音声認識部１０は、本ルーチンの実行開始後又は本装置の音声再生後から所定時間が経過するまでユーザからの音声入力が有るか否かを判定する。そして、肯定判定の場合はステップＳ３に進み、否定判定の場合はタイムアウトと判定され、ステップＳ２に進む。 In step S1, the voice recognition unit 10 determines whether or not there is a voice input from the user until a predetermined time elapses after the execution of this routine is started or after the voice reproduction of the apparatus is performed. If the determination is affirmative, the process proceeds to step S3. If the determination is negative, it is determined that a timeout has occurred, and the process proceeds to step S2.

ステップＳ２では、応答候補生成部３０は、例えばユーザに情報提供を促すような応答を生成する。具体的には、応答候補生成部３０は、応答生成ルール格納部３１に格納された応答生成ルールに従い、応答候補として、［タイムアウト］に対応付けられた「今日は天気がいいね。」を生成する。そして、ステップＳ１５へ進む。 In step S2, the response candidate generator 30 generates a response that prompts the user to provide information, for example. Specifically, the response candidate generation unit 30 generates “Today is good weather” associated with [timeout] as a response candidate in accordance with the response generation rule stored in the response generation rule storage unit 31. To do. Then, the process proceeds to step S15.

ステップＳ３では、音声認識部１０は、ユーザの発話した音声「今日は、遊園地に行ったよ。」に対して音声認識処理を行い、入力音声から単語を抽出すると共に、各単語の信頼度を算出する。 In step S 3, the voice recognition unit 10 performs voice recognition processing on the voice spoken by the user “I went to an amusement park today”, extracts words from the input voice, and sets the reliability of each word. calculate.

例えば本実施形態では、音声認識部１０は、音声認識処理の結果、次の認識候補１〜３を得る。 For example, in the present embodiment, the speech recognition unit 10 obtains the following recognition candidates 1 to 3 as a result of the speech recognition process.

認識候補１：「今日は遊泳しに行ったよ。」
認識候補２：「今日は遊園地に行ったよ。」
認識候補３：「今日は遊泳しにいたよ。」 Recognition candidate 1: “I went swimming today.”
Recognition candidate 2: “I went to an amusement park today.”
Recognition candidate 3: “I was swimming today.”

そして、音声認識部１０は、認識候補１〜３のうち尤度最大となる認識候補１を音声認識結果として出力する。音声認識結果の各単語の信頼度は、例えば上述の文献１の技術に基づいて算出される。この結果、本実施形態では、例えば以下のような音声認識結果が得られる。
「今日（０．７）は（０．６）遊泳（０．２）し（０．４）に（０．８）行った（０．８）よ（１．０）。」 And the speech recognition part 10 outputs the recognition candidate 1 which becomes the maximum likelihood among the recognition candidates 1-3 as a speech recognition result. The reliability of each word of the speech recognition result is calculated based on the technique of Document 1 described above, for example. As a result, in the present embodiment, for example, the following speech recognition result is obtained.
“Today (0.7) was (0.6) swimming (0.2) and (0.4) going to (0.8) (0.8) to (1.0).”

なお、括弧内の数字は、その直前（左側）にある単語の信頼度を示している。 The number in parentheses indicates the reliability of the word immediately before (left side).

ステップＳ４では、音声認識部１０は、認識された単語の中に信頼度が閾値（例えば０．５）より高い自立語があるかを判定する。そして、肯定判定の場合はステップＳ６に進み、否定判定の場合はステップＳ５へ進む。 In step S4, the speech recognition unit 10 determines whether there is an independent word whose reliability is higher than a threshold value (for example, 0.5) among the recognized words. If the determination is affirmative, the process proceeds to step S6. If the determination is negative, the process proceeds to step S5.

なお、ステップＳ３に示した認識結果が得られた場合、この認識結果の中で信頼度が０．５より大きい単語として「今日」、「行った」がある。この２つの単語は共に自立語である。そこで、本実施形態では、認識結果の中に信頼度が閾値より高い自立語が含まれているので、ステップＳ６へ進む。 In addition, when the recognition result shown in step S3 is obtained, there are “today” and “done” as words having a reliability higher than 0.5 in the recognition result. These two words are both independent words. Therefore, in the present embodiment, since the recognition result includes an independent word whose reliability is higher than the threshold, the process proceeds to step S6.

ステップＳ５では、感情推定部２０は、音声認識部１０の音声認識結果を使用できないので、音声認識部１０に入力された音声の韻律情報を用いてユーザの感情を推定する。ここでは、例えば特開２００２−９１４８２号公報に記載された技術が用いられる。なお、本実施形態では、感情として例えば「楽しい」が推定され、その信頼度は１．０とする。そして、ステップＳ６へ進む。 In step S5, since the emotion estimation unit 20 cannot use the speech recognition result of the speech recognition unit 10, the emotion estimation unit 20 estimates the user's emotion using the prosodic information of the speech input to the speech recognition unit 10. Here, for example, a technique described in JP-A-2002-91482 is used. In the present embodiment, for example, “fun” is estimated as the emotion, and the reliability is assumed to be 1.0. Then, the process proceeds to step S6.

ステップＳ６では、応答候補生成部３０は、応答生成ルール格納部３１に格納された応答生成ルールに従って、感情推定部２０の推定結果を用いて応答候補を生成すると共に、各応答候補の信頼度を算出する。 In step S 6, the response candidate generation unit 30 generates a response candidate using the estimation result of the emotion estimation unit 20 according to the response generation rule stored in the response generation rule storage unit 31, and sets the reliability of each response candidate. calculate.

本実施形態の場合では、応答生成ルールの［感情：楽しい］に対応付けられた応答候補、「よかったね。」、「楽しかったんだね。」が生成され、各々の信頼度は１．０である。 In the case of the present embodiment, response candidates associated with [emotion: fun] of the response generation rule, “Good”, “It was fun” are generated, and each reliability is 1.0. is there.

ステップＳ７では、応答候補生成部３０は、音声認識履歴格納部１１に格納された音声認識履歴を用いて応答候補を生成する。ここでは、応答候補生成部３０は、現時刻から遡って、音声認識履歴の中から信頼度が高い（例えば信頼度０．５以上）の自立語（単語）を探し出す。そして、応答候補生成部３０は、探し出した単語と、その単語に対応づけられた応答テンプレートと、に基づいて応答候補を生成する。 In step S 7, the response candidate generation unit 30 generates a response candidate using the speech recognition history stored in the speech recognition history storage unit 11. Here, the response candidate generation unit 30 searches the independent words (words) having high reliability (for example, reliability of 0.5 or more) from the voice recognition history retroactively from the current time. And the response candidate production | generation part 30 produces | generates a response candidate based on the searched word and the response template matched with the word.

一方、ステップＳ８では、音声認識部１０は、ステップＳ３で示した音声認識結果を音声認識履歴格納部１１へ格納する。そして、ステップＳ９へ進む。 On the other hand, in step S8, the voice recognition unit 10 stores the voice recognition result shown in step S3 in the voice recognition history storage unit 11. Then, the process proceeds to step S9.

ステップＳ９では、感情推定部２０は、ステップＳ３の音声認識処理で用いた各認識候補に対して、文献２に記載された技術を用いることで各認識候補の感情を推定する。そして、感情推定部２０は、各認識候補の感情の中で多数を占めた感情をユーザの感情として推定し、その感情が占める割合を信頼度として算出する。 In step S9, the emotion estimation unit 20 estimates the emotion of each recognition candidate by using the technique described in Document 2 for each recognition candidate used in the speech recognition process in step S3. And the emotion estimation part 20 estimates the emotion which occupied many in the emotion of each recognition candidate as a user's emotion, and calculates the ratio for which the emotion accounts as reliability.

本実施形態では、ステップＳ３で示した認識候補１〜３の感情は、例えば、すべて「楽しい」と推定される。この場合、「楽しい」は３候補中３つを占めているので、「楽しい」の信頼度は、３／３＝１．０となる。そして、ステップＳ１０へ進む。 In the present embodiment, the emotions of the recognition candidates 1 to 3 shown in step S3 are all estimated to be “fun”, for example. In this case, since “fun” accounts for three of the three candidates, the reliability of “fun” is 3/3 = 1.0. Then, the process proceeds to step S10.

なお、仮に、認識候補１〜３のうち２つの感情が「楽しい」であって残りの１つの感情が「悲しい」と推定された場合、ユーザの感情として「楽しい」が推定され、その信頼度は２／３＝０．６７となる。 If two emotions among the recognition candidates 1 to 3 are “fun” and the remaining emotion is estimated to be “sad”, “fun” is estimated as the user's emotion, and its reliability Is 2/3 = 0.67.

ステップＳ１０では、応答候補生成部３０は、応答生成ルール格納部３１に格納された応答生成ルールに従って、感情推定部２０の感情推定結果を用いて応答候補を生成すると共に、各応答候補の信頼度を算出する。 In step S10, the response candidate generation unit 30 generates a response candidate using the emotion estimation result of the emotion estimation unit 20 according to the response generation rule stored in the response generation rule storage unit 31, and the reliability of each response candidate. Is calculated.

本実施形態の場合では、応答生成ルールの［感情：楽しい］に対応付けられた応答候補、「よかったね。」、「楽しかったんだね。」が生成される。なお、これらの信頼度は共に１．０である。 In the case of the present embodiment, response candidates associated with [emotion: fun] of the response generation rule, “Good”, “It was fun” are generated. Both of these reliability levels are 1.0.

ステップＳ１１では、応答候補生成部３０は、音声認識部１０の音声認識結果を用いて応答候補を生成すると共に、各応答候補の信頼度を算出する。 In step S11, the response candidate generation unit 30 generates a response candidate using the speech recognition result of the speech recognition unit 10, and calculates the reliability of each response candidate.

例えば、本実施形態では、応答候補生成部３０は、ステップＳ３で示した音声認識結果の各単語を用いて応答候補を生成する。 For example, in the present embodiment, the response candidate generation unit 30 generates a response candidate using each word of the speech recognition result shown in step S3.

図２に示す応答生成ルールによると、例えば「今日」については、［名詞］に対応付けられた応答候補、「今日？」、「どんな今日なの？」、「誰の今日なの？」が生成される。更に、例えば「行った」については、［動詞］に対応付けられた応答候補、「行ったんだ。」、「誰と行ったの？」、「どこで行ったの？」が生成される。 According to the response generation rule shown in FIG. 2, for “Today”, for example, response candidates associated with [Noun], “Today?”, “What is today?”, “Who is today?” Are generated. The Further, for example, for “performed”, response candidates associated with [verb], “has gone”, “who did you go”, and “where did you go” are generated.

さらに、ステップＳ３によると「今日」の信頼度は０．７であるので、応答候補である「今日？」、「どんな今日なの？」、「誰の今日なの？」の各々の信頼度は０．７となる。同様に「行った」の信頼度は０．８であるので、応答候補である「行ったんだ。」、「誰と行ったの？」、「どこで行ったの？」の各々の信頼度は０．８となる。なお、応答候補生成部３０は、音声認識結果に含まれる他の単語についても同様に応答候補を生成する。そして、ステップＳ１２へ進む。 Further, according to step S3, since the reliability of “Today” is 0.7, each of the reliability candidates “Today?”, “What is today?”, “Who is today?” Is 0. .7. Similarly, since the reliability of “I went” is 0.8, the reliability of each of the response candidates “I went”, “Who did you go?”, And “Where did you go?” 0.8. In addition, the response candidate production | generation part 30 produces | generates a response candidate similarly about the other word contained in a speech recognition result. Then, the process proceeds to step S12.

ステップＳ１２では、応答候補選択部４０は、応答候補生成部３０で生成された応答候補に対して、応答履歴格納部４１に格納されている応答履歴を用いて応答候補フィルタフィング処理を行う。具体的には、応答候補選択部４０は、応答候補生成部３０で生成された応答候補の中から、応答履歴として過去に選択されたことのある応答候補を除外する。これにより、過去と同じ応答をするのを回避することができる。そして、ステップＳ１３へ進む。 In step S 12, the response candidate selection unit 40 performs a response candidate filtering process on the response candidates generated by the response candidate generation unit 30 using the response history stored in the response history storage unit 41. Specifically, the response candidate selection unit 40 excludes response candidates that have been selected in the past as a response history from the response candidates generated by the response candidate generation unit 30. Thereby, it is possible to avoid the same response as in the past. Then, the process proceeds to step S13.

ステップＳ１３では、応答候補選択部４０は、応答候補が有るか、すなわち上述のステップＳ１２を経ても応答候補が残っているかを判定する。そして、肯定判定の場合はステップＳ１５へ進み、否定判定の場合はステップＳ１４へ進む。 In step S13, the response candidate selection unit 40 determines whether there is a response candidate, that is, whether a response candidate remains even after the above-described step S12. If the determination is affirmative, the process proceeds to step S15. If the determination is negative, the process proceeds to step S14.

ステップＳ１４では、応答候補生成部３０は、相槌又は再入力を促す応答候補を生成する。具体的には、応答候補生成部３０は、応答候補として、応答生成ルールの［音声認識候補、感情推定結果無し］に対応付けられた「もう一度言ってください。」を生成する。なお、応答候補生成部３０は、この応答候補の代わりに、相槌「うんうん」、「そうだね」などを生成してもよい。そして、ステップＳ１７へ進む。 In step S 14, the response candidate generation unit 30 generates a response candidate that prompts reconciliation or re-input. Specifically, the response candidate generation unit 30 generates “Please say again” associated with “response recognition candidate, no emotion estimation result” of the response generation rule as a response candidate. Note that the response candidate generation unit 30 may generate the answers “Yes”, “Sodane”, and the like instead of this response candidate. Then, the process proceeds to step S17.

一方、ステップＳ１５では、応答候補選択部４０は、既に生成されている応答候補の中から信頼度が最も高い応答候補を選択する。なお、信頼度が最も高い応答候補が複数存在する場合は、応答候補選択部４０は、予め定められた優先度に従って応答候補を選択してもよいし、ランダムに応答候補を選択してもよい。 On the other hand, in step S15, the response candidate selection unit 40 selects a response candidate having the highest reliability from the already generated response candidates. When there are a plurality of response candidates with the highest reliability, the response candidate selection unit 40 may select response candidates according to a predetermined priority, or may select response candidates at random. .

本実施形態では、ステップＳ１０で生成された応答候補、「よかったね。」、「楽しかったんだね。」の信頼度（＝１．０）が最も高いので、応答候補選択部４０は、「よかったね。」又は「楽しかったんだね。」を選択する。そして、ステップＳ１６へ進む。 In the present embodiment, since the reliability (= 1.0) of the response candidates generated in step S10, “It was good” and “It was fun” is the highest, the response candidate selection unit 40 is “good. "Ne." Or "It was fun." Then, the process proceeds to step S16.

ステップＳ１６では、応答候補選択部４０は、ステップＳ１５で選択した応答候補を応答履歴として応答履歴格納部４１に格納する。そして、ステップＳ１７へ進む。 In step S16, the response candidate selection unit 40 stores the response candidate selected in step S15 in the response history storage unit 41 as a response history. Then, the process proceeds to step S17.

ステップＳ１５では、応答候補選択部４０は、ステップＳ２、Ｓ１４、Ｓ１６のいずれかで得られた応答候補について音声合成を行い、音声を再生する。そして、ステップＳ１へ戻って、ユーザの発話待ちの状態となる。 In step S15, the response candidate selection unit 40 performs speech synthesis on the response candidate obtained in any of steps S2, S14, and S16, and reproduces the speech. And it returns to step S1 and will be in the state of waiting for a user's speech.

以上のように、本発明の実施形態に係る音声対話装置は、ユーザの音声に対して、音声認識結果に基づいて応答候補を生成し、感情推定結果に基づいて応答候補を生成して、各応答候補の中から最も信頼度の高い応答候補を出力する。これにより、上記音声対話装置は、音声認識結果又は感情推定結果に誤りがあって応答候補を生成したとしても、その影響のない最も信頼度の高い応答候補を出力することで、誤応答のない応答をして、ユーザと円滑な対話を行うことができる。 As described above, the voice interaction apparatus according to the embodiment of the present invention generates response candidates based on the speech recognition result, generates response candidates based on the emotion estimation result, The response candidate with the highest reliability is output from the response candidates. As a result, even if there is an error in the speech recognition result or the emotion estimation result and the response candidate is generated, the above-described voice interaction device outputs the most reliable response candidate without the influence, thereby preventing an erroneous response. You can respond and have a smooth conversation with the user.

なお、本発明は、上述した実施の形態に限定されるものではなく、特許請求の範囲に記載された範囲内で設計上の変更をされたものにも適用可能であるのは勿論である。 It should be noted that the present invention is not limited to the above-described embodiment, and it is needless to say that the present invention can also be applied to a design modified within the scope described in the claims.

例えば、感情推定部２０は、音声認識部１０において音声認識結果が得られた場合であっても、ユーザの音声の韻律情報を用いてユーザの感情を推定してもよい。この場合、感情推定部２０は、音声認識結果に基づくユーザの感情と、韻律情報に基づく感情と、が一致する場合に、その一致した感情を推定結果として出力すればよい。 For example, even when the speech recognition unit 10 obtains a speech recognition result, the emotion estimation unit 20 may estimate the user's emotion using prosodic information of the user's speech. In this case, when the user's emotion based on the speech recognition result matches the emotion based on the prosodic information, the emotion estimation unit 20 may output the matched emotion as an estimation result.

なお、図１に示した音声対話装置は、コンピュータに対して、図３に示す音声対話ルーチンを実行するためのプログラムをインストールすることによって構成されたものでもよい。 The voice interaction apparatus shown in FIG. 1 may be configured by installing a program for executing the voice interaction routine shown in FIG. 3 in a computer.

１０音声認識部
１１音声認識履歴格納部
２０感情推定部
３０応答候補生成部
３１応答生成ルール格納部
４０応答候補選択部
４１応答履歴格納部 10 speech recognition unit 11 speech recognition history storage unit 20 emotion estimation unit 30 response candidate generation unit 31 response generation rule storage unit 40 response candidate selection unit 41 response history storage unit

Claims

Voice recognition means for recognizing a voice spoken by a user, extracting a word included in the voice, and calculating a reliability of the word included in the voice;
Emotion estimation means for estimating the emotion of the user's voice using the recognition result of the voice recognition means or the prosodic information of the voice, and calculating the reliability of the estimated emotion;
First response candidate generation means for generating a response candidate using a word extracted by the voice recognition means and a predetermined response template;
Second response candidate generation means for generating a response candidate using the emotion estimated by the emotion estimation means and a predetermined response template;
Among the response candidates generated by the first and second response candidate generation means, a response candidate selection means for selecting a response candidate based on the word or emotion having the highest reliability,
Spoken dialogue device with

The voice interaction apparatus according to claim 1, wherein the voice recognition unit extracts only words whose reliability of each word included in the voice is a predetermined threshold value or more.

Response history storage means for storing the response candidate selected by the response candidate selection means as a response history;
The response candidate selection unit excludes the response candidates included in the response history from the response candidates generated by the first and second response candidate generation units, and has the highest reliability from the remaining response candidates The voice interaction device according to claim 1 or 2, wherein a response candidate based on a word or emotion is selected.

A voice dialogue program for causing a computer to function as each means of the voice dialogue apparatus according to any one of claims 1 to 3.