JP2016038501A

JP2016038501A - Voice interactive method and voice interactive system

Info

Publication number: JP2016038501A
Application number: JP2014162579A
Authority: JP
Inventors: 達也河原; Tatsuya Kawahara; 生聖渡部; Seisho Watabe; 中野　雄介; Yusuke Nakano; 雄介中野
Original assignee: Kyoto University; Toyota Motor Corp
Current assignee: Kyoto University; Toyota Motor Corp
Priority date: 2014-08-08
Filing date: 2014-08-08
Publication date: 2016-03-22
Anticipated expiration: 2034-08-08
Also published as: JP6270661B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice interactive method capable of generating a response that promotes speech production.SOLUTION: A voice interactive method includes: an input process (S1) of user utterance; an extraction process (S2) of the prosodic feature of inputted user utterance; and generation processes (S4 to S6) of a response to the user utterance based on the extracted prosodic feature. A response for promoting speech production can be generated by adjusting rhythm of the response so that prosodic feature of the response matches prosodic feature of user utterance when the response is generated.SELECTED DRAWING: Figure 2

Description

本発明は音声対話方法、及び音声対話システムに関する。 The present invention relates to a voice dialogue method and a voice dialogue system.

音声対話システムや人型ロボットにおいては、高齢者や認知症などの患者のケアを行うニーズが高まっており、傾聴する機能が要求されている。傾聴においては、ユーザが話しやすいように相槌を打つことが重要である。 In voice dialogue systems and humanoid robots, there is an increasing need to care for patients such as elderly people and dementia, and a function to listen is required. In listening, it is important to make a consensus so that the user can easily speak.

特許文献１には、自然で円滑な対話を実現できる音声認識装置に関する技術が開示されている。特許文献１に開示されている音声認識装置では、音声入力部に入力された音声信号を基に計算した話者の音声特徴量に基づき、話者との対話中にスピーカから相槌音を出力させる相槌タイミングを推測している。そして、相槌タイミングであるとの推測結果が得られると、相槌タイミング直前のパワーを基に相槌音を出力させるか否かを判定している。 Patent Document 1 discloses a technology related to a speech recognition apparatus that can realize a natural and smooth conversation. In the speech recognition device disclosed in Patent Document 1, a loudspeaker is output from a speaker during a conversation with a speaker based on the speech feature amount of the speaker calculated based on the speech signal input to the speech input unit. I guess the timing of the conflict. Then, when an estimation result that it is the conflict timing is obtained, it is determined whether or not to output the conflict sound based on the power immediately before the conflict timing.

特開２００９−３０４０号公報JP 2009-3040 A

しかしながら、特許文献１に開示されている技術では、相槌を打つタイミングについてのみ焦点が置かれており、実際に打たれている相槌は同一の音声となっている。傾聴においては、ユーザが話しやすいように相槌を打つことが重要であるが、相槌の音声が同一である場合は、ユーザに機械的な印象を与えてしまい、ユーザは話を聞いてもらっているという意識を持つことができない。このため、ユーザの発話が促進されないという問題があった。 However, in the technique disclosed in Patent Document 1, the focus is only on the timing at which a match is made, and the actual strike is the same voice. In listening, it is important that the user makes a conversation so that the user can easily talk. However, if the voice of the partner is the same, the user is given a mechanical impression and the user is listening to the story. I can't have consciousness. For this reason, there was a problem that the user's speech was not promoted.

上記課題に鑑み本発明の目的は、発話を促進させる相槌を生成することが可能な音声対話方法、及び音声対話システムを提供することである。 In view of the above-described problems, an object of the present invention is to provide a voice dialogue method and a voice dialogue system capable of generating a conflict that promotes speech.

本発明にかかる音声対話方法は、ユーザ発話を入力する工程と、入力された前記ユーザ発話の韻律的特徴を抽出する工程と、抽出された前記韻律的特徴に基づき前記ユーザ発話に応答する相槌を生成する工程と、を備え、前記相槌を生成する際、前記相槌の韻律的特徴が前記ユーザ発話の韻律的特徴と合うように前記相槌の韻律を調整する。 The voice interaction method according to the present invention includes a step of inputting a user utterance, a step of extracting a prosodic feature of the input user utterance, and a response to the user utterance based on the extracted prosodic feature. A step of generating, and when generating the companion, the prosodic feature of the companion is adjusted so that the prosodic feature of the companion matches the prosodic feature of the user utterance.

本発明にかかる音声対話システムは、ユーザ発話を入力する発話入力部と、前記発話入力部に入力された前記ユーザ発話の韻律的特徴を抽出する韻律的特徴抽出部と、前記韻律的特徴抽出部で抽出された前記韻律的特徴に基づき前記ユーザ発話に応答する相槌を生成する相槌生成部と、を備え、前記相槌生成部は、前記相槌の韻律的特徴が前記ユーザ発話の韻律的特徴と合うように前記相槌の韻律を調整する。 The speech dialogue system according to the present invention includes an utterance input unit that inputs a user utterance, a prosodic feature extraction unit that extracts prosodic features of the user utterance input to the utterance input unit, and the prosodic feature extraction unit A conflict generation unit that generates a response in response to the user utterance based on the prosodic feature extracted in step (a), wherein the conflict generation unit matches the prosodic feature of the conflict with the prosodic feature of the user utterance As described above, the prosody of the reconciliation is adjusted.

本発明にかかる音声対話方法および音声対話システムでは、ユーザ発話の韻律的特徴を抽出し、相槌を生成する際に、相槌の韻律的特徴がユーザ発話の韻律的特徴と合うように相槌の韻律（音声波形）を調整している。このように相槌の韻律を調整することで、ユーザに機械的な印象を与えることを抑制することができ、ユーザは話を聞いてもらっているという意識を持つことができ、ユーザの発話を促すことができる。 In the spoken dialogue method and the spoken dialogue system according to the present invention, the prosodic features of the user utterance are extracted so that the prosodic features of the user utterance are matched with the prosodic features of the user utterance. (Sound waveform) is adjusted. By adjusting the prosodic prosody in this way, the user can be prevented from giving a mechanical impression, the user can be aware that they are listening to the story, and the user's utterance can be encouraged. Can do.

本発明により、発話を促進させる相槌を生成することが可能な音声対話方法、及び音声対話システムを提供することができる。 INDUSTRIAL APPLICABILITY According to the present invention, it is possible to provide a voice dialogue method and a voice dialogue system capable of generating a conflict that promotes speech.

実施の形態にかかる音声対話システムを示すブロック図である。It is a block diagram which shows the voice dialogue system concerning embodiment. 実施の形態にかかる音声対話方法を説明するためのフローチャートである。It is a flowchart for demonstrating the audio | voice dialogue method concerning embodiment. ユーザと音声対話システムとが対話している状態を示す図である。It is a figure which shows the state in which the user and the voice interactive system are interacting. ユーザ発話の韻律的特徴と相槌の韻律的特徴との相関を示す相関係数テーブルを示す図である。It is a figure which shows the correlation coefficient table which shows the correlation with the prosodic feature of a user utterance, and the prosodic feature of a conflict. ユーザ発話の韻律的特徴と相槌の韻律的特徴との相関を示す相関係数テーブルの一例を示す図である。It is a figure which shows an example of the correlation coefficient table which shows the correlation with the prosodic feature of a user utterance, and the prosodic feature of a conflict.

以下、図面を参照して本発明の実施の形態について説明する。
図３は、ユーザと音声対話システムとが対話している状態を示す図である。図３に示すように、本実施の形態にかかる発明は、ユーザ３１がロボット（音声対話システム）３２と対話する際に、ロボット３２が、ユーザ３１の発話を促進させる相槌を発することを特徴としている。つまり、本実施の形態にかかる発明では、ユーザ３１の発話の音声波形３３から韻律的特徴を抽出し、相槌を生成する際に、相槌の音声波形３４の韻律的特徴がユーザ３１の発話の音声波形３３の韻律的特徴と合うように相槌の韻律（音声波形３４）を調整することを特徴としている。以下で、本実施の形態にかかる音声対話方法、及び音声対話システムについて詳細に説明する。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 3 is a diagram illustrating a state in which the user and the voice interaction system are interacting with each other. As shown in FIG. 3, the invention according to the present embodiment is characterized in that when the user 31 interacts with a robot (voice dialogue system) 32, the robot 32 issues a conflict that promotes the utterance of the user 31. Yes. That is, in the invention according to the present embodiment, when the prosodic features are extracted from the speech waveform 33 of the user 31 utterance and the companion is generated, the prosodic features of the consonant speech waveform 34 are the speech of the user 31 utterance. The feature is that the prosody of the companion (speech waveform 34) is adjusted to match the prosodic feature of the waveform 33. Hereinafter, the voice interaction method and the voice interaction system according to the present embodiment will be described in detail.

図１は、本実施の形態にかかる音声対話システムを示すブロック図である。図１に示すように、本実施の形態にかかる音声対話システム１は、発話入力部１１、韻律的特徴抽出部１２、相槌生成タイミング決定部１３、相槌データベース１５、相槌選択部１６、韻律調整パラメータ生成部１７、相槌波形生成部１８、及び相槌出力部１９を備える。相槌データベース１５、相槌選択部１６、韻律調整パラメータ生成部１７、及び相槌波形生成部１８は、相槌生成部１４を構成している。 FIG. 1 is a block diagram showing a voice dialogue system according to the present embodiment. As shown in FIG. 1, the spoken dialogue system 1 according to the present embodiment includes an utterance input unit 11, a prosodic feature extraction unit 12, a companion generation timing determination unit 13, a companion database 15, a companion selection unit 16, and prosody adjustment parameters. A generation unit 17, a correlation waveform generation unit 18, and a correlation output unit 19 are provided. The interaction database 15, the interaction selection unit 16, the prosody adjustment parameter generation unit 17, and the interaction waveform generation unit 18 constitute an interaction generation unit 14.

発話入力部１１は、ユーザの発話を入力する。例えば、発話入力部１１はマイク等を用いて構成することができる。 The utterance input unit 11 inputs a user's utterance. For example, the speech input unit 11 can be configured using a microphone or the like.

韻律的特徴抽出部１２は、発話入力部１１に入力されたユーザ発話（先行発話）の韻律的特徴を抽出する。韻律的特徴としては、ユーザ発話の基本周波数成分Ｆ０（以下、単にＦ０と記載する場合もある）やパワー成分が挙げられる。このとき、基本周波数成分Ｆ０として、Ｆ０の対数を用いてもよい。例えば、Ｆ０の対数は、発話音声を用いて１０ｍ秒毎にＦ０を算出し、この算出されたＦ０に対して１０を底とする対数を取ることで求めることができる。また、パワー成分についても、例えば１０ｍ秒毎にｄＢ値を算出することで求めることができる。韻律的特徴抽出部１２は、抽出した韻律的特徴２１を相槌生成タイミング決定部１３に出力する。 The prosodic feature extraction unit 12 extracts prosodic features of the user utterance (preceding utterance) input to the utterance input unit 11. The prosodic features include a fundamental frequency component F0 of user utterance (hereinafter sometimes simply referred to as F0) and a power component. At this time, the logarithm of F0 may be used as the fundamental frequency component F0. For example, the logarithm of F0 can be obtained by calculating F0 every 10 msec using uttered speech and taking the logarithm with 10 as the base for the calculated F0. The power component can also be obtained by calculating a dB value every 10 milliseconds, for example. The prosodic feature extraction unit 12 outputs the extracted prosodic feature 21 to the conflict generation timing determination unit 13.

また、韻律的特徴抽出部１２は、相槌生成タイミング決定部１３から相槌生成タイミング情報２２が供給された際、相槌選択部１６に相槌選択信号２３を出力する。 Further, the prosodic feature extraction unit 12 outputs the conflict selection signal 23 to the conflict selection unit 16 when the conflict generation timing information 22 is supplied from the conflict generation timing determination unit 13.

また、韻律的特徴抽出部１２は、相槌生成タイミング決定部１３から相槌生成タイミング情報２２が供給された際、相槌生成タイミングから所定の時間さかのぼった期間（例えば、５００ｍ秒）における基本周波数成分Ｆ０の最大値、平均値、最大値と最小値のレンジ等、及びパワー成分の最大値、平均値、最大値と最小値のレンジ等の特徴量を算出する。算出された特徴量２４は、韻律調整パラメータ生成部１７に供給される。 In addition, when the prosodic feature extraction unit 12 is supplied with the soot generation timing information 22 from the soot generation timing determination unit 13, the prosodic feature extraction unit 12 has the fundamental frequency component F0 in a period (for example, 500 milliseconds) that goes back a predetermined time from the soot generation timing. A feature amount such as a maximum value, an average value, a range between the maximum value and the minimum value, and a power component maximum value, an average value, a range between the maximum value and the minimum value, and the like are calculated. The calculated feature value 24 is supplied to the prosody adjustment parameter generation unit 17.

相槌生成タイミング決定部１３は、韻律的特徴抽出部１２で抽出された韻律的特徴２１を用いて、相槌を生成するタイミングを決定する。また、相槌生成タイミング決定部１３は、相槌を生成するタイミングを決定した場合、相槌生成タイミング情報２２を韻律的特徴抽出部１２に出力する。 The conflict generation timing determination unit 13 determines the timing of generating the conflict using the prosodic features 21 extracted by the prosodic feature extraction unit 12. In addition, when the generation timing of the conflict is determined, the conflict generation timing determination unit 13 outputs the conflict generation timing information 22 to the prosodic feature extraction unit 12.

例えば、相槌生成タイミング決定部１３は、ユーザ発話の韻律的特徴であるパワー成分が所定の閾値以下である場合に、相槌を生成するタイミングであると決定することができる。つまり、ユーザが発話が終了したタイミングでは、ユーザ発話のパワー成分がほぼゼロになるので、このタイミングを相槌を生成するタイミングであると決定することができる。また、ユーザ発話が途中の場合であっても、ユーザ発話のパワー成分が小さい場合は、ユーザ発話の終了が近づいていると判断することができる。よって、このような場合も、相槌を生成するタイミングであると決定することができる。 For example, the conflict generation timing determination unit 13 can determine that it is the timing to generate the conflict when the power component that is the prosodic feature of the user utterance is equal to or less than a predetermined threshold. In other words, since the power component of the user utterance becomes almost zero at the timing when the user finishes utterance, it can be determined that this timing is the timing for generating the conflict. Even if the user utterance is in the middle, if the power component of the user utterance is small, it can be determined that the end of the user utterance is approaching. Therefore, even in such a case, it can be determined that it is the timing for generating the conflict.

なお、上記では、ユーザ発話の韻律的特徴としてパワー成分を用いた場合を例として挙げたが、例えば、ユーザ発話の基本周波数成分Ｆ０を用いて相槌を生成するタイミングを決定してもよい。例えば、相槌生成タイミング決定部１３は、ユーザ発話の基本周波数成分Ｆ０が所定の閾値以下である場合に、相槌を生成するタイミングであると決定してもよい。つまり、ユーザ発話の基本周波数成分Ｆ０が所定の閾値以下である場合は、ユーザ発話のトーンが下がっている状態であるので、ユーザ発話の終了が近づいていると判断することができる。 In the above description, the case where the power component is used as the prosodic feature of the user utterance is taken as an example. However, for example, the timing of generating the conflict may be determined using the fundamental frequency component F0 of the user utterance. For example, the conflict generation timing determination unit 13 may determine that it is the timing for generating the conflict when the fundamental frequency component F0 of the user utterance is equal to or less than a predetermined threshold. That is, when the fundamental frequency component F0 of the user utterance is less than or equal to a predetermined threshold, it is possible to determine that the user utterance is approaching to end because the tone of the user utterance is lowered.

相槌データベース１５は、ユーザ発話の韻律的特徴と相槌の韻律的特徴との相関を示す相関係数テーブルを格納している。この相関係数テーブルは予め生成されている。図４は、ユーザ発話の韻律的特徴と相槌の韻律的特徴との相関を示す相関係数テーブルを示す図である。図４に示すように、相関係数テーブルは、各々の相槌（相槌の形態）と相関係数αとを対応付けたテーブルである。相関係数αは、韻律的特徴の特徴量毎に求める。つまり、相関係数αは、基本周波数成分Ｆ０の最大値、平均値、及びパワー成分の最大値、平均値のそれぞれについて算出する。 The interaction database 15 stores a correlation coefficient table indicating the correlation between the prosodic features of the user utterance and the prosodic features of the interaction. This correlation coefficient table is generated in advance. FIG. 4 is a diagram illustrating a correlation coefficient table indicating the correlation between the prosodic features of the user utterance and the prosodic features of the conflict. As shown in FIG. 4, the correlation coefficient table is a table in which each correlation (form of correlation) is associated with the correlation coefficient α. The correlation coefficient α is obtained for each feature quantity of prosodic features. That is, the correlation coefficient α is calculated for each of the maximum value and average value of the fundamental frequency component F0, and the maximum value and average value of the power component.

例えば、相関係数α（１、１）は、ユーザ発話（先行発話）と相槌「あー」との相関を示す相関係数のうち、基本周波数成分Ｆ０の最大値を用いて求めた相関係数である。相関係数α（１、２）は、ユーザ発話（先行発話）と相槌「あー」との相関を示す相関係数のうち、基本周波数成分Ｆ０の平均値を用いて求めた相関係数である。相関係数α（１、３）は、ユーザ発話（先行発話）と相槌「あー」との相関を示す相関係数のうち、パワー成分の最大値を用いて求めた相関係数である。相関係数α（１、４）は、ユーザ発話（先行発話）と相槌「あー」との相関を示す相関係数のうち、パワー成分の平均値を用いて求めた相関係数である。 For example, the correlation coefficient α (1, 1) is a correlation coefficient obtained by using the maximum value of the fundamental frequency component F0 among the correlation coefficients indicating the correlation between the user utterance (preceding utterance) and the companion “Ah”. It is. The correlation coefficient α (1, 2) is a correlation coefficient obtained using the average value of the fundamental frequency component F0 among the correlation coefficients indicating the correlation between the user utterance (preceding utterance) and the companion “Ah”. . The correlation coefficient α (1, 3) is a correlation coefficient obtained by using the maximum value of the power component among the correlation coefficients indicating the correlation between the user utterance (preceding utterance) and the companion “Ah”. The correlation coefficient α (1, 4) is a correlation coefficient obtained by using the average value of the power components among the correlation coefficients indicating the correlation between the user utterance (preceding utterance) and the companion “ah”.

相関係数は、話し役（複数のサンプル）と聞き役（カウンセラ）の対話を収録し、この収録した対話の音声を分析して、ユーザ発話と相槌との相関を相槌の形態別に調べることで推定することができる。ここで、話し役は主にユーザ発話を発し、聞き役は主に相槌を発する。相関係数を求める場合、相槌の開始から終了までの韻律的特徴と、相槌の直前のユーザ発話の有声区間（例えば、５００ｍ秒）の韻律的特徴を使用する。使用する韻律的特徴の種類は、該当区間の対数Ｆ０の最大値、平均値、及びパワー成分の最大値、平均値とすることができる。 Correlation coefficients are estimated by recording conversations between a speaker (multiple samples) and a listener (counselor), analyzing the voice of the recorded conversations, and examining the correlation between the user's utterances and the interaction according to the form of the interaction. can do. Here, the talker mainly utters the user's speech, and the hearer mainly talks. When obtaining the correlation coefficient, the prosodic features from the start to the end of the match and the prosodic features of the voiced section (eg, 500 milliseconds) of the user utterance immediately before the match are used. The types of prosodic features to be used can be the maximum value and average value of the logarithm F0 of the corresponding section, and the maximum value and average value of the power component.

なお、図４に示すように、相槌の種類には感情表出系の相槌と応答系の相槌とがある。感情表出系の相槌は、「あー」、「はー」等の興味、理解、共感等の感情を示す相槌である。応答系の相槌は、「ふーん」、「はい」等の相手の発話に対する応答を示す相槌である。 As shown in FIG. 4, there are two types of conflicts: emotional expression and response. The emotional expression is a relationship that expresses emotions such as interest, understanding, empathy, etc. The response system is a response indicating a response to the utterance of the other party such as “Fun” or “Yes”.

図１に示す相槌選択部１６は、韻律的特徴抽出部１２から相槌選択信号２３が供給されると、相槌データベース１５に格納されている相槌の形態の中から、所定の相槌を選択する。このとき選択される相槌は任意に決定することができる。一例を挙げると、相槌生成タイミング決定部１３で決定されたタイミングがユーザ発話の途中のタイミングである場合、応答系の相槌（つまり、相手の発話に対する応答を示す相槌）の中から相槌を選択してもよい。一方、相槌生成タイミング決定部１３で決定されたタイミングがユーザ発話が終了したタイミングである場合、感情表出系の相槌（つまり、興味、理解、共感等の感情を示す相槌）の中から相槌を選択してもよい。 The conflict selection unit 16 shown in FIG. 1 selects a predetermined conflict from the forms of conflicts stored in the conflict database 15 when the conflict selection signal 23 is supplied from the prosodic feature extraction unit 12. The selection selected at this time can be arbitrarily determined. As an example, if the timing determined by the interaction generation timing determination unit 13 is a timing during the user's utterance, the interaction is selected from the responses in the response system (that is, the response indicating the response to the opponent's utterance). May be. On the other hand, when the timing determined by the conflict generation timing determination unit 13 is the timing when the user utterance is finished, the conflict is expressed from the emotion expression related conflicts (that is, the emotions indicating interest, understanding, empathy, etc.). You may choose.

相槌選択部１６は、選択した相槌に関する相槌情報２５（例えば、テキストデータ）を相槌波形生成部１８に出力する。また、相槌選択部１６は、選択した相槌の相関係数に関する情報２６を、韻律調整パラメータ生成部１７に出力する。相槌選択部１６は、相関係数に関する情報を相槌データベース１５から取得することができる。相槌選択部１６は、例えば、相槌として図４に示す「あー」を選択した場合、相関係数に関する情報２６として、α（１、１）、α（１、２）、α（１、３）、α（１、４）の値を韻律調整パラメータ生成部１７に出力する。 The conflict selection unit 16 outputs the conflict information 25 (for example, text data) regarding the selected conflict to the conflict waveform generation unit 18. Further, the conflict selection unit 16 outputs information 26 relating to the correlation coefficient of the selected conflict to the prosody adjustment parameter generation unit 17. The consideration selection unit 16 can acquire information on the correlation coefficient from the consideration database 15. For example, when “A” shown in FIG. 4 is selected as the companion, the conflict selection unit 16 uses α (1, 1), α (1, 2), α (1, 3) as the information 26 regarding the correlation coefficient. , Α (1, 4) are output to the prosody adjustment parameter generation unit 17.

韻律調整パラメータ生成部１７は、相槌選択部１６で選択された相槌の韻律的特徴が、ユーザ発話の韻律的特徴と合うように相槌の韻律を調整するパラメータを生成する。このとき、韻律調整パラメータ生成部１７は、韻律的特徴抽出部１２から供給された特徴量２４と、相槌選択部１６から供給された相関係数に関する情報２６とを用いて、韻律調整パラメータを生成する。生成された韻律調整パラメータ２７は、相槌波形生成部１８に供給される。 The prosodic adjustment parameter generation unit 17 generates a parameter for adjusting the prosody of the reconciliation so that the prosodic feature of the reconciliation selected by the reconciliation selection unit 16 matches the prosodic feature of the user utterance. At this time, the prosody adjustment parameter generation unit 17 generates a prosody adjustment parameter using the feature amount 24 supplied from the prosodic feature extraction unit 12 and the information 26 on the correlation coefficient supplied from the conflict selection unit 16. To do. The generated prosody adjustment parameter 27 is supplied to the conflict waveform generation unit 18.

具体的には、韻律調整パラメータ生成部１７は、下記の式を用いて韻律調整パラメータＢＣ_ｉｐを求める。このとき、韻律調整パラメータ生成部１７は、基本周波数成分Ｆ０の最大値、平均値、及びパワー成分の最大値、平均値の各々について韻律調整パラメータＢＣ_ｉｐを求める。 Specifically, the prosody adjustment parameter generation unit 17 obtains the prosody adjustment parameter BC _ip using the following equation. At this time, the prosody adjustment parameter generation unit 17 obtains the prosody adjustment parameter BC _ip for each of the maximum value and average value of the fundamental frequency component F0, and the maximum value and average value of the power component.

上記式において、ＢＣ_ｉｐは韻律調整パラメータ（相槌の韻律的特徴の目標値）、αは相関係数、Ｓ_ｉはユーザ発話の韻律的特徴を示す。ｉはサンプル数であり、ｉ＝１、２、・・・、Ｎである。Ｅ（Ｓ）はユーザ発話の直前Ｎターンの発話（Ｎ≧１）における平均値（ユーザ発話の韻律的特徴の平均値）、Ｅ（ＢＣ）は相槌データベースにおける平均値（相槌の韻律的特徴の平均値）である。σ（Ｓ）はユーザ発話の直前Ｎターンの発話（Ｎ≧１）における標準偏差（ユーザ発話の韻律的特徴の標準偏差）、σ（ＢＣ）は相槌データベースにおける標準偏差（相槌の韻律的特徴の標準偏差）である。本実施の形態では、Ｓ_ｉ、Ｅ（Ｓ）、Ｅ（ＢＣ）、σ（Ｓ）、σ（ＢＣ）は、基本周波数成分Ｆ０の最大値、平均値、及びパワー成分の最大値、平均値のそれぞれについて求める。Ｅ（ＢＣ）およびσ（ＢＣ）は、相槌データベース１５に予め格納されている。なお、ユーザ発話は、初対面なら直前のターンだけで推測、リピーター（かつ、ユーザ判別可能）なら過去の対話履歴全体から推測してもよい。 In the above formula, BC _ip is a prosodic adjustment parameter (target value of the prosodic feature of the conflict), α is a correlation coefficient, and S _i is a prosodic feature of the user utterance. i is the number of samples, i = 1, 2,. E (S) is the average value (average value of prosodic features of user utterances) in N turns (N ≧ 1) immediately before the user utterance, and E (BC) is the average value of the prosodic features (the prosodic features of the utterances). Average value). σ (S) is the standard deviation (standard deviation of prosodic features of the user utterance) in the N-turn utterance (N ≧ 1) immediately before the user utterance, and σ (BC) is the standard deviation (the prosodic features of the compliment prosodic feature). Standard deviation). In the present embodiment, S _i , E (S), E (BC), σ (S), and σ (BC) are the maximum value and average value of the fundamental frequency component F0, and the maximum value and average value of the power component. Ask for each. E (BC) and σ (BC) are stored in advance in the interaction database 15. Note that the user utterance may be estimated only from the immediately preceding turn if it is the first meeting, and may be estimated from the entire past conversation history if it is a repeater (and can be identified by the user).

例えば、相槌選択部１６において相槌として「あー」が選択された場合、韻律調整パラメータ生成部１７には、相関係数に関する情報２６としてα（１、１）、α（１、２）、α（１、３）、α（１、４）が供給される。 For example, when “Ah” is selected as the conflict in the conflict selection unit 16, the prosody adjustment parameter generation unit 17 stores α (1, 1), α (1, 2), α ( 1, 3), α (1, 4) are supplied.

韻律調整パラメータ生成部１７は、韻律的特徴抽出部１２から供給されたユーザ発話の基本周波数成分Ｆ０の最大値を用いて、Ｓ_ｉ、Ｅ（Ｓ）、σ（Ｓ）を求める。なお、Ｅ（ＢＣ）、σ（ＢＣ）については、相槌データベースの値を用いて求める。その後、韻律調整パラメータ生成部１７は、基本周波数成分Ｆ０の最大値に対応した相関係数α（１、１）、基本周波数成分Ｆ０の最大値に対応したＳ_ｉ、Ｅ（Ｓ）、σ（Ｓ）、Ｅ（ＢＣ）、σ（ＢＣ）を上記式に代入して、基本周波数成分Ｆ０の最大値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｆ０_ｍａｘ）を算出する。 The prosody adjustment parameter generation unit 17 obtains S _i , E (S), and σ (S) using the maximum value of the fundamental frequency component F 0 of the user utterance supplied from the prosodic feature extraction unit 12. Note that E (BC) and σ (BC) are obtained using values in the interaction database. Thereafter, the prosody adjustment parameter generation unit 17 correlates α (1, 1) corresponding to the maximum value of the fundamental frequency component F0, and S _i , E (S), σ (corresponding to the maximum value of the fundamental frequency component F0. S), E (BC), and σ (BC) are substituted into the above formula to calculate the prosodic adjustment parameter BC _ip (F0_max) corresponding to the maximum value of the fundamental frequency component F0.

同様に、韻律調整パラメータ生成部１７は、基本周波数成分Ｆ０の平均値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｆ０_ａｖｅ）、パワーの最大値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｐ_ｍａｘ）、パワーの平均値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｐ_ａｖｅ）のそれぞれを算出する。算出されたこれらの韻律調整パラメータ２７は、相槌波形生成部１８に供給される。 Similarly, prosodic adjustment parameter generator 17, the average value prosodic adjustment parameter _BC corresponding to ip fundamental frequency component F0 (F0_ave), prosody corresponding to the maximum value of the power adjustment parameter _BC ip (P_max), the average value of the power Each of the prosodic adjustment parameters BC _ip (P_ave) corresponding to is calculated. These calculated prosodic adjustment parameters 27 are supplied to the conflict waveform generator 18.

なお、上記では４つの韻律調整パラメータＢＣ_ｉｐを求める場合について説明したが、
求める韻律調整パラメータＢＣ_ｉｐの数はこれ以外であってもよい。例えば、韻律調整パラメータ生成部１７は、基本周波数成分Ｆ０およびパワー成分のうち、ユーザ発話の韻律的特徴と相槌の韻律的特徴との相関が高い成分（つまり、相関係数αが高い成分：図５を参照）について、韻律調整パラメータＢＣ_ｉｐを求めるようにしてもよい。換言すると、韻律調整パラメータ生成部１７は、基本周波数成分Ｆ０およびパワー成分のうち、相槌についての相関係数が高い成分を優先的に用いて、韻律調整パラメータＢＣ_ｉｐを求めるようにしてもよい。 In the above description, the case where the four prosodic adjustment parameters BC _ip are obtained has been described.
The number of prosodic adjustment parameters BC _ip to be calculated may be other than this. For example, the prosody adjustment parameter generation unit 17 has a component having a high correlation between the prosodic feature of the user utterance and the prosodic feature of the companion (that is, a component having a high correlation coefficient α: diagram) among the fundamental frequency component F0 and the power component. 5), the prosodic adjustment parameter BC _ip may be obtained. In other words, the prosody adjustment parameter generation unit 17 may obtain the prosody adjustment parameter BC _ip by preferentially using a component having a high correlation coefficient for the correlation among the fundamental frequency component F0 and the power component.

図５は、ユーザ発話の韻律的特徴と相槌の韻律的特徴との相関を示す相関係数テーブルの一例を示す図である。図５に示すように、各成分における相関係数は、相槌の形態に応じて異なってくる。例えば、相槌の形態が「はー」である場合は、相関係数の値が大きい「パワー成分の最大値（相関係数０．４７）」および「パワー成分の平均値（相関係数０．２９」のそれぞれに対応した韻律調整パラメータＢＣ_ｉｐ（Ｐ_ｍａｘ）、ＢＣ_ｉｐ（Ｐ_ａｖｅ）を求めてもよい。また、例えば、相槌の形態が「ふん」、「うん」である場合は、相関係数の値が大きい「基本周波数成分Ｆ０の最大値（相関係数０．２２」および「パワー成分の最大値（相関係数０．２３）」のそれぞれに対応した韻律調整パラメータＢＣ_ｉｐ（Ｆ０_ｍａｘ）、ＢＣ_ｉｐ（Ｐ_ｍａｘ）を求めてもよい。このように、基本周波数成分Ｆ０の最大値および平均値、並びにパワー成分の最大値および平均値のうち、相関係数が高い成分を優先的に用いて韻律調整パラメータＢＣ_ｉｐを求めることで、韻律調整パラメータの精度を向上させることができる。また、韻律調整パラメータを求める際の演算量を低減させることができる。 FIG. 5 is a diagram illustrating an example of a correlation coefficient table indicating the correlation between the prosodic features of the user utterance and the prosodic features of the conflict. As shown in FIG. 5, the correlation coefficient in each component differs depending on the form of the conflict. For example, when the form of the interaction is “ha-”, the “maximum value of the power component (correlation coefficient 0.47)” and the “average value of the power component (correlation coefficient 0. 29 ”may be obtained as prosodic adjustment parameters BC _ip (P_max) and BC _ip (P_ave), for example, when the form of the companion is“ fun ”or“ yes ”, the correlation coefficient Prosody adjustment parameter BC _ip (F0_max) corresponding to each of “maximum value of fundamental frequency component F0 (correlation coefficient 0.22)” and “maximum value of power component (correlation coefficient 0.23)”, BC _ip (P_max) may be obtained in this way, by preferentially using a component having a high correlation coefficient among the maximum value and average value of the fundamental frequency component F0 and the maximum value and average value of the power component. Prosody adjustment By obtaining meter BC _ip, it is possible to improve the accuracy of the prosodic adjustment parameters. Also, it is possible to reduce the amount of calculation for obtaining the prosodic adjustment parameters.

図１に示す相槌波形生成部１８は、相槌選択部１６で選択された相槌に関する相槌情報２５（例えば、テキストデータ）と、韻律調整パラメータ生成部１７で生成された韻律調整パラメータ２７とを用いて、相槌の音声波形を生成する。ここで、韻律調整パラメータ２７は、基本周波数成分Ｆ０の最大値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｆ０_ｍａｘ）、基本周波数成分Ｆ０の平均値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｆ０_ａｖｅ）、パワーの最大値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｐ_ｍａｘ）、及びパワーの平均値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｐ_ａｖｅ）の少なくとも１つである。例えば、相槌波形生成部１８は、ＴＴＳ（text to speech）技術を用いて相槌の音声波形を生成することができる。 The conflict waveform generation unit 18 shown in FIG. 1 uses the conflict information 25 (for example, text data) related to the conflict selected by the conflict selection unit 16 and the prosody adjustment parameter 27 generated by the prosody adjustment parameter generation unit 17. , Generate a sound waveform of the companion. Here, prosodic adjustment parameters 27, prosodic adjustment parameter _BC ip (F0_max), prosody corresponding to the average value of the fundamental frequency component F0 adjustment parameter _BC ip (F0_ave), maximum power corresponding to the maximum value of the fundamental frequency component F0 It is at least one of the prosodic adjustment parameter BC _ip (P_max) corresponding to the value and the prosodic adjustment parameter BC _ip (P_ave) corresponding to the average value of power. For example, the interaction waveform generation unit 18 can generate an audio waveform of interaction using a TTS (text to speech) technique.

このように、相槌データベース１５、相槌選択部１６、韻律調整パラメータ生成部１７、及び相槌波形生成部１８で構成される相槌生成部１４は、韻律的特徴抽出部１２で抽出された韻律的特徴に基づいて、ユーザ発話に応答する相槌の音声波形を生成することができる。 As described above, the conflict generation unit 14 configured by the conflict database 15, the conflict selection unit 16, the prosody adjustment parameter generation unit 17, and the conflict waveform generation unit 18 uses the prosodic features extracted by the prosodic feature extraction unit 12. Based on this, it is possible to generate a speech waveform that is compatible with the user utterance.

相槌波形生成部１８で生成された相槌の音声波形は、相槌出力部１９に供給される。相槌出力部１９は、供給された音声波形に対応した相槌を出力する。例えば、相槌出力部１９はスピーカ等を用いて構成することができる。これにより、ロボット（音声対話システム）３２は、相槌の韻律的特徴がユーザ発話の韻律的特徴と合うように韻律が調整された相槌を出力することができる。このように相槌の韻律を調整することで、ユーザの発話を促すことができる。 The audio waveform of the interaction generated by the interaction waveform generation unit 18 is supplied to the interaction output unit 19. The interaction output unit 19 outputs an interaction corresponding to the supplied speech waveform. For example, the interaction output unit 19 can be configured using a speaker or the like. As a result, the robot (speech dialogue system) 32 can output the prosody adjusted so that the prosodic feature of the match matches the prosodic feature of the user utterance. Thus, the user's speech can be urged by adjusting the prosody of the conflict.

なお、本実施の形態にかかる音声対話システムでは、相槌出力部１９から出力される相槌に応じてロボットが首を振るように構成してもよい。このように、相槌に合わせてロボットが首を振るようにすることで、ユーザの発話を更に促すことができる。 Note that the voice interaction system according to the present embodiment may be configured such that the robot shakes his / her head according to the conflict output from the conflict output unit 19. In this way, the user's speech can be further urged by causing the robot to swing his / her head in accordance with the conflict.

次に、本実施の形態にかかる音声対話システムの動作（音声対話方法）について説明する。図２は、本実施の形態にかかる音声対話方法を説明するためのフローチャートである。なお、この場合も、相槌データベース１５には、予めユーザ発話の韻律的特徴と相槌の韻律的特徴との相関を示す相関係数テーブルが格納されているものとする。 Next, the operation (voice dialogue method) of the voice dialogue system according to the present embodiment will be described. FIG. 2 is a flowchart for explaining the voice interaction method according to the present embodiment. Also in this case, it is assumed that the correlation database 15 stores a correlation coefficient table indicating the correlation between the prosodic feature of the user utterance and the prosodic feature of the conflict in advance.

図１、図２に示すように、まず、音声対話システム１の発話入力部１１は、ユーザの発話を入力する（ステップＳ１）。次に、韻律的特徴抽出部１２は、発話入力部１１に入力されたユーザ発話（先行発話）の韻律的特徴を抽出する（ステップＳ２）。韻律的特徴としては、ユーザ発話の基本周波数成分Ｆ０やパワー成分が挙げられる。次に、相槌生成タイミング決定部１３は、韻律的特徴抽出部１２で抽出された韻律的特徴２１を用いて、相槌を生成するタイミングを決定する。相槌生成タイミング決定部１３が相槌生成タイミングではないと判断した場合（ステップＳ３：Ｎｏ）、再度、ステップＳ１〜Ｓ３の動作を繰り返す。一方、相槌生成タイミング決定部１３が相槌生成タイミングであると判断した場合（ステップＳ３：Ｙｅｓ）、相槌生成タイミング情報２２を韻律的特徴抽出部１２に出力する。例えば、相槌生成タイミング決定部１３は、ユーザ発話の韻律的特徴であるパワー成分が所定の閾値以下である場合に、相槌を生成するタイミングであると決定することができる。 As shown in FIGS. 1 and 2, first, the utterance input unit 11 of the voice interaction system 1 inputs a user's utterance (step S1). Next, the prosodic feature extraction unit 12 extracts prosodic features of the user utterance (preceding utterance) input to the utterance input unit 11 (step S2). Prosodic features include the fundamental frequency component F0 and power component of the user utterance. Next, the conflict generation timing determination unit 13 determines the timing of generating the conflict using the prosodic features 21 extracted by the prosodic feature extraction unit 12. When the conflict generation timing determination unit 13 determines that it is not the conflict generation timing (step S3: No), the operations of steps S1 to S3 are repeated again. On the other hand, when the conflict generation timing determination unit 13 determines that the conflict generation timing is reached (step S3: Yes), the conflict generation timing information 22 is output to the prosodic feature extraction unit 12. For example, the conflict generation timing determination unit 13 can determine that it is the timing to generate the conflict when the power component that is the prosodic feature of the user utterance is equal to or less than a predetermined threshold.

韻律的特徴抽出部１２は、相槌生成タイミング決定部１３から相槌生成タイミング情報２２が供給された場合、相槌選択部１６に相槌選択信号２３を出力する。また、韻律的特徴抽出部１２は、相槌生成タイミング決定部１３から相槌生成タイミング情報２２が供給された場合、相槌生成タイミングから所定の時間さかのぼった期間（例えば、５００ｍ秒）における基本周波数成分Ｆ０の最大値、平均値、最大値と最小値のレンジ等、及びパワー成分の最大値、平均値、最大値と最小値のレンジ等の特徴量を算出する。算出された特徴量２４は、韻律調整パラメータ生成部１７に供給される。 The prosodic feature extraction unit 12 outputs a conflict selection signal 23 to the conflict selection unit 16 when the conflict generation timing information 22 is supplied from the conflict generation timing determination unit 13. Further, when the prosodic feature extraction unit 12 is supplied with the soot generation timing information 22 from the soot generation timing determination unit 13, the prosodic feature extraction unit 12 has the fundamental frequency component F0 in a period (for example, 500 milliseconds) that goes back a predetermined time from the soot generation timing. A feature amount such as a maximum value, an average value, a range between the maximum value and the minimum value, and a power component maximum value, an average value, a range between the maximum value and the minimum value, and the like are calculated. The calculated feature value 24 is supplied to the prosody adjustment parameter generation unit 17.

相槌選択部１６は、韻律的特徴抽出部１２から相槌選択信号２３が供給されると、相槌データベース１５に格納されている相槌の形態の中から、所定の相槌（相槌の形態）を選択する（ステップＳ４）。また、相槌選択部１６は、選択した相槌に関する相槌情報２５（例えば、テキストデータ）を相槌波形生成部１８に出力する。また、相槌選択部１６は、選択した相槌の相関係数に関する情報２６を、韻律調整パラメータ生成部１７に出力する。相槌選択部１６は、相関係数に関する情報を相槌データベース１５から取得することができる。 When the interest selection signal 23 is supplied from the prosodic feature extraction unit 12, the interaction selection unit 16 selects a predetermined interaction (composition form) from among the interaction forms stored in the interaction database 15 ( Step S4). In addition, the conflict selection unit 16 outputs the conflict information 25 (for example, text data) regarding the selected conflict to the conflict waveform generation unit 18. Further, the conflict selection unit 16 outputs information 26 relating to the correlation coefficient of the selected conflict to the prosody adjustment parameter generation unit 17. The consideration selection unit 16 can acquire information on the correlation coefficient from the consideration database 15.

韻律調整パラメータ生成部１７は、相槌選択部１６で選択された相槌の韻律的特徴が、ユーザ発話の韻律的特徴と合うように相槌の韻律を調整するパラメータを生成する（ステップＳ５）。このとき、韻律調整パラメータ生成部１７は、韻律的特徴抽出部１２から供給された特徴量２４と、相槌選択部１６から供給された相関係数に関する情報２６とを用いて、韻律調整パラメータを生成する。生成された韻律調整パラメータ２７は、相槌波形生成部１８に供給される。 The prosodic adjustment parameter generation unit 17 generates a parameter for adjusting the prosody of the reconciliation so that the prosodic feature of the reconciliation selected by the reconciliation selection unit 16 matches the prosodic feature of the user utterance (step S5). At this time, the prosody adjustment parameter generation unit 17 generates a prosody adjustment parameter using the feature amount 24 supplied from the prosodic feature extraction unit 12 and the information 26 on the correlation coefficient supplied from the conflict selection unit 16. To do. The generated prosody adjustment parameter 27 is supplied to the conflict waveform generation unit 18.

具体的には、韻律調整パラメータ生成部１７は、上記式を用いて韻律調整パラメータＢＣ_ｉｐを求める。このとき、韻律調整パラメータ生成部１７は、基本周波数成分Ｆ０の最大値、平均値、及びパワー成分の最大値、平均値の各々について韻律調整パラメータＢＣ_ｉｐを求める。 Specifically, the prosody adjustment parameter generation unit 17 obtains the prosody adjustment parameter BC _ip using the above formula. At this time, the prosody adjustment parameter generation unit 17 obtains the prosody adjustment parameter BC _ip for each of the maximum value and average value of the fundamental frequency component F0, and the maximum value and average value of the power component.

相槌波形生成部１８は、相槌選択部１６で選択された相槌に関する相槌情報２５と、韻律調整パラメータ生成部１７で生成された韻律調整パラメータ２７とを用いて、相槌の音声波形を生成する（ステップＳ６）。ここで、韻律調整パラメータ２７は、基本周波数成分Ｆ０の最大値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｆ０_ｍａｘ）、基本周波数成分Ｆ０の平均値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｆ０_ａｖｅ）、パワーの最大値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｐ_ｍａｘ）、及びパワーの平均値に対応した韻律調整パラメータＢＣ_ｉｐ（Ｐ_ａｖｅ）の少なくとも１つである。例えば、相槌波形生成部１８は、ＴＴＳ（text to speech）技術を用いて相槌の音声波形を生成することができる。 The conflict waveform generator 18 generates a speech waveform of the conflict using the conflict information 25 related to the conflict selected by the conflict selector 16 and the prosodic adjustment parameter 27 generated by the prosody adjustment parameter generator 17 (step). S6). Here, prosodic adjustment parameters 27, prosodic adjustment parameter _BC ip (F0_max), prosody corresponding to the average value of the fundamental frequency component F0 adjustment parameter _BC ip (F0_ave), maximum power corresponding to the maximum value of the fundamental frequency component F0 It is at least one of the prosodic adjustment parameter BC _ip (P_max) corresponding to the value and the prosodic adjustment parameter BC _ip (P_ave) corresponding to the average value of power. For example, the interaction waveform generation unit 18 can generate an audio waveform of interaction using a TTS (text to speech) technique.

相槌波形生成部１８で生成された相槌の音声波形は、相槌出力部１９に供給される。相槌出力部１９は、供給された音声波形に対応した相槌を出力する（ステップＳ７）。これにより、ロボット（音声対話システム）３２は、相槌の韻律的特徴がユーザ発話の韻律的特徴と合うように韻律が調整された相槌を出力することができる。このとき、相槌出力部１９から出力される相槌に応じてロボットが首を振るように構成してもよい。 The audio waveform of the interaction generated by the interaction waveform generation unit 18 is supplied to the interaction output unit 19. The interaction output unit 19 outputs the interaction corresponding to the supplied speech waveform (step S7). As a result, the robot (speech dialogue system) 32 can output the prosody adjusted so that the prosodic feature of the match matches the prosodic feature of the user utterance. At this time, it may be configured such that the robot shakes his / her head according to the conflict output from the conflict output unit 19.

背景技術で説明したように、特許文献１に開示されている音声認識装置では、音声入力部に入力された音声信号を基に計算した話者の音声特徴量に基づき、話者との対話中にスピーカから相槌音を出力させる相槌タイミングを推測している。そして、相槌タイミングであるとの推測結果が得られると、相槌タイミング直前のパワーを基に相槌音を出力させるか否かを判定している。 As described in the background art, in the speech recognition apparatus disclosed in Patent Document 1, the conversation with the speaker is being performed based on the speech feature amount of the speaker calculated based on the speech signal input to the speech input unit. The timing of the interaction that causes the speaker to output the interference sound is estimated. Then, when an estimation result that it is the conflict timing is obtained, it is determined whether or not to output the conflict sound based on the power immediately before the conflict timing.

そこで本実施の形態にかかる音声対話方法および音声対話システムでは、ユーザ発話の音声波形から韻律的特徴を抽出し、相槌を生成する際に、相槌の音声波形の韻律的特徴がユーザ発話の音声波形の韻律的特徴と合うように相槌の韻律（音声波形）を調整している。このように相槌の韻律を調整することで、ユーザに機械的な印象を与えることを抑制することができ、ユーザは話を聞いてもらっているという意識を持つことができ、ユーザの発話を促すことができる。よって、本実施の形態にかかる発明により、発話を促進させる相槌を生成することが可能な音声対話方法、及び音声対話システムを提供することができる。 Therefore, in the speech dialogue method and the speech dialogue system according to the present embodiment, when the prosodic features are extracted from the speech waveform of the user utterance and the companion is generated, the prosodic feature of the comprehension speech waveform is the speech waveform of the user utterance. The prosody (speech waveform) is adjusted to match the prosodic features of By adjusting the prosodic prosody in this way, the user can be prevented from giving a mechanical impression, the user can be aware that they are listening to the story, and the user's utterance can be encouraged. Can do. Therefore, according to the invention according to the present embodiment, it is possible to provide a voice dialogue method and a voice dialogue system capable of generating a conflict that promotes speech.

つまり、本実施の形態にかかる発明では、図３に示すように、ユーザ３１の発話の音声波形３３から韻律的特徴Ｓｉを抽出し、この抽出した韻律的特徴Ｓｉを上記で示した式に代入して、相槌の韻律的特徴を予測している（つまり、ＢＣ_ｉｐを求めている）。よって、相槌を生成する際に、相槌の音声波形３４の韻律的特徴ＢＣ_ｉｐがユーザ３１の発話の音声波形３３の韻律的特徴と合うように相槌の韻律（音声波形３４）を調整することができる。 That is, in the invention according to the present embodiment, as shown in FIG. 3, the prosodic feature Si is extracted from the speech waveform 33 of the utterance of the user 31, and the extracted prosodic feature Si is substituted into the above formula. Thus, the prosodic features of the conflict are predicted (that is, BC _ip is obtained). Therefore, when generating the companion, the prosodic feature (speech waveform 34) of the companion speech waveform 34 may be adjusted so that the prosodic feature BC _ip of the companion speech waveform 34 matches the prosodic feature of the speech waveform 33 of the utterance of the user 31. it can.

ここで、上記式におけるＥ（ＢＣ）は、相槌の韻律的特徴（Ｆ０、パワー）の平均値であり、上記式では、このＥ（ＢＣ）の値をベースラインとし、このＥ（ＢＣ）に、ユーザ発話の韻律的特徴Ｓｉに応じた値を加算することで、相槌の韻律的特徴（韻律調整パラメータ）ＢＣ_ｉｐを求めている。 Here, E (BC) in the above equation is an average value of the prosodic features (F0, power) of the competing, and in the above equation, the value of E (BC) is taken as a baseline, and this E (BC) Then, by adding a value corresponding to the prosodic feature Si of the user utterance, the compliment prosodic feature (prosodic adjustment parameter) BC _ip is obtained.

以上、本発明を上記実施形態に即して説明したが、本発明は上記実施の形態の構成にのみ限定されるものではなく、本願特許請求の範囲の請求項の発明の範囲内で当業者であればなし得る各種変形、修正、組み合わせを含むことは勿論である。 Although the present invention has been described with reference to the above embodiment, the present invention is not limited to the configuration of the above embodiment, and those skilled in the art within the scope of the invention of the claims of the present application claims. It goes without saying that various modifications, modifications, and combinations that can be made are included.

１音声対話システム
１１発話入力部
１２韻律的特徴抽出部
１３相槌生成タイミング決定部
１４相槌生成部
１５相槌データベース
１６相槌選択部
１７韻律調整パラメータ生成部
１８相槌波形生成部
１９相槌出力部
２１抽出した韻律的特徴
２２相槌生成タイミング情報
２３相槌選択信号
２４特徴量
２５相槌情報
２６相関係数に関する情報
２７韻律調整パラメータ
３１ユーザ
３２ロボット
３３ユーザ発話の音声波形
３４相槌の音声波形 DESCRIPTION OF SYMBOLS 1 Spoken dialogue system 11 Utterance input part 12 Prosodic feature extraction part 13 Affinity production | generation timing determination part 14 Ago generation part 15 Ago database 16 Ago selection part 17 Prosodic adjustment parameter generation part 18 Ago waveform generation part 19 Ago output part 21 Extracted prosody Characteristic feature 22 phase generation timing information 23 phase selection signal 24 feature quantity 25 phase information 26 information about correlation coefficient 27 prosody adjustment parameter 31 user 32 robot 33 speech waveform of user utterance 34 phase of speech waveform of phase

Claims

Inputting a user utterance;
Extracting prosodic features of the input user utterance;
Generating a response in response to the user utterance based on the extracted prosodic features,
Adjusting the prosody of the companion so that the prosodic feature of the companion matches the prosodic feature of the user utterance when generating the companion;
Spoken dialogue method.

When extracting the prosodic features of the user utterance, extract the fundamental frequency component and the power component of the user utterance,
Of the fundamental frequency component and the power component, adjust the prosody of the conflict using a component having a high correlation between the prosodic feature of the user utterance and the prosodic feature of the conflict;
The voice interaction method according to claim 1.

Generating in advance a correlation coefficient table indicating a correlation between the prosodic features of the user utterance and the prosodic features of the conflict;
Of the fundamental frequency component and the power component, the component having a high correlation coefficient for the conflict is preferentially used to adjust the prosody of the conflict,
The voice interaction method according to claim 2.

The fundamental frequency component includes a maximum value and an average value of the fundamental frequency component,
The power component includes a maximum value and an average value of the power component,
The voice interaction method according to claim 2 or 3.

When generating the conflict, the prosody adjustment parameter BC _ip is obtained for each of the maximum value and average value of the fundamental frequency component, and the maximum value and average value of the power component, using the following formulae, and the prosody adjustment parameter The spoken dialogue method according to claim 4, wherein the prosody of the companion is adjusted using BC _ip .

In the above equation, α is the correlation coefficient, S _i is the prosodic feature of the user utterance, i is the number of samples, E (S) is the average value of the prosodic feature of the user utterance, and E (BC) is the prosodic feature of the conflict Σ (S) is the standard deviation of the prosodic features of the user utterance, and σ (BC) is the standard deviation of the prosodic features of the conflict.

Further comprising the step of determining when to generate the adjunction using prosodic features of the user utterance,
When the power component that is the prosodic feature of the user utterance is equal to or less than a predetermined threshold, the conflict is generated.
The voice interaction method according to any one of claims 1 to 5.

The agenda includes an emotional expression and a response system,
If the user utterance is uttering, select the response system interaction,
If the user utterance has been completed, select the emotional expression interaction,
The voice interaction method according to any one of claims 1 to 6.

An utterance input unit for inputting user utterances;
A prosodic feature extraction unit that extracts prosodic features of the user utterance input to the utterance input unit;
A conflict generation unit that generates a conflict in response to the user utterance based on the prosodic feature extracted by the prosodic feature extraction unit;
The conflict generating unit adjusts the prosody of the conflict so that the prosodic feature of the conflict matches the prosodic feature of the user utterance;
Spoken dialogue system.