JP2017106989A

JP2017106989A - Voice interactive device and program

Info

Publication number: JP2017106989A
Application number: JP2015238912A
Authority: JP
Inventors: 嘉山　啓; Hiroshi Kayama; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2015-12-07
Filing date: 2015-12-07
Publication date: 2017-06-15
Anticipated expiration: 2035-12-07
Also published as: JP6657888B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice interactive device and a program that achieve natural voice interactions.SOLUTION: A voice interactive device 100C comprises: a voice input device 24 for collecting utterance voices Vx to generate utterance signals X; a voice acquisition part 32 for acquiring the utterance signals X; and a response generation part 36C that analyzes the utterance signals X and allows a reproducer 26 to selectively reproduce first response voices Vy1 for representing responses in response to the utterance voices Vx and second response voices Vy2 other than the responses.SELECTED DRAWING: Figure 10

Description

本発明は、発話音声に対する応答音声を再生する音声対話の技術に関する。 The present invention relates to a voice dialogue technique for reproducing a response voice to a spoken voice.

利用者による発話に対する応答（例えば質問に対する回答）の音声を再生することで利用者との対話を実現する音声対話の技術が従来から提案されている。例えば特許文献１には、利用者の発話音声に対する音声認識で発話内容を解析し、解析結果に応じた応答音声を合成および再生する技術が開示されている。 2. Description of the Related Art Conventionally, a voice dialogue technique has been proposed that realizes a dialogue with a user by reproducing a voice of a response to an utterance by a user (for example, an answer to a question). For example, Patent Document 1 discloses a technique for analyzing utterance contents by voice recognition on a user's uttered voice and synthesizing and reproducing a response voice according to the analysis result.

特開２０１２−１２８４４０号公報JP 2012-128440 A

しかし、特許文献１を含む既存の技術のもとでは、現実の人間同士の対話の傾向を忠実に反映した自然な音声対話を実現することは実際には困難であり、機械的で不自然な印象を利用者が感取し得るという問題がある。以上の事情を考慮して、本発明は、自然な音声対話の実現を目的とする。 However, under the existing technologies including Patent Document 1, it is actually difficult to realize a natural voice dialogue that faithfully reflects the tendency of dialogue between real people, which is mechanical and unnatural. There is a problem that the user can take an impression. In view of the above circumstances, an object of the present invention is to realize a natural voice conversation.

以上の課題を解決するために、本発明の好適な態様に係る音声対話装置は、発話音声を表す発話信号を取得する音声取得部と、発話音声に対する問返しを表す第１応答音声と、問返し以外の第２応答音声とを選択的に再生装置に再生させる応答生成部とを具備する。以上の態様では、発話音声に対する問返しを表す第１応答音声と、問返し以外の第２応答音声とが選択的に再生装置から再生される。したがって、発話者の発話に対する相鎚だけでなく発話者に対する問返し（聞き直し）も適宜に発生するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 In order to solve the above-described problems, a voice interaction apparatus according to a preferred aspect of the present invention includes a voice acquisition unit that acquires an utterance signal that represents an utterance voice, a first response voice that represents a response to the utterance voice, A response generation unit that selectively causes the playback device to play back the second response sound other than the response. In the above-described aspect, the first response sound that indicates the answer to the uttered voice and the second response sound other than the answer are selectively played back from the playback device. Therefore, it is possible to realize a natural voice dialogue that simulates the tendency of an actual dialogue in which not only the talker's utterance but also the questioning / replying (rehearsal) to the talker occurs appropriately.

本発明の好適な態様に係る音声対話装置は、発話音声の韻律を表す韻律指標値を発話信号から特定する音声解析部を具備し、応答生成部は、発話音声の韻律指標値と閾値とを比較し、比較の結果に応じて第１応答音声および第２応答音声の何れかを選択する。以上の態様では、発話音声の韻律を表す韻律指標値を閾値と比較した結果に応じて第１応答音声および第２応答音声の何れかが選択されるから、発話音声の韻律が変動した場合に発話音声の聴取が困難となり問返しの可能性が高まる、という現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 The spoken dialogue apparatus according to a preferred aspect of the present invention includes a speech analysis unit that identifies a prosodic index value representing a prosody of an uttered speech from an utterance signal, and the response generation unit obtains a prosodic index value and a threshold value of the uttered speech. Comparison is made, and either the first response sound or the second response sound is selected according to the comparison result. In the above aspect, since either the first response speech or the second response speech is selected according to the result of comparing the prosodic index value representing the prosody of the speech speech with the threshold value, the prosody of the speech speech changes. It is possible to realize a natural voice conversation that simulates the tendency of an actual conversation that it is difficult to listen to the spoken voice and the possibility of answering questions increases.

本発明の好適な態様において、音声解析部は、過去の複数の発話音声における韻律指標値の代表値を閾値として設定する。以上の態様では、過去の複数の発話音声における韻律指標値の代表値が閾値として設定されるから、発話者の発話音声の韻律が当該発話者の標準的な韻律（すなわち対話相手が想定する韻律）から乖離する場合に対話相手からの問返しが発生し易いという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 In a preferred aspect of the present invention, the speech analysis unit sets a representative value of prosodic index values in a plurality of past speech speech as a threshold value. In the above aspect, since the representative value of the prosodic index value in a plurality of past utterances is set as the threshold, the prosody of the utterance of the utterer is the standard prosody of the utterer (that is, the prosody assumed by the conversation partner). It is possible to realize a natural voice conversation that simulates the tendency of an actual conversation that a question is easily returned from the conversation partner when it deviates from).

本発明の好適な態様において、応答生成部は、韻律指標値が、閾値を含む所定範囲の外側の数値である場合に第１応答音声を選択し、所定範囲の内側の数値である場合に第２応答音声を選択する。以上の態様では、韻律指標値が所定範囲の外側の数値である場合に第１応答音声が選択され、所定範囲の内側の数値である場合に第２応答音声が選択されるから、過度に高い頻度で第１応答音声が再生される可能性を低減することが可能である。 In a preferred aspect of the present invention, the response generation unit selects the first response speech when the prosodic index value is a numerical value outside the predetermined range including the threshold value, and when the prosodic index value is a numerical value inside the predetermined range, 2 Select response voice. In the above aspect, the first response voice is selected when the prosodic index value is a numerical value outside the predetermined range, and the second response voice is selected when the numerical value is inside the predetermined range. It is possible to reduce the possibility that the first response sound is reproduced with frequency.

本発明の好適な態様において、応答生成部は、複数の発話音声からランダムに選択された発話音声に対して第１応答音声を再生させる。以上の態様では、複数の発話音声からランダムに選択された発話音声に対して第１応答音声が再生されるから、発話音声に対する問返しがランダムに発生するという現実の音声対話の傾向を模擬した自然な音声対話を実現することが可能である。例えば、応答生成部は、複数の発話音声に対する第１応答音声の再生頻度を可変に設定する。音声対話の利用履歴に応じて第１応答音声の再生頻度を設定することも可能である。 In a preferred aspect of the present invention, the response generation unit reproduces the first response voice for the utterance voice randomly selected from the plurality of utterance voices. In the above aspect, since the first response voice is reproduced with respect to the utterance voice randomly selected from the plurality of utterance voices, the tendency of the actual voice conversation in which the answer to the utterance voice is randomly generated is simulated. Natural speech dialogue can be realized. For example, the response generation unit variably sets the reproduction frequency of the first response sound for a plurality of speech sounds. It is also possible to set the reproduction frequency of the first response voice according to the use history of the voice conversation.

第１実施形態の音声対話装置の構成図である。It is a block diagram of the voice interactive apparatus of 1st Embodiment. 第１実施形態における音声対話装置の動作のフローチャートである。It is a flowchart of operation | movement of the voice interactive apparatus in 1st Embodiment. 第１実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 1st Embodiment. 第１実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 1st Embodiment. 第１実施形態の応答生成処理のフローチャートである。It is a flowchart of the response generation process of 1st Embodiment. 第２実施形態の音声対話装置の構成図である。It is a block diagram of the voice interactive apparatus of 2nd Embodiment. 第２実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech voice and response voice in 2nd Embodiment. 第２実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech voice and response voice in 2nd Embodiment. 第２実施形態における応答生成処理のフローチャートである。It is a flowchart of the response production | generation process in 2nd Embodiment. 第３実施形態の音声対話装置の構成図である。It is a block diagram of the voice interactive apparatus of 3rd Embodiment. 第３実施形態における音声対話装置の動作のフローチャートである。It is a flowchart of operation | movement of the voice interactive apparatus in 3rd Embodiment. 第３実施形態における応答生成処理のフローチャートである。It is a flowchart of the response generation process in 3rd Embodiment. 第３実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 3rd Embodiment. 第３実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 3rd Embodiment. 第４実施形態の音声対話装置の構成図である。It is a block diagram of the voice interactive apparatus of 4th Embodiment. 第４実施形態における音声対話装置の動作のフローチャートである。It is a flowchart of operation | movement of the voice interactive apparatus in 4th Embodiment. 第４実施形態における応答生成処理のフローチャートである。It is a flowchart of the response generation process in 4th Embodiment. 第４実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 4th Embodiment. 第４実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 4th Embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音声対話装置１００Aの構成図である。第１実施形態の音声対話装置１００Aは、利用者Ｕが発音した音声（以下「発話音声」という）Ｖxに対する応答の音声（以下「応答音声」という）Ｖyを再生する音声対話システムである。例えば携帯電話機やスマートフォン等の可搬型の情報処理装置、または、パーソナルコンピュータ等の情報処理装置が音声対話装置１００Aとして利用され得る。また、動物等の外観を模擬した玩具（例えば動物のぬいぐるみ等の人形）やロボットの形態で音声対話装置１００Aを実現することも可能である。 <First Embodiment>
FIG. 1 is a configuration diagram of a voice interactive apparatus 100A according to the first embodiment of the present invention. The voice interaction apparatus 100A of the first embodiment is a voice interaction system that reproduces a response voice (hereinafter referred to as “response voice”) Vy to a voice (hereinafter referred to as “utterance voice”) Vx produced by the user U. For example, a portable information processing device such as a mobile phone or a smartphone, or an information processing device such as a personal computer can be used as the voice interactive device 100A. It is also possible to realize the voice interaction device 100A in the form of a toy (for example, a doll such as a stuffed animal) or a robot simulating the appearance of an animal or the like.

発話音声Ｖxは、例えば問掛け（質問）および話掛けを含む発話の音声であり、応答音声Ｖyは、問掛けに対する回答や話掛けに対する受応えを含む応答の音声である。応答音声Ｖyには、例えば間投詞を意味する音声も包含される。間投詞は、他の分節から独立して利用されて活用のない自立語（感動詞，感嘆詞）である。具体的には、発話に対する相鎚を表す「うん」「ええ」等の語句や、言淀み（応答の停滞）を表す「え〜と」「あの〜」等の語句、応答（質問に対する肯定／否定）を表す「はい」「いいえ」等の語句、話者の感動を表す「ああ」「おお」等の語句、あるいは、発話に対する問返し（聞き直し）を意味する「え？」「なに？」等の語句が、間投詞として例示され得る。 The utterance voice Vx is an utterance voice including, for example, a question (question) and a talk, and the response voice Vy is a reply voice including an answer to the question and an answer to the talk. The response voice Vy includes a voice meaning an interjection, for example. Interjections are independent words (impression verbs and exclamations) that are used independently of other segments and are not used. Specifically, phrases such as “Ye” and “Ee” that indicate the confession of the utterance, words such as “Eto” and “Ano ~” that indicate speech (stagnation of response), and responses (affirmation / "Yes", "No", etc., which expresses (No), words such as "Oh", "O", etc., which express the impression of the speaker, or "E?" A phrase such as “?” Can be exemplified as an interjection.

第１実施形態の音声対話装置１００Aは、発話音声Ｖxの韻律に応じた韻律の応答音声Ｖyを生成する。韻律（プロソディ）は、音声の受聴者が知覚し得る言語学的および音声学的な特性であり、言語の一般的な表記（例えば韻律を表す特別な表記を除いた表記）のみからでは把握できない性質を意味する。韻律は、発話者の意図や感情を受聴者に想起ないし推測させ得る特性とも換言され得る。具体的には、抑揚（音声の調子の変化，イントネーション），音調（音声の高低や強弱），音長（発話長），話速，リズム（音調の時間的な変化の構造），アクセント（高低または強弱のアクセント）等の種々の特徴が、韻律の概念には包含され得るが、韻律の典型例は音高（基本周波数）または音量である。 The spoken dialogue apparatus 100A of the first embodiment generates a prosodic response voice Vy corresponding to the prosody of the uttered voice Vx. Prosodic is a linguistic and phonetic characteristic that can be perceived by the listener of a speech, and cannot be grasped only from the general notation of the language (for example, not a special notation that expresses prosody). Means nature. Prosody can be rephrased as a characteristic that allows the listener to recall or guess the intention or emotion of the speaker. Specifically, intonation (changes in tone of the voice, intonation), tone (speech level and strength), tone length (speech length), speech speed, rhythm (structure of temporal change in tone), accent (high and low) Various features, such as strong or weak accents, can be included in the concept of prosody, but typical examples of prosody are pitch (fundamental frequency) or volume.

図１に例示される通り、第１実施形態の音声対話装置１００Aは、制御装置２０と記憶装置２２と音声入力装置２４と再生装置２６とを具備する。音声入力装置２４は、例えば利用者Ｕの発話音声Ｖxを表す音声信号（以下「発話信号」という）Ｘを生成する要素であり、収音装置２４２とＡ/Ｄ変換器２４４とを具備する。収音装置（マイクロホン）２４２は、利用者Ｕが発音した発話音声Ｖxを収音して当該発話音声Ｖxの音圧変動を表すアナログの音声信号を生成する。Ａ/Ｄ変換器２４４は、収音装置２４２が生成した音声信号をデジタルの発話信号Ｘに変換する。 As illustrated in FIG. 1, the voice interaction apparatus 100A of the first embodiment includes a control device 20, a storage device 22, a voice input device 24, and a playback device 26. The voice input device 24 is an element that generates, for example, a voice signal (hereinafter referred to as “speech signal”) X representing the utterance voice Vx of the user U, and includes a sound collection device 242 and an A / D converter 244. The sound collection device (microphone) 242 collects the utterance voice Vx produced by the user U and generates an analog voice signal representing the sound pressure fluctuation of the utterance voice Vx. The A / D converter 244 converts the audio signal generated by the sound collection device 242 into a digital speech signal X.

制御装置２０は、音声対話装置１００Aの各要素を統括的に制御する演算処理装置（例えばＣＰＵ）である。第１実施形態の制御装置２０は、音声入力装置２４から供給される発話信号Ｘを取得し、発話音声Ｖxに対する応答音声Ｖyを表す応答信号Ｙを生成する。再生装置２６は、制御装置２０が生成した応答信号Ｙに応じた応答音声Ｖyを再生する要素であり、Ｄ/Ａ変換器２６２と放音装置２６４とを具備する。Ｄ/Ａ変換器２６２は、制御装置２０が生成したデジタルの応答信号Ｙをアナログの音声信号に変換し、放音装置２６４（例えばスピーカまたはヘッドホン）は、変換後の音声信号に応じた応答音声Ｖyを音波として放音する。再生装置２６には、応答信号Ｙを増幅する増幅器等の処理回路も包含され得る。 The control device 20 is an arithmetic processing device (for example, a CPU) that controls each element of the voice interactive device 100A. The control device 20 of the first embodiment acquires the utterance signal X supplied from the voice input device 24, and generates a response signal Y representing the response voice Vy with respect to the utterance voice Vx. The playback device 26 is an element that plays back the response sound Vy corresponding to the response signal Y generated by the control device 20, and includes a D / A converter 262 and a sound emitting device 264. The D / A converter 262 converts the digital response signal Y generated by the control device 20 into an analog audio signal, and the sound emitting device 264 (for example, a speaker or headphones) responds according to the converted audio signal. Vy is emitted as sound waves. The playback device 26 may include a processing circuit such as an amplifier that amplifies the response signal Y.

記憶装置２２は、制御装置２０が実行するプログラムや制御装置２０が使用する各種のデータを記憶する。例えば半導体記録媒体または磁気記録媒体等の公知の記録媒体、あるいは、複数の記録媒体の組合せが記憶装置２２として任意に採用され得る。第１実施形態の記憶装置２２は、特定の発話内容の応答音声を表す音声信号Ｚを記憶する。以下の説明では、間投詞の一例である相鎚を意味する「うん」等の応答音声の音声信号Ｚが記憶装置２２に記憶された場合を例示する。音声信号Ｚは、事前に収録され、例えばwav形式等の任意の形式の音声ファイルとして記憶装置２２に記憶される。 The storage device 22 stores programs executed by the control device 20 and various data used by the control device 20. For example, a known recording medium such as a semiconductor recording medium or a magnetic recording medium, or a combination of a plurality of recording media can be arbitrarily employed as the storage device 22. The memory | storage device 22 of 1st Embodiment memorize | stores the audio | voice signal Z showing the response audio | voice of specific utterance content. In the following description, a case where a voice signal Z of a response voice such as “Yes”, which means a match, which is an example of an interjection, is stored in the storage device 22 will be exemplified. The audio signal Z is recorded in advance and stored in the storage device 22 as an audio file of an arbitrary format such as wav format.

制御装置２０は、記憶装置２２に記憶されたプログラムを実行することで、利用者Ｕとの対話を成立させるための複数の機能（音声取得部３２，音声解析部３４A，応答生成部３６A）を実現する。なお、制御装置２０の機能を複数の装置（すなわちシステム）で実現した構成、または、制御装置２０の機能の一部を専用の電子回路が分担する構成も採用され得る。 The control device 20 executes a program stored in the storage device 22 and thereby has a plurality of functions (voice acquisition unit 32, voice analysis unit 34A, response generation unit 36A) for establishing a dialogue with the user U. Realize. A configuration in which the function of the control device 20 is realized by a plurality of devices (that is, a system) or a configuration in which a dedicated electronic circuit shares a part of the function of the control device 20 may be employed.

図１の音声取得部３２は、発話音声Ｖxを表す発話信号Ｘを取得する。第１実施形態の音声取得部３２は、音声入力装置２４が生成した発話信号Ｘを音声入力装置２４から取得する。音声解析部３４Aは、音声取得部３２が取得した発話信号Ｘから発話音声Ｖxの音高（基本周波数）Ｐを特定する。音高Ｐの特定は所定の周期で順次に実行される。すなわち、時間軸上の相異なる複数の時点の各々について音高Ｐが特定される。発話音声Ｖxの音高Ｐの特定には公知の技術が任意に採用され得る。なお、発話信号Ｘのうち特定の周波数帯域の音響成分を抽出して音高Ｐを特定することも可能である。音声解析部３４Aによる解析の対象となる周波数帯域は、例えば利用者Ｕからの指示（例えば男声／女声の指定）に応じて可変に設定される。また、発話音声Ｖxの音高Ｐに応じて解析対象の周波数帯域を動的に変更することも可能である。 The voice acquisition unit 32 in FIG. 1 acquires an utterance signal X representing the utterance voice Vx. The voice acquisition unit 32 of the first embodiment acquires the speech signal X generated by the voice input device 24 from the voice input device 24. The voice analysis unit 34A specifies the pitch (fundamental frequency) P of the uttered voice Vx from the utterance signal X acquired by the voice acquisition unit 32. The pitch P is specified sequentially in a predetermined cycle. That is, the pitch P is specified for each of a plurality of different time points on the time axis. A known technique can be arbitrarily adopted to specify the pitch P of the speech voice Vx. Note that the pitch P can be specified by extracting an acoustic component in a specific frequency band from the speech signal X. The frequency band to be analyzed by the voice analysis unit 34A is variably set according to, for example, an instruction from the user U (for example, designation of male voice / female voice). It is also possible to dynamically change the frequency band to be analyzed according to the pitch P of the speech voice Vx.

応答生成部３６Aは、音声取得部３２が取得した発話信号Ｘの発話音声Ｖxに対する応答音声Ｖyを再生装置２６に再生させる。具体的には、応答生成部３６Aは、利用者Ｕによる発話音声Ｖxの発音を契機として応答音声Ｖyの応答信号Ｙを生成し、当該応答信号Ｙを再生装置２６に供給することで応答音声Ｖyを再生装置２６に再生させる。第１実施形態の応答生成部３６Aは、記憶装置２２に記憶された音声信号Ｚの韻律を、音声解析部３４Aが特定した発話音声Ｖxの音高Ｐに応じて調整することで、応答音声Ｖyの応答信号Ｙを生成する。すなわち、音声信号Ｚが表す初期的な応答音声を発話音声Ｖxの韻律に応じて調整した応答音声Ｖyが再生装置２６から再生される。 The response generation unit 36A causes the playback device 26 to play back the response voice Vy for the utterance voice Vx of the utterance signal X acquired by the voice acquisition unit 32. Specifically, the response generation unit 36A generates a response signal Y of the response voice Vy triggered by the pronunciation of the uttered voice Vx by the user U, and supplies the response signal Y to the playback device 26, thereby responding to the response voice Vy. Is played back by the playback device 26. The response generation unit 36A of the first embodiment adjusts the prosody of the audio signal Z stored in the storage device 22 in accordance with the pitch P of the uttered voice Vx specified by the audio analysis unit 34A, so that the response voice Vy The response signal Y is generated. That is, the response voice Vy obtained by adjusting the initial response voice represented by the voice signal Z in accordance with the prosody of the uttered voice Vx is played from the playback device 26.

現実の人間同士の対話では、発話者の発話音声のうち終点付近の音高に対応した音高で、当該発話音声に対する応答音声を対話相手が発音する（すなわち応答音声の音高が発話音声の終点付近の音高に依存する）、という傾向が観測される。以上の傾向を考慮して、第１実施形態の応答生成部３６Aは、音声解析部３４Aが特定した発話音声Ｖxの音高Ｐに応じて音声信号Ｚの音高を調整することで、応答音声Ｖyの応答信号Ｙを生成する。 In an actual human-to-human conversation, the conversation partner speaks the response voice for the utterance voice with a pitch corresponding to the pitch near the end point of the utterance voice of the speaker (that is, the pitch of the response voice is the pitch of the utterance voice. Dependent on the pitch near the end point). In consideration of the above tendency, the response generation unit 36A of the first embodiment adjusts the pitch of the audio signal Z according to the pitch P of the uttered voice Vx specified by the voice analysis unit 34A, so that the response voice A response signal Y of Vy is generated.

図２は、第１実施形態の制御装置２０が実行する処理のフローチャートである。例えば音声対話装置１００Aに対する利用者Ｕからの指示（例えば音声対話用のプログラムの起動指示）を契機として図２の処理が開始される。 FIG. 2 is a flowchart of processing executed by the control device 20 of the first embodiment. For example, the process of FIG. 2 is started in response to an instruction from the user U (for example, an instruction to start a voice conversation program) to the voice interaction apparatus 100A.

図２の処理を開始すると、音声取得部３２は、利用者Ｕが発話音声Ｖxの発音を開始するまで待機する（Ｓ10：NO）。具体的には、音声取得部３２は、音声入力装置２４から供給される発話信号Ｘを解析することで発話音声Ｖxの音量を順次に特定し、発話音声Ｖxの音量が所定の閾値（例えば事前に選定された固定値または利用者Ｕからの指示に応じた可変値）を上回る状態が所定の時間長にわたり継続した場合に、発話音声Ｖxが開始したと判断する。なお、発話音声Ｖxの開始（すなわち発話区間の始点）の検出方法は任意である。例えば、発話音声Ｖxの音量が閾値を上回り、かつ、音声解析部３４Aが有意な音高Ｐを検出した場合に、発話音声Ｖxが開始したと判断することも可能である。 When the processing of FIG. 2 is started, the voice acquisition unit 32 waits until the user U starts to pronounce the uttered voice Vx (S10: NO). Specifically, the voice acquisition unit 32 sequentially identifies the volume of the uttered voice Vx by analyzing the utterance signal X supplied from the voice input device 24, and the volume of the uttered voice Vx is set to a predetermined threshold (for example, in advance). It is determined that the utterance voice Vx has started when a state exceeding a fixed value selected in (1) or a variable value according to an instruction from the user U) continues for a predetermined length of time. Note that the detection method of the start of the speech voice Vx (that is, the start point of the speech section) is arbitrary. For example, when the volume of the uttered voice Vx exceeds the threshold and the voice analysis unit 34A detects a significant pitch P, it can be determined that the uttered voice Vx has started.

発話音声Ｖxが開始すると（Ｓ10：YES）、音声取得部３２は、音声入力装置２４から発話信号Ｘを取得して記憶装置２２に格納する（Ｓ11）。音声解析部３４Aは、音声取得部３２が取得した発話信号Ｘから発話音声Ｖxの音高Ｐを特定して記憶装置２２に格納する（Ｓ12）。 When the uttered voice Vx starts (S10: YES), the voice acquisition unit 32 acquires the utterance signal X from the voice input device 24 and stores it in the storage device 22 (S11). The voice analysis unit 34A specifies the pitch P of the uttered voice Vx from the utterance signal X acquired by the voice acquisition unit 32 and stores it in the storage device 22 (S12).

音声取得部３２は、利用者Ｕが発話音声Ｖxの発音を終了したか否かを判定する（Ｓ13）。具体的には、音声取得部３２は、発話信号Ｘから特定される発話音声Ｖxの音量が所定の閾値（例えば事前に選定された固定値または利用者Ｕからの指示に応じた可変値）を下回る状態が所定の時間長にわたり継続した場合に、発話音声Ｖxが終了したと判断する。ただし、発話音声Ｖxの終了（すなわち発話区間の終点）の検出には公知の技術が任意に採用され得る。以上の説明から理解される通り、発話音声Ｖxの発話が継続される発話期間内は（Ｓ13：NO）、音声取得部３２による発話信号Ｘの取得（Ｓ11）と音声解析部３４Aによる発話音声Ｖxの音高Ｐの特定（Ｓ12）とが反復される。 The voice acquisition unit 32 determines whether or not the user U has finished generating the uttered voice Vx (S13). Specifically, the voice acquisition unit 32 sets the volume of the utterance voice Vx specified from the utterance signal X to a predetermined threshold (for example, a fixed value selected in advance or a variable value according to an instruction from the user U). When the lowering state continues for a predetermined length of time, it is determined that the speech voice Vx has ended. However, a known technique can be arbitrarily employed for detecting the end of the speech voice Vx (that is, the end point of the speech section). As understood from the above description, during the utterance period during which the utterance of the utterance voice Vx is continued (S13: NO), the acquisition of the utterance signal X by the voice acquisition section 32 (S11) and the utterance voice Vx by the voice analysis section 34A. The pitch P is identified (S12).

以上に説明した処理の結果、図３および図４に例示される通り、発話音声Ｖxの始点から終点ｔBまでの発話区間について当該発話音声Ｖxの複数の音高Ｐの時系列が特定される。図３では、発話相手の感情や意図等の認識を発話者が問掛ける「楽しいね？」という疑問文の発話音声Ｖxを利用者Ｕが発音した場合が想定されている。図４では、発話者自身の感情や意図等の認識を表現したり当該認識に対する同意を発話相手に要求したりする平叙文の発話音声Ｖxを利用者Ｕが発音した場合が想定されている。 As a result of the processing described above, as illustrated in FIGS. 3 and 4, a time series of a plurality of pitches P of the speech voice Vx is specified for the speech section from the start point to the end point tB of the speech voice Vx. In FIG. 3, it is assumed that the user U has pronounced the utterance voice Vx of the question sentence “Is it fun?” That asks the speaker to recognize the emotion, intention, etc. of the utterance partner. In FIG. 4, it is assumed that the user U has pronounced a plain speech utterance Vx that expresses the recognition of the speaker's own emotions, intentions, etc. or requests the utterance partner to agree on the recognition.

発話音声Ｖxが終了すると（Ｓ13：YES）、当該発話音声Ｖxに対する応答音声Ｖyを再生装置２６に再生させるための処理（以下「応答生成処理」という）ＳAを応答生成部３６Aが実行する。第１実施形態の応答生成処理ＳAは、前述の通り、音声解析部３４Aが特定した発話音声Ｖxの音高Ｐに応じて音声信号Ｚの音高を調整することで、応答音声Ｖyの応答信号Ｙを生成する処理である。 When the uttered voice Vx ends (S13: YES), the response generating unit 36A executes a process (hereinafter referred to as “response generating process”) SA for causing the playback device 26 to reproduce the response voice Vy corresponding to the uttered voice Vx. As described above, the response generation process SA of the first embodiment adjusts the pitch of the voice signal Z in accordance with the pitch P of the uttered voice Vx specified by the voice analysis unit 34A, whereby the response signal of the response voice Vy. This is a process for generating Y.

図５は、応答生成処理ＳAの具体例のフローチャートである。前述の通り、発話音声Ｖxの終了（Ｓ13：YES）を契機として図５の応答生成処理ＳAが開始される。応答生成処理ＳAを開始すると、応答生成部３６Aは、図３および図４に例示される通り、発話音声Ｖxのうち当該発話音声Ｖxの終点ｔBを含む区間（以下「末尾区間」という）Ｅについて音声解析部３４Aが特定した複数の音高Ｐのうちの最低値（以下「最低音高」という）Ｐminを発話音声Ｖxの韻律として特定する（ＳA1）。末尾区間Ｅは、例えば発話音声Ｖxのうち発話音声Ｖxの終点ｔBから手前側の所定長（例えば数秒）にわたる一部の区間である。図３から理解される通り、疑問文の発話音声Ｖxでは終点ｔBの近傍にて音高Ｐが上昇する傾向がある。したがって、発話音声Ｖxの音高Ｐの推移が低下から上昇に転換する極小点での音高Ｐが最低音高Ｐminとして特定される。他方、図４から理解される通り、平叙文の発話音声Ｖxでは終点ｔBにかけて音高Ｐが単調に低下する傾向がある。したがって、発話音声Ｖxの終点ｔBでの音高Ｐが最低音高Ｐminとして特定される。 FIG. 5 is a flowchart of a specific example of the response generation process SA. As described above, the response generation process SA of FIG. 5 is started when the utterance voice Vx ends (S13: YES). When the response generation process SA is started, the response generation unit 36A, as illustrated in FIG. 3 and FIG. 4, for a section (hereinafter referred to as “tail section”) E including the end point tB of the uttered voice Vx in the uttered voice Vx. The lowest value (hereinafter referred to as “minimum pitch”) Pmin among the plurality of pitches P specified by the voice analysis unit 34A is specified as the prosody of the speech voice Vx (SA1). The tail section E is, for example, a part of the utterance voice Vx extending from the end point tB of the utterance voice Vx to a predetermined length (for example, several seconds) on the near side. As understood from FIG. 3, in the utterance voice Vx of the question sentence, the pitch P tends to increase in the vicinity of the end point tB. Therefore, the pitch P at the minimum point at which the transition of the pitch P of the speech voice Vx changes from a decrease to an increase is specified as the minimum pitch Pmin. On the other hand, as understood from FIG. 4, the pitch P tends to decrease monotonously toward the end point tB in the spoken speech Vx. Therefore, the pitch P at the end point tB of the speech voice Vx is specified as the lowest pitch Pmin.

応答生成部３６Aは、発話音声Ｖxの最低音高Ｐminに応じた音高の応答音声Ｖyを表す応答信号Ｙを生成する（ＳA2）。具体的には、応答生成部３６Aは、図３および図４に例示される通り、応答音声Ｖyのうち時間軸上の特定の時点（以下「目標点」という）τでの音高が最低音高Ｐminに一致するように音声信号Ｚの音高を調整することで、応答音声Ｖyの応答信号Ｙを生成する。目標点τの好適例は、応答音声Ｖyを構成する複数のモーラのうち特定のモーラ（典型的には最後のモーラ）の始点である。例えば、「うん」という応答音声の音声信号Ｚを想定すると、図３および図４から理解される通り、音声信号Ｚのうち最後のモーラである「ん」の始点の音高が最低音高Ｐminに一致するように音声信号Ｚの全区間にわたる音高を調整（ピッチシフト）することで、応答音声Ｖyの応答信号Ｙが生成される。なお、音高の調整には公知の技術が任意に採用され得る。また、目標点τは、応答音声Ｖyのうち最後のモーラの始点に限定されない。例えば、応答音声Ｖyの始点や終点を目標点τとして音高を調整することも可能である。 The response generation unit 36A generates a response signal Y representing the response voice Vy having a pitch corresponding to the minimum pitch Pmin of the speech voice Vx (SA2). Specifically, as illustrated in FIGS. 3 and 4, the response generation unit 36 </ b> A has the lowest pitch at a specific time point (hereinafter referred to as “target point”) τ on the time axis of the response voice Vy. The response signal Y of the response voice Vy is generated by adjusting the pitch of the voice signal Z so as to coincide with the high Pmin. A preferred example of the target point τ is a start point of a specific mora (typically the last mora) among the plurality of mora constituting the response voice Vy. For example, assuming a voice signal Z of a response voice of “Yes”, as understood from FIGS. 3 and 4, the pitch of the start point of “n” which is the last mora of the voice signal Z is the lowest pitch Pmin. By adjusting (pitch shifting) the pitch over the entire interval of the audio signal Z so as to match the above, the response signal Y of the response audio Vy is generated. In addition, a well-known technique can be arbitrarily employ | adopted for adjustment of a pitch. Further, the target point τ is not limited to the start point of the last mora in the response voice Vy. For example, the pitch can be adjusted with the start point or end point of the response voice Vy as the target point τ.

以上の手順で応答信号Ｙを生成すると、応答生成部３６Aは、応答音声Ｖyの再生を開始すべき時点（以下「応答再生点」という）ｔyの到来まで待機する（ＳA3：NO）。応答再生点ｔyは、例えば、発話音声Ｖxの終点ｔBから所定の時間（例えば150ms）が経過した時点である。 When the response signal Y is generated by the above procedure, the response generation unit 36A waits until the time point ty at which playback of the response voice Vy should start (hereinafter referred to as “response playback point”) arrives (SA3: NO). The response reproduction point ty is, for example, a point in time when a predetermined time (for example, 150 ms) has elapsed from the end point tB of the speech voice Vx.

応答再生点ｔyが到来すると（ＳA3：YES）、応答生成部３６Aは、最低音高Ｐminに応じた調整後の応答信号Ｙを再生装置２６に供給することで応答音声Ｖyを再生させる（ＳA4）。すなわち、発話音声Ｖxの終点ｔBから所定の時間が経過した応答再生点ｔyにて応答音声Ｖyの再生が開始される。なお、応答生成部３６Aが、応答信号Ｙの生成（ピッチシフト）に並行して実時間的に、応答再生点ｔyから応答信号Ｙを再生装置２６に順次に供給して応答音声Ｖyを再生させることも可能である。以上の説明から理解される通り、第１実施形態の応答生成部３６Aは、発話音声Ｖxの末尾区間Ｅにおける最低音高Ｐminに応じた音高の応答音声Ｖyを再生装置２６に再生させる要素として機能する。 When the response playback point ty arrives (SA3: YES), the response generator 36A supplies the response signal Y after adjustment according to the minimum pitch Pmin to the playback device 26 to play back the response voice Vy (SA4). . That is, the reproduction of the response voice Vy is started at the response reproduction point ty after a predetermined time has elapsed from the end point tB of the utterance voice Vx. The response generator 36A sequentially supplies the response signal Y from the response playback point ty to the playback device 26 in real time in parallel with the generation (pitch shift) of the response signal Y to play back the response voice Vy. It is also possible. As understood from the above description, the response generation unit 36A of the first embodiment is an element that causes the playback device 26 to play back the response voice Vy having a pitch corresponding to the lowest pitch Pmin in the tail section E of the speech voice Vx. Function.

以上に例示した応答生成処理ＳAが完了すると、制御装置２０は、図２に例示される通り、音声対話の終了が利用者Ｕから指示されたか否かを判定する（Ｓ14）。音声対話の終了が指示されていない場合（Ｓ14：NO）、処理はステップＳ10に遷移する。すなわち、発話音声Ｖxの開始（Ｓ10：YES）を契機として、音声取得部３２による発話信号Ｘの取得（Ｓ11）と、音声解析部３４Aによる音高Ｐの特定（Ｓ12）と、応答生成部３６Aによる応答生成処理ＳAとが実行される。以上の説明から理解される通り、発話音声Ｖxの音高Ｐに応じた音高の応答音声Ｖyが発話音声Ｖxの発音毎に再生される。すなわち、利用者Ｕによる任意の発話音声Ｖxの発音と、当該発話音声Ｖxに対する相鎚の応答音声Ｖy（例えば「うん」という応答音声）の再生とが交互に反復される音声対話が実現される。音声対話の終了が利用者Ｕから指示されると（Ｓ14：YES）、制御装置２０は図２の処理を終了する。 When the response generation process SA exemplified above is completed, the control device 20 determines whether or not the user U has instructed the end of the voice dialogue as exemplified in FIG. 2 (S14). If the end of the voice dialogue is not instructed (S14: NO), the process proceeds to step S10. That is, triggered by the start of the uttered voice Vx (S10: YES), the voice acquisition unit 32 acquires the utterance signal X (S11), the voice analysis unit 34A specifies the pitch P (S12), and the response generation unit 36A. The response generation process SA is executed. As understood from the above description, the response voice Vy having a pitch corresponding to the pitch P of the uttered voice Vx is reproduced for each pronunciation of the uttered voice Vx. That is, a voice dialogue is realized in which the pronunciation of an arbitrary utterance voice Vx by the user U and the reproduction of a response voice Vy (for example, a response voice of “yeah”) corresponding to the utterance voice Vx are alternately repeated. . When the end of the voice dialogue is instructed by the user U (S14: YES), the control device 20 ends the process of FIG.

以上に説明した通り、第１実施形態では、発話音声Ｖxの終点ｔBを含む末尾区間Ｅ内の最低音高Ｐminに応じた音高の応答音声Ｖyが再生装置２６から再生される。したがって、発話音声の終点付近の音高に対応した音高で対話相手が応答音声を発音するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。第１実施形態では特に、応答音声Ｖyのうち最後のモーラの始点（目標点τ）での音高が最低音高Ｐminに一致するように応答音声Ｖyが再生されるから、現実の対話に近い自然な音声対話を実現できるという効果は格別に顕著である。 As described above, in the first embodiment, the response sound Vy having a pitch corresponding to the lowest pitch Pmin in the end section E including the end point tB of the utterance voice Vx is played from the playback device 26. Therefore, it is possible to realize a natural voice conversation simulating the tendency of an actual conversation in which the conversation partner generates a response voice at a pitch corresponding to the pitch near the end point of the uttered voice. Particularly in the first embodiment, since the response voice Vy is reproduced so that the pitch at the start point (target point τ) of the last mora in the response voice Vy matches the minimum pitch Pmin, it is close to an actual dialogue. The effect of realizing a natural voice conversation is particularly remarkable.

＜第１実施形態の変形例＞
（１）第１実施形態では、応答音声Ｖyのうち目標点τの音高を発話音声Ｖxの末尾区間Ｅ内の最低音高Ｐminに一致させる構成を例示したが、応答音声Ｖyの目標点τでの音高と発話音声Ｖxの最低音高Ｐminとの関係は以上の例示（両者が一致する関係）に限定されない。例えば、応答音声Ｖyの目標点τでの音高を、最低音高Ｐminに所定の調整値（オフセット）δpを加算または減算した音高に一致させることも可能である。調整値δpは、事前に選定された固定値（例えば最低音高Ｐminに対して５度等の音程に相当する数値）または利用者Ｕからの指示に応じた可変値である。また、調整値δpをオクターブの整数倍に相当する数値に設定した構成によれば、最低音高Ｐminをオクターブシフトした音高の応答音声Ｖyが再生される。調整値δpを適用するか否かを利用者Ｕからの指示に応じて切替えることも可能である。 <Modification of First Embodiment>
(1) In the first embodiment, the configuration in which the pitch of the target point τ in the response voice Vy is matched with the minimum pitch Pmin in the tail section E of the utterance voice Vx is exemplified. However, the target point τ of the response voice Vy is illustrated. The relationship between the pitch of the voice and the minimum pitch Pmin of the uttered voice Vx is not limited to the above example (a relationship in which both match). For example, the pitch of the response voice Vy at the target point τ can be matched with the pitch obtained by adding or subtracting a predetermined adjustment value (offset) δp to the minimum pitch Pmin. The adjustment value δp is a fixed value selected in advance (for example, a numerical value corresponding to a pitch such as 5 degrees with respect to the lowest pitch Pmin) or a variable value according to an instruction from the user U. Further, according to the configuration in which the adjustment value δp is set to a numerical value corresponding to an integral multiple of the octave, the response voice Vy having a pitch obtained by octave shifting the minimum pitch Pmin is reproduced. It is also possible to switch whether to apply the adjustment value δp according to an instruction from the user U.

（２）第１実施形態では、発話音声Ｖxの音高Ｐ（具体的には末尾区間Ｅの最低音高Ｐmin）に応じて応答音声Ｖyの音高を制御したが、応答音声Ｖyの韻律の制御に利用される発話音声Ｖxの韻律の種類や、発話音声Ｖxの韻律に応じて制御される応答音声Ｖyの韻律の種類は、音高に限定されない。例えば、発話音声Ｖxの音量（韻律の一例）に応じて応答音声Ｖyの韻律を制御する構成や、発話音声Ｖxの音高または音量の変動の範囲（韻律の他例）に応じて応答音声Ｖyの韻律を制御する構成も採用される。また、発話音声Ｖxの韻律に応じて応答音声Ｖyの音量（韻律の一例）を制御する構成や、発話音声Ｖxの韻律に応じて応答音声Ｖyの音高または音量の変動の範囲（韻律の他例）を制御する構成も採用され得る。 (2) In the first embodiment, the pitch of the response voice Vy is controlled according to the pitch P of the utterance voice Vx (specifically, the minimum pitch Pmin of the end section E). The prosody type of the speech voice Vx used for control and the prosody type of the response voice Vy controlled according to the prosody of the speech voice Vx are not limited to the pitch. For example, a configuration that controls the prosody of the response voice Vy according to the volume of the utterance voice Vx (an example of prosody), or a response voice Vy according to the range of the pitch or volume of the utterance voice Vx (an example of the prosody). The structure which controls the prosody of is also adopted. Further, the volume of the response voice Vy (an example of the prosody) is controlled according to the prosody of the uttered voice Vx, and the range of fluctuations in the pitch or volume of the response voice Vy according to the prosody of the uttered voice Vx (other than the prosody) A configuration for controlling the example) may also be employed.

（３）現実の人間同士の対話では、応答音声の韻律が発話音声の韻律に応じて一律に決定されるわけでは必ずしもない。すなわち、応答音声の韻律は、発話音声の韻律に依存するとともに発話音声の発音毎に変動し得るという傾向がある。以上の傾向を考慮すると、再生装置２６から再生される応答音声Ｖyの韻律（例えば音高や音量）を、応答生成部３６Aが発話音声Ｖx毎に変動させることも可能である。具体的には、前述の変形例の通り、最低音高Ｐminに調整値δpを加算または減算した音高となるように応答音声Ｖyの音高を調整する構成では、応答生成部３６Aは、発話音声Ｖxの発音毎に調整値δpを可変に制御する。例えば、応答生成部３６Aは、発話音声Ｖxの発音毎に所定の範囲内の乱数を発生させ、当該乱数を調整値δpとして設定する。以上の構成によれば、応答音声の韻律が発話音声の発音毎に変動し得るという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 (3) In real human interaction, the prosody of the response speech is not necessarily determined uniformly according to the prosody of the speech speech. That is, the prosody of the response voice tends to depend on the prosody of the utterance voice and can vary for each pronunciation of the utterance voice. Considering the above tendency, the response generation unit 36A can change the prosody (for example, pitch and volume) of the response voice Vy reproduced from the playback device 26 for each utterance voice Vx. Specifically, as described above, in the configuration in which the pitch of the response voice Vy is adjusted to be the pitch obtained by adding or subtracting the adjustment value δp to the minimum pitch Pmin, the response generation unit 36A The adjustment value δp is variably controlled for each sound Vx. For example, the response generation unit 36A generates a random number within a predetermined range for each pronunciation of the uttered voice Vx, and sets the random number as the adjustment value δp. According to the above configuration, it is possible to realize a natural voice conversation simulating the tendency of an actual conversation in which the prosody of the response voice can be changed every time the uttered voice is pronounced.

（４）第１実施形態では、１種類の音声信号Ｚの音高を調整して応答信号Ｙを生成したが、音高が相違する複数種の音声信号Ｚを応答信号Ｙの生成に利用することも可能である。例えば、複数種の音声信号Ｚのうち発話音声Ｖxの最低音高Ｐminに最も近似する音声信号Ｚの音高を調整して応答信号Ｙを生成する構成が想定され得る。 (4) In the first embodiment, the response signal Y is generated by adjusting the pitch of one type of audio signal Z. However, multiple types of audio signals Z having different pitches are used for generating the response signal Y. It is also possible. For example, a configuration in which the response signal Y is generated by adjusting the pitch of the voice signal Z that is closest to the lowest pitch Pmin of the uttered voice Vx among the plurality of types of voice signals Z can be assumed.

（５）第１実施形態では、応答音声Ｖyを再生装置２６から再生したが、音声取得部３２が取得した発話信号Ｘを再生装置２６に供給することで発話音声Ｖxも再生装置２６から再生することが可能である。発話音声Ｖxを再生装置２６から再生するか否かを利用者Ｕからの指示に応じて切替える構成も採用され得る。 (5) In the first embodiment, the response voice Vy is reproduced from the playback device 26, but the utterance voice Vx is also played back from the playback device 26 by supplying the utterance signal X acquired by the voice acquisition unit 32 to the playback device 26. It is possible. A configuration for switching whether or not to reproduce the uttered voice Vx from the playback device 26 in accordance with an instruction from the user U may be employed.

＜第２実施形態＞
本発明の第２実施形態を説明する。なお、以下に例示する各形態において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で使用した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described. In addition, about the element which an effect | action and function are the same as that of 1st Embodiment in each form illustrated below, the code | symbol used by description of 1st Embodiment is diverted, and each detailed description is abbreviate | omitted suitably.

図６は、本発明の第２実施形態に係る音声対話装置１００Bの構成図である。第２実施形態の音声対話装置１００Bは、第１実施形態の音声対話装置１００Aと同様に、利用者Ｕが発音した発話音声Ｖxに対する応答音声Ｖyを再生する。図６に例示される通り、第２実施形態の音声対話装置１００Bは、第１実施形態の音声対話装置１００Aの応答生成部３６Aを応答生成部３６Bに置換した構成である。音声対話装置１００Bの他の要素（音声入力装置２４，再生装置２６，音声取得部３２，音声解析部３４A）の構成や動作は第１実施形態と同様である。 FIG. 6 is a configuration diagram of a voice interaction device 100B according to the second embodiment of the present invention. Similarly to the voice interaction apparatus 100A of the first embodiment, the voice interaction apparatus 100B of the second embodiment reproduces a response voice Vy corresponding to the uttered voice Vx produced by the user U. As illustrated in FIG. 6, the voice interaction device 100B of the second embodiment has a configuration in which the response generation unit 36A of the voice interaction device 100A of the first embodiment is replaced with a response generation unit 36B. The configuration and operation of other elements (voice input device 24, playback device 26, voice acquisition unit 32, voice analysis unit 34A) of the voice interactive device 100B are the same as those in the first embodiment.

現実の人間同士の対話では、発話者の発話内容（疑問文であるか平叙文であるか）に応じた韻律で対話相手が応答音声を発音するという傾向が観測される。例えば、疑問文に対する応答音声と平叙文に対する応答音声とでは韻律が相違する。具体的には、疑問文に対する回答の音声は、平叙文に対する相鎚の音声と比較すると、例えば応答者の回答（肯定／否定）を発話者に明確に認識させる必要性から、比較的に大きい音量で抑揚（音量または音高の時間変動）を強調して発音される、という傾向がある。以上の傾向を考慮して、第２実施形態の応答生成部３６Bは、発話音声Ｖxによる発話内容（疑問文／平叙文の区別）に応じた韻律の応答音声Ｖyを再生装置２６に再生させる。 In a dialogue between real people, a tendency is observed that the conversation partner pronounces the response speech with a prosody according to the utterance content of the speaker (whether it is a question sentence or a plain sentence). For example, the prosody differs between a response voice for a question sentence and a response voice for a plain text. Specifically, the voice of the answer to the question sentence is relatively loud compared to the voice of the companion sentence, for example, because the answerer's answer (affirmation / denial) needs to be clearly recognized by the speaker. There is a tendency that the volume is pronounced with emphasis on the inflection (time fluctuation of volume or pitch). In consideration of the above tendency, the response generation unit 36B of the second embodiment causes the playback device 26 to play back the prosodic response voice Vy according to the utterance content (question / question distinction) by the utterance voice Vx.

図７には、疑問文の発話音声Ｖxの音高Ｐの推移が例示され、図８には、平叙文の発話音声Ｖxの音高Ｐの推移が例示されている。図７および図８から理解される通り、発話音声Ｖxの発話内容が疑問文である場合と平叙文である場合とでは、発話音声Ｖxのうち末尾の近傍における音高Ｐの推移（時間的な変動の傾向）が相違する、という傾向がある。具体的には、疑問文の発話音声Ｖxの音高Ｐは、図７に例示される通り、末尾区間Ｅ内で低下から上昇に転換または単調に上昇するが、平叙文の発話音声Ｖxの音高Ｐは、図８に例示される通り、末尾区間Ｅの始点ｔAから終点ｔBにかけて単調に低下する。したがって、発話音声Ｖxの末尾の近傍（末尾区間Ｅ）における音高Ｐの推移を解析することで、発話音声Ｖxの発話内容が疑問文および平叙文の何れに該当するかを推定することが可能である。 FIG. 7 illustrates the transition of the pitch P of the utterance voice Vx of the question sentence, and FIG. 8 illustrates the transition of the pitch P of the utterance voice Vx of the plain sentence. As understood from FIGS. 7 and 8, the transition of the pitch P in the vicinity of the end of the utterance voice Vx (temporal in time) between the case where the utterance content of the utterance voice Vx is a question sentence and the case where it is a plain sentence. There is a tendency that the fluctuation tendency) is different. Specifically, the pitch P of the utterance voice Vx of the question sentence is changed from a decrease to an increase or increases monotonously in the end section E as illustrated in FIG. As illustrated in FIG. 8, the high P monotonously decreases from the start point tA to the end point tB of the tail section E. Therefore, by analyzing the transition of the pitch P in the vicinity of the end of the utterance voice Vx (end section E), it is possible to estimate whether the utterance content of the utterance voice Vx corresponds to a question sentence or a plain sentence. It is.

以上の傾向を考慮して、第２実施形態の応答生成部３６Bは、発話音声Ｖxのうち末尾区間Ｅにおける音高Ｐの推移（すなわち疑問文／平叙文の区別）に応じた韻律の応答音声Ｖyを再生装置２６に再生させる。具体的には、図７に例示される通り、発話音声Ｖxの音高Ｐの推移が末尾区間Ｅ内で低下から上昇に転換する場合または発話音声Ｖxの音高Ｐが末尾区間Ｅ内で単調に上昇する場合（すなわち発話内容が疑問文であると推定される場合）には、疑問文に好適な韻律の応答音声Ｖyが再生装置２６から再生される。他方、図８に例示される通り、発話音声Ｖxの音高Ｐが末尾区間Ｅ内で単調に低下する場合（すなわち発話内容が平叙文であると推定される場合）には、平叙文に好適な韻律の応答音声Ｖyが再生装置２６から再生される。 In consideration of the above tendency, the response generation unit 36B of the second embodiment responds to the prosodic response voice according to the transition of the pitch P in the last section E (ie, the distinction between the question sentence / the plain sentence) in the speech voice Vx. The playback device 26 plays back Vy. Specifically, as illustrated in FIG. 7, when the transition of the pitch P of the utterance voice Vx changes from a decrease to an increase in the end section E, or the pitch P of the utterance voice Vx is monotonous in the end section E. (That is, when the utterance content is estimated to be a question sentence), the prosody response voice Vy suitable for the question sentence is reproduced from the reproduction device 26. On the other hand, as illustrated in FIG. 8, when the pitch P of the speech voice Vx decreases monotonously within the end section E (that is, when the speech content is estimated to be a plain text), it is suitable for a plain text. A prosody response voice Vy is reproduced from the reproduction device 26.

図６に例示される通り、第２実施形態の音声対話装置１００Bの記憶装置２２は、特定の発話内容の応答音声Ｖyを事前に収録した応答信号ＹAおよび応答信号ＹBを記憶する。応答信号ＹAおよび応答信号ＹBは、発話内容（文字表記）は相互に共通するが韻律が相違する。具体的には、応答信号ＹAが表す応答音声Ｖyは、疑問文の発話音声Ｖxに対する肯定的な回答の意図で発音された「うん」の音声であり、応答信号ＹBが表す応答音声Ｖyは、平叙文の発話音声Ｖxに対する相鎚の意図で発音された「うん」の音声である。具体的には、応答信号ＹAの応答音声Ｖyは、応答信号ＹBの応答音声Ｖyと比較して音量が大きく、音量および音高の変動の範囲（すなわち抑揚）が広いという韻律の差異がある。第２実施形態の応答生成部３６Bは、記憶装置２２に記憶された応答信号ＹAおよび応答信号ＹBの何れかを再生装置２６に対して選択的に供給することで、韻律が相違する複数の応答音声Ｖyを選択的に再生させる。なお、応答信号ＹAと応答信号ＹBとで発音内容を相違させることも可能である。 As illustrated in FIG. 6, the storage device 22 of the voice interaction device 100B of the second embodiment stores a response signal YA and a response signal YB in which the response voice Vy of the specific utterance content is recorded in advance. The response signal YA and the response signal YB have the same utterance content (character notation) but have different prosody. Specifically, the response voice Vy represented by the response signal YA is a voice of “Yes” pronounced with the intention of an affirmative answer to the utterance voice Vx of the question sentence, and the response voice Vy represented by the response signal YB is: This is the voice of “Yun” that is pronounced with the intention of reconciliation with the plain speech utterance voice Vx. Specifically, the response voice Vy of the response signal YA has a prosodic difference that the volume is larger than that of the response voice Vy of the response signal YB, and the range of fluctuations in volume and pitch (ie, inflection) is wide. The response generation unit 36B of the second embodiment selectively supplies one of the response signal YA and the response signal YB stored in the storage device 22 to the playback device 26, so that a plurality of responses with different prosody are obtained. The audio Vy is selectively reproduced. It should be noted that the content of pronunciation can be made different between the response signal YA and the response signal YB.

図９は、第２実施形態の応答生成部３６Bが応答音声Ｖyを再生装置２６に再生させるための応答生成処理ＳBのフローチャートである。第２実施形態では、第１実施形態で例示した図２の応答生成処理ＳAが図９の応答生成処理ＳBに置換される。応答生成処理ＳB以外の処理は第１実施形態と同様である。発話音声Ｖxの終了（Ｓ13：YES）を契機として図９の応答生成処理ＳBが開始される。 FIG. 9 is a flowchart of a response generation process SB for the response generation unit 36B of the second embodiment to cause the playback device 26 to play back the response voice Vy. In the second embodiment, the response generation process SA of FIG. 2 illustrated in the first embodiment is replaced with the response generation process SB of FIG. Processing other than the response generation processing SB is the same as in the first embodiment. The response generation process SB in FIG. 9 is started when the utterance voice Vx ends (S13: YES).

応答生成処理ＳBを開始すると、応答生成部３６Bは、発話音声Ｖxの末尾区間Ｅのうち第１区間Ｅ1内の複数の音高Ｐの平均（以下「第１平均音高」という）Ｐave1と、第２区間Ｅ2内の複数の音高Ｐの平均（以下「第２平均音高」という）Ｐave2とを算定する（ＳB1）。図７および図８に例示される通り、第１区間Ｅ1は、末尾区間Ｅのうち前方の区間（例えば末尾区間Ｅの始点ｔAを含む区間）であり、第２区間Ｅ2は、末尾区間Ｅのうち第１区間Ｅ1の後方の区間（例えば末尾区間Ｅの終点ｔBを含む区間）である。具体的には、末尾区間Ｅの前半が第１区間Ｅ1として画定され、末尾区間Ｅの後半が第２区間Ｅ2として画定される。ただし、第１区間Ｅ1および第２区間Ｅ2の条件は以上の例示に限定されない。例えば第１区間Ｅ1と第２区間Ｅ2とが間隔をあけて前後する構成や、第１区間Ｅ1と第２区間Ｅ2とで時間長を相違させた構成も採用され得る。 When the response generation process SB is started, the response generation unit 36B has an average (hereinafter referred to as “first average pitch”) Pave1 of a plurality of pitches P in the first interval E1 in the end interval E of the speech voice Vx, An average of a plurality of pitches P in the second section E2 (hereinafter referred to as “second average pitch”) Pave2 is calculated (SB1). As illustrated in FIGS. 7 and 8, the first section E1 is a front section (for example, a section including the start point tA of the end section E) of the end section E, and the second section E2 is the end section E. Of these, it is a section behind the first section E1 (for example, a section including the end point tB of the end section E). Specifically, the first half of the tail section E is defined as the first section E1, and the second half of the tail section E is defined as the second section E2. However, the conditions of the first section E1 and the second section E2 are not limited to the above examples. For example, a configuration in which the first interval E1 and the second interval E2 are moved back and forth with an interval, or a configuration in which the time lengths are different between the first interval E1 and the second interval E2 can be adopted.

応答生成部３６Bは、第１区間Ｅ1の第１平均音高Ｐave1と第２区間Ｅ2の第２平均音高Ｐave2とを比較し、第１平均音高Ｐave1が第２平均音高Ｐave2を下回るか否かを判定する（ＳB2）。前述の通り、疑問文の発話音声Ｖxの音高Ｐの推移は末尾区間Ｅ内で低下から上昇に転換または単調に上昇するという傾向がある。したがって、図７に例示される通り、第１平均音高Ｐave1は第２平均音高Ｐave2を下回る可能性が高い（Ｐave1＜Ｐave2）。他方、平叙文の発話音声Ｖxの音高Ｐは末尾区間Ｅ内で単調に低下するという傾向がある。したがって、図８に例示される通り、第１平均音高Ｐave1は第２平均音高Ｐave2を上回る可能性が高い（Ｐave1＞Ｐave2）。 The response generator 36B compares the first average pitch Pave1 in the first section E1 with the second average pitch Pave2 in the second section E2, and determines whether the first average pitch Pave1 is lower than the second average pitch Pave2. It is determined whether or not (SB2). As described above, the transition of the pitch P of the utterance voice Vx of the question sentence tends to change from a decrease to an increase or monotonously increase in the end section E. Therefore, as illustrated in FIG. 7, the first average pitch Pave1 is likely to be lower than the second average pitch Pave2 (Pave1 <Pave2). On the other hand, the pitch P of the plain speech utterance voice Vx tends to decrease monotonously in the end section E. Therefore, as illustrated in FIG. 8, the first average pitch Pave1 is likely to exceed the second average pitch Pave2 (Pave1> Pave2).

以上の傾向を考慮して、第１平均音高Ｐave1が第２平均音高Ｐave2を下回る場合（ＳB2：YES）、すなわち、発話音声Ｖxが疑問文である可能性が高い場合には、第２実施形態の応答生成部３６Bは、疑問文に対する回答の応答音声Ｖyに対応する応答信号ＹAを記憶装置２２から選択する（ＳB3）。他方、第１平均音高Ｐave1が第２平均音高Ｐave2を上回る場合（ＳB2：NO）、すなわち、発話音声Ｖxが平叙文である可能性が高い場合には、応答生成部３６Bは、平叙文に対する同意の応答音声Ｖyに対応する応答信号ＹBを記憶装置２２から選択する（ＳB4）。 In consideration of the above tendency, when the first average pitch Pave1 is lower than the second average pitch Pave2 (SB2: YES), that is, when the possibility that the speech Vx is a questionable sentence is high, the second The response generation unit 36B of the embodiment selects the response signal YA corresponding to the response voice Vy of the answer to the question sentence from the storage device 22 (SB3). On the other hand, when the first average pitch Pave1 exceeds the second average pitch Pave2 (SB2: NO), that is, when there is a high possibility that the speech voice Vx is a plain text, the response generation unit 36B Is selected from the storage device 22 (SB4).

発話音声Ｖxの音高Ｐの推移に応じた応答信号Ｙ（Ｙ1，Ｙ2）を以上の手順で選択すると、応答生成部３６Bは、第１実施形態と同様に、応答再生点ｔyの到来（ＳB5：YES）を契機として当該応答信号Ｙを再生装置２６に供給することで応答音声Ｖyを再生させる（ＳB6）。具体的には、発話音声Ｖxの音高Ｐが末尾区間Ｅ内で低下から上昇に転換する場合または発話音声Ｖxの音高Ｐが末尾区間Ｅ内で単調に上昇する場合（ＳB2：YES）には疑問文に対する回答の応答音声Ｖyが再生され、発話音声Ｖxの音高Ｐが末尾区間Ｅ内で単調に低下する場合（ＳB2：NO）には平叙文に対する同意の応答音声Ｖyが再生される。すなわち、再生装置２６から再生される応答音声Ｖyの韻律は、発話音声Ｖxが疑問文である場合と平叙文である場合とで相違する。 When the response signal Y (Y1, Y2) corresponding to the transition of the pitch P of the uttered voice Vx is selected by the above procedure, the response generator 36B arrives at the response reproduction point ty (SB5) as in the first embodiment. : YES), the response signal Vy is reproduced by supplying the response signal Y to the reproduction device 26 (SB6). Specifically, when the pitch P of the utterance voice Vx changes from a decrease to an increase within the end section E, or when the pitch P of the utterance voice Vx increases monotonically within the end section E (SB2: YES). When the answer voice Vy of the answer to the question sentence is reproduced and the pitch P of the utterance voice Vx decreases monotonously in the end section E (SB2: NO), the answer voice Vy of the consent to the plain text is reproduced. . That is, the prosody of the response voice Vy reproduced from the playback device 26 differs between the case where the utterance voice Vx is a question sentence and the case where it is a plain sentence.

音声取得部３２による発話信号Ｘの取得（Ｓ11）と、音声解析部３４Aによる音高Ｐの特定（Ｓ12）と、応答生成部３６Bによる応答生成処理ＳBとは、音声対話の終了が利用者Ｕから指示されるまで反復される（Ｓ14：NO）。したがって、第１実施形態と同様に、利用者Ｕによる任意の発話音声Ｖxの発音と、当該発話音声Ｖxに対する応答音声Ｖyの再生とが交互に反復される音声対話が実現される。 The acquisition of the speech signal X by the voice acquisition unit 32 (S11), the specification of the pitch P by the voice analysis unit 34A (S12), and the response generation process SB by the response generation unit 36B are as follows. (S14: NO). Therefore, as in the first embodiment, a voice conversation is realized in which the sound of an arbitrary uttered voice Vx by the user U and the reproduction of the response voice Vy for the uttered voice Vx are alternately repeated.

以上に説明した通り、第２実施形態では、発話音声Ｖxの末尾区間Ｅにおける音高Ｐの推移に応じた韻律の応答音声Ｖyが再生装置２６から再生される。したがって、発話者の発話内容に応じた韻律で対話相手が応答音声を発音するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。第２実施形態では特に、末尾区間Ｅ内で音高Ｐの推移が低下から上昇に転換する場合または末尾区間Ｅ内で音高Ｐが単調に上昇する場合と、末尾区間Ｅの始点ｔAから終点ｔBにかけて音高Ｐが単調に低下する場合とで応答音声Ｖyの韻律が相違するから、疑問文と平叙文とで応答音声の韻律が相違するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 As described above, in the second embodiment, the prosody response voice Vy corresponding to the transition of the pitch P in the last section E of the speech voice Vx is played from the playback device 26. Therefore, it is possible to realize a natural voice conversation simulating the tendency of an actual conversation in which the conversation partner sounds a response voice with a prosody according to the utterance content of the speaker. In the second embodiment, in particular, when the transition of the pitch P changes from a decrease to an increase within the end section E, or when the pitch P increases monotonically within the end section E, and the end point from the start point tA of the end section E. Since the prosody of the response voice Vy is different from the case where the pitch P decreases monotonously over tB, a natural voice dialog simulating the actual dialog tendency that the prosody of the response voice is different between the question sentence and the plain text. Can be realized.

また、第２実施形態では、末尾区間Ｅのうち第１区間Ｅ1内の第１平均音高Ｐave1と第２区間Ｅ2の第２平均音高Ｐave2とを比較した結果に応じて応答音声Ｖyの韻律を相違させるから、複数の音高Ｐの平均および比較という簡便な処理で音高Ｐの推移を評価できる（ひいては応答音声Ｖyの韻律を選択できる）という利点がある。 Further, in the second embodiment, the prosody of the response voice Vy according to the result of comparing the first average pitch Pave1 in the first section E1 and the second average pitch Pave2 in the second section E2 in the end section E. Therefore, there is an advantage that transition of the pitch P can be evaluated by simple processing of averaging and comparing a plurality of pitches P (and thus the prosody of the response voice Vy can be selected).

＜第２実施形態の変形例＞
（１）第２実施形態では、記憶装置２２に事前に記憶された複数の応答信号Ｙ（ＹA，ＹB）を選択的に再生装置２６に供給したが、事前に収録された単一の応答信号Ｙを調整することで、発話音声Ｖxの末尾区間Ｅ内の音高Ｐの推移に応じた韻律の応答信号Ｙを応答生成部３６Bが生成することも可能である。例えば、平叙文に対する応答音声Ｖyの応答信号ＹAを記憶装置２２に保持した構成を想定すると、応答生成部３６Bは、発話音声Ｖxが疑問文である場合、応答信号ＹAの音量を増加させるとともに音量および音高の変動の範囲を拡大することで回答の応答音声Ｖyの応答信号ＹBを生成する一方、発話音声Ｖxが平叙文である場合には応答信号ＹAを再生装置２６に供給する。なお、初期的な応答信号Ｙの音量を減少させるとともに音量および音高の変動の範囲を縮小することで、平叙文に対する同意の応答音声Ｖyの応答信号ＹAを生成することも可能である。 <Modification of Second Embodiment>
(1) In the second embodiment, a plurality of response signals Y (YA, YB) stored in advance in the storage device 22 are selectively supplied to the playback device 26. However, a single response signal recorded in advance is used. By adjusting Y, the response generation unit 36B can generate a prosodic response signal Y corresponding to the transition of the pitch P in the last section E of the speech voice Vx. For example, assuming a configuration in which the response signal YA of the response voice Vy to the plain text is held in the storage device 22, the response generation unit 36B increases the volume of the response signal YA and the volume when the uttered voice Vx is a question sentence. In addition, the response signal YB of the response voice Vy of the answer is generated by expanding the range of fluctuation of the pitch, while the response signal YA is supplied to the playback device 26 when the utterance voice Vx is a plain text. Note that it is also possible to generate the response signal YA of the response voice Vy of the consent to the plain text by reducing the volume of the initial response signal Y and reducing the range of fluctuation of the volume and pitch.

１個の応答信号Ｙに対する調整で相異なる韻律の応答信号Ｙを生成する構成によれば、韻律が相違する複数の応答信号Ｙ（ＹA，ＹB）を記憶装置２２に保持する必要がないから、記憶装置２２に必要な記憶容量が削減されるという利点がある。他方、韻律が相違する複数の応答信号Ｙを選択的に利用する第２実施形態の構成によれば、初期的な応答信号Ｙの韻律を発話音声Ｖxの発話内容に応じて調整する必要がないから、応答生成部３６Bの処理負荷が軽減されるという利点がある。 According to the configuration in which the response signal Y having different prosody is generated by adjusting the one response signal Y, it is not necessary to store a plurality of response signals Y (YA, YB) having different prosody in the storage device 22. There is an advantage that the storage capacity required for the storage device 22 is reduced. On the other hand, according to the configuration of the second embodiment that selectively uses a plurality of response signals Y having different prosody, there is no need to adjust the initial prosody of the response signal Y according to the utterance content of the utterance voice Vx. Therefore, there is an advantage that the processing load of the response generation unit 36B is reduced.

（２）第２実施形態では、末尾区間Ｅのうち第１区間Ｅ1内の第１平均音高Ｐave1と第２区間Ｅ2内の第２平均音高Ｐave2とを比較したが、発話音声Ｖxの発話内容が疑問文および平叙文の何れに該当するかを推定するための方法は以上の例示に限定されない。例えば、平叙文の発話音声Ｖxでは末尾区間Ｅ内で音高Ｐが単調に低下するから、音高Ｐは末尾区間Ｅの終点ｔBで最低音高Ｐminとなる傾向がある。したがって、末尾区間Ｅのうち音高Ｐが最低音高Ｐminとなる時点の後方の区間の時間長が前方の区間と比較して充分に短い場合（例えば所定の閾値を下回る場合）に、発話音声Ｖxの発話内容が平叙文に該当すると推定することも可能である。また、末尾区間Ｅのうち最低音高Ｐminの時点の前後における音高Ｐの遷移に応じて、発話音声Ｖxの発話内容が疑問文および平叙文の何れに該当するかを推定することも可能である。例えば、末尾区間Ｅのうち最低音高Ｐminの時点の経過後に音高Ｐが上昇する場合、応答生成部３６Bは、発話音声Ｖxの発話内容が疑問文に該当すると推定する。 (2) In the second embodiment, the first average pitch Pave1 in the first section E1 and the second average pitch Pave2 in the second section E2 in the end section E are compared, but the utterance of the speech voice Vx is compared. The method for estimating whether the content corresponds to a question sentence or a plain text is not limited to the above examples. For example, in the plain speech utterance voice Vx, since the pitch P decreases monotonously in the end section E, the pitch P tends to become the minimum pitch Pmin at the end point tB of the end section E. Therefore, when the time length of the rear section of the end section E at which the pitch P becomes the minimum pitch Pmin is sufficiently shorter than the front section (for example, when it is below a predetermined threshold), the speech voice It is also possible to estimate that the utterance content of Vx corresponds to a plain text. It is also possible to estimate whether the utterance content of the utterance voice Vx corresponds to the question sentence or the plain sentence according to the transition of the pitch P before and after the point of the lowest pitch Pmin in the end section E. is there. For example, when the pitch P rises after the time of the lowest pitch Pmin in the end section E, the response generation unit 36B estimates that the utterance content of the uttered voice Vx corresponds to the question sentence.

＜第３実施形態＞
図１０は、本発明の第３実施形態に係る音声対話装置１００Cの構成図である。第３実施形態の音声対話装置１００Cは、第１実施形態の音声対話装置１００Aと同様に、利用者Ｕが発音した発話音声Ｖxに対する応答音声Ｖyを再生する。第３実施形態では、発話音声Ｖxに対する回答または相鎚の応答音声（以下「第２応答音声」という）Ｖy2のほか、発話音声Ｖxに対する問返しを表す応答音声（以下「第１応答音声」という）Ｖy1が再生装置２６から再生され得る。第１応答音声Ｖy1は、発話音声Ｖxを発話者に対して聞き直すための「え？」「なに？」等の音声である。図１０に例示される通り、第３実施形態の音声対話装置１００Cの記憶装置２２は、問返しの第１応答音声Ｖy1を収録した応答信号Ｙ1と、問返し以外（例えば「うん」等の相鎚）の第２応答音声Ｖy2を収録した応答信号Ｙ2とを記憶する。 <Third Embodiment>
FIG. 10 is a configuration diagram of a voice interactive apparatus 100C according to the third embodiment of the present invention. Similar to the voice interaction apparatus 100A of the first embodiment, the voice interaction apparatus 100C of the third embodiment reproduces a response voice Vy corresponding to the uttered voice Vx pronounced by the user U. In the third embodiment, in addition to an answer to the utterance voice Vx or a response voice (hereinafter referred to as “second response voice”) Vy2, a response voice (hereinafter referred to as “first response voice”) indicating a response to the utterance voice Vx. ) Vy1 can be regenerated from the regenerator 26. The first response voice Vy1 is a voice such as “E?” Or “What?” For rehearsing the utterance voice Vx to the speaker. As illustrated in FIG. 10, the storage device 22 of the voice interaction device 100C of the third embodiment includes a response signal Y1 containing the first response voice Vy1 for answering questions and a phase other than answering questions (for example, “Yes”). The response signal Y2 recorded with the second response voice Vy2 of (ii) is stored.

図１０に例示される通り、第３実施形態の音声対話装置１００Cは、第１実施形態の音声対話装置１００Aの音声解析部３４Aおよび応答生成部３６Aを、音声解析部３４Cおよび応答生成部３６Cに置換した構成である。音声対話装置１００Cの他の要素（音声入力装置２４，再生装置２６，音声取得部３２）の構成および動作は第１実施形態と同様である。 As illustrated in FIG. 10, the voice interaction device 100C of the third embodiment replaces the voice analysis unit 34A and the response generation unit 36A of the voice interaction device 100A of the first embodiment with the voice analysis unit 34C and the response generation unit 36C. This is a replacement configuration. The configuration and operation of the other elements (voice input device 24, playback device 26, voice acquisition unit 32) of the voice interactive device 100C are the same as in the first embodiment.

第３実施形態の音声解析部３４Cは、音声取得部３２が取得した発話信号Ｘから韻律指標値Ｑを特定する。韻律指標値Ｑは、発話音声Ｖxの韻律に関する指標値であり、発話音声Ｖx毎（発話音声Ｖxの始点から終点までの一連の発話を単位としたときの単位毎）に算定される。具体的には、発話音声Ｖxの発話区間内の音高の平均値、音高の変動幅、音量の平均値、または音量の変動幅が、韻律指標値Ｑとして発話信号Ｘから算定される。第３実施形態の応答生成部３６Cは、前述の通り、発話音声Ｖxに対する問返しを表す第１応答音声Ｖy1と問返し以外の第２応答音声Ｖy2とを選択的に再生装置２６に再生させる。 The voice analysis unit 34C of the third embodiment specifies the prosody index value Q from the utterance signal X acquired by the voice acquisition unit 32. The prosodic index value Q is an index value related to the prosody of the utterance voice Vx, and is calculated for each utterance voice Vx (for each unit when a series of utterances from the start point to the end point of the utterance voice Vx is used as a unit). Specifically, the average value of pitches, the fluctuation range of pitches, the average value of volume, or the fluctuation range of volume in the utterance section of the utterance voice Vx is calculated from the utterance signal X as the prosodic index value Q. As described above, the response generation unit 36C of the third embodiment causes the playback device 26 to selectively play back the first response voice Vy1 that represents the answer to the uttered voice Vx and the second response voice Vy2 other than the answer.

現実の人間同士の対話では、発話者の発話音声の韻律が変動した場合に、対話相手が発話音声を聴取し難くなって問返しの必要性が高まる、という傾向がある。具体的には、発話者の発話音声の韻律が当該発話者の過去の韻律の傾向から乖離する場合（例えば過去の傾向から対話相手が想定する音量と比較して実際の発話音声の音量が小さい場合）に、対話相手が発話音声を適切に聴取できず、結果的に発話者に対する問返しが発生する可能性が高い。以上の傾向を考慮して、第３実施形態の応答生成部３６Cは、音声解析部３４Cが特定した韻律指標値Ｑを閾値ＱTHと比較し、比較の結果に応じて第１応答音声Ｖy1および第２応答音声Ｖy2の何れかを再生装置２６に再生させる。閾値ＱTHは、利用者Ｕが過去に発話した発話音声Ｖxの韻律指標値Ｑの代表値（例えば平均値）に設定される。すなわち、閾値ＱTHは、利用者Ｕの過去の発話から推定される標準的な韻律に相当する。そして、発話音声Ｖxの韻律指標値Ｑが閾値ＱTHから乖離する場合には問返しの第１応答音声Ｖy1が再生され、韻律指標値Ｑが閾値ＱTHに近似する場合には相鎚の第２応答音声Ｖy2が再生される。 In a dialogue between real people, when the prosody of the uttered voice of the speaker changes, there is a tendency that it becomes difficult for the conversation partner to listen to the uttered voice and the necessity of answering questions increases. Specifically, when the prosody of the utterance of the speaker deviates from the tendency of the utterance of the speaker in the past (for example, the volume of the actual utterance is lower than the volume assumed by the conversation partner from the past tendency) ), The conversation partner cannot properly listen to the uttered voice, and as a result, there is a high possibility that a question is returned to the speaker. Considering the above tendency, the response generation unit 36C of the third embodiment compares the prosodic index value Q specified by the speech analysis unit 34C with the threshold value QTH, and determines the first response speech Vy1 and the first response speech Vy1 according to the comparison result. Either of the two response voices Vy2 is caused to be played back by the playback device 26. The threshold value QTH is set to a representative value (for example, an average value) of the prosodic index value Q of the speech voice Vx spoken by the user U in the past. That is, the threshold value QTH corresponds to a standard prosody estimated from the user U's past speech. When the prosodic index value Q of the utterance voice Vx deviates from the threshold value QTH, the first response voice Vy1 is returned. When the prosodic index value Q approximates the threshold value QTH, the second response is a conflict. Audio Vy2 is reproduced.

図１１は、第３実施形態の制御装置２０が実行する処理のフローチャートである。例えば音声対話装置１００Cに対する利用者Ｕからの指示（例えば音声対話用のプログラムの起動指示）を契機として図１１の処理が開始される。 FIG. 11 is a flowchart of processing executed by the control device 20 of the third embodiment. For example, the process in FIG. 11 is started in response to an instruction from the user U (for example, an instruction to start a program for voice conversation) to the voice conversation apparatus 100C.

第１実施形態と同様に、発話音声Ｖxが開始されると（Ｓ20：YES）、音声取得部３２は、音声入力装置２４から発話信号Ｘを取得して記憶装置２２に格納する（Ｓ21）。音声解析部３４Cは、音声取得部３２が取得した発話信号Ｘから、発話音声Ｖxの韻律に関する特徴量ｑを特定する（Ｓ22）。特徴量ｑは、例えば発話音声Ｖxの音高Ｐまたは音量である。音声取得部３２による発話信号Ｘの取得（Ｓ21）と音声解析部３４Cによる特徴量ｑの特定（Ｓ22）とは、発話音声Ｖxの終了まで反復される（Ｓ23：NO）。すなわち、発話音声Ｖxの始点から終点ｔBまでの発話区間について当該発話音声Ｖxの複数の特徴量ｑの時系列が特定される。 As in the first embodiment, when the utterance voice Vx is started (S20: YES), the voice acquisition unit 32 acquires the utterance signal X from the voice input device 24 and stores it in the storage device 22 (S21). The voice analysis unit 34C specifies the feature quantity q related to the prosody of the uttered voice Vx from the utterance signal X acquired by the voice acquisition unit 32 (S22). The feature quantity q is, for example, the pitch P or volume of the speech voice Vx. The acquisition of the utterance signal X by the voice acquisition unit 32 (S21) and the specification of the feature quantity q by the voice analysis unit 34C (S22) are repeated until the end of the utterance voice Vx (S23: NO). That is, a time series of a plurality of feature quantities q of the utterance voice Vx is specified for the utterance section from the start point to the end point tB of the utterance voice Vx.

発話音声Ｖxが終了すると（Ｓ23：YES）、音声解析部３４Cは、発話音声Ｖxの始点から終点までの発話区間について特定した複数の特徴量ｑの時系列から韻律指標値Ｑを算定する（Ｓ24）。具体的には、音声解析部３４Cは、発話区間内の複数の特徴量ｑの平均値または変動幅（範囲）を韻律指標値Ｑとして算定する。 When the uttered voice Vx ends (S23: YES), the voice analysis unit 34C calculates the prosodic index value Q from the time series of a plurality of feature quantities q specified for the utterance section from the start point to the end point of the uttered voice Vx (S24). ). Specifically, the speech analysis unit 34C calculates an average value or a variation range (range) of the plurality of feature quantities q in the utterance section as the prosodic index value Q.

以上に説明した処理で今回の発話音声Ｖxの韻律指標値Ｑが算定されると、応答生成部３６Cは、応答音声Ｖyを再生装置２６に再生させるための応答生成処理ＳCを実行する。第３実施形態の応答生成処理ＳCは、音声解析部３４Cが算定した韻律指標値Ｑに応じて第１応答音声Ｖy1および第２応答音声Ｖy2の何れかを選択的に再生装置２６に再生させる処理である。 When the prosody index value Q of the current speech voice Vx is calculated by the process described above, the response generation unit 36C executes the response generation process SC for causing the playback device 26 to play back the response voice Vy. The response generation process SC of the third embodiment is a process in which the playback device 26 selectively plays back either the first response voice Vy1 or the second response voice Vy2 according to the prosodic index value Q calculated by the voice analysis unit 34C. It is.

応答生成処理ＳCが完了すると、音声解析部３４Cは、今回の発話音声Ｖxの韻律指標値Ｑに応じて閾値ＱTHを更新する（Ｓ25）。具体的には、音声解析部３４Cは、今回の発話音声Ｖxを含む過去の発話音声Ｖxの複数の韻律指標値Ｑの代表値（例えば平均値や中央値）を更新後の閾値ＱTHとして算定する。例えば、以下の数式(1)で表現される通り、今回の韻律指標値Ｑと更新前の閾値ＱTHとの加重平均（指数移動平均）が更新後の閾値ＱTHとして算定される。数式(1)の記号αは１未満の所定の正数（忘却係数）である。
ＱTH＝α・Ｑ＋(１−α)ＱTH ……(1)
以上の説明から理解される通り、第３実施形態の音声解析部３４Cは、過去の複数の発話音声Ｖxにおける韻律指標値Ｑの代表値を閾値ＱTHとして設定する要素として機能する。閾値ＱTHは、発話音声Ｖxの発音毎に当該発話音声Ｖxの韻律指標値Ｑを反映した数値に更新され、複数回にわたる利用者Ｕの発話から推定される標準的な韻律に相当する数値となる。ただし、閾値ＱTHを所定値に固定することも可能である。例えば、不特定多数の発話者の発話音声から特定された韻律指標値Ｑの平均値が閾値ＱTHとして設定され得る。 When the response generation process SC is completed, the voice analysis unit 34C updates the threshold value QTH according to the prosodic index value Q of the uttered voice Vx (S25). Specifically, the voice analysis unit 34C calculates a representative value (for example, an average value or a median value) of a plurality of prosodic index values Q of the past utterance voice Vx including the utterance voice Vx as the updated threshold value QTH. . For example, as expressed by the following formula (1), the weighted average (exponential moving average) between the current prosodic index value Q and the threshold value QTH before update is calculated as the threshold value QTH after update. The symbol α in the formula (1) is a predetermined positive number (forgetting factor) less than 1.
QTH = α ・ Q + (1-α) QTH …… (1)
As understood from the above description, the speech analysis unit 34C of the third embodiment functions as an element that sets the representative value of the prosodic index value Q in a plurality of past speech speech Vx as the threshold value QTH. The threshold value QTH is updated to a numerical value reflecting the prosodic index value Q of the utterance voice Vx for each pronunciation of the utterance voice Vx, and becomes a numerical value corresponding to a standard prosody estimated from the utterance of the user U over a plurality of times. . However, it is also possible to fix the threshold value QTH to a predetermined value. For example, the average value of the prosodic index values Q specified from the uttered voices of an unspecified number of speakers can be set as the threshold value QTH.

音声取得部３２による発話信号Ｘの取得（Ｓ21）と、音声解析部３４Cによる韻律指標値Ｑの算定（Ｓ22，Ｓ24）と、応答生成部３６Cによる応答生成処理ＳCと、音声解析部３４Cによる閾値ＱTHの更新（Ｓ25）とは、音声対話の終了が利用者Ｕから指示されるまで、発話音声Ｖxの発音毎に反復される（Ｓ26：NO）。したがって、利用者Ｕによる発話音声Ｖxの発音と、第１応答音声Ｖy1（問返し）および第２応答音声Ｖy2（相鎚）の選択的な再生とが交互に反復される音声対話が実現される。 Acquisition of speech signal X by voice acquisition unit 32 (S21), calculation of prosodic index value Q by voice analysis unit 34C (S22, S24), response generation processing SC by response generation unit 36C, and threshold by voice analysis unit 34C The update of QTH (S25) is repeated for each pronunciation of the uttered voice Vx until the end of the voice conversation is instructed by the user U (S26: NO). Accordingly, a voice dialogue is realized in which the pronunciation of the uttered voice Vx by the user U and the selective reproduction of the first response voice Vy1 (question answer) and the second response voice Vy2 (interaction) are alternately repeated. .

図１２は、第３実施形態の応答生成処理ＳCのフローチャートである。応答生成処理ＳCを開始すると、応答生成部３６Cは、音声解析部３４Cが特定した韻律指標値Ｑを現段階の閾値ＱTHと比較し、閾値ＱTHを含む所定の範囲（以下「許容範囲」という）Ｒに韻律指標値Ｑが包含されるか否かを判定する（ＳC1）。図１３および図１４には、発話音声Ｖxから音声解析部３４Cが特定する特徴量ｑの推移が例示されている。図１３および図１４に例示される通り、許容範囲Ｒは、閾値ＱTHを中央値とする所定幅の範囲である。韻律指標値Ｑと閾値ＱTHとを比較する処理（ＳC1）は、韻律指標値Ｑと閾値ＱTHとの差分の絶対値が所定値（例えば許容範囲Ｒの範囲幅の半分）を上回るか否かを判定する処理としても実現され得る。 FIG. 12 is a flowchart of the response generation process SC of the third embodiment. When the response generation process SC is started, the response generation unit 36C compares the prosodic index value Q specified by the speech analysis unit 34C with the current threshold value QTH, and a predetermined range including the threshold value QTH (hereinafter referred to as “allowable range”). It is determined whether or not the prosodic index value Q is included in R (SC1). 13 and 14 illustrate the transition of the feature quantity q specified by the voice analysis unit 34C from the speech voice Vx. As illustrated in FIGS. 13 and 14, the allowable range R is a range having a predetermined width with the threshold value QTH being a median value. The process of comparing the prosodic index value Q and the threshold value QTH (SC1) determines whether or not the absolute value of the difference between the prosodic index value Q and the threshold value QTH exceeds a predetermined value (for example, half the range width of the allowable range R). It can also be realized as a determination process.

図１３では、韻律指標値Ｑが許容範囲Ｒの内側の数値である場合が想定されている。韻律指標値Ｑが許容範囲Ｒに包含されるということは、今回の発話音声Ｖxの韻律が利用者Ｕの標準的な韻律（過去の発話の傾向）に近似することを意味する。すなわち、現実の人間同士の対話を想定すると、対話相手が発話音声を聴取し易い状況（発話者に対する問返しが必要となる可能性が低い状況）であると評価できる。そこで、韻律指標値Ｑが許容範囲Ｒの内側の数値である場合（ＳC1：YES）、応答生成部３６Cは、発話音声Ｖxに対する相鎚の第２応答音声Ｖy2の応答信号Ｙ2を記憶装置２２から選択する（ＳC2）。 In FIG. 13, it is assumed that the prosodic index value Q is a numerical value inside the allowable range R. The inclusion of the prosodic index value Q within the allowable range R means that the prosody of the utterance voice Vx of this time approximates the standard prosody of the user U (past utterance tendency). In other words, assuming a dialogue between real people, it can be evaluated that the conversation partner is easy to listen to the uttered voice (a situation where there is a low possibility that a question is returned to the speaker). Therefore, when the prosodic index value Q is a numerical value within the allowable range R (SC1: YES), the response generation unit 36C sends the response signal Y2 of the second response voice Vy2 that is in conflict with the utterance voice Vx from the storage device 22. Select (SC2).

他方、図１４では、韻律指標値Ｑが許容範囲Ｒの外側の数値（具体的には許容範囲Ｒの下限値を下回る数値）である場合が想定されている。韻律指標値Ｑが許容範囲Ｒに包含されないということは、今回の発話音声Ｖxの韻律が利用者Ｕの標準的な韻律から乖離していることを意味する。すなわち、現実の人間同士の対話を想定すると、対話相手が発話音声を聴取し難い状況（発話者に対する問返しが必要となる可能性が高い状況）であると評価できる。そこで、韻律指標値Ｑが許容範囲Ｒの外側の数値である場合（ＳC1：NO）、応答生成部３６Cは、発話音声Ｖxに対する問返しの第２応答音声Ｖy1（例えば「え？」「なに？」等の音声）の応答信号Ｙ1を再生装置２６に対する供給対象として記憶装置２２から選択する（ＳC3）。 On the other hand, in FIG. 14, it is assumed that the prosodic index value Q is a numerical value outside the allowable range R (specifically, a numerical value lower than the lower limit value of the allowable range R). The fact that the prosodic index value Q is not included in the allowable range R means that the prosody of the uttered voice Vx is deviated from the standard prosody of the user U. In other words, assuming a dialogue between actual humans, it can be evaluated that the conversation partner is difficult to hear the uttered voice (a situation where there is a high possibility that it is necessary to answer the speaker). Therefore, when the prosodic index value Q is a numerical value outside the allowable range R (SC1: NO), the response generation unit 36C causes the second response voice Vy1 (for example, “??” “what” to answer the uttered voice Vx). Is selected from the storage device 22 as an object to be supplied to the playback device 26 (SC3).

以上の手順で韻律指標値Ｑに応じた応答信号Ｙ（再生対象の応答音声Ｖy）を選択すると、応答生成部３６Cは、第１実施形態と同様に、応答再生点ｔyの到来（ＳC4：YES）を契機として当該応答信号Ｙを再生装置２６に供給することで応答音声Ｖy（第１応答音声Ｖy1または第２応答音声Ｖy2）を再生させる（ＳC5）。すなわち、韻律指標値Ｑが許容範囲Ｒに包含される場合には相鎚の第２応答音声Ｖy2が再生され、韻律指標値Ｑが許容範囲Ｒに包含されない場合には問返しの第１応答音声Ｖy1が再生される。 When the response signal Y (response target response voice Vy) corresponding to the prosodic index value Q is selected by the above procedure, the response generator 36C arrives at the response playback point ty (SC4: YES) as in the first embodiment. ), The response signal Vy (the first response sound Vy1 or the second response sound Vy2) is reproduced by supplying the response signal Y to the reproduction device 26 (SC5). That is, when the prosodic index value Q is included in the allowable range R, the second response voice Vy2 is reproduced, and when the prosodic index value Q is not included in the allowable range R, the answering first response voice Vy2 is reproduced. Vy1 is played back.

以上に説明した通り、第３実施形態では、発話音声Ｖxに対する問返しを表す第１応答音声Ｖy1と、問返し以外の第２応答音声Ｖy2とが選択的に再生装置２６から再生される。したがって、発話者の発話に対する相鎚だけでなく発話者に対する問返し（聞き直し）も適宜に発生するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 As described above, in the third embodiment, the first response voice Vy1 that indicates a response to the uttered voice Vx and the second response voice Vy2 other than the response are selectively played back from the playback device 26. Therefore, it is possible to realize a natural voice dialogue that simulates the tendency of an actual dialogue in which not only the talker's utterance but also the questioning / replying (rehearsal) to the talker occurs appropriately.

また、第３実施形態では、発話音声Ｖxの韻律を表す韻律指標値Ｑを閾値ＱTHと比較した結果に応じて第１応答音声Ｖy1および第２応答音声Ｖy2の何れかが選択されるから、発話音声の韻律が不意に変動した場合に聴取が困難となり問返しの必要性が高まる、という現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。第３実施形態では特に、過去の複数の発話音声Ｖxにわたる韻律指標値Ｑの代表値が閾値ＱTHとして設定されるから、発話者の発話音声の韻律が当該発話者の標準的な韻律（すなわち対話相手が想定する韻律）から乖離する場合に対話相手からの問返しが発生し易いという現実の対話の傾向を模擬した自然な音声対話が実現されるという利点もある。しかも、韻律指標値Ｑが、閾値ＱTHを含む許容範囲Ｒの外側の数値である場合に第１応答音声Ｖy1が選択され、許容範囲Ｒの内側の数値である場合に第２応答音声Ｖy2が選択されるから、例えば韻律指標値Ｑと閾値ＱTHとの大小関係のみに応じて第１応答音声Ｖy1および第２応答音声Ｖy2を選択する構成と比較して、過度に高い頻度で第１応答音声Ｖy1が再生される可能性を低減する（適度な頻度で第１応答音声Ｖy1を再生する）ことが可能である。 In the third embodiment, since either the first response voice Vy1 or the second response voice Vy2 is selected according to the result of comparing the prosody index value Q representing the prosody of the utterance voice Vx with the threshold value QTH, the utterance It is possible to realize a natural voice conversation that simulates the tendency of an actual conversation that when the prosody of the voice changes unexpectedly, it becomes difficult to listen and the necessity of answering questions increases. Particularly in the third embodiment, since the representative value of the prosodic index value Q over a plurality of past utterances Vx is set as the threshold value QTH, the prosody of the utterance of the utterer is the standard prosody of the utterer (that is, the dialog) There is also an advantage that a natural voice conversation simulating the tendency of an actual conversation that the question is likely to be returned from the conversation partner when it deviates from the prosody of the partner). Moreover, the first response voice Vy1 is selected when the prosodic index value Q is a numerical value outside the allowable range R including the threshold value QTH, and the second response voice Vy2 is selected when the numerical value is inside the allowable range R. Therefore, for example, the first response voice Vy1 is excessively frequently compared with the configuration in which the first response voice Vy1 and the second response voice Vy2 are selected only in accordance with the magnitude relationship between the prosodic index value Q and the threshold value QTH. Can be reduced (the first response voice Vy1 is reproduced at an appropriate frequency).

＜第３実施形態の変形例＞
第３実施形態では、発話音声Ｖxの韻律指標値Ｑに応じて第１応答音声Ｖy1の再生と第２応答音声Ｖy2の再生とを選択したが、発話音声Ｖxの特性とは無関係に所定の頻度で問返しの第１応答音声Ｖy1を再生することも可能である。具体的には、応答生成部３６Cは、利用者Ｕが順次に発音する複数の発話音声Ｖxからランダムに選択された発話音声Ｖxに対して問返しの第１応答音声Ｖy1を再生装置２６に再生させる一方、残余の発話音声Ｖxに対しては相鎚の第２応答音声Ｖy2を再生させる。例えば、応答生成部３６Cは、発話音声Ｖxの発音毎に所定の範囲内の乱数を発生し、当該乱数が閾値を上回る場合には第１応答音声Ｖy1を選択する一方、当該乱数が閾値を下回る場合には第２応答音声Ｖy2を選択する。以上に例示した変形例では、複数の発話音声Ｖxからランダムに選択された発話音声Ｖxに対して問返しの第１応答音声Ｖy1が再生されるから、発話音声に対する問返しがランダムに発生するという現実の音声対話の傾向を模擬した自然な音声対話を実現することが可能である。 <Modification of Third Embodiment>
In the third embodiment, the reproduction of the first response voice Vy1 and the reproduction of the second response voice Vy2 are selected according to the prosodic index value Q of the utterance voice Vx. However, a predetermined frequency is used regardless of the characteristics of the utterance voice Vx. It is also possible to play back the first response voice Vy1 asking. Specifically, the response generation unit 36C plays back to the playback device 26 the first response voice Vy1 that answers the utterance voice Vx that is randomly selected from the plurality of utterance voices Vx that the user U sequentially generates. On the other hand, for the remaining utterance voice Vx, the corresponding second response voice Vy2 is reproduced. For example, the response generation unit 36C generates a random number within a predetermined range for each utterance of the utterance voice Vx, and selects the first response voice Vy1 when the random number exceeds the threshold, while the random number falls below the threshold. In this case, the second response voice Vy2 is selected. In the modification illustrated above, since the first response voice Vy1 for answering the utterance voice Vx randomly selected from the plurality of utterance voices Vx is reproduced, the answer to the utterance voice is randomly generated. It is possible to realize a natural voice conversation simulating the tendency of actual voice conversation.

以上の構成において、応答生成部３６Cは、発話音声Ｖxの発話回数に対する第１応答音声Ｖy1の再生回数の比（すなわち第１応答音声Ｖy1の再生頻度）を可変に設定することが可能である。例えば、乱数と比較される閾値を調整することで、応答生成部３６Cは、第１応答音声Ｖy1の再生頻度を制御する。例えば第１応答音声Ｖy1の再生頻度が３０％に設定された場合、発話音声Ｖxの発話の総回数のうちの３０％に対して第１応答音声Ｖy1が再生され、残余の７０％の回数の発話に対して第２応答音声Ｖy2が再生される。第１応答音声Ｖy1の再生頻度（例えば乱数と比較される閾値）は、例えば利用者Ｕからの指示に応じて可変に設定される。 In the above configuration, the response generation unit 36C can variably set the ratio of the number of times the first response voice Vy1 is reproduced to the number of times the utterance voice Vx is uttered (that is, the reproduction frequency of the first response voice Vy1). For example, the response generation unit 36C controls the reproduction frequency of the first response voice Vy1 by adjusting the threshold value to be compared with the random number. For example, when the reproduction frequency of the first response voice Vy1 is set to 30%, the first response voice Vy1 is reproduced for 30% of the total number of utterances of the utterance voice Vx, and the remaining number of times is 70%. The second response voice Vy2 is reproduced for the utterance. The reproduction frequency of the first response voice Vy1 (for example, a threshold value compared with a random number) is variably set according to an instruction from the user U, for example.

＜第４実施形態＞
図１５は、本発明の第４実施形態に係る音声対話装置１００Dの構成図である。第４実施形態の音声対話装置１００Dは、第１実施形態の音声対話装置１００Aと同様に、利用者Ｕが発音した発話音声Ｖxに対する応答音声Ｖyを再生する。 <Fourth embodiment>
FIG. 15 is a configuration diagram of a voice interaction device 100D according to the fourth embodiment of the present invention. Similar to the voice interaction apparatus 100A of the first embodiment, the voice interaction apparatus 100D of the fourth embodiment reproduces a response voice Vy corresponding to the uttered voice Vx pronounced by the user U.

図１５に例示される通り、第４実施形態の音声対話装置１００Dは、第１実施形態の音声対話装置１００Aの音声解析部３４Aおよび応答生成部３６Aを、履歴管理部３８および応答生成部３６Dに置換した構成である。音声対話装置１００Dの他の要素（音声入力装置２４，再生装置２６，音声取得部３２）の構成および動作は第１実施形態と同様である。第４実施形態の記憶装置２２は、特定の発話内容の応答音声Ｖyを表す応答信号Ｙを記憶する。以下の説明では、発話音声Ｖxに対する相鎚を意味する「うん」という応答音声Ｖyを例示する。 As illustrated in FIG. 15, the voice interaction device 100D of the fourth embodiment replaces the voice analysis unit 34A and the response generation unit 36A of the voice interaction device 100A of the first embodiment with the history management unit 38 and the response generation unit 36D. This is a replacement configuration. The configuration and operation of other elements (voice input device 24, playback device 26, and voice acquisition unit 32) of the voice interactive device 100D are the same as those in the first embodiment. The storage device 22 of the fourth embodiment stores a response signal Y representing the response voice Vy of specific utterance content. In the following description, a response voice Vy of “Yes” that means a conflict with the speech voice Vx is illustrated.

図１５の履歴管理部３８は、音声対話装置１００Dによる音声対話の履歴（以下「利用履歴」という）Ｈを生成する。第４実施形態の利用履歴Ｈは、音声対話装置１００Dを利用して過去に実行された音声対話の回数（以下「利用回数」という）Ｎである。具体的には、音声対話の開始（音声対話装置１００Dの起動）から終了までを１回（すなわち、発話音声Ｖxの発話と応答音声Ｖyの再生との複数対を包含する１回分の音声対話）として、履歴管理部３８は音声対話の回数を利用回数Ｎとして計数する。履歴管理部３８が生成した利用履歴Ｈは記憶装置２２に格納される。 The history management unit 38 in FIG. 15 generates a history (hereinafter referred to as “use history”) H of a voice conversation by the voice interaction device 100D. The usage history H of the fourth embodiment is the number of voice conversations executed in the past (hereinafter referred to as “use count”) N using the voice interaction device 100D. Specifically, the voice dialogue is started once (starting of the voice dialogue device 100D) until ending (that is, one voice dialogue including a plurality of pairs of the utterance voice Vx and the response voice Vy). As a result, the history management unit 38 counts the number of voice conversations as the number of uses N. The usage history H generated by the history management unit 38 is stored in the storage device 22.

第４実施形態の応答生成部３６Dは、履歴管理部３８が生成した利用履歴Ｈに応じた韻律の応答音声Ｖyを再生装置２６に再生させる。すなわち、応答音声Ｖyの韻律が利用履歴Ｈに応じて可変に制御される。第４実施形態では、応答音声Ｖyの再生の待機時間Ｗを当該応答音声Ｖyの韻律として利用履歴Ｈに応じて制御する。待機時間Ｗは、発話音声Ｖxの終点ｔBから応答音声Ｖyの応答再生点ｔyまでの時間長（すなわち発話音声Ｖxと応答音声Ｖyとの間隔）である。 The response generation unit 36D of the fourth embodiment causes the playback device 26 to play back the prosodic response voice Vy corresponding to the usage history H generated by the history management unit 38. That is, the prosody of the response voice Vy is variably controlled according to the usage history H. In the fourth embodiment, the reproduction standby time W of the response voice Vy is controlled according to the usage history H as the prosody of the response voice Vy. The waiting time W is the length of time from the end point tB of the utterance voice Vx to the response playback point ty of the response voice Vy (that is, the interval between the utterance voice Vx and the response voice Vy).

現実の人間同士の対話では、特定の対話相手との対話の反復とともに発話音声の韻律が経時的に変化するという傾向が観測される。具体的には、初対面で対話を開始した直後の段階（各々が対話相手との対話に慣れていない段階）では、対話相手に特有の好適な間合等を両者が把握できないため、発話者による発話から当該発話に対する応答までの時間が長く（すなわち対話がぎこちなく）、当該対話相手との対話が反復されるにつれて当該時間が短縮される（すなわちテンポよく対話できる）、という傾向がある。以上の傾向を考慮して、第４実施形態の応答生成部３６Dは、利用履歴Ｈが示す利用回数Ｎが多い場合に、利用回数Ｎが少ない場合と比較して応答音声Ｖyの待機時間Ｗが短くなるように、利用履歴Ｈに応じて待機時間Ｗを制御する。 In actual human-to-human dialogue, a tendency is observed in which the prosody of the uttered speech changes with time as the dialogue with a specific dialogue partner is repeated. Specifically, at the stage immediately after starting the first face-to-face conversation (the stage where each person is not used to the conversation with the conversation partner), both parties cannot grasp a suitable interval etc. specific to the conversation partner. There is a tendency that the time from the utterance to the response to the utterance is long (that is, the conversation is awkward), and the time is shortened (that is, the conversation can be performed with good tempo) as the conversation with the conversation partner is repeated. Considering the above tendency, the response generation unit 36D of the fourth embodiment has a longer waiting time W for the response voice Vy when the usage history N indicated by the usage history H is greater than when the usage count N is small. The waiting time W is controlled in accordance with the usage history H so as to be shortened.

図１６は、第４実施形態の制御装置２０が実行する処理のフローチャートである。例えば音声対話装置１００Dに対する利用者Ｕからの指示（音声対話用のプログラムの起動指示）を契機として図１６の処理が開始される。音声対話装置１００Dによる音声対話が最初に開始される段階では、利用履歴Ｈは初期値（例えばＮ＝０）に設定される。 FIG. 16 is a flowchart of processing executed by the control device 20 of the fourth embodiment. For example, the process of FIG. 16 is started in response to an instruction from the user U (instruction to start a voice conversation program) to the voice interaction apparatus 100D. At the stage when the voice dialogue by the voice dialogue device 100D is first started, the usage history H is set to an initial value (for example, N = 0).

第１実施形態と同様に、発話音声Ｖxが開始されると（Ｓ30：YES）、音声取得部３２は、音声入力装置２４から発話信号Ｘを取得して記憶装置２２に格納する（Ｓ31）。音声取得部３２による発話信号Ｘの取得は、発話音声Ｖxの終了まで反復される（Ｓ32：NO）。 As in the first embodiment, when the utterance voice Vx is started (S30: YES), the voice acquisition unit 32 acquires the utterance signal X from the voice input device 24 and stores it in the storage device 22 (S31). The acquisition of the utterance signal X by the voice acquisition unit 32 is repeated until the end of the utterance voice Vx (S32: NO).

発話音声Ｖxが終了すると（Ｓ32：YES）、応答生成部３６Dは、記憶装置２２に格納された利用履歴Ｈに応じた韻律の応答音声Ｖyを再生装置２６に再生させるための応答生成処理ＳDを実行する。第４実施形態の応答生成処理ＳDは、前述の通り、発話音声Ｖxの終点ｔBから応答音声Ｖyの再生が開始される応答再生点ｔyまでの待機時間Ｗを利用履歴Ｈに応じて制御する処理である。音声取得部３２による発話信号Ｘの取得（Ｓ31）と、応答生成部３６Dによる応答生成処理ＳDとは、音声対話の終了が利用者Ｕから指示されるまで反復される（Ｓ33：NO）。したがって、第１実施形態と同様に、利用者Ｕによる任意の発話音声Ｖxの発音と、当該発話音声Ｖxに対する応答音声Ｖyの再生とが交互に反復される音声対話が実現される。 When the speech voice Vx ends (S32: YES), the response generation unit 36D performs a response generation process SD for causing the playback device 26 to play back the response voice Vy of the prosody according to the usage history H stored in the storage device 22. Run. As described above, the response generation process SD of the fourth embodiment is a process for controlling the waiting time W from the end point tB of the utterance voice Vx to the response playback point ty where the playback of the response voice Vy is started according to the usage history H. It is. The acquisition of the speech signal X by the voice acquisition unit 32 (S31) and the response generation process SD by the response generation unit 36D are repeated until the end of the voice conversation is instructed by the user U (S33: NO). Therefore, as in the first embodiment, a voice conversation is realized in which the sound of an arbitrary uttered voice Vx by the user U and the reproduction of the response voice Vy for the uttered voice Vx are alternately repeated.

音声対話の終了が利用者Ｕから指示されると（Ｓ33：YES）、履歴管理部３８は、記憶装置２２に記憶された利用履歴Ｈを、今回の音声対話を加味した内容に更新する（Ｓ34）。具体的には、履歴管理部３８は、利用履歴Ｈが示す利用回数Ｎを１だけ増加させる。したがって、音声対話装置１００Dによる音声対話の実行毎に利用履歴Ｈは１ずつ増加していく。利用履歴Ｈの更新後に図１６の処理は終了する。 When the end of the voice dialogue is instructed by the user U (S33: YES), the history management unit 38 updates the usage history H stored in the storage device 22 to the content that takes into account the current voice dialogue (S34). ). Specifically, the history management unit 38 increases the usage count N indicated by the usage history H by one. Accordingly, the usage history H increases by 1 each time a voice conversation is performed by the voice conversation device 100D. After the usage history H is updated, the process of FIG.

図１７は、第４実施形態の応答生成処理ＳDのフローチャートであり、図１８および図１９は、応答生成処理ＳDの説明図である。応答生成処理ＳDを開始すると、応答生成部３６Dは、記憶装置２２に記憶された利用履歴Ｈに応じて待機時間Ｗを可変に設定する（ＳD1〜ＳD3）。具体的には、応答生成部３６Dは、まず、利用履歴Ｈが示す利用回数Ｎが所定の閾値ＮTHを上回るか否かを判定する（ＳD1）。利用回数Ｎが閾値ＮTHを上回る場合（ＳD1：YES）、応答生成部３６Dは、図１８に例示される通り、所定の基礎値ｗ0（例えば150ms）を待機時間Ｗとして設定する（ＳD2）。他方、利用回数Ｎが閾値ＮTHを下回る場合（ＳD1：NO）、応答生成部３６Dは、図１９に例示される通り、基礎値ｗ0に所定の調整値（オフセット）δwを加算した数値(ｗ0＋δw)を待機時間Ｗとして設定する（ＳD3）。調整値δwは所定の正数に設定される。なお、以上の説明では、利用回数Ｎが閾値ＮTHを上回るか否かに応じて待機時間Ｗを２値的に制御したが、利用回数Ｎに応じて待機時間Ｗを多値的または連続的に変化させることも可能である。 FIG. 17 is a flowchart of the response generation process SD of the fourth embodiment, and FIGS. 18 and 19 are explanatory diagrams of the response generation process SD. When the response generation process SD is started, the response generation unit 36D variably sets the standby time W according to the usage history H stored in the storage device 22 (SD1 to SD3). Specifically, the response generation unit 36D first determines whether or not the usage count N indicated by the usage history H exceeds a predetermined threshold value NTH (SD1). When the number of uses N exceeds the threshold value NTH (SD1: YES), the response generation unit 36D sets a predetermined basic value w0 (for example, 150 ms) as the standby time W as illustrated in FIG. 18 (SD2). On the other hand, when the number of uses N is less than the threshold value NTH (SD1: NO), the response generator 36D, as exemplified in FIG. 19, adds the predetermined adjustment value (offset) δw to the basic value w0 (w0 + δw). Is set as the waiting time W (SD3). The adjustment value δw is set to a predetermined positive number. In the above description, the standby time W is binary controlled in accordance with whether or not the usage count N exceeds the threshold value NTH. However, the standby time W is multivalued or continuously in accordance with the usage count N. It is also possible to change.

応答生成部３６Dは、以上の処理で利用履歴Ｈに応じて設定した待機時間Ｗが発話音声Ｖxの終点ｔBから経過するまで待機する（ＳD4：NO）。待機時間Ｗの経過により応答再生点ｔyが到来すると（ＳD4：YES）、応答生成部３６Dは、記憶装置２２に記憶された応答信号Ｙを再生装置２６に供給することで応答音声Ｖyを再生させる（ＳD5）。以上の説明から理解される通り、第４実施形態の応答生成部３６Dは、音声対話装置１００Dの利用履歴Ｈに応じた韻律（第４実施形態では待機時間Ｗ）の応答音声Ｖyを再生装置２６に再生させる。具体的には、利用履歴Ｈが示す利用回数Ｎが多い場合には、基礎値ｗ0の待機時間Ｗの経過により応答音声Ｖyが再生され、利用回数Ｎが少ない場合には、基礎値ｗ0に調整値δwを加算した待機時間Ｗの経過により応答音声Ｖyが再生される。すなわち、利用回数Ｎが多い場合に待機時間Ｗは短くなる。 The response generation unit 36D waits until the standby time W set according to the usage history H in the above processing has elapsed from the end point tB of the uttered voice Vx (SD4: NO). When the response playback point ty comes due to the elapse of the standby time W (SD4: YES), the response generation unit 36D supplies the response signal Y stored in the storage device 22 to the playback device 26 to play back the response voice Vy. (SD5). As understood from the above description, the response generation unit 36D of the fourth embodiment reproduces the response voice Vy of the prosody (the waiting time W in the fourth embodiment) according to the usage history H of the voice interaction apparatus 100D. To play. Specifically, when the usage history N indicated by the usage history H is large, the response voice Vy is reproduced as the standby time W of the basic value w0 elapses, and when the usage count N is small, the response voice Vy is adjusted to the basic value w0. The response voice Vy is reproduced as the standby time W obtained by adding the value δw elapses. That is, the standby time W is shortened when the number of uses N is large.

以上に説明した通り、第４実施形態では、音声対話装置１００Dによる音声対話の利用履歴Ｈに応じた韻律（待機時間Ｗ）の応答音声Ｖyが再生されるから、特定相手との対話の反復とともに発話音声の韻律が経時的に変化するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。第４実施形態では特に、発話音声Ｖxと応答音声Ｖyとの間隔である待機時間Ｗが利用履歴Ｈに応じて制御される。したがって、初対面で対話を開始した直後の段階では、発話と応答との間隔が長く、当該対話相手との対話が反復されるにつれて当該間隔が短縮されるという現実の対話の傾向を模擬した自然な音声対話が実現される。 As described above, in the fourth embodiment, since the response voice Vy of the prosody (waiting time W) corresponding to the voice conversation use history H by the voice conversation device 100D is reproduced, the conversation with the specific partner is repeated. It is possible to realize a natural voice conversation that simulates the tendency of an actual conversation in which the prosody of the spoken voice changes with time. In the fourth embodiment, in particular, the standby time W that is the interval between the utterance voice Vx and the response voice Vy is controlled according to the usage history H. Therefore, at the stage immediately after the start of the dialogue at the first meeting, the interval between the utterance and the response is long, and the natural conversation that simulates the tendency of the actual dialogue that the interval is shortened as the dialogue with the dialogue partner is repeated. Spoken dialogue is realized.

＜変形例＞
前述の各形態で例示した音声対話装置１００（１００A，１００B，１００C，１００D）は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２個以上の態様は、相互に矛盾しない範囲で適宜に併合され得る。 <Modification>
The voice interaction apparatus 100 (100A, 100B, 100C, 100D) exemplified in the above-described embodiments can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined within a range that does not contradict each other.

（１）第１実施形態ないし第４実施形態から選択された任意の２以上の構成を組合せることも可能である。具体的には、発話音声Ｖxの韻律（例えば音高Ｐ）に応じて応答音声Ｖyの韻律を制御する第１実施形態の構成は、第２実施形態から第４実施形態にも同様に適用され得る。例えば、第２実施形態では、図９のステップＳB3またはステップＳB4で選択した応答信号Ｙの韻律を発話音声Ｖxの韻律（例えば音高Ｐ）に応じて制御したうえで再生装置２６から再生させることも可能である。同様に、第３実施形態では、図１２のステップＳC2またはステップＳC3で選択した応答信号Ｙの韻律を発話音声Ｖxの韻律に応じて制御する構成が採用され、第４実施形態では、図１７のステップＳD5で記憶装置２２から取得した応答信号Ｙの韻律を発話音声Ｖxの韻律に応じて制御する構成が採用され得る。第２実施形態から第４実施形態に第１実施形態を適用した構成では、第１実施形態と同様に、例えば、応答音声Ｖyのうち特定のモーラ（典型的には最後のモーラ）の始点における音高が発話音声Ｖxの末尾区間Ｅ内の最低音高Ｐminに一致するように、応答信号Ｙの音高が調整される。 (1) It is possible to combine any two or more configurations selected from the first to fourth embodiments. Specifically, the configuration of the first embodiment that controls the prosody of the response voice Vy according to the prosody of the speech voice Vx (for example, pitch P) is similarly applied to the second to fourth embodiments. obtain. For example, in the second embodiment, the prosody of the response signal Y selected in step SB3 or step SB4 in FIG. 9 is controlled according to the prosody of the uttered voice Vx (for example, pitch P) and reproduced from the playback device 26. Is also possible. Similarly, in the third embodiment, a configuration is adopted in which the prosody of the response signal Y selected in step SC2 or step SC3 in FIG. 12 is controlled according to the prosody of the speech voice Vx. In the fourth embodiment, in FIG. A configuration may be adopted in which the prosody of the response signal Y acquired from the storage device 22 in step SD5 is controlled according to the prosody of the speech voice Vx. In the configuration in which the first embodiment is applied to the second to fourth embodiments, as in the first embodiment, for example, at the start point of a specific mora (typically the last mora) of the response voice Vy. The pitch of the response signal Y is adjusted so that the pitch matches the lowest pitch Pmin in the last section E of the speech voice Vx.

発話音声Ｖxに対する問返しの第１応答音声Ｖy1と問返し以外の第２応答音声Ｖy2とを選択的に再生させる第３実施形態の構成を、第３実施形態以外の各形態に適用することも可能である。また、音声対話の利用履歴Ｈに応じて応答音声Ｖyの韻律（例えば待機時間Ｗ）を制御する第４実施形態の構成を、第１実施形態から第３実施形態に適用することも可能である。 The configuration of the third embodiment that selectively reproduces the first response voice Vy1 that answers the uttered voice Vx and the second response voice Vy2 that does not answer the question may also be applied to each form other than the third embodiment. Is possible. Further, the configuration of the fourth embodiment for controlling the prosody of the response voice Vy (for example, the waiting time W) according to the voice conversation use history H can be applied to the first to third embodiments. .

（２）前述の各形態の音声対話に関連する各種の変数は、例えば利用者Ｕからの指示に応じて可変に設定される。例えば、応答音声Ｖyの再生音量を利用者Ｕからの指示に応じて制御する構成や、発話者の性別または声質（優しい音声，厳しい音声）が相違する複数種の応答音声Ｖyのうち実際に再生装置２６から再生する応答音声Ｖyの種類を利用者Ｕからの指示に応じて選択する構成も採用され得る。また、第１実施形態から第３実施形態において、発話音声Ｖxの終点ｔBから応答音声Ｖyの応答再生点ｔyまでの待機時間Ｗの時間長を利用者Ｕからの指示に応じて設定することも可能である。 (2) Various variables related to the above-described forms of voice conversation are variably set according to instructions from the user U, for example. For example, the playback volume of the response voice Vy is controlled according to an instruction from the user U, and the actual playback is performed among a plurality of types of response voices Vy having different gender or voice quality (friendly voice, severe voice) of the speaker. A configuration in which the type of response voice Vy reproduced from the device 26 is selected according to an instruction from the user U can also be adopted. In the first to third embodiments, the length of the waiting time W from the end point tB of the utterance voice Vx to the response playback point ty of the response voice Vy may be set according to an instruction from the user U. Is possible.

（３）第３実施形態の変形例では、発話音声Ｖxに対する問返しの第１応答音声Ｖy1の再生頻度を利用者Ｕからの指示に応じて可変に設定したが、利用者Ｕからの指示以外の要素に応じて第１応答音声Ｖy1の再生頻度を制御することも可能である。具体的には、第３実施形態の応答生成部３６Dが、第４実施形態で例示した利用履歴Ｈに応じて第１応答音声Ｖy1の再生頻度を制御する構成が採用され得る。例えば、現実の人間同士の対話では、特定の対話相手との対話を反復するほど当該対話相手の発話の特徴（例えば口癖や口調）を把握でき、結果的に発話音声に対する問返しの頻度は低下する、という傾向が想定される。以上の傾向を考慮すると、利用履歴Ｈが示す利用回数Ｎが多いほど第１応答音声Ｖy1の再生頻度を低下させる構成が好適である。 (3) In the modified example of the third embodiment, the reproduction frequency of the first response voice Vy1 for answering the uttered voice Vx is variably set according to the instruction from the user U, but other than the instruction from the user U It is also possible to control the reproduction frequency of the first response voice Vy1 according to the above factors. Specifically, a configuration may be employed in which the response generation unit 36D of the third embodiment controls the reproduction frequency of the first response voice Vy1 according to the usage history H exemplified in the fourth embodiment. For example, in a conversation between real people, the characteristics of the utterance (such as moustache and tone) of the conversation partner can be grasped as the conversation with a specific conversation partner is repeated, and as a result, the frequency of answering the spoken speech decreases. The tendency to do is assumed. Considering the above tendency, a configuration in which the reproduction frequency of the first response voice Vy1 is reduced as the number of uses N indicated by the use history H increases.

（４）第４実施形態では、音声対話の利用回数Ｎを利用履歴Ｈとして例示したが、利用履歴Ｈは利用回数Ｎに限定されない。例えば、音声対話内の応答音声Ｖyの再生を１回とした回数や、音声対話の利用頻度（単位時間毎の利用回数）、音声対話の使用期間（例えば音声対話装置１００の最初の使用からの経過時間）、音声対話装置１００を最後に使用してからの経過時間を、利用履歴Ｈとして待機時間Ｗの制御に適用することも可能である。 (4) In the fourth embodiment, the usage count N of the voice conversation is exemplified as the usage history H. However, the usage history H is not limited to the usage count N. For example, the number of times the response voice Vy is reproduced once in the voice conversation, the frequency of use of the voice conversation (the number of uses per unit time), the period of use of the voice conversation (for example, from the first use of the voice conversation apparatus 100) Elapsed time), and the elapsed time since the last use of the voice interaction apparatus 100 can be applied to the control of the standby time W as the usage history H.

（５）第１実施形態では、記憶装置２２に事前に記憶された音声信号Ｚから応答信号Ｙを生成および再生し、第２実施形態から第４実施形態では、記憶装置２２に事前に記憶された応答信号Ｙを再生したが、特定の発話内容の応答音声Ｖyを表す応答信号Ｙを、例えば公知の音声合成技術により合成することも可能である。応答信号Ｙの合成には、例えば、素片接続型の音声合成や、隠れマルコフモデル等の統計モデルを利用した音声合成が好適に利用される。また、発話音声Ｖxや応答音声Ｖyは人間の発声音に限定されない。例えば動物の鳴き声を発話音声Ｖxや応答音声Ｖyとすることも可能である。 (5) In the first embodiment, the response signal Y is generated and reproduced from the audio signal Z stored in advance in the storage device 22. In the second to fourth embodiments, the response signal Y is stored in advance in the storage device 22. The response signal Y representing the response voice Vy of the specific utterance content can be synthesized by, for example, a known voice synthesis technique. For synthesizing the response signal Y, for example, speech synthesis using a unit connection type speech synthesis or a statistical model such as a hidden Markov model is preferably used. Further, the speech voice Vx and the response voice Vy are not limited to human voices. For example, it is possible to use an animal cry as the utterance voice Vx or the response voice Vy.

（６）前述の各形態では、音声対話装置１００が音声入力装置２４と再生装置２６とを具備する構成を例示したが、音声対話装置１００とは別体の装置（音声入出力装置）に音声入力装置２４および再生装置２６を設置することも可能である。音声対話装置１００は、例えば携帯電話機やスマートフォン等の端末装置で実現され、音声入出力装置は、例えば動物型の玩具やロボット等の電子機器で実現される。音声対話装置１００と音声入出力装置とは無線または有線で通信可能である。すなわち、音声入出力装置の音声入力装置２４が生成した発話信号Ｘは無線または有線で音声対話装置１００に送信され、音声対話装置１００が生成した応答信号Ｙは無線または有線で音声入出力装置の再生装置２６に送信される。 (6) In each of the above-described embodiments, the configuration in which the voice interaction device 100 includes the voice input device 24 and the playback device 26 has been exemplified. However, the voice dialogue device 100 can be connected to a device (voice input / output device) separate from the voice interaction device 100. It is also possible to install an input device 24 and a playback device 26. The voice interaction device 100 is realized by a terminal device such as a mobile phone or a smartphone, and the voice input / output device is realized by an electronic device such as an animal-type toy or a robot. The voice interactive device 100 and the voice input / output device can communicate with each other wirelessly or by wire. That is, the speech signal X generated by the voice input device 24 of the voice input / output device is transmitted to the voice dialog device 100 wirelessly or by wire, and the response signal Y generated by the voice dialog device 100 is transmitted from the voice input device of the voice input device wirelessly or by wire. It is transmitted to the playback device 26.

（７）前述の各形態では、携帯電話機等やパーソナルコンピュータ等の情報処理装置で音声対話装置１００を実現したが、音声対話装置１００の一部または全部の機能をサーバ装置（いわゆるクラウドサーバ）で実現することも可能である。具体的には、移動通信網やインターネット等の通信網を介して端末装置と通信するサーバ装置により音声対話装置１００が実現される。例えば、音声対話装置１００は、端末装置の音声入力装置２４が生成した発話信号Ｘを当該端末装置から受信し、前述の各形態に係る構成により発話信号Ｘから応答信号Ｙを生成する。そして、音声対話装置１００は、発話信号Ｘから生成した応答信号Ｙを端末装置に送信し、当該端末装置の再生装置２６に応答音声Ｖyを再生させる。音声対話装置１００は、単体の装置または複数の装置の集合（すなわちサーバシステム）で実現される。また、前述の各形態に係る音声対話装置１００の一部の機能（例えば音声取得部３２，音声解析部３４A，３４C，応答生成部３６A，３６B，３６C，３６D，履歴管理部３８の少なくとも一部）をサーバ装置により実現し、他の機能を端末装置で実現することも可能である。音声対話装置１００が実現する各機能をサーバ装置および端末装置の何れで実現するか（機能の分担）は任意である。 (7) In each of the above-described embodiments, the voice interactive device 100 is realized by an information processing device such as a mobile phone or a personal computer. However, a part or all of the functions of the voice interactive device 100 are performed by a server device (so-called cloud server). It can also be realized. Specifically, the voice interactive device 100 is realized by a server device that communicates with a terminal device via a communication network such as a mobile communication network or the Internet. For example, the voice interaction apparatus 100 receives the utterance signal X generated by the voice input device 24 of the terminal apparatus from the terminal apparatus, and generates the response signal Y from the utterance signal X with the configuration according to each of the above-described embodiments. Then, the voice interaction device 100 transmits a response signal Y generated from the speech signal X to the terminal device, and causes the playback device 26 of the terminal device to play back the response voice Vy. The voice interactive device 100 is realized by a single device or a set of a plurality of devices (that is, a server system). In addition, some functions of the voice interaction device 100 according to the above-described embodiments (for example, the voice acquisition unit 32, the voice analysis units 34A and 34C, the response generation units 36A, 36B, 36C, and 36D, and the history management unit 38) ) Can be realized by the server device, and other functions can be realized by the terminal device. Whether each function realized by the voice interactive apparatus 100 is realized by the server apparatus or the terminal apparatus (function sharing) is arbitrary.

（８）前述の各形態では、発話音声Ｖxに対して特定の発話内容（例えば「うん」等の相鎚）の応答音声Ｖyを再生したが、応答音声Ｖyの発話内容は以上の例示に限定されない。例えば、発話信号Ｘに対する音声認識および形態素解析で発話音声Ｖxの発話内容を解析し、当該発話内容に対して適切な内容の応答音声Ｖyを複数の候補から選択して再生装置２６に再生させることも可能である。なお、音声認識や形態素解析を実行しない構成（例えば第１実施形態から第４実施形態の例示）では、発話音声Ｖxとは無関係に事前に用意された発話内容の応答音声Ｖyが再生される。したがって、単純に考えると、自然な対話は成立しないようにも推測され得るが、前述の各形態の例示のように応答音声Ｖyの韻律が多様に制御されることで、実際には、人間同士の自然な対話のような感覚を利用者Ｕは感取することが可能である。他方、音声認識や形態素解析を実行しない構成によれば、これらの処理に起因した処理遅延や処理負荷が低減ないし解消されるという利点がある。 (8) In each of the above-described embodiments, the response voice Vy having a specific utterance content (for example, “Yes”) is reproduced with respect to the utterance voice Vx, but the utterance content of the response voice Vy is limited to the above examples. Not. For example, the utterance content of the utterance voice Vx is analyzed by voice recognition and morphological analysis for the utterance signal X, and a response voice Vy having an appropriate content for the utterance content is selected from a plurality of candidates and reproduced by the reproduction device 26. Is also possible. Note that in a configuration in which voice recognition and morphological analysis are not performed (for example, examples in the first to fourth embodiments), the response voice Vy of the utterance content prepared in advance is reproduced regardless of the utterance voice Vx. Therefore, when considered simply, it can be inferred that a natural dialogue is not established, but in reality, humans can interact with each other by controlling the prosody of the response voice Vy in various ways as illustrated in the above examples. The user U can perceive a feeling like a natural dialogue. On the other hand, according to the configuration in which voice recognition and morphological analysis are not performed, there is an advantage that processing delay and processing load due to these processes are reduced or eliminated.

（９）前述の各形態で例示した音声対話装置１００（１００A，１００B，１００C，１００D）を、実際の人間同士の対話の評価に利用することも可能である。例えば、実際の人間同士の対話で観測される応答音声（以下「観測音声」という）の韻律を、前述の形態で生成された応答音声Ｖyの韻律と比較し、両者間で韻律が類似する場合には観測音声を適切と評価する一方、両者間で韻律が乖離する場合には観測音声を不適切と評価することが可能である。以上に例示した評価を実行する装置（対話評価装置）は、人間同士の対話の訓練にも利用され得る。 (9) The spoken dialogue apparatus 100 (100A, 100B, 100C, 100D) exemplified in the above-described embodiments can be used for evaluating dialogue between actual humans. For example, when the prosody of response speech (hereinafter referred to as “observation speech”) observed in an actual human dialogue is compared with the prosody of response speech Vy generated in the above-described form, the prosody is similar between the two While the observation speech is evaluated as appropriate, the observation speech can be evaluated as inappropriate when the prosody diverges between the two. The apparatus (dialog evaluation apparatus) that performs the evaluation exemplified above can be used for training of dialogue between humans.

（１０）前述の各形態で例示した音声対話装置１００（１００A，１００B，１００C，１００D）は、前述の通り、制御装置２０と音声対話用のプログラムとの協働で実現され得る。音声対話用のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、通信網を介した配信の形態でプログラムをコンピュータに配信することも可能である。また、前述の各形態で例示した音声対話装置１００の動作方法（音声対話方法）としても本発明は実現され得る。音声対話方法の動作主体となるコンピュータ（音声対話装置１００）は、例えば単体のコンピュータまたは複数のコンピュータで構成されるシステムである。 (10) The voice interaction device 100 (100A, 100B, 100C, 100D) exemplified in the above-described embodiments can be realized by the cooperation of the control device 20 and the program for voice interaction as described above. The program for voice interaction can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. It is also possible to distribute the program to a computer in the form of distribution via a communication network. The present invention can also be realized as an operation method (voice interaction method) of the voice interaction apparatus 100 exemplified in the above-described embodiments. A computer (speech dialogue apparatus 100) that is an operation subject of the voice dialogue method is, for example, a single computer or a system constituted by a plurality of computers.

１００（１００A，１００B，１００C，１００D）……音声対話装置、２０……制御装置、２２……記憶装置、２４……音声入力装置、２４２……収音装置、２４４……Ａ/Ｄ変換器、２６……再生装置、２６２……Ｄ/Ａ変換器、２６４……放音装置、３２……音声取得部、３４A，３４C……音声解析部、３６A，３６B，３６C，３６D……応答生成部、３８……履歴管理部。
100 (100A, 100B, 100C, 100D) …… Voice interaction device, 20 …… Control device, 22 …… Storage device, 24 …… Voice input device, 242 …… Sound collecting device, 244 …… A / D converter , 26... Playback device, 262... D / A converter, 264 .. sound emitting device, 32... Voice acquisition unit, 34 A, 34 C .. speech analysis unit, 36 A, 36 B, 36 C, 36 D. 38, history management unit.

Claims

An audio acquisition unit for acquiring an utterance signal representing the utterance voice;
A voice interaction device comprising: a response generation unit that causes a playback device to selectively play back a first response voice that represents a response to the spoken voice and a second response speech other than the response to the question.

Comprising a speech analysis unit for identifying a prosodic index value representing the prosody of the speech speech from the speech signal;
The spoken dialogue according to claim 1, wherein the response generation unit compares a prosodic index value of the uttered voice with a threshold value, and selects either the first response voice or the second response voice according to the comparison result. apparatus.

The spoken dialogue apparatus according to claim 1, wherein the voice analysis unit sets a representative value of the prosodic index value in a plurality of past utterances as the threshold value.

The response generation unit selects the first response sound when the prosodic index value is a numerical value outside a predetermined range including the threshold value, and when the prosodic index value is a numerical value inside the predetermined range, the second response The voice interactive apparatus according to any one of claims 1 to 3, wherein a voice is selected.

The voice interaction apparatus according to claim 1, wherein the response generation unit reproduces the first response voice with respect to an utterance voice randomly selected from a plurality of utterance voices.

The voice interaction apparatus according to claim 5, wherein the response generation unit variably sets the reproduction frequency of the first response voice with respect to the plurality of uttered voices.

The voice interaction device according to claim 6, wherein the response generation unit sets the reproduction frequency of the first response voice according to a voice conversation usage history.

Computer
An audio acquisition unit for acquiring an utterance signal representing the utterance audio; and
A program that causes a playback device to function as a response generation unit that selectively plays back a first response voice that represents a response to a spoken voice and a second response voice other than the response.