JP6657887B2

JP6657887B2 - Voice interaction method, voice interaction device, and program

Info

Publication number: JP6657887B2
Application number: JP2015238911A
Authority: JP
Inventors: 嘉山　啓; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2015-12-07
Filing date: 2015-12-07
Publication date: 2020-03-04
Anticipated expiration: 2035-12-07
Also published as: JP2017106988A

Description

本発明は、発話音声に対する応答音声を再生する音声対話の技術に関する。 The present invention relates to a technology of voice interaction for reproducing a response voice to an utterance voice.

利用者による発話に対する応答（例えば質問に対する回答）の音声を再生することで利用者との対話を実現する音声対話の技術が従来から提案されている。例えば特許文献１には、利用者の発話音声に対する音声認識で発話内容を解析し、解析結果に応じた応答音声を合成および再生する技術が開示されている。 2. Description of the Related Art A speech dialogue technology for realizing a dialogue with a user by reproducing a voice of a response (for example, an answer to a question) to an utterance by the user has been proposed. For example, Patent Literature 1 discloses a technique of analyzing utterance content by voice recognition of a user's uttered voice, and synthesizing and reproducing a response voice according to the analysis result.

特開２０１２−１２８４４０号公報JP 2012-128440 A

しかし、特許文献１を含む既存の技術のもとでは、現実の人間同士の対話の傾向を忠実に反映した自然な音声対話を実現することは実際には困難であり、機械的で不自然な印象を利用者が感取し得るという問題がある。以上の事情を考慮して、本発明は、自然な音声対話の実現を目的とする。 However, under the existing technology including Patent Literature 1, it is actually difficult to realize a natural spoken dialogue that faithfully reflects the tendency of a real human-dialogue conversation, and it is mechanically unnatural. There is a problem that a user can feel an impression. In view of the above circumstances, the present invention aims at realizing a natural voice dialogue.

以上の課題を解決するために、本発明の好適な態様に係る音声対話装置は、発話音声に対する応答音声を再生する音声対話を実行する装置であって、発話音声を表す発話信号を取得する音声取得部と、音声対話の利用履歴を生成する履歴管理部と、利用履歴に応じた韻律の応答音声を再生装置に再生させる応答生成部とを具備する。以上の態様では、音声対話の利用履歴に応じた韻律の応答音声が再生されるから、特定の対話相手との対話の反復につれて発話音声の韻律が経時的に変化するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 In order to solve the above problems, a voice interaction device according to a preferred aspect of the present invention is a device for executing a voice interaction for reproducing a response voice to an utterance voice, and a voice for acquiring an utterance signal representing the utterance voice. An acquisition unit, a history management unit that generates a usage history of voice dialogue, and a response generation unit that causes a playback device to reproduce a prosody response voice corresponding to the usage history. In the above embodiment, since the response voice of the prosody corresponding to the usage history of the voice dialogue is reproduced, the tendency of the real dialogue that the prosody of the spoken voice changes with time as the dialogue with the specific dialogue partner is repeated is considered. It is possible to realize a simulated natural speech dialogue.

本発明の好適な態様において、応答生成部は、発話音声と応答音声との間隔である待機時間を、利用履歴に応じて制御する。以上の態様では、発話音声と応答音声との間隔である待機時間が利用履歴に応じて制御されるから、初対面で対話を開始した直後の段階では対話の間隔が長く、対話相手との対話が反復されるにつれて対話の間隔が短縮されるという現実の対話の傾向を模擬した自然な音声対話が実現される。 In a preferred aspect of the present invention, the response generation unit controls a standby time, which is an interval between the speech sound and the response sound, according to the usage history. In the above embodiment, the standby time, which is the interval between the uttered voice and the response voice, is controlled according to the usage history. A natural spoken dialogue simulating the tendency of a real dialogue in which the interval between the dialogues is shortened as it is repeated is realized.

第１実施形態の音声対話装置の構成図である。It is a lineblock diagram of a voice dialogue device of a 1st embodiment. 第１実施形態における音声対話装置の動作のフローチャートである。5 is a flowchart of an operation of the voice interaction device according to the first embodiment. 第１実施形態における発話音声および応答音声の説明図である。FIG. 3 is an explanatory diagram of a speech sound and a response sound in the first embodiment. 第１実施形態における発話音声および応答音声の説明図である。FIG. 3 is an explanatory diagram of a speech sound and a response sound in the first embodiment. 第１実施形態の応答生成処理のフローチャートである。5 is a flowchart of a response generation process according to the first embodiment. 第２実施形態の音声対話装置の構成図である。It is a lineblock diagram of a speech dialogue device of a 2nd embodiment. 第２実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 2nd Embodiment. 第２実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 2nd Embodiment. 第２実施形態における応答生成処理のフローチャートである。It is a flow chart of response generation processing in a 2nd embodiment. 第３実施形態の音声対話装置の構成図である。It is a lineblock diagram of a voice interaction device of a 3rd embodiment. 第３実施形態における音声対話装置の動作のフローチャートである。It is a flowchart of operation | movement of the voice interaction apparatus in 3rd Embodiment. 第３実施形態における応答生成処理のフローチャートである。It is a flow chart of response generation processing in a 3rd embodiment. 第３実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 3rd Embodiment. 第３実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 3rd Embodiment. 第４実施形態の音声対話装置の構成図である。It is a lineblock diagram of a voice dialogue device of a 4th embodiment. 第４実施形態における音声対話装置の動作のフローチャートである。It is a flowchart of operation | movement of the voice interaction apparatus in 4th Embodiment. 第４実施形態における応答生成処理のフローチャートである。It is a flow chart of response generation processing in a 4th embodiment. 第４実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 4th Embodiment. 第４実施形態における発話音声および応答音声の説明図である。It is explanatory drawing of the speech sound and response sound in 4th Embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音声対話装置１００Aの構成図である。第１実施形態の音声対話装置１００Aは、利用者Ｕが発音した音声（以下「発話音声」という）Ｖxに対する応答の音声（以下「応答音声」という）Ｖyを再生する音声対話システムである。例えば携帯電話機やスマートフォン等の可搬型の情報処理装置、または、パーソナルコンピュータ等の情報処理装置が音声対話装置１００Aとして利用され得る。また、動物等の外観を模擬した玩具（例えば動物のぬいぐるみ等の人形）やロボットの形態で音声対話装置１００Aを実現することも可能である。 <First embodiment>
FIG. 1 is a configuration diagram of a voice interaction device 100A according to the first embodiment of the present invention. The voice interaction device 100A of the first embodiment is a voice interaction system that reproduces a voice (hereinafter, referred to as "response voice") Vy in response to a voice (hereinafter, referred to as "uttered voice") Vx pronounced by the user U. For example, a portable information processing device such as a mobile phone or a smartphone, or an information processing device such as a personal computer can be used as the voice interaction device 100A. It is also possible to realize the voice interaction device 100A in the form of a toy (for example, a doll such as a stuffed animal) or a robot that simulates the appearance of an animal or the like.

発話音声Ｖxは、例えば問掛け（質問）および話掛けを含む発話の音声であり、応答音声Ｖyは、問掛けに対する回答や話掛けに対する受応えを含む応答の音声である。応答音声Ｖyには、例えば間投詞を意味する音声も包含される。間投詞は、他の分節から独立して利用されて活用のない自立語（感動詞，感嘆詞）である。具体的には、発話に対する相鎚を表す「うん」「ええ」等の語句や、言淀み（応答の停滞）を表す「え〜と」「あの〜」等の語句、応答（質問に対する肯定／否定）を表す「はい」「いいえ」等の語句、話者の感動を表す「ああ」「おお」等の語句、あるいは、発話に対する問返し（聞き直し）を意味する「え？」「なに？」等の語句が、間投詞として例示され得る。 The utterance voice Vx is, for example, an utterance voice including a question (question) and a utterance, and the response voice Vy is a response voice including an answer to the question and an acknowledgment to the utterance. The response voice Vy also includes, for example, voices that mean interjections. Interjections are independent words (inflection, exclamation) that are used independently of other segments and have no inflection. To be more specific, words such as “Ye” and “Ee” representing the utterance of the utterance, words such as “Eto” and “Ano” representing the stagnation (stagnation of the response), and the response (affirmation / Negative) words such as "yes" and "no", words such as "oh" and "oh" that express the excitement of the speaker, or "eh?" ? "And the like can be exemplified as interjections.

第１実施形態の音声対話装置１００Aは、発話音声Ｖxの韻律に応じた韻律の応答音声Ｖyを生成する。韻律（プロソディ）は、音声の受聴者が知覚し得る言語学的および音声学的な特性であり、言語の一般的な表記（例えば韻律を表す特別な表記を除いた表記）のみからでは把握できない性質を意味する。韻律は、発話者の意図や感情を受聴者に想起ないし推測させ得る特性とも換言され得る。具体的には、抑揚（音声の調子の変化，イントネーション），音調（音声の高低や強弱），音長（発話長），話速，リズム（音調の時間的な変化の構造），アクセント（高低または強弱のアクセント）等の種々の特徴が、韻律の概念には包含され得るが、韻律の典型例は音高（基本周波数）または音量である。 The voice interaction device 100A of the first embodiment generates a response voice Vy of a prosody according to the prosody of the utterance voice Vx. Prosody is a linguistic and phonetic characteristic that speech listeners can perceive, and cannot be grasped solely from a general notation of the language (for example, notation excluding special notation indicating prosody). Means nature. The prosody can be rephrased as a characteristic that allows the listener to recall or guess the intention or emotion of the speaker. Specifically, intonation (change in tone of speech, intonation), tone (pitched or intense), pitch (speech length), speech speed, rhythm (structure of temporal change in tone), accent (pitched Although various features such as a strong or weak accent can be included in the concept of prosody, typical examples of prosody are pitch (fundamental frequency) or volume.

図１に例示される通り、第１実施形態の音声対話装置１００Aは、制御装置２０と記憶装置２２と音声入力装置２４と再生装置２６とを具備する。音声入力装置２４は、例えば利用者Ｕの発話音声Ｖxを表す音声信号（以下「発話信号」という）Ｘを生成する要素であり、収音装置２４２とＡ/Ｄ変換器２４４とを具備する。収音装置（マイクロホン）２４２は、利用者Ｕが発音した発話音声Ｖxを収音して当該発話音声Ｖxの音圧変動を表すアナログの音声信号を生成する。Ａ/Ｄ変換器２４４は、収音装置２４２が生成した音声信号をデジタルの発話信号Ｘに変換する。 As illustrated in FIG. 1, the voice interaction device 100A according to the first embodiment includes a control device 20, a storage device 22, a voice input device 24, and a playback device 26. The voice input device 24 is, for example, an element that generates a voice signal X (hereinafter, referred to as a “voice signal”) representing the voice Vx of the user U, and includes a sound collection device 242 and an A / D converter 244. The sound collection device (microphone) 242 collects the utterance sound Vx generated by the user U and generates an analog sound signal representing a sound pressure fluctuation of the utterance sound Vx. The A / D converter 244 converts the audio signal generated by the sound collection device 242 into a digital speech signal X.

制御装置２０は、音声対話装置１００Aの各要素を統括的に制御する演算処理装置（例えばＣＰＵ）である。第１実施形態の制御装置２０は、音声入力装置２４から供給される発話信号Ｘを取得し、発話音声Ｖxに対する応答音声Ｖyを表す応答信号Ｙを生成する。再生装置２６は、制御装置２０が生成した応答信号Ｙに応じた応答音声Ｖyを再生する要素であり、Ｄ/Ａ変換器２６２と放音装置２６４とを具備する。Ｄ/Ａ変換器２６２は、制御装置２０が生成したデジタルの応答信号Ｙをアナログの音声信号に変換し、放音装置２６４（例えばスピーカまたはヘッドホン）は、変換後の音声信号に応じた応答音声Ｖyを音波として放音する。再生装置２６には、応答信号Ｙを増幅する増幅器等の処理回路も包含され得る。 The control device 20 is an arithmetic processing device (for example, a CPU) that controls each element of the voice interaction device 100A. The control device 20 of the first embodiment acquires the speech signal X supplied from the speech input device 24, and generates a response signal Y indicating a response speech Vy for the speech speech Vx. The reproduction device 26 is an element that reproduces the response voice Vy according to the response signal Y generated by the control device 20, and includes a D / A converter 262 and a sound emission device 264. The D / A converter 262 converts the digital response signal Y generated by the control device 20 into an analog audio signal, and the sound emitting device 264 (for example, a speaker or headphones) outputs a response audio corresponding to the converted audio signal. Vy is emitted as a sound wave. The playback device 26 may also include a processing circuit such as an amplifier that amplifies the response signal Y.

記憶装置２２は、制御装置２０が実行するプログラムや制御装置２０が使用する各種のデータを記憶する。例えば半導体記録媒体または磁気記録媒体等の公知の記録媒体、あるいは、複数の記録媒体の組合せが記憶装置２２として任意に採用され得る。第１実施形態の記憶装置２２は、特定の発話内容の応答音声を表す音声信号Ｚを記憶する。以下の説明では、間投詞の一例である相鎚を意味する「うん」等の応答音声の音声信号Ｚが記憶装置２２に記憶された場合を例示する。音声信号Ｚは、事前に収録され、例えばwav形式等の任意の形式の音声ファイルとして記憶装置２２に記憶される。 The storage device 22 stores a program executed by the control device 20 and various data used by the control device 20. For example, a known recording medium such as a semiconductor recording medium or a magnetic recording medium, or a combination of a plurality of recording media can be arbitrarily adopted as the storage device 22. The storage device 22 of the first embodiment stores an audio signal Z representing a response audio of a specific utterance content. In the following description, a case where the audio signal Z of the response audio such as “un” which means Aizuchi which is an example of the interjection is stored in the storage device 22 will be exemplified. The audio signal Z is recorded in advance and stored in the storage device 22 as an audio file in any format such as a wav format.

制御装置２０は、記憶装置２２に記憶されたプログラムを実行することで、利用者Ｕとの対話を成立させるための複数の機能（音声取得部３２，音声解析部３４A，応答生成部３６A）を実現する。なお、制御装置２０の機能を複数の装置（すなわちシステム）で実現した構成、または、制御装置２０の機能の一部を専用の電子回路が分担する構成も採用され得る。 The control device 20 executes a program stored in the storage device 22 to perform a plurality of functions (a voice acquisition unit 32, a voice analysis unit 34A, and a response generation unit 36A) for establishing a dialogue with the user U. Realize. In addition, a configuration in which the function of the control device 20 is realized by a plurality of devices (that is, a system), or a configuration in which a part of the function of the control device 20 is shared by a dedicated electronic circuit may be adopted.

図１の音声取得部３２は、発話音声Ｖxを表す発話信号Ｘを取得する。第１実施形態の音声取得部３２は、音声入力装置２４が生成した発話信号Ｘを音声入力装置２４から取得する。音声解析部３４Aは、音声取得部３２が取得した発話信号Ｘから発話音声Ｖxの音高（基本周波数）Ｐを特定する。音高Ｐの特定は所定の周期で順次に実行される。すなわち、時間軸上の相異なる複数の時点の各々について音高Ｐが特定される。発話音声Ｖxの音高Ｐの特定には公知の技術が任意に採用され得る。なお、発話信号Ｘのうち特定の周波数帯域の音響成分を抽出して音高Ｐを特定することも可能である。音声解析部３４Aによる解析の対象となる周波数帯域は、例えば利用者Ｕからの指示（例えば男声／女声の指定）に応じて可変に設定される。また、発話音声Ｖxの音高Ｐに応じて解析対象の周波数帯域を動的に変更することも可能である。 The speech acquisition unit 32 in FIG. 1 acquires a speech signal X representing the speech voice Vx. The voice acquisition unit 32 of the first embodiment acquires the speech signal X generated by the voice input device 24 from the voice input device 24. The voice analysis unit 34A specifies the pitch (basic frequency) P of the utterance voice Vx from the utterance signal X acquired by the speech acquisition unit 32. The specification of the pitch P is sequentially performed at a predetermined cycle. That is, the pitch P is specified for each of a plurality of different time points on the time axis. A publicly known technique can be arbitrarily adopted for specifying the pitch P of the uttered voice Vx. Note that the pitch P can be specified by extracting an acoustic component of a specific frequency band from the utterance signal X. The frequency band to be analyzed by the voice analysis unit 34A is variably set in accordance with, for example, an instruction from the user U (for example, designation of a male / female voice). Further, the frequency band to be analyzed can be dynamically changed according to the pitch P of the uttered voice Vx.

応答生成部３６Aは、音声取得部３２が取得した発話信号Ｘの発話音声Ｖxに対する応答音声Ｖyを再生装置２６に再生させる。具体的には、応答生成部３６Aは、利用者Ｕによる発話音声Ｖxの発音を契機として応答音声Ｖyの応答信号Ｙを生成し、当該応答信号Ｙを再生装置２６に供給することで応答音声Ｖyを再生装置２６に再生させる。第１実施形態の応答生成部３６Aは、記憶装置２２に記憶された音声信号Ｚの韻律を、音声解析部３４Aが特定した発話音声Ｖxの音高Ｐに応じて調整することで、応答音声Ｖyの応答信号Ｙを生成する。すなわち、音声信号Ｚが表す初期的な応答音声を発話音声Ｖxの韻律に応じて調整した応答音声Ｖyが再生装置２６から再生される。 The response generation unit 36A causes the reproduction device 26 to reproduce the response voice Vy corresponding to the utterance voice Vx of the utterance signal X acquired by the speech acquisition unit 32. Specifically, the response generation unit 36A generates a response signal Y of the response voice Vy in response to the pronunciation of the utterance voice Vx by the user U, and supplies the response signal Y to the playback device 26, thereby supplying the response voice Vy. On the playback device 26. The response generation unit 36A of the first embodiment adjusts the prosody of the voice signal Z stored in the storage device 22 according to the pitch P of the uttered voice Vx specified by the voice analysis unit 34A, thereby obtaining the response voice Vy. Is generated. That is, the reproduction device 26 reproduces the response voice Vy obtained by adjusting the initial response voice represented by the voice signal Z in accordance with the prosody of the speech voice Vx.

現実の人間同士の対話では、発話者の発話音声のうち終点付近の音高に対応した音高で、当該発話音声に対する応答音声を対話相手が発音する（すなわち応答音声の音高が発話音声の終点付近の音高に依存する）、という傾向が観測される。以上の傾向を考慮して、第１実施形態の応答生成部３６Aは、音声解析部３４Aが特定した発話音声Ｖxの音高Ｐに応じて音声信号Ｚの音高を調整することで、応答音声Ｖyの応答信号Ｙを生成する。 In a real human dialogue, the conversation partner pronounces a response voice to the utterance voice at a pitch corresponding to the pitch near the end point of the utterance voice of the speaker (that is, the response voice has a pitch of the utterance voice). (Depending on the pitch near the end point). In consideration of the above tendency, the response generation unit 36A of the first embodiment adjusts the pitch of the voice signal Z in accordance with the pitch P of the uttered voice Vx specified by the voice analysis unit 34A, thereby obtaining the response voice. A response signal Y of Vy is generated.

図２は、第１実施形態の制御装置２０が実行する処理のフローチャートである。例えば音声対話装置１００Aに対する利用者Ｕからの指示（例えば音声対話用のプログラムの起動指示）を契機として図２の処理が開始される。 FIG. 2 is a flowchart of a process executed by the control device 20 of the first embodiment. For example, the process of FIG. 2 is started by an instruction from the user U to the voice interaction device 100A (for example, an instruction to start a voice interaction program).

図２の処理を開始すると、音声取得部３２は、利用者Ｕが発話音声Ｖxの発音を開始するまで待機する（Ｓ10：NO）。具体的には、音声取得部３２は、音声入力装置２４から供給される発話信号Ｘを解析することで発話音声Ｖxの音量を順次に特定し、発話音声Ｖxの音量が所定の閾値（例えば事前に選定された固定値または利用者Ｕからの指示に応じた可変値）を上回る状態が所定の時間長にわたり継続した場合に、発話音声Ｖxが開始したと判断する。なお、発話音声Ｖxの開始（すなわち発話区間の始点）の検出方法は任意である。例えば、発話音声Ｖxの音量が閾値を上回り、かつ、音声解析部３４Aが有意な音高Ｐを検出した場合に、発話音声Ｖxが開始したと判断することも可能である。 When the process of FIG. 2 is started, the voice acquiring unit 32 waits until the user U starts to generate the uttered voice Vx (S10: NO). Specifically, the voice acquisition unit 32 sequentially specifies the volume of the voice Vx by analyzing the voice signal X supplied from the voice input device 24, and determines the volume of the voice Vx to be a predetermined threshold (for example, If the state exceeding the fixed value selected in (1) or the variable value according to the instruction from the user U) continues for a predetermined time length, it is determined that the uttered voice Vx has started. The method of detecting the start of the uttered voice Vx (that is, the start point of the uttered section) is arbitrary. For example, when the volume of the uttered voice Vx exceeds the threshold value and the voice analyzing unit 34A detects a significant pitch P, it is possible to determine that the uttered voice Vx has started.

発話音声Ｖxが開始すると（Ｓ10：YES）、音声取得部３２は、音声入力装置２４から発話信号Ｘを取得して記憶装置２２に格納する（Ｓ11）。音声解析部３４Aは、音声取得部３２が取得した発話信号Ｘから発話音声Ｖxの音高Ｐを特定して記憶装置２２に格納する（Ｓ12）。 When the uttered voice Vx starts (S10: YES), the voice acquiring unit 32 acquires the uttered signal X from the voice input device 24 and stores it in the storage device 22 (S11). The voice analysis unit 34A specifies the pitch P of the utterance voice Vx from the utterance signal X acquired by the speech acquisition unit 32 and stores the pitch P in the storage device 22 (S12).

音声取得部３２は、利用者Ｕが発話音声Ｖxの発音を終了したか否かを判定する（Ｓ13）。具体的には、音声取得部３２は、発話信号Ｘから特定される発話音声Ｖxの音量が所定の閾値（例えば事前に選定された固定値または利用者Ｕからの指示に応じた可変値）を下回る状態が所定の時間長にわたり継続した場合に、発話音声Ｖxが終了したと判断する。ただし、発話音声Ｖxの終了（すなわち発話区間の終点）の検出には公知の技術が任意に採用され得る。以上の説明から理解される通り、発話音声Ｖxの発話が継続される発話期間内は（Ｓ13：NO）、音声取得部３２による発話信号Ｘの取得（Ｓ11）と音声解析部３４Aによる発話音声Ｖxの音高Ｐの特定（Ｓ12）とが反復される。 The voice acquisition unit 32 determines whether the user U has finished uttering the uttered voice Vx (S13). Specifically, the voice acquisition unit 32 determines that the volume of the utterance voice Vx specified from the utterance signal X is a predetermined threshold (for example, a fixed value selected in advance or a variable value according to an instruction from the user U). When the lowering state continues for a predetermined time length, it is determined that the uttered voice Vx has ended. However, a known technique can be arbitrarily adopted for detecting the end of the uttered voice Vx (that is, the end point of the uttered section). As understood from the above description, during the speech period during which the speech of the speech voice Vx is continued (S13: NO), the acquisition of the speech signal X by the speech acquisition unit 32 (S11) and the speech speech Vx by the speech analysis unit 34A. (S12) of the pitch P is repeated.

以上に説明した処理の結果、図３および図４に例示される通り、発話音声Ｖxの始点から終点ｔBまでの発話区間について当該発話音声Ｖxの複数の音高Ｐの時系列が特定される。図３では、発話相手の感情や意図等の認識を発話者が問掛ける「楽しいね？」という疑問文の発話音声Ｖxを利用者Ｕが発音した場合が想定されている。図４では、発話者自身の感情や意図等の認識を表現したり当該認識に対する同意を発話相手に要求したりする平叙文の発話音声Ｖxを利用者Ｕが発音した場合が想定されている。 As a result of the processing described above, as illustrated in FIGS. 3 and 4, a time series of a plurality of pitches P of the uttered voice Vx is specified for the uttered section from the start point to the end point tB of the uttered voice Vx. In FIG. 3, it is assumed that the user U has pronounced the utterance voice Vx of the question sentence "Fun?" In which the speaker asks the recognition of the emotion and intention of the utterer. In FIG. 4, it is assumed that the user U has pronounced an utterance voice Vx of a declarative sentence that expresses recognition of the speaker's own feelings and intentions and requests the speaker to give consent to the recognition.

発話音声Ｖxが終了すると（Ｓ13：YES）、当該発話音声Ｖxに対する応答音声Ｖyを再生装置２６に再生させるための処理（以下「応答生成処理」という）ＳAを応答生成部３６Aが実行する。第１実施形態の応答生成処理ＳAは、前述の通り、音声解析部３４Aが特定した発話音声Ｖxの音高Ｐに応じて音声信号Ｚの音高を調整することで、応答音声Ｖyの応答信号Ｙを生成する処理である。 When the uttered voice Vx ends (S13: YES), the response generator 36A executes a process SA (hereinafter referred to as "response generating process") SA for causing the reproducing device 26 to reproduce the response voice Vy corresponding to the uttered voice Vx. As described above, the response generation processing SA of the first embodiment adjusts the pitch of the voice signal Z in accordance with the pitch P of the uttered voice Vx specified by the voice analysis unit 34A, thereby forming the response signal of the response voice Vy. This is a process for generating Y.

図５は、応答生成処理ＳAの具体例のフローチャートである。前述の通り、発話音声Ｖxの終了（Ｓ13：YES）を契機として図５の応答生成処理ＳAが開始される。応答生成処理ＳAを開始すると、応答生成部３６Aは、図３および図４に例示される通り、発話音声Ｖxのうち当該発話音声Ｖxの終点ｔBを含む区間（以下「末尾区間」という）Ｅについて音声解析部３４Aが特定した複数の音高Ｐのうちの最低値（以下「最低音高」という）Ｐminを発話音声Ｖxの韻律として特定する（ＳA1）。末尾区間Ｅは、例えば発話音声Ｖxのうち発話音声Ｖxの終点ｔBから手前側の所定長（例えば数秒）にわたる一部の区間である。図３から理解される通り、疑問文の発話音声Ｖxでは終点ｔBの近傍にて音高Ｐが上昇する傾向がある。したがって、発話音声Ｖxの音高Ｐの推移が低下から上昇に転換する極小点での音高Ｐが最低音高Ｐminとして特定される。他方、図４から理解される通り、平叙文の発話音声Ｖxでは終点ｔBにかけて音高Ｐが単調に低下する傾向がある。したがって、発話音声Ｖxの終点ｔBでの音高Ｐが最低音高Ｐminとして特定される。 FIG. 5 is a flowchart of a specific example of the response generation processing SA. As described above, the response generation processing SA in FIG. 5 is started upon the end of the uttered voice Vx (S13: YES). When the response generation process SA is started, the response generation unit 36A determines, as illustrated in FIGS. 3 and 4, a section E (hereinafter, referred to as a “tail section”) including the end point tB of the speech voice Vx among the speech sounds Vx. The lowest value (hereinafter, referred to as "lowest pitch") Pmin of the plurality of pitches P specified by the voice analysis unit 34A is specified as the prosody of the uttered voice Vx (SA1). The end section E is, for example, a part of the utterance voice Vx that extends from the end point tB of the utterance voice Vx to a predetermined length on the near side (for example, several seconds). As can be understood from FIG. 3, the pitch P of the utterance voice Vx of the question sentence tends to increase near the end point tB. Therefore, the pitch P at the minimum point where the transition of the pitch P of the uttered voice Vx changes from a drop to a rise is specified as the minimum pitch Pmin. On the other hand, as understood from FIG. 4, the pitch P of the declarative utterance voice Vx tends to decrease monotonically toward the end point tB. Therefore, the pitch P at the end point tB of the uttered voice Vx is specified as the lowest pitch Pmin.

応答生成部３６Aは、発話音声Ｖxの最低音高Ｐminに応じた音高の応答音声Ｖyを表す応答信号Ｙを生成する（ＳA2）。具体的には、応答生成部３６Aは、図３および図４に例示される通り、応答音声Ｖyのうち時間軸上の特定の時点（以下「目標点」という）τでの音高が最低音高Ｐminに一致するように音声信号Ｚの音高を調整することで、応答音声Ｖyの応答信号Ｙを生成する。目標点τの好適例は、応答音声Ｖyを構成する複数のモーラのうち特定のモーラ（典型的には最後のモーラ）の始点である。例えば、「うん」という応答音声の音声信号Ｚを想定すると、図３および図４から理解される通り、音声信号Ｚのうち最後のモーラである「ん」の始点の音高が最低音高Ｐminに一致するように音声信号Ｚの全区間にわたる音高を調整（ピッチシフト）することで、応答音声Ｖyの応答信号Ｙが生成される。なお、音高の調整には公知の技術が任意に採用され得る。また、目標点τは、応答音声Ｖyのうち最後のモーラの始点に限定されない。例えば、応答音声Ｖyの始点や終点を目標点τとして音高を調整することも可能である。 The response generator 36A generates a response signal Y representing a response voice Vy having a pitch corresponding to the lowest pitch Pmin of the uttered voice Vx (SA2). Specifically, as illustrated in FIGS. 3 and 4, the response generation unit 36 </ b> A has the lowest pitch at a specific time point (hereinafter referred to as “target point”) τ in the response voice Vy on the time axis. By adjusting the pitch of the audio signal Z so as to match the height Pmin, a response signal Y of the response voice Vy is generated. A preferred example of the target point τ is a start point of a specific mora (typically, the last mora) among a plurality of mora constituting the response voice Vy. For example, assuming the voice signal Z of the response voice "Y", as understood from FIGS. 3 and 4, the pitch at the start point of the last mora "N" of the voice signal Z is the minimum pitch Pmin. The response signal Y of the response voice Vy is generated by adjusting (pitch shifting) the pitch over the entire section of the voice signal Z so as to coincide with the following. A known technique can be arbitrarily adopted for adjusting the pitch. The target point τ is not limited to the start point of the last mora in the response voice Vy. For example, the pitch can be adjusted with the start point and the end point of the response voice Vy as the target point τ.

以上の手順で応答信号Ｙを生成すると、応答生成部３６Aは、応答音声Ｖyの再生を開始すべき時点（以下「応答再生点」という）ｔyの到来まで待機する（ＳA3：NO）。応答再生点ｔyは、例えば、発話音声Ｖxの終点ｔBから所定の時間（例えば150ms）が経過した時点である。 When the response signal Y is generated according to the above procedure, the response generation unit 36A waits until a point in time at which reproduction of the response sound Vy should be started (hereinafter referred to as a "response reproduction point") ty (SA3: NO). The response reproduction point ty is, for example, a point in time when a predetermined time (for example, 150 ms) has elapsed from the end point tB of the uttered voice Vx.

応答再生点ｔyが到来すると（ＳA3：YES）、応答生成部３６Aは、最低音高Ｐminに応じた調整後の応答信号Ｙを再生装置２６に供給することで応答音声Ｖyを再生させる（ＳA4）。すなわち、発話音声Ｖxの終点ｔBから所定の時間が経過した応答再生点ｔyにて応答音声Ｖyの再生が開始される。なお、応答生成部３６Aが、応答信号Ｙの生成（ピッチシフト）に並行して実時間的に、応答再生点ｔyから応答信号Ｙを再生装置２６に順次に供給して応答音声Ｖyを再生させることも可能である。以上の説明から理解される通り、第１実施形態の応答生成部３６Aは、発話音声Ｖxの末尾区間Ｅにおける最低音高Ｐminに応じた音高の応答音声Ｖyを再生装置２６に再生させる要素として機能する。 When the response reproduction point ty arrives (SA3: YES), the response generation unit 36A reproduces the response sound Vy by supplying the response signal Y adjusted according to the minimum pitch Pmin to the reproduction device 26 (SA4). . That is, the reproduction of the response voice Vy is started at the response reproduction point ty at which a predetermined time has elapsed from the end point tB of the utterance voice Vx. The response generator 36A sequentially supplies the response signal Y from the response reproduction point ty to the reproducing device 26 in real time in parallel with the generation (pitch shift) of the response signal Y to reproduce the response voice Vy. It is also possible. As understood from the above description, the response generation unit 36A of the first embodiment is configured as an element for causing the playback device 26 to play the response voice Vy having a pitch corresponding to the lowest pitch Pmin in the last section E of the utterance voice Vx. Function.

以上に例示した応答生成処理ＳAが完了すると、制御装置２０は、図２に例示される通り、音声対話の終了が利用者Ｕから指示されたか否かを判定する（Ｓ14）。音声対話の終了が指示されていない場合（Ｓ14：NO）、処理はステップＳ10に遷移する。すなわち、発話音声Ｖxの開始（Ｓ10：YES）を契機として、音声取得部３２による発話信号Ｘの取得（Ｓ11）と、音声解析部３４Aによる音高Ｐの特定（Ｓ12）と、応答生成部３６Aによる応答生成処理ＳAとが実行される。以上の説明から理解される通り、発話音声Ｖxの音高Ｐに応じた音高の応答音声Ｖyが発話音声Ｖxの発音毎に再生される。すなわち、利用者Ｕによる任意の発話音声Ｖxの発音と、当該発話音声Ｖxに対する相鎚の応答音声Ｖy（例えば「うん」という応答音声）の再生とが交互に反復される音声対話が実現される。音声対話の終了が利用者Ｕから指示されると（Ｓ14：YES）、制御装置２０は図２の処理を終了する。 When the response generation processing SA illustrated above is completed, the control device 20 determines whether or not the user U has instructed the end of the voice interaction as illustrated in FIG. 2 (S14). If the end of the voice interaction has not been instructed (S14: NO), the process proceeds to step S10. That is, triggered by the start of the uttered voice Vx (S10: YES), the voice obtaining unit 32 obtains the utterance signal X (S11), the voice analyzing unit 34A specifies the pitch P (S12), and the response generating unit 36A And a response generation process SA is performed. As understood from the above description, the response voice Vy having a pitch corresponding to the pitch P of the uttered voice Vx is reproduced for each pronunciation of the uttered voice Vx. In other words, a voice dialogue is realized in which the sounding of an arbitrary uttered voice Vx by the user U and the reproduction of a response voice Vy (for example, a response voice of “un”) of Aizuchi to the uttered voice Vx are alternately repeated. . When the end of the voice conversation is instructed by the user U (S14: YES), the control device 20 ends the processing of FIG.

以上に説明した通り、第１実施形態では、発話音声Ｖxの終点ｔBを含む末尾区間Ｅ内の最低音高Ｐminに応じた音高の応答音声Ｖyが再生装置２６から再生される。したがって、発話音声の終点付近の音高に対応した音高で対話相手が応答音声を発音するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。第１実施形態では特に、応答音声Ｖyのうち最後のモーラの始点（目標点τ）での音高が最低音高Ｐminに一致するように応答音声Ｖyが再生されるから、現実の対話に近い自然な音声対話を実現できるという効果は格別に顕著である。 As described above, in the first embodiment, the response voice Vy having a pitch corresponding to the lowest pitch Pmin in the last section E including the end point tB of the uttered voice Vx is reproduced from the reproducing device 26. Therefore, it is possible to realize a natural voice dialogue that simulates a tendency of an actual dialogue in which a conversation partner produces a response voice at a pitch corresponding to a pitch near the end point of the uttered voice. In the first embodiment, in particular, the response voice Vy is reproduced such that the pitch at the start point (target point τ) of the last mora of the response voice Vy matches the minimum pitch Pmin. The effect of realizing a natural voice conversation is particularly remarkable.

＜第１実施形態の変形例＞
（１）第１実施形態では、応答音声Ｖyのうち目標点τの音高を発話音声Ｖxの末尾区間Ｅ内の最低音高Ｐminに一致させる構成を例示したが、応答音声Ｖyの目標点τでの音高と発話音声Ｖxの最低音高Ｐminとの関係は以上の例示（両者が一致する関係）に限定されない。例えば、応答音声Ｖyの目標点τでの音高を、最低音高Ｐminに所定の調整値（オフセット）δpを加算または減算した音高に一致させることも可能である。調整値δpは、事前に選定された固定値（例えば最低音高Ｐminに対して５度等の音程に相当する数値）または利用者Ｕからの指示に応じた可変値である。また、調整値δpをオクターブの整数倍に相当する数値に設定した構成によれば、最低音高Ｐminをオクターブシフトした音高の応答音声Ｖyが再生される。調整値δpを適用するか否かを利用者Ｕからの指示に応じて切替えることも可能である。 <Modification of First Embodiment>
(1) In the first embodiment, the configuration in which the pitch of the target point τ in the response voice Vy is made to coincide with the lowest pitch Pmin in the last section E of the utterance voice Vx has been described, but the target point τ of the response voice Vy The relationship between the pitch and the minimum pitch Pmin of the uttered voice Vx is not limited to the above example (a relationship in which both match). For example, the pitch at the target point τ of the response voice Vy can be made to match the pitch obtained by adding or subtracting a predetermined adjustment value (offset) δp to the minimum pitch Pmin. The adjustment value δp is a fixed value selected in advance (for example, a numerical value corresponding to a pitch such as 5 degrees with respect to the minimum pitch Pmin) or a variable value according to an instruction from the user U. Further, according to the configuration in which the adjustment value δp is set to a numerical value corresponding to an integral multiple of the octave, the response voice Vy having the pitch shifted by the octave from the lowest pitch Pmin is reproduced. Whether to apply the adjustment value δp can be switched according to an instruction from the user U.

（２）第１実施形態では、発話音声Ｖxの音高Ｐ（具体的には末尾区間Ｅの最低音高Ｐmin）に応じて応答音声Ｖyの音高を制御したが、応答音声Ｖyの韻律の制御に利用される発話音声Ｖxの韻律の種類や、発話音声Ｖxの韻律に応じて制御される応答音声Ｖyの韻律の種類は、音高に限定されない。例えば、発話音声Ｖxの音量（韻律の一例）に応じて応答音声Ｖyの韻律を制御する構成や、発話音声Ｖxの音高または音量の変動の範囲（韻律の他例）に応じて応答音声Ｖyの韻律を制御する構成も採用される。また、発話音声Ｖxの韻律に応じて応答音声Ｖyの音量（韻律の一例）を制御する構成や、発話音声Ｖxの韻律に応じて応答音声Ｖyの音高または音量の変動の範囲（韻律の他例）を制御する構成も採用され得る。 (2) In the first embodiment, the pitch of the response voice Vy is controlled according to the pitch P of the uttered voice Vx (specifically, the minimum pitch Pmin of the tail section E). The type of prosody of the uttered voice Vx used for control and the type of prosody of the response voice Vy controlled in accordance with the prosody of the uttered voice Vx are not limited to the pitch. For example, a configuration in which the prosody of the response voice Vy is controlled in accordance with the volume (an example of the prosody) of the utterance voice Vx, or the response voice Vy in accordance with the range of the pitch or volume fluctuation of the utterance voice Vx (an example of the prosody) A configuration for controlling the prosody of the sound is also adopted. Further, a configuration for controlling the volume (an example of a prosody) of the response voice Vy according to the prosody of the utterance voice Vx, a range of the pitch or volume fluctuation of the response voice Vy according to the prosody of the speech voice Vx (other than the prosody) For example, a configuration for controlling (e.g.) may be adopted.

（３）現実の人間同士の対話では、応答音声の韻律が発話音声の韻律に応じて一律に決定されるわけでは必ずしもない。すなわち、応答音声の韻律は、発話音声の韻律に依存するとともに発話音声の発音毎に変動し得るという傾向がある。以上の傾向を考慮すると、再生装置２６から再生される応答音声Ｖyの韻律（例えば音高や音量）を、応答生成部３６Aが発話音声Ｖx毎に変動させることも可能である。具体的には、前述の変形例の通り、最低音高Ｐminに調整値δpを加算または減算した音高となるように応答音声Ｖyの音高を調整する構成では、応答生成部３６Aは、発話音声Ｖxの発音毎に調整値δpを可変に制御する。例えば、応答生成部３６Aは、発話音声Ｖxの発音毎に所定の範囲内の乱数を発生させ、当該乱数を調整値δpとして設定する。以上の構成によれば、応答音声の韻律が発話音声の発音毎に変動し得るという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 (3) In real human dialogue, the prosody of the response voice is not necessarily determined uniformly according to the prosody of the uttered voice. That is, there is a tendency that the prosody of the response voice depends on the prosody of the uttered voice and can vary for each pronunciation of the uttered voice. In consideration of the above tendency, the response generation unit 36A can change the prosody (for example, pitch and volume) of the response voice Vy reproduced from the reproduction device 26 for each utterance voice Vx. Specifically, in the configuration in which the pitch of the response voice Vy is adjusted so that the pitch is obtained by adding or subtracting the adjustment value δp to or from the minimum pitch Pmin, as in the above-described modification, the response generation unit 36A The adjustment value δp is variably controlled for each sound of the voice Vx. For example, the response generation unit 36A generates a random number within a predetermined range for each sound of the uttered voice Vx, and sets the random number as the adjustment value δp. According to the above configuration, it is possible to realize a natural voice dialogue that simulates the tendency of a real dialogue in which the prosody of the response voice can vary for each pronunciation of the utterance voice.

（４）第１実施形態では、１種類の音声信号Ｚの音高を調整して応答信号Ｙを生成したが、音高が相違する複数種の音声信号Ｚを応答信号Ｙの生成に利用することも可能である。例えば、複数種の音声信号Ｚのうち発話音声Ｖxの最低音高Ｐminに最も近似する音声信号Ｚの音高を調整して応答信号Ｙを生成する構成が想定され得る。 (4) In the first embodiment, the response signal Y is generated by adjusting the pitch of one type of audio signal Z. However, a plurality of types of audio signals Z having different pitches are used for generating the response signal Y. It is also possible. For example, a configuration in which the response signal Y is generated by adjusting the pitch of the audio signal Z closest to the lowest pitch Pmin of the uttered voice Vx among the plurality of types of audio signals Z can be assumed.

（５）第１実施形態では、応答音声Ｖyを再生装置２６から再生したが、音声取得部３２が取得した発話信号Ｘを再生装置２６に供給することで発話音声Ｖxも再生装置２６から再生することが可能である。発話音声Ｖxを再生装置２６から再生するか否かを利用者Ｕからの指示に応じて切替える構成も採用され得る。 (5) In the first embodiment, the response voice Vy is reproduced from the reproduction device 26. However, the utterance voice Vx is reproduced from the reproduction device 26 by supplying the utterance signal X acquired by the audio acquisition unit 32 to the reproduction device 26. It is possible. A configuration in which whether or not the uttered voice Vx is reproduced from the reproducing device 26 is switched according to an instruction from the user U may be adopted.

＜第２実施形態＞
本発明の第２実施形態を説明する。なお、以下に例示する各形態において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で使用した符号を流用して各々の詳細な説明を適宜に省略する。 <Second embodiment>
A second embodiment of the present invention will be described. Note that in the following embodiments, elements having the same functions and functions as those of the first embodiment will be denoted by the same reference numerals used in the description of the first embodiment, and detailed description thereof will be omitted as appropriate.

図６は、本発明の第２実施形態に係る音声対話装置１００Bの構成図である。第２実施形態の音声対話装置１００Bは、第１実施形態の音声対話装置１００Aと同様に、利用者Ｕが発音した発話音声Ｖxに対する応答音声Ｖyを再生する。図６に例示される通り、第２実施形態の音声対話装置１００Bは、第１実施形態の音声対話装置１００Aの応答生成部３６Aを応答生成部３６Bに置換した構成である。音声対話装置１００Bの他の要素（音声入力装置２４，再生装置２６，音声取得部３２，音声解析部３４A）の構成や動作は第１実施形態と同様である。 FIG. 6 is a configuration diagram of the voice interaction device 100B according to the second embodiment of the present invention. The voice interaction device 100B of the second embodiment reproduces a response voice Vy to the utterance voice Vx pronounced by the user U, similarly to the voice interaction device 100A of the first embodiment. As illustrated in FIG. 6, the voice interaction device 100B of the second embodiment has a configuration in which the response generation unit 36A of the voice interaction device 100A of the first embodiment is replaced with a response generation unit 36B. The configuration and operation of the other elements of the voice interaction device 100B (the voice input device 24, the playback device 26, the voice acquisition unit 32, and the voice analysis unit 34A) are the same as in the first embodiment.

現実の人間同士の対話では、発話者の発話内容（疑問文であるか平叙文であるか）に応じた韻律で対話相手が応答音声を発音するという傾向が観測される。例えば、疑問文に対する応答音声と平叙文に対する応答音声とでは韻律が相違する。具体的には、疑問文に対する回答の音声は、平叙文に対する相鎚の音声と比較すると、例えば応答者の回答（肯定／否定）を発話者に明確に認識させる必要性から、比較的に大きい音量で抑揚（音量または音高の時間変動）を強調して発音される、という傾向がある。以上の傾向を考慮して、第２実施形態の応答生成部３６Bは、発話音声Ｖxによる発話内容（疑問文／平叙文の区別）に応じた韻律の応答音声Ｖyを再生装置２６に再生させる。 In a real-life dialogue between humans, a tendency is observed that a conversation partner produces a response voice in a prosody corresponding to the content of the speaker's utterance (whether it is a question sentence or a declarative sentence). For example, the prosody differs between the response voice for the question text and the response voice for the declarative text. Specifically, the voice of the answer to the question sentence is relatively large when compared with the voice of Aizuchi to the declarative sentence, for example, because it is necessary for the speaker to clearly recognize the answer (affirmation / denial) of the respondent. It tends to be pronounced with emphasis (volume or pitch fluctuation over time) at the volume. In consideration of the above tendency, the response generation unit 36B of the second embodiment causes the reproduction device 26 to reproduce the prosody response voice Vy according to the utterance content (discrimination of question text / declaration text) by the utterance voice Vx.

図７には、疑問文の発話音声Ｖxの音高Ｐの推移が例示され、図８には、平叙文の発話音声Ｖxの音高Ｐの推移が例示されている。図７および図８から理解される通り、発話音声Ｖxの発話内容が疑問文である場合と平叙文である場合とでは、発話音声Ｖxのうち末尾の近傍における音高Ｐの推移（時間的な変動の傾向）が相違する、という傾向がある。具体的には、疑問文の発話音声Ｖxの音高Ｐは、図７に例示される通り、末尾区間Ｅ内で低下から上昇に転換または単調に上昇するが、平叙文の発話音声Ｖxの音高Ｐは、図８に例示される通り、末尾区間Ｅの始点ｔAから終点ｔBにかけて単調に低下する。したがって、発話音声Ｖxの末尾の近傍（末尾区間Ｅ）における音高Ｐの推移を解析することで、発話音声Ｖxの発話内容が疑問文および平叙文の何れに該当するかを推定することが可能である。 FIG. 7 illustrates the transition of the pitch P of the utterance voice Vx of the question sentence, and FIG. 8 illustrates the transition of the pitch P of the utterance voice Vx of the declarative sentence. As can be understood from FIGS. 7 and 8, the transition of the pitch P in the vicinity of the end of the utterance voice Vx (temporal change) between the case where the utterance content of the utterance voice Vx is a question sentence and the case where the utterance is a declarative sentence (Trend of fluctuation). Specifically, as illustrated in FIG. 7, the pitch P of the utterance voice Vx of the question sentence changes from a drop to an increase or monotonically increases in the end section E, but the sound of the utterance voice Vx of the declarative sentence The height P monotonously decreases from the start point tA to the end point tB of the end section E as illustrated in FIG. Therefore, by analyzing the transition of the pitch P near the end of the uttered voice Vx (end section E), it is possible to estimate whether the uttered content of the uttered voice Vx corresponds to a question sentence or a declarative sentence. It is.

以上の傾向を考慮して、第２実施形態の応答生成部３６Bは、発話音声Ｖxのうち末尾区間Ｅにおける音高Ｐの推移（すなわち疑問文／平叙文の区別）に応じた韻律の応答音声Ｖyを再生装置２６に再生させる。具体的には、図７に例示される通り、発話音声Ｖxの音高Ｐの推移が末尾区間Ｅ内で低下から上昇に転換する場合または発話音声Ｖxの音高Ｐが末尾区間Ｅ内で単調に上昇する場合（すなわち発話内容が疑問文であると推定される場合）には、疑問文に好適な韻律の応答音声Ｖyが再生装置２６から再生される。他方、図８に例示される通り、発話音声Ｖxの音高Ｐが末尾区間Ｅ内で単調に低下する場合（すなわち発話内容が平叙文であると推定される場合）には、平叙文に好適な韻律の応答音声Ｖyが再生装置２６から再生される。 In consideration of the above tendency, the response generation unit 36B of the second embodiment generates the response voice of the prosody corresponding to the transition of the pitch P in the last section E of the utterance voice Vx (that is, the distinction between question text / declaration text). The playback device 26 plays back Vy. Specifically, as illustrated in FIG. 7, when the transition of the pitch P of the uttered voice Vx changes from decreasing to rising in the tail section E, or the pitch P of the uttered voice Vx is monotonous in the tail section E. (That is, when the utterance content is presumed to be a question sentence), the playback device 26 reproduces the response voice Vy of a prosody suitable for the question sentence. On the other hand, as illustrated in FIG. 8, when the pitch P of the uttered voice Vx monotonously decreases in the tail section E (that is, when the uttered content is estimated to be a declarative sentence), it is suitable for the declarative sentence. A response voice Vy having a proper prosody is reproduced from the reproduction device 26.

図６に例示される通り、第２実施形態の音声対話装置１００Bの記憶装置２２は、特定の発話内容の応答音声Ｖyを事前に収録した応答信号ＹAおよび応答信号ＹBを記憶する。応答信号ＹAおよび応答信号ＹBは、発話内容（文字表記）は相互に共通するが韻律が相違する。具体的には、応答信号ＹAが表す応答音声Ｖyは、疑問文の発話音声Ｖxに対する肯定的な回答の意図で発音された「うん」の音声であり、応答信号ＹBが表す応答音声Ｖyは、平叙文の発話音声Ｖxに対する相鎚の意図で発音された「うん」の音声である。具体的には、応答信号ＹAの応答音声Ｖyは、応答信号ＹBの応答音声Ｖyと比較して音量が大きく、音量および音高の変動の範囲（すなわち抑揚）が広いという韻律の差異がある。第２実施形態の応答生成部３６Bは、記憶装置２２に記憶された応答信号ＹAおよび応答信号ＹBの何れかを再生装置２６に対して選択的に供給することで、韻律が相違する複数の応答音声Ｖyを選択的に再生させる。なお、応答信号ＹAと応答信号ＹBとで発音内容を相違させることも可能である。 As illustrated in FIG. 6, the storage device 22 of the voice interaction device 100B of the second embodiment stores a response signal YA and a response signal YB in which a response voice Vy of a specific utterance content is recorded in advance. The response signal YA and the response signal YB have mutually common utterance contents (character notation) but different prosody. Specifically, the response voice Vy represented by the response signal YA is a voice of "Ye" pronounced with the intention of a positive answer to the utterance voice Vx of the question sentence, and the response voice Vy represented by the response signal YB is This is the sound of "Yeah" pronounced by Aizuchi for the utterance voice Vx of the declarative sentence. Specifically, the response voice Vy of the response signal YA has a higher volume than the response voice Vy of the response signal YB, and has a difference in prosody that the range of fluctuation of the volume and the pitch (that is, intonation) is wide. The response generation unit 36B of the second embodiment selectively supplies either one of the response signal YA and the response signal YB stored in the storage device 22 to the playback device 26, so that a plurality of responses having different prosody are provided. The voice Vy is selectively reproduced. Note that it is also possible to make the sound content of the response signal YA different from that of the response signal YB.

図９は、第２実施形態の応答生成部３６Bが応答音声Ｖyを再生装置２６に再生させるための応答生成処理ＳBのフローチャートである。第２実施形態では、第１実施形態で例示した図２の応答生成処理ＳAが図９の応答生成処理ＳBに置換される。応答生成処理ＳB以外の処理は第１実施形態と同様である。発話音声Ｖxの終了（Ｓ13：YES）を契機として図９の応答生成処理ＳBが開始される。 FIG. 9 is a flowchart of a response generation process SB for causing the reproduction device 26 to reproduce the response voice Vy by the response generation unit 36B of the second embodiment. In the second embodiment, the response generation process SA of FIG. 2 exemplified in the first embodiment is replaced with the response generation process SB of FIG. The processes other than the response generation process SB are the same as those in the first embodiment. The response generation processing SB of FIG. 9 is started with the end of the uttered voice Vx (S13: YES).

応答生成処理ＳBを開始すると、応答生成部３６Bは、発話音声Ｖxの末尾区間Ｅのうち第１区間Ｅ1内の複数の音高Ｐの平均（以下「第１平均音高」という）Ｐave1と、第２区間Ｅ2内の複数の音高Ｐの平均（以下「第２平均音高」という）Ｐave2とを算定する（ＳB1）。図７および図８に例示される通り、第１区間Ｅ1は、末尾区間Ｅのうち前方の区間（例えば末尾区間Ｅの始点ｔAを含む区間）であり、第２区間Ｅ2は、末尾区間Ｅのうち第１区間Ｅ1の後方の区間（例えば末尾区間Ｅの終点ｔBを含む区間）である。具体的には、末尾区間Ｅの前半が第１区間Ｅ1として画定され、末尾区間Ｅの後半が第２区間Ｅ2として画定される。ただし、第１区間Ｅ1および第２区間Ｅ2の条件は以上の例示に限定されない。例えば第１区間Ｅ1と第２区間Ｅ2とが間隔をあけて前後する構成や、第１区間Ｅ1と第２区間Ｅ2とで時間長を相違させた構成も採用され得る。 When the response generation processing SB is started, the response generation unit 36B calculates an average of a plurality of pitches P (hereinafter, referred to as “first average pitch”) Pave1 in the first section E1 of the last section E of the uttered voice Vx, The average of a plurality of pitches P in the second section E2 (hereinafter referred to as "second average pitch") Pave2 is calculated (SB1). As illustrated in FIGS. 7 and 8, the first section E1 is a preceding section of the end section E (for example, a section including the start point tA of the end section E), and the second section E2 is a section of the end section E. The section is a section behind the first section E1 (for example, a section including the end point tB of the end section E). Specifically, the first half of the last section E is defined as a first section E1, and the second half of the last section E is defined as a second section E2. However, the conditions of the first section E1 and the second section E2 are not limited to the above examples. For example, a configuration in which the first section E1 and the second section E2 move back and forth with an interval, and a configuration in which the first section E1 and the second section E2 have different time lengths may be adopted.

応答生成部３６Bは、第１区間Ｅ1の第１平均音高Ｐave1と第２区間Ｅ2の第２平均音高Ｐave2とを比較し、第１平均音高Ｐave1が第２平均音高Ｐave2を下回るか否かを判定する（ＳB2）。前述の通り、疑問文の発話音声Ｖxの音高Ｐの推移は末尾区間Ｅ内で低下から上昇に転換または単調に上昇するという傾向がある。したがって、図７に例示される通り、第１平均音高Ｐave1は第２平均音高Ｐave2を下回る可能性が高い（Ｐave1＜Ｐave2）。他方、平叙文の発話音声Ｖxの音高Ｐは末尾区間Ｅ内で単調に低下するという傾向がある。したがって、図８に例示される通り、第１平均音高Ｐave1は第２平均音高Ｐave2を上回る可能性が高い（Ｐave1＞Ｐave2）。 The response generator 36B compares the first average pitch Pave1 of the first section E1 with the second average pitch Pave2 of the second section E2, and determines whether the first average pitch Pave1 is lower than the second average pitch Pave2. It is determined whether or not (SB2). As described above, the transition of the pitch P of the utterance voice Vx of the question sentence tends to change from decreasing to increasing or to increase monotonously in the last section E. Therefore, as illustrated in FIG. 7, the first average pitch Pave1 is likely to be lower than the second average pitch Pave2 (Pave1 <Pave2). On the other hand, the pitch P of the utterance voice Vx of the declarative sentence tends to decrease monotonously in the last section E. Therefore, as illustrated in FIG. 8, the first average pitch Pave1 is likely to exceed the second average pitch Pave2 (Pave1> Pave2).

以上の傾向を考慮して、第１平均音高Ｐave1が第２平均音高Ｐave2を下回る場合（ＳB2：YES）、すなわち、発話音声Ｖxが疑問文である可能性が高い場合には、第２実施形態の応答生成部３６Bは、疑問文に対する回答の応答音声Ｖyに対応する応答信号ＹAを記憶装置２２から選択する（ＳB3）。他方、第１平均音高Ｐave1が第２平均音高Ｐave2を上回る場合（ＳB2：NO）、すなわち、発話音声Ｖxが平叙文である可能性が高い場合には、応答生成部３６Bは、平叙文に対する同意の応答音声Ｖyに対応する応答信号ＹBを記憶装置２２から選択する（ＳB4）。 In consideration of the above tendency, if the first average pitch Pave1 is lower than the second average pitch Pave2 (SB2: YES), that is, if the utterance voice Vx is highly likely to be a question sentence, the second The response generator 36B of the embodiment selects the response signal YA corresponding to the response voice Vy of the answer to the question sentence from the storage device 22 (SB3). On the other hand, if the first average pitch Pave1 exceeds the second average pitch Pave2 (SB2: NO), that is, if the utterance voice Vx is highly likely to be a declarative sentence, the response generation unit 36B will The response signal YB corresponding to the response voice Vy of the consent to is selected from the storage device 22 (SB4).

発話音声Ｖxの音高Ｐの推移に応じた応答信号Ｙ（Ｙ1，Ｙ2）を以上の手順で選択すると、応答生成部３６Bは、第１実施形態と同様に、応答再生点ｔyの到来（ＳB5：YES）を契機として当該応答信号Ｙを再生装置２６に供給することで応答音声Ｖyを再生させる（ＳB6）。具体的には、発話音声Ｖxの音高Ｐが末尾区間Ｅ内で低下から上昇に転換する場合または発話音声Ｖxの音高Ｐが末尾区間Ｅ内で単調に上昇する場合（ＳB2：YES）には疑問文に対する回答の応答音声Ｖyが再生され、発話音声Ｖxの音高Ｐが末尾区間Ｅ内で単調に低下する場合（ＳB2：NO）には平叙文に対する同意の応答音声Ｖyが再生される。すなわち、再生装置２６から再生される応答音声Ｖyの韻律は、発話音声Ｖxが疑問文である場合と平叙文である場合とで相違する。 When the response signal Y (Y1, Y2) corresponding to the transition of the pitch P of the uttered voice Vx is selected in the above procedure, the response generator 36B, like the first embodiment, arrives at the response reproduction point ty (SB5). : YES), the response signal Y is supplied to the reproducing device 26 to reproduce the response voice Vy (SB6). Specifically, when the pitch P of the uttered voice Vx changes from decreasing to rising in the last section E, or when the pitch P of the uttered voice Vx monotonically increases in the last section E (SB2: YES). The response voice Vy of the answer to the question sentence is reproduced, and when the pitch P of the utterance voice Vx monotonously decreases in the last section E (SB2: NO), the response voice Vy of the consent to the declarative sentence is reproduced. . That is, the prosody of the response voice Vy reproduced from the reproducing device 26 differs between the case where the utterance voice Vx is a question sentence and the case where the utterance voice Vx is a declarative sentence.

音声取得部３２による発話信号Ｘの取得（Ｓ11）と、音声解析部３４Aによる音高Ｐの特定（Ｓ12）と、応答生成部３６Bによる応答生成処理ＳBとは、音声対話の終了が利用者Ｕから指示されるまで反復される（Ｓ14：NO）。したがって、第１実施形態と同様に、利用者Ｕによる任意の発話音声Ｖxの発音と、当該発話音声Ｖxに対する応答音声Ｖyの再生とが交互に反復される音声対話が実現される。 The acquisition of the utterance signal X by the voice acquisition unit 32 (S11), the specification of the pitch P by the voice analysis unit 34A (S12), and the response generation process SB by the response generation unit 36B are described as follows. (S14: NO). Therefore, as in the first embodiment, a voice dialogue is realized in which the sound of an arbitrary uttered voice Vx by the user U and the reproduction of a response voice Vy to the uttered voice Vx are alternately repeated.

以上に説明した通り、第２実施形態では、発話音声Ｖxの末尾区間Ｅにおける音高Ｐの推移に応じた韻律の応答音声Ｖyが再生装置２６から再生される。したがって、発話者の発話内容に応じた韻律で対話相手が応答音声を発音するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。第２実施形態では特に、末尾区間Ｅ内で音高Ｐの推移が低下から上昇に転換する場合または末尾区間Ｅ内で音高Ｐが単調に上昇する場合と、末尾区間Ｅの始点ｔAから終点ｔBにかけて音高Ｐが単調に低下する場合とで応答音声Ｖyの韻律が相違するから、疑問文と平叙文とで応答音声の韻律が相違するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 As described above, in the second embodiment, the reproduction device 26 reproduces the prosody response voice Vy corresponding to the transition of the pitch P in the tail section E of the utterance voice Vx. Therefore, it is possible to realize a natural voice dialogue that simulates a tendency of a real dialogue in which a conversation partner pronounces a response voice with a prosody corresponding to the content of the utterance of the speaker. In the second embodiment, in particular, the case where the transition of the pitch P changes from a decrease to a rise within the end section E, or the case where the pitch P monotonously increases within the end section E, and the case where the end point E starts from the start point tA. Since the prosody of the response voice Vy is different between the case where the pitch P monotonously decreases toward tB, the natural voice dialogue simulating the tendency of the actual dialogue in which the prosody of the response voice differs between the question sentence and the declarative sentence. Can be realized.

また、第２実施形態では、末尾区間Ｅのうち第１区間Ｅ1内の第１平均音高Ｐave1と第２区間Ｅ2の第２平均音高Ｐave2とを比較した結果に応じて応答音声Ｖyの韻律を相違させるから、複数の音高Ｐの平均および比較という簡便な処理で音高Ｐの推移を評価できる（ひいては応答音声Ｖyの韻律を選択できる）という利点がある。 In the second embodiment, the prosody of the response voice Vy is determined according to the result of comparing the first average pitch Pave1 in the first section E1 of the last section E with the second average pitch Pave2 in the second section E2. Therefore, there is an advantage that the transition of the pitch P can be evaluated (and the prosody of the response voice Vy can be selected) by a simple process of averaging and comparing a plurality of pitches P.

＜第２実施形態の変形例＞
（１）第２実施形態では、記憶装置２２に事前に記憶された複数の応答信号Ｙ（ＹA，ＹB）を選択的に再生装置２６に供給したが、事前に収録された単一の応答信号Ｙを調整することで、発話音声Ｖxの末尾区間Ｅ内の音高Ｐの推移に応じた韻律の応答信号Ｙを応答生成部３６Bが生成することも可能である。例えば、平叙文に対する応答音声Ｖyの応答信号ＹAを記憶装置２２に保持した構成を想定すると、応答生成部３６Bは、発話音声Ｖxが疑問文である場合、応答信号ＹAの音量を増加させるとともに音量および音高の変動の範囲を拡大することで回答の応答音声Ｖyの応答信号ＹBを生成する一方、発話音声Ｖxが平叙文である場合には応答信号ＹAを再生装置２６に供給する。なお、初期的な応答信号Ｙの音量を減少させるとともに音量および音高の変動の範囲を縮小することで、平叙文に対する同意の応答音声Ｖyの応答信号ＹAを生成することも可能である。 <Modification of Second Embodiment>
(1) In the second embodiment, the plurality of response signals Y (YA, YB) previously stored in the storage device 22 are selectively supplied to the playback device 26. However, a single response signal recorded in advance is used. By adjusting Y, the response generation unit 36B can generate a prosody response signal Y corresponding to the transition of the pitch P in the tail section E of the uttered voice Vx. For example, assuming a configuration in which the response signal YA of the response voice Vy to the declarative sentence is stored in the storage device 22, when the utterance voice Vx is a question sentence, the response generator 36B increases the volume of the response signal YA and increases the volume. While the response signal YB of the response voice Vy of the answer is generated by expanding the range of the variation of the pitch and the pitch, the response signal YA is supplied to the reproducing device 26 when the utterance voice Vx is a descriptive sentence. It is also possible to generate the response signal YA of the response voice Vy consenting to the descriptive sentence by reducing the volume of the initial response signal Y and reducing the range of the volume and pitch fluctuations.

１個の応答信号Ｙに対する調整で相異なる韻律の応答信号Ｙを生成する構成によれば、韻律が相違する複数の応答信号Ｙ（ＹA，ＹB）を記憶装置２２に保持する必要がないから、記憶装置２２に必要な記憶容量が削減されるという利点がある。他方、韻律が相違する複数の応答信号Ｙを選択的に利用する第２実施形態の構成によれば、初期的な応答信号Ｙの韻律を発話音声Ｖxの発話内容に応じて調整する必要がないから、応答生成部３６Bの処理負荷が軽減されるという利点がある。 According to the configuration in which response signals Y having different prosody are generated by adjusting one response signal Y, it is not necessary to store a plurality of response signals Y (YA, YB) having different prosody in the storage device 22, There is an advantage that the storage capacity required for the storage device 22 is reduced. On the other hand, according to the configuration of the second embodiment that selectively uses the plurality of response signals Y having different prosody, it is not necessary to adjust the initial prosody of the response signal Y according to the utterance content of the utterance voice Vx. Therefore, there is an advantage that the processing load on the response generation unit 36B is reduced.

（２）第２実施形態では、末尾区間Ｅのうち第１区間Ｅ1内の第１平均音高Ｐave1と第２区間Ｅ2内の第２平均音高Ｐave2とを比較したが、発話音声Ｖxの発話内容が疑問文および平叙文の何れに該当するかを推定するための方法は以上の例示に限定されない。例えば、平叙文の発話音声Ｖxでは末尾区間Ｅ内で音高Ｐが単調に低下するから、音高Ｐは末尾区間Ｅの終点ｔBで最低音高Ｐminとなる傾向がある。したがって、末尾区間Ｅのうち音高Ｐが最低音高Ｐminとなる時点の後方の区間の時間長が前方の区間と比較して充分に短い場合（例えば所定の閾値を下回る場合）に、発話音声Ｖxの発話内容が平叙文に該当すると推定することも可能である。また、末尾区間Ｅのうち最低音高Ｐminの時点の前後における音高Ｐの遷移に応じて、発話音声Ｖxの発話内容が疑問文および平叙文の何れに該当するかを推定することも可能である。例えば、末尾区間Ｅのうち最低音高Ｐminの時点の経過後に音高Ｐが上昇する場合、応答生成部３６Bは、発話音声Ｖxの発話内容が疑問文に該当すると推定する。 (2) In the second embodiment, of the last section E, the first average pitch Pave1 in the first section E1 and the second average pitch Pave2 in the second section E2 are compared. The method for estimating whether the content corresponds to a question sentence or a declarative sentence is not limited to the above example. For example, since the pitch P monotonously decreases in the end section E of the uttered voice Vx of the declarative sentence, the pitch P tends to be the lowest pitch Pmin at the end point tB of the end section E. Therefore, if the time length of the rear section of the last section E at which the pitch P becomes the minimum pitch Pmin is sufficiently shorter than the preceding section (for example, when the time length falls below a predetermined threshold), the utterance voice It is also possible to estimate that the utterance content of Vx corresponds to the declarative sentence. It is also possible to estimate whether the utterance content of the utterance voice Vx corresponds to a question sentence or a declarative sentence according to the transition of the pitch P before and after the lowest pitch Pmin in the last section E. is there. For example, when the pitch P rises after the lapse of the lowest pitch Pmin in the last section E, the response generation unit 36B estimates that the uttered content of the uttered voice Vx corresponds to the question sentence.

＜第３実施形態＞
図１０は、本発明の第３実施形態に係る音声対話装置１００Cの構成図である。第３実施形態の音声対話装置１００Cは、第１実施形態の音声対話装置１００Aと同様に、利用者Ｕが発音した発話音声Ｖxに対する応答音声Ｖyを再生する。第３実施形態では、発話音声Ｖxに対する回答または相鎚の応答音声（以下「第２応答音声」という）Ｖy2のほか、発話音声Ｖxに対する問返しを表す応答音声（以下「第１応答音声」という）Ｖy1が再生装置２６から再生され得る。第１応答音声Ｖy1は、発話音声Ｖxを発話者に対して聞き直すための「え？」「なに？」等の音声である。図１０に例示される通り、第３実施形態の音声対話装置１００Cの記憶装置２２は、問返しの第１応答音声Ｖy1を収録した応答信号Ｙ1と、問返し以外（例えば「うん」等の相鎚）の第２応答音声Ｖy2を収録した応答信号Ｙ2とを記憶する。 <Third embodiment>
FIG. 10 is a configuration diagram of a voice interaction device 100C according to the third embodiment of the present invention. The voice interaction device 100C of the third embodiment reproduces a response voice Vy to the utterance voice Vx pronounced by the user U, similarly to the voice interaction device 100A of the first embodiment. In the third embodiment, in addition to the answer to the utterance voice Vx or the response voice of Aizuchi (hereinafter, referred to as “second response voice”) Vy2, the response voice representing the inquiry to the utterance voice Vx (hereinafter, referred to as “first response voice”). ) Vy1 can be played from playback device 26. The first response voice Vy1 is a voice such as “What?” Or “What?” For re-listening the utterance voice Vx to the speaker. As illustrated in FIG. 10, the storage device 22 of the voice interaction device 100C according to the third embodiment stores a response signal Y1 containing the first response voice Vy1 of the query and a response signal Y1 other than the query (eg, “Yes”). And a response signal Y2 containing the second response voice Vy2 of the hammer.

図１０に例示される通り、第３実施形態の音声対話装置１００Cは、第１実施形態の音声対話装置１００Aの音声解析部３４Aおよび応答生成部３６Aを、音声解析部３４Cおよび応答生成部３６Cに置換した構成である。音声対話装置１００Cの他の要素（音声入力装置２４，再生装置２６，音声取得部３２）の構成および動作は第１実施形態と同様である。 As illustrated in FIG. 10, the voice interaction device 100C according to the third embodiment includes the voice analysis unit 34A and the response generation unit 36A of the voice interaction device 100A according to the first embodiment, and the voice analysis unit 34C and the response generation unit 36C. This is a configuration with replacement. The configuration and operation of the other components of the voice interaction device 100C (the voice input device 24, the playback device 26, and the voice acquisition unit 32) are the same as in the first embodiment.

第３実施形態の音声解析部３４Cは、音声取得部３２が取得した発話信号Ｘから韻律指標値Ｑを特定する。韻律指標値Ｑは、発話音声Ｖxの韻律に関する指標値であり、発話音声Ｖx毎（発話音声Ｖxの始点から終点までの一連の発話を単位としたときの単位毎）に算定される。具体的には、発話音声Ｖxの発話区間内の音高の平均値、音高の変動幅、音量の平均値、または音量の変動幅が、韻律指標値Ｑとして発話信号Ｘから算定される。第３実施形態の応答生成部３６Cは、前述の通り、発話音声Ｖxに対する問返しを表す第１応答音声Ｖy1と問返し以外の第２応答音声Ｖy2とを選択的に再生装置２６に再生させる。 The voice analysis unit 34C of the third embodiment specifies the prosody index value Q from the speech signal X acquired by the speech acquisition unit 32. The prosody index value Q is an index value relating to the prosody of the utterance voice Vx, and is calculated for each utterance voice Vx (for each unit when a series of utterances from the start point to the end point of the utterance voice Vx is used as a unit). Specifically, the average value of the pitch, the variation range of the pitch, the average value of the volume, or the variation range of the volume in the utterance section of the uttered voice Vx is calculated from the utterance signal X as the prosody index value Q. As described above, the response generation unit 36C of the third embodiment selectively causes the reproducing device 26 to reproduce the first response sound Vy1 representing the inquiry to the utterance sound Vx and the second response sound Vy2 other than the inquiry.

現実の人間同士の対話では、発話者の発話音声の韻律が変動した場合に、対話相手が発話音声を聴取し難くなって問返しの必要性が高まる、という傾向がある。具体的には、発話者の発話音声の韻律が当該発話者の過去の韻律の傾向から乖離する場合（例えば過去の傾向から対話相手が想定する音量と比較して実際の発話音声の音量が小さい場合）に、対話相手が発話音声を適切に聴取できず、結果的に発話者に対する問返しが発生する可能性が高い。以上の傾向を考慮して、第３実施形態の応答生成部３６Cは、音声解析部３４Cが特定した韻律指標値Ｑを閾値ＱTHと比較し、比較の結果に応じて第１応答音声Ｖy1および第２応答音声Ｖy2の何れかを再生装置２６に再生させる。閾値ＱTHは、利用者Ｕが過去に発話した発話音声Ｖxの韻律指標値Ｑの代表値（例えば平均値）に設定される。すなわち、閾値ＱTHは、利用者Ｕの過去の発話から推定される標準的な韻律に相当する。そして、発話音声Ｖxの韻律指標値Ｑが閾値ＱTHから乖離する場合には問返しの第１応答音声Ｖy1が再生され、韻律指標値Ｑが閾値ＱTHに近似する場合には相鎚の第２応答音声Ｖy2が再生される。 In a real human dialogue, when the prosody of a speaker's uttered voice fluctuates, there is a tendency that it becomes difficult for a conversation partner to hear the uttered voice, and the necessity of questioning increases. Specifically, when the prosody of the uttered voice of the speaker deviates from the past prosody tendency of the speaker (for example, the volume of the actual uttered voice is lower than the volume assumed by the conversation partner from the past tendency) In this case), the conversation partner cannot properly hear the uttered voice, and as a result, there is a high possibility that a question is returned to the speaker. In consideration of the above tendency, the response generation unit 36C of the third embodiment compares the prosody index value Q specified by the voice analysis unit 34C with the threshold QTH, and according to the comparison result, the first response voice Vy1 and the One of the two response voices Vy2 is reproduced by the reproducing device 26. The threshold value QTH is set to a representative value (for example, an average value) of the prosody index value Q of the uttered voice Vx uttered by the user U in the past. That is, the threshold QTH corresponds to a standard prosody estimated from the past utterance of the user U. When the prosody index value Q of the uttered voice Vx deviates from the threshold value QTH, the first response voice Vy1 of the query is reproduced, and when the prosody index value Q approximates the threshold value QTH, the second response of Aizuchi is obtained. The sound Vy2 is reproduced.

図１１は、第３実施形態の制御装置２０が実行する処理のフローチャートである。例えば音声対話装置１００Cに対する利用者Ｕからの指示（例えば音声対話用のプログラムの起動指示）を契機として図１１の処理が開始される。 FIG. 11 is a flowchart of a process executed by the control device 20 according to the third embodiment. For example, the process of FIG. 11 is started by an instruction from the user U to the voice interaction device 100C (for example, an instruction to start a voice interaction program).

第１実施形態と同様に、発話音声Ｖxが開始されると（Ｓ20：YES）、音声取得部３２は、音声入力装置２４から発話信号Ｘを取得して記憶装置２２に格納する（Ｓ21）。音声解析部３４Cは、音声取得部３２が取得した発話信号Ｘから、発話音声Ｖxの韻律に関する特徴量ｑを特定する（Ｓ22）。特徴量ｑは、例えば発話音声Ｖxの音高Ｐまたは音量である。音声取得部３２による発話信号Ｘの取得（Ｓ21）と音声解析部３４Cによる特徴量ｑの特定（Ｓ22）とは、発話音声Ｖxの終了まで反復される（Ｓ23：NO）。すなわち、発話音声Ｖxの始点から終点ｔBまでの発話区間について当該発話音声Ｖxの複数の特徴量ｑの時系列が特定される。 As in the first embodiment, when the uttered voice Vx is started (S20: YES), the voice obtaining unit 32 obtains the uttered signal X from the voice input device 24 and stores it in the storage device 22 (S21). The voice analysis unit 34C specifies the feature q regarding the prosody of the voice Vx from the voice signal X acquired by the voice acquisition unit 32 (S22). The feature quantity q is, for example, the pitch P or volume of the uttered voice Vx. The acquisition of the speech signal X by the speech acquisition unit 32 (S21) and the specification of the feature q by the speech analysis unit 34C (S22) are repeated until the end of the speech voice Vx (S23: NO). That is, a time series of a plurality of feature quantities q of the uttered voice Vx is specified for the uttered section from the start point to the end point tB of the uttered voice Vx.

発話音声Ｖxが終了すると（Ｓ23：YES）、音声解析部３４Cは、発話音声Ｖxの始点から終点までの発話区間について特定した複数の特徴量ｑの時系列から韻律指標値Ｑを算定する（Ｓ24）。具体的には、音声解析部３４Cは、発話区間内の複数の特徴量ｑの平均値または変動幅（範囲）を韻律指標値Ｑとして算定する。 When the uttered voice Vx ends (S23: YES), the voice analyzing unit 34C calculates the prosodic index value Q from a time series of a plurality of feature amounts q specified for the utterance section from the start point to the end point of the uttered voice Vx (S24). ). Specifically, the voice analysis unit 34C calculates, as the prosody index value Q, the average value or the fluctuation range (range) of the plurality of feature amounts q in the utterance section.

以上に説明した処理で今回の発話音声Ｖxの韻律指標値Ｑが算定されると、応答生成部３６Cは、応答音声Ｖyを再生装置２６に再生させるための応答生成処理ＳCを実行する。第３実施形態の応答生成処理ＳCは、音声解析部３４Cが算定した韻律指標値Ｑに応じて第１応答音声Ｖy1および第２応答音声Ｖy2の何れかを選択的に再生装置２６に再生させる処理である。 When the prosody index value Q of the current uttered voice Vx is calculated by the process described above, the response generation unit 36C executes a response generation process SC for causing the reproduction device 26 to reproduce the response voice Vy. The response generation process SC of the third embodiment is a process of causing the playback device 26 to selectively reproduce one of the first response voice Vy1 and the second response voice Vy2 according to the prosody index value Q calculated by the voice analysis unit 34C. It is.

応答生成処理ＳCが完了すると、音声解析部３４Cは、今回の発話音声Ｖxの韻律指標値Ｑに応じて閾値ＱTHを更新する（Ｓ25）。具体的には、音声解析部３４Cは、今回の発話音声Ｖxを含む過去の発話音声Ｖxの複数の韻律指標値Ｑの代表値（例えば平均値や中央値）を更新後の閾値ＱTHとして算定する。例えば、以下の数式(1)で表現される通り、今回の韻律指標値Ｑと更新前の閾値ＱTHとの加重平均（指数移動平均）が更新後の閾値ＱTHとして算定される。数式(1)の記号αは１未満の所定の正数（忘却係数）である。
ＱTH＝α・Ｑ＋(１−α)ＱTH ……(1)
以上の説明から理解される通り、第３実施形態の音声解析部３４Cは、過去の複数の発話音声Ｖxにおける韻律指標値Ｑの代表値を閾値ＱTHとして設定する要素として機能する。閾値ＱTHは、発話音声Ｖxの発音毎に当該発話音声Ｖxの韻律指標値Ｑを反映した数値に更新され、複数回にわたる利用者Ｕの発話から推定される標準的な韻律に相当する数値となる。ただし、閾値ＱTHを所定値に固定することも可能である。例えば、不特定多数の発話者の発話音声から特定された韻律指標値Ｑの平均値が閾値ＱTHとして設定され得る。 When the response generation process SC is completed, the voice analysis unit 34C updates the threshold value QTH according to the prosody index value Q of the current utterance voice Vx (S25). Specifically, the voice analysis unit 34C calculates a representative value (for example, an average value or a median value) of a plurality of prosodic index values Q of the past voice voice Vx including the current voice voice Vx as the updated threshold QTH. . For example, as expressed by the following equation (1), a weighted average (exponential moving average) of the current prosody index value Q and the threshold value QTH before update is calculated as the threshold value QTH after update. The symbol α in Expression (1) is a predetermined positive number (a forgetting factor) smaller than 1.
QTH = α · Q + (1−α) QTH …… (1)
As understood from the above description, the voice analyzing unit 34C of the third embodiment functions as an element that sets the representative value of the prosodic index value Q in the past plurality of uttered voices Vx as the threshold QTH. The threshold value QTH is updated to a numerical value reflecting the prosody index value Q of the uttered voice Vx for each pronunciation of the uttered voice Vx, and becomes a numerical value corresponding to a standard prosody estimated from a plurality of utterances of the user U. . However, the threshold value QTH can be fixed to a predetermined value. For example, the average value of the prosody index values Q specified from the uttered voices of an unspecified number of speakers can be set as the threshold QTH.

音声取得部３２による発話信号Ｘの取得（Ｓ21）と、音声解析部３４Cによる韻律指標値Ｑの算定（Ｓ22，Ｓ24）と、応答生成部３６Cによる応答生成処理ＳCと、音声解析部３４Cによる閾値ＱTHの更新（Ｓ25）とは、音声対話の終了が利用者Ｕから指示されるまで、発話音声Ｖxの発音毎に反復される（Ｓ26：NO）。したがって、利用者Ｕによる発話音声Ｖxの発音と、第１応答音声Ｖy1（問返し）および第２応答音声Ｖy2（相鎚）の選択的な再生とが交互に反復される音声対話が実現される。 Acquisition of the utterance signal X by the speech acquisition unit 32 (S21), calculation of the prosody index value Q by the speech analysis unit 34C (S22, S24), response generation processing SC by the response generation unit 36C, and threshold by the speech analysis unit 34C. The updating of QTH (S25) is repeated for each sound of the uttered voice Vx until the end of the voice dialogue is instructed by the user U (S26: NO). Accordingly, a voice dialogue in which the pronunciation of the uttered voice Vx by the user U and the selective reproduction of the first response voice Vy1 (interrogation) and the second response voice Vy2 (Aizuchi) are alternately realized. .

図１２は、第３実施形態の応答生成処理ＳCのフローチャートである。応答生成処理ＳCを開始すると、応答生成部３６Cは、音声解析部３４Cが特定した韻律指標値Ｑを現段階の閾値ＱTHと比較し、閾値ＱTHを含む所定の範囲（以下「許容範囲」という）Ｒに韻律指標値Ｑが包含されるか否かを判定する（ＳC1）。図１３および図１４には、発話音声Ｖxから音声解析部３４Cが特定する特徴量ｑの推移が例示されている。図１３および図１４に例示される通り、許容範囲Ｒは、閾値ＱTHを中央値とする所定幅の範囲である。韻律指標値Ｑと閾値ＱTHとを比較する処理（ＳC1）は、韻律指標値Ｑと閾値ＱTHとの差分の絶対値が所定値（例えば許容範囲Ｒの範囲幅の半分）を上回るか否かを判定する処理としても実現され得る。 FIG. 12 is a flowchart of the response generation process SC of the third embodiment. When the response generation process SC is started, the response generation unit 36C compares the prosody index value Q specified by the voice analysis unit 34C with the threshold QTH at the current stage, and a predetermined range including the threshold QTH (hereinafter, referred to as “permissible range”). It is determined whether the prosody index value Q is included in R (SC1). FIGS. 13 and 14 illustrate the transition of the feature quantity q specified by the voice analysis unit 34C from the uttered voice Vx. As illustrated in FIGS. 13 and 14, the allowable range R is a range having a predetermined width with the threshold value QTH being a median value. The process (SC1) of comparing the prosody index value Q and the threshold value QTH is to determine whether the absolute value of the difference between the prosody index value Q and the threshold value QTH exceeds a predetermined value (for example, half the range width of the allowable range R). It can also be realized as a determination process.

図１３では、韻律指標値Ｑが許容範囲Ｒの内側の数値である場合が想定されている。韻律指標値Ｑが許容範囲Ｒに包含されるということは、今回の発話音声Ｖxの韻律が利用者Ｕの標準的な韻律（過去の発話の傾向）に近似することを意味する。すなわち、現実の人間同士の対話を想定すると、対話相手が発話音声を聴取し易い状況（発話者に対する問返しが必要となる可能性が低い状況）であると評価できる。そこで、韻律指標値Ｑが許容範囲Ｒの内側の数値である場合（ＳC1：YES）、応答生成部３６Cは、発話音声Ｖxに対する相鎚の第２応答音声Ｖy2の応答信号Ｙ2を記憶装置２２から選択する（ＳC2）。 In FIG. 13, it is assumed that the prosody index value Q is a numerical value inside the allowable range R. The fact that the prosody index value Q is included in the allowable range R means that the prosody of the current uttered voice Vx approximates the standard prosody of the user U (the tendency of past utterance). That is, assuming a dialogue between real humans, it can be evaluated as a situation in which the conversation partner can easily listen to the uttered voice (a situation in which it is unlikely that a question to the speaker needs to be returned). Therefore, when the prosody index value Q is a numerical value inside the allowable range R (SC1: YES), the response generation unit 36C stores the response signal Y2 of the second response voice Vy2 of the companion voice to the utterance voice Vx from the storage device 22. Select (SC2).

他方、図１４では、韻律指標値Ｑが許容範囲Ｒの外側の数値（具体的には許容範囲Ｒの下限値を下回る数値）である場合が想定されている。韻律指標値Ｑが許容範囲Ｒに包含されないということは、今回の発話音声Ｖxの韻律が利用者Ｕの標準的な韻律から乖離していることを意味する。すなわち、現実の人間同士の対話を想定すると、対話相手が発話音声を聴取し難い状況（発話者に対する問返しが必要となる可能性が高い状況）であると評価できる。そこで、韻律指標値Ｑが許容範囲Ｒの外側の数値である場合（ＳC1：NO）、応答生成部３６Cは、発話音声Ｖxに対する問返しの第２応答音声Ｖy1（例えば「え？」「なに？」等の音声）の応答信号Ｙ1を再生装置２６に対する供給対象として記憶装置２２から選択する（ＳC3）。 On the other hand, FIG. 14 assumes that the prosody index value Q is a numerical value outside the allowable range R (specifically, a numerical value lower than the lower limit value of the allowable range R). The fact that the prosody index value Q is not included in the allowable range R means that the prosody of the current uttered voice Vx deviates from the standard prosody of the user U. That is, assuming a dialogue between real humans, it can be evaluated as a situation where it is difficult for the conversation partner to hear the uttered voice (a situation in which it is highly likely that a question to the speaker needs to be returned). Therefore, when the prosody index value Q is a numerical value outside the allowable range R (SC1: NO), the response generator 36C responds to the second response sound Vy1 (eg, "E?" ) Is selected from the storage device 22 as a supply target to the reproducing device 26 (SC3).

以上の手順で韻律指標値Ｑに応じた応答信号Ｙ（再生対象の応答音声Ｖy）を選択すると、応答生成部３６Cは、第１実施形態と同様に、応答再生点ｔyの到来（ＳC4：YES）を契機として当該応答信号Ｙを再生装置２６に供給することで応答音声Ｖy（第１応答音声Ｖy1または第２応答音声Ｖy2）を再生させる（ＳC5）。すなわち、韻律指標値Ｑが許容範囲Ｒに包含される場合には相鎚の第２応答音声Ｖy2が再生され、韻律指標値Ｑが許容範囲Ｒに包含されない場合には問返しの第１応答音声Ｖy1が再生される。 When the response signal Y (the response voice Vy to be reproduced) corresponding to the prosody index value Q is selected in the above procedure, the response generation unit 36C, like the first embodiment, arrives at the response reproduction point ty (SC4: YES). ), The response signal Y is supplied to the reproducing device 26 to reproduce the response voice Vy (the first response voice Vy1 or the second response voice Vy2) (SC5). That is, if the prosody index value Q is included in the allowable range R, the second response voice Vy2 of Aizuchi is reproduced, and if the prosody index value Q is not included in the allowable range R, the first response voice of the inquiry is returned. Vy1 is reproduced.

以上に説明した通り、第３実施形態では、発話音声Ｖxに対する問返しを表す第１応答音声Ｖy1と、問返し以外の第２応答音声Ｖy2とが選択的に再生装置２６から再生される。したがって、発話者の発話に対する相鎚だけでなく発話者に対する問返し（聞き直し）も適宜に発生するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 As described above, in the third embodiment, the first response voice Vy1 indicating a query to the utterance voice Vx and the second response voice Vy2 other than the query are selectively reproduced from the playback device 26. Therefore, it is possible to realize a natural voice dialogue that simulates the tendency of a real dialogue in which not only a companion to the speaker but also a question (re-listening) to the speaker occurs appropriately.

また、第３実施形態では、発話音声Ｖxの韻律を表す韻律指標値Ｑを閾値ＱTHと比較した結果に応じて第１応答音声Ｖy1および第２応答音声Ｖy2の何れかが選択されるから、発話音声の韻律が不意に変動した場合に聴取が困難となり問返しの必要性が高まる、という現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。第３実施形態では特に、過去の複数の発話音声Ｖxにわたる韻律指標値Ｑの代表値が閾値ＱTHとして設定されるから、発話者の発話音声の韻律が当該発話者の標準的な韻律（すなわち対話相手が想定する韻律）から乖離する場合に対話相手からの問返しが発生し易いという現実の対話の傾向を模擬した自然な音声対話が実現されるという利点もある。しかも、韻律指標値Ｑが、閾値ＱTHを含む許容範囲Ｒの外側の数値である場合に第１応答音声Ｖy1が選択され、許容範囲Ｒの内側の数値である場合に第２応答音声Ｖy2が選択されるから、例えば韻律指標値Ｑと閾値ＱTHとの大小関係のみに応じて第１応答音声Ｖy1および第２応答音声Ｖy2を選択する構成と比較して、過度に高い頻度で第１応答音声Ｖy1が再生される可能性を低減する（適度な頻度で第１応答音声Ｖy1を再生する）ことが可能である。 In the third embodiment, either the first response voice Vy1 or the second response voice Vy2 is selected according to the result of comparing the prosody index value Q representing the prosody of the voice speech Vx with the threshold QTH. It is possible to realize a natural spoken dialogue that simulates the tendency of a real dialogue that listening becomes difficult and the necessity of questioning increases when the prosody of the speech changes suddenly. Particularly in the third embodiment, since the representative value of the prosody index value Q over a plurality of past speech sounds Vx is set as the threshold QTH, the prosody of the speech sound of the speaker is the standard prosody of the speaker (that is, the dialogue of the speaker). There is also an advantage that a natural spoken dialogue that simulates the tendency of an actual dialogue in which a question from the dialogue partner is likely to occur when deviating from the prosody assumed by the partner is realized. Moreover, when the prosody index value Q is a numerical value outside the allowable range R including the threshold value QTH, the first response voice Vy1 is selected, and when the prosody index value Q is a numerical value inside the allowable range R, the second response voice Vy2 is selected. Therefore, for example, the first response voice Vy1 is excessively frequently generated as compared with a configuration in which the first response voice Vy1 and the second response voice Vy2 are selected only according to the magnitude relation between the prosody index value Q and the threshold value QTH. Can be reduced (the first response voice Vy1 is reproduced at an appropriate frequency).

＜第３実施形態の変形例＞
第３実施形態では、発話音声Ｖxの韻律指標値Ｑに応じて第１応答音声Ｖy1の再生と第２応答音声Ｖy2の再生とを選択したが、発話音声Ｖxの特性とは無関係に所定の頻度で問返しの第１応答音声Ｖy1を再生することも可能である。具体的には、応答生成部３６Cは、利用者Ｕが順次に発音する複数の発話音声Ｖxからランダムに選択された発話音声Ｖxに対して問返しの第１応答音声Ｖy1を再生装置２６に再生させる一方、残余の発話音声Ｖxに対しては相鎚の第２応答音声Ｖy2を再生させる。例えば、応答生成部３６Cは、発話音声Ｖxの発音毎に所定の範囲内の乱数を発生し、当該乱数が閾値を上回る場合には第１応答音声Ｖy1を選択する一方、当該乱数が閾値を下回る場合には第２応答音声Ｖy2を選択する。以上に例示した変形例では、複数の発話音声Ｖxからランダムに選択された発話音声Ｖxに対して問返しの第１応答音声Ｖy1が再生されるから、発話音声に対する問返しがランダムに発生するという現実の音声対話の傾向を模擬した自然な音声対話を実現することが可能である。 <Modification of Third Embodiment>
In the third embodiment, the reproduction of the first response voice Vy1 and the reproduction of the second response voice Vy2 are selected in accordance with the prosody index value Q of the voice speech Vx, but the predetermined frequency is determined regardless of the characteristics of the voice speech Vx. It is also possible to reproduce the first response voice Vy1 of the inquiry. Specifically, the response generation unit 36C reproduces the first response voice Vy1 which is interrogated with respect to the utterance voice Vx randomly selected from the plurality of utterance voices Vx that the user U sequentially pronounces, on the reproduction device 26. On the other hand, the second response voice Vy2 of Aizuchi is reproduced for the remaining utterance voice Vx. For example, the response generation unit 36C generates a random number within a predetermined range for each sound of the uttered voice Vx, and selects the first response voice Vy1 when the random number exceeds the threshold, while the random number falls below the threshold. In this case, the second response voice Vy2 is selected. In the modified example described above, since the first response voice Vy1 of the query is reproduced for the speech voice Vx selected at random from the plurality of speech voices Vx, the query for the speech voice is generated randomly. It is possible to realize a natural spoken dialogue simulating the tendency of a real spoken dialogue.

以上の構成において、応答生成部３６Cは、発話音声Ｖxの発話回数に対する第１応答音声Ｖy1の再生回数の比（すなわち第１応答音声Ｖy1の再生頻度）を可変に設定することが可能である。例えば、乱数と比較される閾値を調整することで、応答生成部３６Cは、第１応答音声Ｖy1の再生頻度を制御する。例えば第１応答音声Ｖy1の再生頻度が３０％に設定された場合、発話音声Ｖxの発話の総回数のうちの３０％に対して第１応答音声Ｖy1が再生され、残余の７０％の回数の発話に対して第２応答音声Ｖy2が再生される。第１応答音声Ｖy1の再生頻度（例えば乱数と比較される閾値）は、例えば利用者Ｕからの指示に応じて可変に設定される。 In the above configuration, the response generation unit 36C can variably set a ratio of the number of times of reproduction of the first response voice Vy1 to the number of times of voice of the utterance voice Vx (that is, the reproduction frequency of the first response voice Vy1). For example, the response generation unit 36C controls the reproduction frequency of the first response sound Vy1 by adjusting the threshold value to be compared with the random number. For example, when the reproduction frequency of the first response voice Vy1 is set to 30%, the first response voice Vy1 is reproduced for 30% of the total number of utterances of the utterance voice Vx, and the remaining 70% The second response voice Vy2 is reproduced in response to the utterance. The reproduction frequency of the first response voice Vy1 (for example, a threshold value to be compared with a random number) is variably set according to an instruction from the user U, for example.

＜第４実施形態＞
図１５は、本発明の第４実施形態に係る音声対話装置１００Dの構成図である。第４実施形態の音声対話装置１００Dは、第１実施形態の音声対話装置１００Aと同様に、利用者Ｕが発音した発話音声Ｖxに対する応答音声Ｖyを再生する。 <Fourth embodiment>
FIG. 15 is a configuration diagram of the voice interaction device 100D according to the fourth embodiment of the present invention. The voice interaction device 100D of the fourth embodiment reproduces a response voice Vy corresponding to the utterance voice Vx pronounced by the user U, similarly to the voice interaction device 100A of the first embodiment.

図１５に例示される通り、第４実施形態の音声対話装置１００Dは、第１実施形態の音声対話装置１００Aの音声解析部３４Aおよび応答生成部３６Aを、履歴管理部３８および応答生成部３６Dに置換した構成である。音声対話装置１００Dの他の要素（音声入力装置２４，再生装置２６，音声取得部３２）の構成および動作は第１実施形態と同様である。第４実施形態の記憶装置２２は、特定の発話内容の応答音声Ｖyを表す応答信号Ｙを記憶する。以下の説明では、発話音声Ｖxに対する相鎚を意味する「うん」という応答音声Ｖyを例示する。 As illustrated in FIG. 15, the voice interaction device 100D according to the fourth embodiment includes a voice analysis unit 34A and a response generation unit 36A of the voice interaction device 100A according to the first embodiment, and a history management unit 38 and a response generation unit 36D. This is a configuration with replacement. The configuration and operation of the other elements of the voice interaction device 100D (the voice input device 24, the playback device 26, and the voice acquisition unit 32) are the same as in the first embodiment. The storage device 22 of the fourth embodiment stores a response signal Y representing a response voice Vy of a specific utterance content. In the following description, a response voice Vy of "yes" indicating a companion to the utterance voice Vx is exemplified.

図１５の履歴管理部３８は、音声対話装置１００Dによる音声対話の履歴（以下「利用履歴」という）Ｈを生成する。第４実施形態の利用履歴Ｈは、音声対話装置１００Dを利用して過去に実行された音声対話の回数（以下「利用回数」という）Ｎである。具体的には、音声対話の開始（音声対話装置１００Dの起動）から終了までを１回（すなわち、発話音声Ｖxの発話と応答音声Ｖyの再生との複数対を包含する１回分の音声対話）として、履歴管理部３８は音声対話の回数を利用回数Ｎとして計数する。履歴管理部３８が生成した利用履歴Ｈは記憶装置２２に格納される。 The history management unit 38 in FIG. 15 generates a history (hereinafter, referred to as “usage history”) H of the voice conversation by the voice conversation device 100D. The usage history H of the fourth embodiment is the number N of voice conversations executed in the past using the voice conversation device 100D (hereinafter, referred to as “the number of times of use”). Specifically, one time from the start (activation of the voice interaction apparatus 100D) to the end of the voice interaction (that is, one voice interaction including a plurality of pairs of the utterance voice Vx and the response voice Vy). The history management unit 38 counts the number of voice conversations as the number of uses N. The usage history H generated by the history management unit 38 is stored in the storage device 22.

第４実施形態の応答生成部３６Dは、履歴管理部３８が生成した利用履歴Ｈに応じた韻律の応答音声Ｖyを再生装置２６に再生させる。すなわち、応答音声Ｖyの韻律が利用履歴Ｈに応じて可変に制御される。第４実施形態では、応答音声Ｖyの再生の待機時間Ｗを当該応答音声Ｖyの韻律として利用履歴Ｈに応じて制御する。待機時間Ｗは、発話音声Ｖxの終点ｔBから応答音声Ｖyの応答再生点ｔyまでの時間長（すなわち発話音声Ｖxと応答音声Ｖyとの間隔）である。 The response generation unit 36D of the fourth embodiment causes the reproduction device 26 to reproduce the response voice Vy of the prosody corresponding to the use history H generated by the history management unit 38. That is, the prosody of the response voice Vy is variably controlled according to the usage history H. In the fourth embodiment, the standby time W for reproducing the response voice Vy is controlled as the prosody of the response voice Vy according to the usage history H. The standby time W is a time length from the end point tB of the speech sound Vx to the response reproduction point ty of the response sound Vy (that is, the interval between the speech sound Vx and the response sound Vy).

現実の人間同士の対話では、特定の対話相手との対話の反復とともに発話音声の韻律が経時的に変化するという傾向が観測される。具体的には、初対面で対話を開始した直後の段階（各々が対話相手との対話に慣れていない段階）では、対話相手に特有の好適な間合等を両者が把握できないため、発話者による発話から当該発話に対する応答までの時間が長く（すなわち対話がぎこちなく）、当該対話相手との対話が反復されるにつれて当該時間が短縮される（すなわちテンポよく対話できる）、という傾向がある。以上の傾向を考慮して、第４実施形態の応答生成部３６Dは、利用履歴Ｈが示す利用回数Ｎが多い場合に、利用回数Ｎが少ない場合と比較して応答音声Ｖyの待機時間Ｗが短くなるように、利用履歴Ｈに応じて待機時間Ｗを制御する。 In real human dialogue, it is observed that the prosody of the uttered voice changes over time with repetition of the dialogue with a specific partner. More specifically, at the stage immediately after the start of the conversation in the first meeting (the stage in which each person is not used to the conversation with the conversation partner), it is not possible for both parties to grasp a suitable interval specific to the conversation partner. There is a tendency that the time from the utterance to the response to the utterance is long (that is, the conversation is awkward), and the time is shortened (that is, the conversation can be performed at a good tempo) as the conversation with the conversation partner is repeated. In consideration of the above tendency, the response generation unit 36D of the fourth embodiment sets the standby time W of the response voice Vy when the number of uses N indicated by the use history H is larger than when the number of uses N is small. The standby time W is controlled according to the usage history H so as to be shorter.

図１６は、第４実施形態の制御装置２０が実行する処理のフローチャートである。例えば音声対話装置１００Dに対する利用者Ｕからの指示（音声対話用のプログラムの起動指示）を契機として図１６の処理が開始される。音声対話装置１００Dによる音声対話が最初に開始される段階では、利用履歴Ｈは初期値（例えばＮ＝０）に設定される。 FIG. 16 is a flowchart of a process executed by the control device 20 of the fourth embodiment. For example, the processing in FIG. 16 is started by an instruction from the user U (an instruction to start a program for voice dialogue) to the voice dialogue apparatus 100D. At the stage where the voice dialogue by the voice dialogue device 100D is first started, the usage history H is set to an initial value (for example, N = 0).

第１実施形態と同様に、発話音声Ｖxが開始されると（Ｓ30：YES）、音声取得部３２は、音声入力装置２４から発話信号Ｘを取得して記憶装置２２に格納する（Ｓ31）。音声取得部３２による発話信号Ｘの取得は、発話音声Ｖxの終了まで反復される（Ｓ32：NO）。 As in the first embodiment, when the uttered voice Vx is started (S30: YES), the voice acquiring unit 32 acquires the uttered signal X from the voice input device 24 and stores it in the storage device 22 (S31). The acquisition of the utterance signal X by the voice obtaining unit 32 is repeated until the end of the utterance voice Vx (S32: NO).

発話音声Ｖxが終了すると（Ｓ32：YES）、応答生成部３６Dは、記憶装置２２に格納された利用履歴Ｈに応じた韻律の応答音声Ｖyを再生装置２６に再生させるための応答生成処理ＳDを実行する。第４実施形態の応答生成処理ＳDは、前述の通り、発話音声Ｖxの終点ｔBから応答音声Ｖyの再生が開始される応答再生点ｔyまでの待機時間Ｗを利用履歴Ｈに応じて制御する処理である。音声取得部３２による発話信号Ｘの取得（Ｓ31）と、応答生成部３６Dによる応答生成処理ＳDとは、音声対話の終了が利用者Ｕから指示されるまで反復される（Ｓ33：NO）。したがって、第１実施形態と同様に、利用者Ｕによる任意の発話音声Ｖxの発音と、当該発話音声Ｖxに対する応答音声Ｖyの再生とが交互に反復される音声対話が実現される。 When the uttered voice Vx ends (S32: YES), the response generation unit 36D performs a response generation process SD for causing the reproduction device 26 to reproduce the prosody response voice Vy corresponding to the use history H stored in the storage device 22. Execute. As described above, the response generation process SD of the fourth embodiment is a process of controlling the standby time W from the end point tB of the utterance voice Vx to the response reproduction point ty at which the reproduction of the response voice Vy starts, according to the use history H. It is. The acquisition of the utterance signal X by the voice acquisition unit 32 (S31) and the response generation processing SD by the response generation unit 36D are repeated until the end of the voice interaction is instructed by the user U (S33: NO). Therefore, as in the first embodiment, a voice dialogue is realized in which the sound of an arbitrary uttered voice Vx by the user U and the reproduction of a response voice Vy to the uttered voice Vx are alternately repeated.

音声対話の終了が利用者Ｕから指示されると（Ｓ33：YES）、履歴管理部３８は、記憶装置２２に記憶された利用履歴Ｈを、今回の音声対話を加味した内容に更新する（Ｓ34）。具体的には、履歴管理部３８は、利用履歴Ｈが示す利用回数Ｎを１だけ増加させる。したがって、音声対話装置１００Dによる音声対話の実行毎に利用履歴Ｈは１ずつ増加していく。利用履歴Ｈの更新後に図１６の処理は終了する。 When the end of the voice dialogue is instructed by the user U (S33: YES), the history management unit 38 updates the usage history H stored in the storage device 22 to the content including the current voice dialogue (S34). ). Specifically, the history management unit 38 increases the number of uses N indicated by the use history H by one. Therefore, the usage history H increases by one each time the voice dialogue device 100D executes the voice dialogue. After the use history H is updated, the processing in FIG. 16 ends.

図１７は、第４実施形態の応答生成処理ＳDのフローチャートであり、図１８および図１９は、応答生成処理ＳDの説明図である。応答生成処理ＳDを開始すると、応答生成部３６Dは、記憶装置２２に記憶された利用履歴Ｈに応じて待機時間Ｗを可変に設定する（ＳD1〜ＳD3）。具体的には、応答生成部３６Dは、まず、利用履歴Ｈが示す利用回数Ｎが所定の閾値ＮTHを上回るか否かを判定する（ＳD1）。利用回数Ｎが閾値ＮTHを上回る場合（ＳD1：YES）、応答生成部３６Dは、図１８に例示される通り、所定の基礎値ｗ0（例えば150ms）を待機時間Ｗとして設定する（ＳD2）。他方、利用回数Ｎが閾値ＮTHを下回る場合（ＳD1：NO）、応答生成部３６Dは、図１９に例示される通り、基礎値ｗ0に所定の調整値（オフセット）δwを加算した数値(ｗ0＋δw)を待機時間Ｗとして設定する（ＳD3）。調整値δwは所定の正数に設定される。なお、以上の説明では、利用回数Ｎが閾値ＮTHを上回るか否かに応じて待機時間Ｗを２値的に制御したが、利用回数Ｎに応じて待機時間Ｗを多値的または連続的に変化させることも可能である。 FIG. 17 is a flowchart of the response generation processing SD of the fourth embodiment, and FIGS. 18 and 19 are explanatory diagrams of the response generation processing SD. When the response generation process SD is started, the response generation unit 36D variably sets the standby time W according to the use history H stored in the storage device 22 (SD1 to SD3). Specifically, the response generation unit 36D first determines whether or not the number of uses N indicated by the use history H exceeds a predetermined threshold NTH (SD1). When the number of uses N exceeds the threshold value NTH (SD1: YES), the response generation unit 36D sets a predetermined basic value w0 (for example, 150 ms) as the standby time W as illustrated in FIG. 18 (SD2). On the other hand, if the number of uses N is less than the threshold value NTH (SD1: NO), the response generator 36D adds a predetermined adjustment value (offset) δw to the base value w0, as illustrated in FIG. 19 (w0 + δw). Is set as the standby time W (SD3). The adjustment value Δw is set to a predetermined positive number. In the above description, the standby time W is binary-controlled according to whether or not the number of uses N exceeds the threshold value NTH. However, the standby time W is multi-valued or continuously controlled according to the number of uses N. It can be changed.

応答生成部３６Dは、以上の処理で利用履歴Ｈに応じて設定した待機時間Ｗが発話音声Ｖxの終点ｔBから経過するまで待機する（ＳD4：NO）。待機時間Ｗの経過により応答再生点ｔyが到来すると（ＳD4：YES）、応答生成部３６Dは、記憶装置２２に記憶された応答信号Ｙを再生装置２６に供給することで応答音声Ｖyを再生させる（ＳD5）。以上の説明から理解される通り、第４実施形態の応答生成部３６Dは、音声対話装置１００Dの利用履歴Ｈに応じた韻律（第４実施形態では待機時間Ｗ）の応答音声Ｖyを再生装置２６に再生させる。具体的には、利用履歴Ｈが示す利用回数Ｎが多い場合には、基礎値ｗ0の待機時間Ｗの経過により応答音声Ｖyが再生され、利用回数Ｎが少ない場合には、基礎値ｗ0に調整値δwを加算した待機時間Ｗの経過により応答音声Ｖyが再生される。すなわち、利用回数Ｎが多い場合に待機時間Ｗは短くなる。 The response generation unit 36D waits until the standby time W set in accordance with the use history H in the above processing elapses from the end point tB of the uttered voice Vx (SD4: NO). When the response reproduction point ty arrives after the elapse of the standby time W (SD4: YES), the response generation unit 36D reproduces the response sound Vy by supplying the response signal Y stored in the storage device 22 to the reproduction device 26. (SD5). As understood from the above description, the response generation unit 36D of the fourth embodiment reproduces the response voice Vy of the prosody (the standby time W in the fourth embodiment) according to the usage history H of the voice interaction device 100D. To play. Specifically, when the number of uses N indicated by the use history H is large, the response voice Vy is reproduced after the elapse of the standby time W of the base value w0, and when the number of uses N is small, the response voice Vy is adjusted to the base value w0. The response voice Vy is reproduced after the elapse of the standby time W to which the value δw is added. That is, when the number of uses N is large, the standby time W becomes short.

以上に説明した通り、第４実施形態では、音声対話装置１００Dによる音声対話の利用履歴Ｈに応じた韻律（待機時間Ｗ）の応答音声Ｖyが再生されるから、特定相手との対話の反復とともに発話音声の韻律が経時的に変化するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。第４実施形態では特に、発話音声Ｖxと応答音声Ｖyとの間隔である待機時間Ｗが利用履歴Ｈに応じて制御される。したがって、初対面で対話を開始した直後の段階では、発話と応答との間隔が長く、当該対話相手との対話が反復されるにつれて当該間隔が短縮されるという現実の対話の傾向を模擬した自然な音声対話が実現される。 As described above, in the fourth embodiment, since the response voice Vy of the prosody (standby time W) according to the usage history H of the voice dialogue by the voice dialogue device 100D is reproduced, the dialogue with the specific partner is repeated. It is possible to realize a natural spoken dialogue simulating the tendency of a real dialogue in which the prosody of a spoken voice changes over time. In the fourth embodiment, in particular, the standby time W, which is the interval between the uttered voice Vx and the response voice Vy, is controlled according to the usage history H. Therefore, at the stage immediately after starting the dialogue in the first meeting, the interval between the utterance and the response is long, and the interval is shortened as the dialogue with the dialogue partner is repeated. Voice dialogue is realized.

＜変形例＞
前述の各形態で例示した音声対話装置１００（１００A，１００B，１００C，１００D）は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２個以上の態様は、相互に矛盾しない範囲で適宜に併合され得る。 <Modification>
The voice interaction device 100 (100A, 100B, 100C, 100D) exemplified in each of the above embodiments can be variously modified. Specific modifications will be described below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined within a mutually consistent range.

（１）第１実施形態ないし第４実施形態から選択された任意の２以上の構成を組合せることも可能である。具体的には、発話音声Ｖxの韻律（例えば音高Ｐ）に応じて応答音声Ｖyの韻律を制御する第１実施形態の構成は、第２実施形態から第４実施形態にも同様に適用され得る。例えば、第２実施形態では、図９のステップＳB3またはステップＳB4で選択した応答信号Ｙの韻律を発話音声Ｖxの韻律（例えば音高Ｐ）に応じて制御したうえで再生装置２６から再生させることも可能である。同様に、第３実施形態では、図１２のステップＳC2またはステップＳC3で選択した応答信号Ｙの韻律を発話音声Ｖxの韻律に応じて制御する構成が採用され、第４実施形態では、図１７のステップＳD5で記憶装置２２から取得した応答信号Ｙの韻律を発話音声Ｖxの韻律に応じて制御する構成が採用され得る。第２実施形態から第４実施形態に第１実施形態を適用した構成では、第１実施形態と同様に、例えば、応答音声Ｖyのうち特定のモーラ（典型的には最後のモーラ）の始点における音高が発話音声Ｖxの末尾区間Ｅ内の最低音高Ｐminに一致するように、応答信号Ｙの音高が調整される。 (1) It is also possible to combine any two or more configurations selected from the first to fourth embodiments. Specifically, the configuration of the first embodiment in which the prosody of the response voice Vy is controlled in accordance with the prosody (for example, pitch P) of the utterance voice Vx is similarly applied to the second to fourth embodiments. obtain. For example, in the second embodiment, after the prosody of the response signal Y selected in step SB3 or step SB4 in FIG. 9 is controlled according to the prosody (for example, pitch P) of the uttered voice Vx, the reproduction is performed by the reproducing device 26. Is also possible. Similarly, in the third embodiment, a configuration is adopted in which the prosody of the response signal Y selected in step SC2 or SC3 in FIG. 12 is controlled in accordance with the prosody of the speech voice Vx. In the fourth embodiment, the configuration shown in FIG. A configuration may be adopted in which the prosody of the response signal Y obtained from the storage device 22 in step SD5 is controlled in accordance with the prosody of the uttered voice Vx. In the configuration in which the first embodiment is applied to the second to fourth embodiments, for example, at the start point of a specific mora (typically, the last mora) in the response voice Vy, as in the first embodiment. The pitch of the response signal Y is adjusted such that the pitch matches the lowest pitch Pmin in the last section E of the uttered voice Vx.

発話音声Ｖxに対する問返しの第１応答音声Ｖy1と問返し以外の第２応答音声Ｖy2とを選択的に再生させる第３実施形態の構成を、第３実施形態以外の各形態に適用することも可能である。また、音声対話の利用履歴Ｈに応じて応答音声Ｖyの韻律（例えば待機時間Ｗ）を制御する第４実施形態の構成を、第１実施形態から第３実施形態に適用することも可能である。 The configuration of the third embodiment in which the first response voice Vy1 of the interrogation for the utterance voice Vx and the second response voice Vy2 other than the interrogation are selectively reproduced may be applied to each mode other than the third embodiment. It is possible. Further, the configuration of the fourth embodiment that controls the prosody (for example, the standby time W) of the response voice Vy according to the usage history H of the voice dialogue can be applied to the first to third embodiments. .

（２）前述の各形態の音声対話に関連する各種の変数は、例えば利用者Ｕからの指示に応じて可変に設定される。例えば、応答音声Ｖyの再生音量を利用者Ｕからの指示に応じて制御する構成や、発話者の性別または声質（優しい音声，厳しい音声）が相違する複数種の応答音声Ｖyのうち実際に再生装置２６から再生する応答音声Ｖyの種類を利用者Ｕからの指示に応じて選択する構成も採用され得る。また、第１実施形態から第３実施形態において、発話音声Ｖxの終点ｔBから応答音声Ｖyの応答再生点ｔyまでの待機時間Ｗの時間長を利用者Ｕからの指示に応じて設定することも可能である。 (2) Various variables related to the above-described various forms of voice dialogue are variably set in accordance with, for example, an instruction from the user U. For example, a configuration in which the reproduction volume of the response voice Vy is controlled in accordance with an instruction from the user U, or the actual reproduction of a plurality of types of response voices Vy having different sexes or voice qualities (gentle voice, severe voice). A configuration in which the type of the response voice Vy reproduced from the device 26 is selected according to an instruction from the user U may be adopted. In the first to third embodiments, the length of the standby time W from the end point tB of the utterance voice Vx to the response reproduction point ty of the response voice Vy may be set in accordance with an instruction from the user U. It is possible.

（３）第３実施形態の変形例では、発話音声Ｖxに対する問返しの第１応答音声Ｖy1の再生頻度を利用者Ｕからの指示に応じて可変に設定したが、利用者Ｕからの指示以外の要素に応じて第１応答音声Ｖy1の再生頻度を制御することも可能である。具体的には、第３実施形態の応答生成部３６Dが、第４実施形態で例示した利用履歴Ｈに応じて第１応答音声Ｖy1の再生頻度を制御する構成が採用され得る。例えば、現実の人間同士の対話では、特定の対話相手との対話を反復するほど当該対話相手の発話の特徴（例えば口癖や口調）を把握でき、結果的に発話音声に対する問返しの頻度は低下する、という傾向が想定される。以上の傾向を考慮すると、利用履歴Ｈが示す利用回数Ｎが多いほど第１応答音声Ｖy1の再生頻度を低下させる構成が好適である。 (3) In the modified example of the third embodiment, the reproduction frequency of the first response voice Vy1 of the inquiry to the utterance voice Vx is variably set according to the instruction from the user U. It is also possible to control the reproduction frequency of the first response voice Vy1 according to the factor of. Specifically, a configuration in which the response generation unit 36D of the third embodiment controls the reproduction frequency of the first response voice Vy1 in accordance with the usage history H exemplified in the fourth embodiment may be employed. For example, in a dialogue between real humans, as the dialogue with a specific dialogue partner is repeated, the characteristics of the utterance (eg, habit and tone) of the dialogue partner can be grasped, and as a result, the frequency of querying the uttered voice decreases. Is assumed. In consideration of the above tendency, it is preferable that the frequency of reproduction of the first response voice Vy1 be reduced as the number of uses N indicated by the use history H increases.

（４）第４実施形態では、音声対話の利用回数Ｎを利用履歴Ｈとして例示したが、利用履歴Ｈは利用回数Ｎに限定されない。例えば、音声対話内の応答音声Ｖyの再生を１回とした回数や、音声対話の利用頻度（単位時間毎の利用回数）、音声対話の使用期間（例えば音声対話装置１００の最初の使用からの経過時間）、音声対話装置１００を最後に使用してからの経過時間を、利用履歴Ｈとして待機時間Ｗの制御に適用することも可能である。 (4) In the fourth embodiment, the use frequency N of the voice interaction is exemplified as the use history H, but the use history H is not limited to the use frequency N. For example, the number of times the response voice Vy is reproduced once in the voice interaction, the frequency of use of the voice interaction (the number of uses per unit time), the usage period of the voice interaction (for example, from the first use of the voice interaction device 100). It is also possible to apply the elapsed time since the last use of the voice interactive device 100 to the control of the standby time W as the usage history H.

（５）第１実施形態では、記憶装置２２に事前に記憶された音声信号Ｚから応答信号Ｙを生成および再生し、第２実施形態から第４実施形態では、記憶装置２２に事前に記憶された応答信号Ｙを再生したが、特定の発話内容の応答音声Ｖyを表す応答信号Ｙを、例えば公知の音声合成技術により合成することも可能である。応答信号Ｙの合成には、例えば、素片接続型の音声合成や、隠れマルコフモデル等の統計モデルを利用した音声合成が好適に利用される。また、発話音声Ｖxや応答音声Ｖyは人間の発声音に限定されない。例えば動物の鳴き声を発話音声Ｖxや応答音声Ｖyとすることも可能である。 (5) In the first embodiment, the response signal Y is generated and reproduced from the audio signal Z stored in the storage device 22 in advance. In the second to fourth embodiments, the response signal Y is stored in the storage device 22 in advance. Although the response signal Y is reproduced, the response signal Y indicating the response voice Vy of the specific utterance content can be synthesized by, for example, a known voice synthesis technique. For the synthesis of the response signal Y, for example, speech synthesis of a unit connection type or speech synthesis using a statistical model such as a hidden Markov model is suitably used. Further, the uttered voice Vx and the response voice Vy are not limited to human uttered sounds. For example, the animal's cry can be used as the utterance voice Vx or the response voice Vy.

（６）前述の各形態では、音声対話装置１００が音声入力装置２４と再生装置２６とを具備する構成を例示したが、音声対話装置１００とは別体の装置（音声入出力装置）に音声入力装置２４および再生装置２６を設置することも可能である。音声対話装置１００は、例えば携帯電話機やスマートフォン等の端末装置で実現され、音声入出力装置は、例えば動物型の玩具やロボット等の電子機器で実現される。音声対話装置１００と音声入出力装置とは無線または有線で通信可能である。すなわち、音声入出力装置の音声入力装置２４が生成した発話信号Ｘは無線または有線で音声対話装置１００に送信され、音声対話装置１００が生成した応答信号Ｙは無線または有線で音声入出力装置の再生装置２６に送信される。 (6) In each of the above-described embodiments, the configuration in which the voice interaction device 100 includes the voice input device 24 and the playback device 26 has been described as an example. It is also possible to provide an input device 24 and a playback device 26. The voice interaction device 100 is realized by a terminal device such as a mobile phone or a smartphone, and the voice input / output device is realized by an electronic device such as an animal-type toy or a robot. The voice interaction device 100 and the voice input / output device can communicate wirelessly or by wire. That is, the utterance signal X generated by the voice input device 24 of the voice input / output device is transmitted to the voice interactive device 100 wirelessly or wired, and the response signal Y generated by the voice interactive device 100 is transmitted wirelessly or wiredly to the voice input / output device. It is transmitted to the playback device 26.

（７）前述の各形態では、携帯電話機等やパーソナルコンピュータ等の情報処理装置で音声対話装置１００を実現したが、音声対話装置１００の一部または全部の機能をサーバ装置（いわゆるクラウドサーバ）で実現することも可能である。具体的には、移動通信網やインターネット等の通信網を介して端末装置と通信するサーバ装置により音声対話装置１００が実現される。例えば、音声対話装置１００は、端末装置の音声入力装置２４が生成した発話信号Ｘを当該端末装置から受信し、前述の各形態に係る構成により発話信号Ｘから応答信号Ｙを生成する。そして、音声対話装置１００は、発話信号Ｘから生成した応答信号Ｙを端末装置に送信し、当該端末装置の再生装置２６に応答音声Ｖyを再生させる。音声対話装置１００は、単体の装置または複数の装置の集合（すなわちサーバシステム）で実現される。また、前述の各形態に係る音声対話装置１００の一部の機能（例えば音声取得部３２，音声解析部３４A，３４C，応答生成部３６A，３６B，３６C，３６D，履歴管理部３８の少なくとも一部）をサーバ装置により実現し、他の機能を端末装置で実現することも可能である。音声対話装置１００が実現する各機能をサーバ装置および端末装置の何れで実現するか（機能の分担）は任意である。 (7) In each of the embodiments described above, the voice interactive device 100 is realized by an information processing device such as a mobile phone or a personal computer. However, some or all of the functions of the voice interactive device 100 are implemented by a server device (a so-called cloud server). It is also possible to realize. Specifically, the voice interaction device 100 is realized by a server device that communicates with a terminal device via a communication network such as a mobile communication network or the Internet. For example, the voice interaction device 100 receives the speech signal X generated by the voice input device 24 of the terminal device from the terminal device, and generates a response signal Y from the speech signal X by the configuration according to each of the above-described embodiments. Then, the voice interaction device 100 transmits the response signal Y generated from the speech signal X to the terminal device, and causes the reproducing device 26 of the terminal device to reproduce the response voice Vy. The voice interaction device 100 is realized by a single device or a group of a plurality of devices (that is, a server system). In addition, some functions of the voice interaction device 100 according to each of the above-described embodiments (for example, at least a part of the voice acquisition unit 32, the voice analysis units 34A and 34C, the response generation units 36A, 36B, 36C and 36D, and the history management unit 38) ) Can be realized by a server device, and other functions can be realized by a terminal device. Whether each function realized by the voice interaction device 100 is realized by the server device or the terminal device (function sharing) is arbitrary.

（８）前述の各形態では、発話音声Ｖxに対して特定の発話内容（例えば「うん」等の相鎚）の応答音声Ｖyを再生したが、応答音声Ｖyの発話内容は以上の例示に限定されない。例えば、発話信号Ｘに対する音声認識および形態素解析で発話音声Ｖxの発話内容を解析し、当該発話内容に対して適切な内容の応答音声Ｖyを複数の候補から選択して再生装置２６に再生させることも可能である。なお、音声認識や形態素解析を実行しない構成（例えば第１実施形態から第４実施形態の例示）では、発話音声Ｖxとは無関係に事前に用意された発話内容の応答音声Ｖyが再生される。したがって、単純に考えると、自然な対話は成立しないようにも推測され得るが、前述の各形態の例示のように応答音声Ｖyの韻律が多様に制御されることで、実際には、人間同士の自然な対話のような感覚を利用者Ｕは感取することが可能である。他方、音声認識や形態素解析を実行しない構成によれば、これらの処理に起因した処理遅延や処理負荷が低減ないし解消されるという利点がある。 (8) In each of the above-described embodiments, the response voice Vy of a specific utterance content (for example, Aizuchi such as “un”) is reproduced with respect to the utterance voice Vx, but the utterance content of the response voice Vy is limited to the above examples. Not done. For example, the utterance content of the utterance voice Vx is analyzed by voice recognition and morphological analysis of the utterance signal X, and a response voice Vy having appropriate content for the utterance content is selected from a plurality of candidates and reproduced by the reproduction device 26. Is also possible. In a configuration in which speech recognition or morphological analysis is not performed (for example, the first to fourth embodiments), a response voice Vy of the utterance content prepared in advance is reproduced regardless of the utterance voice Vx. Therefore, when simply considered, it can be inferred that a natural dialogue is not established. However, as the prosody of the response voice Vy is variously controlled as illustrated in the above-described respective embodiments, actually, humans can not communicate with each other. The user U can perceive a feeling like a natural conversation. On the other hand, according to the configuration in which the speech recognition and the morphological analysis are not performed, there is an advantage that the processing delay and the processing load caused by these processes are reduced or eliminated.

（９）前述の各形態で例示した音声対話装置１００（１００A，１００B，１００C，１００D）を、実際の人間同士の対話の評価に利用することも可能である。例えば、実際の人間同士の対話で観測される応答音声（以下「観測音声」という）の韻律を、前述の形態で生成された応答音声Ｖyの韻律と比較し、両者間で韻律が類似する場合には観測音声を適切と評価する一方、両者間で韻律が乖離する場合には観測音声を不適切と評価することが可能である。以上に例示した評価を実行する装置（対話評価装置）は、人間同士の対話の訓練にも利用され得る。 (9) The voice interaction device 100 (100A, 100B, 100C, 100D) exemplified in each of the above-described embodiments can be used for evaluating an actual dialogue between humans. For example, when the prosody of a response voice (hereinafter referred to as “observation voice”) observed in an actual dialogue between humans is compared with the prosody of the response voice Vy generated in the above-described form, and the prosody is similar between the two. , It is possible to evaluate the observed speech as inappropriate while the prosodic divergence between the two is appropriate. The device (dialogue evaluation device) that executes the evaluation exemplified above can also be used for training of human interaction.

（１０）前述の各形態で例示した音声対話装置１００（１００A，１００B，１００C，１００D）は、前述の通り、制御装置２０と音声対話用のプログラムとの協働で実現され得る。音声対話用のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、通信網を介した配信の形態でプログラムをコンピュータに配信することも可能である。また、前述の各形態で例示した音声対話装置１００の動作方法（音声対話方法）としても本発明は実現され得る。音声対話方法の動作主体となるコンピュータ（音声対話装置１００）は、例えば単体のコンピュータまたは複数のコンピュータで構成されるシステムである。 (10) The voice interaction device 100 (100A, 100B, 100C, 100D) exemplified in each of the above-described embodiments can be realized by cooperation between the control device 20 and the voice interaction program as described above. The voice dialogue program may be provided in a form stored on a computer-readable recording medium and installed on the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, and a known arbitrary recording medium such as a semiconductor recording medium or a magnetic recording medium is used. In the form of a recording medium. It is also possible to distribute the program to a computer in the form of distribution via a communication network. Further, the present invention can also be realized as an operation method (voice interaction method) of the voice interaction device 100 exemplified in each of the above-described embodiments. The computer (speech dialogue device 100) that is the main subject of the speech dialogue method is, for example, a system composed of a single computer or multiple computers.

１００（１００A，１００B，１００C，１００D）……音声対話装置、２０……制御装置、２２……記憶装置、２４……音声入力装置、２４２……収音装置、２４４……Ａ/Ｄ変換器、２６……再生装置、２６２……Ｄ/Ａ変換器、２６４……放音装置、３２……音声取得部、３４A，３４C……音声解析部、３６A，３６B，３６C，３６D……応答生成部、３８……履歴管理部。
100 (100A, 100B, 100C, 100D) ... voice interaction device, 20 ... control device, 22 ... storage device, 24 ... voice input device, 242 ... sound collection device, 244 ... A / D converter , 26 playback device, 262 D / A converter, 264 sound emitting device, 32 voice acquisition unit, 34A, 34C voice analysis unit, 36A, 36B, 36C, 36D response generation Unit, 38: History management unit.

Claims

  A method for performing a voice dialogue that plays a response voice to an uttered voice,
  In the past, the playback device reproduces a prosody response voice according to the number of voice conversations, the frequency of use of voice conversations, the use period of voice conversations, or the elapsed time since the last voice conversations.
  A spoken dialogue method implemented by a computer.

The prosody of the response voice is a standby time that is an interval between the utterance voice and the response voice.
The voice interaction method according to claim 1.

An apparatus for executing a voice dialogue for reproducing a response voice to an uttered voice,
A voice having a response generation unit for causing a playback device to reproduce a prosody response voice corresponding to the number of voice conversations, the frequency of voice conversations, the use period of voice conversations, or the elapsed time since the last voice conversations in the past. Interactive device.

A program for voice dialogue for reproducing a response voice to an utterance voice,
Computer
A program that functions as a response generation unit that causes a playback device to reproduce a prosodic response voice according to the number of voice dialogues, the frequency of voice dialogue use, the duration of voice dialogue, or the elapsed time since the last voice dialogue in the past. .