JP2018159777A

JP2018159777A - Voice reproduction device, and voice reproduction program

Info

Publication number: JP2018159777A
Application number: JP2017056326A
Authority: JP
Inventors: 嘉山　啓; Hiroshi Kayama; 啓嘉山; 久湊　裕司; Yuji Hisaminato; 裕司久湊
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2017-03-22
Filing date: 2017-03-22
Publication date: 2018-10-11
Anticipated expiration: 2037-03-22
Also published as: JP6922306B2

Abstract

PROBLEM TO BE SOLVED: To make it possible to perform abundant expression of intension, suppressing increment of the storage capacity required for storing voice data of answers.SOLUTION: The voice reproduction device includes an answer acquiring unit which acquires voice data of an answer to a question represented by an input voice signal, an intention specifying unit which designates an intention to be given to the answer, a rhythm control data acquisition unit which acquires rhythm control data representing a time change of the rhythm according to the intention designated by the intention specifying unit, and an answer reproduction unit which reproduces a voice of an answer in which a time change of rhythm based on the voice data is controlled according to the rhythm control data.SELECTED DRAWING: Figure 1

Description

本発明は、音声再生技術に関する。 The present invention relates to an audio reproduction technique.

音声再生技術の応用例として、人と機械による音声インタラクション或いは機械と機械による音声インタラクションが挙げられる。人と機械による音声インタラクションの一例としては、音声による利用者の問いに対してその問いに対する回答の音声を合成して再生する音声対話システムが挙げられる。機械と機械による音声インタラクションの一例としては、予め定められたシナリオにしたがって音声再生装置Ａにより再生された問いの音声を、音声再生装置Ｂが認識して回答の音声を再生することが挙げられ、具体的には登場人物の全てを機械（音声再生装置）が演じる演劇や漫才が挙げられる。音声による問いに対する回答の音声を合成する際には、利用者の音声による問いに対して不自然さのない人間らしい受け答えを実現するために、意図を込めた回答の音声を再生することが好ましい。例えば、特許文献１に開示の技術では、回答に込められた意図を表現するために、肯定的な回答と否定的な回答とで語尾の音高を異ならせている。 As an application example of the sound reproduction technology, there is a sound interaction between a person and a machine or a sound interaction between a machine and a machine. As an example of a voice interaction between a person and a machine, there is a voice dialogue system that synthesizes and plays back a voice of an answer to a question of a user by voice. An example of voice interaction between machines is that the voice playback device B recognizes the voice of the question played back by the voice playback device A according to a predetermined scenario and plays back the voice of the answer, Specifically, there are plays and comics where all the characters are played by a machine (voice playback device). When synthesizing the voice of the answer to the question by voice, it is preferable to reproduce the voice of the answer with the intention in order to realize a human-like answer to the question by the voice of the user. For example, in the technique disclosed in Patent Document 1, in order to express the intention included in the answer, the pitch of the ending is made different between a positive answer and a negative answer.

特開２０１５−０６４４８０号公報Japanese Patent Application Laid-Open No. 2015-064480

しかし、特許文献１に開示の技術のように回答の語尾の音高の調整だけでは、多彩な意図を表現することはできない。多様な意図の各々について意図毎に回答の音声データを用意しておけば豊かな意図表現が可能となるが、音声データを記憶する記憶装置の記憶容量が増加する、といった問題がある。 However, various intents cannot be expressed only by adjusting the ending pitch of the answer as in the technique disclosed in Patent Document 1. If voice data of answers is prepared for each intention for each of various intentions, a rich intention can be expressed, but there is a problem that the storage capacity of the storage device for storing the voice data increases.

本発明は以上に説明した課題に鑑みて為されたものであり、回答の音声データの記憶に要する記憶容量の増加を抑えつつ、豊かな意図の表現を可能にする技術を提供することを目的とする。 The present invention has been made in view of the problems described above, and it is an object of the present invention to provide a technique capable of expressing a rich intention while suppressing an increase in storage capacity required for storing voice data of answers. And

上記目的を達成するために、本発明の一態様に係る音声再生装置は、入力された音声信号の表す問いに対する回答の音声データを取得する回答取得部と、前記回答に付与する意図を指定する意図指定部と、前記意図指定部により指定された意図に応じた韻律の時間変化を表す韻律制御データを取得する韻律制御データ取得部と、前記音声データに基づく韻律の時間変化を前記韻律制御データにしたがって制御した回答の音声を再生する回答再生部と、を具備することを特徴とする。 In order to achieve the above object, an audio reproduction device according to an aspect of the present invention specifies an answer acquisition unit that acquires audio data of an answer to a question represented by an input audio signal, and an intention to be given to the answer An intent specifying unit, a prosody control data acquiring unit that acquires prosody control data representing a temporal change in prosody according to the intention specified by the intention specifying unit, and a temporal change in prosody based on the speech data. An answer reproducing unit that reproduces the sound of the answer controlled according to the above.

韻律の時間変化とは、音高、話速、音量および発話タイミングといった各韻律構成要素の時間変化のことを言う。なお、話速とは、単位時間当たりに発音される音素数のことを言う。人は、問いに対する回答を発話する際に、その回答に込める意図に応じて韻律の時間変化を調整することで、「気楽」や「慎重」、或いは「怒り」や「あきれ」などの多彩な意図を表現する。本態様によれば、回答に込める意図に応じて、韻律の時間変化をきめ細かく制御した回答の音声を再生することが可能になり、豊かな意図の表現が可能になる。ここで、韻律制御データは、音高、話速、音量および発話タイミングといった韻律構成要素の各時刻（回答の語頭を起算点とする時刻）における韻律構成要素の変化量の配列、すなわちシーケンスデータであれば良く、音声の波形データに比較してデータ量が少ない。このため、回答毎に各意図に応じた韻律制御データを記憶装置に予め記憶させておくとしても、異なる意図を込めて発音された各回答の音声データを記憶装置へ記憶させておく態様に比較して少ない記憶容量で対応可能である。つまり、本態様によれば、回答の音声データの記憶に要する記憶容量の増加を抑えつつ、豊かな意図の表現が可能になる。なお、回答には、問いに対する具体的な答えに限られず、相槌（間投詞）も含まれる。また、回答には、問に対する答えや相槌の他、演劇や漫才における掛け合いの台詞も含まれ、人による声のほかにも、「ワン」（bowwow）、「ニャー」（meow）などの動物の鳴き声も含まれる。すなわち、ここでいう回答や音声とは、人が発する声のみならず、動物の鳴き声を含む概念である。 The time change of the prosody means a time change of each prosody component such as pitch, speech speed, volume, and speech timing. Note that the speech speed refers to the number of phonemes that are pronounced per unit time. When a person utters an answer to a question, he adjusts the time change of the prosody according to the intention to be included in the answer, so that various types such as `` easy '' and `` careful '' or `` anger '' and `` are '' Express intentions. According to this aspect, it is possible to reproduce the answer voice in which the time change of the prosody is finely controlled according to the intention to be included in the answer, and it is possible to express a rich intention. Here, the prosodic control data is an array of changes in prosodic components at each time (time starting from the beginning of the answer) of prosodic components such as pitch, speech speed, volume, and utterance timing, that is, sequence data. The amount of data is small compared to the waveform data of speech. For this reason, even if prosodic control data corresponding to each intention is stored in the storage device in advance for each answer, it is compared with a mode in which the sound data of each answer sounded with a different intention is stored in the storage device. Therefore, it can be handled with a small storage capacity. That is, according to this aspect, it is possible to express a rich intention while suppressing an increase in storage capacity required for storing the answer voice data. Note that the answer is not limited to a specific answer to the question, but includes an answer (interjection). Answers include answers to questions and dialogues, as well as dialogues in theater and comics. In addition to human voices, animals such as “one” (bowwow) and “meow” (meow) A cry is also included. That is, the answer and the voice here are concepts including not only a voice uttered by a person but also an animal cry.

より好ましい態様においては、前記韻律制御データが意図毎に格納されたデータベースを有し、前記韻律制御データ取得部は、前記意図指定部により指定された意図に対応する韻律制御データを前記データベースから取得する。また、別の好ましい態様においては、前記韻律制御データ取得部は、前記意図指定部により指定された意図に対応する韻律制御データが前記データベースに格納されていない場合には、当該韻律制御データを前記データベースに格納されている複数の韻律制御データを用いた補間により取得することを特徴とする。このような態様によれば、上記データベースの記憶容量の増加を抑えつつ、さらに豊な意図表現が可能になる。また、別の好ましい態様においては、前記入力された音声信号を解析して前記問いに付与されている意図を特定する意図特定部を有し、前記意図指定部は、前記意図特定部により特定された意図に応じて前記回答に付与する意図を指定することを特徴とする。このような態様によれば、問いに込められた意図に応じた意図を込めた回答の音声を再生することが可能になる。 In a more preferred aspect, the prosody control data has a database in which each intention is stored, and the prosody control data acquisition unit acquires prosody control data corresponding to the intention designated by the intention designation unit from the database. To do. In another preferred aspect, the prosodic control data acquisition unit, when prosodic control data corresponding to the intention specified by the intention specifying unit is not stored in the database, It is obtained by interpolation using a plurality of prosodic control data stored in a database. According to such an aspect, it is possible to express more intentionally while suppressing an increase in the storage capacity of the database. In another preferred aspect, the system further comprises an intention specifying unit that analyzes the input audio signal and specifies an intention given to the question, and the intention specifying unit is specified by the intention specifying unit. The intention to be given to the answer is designated according to the intended intention. According to such an aspect, it becomes possible to reproduce the voice of an answer with an intention corresponding to the intention included in the question.

本発明の態様について、音声再生装置のみならず、コンピュータを当該音声再生装置として機能させるプログラムとして概念することも可能である。 The aspect of the present invention can be conceptualized as a program that causes a computer to function as the sound reproducing device as well as the sound reproducing device.

実施形態に係る音声再生装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice reproduction apparatus which concerns on embodiment. 意図を込められた回答における韻律の時間変化の一例を示す図である。It is a figure which shows an example of the time change of the prosody in the answer with an intention. 音声再生装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of an audio | voice reproduction apparatus.

以下、図面を参照しつつ、この発明の実施形態を説明する。
（Ａ：構成）
図１は、本発明の実施形態に係る音声再生装置１０の構成を示す図である。
この音声再生装置１０は、例えば、ぬいぐるみに組み込まれる装置である。音声再生装置１０は、利用者が当該ぬいぐるみに問いを発したときに、利用者により指定された意図を込めた回答の音声を合成して再生する。人は、問いに対する回答を発話する際に、その回答に込める意図に応じて韻律の時間変化を調整することで、「気楽」や「慎重」、或いは「怒り」や「あきれ」などの多彩な意図を表現する。例えば、図２には、特定の意図を込めずに発音された「あのさ」という基準音声の時間波形ＴＷと、「気楽さ」を込めた「あのさ」という音声の基準音声からの韻律の変化パターンＰ１と、「慎重さ」を込めた「あのさ」という音声の基準音声からの韻律の変化パターンＰ２と例示されている。なお、図２では、「音高」、「話速」および「音量」の各韻律構成要素の基準音声からの変化量が、三角形の重心から各頂点へ至る座標軸上の位置で表されており、上記重心から遠ざかるほど、基準音声に比較して音高が高いこと、話速が早いこと、音量が大きいことを意味する。 Embodiments of the present invention will be described below with reference to the drawings.
(A: Configuration)
FIG. 1 is a diagram showing a configuration of an audio playback device 10 according to an embodiment of the present invention.
The audio reproduction device 10 is, for example, a device incorporated in a stuffed animal. When the user asks the stuffed animal, the voice playback device 10 synthesizes and plays back the voice of the answer with the intention specified by the user. When a person utters an answer to a question, he adjusts the time change of the prosody according to the intention included in the answer. Express intentions. For example, FIG. 2 shows a time waveform TW of a reference voice “Anasa” that is pronounced without a specific intention, and a prosody of a reference voice of “Anasa” that includes “comfort”. The change pattern P1 and the prosody change pattern P2 from the reference voice of the voice “Anasa” with “carefulness” are exemplified. In FIG. 2, the amount of change from the reference speech of each prosodic component of “pitch”, “speech speed”, and “volume” is represented by the position on the coordinate axis from the center of gravity of the triangle to each vertex. This means that the further away from the center of gravity, the higher the pitch, the faster the speech speed, and the higher the volume compared to the reference voice.

音声再生装置１０は、ＣＰＵ（Central Processing Unit）や、音声入力部１０２、スピーカ１１４を有し、予めインストールされたアプリケーションプログラムを当該ＣＰＵが実行することによって、複数の機能ブロックが次のように構築される。詳細には、音声再生装置１０では、言語解析部１０４、回答取得部１０６、意図指定部１０８、韻律制御データ取得部１１０、および回答再生部１１２が構築される。なお、特に図示しないが、このほかにも音声再生装置１０は、表示部や操作入力部なども有し、利用者が装置の状況を確認したり、装置に対して各種の操作を入力したり、各種の設定を行えるようになっている。また、音声再生装置１０は、ぬいぐるみのような玩具に限られず、いわゆるペットロボットや、携帯電話機のような端末装置、タブレット型のパーソナルコンピュータなどであっても良い。 The audio reproduction device 10 includes a CPU (Central Processing Unit), an audio input unit 102, and a speaker 114. When the CPU executes an application program installed in advance, a plurality of functional blocks are constructed as follows. Is done. Specifically, in the voice reproduction device 10, a language analysis unit 104, an answer acquisition unit 106, an intention designation unit 108, a prosody control data acquisition unit 110, and an answer reproduction unit 112 are constructed. Although not particularly illustrated, the sound reproducing device 10 also includes a display unit, an operation input unit, and the like. The user can check the state of the device and input various operations to the device. Various settings can be made. The audio playback device 10 is not limited to a toy such as a stuffed toy, and may be a so-called pet robot, a terminal device such as a mobile phone, a tablet personal computer, or the like.

音声入力部１０２は、詳細については省略するが、音声を電気信号に変換するマイクロフォンと、変換された音声信号をデジタル信号に変換するＡ／Ｄ変換器とで構成される。言語解析部１０４は、音声入力部１０２から入力される音声信号で規定される問いの意味内容を解析し、その解析結果（すなわち、上記問いの意味内容）を示す意味内容データを回答取得部１０６に与える。 Although not described in detail, the sound input unit 102 includes a microphone that converts sound into an electrical signal and an A / D converter that converts the converted sound signal into a digital signal. The language analysis unit 104 analyzes the semantic content of the question defined by the voice signal input from the voice input unit 102, and obtains the semantic content data indicating the analysis result (that is, the semantic content of the question). To give.

回答ライブラリ１２４は、利用者による問いに対する回答を一意に示す識別子（以下、回答識別子）とその回答の音声データの組を、予め複数格納したデータベースである。この音声データは、モデルとなる人物の音声を録音したものであり、例えば「はい」、「いいえ」、「そう」、「うん」、「ふーん」、「なるほど」のような、質問に対する返事や相槌などである。回答の音声データについては、例えばｗａｖやｍｐ３などのフォーマットである。また、回答識別子の具体例としては、「はい」や「いいえ」など回答の内容を表す文字列や一連番号が挙げられる。なお、問いに対する回答は返事や相槌には限定されず、例えば、「今日の天気は？」という問いに対する「晴れです。」といった答えのように、問いにより要求された情報を提示する文章であっても良い。 The answer library 124 is a database that stores in advance a plurality of pairs of identifiers (hereinafter referred to as answer identifiers) that uniquely indicate answers to questions by users and voice data of the answers. This voice data is a recording of the voice of a model person. For example, “Yes”, “No”, “Yes”, “Yes”, “Fun”, “I see” For example, it is a companion. The answer voice data is in a format such as wav or mp3, for example. Specific examples of answer identifiers include character strings and serial numbers representing the contents of answers such as “Yes” and “No”. Note that the answer to the question is not limited to answering or comprehension, for example, a sentence that presents the information requested by the question, such as “Sunny” for the question “What is the weather today?” May be.

回答取得部１０６は、言語解析部１０４から与えられる意味内容データにより意味内容が表される問いに対する回答の回答識別子と音声データを、回答ライブラリ１２４から１つを選択し、当該選択した回答識別子および音声データを回答ライブラリ１２４から読み出して取得する。そして、回答取得部１０６は、回答ライブラリ１２４から取得した回答識別子を韻律制御データ取得部１１０へ出力し、回答ライブラリ１２４から取得した音声データを回答再生部１１２へ出力する。本実施形態の回答取得部１０６は、言語解析部１０４から出力され得る意味内容データに対応づけて、その意味内容データの表す意味内容の問いに対して相応しい回答の回答識別子を格納したテーブルを備えている。回答取得部１０６は、上記テーブルの格納内容を参照して上記意味内容データの表す意味内容の問いに対して相応しい回答を１つ選択する。 The answer acquisition unit 106 selects one answer identifier and voice data of an answer to the question whose meaning is represented by the meaning contents data given from the language analysis unit 104 from the answer library 124, and selects the selected answer identifier and Voice data is read from the answer library 124 and acquired. Then, the response acquisition unit 106 outputs the response identifier acquired from the response library 124 to the prosodic control data acquisition unit 110, and outputs the audio data acquired from the response library 124 to the response playback unit 112. The answer acquisition unit 106 according to the present embodiment includes a table that stores answer identifiers of answers that are appropriate to the question of the meaning content represented by the meaning content data in association with the meaning content data that can be output from the language analysis unit 104. ing. The answer acquisition unit 106 refers to the contents stored in the table and selects one answer that is appropriate for the question of the meaning content represented by the meaning content data.

本実施形態では、問いの意味内容に応じて回答を選択する態様について説明するが、問いの意味内容とは無関係にランダムに回答を選択しても良い。この場合、言語解析部１０４および上記テーブルは不要である。具体的には、音声入力部１０２の出力信号を回答取得部１０６に与え、回答取得部１０６には、音声入力部１０２からの音声信号の受信を契機として、回答ライブラリ１２４から回答識別子および音声データをランダムに読み出す処理を実行させれば良い。 In the present embodiment, an aspect in which an answer is selected according to the meaning content of the question will be described, but an answer may be selected at random regardless of the meaning content of the question. In this case, the language analysis unit 104 and the table are not necessary. Specifically, the output signal of the voice input unit 102 is given to the answer acquisition unit 106, and the answer acquisition unit 106 receives an answer identifier and voice data from the answer library 124 when receiving the voice signal from the voice input unit 102. It is sufficient to execute a process of randomly reading out.

韻律ライブラリ１２２は、回答ライブラリ１２４に格納されている複数の回答識別子の各々に対応付けて、その回答識別子の示す回答に込める意図毎にその意図を示す識別子（以下、意図識別子）とその意図に応じた当該回答における韻律の時間変化を規定する韻律制御データと、を格納したデータベースである。ここで、韻律制御データは、音高、話速、および音量といった韻律構成要素の各時刻（回答の語頭を起算点とする時刻）における変化量の配列、すなわちシーケンスデータである。また、意図識別子の具体例としては、「怒り」や「あきれ」など意図の内容を表す文字列が挙げられる。 The prosodic library 122 is associated with each of a plurality of response identifiers stored in the response library 124, and an identifier indicating the intention (hereinafter referred to as an intention identifier) for each intention included in the response indicated by the response identifier. And a prosody control data for defining a temporal change of the prosody in the corresponding answer. Here, the prosody control data is an array of changes in prosodic components such as pitch, speech speed, and volume at each time (time starting from the beginning of the answer), that is, sequence data. As a specific example of the intention identifier, there is a character string representing the content of the intention such as “anger” or “drill”.

韻律制御データについては、次の要領で作成することが考えられる。例えば、「あのさ」という回答であれば、特定の意図を込めずに平板に発音された「あのさ」という音声の波形データを基準データとし、「怒り」や「あきれ」などの特定の意図を込めて発音された「あのさ」という音声の波形データを上記基準データと比較して音高、話速および音量などの韻律の構成要素毎に各時刻における基準データからの差分（オフセット）を算出し、構成要素毎に当該差分を時刻順に並べて韻律制御データとすることが考えられる。この場合、回答ライブラリ１２４には、回答の音声データとして、特定の意図を込めずに平板に発音された音声の音声データを格納しておけば良い。なお、特定の意図を込めて発音された音声の波形データを基準データとし、他の意図を込めて発音された音声についての韻律制御データを生成しても良く、この場合は、回答の音声データとして当該特定の意図を込めて発音された音声の音声データを回答ライブラリ１２４に格納しておけば良い。 Prosodic control data can be created in the following manner. For example, if the answer is “Anosa”, the waveform data of the voice “Anosa” that is pronounced on a flat plate without any specific intention is used as the reference data, and a specific intention such as “Angry” or “Akure” Compare the waveform data of the voice “Anosa” that is pronounced with the above reference data, and calculate the difference (offset) from the reference data at each time for each prosodic component such as pitch, speech speed, and volume. It is conceivable that the prosody control data is calculated and arranged for each component in order of time. In this case, the answer library 124 may store the sound data of the sound generated on the flat plate without any specific intention as the sound data of the answer. Note that the prosody control data for voices sounded with other intentions may be generated using the waveform data of the voices sounded with a specific intention as reference data. As a result, the voice data of the voice generated with the specific intention may be stored in the answer library 124.

意図指定部１０８は、問いに対する回答に込める意図として、意図識別子および韻律制御データが韻律ライブラリ１２２に格納されている各意図のうちの１つを利用者に指定させる装置である。例えば、意図指定部１０８は、意図識別子のリストを表示部（図１では図示略）に表示させ、操作入力部（図１では図示略）に対する操作により指定された意図識別子を韻律制御データ取得部１１０に与える。 The intention designating unit 108 is a device that allows the user to designate one of the intentions whose intention identifiers and prosodic control data are stored in the prosody library 122 as intentions that can be included in the answer to the question. For example, the intention designating unit 108 displays a list of intention identifiers on a display unit (not shown in FIG. 1), and displays the intention identifier designated by an operation on the operation input unit (not shown in FIG. 1). 110.

韻律制御データ取得部１１０は、意図指定部１０８から与えられた意図識別子と回答取得部１０６から与えられた回答識別子に対応する韻律制御データを韻律ライブラリ１２２から読み出して取得する。韻律制御データ取得部１１０は、韻律ライブラリ１２２から読み出した韻律制御データを回答再生部１１２へ与える。 The prosodic control data acquisition unit 110 reads out and acquires the prosodic control data corresponding to the intention identifier given from the intention designating unit 108 and the answer identifier given from the answer obtaining unit 106 from the prosody library 122. The prosody control data acquisition unit 110 gives the prosody control data read from the prosody library 122 to the answer reproduction unit 112.

回答再生部１１２は、回答取得部１０６から与えられた音声データの表す音声を、韻律制御データ取得部１１０から与えられた韻律制御データにしたがって音高、話速および音量の各韻律構成要素の時間変化を別個独立に制御して、再生（合成）する。
以上が音声再生装置１０の構成である。 The answer reproducing unit 112 converts the voice represented by the voice data given from the answer obtaining unit 106 according to the prosodic control data given from the prosody control data obtaining unit 110 to the time of each prosodic component of pitch, speech speed, and volume. The changes are controlled separately and reproduced (synthesized).
The above is the configuration of the audio reproduction device 10.

（Ｂ：動作）
次に、音声再生装置１０の動作について説明する。
図３は、音声再生装置１０における処理動作を示すフローチャートである。本実施形態では、音声再生装置１０が適用されたぬいぐるみに対して、利用者が音声で問いを発したときに、このフローチャートで示される処理が起動される。 (B: Operation)
Next, the operation of the audio playback device 10 will be described.
FIG. 3 is a flowchart showing the processing operation in the audio playback device 10. In the present embodiment, when the user makes an inquiry by voice to the stuffed animal to which the audio reproduction device 10 is applied, the processing shown in this flowchart is started.

利用者が音声で問いを発すると、その問いの音声は音声入力部１０２によって音声信号に変換され、当該音声信号が音声入力部１０２から言語解析部１０４に供給される。ステップＳａ１１において言語解析部１０４は、音声入力部１０２から供給される音声信号をメモリ等に蓄積し、音声による問いが終了したか否かを判別する。問いが終了したか否かについては、音声信号の音量が所定の閾値未満となった状態が所定時間継続したか否かで判別される。問いが終了していなければ（ステップＳａ１１の判別結果が「Ｎｏ」であれば）、言語解析部１０４はステップＳａ１１の処理を再度実行し、問いの発話終了を待ち受ける。 When the user makes a question by voice, the voice of the question is converted into a voice signal by the voice input unit 102, and the voice signal is supplied from the voice input unit 102 to the language analysis unit 104. In step Sa <b> 11, the language analysis unit 104 accumulates the audio signal supplied from the audio input unit 102 in a memory or the like, and determines whether or not the voice inquiry has ended. Whether or not the inquiry has ended is determined by whether or not the state in which the volume of the audio signal has become less than a predetermined threshold has continued for a predetermined time. If the question has not ended (if the determination result in step Sa11 is “No”), the language analysis unit 104 executes the process in step Sa11 again and waits for the end of the question utterance.

ステップＳａ１１の判別結果が「Ｙｅｓ」の場合に実行されるステップＳａ１２では、言語解析部１０４は、メモリ等に蓄積した音声信号の規定する問いの意味内容を解析してその解析結果を示す意味内容データを回答取得部１０６に与える。本実施形態では、問いの発話が終了したか否かの判別と、問いの意味解析と、をシーケンシャルに実行するが、両者を並列に逐次実行しても良い。このようにすることで、問いの発話終了から意味解析完了までの遅延を軽減することができる。ステップＳａ１２に後続するステップＳａ１３では、回答取得部１０６は、上記意味内容データにより意味内容が表される問いに対する回答の回答識別子および音声データを回答ライブラリ１２４から取得し、前者を韻律制御データ取得部１１０に与え、後者を回答再生部１１２に与える。 In step Sa12 which is executed when the determination result in step Sa11 is “Yes”, the language analysis unit 104 analyzes the semantic content of the question defined by the voice signal accumulated in the memory or the like, and indicates the semantic content indicating the analysis result. Data is given to the answer acquisition unit 106. In the present embodiment, the determination as to whether or not the utterance of the question has ended and the semantic analysis of the question are executed sequentially, but both may be executed sequentially in parallel. In this way, the delay from the end of the question utterance to the completion of the semantic analysis can be reduced. In step Sa13 subsequent to step Sa12, the answer acquisition unit 106 acquires the answer identifier and voice data of the answer to the question whose meaning is represented by the meaning contents data from the answer library 124, and the former is the prosody control data acquisition unit. 110 and the latter is given to the answer reproducing unit 112.

ステップＳａ１３に後続するステップＳａ１４では、回答再生部１１２は、回答を再生中であるか否かを判別する。ステップＳａ１４において、回答再生部１１２によって回答が再生中であると判別される場合とは、ある問いに応じて回答を再生中に、次の問いが利用者によって発せられた場合などである。回答が再生中であれば（ステップＳａ１４の判別結果が「Ｙｅｓ」であれば）、回答再生部１１２は、ステップＳａ１４の判別結果が「Ｎｏ」になるまで、ステップＳａ１４の処理を再度実行する。ステップＳａ１４の判別結果が「Ｎｏ」である場合は、意図指定部１０８は回答に込める意図を利用者に指定させ、利用者により指定された意図を示す意図識別子を韻律制御データ取得部１１０へ出力する（ステップＳａ１５）。ステップＳａ１５に後続するステップＳａ１６では、韻律制御データ取得部１１０は、意図指定部１０８から与えられる意図識別子と回答取得部１０６から与えられた回答識別子とに対応する韻律制御データを韻律ライブラリ１２２から取得し、取得した韻律制御データを回答再生部１１２に通知して回答取得部１０６により選択された回答の音声データの再生を指示する。この指示にしたがって回答再生部１１２は、韻律制御データ取得部１１０から与えられた韻律制御データにしたがって、音高、話速および音量の各韻律構成要素の時間変化を別個独立に制御しつつ、上記音声データの表す音声を再生する（ステップＳａ１７）。 In step Sa14 subsequent to step Sa13, the answer reproducing unit 112 determines whether or not the answer is being reproduced. The case where the answer reproducing unit 112 determines that the answer is being reproduced in step Sa14 is a case where the user asks the next question while reproducing the answer in response to a certain question. If the answer is being reproduced (if the determination result in step Sa14 is “Yes”), the answer reproduction unit 112 executes the process in step Sa14 again until the determination result in step Sa14 becomes “No”. When the determination result in step Sa14 is “No”, the intention specifying unit 108 causes the user to specify the intention that can be included in the answer, and outputs an intention identifier indicating the intention specified by the user to the prosodic control data acquiring unit 110. (Step Sa15). In step Sa16 subsequent to step Sa15, the prosody control data acquisition unit 110 acquires, from the prosody library 122, prosodic control data corresponding to the intention identifier given from the intention specifying unit 108 and the answer identifier given from the answer acquisition unit 106. Then, the acquired prosodic control data is notified to the answer reproducing unit 112 and the reproduction of the voice data of the answer selected by the answer acquiring unit 106 is instructed. In accordance with this instruction, the answer reproducing unit 112 controls the time change of each prosodic component of pitch, speech speed, and volume independently according to the prosodic control data given from the prosodic control data acquisition unit 110, while The voice represented by the voice data is reproduced (step Sa17).

例えば、ステップＳａ１３にて「あのさ」という回答が選択され、ステップＳａ１５にて「気楽」という意図が指定された場合には、ステップＳａ１７では、図２の時間変化パターンＰ１で韻律が時間変化する音声が合成され、この音声を聴いた利用者は当該音声における韻律の時間変化から「気楽さ」を感じ取る。一方、ステップＳａ１３にて「あのさ」という回答が選択され、ステップＳａ１５にて「慎重」という意図が指定された場合には、ステップＳａ１７では、図２の時間変化パターンＰ２で韻律が時間変化する音声が合成され、この音声を聴いた利用者は当該音声における韻律の時間変化から「慎重さ」を感じ取る。 For example, if the answer “Anosa” is selected in step Sa13 and the intention of “easy” is specified in step Sa15, the prosody changes with time in the time change pattern P1 of FIG. 2 in step Sa17. A voice is synthesized, and a user who listens to this voice feels “comfort” from the time change of the prosody in the voice. On the other hand, if the answer “Anosa” is selected in step Sa13 and the intention of “careful” is designated in step Sa15, the prosody changes with time in the time change pattern P2 of FIG. 2 in step Sa17. A voice is synthesized, and a user who listens to this voice feels "carefulness" from the time change of the prosody in the voice.

このように、本実施形態によれば、回答に込める意図に応じて韻律の時間変化をきめ細かく制御した回答音声を再生することができ、豊かな意図の表現が可能になる。本実施形態の音声再生装置１０では、回答ライブラリ１２４の他に韻律ライブラリ１２２を記憶装置等に記憶させておくことが必要となる。しかし、韻律ライブラリ１２２に格納される韻律制御データはシーケンスデータであるため、各回答について様々な意図を込めた音声の音声データの総和に比較して合計データ量は小さくなる。このため、これらの音声データの全てを記憶させておく態様に比較して記憶装置の記憶容量の増加を抑えることができる。つまり、本実施形態によれば、回答の音声の合成に要するデータを記憶する記憶装置の記憶容量の増加を抑えつつ、豊かな意図の表現が可能になる。なお、上記実施形態では、音高、話速、および音量の時間変化を制御する場合について説明したが、さらに発話タイミングの制御も行うようにしても良い。 As described above, according to the present embodiment, it is possible to reproduce the answer voice in which the time change of the prosody is finely controlled according to the intention included in the answer, and it is possible to express a rich intention. In the audio reproduction device 10 of this embodiment, it is necessary to store the prosody library 122 in addition to the answer library 124 in a storage device or the like. However, since the prosodic control data stored in the prosodic library 122 is sequence data, the total amount of data is smaller than the sum of speech data with various intentions for each answer. For this reason, an increase in the storage capacity of the storage device can be suppressed as compared with a mode in which all of these audio data is stored. That is, according to the present embodiment, it is possible to express a rich intention while suppressing an increase in the storage capacity of a storage device that stores data required for synthesizing the answer speech. In the above embodiment, the case of controlling the temporal change of the pitch, the speech speed, and the volume has been described. However, the speech timing may be further controlled.

（Ｃ：変形および応用例）
本発明は、上述した実施形態に限定されるものではなく、例えば次に述べるような各種の応用・変形が可能である。また、次に述べる応用・変形の態様は、任意に選択された一または複数を適宜に組み合わせることもできる。 (C: Deformation and application examples)
The present invention is not limited to the above-described embodiments, and various applications and modifications as described below are possible, for example. In addition, one or more arbitrarily selected aspects of application / deformation described below can be appropriately combined.

＜音声入力部＞
上記実施形態では、音声入力部１０２は、利用者の音声（発言）をマイクロフォンで入力して音声信号に変換する構成としたが、この構成に限られない。すなわち、音声入力部１０２は、音声信号による発言をなんらかの形で入力する、または、入力される構成であれば良い。詳細には、音声入力部１０２は、他の処理部で処理された音声信号や、他の装置から供給（または転送された）音声信号を入力する構成、さらには、ＬＳＩに内蔵され、単に音声信号を受信し後段に転送する入力インターフェース回路等を含んだ概念である。また、音声再生装置１０に対する問い掛けは、音声による問い掛けには限定されず、問いを書き下した文を表すテキストデータの入力であっても良い。この場合、音声入力部１０２に代えてテキストデータ入力部を音声再生装置１０に設けて置けば良い。 <Voice input part>
In the above-described embodiment, the voice input unit 102 is configured to input a user's voice (speech) with a microphone and convert the voice into a voice signal, but is not limited to this configuration. That is, the voice input unit 102 may be configured to input or input a speech by a voice signal in some form. Specifically, the voice input unit 102 is configured to input a voice signal processed by another processing unit or a voice signal supplied (or transferred) from another device, and further, is built in an LSI and is simply a voice. This is a concept including an input interface circuit for receiving a signal and transferring it to a subsequent stage. In addition, the question to the voice reproduction device 10 is not limited to the question by voice, and may be input of text data representing a sentence in which the question is written. In this case, instead of the voice input unit 102, a text data input unit may be provided in the voice reproduction device 10.

＜韻律の制御単位＞
上記実施形態では、問いに対する回答の音声データ単位で韻律制御データを用意しておき、音声データ単位で回答音声の韻律の時間変化を制御した。しかし、韻律制御データを回答の音声を構成する音素単位、回答の音声をサンプリングして音声データを生成する際の波形サンプル単位、或いは回答の音声を所定時間のフレームに区切った場合のフレーム単位で韻律制御データを用意しておき、これらを単位として韻律の時間変化を制御しても良い。 <Prosodic control unit>
In the above embodiment, prosodic control data is prepared in units of voice data of answers to questions, and the temporal change of the prosody of the answer voice is controlled in units of voice data. However, the prosodic control data is a phoneme unit constituting the answer voice, a waveform sample unit when the answer voice is sampled to generate voice data, or a frame unit when the answer voice is divided into frames of a predetermined time. Prosodic control data may be prepared, and the time change of the prosody may be controlled in units of these.

＜意図の指定のバリエーション＞
上記実施形態では、問いに対する回答に込める意図を利用者に１つだけ指定させたが、例えば、「あきれ」と「気楽に」とを指定するといった具合に、複数の意図を利用者に指定させても良い。この場合、意図指定部１０８には、利用者により指定された複数の意図の各々を示す意図識別子を韻律制御データ取得部１１０へ出力させる。このように利用者により指定された複数の意図に直接対応する１つの韻律制御データは韻律ライブラリ１２２には格納されていない。そこで、韻律制御データ取得部１１０には、これら複数の意図識別子の各々に対応する韻律制御データを取得し、それら複数の韻律制御データを用いた補間により１つの韻律制御データを生成させるようにすれば良い。具体的には、韻律制御データ取得部１１０は、上記の要領で取得した複数の韻律制御データを、例えば１：１などの重み付け加算して新たな韻律制御データを生成し、当該新たな韻律制御データを回答再生部１１２へ出力し回答の音声を合成させる。 <Variation of intention designation>
In the above embodiment, the user specifies only one intention that can be included in the answer to the question. For example, the user can specify a plurality of intentions such as “Are” and “Easy”. May be. In this case, the intention specifying unit 108 causes the prosody control data acquiring unit 110 to output an intention identifier indicating each of a plurality of intentions specified by the user. In this way, one prosodic control data that directly corresponds to a plurality of intentions designated by the user is not stored in the prosodic library 122. Therefore, the prosody control data acquisition unit 110 acquires prosody control data corresponding to each of the plurality of intention identifiers, and generates one prosody control data by interpolation using the plurality of prosody control data. It ’s fine. Specifically, the prosody control data acquisition unit 110 generates new prosody control data by weighting and adding the plurality of prosody control data acquired in the above manner, for example, 1: 1, and the like. The data is output to the answer playback unit 112 to synthesize the answer voice.

このような態様によれば、利用者により指定された複数の意図の中間の意図に対応する回答の音声を合成することができると考えられる。一般に、「あきれた」感じで発音された「あのね」という音声の波形データと「気楽な」感じで発音された「あのね」という音声の波形データとを重み付け加算しても、「気楽さを伴いつつあきれた」感じの音声の波形データは得られないが、韻律制御データはシーケンスデータであるため、重み付け加算により中間の意図を表す韻律制御データを生成できるからである。同様に、特定の意図を込めずに平板に発音することも意図の一態様と考え、「平板」と「慎重」とを指定することで、「やや慎重」といった「平板」と「慎重」の中間の意図に対応する韻律制御データを生成することもできる。 According to such an aspect, it is considered that the voice of an answer corresponding to an intermediate intention among a plurality of intentions designated by the user can be synthesized. In general, even if the waveform data of the voice “Ane” that is pronounced “feeling” and the waveform data of the sound “Ane” that is pronounced “feeling easy” are weighted and added, This is because the waveform data of a voice that feels “bright” is not obtained, but the prosody control data is sequence data, so that prosody control data representing intermediate intentions can be generated by weighted addition. Similarly, sounding on a flat plate without a specific intention is also considered an aspect of the intention, and by specifying “flat plate” and “careful”, “flat plate” such as “slightly careful” and “careful” Prosodic control data corresponding to an intermediate intention can also be generated.

複数の意図を利用者に指定させる態様においては、上記重み付け加算における各意図の重みについても利用者に指定させても良く、さらに韻律構成要素毎の重みを利用者に指定させても良い。「怒り」や「あきれ」とではその意図を演出する際の各韻律構成要素の重要度が異なり得るからである。また、「気楽さを伴いつつあきれた」感じなど複数の意図の中間な意図を利用者に指定させ、このような中間な意図を複数の意図に分解して各意図に対応する韻律制御データを韻律制御データ取得部１１０に取得させるようにしても良い。要は、意図指定部１０８により指定された意図に対応する韻律制御データが韻律ライブラリ１２２（データベース）に格納されていない場合には、当該データベースに格納されている複数の韻律制御データを用いた補間により上記指定された意図に対応する韻律制御データを韻律制御データ取得部１１０に取得させる態様であれば良い。 In an aspect in which a user designates a plurality of intentions, the weight of each intention in the above weighted addition may be designated by the user, and the weight for each prosodic component may be designated by the user. This is because the importance of each prosodic component in directing the intention may differ between “anger” and “clear”. Also, let the user specify an intermediate intention among multiple intentions, such as “feeling relaxed with ease”, and decompose the intermediate intention into multiple intentions to generate prosodic control data corresponding to each intention. The prosody control data acquisition unit 110 may acquire the prosody control data. In short, if prosodic control data corresponding to the intention designated by the intention designating unit 108 is not stored in the prosodic library 122 (database), interpolation using a plurality of prosodic control data stored in the database is performed. As long as the prosody control data corresponding to the specified intention is acquired by the prosody control data acquisition unit 110, any method may be used.

上記実施形態では、音声再生装置１０への問い掛け毎にその問いに対する回答に込める意図を利用者に指定させたが、常に同じ意図を込めた回答の音声が合成されるように、回答に込める意図を予め利用者に指定させておいても良い。また、予め定められたシナリオにしたがって回答を合成する場合、そのシナリオの進行にしたがって意図の指定が行われるようにしても良い。具体的には、回答に込める意図の意図識別子が当該回答の再生順に配列されたシナリオデータを意図指定部１０８へ入力し、音声入力部１０２から問いの音声が入力される毎に意図指定部１０８には当該シナリオデータに含まれている意図識別子をその記載順に取得して韻律制御データ取得部１１０へ出力させるようにすれば良い。 In the above-described embodiment, the user specifies the intention to be included in the answer to the question for each question to the voice playback device 10, but the intention to be included in the answer so that the voice of the answer with the same intention is always synthesized. May be specified in advance by the user. Further, when the answers are synthesized according to a predetermined scenario, the intention may be designated according to the progress of the scenario. Specifically, scenario data in which intention identifiers of intentions that can be included in an answer are arranged in the order of reproduction of the answers is input to the intention specifying unit 108, and each time a questioned voice is input from the voice input unit 102, the intention specifying unit 108 The intention identifiers included in the scenario data may be acquired in the order of description and output to the prosodic control data acquisition unit 110.

また、音声再生装置１０或いは当該音声再生装置１０の埋め込まれたぬいぐるみを擬人化した回答キャラクタについて「常に気楽な感じ」或いは「常に慎重さを失わない」などの人格や雰囲気、話し方の特性（以下、キャラクタ特性）を予め複数用意しておき、キャラクタ特性毎にそのキャラクタ特性に応じた意図識別子を対応付けておき、利用者には回答キャラクタのキャラクタ特性を指定させることで回答に込める意図を指定させるようにしても良い。また、各々異なるキャラクタ特性を有する複数の回答キャラクタを用意しておき、利用者に回答キャラクタを指定させることで回答に込める意図を指定させても良い。問いに対する回答をシナリオにしたがって合成する場合には、シナリオの登場人物毎にキャラクタ特性を定めておけば良い。さらに、１つの回答キャラクタに対して複数のキャラクタ特性を対応付けておき、何れのキャラクタ特性で回答の音声を合成するのかをシナリオにおいて定めておき、１つの回答キャラクタの経年変化や、本音と建て前の使い分けを演出しても良い。 In addition, the personality, atmosphere, and speaking characteristics of the voice reproduction device 10 or an answer character that personifies the stuffed stuffed toy of the voice reproduction device 10 such as “always feel comfortable” or “always keep careful” , Character characteristics) are prepared in advance, each character characteristic is associated with an intention identifier corresponding to the character characteristic, and the user is allowed to specify the character characteristic of the answer character, thereby specifying the intention to be included in the answer You may make it let it. Alternatively, a plurality of answer characters each having different character characteristics may be prepared, and the user may designate the intention to be included in the answer by designating the answer character. When combining the answers to the questions according to the scenario, the character characteristics may be determined for each character in the scenario. In addition, a plurality of character characteristics are associated with one answer character, and it is determined in the scenario which character characteristics are used to synthesize the answer speech. You can produce different usages.

また、問いの音声信号を解析して当該問いに付与されている意図を特定する意図特定部を設け、意図指定部１０８には、意図特定部により特定された意図に応じて上記問い対する回答に付与する意図を指定させても良い。具体的には、回答の音声データが回答ライブラリ１２４に格納されている各問いについて、その問いを特定の意図を込めることなく平板に発音した音声の波形データを基準データとして意図特定部に予め記憶させておき、意図特定部には、音声入力部１０２から入力された音声信号を波形データに変換して対応する基準データと比較することで、当該波形データの表す音声における韻律を表す韻律規定データ（上記韻律制御データと同一フォーマットのデータ）を生成させる。そして、意図特定部は、当該韻律規定データと同じまたは当該韻律規定データに近似する韻律制御データに対応する意図識別子を韻律ライブラリ１２２から読み出して意図指定部１０８に与える。意図指定部１０８には、意図特定部から与えられた意図識別子と予め定められた特定の関係にある意図識別子を韻律制御データ取得部１１０に出力させるようにすれば良い。ここで、上記特定の関係とは、例えば「気楽」に対する「慎重」のように反対の意図の関係や、同じ意図の関係が考えられる。上記特定の関係として「同じ意図の関係」を採用する場合には、意図特定部から与えられた意図識別子をそのまま韻律制御データ取得部１１０へ出力する処理を意図指定部１０８に実行させるようにすれば良い。 In addition, an intention specifying unit that analyzes the voice signal of the question and identifies the intention given to the question is provided, and the intention specifying unit 108 responds to the above question according to the intention specified by the intention specifying unit. The intention to give may be specified. Specifically, for each question for which answer voice data is stored in the answer library 124, waveform data of a voice that is pronounced on a flat plate without any particular intention is stored in the intention specifying unit in advance as reference data. In addition, the intention specifying unit converts the speech signal input from the speech input unit 102 into waveform data and compares it with corresponding reference data, so that the prosody definition data representing the prosody in the speech represented by the waveform data. (Data having the same format as the prosodic control data) is generated. Then, the intention specifying unit reads the intention identifier corresponding to the prosody control data that is the same as or close to the prosody definition data from the prosody library 122 and gives it to the intention designating unit 108. The intention designating unit 108 may cause the prosodic control data acquisition unit 110 to output an intention identifier having a predetermined specific relationship with the intention identifier given from the intention specifying unit. Here, with respect to the above-mentioned specific relationship, for example, an opposite intention relationship such as “careful” with respect to “comfort” or the same intention relationship can be considered. When the “same intention relationship” is adopted as the specific relationship, the intention specifying unit 108 should be caused to execute the process of outputting the intention identifier given from the intention specifying unit to the prosodic control data acquiring unit 110 as it is. It ’s fine.

＜問いに対する意図の込め方の評価・採点＞
意図特定部を設ける態様においては、意図特定部により特定された意図と韻律ライブラリ１２２の格納内容とを比較することにより、問いに対する意図の込め方を評価、採点するようにしても良い。具体的には、「怒り」や「あきれ」などの意図を込めて利用者は発話した問いについて、当該利用者が問いに込めた意図が意図特定部により特定されたか否か、或いは、意図特定部により問いの音声信号から生成された韻律規定データと当該意図について韻律ライブラリ１２２に格納されている韻律制御データとの比較結果に応じて、当該意図の込め方の適否を評価・採点するようにすれば良い。この態様によれば、問いに対する意図の込め方の練習を支援することが可能になる。 <Evaluation and scoring of how to put intent for questions>
In an aspect in which the intention specifying unit is provided, the intention specified by the intention specifying unit and the stored contents of the prosody library 122 may be compared to evaluate and score how the intention is included in the question. Specifically, regarding the question spoken by the user with intentions such as “anger” and “clear”, whether or not the intention specified by the user was specified by the intention specifying unit, or the intention specification According to the comparison result between the prosody definition data generated from the questioned speech signal by the section and the prosody control data stored in the prosody library 122 for the intention, the suitability of the intention is evaluated and scored. Just do it. According to this aspect, it becomes possible to support practice of how to put in the intention to the question.

＜その他＞
上記実施形態にあっては、発言に対する回答を取得する構成である言語解析部１０４、韻律ライブラリ、および回答ライブラリを音声再生装置１０の側に設けたが、端末装置などでは、処理の負荷が重くなる点や、記憶容量に制限がある点などを考慮して、外部サーバの側に設ける構成としても良い。また、上記実施形態では、利用者の音声による問いに対する回答の音声を合成して再生し音声対話を実現する音声再生装置への本発明の適用例を説明したが、演劇や漫才などにおける各役を演じる音声再生装置に本発明を適用しても良い。 <Others>
In the above embodiment, the language analysis unit 104, the prosody library, and the answer library, which are configured to acquire answers to the utterances, are provided on the side of the audio playback device 10, but the processing load is heavy in the terminal device or the like. In view of the above, the storage capacity is limited, and the like may be provided on the external server side. Further, in the above embodiment, the application example of the present invention to the voice reproduction apparatus that synthesizes and reproduces the voice of the answer to the question by the voice of the user to realize the voice dialogue has been described. The present invention may be applied to a sound reproducing device that performs the following.

１０２…音声入力部、１０４…言語解析部、１０６…回答取得部、１０８…意図指定部、１１０…韻律制御データ取得部、１１２…回答再生部、１２２…韻律ライブラリ、１２４…回答ライブラリ。 DESCRIPTION OF SYMBOLS 102 ... Voice input part 104 ... Language analysis part 106 ... Answer acquisition part 108 ... Intention designation part 110 ... Prosody control data acquisition part 112 ... Answer reproduction part 122 ... Prosody library 124 ... Answer library

Claims

An answer acquisition unit for acquiring voice data of an answer to a question represented by the input voice signal;
An intention designating unit for designating an intention to be given to the answer;
A prosody control data acquisition unit for acquiring prosody control data representing a temporal change of the prosody according to the intention designated by the intention designation unit;
An answer playback unit for playing back the answer voice in which the time change of the prosody based on the voice data is controlled according to the prosody control data;
An audio reproducing apparatus comprising:

A database in which the prosodic control data is stored for each intention;
The audio reproduction device according to claim 1, wherein the prosodic control data acquisition unit acquires prosodic control data corresponding to the intention specified by the intention specifying unit from the database.

The prosodic control data acquisition unit, when prosodic control data corresponding to the intention specified by the intention specifying unit is not stored in the database, the prosodic control data is stored in the database The sound reproducing device according to claim 2, wherein the sound reproducing device is obtained by interpolation using control data.

Having an intention specifying unit for analyzing the input voice signal and specifying the intention given to the question;
The audio reproducing apparatus according to claim 1, wherein the intention specifying unit specifies an intention to be given to the answer according to the intention specified by the intention specifying unit.

Computer
An answer acquisition unit for acquiring voice data of an answer to a question represented by the input voice signal;
An intention designating unit for designating an intention to be given to the answer;
A prosody control data acquisition unit for acquiring prosody control data representing a temporal change of the prosody according to the intention designated by the intention designation unit;
An answer playback unit for playing back the answer voice in which the time change of the prosody based on the voice data is controlled according to the prosody control data;
A voice reproduction program characterized by being made to function.