JP6922306B2

JP6922306B2 - Audio playback device and audio playback program

Info

Publication number: JP6922306B2
Application number: JP2017056326A
Authority: JP
Inventors: 嘉山　啓; 啓嘉山; 久湊　裕司; 裕司久湊
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2017-03-22
Filing date: 2017-03-22
Publication date: 2021-08-18
Anticipated expiration: 2037-03-22
Also published as: JP2018159777A

Description

本発明は、音声再生技術に関する。 The present invention relates to audio reproduction technology.

音声再生技術の応用例として、人と機械による音声インタラクション或いは機械と機械による音声インタラクションが挙げられる。人と機械による音声インタラクションの一例としては、音声による利用者の問いに対してその問いに対する回答の音声を合成して再生する音声対話システムが挙げられる。機械と機械による音声インタラクションの一例としては、予め定められたシナリオにしたがって音声再生装置Ａにより再生された問いの音声を、音声再生装置Ｂが認識して回答の音声を再生することが挙げられ、具体的には登場人物の全てを機械（音声再生装置）が演じる演劇や漫才が挙げられる。音声による問いに対する回答の音声を合成する際には、利用者の音声による問いに対して不自然さのない人間らしい受け答えを実現するために、意図を込めた回答の音声を再生することが好ましい。例えば、特許文献１に開示の技術では、回答に込められた意図を表現するために、肯定的な回答と否定的な回答とで語尾の音高を異ならせている。 Examples of applications of the voice reproduction technology include voice interaction between a person and a machine or voice interaction between a machine and a machine. An example of voice interaction between a person and a machine is a voice dialogue system that synthesizes and reproduces the voice of the answer to a user's question by voice. An example of machine-to-machine voice interaction is that the voice playback device B recognizes the voice of the question played by the voice playback device A according to a predetermined scenario and reproduces the answer voice. Specific examples include plays and comics in which a machine (voice playback device) plays all of the characters. When synthesizing the voice of the answer to the question by voice, it is preferable to reproduce the voice of the answer with intention in order to realize a human-like answer to the question by the user's voice without any unnaturalness. For example, in the technique disclosed in Patent Document 1, in order to express the intention contained in the answer, the pitch of the ending is different between the positive answer and the negative answer.

特開２０１５−０６４４８０号公報Japanese Unexamined Patent Publication No. 2015-064480

しかし、特許文献１に開示の技術のように回答の語尾の音高の調整だけでは、多彩な意図を表現することはできない。多様な意図の各々について意図毎に回答の音声データを用意しておけば豊かな意図表現が可能となるが、音声データを記憶する記憶装置の記憶容量が増加する、といった問題がある。 However, it is not possible to express various intentions only by adjusting the pitch of the ending of the answer as in the technique disclosed in Patent Document 1. If the voice data of the answer is prepared for each of the various intentions, a rich intention expression becomes possible, but there is a problem that the storage capacity of the storage device for storing the voice data increases.

本発明は以上に説明した課題に鑑みて為されたものであり、回答の音声データの記憶に要する記憶容量の増加を抑えつつ、豊かな意図の表現を可能にする技術を提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object of the present invention is to provide a technique capable of expressing a rich intention while suppressing an increase in the storage capacity required for storing the voice data of the answer. And.

上記目的を達成するために、本発明の一態様に係る音声再生装置は、入力された音声信号の表す問いに対する回答の音声データを取得する回答取得部と、前記回答に付与する意図を指定する意図指定部と、前記意図指定部により指定された意図に応じた韻律の時間変化を表す韻律制御データを取得する韻律制御データ取得部と、前記音声データに基づく韻律の時間変化を前記韻律制御データにしたがって制御した回答の音声を再生する回答再生部と、を具備することを特徴とする。 In order to achieve the above object, the voice reproduction device according to one aspect of the present invention specifies an answer acquisition unit that acquires voice data of an answer to a question represented by an input voice signal, and an intention to be given to the answer. The intent designation unit, the rhyme control data acquisition unit that acquires the rhyme control data representing the time change of the rhyme according to the intention specified by the intention designation unit, and the rhyme control data that obtains the time change of the rhyme based on the voice data. It is characterized by including an answer reproduction unit that reproduces the voice of the answer controlled according to the above.

韻律の時間変化とは、音高、話速、音量および発話タイミングといった各韻律構成要素の時間変化のことを言う。なお、話速とは、単位時間当たりに発音される音素数のことを言う。人は、問いに対する回答を発話する際に、その回答に込める意図に応じて韻律の時間変化を調整することで、「気楽」や「慎重」、或いは「怒り」や「あきれ」などの多彩な意図を表現する。本態様によれば、回答に込める意図に応じて、韻律の時間変化をきめ細かく制御した回答の音声を再生することが可能になり、豊かな意図の表現が可能になる。ここで、韻律制御データは、音高、話速、音量および発話タイミングといった韻律構成要素の各時刻（回答の語頭を起算点とする時刻）における韻律構成要素の変化量の配列、すなわちシーケンスデータであれば良く、音声の波形データに比較してデータ量が少ない。このため、回答毎に各意図に応じた韻律制御データを記憶装置に予め記憶させておくとしても、異なる意図を込めて発音された各回答の音声データを記憶装置へ記憶させておく態様に比較して少ない記憶容量で対応可能である。つまり、本態様によれば、回答の音声データの記憶に要する記憶容量の増加を抑えつつ、豊かな意図の表現が可能になる。なお、回答には、問いに対する具体的な答えに限られず、相槌（間投詞）も含まれる。また、回答には、問に対する答えや相槌の他、演劇や漫才における掛け合いの台詞も含まれ、人による声のほかにも、「ワン」（bowwow）、「ニャー」（meow）などの動物の鳴き声も含まれる。すなわち、ここでいう回答や音声とは、人が発する声のみならず、動物の鳴き声を含む概念である。 The time change of prosody refers to the time change of each prosodic component such as pitch, speaking speed, volume and utterance timing. The speaking speed is the number of phonemes that are pronounced per unit time. When a person utters an answer to a question, by adjusting the time change of prosody according to the intention contained in the answer, a variety of things such as "easy" and "cautious", or "anger" and "awareness" are available. Express your intentions. According to this aspect, it is possible to reproduce the voice of the answer in which the time change of the prosody is finely controlled according to the intention to be included in the answer, and it is possible to express a rich intention. Here, the prosody control data is an array of changes in the prosody components at each time (time starting from the beginning of the answer) such as pitch, speech speed, volume, and speech timing, that is, sequence data. It suffices, and the amount of data is smaller than that of voice waveform data. Therefore, even if the prosody control data corresponding to each intention is stored in the storage device in advance for each answer, it is compared with the mode in which the voice data of each answer pronounced with different intentions is stored in the storage device. It is possible to handle with a small storage capacity. That is, according to this aspect, it is possible to express a rich intention while suppressing an increase in the storage capacity required for storing the voice data of the answer. It should be noted that the answers are not limited to specific answers to the questions, but also include aizuchi (interjections). In addition, the answers include answers to questions and aizuchi, as well as dialogues in theater and comics. In addition to human voices, animals such as "one" (bowwow) and "meow" (meow) The bark is also included. That is, the answer and the voice here are concepts including not only the voice uttered by a person but also the cry of an animal.

より好ましい態様においては、前記韻律制御データが意図毎に格納されたデータベースを有し、前記韻律制御データ取得部は、前記意図指定部により指定された意図に対応する韻律制御データを前記データベースから取得する。また、別の好ましい態様においては、前記韻律制御データ取得部は、前記意図指定部により指定された意図に対応する韻律制御データが前記データベースに格納されていない場合には、当該韻律制御データを前記データベースに格納されている複数の韻律制御データを用いた補間により取得することを特徴とする。このような態様によれば、上記データベースの記憶容量の増加を抑えつつ、さらに豊な意図表現が可能になる。また、別の好ましい態様においては、前記入力された音声信号を解析して前記問いに付与されている意図を特定する意図特定部を有し、前記意図指定部は、前記意図特定部により特定された意図に応じて前記回答に付与する意図を指定することを特徴とする。このような態様によれば、問いに込められた意図に応じた意図を込めた回答の音声を再生することが可能になる。 In a more preferred embodiment, the rhythm control data has a database in which the rhythm control data is stored for each intention, and the rhythm control data acquisition unit acquires the rhythm control data corresponding to the intention specified by the intention designation unit from the database. do. In another preferred embodiment, the rhythm control data acquisition unit uses the rhythm control data when the rhythm control data corresponding to the intention specified by the intention designation unit is not stored in the database. It is characterized in that it is acquired by interpolation using a plurality of rhythm control data stored in a database. According to such an aspect, it is possible to express a richer intention while suppressing an increase in the storage capacity of the database. Further, in another preferred embodiment, the input voice signal is analyzed to have an intention specifying unit for specifying the intention given to the question, and the intention specifying unit is specified by the intention specifying unit. It is characterized in that the intention to be given to the answer is specified according to the intention. According to such an aspect, it is possible to reproduce the voice of the answer with the intention according to the intention contained in the question.

本発明の態様について、音声再生装置のみならず、コンピュータを当該音声再生装置として機能させるプログラムとして概念することも可能である。 The aspect of the present invention can be conceived as a program that causes a computer to function as the audio reproduction device as well as the audio reproduction device.

実施形態に係る音声再生装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio reproduction apparatus which concerns on embodiment. 意図を込められた回答における韻律の時間変化の一例を示す図である。It is a figure which shows an example of the time change of prosody in an intentional answer. 音声再生装置の動作を示すフローチャートである。It is a flowchart which shows the operation of the voice reproduction apparatus.

以下、図面を参照しつつ、この発明の実施形態を説明する。
（Ａ：構成）
図１は、本発明の実施形態に係る音声再生装置１０の構成を示す図である。
この音声再生装置１０は、例えば、ぬいぐるみに組み込まれる装置である。音声再生装置１０は、利用者が当該ぬいぐるみに問いを発したときに、利用者により指定された意図を込めた回答の音声を合成して再生する。人は、問いに対する回答を発話する際に、その回答に込める意図に応じて韻律の時間変化を調整することで、「気楽」や「慎重」、或いは「怒り」や「あきれ」などの多彩な意図を表現する。例えば、図２には、特定の意図を込めずに発音された「あのさ」という基準音声の時間波形ＴＷと、「気楽さ」を込めた「あのさ」という音声の基準音声からの韻律の変化パターンＰ１と、「慎重さ」を込めた「あのさ」という音声の基準音声からの韻律の変化パターンＰ２と例示されている。なお、図２では、「音高」、「話速」および「音量」の各韻律構成要素の基準音声からの変化量が、三角形の重心から各頂点へ至る座標軸上の位置で表されており、上記重心から遠ざかるほど、基準音声に比較して音高が高いこと、話速が早いこと、音量が大きいことを意味する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(A: Composition)
FIG. 1 is a diagram showing a configuration of an audio reproduction device 10 according to an embodiment of the present invention.
The audio reproduction device 10 is, for example, a device incorporated in a stuffed animal. When the user asks a question to the stuffed animal, the voice reproduction device 10 synthesizes and reproduces the voice of the answer with the intention specified by the user. When a person utters an answer to a question, he or she adjusts the time change of prosody according to the intention of the answer, so that a variety of things such as "easy", "cautious", "anger", and "awareness" can be obtained. Express your intentions. For example, FIG. 2 shows the prosody of the time waveform TW of the reference voice "that" that was pronounced without a specific intention and the prosody of the reference voice "that" that included "comfort". The change pattern P1 and the prosodic change pattern P2 from the reference voice of the voice "that" with "cautiousness" are exemplified. In FIG. 2, the amount of change from the reference voice of each prosodic component of "pitch", "speaking speed", and "volume" is represented by the position on the coordinate axis from the center of gravity of the triangle to each apex. The farther away from the center of gravity, the higher the pitch, the faster the speaking speed, and the louder the volume than the reference voice.

音声再生装置１０は、ＣＰＵ（Central Processing Unit）や、音声入力部１０２、スピーカ１１４を有し、予めインストールされたアプリケーションプログラムを当該ＣＰＵが実行することによって、複数の機能ブロックが次のように構築される。詳細には、音声再生装置１０では、言語解析部１０４、回答取得部１０６、意図指定部１０８、韻律制御データ取得部１１０、および回答再生部１１２が構築される。なお、特に図示しないが、このほかにも音声再生装置１０は、表示部や操作入力部なども有し、利用者が装置の状況を確認したり、装置に対して各種の操作を入力したり、各種の設定を行えるようになっている。また、音声再生装置１０は、ぬいぐるみのような玩具に限られず、いわゆるペットロボットや、携帯電話機のような端末装置、タブレット型のパーソナルコンピュータなどであっても良い。 The voice reproduction device 10 has a CPU (Central Processing Unit), a voice input unit 102, and a speaker 114, and the CPU executes a pre-installed application program to construct a plurality of functional blocks as follows. Will be done. Specifically, in the voice reproduction device 10, a language analysis unit 104, an answer acquisition unit 106, an intention designation unit 108, a prosody control data acquisition unit 110, and an answer reproduction unit 112 are constructed. Although not shown in particular, the audio reproduction device 10 also has a display unit, an operation input unit, and the like, so that the user can check the status of the device and input various operations to the device. , Various settings can be made. Further, the voice reproduction device 10 is not limited to a toy such as a stuffed animal, but may be a so-called pet robot, a terminal device such as a mobile phone, a tablet-type personal computer, or the like.

音声入力部１０２は、詳細については省略するが、音声を電気信号に変換するマイクロフォンと、変換された音声信号をデジタル信号に変換するＡ／Ｄ変換器とで構成される。言語解析部１０４は、音声入力部１０２から入力される音声信号で規定される問いの意味内容を解析し、その解析結果（すなわち、上記問いの意味内容）を示す意味内容データを回答取得部１０６に与える。 Although details are omitted, the voice input unit 102 includes a microphone that converts voice into an electric signal and an A / D converter that converts the converted voice signal into a digital signal. The language analysis unit 104 analyzes the meaning content of the question defined by the voice signal input from the voice input unit 102, and receives the meaning content data indicating the analysis result (that is, the meaning content of the above question) in the answer acquisition unit 106. Give to.

回答ライブラリ１２４は、利用者による問いに対する回答を一意に示す識別子（以下、回答識別子）とその回答の音声データの組を、予め複数格納したデータベースである。この音声データは、モデルとなる人物の音声を録音したものであり、例えば「はい」、「いいえ」、「そう」、「うん」、「ふーん」、「なるほど」のような、質問に対する返事や相槌などである。回答の音声データについては、例えばｗａｖやｍｐ３などのフォーマットである。また、回答識別子の具体例としては、「はい」や「いいえ」など回答の内容を表す文字列や一連番号が挙げられる。なお、問いに対する回答は返事や相槌には限定されず、例えば、「今日の天気は？」という問いに対する「晴れです。」といった答えのように、問いにより要求された情報を提示する文章であっても良い。 The answer library 124 is a database in which a plurality of sets of an identifier uniquely indicating an answer to a question by a user (hereinafter referred to as an answer identifier) and voice data of the answer are stored in advance. This voice data is a recording of the voice of a model person, such as answers to questions such as "yes", "no", "yes", "yes", "hmm", "I see". Aizuchi and so on. The voice data of the answer is in a format such as wav or mp3. Further, specific examples of the answer identifier include a character string representing the content of the answer such as "yes" and "no" and a serial number. In addition, the answer to the question is not limited to the answer and the aizuchi, for example, it is a sentence that presents the information requested by the question, such as the answer "It is sunny" to the question "What is the weather today?" You may.

回答取得部１０６は、言語解析部１０４から与えられる意味内容データにより意味内容が表される問いに対する回答の回答識別子と音声データを、回答ライブラリ１２４から１つを選択し、当該選択した回答識別子および音声データを回答ライブラリ１２４から読み出して取得する。そして、回答取得部１０６は、回答ライブラリ１２４から取得した回答識別子を韻律制御データ取得部１１０へ出力し、回答ライブラリ１２４から取得した音声データを回答再生部１１２へ出力する。本実施形態の回答取得部１０６は、言語解析部１０４から出力され得る意味内容データに対応づけて、その意味内容データの表す意味内容の問いに対して相応しい回答の回答識別子を格納したテーブルを備えている。回答取得部１０６は、上記テーブルの格納内容を参照して上記意味内容データの表す意味内容の問いに対して相応しい回答を１つ選択する。 The answer acquisition unit 106 selects one of the answer libraries 124 for the answer identifier and the voice data of the answer to the question whose meaning content is represented by the meaning content data given by the language analysis unit 104, and the selected answer identifier and the selected answer identifier and the voice data. The voice data is read from the answer library 124 and acquired. Then, the answer acquisition unit 106 outputs the answer identifier acquired from the answer library 124 to the prosody control data acquisition unit 110, and outputs the voice data acquired from the answer library 124 to the answer reproduction unit 112. The answer acquisition unit 106 of the present embodiment includes a table that stores answer identifiers of answers suitable for the meaning content question represented by the meaning content data in association with the meaning content data that can be output from the language analysis unit 104. ing. The answer acquisition unit 106 refers to the stored contents of the table and selects one answer suitable for the question of the meaning content represented by the meaning content data.

本実施形態では、問いの意味内容に応じて回答を選択する態様について説明するが、問いの意味内容とは無関係にランダムに回答を選択しても良い。この場合、言語解析部１０４および上記テーブルは不要である。具体的には、音声入力部１０２の出力信号を回答取得部１０６に与え、回答取得部１０６には、音声入力部１０２からの音声信号の受信を契機として、回答ライブラリ１２４から回答識別子および音声データをランダムに読み出す処理を実行させれば良い。 In the present embodiment, a mode of selecting an answer according to the meaning and content of the question will be described, but the answer may be randomly selected regardless of the meaning and content of the question. In this case, the language analysis unit 104 and the above table are unnecessary. Specifically, the output signal of the voice input unit 102 is given to the answer acquisition unit 106, and the answer acquisition unit 106 receives the voice signal from the voice input unit 102 as an opportunity to receive the answer identifier and the voice data from the answer library 124. It suffices to execute the process of reading out at random.

韻律ライブラリ１２２は、回答ライブラリ１２４に格納されている複数の回答識別子の各々に対応付けて、その回答識別子の示す回答に込める意図毎にその意図を示す識別子（以下、意図識別子）とその意図に応じた当該回答における韻律の時間変化を規定する韻律制御データと、を格納したデータベースである。ここで、韻律制御データは、音高、話速、および音量といった韻律構成要素の各時刻（回答の語頭を起算点とする時刻）における変化量の配列、すなわちシーケンスデータである。また、意図識別子の具体例としては、「怒り」や「あきれ」など意図の内容を表す文字列が挙げられる。 The prosody library 122 associates with each of the plurality of answer identifiers stored in the answer library 124, and sets an identifier (hereinafter, intention identifier) indicating the intention for each intention included in the answer indicated by the answer identifier and the intention. It is a database that stores the prosody control data that defines the time change of the prosody in the corresponding answer. Here, the prosody control data is an array of changes in the amount of change at each time (time starting from the beginning of the answer) of the prosody components such as pitch, speaking speed, and volume, that is, sequence data. Further, as a specific example of the intention identifier, a character string representing the content of the intention such as "anger" or "akire" can be mentioned.

韻律制御データについては、次の要領で作成することが考えられる。例えば、「あのさ」という回答であれば、特定の意図を込めずに平板に発音された「あのさ」という音声の波形データを基準データとし、「怒り」や「あきれ」などの特定の意図を込めて発音された「あのさ」という音声の波形データを上記基準データと比較して音高、話速および音量などの韻律の構成要素毎に各時刻における基準データからの差分（オフセット）を算出し、構成要素毎に当該差分を時刻順に並べて韻律制御データとすることが考えられる。この場合、回答ライブラリ１２４には、回答の音声データとして、特定の意図を込めずに平板に発音された音声の音声データを格納しておけば良い。なお、特定の意図を込めて発音された音声の波形データを基準データとし、他の意図を込めて発音された音声についての韻律制御データを生成しても良く、この場合は、回答の音声データとして当該特定の意図を込めて発音された音声の音声データを回答ライブラリ１２４に格納しておけば良い。 Prosody control data can be created as follows. For example, in the case of the answer "that", the waveform data of the voice "that" pronounced on a flat plate without a specific intention is used as the reference data, and the specific intention such as "anger" or "akire" is used. Compare the waveform data of the voice "Ano" pronounced with It is conceivable to calculate and arrange the differences for each component in chronological order to obtain rhyme control data. In this case, the answer library 124 may store the voice data of the voice pronounced on the flat plate without a specific intention as the voice data of the answer. In addition, the waveform data of the voice sounded with a specific intention may be used as the reference data, and the rhyme control data for the voice sounded with another intention may be generated. In this case, the voice data of the answer. The voice data of the voice pronounced with the specific intention may be stored in the answer library 124.

意図指定部１０８は、問いに対する回答に込める意図として、意図識別子および韻律制御データが韻律ライブラリ１２２に格納されている各意図のうちの１つを利用者に指定させる装置である。例えば、意図指定部１０８は、意図識別子のリストを表示部（図１では図示略）に表示させ、操作入力部（図１では図示略）に対する操作により指定された意図識別子を韻律制御データ取得部１１０に与える。 The intention designation unit 108 is a device that allows the user to specify one of the intentions in which the intention identifier and the prosody control data are stored in the prosody library 122 as the intention to be included in the answer to the question. For example, the intention designation unit 108 displays a list of intention identifiers on a display unit (not shown in FIG. 1), and a prosody control data acquisition unit obtains the intention identifier designated by an operation on the operation input unit (not shown in FIG. 1). Give to 110.

韻律制御データ取得部１１０は、意図指定部１０８から与えられた意図識別子と回答取得部１０６から与えられた回答識別子に対応する韻律制御データを韻律ライブラリ１２２から読み出して取得する。韻律制御データ取得部１１０は、韻律ライブラリ１２２から読み出した韻律制御データを回答再生部１１２へ与える。 The prosody control data acquisition unit 110 reads and acquires the prosody control data corresponding to the intention identifier given by the intention designation unit 108 and the answer identifier given by the answer acquisition unit 106 from the prosody library 122. The prosody control data acquisition unit 110 gives the prosody control data read from the prosody library 122 to the answer reproduction unit 112.

回答再生部１１２は、回答取得部１０６から与えられた音声データの表す音声を、韻律制御データ取得部１１０から与えられた韻律制御データにしたがって音高、話速および音量の各韻律構成要素の時間変化を別個独立に制御して、再生（合成）する。
以上が音声再生装置１０の構成である。 The answer reproduction unit 112 makes the voice represented by the voice data given by the answer acquisition unit 106 the time of each prosody component of pitch, speech speed, and volume according to the prosody control data given by the prosody control data acquisition unit 110. The change is controlled independently and reproduced (synthesized).
The above is the configuration of the audio reproduction device 10.

（Ｂ：動作）
次に、音声再生装置１０の動作について説明する。
図３は、音声再生装置１０における処理動作を示すフローチャートである。本実施形態では、音声再生装置１０が適用されたぬいぐるみに対して、利用者が音声で問いを発したときに、このフローチャートで示される処理が起動される。 (B: Operation)
Next, the operation of the voice reproduction device 10 will be described.
FIG. 3 is a flowchart showing a processing operation in the audio reproduction device 10. In the present embodiment, when the user asks a question by voice to the stuffed animal to which the voice reproduction device 10 is applied, the process shown in this flowchart is activated.

利用者が音声で問いを発すると、その問いの音声は音声入力部１０２によって音声信号に変換され、当該音声信号が音声入力部１０２から言語解析部１０４に供給される。ステップＳａ１１において言語解析部１０４は、音声入力部１０２から供給される音声信号をメモリ等に蓄積し、音声による問いが終了したか否かを判別する。問いが終了したか否かについては、音声信号の音量が所定の閾値未満となった状態が所定時間継続したか否かで判別される。問いが終了していなければ（ステップＳａ１１の判別結果が「Ｎｏ」であれば）、言語解析部１０４はステップＳａ１１の処理を再度実行し、問いの発話終了を待ち受ける。 When the user asks a question by voice, the voice of the question is converted into a voice signal by the voice input unit 102, and the voice signal is supplied from the voice input unit 102 to the language analysis unit 104. In step Sa11, the language analysis unit 104 stores the voice signal supplied from the voice input unit 102 in a memory or the like, and determines whether or not the question by voice is completed. Whether or not the question is completed is determined by whether or not the state in which the volume of the audio signal is less than a predetermined threshold value continues for a predetermined time. If the question is not completed (if the determination result of step Sa11 is "No"), the language analysis unit 104 executes the process of step Sa11 again and waits for the end of the utterance of the question.

ステップＳａ１１の判別結果が「Ｙｅｓ」の場合に実行されるステップＳａ１２では、言語解析部１０４は、メモリ等に蓄積した音声信号の規定する問いの意味内容を解析してその解析結果を示す意味内容データを回答取得部１０６に与える。本実施形態では、問いの発話が終了したか否かの判別と、問いの意味解析と、をシーケンシャルに実行するが、両者を並列に逐次実行しても良い。このようにすることで、問いの発話終了から意味解析完了までの遅延を軽減することができる。ステップＳａ１２に後続するステップＳａ１３では、回答取得部１０６は、上記意味内容データにより意味内容が表される問いに対する回答の回答識別子および音声データを回答ライブラリ１２４から取得し、前者を韻律制御データ取得部１１０に与え、後者を回答再生部１１２に与える。 In step Sa12, which is executed when the determination result of step Sa11 is "Yes", the language analysis unit 104 analyzes the meaning content of the question defined by the audio signal stored in the memory or the like, and indicates the analysis result. The data is given to the answer acquisition unit 106. In the present embodiment, the determination of whether or not the utterance of the question is completed and the semantic analysis of the question are sequentially executed, but both may be sequentially executed in parallel. By doing so, it is possible to reduce the delay from the end of the question utterance to the completion of the semantic analysis. In step Sa13 following step Sa12, the answer acquisition unit 106 acquires the answer identifier and the voice data of the answer to the question whose meaning content is represented by the above meaning content data from the answer library 124, and obtains the former as the rhyme control data acquisition unit. It is given to 110, and the latter is given to the answer reproduction unit 112.

ステップＳａ１３に後続するステップＳａ１４では、回答再生部１１２は、回答を再生中であるか否かを判別する。ステップＳａ１４において、回答再生部１１２によって回答が再生中であると判別される場合とは、ある問いに応じて回答を再生中に、次の問いが利用者によって発せられた場合などである。回答が再生中であれば（ステップＳａ１４の判別結果が「Ｙｅｓ」であれば）、回答再生部１１２は、ステップＳａ１４の判別結果が「Ｎｏ」になるまで、ステップＳａ１４の処理を再度実行する。ステップＳａ１４の判別結果が「Ｎｏ」である場合は、意図指定部１０８は回答に込める意図を利用者に指定させ、利用者により指定された意図を示す意図識別子を韻律制御データ取得部１１０へ出力する（ステップＳａ１５）。ステップＳａ１５に後続するステップＳａ１６では、韻律制御データ取得部１１０は、意図指定部１０８から与えられる意図識別子と回答取得部１０６から与えられた回答識別子とに対応する韻律制御データを韻律ライブラリ１２２から取得し、取得した韻律制御データを回答再生部１１２に通知して回答取得部１０６により選択された回答の音声データの再生を指示する。この指示にしたがって回答再生部１１２は、韻律制御データ取得部１１０から与えられた韻律制御データにしたがって、音高、話速および音量の各韻律構成要素の時間変化を別個独立に制御しつつ、上記音声データの表す音声を再生する（ステップＳａ１７）。 In step Sa14 following step Sa13, the answer reproduction unit 112 determines whether or not the answer is being reproduced. In step Sa14, the case where the answer reproduction unit 112 determines that the answer is being reproduced is a case where the next question is asked by the user while the answer is being reproduced in response to a certain question. If the answer is being reproduced (if the determination result in step Sa14 is "Yes"), the answer reproduction unit 112 re-executes the process of step Sa14 until the determination result in step Sa14 becomes "No". When the determination result of step Sa14 is "No", the intention designation unit 108 causes the user to specify the intention to be included in the answer, and outputs an intention identifier indicating the intention specified by the user to the prosody control data acquisition unit 110. (Step Sa15). In step Sa16 following step Sa15, the prosody control data acquisition unit 110 acquires the prosody control data corresponding to the intention identifier given by the intention designation unit 108 and the answer identifier given by the answer acquisition unit 106 from the prosody library 122. Then, the acquired prosody control data is notified to the answer reproduction unit 112, and the answer acquisition unit 106 instructs the answer acquisition unit 106 to reproduce the audio data of the answer. According to this instruction, the answer reproduction unit 112 controls the time change of each prosody component of pitch, speech speed, and volume according to the prosody control data given from the prosody control data acquisition unit 110, and controls the time change of each prosody component separately and independently. The voice represented by the voice data is reproduced (step Sa17).

例えば、ステップＳａ１３にて「あのさ」という回答が選択され、ステップＳａ１５にて「気楽」という意図が指定された場合には、ステップＳａ１７では、図２の時間変化パターンＰ１で韻律が時間変化する音声が合成され、この音声を聴いた利用者は当該音声における韻律の時間変化から「気楽さ」を感じ取る。一方、ステップＳａ１３にて「あのさ」という回答が選択され、ステップＳａ１５にて「慎重」という意図が指定された場合には、ステップＳａ１７では、図２の時間変化パターンＰ２で韻律が時間変化する音声が合成され、この音声を聴いた利用者は当該音声における韻律の時間変化から「慎重さ」を感じ取る。 For example, when the answer "that" is selected in step Sa13 and the intention of "easy" is specified in step Sa15, in step Sa17, the prosody changes with time in the time change pattern P1 of FIG. The voice is synthesized, and the user who listens to this voice feels "comfort" from the time change of the prosody in the voice. On the other hand, when the answer "that" is selected in step Sa13 and the intention of "cautious" is specified in step Sa15, in step Sa17, the prosody changes with time in the time change pattern P2 of FIG. The voice is synthesized, and the user who listens to this voice feels "cautiousness" from the time change of the prosody in the voice.

このように、本実施形態によれば、回答に込める意図に応じて韻律の時間変化をきめ細かく制御した回答音声を再生することができ、豊かな意図の表現が可能になる。本実施形態の音声再生装置１０では、回答ライブラリ１２４の他に韻律ライブラリ１２２を記憶装置等に記憶させておくことが必要となる。しかし、韻律ライブラリ１２２に格納される韻律制御データはシーケンスデータであるため、各回答について様々な意図を込めた音声の音声データの総和に比較して合計データ量は小さくなる。このため、これらの音声データの全てを記憶させておく態様に比較して記憶装置の記憶容量の増加を抑えることができる。つまり、本実施形態によれば、回答の音声の合成に要するデータを記憶する記憶装置の記憶容量の増加を抑えつつ、豊かな意図の表現が可能になる。なお、上記実施形態では、音高、話速、および音量の時間変化を制御する場合について説明したが、さらに発話タイミングの制御も行うようにしても良い。 As described above, according to the present embodiment, it is possible to reproduce the answer voice in which the time change of the prosody is finely controlled according to the intention to be included in the answer, and it is possible to express a rich intention. In the voice reproduction device 10 of the present embodiment, it is necessary to store the prosody library 122 in a storage device or the like in addition to the answer library 124. However, since the prosody control data stored in the prosody library 122 is sequence data, the total amount of data is smaller than the sum of the voice data of the voice with various intentions for each answer. Therefore, it is possible to suppress an increase in the storage capacity of the storage device as compared with the mode in which all of these voice data are stored. That is, according to the present embodiment, it is possible to express a rich intention while suppressing an increase in the storage capacity of the storage device that stores the data required for synthesizing the voice of the answer. In the above embodiment, the case of controlling the time change of the pitch, the speaking speed, and the volume has been described, but the utterance timing may also be controlled.

（Ｃ：変形および応用例）
本発明は、上述した実施形態に限定されるものではなく、例えば次に述べるような各種の応用・変形が可能である。また、次に述べる応用・変形の態様は、任意に選択された一または複数を適宜に組み合わせることもできる。 (C: Deformation and application example)
The present invention is not limited to the above-described embodiment, and various applications and modifications as described below are possible, for example. In addition, one or a plurality of arbitrarily selected modes of application / modification described below can be appropriately combined.

＜音声入力部＞
上記実施形態では、音声入力部１０２は、利用者の音声（発言）をマイクロフォンで入力して音声信号に変換する構成としたが、この構成に限られない。すなわち、音声入力部１０２は、音声信号による発言をなんらかの形で入力する、または、入力される構成であれば良い。詳細には、音声入力部１０２は、他の処理部で処理された音声信号や、他の装置から供給（または転送された）音声信号を入力する構成、さらには、ＬＳＩに内蔵され、単に音声信号を受信し後段に転送する入力インターフェース回路等を含んだ概念である。また、音声再生装置１０に対する問い掛けは、音声による問い掛けには限定されず、問いを書き下した文を表すテキストデータの入力であっても良い。この場合、音声入力部１０２に代えてテキストデータ入力部を音声再生装置１０に設けて置けば良い。 <Voice input section>
In the above embodiment, the voice input unit 102 is configured to input the user's voice (speech) with a microphone and convert it into a voice signal, but the present invention is not limited to this configuration. That is, the voice input unit 102 may be configured to input or input a speech by a voice signal in some form. Specifically, the audio input unit 102 has a configuration for inputting an audio signal processed by another processing unit or an audio signal supplied (or transferred) from another device, and further, it is built in the LSI and is simply audio. It is a concept that includes an input interface circuit that receives a signal and transfers it to the subsequent stage. Further, the question to the voice reproduction device 10 is not limited to the question by voice, and may be the input of text data representing a sentence in which the question is written down. In this case, a text data input unit may be provided in the voice reproduction device 10 instead of the voice input unit 102.

＜韻律の制御単位＞
上記実施形態では、問いに対する回答の音声データ単位で韻律制御データを用意しておき、音声データ単位で回答音声の韻律の時間変化を制御した。しかし、韻律制御データを回答の音声を構成する音素単位、回答の音声をサンプリングして音声データを生成する際の波形サンプル単位、或いは回答の音声を所定時間のフレームに区切った場合のフレーム単位で韻律制御データを用意しておき、これらを単位として韻律の時間変化を制御しても良い。 <Prosody control unit>
In the above embodiment, prosody control data is prepared for each voice data of the answer to the question, and the time change of the prosody of the answer voice is controlled for each voice data. However, the prosody control data is used in phoneme units that compose the answer voice, in waveform sample units when the answer voice is sampled to generate voice data, or in frame units when the answer voice is divided into frames of a predetermined time. Prosody control data may be prepared and the time change of prosody may be controlled in units of these data.

＜意図の指定のバリエーション＞
上記実施形態では、問いに対する回答に込める意図を利用者に１つだけ指定させたが、例えば、「あきれ」と「気楽に」とを指定するといった具合に、複数の意図を利用者に指定させても良い。この場合、意図指定部１０８には、利用者により指定された複数の意図の各々を示す意図識別子を韻律制御データ取得部１１０へ出力させる。このように利用者により指定された複数の意図に直接対応する１つの韻律制御データは韻律ライブラリ１２２には格納されていない。そこで、韻律制御データ取得部１１０には、これら複数の意図識別子の各々に対応する韻律制御データを取得し、それら複数の韻律制御データを用いた補間により１つの韻律制御データを生成させるようにすれば良い。具体的には、韻律制御データ取得部１１０は、上記の要領で取得した複数の韻律制御データを、例えば１：１などの重み付け加算して新たな韻律制御データを生成し、当該新たな韻律制御データを回答再生部１１２へ出力し回答の音声を合成させる。 <Variation of intention specification>
In the above embodiment, only one intention to be included in the answer to the question is specified by the user, but a plurality of intentions are specified by the user, for example, "Akire" and "easily" are specified. You may. In this case, the intention designating unit 108 causes the prosody control data acquisition unit 110 to output an intention identifier indicating each of the plurality of intentions designated by the user. One prosody control data that directly corresponds to the plurality of intentions specified by the user is not stored in the prosody library 122. Therefore, the time signature control data acquisition unit 110 acquires the time signature control data corresponding to each of the plurality of intention identifiers, and generates one time signature control data by interpolation using the plurality of time signature control data. Just do it. Specifically, the prosody control data acquisition unit 110 generates new prosody control data by weighting and adding a plurality of prosody control data acquired in the above manner, for example, 1: 1 and the new prosody control. The data is output to the answer reproduction unit 112 to synthesize the answer sound.

このような態様によれば、利用者により指定された複数の意図の中間の意図に対応する回答の音声を合成することができると考えられる。一般に、「あきれた」感じで発音された「あのね」という音声の波形データと「気楽な」感じで発音された「あのね」という音声の波形データとを重み付け加算しても、「気楽さを伴いつつあきれた」感じの音声の波形データは得られないが、韻律制御データはシーケンスデータであるため、重み付け加算により中間の意図を表す韻律制御データを生成できるからである。同様に、特定の意図を込めずに平板に発音することも意図の一態様と考え、「平板」と「慎重」とを指定することで、「やや慎重」といった「平板」と「慎重」の中間の意図に対応する韻律制御データを生成することもできる。 According to such an aspect, it is considered that the voice of the answer corresponding to the intermediate intention of the plurality of intentions specified by the user can be synthesized. In general, even if the waveform data of the voice "Ane" pronounced with a feeling of "disappointment" and the waveform data of the voice of "Ane" pronounced with a feeling of "easy" are weighted and added, it is accompanied by "comfort". This is because it is not possible to obtain waveform data of a voice that feels like "a slapstick", but since the rhyme control data is sequence data, it is possible to generate rhyme control data that expresses an intermediate intention by weighted addition. Similarly, it is considered that pronouncing on a flat plate without a specific intention is one aspect of the intention, and by specifying "flat plate" and "careful", "flat plate" and "careful" such as "somewhat careful" can be used. It is also possible to generate prosodic control data corresponding to intermediate intentions.

複数の意図を利用者に指定させる態様においては、上記重み付け加算における各意図の重みについても利用者に指定させても良く、さらに韻律構成要素毎の重みを利用者に指定させても良い。「怒り」や「あきれ」とではその意図を演出する際の各韻律構成要素の重要度が異なり得るからである。また、「気楽さを伴いつつあきれた」感じなど複数の意図の中間な意図を利用者に指定させ、このような中間な意図を複数の意図に分解して各意図に対応する韻律制御データを韻律制御データ取得部１１０に取得させるようにしても良い。要は、意図指定部１０８により指定された意図に対応する韻律制御データが韻律ライブラリ１２２（データベース）に格納されていない場合には、当該データベースに格納されている複数の韻律制御データを用いた補間により上記指定された意図に対応する韻律制御データを韻律制御データ取得部１１０に取得させる態様であれば良い。 In the embodiment in which the user specifies a plurality of intentions, the user may also specify the weight of each intention in the weighting addition, and the user may further specify the weight for each prosodic component. This is because the importance of each prosodic component in producing the intention can be different between "anger" and "akire". In addition, the user is made to specify an intermediate intention of a plurality of intentions such as a feeling of "being bored with comfort", and such an intermediate intention is decomposed into a plurality of intentions, and metric control data corresponding to each intention is generated. The rhythm control data acquisition unit 110 may be made to acquire the data. In short, when the rhythm control data corresponding to the intention specified by the intention designation unit 108 is not stored in the rhythm library 122 (database), interpolation using a plurality of rhythm control data stored in the database is performed. Any mode may be used as long as the rhythm control data acquisition unit 110 is made to acquire the rhythm control data corresponding to the above-specified intention.

上記実施形態では、音声再生装置１０への問い掛け毎にその問いに対する回答に込める意図を利用者に指定させたが、常に同じ意図を込めた回答の音声が合成されるように、回答に込める意図を予め利用者に指定させておいても良い。また、予め定められたシナリオにしたがって回答を合成する場合、そのシナリオの進行にしたがって意図の指定が行われるようにしても良い。具体的には、回答に込める意図の意図識別子が当該回答の再生順に配列されたシナリオデータを意図指定部１０８へ入力し、音声入力部１０２から問いの音声が入力される毎に意図指定部１０８には当該シナリオデータに含まれている意図識別子をその記載順に取得して韻律制御データ取得部１１０へ出力させるようにすれば良い。 In the above embodiment, the user is made to specify the intention to be included in the answer to the question for each question to the voice reproduction device 10, but the intention to be included in the answer is to always synthesize the voice of the answer with the same intention. May be specified by the user in advance. Further, when the answers are synthesized according to a predetermined scenario, the intention may be specified according to the progress of the scenario. Specifically, scenario data in which the intention identifiers of the intentions to be included in the answer are arranged in the order of reproduction of the answer is input to the intention designation unit 108, and each time the voice of the question is input from the voice input unit 102, the intention designation unit 108 The intention identifiers included in the scenario data may be acquired in the order of description and output to the rhyme control data acquisition unit 110.

また、音声再生装置１０或いは当該音声再生装置１０の埋め込まれたぬいぐるみを擬人化した回答キャラクタについて「常に気楽な感じ」或いは「常に慎重さを失わない」などの人格や雰囲気、話し方の特性（以下、キャラクタ特性）を予め複数用意しておき、キャラクタ特性毎にそのキャラクタ特性に応じた意図識別子を対応付けておき、利用者には回答キャラクタのキャラクタ特性を指定させることで回答に込める意図を指定させるようにしても良い。また、各々異なるキャラクタ特性を有する複数の回答キャラクタを用意しておき、利用者に回答キャラクタを指定させることで回答に込める意図を指定させても良い。問いに対する回答をシナリオにしたがって合成する場合には、シナリオの登場人物毎にキャラクタ特性を定めておけば良い。さらに、１つの回答キャラクタに対して複数のキャラクタ特性を対応付けておき、何れのキャラクタ特性で回答の音声を合成するのかをシナリオにおいて定めておき、１つの回答キャラクタの経年変化や、本音と建て前の使い分けを演出しても良い。 In addition, the personality, atmosphere, and speaking style characteristics of the voice playback device 10 or the answer character that anthropomorphizes the embedded stuffed character of the voice playback device 10 such as "always feel comfortable" or "always remain cautious" (hereinafter, , Character characteristics) are prepared in advance, an intention identifier corresponding to the character characteristics is associated with each character characteristic, and the user is asked to specify the character characteristics of the answer character to specify the intention to be included in the answer. You may let it. Further, a plurality of answer characters having different character characteristics may be prepared, and the user may specify the answer character to specify the intention to be included in the answer. When synthesizing the answers to the questions according to the scenario, the character characteristics may be defined for each character in the scenario. Furthermore, a plurality of character characteristics are associated with one answer character, and which character characteristic is used to synthesize the answer voice is determined in the scenario. You may produce the proper use of.

また、問いの音声信号を解析して当該問いに付与されている意図を特定する意図特定部を設け、意図指定部１０８には、意図特定部により特定された意図に応じて上記問い対する回答に付与する意図を指定させても良い。具体的には、回答の音声データが回答ライブラリ１２４に格納されている各問いについて、その問いを特定の意図を込めることなく平板に発音した音声の波形データを基準データとして意図特定部に予め記憶させておき、意図特定部には、音声入力部１０２から入力された音声信号を波形データに変換して対応する基準データと比較することで、当該波形データの表す音声における韻律を表す韻律規定データ（上記韻律制御データと同一フォーマットのデータ）を生成させる。そして、意図特定部は、当該韻律規定データと同じまたは当該韻律規定データに近似する韻律制御データに対応する意図識別子を韻律ライブラリ１２２から読み出して意図指定部１０８に与える。意図指定部１０８には、意図特定部から与えられた意図識別子と予め定められた特定の関係にある意図識別子を韻律制御データ取得部１１０に出力させるようにすれば良い。ここで、上記特定の関係とは、例えば「気楽」に対する「慎重」のように反対の意図の関係や、同じ意図の関係が考えられる。上記特定の関係として「同じ意図の関係」を採用する場合には、意図特定部から与えられた意図識別子をそのまま韻律制御データ取得部１１０へ出力する処理を意図指定部１０８に実行させるようにすれば良い。 In addition, an intention specifying unit that analyzes the audio signal of the question and specifies the intention given to the question is provided, and the intention specifying unit 108 responds to the above question according to the intention specified by the intention specifying unit. You may let you specify the intention to give. Specifically, for each question in which the answer voice data is stored in the answer library 124, the waveform data of the voice in which the question is sounded on a flat plate without a specific intention is stored in advance in the intention identification unit as reference data. In the intention specifying unit, the voice signal input from the voice input unit 102 is converted into waveform data and compared with the corresponding reference data, so that the rhyme regulation data representing the rhyme in the voice represented by the waveform data is specified. (Data in the same format as the above rhyme control data) is generated. Then, the intention specifying unit reads out the intention identifier corresponding to the rhythm control data that is the same as or close to the rhythm regulation data from the rhythm library 122 and gives it to the intention designation unit 108. The intention designation unit 108 may cause the prosody control data acquisition unit 110 to output the intention identifier given by the intention identification unit and the intention identifier having a predetermined specific relationship. Here, the above-mentioned specific relationship may be a relationship of opposite intentions such as "carefulness" for "comfort" or a relationship of the same intention. When the "relationship of the same intention" is adopted as the specific relationship, the intention specification unit 108 is made to execute the process of outputting the intention identifier given by the intention identification unit to the prosody control data acquisition unit 110 as it is. Just do it.

＜問いに対する意図の込め方の評価・採点＞
意図特定部を設ける態様においては、意図特定部により特定された意図と韻律ライブラリ１２２の格納内容とを比較することにより、問いに対する意図の込め方を評価、採点するようにしても良い。具体的には、「怒り」や「あきれ」などの意図を込めて利用者は発話した問いについて、当該利用者が問いに込めた意図が意図特定部により特定されたか否か、或いは、意図特定部により問いの音声信号から生成された韻律規定データと当該意図について韻律ライブラリ１２２に格納されている韻律制御データとの比較結果に応じて、当該意図の込め方の適否を評価・採点するようにすれば良い。この態様によれば、問いに対する意図の込め方の練習を支援することが可能になる。 <Evaluation and scoring of how to put intentions into questions>
In the embodiment in which the intention specifying unit is provided, the intention specified by the intention specifying unit may be compared with the stored contents of the prosody library 122 to evaluate and score how the intention is included in the question. Specifically, regarding the question uttered by the user with the intention such as "anger" or "awareness", whether or not the intention that the user put in the question was specified by the intention identification unit, or the intention identification According to the comparison result between the prosody regulation data generated from the voice signal of the question by the department and the prosody control data stored in the prosody library 122 for the intention, the suitability of the intention is evaluated and scored. Just do it. According to this aspect, it becomes possible to support the practice of how to put the intention into the question.

＜その他＞
上記実施形態にあっては、発言に対する回答を取得する構成である言語解析部１０４、韻律ライブラリ、および回答ライブラリを音声再生装置１０の側に設けたが、端末装置などでは、処理の負荷が重くなる点や、記憶容量に制限がある点などを考慮して、外部サーバの側に設ける構成としても良い。また、上記実施形態では、利用者の音声による問いに対する回答の音声を合成して再生し音声対話を実現する音声再生装置への本発明の適用例を説明したが、演劇や漫才などにおける各役を演じる音声再生装置に本発明を適用しても良い。 <Others>
In the above embodiment, the language analysis unit 104, the prosody library, and the answer library, which are configured to acquire the answer to the remark, are provided on the side of the voice reproduction device 10, but the processing load is heavy in the terminal device and the like. In consideration of the fact that the storage capacity is limited and the storage capacity is limited, the configuration may be provided on the external server side. Further, in the above embodiment, an example of application of the present invention to a voice reproduction device that synthesizes and reproduces the voice of an answer to a question by a user's voice to realize a voice dialogue has been described. The present invention may be applied to an audio reproduction device that plays the above.

１０２…音声入力部、１０４…言語解析部、１０６…回答取得部、１０８…意図指定部、１１０…韻律制御データ取得部、１１２…回答再生部、１２２…韻律ライブラリ、１２４…回答ライブラリ。 102 ... voice input unit, 104 ... language analysis unit, 106 ... answer acquisition unit, 108 ... intention specification unit, 110 ... prosody control data acquisition unit, 112 ... answer reproduction unit, 122 ... prosody library, 124 ... answer library.

Claims

An intention specifying unit that analyzes the input audio signal and identifies the psychological state of the questioning source contained in the question represented by the voice signal as the intention of the questioning source.
The answer acquisition unit that acquires the voice data of the answer to the above question,
An intention specifying unit that specifies the intention given to the answer according to the intention specified by the intention specifying unit, and
A prosody control data acquisition unit that acquires prosody control data representing a time change of the prosody according to the intention specified by the intention designation unit, and a prosody control data acquisition unit.
A response reproduction unit for reproducing the voice of the answer in which the time change of the prosody based on the voice data is controlled according to the prosody control data is provided.
The intention specifying unit uses the waveform data of the voice that sounds the question represented by the voice signal on a flat plate as the reference data, and the waveform data obtained by converting the voice signal and the reference data. By comparing, the rhyme regulation data representing the time change of the rhyme in the question represented by the voice signal is generated, and the intention of the question source is specified according to the rhyme regulation data, so that the voice of the question represented by the voice signal is specified. Identify the intent of the questioner based on the temporal change of the rhythm in
The intention designating unit designates an intention having a predetermined specific relationship with the intention specified by the intention specifying unit as an intention to be given to the answer.
Audio playback device comprising a call.

It has a database in which the prosody control data is stored for each intention.
The prosody control data acquisition unit acquires prosody control data corresponding to the intention specified by the intention designation unit from the database.
The audio reproduction device according to claim 1.

An intention specifying unit that analyzes the input audio signal and identifies the psychological state of the questioning source contained in the question represented by the voice signal as the intention of the questioning source.
The answer acquisition unit that acquires the voice data of the answer to the above question,
An intention specifying unit that specifies the intention given to the answer according to the intention specified by the intention specifying unit, and
A prosody control data acquisition unit that acquires prosody control data representing a time change of the prosody according to the intention specified by the intention designation unit, and a prosody control data acquisition unit.
An answer reproduction unit that reproduces the voice of the answer in which the time change of the prosody based on the voice data is controlled according to the prosody control data.
It has a database in which the prosody control data is stored for each intention.
The rhyme control data acquisition unit acquires the rhyme control data corresponding to the intention specified by the intention designation unit from the database, and stores the rhyme control data corresponding to the intention specified by the intention designation unit in the database. If not, the rhyme control data is acquired by interpolation using a plurality of rhyme control data stored in the database.
An audio playback device characterized by this.

Computer,
It is an intention specifying unit that analyzes the input voice signal and specifies the psychological state of the questioning source contained in the question represented by the voice signal as the intention of the questioning source, and specifies the question represented by the voice signal. By using the waveform data of the voice sounded on the flat plate without intention as the reference data and comparing the waveform data obtained by converting the voice signal with the reference data, the temporal change of the rhyme in the question represented by the voice signal. By generating the rhyme regulation data representing the above and specifying the intention of the question source according to the rhyme regulation data, the intention of the question source is specified based on the time change of the rhyme in the voice of the question represented by the voice signal. Intention identification part to do,
The answer acquisition unit that acquires the voice data of the answer to the above question,
An intention designating unit that specifies the intention given to the answer according to the intention specified by the intention specifying unit, and an intention having a predetermined specific relationship with the intention specified by the intention specifying unit. , The intention designation part specified as the intention to be given to the answer,
A prosody control data acquisition unit that acquires prosody control data representing a time change of the prosody according to the intention specified by the intention designation unit, and a prosody control data acquisition unit.
An answer reproduction unit that reproduces the voice of the answer in which the time change of the prosody based on the voice data is controlled according to the prosody control data.
An audio playback program characterized by the ability to function.

Computer,
An intention specifying unit that analyzes the input audio signal and identifies the psychological state of the questioning source contained in the question represented by the voice signal as the intention of the questioning source.
The answer acquisition unit that acquires the voice data of the answer to the above question,
An intention specifying unit that specifies the intention given to the answer according to the intention specified by the intention specifying unit, and
It is a rhyme control data acquisition unit that acquires rhyme control data representing a time change of the rhyme according to the intention specified by the intention designation unit, and is the intention designation unit from a database in which the rhyme control data is stored for each intention. If the rhythm control data corresponding to the intention specified by is acquired and the rhythm control data corresponding to the intention specified by the intention designation unit is not stored in the database, the rhythm control data is stored in the database. A rhyme control data acquisition unit that acquires by interpolation using a plurality of stored rhyme control data,
An answer reproduction unit that reproduces the voice of the answer in which the time change of the prosody based on the voice data is controlled according to the prosody control data.
An audio playback program characterized by the ability to function.

It is an intention specifying step of analyzing the input voice signal and specifying the psychological state of the questioning source contained in the question represented by the voice signal as the intention of the questioning source, and the question represented by the voice signal is specified. By using the waveform data of the voice sounded on the flat plate without intention as the reference data and comparing the waveform data obtained by converting the voice signal with the reference data, the temporal change of the rhyme in the question represented by the voice signal. By generating the rhyme regulation data representing the above and specifying the intention of the question source according to the rhyme regulation data, the intention of the question source is specified based on the time change of the rhyme in the voice of the question represented by the voice signal. Intention-specific steps to be performed and
The answer acquisition step to acquire the voice data of the answer to the above question, and
It is an intention designating step in which the intention given to the answer is specified according to the intention specified in the intention specifying step, and has a predetermined specific relationship with the intention specified in the intention specifying step. An intention specification step that specifies the intention as the intention to be given to the answer, and
A prosody control data acquisition step for acquiring prosody control data representing a time change of prosody according to the intention specified in the intention designation step, and a prosody control data acquisition step.
An answer reproduction step of reproducing the voice of the answer in which the time change of the prosody based on the voice data is controlled according to the prosody control data, and
An audio reproduction method characterized by including.

An intention specifying step that analyzes the input audio signal and identifies the psychological state of the questioning source contained in the question represented by the audio signal as the intention of the questioning source.
The answer acquisition step to acquire the voice data of the answer to the above question, and
An intention specification step that specifies the intention to be given to the answer according to the intention specified in the intention identification step, and
This is a rhyme control data acquisition step for acquiring rhyme control data representing a time change of the rhyme according to the intention specified in the intention designation step, and the intention designation is performed from a database in which the rhyme control data is stored for each intention. If the rhyme control data corresponding to the intention specified in the step is acquired and the rhyme control data corresponding to the intention specified in the intention designation step is not stored in the database, the rhyme control data is stored. A rhyme control data acquisition step acquired by interpolation using a plurality of rhyme control data stored in the database, and
An answer reproduction step of reproducing the voice of the answer in which the time change of the prosody based on the voice data is controlled according to the prosody control data, and
An audio reproduction method characterized by including.