JP2003271194A

JP2003271194A - Voice interaction device and controlling method thereof

Info

Publication number: JP2003271194A
Application number: JP2002070320A
Authority: JP
Inventors: Kazue Kaneko; 和恵金子
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-03-14
Filing date: 2002-03-14
Publication date: 2003-09-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide more familiar voice interaction by changing the speaking tone and/or voice quality of combined voice of a responding side with respect to the speaking tone and/or voice quality of user's voice. <P>SOLUTION: A recognition model 103 includes a recognition model corresponding to a plurality of types of speaking tones. A voice recognizing part 104 refers to the recognition model 103 to recognize the speaking tone and contents of input voice from a voice inputting part 101. A speaking tone determining part 108 determines the speaking tone of combined voice on the basis of the speaking tone recognized by the voice recognizing part 104. A response sentence preparing part 105 prepares a response sentence to the recognized contents, and a response sentence analyzing part 106 converts the response sentence into reading. A waveform generating part 109 generates a voice waveform on the basis of the speaking tone determined by the speaking tone determining part 108 and the reading of the response sentence obtained by the response sentence analyzing part 106, and a voice outputting part 111 outputs the voice waveform by voice. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、ユーザの音声を認
識し、その認識結果に応じて合成音声で応答する音声対
話装置及びその制御方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice interactive apparatus which recognizes a user's voice and responds with a synthesized voice according to the recognition result, and a control method thereof.

【０００２】[0002]

【従来の技術】従来の音声対話システムでは、ユーザの
音声を認識結果としての文字情報に変換し、それに対応
する応答の文を合成音声で出力するものが主流であり、
常に一定の調子での音声による応答しか得られないもの
であった。近年、音声認識や音声合成の技術の進歩によ
り、さまざまな場面での音声対話システムが実現可能と
なってきており、表現豊かな応答を返す、より親しみや
すい音声対話システムが期待されている。2. Description of the Related Art In a conventional voice dialogue system, a mainstream is one that converts a user's voice into character information as a recognition result and outputs a response sentence corresponding to it as a synthetic voice.
It was always possible to obtain only a voice response with a constant tone. In recent years, advances in speech recognition and speech synthesis technologies have made it possible to realize voice dialogue systems in various situations, and a more familiar voice dialogue system that returns expressive responses is expected.

【０００３】例えば、ユーザの音声から感情や性別・年
齢などの情報を認識し、システム側は、この認識結果に
基づいて相手に合わせた文体の応答文を作成し、ＣＧの
顔画像を使って表情豊かに応答するなどの提案がなされ
ている。For example, information such as emotion, sex, and age is recognized from the voice of the user, the system side creates a response sentence in a style suitable for the opponent based on the recognition result, and uses the face image of the CG. Suggestions have been made such as responding expressively.

【０００４】また、ユーザの発声した文のアクセント型
や声の高さや大きさや話す速度を検出して、合成音声の
アクセント型や声の高さや大きさ及び話す速度をユーザ
の発声に適応させるという提案もなされている。Further, it is said that the accent type of the sentence uttered by the user, the pitch and loudness of the voice and the speaking speed are detected, and the accent type of the synthetic voice, the pitch and loudness of the voice and the speaking speed are adapted to the utterance of the user. Proposals have also been made.

【０００５】[0005]

【発明が解決しようとする課題】ユーザの音声に含まれ
る感情や性別・年齢などを認識して、顔画像の表情など
で応答の表現力をあげるという方法では、画像表示部分
を持たない電話などによる音声対話システムでは利用で
きない。また、応答内容の文体などを変更するという手
段は、電子メールの読み上げや文学作品の朗読など、も
との内容を変更することが好ましくない場面では利用で
きない。[Problems to be Solved by the Invention] In the method of recognizing emotions, sex, age, etc. included in a user's voice and enhancing the expressiveness of a response by the facial expression of a facial image, a telephone or the like having no image display portion is used. Cannot be used with the voice dialogue system. Further, the means for changing the style of the response contents cannot be used in a situation where it is not preferable to change the original contents such as reading an e-mail or reading a literary work.

【０００６】また、ユーザのアクセント型や声の高さや
大きさや話す速度に、合成音声のアクセント型や声の高
さや大きさ話す速度を適応させるという方法では、感情
表現というところまで行なえず、親しみやすさという点
で限界がある。[0006] Further, in the method of adapting the accent type of the synthesized voice, the pitch and loudness of the voice, and the speaking speed to the accent type of the user, the pitch and loudness of the voice, and the speaking speed, the expression of emotion cannot be performed, and the familiarity becomes familiar. There is a limit in terms of ease.

【０００７】本発明は上記課題に鑑みてなされたもので
あり、ユーザの声の話調及び／または声質に対応して応
答側の合成音声の話調及び／または声質を変えることを
可能とし、より親しみやすい合成音声による対話を実現
することを目的とする。The present invention has been made in view of the above problems, and it is possible to change the tone and / or the voice quality of the synthesized voice of the response side in accordance with the tone and / or the voice quality of the voice of the user, The purpose is to realize more familiar synthetic speech dialogue.

【０００８】[0008]

【課題を解決するための手段】上記の目的を達成するた
めの本発明による音声対話装置は以下の構成を備える。
すなわち、入力音声を認識し、認識結果に応じた応答文
を合成音声で出力する音声対話装置であって、前記入力
音声の話調及び内容を認識する認識手段と、前記認識手
段で認識された話調に基づいて合成音声の話調を決定す
る決定手段と、前記決定手段で決定された話調で、前記
入力音声の内容に対する応答文の合成音声を生成する生
成手段とを備える。また、上記の目的を達成するための
本発明の他の構成による音声対話装置は以下の構成を備
える。即ち、入力音声を認識し、認識結果に応じた応答
文を合成音声で出力する音声対話装置であって、前記入
力音声の声質及び内容を認識する認識手段と、前記認識
手段で認識された声質に基づいて合成音声の声質を決定
する決定手段と、前記決定手段で決定された声質で、前
記入力音声の内容に対する応答文の合成音声を生成する
生成手段とを備える。A speech dialogue apparatus according to the present invention for achieving the above object has the following configuration.
That is, a voice dialog device that recognizes an input voice and outputs a response sentence according to a recognition result as a synthesized voice, and a recognition unit that recognizes a tone and content of the input voice, and a recognition unit that recognizes the input voice. The system includes a determining unit that determines the tone of the synthesized voice based on the tone, and a generating unit that generates the synthesized voice of the response sentence with respect to the content of the input voice with the tone determined by the determining unit. Further, a voice interaction device according to another configuration of the present invention for achieving the above object has the following configuration. That is, a voice dialog device for recognizing an input voice and outputting a response sentence according to a recognition result as a synthetic voice, the recognition means recognizing the voice quality and content of the input voice, and the voice quality recognized by the recognition means. Determining means for determining the voice quality of the synthetic voice based on the above, and generating means for generating the synthetic voice of the response sentence with respect to the content of the input voice with the voice quality determined by the determining means.

【０００９】また、上記の目的を達成するための本発明
による音声対話装置の制御方法は、入力音声を認識し、
認識結果に応じた応答文を合成音声で出力する音声対話
装置の制御方法であって、前記入力音声の話調及び内容
を認識する認識工程と、前記認識工程で認識された話調
に基づいて合成音声の話調を決定する決定工程と、前記
決定工程で決定された話調で、前記入力音声の内容に対
する応答文の合成音声を生成する生成工程とを備える。
更に、上記の目的を達成する本発明の他の構成による音
声対話装置の制御方法は、入力音声を認識し、認識結果
に応じた応答文を合成音声で出力する音声対話装置の制
御方法であって、前記入力音声の声質及び内容を認識す
る認識工程と、前記認識工程で認識された声質に基づい
て合成音声の声質を決定する決定工程と、前記決定工程
で決定された声質で、前記入力音声の内容に対する応答
文の合成音声を生成する生成工程とを備える。Further, a control method of a voice interactive apparatus according to the present invention for achieving the above object, recognizes an input voice,
A method of controlling a voice interaction device for outputting a response sentence according to a recognition result as a synthetic voice, comprising: a recognition step of recognizing a tone and content of the input voice; and a tone recognized in the recognition step. The method further comprises a determining step of determining a speech tone of the synthetic voice, and a generating step of generating a synthetic speech of a response sentence with respect to the content of the input voice with the tone determined in the determining step.
Furthermore, a control method for a voice dialogue apparatus according to another configuration of the present invention that achieves the above object is a control method for a voice dialogue apparatus which recognizes an input voice and outputs a response sentence according to the recognition result as a synthetic voice. A recognition step of recognizing the voice quality and content of the input voice; a determination step of determining the voice quality of the synthesized voice based on the voice quality recognized in the recognition step; and a voice quality determined in the determination step, And a generation step of generating a synthetic voice of a response sentence corresponding to the content of the voice.

【００１０】[0010]

【発明の実施の形態】以下、添付の図面を参照して本発
明の実施形態を詳細に説明する。DETAILED DESCRIPTION OF THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

【００１１】以下で説明する第１乃至第３実施形態で
は、音声認識において複数種類の話調（ささやき声や明
るい声や悲しそうな声などの複数種類の話調）の認識モ
デルを用いて、ユーザの発声した音声の内容ととともに
その話調を認識する。そして、音声合成処理において、
複数種類の話調の音声波形辞書を用い、ユーザの話調に
応じた話調の音声を合成する。なお、応答文の合成音声
をユーザの話調に適応させる他に、ユーザの声質に適応
させるように構成してもよい（第４乃至第６実施形
態）。In the first to third embodiments described below, a user recognizes a plurality of types of voices (a plurality of types of voices such as a whispering voice, a bright voice, or a sad voice) in speech recognition by using a recognition model. The speech tone is recognized together with the content of the voice uttered by. Then, in the voice synthesis process,
A voice waveform dictionary for a plurality of types of voices is used to synthesize voices with voices according to the user's voice. Note that the synthesized voice of the response sentence may be adapted to the voice quality of the user in addition to being adapted to the tone of the user (fourth to sixth embodiments).

【００１２】＜第１実施形態＞図１は、第１実施形態に
よる音声対話装置の構成を示すブロック図である。図１
において、１０１は音声入力部であり、ユーザの音声を
入力する。１０２はデータベースであり、認識結果に対
応する応答文を生成するために使用される。１０３は認
識モデルであり、ユーザの発声内容とその話調を認識す
るのに用いられる。１０４は音声認識部であり、認識モ
デル１０３を用いてユーザの発声内容と話調を認識す
る。<First Embodiment> FIG. 1 is a block diagram showing the arrangement of a voice interactive apparatus according to the first embodiment. Figure 1
In the figure, 101 is a voice input unit for inputting the voice of the user. Reference numeral 102 denotes a database, which is used to generate a response sentence corresponding to the recognition result. A recognition model 103 is used to recognize the user's utterance content and its tone. A voice recognition unit 104 recognizes the user's utterance content and tone using the recognition model 103.

【００１３】１０５は応答文作成部であり、音声認識部
１０４で認識した発声内容から応答文を作成する。１０
６は応答文解析部であり、応答文作成部１０５で作成さ
れた応答文を解析して読み方を付与する。１０７は言語
解析用辞書であり、応答文解析部１０６が応答文を解析
するために用いられる。Reference numeral 105 denotes a response sentence creating unit, which creates a response sentence from the utterance contents recognized by the voice recognizing unit 104. 10
Reference numeral 6 is a response sentence analysis unit, which analyzes the response sentence created by the response sentence creation unit 105 and gives a reading. A language analysis dictionary 107 is used by the response sentence analysis unit 106 to analyze the response sentence.

【００１４】１０８は話調決定部であり、音声合成にお
いて用いる話調を、音声認識部１０４にてユーザの音声
から認識した話調にもっとも近い話調に決定する。１０
９は波形生成部であり、応答文解析部１０６で得られた
応答文の読みと、話調決定部１０８で決定された話調で
音声波形を生成する。１１０は音声波形辞書であり、複
数種類の話調の音声素片を格納する。１１１は音声出力
部であり、波形生成部１０９で生成された音声波形に従
って、応答文の音声を出力する。Reference numeral 108 is a speech tone determining unit which determines the speech tone used in the speech synthesis to be the closest to the speech tone recognized by the voice recognition unit 104 from the user's voice. 10
Reference numeral 9 denotes a waveform generation unit, which generates a voice waveform with the reading of the response sentence obtained by the response sentence analysis unit 106 and the speech tone determined by the speech tone determination unit 108. Reference numeral 110 is a voice waveform dictionary, which stores voice units of a plurality of types of speech tones. A voice output unit 111 outputs the voice of the response sentence according to the voice waveform generated by the waveform generation unit 109.

【００１５】以上の各部の構成は汎用のパーソナルコン
ピュータにおいて、メモリに格納された以下のフローチ
ャートで説明するプログラムをＣＰＵが実行することに
より実現される。また、データベース１０２、認識モデ
ル１０３、言語解析用辞書１０７、音声波形辞書１１０
は、それぞれ当該パーソナルコンピュータによってアク
セスが可能な外部記憶装置に格納される。The above-described configuration of each unit is realized in the general-purpose personal computer by the CPU executing the program stored in the memory and described in the following flowchart. Also, the database 102, the recognition model 103, the language analysis dictionary 107, and the voice waveform dictionary 110.
Are stored in an external storage device accessible by the personal computer.

【００１６】図７は第１実施形態による音声対話処理の
手順を説明するフローチャートである。FIG. 7 is a flow chart for explaining the procedure of the voice dialogue processing according to the first embodiment.

【００１７】まず、ステップＳ７０１で音声入力部１０
１によりユーザの音声を入力する。そして、ステップＳ
７０２において、音声認識部１０４は、ユーザの発声内
容とその話調を認識する。なお、認識のための認識モデ
ル１０３は、複数種類の話調について、複数の人間がそ
れぞれの話調で発声した音声データを用いて作成された
複数種類の隠れマルコフモデル（ＨＭＭ）である。それ
ぞれの話調のＨＭＭとのマッチングをとり、もっともよ
くマッチングしたモデルの話調と発声内容を認識結果と
する。First, in step S701, the voice input unit 10
1 inputs the user's voice. And step S
At 702, the voice recognition unit 104 recognizes the user's utterance content and its tone. The recognition model 103 for recognition is a plurality of types of Hidden Markov Models (HMM) created by using voice data uttered by a plurality of humans in a plurality of types of tones. Matching with the HMM of each tone is performed, and the tone and utterance content of the best matched model are used as the recognition result.

【００１８】ステップＳ７０３において、話調決定部１
０８は、音声合成の話調を決定する。ここでは、音声波
形辞書１１０に用意されている話調のうち、ユーザの音
声の認識結果の話調に一番近い話調を選択する。In step S703, the tone determination unit 1
08 determines the tone of voice synthesis. Here, of the tones prepared in the voice waveform dictionary 110, the tone closest to the tone of the recognition result of the user's voice is selected.

【００１９】ステップＳ７０４で、応答文作成部１０５
は、ユーザの発声内容から応答文を作成する。応答文
は、「こんにちは」に対する「こんにちは」、「ありが
とう」に対する「どういたしまして」のような会話の上
での対応づけのとれているものもあるが、ユーザの「今
日の天気は？」といった質問に対して、「９月２５日の
横浜の天気は晴れ、午後一時曇り、降水確率は午前中０
％、午後は１０％、最低気温は１５度、最高気温は２８
度です」といったような外部のデータを参照して作成す
るものまでさまざまである。図１のデータベース１０２
はこれらの応答を作成するために必要な情報源である。In step S704, the response sentence creating section 105
Creates a response sentence from the utterance content of the user. Response statement, for "Hello", "Hello", questions such as "thank you" there are also those that are well-correspondence on the conversation, such as "You're welcome" to it, "the weather today?" Of the user On the other hand, “September 25, the weather in Yokohama was fine, cloudy in the afternoon, and the probability of precipitation was 0 in the morning.
%, 10% in the afternoon, minimum temperature is 15 degrees, maximum temperature is 28%
There are various things such as those that are created by referring to external data such as "degree." Database 102 of FIG.
Is the source of information needed to create these responses.

【００２０】ステップＳ７０５では、応答文解析部１０
６が応答文の言語解析を行なう。例えば日本語の音声合
成においては、漢字かな混じり文についての読み表記を
作成する。英語の音声合成においてはスペルから発音記
号表記を作成する。In step S705, the response sentence analysis unit 10
6 analyzes the language of the response sentence. For example, in Japanese speech synthesis, a phonetic notation for a kanji / kana mixed sentence is created. In English speech synthesis, phonetic transcription is created from spelling.

【００２１】ステップＳ７０６では、波形生成部１０９
が、話調決定部１０８で決定された話調に対応した音声
波形辞書から、読みや発音記号に相当する音声素片を取
り出し、話調に合せた韻律の生成やポーズの配置を行い
音声波形を作成する。ステップＳ７０７では、音声出力
部１１１が、波形生成部１０９で生成された音声波形を
用いて音声を出力する。In step S706, the waveform generator 109
Is extracted from the speech waveform dictionary corresponding to the speech determined by the speech determination unit 108, and phonemes corresponding to the reading and phonetic symbols are extracted, and prosody is generated and poses are arranged in accordance with the speech. To create. In step S707, the audio output unit 111 outputs audio using the audio waveform generated by the waveform generation unit 109.

【００２２】以上のように、第１実施形態によれば、入
力された音声の話調に応じて、入力音声の話調にもっと
も近い話調の音声が合成されるので、機械とのより自然
な音声対話が実現される。As described above, according to the first embodiment, the voice of the tone closest to the tone of the input voice is synthesized according to the tone of the input voice, so that it is more natural with the machine. Voice conversation is realized.

【００２３】＜第２実施形態＞第１実施形態では、話調
を音声認識部１０４で認識された話調に最も近い話調と
するように合成音声の話調を決定した。第２実施形態で
は、入力音声の話調に対する合成音声の話調を対応付け
るテーブルを用意し、これに基づいて合成音声の話調を
決定する。<Second Embodiment> In the first embodiment, the tone of the synthetic voice is determined so that the tone is closest to the tone recognized by the voice recognition unit 104. In the second embodiment, a table that associates the tone of the input voice with the tone of the synthetic voice is prepared, and the tone of the synthetic voice is determined based on the table.

【００２４】図２は第２実施形態による音声対話装置の
構成を示すブロック図である。第１実施形態と同様の構
成には同一の参照番号を付してある。２０８は話調決定
部であり、音声認識部１０４で認識された入力音声の話
調で話調対応付けテーブル２１１を検索し、応答文の話
調を決定する。話調対応付けテーブル２１１には、ユー
ザの話調と応答音声（合成音声）の話調の対応づけがあ
らかじめ登録されている。FIG. 2 is a block diagram showing the structure of the voice interactive apparatus according to the second embodiment. The same components as those in the first embodiment are designated by the same reference numerals. Reference numeral 208 denotes a speech tone determination unit, which searches the speech tone correspondence table 211 for the tone of the input voice recognized by the voice recognition unit 104 and determines the tone of the response sentence. In the speech tone correspondence table 211, the correspondence between the speech tone of the user and the speech tone of the response voice (synthesized voice) is registered in advance.

【００２５】以上の構成を備えた第２実施形態による音
声対話装置の動作は第１実施形態と同様であるが、ステ
ップＳ７０３において、話調決定部２０８は音声認識部
１０４で認識された入力音声の話調に対応付けられた話
調を、話調対応付けテーブル２１１を参照して取得し、
応答音声の話調を決定する。図９の（ａ）は話調対応付
けテーブル２１１の一例である。ユーザがノーマルな状
態の声の場合は、システムの応答もノーマルな声で応答
する。ユーザがささやき声のときは、システム側もささ
やき声で応答する。ユーザが悲しそうな声のときは、シ
ステム側は穏やかな声で応答し、ユーザが怒った声のと
きは、システム側は緊張した声で応答する。この応答は
ユーザーに同調する方向のものであり、対話戦略におい
て、ユーザーに反感を抱かせずさらなる発話を促すため
のものである。The operation of the voice interactive apparatus according to the second embodiment having the above-mentioned configuration is the same as that of the first embodiment, but in step S703, the speech tone determination unit 208 causes the speech recognition unit 104 to recognize the input voice. The tone associated with the tone is acquired by referring to the tone correspondence table 211,
Determine the tone of the response voice. FIG. 9A shows an example of the tone correspondence table 211. When the user has a normal voice, the system also responds with a normal voice. When the user whispers, the system side also responds with a whisper. When the user has a sad voice, the system side responds with a soft voice, and when the user has an angry voice, the system side responds with a tense voice. This response is in tune with the user, and is to encourage further utterance in the dialogue strategy without making the user feel negative.

【００２６】なお、図９（ａ），（ｂ）において、ユー
ザとは入力音声（の話調）を、システムとは応答音声
（の話調）を表す。また、第２実施形態では、話調の対
応を図９の（ａ）のように、ユーザに同調させるものと
したが、図９の（ｂ）のように、ユーザの悲しそうな声
に対してシステム側で朗らかな声で応答し、怒った声に
対してはささやき声で応答するような、ユーザ側の感情
を反対方向に誘導するような対応づけを用いてもよい。In FIGS. 9A and 9B, the user means (input tone) of the input voice, and the system means (tone of) the response voice. Further, in the second embodiment, the correspondence of the tone is set to be synchronized with the user as shown in FIG. 9A. However, as shown in FIG. It is also possible to use a correspondence in which the emotion of the user side is guided in the opposite direction, such as the system side responding with a cheerful voice and the angry voice responding with a whisper.

【００２７】＜第３実施形態＞第２実施形態において、
話調対応付けテーブル２１１を参照して合成音声の話調
を決定することを説明したが、第３実施形態では、複数
種類の話調対応付けテーブルを用意し、所望のテーブル
をユーザが選択できるようにする。<Third Embodiment> In the second embodiment,
Although it has been described that the speech tone of the synthetic voice is determined with reference to the speech tone correspondence table 211, a plurality of types of tone tone correspondence tables are prepared in the third embodiment, and the user can select a desired table. To do so.

【００２８】図３は第３実施形態による音声対話装置の
構成を示すブロック図である。第１実施形態或いは第２
実施形態と同様の構成には同一の参照番号を付してあ
る。３１１は話調対応選択部であり、話調対応付けテー
ブル３１２に格納されている複数種類の話調対応付けテ
ーブルから所望のテーブルを選択する。ステップＳ８０
３において、話調決定部３０８は、第２実施形態の話長
決定部２０８とほぼ同じ機能を有するが、話調対応選択
部３１１で選択された話調対応付けテーブルを参照して
合成音声の話調を決定する。FIG. 3 is a block diagram showing the structure of the voice interactive apparatus according to the third embodiment. First embodiment or second
The same components as those in the embodiment are designated by the same reference numerals. Reference numeral 311 is a speech tone correspondence selection unit that selects a desired table from a plurality of types of speech tone correspondence tables stored in the speech tone correspondence table 312. Step S80
3, the speech tone determination unit 308 has substantially the same function as the speech length determination unit 208 of the second embodiment, but refers to the speech tone correspondence table selected by the speech tone correspondence selection unit 311 to synthesize the synthesized speech. Determine the tone.

【００２９】例えば、話調対応付けテーブルには、図９
（ａ）と図９（ｂ）に示すようなテーブルが登録されて
おり、話調対応選択部３１１はユーザの指示により所望
のテーブルを選択する。話調決定部３０８は、話調対応
選択部３１１によって選択されている方のテーブルを参
照して合成音声の話調を決定する。なお、図９（ａ），
（ｂ）では２種類のテーブルを示したがテーブル数は３
つ以上あってもかまわない。For example, in the speech tone correspondence table, FIG.
Tables such as those shown in FIGS. 9A and 9B are registered, and the speech tone correspondence selection unit 311 selects a desired table according to a user's instruction. The speech tone determination unit 308 refers to the table selected by the speech tone correspondence selection unit 311 to determine the speech tone of the synthetic voice. In addition, FIG.
Although two types of tables are shown in (b), the number of tables is three.
It doesn't matter if there are more than one.

【００３０】なお、テーブルの選択に際しては、音声対
話装置が備える表示装置上に図９の（ａ）、（ｂ）に示
すような表を表示して、所望のテーブルをユーザに選択
させるインターフェースが提供されるようにする。When selecting a table, an interface for displaying a table as shown in FIGS. 9 (a) and 9 (b) on the display device provided in the voice dialog device and allowing the user to select a desired table is provided. To be provided.

【００３１】以上のように第３実施形態によれば、入力
音声の内容に対応する応答文の音声出力に際して、入力
音声の話調に応じた所望の話調で音声合成を行なうこと
が可能となる。As described above, according to the third embodiment, when the response sentence corresponding to the content of the input voice is output, it is possible to perform voice synthesis in a desired tone according to the tone of the input voice. Become.

【００３２】なお、第３実施形態ではテーブルの選択を
行なう構成を示したが、話調対応付けテーブルを任意に
作成できる（話調対応付けテーブルを編集可能とする）
ようにしてもよい。例えば、図９（ａ）においてユーザ
側の話調とシステム側の話調の任意な組み合わせを設定
できるようにしてもよい。この場合、入力音声の２種類
の話調に対して合成音声の話調を１種類とすることがで
きるようにしてもよい（例えば、ユーザ側の「ノーマル
な声」と「ささやき声」に「ノーマルな声」を割り当て
可能としてもよい）。In the third embodiment, the table is selected, but the tone correspondence table can be arbitrarily created (the tone correspondence table can be edited).
You may do it. For example, in FIG. 9A, an arbitrary combination of the tone of the user and the tone of the system may be set. In this case, one type of synthesized voice may be used for two types of input voice tones (for example, “normal voice” and “whispering voice” on the user side may be “normal”). It may be possible to assign a "voice").

【００３３】＜第４実施形態＞上記第１乃至第３実施形
態では、入力音声の話調に応じて、合成音声の話調を決
定した。第４実施形態では、入力音声の声質に応じて合
成音声の声質を決定する。<Fourth Embodiment> In the first to third embodiments, the tone of the synthesized voice is determined according to the tone of the input voice. In the fourth embodiment, the voice quality of the synthesized voice is determined according to the voice quality of the input voice.

【００３４】図４は、第４実施形態による音声対話装置
の構成を示すブロック図である。図４において、第１乃
至第３実施形態の音声対話装置と同様の構成には同一の
参照番号を付してある。FIG. 4 is a block diagram showing the structure of the voice dialogue system according to the fourth embodiment. In FIG. 4, the same components as those of the voice dialogue devices of the first to third embodiments are designated by the same reference numerals.

【００３５】認識モデル４０３は、ユーザの発声内容と
複数種類の声質を認識するためのモデルが登録されてい
る。音声認識部４０４は、認識モデル４０３を参照して
ユーザの発声内容と声質を認識する。声質決定部４０８
は、ユーザの音声から認識した声質にもっとも近い声質
を、当該入力音声に対する応答文の合成音声で使用する
声質として決定する。波形生成部４０９は、応答文解析
部１０６で得られた応答文の読みと、声質決定部４０８
で決定された声質に基づいて、音声波形辞書４１０を参
照して音声波形を生成する。音声波形辞書４１０は、複
数種類の声質の音声素片を格納する。The recognition model 403 is registered with a model for recognizing the utterance content of the user and a plurality of types of voice qualities. The voice recognition unit 404 refers to the recognition model 403 to recognize the utterance content and voice quality of the user. Voice quality determination unit 408
Determines the voice quality closest to the voice quality recognized from the user's voice as the voice quality used in the synthesized voice of the response sentence to the input voice. The waveform generation unit 409 reads the response sentence obtained by the response sentence analysis unit 106, and the voice quality determination unit 408.
On the basis of the voice quality determined in step 1, a voice waveform is generated by referring to the voice waveform dictionary 410. The voice waveform dictionary 410 stores voice units having a plurality of voice qualities.

【００３６】図８は第４実施形態による音声対話処理を
説明するフローチャートである。FIG. 8 is a flow chart for explaining the voice dialogue processing according to the fourth embodiment.

【００３７】まず、ステップＳ８０１で音声入力部１０
１によりユーザの音声を入力する。ステップＳ８０２で
は、音声認識部４０４が、認識モデル４０３を参照し
て、入力音声の発声内容とその声質を認識する。First, in step S801, the voice input unit 10
1 inputs the user's voice. In step S802, the voice recognition unit 404 refers to the recognition model 403 to recognize the utterance content of the input voice and its voice quality.

【００３８】発声内容と声質の認識のための認識モデル
４０３は、年齢・性別によってグループ分けした人間の
発声した音声を用いて作成した複数種類の隠れマルコフ
モデル（ＨＭＭ）である。入力音声について、それそれ
の声質のＨＭＭとのマッチングをとり、もっともよくマ
ッチングした認識モデルの声質と発声内容を認識結果と
する。The recognition model 403 for recognizing utterance contents and voice quality is a plurality of types of Hidden Markov Models (HMM) created by using human uttered voices grouped by age and sex. The input voice is matched with the HMM of each voice quality, and the voice quality and utterance content of the best matched recognition model are used as the recognition result.

【００３９】ステップＳ８０３では、声質決定部４０８
が、ステップＳ８０２で認識された声質に基づいて音声
合成の声質を決定する。第４実施形態では、音声波形辞
書４１０に登録されている声質のうち、音声認識部４０
４で認識されたユーザ音声の声質に一番近いものが選択
される。In step S803, the voice quality determination unit 408
Determines the voice quality for voice synthesis based on the voice quality recognized in step S802. In the fourth embodiment, of the voice qualities registered in the voice waveform dictionary 410, the voice recognition unit 40
The one that is closest to the voice quality of the user voice recognized in 4 is selected.

【００４０】ステップＳ８０４で、応答文作成部１０
５は、音声認識部４０４で認識されたユーザの発声内容
から応答文を作成する。応答文は、「こんにちは」に対
する「こんにちは」、「ありがとう」に対する「どうい
たしまして」のような会話の上での対応づけのとれてい
るものもあるが、ユーザの「今日の天気は？」といった
質問に対して、「９月２５日の横浜の天気は晴れ、午後
一時曇り、降水確率は午前中０％、午後は１０％、最低
気温は１５度、最高気温は２８度です」といったよう
な、外部のデータを参照して作成するものまでさまざま
である。データベース１０２はこれらの応答を作成する
ために必要な情報源を示す。In step S804, the response sentence creating unit 10
Reference numeral 5 creates a response sentence from the user's uttered content recognized by the voice recognition unit 404. Response statement, for "Hello", "Hello", questions such as "thank you" there are also those that are well-correspondence on the conversation, such as "You're welcome" to it, "the weather today?" Of the user On the other hand, such as "September 25, the weather in Yokohama is sunny, cloudy in the afternoon, the probability of precipitation is 0% in the morning, 10% in the afternoon, the minimum temperature is 15 degrees, and the maximum temperature is 28 degrees". There are various things to create by referring to the data of. Database 102 shows the sources of information needed to create these responses.

【００４１】ステップＳ８０５で、応答文解析部１０６
により応答文の言語解析を行なう。日本語においては漢
字かな混じり文についての読み表記を作成する。英語に
おいてはスペルから発音記号表記を作成する。In step S805, the response sentence analysis unit 106
The linguistic analysis of the response sentence is performed by. In Japanese, create a phonetic notation for mixed kanji and kana sentences. In English, create phonetic transcription from spelling.

【００４２】ステップＳ８０６では、波形生成部４０９
が、応答音声の声質にあった音声波形辞書から読みや発
音記号に相当する音声素片を取り出し、声質に合せた韻
律の生成やポーズの配置を行い音声波形を作成する。ス
テップＳ８０７では、音声出力部１１１が波形生成部４
０９で生成した音声波形に基づいて音声を出力する。In step S806, the waveform generator 409
However, the speech waveform corresponding to the voice quality of the response voice is extracted from the voice waveform dictionary corresponding to the voice quality, and the phoneme corresponding to the voice quality is generated, the prosody is generated and the poses are arranged to create the voice waveform. In step S807, the audio output unit 111 causes the waveform generation unit 4
The voice is output based on the voice waveform generated in 09.

【００４３】以上のように第４実施形態によれば、入力
音声に応じた声質で応答文を発声することが可能とな
る。As described above, according to the fourth embodiment, it is possible to utter a response sentence with a voice quality according to the input voice.

【００４４】＜第５実施形態＞第４実施形態では、入力
音声の声質を認識し、認識された声質に最も近い声質の
合成音声を生成する。第５実施形態では、第２実施形態
で説明した話調対応付けテーブルに類似の声質対応付け
テーブルを用いて、合成音声の声質を決定する。<Fifth Embodiment> In the fourth embodiment, the voice quality of the input voice is recognized, and a synthetic voice having a voice quality closest to the recognized voice quality is generated. In the fifth embodiment, the voice quality of the synthetic speech is determined using a voice quality matching table similar to the tone quality matching table described in the second embodiment.

【００４５】図５は第５実施形態による音声対話装置の
構成を示すブロック図である。第１乃至第４実施形態と
同様の構成には同一の参照番号を付してある。５０８は
声質決定部であり、音声認識部４０４で認識された入力
音声の声質で声質対応付けテーブル５１１を検索し、応
答文の声質を決定する。声質対応付けテーブル５１１に
は、ユーザの声質と応答音声（合成音声）の声質の対応
づけがあらかじめ登録されている。FIG. 5 is a block diagram showing the structure of the voice dialogue system according to the fifth embodiment. The same components as those in the first to fourth embodiments are designated by the same reference numerals. A voice quality determination unit 508 searches the voice quality correspondence table 511 with the voice quality of the input voice recognized by the voice recognition unit 404 and determines the voice quality of the response sentence. In the voice quality correspondence table 511, the correspondence between the voice quality of the user and the voice quality of the response voice (synthesized voice) is registered in advance.

【００４６】以上の構成を備えた第５実施形態による音
声対話装置の動作は第４実施形態と同様であるが、ステ
ップＳ８０３において、音声認識部１０４で認識された
入力音声の話調に対応付けられた声質を声質対応付けテ
ーブル５１１を参照して取得し、応答音声の声質に決定
する。例えば、図１０の（ａ）は、この声質対応付けテ
ーブル５１１の一例である。ユーザが成人男性音声の場
合は、システムの応答は成人女性音声で応答し、ユーザ
が成人女性音声のときは、システム側は成人男性音声で
応答するといった対応づけが与えられている。The operation of the voice interactive apparatus according to the fifth embodiment having the above configuration is the same as that of the fourth embodiment, but in step S803, it is associated with the tone of the input voice recognized by the voice recognition unit 104. The obtained voice quality is acquired by referring to the voice quality correspondence table 511, and is determined as the voice quality of the response voice. For example, FIG. 10A is an example of this voice quality correspondence table 511. When the user is an adult male voice, the system responds with an adult female voice, and when the user is an adult female voice, the system side responds with an adult male voice.

【００４７】なお、図１０（ａ），（ｂ）において、ユ
ーザとは入力音声（の声質）を、システムとは応答音声
（の声質）を表す。In FIGS. 10A and 10B, the user means (input voice quality) of the input voice, and the system means (voice quality) of the response voice.

【００４８】＜第６実施形態＞第５実施形態において、
声質対応付けテーブル５１１を参照して合成音声の声質
を決定することを説明したが、第６実施形態では、複数
種類の声質対応付けテーブルを用意し、所望のテーブル
をユーザが選択できるようにする。<Sixth Embodiment> In the fifth embodiment,
Although it has been described that the voice quality of synthesized speech is determined with reference to the voice quality association table 511, in the sixth embodiment, a plurality of types of voice quality association tables are prepared so that the user can select a desired table. .

【００４９】図６は第６実施形態による音声対話装置の
構成を示すブロック図である。第１乃至第５実施形態と
同様の構成には同一の参照番号を付してある。６１１は
声質対応選択部であり、声質対応付けテーブル６１２に
格納されている複数種類の声質対応付けテーブルから所
望のテーブルを選択する。FIG. 6 is a block diagram showing the arrangement of a voice dialogue system according to the sixth embodiment. The same components as those in the first to fifth embodiments are designated by the same reference numerals. Reference numeral 611 denotes a voice quality correspondence selection unit, which selects a desired table from a plurality of types of voice quality correspondence tables stored in the voice quality correspondence table 612.

【００５０】ステップＳ８０３において、声質決定部６
０８は、声質対応選択部６１１で選択された声質対応付
けテーブルを参照して合成音声の声質を決定する。In step S803, the voice quality determining unit 6
08 refers to the voice quality correspondence table selected by the voice quality correspondence selection unit 611 to determine the voice quality of the synthesized voice.

【００５１】例えば、声質対応付けテーブル６１２に
は、図１０（ａ）と図１０（ｂ）に示すようなテーブル
が登録されており、声質対応選択部３１１はユーザの指
示により所望のテーブルを選択する。声質決定部６０８
は、声質対応選択部６１１によって選択されている方の
テーブルを参照して合成音声の声質を決定する。なお、
図１０では２種類のテーブルを示したがテーブル数は３
つ以上あってもかまわない。For example, the voice quality correspondence table 612 has registered therein the tables as shown in FIGS. 10A and 10B, and the voice quality correspondence selection unit 311 selects a desired table according to a user's instruction. To do. Voice quality determination unit 608
Determines the voice quality of the synthesized voice by referring to the table selected by the voice quality correspondence selection unit 611. In addition,
Although two types of tables are shown in FIG. 10, the number of tables is three.
It doesn't matter if there are more than one.

【００５２】なお、テーブルの選択に際しては、音声対
話装置が備える表示装置上に図１０の（ａ）、（ｂ）に
示すような表を表示して、所望のテーブルをユーザに選
択させるインターフェースが提供されるようにする。When selecting a table, an interface for displaying a table as shown in FIGS. 10 (a) and 10 (b) on the display device provided in the voice dialog device and allowing the user to select a desired table is provided. To be provided.

【００５３】以上のように第６実施形態によれば、入力
音声の内容に対応する応答文の音声出力に際して、入力
音声の声質に応じた所望の声質で音声合成を行なうこと
が可能となる。As described above, according to the sixth embodiment, when outputting a response sentence corresponding to the contents of the input voice, it is possible to perform voice synthesis with a desired voice quality according to the voice quality of the input voice.

【００５４】なお、第６実施形態ではテーブルの選択を
行なう構成を示したが、声質対応付けテーブルを任意に
作成できる（声質対応付けテーブルを編集可能とする）
ようにしてもよい。例えば、図１０（ａ）においてユー
ザ側の声質とシステム側の声質の任意な組み合わせを設
定できるようにしてもよい。この場合、入力音声の２種
類の声質に対して合成音声の声質を１種類とすることが
できるようにしてもよい（例えば、ユーザ側の「成人男
性音声」と「成人女性音声」に「子供男性音声」を割り
当て可能としてもよい）。In the sixth embodiment, the table is selected, but the voice quality correspondence table can be arbitrarily created (the voice quality correspondence table can be edited).
You may do it. For example, in FIG. 10A, an arbitrary combination of the voice quality on the user side and the voice quality on the system side may be set. In this case, the voice quality of the synthesized voice may be set to one for the two voice qualities of the input voice (for example, “adult male voice” and “adult female voice” on the user side may be “children”). "Male voice" may be assigned).

【００５５】なお、第１〜第６の各実施形態では、音声
認識の認識モデルに隠れマルコフモデルを使用したが、
ニューラルネットなどの別のモデルを用いてもよい。In each of the first to sixth embodiments, the hidden Markov model is used as the recognition model for speech recognition.
Another model such as a neural network may be used.

【００５６】また、上記第１〜第３実施形態では、話調
としてささやき声や喜怒哀楽などの感情のこもった声を
採用しているが、ＤＪ調、ナレーター調、朗読調などの
より韻律部分に特徴のあるものも話調の種類としてもよ
い。In the first to third embodiments, whispering voices and emotional voices such as emotions are used as speech tones, but more prosody parts such as DJ tone, narrator tone and reading tone are adopted. Characters that are characteristic of may be used as the type of speech tone.

【００５７】また、上記第１〜第６実施形態では、一番
よくマッチングした話調や音質のモデルのみを認識結果
に採用しているが、候補が複数ある場合の上位に入った
ものの話調や音質の組みあわせを認識結果とし、その組
み合わせで応答の話調を決定するようにしてもよい。例
えば、ユーザの音声の認識結果として、悲しそうな声と
怒った声がともに上位にあがった場合は、ささやき声で
応答するというような対応づけが考えられる。In the first to sixth embodiments, only the model of the best matching tone and sound quality is adopted as the recognition result. However, when there are a plurality of candidates, the tone of the higher-ranked one is adopted. Alternatively, a combination of the sound quality and the sound quality may be used as the recognition result, and the tone of the response may be determined based on the combination. For example, as a result of recognition of a user's voice, when both a sad voice and an angry voice are in the upper rank, it is possible to associate the user with a whisper.

【００５８】また、上記実施形態では、声質として年齢
と性別の組み合わせを採用しているが、バスやテノール
アルトやソプラノなどのベースとなる声の高さも声質の
種類として採用してもよい。図１０の（ｂ）がその対応
例である。In the above embodiment, the combination of age and gender is adopted as the voice quality, but the pitch of the base voice of bass, tenor alto, soprano, etc. may be adopted as the type of voice quality. FIG. 10B shows an example of the correspondence.

【００５９】更に、上記実施形態では、話調を対応させ
るもの（第１乃至第３実施形態）と声質を対応させるも
の（第４乃至第６実施形態）とに分けて説明したが、話
調及び声質を入力音声に対応させるようにしてもよいこ
と、またその場合に第１乃至第３実施形態の何れかと第
４乃至第６実施形態のいずれかを組み合わせてよいこと
は上記実施形態の説明から明らかである。Further, in the above-described embodiment, the description has been divided into the ones corresponding to the tone (first to third embodiments) and the ones corresponding to the voice quality (fourth to sixth embodiments). And that the voice quality may correspond to the input voice, and in that case any of the first to third embodiments may be combined with any of the fourth to sixth embodiments. Is clear from.

【００６０】以上説明したように、上記各実施形態によ
れば、ユーザの声質やユーザのその時々の話調に対応し
て、応答する声の声質や話調が変更される。このため、
表情を現す顔画像を使用せず、発話内容などを変更する
ことなく、反感を与えずより親しみやすい印象を与える
合成音声の生成が可能となるという効果がある。As described above, according to the above-described embodiments, the voice quality and tone of the responding voice are changed in accordance with the voice quality of the user and the tone of the user at that time. For this reason,
There is an effect that it is possible to generate a synthetic voice that does not give a feeling of repulsion and gives a more familiar impression, without using a facial image showing a facial expression and without changing the utterance content.

【００６１】なお、本発明の目的は、前述した実施形態
の機能を実現するソフトウェアのプログラムコードを記
録した記憶媒体を、システムあるいは装置に供給し、そ
のシステムあるいは装置のコンピュータ（またはＣＰＵ
やＭＰＵ）が記憶媒体に格納されたプログラムコードを
読出し実行することによっても、達成されることは言う
までもない。The object of the present invention is to supply a storage medium having a program code of software for realizing the functions of the above-described embodiments to a system or apparatus, and to supply a computer (or CPU) of the system or apparatus.
It is needless to say that it can be achieved by reading and executing the program code stored in the storage medium.

【００６２】この場合、記憶媒体から読出されたプログ
ラムコード自体が前述した実施形態の機能を実現するこ
とになり、そのプログラムコードを記憶した記憶媒体は
本発明を構成することになる。In this case, the program code itself read from the storage medium realizes the function of the above-described embodiment, and the storage medium storing the program code constitutes the present invention.

【００６３】プログラムコードを供給するための記憶媒
体としては、例えば、フロッピディスク，ハードディス
ク，光ディスク，光磁気ディスク，ＣＤ−ＲＯＭ，ＣＤ
−Ｒ，磁気テープ，不揮発性のメモリカード，ＲＯＭな
どを用いることができる。As a storage medium for supplying the program code, for example, a floppy disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD
-R, magnetic tape, non-volatile memory card, ROM, etc. can be used.

【００６４】また、コンピュータが読出したプログラム
コードを実行することにより、前述した実施形態の機能
が実現されるだけでなく、そのプログラムコードの指示
に基づき、コンピュータ上で稼働しているＯＳ（オペレ
ーティングシステム）などが実際の処理の一部または全
部を行い、その処理によって前述した実施形態の機能が
実現される場合も含まれることは言うまでもない。Moreover, not only the functions of the above-described embodiments are realized by executing the program code read by the computer, but also the OS (operating system) running on the computer based on the instructions of the program code. It is needless to say that this also includes a case where the above) performs a part or all of the actual processing and the processing realizes the functions of the above-described embodiments.

【００６５】さらに、記憶媒体から読出されたプログラ
ムコードが、コンピュータに挿入された機能拡張ボード
やコンピュータに接続された機能拡張ユニットに備わる
メモリに書込まれた後、そのプログラムコードの指示に
基づき、その機能拡張ボードや機能拡張ユニットに備わ
るＣＰＵなどが実際の処理の一部または全部を行い、そ
の処理によって前述した実施形態の機能が実現される場
合も含まれることは言うまでもない。Further, after the program code read from the storage medium is written in the memory provided in the function expansion board inserted into the computer or the function expansion unit connected to the computer, based on the instruction of the program code, It goes without saying that a case where the CPU or the like included in the function expansion board or the function expansion unit performs some or all of the actual processing and the processing realizes the functions of the above-described embodiments is also included.

【００６６】[0066]

【発明の効果】以上説明したように、本発明によれば、
ユーザの声の話調及び／又は声質に対して、応答側の合
成音声の話調及び／又は声質を変えることで、より親し
みやすい音声対話を提供することが可能となる。As described above, according to the present invention,
By changing the tone and / or voice quality of the synthesized voice of the responding side with respect to the tone and / or voice quality of the user's voice, it becomes possible to provide a more familiar voice dialogue.

[Brief description of drawings]

【図１】第１実施形態による音声対話装置の構成を示す
ブロック図である。FIG. 1 is a block diagram showing a configuration of a voice interaction device according to a first embodiment.

【図２】第２実施形態による音声対話装置の構成を示す
ブロック図である。FIG. 2 is a block diagram showing a configuration of a voice interaction device according to a second embodiment.

【図３】第３実施形態による音声対話装置の構成を示す
ブロック図である。FIG. 3 is a block diagram showing a configuration of a voice dialogue device according to a third embodiment.

【図４】第４実施形態による音声対話装置の構成を示す
ブロック図である。FIG. 4 is a block diagram showing a configuration of a voice dialogue device according to a fourth embodiment.

【図５】第５実施形態による音声対話装置の構成を示す
ブロック図である。FIG. 5 is a block diagram showing a configuration of a voice dialogue device according to a fifth embodiment.

【図６】第６実施形態による音声対話装置の構成を示す
ブロック図である。FIG. 6 is a block diagram showing a configuration of a voice dialogue device according to a sixth embodiment.

【図７】第１実施形態による音声対話処理を説明するフ
ローチャートである。FIG. 7 is a flowchart illustrating a voice interaction process according to the first embodiment.

【図８】第４実施形態による音声対話処理を説明するフ
ローチャートである。FIG. 8 is a flowchart illustrating a voice interaction process according to a fourth embodiment.

【図９】ユーザとシステムの話調の対応テーブルの例を
示す図である。FIG. 9 is a diagram showing an example of a correspondence table of user and system tone.

【図１０】ユーザとシステムの声質の対応テーブルの例
を示す図である。FIG. 10 is a diagram showing an example of a correspondence table of voice qualities of a user and a system.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/14 Ｇ１０Ｌ 3/00 Ｒ ─────────────────────────────────────────────────── ─── Continued Front Page (51) Int.Cl. ⁷ Identification Code FI Theme Coat (Reference) G10L 15/14 G10L 3/00 R

Claims

[Claims]

1. A voice interactive device for recognizing an input voice and outputting a response sentence according to a recognition result as a synthetic voice, comprising: a recognizing means for recognizing a tone and content of the input voice, and the recognizing means. The system includes: a determining unit that determines the tone of the synthesized voice based on the recognized tone; and a generating unit that generates the synthesized voice of the response sentence with respect to the content of the input voice with the tone determined by the determining unit. A voice dialogue device characterized by the above.

2. A voice waveform dictionary corresponding to a plurality of types of tones is further provided, and the generating unit uses the voice waveform dictionary corresponding to the tone determined by the determining unit to synthesize a previously-synthesized voice. The voice interaction device according to claim 1, wherein the voice interaction device is generated.

3. The voice interactive apparatus according to claim 1, wherein the deciding means decides the tone of the synthetic voice to a tone similar to the tone recognized by the recognizing means.

4. A speech tone table for registering a speech tone of a synthetic speech in association with a speech tone of an input speech, wherein the deciding means is for a synthetic speech corresponding to the speech tone recognized by the recognizing means. The voice interaction apparatus according to claim 1, wherein the voice tone is acquired from the voice tone table.

5. The method according to claim 4, further comprising a selection unit that has a plurality of speech tone tables and that selects a desired table used by the determination unit from the plurality of types of speech tone tables. Spoken dialogue device.

6. The voice interactive apparatus according to claim 4, further comprising an editing unit that edits the speech tone table.

7. The recognizing means recognizes an input voice using a recognition model corresponding to a plurality of types of tones, and determines the tone corresponding to the best matching recognition model as the tone of the input voice. The voice interaction device according to claim 1, wherein

8. A voice interactive device for recognizing an input voice and outputting a response sentence according to a recognition result as a synthetic voice, the recognition means recognizing a voice quality and content of the input voice, and the recognition means. Determining means for determining the voice quality of the synthesized voice based on the determined voice quality, and generating means for generating the synthesized voice of the response sentence with respect to the content of the input voice with the voice quality determined by the determining means. Voice interaction device.

9. A voice waveform dictionary corresponding to a plurality of types of voice qualities is further provided, and the generating means generates a synthesized voice of previous and previous equal parts using the voice waveform dictionary corresponding to the voice quality determined by the determining means. 9. The voice interaction device according to claim 8, wherein:

10. The deciding means determines the voice quality of the synthesized voice,
9. The voice interaction device according to claim 8, wherein the voice quality is determined to be similar to the voice quality recognized by the recognition means.

11. A voice quality table for registering a voice quality of a synthesized voice in association with a voice quality of an input voice, wherein the deciding means has a voice quality for synthetic voice corresponding to the voice quality recognized by the recognizing means. The voice interaction device according to claim 8, wherein the voice interaction device is obtained from a table.

12. The voice dialog according to claim 11, further comprising a selection unit that has a plurality of types of voice quality tables and that selects a desired table to be used by the determination unit from the plurality of types of voice quality tables. apparatus.

13. The voice interactive apparatus according to claim 11, further comprising an editing unit that edits the voice quality table.

14. The recognizing means recognizes an input voice by using a recognition model corresponding to a plurality of types of voice qualities, and determines a voice quality corresponding to the best matching recognition model as a voice quality of the input voice. The voice interaction device according to claim 8.

15. A control method of a voice interaction device for recognizing an input voice and outputting a response sentence according to a recognition result as a synthetic voice, comprising: a recognition step of recognizing a tone and content of the input voice, A determining step of determining the tone of the synthetic voice based on the tone recognized in the recognizing step, and a generating step of generating a synthetic voice of a response sentence with respect to the content of the input voice with the tone determined in the determining step. A method for controlling a voice interaction device, comprising:

16. A method of controlling a voice interaction device for recognizing an input voice and outputting a response sentence according to a recognition result as a synthetic voice, comprising: a recognition step of recognizing a voice quality and content of the input voice; A determining step of determining a voice quality of the synthetic voice based on the voice quality recognized in the step; and a generating step of generating a synthetic voice of a response sentence with respect to the content of the input voice with the voice quality determined in the determining step. And a method for controlling a voice interaction device.

17. A control program for causing a computer to execute the control method according to claim 15 or 16.

18. A storage medium storing a control program for causing a computer to execute the control method according to claim 15 or 16.