JPH1185178A

JPH1185178A - Synthesized speech response method and its device and storage medium storing synthesized speech response program

Info

Publication number: JPH1185178A
Application number: JP9249019A
Authority: JP
Inventors: Tasuku Shinozaki; 翼篠崎; Masanobu Abe; 匡伸阿部
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-09-12
Filing date: 1997-09-12
Publication date: 1999-03-30

Abstract

PROBLEM TO BE SOLVED: To eliminate monotonous impressions, such as a monotone and mechan ical tones, by returning the reaction speeches selected from plural reaction speech sets along a scenario or arbitrarily assigned reaction voices. SOLUTION: A synthesized speech response sentence forming device 5 forms the output speeches of the system in accordance with the output result of a scenario interpretation apparatus 4. A reaction speech selection apparatus 6 selects the reaction speeches assigned from the reaction speech sets 9 to 12 or randomly selects the reaction speeches in accordance with the determined output of the reaction speech sets among the outputs thereof. A synthesized speech forming apparatus 7 forms the synthesized speeches outputted by an output device 8 in accordance with the instructions, questions, response sentences, etc., from the scenario interpretation apparatus 4 or the reaction speeches selected by the reaction speech selection apparatus 6. The output device 8 is a device for outputting the synthesized speeches, sounds and images, etc., and outputs and displays the state of the system and the instructions, questions and response speeches by using the same.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、合成音声応答方法
及び装置及び合成音声応答プログラムを格納した記憶媒
体に係り、特に、人間の動作及び／あるいは、操作に対
して、リアクション音声を返却することによりシステム
をユーザフレンドリ化し、ヒューマンインターフェース
の効率を向上させるための合成音声応答方法及び装置及
び合成音声応答プログラムを格納した記憶媒体に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a synthetic voice response method and apparatus, and a storage medium storing a synthetic voice response program, and more particularly to a method of returning a reaction voice to human movement and / or operation. The present invention relates to a synthesized voice response method and apparatus for making a system user-friendly and improving the efficiency of a human interface, and a storage medium storing a synthesized voice response program.

【０００２】[0002]

【従来の技術】合成音声は、技術の進歩に伴い、電話に
よる応答装置やＣＤ装置、ＡＴＭ装置、各種自動販売
機、電卓、しゃべるおもちゃ、コンピュータといったよ
うなものに利用されている。ここで用いらている合成音
声は、予め録音した音声をつなぎ合わせた音声（録音編
集音声合成）であったり、または、日本語文章を規則に
従って変換した音声（規則による合成音声）が考えられ
る。従来の技術ではこれらの合成音声は、それぞれの装
置「システム」を利用する人間「ユーザ」の入力に対応
して一意に決められている。2. Description of the Related Art Synthetic speech is used in telephone answering devices, CD devices, ATM devices, various vending machines, calculators, talking toys, computers, and the like with the advance of technology. The synthesized voice used here may be a voice obtained by connecting pre-recorded voices (recording / editing voice synthesis), or a voice obtained by converting Japanese sentences according to rules (synthesized voice according to rules). In the prior art, these synthesized voices are uniquely determined according to the input of a human “user” using each device “system”.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上記の
規則による合成音声は、何度かシステムを利用している
と「○○をすると××といった合成音声を出力する」と
いったことが予想でき、システムに対して一本調子であ
るとか、機械的であるとか単調な印象を受ける。また、
システムに対するこのような問題点を解決するためにシ
ステムから合成音声による応答を単にランダムな順序で
出力するのでは、システムが一貫した特徴を持つことが
困難であったり、システムとして達成すべき目的を達成
することが困難である。However, if the synthesized speech according to the above rules is used several times in the system, it can be expected that "when XX is performed, a synthesized speech such as XX is output". The impression is monotonous, mechanical, and monotonous. Also,
In order to solve such problems for the system, simply outputting the synthesized speech responses from the system in a random order would make it difficult for the system to have consistent features or to achieve the objectives of the system. Difficult to achieve.

【０００４】このため、従来の合成音声による応答方法
や装置は、システムに対して一本調子であるとか機械的
であるとか単調な印象をユーザが受けるため、システム
に対して、使って楽しいとか、使う気にさせるとか、親
しみを持たせるといったユーザフレンドリさをユーザに
与えることが困難である。本発明は、上記の点に鑑みな
されたもので、人間の動作や操作に対して、任意の特徴
を持ったリアクション音声セットの中からある一貫性の
ある特定の特徴を持ったリアクション音声をユーザに与
えることが可能な合成音声応答方法及び装置及び合成音
声応答プログラムを格納した記憶媒体を提供することを
目的とする。[0004] For this reason, the conventional response method and apparatus based on synthesized speech give users a monotonous impression that the system is monotonous, mechanical, or monotonous. It is difficult to give the user friendliness, such as giving the user a feeling of use or having a familiarity. The present invention has been made in view of the above points, and provides a reaction voice having a certain specific feature to a user from a reaction voice set having an arbitrary feature with respect to human movement and operation. It is an object of the present invention to provide a synthesized voice response method and apparatus which can be provided to a user and a storage medium storing a synthesized voice response program.

【０００５】さらなる目的は、多様なリアクション音声
を返却することにより一本調子であるとか、機械的であ
るとか単調な印象を解消すると共に、シナリオに基づい
て一貫した特徴を持った応答を可能とし、システムをユ
ーザフレンドリ化（システムを使って楽しいとか、シス
テムを使う気になるとか、システムに親しみを持つ等）
し、ヒューマンインタフェースの効率を向上させること
が可能な合成音声応答方法及び装置及び合成音声応答プ
ログラムを格納した記憶媒体を提供することである。[0005] A further object is to return a variety of reaction voices to eliminate monotonous or mechanical or monotonous impressions, and to enable a response having consistent characteristics based on a scenario. , Making the system user-friendly (e.g. fun using the system, motivated to use the system, familiarity with the system, etc.)
It is another object of the present invention to provide a synthesized voice response method and apparatus capable of improving the efficiency of a human interface and a storage medium storing a synthesized voice response program.

【０００６】[0006]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。本発明は、動作や操作の認識
結果に基づいて合成音声で応答する合成音声応答方法に
おいて、動作や操作を認識し（ステップ１）、装置の動
作及び応答方法を含む記述を有するシナリオを解釈し
（ステップ２）、シナリオに沿って複数のリアクション
音声セットの中からランダムにリアクション音声を選
択、または、任意にリアクション音声を指定し（ステッ
プ３）、リアクション音声を返す（ステップ４）。FIG. 1 is a diagram for explaining the principle of the present invention. According to the present invention, in a synthesized voice response method for responding with synthesized voice based on a recognition result of an operation or an operation, an operation or an operation is recognized (step 1), and a scenario having a description including an operation of the apparatus and a response method is interpreted. (Step 2) A reaction voice is randomly selected from a plurality of reaction voice sets according to a scenario, or a reaction voice is arbitrarily designated (Step 3), and the reaction voice is returned (Step 4).

【０００７】また、上記のリアクション音声セットは、
特定の特徴を持つ多様なリアクション音声をユーザに与
えるリアクション音声の集まりである複数のリアクショ
ン音声のサブセットからなり、該サブセット毎に異なっ
た特徴を持つ。また、上記のリアクション音声セット
は、日本語文章、または、音韻記号と基本周波数、継続
時間、パワーを含む音律パラメータで表現された音声情
報、または、音声波形のいずれかで記述される。[0007] The above reaction voice set is
It is composed of a plurality of reaction voice subsets, which are a collection of reaction voices that give the user various reaction voices having specific characteristics, and each subset has a different characteristic. The above-mentioned reaction voice set is described in either a Japanese sentence, voice information expressed by phonological symbols and phonological parameters including a fundamental frequency, a duration, and power, or a voice waveform.

【０００８】図２は、本発明の原理構成図である。本発
明は、動作や操作を認識する認識手段と、該認識手段に
よる認識結果に基づいて合成音声で応答する応答手段を
有する合成音声応答装置であって、装置の動作及び応答
方法を含む記述を有するシナリオ３と、シナリオを解釈
するシナリオ解釈手段４と、認識結果とシナリオ解釈手
段４の解釈結果に基づいて合成音声を生成する合成音声
応答文生成手段５とを有する。FIG. 2 is a diagram showing the principle of the present invention. The present invention relates to a synthesized voice response apparatus including a recognition unit that recognizes an operation or an operation, and a response unit that responds with a synthesized voice based on a recognition result by the recognition unit. The description includes an operation of the device and a response method. It has a scenario 3, a scenario interpreting unit 4 for interpreting the scenario, and a synthesized speech response sentence generating unit 5 for generating a synthesized speech based on the recognition result and the interpretation result of the scenario interpreting unit 4.

【０００９】上記の合成音声応答文生成手段５は、録音
編集合成装置または、規則による音声合成を行う装置を
用いる。また、上記の合成音声応答文生成手段５は、特
定の特徴を持つ多様なリアクション音声をユーザに与え
るリアクション音声の集まりである複数のリアクション
音声セット６１の中からランダムにリアクション音声を
選択する音声選択手段５１を含む。The synthesized voice response sentence generating means 5 uses a recording / editing / synthesizing device or a device for performing voice synthesis according to rules. The synthesized voice response sentence generation means 5 performs voice selection for randomly selecting a reaction voice from a plurality of reaction voice sets 61, which are a collection of reaction voices for giving various reaction voices having specific characteristics to the user. Means 51 are included.

【００１０】また、上記のリアクション音声セット６１
は、特定の特徴を持つ多様なリアクション音声をユーザ
に与えるリアクション音声の集まりである異なる特徴を
有する複数のリアクション音声のサブセットを含む。ま
た、上記のリアクション音声セット６１は、日本語文
章、または、音韻記号と基本周波数、継続時間、パワー
を含む音律パラメータで表現された音声情報、または、
音声波形のいずれかを有する。The above-mentioned reaction voice set 61
Includes a subset of a plurality of reaction voices having different characteristics, which is a collection of reaction voices that provide a user with various reaction voices having specific characteristics. In addition, the above-described reaction voice set 61 includes voice information expressed in Japanese sentences or phonological parameters including phonological symbols and fundamental frequencies, duration, and power, or
With any of the audio waveforms.

【００１１】本発明は、動作や操作を認識する認識プロ
セスと、該認識プロセスによる認識結果に基づいて合成
音声で応答する応答プロセスを有する合成音声応答プロ
グラムを格納した記憶媒体であって、装置の動作及び応
答方法を含む記述を有するシナリオを解釈するシナリオ
解釈プロセスと、認識プロセスの認識結果とシナリオ解
釈プロセスの解釈結果に基づいて合成音声を生成する合
成音声応答文生成プロセスとを有する。The present invention is a storage medium storing a synthesized speech response program having a recognition process for recognizing an operation or an operation, and a response process for responding with a synthesized voice based on the recognition result by the recognition process. It has a scenario interpretation process for interpreting a scenario having a description including an operation and a response method, and a synthesized speech response sentence generation process for generating a synthesized speech based on the recognition result of the recognition process and the interpretation result of the scenario interpretation process.

【００１２】また、上記の合成音声応答文生成プロセス
は、特定の特徴を持つ多様なリアクション音声をユーザ
に与えるリアクション音声の集まりである複数のリアク
ション音声セットの中からランダムにリアクション音声
を選択する音声選択プロセスを含む。上記により、動作
や操作を認識し、その認識結果に基づいて合成音声で応
答する際に、シナリオを解釈し、当該シナリオに沿って
複数のリアクション音声のセトの中からランダムに選ん
だリアクション音声及び任意に指定したリアクション音
声を返却することが可能となる。[0012] The above-mentioned synthetic speech response sentence generation process includes a speech for randomly selecting a reaction speech from a plurality of reaction speech sets, which are a collection of reaction speeches for giving various reaction speeches having specific characteristics to a user. Including the selection process. According to the above, when recognizing an operation or operation and responding with a synthesized voice based on the recognition result, a scenario is interpreted, and a reaction voice randomly selected from a plurality of reaction voice sets along the scenario and It is possible to return the reaction voice arbitrarily specified.

【００１３】また、リアクション音声セットは、日本語
文章または、音韻記号と基本周波数（Ｆ0 ）、継続時
間、パワーといった韻律パラメータで表現された音声情
報、あるいは、音声波形のいずれかで記述することが可
能である。さらに、リアクション音声セットは、リアク
ション音声セット毎に一貫した特定の特徴をユーザに与
えることが可能であり、システムにある一貫した特定の
意図した特徴を与えることが可能である。The reaction voice set can be described in either a Japanese sentence, voice information expressed by phonological symbols and prosodic parameters such as fundamental frequency (F0), duration and power, or a voice waveform. It is possible. In addition, the reaction audio set can provide the user with specific features consistent with each reaction audio set, and can provide certain consistent and intended characteristics in the system.

【００１４】[0014]

【発明の実施の形態】以下、図面と共に、本発明の実施
例を説明する。図３は、本発明の一実施例の音声応答装
置の構成を示す。同図に示す構成は、入力装置１、セン
サ２、シナリオ３、シナリオ解釈装置４、合成音声応答
文生成装置５、出力装置８、リアクション音声セット９
〜１２、応答画像・応答音生成装置１３から構成され
る。Embodiments of the present invention will be described below with reference to the drawings. FIG. 3 shows the configuration of the voice response device according to one embodiment of the present invention. The configuration shown in the figure includes an input device 1, a sensor 2, a scenario 3, a scenario interpretation device 4, a synthesized voice response sentence generation device 5, an output device 8, and a reaction voice set 9.
-12, and a response image / response sound generation device 13.

【００１５】入力装置１は、キーボード、マウスやタッ
チパネル、タブレットといったポインティング装置、ジ
ョイスティック、ジョイパット、その他の入力装置のい
ずれかであるものとする。センサ２は、マイクロホンセ
ンサ、磁気センサ、光センサ、圧力センサ、静電容量セ
ンサ、超音波センサ、画像認識装置、音声認識装置、動
画像認識装置、ＰＢ信号認識装置、その他の物理量を検
出する装置であるものとする。The input device 1 is one of a pointing device such as a keyboard, a mouse, a touch panel, and a tablet, a joystick, a joypad, and other input devices. The sensor 2 includes a microphone sensor, a magnetic sensor, an optical sensor, a pressure sensor, a capacitance sensor, an ultrasonic sensor, an image recognition device, a voice recognition device, a moving image recognition device, a PB signal recognition device, and other devices for detecting physical quantities. It is assumed that

【００１６】これらの入力装置１及びセンサ２により、
人間の動作、人間による操作を認識するのに必要とされ
るデータを得ることができる。シナリオ３は、入力装置
１やセンサ２により検出されるデータに基づいて合成音
声応答装置の動作が記述されている。シナリオ解釈装置
４は、入力装置１及びセンサ２により検出されるデータ
に基づいて、人間の動作、人間による操作を認識し、シ
ナリオ４に基づいて指示や質問、応答文といったものの
決定及びリアクション音声セットの決定を行う。With these input device 1 and sensor 2,
Data necessary for recognizing human motion and human operation can be obtained. Scenario 3 describes the operation of the synthesized voice response device based on data detected by the input device 1 and the sensor 2. The scenario interpreting device 4 recognizes a human motion and a human operation based on data detected by the input device 1 and the sensor 2, determines an instruction, a question, a response sentence based on the scenario 4, and sets a reaction voice set. Make a decision.

【００１７】合成音声応答文生成装置５は、リアクショ
ン音声選択装置６及び合成音声生成装置７から構成され
る。当該合成音声応答文生成装置５は、シナリオ解釈装
置４の出力結果に基づいてシステムの出力音声を生成す
る。リアクション音声選択装置６は、シナリオ解釈装置
４からの出力のうち、リアクション音声セットの決定の
出力に基づき、リアクション音声セット９〜１２から指
定されたリアクション音声あるいは、ランダムにリアク
ション音声を選択する。The synthesized speech response sentence generation device 5 comprises a reaction speech selection device 6 and a synthesized speech generation device 7. The synthesized voice response sentence generation device 5 generates an output voice of the system based on the output result of the scenario interpretation device 4. The reaction voice selection device 6 selects a reaction voice specified from the reaction voice sets 9 to 12 or a reaction voice at random based on the output of the determination of the reaction voice set among the outputs from the scenario interpretation device 4.

【００１８】合成音声生成装置７は、シナリオ解釈装置
４からの指示や質問、応答文等、あるいは、リアクショ
ン音声選択装置６で選ばれたリアクション音声に基づい
て、出力装置８で出力する合成音声を生成する。出力装
置８は、合成音声、音や画像といったものを出力する装
置であり、これらを使用してシステムの状態や指示、質
問、応答音声を出力・表示する。The synthesized voice generation device 7 generates a synthesized voice output from the output device 8 based on an instruction, a question, a response sentence or the like from the scenario interpretation device 4 or the reaction voice selected by the reaction voice selection device 6. Generate. The output device 8 is a device that outputs synthesized speech, sound and images, and outputs and displays system status, instructions, questions, and response voices using these devices.

【００１９】リアクション音声セット９〜１２は、「無
関心な感じ」や「のりのよい感じ」、「素直な感じ」、
「ひねくれた感じ」といったような特定の印象をユーザ
に与えるリアクション音声毎に構成されている。例え
ば、「無関心な感じ」は、「へー」とか、「ほー」、
「おー」、「えー」といった短い言葉でＦ0 が低く、か
つＦ0 の変化が平坦なあるいは、Ｆ0 の変化が僅かに右
下がりであるといった韻律を持つといった特徴を持つも
のや、「あっそう」とか、「まっ、どうでもいいけ
ど」、「ふーん」といったように言葉の意味として素っ
気なさや無関心さを表すといった特徴を持つリアクショ
ン音声である。The reaction voice sets 9 to 12 include "indifferent feeling", "good feeling", "frank feeling",
It is configured for each reaction voice that gives the user a specific impression such as “feeling twisted”. For example, "indifferent feeling" means "Hey", "Ho",
Short words such as "Oh" and "Eh" have the characteristic that F0 is low and the change of F0 is flat, or that the change of F0 is slightly right-downward. It is a reaction voice that has the characteristics of expressing indifferentness and indifference as the meaning of words, such as "I don't care," or "Hmm."

【００２０】また、例えば、「素直な印象」は前向きで
明るい感じを出すために平均的なＦ0 の高さが中くらい
で、平均話速がリアクション音声以外の平均話速よりや
や長めの韻律を持つといった特徴を持つものや、「は
い」や「そうですよね」、「なるほど」といったような
肯定的な意味と韻律を持つリアクション音声である。ま
た、例えば、「ひねた印象」はＦ0 が急激に大きく右上
がりに上昇するいった変化をする部分をもち、平均話速
がややゆっくりであるといった韻律を持つといった特徴
をもつものや、「うそでしょー」や「ほんとー」、「そ
んなー」、「まじー」、「そーかなー」といったように
半ば否定的ともとれる懐疑的な意味と韻律をもつリアク
ション音声である。ここでは、リアクション音声セット
は４つであるが、音声応答装置では、リアクション音声
セットは４つである必要はなく１つ以上あればよい。[0020] For example, the "small impression" means that the average F0 is medium and the average speech speed is slightly longer than the average speech speed other than the reaction voice in order to give a positive and bright feeling. It is a reaction voice with a positive meaning and prosody such as "Yes", "Yes", "I see". Also, for example, the “twisted impression” has a characteristic that has a portion where F0 changes rapidly and rises to the right and has a prosody such that the average speech speed is slightly slower, It is a reaction voice that has a skeptical meaning and prosody that can be taken in a semi-negative manner, such as "should", "honto", "so-so", "maji", "so-kana". Here, the number of reaction voice sets is four, but in the voice response device, the number of reaction voice sets need not be four, but may be one or more.

【００２１】応答画像・応答音生成装置１３は、シナリ
オ解釈装置４の出力結果に基づいてシステムの出力画像
や出力音を生成する。ここで、合成音声による応答方法
について選択式チャートを用いた占いを例にとって具体
的に説明する。ここで、選択式チャートについてまず説
明する。この選択式チャートとは、「次の４つの色のう
ちどの色が一番好き？１、赤。２、白。３、青。４、
緑」といったような２つ以上の選択肢を持つ形式の質問
に応えるというもので、この質問と回答を繰り返し、そ
の選んだ答えに従って最後に占いの結果に辿り着くもの
である。この例では、入力装置１はマウスまたは、キー
ボードであり、選択肢を選択するために用いられる。シ
ナリオ３には、選択式チャートの質問文と選択肢、そし
て、選ばれた選択肢と次の質問文との関係、占いの結
果、及びどのリアクション音声セットを使用するかが記
述されている。The response image / response sound generation device 13 generates an output image or output sound of the system based on the output result of the scenario interpretation device 4. Here, a response method using a synthesized voice will be specifically described by taking a fortune-telling using a selection formula chart as an example. Here, the selection formula chart will be described first. The selection formula chart is "What color do you like best among the following four colors? 1, red. 2, white. 3, blue. 4,
In response to a question with two or more options, such as "green", the question and the answer are repeated, and the result of the fortune-telling is finally reached according to the answer selected. In this example, the input device 1 is a mouse or a keyboard, and is used for selecting an option. Scenario 3 describes a question sentence and options of the selection formula chart, a relationship between the selected option and the next question sentence, a result of fortune-telling, and which reaction voice set to use.

【００２２】図４は、本発明の一実施例のシナリオ記述
の例を示す。同図において、「Ｓ１」の部分は、この選
択式チャートで用いるリアクション音声セットのファイ
ル名とシナリオ内で用いるセット番号との対応を記述し
ているテーブルである。「Ｓ２」の部分は、システムの
動作を記述した部分で、この選択式チャートでの質問と
選択肢及び占いの結果の流れを記述している。FIG. 4 shows an example of a scenario description according to an embodiment of the present invention. In the figure, the part "S1" is a table describing the correspondence between the file name of the reaction voice set used in this selection formula chart and the set number used in the scenario. The part "S2" describes the operation of the system, and describes the flow of questions, options, and results of fortune-telling in this selection formula chart.

【００２３】この選択式チャートでは、まず、１番目の
質問として、リアクション音声セット１に登録されてい
る７番目の音声を出力する（Ｓ３の部分）。そして、こ
の質問に対してユーザが選択肢１を選択した場合、応答
としてリアクション音声セット番号“３”からランダム
に音声を１つ出力した後、２番目の質問として質問番号
“２”へと進む（Ｓ４の部分）。In this selection formula chart, first, as the first question, the seventh voice registered in the reaction voice set 1 is output (S3 part). When the user selects option 1 for this question, one voice is randomly output from the reaction voice set number “3” as a response, and then the process proceeds to question number “2” as the second question ( S4 part).

【００２４】このような繰り返しを行うことによって最
後に占いの結果へと進む。「Ｓ５」の部分は占いの結果
を聞いた後、ユーザが選択肢“２”を選択した場合、音
声ファイル「End.pcm 」を出力して終了する。このよう
に、ユーザに対してシナリオの記述に基づいて質問が合
成音声によって提示される。その質問にユーザが選択肢
の中から答えを選択すると、シナリオ解釈装置４で、ユ
ーザが選んだ選択肢とシナリオの記述に基づいて、リア
クション音声セットを使うのか、あるいは、使わずシナ
リオに記述されている音声を生成するのかを決定する。By performing such repetition, the process finally proceeds to the fortune-telling result. When the user selects the option "2" after hearing the result of the fortune-telling in the part "S5", the sound file "End.pcm" is output and the processing ends. In this way, a question is presented to the user based on the description of the scenario in a synthesized voice. When the user selects an answer from the options for the question, the scenario interpretation device 4 describes whether the reaction voice set is used or not based on the description of the option and the scenario selected by the user. Decide whether to generate audio.

【００２５】さらに、リアクション音声セットを使う場
合には、リアクション音声セットとリアクション音声の
決定方法（ランダムに提示するのか、提示するリアクシ
ョン音声を指定するのか）を決定する。合成音声応答文
生成装置５では、シナリオ解釈装置４でシナリオ３に記
述されている音声を生成する場合には、合成音声生成装
置７へデータを渡し、シナリオ解釈装置４でリアクショ
ン音声セットを使う場合には、リアクション音声選択装
置６へデータを渡す。Further, when a reaction voice set is used, a method of determining the reaction voice set and the reaction voice (whether to present randomly or specify the reaction voice to be presented) is determined. In the case where the scenario interpreter 4 generates the speech described in the scenario 3, the synthetic speech response sentence generator 5 passes data to the synthetic speech generator 7 and uses the reaction speech set in the scenario interpreter 4. Is passed to the reaction voice selection device 6.

【００２６】リアクション音声選択装置６では、シナリ
オ解釈装置４で決定されたリアクション音声セットから
リアクション音声をランダムまたは、シナリオの指定通
りに選択し、合成音声生成装置７へデータを渡す。この
例では、リアクション音声セットとして、４つの異なっ
た印象をユーザに与えるリアクショ音声セットを持って
いる。この４つの異なった印象とは、例えば、「のりの
よい印象」や「冷たい印象」、「おおらかな印象」、
「ひねた印象」といったものである。The reaction voice selection device 6 selects reaction voices randomly or as specified by the scenario from the reaction voice set determined by the scenario interpretation device 4, and passes the data to the synthesized voice generation device 7. In this example, there are reaction voice sets that give the user four different impressions as reaction voice sets. These four different impressions are, for example, a “good impression”, a “cold impression”, a “easy impression”,
It's something like a twisted impression.

【００２７】ここで、リアクション音声の一例として、
「のりのよい印象」をあげると、「へー」とか、「ほ
ー」、「おー」、「えー」といったように、短く言葉と
して意味に明確な意味を持たせない音声のＦ0 の変化
が、「右上がり」や「右下がり」、「への字」、「逆へ
の字」といった形で大きく変化するような韻律を持つと
いった特徴をもつものや、「なるほど」や「それでそれ
で」、「でしょうー」、「うんうん」といった言葉のよ
うに平均話速を少し速くしたり、Ｆ0 の高さを大きく変
化させるといった韻律を適切に調整することによって言
葉として「のりのよさ」を表現できる音声がある。Here, as an example of the reaction voice,
If you give a “good impression”, the change in the F0 of the voice, such as “He”, “Hoo”, “Oh”, “Eh”, which does not have a clear meaning in the words, is short. , "Characteristics" such as "upward", "downward", "shape", "shape in reverse", etc. You can express "goodness" as a word by appropriately adjusting the prosody, such as slightly increasing the average speech speed like the words "Wow,""Yeah," or changing the height of F0 greatly There is sound.

【００２８】また、韻律パターンにＦ0 が大きく変化す
る場合には、「おどけた感じ」といった印象を受ける。
このようなリアクション音声は、予めリアクション音声
セットに複数登録されているものとする。リアクション
音声の登録の形態は、音声波形、あるいは、音韻記号列
と韻律パラメータ、あるいは、仮名アクセント文等があ
る。合成音声生成装置７で生成された音声は、出力装置
８より出力される。Further, when F0 greatly changes in the prosody pattern, an impression such as "a strange feeling" is given.
It is assumed that a plurality of such reaction voices are registered in the reaction voice set in advance. The registration form of the reaction voice includes a voice waveform, a phoneme symbol string and a prosody parameter, a kana accent sentence, and the like. The voice generated by the synthesized voice generation device 7 is output from the output device 8.

【００２９】上述のように、シナリオに基づいて質問を
合成音声で提示し、ユーザの回答に対して、一貫した特
定の印象を与えるリアクション音声セットからランダム
にリアクション音声を生成し、出力することにより、特
定の一貫した印象をユーザに与えることができ、その上
で、出力される次の合成音声が予測できることや、一本
調子の音声出力から生じる、合成音声について機械的で
あるとか、単調な印象といった従来技術の問題点を解消
することができる。このため、システムに対して、使っ
て楽しいとか、使う気にさせるとか、親しみを持たせる
ことが可能となる。As described above, a question is presented in a synthetic voice based on a scenario, and a reaction voice is randomly generated from a reaction voice set that gives a consistent and specific impression to a user's answer, and is output. , Can give the user a certain consistent impression, on which the next synthesized speech to be output can be predicted, and whether the synthesized speech resulting from monotonous speech output is mechanical or monotonous Problems of the prior art such as an impression can be solved. For this reason, it is possible to make the system enjoyable to use, motivated to use, or have a familiarity.

【００３０】また、本発明において、図３に示した構成
要素のうち、シナリオ解釈装置４、応答画像・応答音生
成装置１３、合成音声応答文生成装置５をソフトウェア
として構築し、当該音声合成を行うコンピュータに接続
されるディスク装置やフロッピーディスク、ＣＤ−ＲＯ
Ｍ等に格納しておき、必要に応じて実行することにより
汎用的な利用が可能となる。In the present invention, among the components shown in FIG. 3, the scenario interpreting device 4, the response image / response sound generating device 13, and the synthesized voice response sentence generating device 5 are constructed as software, and the voice synthesis is performed. Disk device, floppy disk, CD-RO connected to the computer
By storing it in M or the like and executing it as necessary, general-purpose use becomes possible.

【００３１】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内で種々変更・応用が可能
である。The present invention is not limited to the above-described embodiment, but can be variously modified and applied within the scope of the claims.

【００３２】[0032]

【発明の効果】上述のように、本発明によれば、人間の
動作や操作、あるいは、入力を認識し、この認識結果と
シナリオに基づいて、合成音声出力による応答を行うも
のであり、その応答によって特定の印象をユーザに与
え、かつ多様な応答音声を返すことにより、一本調子で
あるとか機械的であるとか単調な印象を解消するもので
ある。As described above, according to the present invention, a human motion or operation, or an input is recognized, and a response based on the recognition result and the scenario is made by outputting a synthesized voice. By giving a specific impression to the user by a response and returning various response voices, a monotonous or mechanical or monotonous impression is eliminated.

【００３３】そして、ユーザに対してシステムを使って
楽しいとか、システムを使う気になるとか、システムに
親しみを持つといったようなユーザフレンドリさを提供
することができ、特に、ＣＡＩ等、コンピュータを使っ
た教育システムやアミューズメント機器といったものに
対してユーザの持続性を高めるという効果がある。It is possible to provide the user with user friendliness, such as having fun using the system, being interested in using the system, and being familiar with the system. This has the effect of increasing the user's sustainability for things such as educational systems and amusement devices.

[Brief description of the drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の一実施例の音声応答装置の構成図であ
る。FIG. 3 is a configuration diagram of a voice response device according to an embodiment of the present invention.

【図４】本発明の一実施例のシナリオ記述の例を示す。FIG. 4 shows an example of a scenario description according to an embodiment of the present invention.

[Explanation of symbols]

１入力装置２センサ３シナリオ４シナリオ解釈装置，シナリオ解釈手段５合成音声応答文生成装置，合成音声応答文生成手段６リアクション音声選択装置７合成音声生成装置８出力装置９，１０，１１，１２リアクション音声セット１３応答画像・応答音生成装置５１音声選択手段６１リアクション音声セット DESCRIPTION OF SYMBOLS 1 Input device 2 Sensor 3 Scenario 4 Scenario interpreting device, scenario interpreting means 5 Synthetic voice response sentence generating device, synthetic voice response sentence generating device 6 Reaction voice selecting device 7 Synthetic voice generating device 8 Output device 9,10,11,12 Reaction Voice set 13 Response image / response sound generation device 51 Voice selection means 61 Reaction voice set

Claims

[Claims]

1. A synthetic voice response method for recognizing an operation or an operation and responding with a synthetic voice based on a result of the recognition, wherein a scenario having a description including an operation of the device and a response method is interpreted. A synthetic voice response method characterized by returning a reaction voice randomly selected from a plurality of reaction voice sets or a reaction voice specified arbitrarily.

2. The reaction voice set includes a plurality of reaction voice subsets, each of which is a group of reaction voices for giving a user various reaction voices having specific characteristics, and each subset has a different characteristic. 2. The synthetic voice response method according to 1.

3. The reaction voice set is described in any of a Japanese sentence, voice information expressed by phonological symbols and rhythm parameters including a fundamental frequency, a duration, and power, or a voice waveform. Item 1
And the synthetic voice response method according to 2.

4. A synthetic speech response apparatus comprising: recognition means for recognizing an operation or operation; and response means for responding with a synthesized voice based on a recognition result by the recognition means, the description including an operation of the apparatus and a response method. And a scenario interpreting means for interpreting the scenario; and a synthetic speech response sentence generating means for generating a synthetic speech based on the recognition result and the interpretation result of the scenario interpreting means. Answering device.

5. The synthesized speech response device according to claim 4, wherein said synthesized speech response sentence generation means uses a recording / editing / synthesis device or a device for performing speech synthesis according to rules.

6. A voice selection unit for randomly selecting a reaction voice from a plurality of reaction voice sets, each of which is a collection of reaction voices for giving a user various reaction voices having specific characteristics, said synthesized voice response sentence generation unit. The apparatus according to claim 4, further comprising means.

7. The reaction voice set includes a subset of a plurality of reaction voices having different characteristics, which is a collection of reaction voices providing a user with various reaction voices having specific characteristics.
Described synthetic voice response method.

8. The reaction voice set includes one of a Japanese sentence, voice information represented by a phonological symbol and a rhythm parameter including a fundamental frequency, a duration, and power, or a voice waveform. 8. The synthesized voice response device according to claim 7.

9. A recognition process for recognizing an operation or an operation,
A storage medium storing a synthesized voice response program having a response process for responding with a synthesized voice based on a recognition result by the recognition process, wherein a scenario interpretation process for interpreting a scenario having a description including an operation of an apparatus and a response method; A storage medium storing a synthesized speech response program, comprising: a synthesized speech response sentence generation process for generating a synthesized speech based on a recognition result of the recognition process and an interpretation result of the scenario interpretation process.

10. The synthesized voice response sentence generation process includes a voice selection for randomly selecting a reaction voice from a plurality of reaction voice sets, which are a collection of reaction voices for giving various reaction voices having specific characteristics to a user. A storage medium storing the synthesized speech response program according to claim 9 including a process.