JP2000259177A

JP2000259177A - Voice outputting device

Info

Publication number: JP2000259177A
Application number: JP11058117A
Authority: JP
Inventors: Yasuhisa Watanabe; 泰久渡辺
Original assignee: Omron Corp; Omron Tateisi Electronics Co
Current assignee: Omron Corp
Priority date: 1999-03-05
Filing date: 1999-03-05
Publication date: 2000-09-22
Anticipated expiration: 2019-03-05
Also published as: JP3797003B2

Abstract

PROBLEM TO BE SOLVED: To obtain a voice outputting device which outputs most desirable smooth voices for a user by dynamically controlling output voices in accordance with the situation in which the user is placed. SOLUTION: An input.recognition section 101 recognizes voices uttered by a user. A dialogue processing section 102 selects a voice that corresponds to the dialogue status at that time from the output data registered in an output table 103. A deformation section 104 deforms the words to be assembled into the output data on the basis of the voices from the user and a voice output section 109 vocally outputs the words. The section 104 refers to a deformation contents table 108, an adding word table 105 for characters, an adding word table 106 for words and a deformation counter 107.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、話者と対話を行
う時や、話者の話した内容を認識して出力する時に使用
される音声出力装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an audio output device used when interacting with a speaker or when recognizing and outputting the contents spoken by the speaker.

【０００２】[0002]

【従来の技術】話者との対話を行う時に使用される音声
出力装置には、入力された音声を認識して、その確認の
ために認識音声を出力したり、質問または回答のために
装置側で生成された音声を出力する。この場合、対話の
効率性や話者と装置間での確実な意思の疎通を図るため
に、重要な部分などには音量を増大するなどの強調処理
を行ったり、難解な単語などが入力された場合には、そ
れを平易な表現に置き換えて確認のために音声出力する
手法が採用されている。例えば、特開平１０−１７１４
８５号公報に示される装置は、テキスト情報を構文解析
を行って重要部分を特定し、これに強調などの変形を施
して出力するようにしている。また、特開平７−１２１
５３７号公報に示される装置は、読み上げ対象となる文
章から難解な語句を特定し、これを平易な表現に置き換
えて読み上げる装置が示されている。2. Description of the Related Art A voice output device used for a dialogue with a speaker recognizes an input voice and outputs a recognized voice for confirmation thereof, or a device for question or answer. And output the generated audio. In this case, in order to improve the efficiency of the dialogue and secure communication between the speaker and the device, emphasis processing such as increasing the volume is performed on important parts, and difficult words are input. In such a case, a method is employed in which the data is replaced with a plain expression and voice is output for confirmation. For example, Japanese Patent Application Laid-Open No. 10-1714
The device disclosed in Japanese Patent Publication No. 85 is designed to analyze text information to identify an important part, apply a transformation such as emphasis to the important part, and output the modified part. Also, Japanese Patent Application Laid-Open No. 7-121
The device disclosed in Japanese Patent No. 537 discloses a device that specifies an esoteric word or phrase from a text to be read and replaces it with a plain expression and reads the word.

【０００３】[0003]

【発明が解決しようとする課題】しかし、上記のような
装置では、・過去の出力状況に応じた処理を行っていないために、
同じ出力がついた場合には常に変形処理が施され、結果
としてユーザが煩わしく感じる音声出力となることがあ
る。However, in the above-described apparatus, since the processing according to the past output state is not performed,
When the same output is provided, the transformation process is always performed, and as a result, a sound output that the user feels troublesome may be obtained.

【０００４】・ユーザの発声に対する装置側の音声認識
結果の誤り発生を考慮してないため、ユーザが装置の認
識誤りを検知しにくい。Since it is not considered that an error occurs in the result of speech recognition on the device side with respect to the utterance of the user, it is difficult for the user to detect a recognition error of the device.

【０００５】・同音異義語などを平易な表現に置き換え
るには、使用されるすべての単語に対して説明のための
表現を用意しなければならない。[0005] In order to replace homonyms and the like with plain expressions, expressions for explanation must be prepared for all words used.

【０００６】などの問題点があった。There have been problems such as the following.

【０００７】この発明の目的は、ユーザの状況に応じて
出力音声を動的に制御することによって、ユーザにとっ
て最も望ましいスムースな音声出力を行うことができる
音声出力装置を提供することにある。An object of the present invention is to provide an audio output device capable of performing the most desirable smooth audio output for a user by dynamically controlling the output audio according to the situation of the user.

【０００８】[0008]

【課題を解決するための手段】この発明は、ユーザのと
まどい、割り込み発生、ユーザからの入力音声の認識確
実度などのユーザの状況を知り、そのときどきの状況に
応じて動的に出力音声を制御するものである。SUMMARY OF THE INVENTION According to the present invention, the user's situation, such as the user's confusing, interrupt occurrence, and the certainty of recognition of the input speech from the user, is known, and the output speech is dynamically generated according to the occasional situation. To control.

【０００９】この発明では次の用語を使う。The following terms are used in the present invention.

【００１０】・音声認識：ユーザの入力音声を音素に分
解して各音素を認識することである。・対話状況：ユーザと装置側の一連の対話において、
「挨拶」「確認」「回答」等、対話の各段階をいう。Speech recognition: Recognizing each phoneme by decomposing a user's input speech into phonemes. -Dialogue status: In a series of dialogues between the user and the device,
Each stage of the dialogue, such as "greeting", "confirmation", "answer", etc.

【００１１】・とまどいを表す表現：装置の出力をユー
ザが認識出来なかったときにユーザが表す音声や態度の
表現をいう。Expressions that express confusingness: Expressions of voices and attitudes expressed by the user when the user cannot recognize the output of the device.

【００１２】・出力データに組み込まれる単語の変形：
装置からユーザに音声出力される出力データに組み込ま
れる単語の出力の仕方の態様を変えることをいう。[0012] Deformation of words incorporated in output data:
This refers to changing the manner of outputting words incorporated in output data that is output as voice from the device to the user.

【００１３】・ユーザの割り込み音声：装置側からの出
力中に、ユーザが何らかの割り込みをかけるために発声
する音声をいう。[0013] Interruption voice of the user: A voice uttered by the user to give some interruption during output from the apparatus side.

【００１４】・入力音声の認識確実度：ユーザの入力音
声を音声認識するときの認識の確実度をいう。Recognition certainty of input voice: The certainty of recognition when recognizing a user's input voice by voice.

【００１５】・付加語：装置からの出力データ中に組み
込まれる単語や文字等を説明するために、その直前か直
後に付加される用語をいう。この発明は以下のように構
成される。Additional words: words added immediately before or immediately after to describe words, characters, and the like incorporated in output data from the device. The present invention is configured as follows.

【００１６】（１）音声が入力する音声入力部と、入力
音声から音声認識を行う音声認識部と、装置から音声出
力する出力データをユーザとの対話状況別に記憶する出
力データテーブルと、音声認識結果から対話状況を判定
し、そのときに出力すべき出力データを前記出力データ
テーブルから選択する出力データ選択手段と、選択され
た出力データを音声出力する音声出力部と、装置からの
音声出力に対するユーザのとまどいを表す音声を音声認
識結果から検出する手段と、前記とまどいを表す音声を
検出したときに出力データに組み込まれる単語を変形し
て前記音声出力部で再出力させる変形手段と、を備えて
なる（請求項１）。(1) A voice input unit for inputting voice, a voice recognition unit for performing voice recognition from the input voice, an output data table for storing output data to be output as voice from the apparatus for each conversation state with a user, and a voice recognition unit An output data selecting means for determining a dialogue state from the result and selecting output data to be output at that time from the output data table, an audio output unit for outputting audio of the selected output data, Means for detecting a voice representing a confusing user from a voice recognition result, and deforming means for deforming a word incorporated in output data when detecting the voice indicating the confusing and re-outputting the word at the voice output unit. (Claim 1).

【００１７】音声の入力部は、例えばマイクロフォンで
構成される。入力音声から音声認識を行う音声認識部
は、ユーザが入力する音声の音素を検出する。例えば、
ユーザが「わたし」と話した場合、音声認識部では
「わ」「た」「し」を各音素として認識する。この音声
認識手法には、音声データから線形予測係数やフーリエ
スペクトル係数などの特徴パターンを抽出し、これと辞
書に記憶されているパターンとのマッチングを行って認
識を行うパターンマッチング手法や、その他、隠れ（Ｈ
ｉｄｄｅｎ）マルコフ法を用いた音声認識法や、シンタ
ックス認識法などの周知の方法が用いられる。なお、パ
ターンマッチング手法では、ユーザによって発声速度が
異なる相違を吸収するＤＰ（ダイナミックプログラミン
グ）パターンマッチング手法が一般に採用されている。
また、音声認識では、各音素の認識と共に形態素解析を
行って単語や品詞の認識も行うことができる。この発明
では、単語や品詞の認識を音声認識部で行っても行わな
くてもよい。The voice input unit is composed of, for example, a microphone. A voice recognition unit that performs voice recognition from input voice detects a phoneme of voice input by the user. For example,
When the user speaks “I”, the voice recognition unit recognizes “Wa”, “Ta”, and “Shi” as each phoneme. This speech recognition method includes extracting a feature pattern such as a linear prediction coefficient or a Fourier spectrum coefficient from speech data, and matching the pattern with a pattern stored in a dictionary to perform recognition. Hidden (H
A well-known method such as a speech recognition method using a Markov method or a syntax recognition method is used. In the pattern matching method, a DP (dynamic programming) pattern matching method for absorbing a difference in utterance speed depending on a user is generally adopted.
In speech recognition, words and parts of speech can be recognized by performing morphological analysis together with recognition of each phoneme. In the present invention, the recognition of words and parts of speech may or may not be performed by the voice recognition unit.

【００１８】出力データテーブルは、ユーザとの対話状
況別に装置から音声出力するデータを記憶する。対話状
況別とは、例えば、一連の対話において、まず挨拶があ
り、次にユーザの発した音声を確認し、さらに、ユーザ
からの質問を認識して回答するといった対話における状
況を意味する。この対話状況別に、あらかじめ、出力デ
ータテーブルで記憶することにより、このテーブルの内
容を更新することによってどのような場面にも対応する
ことが可能になる。なお、出力データは、固定データと
組み込みデータ（単語等）で結合されている。固定デー
タはシステム上あらかじめ固定された音声パターンデー
タであり、組み込みデータは、ユーザから入力された音
声をそのまま返すための音声パターンデータと、システ
ムにおいて動的に生成した音声パターンデータとに区別
される。通常、対話状況が「確認」のタイプの場合に
は、組み込みデータはユーザからの入力音声をそのまま
返す音声パターンとなり、対話状況が「回答」のタイプ
の場合には、組み込みデータが動的に生成した音声パタ
ーンとなる。The output data table stores data to be output as voice from the apparatus for each situation of conversation with the user. The dialogue state means, for example, a state in a dialogue in which a greeting is first given in a series of dialogues, then a voice uttered by the user is confirmed, and a question from the user is recognized and answered. By storing in advance in the output data table for each dialogue situation, it is possible to deal with any scene by updating the contents of this table. Note that the output data is combined with fixed data and embedded data (words and the like). The fixed data is voice pattern data fixed in advance on the system, and the embedded data is distinguished into voice pattern data for returning voice input from a user as it is and voice pattern data dynamically generated in the system. . Normally, when the conversation status is "confirmation" type, the embedded data is a voice pattern that returns the input voice from the user as it is, and when the conversation status is "answer" type, the embedded data is dynamically generated. It becomes the voice pattern.

【００１９】出力データ選択手段は、音声認識結果から
対話状況を判定し、その時に出力すべき出力データを上
記テーブルから選択する。また、出力手段は上記選択さ
れた出力データをスピーカ等から出力する。この時に、
出力データ中の組み込みデータが対話状況に応じて生成
される。The output data selection means determines a dialogue state from the speech recognition result, and selects output data to be output at that time from the table. The output means outputs the selected output data from a speaker or the like. At this time,
Embedded data in the output data is generated according to the conversation situation.

【００２０】上記出力手段で出力された出力音声に対し
て、ユーザが、「えっ」などのとまどいを表す音声を検
出すると、出力データに組み込まれる組み込みデータを
変形して再出力する。とまどいとは、装置の出力をユー
ザが認識出来なかった結果、発声する表現を言う。「え
っ」等の他、「なにっ」等の表現もとまどいを表す一形
態である。このとまどいを検出すると、例えば、前回よ
りも「音量大」「速度低」の変形対応で再出力する。と
まどいを表す音声が認識されない場合には、次の対話に
入る。When the user detects a voice that indicates a confusion, such as "eh," with respect to the output voice output by the output means, the embedded data included in the output data is transformed and output again. Toxic is an expression that utters as a result of the inability of the user to recognize the output of the device. It is a form that expresses confusing expressions such as "Nani" in addition to "Eh". When this confusing is detected, for example, it is output again in response to the deformation of “volume higher” and “speed lower” than the previous time. If no confusing speech is recognized, the next dialogue is entered.

【００２１】（２）音声が入力する音声入力部と、入力
音声から音声認識を行う音声認識部と、装置から音声出
力する出力データをユーザとの対話状況別に記憶する出
力データテーブルと、音声認識結果から対話状況を判定
し、そのときに出力すべき出力データを前記出力データ
テーブルから選択する出力データ選択手段と、選択され
た出力データを音声出力する音声出力部と、装置からの
音声出力に対するユーザの割込音声を音声認識結果から
検出する手段と、前記割込音声を検出したときに出力デ
ータに組み込まれる単語を変形して前記音声出力部で再
出力させる変形手段と、を備えてなる（請求項２）。(2) A voice input unit for inputting voice, a voice recognition unit for performing voice recognition from the input voice, an output data table for storing output data to be output as voice from the apparatus for each conversation state with a user, An output data selecting means for determining a dialogue state from the result and selecting output data to be output at that time from the output data table, an audio output unit for outputting audio of the selected output data, Means for detecting a user's interrupt voice from a voice recognition result; and deforming means for deforming a word incorporated in output data when the interrupt voice is detected and outputting the word again at the voice output unit. (Claim 2).

【００２２】この発明では、ユーザから割り込み要求が
音声であると、出力データに組み込まれるデータ（単語
等）を変形して再出力する。割り込み音声には、例え
ば、「もういちど」や「もっとくわしく」などの音声が
ある。According to the present invention, if the interrupt request from the user is voice, the data (words or the like) incorporated in the output data is transformed and output again. The interrupt sound includes, for example, sounds such as "retry" and "more detailed".

【００２３】（３）音声が入力する音声入力部と、入力
音声から音声認識を行う音声認識部と、装置から音声出
力する出力データをユーザとの対話状況別に記憶する出
力データテーブルと、音声認識結果から対話状況を判定
し、そのときに出力すべき出力データを前記出力データ
テーブルから選択する出力データ選択手段と、選択され
た出力データを音声出力する音声出力部と、ユーザを撮
像する撮像手段と、装置からの音声出力に対するユーザ
のとまどいの状態を前記撮像手段で撮像したユーザの画
像から検出する手段と、前記とまどいの状態をユーザの
画像から検出したときに出力データに組み込まれる単語
を変形して前記音声出力部でで再出力させる変形手段
と、を備えてなる（請求項３）。(3) A voice input unit for inputting voice, a voice recognition unit for performing voice recognition from the input voice, an output data table for storing output data to be output as voice from the apparatus for each situation of dialogue with a user, Output data selecting means for judging a dialogue state from the result and selecting output data to be output at that time from the output data table; an audio output unit for outputting the selected output data as audio; and an imaging means for imaging the user Means for detecting a user's confusing state with respect to the audio output from the apparatus from the user's image captured by the imaging means; and deforming a word incorporated in output data when the confusing state is detected from the user's image. And a transforming means for causing the sound output unit to re-output the sound (claim 3).

【００２４】この発明では、ユーザのとまどいの状態を
撮像手段で撮像したユーザ画像から検出する。例えば、
ユーザが両手を広げて装置の発する音声出力の意味が理
解できていないポーズをとったとき、ユーザがとまどっ
ていると判断する。この場合の画像処理では、ユーザの
両手に対応する画像を注目し、その両手が左右に広げら
れたことを検出した時にとまどいの状態になったと判断
する。According to the present invention, the user's confusing state is detected from the user image picked up by the image pickup means. For example,
When the user spreads both hands and takes a pose in which the meaning of the voice output emitted by the device cannot be understood, it is determined that the user is stuck. In the image processing in this case, an image corresponding to both hands of the user is focused on, and when it is detected that the both hands are spread left and right, it is determined that the state becomes confusing.

【００２５】（４）音声が入力する音声入力部と、入力
音声から音声認識を行う音声認識部と、装置から音声出
力する出力データをユーザとの対話状況別に記憶する出
力データテーブルと、音声認識結果から対話状況を判定
し、そのときに出力すべき出力データを前記出力データ
テーブルから選択する出力データ選択手段と、選択され
た出力データを音声出力する音声出力部と、入力音声の
認識確実度を検出し、その確実度がしきい値以下の時に
出力データに組み込まれる単語を変形して前記音声出力
部で再出力させる変形手段と、を備えてなる（請求項
４）。(4) A voice input unit for inputting a voice, a voice recognition unit for performing voice recognition from the input voice, an output data table for storing output data to be output as voice from the apparatus for each conversation state with a user, An output data selecting means for judging a dialogue state from the result and selecting output data to be output at that time from the output data table; an audio output unit for outputting the selected output data as an audio; And a transformation means for transforming a word incorporated in the output data when the certainty is equal to or less than the threshold value and causing the word to be outputted again by the voice output unit (claim 4).

【００２６】この発明では、周囲の雑音状況等によって
ユーザの入力音声が確実に認識できなかった場合、出力
データに組み込まれる単語を変形する。この発明は、ユ
ーザが発した音声の内容を確認する場合に適用される。
例えば、ユーザが「おおさかのめいしょは」と音声入力
した場合、装置が「おおさかでよろしいですか」と確認
の出力データを出力するが、この時に、ユーザからの入
力音声の認識確実度がしきい値以下の場合、出力データ
に組み込まれる単語である「おおさか」をゆっくりと確
実に出力する。あるいは、音量を上げて出力する。これ
により、ユーザは装置側認識手段による認識の程度をま
ちがいなく知ることができ、まちがっていなければ「は
い」などの肯定の返答をし、まちがっている場合には
「いいえ」などの否定の応答をすることになる。装置側
は、上記肯定の返答の音声入力を認識した時には、次の
対話に入り、否定の返答の音声入力を認識した時には、
さらに速度を遅くして音声出力を繰り返したり、または
さらに音量を上げて音声出力を繰り返すなどの変形出力
動作を行う。According to the present invention, when the input voice of the user cannot be reliably recognized due to the surrounding noise situation or the like, the word incorporated in the output data is modified. The present invention is applied to the case of confirming the content of a voice uttered by a user.
For example, if the user inputs a voice "Osaka no Osamu Osamu", the device outputs output data of "Okay?", But at this time, the recognition certainty of the input voice from the user is threshold. If the value is equal to or less than the value, the word “Osaka” to be incorporated in the output data is output slowly and surely. Alternatively, the volume is increased and output. As a result, the user can correctly know the degree of recognition by the device-side recognition means, and if it is not correct, gives a positive response such as "Yes", and if it is wrong, returns a negative response such as "No". Will do. When the device recognizes the voice input of the positive response, it enters the next dialogue, and when it recognizes the voice input of the negative response,
A modified output operation such as repeating the audio output at a lower speed or repeating the audio output with the volume further increased is performed.

【００２７】（５）出力データに組み込まれる単語を説
明するための付加語を記憶する単語別付加語テーブルを
備え、前記変形手段は、前記単語に付加語を加えた音声
を出力することで変形する（請求項５）。(5) A word-by-word additional word table for storing additional words for explaining words to be incorporated in the output data, wherein the deforming means outputs a speech in which the additional words are added to the word to transform the words. (Claim 5).

【００２８】出力データに組み込まれる音声の変形に
は、その出力音声の内容がよりわかりやすくするため
に、付加語を加えて音声出力することが可能である。付
加語とは、出力データに組み込まれる単語を説明するた
めの用語である。例えば、対象となる単語が「はし」の
場合、付加語として「ブリッジ」が当該単語に加えられ
る。したがって、この場合には、「ブリッジのはし」と
出力される。この付加語が加えられる変形が行われる場
合は、ユーザから「えっ」等のとまどいを表す音声入力
があった場合などである。In the modification of the voice incorporated in the output data, it is possible to output the voice by adding an additional word in order to make the contents of the output voice more understandable. The additional word is a term for explaining a word incorporated in the output data. For example, if the target word is "Hashi", "Bridge" is added to the word as an additional word. Therefore, in this case, "bridge bridge" is output. A case where the additional word is added is performed when the user inputs a voice such as “eh” which indicates a confusing situation.

【００２９】（６）出力データに組み込まれる単語の各
文字を説明するための文字別付加語を記憶する付加語テ
ーブルを備え、前記変形手段は、前記単語を構成する各
文字に付加語を加えた音声を出力することで変形する
（請求項６）。(6) An additional word table for storing additional words for each character for explaining each character of the word incorporated in the output data, wherein the deforming means adds an additional word to each character constituting the word. The sound is transformed by outputting the sound (claim 6).

【００３０】付加語には、上記（５）のように単語に対
する付加語のほか、単語を構成する各文字に対する付加
語もある。例えば、「とうきょう」の音声出力を行う場
合に、付加語として「ひがしのとう」を加える。したが
って、全体として「ひがしのとうのとうきょう」と音声
出力することになる。このような変形が行われる場合
は、例えば上記（５）で単語に付加語を加えて音声出力
しても、ユーザから「えっ」等の音声入力があったとき
である。すなわち、単語に付加語を加えて音声出力する
よりも、単語を構成する各文字毎の付加語を加えて音声
出力する方が、ユーザにとってより理解し易いと思われ
る場合にこの（６）の変形が行われる。As the additional words, there are additional words for each character constituting the word, in addition to the additional words for the word as described in (5) above. For example, when performing the audio output of “Tokyo”, “Higashi Toto” is added as an additional word. Therefore, as a whole, the voice is output as "Higashi no Toyo Tokyo". Such a modification is performed, for example, when the user inputs a voice such as "eh" even if the additional word is added to the word and the voice is output in (5). In other words, when it is considered that it is easier for the user to understand that it is easier for the user to comprehend the word and to output the voice by adding the additional word for each character constituting the word than to output the voice by adding the additional word to the word. Deformation is performed.

【００３１】（７）音声が入力する音声入力部と、入力
音声から音声認識を行う音声認識部と、入力音声を装置
側で復唱するために、音声認識した結果を出力データと
して生成する手段と、出力データを音声出力する音声出
力部と、出力データに組み込まれる単語を説明するため
の付加語を記憶する単語別付加語テーブルと、前記組み
込まれる単語をその単語に付加語を加えたものに変形し
て前記出力手段で再出力させる変形手段と、を備えてな
る（請求項７）。(7) A voice input unit for inputting voice, a voice recognition unit for performing voice recognition from the input voice, and means for generating a voice recognition result as output data in order to repeat the input voice on the device side. A voice output unit that outputs output data as voice, a word-specific additional word table that stores additional words for explaining words to be incorporated in the output data, and the incorporated words to which the additional words have been added. And a deforming means for deforming and re-outputting by the output means (claim 7).

【００３２】この発明は、ユーザ発声の全文を音声認識
する音声入力システムに適用されるものである。例え
ば、紙に書かれた原稿をユーザが読み上げて電子化デー
タにする場合にこの装置が使用される。すなわち、ユー
ザが原稿をある一定量読み上げ、装置が音声認識して同
じ内容を音声出力する。この時、同音異義語がある場合
には、装置側が音声出力する単語に付加語を加える。付
加語を加えることによって理解され得る単語の表記、す
なわち漢字がちがう場合にはユーザが音声により訂正を
入力する。これを繰り返すことによって音声入力だけで
電子化データを作成することができる。The present invention is applied to a voice input system for recognizing all sentences of a user's voice. For example, this apparatus is used when a user reads a document written on paper and converts it into digitized data. That is, the user reads out a certain amount of the original, and the apparatus recognizes the voice and outputs the same content as voice. At this time, if there is a homonym, an additional word is added to the word that is output by the device. Notation of a word that can be understood by adding an additional word, that is, if the kanji is different, the user inputs a correction by voice. By repeating this, digitized data can be created only by voice input.

【００３３】（８）前記変形された単語の組み込まれた
出力データが出力された後に、ユーザから付加語を加え
た単語が訂正単語として音声入力されたとき、該当の単
語をその入力された訂正単語に変換する手段と、前記単
語別付加語テーブルにユーザから音声入力された付加語
が未登録のとき、同音異義の単語とその付加語を順次、
候補として音声出力し、ユーザから正しい付加語である
ことを表す入力があると、該当の単語をそのときの候補
の付加語として前記単語別付加語テーブルに追加するテ
ーブル更新手段と、を備えてなる（請求項８）。(8) After the output data in which the transformed word is incorporated is output, when a word to which an additional word is added is spoken by the user as a corrected word, the corresponding word is corrected by the input correction. Means for converting to words, and when the additional words spoken by the user in the word-specific additional word table are unregistered, the homonymous words and the additional words are sequentially
A table updating unit that outputs a voice as a candidate and, when there is an input indicating that the word is a correct additional word from the user, adds the corresponding word to the word-based additional word table as a candidate additional word at that time. (Claim 8).

【００３４】この発明では、上記（７）の装置におい
て、ユーザから付加語を加えた単語が訂正単語として音
声入力された時に、該当する単語をその入力された訂正
単語に変換して再度出力する。また、この時にユーザか
ら音声入力された付加語が未登録の場合には、同音異義
語の単語とその付加語を順次、候補として音声出力して
いく。正しい付加語が音声出力された時に、ユーザから
それが正しい付加語であることを表す音声（例えば「は
い」）の入力があり、この時にその候補の付加語が新た
にテーブルに記憶される。この一連の動作によって自動
的に単語別付加語テーブルが学習し、使うにしたがって
能力の高い音声出力装置にすることができる。According to the present invention, in the apparatus of the above (7), when a word to which an additional word is added is input as a corrected word by the user, the corresponding word is converted into the input corrected word and output again. . At this time, if the additional word input by voice from the user is unregistered, the word of the homonymous word and the additional word are sequentially output as candidates. When the correct additional word is output as a voice, the user inputs a voice (for example, “Yes”) indicating that it is a correct additional word, and at this time, the candidate additional word is newly stored in the table. Through this series of operations, the word-by-word additional word table is automatically learned, so that a voice output device having a higher capability as it is used can be provided.

【００３５】（９）前記変形手段により出力データに組
み込まれる単語を変形して音声出力部で出力するとき
に、その組み込まれる単語の出力時にランプ、ブザー、
バイブレータなどの音声外手段による報知を行う報知手
段を備える（請求項９）。(9) When the word incorporated in the output data is transformed by the transformation means and output by the audio output unit, a ramp, buzzer,
A notifying means for notifying by means other than voice such as a vibrator is provided (claim 9).

【００３６】この発明は、出力データに組み込まれる単
語を変形して再出力する時に、その組み込まれる単語の
出力時に、ランプやブザーあるはバイブレータなどの音
声外手段による報知を行うようにしているので、ユーザ
に対して、重要な情報やユーザの音声認識が確実でない
場合など、特にユーザに対して強調すべき点を簡単に知
らせることができる。According to the present invention, when a word incorporated in the output data is deformed and re-outputted, when the incorporated word is output, a notification is provided by means other than voice such as a lamp, a buzzer, or a vibrator. In particular, when important information or voice recognition of the user is uncertain, the user can be easily informed of points to be emphasized especially to the user.

【００３７】[0037]

【発明の実施の形態】図１は、この発明の実施形態であ
る音声出力装置を使用した場面のイメージ図を示してい
る。この音声出力装置は、街頭、役所、ホテルなどに設
置される情報端末を示している。このような情報端末に
代えて、電話などを用いた音声による対話システムにも
この音声出力装置を適用することができる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows an image diagram of a scene using an audio output device according to an embodiment of the present invention. This audio output device indicates an information terminal installed in a street, a government office, a hotel, or the like. Instead of such an information terminal, the voice output device can be applied to a voice interactive system using a telephone or the like.

【００３８】図において、音声出力装置にユーザが接近
すると、音声出力装置から「いらっしゃいませ。観光案
内センターです。」と音声出力する。この音声出力は、
重要でない情報であるために、速く且つさらっと出力さ
れる。ユーザは、この案内を聞くと、例えば、「長岡京
市にあるお勧めの場所は？」と聞く。音声出力装置はこ
のユーザからの音声出力を認識した後、その対話状況を
判定して、その時に出力すべき音声出力パターン（出力
データ）をテーブルから選択して音声出力する。この
時、「長岡京市」のユーザの音声が十分に認識できなか
った場合には、「ながおかきょうし」とゆっくり確実に
記憶し、その後の「でよろしいですか」は速くさらっと
出力する。つまり、音声認識の認識確実度がある一定の
しきい値以下の時には、音声認識に自信がないものとし
て、重要な情報（後述するように、この情報は全体の出
力データ組み込まれる単語である）を、ユーザに確実に
確認されるように変形して出力する。In the figure, when the user approaches the voice output device, a voice is output from the voice output device saying "Welcome. It is a tourist information center." This audio output is
Because it is insignificant information, it is output quickly and quickly. When the user hears this guidance, for example, he hears "What is a recommended place in Nagaokakyo City?" After recognizing the voice output from the user, the voice output device determines the dialogue state, selects a voice output pattern (output data) to be output at that time from a table, and outputs the voice. At this time, if the voice of the user of "Nagaokakyo-shi" is not sufficiently recognized, "Nagaokakyo" is memorized slowly and surely, and the subsequent "Is it OK?" That is, when the recognition certainty of the voice recognition is below a certain threshold value, it is determined that the user is not confident in the voice recognition and important information (as will be described later, this information is a word incorporated in the entire output data). Is transformed and output so as to be surely confirmed by the user.

【００３９】装置からの上記の「ながおかきょうし」
（ゆっくり確実な出力）「でよろしいですか」（速くさ
らっと出力）の出力に対し、ユーザから肯定を表す音声
として「はい」が発声され、これが認識されると、装置
は「こうみょうじ」（重要な情報としてゆっくり確実に
出力する）が「がおすすめです」（重要でない情報とし
て速くさらっと出力）を出力する。この時、ユーザは、
「こうみょうじ」がわからないとすると、とまどいを表
す音声として「えっ」を発声する。装置はこの「えっ」
を認識すると、「こうみょうじ」がわからなかったもの
と見なし、これを難読語として扱い、付加語を追加して
再度出力する。付加語には、単語自体を説明する付加語
と、単語を構成する各文字を説明する付加語とがある。
図１に示す例では、「こうみょうじ」の単語を構成する
文字は「こう」「みょう」「じ」であり、これらの各文
字を付加語で説明するようにしている。すなわち、装置
は、「ひかるあかりのてらのこうみょうじ」（ゆ
っくり確実に出力）「がおすすめです」（速くさらっと
出力）の出力を行う。The above "Nagaoka Kyoshi" from the device
(Slow and reliable output) In response to the "Is it OK" (quickly and quickly output) output, the user utters "Yes" as a voice that indicates affirmation, and when this is recognized, the device turns to "Komiji" (Slowly and reliably output important information), but "Recommended" (quickly and quickly output as unimportant information). At this time, the user
If you do not know "Komiji", "Eh" is uttered as a voice that is confusing. The device is this "eh"
Is recognized, it is regarded as not knowing, and this is treated as an obfuscated word, an additional word is added and output again. The additional words include an additional word that describes the word itself and an additional word that describes each character constituting the word.
In the example shown in FIG. 1, the characters that make up the word "Kyoumoji" are "Kou", "Myou", and "Ji", and these characters are described using additional words. In other words, the device outputs “Hikaru Akari no Tera no Komoji” (slow and reliable output) and “recommended” (fast and quick output).

【００４０】上記の例では、「こうみょうじ」は難読語
として扱い、これを各文字毎の付加語を追加することで
再出力するようにしている。また、難読語ではなく同音
異義語を出力する場合にも、同様に単語または単語を構
成する各文字の付加語を追加して再出力する。このよう
にして、ユーザと装置側で対話を行い、装置側は状況別
にあらかじめテーブルに記憶されている音声データを選
択して出力すると共に、その音声データに組み込まれて
いる単語を、その時の状況に基づいて変形して出力す
る。なお、全体の出力データに組み込まれる単語等に
は、ユーザによる入力をそのまま返すものと、装置自身
が動的に生成するものとがある。図１に示す例では、
「ながおかきょうし」はユーザによる入力を返す単語で
あり、「こうみょうじ」は装置側で動的に生成される単
語である。In the above example, "Kyoumyoji" is treated as an obfuscated word, and is re-output by adding an additional word for each character. Also, when a homonym is output instead of an obfuscated word, a word or an additional word of each character constituting the word is added and output again. In this way, the user interacts with the device side, and the device side selects and outputs speech data stored in advance in the table for each situation, and also outputs words incorporated in the speech data at the time. And output it. Note that words and the like incorporated into the entire output data include those that directly return the user's input and those that are dynamically generated by the device itself. In the example shown in FIG.
“Nagaoka” is a word that returns an input by the user, and “Komiji” is a word dynamically generated on the device side.

【００４１】後述のように、この全体の出力データに組
み込まれる単語をユーザによる入力を返す単語とするか
装置で動的に生成する単語とするかをあらかじめ対話状
況別に記憶している。例えば、対話状況が最初の「挨
拶」のタイプである場合には、出力データは固定であっ
てその中に組み込まれる単語はない。上記の例では、
「いらっしゃいませ。観光案内センターです。」はこの
タイプが「挨拶」の場合の出力データである。また、対
話状況が「確認」のタイプの場合には、全体の出力デー
タに組み込まれる単語は、ユーザによる入力を返す単語
である。図１に示す例では、「ながおかきょうしでよろ
しいですか」の「ながおかきょうし」が組み込まれる単
語である。また、対話状況が「簡易回答」のタイプであ
る場合には、全体の出力データに組み込まれる単語は装
置側で動的に生成する単語となる。図１に示す例では、
「こうみょうじがおすすめです。」の出力データのう
ち、「こうみょうじ」が組み込まれる単語であって、装
置側で動的に生成される部分となる。As will be described later, whether a word to be incorporated into the entire output data is a word for returning an input by a user or a word dynamically generated by an apparatus is stored in advance for each conversation situation. For example, if the dialogue situation is of the initial "greeting" type, the output data is fixed and no words are embedded therein. In the example above,
"Welcome. This is a tourist information center." Is the output data when this type is "greeting". When the conversation situation is of the “confirmation” type, the word incorporated in the entire output data is a word that returns an input by the user. In the example illustrated in FIG. 1, the word includes “Nagaoka Kyoshi” of “Is it OK?”. Further, when the dialogue situation is of the “simple answer” type, the words incorporated in the entire output data are words dynamically generated on the device side. In the example shown in FIG.
Of the output data of "Kyouji is recommended.", This is a word that incorporates "Kyouji" and is a part that is dynamically generated on the device side.

【００４２】図２は、上記音声出力装置のシステム構成
図を示している。音声の入力部として、マイクロフォン
から構成される入力装置１が用いられ、音声出力パター
ンである出力データを出力する出力手段として、スピー
カからなる出力装置２が用いられる。これらの入力装置
１、出力装置２は、対話システムを制御するオペレーテ
ィングシステム３との間でデータのやりとりを行い、こ
のオペレーティングシステム３には、音声出力プログラ
ムに基づく対話処理を行う対話処理アプリケーションプ
ログラム４がインプリメントされる。外部記録媒体５
は、このシステムに対して対話処理アプリケーション４
を提供したり、データを提供する。また、ハードディス
クなどの高速記録媒体を採用することによって、システ
ムの常駐領域やユーザ領域として提供することもでき
る。FIG. 2 shows a system configuration diagram of the audio output device. An input device 1 composed of a microphone is used as a voice input unit, and an output device 2 composed of a speaker is used as output means for outputting output data as a voice output pattern. The input device 1 and the output device 2 exchange data with an operating system 3 that controls a dialog system. The operating system 3 includes a dialog processing application program 4 that performs a dialog process based on a voice output program. Is implemented. External recording medium 5
Is the interactive application 4
Or provide data. In addition, by using a high-speed recording medium such as a hard disk, it can be provided as a resident area or a user area of the system.

【００４３】図３は、上記対話処理アプリケーション４
を構成する概略ブロック図である。なお、対話処理アプ
リケーション４は、ここでは、プログラムと共にデータ
の記憶エリアも含む概念とする。FIG. 3 shows the interactive processing application 4
It is a schematic block diagram which comprises. Here, the concept of the interactive processing application 4 includes not only a program but also a data storage area.

【００４４】図３において、入力・認識部１０１は、入
力装置１から入力した音声データを処理、解析し、音声
認識を行う。先に述べたように、音声認識には、ＤＰパ
ターンマッチング法や隠れマルコフ法などの周知の音声
認識手法が採用され、例えば、「わたし」と入力された
場合、これを「わ」「た」「し」のように、音素毎に認
識する。また、これらの認識した音素に基づいて、構文
解析を行って意味の認識を行う。構文解析には、いわゆ
る形態素解析が採用される。例えば、上記「わ」「た」
「し」の音素を認識した時、これを、「私」と認識す
る。In FIG. 3, an input / recognition unit 101 processes and analyzes voice data input from the input device 1 and performs voice recognition. As described above, well-known speech recognition methods such as the DP pattern matching method and the Hidden Markov method are employed for speech recognition. For example, when "I" is input, this is referred to as "wa" or "ta". Recognize for each phoneme, such as "shi". Further, based on these recognized phonemes, syntax analysis is performed to recognize the meaning. A so-called morphological analysis is adopted for the syntax analysis. For example, the above "wa""ta"
When the phoneme of "shi" is recognized, it is recognized as "me".

【００４５】対話処理部１０２は、入力・認識部１０１
からの結果を受けて、ユーザに返答すべき内容を決定す
る部分である。ここでは、出力データテーブル１０３を
参照して内容を決定する。出力データテーブル１０３
は、装置からの出力データをユーザとの対話状況別に記
憶する部分である。図４に、この出力データテーブルの
構成例を示す。このテーブルは、出力データ（音声出力
パターン）とタイプとを記憶する。タイプは、ユーザと
の対話状況を示している。対話状況別に、すなわちタイ
プ別に出力データを記憶している。各出力データには、
＜ｘｘｘ＞と＜ｙｙｙ＞およびその後ろに続く（）で示
す変形属性を含んでいる。＜ｘｘｘ＞および＜ｙｙｙ＞
は、変形の対象となる単語を示している。＜ｘｘｘ＞
は、ユーザによる入力がそのまま返される部分であり、
＜ｙｙｙ＞はシステム側が動的に生成した結果が挿入さ
れる部分である。図１に示す例では、「ながおかきょう
し」が＜ｘｘｘ＞であり、「こうみょうじ」が＜ｙｙｙ
＞に対応している。全体の出力データに、＜ｘｘｘ＞
および＜ｙｙｙ＞の部分が組み込まれる。（）で示され
る変形属性については後述する。The dialog processing unit 102 includes an input / recognition unit 101
This is a part that determines the contents to be responded to the user based on the result from. Here, the contents are determined with reference to the output data table 103. Output data table 103
Is a part for storing output data from the apparatus for each state of conversation with the user. FIG. 4 shows a configuration example of the output data table. This table stores output data (audio output patterns) and types. The type indicates the state of conversation with the user. Output data is stored for each conversation situation, that is, for each type. Each output data includes
It includes <xxx> and <yyy> and the transformation attribute indicated by () following them. <Xxx> and <yyy>
Indicates a word to be transformed. <Xxx>
Is the part where the user input is returned as is,
<Yyy> is a portion into which the result dynamically generated by the system is inserted. In the example shown in FIG. 1, “Nagaoka Kyoto” is <xxx>, and “Komikyo” is <yyyy.
>. <Xxx> to the entire output data
And <yyy> portions are incorporated. The transformation attribute indicated by () will be described later.

【００４６】出力内容を変形する変形部１０４は、上記
対話処理部１０２で選択された出力データを変形する部
分である。この変形部１０４では、文字別付加語テーブ
ル１０５、単語別付加語テーブル１０６、変形カウンタ
１０７および変形内容テーブル１０８が参照される。The transformation section 104 for transforming the output content is a section for transforming the output data selected by the dialog processing section 102. The modification unit 104 refers to a character-based additional word table 105, a word-based additional word table 106, a modification counter 107, and a modification content table 108.

【００４７】図５は変形内容テーブル１０８の構成例を
示す。このテーブルは、変形の型（属性に対応してい
る）と具体的な変形内容を定義している。例えば、型が
変形０であれば、変形内容は「変形なし」である。すな
わち変形をしない。型が変形１の場合には、変形内容は
「音量大」「速度低」である。すなわち、対象となって
いる＜ｘｘｘ＞または＜ｙｙｙ＞の音量出力を大きく
し、且つ音声出力速度を低くすることによって、大きく
て且つゆっくり話すように制御する。同図に示すよう
に、変形内容としては、音量の変化、速度の変化、単語
の繰り返しの有無、説明用の付加語の有無などがある。FIG. 5 shows a configuration example of the modification content table 108. This table defines the type of transformation (corresponding to the attribute) and the specific transformation content. For example, if the type is deformation 0, the deformation content is “no deformation”. That is, it does not deform. When the type is the deformation 1, the contents of the deformation are “loud volume” and “low speed”. That is, by increasing the volume output of the target <xxx> or <yyy> and lowering the audio output speed, control is performed so that the speaker speaks loudly and slowly. As shown in the figure, the modification includes a change in volume, a change in speed, the presence or absence of word repetition, the presence or absence of an additional word for explanation, and the like.

【００４８】図６は、文字別付加語テーブル１０５の構
成例を示す。FIG. 6 shows a configuration example of the character-based additional word table 105.

【００４９】このテーブルは、漢字１文字とその文字を
説明するための付加語で構成される。この実施例では、
この文字別付加語テーブルは、後述するように、単語別
付加語テーブルにない単語を説明する場合に、各単語を
１文字毎に区切って説明する場合に使用される。This table is composed of one kanji character and an additional word for explaining the character. In this example,
As will be described later, this character-based additional word table is used to explain words that are not in the word-based additional word table and to describe each word by separating each character.

【００５０】図７は単語別付加語テーブル１０６の構成
例を示す。このテーブルは、単語の文字列と、その単語
を説明するための付加語で構成される。FIG. 7 shows an example of the structure of the additional word table 106 for each word. This table includes a character string of a word and an additional word for explaining the word.

【００５１】図８は変形カウンタ１０７の構成例であ
る。このカウンタは、単語と、その単語について過去に
変形が施された回数を格納するカウント部で構成され
る。カウンタの値は、指定された単語に対して変形を施
すかどうかを決定するために使用される。この値が大き
いほど、変形を行わない可能性が高くなり、この値が小
さいほど変形を行う可能性が高くなる。FIG. 8 shows an example of the configuration of the modification counter 107. The counter includes a word and a count unit that stores the number of times the word has been deformed in the past. The value of the counter is used to determine whether to transform the specified word. The larger this value is, the higher the possibility of not performing the deformation is, and the smaller this value is, the higher the possibility of performing the deformation is.

【００５２】次に、図９を参照して概略の動作を説明す
る。Next, the general operation will be described with reference to FIG.

【００５３】・ＳＴ１０１変形カウンタ１０７を０ク
リアする。変形カウンタ１０７は、その値を参照するこ
とによって変形を行うかどうかを決定するのに使用され
る。その値が小さい場合には変形が行われ、大きければ
変形が行われない。ＳＴ１０１でこのカウンタを０クリ
アすることによって、変形が行われるよう初期設定す
る。ST101 The transformation counter 107 is cleared to 0. The deformation counter 107 is used to determine whether to perform deformation by referring to the value. When the value is small, the transformation is performed, and when the value is large, the transformation is not performed. In step ST101, the counter is initialized to 0 so that the deformation is performed.

【００５４】・ＳＴ１０２ユーザからの入力音声デー
タから音声認識を行い、且つ形態素解析を行って単語列
を得る。また、単語毎の認識確実度も得る。なお、音声
認識を行う時にその認識の確実度を得る技術は一般的で
ある。例えば、ＤＰマッチング法において、リファレン
スデータ（参照データ）との相関値を得ることで認識確
実度を数値的に求めることが可能である。ST102 Speech recognition is performed from speech data input by the user, and morphological analysis is performed to obtain a word string. Also, the recognition certainty for each word is obtained. It should be noted that a technique for obtaining the certainty of the recognition when performing the voice recognition is general. For example, in the DP matching method, it is possible to numerically determine the recognition certainty by obtaining a correlation value with reference data (reference data).

【００５５】・ＳＴ１０３音声出力部において、現在
付加語を出力中かどうかを判断する。付加語を出力中
に、ＳＴ１０２で入力音声を認識する場合とは、ユーザ
から「わかった」などの煩わしさを表現する音声入力が
あった時である。したがって、この時には、ＳＴ１０４
で付加語の出力を中断して、変形カウンタ１０７を無限
大にセットする（ＳＴ１０５）。このＳＴ１０３〜１０
５の動作は、付加語の出力中にユーザからの割り込み入
力があった場合、その付加語の説明はユーザが煩わしく
感じていると判断して、その語を、付加語の出力を行わ
ないようにするものである。ST103 The voice output unit determines whether or not the additional word is currently being output. The case where the input voice is recognized in ST102 while the additional word is being output is when there is a voice input expressing annoyance such as "I understand" from the user. Therefore, at this time, ST104
Then, the output of the additional word is interrupted, and the transformation counter 107 is set to infinity (ST105). ST103-10
The operation 5 is such that, when an interrupt input is made by the user during the output of the additional word, the user determines that the explanation of the additional word is bothersome and does not output the additional word. It is to be.

【００５６】・ＳＴ１０６ＳＴ１０２で認識された入
力音声が「えっ」や「なにっ」などのとまどいを表現す
る音声や、「詳しく」などの詳細要求を表す音声であっ
た場合、ＳＴ１０７において、変形カウンタ１０７を０
クリアする。この動作は、装置からの音声出力をユーザ
が認識できなかったかまたは詳細な説明を求めているこ
とを示す。そこで、この場合には、変形カウンタを０ク
リアして、出力データテーブルに記述されている変形内
容に基づいて変形を行うようにする。ST106 If the input voice recognized in ST102 is a voice expressing a confusing word such as "E" or "Nani" or a voice indicating a detailed request such as "Detailed", the process proceeds to ST107. Set counter 107 to 0
clear. This action indicates that the user has failed to recognize the audio output from the device or that a detailed description is required. Therefore, in this case, the transformation counter is cleared to 0, and the transformation is performed based on the transformation contents described in the output data table.

【００５７】・ＳＴ１０８現在の対話状況に基づい
て、システムが出力すべきデータを出力データテーブル
１０３から選択する。一般には、出力データの選択は、
テーブルの最小の位置のものから順に選択されていく。
図４に示す例では、最初が、「挨拶」のタイプの出力デ
ータが、次に「確認」のタイプの出力データが、という
ふうに上から順番に選択されていく。上から順番に出力
データを選択する方法に限らず、その時の対話状況に応
じて、ダイナミックに出力データを選択することも可能
である。ST108 The data to be output by the system is selected from the output data table 103 based on the current dialogue situation. In general, the choice of output data is
It is selected in order from the smallest position in the table.
In the example shown in FIG. 4, output data of the "greeting" type is selected first, and then output data of the "confirmation" type is sequentially selected from the top. The method is not limited to the method of selecting the output data in order from the top, but it is also possible to dynamically select the output data according to the conversation situation at that time.

【００５８】・ＳＴ１０９−ＳＴ１１１ここでは、選
択された出力データに組み込まれる変形対象単語のそれ
ぞれに対して後述の変形処理を施す。ST109-ST111 Here, the later-described transformation processing is performed on each of the transformation target words incorporated in the selected output data.

【００５９】・ＳＴ１１２変形後のデータを音声合成
して出力する。ST112 Speech-synthesize the transformed data and output.

【００６０】・ＳＴ１１３今回の出力が直前の出力の
単なる繰り返しであるかどうかを判定する。もし、繰り
返しでないなら、その出力に組み込まれる変形対象単語
のすべてに対して変形カウンタ１０７をインクリメント
する。この動作は、過去に変形を施した回数が多いほ
ど、ユーザが単語の内容を正しく理解したとして、変形
を行わないようにすることを意味する。ST113 It is determined whether or not the current output is a mere repetition of the immediately preceding output. If it is not a repetition, the transformation counter 107 is incremented for all the transformation target words incorporated in the output. This operation means that, as the number of times of deformation has been increased in the past, the user correctly understands the contents of the word and does not perform the deformation.

【００６１】上記の動作を終えて、再びＳＴ１０２に戻
り新たな入力を待つ。After the above operation is completed, the process returns to ST102 and waits for a new input.

【００６２】図１０は、上記ＳＴ１１０の変形処理部の
制御フローを示す。FIG. 10 shows a control flow of the transformation processing unit in ST110.

【００６３】・ＳＴ２０１・ＳＴ２０２変形対象単語
がユーザによる入力を介すもの（＜ｘｘｘ＞に相当）
で、且つ音声認識の確実度が所定値よりも低かった場
合、その単語に対して該当する変形の属性をつける（Ｓ
Ｔ２０４）。図１に示す例では、「長岡京市」のユーザ
発話があり、この音声認識の確実度が低かった場合、Ｓ
Ｔ２０４に進んで、この該当単語に対して変形属性がつ
けられる。この時の変形属性は、図４の「確認」のタイ
プの対話状況であるために「変形１」である。すなわ
ち、実際の変形では、図５から「音量大」「速度低」と
なる。ST201, ST202 The word to be transformed is input by the user (corresponding to <xxx>)
If the certainty degree of the voice recognition is lower than a predetermined value, the corresponding deformation attribute is attached to the word (S
T204). In the example shown in FIG. 1, when there is a user utterance of "Nagaokakyo" and the certainty of the voice recognition is low,
Proceeding to T204, a deformation attribute is added to the corresponding word. The deformation attribute at this time is “deformation 1” because it is a dialogue state of the “confirmation” type in FIG. In other words, in the actual deformation, “volume is high” and “speed is low” from FIG.

【００６４】・ＳＴ２０３変形対象単語がシステム側
で動的に生成する部分（＜ｙｙｙ＞に相当）であって、
且つ変形カウンタ１０７が一定値を越えていない場合
に、その単語に対して該当する変形の属性をつける（Ｓ
Ｔ２０４）。図１に示す例では、システム側から「こう
みょうじがおすすめです。」と「簡易回答」のタイプの
対話状況で出力する場合に、変形カウンタ１０７が一定
値を越えていない場合に「変形２」が属性としてつけら
れる。したがって、「こうみょうじ」（ゆっくり確実に
出力）「がおすすめです。」（速くさらっと出力）とな
る。ST203: The part to be transformed is a part dynamically generated on the system side (corresponding to <yyy>),
If the transformation counter 107 does not exceed a certain value, the corresponding transformation attribute is added to the word (S
T204). In the example shown in FIG. 1, when the system outputs the message in a dialogue state of “Kyomyoji is recommended.” And “Simple answer”, if the deformation counter 107 does not exceed a certain value, “Deformation 2” is performed. Is added as an attribute. Therefore, "Komiji" (slowly and reliably output) "recommended" (fast and quick output) is displayed.

【００６５】図１１は、図９のＳＴ１１２の音声合成部
の制御フローである。FIG. 11 is a control flow of the voice synthesizing section in ST112 of FIG.

【００６６】・ＳＴ３０１変形の内容に「付加語あ
り」がある場合には、ＳＴ３０２に進む。ここでは、該
当する対象単語が単語別付加語テーブル１０６に含まれ
ているかどうかを判定し、含まれている場合には、対応
する付加語を生成し（ＳＴ３０３）、この付加語を対象
単語の直前に挿入する（ＳＴ３０４）。そうでない場合
には、その単語に含まれる文字のそれぞれに対して、文
字別付加語テーブル１０５を使用して各文字の付加語を
取得し、助詞「に」などを付加語間に挿入して１つの付
加語列を生成する（ＳＴ３０５）。この付加語を、対象
単語の直後に挿入する（ＳＴ３０６）。また、付加語の
部分に対して音色変更の属性を設定することも可能であ
る。ＳＴ３０７ではこの処理を行い、付加語と該当の対
象単語等を耳で聞いて区別可能にする。ST301 If the contents of the modification include "with additional words", the operation proceeds to ST302. Here, it is determined whether or not the corresponding target word is included in the word-by-word additional word table 106. If the target word is included, a corresponding additional word is generated (ST303), and this additional word is added to the target word. Insert immediately before (ST304). Otherwise, for each of the characters contained in the word, an additional word for each character is obtained using the character-specific additional word table 105, and a particle "ni" is inserted between the additional words. One additional word string is generated (ST305). This additional word is inserted immediately after the target word (ST306). It is also possible to set a tone color change attribute for the additional word portion. In ST307, this process is performed, and the additional word and the corresponding target word or the like can be distinguished by hearing.

【００６７】・ＳＴ３０８変形の内容にしたがって音
響的処理を施して、音声合成を行う。ST308 Performs acoustic processing according to the contents of the transformation to synthesize speech.

【００６８】図１２は、この発明の他の実施形態の音声
出力装置を使用したシステムのイメージ図を示す。FIG. 12 shows an image diagram of a system using an audio output device according to another embodiment of the present invention.

【００６９】このシステムは、ユーザが発声した全文を
音声認識する音声入力システムである。例えば、ユーザ
が紙に書かれた原稿を読み上げてこれを正確に電子化デ
ータにする場合に利用される。この音声入力システムで
は、正しい電子化データを作成するために、同音異義語
などがあった場合に誤った認識を防ぐためにユーザとの
間で対話処理を行う。図１２に示す例では、ユーザが、
最初に「あの橋に気をつけてください」と発声した場面
のイメージを示している。この場合、システムの音声認
識結果が「あの端に気をつけてください」だったとす
る。すなわち、システムは、「橋」を「端」と誤って認
識している。この場合には、システムは「橋」と「端」
を区別するために付加語「はしっこ」を付加する。すな
わち、システムは「あのはしっこの端に気をつけてくだ
さい。」と音声出力する。この時、ユーザはシステムが
認識まちがいをしていることを知ることになるから、
「繰り返し」や「訂正」などの制御単語を発声して訂正
のための対話に入る。例えば、訂正しようとする時、ユ
ーザは「訂正○○○の○○」のように付加語を用いて単
語を特定する。図１２に示す例では、「訂正○ブリッジ
のはし」と発声する。システムは、この時、「ブリッジ
の」が「橋」を特定する付加語としてシステムに登録さ
れているかどうかを判断する。登録されている場合に
は、訂正処理を行って再度音声出力する。すなわち、
「あのブリッジの橋に気をつけてください」と音声出力
する。もし、「ブリッジの」が「橋」を特定する付加語
としてシステムに登録されていなかった場合には、認識
候補として可能なものを順に付加語をつけて出力してい
く。ユーザが、出力された候補が誤っている場合は、
「いいえ」の否定を表現する単語を発声する。システム
が否定を認識すると、次の認識候補を出力する。この繰
り返しにより、ユーザが求める単語が音声出力された時
点で、ユーザが「はい」の肯定の表現を発声する機会が
訪れる。すると、システムは認識結果を訂正すると共
に、ユーザが使用した「ブリッジの」という付加語を新
たにシステムに追加更新する。これ以降は「ブリッジ
の」が「橋」を特定する付加語として使用できるように
なる。This system is a voice input system for recognizing all sentences uttered by the user. For example, it is used when a user reads a manuscript written on paper and converts it into accurate digitized data. In this voice input system, in order to create correct digitized data, when there is a homonym or the like, dialog processing is performed with the user to prevent erroneous recognition. In the example shown in FIG.
An image of the scene where "Please watch out for that bridge" was first uttered. In this case, it is assumed that the speech recognition result of the system is “Take care of that end”. That is, the system incorrectly recognizes "bridge" as "edge". In this case, the system is a “bridge” and “edge”
Is added to distinguish between. That is, the system outputs a voice saying "Take care of the end of the tail." At this time, the user will know that the system is making a mistake.
Speak control words such as "repeat" and "correct" to enter the dialogue for correction. For example, when trying to make a correction, the user specifies a word using an additional word, such as "XX of correction XX". In the example shown in FIG. 12, “correction ○ bridge chopstick” is uttered. At this time, the system determines whether "bridge" is registered in the system as an additional word specifying "bridge". If it is registered, it performs a correction process and outputs the sound again. That is,
"Please watch out for the bridge of that bridge." If “bridge” is not registered in the system as an additional word specifying “bridge”, possible recognition candidates are sequentially added with additional words and output. If the user outputs an incorrect candidate,
Say a word that expresses the negation of "No." If the system recognizes no, it outputs the next recognition candidate. By this repetition, when the word desired by the user is output as a voice, the user has an opportunity to say a positive expression of “yes”. Then, the system corrects the recognition result and additionally updates the system with the additional word “bridge” used by the user. Thereafter, "bridge" can be used as an additional word to specify "bridge".

【００７０】図１２に示す例では、・ユーザが、「あの橋に気をつけてください」と発声し
た後、システムが誤った認識結果をしていることを知っ
て、「繰り返し」の単語を発声し、システムがそれを認
識して再度同じ内容を音声出力し、・ユーザが「訂正ブリッジのはし」と発声し、システ
ムが「ごはんを食べる箸ですか」を最初の候補として出
力し、・ユーザが「いいえ」と発声し、システムが次の候補と
して「渡る橋ですか」と出力し、ユーザが「はい」と発
声し、システムが「訂正しました。このブリッジの橋に
気をつけてください。」と出力する。In the example shown in FIG. 12, after the user utters “Take care of that bridge”, the user knows that the system is giving an incorrect recognition result, and Uttered, the system recognized it and uttered the same content again, and the user uttered "Correction bridge chopsticks", and the system output "Chopsticks eating rice" as the first candidate,・ The user utters “No”, the system outputs “Which bridge is over” as the next candidate, the user utters “Yes”, and the system “corrected. Watch out for this bridge. Please output. "

【００７１】上記システムの構成図は図３に示す構成と
同様である。また、主な制御フローについても図９に示
すフローと同様である。図１３は、ユーザが「訂正○○
○の○○」の発声を行って訂正処理を行う場合の制御フ
ローである。The configuration diagram of the above system is the same as the configuration shown in FIG. The main control flow is the same as the flow shown in FIG. FIG. 13 shows that the user can select “Correction XX
This is a control flow in the case where the correction process is performed by uttering "○ of ○○".

【００７２】・ＳＴ６０１ここでは、入力音声を認識
した結果（図９のＳＴ１０２）が訂正コマンドに相当す
るものであるかどうかを判定する。すなわち、ユーザか
ら「訂正○○○の○○」と発声されたかどうかを判定す
る。訂正コマンドの場合にはＳＴ６０２以下を処理す
る。ST601 Here, it is determined whether or not the result of recognition of the input speech (ST102 in FIG. 9) is equivalent to a correction command. That is, it is determined whether or not the user has uttered “XX of correction XX”. In the case of a correction command, the processing after ST602 is processed.

【００７３】・ＳＴ６０２ユーザの発声した訂正コマ
ンドのタイミングから、ユーザがシステムの出力データ
のどの単語を訂正使用としているかを特定する。図２に
示す例では、ユーザの「繰り返し」の発声を認識してシ
ステムが「あのはしっこの」を音声出力した直後にユー
ザが「訂正」と発声しているために、システムは「あの
はしっこの」と音声出力した直後の単語が訂正すべき対
象を単語として特定する。ST602 From the timing of the correction command uttered by the user, the user specifies which word in the output data of the system is to be used for correction. In the example shown in FIG. 2, the system utters “correction” immediately after the system recognizes the utterance of “repetition” of the user and outputs the voice of “that shit”. "Is specified as a word to be corrected by the word immediately after the voice output.

【００７４】・ＳＴ６０３訂正コマンド中に含まれる
訂正後と付加語を抽出する。図１２に示す例では、訂正
後が「はし」であり、付加語が「ブリッジの」である。ST603 Extract the post-correction and additional words contained in the correction command. In the example shown in FIG. 12, the corrected word is "Hashi" and the additional word is "Bridge".

【００７５】・ＳＴ６０４上記訂正後と付加語がシス
テムの持つ文字別付加語テーブル１０５または単語別付
加語テーブル１０６に登録済みであるかどうかを判定す
る。登録済みであれば、ＳＴ６０５に進んでその付加語
に対応する語を選択する。もしそうでない場合には、Ｓ
Ｔ６０６で、同音異義語などの認識候補語を登録済みの
付加語と共に順に出力していき、ユーザに確認を求め
る。ST604 It is determined whether or not the corrected word and the additional word have already been registered in the character-based additional word table 105 or the word-based additional word table 106 of the system. If registered, the process proceeds to ST605 to select a word corresponding to the additional word. If not, S
At T606, recognition candidate words such as homonyms are sequentially output together with the registered additional words, and the user is asked for confirmation.

【００７６】・ＳＴ６０７ここでは、順次出力されて
いく候補の出力中にユーザの求める語があったかどうか
を判定する。この判定は、ユーザから「いいえ」または
「はい」などの否定または肯定の表現の発声があったか
どうかで行う。ユーザの求める語があった場合には、Ｓ
Ｔ６０８においてその付加語を新たに単語別付加語テー
ブルまたが文字別付加語テーブルに登録し、そうでない
場合には見つからなかった旨を出力してユーザに原文の
再入力を促す。ST607 Here, it is determined whether or not there is a word desired by the user in the output of the sequentially output candidates. This determination is made based on whether or not the user has uttered a negative or positive expression such as “No” or “Yes”. If there is a word requested by the user, S
At T608, the additional word is newly registered in the word-specific additional word table or the character-specific additional word table. Otherwise, the fact that the word is not found is output and the user is prompted to re-input the original sentence.

【００７７】．ＳＴ６０９続いて認識内容を訂正し、
訂正結果を音声出力して（ＳＴ６０１）元に戻る。[0077] ST609 Then, the recognition content is corrected,
The correction result is output as audio (ST601), and the process returns to the original position.

【００７８】図１４は、この発明の他の実施形態である
音声出力装置を使用したシステムのイメージ図である。FIG. 14 is an image diagram of a system using an audio output device according to another embodiment of the present invention.

【００７９】このシステムは、図１に示す装置にカメラ
を取り付けたものである。このカメラはユーザのとまど
い検知を行う。図１に示す例では、ユーザはシステムの
音声出力の内容がわからなかった時「えっ」「なにっ」
などの発声を行うことでとまどいを表現しているが、ユ
ーザがとまどいを感じる時には例えば両手を広げるよう
にするという慣習やくせがある。また、あらかじめルー
ルとして決めておいてもよい。このような場合、画像に
よってこの手の広がった状態を検出することによりユー
ザがとまどい状態にあることを検出することができる。
したがって、この場合にはユーザは「えっ」や「なに
っ」などの発声をしなくても手を広げるだけでシステム
に対してとまどいを知らせることが可能になる。This system is obtained by attaching a camera to the apparatus shown in FIG. This camera performs confusing detection of the user. In the example shown in FIG. 1, when the user does not understand the contents of the audio output of the system,
Is expressed by making an utterance such as, for example, when the user feels confused, for example, there is a custom or a habit of expanding both hands. In addition, a rule may be determined in advance. In such a case, it is possible to detect that the user is in a troubled state by detecting the spread state of the hand based on the image.
Therefore, in this case, it is possible for the user to inform the system of the embarrassment only by spreading his hand without saying "eh" or "what".

【００８０】図１５は、この発明の他の実施形態の音声
出力装置を用いたシステムのイメージ図である。FIG. 15 is an image diagram of a system using an audio output device according to another embodiment of the present invention.

【００８１】このシステムでは、出力データを変形する
操作に加えて、組み込まれる単語を出力する時に出力タ
イミングに合わせてランプを点灯または点滅させる。図
１５に示す例では、音声認識の確実度が一定以下の場合
に、その対象となる単語（音声認識に自信のないとこ
ろ）を音声出力する時、同時にランプを点滅させる。ま
た、システム側で動的に生成する重要な情報を出力する
部分についてもランプを点滅させる。したがって、図１
５に示すイメージ図では、「ながおかきょうし」と「こ
うみょうじ」がランプを点滅させながら音声出力され
る。In this system, in addition to the operation of transforming the output data, the lamp is turned on or blinked according to the output timing when the word to be incorporated is output. In the example shown in FIG. 15, when the certainty degree of the voice recognition is equal to or less than a certain value, the lamp is turned on and off at the same time as outputting the target word (where the voice recognition is not confident). Also, the lamp is blinked for a part for outputting important information dynamically generated on the system side. Therefore, FIG.
In the image diagram shown in FIG. 5, “Nagaoka Kyoji” and “Kogami Kyoji” are sound-output while blinking the lamp.

【００８２】上記のランプの点滅動作はそれらの単語に
対する音声変形処理を行わずに独立して行ってもよい。The above-mentioned blinking operation of the lamp may be performed independently without performing the voice transformation processing on those words.

【００８３】図１６は、この発明の他の実施形態の音声
出力装置を用いたシステムのイメージ図を示す。FIG. 16 shows an image diagram of a system using an audio output device according to another embodiment of the present invention.

【００８４】このシステムでは、図１５のランプに代え
てバイブレータを用いている。すなわち、ランプの点滅
に代えてバイブレータを振動させる。この場合、バイブ
レータはユーザが持つことになる。In this system, a vibrator is used instead of the lamp shown in FIG. That is, the vibrator is vibrated instead of blinking the lamp. In this case, the user has the vibrator.

【００８５】この他、ランプやバイブレータに代えて小
出力のブザーを用いることも可能である。In addition, a small output buzzer can be used in place of the lamp or vibrator.

【００８６】[0086]

【発明の効果】請求項１の発明によれば、装置側から出
力した音声をユーザが認識できずにとまどいをあらわす
音声を発声した場合、装置側ではこれを検出してユーザ
が理解しやすいように該当する単語を変形して再度出力
するから、音声対話中のユーザの認識が確実なものとな
りスムーズな対話を行うことができる。According to the first aspect of the present invention, when a user outputs a voice that is confusing because the user cannot recognize the voice output from the device, the device detects the voice and makes it easy for the user to understand. Is transformed and output again, so that the recognition of the user during the voice dialogue is assured, and a smooth dialogue can be performed.

【００８７】請求項２の発明によれば、同じような音声
出力が繰り返して行われる場合にユーザが煩わしさを感
じて割り込み音声を発声するだけで、それ以後繰り返し
が行われないようにすることができるため、スムーズな
対話が可能となる。According to the second aspect of the present invention, when a similar sound output is repeatedly performed, the user feels annoyed and only issues an interrupt sound, and does not repeat thereafter. , And a smooth dialogue becomes possible.

【００８８】請求項３の発明によれば、カメラ等の撮像
手段でユーザのとまどい状態を検出するために、音声認
識でとまどいの状態を検出できない場合、例えば、周囲
の雑音環境がひどいなどにおいても対話のスムーズさが
改善される。According to the third aspect of the present invention, since the user's troublesome state is detected by the imaging means such as a camera, when the troublesome state cannot be detected by voice recognition, for example, even when the surrounding noise environment is severe, etc. The smoothness of the dialogue is improved.

【００８９】請求項４の発明によれば、ユーザの周囲の
雑音環境がひどい場合でも対話の確実性を改善すること
ができる。According to the fourth aspect of the present invention, the reliability of the dialog can be improved even when the noise environment around the user is severe.

【００９０】請求項５の発明によれば、同音異義語や難
解な用語などが出力データ中に含まれる場合であっても
ユーザが容易にその意味内容を認識できるようになる。According to the fifth aspect of the present invention, even when a homonym or an esoteric term is included in the output data, the user can easily recognize the meaning and content.

【００９１】請求項６の発明によれば、単語に付加語を
加えることができない場合であっても各文字に付加語を
加えることによって、同音異議語や難解な単語であって
もユーザがその意味内容を用意に認識できるようにな
る。According to the sixth aspect of the present invention, even when it is not possible to add an additional word to a word, by adding an additional word to each character, the user can obtain a homonymous word or an esoteric word. You can easily understand the meaning.

【００９２】請求項７の発明によれば、ユーザが発声し
た全文を音声認識して電子化データにするシステムにお
いて、同音異義語や難解な単語が存在する場合であって
も、付加語を加えて確認出力することで誤った電子化デ
ータが作成されないようにすることができる。According to the seventh aspect of the present invention, in a system in which a whole sentence uttered by a user is voice-recognized and converted into digitized data, even if a homonymous word or an esoteric word exists, an additional word is added. By confirming and outputting, incorrect electronic data can be prevented from being created.

【００９３】請求項８の発明によれば、装置側で付加語
が未登録の場合であっても、候補となるものを順次出力
し、意味的に正しい付加語が出力された時のユーザから
の肯定を表す音声出力に基づいてその付加語を新たな付
加語として登録できるために、付加語テーブルを学習更
新させることができるから、使用回数にしたがってより
正確でスムーズな対話が可能になってくる。According to the invention of claim 8, even when the additional word is not registered on the device side, candidates are sequentially output, and the user who outputs the semantically correct additional word is sent from the user. Since the additional word can be registered as a new additional word based on the voice output indicating affirmation of the above, the additional word table can be learned and updated, so that a more accurate and smooth dialogue according to the number of uses becomes possible. come.

【００９４】請求項９の発明によれば、音声外手段によ
る報知も単語変形と同時に行うことによって、ユーザ側
の認識がより確実なものとなる。According to the ninth aspect of the present invention, the notification by the voice-out means is performed at the same time as the word deformation, so that the user's recognition becomes more reliable.

【００９５】請求項１０の発明によれば、変形手段が不
要であるために、装置全体が簡易なものとなる。According to the tenth aspect of the present invention, since no deforming means is required, the entire apparatus can be simplified.

[Brief description of the drawings]

【図１】この発明の実施形態である音声出力装置を適用
したシステムのイメージ図FIG. 1 is an image diagram of a system to which an audio output device according to an embodiment of the present invention is applied;

【図２】システム構成を示す図FIG. 2 is a diagram showing a system configuration;

【図３】システムの概略ブロック図FIG. 3 is a schematic block diagram of the system.

【図４】出力データテーブルの構成例FIG. 4 is a configuration example of an output data table;

【図５】変形内容テーブルの構成例FIG. 5 is a configuration example of a modification content table.

【図６】文字別付加語構成例FIG. 6 is an example of an additional word configuration for each character.

【図７】単語別付加語構成例FIG. 7 is an example of an additional word configuration for each word.

【図８】変形カウンタの構成例FIG. 8 is a configuration example of a deformation counter.

【図９】概略フローを示す図FIG. 9 is a diagram showing a schematic flow.

【図１０】変形処理部の制御フローを示す図FIG. 10 is a diagram showing a control flow of a transformation processing unit.

【図１１】音声合成部の制御フローを示す図FIG. 11 is a diagram showing a control flow of a speech synthesis unit.

【図１２】他の実施形態の音声出力装置を用いたシステ
ムのイメージ図FIG. 12 is an image diagram of a system using an audio output device of another embodiment.

【図１３】訂正処理部の制御フローを示す図FIG. 13 is a diagram showing a control flow of a correction processing unit.

【図１４】他の実施形態の音声出力装置を用いたシステ
ムのイメージ図FIG. 14 is an image diagram of a system using an audio output device of another embodiment.

【図１５】さらに他の音声出力装置を使用したシステム
のイメージ図FIG. 15 is an image diagram of a system using still another audio output device.

【図１６】さらに他の音声出力装置を使用したシステム
のイメージ図FIG. 16 is an image diagram of a system using still another audio output device.

Claims

[Claims]

1. A voice input unit for inputting voice, a voice recognition unit for performing voice recognition from input voice, an output data table for storing output data to be output as voice from the apparatus for each conversation state with a user, and a voice recognition result. Judge the dialogue status from
Output data selection means for selecting output data to be output at that time from the output data table, an audio output unit for outputting the selected output data as an audio signal, and an audio signal representing the user's confusing sound with respect to the audio output from the apparatus. A voice output device comprising: means for detecting from a recognition result; and deforming means for deforming a word incorporated in output data when detecting a voice indicating the confusion and causing the voice output unit to re-output the word.

2. A voice input unit for inputting voice, a voice recognition unit for performing voice recognition from an input voice, an output data table for storing output data to be output as voice from the apparatus for each conversation state with a user, and a voice recognition result. Judge the dialogue status from
Output data selecting means for selecting output data to be output at that time from the output data table, a voice output unit for outputting the selected output data as voice, and voice recognition of a user interrupt voice for voice output from the device. A sound output device comprising: means for detecting from a result; and deforming means for deforming a word incorporated in output data when the interrupt sound is detected and outputting the word again at the sound output unit.

3. A voice input unit for inputting voice, a voice recognition unit for performing voice recognition from the input voice, an output data table for storing output data to be output as voice from the apparatus for each dialogue state with a user, and a voice recognition result. Judge the dialogue status from
Output data selection means for selecting output data to be output at that time from the output data table, an audio output unit for outputting the selected output data as an audio, an imaging means for imaging the user, and an audio output from the apparatus. Means for detecting a confusing state of the user from the image of the user taken by the imaging means, and a word incorporated in the output data when the confusing state is detected from the image of the user, and transformed by the audio output unit. An audio output device comprising: a transforming means for re-outputting.

4. A voice input unit for inputting voice, a voice recognition unit for performing voice recognition from the input voice, an output data table for storing output data to be output as voice from the apparatus for each dialogue state with a user, and a voice recognition result. Judge the dialogue status from
Output data selection means for selecting output data to be output at that time from the output data table, a voice output unit for outputting the selected output data as a voice, and detecting the recognition certainty of the input voice. Deforming means for deforming a word incorporated in the output data when the value is equal to or less than a threshold value and outputting the word again at the voice output unit.

5. A word-by-word additional word table for storing an additional word for explaining a word incorporated in output data, wherein the deforming unit deforms by outputting a voice in which the additional word is added to the word. An audio output device according to any one of claims 1 to 4.

6. An additional word table for storing an additional word for each character for explaining each character of a word incorporated in the output data, wherein the deforming means adds an additional word to each character constituting the word. The audio output device according to claim 1, wherein the audio output device is deformed by outputting audio.

7. A voice input unit to which a voice is input, a voice recognition unit that performs voice recognition from the input voice, and a unit that generates a voice recognition result as output data in order to repeat the input voice on the device side, An audio output unit that outputs output data as audio, a word-specific additional word table that stores additional words for explaining words to be incorporated in the output data, and transforms the incorporated words into words obtained by adding additional words to the words. And a deforming means for re-outputting by the output means.

8. When a word to which an additional word is added is spoken as a correction word by a user after the output data in which the transformed word is incorporated, the corresponding word is input as the corrected word. Means for converting to a word-by-word additional word table, when the additional word input by the user is unregistered, the homonymous word and the additional word are sequentially output as candidates as voice, and the user adds the correct additional word. 8. The audio output device according to claim 7, further comprising: a table updating unit that, when there is an input indicating that the word is present, adds a corresponding word as a candidate additional word at that time to the additional word table for each word.

9. When the word incorporated in the output data is transformed by the transformation means and output by the audio output unit, a notification is issued by an external means such as a lamp, a buzzer, or a vibrator when the incorporated word is output. The audio output device according to any one of claims 1 to 9, comprising means.
8. The audio output device according to claim 7, comprising: a new unit.

10. The voice according to claim 1, further comprising a notification unit that performs notification by an external unit such as a lamp, a buzzer, or a vibrator when outputting the incorporated word, instead of the deformation unit. Output device.

11. A voice input step, a voice recognition step for performing voice recognition from an input voice, and an output data table for storing output data from the apparatus for each state of interaction with a user, and referring to a voice recognition result. An output data selection step of determining an interaction situation and selecting output data to be output at that time from the output data table; an output step of outputting the selected output data by voice; Recording a voice input program comprising: a step of detecting a voice representing a voice from a voice recognition result; and a deforming step of deforming a word incorporated in output data according to the confusing state and re-outputting the word at the output step. Audio output program recording medium.

12. A speech input step, a speech recognition step of performing speech recognition from input speech, and an output data table for storing output data from the apparatus for each conversation status with a user, and performing a dialog based on a speech recognition result. Judge the situation,
An output data selecting step of selecting output data to be output at that time from the output data table; an output step of outputting the selected output data by voice; and a voice recognition result of a user's interrupt voice for voice output from the device. A sound output program recording medium, comprising: a step of detecting a sound to be incorporated into output data; and a step of deforming the sound incorporated in the output data in accordance with the interrupt sound and outputting the sound again at the output step.

13. A voice input step, a voice recognition step of performing voice recognition from input voice, and an output data table storing output data from the apparatus for each dialogue state with a user, and based on the voice recognition result, a dialogue state. The output data selection step of selecting output data to be output at that time from the output data table, the output step of outputting the selected output data, and the confusing state of the user with respect to the audio output from the device. A sound output program comprising: a step of detecting a user from an image obtained by imaging the user; and a deforming step of deforming a sound incorporated in the output data in accordance with the confusing state and re-outputting the sound in the output step. recoding media.