JPH06175689A

JPH06175689A - Voice recognition reaction device

Info

Publication number: JPH06175689A
Application number: JP4351307A
Authority: JP
Inventors: Keiichi Miyamoto; 恵一宮本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1992-12-07
Filing date: 1992-12-07
Publication date: 1994-06-24

Abstract

PURPOSE:To provide a voice recognition reaction device which varies voice and motion responses against an operator corresponding to the results and the sureness of voice recognition and volume and pitch information. CONSTITUTION:A similarity level, which stepwise indicates recognition numbers and sureness of the recognition that are the results of voice recognition, is obtained by a feature extraction section 3, a degree of similarity computing section 4, a voice recognition dictionary 5 and a degree of similarity recognition section 7. A volume level is obtained by a volume discrimination section 6 and the section 7. Motion and voice dictionaries are selected from a motion dictionary storage section 9 and a voice synthesis dictionary storage section 14 by a motion dictionary selection section 8 and a voice dictionary selection section 13 employing the combination of the recognition number, the similarity level and the volume level. A moving section 12 is driven basing on the selected dictionaries and a synthesized voice is outputted.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声を認識して聴覚的
反応或いは視覚的反応を示す例えば玩具ロボット等とし
て利用される音声認識反応装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition reaction device used as, for example, a toy robot or the like which recognizes a voice and exhibits an auditory reaction or a visual reaction.

【０００２】[0002]

【従来の技術】従来より、音声認識装置における認識結
果確認方法なる技術が知られており、例えば、認識可・
認識否の判断が必ずしも断定的でないことを示す認識結
果を発声者に示すようにしたものがある（特開昭５７−
７４７９７号公報参照）。また、音声認識の状況に応じ
て可聴音の周波数や継続時間を変化させるようにしたも
のがある（特開昭５８−２１６２９８号公報参照）。更
に、認識結果の信頼度に対して第１の閾値と第２の閾値
とを備え、これらの閾値に基づいて所定の制御を行うこ
とで、コマンド入力効率を落とさずに機械の致命的な誤
動作を起こさないようなコマンドを出力するようにした
ものが知られている（特開平３−２４８１９９号公報参
照）。しかし、これらの方法は、音声認識の確からしさ
が低い場合に操作者の確認を求めるための方法である。2. Description of the Related Art Conventionally, a technique called a recognition result confirmation method in a voice recognition device has been known.
There is a method in which a recognition result indicating that judgment of recognition is not necessarily affirmative is shown to a speaker (Japanese Patent Laid-Open No. 57-57).
74797). Further, there is one in which the frequency and duration of the audible sound are changed according to the situation of voice recognition (see Japanese Patent Laid-Open No. 58-216298). Furthermore, a first threshold value and a second threshold value are provided for the reliability of the recognition result, and predetermined control is performed based on these threshold values, thereby causing a fatal malfunction of the machine without reducing the command input efficiency. It is known that a command is output so as not to cause the above (see Japanese Patent Laid-Open No. 3-248199). However, these methods are methods for requesting confirmation from the operator when the accuracy of voice recognition is low.

【０００３】[0003]

【発明が解決しようとする課題】ところで、玩具ロボッ
トなどの娯楽機器においても、音声認識装置を備え、認
識結果に応じて予め用意してある音声、例えば、「名前
は」という入力音声を確認したときには、予め用意して
ある「何々ロボットです」といった合成音声を生成して
出力するものなどが知られている。By the way, even in entertainment equipment such as a toy robot, a voice recognition device is provided, and a voice prepared in advance according to the recognition result, for example, an input voice "name is" is confirmed. At times, it is known to generate and output a synthesized voice such as “what a robot” prepared in advance.

【０００４】しかしながら、上記従来の玩具ロボットで
は、或る一つの入力された言葉に対して一つのリアクシ
ョンが用意されているだけであり、一つの言葉について
その音声認識の確からしさに応じて異なる色々なリアク
ションを示すことができないものであった。また、入力
音声の音量やピッチをリアクション選択のための情報と
して用いるものはなかった。However, in the above-mentioned conventional toy robot, only one reaction is prepared for a certain inputted word, and various different words are obtained for one word depending on the certainty of its voice recognition. It was not possible to show such a reaction. Further, there is nothing that uses the volume or pitch of the input voice as information for selecting a reaction.

【０００５】本発明は、上記の事情に鑑み、音声認識の
結果とその確からしさ、更には、音量情報やピッチ情報
に応じて、操作者に対する音声応答や動作応答を変化さ
せることのできる音声認識反応装置を提供することを目
的とする。In view of the above circumstances, the present invention is capable of changing a voice response and a motion response to an operator according to a result of voice recognition and its certainty, and further, volume information and pitch information. An object is to provide a reactor.

【０００６】[0006]

【課題を解決するための手段】本発明の音声認識反応装
置は、上記従来の課題を解決するために、入力された音
声の特徴を抽出する特徴抽出部と、予め複数の音声の特
徴量を登録している音声認識辞書と、上記入力された音
声について抽出された特徴量と音声認識辞書中の各音声
の特徴量とを比較し各々について類似度を計算する類似
度計算部と、この計算された類似度をその値の大小によ
り数段階のレベルに分けて認識する類似度認識部と、音
声或いは音に対応したデータが格納されている音声辞書
格納部と、前記類似度が最も高かった音声認識辞書中の
音声の認識番号とその類似レベルとの組み合わせにより
音声辞書を選択する音声辞書選択部と、選択された音声
或いは音のデータを音声信号に変換する音声生成部と、
上記音声信号を音声或いは音に変換して出力する音声出
力部とを備えたことを特徴としている。In order to solve the above-mentioned conventional problems, a speech recognition reaction device of the present invention includes a feature extraction unit for extracting a feature of an input voice and a feature amount of a plurality of voices in advance. A similarity calculation unit that compares the registered voice recognition dictionary with the feature amount extracted for the input voice and the feature amount of each voice in the voice recognition dictionary and calculates the similarity for each, and this calculation The degree of similarity is the highest, and the degree of similarity is recognized by dividing the degree of similarity into several levels according to the magnitude of the value, a voice dictionary storage section in which data corresponding to voice or sound is stored. A voice dictionary selection unit that selects a voice dictionary based on a combination of a voice recognition number in the voice recognition dictionary and its similarity level; and a voice generation unit that converts the selected voice or sound data into a voice signal.
And a voice output unit for converting the voice signal into voice or sound and outputting the voice or sound.

【０００７】また、入力された音声の特徴を抽出する特
徴抽出部と、予め複数の音声の特徴量を登録している音
声認識辞書と、上記入力された音声について抽出された
特徴量と音声認識辞書中の各音声の特徴量とを比較し各
々について類似度を計算する類似度計算部と、この計算
された類似度をその値の大小により数段階のレベルに分
けて認識する類似度認識部と、動作パターンに対応した
データが格納されている動作辞書格納部と、前記類似度
が最も高かった音声認識辞書中の音声の認識番号とその
類似レベルとの組み合わせにより動作辞書を選択する動
作辞書選択部と、選択された動作パターンのデータに基
づきアクチュエータ部を制御するアクチュエータ制御部
と、前記アクチュエータ部にて駆動される可動部とを備
えたことを特徴としている。Further, a feature extraction unit for extracting the features of the input voice, a voice recognition dictionary in which the feature amounts of a plurality of voices are registered in advance, the feature amount and the voice recognition extracted for the input voices. A similarity calculation unit that compares the feature amount of each voice in the dictionary and calculates the similarity for each, and a similarity recognition unit that recognizes the calculated similarity by dividing it into several levels. And a motion dictionary storing unit that stores data corresponding to a motion pattern, and a motion dictionary that selects a motion dictionary based on a combination of a voice recognition number in the voice recognition dictionary having the highest similarity and its similarity level. A selection unit, an actuator control unit that controls the actuator unit based on the data of the selected operation pattern, and a movable unit that is driven by the actuator unit. To have.

【０００８】また、入力された音声の音量情報を判定す
る音量判定部を備えると共に、前記類似度認識部は音量
をその値の大小により数段階のレベルに分けて認識する
機能も備え、前記音声辞書選択部又は動作辞書選択部
は、前記類似度が最も高かった音声認識辞書中の音声の
認識番号とその類似レベルと前記音量レベルとの組み合
わせにより音声辞書を選択する、或いは動作辞書を選択
するように構成されていることを特徴としている。[0008] Further, in addition to having a volume determining section for determining volume information of the input voice, the similarity recognizing section also has a function of recognizing the volume by dividing it into several levels according to the magnitude of the value. The dictionary selecting unit or the action dictionary selecting unit selects a voice dictionary according to a combination of the voice recognition number in the voice recognition dictionary having the highest degree of similarity, its similarity level and the volume level, or selects the action dictionary. It is characterized by being configured as follows.

【０００９】さらに、入力された音声のピッチ情報を判
定するピッチ判定部を備えると共に、前記類似度認識部
はピッチをその大小により数段階のレベルに分けて認識
する機能も備え、前記音声辞書選択部又は動作辞書選択
部は、前記類似度が最も高かった音声認識辞書中の音声
の認識番号とその類似レベルと前記ピッチレベルの組み
合わせにより音声辞書を選択する、或いは動作辞書を選
択するように構成されていることを特徴としている。Further, a pitch judging section for judging the pitch information of the inputted voice is provided, and the similarity recognizing section is also provided with a function for recognizing the pitch by dividing it into several levels according to the size of the pitch. The unit or the action dictionary selection unit is configured to select a voice dictionary based on a combination of a voice recognition number in the voice recognition dictionary having the highest degree of similarity, its similarity level and the pitch level, or to select a action dictionary. It is characterized by being.

【００１０】[0010]

【作用】上記の構成によれば、計算された類似度をその
値の大小により数段階に分けて類似レベルとして認識
し、類似度が最も高かった音声認識辞書中の音声の認識
番号とその類似レベルとの組み合わせにより音声辞書を
選択し、その音声辞書の音声や音を出力するから、音声
認識の確からしさに応じて異なる色々な聴覚的リアクシ
ョンを示すことができることになる。同様に、音声認識
の確からしさに応じて異なる色々な視覚的リアクション
を示すことができる。According to the above structure, the calculated similarity is recognized as a similarity level by dividing it into several levels according to the magnitude of the value, and the recognition number of the voice in the voice recognition dictionary having the highest similarity and its similarity. Since a voice dictionary is selected according to the combination with the level and the voice or sound of the voice dictionary is output, it is possible to show various different auditory reactions depending on the certainty of voice recognition. Similarly, different visual reactions can be presented depending on the likelihood of voice recognition.

【００１１】また、類似度が最も高かった音声認識辞書
中の音声の認識番号とその類似段階値と音量レベル及び
／又はピッチレベルとの組み合わせにより音声辞書を選
択する、或いは動作辞書を選択するので、音声認識の確
からしさに加え、音量やピッチに応じて異なる色々な聴
覚的・視覚的なリアクションを示すことができる。Further, the voice dictionary is selected according to the combination of the voice recognition number in the voice recognition dictionary having the highest degree of similarity, its similarity step value, the volume level and / or the pitch level, or the action dictionary is selected. In addition to the certainty of voice recognition, it is possible to show various audible and visual reactions that differ depending on the volume and pitch.

【００１２】[0012]

【実施例】以下、本発明をその実施例を示す図面に基づ
いて説明する。図１は音声認識反応装置を示すブロック
図である。図において、１はマイク、２は音声入力部、
３は特徴抽出部、４は類似度演算部、５は音声認識辞
書、６は音量判定部、７は類似度認識部、８は動作辞書
選択部、９は動作辞書格納部、１０はアクチュエータ制
御部、１１はアクチュエータ部、１２は可動部、１３は
音声合成辞書選択部、１４は音声合成辞書格納部、１５
は音声合成部、１６は音声出力部、１７はスピーカーで
ある。また、図２は本発明の音声認識反応装置が玩具ロ
ボットである場合のそのロボットの外観を示したもので
あり、このロボットの口に相当する部分が前記スピーカ
ー１７、耳に相当する部分が前記マイク１、首と腕と脚
に相当する部分が前記可動部１２であることを示してい
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described below with reference to the drawings showing the embodiments thereof. FIG. 1 is a block diagram showing a voice recognition reaction device. In the figure, 1 is a microphone, 2 is a voice input unit,
3 is a feature extraction unit, 4 is a similarity calculation unit, 5 is a voice recognition dictionary, 6 is a volume determination unit, 7 is a similarity recognition unit, 8 is a motion dictionary selection unit, 9 is a motion dictionary storage unit, and 10 is actuator control. Part, 11 is an actuator part, 12 is a movable part, 13 is a voice synthesis dictionary selection part, 14 is a voice synthesis dictionary storage part, 15
Is a voice synthesis unit, 16 is a voice output unit, and 17 is a speaker. 2 shows the appearance of the robot when the voice recognition reaction device of the present invention is a toy robot. The part corresponding to the mouth of the robot is the speaker 17, and the part corresponding to the ear is the aforesaid part. It is shown that the portions corresponding to the microphone 1, neck, arms and legs are the movable portion 12.

【００１３】マイク１は入力音声を音声信号に変換し、
音声入力部２は上記音声信号を増幅・整形する等の所定
の処理を行うものである。The microphone 1 converts an input voice into a voice signal,
The voice input unit 2 performs a predetermined process such as amplifying / shaping the voice signal.

【００１４】特徴抽出部３は、例えば、公知の周波数ス
ペクトル法に基づいて音声の特徴を抽出するものであ
り、複数個の互いに通過させる周波数が異なるバンドパ
スフィルターやＡ／Ｄ変換器などを備えて構成される。The feature extraction unit 3 extracts voice features based on, for example, a well-known frequency spectrum method, and is provided with a plurality of band pass filters or A / D converters that pass different frequencies. Consists of

【００１５】音声認識辞書５は、いわゆる登録モードに
おいて、前記マイク１及び音声入力部２を通じて取り込
まれ特徴抽出部３にて抽出された操作者の声の周波数ス
ペクトルに基づく特徴量を抽出し、その特徴量をデジタ
ルデータで格納しているものである。また、音声認識辞
書５は、複数の音声の特徴量を格納できるものである。
例えば、本実施例では、「バスター」（バスターは前記
ロボットの名前）、「進め」、「やっつけろ」といった
言葉が操作者により音声入力され、この３つの音声につ
いて特徴量が各々格納されているものである。In the so-called registration mode, the voice recognition dictionary 5 extracts the feature amount based on the frequency spectrum of the operator's voice taken in through the microphone 1 and the voice input unit 2 and extracted by the feature extraction unit 3, and The feature amount is stored as digital data. The voice recognition dictionary 5 can store a plurality of voice feature amounts.
For example, in the present embodiment, the operator inputs words such as “buster” (buster is the name of the robot), “advance”, and “kill me”, and the feature values are stored for each of these three sounds. Is.

【００１６】類似度演算部４は、いわゆる認識モードに
おいて、前記マイク１及び音声入力部２を通じて取り込
まれ特徴抽出部３にて抽出された操作者の声の周波数ス
ペクトルに基づく特徴量を一旦記憶すると共に、前記音
声認識辞書５から順次読み出した各音声の特徴量との比
較を行い、各音声についての類似度を算出し、最も高か
った類似度およびその音声の認識番号を類似度認識部７
に供給する。In the so-called recognition mode, the similarity calculation unit 4 temporarily stores the feature quantity based on the frequency spectrum of the operator's voice taken in by the microphone 1 and the voice input unit 2 and extracted by the feature extraction unit 3. At the same time, the feature amount of each voice read out from the voice recognition dictionary 5 is compared to calculate the similarity for each voice, and the highest similarity and the recognition number of the voice are calculated.
Supply to.

【００１７】音量判定部６は、いわゆる認識モードにお
いて、前記マイク１及び音声入力部２を通じて取り込ま
れ特徴抽出部３にて抽出された操作者の声の周波数スペ
クトルのレベルの平均値にて入力音声の音量を判定し、
この判定した音量情報を類似度認識部７に供給する。In the so-called recognition mode, the volume determination unit 6 receives the input voice as an average value of the frequency spectrum levels of the operator's voice taken in through the microphone 1 and the voice input unit 2 and extracted by the feature extraction unit 3. The volume of
The determined volume information is supplied to the similarity degree recognition unit 7.

【００１８】類似度認識部７は、類似度演算部４からの
類似度をその値の大小により数段階に分けて類似レベル
（例えば、レベル１，２，３…のように類似度が高くな
るほど大きな数値）として認識すると共に、音量判定部
６からの音量をその値の大小により数段階に分けて音量
レベル（例えば、レベル１，２，３…のように音量が大
きくなるほど大きな数値）として認識し、これら類似段
階値と音量段階値を、動作辞書選択部８及び音声合成辞
書選択部１３に供給するようになっている。The similarity recognizing unit 7 divides the similarity from the similarity calculating unit 4 into several levels according to the magnitude of the value, and the similarity level (for example, as the similarity becomes higher like levels 1, 2, 3, ...). In addition to recognizing the volume from the volume determination unit 6 as a volume level (for example, level 1, 2, 3, ... Then, the similar step value and the volume step value are supplied to the action dictionary selection section 8 and the voice synthesis dictionary selection section 13.

【００１９】動作辞書選択部８は、前記類似度が最も高
かった音声認識辞書５中の音声の認識番号とその類似レ
ベルと音量レベルとの組み合わせにより動作辞書格納部
９から動作辞書を選択してアクチュエータ制御部１０に
出力する。上記音声の認識番号は、例えば「バスター」
については番号“１”、「進め」については番号
“２”、「やっつけろ」については番号“３”のように
付されている。また、動作辞書格納部９には、「片手を
上げる」「バンザイする」「がたがた揺れる」「ゆっく
り進む」「早く進む」「両手を上げ下げする」「両手を
激しく上下左右に振る」更に「なにもしない」といった
動作辞書が格納されている。具体的には、そのような動
作をさせるためのアクチュエータ駆動データ（駆動する
対象、駆動時間等のデータ）が格納されている。The action dictionary selection unit 8 selects the action dictionary from the action dictionary storage unit 9 based on the combination of the voice recognition number in the voice recognition dictionary 5 having the highest degree of similarity and its similarity level and volume level. Output to the actuator control unit 10. The voice recognition number is, for example, "Buster".
Is attached as a number "1", "advance" is attached as a number "2", and "kill" is attached as a number "3". In the motion dictionary storage unit 9, "raise one hand", "buzz", "rattle", "slowly advance", "advance forward", "raise and lower both hands", "shake both hands violently up and down, left and right" and "what" A motion dictionary such as "if not" is stored. Specifically, actuator drive data (data of a drive target, drive time, etc.) for performing such an operation is stored.

【００２０】また、音声合成辞書選択部１３は、前記類
似度が最も高かった音声認識辞書５中の音声の認識番号
とその類似レベルと音量レベルとの組み合わせにより音
声合成辞書格納部１４から音声合成辞書を選択して音声
合成部１５に出力する。上記音声の認識番号は、前に説
明した通りである。また、音声合成辞書格納部１４に
は、「はい」「はい、ご主人様」「あなたは誰」「いや
だ」「ガー」「ガオー」といった音声合成辞書が格納さ
れている。具体的には、そのような合成音声を生成する
ためのデータが格納されている。Further, the voice synthesis dictionary selection unit 13 synthesizes the voice from the voice synthesis dictionary storage unit 14 based on the combination of the voice recognition number in the voice recognition dictionary 5 having the highest similarity and the similarity level and the volume level. The dictionary is selected and output to the voice synthesizer 15. The recognition number of the voice is as described above. The voice synthesis dictionary storage unit 14 stores a voice synthesis dictionary such as "Yes", "Yes, master", "Who are you", "No", "Gar", "Gaoh". Specifically, data for generating such synthetic speech is stored.

【００２１】図３は、上記の動作辞書選択部８及び音声
合成辞書選択部１３による動作の選択処理及び合成音声
の選択処理を説明するための説明図である。例えば、同
図（ａ）は、類似度が最も高かった音声認識辞書５中の
音声の認識番号が“１”であった、即ち、入力音声が
「バスター」であると判断されたときの説明図であり、
その類似レベルがレベル１で音量レベルがレベル１のと
きは、“音声なしで動作なし”となり、類似レベルがレ
ベル２で音量レベルがレベル１のときは、音声は「は
い」、動作は「片手を上げる」となり、類似レベルがレ
ベル１で音量レベルがレベル２のときは、音声は「あな
たは誰」、動作は「がたがた揺れる」となり、類似レベ
ルがレベル２で音量レベルがレベル２のときは、音声は
「はい、ご主人様」、動作は「バンザイする」となる。FIG. 3 is an explanatory diagram for explaining the operation selecting process and the synthetic voice selecting process by the action dictionary selecting section 8 and the voice synthesizing dictionary selecting section 13. For example, FIG. 9A is an explanation when the recognition number of the voice in the voice recognition dictionary 5 having the highest similarity is “1”, that is, when the input voice is determined to be “Buster”. Is a figure,
When the similarity level is level 1 and the volume level is level 1, it means “no operation without voice”. When the similarity level is level 2 and the volume level is level 1, voice is “yes” and operation is “one-handed”. When the similarity level is level 1 and the volume level is level 2, the voice is “who are you” and the motion is “rattle”, and when the similarity level is level 2 and the volume level is level 2 , The voice is "Yes, master", and the action is "Banzai".

【００２２】同図（ｂ）は、類似度が最も高かった音声
認識辞書５中の音声の認識番号が“２”であった、即
ち、入力音声が「進め」であると判断されたときの説明
図であり、同図（ｃ）は、類似度が最も高かった音声認
識辞書５中の音声の認識番号が“３”であった、即ち、
入力音声が「やっつけろ」であると判断されたときの説
明図である。そして、認識段階値と音量段階値との組み
合わせによる音声及び動作の選択は、これらの図に示さ
れたものとなる。このような音声及び動作を用意してお
くことで、あたかも、このロボットが感情をもって主人
を判別しているがごときに見せることができる。FIG. 3B shows the case where the voice recognition number in the voice recognition dictionary 5 having the highest similarity is "2", that is, when the input voice is judged to be "advance". FIG. 13C is an explanatory diagram, in which the recognition number of the voice in the voice recognition dictionary 5 having the highest similarity is “3”, that is,
It is an explanatory view when it is judged that the input voice is "kill it". The selection of the voice and the action based on the combination of the recognition step value and the volume step value is as shown in these figures. By preparing such a voice and a motion, it is possible to show the robot as if the robot discerns the master with emotion.

【００２３】アクチュエータ制御部１０は、動作辞書選
択部８にて選択された動作辞書に基づいてアクチュエー
タ１１を制御するものである。例えば、選択された動作
辞書が「片手を上げる」であれば、片手の動作を担うア
クチュエータ１１としてのモーターに対して、その片手
を上げるに必要な時間だけ電力を供給するなどの制御を
行うものである。The actuator control section 10 controls the actuator 11 based on the motion dictionary selected by the motion dictionary selection section 8. For example, if the selected motion dictionary is "raise one hand", control is performed such that power is supplied to the motor serving as the actuator 11 that carries out the motion of one hand for the time required to raise the one hand. Is.

【００２４】音声合成部１５は、音声合成辞書選択部１
３にて選択された音声辞書のデジタルデータに基づいて
合成音声を生成するものである。例えば、選択された音
声辞書が「はい」であれば、その辞書のデジタルデータ
をＡ／Ｄ変換し、その「はい」に相当するさまざまな周
波数の音声信号を生成する。The voice synthesis unit 15 is a voice synthesis dictionary selection unit 1.
The synthesized voice is generated based on the digital data of the voice dictionary selected in 3. For example, if the selected voice dictionary is "yes", the digital data of the dictionary is A / D converted, and voice signals of various frequencies corresponding to the "yes" are generated.

【００２５】音声出力部は、音声合成部１５からの音声
信号を増幅する等してスピーカー１７に供給する。The audio output unit amplifies the audio signal from the audio synthesis unit 15 and supplies it to the speaker 17.

【００２６】上記の構成によれば、計算された類似度を
その値の大小により数段階に分けて類似レベルとして認
識し、類似度が最も高かった音声認識辞書中の音声の認
識番号とその類似レベルと音量レベルとの組み合わせに
より音声辞書を選択して前記の音声合成部１５に出力す
るため、音声認識の確からしさに応じて異なる色々な聴
覚的リアクション、即ち、「はい」や「ガオー」といっ
た音声によるリアクションを示すことができることにな
る。同様に、音声認識の確からしさに応じて異なる色々
な視覚的リアクション、即ち、「片手を上げる」や「両
手を激しく上下左右に振る」などの動作によるリアクシ
ョンを示すことができる。With the above arrangement, the calculated similarity is recognized as a similarity level by dividing it into several levels according to the magnitude of the value, and the recognition number of the voice in the voice recognition dictionary having the highest similarity and its similarity. Since a voice dictionary is selected and output to the voice synthesizer 15 depending on the combination of the level and the volume level, various auditory reactions that differ depending on the certainty of voice recognition, that is, "Yes" or "Gaoh". It will be possible to show a voice reaction. Similarly, various visual reactions that differ depending on the certainty of voice recognition, that is, reactions by actions such as "raising one hand" and "shaking both hands violently up and down, left and right" can be shown.

【００２７】図４は、音量レベルを考慮しない場合、即
ち、類似度が最も高かった音声認識辞書中の音声の認識
番号とその類似レベルとの組み合わせのみにより音声辞
書又は動作辞書を選択する場合で、類似度が最も高かっ
た音声認識辞書５中の音声の認識番号が“１”であっ
た、即ち、入力音声が「バスター」であると判断された
ときの説明図である。類似レベルがレベル１であれば、
音声なし動作なしとなり、レベル２であれば、音声は
「発音が悪いですね」、動作は「片手をやや上げる」と
なり、レベル３であれば、音声は「はい」、動作は「片
手を上げる」となり、レベル４であれば、音声は「は
い、発音が良いですね」、動作は「片手を上げて左右に
振る」となる。FIG. 4 shows a case where the volume level is not taken into consideration, that is, the voice dictionary or the action dictionary is selected only by the combination of the voice recognition number and the similarity level of the voice recognition dictionary having the highest similarity. FIG. 9 is an explanatory diagram when it is determined that the voice recognition number in the voice recognition dictionary 5 having the highest degree of similarity is “1”, that is, the input voice is “Buster”. If the similarity level is level 1,
There is no voice. No action is taken. If the level is 2, the voice is “pronounced bad”, the action is “slightly raise one hand”, if the level is 3, the voice is “yes”, the action is “raise one hand”. If it is level 4, the voice is "Yes, the pronunciation is good", and the action is "Raise one hand and shake left and right".

【００２８】なお、音量レベルに代えて又はこれと共
に、音声のピッチレベルを考慮して音声辞書又は動作辞
書を選択するようにしてもよいものである。音声のピッ
チは、例えば、自己相関関数を用いた公知の方法を用い
て、音声信号をデジタル化したデータから検出すること
ができる。ピッチレベルは、音量レベルと同様に扱うこ
とができ、例えば、ピッチが高いものを音量の大きいも
の、ピッチが低いものを音量が小さいものとして対応づ
けることができる。Instead of or in addition to the volume level, the voice dictionary or the action dictionary may be selected in consideration of the voice pitch level. The pitch of the voice can be detected from the digitized data of the voice signal by using a known method using an autocorrelation function, for example. The pitch level can be treated in the same manner as the volume level. For example, a high pitch can be associated with a high volume, and a low pitch can be associated with a low volume.

【００２９】また、上記の実施例において、音声認識お
よび音声合成のアルゴリズムや特徴量、音量及びピッチ
の判定方法、可動部の制御方法等については、なんら限
定されるものではなく、他の方法等を用いてもよいもの
である。Further, in the above-mentioned embodiment, the algorithm of voice recognition and voice synthesis, the feature amount, the method of determining the volume and pitch, the method of controlling the movable part, etc. are not limited at all, and other methods etc. May be used.

【００３０】[0030]

【発明の効果】以上のように、本発明によれば、音声認
識の確からしさに応じて異なる色々な聴覚的リアクショ
ンを示すことができることになる。同様に、音声認識の
確からしさに応じて異なる色々な視覚的リアクションを
示すことができる。さらに、音量レベル及び／又はピッ
チレベルとの組み合わせにより音声辞書を選択する、或
いは動作辞書を選択することにより、音声認識の確から
しさに加え、音量やピッチに応じて異なる色々な聴覚的
・視覚的なリアクションを示すことができるという効果
も奏する。As described above, according to the present invention, it is possible to show various different auditory reactions depending on the certainty of voice recognition. Similarly, different visual reactions can be presented depending on the likelihood of voice recognition. Furthermore, by selecting a voice dictionary or a motion dictionary according to a combination with a volume level and / or a pitch level, in addition to the certainty of voice recognition, various auditory / visual sounds that differ depending on the volume and pitch are selected. There is also an effect that it is possible to show a reaction.

[Brief description of drawings]

【図１】本発明の音声認識反応装置のブロック図であ
る。FIG. 1 is a block diagram of a voice recognition reaction device of the present invention.

【図２】本発明の音声認識反応装置が玩具ロボットであ
る場合のそのロボットの外観を示す正面図である。FIG. 2 is a front view showing an appearance of a toy robot when the voice recognition reaction device of the present invention is a toy robot.

【図３】本発明の動作辞書選択部及び音声合成辞書選択
部による動作の選択処理及び合成音声の選択処理を説明
するための説明図である。FIG. 3 is an explanatory diagram for explaining a motion selection process and a synthetic voice selection process by the motion dictionary selection unit and the voice synthesis dictionary selection unit of the present invention.

【図４】本発明の動作辞書選択部及び音声合成辞書選択
部において音量レベルを考慮しない場合の動作の選択処
理及び合成音声の選択処理を説明するための説明図であ
る。FIG. 4 is an explanatory diagram for explaining an operation selection process and a synthetic voice selection process when the sound volume level is not considered in the action dictionary selection unit and the voice synthesis dictionary selection unit of the present invention.

[Explanation of symbols]

３特徴抽出部４類似度演算部５音声認識辞書６音量判定部７類似度認識部８動作辞書選択部９動作辞書格納部１０アクチュエータ制御部１１アクチュエータ１２可動部１３音声合成辞書選択部１４音声合成辞書格納部１５音声合成部１６音声出力部 3 Feature Extraction Section 4 Similarity Calculation Section 5 Speech Recognition Dictionary 6 Volume Determination Section 7 Similarity Recognition Section 8 Action Dictionary Selection Section 9 Action Dictionary Storage Section 10 Actuator Control Section 11 Actuator 12 Movable Section 13 Speech Synthesis Dictionary Selection Section 14 Speech Synthesis Dictionary storage unit 15 Voice synthesis unit 16 Voice output unit

Claims

[Claims]

1. A feature extraction unit for extracting a feature of an input voice, a voice recognition dictionary in which a plurality of predetermined feature amounts of voice are registered, and a feature amount extracted for the input voice. A similarity calculator that compares the features of each voice in the voice recognition dictionary and calculates the similarity for each, and the similarity that is calculated by dividing the calculated similarity into several levels according to the magnitude of the value. A voice dictionary is selected based on a combination of a recognition unit, a voice dictionary storage unit in which data corresponding to voice or sound is stored, and a voice recognition number in the voice recognition dictionary having the highest similarity and its similarity level. A voice dictionary selection unit, a voice generation unit that converts the selected voice or sound data into a voice signal, and a voice output unit that converts the voice signal into voice or sound and outputs the voice signal or sound. Voice recognition Reactor.

2. A feature extraction unit for extracting a feature of an input voice, a voice recognition dictionary in which feature amounts of a plurality of voices are registered in advance, a feature amount and a voice recognition extracted for the input voice. A similarity calculation unit that compares the feature amount of each voice in the dictionary and calculates the similarity for each, and a similarity recognition unit that recognizes the calculated similarity by dividing it into several levels. And a motion dictionary storing unit that stores data corresponding to a motion pattern, and a motion dictionary that selects a motion dictionary based on a combination of a voice recognition number in the voice recognition dictionary having the highest similarity and its similarity level. A selection unit, and an actuator control unit that controls the actuator unit based on the data of the selected operation pattern,
A voice recognition reaction device, comprising: a movable part driven by the actuator part.

3. The sound volume determining section for determining volume information of the input voice, and the similarity recognizing section has a function of recognizing the volume by dividing it into several levels according to the magnitude of the value. The dictionary selecting unit or the action dictionary selecting unit selects a voice dictionary according to a combination of the voice recognition number in the voice recognition dictionary having the highest degree of similarity, its similarity level and the volume level, or selects the action dictionary. The speech recognition reaction device according to claim 1 or 2, wherein the speech recognition reaction device is configured as follows.

4. The voice dictionary selection means is provided with a pitch determination section for determining pitch information of the input voice, and the similarity degree recognition section has a function of recognizing the pitch by dividing the pitch into several levels according to the size. Section or action dictionary selection section,
It is configured to select a voice dictionary or a motion dictionary according to a combination of a voice recognition number in the voice recognition dictionary having the highest degree of similarity, its similarity level and the pitch level. Claim 1 or 2
The speech recognition reaction device described in 1.