JP2003271181A

JP2003271181A - Information processing apparatus and information processing method, and recording medium and program

Info

Publication number: JP2003271181A
Application number: JP2002072719A
Authority: JP
Inventors: Lucke Helmut; ルッケヘルムート
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2002-03-15
Filing date: 2002-03-15
Publication date: 2003-09-25

Abstract

(57)【要約】【課題】辞書に登録されていない未登録語を、より正
確にクラスタリングし、登録できるようにする。【解決手段】辞書記憶部２５に記憶されている単語辞
書には、単語の見出しと音韻系列の他、その単語に対応
する画像のRGB値が記憶されている。例えば、「赤」の
単語には、その赤色を表すRGB値が記憶されている。マ
ッチング部２３は、特徴抽出部２２により抽出された音
声の特徴ベクトルと、特徴抽出部４２により抽出された
画像の特徴ベクトルの両方に基づいて、単語辞書を参照
してマッチング処理を行う。例えば、「赤いボール」の
音声が入力された場合、「赤い」の部分は、音声だけで
なく、画像としての特徴も利用して認識処理が実行され
る。本発明は、音声認識装置に適用することが可能であ
る。 (57) [Summary] [Problem] To enable more accurate clustering and registration of unregistered words not registered in a dictionary. A word dictionary stored in a dictionary storage unit 25 stores, in addition to word headings and phoneme sequences, RGB values of an image corresponding to the word. For example, the word “red” stores an RGB value representing the red color. The matching unit 23 performs a matching process by referring to a word dictionary based on both the feature vector of the voice extracted by the feature extraction unit 22 and the feature vector of the image extracted by the feature extraction unit 42. For example, when the voice of “red ball” is input, the recognition processing is performed on the “red” part using not only the voice but also the feature as an image. The present invention can be applied to a speech recognition device.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、情報処理装置およ
び情報処理方法、並びに記録媒体およびプログラムに関
し、特に、例えば、音声認識の対象とする単語等の語句
を、簡単に、かつ正確に、辞書に登録することができる
ようにした情報処理装置および情報処理方法、並びに記
録媒体およびプログラムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information processing apparatus and an information processing method, a recording medium and a program, and in particular, for example, a dictionary of a word or the like which is a target of speech recognition can be simply and accurately defined. The present invention relates to an information processing device and an information processing method, a recording medium and a program that can be registered in.

【０００２】[0002]

【従来の技術】例えば、ロボットにユーザ（人間）の発
話を理解させるには、ロボットに音声認識装置を具備さ
せる必要がある。通常、音声認識装置においては、辞書
が用意され、その辞書に単語が予め登録されている。音
声が入力された場合、その音声に、辞書に登録されてい
る単語（登録語）が含まれているか否かが判定され、含
まれている場合、その登録語が音声認識結果として出力
される。2. Description of the Related Art For example, in order for a robot to understand the speech of a user (human), it is necessary to equip the robot with a voice recognition device. Usually, in a voice recognition device, a dictionary is prepared and words are registered in advance in the dictionary. When a voice is input, it is determined whether or not the voice includes a word (registered word) registered in the dictionary. If the voice is included, the registered word is output as a voice recognition result. .

【０００３】このように、音声認識装置は、入力された
音声を、辞書を参照することで認識するようになされて
いるため、基本的に、登録語以外の単語（未登録語）
は、認識することができないことになる。As described above, since the voice recognition device recognizes the input voice by referring to the dictionary, basically, the words other than the registered words (unregistered words) are recognized.
Will not be able to recognize.

【０００４】このような未登録語に対する対処法のうち
の代表的なものの１つとしては、入力音声に未登録語が
含まれる場合に、その未登録語を、辞書に登録し、以後
は、登録語としてしまう方法がある。As one of the typical measures against such unregistered words, when the input voice includes unregistered words, the unregistered words are registered in the dictionary, and thereafter, There is a method to use it as a registered word.

【０００５】未登録語を辞書に登録するには、まず、そ
の未登録語の音声区間を検出し、その音声区間における
音声の音韻系列を認識する必要がある。ある音声の音韻
系列を認識する方法としては、例えば、音韻タイプライ
タと呼ばれる方法があり、音韻タイプライタでは、基本
的に、すべての音韻に対する自由な遷移を許可するガー
ベジモデルを用いて、入力音声に対する音韻系列が出力
される。To register an unregistered word in the dictionary, it is first necessary to detect the voice section of the unregistered word and recognize the phoneme sequence of the voice in the voice section. As a method of recognizing a phoneme sequence of a certain speech, for example, there is a method called a phoneme typewriter. In the phoneme typewriter, basically, a garbage model that allows free transitions for all phonemes is used to input speech. The phonological sequence for is output.

【０００６】さらに、未登録語を辞書に登録するには、
未登録語の音韻系列をクラスタリングする必要がある。
即ち、辞書においては、各単語の音韻系列が、その単語
のクラスタにクラスタリングされて登録されており、未
登録語を辞書に登録するには、その未登録語の音韻系列
をクラスタリングする必要がある。Further, to register an unregistered word in the dictionary,
It is necessary to cluster the phoneme sequences of unregistered words.
That is, in the dictionary, the phonological sequence of each word is registered by being clustered into the cluster of that word. To register an unregistered word in the dictionary, it is necessary to cluster the phonological sequence of that unregistered word. .

【０００７】[0007]

【発明が解決しようとする課題】未登録語の音韻系列を
クラスタリングする方法としては、その未登録語を表す
見出し（例えば、未登録語の読み）を、ユーザに入力し
てもらい、その見出しで表されるクラスタに、未登録語
の音韻系列をクラスタリングする方法があるが、この方
法では、ユーザが見出しを入力しなければならず、面倒
である。As a method of clustering a phoneme sequence of unregistered words, a user is requested to input a heading representing the unregistered word (for example, reading of the unregistered word), and the heading is used. There is a method of clustering the phoneme sequence of unregistered words in the represented cluster, but this method is troublesome because the user has to input the headline.

【０００８】また、未登録語が検出されるたびに、新た
なクラスタを生成し、未登録語の音韻系列を、その新た
なクラスタにクラスタリングする方法がある。しかしな
がら、この方法では、未登録語が検出されるたびに、辞
書に、新たなクラスタに対応するエントリが登録される
こととなるから、辞書が大規模になり、その後の音声認
識に要する処理量や時間が増大することになる。There is also a method of generating a new cluster each time an unregistered word is detected and clustering a phoneme sequence of the unregistered word into the new cluster. However, in this method, each time an unregistered word is detected, an entry corresponding to a new cluster is registered in the dictionary, so that the dictionary becomes large in scale and the processing amount required for subsequent speech recognition is increased. And time will increase.

【０００９】さらに、未登録語を登録する場合、その未
登録語を音声入力に基づいてのみ処理するようにしてい
るため、例えば、発話者がなまりを有するような場合、
同一の単語であったとしても、異なる単語として登録さ
れてしまったり、逆に、異なる単語であったとしても、
同一の単語として登録されてしまうことがあった。Further, when an unregistered word is registered, the unregistered word is processed only based on voice input. Therefore, for example, when the speaker has a dullness,
Even if they are the same word, they will be registered as different words, or conversely, even if they are different words,
Sometimes it was registered as the same word.

【００１０】また、ロボットが色を認識できるようにす
るには、ロボットに色を学習させる必要がある。このよ
うな場合、従来、例えば、ビデオカメラで撮像した色を
表す三原色（RGB）の具体的な値が登録される。そし
て、ビデオカメラにより撮像された画像に含まれる色
を、予め登録されている色と比較することで、いま撮像
した画像の色を判定することが一般的に行われる。Further, in order for the robot to recognize colors, it is necessary for the robot to learn colors. In such a case, conventionally, for example, specific values of three primary colors (RGB) representing colors captured by a video camera are registered. Then, the color included in the image captured by the video camera is compared with the color registered in advance to generally determine the color of the currently captured image.

【００１１】しかしながら、例えば、赤色とオレンジ色
は、比較的似た色であるため、ビデオカメラにより撮像
された赤色がオレンジ色と誤認識されたり、オレンジ色
が赤色と誤認識されることがあった。However, for example, since red and orange are relatively similar colors, red imaged by a video camera may be erroneously recognized as orange, or orange may be erroneously recognized as red. It was

【００１２】本発明は、このような状況に鑑みてなされ
たものであり、登録済みのデータ量を大規模化させるこ
となく、入力された情報を、正確、かつ容易にクラスタ
リングし、もって、正確な認識を行うことができるよう
にするものである。The present invention has been made in view of such a situation, and the input information is accurately and easily clustered without increasing the registered data amount, and thus the It is intended to be able to recognize.

【００１３】[0013]

【課題を解決するための手段】本発明の情報処理装置
は、登録情報に基づいて入力情報を認識する情報処理装
置であって、第１の情報を取得する第１の取得手段と、
第２の情報を取得する第２の取得手段と、第１の取得手
段により取得された第１の情報の特徴を抽出する第１の
特徴抽出手段と、第２の取得手段により取得された第２
の情報の特徴を抽出する第２の特徴抽出手段と、第１の
特徴抽出手段により抽出された第１の情報の特徴と、第
２の特徴抽出手段により抽出された第２の情報の特徴を
用いて、第１の取得手段により取得された第１の情報を
クラスタリングするクラスタリング手段と、クラスタリ
ング手段によりクラスタリングされた結果に基づいて、
第１の情報を登録情報として登録する登録手段とを備え
ることを特徴とする。An information processing apparatus of the present invention is an information processing apparatus for recognizing input information based on registration information, and first acquisition means for acquiring first information,
Second acquisition means for acquiring the second information, first feature extraction means for extracting the characteristics of the first information acquired by the first acquisition means, and second acquisition means for the second acquisition means Two
The second characteristic extracting means for extracting the characteristic of the information, the characteristic of the first information extracted by the first characteristic extracting means, and the characteristic of the second information extracted by the second characteristic extracting means. Using, based on the clustering means for clustering the first information acquired by the first acquisition means, and the result of clustering by the clustering means,
And a registration unit for registering the first information as registration information.

【００１４】前記クラスタリング手段は、第１の特徴抽
出手段により抽出された第１の情報の特徴と、第２の特
徴抽出手段により抽出された第２の情報の特徴を、それ
ぞれ重み付け加算して得られた特徴に基づいて、第１の
情報をクラスタリングすることができる。The clustering means obtains the weighted addition of the characteristics of the first information extracted by the first characteristic extraction means and the characteristics of the second information extracted by the second characteristic extraction means. The first information may be clustered based on the identified features.

【００１５】前記第１の情報と第２の情報は、音声と画
像の一方と他方であるようにすることができる。The first information and the second information may be one or the other of voice and image.

【００１６】前記第１または第２の特徴抽出手段は、画
像の特徴を、RGBの値として抽出することができる。The first or second feature extracting means can extract the features of the image as RGB values.

【００１７】前記第１の情報は、登録情報として登録さ
れていない未登録語の音声であるようにすることができ
る。The first information may be a voice of an unregistered word that is not registered as registration information.

【００１８】本発明の情報処理方法は、登録情報に基づ
いて入力情報を認識する情報処理装置の情報処理方法で
あって、第１の情報を取得する第１の取得ステップと、
第２の情報を取得する第２の取得ステップと、第１の取
得ステップの処理により取得された第１の情報の特徴を
抽出する第１の特徴抽出ステップと、第２の取得ステッ
プの処理により取得された第２の情報の特徴を抽出する
第２の特徴抽出ステップと、第１の特徴抽出ステップの
処理により抽出された第１の情報の特徴と、第２の特徴
抽出ステップの処理により抽出された第２の情報の特徴
を用いて、第１の取得ステップの処理により取得された
第１の情報をクラスタリングするクラスタリングステッ
プと、クラスタリングステップの処理によりクラスタリ
ングされた結果に基づいて、第１の情報を登録情報とし
て登録する登録ステップとを含むことを特徴とする。An information processing method of the present invention is an information processing method of an information processing apparatus for recognizing input information based on registration information, which comprises a first acquisition step of acquiring first information,
By the second acquisition step of acquiring the second information, the first feature extraction step of extracting the characteristics of the first information acquired by the processing of the first acquisition step, and the processing of the second acquisition step A second characteristic extraction step for extracting the characteristic of the acquired second information, a characteristic of the first information extracted by the processing of the first characteristic extraction step, and an extraction by the processing of the second characteristic extraction step Based on the clustering step of clustering the first information acquired by the processing of the first acquisition step using the characteristics of the generated second information and the result of the clustering performed by the processing of the clustering step, And a registration step of registering the information as registration information.

【００１９】本発明の記録媒体のプログラムは、登録情
報に基づいて入力情報を認識する情報処理装置のプログ
ラムであって、第１の情報を取得する第１の取得ステッ
プと、第２の情報を取得する第２の取得ステップと、第
１の取得ステップの処理により取得された第１の情報の
特徴を抽出する第１の特徴抽出ステップと、第２の取得
ステップの処理により取得された第２の情報の特徴を抽
出する第２の特徴抽出ステップと、第１の特徴抽出ステ
ップの処理により抽出された第１の情報の特徴と、第２
の特徴抽出ステップの処理により抽出された第２の情報
の特徴を用いて、第１の取得ステップの処理により取得
された第１の情報をクラスタリングするクラスタリング
ステップと、クラスタリングステップの処理によりクラ
スタリングされた結果に基づいて、第１の情報を登録情
報として登録する登録ステップとを含むことを特徴とす
る。A program of a recording medium of the present invention is a program of an information processing apparatus for recognizing input information based on registration information, and includes a first acquisition step of acquiring first information and second information. A second acquisition step of acquiring, a first characteristic extraction step of extracting a characteristic of the first information acquired by the processing of the first acquisition step, and a second characteristic acquired by the processing of the second acquisition step A second feature extracting step for extracting the feature of the information of the first information, a feature of the first information extracted by the process of the first feature extracting step, and a second feature
Clustering step of clustering the first information acquired by the processing of the first acquisition step, using the characteristics of the second information extracted by the processing of the characteristic extraction step, and clustering by the processing of the clustering step. A registration step of registering the first information as registration information based on the result.

【００２０】本発明のプログラムは、登録情報に基づい
て入力情報を認識する情報処理装置を制御するコンピュ
ータが実行可能なプログラムであって、第１の情報を取
得する第１の取得ステップと、第２の情報を取得する第
２の取得ステップと、第１の取得ステップの処理により
取得された第１の情報の特徴を抽出する第１の特徴抽出
ステップと、第２の取得ステップの処理により取得され
た第２の情報の特徴を抽出する第２の特徴抽出ステップ
と、第１の特徴抽出ステップの処理により抽出された第
１の情報の特徴と、第２の特徴抽出ステップの処理によ
り抽出された第２の情報の特徴を用いて、第１の取得ス
テップの処理により取得された第１の情報をクラスタリ
ングするクラスタリングステップと、クラスタリングス
テップの処理によりクラスタリングされた結果に基づい
て、第１の情報を登録情報として登録する登録ステップ
とを含むことを特徴とする。The program of the present invention is a program that can be executed by a computer that controls an information processing apparatus that recognizes input information based on registration information, and includes a first acquisition step for acquiring first information, and a first acquisition step for acquiring first information. Second acquisition step for acquiring the second information, a first characteristic extraction step for extracting the characteristic of the first information acquired by the processing of the first acquisition step, and a second characteristic acquisition step for the processing of the second acquisition step A second feature extracting step for extracting the feature of the extracted second information, a feature of the first information extracted by the process of the first feature extracting step, and a feature of the second feature extracting step. The clustering step of clustering the first information acquired by the processing of the first acquisition step by using the characteristics of the second information and the processing of the clustering step. Based on the clustered results, characterized by comprising a registration step of registering the first information as registration information.

【００２１】本発明の情報処理装置および情報処理方
法、並びに記録媒体およびプログラムにおいては、第１
の情報の特徴と第２の情報の特徴を用いて、第１の情報
がクラスタリングされる。そして、クラスタリングされ
た結果に基づいて、第１の情報が登録情報として登録さ
れる。In the information processing apparatus and the information processing method, the recording medium and the program of the present invention, the first
The first information is clustered using the information feature of and the second information feature. Then, based on the result of clustering, the first information is registered as registration information.

【００２２】[0022]

【発明の実施の形態】図１は、本発明を適用したロボッ
トの一実施の形態の外観構成例を示しており、図２は、
その電気的構成例を示している。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows an external configuration example of an embodiment of a robot to which the present invention is applied, and FIG.
An example of its electrical configuration is shown.

【００２３】本実施の形態では、ロボット１は、例え
ば、犬等の四つ足の動物の形状のものとなっており、胴
体部ユニット２の前後左右に、それぞれ脚部ユニット３
Ａ，３Ｂ，３Ｃ，３Ｄが連結されるとともに、胴体部ユ
ニット２の前端部と後端部に、それぞれ頭部ユニット４
と尻尾部ユニット５が連結されることにより構成されて
いる。In the present embodiment, the robot 1 is in the shape of a four-legged animal such as a dog, and the leg units 3 are provided to the front, rear, left and right of the body unit 2, respectively.
A, 3B, 3C and 3D are connected, and the head unit 4 is provided at the front end and the rear end of the body unit 2, respectively.
And the tail unit 5 are connected to each other.

【００２４】尻尾部ユニット５は、胴体部ユニット２の
上面に設けられたベース部５Ｂから、２自由度をもって
湾曲または揺動自在に引き出されている。The tail unit 5 is drawn out from the base portion 5B provided on the upper surface of the body unit 2 so as to be curved or swingable with two degrees of freedom.

【００２５】胴体部ユニット２には、ロボット１の全体
の制御を行うコントローラ１０、ロボット１の動力源と
なるバッテリ１１、並びにバッテリセンサ１２および熱
センサ１３からなる内部センサ部１４などが収納されて
いる。The body unit 2 contains a controller 10 for controlling the entire robot 1, a battery 11 as a power source of the robot 1, an internal sensor unit 14 including a battery sensor 12 and a heat sensor 13, and the like. There is.

【００２６】頭部ユニット４には、「耳」に相当するマ
イクロフォン（マイク）１５、「目」に相当するＣＣＤ
(Charge Coupled Device), CMOS(Complementary Metal
Oxide Semiconductor)などよりなるビデオカメラ１６、
触覚に相当するタッチセンサ１７、「口」に相当するス
ピーカ１８などが、それぞれ所定位置に配設されてい
る。また、頭部ユニット４には、口の下顎に相当する下
顎部４Ａが１自由度をもって可動に取り付けられてお
り、この下顎部４Ａが動くことにより、ロボット１の口
の開閉動作が実現されるようになっている。The head unit 4 includes a microphone (microphone) 15 corresponding to "ears" and a CCD corresponding to "eyes".
(Charge Coupled Device), CMOS (Complementary Metal
Oxide Semiconductor) video camera 16,
A touch sensor 17 corresponding to a tactile sense, a speaker 18 corresponding to a “mouth”, and the like are provided at predetermined positions. Further, a lower jaw 4A corresponding to the lower jaw of the mouth is movably attached to the head unit 4 with one degree of freedom, and the opening / closing operation of the mouth of the robot 1 is realized by moving the lower jaw 4A. It is like this.

【００２７】脚部ユニット３Ａ乃至３Ｄそれぞれの関節
部分や、脚部ユニット３Ａ乃至３Ｄそれぞれと胴体部ユ
ニット２の連結部分、頭部ユニット４と胴体部ユニット
２の連結部分、頭部ユニット４と下顎部４Ａの連結部
分、並びに尻尾部ユニット５と胴体部ユニット２の連結
部分などには、図２に示されるように、それぞれアクチ
ュエータ３ＡＡ₁乃至３ＡＡ_K、３ＢＡ₁乃至３ＢＡ_K、３
ＣＡ₁乃至３ＣＡ_K、３ＤＡ₁乃至３ＤＡ_K、４Ａ₁乃至４
Ａ_L、５Ａ₁および５Ａ₂が配設されている。Joint parts of the leg units 3A to 3D, connecting parts of the leg units 3A to 3D and the body unit 2, connecting parts of the head unit 4 and the body unit 2, head unit 4 and the lower jaw linking moiety parts 4A, and the like in the connecting portion of the tail unit 5 and the body unit 2, as shown in FIG. 2, the actuators 3AA ₁ to 3AA _K, respectively, 3BA ₁ to 3BA _K, 3
CA ₁ to 3CA _K, 3DA ₁ to 3DA _K, 4A ₁ to 4
A _L , 5A ₁ and 5A ₂ are provided.

【００２８】頭部ユニット４におけるマイク１５は、ユ
ーザからの発話を含む周囲の音声（音）を集音し、得ら
れた音声信号を、コントローラ１０に送出する。ビデオ
カメラ１６は、周囲の状況を撮像し、得られた画像信号
を、コントローラ１０に送出する。The microphone 15 in the head unit 4 collects ambient voice (sound) including the utterance from the user and sends the obtained voice signal to the controller 10. The video camera 16 captures an image of the surroundings and sends the obtained image signal to the controller 10.

【００２９】タッチセンサ１７は、例えば、頭部ユニッ
ト４の上部に設けられており、ユーザからの「なでる」
や「たたく」といった物理的な働きかけにより受けた圧
力を検出し、その検出結果を圧力検出信号としてコント
ローラ１０に送出する。The touch sensor 17 is provided, for example, on the upper part of the head unit 4, and is "stroked" by the user.
The pressure received by a physical action such as "tap" is detected, and the detection result is sent to the controller 10 as a pressure detection signal.

【００３０】胴体部ユニット２におけるバッテリセンサ
１２は、バッテリ１１の残量を検出し、その検出結果
を、バッテリ残量検出信号としてコントローラ１０に送
出する。熱センサ１３は、ロボット１の内部の熱を検出
し、その検出結果を、熱検出信号としてコントローラ１
０に送出する。The battery sensor 12 in the body unit 2 detects the remaining amount of the battery 11 and sends the detection result to the controller 10 as a battery remaining amount detection signal. The heat sensor 13 detects heat inside the robot 1, and the detection result is used as a heat detection signal in the controller 1
Send to 0.

【００３１】コントローラ１０は、ＣＰＵ(Central Pro
cessing Unit)１０Ａやメモリ１０Ｂ等を内蔵してお
り、ＣＰＵ１０Ａにおいて、メモリ１０Ｂに記憶された
制御プログラムが実行されることにより、各種の処理が
行なわれる。The controller 10 is a CPU (Central Pro
cessing unit) 10A, a memory 10B, and the like, and various processes are performed by the CPU 10A executing the control program stored in the memory 10B.

【００３２】即ち、コントローラ１０は、マイク１５、
ビデオカメラ１６、タッチセンサ１７、バッテリセンサ
１２、または熱センサ１３から与えられる音声信号、画
像信号、圧力検出信号、バッテリ残量検出信号、または
熱検出信号に基づいて、周囲の状況や、ユーザからの指
令、ユーザからの働きかけなどの有無を判断する。That is, the controller 10 includes a microphone 15,
Based on an audio signal, an image signal, a pressure detection signal, a battery remaining amount detection signal, or a heat detection signal given from the video camera 16, the touch sensor 17, the battery sensor 12, or the heat sensor 13, the surrounding situation or the user Judgment of whether or not there is a command from the user or the user's work.

【００３３】さらに、コントローラ１０は、この判断結
果等に基づいて、続く行動を決定し、その決定結果に基
づいて、アクチュエータ３ＡＡ₁乃至３ＡＡ_K、３ＢＡ₁
乃至３ＢＡ_K、３ＣＡ₁乃至３ＣＡ_K、３ＤＡ₁乃至３ＤＡ
_K、４Ａ₁乃至４Ａ_L、５Ａ₁、５Ａ₂のうちの必要なもの
を駆動させる。これにより、頭部ユニット４を上下左右
に振らせたり、下顎部４Ａを開閉させる。さらには、尻
尾部ユニット５を動かせたり、各脚部ユニット３Ａ乃至
３Ｄを駆動して、ロボット１を歩行させるなどの行動を
行わせる。Further, the controller 10 determines the subsequent action based on the result of the determination, and based on the result of the determination, the actuators 3AA _{1 to} 3AA _K , 3BA ₁
To 3BA _K , 3CA _{1 to} 3CA _K , 3DA _{1 to} 3DA
_K, 4A ₁ to 4A _L, 5A _1, 5A to drive the necessary of the _two. As a result, the head unit 4 is shaken vertically and horizontally, and the lower jaw 4A is opened and closed. Furthermore, the tail unit 5 is moved, and the leg units 3A to 3D are driven to cause the robot 1 to walk.

【００３４】また、コントローラ１０は、必要に応じ
て、合成音を生成し、スピーカ１８に供給して出力させ
たり、ロボットの「目」の位置に設けられた図示しない
ＬＥＤ（Light Emitting Diode）を点灯、消灯または点
滅させる。Further, the controller 10 generates a synthetic sound as needed, supplies it to the speaker 18 and outputs it, and an LED (Light Emitting Diode) (not shown) provided at the position of the "eyes" of the robot. Turn on, turn off or blink.

【００３５】以上のようにして、ロボット１は、周囲の
状況等に基づいて自律的に行動をとるようになってい
る。As described above, the robot 1 is adapted to act autonomously based on the surrounding situation and the like.

【００３６】図３は、図２のコントローラ１０の機能的
構成例を示している。なお、図３に示す機能的構成は、
ＣＰＵ１０Ａが、メモリ１０Ｂに記憶された制御プログ
ラムを実行することで実現されるようになっている。FIG. 3 shows an example of the functional configuration of the controller 10 shown in FIG. The functional configuration shown in FIG.
The CPU 10A is realized by executing the control program stored in the memory 10B.

【００３７】コントローラ１０は、特定の外部状態を認
識するセンサ入力処理部５０、センサ入力処理部５０の
認識結果を累積して、感情、本能、成長などの状態を表
現するモデル記憶部５１、センサ入力処理部５０の認識
結果等に基づいて、続く行動を決定する行動決定機構部
５２、行動決定機構部５２の決定結果に基づいて、実際
にロボット１に行動を起こさせる姿勢遷移機構部５３、
各アクチュエータ３ＡＡ₁乃至３ＤＡ_K, ５Ａ₁および５
Ａ₂を駆動制御する制御機構部５４、並びに合成音を生
成する音声合成部５５から構成されている。The controller 10 accumulates the recognition results of the sensor input processing unit 50, the sensor input processing unit 50 that recognizes a specific external state, and the model storage unit 51 that expresses states such as emotion, instinct, and growth, and the sensor. An action determination mechanism unit 52 that determines a subsequent action based on the recognition result of the input processing unit 50, and a posture transition mechanism unit 53 that actually causes the robot 1 to make an action based on the determination result of the action determination mechanism unit 52,
Each actuator 3AA _{1 to} 3DA _K , 5A ₁ and 5
It is composed of a control mechanism unit 54 for driving and controlling A _2, and a voice synthesizing unit 55 for generating a synthetic sound.

【００３８】センサ入力処理部５０は、マイク１５や、
ビデオカメラ１６、タッチセンサ１７等から与えられる
音声信号、画像信号、圧力検出信号等に基づいて、特定
の外部状態や、ユーザからの特定の働きかけ、ユーザか
らの指示等を認識し、その認識結果を表す状態認識情報
を、モデル記憶部５１および行動決定機構部５２に通知
する。The sensor input processing section 50 includes a microphone 15 and
Based on the audio signal, image signal, pressure detection signal, etc. provided from the video camera 16, the touch sensor 17, etc., a specific external state, a specific action from the user, an instruction from the user, etc. are recognized, and the recognition result The model storage unit 51 and the action determination mechanism unit 52 are notified of the state recognition information indicating the.

【００３９】即ち、センサ入力処理部５０は、音声認識
部５０Ａを有しており、音声認識部５０Ａは、マイク１
５から与えられる音声信号について音声認識を行う。そ
して、音声認識部５０Ａは、その音声認識結果として
の、例えば、「歩け」、「伏せ」、「赤いボールを追い
かけろ」等の指令その他を、状態認識情報として、モデ
ル記憶部５１および行動決定機構部５２に通知する。That is, the sensor input processing section 50 has a voice recognition section 50A, and the voice recognition section 50A is a microphone 1.
Voice recognition is performed on the voice signal given from 5. Then, the voice recognition unit 50A uses, as the state recognition information, a command such as "walk", "prone", "follow the red ball" as the voice recognition result, and the model storage unit 51 and the action determination mechanism. Notify the unit 52.

【００４０】また、センサ入力処理部５０は、画像認識
部５０Ｂを有しており、画像認識部５０Ｂは、ビデオカ
メラ１６から与えられる画像信号を用いて、画像認識処
理を行う。そして、画像認識部５０Ｂは、その処理の結
果、例えば、「赤い丸いもの」や、「地面に対して垂直
なかつ所定高さ以上の平面」等を検出したときには、
「赤いボールがある」、「壁がある」等の画像認識結果
を、状態認識情報として、モデル記憶部５１および行動
決定機構部５２に通知する。Further, the sensor input processing section 50 has an image recognition section 50B, and the image recognition section 50B performs image recognition processing using the image signal given from the video camera 16. Then, when the image recognition unit 50B detects, for example, "a red round object" or "a plane perpendicular to the ground and having a predetermined height or more" as a result of the processing,
Image recognition results such as "there is a red ball" and "there is a wall" are notified to the model storage unit 51 and the action determination mechanism unit 52 as state recognition information.

【００４１】さらに、センサ入力処理部５０は、圧力処
理部５０Ｃを有しており、圧力処理部５０Ｃは、タッチ
センサ１７から与えられる圧力検出信号を処理する。そ
して、圧力処理部５０Ｃは、その処理の結果、所定の閾
値以上で、かつ短時間の圧力を検出したときには、「た
たかれた（しかられた）」と認識し、所定の閾値未満
で、かつ長時間の圧力を検出したときには、「なでられ
た（ほめられた）」と認識して、その認識結果を、状態
認識情報として、モデル記憶部５１および行動決定機構
部５２に通知する。Further, the sensor input processing section 50 has a pressure processing section 50C, and the pressure processing section 50C processes the pressure detection signal given from the touch sensor 17. Then, as a result of the processing, the pressure processing unit 50C recognizes that the pressure is equal to or higher than a predetermined threshold and for a short period of time, and "recognizes that the pressure has been hit," and is less than the predetermined threshold, Further, when the pressure for a long time is detected, it is recognized as "stroked (praised)", and the recognition result is notified to the model storage unit 51 and the action determination mechanism unit 52 as state recognition information.

【００４２】モデル記憶部５１は、ロボットの感情、本
能、成長の状態を表現する感情モデル、本能モデル、成
長モデルをそれぞれ記憶、管理している。The model storage unit 51 stores and manages an emotion model, an instinct model, and a growth model that express the emotion, instinct, and growth state of the robot.

【００４３】感情モデルは、例えば、「うれしさ」、
「悲しさ」、「怒り」、「楽しさ」等の感情の状態（度
合い）を、所定の範囲（例えば、−１．０乃至１．０
等）の値によってそれぞれ表し、センサ入力処理部５０
からの状態認識情報や時間経過等に基づいて、その値を
変化させる。本能モデルは、例えば、「食欲」、「睡眠
欲」、「運動欲」等の本能による欲求の状態（度合い）
を、所定の範囲の値によってそれぞれ表し、センサ入力
処理部５０からの状態認識情報や時間経過等に基づい
て、その値を変化させる。成長モデルは、例えば、「幼
年期」、「青年期」、「熟年期」、「老年期」等の成長
の状態（度合い）を、所定の範囲の値によってそれぞれ
表し、センサ入力処理部５０からの状態認識情報や時間
経過等に基づいて、その値を変化させる。The emotion model is, for example, "joy",
The emotional states (degrees) such as “sadness”, “anger”, and “joy” are set in a predetermined range (for example, −1.0 to 1.0).
Etc.) and the sensor input processing unit 50
The value is changed on the basis of the state recognition information from, the elapsed time, and the like. The instinct model is, for example, the state (degree) of the desire by the instinct such as “appetite”, “sleep desire”, “exercise desire”
Are represented by values in a predetermined range, and the values are changed based on the state recognition information from the sensor input processing unit 50, the elapsed time, and the like. The growth model represents, for example, growth states (degrees) such as “childhood”, “adolescence”, “mature”, “old age” by a value in a predetermined range, and outputs from the sensor input processing unit 50. The value is changed based on the state recognition information of, the time passage, and the like.

【００４４】モデル記憶部５１は、上述のようにして感
情モデル、本能モデル、成長モデルの値で表される感
情、本能、成長の状態を、状態情報として、行動決定機
構部５２に送出する。The model storage unit 51 sends the emotion, instinct, and growth state represented by the values of the emotion model, the instinct model, and the growth model as state information to the action determination mechanism unit 52.

【００４５】なお、モデル記憶部５１には、センサ入力
処理部５０から状態認識情報が供給される他、行動決定
機構部５２から、ロボットの現在または過去の行動、具
体的には、例えば、「長時間歩いた」などの行動の内容
を示す行動情報が供給されるようになっており、モデル
記憶部５１は、同一の状態認識情報が与えられても、行
動情報が示すロボット１の行動に応じて、異なる状態情
報を生成するようになっている。The model storage section 51 is supplied with the state recognition information from the sensor input processing section 50, and the action determination mechanism section 52 provides the current or past behavior of the robot, specifically, for example, " Behavior information indicating the content of the behavior such as “walked for a long time” is supplied, and the model storage unit 51 recognizes the behavior of the robot 1 indicated by the behavior information even if the same state recognition information is given. Different state information is generated accordingly.

【００４６】即ち、例えば、ロボット１が、ユーザに挨
拶をし、ユーザに頭を撫でられた場合には、ユーザに挨
拶をしたという行動情報と、頭を撫でられたという状態
認識情報とが、モデル記憶部５１に与えられ、この場
合、モデル記憶部５１では、「うれしさ」を表す感情モ
デルの値が増加される。That is, for example, when the robot 1 greets the user and pats the head on the user, the action information indicating that the user greets the user and the state recognition information indicating that the head is patted on the user are displayed. The value is given to the model storage unit 51, and in this case, the value of the emotion model representing “joy” is increased in the model storage unit 51.

【００４７】一方、ロボット１が、何らかの仕事を実行
中に頭を撫でられた場合には、仕事を実行中であるとい
う行動情報と、頭を撫でられたという状態認識情報と
が、モデル記憶部５１に与えられ、この場合、モデル記
憶部５１では、「うれしさ」を表す感情モデルの値は変
化されない。On the other hand, when the robot 1 pats its head while performing some work, the model storage unit stores the action information indicating that the job is being carried out and the state recognition information indicating that the head is patted. 51, and in this case, in the model storage unit 51, the value of the emotion model representing “joy” is not changed.

【００４８】このように、モデル記憶部５１は、状態認
識情報だけでなく、現在または過去のロボット１の行動
を示す行動情報も参照しながら、感情モデルの値を設定
する。これにより、例えば、何らかのタスクを実行中
に、ユーザが、いたずらするつもりで頭を撫でたとき
に、「うれしさ」を表す感情モデルの値を増加させるよ
うな、不自然な感情の変化が生じることを回避すること
ができる。As described above, the model storage unit 51 sets the value of the emotion model with reference to not only the state recognition information but also the action information indicating the current or past action of the robot 1. This causes an unnatural emotional change, such as increasing the value of the emotional model expressing "joyfulness" when the user pats his / her head with the intention of mischief while performing some task. You can avoid that.

【００４９】なお、モデル記憶部５１は、本能モデルお
よび成長モデルについても、感情モデルにおける場合と
同様に、状態認識情報および行動情報の両方に基づい
て、その値を増減させるようになっている。また、モデ
ル記憶部５１は、感情モデル、本能モデル、成長モデル
それぞれの値を、他のモデルの値にも基づいて増減させ
るようになっている。The model storage unit 51 is adapted to increase or decrease the values of the instinct model and the growth model based on both the state recognition information and the behavior information, as in the case of the emotion model. Further, the model storage unit 51 is configured to increase or decrease the values of the emotion model, the instinct model, and the growth model based on the values of other models.

【００５０】行動決定機構部５２は、センサ入力処理部
５０からの状態認識情報や、モデル記憶部５１からの状
態情報、時間経過等に基づいて、次の行動を決定し、決
定された行動の内容を、行動指令情報として、姿勢遷移
機構部５３に送出する。The action determination mechanism unit 52 determines the next action based on the state recognition information from the sensor input processing unit 50, the state information from the model storage unit 51, the passage of time, etc. The content is sent to the posture transition mechanism unit 53 as action command information.

【００５１】即ち、行動決定機構部５２は、ロボットが
とり得る行動をステート（状態）(state)に対応させた
有限オートマトンを、ロボットの行動を規定する行動モ
デルとして管理しており、この行動モデルとしての有限
オートマトンにおけるステートを、センサ入力処理部５
０からの状態認識情報、モデル記憶部５１における感情
モデル、本能モデル、または成長モデルの値、時間経過
等に基づいて遷移させ、遷移後のステートに対応する行
動を、次にとるべき行動として決定する。That is, the action determination mechanism unit 52 manages a finite state automaton in which actions that the robot can take correspond to states, as an action model that defines the action of the robot. State in the finite state automaton as a sensor input processing unit 5
Based on the state recognition information from 0, the value of the emotion model, the instinct model, or the growth model in the model storage unit 51, the elapsed time, and the like, the action corresponding to the state after the transition is determined as the action to be taken next. To do.

【００５２】行動決定機構部５２は、所定のトリガ(tri
gger)があったことを検出すると、ステートを遷移させ
る。即ち、行動決定機構部５２は、例えば、現在のステ
ートに対応する行動を実行している時間が所定時間に達
したときや、特定の状態認識情報を受信したとき、モデ
ル記憶部５１から供給される状態情報が示す感情、本
能、成長の状態の値が所定の閾値以下または以上になっ
たとき等に、ステートを遷移させる。The action determination mechanism section 52 uses a predetermined trigger (tri
gger) is detected, the state is transited. That is, the action determination mechanism unit 52 is supplied from the model storage unit 51, for example, when the time during which the action corresponding to the current state is executed reaches a predetermined time or when the specific state recognition information is received. When the value of the emotion, instinct, or growth state indicated by the state information becomes equal to or less than or equal to a predetermined threshold value, the state is transited.

【００５３】なお、行動決定機構部５２は、上述したよ
うに、センサ入力処理部５０からの状態認識情報だけで
なく、モデル記憶部５１における感情モデル、本能モデ
ル、成長モデルの値等にも基づいて、行動モデルにおけ
るステートを遷移させることから、同一の状態認識情報
が入力されても、感情モデル、本能モデル、成長モデル
の値（状態情報）によっては、ステートの遷移先は異な
るものとなる。As described above, the action determination mechanism section 52 is based on not only the state recognition information from the sensor input processing section 50 but also the values of the emotion model, the instinct model, the growth model, etc. in the model storage section 51. Since the states in the behavior model are transited, even if the same state recognition information is input, the transition destinations of the states are different depending on the values (state information) of the emotion model, the instinct model, and the growth model.

【００５４】その結果、行動決定機構部５２は、例え
ば、状態情報が、「怒っていない」こと、および「お腹
がすいていない」ことを表している場合において、状態
認識情報が、「目の前に手のひらが差し出された」こと
を表しているときには、目の前に手のひらが差し出され
たことに応じて、「お手」という行動をとらせる行動指
令情報を生成し、これを、姿勢遷移機構部５３に送出す
る。As a result, for example, when the state information indicates "not angry" and "not hungry", the action determination mechanism section 52 determines that the state recognition information indicates "eyes". When it indicates that the palm was held out in front, the action command information that causes the action “hand” to be taken is generated in response to the palm held out in front of the eye, and this is It is sent to the posture transition mechanism unit 53.

【００５５】また、状態認識情報が「赤いボールを追い
かけろ」という音声認識結果が得られたことを表してい
るときには、「赤いボールを追いかける」という行動を
取らせる行動指令情報が生成され、これが姿勢遷移機構
部５３に送出される。When the state recognition information indicates that the voice recognition result of "follow the red ball" is obtained, action command information for causing the action of "chasing the red ball" is generated, which is the posture. It is sent to the transition mechanism unit 53.

【００５６】さらに、行動決定機構部５２は、例えば、
状態情報が、「怒っていない」こと、および「お腹がす
いている」ことを表している場合において、状態認識情
報が、「目の前に手のひらが差し出された」ことを表し
ているときには、目の前に手のひらが差し出されたこと
に応じて、「手のひらをぺろぺろなめる」ような行動を
行わせるための行動指令情報を生成し、これを、姿勢遷
移機構部５３に送出する。Further, the action determining mechanism unit 52, for example,
When the state information indicates that "I am not angry" and "I am hungry", and when the state recognition information indicates that "the palm of my hand is in front of me" , And generates action command information for performing an action of “licking the palm” in response to the palm being held in front of the eye, and sends this to the posture transition mechanism unit 53.

【００５７】また、行動決定機構部５２は、例えば、状
態情報が、「怒っている」ことを表している場合におい
て、状態認識情報が、「目の前に手のひらが差し出され
た」ことを表しているときには、状態情報が、「お腹が
すいている」ことを表していても、また、「お腹がすい
ていない」ことを表していても、「ぷいと横を向く」よ
うな行動を行わせるための行動指令情報を生成し、これ
を、姿勢遷移機構部５３に送出する。In addition, for example, when the state information indicates "angry", the action determining mechanism section 52 indicates that the state recognition information is "the palm was put out in front of the eyes". When it is displayed, even if the status information indicates that you are "hungry" or "not hungry", you should behave like "Looking sideways". The action command information to be performed is generated and sent to the posture transition mechanism unit 53.

【００５８】なお、行動決定機構部５２では、上述した
ように、ロボット１の頭部や手足等を動作させる行動指
令情報の他、ロボット１に発話を行わせる行動指令情報
も生成される。ロボット１に発話を行わせる行動指令情
報は、音声合成部５５に供給されるようになっており、
音声合成部５５に供給される行動指令情報には、音声合
成部５５に生成させる合成音に対応するテキスト等が含
まれる。そして、音声合成部５５は、行動決定機構部５
２から行動指令情報を受信すると、その行動指令情報に
含まれるテキストに基づき、合成音を生成し、スピーカ
１８に供給して出力させる。これにより、スピーカ１８
からは、例えば、ロボット１の鳴き声、さらには、「お
腹がすいた」等のユーザへの各種の要求、「何？」等の
ユーザの呼びかけに対する応答、その他の音声出力が行
われる。また、行動決定機構部５２は、合成音を出力す
る場合には、下顎部４Ａを開閉させる行動指令情報を、
必要に応じて生成し、姿勢遷移機構部５３に出力する。
この場合、合成音の出力に同期して、下顎部４Ａが開閉
し、ユーザに、ロボット１がしゃべっているかのような
印象を与えることができる。As described above, the action determination mechanism section 52 also generates action command information for causing the robot 1 to speak, in addition to the action command information for operating the head, limbs, etc. of the robot 1. The action command information that causes the robot 1 to speak is supplied to the voice synthesizer 55.
The action command information supplied to the voice synthesizing unit 55 includes a text or the like corresponding to the synthetic sound generated by the voice synthesizing unit 55. Then, the voice synthesizing unit 55 includes the action determining mechanism unit 5
When the action command information is received from 2, the synthetic sound is generated based on the text included in the action command information and is supplied to the speaker 18 to be output. As a result, the speaker 18
From, for example, a crying voice of the robot 1, various requests to the user such as "hungry", a response to the user's call such as "what?", And other voice outputs are performed. In addition, the action determination mechanism unit 52 outputs action command information for opening and closing the lower jaw portion 4A when outputting a synthetic sound,
It is generated as needed and is output to the posture transition mechanism unit 53.
In this case, the lower jaw 4A is opened and closed in synchronization with the output of the synthetic sound, and the user can be given an impression as if the robot 1 is speaking.

【００５９】姿勢遷移機構部５３は、行動決定機構部５
２から供給される行動指令情報に基づいて、ロボット１
の姿勢を、現在の姿勢から次の姿勢に遷移させるための
姿勢遷移情報を生成し、これを制御機構部５４に送出す
る。The posture transition mechanism section 53 includes the action determination mechanism section 5
Based on the action command information supplied from the robot 1, the robot 1
The posture change information for changing the posture of the current posture from the current posture to the next posture is generated and sent to the control mechanism unit 54.

【００６０】制御機構部５４は、姿勢遷移機構部５３か
らの姿勢遷移情報にしたがって、アクチュエータ３ＡＡ
₁乃至３ＤＡ_K, ５Ａ₁および５Ａ₂を駆動するための制御
信号を生成し、これを、アクチュエータ３ＡＡ₁乃至３
ＤＡ_K, ５Ａ₁および５Ａ₂に送出する。これにより、ア
クチュエータ３ＡＡ₁乃至３ＤＡ_K, ５Ａ₁および５Ａ
₂は、制御信号にしたがって駆動し、ロボット１は、自
律的に行動を起こす。The control mechanism section 54 operates the actuator 3AA in accordance with the posture transition information from the posture transition mechanism section 53.
₁ to 3DA _K, generates a control signal for driving the 5A ₁ and 5A _2, this actuator 3AA ₁ to 3
Send to DA _K , 5A ₁ and 5A ₂ . As a result, the actuators 3AA _{1 to} 3DA _K , 5A ₁ and 5A
₂ drives according to a control signal, and the robot 1 autonomously takes action.

【００６１】図４は、図３の音声認識部５０Ａと画像認
識部５０Ｂの構成例を示している。FIG. 4 shows a configuration example of the voice recognition unit 50A and the image recognition unit 50B of FIG.

【００６２】マイク１５からの音声信号は、ＡＤ(Analo
g Digital)変換部２１に供給される。ＡＤ変換部２１
は、マイク１５からのアナログ信号である音声信号をサ
ンプリング、量子化し、ディジタル信号である音声デー
タにＡＤ変換する。この音声データは、特徴抽出部２２
に供給される。The audio signal from the microphone 15 is AD (Analo
g Digital) is supplied to the conversion unit 21. AD converter 21
Performs sampling and quantization of a voice signal which is an analog signal from the microphone 15 and AD-converts it into voice data which is a digital signal. This voice data is the feature extraction unit 22.
Is supplied to.

【００６３】特徴抽出部２２は、そこに入力される音声
データについて、適当なフレームごとに、例えば、ＭＦ
ＣＣ(Mel Frequency Cepstrum Coefficient)分析を行
い、その分析の結果得られるＭＦＣＣを、特徴ベクトル
（特徴パラメータ）として、マッチング部２３と未登録
語区間処理部２７に出力する。なお、特徴抽出部２２で
は、その他、例えば、線形予測係数、ケプストラム係
数、線スペクトル対、所定の周波数帯域ごとのパワー
（フィルタバンクの出力）等を、特徴ベクトルとして抽
出することが可能である。The feature extraction unit 22 extracts, for example, the MF of the audio data input thereto for each appropriate frame.
CC (Mel Frequency Cepstrum Coefficient) analysis is performed, and the MFCC obtained as a result of the analysis is output to the matching unit 23 and the unregistered word section processing unit 27 as a feature vector (feature parameter). In addition, the feature extraction unit 22 can also extract, for example, a linear prediction coefficient, a cepstrum coefficient, a line spectrum pair, a power (output of a filter bank) for each predetermined frequency band, as a feature vector.

【００６４】マッチング部２３は、特徴抽出部２２から
の特徴ベクトルを用いて、音響モデル記憶部２４、辞書
記憶部２５、および文法記憶部２６を必要に応じて参照
しながら、マイク１５に入力された音声（入力音声）
を、例えば、連続分布ＨＭＭ(Hidden Markov Model)法
に基づいて音声認識する。The matching unit 23 uses the feature vector from the feature extraction unit 22 to input to the microphone 15 while referring to the acoustic model storage unit 24, the dictionary storage unit 25, and the grammar storage unit 26 as necessary. Voice (input voice)
Is speech-recognized based on, for example, the continuous distribution HMM (Hidden Markov Model) method.

【００６５】即ち、音響モデル記憶部２４は、音声認識
する音声の言語における個々の音素や、音節、音韻など
のサブワードについて音響的な特徴を表す音響モデル
（例えば、ＨＭＭの他、ＤＰ(Dynamic Programming)マ
ッチングに用いられる標準パターン等を含む）を記憶し
ている。なお、ここでは、連続分布ＨＭＭ法に基づいて
音声認識を行うこととしているので、音響モデルとして
は、ＨＭＭ(Hidden Markov Model)が用いられる。That is, the acoustic model storage unit 24 represents an acoustic model representing acoustic characteristics of individual phonemes in the language of the speech to be recognized, subwords such as syllables and phonemes (for example, HMM, DP (Dynamic Programming)). ), Including standard patterns used for matching). Note that, here, since the voice recognition is performed based on the continuous distribution HMM method, an HMM (Hidden Markov Model) is used as the acoustic model.

【００６６】辞書記憶部２５は、認識対象の各単語ごと
にクラスタリングされた、その単語の発音に関する情報
（音韻情報）と、その単語の見出しとが対応付けられて
いるとともに、必要に応じて、さらに、画像に関する情
報（RGB値）が、対応付けられた単語辞書を記憶してい
る。In the dictionary storage unit 25, information on the pronunciation of the word (phonological information), which is clustered for each word to be recognized, and the headline of the word are associated with each other, and if necessary, Further, a word dictionary in which information (RGB values) related to images is associated is stored.

【００６７】図５は、辞書記憶部２５に記憶された単語
辞書７１の例を示している。FIG. 5 shows an example of the word dictionary 71 stored in the dictionary storage unit 25.

【００６８】図５に示されるように、単語辞書７１にお
いては、単語の見出しと、その音韻系列とが対応付けら
れており、音韻系列は、対応する単語ごとにクラスタリ
ングされている。図５の単語辞書７１では、１つのエン
トリ（図３の１行）が、１つのクラスタに相当する。As shown in FIG. 5, in the word dictionary 71, word headings are associated with their phoneme sequences, and the phoneme sequences are clustered for each corresponding word. In the word dictionary 71 of FIG. 5, one entry (one line in FIG. 3) corresponds to one cluster.

【００６９】また、図５の単語辞書７１においては、ak
a［赤］の見出しに対して、RGB値として、（２１５，０
３５，０９３）の値が対応されており、orenji［オレン
ジ］の見出しに対して、（１６７，０９５，０５６）の
RGB値が対応されている。In the word dictionary 71 of FIG. 5, ak
For the heading a [red], the RGB value is (215, 0
35,093) is supported, and the heading of orenji [orange] is (167,095,056).
RGB values are supported.

【００７０】なお、図５においては、見出しは、ローマ
字と日本語（仮名漢字）で表してあり、音韻系列は、ロ
ーマ字で表してある。但し、音韻系列における「N」
は、撥音「ん」を表す。また、図５では、１つのエント
リに、１つの音韻系列を記述してあるが、１つのエント
リには、複数の音韻系列を記述することも可能である。In FIG. 5, headings are shown in Roman letters and Japanese (Kana / Kanji), and phoneme sequences are shown in Roman letters. However, "N" in the phoneme sequence
Represents the sound repellency "n". Further, in FIG. 5, one phoneme sequence is described in one entry, but a plurality of phoneme sequences can be described in one entry.

【００７１】図４に戻り、文法記憶部２６は、辞書記憶
部２５の単語辞書に登録されている各単語が、どのよう
に連鎖する（つながる）かを記述した文法規則を記憶し
ている。Returning to FIG. 4, the grammar storage unit 26 stores grammar rules describing how the words registered in the word dictionary of the dictionary storage unit 25 are linked (connected).

【００７２】図６は、文法記憶部２６に記憶された文法
規則８１の例を示している。なお、図６の文法規則８１
は、ＥＢＮＦ(Extended Backus Naur Form)で記述され
ている。FIG. 6 shows an example of the grammar rules 81 stored in the grammar storage unit 26. Note that the grammar rule 81 in FIG.
Is described in EBNF (Extended Backus Naur Form).

【００７３】図６においては、行頭から、最初に現れる
「;」までが、１つの文法規則を表している。また、先
頭に「$」が付されたアルファベット（列）は、変数を
表し、「$」が付されていないアルファベット（列）
は、単語の見出し（図５に示したローマ字による見出
し）を表す。さらに、[]で囲まれた部分は、省略可能で
あることを表し、「|」は、その前後に配置された見出
しの単語（あるいは変数）のうちのいずれか一方を選択
することを表す。In FIG. 6, from the beginning of the line to the first appearance of ";" represents one grammar rule. In addition, alphabets (columns) prefixed with "$" represent variables, and alphabets (columns) without "$"
Indicates a word heading (heading in Roman letters shown in FIG. 5). Further, the portion enclosed by [] indicates that it can be omitted, and “|” indicates that one of the words (or variables) of the headings arranged before and after it is selected.

【００７４】従って、図６において、例えば、第１行
（上から１行目）の文法規則「$col =[kono | sono] ir
o wa;」は、変数$colが、「このいろ（色）は」または
「そのいろ（色）は」という単語列であることを表す。Therefore, in FIG. 6, for example, the grammar rule “$ col = [kono | sono] ir” on the first line (first line from the top) is used.
"o wa;" indicates that the variable $ col is a word string of "this color (color) is" or "that color (color)".

【００７５】なお、図６に示した文法規則８１において
は、変数$silと$garbageが定義されていないが、変数$s
ilは、無音の音響モデル（無音モデル）を表し、変数$g
arbageは、基本的には、音韻どうしの間での自由な遷移
を許可したガーベジモデルを表す。In the grammar rule 81 shown in FIG. 6, the variables $ sil and $ garbage are not defined, but the variable $ s
il represents a silent acoustic model (silent model), and the variable $ g
Arbage basically represents a garbage model that allows free transitions between phonemes.

【００７６】再び図４に戻り、マッチング部２３は、辞
書記憶部２５の単語辞書７１を参照することにより、音
響モデル記憶部２４に記憶されている音響モデルを接続
することで、単語の音響モデル（単語モデル）を構成す
る。さらに、マッチング部２３は、幾つかの単語モデル
を、文法記憶部２６に記憶された文法規則８１を参照す
ることにより接続し、そのようにして接続された単語モ
デルを用いて、特徴ベクトルに基づき、連続分布ＨＭＭ
法によって、マイク１５に入力された音声を認識する。
即ち、マッチング部２３は、特徴抽出部２２が出力する
時系列の特徴ベクトルが観測されるスコア（尤度）が最
も高い単語モデルの系列を検出し、その単語モデルの系
列に対応する単語列の見出しを、音声の認識結果として
出力する。Returning to FIG. 4 again, the matching unit 23 refers to the word dictionary 71 of the dictionary storage unit 25 to connect the acoustic models stored in the acoustic model storage unit 24, thereby connecting the acoustic models of the words. (Word model). Further, the matching unit 23 connects some word models by referring to the grammatical rules 81 stored in the grammar storage unit 26, and uses the word models thus connected based on the feature vector. , Continuous distribution HMM
By the method, the voice input to the microphone 15 is recognized.
That is, the matching unit 23 detects the sequence of the word model having the highest score (likelihood) at which the time-series feature vector output from the feature extraction unit 22 is detected, and the word string corresponding to the sequence of the word model is detected. The headline is output as a voice recognition result.

【００７７】より具体的には、マッチング部２３は、接
続された単語モデルに対応する単語列について、各特徴
ベクトルの出現確率（出力確率）を累積し、その累積値
をスコアとして、そのスコアを最も高くする単語列の見
出しを、音声認識結果として出力する。More specifically, the matching unit 23 accumulates the appearance probabilities (output probabilities) of the respective feature vectors with respect to the word strings corresponding to the connected word models, and sets the cumulative value as a score. The headline of the word string to be made the highest is output as the voice recognition result.

【００７８】また、マッチング部２３は、画像認識部５
０Ｂの特徴抽出部４２により抽出された画像の特徴ベク
トルを、特徴抽出部２２により抽出された音声の特徴ベ
クトルとともに必要に応じて利用して、上述したマッチ
ング処理を実行する。Further, the matching section 23 includes the image recognition section 5
The above-described matching process is executed by using the feature vector of the image extracted by the 0B feature extraction unit 42 together with the feature vector of the voice extracted by the feature extraction unit 22 as necessary.

【００７９】以上のようにして出力される、マイク１５
に入力された音声の認識結果は、状態認識情報として、
モデル記憶部５１および行動決定機構部５２に出力され
る。The microphone 15 output as described above
The recognition result of the voice input to the
It is output to the model storage unit 51 and the action determination mechanism unit 52.

【００８０】図６の実施の形態では、第９行（上から９
行目）に、ガーベジモデルを表す変数$garbageを用いた
文法規則（以下、適宜、未登録語用規則という）「$pat
1 =$color1 $garbage $color2;」があるが、マッチング
部２３は、この未登録語用規則が適用された場合には、
変数$garbageに対応する音声区間を、未登録語の音声区
間として検出する。さらに、マッチング部２３は、未登
録語用規則が適用された場合における変数$garbageが表
すガーベジモデルにおける音韻の遷移としての音韻系列
を、未登録語の音韻系列として検出する。そして、マッ
チング部２３は、未登録語用規則が適用された音声認識
結果が得られた場合に検出される未登録語の音声区間と
音韻系列を、未登録語区間処理部２７に供給する。In the embodiment shown in FIG. 6, the ninth line (from the top 9
Line), a grammar rule using the variable $ garbage that represents the garbage model (hereinafter, referred to as an unregistered word rule) "$ pat
1 = $ color1 $ garbage $ color2; ", but the matching unit 23 determines that when this unregistered word rule is applied,
The voice section corresponding to the variable $ garbage is detected as the voice section of an unregistered word. Further, the matching unit 23 detects a phoneme sequence as a phoneme transition in the garbage model represented by the variable $ garbage when the unregistered word rule is applied, as the phoneme sequence of the unregistered word. Then, the matching unit 23 supplies the unregistered word segment processing unit 27 with the unregistered word speech segment and the phoneme sequence that are detected when the speech recognition result to which the unregistered word rule is applied is obtained.

【００８１】なお、上述の未登録語用規則「$pat1 = $c
olor1 $garbage $color2;」によれば、変数$color1で表
される、単語辞書に登録されている単語（列）の音韻系
列と、変数$color2で表される、単語辞書に登録されて
いる単語（列）の音韻系列との間にある１つの未登録語
が検出されるが、本発明は、発話に、複数の未登録語が
含まれている場合や、未登録語が、単語辞書に登録され
ている単語（列）の間に挟まれていない場合であって
も、適用可能である。The above unregistered word rule "$ pat1 = $ c
According to "olor1 $ garbage $ color2;", the phoneme sequence of the word (column) registered in the word dictionary represented by the variable $ color1 and the phoneme sequence represented by the variable $ color2 registered in the word dictionary. Although one unregistered word between the word (string) and the phoneme sequence is detected, in the present invention, when the utterance includes a plurality of unregistered words or when the unregistered word is a word dictionary. It is applicable even when it is not sandwiched between words (columns) registered in.

【００８２】未登録語区間処理部２７は、特徴抽出部２
２から供給される音声の特徴ベクトルの系列（特徴ベク
トル系列）を一時記憶する。さらに、未登録語区間処理
部２７は、マッチング部２３から未登録語の音声区間と
音韻系列を受信すると、その音声区間における音声の特
徴ベクトル系列を、一時記憶している特徴ベクトル系列
から検出する。そして、未登録語区間処理部２７は、マ
ッチング部２３からの音韻系列（未登録語）に、ユニー
クなID(Identification)を付し、未登録語の音韻系列
と、その音声区間における特徴ベクトル系列（音声特徴
ベクトル系列）とともに、特徴ベクトルバッファ２８に
供給する。The unregistered word section processing unit 27 includes the feature extraction unit 2
The sequence of feature vectors of the voice (feature vector sequence) supplied from No. 2 is temporarily stored. Further, when the unregistered word section processing unit 27 receives the unregistered word speech section and the phoneme sequence from the matching unit 23, the unregistered word section processing unit 27 detects the speech feature vector series in the speech section from the temporarily stored feature vector series. . Then, the unregistered word section processing unit 27 attaches a unique ID (Identification) to the phoneme sequence (unregistered word) from the matching unit 23 to identify the phoneme sequence of the unregistered word and the feature vector sequence in the speech section. It is supplied to the feature vector buffer 28 together with (voice feature vector series).

【００８３】特徴ベクトルバッファ２８は、例えば、図
７に示されるように、未登録語区間処理部２７から供給
される未登録語のID、音韻系列、および音声特徴ベクト
ル系列を対応付けて一時記憶する。The feature vector buffer 28, for example, as shown in FIG. 7, temporarily stores the IDs of unregistered words supplied from the unregistered word section processing unit 27, phoneme sequences, and speech feature vector sequences in association with each other. To do.

【００８４】図７においては、未登録語に対して、１か
らのシーケンシャルな数字が、IDとして付されている。
従って、例えば、いま、特徴ベクトルバッファ２８にお
いて、Ｎ個の未登録語のID、音韻系列、および音声特徴
ベクトル系列が記憶されている場合において、マッチン
グ部２３が未登録語の音声区間と音韻系列を検出する
と、未登録語区間処理部２７では、その未登録語に対し
て、Ｎ＋１が、IDとして付され、特徴ベクトルバッファ
２８では、図７に点線で示されるように、その未登録語
のID、音韻系列、および音声特徴ベクトル系列が記憶さ
れる。In FIG. 7, a sequential number from 1 is added to the unregistered word as an ID.
Therefore, for example, when the feature vector buffer 28 is now storing N unregistered word IDs, phoneme sequences, and phonetic feature vector sequences, the matching unit 23 causes the unregistered word voice segments and phoneme sequences to be stored. When an unregistered word is detected, the unregistered word section processing unit 27 assigns N + 1 to the unregistered word as an ID, and the feature vector buffer 28 indicates the unregistered word of the unregistered word as indicated by a dotted line in FIG. The ID, phoneme sequence, and voice feature vector sequence are stored.

【００８５】再び図４に戻り、画像認識部５０Ｂは、Ａ
Ｄ変換部４１を有している。ＡＤ変換部４１は、ビデオ
カメラ１６より入力されたアナログ信号である画像信号
をサンプリング、量子化し、ディジタル信号である画像
データにＡ／Ｄ変換する。この画像データは、特徴抽出
部４２に供給される。Returning to FIG. 4 again, the image recognition unit 50B displays the A
It has a D converter 41. The AD converter 41 samples and quantizes an image signal which is an analog signal input from the video camera 16 and A / D converts it into image data which is a digital signal. This image data is supplied to the feature extraction unit 42.

【００８６】特徴抽出部４２は、そこに入力される画像
データの所定の範囲について、例えば、RGB値の平均値
を演算し、得られた結果を、画像の特徴ベクトル（特徴
パラメータ）として、マッチング部２３と特徴ベクトル
バッファ４３に供給する。サンプリングされる所定の範
囲は、画面のほぼ中央の予め定められている範囲、ユー
ザにより指定された範囲、などとされる。あるいはま
た、特徴抽出部４２は、マッチング部２３から、未登録
語の音声区間を表す情報の入力を受け、その音声区間に
対応する画像の特徴ベクトルを検出する。The feature extraction unit 42 calculates, for example, an average value of RGB values in a predetermined range of the image data input thereto, and the obtained result is used as a feature vector (feature parameter) of the image to perform matching. It is supplied to the unit 23 and the feature vector buffer 43. The predetermined range to be sampled is a predetermined range in the center of the screen, a range designated by the user, or the like. Alternatively, the feature extraction unit 42 receives the information indicating the voice section of the unregistered word from the matching unit 23, and detects the feature vector of the image corresponding to the voice section.

【００８７】特徴抽出部４２により抽出された特徴ベク
トルは、特徴ベクトルバッファ４３に供給され、記憶さ
れる。特徴ベクトルバッファ４３に記憶された画像の特
徴ベクトルは、さらに特徴ベクトルバッファ２８に供給
され、画像特徴ベクトル系列として記憶される。図７に
示されるように、この画像特徴ベクトル系列は、音韻系
列と音声特徴ベクトル系列に対応して記憶される。The feature vector extracted by the feature extraction unit 42 is supplied to the feature vector buffer 43 and stored therein. The feature vector of the image stored in the feature vector buffer 43 is further supplied to the feature vector buffer 28 and stored as an image feature vector series. As shown in FIG. 7, this image feature vector sequence is stored in correspondence with the phoneme sequence and the voice feature vector sequence.

【００８８】クラスタリング部２９は、特徴ベクトルバ
ッファ２８に新たに記憶された未登録語（以下、適宜、
新未登録語という）について、特徴ベクトルバッファ２
８に既に記憶されている他の未登録語（以下、適宜、既
記憶未登録語という）それぞれに対するスコアを計算す
る。The clustering unit 29 uses the unregistered words newly stored in the feature vector buffer 28 (hereinafter, as appropriate,
Feature vector buffer 2 for new unregistered words)
A score for each of the other unregistered words already stored in 8 (hereinafter, appropriately referred to as an already stored unregistered word) is calculated.

【００８９】即ち、クラスタリング部２９は、新未登録
語を入力音声とし、かつ、既記憶未登録語を、単語辞書
に登録されている単語とみなして、マッチング部２３に
おける場合と同様にして、新未登録語について、各既記
憶未登録語に対するスコアを計算する。具体的には、ク
ラスタリング部２９は、特徴ベクトルバッファ２８を参
照することで、新未登録語の特徴ベクトル系列を認識す
るとともに、既記憶未登録語の音韻系列にしたがって音
響モデルを接続し、その接続された音響モデルから、新
未登録語の特徴ベクトル系列が観測される尤度としての
スコアを計算する。That is, the clustering unit 29 regards the new unregistered word as the input voice and the already stored unregistered word as the word registered in the word dictionary, and in the same manner as in the matching unit 23, For the new unregistered word, a score for each stored unregistered word is calculated. Specifically, the clustering unit 29 recognizes the feature vector series of the new unregistered word by referring to the feature vector buffer 28, and connects the acoustic models in accordance with the phoneme series of the stored unregistered word. From the connected acoustic model, a score is calculated as the likelihood that the feature vector sequence of the new unregistered word is observed.

【００９０】なお、音響モデルは、音響モデル記憶部２
４に記憶されているものが用いられる。The acoustic model is stored in the acoustic model storage unit 2.
The one stored in 4 is used.

【００９１】クラスタリング部２９は、同様にして、各
既記憶未登録語について、新未登録語に対するスコアも
計算し、そのスコアによって、スコアシート記憶部３０
に記憶されたスコアシートを更新する。In the same way, the clustering unit 29 also calculates a score for the new unregistered word for each already stored unregistered word, and the score sheet storage unit 30 is calculated according to the score.
Update the score sheet stored in.

【００９２】また、クラスタリング部２９は、スコアを
計算するに際し、音声特徴ベクトル系列と画像特徴ベク
トル系列の両方を利用する。Further, the clustering unit 29 uses both the voice feature vector sequence and the image feature vector sequence when calculating the score.

【００９３】具体的には、クラスタリング部２９は、例
えば、音声特徴ベクトル系列に基づいて計算したスコア
をSaとし、画像特徴ベクトル系列に基づいて計算したス
コアをSvとし、α，βを、それぞれ所定の係数とすると
き、次式に従って合成されたスコアScを計算し、この合
成スコアScを用いてクラスタリングを行う。Specifically, the clustering unit 29 sets, for example, a score calculated based on the voice feature vector series to Sa, a score calculated based on the image feature vector series to Sv, and sets α and β to predetermined values, respectively. When the coefficient is, the score Sc synthesized according to the following equation is calculated, and clustering is performed using this synthesized score Sc.

【００９４】 Sc＝α×Sa＋β×Sv （１）[0094] Sc = α × Sa + β × Sv (1)

【００９５】この係数α，βは、音声によるスコアと画
像によるスコアを重み付けするための係数であり、必要
に応じて適宜変更される。The coefficients α and β are coefficients for weighting the score by the voice and the score by the image, and are appropriately changed as necessary.

【００９６】なお、スコアSvの計算には、例えば、RGB
値のユークリット距離を用いることができる。The score Sv can be calculated by, for example, RGB
The value of Euclidean distance can be used.

【００９７】クラスタリング部２９はまた、更新したス
コアシートを参照することにより、既に求められてい
る、未登録語（既記憶未登録語）をクラスタリングした
クラスタの中から、新未登録語を新たなメンバとして加
えるクラスタを検出する。さらに、クラスタリング部２
９は、新未登録語を、検出したクラスタの新たなメンバ
とし、そのクラスタを、そのクラスタのメンバに基づい
て分割し、その分割結果に基づいて、スコアシート記憶
部３０に記憶されているスコアシートを更新する。The clustering unit 29 also refers to the updated score sheet to create a new unregistered word from among the clusters of unregistered words (already stored unregistered words) that have already been obtained. Discover the clusters to add as members. Furthermore, the clustering unit 2
9 designates the new unregistered word as a new member of the detected cluster, divides the cluster based on the members of the cluster, and based on the division result, the score stored in the score sheet storage unit 30. Update the sheet.

【００９８】スコアシート記憶部３０は、新未登録語に
ついての、既記憶未登録語に対するスコアや、既記憶未
登録語についての、新未登録語に対するスコア等が登録
されたスコアシートを記憶する。The score sheet storage unit 30 stores a score sheet in which a score for a new unregistered word, a score for a previously stored unregistered word, a score for a previously unregistered word for a new unregistered word, and the like are registered. .

【００９９】図８は、スコアシートの例を示している。FIG. 8 shows an example of the score sheet.

【０１００】スコアシート９１は、未登録語の「ID」、
「音韻系列」、「クラスタナンバ」、「代表メンバI
D」、および「スコア」が記述されたエントリで構成さ
れる。The score sheet 91 includes the unregistered word "ID",
"Phonological series", "Cluster number", "Representative member I"
It is composed of entries in which "D" and "score" are described.

【０１０１】未登録語の「ID」と「音韻系列」として
は、特徴ベクトルバッファ２８に記憶されたものと同一
のものが、クラスタリング部２９によって登録される。
「クラスタナンバ」は、そのエントリの未登録語がメン
バとなっているクラスタを特定するための数字で、クラ
スタリング部２９によって付され、スコアシート９１に
登録される。「代表メンバID」は、そのエントリの未登
録語がメンバとなっているクラスタを代表する代表メン
バとしての未登録語のIDであり、この代表メンバIDによ
って、未登録語がメンバとなっているクラスタの代表メ
ンバを認識することができる。なお、クラスタの代表メ
ンバは、クラスタリング部２９によって求められ、その
代表メンバのIDが、スコアシート９１の代表メンバIDに
登録される。「スコア」は、そのエントリの未登録語に
ついての、他の未登録語それぞれに対するスコアであ
り、上述したように、クラスタリング部２９によって計
算される。As the unregistered words “ID” and “phoneme sequence”, the same ones stored in the feature vector buffer 28 are registered by the clustering unit 29.
The “cluster number” is a number for identifying a cluster in which the unregistered word of the entry is a member, is assigned by the clustering unit 29, and is registered in the score sheet 91. The "representative member ID" is an ID of an unregistered word as a representative member representing a cluster in which the unregistered word of the entry is a member, and the unregistered word is a member by this representative member ID. The representative member of the cluster can be recognized. The representative member of the cluster is obtained by the clustering unit 29, and the ID of the representative member is registered in the representative member ID of the score sheet 91. The “score” is the score of the unregistered word of the entry for each of the other unregistered words, and is calculated by the clustering unit 29 as described above.

【０１０２】例えば、いま、図７に示されるように、特
徴ベクトルバッファ２８に、Ｎ個の未登録語のID、音韻
系列、および特徴ベクトル系列が記憶されているとする
と、スコアシート９１には、図８に示されるように、そ
のＮ個の未登録語のID、音韻系列、クラスタナンバ、代
表メンバID、およびスコアが登録されている。For example, assuming that the feature vector buffer 28 stores N unregistered word IDs, phoneme sequences, and feature vector sequences as shown in FIG. As shown in FIG. 8, the N unregistered word IDs, phoneme sequences, cluster numbers, representative member IDs, and scores are registered.

【０１０３】そして、図７に、点線で示されるように、
特徴ベクトルバッファ２８に、新未登録語のID、音韻系
列、および特徴ベクトル系列が新たに記憶されると、ク
ラスタリング部２９は、スコアシート９１を、図８にお
いて点線で示されるように更新させる。Then, as shown by the dotted line in FIG.
When a new unregistered word ID, a phoneme sequence, and a feature vector sequence are newly stored in the feature vector buffer 28, the clustering unit 29 updates the score sheet 91 as indicated by the dotted line in FIG. 8.

【０１０４】即ち、スコアシート９１には、新未登録語
のID、音韻系列、クラスタナンバ、代表メンバID、新未
登録語についての、既記憶未登録語それぞれに対するス
コア（図８におけるスコアs(N+1,1),s(N+1,2),・・・,s
(N+1,N)）が追加される。さらに、スコアシート９１に
は、既記憶未登録語それぞれについての、新未登録語に
対するスコア（図８におけるs(1,N+1),s(2,N+1),・・
・，s(N,N+1)）が追加される。さらに、後述するよう
に、スコアシート９１における未登録語のクラスタナン
バと代表メンバIDが、必要に応じて変更される。That is, in the score sheet 91, the score of the new unregistered word ID, the phoneme sequence, the cluster number, the representative member ID, and the new unregistered word for each of the stored unregistered words (score s ( N + 1,1), s (N + 1,2), ..., s
(N + 1, N)) is added. Further, in the score sheet 91, the score for the new unregistered word for each of the stored unregistered words (s (1, N + 1), s (2, N + 1), ...
,, s (N, N + 1)) is added. Further, as will be described later, the cluster numbers and representative member IDs of unregistered words on the score sheet 91 are changed as necessary.

【０１０５】なお、図８の実施の形態においては、IDが
iの未登録語（の発話）についての、IDがjの未登録語
（の音韻系列）に対するスコアを、s(i,j)として表して
ある。In the embodiment of FIG. 8, the ID is
For the unregistered word (utterance of i) of i, the score for the unregistered word (phoneme sequence of) of ID j is represented as s (i, j).

【０１０６】また、スコアシート９１（図８）には、ID
がiの未登録語（の発話）についての、IDがiの未登録語
（の音韻系列）に対するスコアs(i,i)も登録される。但
し、このスコアs(i,i)は、マッチング部２３において、
未登録語の音韻系列が検出されるときに計算されるた
め、クラスタリング部２９で計算する必要はない。The score sheet 91 (FIG. 8) has an ID
The score s (i, i) for the unregistered word of (i.e., the utterance of) is also registered for the unregistered word of (i.e., its phoneme sequence). However, this score s (i, i) is
Since the calculation is performed when the phoneme sequence of the unregistered word is detected, the clustering unit 29 does not need to calculate.

【０１０７】再び図４に戻り、メンテナンス部３１は、
スコアシート記憶部３０における、更新後のスコアシー
ト９１に基づいて、辞書記憶部２５に記憶された単語辞
書７１を更新する。Returning to FIG. 4 again, the maintenance section 31
The word dictionary 71 stored in the dictionary storage unit 25 is updated based on the updated score sheet 91 in the score sheet storage unit 30.

【０１０８】クラスタの代表メンバは、次のように決定
される。即ち、例えば、クラスタのメンバとなっている
未登録語のうち、他の未登録語それぞれについてのスコ
アの総和（その他、例えば、総和を、他の未登録語の数
で除算した平均値でも良い）を最大にするものが、その
クラスタの代表メンバとされる。従って、この場合、ク
ラスタに属するメンバのメンバIDをkで表すこととする
と、次式で示される値K（∈k）をIDとするメンバが、代
表メンバとされることになる。The representative member of the cluster is determined as follows. That is, for example, among the unregistered words that are members of the cluster, the total sum of the scores of the other unregistered words (otherwise, for example, an average value obtained by dividing the total sum by the number of other unregistered words may be used. ) Is the representative member of the cluster. Therefore, in this case, if the member ID of the member belonging to the cluster is represented by k, the member having the value K (εk) represented by the following equation as the ID is the representative member.

【０１０９】 K=max_k{Σs(k',k)} ・・・（２）K = max _k {Σs (k ', k)} (2)

【０１１０】但し、式（２）において、max_k{}は、{}内
の値を最大にするｋを意味する。また、k'は、kと同様
に、クラスタに属するメンバのIDを意味する。さらに、
Σは、k'を、クラスタに属するメンバすべてのIDに亘っ
て変化させての総和を意味する。However, in the equation (2), max _k {} means k that maximizes the value in {}. Also, k'means the ID of a member belonging to the cluster, similar to k. further,
Σ means the sum of k ′ varied over the IDs of all members belonging to the cluster.

【０１１１】なお、上述のように代表メンバを決定する
場合、クラスタのメンバが、１または２つの未登録語で
あるときには、代表メンバを決めるにあたって、スコア
を計算する必要はない。即ち、クラスタのメンバが、１
つの未登録語である場合には、その１つの未登録語が代
表メンバとなり、クラスタのメンバが、２つの未登録語
である場合には、その２つの未登録語のうちのいずれ
を、代表メンバとしても良い。When the representative member is determined as described above, when the members of the cluster are 1 or 2 unregistered words, it is not necessary to calculate the score when determining the representative member. That is, if the members of the cluster are 1
In the case of two unregistered words, the one unregistered word becomes the representative member, and when the cluster member is two unregistered words, which of the two unregistered words is the representative member? Good as a member.

【０１１２】また、代表メンバの決定方法は、上述した
ものに限定されるものではなく、その他、例えば、クラ
スタのメンバとなっている未登録語のうち、他の未登録
語それぞれとの特徴ベクトル空間における距離の総和を
最小にするもの等を、そのクラスタの代表メンバとする
ことも可能である。The method of deciding the representative member is not limited to the above-described one, but other than that, for example, among the unregistered words that are members of the cluster, the feature vector of each of the other unregistered words. The one that minimizes the total sum of distances in space may be the representative member of the cluster.

【０１１３】以上のように構成される音声認識部５０Ａ
では、マイク１５に入力された音声を認識する音声認識
処理と、未登録語に関する未登録語処理が、ビデオカメ
ラ１６により撮像された画像を利用して行われるように
なっている。The voice recognition section 50A configured as described above.
Then, the voice recognition process for recognizing the voice input to the microphone 15 and the unregistered word process regarding the unregistered word are performed using the image captured by the video camera 16.

【０１１４】そこで、まず最初に、図９のフローチャー
トを参照して、音声認識処理について説明する。Therefore, first, the voice recognition processing will be described with reference to the flowchart of FIG.

【０１１５】ユーザが発話を行うと、その発話された音
声は、ステップＳ１で入力される。すなわち、マイク１
５およびＡＤ変換部２１を介することにより入力された
ディジタルの音声データが、特徴抽出部２２に供給され
る。また、ステップＳ２において、ビデオカメラ１６に
より撮像された結果得られた画像信号が入力され、ＡＤ
変換部４１においてＡＤ変換され、特徴抽出部４２に供
給される。When the user speaks, the spoken voice is input in step S1. That is, microphone 1
5 and the digital voice data input through the AD conversion unit 21 are supplied to the feature extraction unit 22. Further, in step S2, the image signal obtained as a result of being imaged by the video camera 16 is input, and AD
AD conversion is performed in the conversion unit 41, and the converted data is supplied to the feature extraction unit 42.

【０１１６】特徴抽出部２２は、ステップＳ３におい
て、音声データを、所定のフレーム単位で音響分析する
ことにより、特徴ベクトルを抽出し、その特徴ベクトル
の系列を、マッチング部２３および未登録語区間処理部
２７に供給する。特徴抽出部４２もまた、ステップＳ４
において、画像データを分析することにより、特徴ベク
トルを抽出し、その特徴ベクトルの系列をマッチング部
２３に供給する。In step S3, the feature extraction unit 22 extracts a feature vector by acoustically analyzing the voice data in a predetermined frame unit, and the feature vector series is processed by the matching unit 23 and the unregistered word section processing. It is supplied to the unit 27. The feature extraction unit 42 also performs step S4.
In, the feature vector is extracted by analyzing the image data, and the series of the feature vector is supplied to the matching unit 23.

【０１１７】マッチング部２３は、ステップＳ５におい
て、特徴抽出部２２からの特徴ベクトル系列（音声特徴
ベクトル系列）と、特徴抽出部４２からの特徴ベクトル
系列（画像特徴ベクトル系列）に基づいて、上述したよ
うに、式（１）を用いて、スコア計算を行い、ステップ
Ｓ６に進む。ステップＳ６では、マッチング部２３は、
スコア計算の結果得られるスコアに基づいて、音声認識
結果となる単語列の見出しを求めて出力する。In step S5, the matching unit 23 described above is based on the feature vector sequence (voice feature vector sequence) from the feature extraction unit 22 and the feature vector sequence (image feature vector sequence) from the feature extraction unit 42. As described above, the score is calculated using the equation (1), and the process proceeds to step S6. In step S6, the matching unit 23
Based on the score obtained as a result of the score calculation, the headline of the word string that becomes the voice recognition result is obtained and output.

【０１１８】さらに、マッチング部２３は、ステップＳ
７に進み、ユーザの音声に、未登録語が含まれていたか
どうかを判定する。Further, the matching section 23 carries out step S
In step 7, it is determined whether the user's voice contains an unregistered word.

【０１１９】ステップＳ７において、ユーザの音声に、
未登録語が含まれていないと判定された場合、即ち、上
述の未登録語用規則「$pat1 = $color1 $garbage $colo
r2;」が適用されずに、音声認識結果が得られた場合、
ステップＳ８をスキップして、処理を終了する。In step S7, the voice of the user is
When it is determined that the unregistered word is not included, that is, the rule for unregistered word “$ pat1 = $ color1 $ garbage $ colo
r2; ”is not applied and the speech recognition result is obtained,
Step S8 is skipped and the process ends.

【０１２０】また、ステップＳ７において、ユーザの音
声に、未登録語が含まれていると判定された場合、即
ち、未登録語用規則「$pat1 = $color1 $garbage $colo
r2;」が適用されて、音声認識結果が得られた場合、ス
テップＳ８に進み、マッチング部２３は、未登録語用規
則の変数$garbageに対応する音声区間を、未登録語の音
声区間として検出するとともに、その変数$garbageが表
すガーベジモデルにおける音韻の遷移としての音韻系列
を、未登録語の音韻系列として検出し、その未登録語の
音声区間と音韻系列を、未登録語区間処理部２７に供給
して、処理を終了する。If it is determined in step S7 that the user's voice includes an unregistered word, that is, the unregistered word rule "$ pat1 = $ color1 $ garbage $ colo".
r2; ”is applied to obtain a speech recognition result, the process proceeds to step S8, and the matching unit 23 determines that the speech section corresponding to the variable $ garbage of the unregistered word rule is the unregistered word speech section. Along with the detection, the phoneme sequence as the transition of the phoneme in the garbage model represented by the variable $ garbage is detected as the phoneme sequence of the unregistered word, and the phoneme section and the phoneme sequence of the unregistered word are unregistered word section processing unit. 27, and the process ends.

【０１２１】一方、未登録語区間処理部２７は、特徴抽
出部２２から供給される特徴ベクトル系列を一時記憶し
ており、マッチング部２３から未登録語の音声区間と音
韻系列が供給されると、その音声区間における音声の特
徴ベクトル系列を検出する。さらに、未登録語区間処理
部２７は、マッチング部２３からの未登録語（の音韻系
列）にIDを付し、未登録語の音韻系列と、その音声区間
における特徴ベクトル系列とともに、特徴ベクトルバッ
ファ２８に供給する。On the other hand, the unregistered word section processing unit 27 temporarily stores the feature vector series supplied from the feature extraction section 22, and when the matching section 23 supplies the unregistered word speech section and the phoneme series. , The feature vector sequence of the voice in the voice section is detected. Further, the unregistered word section processing unit 27 assigns an ID to (the phoneme sequence of) the unregistered word from the matching unit 23, and, together with the phoneme sequence of the unregistered word and the feature vector sequence in the speech section, a feature vector buffer 28.

【０１２２】また、特徴抽出部４２は、マッチング部２
３により検出された未登録語の音声区間に対応する画像
データの特徴を抽出し、その特徴ベクトルをマッチング
部２３に供給する他、特徴ベクトルバッファ４３に供給
し、記憶させる。この特徴ベクトルは、また、特徴ベク
トルバッファ４３から特徴ベクトルバッファ２８に供給
され、記憶される。Further, the feature extraction unit 42 is the matching unit 2
The feature of the image data corresponding to the voice section of the unregistered word detected by 3 is extracted, and the feature vector thereof is supplied to the matching unit 23 and also supplied to the feature vector buffer 43 to be stored. The feature vector is also supplied from the feature vector buffer 43 to the feature vector buffer 28 and stored therein.

【０１２３】以上のようにして、特徴ベクトルバッファ
２８に、新たな未登録語（新未登録語）のID、音韻系
列、および音声と画像の特徴ベクトル系列が記憶される
と、未登録語処理が行われる。As described above, when the feature vector buffer 28 stores the ID of a new unregistered word (new unregistered word), the phoneme sequence, and the feature vector sequence of voice and image, the unregistered word process is performed. Is done.

【０１２４】図１０は、この場合における未登録語処理
を説明するフローチャートを示している。FIG. 10 is a flow chart for explaining the unregistered word process in this case.

【０１２５】未登録語処理では、まず最初に、ステップ
Ｓ１１において、クラスタリング部２９が、特徴ベクト
ルバッファ２８から、新未登録語のIDと音韻系列を読み
出し、ステップＳ１２に進む。In the unregistered word processing, first, in step S11, the clustering unit 29 reads the ID and phonological sequence of the new unregistered word from the feature vector buffer 28, and proceeds to step S12.

【０１２６】ステップＳ１２では、クラスタリング部２
９が、スコアシート記憶部３０のスコアシート９１を参
照することにより、既に求められている（生成されてい
る）クラスタが存在するかどうかを判定する。In step S12, the clustering unit 2
9 refers to the score sheet 91 of the score sheet storage unit 30 to determine whether there is a cluster that has already been obtained (generated).

【０１２７】ステップＳ１２において、既に求められて
いるクラスタが存在しないと判定された場合、即ち、新
未登録語が、初めての未登録語であり、スコアシート９
１に、既記憶未登録語のエントリが存在しない場合、ス
テップＳ１３に進み、クラスタリング部２９は、その新
未登録語を代表メンバとするクラスタを新たに生成し、
その新たなクラスタに関する情報と、新未登録語に関す
る情報とを、スコアシート記憶部３０のスコアシート９
１に登録することにより、スコアシート９１を更新す
る。If it is determined in step S12 that the already-obtained cluster does not exist, that is, the new unregistered word is the first unregistered word, and the score sheet 9
If the entry of the stored unregistered word does not exist in 1, the clustering unit 29 creates a cluster having the new unregistered word as a representative member, in step S13.
Information regarding the new cluster and information regarding the new unregistered word are stored in the score sheet 9 of the score sheet storage unit 30.
The score sheet 91 is updated by registering 1 in the score sheet 91.

【０１２８】即ち、クラスタリング部２９は、特徴ベク
トルバッファ２８から読み出した新未登録語のIDおよび
音韻系列を、スコアシート９１（図８）に登録する。さ
らに、クラスタリング部２９は、ユニークなクラスタナ
ンバを生成し、新未登録語のクラスタナンバとして、ス
コアシート９１に登録する。また、クラスタリング部２
９は、新未登録語のIDを、その新未登録語の代表メンバ
IDとして、スコアシート９１に登録する。従って、この
場合は、新未登録語は、新たなクラスタの代表メンバと
なる。That is, the clustering unit 29 registers the ID and phonological sequence of the new unregistered word read from the feature vector buffer 28 in the score sheet 91 (FIG. 8). Further, the clustering unit 29 generates a unique cluster number and registers it in the score sheet 91 as a cluster number of a new unregistered word. Also, the clustering unit 2
9 is the ID of the new unregistered word and the representative member of the new unregistered word
The ID is registered in the score sheet 91. Therefore, in this case, the new unregistered word becomes the representative member of the new cluster.

【０１２９】なお、いまの場合、新未登録語とのスコア
を計算する既記憶未登録語が存在しないため、スコアの
計算は行われない。In this case, since there is no stored unregistered word for calculating the score with the new unregistered word, the score is not calculated.

【０１３０】ステップＳ１３の処理の後、ステップＳ２
２に進み、メンテナンス部３１は、ステップＳ１３で更
新されたスコアシート９１に基づいて、辞書記憶部２５
の単語辞書７１を更新し、処理を終了する。After the processing of step S13, step S2
2, the maintenance unit 31 determines the dictionary storage unit 25 based on the score sheet 91 updated in step S13.
The word dictionary 71 of is updated, and the process ends.

【０１３１】即ち、いまの場合、新たなクラスタが生成
されているので、メンテナンス部３１は、スコアシート
９１におけるクラスタナンバを参照し、その新たに生成
されたクラスタを認識する。そして、メンテナンス部３
１は、そのクラスタに対応するエントリを、辞書記憶部
２５の単語辞書７１に追加し、そのエントリの音韻系列
として、新たなクラスタの代表メンバの音韻系列、つま
り、いまの場合は、新未登録語の音韻系列を登録する。That is, in this case, since a new cluster has been generated, the maintenance unit 31 refers to the cluster number in the score sheet 91 and recognizes the newly generated cluster. And the maintenance section 3
1 adds the entry corresponding to the cluster to the word dictionary 71 of the dictionary storage unit 25, and as the phoneme sequence of the entry, the phoneme sequence of the representative member of the new cluster, that is, new unregistered in this case. Register the phoneme sequence of a word.

【０１３２】一方、ステップＳ１２において、既に求め
られているクラスタが存在すると判定された場合、即
ち、新未登録語が、初めての未登録語ではなく、従っ
て、スコアシート９１（図８）に、既記憶未登録語のエ
ントリ（行）が存在する場合、ステップＳ１４に進み、
クラスタリング部２９は、新未登録語について、各既記
憶未登録語それぞれに対するスコアを、上述した式
（１）を用いて計算するとともに、各既記憶未登録語そ
れぞれについて、新未登録語に対するスコアを計算す
る。On the other hand, when it is determined in step S12 that the already-obtained cluster exists, that is, the new unregistered word is not the first unregistered word, and therefore the score sheet 91 (FIG. 8) shows If there is an entry (row) of the stored unregistered word, the process proceeds to step S14,
The clustering unit 29 calculates a score for each already-stored unregistered word for the new unregistered word by using the above-described formula (1), and at the same time, for each already-stored unregistered word, scores for the new unregistered word To calculate.

【０１３３】即ち、例えば、いま、IDが１乃至ＮのＮ個
の既記憶未登録語が存在し、新未登録語のIDをN+1とす
ると、クラスタリング部２９では、図８において点線で
示した部分の新未登録語についてのＮ個の既記憶未登録
語それぞれに対するスコアs(N+1,1),s(N+1,2),・・・,s
(N+1,N)と、Ｎ個の既記憶未登録語それぞれについての
新未登録語に対するスコアs(1,N+1),s(2,N+1),・・・，
s(N,N+1)が計算される。なお、クラスタリング部２９に
おいて、これらのスコアを計算するにあたっては、新未
登録語とＮ個の既記憶未登録語それぞれの特徴ベクトル
系列が必要となるが、これらの特徴ベクトル系列は、特
徴ベクトルバッファ２８を参照することで取得される。That is, for example, if there are N stored unregistered words with IDs 1 to N and the ID of the new unregistered word is N + 1, the clustering unit 29 uses the dotted line in FIG. Scores s (N + 1,1), s (N + 1,2), ..., s for each of the N already-stored unregistered words for the new unregistered word in the indicated portion
(N + 1, N) and the scores s (1, N + 1), s (2, N + 1), ... for the new unregistered words for each of the N stored unregistered words
s (N, N + 1) is calculated. Note that in the clustering unit 29, in order to calculate these scores, the feature vector series of each of the new unregistered word and the N stored unregistered words is required. It is acquired by referring to 28.

【０１３４】そして、クラスタリング部２９は、計算し
たスコアを、新未登録語のIDおよび音韻系列とともに、
スコアシート９１（図８）に追加し、ステップＳ１５に
進む。Then, the clustering unit 29 calculates the calculated score together with the ID of the new unregistered word and the phoneme sequence.
The score sheet 91 is added to the score sheet 91 (FIG. 8), and the process proceeds to step S15.

【０１３５】ステップＳ１５では、クラスタリング部２
９は、スコアシート９１（図８）を参照することによ
り、新未登録語についてのスコアs(N+1,i)（i=1,2,・・
・,N）を最も高く（大きく）する代表メンバを有するク
ラスタを検出する。即ち、クラスタリング部２９は、ス
コアシート９１の代表メンバIDを参照することにより、
代表メンバとなっている既記憶未登録語を認識し、さら
に、スコアシート９１のスコアを参照することで、新未
登録語についてのスコアを最も高くする代表メンバとし
ての既記憶未登録語を検出する。そして、クラスタリン
グ部２９は、その検出した代表メンバとしての既記憶未
登録語のクラスタナンバのクラスタを検出する。In step S15, the clustering unit 2
9 refers to the score sheet 91 (FIG. 8) to obtain the score s (N + 1, i) (i = 1,2, ...
-, N) find the cluster with the representative member that makes (N) the highest (large). That is, the clustering unit 29 refers to the representative member ID of the score sheet 91,
By recognizing the stored unregistered words that are the representative members and further referring to the score of the score sheet 91, the stored unregistered words as the representative members that maximize the score of the new unregistered words are detected. To do. Then, the clustering unit 29 detects the cluster of the cluster number of the stored unregistered word as the detected representative member.

【０１３６】その後、ステップＳ１６に進み、クラスタ
リング部２９は、新未登録語を、ステップＳ１５で検出
したクラスタ（以下、適宜、検出クラスタという）のメ
ンバに加える。即ち、クラスタリング部２９は、スコア
シート９１における新未登録語のクラスタナンバとし
て、検出クラスタの代表メンバのクラスタナンバを書き
込む。Then, in step S16, the clustering unit 29 adds the new unregistered word to the members of the cluster detected in step S15 (hereinafter, appropriately referred to as a detected cluster). That is, the clustering unit 29 writes the cluster number of the representative member of the detected cluster as the cluster number of the new unregistered word on the score sheet 91.

【０１３７】そして、クラスタリング部２９は、ステッ
プＳ１７において、検出クラスタを、例えば、２つのク
ラスタに分割するクラスタ分割処理を行い（その処理の
詳細は、図１１を参照して後述する）、ステップＳ１８
に進む。ステップＳ１８では、クラスタリング部２９
は、ステップＳ１７のクラスタ分割処理によって、検出
クラスタを２つのクラスタに分割することができたかど
うかを判定し、分割することができた判定した場合、ス
テップＳ１９に進む。ステップＳ１９では、クラスタリ
ング部２９は、検出クラスタの分割により得られる２つ
のクラスタ（この２つのクラスタを、以下、適宜、第１
の子クラスタと第２の子クラスタという）どうしの間の
クラスタ間距離を求める。Then, in step S17, the clustering unit 29 performs cluster division processing for dividing the detected cluster into, for example, two clusters (details of the processing will be described later with reference to FIG. 11), and in step S18.
Proceed to. In step S18, the clustering unit 29
Determines whether or not the detected cluster can be divided into two clusters by the cluster division processing in step S17, and if it is determined that the detected cluster can be divided, the process proceeds to step S19. In step S19, the clustering unit 29 determines two clusters obtained by dividing the detected clusters (hereinafter, these two clusters will be referred to as the first cluster as appropriate).
Inter-cluster distance between two child clusters (referred to as a child cluster and a second child cluster).

【０１３８】第１と第２の子クラスタどうしの間のクラ
スタ間距離とは、例えば、次のように定義される。The inter-cluster distance between the first and second child clusters is defined as follows, for example.

【０１３９】即ち、第１の子クラスタと第２の子クラス
タの両方の任意のメンバ（未登録語）のIDを、kで表す
とともに、第１と第２の子クラスタの代表メンバ（未登
録語）のIDを、それぞれk1またはk2で表すこととする
と、次式で表される値D(k1,k2)が、第１と第２の子クラ
スタどうしの間のクラスタ間距離とされる。That is, the IDs of arbitrary members (unregistered words) of both the first child cluster and the second child cluster are represented by k, and the representative members of the first and second child clusters (unregistered words) are represented. If the IDs of the words are represented by k1 or k2, respectively, the value D (k1, k2) represented by the following equation is the inter-cluster distance between the first and second child clusters.

【０１４０】 D(k1,k2)＝maxval_k{abs(log(s(k,k1))-log(s(k,k2)))} ・・・（３）D (k1, k2) = maxval _k {abs (log (s (k, k1))-log (s (k, k2)))} (3)

【０１４１】但し、式（３）において、abs()は、()内
の値の絶対値を表す。また、maxval_k{}は、kを変えて求
められる{}内の値の最大値を表す。また、logは、自然
対数または常用対数を表す。However, in the equation (3), abs () represents the absolute value of the value in (). Also, maxval _k {} represents the maximum value in {} obtained by changing k. Also, log represents natural logarithm or common logarithm.

【０１４２】いま、IDがiのメンバを、メンバ#iと表す
こととすると、式（３）におけるスコアの逆数1/s(k,k
1)は、メンバ#kと代表メンバk1との距離に相当し、スコ
アの逆数1/s(k,k2)は、メンバ#kと代表メンバk2との距
離に相当する。従って、式（３）によれば、第１と第２
の子クラスタのメンバのうち、第１の子クラスタの代表
メンバ#k1との距離と、第２の子クラスタの代表メンバ#
k2との距離との差の最大値が、第１と第２の子クラスタ
どうしの間の子クラスタ間距離とされることになる。Now, if the member with ID i is represented as member #i, the reciprocal of the score in equation (3) 1 / s (k, k
1) corresponds to the distance between the member #k and the representative member k1, and the reciprocal 1 / s (k, k2) of the score corresponds to the distance between the member #k and the representative member k2. Therefore, according to equation (3), the first and second
Of the members of the child cluster of the first child cluster and the representative member # k1 of the second child cluster, and the representative member of the second child cluster #
The maximum value of the difference with the distance from k2 is the inter-child cluster distance between the first and second child clusters.

【０１４３】このように、スコアと距離は逆数の関係に
あるが、一対一に対応しているので、スコアとして距離
を用いることも可能である。但し、この場合、大小の関
係が逆になる。As described above, the score and the distance have an inverse relationship, but since they correspond one-to-one, it is possible to use the distance as the score. However, in this case, the magnitude relationship is reversed.

【０１４４】なお、クラスタ間距離は、上述したものに
限定されるものではなく、その他、例えば、第１の子ク
ラスタの代表メンバと、第２の子クラスタの代表メンバ
とのＤＰマッチングを行うことにより、特徴ベクトル空
間における距離の積算値を求め、その距離の積算値を、
クラスタ間距離とすることも可能である。The inter-cluster distance is not limited to the one described above, but other than that, for example, DP matching between the representative member of the first child cluster and the representative member of the second child cluster is performed. With, the integrated value of the distance in the feature vector space is obtained, and the integrated value of the distance is
The distance between clusters can also be used.

【０１４５】ステップＳ１９の処理の後、ステップＳ２
０に進み、クラスタリング部２９は、第１と第２の子ク
ラスタどうしのクラスタ間距離が、所定の閾値θより大
である（あるいは、閾値θ以上である）かどうかを判定
する。After the processing of step S19, step S2
Proceeding to 0, the clustering unit 29 determines whether the inter-cluster distance between the first and second child clusters is larger than a predetermined threshold value θ (or is equal to or larger than the threshold value θ).

【０１４６】ステップＳ２０において、クラスタ間距離
が、所定の閾値θより大であると判定された場合、即
ち、検出クラスタのメンバとしての複数の未登録語が、
その音響的特徴からいって、２つのクラスタにクラスタ
リングすべきものであると考えられる場合、ステップＳ
２１に進み、クラスタリング部２９は、第１と第２の子
クラスタを、スコアシート記憶部３０のスコアシート９
１に登録する。In step S20, when it is determined that the inter-cluster distance is larger than the predetermined threshold value θ, that is, a plurality of unregistered words as members of the detected cluster are
If it is considered that the two clusters should be clustered because of their acoustic characteristics, step S
21, the clustering unit 29 sets the first and second child clusters to the score sheet 9 of the score sheet storage unit 30.
Register to 1.

【０１４７】即ち、クラスタリング部２９は、第１と第
２の子クラスタに、ユニークなクラスタナンバを割り当
て、検出クラスタのメンバのうち、第１の子クラスタに
クラスタリングされたもののクラスタナンバを、第１の
子クラスタのクラスタナンバにするとともに、第２の子
クラスタにクラスタリングされたもののクラスタナンバ
を、第２の子クラスタのクラスタナンバにするように、
スコアシート９１を更新する。That is, the clustering unit 29 assigns unique cluster numbers to the first and second child clusters, and among the members of the detected cluster, the cluster number of the one clustered to the first child cluster is set to the first cluster number. And the cluster number of the one clustered to the second child cluster to the cluster number of the second child cluster.
The score sheet 91 is updated.

【０１４８】さらに、クラスタリング部２９は、第１の
子クラスタにクラスタリングされたメンバの代表メンバ
IDを、第１の子クラスタの代表メンバのIDにするととも
に、第２の子クラスタにクラスタリングされたメンバの
代表メンバIDを、第２の子クラスタの代表メンバのIDに
するように、スコアシート９１を更新する。Further, the clustering unit 29 is a representative member of the members clustered into the first child cluster.
Score sheet so that the ID is the ID of the representative member of the first child cluster and the representative member ID of the member clustered in the second child cluster is the ID of the representative member of the second child cluster Update 91.

【０１４９】なお、第１と第２の子クラスタのうちのい
ずれか一方には、検出クラスタのクラスタナンバを割り
当てるようにすることが可能である。The cluster number of the detected cluster can be assigned to either one of the first and second child clusters.

【０１５０】クラスタリング部２９が、以上のようにし
て、第１と第２の子クラスタを、スコアシート９１に登
録すると、ステップＳ２１からＳ２２に進み、メンテナ
ンス部３１が、スコアシート９１に基づいて、辞書記憶
部２５の単語辞書７１を更新し、処理を終了する。When the clustering unit 29 registers the first and second child clusters in the score sheet 91 as described above, the process proceeds from step S21 to S22, and the maintenance unit 31 calculates the score sheet 91 based on the score sheet 91. The word dictionary 71 in the dictionary storage unit 25 is updated, and the process ends.

【０１５１】即ち、いまの場合、検出クラスタが、第１
と第２の子クラスタに分割されたため、メンテナンス部
３１は、まず、単語辞書７１における、検出クラスタに
対応するエントリを削除する。さらに、メンテナンス部
３１は、第１と第２の子クラスタそれぞれに対応する２
つのエントリを、単語辞書７１に追加し、第１の子クラ
スタに対応するエントリの音韻系列として、その第１の
子クラスタの代表メンバの音韻系列を登録するととも
に、第２の子クラスタに対応するエントリの音韻系列と
して、その第２の子クラスタの代表メンバの音韻系列を
登録する。That is, in this case, the detected cluster is the first
Since it is divided into the second child cluster, the maintenance unit 31 first deletes the entry corresponding to the detected cluster in the word dictionary 71. In addition, the maintenance unit 31 has 2 nodes corresponding to each of the first and second child clusters.
One entry is added to the word dictionary 71, the phoneme sequence of the representative member of the first child cluster is registered as the phoneme sequence of the entry corresponding to the first child cluster, and the entry corresponding to the second child cluster. The phoneme sequence of the representative member of the second child cluster is registered as the phoneme sequence of the entry.

【０１５２】一方、ステップＳ１８において、ステップ
Ｓ１７のクラスタ分割処理によって、検出クラスタを２
つのクラスタに分割することができなかったと判定され
た場合、あるいは、ステップＳ２０において、第１と第
２の子クラスタのクラスタ間距離が、所定の閾値θより
大でないと判定された場合（従って、検出クラスタのメ
ンバとしての複数の未登録語の音響的特徴が、第１と第
２の２つの子クラスタにクラスタリングするほど似てい
ないものではない場合）、ステップＳ２３に進み、クラ
スタリング部２９は、検出クラスタの新たな代表メンバ
を求め、スコアシート９１を更新する。On the other hand, in step S18, the detected clusters are divided into two by the cluster division processing in step S17.
When it is determined that the cluster cannot be divided into two clusters, or when it is determined in step S20 that the inter-cluster distance between the first and second child clusters is not larger than the predetermined threshold θ (hence, If the acoustic features of the plurality of unregistered words as members of the detected cluster are not so similar as to cluster them into the first and second child clusters), the process proceeds to step S23, where the clustering unit 29 A new representative member of the detected cluster is obtained and the score sheet 91 is updated.

【０１５３】即ち、クラスタリング部２９は、新未登録
語をメンバとして加えた検出クラスタの各メンバについ
て、スコアシート記憶部３０のスコアシート９１を参照
することにより、式（２）の計算に必要なスコアs(k',
k)を認識する。さらに、クラスタリング部２９は、その
認識したスコアs(k',k)を用い、式（２）に基づき、検
出クラスタの新たな代表メンバとなるメンバのIDを求め
る。そして、クラスタリング部２９は、スコアシート９
１（図８）における、検出クラスタの各メンバの代表メ
ンバIDを、検出クラスタの新たな代表メンバのIDに書き
換える。That is, the clustering unit 29 refers to the score sheet 91 of the score sheet storage unit 30 for each member of the detected cluster to which the new unregistered word is added as a member, and is required for the calculation of the equation (2). Score s (k ',
Recognize k). Further, the clustering unit 29 uses the recognized score s (k ′, k) to obtain the ID of the member to be the new representative member of the detected cluster based on the equation (2). Then, the clustering unit 29 uses the score sheet 9
The representative member ID of each member of the detection cluster in 1 (FIG. 8) is rewritten to the ID of a new representative member of the detection cluster.

【０１５４】その後、ステップＳ２２に進み、メンテナ
ンス部３１が、スコアシート９１に基づいて、辞書記憶
部２５の単語辞書７１を更新し、処理を終了する。After that, the procedure goes to step S22, in which the maintenance section 31 updates the word dictionary 71 in the dictionary storage section 25 based on the score sheet 91, and the processing ends.

【０１５５】即ち、いまの場合、メンテナンス部３１
は、スコアシート９１を参照することにより、検出クラ
スタの新たな代表メンバを認識し、さらに、その代表メ
ンバの音韻系列を認識する。そして、メンテナンス部３
１は、単語辞書７１における、検出クラスタに対応する
エントリの音韻系列を、検出クラスタの新たな代表メン
バの音韻系列に変更する。That is, in this case, the maintenance unit 31
Recognizes a new representative member of the detected cluster by referring to the score sheet 91, and further recognizes the phoneme sequence of the representative member. And the maintenance section 3
1 changes the phoneme sequence of the entry corresponding to the detected cluster in the word dictionary 71 to the phoneme sequence of the new representative member of the detected cluster.

【０１５６】次に、図１１のフローチャートを参照し
て、図１０のステップＳ１７のクラスタ分割処理の詳細
について説明する。Details of the cluster division processing in step S17 of FIG. 10 will be described below with reference to the flowchart of FIG.

【０１５７】クラスタ分割処理では、まず最初に、ステ
ップＳ３１において、クラスタリング部２９が、新未登
録語がメンバとして加えられた検出クラスタから、まだ
選択していない任意の２つのメンバの組み合わせを選択
し、それぞれを、仮の代表メンバとする。ここで、この
２つの仮の代表メンバを、以下、適宜、第１の仮代表メ
ンバと第２の仮代表メンバという。In the cluster division process, first, in step S31, the clustering unit 29 selects an arbitrary combination of two members that have not been selected from the detected clusters to which the new unregistered word has been added as a member. , And each is a temporary representative member. Here, these two temporary representative members will be appropriately referred to as a first temporary representative member and a second temporary representative member hereinafter.

【０１５８】そして、ステップＳ３２に進み、クラスタ
リング部２９は、第１の仮代表メンバと、第２の仮代表
メンバを、それぞれ代表メンバとすることができるよう
に、検出クラスタのメンバを、２つのクラスタに分割す
ることができるかどうかを判定する。Then, in step S32, the clustering unit 29 sets the two members of the detected cluster to two representative members so that the first temporary representative member and the second temporary representative member can be respectively set as the representative members. Determine if it can be divided into clusters.

【０１５９】ここで、第１または第２の仮代表メンバを
代表メンバとすることができるかどうかは、式（２）の
計算を行う必要があるが、この計算に用いられるスコア
s(k',k)は、スコアシート９１を参照することで取得さ
れる。Here, whether or not the first or second temporary representative member can be the representative member needs to be calculated by the equation (2), but the score used for this calculation
s (k ′, k) is acquired by referring to the score sheet 91.

【０１６０】ステップＳ３２において、第１の仮代表メ
ンバと、第２の仮代表メンバを、それぞれ代表メンバと
することができるように、検出クラスタのメンバを、２
つのクラスタに分割することができないと判定された場
合、ステップＳ３３をスキップして、ステップＳ３４に
進む。In step S32, the number of detected cluster members is set to 2 so that the first temporary representative member and the second temporary representative member can be set as the representative members.
When it is determined that the cluster cannot be divided into one cluster, step S33 is skipped and the process proceeds to step S34.

【０１６１】また、ステップＳ３２において、第１の仮
代表メンバと、第２の仮代表メンバを、それぞれ代表メ
ンバとすることができるように、検出クラスタのメンバ
を、２つのクラスタに分割することができると判定され
た場合、ステップＳ３３に進み、クラスタリング部２９
は、第１の仮代表メンバと、第２の仮代表メンバが、そ
れぞれ代表メンバとなるように、検出クラスタのメンバ
を、２つのクラスタに分割し、その分割後の２つのクラ
スタの組を、検出クラスタの分割結果となる第１および
第２の子クラスタの候補（以下、適宜、候補クラスタの
組という）として、ステップＳ３４に進む。In step S32, the member of the detected cluster may be divided into two clusters so that the first temporary representative member and the second temporary representative member can be set as the representative members. If it is determined that it is possible, the process proceeds to step S33, and the clustering unit 29
Divides the member of the detected cluster into two clusters such that the first temporary representative member and the second temporary representative member are representative members, respectively, and sets the set of the two clusters after the division. The process proceeds to step S34 as first and second child cluster candidates (hereinafter, appropriately referred to as a set of candidate clusters) that are the result of the division of the detected cluster.

【０１６２】ステップＳ３４では、クラスタリング部２
９は、検出クラスタのメンバの中で、まだ、第１と第２
の仮代表メンバの組として選択していない２つのメンバ
の組があるかどうかを判定し、あると判定した場合、ス
テップＳ３１に戻り、まだ、第１と第２の仮代表メンバ
の組として選択していない、検出クラスタの２つのメン
バの組が選択され、以下、同様の処理が繰り返される。In step S34, the clustering unit 2
9 is still the first and second among the members of the detected cluster.
It is determined whether or not there is a set of two members that have not been selected as the set of temporary representative members of No., and if it is determined that there is a set of members, the process returns to step S31 and still selected as the set of first and second temporary representative members. A pair of two members of the detected cluster that have not been selected is selected, and the same processing is repeated thereafter.

【０１６３】また、ステップＳ３４において、第１と第
２の仮代表メンバの組として選択していない、検出クラ
スタの２つのメンバの組がないと判定された場合、ステ
ップＳ３５に進み、クラスタリング部２９は、候補クラ
スタの組が存在するかどうかを判定する。If it is determined in step S34 that there is no set of two members of the detected cluster that has not been selected as the set of the first and second temporary representative members, the process proceeds to step S35 and the clustering unit 29 Determines whether a set of candidate clusters exists.

【０１６４】ステップＳ３５において、候補クラスタの
組が存在しないと判定された場合、ステップＳ３６をス
キップして、リターンする。この場合は、図１０のステ
ップＳ１８において、検出クラスタを分割することがで
きなかったと判定される。If it is determined in step S35 that there is no candidate cluster set, step S36 is skipped and the process returns. In this case, in step S18 of FIG. 10, it is determined that the detected cluster could not be divided.

【０１６５】一方、ステップＳ３５において、候補クラ
スタの組が存在すると判定された場合、ステップＳ３６
に進み、クラスタリング部２９は、候補クラスタの組が
複数存在するときには、各候補クラスタの組の２つのク
ラスタどうしの間のクラスタ間距離を求める。そして、
クラスタリング部２９は、クラスタ間距離が最小の候補
クラスタの組を求め、その候補クラスタの組を、検出ク
ラスタの分割結果として、即ち、第１と第２の子クラス
タとして、リターンする。なお、候補クラスタの組が１
つだけの場合は、その候補クラスタの組が、そのまま、
第１と第２の子クラスタとされる。On the other hand, if it is determined in step S35 that there is a set of candidate clusters, step S36.
Proceeding to step, the clustering unit 29 obtains the inter-cluster distance between two clusters of each candidate cluster set when there are a plurality of candidate cluster sets. And
The clustering unit 29 obtains a set of candidate clusters having the smallest inter-cluster distance, and returns the set of candidate clusters as the division result of the detected clusters, that is, as the first and second child clusters. The set of candidate clusters is 1
If there are only two, the set of candidate clusters is
They are the first and second child clusters.

【０１６６】この場合は、図１０のステップＳ１８にお
いて、検出クラスタを分割することができたと判定され
る。In this case, it is determined in step S18 of FIG. 10 that the detected cluster can be divided.

【０１６７】以上のように、クラスタリング部２９にお
いて、既に求められている、未登録語をクラスタリング
したクラスタの中から、新未登録語を新たなメンバとし
て加えるクラスタ（検出クラスタ）を検出し、新未登録
語を、その検出クラスタの新たなメンバとして、検出ク
ラスタを、その検出クラスタのメンバに基づいて分割す
るようにしたので、未登録語を、その音響的特徴が近似
しているものどうしに、容易にクラスタリングすること
ができる。As described above, the clustering unit 29 detects the cluster (detection cluster) to which the new unregistered word is added as a new member from the clusters obtained by clustering the unregistered words, which have already been obtained. The unregistered word is divided into new members of the detected cluster, and the detected cluster is divided based on the members of the detected cluster. , Can be easily clustered.

【０１６８】さらに、メンテナンス部３１において、そ
のようなクラスタリング結果に基づいて、単語辞書を更
新するようにしたので、単語辞書の大規模化を避けなが
ら、未登録語の単語辞書への登録を、容易に行うことが
できる。Further, since the maintenance unit 31 updates the word dictionary based on such a clustering result, unregistered words can be registered in the word dictionary while avoiding an increase in the size of the word dictionary. It can be done easily.

【０１６９】また、例えば、仮に、マッチング部２３に
おいて、未登録語の音声区間の検出を誤ったとしても、
そのような未登録語は、検出クラスタの分割によって、
音声区間が正しく検出された未登録語とは別のクラスタ
にクラスタリングされる。そして、このようなクラスタ
に対応するエントリが、単語辞書に登録されることにな
るが、このエントリの音韻系列は、正しく検出されなか
った音声区間に対応するものとなるから、その後の音声
認識において、大きなスコアを与えることはない。従っ
て、仮に、未登録語の音声区間の検出を誤ったとして
も、その誤りは、その後の音声認識には、ほとんど影響
しない。Further, for example, even if the matching section 23 makes a mistake in detecting the voice section of an unregistered word,
Such unregistered words are
The voice segment is clustered into a cluster different from the unregistered word for which the correct detection is performed. Then, an entry corresponding to such a cluster will be registered in the word dictionary, but since the phoneme sequence of this entry corresponds to the speech section that was not correctly detected, in the subsequent speech recognition. , Don't give a big score. Therefore, even if the detection of the voice section of the unregistered word is erroneous, the error hardly affects the subsequent voice recognition.

【０１７０】さらに、音声による特徴ベクトルだけでな
く、画像による特徴ベクトルを合成して、合成した特徴
ベクトルを用いて、クラスタリングを行うようにした場
合には、音声の特徴ベクトルだけを用いる場合に比べ
て、より正確にクラスタリングを行うことが可能とな
る。その結果、音声認識をより正確に行うことが可能と
なる。Further, when not only the feature vector of the voice but also the feature vector of the image is synthesized and the clustering is performed using the synthesized feature vector, compared to the case of using only the feature vector of the voice. Therefore, clustering can be performed more accurately. As a result, it becomes possible to perform voice recognition more accurately.

【０１７１】次に、図１２のフローチャートを参照し
て、以上のようにして、音声と画像に基づく特徴ベクト
ルに基づいて、学習が行われたロボット１において行わ
れる行動処理について説明する。Next, with reference to the flow chart of FIG. 12, the action processing performed in the robot 1 learned as described above based on the feature vector based on the voice and the image will be described.

【０１７２】いま、例えば、ユーザがロボット１に対し
て「赤いボールを追いかけろ」という発話をしたものと
する。Now, for example, it is assumed that the user speaks to the robot 1 "follow the red ball".

【０１７３】ステップＳ５１において、マイク１５がユ
ーザからのこの音声信号を入力する。この音声信号は、
ステップＳ５２において、音声認識部５０Ａにより、音
声認識される。In step S51, the microphone 15 inputs this voice signal from the user. This audio signal is
In step S52, voice recognition is performed by the voice recognition unit 50A.

【０１７４】ステップＳ５３において、行動決定機構部
５２は、ステップＳ５２における音声認識の結果に対応
して、画像データ（RGB値）が登録されているか否かを
判定する。いまの場合、図５の単語辞書７１に示される
ように、「赤」の単語に対応して、RGB値が記憶されて
いるので、画像データが登録されていると判定される。In step S53, the action determining mechanism section 52 determines whether or not the image data (RGB value) is registered, corresponding to the result of the voice recognition in step S52. In this case, as shown in the word dictionary 71 of FIG. 5, since the RGB value is stored corresponding to the word “red”, it is determined that the image data is registered.

【０１７５】この場合、ステップＳ５５に進み、ビデオ
カメラ１６により撮像された画像信号が入力される。画
像認識部５０Ｂは、ステップＳ５５で入力した画像デー
タの画像認識処理を行い、その認識結果を行動決定機構
部５２に出力する。行動決定機構部５２は、ステップＳ
５６において、入力した画像データから、登録されてい
る画像データに基づいて、音声入力の対象を特定する処
理を実行する。具体的には、いまの場合、「赤」の単語
に対応して、RGB値が登録されているので、行動決定機
構部５２は、画像認識部５０Ｂより出力された認識結果
から、登録されているRGB値と等しいか、それに近いRGB
値を有する画像の範囲を「赤」の画像と認識する。いま
の場合、「赤いボール」が対象とされているオブジェク
トであるため、行動決定機構部５２は、「赤」の画像の
うち、円形の画像を抽出し、これを「赤いボール」とし
て認識する。In this case, the process proceeds to step S55, and the image signal picked up by the video camera 16 is input. The image recognition unit 50B performs image recognition processing of the image data input in step S55, and outputs the recognition result to the action determination mechanism unit 52. The action determining mechanism unit 52, step S
At 56, a process of specifying a voice input target from the input image data based on the registered image data is executed. Specifically, in this case, since the RGB value is registered corresponding to the word “red”, the action determination mechanism unit 52 registers the RGB value from the recognition result output from the image recognition unit 50B. RGB value equal to or close to the existing RGB value
A range of images having a value is recognized as a “red” image. In this case, since the "red ball" is the target object, the action determination mechanism unit 52 extracts a circular image from the "red" images and recognizes this as the "red ball". .

【０１７６】ステップＳ５７において、行動決定機構部
５２は、ステップＳ５６の処理で特定した対象に対し
て、音声入力に対応する処理を実行する。いまの場合、
「赤いボール」を「追いかける」の認識が音声認識部５
０Ａによりなされているので、行動決定機構部５２は、
「赤いボール」を追いかけるように姿勢遷移機構部５３
を制御する。姿勢遷移機構部５３は、この制御に基づい
て、さらに制御機構部５４を制御し、アクチュエータ３
ＡＡ₁乃至３ＤＡ_K，５Ａ₁および５Ａ₂のうち、所定のア
クチュエータを駆動する。これにより、ロボット１は、
「赤いボール」を追いかけるといった行動を取ることに
なる。In step S57, the action determination mechanism section 52 executes the process corresponding to the voice input for the target specified in the process of step S56. In the present case,
The voice recognition unit 5 recognizes "following" the "red ball".
Since it is performed by 0A, the action determination mechanism unit 52
Posture transition mechanism 53 so as to chase the "red ball"
To control. The posture transition mechanism unit 53 further controls the control mechanism unit 54 based on this control, and the actuator 3
AA ₁ to 3DA _K, of 5A ₁ and 5A _2, drives a predetermined actuator. As a result, the robot 1
You will be chasing the "red ball".

【０１７７】この場合、「赤」の認識が画像認識部５０
Ｂによる画像認識をも用いて行われており、音声だけに
よる認識の場合に比べて、正確に「赤いボール」を追い
かけることが可能となる。In this case, the recognition of "red" is performed by the image recognition unit 50.
Since the image recognition by B is also used, it becomes possible to follow the “red ball” more accurately than in the case of recognition only by voice.

【０１７８】ステップＳ５３において、認識結果に対応
して画像データが登録されていないと判定された場合に
は、ステップＳ５４に進み、音声入力に対応する処理が
実行される。すなわち、この場合には、画像に基づく認
識結果が利用されないので、従来の場合と同様の認識結
果に基づく行動が実行されることになる。If it is determined in step S53 that the image data is not registered corresponding to the recognition result, the process proceeds to step S54, and the process corresponding to the voice input is executed. That is, in this case, since the recognition result based on the image is not used, the same action based on the recognition result as in the conventional case is executed.

【０１７９】音声だけの認識によっては、「赤いボー
ル」をロボット１に認識させることができない。The robot 1 cannot be made to recognize the "red ball" by recognizing only the voice.

【０１８０】また、例えば、「赤色」と「オレンジ色」
の単語と、その画像データを予め登録しておけば、ビデ
オカメラ１６より入力された画像データを処理すること
で、「赤いボール」を「オレンジ色のボール」と識別す
ることが可能である。「赤色」と「オレンジ色」は、比
較的似た色であり、画像データのみから認識処理する
と、ロボット１は、「オレンジ色のボール」を「赤いボ
ール」と誤認識してしまう恐れがある。Also, for example, "red" and "orange"
If the word and the image data thereof are registered in advance, the “red ball” can be identified as the “orange ball” by processing the image data input from the video camera 16. “Red” and “orange” are relatively similar colors, and if recognition processing is performed only from image data, the robot 1 may mistakenly recognize an “orange ball” as a “red ball”. .

【０１８１】しかしながら、上述したように、音声と画
像の両方を利用して認識処理するようにすると、「赤
色」と「オレンジ色」の色（RGB値）が似ていたとして
も、「赤」という発音と、「オレンジ」という発音は似
ていないため、「赤いボール」と「オレンジ色のボー
ル」を、ロボット１は正確に識別することが可能とな
る。However, as described above, when the recognition processing is performed by using both the voice and the image, even if the colors (RGB values) of "red" and "orange" are similar, "red" Since the pronunciation "and" is not similar to the pronunciation "orange", the robot 1 can accurately identify the "red ball" and the "orange ball".

【０１８２】「赤」と「オレンジ」の単語をクラスタリ
ングする場合においても、画像だけでなく、音声を用い
ることで、両者を正確に、異なったクラスタにクラスタ
リングすることが可能となる。「赤いボール」と「オレ
ンジ色のボール」を、ロボット１が正確に識別できるよ
うになることは、このクラスタリングの結果でもある。Even in the case of clustering the words "red" and "orange", it is possible to accurately cluster them by using not only images but also sounds. It is also a result of this clustering that the robot 1 can accurately identify the “red ball” and the “orange ball”.

【０１８３】以上においては、色を音声と画像に基づい
て学習させ、クラスタリングして登録させる場合を例と
したが、例えば、「上げる」と「下げる」の単語をクラ
スタリングし、登録する場合にも、本発明を適用するこ
とができる。In the above, the case where the colors are learned based on the sound and the image and clustered and registered is described as an example. However, for example, when the words “increase” and “decrease” are clustered and registered. The present invention can be applied.

【０１８４】すなわち、「上げる」の音韻は「ageru」
となり、「下げる」の音韻は「sageru」となる。後者は
その先頭に、摩擦音「s」が存在するだけで、他の音韻
は前者と一致している。That is, the phoneme of "raise" is "ageru".
And the phoneme of "lower" becomes "sageru". The latter has only a fricative "s" at the beginning, and the other phonemes are the same as the former.

【０１８５】摩擦音「ｓ」は、背景のノイズにより、識
別することが比較的困難になる音韻である。このため、
音声だけにより、「上げる」と「下げる」の単語をクラ
スタリングし、辞書に登録しようとすると、両者は全く
異なる単語であるにも関わらず、同一のクラスタにクラ
スタリングされてしまう恐れがある。The fricative "s" is a phoneme that is relatively difficult to identify due to background noise. For this reason,
If the words "increase" and "decrease" are clustered only by the voice and registered in the dictionary, there is a possibility that they may be clustered in the same cluster even though they are completely different words.

【０１８６】しかしながら、音声だけでなく、例えば、
所定の物体を「上げる」か「下げる」動作を音声と同時
に、ビデオカメラ１６により取り込ませ、音声に対応し
て登録させるようにすれば、「上げる」と「下げる」の
動作は全く逆の動作となるため、両者を確実に識別し、
異なるクラスタにクラスタリングすることが可能とな
る。なお、この場合における「上げる」と「下げる」の
画像の特徴ベクトルとしては、動きベクトルを利用する
ことができる。「上げる」の画像の動きベクトルは、上
方に向かうベクトルであるのに対して、「下げる」の画
像に伴う動きベクトルは、下方に向かうベクトルとな
る。従って、両者を、正確に異なったクラスタに、容易
に、クラスタリングすることが可能となる。However, not only the voice but, for example,
If the operation of "raising" or "lowering" a predetermined object is captured by the video camera 16 at the same time as the voice and registered corresponding to the voice, the operations of "raising" and "lowering" are completely opposite operations. Therefore, be sure to distinguish between
It is possible to cluster into different clusters. In this case, a motion vector can be used as the feature vector of the "increase" and "decrease" images. The motion vector of the "raising" image is a vector heading upward, whereas the motion vector associated with the "lowering" image is a vector heading downward. Therefore, it is possible to easily cluster the both into exactly different clusters.

【０１８７】さらに、また、本発明は、人の名前をクラ
スタリングし、登録する場合にも適用することができ
る。Furthermore, the present invention can also be applied to the case of clustering and registering a person's name.

【０１８８】例えば、「これは＜OOV＞です」の音声が
入力された場合、この未登録の語に相当する＜OOV＞の
部分の音韻をクラスタリングし、登録させるようにする
ことができる。例えば、「これは山田さんです」の音声
が入力された場合、「山田さん」が辞書にまだ登録され
ていなければ、「山田さん」が辞書に登録されることに
なる。このとき、「山田さん」の、例えば、正面からの
画像を同時に取り込み、「山田さん」の音韻に対応して
辞書に登録することができる。For example, when the voice "This is <OOV>" is input, the phoneme of the <OOV> portion corresponding to this unregistered word can be clustered and registered. For example, when the voice "This is Mr. Yamada" is input, if "Mr. Yamada" is not registered in the dictionary, "Mr. Yamada" will be registered in the dictionary. At this time, for example, an image of “Ms. Yamada” from the front can be simultaneously captured and registered in the dictionary in correspondence with the phoneme of “Ms. Yamada”.

【０１８９】あるいは、また、「こんにちは。私の名前
は山田です。」のような「こんにちは。私の名前は＜OO
V＞です。」のパターンの音声が入力された場合に、＜O
OV＞の部分の単語を同時に入力された画像とともに、ク
ラスタリングし、登録するようにしてもよい。[0189] Alternatively, also, "Hello. My name is Yamada." Such as "Hello. My name is <OO
V> When the voice of the pattern is input, <O
The words in the portion of OV> may be clustered and registered together with the images that have been input at the same time.

【０１９０】このようにして、例えば、「山田さん」の
音韻と画像がクラスタリングされ、登録されている場
合、その「山田さん」が、その後、他の人に紹介される
とき、ロボット１は、そのシーンをビデオカメラ１６で
撮像し、「山田さん」の、例えば、横から見た画像をサ
ンプリングし、登録することができる。同一の「山田さ
ん」であったとしても、その正面からの画像と、側面か
らの画像とでは、全く異なる画像となる。このため、例
えば、画像だけからロボット１に認識処理を実行させ、
さらに、クラスタリングと登録処理を実行させると、ロ
ボット１は、正面の画像と側面からの画像とを同一人物
の画像と認識することは困難であるから、異なった人物
としてクラスタリングと登録が行われる。これに対し
て、本発明における場合のように、画像が音声と対応付
けて登録されていると、同姓の他の「山田さん」が登録
されていなければ、ロボット１は、「山田さん」の音韻
から、２つの画像（正面の画像と側面の画像）が関連し
た（正確には、同一の）人物の画像であることを、認識
することが可能となる。In this way, for example, when the phonemes and images of "Ms. Yamada" are clustered and registered, when the "Ms. Yamada" is subsequently introduced to another person, the robot 1 The scene can be picked up by the video camera 16 and, for example, an image of “Mr. Yamada” seen from the side can be sampled and registered. Even if it is the same "Mr. Yamada", the image from the front and the image from the side are completely different images. Therefore, for example, the robot 1 is caused to perform the recognition process only from the image,
Further, when the clustering and registration processing is executed, it is difficult for the robot 1 to recognize the image in the front and the image from the side as the image of the same person. Therefore, the clustering and the registration are performed as different persons. On the other hand, as in the case of the present invention, if the image is registered in association with the voice, the robot 1 will not be registered as "Mr. Yamada" unless another "Mr. Yamada" with the same surname is registered. From the phoneme, it is possible to recognize that the two images (front image and side image) are related (to be exact, the same) person's image.

【０１９１】なお、以上においては、画像データを辞書
に登録するようにしたが、専用の記憶部を設け、そこに
登録するようにしてもよい。In the above description, the image data is registered in the dictionary. However, a dedicated storage unit may be provided and registered in the dictionary.

【０１９２】以上、本発明を、エンターテイメント用の
ロボット（疑似ペットとしてのロボット）に適用した場
合について説明したが、本発明は、これに限らず、例え
ば、音声認識装置を搭載した音声対話システムその他に
広く適用することが可能である。また、本発明は、現実
世界のロボットだけでなく、例えば、液晶ディスプレイ
等の表示装置に表示される仮想的なロボットにも適用可
能である。The case where the present invention is applied to a robot for entertainment (robot as a pseudo pet) has been described above. However, the present invention is not limited to this, and for example, a voice dialogue system equipped with a voice recognition device, etc. It can be widely applied to. Further, the present invention is applicable not only to a robot in the real world, but also to a virtual robot displayed on a display device such as a liquid crystal display.

【０１９３】また、本実施の形態においては、上述した
一連の処理を、ＣＰＵ１０Ａにプログラムを実行させる
ことにより行うようにしたが、一連の処理は、それ専用
のハードウェアによって行うことも可能である。Further, in the present embodiment, the series of processes described above is executed by causing the CPU 10A to execute a program, but the series of processes can also be executed by dedicated hardware. .

【０１９４】ここで、プログラムは、あらかじめメモリ
１０Ｂ（図２）に記憶させておく他、フロッピー（登録
商標）ディスク、CD-ROM(Compact Disc Read Only Memo
ry)，MO(Magneto optical)ディスク，DVD(Digital Vers
atile Disc)、磁気ディスク、半導体メモリなどのリム
ーバブル記録媒体に、一時的あるいは永続的に格納（記
録）しておくことができる。そして、このようなリムー
バブル記録媒体を、いわゆるパッケージソフトウエアと
して提供し、ロボット（メモリ１０Ｂ）にインストール
するようにすることができる。Here, the program is stored in the memory 10B (FIG. 2) in advance, as well as a floppy (registered trademark) disk or a CD-ROM (Compact Disc Read Only Memo).
ry), MO (Magneto optical) disc, DVD (Digital Vers
It can be temporarily or permanently stored (recorded) in a removable recording medium such as an atile disc), a magnetic disc, or a semiconductor memory. Then, such a removable recording medium can be provided as so-called package software and installed in the robot (memory 10B).

【０１９５】また、プログラムは、ダウンロードサイト
から、ディジタル衛星放送用の人工衛星を介して、無線
で転送したり、LAN(Local Area Network)、インターネ
ットといったネットワークを介して、有線で転送し、メ
モリ１０Ｂにインストールすることができる。Further, the program may be transferred wirelessly from a download site via an artificial satellite for digital satellite broadcasting, or may be transferred by wire via a network such as a LAN (Local Area Network) or the Internet to store the program in the memory 10B. Can be installed on.

【０１９６】この場合、プログラムがバージョンアップ
されたとき等に、そのバージョンアップされたプログラ
ムを、メモリ１０Ｂに、容易にインストールすることが
できる。In this case, when the program is upgraded, the upgraded program can be easily installed in the memory 10B.

【０１９７】なお、本明細書において、ＣＰＵ１０Ａに
各種の処理を行わせるためのプログラムを記述する処理
ステップは、必ずしもフローチャートとして記載された
順序に沿って時系列に処理する必要はなく、並列的ある
いは個別に実行される処理（例えば、並列処理あるいは
オブジェクトによる処理）も含むものである。In the present specification, the processing steps in which the programs for causing the CPU 10A to perform various processes are described do not necessarily have to be processed in time series in the order described in the flow charts. It also includes individually executed processing (for example, parallel processing or object processing).

【０１９８】また、プログラムは、１個のＣＰＵにより
処理されるものであっても良いし、複数のＣＰＵによっ
て分散処理されるものであっても良い。The program may be processed by one CPU or may be processed by a plurality of CPUs in a distributed manner.

【０１９９】さらに、図４の音声認識部５０Ａは、専用
のハードウェアにより実現することもできるし、ソフト
ウェアにより実現することもできる。音声認識部５０Ａ
をソフトウェアによって実現する場合には、そのソフト
ウェアを構成するプログラムが、汎用のコンピュータ等
にインストールされる。Further, the voice recognition unit 50A shown in FIG. 4 can be realized by dedicated hardware or software. Voice recognition unit 50A
When the software is realized by software, a program forming the software is installed in a general-purpose computer or the like.

【０２００】そこで、図１３は、音声認識部５０Ａを実
現するためのプログラムがインストールされるコンピュ
ータの一実施の形態の構成例を示している。Therefore, FIG. 13 shows an example of the configuration of an embodiment of a computer in which a program for realizing the voice recognition unit 50A is installed.

【０２０１】プログラムは、コンピュータに内蔵されて
いる記録媒体としてのハードディスク１０５やＲＯＭ１
０３に予め記録しておくことができる。The program is stored in the hard disk 105 or the ROM 1 as a recording medium built in the computer.
03 can be recorded in advance.

【０２０２】あるいはまた、プログラムは、フロッピー
（登録商標）ディスク、CD-ROM，MOディスク，DVD、磁
気ディスク、半導体メモリなどのリムーバブル記録媒体
１１１に、一時的あるいは永続的に格納（記録）してお
くことができる。このようなリムーバブル記録媒体１１
１は、いわゆるパッケージソフトウエアとして提供する
ことができる。Alternatively, the program is temporarily or permanently stored (recorded) in a removable recording medium 111 such as a floppy (registered trademark) disk, a CD-ROM, an MO disk, a DVD, a magnetic disk, or a semiconductor memory. Can be set. Such a removable recording medium 11
1 can be provided as so-called package software.

【０２０３】なお、プログラムは、上述したようなリム
ーバブル記録媒体１１１からコンピュータにインストー
ルする他、ダウンロードサイトから、ディジタル衛星放
送用の人工衛星を介して、コンピュータに無線で転送し
たり、LAN、インターネットといったネットワークを介
して、コンピュータに有線で転送し、コンピュータで
は、そのようにして転送されてくるプログラムを、通信
部１０８で受信し、内蔵するハードディスク１０５にイ
ンストールすることができる。The program is installed in the computer from the removable recording medium 111 as described above, and is also transferred wirelessly from the download site to the computer via an artificial satellite for digital satellite broadcasting, LAN, Internet, etc. The program can be transferred to a computer via a network via a network, and the program can be received by the communication unit 108 and installed in the built-in hard disk 105 in the computer.

【０２０４】コンピュータは、CPU(Central Processing
Unit)１０２を内蔵している。CPU１０２には、バス１
０１を介して、入出力インタフェース１１０が接続され
ており、CPU１０２は、入出力インタフェース１１０を
介して、ユーザによって、キーボードや、マウス、マイ
ク、ＡＤ変換器等で構成される入力部１０７が操作等さ
れることにより指令が入力されると、それにしたがっ
て、ROM(Read Only Memory)１０３に格納されているプ
ログラムを実行する。あるいは、また、CPU１０２は、
ハードディスク１０５に格納されているプログラム、衛
星若しくはネットワークから転送され、通信部１０８で
受信されてハードディスク１０５にインストールされた
プログラム、またはドライブ１０９に装着されたリムー
バブル記録媒体１１１から読み出されてハードディスク
１０５にインストールされたプログラムを、RAM(Random
Access Memory)１０４にロードして実行する。The computer is a CPU (Central Processing).
Unit) 102 is built in. CPU 102 has a bus 1
The input / output interface 110 is connected via 01, and the CPU 102 operates the input unit 107 including a keyboard, a mouse, a microphone, an AD converter, etc. by the user via the input / output interface 110. When a command is input as a result, the program stored in the ROM (Read Only Memory) 103 is executed accordingly. Alternatively, the CPU 102 also
The program stored in the hard disk 105, the program transferred from the satellite or the network, received by the communication unit 108, and installed in the hard disk 105, or read from the removable recording medium 111 mounted in the drive 109 and stored in the hard disk 105. Installed programs are stored in RAM (Random
It is loaded into the Access Memory) 104 and executed.

【０２０５】これにより、CPU１０２は、上述したフロ
ーチャートにしたがった処理、あるいは上述したブロッ
ク図の構成により行われる処理を行う。そして、CPU１
０２は、その処理結果を、必要に応じて、例えば、入出
力インタフェース１１０を介して、LCD(Liquid Crystal
Display)等のディスプレイや、スピーカ、ＤＡ(Digita
l Analog)変換器等で構成される出力部１０６から出
力、あるいは、通信部１０８から送信、さらには、ハー
ドディスク１０５に記録等させる。As a result, the CPU 102 performs the processing according to the above-described flowchart or the processing performed by the configuration of the above-described block diagram. And CPU1
Reference numeral 02 designates the processing result of the LCD (Liquid Crystal) via the input / output interface 110, if necessary.
Display, etc., speakers, DA (Digita)
l Analog) Output from the output unit 106 configured by a converter or the like, or transmitted from the communication unit 108, and further recorded on the hard disk 105.

【０２０６】なお、本実施の形態においては、ＨＭＭ法
により音声認識を行うようにしたが、本発明は、その
他、例えば、ＤＰマッチング法等により音声認識を行う
場合にも適用可能である。ここで、例えば、ＤＰマッチ
ング法による音声認識を行う場合には、上述のスコア
は、入力音声と標準パターンとの間の距離の逆数に相当
する。In the present embodiment, the voice recognition is performed by the HMM method, but the present invention is also applicable to the voice recognition by the DP matching method or the like. Here, for example, when performing voice recognition by the DP matching method, the above-mentioned score corresponds to the reciprocal of the distance between the input voice and the standard pattern.

【０２０７】また、本実施の形態では、未登録語をクラ
スタリングし、そのクラスタリング結果に基づいて、単
語辞書に、未登録語を登録するようにしたが、本発明
は、単語辞書に登録されている登録語についても適用可
能である。In the present embodiment, the unregistered words are clustered and the unregistered words are registered in the word dictionary based on the clustering result. However, the present invention does not register the unregistered words in the word dictionary. It is also applicable to registered words.

【０２０８】即ち、同一単語の発話についてであって
も、異なる音韻系列が得られる場合があることから、単
語辞書に、１つの単語について、１つの音韻系列だけを
登録しておく場合には、その単語の発話として、単語辞
書に登録された登録語の音韻系列と異なる音韻系列が得
られるときには、発話が、その登録語に認識されないこ
とがある。これに対して、本発明によれば、同一の単語
についての異なる発話が、音響的に類似したものどうし
にクラスタリングされることとなるので、そのクラスタ
リング結果に基づいて、単語辞書を更新することによ
り、同一の単語について、多種の音韻系列が、単語辞書
に登録されることになり、その結果、同一単語につき、
種々の音韻に対処した音声認識を行うことが可能とな
る。That is, even if the same word is uttered, different phoneme sequences may be obtained. Therefore, when only one phoneme sequence is registered for one word in the word dictionary, When a phoneme sequence different from the phoneme sequence of the registered word registered in the word dictionary is obtained as the utterance of the word, the utterance may not be recognized by the registered word. On the other hand, according to the present invention, different utterances for the same word are clustered acoustically similar to each other. Therefore, by updating the word dictionary based on the clustering result. , For the same word, various phoneme sequences will be registered in the word dictionary, and as a result, for the same word,
It is possible to perform voice recognition that deals with various phonemes.

【０２０９】なお、単語辞書に登録する、未登録語のク
ラスタに対応するエントリには、音韻系列の他、例え
ば、次のようにして見出しを記述することができる。[0209] In addition to the phonological sequence, for example, a heading can be described as follows in the entry corresponding to the cluster of unregistered words registered in the word dictionary.

【０２１０】即ち、例えば、行動決定機構部５２におい
て、画像認識部５０Ｂや圧力処理部５０Ｃが出力する状
態認識情報を、図３において点線で示すように、音声認
識部５０Ａに供給するようにし、音声認識部５０Ａのメ
ンテナンス部３１（図４）において、その状態認識情報
を受信するようにする。That is, for example, in the action determination mechanism section 52, the state recognition information output by the image recognition section 50B and the pressure processing section 50C is supplied to the voice recognition section 50A as indicated by the dotted line in FIG. The maintenance unit 31 (FIG. 4) of the voice recognition unit 50A receives the state recognition information.

【０２１１】一方、特徴ベクトルバッファ２８、ひいて
は、スコアシート記憶部３０においては、未登録語が入
力された絶対時刻（時間）も記憶しておくようにし、メ
ンテナンス部３１において、スコアシート記憶部３０に
おけるスコアシートの絶対時刻を参照することにより、
未登録語が入力されたときの、行動決定機構部５２から
供給される状態認識情報を、その未登録語の見出しとし
て認識する。On the other hand, the feature vector buffer 28, and by extension, the score sheet storage unit 30, also stores the absolute time (time) at which the unregistered word is input, and the maintenance unit 31 stores the score sheet storage unit 30. By referring to the absolute time of the score sheet in
The state recognition information supplied from the action determination mechanism unit 52 when the unregistered word is input is recognized as the heading of the unregistered word.

【０２１２】そして、メンテナンス部３１において、単
語辞書の、未登録語のクラスタに対応するエントリに
は、そのクラスタの代表メンバの音韻系列とともに、そ
の見出しとしての状態認識情報を登録するようにする。Then, in the maintenance section 31, the entry corresponding to the cluster of unregistered words in the word dictionary is registered with the phoneme sequence of the representative member of the cluster and the state recognition information as its heading.

【０２１３】この場合、マッチング部２３には、単語辞
書に登録された未登録語の音声認識結果として、その未
登録語の見出しとしての状態認識情報を出力させること
が可能となり、さらに、その見出しとしての状態認識情
報に基づいて、ロボットに所定の行動をとらせることが
可能となる。In this case, the matching unit 23 can output the state recognition information as the headline of the unregistered word as the voice recognition result of the unregistered word registered in the word dictionary. It is possible to make the robot take a predetermined action based on the state recognition information.

【０２１４】なお、本実施の形態においては、スコアシ
ートに、スコアを記憶しておくようにしたが、スコア
は、必要に応じて、再計算するようにすることも可能で
ある。In the present embodiment, the score is stored in the score sheet, but the score may be recalculated if necessary.

【０２１５】また、本実施の形態では、検出クラスタ
を、２つのクラスタに分割するようにしたが、検出クラ
スタは、３以上のクラスタに分割することが可能であ
る。さらに、検出クラスタは、一定以上のクラスタ間距
離となる任意の数のクラスタに分割することも可能であ
る。Further, in the present embodiment, the detection cluster is divided into two clusters, but the detection cluster can be divided into three or more clusters. Further, the detection cluster can be divided into an arbitrary number of clusters having a certain inter-cluster distance or more.

【０２１６】さらに、本実施の形態では、スコアシート
（図８）に、スコアの他、未登録語の音韻系列や、クラ
スタナンバ、代表メンバID等を登録するようにしたが、
これらのスコア以外の情報は、スコアシートに登録する
のではなく、スコアとは別に管理することが可能であ
る。Further, in the present embodiment, in addition to the score, the phoneme sequence of the unregistered word, the cluster number, the representative member ID, etc. are registered in the score sheet (FIG. 8).
Information other than these scores can be managed separately from the scores instead of being registered in the score sheet.

【０２１７】[0217]

【発明の効果】本発明によれば、入力情報をクラスタリ
ングすることができる。また、本発明によれば、入力情
報を、より容易に、クラスタリングすることができる。
さらに、本発明によれば、入力情報を、より正確に、ク
ラスタリングすることができる。従って、本発明によれ
ば、入力情報をより正確に登録し、認識することが可能
となる。According to the present invention, input information can be clustered. Further, according to the present invention, input information can be clustered more easily.
Further, according to the present invention, input information can be clustered more accurately. Therefore, according to the present invention, the input information can be registered and recognized more accurately.

【０２１８】また、本発明によれば、登録情報の登録に
よる大規模化を避けることができる。Further, according to the present invention, it is possible to avoid a large scale due to registration of registration information.

[Brief description of drawings]

【図１】本発明を適用したロボットの一実施の形態の外
観構成例を示す斜視図である。FIG. 1 is a perspective view showing an external configuration example of an embodiment of a robot to which the present invention is applied.

【図２】図１のロボットの内部構成例を示すブロック図
である。FIG. 2 is a block diagram showing an internal configuration example of the robot of FIG.

【図３】図２のコントローラの機能的構成例を示すブロ
ック図である。FIG. 3 is a block diagram showing a functional configuration example of the controller of FIG.

【図４】図２の音声認識部の構成例を示すブロック図で
ある。4 is a block diagram showing a configuration example of a voice recognition unit in FIG.

【図５】単語辞書の例を示す図である。FIG. 5 is a diagram showing an example of a word dictionary.

【図６】文法規則の例を示す図である。FIG. 6 is a diagram showing an example of grammar rules.

【図７】図２の特徴ベクトルバッファの記憶内容の例を
示す図である。FIG. 7 is a diagram showing an example of stored contents of a feature vector buffer of FIG.

【図８】スコアシートの例を示す図である。FIG. 8 is a diagram showing an example of a score sheet.

【図９】音声認識処理を説明するフローチャートであ
る。FIG. 9 is a flowchart illustrating a voice recognition process.

【図１０】未登録語処理を説明するフローチャートであ
る。FIG. 10 is a flowchart illustrating unregistered word processing.

【図１１】クラスタ分割処理を説明するフローチャート
である。FIG. 11 is a flowchart illustrating a cluster division process.

【図１２】行動処理を説明するフローチャートである。FIG. 12 is a flowchart illustrating action processing.

【図１３】本発明を適用したコンピュータの一実施の形
態の構成例を示すブロック図である。FIG. 13 is a block diagram showing a configuration example of an embodiment of a computer to which the present invention has been applied.

[Explanation of symbols]

１頭部ユニット，４Ａ下顎部，１０コントロ
ーラ，１０ＡＣＰＵ，１０Ｂメモリ，１５
マイク，１６ビデオカメラ，１７タッチセン
サ，１８スピーカ，２１ＡＤ変換部，２２
特徴抽出部，２３マッチング部，２４音響モデ
ル記憶部，２５辞書記憶部，２６文法記憶部，
２７未登録語区間処理部，２８特徴ベクトルバッ
ファ，２９クラスタリング部，３０スコアシート
記憶部，３１メンテナンス部，４２特徴抽出
部，４３特徴ベクトルバッファ，５０センサ入
力処理部，５０Ａ音声認識部，５０Ｂ画像認識
部，５０Ｃ圧力処理部，５１モデル記憶部，
５２行動決定機構部，５３姿勢遷移機構部，５４
制御機構部，５５音声合成部，１０１バス，
１０２ CPU，１０３ ROM，１０４ RAM，１０
５ハードディスク，１０６出力部，１０７入
力部，１０８通信部，１０９ドライブ，１１
０入出力インタフェース，１１１リムーバブル記
録媒体1 head unit, 4A lower jaw, 10 controller, 10A CPU, 10B memory, 15
Microphone, 16 video camera, 17 touch sensor, 18 speaker, 21 AD converter, 22
Feature extraction unit, 23 matching unit, 24 acoustic model storage unit, 25 dictionary storage unit, 26 grammar storage unit,
27 unregistered word section processing unit, 28 feature vector buffer, 29 clustering unit, 30 score sheet storage unit, 31 maintenance unit, 42 feature extraction unit, 43 feature vector buffer, 50 sensor input processing unit, 50A speech recognition unit, 50B image Recognition unit, 50C pressure processing unit, 51 model storage unit,
52 behavior determination mechanism unit, 53 posture transition mechanism unit, 54
Control mechanism part, 55 speech synthesis part, 101 bus,
102 CPU, 103 ROM, 104 RAM, 10
5 hard disk, 106 output section, 107 input section, 108 communication section, 109 drive, 11
0 input / output interface, 111 removable recording medium

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ０６Ｔ 7/00 ３００Ｇ１０Ｌ 3/00 ５２１ＣＧ１０Ｌ 13/00 ５５１Ｈ 15/00 ５７１Ｑ 15/12 ５３３Ｚ 15/14 ５３５Ｚ 15/20 Ｑ 15/24 ５３１ＱＦターム(参考） 3C007 AS36 CS08 KS31 KS39 KT01 KX02 WA04 WA14 WB16 WB19 5D015 CC01 CC07 CC11 HH07 HH23 KK01 5D045 AB11 5L096 BA16 FA00 FA32 JA11 KA03 MA07 ─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ Identification code FI theme code (reference) G06T 7/00 300 G10L 3/00 521C G10L 13/00 551H 15/00 571Q 15/12 533Z 15/14 535Z 15/20 Q 15/24 531Q F term (reference) 3C007 AS36 CS08 KS31 KS39 KT01 KX02 WA04 WA14 WB16 WB19 5D015 CC01 CC07 CC11 HH07 HH23 KK01 5D045 AB11 5L096 BA16 FA00 FA32 JA11 KA03 MA07

Claims

[Claims]

1. An information processing apparatus for recognizing input information based on registration information, comprising: first acquisition means for acquiring first information; and second acquisition means for acquiring second information. A first feature extracting means for extracting a feature of the first information obtained by the first obtaining means; and a second feature extracting means for extracting a feature of the second information obtained by the second obtaining means. The feature of the first information extracted by the first feature extracting unit, and the feature of the second information extracted by the second feature extracting unit, Clustering means for clustering the first information acquired by the first acquisition means, and registration means for registering the first information as the registration information based on the result of clustering by the clustering means. An information processing device characterized by the above.

2. The clustering means includes the feature of the first information extracted by the first feature extracting means and the second feature extracted by the second feature extracting means.
The information processing apparatus according to claim 1, wherein the first information is clustered on the basis of the characteristics obtained by weighting and adding the characteristics of the information.

3. The first information and the second information are one of voice and an image, and the other.
The information processing device according to 1.

4. The first or second feature extraction means,
The information processing apparatus according to claim 3, wherein the characteristics of the image are extracted as RGB values.

5. The information processing apparatus according to claim 3, wherein the first information is a voice of an unregistered word that is not registered as the registration information.

6. An information processing method of an information processing apparatus for recognizing input information based on registration information, comprising a first acquisition step of acquiring first information and a second acquisition step of acquiring second information. An acquisition step, a first feature extraction step of extracting a feature of the first information acquired by the process of the first acquisition step, and a second feature acquired by the process of the second acquisition step. A second feature extracting step of extracting a feature of information; a feature of the first information extracted by the process of the first feature extracting step; and a feature extracted by the process of the second feature extracting step. Using the second information feature,
The first information is registered as the registration information based on a clustering step of clustering the first information acquired by the processing of the first acquisition step and a result of clustering by the processing of the clustering step. An information processing method comprising: a registration step.

7. A program of an information processing device for recognizing input information based on registration information, the first acquisition step of acquiring first information, and the second acquisition step of acquiring second information. A first feature extraction step of extracting a feature of the first information acquired by the processing of the first acquisition step; and a second feature extraction step of the second information acquired by the processing of the second acquisition step. A second feature extraction step of extracting features, a feature of the first information extracted by the process of the first feature extraction step, and a second feature extracted by the process of the second feature extraction step Using the information features of
The first information is registered as the registration information based on a clustering step of clustering the first information acquired by the processing of the first acquisition step and a result of clustering by the processing of the clustering step. A recording medium having a computer-readable program recorded thereon, the recording medium including a registration step.

8. A program executable by a computer that controls an information processing apparatus that recognizes input information based on registration information, the program including a first acquisition step of acquiring first information, and a second information. A second acquisition step of acquiring, a first characteristic extraction step of extracting a characteristic of the first information acquired by the processing of the first acquisition step, and a second characteristic acquisition step of the second acquisition step A second characteristic extracting step of extracting a characteristic of the second information, a characteristic of the first information extracted by the processing of the first characteristic extracting step, and a processing of the second characteristic extracting step Using the characteristics of the second information extracted by
The first information is registered as the registration information based on a clustering step of clustering the first information acquired by the processing of the first acquisition step and a result of clustering by the processing of the clustering step. A program comprising a registration step.