JP2002182685A

JP2002182685A - Recognizer and recognition system, learning system and learning method as well as recording medium

Info

Publication number: JP2002182685A
Application number: JP2000376911A
Authority: JP
Inventors: Kouchiyo Nakatsuka; 洪長中塚
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2000-12-12
Filing date: 2000-12-12
Publication date: 2002-06-26

Abstract

PROBLEM TO BE SOLVED: To improve recognition performance. SOLUTION: A synchronous processing section 2 synchronizes an inputted image and speech and a characteristic extraction section 3 extracts characteristic quantities respectively from the synthesized image and speech and obtains a synthesized characteristic quantity formed by synthesizing the image and speech. A learning section 7 makes learning in accordance with the synthesized characteristic quantity, forms a model dealing with the image and speech indicating the same concept and forms a dictionary which makes the model and the concept information indicating the concept of the image and the speech correspondent to each other. On the other hand, a recognition processing section 5 makes matching by using the synthesized characteristic quantity and the model in the dictionary, thereby recognizing the concept indicated by the input image and speech.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、認識装置および認
識方法、学習装置および学習方法、並びに記録媒体に関
し、特に、例えば、画像と音声から、その概念を認識す
る場合において、高い認識性能を得ることができるよう
にする認識装置および認識方法、学習装置および学習方
法、並びに記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a recognition device and a recognition method, a learning device and a learning method, and a recording medium, and particularly to obtaining high recognition performance when, for example, recognizing a concept from an image and a sound. The present invention relates to a recognizing device and a recognizing method, a learning device and a learning method, and a recording medium.

【０００２】[0002]

【従来の技術】近年においては、CPUの高速化、メモリ
の大容量化等が進み、例えば、音声認識装置を搭載した
エンタテイメント用のロボット等が低価格で実現されて
いる。2. Description of the Related Art In recent years, the speed of a CPU and the capacity of a memory have been increased, and for example, an entertainment robot or the like equipped with a voice recognition device has been realized at a low price.

【０００３】このようなロボットは、ユーザの音声を音
声認識し、その認識結果に基づいて、各種の行動を起こ
す。即ち、例えば、ユーザが、「お手」と発話した場合
には、ロボットは、実際の犬がお手をするような行動を
起こす。[0003] Such a robot recognizes a user's voice and performs various actions based on the recognition result. That is, for example, when the user utters “hand”, the robot takes an action such that an actual dog hands.

【０００４】[0004]

【発明が解決しようとする課題】ところで、例えば、ロ
ボットにおいて、外部から与えられる刺激としての音声
だけでなく、画像についても処理を行い、その音声と画
像が表す概念を認識して、その認識した概念に基づい
て、行動するようにすれば、よりエンタテイメント性の
向上させたロボットを実現することができると予想され
る。By the way, for example, in a robot, not only a voice as a stimulus given from the outside but also an image is processed, and the voice and the concept represented by the image are recognized. By acting based on the concept, it is expected that a robot with more improved entertainment can be realized.

【０００５】しかしながら、例えば、ロボットに対し
て、ある概念を表す音声と画像が与えられる場合、その
音声と画像とは、同期したものとはなっていないため、
そのような同期していない画像と音声を処理して、その
画像と音声が表す概念を認識しても、十分な認識性能が
得られないことが予想される。However, for example, when a voice and an image representing a certain concept are given to a robot, the voice and the image are not synchronized.
Even if such unsynchronized images and sounds are processed to recognize the concept represented by the images and sounds, it is expected that sufficient recognition performance will not be obtained.

【０００６】また、上述のように、画像と音声が表す概
念を認識するには、あらかじめ学習を行っておく必要が
あるが、学習に際しても、ある概念を表す音声と画像
が、同期したものとなっていない場合には、十分な認識
性能が得られないことが予想される。Further, as described above, in order to recognize the concept represented by the image and the voice, it is necessary to perform learning in advance, but also in the learning, the voice representing the certain concept and the image are synchronized. If not, it is expected that sufficient recognition performance will not be obtained.

【０００７】本発明は、このような状況に鑑みてなされ
たものであり、画像と音声から、その概念を認識する場
合において、高い認識性能を得ることができるようにす
るものである。The present invention has been made in view of such a situation, and it is an object of the present invention to obtain high recognition performance when recognizing the concept from images and sounds.

【０００８】[0008]

【課題を解決するための手段】本発明の認識装置は、入
力された画像と音声を同期させる同期手段と、同期され
た画像と音声それぞれから、特徴量を抽出し、その画像
と音声の特徴量を合成した合成特徴量を出力する抽出手
段と、抽出手段において出力される合成特徴量と、辞書
におけるモデルとを用いてマッチングを行うことによ
り、入力された画像と音声が表す概念を認識する認識手
段とを備えることを特徴とする。A recognition device according to the present invention includes a synchronizing means for synchronizing an input image and a voice, and extracting a feature amount from each of the synchronized image and the voice, and obtaining a feature of the image and the voice. Extraction means for outputting a combined feature quantity obtained by combining the quantities, and matching using the combined feature quantity output from the extraction means and the model in the dictionary, thereby recognizing the concept represented by the input image and sound. And recognizing means.

【０００９】本発明の認識方法は、入力された画像と音
声を同期させる同期ステップと、同期された画像と音声
それぞれから、特徴量を抽出し、その画像と音声の特徴
量を合成した合成特徴量を出力する抽出ステップと、抽
出ステップにおいて出力される合成特徴量と、辞書にお
けるモデルとを用いてマッチングを行うことにより、入
力された画像と音声が表す概念を認識する認識ステップ
とを備えることを特徴とする。According to the recognition method of the present invention, there is provided a synchronizing step of synchronizing an input image and a voice, a feature amount extracted from each of the synchronized image and the voice, and a synthesized feature obtained by synthesizing the image and the voice. An extraction step of outputting a quantity, and a recognition step of recognizing a concept represented by the input image and sound by performing matching using the combined feature quantity output in the extraction step and a model in a dictionary. It is characterized by.

【００１０】本発明の第１の記録媒体は、入力された画
像と音声を同期させる同期ステップと、同期された画像
と音声それぞれから、特徴量を抽出し、その画像と音声
の特徴量を合成した合成特徴量を出力する抽出ステップ
と、抽出ステップにおいて出力される合成特徴量と、辞
書におけるモデルとを用いてマッチングを行うことによ
り、入力された画像と音声が表す概念を認識する認識ス
テップとを備えるプログラムが記録されていることを特
徴とする。The first recording medium of the present invention is a synchronizing step for synchronizing an input image and audio, extracting a characteristic amount from each of the synchronized image and audio, and synthesizing the image and audio characteristic amounts. An extracting step of outputting the synthesized feature amount obtained in the extracting step, and a recognition step of recognizing a concept represented by the input image and voice by performing matching using the synthesized feature amount output in the extracting step and a model in the dictionary. A program having the following is recorded.

【００１１】本発明の学習装置は、入力された画像と音
声を同期させる同期手段と、同期された画像と音声それ
ぞれから、特徴量を抽出し、その画像と音声の特徴量を
合成した合成特徴量を出力する抽出手段と、抽出手段に
おいて出力される合成特徴量と、合成特徴量に基づいて
学習を行い、同一概念を表す画像および音声から得られ
る合成特徴量と、その画像および音声の概念を表す概念
情報とを対応付けた辞書を生成する学習手段とを備える
ことを特徴とする。[0011] The learning apparatus of the present invention comprises a synchronizing means for synchronizing an input image and audio, a feature amount extracted from each of the synchronized image and audio, and a synthesized feature obtained by synthesizing the image and audio feature amounts. Extraction means for outputting an amount, a synthesized feature amount output from the extraction means, a learning feature based on the synthesized feature amount, a synthesized feature amount obtained from an image and a sound representing the same concept, and a concept of the image and the sound. And a learning unit that generates a dictionary in which the dictionary is associated with conceptual information representing

【００１２】本発明の学習方法は、入力された画像と音
声を同期させる同期ステップと、同期された画像と音声
それぞれから、特徴量を抽出し、その画像と音声の特徴
量を合成した合成特徴量を出力する抽出ステップと、抽
出ステップにおいて出力される合成特徴量に基づいて学
習を行うことによりモデルを生成し、同一概念を表す画
像および音声に対応するモデルと、その画像および音声
の概念を表す概念情報とを対応付けた辞書を生成する学
習ステップとを備えることを特徴とする。According to the learning method of the present invention, there is provided a synchronizing step of synchronizing an input image and sound, a feature amount extracted from each of the synchronized image and sound, and a synthesized feature obtained by synthesizing the image and sound feature amount. An extraction step of outputting an amount, and generating a model by performing learning based on the synthesized feature amount output in the extraction step, and generating a model corresponding to an image and a sound representing the same concept, and a concept of the image and the sound. And a learning step of generating a dictionary in which the concept information to be represented is associated.

【００１３】本発明の第２の記録媒体は、入力された画
像と音声を同期させる同期ステップと、同期された画像
と音声それぞれから、特徴量を抽出し、その画像と音声
の特徴量を合成した合成特徴量を出力する抽出ステップ
と、抽出ステップにおいて出力される合成特徴量に基づ
いて学習を行うことによりモデルを生成し、同一概念を
表す画像および音声に対応するモデルと、その画像およ
び音声の概念を表す概念情報とを対応付けた辞書を生成
する学習ステップとを備えるプログラムが記録されてい
ることを特徴とする。[0013] The second recording medium of the present invention is a synchronizing step for synchronizing an input image and audio, extracting a characteristic amount from each of the synchronized image and audio, and synthesizing the image and audio characteristic amounts. An extraction step of outputting the synthesized feature amount, and a model generated by performing learning based on the synthesized feature amount output in the extraction step, and a model corresponding to an image and a sound representing the same concept, and the image and the sound. And a learning step of generating a dictionary in which concept information representing the concept is associated.

【００１４】本発明の認識装置および認識方法、並びに
第１の記録媒体においては、入力された画像と音声が同
期され、その同期された画像と音声それぞれから、特徴
量が抽出され、その画像と音声の特徴量を合成した合成
特徴量が出力される。そして、その合成特徴量と、辞書
におけるモデルとを用いてマッチングを行うことによ
り、入力された画像と音声が表す概念が認識される。[0014] In the recognition apparatus and the recognition method of the present invention, the input image and the sound are synchronized, and the feature amount is extracted from each of the synchronized image and the sound. A synthesized feature obtained by synthesizing the speech feature is output. Then, the concept represented by the input image and sound is recognized by performing matching using the combined feature amount and the model in the dictionary.

【００１５】本発明の学習装置および学習方法、並びに
第２の記録媒体においては、入力された画像と音声が同
期され、その同期された画像と音声それぞれから、特徴
量が抽出され、その画像と音声の特徴量を合成した合成
特徴量が出力される。そして、その合成特徴量に基づい
て学習が行われることによりモデルが生成され、同一概
念を表す画像および音声に対応するモデルと、その画像
および音声の概念を表す概念情報とを対応付けた辞書が
生成される。[0015] In the learning apparatus and the learning method of the present invention, the input image and the audio are synchronized, and the feature amount is extracted from each of the synchronized image and the audio. A synthesized feature obtained by synthesizing the speech feature is output. Then, a model is generated by learning based on the synthesized feature amount, and a dictionary in which a model corresponding to an image and a sound representing the same concept is associated with concept information representing a concept of the image and the sound is created. Generated.

【００１６】[0016]

【発明の実施の形態】図１は、本発明を適用した認識装
置の一実施の形態の構成例を示している。FIG. 1 shows a configuration example of an embodiment of a recognition apparatus to which the present invention is applied.

【００１７】センサ部１は、例えば、少なくとも、音声
（音）を集音して出力するマイク（マイクロフォン）
と、画像を撮影して出力するビデオカメラから構成さ
れ、外部から与えられる刺激としての音声と画像を感知
し、Ａ／Ｄ(Analog Digital)変換して、対応する音声デ
ータと画像データを、同期処理部２に出力する。The sensor unit 1 is, for example, a microphone (microphone) that collects and outputs at least voice (sound).
And a video camera that captures and outputs an image. It senses voice and image as an external stimulus and performs A / D (Analog Digital) conversion to synchronize the corresponding voice data and image data. Output to the processing unit 2.

【００１８】同期処理部２は、センサ部１が出力する音
声データと画像データについて、後述するような同期処
理を施し、これにより、相互に同期した音声データと画
像データを、所定のフレーム単位で、特徴抽出部３に供
給する。The synchronization processing section 2 performs a synchronization process as described below on the audio data and the image data output from the sensor section 1, thereby synchronizing the mutually synchronized audio data and image data in a predetermined frame unit. , To the feature extraction unit 3.

【００１９】特徴抽出部３は、同期処理部２から供給さ
れる音声データを、そのフレーム（以下、適宜、音声フ
レームという）単位で処理し、音声の特徴パラメータを
抽出する。さらに、特徴抽出部３は、同期処理部２から
供給される画像データを、そのフレーム（以下、適宜、
画像フレームという）単位で処理し、画像の特徴パラメ
ータを抽出する。そして、特徴抽出部３は、音声フレー
ムから得られた音声の特徴パラメータと、その音声フレ
ームに対応する画像フレームから得られた画像の特徴パ
ラメータとを合成し、その結果得られる合成特徴パラメ
ータを、メモリ４に供給する。The feature extraction unit 3 processes the audio data supplied from the synchronization processing unit 2 on a frame basis (hereinafter, appropriately referred to as an audio frame), and extracts audio characteristic parameters. Further, the feature extraction unit 3 converts the image data supplied from the synchronization processing unit 2 into the frame (hereinafter, appropriately
The processing is performed in units of image frames, and the feature parameters of the images are extracted. Then, the feature extracting unit 3 combines the feature parameter of the voice obtained from the voice frame and the feature parameter of the image obtained from the image frame corresponding to the voice frame, The data is supplied to the memory 4.

【００２０】なお、音声と画像の特徴パラメータの合成
方法としては、例えば、その音声と画像の特徴パラメー
タがベクトルで構成される場合に、その音声と画像の特
徴ベクトルの要素（コンポーネント）を接続して１つの
ベクトルを構成する方法等がある。As a method for synthesizing the characteristic parameters of the voice and the image, for example, when the characteristic parameters of the voice and the image are composed of vectors, the elements (components) of the characteristic vector of the voice and the image are connected. To form a single vector.

【００２１】また、音声の特徴パラメータとしては、例
えば、ＭＦＣＣ(Mel Frequency Cepstrum Coefficient
s)や、そのフレーム間差分、パワー等を用いることがで
きる。さらに、画像の特徴パラメータとしては、動きベ
クトルや、色情報、ＤＣＴ(Discrete Cosine Transfor
m)係数、画像に表示された物体の形状を表す情報等を用
いることができる。For example, MFCC (Mel Frequency Cepstrum Coefficient)
s), the difference between frames, power, and the like. Further, as the feature parameters of the image, motion vectors, color information, DCT (Discrete Cosine Transform
m) Coefficients, information indicating the shape of the object displayed on the image, and the like can be used.

【００２２】メモリ４は、特徴抽出部３から供給される
合成特徴パラメータを一時記憶する。The memory 4 temporarily stores the combined feature parameters supplied from the feature extraction unit 3.

【００２３】認識処理部５は、辞書データベース６の辞
書に登録されている各モデルと、メモリ４に記憶された
合成特徴パラメータとを用いて、各モデルから、合成特
徴パラメータが観測されるスコア（尤度）を求めるマッ
チング処理を行い、その合成特徴パラメータと最もマッ
チするモデル、即ち、例えば、最高スコアのモデルを求
める。さらに、認識処理部５は、その最高スコアに基づ
いて、いま認識の対象となっている音声および画像に対
応するモデルが、既に、辞書データベース６の辞書に登
録されているかどうか、即ち、学習済みかどうかを判定
し、学習済みの場合には、最高スコアのモデルに対応す
る概念情報を求め、その概念情報を、センサ部１に入力
された音声と画像が表す概念の認識結果として出力す
る。また、認識処理部５は、いま認識の対象となってい
る音声および画像に対応するモデルが学習済みでない場
合（いま認識の対象となっている音声および画像に対応
するモデルが、辞書データベース６の辞書に登録されて
いない場合）、学習部７に対して、そのモデルの学習を
要求する。The recognition processing unit 5 uses each model registered in the dictionary of the dictionary database 6 and the combined feature parameter stored in the memory 4 to obtain a score (for which the combined feature parameter is observed from each model). A matching process is performed to obtain a likelihood), and a model that best matches the combined feature parameter, that is, a model with the highest score, for example, is obtained. Further, based on the highest score, the recognition processing unit 5 determines whether the model corresponding to the voice and image to be recognized is already registered in the dictionary of the dictionary database 6, that is, It is determined whether or not learning has been completed, and conceptual information corresponding to the model with the highest score is obtained, and the conceptual information is output to the sensor unit 1 as a recognition result of the concept represented by the voice and image input to the sensor unit 1. If the model corresponding to the voice and image to be recognized is not already learned (the model corresponding to the voice and image to be recognized is stored in the dictionary database 6). When the model is not registered in the dictionary), the learning unit 7 is requested to learn the model.

【００２４】なお、ここでは、スコアは、その値が大き
いほど尤度が高くなるものとする。また、スコアとして
は、例えば、モデルから合成特徴パラメータが観測され
る確率や、合成特徴パラメータのモデルに対する類似性
（例えば、特徴空間におけるモデルと合成特徴パラメー
タとの距離）(Confidence Measure)等を用いることが可
能である。但し、スコアとして、例えば、特徴空間にお
けるモデルと合成特徴パラメータとの距離を用いる場合
には、スコアの値が小さいほど、尤度が高いことを意味
することになる。Here, it is assumed that the larger the score, the higher the likelihood. As the score, for example, the probability that the combined feature parameter is observed from the model, the similarity of the combined feature parameter to the model (for example, the distance between the model and the combined feature parameter in the feature space) (Confidence Measure), and the like are used. It is possible. However, for example, when the distance between the model and the combined feature parameter in the feature space is used as the score, the smaller the score value, the higher the likelihood.

【００２５】辞書データベース６は、合成特徴パラメー
タを用いて学習を行うことにより得られる音声および画
像のモデルと、その音声および画像の概念を表す概念情
報とが対応付けられて登録される辞書を記憶している。
なお、音声および画像のモデルとしては、例えば、ＨＭ
Ｍ(Hidden Markov Model)等の確率モデルその他を採用
することができる。The dictionary database 6 stores a dictionary in which speech and image models obtained by performing learning using synthetic feature parameters are associated with concept information representing the concepts of the speech and images. are doing.
In addition, as a model of a sound and an image, for example, HM
A probabilistic model such as M (Hidden Markov Model) or the like can be adopted.

【００２６】学習部７は、認識処理部５から学習の要求
があると、メモリ４に記憶された合成特徴パラメータを
用いて、音声および画像のモデルの学習を行う。さら
に、学習部７は、学習の結果得られたモデルを、そのモ
デルが表す音声および画像の概念情報と対応付けて、辞
書データベース６の辞書に登録する。When there is a learning request from the recognition processing unit 5, the learning unit 7 learns a speech and image model using the synthesized feature parameters stored in the memory 4. Further, the learning unit 7 registers the model obtained as a result of the learning in the dictionary of the dictionary database 6 in association with the sound and image concept information represented by the model.

【００２７】入力部８は、例えば、キーボード等で構成
され、学習部７が学習の対象としているモデルに対応す
る音声および画像の概念情報を入力するときに、ユーザ
によって操作される。入力部８が操作されることにより
入力された概念情報は、学習部７に供給される。なお、
入力部８は、その他、例えば、音声認識装置等で構成す
ることも可能であり、この場合、概念情報は、音声で入
力することができる。The input unit 8 is constituted by, for example, a keyboard or the like, and is operated by a user when the learning unit 7 inputs concept information of speech and images corresponding to the model to be learned. The concept information input by operating the input unit 8 is supplied to the learning unit 7. In addition,
The input unit 8 can also be constituted by, for example, a voice recognition device or the like. In this case, the concept information can be input by voice.

【００２８】次に、図２のフローチャートを参照して、
図１の認識装置の動作について説明する。Next, referring to the flowchart of FIG.
The operation of the recognition device of FIG. 1 will be described.

【００２９】センサ部１は、音声と画像を感知し、対応
する音声データと画像データを、同期処理部２に出力す
る。同期処理部２は、ステップＳ１において、センサ部
１が出力する音声データと画像データについて、同期処
理を施し、これにより、相互に同期した、所定の音声フ
レームごとの音声データと、所定の画像フレームごとの
画像データを、特徴抽出部３に供給する。The sensor unit 1 detects voice and image, and outputs corresponding voice data and image data to the synchronization processing unit 2. In step S1, the synchronization processing unit 2 performs a synchronization process on the audio data and the image data output from the sensor unit 1, whereby the audio data for each predetermined audio frame synchronized with each other and the predetermined image frame Is supplied to the feature extracting unit 3.

【００３０】特徴抽出部３は、ステップＳ２において、
同期処理部２からの音声フレーム単位の音声データか
ら、その特徴パラメータを抽出するとともに、同じく、
同期処理部２からの画像フレーム単位の画像データか
ら、その特徴パラメータを抽出する。さらに、特徴抽出
部３は、ステップＳ２において、各音声フレームから得
られた音声の特徴パラメータと、その音声フレームに対
応する画像フレームから得られた画像の特徴パラメータ
とを合成し、合成特徴パラメータとして、メモリ４に供
給して記憶させる。In step S2, the feature extracting unit 3
The feature parameters are extracted from the audio data in units of audio frames from the synchronization processing unit 2, and
The feature parameters are extracted from the image data in units of image frames from the synchronization processing unit 2. Further, in step S2, the feature extracting unit 3 combines the feature parameter of the voice obtained from each voice frame and the feature parameter of the image obtained from the image frame corresponding to the voice frame, and sets the synthesized parameter as a synthesized feature parameter. , And store it in the memory 4.

【００３１】そして、ステップＳ３に進み、認識処理部
５は、マッチング処理を行い、スコアを求める。即ち、
認識処理部５は、辞書データベース６の辞書に登録され
ている各モデルと、メモリ４に記憶された合成特徴パラ
メータとを用いて、各モデルから、合成特徴パラメータ
が観測されるスコアを求め、さらに、各モデルのスコア
のうち、最も値の大きいもの（最高スコア）を求めて、
ステップＳ４に進む。Then, proceeding to step S3, the recognition processing section 5 performs a matching process to obtain a score. That is,
The recognition processing unit 5 uses each model registered in the dictionary of the dictionary database 6 and the combined feature parameter stored in the memory 4 to obtain a score at which the combined feature parameter is observed from each model, , To find the highest score (highest score) among the scores of each model,
Proceed to step S4.

【００３２】ステップＳ４では、認識処理部５は、最高
スコアが所定の閾値εよりも大（所定の閾値以上）であ
るかどうかによって、センサ部１に入力された音声およ
び画像のモデルが学習済みであるかどうかを判定する。
ここで、モデルが学習済みかどうかの判定は、その他、
例えば、音声認識において、入力音声が、辞書に登録さ
れていない単語（未知語(OOV(Out Of Vocabulary))）か
どうかを判定する手法等を適用して行うことが可能であ
る。In step S4, the recognition processing unit 5 determines whether the model of the voice and image input to the sensor unit 1 has been trained, based on whether or not the highest score is larger than the predetermined threshold ε (not less than the predetermined threshold). Is determined.
Here, it is determined whether the model has been trained,
For example, in speech recognition, it is possible to apply a method of determining whether the input speech is a word (unknown word (OOV (Out Of Vocabulary))) that is not registered in the dictionary.

【００３３】ステップＳ４において、最高スコアが所定
の閾値εよりも大であると判定された場合、認識処理部
５は、センサ部１に入力された音声および画像のモデル
が学習済みであるとして、ステップＳ５に進む。ステッ
プＳ５では、認識処理部５は、最高スコアのモデルに対
応付けられた概念情報を、辞書データベース６の辞書か
ら読み出し、認識結果として出力して、処理を終了す
る。If it is determined in step S4 that the highest score is larger than the predetermined threshold value ε, the recognition processing unit 5 determines that the voice and image models input to the sensor unit 1 have been trained. Proceed to step S5. In step S5, the recognition processing unit 5 reads out the concept information associated with the model with the highest score from the dictionary of the dictionary database 6, outputs the result as a recognition result, and ends the process.

【００３４】一方、ステップＳ４において、最高スコア
が所定の閾値εよりも大でないと判定された場合、認識
処理部５は、いま認識の対象となっている音声および画
像のモデルが学習済みでないとして、学習部７に対し
て、そのモデルの学習を要求して、ステップＳ６に進
む。On the other hand, if it is determined in step S4 that the highest score is not greater than the predetermined threshold value ε, the recognition processing unit 5 determines that the voice and image model to be recognized has not been learned. Then, the learning unit 7 is requested to learn the model, and the process proceeds to step S6.

【００３５】ステップＳ６では、学習部７は、メモリ４
に記憶された合成特徴パラメータを用いて、センサ部１
に入力された音声および画像のモデルの学習を行う。な
お、モデルとしてＨＭＭを採用する場合には、モデルの
学習には、例えば、Baum-Welchの再推定法などを用いる
ことが可能である。In step S6, the learning section 7 stores
The sensor unit 1 uses the combined feature parameters stored in the
Learning of the voice and image models input to. When an HMM is adopted as a model, for example, a Baum-Welch re-estimation method or the like can be used for learning the model.

【００３６】学習部７は、学習によってモデルを得る
と、そのモデルが表す音声および画像の概念情報の入力
を要求するメッセージを、図示せぬディスプレイまたは
スピーカより出力し、ユーザが、入力部８を操作するこ
とにより、概念情報を入力するのを待って、ステップＳ
７に進む。When the learning unit 7 obtains the model by learning, the learning unit 7 outputs a message requesting the input of the concept information of the sound and the image represented by the model from a display or a speaker (not shown). By operating, waiting for input of concept information, step S
Go to 7.

【００３７】ステップＳ７では、学習部７は、学習の結
果得られたモデルと、入力部８が操作されることにより
入力された概念情報と対応付けて、辞書データベース６
の辞書に追加登録し、処理を終了する。In step S7, the learning unit 7 associates the model obtained as a result of the learning with the concept information input by operating the input unit 8 and associates the model with the dictionary database 6.
And the process is terminated.

【００３８】なお、上述の場合には、最高スコアと閾値
εとの大小関係に基づいて、認識処理部５から認識結果
を出力し、あるいは、学習部７において学習を行うよう
にしたが、認識処理部５から認識結果を出力するか、あ
るいは、学習部７において学習を行うかは、その他、例
えば、装置において、認識モードと学習モードとを切り
替えるスイッチ等を設け、そのスイッチの操作に基づい
て決定することが可能である。In the above case, the recognition result is output from the recognition processing unit 5 or learning is performed in the learning unit 7 based on the magnitude relationship between the highest score and the threshold value ε. Whether the recognition result is output from the processing unit 5 or the learning is performed by the learning unit 7 is determined based on the operation of the switch, for example, by providing a switch for switching between a recognition mode and a learning mode in the device. It is possible to decide.

【００３９】次に、図２のステップＳ１において同期処
理部２が行う同期処理について説明する。Next, the synchronization processing performed by the synchronization processing unit 2 in step S1 of FIG. 2 will be described.

【００４０】例えば、いま、「ボールを蹴る」という音
声と、ボールを蹴っている状態の画像とを、センサ部１
から入力するとした場合、「ボールを蹴る」という音声
が存在する区間（音声区間）と、ボールを蹴っている状
態の画像（ボールを蹴るという行動が行われている画
像）が存在する区間（画像区間）とは、図３に示すよう
に、同期した状態にない。For example, the voice “Kicking the ball” and the image of the ball being kicked are sent to the sensor unit 1.
When inputting from the section, there is a section (voice section) in which the voice of “Kicking the ball” exists and a section (image in which the action of kicking the ball is performed) in the state of kicking the ball (image) 3) is not in a synchronized state as shown in FIG.

【００４１】即ち、「ボールを蹴る」という音声の音声
区間の始点（時刻）Ｓ_bと、ボールを蹴っている状態の
画像の画像区間の始点Ｍ_bとは、一般に一致せず、「ボ
ールを蹴る」という音声の音声区間の終点（時刻）Ｓ_e
と、ボールを蹴っている状態の画像の画像区間の終点Ｍ
_eも、一般に一致しない。従って、その音声区間の長さ
Ｔ_S（＝Ｓ_e−Ｓ_b）と、画像区間の長さＴ_M（＝Ｍ_e−
Ｍ_b）も一致しない。[0041] In other words, the starting point (time) S _b of the voice of the speech section called "kick the ball", and is the starting point M _b of the image section of the image of the state in which the kicking the ball, do not generally match, "the ball End time (time) S _e of the voice section of the voice “Kick”
And the end point M of the image section of the image in which the ball is being kicked
_e also generally does not match. Accordingly, the length of the voice section T _S (= S _e −S _b ) and the length of the image section T _M (= M _e −
M _b ) also does not match.

【００４２】また、上述のように、認識モードと学習モ
ードとをスイッチによって切り替える場合に、学習モー
ドにおいて学習を行おうとするときには、その学習対象
の音声および画像は、例えば、図４に示すように、繰り
返し入力されることがある。この場合も、図３における
場合と同様に、繰り返し入力される音声の音声区間と、
画像の画像区間とは、同期した状態にはならない。As described above, when the recognition mode and the learning mode are switched by the switch, when learning is to be performed in the learning mode, the learning target voice and image are, for example, as shown in FIG. , May be repeated. In this case as well, as in the case of FIG.
It is not synchronized with the image section of the image.

【００４３】即ち、いま、ｉ番目に入力される音声と画
像を、それぞれＳ_iとＭ_iと表すとともに、音声Ｓ_iの音
声区間の始点、終点、長さを、それぞれＢ（Ｓ_i），Ｅ
（Ｓ_i），Ｔ（Ｓ_i）と表し、画像Ｍ_iの画像区間の始
点、終点、長さを、それぞれＢ（Ｍ_i），Ｅ（Ｍ_i），Ｔ
（Ｍ_i）と表すと、ｉ番目の音声区間の始点Ｂ（Ｓ_i）と
画像区間の始点Ｂ（Ｍ_i）は、一般に一致せず、終点Ｅ
（Ｓ_i）とＥ（Ｍ_i）も、一般には一致しない。従って、
音声区間の長さＴ（Ｓ_i）と、画像区間の長さＴ（Ｍ_i）
も、一般には一致しない。That is, the i-th input voice and image are denoted by S _i and M _i , respectively, and the start point, end point, and length of the voice section of the voice S _i are denoted by B (S _i ) and B (S _i ), respectively. E
_{(S i), T (S} i) and represents the starting point of the image segment image M _i, end point, length, respectively _{B (M i), E (} M i), T
When expressed as (M _i ), the start point B (S _i ) of the i-th audio section and the start point B (M _i ) of the image section do not generally match, and the end point E
(S _i ) and E (M _i ) also generally do not match. Therefore,
Voice section length T (S _i ) and image section length T (M _i )
Also generally do not match.

【００４４】このように、音声区間と画像区間の始点、
終点、長さが一致しないと、音声と画像の特徴パラメー
タを合成した合成特徴パラメータにおいて、音声または
画像の特徴パラメータのうちのいずれか一方が存在しな
い（いずれか一方だけ存在する）区間が生じ、この場
合、ある同一の概念について、音声および画像の両方が
入力されているのに、その概念の認識に、音声または画
像のうちの一方だけしか用いられず、その結果、音声お
よび画像の両方が用いられる場合に比較して、認識性能
が劣化することとなる。As described above, the starting point of the voice section and the image section,
If the end point and the length do not match, in the synthesized feature parameters obtained by synthesizing the feature parameters of the voice and the image, there occurs a section in which either one of the voice or image feature parameters does not exist (only one of them exists), In this case, although both sound and image are input for the same concept, only one of the sound and the image is used for recognition of the concept, and as a result, both the sound and the image are input. Recognition performance will be degraded as compared with the case where it is used.

【００４５】そこで、図１の同期処理部２は、図５に示
すように、センサ部１から供給される音声Ｓと画像Ｍを
同期させる。Therefore, the synchronization processing section 2 in FIG. 1 synchronizes the sound S and the image M supplied from the sensor section 1 as shown in FIG.

【００４６】即ち、同期処理部２は、例えば、図５
（Ａ）に示すように、音声Ｓの音声区間の始点Ｓ_b ^*と、
画像Ｍの画像区間の始点Ｍ_b ^*とを、ある時刻に一致させ
るとともに、音声区間の終点Ｓ_e ^*と、画像区間の終点Ｍ
_e ^*とを、他の時刻に一致させる正規化処理を行う。な
お、正規化が行われることにより、音声区間の長さと、
画像区間の長さとは、一致することになる。That is, for example, the synchronization processing unit 2
As shown in (A), the starting point S _b ^* of the voice section of the voice S,
The start point M _b ^* of the image section of the image M is matched with a certain time, and the end point S _e ^* of the voice section and the end point M of the image section are set.
A normalization process for matching _e ^* with another time is performed. By performing the normalization, the length of the voice section and
The length of the image section matches.

【００４７】より具体的には、同期処理部２は、例え
ば、音声の音声区間の始点Ｓ_b ^*、または画像Ｍの画像区
間の始点Ｍ_b ^*のうちのいずれか一方を基準点とし、その
基準点に、音声区間の始点Ｓ_b ^*と、画像区間の始点Ｍ_b ^*
を一致させる。さらに、同期処理部２は、例えば、２０
０，４００、または８００ミリ秒等の所定の時間を基準
時間長とし、音声区間と画像区間の長さが、いずれも基
準時間長に一致するように、音声区間の終点Ｓ_e ^*と、画
像区間の終点Ｍ_e ^*を変更する。従って、音声区間の終点
Ｓ_e ^*と、画像区間の終点Ｍ_e ^*は、基準点から基準時間長
だけ経過した点において一致することになる。More specifically, the synchronization processing unit 2 uses, for example, one of the start point S _b ^* of the voice section of the voice and the start point M _b ^* of the image section of the image M as a reference point. the reference point, the starting point S _b ^* of speech segment, the image segment start point M _b ^*
To match. Further, the synchronization processing unit 2 has, for example, 20
A predetermined time such as 0, 400, or 800 milliseconds is set as a reference time length, and the end point S _e ^{* of} the voice section and the image are set so that the lengths of the voice section and the image section both match the reference time length. Change the end point _Me ^{* of the} section. Therefore, an end point S _e ^* of speech segment, ^* end point M _e of the image segment will be consistent in that has elapsed from the reference point by the reference time length.

【００４８】さらに、同期処理部２は、正規化された音
声区間の各音声フレームと、画像区間の各画像フレーム
とを、例えば、図５（Ｂ）に示すように、一対一に対応
させる対応付け処理を行い、これにより、音声と画像と
を同期させる。Further, the synchronization processing section 2 associates each audio frame of the normalized audio section with each image frame of the image section, for example, as shown in FIG. Attaching process, thereby synchronizing the sound and the image.

【００４９】なお、正規化処理によれば、音声区間また
は画像区間の長さが変更することから、その音声区間ま
たは画像区間を構成する音声フレームまたは画像フレー
ムの数を増減させる必要がある。また、対応付け処理に
おいても、音声フレームと画像フレームの時間長が異な
る場合には、音声フレームまたは画像フレームの数を増
減させる必要があることがある。この音声フレームまた
は画像フレームの数の増減は、例えば、補間や間引き等
によって行うことが可能である。According to the normalization processing, since the length of a voice section or image section is changed, it is necessary to increase or decrease the number of voice frames or image frames constituting the voice section or image section. Also, in the associating process, if the time lengths of the audio frame and the image frame are different, the number of audio frames or image frames may need to be increased or decreased. The number of audio frames or image frames can be increased or decreased by, for example, interpolation or thinning.

【００５０】ここで、上述の場合には、音声フレームと
画像フレームとを、一対一に対応付けるようにしたが、
音声フレームと画像フレームとは、一対多または多対一
に対応付けることも可能である。即ち、例えば、画像フ
レームの時間長が、音声フレームの時間長のＬ倍に一致
する場合には、１フレームの画像フレームと、Ｌフレー
ムの音声フレームとを対応付けるようにすることが可能
である。Here, in the above case, the audio frame and the image frame are associated one-to-one.
The audio frame and the image frame can be associated one-to-many or many-to-one. That is, for example, when the time length of the image frame is equal to L times the time length of the audio frame, it is possible to associate one image frame with the L frame audio frame.

【００５１】次に、図６は、図１の同期処理部２の構成
例を示している。Next, FIG. 6 shows an example of the configuration of the synchronization processing section 2 of FIG.

【００５２】センサ部１（図１）が出力する画像と音声
は、区間検出部１１とメモリ１２に供給される。The image and sound output from the sensor unit 1 (FIG. 1) are supplied to a section detection unit 11 and a memory 12.

【００５３】区間検出部１１は、そこに供給される画像
の画像区間と、音声の音声区間とを検出し、区間正規化
部１３および同期化部１４に供給する。即ち、区間検出
部１１は、例えば、各画像フレーム全体の動きベクトル
を求め、その動きベクトルに基づき、ある程度の動きの
ある画像フレームが連続している区間を画像区間として
検出する。また、区間検出部１１は、例えば、各音声フ
レームのパワーを求め、そのパワーに基づき、ある程度
のパワーを有する音声フレームが連続している区間を音
声区間として検出する。The section detecting section 11 detects the image section of the image supplied thereto and the voice section of the voice, and supplies them to the section normalizing section 13 and the synchronizing section 14. That is, the section detection unit 11 obtains, for example, a motion vector of the entire image frame, and detects, as an image section, a section in which image frames having a certain degree of motion are continuous based on the motion vector. In addition, the section detection unit 11 obtains, for example, the power of each voice frame, and detects a section in which voice frames having a certain level of power are continuous as a voice section based on the power.

【００５４】メモリ１２は、そこに供給される画像デー
タと音声データを一時記憶する。The memory 12 temporarily stores the image data and the audio data supplied thereto.

【００５５】区間正規化部１３は、区間検出部１１から
供給される音声区間を構成する音声フレームの音声デー
タと、同じく区間検出部１１から供給される画像区間を
構成する画像フレームの画像データを、メモリ１２から
読み出す。さらに、区間正規化部１３は、その音声区間
の音声データと、画像区間の画像データについて、上述
した正規化処理を行い、これにより、始点、終点、長さ
を一致させた（正規化された）音声区間と画像区間の音
声データと画像データを得て、同期化部１４に供給す
る。The section normalizing section 13 converts the voice data of the voice frame constituting the voice section supplied from the section detecting section 11 and the image data of the image frame constituting the image section also supplied from the section detecting section 11. , From the memory 12. Further, the section normalization unit 13 performs the above-described normalization processing on the audio data of the audio section and the image data of the image section, thereby matching the start point, the end point, and the length (normalized). (2) The audio data and the image data of the audio section and the image section are obtained and supplied to the synchronization unit 14.

【００５６】同期化部１４は、区間正規化部１３からの
音声データと画像データについて、上述の対応付け処理
を行うことにより、正規化された音声区間と画像区間の
音声フレームと画像フレームとを一対一に対応付け、特
徴抽出部３（図１）に出力する。The synchronizing unit 14 performs the above-described associating process on the audio data and the image data from the section normalizing unit 13 to thereby convert the normalized audio section and the audio frame and the image frame of the image section. One-to-one correspondence and output to the feature extraction unit 3 (FIG. 1).

【００５７】次に、図７のフローチャートを参照して、
図６の同期処理部２が、図２のステップＳ１において行
う同期処理について説明する。Next, referring to the flowchart of FIG.
The synchronization processing performed by the synchronization processing unit 2 in FIG. 6 in step S1 in FIG. 2 will be described.

【００５８】センサ部１（図１）が出力する音声データ
と画像データは、区間検出部１１とメモリ１２に供給さ
れ、メモリ１２は、その音声データと画像データを一時
記憶する。The voice data and the image data output from the sensor unit 1 (FIG. 1) are supplied to a section detection unit 11 and a memory 12, and the memory 12 temporarily stores the voice data and the image data.

【００５９】区間検出部１１は、ステップＳ１１におい
て、センサ部１からの画像データの画像区間と、音声デ
ータの音声区間とを検出し、区間正規化部１３および同
期化部１４に供給して、ステップＳ１２に進む。In step S 11, the section detecting section 11 detects the image section of the image data from the sensor section 1 and the voice section of the voice data, and supplies them to the section normalizing section 13 and the synchronizing section 14. Proceed to step S12.

【００６０】ステップＳ１２では、区間正規化部１３
は、区間検出部１１からの音声区間と画像区間それぞれ
を構成する音声フレームの音声データと、画像フレーム
の画像データを、メモリ１２から読み出し、正規化処理
を施すことで、始点、終点、長さを一致させた音声区間
と画像区間、即ち、正規化された音声区間と画像区間の
音声データと画像データを得て、同期化部１４に供給す
る。In step S12, the section normalizing section 13
Is obtained by reading the audio data of the audio frame and the image data of the image frame constituting the audio section and the image section from the section detection unit 11 from the memory 12 and performing normalization processing to obtain the start point, the end point, and the length. Are obtained, and the voice data and the image data of the normalized voice section and the image section are obtained and supplied to the synchronization unit 14.

【００６１】同期化部１４は、ステップＳ１３におい
て、区間正規化部１３からの、正規化された音声区間と
画像区間の音声フレームと画像フレームとを一対一に対
応付ける対応付け処理を行うことで、音声フレームと画
像フレームとをフレーム同期化し、特徴抽出部３（図
１）に出力して、同期処理を終了する。In step S13, the synchronizing unit 14 performs an associating process of associating the normalized audio section and the audio frame of the image section with the image frame from the section normalizing unit 13 on a one-to-one basis. The audio frame and the image frame are frame-synchronized and output to the feature extraction unit 3 (FIG. 1), and the synchronization processing ends.

【００６２】以上のように、センサ部１から入力され
る、同一概念を表す音声データと画像データとを同期さ
せるようにしたので、その音声データと画像データから
得られる特徴パラメータを合成した合成特徴パラメータ
において、音声または画像の特徴パラメータのうちのい
ずれか一方が存在しない区間が生じなくなり、そのよう
な合成特徴パラメータを用いて認識が行われる結果、認
識性能を向上させることができる。As described above, the audio data and the image data representing the same concept input from the sensor unit 1 are synchronized with each other, so that the characteristic parameters obtained from the audio data and the image data are synthesized. In the parameters, a section in which either one of the voice or image feature parameters does not exist does not occur, and recognition is performed using such composite feature parameters. As a result, recognition performance can be improved.

【００６３】次に、上述した一連の処理は、ハードウェ
アにより行うこともできるし、ソフトウェアにより行う
こともできる。一連の処理をソフトウェアによって行う
場合には、そのソフトウェアを構成するプログラムが、
汎用のコンピュータ等にインストールされる。Next, the above-described series of processing can be performed by hardware or software. When a series of processing is performed by software, a program constituting the software is
Installed on a general-purpose computer.

【００６４】そこで、図８は、上述した一連の処理を実
行するプログラムがインストールされるコンピュータの
一実施の形態の構成例を示している。FIG. 8 shows a configuration example of an embodiment of a computer in which a program for executing the above-described series of processing is installed.

【００６５】プログラムは、コンピュータに内蔵されて
いる記録媒体としてのハードディスク１０５やＲＯＭ１
０３に予め記録しておくことができる。The program is stored in a hard disk 105 or a ROM 1 as a recording medium built in the computer.
03 can be recorded in advance.

【００６６】あるいはまた、プログラムは、フロッピー
（登録商標）ディスク、CD-ROM(Compact Disc Read Onl
y Memory)，MO(Magneto optical)ディスク，DVD(Digita
l Versatile Disc)、磁気ディスク、半導体メモリなど
のリムーバブル記録媒体１１１に、一時的あるいは永続
的に格納（記録）しておくことができる。このようなリ
ムーバブル記録媒体１１１は、いわゆるパッケージソフ
トウエアとして提供することができる。Alternatively, the program includes a floppy (registered trademark) disk, a CD-ROM (Compact Disc Read Onl
y Memory), MO (Magneto optical) disc, DVD (Digita
l Versatile Disc), a magnetic disk, a semiconductor memory, etc., can be temporarily or permanently stored (recorded) in a removable recording medium 111. Such a removable recording medium 111 can be provided as so-called package software.

【００６７】なお、プログラムは、上述したようなリム
ーバブル記録媒体１１１からコンピュータにインストー
ルする他、ダウンロードサイトから、ディジタル衛星放
送用の人工衛星を介して、コンピュータに無線で転送し
たり、LAN(Local Area Network)、インターネットとい
ったネットワークを介して、コンピュータに有線で転送
し、コンピュータでは、そのようにして転送されてくる
プログラムを、通信部１０８で受信し、内蔵するハード
ディスク１０５にインストールすることができる。The program may be installed in the computer from the removable recording medium 111 as described above, or may be wirelessly transferred from a download site to the computer via a digital satellite broadcasting artificial satellite, or transmitted to a LAN (Local Area). Network), the Internet, and the like, and can be transferred to a computer by wire. In the computer, the transferred program can be received by the communication unit 108 and installed on the built-in hard disk 105.

【００６８】コンピュータは、CPU(Central Processing
Unit)１０２を内蔵している。CPU１０２には、バス１
０１を介して、入出力インタフェース１１０が接続され
ており、CPU１０２は、入出力インタフェース１１０を
介して、ユーザによって、キーボードや、マウス、マイ
ク等で構成される入力部１０７が操作等されることによ
り指令が入力されると、それにしたがって、ROM(Read O
nly Memory)１０３に格納されているプログラムを実行
する。あるいは、また、CPU１０２は、ハードディスク
１０５に格納されているプログラム、衛星若しくはネッ
トワークから転送され、通信部１０８で受信されてハー
ドディスク１０５にインストールされたプログラム、ま
たはドライブ１０９に装着されたリムーバブル記録媒体
１１１から読み出されてハードディスク１０５にインス
トールされたプログラムを、RAM(Random Access Memor
y)１０４にロードして実行する。これにより、CPU１０
２は、上述したフローチャートにしたがった処理、ある
いは上述したブロック図の構成により行われる処理を行
う。そして、CPU１０２は、その処理結果を、必要に応
じて、例えば、入出力インタフェース１１０を介して、
LCD(Liquid CryStal Display)やスピーカ等で構成され
る出力部１０６から出力、あるいは、通信部１０８から
送信、さらには、ハードディスク１０５に記録等させ
る。The computer has a CPU (Central Processing).
Unit) 102. The CPU 102 has a bus 1
The input / output interface 110 is connected to the CPU 102 via the input / output interface 110 and the user operates the input unit 107 including a keyboard, a mouse, and a microphone via the input / output interface 110. When a command is input, the ROM (Read O
nly Memory) 103 is executed. Alternatively, the CPU 102 may execute a program stored in the hard disk 105, a program transferred from a satellite or a network, received by the communication unit 108 and installed in the hard disk 105, or a removable recording medium 111 mounted in the drive 109. The program read and installed on the hard disk 105 is stored in a RAM (Random Access Memory).
y) Load to 104 and execute. Thereby, the CPU 10
2 performs processing according to the above-described flowchart or processing performed by the configuration of the above-described block diagram. Then, the CPU 102 transmits the processing result as necessary, for example, via the input / output interface 110.
An output is made from an output unit 106 including an LCD (Liquid CryStal Display), a speaker, or the like, or transmitted from a communication unit 108, and further recorded on the hard disk 105.

【００６９】ここで、本明細書において、コンピュータ
に各種の処理を行わせるためのプログラムを記述する処
理ステップは、必ずしもフローチャートとして記載され
た順序に沿って時系列に処理する必要はなく、並列的あ
るいは個別に実行される処理（例えば、並列処理あるい
はオブジェクトによる処理）も含むものである。Here, in the present specification, processing steps for writing a program for causing a computer to perform various kinds of processing do not necessarily have to be processed in chronological order in the order described in the flowchart, and may be performed in parallel. Alternatively, it also includes processing executed individually (for example, parallel processing or processing by an object).

【００７０】また、プログラムは、１のコンピュータに
より処理されるものであっても良いし、複数のコンピュ
ータによって分散処理されるものであっても良い。さら
に、プログラムは、遠方のコンピュータに転送されて実
行されるものであっても良い。The program may be processed by one computer, or may be processed in a distributed manner by a plurality of computers. Further, the program may be transferred to a remote computer and executed.

【００７１】なお、図１の認識装置は、例えば、エンタ
テイメント用のロボットや、音声対話システム等に適用
可能である。図１の認識装置をロボットに適用した場合
には、例えば、ユーザの発話（例えば、「ボールが転が
っている」など）と、対応する画像（例えば、ボールが
転がっている様子が撮影された画像）から、周囲で生じ
ている事象を、より正確に表す概念を得て、ロボットに
的確な行動をとらせること等が可能となる。また、図１
の認識装置を対話システムに適用した場合には、例え
ば、ユーザの発話とジェスチャから、ユーザの意図を、
より正確に表す概念を得て、的確な応答を返すようにす
ること等が可能となる。The recognition device shown in FIG. 1 can be applied to, for example, an entertainment robot, a voice dialogue system, and the like. When the recognition device of FIG. 1 is applied to a robot, for example, a user's utterance (for example, “ball is rolling”) and a corresponding image (for example, an image in which the ball is rolled is captured) ), It is possible to obtain a concept that more accurately represents an event occurring in the surroundings, and to make the robot take an appropriate action. FIG.
When the recognition device is applied to a dialogue system, for example, the user's intention is determined from the user's utterance and gesture.
It is possible to obtain a more accurate concept and return an accurate response.

【００７２】ここで、本実施の形態においては、音声と
画像の両方から、それらが表す概念を認識するようにし
たが、音声または画像のいずれか一方だけから、それが
表す概念を認識するようにすることも可能である。Here, in the present embodiment, the concept represented by both voice and image is recognized, but the concept represented by either voice or image is recognized. It is also possible to

【００７３】また、センサ部１には、マイクとビデオカ
メラだけでなく、例えば、圧力センサ等の、画像および
音声以外の外部からの刺激を感知するセンサを設け、そ
のセンサで得られる情報も、合成特徴パラメータに含め
て、認識および学習を行うようにすることが可能であ
る。The sensor section 1 is provided with not only a microphone and a video camera, but also a sensor for detecting an external stimulus other than images and sounds, such as a pressure sensor, for example. Recognition and learning can be performed by being included in the composite feature parameter.

【００７４】[0074]

【発明の効果】本発明の認識装置および認識方法、並び
に第１の記録媒体によれば、入力された画像と音声を同
期させ、その同期された画像と音声それぞれから、特徴
量を抽出して、その画像と音声の特徴量を合成した合成
特徴量を得る。そして、その合成特徴量と、辞書におけ
るモデルとを用いてマッチングを行うことにより、入力
された画像と音声が表す概念を認識する。従って、認識
性能を向上させることが可能となる。According to the recognizing device and the recognizing method of the present invention, the input image and the voice are synchronized, and the feature amount is extracted from each of the synchronized image and the voice. Then, a combined feature amount obtained by combining the feature amount of the image and the sound is obtained. Then, the concept represented by the input image and sound is recognized by performing matching using the combined feature amount and the model in the dictionary. Therefore, it is possible to improve recognition performance.

【００７５】本発明の学習装置および学習方法、並びに
第２の記録媒体によれば、入力された画像と音声を同期
させ、その同期された画像と音声それぞれから、特徴量
を抽出して、その画像と音声の特徴量を合成した合成特
徴量を得る。そして、その合成特徴量に基づいて学習を
行うことによりモデルを生成し、同一概念を表す画像お
よび音声に対応するモデルと、その画像および音声の概
念を表す概念情報とを対応付けた辞書を生成する。従っ
て、その辞書を用いた概念の認識を行う場合において、
その認識性能を向上させることが可能となる。According to the learning apparatus, the learning method, and the second recording medium of the present invention, the input image and the sound are synchronized, and the feature amount is extracted from each of the synchronized image and the sound. A combined feature amount obtained by combining the image and sound feature amounts is obtained. Then, a model is generated by performing learning based on the combined feature amount, and a dictionary is created in which a model corresponding to an image and a sound representing the same concept is associated with concept information representing a concept of the image and the sound. I do. Therefore, when recognizing concepts using the dictionary,
The recognition performance can be improved.

[Brief description of the drawings]

【図１】本発明を適用した認識装置の一実施の形態の構
成例を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration example of an embodiment of a recognition device to which the present invention has been applied.

【図２】認識装置の処理を説明するフローチャートであ
る。FIG. 2 is a flowchart illustrating processing of a recognition device.

【図３】音声と画像が同期していない様子を示す図であ
る。FIG. 3 is a diagram illustrating a state in which audio and images are not synchronized.

【図４】音声と画像が同期していない様子を示す図であ
る。FIG. 4 is a diagram showing a situation where audio and images are not synchronized.

【図５】音声と画像が同期している様子を示す図であ
る。FIG. 5 is a diagram illustrating a state in which sound and an image are synchronized.

【図６】同期処理部２の構成例を示すブロック図であ
る。FIG. 6 is a block diagram illustrating a configuration example of a synchronization processing unit 2.

【図７】同期処理部２による同期処理を説明するフロー
チャートである。FIG. 7 is a flowchart illustrating a synchronization process performed by the synchronization processing unit 2.

【図８】本発明を適用したコンピュータの一実施の形態
の構成例を示すブロック図である。FIG. 8 is a block diagram illustrating a configuration example of a computer according to an embodiment of the present invention.

[Explanation of symbols]

１センサ部，２同期処理部，３特徴抽出部，
４メモリ，５認識処理部，６辞書データベー
ス，７学習部，８入力部，１１区間検出部，
１２メモリ，１３区間正規化部，１４同期
化部，１０１バス，１０２ CPU，１０３ RO
M，１０４ RAM，１０５ハードディスク，１０
６出力部，１０７入力部，１０８通信部，
１０９ドライブ，１１０入出力インタフェース，
１１１リムーバブル記録媒体1 sensor unit, 2 synchronization processing unit, 3 feature extraction unit,
4 memory, 5 recognition processing unit, 6 dictionary database, 7 learning unit, 8 input unit, 11 section detection unit,
12 memory, 13 section normalization section, 14 synchronization section, 101 bus, 102 CPU, 103 RO
M, 104 RAM, 105 hard disk, 10
6 output unit, 107 input unit, 108 communication unit,
109 drive, 110 input / output interface,
111 Removable Recording Medium

Claims

[Claims]

1. A storage unit for storing a dictionary in which image and sound models representing the same concept are associated with concept information representing the concept, a synchronization unit for synchronizing input images and sounds, Extracting means for extracting a characteristic amount from each of the image and the sound, and outputting a synthesized characteristic amount obtained by synthesizing the characteristic amount of the image and the sound; a synthesized characteristic amount output by the extracting means; and a model in the dictionary. And a recognizing means for recognizing a concept represented by the input image and the voice by performing the matching using.

2. The synchronization means detects an image section, which is a section of an input image, and a speech section, which is a section of an input voice, and normalizes each of the image section and the voice section, The recognition device according to claim 1, wherein the input image and the sound are synchronized.

3. The synchronizing means further synchronizes the input image with the audio by matching the normalized start and end points of the image section and the audio section, respectively. 3. The recognition device according to 2.

4. The synchronizing means further synchronizes the input image with the audio by associating the image frame and the audio frame in the normalized image section and audio section, respectively. Claim 2
A recognition device according to claim 1.

5. A recognition method for performing a recognition process by referring to a dictionary in which an image and sound model representing the same concept and concept information representing the concept are associated with each other. Extracting a feature amount from each of the synchronized image and sound, and outputting a combined feature amount obtained by combining the image and sound feature amounts; and a combined feature amount output in the extraction step. A recognition step of performing matching using a model in the dictionary to recognize a concept represented by the input image and the voice.

6. A recording medium storing a program for causing a computer to perform a recognition process performed by referring to a dictionary in which an image and sound model representing the same concept and concept information representing the concept are associated with each other. A synchronizing step of synchronizing the input image and audio; an extracting step of extracting a characteristic amount from each of the synchronized image and audio, and outputting a synthesized characteristic amount obtained by synthesizing the characteristic amount of the image and the audio; A program including a recognition step of recognizing a concept represented by the input image and sound by performing matching using the combined feature amount output in the extraction step and a model in the dictionary is recorded. Recording medium characterized by the above-mentioned.

7. Synchronizing means for synchronizing an input image and audio, extracting a characteristic amount from each of the synchronized image and audio, and outputting a synthesized characteristic amount obtained by synthesizing the image and audio characteristic amounts. An extracting unit, a model is generated by performing learning based on the combined feature amount output from the extracting unit, and a model corresponding to an image and a sound representing the same concept, and conceptual information representing a concept of the image and the sound. A learning unit for generating a dictionary in which a dictionary is associated with the learning device.

8. The synchronizing means detects an image section, which is a section of an input image, and a speech section, which is a section of input speech, and normalizes each of the image section and the speech section, The learning device according to claim 7, wherein the input image and the sound are synchronized.

9. The synchronizing means further synchronizes the input image and the audio by matching the normalized start and end points of the image section and the audio section, respectively. 9. The learning device according to 8.

10. The synchronizing means further synchronizes the input image with the audio by associating the image frame and the audio frame in the normalized image section and audio section, respectively. The learning device according to claim 8, wherein

11. A synchronizing step of synchronizing an input image and audio, extracting a characteristic amount from each of the synchronized image and audio, and outputting a synthesized characteristic amount obtained by synthesizing the image and audio characteristic amounts. An extraction step, a model is generated by performing learning based on the synthesized feature amount output in the extraction step, a model corresponding to an image and a sound representing the same concept, and concept information representing a concept of the image and the sound. And a learning step of generating a dictionary in which the dictionary is associated with the learning method.

12. A synchronizing step for synchronizing an input image and audio on a recording medium storing a program for causing a computer to perform a predetermined learning process; Extraction step of outputting a combined feature quantity obtained by combining the feature quantity of the image and the sound, and a model is generated by performing learning based on the combined feature quantity output in the extraction step to generate the same concept. A recording medium storing a program including a learning step of generating a dictionary in which a model corresponding to an image and a sound to be represented and conceptual information representing a concept of the image and the sound are generated.