JP2002169586A

JP2002169586A - Composite model generating device for voice and image, environment adapting device for composite model of voice and image, and voice recognizing device

Info

Publication number: JP2002169586A
Application number: JP2000385184A
Authority: JP
Inventors: Kenichi Kumagai; 建一熊谷; Satoru Nakamura; 哲中村
Original assignee: ATR ONSEI GENGO TSUSHIN KENKYU; ATR Spoken Language Translation Research Laboratories
Current assignee: ATR ONSEI GENGO TSUSHIN KENKYU; ATR Spoken Language Translation Research Laboratories
Priority date: 2000-09-19
Filing date: 2000-12-19
Publication date: 2002-06-14

Abstract

PROBLEM TO BE SOLVED: To provide a composite model generating device for voice and image for voice recognizing device which can performs voice recognition at a high voice recognition rate and the voice recognizing device. SOLUTION: In the composite model generating device 100 for voice and image, an HMM composition part 16 computes the products of the output probabilities of the voice and image in all combinations of states of a voice HMM and an image HMM and generates and composites a composite HMM having a composited Gaussian mixture distribution including the products of the output probabilities in the respective states. Then, an HMM learning part 17 performs connected learning maximizing the output likelihood by using a labeled AV signal in a learning AV data memory 31 according to the generated and composite HMM to generate a composite HMM of the learnt voice and image. A voice recognition part 200 of the voice recognizing device 200 performs voice recognition by using the composite HMM of the learnt voice and image according to the feature quantity of a feature-extracted spoken voice signal and the feature quantity of an image signal.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、発話音声信号と、
発話時の唇の画像信号とに基づいて音声認識するときに
用いる音声及び画像の合成モデル生成装置、音声及び画
像の合成モデルのための環境適応化装置、並びに、上記
合成モデル生成装置及び／又は環境適応化装置を用いた
音声認識装置に関する。The present invention relates to an uttered voice signal,
A speech and image synthesis model generation device used for speech recognition based on the lip image signal at the time of speech, an environment adaptation device for a speech and image synthesis model, and the synthesis model generation device and / or The present invention relates to a speech recognition device using an environment adaptation device.

【０００２】[0002]

【従来の技術】実環境により適した音声認識システムと
して、音声と唇周辺の動画像を用いたバイモーダル音声
認識システムが近年研究されている。電車内や公共の場
などの大きな声で話しづらい状況や、周辺環境が騒がし
いというような、音声の信号対雑音電力比（以下、ＳＮ
Ｒという。）が低い状況において、唇周辺の動画像を用
いることで、音声のみを用いる場合より高い認識性能が
得られることなどが、唇周辺の動画像を音声認識に用い
る利点としてあげられる。また、近年バイモーダル音声
認識のモデル化には、隠れマルコフモデル（以下、ＨＭ
Ｍという。）が用いられ、その効果が報告されている
（例えば、従来技術文献１「中村哲ほか，“ＨＭＭを用
いた音声と唇画像の統合による音声認識と唇画像生
成”，情報処理学会，音声言語情報処理，Ｖｏｌ．１５
−１７，１９９７年２月８日」参照。）。2. Description of the Related Art As a speech recognition system more suitable for a real environment, a bimodal speech recognition system using speech and a moving image around the lips has been studied in recent years. The signal-to-noise ratio (hereinafter referred to as SN) of a voice such as a situation where it is difficult to speak with a loud voice such as in a train or a public place, or the surrounding environment is noisy.
Called R. The advantage of using a moving image around the lips for voice recognition is that using a moving image around the lips in a situation where the ratio is lower than that in the case of using the moving image around the lips gives a higher recognition performance than using only the voice. In recent years, modeling of bimodal speech recognition includes a hidden Markov model (hereinafter referred to as HM).
It is called M. ) And its effects have been reported (for example, Prior Art Document 1 "Tetsu Nakamura et al.," Speech recognition and lip image generation by integrating speech and lip images using HMM "), Information Processing Society of Japan, Spoken Language Information Processing, Vol.
-17, February 8, 1997 ". ).

【０００３】従来のＨＭＭに基づいた音声と画像のバイ
モーダル音声認識において、画像データと音声データを
特徴ベクトルの段階で統合し、出力確率に重み係数付け
を行う初期統合法と、音声と画像を別々の過程で処理
し、その結果の尤度に重み係数付けを行う結果統合法が
ある。具体的には、初期統合法では、音声データ及び画
像データのパラメータを独立のパラメータストリームと
し、それぞれのＨＭＭの出力確率の積を各状態で計算し
て、その状態の出力確率として計算する。このとき、各
ストリームの出力確率のべき乗の重み係数を与える。一
方、結果統合法では、上記の初期統合法とは反対に、音
声データと画像データに対して別々にすべての単語に対
する尤度を計算しておき、最後に、同一の単語に対する
音声データの対数尤度と、画像データの対数尤度とを重
み係数付けして加算し、その単語の対数尤度として計算
する。これら２つの方法を比較例としての従来例の方法
とする。In the conventional bimodal speech recognition of speech and image based on HMM, an initial integration method of integrating image data and speech data at a feature vector stage and assigning a weighting factor to an output probability, There is a result integration method in which processing is performed in separate processes and a weighting factor is assigned to the likelihood of the result. Specifically, in the initial integration method, the parameters of the audio data and the image data are set as independent parameter streams, and the product of the output probabilities of the respective HMMs is calculated in each state, and is calculated as the output probability of that state. At this time, a weight coefficient of the power of the output probability of each stream is given. On the other hand, in the result integration method, contrary to the above-mentioned initial integration method, the likelihood for all words is separately calculated for the voice data and the image data, and finally, the logarithm of the voice data for the same word is calculated. The likelihood and the log likelihood of the image data are weighted and added, and the result is calculated as the log likelihood of the word. These two methods are referred to as conventional methods as comparative examples.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、初期統
合法の場合は、ＨＭＭの状態遷移確率を共有しており、
結果統合法は音声と画像を別々の過程で処理しているの
で、両者とも発話速度と唇の動きの関係について考慮し
ておらず、また、同じ音韻をもつ音声を発声するときで
も、発声と唇の動きが必ずしも一致しない場合があるた
め、音声認識率がいまだ低いという問題点があった。However, in the case of the initial integration method, the state transition probability of the HMM is shared,
Since the result integration method processes speech and images in separate processes, they do not consider the relationship between speech speed and lip movement, and even when speech with the same phoneme is spoken, Since the movements of the lips may not always match, there is a problem that the speech recognition rate is still low.

【０００５】また、音声のＨＭＭと、画像のＨＭＭとを
統合するときに、周辺の環境に応じてどちらの情報を重
視するかを決定することが重要な問題となるが、この問
題の解決法はいまだ示されていない。[0005] Further, when integrating the voice HMM and the image HMM, it is important to determine which information should be prioritized according to the surrounding environment. Yes not shown yet.

【０００６】本発明の目的は以上の問題点を解決し、従
来例に比較して高い音声認識率で音声認識することがで
きる音声認識装置のための音声及び画像の合成モデル生
成装置、並びに、上記音声及び画像の合成モデル生成装
置を用いた音声認識装置を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to solve the above problems and to provide a speech and image synthesis model generation apparatus for a speech recognition apparatus capable of recognizing speech at a higher speech recognition rate than a conventional example, and It is another object of the present invention to provide a speech recognition apparatus using the speech and image synthesis model generation apparatus.

【０００７】また、本発明のもう１つの目的は、音声の
ＨＭＭと、画像のＨＭＭとを統合するときに、周辺の環
境に応じて適応化でき、従来例に比較して高い音声認識
率で音声認識可能な音声及び画像の合成モデルを生成で
きる環境適応化装置、並びに、環境適応化装置を用いた
音声認識装置を提供することにある。Another object of the present invention is to integrate a speech HMM and an image HMM according to the surrounding environment, and achieve a higher speech recognition rate than the conventional example. An object of the present invention is to provide an environment adaptation device capable of generating a speech and image synthesis model capable of speech recognition, and a speech recognition device using the environment adaptation device.

【０００８】[0008]

【課題を解決するための手段】本発明に係る音声及び画
像の合成モデル生成装置は、発話音声信号と、発話時の
話者の唇の画像信号とを含むＡＶ信号を格納する第１の
記憶手段と、上記ＡＶ信号のうちの発話音声信号に基づ
いて、出力尤度が最大となるように、音声ＨＭＭを生成
する第１の生成手段と、上記ＡＶ信号のうちの画像信号
に基づいて、出力尤度が最大となるように、画像ＨＭＭ
を生成する第２の生成手段と、上記第１の生成手段によ
り生成された音声ＨＭＭを格納する第２の記憶手段と、
上記第２の生成手段により生成された画像ＨＭＭを格納
する第３の記憶手段と、上記第２の記憶手段に格納され
た音声ＨＭＭと、上記第３の記憶手段に格納された画像
ＨＭＭとを、これら２つのＨＭＭの各状態のすべての組
み合わせにおいて音声と画像の出力確率の積を計算し
て、各状態で出力確率の積を含む合成された合成ＨＭＭ
を生成することにより合成する合成手段と、上記生成さ
れた合成ＨＭＭに基づいて、上記第１の記憶手段に格納
されたラベル付きＡＶ信号を用いて、出力尤度が最大と
なるように連結学習することにより、学習された音声及
び画像の合成ＨＭＭを生成する学習手段とを備えたこと
を特徴とする。According to the present invention, there is provided an apparatus for generating a synthesized model of voice and image, which stores an AV signal including an uttered voice signal and an image signal of a lip of a speaker at the time of utterance. Means, first generating means for generating an audio HMM based on an uttered voice signal of the AV signal so that the output likelihood is maximized, and based on an image signal of the AV signal, In order to maximize the output likelihood, the image HMM
A second generation unit for generating a voice HMM generated by the first generation unit; and a second storage unit for storing the speech HMM generated by the first generation unit.
A third storage unit for storing the image HMM generated by the second generation unit, a voice HMM stored in the second storage unit, and an image HMM stored in the third storage unit. , The product of the output probabilities of speech and image in all combinations of each state of these two HMMs, and the synthesized HMM containing the product of the output probabilities in each state
And a combination learning unit that generates the maximum likelihood by using the labeled AV signal stored in the first storage unit based on the generated composite HMM. And a learning means for generating a synthesized HMM of the learned voice and image.

【０００９】また、本発明に係る音声認識装置は、発話
音声信号と、発話時の話者の唇の画像信号とを含む、入
力されるＡＶ信号に基づいて、上記発話音声信号の特徴
量と、上記画像信号の特徴量とを抽出する抽出手段と、
上記抽出された上記発話音声信号の特徴量及び上記画像
信号の特徴量に基づいて、上記音声及び画像の合成モデ
ル生成装置により生成された、学習された音声及び画像
の合成ＨＭＭを用いて、音声認識して音声認識結果を出
力する音声認識手段とを備えたことを特徴とする。Further, the speech recognition apparatus according to the present invention, based on an input AV signal including an uttered voice signal and an image signal of a lip of a speaker at the time of utterance, obtains a characteristic amount of the uttered voice signal and Extracting means for extracting a feature amount of the image signal;
Based on the extracted feature amount of the uttered speech signal and the feature amount of the image signal, the speech and the synthesized speech and image synthesized HMM generated by the synthesized model of the speech and image are used to generate a speech. Voice recognition means for recognizing and outputting a voice recognition result.

【００１０】さらに、本発明に係る音声及び画像の合成
モデルのための環境適応化装置は、発話音声信号と、発
話時の話者の唇の画像信号とを含むＡＶ信号を音素ラベ
ル付きで格納する環境適応化用信号データを記憶する第
４の記憶手段と、上記記憶された環境適応化用信号デー
タを、所定のＨＭＭを用いて音声認識したときの尤度を
演算する第２の音声認識手段と、上記音声及び画像の合
成モデル生成装置により生成された、学習された音声及
び画像の合成ＨＭＭにおける各音素の重み係数を、所定
のクラスタリングの基準を用いて複数のクラスにクラス
タリングし、各クラスに属する各音素の重み係数を、上
記演算された尤度に基づいて、誤認識が少なくなるよう
に再学習することにより上記合成ＨＭＭを環境適応化す
る環境適応化手段とを備えたことを特徴とする。[0010] Furthermore, the environment adaptation apparatus for a speech and image synthesis model according to the present invention stores an AV signal including an uttered speech signal and an image signal of a speaker's lip at the time of utterance with a phoneme label. Fourth storage means for storing the environment adaptation signal data to be processed, and second speech recognition for calculating the likelihood when the stored environment adaptation signal data is subjected to speech recognition using a predetermined HMM. Means, and the weighting factors of each phoneme in the synthesized speech and image synthesized HMM generated by the audio and image synthesis model generation device are clustered into a plurality of classes using a predetermined clustering criterion. An environment adapting means for adapting the synthesized HMM to the environment by re-learning the weighting factors of the phonemes belonging to the class based on the calculated likelihood so as to reduce misrecognition; Characterized by comprising a.

【００１１】またさらに、上記音声及び画像の合成モデ
ルのための環境適応化装置において、好ましくは、各ク
ラスの環境適応化用信号データの数が所定のしきい値未
満となるように上記環境適応化手段の再学習を繰り返す
ことを特徴とする。Still further, in the above-mentioned environment adaptation apparatus for a speech and image synthesis model, preferably, the environment adaptation apparatus is arranged such that the number of environment adaptation signal data of each class is less than a predetermined threshold value. It is characterized in that re-learning of the conversion means is repeated.

【００１２】さらに、本発明の別の音声認識装置は、発
話音声信号と、発話時の話者の唇の画像信号とを含む、
入力されるＡＶ信号に基づいて、上記発話音声信号の特
徴量と、上記画像信号の特徴量とを抽出する抽出手段
と、上記抽出された上記発話音声信号の特徴量及び上記
画像信号の特徴量に基づいて、上記音声及び画像の合成
モデルのための環境適応化装置により環境適応化され
た、音声及び画像の合成ＨＭＭを用いて、音声認識して
音声認識結果を出力する第３の音声認識手段とを備えた
ことを特徴とする。Further, another speech recognition apparatus of the present invention includes an utterance speech signal and an image signal of a lip of a speaker at the time of utterance.
Extracting means for extracting the feature amount of the speech sound signal and the feature amount of the image signal based on the input AV signal, the feature amount of the extracted speech sound signal and the feature amount of the image signal Using the synthesized HMM for speech and image, which is environment-adapted by the environment adaptation apparatus for speech and image synthesis model based on the above, to perform speech recognition and output a speech recognition result Means.

【００１３】[0013]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１４】＜第１の実施形態＞図１は、本発明に係る
第１の実施形態である、音声及び画像の合成モデル生成
装置１００及び音声認識装置２００の構成を示すブロッ
ク図である。この実施形態に係る音声及び画像の合成モ
デル生成装置１００は、音声ＨＭＭメモリ３２ａ内の音
声ＨＭＭと、画像ＨＭＭメモリ３２ｂ内の画像ＨＭＭと
を合成するＨＭＭ合成部１６と、その合成されたＨＭＭ
に基づいて、データメモリ３１内の音素ラベル付き学習
用ＡＶ（Audio and Visual)データを用いて連結学習す
ることにより学習された合成ＨＭＭを生成するＨＭＭ学
習部１７とを備えたことを特徴としている。また、音声
認識装置２００は、ＨＭＭ学習部１７により学習された
合成ＨＭＭを用いて音声認識する音声認識部２６を備え
たことを特徴としている。<First Embodiment> FIG. 1 is a block diagram showing a configuration of a speech and image synthesis model generation apparatus 100 and a speech recognition apparatus 200 according to a first embodiment of the present invention. The speech and image synthesis model generation device 100 according to this embodiment includes an HMM synthesis unit 16 that synthesizes an audio HMM in the audio HMM memory 32a and an image HMM in the image HMM memory 32b,
And an HMM learning unit 17 that generates a combined HMM learned by performing connection learning using learning AV (Audio and Visual) data with phoneme labels in the data memory 31 based on . Further, the speech recognition device 200 includes a speech recognition unit 26 that performs speech recognition using the synthesized HMM learned by the HMM learning unit 17.

【００１５】本発明者らは、上述の従来例の問題点を解
決するために、発話速度と唇の動きの関係を記述できる
ＨＭＭ合成を用いた統合方法を用いて、音声ＨＭＭと画
像ＨＭＭとを合成した後、音素ラベル付き学習用ＡＶデ
ータを用いて学習することにより、音声及び画像の合成
モデルを生成することを提案する。In order to solve the above-mentioned problems of the conventional example, the present inventors use an integrated method using HMM synthesis that can describe the relationship between the utterance speed and the lip movement. It is proposed to generate a synthesis model of voice and image by learning by using learning AV data with phoneme labels after synthesizing.

【００１６】図１の音声及び画像の合成モデル生成装置
１００において、音素ラベル付き学習用ＡＶデータメモ
リ３１は、特定の話者が所定の複数の単語を発話したと
きの音声の波形データと、その発話時に当該話者の頭部
を固定して、その唇周辺の画像を記録した画像データ
（以下、画像データという。）とを予め格納する。次い
で、データ分離部１１は、学習用ＡＶデータメモリ３１
内の音声データと画像データとの混合データを、音声デ
ータと画像データとに分離して同期化部１２に出力す
る。このとき、画像データのフレーム周期は３３．３ｍ
ｓｅｃであり、音声データのフレーム周期は８ｍｓｅｃ
であるので、互いに同期するようにフレームシフト処理
を施し、同期処理後の音声データを前処理部１３ａに出
力するとともに、同期処理後の画像データを前処理部１
３ｂに出力する。In the speech and image synthesis model generating apparatus 100 shown in FIG. 1, the learning AV data memory 31 with phoneme labels stores speech waveform data when a specific speaker speaks a plurality of predetermined words, At the time of speech, the head of the speaker is fixed, and image data (hereinafter, referred to as image data) in which an image around the lips is recorded is stored in advance. Next, the data separation unit 11 outputs the learning AV data memory 31
The mixed data of the audio data and the image data is separated into audio data and image data and output to the synchronization unit 12. At this time, the frame period of the image data is 33.3 m.
sec, and the frame period of the audio data is 8 msec.
Therefore, a frame shift process is performed so as to synchronize with each other, the audio data after the synchronization process is output to the preprocessing unit 13a, and the image data after the synchronization process is output to the preprocessing unit 1a.
3b.

【００１７】前処理部１３ａは、サンプリング周波数４
４．１ｋＨｚでサンプリングリンされた最高周波数２
２．０５ｋＨｚの音声データを、最高周波数１２ｋＨｚ
の音声データにダウンサンプリング処理して特徴抽出部
１４ａに出力する。次いで、特徴抽出部１４ａは、入力
される音声データに対して例えばＬＰＣ分析を行うこと
により、１６次のメルケプストラム係数と、１６次のΔ
メルケプストラム係数と、Δパワーとを含む特徴ベクト
ルを抽出して音声ＨＭＭ生成部１５ａに出力する。The pre-processing unit 13a has a sampling frequency of 4
Maximum frequency 2 sampled at 4.1 kHz 2
2.05kHz audio data, maximum frequency 12kHz
Is downsampled and output to the feature extracting unit 14a. Next, the feature extracting unit 14a performs, for example, an LPC analysis on the input audio data, thereby obtaining a 16th-order mel-cepstral coefficient and a 16th-order Δ
The feature vector including the mel cepstrum coefficient and the Δ power is extracted and output to the speech HMM generation unit 15a.

【００１８】一方、前処理部１３ｂは、入力される画像
データに基づいて、各フレーム毎のＲＧＢのＪＰＥＧ画
像信号（例えば１６０×１２０画素）を、２５６階調の
濃淡画像信号（グレースケール信号）に変換した後、ヒ
ストグラムの平坦化処理の後、基準フレームとの輝度の
差分を最小化するために、唇位置での正規化を行う。次
いで、特徴抽出部１４ｂでは、前処理後のデータに対し
て、例えば２５６×２５６画素の領域で２次元ＦＦＴの
処理を行う。ここで、空間周波数領域におけるパワース
ペクトルを計算し、フレーム間の差分を計算することに
より動的な特徴を計算する。具体的には、例えば、３５
次の平滑化対数パワースペクトル及び３５次の平滑化対
数Δパワースペクトルのパラメータを含む特徴ベクトル
を抽出して画像ＨＭＭ生成部１５ｂに出力する。On the other hand, the preprocessing unit 13b converts an RGB JPEG image signal (for example, 160 × 120 pixels) for each frame into a 256-level gray-scale image signal (gray scale signal) based on the input image data. Then, after the histogram is flattened, normalization at the lip position is performed in order to minimize the difference in luminance from the reference frame. Next, the feature extraction unit 14b performs a two-dimensional FFT process on the data after the pre-processing, for example, in an area of 256 × 256 pixels. Here, a dynamic spectrum is calculated by calculating a power spectrum in a spatial frequency domain and calculating a difference between frames. Specifically, for example, 35
A feature vector including parameters of the next smoothed logarithmic power spectrum and the 35th smoothed logarithmic Δ power spectrum is extracted and output to the image HMM generation unit 15b.

【００１９】音声ＨＭＭ生成部１５ａは、入力される特
徴ベクトル及び、学習用ＡＶデータメモリ３１内の音素
ラベルとに基づいて、公知のＥＭ（Expectation-maximi
zation）アルゴリズムを用いて、出力尤度が最大となる
ように、ラベル付きの連結学習を行うことにより、３状
態のガウス混合分布を有する音声ＨＭＭを生成して音声
ＨＭＭメモリ３２ａに出力して格納する。一方、画像Ｈ
ＭＭ生成部１５ｂは、入力される特徴ベクトル及び、学
習用ＡＶデータメモリ３１内の音素ラベルとに基づい
て、ＥＭアルゴリズムを用いて、出力尤度が最大となる
ように、ラベル付きの連結学習を行うことにより、２状
態のガウス混合分布を有する画像ＨＭＭを生成して画像
ＨＭＭメモリ３２ｂに出力して格納する。The speech HMM generation unit 15a uses a known EM (Expectation-maximi) based on the input feature vector and the phoneme label in the learning AV data memory 31.
zation), a connected HMM having a three-state Gaussian mixture distribution is generated by performing connected learning with a label such that the output likelihood is maximized, and is output to and stored in the voice HMM memory 32a. I do. On the other hand, the image H
The MM generation unit 15b performs the labeled connection learning using the EM algorithm based on the input feature vector and the phoneme label in the learning AV data memory 31 so that the output likelihood is maximized. As a result, an image HMM having a two-state Gaussian mixture distribution is generated, output to the image HMM memory 32b, and stored.

【００２０】図２（ａ）は図１の音声ＨＭＭメモリ３２
ａ内の音声ＨＭＭの一例を示す状態遷移図であり、図２
（ｂ）は図１の画像ＨＭＭメモリ３２ｂ内の画像ＨＭＭ
の一例を示す状態遷移図である。図２（ａ）に示す音声
ＨＭＭは音声データの状態遷移方向で３状態を有し、各
状態で自己に帰還する遷移と次の状態に進行する遷移と
を有する。また、図２（ｂ）に示す画像ＨＭＭは画像デ
ータの状態遷移方向で２状態を有し、各状態で自己に帰
還する遷移と次の状態に進行する遷移とを有する。FIG. 2A shows the voice HMM memory 32 shown in FIG.
FIG. 2 is a state transition diagram showing an example of a voice HMM in FIG.
(B) is an image HMM in the image HMM memory 32b of FIG.
FIG. 6 is a state transition diagram showing an example of. The audio HMM shown in FIG. 2A has three states in the state transition direction of the audio data, and each state has a transition that returns to itself and a transition that proceeds to the next state. Further, the image HMM shown in FIG. 2B has two states in the state transition direction of the image data, and in each state, there is a transition that returns to itself and a transition that proceeds to the next state.

【００２１】ＨＭＭ合成部１６は、音声ＨＭＭメモリ３
２ａ内の音声ＨＭＭと、画像ＨＭＭメモリ３２ｂ内の画
像ＨＭＭとを、本発明に係る合成統合法を用いて、これ
ら２つのＨＭＭの各状態のすべての組み合わせにおいて
音声と画像の出力確率の積を計算して、各状態で出力確
率の積を含む合成統合されたガウス混合分布を有する合
成ＨＭＭを生成することにより合成し、合成された合成
ＨＭＭを合成ＨＭＭメモリ３３に出力して格納する。こ
のとき、合成したＨＭＭの各状態の出力確率は、次式に
示すように音声と画像の出力確率の積として合成統合さ
れる。The HMM synthesizing unit 16 includes the voice HMM memory 3
The voice HMM in 2a and the image HMM in the image HMM memory 32b are combined using the synthesis integration method according to the present invention to calculate the product of the output probabilities of the voice and the image in all combinations of these two HMM states. Calculated and combined by generating a combined HMM having a combined Gaussian mixture distribution including the product of the output probabilities in each state, and the combined HMM is output to the combined HMM memory 33 and stored. At this time, the output probabilities of the respective states of the synthesized HMM are synthesized and integrated as a product of the output probabilities of voice and image as shown in the following equation.

【００２２】[0022]

【数１】ｂ_ij（Ｏ_t）＝ｂ_i ^(a)（Ｏ_t ^(a)）^λ ^a×ｂ
_j ^(v)（Ｏ_t ^(v)）^λ ^v ## EQU1 ## b _ij (O _t ) = b _i ^(a) (O _t ^(a) ) ^λ ^a × b
_j ^(v) (O _t ^(v) ) ^λ ^v

【００２３】ここで、ｂ_i ^(a)（Ｏ_t ^(a)）は、時刻ｔで、
音声ＨＭＭの状態ｉにおいて、特徴ベクトルＯ_t ^(a)を出
力する確率ｂ_j ^(v)（Ｏ_t ^(v)）は、画像ＨＭＭの状態ｊで
特徴ベクトルＯ_t ^(v)を出力する確率であり、λａ、λｖ
はそれぞれ、音声データのストリームの重み係数、画像
データのストリームの重み係数である。また、合成ＨＭ
Ｍにおいて、状態Ｓ_ijから状態Ｓ_k _λへの遷移確率ａ
_ij,k _λは、音声ＨＭＭの状態Ｓ_iから状態Ｓ_kへの遷移確
率ａ_ik ^(a)と、画像ＨＭＭの状態Ｓ_jから状態Ｓ_λへの遷
移確率ａ_j _λ ^(v)を用いて、次式で表される。Where b _i ^(a) (O _t ^(a) ) is the time t,
The probability b _j ^(v) (O _t ^(v) ) of outputting the feature vector _Ot ^(a) in the state i of the voice HMM is the probability of outputting the feature vector _Ot ^(v) in the state j of the image HMM. Yes, λa, λv
Is a weight coefficient of the audio data stream and a weight coefficient of the image data stream, respectively. Also, synthetic HM
M, the transition probability a from the state S _ij to the state S _k _λ
_{ij, k} _λ is _{calculated using} a transition probability a _ik ^(a) from the state S _{i of the} voice HMM to the state S _k and a transition probability a _j _λ ^(v) of the image HMM from the state S _j to the state S _λ . Is represented by the following equation.

【００２４】[0024]

【数２】ａ_ij,k _λ＝ａ_ik ^(a)×ａ_j _λ ^(v) ## EQU2 ## a _{ij, k} _λ = a _ik ^(a) × a _j _λ ^(v)

【００２５】図３は、図１の合成ＨＭＭメモリ３３内の
合成ＨＭＭの一例を示す状態遷移図である。図３に示す
ように、合成ＨＭＭは、音声データの状態遷移方向で３
状態で、画像データの状態遷移方向で２状態である、各
状態でガウス混合分布を有する合計６状態のＨＭＭとな
る。ここで、合成したＨＭＭは、音声ＨＭＭと、画像Ｈ
ＭＭと、時間方向の３次元トレリスを構成することにな
るが、合成時のＨＭＭパラメータは、音声と画像で独立
に学習を行っているため、音声と画像の同期性が考慮さ
れていない。そこで、本実施形態では、ＨＭＭ学習部１
７により、合成ＨＭＭを初期モデルとして、音声と画像
の特徴ベクトルを合成した音声画像同期混合ベクトルを
用い、ＥＭアルゴリズムを用いて、出力尤度が最大とな
るように、かつ学習用ＡＶデータメモリ３１に格納され
た音素ラベルを用いて連結学習を行い、学習後の学習さ
れた合成ＨＭＭを、学習された合成ＨＭＭメモリ３４に
出力して格納する。このＨＭＭ学習部１７による合成Ｈ
ＭＭの学習により、音声と画像との間で同期性を表現で
きる。FIG. 3 is a state transition diagram showing an example of the combined HMM in the combined HMM memory 33 of FIG. As shown in FIG. 3, the combined HMM has three states in the state transition direction of the audio data.
There are two states in the state transition direction of the image data, that is, a total of six states having a Gaussian mixture distribution in each state. Here, the synthesized HMM is a voice HMM and an image H
A MM and a three-dimensional trellis in the time direction are formed. However, since the HMM parameters at the time of synthesis are independently learned for voice and image, synchronization between voice and image is not considered. Therefore, in the present embodiment, the HMM learning unit 1
7, using the synthesized HMM as an initial model, using an audio-video synchronous mixed vector obtained by synthesizing a feature vector of audio and an image, using an EM algorithm so that the output likelihood is maximized, and a learning AV data memory 31. The connected learning is performed using the phoneme labels stored in the learning HMM, and the learned synthesized HMM after the learning is output to the learned synthesized HMM memory 34 and stored. The synthesized H by the HMM learning unit 17
By learning the MM, synchronism can be expressed between the voice and the image.

【００２６】図４は、図１の音声及び画像の合成モデル
生成装置１００における合成統合の３次元の探索空間を
示す図である。図４に示すように、音声認識の際に、合
成統合による探索空間は、音声ＨＭＭと画像ＨＭＭの状
態と時刻フレーム方向の３次元トレリスを探索すること
になり、音声と画像の状態を非同期に探索可能となる。FIG. 4 is a diagram showing a three-dimensional search space for synthesis integration in the speech and image synthesis model generating apparatus 100 of FIG. As shown in FIG. 4, at the time of speech recognition, the search space based on the synthesis integration searches the state of the voice HMM and the image HMM and the three-dimensional trellis in the time frame direction, and asynchronously changes the state of the voice and the image. It becomes searchable.

【００２７】図１において、音声認識装置２００は、入
力ＡＶデータメモリ４１と、データ分離部２１と、同期
化部２２と、前処理部２３ａ，２３ｂと、特徴抽出部２
４ａ，２４ｂと、特徴合成部２５と、音声認識部２６と
を備えて構成される。ここで、データ分離部２１から特
徴抽出部２４ａ，２４ｂまでの処理は、音声及び画像の
合成モデル生成装置１００内のデータ分離部１１から特
徴抽出部１４ａ，１４ｂまでの処理と同様であり、詳細
な説明を省略する。In FIG. 1, a speech recognition apparatus 200 includes an input AV data memory 41, a data separation section 21, a synchronization section 22, pre-processing sections 23a and 23b, and a feature extraction section 2.
4a and 24b, a feature synthesizing unit 25, and a voice recognition unit 26. Here, the processing from the data separation unit 21 to the feature extraction units 24a and 24b is the same as the processing from the data separation unit 11 to the feature extraction units 14a and 14b in the audio and image synthesis model generation device 100, and is described in detail. Detailed description is omitted.

【００２８】音声認識装置２００において、入力ＡＶデ
ータメモリ４１は、音声認識すべき発話音声文の発話音
声信号の波形データと、その発話時に当該話者の唇周辺
の画像を記録した画像データとを含むＡＶデータを予め
格納し、データ分離部２１は、データ分離部１１と同様
に、入力ＡＶデータメモリ４１から入力されるＡＶデー
タから発話音声信号の音声データと、画像データとに分
離して同期化部２２に出力する。特徴合成部２５は、特
徴抽出部２４ａにより抽出された音声の特徴ベクトル
と、特徴抽出部２４ｂにより抽出された画像の特徴ベク
トルとのデータを合成して（各データを同一のフレーム
で１つの特徴ベクトルとして合成して）、音声認識部２
６に出力する。音声認識部２６は、特徴合成部２５から
入力された特徴ベクトルを、学習された合成ＨＭＭメモ
リ３４内の学習された合成ＨＭＭに入力することによ
り、当該ＨＭＭでの尤度（具体的には、対数尤度であ
る。）を計算して、最尤の音素を決定する。さらに、音
声認識部２６は、決定された最尤の音素に基づいて、単
語ＨＭＭ４２に格納された音素ベースの単語ＨＭＭを用
いて単語に対する尤度を計算して最尤の単語の文字列を
決定し、これによって、音声認識処理を実行し、音声認
識結果の文字列を出力する。In the voice recognition device 200, the input AV data memory 41 stores the waveform data of the utterance voice signal of the utterance voice sentence to be voice-recognized and the image data in which the image around the lips of the speaker at the time of the utterance is recorded. In the same manner as the data separation unit 11, the data separation unit 21 separates the AV data input from the input AV data memory 41 into the audio data of the uttered voice signal and the image data and synchronizes them. Output to the conversion unit 22. The feature synthesizing unit 25 synthesizes data of the feature vector of the voice extracted by the feature extracting unit 24a and the feature vector of the image extracted by the feature extracting unit 24b (each data is converted into one feature in the same frame). Synthesized as a vector), speech recognition unit 2
6 is output. The speech recognition unit 26 inputs the feature vector input from the feature synthesizing unit 25 to the learned synthesized HMM in the learned synthesized HMM memory 34, so that the likelihood in the HMM (specifically, Is calculated, and the maximum likelihood phoneme is determined. Further, based on the determined maximum likelihood phoneme, the speech recognition unit 26 calculates the likelihood for the word using the phoneme-based word HMM stored in the word HMM 42 and determines the character string of the maximum likelihood word. Thereby, the voice recognition processing is executed, and the character string of the voice recognition result is output.

【００２９】以上のように構成された音声及び画像の合
成モデル生成装置１００では、ＨＭＭ合成部１６により
合成された合成ＨＭＭに対して、ＨＭＭ学習部１７によ
り音素ラベルを用いて連結学習することにより、音声と
画像の同期性を確立することができ、音声と画像の同期
性を有する学習された合成ＨＭＭを生成できる。これを
用いて音声認識部２６により音声認識処理を行うことに
より、従来例に係る初期統合法や結果統合法に比較して
高い音声認識率で音声認識することができる。In the speech and image synthesis model generating apparatus 100 configured as described above, the HMM learning unit 17 performs connection learning on the synthesized HMM synthesized by the HMM synthesis unit 16 using phoneme labels. , It is possible to establish synchronism between voice and image, and to generate a learned synthesized HMM having synchronism between voice and image. By performing the speech recognition processing by the speech recognition unit 26 using this, the speech recognition can be performed at a higher speech recognition rate than the initial integration method and the result integration method according to the related art.

【００３０】[0030]

【実施例】＜第１の実施形態の実施例＞本発明者らは、
第１の実施形態の音声及び画像の合成モデル生成装置１
００及び音声認識装置２００を用いて評価実験を以下の
ように行い、以下の結果を得た。表１に実験条件を示
す。EXAMPLES <Examples of First Embodiment> The present inventors
Synthetic model generation device 1 for voice and image according to first embodiment
An evaluation experiment was performed as follows using the voice recognition apparatus 200 and the speech recognition apparatus 200, and the following results were obtained. Table 1 shows the experimental conditions.

【００３１】[0031]

【表１】実験条件 ――――――――――――――――――――――――――――――――――― 音声標本化周波数：１２ｋＨｚ分析窓関数：ハミング窓フレーム長：３２ｍｓｅｃフレームシフト：８ｍｓｅｃパラメータ：ＭＦＣＣ１６次元、ΔＭＦＣＣ１６次元 ――――――――――――――――――――――――――――――――――― 画像フレームシフト：３３ｍｓｅｃ前処理１：ＲＧＢ信号から２５６階調の濃淡画像信号への変換前処理２：ヒストグラム平坦化処理前処理３：唇位置の正規化処理パラメータ：平滑化対数パワースペクトル３５次元、及び平滑化対数Δパワースペクトル３５次元 ――――――――――――――――――――――――――――――――――― ＨＭＭの状態数結果統合法、合成統合：音声３、画像２初期統合法：３ ――――――――――――――――――――――――――――――――――― 確率密度関数ガウス分布：２混合ＨＭＭ音素環境独立５５音素モデル ――――――――――――――――――――――――――――――――――― 学習データ音声及び画像同期データ女性話者１名、４７４０単語 ――――――――――――――――――――――――――――――――――― テストデータ２００単語（３セット）（オープン条件） ―――――――――――――――――――――――――――――――――――[Table 1] Experimental conditions ――――――――――――――――――――――――――――――――― Voice Sampling frequency: 12 kHz Analysis window function : Hamming window Frame length: 32 msec Frame shift: 8 msec Parameter: MFCC 16 dimension, ΔMFCC 16 dimension ―――――――――――――――――――――――――――――――― ――― Image Frame shift: 33 msec Preprocessing 1: Conversion from RGB signals to 256-level grayscale image signal Preprocessing 2: Histogram flattening Preprocessing 3: Lip position normalization Processing Parameter: Smoothed log power spectrum 35 dimensions and smoothed logarithmic Δ power spectrum 35 dimensions ――――――――――――――――――――――――――――――――― HMM status Numerical result integration method, synthesis integration: voice 3, Image 2 Initial integration method: 3 ――――――――――――――――――――――――――――――――― Probability density function Gaussian distribution: 2 mixture HMM Phoneme environment independent 55 phoneme model ――――――――――――――――――――――――――――――――――― Learning data Voice and image synchronization data Female One speaker, 4740 words ――――――――――――――――――――――――――――――――――― Test data 200 words (3 sets) (Open condition) ―――――――――――――――――――――――――――――――――――

【００３２】本実験では、音響実験室で、特定話者（女
性話者１人）が特許出願人が所有する発声リストの５２
４０単語を発話しているデータベースを用いた。音声と
画像のフレームシフトは１：４であるため、画像は、４
フレーム同じフレームを埋め込み、音声と画像のフレー
ムシフトを調整を行う。また、収録した画像データは発
話単語により、照明条件の違いや顔の傾きなどが見られ
る。そこで前処理として、ヒストグラム平坦化、基準フ
レームとの輝度の差分を最小化するように唇位置の正規
化を行った。音声ＨＭＭの作成には、音響実験室で収録
したクリーンな音声データからメルケプストラム係数を
求め、それを特徴ベクトルとしてモデル作成を行った。
また、画像ＨＭＭは、前処理後の画像に２次元ＦＦＴを
行い、対数パワースペクトルを計算し、そして、その周
波数領域を６×６の領域分割を行い、直流成分を除いた
領域の平滑化対数パワースペクトルを特徴ベクトルとし
てモデル作成を行った。本実験では、音声及び画像の合
成ＨＭＭを、各ストリームの重み係数を１：１と等しい
重み係数で学習を行っている。In this experiment, in a sound laboratory, a specific speaker (one female speaker) was placed on the utterance list 52 owned by the patent applicant.
A database speaking 40 words was used. Since the frame shift between sound and image is 1: 4, the image is 4
Embed the same frame and adjust the audio and image frame shift. The recorded image data shows differences in lighting conditions, face inclination, and the like depending on the utterance word. Therefore, as pre-processing, histogram flattening and lip position normalization were performed so as to minimize the difference in luminance from the reference frame. To create a speech HMM, a mel-cepstrum coefficient was obtained from clean speech data recorded in an acoustic laboratory, and a model was created using the coefficient as a feature vector.
Further, the image HMM performs a two-dimensional FFT on the preprocessed image, calculates a logarithmic power spectrum, performs a 6 × 6 region division on the frequency region, and smoothes the logarithm of the region excluding the DC component. A model was created using the power spectrum as a feature vector. In this experiment, the combined HMM of audio and image is trained with the weighting factor of each stream being equal to 1: 1.

【００３３】図５乃至図７はそれぞれ、従来例である初
期統合法及び結果統合法、並びに、第１の実施形態に係
る合成統合法を用いた音声認識装置の実験結果であっ
て、ＳＮＲが１０ｄＢ、２０ｄｂのとき、及び雑音のな
いクリーンな音声のときの音声ストリームの重み係数λ
ａに対する単語認識率を示すグラフである。ここで、音
声データのストリームの重み係数λａと画像データのス
トリームの重み係数λｖは次式を満足するように変化さ
せている。FIGS. 5 to 7 show experimental results of the speech recognition apparatus using the initial integration method and the result integration method, which are conventional examples, and the synthesis integration method according to the first embodiment, respectively. Weight factor λ of the audio stream at 10 dB, 20 dB, and for clean audio without noise
6 is a graph showing a word recognition rate for a. Here, the weight coefficient λa of the audio data stream and the weight coefficient λv of the image data stream are changed so as to satisfy the following equation.

【００３４】[0034]

【数３】λａ＋λｖ＝１Λa + λv = 1

【００３５】すなわち、図５は、ＳＮＲが１０ｄＢのと
きに、横軸が音声ストリームの重み係数λａのときの単
語認識率を示すグラフであり、図６は、ＳＮＲが２０ｄ
Ｂのときに、横軸が音声ストリームの重み係数λａのと
きの単語認識率を示すグラフであり、図７は、雑音のな
いクリーンな音声のときの、横軸が音声ストリームの重
み係数λａのときの単語認識率を示すグラフである。図
５乃至図７において、以下の場合のデータを示してい
る。（ａ）第１の実施形態に係る合成統合法（学習有り）、（ｂ）比較例の合成統合法（学習無し）、（ｃ）従来例の結果統合法、（ｄ）従来例の初期統合法、（ｅ）音声データのみのとき、（ｆ）画像データのみのとき。図５から明らかなように、ＳＮＲが１０ｄＢのときに、
音声ストリームの重み係数λａを変化しても、本発明の
第１の実施形態に係る合成統合法の音声認識率は、他の
統合法のそれより高い。また、図６から明らかなよう
に、ＳＮＲが２０ｄＢのときに、音声ストリームの重み
係数λａを変化しても、本発明の第１の実施形態に係る
合成統合法の音声認識率は、他の統合法のそれより高
い。さらに、図７から明らかなように、雑音のないクリ
ーンな音声のときに、音声ストリームの重み係数λａを
変化しても、本発明の第１の実施形態に係る合成統合法
の音声認識率は、他の統合法のそれより高い。従って、
ＳＮＲを１０ｄＢから無限大まで変化させても、本発明
の第１の実施形態に係る合成統合法は、他の統合法よ
り、認識率が高いことが分かる。本手法の合成統合法
は、異なったモダリティの統合に効果的であるといえ
る。That is, FIG. 5 is a graph showing the word recognition rate when the SNR is 10 dB and the horizontal axis is the weight coefficient λa of the audio stream. FIG. 6 is a graph showing the SNR when the SNR is 20 dB.
FIG. 7 is a graph showing the word recognition rate when the horizontal axis is the weight factor λa of the audio stream at B, and FIG. 7 shows the horizontal axis of the weight coefficient λa of the audio stream when the voice is clean without noise. 6 is a graph showing a word recognition rate at the time. 5 to 7 show data in the following cases. (A) The synthesis integration method according to the first embodiment (with learning), (b) The synthesis integration method of the comparative example (without learning), (c) the result integration method of the conventional example, (d) the initial integration of the conventional example (E) When only audio data, (f) When only image data. As is clear from FIG. 5, when the SNR is 10 dB,
Even if the weight coefficient λa of the audio stream is changed, the speech recognition rate of the synthesis integration method according to the first embodiment of the present invention is higher than that of the other integration methods. As is clear from FIG. 6, when the SNR is 20 dB, even if the weight coefficient λa of the audio stream is changed, the speech recognition rate of the synthesis and integration method according to the first embodiment of the present invention is different from that of the first embodiment. Higher than that of the integration method. Further, as is clear from FIG. 7, even when the weight coefficient λa of the audio stream is changed in the case of clean audio without noise, the speech recognition rate of the synthesis integration method according to the first embodiment of the present invention is , Higher than that of other integration laws. Therefore,
It can be seen that even when the SNR is changed from 10 dB to infinity, the synthesis integration method according to the first embodiment of the present invention has a higher recognition rate than other integration methods. It can be said that the synthetic integration method of this method is effective in integrating different modalities.

【００３６】＜第２の実施形態＞図８は、本発明に係る
第２の実施形態である環境適応化装置３００の構成を示
すブロック図である。第２の実施形態に係る環境適応化
装置３００においては、図１の学習用ＡＶデータメモリ
３１と同様の形式で、発話音声信号と、発話時の話者の
唇の画像信号とを含む複数の単語のＡＶ信号データが音
素ラベル付きで環境適応化用ＡＶ単語データメモリ５１
に格納され、図１の入力ＡＶデータメモリ４１内のＡＶ
信号データに代えて、この環境適応化用ＡＶ信号に基づ
いて、図１の音声認識装置２００を用いて音声認識す
る。環境適応化処理部５０は、図９のフローチャートに
示されたストリームの重み係数の環境適応化処理を実行
し、具体的には、図１のＨＭＭ学習部１７により生成さ
れた、学習された音声及び画像の合成ＨＭＭにおける各
音素の重み係数を、例えば図１０に示す二分木の木構造
のクラスタリング木などの所定のクラスタリングの基準
を用いて複数のクラスにクラスタリングし、各クラスに
属する各音素の重み係数を、音声認識装置２００内の音
声認識部２６で演算される対数尤度に基づいて、誤認識
が少なくなるように（具体的には、数６で示す誤分類測
度ｄ_xが小さくなるように）再学習することにより合成
ＨＭＭを環境適応化することを特徴としている。ここ
で、好ましくは、各クラスの環境適応化用信号データの
数が所定のしきい値未満となるように環境適応化処理の
再学習を繰り返す。そして、再学習された合成ＨＭＭを
用いて、図１の音声認識装置２００は音声認識処理を行
う。<Second Embodiment> FIG. 8 is a block diagram showing a configuration of an environment adapting apparatus 300 according to a second embodiment of the present invention. In the environment adapting apparatus 300 according to the second embodiment, in the same format as the learning AV data memory 31 of FIG. 1, a plurality of utterance voice signals and an image signal of the speaker's lips at the time of utterance are included. AV word data memory for environment adaptation with word AV signal data having phoneme labels
In the input AV data memory 41 of FIG.
Voice recognition is performed using the voice recognition device 200 of FIG. 1 based on the environment adaptation AV signal instead of the signal data. The environment adaptation processing unit 50 executes the environment adaptation processing of the stream weight coefficients shown in the flowchart of FIG. 9, and specifically, learns the learned speech generated by the HMM learning unit 17 of FIG. 1. And a weighting factor of each phoneme in the image-synthesizing HMM is clustered into a plurality of classes using a predetermined clustering criterion such as a binary tree tree clustering tree shown in FIG. Based on the log likelihood calculated by the speech recognition unit 26 in the speech recognition apparatus 200, the weighting coefficient is set so that misrecognition is reduced (specifically, the misclassification measure d _x shown in Expression 6 is reduced). As described above, the synthetic HMM is adapted to the environment by re-learning. Here, preferably, the re-learning of the environment adaptation process is repeated so that the number of environment adaptation signal data of each class becomes less than a predetermined threshold value. Then, the speech recognition device 200 in FIG. 1 performs a speech recognition process using the re-learned synthesized HMM.

【００３７】まず、本実施形態に係る再学習法である環
境適応化法について以下に説明する。First, an environment adaptation method which is a relearning method according to the present embodiment will be described below.

【００３８】従来技術の項で述べたように、音声のＨＭ
Ｍと、画像のＨＭＭとを統合するときに、周辺の環境に
応じてどちらの情報を重視するかを決定することが重要
な問題となる。この問題は、具体的には、第１の実施形
態に係る合成統合法により、統合を行ったＨＭＭの認識
率がピークとなる音声と画像のストリーム重みを、ユー
ザが発話した適応データから、環境に応じて適応化する
方法が考えられる。しかしながら、音声のＳＮＲを推定
するのは難しいので、ストリーム重みを推定するために
は、他の基準が必要となる。通常、音声と画像の尤度の
ダイナミックレンジが大きく違うために、尤度最大化基
準（ＭＬ基準）による学習では、良い性能が得られない
ことが知られている。従って、本実施形態では、最小分
類誤り基準（ＭＣＥ基準）による学習を用いて、具体的
には、公知のＧＰＤ（Generalized Probabilistic Desc
ent method;一般化された確率的降下法）アルゴリズム
（例えば、従来技術文献２「Gerasimos Potamianos et
al.,”Discriminative training of HMM stream expone
nts for Audio-Visual speech recognition”,Proceedi
ng of ICASSP-98,Vol.6,pp.3733-3736,May 1998」、及
び従来技術文献３「Chiyomi Miyajima et al.,”Audio-
Visual speech recognition using MCE-basedHMMs and
model-dependent stream weights”,Proceeding of ICS
LP2000,Vol.2,pp.1023-1026,2000」など参照。）を用い
て、合成ＨＭＭを再学習することにより環境適応化す
る。ＧＰＤアルゴリズムを用いる理由は以下の通りであ
る。音声と画像のストリームの重み係数は、音素毎に違
い、従って、適応データ数に応じて、ストリームの重み
係数のクラスタリングの単位は音素クラスごとに分割し
たほうが良いと考えられる。ＭＣＥ基準の方法の１つで
ある直接探索法に対して、ＧＰＤアルゴリズムは、多変
数にも適用可能で、応用性が高いアルゴリズムである。As described in the related art section, the HM
When integrating M and the HMM of an image, it is important to determine which information is to be prioritized according to the surrounding environment. Specifically, the problem is that, by the synthesis integration method according to the first embodiment, the stream weight of the audio and the image at which the recognition rate of the integrated HMM reaches a peak is determined from the adaptation data spoken by the user. A method of adapting according to the situation can be considered. However, it is difficult to estimate the SNR of speech, so other criteria are needed to estimate the stream weight. Normally, it is known that good performance cannot be obtained by learning based on the maximum likelihood criterion (ML criterion) because the dynamic range of the likelihood between voice and image greatly differs. Therefore, in the present embodiment, using the learning based on the minimum classification error criterion (MCE criterion), specifically, a known GPD (Generalized Probabilistic Desc
ent method; generalized stochastic descent method) algorithm (for example, see Prior Art Document 2 “Gerasimos Potamianos et
al., ”Discriminative training of HMM stream expone
nts for Audio-Visual speech recognition ”, Proceedi
ng of ICASSP-98, Vol. 6, pp. 3733-3736, May 1998 ", and prior art document 3" Chiyomi Miyajima et al., "Audio-
Visual speech recognition using MCE-basedHMMs and
model-dependent stream weights ”, Proceeding of ICS
LP2000, Vol.2, pp.1023-1026,2000 ". ) To adapt the environment by re-learning the composite HMM. The reason for using the GPD algorithm is as follows. The weighting factors of the audio and image streams are different for each phoneme. Therefore, it is considered that the clustering unit of the stream weighting factors should be divided for each phoneme class according to the number of adaptive data. In contrast to the direct search method, which is one of the methods based on the MCE, the GPD algorithm is an algorithm that can be applied to multiple variables and has high applicability.

【００３９】次いで、ストリームの重み係数の環境適応
について説明する。Next, the adaptation of the stream weight coefficient to the environment will be described.

【００４０】ＧＰＤアルゴリズムによるストリームの重
み係数推定法では、正しい分類と誤った分類との距離の
情報を表す誤分類測程度を含む、滑らかな損失関数を最
小化するように、ＨＭＭのストリームの重み係数を推定
する。ここでは、ＧＰＤアルゴリズムに基づく音素毎の
ストリームの重み係数を推定する処理について以下に説
明する。In the stream weighting coefficient estimation method using the GPD algorithm, the weight of an HMM stream is minimized so as to minimize a smooth loss function including a misclassification measure indicating information on a distance between a correct classification and an incorrect classification. Estimate the coefficient. Here, the process of estimating the weight coefficient of the stream for each phoneme based on the GPD algorithm will be described below.

【００４１】まず、ある単語の発話音声データｘの特徴
ベクトル系列ＯをFirst, a feature vector sequence O of uttered voice data x of a certain word is

【数４】Ｏ＝［ｏ_x（１），…，ｏ_x（ｔ），…，ｏ_x（Ｔ_x）］とする。ここで、ｔは時刻フレーム、ｏ_x（ｔ）はＳ個
のストリーム（モダリティ）をもったベクトルである。
次に、ＨＭＭの状態のある集合Ｃに対するストリームの
重み係数セットを[Number 4] _{O = [o x (1)} , ..., o x (t), ..., o x (T x)] and. Here, t is the time frame, o _x (t) is a vector with the S streams (modalities).
Next, a set of stream weighting factors for a set C with HMM states is

【数５】λ_c＝［λ_c1，…，λ_cs，…，λ_cS］とし、全体のストリームの重み係数セットをΛ _c = [λ _c1 ,..., Λ _cs ,..., Λ _cS ], and the weight coefficient set of the entire stream is

【数６】Λ＝［λ₁，…，λ_c，…，λ_C］とする。ただし、Ｃは、音素毎のストリームの重み係数
のクラス数である。そのとき、ある単語の発話音声デー
タｘを、それに対応する単語ＨＭＭ（図１の単語ＨＭＭ
メモリ４２に格納されている）を用いて、例えば、ビタ
ビアルゴリズムで音声認識した時の、ＨＭＭの状態系列
を６ = [λ ₁ ,..., Λ _c ,..., Λ _C ]. Here, C is the number of classes of the weight coefficient of the stream for each phoneme. At this time, the utterance voice data x of a certain word is converted to the corresponding word HMM (the word HMM in FIG. 1).
For example, the state sequence of the HMM at the time of voice recognition by the Viterbi algorithm is

【数７】Ｑ_x＝｛ｑ_x（ｔ）；ｔ＝１，…，Ｔ_x｝とすると、そのときの対数尤度Ｌ_x ^Rは、次式で表され
る。Assuming that Q _x = {q _x (t); t = 1,..., T _x }, the log likelihood L _x ^{R at} that time is represented by the following equation.

【００４２】[0042]

【数８】 (Equation 8)

【数９】 (Equation 9)

【００４３】このように、ストリームの重み係数のセッ
トΛの関数として表すことができる。ただし、数９にお
いて、ｂ_js［ｏ_x,s（ｔ）］は、状態ｊにおいて、スト
リームｓの特徴ベクトルｏ_x,s（ｔ）を観測する確率ｑ_x
（ｔ）が、もしｑ_x（ｔ）＝ｊであるとき、δ^j _qx(t)＝
１であり、もしｑ_x（ｔ）≠ｊであるとき、δ^j _qx(t)＝
０である。同様に、単語の発話音声データｘに対して、
誤った単語ＨＭＭの中で、ｎ番目の候補により認識した
場合の対数尤度Ｌ_x ^Fnは、次式で表される。Thus, it can be expressed as a function of the set of stream weighting factors Λ. In _Equation 9, b _js [ox _{, s} (t)] is the probability q _x of observing the feature vector ox _{, s} (t) of the stream s in the state j.
If (t) is q _x (t) = j, δ ^j _{qx (t)} =
1 and if q _x (t) ≠ j, δ ^j _{qx (t)} =
0. Similarly, for the utterance voice data x of the word,
The log likelihood L _x ^Fn when the word HMM is erroneously recognized by the nth candidate is represented by the following equation.

【００４４】[0044]

【数１０】 (Equation 10)

【００４５】次に、誤分類測度ｄ_xを次式のように定義
する。Next, the misclassification measure d _x is defined as follows.

【００４６】[0046]

【数１１】 [Equation 11]

【００４７】この誤分類測度ｄ_xは、小さいほど分類誤
り、つまり誤認識が少なくなることを表現する。しか
し、上記数９及び数１０は、最尤の状態系列での尤度を
計算するため、滑らかでない関数になる場合がある。そ
こで、誤分類測度ｄ_xを用いて次式のようにシグモイド
関数の形に変換し、滑らかな損失関数を定義する。[0047] The misclassification measure d _x expresses that about classification error small, i.e. erroneous recognition is reduced. However, Equations 9 and 10 above may calculate non-smooth functions in order to calculate the likelihood in the maximum likelihood state sequence. Then, using the misclassification measure d _x , the form is converted into a sigmoid function as in the following equation, and a smooth loss function is defined.

【００４８】[0048]

【数１２】 (Equation 12)

【００４９】また、勾配の方向を安定させるために、全
体の適応データに対して次式の損失関数をおく。In order to stabilize the direction of the gradient, a loss function of the following equation is set for the entire adaptive data.

【００５０】[0050]

【数１３】 (Equation 13)

【００５１】ただし、Ｘは適応データの総数である。全
体のストリームの重み係数Λは、ＧＰＤアルゴリズムを
用いて次式により更新される。Where X is the total number of adaptive data. The weight coefficient の of the entire stream is updated by the following equation using the GPD algorithm.

【００５２】[0052]

【数１４】Λ_k+1＝Λ_k−ε_kＥ_k∇Ｌ（Λ）｜_Λ ₌ _Λ _k，ｋ
＝１，２，…のとき１４ _{k + 1} = Λ _k −ε _k E _k ∇L (Λ) | _Λ ₌ _Λ _k , k
= 1, 2, ...

【００５３】ここで、Ｅ_kは単位行列である。Here, E _k is a unit matrix.

【００５４】[0054]

【数１５】 (Equation 15)

【数１６】 (Equation 16)

【００５５】上記の式を満たすと、このアルゴリズムは
収束することが証明されている（例えば、従来技術文献
４「W.Chou et al.”A minimum error rate pattern re
cognition approach to speech recognition”,Journal
of Pattern Recognition and artificial intelligenc
e, Column VIII, pp.5-31, 1994」など参照。）。It has been proved that this algorithm converges if the above equation is satisfied (for example, see W. Chou et al., A minimum error rate pattern re
cognition approach to speech recognition ”, Journal
of Pattern Recognition and artificial intelligenc
e, Column VIII, pp. 5-31, 1994 ". ).

【００５６】さらに、ストリームの重み係数の更新式に
ついて説明する。ここでは、実際に、上記数１４を計算
するための、式の展開を述べる。ただし、簡潔に記述す
るために（Λ）を省略する。まず、ＧＰＤアルゴリズム
の処理において、各々のストリームの重み係数のクラス
ｃにFurther, a description will be given of a formula for updating the weight coefficient of the stream. Here, a description will be given of the expansion of the equation for actually calculating the above equation (14). However, (Λ) is omitted for simplicity. First, in the process of the GPD algorithm, the weight coefficient class c of each stream is

【数１７】の制限を加えるために、[Equation 17] To add

【数１８】を満たす変換(Equation 18) A transformation that satisfies

【数１９】λｈ_cs＝ｌｏｇλ_cs を行う。そして、上記数１４によりストリームの重み係
数を更新するために、上記数１２及び数１３から、次式
を計算する。Λ h _cs = log λ _cs is performed. Then, in order to update the weight coefficient of the stream by the above equation (14), the following equation is calculated from the above equations (12) and (13).

【００５７】[0057]

【数２０】 (Equation 20)

【００５８】ここで、Here,

【数２１】 (Equation 21)

【数２２】である。(Equation 22) It is.

【００５９】ここで、Ｂ＝Ｒ又はＦｎであり、Ｃはスト
リームの重み係数の値をクラスタリングしたときのＨＭ
Ｍの状態の集合である。上記数１２、上記数２０乃至数
２２を計算し、上記数１４によりストリームの重み係数
を更新する。最後に、各ステップの更新後に上記数１８
により変換する。Here, B = R or Fn, and C is the HM when the values of the weighting factors of the stream are clustered.
A set of M states. Equations (12) and (20) to (22) are calculated, and the weight coefficient of the stream is updated by the equation (14). Finally, after updating each step,
Is converted by

【００６０】さらに、木構造を用いたストリームの重み
係数のクラスタリングの単位の細分化について説明す
る。本実施形態では、音素ＨＭＭのストリームの重み係
数を基本単位とし、適応データ数に応じ、ストリームの
重み係数のクラスタリングの単位をトップダウンに分割
していく方法を検討する。Further, the subdivision of the unit of clustering of the weighting factor of the stream using the tree structure will be described. In the present embodiment, a method is considered in which the weighting factor of the stream of the phoneme HMM is used as a basic unit, and the clustering unit of the weighting factor of the stream is divided top-down according to the number of adaptive data.

【００６１】まず、ＨＭＭのクラスタリングを行う基準
となる木構造を作る。木構造を作成する手順として、複
数の質問を用意し、それらの質問に対してＨＭＭのクラ
スタリングを行う。今回の実験で用いた質問（各ノード
に割り当てられる）の一例は、以下の３項目である。（１）ＨＭＭが母音か子音のどちらであるか？（２）有声音か無声音のどちらであるか？（３）調音位置が唇周辺であるかどうか？以上のようにクラスタリングを行うことで、音声の先見
知識をストリームの重み係数推定に組み込むことができ
る。First, a tree structure as a reference for performing HMM clustering is created. As a procedure for creating a tree structure, a plurality of questions are prepared, and HMM clustering is performed on those questions. Examples of the questions (assigned to each node) used in this experiment are the following three items. (1) Is the HMM a vowel or a consonant? (2) Is it voiced or unvoiced? (3) Whether the articulation position is around the lips? By performing the clustering as described above, the foresight knowledge of the voice can be incorporated in the weight coefficient estimation of the stream.

【００６２】そして、予め用意された複数の質問から、
１つの質問を選択し、ＨＭＭをクラスタリングする。質
問には、予備実験で最も認識性能の良かった”有声音か
無声音であるか”の質問を選択した。このときの、ＨＭ
Ｍをクラスタリングするときの基準となる二分木構造の
クラスタリング木の一例を図１０に示す。これ例では、
ルートノード１０１において、有声音であるか否かが判
断され、ＹＥＳのときはクラスタノード１０２に進んで
クラスタリングされる一方、ＮＯのときはクラスタノー
ド１０３に進んでクラスタリングされる。そして、より
下の階層に向かってクラスタリングの処理が繰り返され
る。また、環境適応化時に、損失関数（数１２）を最小
化する質問を選択する方法が考えられるが、損失関数は
認識性能に必ずしも一致せず、適応化時の計算量の増加
を招いてしまう。従って、このように、予め作成した木
構造を用いて環境適応化時に、音素毎のストリームの重
み係数のクラスタリング単位を分割していくことにし
た。Then, from a plurality of questions prepared in advance,
Select one question and cluster the HMM. As the question, the question "whether voiced or unvoiced" with the best recognition performance in the preliminary experiment was selected. HM at this time
FIG. 10 shows an example of a clustering tree having a binary tree structure as a reference when M is clustered. In this example,
At the root node 101, it is determined whether or not it is a voiced sound. If YES, the process proceeds to the cluster node 102 to perform clustering, whereas if NO, the process proceeds to the cluster node 103 to perform clustering. Then, the clustering process is repeated toward lower layers. In addition, a method of selecting a question that minimizes the loss function (Equation 12) at the time of environmental adaptation can be considered. However, the loss function does not always match the recognition performance, and the amount of calculation at the time of adaptation is increased. . Therefore, the clustering unit of the weighting factor of the stream for each phoneme is divided at the time of environment adaptation using the tree structure created in advance.

【００６３】次いで、図８の環境適応化装置３００の構
成及び動作について以下に説明する。図８において、環
境適応化用ＡＶ単語データメモリ５１は、図１の学習用
ＡＶデータメモリ３１と同様の形式で、発話音声信号
と、発話時の話者の唇の画像信号とを含む複数の単語の
ＡＶ信号データが音素ラベル付きで環境適応化用ＡＶ単
語データメモリ５１に格納される。図１の音声認識装置
２００は、図１の入力ＡＶデータメモリ４１内のＡＶ信
号データに代えて、この環境適応化用ＡＶ信号に基づい
て音声認識して対数尤度を演算して環境適応化処理部５
０に出力する。そして、環境適応化処理部５０は、図９
のストリームの重み係数の環境適応化処理を実行する。
具体的には、図１のＨＭＭ学習部１７により生成され
た、学習された音声及び画像の合成ＨＭＭにおける各音
素の重み係数を、例えば図１０に示す二分木の木構造の
クラスタリング木などの所定のクラスタリングの基準を
用いて複数のクラスにクラスタリングし、各クラスに属
する各音素の重み係数を、音声認識装置２００内の音声
認識部２６で演算される対数尤度に基づいて、誤認識が
少なくなるように（具体的には、数６で示す誤分類測度
ｄ_xが小さくなるように）再学習することにより合成Ｈ
ＭＭを環境適応化する。ここで、各クラスの環境適応化
用信号データの数が所定のしきい値未満となるように環
境適応化処理の再学習を繰り返す。そして、再学習され
た合成ＨＭＭを用いて、図１の音声認識装置２００は音
声認識処理を行う。Next, the configuration and operation of the environment adapting apparatus 300 shown in FIG. 8 will be described below. In FIG. 8, the environment-adaptive AV word data memory 51 has a format similar to that of the learning AV data memory 31 of FIG. 1, and includes a plurality of utterance voice signals and an image signal of the lips of the speaker at the time of utterance. The word AV signal data is stored in the environment adaptation AV word data memory 51 with a phoneme label. The speech recognition apparatus 200 shown in FIG. 1 performs speech recognition based on the AV signal for environment adaptation instead of the AV signal data in the input AV data memory 41 shown in FIG. Processing unit 5
Output to 0. Then, the environment adaptation processing unit 50 executes the processing shown in FIG.
Of the stream weight coefficient of the stream.
Specifically, the weighting factor of each phoneme in the synthesized HMM of the learned speech and image generated by the HMM learning unit 17 in FIG. 1 is determined by a predetermined value such as a clustering tree having a binary tree structure shown in FIG. Is clustered into a plurality of classes using the clustering criterion, and the weighting factor of each phoneme belonging to each class is reduced based on the log likelihood calculated by the voice recognition unit 26 in the voice recognition device 200. (Specifically, so that the misclassification measure d _x shown in Expression 6 is reduced) so that the combined H
Environmental adaptation of MM. Here, the re-learning of the environment adaptation process is repeated so that the number of environment adaptation signal data of each class becomes less than a predetermined threshold value. Then, the speech recognition device 200 in FIG. 1 performs a speech recognition process using the re-learned synthesized HMM.

【００６４】図９は、図８の環境適応化処理部５０によ
って実行されるストリームの重み係数の環境適応化処理
を示すフローチャートである。FIG. 9 is a flowchart showing the process of adapting the environment of the stream weighting coefficients performed by the environment adaptation processing unit 50 of FIG.

【００６５】図９において、まず、ステップＳ１におい
て、音素毎の初期のストリームの重み係数を、すべての
ＨＭＭについて同一に設定する（初期化処理）。すなわ
ち、すべてのＨＭＭについて、クラスタリング木のルー
トノードにおける各音素のストリームの重み係数を１つ
のクラスとし、例えば、０．５に初期化する。次いで、
ステップＳ２において、クラスタリング木において次の
下の階層にある各々のノードについて、同じクラスタに
属するＨＭＭのストリームの重み係数を１つのクラスに
クラスタリングし、ステップＳ３において、各音素のス
トリームの重み係数の初期値を上の階層で推定された値
とする。そして、ステップＳ４において、各クラスの適
応データ数＜しきい値（例えば、２０）であるか否かを
判断する。ここで、ＹＥＳであるときは、クラスタリン
グが十分に行われたと判断し、ステップＳ５において、
ストリームの重み係数を上記推定された定数の推定値と
してステップＳ７に進む。一方、ステップＳ４でＮＯで
あるときは、クラスタリングが十分に行われていないと
判断し、ステップＳ６において、ストリームの重み係数
を変数更新対象の推定値とし、ステップＳ７に進む。ス
テップＳ７において、すべてのクラスについてステップ
Ｓ４の処理をしたか否かが判断され、ＮＯであるとき
は、ステップＳ４に戻りステップＳ４の処理を実行す
る。一方、ステップＳ７でＹＥＳであるときは、ステッ
プＳ８で更新対象となるストリームの重み係数があるか
否かが判断され、ＹＥＳであるときは、ステップＳ９に
おいて、ＧＰＤアルゴリズムを用いて（上述の更新式を
用いて）所定のｎ回の更新を繰り返し、ストリームの重
み係数を更新することにより、合成ＨＭＭメモリ３４に
格納された合成ＨＭＭを環境適応化した後、ステップＳ
２に戻る。一方、ステップＳ８でＮＯであるときは、当
該環境適応化処理を終了する。In FIG. 9, first, in step S1, the initial stream weighting factor for each phoneme is set to be the same for all HMMs (initialization processing). That is, for all HMMs, the weighting factor of each phoneme stream at the root node of the clustering tree is set to one class, and is initialized to, for example, 0.5. Then
In step S2, for each node in the next lower hierarchy in the clustering tree, the weighting factors of the HMM streams belonging to the same cluster are clustered into one class, and in step S3, the initial weighting factor of the weighting factor of each phoneme stream is clustered. Let the value be the value estimated in the upper layer. Then, in step S4, it is determined whether or not the number of adaptive data of each class <the threshold value (for example, 20). Here, if YES, it is determined that clustering has been sufficiently performed, and in step S5,
The process proceeds to step S7, using the weight coefficient of the stream as the estimated value of the estimated constant. On the other hand, if NO is determined in the step S4, it is determined that the clustering is not sufficiently performed, and in step S6, the weight coefficient of the stream is set as the estimated value of the variable update target, and the process proceeds to step S7. In step S7, it is determined whether or not the process of step S4 has been performed for all the classes. If NO, the process returns to step S4 and executes the process of step S4. On the other hand, if YES is determined in the step S7, it is determined whether or not there is a weight coefficient of the stream to be updated in a step S8. If YES, in a step S9, the GPD algorithm is used (as described above). After the update of the stream by a predetermined number of times (using the formula) and updating of the weighting factor of the stream, the combined HMM stored in the combined HMM memory 34 is environment-adapted.
Return to 2. On the other hand, if NO in step S8, the environment adaptation process ends.

【００６６】この手法を用いる理由として、ルートノー
ドから、順に、ストリームを推定し、それを初期値とし
て用いることで、安定した解に推定されるということ
と、適応データ数に応じて、精度の良いＨＭＭの適応化
が行われるということがあげられる。ただし、計算時間
は木の深さが大きくなるにつれて増加し、膨大な量とな
ってしまう。本実施形態では、分割の有効性を確認する
ことを第１の目的にし、木の深さは最大２とし、ＧＰＤ
アルゴリズムによる処理の繰り返し回数ｎは、最大８回
に設定した。The reason for using this method is that a stream is estimated in order from the root node and is used as an initial value, so that a stable solution is estimated. A good HMM adaptation is performed. However, the calculation time increases as the depth of the tree increases, and becomes enormous. In the present embodiment, the first purpose is to confirm the effectiveness of the division, the tree depth is set to a maximum of 2, and the GPD
The number of repetitions n of the processing by the algorithm was set to a maximum of eight.

【００６７】以上の実施形態においては、クラスタリン
グの基準としてクラスタリング木を用いたが、本発明は
これに限らず、例えば所定の基準式など別の基準を用い
てもよい。In the above embodiment, a clustering tree is used as a reference for clustering. However, the present invention is not limited to this, and another reference such as a predetermined reference formula may be used.

【００６８】[0068]

【実施例】＜第２の実施形態の実施例＞本発明者らは、
第２の実施形態に係る環境適応化装置に対する評価実験
として、２００単語×２セットの認識実験を行った。評
価として、２セットの単語認識率の平均を用いた。表２
に実験条件を示す。この実験では、音響実験室で、特定
話者（女性話者１人）が本願出願人が所有する発声リス
トの５２４０単語を発話している音声データのデータベ
ースを用いた。Example <Example of Second Embodiment> The present inventors
As an evaluation experiment for the environment adaptation apparatus according to the second embodiment, a recognition experiment of 200 words × 2 sets was performed. As an evaluation, an average of two sets of word recognition rates was used. Table 2
Shows the experimental conditions. In this experiment, a database of audio data in which a specific speaker (one female speaker) uttered 5,240 words in an utterance list owned by the present applicant was used in an acoustic laboratory.

【００６９】[0069]

【表２】実験条件 ――――――――――――――――――――――――――――――――――― 音声標本化周波数：１２ｋＨｚ分析窓関数：ハミング窓フレーム長：３２ｍｓｅｃフレームシフト：８ｍｓｅｃパラメータ：ＭＦＣＣ１６次元、ΔＭＦＣＣ１６次元 ――――――――――――――――――――――――――――――――――― 画像フレームシフト：３３ｍｓｅｃ前処理１：ＲＧＢ信号から２５６階調の濃淡画像信号への変換前処理２：ヒストグラム平坦化処理前処理３：唇位置の正規化処理パラメータ：平滑化対数パワースペクトル３５次元、及び平滑化対数Δパワースペクトル３５次元 ――――――――――――――――――――――――――――――――――― ＨＭＭの状態数音声３、画像２ ――――――――――――――――――――――――――――――――――― 確率密度関数ガウス分布：２混合ＨＭＭ音素環境独立５５音素モデル ――――――――――――――――――――――――――――――――――― 学習データ音声及び画像同期データ女性話者１名、４７４０単語 ――――――――――――――――――――――――――――――――――― テストデータ２００単語（２セット）（オープン条件） ――――――――――――――――――――――――――――――――――― 適応データ学習データとテストデータ以外の単語データ ――――――――――――――――――――――――――――――――――― 適応化時の認識辞書テストセットの語彙を含む５００単語辞書 ―――――――――――――――――――――――――――――――――――[Table 2] Experimental conditions ――――――――――――――――――――――――――――――――― Voice Sampling frequency: 12 kHz Analysis window function : Hamming window Frame length: 32 msec Frame shift: 8 msec Parameter: MFCC 16 dimension, ΔMFCC 16 dimension ―――――――――――――――――――――――――――――――― ――― Image Frame shift: 33 msec Preprocessing 1: Conversion from RGB signals to 256-level grayscale image signal Preprocessing 2: Histogram flattening Preprocessing 3: Lip position normalization Processing Parameter: Smoothed log power spectrum 35 dimensions and smoothed logarithmic Δ power spectrum 35 dimensions ――――――――――――――――――――――――――――――――― HMM status Number 3 Audio, Image 2 ―――――――― ―――――――――――――――――――――――――― Probability density function Gaussian distribution: 2 mixture HMM Phoneme environment independent 55 phoneme model ――――――――― ―――――――――――――――――――――――――― Learning data Voice and image synchronization data One female speaker, 4740 words ――――――――― ―――――――――――――――――――――――――― Test data 200 words (2 sets) (open condition) ―――――――――――― ――――――――――――――――――――――― Adaptive data Word data other than training data and test data ―――――――――――――――― ――――――――――――――――――― Recognition dictionary at the time of adaptation 500 words dictionary containing vocabulary of test set ―――――――――――――――― ―――――― ------------

【００７０】音声と比べて画像のフレームシフトは長い
ため、画像は、同じフレームを埋め込み、音声と画像の
フレームシフトを調整を行う。また、収録した画像デー
タは発話単語により、照明条件の違いや顔の傾きなどが
見られる。そこで、前処理として、ヒストグラム平坦
化、基準フレームとの輝度の差分を最小化するように唇
位置の正規化を行った。Since the frame shift of the image is longer than that of the sound, the same frame is embedded in the image, and the frame shift of the sound and the image is adjusted. The recorded image data shows differences in lighting conditions, face inclination, and the like depending on the utterance word. Therefore, as preprocessing, histogram flattening and lip position normalization were performed so as to minimize the difference in luminance from the reference frame.

【００７１】音声ＨＭＭの作成には、音響実験室で収録
したクリーンな音声データからＭＦＣＣを求め、それを
特徴ベクトルとしてモデル作成を行った。また、画像Ｈ
ＭＭは、前処理後の画像に２次元ＦＦＴを行い、対数パ
ワースペクトルを求める。そして、その周波数領域を６
×６の領域分割を行い、直流成分を除いた領域の平滑化
対数パワースペクトルを特徴ベクトルとしてモデル作成
を行った。本実験では、音声及び画像の合成ＨＭＭは、
各ストリームの重み係数を１：１と等しい重み係数で学
習を行っている。To create the audio HMM, an MFCC was obtained from clean audio data recorded in an acoustic laboratory, and a model was created using the MFCC as a feature vector. Also, the image H
The MM performs a two-dimensional FFT on the preprocessed image to obtain a logarithmic power spectrum. Then, the frequency range is set to 6
A × 6 area division was performed, and a model was created using the smoothed logarithmic power spectrum of the area excluding the DC component as a feature vector. In this experiment, the synthesized HMM of voice and image is
Learning is performed with a weight coefficient of each stream being equal to 1: 1.

【００７２】また、比較として音声のみ、画像のみ及び
音声と画像を初期統合した場合の認識実験も行った。音
声のみの実験は３状態のＨＭＭ、画像のみの実験は２状
態のＨＭＭ、そして初期統合法は３状態のＨＭＭを用い
た。ＨＭＭの形状は、いずれも左から右方向へのｌｅｆ
ｔ−ｔｏ−ｒｉｇｈｔ型である。As a comparison, a recognition experiment was performed in which only voice, only image, and voice and image were initially integrated. The experiment using only speech used a 3-state HMM, the experiment using only images used a 2-state HMM, and the initial integration method used a 3-state HMM. The shape of each HMM is ref from left to right.
It is a t-to-right type.

【００７３】環境適応化時の実験条件として、適応デー
タは、学習データとテストデータ以外の単語発話データ
を用いた。従って、適応データは、テストデータと発話
内容は異なっている。また、適応データ数を、１５、２
５、５０、７５及び１００単語とした場合についてスト
リームの重み係数推定を行った。ただし、適応データ数
が１５単語の場合は、発話内容により推定されるストリ
ームの重み係数が大きく異なる。そのため、適応データ
数が１５単語の場合は、適応データ３セットについての
認識率の平均とする。適応化時の辞書は、適応データの
単語とテストデータの単語を含む５００単語の辞書を用
いた。As the experiment conditions at the time of environmental adaptation, word utterance data other than learning data and test data was used as adaptation data. Therefore, the adaptation data differs from the test data in the utterance content. In addition, the number of adaptive data is set to 15, 2,
Stream weighting factors were estimated for 5, 50, 75 and 100 words. However, when the number of adaptive data is 15 words, the weighting factors of the streams estimated based on the utterance contents are significantly different. Therefore, when the number of adaptive data is 15 words, the average of the recognition rates for three sets of adaptive data is used. As a dictionary at the time of adaptation, a dictionary of 500 words including words of adaptation data and words of test data was used.

【００７４】誤分類測定度の数１１において、誤りの候
補数をＮ＝１とし、ＧＰＤアルゴリズムの数１２におい
て、α＝０．１とした。また、上記数１４において、す
べてのストリームの重み係数がクラスタリングされてい
るときε_k＝２００／ｋとし、ストリームの重み係数の
クラスタリングの単位を分割した後はε_k＝１００／ｋ
とし、すべてのストリームの重み係数をクラスタリング
したときよりも、緩やかに収束させている。In Equation 11 of the misclassification measurement degree, the number of error candidates is N = 1, and in Equation 12 of the GPD algorithm, α is 0.1. In Equation 14, when the weighting factors of all the streams are clustered, ε _k = 200 / k, and after dividing the unit of the clustering of the weighting factors of the stream, ε _k = 100 / k.
And converges more slowly than when the weighting factors of all the streams are clustered.

【００７５】次いで、環境適応化の実験結果について以
下に説明する。Next, the results of experiments for environmental adaptation will be described below.

【００７６】まず、合成統合法と他の統合方法の認識率
を比較する。図１１、図１２及び図１３に、音声のスト
リームの重み係数と画像のストリームの重み係数を上記
数１７を満たすように、音声のストリームの重み係数を
変化させたときの初期統合法と合成統合法の認識結果を
示す。また、音声のみと画像のみの認識率もあわせて示
す。ここで、図１１は、ＳＮＲが１０ｄＢになるように
音声に白色ガウス雑音を加えた場合の認識結果であり、
図１２は、同様にＳＮＲが２０ｄＢのときの認識結果で
ある。そして、図１３は、収録データにノイズを加えて
いない場合の認識結果である。さらに、図１１乃至図１
３の各図に、５０単語の適応データからＧＰＤアルゴリ
ズムで推定されたストリームの重み係数の値を示す。た
だし、推定したストリームの重み係数は、再学習を行っ
た合成統合法の場合である。First, the recognition rates of the synthesis integration method and other integration methods are compared. FIGS. 11, 12 and 13 show the initial integration method and the synthesis integration when the weight factor of the audio stream is changed so that the weight coefficient of the audio stream and the weight coefficient of the image stream satisfy Equation 17 above. The recognition result of the law is shown. In addition, the recognition rates of only voice and only image are also shown. Here, FIG. 11 is a recognition result when white Gaussian noise is added to the voice so that the SNR becomes 10 dB.
FIG. 12 shows a recognition result when the SNR is 20 dB. FIG. 13 shows a recognition result when no noise is added to the recorded data. 11 to FIG.
3 shows the value of the weight coefficient of the stream estimated by the GPD algorithm from the adaptive data of 50 words. However, the weight coefficients of the estimated streams are for the synthesis integration method in which relearning has been performed.

【００７７】図１１乃至図１３から、バイモーダル音声
認識システムは、あるストリームの重み係数の値で認識
率のピークをもつ傾向があり、このピークを推定するこ
とで単一モーダルの認識システムより高い認識性能が得
られることが分かる。そして、ＧＰＤアルゴリズムによ
って、認識率のピークに近いストリームの重み係数の値
が推定できることが分かる。また、合成した音声及び画
像の合成ＨＭＭを再学習する合成統合法（再学習有り）
は、初期統合法と再学習しない合成統合法（再学習無
し）よりも高い認識性能が得られることが分かる。これ
は、初期統合法は、音声と画像が同期していると仮定
し、再学習しない合成統合法は、同期性を学習していな
いが、再学習する合成統合法は、音声と画像の同期関係
を学習しているためであると考えられる。また、予備実
験で音声と画像ＨＭＭを合成せずに、単にＨＭＭの状態
数を増やし形状を変えて、音声及び画像データで学習し
た場合は、パラメータ推定がうまくいかず、合成ＨＭＭ
をもとに学習したものより、高い性能は得られなかっ
た。このことから、音声と画像ＨＭＭを合成すること
で、良い初期モデルを与えることができると考えられ
る。さらに、学習できない場合も、合成モデルを初期モ
デルとしてそのまま使うことができる。From FIGS. 11 to 13, it can be seen that the bimodal speech recognition system tends to have a peak in the recognition rate at the value of the weight coefficient of a certain stream, and by estimating this peak, the bimodal speech recognition system is higher than the single-modal speech recognition system. It can be seen that recognition performance can be obtained. Then, it can be seen that the value of the weight coefficient of the stream close to the peak of the recognition rate can be estimated by the GPD algorithm. Also, a synthesis integration method for re-learning a synthesized HMM of synthesized voice and image (with re-learning)
It can be seen that the recognition performance obtained is higher than the initial integration method and the synthesis integration method without re-learning (no re-learning). This is because the initial integration method assumes that the audio and the image are synchronized, and the synthesis integration method that does not re-learn does not learn the synchronization, but the synthesis integration method that re-learns does not synchronize the audio and the image. This is probably because they are learning the relationship. Also, in the preliminary experiment, if the learning is performed with the voice and image data by simply increasing the number of states of the HMM and changing the shape without synthesizing the voice and the image HMM, the parameter estimation is not successful, and the synthesized HMM
Higher performance was not obtained than those learned based on. From this, it is considered that a good initial model can be given by synthesizing the voice and the image HMM. Further, even when learning is not possible, the composite model can be used as an initial model as it is.

【００７８】次に、ストリームの重み係数を分割せず
に、音声と画像のストリームの重み係数値をＧＰＤアル
ゴリズムで推定した場合の実験結果を考察する。表３
に、音声がクリーンな場合及び音声のＳＮＲが２０ｄ
Ｂ、１０ｄＢとなるように白色ガウス雑音を加えた場合
に、音声と画像のストリームの重み係数を分割せずに環
境適応したときの認識率を示す。また、適応データは無
作為に選択している。Next, an experimental result in the case where the weight coefficient values of the audio and video streams are estimated by the GPD algorithm without dividing the weight coefficients of the streams will be considered. Table 3
When the sound is clean and the SNR of the sound is 20d
B shows the recognition rate when the environment adaptation is performed without dividing the weight coefficients of the audio and video streams when white Gaussian noise is added so as to be 10 dB. The adaptation data is selected at random.

【００７９】[0079]

【表３】音声及び画像ストリームの重み係数を環境適応した場合の単語認識率結果 ――――――――――――――――――――――――――――――――――― 適応データ数雑音のないクリーンＳＮＲ＝２０ｄＢＳＮＲ＝１０ｄＢ（音素数） ――――――――――――――――――――――――――――――――――― １５単語（１０８）９６．８６％７７．１５％５６．３５％２５単語（１９３）９７．２８％８９．３６％６９．０６％５０単語（３６６）９７．２８％８７．３８％６８．５７％７５単語（５２１）９７．０３％８３．４２％６５．６０％１００単語（６９７）９７．０３％８７．３８％６８．８１％ ―――――――――――――――――――――――――――――――――――[Table 3] Word recognition rate results when weighting factors of audio and video streams are adapted to the environment ――――――――――――――――――――――――――――― ―――――― Number of adaptation data Clean with no noise SNR = 20dB SNR = 10dB (phonemes) ――――――――――――――――――――――――――― ―――――――― 15 words (108) 96.86% 77.15% 56.35% 25 words (193) 97.28% 89.36% 69.06% 50 words (366) 97.28 % 87.38% 68.57% 75 words (521) 97.03% 83.42% 65.60% 100 words (697) 97.03% 87.38% 68.81% ――――――― ――――――――――――――――――――――――――――

【００８０】表３から明らかなように、適応データ数が
１５単語であるとき、低い認識率になる。これは、少数
の適応データ数から、音声と画像のストリームの重み係
数を推定するとき、その値がテストセットに対して、最
適なストリームの重み係数値から外れてしまうためであ
る。そこで、適応データの内容により、どのぐらい認識
率が変わるのかを調べるために、適応データが１５単語
の場合と５０単語の場合について、３回の認識実験を行
い、認識率の分散を調べた。適応データ数が１５単語の
場合において、認識率の標準偏差は、１０．１８となり
適応データの発話内容で認識率がばらついていた。それ
に対して、５０単語の標準偏差は、０.５７となり、適
応データの違いで認識率のばらつきはほとんどなかっ
た。ただし、標準偏差は、音声が雑音のないクリーンな
とき、ＳＮＲ＝２０ｄＢのとき及び１０ｄＢのときの平
均値である。従って、少数の適応データから、音声と画
像のストリームの重み係数を推定する場合は、適応デー
タの発話内容を注意して選ばなければならないことがわ
かる。また、表３から、適応データ数が多いほど、適切
なストリームの重み係数が推定されることがわかる。最
後に、ストリームの重み係数を２分割をした場合の実験
結果を考察する。As is clear from Table 3, when the number of adaptive data is 15 words, the recognition rate is low. This is because when estimating the weight coefficients of the audio and video streams from the small number of adaptive data, the values deviate from the optimal stream weight coefficient values for the test set. Therefore, in order to examine how much the recognition rate changes depending on the content of the adaptation data, three recognition experiments were performed for the case where the adaptation data was 15 words and 50 words, and the variance of the recognition rate was examined. When the number of adaptive data is 15 words, the standard deviation of the recognition rate is 10.18, and the recognition rate varies depending on the utterance contents of the adaptive data. On the other hand, the standard deviation of 50 words was 0.57, and there was almost no variation in the recognition rate due to the difference in the adaptation data. Here, the standard deviation is an average value when the voice is clean without noise, when SNR = 20 dB, and when 10 dB. Therefore, when estimating the weighting factors of the audio and video streams from a small number of adaptive data, it is understood that the utterance content of the adaptive data must be carefully selected. Also, from Table 3, it can be seen that the larger the number of adaptive data, the more the appropriate stream weighting factor is estimated. Finally, consider the experimental results when the stream weighting factor is divided into two.

【００８１】表４に、適応データ数を変化させて、スト
リームの重み係数のクラスタリングの単位を２分割し、
音声と画像のストリームの重み係数を推定した場合の認
識率を示す。なお、適応データは、表３と同様のものを
選んでいる。実験では、更新する適応データ数のしきい
値は制限していない。従って、すべてのストリームの重
み係数のクラスが更新されている。Table 4 shows that the unit of clustering of stream weighting factors is divided into two by changing the number of adaptive data.
The recognition rate when the weight coefficient of the audio and image streams is estimated is shown. Note that the same adaptive data as those in Table 3 are selected. In the experiment, the threshold of the number of adaptive data to be updated is not limited. Accordingly, the weight coefficient classes of all the streams have been updated.

【００８２】[0082]

【表４】音声及び画像ストリームの重み係数を２分割した場合の単語認識率結果 ――――――――――――――――――――――――――――――――――― 適応データ数雑音のないクリーンＳＮＲ＝２０ｄＢＳＮＲ＝１０ｄＢ ――――――――――――――――――――――――――――――――――― １５単語(58,50) ９７．０３％７７．３９％６１．９６％２５単語(104,89)９７．５２％８９．３６％６８．５７％５０単語(197,169) ９７．２８％８７．８７％６６．３４％７５単語(266,255) ９７．０３％８３．９１％６５．８５％１００単語(365,332) ９７．２８％８７．３８％６８．５７％ ――――――――――――――――――――――――――――――――――― （注）適応データ数の（，）は（有声音の音素数、無声音の音素数）を表す。[Table 4] Word recognition rate results when the weighting factors of audio and video streams are divided into two parts ――――――――――――――――――――――――――――― ―――――― Number of adaptation data Clean with no noise SNR = 20dB SNR = 10dB ――――――――――――――――――――――――――――― ―――― 15 words (58,50) 97.03% 77.39% 61.96% 25 words (104,89) 97.52% 89.36% 68.57% 50 words (197,169) 97.28 % 87.87% 66.34% 75 words (266,255) 97.03% 83.91% 65.85% 100 words (365,332) 97.28% 87.38% 68.57% ――――――― ―――――――――――――――――――――――――――― (Note) The number of adaptive data (,) is (the number of phonemes of voiced sound, unvoiced sound) It represents the number of phonemes).

【００８３】表３と表４を比べると、適応データが５０
単語以上になると、少し認識率が高くなっている場合が
あるが、それほど差は見られない。適応データ数が１５
単語であるとき、ストリームの重み係数を分割しない場
合より、認識率が高い。これは、ストリームの重み係数
のクラスタリングの単位を分割することで、一方のスト
リームの重み係数のクラスがテストセットに対して最適
な値に近い値が推定されたことと、単にＧＰＤアルゴリ
ズムの処理の繰り返し回数が増えたことがあげられる。When Table 3 and Table 4 are compared, the adaptive data is 50
Above words, the recognition rate may be slightly higher, but there is not much difference. 15 adaptive data
When the word is a word, the recognition rate is higher than when the weight coefficient of the stream is not divided. This is because, by dividing the unit of the clustering of the weighting factor of the stream, the value of the class of the weighting factor of one stream is estimated to be a value close to the optimal value for the test set. This is because the number of repetitions has increased.

【００８４】以上説明したように、本実施形態によれ
ば、音声と画像情報をＨＭＭを用いて合成統合を行い、
さらに、合成統合されたＨＭＭのストリームの重み係数
を環境適応した。その結果、音声と画像の同期性を確立
することができるとともに、音声認識時の誤認識を少な
くすることができる再学習された合成ＨＭＭを生成でき
る。また、従来例及び第１の実施形態に比較して良い音
声認識性能が得られる。As described above, according to the present embodiment, voice and image information are synthesized and integrated using the HMM,
Further, the weight coefficient of the stream of the synthesized and integrated HMM was adapted to the environment. As a result, it is possible to generate a re-learned composite HMM that can establish synchronism between speech and images and reduce erroneous recognition during speech recognition. Further, better speech recognition performance can be obtained as compared with the conventional example and the first embodiment.

【００８５】以上の第１と第２の実施形態において、各
演算又は処理部１１−１７，２１−２６，５０はＣＰＵ
などのディジタル計算機で構成され、ハードウエア回路
で構成してもよいし、ソフトウエアのプログラムで構成
してもよい。また、各メモリ３１−３４，４１，４２，
５１は例えばハードディスクメモリなどの記憶装置で構
成される。In the first and second embodiments described above, each operation or processing unit 11-17, 21-26, 50
And the like, and may be constituted by a hardware circuit or may be constituted by a software program. Each of the memories 31-34, 41, 42,
Reference numeral 51 denotes a storage device such as a hard disk memory.

【００８６】[0086]

【発明の効果】以上詳述したように本発明に係る音声及
び画像の合成モデル生成装置によれば、音声ＨＭＭと、
画像ＨＭＭとを、これら２つのＨＭＭの各状態のすべて
の組み合わせにおいて音声と画像の出力確率の積を計算
して、各状態で出力確率の積を含む合成された合成ＨＭ
Ｍを生成することにより合成した後、上記生成された合
成ＨＭＭに基づいて、上記第１の記憶手段に格納された
ラベル付きＡＶ信号を用いて、出力尤度が最大となるよ
うに連結学習することにより、学習された音声及び画像
の合成ＨＭＭを生成する。従って、音声と画像の同期性
を確立することができ、音声と画像の同期性を有する学
習された合成ＨＭＭを生成できる。As described above in detail, according to the speech and image synthesis model generating apparatus of the present invention, the speech HMM,
The image HMM is calculated by calculating the product of the output probabilities of the voice and the image in all combinations of the states of these two HMMs, and generating a synthesized HM including the product of the output probabilities in each state.
After generating and synthesizing M, based on the generated synthesized HMM, connection learning is performed using the labeled AV signal stored in the first storage means so as to maximize the output likelihood. Thus, a synthesized HMM of the learned voice and image is generated. Therefore, it is possible to establish synchronism between the voice and the image, and to generate a learned synthesized HMM having the synchronism between the voice and the image.

【００８７】また、本発明に係る音声認識装置によれ
ば、抽出された発話音声信号の特徴量及び画像信号の特
徴量に基づいて、上記学習された音声及び画像の合成Ｈ
ＭＭを用いて、音声認識することにより、従来例に係る
初期統合法や結果統合法に比較して高い音声認識率で音
声認識することができる。According to the speech recognition apparatus of the present invention, the synthesized speech and image synthesis H based on the extracted speech speech signal feature and image signal feature are extracted.
By performing voice recognition using the MM, voice recognition can be performed with a higher voice recognition rate than the initial integration method and the result integration method according to the related art.

【００８８】さらに、本発明に係る音声及び画像の合成
モデルのための環境適応化装置によれば、環境適応化用
信号データを、所定のＨＭＭを用いて音声認識したとき
の尤度を演算し、学習された合成ＨＭＭにおける各音素
の重み係数を、所定のクラスタリングの基準を用いて複
数のクラスにクラスタリングし、各クラスに属する各音
素の重み係数を、上記演算された尤度に基づいて、誤認
識が少なくなるように再学習することにより上記合成Ｈ
ＭＭを環境適応化する。ここで、好ましくは、各クラス
の環境適応化用信号データの数が所定のしきい値未満と
なるように上記環境適応化手段の再学習を繰り返す。従
って、音声と画像の同期性を確立することができるとと
もに、音声認識時の誤認識を少なくすることができる再
学習された合成ＨＭＭを生成できる。Further, according to the environment adapting apparatus for a combined voice and image model according to the present invention, the likelihood when speech recognition is performed on the environment adapting signal data using a predetermined HMM is calculated. The weighted coefficients of each phoneme in the learned synthesized HMM are clustered into a plurality of classes using a predetermined clustering criterion, and the weighted coefficients of each phoneme belonging to each class are calculated based on the calculated likelihood. By performing re-learning so as to reduce erroneous recognition, the combined H
Environmental adaptation of MM. Here, preferably, re-learning of the environment adapting means is repeated so that the number of environment adaptation signal data of each class becomes less than a predetermined threshold value. Therefore, it is possible to generate a re-learned composite HMM that can establish synchronization between voice and image and reduce erroneous recognition during voice recognition.

【００８９】またさらに、本発明に係る別の音声認識装
置によれば、上記環境適応化装置による環境適応化され
た、音声及び画像の合成ＨＭＭを用いて、音声認識する
ので、従来例に係る初期統合法や結果統合法、並びに上
述の合成統合法に比較して高い音声認識率で音声認識す
ることができる。Further, according to another speech recognition apparatus according to the present invention, speech recognition is performed using the synthesized HMM of speech and image which has been environment-adapted by the environment adaptation apparatus. The speech recognition can be performed at a higher speech recognition rate than the initial integration method, the result integration method, and the above-described synthesis integration method.

[Brief description of the drawings]

【図１】本発明に係る第１の実施形態である、音声及
び画像の合成モデル生成装置１００及び音声認識装置２
００の構成を示すブロック図である。FIG. 1 shows a speech and image synthesis model generation apparatus 100 and a speech recognition apparatus 2 according to a first embodiment of the present invention.
FIG. 2 is a block diagram showing a configuration of a 00.

【図２】（ａ）は図１の音声ＨＭＭメモリ３２ａ内の
音声ＨＭＭの一例を示す状態遷移図であり、（ｂ）は図
１の画像ＨＭＭメモリ３２ｂ内の画像ＨＭＭの一例を示
す状態遷移図である。2A is a state transition diagram illustrating an example of a voice HMM in a voice HMM memory 32a in FIG. 1, and FIG. 2B is a state transition diagram illustrating an example of an image HMM in a image HMM memory 32b in FIG. FIG.

【図３】図１の合成ＨＭＭメモリ３３内の合成ＨＭＭ
の一例を示す状態遷移図である。FIG. 3 is a combined HMM in a combined HMM memory 33 of FIG. 1;
FIG. 6 is a state transition diagram showing an example of.

【図４】図１の音声及び画像の合成モデル生成装置１
００における合成統合の３次元の探索空間を示す図であ
る。4 is a synthesized model generation device 1 for audio and images in FIG.
It is a figure which shows the three-dimensional search space of the synthesis integration in 00.

【図５】従来例及び第１の実施形態に係る音声認識装
置の実験結果を示すグラフである。FIG. 5 is a graph showing experimental results of the speech recognition apparatus according to the conventional example and the first embodiment.

【図６】従来例及び第１の実施形態に係る音声認識装
置の実験結果を示すグラフである。FIG. 6 is a graph showing experimental results of the speech recognition apparatus according to the conventional example and the first embodiment.

【図７】従来例及び第１の実施形態に係る音声認識装
置の実験結果を示すグラフである。FIG. 7 is a graph showing experimental results of the speech recognition apparatus according to the conventional example and the first embodiment.

【図８】本発明に係る第２の実施形態である環境適応
化装置３００の構成を示すブロック図である。FIG. 8 is a block diagram illustrating a configuration of an environment adaptation apparatus 300 according to a second embodiment of the present invention.

【図９】図８の環境適応化処理部５０によって実行さ
れるストリームの重み係数の環境適応化処理を示すフロ
ーチャートである。9 is a flowchart showing a stream weighting coefficient environment adaptation process executed by the environment adaptation processing unit 50 of FIG. 8;

【図１０】図９の環境適応化処理においてＨＭＭをク
ラスタリングするときの基準となる二分木の木構造を示
す図である。10 is a diagram illustrating a tree structure of a binary tree that is a reference when clustering the HMMs in the environment adaptation processing of FIG. 9;

【図１１】従来例及び第１と第２の実施形態に係る音
声認識装置の実験結果を示すグラフである。FIG. 11 is a graph showing experimental results of the conventional example and the speech recognition devices according to the first and second embodiments.

【図１２】従来例及び第１と第２の実施形態に係る音
声認識装置の実験結果を示すグラフである。FIG. 12 is a graph showing experimental results of the conventional example and the speech recognition devices according to the first and second embodiments.

【図１３】従来例及び第１と第２の実施形態に係る音
声認識装置の実験結果を示すグラフである。FIG. 13 is a graph showing experimental results of the conventional example and the speech recognition devices according to the first and second embodiments.

[Explanation of symbols]

１１…データ分離部、１２…同期化部、１３ａ，１３ｂ…前処理部、１４ａ，１４ｂ…特徴抽出部、１５ａ…音声ＨＭＭ生成部、１５ｂ…画像ＨＭＭ生成部、１６…ＨＭＭ合成部、１７…ＨＭＭ学習部、２１…データ分離部、２２…同期化部、２３ａ，２３ｂ…前処理部、２４ａ，２４ｂ…特徴抽出部、２５…特徴合成部、２６…音声認識部、３１…音素ラベル付き学習用ＡＶデータメモリ、３２ａ…音声ＨＭＭメモリ、３２ｂ…画像ＨＭＭメモリ、３３…合成ＨＭＭメモリ、３４…学習された合成ＨＭＭメモリ、４１…入力ＡＶデータメモリ、４２…単語ＨＭＭメモリ、５０…環境適応化処理部、５１…環境適応化用ＡＶ単語データメモリ、１００…音声及び画像の合成モデル生成装置、２００…音声認識装置。 11: Data separation unit, 12: Synchronization unit, 13a, 13b: Preprocessing unit, 14a, 14b: Feature extraction unit, 15a: Voice HMM generation unit, 15b: Image HMM generation unit, 16: HMM synthesis unit, 17 ... HMM learning unit, 21: data separation unit, 22: synchronization unit, 23a, 23b: preprocessing unit, 24a, 24b: feature extraction unit, 25: feature synthesis unit, 26: speech recognition unit, 31: learning with phoneme label AV data memory for use, 32a: voice HMM memory, 32b: image HMM memory, 33: synthesized HMM memory, 34: learned HMM memory, 41: input AV data memory, 42: word HMM memory, 50: environment adaptation Processing unit, 51: AV word data memory for environment adaptation, 100: Synthetic model generation device for voice and image, 200: Voice recognition device.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/24 Ｇ１０Ｌ 3/00 ５２１Ｃ５７１Ｑ (72)発明者中村哲京都府相楽郡精華町光台二丁目２番地２株式会社エイ・ティ・アール音声言語通信研究所内Ｆターム(参考） 5D015 GG01 GG03 HH23 LL07 5L096 BA16 BA18 JA11 JA16 KA04──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G10L 15/24 G10L 3/00 521C 571Q (72) Inventor Satoshi Nakamura 2-chome Kodai, Seika-cho, Soraku-gun, Kyoto Prefecture 2nd 2 AT R Co., Ltd. Spoken Language Communication Research Laboratory F-term (reference) 5D015 GG01 GG03 HH23 LL07 5L096 BA16 BA18 JA11 JA16 KA04

Claims

[Claims]

A first storage unit for storing an AV signal including an uttered voice signal and an image signal of a lip of a speaker at the time of utterance; and an output likelihood based on the uttered voice signal of the AV signal. First generating means for generating a hidden Hidden Markov Model so as to maximize the degree, and generating an image Hidden Markov Model based on an image signal of the AV signal so as to maximize the output likelihood. A second generation unit that stores the speech hidden Markov model generated by the first generation unit; and a second storage unit that stores the image hidden Markov model generated by the second generation unit. 3, a hidden Hidden Markov Model stored in the second storage means, and an image Hidden Markov Model stored in the third storage means, for each state of the two hidden Markov models. Synthesizing means for calculating the product of the output probabilities of the voice and the image in each combination and generating a synthesized hidden Markov model that includes the product of the output probabilities in each state; Based on the Markov model, by using the labeled AV signal stored in the first storage means and performing joint learning so that the output likelihood is maximized, a synthesized hidden Markov model of the learned speech and image is obtained. And a learning unit for generating a composite model of voice and image.

2. The method according to claim 1, further comprising: determining a feature amount of the speech sound signal and a feature amount of the image signal based on an input AV signal including the speech sound signal and an image signal of a speaker's lip at the time of speech. 2. A learned voice generated by the voice and image synthesis model generating apparatus according to claim 1, based on extraction means for extracting, and the extracted voice voice signal feature amount and the image signal feature amount. And a first speech recognition means for performing speech recognition using a synthesized Hidden Markov Model of an image and outputting a speech recognition result.

3. Fourth storage means for storing environment adaptation signal data for storing an audio-video signal including an uttered voice signal and an image signal of a lip of a speaker at the time of utterance with a phoneme label, and the storage means. A second speech recognition unit for calculating a likelihood when speech recognition is performed on the obtained environment adaptation signal data using a predetermined hidden Markov model; and a speech and image synthesis model generating apparatus according to claim 1. The generated weighted coefficients of the phonemes in the synthesized speech and image synthesized Hidden Markov Model are clustered into a plurality of classes using a predetermined clustering criterion, and the weighted coefficients of the phonemes belonging to each class are calculated as described above. Based on the calculated likelihood, re-learning so as to reduce misrecognition by environmental adaptation means for adapting the composite hidden Markov model to the environment. That environmental adaptation device for the synthesis model of speech and image.

4. The voice and image according to claim 3, wherein re-learning of said environment adapting means is repeated so that the number of environment adaptation signal data of each class becomes less than a predetermined threshold value. Environment Adaptation System for Composite Model

5. A feature amount of the speech sound signal and a feature amount of the image signal based on an input AV signal including a speech sound signal and an image signal of a lip of a speaker at the time of speech. An extracting means for extracting, and an environment adapting device for a speech and image synthesis model according to claim 3 or 4, based on the extracted feature amount of the speech voice signal and the feature amount of the image signal. A speech recognition apparatus comprising: a third speech recognition unit that performs speech recognition using an adapted synthesized Markov model of speech and an image and outputs a speech recognition result.