JP4864783B2

JP4864783B2 - Pattern matching device, pattern matching program, and pattern matching method

Info

Publication number: JP4864783B2
Application number: JP2007076928A
Authority: JP
Inventors: 俊樹遠藤; 恒夫加藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2007-03-23
Filing date: 2007-03-23
Publication date: 2012-02-01
Anticipated expiration: 2027-03-23
Also published as: JP2008233782A

Description

本発明は、パタンマッチング装置、パタンマッチングプログラム、およびパタンマッチング方法に関する。 The present invention relates to a pattern matching device, a pattern matching program, and a pattern matching method.

音声認識装置は、入力音声信号から抽出された時系列の音響特徴量を、母音や子音などの音素を単位として、音響特徴量空間における確率密度分布が予め学習された音響モデルと照合することにより認識結果を得る。確率モデルである音響モデルは、音響特徴量の入力に対して、その音素らしさのスコア（音響尤度）を出力する。音声認識装置は文法と単語辞書の制約に従って音素らしさのスコア（音響尤度）を発声全体に渡って累積し、累積スコアが最も高い単語の並びを認識結果として出力する。 The speech recognition device compares the time-series acoustic feature extracted from the input speech signal with an acoustic model in which the probability density distribution in the acoustic feature space is learned in advance, with phonemes such as vowels and consonants as units. Get recognition result. The acoustic model, which is a probabilistic model, outputs a score (acoustic likelihood) of the phoneme-likeness with respect to the input of the acoustic feature amount. The speech recognition device accumulates phoneme-likeness scores (acoustic likelihood) over the entire utterance according to the restrictions of the grammar and the word dictionary, and outputs a word sequence having the highest accumulated score as a recognition result.

音響特徴量は多次元ベクトルの時系列データであり、各次元において各音素に該当するデータの頻度分布を集計すると正規分布に近い形状、もしくは複数の正規分布の和に近い形状になる。こうした音響特徴量の分布を表現するために、音響モデルの確率密度分布は多次元正規分布もしくは複数の多次元正規分布によって表現される。しかし、実際の照合においては、マイク特性のばらつき、話者による違い、背景雑音などにより、入力音響特徴量の分布と音響モデルの確率密度分布との間にミスマッチが生じ、認識率低下の原因となる。入力音響特徴量と音響モデルの照合において、このミスマッチを解消する手法として、ケプストラム平均値正規化（ＣＭＮ: Cepstral Mean Normalization）という手法が広く利用されており、ＣＭＮをさらに発展させた手法として平均値・分散正規化（ＭＶＮ: Mean and Variance Normalization）が提案されている。ＣＭＮは、発声の各時刻の音響特徴量からその発声全体の平均値を減算し、音響特徴量の平均をゼロにすることで、入力音響特徴量の分布と音響モデルの確率密度分布を揃え、ミスマッチを低減する手法である。ＣＭＮ前の各次元の音響特徴量をｘ（ｔ）、ＣＭＮ後の音響特徴量をｘ_ｃ（ｔ）とすると、ＣＭＮの操作は（１）式、（２）式で表される。Ｔは発声全体のフレーム数を表す。 The acoustic feature amount is time-series data of multi-dimensional vectors, and when the frequency distribution of data corresponding to each phoneme in each dimension is aggregated, it becomes a shape close to a normal distribution or a shape close to the sum of a plurality of normal distributions. In order to express such a distribution of acoustic features, the probability density distribution of the acoustic model is represented by a multidimensional normal distribution or a plurality of multidimensional normal distributions. However, in actual collation, due to variations in microphone characteristics, speaker differences, background noise, etc., a mismatch occurs between the distribution of input acoustic features and the probability density distribution of the acoustic model, which causes the recognition rate to decrease. Become. A method called Cepstral Mean Normalization (CMN) is widely used as a method to eliminate this mismatch in matching input acoustic features with acoustic models, and the average value is a further development of CMN. -Variance normalization (MVN) has been proposed. The CMN subtracts the average value of the entire utterance from the acoustic feature amount at each time of utterance, and makes the average of the acoustic feature amount zero, thereby aligning the distribution of the input acoustic feature amount and the probability density distribution of the acoustic model, This is a technique for reducing mismatches. When the acoustic feature quantity of each dimension before CMN is x (t) and the acoustic feature quantity after CMN is x _c (t), the operation of the CMN is represented by Expressions (1) and (2). T represents the number of frames of the entire utterance.

一方、ＭＶＮとは、発声の各時刻の音響特徴量を、その発声全体の平均値と分散で正規化して、基準系の正規分布Ｎ（平均０、分散１）に揃えることで、マイク特性などによる入力音響特徴量の分布と音響モデルの確率密度分布とのミスマッチを低減する手法である。ＭＶＮ前の各次元の音響特徴量をｘ（ｔ）、ＭＶＮ後の音響特徴量をｘ_ｍ（ｔ）とすると、ＭＶＮの操作は（３）〜（５）式で表される。 MVN, on the other hand, normalizes the acoustic feature value at each time of utterance with the average value and variance of the entire utterance, and aligns it with the normal distribution N (average 0, variance 1) of the reference system, and so on. This is a technique for reducing the mismatch between the distribution of input acoustic features and the probability density distribution of the acoustic model. Assuming that the acoustic feature quantity of each dimension before MVN is x (t) and the acoustic feature quantity after MVN is x _m (t), the operation of MVN is expressed by equations (3) to (5).

また、音声に限らず、静止画および動画についても、ＣＭＮおよびＭＶＮにて正規化が可能である。静止画像の場合、各次元の画像特徴量をｘ_ｉ,ｊ、ＣＭＮ後の画像特徴量をｘ_ｃｉ,ｊとすると、ＣＭＮの操作は式（６）、（７）で表される。Ｉ、Ｊは静止画の縦軸、横軸のブロック数を表す。 Further, not only audio but also still images and moving images can be normalized by CMN and MVN. In the case of a still image, assuming that the image feature amount of each dimension is x _{i, j} and the image feature amount after CMN is x _{ci, j} , the operation of the CMN is expressed by equations (6) and (7). I and J represent the number of blocks on the vertical and horizontal axes of a still image.

一方、ＭＶＮでは、ＭＶＮ前の各次元の画像特徴量をｘ_ｉ,ｊ、ＭＶＮ後の画像特徴量をｘ_ｍｉ,ｊとすると、ＭＶＮの操作は式（８）〜（１０）で表される。 On the other hand, in MVN, if the image feature quantity of each dimension before MVN is x _{i, j} and the image feature quantity after MVN is x _{mi, j} , the operation of MVN is expressed by equations (8) to (10). .

動画の場合、各次元の動画特徴量をｘ_ｉ,ｊ,ｔ、ＣＭＮ後の動画特徴量をｘ_{ｃｉ,ｊ,ｔ}とすると、ＣＭＮの操作は式（１１）、（１２）で表される。Ｉ、Ｊは動画の縦軸、横軸のブロック数、Ｔはフレーム数を表す。 In the case of a moving image, if the moving image feature amount of each dimension is x _{i, j, t} and the moving image feature amount after CMN is x _{ci, j, t} , the operation of the CMN is expressed by equations (11) and (12). . I and J are the vertical and horizontal axes of the moving image, and T is the number of frames.

一方、ＭＶＮでは、ＭＶＮ前の各次元の動画特徴量をｘ_ｉ,ｊ,ｔ、ＭＶＮ後の動画特徴量をｘ_{ｍｉ,ｊ,ｔ}とすると、ＭＶＮの操作は式（１３）〜（１５）で表される。 On the other hand, in MVN, if the moving image feature amount of each dimension before MVN is x _{i, j, t} , and the moving image feature amount after MVN is x _{mi, j, t} , the operation of MVN is expressed by equations (13) to (15). It is represented by

ただし、発声全体の平均値や分散を用いるＣＭＮやＭＶＮは、発声が終わるまで正規化後の音響特徴量が得られないために照合処理の開始が遅れ、発声終了から認識結果出力までの待ち時間を長くしてしまうというデメリットがある。この処理遅れを低減する手法として、発声全体の代わりに数十〜数百ミリ秒の局所の区間から平均値や分散を算出して正規化に用いる手法が提案されている。以降、発声全体から計算した平均値を用いて音響特徴量を正規化す手法をバッチＣＭＮ、発声の一部区間から計算した平均値を用いて音響特徴量を正規化する手法をセグメンタルＣＭＮとよぶ。同様に、発声全体から計算した平均値と分散値を用いて音響特徴量を正規化する手法をバッチＭＶＮ、発声の一部区間から計算した平均値と分散値を用いて音響特徴量を正規化する手法をセグメンタルＭＶＮと呼ぶ。 However, CMN and MVN using the average value and variance of the entire utterance delay the start of the collation process because the normalized acoustic feature quantity is not obtained until the utterance is finished, and the waiting time from the end of the utterance to the output of the recognition result There is a demerit that makes it longer. As a technique for reducing this processing delay, a technique has been proposed in which an average value or variance is calculated from a local interval of several tens to several hundreds of milliseconds instead of the entire utterance and used for normalization. Hereinafter, the method for normalizing the acoustic feature using the average value calculated from the entire utterance is called batch CMN, and the method for normalizing the acoustic feature using the average calculated from a part of the utterance is called segmental CMN. . Similarly, batch MVN is a method for normalizing acoustic features using the average and variance values calculated from the entire utterance, and normalizes acoustic features using the average and variance values calculated from some sections of the utterance. This technique is called segmental MVN.

また、特徴量の量子化を仮定しない平均値・分散正規化（ＭＶＮ）において算出した分散の値がゼロもしくはゼロに近い小さな値の場合には分散正規化を行わない手法も知られている（例えば、特許文献１参照）。
特開２００２−２７８５８６号公報 There is also known a method in which dispersion normalization is not performed when the dispersion value calculated in the mean value / dispersion normalization (MVN) that does not assume feature quantity quantization is zero or a small value close to zero ( For example, see Patent Document 1).
JP 2002-278586 A

しかし、ＣＭＮでは、バッチＣＭＮの方が、セグメンタルＣＭＮよりも長い音声区間から特徴量の平均値を算出するため、精度が高く認識率の改善効果が高いが、入力音響特徴量の分布のばらつきと、参照する音響モデルの確率密度分布のばらつきまで揃えることはできない。 However, in the CMN, since the batch CMN calculates the average value of feature values from a speech section longer than the segmental CMN, the accuracy is high and the recognition rate improvement effect is high. It is impossible to align even the variation of probability density distribution of the acoustic model to be referred to.

また、ＭＶＮでは、バッチＭＶＮは発声全体の音響特徴量の分布を平均０、分散１に正規化するが、音声認識の単位となる音素ごとの分布に着目すると、分散は正規化されていない。一方、セグメンタルＭＶＮで平均・分散の計算区間を１音素相当の時間長(数十から数百ミリ秒)に設定すれば、音素ごとの分布の分散を正規化するのに近い効果が得られる。ただし、短時間の平均値も０に正規化されるので、すべての音素の分布の平均値が０に近づくため重なりが大きくなり（図2参照）、音素の識別能力の低下を招く。 Also, in MVN, batch MVN normalizes the distribution of acoustic features of the entire utterance to 0 on average and 1 on variance, but focusing on the distribution for each phoneme as a unit of speech recognition, variance is not normalized. On the other hand, if the average / variance calculation interval is set to a time length equivalent to one phoneme (several tens to hundreds of milliseconds) in the segmental MVN, an effect close to normalizing the variance of the distribution of each phoneme can be obtained. . However, since the average value for a short time is also normalized to 0, the average value of all phoneme distributions approaches 0, so that the overlap becomes large (see FIG. 2), leading to a decrease in phoneme identification ability.

また、特許文献１では、発声全体の平均値と分散値を用いて正規化するバッチＭＶＮと、局所の平均値と分散値を用いて音響特徴量を正規化するセグメンタルＭＶＮへの適用についてのみ述べられており、前述の音素の識別能力の低下を招くという問題点を解決することができない。 In Patent Document 1, only the application to the batch MVN that normalizes using the average value and the variance value of the entire utterance and the segmental MVN that normalizes the acoustic feature value using the local average value and the variance value. However, it is impossible to solve the above-mentioned problem that the phoneme discrimination ability is deteriorated.

すなわち、ＣＭＮでは分布のばらつき（分散）を正規化することができず、セグメンタルＭＶＮでは、音素ごとの分散の正規化に近い効果があるが、音素間で分布の平均値が近づいてしまい音素の識別能力が低下してしまうという問題がある。 That is, CMN cannot normalize distribution variation (dispersion), and segmental MVN has an effect close to normalization of dispersion for each phoneme, but the average value of the distribution approaches between phonemes, and the phonemes There is a problem that the discriminating ability of the is reduced.

また、上記の課題は、音声に限らず、外部より入力されたデータの特徴量を算出し、算出した特徴量を正規化し、正規化済み特徴量に基づいてパタンマッチングを行うパタンマッチング装置にも当てはまる。 In addition, the above-described problem is not limited to speech, but is also applied to a pattern matching device that calculates feature amounts of data input from the outside, normalizes the calculated feature amounts, and performs pattern matching based on the normalized feature amounts. apply.

本発明は、上記の課題を解決するためになされたものであり、特徴量の識別能力を低下させること無く特徴量を正規化することが可能なパタンマッチング装置、パタンマッチングプログラム、およびパタンマッチング方法を提供することを目的とする。 The present invention has been made to solve the above-described problem, and a pattern matching device, a pattern matching program, and a pattern matching method capable of normalizing feature quantities without degrading the feature quantity identification capability The purpose is to provide.

本発明は、外部より入力された音声データまたは画像データの特徴量を算出する分析手段と、前記分析手段で算出された前記特徴量を正規化する正規化手段と、前記正規化手段で正規化された正規化済み特徴量に基づいて、パタンマッチングを行うパタンマッチング手段と、を備えたパタンマッチング装置において、前記正規化手段は、前記音声データの全フレーム数または前記画像データ全体の前記特徴量の平均値である全体平均値を取得する全体平均取得手段と、前記音声データの局所のフレーム数または前記画像データの局所範囲の前記特徴量の平均値である局所平均値を計算する局所平均計算手段と、前記局所平均値に基づいて、前記音声データの局所のフレーム数または前記画像データの局所範囲の前記特徴量の分散値である局所分散値を計算する局所分散計算手段と、前記全体平均値と複数の前記局所分散値とに基づいて前記特徴量を正規化する正規化処理計算手段と、を備えたことを特徴とするパタンマッチング装置である。 The present invention provides an analysis means for calculating feature values of audio data or image data input from the outside, a normalization means for normalizing the feature values calculated by the analysis means, and a normalization by the normalization means A pattern matching unit that performs pattern matching on the basis of the normalized feature quantity that has been normalized, wherein the normalization means includes the total number of frames of the audio data or the feature quantity of the entire image data. An overall average acquisition means for acquiring an overall average value, which is an average value of the audio data, and a local average calculation for calculating a local average value that is an average value of the feature quantity of the local number of frames of the audio data or the local range of the image data and means, based on said local average value, a variance value of the characteristic amount of the local range of the local frame number or the image data of the voice data locality Pattern matching comprising: local variance calculation means for calculating a variance value; and normalization processing calculation means for normalizing the feature quantity based on the overall average value and the plurality of local variance values Device.

また、本発明の前記全体平均取得手段は、前記音声データの全フレーム数または前記画像データ全体の前記特徴量から前記全体平均値を計算することを特徴とする。 Further, the overall average acquisition means of the present invention is characterized in that the overall average value is calculated from the total number of frames of the audio data or the feature amount of the entire image data .

また、本発明の前記全体平均取得手段は、予め記憶した所定値を前記全体平均値とすることを特徴とする。 The overall average acquisition means of the present invention is characterized in that a predetermined value stored in advance is used as the overall average value.

また、本発明は、パタンマッチングの対象とする前記特徴量が含まれる範囲を同定する範囲同定手段を備え、前記全体平均取得手段は、前記範囲同定手段で同定された範囲に基づく前記音声データの全フレーム数または前記画像データ全体の前記特徴量から前記全体平均値を計算することを特徴とする、請求項２に記載のパタンマッチング装置である。 The present invention further includes a range identification unit that identifies a range including the feature quantity that is a target of pattern matching, and the overall average acquisition unit includes the audio data based on the range identified by the range identification unit . 3. The pattern matching device according to claim 2, wherein the overall average value is calculated from the total number of frames or the feature amount of the entire image data .

また、本発明の前記局所平均計算手段は、過去に計算した前記局所平均値により重み付けした値に基づいて、前記局所平均値を計算し、前記局所分散計算手段は、過去に計算した前記局所分散値により重み付けした値に基づいて、前記局所分散値を計算することを特徴とする。 The local average calculating means of the present invention calculates the local average value based on a value weighted by the local average value calculated in the past, and the local variance calculating means calculates the local variance calculated in the past. The local variance value is calculated based on a value weighted by the value.

また、本発明は、外部より入力された音声データまたは画像データの特徴量を算出する分析手段と、前記分析手段で算出された前記特徴量を正規化する正規化手段と、前記正規化手段で正規化された正規化済み特徴量に基づいて、パタンマッチングを行うパタンマッチング手段と、としてコンピュータを機能させるためのパタンマッチングプログラムにおいて、前記正規化手段は、前記音声データの全フレーム数または前記画像データ全体の前記特徴量の平均値である全体平均値を取得する全体平均取得手段と、前記音声データの局所のフレーム数または前記画像データの局所範囲の前記特徴量の平均値である局所平均値を計算する局所平均計算手段と、前記局所平均値に基づいて、前記第２の範囲に含まれる前記特徴量の分散値である局所分散値を計算する局所分散計算手段と、前記全体平均値と複数の前記局所分散値とに基づいて前記特徴量を正規化する正規化処理計算手段と、としてコンピュータを機能させるためのパタンマッチングプログラムである。 Further, the present invention provides an analysis unit that calculates a feature amount of audio data or image data input from the outside, a normalization unit that normalizes the feature amount calculated by the analysis unit, and the normalization unit. In a pattern matching program for causing a computer to function as a pattern matching unit that performs pattern matching based on a normalized feature amount that has been normalized, the normalization unit includes the total number of frames of the audio data or the image the overall average obtaining means for obtaining an overall average value is an average value of the feature amount of the total data, the local average value is an average value of the feature amount of the local range of the local frame number or the image data of the audio data A local average calculating means for calculating the local average value, and a local value that is a variance value of the feature amount included in the second range based on the local average value A pattern matching program for causing a computer to function as local variance calculation means for calculating a variance value, and normalization processing calculation means for normalizing the feature quantity based on the overall average value and the plurality of local variance values It is.

また、本発明は、外部より入力された音声データまたは画像データの特徴量を算出する分析ステップと、前記分析ステップで算出された前記特徴量を正規化する正規化ステップと、前記正規化ステップで正規化された正規化済み特徴量に基づいて、パタンマッチングを行うパタンマッチングステップと、を備えたパタンマッチング方法において、前記正規化ステップは、前記音声データの全フレーム数または前記画像データ全体の前記特徴量の平均値である全体平均値を取得する全体平均取得ステップと、前記音声データの局所のフレーム数または前記画像データの局所範囲の前記特徴量の平均値である局所平均値を計算する局所平均計算ステップと、前記局所平均値に基づいて、前記音声データの局所のフレーム数または前記画像データの局所範囲の前記特徴量の分散値である局所分散値を計算する局所分散計算ステップと、前記全体平均値と複数の前記局所分散値とに基づいて前記特徴量を正規化する正規化処理計算ステップと、を備えたことを特徴とするパタンマッチング方法である。 Further, the present invention provides an analysis step for calculating the feature amount of audio data or image data input from the outside, a normalization step for normalizing the feature amount calculated in the analysis step, and the normalization step. A pattern matching step of performing pattern matching based on the normalized normalized feature quantity, wherein the normalization step includes the total number of frames of the audio data or the whole of the image data. An overall average acquisition step of acquiring an overall average value that is an average value of feature amounts; and a local average value that calculates a local average value that is an average value of the feature amounts of a local range of the audio data or a local range of the image data an average calculation step, on the basis of the local average value, the number of frames the local of the voice data or the image data Normalization processing calculating step of normalizing the feature amount based on the local variance calculation step of calculating a local variance value is the variance value of the characteristic quantity of Tokoro range, the overall average value and a plurality of said local variance And a pattern matching method characterized by comprising:

本発明によれば、特徴量の識別能力を低下させること無く特徴量を正規化することができる。 According to the present invention, it is possible to normalize the feature amount without reducing the feature amount identification capability.

以下、図面を参照し、本発明の実施形態を説明する。図１は本発明の一実施形態による音声認識装置の構成を示している構成図である。音響分析部１０１は、マイク等より入力された音声データに対して音響分析を行い、音響特徴量を計算する。入力は、プッシュ・ツー・トークで制御することも可能である。また、音響分析部１０１は、計算した音響特徴量を一時的にバッファに記憶させる。正規化処理部１０２は、音響分析部１０１がバッファに記憶させた音響特徴量を、音響特徴量の平均値および分散値を用いて正規化処理を行う。正規化処理については後述する。音響モデル学習部１０３は、学習用音声データに対して、認識対象の音声データと同一の音響分析を音響分析部１０１で行い、正規化処理部１０２で正規化を行って得た、学習用音声データの音響特徴量を音響モデル記憶部１０４に記憶させる。言語モデル記憶部１０５は、単語辞書や文法を記憶する。認識処理部１０６は、認識対象の音声データに対して音響分析部１０１で音響分析を行い、正規化処理部１０２で正規化処理を行って得た、認識対象の音声データの音響特徴量と音響モデル記憶部１０４が記憶している学習用音声データの音響特徴量および言語モデルが記憶している単語辞書や文法を用いてパタンマッチングを行い、認識結果を出力する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according to an embodiment of the present invention. The acoustic analysis unit 101 performs acoustic analysis on voice data input from a microphone or the like, and calculates an acoustic feature amount. Input can also be controlled by push-to-talk. The acoustic analysis unit 101 temporarily stores the calculated acoustic feature amount in a buffer. The normalization processing unit 102 normalizes the acoustic feature amount stored in the buffer by the acoustic analysis unit 101 using the average value and the variance value of the acoustic feature amount. The normalization process will be described later. The acoustic model learning unit 103 performs the same acoustic analysis on the learning speech data as the speech data to be recognized by the acoustic analysis unit 101, and normalization by the normalization processing unit 102. The acoustic feature quantity of the data is stored in the acoustic model storage unit 104. The language model storage unit 105 stores a word dictionary and grammar. The recognition processing unit 106 performs acoustic analysis on the speech data to be recognized by the acoustic analysis unit 101 and performs normalization processing on the normalization processing unit 102 to obtain the acoustic feature amount and sound of the recognition target speech data. Pattern matching is performed using the acoustic features of the learning speech data stored in the model storage unit 104 and the word dictionary or grammar stored in the language model, and the recognition result is output.

[第１の実施形態]
まず、本発明の第１の実施形態を説明する。図３は、本実施形態による正規化処理部１０２の構成を示している。マイクなどから音声認識装置に入力された１発声全体の音響特徴量は、音響分析部１０１によって図示せぬバッファに格納されている。全体平均計算部３０１は、発声全体に対応したフレーム数Ｔ内の音響特徴量をバッファから読み出し、その平均値を計算する。発声全体の長さとして、単語の長さ、音声の切れ目までの長さ、句読点から句読点までの長さ、入力された音声全体の長さなどを用いることが可能である。発声全体の音響特徴量の平均値Ｅ（ｘ）は計算式（１６）で求める。 [First embodiment]
First, a first embodiment of the present invention will be described. FIG. 3 shows a configuration of the normalization processing unit 102 according to the present embodiment. The acoustic feature quantity of the entire utterance input to the speech recognition device from a microphone or the like is stored in a buffer (not shown) by the acoustic analysis unit 101. The overall average calculation unit 301 reads the acoustic feature amount within the frame number T corresponding to the entire utterance from the buffer, and calculates the average value. As the length of the entire utterance, it is possible to use the length of the word, the length from the speech break, the length from the punctuation mark to the punctuation mark, the length of the entire input speech, and the like. The average value E (x) of the acoustic features of the entire utterance is obtained by the calculation formula (16).

局所平均計算部３０２は、予め設定した局所のフレーム数τ内の発声に対応した音響特徴量をバッファから読み出し、その平均値を計算する。局所のフレーム数τは、音素の長さとして、例えば数十から数百ミリ秒に対応した数である。音素の長さなので発声する単語や人によって変動するが、本実施形態では固定値を使用する。局所のフレーム数τの音響特徴量の平均値Ｅ_τ（ｘ）は計算式（１７）で求める。 The local average calculation unit 302 reads an acoustic feature amount corresponding to the utterance within the preset local frame number τ from the buffer, and calculates the average value. The local frame number τ is a number corresponding to, for example, tens to hundreds of milliseconds as the phoneme length. Since it is the length of a phoneme, it varies depending on the word or person uttered, but in this embodiment, a fixed value is used. The average acoustic feature value E _τ (x) of the local number of frames τ is obtained by the calculation formula (17).

局所分散計算部３０３は、予め設定した局所のフレーム数τ内の発声に対応した音響特徴量をバッファから読み出し、その分散値を、局所平均計算部３０２で算出した平均値に基づいて計算する。局所のフレーム数τの音響特徴量の分散値Ｖ_τ（ｘ）は計算式（１８）で求める。 The local variance calculation unit 303 reads an acoustic feature amount corresponding to the utterance within the preset local frame number τ from the buffer, and calculates the variance value based on the average value calculated by the local average calculation unit 302. The dispersion value V _τ (x) of the acoustic feature quantity of the local frame number τ is obtained by the calculation formula (18).

正規化処理計算部３０４は、正規化前の音声特徴量から全体平均計算部３０１で算出した発声全体に対しての音響特徴量の平均値を減算し、局所分散計算部３０３で算出した局所のフレーム数τの音響特徴量の分散値で割ることで、正規化後の音響特徴量ｘ_τ（ｔ）を求めることができる（計算式（１９）参照）。 The normalization processing calculation unit 304 subtracts the average value of the acoustic feature amount for the entire utterance calculated by the overall average calculation unit 301 from the speech feature amount before normalization, and calculates the local value calculated by the local variance calculation unit 303. By dividing by the dispersion value of the acoustic feature quantity of the number of frames τ, the normalized acoustic feature quantity x _τ (t) can be obtained (see the calculation formula (19)).

上述したとおり、入力音響特徴量に対して、発声全体の平均値による正規化処理を行うことにより、すべての音素の分布の位置を音響モデルの該当音素の分布に揃え、更に局所の分散値による正規化処理によって、全音素の分布の重なりを抑制しつつ正規分布に近づける効果を持つ（図４参照）。その結果、音素間の識別精度を低減することなく、背景雑音や残響などによる音響モデルと入力された音響特徴量のミスマッチ成分を低減することができ、音声認識精度の劣化を低減することができる。 As described above, by performing normalization processing based on the average value of the entire utterance on the input acoustic feature amount, the positions of all phoneme distributions are aligned with the corresponding phoneme distribution of the acoustic model, and further according to the local variance value. The normalization process has an effect of bringing the distribution close to the normal distribution while suppressing the overlap of the distribution of all phonemes (see FIG. 4). As a result, it is possible to reduce the mismatch component between the acoustic model due to background noise, reverberation, etc. and the input acoustic feature amount without reducing the accuracy of discrimination between phonemes, and to reduce degradation of speech recognition accuracy. .

なお、全体平均計算部３０１で、平均を求めるフレーム数を発声時間に対応する数としたが、代わりにフレーム数τ´を予め設定してもよい。 Note that the overall average calculation unit 301 uses the number of frames for which the average is calculated as the number corresponding to the utterance time, but instead, the number of frames τ ′ may be set in advance.

[第２の実施形態]
次に、本発明の第２の実施形態を説明する。図５は、本実施形態による正規化処理部１０２の構成を示している。本実施形態では、対象とする局所のフレーム数での局所平均値および局所分散値を算出する際に、１つ前の局所のフレーム数の音響特徴量から計算した局所平均値（以下、１つ前の局所平均値と記す。）および１つ前の局所のフレーム数の音響特徴量から計算した局所分散値（以下、１つ前の局所分散値と記す。）を用いることを特徴とする。突発的な雑音が音声認識装置に入力された場合、局所平均値および局所分散値が大きく変わり、入力された音声データを正しく認識することが困難となるが、１つ前の局所平均値および１つ前の局所分散値を用いることで、突発的に音声認識装置に雑音が入力された場合でも局所平均値および局所分散値が大きく変わらず、音声認識制度の劣化を低減することができる。 [Second Embodiment]
Next, a second embodiment of the present invention will be described. FIG. 5 shows the configuration of the normalization processing unit 102 according to this embodiment. In the present embodiment, when calculating the local average value and local variance value for the number of local frames of interest, the local average value calculated from the acoustic feature quantity of the previous local frame number (hereinafter referred to as one). And a local variance value calculated from the acoustic feature quantity of the number of local frames immediately before (hereinafter referred to as the previous local variance value). When sudden noise is input to the speech recognition apparatus, the local average value and the local variance value change greatly, making it difficult to correctly recognize the input speech data, but the previous local average value and 1 By using the previous local variance value, even when noise is suddenly input to the speech recognition device, the local average value and the local variance value do not change greatly, and the degradation of the speech recognition system can be reduced.

全体平均計算部５０１は、第１の実施形態と同様に音声認識装置に入力された発声全体対応した音響特徴量をバッファから読み出し、その平均値Ｅ（ｘ）を計算する。局所平均計算部５０２は、予め設定した局所のフレーム数τ内の発声に対応した音響特徴量をバッファから読み出し、その平均値を計算する。その際忘却係数αを予め設定し、1つ前の局所平均値を重み付け加算する。1つ前の局所平均値を重み付け加算した、局所のフレーム数τの音響特徴量の局所平均値Ｅ_ｐ（ｔ）は計算式（２０）で求める。 Similar to the first embodiment, the overall average calculation unit 501 reads the acoustic feature amount corresponding to the entire utterance input to the speech recognition apparatus from the buffer, and calculates the average value E (x). The local average calculation unit 502 reads the acoustic feature amount corresponding to the utterance within the preset local frame number τ from the buffer, and calculates the average value. At that time, the forgetting factor α is set in advance, and the previous local average value is weighted and added. The local average value E _p (t) of the acoustic feature quantity of the number of local frames τ obtained by weighted addition of the previous local average value is obtained by the calculation formula (20).

局所分散計算部５０３は、予め設定した局所のフレーム数τ内の発声に対応した音響特徴量をバッファから読み出し、その分散値を、局所平均計算部５０２で算出した平均値に基づいて計算する。その際忘却係数αを予め設定し、1つ前の局所平均値を重み付け加算する。1つ前の局所平均値を重み付け加算した、局所のフレーム数τの音響特徴量の局所分散値Ｖ_ｐ（ｔ）は計算式（２１）で求める。 The local variance calculation unit 503 reads the acoustic feature amount corresponding to the utterance within the preset local frame number τ from the buffer, and calculates the variance based on the average value calculated by the local average calculation unit 502. At that time, the forgetting factor α is set in advance, and the previous local average value is weighted and added. The local variance value V _p (t) of the acoustic feature quantity of the number of local frames τ obtained by weighted addition of the previous local average value is obtained by the calculation formula (21).

正規化処理計算部５０４は、正規化前の音声特徴量から全体平均計算部５０１で算出した発声全体の音響特徴量の平均値を減算し、局所分散計算部５０３で算出した局所のフレーム数τの音響特徴量の分散値で割ることで、正規化後の音響特徴量ｘ_ｐ（ｔ）を求めることができる（計算式（２２）参照）。 The normalization processing calculation unit 504 subtracts the average value of the acoustic feature amount of the entire utterance calculated by the overall average calculation unit 501 from the speech feature amount before normalization, and the number of local frames τ calculated by the local variance calculation unit 503. The acoustic feature value x _p (t) after normalization can be obtained by dividing by the variance value of the acoustic feature value (see formula (22)).

上述したとおり、入力音響特徴量に対して、発声全体の平均値による正規化処理を行うことにより、すべての音素の分布の位置を音響モデルの該当音素の分布に揃え、更に局所の分散値による正規化処理によって、全音素の分布の重なりを抑制しつつ正規分布に近づける効果を持つ（図４参照）。その結果、音素間の識別精度を低減することなく、背景雑音や残響などによる音響モデルと入力された音響特徴量のミスマッチ成分を低減することができ、音声認識精度の劣化を低減することができる。さらに、突発的な雑音が音声認識装置に入力された場合、入力された音声データを認識することが困難となるが、１つ前の局所平均値および１つ前の局所分散値を用いることで、突発的に音声認識装置に雑音が入力された場合でも平均値が大きく変わらず、音声認識精度の劣化を低減することができる。 As described above, by performing normalization processing based on the average value of the entire utterance on the input acoustic feature amount, the positions of all phoneme distributions are aligned with the corresponding phoneme distribution of the acoustic model, and further according to the local variance value. The normalization process has an effect of bringing the distribution close to the normal distribution while suppressing the overlap of the distribution of all phonemes (see FIG. 4). As a result, it is possible to reduce the mismatch component between the acoustic model due to background noise, reverberation, etc. and the input acoustic feature amount without reducing the accuracy of discrimination between phonemes, and to reduce degradation of speech recognition accuracy. . Furthermore, when sudden noise is input to the speech recognition device, it becomes difficult to recognize the input speech data, but by using the previous local average value and the previous local variance value, Even when noise is suddenly input to the speech recognition apparatus, the average value does not change greatly, and deterioration of speech recognition accuracy can be reduced.

[第３の実施形態]
次に、本発明の第３の実施形態を説明する。図６は、本実施形態による正規化処理部１０２の構成を示している。本実施形態では、実施形態１での発声全体の音響特徴量の平均値を算出する代わりに、予め計算した固定の平均値を用いることを特徴とする。これにより、発声全体から音響特徴量の平均値を計算する必要がないため、音響特徴量の正規化が完了するまでの待ち時間が、局所分散の計算に必要な時間となり、リアルタイム処理が可能となる。 [Third embodiment]
Next, a third embodiment of the present invention will be described. FIG. 6 shows the configuration of the normalization processing unit 102 according to this embodiment. The present embodiment is characterized in that a fixed average value calculated in advance is used instead of calculating the average value of the acoustic feature values of the entire utterance in the first embodiment. This eliminates the need to calculate the average value of acoustic features from the entire utterance, so the waiting time until the normalization of acoustic features is completed is the time required to calculate local variance, enabling real-time processing. Become.

固定平均値記憶部６０１は、予め設定した音響特徴量の平均値Ｅ_ｆ（ｘ）を記憶する。固定値は、前の発声の平均値を用いる、もしくは過去の莫大な音声データから求めることなどが可能である。 The fixed average value storage unit 601 stores a preset average value E _f (x) of the acoustic feature amount. The fixed value can be obtained by using an average value of previous utterances or obtained from a huge amount of past voice data.

局所平均計算部６０２、局所分散計算部６０３は、第１の実施形態と同様に局所平均値および局所分散値を算出する。正規化処理計算部６０２は、固定平均値記憶部６０１に記憶された固定平均値を用い、正規化前の音声特徴量から固定平均値を減算し、局所分散計算部６０３で算出した局所のフレーム数τの音響特徴量の分散値で割ることで、正規化後の音響特徴量ｘ_ｆ（ｔ）を求めることができる（計算式（２３）参照）。 The local average calculation unit 602 and the local variance calculation unit 603 calculate the local average value and the local variance value as in the first embodiment. The normalization processing calculation unit 602 uses the fixed average value stored in the fixed average value storage unit 601, subtracts the fixed average value from the speech feature value before normalization, and calculates the local frame calculated by the local variance calculation unit 603. By dividing by the variance of the acoustic feature quantity of several τ, the normalized acoustic feature quantity x _f (t) can be obtained (see the calculation formula (23)).

上述したとおり、発声全体の音響特徴量の平均値を算出する代わりに、予め計算した固定の平均値を用いることで、発声全体から音響特徴量の平均値をリアルタイムに計算する必要がない。これにより、音響特徴量の正規化が完了するまでの待ち時間が局所分散の計算に必要な時間となり、リアルタイム処理が可能となる。また、入力音響特徴量に対して、発声全体の平均値による正規化処理を行うことにより、すべての音素の分布の位置を音響モデルの該当音素の分布に揃え、更に局所の分散値による正規化処理によって、全音素の分布の重なりを抑制しつつ正規分布に近づける効果を持つ（図４参照）。その結果、音素間の識別精度を低減することなく、背景雑音や残響などによる音響モデルと入力された音響特徴量のミスマッチ成分を低減することができ、音声認識精度の劣化を低減することができる。 As described above, instead of calculating the average value of the acoustic feature amount of the entire utterance, it is not necessary to calculate the average value of the acoustic feature amount from the entire utterance in real time by using the fixed average value calculated in advance. Thereby, the waiting time until the normalization of the acoustic feature amount is completed becomes a time required for the calculation of the local variance, and real-time processing is possible. In addition, by normalizing the input acoustic features using the average value of the entire utterance, the positions of all phoneme distributions are aligned with the corresponding phoneme distribution of the acoustic model, and further normalized by local variance values. The processing has an effect of bringing the distribution close to the normal distribution while suppressing the overlap of the distribution of all phonemes (see FIG. 4). As a result, it is possible to reduce the mismatch component between the acoustic model due to background noise, reverberation, etc. and the input acoustic feature amount without reducing the accuracy of discrimination between phonemes, and to reduce degradation of speech recognition accuracy. .

[第４の実施形態]
次に、本発明の第４の実施形態を説明する。図７は、本実施形態による正規化処理部１０２の構成を示している。本実施形態では、全体平均計算部７０２の前段に音声検出部を設ける事により音声区間を同定し、音声区間とその前後の数十ミリ秒を加えた時間に対応するフレーム数τ´での平均値を用いて正規化することを特徴とする。これにより、発声終了後に無音区間が長く続いた場合においても、正規化処理までの待ち時間を短くすることが可能となる。 [Fourth Embodiment]
Next, a fourth embodiment of the present invention will be described. FIG. 7 shows the configuration of the normalization processing unit 102 according to this embodiment. In the present embodiment, a speech section is identified by providing a speech detection unit in front of the overall average calculation unit 702, and the average of the number of frames τ ′ corresponding to the time obtained by adding the speech section and several tens of milliseconds before and after the speech section. It is characterized by normalization using values. Thereby, even when the silent section continues for a long time after the end of the utterance, the waiting time until the normalization process can be shortened.

音声検出部７０１は、入力された音響特徴量に音声特有の特徴が含まれていることを検出し、音声区間を同定する。音声特有の特徴としては、音声のパワー、ケプストラム値、周波数などを用いることが可能である。全体平均計算部７０２は、音声検出部７０１で同定した音声区間とその前後の数十ミリ秒を加えた時間に対応するフレーム数τ´での発声に対応した音響特徴量をバッファから読み出し、その平均値を計算する。τ´の音響特徴量の平均値Ｅ_τ´（ｘ）は計算式（２４）で求める。 The voice detection unit 701 detects that voice-specific features are included in the input acoustic feature quantity, and identifies a voice section. As the voice-specific features, voice power, cepstrum value, frequency, and the like can be used. The overall average calculation unit 702 reads out from the buffer the acoustic feature amount corresponding to the utterance in the number of frames τ ′ corresponding to the time obtained by adding the speech section identified by the speech detection unit 701 and the preceding and succeeding tens of milliseconds. Calculate the average value. The average value E _{τ ′} (x) of acoustic feature values of τ ′ is obtained by the calculation formula (24).

局所平均計算部７０３は、予め設定した局所のフレーム数τ内の発声に対応した音響特徴量をバッファから読み出し、その平均値を計算する。局所のフレーム数は、音素の長さとして、例えば数十から数百ミリ秒に対応した数である。音素の長さなので発声する単語や人によって変動するが、本実施形態では固定値を使用する。局所のフレーム数τの音響特徴量の平均値Ｅ_τ（ｘ）は計算式（１７）で求める。 The local average calculation unit 703 reads the acoustic feature amount corresponding to the utterance within the preset local frame number τ from the buffer, and calculates the average value. The number of local frames is a number corresponding to, for example, tens to hundreds of milliseconds as the phoneme length. Since it is the length of a phoneme, it varies depending on the word or person uttered, but in this embodiment, a fixed value is used. The average acoustic feature value E _τ (x) of the local number of frames τ is obtained by the calculation formula (17).

局所分散計算部７０４は、予め設定した局所のフレーム数τ内の発声に対応した音響特徴量をバッファから読み出し、その分散値を、局所平均計算部７０３で算出した平均値に基づいて計算する。局所のフレーム数τの音響特徴量の分散値Ｖ_τ（ｘ）は計算式（１８）で求める。 The local variance calculation unit 704 reads an acoustic feature amount corresponding to the utterance within the preset local frame number τ from the buffer, and calculates the variance value based on the average value calculated by the local average calculation unit 703. The dispersion value V _τ (x) of the acoustic feature quantity of the local frame number τ is obtained by the calculation formula (18).

正規化処理計算部７０５は、正規化前の音声特徴量から全体平均計算部７０２で算出した発声全体の音響特徴量の平均値を減算し、局所分散計算部７０４で算出した局所のフレーム数τの音響特徴量の分散値で割ることで、正規化後の音響特徴量ｘ_τ´（ｔ）を求めることができる（計算式（２５）参照）。 The normalization processing calculation unit 705 subtracts the average value of the acoustic feature amount of the entire utterance calculated by the overall average calculation unit 702 from the speech feature amount before normalization, and the number of local frames τ calculated by the local variance calculation unit 704 The acoustic feature value x _{τ ′} (t) after normalization can be obtained by dividing by the variance value of the acoustic feature value (see formula (25)).

上述したとおり、全体平均計算部７０２の前段に音声検出部７０１を設ける事により音声区間を同定し、音声区間とその前後の数十ミリ秒を加えた時間に対応するフレーム数τ´での平均値を用いて正規化することにより、発声終了後に無音区間が長く続いた場合においても、正規化処理までの待ち時間を短くすることが可能となる。 As described above, the speech section is identified by providing the speech detection unit 701 in the preceding stage of the overall average calculation unit 702, and the average at the number of frames τ ′ corresponding to the time obtained by adding the speech section and several tens of milliseconds before and after the speech section. By normalizing using the value, it is possible to shorten the waiting time until the normalization process even when the silent period continues for a long time after the end of utterance.

[第５の実施形態]
次に、本発明の第５の実施形態を説明する。図８は本実施形態による画像認識装置の構成を示している構成図である。図４において、マイクから入力された音声データの代わりにカメラから入力された画像とし、単語辞書・文法と音響モデルの代わりにオブジェクトモデルとし、音声認識結果の代わりに画像認識結果と置き換えることで、画像認識への適用も可能となる。 [Fifth Embodiment]
Next, a fifth embodiment of the present invention will be described. FIG. 8 is a configuration diagram showing the configuration of the image recognition apparatus according to the present embodiment. In FIG. 4, an image input from a camera is used instead of voice data input from a microphone, an object model is used instead of a word dictionary / grammar and an acoustic model, and an image recognition result is used instead of a voice recognition result. Application to image recognition is also possible.

画像特徴量分析部８０１は、カメラから入力された画像データに対して画像特徴量分析を行い、画像特徴量を計算する。正規化処理部８０２は、画像特徴量分析部８０１で計算した画像特徴量を画像特徴量の平均値および分散値を用いて正規化処理を行う。正規化処理については後述する。オブジェクトモデル学習部８０３は、学習用画像データに対して、認識対象の画像データと同一の画像特徴量分析を画像特徴量分析部８０１で行い、正規化処理部８０２で正規化を行って得た、学習用画像データの画像特徴量をオブジェクトモデル８０４に記憶させる。認識処理部８０５は、認識対象の画像データに対して画像特徴量分析部８０１で画像特徴量分析を行い、正規化処理部８０２で正規化処理を行って得た、認識対象の画像データの画像特徴量とオブジェクトモデル８０４が記憶している学習用画像データの画像特徴量を用いて認識処理を行い、認識結果を出力する。 An image feature amount analysis unit 801 performs image feature amount analysis on image data input from a camera, and calculates an image feature amount. The normalization processing unit 802 normalizes the image feature amount calculated by the image feature amount analysis unit 801 using the average value and the variance value of the image feature amount. The normalization process will be described later. The object model learning unit 803 obtains the image data for learning by performing the same image feature amount analysis as the image data to be recognized by the image feature amount analysis unit 801 and normalizing by the normalization processing unit 802. Then, the image feature amount of the learning image data is stored in the object model 804. The recognition processing unit 805 performs image feature amount analysis on the image data to be recognized by the image feature amount analysis unit 801, and performs normalization processing on the normalization processing unit 802. Recognition processing is performed using the feature amount and the image feature amount of the learning image data stored in the object model 804, and a recognition result is output.

図９を参照し本実施形態における画像の正規化処理について説明する。図９は、本実施形態による正規化処理部１０２の構成を示している。全体平均計算部９０１は、カメラ等から画像認識装置に入力された画像データ全体の画像特徴量をバッファから読み出し、その平均値を計算する。画像データ全体の画像特徴量の平均値Ｅ（ｘ_ｉ,ｊ）は計算式（２６）で求める。Ｉ、Ｊは静止画の縦軸、横軸のブロック数を表す。 The image normalization process in this embodiment will be described with reference to FIG. FIG. 9 shows the configuration of the normalization processing unit 102 according to this embodiment. The overall average calculation unit 901 reads the image feature amount of the entire image data input from the camera or the like to the image recognition apparatus from the buffer, and calculates the average value. The average value E (x _{i, j} ) of the image feature values of the entire image data is obtained by the calculation formula (26). I and J represent the number of blocks on the vertical and horizontal axes of a still image.

局所平均計算部９０２は、予め設定した画像データの局所範囲における画像特徴量の平均値を計算する。局所範囲としては、正規化対象画像範囲を含む周囲数ブロックなどを用いることが可能である。局所範囲（ｋ,ｌ）の画像特徴量の平均値Ｅ（ｘ_ｋ,ｌ）は計算式（２７）で求める。Ｉ、Ｊは静止画の局所範囲での縦軸、横軸のブロック数を表す。 The local average calculation unit 902 calculates an average value of image feature amounts in a local range of preset image data. As the local range, it is possible to use a peripheral block including the normalization target image range. Local range (k, l) mean E (x _{k, l)} of the image feature amount is obtained by equation (27). I and J represent the number of blocks on the vertical axis and the horizontal axis in the local range of the still image.

局所分散計算部９０３は、予め設定した画像データの局所範囲における画像特徴量の分散値を、局所平均計算部９０２で算出した平均値に基づいて計算する。局所範囲（ｋ,ｌ）の画像特徴量の分散値Ｖ（ｘ_ｋ,ｌ）は計算式（２８）で求める。 The local variance calculation unit 903 calculates the variance value of the image feature amount in the local range of the preset image data based on the average value calculated by the local average calculation unit 902. Local range (k, l) the image feature amount of variance V (x _{k, l)} of obtaining by equation (28).

正規化処理計算部９０４は、正規化前の画像特徴量から全体平均計算部９０１で算出した画像全体の画像特徴量の平均値を減算し、局所分散計算部９０３で算出した予め設定した画像データの範囲における画像特徴量の分散値で割ることで、正規化後の画像特徴量ｘ_ｋ,ｌを求めることができる（計算式（２９）参照）。 The normalization processing calculation unit 904 subtracts the average value of the image feature amount of the entire image calculated by the overall average calculation unit 901 from the image feature amount before normalization, and sets the preset image data calculated by the local variance calculation unit 903 By dividing by the variance value of the image feature amount in the range of (2), the normalized image feature amount x _{k, l} can be obtained (see the calculation formula (29)).

上述したとおり、画像認識においても、画像特徴量に対して画像全体の平均値による正規化処理を行うことにより、すべての画像特徴量の分布の位置をオブジェクトモデルの該当画像特徴量の分布に揃え、更に局所の分散値による正規化処理によって、全画像特徴量の分布の重なりを抑制しつつ正規分布に近づける効果を持つ。その結果、画像特徴量の識別精度を低減することなく、影や輝度などによるオブジェクトモデルと入力された画像特徴量のミスマッチ成分を低減することができ、画像認識精度の劣化を低減することができる。 As described above, even in image recognition, by performing normalization processing based on the average value of the entire image for image feature amounts, the positions of all image feature amount distributions are aligned with the corresponding image feature amount distributions of the object model. Furthermore, the normalization process using the local variance value has an effect of bringing the distribution close to the normal distribution while suppressing the overlap of the distributions of all the image feature amounts. As a result, it is possible to reduce the mismatch component of the input image feature quantity and the object model due to shadows, brightness, etc. without reducing the identification accuracy of the image feature quantity, and to reduce the degradation of the image recognition precision. .

なお、画像認識については、平面画像だけではなく、３Ｄ画像でも可能である。３Ｄ画像を作成する際にカメラの位置によって、対象物の陰が変わるが、本発明の正規化を用いることで、画像特徴量のミスマッチ成分を低減することができ、画像認識精度の劣化を低減することができる。 Note that image recognition can be performed not only on a planar image but also on a 3D image. When creating a 3D image, the shadow of the object changes depending on the position of the camera. By using the normalization of the present invention, the mismatch component of the image feature amount can be reduced, and the degradation of the image recognition accuracy is reduced. can do.

[第６の実施形態]
また、画像認識に時間要素を取り入れることで、動画についても動画特徴量のミスマッチ成分を低減することができ、動画認識精度の劣化を低減することができる。 [Sixth Embodiment]
In addition, by incorporating a time element into image recognition, it is possible to reduce the mismatch component of the moving image feature quantity for moving images, and to reduce degradation of moving image recognition accuracy.

本発明の第６の実施形態を説明する。図１０は本実施形態による動画認識装置の構成を示している構成図である。図４において、マイクから入力された音声データの代わりにカメラから入力された動画とし、単語辞書・文法記憶部と音響モデル記憶部の代わりにオブジェクトモデル記憶部とし、音声認識結果の代わりに動画認識結果と置き換えることで、動画認識への適用も可能となる。 A sixth embodiment of the present invention will be described. FIG. 10 is a configuration diagram showing the configuration of the moving image recognition apparatus according to the present embodiment. In FIG. 4, a moving image input from a camera is used instead of voice data input from a microphone, an object model storage unit is used instead of a word dictionary / grammar storage unit and an acoustic model storage unit, and a moving image recognition is performed instead of a voice recognition result. By replacing the result, it can be applied to moving image recognition.

動画特徴量分析部１００１は、カメラから入力された動画データに対して動画特徴量分析を行い、動画特徴量を計算する。正規化処理部１００２は、動画特徴量分析部１００１で計算した動画特徴量を動画特徴量の平均値および分散値を用いて正規化処理を行う。正規化処理については後述する。オブジェクトモデル学習部１００３は、学習用動画データに対して、認識対象の動画データと同一の動画特徴量分析を動画特徴量分析部１００１で行い、正規化処理部１００２で正規化を行って得た、学習用動画データの動画特徴量をオブジェクトモデル１００４に記憶させる。認識処理部１００５は、認識対象の動画データに対して動画特徴量分析部１００１で動画特徴量分析を行い、正規化処理部１００２で正規化処理を行って得た、認識対象の動画データの動画特徴量とオブジェクトモデル１００４が記憶している学習用動画データの動画特徴量を用いて認識処理を行い、認識結果を出力する。 The moving image feature amount analysis unit 1001 performs moving image feature amount analysis on the moving image data input from the camera, and calculates the moving image feature amount. The normalization processing unit 1002 normalizes the moving image feature amount calculated by the moving image feature amount analysis unit 1001 using the average value and the variance value of the moving image feature amount. The normalization process will be described later. The object model learning unit 1003 is obtained by performing the same moving image feature amount analysis as the recognition target moving image data with the moving image feature amount analyzing unit 1001 and normalizing with the normalization processing unit 1002 with respect to the moving image data for learning. The moving image feature amount of the learning moving image data is stored in the object model 1004. The recognition processing unit 1005 performs the moving image feature amount analysis on the moving image data to be recognized by the moving image feature amount analysis unit 1001 and the normalization processing by the normalization processing unit 1002 to obtain the moving image of the moving image data to be recognized. Recognition processing is performed using the feature amount and the moving image feature amount of the learning moving image data stored in the object model 1004, and a recognition result is output.

図１１を参照し本実施形態における動画の正規化処理について説明する。図１１は、本実施形態による正規化処理部１０２の構成を示している。全体平均計算部１１０１は、カメラ等から動画認識装置に入力された動画データ全体の動画特徴量をバッファから読み出し、その平均値を計算する。動画データ全体の動画特徴量の平均値Ｅ（ｘ_ｉ,ｊ,Ｔ）は計算式（３０）で求める。Ｉ、Ｊは動画の縦軸、横軸のブロック数、Ｔはフレーム数を表す。 With reference to FIG. 11, the normalization process of the moving image in the present embodiment will be described. FIG. 11 shows the configuration of the normalization processing unit 102 according to this embodiment. The overall average calculation unit 1101 reads the moving image feature amount of the entire moving image data input from the camera or the like to the moving image recognition apparatus, and calculates the average value. The average value E (x _{i, j, T} ) of the moving image feature values of the entire moving image data is obtained by the calculation formula (30). I and J are the vertical and horizontal axes of the moving image, and T is the number of frames.

局所平均計算部１１０２は、予め設定した動画データの局所範囲における画像特徴量の平均値を計算する。局所範囲としては、正規化対象動画範囲を含む周囲数ブロックおよび局所のフレーム数を用いることが可能である。局所範囲（ｋ,ｌ）および局所のフレーム数τの動画特徴量の平均値Ｅ（ｘ_ｋ,ｌ,τ）は計算式（３１）で求める。Ｉ、Ｊは動画の局所範囲での縦軸、横軸の区間のブロック数、τは局所のフレーム数を表す。 A local average calculation unit 1102 calculates an average value of image feature amounts in a local range of preset moving image data. As the local range, it is possible to use a surrounding block including the normalization target moving image range and the number of local frames. The average value E (x _{k, l, τ} ) of the moving image feature amount of the local range (k, l) and the local frame number τ is obtained by the calculation formula (31). I and J represent the number of blocks in the vertical and horizontal axes in the local range of the moving image, and τ represents the number of local frames.

局所分散計算部１１０３は、予め設定した動画データの局所範囲における動画特徴量の分散値を、局所平均計算部１１０２で算出した平均値に基づいて計算する。局所範囲（ｋ,ｌ）および局所のフレーム数τの動画特徴量の分散値Ｖ（ｘ_ｋ,ｌ,τ）は計算式（３２）で求める。 The local variance calculation unit 1103 calculates the variance value of the moving image feature amount in the local range of the preset moving image data based on the average value calculated by the local average calculation unit 1102. The distribution value V (x _{k, l, τ} ) of the moving image feature quantity of the local range (k, l) and the local frame number τ is obtained by the calculation formula (32).

正規化処理計算部１１０４は、正規化前の動画特徴量から全体平均計算部１１０１で算出した動画全体の動画特徴量の平均値を減算し、局所分散計算部１１０３で算出した予め設定した動画データの範囲および局所のフレーム数における動画特徴量の分散値で割ることで、正規化後の動画特徴量ｘ_ｋ,ｌ,τを求めることができる（計算式（３３）参照）。 The normalization processing calculation unit 1104 subtracts the average value of the moving image feature amount of the entire moving image calculated by the overall average calculation unit 1101 from the moving image feature amount before normalization, and preset moving image data calculated by the local variance calculation unit 1103 The moving image feature value x _{k, l, τ} after normalization can be obtained by dividing by the variance value of the moving image feature value in the range and the number of local frames (see the calculation formula (33)).

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

例えば、音声、画像、および動画について詳述してきたが、音声、画像、および動画に限らず、入力されたデータの特徴量に基づいてパタンマッチングを行う認識装置にも本発明が適用可能である。 For example, audio, image, and video have been described in detail. However, the present invention is not limited to audio, image, and video, and the present invention can also be applied to a recognition device that performs pattern matching based on the feature amount of input data. .

また、第２〜第４の実施形態については音声認識について説明したが、画像認識および動画認識についても適用可能である。 Moreover, although voice recognition has been described for the second to fourth embodiments, it can also be applied to image recognition and video recognition.

また、図１などに示す正規化処理部１０２の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、正規化処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。 Also, a program for realizing the function of the normalization processing unit 102 shown in FIG. 1 or the like is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system and executed. Thus, normalization processing may be performed. Here, the “computer system” may include an OS and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. The “computer-readable recording medium” means a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a CD-ROM, a hard disk built in a computer system, etc. This is a storage device.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（Dynamic Random Access Memory））のように、一定時間プログラムを保持しているものも含むものとする。 Further, the “computer-readable recording medium” means a volatile memory (for example, DRAM (Dynamic DRAM) in a computer system that becomes a server or a client when a program is transmitted through a network such as the Internet or a communication line such as a telephone line. Random Access Memory)), etc., which hold programs for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

本発明の一実施形態による音声認識装置の構成を示した構成図である。It is the block diagram which showed the structure of the speech recognition apparatus by one Embodiment of this invention. セグメンタルＭＶＮ手法の正規化処理による分布の変化の様子を示した図である。It is the figure which showed the mode of the change of the distribution by the normalization process of a segmental MVN method. 本発明の第１の実施形態による正規化処理部の構成を示した図である。It is the figure which showed the structure of the normalization process part by the 1st Embodiment of this invention. 本発明の正規化処理による分布の変化の様子を示した図である。It is the figure which showed the mode of the change of the distribution by the normalization process of this invention. 本発明の第２の実施形態による正規化処理部の構成を示した図である。It is the figure which showed the structure of the normalization process part by the 2nd Embodiment of this invention. 本発明の第３の実施形態による正規化処理部の構成を示した図である。It is the figure which showed the structure of the normalization process part by the 3rd Embodiment of this invention. 本発明の第４の実施形態による正規化処理部の構成を示した図である。It is the figure which showed the structure of the normalization process part by the 4th Embodiment of this invention. 本発明の第５の実施形態による画像認識装置の構成を示した構成図である。It is the block diagram which showed the structure of the image recognition apparatus by the 5th Embodiment of this invention. 本発明の第５の実施形態による正規化処理部の構成を示した図である。It is the figure which showed the structure of the normalization process part by the 5th Embodiment of this invention. 本発明の第６の実施形態による画像認識装置の構成を示した構成図である。It is the block diagram which showed the structure of the image recognition apparatus by the 6th Embodiment of this invention. 本発明の第６の実施形態による正規化処理部の構成を示した図である。It is the figure which showed the structure of the normalization process part by the 6th Embodiment of this invention.

Explanation of symbols

１０１・・・音声分析部、１０２,８０２,１００２・・・正規化処理部、１０３・・・音響モデル学習部、１０４・・・音響モデル、１０５・・・言語モデル、１０６,８０５・・・認識処理部、３０１,５０１,７０２,９０１,１１０１・・・全体平均計算部、３０２,５０２,６０２,７０３,９０２,１１０２・・・局所平均計算部、３０３,５０３,６０３,７０４,９０３,１１０３・・・局所分散計算部、３０４,５０４,６０４,７０５,９０４,１１０４・・・正規化処理計算部、６０１・・・固定平均値記憶部、７０１・・・音声検出部、８０１・・・画像特徴量分析部、８０３,１００３・・・オブジェクトモデル学習部、８０４,１００４・・・オブジェクトモデル、１００１・・・動画特徴量分析部 DESCRIPTION OF SYMBOLS 101 ... Speech analysis part, 102,802,1002 ... Normalization process part, 103 ... Acoustic model learning part, 104 ... Acoustic model, 105 ... Language model, 106,805 ... Recognition processing unit, 301, 501, 702, 901, 1101... Overall average calculation unit, 302, 502, 602, 703, 902, 1102 ... Local average calculation unit, 303, 503, 603, 704, 903, 1103: Local variance calculation unit, 304, 504, 604, 705, 904, 1104 ... Normalization processing calculation unit, 601 ... Fixed average value storage unit, 701 ... Voice detection unit, 801 ... Image feature amount analysis unit, 803, 1003 ... object model learning unit, 804, 1004 ... object model, 1001 ... moving image feature amount analysis unit

Claims

An analysis means for calculating a feature amount of audio data or image data input from the outside;
Normalizing means for normalizing the feature amount calculated by the analyzing means;
Pattern matching means for performing pattern matching based on the normalized feature value normalized by the normalization means;
In the pattern matching device with
The normalizing means includes
An overall average acquisition means for acquiring an overall average value that is the total number of frames of the audio data or the average value of the feature values of the entire image data ;
A local average calculating means for calculating a local average value that is an average value of the feature quantity of the local number of frames of the audio data or the local range of the image data ;
Local variance calculation means for calculating a local variance value which is a variance value of the feature quantity of the local number of frames of the audio data or the local range of the image data based on the local average value;
Normalization processing calculation means for normalizing the feature amount based on the overall average value and the plurality of local variance values;
A pattern matching device characterized by comprising:

The pattern matching apparatus according to claim 1, wherein the overall average acquisition unit calculates the overall average value from the total number of frames of the audio data or the feature amount of the entire image data .

The pattern matching apparatus according to claim 1, wherein the overall average acquisition unit sets a predetermined value stored in advance as the overall average value.

A range identifying means for identifying a range including the feature quantity as a pattern matching target;
The overall average acquisition unit calculates the overall average value from the total number of frames of the audio data or the feature amount of the entire image data based on the range identified by the range identification unit. 2. The pattern matching device according to 2.

The local average calculation means calculates the local average value based on a value obtained by weighting the local average value calculated in the past,
The pattern matching device according to claim 1, wherein the local variance calculation unit calculates the local variance based on a value obtained by weighting the local variance calculated in the past.

An analysis means for calculating a feature amount of audio data or image data input from the outside;
Normalizing means for normalizing the feature amount calculated by the analyzing means;
Pattern matching means for performing pattern matching based on the normalized feature value normalized by the normalization means;
As a pattern matching program to make a computer function as
The normalizing means includes
An overall average acquisition means for acquiring an overall average value that is the total number of frames of the audio data or the average value of the feature values of the entire image data ;
A local average calculating means for calculating a local average value that is an average value of the feature quantity of the local number of frames of the audio data or the local range of the image data ;
Local variance calculation means for calculating a local variance value which is a variance value of the feature quantity of the local number of frames of the audio data or the local range of the image data based on the local average value;
Normalization processing calculation means for normalizing the feature amount based on the overall average value and the plurality of local variance values;
Pattern matching program to make the computer function as

An analysis step for calculating feature values of audio data or image data input from the outside;
A normalizing step of normalizing the feature amount calculated in the analyzing step;
A pattern matching step for performing pattern matching based on the normalized feature value normalized in the normalization step;
In the pattern matching method with
The normalizing step includes
An overall average acquisition step of acquiring an overall average value that is the total number of frames of the audio data or the average value of the feature values of the entire image data ;
A local average calculation step of calculating a local average value which is an average value of the feature amount of the local number of frames of the audio data or the local range of the image data ;
A local variance calculation step of calculating a local variance value, which is a variance value of the feature quantity in the local number of frames of the audio data or the local range of the image data , based on the local average value;
A normalization processing calculation step of normalizing the feature amount based on the overall average value and a plurality of the local variance values;
A pattern matching method characterized by comprising: