JP2008309945A

JP2008309945A - Pattern matching method and device, and its feature amount normalizing method and device

Info

Publication number: JP2008309945A
Application number: JP2007156455A
Authority: JP
Inventors: Tsuneo Kato; 恒夫加藤; Shoken Nasu; 庄健奈須; Toshiki Endo; 俊樹遠藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2007-06-13
Filing date: 2007-06-13
Publication date: 2008-12-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide a pattern matching method and a device suitable for collating a feature amount including a vector quantization (VQ) distortion after vector quantization and its decoding, to its probability model; and to provide its normalizing method and device suitable for the feature amount. <P>SOLUTION: A mean-variance calculation section 51 calculates a mean y and a variance x1 of an acoustic feature amount (a feature vector) decoded by a VQ decoding section 21. A calculation result of the mean y is subtracted from the sound feature amount in an adder section 52. The variance x1 is added to a variance x2 of the VQ distortion stored beforehand in a VQ distortion variance storage section 27, in an adder section 53, and the variance x with the VQ distortion taken into consideration is calculated. In a normalizing section 54, the sound feature amount is normalized by using the variance x, and the sound feature amount after mean and variance normalization (MVN) is calculated. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、パターンマッチング方法および装置ならびにその特徴量正規化方法および装置に係り、特に、入力信号から抽出された特徴量を一度ベクトル量子化し、これを復号して得られた特徴量を確率モデルと照合する際に好適なパターンマッチング方法および装置、ならびに前記特徴量の正規化に好適な特徴量正規化方法および装置に関する。 The present invention relates to a pattern matching method and apparatus and a feature quantity normalization method and apparatus thereof, and more particularly to a probability model obtained by decoding a vector quantity once extracted from an input signal. And a feature amount normalizing method and apparatus suitable for normalizing the feature amount.

情報検索や音声、画像認識の分野で使用されるパターンマッチングでは、未知の入力パターンと既知の多数の標準パターンとの類似度が算出され、類似度が最も高い標準パターンがパターンマッチング結果とされる。このとき、比較対象の入力パターンおよび標準パターンは共に正規化され、外乱などの不確定要素が予め除去される。 In pattern matching used in the fields of information retrieval, voice, and image recognition, the similarity between an unknown input pattern and many known standard patterns is calculated, and the standard pattern with the highest similarity is used as the pattern matching result. . At this time, both the input pattern to be compared and the standard pattern are normalized, and uncertain elements such as disturbance are removed in advance.

このパターンマッチングを適用した音声認識装置では、入力音声信号から抽出された時系列の音響特徴量を、母音や子音などの音素を単位として音響特徴量空間における確率密度分布が予め学習された音響モデルと照合することにより認識結果を得る。確率モデルである音響モデルは、音響特徴量の入力に対して、その音素らしさのスコア（音響尤度）を出力する。音声認識装置は文法と単語辞書の制約に従って音素らしさのスコア（音響尤度）を発声全体に渡って累積し、累積スコアが最も高い単語の並びを認識結果として出力する。 In the speech recognition apparatus to which this pattern matching is applied, a time-series acoustic feature amount extracted from an input speech signal is converted into an acoustic model in which a probability density distribution in an acoustic feature amount space is learned in advance in units of phonemes such as vowels and consonants. The recognition result is obtained by collating with. The acoustic model, which is a probabilistic model, outputs a score (acoustic likelihood) of the phoneme-likeness with respect to the input of the acoustic feature amount. The speech recognition device accumulates phoneme-likeness scores (acoustic likelihood) over the entire utterance according to the restrictions of the grammar and the word dictionary, and outputs a word sequence having the highest accumulated score as a recognition result.

ここで、音響特徴量は多次元ベクトルの時系列データであり、各次元において各音素に該当するデータの頻度分布を集計すると正規分布に近い形状、もしくは複数の正規分布の和に近い形状になる。こうした音響特徴量の分布を表現するために、音響モデルの確率密度分布は多次元正規分布もしくは複数の多次元正規分布によって表現される。 Here, the acoustic feature amount is time-series data of multidimensional vectors, and when the frequency distribution of data corresponding to each phoneme in each dimension is aggregated, it becomes a shape close to a normal distribution or a shape close to the sum of a plurality of normal distributions. . In order to express such a distribution of acoustic features, the probability density distribution of the acoustic model is represented by a multidimensional normal distribution or a plurality of multidimensional normal distributions.

しかしながら、実際の照合においては、マイク特性のばらつき、話者による違い、背景雑音などにより、入力音響特徴量の分布と音響モデルの確率密度分布との間にずれが生じ、これが認識率低下の原因となる。 However, in actual matching, there is a gap between the distribution of input acoustic features and the probability density distribution of the acoustic model due to variations in microphone characteristics, speaker differences, background noise, etc. It becomes.

入力音響特徴量と音響モデルとのパターンマッチングにおいて、このようなずれを解消する手法として、ケプストラム平均値正規化（CMN: Cepstral Mean Normalization）という手法が広く利用されており、CMNをさらに発展させた手法として平均値・分散正規化（MVN: Mean and Variance Normalization）が提案されている。 In pattern matching between input acoustic features and acoustic models, a method called Cepstral Mean Normalization (CMN) has been widely used as a method to eliminate such deviations, and CMN has been further developed. Mean value and variance normalization (MVN) has been proposed as a method.

CMNとは、発声の各時刻の音響特徴量から、その発声全体の平均値を引き算して音響特徴量の平均をゼロにすることで、入力音響特徴量の分布と音響モデルの確率密度分布とを揃えてずれを低減する手法である。CMN前の各次元の音響特徴量をx(t)、CMN後の音響特徴量をx'(t)とすると、CMNの操作は次式(1)で表される。 CMN is the difference between the input acoustic feature distribution and the probability density distribution of the acoustic model by subtracting the average value of the entire utterance from the acoustic feature quantity at each time of utterance to zero the average acoustic feature. This is a technique for reducing the shift by aligning them. When the acoustic feature quantity of each dimension before CMN is x (t) and the acoustic feature quantity after CMN is x ′ (t), the operation of CMN is expressed by the following equation (1).

一方、MVNとは、発声の各時刻の音響特徴量を、その発声全体の平均値と分散とで正規化して基準系の正規分布N（平均0、分散1）に揃えることで、マイク特性などによる入力音響特徴量の分布と音響モデルの確率密度分布とのずれを低減する手法である。MVN前の各次元の音響特徴量をx(t)、MVN後の音響特徴量をx'(t)とすると、MVNの操作は次式(2)で表される。 MVN, on the other hand, normalizes the acoustic feature value at each time of utterance with the average value and variance of the entire utterance and aligns it with the normal distribution N (average 0, variance 1) of the reference system, so that the microphone characteristics, etc. This is a technique for reducing the difference between the distribution of the input acoustic feature quantity by the sound and the probability density distribution of the acoustic model. If the acoustic feature quantity of each dimension before MVN is x (t) and the acoustic feature quantity after MVN is x ′ (t), the operation of MVN is expressed by the following equation (2).

ただし、発声全体の平均値や分散を用いるCMNやMVNは、発声が終わるまで正規化後の音響特徴量が得られないために照合処理の開始が遅れ、発声終了から認識結果出力までの待ち時間を長くしてしまうというデメリットがある。 However, for CMN and MVN that use the average value and variance of the entire utterance, the start of the matching process is delayed because the normalized acoustic features are not obtained until the utterance is completed, and the waiting time from the end of the utterance to the output of the recognition result There is a demerit that makes it longer.

この処理遅れを低減する手法として、発声全体の代わりに発声先頭からの一部分（数百ミリ秒）から平均値や分散を算出して正規化に用いる手法が提案されている。この手法は、認識処理遅れを低減する代わりに、平均値や分散の精度を落としている。以降、発声全体から平均値と分散を算出するMVNを「バッチMVN」、発声の一部分から平均値と分散を算出するMVNを「セグメンタルMVN」と表現することで両者を区別する。 As a technique for reducing this processing delay, a technique has been proposed in which an average value and variance are calculated from a part (several hundred milliseconds) from the beginning of the utterance instead of the entire utterance and used for normalization. This method lowers the accuracy of the average value and variance instead of reducing the recognition processing delay. Hereinafter, the MVN for calculating the average value and variance from the entire utterance is expressed as “batch MVN”, and the MVN for calculating the average value and variance from a part of the utterance is expressed as “segmental MVN”.

しかしながら、セグメンタルMVNを行う場合に、発声先頭の平均値と分散を計算する区間が無音であったり定常音であったりして量子化ベクトルが一種類になると、分散がゼロになり、正規化の際にゼロによる割り算が発生してしまう。 However, when performing segmental MVN, if the average value of the utterance head and the interval for calculating the variance are silent or stationary, and the quantization vector becomes one type, the variance becomes zero and normalization In this case, division by zero occurs.

このような技術課題を解決するために、特許文献１では、特徴量の量子化を仮定しない平均値・分散正規化(MVN)において算出した分散の値がゼロもしくはゼロに近い小さな値の場合には分散正規化を行わない技術が提案されている。
特開２００２−２７８５８６号公報 In order to solve such a technical problem, in Patent Document 1, when the variance value calculated in the mean value / variance normalization (MVN) without assuming the quantization of the feature amount is zero or a small value close to zero, A technique that does not perform distributed normalization has been proposed.
JP 2002-278586 A

従来の平均値・分散特徴量正規化手法（MVN）は、音響特徴量に量子化の影響がないことを前提に、マイク特性のばらつき、話者による違い、背景雑音などに起因する入力音響特徴量の分布と音響モデルの確率密度分布との間のずれを低減していた。 Conventional average value / variance feature normalization method (MVN) is based on the assumption that there is no quantization effect on the acoustic features, and input acoustic features caused by variations in microphone characteristics, differences among speakers, background noise, etc. The deviation between the quantity distribution and the probability density distribution of the acoustic model was reduced.

しかしながら、認識対象の音響特徴量が通信により遠隔の音声認識部へ送信される分散型（クライアント・サーバ型）の音声認識システムでは、クライアントのマイクで検知された音声の特徴量がベクトル量子化によりデータ量を減ぜられた後にサーバへ送信される。サーバ側では、クライアントから受信した特徴量を復号して特徴量を再現し、この特徴量と確率モデルとのパターンマッチングに基づいて認識結果を得る。 However, in a distributed (client / server type) speech recognition system in which acoustic features to be recognized are transmitted to a remote speech recognition unit by communication, the feature values of speech detected by the microphone of the client are converted by vector quantization. Sent to the server after the amount of data has been reduced. On the server side, the feature quantity received from the client is decoded to reproduce the feature quantity, and a recognition result is obtained based on the pattern matching between the feature quantity and the probability model.

ベクトル量子化ではクライアントおよびサーバの双方が、多次元の特徴量を表現する有限のベクトル量子化(VQ)コードブックを備え、クライアントでは、入力された特徴ベクトル（特徴量）を最近傍のコードベクトルで代表し、そのインデックスをサーバに送信する。サーバも共通のベクトル量子化コードブックを有しているので、クライアントから送信されたインデックスに基づいてコードベクトルを特定し、このコードベクトルを音響特徴量とすることで前記特徴量が復号される。すなわち、入力音声信号の特徴ベクトルは、ベクトル量子化および復号を経て最近傍のコードベクトルに近似される。 In vector quantization, both the client and the server have a finite vector quantization (VQ) codebook that expresses multidimensional features, and in the client, the input feature vector (feature) is the nearest code vector. And send the index to the server. Since the server also has a common vector quantization codebook, a code vector is specified based on the index transmitted from the client, and the feature quantity is decoded by using the code vector as an acoustic feature quantity. That is, the feature vector of the input speech signal is approximated to the nearest code vector through vector quantization and decoding.

図７は、ヨーロッパの標準化組織ETSI (European Telecommunications Standards Institute)で標準化された分散型音声認識のためのベクトル量子化コードブックを示した図である。ETSIの標準方式ではケプストラム領域の特徴量MFCC(Mel-Frequency Cepstrum Coefficient)および対数パワーからなる全１４次元の特徴量が、２次元ずつ７種類のVQコードブックで量子化される。MFCCの１次および２次の空間は、図７に示したように６４個のコードベクトルによって表現される。 FIG. 7 is a diagram showing a vector quantization codebook for distributed speech recognition standardized by the European standardization organization ETSI (European Telecommunications Standards Institute). In the standard method of ETSI, a total 14-dimensional feature quantity consisting of a cepstrum region feature quantity MFCC (Mel-Frequency Cepstrum Coefficient) and a logarithmic power is quantized with two types of VQ codebooks in two dimensions. The primary and secondary spaces of the MFCC are expressed by 64 code vectors as shown in FIG.

このような構成では、認識対象の特徴量がベクトル量子化およびその復号を経るためにベクトル量子化歪み（VQ歪み）を含むのに対して、サーバに予め用意されている音響モデルの学習用音声データの特徴量は、ベクトル量子化およびその復号を経ていない場合にはVQ歪みを含まない。この場合、認識対象の音響特徴量の分散と音響モデルの分散とに差異が生じ、MVNによる認識率改善の効果が薄れてしまう。 In such a configuration, the feature quantity to be recognized includes vector quantization distortion (VQ distortion) for vector quantization and decoding thereof, whereas the acoustic model learning speech prepared in advance on the server The feature amount of data does not include VQ distortion when it has not undergone vector quantization and decoding. In this case, there is a difference between the variance of the acoustic feature quantity to be recognized and the variance of the acoustic model, and the effect of improving the recognition rate by MVN is diminished.

特許文献１のように、分散の値がゼロもしくはゼロに近い小さな値の場合に分散正規化を行わないようにしてしまうと、確率モデルは特徴量正規化込みであるのに対して、入力特徴量は特徴量正規化を含まないので、パターンマッチングにミスマッチが生じて音声認識の精度が低下してしまう。 If the variance normalization is not performed when the variance value is zero or a small value close to zero as in Patent Document 1, the probability model includes the feature amount normalization, whereas the input feature Since the quantity does not include feature quantity normalization, a mismatch occurs in pattern matching and the accuracy of speech recognition is reduced.

本発明の目的は、上記した従来技術の課題を解決し、ベクトル量子化およびその復号を経てVQ歪みを含む特徴量を、その確率モデルと照合する際に好適なパターンマッチング方法および装置、ならびに前記特徴量の正規化に好適な特徴量正規化方法および装置を提供することにある。 An object of the present invention is to solve the above-described problems of the prior art, and a pattern matching method and apparatus suitable for collating a feature quantity including VQ distortion with a probability model after vector quantization and decoding thereof, and the above It is an object of the present invention to provide a feature amount normalization method and apparatus suitable for feature amount normalization.

ベクトル量子化が行われた音響特徴量に対するMVNについて考えると、ベクトル量子化前の入力音響特徴量の分散は、ベクトル量子化後の分散とベクトル量子化歪み（VQ歪み）の分散との和で表される。これを以下に数式で示す。 Considering MVN for acoustic features that have undergone vector quantization, the variance of the input acoustic features before vector quantization is the sum of the variance after vector quantization and the variance of vector quantization distortion (VQ distortion). expressed. This is shown in the following mathematical formula.

入力音声のk番目のフレームの音響特徴量の第m次元の値をxm(k)、選択されたVQコードブックの第m次元の値をqm(k)、xm(k)とqm(k)との差分をΔxm(k)とすれば次式(3)が成立する。 The m-th dimension value of the acoustic feature of the k-th frame of the input speech is xm (k), and the m-th dimension value of the selected VQ codebook is qm (k), xm (k) and qm (k) If the difference between is Δxm (k), the following equation (3) is established.

また、変数x(k)の発声全体の平均値および分散をそれぞれE(x)，V(x)で表せば次式(4)が成立する。 Further, if the average value and variance of the entire utterance of the variable x (k) are expressed by E (x) and V (x), the following equation (4) is established.

ここで、E(Δxm)＝０と仮定すれば次式(5)が成立する。 Here, assuming that E (Δxm) = 0, the following equation (5) is established.

ここで、q(k)m −E(qm)とΔxm(k)とが無相関とすれば、次式(6)より次式(7)が成立する。 Here, if q (k) m−E (qm) and Δxm (k) are uncorrelated, the following equation (7) is established from the following equation (6).

即ち、VQ前の入力音響特徴量の分散は、VQ後の音響特徴量の分散と、VQ歪みの分散の和で表される。したがって、音声認識サーバに到来したVQ後の音響特徴量からVQ前の入力音響特徴量の分散を推定するには、予め計算しておいた各次元のVQ歪みの分散を加えれば良い。 That is, the variance of the input acoustic feature quantity before VQ is represented by the sum of the variance of the acoustic feature quantity after VQ and the variance of the VQ distortion. Therefore, in order to estimate the variance of the input acoustic feature quantity before VQ from the acoustic feature quantity after VQ that has arrived at the speech recognition server, it is only necessary to add the variance of VQ distortion calculated in advance for each dimension.

本発明の特徴量正規化装置およびパターンマッチング装置は、上記した知見に基づいてなされたものであり、以下のような手段を講じた点に特徴がある。 The feature quantity normalization device and the pattern matching device of the present invention are made based on the above-described knowledge, and are characterized in that the following measures are taken.

(1)本発明の特徴量正規化装置は、前記復号された特徴量の平均値および分散を計算する手段と、前記復号された特徴量の分散にベクトル量子化歪みの分散を加算して補正後分散を算出する補正後分散算出手段と、前記平均値および補正後分散を利用して、前記復号された特徴量を正規化する正規化手段とを含むことを特徴とする。 (1) The feature quantity normalization apparatus of the present invention corrects the means for calculating the average value and variance of the decoded feature quantities, and adds the variance of the vector quantization distortion to the variance of the decoded feature quantities A post-correction variance calculation unit that calculates post-variance, and a normalization unit that normalizes the decoded feature amount using the average value and the post-correction variance.

(2)本発明のパターンマッチング装置は、入力信号から抽出されてベクトル量子化された特徴量を復号する復号手段と、前記復号された特徴量を正規化する特徴量正規化手段と、前記正規化後の特徴量を確率モデルと照合してパターンマッチング結果を出力する認識処理手段とを含み、特徴量正規化手段が、復号された特徴量の平均値および分散を計算する手段と、復号された特徴量の分散にベクトル量子化歪みの分散を加算して補正後分散を算出する補正後分算出散手段と、平均値および補正後分散を利用して前記復号された特徴量を正規化する正規化手段とを含むことを特徴とする。 (2) The pattern matching apparatus according to the present invention includes a decoding unit that decodes a feature quantity extracted from an input signal and subjected to vector quantization, a feature quantity normalization unit that normalizes the decoded feature quantity, and the normalization And a recognition processing means for outputting a pattern matching result by comparing the normalized feature quantity with a probability model, and the feature quantity normalizing means is a means for calculating an average value and variance of the decoded feature quantity, A post-correction distribution calculating unit for calculating the post-correction variance by adding the variance of the vector quantization distortion to the variance of the obtained feature amount, and normalizing the decoded feature amount using the average value and the post-correction variance And normalizing means.

本発明によれば、以下のような効果が達成される。
(1)本発明の特徴量正規化方法および装置によれば、入力信号の特徴量がベクトル量子化およびその復号を経てVQ歪みを含むときに、この特徴量がVQ歪みの分散を考慮して正規化されるので、この特徴量の分布と、ベクトル量子化およびその復号を経ていない特徴量に基づいて構築される確率モデルの確率密度分布とのずれを低減できるようになる。
(2)本発明の特徴量正規化方法および装置によれば、入力信号の特徴量がベクトル量子化およびその復号を経てVQ歪みを含むときに、この特徴量が、当該特徴量の分散とVQ歪みの分散との加算値として求められる補正後分散に基づいて正規化されるので、単一のコードベクトルが選択されて特徴量の分散がゼロになる場合でも、正規化時にゼロによる割り算を回避することができる。
(3)本発明のパターンマッチング方法および装置によれば、特徴量を正規化する正規化部において、ベクトル量子化およびその復号を経てVQ歪みを含む特徴量がVQ歪みの分散を考慮して正規化されるので、この特徴量の分布と、ベクトル量子化およびその復号を経ていない特徴量に基づいて構築される確率モデルの確率密度分布とのずれを低減できる。したがって、パターンマッチングの精度を向上させることができる。
(4)本発明のパターンマッチング方法および装置によれば、特徴量を正規化する正規化部において、ベクトル量子化およびその復号を経てVQ歪みを含む特徴量が、当該特徴量の分散とVQ歪みの分散との加算値として求められる補正後分散に基づいて正規化されるので、単一のコードベクトルが選択されて特徴量の分散がゼロになる場合でも、正規化時にゼロによる割り算を回避することができる。 According to the present invention, the following effects are achieved.
(1) According to the feature quantity normalization method and apparatus of the present invention, when the feature quantity of the input signal includes VQ distortion after vector quantization and decoding thereof, the feature quantity takes VQ distortion variance into account. Since normalization is performed, it is possible to reduce a deviation between the distribution of the feature quantity and the probability density distribution of the probability model constructed based on the feature quantity that has not undergone vector quantization and decoding.
(2) According to the feature amount normalization method and apparatus of the present invention, when the feature amount of the input signal includes VQ distortion after vector quantization and decoding thereof, the feature amount is determined by the distribution of the feature amount and the VQ. Normalization is based on the post-correction variance calculated as the sum of the variance of the distortion, so even if a single code vector is selected and the feature variance is zero, avoid division by zero during normalization can do.
(3) According to the pattern matching method and apparatus of the present invention, in the normalization unit that normalizes the feature amount, the feature amount including VQ distortion is normalized in consideration of the variance of the VQ distortion through vector quantization and decoding thereof. Therefore, it is possible to reduce a deviation between the distribution of the feature quantity and the probability density distribution of the probability model constructed based on the feature quantity that has not undergone vector quantization and decoding. Therefore, the accuracy of pattern matching can be improved.
(4) According to the pattern matching method and apparatus of the present invention, in the normalization unit that normalizes the feature amount, the feature amount including VQ distortion after vector quantization and decoding thereof is the variance of the feature amount and the VQ distortion. Since normalization is performed based on the post-correction variance obtained as an addition value to the variance of, even if a single code vector is selected and the feature variance becomes zero, avoid division by zero during normalization be able to.

以下、図面を参照して本発明の最良の実施の形態について詳細に説明する。図１は、本発明のパターンマッチング方法を適用した分散型音声認識装置の主要部の構成を示したブロック図であり、入力された音声信号の音響特徴量をベクトル量子化して送信する端末１と、ベクトル量子化データを端末１から受信、復号して音響特徴量を再現し、この音響特徴量と音響モデル（確率モデル）とのパターンマッチングにより得られた認識結果を端末１に返信する音声認識サーバ２と、この音声認識サーバ２で音声認識に利用する音響モデルを事前の処理で構築する学習装置３とを主要な構成としている。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the best embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the main part of a distributed speech recognition apparatus to which the pattern matching method of the present invention is applied. A terminal 1 for vector-quantizing and transmitting an acoustic feature quantity of an input speech signal, , Receiving and decoding the vector quantized data from the terminal 1 to reproduce the acoustic feature amount, and returning the recognition result obtained by pattern matching between the acoustic feature amount and the acoustic model (probability model) to the terminal 1 The server 2 and the learning device 3 that constructs an acoustic model used for speech recognition by the speech recognition server 2 through prior processing are the main components.

前記端末１において、音響特徴量抽出部１２は、マイク１１から入力された音声信号から音響特徴量を抽出して一時記憶する。音響特徴量とは、入力音声を一定時間間隔（例えば１０ms：以下、フレームと表現する）毎に分析して得られる時系列の特徴ベクトルである。本実施形態では、フレームごとに多次元の特徴ベクトルが生成される。 In the terminal 1, the acoustic feature amount extraction unit 12 extracts the acoustic feature amount from the audio signal input from the microphone 11 and temporarily stores it. The acoustic feature amount is a time-series feature vector obtained by analyzing the input speech at regular time intervals (for example, 10 ms: hereinafter referred to as a frame). In this embodiment, a multidimensional feature vector is generated for each frame.

ベクトル量子化部１３は、ベクトル量子化(VQ)コードブックを用いて前記特徴ベクトルを量子化（ベクトル量子化）する。さらに具体的に説明すれば、特徴ベクトルに対してVQコードブック上で最も距離の近いコードベクトルを選択して、そのインデックスを抽出する。抽出されたインデックスは音声認識サーバ２へ送信される。 The vector quantization unit 13 quantizes the feature vector (vector quantization) using a vector quantization (VQ) codebook. More specifically, a code vector that is closest to the feature vector on the VQ codebook is selected and its index is extracted. The extracted index is transmitted to the speech recognition server 2.

音声認識サーバ２において、ベクトル量子化(VQ)復号部２１は、受信したインデックスを前記端末１のVQコードブックと同一のVQコードブックを用いて復号し、復号されたコードベクトルで前記特徴ベクトルを近似する。VQ歪み分散記憶部２７には、音響特徴量に関して予め算出された量子化ベクトル時の分散（VQ歪み）が記憶されている。特徴量正規化部２２は、後に詳述するように、VQ歪み分散記憶部２７に予め記憶されているVQ歪みの分散と前記復号された音響特徴量の分散とに基づいて当該音響特徴量を正規化する。 In the speech recognition server 2, the vector quantization (VQ) decoding unit 21 decodes the received index using the same VQ codebook as the VQ codebook of the terminal 1, and the feature vector is decoded by the decoded code vector. Approximate. The VQ distortion dispersion storage unit 27 stores a quantization vector dispersion (VQ distortion) calculated in advance with respect to the acoustic feature quantity. As will be described later in detail, the feature quantity normalization unit 22 calculates the acoustic feature quantity based on the variance of the VQ distortion stored in advance in the VQ distortion variance storage section 27 and the variance of the decoded acoustic feature quantity. Normalize.

前記VQ歪み分散記憶部２７に記憶されている分散は、例えば多数の音響特徴量をランダムに発生させ、これらをベクトル量子化部１３でVQコードブックと照合して全特徴ベクトルについて最短距離の量子化ベクトルを求め、これらの特徴ベクトルと量子化ベクトルとの平均二乗距離として予め求めることができる。前記特徴ベクトルと量子化ベクトルとの平均二乗距離は特徴ベクトルの次元ごとに計算することができる。 The dispersion stored in the VQ distortion dispersion storage unit 27 generates, for example, a large number of acoustic feature quantities at random, and these are collated with the VQ codebook by the vector quantization unit 13 to quantize the shortest distance for all feature vectors. The quantization vector can be obtained and can be obtained in advance as the mean square distance between the feature vector and the quantization vector. The mean square distance between the feature vector and the quantization vector can be calculated for each dimension of the feature vector.

認識処理部２３は、正規化された音響特徴量を順次に取り込み、単語辞書文法記憶部２４に記憶された文法の拘束条件に従いながら音響モデル２５と音響特徴量とを照合し、音響的な対数尤度に基づいて認識結果を出力する。認識結果送信部２６は、前記音声認識結果を端末宛てに送信する。端末１では、この認識結果が認識結果受信部１４で受信されて適宜に処理される。 The recognition processing unit 23 sequentially takes in the normalized acoustic feature values, collates the acoustic model 25 with the acoustic feature values in accordance with the grammatical constraint conditions stored in the word dictionary grammar storage unit 24, and determines the acoustic logarithm. A recognition result is output based on the likelihood. The recognition result transmission unit 26 transmits the voice recognition result to the terminal. In the terminal 1, the recognition result is received by the recognition result receiving unit 14 and appropriately processed.

学習装置３において、学習用音声データベース３２には大量の学習用音声データが格納されており、書き起こしテキスト記憶部３１には、各音声データの書き起こしテキストが記憶されている。音響特徴量抽出部３３は、前記学習用音声データベース３２に格納されている音声データから音響特徴量を抽出する。この音響特徴量は正規化部３４で正規化され、音響モデル学習部３５において書き起こしテキストと対応付けられて音響モデル２５として記憶される。 In the learning device 3, a large amount of learning speech data is stored in the learning speech database 32, and a transcription text of each speech data is stored in the transcription text storage unit 31. The acoustic feature quantity extraction unit 33 extracts an acoustic feature quantity from the voice data stored in the learning voice database 32. This acoustic feature amount is normalized by the normalizing unit 34 and is stored as the acoustic model 25 in association with the transcription text in the acoustic model learning unit 35.

図２は、前記特徴量正規化部２２の主要部の構成を示したブロック図である。平均値・分散計算部５１は、VQ復号部２１で復号された音響特徴量（特徴ベクトル）の平均値yおよび分散x1を計算する。平均値yの計算結果は加算部５２において音響特徴量から減算される。分散x1は加算部５３において、予めVQ歪み分散記憶部２７に記憶されているVQ歪みの分散x2と加算され、VQ歪みを考慮した分散xが求められる。正規化部５４では、この分散xを用いて音響特徴量が正規化され、MVN後の音響特徴量が算出される。 FIG. 2 is a block diagram showing the configuration of the main part of the feature quantity normalization unit 22. The average value / variance calculation unit 51 calculates the average value y and the variance x1 of the acoustic feature amount (feature vector) decoded by the VQ decoding unit 21. The calculation result of the average value y is subtracted from the acoustic feature amount in the adding unit 52. The variance x1 is added by the adder 53 to the VQ distortion variance x2 stored in the VQ distortion variance storage unit 27 in advance, and the variance x considering the VQ distortion is obtained. In the normalization unit 54, the acoustic feature value is normalized using the variance x, and the acoustic feature value after MVN is calculated.

図３は、前記特徴量正規化部２２の他の構成を示したブロック図であり、前記と同一の符号は同一または同等部分を表している。 FIG. 3 is a block diagram showing another configuration of the feature quantity normalization unit 22, and the same reference numerals as those described above represent the same or equivalent parts.

VQ歪み分散記憶部２７には、VQ歪みの分散がコードベクトルまたはコードベクトルのクラス（集合）ごとに記憶されている。VQ歪み分散選択部５５は、VQ復号部２１から通知される復号中の音響特徴量のコードベクトルまたはコードベクトルのクラスに基づいて、VQ歪み分散記憶部２７から対応するVQ歪みの分散を抽出し、さらに一発声の全フレームまたは所定のフレーム数ごとに、前記抽出されたVQ歪みの分散のフレーム平均値を求めて加算部５３へ出力する。 The VQ distortion distribution storage unit 27 stores VQ distortion distribution for each code vector or code vector class (set). The VQ distortion variance selection unit 55 extracts the corresponding VQ distortion variance from the VQ distortion variance storage unit 27 based on the code vector of the acoustic feature quantity being decoded or the class of the code vector notified from the VQ decoding unit 21. Further, the average value of the extracted VQ distortion variance is calculated and output to the adder 53 for every frame of a single utterance or for every predetermined number of frames.

図４は、本発明のパターンマッチングを適用した分散型音声認識装置のさらに他の実施形態の構成を示したブロック図であり、前記と同一の符号は同一または同等部分を表している。 FIG. 4 is a block diagram showing a configuration of still another embodiment of the distributed speech recognition apparatus to which the pattern matching of the present invention is applied. The same reference numerals as those described above represent the same or equivalent parts.

上記した各実施形態では、音声認識サーバ２において加算されるVQ歪みの分散が予め統計的に求められて記憶装置２７に記憶されて読み出されるものとして説明したが、本実施形態では、端末１にVQ歪み分散算出部１５を設けた点に特徴がある。 In each of the above-described embodiments, it has been described that the variance of the VQ distortion added in the speech recognition server 2 is statistically obtained in advance, stored in the storage device 27, and read out. It is characterized in that a VQ distortion variance calculation unit 15 is provided.

このVQ歪み分散算出部１５は、各特徴ベクトルに関して、前記コードブック上で最短距離の量子化ベクトルが前記ベクトル量子化部１３で求められると、各特徴ベクトルと各量子化ベクトルとの距離を算出してVQ歪みの分散を求める。この分散は音声認識サーバ２へ送信され、特徴量正規化部２２において、対応するフレームの正規化に利用される。前記VQ歪み分散算出部１５は、特徴ベクトルの次元ごとに全フレームの分散を求めても良いし、あるいは所定のフレーム数ごとに分散を求めても良い。 The VQ distortion variance calculation unit 15 calculates the distance between each feature vector and each quantization vector when the vector quantization unit 13 determines the shortest distance quantization vector on each code vector. To obtain the variance of the VQ distortion. This distribution is transmitted to the speech recognition server 2 and is used by the feature amount normalization unit 22 to normalize the corresponding frame. The VQ distortion variance calculation unit 15 may obtain the variance of all frames for each dimension of the feature vector, or may obtain the variance for each predetermined number of frames.

なお、上記した実施形態では、本発明のパターンマッチング方法およびその正規化方法を音声認識装置に適用して説明したが、本発明はこれのみに限定されるものではなく、図５に示したように、カメラ１６で撮影された画像を認識する画像認識装置にも同様に適用できる。 In the above-described embodiment, the pattern matching method and the normalization method of the present invention have been applied to the speech recognition apparatus. However, the present invention is not limited to this, as shown in FIG. In addition, the present invention can be similarly applied to an image recognition apparatus that recognizes an image captured by the camera 16.

この場合、端末１ａでは前記音響特徴量抽出部１２に代えて画像特徴量抽出部１２ａが設けられ、画像認識サーバ２ａでは前記音響モデル記憶部２５に代えてオブジェクトモデル記憶部２５ａが設けられる。このオブジェクトモデル記憶部２５ａには、学習用画像データベース３２ａに登録されている多数の画像から抽出された特徴量を正規化して得られたデータとオブジェクトモデル教師データ記憶部３１ａに記憶されている教師データとに基づいて学習されたオブジェクトモデル（確率モデル）が登録される。認識処理部２３は、特徴量正規化部２２で正規化されたデータを前記オブジェクトモデルと照合して認識結果を得る。 In this case, the terminal 1a is provided with an image feature quantity extraction unit 12a instead of the acoustic feature quantity extraction unit 12, and the image recognition server 2a is provided with an object model storage unit 25a instead of the acoustic model storage unit 25. In the object model storage unit 25a, data obtained by normalizing feature amounts extracted from a large number of images registered in the learning image database 32a and a teacher stored in the object model teacher data storage unit 31a are stored. An object model (probability model) learned based on the data is registered. The recognition processing unit 23 collates the data normalized by the feature amount normalizing unit 22 with the object model to obtain a recognition result.

さらに、本発明のパターンマッチング方法およびその正規化方法は、図６に示したように、嗜好データ入力部１７から入力されたユーザの嗜好に基づいて当該ユーザの嗜好に適合したサービスや商品等のカテゴリを予測するレコメンド装置にも同様に適用できる。 Furthermore, as shown in FIG. 6, the pattern matching method and its normalization method of the present invention are based on the user's preference input from the preference data input unit 17, such as services and products that match the user's preference. The present invention can be similarly applied to a recommendation device that predicts a category.

この場合、端末１ｂでは前記音響特徴量抽出部１２に代えてユーザプロファイル特徴量抽出部１２ｂが設けられ、数値化された嗜好データが嗜好データ入力部１７から入力される。レコメンドサーバ２ｂでは、前記音響モデル記憶部２５に代えてユーザプロファイルカテゴリモデル記憶部２５ｂが設けられる。このユーザプロファイルカテゴリモデル記憶部２５ｂには、学習用ユーザ嗜好データベース３２ｂに登録されている多数の嗜好情報から抽出された特徴量を正規化して得られたデータとカテゴリ教師データ記憶部３１ｂに記憶されている教師データとに基づいて学習されたされたユーザプロファイルカテゴリモデル（確率モデル）が登録される。認識処理部２３は、特徴量正規化部２２で正規化されたデータを前記ユーザプロファイルカテゴリモデルと照合してレコメンド結果を得る。 In this case, the terminal 1 b is provided with a user profile feature amount extraction unit 12 b instead of the acoustic feature amount extraction unit 12, and digitized preference data is input from the preference data input unit 17. In the recommendation server 2b, a user profile category model storage unit 25b is provided instead of the acoustic model storage unit 25. In the user profile category model storage unit 25b, data obtained by normalizing feature quantities extracted from a large amount of preference information registered in the learning user preference database 32b and the category teacher data storage unit 31b are stored. The user profile category model (probability model) learned based on the current teacher data is registered. The recognition processing unit 23 collates the data normalized by the feature amount normalizing unit 22 with the user profile category model to obtain a recommendation result.

このように、本発明のパターンマッチング方法およびその正規化方法は、未知の入力パターンと既知の多数の標準パターンとの類似度を算出し、類似度が最も高い標準パターンがパターンマッチング結果とされる全てのパターンマッチング方法およびその正規化方法に適用できる。 As described above, the pattern matching method and its normalization method of the present invention calculate the similarity between an unknown input pattern and many known standard patterns, and the standard pattern with the highest similarity is used as the pattern matching result. It can be applied to all pattern matching methods and their normalization methods.

本発明のパターンマッチング方法を適用した分散型音声認識装置の第１実施形態のブロック図である。1 is a block diagram of a first embodiment of a distributed speech recognition apparatus to which a pattern matching method of the present invention is applied. 音響特徴量正規化部の構成を示したブロック図である。It is the block diagram which showed the structure of the acoustic feature-value normalization part. 音響特徴量正規化部の他の構成を示したブロック図である。It is the block diagram which showed the other structure of the acoustic feature-value normalization part. 本発明のパターンマッチングを適用した分散型音声認識装置の他の実施形態のブロック図である。It is a block diagram of other embodiment of the distributed speech recognition apparatus to which the pattern matching of this invention is applied. 本発明のパターンマッチングを適用した画像認識装置のブロック図である。It is a block diagram of the image recognition apparatus to which the pattern matching of this invention is applied. 本発明のパターンマッチングを適用したレコメンド装置のブロック図である。It is a block diagram of a recommendation device to which pattern matching of the present invention is applied. 分散型音声認識のためのベクトル量子化コードブックの一例を示した図である。It is the figure which showed an example of the vector quantization codebook for distributed speech recognition.

Explanation of symbols

１…端末，２…音声認識サーバ，３…学習装置，１１…マイク，１２…音響特徴量抽出部，１３…ベクトル量子化部，２１…ベクトル量子化(VQ)復号部，２２…音響特徴量正規化部，２３…認識処理部，２４…単語辞書文法記憶部，２５…音響モデル記憶部，２６…認識結果送信部，２７…VQ歪み分散記憶部 DESCRIPTION OF SYMBOLS 1 ... Terminal, 2 ... Speech recognition server, 3 ... Learning apparatus, 11 ... Microphone, 12 ... Acoustic feature-value extraction part, 13 ... Vector quantization part, 21 ... Vector quantization (VQ) decoding part, 22 ... Acoustic feature-value Normalization unit, 23 ... recognition processing unit, 24 ... word dictionary grammar storage unit, 25 ... acoustic model storage unit, 26 ... recognition result transmission unit, 27 ... VQ distortion distribution storage unit

Claims

In a feature quantity normalization device that normalizes a decoded feature quantity after being extracted from an input signal and once vector quantized,
Means for calculating an average value and variance of the decoded feature values;
A post-correction variance calculation means for calculating a post-correction variance by adding a variance of vector quantization distortion to the variance of the decoded feature quantity;
And a normalizing unit that normalizes the decoded feature value using the average value and the variance after correction.

Storage means for storing the variance of the vector quantization distortion calculated separately;
The feature quantity normalization apparatus according to claim 1, wherein the variance of the vector quantization distortion is read from the storage unit.

The feature amount extracted from the input signal is vector-quantized into a code vector based on a matching result with a VQ codebook,
The storage means stores a variance for each code vector or class of code vectors,
Selecting means for selecting a distribution corresponding to the code vector or its class from the storage means;
The feature amount normalization apparatus according to claim 2, wherein the post-correction variance calculation unit adds a frame average value of the selected variance to the variance of the decoded feature amount.

In a feature quantity normalization method for normalizing a decoded feature quantity after being extracted from an input signal and once vector quantized,
Calculating an average value and variance of the decoded feature values;
A procedure of calculating a corrected variance by adding a variance of vector quantization distortion to the variance of the decoded feature quantity;
And a step of normalizing the decoded feature value using the average value and the corrected variance.

In a pattern matching device that is extracted from an input signal, once vector quantized, and then collated with a probability model to output a pattern matching result.
Decoding means for decoding vector quantized data of features extracted from the input signal;
Feature quantity normalizing means for normalizing the decoded feature quantity;
Recognizing processing means for collating the normalized feature quantity with a probability model and outputting a pattern matching result;
The feature amount normalizing means includes:
Means for calculating an average value and variance of the decoded feature values;
A post-correction fraction calculation unit for calculating a post-correction variance by adding a variance of vector quantization distortion to the variance of the decoded feature quantity;
And a normalizing unit that normalizes the decoded feature amount using the average value and the variance after correction.

Storage means for storing the variance of the vector quantization distortion calculated separately;
The pattern matching apparatus according to claim 5, wherein the variance of the vector quantization distortion is read from the storage unit.

The feature amount extracted from the input signal is vector-quantized into a code vector based on a matching result with a VQ codebook,
The storage means stores a variance for each code vector or class of code vectors,
Selecting means for selecting a distribution corresponding to the code vector or its class from the storage means;
The pattern matching apparatus according to claim 6, wherein the post-correction variance calculation unit adds a frame average value of the selected variance to the variance of the decoded feature amount.

Means for receiving a variance of the vector quantization distortion;
The pattern matching apparatus according to claim 5, wherein the corrected variance calculation unit adds the received variance to the variance of the decoded feature quantity.

The pattern matching apparatus according to claim 5, wherein the input signal is a voice signal, and the pattern matching result is a voice recognition result.

9. The pattern matching apparatus according to claim 5, wherein the input signal is an image signal, and the pattern matching result is an image recognition result.

The pattern matching apparatus according to claim 5, wherein the input signal is user preference information, and the pattern matching result is a recommendation result.

In a pattern matching method, which is extracted from an input signal, once vector quantized, and then collated with a probability model to output a pattern matching result.
Decoding vector quantization data of feature quantities extracted from an input signal and decoding the feature quantities;
Normalizing the decoded feature quantity;
A step of collating the normalized feature quantity with a probability model and outputting a pattern matching result,
The procedure for normalizing the feature amount is:
Calculating an average value and variance of the decoded feature values;
A procedure of calculating a corrected variance by adding a variance of vector quantization distortion to the variance of the decoded feature quantity;
And a step of normalizing the decoded feature amount using the average value and the corrected variance.