JP2010078650A

JP2010078650A - Speech recognizer and method thereof

Info

Publication number: JP2010078650A
Application number: JP2008243885A
Authority: JP
Inventors: Yusuke Shinohara; 雄介篠原; Masami Akamine; 政巳赤嶺
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2008-09-24
Filing date: 2008-09-24
Publication date: 2010-04-08
Also published as: US20100076759A1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognizer stably recognizing speech even under noise. <P>SOLUTION: The speech recognizer executes: extracting noisy speech feature vectors from input noisy speech in each frame; estimating noise feature distribution parameters of noise feature vectors concerning noise superposed on the noisy speech; calculating combination Gaussian distribution parameters of clean speech feature vectors and the noisy speech feature vectors by using unscented transform from noise feature distribution parameters and antedating distribution parameters of previously stored clean speech feature vectors; calculating ex post facto distribution parameters of the clean speech feature vectors from the noisy speech feature vectors by using the combination Gaussian distribution parameters; comparing ex post facto distribution parameters with previously stored word standard patterns in each frame; and outputting word strings of noisy speech based on the comparison result. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、雑音下において発声された音声を認識する音声認識装置及びその方法に関する。 The present invention relates to a speech recognition apparatus and method for recognizing speech uttered under noise.

雑音下において音声認識性能が低下することは、音声認識システムに関する主要な問題の一つである。音声認識システムのノイズ（雑音）に対する耐性を改善するための方法として、「音声強調法」がある。この音声強調法は、クリーン音声にノイズが重畳したノイジー音声からクリーン音声を推定する方法である。特に、音声特徴領域においてクリーン音声を推定する方法を「音声特徴強調法」又は「特徴強調法」と呼ぶ。 Declining speech recognition performance under noise is one of the major problems with speech recognition systems. There is a “speech enhancement method” as a method for improving resistance to noise (noise) of a speech recognition system. This speech enhancement method is a method for estimating clean speech from noisy speech in which noise is superimposed on clean speech. In particular, a method for estimating clean speech in a speech feature area is referred to as “speech feature enhancement method” or “feature enhancement method”.

この特徴強調法を実現できる音声認識装置は、以下のように動作する。 A speech recognition apparatus capable of realizing this feature enhancement method operates as follows.

まず、音声認識装置は、ノイズが重畳したノイジー音声からノイジー音声特徴ベクトルを抽出する。 First, the speech recognition apparatus extracts a noisy speech feature vector from noisy speech with noise superimposed.

次に、音声認識装置は、ノイジー音声の特徴ベクトルから、クリーン音声特徴ベクトルの推定を行う。 Next, the speech recognition apparatus estimates a clean speech feature vector from a noisy speech feature vector.

最後に、音声認識装置は、推定されたクリーン音声特徴ベクトルと、単語の標準パターンとの照合を行い、認識結果の単語列を出力する。 Finally, the speech recognition apparatus collates the estimated clean speech feature vector with a standard pattern of words, and outputs a recognition result word string.

非特許文献１には、結合ガウス分布の性質を応用した特徴強調法が開示されている。この特徴強調法は、クリーン音声特徴ベクトルとノイジー音声特徴ベクトルが結合ガウス分布するものとし、かつ、この結合ガウス分布のパラメータが既知であるとする。そして、この特徴強調法は、ノイジー音声特徴ベクトルを観測したときのクリーン音声特徴ベクトルの事後平均及び事後共分散を算出する。 Non-Patent Document 1 discloses a feature enhancement method that applies the properties of a coupled Gaussian distribution. In this feature enhancement method, it is assumed that a clean speech feature vector and a noisy speech feature vector have a joint Gaussian distribution, and parameters of this joint Gaussian distribution are known. This feature enhancement method calculates the posterior mean and posterior covariance of the clean speech feature vector when the noisy speech feature vector is observed.

ここで、この結合ガウス分布のパラメータをいかに算出するかが問題となる。ノイズによる音声特徴ベクトルの劣化過程は非線形性を伴うため、結合ガウス分布パラメータの推定は非線形推定問題となり、解析的に解くことはできない。 Here, the problem is how to calculate the parameters of this coupled Gaussian distribution. Since the degradation process of the speech feature vector due to noise involves nonlinearity, the estimation of the combined Gaussian distribution parameter becomes a nonlinear estimation problem and cannot be solved analytically.

従来技術では、非特許文献１に開示されるように、１次テイラー近似を用いることで、この非線形推定問題を線形の推定問題にまず置き換え、この線形推定問題を解析することにより、結合ガウス分布パラメータを算出する。
V. Stouten, H. Van hamme, and P. Wambacq, 「Model-based feature enhancement with uncertainty decoding for noise robust ASR,」 Speech Communication, vol. 48, pp. 1502-1514, 2006. In the prior art, as disclosed in Non-Patent Document 1, by using first-order Taylor approximation, this nonlinear estimation problem is first replaced with a linear estimation problem, and this linear estimation problem is analyzed, thereby combining Gaussian distributions. Calculate the parameters.
V. Stouten, H. Van hamme, and P. Wambacq, "Model-based feature enhancement with uncertainty decoding for noise robust ASR," Speech Communication, vol. 48, pp. 1502-1514, 2006.

しかし、上記従来技術は、非線形関数を１次テイラー展開によって線形近似するため、大きな近似誤差が発生する。そのため、結合ガウス分布パラメータの算出精度が低く、その結果、雑音下で十分な音声認識性能が得られないという問題点があった。 However, since the above conventional technique linearly approximates a nonlinear function by first-order Taylor expansion, a large approximation error occurs. Therefore, the calculation accuracy of the combined Gaussian distribution parameter is low, and as a result, there is a problem that sufficient speech recognition performance cannot be obtained under noise.

そこで本発明は、上記従来技術の問題点を解決し、雑音下でも安定して音声認識が行える音声認識装置及びその方法を提供する。 Accordingly, the present invention provides a speech recognition apparatus and method for solving the above-described problems of the prior art and capable of stably performing speech recognition even under noise.

本発明は、入力したノイジー音声から、ノイジー音声特徴ベクトルをフレーム毎に抽出する特徴抽出部と、前記ノイジー音声に重畳されたノイズに関するノイズ特徴ベクトルのノイズ特徴分布パラメータを推定するノイズ推定部と、クリーン音声に関するクリーン音声特徴ベクトルの事前分布パラメータを記憶する事前分布パラメータ記憶部と、前記ノイズ特徴分布パラメータと前記事前分布パラメータとから、アンセンテッド変換を用いて、前記クリーン音声特徴ベクトルと前記ノイジー音声特徴ベクトルの結合ガウス分布パラメータを前記フレーム毎に算出するガウス分布算出部と、前記結合ガウス分布パラメータを用いて、前記ノイジー音声特徴ベクトルから、前記クリーン音声特徴ベクトルの事後分布パラメータを前記フレーム毎に算出する算出実行部と、前記事後分布パラメータと、予め記憶した単語の標準パターンとを前記フレーム毎に照合し、前記照合結果に基づいて前記ノイジー音声の単語列を出力する照合部と、を備えることを特徴とする音声認識装置である。 The present invention includes a feature extraction unit that extracts a noisy speech feature vector for each frame from an input noisy speech, a noise estimation unit that estimates a noise feature distribution parameter of a noise feature vector related to noise superimposed on the noisy speech, A prior distribution parameter storage unit that stores a prior distribution parameter of a clean speech feature vector related to clean speech, and the noise feature distribution parameter and the prior distribution parameter, using the unscented transform, the clean speech feature vector and the noisy A Gaussian distribution calculation unit that calculates a combined Gaussian distribution parameter of a speech feature vector for each frame, and a posterior distribution parameter of the clean speech feature vector from the noisy speech feature vector using the combined Gaussian distribution parameter. In A calculation execution unit that outputs, a posterior distribution parameter, a pre-stored word standard pattern for each frame, and a collation unit that outputs the noisy speech word string based on the collation result; A speech recognition apparatus comprising:

本発明によれば、雑音下でも安定して音声認識を行うことができる。 According to the present invention, voice recognition can be performed stably even under noise.

以下、図面を参照して本発明の実施形態の音声認識装置１０について説明する。 Hereinafter, a speech recognition apparatus 10 according to an embodiment of the present invention will be described with reference to the drawings.

（第１の実施形態）
第１の実施形態の音声認識装置１０について図１〜図３に基づいて説明する。 (First embodiment)
A voice recognition device 10 according to a first embodiment will be described with reference to FIGS.

図１は、本実施形態に係る音声認識装置１０のブロック図である。 FIG. 1 is a block diagram of a speech recognition apparatus 10 according to the present embodiment.

図１に示すように、音声認識装置１０は、特徴抽出部１１、ノイズ推定部１２、特徴強調部１３、照合部１４を備える。 As shown in FIG. 1, the speech recognition apparatus 10 includes a feature extraction unit 11, a noise estimation unit 12, a feature enhancement unit 13, and a collation unit 14.

なお、この音声認識装置１０は、例えば、汎用のコンピュータ装置を基本ハードウェアとして用いることでも実現することが可能である。すなわち、特徴抽出部１１、ノイズ推定部１２、特徴強調部１３、照合部１４は、上記のコンピュータ装置に搭載されたプロセッサにプログラムを実行させることにより実現することができる。このとき、音声認識装置１０は、上記のプログラムをコンピュータ装置に予めインストールすることで実現してもよいし、ＣＤ−ＲＯＭなどの記憶媒体に記憶して、又はネットワークを介して上記のプログラムを配布して、このプログラムをコンピュータ装置に適宜インストールすることで実現してもよい。 The voice recognition device 10 can also be realized by using, for example, a general-purpose computer device as basic hardware. That is, the feature extraction unit 11, the noise estimation unit 12, the feature enhancement unit 13, and the collation unit 14 can be realized by causing a processor mounted on the computer device to execute a program. At this time, the speech recognition apparatus 10 may be realized by installing the above program in a computer device in advance, or may be stored in a storage medium such as a CD-ROM or distributed through the network. Then, this program may be realized by appropriately installing it in a computer device.

特徴抽出部１１について説明する。 The feature extraction unit 11 will be described.

特徴抽出部１１は、入力したノイジー音声の信号から音声の特徴を表わすベクトルを抽出する。このノイジー音声は、クリーン音声にノイズが重畳されている。 The feature extraction unit 11 extracts a vector representing the voice feature from the input noisy voice signal. In this noisy voice, noise is superimposed on the clean voice.

具体的には、特徴抽出部１１は、ノイジー音声である音声信号が入力されてくる。次に、特徴抽出部１１は、時系列に沿って切り出し窓を少しずつずらしながら音声信号から短時間フレーム（以下、単にフレームという）を切り出す。次に、特徴抽出部１１は、特徴ベクトルにフレーム毎に変換し、時系列のノイジー音声の特徴ベクトルを出力する。特徴ベクトルとして、例えばＭＦＣＣ（Mel-Frequency Cepstral Coefficients）ベクトルを用いる。以降では、ノイジー音声特徴ベクトル（以下、ノイジーベクトルという）をｙとおく。 Specifically, the feature extraction unit 11 receives an audio signal that is noisy speech. Next, the feature extraction unit 11 cuts out a short-time frame (hereinafter simply referred to as a frame) from the audio signal while shifting the cut-out window little by little along the time series. Next, the feature extraction unit 11 converts each frame into a feature vector and outputs a time-series noisy speech feature vector. As the feature vector, for example, an MFCC (Mel-Frequency Cepstral Coefficients) vector is used. Hereinafter, a noisy speech feature vector (hereinafter referred to as a noisy vector) is set as y.

ノイズ推定部１２について説明する。 The noise estimation unit 12 will be described.

ノイズ推定部１２は、各フレームについて、ノイジーベクトルｙから、ノイズ特徴ベクトルのノイズ特徴分布パラメータ（以下、単にノイズパラメータという）を推定する。 The noise estimation unit 12 estimates a noise feature distribution parameter (hereinafter simply referred to as a noise parameter) of a noise feature vector from the noisy vector y for each frame.

ノイズパラメータは、具体的には、ノイズ特徴ベクトルの平均と共分散である。例えば、発話開始前の音声を含まないノイズのみの区間（時間）から抽出された特徴ベクトルの集合から平均と共分散を算出し、以降は発話中ノイズは変動しないと仮定して、発話中の全てのフレームについてこの平均と共分散を出力する。 Specifically, the noise parameter is the mean and covariance of the noise feature vector. For example, the mean and covariance are calculated from a set of feature vectors extracted from a noise-only section (time) that does not include the speech before the start of speech, and thereafter, the noise during speech is assumed to be unchanged, Output this mean and covariance for all frames.

また、発話中にノイズが変動すると仮定する場合には、音声区間検出器を用いて、音声が含まれない区間を検出する度に、該区間の特徴ベクトルを用いてノイズパラメータを更新してもよい。 Also, when it is assumed that the noise fluctuates during utterance, the noise parameter may be updated using the feature vector of the section every time a section not including speech is detected using the voice section detector. Good.

以降の説明では、ノイズ特徴ベクトルをｎとおく。また、ノイズパラメータ、すなわち、ノイズ特徴ベクトルｎの平均と共分散を、それぞれμｎとΣｎとおく。 In the following description, the noise feature vector is set to n. Also, the noise parameters, that is, the mean and covariance of the noise feature vector n are set to μn and Σn, respectively.

特徴強調部１３について説明する。 The feature enhancement unit 13 will be described.

特徴強調部１３は、ノイジーベクトルｙと、ノイズパラメータから、クリーン音声特徴ベクトル（以下、クリーンベクトルという）の事後分布パラメータであるクリーン音声特徴事後分布パラメータ（以下、事後分布パラメータという）を算出する。 The feature enhancement unit 13 calculates a clean speech feature posterior distribution parameter (hereinafter referred to as a posterior distribution parameter) that is a posterior distribution parameter of the clean speech feature vector (hereinafter referred to as a clean vector) from the noisy vector y and the noise parameter.

事後分布パラメータとは、具体的には、ノイジーベクトルｙを観測したときの、クリーンベクトルの事後平均と事後共分散である。 The posterior distribution parameters are specifically the posterior mean and posterior covariance of the clean vector when the noisy vector y is observed.

以降の説明では、クリーンベクトルをｘとおく。また、事後分布パラメータ、すなわち、ノイジーベクトルｙを観測したときのクリーンベクトルｘの事後平均と事後共分散を、それぞれμｘ｜ｙとΣｘ｜ｙとおく。特徴強調部１３の詳細については後述する。 In the following description, the clean vector is set to x. Further, the posterior distribution parameters, that is, the posterior mean and posterior covariance of the clean vector x when the noisy vector y is observed are set as μx | y and Σx | y, respectively. Details of the feature enhancement unit 13 will be described later.

照合部１４について説明する。 The verification unit 14 will be described.

照合部１４は、クリーンベクトルｘの事後分布パラメータと、予め記憶した単語の標準パターンを前記フレーム毎に照合し、前記照合結果に基づいて前記ノイジー音声の単語列を出力する。 The collation unit 14 collates the posterior distribution parameter of the clean vector x with a pre-stored word standard pattern for each frame, and outputs the noisy speech word string based on the collation result.

特徴強調部１３で算出された事後平均μｘ｜ｙを、クリーンベクトルｘの推定値として用いて、標準的なビタビデコーディングを行う。 Standard Viterbi decoding is performed using the posterior average μx | y calculated by the feature enhancement unit 13 as an estimated value of the clean vector x.

また、非特許文献２（L. Deng, J. Droppo, and A. Acero, 「Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion,」 IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 412-412, May 2005.）に開示されているように、事後平均μｘ｜ｙと事後共分散Σｘ｜ｙの両方用いて、アンサーテンティデコーディングを行ってもよい。 Non-Patent Document 2 (L. Deng, J. Droppo, and A. Acero, “Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 412-412, May 2005.) Using the posterior mean μx | y and the posterior covariance Σx | y, Also good.

事後共分散（アンサーテンティ）の大きさを考慮しながら照合を行うことで、アンサーテンティの大きなフレームは不確実なフレームとして照合における影響が小さくなり、逆にアンサーテンティの小さなフレームは確実なフレームとして照合における影響が大きくなり、音声認識性能が向上する。 By performing matching while taking into account the magnitude of the posterior covariance (answerability), frames with a large answer are less reliable and the impact on the matching is reduced. Conversely, frames with a small answer are certain. As a frame, the influence on collation is increased, and the speech recognition performance is improved.

次に、特徴強調部１３の詳細について、図２のブロック図を参照しながら説明する。 Next, details of the feature enhancement unit 13 will be described with reference to the block diagram of FIG.

図２に示すように、特徴強調部１３は、事前分布パラメータ記憶部１３１、ガウス分布記憶部１３２、ガウス分布算出部１３３、算出実行部１３４とを備える。 As shown in FIG. 2, the feature enhancement unit 13 includes a prior distribution parameter storage unit 131, a Gaussian distribution storage unit 132, a Gaussian distribution calculation unit 133, and a calculation execution unit 134.

事前分布パラメータ記憶部１３１について説明する。 The prior distribution parameter storage unit 131 will be described.

事前分布パラメータ記憶部１３１は、クリーンベクトルｘのクリーン音声特徴事前分布パラメータ（以下、単に事前分布パラメータという）を記憶する。 The prior distribution parameter storage unit 131 stores clean speech feature prior distribution parameters of the clean vector x (hereinafter simply referred to as prior distribution parameters).

具体的には、クリーンベクトルｘの事前平均μｘと事前共分散Σｘを記憶する。事前分布パラメータは、静粛な環境で収録された音声コーパスを用いて、事前に算出しておく。 Specifically, the prior average μx and the prior covariance Σx of the clean vector x are stored. The prior distribution parameter is calculated in advance using an audio corpus recorded in a quiet environment.

より具体的には、クリーン音声のコーパスから抽出された特徴ベクトルの集合から、平均と共分散を算出しておく。話者、又は、発話内容が事前に分かっている場合には、該話者、又は、該発話内容に特化したコーパスを用いることができる。 More specifically, the mean and covariance are calculated from a set of feature vectors extracted from the clean speech corpus. When the speaker or the utterance content is known in advance, the speaker or a corpus specialized for the utterance content can be used.

また、話者、又は、発話内容が事前に特定されない場合には、さまざまな話者、さまざまな発話内容を含んだコーパスを用いることが好ましい。 Moreover, when a speaker or utterance content is not specified in advance, it is preferable to use various speakers and a corpus including various utterance contents.

ガウス分布記憶部１３２について説明する。 The Gaussian distribution storage unit 132 will be described.

ガウス分布記憶部１３２は、クリーンベクトルｘとノイジーベクトルｙとの、結合ガウス分布のパラメータである結合ガウス分布パラメータ（以下、単にガウスパラメータという）を記憶する。すなわち、ガウス分布記憶部１３２は、ガウス分布算出部１３３から出力されたガウスパラメータを記憶する。 The Gaussian distribution storage unit 132 stores a combined Gaussian distribution parameter (hereinafter simply referred to as a Gaussian parameter) that is a combined Gaussian distribution parameter of the clean vector x and the noisy vector y. That is, the Gaussian distribution storage unit 132 stores the Gaussian parameters output from the Gaussian distribution calculation unit 133.

ガウスパラメータとは、クリーンベクトルｘの事前平均μｘと事前共分散Σｘ、ノイジーベクトルｙの平均μｙと共分散Σｙ、及び、クリーンベクトルｘとノイジーベクトルｙのクロス共分散Σｘｙである。 The Gaussian parameters are the prior mean μx and prior covariance Σx of the clean vector x, the mean μy and covariance Σy of the noisy vector y, and the cross covariance Σxy of the clean vector x and the noisy vector y.

これらのパラメータを用いて、クリーンベクトルｘとノイジーベクトルｙの結合ガウス分布は、式（１）のように表わされる。但し、Ｎ（μ，Σ）は、平均μ及び共分散Σによって規定されるガウス分布を表わす。
Using these parameters, the combined Gaussian distribution of the clean vector x and the noisy vector y is expressed as shown in Equation (1). However, N (μ, Σ) represents a Gaussian distribution defined by the average μ and the covariance Σ.

ガウス分布算出部１３３について説明する。 The Gaussian distribution calculation unit 133 will be described.

ガウス分布算出部１３３は、ノイズパラメータと事前分布パラメータとから、アンセンテッド変換を用いて、ガウスパラメータを算出して、ガウス分布記憶部１３２に出力する。 The Gaussian distribution calculation unit 133 calculates a Gaussian parameter from the noise parameter and the prior distribution parameter by using unscented transformation, and outputs it to the Gaussian distribution storage unit 132.

ここで、ガウスパラメータの算出にあたって、クリーンベクトルｘ、ノイズ特徴ベクトルｎ、ノイジーベクトルｙを関連付けるための非線形関数ｙ＝ｆ（ｘ，ｎ）が既知である必要がある。例えば、特徴ベクトルとしてＭＦＣＣベクトルを用いる場合には、この非線形関数は式（２）のように表される。但し、行列Ｃは離散コサイン変換を表わし、その逆行列は逆離散コサイン変換を表わす。また、ｌｏｇ及びｅｘｐは、ベクトルの各要素に作用するものとする。
Here, in calculating the Gaussian parameter, the nonlinear function y = f (x, n) for associating the clean vector x, the noise feature vector n, and the noisy vector y needs to be known. For example, when an MFCC vector is used as the feature vector, this nonlinear function is expressed as in Equation (2). However, the matrix C represents a discrete cosine transform, and its inverse matrix represents an inverse discrete cosine transform. Further, log and exp are assumed to act on each element of the vector.

非特許文献１に開示される従来技術では、１次テイラー近似を用いてガウスパラメータを算出した。それに対して、本実施形態では、アンセンテッド変換を用いてガウスパラメータを算出する。 In the prior art disclosed in Non-Patent Document 1, Gaussian parameters are calculated using first-order Taylor approximation. On the other hand, in this embodiment, a Gaussian parameter is calculated using unscented transformation.

以下、まず従来技術の詳細について説明し、その問題点を指摘する。その後、本実施形態の方法についてその詳細を説明する。 In the following, the details of the prior art will be described first, and the problems will be pointed out. Then, the detail is demonstrated about the method of this embodiment.

従来技術である１次テイラー近似を用いたガウスパラメータ算出方法について説明する。 A conventional Gaussian parameter calculation method using first-order Taylor approximation will be described.

まず、式（２）の非線形関数を、下記の式（３）に示すように１次テイラー展開で近似する。
First, the nonlinear function of Expression (2) is approximated by first-order Taylor expansion as shown in Expression (3) below.

但し、行列ＦとＧは、下記の式（４）に示すように、非線形関数ｆをそれぞれクリーンベクトルｘとノイズ特徴ベクトルｎで偏微分したものである。
However, the matrices F and G are obtained by partial differentiation of the nonlinear function f by the clean vector x and the noise feature vector n, respectively, as shown in the following equation (4).

また、テイラー展開の展開点（ｘ０，ｎ０）は、下記の式（５）に示すように、クリーンベクトルｘの事前平均μｘ、及び、ノイズ特徴ベクトルｎの平均μｎに、それぞれ設定される。
Further, the expansion point (x0, n0) of Taylor expansion is set to the prior average μx of the clean vector x and the average μn of the noise feature vector n, respectively, as shown in the following equation (5).

このように、非線形関数を１次テイラー近似すると、ガウスパラメータを線形演算で算出できる。すなわち、ノイジーベクトルｙの平均μｙ及び共分散Σｙ、クリーンベクトルｘとノイジーベクトルｙのクロス共分散Σｘｙは、それぞれ式（６）、式（７）、式（８）により算出される。
As described above, when the nonlinear function is approximated by the first order Taylor, the Gaussian parameter can be calculated by a linear operation. That is, the mean μy and covariance Σy of the noisy vector y, and the cross covariance Σxy of the clean vector x and the noisy vector y are calculated by the equations (6), (7), and (8), respectively.

しかし、上記した従来技術の方法では、非線形関数を１次テイラー近似する際に発生する近似誤差の影響により、ガウスパラメータの算出誤差が大きいという問題点があった。 However, the above-described conventional method has a problem in that a Gaussian parameter calculation error is large due to an influence of an approximation error that occurs when a nonlinear function is approximated by a first-order Taylor.

次に、本実施形態に係るアンセンテッド変換を用いたガウスパラメータ算出方法について説明する。 Next, a Gaussian parameter calculation method using unscented transformation according to the present embodiment will be described.

「アンセンテッド変換」は、非線形システムにおいて高精度に所望の統計量を算出する方法である。アンセンテッド変換の詳細については、例えば非特許文献３（S. Julier and J. Uhlmann, 「Unscented filtering and nonlinear estimation,」 Proceedings of the IEEE, vol. 92, no. 3, pp. 401-422, March 2004.）に開示されている。 “Unscented transformation” is a method of calculating a desired statistic with high accuracy in a nonlinear system. For details of unscented transformation, see Non-Patent Document 3 (S. Julier and J. Uhlmann, “Unscented filtering and nonlinear estimation,” Proceedings of the IEEE, vol. 92, no. 3, pp. 401-422, March. 2004.).

このアンセンテッド変換について説明する。 This unscented conversion will be described.

第１の確率変数ｘがあり、その平均μｘ及び共分散Σｘは既知とする。 There is a first random variable x, and its mean μx and covariance Σx are known.

第２の確率変数ｎがあり、その平均μｎ及び共分散Σｎは既知とする。 There is a second random variable n, and its mean μn and covariance Σn are known.

第３の確率変数ｙがあり、第３の確率変数ｙは、第１の確率変数ｘと第２の確率変数ｎとから、既知の非線形関数ｙ＝ｆ（ｘ，ｎ）によって算出されるものとする。 There is a third random variable y, and the third random variable y is calculated from the first random variable x and the second random variable n by a known nonlinear function y = f (x, n). And

このとき、第３の確率変数ｙの平均μｙと共分散Σｙ、及び、第１の確率変数ｘと第３の確率変数ｙの間のクロス共分散Σｘｙを算出する問題を考える。この問題を高精度に解決する方法として、上記のアンセンテッド変換が知られている。 At this time, consider the problem of calculating the mean μy and covariance Σy of the third random variable y and the cross covariance Σxy between the first random variable x and the third random variable y. As a method for solving this problem with high accuracy, the above-mentioned unscented transformation is known.

ガウス分布算出部１３３では、このアンセンテッド変換を用いて、ガウスパラメータの算出を行う。 The Gaussian distribution calculation unit 133 calculates Gaussian parameters using this unscented transformation.

まず、下記の式（９）に示すように、クリーンベクトルｘとノイズ特徴ベクトルｎを連結したベクトルａを考える。
First, as shown in the following equation (9), a vector a obtained by connecting a clean vector x and a noise feature vector n is considered.

クリーンベクトルｘの次元がＮｘ、ノイズ特徴ベクトルｎの次元がＮｎであるとき、ベクトルａの次元はＮａ＝Ｎｘ＋Ｎｎとなる。このベクトルａの平均μａと共分散Σａは、それぞれ下記の式（１０）及び式（１１）のように表わされる。
When the dimension of the clean vector x is Nx and the dimension of the noise feature vector n is Nn, the dimension of the vector a is Na = Nx + Nn. The average μa and the covariance Σa of the vector a are expressed by the following equations (10) and (11), respectively.

次に、「シグマポイント」と呼ばれるサンプルの集合を生成する。すなわち、ｐ個のＮａ次元ベクトルａｉと、その各々に関連付けられた重みｗｉを生成する。シグマポイントの生成法として、さまざまな方法が知られており、例えば非特許文献３にそれらは開示されている。ここでは、「シンメトリックシグマポイント生成法」について説明する。なお、他の任意のシグマポイント生成法を用いてよい。 Next, a set of samples called “sigma points” is generated. That is, p Na-dimensional vectors ai and weights wi associated with the Na-dimensional vectors ai are generated. Various methods are known as methods for generating sigma points. For example, Non-Patent Document 3 discloses them. Here, the “symmetric sigma point generation method” will be described. Any other sigma point generation method may be used.

シンメトリックシグマポイント生成法では、ｐ＝２Ｎａ個のベクトルａｉと、それに関連付けられた重みｗｉを、下記の式（１２）のように生成する。
In the symmetric sigma point generation method, p = 2Na vectors ai and weights wi associated therewith are generated as in the following Expression (12).

但し、式（１２）の中で、
However, in Formula (12),

は行列ＮａΣａの平方根の第ｉ列（又は行）を表す。 Represents the i-th column (or row) of the square root of the matrix NaΣa.

次に、ガウス分布算出部１３３は、ｐ個のシグマポイントａｉのそれぞれについて、非線形関数ｙ＝ｆ（ｘ，ｎ）を用いて、ｙｉを算出する。例えば、特徴ベクトルがＭＦＣＣである場合には、非線形関数ｙ＝ｆ（ｘ，ｎ）は式（２）で表わされる。また、ｉ番目のサンプルａｉのｘに対応する部分を取り出したベクトルをｘｉとする。 Next, the Gaussian distribution calculation unit 133 calculates yi for each of the p sigma points ai using the nonlinear function y = f (x, n). For example, when the feature vector is MFCC, the nonlinear function y = f (x, n) is expressed by Expression (2). Also, let xi be a vector obtained by extracting a portion corresponding to x of the i-th sample ai.

以上のようにして生成したｘｉ及びｙｉ、（ｉ＝１，・・・ｐ）を用いて、求めるガウスパラメータを算出する。すなわち、ガウス分布算出部１３３は、ノイジーベクトルｙの平均μｙと共分散Σｙ、及び、クリーンベクトルｘとノイジーベクトルｙのクロス共分散Σｘｙを下記の式（１４）から式（１６）のように算出する。
The gauss parameters to be calculated are calculated using xi and yi generated as described above (i = 1,... P). That is, the Gaussian distribution calculation unit 133 calculates the mean μy and the covariance Σy of the noisy vector y and the cross covariance Σxy of the clean vector x and the noisy vector y as shown in the following equations (14) to (16). To do.

ガウス分布算出部１３３は、以上に説明したように、アンセンテッド変換を用いて、事前分布パラメータとノイズパラメータとから、ガウスパラメータを算出する。従来技術では非線形関数ｙ＝ｆ（ｘ，ｎ）を１次テイラー展開で近似したため算出誤差が大きかったが、アンセンテッド変換を用いることで算出誤差を小さく抑えることができる。 As described above, the Gaussian distribution calculation unit 133 calculates a Gaussian parameter from the prior distribution parameter and the noise parameter using the unscented transformation. In the conventional technique, the nonlinear function y = f (x, n) is approximated by the first-order Taylor expansion, so that the calculation error is large. However, the calculation error can be suppressed small by using unscented transformation.

算出実行部１３４について説明する。 The calculation execution unit 134 will be described.

算出実行部１３４は、ガウス分布記憶部１３２に記憶されたガウスパラメータに基づき、ノイジーベクトルｙから、事後分布パラメータを算出する。事後分布パラメータとは、上記したように、事後平均μｘ｜ｙと事後共分散Σｘ｜ｙである。 The calculation execution unit 134 calculates a posterior distribution parameter from the noisy vector y based on the Gaussian parameter stored in the Gaussian distribution storage unit 132. As described above, the posterior distribution parameters are the posterior mean μx | y and the posterior covariance Σx | y.

二つの確率変数ｘとｙが式（１）のように結合ガウス分布で分布するとき、第３の確率変数であるノイジーベクトルｙを観測したときの、第１の確率変数であるクリーンベクトルｘの事後平均と事後共分散を算出する下記の式（１７）が知られている。算出実行部１３４は、この式（１７）を用いて、事後分布パラメータを算出する。
When the two random variables x and y are distributed in a combined Gaussian distribution as shown in the equation (1), the clean vector x that is the first random variable when the noisy vector y that is the third random variable is observed. The following formula (17) for calculating the posterior average and the posterior covariance is known. The calculation execution unit 134 calculates the posterior distribution parameter using the equation (17).

次に、本実施形態に係る音声認識装置１０の動作について図３を参照しながら説明する。 Next, the operation of the speech recognition apparatus 10 according to the present embodiment will be described with reference to FIG.

まず、ステップＳ３１において、特徴抽出部１１は、ノイジー音声の一つのフレームからノイジーベクトルｙを算出する。 First, in step S31, the feature extraction unit 11 calculates a noisy vector y from one frame of noisy speech.

次に、ステップＳ３２において、ノイズ推定部１２は、ノイジーベクトルｙから、ノイズ特徴ベクトルｎのノイズパラメータを推定する。 Next, in step S32, the noise estimation unit 12 estimates the noise parameter of the noise feature vector n from the noisy vector y.

次に、ステップＳ３３において、ガウス分布算出部１３３は、アンセンテッド変換を用いてガウスパラメータを算出して、ガウス分布記憶部１３２は、そのガウスパラメータを記憶する。 Next, in step S33, the Gaussian distribution calculation unit 133 calculates a Gaussian parameter using unscented transformation, and the Gaussian distribution storage unit 132 stores the Gaussian parameter.

次に、ステップＳ３４において、算出実行部１３４は、ガウス分布記憶部１３２に記憶されたガウスパラメータに基づいて、事後分布パラメータを算出する。 Next, in step S <b> 34, the calculation execution unit 134 calculates posterior distribution parameters based on the Gaussian parameters stored in the Gaussian distribution storage unit 132.

次に、ステップＳ３５において、照合部１４は、クリーンベクトルｘの事後分布パラメータと、予め記憶した単語の標準パターンを照合する。 Next, in step S35, the collation unit 14 collates the posterior distribution parameter of the clean vector x with a standard pattern of words stored in advance.

次に、ステップＳ３６において、音声認識装置１０は、全てのフレームの処理が完了したかどうかを判定する。まだ処理していないフレームが残っている場合には、ステップＳ３１に戻って、次のフレームの処理を行う。全てのフレームの処理が完了した場合には、ステップＳ３７へと進む。 Next, in step S36, the speech recognition apparatus 10 determines whether or not processing of all frames has been completed. If there are still frames that have not been processed, the process returns to step S31 to process the next frame. If all the frames have been processed, the process proceeds to step S37.

最後に、ステップＳ３７において、照合部１４は、前記照合結果に基づいて前記ノイジー音声の単語列を出力する。 Finally, in step S37, the collation unit 14 outputs the noisy speech word string based on the collation result.

このように本実施形態によれば、アンセンテッド変換を用いることで、ガウスパラメータを精度良く算出できるため、特徴強調効果を高め、雑音下においても高い音声認識性能を保つことができる。 As described above, according to the present embodiment, since the Gaussian parameter can be calculated with high accuracy by using the unscented conversion, the feature enhancement effect can be enhanced and high speech recognition performance can be maintained even under noise.

（第２の実施形態）
次に、第２の実施形態の音声認識装置１０について図４と図５に基づいて説明する。 (Second Embodiment)
Next, the speech recognition apparatus 10 according to the second embodiment will be described with reference to FIGS.

第１の実施形態の音声認識装置１０では、クリーンベクトルｘの事前分布を単一のガウス分布で表現するため、事前分布を十分精緻に表現できない場合がある。 In the speech recognition apparatus 10 according to the first embodiment, since the prior distribution of the clean vector x is expressed by a single Gaussian distribution, the prior distribution may not be expressed sufficiently precisely.

そこで本実施形態の音声認識装置１０では、クリーンベクトルｘの事前分布をガウス混合モデルで表現することにより、事前分布がより精緻に表現されるため、特徴強調がより有効に働き、雑音下での音声認識性能が向上する。 Thus, in the speech recognition apparatus 10 of the present embodiment, the prior distribution of the clean vector x is expressed by a Gaussian mixture model, so that the prior distribution is expressed more precisely, so that feature enhancement works more effectively, and under noise Speech recognition performance is improved.

まず最初に、クリーンベクトルｘの事前分布を表現するガウス混合モデル、及び、その学習法について説明する。まず、本実施形態では、Ｍ個（Ｍ＞１）の特徴強調部１３を有している。そして、クリーンベクトルｘの事前分布ｐ（ｘ）は、ガウス混合モデルを用いて下記の式（１８）のように表わされる。
First, the Gaussian mixture model expressing the prior distribution of the clean vector x and its learning method will be described. First, in the present embodiment, M (M> 1) feature enhancement units 13 are provided. The prior distribution p (x) of the clean vector x is expressed by the following equation (18) using a Gaussian mixture model.

ここで、Ｍ（但し、Ｍ＞１である）は混合数、ｋ（但し、１＜＝ｋ＜＝Ｍである）は特徴強調部１３の番号、πｋ、μｘ（ｋ）、Σｘ（ｋ）はそれぞれ第ｋ番目の特徴強調部１３−ｋのガウス分布の混合重み、平均、共分散を表わす。 Here, M (where M> 1) is the number of mixtures, k (where 1 <= k <= M) is the number of the feature enhancement unit 13, πk, μx (k), Σx (k) Represents the mixing weight, average, and covariance of the Gaussian distribution of the kth feature enhancement unit 13-k.

第１の実施形態では、単一のガウス分布で事前分布を表現したが、本実施形態では複数のガウス分布の混合を用いるため、事前分布をより精緻に表現することができる。 In the first embodiment, the prior distribution is expressed by a single Gaussian distribution. However, in this embodiment, since the mixture of a plurality of Gaussian distributions is used, the prior distribution can be expressed more precisely.

クリーンベクトルｘの事前分布を表わすためのガウス混合モデルパラメータは、クリーン音声のコーパスから予め学習して記憶しておく。具体的には、クリーン音声のコーパスから抽出された特徴ベクトルの集合を学習データとして、ＥＭアルゴリズムを用いることによって、上記の式（１８）のガウス混合モデルパラメータを算出する。そして、各特徴強調部１３は、例えば、各音素に対応するように生成され、特徴強調部１３毎に、音素に対応するガウスパラメータを算出する。 Gaussian mixture model parameters for representing the prior distribution of the clean vector x are previously learned from a corpus of clean speech and stored. Specifically, the Gaussian mixture model parameter of the above equation (18) is calculated by using an EM algorithm using a set of feature vectors extracted from a corpus of clean speech as learning data. Each feature enhancement unit 13 is generated so as to correspond to each phoneme, and calculates a Gaussian parameter corresponding to the phoneme for each feature enhancement unit 13.

次に、本実施形態の音声認識装置１０の構成について、図４を参照しながら説明する。図４は、音声認識装置１０を示すブロック図である。 Next, the configuration of the speech recognition apparatus 10 of the present embodiment will be described with reference to FIG. FIG. 4 is a block diagram showing the voice recognition device 10.

図４に示すように、音声認識装置１０は、特徴抽出部１１、ノイズ推定部１２、Ｍ個の特徴強調部１３−１，・・・１３−Ｍ、重み算出部４１、統合部４２、照合部１４を備える。特徴抽出部１１、ノイズ推定部１２、照合部１４については、第１の実施形態と同一であるので、同一の符号を付与してここでは説明を省略する。 As shown in FIG. 4, the speech recognition apparatus 10 includes a feature extraction unit 11, a noise estimation unit 12, M feature enhancement units 13-1 to 13 -M, a weight calculation unit 41, an integration unit 42, a collation The unit 14 is provided. Since the feature extraction unit 11, the noise estimation unit 12, and the collation unit 14 are the same as those in the first embodiment, the same reference numerals are given and description thereof is omitted here.

Ｍ個の特徴強調部１３−１，・・・１３−Ｍの一つ一つは、第１の実施形態における特徴強調部１３と同一であるが、これを複数備える点が第１の実施形態と異なる。Ｍ個の特徴強調部１３−１，・・・１３−Ｍは、それぞれ独自の互いに異なったパラメータを持つ。 Each of the M feature emphasizing units 13-1,... 13-M is the same as the feature emphasizing unit 13 in the first embodiment. And different. The M feature emphasizing units 13-1,... 13-M have their own different parameters.

すなわち、第ｋ番目の特徴強調部１３−ｋが備える事前分布パラメータ記憶部１３１−ｋは、上記したガウス混合モデルの第ｋ番目のガウス混合モデルパラメータμｘ（ｋ）とΣｘ（ｋ）を記憶する。 That is, the prior distribution parameter storage unit 131-k included in the kth feature enhancement unit 13-k stores the kth Gaussian mixture model parameters μx (k) and Σx (k) of the Gaussian mixture model described above. .

また、ガウス分布算出部１３３−ｋは、ノイズパラメータ（μｎとΣｎ）と、事前分布パラメータ（μｘ（ｋ）とΣｘ（ｋ））から、ガウスパラメータ（μｙ（ｋ）、Σｙ（ｋ）、Σｘｙ（ｋ））を算出し、ガウス分布記憶部１３２−ｋに記憶させる。 In addition, the Gaussian distribution calculating unit 133-k calculates the Gaussian parameters (μy (k), Σy (k), Σxy) from the noise parameters (μn and Σn) and the prior distribution parameters (μx (k) and Σx (k)). (K)) is calculated and stored in the Gaussian distribution storage unit 132-k.

算出実行部１３４−ｋは、ガウス分布記憶部１３２−ｋに記憶したガウスパラメータに基づいて、第ｋ番目の事後分布パラメータ、すなわち、事後平均μｘ｜ｙ（ｋ）と事後共分散μｘ｜ｙ（ｋ）を算出する。 The calculation execution unit 134-k, based on the Gaussian parameter stored in the Gaussian distribution storage unit 132-k, the kth posterior distribution parameter, that is, the posterior mean μx | y (k) and the posterior covariance μx | y ( k) is calculated.

重み算出部４１について説明する。 The weight calculation unit 41 will be described.

重み算出部４１は、Ｍ個の特徴強調部１３−１，・・・１３−Ｍからの出力を統合する際に用いる統合重みを算出する。すなわち。各ガウス分布算出部１３３−ｋがそれぞれ算出したガウスパラメータに基づいて、各事後分布パラメータに対する統合重みをフレーム毎に算出する。 The weight calculation unit 41 calculates an integration weight used when integrating the outputs from the M feature emphasizing units 13-1,... 13-M. That is. Based on the Gaussian parameter calculated by each Gaussian distribution calculation unit 133-k, an integrated weight for each posterior distribution parameter is calculated for each frame.

具体的には、ノイジーベクトルｙを観測したときに、該フレームが特徴強調部１３−ｋに属する事後確率ｐ（ｋ｜ｙ）を統合重みとして用いる。事後確率ｐ（ｋ｜ｙ）は、下記の式（１９）によって算出する。
Specifically, when the noisy vector y is observed, the posterior probability p (k | y) that the frame belongs to the feature enhancement unit 13-k is used as the integrated weight. The posterior probability p (k | y) is calculated by the following equation (19).

πｋは上記したガウス混合モデルの混合重みである。μｙ（ｋ）及びΣｙ（ｋ）は第ｋ番目の特徴強調部１３−ｋのガウス分布記憶部１３２−ｋの値を参照する。 πk is a mixing weight of the Gaussian mixture model described above. μy (k) and Σy (k) refer to the value of the Gaussian distribution storage unit 132-k of the k-th feature enhancement unit 13-k.

統合部４２について説明する。 The integration unit 42 will be described.

統合部４２は、Ｍ個の特徴強調部１３−１，・・・１３−Ｍからの出力を統合する。 The integrating unit 42 integrates outputs from the M feature emphasizing units 13-1 to 13 -M.

具体的には、Ｍ個の特徴強調部１３−１，・・・１３−Ｍからの出力μｘ｜ｙ（ｋ）とΣｘ｜ｙ（ｋ）を、下記の式（２０）によって統合し、μｘ｜ｙとΣｘ｜ｙを出力する。
Specifically, the outputs μx | y (k) and Σx | y (k) from the M feature emphasizing units 13-1,... 13-M are integrated by the following equation (20), and μx | Y and Σx | y are output.

次に、本実施形態に係る音声認識装置１０の動作について、図５を参照しながら説明する。なお、第１の実施形態と同一のステップについては、同一符号を付与して説明を簡略におこなう。 Next, the operation of the speech recognition apparatus 10 according to the present embodiment will be described with reference to FIG. In addition, about the step same as 1st Embodiment, the same code | symbol is provided and description is performed simply.

まず、ステップＳ３１の特徴抽出処理、ステップＳ３２のノイズ推定処理が行われる。 First, the feature extraction process in step S31 and the noise estimation process in step S32 are performed.

次に、ステップＳ３３において、特徴強調部１３−ｋのガウス分布算出部１３３−ｋは、アンセンテッド変換を用いて、ガウスパラメータを算出し、ガウス分布記憶部１３２−ｋは、そのガウスパラメータを記憶する。 Next, in step S33, the Gaussian distribution calculation unit 133-k of the feature enhancement unit 13-k calculates a Gaussian parameter using unscented transformation, and the Gaussian distribution storage unit 132-k stores the Gaussian parameter. To do.

次に、ステップＳ３４において、算出実行部１３４−ｋは、ガウス分布記憶部１３２−ｋに記憶されたガウスパラメータに基づいて、事後分布パラメータを算出する。 Next, in step S34, the calculation execution unit 134-k calculates a posterior distribution parameter based on the Gaussian parameter stored in the Gaussian distribution storage unit 132-k.

次に、ステップＳ５１において、音声認識装置１０は、全ての特徴強調部１３−１，・・・１３−Ｍについての処理が完了していなければステップＳ３３に戻り、完了していればステップＳ５２へと進む。 Next, in step S51, the speech recognition apparatus 10 returns to step S33 if the processes for all the feature emphasizing units 13-1,... 13-M are not completed, and proceeds to step S52 if completed. Proceed with

次に、ステップＳ５２において、重み算出部４１は、統合重みを算出する。 Next, in step S52, the weight calculation unit 41 calculates an integrated weight.

次に、ステップＳ５３において、統合部４２は、Ｍ個の特徴抽出部１３−１，・・・１３−Ｍからの出力を統合する。 Next, in step S53, the integration unit 42 integrates the outputs from the M feature extraction units 13-1, ... 13-M.

次に、ステップＳ３５において、照合部１４は、単語の標準パターンとの照合を行う。 Next, in step S35, the collation unit 14 collates with a standard pattern of words.

次に、ステップＳ３６において、音声認識装置１０は、全てのフレームの処理が完了していなければステップＳ３１へ戻り、完了していればステップＳ３７へと進む。 Next, in step S36, the speech recognition apparatus 10 returns to step S31 if processing of all frames is not completed, and proceeds to step S37 if completed.

このように本実施形態は、ガウス混合モデルを用いることで、単一のガウス分布を用いる場合よりも、より精緻に事前分布を表現することができ、特徴強調効果を高め、雑音下でも高い音声認識性能を得ることができる。 As described above, this embodiment can express the prior distribution more precisely by using the Gaussian mixture model than the case of using a single Gaussian distribution, enhances the feature enhancement effect, and has high speech even under noise. Recognition performance can be obtained.

（第３の実施形態）
次に、第３の実施形態の音声認識装置１０について図６〜図８に基づいて説明する。 (Third embodiment)
Next, the speech recognition apparatus 10 according to the third embodiment will be described with reference to FIGS.

第１及び第２の実施形態では、全てのフレームにおいてガウスパラメータの算出を行うため、演算量が大きくなる。 In the first and second embodiments, since the Gaussian parameter is calculated in all frames, the amount of calculation increases.

そこで本実施形態では、各フレームにおいて、ガウスパラメータを再び算出する必要があるかどうかを判定し、不必要と判定された場合にはガウスパラメータの再算出を省略することで、演算量を削減する。 Therefore, in this embodiment, in each frame, it is determined whether the Gaussian parameter needs to be calculated again. If it is determined that it is unnecessary, the calculation amount is reduced by omitting the recalculation of the Gaussian parameter. .

本実施形態と、第１及び第２の実施形態との違いは、特徴強調部１３の構成のみであるため、その他の構成についての説明は省略する。 Since the difference between the present embodiment and the first and second embodiments is only the configuration of the feature emphasizing unit 13, the description of other configurations is omitted.

本実施形態の特徴強調部１３について図６に基づいて説明する。図６は、本実施形態における特徴強調部１３のブロック図である。 The feature emphasizing unit 13 of this embodiment will be described with reference to FIG. FIG. 6 is a block diagram of the feature emphasizing unit 13 in the present embodiment.

特徴強調部１３は、事前分布パラメータ記憶部１３１、ガウス分布記憶部１３２、ガウス分布算出部１３３、算出実行部１３４、判定部６１、第１スイッチ部６２を備える。判定部６１、第１スイッチ部６２以外は、第１及び第２の実施形態と同一であるため、同一符号を付与して説明を省略する。 The feature enhancement unit 13 includes a prior distribution parameter storage unit 131, a Gaussian distribution storage unit 132, a Gaussian distribution calculation unit 133, a calculation execution unit 134, a determination unit 61, and a first switch unit 62. Except for the determination unit 61 and the first switch unit 62, the second embodiment is the same as the first and second embodiments.

判定部６１について説明する。 The determination unit 61 will be described.

判定部６１は、ある一つのフレームにおいて、ガウスパラメータの再算出が必要か否かを判定する。 The determination unit 61 determines whether it is necessary to recalculate Gaussian parameters in a certain frame.

判定部６１には、フレーム毎に、ノイズ推定部１２からノイズパラメータが入力される。判定部６１は、ノイズパラメータが大きく変化した場合には、ガウスパラメータの値もそれに伴って大きく変動するため、ガウスパラメータの再算出が必要と判定する。逆に、ノイズパラメータがあまり変化していない場合には、ガウスパラメータの値もあまり変化しないので、ガウスパラメータの再算出は不必要と判定する。 The noise parameter is input from the noise estimation unit 12 to the determination unit 61 for each frame. When the noise parameter changes greatly, the determination unit 61 determines that the Gaussian parameter needs to be recalculated because the value of the Gaussian parameter greatly varies accordingly. Conversely, when the noise parameter has not changed much, the value of the Gaussian parameter does not change much, so it is determined that recalculation of the Gaussian parameter is unnecessary.

図７は、判定部６１のブロック図である。図７に示すように、判定部６１は、ノイズパラメータ記憶部６１１、変動量算出部６１２、比較部６１３を備える。 FIG. 7 is a block diagram of the determination unit 61. As illustrated in FIG. 7, the determination unit 61 includes a noise parameter storage unit 611, a fluctuation amount calculation unit 612, and a comparison unit 613.

まず、ノイズパラメータ記憶部６１１は、ガウス分布算出部１３３が過去の最後にガウスパラメータの算出を行ったフレームのノイズパラメータを記憶する。 First, the noise parameter storage unit 611 stores the noise parameter of the frame for which the Gaussian distribution calculation unit 133 calculated Gaussian parameters at the end of the past.

変動量算出部６１２は、現在のフレームにおいてノイズ推定部１２から入力された現在のノイズパラメータと、ノイズパラメータ記憶部６１１に記憶された過去のノイズパラメータから、ノイズパラメータの変動量を算出する。例えば、下記の式（２１）に示されるようなユークリッド距離によって、ノイズパラメータの変動量を算出する。
The fluctuation amount calculation unit 612 calculates the fluctuation amount of the noise parameter from the current noise parameter input from the noise estimation unit 12 in the current frame and the past noise parameter stored in the noise parameter storage unit 611. For example, the fluctuation amount of the noise parameter is calculated by the Euclidean distance as shown in the following formula (21).

ここで、Δはノイズパラメータの変動量、μｎは現在のフレームにおける現在のノイズパラメータ、μｎに上付き棒を付したものはノイズパラメータ記憶部６１１に記憶された過去のノイズパラメータである。 Here, Δ is the fluctuation amount of the noise parameter, μn is the current noise parameter in the current frame, and μn with a superscript bar is a past noise parameter stored in the noise parameter storage unit 611.

比較部６１３は、変動量を任意の閾値と比較し、閾値より大きければ、過去の最後にガウスパラメータの算出を行った時からノイズパラメータが大きく変動したみなし、ガウスパラメータの再算出が必要との判定を出力する。 The comparison unit 613 compares the fluctuation amount with an arbitrary threshold value. If the fluctuation amount is larger than the threshold value, it is considered that the noise parameter has greatly fluctuated since the last calculation of the Gaussian parameter, and the Gaussian parameter needs to be recalculated. Output verdict.

また、同時に、比較部６１３からノイズパラメータ記憶部６１１に記憶指令を送信し、現在のフレームにおける現在のノイズパラメータをノイズパラメータ記憶部６１１に記憶して過去のノイズパラメータを更新する。 At the same time, a storage command is transmitted from the comparison unit 613 to the noise parameter storage unit 611, the current noise parameter in the current frame is stored in the noise parameter storage unit 611, and the past noise parameter is updated.

閾値より小さければ、ノイズパラメータはあまり変動していないとみなし、ガウスパラメータの再算出は不要との判定を出力する。この時、ノイズパラメータ記憶部６１１の値は更新しない。 If it is smaller than the threshold value, it is considered that the noise parameter has not fluctuated so much and a determination is made that recalculation of the Gaussian parameter is unnecessary. At this time, the value of the noise parameter storage unit 611 is not updated.

第１スイッチ部６２は、判定部６１の判定に従って、ガウス分布算出部１３３の動作を制御する。すなわち、ガウスパラメータの再算出が必要と判断された場合には、ガウス分布算出部１３３を実行し、結果をガウス分布記憶部１３２に新たに保存し、この新たなガウスパラメータを用いて算出実行部１３４は事後分布パラメータの算出を実行する。 The first switch unit 62 controls the operation of the Gaussian distribution calculation unit 133 according to the determination of the determination unit 61. That is, when it is determined that the Gaussian parameter needs to be recalculated, the Gaussian distribution calculating unit 133 is executed, the result is newly stored in the Gaussian distribution storage unit 132, and the calculation executing unit is used using the new Gaussian parameter. 134 calculates the posterior distribution parameters.

一方、再算出が不必要と判定された場合には、第１スイッチ部６２は、ガウス分布算出部１３３の実行を省略する。そして、ガウス分布記憶部１３２の内容は変更しない。算出実行部１３４は、ガウス分布記憶部１３２に記憶された過去のガウスパラメータを用いて事後分布パラメータの算出を実行する。 On the other hand, when it is determined that recalculation is unnecessary, the first switch unit 62 omits the execution of the Gaussian distribution calculation unit 133. The contents of the Gaussian distribution storage unit 132 are not changed. The calculation execution unit 134 calculates the posterior distribution parameters using the past Gaussian parameters stored in the Gaussian distribution storage unit 132.

なお、第２の実施形態のような複数の特徴強調部１３−１，・・・１３−Ｍを備えている場合には、各々の特徴強調部１３−１，・・・１３−Ｍが判定部６１を備えているが、処理の内容は同一であるので、単一の判定部６１を全ての特徴強調部１３−１，・・・１３−Ｍで共有することができる。 In the case where a plurality of feature enhancement units 13-1,... 13-M as in the second embodiment are provided, each feature enhancement unit 13-1,. However, since the processing contents are the same, the single determination unit 61 can be shared by all the feature emphasizing units 13-1 to 13 -M.

次に、本実施形態の音声認識装置１０の動作について図８を参照しながら説明する。図８は、音声認識装置１０の動作を示すフローチャートである。ここでは、複数の特徴強調部１３−１，・・・１３−Ｍを用いた構成についての動作を説明する。 Next, the operation of the speech recognition apparatus 10 of this embodiment will be described with reference to FIG. FIG. 8 is a flowchart showing the operation of the speech recognition apparatus 10. Here, the operation of the configuration using the plurality of feature emphasizing units 13-1,... 13-M will be described.

第１の実施形態のような単一の特徴強調部１３を用いた構成についての動作は、複数の場合と同様であるので、説明を省略する。また、第１又は第２の実施形態と同一のステップについては、同一符号を付与して説明を簡潔におこなう。 Since the operation of the configuration using the single feature emphasizing unit 13 as in the first embodiment is the same as that in a plurality of cases, the description thereof is omitted. Further, the same steps as those in the first or second embodiment will be given the same reference numerals and will be briefly described.

次に、ステップＳ８１において、判定部６１は、特徴強調部１３−ｋについて、ノイズパラメータの変動量に基づき、ガウスパラメータの再算出が必要か不必要かを判定する。必要と判定された場合には、ステップＳ３３において、ガウスパラメータの算出を実行する。不必要と判定された場合には、ガウスパラメータの算出を省略する。 Next, in step S81, the determination unit 61 determines whether the Gaussian parameter needs to be recalculated based on the noise parameter fluctuation amount for the feature enhancement unit 13-k. If it is determined that it is necessary, a Gaussian parameter is calculated in step S33. If it is determined that it is unnecessary, the calculation of the Gaussian parameter is omitted.

次に、ステップＳ５１において、音声認識装置１０は、全ての特徴強調部１３−１，・・・１３−Ｍについて処理が完了していればステップＳ５２へ進む。そうでなければステップＳ８１へ戻る。 Next, in step S51, the speech recognition apparatus 10 proceeds to step S52 if the processing has been completed for all the feature emphasizing units 13-1, ... 13-M. Otherwise, the process returns to step S81.

次に、ステップＳ５３において、統合部４２は、Ｍ個の特徴強調部１３−１，・・・１３−Ｍからの出力を統合する。 Next, in step S53, the integration unit 42 integrates the outputs from the M feature enhancement units 13-1, ... 13-M.

次に、ステップＳ３６において、音声認識装置１０は、全てのフレームの処理が完了していなければステップＳ３１へと戻り、全てのフレームの処理が完了していればステップＳ３７へと進む。 Next, in step S36, the speech recognition apparatus 10 returns to step S31 if processing of all frames is not completed, and proceeds to step S37 if processing of all frames is completed.

このように本実施形態では、ノイズパラメータの変動量に基づきガウスパラメータの再算出が必要か不必要かを判定し、再算出が不必要と判定されたフレームではガウス分布算出部１３３の実行を省略することにより、演算量を削減することができる。 As described above, in the present embodiment, it is determined whether or not the Gaussian parameter needs to be recalculated based on the fluctuation amount of the noise parameter, and the execution of the Gaussian distribution calculating unit 133 is omitted in the frame in which the recalculation is determined to be unnecessary. By doing so, the amount of calculation can be reduced.

（第４の実施形態）
次に、第４の実施形態の音声認識装置１０について図９〜図１０に基づいて説明する。 (Fourth embodiment)
Next, the speech recognition apparatus 10 according to the fourth embodiment will be described with reference to FIGS.

本実施形態は、第３の実施形態と同様に特徴強調部１３における演算量を削減することを目的としたものである。すなわち、本実施形態では、判定部６１がガウスパラメータの再算出が不必要と判定した場合には、ガウス分布算出部１３３よりも演算量が少ない簡易算出部９１でガウスパラメータの演算を実行し、少なくとも一部のパラメータを更新する。 The present embodiment is intended to reduce the amount of calculation in the feature emphasizing unit 13 as in the third embodiment. That is, in the present embodiment, when the determination unit 61 determines that recalculation of the Gauss parameter is unnecessary, the simple calculation unit 91 that has a smaller calculation amount than the Gaussian distribution calculation unit 133 performs the calculation of the Gauss parameter. Update at least some parameters.

本実施形態と、第３の実施形態の違いは、特徴強調部１３の構成のみであるため、その他の構成についての説明は省略する。 Since the difference between the present embodiment and the third embodiment is only the configuration of the feature emphasizing unit 13, the description of other configurations is omitted.

本実施形態の特徴強調部１３について図９に基づいて説明する。図９は、特徴強調部１３のブロック図である。 The feature emphasizing unit 13 of this embodiment will be described with reference to FIG. FIG. 9 is a block diagram of the feature enhancement unit 13.

特徴強調部１３は、事前分布パラメータ記憶部１３１、ガウス分布記憶部１３２、ガウス分布算出部１３３、簡易算出部９１、判定部６１、第２スイッチ部９２、算出実行部１３４を備える。簡易算出部９１と第２スイッチ部９２以外は、第１乃至第３の実施形態と同一であるため、同一符号を付与して説明を省略する。 The feature enhancement unit 13 includes a prior distribution parameter storage unit 131, a Gaussian distribution storage unit 132, a Gaussian distribution calculation unit 133, a simple calculation unit 91, a determination unit 61, a second switch unit 92, and a calculation execution unit 134. Except for the simple calculation unit 91 and the second switch unit 92, the second embodiment is the same as the first to third embodiments.

簡易算出部９１は、ガウス分布算出部１３３よりも少ない演算量で、ガウスパラメータの少なくとも一部を更新する。 The simple calculation unit 91 updates at least a part of the Gaussian parameter with a smaller calculation amount than the Gaussian distribution calculation unit 133.

具体的には、現在のフレームにおけるノイズパラメータ（μｎ、Σｎ）の一つである平均μｎを用いて、ガウスパラメータの一つであるノイズパラメータの平均μｙの値をμｙ＝ｆ（μｘ，μｎ）によって算出する。その他のガウスパラメータ（Σｙ，Σｘｙ）の値は算出しない。 Specifically, using the average μn that is one of the noise parameters (μn, Σn) in the current frame, the average μy value of the noise parameter that is one of the Gaussian parameters is expressed as μy = f (μx, μn). Calculated by The values of other Gaussian parameters (Σy, Σxy) are not calculated.

ガウス分布算出部１３３は、アンセンテッド変換を用いてノイズパラメータの一つである平均μｙ、ガウスパラメータ（Σｙ，Σｘｙ）を算出するため、精度良くパラメータを算出できる代わりに演算量が多いという欠点がある。一方、簡易算出部９１は、精度は下がるが演算量が少ない。そこで、ノイズパラメータの変動量に基づき、ガウスパラメータの算出が不必要と判定されたフレームについては、少ない演算量で実行できる簡易算出部９１に切り替えることで、特徴強調部１３における演算量を抑えることができる。 Since the Gaussian distribution calculation unit 133 calculates an average μy and a Gaussian parameter (Σy, Σxy), which are one of noise parameters, using unscented transformation, there is a drawback in that the amount of calculation is large instead of being able to calculate the parameters with high accuracy. is there. On the other hand, the simple calculation unit 91 is less accurate but has a small amount of calculation. Therefore, the calculation amount in the feature enhancement unit 13 is suppressed by switching to the simple calculation unit 91 that can be executed with a small calculation amount for a frame that is determined to be unnecessary to calculate the Gaussian parameter based on the fluctuation amount of the noise parameter. Can do.

次に、本実施形態の音声認識装置１０の動作について図１０を参照しながら説明する。図１０は、音声認識装置１０の動作を示すフローチャートである。 Next, the operation of the speech recognition apparatus 10 of this embodiment will be described with reference to FIG. FIG. 10 is a flowchart showing the operation of the speech recognition apparatus 10.

ここでは、Ｍ個の特徴強調部１３−１，・・・１３−Ｍを用いた構成についての動作を説明する。第１の実施形態のような単一の特徴強調部１３を用いた構成についての動作は、複数の場合と同様であるので、説明を省略する。また、第１乃至第３の実施形態と同一のステップについては、同一符号を付与して説明を簡潔におこなう。 Here, the operation of the configuration using M feature emphasizing units 13-1,... 13-M will be described. Since the operation of the configuration using the single feature emphasizing unit 13 as in the first embodiment is the same as that in a plurality of cases, the description thereof is omitted. Further, the same steps as those in the first to third embodiments will be given the same reference numerals and will be briefly described.

まず、ステップＳ３１の特徴抽出処理、ステップＳ３２のノイズ推定処理が実行される。 First, the feature extraction process in step S31 and the noise estimation process in step S32 are executed.

次に、ステップＳ８１において、判定部６１は、特徴強調部１３−ｋについて、ノイズパラメータの変動量に基づき、ガウスパラメータの再算出が必要か不必要かを判定する。この判定は、第３の実施形態と同様である。再算出が必要と判定された場合には、ステップＳ３３においてガウス分布算出部１３３−ｋの動作を実行する。再算出が不必要と判定された場合には、ステップＳ１０１において簡易算出部９１−ｋの上記動作を実行する。 Next, in step S81, the determination unit 61 determines whether the Gaussian parameter needs to be recalculated based on the noise parameter fluctuation amount for the feature enhancement unit 13-k. This determination is the same as in the third embodiment. If it is determined that recalculation is necessary, the operation of the Gaussian distribution calculation unit 133-k is executed in step S33. If it is determined that recalculation is unnecessary, the above-described operation of the simple calculation unit 91-k is executed in step S101.

次に、ステップＳ５１において、音声認識装置１０は、全ての特徴強調部１３−１，・・・１３−Ｍについて処理が完了していればステップＳ５２へ進む。終了していなければ、ステップＳ８１へと戻る。 Next, in step S51, the speech recognition apparatus 10 proceeds to step S52 if the processing has been completed for all the feature emphasizing units 13-1, ... 13-M. If not completed, the process returns to step S81.

次に、ステップＳ５３において、Ｍ個の特徴強調部１３−１，・・・１３−Ｍからの出力を統合する。 Next, in step S53, the outputs from the M feature emphasizing units 13-1, ... 13-M are integrated.

このように本実施形態では、ノイズパラメータの変動量に基づきガウスパラメータの再算出が必要か不必要かを判定し、再算出が不必要と判定されたフレームでは、演算量の少ない簡易算出部９１に切り替えることによって、演算量を削減することができる。 As described above, in the present embodiment, it is determined whether the recalculation of the Gaussian parameter is necessary or unnecessary based on the fluctuation amount of the noise parameter, and the simple calculation unit 91 having a small calculation amount is determined in the frame in which the recalculation is determined to be unnecessary. The amount of calculation can be reduced by switching to.

（変更例）
本発明は上記各実施形態に限らず、その主旨を逸脱しない限り種々に変更することができる。 (Example of change)
The present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the gist thereof.

第１の実施形態の音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus of 1st Embodiment. 第１の実施形態の特徴強調部の構成を示すブロック図である。It is a block diagram which shows the structure of the characteristic emphasis part of 1st Embodiment. 第１の実施形態の音声認識装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech recognition apparatus of 1st Embodiment. 第２の実施形態の音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus of 2nd Embodiment. 第２の実施形態の音声認識装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech recognition apparatus of 2nd Embodiment. 第３の実施形態の特徴強調部の構成を示すブロック図である。It is a block diagram which shows the structure of the characteristic emphasis part of 3rd Embodiment. 判定部の構成を示すブロック図である。It is a block diagram which shows the structure of a determination part. 第３の実施形態の音声認識装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech recognition apparatus of 3rd Embodiment. 第４の実施形態の特徴強調部の構成を示すブロック図である。It is a block diagram which shows the structure of the characteristic emphasis part of 4th Embodiment. 第４の実施形態の音声認識装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech recognition apparatus of 4th Embodiment.

Explanation of symbols

１０音声認識装置
１１特徴抽出部
１２ノイズ推定部
１３特徴強調部
１４照合部
４１重み算出部
４２統合部
６１判定部
６２第１スイッチ部
９１簡易算出部
９２第２スイッチ部
１３１事前分布パラメータ記憶部
１３２ガウス分布記憶部
１３３ガウス分布算出部
１３４算出実行部
６１１ノイズパラメータ記憶部
６１２変動量算出部
６１３比較部 DESCRIPTION OF SYMBOLS 10 Speech recognition apparatus 11 Feature extraction part 12 Noise estimation part 13 Feature emphasis part 14 Collation part 41 Weight calculation part 42 Integration part 61 Determination part 62 1st switch part 91 Simple calculation part 92 2nd switch part 131 Prior distribution parameter memory | storage part 132 Gaussian distribution storage unit 133 Gaussian distribution calculation unit 134 Calculation execution unit 611 Noise parameter storage unit 612 Variation amount calculation unit 613 Comparison unit

Claims

A feature extraction unit that extracts a noisy speech feature vector for each frame from the input noisy speech;
A noise estimator for estimating a noise feature distribution parameter of a noise feature vector related to noise superimposed on the noisy speech;
A prior distribution parameter storage unit for storing a prior distribution parameter of a clean speech feature vector related to clean speech;
A Gaussian distribution calculation unit that calculates a combined Gaussian distribution parameter of the clean speech feature vector and the noisy speech feature vector for each frame using an unscented transform from the noise feature distribution parameter and the prior distribution parameter;
A calculation execution unit that calculates a posterior distribution parameter of the clean speech feature vector for each frame from the noisy speech feature vector using the combined Gaussian distribution parameter;
Collating the posterior distribution parameter with a pre-stored word standard pattern for each frame, and outputting the noisy speech word string based on the collation result;
A speech recognition apparatus comprising:

A plurality of the prior distribution parameter storage unit, the Gaussian distribution calculation unit, and the calculation execution unit, respectively,
Based on the combined Gaussian distribution parameters respectively calculated by the respective Gaussian distribution calculating units, a weight calculating unit that calculates an integrated weight for each posterior distribution parameter for each frame;
Integrating each posterior distribution parameter based on each integration weight, and outputting the integrated posterior distribution parameter to the matching unit for each frame;
The speech recognition apparatus according to claim 1, further comprising:

A Gaussian distribution storage unit that stores the combined Gaussian distribution parameter calculated by the Gaussian distribution calculation unit for each frame;
The fluctuation amount of the noise feature distribution parameter is obtained for each frame, and when the fluctuation amount is smaller than an arbitrary threshold, it is determined that recalculation of the combined Gaussian distribution parameter is unnecessary, and when the fluctuation amount is larger than the threshold. Is a determination unit that determines that recalculation of the combined Gaussian distribution parameter is necessary;
(1) For the frame determined to require recalculation, the combined Gaussian distribution parameter recalculated by the Gaussian distribution calculation unit is sent to the calculation execution unit, and (2) it is determined that recalculation is unnecessary. For the frame, a first switch unit that sends the combined Gaussian distribution parameter calculated for the previous frame before the frame calculated by the Gaussian distribution calculation unit to the calculation execution unit;
The speech recognition apparatus according to claim 1, further comprising:

A Gaussian distribution storage unit that stores the combined Gaussian distribution parameter calculated by the Gaussian distribution calculation unit for each frame;
The fluctuation amount of the noise feature distribution parameter is obtained for each frame, and when the fluctuation amount is smaller than an arbitrary threshold, it is determined that recalculation of the combined Gaussian distribution parameter is unnecessary, and when the fluctuation amount is larger than the threshold. Is a determination unit that determines that recalculation of the combined Gaussian distribution parameter is necessary;
From the noise feature distribution parameter and the prior distribution parameter, a simple calculation unit that calculates one parameter among the combined Gaussian distribution parameters for each frame,
(1) For the frame determined to require recalculation, the combined Gaussian distribution parameter recalculated by the Gaussian distribution calculation unit is sent to the calculation execution unit, and (2) it is determined that recalculation is unnecessary. For the frame, a second switch unit that sends the combined Gaussian distribution parameter excluding the one parameter calculated in the simple calculation unit and the one parameter stored in the Gaussian distribution storage unit to the calculation execution unit;
The speech recognition apparatus according to claim 1, further comprising:

A feature extraction step for extracting a noisy speech feature vector for each frame from the input noisy speech;
A noise estimation step of estimating a noise feature distribution parameter of a noise feature vector related to noise superimposed on the noisy speech;
Using the noise feature distribution parameter and the pre-distribution parameter of the clean speech feature vector relating to the clean speech stored in advance, the combined Gaussian distribution parameter of the clean speech feature vector and the noisy speech feature vector is converted into the frame using unscented transformation. A Gaussian distribution calculating step to calculate for each,
A calculation execution step of calculating a posterior distribution parameter of the clean speech feature vector for each frame from the noisy speech feature vector using the combined Gaussian distribution parameter;
Collating the posterior distribution parameter with a pre-stored word standard pattern for each frame, and outputting the noisy speech word string based on the collation result;
A speech recognition method comprising:

A feature extraction function that extracts a noisy speech feature vector from the input noisy speech for each frame;
A noise estimation function for estimating a noise feature distribution parameter of a noise feature vector related to noise superimposed on the noisy speech;
Using the noise feature distribution parameter and the pre-distribution parameter of the clean speech feature vector relating to the clean speech stored in advance, the combined Gaussian distribution parameter of the clean speech feature vector and the noisy speech feature vector is converted into the frame using unscented transformation. Gaussian distribution calculation function to calculate every,
A calculation execution function for calculating the posterior distribution parameter of the clean speech feature vector for each frame from the noisy speech feature vector using the combined Gaussian distribution parameter;
A collation function for collating the posterior distribution parameter with a pre-stored word standard pattern for each frame and outputting the word sequence of the noisy speech based on the collation result;
Is a speech recognition program that realizes a computer.