JP4881357B2

JP4881357B2 - Acoustic model creation apparatus, speech recognition apparatus using the apparatus, these methods, these programs, and these recording media

Info

Publication number: JP4881357B2
Application number: JP2008216640A
Authority: JP
Inventors: 晋治渡部; 篤中村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-08-26
Filing date: 2008-08-26
Publication date: 2012-02-22
Anticipated expiration: 2028-08-26
Also published as: JP2010054588A

Description

この発明は、音声認識時に使用する音響モデルを逐次適応法により作成する音響モデル作成装置、および、その装置で作成された音響モデルを用いて音声認識を行う音声認識装置、これらの方法、これらのプログラム、およびこれらの記録媒体に関する。 The present invention relates to an acoustic model creation device that creates an acoustic model used at the time of speech recognition by a sequential adaptation method, a speech recognition device that performs speech recognition using an acoustic model created by the device, these methods, and these The present invention relates to a program and these recording media.

［音声認識］
従来の音声認識装置の機能構成例を図１に示し、従来の音声認識装置の処理の主な流れを図２に示す。音声認識装置２は主に、特徴抽出部４と単語列探索部６と音響モデル記憶部８と言語モデル記憶部１０とで構成されている。 [voice recognition]
A functional configuration example of a conventional speech recognition apparatus is shown in FIG. 1, and a main flow of processing of the conventional speech recognition apparatus is shown in FIG. The speech recognition apparatus 2 mainly includes a feature extraction unit 4, a word string search unit 6, an acoustic model storage unit 8, and a language model storage unit 10.

まず、音響モデル記憶部８中の音響モデルの読み込みを行う（ステップＳ２）。なお、場合によっては、音響モデルの他に、単語モデル、コンテクスト依存音素モデル等の読み込みを行う。また、言語モデル記憶部１０の読み込みを行う（ステップＳ４）。入力された認識用音声データは音声認識装置２に読み込まれ（ステップＳ６）、認識用音声データは特徴抽出部４に入力され、認識用音声データはフレーム（一定時間区間）ごとにＭＦＣＣ（メルフィルタバンクケプストラム係数）ベクトルなどの音響的特徴量系列（以下、「特徴量系列」という。）に変換される（ステップＳ８）。変換された特徴量系列は図に示していないが、一旦、特徴量記憶部に記憶される。記憶された特徴量系列は、読み出されて、単語列探索部６に入力される。 First, the acoustic model in the acoustic model storage unit 8 is read (step S2). In some cases, in addition to the acoustic model, a word model, a context-dependent phoneme model, and the like are read. Further, the language model storage unit 10 is read (step S4). The input recognition speech data is read into the speech recognition apparatus 2 (step S6), the recognition speech data is input to the feature extraction unit 4, and the recognition speech data is MFCC (Mel filter) for each frame (fixed time interval). It is converted into an acoustic feature quantity series (hereinafter referred to as “feature quantity series”) such as a bank cepstrum coefficient) vector (step S8). The converted feature quantity series is not shown in the figure, but is temporarily stored in the feature quantity storage unit. The stored feature quantity series is read out and input to the word string search unit 6.

単語列探索部６では、音響モデル記憶部８の音響モデルを用いて認識用音声データの特徴量系列に対しスコアを算出し、これに言語モデル記憶部１０の言語モデル等に対するスコアを参照して単語列探索を行う（ステップＳ１０）。また、場合によっては、音素列探索や孤立単語探索を行う。最終的に認識結果を単語列として出力し（ステップＳ１２）、場合によっては、音素列、孤立単語のみを出力する。 In the word string search unit 6, a score is calculated for the feature amount series of the recognition speech data using the acoustic model in the acoustic model storage unit 8, and the score for the language model or the like in the language model storage unit 10 is referred to this. A word string search is performed (step S10). In some cases, phoneme string search or isolated word search is performed. Finally, the recognition result is output as a word string (step S12). In some cases, only a phoneme string or an isolated word is output.

［音響モデル作成］
次に、音響モデルの作成方法について説明する。音響モデルは、音声の音響的特徴をモデル化したものであり、認識用音声データと音響モデルを参照することにより、音声データを音素や単語といったシンボルに変換する。そのため、音響モデルの作成は、音声認識装置の性能を大きく左右する。通常、音声認識用音響モデルでは、各音素をLeft to rightの隠れマルコフモデル（ＨＭＭ）で、ＨＭＭ状態の出力確率分布を混合ガウス分布モデル（ＧＭＭ）で表現する。そのため、実際に音響モデルとして記憶部に記憶されているのは、音素などの各シンボルにおける、ＨＭＭの状態遷移確率ａ，ＧＭＭの混合重み因子ｗ、及び音響モデル中のガウス分布の平均ベクトルパラメータμ、及び音響モデル中のガウス分布の共分散行列パラメータΣとなる。これらを音響モデルパラメータと呼びその集合をθとする。つまり、θ＝｛ａ，ｗ，μ，Σ｝とする。音響モデルパラメータθの値を正確に求めるのが音響モデルの作成過程となり、この過程を音響モデル作成方法と呼ぶ。 [Acoustic model creation]
Next, a method for creating an acoustic model will be described. The acoustic model is obtained by modeling the acoustic features of speech, and the speech data is converted into symbols such as phonemes and words by referring to the recognition speech data and the acoustic model. Therefore, the creation of the acoustic model greatly affects the performance of the speech recognition device. Usually, in the acoustic model for speech recognition, each phoneme is represented by a Left to right hidden Markov model (HMM), and the output probability distribution of the HMM state is represented by a mixed Gaussian distribution model (GMM). Therefore, what is actually stored in the storage unit as an acoustic model is that the HMM state transition probability a, the GMM mixture weight factor w, and the average vector parameter μ of the Gaussian distribution in the acoustic model in each symbol such as a phoneme. , And the covariance matrix parameter Σ of the Gaussian distribution in the acoustic model. These are referred to as acoustic model parameters, and the set is θ. That is, θ = {a, w, μ, Σ}. Accurately obtaining the value of the acoustic model parameter θ is an acoustic model creation process, and this process is called an acoustic model creation method.

近年、音響モデルは確率統計的手法により大量の音声データとその教師ラベルの情報から、音響モデルパラメータθを学習することにより作成される。通常学習データに対しては、その何れの部分が何れの音素であるかを示す教師ラベル情報が与えられている。教師ラベル情報が与えられていない場合は、実際人が聞いて教師ラベル情報を付けたり、また音声認識装置を用いることにより教師ラベル情報を付与する。以降では学習用音声データには教師ラベル情報が前記のような方法で付与されているとして説明を進める。 In recent years, an acoustic model is created by learning an acoustic model parameter θ from a large amount of speech data and information of a teacher label by a probabilistic statistical method. For normal learning data, teacher label information indicating which part is which phoneme is given. When the teacher label information is not given, the teacher label information is given by listening to the actual person and attaching the teacher label information or by using a voice recognition device. In the following description, it is assumed that the teacher label information is given to the learning voice data by the method described above.

従来の音響モデル作成装置の機能構成例を図３に示し、従来の音響モデル作成装置の処理の主な流れを図４に示す。図３及び図４において、教師ラベル情報の付与については省略する。 An example of the functional configuration of a conventional acoustic model creation device is shown in FIG. 3, and the main flow of processing of the conventional acoustic model creation device is shown in FIG. In FIG. 3 and FIG. 4, the provision of teacher label information is omitted.

音響モデル作成装置１１は、特徴抽出部４、特徴量記憶部５、音響モデルパラメータ学習部１２、とで構成されている。学習用音声データが音響モデル作成装置１１により読み込まれる（ステップＳ２２）。読み込まれた学習用音声データは、特徴抽出部４で特徴量系列に変換される（ステップＳ２４）。変換された特徴量系列は特徴量記憶部５に記憶される。記憶された特徴量系列は読み出されて、音響モデルパラメータ学習部１２に入力される。教師ラベルが存在していなければ（ステップＳ２６）、音声認識装置、若しくは人手によって教師ラベル情報が与えられる（ステップＳ２８） The acoustic model creation device 11 includes a feature extraction unit 4, a feature amount storage unit 5, and an acoustic model parameter learning unit 12. The learning voice data is read by the acoustic model creation device 11 (step S22). The read learning speech data is converted into a feature amount series by the feature extraction unit 4 (step S24). The converted feature quantity series is stored in the feature quantity storage unit 5. The stored feature quantity series is read out and input to the acoustic model parameter learning unit 12. If the teacher label does not exist (step S26), the teacher label information is given by the voice recognition device or manually (step S28).

次に、音響モデルパラメータ学習部１２による音響モデルパラメータの学習について説明する。教師ラベル情報により得られる学習データ中の各音素に対応するデータから、音響モデルパラメータθ（ＨＭＭの状態遷移確率ａ，ＧＭＭの混合重み因子ｗ、及びＧＭＭの平均ベクトルパラメータμ及び共分散行列パラメータΣ）を推定することを音響モデルパラメータの学習と呼ぶ。パラメータを学習する手法としては最尤学習法がある。また、音響モデルパラメータの学習には他にも、ベイズ学習、識別学習、ニューラルネットワーク等がある。 Next, learning of acoustic model parameters by the acoustic model parameter learning unit 12 will be described. From the data corresponding to each phoneme in the learning data obtained from the teacher label information, the acoustic model parameter θ (the HMM state transition probability a, the GMM mixture weight factor w, the GMM average vector parameter μ, and the covariance matrix parameter Σ ) Is called acoustic model parameter learning. There is a maximum likelihood learning method as a parameter learning method. Other acoustic model parameter learning includes Bayesian learning, discriminative learning, neural network, and the like.

音響モデルパラメータ学習部１２は、教師ラベル記憶部１４に予め用意された音声データに対応する教師ラベル情報を用いて、音響モデルパラメータの学習を行う（ステップＳ３０）。音響モデル作成装置１１で作成された音響モデルが出力される（ステップＳ３２）。また、ステップＳ２６において、教師ラベルが存在していれば、直接ステップＳ３０に進む。 The acoustic model parameter learning unit 12 learns acoustic model parameters using the teacher label information corresponding to the voice data prepared in advance in the teacher label storage unit 14 (step S30). The acoustic model created by the acoustic model creation device 11 is output (step S32). If a teacher label is present in step S26, the process proceeds directly to step S30.

音響モデルパラメータは数百万の自由度を持つため、これらを学習するためには数百時間に及ぶ大量の学習用音声データが必要となる。しかし、事前に話者、雑音、発話スタイルといった全ての音響的変動要因を含む音声データを数百万のパラメータを十分に学習するほど、大量に収集するのは不可能である。そこで、少量の学習用音声データから音響モデルパラメータを推定する手法として、適応学習が非常に重要な技術となる。 Since the acoustic model parameters have millions of degrees of freedom, a large amount of learning speech data for several hundred hours is required to learn them. However, it is impossible to collect a large amount of speech data including all acoustic variation factors such as a speaker, noise, and speech style in advance and sufficiently learn millions of parameters. Therefore, adaptive learning is a very important technique as a method for estimating acoustic model parameters from a small amount of learning speech data.

［音響モデルパラメータの変換にもとづく適応学習］
音響モデルパラメータに対しての適応学習は、パラメータあたりの学習データ量が少ない場合に初期モデルを先験知識として用い、少ないデータで学習を行う手法である。通常の学習方法との違いは学習データのみならず初期モデルを用いて音響モデルを構築する点である。このように初期モデルと学習データから新たに音響モデルを構築する学習方法を適応学習と呼ぶ。 [Adaptive learning based on conversion of acoustic model parameters]
Adaptive learning with respect to acoustic model parameters is a method of learning with a small amount of data using the initial model as a priori knowledge when the amount of learning data per parameter is small. The difference from the normal learning method is that an acoustic model is constructed using not only learning data but also an initial model. A learning method for constructing a new acoustic model from the initial model and learning data in this way is called adaptive learning.

適応学習では、一般的に初期音響モデルパラメータθ_０と新たに作られる音響モデルパラメータθの変換に注目する。例えば、Ｎ個のＤ次元特徴ベクトルで表現される特徴量系列Ｏ＝｛ｏ_１，ｏ_２，…，ｏ_Ｎ｜ｏ_ｎ∈Ｒ^Ｄ｝が与えられたとき、音響モデルパラメータθの推定を考えるのではなく、その変換パラメータを考えるのが変換パラメータ推定法である。つまり初期モデルのパラメータθ_０と特徴量系列Ｏから適応後の音響モデルパラメータθをθ＝ｆ（θ_０，Ｏ）として求めるときの、ｆ（・）を求め、それにより新たに音響モデルパラメータθを得る手法である。 In adaptive learning, attention is generally paid to the conversion between the initial acoustic model parameter θ ₀ and the newly created acoustic model parameter θ. For example, when a feature quantity sequence O = {o ₁ , o ₂ ,..., O _N | o _n ∈R ^D } expressed by N D-dimensional feature vectors is considered, estimation of the acoustic model parameter θ is considered. Instead, the conversion parameter estimation method considers the conversion parameter. That is, when the acoustic model parameter θ after adaptation is obtained as θ = f (θ ₀ , O) from the parameter θ _{0 of the} initial model and the feature amount series O, f (·) is obtained, and thereby the acoustic model parameter θ is newly obtained. It is a technique to obtain.

ｆ（・）がパラメトリックに表現される（関係が数式で表現される。）とすれば、適応学習はｆ（・）のパラメータである変換パラメータＷ（後に具体的に説明する）の推定を初期モデルパラメータθ_０と適応用音声データＯから求めることになる。これを音響モデルパラメータの変換にもとづく適応学習と呼ぶ。 If f (•) is expressed parametrically (the relationship is expressed by a mathematical expression), adaptive learning initially estimates a conversion parameter W (which will be described in detail later) that is a parameter of f (•). It is obtained from the model parameter θ ₀ and the adaptation audio data O. This is called adaptive learning based on acoustic model parameter conversion.

［線形回帰法］
適応学習の中では、音響モデル中のガウス分布の平均ベクトルパラメータμに対する線形回帰行列を推定する手法が広く用いられている（非特許文献１、２参照）。線形回帰行列を用いた場合の音響モデル作成装置の機能構成例を図５に示し、この場合の音響モデル作成装置の主な処理の流れを図６に示す。この手法を用いた音響モデル作成装置２１は、特徴抽出部４、特徴量記憶部５、パラメータ適応部２２、とで構成されており、パラメータ適応部２２は変換パラメータ推定部２４、変換パラメータ記憶部２６、モデルパラメータ変換部２８、とで構成されている。 [Linear regression method]
In adaptive learning, a method for estimating a linear regression matrix for an average vector parameter μ of a Gaussian distribution in an acoustic model is widely used (see Non-Patent Documents 1 and 2). FIG. 5 shows an example of the functional configuration of the acoustic model creation apparatus when a linear regression matrix is used, and FIG. 6 shows the main processing flow of the acoustic model creation apparatus in this case. The acoustic model creation device 21 using this method is composed of a feature extraction unit 4, a feature amount storage unit 5, and a parameter adaptation unit 22, which are a conversion parameter estimation unit 24, a conversion parameter storage unit. 26 and a model parameter conversion unit 28.

まず、初期音響モデルパラメータθ_０が初期音響モデルパラメータ記憶部３０に読み込まれる（ステップＳ４０）。そして、適応用音声データ２０が読み込まれ（ステップＳ４２）、特徴抽出部４に入力され、特徴量系列Ｏに変換される（ステップＳ４４）。変換された特徴量系列Ｏは一旦、特徴量記憶部５に記憶される。記憶された特徴量系列Ｏは変換パラメータ推定部２４に入力される。以下に変換パラメータ推定部２４、モデルパラメータ変換部２８の処理を説明する。 First, the initial acoustic model parameter θ ₀ is read into the initial acoustic model parameter storage unit 30 (step S40). Then, the adaptation audio data 20 is read (step S42), input to the feature extraction unit 4, and converted into the feature amount series O (step S44). The converted feature quantity series O is temporarily stored in the feature quantity storage unit 5. The stored feature quantity series O is input to the conversion parameter estimation unit 24. The processes of the conversion parameter estimation unit 24 and the model parameter conversion unit 28 will be described below.

初期音響モデルパラメータθ_０中のあるガウス分布の平均ベクトルパラメータμ_０は以下の式（１）により線形変換される。
μ＝Ａμ_０＋ν （１）
ここで、ＡはＤ×Ｄの行列であり、平均ベクトルパラメータμ_０の回転、伸縮をさせる行列である。νはＤ次元ベクトルであり平均ベクトルパラメータμ_０の平行移動をさせるベクトルを表す。このとき、変換パラメータＷ＝（ν，Ａ）である。変換パラメータＷは特徴量系列Ｏから期待値最大化（Expectation Maximization）アルゴリズム（以下ＥＭアルゴリズムという）やその一種であるＭＬＬＲ（Maximum Likelihood Linear Regression）アルゴリズムを用いて繰り返し計算により効率よく求められる（ステップＳ４６）。推定すべき変換パラメータＷのパラメータ数はＤ^２＋Ｄ＝Ｄ（Ｄ＋１）となる。何故なら、行列Ａの要素数はＤ^２であり、ベクトルνの要素数はＤであるからである。平均ベクトルのパラメータ数Ｄよりもパラメータ数が多いが、複数のガウス分布で同一の変換パラメータを共有することにより、推定すべきパラメータ数を減らすことが可能である。推定された変換パラメータＷは一旦変換パラメータ記憶部２６に記憶される。 An average vector parameter μ ₀ of a certain Gaussian distribution in the initial acoustic model parameter θ ₀ is linearly converted by the following equation (1).
μ = Aμ ₀ + ν (1)
Here, A is a D × D matrix, and is a matrix for rotating and expanding / contracting the average vector parameter μ ₀ . ν is a D-dimensional vector and represents a vector that translates the average vector parameter μ ₀ . At this time, the conversion parameter W = (ν, A). The transformation parameter W is efficiently obtained from the feature amount series O by repeated calculation using an expected value maximization (Expectation Maximization) algorithm (hereinafter referred to as EM algorithm) or a kind of MLLR (Maximum Likelihood Linear Regression) algorithm (step S46). ). The number of conversion parameters W to be estimated is D ² + D = D (D + 1). This is because the number of elements of the matrix A is D ² and the number of elements of the vector ν is D. Although the number of parameters is larger than the number of parameters D of the average vector, it is possible to reduce the number of parameters to be estimated by sharing the same transformation parameter with a plurality of Gaussian distributions. The estimated conversion parameter W is temporarily stored in the conversion parameter storage unit 26.

記憶された変換パラメータＷはモデルパラメータ変換部２８に入力される。モデルパラメータ変換部２８で、得られた変換パラメータＷと（初期音響モデルパラメータθ_０中の）初期平均ベクトルパラメータμ_０をもとに前記式（１）から新たな平均ベクトルパラメータμを得る（ステップＳ４８）。平均ベクトルパラメータμが音響モデルパラメータμとして出力される（ステップＳ５０）。 The stored conversion parameter W is input to the model parameter conversion unit 28. The model parameter conversion unit 28 obtains a new average vector parameter μ from the above equation (1) based on the obtained conversion parameter W and the initial average vector parameter μ ₀ (in the initial acoustic model parameter θ ₀ ) (step S1). S48). The average vector parameter μ is output as the acoustic model parameter μ (step S50).

［音響モデルパラメータ変換にもとづく逐次適応］
以上までは、一まとまりの特徴量系列Ｏ＝｛ｏ_１，ｏ_２，…，ｏ_ｎ，…，ｏ_Ｎ｝（ただし、Ｎはフレーム数である）に対しての適応学習を考えた。しかし、音声は雑音などの外的要因や発声のなまり等の内的要因によって、時々刻々その音響的特徴を大きく変化させている。このような変化に追随していくためには、時系列的に与えられるまとまった量の音声データに対して逐次モデルを適応させる逐次適応学習が有効である。このとき、特徴量系列を１まとまりとして捉えず、複数のまとまりが時系列的に与えられる場合の適応を考える。つまり以下の式（２）（３）のように考える。

[Sequential adaptation based on acoustic model parameter conversion]
Up to this point, adaptive learning for a group of feature amount sequences O = {o ₁ , o ₂ ,..., O _n ,..., O _N } (where N is the number of frames) has been considered. However, the acoustic characteristics of speech are greatly changed from time to time due to external factors such as noise and internal factors such as voicing. In order to follow such changes, it is effective to use sequential adaptive learning in which a sequential model is adapted to a large amount of speech data given in time series. At this time, let us consider adaptation in the case where a plurality of groups are given in a time series, instead of taking a feature quantity series as one group. In other words, the following equations (2) and (3) are considered.

ただし、ｔは前回の時刻、ｔ＋１は今回の時刻を示し、式（３）中のＴはｔの総数を示し、θ_ｔ＋１およびθ_ｔは今回および前回の音響モデルパラメータである。このとき、あるまとまりｔ＋１での音響モデルパラメータθ_ｔ＋１は、その前のまとまりｔにおいて得られた音響モデルパラメータθ_ｔ及び特徴量系列のまとまりＯ_ｔ＋１から求められる。つまり、以下の式（４）に示す漸化式で表現することにより、時々刻々音響モデルを求めることができる。これをパラメータ変換に基づく逐次適応法と呼ぶ。
θ_ｔ＋１＝ｆ（θ_ｔ，Ｏ_ｔ＋１）（４） However, t indicates the previous time, t + 1 indicates the current time, T in Equation (3) indicates the total number of _t , and θ _{t + 1} and θ _t are the current and previous acoustic model parameters. In this case, the acoustic model parameter theta _{t + 1} at a certain chunks t + 1 is determined from the coherent O _{t + 1} of the acoustic model parameters theta _t and the feature sequence obtained in the previous chunk t. That is, the acoustic model can be obtained from time to time by expressing it with the recurrence formula shown in the following formula (4). This is called a sequential adaptation method based on parameter conversion.
θ _{t + 1} = f (θ _t , O _{t + 1} ) (4)

図７に、逐次適応法を用いた場合の音響モデルパラメータが変換される手順を示す。まず、特徴量系列Ｏ_１と初期音響モデルパラメータθ_０を用いてモデルパラメータ変換部２８で音響モデルパラメータθ_１が求められる。そして、今度は、音響モデルパラメータθ_１と次の特徴量系列Ｏ_２と用いて、音響モデルパラメータθ_２が求められる。このようにして、前回の音響モデルパラメータθ_ｔと今回の特徴量系列Ｏ_ｔ＋１とを用いて、今回の音響モデルパラメータθ_ｔ＋１が求められる。 FIG. 7 shows a procedure for converting acoustic model parameters when the sequential adaptation method is used. First, the acoustic model parameter θ ₁ is obtained by the model parameter conversion unit 28 using the feature amount series O ₁ and the initial acoustic model parameter θ ₀ . Next, the acoustic model parameter θ ₂ is obtained using the acoustic model parameter θ ₁ and the next feature amount series O ₂ . In this way, the current acoustic model parameter θ _{t + 1} is obtained using the previous acoustic model parameter θ _t and the current feature amount series O _{t + 1} .

［線形回帰法］
このとき、変換パラメータ推定法の逐次適応への適用を考察する（非特許文献２参照）。先ほどは、変換パラメータＷは全ての特徴量系列から推定されたとしたが、逐次適応においては各まとまりごと（ｔごと）にＷを推定する。それをＷ_ｔ＝｛ν_ｔ，Ａ_ｔ｝とすれば、パラメータ変換に基づく逐次適応法における平均パラメータの更新式（前記式（４）に示す）は前記式（１）を基に、以下の式（５）のように漸化式で表現することができる。
μ_ｔ＋１＝Ａ_ｔ＋１μ_ｔ＋ν_ｔ＋１（５）
これによって、パラメータ変換に基づく逐次適応が実現される。以下の説明では、Ａ_ｔ＋１は「今回の音響モデルパラメータ中の平均の確率的ダイナミクスを線形表現した時の係数行列」といい、ν_ｔ＋１は「今回の音響モデルパラメータ中の平均の確率的ダイナミクスを線形表現した時の係数ベクトル」という。 [Linear regression method]
At this time, the application of the transformation parameter estimation method to the sequential adaptation is considered (see Non-Patent Document 2). The conversion parameter W is estimated from all the feature amount sequences earlier. However, in the sequential adaptation, W is estimated for each group (every t). If it is assumed that W _t = {ν _t , A _t }, the average parameter update formula (shown in the formula (4)) in the iterative adaptive method based on the parameter transformation is based on the formula (1) below. It can be expressed by a recurrence formula as shown in Formula (5).
μ _{t + 1} = A _{t + 1} μ _t + ν _{t + 1} (5)
As a result, sequential adaptation based on parameter conversion is realized. In the following description, At _{+ 1} is referred to as “a coefficient matrix when linearly expressing the average stochastic dynamics in the current acoustic model parameter”, and ν _{t + 1} is “average stochastic dynamics in the current acoustic model parameter. It is called "coefficient vector when linearly expressed".

以上の逐次適応法は得られた音響モデルパラメータθ_０、．．．、θ_ｔ＋１にどの程度推定による誤差が含まれるかが考慮されていない。そのため、学習に悪影響を及ぼすような音声データが存在した場合、学習が失敗した場合等は、その影響がそのまま認識性能に出てしまい、頑健性が低いものとなってしまう。 The above-described successive adaptation method is obtained by the acoustic model parameters θ ₀ ,. . . , Θ _{t + 1} does not take into account how much estimation error is included. For this reason, when there is voice data that adversely affects learning, or when learning fails, the influence directly appears in the recognition performance, resulting in low robustness.

［分布変換にもとづく逐次適応法］
次に、本発明の基本概念となる「分布変換にもとづく逐次適応法」について説明する。本手法では、音響モデルパラメータθ_ｔそのものの推定を考えるのではなく、音響モデルパラメータの分布ｐ（θ_ｔ）を考える（特許文献１、非特許文献３、４参照）。 [Sequential adaptation method based on distribution transformation]
Next, the “sequential adaptation method based on distribution transformation” which is the basic concept of the present invention will be described. In this method, the estimation of the acoustic model parameter θ _t itself is not considered, but the distribution p (θ _t ) of the acoustic model parameter is considered (see Patent Document 1, Non-Patent Documents 3 and 4).

これにより、推定による誤差を例えばその分布の分散から考慮することができる。さらに音響モデルパラメータの分布として累積された特徴量系列Ｏ^ｔ＝｛Ｏ_１，Ｏ_２，…，Ｏ_ｔ｝が与えられたときの事後確率分布を考える。つまり、ｐ（θ_ｔ）ではなく、ｐ（θ_ｔ│Ｏ^ｔ）を推定対象とする。ここで、Ｏ^ｔ＋１およびＯ^ｔは今回および前回までに累積された特徴量系列であることを示す。 Thereby, the error by estimation can be considered from the variance of the distribution, for example. Further, consider a posterior probability distribution when a feature amount series O ^t = {O ₁ , O ₂ ,..., O _t } accumulated as an acoustic model parameter distribution is given. That is, instead of p (θ _t ), p (θ _t | O ^t ) is an estimation target. Here, O ^{t + 1} and O ^t indicate feature quantity sequences accumulated up to this time and the previous time.

ここで、ｐ（α│β）はある事象βが起こるという条件下で、別の事象αが起こる確率である事後確率（条件付き確率）である。つまり、ｐ（θ_ｔ│Ｏ^ｔ）は特徴量系列Ｏ^ｔが与えられた時の音響モデルパラメータがθ_ｔである事後確率であることを示す。これにより、累積された特徴量系列Ｏ^ｔの情報を音響モデルパラメータに加味することができるため、頑健性を確保することができる。従って、以下の式（８）に示す漸化式
ｐ(θ_ｔ＋１｜Ｏ^ｔ＋１)＝Ｆ[ｐ(θ_ｔ｜Ｏ^ｔ)] （８）
を用いて時間発展、つまり、音声の音響的特徴の変化として対応した漸化式を記述することにより、前記式（４）で注目した音響モデルパラメータθではなく、音響モデルパラメータの事後確率分布ｐ(θ｜Ｏ)に基づく逐次適応を実現することができる。ここで、Ｆ[・]はｐ(θ｜Ｏ)を引数として持つ汎関数である。また、Ｆ[・]は今回まで累積された特徴量系列Ｏ^ｔ＋１の一部の特徴量系列に基づいて表現されるものである。以下の説明では、Ｆ[・]は、今回まで累積された特徴量系列Ｏ^ｔ＋１に基づいて、表現されるものとする。このとき、Ｆ[・]をパラメトリックに表現し、その変換パラメータＷを例えば特徴量Ｏ_ｔから適切に推定することにより前記式（８）で表現される逐次適応を実現できる。ただし、変換パラメータの推定は、特徴量Ｏ_ｔのみではなく、特徴量系列Ｏ_１，Ｏ_２，…，Ｏ_ｔのうちの一部を用いてもよく、特徴量系列Ｏ^ｔを用いてもよい。 Here, p (α | β) is a posterior probability (conditional probability) that is a probability that another event α occurs under the condition that a certain event β occurs. That is, p (θ _t | O ^t ) indicates a posterior probability that the acoustic model parameter is θ _t when the feature amount sequence O ^t is given. Thereby, since the information of the accumulated feature amount sequence O ^t can be added to the acoustic model parameter, robustness can be ensured. Accordingly, the recurrence formula p (θ _{t + 1} | O ^{t + 1} ) = F [p (θ _t | O ^t )] shown in the following formula (8) (8)
Is used to describe the recursion formula corresponding to the time evolution, that is, the change of the acoustic feature of the speech, so that the posterior probability distribution p of the acoustic model parameter, not the acoustic model parameter θ noted in the above formula (4), is used. Sequential adaptation based on (θ | O) can be realized. Here, F [•] is a functional having p (θ | O) as an argument. F [•] is expressed based on a partial feature quantity sequence of the feature quantity series O ^{t + 1} accumulated up to this time. In the following description, it is assumed that F [•] is expressed based on the feature amount series O ^{t + 1} accumulated up to this time. At this time, F [•] is expressed parametrically, and the conversion parameter W is appropriately estimated from, for example, the feature amount O _t, thereby realizing the sequential adaptation expressed by the equation (8). However, the estimation of the conversion parameter may use not only the feature quantity O _t but also a part of the feature quantity series O ₁ , O ₂ ,..., O _t , or the feature quantity series O ^t. .

前記式（４）と前記式（８）を見比べてわかることは、前記式（８）はパラメータを逐次変換させるのではなく、その事後確率分布を逐次変換させていることである。また、時刻ｔでの事後確率分布ｐ（θ_ｔ│Ｏ^ｔ）のパラメータをω_ｔとすると、ｐ（θ_ｔ│Ｏ^ｔ）の逐次更新はパラメータω_ｔの逐次更新で表現できる。従って、時々刻々、事後確率分布パラメータω_ｔを求めることにより逐次適応が実現できる。従って、分布変換にもとづく逐次適応法では、事後確率分布ｐ（θ_ｔ│Ｏ^ｔ）ではなく、事後確率分布パラメータω_ｔを更新していく。 Comparing the equation (4) and the equation (8), it can be understood that the equation (8) does not sequentially convert the parameters, but sequentially converts the posterior probability distribution. In addition, if the parameters of the posterior probability distribution p (θ _^t │O _t) at time t and ω _t, the sequential update of p (θ _^t │O _t) can be expressed in a sequential updating of the parameters ω _t. Accordingly, the adaptation can be realized sequentially by obtaining the posterior probability distribution parameter ω _t every moment. Therefore, in the successive adaptation method based on the distribution transformation, the posterior probability distribution parameter ω _t is updated instead of the posterior probability distribution p (θ _t | O ^t ).

図８に当該逐次的応法を適用した場合の機能構成例を示し、図９に事後確率分布のパラメータωを逐次適応させる順序を示し、図１０に主な処理の流れを示す。図８に示す音響モデル作成装置４８は、特徴抽出部４、特徴量記憶部５、モデル適応化部５０、とで構成され、モデル適応化部５０は逐次学習部５２、事後確率分布記憶部５４、モデル更新部５６、とで構成されている。 FIG. 8 shows an example of the functional configuration when the sequential adaptation is applied, FIG. 9 shows the order of sequentially adapting the parameter ω of the posterior probability distribution, and FIG. 10 shows the main processing flow. The acoustic model creation device 48 shown in FIG. 8 includes a feature extraction unit 4, a feature amount storage unit 5, and a model adaptation unit 50. The model adaptation unit 50 includes a sequential learning unit 52 and a posterior probability distribution storage unit 54. , The model update unit 56.

まず、前回の事後確率分布のパラメータω_ｔがモデル適応化部５０で読み込まれる（ステップＳ６０）。次に、適応用音声データが読み込まれ（ステップＳ６２）、適応用音声データが特徴抽出部４に入力され、特徴量系列Ｏ_ｔ＋１に変換される（ステップＳ６４）。変換された特徴量系列Ｏ_ｔ＋１は一旦、特徴量記憶部５に記憶され逐次学習部５２に入力される。 First, the parameter ω _t of the previous posterior probability distribution is read by the model adaptation unit 50 (step S60). Next, the audio data for adaptation is read (step S62), and the audio data for adaptation is input to the feature extraction unit 4 and converted into the feature amount series O _{t + 1} (step S64). The converted feature quantity sequence O _{t + 1} is temporarily stored in the feature quantity storage unit 5 and sequentially input to the learning unit 52.

逐次学習部５２では、前記式（８）のように、前回までの累積された特徴量系列が加味された前回求めた音響モデルパラメータの事後確率分布ｐ（θ_ｔ│Ｏ^ｔ）と、前記今回抽出した特徴量系列Ｏ_ｔ＋１とに基づき、今回の特徴量系列に適応化させた今回の音響モデルパラメータの事後確率分布ｐ（θ_ｔ＋１│Ｏ^ｔ＋１）を求める（ステップＳ６８）。以下に、逐次学習部５２による更に具体的な事後確率分布ｐ(θ_ｔ＋１｜Ｏ^ｔ＋１)の求め方を説明する。 In the sequential learning unit 52, the posterior probability distribution p (θ _t | O ^t ) of the acoustic model parameter obtained last time in consideration of the feature amount series accumulated up to the previous time, as in the equation (8), and the current time Based on the extracted feature quantity sequence O _{t + 1} , the posterior probability distribution p (θ _{t + 1} | O ^{t + 1} ) of the current acoustic model parameter adapted to the current feature quantity series is obtained (step S68). Hereinafter, a more specific method for obtaining the posterior probability distribution p (θ _{t + 1} | O ^{t + 1} ) by the sequential learning unit 52 will be described.

ｐ(θ_ｔ｜Ｏ^ｔ)からｐ(θ_ｔ＋１｜Ｏ^ｔ＋１)への時間発展を記述する前記式（８）中の関数Ｆ[・]には任意の形を与えることが可能であり様々な変換を考えることができる。この実施形態では、具体的な関数系のひとつとして、確率の積の公式とベイズの定理から理論的に近似無く導出される漸化式を紹介する。はじめにｐ(θ_ｔ＋１｜Ｏ^ｔ＋１)はベイズの定理から次のように表現される。

ここで式（９）の右辺にあるｐ(θ_ｔ＋１｜Ｏ^ｔ)はｐ(θ_ｔ｜Ｏ^ｔ)を用いると次のように表現される。
ｐ(θ_ｔ＋１｜Ｏ^ｔ)＝∫ｐ(θ_ｔ＋１｜θ_ｔ，Ｏ^ｔ)ｐ(θ_ｔ｜Ｏ^ｔ)ｄθ_ｔ（１０）
従って式（１０）を式（９）に代入することにより次式（１１）のような漸化式を導出することができる。

The function F [•] in the above equation (8) describing the time evolution from p (θ _t | O ^t ) to p (θ _{t + 1} | O ^{t + 1} ) can be given an arbitrary form. You can think of transformation. In this embodiment, as one specific function system, a recurrence formula derived theoretically without approximation from a probability product formula and Bayes' theorem is introduced. _First , p (θ _{t + 1} | O ^{t + 1} ) is expressed as follows from Bayes' theorem.

Here, p (θ _{t + 1} | O ^t ) on the right side of Expression (9) is expressed as follows using p (θ _t | O ^t ).
p (θ _{t + 1} | O ^t ) = ∫p (θ _{t + 1} | θ _t , O ^t ) p (θ _t | O ^t ) dθ _t (10)
Accordingly, a recurrence formula such as the following formula (11) can be derived by substituting formula (10) into formula (9).

式（１１）の右辺には前回（時刻ｔ）での事後確率分布ｐ（θ_ｔ│Ｏ^ｔ）が含まれており、ｐ（θ_ｔ│Ｏ^ｔ）から現在（次の時刻ｔ＋１）での事後確率ｐ（θ_ｔ＋１│Ｏ^ｔ＋１）を求める式となっている。従って、式（１１）を音響モデルパラメータの事後確率分布の漸化式と呼ぶ。この漸化式を用いることにより、前回まで累積された、特徴量系列Ｏ^ｔ＋１の情報が加味された音響モデルパラメータの事後確率分布ｐ(θ_ｔ＋１｜Ｏ^ｔ＋１)の逐次推定を逐次学習部５２で行うことが出来る。また式（１１）の積分計算はモンテカルロ法（Monte Carlo method）などの数値計算で解くことが出来る。またここで時間発展を最初の一ステップに限る。つまりｔ→０，ｔ＋１→１とすれば以下の式（１２）のようになる。

式（１２）は、逐次的ではなく、与えられた適応用データＯ_１からｐ（θ_１│Ｏ_１
）を推定する通常の適応を示している。つまり、本発明は逐次適応のみならず通常の適応においてもその効果を与えることができる。 The right side of equation (11) includes a previous posterior probability distribution at (time _{^{t) p (θ t │O t}} ), from p (θ _^t │O _t) at the current (the next time t + 1) This is an equation for obtaining the posterior probability p (θ _{t + 1} | O ^{t + 1} ). Therefore, Expression (11) is called a recurrence expression of the posterior probability distribution of acoustic model parameters. By using this recurrence formula, the sequential learning unit 52 performs the sequential estimation of the posterior probability distribution p (θ _{t + 1} | O ^{t + 1} ) of the acoustic model parameter, which is accumulated up to the previous time and includes the information of the feature amount sequence O ^{t + 1.} Can be done. Further, the integral calculation of Expression (11) can be solved by numerical calculation such as Monte Carlo method. Here, time development is limited to the first step. That is, if t → 0 and t + 1 → 1, the following equation (12) is obtained.

Equation (12) is not sequential, but given adaptation data O ₁ to p (θ ₁ | O ₁
) Shows the normal adaptation to estimate. That is, the present invention can provide the effect not only in the sequential adaptation but also in the normal adaptation.

前記式（１１）による逐次適応を実現するためには、右辺は次の４つの確率分布であるｐ(Ｏ_ｔ＋１｜Ｏ^ｔ)、ｐ(θ_ｔ｜Ｏ^ｔ)、ｐ(Ｏ_ｔ＋１｜θ_ｔ＋１，Ｏ^ｔ)、ｐ(θ_ｔ＋１｜θ_ｔ，Ｏ^ｔ)に具体系を与える必要がある。ここでｐ(Ｏ_ｔ＋１｜Ｏ^ｔ)は求めたい分布であるｐ(θ_ｔ＋１｜Ｏ^ｔ＋１)の引数θ_ｔ＋１に依存しないため、規格化定数として扱うことができるため、具体形を与えなくても良い。残りの３つであるｐ(θ_ｔ｜Ｏ^ｔ)、ｐ(Ｏ_ｔ＋１｜θ_ｔ＋１，Ｏ^ｔ)、ｐ(θ_ｔ＋１｜θ_ｔ，Ｏ^ｔ)について考察を行う。 In order to realize the sequential adaptation according to the equation (11), the right side is the following four probability distributions p (O _{t + 1} | O ^t ), p (θ _t | O ^t ), p (O _{t + 1} | θ _{t + 1} , O ^t ), p (θ _{t + 1} | θ _t , O ^t ) need to be given a specific system. Here, p (O _{t + 1} | O ^t ) does not depend on the argument θ _{t + 1} of p (θ _{t + 1} | O ^{t + 1} ), which is the distribution to be obtained, and can be treated as a normalization constant. good. Consider the remaining three, p (θ _t | O ^t ), p (O _{t + 1} | θ _{t + 1} , O ^t ), and p (θ _{t + 1} | θ _t , O ^t ).

ｐ(θ_ｔ｜Ｏ^ｔ)は前述した音響モデルパラメータの事後確率分布であり、適切に初期分布を設定することにより逐次求めることが可能である。ｐ(Ｏ_ｔ＋１｜θ_ｔ＋１，Ｏ^ｔ)はＯ_ｔ＋１の出力分布であり、ＨＭＭやＧＭＭといった音響モデルの設定によって与えられるものである。最後にｐ(θ_ｔ＋１｜θ_ｔ，Ｏ^ｔ)は音響モデルパラメータθの確率的ダイナミクスである。従って、前記式（１１）の漸化式は、初期分布、出力分布及び確率的ダイナミクスによって構成されている。 p (θ _t | O ^t ) is the posterior probability distribution of the acoustic model parameters described above, and can be obtained sequentially by setting an initial distribution appropriately. p (O _{t + 1} | θ _{t + 1} , O ^t ) is an output distribution of O _{t + 1} and is given by setting of an acoustic model such as HMM or GMM. Finally, p (θ _{t + 1} | θ _t , O ^t ) is the stochastic dynamics of the acoustic model parameter θ. Therefore, the recurrence formula of the formula (11) is composed of the initial distribution, the output distribution, and the stochastic dynamics.

説明を図８に戻すと、逐次学習部５２により求められた今回の事後確率分布ｐ(θ_ｔ＋１｜Ｏ^ｔ＋１)は一旦、事後確率分布記憶部５４に記憶される。そして今回の事後確率分布ｐ(θ_ｔ＋１｜Ｏ^ｔ＋１)はモデル更新部５６に入力される。
モデル更新部５６で、音響モデル記憶部５８内の音響モデルとしての前回の事後確率分布ｐ(θ_ｔ｜Ｏ^ｔ)が、今回の音響モデルパラメータの事後確率分布ｐ(θ_ｔ＋１｜Ｏ^ｔ＋１)に新たな音響モデルとして更新する（ステップＳ７０）。 Returning to FIG. 8, the current posterior probability distribution p (θ _{t + 1} | O ^{t + 1} ) obtained by the sequential learning unit 52 is temporarily stored in the posterior probability distribution storage unit 54. The current posterior probability distribution p (θ _{t + 1} | O ^{t + 1} ) is input to the model update unit 56.
In the model update unit 56, the previous posterior probability distribution p (θ _t | O ^t ) as the acoustic model in the acoustic model storage unit 58 is changed to the posterior probability distribution p (θ _{t + 1} | O ^{t + 1} ) of the current acoustic model parameter. Update as a new acoustic model (step S70).

また、図９について説明すると、求められた前回の事後確率分布ｐ(θ_ｔ｜Ｏ^ｔ)は一旦、音響モデル（分布モデル）記憶部５８に一旦、記憶される。逐次学習部５２で、前回の事後確率分布ｐ(θ_ｔ｜Ｏ^ｔ)と、今回の特徴量系列Ｏ^ｔ＋１とを用いて、前記式（１１）から、今回の事後確率分布ｐ(θ_ｔ＋１｜Ｏ^ｔ＋１)を求める。このようにして、音響モデルパラメータの事後確率分布を逐次的に更新する。 Further, referring to FIG. 9, the obtained previous posterior probability distribution p (θ _t | O ^t ) is once stored in the acoustic model (distribution model) storage unit 58. The sequential learning unit 52 uses the previous posterior probability distribution p (θ _t | O ^t ) and the current feature amount sequence O ^{t + 1} to calculate the current posterior probability distribution p (θ _{t + 1} | O ^{t + 1} ). In this way, the posterior probability distribution of the acoustic model parameters is updated sequentially.

［マルコフ過程の導入］
次に前記式（１１）の演算処理をマルコフ過程を仮定することで簡単にする手法を説明する。ｐ(Ｏ_ｔ＋１｜θ_ｔ＋１，Ｏ^ｔ)及びｐ(θ_ｔ＋１｜θ_ｔ，Ｏ^ｔ)は累積された特徴量系列に直接依存する。これらを全ての累積特徴量系列から推定しようとした場合、時が経つにつれ累積データは多くなるため、その推定は大変計算量が多くなり現実的でない。そこで、マルコフ過程を仮定すると、ｐ(Ｏ_ｔ＋１｜θ_ｔ＋１，Ｏ^ｔ)とｐ(θ_ｔ＋１｜θ_ｔ，Ｏ^ｔ)はそれぞれ式（１３）のように近似される。
ｐ(Ｏ_ｔ＋１｜θ_ｔ＋１，Ｏ^ｔ)≒ｐ(Ｏ_ｔ＋１｜θ_ｔ＋１)，
ｐ(θ_ｔ＋１｜θ_ｔ，Ｏ^ｔ) ≒ｐ(θ_ｔ＋１｜θ_ｔ) （１３） [Introduction of Markov process]
Next, a method for simplifying the arithmetic processing of the equation (11) by assuming a Markov process will be described. p (O _{t + 1} | θ _{t + 1} , O ^t ) and p (θ _{t + 1} | θ _t , O ^t ) directly depend on the accumulated feature amount series. If these are to be estimated from all the accumulated feature quantity sequences, the accumulated data increases as time passes. Therefore, the estimation is very complicated and unrealistic. Therefore, assuming a Markov process, p (O _{t + 1} | θ _{t + 1} , O ^t ) and p (θ _{t + 1} | θ _t , O ^t ) are approximated as shown in Equation (13).
p (O _{t + 1} | θ _{t + 1} , O ^t ) ≈p (O _{t + 1} | θ _{t + 1} ),
p (θ _{t + 1} | θ _t , O ^t ) ≈p (θ _{t + 1} | θ _t ) (13)

この近似により、逐次学習部５２は前回の音響モデルパラメータの事後確率分布ｐ（θ_ｔ│Ｏ^ｔ）と、今回の出力分布ｐ（Ｏ_ｔ＋１│θ_ｔ＋１）と、今回の確率的ダイナミクスｐ（θ_ｔ＋１│θ_ｔ）と、を用いて今回の音響モデルパラメータの事後確率分布ｐ（θ_ｔ＋１│Ｏ^ｔ＋１）を求める。具体的には以下の式（１４）のように近似される。
ｐ(θ_ｔ＋１｜Ｏ^ｔ＋１)∝ｐ(Ｏ_ｔ＋１｜θ_ｔ＋１)∫ｐ(θ_ｔ＋１｜θ_ｔ)ｐ(θ_ｔ｜Ｏ^ｔ)ｄθ_ｔ（１４）
ここで、Ａ∝ＢはＡとＢは比例しているということを表す。前記式（１４）によって、シンプルな出力分布及び確率的ダイナミクスを設定することができる。図８中の逐次学習部５２は、この式（１４）を計算することになる。 By this approximation, the sequential learning unit 52 performs the posterior probability distribution p (θ _t | O ^t ) of the previous acoustic model parameter, the current output distribution p (O _{t + 1} | θ _{t + 1} ), and the current stochastic dynamics p (θ _{t + 1} | θ _t ) and posterior probability distribution p (θ _{t + 1} | O ^{t + 1} ) of the current acoustic model parameter. Specifically, it is approximated as the following formula (14).
p (θ _{t + 1} | O ^{t + 1} ) ∝p (O _{t + 1} | θ _{t + 1} ) ∫p (θ _{t + 1} | θ _t ) p (θ _t | O ^t ) dθ _t (14)
Here, A∝B indicates that A and B are proportional. A simple output distribution and stochastic dynamics can be set according to the equation (14). The sequential learning unit 52 in FIG. 8 calculates this equation (14).

［ガウス分布の平均ベクトルの考察］
以上の議論では、ＨＭＭの状態遷移確率ａ，ＧＭＭの混合重み因子ｗ、及びガウス分布の平均ベクトルパラメータμ及び共分散行列パラメータΣといった全ての音響モデルパラメータθの事後確率分布ｐ(θ｜Ｏ)についての処理を行った。一般に、音響モデルにおいて最も性能を左右するパラメータはガウス分布の平均ベクトルパラメータμであり、またそれ以外のパラメータの事後確率分布を推定対象とした場合、分布変換関数Ｆの推定すべきパラメータ数が多くなるため、少量データ適応において効果が十分でなくなる。そのため、以降ではガウス分布の平均ベクトルパラメータμのみに焦点を当て、つまり、音響モデルパラメータθに代えて、ガウス分布の平均ベクトルパラメータμを用いて、図８の逐次学習部５２では演算する。演算された事後確率分布ｐ(μ｜Ｏ)の時間発展について考察する。つまり、前記式（１４）においてガウス分布の平均ベクトルパラメータμのみを考えるため時間発展は次式（１５）を逐次学習部５２で演算する。
ｐ(μ_ｔ＋１｜Ｏ^ｔ＋１)∝ｐ(Ｏ_ｔ＋１｜μ_ｔ＋１)∫ｐ(μ_ｔ＋１｜μ_ｔ)ｐ(μ_ｔ｜Ｏ^ｔ)ｄμ_ｔ（１５）
なお、式（１５）は音響モデル中の各ガウス分布の平均ベクトルパラメータに独立に与えられる。その際の各ガウス分布のインデックスは文中では省略する。 [Consideration of mean vector of Gaussian distribution]
In the above discussion, the posterior probability distribution p (θ | O) of all acoustic model parameters θ such as the state transition probability a of the HMM, the mixture weight factor w of the GMM, the mean vector parameter μ of the Gaussian distribution, and the covariance matrix parameter Σ. The process was performed. In general, the parameter that determines the performance most in the acoustic model is the average vector parameter μ of the Gaussian distribution, and when the posterior probability distribution of other parameters is an estimation target, the number of parameters to be estimated by the distribution conversion function F is large. Therefore, the effect is not sufficient in small-scale data adaptation. Therefore, hereinafter, only the average vector parameter μ of the Gaussian distribution is focused, that is, the sequential learning unit 52 in FIG. 8 performs the calculation using the average vector parameter μ of the Gaussian distribution instead of the acoustic model parameter θ. Consider the time evolution of the computed posterior probability distribution p (μ | O). In other words, since only the average vector parameter μ of the Gaussian distribution is considered in the equation (14), the time evolution is calculated by the learning unit 52 sequentially as follows.
p (μ _{t + 1} | O ^{t + 1} ) p (O _{t + 1} | μ _{t + 1} ) ∫p (μ _{t + 1} | μ _t ) p (μ _t | O ^t ) dμ _t (15)
Equation (15) is independently given to the average vector parameter of each Gaussian distribution in the acoustic model. The index of each Gaussian distribution at that time is omitted in the text.

［線形ダイナミクス］
次に、前記式（１５）の解析解を導出することを考える。これを用いて、逐次学習を行う。式（１５）にはさまざまな解析解が存在するが、最も単純な解析解として確率的ダイナミクスが線形で表現される場合を考える。つまり、確率的ダイナミクスとして、以下の式（１６）を仮定することが出来る。
μ_ｔ＋１＝Ａ_ｔ＋１μ_ｔ＋ν_ｔ＋１＋ε_ｔ＋１（１６） [Linear dynamics]
Next, let us consider deriving an analytical solution of the equation (15). Sequential learning is performed using this. There are various analytical solutions in equation (15), but consider the case where the stochastic dynamics is expressed linearly as the simplest analytical solution. That is, the following equation (16) can be assumed as the stochastic dynamics.
μ _{t + 1} = A _{t + 1} μ _t + ν _{t + 1} + ε _{t + 1} (16)

ここでε_ｔ＋１は平均０、共分散行列Ｕのガウシアンノイズである。式（１６）は、前記式（５）における線形変換が確率的に揺らいでいるといえる。このとき、確率ダイナミクスの分布具体系は、以下の式（１７）として与えられる。
ｐ(μ_ｔ＋１｜μ_ｔ)＝Ｎ(μ_ｔ＋１｜Ａ_ｔ＋１μ_ｔ＋ν_ｔ＋１，Ｕ) （１７） Here, ε _{t + 1} is a Gaussian noise having an average of 0 and a covariance matrix U. Equation (16) can be said to be that the linear transformation in equation (5) fluctuates stochastically. At this time, the specific distribution system of probability dynamics is given by the following equation (17).
p ([mu] _{t + 1} | [mu] _t ) = N ([mu] _{t + 1} | At _{+ 1 [} mu] _t + [nu] _{t + 1} , U) (17)

ここで式（１７）のＮ(ｘ｜ｍ、Ｓ)は、ｘを引き数とする平均パラメータｍ、共分散行列パラメータＳのガウス分布である。さらに通常のＨＭＭ，ＧＭＭで表現される音響モデルに対して一まとまりの特徴量系列Ｏ_ｔ＝｛ｏ_Ｎｔ＋１，…，ｏ_{Ｎｔ＋Ｎｔ＋１}｝が出力される出力分布ｐ（Ｏ_ｔ│μ_ｔ）は以下の式（１８）で表すことができる。

Here, N (x | m, S) in the equation (17) is a Gaussian distribution of an average parameter m and a covariance matrix parameter S having x as an argument. Further, an output distribution p (O _t | μ _t ) in which a group of feature amount series O _t = {o _{Nt + 1} ,..., O _{Nt + Nt + 1} } is output for an acoustic model expressed in normal HMM and GMM is as follows. It can represent with Formula (18).

ここで、ζ_ｎは、対象のガウス分布に割り当てられたＯ_ｎの事後占有確率値である。また、状態遷移確率ａおよび混合重み因子ｗはｐ（μ｜Ｏ）の推定に関係ないため無視した。またＨＭＭやＧＭＭの潜在変数は無視したが、これらはＥＭアルゴリズム（期待値最大化アルゴリズム）を用いることによって対処可能である。実際、式（１８）はＥＭアルゴリズムにおける補助関数の形式で表現されている。 Here, the zeta _n, a posteriori occupancy probability value of O _n assigned to a Gaussian distribution of the subject. The state transition probability a and the mixture weight factor w are ignored because they are not related to the estimation of p (μ | O). Although the latent variables of HMM and GMM are ignored, these can be dealt with by using the EM algorithm (expected value maximization algorithm). Actually, Expression (18) is expressed in the form of an auxiliary function in the EM algorithm.

最後に音響モデル中のガウス分布の平均ベクトルパラメータの事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）がガウス分布で表現されると仮定し、事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）の平均ベクトルパラメータをμ＾_ｔとし、事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）の共分散行列パラメータがＱ＾_ｔで表現されるとすると関数形は以下の式（１９）で表すことができる。
ｐ（μ_ｔ│Ｏ^ｔ）＝Ｎ（μ_ｔ│μ＾_ｔ、Ｑ＾_ｔ）（１９） Finally, assuming that the posterior probability distribution p (μ _t | O ^t ) of the average vector parameter of the Gaussian distribution in the acoustic model is expressed by a Gaussian distribution, the average vector parameter of the posterior probability distribution p (μ _t | O ^t ) was a mu ^ _t, when the covariance matrix parameters posterior distribution p (μ _t │O _^t) is represented by Q ^ _t function form can be expressed by the following equation (19).
p (μ _t | O ^t ) = N (μ _t | μ ^ _t , Q ^ _t ) (19)

従って、式（１７）、（１８）、及び（１９）を式（１５）に代入することにより以下の式（２０）で示される解析解を導出することができる。
ｐ（μ_ｔ＋１│Ｏ^ｔ＋１）＝Ｎ（μ_ｔ＋１│μ＾_ｔ＋１、Ｑ＾_ｔ＋１）（２０）
ここで、
Ｑ＾_ｔ＋１＝（（Ｕ＋Ａ_ｔ＋１Ｑ＾_ｔＡ_ｔ＋１’）^−１＋ζ_ｔ＋１Σ^―１）^−１
（２１）
Ｋ＾_ｔ＋１＝Ｑ＾_ｔ+１ζ_ｔ＋１Σ^―１（２２）
μ＾_ｔ＋１＝Ａ_ｔ＋１μ＾_ｔ＋ν_ｔ＋１
＋Ｋ＾_ｔ＋１（Ｍ_ｔ＋１／ζ_ｔ＋１−Ａ_ｔ＋１μ＾_ｔ−ν_ｔ＋１）（２３） Therefore, an analytical solution represented by the following equation (20) can be derived by substituting equations (17), (18), and (19) into equation (15).
p (μ _{t + 1} | O ^{t + 1} ) = N (μ _{t + 1} | μ ^ _{t + 1} , Q ^ _{t + 1} ) (20)
here,
_{_{Q ^ t + 1 = ((}} U + A t + 1 Q ^ t A t + 1 ') -1 + ζ t + 1 Σ -1) -1
(21)
K ^ _{t + 1} = Q ^ _{t + 1} ζ _{t + 1} Σ- ¹ (22)
μ ^ _{t + 1} = A _{t + 1} μ ^ _t + ν _{t + 1}
+ K ^ _{t + 1} (M _{t + 1} / ζ _{t + 1} −A _{t + 1} μ ^ _t −ν _{t + 1} ) (23)

ただし、μ＾_ｔ＋１は今回の事後確率分布ｐ（μ_ｔ＋１│Ｏ^ｔ＋１）をガウス分布で表現した際の平均ベクトルパラメータであり、Ｑ＾_ｔ＋１は今回の事後確率分布ｐ（μ_ｔ＋１│Ｏ^ｔ＋１）をガウス分布で表現した際の共分散行列パラメータであり、Ｋ＾_ｔ＋１はカルマンゲインであり、Ａ_ｔ＋１、ν_ｔ＋１、及びＵはそれぞれ音響モデルパラメータ中の平均の前記確率的ダイナミクスを線形表現した時の係数行列、係数ベクトル、及びガウシアンノイズの共分散行列であり、Σは初期音響モデルパラメータ中の共分散行列であり、Ａ_ｔ＋1’は行列Ａ_ｔ＋1の転置を表す。ζ_ｔ＋１は、今回の事後占有確率値の和であり、Ｍ_ｔ＋１は今回の各時点におけるζと特徴量との積和であり、ζ_ｔ，Ｍ_ｔはガウス分布の平均ベクトルパラメータの十分統計量であり以下の式（２４）のように定義される。

However, μ _{^ t + 1} this time of the posterior probability distribution p the _{^{(μ t + 1 │O t +}} 1) is the mean vector parameter when expressed by a Gaussian distribution, _{Q ^ t + 1} this time of the posterior probability distribution p _{^{(μ t + 1 │O t +}} 1) Is a covariance matrix parameter when K is expressed by a Gaussian distribution, K _{t + 1} is a Kalman gain, and A _{t + 1} , ν _{t + 1} , and U are each a linear expression of the above stochastic dynamics in the acoustic model parameters. Is a covariance matrix of the coefficient matrix, coefficient vector, and Gaussian noise, Σ is the covariance matrix in the initial acoustic model parameters, and A _{t + 1} ′ represents the transpose of the matrix At _{+ 1} . ζ _{t + 1} is the sum of the current posterior occupancy probability values, M _{t + 1} is the product sum of ζ and the feature value at each time point, and ζ _t and M _t are sufficient statistics of the average vector parameter of the Gaussian distribution. And is defined as the following formula (24).

ζ_ｔ，Ｍ_ｔはForward-backwardアルゴリズムやViterbiアルゴリズム、ｋｍｅａｎｓ法などのアライメント手法によって効率よく求めることができる。式（２１）〜（２３）の更新を音響モデル中の全てのガウス分布に対して行うことにより、全ての事後確率分布を更新することができる。 ζ _t and M _t can be efficiently obtained by an alignment method such as the Forward-backward algorithm, Viterbi algorithm, or kmeans method. All posterior probability distributions can be updated by updating the equations (21) to (23) for all Gaussian distributions in the acoustic model.

このようにして、今回の事後確率分布ｐ（μ_ｔ＋１｜Ｏ^ｔ＋１）の分布パラメータω_ｔ＋１はＱ＾_ｔ＋１，μ＾_ｔ＋１となり（図９の音響モデルパラメータ記憶部５８の括弧書き参照）、式（２１）（２２）（２３）から求めることができる。なお、式（２２）におけるＫ＾_ｔ＋１は数式の表現のしやすさのために導入したものである。実際の計算では、式（２２）と（２３）を同時に行っても良い。その場合、Ｋ＾_ｔ＋１は求める必要がない。 In this way, the distribution parameter ω _{t + 1} of the current posterior probability distribution p (μ _{t + 1} | O ^{t + 1} ) becomes Q ^ _{t + 1} , μ ^ _{t + 1} (see the parentheses in the acoustic model parameter storage unit 58 in FIG. 9), and the equation ( 21) (22) (23). Note that K ^ _{t + 1} in equation (22) is introduced for ease of expression of the equation. In actual calculation, equations (22) and (23) may be performed simultaneously. In that case, it is not necessary to obtain K ^ _{t + 1} .

つまり、音響モデルパラメータの事後確率分布ｐ（μ｜Ｏ）の漸化式はそのパラメータ（Ｑ＾，μ＾）の漸化式（２１）（２２）（２３）によって求めることができる。これは線形動的システムにおけるカルマンフィルタの解と類似している。しかし、カルマンフィルタの解はｏ_ｎ→ｏ_ｎ＋１のように各音声分析フレームごとの更新となっている。一方、本発明ではＯ_ｔ→Ｏ_ｔ＋１のように１まとまりのフレームごとの更新となっているのが違いとなっている。そのため、パラメータＱ＾，μ＾は、１フレームの特徴量ｏ_ｎではなく、その統計量であらわされている。従って、これを巨視的な線形動的システムと呼ぶ。 That is, the recurrence formula of the posterior probability distribution p (μ | O) of the acoustic model parameter can be obtained by the recurrence formulas (21), (22), and (23) of the parameters (Q ^, μ ^). This is similar to the Kalman filter solution in linear dynamic systems. However, the solution of the Kalman filter has become the update for each voice analysis frame as o _{_n} → o _n _{+ 1.} On the other hand, in the present invention, the difference is that updating is performed for each frame as O _t → O _{t + 1} . Therefore, the parameter Q ^, mu ^, not the feature amount o _n of one frame are represented by the statistic. Therefore, this is called a macroscopic linear dynamic system.

Ｑ＾，μ＾を用いた場合の逐次学習部５２の具体的構成例を図１１に示す。逐次学習部５２は、Ｑ＾更新部５２０、Ｋ＾更新部５２２、μ＾更新部５２４、事後確率計算部５２６とで構成されている。
Ｑ＾更新部５２０では前記式（２１）が計算され、Ｋ＾更新部５２２では前記式（２２）が計算され、μ＾更新部５２４では前記式（２３）が計算され、事後確率計算部５２６では前記式（２０）が計算される。 FIG. 11 shows a specific configuration example of the sequential learning unit 52 when Q ^ and μ ^ are used. The sequential learning unit 52 includes a Q ^ update unit 520, a K ^ update unit 522, a μ ^ update unit 524, and a posterior probability calculation unit 526.
The Q ^ update unit 520 calculates the equation (21), the K ^ update unit 522 calculates the equation (22), the μ ^ update unit 524 calculates the equation (23), and the posterior probability calculation unit 526. Then, the equation (20) is calculated.

従って、Ｑ＾，μ＾を求めるためには、線形変換パラメータＷ_ｔ＋１＝｛ν_ｔ＋１，Ａ_ｔ＋１｝、システムノイズＵ_ｔ＋１、初期パラメータＱ＾_０、及びμ＾_０の４つを設定する必要がある。ここで、Ｑ＾_０は初期音響モデルの共分散行列パラメータから与えられるものであり、μ＾_０は初期音響モデルの平均ベクトルパラメータから与えられるものである。 Therefore, in order to obtain Q ^ and μ ^, it is necessary to set four of linear transformation parameters W _{t + 1} = {ν _{t + 1} , A _{t + 1} }, system noise U _{t + 1} , initial parameters Q ^ ₀ , and μ ^ _0. is there. Here, Q ^ ₀ is given from the covariance matrix parameter of the initial acoustic model, and μ ^ ₀ is given from the average vector parameter of the initial acoustic model.

このうち線形変換パラメータＷ＝｛ν_ｔ＋１，Ａ_ｔ＋１｝は、今回まで累積された特徴量系列Ｏ^ｔのうち少なくとも１つの特徴量系列を用いて、推定される。よく知られた手法の一例としては上述したＥＭアルゴリズムやＭＬＬＲアルゴリズムを用いて繰り返し計算により効率よく求められる。また、複数のガウス分布で同一の変換パラメータを共有することにより、推定すべきパラメータ数を減らすことが可能である。 Among these, the linear transformation parameter W = {ν _{t + 1} , A _{t + 1} } is estimated using at least one feature amount sequence among the feature amount sequences O ^t accumulated up to this time. As an example of a well-known method, it is efficiently obtained by repeated calculation using the above-described EM algorithm or MLLR algorithm. Further, by sharing the same conversion parameter among a plurality of Gaussian distributions, it is possible to reduce the number of parameters to be estimated.

システムノイズＵも線形変換パラメータＷと同様に学習によって求めることができる。または、行列成分すべてを特徴量系列やその他のデータから先験的に与えることもできる。最も単純な方法は、システムノイズＵを（ｕ^０）^−１Σとしておき、システムノイズの共分散行列が出力分布の共分散行列と比例関係にあるとするとして、ｕ^０を予め与えられるパラメータとする。つまり、１つだけパラメータが導入される。システムノイズＵと線形変換パラメータＷが、前記式（８）の分布変換関数Ｆにおける変換パラメータとなる。 Similarly to the linear transformation parameter W, the system noise U can be obtained by learning. Alternatively, all matrix components can be given a priori from a feature series or other data. The simplest method is the system noise U (u ⁰⁾ leave the ^-1 sigma, as the covariance matrix of system noise and is proportional to the covariance matrix of the output distribution, a parameter given to u ⁰ in advance To do. That is, only one parameter is introduced. The system noise U and the linear conversion parameter W are conversion parameters in the distribution conversion function F of the equation (8).

このとき更新式は、以下の式（２５）（２６）（２７）で表され、Ｑ＾更新部５２０では前記式（２５）が計算され、Ｋ＾更新部５２２では前記式（２６）が計算され、μ＾更新部５２４では前記式（２７）が計算される。
Ｑ＾_ｔ＋１＝（（（ｕ^０）^−１Σ＋Ａ_ｔ＋１Ｑ＾_ｔＡ_ｔ＋１’）^−１＋ζ_ｔ＋１Σ^―１）^−１（２５）
Ｋ＾_ｔ＋１＝Ｑ＾_ｔ+１ζ_ｔ＋１Σ^―１（２６）
μ＾_ｔ＋１＝Ａ_ｔ＋１μ＾_ｔ＋ν_ｔ＋１＋Ｋ＾_ｔ＋１（Ｍ_ｔ＋１／ζ_ｔ＋１−Ａ_ｔ＋１μ＾_ｔ−ν_ｔ＋１）（２７）
以上によってパラメータｕ^０によって制御される分布変換にもとづく逐次適応法を実現できる。 At this time, the update formula is expressed by the following formulas (25), (26), and (27), the Q ^ update unit 520 calculates the formula (25), and the K ^ update unit 522 calculates the formula (26). Then, the μ ^ update unit 524 calculates the equation (27).
_{^{Q ^ t + 1 = ((}} (u 0) -1 Σ + A t + 1 Q ^ t A t + 1 ') -1 + ζ t + 1 Σ -1) -1 (25)
K ^ _{t + 1} = Q ^ _{t + 1} ζ _{t + 1} Σ- ¹ (26)
_{_{μ ^ t + 1 = A t}} + 1 μ ^ t + ν t + 1 + K ^ t + 1 (M t + 1 / ζ t + 1 -A t + 1 μ ^ t -ν t + 1) (27)
As described above, the sequential adaptation method based on the distribution transformation controlled by the parameter u ⁰ can be realized.

［平行移動適応］
前記線形ダイナミクスの式（１６）の平均ベクトルμ_ｔの平行移動ν_ｔ＋１にだけ注目することにより、推定すべきパラメータを少なくしてより少量データでの適応を実現できる。このとき、前記式（２５）（２６）（２７）における行列Ａ_ｔ＋１を単位行列Ｉとする、つまり、Ａ_ｔ＋１＝Ｉとすると、Ｑ＾、Ｋ＾、μ＾は以下の式（２８）（２９）（３０）で計算される。 [Translation adaptation]
By paying attention only to the translation ν _{t + 1} of the mean vector μ _t in the equation (16) of the linear dynamics, it is possible to reduce the number of parameters to be estimated and realize adaptation with a smaller amount of data. At this time, assuming that the matrix A _{t + 1} in the equations (25), (26), and (27) is the unit matrix I, that is, A _{t + 1} = I, Q ^, K ^, and μ ^ are the following expressions (28) ( 29) Calculated in (30).

Ｑ＾_ｔ＋１＝（（（ｕ^０）^−１Σ＋Ｑ＾_ｔ’）^−１＋ζ_ｔ＋１Σ^―１）^−１（２８）
Ｋ＾_ｔ＋１＝Ｑ＾_ｔ+１ζ_ｔ＋１Σ^―１（２９）
μ＾_ｔ＋１＝μ＾_ｔ＋ν_ｔ＋１＋Ｑ＾_ｔ+１ζ_ｔ＋１Σ^―１（Ｍ_ｔ＋１／ζ_ｔ＋１−μ＾_ｔ−ν_ｔ＋１）（３０）
この場合、Ｑ＾更新部５２０では前記式（２８）が計算され、Ｋ＾更新部５２２では前記式（２９）が計算され、μ＾更新部５２４では前記式（３０）が計算される。
C.J.Leggetter and P.C.Woodland,Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language,Vol.9,pp.171-185,1995. C.J.Leggetter and P.C.Woodland,Maximum.Flexible speaker using maximum likehood linear regression In Proc ARPA Spoken Language Technology Work-shop,pp.104-109,1995. 渡部晋治、中村篤、確率分布の巨視的な時間発展システムに基づく逐次モデル適応．秋季音響学会講演論文集、２−２−１０，ｐｐ．７１−７２，２００６．渡部晋治、中村篤、確率分布の巨視的な時間発展系に基づくモデル適応との従来型適応との関係の考察．秋季音響学会講演論文集２−３−１２、２００７．特開２００８−６４８４９号 _{^{Q ^ t + 1 = ((}} (u 0) -1 Σ + Q ^ t ') -1 + ζ t + 1 Σ -1) -1 (28)
K ^ _{t + 1} = Q ^ _{t + 1} ζ _{t + 1} Σ- ¹ (29)
μ ^ _{t + 1} = μ ^ _t + νt _{+ 1} + Q ^ _{t + 1} ζt _{+ 1} Σ− ¹ (Mt _{+ 1} / ζt _{+ 1−} μ ^ _t− νt _{+ 1} ) (30)
In this case, the Q ^ update unit 520 calculates the equation (28), the K ^ update unit 522 calculates the equation (29), and the μ ^ update unit 524 calculates the equation (30).
CJLeggetter and PCWoodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models.Computer Speech and Language, Vol.9, pp.171-185, 1995. CJLeggetter and PCWoodland, Maximum.Flexible speaker using maximum likehood linear regression In Proc ARPA Spoken Language Technology Work-shop, pp.104-109,1995. Satoshi Watanabe, Atsushi Nakamura, Sequential model adaptation based on macroscopic time evolution system of probability distribution. Proceedings of the Autumn Acoustics Society, 2-2-10, pp. 71-72, 2006. Yuji Watanabe, Atsushi Nakamura, Consideration of the relationship between model adaptation based on the macroscopic time evolution system of probability distribution and conventional adaptation. Proceedings of the Autumn Acoustics Society 2-3-12, 2007. JP 2008-64849 A

オンライン逐次適応タスクでは、１秒程度の非常に短い時間でのモデル更新が、実時間処理のためには必要である。しかし、分布変換に基づく適応法の更新は式式（２８）〜（３０）から分かるとおり、行列Ｑ＾_ｔやΣの演算で表現されており、この行列の積、和および逆行列演算を全てのガウス分布について行う必要があり、計算コストが非常に高く、実時間処理が困難である。例えば、通常の音響モデルは３９次元のガウス分布を数万個含むが、これに対して式（２８）〜（３０）を実行することは３９×３９の行列Ａ、Ｑ＾_ｔやΣ（ただし、Σは通常、対角行列を用いる）の積、和および逆行列をガウス分布数分（数万回）、行う必要があるため、非常に計算に時間がかかる。 In the online sequential adaptation task, model update in a very short time of about 1 second is necessary for real-time processing. However, the update of the adaptive method based on the distribution transformation is expressed by the operations of the matrix Q ^ _t and Σ, as can be seen from the equations (28) to (30), and all the product, sum and inverse matrix operations of this matrix are performed. It is necessary to carry out a Gaussian distribution, and the calculation cost is very high, and real-time processing is difficult. For example, a normal acoustic model includes tens of thousands of 39-dimensional Gaussian distributions, but executing equations (28) to (30) on this is a 39 × 39 matrix A, Q ^ _t or Σ (however, , .SIGMA. Usually uses a diagonal matrix), and it is necessary to perform the product, sum, and inverse matrix for the number of Gaussian distributions (tens of thousands of times).

また、式（２８）〜（３０）は逆行列の計算を含むため音声データによっては、計算が不安定になり逆行列が求まらなくなる。
また分布パラメータＱ＾_ｔはモデルの更新に必要なため、それらを音響モデルパラメータ記憶部５８に記憶する必要がある。しかし、Ｑ＾_ｔは非対角成分が０でない全共分散行列（ただし、対称行列）であり、それが音響モデル中のガウス分布数分存在するため、大量のメモリを消費する。例えば、音響モデルは３９次元のガウス分布数万個で表現される音響モデルが数メガバイト程度なのに対し、Ｑ＾_ｔだけで、音響モデルの１０倍以上のメモリ（数１０メガバイト）を消費する。 Also, since equations (28) to (30) include calculation of the inverse matrix, the calculation becomes unstable depending on the voice data, and the inverse matrix cannot be obtained.
Further, since the distribution parameter Q ^ _t is necessary for updating the model, it is necessary to store them in the acoustic model parameter storage unit 58. However, Q ^ _t is a total covariance matrix (where the off-diagonal component is not 0) (however, a symmetric matrix), and there are as many Gaussian distributions in the acoustic model, and thus a large amount of memory is consumed. For example, an acoustic model represented by tens of thousands of 39-dimensional Gaussian distributions is about several megabytes, whereas only Q ^ _t consumes 10 times more memory (tens of megabytes) than the acoustic model.

このように、分布変換に基づく適応法は計算量が多く、計算が不安定であり、メモリを多く消費する。従って、それらを用いて、１秒程度の非常に短いで逐次更新を行う、オンライン逐次適応タスクの実現をするのは困難であった。 As described above, the adaptive method based on the distribution transformation has a large amount of calculation, is unstable in calculation, and consumes a lot of memory. Therefore, it has been difficult to realize an online sequential adaptation task that uses them for sequential update in a very short time of about 1 second.

この発明では、従来と比べて、計算量、メモリ量を削減させ、計算の安定性を向上させた音響モデル作成装置、その装置を用いた音声認識装置、これらの方法、これらのプログラム、およびこれらの記録媒体を提供する。 In the present invention, an acoustic model creation device that reduces the amount of calculation and the amount of memory and improves the stability of calculation compared to the prior art, a speech recognition device using the device, these methods, these programs, and these A recording medium is provided.

この発明の音響モデル作成装置は、特徴抽出部と、逐次学習部と、モデル更新部と、を具備する。特徴抽出部は、今回の適応用音声データの特徴量系列を抽出する。逐次学習部は、音響モデル中のガウス分布の平均ベクトルパラメータの事後確率分布をガウス分布で表現した際の、当該事後確率分布の平均ベクトルパラメータ、当該事後確率分布の共分散行列パラメータに対するスケーリング因子、初期音響モデルパラメータ中の共分散行列で表されることに基づき、前回までの累積された特徴量系列が加味された、前回求めた音響モデル中のガウス分布の平均ベクトルパラメータの事後確率分布及び今回まで累積された特徴量系列の一部を用いて、今回の音響モデル中のガウス分布の平均ベクトルパラメータの事後確率分布をガウス分布で表現した際の、当該事後確率分布の平均ベクトルパラメータ及び当該事後確率分布の共分散行列パラメータのスケーリング因子を計算することで、今回の音響モデル中のガウス分布の平均ベクトルパラメータの事後確率分布を求める。モデル更新部は、今回の音響モデルパラメータの事後確率分布を新たな音響モデルパラメータに変換して更新する。 The acoustic model creation device of the present invention includes a feature extraction unit, a sequential learning unit, and a model update unit. The feature extraction unit extracts a feature amount series of the current adaptation audio data. The sequential learning unit expresses the posterior probability distribution of the average vector parameter of the Gaussian distribution in the acoustic model as a Gaussian distribution, the scaling factor for the average vector parameter of the posterior probability distribution, the covariance matrix parameter of the posterior probability distribution, The posterior probability distribution of the average vector parameter of the Gaussian distribution in the acoustic model obtained last time and the current time, taking into account the feature quantity series accumulated up to the previous time based on the covariance matrix in the initial acoustic model parameters The posterior probability distribution average vector parameter and the posterior distribution when the posterior probability distribution of the average vector parameter of the Gaussian distribution in the current acoustic model is expressed as a Gaussian distribution using a part of the feature amount series accumulated up to This acoustic model is calculated by calculating the scaling factor of the covariance matrix parameter of the probability distribution. Determining the posterior probability distribution of the mean vector parameters of the Gaussian distribution. The model update unit converts the posterior probability distribution of the current acoustic model parameter into a new acoustic model parameter and updates it.

この発明では、音響モデル中のあるガウス分布の平均ベクトルパラメータμ_ｔの事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）に対して、その共分散行列パラメータＱ＾_ｔを、対象とする音響モデルのガウス分布の共分散行列Σと、本発明で新たに導入するスカラー変数であるスケーリング因子ｒ＾_ｔの逆数（ｒ＾_ｔ）^−１を用いて、以下の式のように掛け合わせたもので表現する。
Ｑ＾_ｔ＝（ｒ＾_ｔ）^−１Σ （３１）
これにより、式（２８）〜（３０）はスカラー演算に直すことができるため、計算量の削減および安定性の確保を実現することができる。また、記憶すべき更新パラメータが対称行列Ｑ＾_ｔからｒ＾_ｔとなるため音響モデル記憶部中のメモリ容量を削減できる。 In the present invention, with respect to the posterior probability distribution p (μ _t | O ^t ) of an average vector parameter μ _t of a certain Gaussian distribution in the acoustic model, the covariance matrix parameter Q ^ _t is used as the Gauss of the target acoustic model. Using the covariance matrix Σ of the distribution and the reciprocal (r ^ _t ) ⁻¹ of the scaling factor r ^ _t , which is a scalar variable newly introduced in the present invention, it is expressed by the following equation. .
_{_{^{Q ^ t = (r ^ t}}} ) -1 Σ (31)
Thereby, since Formulas (28) to (30) can be rewritten into a scalar calculation, the calculation amount can be reduced and the stability can be ensured. Further, since the update parameter to be stored is changed from the symmetric matrix Q ^ _t to r ^ _t , the memory capacity in the acoustic model storage unit can be reduced.

以下に、発明を実施するための最良の形態を示す。なお、同じ機能を持つ構成部や同じ処理を行う過程には同じ番号を付し、重複説明を省略する。 The best mode for carrying out the invention will be described below. In addition, the same number is attached | subjected to the process which performs the same part and the process which has the same function, and duplication description is abbreviate | omitted.

まず、改めて、記号について定義する。
μ_ｔ前回の音響モデル中のガウス分布の平均ベクトルパラメータ
Σ_ｔ前回の音響モデル中のガウス分布の共分散行列パラメータ
ｐ（μ_ｔ│Ｏ^ｔ）前回の音響モデル中のガウス分布の平均ベクトルパラメータμ_ｔの事後分布確率
μ＾_ｔ音響モデル中のガウス分布の平均ベクトルパラメータμ_ｔの事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）をガウス分布で表現した際の平均ベクトルパラメータ、もしくは、ｐ（μ_ｔ│Ｏ^ｔ）の平均ベクトルパラメータ
Ｑ＾_ｔ音響モデル中のガウス分布の平均ベクトルパラメータμ_ｔの事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）をガウス分布で表現した際の共分散行列パラメータ、もしくは、ｐ（μ_ｔ│Ｏ^ｔ）の共分散行列パラメータ
ｒ＾_ｔ音響モデル中のガウス分布の平均ベクトルパラメータμ_ｔの事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）をガウス分布で表現した際の共分散行列パラメータＱ＾_ｔに対するスケーリング因子、もしくは、ｐ（μ_ｔ│Ｏ^ｔ）の共分散行列パラメータＱ＾_ｔに対するスケーリング因子 First, the symbol is defined again.
μ _t Average vector parameter of Gaussian distribution in previous acoustic model Σ _t Covariance matrix parameter of Gaussian distribution in previous acoustic model p (μ _t | O ^t ) Average vector parameter of Gaussian distribution in previous acoustic model μ I mean vector parameter in mean vector parameter mu _t posterior probability distribution p of the posterior distribution probability mu ^ _t Gaussian distribution in the acoustic model of _t a (μ _t │O _^t) is expressed by the Gaussian distribution, or, p (mu _t | O ^t ) average vector parameter Q ^ _t covariance matrix parameter when the posterior probability distribution p (μ _t | O ^t ) of the average vector parameter μ _t of the Gaussian distribution in the acoustic model is expressed by a Gaussian distribution, or p (μ _^t │O _t) of the covariance matrix parameters r _{^ t} mean vector parameters of the Gaussian distribution in the acoustic model mu _t posterior probability distribution p (mu scaling factor _{^t │O t)} for the covariance matrix parameters Q _{^ t} when expressed in Gaussian distribution or the scaling factor for the covariance matrix parameters Q _{^ t} of p (μ _{^t │O t)}

実施例１の音響モデル作成装置の機能構成例を図１２に示し、処理の流れを図１０を用いて説明し、図１３に逐次適応法を用いた場合の音響モデルパラメータが変換される手順を示し、図１４に逐次学習部の機能構成例を示す。図１２に示すように、音響モデル作成装置１４８は、特徴抽出部４、特徴量記憶部５、モデル適応化部１５０と、で構成され、モデル適応化部１５０は逐次学習部１５２、事後確率記憶部１５４、モデル更新部１５６とで構成される。また、逐次学習部１５２は、ｒ＾更新部１５２２、μ＾更新部１５２４、事後確率計算部１５２６、とを有する。 FIG. 12 shows a functional configuration example of the acoustic model creation apparatus according to the first embodiment. The flow of processing will be described with reference to FIG. 10. FIG. 13 shows a procedure for converting acoustic model parameters when the sequential adaptation method is used. FIG. 14 shows a functional configuration example of the sequential learning unit. As illustrated in FIG. 12, the acoustic model creation device 148 includes a feature extraction unit 4, a feature amount storage unit 5, and a model adaptation unit 150. The model adaptation unit 150 includes a sequential learning unit 152 and a posterior probability storage. Unit 154 and model update unit 156. The sequential learning unit 152 includes an r ^ update unit 1522, a μ ^ update unit 1524, and a posterior probability calculation unit 1526.

まず、前回の事後確率分布の平均ベクトルパラメータμ＾_ｔ、スケーリング因子ｒ＾_ｔ（ｒ＾_ｔについては後述する）がモデル適応化部１５０で読み込まれる（ステップＳ６０）。そして、適応用音声データ２０が読み込まれ（ステップＳ６２）、適応用音声データが特徴抽出部４に入力され、特徴量系列Ｏ_ｔ＋１に変換される（ステップＳ６４）。変換された特徴量系列Ｏ_ｔ＋１は一旦、特徴量記憶部５に記憶され、逐次学習部１５２に入力される。 First, the average vector parameter μ ^ _t and scaling factor r ^ _t (r ^ _t will be described later) of the previous posterior probability distribution are read by the model adaptation unit 150 (step S60). Then, the adaptation audio data 20 is read (step S62), and the adaptation audio data is input to the feature extraction unit 4 and converted into the feature amount series O _{t + 1} (step S64). The converted feature quantity sequence O _{t + 1} is temporarily stored in the feature quantity storage unit 5 and is sequentially input to the learning unit 152.

そして、逐次学習部１５２の処理としてまず、（ｉ）音響モデル中のガウス分布の平均ベクトルパラメータμ_ｔの事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）が、当該事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）の平均ベクトルパラメータμ＾_ｔ、事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）の共分散行列パラメータＱ＾_ｔに対するスケーリング因子ｒ＾_ｔ、初期音響モデルパラメータ中の共分散行列Σ、で表現されるガウス分布で表されることに基づく。そして、（ｉｉ）前回までの累積された特徴量系列Ｏ^ｔが加味された、前回求めた音響モデル中のガウス分布の平均ベクトルパラメータμ_ｔの事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）及び（ｉｉｉ）今回まで累積された特徴量系列Ｏ^ｔの一部を用いる。（ｉｖ）今回の音響モデル中のガウス分布の平均ベクトルパラメータの事後確率分布をガウス分布で表現した際の、平均ベクトルパラメータμ＾_ｔ＋１及び共分散行列のスケーリング因子ｒ＾_ｔ＋１を計算することで、今回の音響モデル中のガウス分布の平均ベクトルパラメータの事後確率分布ｐ（μ_ｔ＋１│Ｏ^ｔ＋１）を求める。 As the processing of the sequential learning unit 152, first, (i) the posterior probability distribution p (μ _t | O ^t ) of the average vector parameter μ _t of the Gaussian distribution in the acoustic model is represented by the posterior probability distribution p (μ _t | O ^t ) average vector parameter μ ^ _t , scaling factor r ^ _t for covariance matrix parameter Q ^ _t of posterior probability distribution p (μ _t | O ^t ), and covariance matrix Σ in the initial acoustic model parameters Based on the Gaussian distribution. Then, (ii) the posterior probability distribution p (μ _t | O ^t ) of the mean vector parameter μ _t of the Gaussian distribution in the acoustic model obtained in the previous time, in which the accumulated feature amount sequence O ^t up to the previous time is added. iii) A part of the feature amount series O ^t accumulated so far is used. (Iv) By calculating the mean vector parameter μ ^ _{t + 1} and the covariance matrix scaling factor r ^ _{t + 1} when the a posteriori probability distribution of the average vector parameter of the Gaussian distribution in the current acoustic model is expressed by the Gaussian distribution, The posterior probability distribution p (μ _{t + 1} | O ^{t + 1} ) of the average vector parameter of the Gaussian distribution in the current acoustic model is obtained.

以下、詳細に説明する。また、この実施例では、計算量、安定性、メモリ量、の改善に焦点を当てるため、線形回帰適応ではなく、パラメータ数の少ない平行移動適応（前記［平行移動適応］の段落で説明）に対して議論を進める。以下、前記（ｉ）〜（ｉｖ）に分けて説明する。
前記式（２８）〜（３０）について、式（２９）に示すＫ＾_ｔ＋１を式（３０）に代入した式を以下に示す。
Ｑ＾_ｔ＋１＝（（（ｕ^０）^−１Σ＋Ｑ＾_ｔ）^−１＋ζ_ｔ＋１Σ^―１）^−１（３２）
μ＾_ｔ＋１＝μ＾_ｔ＋ν_ｔ＋１＋Ｑ＾_ｔ+１ζ_ｔ＋１Σ^―１（Ｍ_ｔ＋１／ζ_ｔ＋１−μ＾_ｔ−ν_ｔ＋１）（３３） Details will be described below. Further, in this embodiment, in order to focus on improving the calculation amount, stability, and memory amount, instead of linear regression adaptation, the translation adaptation with a small number of parameters (explained in the paragraph [Parallel adaptation]) is used. Proceed with discussions. Hereinafter, the description will be divided into (i) to (iv).
Regarding the equations (28) to (30), equations obtained by substituting K ^ _{t + 1} shown in the equation (29) into the equation (30) are shown below.
_{^{Q ^ t + 1 = ((}} (u 0) -1 Σ + Q ^ t) -1 + ζ t + 1 Σ -1) -1 (32)
μ ^ _{t + 1} = μ ^ _t + νt _{+ 1} + Q ^ _{t + 1} ζt _{+ 1} Σ− ¹ (Mt _{+ 1} / ζt _{+ 1−} μ ^ _t− νt _{+ 1} ) (33)

そして、分布パラメータ数を削減するために、前記式（３１）に示したように、音響モデルのガウス分布の平均ベクトルμ_ｔに対する事後分布の共分散行列Ｑ＾_ｔを音響モデルのガウス分布の共分散行列Σと、スケーリング因子ｒ＾_ｔまたはｒ＾_ｔの逆数（ｒ＾_ｔ）^−１を掛け合わせたもので表現する。念のため、式（３１）を以下に示す。スケーリング因子ｒ＾_ｔは実数（スカラー）で表されるパラメータである。
Ｑ＾_ｔ＝（ｒ＾_ｔ）^−１Σ （３１）
この式（３１）を式（３２）に代入するとＱ＾_ｔ＋１の更新式はそれぞれ以下のように表現できる。
Ｑ＾_ｔ＋１＝（（（μ^０）^−１Σ＋（ｒ＾_ｔ）^−１Σ）^−１＋ζ_ｔ＋１Σ^−１）^−１
＝（（（μ^０）^−１＋（ｒ＾_ｔ）^−１）^−１＋ζ_ｔ＋１）^−１Σ
（３４） Then, in order to reduce the number of distribution parameters, as shown in the equation (31), co-Gaussian distribution of a covariance matrix Q ^ _t the acoustic model of the posterior distribution to the mean vector mu _t Gaussian distribution of acoustic models This is expressed by multiplying the variance matrix Σ by the inverse of the scaling factor r ^ _t or r ^ _t (r ^ _t ) ^-1 . As a precaution, equation (31) is shown below. The scaling factor r ^ _t is a parameter represented by a real number (scalar).
_{_{^{Q ^ t = (r ^ t}}} ) -1 Σ (31)
Substituting this equation (31) into equation (32), the update equations for Q ^ _{t + 1} can be expressed as follows.
Q ^ _{t + 1} = ((([mu] ⁰ ) ^{-1 [} Sigma] + (r [tau _] ) ^{-1 [} Sigma]) ^- 1+ [zeta] _{t + 1 [} Sigma] ^-1 ) ^-1.
= (((Μ ⁰ ) ⁻¹ + (r ^ _t ) ⁻¹ ) ⁻¹ + ζ _{t + 1} ) ⁻¹ Σ
(34)

そして、式（３１）を変形した式Ｑ＾_ｔ＋１＝（ｒ＾_ｔ＋１）^−１Σの右辺と式（３４）の右辺とは等しくなるので、以下の式（３５）が成り立つ。
ｒ＾_ｔ＋１＝（（μ^０）^−１＋（ｒ＾_ｔ）^−１）^−１＋ζ_ｔ＋１（３５）
つまり、式（３２）に示すＱ＾_ｔ＋１の更新式を式（３５）に示すｒ＾_ｔ＋１に書き直すことができる。 Since the right side of the expression Q ^ _{t + 1} = (r ^ _{t + 1} ) ^-1 Σ obtained by modifying the expression (31) is equal to the right side of the expression (34), the following expression (35) is established.
_{r ^ t + 1 = ((} μ 0) -1 + (r ^ t) -1) -1 + ζ t + 1 (35)
That is, the update formula of Q ^ _{t + 1} shown in Expression (32) can be rewritten to r ^ _{t + 1} shown in Expression (35).

一方、μ＾_ｔ＋１についての更新式について検討すると、前記式（３１）を式（３３）に代入すると以下の式のようになる。

On the other hand, considering the update equation for μ ^ _{t + 1} , substituting equation (31) into equation (33) yields the following equation.

ここで、上述のように、μ^０は予め定められた定数であり、ｒ＾_ｔは前回のスケーリング因子であり、ζ_ｔ＋１は今回の事後占有確率値の和であり、Ｍ_ｔ＋１は今回の各時点におけるζと特徴量との積和であり、ν_ｔ＋１は今回の音響モデルパラメータ中の平均の確率的ダイナミクスを線形表現した時の係数ベクトルである。また、スケーリング因子ｒ＾_ｔの逆数（ｒ＾_ｔ）^−１を用いている理由は、ｒ＾_ｔをそのまま用いると、式（３５）の左辺が（ｒ＾_ｔ＋１）^−１になってしまうという表現上の問題である。実装上はどちらを用いてもかまわない。また、初期値ζ_０、Ｍ_０、ｒ_０については、任意の実数値が与えられる。また、式（２０）のＱ＾_ｔに式（３１）を代入することで、以下の式（３７）が求められる。
ｐ（μ_ｔ＋１│Ｏ^ｔ＋１）＝Ｎ（μ_ｔ＋１│μ＾_ｔ＋１、（ｒ＾_ｔ＋１）^−１Σ）
（３７） Here, as described above, μ ⁰ is a predetermined constant, r ^ _t is the previous scaling factor, ζ _{t + 1} is the sum of the current posterior occupation probability values, and M _{t + 1} is Ν _{t + 1} is a coefficient vector when linearly expressing the average stochastic dynamics in the current acoustic model parameter. Also, the reason for using the reciprocal (r ^ _t ) ^-1 of the scaling factor r ^ _t is that if r ^ _t is used as it is, the left side of equation (35) becomes (r ^ _{t + 1} ) ^-1. It is an expression problem. Either can be used for implementation. Further, arbitrary real values are given for the initial values ζ ₀ , M ₀ , r ₀ . Moreover, the following formula | equation (37) is calculated | required by substituting a formula (31) to Q ^ _t of a formula (20).
p (μ _{t + 1} | O ^{t + 1} ) = N (μ _{t + 1} | μ ^ _{t + 1} , (r ^ _{t + 1} ) ⁻¹ Σ)
(37)

つまり、前記（ｉ）で述べたように、前記式（３７）から、音響モデル中のガウス分布の平均ベクトルパラメータμ_ｔの事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）が、当該事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）の平均ベクトルパラメータμ＾_ｔ、当該事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）の共分散行列パラメータＱ＾_ｔに対するスケーリング因子ｒ＾_ｔ、初期音響モデルパラメータ中の共分散行列Σで表現されるガウス分布で表されることが理解されよう。 That is, as described in the above (i), from the equation (37), the posterior probability distribution p (μ _t | O ^t ) of the average vector parameter μ _t of the Gaussian distribution in the acoustic model is expressed as the posterior probability distribution p. (μ _^t │O _t) mean vector parameters mu _{^ t,} the posterior distribution p (μ _^t │O _t) covariance matrix parameters Q _^ scaling factor for _t r _{^ t,} covariance in the initial acoustic model parameters It will be understood that it is represented by a Gaussian distribution represented by a matrix Σ.

また、前記（ｉｉ）前回までの累積された特徴量系列Ｏ^ｔが加味された、前回求めた音響モデル中のガウス分布の平均ベクトルパラメータμ_ｔの事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）を用いることについて説明する。前記式（３７）を演算するために前記式（３５）（３６）を用いるのであるが、式（３５）（３６）から明らかなように、前回の平均ベクトルパラメータμ＾_ｔ、前回のスケーリング因子ｒ＾_ｔを用いなければならない。また、式（３７）より前回求めた音響モデルパラメータの事後分布確率ｐ（μ_ｔ│O^ｔ）は、
ｐ（μ_ｔ│Ｏ^ｔ）＝Ｎ（μ_ｔ│μ＾_ｔ、（ｒ＾_ｔ）^−１Σ）（３７’）
により表される。従って、前回の平均ベクトルパラメータμ＾_ｔ、前回のスケーリング因子ｒ＾_ｔを用いるということは、事後分布確率ｐ（μ_ｔ│O^ｔ）を用いているということになる。 Further, (ii) a posteriori probability distribution p (μ _t | O ^t ) of the average vector parameter μ _t of the Gaussian distribution in the acoustic model obtained in the previous time, which is added with the feature amount series O ^t accumulated up to the previous time. The use will be described. The equations (35) and (36) are used to calculate the equation (37). As is clear from the equations (35) and (36), the previous average vector parameter μ ^ _t and the previous scaling factor are used. r ^ _t must be used. Further, the posterior distribution probability p (μ _t | O ^t ) of the acoustic model parameter obtained last time from the equation (37) is
p (μ _t | O ^t ) = N (μ _t | μ ^ _t , (r ^ _t ) ⁻¹ Σ) (37 ′)
It is represented by Therefore, using the previous average vector parameter μ ^ _t and the previous scaling factor r ^ _t means using the posterior distribution probability p (μ _t | O ^t ).

また、（ｉｉｉ）今回まで累積された特徴量系列Ｏ^ｔの一部を用いることについて説明する。今回の係数ベクトルν_ｔ＋１の推定は、上述のように、ＥＭアルゴリズムやＭＬＬＲアルゴリズムを用いて、行われる。当該推定は、Ｏ^ｔ＝Ｏ_１，Ｏ_２，…，Ｏ_ｔのうちの一部を用いる。そして、式（３７）により今回の音響モデルパラメータの事後確率分布ｐ（μ_ｔ＋１│O^ｔ＋１）を求める。 Also, (iii) using a part of the feature amount series O ^t accumulated up to this time will be described. The current coefficient vector ν _{t + 1} is estimated using the EM algorithm or the MLLR algorithm as described above. The estimation uses a part of O ^t = O ₁ , O ₂ ,..., O _t . Then, the a posteriori probability distribution p (μ _{t + 1} | O ^{t + 1} ) of the current acoustic model parameter is obtained by Expression (37).

また、（ｉｖ）今回の音響モデル中のガウス分布の平均ベクトルパラメータの事後確率分布をガウス分布で表現することについて説明すると、音響モデル中のガウス分布の平均ベクトルパラメータの事後確率分布ｐ（μ_ｔ│Ｏ^ｔ）がガウス分布で表現されると仮定することで、前記式（１９）が表されるからである。 In addition, (iv) expressing the posterior probability distribution of the average vector parameter of the Gaussian distribution in the current acoustic model as a Gaussian distribution will be described. The posterior probability distribution p (μ _t of the average vector parameter of the Gaussian distribution in the acoustic model. This is because the above expression (19) is expressed by assuming that | O ^t ) is expressed by a Gaussian distribution.

図１４について説明すると、ｒ＾更新部１５２２が前記式（３５）を演算することでｒ＾を更新し、μ＾更新部１５２４が前記式（３６）を演算することでμ＾を更新する。事後確率計算部１５２６が前記式（３７）を演算することで今回の音響モデルパラメータの事後確率分布ｐ（μ_ｔ＋１│O^ｔ＋１）を求める。また、モデル更新部１５６の記載については、実施例２で説明する。また、変形例を説明すると、前記式（３５）は好適な例であり、前記式（３５）に近い式であれば、今回のスケーリング因子ｒ＾_ｔ＋１は、前回のスケーリング因子ｒ＾_ｔ、今回の事後占有確率値の和ζ_ｔ＋１とから求めることができる。同様に、前記式（３６）は好適な例であり、前記式（３６）に近い式であれば、今回の平均ベクトルパラメータμ＾_ｔ＋１は、前回の平均ベクトルパラメータμ＾_ｔ、今回の音響モデルパラメータ中の平均の確率的ダイナミクスを線形表現した時の係数ベクトルν_ｔ＋１、今回の事後占有確率値の和ζ_ｔ＋１、今回の各時点におけるζと特徴量との積和Ｍ_ｔ＋１、今回のスケーリング因子ｒ＾_ｔ＋１とから求めることができる。 Referring to FIG. 14, the r ^ update unit 1522 calculates r ^ by calculating the equation (35), and the μ ^ update unit 1524 updates μ ^ by calculating the equation (36). The posterior probability calculation unit 1526 calculates the formula (37) to obtain the posterior probability distribution p (μ _{t + 1} | O ^{t + 1} ) of the current acoustic model parameter. The description of the model update unit 156 will be described in the second embodiment. Further, the modified example will be described. The equation (35) is a suitable example. If the equation is close to the equation (35), the current scaling factor r ^ _{t + 1} is the previous scaling factor r ^ _t , It can be obtained from the sum ζ _{t + 1} of the posterior occupation probability values. Similarly, the equation (36) is a suitable example. If the equation is close to the equation (36), the current average vector parameter μ ^ _{t + 1} is the previous average vector parameter μ ^ _t and the current acoustic model. Coefficient vector ν _{t + 1} when linearly expressing the average probabilistic dynamics in the parameter, the sum ζ _{t + 1} of the posterior occupation probability value this time, the product sum M _{t + 1} of ζ and the feature quantity at each time point, the current scaling factor r ^ _{t + 1} .

そして、モデル更新部５６は、今回の音響モデルパラメータ中の事後確率分布ｐ（μ_ｔ＋１│O^ｔ＋１）を新たな音響モデルパラメータに変換して更新する（ステップＳ７０）。 Then, the model update unit 56 converts the posterior probability distribution p (μ _{t + 1} | O ^{t + 1} ) in the current acoustic model parameter into a new acoustic model parameter and updates it (step S70).

次に、この実施例の発明の効果を説明する。式（３５）、式（３６）からも分かるように、分布パラメータの共分散行列Ｑ＾_ｔは、前記式（３１）を用いることにより、ｒ＾_ｔ、Σに置き換えられ、またΣは打ち消され、式中の行列表現が取り除かれていることが分かる。また、本実施例１の更新式（３５）（３６）と従来の更新式（３２）（３３）とを見比べると、本実施例１の更新式（３５）（３６）は行列計算（積、和、逆行列）を必要としないため、計算が高速となり、安定性を確保できる。
また、式（３５）（３６）を用いることにより、逐次適応において、図９中の音響モデル記憶部５８記載のように、分布パラメータの共分散行列Ｑ＾_ｔと平均ベクトルパラメータμ＾_ｔを記録するのではなく、図１２、図１３中の音響モデル記憶部１５８記載のようにスケーリング因子ｒ＾_ｔと平均ベクトルμ＾_ｔを記録することにより、大幅にメモリ量を削減できる。 Next, the effect of the invention of this embodiment will be described. As can be seen from the equations (35) and (36), the covariance matrix Q ^ _t of the distribution parameter is replaced with r ^ _t and Σ by using the equation (31), and Σ is canceled out. It can be seen that the matrix representation in the equation has been removed. Further, comparing the update formulas (35) and (36) of the first embodiment with the conventional update formulas (32) and (33), the update formulas (35) and (36) of the first embodiment are calculated by matrix calculation (product, (Sum, Inverse matrix) is not required, so calculation is fast and stability is ensured.
Further, by using the equations (35) and (36), the covariance matrix Q ^ _t of the distribution parameter and the average vector parameter μ ^ _t are recorded in the sequential adaptation as described in the acoustic model storage unit 58 in FIG. Instead, the amount of memory can be significantly reduced by recording the scaling factor r ^ _t and the average vector μ ^ _t as described in the acoustic model storage unit 158 in FIGS.

以上の方法により音響モデルパラメータの事後確率分布ｐ（μ｜Ｏ）つまり、音響モデルが求まった。この実施例２では、求められた音響モデルを用いて音声認識をする、つまり音響スコアの算出の処理を説明する。図１５に、この実施例の音声認識装置の機能構成例を示し、図１６に、音声認識の主な処理の流れを示す。 With the above method, the posterior probability distribution p (μ | O) of the acoustic model parameters, that is, the acoustic model is obtained. In the second embodiment, a speech recognition process using the obtained acoustic model, that is, a process of calculating an acoustic score will be described. FIG. 15 shows an example of the functional configuration of the speech recognition apparatus of this embodiment, and FIG. 16 shows the main processing flow of speech recognition.

認識用音声データの音響的特徴量と同様な音響的特徴を持つ適応用音声データが実施例１で説明した音響モデル作成装置１４８に入力される。そして、音響モデル記憶部１５８内の音響モデルが上述したように、更新される（ステップＳ８０）。認識用音声データがフレームに分割されて認識用音声データｘとして、特徴抽出部４に入力され、特徴量系列Ｏに変換される。この特徴量系列Ｏは、単語列探索部６に入力される（ステップＳ８２）。 Adaptation speech data having an acoustic feature similar to the acoustic feature amount of the recognition speech data is input to the acoustic model creation device 148 described in the first embodiment. Then, the acoustic model in the acoustic model storage unit 158 is updated as described above (step S80). The recognition speech data is divided into frames and input to the feature extraction unit 4 as recognition speech data x and converted into a feature amount series O. The feature amount series O is input to the word string search unit 6 (step S82).

単語列探索部６で特徴量系列Ｏに対して、音響モデル記憶部８の音響モデルを用いて必要に応じて各ガウス分布の音響スコアを算出する。この音響スコア算出には例えば、以下の式（４０）の計算を行う。
∫ｐ(ｘ_τ｜μ_ｔ)ｐ(μ_ｔ｜Ｏ^ｔ)ｄμ_ｔ（４０）
ここでｐ(ｘ_τ｜μ_ｔ)は音響モデルの出力分布である。μ_ｔ以外のパラメータはここでは省略する。従って、ｐ(μ_ｔ｜Ｏ^ｔ)について検討すれば良い。単語列探索部６による複数フレームの音響スコア算出に関しては前記式（４０）をもとに動的計画法（ＤＰ：Dynamic Programming マッチング）を行えばよい。音響スコアを最大とする単語列を認識単語列として出力する（ステップＳ８４）。なお、この場合はステップＳ８０におけるモデル更新は、音響モデルとして事後確率分布ｐ（μ_τ｜Ｏ^ｔ)の更新を行う（ステップＳ８０ａ）。前記式（４０）の積分は数値的に解くことも可能であるが、次のような２種類の解析解が存在する。 The word string search unit 6 calculates an acoustic score of each Gaussian distribution for the feature amount series O using the acoustic model in the acoustic model storage unit 8 as necessary. For this acoustic score calculation, for example, the following equation (40) is calculated.
_{_{∫p (x τ | μ t)}} p (μ t | O t) dμ t (40)
Here, p (x _τ | μ _t ) is the output distribution of the acoustic model. parameters other than μ _t is omitted here. Therefore, p (μ _t | O ^t ) may be examined. Regarding the acoustic score calculation for a plurality of frames by the word string search unit 6, dynamic programming (DP: Dynamic Programming matching) may be performed based on the equation (40). The word string that maximizes the acoustic score is output as a recognized word string (step S84). In this case, the model update in step S80 updates the posterior probability distribution p (μ _τ | O ^t ) as an acoustic model (step S80a). The integral of the equation (40) can be solved numerically, but there are two types of analytical solutions as follows.

［Plug-in法］
Plug-in法では、積分をまともに扱うのではなく、ｐ(μ_ｔ｜Ｏ^ｔ)の事後確率最大化（ＭＡＰ）値（以下の式（４１）の右辺）は、前記式（３６）のμ＾_ｔである事を利用する。つまり、以下の式（４１）になる。

従って、音響モデル作成装置６０による前記ステップＳ８０におけるモデル更新として、前記式（３６）で求まるμ＾_ｔを出力分布ｐ(ｘ_τ｜μ_ｔ)の平均ベクトルパラメータμ_ｔにそのまま代入（Plug-in）して音響モデルパラメータを更新する（ステップＳ８０ｂ）。このようにすればスコア計算を、以下の式（４２）で行うことが出来る。

つまり、平均μ＾_ｔ、共分散行列Σのガウス分布で表現する。これをPlug-in法と呼ぶ。また、その他のパラメータ状態遷移確率ａ、混合重み因子ｗ、共分散行列Σ、はそのまま適用する。ステップＳ８０ｂの後は、破線矢印で示すように、ステップＳ８２に移る。 [Plug-in method]
In the Plug-in method, the integral is not handled properly, but the posterior probability maximization (MAP) value (the right side of the following equation (41)) of p (μ _t | O ^t ) is expressed by the equation (36). Utilize that μ ^ _t . That is, the following expression (41) is obtained.

Therefore, as a model updating in the step S80 by the acoustic model creating apparatus 60, the equation obtained in (36) μ ^ _t the output distribution _p (x τ | μ _t) as it is assigned to the mean vector parameter mu _t of (Plug-in ) To update the acoustic model parameters (step S80b). In this way, the score can be calculated by the following formula (42).

That is, it is expressed by a Gaussian distribution of mean μ ^ _t and covariance matrix Σ. This is called the plug-in method. The other parameter state transition probability a, the mixture weight factor w, and the covariance matrix Σ are applied as they are. After step S80b, the process moves to step S82 as indicated by the broken line arrow.

［周辺化法］
周辺化法は、Plug-in法と違い積分を解析的に解く方法である。この積分をとく方法が、平均ベクトルパラメータμ_ｔについての周辺化にあたる。周辺化法は、Plug-in法と比較して、平均ベクトルパラメータの事後確率分布ｐ(μ_ｔ｜Ｏ^ｔ)の分散を考慮することになる。このようにすれば、積分計算によるスコア計算は以下の式（４３）で表せることになる。

つまり、周辺化法を利用する場合はステップＳ８０のモデル更新において、平均ベクトルパラメータμをμ＾_ｔと置き換える（ステップＳ８０ｂ）と共に、共分散行列パラメータΣ→Σ＋（ｒ＾_ｔ）^−１Σと置き換えて（ステップＳ８０ｃ）、音響モデルパラメータを更新する。また、その他のパラメータ、つまり、状態遷移確率ａ、混合重み因子ｗ、はそのまま適用する。ステップＳ８０ｃのあとは、ステップＳ８２に移る。 [Marginalization method]
Unlike the plug-in method, the marginalization method is an analytical solution of the integral. How to solve this integral corresponds to the peripheral of the mean vector parameter mu _t. The marginalization method considers the variance of the posterior probability distribution p (μ _t | O ^t ) of the average vector parameter as compared to the plug-in method. If it does in this way, the score calculation by integral calculation can be represented by the following formula | equation (43).

That is, when the marginalization method is used, the average vector parameter μ is replaced with μ ^ _t (step S80b) and the covariance matrix parameter Σ → Σ + (r ^ _t ) ⁻¹ Σ in the model update in step S80. (Step S80c), the acoustic model parameters are updated. Other parameters, that is, the state transition probability a and the mixture weight factor w are applied as they are. After step S80c, the process proceeds to step S82.

また、図１４に示すように、モデル更新部１５６は、破線で示すパラメータ変換部として作用する。Plug-in法を用いる場合は、μ→μ＾_ｔと置き換えて、平均ベクトルパラメータを更新し、周辺化法を用いる場合は、μ→μ＾_ｔ、Σ→Σ＋（ｒ＾_ｔ）^−１Σと置き換えて、平均ベクトルパラメータ、共分散行列パラメータを更新する。このようにすることで、分布変換に基づく逐次適応法による音響スコアを算出できる。 Further, as shown in FIG. 14, the model update unit 156 functions as a parameter conversion unit indicated by a broken line. When using the Plug-in method, the average vector parameter is updated by substituting μ → μ ^ _t, and when using the marginalization method, μ → μ ^ _t , Σ → Σ + (r ^ _t ) ⁻¹ Σ And the mean vector parameter and covariance matrix parameter are updated. By doing in this way, the acoustic score by the sequential adaptation method based on distribution conversion is computable.

［実験結果］
ＡＳＪ（日本音響学会）読み上げ音声データベース１００時間分を用いてトライフォンＨＭＭの総状態数２０００、ＨＭＭ状態あたりの混合数１６の不特定話者音響モデルを構築し、日本語模擬ニュース音声に対し、逐次適応実験を行った。特徴量は１２次元ＭＦＣＣ（メルフレクエンシイペプストラム係数）と、そのフレームのエネルギーと、ＭＦＣＣのフレーム間差分Δと、その差分ＭＦＣＣのフレーム間差分デルタΔΔとして、語彙サイズ７０万語のトライアングルを用いて大語彙連続音声認識実験を行った。逐次適応を行わない通常の音声認識の場合の音声認識率は８１．３％であった。 [Experimental result]
Using ASJ (acoustic societies) reading speech database for 100 hours, we built an unspecified speaker acoustic model with a total of 2000 triphone HMM states and 16 mixed states per HMM state. A sequential adaptation experiment was conducted. The feature quantity uses a triangle with a vocabulary size of 700,000 words as the 12-dimensional MFCC (mel frequency pepstrum coefficient), the energy of the frame, the inter-frame difference Δ of the MFCC, and the inter-frame difference delta ΔΔ of the difference MFCC. A large vocabulary continuous speech recognition experiment was conducted. The speech recognition rate in the case of normal speech recognition without sequential adaptation was 81.3%.

ここで、分布パラメータに共分散行列Ｑ＾_ｔを用いた従来逐次適応では、認識率は８８．５％と大きく改善した。しかし、１発話（１秒程度）で逐次更新を行う。オンライン逐次適応タスクにおいて、従来法は実時間処理できず（リアルタイムの２倍程度）、また、Ｑ＾_ｔに用いたメモリ消費量は２７メガバイトであった。 Here, in the conventional sequential adaptation using the covariance matrix Q ^ _t as the distribution parameter, the recognition rate is greatly improved to 88.5%. However, the update is performed sequentially with one utterance (about 1 second). In the online sequential adaptation task, the conventional method cannot perform real-time processing (about twice the real time), and the memory consumption used for Q ^ _t is 27 megabytes.

一方、分布パラメータにスケーリング因子ｒ＾_ｔを用いた本発明では認識率は８８．５％を、従来法と同程度の性能を維持しつつ、実時間処理（リアルタイムの１倍程度）を実現した。また、ｒ＾_ｔに用いたメモリ消費量は１．３メガバイトであり、Ｑ＾_ｔを用いた場合と比較して、メモリを２０分の１程度削減できた。 On the other hand, in the present invention using the scaling factor r ^ _t as the distribution parameter, the recognition rate is 88.5%, and real-time processing (about 1 time of real time) is realized while maintaining the same performance as the conventional method. . The memory consumption used for r ^ _t was 1.3 megabytes, and the memory could be reduced by about 20 times compared to the case where Q ^ _t was used.

以上の各実施形態の他、本発明である音響モデル作成装置は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、音響モデル作成装置において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。
また、この発明の音響モデル作成装置における処理をコンピュータによって実現する場合、音響モデル作成装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、音響モデル作成装置における処理機能がコンピュータ上で実現される。 In addition to the above-described embodiments, the acoustic model creation device according to the present invention is not limited to the above-described embodiments, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the acoustic model creation device is not only executed in time series according to the order of description, but may also be executed in parallel or individually as required by the processing capability of the device that executes the processing. Good.
When the processing in the acoustic model creation device of the present invention is realized by a computer, the processing contents of the functions that the acoustic model creation device should have are described by a program. Then, by executing this program on a computer, the processing function in the acoustic model creation device is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）、ＤＶＤ−ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＣＤ−Ｒ（Ｒｅｃｏｒｄａｂｌｅ）／ＲＷ（ＲｅＷｒｉｔａｂｌｅ）等を、光磁気記録媒体として、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｃ）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（ＥｌｅｃｔｒｏｎｉｃａｌｌｙＥｒａｓａｂｌｅａｎｄＰｒｏｇｒａｍｍａｂｌｅ−ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like is used as an optical disc, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only Memory), a CD-R (Recordable). ) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable Programmable-Read Only Memory), etc. can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（ＡｐｐｌｉｃａｔｉｏｎＳｅｒｖｉｃｅＰｒｏｖｉｄｅｒ）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Further, the above-described processing may be executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. Good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、音響モデル作成装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the acoustic model creation apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

従来の音声認識装置の機能構成例を示した図。The figure which showed the function structural example of the conventional speech recognition apparatus. 従来の音声認識装置の処理フローを示した図。The figure which showed the processing flow of the conventional speech recognition apparatus. 従来の音響モデル作成装置の機能構成例を示した図。The figure which showed the function structural example of the conventional acoustic model production apparatus. 従来の音響モデル作成装置の処理フローを示した図。The figure which showed the processing flow of the conventional acoustic model production apparatus. 線形回帰行列を用いた場合の音響モデル作成装置の機能構成例を示した図。The figure which showed the function structural example of the acoustic model creation apparatus at the time of using a linear regression matrix. 図５に示す音響モデル作成装置の主な処理の流れを示したブロック図。The block diagram which showed the flow of the main processes of the acoustic model production apparatus shown in FIG. 逐次適応法を用いた場合の音響モデルパラメータが変換される手順を示した図。The figure which showed the procedure in which the acoustic model parameter at the time of using a sequential adaptation method is converted. 逐次的応法を適用した場合の音響モデル作成装置の機能構成例を示した図。The figure which showed the function structural example of the acoustic model creation apparatus at the time of applying a sequential response. 事後確率分布のパラメータを逐次適応させる順序を示した図。The figure which showed the order which adapts the parameter of posterior probability distribution sequentially. 図８に示す音響モデル作成装置の主な処理の流れを示したブロック図。The block diagram which showed the flow of the main processes of the acoustic model production apparatus shown in FIG. 従来の逐次学習部５２の機能構成例を示した図。The figure which showed the function structural example of the conventional sequential learning part 52. FIG. 本実施例の音響モデル作成装置の機能構成例を示した図。The figure which showed the function structural example of the acoustic model production apparatus of a present Example. 本実施例の事後確率分布のパラメータを逐次適応させる順序を示した図。The figure which showed the order which adapts the parameter of the posterior probability distribution of a present Example sequentially. 本実施例の逐次学習部などの機能構成例を示した図。The figure which showed the function structural examples, such as the sequential learning part of a present Example. 本実施例の音声認識装置の機能構成例を示した図。The figure which showed the function structural example of the speech recognition apparatus of a present Example. 本実施例の音声認識装置の処理フローを示した図。The figure which showed the processing flow of the speech recognition apparatus of a present Example.

Claims

A feature extraction unit that extracts a feature amount sequence of the adaptive audio data;
When the posterior probability distribution of the mean vector parameter of the Gaussian distribution in the acoustic model is expressed by the Gaussian distribution, the mean vector parameter of the posterior probability distribution, the scaling factor for the covariance matrix parameter of the posterior probability distribution, and the initial acoustic model parameter The posterior probability distribution of the mean vector parameter of the Gaussian distribution in the acoustic model obtained last time and the feature accumulated up to this time, taking into account the feature quantity series accumulated up to the previous time based on the covariance matrix Covariance of the posterior probability distribution's average vector parameter and the posterior probability distribution when the posterior probability distribution of the average vector parameter of the Gaussian distribution in this acoustic model is expressed as a Gaussian distribution using a part of the quantity series Gaussian distribution in this acoustic model by calculating the scaling factor of the matrix parameter A sequential learning section for determining the posterior probability distribution of the mean vector parameter,
An acoustic model creation device, comprising: a model updating unit that converts the posterior probability distribution of the current acoustic model parameter into a new acoustic model parameter and updates it.

The acoustic model creation device according to claim 1,
The sequential learning unit calculates the current scaling factor r ^ _{t + 1} for the covariance vector parameter of the posterior probability distribution of the mean vector parameter of the Gaussian distribution in the acoustic model, the previous scaling factor r ^ _t , and the current posterior occupation probability value. From the sum ζ _{t + 1} ,
The present average vector parameter μ ^ _{t + 1} of the posterior probability distribution of the average vector parameter of the Gaussian distribution in the acoustic model is linearly expressed as the previous average vector parameter μ ^ _t and the average stochastic dynamics in the current acoustic model parameter. Acoustics characterized by being obtained from the coefficient vector ν _{t + 1 at} the time, the sum ζ _{t + 1 of} the current posterior occupancy probability value, the product sum M _{t + 1} of ζ and the feature value at each time point, and the current scaling factor r ^ _{t + 1.} Model creation device.

The acoustic model creation device according to claim 2,
The sequential learning unit obtains a scaling factor r ^ _{t + 1} of the current acoustic model parameter and an average vector parameter μ ^ _{t + 1} of the current acoustic model parameter by the following equation:

However, μ ⁰ is a predetermined constant, r ^ _t is the previous scaling factor, ζ _{t + 1} is the sum of the posterior occupation probability values of this time, and M _{t + 1} is ζ and the feature value at each time point of the current time. And ν _{t + 1} is a coefficient vector obtained by linearly expressing the average stochastic dynamics in the current acoustic model parameter.

A model for recognition in which an acoustic model adapted to voice data for adaptation having acoustic characteristics of voice data for recognition is created by the acoustic model creation device according to any one of claims 1 to 3, and the acoustic model is updated. Update section,
A speech recognition apparatus comprising: a recognition unit configured to perform speech recognition on input speech data having the acoustic feature using the updated acoustic model.

A feature extraction process for extracting a feature amount sequence of adaptive audio data;
When the posterior probability distribution of the mean vector parameter of the Gaussian distribution in the acoustic model is expressed by the Gaussian distribution, the mean vector parameter of the posterior probability distribution, the scaling factor for the covariance matrix parameter of the posterior probability distribution, and the initial acoustic model parameter The posterior probability distribution of the mean vector parameter of the Gaussian distribution in the acoustic model obtained last time and the feature accumulated up to this time, taking into account the feature quantity series accumulated up to the previous time based on the covariance matrix Covariance of the posterior probability distribution's average vector parameter and the posterior probability distribution when the posterior probability distribution of the average vector parameter of the Gaussian distribution in this acoustic model is expressed as a Gaussian distribution using a part of the quantity series Gaussian distribution in this acoustic model by calculating the scaling factor of the matrix parameter A sequential learning process for obtaining the posterior distribution of the mean vector parameter,
A model updating step of converting the posterior probability distribution of the current acoustic model parameter into a new acoustic model parameter and updating it.

The acoustic model creation method according to claim 5,
In the sequential learning process, the current scaling factor r ^ _{t + 1} for the covariance vector parameter of the posterior probability distribution of the mean vector parameter of the Gaussian distribution in the acoustic model, the previous scaling factor r ^ _t , and the current posterior occupation probability value From the sum ζ _{t + 1} ,
The present average vector parameter μ ^ _{t + 1} of the posterior probability distribution of the average vector parameter of the Gaussian distribution in the acoustic model is linearly expressed as the previous average vector parameter μ ^ _t and the average stochastic dynamics in the current acoustic model parameter. Acoustics characterized by being obtained from the coefficient vector ν _{t + 1 at} the time, the sum ζ _{t + 1 of} the current posterior occupancy probability value, the product sum M _{t + 1} of ζ and the feature value at each time point, and the current scaling factor r ^ _{t + 1.} Model creation method.

The acoustic model creation method according to claim 6,
In the sequential learning process, a scaling factor r ^ _{t + 1} of the current acoustic model parameter and an average vector parameter μ ^ _{t + 1} of the current acoustic model parameter are obtained by the following equations:

However, μ ⁰ is a predetermined constant, r ^ _t is the previous scaling factor, ζ _{t + 1} is the sum of the posterior occupation probability values of this time, and M _{t + 1} is ζ and the feature value at each time point of the current time. And ν _{t + 1} is a coefficient vector when linearly expressing the average stochastic dynamics in the current acoustic model parameters.

A recognition model for updating an acoustic model by creating an acoustic model adapted to the adaptation speech data having acoustic characteristics of the recognition speech data by the acoustic model creation method according to any one of claims 5 to 7. Update process,
A speech recognition method comprising: a speech recognition process for performing speech recognition on input speech data having the acoustic features using the updated acoustic model.

The program which operates a computer as the acoustic model production apparatus in any one of Claims 1-3, or the speech recognition apparatus of Claim 4.

A computer-readable recording medium on which the program according to claim 9 is recorded.