JP2005338358A

JP2005338358A - Acoustic model noise adapting method and device implementing same method

Info

Publication number: JP2005338358A
Application number: JP2004156037A
Authority: JP
Inventors: Atsunori Ogawa; 厚徳小川; Satoru Kobashigawa; 哲小橋川; Satoshi Takahashi; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-05-26
Filing date: 2004-05-26
Publication date: 2005-12-08
Anticipated expiration: 2024-05-26
Also published as: JP4510517B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and device for acoustic model noise adaptation that perform noise adaptation of a different acoustic model based upon an existent acoustic model noise adaptation result. <P>SOLUTION: Disclosed are the method and device for acoustic model noise adaptation such that: a clean acoustic model A and a noise-adapted acoustic model B obtained through noise adaptation based upon the clean acoustic model A are prepared; variation quantities of respective parameters due to noise adaptation from the clean acoustic model A to noise-adapted acoustic model B are calculated; respective states and respective distributions of the clean acoustic model A are determined by referring to the clean acoustic model A with respective states and respective distributions of another clean acoustic model C; and respective parameters of the clean acoustic model C are adjusted based upon the variation quantities of the parameters due to the noise adaptation from the clean acoustic model A to the noise-adapted acoustic model B to generate a new noise-adapted acoustic model D. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、音響モデル雑音適応化方法およびこの方法を実施する装置に関し、特に、既存の音響モデル雑音適応化結果に基づいて別の音響モデルの雑音適応化を低コストかつ高速に実行する音響モデル雑音適応化方法およびこの方法を実施する装置に関する。 The present invention relates to an acoustic model noise adaptation method and an apparatus for performing the method, and in particular, an acoustic model that performs noise adaptation of another acoustic model at low cost and at high speed based on an existing acoustic model noise adaptation result. The present invention relates to a noise adaptation method and an apparatus for implementing the method.

先ず、図を参照して、音声認識装置について説明しておく。
図６において、入力音声６０１は、音声分析部６０２において特徴ベクトル６０３の時系列に変換され、探索処理部６０５に入力される。探索処理部６０５においては、音響モデル６０４を用いて、文法６０６で表現される単語或いは単語列と特徴ベクトル６０３の時系列との間の照合、即ち、探索処理が行われ、最も尤度が高い単語或いは単語列が認識結果６０７として出力される。
音声分析部６０２における音声分析方法としてよく用いられるのは、ケプストラム分析であり、特徴量としては、ＭＦＣＣ（Mel Frequency Cepstral Coefficient）、△ＭＦＣＣ、△△ＭＦＣＣ、対数パワー、△対数パワーその他があり、それらが、１０〜１００次元程度の特徴量ベクトルを構成する。分析フレーム幅３０ｍｓ程度、分析フレームシフト幅１０ｍｓ程度で分析が実行される。音響モデル６０４は、先のＭＦＣＣその他の音声特徴量を適切なカテゴリで標準パターンとして保持したものであり、入力音声の或る区間の特徴量に対して、各標準パターンとの音響的な近さを尤度として計算し、それがどのカテゴリに属するかを推定する。現在、音響モデル６０４としては、確率・統計理論に基づいてモデル化された隠れマルコフモデル（Hidden Markov Model、略して、ＨＭＭ）が汎用されている。通常、ＨＭＭは音素カテゴリ単位で作成され、音素ＨＭＭの集合として一つの音響モデルが構築される。 First, the speech recognition apparatus will be described with reference to the drawings.
In FIG. 6, an input speech 601 is converted into a time series of feature vectors 603 by the speech analysis unit 602 and input to the search processing unit 605. The search processing unit 605 uses the acoustic model 604 to perform collation between a word or a word string expressed in the grammar 606 and the time series of the feature vector 603, that is, search processing is performed, and has the highest likelihood. A word or a word string is output as the recognition result 607.
A cepstrum analysis is often used as a speech analysis method in the speech analysis unit 602. As feature amounts, there are MFCC (Mel Frequency Cepstral Coefficient), ΔMFCC, ΔΔMFCC, logarithmic power, Δlogarithmic power, and the like. They constitute a feature vector of about 10 to 100 dimensions. The analysis is executed with an analysis frame width of about 30 ms and an analysis frame shift width of about 10 ms. The acoustic model 604 holds the previous MFCC and other speech feature quantities as standard patterns in appropriate categories, and the acoustic proximity of each standard pattern to the feature quantities in a certain section of the input speech. Is estimated as a likelihood and to which category it belongs. At present, as the acoustic model 604, a hidden Markov model (HMM for short) modeled on the basis of probability / statistical theory is widely used. Usually, an HMM is created for each phoneme category, and one acoustic model is constructed as a set of phoneme HMMs.

音素ＨＭＭの種類としては、当該音素に先行および後続する音素の両方を音素環境として考慮しないmonophone−ＨＭＭ（例えば、＊−ａ−＊は、音素ａのmonophone−ＨＭＭと言う。＊は任意の音素を表す）、当該音素に先行する音素のみ音素環境として考慮する先行音素環境依存biphone−ＨＭＭ（例えば、ｐ−ａ−＊は、先行音素がｐである音素ａの先行音素環境依存biphone−ＨＭＭ、と言う）、当該音素に後続する音素のみ音素環境として考慮する後続音素環境依存biphone−ＨＭＭ（例えば、＊−ａ−ｔは、後続音素がｔである音素ａの後続音素環境依存biphone−ＨＭＭと言う）、当該音素に先行および後続する音素の両方を音素環境として考慮するtriphone−ＨＭＭ（例えば、ｐ−ａ−ｔは、先行音素がｐ、後続音素がｔである音素ａのtriphone−ＨＭＭ）が最もよく用いられる。 As a type of phoneme HMM, a monophone-HMM that does not consider both phonemes preceding and following the phoneme as a phoneme environment (for example, * -a- * is a monophone-HMM of phoneme a. * Is an arbitrary phoneme. A phoneme environment-dependent biphone-HMM in which only the phoneme preceding the phoneme is considered as the phoneme environment (for example, pa- * is the phoneme environment-dependent biphone-HMM of the phoneme a whose preceding phoneme is p, The subsequent phoneme environment dependent biphone-HMM in which only the phoneme following the phoneme is considered as the phoneme environment (for example, * -at is the subsequent phoneme environment dependent biphone-HMM of the phoneme a whose subsequent phoneme is t) Triphone-HMM that considers both phonemes preceding and following the phoneme as the phoneme environment (e.g., p-a-t is a triphone-HM of phoneme a having a preceding phoneme of p and a subsequent phoneme of t) ) It is most often used.

また、音素ＨＭＭが表現する音素カテゴリの種類については、音響モデルの学習データに依存するが、例えば、ｔ−ｔ−ｔの如く日本語の音素連鎖としてあり得ないものは含まれないので、一般的には、数千〜数万種類程度になる。
音響モデル６０４の構造を図７を参照して説明する。
先ず、図７に示される如く、状態Ｓが混合確率分布Ｍとして表現される。混合確率分布の各要素分布としては、離散確率分布と連続確率分布があるが、現在、最もよく用いられているのは、連続確率分布の一つである多次元正規（ガウス）分布であり、その内でも次元間の相関がない（共分散行列の対角成分が０である）多次元無相関正規分布が最もよく用いられている。多次元正規分布の各次元は、先の特徴量ベクトルの各次元に対応する。図７においては、状態Ｓが４つの多次元正規分布を要素分布とする多次元混合正規分布Ｍとして表現されている。図７においては、特徴量ベクトルの或る次元ｉについて示しているが、特徴量ベクトルの各次元について同様に表現される。図７に示される状態の数個〜十数個程度の確率連鎖によって、音素ＨＭＭが構築される。音素ＨＭＭが、幾つの状態の如何なる確率連鎖によって構築されるかに関しては、様々なバリエーションがある。また、音素ＨＭＭ毎に異なる構造をとることもある。現在、最も一般的に用いられている構造は、例えば、図８に示す音素ＨＭＭの如き３状態のleft−to−right型ＨＭＭと呼ばれるもので、３つの状態Ｓ₁（第１状態）、Ｓ₂（第２状態）、Ｓ₃（第３状態）を左から右に並べたものであり、状態の確率連鎖（状態遷移）としては、自分自身への遷移（自己遷移）Ｓ₁→Ｓ₁、Ｓ₂→Ｓ₂、Ｓ₃→Ｓ₃と、次状態への遷移Ｓ₁→Ｓ₂、Ｓ₂→Ｓ₃から成る。音響モデル中の全ての音素ＨＭＭがこの３状態left−to−right型ＨＭＭの構造をとることが多い。 In addition, although the type of phoneme category expressed by the phoneme HMM depends on the learning data of the acoustic model, it does not include, for example, a phoneme chain that cannot be a Japanese phoneme chain such as ttt. Actually, it is about several thousand to several tens of thousands.
The structure of the acoustic model 604 will be described with reference to FIG.
First, as shown in FIG. 7, the state S is expressed as a mixed probability distribution M. Each component distribution of the mixed probability distribution includes a discrete probability distribution and a continuous probability distribution. Currently, the most commonly used is a multidimensional normal (Gaussian) distribution, which is one of the continuous probability distributions. Among them, a multidimensional uncorrelated normal distribution having no correlation between dimensions (the diagonal component of the covariance matrix is 0) is most often used. Each dimension of the multidimensional normal distribution corresponds to each dimension of the previous feature vector. In FIG. 7, the state S is represented as a multidimensional mixed normal distribution M having four multidimensional normal distributions as element distributions. In FIG. 7, a certain dimension i of the feature quantity vector is shown, but each dimension of the feature quantity vector is similarly expressed. A phoneme HMM is constructed by a probability chain of several to about a dozen states in the state shown in FIG. There are various variations on how many phonetic HMMs are built by what probability chain. Also, the phoneme HMM may have a different structure. At present, the most commonly used structure is called a three-state left-to-right type HMM such as the phoneme HMM shown in FIG. 8, and has three states S ₁ (first state), S ₂ (second state) and S ₃ (third state) are arranged from left to right, and the state probability chain (state transition) is transition to itself (self-transition) S ₁ → S ₁ , S ₂ → S ₂ , S ₃ → S ₃ and transition to the next state S ₁ → S ₂ , S ₂ → S ₃ . All phoneme HMMs in an acoustic model often take the structure of this three-state left-to-right type HMM.

図８を参照して音素ＨＭＭを用いた尤度計算について説明する。
図８の音素ＨＭＭに、或る特徴ベクトルの時系列が入力されたときの尤度計算について解説する。例えば、６フレーム分の特徴量ベクトルの時系列Ｘ＝Ｘ₁、Ｘ₂、Ｘ₃、Ｘ₄、Ｘ₅、Ｘ₆が、音素ＨＭＭの或る一つの状態遷移系列Ｓ=Ｓ₁→Ｓ₁→Ｓ₂→Ｓ₂→Ｓ₃→Ｓ₃から出力される確率（尤度）Ｐ（Ｘ｜Ｓ、ＨＭＭ）は、以下の通りに計算される。
Ｐ（Ｘ｜Ｓ、ＨＭＭ）＝ｂ₁（Ｘ₁）ａ₁₁ｂ₁（Ｘ₂）ａ₁₂ｂ₂（Ｘ₃）ａ₂₂ｂ₂（Ｘ₄）ａ₂₃ｂ₃（Ｘ₅）ａ₃₃ｂ₃（Ｘ₆）・・・・・式（１）
ここで、ａ_jkは、状態Ｓ_jから状態Ｓ_kへの遷移確率である。また、ｂ_j（Ｘ_t）は、時刻ｔ（ｔ番目のフレーム）における特徴ベクトルＸ_tが状態Ｓ_jを表現する混合正規分布Ｍ_jから出力される確率であり、混合正規分布Ｍ_jを構成するｍ番目の正規分布の出力確率Ｐ_jm（Ｘt）を用いて以下の通りに計算される。 The likelihood calculation using the phoneme HMM will be described with reference to FIG.
The likelihood calculation when a time series of a certain feature vector is input to the phoneme HMM in FIG. 8 will be described. For example, the time series X = X ₁ , X ₂ , X ₃ , X ₄ , X ₅ , X ₆ of feature quantity vectors for 6 frames is one state transition sequence S = S ₁ → S _{1 of the} phoneme HMM. The probability (likelihood) P (X | S, HMM) output from S ₂ → S ₂ → S ₃ → S ₃ is calculated as follows.
P (X | S, HMM) = b ₁ (X ₁ ) a ₁₁ b ₁ (X ₂ ) a ₁₂ b ₂ (X ₃ ) a ₂₂ b ₂ (X ₄ ) a ₂₃ b ₃ (X ₅ ) a ₃₃ b ₃ (X ₆ ) ... Formula (1)
Here, a _jk is a transition probability from the state S _j to the state S _k . B _j (X _t ) is a probability that the feature vector X _{t at} time t (t-th frame) is output from the mixed normal distribution M _j representing the state S _j , and constitutes the mixed normal distribution M _j . The output probability P _jm (Xt) of the mth normal distribution is calculated as follows.

ここで、Ｍ_jは混合正規分布Ｍ_jを構成する正規分布の数（混合数）、Ｗ_jmは混合正規分布Ｍ_jを構成するｍ番目の正規分布の分布重みである。Ｗ_jmについては以下の式を満足する。

Here, M _j is the number of normal distributions constituting the mixed normal distribution M _j (number of mixtures), and W _jm is the distribution weight of the mth normal distribution constituting the mixed normal distribution M _j . For W _jm , the following equation is satisfied.

また、混合正規分布Ｍ_jを構成する正規分布が多次元無相関正規分布の場合、Ｐ_jm（Ｘ_t）は以下の通りに計算される。

When the normal distribution constituting the mixed normal distribution M _j is a multidimensional uncorrelated normal distribution, P _jm (X _t ) is calculated as follows.

ここで、μ_jmi、σ² _jmiは混合正規分布Ｍ_jを構成するｍ番目の多次元無相関正規分布の次元ｉにおける平均値、分散である。Ｘ_tiは、特徴ベクトルＸ_tの次元ｉの値である。Ｉは特徴ベクトル（多次元無相関正規分布）の次元数である。
以上の尤度計算は、或る一つの状態遷移系列Ｓに対するものであるが、このような状態遷移系列は他にもあげることができる。このような状態遷移系列全てに対して、特徴ベクトルの時系列Ｘを出力する確率を計算し、それらを加算したものを音素ＨＭＭに特徴ベクトルの時系列Ｘが入力されたときの尤度とする方法はトレリス（trellis）アルゴリズムと呼ばれる。一方、全ての状態遷移系列のなかで最も高い尤度を与える状態遷移系列を特徴ベクトルの時系列によりフレーム単位で逐次的に求め、最終フレームに到達したときの尤度を音素ＨＭＭに特徴ベクトルの時系列Ｘが入力されたときの尤度とする方法をビタービ（Viterbi）アルゴリズムという。一般的には、トレリスアルゴリズムと比較して計算量を大幅に削減することができるビタービアルゴリズムが用いられることが多い。

Here, μ _jmi and σ ² _jmi are the average value and variance in the dimension i of the m-th multidimensional uncorrelated normal distribution constituting the mixed normal distribution M _j . X _ti is the value of dimension i of feature vector X _t . I is the number of dimensions of the feature vector (multidimensional uncorrelated normal distribution).
The above likelihood calculation is for a certain state transition sequence S, but other state transition sequences can be mentioned. For all such state transition sequences, the probability of outputting the feature vector time series X is calculated, and the sum of these is used as the likelihood when the feature vector time series X is input to the phoneme HMM. The method is called the trellis algorithm. On the other hand, the state transition sequence that gives the highest likelihood among all the state transition sequences is sequentially obtained in units of frames by the time series of feature vectors, and the likelihood when the final frame is reached is stored in the phoneme HMM. A method of setting the likelihood when the time series X is input is called a Viterbi algorithm. In general, a Viterbi algorithm that can significantly reduce the amount of calculation compared to the trellis algorithm is often used.

また、以上の尤度計算は、或る一つの音素ＨＭＭに対するものであるが、実際には、探索処理部６０５において、探索処理を行う前に、音素ＨＭＭを連結して文法６０６で表現される単語或いは単語列のＨＭＭのネットワーク（探索ネットワーク）が作成され、そして、入力音声の特徴ベクトル６０３の時系列と探索ネットワークで表現される単語或いは単語列との照合が行われ（探索処理）、最も尤度が高い単語或いは単語列が認識結果６０７として出力される。
また、以上の尤度計算では、確率値をそのまま扱ったが、実際には、アンダーフローを防ぐために、確率値の対数をとって計算を行う。 Further, the above likelihood calculation is for a certain phoneme HMM, but actually, the search processing unit 605 connects the phoneme HMMs and expresses them in the grammar 606 before performing the search process. An HMM network (search network) of words or word strings is created, and the time series of the feature vector 603 of the input speech is compared with the words or word strings expressed in the search network (search process). A word or word string having a high likelihood is output as the recognition result 607.
Further, in the above likelihood calculation, the probability value is handled as it is, but actually, in order to prevent underflow, the calculation is performed by taking the logarithm of the probability value.

また、音響モデル６０４の各種パラメータ（状態遷移確率ａ_jk、分布重みＷ_jm、正規分布の各次元の平均μ_jmiおよび分散σ_jmi ²）の推定アルゴリズムとしては、バウムーウェルチ（Baum−Welch）アルゴリズムが最もよく用いられる。また、一つの音響モデルの学習（パラメータ推定）には、数十〜数千時間という大量の音声データとその発声内容ラベルデータが用いられる（非特許文献１参照）。
ところで、以上の音声認識装置の従来例が実際に使用される環境には様々な種類の雑音が存在しており、これが音声認識精度の低下を招く要因となっている。雑音は大きく分けて二種類に分類することができる。一つは、音声に畳み込みの関係で影響を及ぽす乗算性歪みと呼ばれるもので、例えば、話者の口からマイクまでの空間の伝達特性、電話回線の伝達特性である。もう一つは、伝達特性の影響を受けた音声に加算の関係で影響を及ぼす加算性雑音と呼ばれるもので、例えば、オフィスにおいては、計算機が出す雑音、紙をめくる音の如き雑音であり、自動車内においてはエンジンの回転音を挙げることができる。時刻ｔにおける雑音のない、クリーンな音声をｓ（ｔ）、乗法性歪み（伝達特性）をｈ（ｔ）、加算性雑音をｎ（ｔ）とすると、雑音のある音声のｙ（ｔ）は以下の式で与えられる。以下で、※は畳み込みを表す。 In addition, as an estimation algorithm for various parameters (state transition probability a _jk , distribution weight W _jm , average μ _jmi and variance σ _jmi ² ) of each dimension of the acoustic model 604, a Baum-Welch algorithm is used. Is most often used. In addition, a large amount of speech data of several tens to thousands of hours and utterance content label data are used for learning (parameter estimation) of one acoustic model (see Non-Patent Document 1).
By the way, there are various types of noise in an environment where the above-described conventional example of the speech recognition apparatus is actually used, which causes a decrease in speech recognition accuracy. Noise can be broadly classified into two types. One is called multiplicative distortion, which affects speech due to convolution, and includes, for example, the transmission characteristics of the space from the speaker's mouth to the microphone and the transmission characteristics of the telephone line. The other is called additive noise, which affects the sound affected by the transfer characteristics due to the addition. For example, in the office, it is noise generated by a computer, noise such as paper turning, In the automobile, an engine rotation sound can be mentioned. When s (t) is a clean voice without noise at time t, h (t) is multiplicative distortion (transfer characteristic), and n (t) is additive noise, y (t) of the noisy voice is It is given by the following formula. In the following, * indicates convolution.

ｙ（ｔ）＝ｓ（ｔ）※ｈ（ｔ）＋ｎ（ｔ）・・・・・式（５）
近年、音声認識装置の雑音対策として、音響モデルのマルチコンディション（マルチスタイル）学習が研究されており、非常に単純でありながら、様々な雑音への耐性の高い雑音適応音響モデルが得られる学習方法として注目されている。
音響モデルのマルチコンディション学習について図を参照して解説する。
図９は音響モデルのマルチコンディション学習の流れを図示したものである。先ず、雑音のないクリーン音声データ９０８とその発声内容ラベルデータ９０９を用いて音響モデル学習部９１０によりクリーン音響モデル９０４を作成しておく。また、音声認識装置の使用環境に存在する雑音の情報として、乗算性歪みを表す伝達特性データ９１１と加算性雑音データを表す雑音データ９１２をそれぞれ想定される分だけ組み合わせで準備しておく。ここにおいては、Ｎ個の伝達特性データ９１１−１〜９１１−Ｎと雑音データ９１２−１〜９１２−Ｎの組み合わせがあるものとする。そして、クリーン音声データ９０８に対して、伝達特性畳み込み部９１３において伝達特性データ９１１を畳み込み、その後、雑音加算部９１４において雑音データ９１２を加算する。この操作をＮ個の伝達特性データ９１１と雑音データ９１２の組み合わせ全てに対して行い、それらの集合としてマルチコンディション音声データ９１５が作成される。元のクリーン音声データ９０８もマルチコンディション音声データ９１５に含める場合があり、ここにおいてはその様にしている。次に、音響モデル追加学習部９１６において、マルチコンディション音声データ９１５と発声内容ラベルデータ９０９を用いて、クリーン音響モデル９０４に対して、追加学習により音響モデルの学習を行い、その結果として雑音適応音響モデル９１８が作成される。クリーン音響モデル９０４に対する追加学習ではなく、マルチコンディション音声データ９１５を用いて最初から雑音適応音響モデル９１８を学習する方法も考えられるが、ここにおいては、追加学習を行うものとする。なお、追加学習であるため、クリーン音響モデル９０４と雑音適応音響モデル９１８の状態連鎖構造、状態における確率分布数を含む構造は等しい。 y (t) = s (t) * h (t) + n (t) (5)
In recent years, multi-condition (multi-style) learning of acoustic models has been studied as a noise countermeasure for speech recognition devices, and a learning method that can obtain a noise adaptive acoustic model that is very simple but highly resistant to various noises. It is attracting attention as.
The acoustic model multi-condition learning is explained with reference to the figure.
FIG. 9 illustrates the flow of multi-condition learning of an acoustic model. First, a clean acoustic model 904 is created by the acoustic model learning unit 910 using clean speech data 908 without noise and utterance content label data 909. Also, as information on noise existing in the environment in which the speech recognition apparatus is used, transfer characteristic data 911 representing multiplicative distortion and noise data 912 representing additive noise data are prepared in combinations as much as possible. Here, it is assumed that there are combinations of N pieces of transfer characteristic data 911-1 to 911-N and noise data 912-1 to 912-N. Then, the transfer characteristic convolution unit 913 convolves the clean sound data 908 with the transfer characteristic data 911, and then the noise addition unit 914 adds the noise data 912. This operation is performed on all combinations of N pieces of transfer characteristic data 911 and noise data 912, and multi-condition audio data 915 is created as a set of them. The original clean audio data 908 may also be included in the multi-condition audio data 915, which is the case here. Next, in the acoustic model additional learning unit 916, the acoustic model is learned by additional learning for the clean acoustic model 904 using the multi-condition audio data 915 and the utterance content label data 909, and as a result, noise adaptive acoustics are obtained. A model 918 is created. A method for learning the noise adaptive acoustic model 918 from the beginning using the multi-condition audio data 915 instead of the additional learning for the clean acoustic model 904 is also conceivable, but here, additional learning is performed. In addition, since it is additional learning, the structure including the state chain structure and the probability distribution number in the state of the clean acoustic model 904 and the noise adaptive acoustic model 918 are equal.

上述の通りにして作成された雑音適応音響モデル９１８を用いることにより、想定した全ての雑音環境下で高い認識精度を得ることがでぎる。マルチコンディション学習に関する従来の実験報告としては、乗算性歪みはない状態で、地下鉄、自動車内、レストランなど加算性雑音のみを想定したもの（非特許文献２参照）、自動車内の雑音環境で、マイクをサンバイザーまたは前方室内ランプの位置に設置するため、乗算性歪みは話者の口からサンバイザーまたは前方室内ランプまでの空間伝達特性の２種類、加算性雑音として、アイドリング時、一般道路走行時、高速道路走行時などいくっかの雑音を想定したもの（非特許文献３参照）、電話音声の認識において、様々な電話機の周波数特性による乗算性歪みと、幾つかのＳＮ比の展示会場騒音を想定したもの（非特許文献４参照）その他がある。
財団法人電子情報通信学会編、中川聖一著『確率モデルによる音声認識』 J.C.Segura,A.de la Torre,M.C.Benitez,A.M.Peinado,“Model-based compensation of the additivenoise for continuous speech recognition．Experiments using the AURORA II database and tasks”,Proc. EUROSPEECH' 2001, vol.1, pp.221-24, Scandinavia,2001 滝口哲也、西村雅史、“車内音声認識におけるマルチスタイル学習法の効果について”、日本音響学会２００１年秋季研究発表会講演論文集、１−Ｑ−８、pp．１５５−１５６國枝伸行、木村達也、石田明、“Multi-Style学習で作成した電話音声認識向け音響モデルの評価−ＳＮ比と電話機特性に対する効果−” By using the noise adaptive acoustic model 918 created as described above, high recognition accuracy can be obtained under all assumed noise environments. As a conventional experimental report on multi-condition learning, there is no multiplicative distortion, assuming only additive noise such as in a subway, in a car, or a restaurant (see Non-Patent Document 2). Is installed at the position of the sun visor or the front room lamp, so the multiplicative distortion has two types of spatial transfer characteristics from the speaker's mouth to the sun visor or the front room lamp. Assuming some noise when driving on an expressway (see Non-Patent Document 3), in the recognition of telephone speech, multiplicative distortion due to the frequency characteristics of various telephones and noise in the exhibition hall with several signal-to-noise ratios (See Non-Patent Document 4) and others.
Seiichi Nakagawa, “The Speech Recognition by Stochastic Model”, edited by The Institute of Electronics, Information and Communication Engineers JCSegura, A. de la Torre, MCBenitez, AM Peinado, “Model-based compensation of the additivenoise for continuous speech recognition. Experiences using the AURORA II database and tasks”, Proc. EUROSPEECH '2001, vol.1, pp.221-24 , Scandinavia, 2001 Tetsuya Takiguchi, Masafumi Nishimura, “Effects of multi-style learning method on in-car speech recognition”, Proceedings of the Acoustical Society of Japan 2001 Autumn Meeting, 1-Q-8, pp. 155-156 Nobuyuki Kunieda, Tatsuya Kimura, Akira Ishida, “Evaluation of an acoustic model for telephone speech recognition created by Multi-Style learning: effect on SN ratio and telephone characteristics”

上述した通り、音響モデルのマルチコンディション学習は非常に単純な方法でありながら、様々な雑音への耐性の高い雑音適応音響モデルが得られる学習方法である、しかし、図９に示した通り、その学習の過程において、想定する雑音の種類に応じて音声データを準備して使用するので、データ記憶容量と計算時間が大幅に増大する。例えば、想定する雑音の種類がＮ種類であれば、クリーン音声のみでクリーン音響モデルを学習する場合と比較してＮ倍のデータ記憶容量と計算時間が必要となる。元々、音響モデルの学習はデータ記憶容量および計算時間に関してコストの高い作業であるが、マルチコンディション学習ではそのコストが更に大幅に高くなり、これは望ましいことではない。 As described above, multi-condition learning of an acoustic model is a very simple method, but is a learning method capable of obtaining a noise adaptive acoustic model having high resistance to various noises. However, as shown in FIG. In the learning process, voice data is prepared and used according to the type of noise assumed, so that the data storage capacity and the calculation time are significantly increased. For example, if N types of noises are assumed, N times as much data storage capacity and calculation time are required as compared with the case of learning a clean acoustic model using only clean speech. Originally, learning an acoustic model is a costly task in terms of data storage capacity and computation time, but multi-condition learning is much more expensive, which is undesirable.

この発明は、マルチコンディション学習にあるコストが高くなるという問題点に鑑みてなされたものであり、既存の雑音適応化の結果を用いて、マルチコンディション音声データなどの雑音を付加された音声データを新たに準備することなく、音響モデルの雑音適応を行うことができる音響モデル雑音適応化方法およびこの方法を実施する装置を提供することをその目的としている。 The present invention has been made in view of the problem that the cost for multi-condition learning becomes high, and using existing noise adaptation results, noise data such as multi-condition audio data is added. An object of the present invention is to provide an acoustic model noise adaptation method capable of performing noise adaptation of an acoustic model without newly preparing, and an apparatus for implementing the method.

請求項１：雑音のない音声データで学習されたクリーン音響モデルＡ：１０４−Ａと当該クリーン音響モデルＡ：１０４Ａを元に雑音適応化された雑音適応音響モデルＢ：１１８−Ｂとを準備し、クリーン音響モデルＡ：１０４−Ａから雑音適応音響モデルＢ：１１８−Ｂへの雑音適応化による各パラメータの変化量を計算し、雑音のない音声データで学習された別のクリーン音響モデルＣ：１０４−Ｃの各状態および各分布によりクリーン音響モデルＡを参照してその各状態および各分布を決定し、別のクリーン音響モデルＣとクリーン音響モデルＡの各状態および各分布の参照関係と、クリーン音響モデルＡから雑音適応音響モデルＢへの雑音適応化による各パラメータの変化量とを基に別のクリーン音響モデルＣの各パラメータを調整して、新規の雑音適応化された音響モデルＤ：１１８−Ｄを作成する音響モデル雑音適応化方法を構成した。 Claim 1: A clean acoustic model A: 104-A learned from noise-free speech data and a noise adaptive acoustic model B: 118-B adapted to noise based on the clean acoustic model A: 104A are prepared. The clean acoustic model A: 104-A to the noise adaptive acoustic model B: 118-B, the amount of change in each parameter due to noise adaptation is calculated, and another clean acoustic model C learned from noiseless speech data: Each state and each distribution of 104-C is referred to the clean acoustic model A to determine each state and each distribution, and a reference relationship between each state and each distribution of another clean acoustic model C and clean acoustic model A. Adjust each parameter of another clean acoustic model C based on the amount of change of each parameter by noise adaptation from clean acoustic model A to noise adaptive acoustic model B Te, acoustic model D is noise adaptation of new: constituted acoustic model noise adaptation method of creating a 118-D.

請求項２：雑音のない音声データで学習されたクリーン音響モデルＡ：１０４−Ａを具備し、クリーン音響モデルＡ：１０４−Ａを入力してこれを雑音適応化する雑音適応化部１１７を具備し、クリーン音響モデルＡ：１０４−Ａを元に雑音適応化された雑音適応音響モデルＢ：１１８−Ｂを具備し、クリーン音響モデルＡ：１０４−Ａと雑音適応音響モデルＢ：１１８−Ｂとを入力して、クリーン音響モデルＡ：１０４−Ａの雑音適応化による各パラメータの変化量を計算する音響モデルパラメータ変化量計算部１１９を具備し、雑音のない音声データで学習された別のクリーン音響モデルＣ：１０４−Ｃおよびクリーン音響モデルＡ：１０４−Ａとを入力し、別のクリーン音響モデルＣ：１０４−Ｃの各パラメータが参照するクリーン音響モデルＡ：１０４−Ａの各パラメータを決定する音響モデル構造参照関係決定部１２０を具備し、音響モデルパラメータ変化量計算部１１９において計算されたクリーン音響モデルＡ：１０４−Ａの雑音適応化による各パラメータの変化量と音響モデル構造参照関係決定部１２０において決定されたクリーン音響モデルＡ：l０４−Ａと別のクリーン音響モデルＣ：１０４−Ｃとの間の参照関係とを入力し、別のクリーン音響モデルＣ：１０４−Ｃの各パラメータを調整して新規の雑音適応音響モデルＤ：１１８−Ｄを作成する音響モデルパラメータ調整部１２１を具備する音響モデル雑音適応化装置を構成した。 Claim 2: A clean acoustic model A: 104-A learned from noise-free speech data is provided, and a noise adaptation unit 117 for inputting the clean acoustic model A: 104-A and adapting it to noise is provided. The noise adaptive acoustic model B: 118-B, which is noise-adapted based on the clean acoustic model A: 104-A, is provided. The clean acoustic model A: 104-A and the noise adaptive acoustic model B: 118-B , And an acoustic model parameter change amount calculation unit 119 for calculating the change amount of each parameter due to the noise adaptation of the clean acoustic model A: 104-A, and another clean train learned from noise-free speech data The acoustic model C: 104-C and the clean acoustic model A: 104-A are input, and the clean sound referred to by each parameter of another clean acoustic model C: 104-C The acoustic model structure reference relationship determining unit 120 that determines each parameter of the model A: 104-A is provided, and each of the acoustic model parameter change amount calculation unit 119 calculated by the noise adaptation of the clean acoustic model A: 104-A is included. The parameter change amount and the reference relationship between the clean acoustic model A: 104-A determined by the acoustic model structure reference relationship determining unit 120 and another clean acoustic model C: 104-C are input, and another clean The acoustic model noise adaptation apparatus including the acoustic model parameter adjustment unit 121 that creates the new noise adaptive acoustic model D: 118-D by adjusting the parameters of the acoustic model C: 104-C is configured.

請求項３：請求項２に記載される音響モデル雑音適応化装置において、音響モデルパラメータ変化量計算部は、クリーン音響モデルＡから雑音適応音響モデルＢへの雑音適応化による各パラメータの変化量を計算するに際して、クリーン音響モデルＡの各状態において、分布重みが最も大きい分布のパラメータの変化量を、状態における全ての分布のパラメータ変化量とする音響モデル雑音適応化装置を構成した。
請求項４：請求項２に記載される音響モデル雑音適応化装置において、音響モデルパラメータ変化量計算部は、クリーン音響モデルＡから雑音適応音響モデルＢへの雑音適応化による各パラメータの変化量を計算するに際して、クリーン音響モデルＡの各状態と対応する雑音適応音響モデルＢの各状態において、状態における全ての分布を統合した分布を作成し、その統合分布のパラメータ変化量を、状態における全ての分布のパラメータ変化量とする音響モデル雑音適応化装置を構成した。 (3) In the acoustic model noise adaptation device according to (2), the acoustic model parameter variation calculation unit calculates the variation of each parameter due to noise adaptation from the clean acoustic model A to the noise adaptive acoustic model B. In the calculation, an acoustic model noise adaptation apparatus was configured in which the change amount of the distribution parameter having the largest distribution weight in each state of the clean acoustic model A is set as the parameter change amount of all distributions in the state.
(4) In the acoustic model noise adaptation device according to (2), the acoustic model parameter variation calculation unit calculates the variation of each parameter due to noise adaptation from the clean acoustic model A to the noise adaptive acoustic model B. In the calculation, in each state of the noise adaptive acoustic model B corresponding to each state of the clean acoustic model A, a distribution in which all the distributions in the state are integrated is created, and the parameter change amount of the integrated distribution is calculated for all the states in the state. An acoustic model noise adaptation device with the distribution parameter variation is constructed.

請求項５：請求項２ないし請求項４の内の何れかに記載される音響モデル雑音適応化装置において、音響モデル構造参照関係決定部は、別のクリーン音響モデルＣの各状態および各分布が参照するクリーン音響モデルＡの各状態および各分布を決定するに際して、別のクリーン音響モデルＣの各状態と対応するクリーン音響モデルＡの各状態における分布同士の対応を、分布間距離の近さを基準として決定する音響モデル雑音適応化装置を構成した。
請求項６：請求項２ないし請求項５の内の何れかに記載される音響モデル雑音適応化装置において、分布間距離尺度として、Kullback−Leiblerダイバージェンスを用いる音響モデル雑音適応化装置を構成した。 Claim 5: In the acoustic model noise adaptation device according to any one of claims 2 to 4, the acoustic model structure reference relationship determining unit is configured so that each state and each distribution of another clean acoustic model C is When determining each state and each distribution of the reference clean acoustic model A, the correspondence between the distributions in each state of the clean acoustic model A corresponding to each state of another clean acoustic model C is determined by the proximity of the distance between the distributions. An acoustic model noise adaptation device to determine as a reference is constructed.
Claim 6: In the acoustic model noise adaptation apparatus according to any one of claims 2 to 5, an acoustic model noise adaptation apparatus using Kullback-Leibler divergence as an inter-distribution distance measure is configured.

請求項７：請求項２ないし請求項５の内の何れかに記載される音響モデル雑音適応化装置において、分布間距離尺度として、バタチャリヤ距離を用いる音響モデル雑音適応化装置を構成した。
請求項８：請求項２ないし請求項５の内の何れかに記載される音響モデル雑音適応化装置において、分布間距離尺度として、分布統合前後の尤度差分を用いる音響モデル雑音適応化装置を構成した。
請求項９：請求項２ないし請求項５の内の何れかに記載される音響モデル雑音適応化装置において、分布間距離尺度として、分布統合前後の変分ペイズ法に基づく評価関数値の差分を用いる音響モデル雑音適応化装置を構成した。 [7] The acoustic model noise adaptation apparatus according to any one of [2] to [5], wherein an acoustic model noise adaptation apparatus using a batch rear distance as a distribution distance measure is configured.
[8] The acoustic model noise adaptation apparatus according to any one of [2] to [5], wherein the acoustic model noise adaptation apparatus uses a likelihood difference before and after distribution integration as a distance measure between distributions. Configured.
Claim 9: In the acoustic model noise adaptation apparatus according to any one of claims 2 to 5, the difference between evaluation function values based on the variational Pais method before and after distribution integration is used as a distance measure between distributions. The acoustic model noise adaptation device used is constructed.

この発明によれば、既存の雑音適応化の結果を用いて、マルチコンディション音声データなどの雑音を付加された音声データを新たに準備することなく、瞬時に音響モデルの雑音適応を行うことができる。即ち、一例として、多数の女性の雑音のない音声データを用いて学習された入力される別のクリーン音響モデルである、クリーン女声音響モデルの雑音適応化を実施する様な場合、多数の男性の雑音のない音声データを用いて学習されたクリーン男声音響モデルと、それをマルチコンディション学習により雑音適応化した雑音適応男声音響モデルがあれば、新たに雑音が付加された女声音声データを準備してマルチコンディション学習する必要はなく、男声音響モデルの雑音適応化結果をそのまま用いて瞬時に雑音適応を行うことができる。 According to the present invention, noise adaptation of an acoustic model can be performed instantaneously without preparing new voice data to which noise such as multi-condition voice data is added, using the result of existing noise adaptation. . That is, as an example, in the case of performing noise adaptation of a clean female voice acoustic model, which is another clean acoustic model that is input by using a large number of female noiseless voice data, If there is a clean male voice model trained using noise-free voice data and a noise-adapted male voice model modeled with noise adaptation using multi-condition learning, prepare female voice data with newly added noise. There is no need to perform multi-condition learning, and noise adaptation can be performed instantaneously using the noise adaptation result of the male acoustic model as it is.

この発明による音響モデル雑音適応化は、雑音のない音声データで学習されたクリーン音響モデルＡと、クリーン音響モデルＡを元に雑音適応化された雑音適応音響モデルＢを備え、音響モデルパラメータ変化量計算部において、クリーン音響モデルＡから雑音適応音響モデルＢへの雑音適応化による各パラメータの変化量を計算しておき、雑音のない音声データで学習された別のクリーン音響モデルＣが入力された場合に、音響モデル構造参照関係決定部において、別のクリーン音響モデルＣの各状態および各分布が参照するクリーン音響モデルＡの各状態および各分布を決定し、音響モデルパラメータ調整部において、別のクリーン音響モデルＣとクリーン音響モデルＡの各状態および各分布の参照関係と、クリーン音響モデルＡから雑音適応音響モデルＢへの雑音適応化による各パラメータの変化量を基に、別のクリーン音響モデルＣの各パラメータを調整し、雑音適応化された音響モデルＤを新たに作成する。この発明による音響モデル雑音適応化装置は、音響モデルパラメータ変化量計算部において、クリーン音響モデルＡから雑音適応音響モデルＢへの雑音適応化による各パラメータの変化量を計算するに際して、クリーン音響モデルＡの各状態において、分布重みが最も大きい分布のパラメータの変化量を、状態における全ての分布のパラメータ変化量とする。この発明による音響モデル雑音適応化装置は、音響モデルパラメータ変化量計算部において、クリーン音響モデルＡから雑音適応音響モデルＢへの雑音適応化による各パラメータの変化量を計算するに際して、クリーン音響モデルＡの各状態と対応する雑音適応音響モデルＢの各状態において、状態における全ての分布を統合した分布を作成し、その統合分布のパラメータ変化量を、状態における全ての分布のパラメータ変化量とする。この発明による音響モデル雑音適応化装置は、音響モデル構造参照関係決定部において、入力される音響モデルＣの各状態および各分布が参照するクリーン音響モデルＡの各状態および各分布を決定するに際して、別のクリーン音響モデルＣの各状態と対応するクリーン音響モデルＡの各状態における分布同士の対応を、分布間距離の近さを基準として決定する。この発明による音響モデル雑音適応化装置は、分布間距離尺度として、Kullback−Leiblerダイバージェンスを用いる。この発明による音響モデル雑音適応化装置は、分布間距離尺度として、バタチャリヤ距離を用いる。この発明による音響モデル雑音適応化装置は、分布間距離尺度として、分布統合前後の尤度差分を用いる。この発明による音響モデル雑音適応化装置は、分布間距離尺度として、分布統合前後の変分ベイズ法に基づく評価関数値の差分を用いる。 The acoustic model noise adaptation according to the present invention includes a clean acoustic model A learned from noise-free speech data, and a noise adaptive acoustic model B that is noise-adapted based on the clean acoustic model A. In the calculation unit, the amount of change in each parameter due to noise adaptation from the clean acoustic model A to the noise adaptive acoustic model B is calculated, and another clean acoustic model C learned from speech data without noise is input. In this case, the acoustic model structure reference relationship determining unit determines each state and each distribution of the clean acoustic model A to which each state and distribution of another clean acoustic model C refers, and the acoustic model parameter adjusting unit The reference relationship between the states and distributions of the clean acoustic model C and the clean acoustic model A and the clean acoustic model A Based on the amount of change in each parameter due to noise adaptation to the adaptive acoustic model B, by adjusting the parameters of another clean acoustic model C, and create a new noise adaptation acoustic model D. In the acoustic model noise adaptation device according to the present invention, the acoustic model parameter variation calculation unit calculates the variation of each parameter by noise adaptation from the clean acoustic model A to the noise adaptive acoustic model B. In each state, the change amount of the parameter of the distribution having the largest distribution weight is set as the parameter change amount of all the distributions in the state. In the acoustic model noise adaptation device according to the present invention, the acoustic model parameter variation calculation unit calculates the variation of each parameter by noise adaptation from the clean acoustic model A to the noise adaptive acoustic model B. In each state of the noise adaptive acoustic model B corresponding to each state, a distribution in which all distributions in the state are integrated is created, and a parameter change amount of the integrated distribution is set as a parameter change amount in all distributions in the state. In the acoustic model noise adaptation device according to the present invention, the acoustic model structure reference relationship determining unit determines each state and each distribution of the clean acoustic model A to which each state and each distribution of the input acoustic model C refers. The correspondence between the distributions in each state of the clean acoustic model A corresponding to each state of another clean acoustic model C is determined based on the proximity of the distance between the distributions. The acoustic model noise adaptation apparatus according to the present invention uses Kullback-Leibler divergence as a distance measure between distributions. The acoustic model noise adaptation apparatus according to the present invention uses the batcha rear distance as the inter-distribution distance measure. The acoustic model noise adaptation apparatus according to the present invention uses likelihood differences before and after distribution integration as a distance measure between distributions. The acoustic model noise adaptation apparatus according to the present invention uses a difference between evaluation function values based on the variational Bayes method before and after distribution integration as a distance measure between distributions.

以下、発明を実施するための最良の形態を図を参照して具体的に説明する。
図１はこの発明による音響モデル雑音適応化装置の実施例の概要を説明する図である。ここにおいては、入力される別のクリーン音響モデルＣ：１０４−Ｃの雑音適応化を行い、新規の雑音適応音響モデルＤ：１１８−Ｄを得ることを目的としている。
先ず、クリーン音響モデルＡ：１０４−Ａの雑音適応化が雑音適応化部１１７において行われ、雑音適応音響モデルＢ：１１８−Ｂが予め得られているものとする。なお、雑音適応化部１１７は、図９を参照して先に説明された点線内の雑音適応化部９１７に相当する。ここにおいては、クリーン音響モデルＡ：１０４Ａと雑音適応音響モデルＢ：１１８−Ｂの状態連鎖構造、状態における確率分布数を含む構造は等しいものとするが、構造が異なる場合でも容易に拡張することができる。このクリーン音響モデルＡ：１０４−Ａと雑音適応音響モデルＢ：１１８−Ｂを入力として、音響モデルパラメータ変化量計算部１１９において、雑音適応化によるクリーン音響モデルＡ：１０４−Ａの各パラメータ、状態遷移確率、要素確率分布の平均・分散・分布重み、の変化量を計算する。一方、音響モデル構造参照関係決定部１２０において、別のクリーン音響モデルＣ：１０４−Ｃの各パラメータが参照するクリーン音響モデルＡ：１０４−Ａの各パラメータを決定する。次に、音響モデルパラメータ調整部１２１において、雑音適応化によるクリーン音響モデルＡ：１０４−Ａの各パラメータの変化量と、別のクリーン音響モデルＣ：１０４−Ｃとクリーン音響モデルＡ：l０４−Ａの参照関係を基に、別のクリーン音響モデルＣ：１０４−Ｃの各パラメータを調整し、雑音適応音響モデルＤ：１１８−Ｄを作成する。 Hereinafter, the best mode for carrying out the invention will be specifically described with reference to the drawings.
FIG. 1 is a diagram for explaining the outline of an embodiment of an acoustic model noise adaptation apparatus according to the present invention. Here, the objective is to obtain a new noise adaptive acoustic model D: 118-D by performing noise adaptation of another input clean acoustic model C: 104-C.
First, it is assumed that noise adaptation of the clean acoustic model A: 104-A is performed in the noise adaptation unit 117, and the noise adaptive acoustic model B: 118-B is obtained in advance. Note that the noise adaptation unit 117 corresponds to the noise adaptation unit 917 in the dotted line described above with reference to FIG. Here, the state chain structure of the clean acoustic model A: 104A and the noise adaptive acoustic model B: 118-B and the structure including the number of probability distributions in the state are assumed to be equal, but they can be easily expanded even when the structures are different. Can do. With this clean acoustic model A: 104-A and noise adaptive acoustic model B: 118-B as inputs, the acoustic model parameter change amount calculation unit 119 uses the parameters and states of the clean acoustic model A: 104-A by noise adaptation. The change amount of the transition probability, the average / variance / distribution weight of the element probability distribution is calculated. On the other hand, in the acoustic model structure reference relationship determining unit 120, each parameter of the clean acoustic model A: 104-A referred to by each parameter of another clean acoustic model C: 104-C is determined. Next, in the acoustic model parameter adjustment unit 121, the amount of change in each parameter of the clean acoustic model A: 104-A by noise adaptation, another clean acoustic model C: 104-C, and the clean acoustic model A: 104-A. Based on the reference relationship, each parameter of another clean acoustic model C: 104-C is adjusted to create a noise adaptive acoustic model D: 118-D.

図２は図１における４つの音響モデルを音素ＨＭＭのレベルで示した図である。この図２を参照して、この発明による音響モデル雑音適応化装置において、別のクリーン音響モデルＣ：２０４−Ｃの状態遷移確率が調整される仕組みについて説明する。なお、ここにおける説明は、音素ＨＭＭの構造が上述した３状態のleft−to−right型であるものとするが、他の構造の場合でも容易に拡張することができる。
先ず、音響モデルパラメータ変化量計算部２１９において、クリーン音響モデルＡ：２０４−Ａと雑音適応音響モデルＢ：２１８−Ｂの対応する状態の遷移確率（クリーン音響モデルＡ：２０４Ａと雑音適応音響モデルＢ：２１８−Ｂは構造が同じであるので、状態遷移確率の対応は１対１にとることができる）より、雑音適応化によるクリーン音響モデルＡ：２０４−Ａの状態遷移確率の変化率を計算する。次に、音響モデル構造参照決定部２２０における処理を説明するが、ここにおいては、別のクリーン音響モデルＣ：２０４−Ｃに含まれる音素ＨＭＭとして、音素ｐ−ａ−ｔ（triphone）を例に挙げて説明する。音響モデル構造参照決定部２２０おける、別のクリーン音響モデルＣ：２０４−Ｃの音素ｐ−ａ−ｔ（triphone）が参照するクリーン音響モデルＡ：２０４−Ａの音素ＨＭＭの検索方法には、幾つかのバリエーションが考えられるが、最も一般的な方法としては、前後の音素環境依存性を段階的に無視して行く方法である。この方法は、先ず、クリーン音響モデルＡ：２０４−Ａに音素ｐ−ａ−ｔ（triphone）が存在するか否かを検索し、存在すれば、別のクリーン音響モデルＣ：２０４−Ｃの音素ｐ−ａ−ｔ（tripbone）が参照するクリーン音響モデルＡ：２０４−Ａの音素ＨＭＭを音素ｐ−ａ−ｔ（triphone）とする。クリーン音響モデルＡ：２０４−Ａに音素ｐ−ａ−ｔ（triphone）が存在しない場合は、音素ｐ−ａ−ｔ（triphone）に近い音素ＨＭＭとして、音素ｐ−ａ−＊（先行音素環境依存biphone）を検索する。クリーン音響モデルＡ：２０４−Ａに音素ｐ−ａ−＊（先行音素環境依存biphone）が存在すれば、別のクリーン音響モデルＣ：２０４−Ｃの音素ｐ−ａ−ｔ（triphone）が参照するクリーン音響モデルＡ：２０４−Ａの音素ＨＭＭを音素ｐ−ａ−＊（先行音素環境依存biphone）とする。クリーン音響モデルＡ：２０４−Ａに音素ｐ−ａ−＊（先行音素環境依存biphone）が存在しない場合は、音素ｐ−ａ−＊（先行音素環境依存biphone）の次に音素ｐ−ａ−ｔ（triphone）に近い音素ＨＭＭとして、音素＊−ａ−ｔ（後続音素環境依存biphone）を検索する。先行音素環境依存biphoneと後続音素環境依存biphoneの何れを優先させるかというバリエーションも考えられる。クリーン音響モデルＡ：２０４−Ａに音素＊−ａ−ｔ（後続音素環境依存biphone）が存在すれば、別のクリーン音響モデルＣ：２０４−Ｃの音素ｐ−ａ−ｔ（triphone）が参照するクリーン音響モデルＡ：２０４−Ａの音素ＨＭＭを音素＊−ａ−ｔ（後続音素環境依存biphone）とする。クリーン音響モデルＡ：２０４−Ａに音素＊−ａ−ｔ（後続音素環境依存biphone）が存在しない場合は、音素＊−ａ−ｔ（後続音素環境依存biphone）の次に音素ｐ−ａ−ｔ（triphone）に近い音素ＨＭＭとして、音素＊−ａ−＊（monophone）を検索する。通常、各音素のmonophone−ＨＭＭは音響モデルに含まれるので、音素＊−ａ−＊（monophone）は必ず見つかると考えてよい。この様に、最終的には、別のクリーン音響モデルＣ：２０４−Ｃの音素ｐ−ａ−ｔ（triphone）が参照するクリーン音響モデルＡ：２０４−Ａの音素ＨＭＭが音素＊−ａ−＊（monophone）となる。以上は、前後の音素環境依存性を段階的に無視して行く方法であるが、例えば、別のクリーン音響モデルＣ：２０４−Ｃの音素ｐ−ａ−ｔ（triphone）が参照するクリーン音響モデルＡ：２０４−Ａの音素ＨＭＭとして、初めから音素＊−ａ−＊（monophone）を選択する方法も考えられる。図２の場合は、前後の音素環境依存性を段階的に無視していく方法で、クリーン音響モデルＡ：２０４−Ａに音素ｐ−ａ−ｔ（triphone）は存在しなかったが、音素ｐ−ａ−＊（先行音素環境依存biphone）が存在したため、別のクリーン音響モデルＣ：２０４−Ｃの音素ｐ−ａ−ｔ（triphone）が参照するクリーン音響モデルＡ：２０４−Ａの音素ＨＭＭは音素ｐ−ａ−＊（先行音素環境依存biphone）となっている。この様にして、別のクリーン音響モデルＣ：２０４−Ｃの音素ｐ−ａ−ｔ（triphone）とクリーン音響モデルＡ：２０４−Ａの音素ＨＭＭは、音素ｐ−ａ−＊（先行音素環境依存biphone）の対応が取れれば、両音素ＨＭＭは、３状態のleft−to−right型の構造をとるので、両音素ＨＭＭの状態同士の対応もとれる。最後に、音響モデルパラメータ調整部２２１において、音響モデルパラメータ変化量計算部２１９で計算された雑音適応化によるクリーン音響モデルＡ：２０４−Ａの状態遷移確率の変化率と、音響モデル構造参照関係決定部２２０において決定された別のクリーン音響モデルＣ：２０４−Ｃの各状態とクリーン音響モデルＡ：２０４−Ａの各状態の対応関係を基に、別のクリーン音響モデルＣ：２０４−Ｃの各状態遷移確率を調整し、雑音適応音響モデルＤ：２１８−Ｄの各状態遷移確率とする。 FIG. 2 is a diagram showing the four acoustic models in FIG. 1 at the level of the phoneme HMM. With reference to this FIG. 2, the mechanism by which the state transition probability of another clean acoustic model C: 204-C is adjusted in the acoustic model noise adaptation apparatus by this invention is demonstrated. In the description here, the structure of the phoneme HMM is assumed to be the above-described three-state left-to-right type, but the structure can be easily expanded even in the case of other structures.
First, in the acoustic model parameter change amount calculation unit 219, transition probabilities of the corresponding states of the clean acoustic model A: 204-A and the noise adaptive acoustic model B: 218-B (clean acoustic model A: 204A and noise adaptive acoustic model B). : Since the structure of 218-B is the same, the correspondence of the state transition probabilities can be one-to-one). Therefore, the change rate of the state transition probability of the clean acoustic model A: 204-A by noise adaptation is calculated. To do. Next, processing in the acoustic model structure reference determination unit 220 will be described. Here, a phoneme p-at (triphone) is taken as an example as a phoneme HMM included in another clean acoustic model C: 204-C. I will give you a description. In the acoustic model structure reference determination unit 220, there are several search methods for the phoneme HMM of the clean acoustic model A 204-A that is referred to by the phoneme p-at (triphone) of another clean acoustic model C 204-C. The most common method is to gradually ignore the phoneme environment dependence before and after. In this method, first, it is searched whether or not the phoneme p-at (triphone) exists in the clean acoustic model A: 204-A, and if it exists, the phoneme of another clean acoustic model C: 204-C is found. The phoneme HMM of the clean acoustic model A: 204-A referred to by p-at (tripbone) is defined as a phoneme p-at (triphone). When the phoneme p-at (triphone) does not exist in the clean acoustic model A 204-A, the phoneme p-a- * (depending on the preceding phoneme environment) is used as the phoneme HMM close to the phoneme p-at (triphone). biphone). If the phoneme p-a- * (preceding phoneme environment dependent biphone) exists in the clean acoustic model A: 204-A, the phoneme p-at (triphone) of another clean acoustic model C: 204-C refers to it. The phoneme HMM of the clean acoustic model A: 204-A is defined as a phoneme pa-a- * (preceding phoneme environment dependent biphone). When the phoneme p-a- * (preceding phoneme environment-dependent biphone) does not exist in the clean acoustic model A: 204-A, the phoneme p-a-t follows the phoneme pa-a- * (preceding phoneme environment-dependent biphone). The phoneme * -at (subsequent phoneme environment dependent biphone) is searched as a phoneme HMM close to (triphone). A variation of whether to give priority to the preceding phoneme environment-dependent biphone or the subsequent phoneme environment-dependent biphone is also conceivable. If the phoneme * -at (subsequent phoneme environment dependent biphone) exists in the clean acoustic model A: 204-A, the phoneme p-at (triphone) of another clean acoustic model C: 204-C refers to The phoneme HMM of the clean acoustic model A: 204-A is defined as a phoneme * -at (subsequent phoneme environment dependent biphone). When the phoneme * -at (subsequent phoneme environment dependent biphone) does not exist in the clean acoustic model A: 204-A, the phoneme p-at is next to the phoneme * -at (subsequent phoneme environment dependent biphone). Search phonemes * -a- * (monophone) as phoneme HMMs close to (triphone). Usually, since the monophone-HMM of each phoneme is included in the acoustic model, it may be considered that the phoneme * -a- * (monophone) is always found. Thus, finally, the phoneme HMM of the clean acoustic model A 204-A that is referred to by the phoneme p-at (triphone) of another clean acoustic model C 204-C is the phoneme * -a- *. (Monophone). The above is a method of ignoring the front and back phoneme environment dependency step by step. For example, a clean acoustic model referred to by another clean acoustic model C: 204-C phoneme p-at (triphone). A: A method of selecting phonemes * -a- * (monophone) from the beginning as a phoneme HMM of 204-A is also conceivable. In the case of FIG. 2, the phoneme p-at (triphone) does not exist in the clean acoustic model A: 204-A by the method of ignoring the dependence of the phoneme environment before and after in stages, but the phoneme p -A- * (preceding phoneme environment dependent biphone) exists, so the phoneme HMM of clean acoustic model A: 204-A referred to by another clean acoustic model C: 204-C phoneme pat (triphone) is Phoneme pa-a- * (preceding phoneme environment dependent biphone). In this way, the phoneme p-at (triphone) of another clean acoustic model C: 204-C and the phoneme HMM of the clean acoustic model A: 204-A are phoneme p-a- * (depending on the preceding phoneme environment). biphone), the two-phoneme HMM has a three-state left-to-right structure, so that the states of the two-phoneme HMMs can be matched. Finally, in the acoustic model parameter adjustment unit 221, the change rate of the state transition probability of the clean acoustic model A: 204-A by noise adaptation calculated by the acoustic model parameter change amount calculation unit 219 and the acoustic model structure reference relation determination are determined. Based on the correspondence between each state of another clean acoustic model C: 204-C determined by the unit 220 and each state of the clean acoustic model A: 204-A, each of the other clean acoustic models C: 204-C The state transition probability is adjusted to be each state transition probability of the noise adaptive acoustic model D: 218-D.

図３は図１における４つの音響モデルを状態のレベルで示した図である。図３を参照して、図２において例とした別のクリーン音響モデルＣ：２０４−Ｃの音素ｐ−ａ−ｔの状態Ｓ_C1に含まれる要素確率分布のパラメータが調整される仕組みの一例について説明する。なお、この説明は、各状態に含まれる要素確率分布の数は４、かつ各要素確率分布は多次元無相関正規分布であり、その次元ｉについて図示されているものとするが、これ以外の条件であっても容易に拡張することができる。
先ず、音響モデルパラメータ変化量計算部３１９において、クリーン音響モデルＡ：３０４−Ａの状態Ｓ_A1と対応する雑音適応音響モデルＢ：３１８−Ｂの状態Ｓ_B1から状態Ｓ_A1の各要素正規分布Ｎ_A1、Ｎ_A2、Ｎ_A3、Ｎ_A4のパラメータ（平均、分散、分布重み）の変化量を求めるが、この例は、状態Ｓ_A1の最も分布重みの大きい要素正規分布Ｎ_A3のパラメータ変化量をもって状態Ｓ_A1における全ての要素正規分布Ｎ_A1、Ｎ_A2、Ｎ_A3、Ｎ_A4のパラメータ変化量（平均、分散）とする。これは状態Ｓ_A1に付随するパラメータ変化量と見ることができる。一方、音響モデル構造参照関係決定部３２０においては、別のクリーン音響モデルＣ：３０４−Ｃの状態Ｓ_C1とクリーン音響モデルＡ：３０４−Ａの状態Ｓ_A1の対応がとれており、この例の場合は、要素正規分布同士の対応は取る必要はない。そして、音響モデルパラメータ調整部３２１において、音響モデルパラメータ変化量計算部３１９において計算された状態Ｓ_A1の最も分布重みの大きい要素正規分布Ｎ_A3のパラメータ変化量と、音響モデル構造参照関係決定部３２０において決定された状態Ｓ_C1と状態Ｓ_A1の対応を基に、状態Ｓ_C1の各要素正規分布のパラメータ（平均、分散）の調整を行い、雑音適応音響モデルＤ：３１８−Ｄの状態Ｓ_D1の各要素正規分布のパラメータ（平均、分散）とする。この例は、分布重みの調整は行わない。 FIG. 3 is a diagram showing the four acoustic models in FIG. 1 at the state level. Referring to FIG. 3, an example of a mechanism in which the parameter of the element probability distribution included in the state S _C1 of the phoneme p-at of another clean acoustic model C: 204-C exemplified in FIG. 2 is adjusted. explain. In this description, the number of element probability distributions included in each state is four, and each element probability distribution is a multidimensional uncorrelated normal distribution, and the dimension i is illustrated. Even conditions can be easily extended.
First, in the acoustic model parameter change amount calculation unit 319, each element normal distribution N from the state S _B1 to the state S _A1 of the noise adaptive acoustic model B: 318-B corresponding to the state S _A1 of the clean acoustic model A: 304-A. _The amount of change in parameters _A1 , N _A2 , N _A3 , and N _A4 (average, variance, distribution weight) is obtained. In this example, the parameter variation of the element normal distribution N _{A3 having} the largest distribution weight in the state S _A1 is obtained. The parameter change amounts (average and variance) of all element normal distributions N _A1 , N _A2 , N _A3 , and N _{A4 in} the state S _A1 are used. This can be regarded as a parameter change amount accompanying the state S _A1 . On the other hand, in the acoustic model structure reference relationship determination unit 320, the correspondence between the state S _{C1 of} another clean acoustic model C: 304-C and the state S _A1 of the clean acoustic model A: 304-A is taken. In this case, it is not necessary to take correspondence between element normal distributions. Then, in the acoustic model parameter adjustment unit 321, the parameter variation of the element normal distribution N _{A3 having} the largest distribution weight in the state S _A1 calculated by the acoustic model parameter variation calculation unit 319 and the acoustic model structure reference relationship determination unit 320 On the basis of the correspondence between the state S _C1 and the state S _A1 determined in step S1, the parameters (average and variance) of each element normal distribution in the state S _C1 are adjusted, and the state S _{D1 of the} noise adaptive acoustic model D: 318-D Parameter of each element normal distribution (mean, variance). In this example, the distribution weight is not adjusted.

図４は、図１における４つの音響モデルを状態のレベルで示した図である。図４を参照して、図２において例とした別のクリーン音響モデルＣ：２０４−Ｃの音素ｐ−ａ−ｔの状態Ｓ_Clに含まれる要素確率分布のパラメータが調整される仕組みについて、図３とは異なった一例を説明する。なお、この説明は、各状態に含まれる要素確率分布の数は４、かつ各要素確率分布は多次元無相関正規分布であり、その次元ｉについて図示されているものとするが、これ以外の条件であっても容易に拡張することができる。
先ず、音響モデルパラメータ変化量計算部４１９において、クリーン音響モデルＡ：４０４−Ａの状態Ｓ_A1と対応する雑音適応音響モデルＢ：４１８−Ｂの状態Ｓ_B1から状態Ｓ_A1の各要素正規分布のパラメータ（平均、分散、分布重み）の変化量を求めるが、この例は、状態Ｓ_A1における各要素正規分布Ｎ_A1、Ｎ_A2、Ｎ_A3、Ｎ_A4を統合して一つの統合分布Ｎ_Aを作成すると共に、状態Ｓ_Blにおける各要素正規分布Ｎ_B1、Ｎ_B2、Ｎ_B3、Ｎ_B4を統合して一つの統合分布Ｎ_Bを作成し、統合分布Ｎ_Aから統合分布Ｎ_Bへのパラメータ変化量をもって状態Ｓ_A1における全ての要素正規分布Ｎ_A1、Ｎ_A2、Ｎ_A3、Ｎ_A4のパラメータ変化量（平均、分散）とする。これは、状態Ｓ_A1に付随するパラメータ変化量と見ることができる。統合分布Ｎ_Aの次元ｉの平均μ_Ai、分散σ² _Aiおよび分布重みＷ_Aは、各要素正規分布Ｎ_A1、Ｎ_A2、Ｎ_A3、Ｎ_A4の次元ｉの平均μ_A1i、μ_A2i、μ_A3i、μ_A4i、分散σ² _A1i、σ² _A2i、σ² _A3i、σ² _A4i;、および分布重みＷ_A1、Ｗ_A2、Ｗ_A3、Ｗ_A4を用いて以下の式で得られる。 FIG. 4 is a diagram showing the four acoustic models in FIG. 1 at the state level. Referring to FIG. 4, a diagram illustrating a mechanism in which the parameter of the element probability distribution included in the state S _Cl of the phoneme p-at of another clean acoustic model C: 204-C exemplified in FIG. 2 is adjusted. An example different from 3 will be described. In this description, the number of element probability distributions included in each state is four, and each element probability distribution is a multidimensional uncorrelated normal distribution, and the dimension i is illustrated. Even conditions can be easily extended.
First, in the acoustic model parameter change amount calculation unit 419, each element normal distribution from the state S _B1 to the state S _A1 of the noise adaptive acoustic model B: 418-B corresponding to the state S _A1 of the clean acoustic model A: 404-A is obtained. The amount of change in the parameters (mean, variance, distribution weight) is obtained. In this example, each element normal distribution N _A1 , N _A2 , N _A3 , N _A4 in the state S _A1 is integrated to form one integrated distribution N _A. together to create the parameters change in creating a single integrated distribution N _B by integrating the status elements normal distribution in _{_{_{S Bl N B1, N B2,}}} N B3, N B4, from the integrated distribution N _a to the integrated distribution N _B Let the amount be the parameter variation (average, variance) of all element normal distributions N _A1 , N _A2 , N _A3 , N _{A4 in} the state S _A1 . This can be regarded as a parameter change amount accompanying the state S _A1 . Mean mu _Ai dimension i of integrated distribution N _A, variance sigma ² _Ai and distribution weights W _A, the average mu _A1i of each element Gaussian distribution _{_{_{N A1, N A2, N A3}}} , dimensional N _A4 i, mu _A2i, mu _{A3i, μ} _A4i, variance ^{_{^{_{σ 2 A1i, σ 2 A2i,}}}} σ 2 A3i, σ 2 A4i;, and with the distribution weights _{_{_{W A1, W A2, W A3}}} , W A4 obtained by the following equation.

統合分布Ｎ_Bについても同様の式で得られる。一方、音響モデル構造参照関係決定部４２０においては、別のクリーン音響モデルＣ：４０４−Ｃの状態Ｓ_C1とクリーン音響モデルＡ：４０４−Ａの状態Ｓ_A1の対応がとれており、この例の場合は、要素正規分布同士の対応は取る必要はない。そして、音響モデルパラメータ調整部４２１において、音響モデルパラメータ変化量計算部４１９において計算された状態Ｓ_A1の統合分布Ｎ_Aのパラメータ変化量と、音響モデル構造参照関係決定部４２０において決定された状態Ｓ_C1と状態Ｓ_A1の対応を基に、状態Ｓ_C1の各要素正規分布のパラメータ（平均、分散）の調整を行い、雑音適応音響モデルＤ：４１８−Ｄの状態Ｓ_D1の各要素正規分布のパラメータ（平均、分散）とする。この例は、分布重みの調整は行わない。

Obtained in the same formula applies to the integrated distribution N _B. On the other hand, in the acoustic model structure reference relationship determining unit 420, the state S _{C1 of} another clean acoustic model C: 404-C and the state S _A1 of the clean acoustic model A: 404-A are taken. In this case, it is not necessary to take correspondence between element normal distributions. Then, in the acoustic model parameter adjustment unit 421, the parameter change amount of the integrated distribution N _A of the state S _A1 calculated by the acoustic model parameter change amount calculation unit 419 and the state S determined by the acoustic model structure reference relationship determination unit 420. _{Based on} the correspondence between _C1 and state S _A1 , the parameters (average and variance) of each element normal distribution in state S _C1 are adjusted, and each element normal distribution in state S _D1 in noise adaptive acoustic model D: 418-D is adjusted. Parameters (average, variance). In this example, the distribution weight is not adjusted.

図５は図１における４つの音響モデルを状態のレベルで示した図である。図５を参照して、図２において例とした別のクリーン音響モデルＣ：２０４−Ｃの音素ｐ−ａ−ｔの状態Ｓ_Clに含まれる要素確率分布のパラメータが調整される仕組みについて、図３および図４とは異なった一例を説明する。なお、この説明は、各状態に含まれる要素確率分布の数は４、かつ各要素確率分布は多次元無相関正規分布であり、その次元ｉについて図示されているものとするが、これ以外の条件であっても容易に拡張することができる。
先ず、音響モデルパラメータ変化量計算部５１９−Ａにおいて、クリーン音響モデルＡ：５０４−Ａの状態Ｓ_A1と対応する雑音適応音響モデルＢ：５１８−Ｂの状態Ｓ_B1から状態Ｓ_A1の各要素正規分布のパラメータ（平均、分散、分布重み）の変化量を求めるが、この例は、状態Ｓ_A1における各要素正規分布Ｎ_A1、Ｎ_A2、Ｎ_A3、Ｎ_A4と状態Ｓ_B1における各要素正規分布Ｎ_B1、Ｎ_B2、Ｎ_B3、Ｎ_B4の対応から、状態Ｓ_A1における各要素正規分布Ｎ_A1、Ｎ_A2、Ｎ_A3、Ｎ_A4のパラメ一タ変化量（平均、分散、分布重み）を個別に計算しておく。一方、音響モデル構造参照関係決定部５２０においては、別のクリーン音響モデルＣ：５０４−Ｃの状態Ｓ_C1とクリーン音響モデルＡ：４０４−Ａの状態Ｓ_A1の対応がとれており、更に、状態Ｓ_C1の各要素正規分布Ｎ_C1、Ｎ_C2、Ｎ_C3、Ｎ_C4がそれぞれ参照する状態Ｓ_A1における要素正規分布を決めるが、この参照関係を、分布間距離を基準に決める。ここにおいては、Ｎ_C1との分布間距離が最も近い状態Ｓ_A1の要素正規分布としてＮ_A1が選ばれており、以下、同様に、Ｎ_C2に対してＮ_A2、Ｎ_C3に対してＮ_A4、Ｎ_C4に対してＮ_A4が選ばれている。この様に、状態Ｓ_C1と状態Ｓ_A1の要素正規分布数が同じであったとしても、必ずしも要素正規分布同士の対応は１対１になるとは限らない。そして、音響モデルパラメータ調整部５２１において、音響モデルパラメータ変化量計算部５１９において計算された状態Ｓ_A1の各要素正規分布Ｎ_A1、Ｎ_A2、Ｎ_A3、Ｎ_A4のパラメータ変化量と、音響モデル構造参照関係決定部５２０において決定された状態Ｓ_C1における各要素正規分布Ｎ_C1、Ｎ_C2、Ｎ_C3、Ｎ_C4とそれが参照する状態Ｓ_A1における各要素正規分布Ｎ_A1、Ｎ_A2、Ｎ_A4を基に、状態Ｓ_C1の各要素正規分布Ｎ_C1、Ｎ_C2、Ｎ_C3、Ｎ_C4のパラメータ（平均、分散、分布重み）の調整を行い、雑音適応音響モデルＤ：５１８−Ｄの状態Ｓ_D1の各要素正規分布Ｎ_D1、Ｎ_D2、Ｎ_D3、Ｎ_D4のパラメータ（平均、分散、分布重み）とする。 FIG. 5 is a diagram showing the four acoustic models in FIG. 1 at the state level. Referring to FIG. 5, a diagram illustrating a mechanism in which parameters of the element probability distribution included in the state S _Cl of the phoneme p-at of another clean acoustic model C: 204-C illustrated in FIG. 2 is adjusted. An example different from FIGS. 3 and 4 will be described. In this description, the number of element probability distributions included in each state is four, and each element probability distribution is a multidimensional uncorrelated normal distribution, and the dimension i is illustrated. Even conditions can be easily extended.
First, in the acoustic model parameter change amount calculation unit 519-A, each element normalization from the state S _B1 to the state S _A1 of the noise adaptive acoustic model B: 518-B corresponding to the state S _A1 of the clean acoustic model A: 504-A is performed. distribution parameters (average, variance, distribution weights) is obtaining the amount of change, this example, each element Gaussian distribution in each component normal distribution _{_{_{N A1, N A2, N A3}}} , N A4 and the state S _B1 in the state S _A1 From the correspondence of N _B1 , N _B2 , N _B3 , N _B4 , the parameter variation (mean, variance, distribution weight) of each element normal distribution N _A1 , N _A2 , N _A3 , N _A4 in state S _A1 is individually Calculate in advance. On the other hand, in the acoustic model structure reference relationship determining unit 520, the state S _{C1 of} another clean acoustic model C: 504-C and the state S _A1 of the clean acoustic model A: 404-A are taken, and the state The element normal distribution in the state S _A1 to which each element normal distribution N _C1 , N _C2 , N _C3 , and N _{C4 of} S _C1 refers is determined. This reference relation is determined based on the distance between the distributions. Here, N _A1 is selected as the element normal distribution in the state S _A1 where the distance between distributions with N _C1 is the shortest, and similarly, N _A2 for N _{C2 and} N _A4 for N _{C3 in the same} manner. , N _A4 is selected for N _C4 . Thus, even if the number of element normal distributions in the state S _C1 and the state S _A1 is the same, the correspondence between the element normal distributions is not necessarily one-to-one. Then, in the acoustic model parameter adjustment unit 521, the parameter variation of each element normal distribution N _A1 , N _A2 , N _A3 , N _A4 of the state S _A1 calculated by the acoustic model parameter variation calculation unit 519 and the acoustic model structure Each element normal distribution N _C1 , N _C2 , N _C3 , N _C4 in the state S _C1 determined by the reference relationship determining unit 520 and each element normal distribution N _A1 , N _A2 , N _A4 in the state S _A1 to which it refers Based on this, the parameters (average, variance, distribution weight) of each element normal distribution N _C1 , N _C2 , N _C3 , N _C4 of the state S _C1 are adjusted, and the state S _{D1 of the} noise adaptive acoustic model D: 518-D _Are the parameters (average, variance, distribution weight) of each element normal distribution N _D1 , N _D2 , N _D3 , N _D4 .

図５の例における分布間距離尺度としては、Kullback−Leiblerダイバージェンスを用いることができる。多次元無相関正規分布Ｎ１、Ｎ２の次元ｉにおける平均、分散および分布重みをそれぞれ、（μ_1i、σ² _1i）、（μ_2i、σ² _2i）とすると、Ｎ₁、Ｎ₂間のKullback−LeiblerダイバージェンスＫ（Ｎ₁、Ｎ₂）は、以下の通りに計算することができる。以下、Ｉは次元数である。 As the distribution distance measure in the example of FIG. 5, Kullback-Leibler divergence can be used. If the mean, variance, and distribution weight in the dimension i of the multidimensional uncorrelated normal distribution N1, N2 are (μ _1i , σ ² _1i ), (μ _2i , σ ² _2i ), respectively, a Kullback between N ₁ and N ₂ -Leibler divergence K (N ₁ , N ₂ ) can be calculated as follows: Hereinafter, I is the number of dimensions.

また、図５の例における他の分布間距離尺度としては、バタチャリヤ距離を用いることができる。多次元無相関正規分布Ｎ１、Ｎ２間のバタチャリヤ距離Ｂ（Ｎ１、Ｎ２）は、以下の通りに計算することができる。

Further, as another inter-distribution distance measure in the example of FIG. 5, a batcha rear distance can be used. The virtual distance B (N1, N2) between the multidimensional uncorrelated normal distributions N1, N2 can be calculated as follows.

また、図５の例における他の分布間距離尺度としては、分布を統合してみたときの前後の尤度差分を用いることができる。多次元無相関正規分布Ｎ_kの次元ｉにおける平均、分散および学習データフレーム数をそれぞれ（μ_ki、σ² _ki、Γ_k）とすると、Ｎ_kの尤度（期待値）は、以下の通りに計算することができる。

As another inter-distribution distance measure in the example of FIG. 5, the likelihood difference before and after the distributions can be integrated can be used. If the mean, variance, and number of training data frames in dimension i of multidimensional uncorrelated normal distribution N _k are (μ _ki , σ ² _ki , Γ _k ), respectively, the likelihood (expected value) of N _k is as follows: Can be calculated.

統合前の多次元無相関正規分布Ｎ₁、Ｎ₂の尤度をそれぞれＰ₁、Ｐ₂とし、統合後の多次元無相関正規分布Ｎの尤度をＰとすると、統合前後の尤度差分△Ｐ（Ｎ₁、Ｎ₂→Ｎ）は、以下の通りに計算することができる。
△Ｐ（Ｎ₁、Ｎ₂→Ｎ）＝Ｐ₁＋Ｐ₂−Ｐ・・・・・式（１２）
また、図５の例における他の分布間距離尺度としては、分布統合前後の変分ベイズ法に基づく評価関数値の差分を用いることができる。変分ベイズ法に基づく評価関数値については、参考文献（渡部晋治、南泰浩、中村篤、上田修功、“ベイズ的基準を用いた状態共有型ＨＭＭ構造の選択”、電子情報通信学会論文誌、D-II,Vol.J86-D-II, No.6, pp．776-786, 2003年6月）に開示されている。

When the likelihoods of the multidimensional uncorrelated normal distributions N ₁ and N ₂ before integration are P ₁ and P ₂ , respectively, and the likelihood of the multidimensional uncorrelated normal distribution N after integration is P, the likelihood difference before and after the integration ΔP (N ₁ , N ₂ → N) can be calculated as follows.
ΔP (N ₁ , N ₂ → N) = P ₁ + P ₂ −P (12)
In addition, as another inter-distribution distance measure in the example of FIG. 5, a difference between evaluation function values based on the variational Bayes method before and after distribution integration can be used. For evaluation function values based on the variational Bayesian method, refer to the references (Yuji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Nobuo Ueda, “Selection of state-sharing HMM structure using Bayesian criteria”, IEICE Transactions, D-II, Vol. J86-D-II, No. 6, pp. 776-786, June 2003).

以上の通りであって、図１から図５までを参照して説明してきた音響モデル雑音適応化装置を用いれば、既存の雑音適応化の結果を用いて、マルチコンディション音声データなどの雑音を付加された音声データを新たに準備することなく、瞬時に音響モデルの雑音適応を行うことができる。即ち、一例として、多数の女性の雑音のない音声データを用いて学習されたクリーン女声音響モデルの雑音適応化を実施しようとする場合、多数の男性の雑音のない音声データを用いて学習されたクリーン男声音響モデルと、それをマルチコンディション学習により雑音適応化した雑音適応男声音響モデルがあれば、新たに雑音が付加された女声音声データを準備してマルチコンディション学習する必要はなく、男声音響モデルの雑音適応化結果をそのまま用いて瞬時に雑音適応を行うことができる。 As described above, if the acoustic model noise adaptation apparatus described with reference to FIGS. 1 to 5 is used, noise such as multi-condition speech data is added using the result of existing noise adaptation. The noise adaptation of the acoustic model can be performed instantaneously without preparing newly prepared speech data. That is, as an example, when trying to perform noise adaptation of a clean female voice model trained using a large number of female voice-free voice data, it was learned using a large number of male voice-free voice data. If there is a clean male voice acoustic model and a noise-adapted male voice acoustic model that is noise-adapted by multi-condition learning, there is no need to prepare female voice data with newly added noise and perform multi-condition learning. Noise adaptation can be performed instantaneously using the noise adaptation result of the above.

また、雑音適応化によるパラメータの変化量が、話者の個別の音声の特徴などによらないと仮定すれば、図１から図５までのクリーン音響モデルＡとしては、多数の話者の大量の音声データを用いて学習した音響モデルでなくとも、例えば、一人の話者の少量の音声データを用いて学習した音響モデルでも差し支えない。この様な少量の音声データで学習された音響モデルであれば、マルチコンディション学習に必要なデータ記憶容量も計算時間も少なく抑えることができ、雑音適応化も容易である。更に簡単化するのであれば、音素ＨＭＭとしてmonophone−ＨＭＭのみで構成される様なクリーン音響モデルＡを用いることもできる。 If it is assumed that the amount of parameter change due to noise adaptation does not depend on the characteristics of the individual speech of the speaker, the clean acoustic model A shown in FIGS. For example, an acoustic model learned using a small amount of speech data of a single speaker may be used instead of the acoustic model learned using speech data. With such an acoustic model learned from a small amount of speech data, the data storage capacity and calculation time required for multi-condition learning can be reduced, and noise adaptation is easy. For further simplification, a clean acoustic model A composed only of a monophone-HMM can be used as the phoneme HMM.

実施例を説明する図。The figure explaining an Example. 実施例において状態遷移確率の調整をする仕方を説明する図。The figure explaining how to adjust a state transition probability in an Example. 実施例において最も分布重みの大きい要素分布のパラメータ変化量を基に分布パラメータを調整する仕方を説明する図。The figure explaining how to adjust a distribution parameter based on the parameter variation | change_quantity of element distribution with the largest distribution weight in an Example. 実施例において統合分布のパラメータ変化量を基に分布パラメータを調整する仕方を説明する図。The figure explaining how to adjust a distribution parameter based on the parameter variation | change_quantity of integrated distribution in an Example. 実施例において各要素分布のパラメータ変化量を基に分布パラメータを調整する仕方を説明する図。The figure explaining the method of adjusting a distribution parameter based on the parameter variation | change_quantity of each element distribution in an Example. 音声認識装置の従来例を説明する図。The figure explaining the prior art example of a speech recognition apparatus. 音響モデルにおける状態の構造の一例を説明する図。The figure explaining an example of the structure of the state in an acoustic model. 音響モデルにおける音素ＨＭＭの構造の一例を説明する図。The figure explaining an example of the structure of the phoneme HMM in an acoustic model. マルチコンディション学習を説明する図。The figure explaining multi-condition learning.

Explanation of symbols

１０４−Ａクリーン音響モデルＡ
１０４−Ｃ別のクリーン音響モデルＣ
１１７雑音適応化部
１１８−Ｂ雑音適応音響モデルＢ
１１８−Ｄ新規の雑音適応音響モデルＤ
１１９音響モデルパラメータ変化量計算部
１２０音響モデル構造参照関係決定部
１２１音響モデルパラメータ調整部
104-A Clean acoustic model A
104-C Another clean acoustic model C
117 Noise Adaptation Unit 118-B Noise Adaptive Acoustic Model B
118-D New noise adaptive acoustic model D
119 Acoustic model parameter change amount calculation unit 120 Acoustic model structure reference relationship determination unit 121 Acoustic model parameter adjustment unit

Claims

Preparing a clean acoustic model A learned from speech data without noise and a noise adaptive acoustic model B adapted to noise based on the clean acoustic model A;
Calculate the amount of change of each parameter due to noise adaptation from clean acoustic model A to noise adaptive acoustic model B,
With reference to the clean acoustic model A by each state and each distribution of another clean acoustic model C trained with noise-free speech data, each state and each distribution is determined.
Based on the reference relationship between each state and each distribution of another clean acoustic model C and clean acoustic model A and the amount of change in each parameter due to noise adaptation from clean acoustic model A to noise adaptive acoustic model B An acoustic model noise adaptation method characterized by adjusting each parameter of the acoustic model C to create a new noise-adapted acoustic model D.

A clean acoustic model A trained with noise-free speech data;
A clean sound model A is input and a noise adapting unit for adapting the noise is provided,
A noise adaptive acoustic model B that is noise-adapted based on the clean acoustic model A is provided,
An acoustic model parameter variation calculation unit that inputs the clean acoustic model A and the noise adaptive acoustic model B and calculates the variation of each parameter due to noise adaptation of the clean acoustic model A,
An acoustic model structure that inputs another clean acoustic model C and clean acoustic model A learned from noise-free speech data and determines each parameter of the clean acoustic model A referenced by each parameter of the other clean acoustic model C A reference relationship determination unit;
The amount of change of each parameter due to noise adaptation of the clean acoustic model A calculated by the acoustic model parameter variation calculation unit and the clean acoustic model A determined by the acoustic model structure reference relationship determination unit and another clean acoustic model C Acoustic model noise adaptation, comprising: an acoustic model parameter adjustment unit that inputs a reference relationship between the parameters and adjusts each parameter of another clean acoustic model C to create a new noise adaptive acoustic model D apparatus.

In the acoustic model noise adaptation apparatus according to claim 2,
The acoustic model parameter change amount calculation unit calculates the change amount of each parameter due to noise adaptation from the clean acoustic model A to the noise adaptive acoustic model B, and the distribution having the largest distribution weight in each state of the clean acoustic model A The acoustic model noise adaptation apparatus characterized in that the amount of parameter change is the amount of parameter change of all distributions in the state.

In the acoustic model noise adaptation apparatus according to claim 2,
The acoustic model parameter change amount calculation unit calculates a change amount of each parameter due to noise adaptation from the clean acoustic model A to the noise adaptive acoustic model B, and the noise adaptive acoustic model B corresponding to each state of the clean acoustic model A In each of the states, an acoustic model noise adaptation device that creates a distribution in which all the distributions in the state are integrated and sets the parameter change amount of the integrated distribution as the parameter change amount of all the distributions in the state.

In the acoustic model noise adaptation apparatus according to any one of claims 2 to 4,
The acoustic model structure reference relationship determining unit corresponds to each state of another clean acoustic model C when determining each state and each distribution of the clean acoustic model A that each state and distribution of another clean acoustic model C refers to. An acoustic model noise adaptation device, wherein correspondence between distributions in each state of a clean acoustic model A is determined based on the proximity of the distance between the distributions.

In the acoustic model noise adaptation apparatus according to any one of claims 2 to 5,
An acoustic model noise adaptation device using Kullback-Leibler divergence as a distance measure between distributions.

In the acoustic model noise adaptation apparatus according to any one of claims 2 to 5,
An acoustic model noise adapting device characterized by using a batcha rear distance as a distance measure between distributions.

In the acoustic model noise adaptation apparatus according to any one of claims 2 to 5,
An acoustic model noise adaptation apparatus using a likelihood difference between before and after distribution integration as a distance measure between distributions.

In the acoustic model noise adaptation apparatus according to any one of claims 2 to 5,
An acoustic model noise adaptation device using a difference between evaluation function values based on a variational Pais method before and after distribution integration as a distance measure between distributions.