JP5089651B2

JP5089651B2 - Speech recognition device, acoustic model creation device, method thereof, program, and recording medium

Info

Publication number: JP5089651B2
Application number: JP2009138987A
Authority: JP
Inventors: 哲小橋川; 太一浅見
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-06-10
Filing date: 2009-06-10
Publication date: 2012-12-05
Anticipated expiration: 2029-06-10
Also published as: JP2010286586A

Description

この発明は、未知の非定常雑音が入力されても認識誤りの発生が少ない音声認識装置と、未知の非定常雑音が入力されても精度の高い音響モデルを作成する音響モデル作成装置とそれらの方法と、プログラムと記録媒体に関する。 The present invention provides a speech recognition apparatus that generates less recognition errors even when unknown non-stationary noise is input, an acoustic model generation apparatus that generates a highly accurate acoustic model even when unknown non-stationary noise is input, and their The present invention relates to a method, a program, and a recording medium.

近年、統計的手法に基づく音声認識技術の進歩により、静かな環境における音声認識は高い精度で行うことが可能になった。しかし、実際の環境では、雑音の存在、特に未知の非定常な雑音によって認識性能が劣化することが問題になっている。 In recent years, the progress of speech recognition technology based on statistical methods has made it possible to perform speech recognition in a quiet environment with high accuracy. However, in an actual environment, there is a problem that the recognition performance deteriorates due to the presence of noise, particularly unknown non-stationary noise.

図９に従来の音声認識装置９００の機能構成を示す。音声認識装置９００は、Ａ/Ｄ変換部１０、特徴量分析部２０、音声認識処理部３０、音響モデルパラメータメモリ４０、言語モデルパラメータメモリ５０を備える。 FIG. 9 shows a functional configuration of a conventional speech recognition apparatus 900. The speech recognition apparatus 900 includes an A / D conversion unit 10, a feature amount analysis unit 20, a speech recognition processing unit 30, an acoustic model parameter memory 40, and a language model parameter memory 50.

Ａ/Ｄ変換部１０は、入力されるアナログ信号の音声を、例えばサンプリング周波数１６ｋＨｚで離散的なディジタル信号に変換する。特徴量分析部２０は、離散値化された音声ディジタル信号を入力として、例えば３２０個の音声ディジタル信号を１フレーム（２０ｍｓ）としたフレーム毎に、音声特徴量Ｏ_ｔを算出する。音声特徴量Ｏ_ｔは、例えばメル周波数ケプストラム係数（ＭＦＣＣ）分析によって算出される。 The A / D converter 10 converts the sound of the input analog signal into a discrete digital signal, for example, at a sampling frequency of 16 kHz. The feature amount analysis unit 20 receives the speech digital signal that has been converted into discrete values, and calculates the speech feature amount O _t for each frame in which, for example, 320 speech digital signals are one frame (20 ms). The voice feature amount O _t is calculated, for example, by Mel frequency cepstrum coefficient (MFCC) analysis.

音声認識処理部３０は、音声特徴量Ｏ_ｔを入力として音響モデルパラメータメモリ４０に記録された音響モデルと、言語モデルパラメータメモリ５０に記録された言語モデルとを参照して、音響モデルの尤度と言語モデルの尤度の和が最も高い音声認識結果候補を音声認識結果として出力する。 The speech recognition processing unit 30 refers to the acoustic model recorded in the acoustic model parameter memory 40 with the speech feature quantity O _t as an input, and the language model recorded in the language model parameter memory 50, and the likelihood of the acoustic model. And the speech recognition result candidate having the highest likelihood of the language model is output as the speech recognition result.

従来の音声認識装置９００においては、未知な非定常雑音である突発性雑音に対処する目的で、音響モデルの尤度を補正する方法が取られていた。音響モデルは、ＨＭＭ（Hidden Markov Model：隠れマルコフモデル）で表現され、その出現確率分布には正規分布が広く用いられる。その出現確率の対数である尤度は、２次関数となり分布平均からのずれの２乗に従い低下する特性を示す。この音響モデルの出現確率と尤度との特性の差が、突発性雑音による認識誤りの一因と考えられる。その差を補正する考えが、例えば非特許文献１に開示されている。 In the conventional speech recognition apparatus 900, a method of correcting the likelihood of the acoustic model has been taken for the purpose of dealing with sudden noise that is unknown unsteady noise. The acoustic model is expressed by an HMM (Hidden Markov Model), and a normal distribution is widely used as the appearance probability distribution. The likelihood, which is the logarithm of the appearance probability, is a quadratic function and exhibits a characteristic that decreases according to the square of the deviation from the distribution average. The difference in the characteristics between the appearance probability and likelihood of the acoustic model is considered to be a cause of recognition error due to sudden noise. The idea of correcting the difference is disclosed in Non-Patent Document 1, for example.

図１０に非特許文献１の尤度補正の考えを示す。図１０の左上側は、音素｜ａ｜と音素｜ｏ｜の出現確率分布を示す。同左下側はそれぞれの音素の尤度の特性を示す。図１０の横軸は音声特徴量である。音声特徴量ｙ_ｓが観測されたとき、出現確率は、音素｜ａ｜よりも音素｜ｏ｜の確率が高い。左下側に示す尤度も同じ傾向を示す。 FIG. 10 shows the idea of likelihood correction of Non-Patent Document 1. The upper left side of FIG. 10 shows the appearance probability distribution of phonemes | a | and phonemes | o |. The lower left side shows the likelihood characteristics of each phoneme. The horizontal axis of FIG. 10 is the audio feature amount. When the speech feature amount _s is observed, the appearance probability is higher for the phoneme | o | than for the phoneme | a |. The likelihood shown on the lower left side shows the same tendency.

しかし、音声特徴量ｙ_ｓが重畳雑音等の影響でｙ_ｏに変化したとき、出現確率はどちらも小さくなるが、左下側に示す尤度では音素｜ａ｜と音素｜ｏ｜で大小関係が逆転するだけでなく、２次曲線によってその間に大きな差が生じてしまう。このように、出現確率の差は小さいのにもかかわらず尤度に大きな差が発生することが、認識誤りの一因になると考えられる。 However, when the speech feature amount _s changes to _yo due to the influence of superimposed noise or the like, the appearance probabilities are both small, but the likelihood shown in the lower left side has a magnitude relationship between the phoneme | a | and the phoneme | o |. Not only does it reverse, but the quadratic curve makes a big difference between them. Thus, although the difference in appearance probability is small, a large difference in likelihood is considered to contribute to recognition errors.

そこで、非特許文献１では、突発性雑音に対する頑健性の向上のため、観測されたデータの分布Ｎ（ｙ）に正の微小な補正定数εを加え、その値の尤度（式（１））を用いることで、線形の出現確率の小さな差が尤度の大きな差になる問題を回避している。 Therefore, in Non-Patent Document 1, in order to improve robustness against sudden noise, a small positive correction constant ε is added to the observed data distribution N (y), and the likelihood of the value (formula (1)) ) Is used to avoid the problem that a difference in linear appearance probability becomes a difference in likelihood.

ここでＮ（ｙ）は観測されたデータの音声特徴量の分布である。つまり、図１０の音素｜ａ｜や音素｜ｏ｜の分布である。εは補正定数である。
この補正定数εを加える処理は、音声認識処理部３０で行われる。この処理によって、図１０の右下側に示すように尤度の差を縮小することが可能である。よって、突発性雑音が発生したときの尤度の変化量を少なくすることができるので、認識誤りの発生を抑制する効果が期待できる。 Here, N (y) is the distribution of the voice feature amount of the observed data. That is, the distribution of phonemes | a | and phonemes | o | in FIG. ε is a correction constant.
The process of adding the correction constant ε is performed by the voice recognition processing unit 30. By this processing, it is possible to reduce the difference in likelihood as shown in the lower right side of FIG. Therefore, since the amount of change in likelihood when sudden noise occurs can be reduced, the effect of suppressing the occurrence of recognition errors can be expected.

山本仁、篠田浩一、嵯峨山茂樹「正規分布の尤度補正による突発性雑音に頑健な音声認識」、音響学会秋季講演論文集、1-９-１０，pp.19-20,2002Hitoshi Yamamoto, Koichi Shinoda, Shigeki Hiyama “Speech Recognition Robust against Sudden Noise by Correcting Likelihood of Normal Distribution”, Acoustical Society Autumn Meeting, 1-9-10, pp.19-20,2002

従来の補正定数εを導入する考えは、その定数の設定によっては認識精度が劣化してしまう危険性がある。最適な定数は、認識対象のデータによって異なるため一律に決められない。定数を一律に固定してしまうと、音素モデルによって、補正の影響に強弱が発生してしまい本来正しい尤度が得られる場合も、それを阻害してしまう心配がある。 The idea of introducing the conventional correction constant ε has a risk that the recognition accuracy may deteriorate depending on the setting of the constant. The optimum constant varies depending on the data to be recognized and cannot be determined uniformly. If the constants are fixed uniformly, the phoneme model may affect the effect of correction, and even if the correct likelihood is obtained, there is a concern that it may be hindered.

この発明は、このような点に鑑みてなされたものであり、尤度が一定の範囲を超えた場合にそのデータを対象外にすることで、認識誤りの発生を少なくした音声認識装置とその方法と、それと同じ考えに基づく音響モデル作成装置とその方法と、プログラムと記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and a speech recognition apparatus that reduces the occurrence of recognition errors by excluding data when the likelihood exceeds a certain range, and its It is an object of the present invention to provide a method, an acoustic model creation device and method based on the same idea, a program, and a recording medium.

この発明の音声認識装置は、特徴量分析部と、ＧＭＭ尤度計算部と、ＧＭＭ尤度判定部と、音声認識処理部とを具備する。特徴量分析部は、入力される音声ディジタル信号の音声特徴量をフレーム単位で分析する。ＧＭＭ尤度計算部は、ＧＭＭ（Gaussian Mixture Model：混合正規分布モデル）と上記音声特徴量を照合してフレーム毎にＧＭＭ尤度を計算する。ＧＭＭ尤度判定部は、ＧＭＭ尤度が所定の範囲内であるか否かを判定し、その判定結果とＧＭＭ尤度とを出力する。音声認識処理部は、音声特徴量とＧＭＭ尤度と判定結果を入力として、上記所定の範囲内のフレームについては音声特徴量に対応する音響尤度に基づいて音声認識処理を行い、上記所定の範囲外のフレームについてはＧＭＭ尤度を利用した音響尤度を用いて音声認識処理を行う。 The speech recognition apparatus according to the present invention includes a feature amount analysis unit, a GMM likelihood calculation unit, a GMM likelihood determination unit, and a speech recognition processing unit. The feature amount analysis unit analyzes the speech feature amount of the input speech digital signal in units of frames. The GMM likelihood calculation unit collates GMM (Gaussian Mixture Model: mixed normal distribution model) with the above-mentioned speech feature quantity, and calculates the GMM likelihood for each frame. The GMM likelihood determination unit determines whether or not the GMM likelihood is within a predetermined range, and outputs the determination result and the GMM likelihood. The speech recognition processing unit receives the speech feature amount, the GMM likelihood, and the determination result, performs speech recognition processing on the frame within the predetermined range based on the acoustic likelihood corresponding to the speech feature amount, For frames outside the range, speech recognition processing is performed using acoustic likelihood using GMM likelihood.

また、この発明の音響モデル作成装置は、学習処理部と、上記したと同じＧＭＭ尤度計算部とＧＭＭ尤度判定部とを具備し、学習処理部は、判定結果が範囲外と判定されたフレームを音響モデルの統計量計算の対象外として学習後音響モデルを生成する。 Further, the acoustic model creation device of the present invention includes a learning processing unit and the same GMM likelihood calculating unit and GMM likelihood determining unit as described above, and the learning processing unit determines that the determination result is out of range. An acoustic model is generated after learning with the frame excluded from the statistical model calculation target.

この発明の音声認識装置と音響モデル作成装置は、殆どの音素を包含し、分散が広くなる混合ガウス分布モデルであるＧＭＭから求めた尤度を用いる。よって、従来問題になっていた分布の端において発生する尤度の逆転現象や、尤度の差が増大してしまう問題を低減できる。
つまり、ＧＭＭ尤度判定部で範囲外と判定されたフレームの音響尤度が、ＧＭＭ尤度計算部でＧＭＭに基づいて計算されたＧＭＭ尤度に代用されるので、突発性雑音が入力されたときの音響尤度を安定化させることが出来る。その結果、音声認識処理及び音響モデルの学習処理の精度を向上させる効果を奏する。 The speech recognition apparatus and acoustic model creation apparatus of the present invention use the likelihood obtained from the GMM, which is a mixed Gaussian distribution model that includes most phonemes and has a wide dispersion. Therefore, it is possible to reduce the likelihood reversal phenomenon that occurs at the end of the distribution, which has been a problem in the past, and the problem that the difference in likelihood increases.
That is, since the acoustic likelihood of the frame determined to be out of the range by the GMM likelihood determination unit is substituted for the GMM likelihood calculated based on the GMM by the GMM likelihood calculation unit, sudden noise is input. The acoustic likelihood at the time can be stabilized. As a result, there is an effect of improving the accuracy of the speech recognition process and the acoustic model learning process.

音素モデルを構成する１状態を模式的に示す図。The figure which shows typically 1 state which comprises a phoneme model. 音素モデルの一例を示す図。The figure which shows an example of a phoneme model. この発明の音声認識装置１００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 100 of this invention. 音声認識装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 100. ＧＭＭを用いた音声特徴量とＧＭＭ尤度との関係を示す図。The figure which shows the relationship between the audio | voice feature-value using GMM, and GMM likelihood. 音声認識装置１００′の動作フローを示す図。The figure which shows the operation | movement flow of speech recognition apparatus 100 '. この発明の音響モデル作成装置２００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production apparatus 200 of this invention. 音響モデル作成装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic model production apparatus 200. 従来の音声認識装置９００の機能構成を示す図。The figure which shows the function structure of the conventional speech recognition apparatus 900. 非特許文献１に開示された尤度補正の考えを示す図。The figure which shows the idea of the likelihood correction | amendment disclosed by the nonpatent literature 1. FIG.

この発明の実施例の説明をする前に、この発明の考えについて説明する。
〔この発明の考え〕
この発明の考えを説明するに当たって先ず音響モデルについて説明する。音響モデルを構成する音素モデルは、約３個程度の状態の確率連鎖によって構築される。各状態は、混合正規分布として表現される。図１に、例えば混合数を３の場合での３つの正規分布、Ｎ（μ_１，Ｕ_１），Ｎ（μ_２，Ｕ_２），Ｎ（μ_３，Ｕ_３）、重み係数ｃ_１，ｃ_２，ｃ_３で構成される状態ｓを示す。μは平均ベクトル、Ｕは共分散行列である。 Before describing the embodiments of the present invention, the idea of the present invention will be described.
[Concept of this invention]
In describing the idea of the present invention, an acoustic model will be described first. A phoneme model constituting the acoustic model is constructed by a probability chain of about three states. Each state is expressed as a mixed normal distribution. FIG. 1 shows, for example, three normal distributions when the number of mixtures is 3, N (μ ₁ , U ₁ ), N (μ ₂ , U ₂ ), N (μ ₃ , U ₃ ), weight coefficient c ₁ , A state s composed of c ₂ and c ₃ is shown. μ is an average vector, and U is a covariance matrix.

図２に３状態で構成される音素モデルの概念図を一例として示す。この例は、left−to−right型ＨＭＭと呼ばれるもので、３つの状態ｓ_１（第１状態）、ｓ_２（第２状態）、ｓ_３（第３状態）を並べたものであり、状態の確率連鎖（状態遷移確率）としては、自己遷移ａ_１１、ａ_２２、ａ_３３と、次状態へのａ_１２、ａ_２３、ａ_３４からなる。この状態遷移系列の中で最も尤度の高い音素モデルの組み合わせが、音声認識結果として出力される。この音素モデルの集合が音響モデルである。
状態ｓから得られる出現確率Ｐ（ｓ，Ｏ_ｔ）は式（２）で求められる。 FIG. 2 shows, as an example, a conceptual diagram of a phoneme model composed of three states. This example is called a left-to-right type HMM, which is an array of three states s ₁ (first state), s ₂ (second state), and s ₃ (third state). As a probability chain (state transition probability), self-transitions a ₁₁ , a ₂₂ , a ₃₃ and a ₁₂ , a ₂₃ , a ₃₄ to the next state are included. A combination of phoneme models having the highest likelihood in the state transition series is output as a speech recognition result. A set of phoneme models is an acoustic model.
The appearance probability P (s, O _t ) obtained from the state s is obtained by Expression (2).

ここでＯ_ｔはフレームｔの音声特徴量、Ｎ（Ｏ_ｔ；μ_ｍｓ，Ｕ_ｍｓ）は平均ベクトルμ_ｍｓ、共分散行列Ｕ_ｍｓからなる正規分布から計算される確率、ｃ_ｍｓは混合重み係数、Ｍ_ｓは状態ｓに属する分布数である。各状態におけるこの出現確率Ｐ（ｓ，Ｏ_ｔ）と前述の状態遷移確率の対数値の総和が音響尤度である。
背景技術で説明した補正定数εを音声特徴量の分布に加える考え方では、突発性雑音が入力されると、上記した説明から明らかなように、音響尤度が大きく変動する可能性が有り、それが認識誤りの原因になっていた。 Here, O _t is an audio feature amount of frame t, N (O _t ; μ _ms , U _ms ) is a probability calculated from a normal distribution consisting of an average vector μ _ms and a covariance matrix U _ms , and c _ms is a mixing weight coefficient. , M _s is the number of distributions belonging to state s. The sum of the logarithmic value of the appearance probability P (s, O _t ) in each state and the above-described state transition probability is the acoustic likelihood.
In the concept of adding the correction constant ε described in the background art to the distribution of the speech feature amount, when sudden noise is input, the acoustic likelihood may greatly vary as is apparent from the above description. Was the cause of recognition errors.

その従来の方法に対してこの発明の音声認識方法は、音声認識処理の前にＧＭＭと音声特徴量を照合してＧＭＭ尤度を計算する。そして、そのＧＭＭ尤度が所定の範囲内であるか否かを判定する。ＧＭＭ尤度が所定の範囲内であれば、音声認識処理過程で音声特徴量に基づいた音響尤度を求め、その音響尤度を用いて音声認識処理を行う。
逆にＧＭＭ尤度が所定の範囲外の場合、例えば、突発性雑音が入力されたＧＭＭから求めたＧＭＭ尤度を、音響尤度に代用して音声認識処理するので、音響尤度が大きく変化することがない。 In contrast to the conventional method, the speech recognition method of the present invention calculates the GMM likelihood by comparing the GMM with the speech feature before speech recognition processing. Then, it is determined whether or not the GMM likelihood is within a predetermined range. If the GMM likelihood is within a predetermined range, the acoustic likelihood based on the speech feature amount is obtained in the speech recognition process, and the speech recognition process is performed using the acoustic likelihood.
On the other hand, when the GMM likelihood is out of the predetermined range, for example, the GMM likelihood obtained from the GMM to which sudden noise has been input is subjected to speech recognition processing instead of the acoustic likelihood, so that the acoustic likelihood changes greatly. There is nothing to do.

したがって従来の方法のように、そもそも音声特徴量に対する小さな出現確率の差が逆転したり、その小さな出現確率の差が大きな尤度差に変化してしまうことが無い。このようにこの発明の考えによれば、音響尤度の値を安定化することが可能である。その結果、音声認識の誤認識を減らすことが出来る。また、この考えは音響モデル作成装置にも適用することが可能である。
以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Therefore, unlike the conventional method, the small difference in appearance probability with respect to the speech feature amount is not reversed, or the small difference in appearance probability does not change to a large likelihood difference. Thus, according to the idea of the present invention, it is possible to stabilize the value of acoustic likelihood. As a result, misrecognition of voice recognition can be reduced. This idea can also be applied to an acoustic model creation device.
Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図３にこの発明の音声認識装置１００の機能構成例を示す。その動作フローを図４に示す。音声認識装置１００は、Ａ/Ｄ変換部１０と、特徴量分析部２０と、音響モデルパラメータメモリ４０と、ＧＭＭ尤度計算部６０と、ＧＭＭ尤度判定部７０と、音声認識処理部８０と、言語モデルパラメータメモリ５０と、制御部９０とを具備する。音声認識装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 3 shows a functional configuration example of the speech recognition apparatus 100 of the present invention. The operation flow is shown in FIG. The speech recognition apparatus 100 includes an A / D conversion unit 10, a feature amount analysis unit 20, an acoustic model parameter memory 40, a GMM likelihood calculation unit 60, a GMM likelihood determination unit 70, and a speech recognition processing unit 80. The language model parameter memory 50 and the control unit 90 are provided. The speech recognition apparatus 100 is realized by reading a predetermined program into a computer configured with, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

音声認識装置１００は、従来の音声認識装置９００と比較してＧＭＭ尤度計算部６０と、ＧＭＭ尤度判定部７０とを具備する点で新しい。また、音声認識処理部８０の動作が従来の音声認識処理部３０と異なる。他の機能構成は音声認識装置９００と同じものである。以降の説明では、その異なる部分を中心に説明を行う。 Compared with the conventional speech recognition apparatus 900, the speech recognition apparatus 100 is new in that it includes a GMM likelihood calculation unit 60 and a GMM likelihood determination unit 70. Further, the operation of the voice recognition processing unit 80 is different from that of the conventional voice recognition processing unit 30. Other functional configurations are the same as those of the speech recognition apparatus 900. In the following description, the description will focus on the different parts.

Ａ/Ｄ変換部１０は、入力されるアナログ信号の音声を、例えばサンプリング周波数１６ｋＨｚで離散的なディジタル信号に変換する（ステップＳ１０、図４）。特徴量分析部２０は、離散値化された音声ディジタル信号を入力として、所定の数の音声ディジタル信号を１フレーム（例えば２０ｍｓ）としたフレーム毎に、音声特徴量Ｏ_ｔを算出する（ステップＳ２０）。
ＧＭＭ尤度計算部６０は、ＧＭＭと音声特徴量Ｏ_ｔを照合してフレーム毎にＧＭＭ尤度を計算する（ステップＳ６０）。ＧＭＭは、音響モデルの学習データ中の全ての音素から学習した（場合によっては無音を除く）混合正規分布モデル（ＧＭＭ）である。ＧＭＭは、この例では音響モデルパラメータメモリ４０内に記録されている。 The A / D converter 10 converts the sound of the input analog signal into a discrete digital signal, for example, at a sampling frequency of 16 kHz (step S10, FIG. 4). The feature amount analysis unit 20 receives the speech digital signal converted into a discrete value, and calculates the speech feature amount O _t for each frame in which a predetermined number of speech digital signals are one frame (for example, 20 ms) (step S20). ).
The GMM likelihood calculating unit 60 calculates the GMM likelihood for each frame by comparing the GMM with the speech feature amount O _t (step S60). The GMM is a mixed normal distribution model (GMM) learned from all phonemes in the learning data of the acoustic model (except for silence in some cases). The GMM is recorded in the acoustic model parameter memory 40 in this example.

図５にＧＭＭと音声特徴量Ｏ_ｔを照合してＧＭＭ尤度を求める方法を模式的に示す。図５は、ＧＭＭ尤度の分布を正規分布に近い形と仮定した場合である。横軸は音声特徴量Ｏ_ｔであり、縦軸はＧＭＭ尤度である。
ＧＭＭ尤度の計算は、上記した式（２）で求めた出現確率の対数値として求められる。この場合、ＧＭＭは、式（２）のように混合重み係数ｃ_ｍｓ、平均ベクトルμ_ｍｓ、共分散行列Ｕ_ｍｓで表現される。図５に示すように音声特徴量ｙ_ｓやｙ_ｏに対応するＧＭＭ尤度が計算される。 FIG. 5 schematically shows a method for obtaining the GMM likelihood by comparing the GMM and the speech feature amount O _t . FIG. 5 shows a case where the GMM likelihood distribution is assumed to be close to a normal distribution. The horizontal axis is the voice feature amount O _t , and the vertical axis is the GMM likelihood.
The calculation of the GMM likelihood is obtained as a logarithmic value of the appearance probability obtained by the above equation (2). In this case, the GMM is expressed by a mixing weight coefficient c _ms , an average vector μ _ms , and a covariance matrix U _ms as shown in Equation (2). GMM likelihood corresponding to the audio feature amount y _s and y _o as shown in FIG. 5 are calculated.

ＧＭＭ尤度判定部７０は、ＧＭＭ尤度が所定の範囲内であるか否かを判定し、その判定結果を出力する（ステップＳ７０）。所定の範囲とは、例えば、図５に示すＧＭＭ尤度分布の最大値〜最小値の範囲である。その範囲は、学習した音声データに対するＧＭＭ尤度の上下限値の範囲ということになる。つまり、ＧＭＭ尤度判定部７０は、学習していない突発性雑音等の影響を受けたフレームをフィルタリングすることが出来る。 The GMM likelihood determining unit 70 determines whether or not the GMM likelihood is within a predetermined range, and outputs the determination result (step S70). The predetermined range is, for example, a range from the maximum value to the minimum value of the GMM likelihood distribution shown in FIG. This range is the range of the upper and lower limit values of the GMM likelihood for the learned speech data. That is, the GMM likelihood determination unit 70 can filter frames that are affected by sudden noise that has not been learned.

音声認識処理部８０は、ＧＭＭ尤度が所定の範囲内である場合（ステップＳ７０のＹ）は、音声特徴量Ｏｔに対応する音響尤度を求め（ステップＳ８０１）、その音響尤度に基づいて音声認識処理を行う（ステップＳ８０２）。ＧＭＭ尤度が所定の範囲外である場合（ステップＳ７０のＮ）は、音響尤度の代わりにＧＭＭ尤度を用いて音声認識処理を行う（ステップＳ８０３）。
以上の動作は、全てのフレームについて終了するまで繰り返される（ステップＳ９０のＮ）。この音声認識装置１００の各部の動作及び繰り返し動作の制御は、制御部９０が行う。 When the GMM likelihood is within a predetermined range (Y in step S70), the speech recognition processing unit 80 obtains an acoustic likelihood corresponding to the speech feature amount Ot (step S801), and based on the acoustic likelihood. A voice recognition process is performed (step S802). If the GMM likelihood is outside the predetermined range (N in step S70), the speech recognition process is performed using the GMM likelihood instead of the acoustic likelihood (step S803).
The above operation is repeated until all frames are completed (N in step S90). The control unit 90 controls the operation and repetitive operation of each unit of the speech recognition apparatus 100.

音声認識装置１００によれば、音声特徴量Ｏ_ｔとＧＭＭとから求めたＧＭＭ尤度を用いることで、その音声特徴量Ｏ_ｔが学習済みの特徴量の集合から大きく逸脱しないかどうかを判定する。そして、突発性雑音等のように学習データの集合には含まれないような音声特徴量Ｏ_ｔが入力された場合は、その音響尤度をＧＭＭ尤度に置換えて音声認識処理を行う。したがって、突発性雑音等が入力された場合でも音響尤度を安定化することが可能である。その結果、音声認識の誤認識の発生を抑制することが出来る。 According to the speech recognition apparatus 100, by using the GMM likelihood obtained from the speech feature amount O _t and the GMM, it is determined whether or not the speech feature amount O _t greatly deviates from the learned feature amount set. . Then, when a speech feature quantity O _t that is not included in the set of learning data such as sudden noise is input, the speech likelihood processing is performed by replacing the acoustic likelihood with the GMM likelihood. Therefore, even when sudden noise or the like is input, the acoustic likelihood can be stabilized. As a result, occurrence of misrecognition of voice recognition can be suppressed.

なお、所定の範囲は、図５のＧＭＭ分布の下限値以下のみとしても良い。又は、上述したようにＧＭＭ尤度の分布の上限値以上及び下限値以下としても良く、そのどちらでも良い。ＧＭＭ尤度判定部７０がＧＭＭ尤度の上限値も判定する場合、上限値を超えたフレームの音響尤度もそのＧＭＭ尤度に代用される。そのＧＭＭ尤度の値は、殆どの音素を包含した分布の大きなＧＭＭから求めているので大きく変化した値にならない。よって、尤度値が不安定になることは無い。 Note that the predetermined range may be not more than the lower limit value of the GMM distribution of FIG. Alternatively, as described above, the upper limit value and the lower limit value of the GMM likelihood distribution may be set, or either of them may be used. When the GMM likelihood determining unit 70 also determines the upper limit value of the GMM likelihood, the acoustic likelihood of the frame exceeding the upper limit value is also substituted for the GMM likelihood. Since the GMM likelihood value is obtained from a GMM having a large distribution including most phonemes, it does not change greatly. Therefore, the likelihood value does not become unstable.

なお、所定の範囲を、学習した全ての音声特徴量に対応した尤度の上下限値の範囲として説明したが、この発明はこの例に限定されない。例えば、音響モデル学習時のＧＭＭ尤度の分布を正規分布と過程して予め求めたＧＭＭ尤度の平均値μと標準偏差σに基づき、ＧＭＭ尤度計算部６０内に設けられた所定範囲設定手段６０１が、所定の範囲をμ±２σ（上限値＝μ＋２σ、下限値＝μ−２σ）と、計算して設定しても良い。このようにすることで所定の範囲を、学習したＧＭＭ尤度の値から任意の範囲に設定することが可能となる。なお、予めＧＭＭ尤度の平均値μと標準偏差σに基づいて任意の所定の範囲を設定し、その値を音響モデルパラメータメモリに記録して置いても良い。その場合、所定範囲設定手段６０１は無くても良い。 Although the predetermined range has been described as the range of the upper and lower limits of likelihood corresponding to all learned speech feature quantities, the present invention is not limited to this example. For example, the predetermined range setting provided in the GMM likelihood calculation unit 60 is based on the average value μ and the standard deviation σ of the GMM likelihood obtained in advance by processing the distribution of the GMM likelihood at the time of acoustic model learning as a normal distribution. The means 601 may calculate and set the predetermined range as μ ± 2σ (upper limit = μ + 2σ, lower limit = μ−2σ). In this way, the predetermined range can be set to an arbitrary range from the learned GMM likelihood value. An arbitrary predetermined range may be set in advance based on the average value μ and standard deviation σ of GMM likelihood, and the value may be recorded in the acoustic model parameter memory. In that case, the predetermined range setting means 601 may be omitted.

また、ＧＭＭ尤度判定部７０内に、上下限値設定手段７０１を備え、所定の範囲外のフレームのＧＭＭ尤度を所定の上下限値にしても良い。つまり、上下限値にＧＭＭ尤度を丸め込んでも良い。丸め込むことで尤度の範囲を更に狭めることが出来る。
図６に所定の範囲外のＧＭＭ尤度を上下限値に丸め込む動作を行う音声認識装置１００′の動作フローを示す。ＧＭＭ尤度判定過程（ステップＳ７０）以外は、音声認識装置１００と同じである。
ＧＭＭ尤度判定部７０′は、ＧＭＭ尤度が所定の範囲内か否かを判定する（ステップＳ７０１）。範囲内の場合（ステップＳ７０１のＹ）は、音声特徴量から音響尤度を求めて音声認識処理を行うステップＳ８０１以降の動作を行う。 Further, the upper and lower limit value setting means 701 may be provided in the GMM likelihood determining unit 70, and the GMM likelihood of a frame outside a predetermined range may be set to a predetermined upper and lower limit value. That is, the GMM likelihood may be rounded to the upper and lower limit values. The likelihood range can be further narrowed by rounding.
FIG. 6 shows an operation flow of the speech recognition apparatus 100 ′ that performs the operation of rounding the GMM likelihood outside the predetermined range to the upper and lower limit values. Except for the GMM likelihood determination process (step S70), it is the same as the speech recognition apparatus 100.
The GMM likelihood determining unit 70 ′ determines whether or not the GMM likelihood is within a predetermined range (step S701). If it is within the range (Y in step S701), the operation after step S801 is performed in which the speech likelihood processing is performed by obtaining the acoustic likelihood from the speech feature amount.

ＧＭＭ尤度が所定の範囲外の場合は、ＧＭＭ尤度が下限値以下（ステップＳ７０２）であるか、上限値以上であるかを判定する（ステップＳ７０４）。上下限値設定手段７０１はその判定結果に基づいてＧＭＭ尤度を、下限値若しくは上限値に設定して音声認識処理部８０へ出力する（ステップＳ７０３，Ｓ７０５）。
なお、図６では上下限の両方を所定の上下限値に設定する例を説明したが、上下限のどちらか一方を設定するようにしても良い。 If the GMM likelihood is outside the predetermined range, it is determined whether the GMM likelihood is equal to or lower than the lower limit (step S702) or equal to or higher than the upper limit (step S704). The upper and lower limit value setting means 701 sets the GMM likelihood to the lower limit value or the upper limit value based on the determination result and outputs it to the speech recognition processing unit 80 (steps S703 and S705).
Although FIG. 6 illustrates an example in which both upper and lower limits are set to predetermined upper and lower limits, either one of the upper and lower limits may be set.

図７にこの発明の音響モデル作成装置２００の機能構成例を示す。その動作フローを図８に示す。音響モデル作成装置２００は、特徴量分析部２０と、ＧＭＭ尤度計算部６０と音響モデルパラメータメモリ４０と、ＧＭＭ尤度判定部７０と、学習処理部９０と、学習後音響モデルメモリ９５と、制御部９６とを具備する。音響モデル作成装置２００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 7 shows a functional configuration example of the acoustic model creation device 200 of the present invention. The operation flow is shown in FIG. The acoustic model creation device 200 includes a feature amount analysis unit 20, a GMM likelihood calculation unit 60, an acoustic model parameter memory 40, a GMM likelihood determination unit 70, a learning processing unit 90, a post-learning acoustic model memory 95, And a control unit 96. The acoustic model creation apparatus 200 is realized by a predetermined program being read into a computer composed of, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

音響モデル作成装置２００の特徴量分析部２０と、ＧＭＭ尤度計算部６０と、ＧＭＭ尤度判定部７０は、音声認識装置１００と同じものである。
学修処理部９０が、学習ラベルと音声特徴量とＧＭＭ尤度と判定結果を入力として、ＧＭＭ尤度が所定の範囲内のフレームについては音声特徴量と学習ラベルを対応付けて音響モデルの学習処理を行い（ステップＳ９０）。所定の範囲外のフレームについては音響モデルの統計量計算の対象外（ステップＳ７０のＮ）、異常フレームとして廃棄し、次のフレームの処理を行う（ステップＳ９８）。 The feature amount analysis unit 20, the GMM likelihood calculation unit 60, and the GMM likelihood determination unit 70 of the acoustic model creation device 200 are the same as those of the speech recognition device 100.
The learning processing unit 90 receives the learning label, the voice feature amount, the GMM likelihood, and the determination result, and associates the voice feature amount with the learning label for a frame in which the GMM likelihood is within a predetermined range, thereby learning the acoustic model. (Step S90). Frames outside the predetermined range are not subject to acoustic model statistic calculation (N in step S70), are discarded as abnormal frames, and the next frame is processed (step S98).

以上の動作は、全てのフレームについて終了するまで繰り返される（ステップＳ９７のＮ）。この音響モデル作成装置２００の各部の動作及び繰り返し動作の制御は、制御部９５が行う。上記所定の範囲をＧＭＭ尤度の平均値μと標準偏差σに基づいて設定する場合には、学習によって更新されたＧＭＭ尤度の平均値μと標準偏差σは、学習後音響モデルメモリ９５に記録される。学習処理部９０において、所定の範囲も、更新された平均値μと標準偏差σに連動させて更新し、その値を学習後音響モデルメモリ９５に記録するようにしても良い。また、上記所定の範囲をＧＭＭ尤度の上下限値に基づいて設定する場合は、ＧＭＭ尤度の上下限値は、学習後音響モデルメモリ９５に記録される。学習処理部９０において、所定の範囲も、更新された上下限値に連動させて更新し、その値を学習後音響モデルメモリ９５に記録するようにしても良い。 The above operation is repeated for all frames (N in step S97). The control unit 95 controls the operation and repetitive operation of each unit of the acoustic model creation device 200. When the predetermined range is set based on the average value μ and standard deviation σ of GMM likelihood, the average value μ and standard deviation σ of GMM likelihood updated by learning are stored in the post-learning acoustic model memory 95. To be recorded. In the learning processing unit 90, the predetermined range may be updated in conjunction with the updated average value μ and standard deviation σ, and the value may be recorded in the after-learning acoustic model memory 95. When the predetermined range is set based on the upper and lower limit values of the GMM likelihood, the upper and lower limit values of the GMM likelihood are recorded in the post-learning acoustic model memory 95. In the learning processing unit 90, the predetermined range may be updated in conjunction with the updated upper and lower limit values, and the value may be recorded in the after-learning acoustic model memory 95.

音響モデル作成装置２００も、ＧＭＭ尤度計算部６０とＧＭＭ尤度判定部７０を備え、所定の範囲外のフレームは対象外として音響モデルの学習を行うので、突発性雑音等の影響を受けないで音響モデルを作成することが出来る。よって、精度の高いよりクリーンな音響モデルの作成を可能にする。 The acoustic model creation apparatus 200 also includes a GMM likelihood calculation unit 60 and a GMM likelihood determination unit 70, and performs learning of the acoustic model while excluding frames outside the predetermined range, so that it is not affected by sudden noise or the like. An acoustic model can be created. Therefore, it is possible to create a clean acoustic model with high accuracy.

以上説明した音声認識装置１００によれば、殆どの音素を包含し、最も分散が広くなるＧＭＭから求めたＧＭＭ尤度を、所定範囲外のフレームの音響尤度に代用するので、突発性雑音等が入力されても音響尤度が大きく変化しない。つまり、音響尤度を安定化することが出来る。また、所定の範囲は、学習時のＧＭＭのＧＭＭ尤度に基づいて決められるので、その範囲を決定するための開発用データが不要である。
また、音響モデル作成装置２００によれば、学習時に異常フレームを除去するので、異常な分布が生成される可能性を低減することが出来る。よって、より精度の高い音響モデルの作成を可能にする。 According to the speech recognition apparatus 100 described above, the GMM likelihood obtained from the GMM that includes most phonemes and has the widest variance is substituted for the acoustic likelihood of frames outside the predetermined range. Even if is input, the acoustic likelihood does not change greatly. That is, the acoustic likelihood can be stabilized. Further, since the predetermined range is determined based on the GMM likelihood of the GMM at the time of learning, development data for determining the range is unnecessary.
Moreover, according to the acoustic model creation apparatus 200, an abnormal frame is removed at the time of learning, so that the possibility that an abnormal distribution is generated can be reduced. Therefore, it is possible to create a more accurate acoustic model.

この発明の方法及び装置は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能である。例えば、ＧＭＭは、無音データを含めて、無音データも学習対象にしても良いし、又は音声の特徴量のみを記録させ、音声区間のみを学習対象にしても良い。音声区間のみを学習対象にする場合には、音響モデルパラメータメモリ４０に、音声認識の前処理の音声区間検出等の用途でも利用される音響モデルをそのまま用いることが可能である。 The method and apparatus of the present invention are not limited to the above-described embodiments, and can be appropriately changed without departing from the spirit of the present invention. For example, in the GMM, silence data including silence data may be set as a learning target, or only a voice feature amount may be recorded and only a voice section may be set as a learning target. When only the speech section is a learning target, it is possible to directly use the acoustic model that is also used in the acoustic model parameter memory 40 for purposes such as speech section detection of speech recognition preprocessing.

なお、上記方法及び装置において説明した処理は、記載の順に従って時系列に実行され
るのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。
また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Note that the processes described in the above method and apparatus are not only executed in time series according to the order of description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Good.
Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。
また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.
Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A feature amount analysis unit that analyzes the speech feature amount of the input speech digital signal in units of frames;
A GMM likelihood calculating unit that compares a GMM (Gaussian Mixture Model) with the speech feature and calculates a GMM likelihood for each frame;
A GMM likelihood determination unit that determines whether the GMM likelihood is within a predetermined range and outputs the determination result;
Using the speech feature amount, the GMM likelihood, and the determination result as input, a frame within the predetermined range is subjected to speech recognition processing based on the acoustic likelihood corresponding to the speech feature amount, and out of the predetermined range. A speech recognition processing unit that performs speech recognition processing using the acoustic likelihood using the GMM likelihood,
A speech recognition apparatus comprising:

The speech recognition apparatus according to claim 1,
The speech recognition apparatus, wherein the predetermined range is a likelihood distribution range of a GMM (Gaussian Mixture Model) of the learned acoustic model.

The speech recognition apparatus according to claim 1,
The GMM likelihood calculator is
A predetermined range setting means for setting the predetermined range by inputting the average value μ of the GMM likelihood and the standard deviation σ of the GMM likelihood;
A speech recognition apparatus comprising:

A feature amount analysis unit in which a feature amount analysis unit analyzes a speech feature amount of an input speech digital signal in units of frames;
A GMM likelihood calculation process in which a GMM likelihood calculator calculates a GMM likelihood for each frame by comparing a GMM (Gaussian Mixture Model) with the speech feature amount;
A GMM likelihood determination unit that determines whether the GMM likelihood is within a predetermined range and outputs the determination result;
A speech recognition processing unit receives the speech feature value, the GMM likelihood, and the determination result as input, and performs speech recognition processing based on the acoustic likelihood corresponding to the speech feature value for frames within the predetermined range. A speech recognition process for performing speech recognition processing using the acoustic likelihood using the GMM likelihood for a frame outside the predetermined range;
A speech recognition method comprising:

A feature amount analysis unit that analyzes the speech feature amount of the input speech digital signal in units of frames;
A GMM likelihood calculating unit that compares a GMM (Gaussian Mixture Model) with the speech feature and calculates a GMM likelihood for each frame;
A GMM likelihood determination unit that determines whether or not the GMM likelihood is within a predetermined range, and outputs the determination result and the GMM likelihood;
The learning label, the speech feature value, the GMM likelihood, and the determination result are input, and an acoustic model learning process is performed on the frame within the predetermined range based on the speech feature amount, and the frame outside the predetermined range Is a learning processing unit that generates a post-learning acoustic model that is not subject to acoustic model statistic calculation,
An acoustic model creation device comprising:

In the acoustic model creation device according to claim 5,
The acoustic model generating apparatus, wherein the predetermined range is a likelihood distribution range of a GMM (Gaussian Mixture Model) of the learned acoustic model.

In the acoustic model creation device according to claim 5,
The GMM likelihood calculator is
A predetermined range setting means for setting the predetermined range by inputting the average value μ of the GMM likelihood and the standard deviation σ of the GMM likelihood;
An acoustic model creation device comprising:

A feature amount analysis unit in which a feature amount analysis unit analyzes a speech feature amount of an input speech digital signal in units of frames;
A GMM likelihood calculation process in which a GMM (Gaussian Mixture Model) is compared with the speech feature amount to calculate a GMM likelihood for each frame;
A GMM likelihood determination unit that determines whether or not the GMM likelihood is within a predetermined range and outputs the determination result and the GMM likelihood;
The learning processing unit receives the learning label, the voice feature quantity, the GMM likelihood, and the determination result, performs a learning process of the acoustic model based on the voice feature quantity for the frame within the predetermined range, and performs the predetermined process. The learning process for generating a post-learning acoustic model for out-of-range frames is excluded from the calculation of the acoustic model statistics,
An acoustic model creation method comprising:

An apparatus program for causing a computer to function as each apparatus according to any one of claims 1 to 3 or claims 5 to 7.

A computer-readable recording medium on which any of the apparatus programs according to claim 9 is recorded.