JP4860962B2

JP4860962B2 - Speech recognition apparatus, speech recognition method, and program

Info

Publication number: JP4860962B2
Application number: JP2005244491A
Authority: JP
Inventors: 誠庄境; 豪秀奈木野
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2004-08-26
Filing date: 2005-08-25
Publication date: 2012-01-25
Anticipated expiration: 2025-08-25
Also published as: JP2006091864A

Description

本願発明は、雑音が発生する環境において音声認識を行う音声認識装置、音声認識方法、及び、プログラムに関する。 The present invention relates to a speech recognition apparatus, a speech recognition method, and a program for performing speech recognition in an environment where noise occurs.

一般に、音声認識は、話者が発声した音声サンプルをある特徴パラメータの系列に変換する音響分析部と、音響分析部で得られた特徴パラメータの系列を予めメモリやハードディスクなどの記憶装置に蓄積した語彙単語の特徴パラメータに関する情報と照合して、最も類似度の高い音声を認識結果とする音声照合部との２つの部分から構成される。音声サンプルをある特徴パラメータの系列に変換する音響分析方法としては、ケプストラム分析や線形予測分析などが知られており、これらの分析方法は、例えば、非特許文献１に詳述されている。メルスケールに配置されたフィルターバンクの対数パワースペクトルから求められるケプストラムとして、ＭＦＣＣ（Mel-Frequency Cepstrum Coefficient）が提案され、音声認識における特徴パラメータとしてよく用いられている。また、語彙単語の特徴パラメータに関する情報の作成、および、その作成された情報と入力された音声から変換された特徴パラメータの系列との音声照合方法としては、隠れマルコフモデル(Hidden Markov Model、ＨＭＭ)による方法が一般に用いられている。ＨＭＭによる方法においては、音節、半音節、音韻、ｂｉｐｈｏｎｅ、ｔｒｉｐｈｏｎｅなどの音声単位がＨＭＭによりモデル化される。これらのモデルを一般に音響モデルと呼ぶ。音響モデルの作成方法については非特許文献１のSec.6.4に詳しく述べられている。また、非特許文献１のSec.6.4に記載されているＶｉｔｅｒｂｉ（ビタビ）アルゴリズムにより、当業者は音声認識装置を容易に構成することができる。 In general, in speech recognition, an acoustic analysis unit that converts a speech sample uttered by a speaker into a sequence of feature parameters, and a sequence of feature parameters obtained by the acoustic analysis unit is stored in a storage device such as a memory or a hard disk in advance. It is composed of two parts: a speech collating unit that collates with information related to the characteristic parameters of vocabulary words and recognizes speech with the highest similarity as a recognition result. Known acoustic analysis methods for converting a speech sample into a series of characteristic parameters include cepstrum analysis and linear prediction analysis. These analysis methods are described in detail in Non-Patent Document 1, for example. MFCC (Mel-Frequency Cepstrum Coefficient) has been proposed as a cepstrum obtained from the logarithmic power spectrum of a filter bank arranged on the mel scale, and is often used as a feature parameter in speech recognition. Hidden Markov Model (HMM) is used as a method for creating information on feature parameters of vocabulary words and a speech matching method between the created information and a sequence of feature parameters converted from input speech. Is generally used. In the method based on the HMM, speech units such as syllables, semi-syllables, phonemes, biphones, and triphones are modeled by the HMM. These models are generally called acoustic models. The method of creating the acoustic model is described in detail in Sec. 6.4 of Non-Patent Document 1. Further, a Viterbi algorithm described in Sec. 6.4 of Non-Patent Document 1 allows a person skilled in the art to easily configure a speech recognition device.

以下、ＨＭＭ及びＶｉｔｅｒｂｉアルゴリズムについて具体的に説明する。ＨＭＭは６つのパラメータの組M＝（S、Y、A、B、π、F）で定義される。
S：状態の有限集合 S＝[s_i]
Y：出力シンボルの集合
A：状態遷移確率の集合 A＝[a_ij] a_ijは状態s_iから状態s_jへの遷移確率 Hereinafter, the HMM and Viterbi algorithm will be described in detail. The HMM is defined by a set of six parameters M = (S, Y, A, B, π, F).
S: Finite set of states S = [s _i ]
Y: Set of output symbols
A: Set of state transition probabilities A = [a _ij ] a _ij is the transition probability from state s _i to state s _j

B：出力確率の集合 B＝[b_j(x)] b_j(x)は状態s_jに入る遷移の際に出力シンボルxを出力する確率 B: Set of output probabilities B = [b _j (x)] b _j (x) is the probability of outputting an output symbol x at the transition to state s _j

π：初期状態確率の集合 π＝[π_i] π_iは初期状態がs_jである確率 π: set of initial state probabilities π = [π _i ] π _i is the probability that the initial state is s _j

F：最終状態の集合
その時、従来のＶｉｔｅｒｂｉアルゴリズムは、前向き確率f(i、t)を介して、以下のように定義される。
ステップ０： F: Set of final states At that time, the conventional Viterbi algorithm is defined as follows through forward probability f (i, t).
Step 0:

ステップ１：t＝1,2,・・・,Tに対して、ステップ１．１を繰り返す。
ステップ１．１：全てのiに対して、ステップ１．１．１〜１．１．３を繰り返す。
ステップ１．１．１： Step 1: Repeat step 1.1 for t = 1, 2,.
Step 1.1: Repeat steps 1.1.1 to 1.1.3 for all i.
Step 1.1.1:

ステップ１．１．２： Step 1.1.2:

ステップ１．１．３： Step 1.1.3:

ここで、式（１−５）の右辺に含まれる記号は、左側の状態遷移系列と右側の状態を連結し、新しい状態遷移系列を作るオペレータである。
ステップ２： Here, the symbol included in the right side of the expression (1-5) is an operator that connects the left state transition sequence and the right state to create a new state transition sequence.
Step 2:

図１１には、音声認識で従来良く用いられる、Ｌｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭ音響モデルの最小構成を示す。Ｌｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭ音響モデルの最小構成は、状態i-1と状態iの２つから成り、状態i-1から状態iへ遷移する確率はａ_i-1iで、状態iから状態iへ自己ループする確率はａ_iiで与えられる。また、状態iは、出力確率b_i(x)を有している。時刻t-1における、状態i-1および状態iでの前向き確率f(i-1,t-1）、f(i,t-1)は、既にＶｉｔｅｒｂｉアルゴリズムにより逐次的に計算されている。
この時、図１２で示すように、従来のＶｉｔｅｒｂｉアルゴリズムに依れば、時刻tにおける、状態iでの前向き確率f(i、t)は、式（１−３）、式（１−４）により、 FIG. 11 shows a minimum configuration of a Left-to-Right HMM acoustic model that is often used in speech recognition. The minimum configuration of the Left-to-Right HMM acoustic model consists of two states, state i-1 and state i. The probability of transition from state i-1 to state i is _ai-1i , and from state i to state i. The probability of self-looping is given by a _ii . Further, the state i has an output probability b _i (x). The forward probabilities f (i−1, t−1) and f (i, t−1) in state i−1 and state i at time t−1 have already been sequentially calculated by the Viterbi algorithm.
At this time, as shown in FIG. 12, according to the conventional Viterbi algorithm, the forward probability f (i, t) in the state i at time t is expressed by Equations (1-3) and (1-4). By

で求められる。ＨＭＭを用いた音声認識においては、式（１−７）で算出されるＶｉｔｅｒｂｉスコアが最大となるＨＭＭの正規分布の混合数Ｍに対応する単語系列Ｗを認識結果とする。また、Ｖｉｔｅｒｂｉアルゴリズムにおいて、f(i、t)ではなく、log f(i、t)を用いると、乗算を加算に置き換えられるため、より計算が簡単になるとともに、０以上１以下の確率値の乗算に伴うアンダーフローの問題を回避できるメリットもある。 Is required. In speech recognition using the HMM, the word sequence W corresponding to the mixture number M of the normal distribution of the HMM that maximizes the Viterbi score calculated by Expression (1-7) is used as the recognition result. Further, in the Viterbi algorithm, if log f (i, t) is used instead of f (i, t), multiplication can be replaced with addition, so that the calculation becomes easier and probability values of 0 or more and 1 or less are obtained. There is also an advantage that the problem of underflow associated with multiplication can be avoided.

また、Ｖｉｔｅｒｂｉアルゴリズムによる照合処理において、計算量を削減するため、照合の途中でも、Ｖｉｔｅｒｂｉスコアが他より著しく小さな単語を間引く、枝狩りという手法も広く用いられている。
一方、雑音のモデル化の手段として、混合正規分布モデル(Gaussian Mixture Model, GMM)による方法が知られている。ＧＭＭは、状態遷移を伴わない１状態のＨＭＭと見なすことが出来る。 Also, in the matching process using the Viterbi algorithm, in order to reduce the amount of calculation, a technique called branch hunting is also widely used, in which words whose Viterbi score is significantly smaller than others are thinned out even during the matching.
On the other hand, a method based on a Gaussian Mixture Model (GMM) is known as a means for modeling noise. The GMM can be regarded as a one-state HMM without state transition.

従来、音声認識システムの雑音に対する頑健性を高めるための手段としては、
従来法１）クリーンな音声のＨＭＭ音響モデルと雑音のＧＭＭ音響モデルとから、雑音が重畳した音声のＨＭＭ音響モデルである、雑音重畳音声ＨＭＭ音響モデルを合成する方法
従来法２）フレーム毎の入力スペクトルから雑音スペクトルの推定値を減じる方法や入力スペクトルに音声スペクトル／（音声スペクトル＋雑音スペクトル）の推定値を乗じる方法
従来法３）クリーンな音声に雑音を重畳した音声データから雑音重畳音声ＨＭＭ音響モデルを学習する方法
従来法４）従来法２）と従来法３）とを組み合わせる方法
などが知られている。従来法１は、非特許文献２及び非特許文献３に記載されているＨＭＭ合成法が良く知られている。従来法２は、非特許文献４に記載のスペクトル減算法や、非特許文献５に記載の最小平均二乗誤差（ＭＭＳＥ）法などが良く知られている。 Conventionally, as a means for improving robustness against noise in a speech recognition system,
Conventional method 1) Method of synthesizing a noise-superimposed speech HMM acoustic model, which is a speech HMM acoustic model superimposed with noise, from a clean speech HMM acoustic model and a noise GMM acoustic model Conventional method 2) Input for each frame Method of subtracting noise spectrum estimate from spectrum or method of multiplying input spectrum by speech spectrum / (voice spectrum + noise spectrum) Conventional method 3) Noise superimposed speech HMM sound from speech data with noise superimposed on clean speech Methods for learning a model Conventional methods 4) A method combining conventional methods 2) and 3) is known. As the conventional method 1, the HMM synthesis method described in Non-Patent Document 2 and Non-Patent Document 3 is well known. As the conventional method 2, a spectral subtraction method described in Non-Patent Document 4 and a minimum mean square error (MMSE) method described in Non-Patent Document 5 are well known.

しかしながら、従来法１は雑音の種類数に合成される音響モデルの数が比例するために、実環境の多種の雑音に対応する場合に、所用の計算量やメモリサイズが膨大になるという課題があった。例えば、日本語の音声認識システムを構築する場合、図９に示す５４個の音素をＨＭＭでモデル化するとすると、音声認識システムは、クリーンな音声から作成された５４個のクリーン音素ＨＭＭのモデルを有することになる。もし、雑音の種類が３３種あり、対応する雑音のＧＭＭ音響モデルの数が３３個あるとすると、全部で５４に３３をかけた数である１７８２個の雑音重畳音素ＨＭＭ音響モデルを作成することになる。つまり、クリーン音素ＨＭＭ音響モデルを１組有する場合と比べると、３３倍の音響モデルサイズになるため、所用の計算量やメモリサイズが膨大になり、実現が困難になる。 However, since the number of acoustic models synthesized in the conventional method 1 is proportional to the number of types of noise, there is a problem that the required calculation amount and memory size become enormous when dealing with various types of noise in a real environment. there were. For example, in the case of constructing a Japanese speech recognition system, if 54 phonemes shown in FIG. 9 are modeled by an HMM, the speech recognition system uses 54 clean phoneme HMM models created from clean speech. Will have. If there are 33 types of noise and there are 33 corresponding GMM acoustic models of noise, 1882 noise superimposed phoneme HMM acoustic models, which is a total of 54 times 33, are created. become. In other words, the acoustic model size is 33 times that of the case of having one set of clean phoneme HMM acoustic models, so that the required calculation amount and memory size become enormous and difficult to realize.

一方、従来法２は、定常雑音には効果的であるが、ドアが閉まる音、コップの音、カーテンが閉まる音などの非定常雑音が存在する場合、音声認識率の著しい低下が避けられないという致命的な課題があり、非定常雑音がごく自然に存在する実環境内での音声認識性能を向上させる必要があるとの課題が顕在化している。 On the other hand, the conventional method 2 is effective for stationary noise, but if there is non-stationary noise such as a door closing sound, a cupping sound, or a curtain closing sound, a significant reduction in the speech recognition rate is inevitable. This is a fatal problem, and the problem that it is necessary to improve speech recognition performance in a real environment where non-stationary noise exists naturally has become apparent.

従来法３は、従来法１及び従来法２よりも優れた音声認識性能を発揮するが、雑音重畳音声ＨＭＭ音響モデルを学習するために、雑音毎にクリーンな音声に雑音を重畳した音声データを作成する必要があり、作成に要する所要の計算量やメモリ量が雑音の種類数に比例するという課題がある。例えば、日本語の音声認識システムを構築する場合、図９に示す５４個の音素をＨＭＭでモデル化する場合は、もし、雑音の種類が３３種あるとすると、クリーン音素ＨＭＭのみを作成する場合と比べて雑音が重畳した音声データを３３倍作成することになり、所用の計算量やメモリサイズが膨大で、実現が困難になる。 Conventional method 3 exhibits better speech recognition performance than conventional method 1 and conventional method 2, but in order to learn a noise superimposed speech HMM acoustic model, speech data in which noise is superimposed on clean speech for each noise is used. There is a problem that the amount of calculation and memory required for creation are proportional to the number of types of noise. For example, when a Japanese speech recognition system is constructed, when 54 phonemes shown in FIG. 9 are modeled by an HMM, if there are 33 types of noise, only a clean phoneme HMM is created. Compared to the above, the voice data with noise superimposed is created 33 times, and the required calculation amount and memory size are enormous, making it difficult to realize.

従来法４は、従来法１、従来法２及び従来法３に比べて、雑音に対する頑健性は最も強いが、従来法１、従来法２及び従来法３の課題、すなわち、非定常雑音に頑健でないこと及び計算量やメモリ量が雑音の種類数に比例することの両方の課題を抱える。 Conventional method 4 has the strongest robustness against noise compared to conventional method 1, conventional method 2, and conventional method 3, but is robust against the problems of conventional method 1, conventional method 2, and conventional method 3, that is, non-stationary noise. The problem is that the amount of computation and the amount of memory are not proportional to the number of types of noise.

非定常雑音が存在する環境下での音声認識性能が著しく低下するのは、非定常雑音の重畳により音声サンプルの特徴パラメータの値が大きく変形することが原因として考えられる。例えば、「ラジオ」という音声を発声している途中でたまたまガラスが割れて「ガチャン」という雑音が音声に重畳した場合の例で説明する。図１３に、「ラジオ」という音声に「ガチャン」という雑音が重畳した場合の波形の例を示す。ここで、波形の下には、「ラジオ」を図９に示す音素に分解したときの時間区切りを示す。「ラジオ」の「ジ」の母音「ｉ」を発声している瞬間に「ガチャン」という雑音が発生したとすると、破線の矢印の時間区間に「ガチャン」という雑音が重畳する。この時、時刻Ｔ１の時点でのＶｉｔｅｒｂｉアルゴリズムによる認識の途中結果では、図１４に示すように「ラジオ」という正解の単語が第１位にランクされている。しかしながら、時刻Ｔ２の時点でのＶｉｔｅｒｂｉアルゴリズムによる認識の途中結果では、「ガチャン」という雑音のスペクトルパターンがたまたま「カセイ」という音声のスペクトルパターンに類似しているとすると、図１４に示すように「ラジオ」という正解の単語が第５位にランクされ、代わって「ラッカセイ」という不正解の単語が第１位にランクされるという不具合が発生する。 The reason why the speech recognition performance in an environment where non-stationary noise exists is significantly reduced is considered to be because the feature parameter value of the speech sample is greatly deformed by the superposition of non-stationary noise. For example, an example will be described in which the glass breaks in the middle of speaking the voice “radio” and the noise “gachan” is superimposed on the voice. FIG. 13 shows an example of a waveform in the case where the noise “Gachang” is superimposed on the sound “Radio”. Here, below the waveform, time divisions when “radio” is decomposed into phonemes shown in FIG. 9 are shown. If the noise “Gachang” occurs at the moment when the vowel “i” of “Di” of “Radio” is uttered, the noise “Gachang” is superimposed on the time interval indicated by the dashed arrow. At this time, in the midway result of recognition by the Viterbi algorithm at time T1, the correct word “radio” is ranked first as shown in FIG. However, in the result of the recognition by the Viterbi algorithm at time T2, if the spectrum pattern of the noise “Gachan” happens to be similar to the spectrum pattern of the speech “Casey”, as shown in FIG. The correct word “radio” is ranked fifth, and the incorrect word “Lacket” is ranked first, instead.

このように雑音の重畳により、音声のスペクトルの変形が大きなフレームでは、正解の単語の類似度の値が著しく低下したり、不正解の単語の類似度の値が著しく増加したりするという現象が発生し易くなる。この時、候補単語間で類似度の順位変動が起こり、正解の単語の類似度よりも不正解の単語の類似度の値が高くなる場合や、最悪の場合にはＶｉｔｅｒｂｉアルゴリズムと組み合わせてよく用いられる公知の枝狩りというテクニックにより、正解の単語が候補単語の中からこぼれ落ちる場合がある。これにより、著しい認識率の低下が引き起こされる。 As described above, due to noise superposition, in a frame in which the deformation of the speech spectrum is large, there is a phenomenon that the value of the similarity of the correct word is remarkably lowered or the value of the similarity of the incorrect word is remarkably increased. It tends to occur. At this time, the ranking of the similarity is changed between candidate words, and the similarity value of the incorrect word is higher than the similarity of the correct word. In the worst case, it is often used in combination with the Viterbi algorithm. The correct word may be spilled out of the candidate words by the known technique of branch hunting. This causes a significant reduction in recognition rate.

より具体的に説明すると、非定常雑音の影響が顕著なフレームの場合、上述した従来のＶｉｔｅｒｂｉアルゴリズムにおける式（１−４）の計算に用いられる出力確率b_i(x_t)の値が、非定常雑音が重畳しない場合に比べて著しく小さな値になる。これが原因で、正解単語のf(i,t)が不正解単語のそれに比べて小さな値になるという現象が起こる。あるいは、たまたま非定常雑音のあるフレームのスペクトルが不正解の単語のあるフレームのスペクトルに類似していた場合、その不正解の単語の出力確率 b_i(x_t)が大きな値となり、不正解単語のf(i,t)が正解単語のそれに比べて大きな値になるという現象も起こる。
非定常雑音の存在に起因する、この順位変動を抑制することが出来れば、非定常雑音に頑健な音声認識を実現することが出来る。このような問題を解決するために、非定常雑音に対処して音声認識を行う技術内容が記載された特許文献としては、特許文献１〜３が存在する。 More specifically, in the case of a frame in which the influence of non-stationary noise is significant, the value of the output probability b _i (x _t ) used in the calculation of Expression (1-4) in the conventional Viterbi algorithm described above is non- The value is significantly smaller than when no stationary noise is superimposed. This causes a phenomenon that the correct word f (i, t) has a smaller value than that of the incorrect word. Or, if the spectrum of a frame with unsteady noise happens to be similar to the spectrum of a frame with an incorrect word, the incorrect word output probability b _i (x _t ) becomes a large value and the incorrect word There is also a phenomenon in which f (i, t) is larger than that of the correct word.
If this rank fluctuation caused by the presence of non-stationary noise can be suppressed, speech recognition robust to non-stationary noise can be realized. In order to solve such a problem, Patent Documents 1 to 3 exist as patent documents in which technical contents for performing speech recognition in response to non-stationary noise are described.

Lawrence Rabiner and Biing-Hwang Juang, "Fundamentals of Speech Recognition", Prentice Hall, 1993Lawrence Rabiner and Biing-Hwang Juang, "Fundamentals of Speech Recognition", Prentice Hall, 1993 F. Martin, K. Shikano and Y. Minami, "Recognition of Noisy Speech by Composition of Hidden Markov Models," Proc. Eurospeech, Berlin, Germany, pp.1031-1034, 1993.F. Martin, K. Shikano and Y. Minami, "Recognition of Noisy Speech by Composition of Hidden Markov Models," Proc. Eurospeech, Berlin, Germany, pp.1031-1034, 1993. M. J. F. Gales and S. Young, "Cepstrum Parameter Compensation for HMM Recognition," Speech Communication, Vol.12, No.3, pp.231-239,1993.M. J. F. Gales and S. Young, "Cepstrum Parameter Compensation for HMM Recognition," Speech Communication, Vol.12, No.3, pp.231-239, 1993. S. Boll, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction," IEEE Trans. ASSP, Vol.ASSP-27, No.2, pp.113-120, 1979.S. Boll, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction," IEEE Trans. ASSP, Vol.ASSP-27, No.2, pp.113-120, 1979. Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean Square Error Short-Time Spectral Amplitude Estimator," IEEE Trans. Vol.ASSP-32, No.6, pp.1109-1121, 1984.Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean Square Error Short-Time Spectral Amplitude Estimator," IEEE Trans. Vol. ASSP-32, No. 6, pp. 1109-1121, 1984. 特開２００２−９１４８０号公報JP 2002-91480 A 特開２００３−１７７７８１号公報JP 2003-177771 A 特開２００３−２８０６８６号公報JP 2003-280686 A

特許文献1には、上記の従来法１に属するＨＭＭ合成法に類する技術が開示されている。特許文献１においては、上記の非特許文献２及び３に記載されているような、音声のＨＭＭ音響モデルと単一正規分布による雑音のＧＭＭ音響モデルとから雑音重畳音声ＨＭＭ音響モデルを生成する方法では、様々な非定常雑音に対処できない点に着目している。そして、この特許文献１では、雑音の音響モデルを混合正規分布によるＧＭＭに拡張し、様々な非定常雑音に対する表現能力を向上させることにより、非定常雑音に対処することを目的としている。しかしながら、特許文献１に記載の技術は従来法１に属するため、想定する雑音の種類が増えるにつれて、音響モデル作成用データサイズや音響モデルサイズが膨大になるという不具合、及び、音響モデル作成時間やＶｉｔｅｒｂｉアルゴリズムによる照合時間が膨大になるという不具合から免れることは出来ない。 Patent Document 1 discloses a technique similar to the HMM synthesis method belonging to the conventional method 1 described above. In Patent Document 1, a method for generating a noise-superimposed speech HMM acoustic model from a speech HMM acoustic model and a noise GMM acoustic model with a single normal distribution as described in Non-Patent Documents 2 and 3 above Then, it pays attention to the point which cannot cope with various non-stationary noises. And in this patent document 1, it aims at coping with nonstationary noise by extending the acoustic model of noise to GMM by mixed normal distribution, and improving the expressive capability with respect to various nonstationary noise. However, since the technique described in Patent Document 1 belongs to the conventional method 1, as the type of noise to be assumed increases, there is a problem that the data size for creating the acoustic model and the acoustic model size become enormous, and the acoustic model creation time, It cannot be avoided from the problem that the verification time by the Viterbi algorithm becomes enormous.

特許文献２に開示されている技術も、上記の従来法１に属するＨＭＭ合成法に類する。この特許文献２においては、特許文献１では入力音声の信号雑音比が既知であるという制約の下で雑音重畳音声ＨＭＭ音響モデルを合成しなければならない点に着目し、音声のＨＭＭ音響モデルと雑音のＧＭＭ音響モデルから複数の信号雑音比に対するマルチパス構成の雑音重畳音声ＨＭＭ音響モデルを生成することにより、非定常雑音に対処することを目的としている。しかしながら、特許文献２に記載の技術も従来法１に属するため、想定する雑音の種類が増えるにつれて、音響モデル作成用データサイズや音響モデルサイズが膨大になるという不具合、及び、音響モデル作成時間やＶｉｔｅｒｂｉアルゴリズムによる照合時間が膨大になるという不具合から免れることは出来ない。 The technique disclosed in Patent Document 2 is also similar to the HMM synthesis method belonging to the conventional method 1 described above. In Patent Document 2, it is noted that in Patent Document 1 a noise-superimposed speech HMM acoustic model must be synthesized under the restriction that the signal-to-noise ratio of the input speech is known. It is an object of the present invention to deal with non-stationary noise by generating a noise-superimposed speech HMM acoustic model having a multipath configuration for a plurality of signal-to-noise ratios from the GMM acoustic model. However, since the technique described in Patent Document 2 also belongs to the conventional method 1, as the type of noise to be assumed increases, the problem that the data size for creating the acoustic model and the acoustic model size become enormous, and the acoustic model creation time, It cannot be avoided from the problem that the verification time by the Viterbi algorithm becomes enormous.

特許文献３に開示されている技術も、上記の従来法１に属するＨＭＭ合成法に類する。この特許文献３においては、非特許文献２では１回の発声毎に雑音の種類を特定して、雑音重畳音声と雑音重畳音声ＨＭＭ音響モデルとの照合を行うため、突発的に発生する雑音や不規則に発生する雑音に対処できない点に着目し、発声区間を例えばフレームなどの適当な区間に分割し、各区間毎に複数の雑音重畳音声ＨＭＭ音響モデルの中から１つの雑音重畳音声ＨＭＭ音響モデルを選択して、Ｖｉｔｅｒｂｉアルゴリズムによる照合処理を行うことにより、突発的に発生する雑音や不規則に発生する雑音に対処することを目的としている。しかしながら、特許文献３に記載の技術も従来法１に属するため、想定する雑音の種類が増えるにつれて、音響モデル作成用データサイズや音響モデルサイズが膨大になるという不具合、及び、音響モデル作成時間やＶｉｔｅｒｂｉアルゴリズムによる照合時間が膨大になるという不具合から免れることは出来ない。 The technique disclosed in Patent Document 3 is also similar to the HMM synthesis method belonging to the conventional method 1 described above. In this Patent Document 3, since the type of noise is specified for each utterance in Non-Patent Document 2 and the noise superimposed speech and the noise superimposed speech HMM acoustic model are compared, Focusing on the fact that the noise generated irregularly cannot be dealt with, the utterance section is divided into appropriate sections such as frames, and one noise-superimposed voice HMM sound is selected from a plurality of noise-superimposed voice HMM acoustic models for each section. The object is to cope with suddenly generated noise or irregularly generated noise by selecting a model and performing matching processing using the Viterbi algorithm. However, since the technique described in Patent Document 3 also belongs to the conventional method 1, as the type of noise to be assumed increases, there is a problem that the data size for creating the acoustic model and the acoustic model size become enormous, and the acoustic model creation time, It cannot be avoided from the problem that the verification time by the Viterbi algorithm becomes enormous.

本発明はこのような問題点に鑑みてなされたものであり、雑音が発生する環境において、処理量やメモリ量消費を抑えつつ、精度の高い音声認識を行う音声認識装置、音声認識方法、及び、プログラムを提供することを目的とする。 The present invention has been made in view of such problems, and in an environment where noise occurs, a speech recognition device, a speech recognition method, and a speech recognition device that perform highly accurate speech recognition while suppressing processing amount and memory consumption. The purpose is to provide a program.

上記課題を解決するために、請求項１に記載の発明は、装置に入力される音声をビタビアルゴリズムを用いて音声認識を行う音声認識装置において、音声のＨＭＭ（Hidden Markov Model）音響モデルを記憶する第１の記憶手段と、雑音のＧＭＭ（Gaussian Mixture Model）音響モデルを記憶する第２の記憶手段と、前記入力される音声の所定フレーム毎、且つ、前記第１の記憶手段に記憶されている音声のＨＭＭ音響モデルの状態毎に、認識対象となる音声の特徴パラメータに対する前記音声のＨＭＭ音響モデルの出力確率を計算する第１の計算手段と、前記所定フレーム毎に前記特徴パラメータに対する、前記第２の記憶手段に記憶されている雑音のＧＭＭ音響モデルの出力確率を計算する第２の計算手段と、前記所定フレーム毎に前記第１の計算手段により計算された出力確率、及び、前記第２の計算手段により計算された出力確率の中から、最大の出力確率を選択する選択手段と、前記選択手段により選択された出力確率を当該出力確率が選択されたフレームにおける音声のＨＭＭ音響モデルの出力確率としてビタビアルゴリズムに用い、当該ビタビアルゴリズムを用いて前記入力される音声の音声認識を行う照合手段とを備えることを特徴とする音声認識装置を提供する。 In order to solve the above-described problem, the invention described in claim 1 stores a speech HMM (Hidden Markov Model) acoustic model in a speech recognition device that performs speech recognition using a Viterbi algorithm on speech input to the device. Stored in the first storage means, a second storage means for storing a noise GMM (Gaussian Mixture Model) acoustic model , and every predetermined frame of the input speech. First calculation means for calculating an output probability of the speech HMM acoustic model for the speech feature parameter to be recognized for each state of the speech HMM acoustic model, and for the feature parameter for each predetermined frame , a second calculating means for calculating the output probability of the GMM acoustic model of the noise stored in the second storage means, said first calculating means for each said predetermined frame More calculated output probabilities, and, from the calculated output probabilities by said second calculating means, selection means for selecting the largest output probability, the output probability selected by said selecting means the output probability Provided is a speech recognition apparatus comprising: a Viterbi algorithm used as an output probability of an HMM acoustic model of speech in a selected frame; and a collating means for performing speech recognition of the input speech using the Viterbi algorithm To do.

この構成によれば、音声認識装置は、音声のＨＭＭ音響モデルと雑音のＧＭＭ音響モデルとを合成せずに、第１の記憶手段と第２の記憶手段とに別々に記憶することができ、少ない音響モデルを用いて音声認識処理を行うことができる。また、音声認識装置は、第１の計算手段により計算された出力確率、及び、第２の計算手段により計算された出力確率の中から選択された最大の出力確率を、音声のＨＭＭ音響モデルの出力確率としてビタビアルゴリズムに用いるため、雑音の影響による認識精度の低下を防ぐことができる。このため、音声認識装置は、雑音が発生する環境において、処理量やメモリ量消費を抑えつつ、精度の高い音声認識を行うことができる。 According to this configuration, the speech recognition apparatus can store the speech HMM acoustic model and the noise GMM acoustic model separately in the first storage unit and the second storage unit without synthesizing, Speech recognition processing can be performed using a small number of acoustic models. The speech recognition apparatus, calculated output probabilities by the first calculating means, and the largest output probability selected from among the calculated output probabilities by the second calculating means, the speech HMM acoustic model Since the output probability is used in the Viterbi algorithm, it is possible to prevent a reduction in recognition accuracy due to the influence of noise. Therefore, the speech recognition apparatus can perform highly accurate speech recognition while suppressing processing amount and memory consumption in an environment where noise is generated.

請求項２に記載の発明は、請求項１に記載の音声認識装置において、前記選択手段は、
前記第２の計算手段により計算された出力確率のうち最大の出力確率を雑音の最大出力確率として選択する第１の選択手段と、前記第１の計算手段により計算された出力確率と前記第１の選択手段により選択された雑音の最大出力確率とのうち、大きい方の出力確率を選択する第２の選択手段とを備えることを特徴とする。
この構成によれば、音声認識装置は、出力確率の選択を２段階で行うようにしたため、雑音の最大出力確率を把握することが可能となる。雑音の最大出力確率に基づいて、頻繁に発生している雑音を把握したり、雑音のＧＭＭ音響モデルの出力確率が音声のＨＭＭ音響モデルの出力確率に代わって選択される場合を把握することができ、雑音の最大出力確率を情報として活用する場合に有効である。 According to a second aspect of the present invention, in the voice recognition device according to the first aspect, the selection unit includes:
First selecting means and the output probability calculation by the first calculating means and the first to be selected as the maximum of the maximum output probability of output probability noise of the second calculated output probabilities by the calculation means And a second selection means for selecting a larger output probability among the maximum output probabilities of noise selected by the selection means.
According to this configuration, since the speech recognition apparatus selects the output probability in two stages, it is possible to grasp the maximum output probability of noise. Based on the maximum output probability of noise, frequent or grasp the noise being generated, the output probability of the noise GMM acoustic models to grasp the when selected in place of the output probability of the speech HMM acoustic model This is effective when the maximum output probability of noise is used as information.

請求項３に記載の発明は、請求項１又は２に記載の音声認識装置において、前記第２の記憶手段には、複数種類の雑音を混合して作成された雑音のＧＭＭ音響モデルが１つ記憶されることを特徴とする。
この構成によれば、音声認識装置は、雑音のＧＭＭ音響モデルを１つのみ記憶し、かつ、雑音のＧＭＭ音響モデル１つのみを計算に用いることにより、音響モデル作成にあたっての処理負荷、音響モデルを記憶するためのメモリ量、及び、音声認識にあたっての計算量を削減することができる。 The invention according to claim 3 is the speech recognition apparatus according to claim 1 or 2, wherein the second storage means has one GMM acoustic model of noise created by mixing a plurality of types of noise. It is memorized.
According to this configuration, the speech recognition apparatus stores only one GMM acoustic model of noise, and uses only one GMM acoustic model of noise for calculation, so that the processing load and acoustic model for creating the acoustic model are reduced. The amount of memory for storing and the amount of calculation for speech recognition can be reduced.

請求項４に記載の発明は、請求項１又は２に記載の音声認識装置において、前記第２の記憶手段には、雑音に対して音声を所定の信号雑音比で重畳して作成された雑音のＧＭＭ音響モデルが記憶されることを特徴とする。
この構成によれば、音声認識装置は、雑音に音声を所定の信号雑音比で重畳して作成された、現実の環境に近い雑音のＧＭＭ音響モデルに基づいて音声認識を行うことができ、音声認識の精度を高めることができる。 According to a fourth aspect of the present invention, in the voice recognition device according to the first or second aspect, the second storage means is a noise generated by superimposing a voice with a predetermined signal-to-noise ratio on the noise. The GMM acoustic model is stored.
According to this configuration, the speech recognition apparatus can perform speech recognition based on a GMM acoustic model of noise that is close to the real environment and is created by superimposing speech on noise at a predetermined signal-to-noise ratio. Recognition accuracy can be increased.

請求項５に記載の発明は、請求項１、２又は４の何れか１項に記載の音声認識装置において、前記第２の記憶手段には、雑音のＧＭＭ音響モデル間の相互距離に基づいて該雑音のＧＭＭ音響モデルを分類し、該分類毎に、該雑音のＧＭＭ音響モデルの作成基となった雑音データに基づいて再作成された、雑音のＧＭＭ音響モデルが２つ以上記憶されることを特徴とする。
この構成によれば、音声認識装置は、複数の雑音を、類似したものが集まった集合に分類し、分類毎に作成された雑音のＧＭＭ音響モデル２つ以上を用いて音声認識を行うことができ、処理量を削減しながら精度の高い音声認識を行うことができる。 According to a fifth aspect of the present invention, in the speech recognition device according to any one of the first, second, and fourth aspects, the second storage means is based on a mutual distance between GMM acoustic models of noise. The noise GMM acoustic model is classified, and for each classification, two or more noise GMM acoustic models re-created based on the noise data from which the noise GMM acoustic model was created are stored. It is characterized by.
According to this configuration, the speech recognition apparatus can classify a plurality of noises into a set of similar ones, and perform speech recognition using two or more GMM acoustic models of noise created for each classification. It is possible to perform highly accurate speech recognition while reducing the processing amount.

請求項６に記載の発明は、ビタビアルゴリズムを用いて音声認識を行う音声認識方法において、認識対象となる音声の所定フレーム毎、且つ、音声のＨＭＭ音響モデルの状態毎に、前記認識対象となる音声の特徴パラメータに対する前記音声のＨＭＭ音響モデルの出力確率を計算する第１の計算ステップと、前記所定フレーム毎に前記特徴パラメータに対する、雑音のＧＭＭ音響モデルの出力確率を計算する第２の計算ステップと、前記所定フレーム毎に前記第１の計算ステップにおいて計算された出力確率、及び、前記第２の計算ステップにおいて計算された出力確率の中から、最大の出力確率を選択する選択ステップと、前記選択ステップにより選択された出力確率を、当該出力確率が選択されたフレームにおける音声のＨＭＭ音響モデルの出力確率としてビタビアルゴリズムに用い、当該ビタビアルゴリズムを用いて前記認識対象となる音声の音声認識を行う照合ステップとを有することを特徴とする音声認識方法を提供する。 The invention according to claim 6, in the speech recognition method for performing speech recognition using the Viterbi algorithm, a predetermined frame of the speech to be recognized, and, for each state of the speech HMM acoustic model, and the recognition target A first calculation step of calculating an output probability of the speech HMM acoustic model with respect to a speech feature parameter; and a second calculation step of calculating an output probability of a noise GMM acoustic model with respect to the feature parameter for each predetermined frame. When the said first calculated output probabilities in the calculation step for each predetermined frame, and from among the calculated output probabilities in the second calculation step, a selection step of selecting the maximum output probabilities, the has been output probabilities selected by the selection step, voice HMM acoustic model in the frame of the output probability is selected Used as the output probability Viterbi algorithm provides a speech recognition method characterized by having a matching step of performing speech recognition of the speech to be the recognition target by using the Viterbi algorithm.

この方法によれば、第１の計算ステップにおいて計算された出力確率、及び、第２の計算ステップにおいて計算された出力確率の中から選択された最大の出力確率を、音声のＨＭＭ音響モデルの出力確率としてビタビアルゴリズムに用いるため、雑音の影響によって認識精度が低下することがない。このため、精度の高い音声認識方法を実現することができる。 According to this method, calculated output probabilities in the first calculation step, and the maximum output probabilities selected from among the calculated output probabilities in the second calculation step, the output of the speech HMM acoustic model Since it is used for the Viterbi algorithm as a probability , the recognition accuracy does not deteriorate due to the influence of noise. For this reason, a highly accurate speech recognition method can be realized.

請求項７に記載の発明は、コンピュータに、認識対象となる音声の所定フレーム毎、且つ、音声のＨＭＭ音響モデルの状態毎に、前記認識対象となる音声の特徴パラメータに対する前記音声のＨＭＭ音響モデルの出力確率を計算する第１の計算ステップと、前記所定フレーム毎に前記特徴パラメータに対する、雑音のＧＭＭ音響モデルの出力確率を計算する第２の計算ステップと、前記所定フレーム毎に前記第１の計算ステップにおいて計算された出力確率、及び、前記第２の計算ステップにおいて計算された出力確率の中から、最大の出力確率を選択する選択ステップと、前記選択ステップにより選択された出力確率を、当該出力確率が選択されたフレームにおける音声のＨＭＭ音響モデルの出力確率としてビタビアルゴリズムに用い、当該ビタビアルゴリズムを用いて前記認識対象となる音声の音声認識を行う照合ステップとを実行させるためのプログラムを提供する。 Invention of claim 7, the computer, each predetermined frame of the speech to be recognized, and, for each state of the speech HMM acoustic model, the speech HMM acoustic model for the feature parameters of the speech to be the recognition target A first calculation step of calculating an output probability of the noise, a second calculation step of calculating an output probability of a noise GMM acoustic model for the feature parameter for each predetermined frame , and the first calculation step for each predetermined frame calculated output probabilities in the calculation step, and, from the calculated output probabilities in the second calculation step, a selection step of selecting the maximum output probability, the output probability selected by said selecting step, the used in the Viterbi algorithm as an output probability of the speech HMM acoustic model in the frame of the output probability is selected, those It provides a program for executing a matching step of performing speech recognition of the speech to be the recognition target by using the Viterbi algorithm.

本発明によれば、音声認識装置は、音声のＨＭＭ音響モデルと雑音のＧＭＭ音響モデルとを別々に第１の記憶手段と第２の記憶手段とに記憶することができ、少ない音響モデルを用いて音声認識処理を行うことができる。また、音声認識装置は、第１の計算手段により計算された出力確率、及び、第２の計算手段により計算された出力確率の中から選択された最大の出力確率を、音声のＨＭＭ音響モデルの出力確率としてビタビアルゴリズムに用いるため、雑音の影響による認識精度の低下を防ぐことができる。このため、音声認識装置は、処理量やメモリ量消費を抑えつつ、高い精度の音声認識を実現することができる。 According to the present invention, the speech recognition apparatus can store the speech HMM acoustic model and the noise GMM acoustic model separately in the first storage unit and the second storage unit, and uses a small number of acoustic models. Voice recognition processing can be performed. The speech recognition apparatus, calculated output probabilities by the first calculating means, and the largest output probability selected from among the calculated output probabilities by the second calculating means, the speech HMM acoustic model Since the output probability is used in the Viterbi algorithm, it is possible to prevent a reduction in recognition accuracy due to the influence of noise. For this reason, the speech recognition apparatus can realize highly accurate speech recognition while suppressing processing amount and memory consumption.

以下、本発明に係る実施の形態について、図面を参照しながら説明する。本発明の実施の形態に係る音声認識装置１０は、コンピュータのハードウェア構成を備えている。すなわち、音声認識装置１０は、図示せぬＣＰＵ（Central Processing Unit）とＲＡＭ（Random Access Memory）とＲＯＭ（Read Only Memory）とハードディスク装置とを含む記憶部、及び、入出力インターフェースを備えている。音声認識装置１０の記憶部には各種ソフトウェアが記憶される。例えば、ハードディスク装置には、音声及び雑音の音響モデルが記憶される。また、ハードディスクに装置は、以下に示す本願発明のＶｉｔｅｒｂｉアルゴリズムに従って演算処理を行うためのプログラムが記憶されている。
本願発明のＶｉｔｅｒｂｉアルゴリズムは、非定常雑音の種類数をＫとし、非定常雑音ｋ（ｋ＝１、２、・・・、Ｋ）のＧＭＭ音響モデルの出力確率分布をc_k(x)と表現すると、、以下のように定義される。
ステップ０’： Hereinafter, embodiments according to the present invention will be described with reference to the drawings. The speech recognition apparatus 10 according to the embodiment of the present invention has a hardware configuration of a computer. That is, the speech recognition apparatus 10 includes a storage unit including a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and a hard disk device (not shown), and an input / output interface. Various kinds of software are stored in the storage unit of the speech recognition apparatus 10. For example, an acoustic model of voice and noise is stored in the hard disk device. In addition, the hard disk stores a program for performing arithmetic processing according to the following Viterbi algorithm of the present invention.
The Viterbi algorithm of the present invention expresses the output probability distribution of the GMM acoustic model of non-stationary noise k (k = 1, 2,..., K) as c _k (x), _where K is the number of types of non-stationary noise. Then, it is defined as follows.
Step 0 ':

ステップ１’：t＝1,2,・・・,Tに対して、ステップ１．１'を繰り返す。
ステップ１．１’：全てのiに対して、ステップ１．１．１’〜１．１．５’を繰り返す。
ステップ１．１．１’： Step 1 ′: Step 1.1 ′ is repeated for t = 1, 2,.
Step 1.1 ′: Steps 1.1.1 ′ to 1.1.5 ′ are repeated for all i.
Step 1.1.1 ′:

ステップ１．１．２’： Step 1.1.2 ':

ステップ１．１．３’： Step 1.1.3 ':

ステップ１．１．４’： Step 1.1.4 ':

ステップ１．１．５’： Step 1.1.5 ':

ステップ２’： Step 2 ':

以上のように、本願発明のＶｉｔｅｒｂｉアルゴリズムでは、各フレームにおいて、候補単語のＨＭＭ音響モデルの各状態の出力確率b_i(x_t)の値を非定常雑音のＧＭＭ音響モデルの最大の出力確率と比較し、b_i(x_t)の値の方が大きければそのまま用い、非定常ＧＭＭ音響モデルの最大の出力確率の方が大きければその出力確率を出力確率b_i(x_t)として用いることとした。このようなアルゴリズムとしたのは、非定常雑音の存在に起因する順位変動を抑制するためである。すなわち、非定常雑音の影響が顕著なフレームにおいて、上述した従来のＶｉｔｅｒｂｉアルゴリズムのステップ１．１．２における式（１−４）の計算に用いられる、正解単語のＨＭＭ音響モデルの出力確率b_i(x_t)が著しく小さな値になることを抑制するためである。 As described above, in the Viterbi algorithm of the present invention, in each frame, the value of the output probability b _i (x _t ) of each state of the HMM acoustic model of the candidate word is set as the maximum output probability of the GMM acoustic model of non-stationary noise. In comparison, if the value of b _i (x _t ) is larger, it is used as it is. If the maximum output probability of the nonstationary GMM acoustic model is larger, the output probability is used as the output probability b _i (x _t ). did. The reason why such an algorithm is used is to suppress the rank fluctuation caused by the presence of non-stationary noise. That is, in a frame in which the influence of non-stationary noise is significant, the output probability b _i of the HMM acoustic model of the correct word used in the calculation of Expression (1-4) in Step 1.1.2 of the conventional Viterbi algorithm described above. This is to prevent (x _t ) from becoming a remarkably small value.

本願発明のＶｉｔｅｒｂｉアルゴリズムは、従来のＶｉｔｅｒｂｉアルゴリズムと比べて、非定常雑音のＧＭＭ音響モデルの数に比例した、特徴パラメータに対する雑音のＧＭＭ音響モデルの尤度計算処理、及び、式（２−４）、及び、式（２−５）に従った計算処理が増加するのみとなる。 The Viterbi algorithm of the present invention has a noise GMM acoustic model likelihood calculation process with respect to a characteristic parameter, which is proportional to the number of non-stationary noise GMM acoustic models, as compared with the conventional Viterbi algorithm, and Expression (2-4) And the calculation process according to Formula (2-5) only increases.

図１には、本願発明を適用した、Ｌｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭ音響モデルの最小構成を示す。当該Ｌｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭ音響モデルの最小構成は、状態i-1と状態iとの２つから成り、状態i-1から状態ｉへ遷移する確率はa_{i-1 i}で与えられ、状態iから状態iへ自己ループする確率はa_iiで与えられる。また、状態iは、出力確率b_i(x)に加えて、雑音のＧＭＭ音響モデルの出力確率c_i(x)を雑音のＧＭＭ音響モデルの数Ｋ（Ｋ＞０）個だけ持つ。時刻t-1における、状態i-1および状態iでの前向き確率f(i-1、t-1)、f(i、t-1)は、上述した本願発明のＶｉｔｅｒｂｉアルゴリズムにより逐次的に計算されているものとする。
この時、図２で示すように、時刻tにおける、状態iでの前向き確率f(i,t)は、本願発明のＶｉｔｅｒｂｉアルゴリズムにおける式（２−３）、式（２−４）、式（２−５）、及び、式（２−６）により、 FIG. 1 shows a minimum configuration of a Left-to-Right HMM acoustic model to which the present invention is applied. The minimum configuration of the Left-to-Right HMM acoustic model is composed of two states i-1 and i, and the probability of transition from state i-1 to state i is given by _ai- 1i. The probability of self-looping from i to state i is given by a _ii . In addition to the output probability b _i (x), the state i has the output probability c _i (x) of the noise GMM acoustic model by the number K (K> 0) of the noise GMM acoustic model. The forward probabilities f (i-1, t-1) and f (i, t-1) in state i-1 and state i at time t-1 are sequentially calculated by the Viterbi algorithm of the present invention described above. It is assumed that
At this time, as shown in FIG. 2, the forward probability f (i, t) in the state i at the time t is expressed by the equations (2-3), (2-4), (2) in the Viterbi algorithm of the present invention. 2-5) and (2-6)

で求められることとなる。
音声認識装置１０が備える以上のようなソフトウェア及びハードウェアにより、図３に示す音声認識装置１０の機能構成が実現される。同図に示すように、音声認識装置１０は、第１の記憶部１１と、第２の記憶部１２と、第１の計算部１３と、第２の計算部１４と、選択部１５と、照合部１６とを備えている。
第１の記憶部１１は、ＲＯＭ、ＲＡＭやハードディスク装置に記憶される音声のＨＭＭ音響モデルを含んで構成される。この第１の記憶部１１には、予め作成された音声のＨＭＭ音響モデルが、外部から入出力インターフェースを介して入力され、記憶される。 Will be required.
The functional configuration of the speech recognition apparatus 10 shown in FIG. 3 is realized by the software and hardware described above included in the speech recognition apparatus 10. As shown in the figure, the speech recognition apparatus 10 includes a first storage unit 11, a second storage unit 12, a first calculation unit 13, a second calculation unit 14, a selection unit 15, And a collation unit 16.
The first storage unit 11 includes a voice HMM acoustic model stored in a ROM, RAM, or hard disk device. In the first storage unit 11, an HMM acoustic model created in advance is input from the outside via an input / output interface and stored.

第２の記憶部１２は、ハードディスク装置に記憶される非定常雑音のＧＭＭ音響モデルを含んで構成される。この第２の記憶部１２には、予め作成された雑音のＧＭＭ音響モデルが、外部から入出力インターフェースを介して入力され、記憶される。本実施形態においては、非定常雑音の音響モデルとして一般的に利用されているＧＭＭを用いている。ここで用いる雑音のＧＭＭ音響モデルとしては、例えば、複数種類の雑音を混合して作成した雑音のＧＭＭ音響モデル１つや、雑音に対して音声を１より小なる所定の信号雑音比で重畳して作成した雑音のＧＭＭ音響モデルであってもよい。また、他の雑音のＧＭＭ音響モデルとしては、雑音のＧＭＭ音響モデル間の相互距離に基づいて当該雑音のＧＭＭ音響モデルを分類し、当該分類毎に、当該雑音のＧＭＭ音響モデルの作成基となった雑音に基づいて再作成した、雑音のＧＭＭ音響モデル複数個を用いてもよい。ここで相互距離とは、雑音のＧＭＭ音響モデル間の相対的な遠近関係を示す値であり、詳細には後述する。どの雑音のＧＭＭ音響モデルを用いて音声認識を行うかは、必要な計算処理量及びメモリ量と、認識精度との兼ね合いで決定される。 The second storage unit 12 includes a non-stationary noise GMM acoustic model stored in the hard disk device. In the second storage unit 12, a GMM acoustic model of noise created in advance is input from the outside via an input / output interface and stored. In the present embodiment, a GMM that is generally used as an acoustic model of non-stationary noise is used. As the noise GMM acoustic model used here, for example, one noise GMM acoustic model created by mixing a plurality of types of noise, or superimposing a voice with a predetermined signal-to-noise ratio smaller than 1 is superimposed on the noise. It may be a GMM acoustic model of the created noise. Further, as the other noise GMM acoustic models, the noise GMM acoustic models are classified based on the mutual distance between the noise GMM acoustic models, and the noise GMM acoustic model is created for each classification. A plurality of noise GMM acoustic models recreated based on the noise may be used. Here, the mutual distance is a value indicating a relative perspective relationship between GMM acoustic models of noise, and will be described in detail later. Which noise GMM acoustic model is used for speech recognition is determined based on the balance between the required amount of calculation processing and memory and the recognition accuracy.

第１の計算部１３、第２の計算部１４、選択部１５、及び、照合部１６は、音声認識装置１０のＣＰＵが本願発明のＶｉｔｅｒｂｉアルゴリズムが記述されたプログラムを実行することにより実現される機能である。
第１の計算部１３は、第１の記憶部１１に記憶されている音声のＨＭＭ音響モデルと、認識対象となる音声の特徴パラメータとを用いて出力確率を計算する。このとき、第１の計算部１３は、認識対象となる音声のフレーム毎、かつ、音声のＨＭＭ音響モデルの状態毎に、フレームの特徴パラメータに対する出力確率を計算する。 The first calculation unit 13, the second calculation unit 14, the selection unit 15, and the collation unit 16 are realized by the CPU of the speech recognition apparatus 10 executing a program in which the Viterbi algorithm of the present invention is described. It is a function.
The first calculation unit 13 calculates an output probability using the HMM acoustic model of speech stored in the first storage unit 11 and the feature parameters of speech to be recognized. At this time, the first calculator 13 calculates an output probability for the feature parameter of the frame for each frame of the speech to be recognized and for each state of the HMM acoustic model of the speech.

第２の計算部１４は、第２の記憶部１２に記憶されている雑音のＧＭＭ音響モデルと、認識対象となる音声の特徴パラメータとを用いて、フレーム毎に出力確率を計算する。このときに、第２の記憶部１２に雑音のＧＭＭ音響モデルが複数個記憶されている場合には、第２の計算部１４は各々の雑音のＧＭＭ音響モデル毎に出力確率を計算する。
選択部１５は、第１の計算部１３が計算したＨＭＭ音響モデルの出力確率、及び、第２の計算部１４が計算した各々の雑音のＧＭＭ音響モデルの出力確率の中で、最大の出力確率を選択する。この選択処理は、フレーム毎、かつ、音声のＨＭＭ音響モデルの状態毎に行われる。 The second calculation unit 14 calculates an output probability for each frame using the noise GMM acoustic model stored in the second storage unit 12 and the feature parameter of the speech to be recognized. At this time, when a plurality of noise GMM acoustic models are stored in the second storage unit 12, the second calculation unit 14 calculates an output probability for each noise GMM acoustic model.
The selection unit 15 has the maximum output probability among the output probability of the HMM acoustic model calculated by the first calculation unit 13 and the output probability of the GMM acoustic model of each noise calculated by the second calculation unit 14. Select. This selection process is performed for each frame and for each state of the speech HMM acoustic model.

選択部１５が備える機能をより詳細に説明すると、本実施形態においては、選択部１５は第１の選択部１５１と第２の選択部１５２とを備えている。第１の選択部１５１は、上述した本願発明のＶｉｔｅｒｂｉアルゴリズムのうち、式（２−４）に従った演算処理を行う。具体的には、第１の選択部１５１は、第２の記憶部１２に雑音のＧＭＭ音響モデルが複数個記憶されている場合には、フレーム毎に第２の計算部１４で計算された複数の雑音のＧＭＭ音響モデルの出力確率同士を比較し、最大の出力確率を雑音の最大出力確率として選択する。 The function provided in the selection unit 15 will be described in more detail. In the present embodiment, the selection unit 15 includes a first selection unit 151 and a second selection unit 152. The 1st selection part 151 performs the arithmetic processing according to Formula (2-4) among the Viterbi algorithms of this invention mentioned above. Specifically, when a plurality of GMM acoustic models of noise are stored in the second storage unit 12, the first selection unit 151 includes a plurality of values calculated by the second calculation unit 14 for each frame. The output probabilities of the GMM acoustic model of the noise are compared with each other, and the maximum output probability is selected as the maximum output probability of the noise.

第２の選択部１５２は、本願発明のＶｉｔｅｒｂｉアルゴリズムのうち、式（２−５）に従った演算処理を行う。具体的には、第２の選択部１５２は、フレーム毎、かつ、音声のＨＭＭ音響モデルの状態毎に、第１の計算部１３で計算された出力確率と、第１の選択部１５１で選択された雑音の最大出力確率とを比較し、大きい方の出力確率を選択する。 The 2nd selection part 152 performs the arithmetic processing according to Formula (2-5) among Viterbi algorithms of this invention. Specifically, the second selection unit 152 selects the output probability calculated by the first calculation unit 13 and the first selection unit 151 for each frame and for each state of the speech HMM acoustic model. The maximum output probability of the generated noise is compared, and the larger output probability is selected.

照合部１６は、選択部１５で選択されたフレーム毎及び状態毎の出力確率を従来のＶｉｔｅｒｂｉアルゴリズムに用いることにより照合処理を行い、音声認識結果を出力する。具体的には、従来のＶｉｔｅｒｂｉアルゴリズムの式（１−４）におけるb_i(x_t)を、選択部１５が選択した出力確率に置き換えた上で、式（１−５）以降の従来のＶｉｔｅｒｂｉアルゴリズムに従った演算を行う。なお、b_i(x_t)と選択部１５が選択した音声のＨＭＭ音響モデルの出力確率とが等しい場合には、b_i(x_t)を選択部１５が選択した出力確率に置き換えなくてもよい。 The collation unit 16 performs collation processing by using the output probability for each frame and each state selected by the selection unit 15 in the conventional Viterbi algorithm, and outputs a speech recognition result. Specifically, after replacing b _i (x _t ) in Expression (1-4) of the conventional Viterbi algorithm with the output probability selected by the selection unit 15, the conventional Viterbi after Expression (1-5) is used. Perform operations according to the algorithm. If b _i (x _t ) and the output probability of the HMM acoustic model of the speech selected by the selection unit 15 are equal, it is not necessary to replace b _i (x _t ) with the output probability selected by the selection unit 15. Good.

このように、本実施形態に係る音声認識装置１０の機能は、従来の音声認識装置と比較して、第２の記憶部１２に記憶されている全ての雑音のＧＭＭ音響モデルに対し、フレーム毎に特徴パラメータに対する出力確率を計算する第２の計算部１４、第１の選択部１５１、及び、第２の選択部１５２が加わっただけであり、その他の機能構成は従来の音声認識装置と同一である。 As described above, the function of the speech recognition apparatus 10 according to the present embodiment is as follows for each noise GMM acoustic model stored in the second storage unit 12 for each frame as compared with the conventional speech recognition apparatus. The second calculation unit 14, the first selection unit 151, and the second selection unit 152 that calculate the output probability for the feature parameter are added, and the other functional configurations are the same as those of the conventional speech recognition apparatus. It is.

この音声認識装置１０は雑音のＧＭＭ音響モデルを学習するだけでよいので、従来法に比べて音響モデルの学習に要する処理量やメモリ量が圧倒的に少ない。また、音声認識装置１０は、認識時には、従来に比較して雑音のＧＭＭ音響モデルを追加で保持するだけでよいので、認識時に必要とするメモリ量が大幅に少ない。また、新しい環境に音声認識装置１０が持ち込まれた場合は、音声認識装置１０は、その環境における雑音のＧＭＭ音響モデルを学習し、第２の記憶部１２に追加蓄積すればよいのだけなので非常に手軽である。
次に、音響モデル作成用データサイズ、音響モデルサイズ、音響モデル作成時間やＶｉｔｅｒｂｉアルゴリズムによる照合時間が従来法の中で最も少ない従来法２と、最も認識性能が高い従来法４と、本願発明に係る実施例との実験結果を比較し、本願発明に係る実施例の効果を実証する。 Since the speech recognition apparatus 10 only needs to learn a GMM acoustic model of noise, the amount of processing and memory required for learning the acoustic model is much smaller than that of the conventional method. Further, since the speech recognition apparatus 10 only has to additionally hold a GMM acoustic model of noise at the time of recognition, the amount of memory required at the time of recognition is greatly reduced. Further, when the speech recognition apparatus 10 is brought into a new environment, the speech recognition apparatus 10 only needs to learn a GMM acoustic model of noise in the environment and additionally accumulate it in the second storage unit 12. It is easy.
Next, the conventional method 2 having the smallest data size for acoustic model creation, the acoustic model size, the acoustic model creation time and the matching time by the Viterbi algorithm, the conventional method 4 having the highest recognition performance, and the present invention. By comparing the experimental results with such an example, the effect of the example according to the present invention is demonstrated.

Ｖｉｔｅｒｂｉアルゴリズムによる照合処理に用いる音声の音響モデルとして、まず、
Ａ）人間のクリーンな音声に公知のスペクトル減算法を施した学習用音声データから作成した音声のＨＭＭ音響モデル（従来法２における音響モデル）
Ｂ）コーヒーカップをコーヒー皿に置く音（以下、「カップ音」または「Ｃｕｐ」と表す）、スリッパでフローリングの床の上を歩く音（以下、「スリッパ音」または「Ｓｕｒｉｐｐａ」と表す）及び雨戸を開閉する音（以下、「雨戸音」または「Ａｍａｄｏ」と表す）に対して人間のクリーンな音声を１より小なる信号雑音比であるＳＮＲ（Signal to Noise Ratio；信号雑音比）−２０ｄＢとＳＮＲ−１０ｄＢとで各々重畳し、スペクトル減算法を施した学習用音声データから作成した音声のＨＭＭ音響モデル（従来法４における音響モデル）
の２種を作成した。ここで、ＳＮＲの定義は以下の通りである。
ＳＮＲ＝１０＊ｌｏｇ１０（音声のパワー／雑音のパワー）
次に、第１及び第２の実施例における音響モデルを作成する。作成の根拠について以下説明する。
ＧＭＭモデルiとＧＭＭモデルjの相互距離D(i,j)を次式で定義する。 As an acoustic model of speech used for collation processing by the Viterbi algorithm,
A) HMM acoustic model of speech generated from learning speech data obtained by applying a known spectral subtraction method to clean human speech (acoustic model in conventional method 2)
B) The sound of placing a coffee cup on a coffee dish (hereinafter referred to as “cup sound” or “Cup”), the sound of walking on the floor of the floor with slippers (hereinafter referred to as “slipper sound” or “Surippa”) and SNR (Signal to Noise Ratio) −20 dB, which is a signal-to-noise ratio of human clean speech less than 1 with respect to the sound of opening and closing the shutter (hereinafter referred to as “shutter sound” or “Amado”) HMM acoustic model (acoustic model in the conventional method 4) created from the speech data for learning subjected to the spectral subtraction method and superposed with SNR and SNR-10 dB
Two kinds of were made. Here, the definition of SNR is as follows.
SNR = 10 * log10 (voice power / noise power)
Next, acoustic models in the first and second embodiments are created. The basis for preparation will be described below.
The mutual distance D (i, j) between the GMM model i and the GMM model j is defined by the following equation.

ここで、μ(i,l,m)、σ(i,l,m)、ｐ(i,l,m)は、それぞれＧＭＭモデルiの次元lのm番目の正規分布の平均値、標準偏差値、重みである。Ｍ_i、Ｍ_jは、ぞれぞれＧＭＭモデルi、jの正規分布の混合数を表す。ＬはＧＭＭモデルの次元数を表す。
この時、ＧＭＭモデル間の相互距離を上記式のように定義しておけば、多次元尺度構成法の１種であるＳａｍｍｏｎ法（Jon W. Sammon, JR., "A Nonlinear Mapping for Data Structure Analysis," IEEE Trans. Computers, Vol. C-18, No. 5, May 1969.）を用いると、複数のＧＭＭモデル間の相対的な遠近関係を２次元平面状に射影することができる。ＧＭＭが単一正規分布による場合は、上記相互距離の式は以下のように簡略化される。 Here, μ (i, l, m), σ (i, l, m), and p (i, l, m) are the mean value and standard deviation of the m-th normal distribution of dimension l of GMM model i, respectively. Value, weight. M _i and M _j represent the number of normal distributions of the GMM models i and j, respectively. L represents the number of dimensions of the GMM model.
At this time, if the mutual distance between the GMM models is defined as in the above equation, the Sammon method (Jon W. Sammon, JR., "A Nonlinear Mapping for Data Structure Analysis") , "IEEE Trans. Computers, Vol. C-18, No. 5, May 1969.), a relative perspective relationship between a plurality of GMM models can be projected onto a two-dimensional plane. When the GMM has a single normal distribution, the above mutual distance formula is simplified as follows.

この時、日本人の日本語の２５種の音素（母音５種、半母音２種、子音１８種）の音声データから作成された単一正規分布による男性の２５個の音素のＨＭＭ音響モデル及び女性の２５個の音素のＨＭＭ音響モデル、４０種の野鳥の鳴声データから作成された単一正規分布による４０個の野鳥鳴声のＧＭＭ音響モデル、及び、住宅内で収録されたカップ音とスリッパ音と雨戸音とを含む３３種の非定常雑音データから作成された３３個の雑音のＧＭＭ音響モデルを、式（６）及び公知のＳａｍｍｏｎ法を用いて、図４に示すように２次元平面状に射影した。ここで、日本人男性の日本語の音素のＨＭＭ音響モデルは■で、日本人女性の日本語の音素のＨＭＭ音響モデルは□で、野鳥鳴声のＧＭＭ音響モデルは×で、非定常雑音のＧＭＭ音響モデルは▲でそれぞれ表されている。 At this time, HMM acoustic model of 25 phonemes of males and female by single normal distribution created from speech data of 25 Japanese phonemes (5 vowels, 2 semi-vowels, 18 consonants) HMM acoustic model of 25 phonemes, GMM acoustic model of 40 wild bird calls with a single normal distribution created from 40 wild bird call data, and cup sounds and slippers recorded in the house As shown in FIG. 4, a GMM acoustic model of 33 noises created from 33 types of non-stationary noise data including sound and shutter sound is expressed by a two-dimensional plane as shown in FIG. 4 using Equation (6) and the known Sammon method. Projected into a shape. Here, the HMM acoustic model of Japanese men's Japanese phonemes is ■, the HMM acoustic model of Japanese women's Japanese phonemes is □, and the GMM acoustic model of wild birds singing is ×, which is unsteady noise. Each GMM acoustic model is represented by ▲.

同図の射影結果で示されるように、非定常雑音と音声は、間に野鳥の鳴声を挟んで明確に分離され、３３種の非定常雑音は固まりを形成している。従って、３３種の全ての非定常雑音を使って少数の雑音のＧＭＭ音響モデルを作成しても、音声のＨＭＭ音響モデルとの分離度は高いと考えられる。 As shown in the projection results of FIG. 6, the non-stationary noise and the speech are clearly separated with a bird cry between them, and the 33 types of non-stationary noise form a cluster. Therefore, even if a GMM acoustic model with a small number of noises is created using all 33 types of non-stationary noise, it is considered that the degree of separation from the HMM acoustic model of speech is high.

そこで、上記Ａ）、Ｂ）に加えて、以下のＣ）、Ｄ）の音響モデルを作成した。
Ｃ）（本願発明に係る第１の実施例における音響モデル）
（１）人間のクリーンな音声にスペクトル減算法を施した学習用音声データから作成した音声のＨＭＭ音響モデル
（２）住宅内で収録したカップ音、スリッパ音及び雨戸音を含む３３種の非定常雑音データを全て使って作成した単一正規分布による雑音のＧＭＭ音響モデル１個 Therefore, in addition to the above A) and B), the following acoustic models C) and D) were created.
C) (Acoustic model in the first embodiment according to the present invention)
(1) HMM acoustic model of speech created from learning speech data obtained by applying spectral subtraction to clean human speech (2) 33 unsteady types including cup sound, slipper sound and shutter sound recorded in a house One GMM acoustic model of noise with a single normal distribution created using all noise data

Ｄ）（本願発明に係る第２の実施例における音響モデル）
（１）人間のクリーンな音声にスペクトル減算法を施した学習用音声データから作成した音声のＨＭＭ音響モデル
（２）住宅内で収録したカップ音、スリッパ音及び雨戸音を含む３３種の非定常雑音データに、人間のクリーンな音声を１より小なる信号雑音比であるＳＮＲ−２０ｄＢで重畳したデータを全て使って作成した単一正規分布による雑音のＧＭＭ音響モデル１個
（３）住宅内で収録したカップ音、スリッパ音及び雨戸音を含む３３種の非定常雑音データに、人間のクリーンな音声を１より小なる信号雑音比であるＳＮＲ−１０ｄＢで重畳したデータを全て使って作成した単一正規分布による雑音のＧＭＭ音響モデル１個 D) (Acoustic model in the second embodiment according to the present invention)
(1) HMM acoustic model of speech created from learning speech data obtained by applying spectral subtraction to clean human speech (2) 33 unsteady types including cup sound, slipper sound and shutter sound recorded in a house One GMM acoustic model of noise with a single normal distribution created by using all of the noise data superimposed with SNR-20dB, which is a signal-to-noise ratio of less than 1 on human clean speech (3) In a house A single sample created using all of the 33 types of non-stationary noise data including cup sounds, slipper sounds, and shutter sound recorded, and superimposing human clean speech with a signal-to-noise ratio of less than 1 at SNR-10 dB. One GMM acoustic model of noise with a normal distribution

次に、上記音声のＨＭＭ音響モデル作成に用いなかった男性１５人女性１５人合計３０人による地名１００単語の音声データに、住宅内で収録したカップ音、スリッパ音及び雨戸音を、ＳＮＲ２０ｄＢとＳＮＲ１０ｄＢとで各々重畳した２つのデータを、音声認識対象の評価データとした。 Next, the cup sound, slipper sound, and shutter sound recorded in the house in the voice data of the place name 100 words by 15 males and 15 females who were not used for the HMM acoustic model creation of the above sound, SNR 20 dB and SNR 10 dB The two data superimposed on each other were used as evaluation data for speech recognition.

図５には、地名１００単語の単語認識タスクにおいて、上記Ａ）の音声のＨＭＭ音響モデルを用いて従来のＶｉｔｅｒｂｉアルゴリズムで照合した場合（従来法２）と、上記Ｂ）の音声のＨＭＭ音響モデルを用いて従来のＶｉｔｅｒｂｉアルゴリズムで照合した場合（従来法４）と、上記Ｃ）の音声のＨＭＭ音響モデル及び雑音のＧＭＭ音響モデルを用いて本願発明のＶｉｔｅｒｂｉアルゴリズムで照合した場合（第１の実施例）と、上記Ｄ）の音声のＨＭＭ音響モデル及び雑音のＧＭＭ音響モデルを用いて本願発明のＶｉｔｅｒｂｉアルゴリズムで照合した場合（第２の実施例）における評価データの認識性能比較を、ＳＮＲ別（ＳＮＲ２０ｄＢとＳＮＲ１０ｄＢ）に示す。 FIG. 5 shows an HMM acoustic model of speech in the case of collation using the conventional Viterbi algorithm (conventional method 2) using the speech HMM acoustic model of A) in the word recognition task of 100 place names. When using the Viterbi algorithm according to the present invention using the conventional Viterbi algorithm (conventional method 4) and using the HMM acoustic model of speech and the GMM acoustic model of noise (first implementation) Comparison of recognition performance of evaluation data according to the Viterbi algorithm of the present invention using the HMM acoustic model of speech and the noise GMM acoustic model of D) above (second example) for each SNR (Example) SNR 20 dB and SNR 10 dB).

この結果から、カップ音の場合、本願発明に係る第１の実施例及び第２の実施例ともに、従来法２に比べて優位に認識性能が高かった。また、本願発明の第１の実施例及び第２の実施例は従来法４と同程度の認識性能であったが、後に詳述する図６の比較表に示すように、音響モデル作成用データサイズ、音響モデルサイズ、音響モデル作成時間やＶｉｔｅｒｂｉアルゴリズムによる照合時間が従来法４に比べて優位に少ないという利点がある。 From this result, in the case of a cup sound, both the first and second embodiments according to the present invention have a recognition performance that is superior to that of the conventional method 2. In addition, the first and second embodiments of the present invention had the same recognition performance as the conventional method 4, but as shown in the comparison table of FIG. There is an advantage that the size, the acoustic model size, the acoustic model creation time, and the matching time by the Viterbi algorithm are significantly less than the conventional method 4.

また、スリッパ音と雨戸音の場合は、本願発明に係る第１の実施例及び第２の実施例ともに、従来法２及び従来法４に比べて優位に認識性能が高かった。
従って、本願発明のＶｉｔｅｒｂｉアルゴリズムは、従来のＶｉｔｅｒｂｉアルゴリズムと比較して、カップ音、スリッパ音、雨戸音などの非定常雑音の存在に対して圧倒的に優位であることが明らかとなった。 Further, in the case of slipper sound and shutter sound, both the first and second embodiments according to the present invention have significantly higher recognition performance than the conventional method 2 and the conventional method 4.
Therefore, it has been clarified that the Viterbi algorithm of the present invention is overwhelmingly superior to the presence of non-stationary noise such as cup sound, slipper sound, and shutter sound compared with the conventional Viterbi algorithm.

次に、第３の実施例において作成する雑音のＧＭＭ音響モデルについて説明する。住宅内で収録したカップ音、スリッパ音及び雨戸音を含む３３種の非定常雑音データそれぞれについて単一正規分布による雑音のＧＭＭ音響モデルを作成し、式（６）と公知のＳａｍｍｏｎ法とを用いると、３３種の非定常雑音データそれぞれは、図７に示すように２次元平面状に射影される。図７の中央部に位置する雑音のＧＭＭ音響モデルは他の雑音のＧＭＭ音響モデルと比較的近く、図７の周辺部に位置する雑音のＧＭＭ音響モデルは他の雑音のＧＭＭ音響モデルと比較的遠いと言える。そこで、図７に示すように円の内側と外側の２つの領域に分割し、各領域内に位置する雑音のＧＭＭ音響モデルに基づいて以下の音響モデルを作成した。 Next, a noise GMM acoustic model created in the third embodiment will be described. A GMM acoustic model of noise with a single normal distribution is created for each of 33 types of non-stationary noise data including cup sound, slipper sound and shutter sound recorded in a house, and the equation (6) and the known Sammon method are used. Each of the 33 types of non-stationary noise data is projected onto a two-dimensional plane as shown in FIG. The noise GMM acoustic model located at the center of FIG. 7 is relatively close to other noise GMM acoustic models, and the noise GMM acoustic model located at the periphery of FIG. 7 is relatively similar to other noise GMM acoustic models. It ’s far away. Therefore, as shown in FIG. 7, the following acoustic model was created based on the noise GMM acoustic model divided into two regions inside and outside the circle and located in each region.

Ｅ）（本願発明に係る第３の実施例における音響モデル）
（１）人間のクリーンな音声にスペクトル減算法を施した学習用音声データから作成した音声のＨＭＭ音響モデル
（２）図７に示す２つの領域毎に、領域内に位置する雑音のＧＭＭ音響モデルを作成する際に用いた雑音データ（住宅内で収録したカップ音、スリッパ音及び雨戸音を含む３３種の非定常雑音データ）に、人間のクリーンな音声を１より小なる信号雑音比であるＳＮＲ−２０ｄＢで重畳したデータから作成した、単一正規分布による雑音のＧＭＭ音響モデル２個
（３）図７に示す２つの領域毎に、領域内に位置する雑音のＧＭＭ音響モデルを作成する際に用いた雑音データ（住宅内で収録したカップ音、スリッパ音及び雨戸音を含む３３種の非定常雑音データ）に、人間のクリーンな音声を１より小なる信号雑音比であるＳＮＲ−１０ｄＢで重畳したデータから作成した、単一正規分布による雑音のＧＭＭ音響モデル２個 E) (Acoustic model in the third embodiment according to the present invention)
(1) HMM acoustic model of speech created from learning speech data obtained by performing spectral subtraction on human clean speech (2) GMM acoustic model of noise located in each region shown in FIG. The noise data (33 kinds of unsteady noise data including cup sound, slipper sound, and shutter sound recorded in the house) used in creating the sound is a signal-to-noise ratio smaller than 1 for human clean speech. Two GMM acoustic models of noise with a single normal distribution created from data superimposed at SNR-20 dB (3) When creating a GMM acoustic model of noise located in each of the two regions shown in FIG. SNR, which is a signal-to-noise ratio that is less than 1 for human clean speech in the noise data (33 types of non-stationary noise data including cup sounds, slipper sounds, and shutter sound recorded in a house) Created from superimposed data 10 dB, 2 pieces noise GMM acoustic model with a single normal distribution

次に、上記音声のＨＭＭ音響モデル作成に用いなかった男性１５人女性１５人合計３０人の地名１００単語の音声データに、住宅内で収録したカップ音、スリッパ音及び雨戸音をＳＮＲ２０ｄＢとＳＮＲ１０ｄＢとで各々重畳した音響モデルを評価データとした。 Next, SNR 20 dB and SNR 10 dB are recorded as the cup sound, slipper sound and shutter sound recorded in the house in the sound data of the place name 100 words of 15 males and 15 females who were not used for the HMM acoustic model creation of the above voice. The acoustic models superimposed on each were used as evaluation data.

図８には、地名１００単語の単語認識タスクにおいて、上記Ａ）の音声のＨＭＭ音響モデルを用いて従来のＶｉｔｅｒｂｉアルゴリズムで照合した場合（従来法２）と、上記Ｂ）の音声のＨＭＭ音響モデルを用いて従来のＶｉｔｅｒｂｉアルゴリズムで照合した場合（従来法４）と、上記Ｄ）の音声のＨＭＭ音響モデル及び雑音のＧＭＭ音響モデルを用いて本願発明のＶｉｔｅｒｂｉアルゴリズムで照合した場合（第２の実施例）と、上記Ｅ）の音声のＨＭＭ音響モデル及び雑音のＧＭＭ音響モデルを用いて本願発明のＶｉｔｅｒｂｉアルゴリズムで照合した場合（第３の実施例）における評価データについての認識性能比較を、ＳＮＲ別（ＳＮＲ２０ｄＢとＳＮＲ１０ｄＢ）に示す。 FIG. 8 shows an HMM acoustic model of speech in the case of collation with the conventional Viterbi algorithm using the HMM acoustic model of speech of A) in the word recognition task of 100 place names (conventional method 2) and B). When using the conventional Viterbi algorithm (conventional method 4) and using the HMM acoustic model of speech and the noise GMM acoustic model of D) to collate with the Viterbi algorithm of the present invention (second implementation) Comparison of recognition performance for evaluation data in the case of matching with the Viterbi algorithm of the present invention using the HMM acoustic model of speech and the noise GMM acoustic model of E) above (Example 3) for each SNR (SNR 20 dB and SNR 10 dB).

この結果から、カップ音の場合、本願発明に係る第２の実施例及び第３の実施例は、従来法２に比べて優位に認識性能が高かった。また、従来法４と比較した場合には、第２の実施例及び第３の実施例は同程度の認識性能であったが、図６に示すように音響モデル作成用データサイズ、音響モデルサイズ、音響モデル作成時間、及び、Ｖｉｔｅｒｂｉアルゴリズムによる照合時間が従来法４に比べて優位に少ないという利点がある。 From this result, in the case of a cup sound, the second and third embodiments according to the present invention have a significantly higher recognition performance than the conventional method 2. Further, when compared with the conventional method 4, the second embodiment and the third embodiment have similar recognition performance, but as shown in FIG. 6, the data size for creating the acoustic model, the acoustic model size There is an advantage that the acoustic model creation time and the matching time by the Viterbi algorithm are significantly less than the conventional method 4.

また、スリッパ音と雨戸音の場合は、本願発明に係る第２の実施例及び第３の実施例ともに、従来法２及び従来法４に比べて優位に認識性能が高かった。
従って、カップ音、スリッパ音、雨戸音などの非定常雑音の存在に対して、本願発明のＶｉｔｅｒｂｉアルゴリズムは、従来のＶｉｔｅｒｂｉアルゴリズムと比較して、圧倒的に優位であることが明らかとなった。 Further, in the case of slipper sound and shutter sound, both the second and third embodiments according to the present invention have a significantly higher recognition performance than the conventional method 2 and the conventional method 4.
Accordingly, it has been clarified that the Viterbi algorithm of the present invention is overwhelmingly superior to the conventional Viterbi algorithm with respect to the presence of non-stationary noise such as cup sounds, slipper sounds, and shutter sound.

尚、２次元平面状に射影された３３種の非定常雑音データの領域分割は、図８に示すように内側と外側とに分割するばかりでなく、図１０に示すように外側の領域をさらに２つ領域に分割しても良いのは言うまでもないし、適宜さらに分割数を増やしても良い。分割数を多くするに従って、相互距離の近い雑音の集合に基づいた雑音のＧＭＭ音響モデルを作成することができ、音声認識精度を向上させることができる。 Note that the region division of 33 types of non-stationary noise data projected in a two-dimensional plane is not only divided into the inner side and the outer side as shown in FIG. 8, but the outer region is further divided as shown in FIG. Needless to say, it may be divided into two regions, and the number of divisions may be increased as appropriate. As the number of divisions increases, a noise GMM acoustic model based on a set of noises close to each other can be created, and speech recognition accuracy can be improved.

このように、本願発明のＶｉｔｅｒｂｉアルゴリズムを用いれば、非定常雑音の影響が顕著な場合に、従来のＶｉｔｅｒｂｉアルゴリズムで頻発する順位変動を発生させずに「やり過ごす」ことができ、音声認識の精度が低下するのを防ぐことができる。
さらに、本願発明の利点は、非定常雑音のグループ毎に雑音のＧＭＭ音響モデルを有するのみでよいことである。例えば、図９に示す５４個の音素をＨＭＭでモデル化して日本語の音声認識システムを構築する場合、クリーンな音声から作成された５４個のクリーン音素ＨＭＭ音響モデルに加えて、もし、雑音の種類が３３種あり、対応する雑音のＧＭＭ音響モデルの数が３３個あるとすると、全部で８７個の雑音重畳音素ＨＭＭ音響モデルを作成すればよい。さらに、３３種の雑音が２つの種類にまとめられるとすると、全部で５６個の雑音重畳音素ＨＭＭ音響モデルを作成すればよい。このため、必要となるメモリ量や計算量を削減することができる。 As described above, when the Viterbi algorithm of the present invention is used, when the influence of non-stationary noise is significant, it is possible to “pass through” without causing the order fluctuation that frequently occurs in the conventional Viterbi algorithm, and the accuracy of speech recognition is improved. It can be prevented from lowering.
Furthermore, an advantage of the present invention is that it only needs to have a GMM acoustic model of noise for each group of non-stationary noise. For example, in the case of constructing a Japanese speech recognition system by modeling 54 phonemes shown in FIG. 9 with HMM, in addition to 54 clean phoneme HMM acoustic models created from clean speech, If there are 33 types and the number of corresponding GMM acoustic models of noise is 33, a total of 87 noise superimposed phoneme HMM acoustic models may be created. Furthermore, if 33 types of noise are grouped into two types, a total of 56 noise superimposed phoneme HMM acoustic models may be created. For this reason, it is possible to reduce the required memory amount and calculation amount.

図６に本願発明と従来法との比較表を示す。ここでは、図９に示す５４個の音素に基づいた日本語音声認識システムを構築する場合を想定している。雑音の種類は３３種とする。音響モデル作成用データサイズ、音響モデル作成時間、及び、Ｖｉｔｅｒｂｉアルゴリズムによる照合時間は、従来法２の場合を１とし、それに対する倍率で表した。音響モデルサイズは、作成するＨＭＭ音響モデルの個数で表した。括弧内には、３３種の雑音が２種類にまとめられる場合についての上記倍率、上記ＨＭＭ音響モデルの個数を示した。 FIG. 6 shows a comparison table between the present invention and the conventional method. Here, it is assumed that a Japanese speech recognition system based on the 54 phonemes shown in FIG. 9 is constructed. There are 33 types of noise. The data size for acoustic model creation, the acoustic model creation time, and the collation time by the Viterbi algorithm are set to 1 in the case of the conventional method 2, and are expressed as a magnification with respect to it. The acoustic model size is represented by the number of HMM acoustic models to be created. In parentheses, the magnification and the number of the HMM acoustic models when 33 types of noise are grouped into two types are shown.

この比較表に示すように、本願発明は、従来法１、従来法３及び従来法４が有する不具合、すなわち、雑音の種類数に伴って音響モデル作成用データサイズや音響モデルサイズが膨大になるという不具合はない。また、本願発明は、同様に、従来法１、従来法３及び従来法４が有する、雑音の種類数に伴って音響モデル作成時間やＶｉｔｅｒｂｉアルゴリズムによる照合時間が膨大になるという不具合はない。 As shown in this comparison table, the present invention has a problem that the conventional method 1, the conventional method 3 and the conventional method 4 have, that is, the data size for creating the acoustic model and the acoustic model size become enormous with the number of types of noise. There is no malfunction. Similarly, the present invention has no inconvenience that the conventional method 1, the conventional method 3, and the conventional method 4 have an enormous time for making an acoustic model and a matching time by the Viterbi algorithm with the number of types of noise.

また、本願発明は、従来法１、従来法２、従来法３及び従来法４と比べて、非定常雑音に対する頑健性が高いという効果を有する。
また、特許文献１、２、及び３に比較して、本願発明は、想定する雑音の種類が増えても、音響モデル作成用データサイズや音響モデルサイズが膨大になるという不具合、音響モデル作成時間やＶｉｔｅｒｂｉアルゴリズムによる照合時間が膨大になるという不具合を回避することができできる。
尚、上記では単一正規分布による雑音のＧＭＭ音響モデルを本願発明に適用した実施例について説明したが、単一正規分布による雑音のＧＭＭ音響モデルよりも精度が高い混合正規分布による雑音のＧＭＭ音響モデルを本願発明に適用しても良いことは言うまでもない。 In addition, the present invention has an effect that robustness against non-stationary noise is high as compared with the conventional method 1, the conventional method 2, the conventional method 3, and the conventional method 4.
In addition, compared with Patent Documents 1, 2, and 3, the present invention has a problem that the data size for creating an acoustic model and the acoustic model size become enormous even if the number of types of noise assumed increases, and the acoustic model creation time. And a problem that the verification time by the Viterbi algorithm becomes enormous.
In the above description, the embodiment in which the GMM acoustic model of noise with a single normal distribution is applied to the present invention has been described. However, the GMM acoustic of noise with a mixed normal distribution having higher accuracy than the noise GMM acoustic model with a single normal distribution has been described. Needless to say, the model may be applied to the present invention.

以上のように、音声認識装置１０は、本願発明のＶｉｔｅｒｂｉアルゴリズムを用いることによって雑音の存在に起因する候補の順位変動を抑制することができ、非定常雑音に対して頑健な音声認識を実現することができる。さらに、少ない数のＧＭＭ音響モデルを用いて音声認識を実現することができるため、計算量やメモリ量的に有利である。このように、音声認識装置１０は、雑音が発生する環境において、処理量やメモリ量消費を抑えつつ、高い精度の音声認識を実現することが可能である。 As described above, the speech recognition apparatus 10 can suppress candidate rank variation due to the presence of noise by using the Viterbi algorithm of the present invention, and realizes speech recognition that is robust against non-stationary noise. be able to. Furthermore, since speech recognition can be realized using a small number of GMM acoustic models, it is advantageous in terms of calculation amount and memory amount. As described above, the speech recognition apparatus 10 can realize highly accurate speech recognition while suppressing the amount of processing and the amount of memory in an environment where noise is generated.

本発明は、特に非定常雑音が発生する環境における効率的な精度の高い音声認識に利用することができる。 The present invention can be used for efficient and highly accurate speech recognition particularly in an environment where non-stationary noise occurs.

本発明の実施の形態に係るＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭ音響モデルを説明するための図である。It is a figure for demonstrating the Left-to-Right HMM acoustic model which concerns on embodiment of this invention. 同実施の形態に係るＶｉｔｅｒｂｉアルゴリズムを説明するための図である。It is a figure for demonstrating the Viterbi algorithm which concerns on the same embodiment. 同実施の形態に係る音声認識装置の構成を示す図である。It is a figure which shows the structure of the speech recognition apparatus which concerns on the same embodiment. 人間の音声のＨＭＭ音響モデル、野鳥鳴声のＧＭＭ音響モデル、及び、雑音のＧＭＭ音響モデルを２次元射影した様子を示す図である。It is a figure which shows a mode that the HMM acoustic model of the human voice, the GMM acoustic model of the bird cry, and the GMM acoustic model of the noise were two-dimensionally projected. 従来法と本願発明に係る実施例との性能比較を示す図である。It is a figure which shows the performance comparison with the conventional method and the Example which concerns on this invention. 従来法と本願発明との比較表である。It is a comparison table between the conventional method and the present invention. 雑音のＧＭＭ音響モデルを２次元平面へ写像した時の様子を示す図である。It is a figure which shows a mode when the GMM acoustic model of noise is mapped to a two-dimensional plane. 従来法と本願発明に係る実施例との性能比較を示す図である。It is a figure which shows the performance comparison with the conventional method and the Example which concerns on this invention. 音素のＨＭＭ音響モデルの種類を説明するための図である。It is a figure for demonstrating the kind of HMM acoustic model of a phoneme. ２次元平面に写像された雑音のＧＭＭ音響モデルの領域分割の例を示す図である。It is a figure which shows the example of the area | region division of the GMM acoustic model of the noise mapped on the two-dimensional plane. 従来のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭ音響モデルを説明するための図である。It is a figure for demonstrating the conventional Left-to-Right HMM acoustic model. 従来のＶｉｔｅｒｂｉアルゴリズムを説明するための図である。It is a figure for demonstrating the conventional Viterbi algorithm. 非定常雑音重畳音声の波形の例を示す図である。It is a figure which shows the example of the waveform of non-stationary noise superimposition audio | voice. 非定常雑音重畳音声照合時の順位変動を示す図である。It is a figure which shows the order fluctuation | variation at the time of non-stationary noise superimposition voice collation.

Explanation of symbols

１０音声認識装置
１１第１の記憶部
１２第２の記憶部
１３第１の計算部
１４第２の計算部
１５選択部
１５１第１の選択部
１５２第２の選択部
１６照合部 DESCRIPTION OF SYMBOLS 10 Speech recognition apparatus 11 1st memory | storage part 12 2nd memory | storage part 13 1st calculation part 14 2nd calculation part 15 Selection part 151 1st selection part 152 2nd selection part 16 Collation part

Claims

In a speech recognition device that performs speech recognition using a Viterbi algorithm for speech input to the device,
First storage means for storing an HMM (Hidden Markov Model) acoustic model of speech;
A second storage means for storing a noise GMM (Gaussian Mixture Model) acoustic model;
Output of the speech HMM acoustic model for the speech feature parameter to be recognized for each predetermined frame of the speech input and for each state of the speech HMM acoustic model stored in the first storage means A first calculating means for calculating a probability ;
Second calculation means for calculating an output probability of a GMM acoustic model of noise stored in the second storage means for the feature parameter for each predetermined frame ;
Output probability calculated by said first calculating means for each said predetermined frame, and from among the calculated output probabilities by said second calculating means, selection means for selecting the largest output probability,
Used in the Viterbi algorithm output probabilities selected by the selection means as the output probability of the speech HMM acoustic model in the frame in which the output probability is selected, the matching performing voice recognition of the voice to be the input by using the Viterbi algorithm And a voice recognition device.

The selection means includes
First selection means for selecting the maximum output probability among the output probabilities calculated by the second calculation means as the maximum output probability of noise;
A second selection unit that selects a larger output probability among the output probability calculated by the first calculation unit and the maximum output probability of the noise selected by the first selection unit; The speech recognition apparatus according to claim 1, wherein

The speech recognition apparatus according to claim 1, wherein the second storage unit stores one GMM acoustic model of noise created by mixing a plurality of types of noise.

The speech according to claim 1 or 2, wherein the second storage means stores a GMM acoustic model of noise created by superimposing speech on noise at a predetermined signal-to-noise ratio. Recognition device.

In the second storage means,
Classifying the GMM acoustic model of the noise based on the mutual distance between the GMM acoustic models of the noise;
The two or more GMM acoustic models of noise that are recreated based on the noise data that is the basis for creating the GMM acoustic model of the noise are stored for each classification. 5. The speech recognition device according to any one of 4 above.

In a speech recognition method that performs speech recognition using the Viterbi algorithm,
Each predetermined frame of speech to be recognized, and, for each state of the speech HMM acoustic model, a first calculation step of calculating the output probability of the speech HMM acoustic model for the feature parameters of the speech to be the recognition target ,
A second calculation step of calculating an output probability of a noise GMM acoustic model for the feature parameter for each predetermined frame ;
Wherein said first calculated output probabilities in the calculation step for each predetermined frame, and from among the calculated output probabilities in the second calculation step, a selection step of selecting the maximum output probability,
The output probability selected in the selection step is used in the Viterbi algorithm as the output probability of the HMM acoustic model of speech in the frame in which the output probability is selected , and speech recognition of the speech to be recognized is performed using the Viterbi algorithm. A voice recognition method comprising: a collating step to perform .

On the computer,
Each predetermined frame of speech to be recognized, and, for each state of the speech HMM acoustic model, a first calculation step of calculating the output probability of the speech HMM acoustic model for the feature parameters of the speech to be the recognition target ,
A second calculation step of calculating an output probability of a noise GMM acoustic model for the feature parameter for each predetermined frame ;
Wherein said first calculated output probabilities in the calculation step for each predetermined frame, and from among the calculated output probabilities in the second calculation step, a selection step of selecting the maximum output probability,
The output probability selected in the selection step is used in the Viterbi algorithm as the output probability of the HMM acoustic model of speech in the frame in which the output probability is selected , and speech recognition of the speech to be recognized is performed using the Viterbi algorithm. program for executing a matching step of performing.