JP4362054B2

JP4362054B2 - Speech recognition apparatus and speech recognition program

Info

Publication number: JP4362054B2
Application number: JP2003322135A
Authority: JP
Inventors: 庄衛佐藤; 亨今井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2003-09-12
Filing date: 2003-09-12
Publication date: 2009-11-11
Anticipated expiration: 2023-09-12
Also published as: JP2005091518A

Description

本発明は、音声認識技術に関し、より詳細には、雑音環境下において、入力された音声の認識を行う音声認識装置及び音声認識プログラムに関する。 The present invention relates to speech recognition technology, and more particularly to a speech recognition apparatus and speech recognition program for recognizing input speech in a noisy environment.

従来、音声認識を行う手法として、入力音声の音響信号を分析して得られる特徴量（特徴ベクトル）と、音声中の音素の特徴量をモデル化した音響モデルとから、類似度を示す音響尤度を算出し、その音響尤度と、単語の出現頻度や接続確率をモデル化した言語モデルとから、認識候補となる単語を探索することで音声認識を行っている（例えば、非特許文献１）。 Conventionally, as a technique for performing speech recognition, an acoustic likelihood indicating similarity is obtained from a feature amount (feature vector) obtained by analyzing an acoustic signal of an input speech and an acoustic model obtained by modeling a feature amount of a phoneme in speech. The speech recognition is performed by searching for a word as a recognition candidate from the acoustic likelihood and a language model obtained by modeling the appearance frequency and connection probability of the word (for example, Non-Patent Document 1). ).

ここで、図４を参照して、従来の音声認識装置について説明する。図４は、従来の音声認識装置の構成を示すブロック図である。図４に示すように、従来の音声認識装置２０は、音響モデル（音響モデル記憶手段２１に記憶）と、言語モデル（言語モデル記憶手段２２に記憶）とに基づいて、入力音声である発話の内容を探索アルゴリズムにより決定し、認識結果を出力するものである。 Here, a conventional speech recognition apparatus will be described with reference to FIG. FIG. 4 is a block diagram showing a configuration of a conventional speech recognition apparatus. As shown in FIG. 4, the conventional speech recognition device 20 is based on an acoustic model (stored in the acoustic model storage unit 21) and a language model (stored in the language model storage unit 22). The content is determined by a search algorithm, and the recognition result is output.

この音声認識装置２０は、まず、音響分析手段２３によって、入力された音声（入力音声）から、音素毎の特徴量（例えば、ケプストラム等）を複数抽出する。そして、音響尤度算出手段２４によって、音響分析手段２３で抽出された複数の特徴量（特徴ベクトル）と、音響モデル記憶手段２１に記憶されている音響モデルとに基づいて、入力された音声と音響モデルとの類似度を示す音響尤度を算出する。 In the speech recognition apparatus 20, first, the acoustic analysis unit 23 extracts a plurality of feature quantities (for example, cepstrum) for each phoneme from the input speech (input speech). Then, based on the plurality of feature quantities (feature vectors) extracted by the acoustic analysis unit 23 by the acoustic likelihood calculation unit 24 and the acoustic model stored in the acoustic model storage unit 21, The acoustic likelihood indicating the similarity with the acoustic model is calculated.

そして、音声認識装置２０は、探索手段２５によって、音響尤度算出手段２４で算出された音響尤度と、言語モデル記憶手段２２に記憶されている言語モデルで示される単語の接続確率とに基づいて、接続される単語の候補を探索し、その探索結果（探索候補）を音響尤度算出手段２４に通知する。そして、音声認識装置２０は、音響尤度算出手段２４と、探索手段２５とを順次動作させることで、入力音声を単語列としたときの、その単語列の接続確率を算出することができる。そして、探索手段２５が、その接続確率が最大となる単語列を、入力音声の認識結果として出力する。
このように、従来は、予め学習データから学習した音響モデル及び言語モデルに基づいて、音声認識を行う手法が一般的である。
今井，小林，尾上，安藤，「ニュース番組自動字幕化のための音声認識システム」，情報処理学会音声言語処理技研報，２３−１１，ｐｐ．５９−６４，Ｏｃｔ．，１９９８ Then, the speech recognition device 20 uses the search means 25 based on the acoustic likelihood calculated by the acoustic likelihood calculation means 24 and the connection probability of the word indicated by the language model stored in the language model storage means 22. Then, the candidate of the connected word is searched, and the search result (search candidate) is notified to the acoustic likelihood calculating means 24. Then, the speech recognition apparatus 20 can calculate the connection probability of the word string when the input speech is a word string by sequentially operating the acoustic likelihood calculating unit 24 and the search unit 25. Then, the search means 25 outputs the word string having the maximum connection probability as the recognition result of the input voice.
Thus, conventionally, a method of performing speech recognition based on an acoustic model and a language model learned from learning data in advance is common.
Imai, Kobayashi, Onoe, Ando, "Speech recognition system for automatic subtitle conversion of news program", IPSJ Spoken Language Processing Technical Report, 23-11, pp. 59-64, Oct. , 1998

しかし、前記した音声認識手法は、入力音声に雑音が重畳され、その雑音が音響モデルでモデル化されている音素に類似している場合、その雑音を音素として認識してしまい、入力音声に対して誤った音声認識結果を出力してしまうという問題があった。
この問題は、音響モデルに学習されていない雑音部分では、音響尤度の信頼性が低下しているにも関わらず、音声部分と同様の探索を行ったことに起因している。また、この問題は、音響モデルが、雑音が重畳された入力音声の特徴を十分に学習していないことにも起因している。 However, the speech recognition method described above recognizes the noise as a phoneme when noise is superimposed on the input speech and the noise is similar to the phoneme modeled in the acoustic model, There is a problem that an incorrect voice recognition result is output.
This problem is caused by performing a search similar to that of the speech portion in the noise portion that has not been learned by the acoustic model, although the reliability of the acoustic likelihood is reduced. This problem is also caused by the fact that the acoustic model does not sufficiently learn the features of the input speech on which noise is superimposed.

このような問題に対して、音響モデルを、音声に雑音を重畳させた学習データ（雑音重畳音声）から作成することも考えられるが、音響モデルに学習させる学習データは有限であるため、音響モデルを多種多様な雑音重畳音声に対応させて学習させることは困難である。 To solve this problem, it is conceivable to create an acoustic model from learning data in which noise is superimposed on the speech (noise-superimposed speech). However, since the learning data to be learned by the acoustic model is limited, the acoustic model Is difficult to learn in response to a wide variety of noise superimposed speech.

本発明は、以上のような問題点に鑑みてなされたものであり、入力音声に雑音が重畳された場合であっても、多種多様な雑音重畳音声に対応した音響モデルを用いることなく、音声認識の認識誤りを低減させることが可能な音声認識装置及び音声認識プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and even when noise is superimposed on the input speech, the speech can be obtained without using an acoustic model corresponding to a variety of noise superimposed speech. An object of the present invention is to provide a speech recognition device and a speech recognition program that can reduce recognition errors in recognition.

本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載の音声認識装置は、音響モデル及び言語モデルと、前記音響モデルをモデル化した第二音響モデルと、雑音をモデル化した雑音モデルとを用いて、入力音声を認識する音声認識装置であって、音響分析手段と、信頼度算出手段と、音響尤度算出手段と、音響尤度補正手段と、出力系列探索手段と、認識結果出力手段と、を備える構成とした。 The present invention has been created to achieve the above object, and first, the speech recognition apparatus according to claim 1 includes an acoustic model and a language model, and a second acoustic model obtained by modeling the acoustic model. A speech recognition device for recognizing an input speech using a noise model obtained by modeling noise, an acoustic analysis means, a reliability calculation means, an acoustic likelihood calculation means, an acoustic likelihood correction means, An output series search means and a recognition result output means are provided.

かかる構成によれば、音声認識装置は、音響分析手段によって、入力された音声（入力音声）の音響信号をスペクトル分析、線形予測分析、ケプストラム分析等によって分析し、音声の特徴量を抽出する。この特徴量は、１つの特徴量である必要はなく複数の特徴量を持つ特徴ベクトルとすることで、入力された音声の特徴を適切に表現することができる。 According to such a configuration, the speech recognition apparatus analyzes the acoustic signal of the input speech (input speech) by the acoustic analysis means by spectrum analysis, linear prediction analysis, cepstrum analysis, and the like, and extracts speech feature values. This feature quantity need not be a single feature quantity, but can be a feature vector having a plurality of feature quantities, so that the features of the input speech can be appropriately expressed.

また、音声認識装置は、信頼度算出手段によって、音響分析手段で抽出した特徴量（特徴ベクトル）が音声の特徴量である度合いを示す信頼度を算出する。すなわち、予め雑音のデータをモデル化した雑音モデルと、音響モデルをモデル化した第二音響モデルとに基づいて、その各モデルにおける確率密度関数の値の比率によって、その特徴量を有する入力音声が音声であるかどうかの度合い（信頼度）を算出する。なお、ここで雑音モデルとは、予め想定される雑音のデータを学習し、例えば、混合正規分布モデル等のよってモデル化したものである。また、音響モデルとは、大量の音声データから予め学習した、音素毎の特徴量を隠れマルコフモデルによってモデル化したものである。また、第二音響モデルは、例えば、音響モデルを混合正規化分布モデルによりモデル化することで生成する。このように、データ量を削減した第二音響モデルを用いることで、信頼度を算出する際の演算量を抑えることができる。 In addition, the speech recognition apparatus calculates the reliability indicating the degree to which the feature amount (feature vector) extracted by the acoustic analysis unit is the feature amount of speech by the reliability calculation unit. In other words, based on a noise model obtained by modeling noise data in advance and a second acoustic model obtained by modeling an acoustic model, an input speech having the feature amount is determined according to a ratio of probability density function values in each model. The degree (reliability) of whether or not it is speech is calculated. Here, the noise model is obtained by learning noise data assumed in advance and modeling the data using, for example, a mixed normal distribution model. The acoustic model is obtained by modeling a feature amount for each phoneme learned in advance from a large amount of speech data using a hidden Markov model. The second acoustic model is generated by modeling the acoustic model with a mixed normalization distribution model, for example. As described above, by using the second acoustic model in which the data amount is reduced, it is possible to suppress the calculation amount when calculating the reliability.

そして、音声認識装置は、音響尤度算出手段によって、特徴量と音響モデルとに基づいて、入力音声と音響モデルとの類似度を示す音響尤度を算出する。すなわち、この音響尤度算出手段では、入力音声が音声であるかどうかを示す音響尤度を音響モデルのみで算出する。しかし、この入力音声には雑音が重畳されている場合があって、音響尤度は正確である保証がない。そこで、音声認識装置は、音響尤度補正手段によって、音響尤度算出手段で算出した音響尤度を、信頼度算出手段で算出した信頼度により補正することで、雑音を考慮した音響尤度（補正音響尤度）を算出する。 Then, the speech recognition device calculates an acoustic likelihood indicating the similarity between the input speech and the acoustic model based on the feature amount and the acoustic model by the acoustic likelihood calculating means. That is, in this acoustic likelihood calculating means, the acoustic likelihood indicating whether or not the input speech is speech is calculated only by the acoustic model. However, noise may be superimposed on the input speech, and there is no guarantee that the acoustic likelihood is accurate. Therefore, the speech recognition apparatus corrects the acoustic likelihood calculated by the acoustic likelihood calculating means by the acoustic likelihood correcting means with the reliability calculated by the reliability calculating means, so that the acoustic likelihood considering noise ( Corrected acoustic likelihood) is calculated.

また、音声認識装置は、出力系列探索手段によって、補正音響尤度と言語モデルとに基づいて、接続確率が高くなる出力系列（単語、形態素、音素等）を探索する。なお、ここで探索された出力系列の候補（探索候補）は、適宜音響尤度算出手段に通知されることで、音響尤度算出手段が、探索候補の音響モデルに基づいて、入力音声の音響尤度を算出する。
そして、音声認識装置は、認識結果出力手段によって、出力系列探索手段で探索された複数の出力系列の中で、接続確率が最大となる出力系列を、入力音声を認識した出力系列であると特定し、音声認識結果として出力する。 Further, the speech recognition apparatus searches for an output sequence (word, morpheme, phoneme, etc.) having a high connection probability based on the corrected acoustic likelihood and the language model by the output sequence search means. Note that the output sequence candidates (search candidates) searched here are appropriately notified to the acoustic likelihood calculating means so that the acoustic likelihood calculating means can input the sound of the input speech based on the acoustic model of the search candidate. Calculate the likelihood.
Then, the speech recognition apparatus identifies the output sequence having the maximum connection probability among the plurality of output sequences searched by the output sequence search unit by the recognition result output unit as an output sequence in which the input speech is recognized. And output as a speech recognition result.

また、請求項２に記載の音声認識プログラムは、音響モデル及び言語モデルと、前記音響モデルをモデル化した第二音響モデルと、雑音のデータをモデル化した雑音モデルとを用いて、入力音声を認識するために、コンピュータを、音響分析手段、信頼度算出手段、音響尤度算出手段、音響尤度補正手段、出力系列探索手段、認識結果出力手段、として機能させることを特徴とする。 According to a second aspect of the present invention, there is provided a speech recognition program that uses an acoustic model and a language model, a second acoustic model obtained by modeling the acoustic model, and a noise model obtained by modeling noise data, to input speech. In order to recognize, the computer functions as an acoustic analysis unit, a reliability calculation unit, an acoustic likelihood calculation unit, an acoustic likelihood correction unit, an output sequence search unit, and a recognition result output unit.

かかる構成によれば、音声認識プログラムは、音響分析手段によって、入力された音声（入力音声）の音響信号を分析し、音声の特徴量を抽出する。
また、音声認識プログラムは、信頼度算出手段によって、音響分析手段で抽出された特徴量と、雑音モデル及び第二音響モデルとに基づいて、入力音声が音声である度合いを示す信頼度を算出する。 According to this configuration, the speech recognition program analyzes the acoustic signal of the input speech (input speech) by the acoustic analysis unit, and extracts the feature amount of the speech.
In addition, the speech recognition program calculates the reliability indicating the degree to which the input speech is speech based on the feature amount extracted by the acoustic analysis unit, the noise model, and the second acoustic model by the reliability calculation unit. .

そして、音声認識プログラムは、音響尤度算出手段によって、特徴量と音響モデルとに基づいて、入力音声と音響モデルとの類似度を示す音響尤度を算出し、音響尤度補正手段によって、音響尤度算出手段で算出した音響尤度を、信頼度算出手段で算出した信頼度により補正することで、雑音を考慮した音響尤度（補正音響尤度）を算出する。
また、音声認識プログラムは、出力系列探索手段によって、補正音響尤度と言語モデルとに基づいて、接続確率が高くなる出力系列（単語、形態素、音素等）を探索し、認識結果出力手段によって、出力系列探索手段で探索された出力系列の中で、接続確率が最大となる出力系列を、入力音声を認識した出力系列であると特定し、音声認識結果として出力する。 Then, the speech recognition program calculates an acoustic likelihood indicating the similarity between the input speech and the acoustic model based on the feature amount and the acoustic model by the acoustic likelihood calculating unit, and the acoustic likelihood correcting unit calculates the acoustic likelihood. By correcting the acoustic likelihood calculated by the likelihood calculating means with the reliability calculated by the reliability calculating means, an acoustic likelihood considering noise (corrected acoustic likelihood) is calculated.
Further, the speech recognition program searches for an output sequence (word, morpheme, phoneme, etc.) having a high connection probability based on the corrected acoustic likelihood and the language model by the output sequence search means, and by the recognition result output means, Among the output sequences searched for by the output sequence search means, the output sequence having the maximum connection probability is identified as the output sequence that has recognized the input speech, and is output as a speech recognition result.

請求項１又は請求項２に記載の発明によれば、雑音が重畳された音声であっても、雑音のデータをモデル化した雑音モデルに基づいて、入力音声がどの出力系列に対応するのかを示す音響尤度を補正することができる。これによって、音響尤度の信頼度が低下した場合は、言語モデルによる語彙、文法、意味等の言語的制約に重みを付けて認識を行うことが可能になるため、雑音環境下における音声認識の誤りを低減させることができる。
また、音響モデルを多種多様な雑音重畳音声に対応させてモデル化する必要がなく、雑音モデルのみを構築すればよいので、モデルの構築を簡単に行うことができる。 According to the first or second aspect of the present invention, which output series the input speech corresponds to, based on a noise model obtained by modeling noise data, even for speech with superimposed noise. The acoustic likelihood shown can be corrected. As a result, when the reliability of acoustic likelihood decreases, it is possible to perform recognition by weighting linguistic constraints such as vocabulary, grammar, and meaning based on the language model. Errors can be reduced.
In addition, it is not necessary to model the acoustic model in correspondence with a wide variety of noise-superimposed speech, and only the noise model needs to be constructed, so that the model can be easily constructed.

また、請求項１又は請求項２に記載の発明によれば、音響尤度を算出するために用いる音響モデルのデータ量を削減した第二音響モデル（音声モデル）を用いて信頼度を算出するため、信頼度算出にかかる演算量を抑えることができる。 According to the invention described in claim 1 or claim 2 , the reliability is calculated using the second acoustic model (voice model) in which the data amount of the acoustic model used for calculating the acoustic likelihood is reduced. Therefore, it is possible to reduce the amount of calculation for the reliability calculation.

以下、本発明の実施の形態について図面を参照して説明する。
［音声認識装置の構成］
まず、図１を参照して、本発明の参考例に係る音声認識装置の構成について説明する。図１は、音声認識装置の構成を示したブロック図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Configuration of voice recognition device]
First, the configuration of a speech recognition apparatus according to a reference example of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing the configuration of the speech recognition apparatus.

雑音モデル記憶手段１１は、雑音モデル１１ａを記憶した一般的なハードディスク等の記憶媒体である。ここに記憶される雑音モデル１１ａは、予め想定される雑音のデータを蓄積したデータベース（図示せず）に基づいて、雑音のデータをモデル化したものである。この雑音モデル１１ａには、例えば、混合正規分布モデルを用いることができる。なお、雑音モデル１１ａは、単一の雑音モデルλ^Nを用いてもよいし、雑音の種別毎に複数のモデル（クラスタモデル）λ^N＝｛λ₁ ^N，λ₂ ^N，…，λ_M ^N｝（Ｍは雑音モデルのクラスタ数）を用いてもよい。
ここで雑音とは、音声認識を行いたい音声以外の音をいい、例えば、飛行機やサイレンの音、雑踏の音声、あるいはニュース番組で原稿をめくる音等である。 The noise model storage means 11 is a general storage medium such as a hard disk that stores the noise model 11a. The noise model 11a stored here is obtained by modeling noise data based on a database (not shown) in which data of expected noise is accumulated. As this noise model 11a, for example, a mixed normal distribution model can be used. Incidentally, the noise model 11a may be used single noise model lambda ^N, a plurality of models (cluster model) for each noise type ^{_{^{λ N = {λ 1 N,}}} λ 2 N, ..., λ M N } (M is the number of clusters in the noise model).
Here, the noise means a sound other than the voice to be recognized, for example, a sound of an airplane or a siren, a busy sound, a sound of turning a manuscript in a news program, or the like.

音響モデル記憶手段１２は、音響モデル１２ａを記憶した一般的なハードディスク等の記憶媒体である。ここに記憶される音響モデル１２ａは、大量の音声データから予め学習した音素毎の特徴量を「隠れマルコフモデル」によってモデル化したものである。この音響モデル１２ａも、雑音モデル１１ａと同様に、単一の音響モデルλ^Sを用いてもよいし、音響の種別（例えば、人物別）毎に複数のモデル（クラスタモデル）λ^S＝｛λ₁ ^S，λ₂ ^S，…，λ_K ^S｝（Ｋは音響モデルのクラスタ数）を用いてもよい。また、例えば、音響モデル１２ａを、１つの音素に対して前後の音素を組としたトライフォンを認識の単位としてモデル化した場合は、トライフォンの数だけクラスタモデルが存在することになる。 The acoustic model storage unit 12 is a storage medium such as a general hard disk that stores the acoustic model 12a. The acoustic model 12a stored here is obtained by modeling a feature amount for each phoneme learned in advance from a large amount of speech data using a “hidden Markov model”. Similarly to the noise model 11a, this acoustic model 12a may use a single acoustic model λ ^S , or a plurality of models (cluster models) λ ^S = {λ for each acoustic type (for example, for each person). ₁ ^S , λ ₂ ^S ,..., Λ _K ^S } (K is the number of clusters in the acoustic model) may be used. Further, for example, when the acoustic model 12a is modeled by using a triphone in which front and rear phonemes are paired with respect to one phoneme as a recognition unit, there are as many cluster models as the number of triphones.

言語モデル記憶手段１３は、言語モデル１３ａを記憶した一般的なハードディスク等の記憶媒体である。ここに記憶される言語モデル１３ａは、大量の音声データから学習した出力系列（単語、形態素、音素等）の出現頻度や接続確率等をモデル化したものである。例えば、一般的な「Ｎ−ｇｒａｍ言語モデル」等を用いることができる。
なお、ここでは、雑音モデル記憶手段１１と、音響モデル記憶手段１２と、言語モデル記憶手段１３とを別々の記憶手段で構成しているが、同一の記憶手段に各モデルを記憶することも可能である。また、各記憶手段は、ネットワークを介して接続された形態であってもよい。 The language model storage means 13 is a general storage medium such as a hard disk that stores the language model 13a. The language model 13a stored here models the appearance frequency, connection probability, and the like of an output sequence (words, morphemes, phonemes, etc.) learned from a large amount of speech data. For example, a general “N-gram language model” or the like can be used.
Here, the noise model storage unit 11, the acoustic model storage unit 12, and the language model storage unit 13 are configured as separate storage units, but each model may be stored in the same storage unit. It is. Each storage unit may be connected via a network.

音響分析手段１４は、外部から入力された音声（入力音声）を分析し、その音声の特徴量を抽出するものである。ここで抽出された音声の特徴量は、特徴ベクトルとして時系列に信頼度算出手段１５及び音響尤度算出手段１６に出力される。
なお、この音響分析手段１４は、入力された音声の音声波形に窓関数（ハミング窓等）をかけることで、フレーム化された波形を抽出し、その波形を周波数分析することで、種々の特徴量を抽出する。例えば、フレーム化された波形のパワースペクトルの対数を逆フーリエ変換した値であるケプストラム係数等を特徴量とする。この特徴量には、ケプストラム係数以外にも、メル周波数ケプストラム係数（ＭＦＣＣ：ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）、ＬＰＣ（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｒｄｉｎｄ）係数、対数パワー等、一般的な音声特徴量を用いることができる。 The acoustic analysis unit 14 analyzes a voice (input voice) input from the outside and extracts a feature amount of the voice. The extracted feature amount of the voice is output to the reliability calculation unit 15 and the acoustic likelihood calculation unit 16 in time series as a feature vector.
The acoustic analysis means 14 extracts a framed waveform by applying a window function (such as a Hamming window) to the voice waveform of the input voice, and performs various frequency analysis on the waveform. Extract the amount. For example, a cepstrum coefficient or the like, which is a value obtained by inverse Fourier transform of the logarithm of the power spectrum of a framed waveform, is used as the feature amount. In addition to the cepstrum coefficient, a general speech feature quantity such as a mel frequency cepstrum coefficient (MFCC), LPC (Linear Predictive Cordind) coefficient, logarithmic power, or the like can be used as the feature quantity.

信頼度算出手段１５は、雑音モデル記憶手段１１に予め学習し記憶されている雑音モデル１１ａと、音響モデル記憶手段１２に予め学習し記憶されている音響モデル１２ａとに基づいて、音響分析手段１４から時系列に出力される時刻ｔの特徴ベクトルｘ_tが、音声である（雑音でない）度合いを示す信頼度Ｐ（Ｓ｜ｘ_t）を算出するものである。ここで算出された信頼度は、音響尤度補正手段１７に出力される。 The reliability calculation means 15 is based on the noise model 11a previously learned and stored in the noise model storage means 11 and the acoustic model 12a previously learned and stored in the acoustic model storage means 12. To calculate a reliability P (S | x _t ) indicating the degree to which the feature vector x _{t at} time t output in time series is speech (not noise). The reliability calculated here is output to the acoustic likelihood correcting means 17.

以下、信頼度Ｐ（Ｓ｜ｘ_t）を算出する手法について具体的に説明する。
まず、雑音モデル１１ａをλ^N＝｛λ₁ ^N，λ₂ ^N，…，λ_M ^N｝（Ｍは雑音モデルのクラスタ数）とし、雑音モデル１１ａのｉ番目のクラスタモデルλ_i ^Nにおける、特徴ベクトルｘ_tが雑音の特徴量である度合い（尤度）を、条件付確率Ｐ（ｘ_t｜λ_i ^N）（１≦ｉ≦Ｍ）とする。すると、特徴ベクトルｘ_tが雑音の特徴量である度合い（雑音モデル尤度）Ｐ（ｘ_t｜λ^N）は、（１）式に示すように、各クラスタモデルλ_i ^Nにおける尤度和として算出することができる。 Hereinafter, a method for calculating the reliability P (S | x _t ) will be specifically described.
First, _let the noise model 11a be λ ^N = {λ ₁ ^N , λ ₂ ^N ,..., Λ _M ^N } (M is the number of clusters of the noise model), and the feature of the noise model 11a in the i-th cluster model λ _i ^N A degree (likelihood) that the vector x _t is a feature amount of noise is defined as a conditional probability P (x _t | λ _i ^N ) (1 ≦ i ≦ M). Then, the degree (noise model likelihood) P (x _t | λ ^N ) that the feature vector x _t is a feature amount of noise is expressed as the likelihood sum in each cluster model λ _i ^N as shown in the equation (1). Can be calculated.

また、同様に、音響モデル１２ａをλ^S＝｛λ₁ ^S，λ₂ ^S，…，λ_K ^S｝（Ｋは音響モデルのクラスタ数）とし、音響モデル１２ａのｉ番目のクラスタモデルλ_i ^Sにおける、特徴ベクトルｘ_tが音声の特徴量である度合い（尤度）を、条件付確率Ｐ（ｘ_t｜λ_i ^S）（１≦ｉ≦Ｋ）とする。すると、特徴ベクトルｘ_tが音声の特徴量である度合い（音響モデル尤度）Ｐ（ｘ_t｜λ^S）は、（２）式に示すように、各クラスタモデルλ_i ^Sにおける尤度和として算出することができる。 Similarly, the acoustic model 12a is λ ^S = {λ ₁ ^S , λ ₂ ^S ,..., Λ _K ^S } (K is the number of clusters of the acoustic model), and the i-th cluster model λ _i ^{S of the} acoustic model 12a. The degree (likelihood) that the feature vector x _t is the feature amount of speech is defined as a conditional probability P (x _t | λ _i ^S ) (1 ≦ i ≦ K). Then, the degree (acoustic model likelihood) P (x _t | λ ^S ) that the feature vector x _t is the feature amount of speech is expressed as the likelihood sum in each cluster model λ _i ^S as shown in the equation (2). Can be calculated.

このように、（１）式及び（２）式で求めた、特徴ベクトルｘ_tが雑音の特徴量である度合いＰ（ｘ_t｜λ^N）及び音声の特徴量である度合いＰ（ｘ_t｜λ^S）に基づいて、特徴ベクトルｘ_tが、音声である（雑音でない）度合いを示す信頼度Ｐ（Ｓ｜ｘ_t）は、（３）式によって算出することができる。 As described above, the degree P (x _t | λ ^N ) that the feature vector x _t is a noise feature amount and the degree P (x _t | Based on (λ ^S ), the reliability P (S | x _t ) indicating the degree to which the feature vector x _t is speech (not noise) can be calculated by the equation (3).

なお、雑音モデル１１ａの精度不足によって、十分な精度の信頼度が得られない場合は、信頼値Ｐ（Ｓ｜ｘ_t）に下限値を設けることとしてもよい。例えば、下限値をαとし、（４）式によって、Ｐ（Ｓ｜ｘ_t）を補正したＰ´（Ｓ｜ｘ_t）を信頼値として算出することとしてもよい。これによって、雑音モデルの精度が十分でない場合に、音響尤度そのものの情報を無くしてしまうという弊害を回避することができる。 In addition, when the reliability of sufficient accuracy cannot be obtained due to insufficient accuracy of the noise model 11a, a lower limit value may be provided for the reliability value P (S | _xt ). For example, the lower limit value as the alpha, (4) the formula, P may calculate the | | (x _t S) as a confidence value P'corrected for (S x _t). As a result, when the accuracy of the noise model is not sufficient, it is possible to avoid the adverse effect of losing information on the acoustic likelihood itself.

また、ここでは、（１）式及び（２）式に示したように、特徴ベクトルｘ_tが雑音又は音声の特徴量である度合いを、各クラスタモデルにおける尤度和として算出したが、簡易的に各クラスタモデルλ_i ^N，λ_i ^Sにおける最大の度合い（尤度）を、特徴ベクトルｘ_tが雑音又は音声の特徴量である度合いとしてもよい。すなわち、以下の（５）式及び（６）式によって、雑音の特徴量である度合い（雑音モデル尤度）Ｐ（ｘ_t｜λ^N）、音声の特徴量である度合い（音響モデル尤度）Ｐ（ｘ_t｜λ^S）を算出することとしてもよい。 Here, as shown in the equations (1) and (2), the degree to which the feature vector x _t is the feature amount of noise or speech is calculated as the likelihood sum in each cluster model. Alternatively, the maximum degree (likelihood) in each of the cluster models λ _i ^N and λ _i ^S may be the degree that the feature vector x _t is a noise or speech feature amount. That is, according to the following formulas (5) and (6), the degree of noise feature (noise model likelihood) P (x _t | λ ^N ), the degree of voice feature (acoustic model likelihood) P (x _t | λ ^S ) may be calculated.

また、特徴ベクトルｘ_tが雑音の特徴量である度合いＰ（ｘ_t｜λ^N）、音声の特徴量である度合いＰ（ｘ_t｜λ^S）は、（７）式及び（８）式に示すように、特徴ベクトルｘ_tに対して特定の時間幅（移動平均窓幅ｗ）を設定し、その時間幅毎の雑音又は音声の特徴量である度合い（尤度）を加算することで算出してもよい。この（７）式及び（８）式は、ハミング窓等の窓関数によって、入力音声をフレーム化した各フレームにおいて、複数のフレーム毎に一定期間の平均値を加算した移動平均値である。これによって、信頼度の精度を向上させることができる。なお、ここで、除算による平均化を行っていないのは、前記（３）式により、信頼度を比率として算出するためである。 Further, the degree P (x _t | λ ^N ) that the feature vector x _t is a noise feature amount and the degree P (x _t | λ ^S ) that is a speech feature amount are _{expressed by} Equations (7) and (8). As shown in the _figure , a specific time width (moving average window width w) is set for the feature vector xt, and the degree (likelihood) that is the feature amount of noise or speech for each time width is added. May be. Equations (7) and (8) are moving average values obtained by adding an average value for a certain period for each of a plurality of frames in each frame obtained by framing input speech by a window function such as a Hamming window. Thereby, the accuracy of reliability can be improved. Here, the reason why the averaging by the division is not performed is to calculate the reliability as the ratio by the equation (3).

図１に戻って、音声認識装置１０の構成について説明を続ける。
音響尤度算出手段１６は、音響分析手段１４で抽出され、時系列に入力される特徴ベクトルと、音響モデル記憶手段１２に記憶されている音響モデル１２ａでモデル化されている音素との類似度を示す音響尤度を算出するものである。なお、この音響尤度算出手段１６は、探索手段１８から逐次出力される出力系列の探索候補毎に音響尤度を算出する。ここで算出された音響尤度は、音響尤度補正手段１７に出力される。
例えば、探索手段１８から逐次出力される探索候補の音響モデルがλ^AM＝｛λ₁ ^AM，λ₂ ^AM，…，λ_J ^AM｝（Ｊは探索候補のモデルの総数）であった場合、音響尤度算出手段１６は、特徴ベクトルｘ_tが音声である度合い（音響尤度）を、条件付確率Ｐ（ｘ_t｜λ_j ^AM）（１≦ｊ≦Ｊ）で算出する。 Returning to FIG. 1, the description of the configuration of the speech recognition apparatus 10 will be continued.
The acoustic likelihood calculating means 16 is a similarity between the feature vector extracted by the acoustic analyzing means 14 and inputted in time series and the phoneme modeled by the acoustic model 12 a stored in the acoustic model storage means 12. Is calculated. The acoustic likelihood calculating unit 16 calculates an acoustic likelihood for each output sequence search candidate that is sequentially output from the searching unit 18. The acoustic likelihood calculated here is output to the acoustic likelihood correcting means 17.
For example, if the search candidate acoustic model sequentially output from the search means 18 is λ ^AM = {λ ₁ ^AM , λ ₂ ^AM ,..., Λ _J ^AM } (J is the total number of search candidate models) The likelihood calculating means 16 calculates the degree (acoustic likelihood) that the feature vector x _t is speech with a conditional probability P (x _t | λ _j ^AM ) (1 ≦ j ≦ J).

音響尤度補正手段１７は、信頼度算出手段１５で算出された信頼度に基づいて、音響尤度算出手段１６で算出された、探索候補毎の音響尤度を補正して補正音響尤度を生成するものである。ここで算出された補正音響尤度は、探索手段１８に出力される。
例えば、音響尤度算出手段１６から、時刻ｔにおける探索候補の音響尤度Ｐ（ｘ_t｜λ_j ^AM）（１≦ｊ≦Ｊ）が出力される場合、音響尤度補正手段１７は、（９）式に示すように、信頼度算出手段１５で算出された信頼度Ｐ（Ｓ｜ｘ_t）をべき数として、音響尤度Ｐ（ｘ_t｜λ_j ^AM）をべき乗することで、音響尤度を補正した補正音響尤度Ｐ´（ｘ_tｘ_tλ_j ^AMｘ_t）を生成する。 The acoustic likelihood correcting unit 17 corrects the acoustic likelihood for each search candidate calculated by the acoustic likelihood calculating unit 16 based on the reliability calculated by the reliability calculating unit 15 to obtain a corrected acoustic likelihood. Is to be generated. The corrected acoustic likelihood calculated here is output to the search means 18.
For example, when the acoustic likelihood P (x _t | λ _j ^AM ) (1 ≦ j ≦ J) of the search candidate at time t is output from the acoustic likelihood calculating unit 16, the acoustic likelihood correcting unit 17 9) As shown in the equation, the acoustic likelihood P (x _t | λ _j ^AM ) is raised to the power by using the reliability P (S | x _t ) calculated by the reliability calculation means 15 as a power, and the sound A corrected acoustic likelihood P ′ (x _t x _t λ _j ^AM x _t ) obtained by correcting the likelihood is generated.

なお、（９）式において、信頼度は０≦Ｐ（Ｓ｜ｘ_t）≦１で正規化されるため、補正音響尤度Ｐ´（ｘ_t｜λ_j ^AM）は、音響尤度Ｐ（ｘ_t｜λ_j ^AM）に比べて高くなってしまう。しかし、後記する探索手段１８で、探索候補を探索する際の対立候補間における音声尤度（補正音響尤度）の比較において、音響尤度の対数をとった対数尤度の差をとれば、対数尤度差のダイナミックレンジがＰ（Ｓ｜ｘ_t）倍になるだけで、音響尤度が高くなっても影響はない。 In Equation (9), since the reliability is normalized by 0 ≦ P (S | x _t ) ≦ 1, the corrected acoustic likelihood P ′ (x _t | λ _j ^AM ) is the acoustic likelihood P ( x _t | λ _j ^AM ). However, in the comparison of the speech likelihood (corrected acoustic likelihood) between the contending candidates when searching for a search candidate by the search means 18 described later, if the difference in log likelihood obtained by taking the logarithm of the acoustic likelihood is taken, The dynamic range of the log-likelihood difference is only P (S | x _t ) times, and there is no effect even if the acoustic likelihood increases.

探索手段１８は、音響尤度補正手段１７で生成された補正音響尤度に基づいて、言語モデル１３ａから、接続される出力系列の候補を探索し、その探索結果である探索候補を音響尤度算出手段１６に出力するとともに、接続確率が最大となる出力系列を入力音声に対する認識結果として外部に出力するものである。ここでは、探索手段１８は、出力系列探索部１８ａと、認識結果出力部１８ｂとで構成している。 Based on the corrected acoustic likelihood generated by the acoustic likelihood correcting unit 17, the searching unit 18 searches the language model 13 a for a candidate of an output series to be connected, and sets the search candidate that is the search result as the acoustic likelihood. While outputting to the calculation means 16, the output series with the largest connection probability is output to the outside as a recognition result for the input speech. Here, the search means 18 is comprised by the output series search part 18a and the recognition result output part 18b.

出力系列探索部（出力系列探索手段）１８ａは、音響尤度補正手段１７で生成された補正音響尤度と、言語モデル記憶手段１３に記憶されている言語モデル１３ａとに基づいて、接続確率が高くなる出力系列を探索するものである。なお、雑音が重畳された音声については、補正音響尤度が低くなるため、出力系列探索部１８ａは、言語モデル１３ａの言語的制約（例えば、語彙、文法、意味等）によって出力系列を探索する度合いを大きくする。すなわち、雑音が重畳された音声（雑音重畳音声）については、音響的制約を弱め、相対的に言語的制約の重みを大きくすることで、雑音により歪められた音声の認識精度を高めることができる。 The output sequence search unit (output sequence search means) 18 a has a connection probability based on the corrected acoustic likelihood generated by the acoustic likelihood correction means 17 and the language model 13 a stored in the language model storage means 13. This searches for an output sequence that increases. Note that since the corrected acoustic likelihood is low for speech with noise superimposed thereon, the output sequence search unit 18a searches for an output sequence based on linguistic restrictions (for example, vocabulary, grammar, meaning, etc.) of the language model 13a. Increase the degree. In other words, for speech with superimposed noise (noise superimposed speech), the recognition accuracy of speech distorted by noise can be improved by weakening acoustic constraints and relatively increasing the weight of linguistic constraints. .

認識結果出力部（認識結果出力手段）１８ｂは、出力系列探索部１８ａで探索された複数の出力系列の中で、接続確率が最大となる出力系列を出力するものである。なお、出力系列探索部１８ａでは、複数の出力系列が探索されるが、認識結果出力部１８ｂでは、その複数の出力系列の中から、接続経路の確率が最大となる経路（出力系列）を音声認識結果とする。 The recognition result output unit (recognition result output unit) 18b outputs an output sequence having the maximum connection probability among the plurality of output sequences searched by the output sequence search unit 18a. The output sequence search unit 18a searches for a plurality of output sequences, but the recognition result output unit 18b uses the plurality of output sequences as a voice (output sequence) having the maximum connection path probability. The recognition result.

以上、本発明の参考例に係る音声認識装置１０の構成について説明したが、本発明は、信頼度算出手段１５が、音響モデル１２ａのデータ量を削減してモデル化した音声モデルを用いて、信頼度を算出することとした。
すなわち、本発明は、図２に示した音声認識装置１０Ｂのブロック図のように、音声認識装置１０（図１参照）に音声モデル記憶手段１９を付加して構成した。なお、この音声認識装置１０Ｂは、音声認識装置１０に、音声モデル１９ａを記憶した音声モデル記憶手段１９を付加し、信頼度算出手段１５の機能を変更した信頼度算出手段１５Ｂで構成している。他の構成は、図１に示した音声認識装置１０と同様であるので、同一の符号を付し、説明を省略する。 Having described the structure of the speech recognition apparatus 10 according to a reference example of the present invention, the present invention provides reliability calculation means 15, using the speech models modeled by reducing the amount of data of the acoustic model 12a, The reliability was calculated .
That is, according to the present invention, as shown in the block diagram of the speech recognition device 10B shown in FIG. 2, the speech model storage means 19 is added to the speech recognition device 10 (see FIG. 1) . The voice recognition device 10B is configured by a reliability calculation unit 15B in which a voice model storage unit 19 storing a voice model 19a is added to the voice recognition device 10 and the function of the reliability calculation unit 15 is changed. . Other configurations are the same as those of the speech recognition apparatus 10 shown in FIG.

音声モデル記憶手段１９に記憶した音声モデル（第二音響モデル）１９ａは、音響モデル１２ａよりデータ量を削減したモデルである。この音声モデル１９ａは、例えば、音響モデル１２ａを混合正規化分布モデルによりモデル化することで生成されたものとする。なお、この音声モデル１９ａは、大量の音声データから学習することでモデル化を行ってもよいが、信頼度算出手段１５Ｂによって、信頼度を算出することが目的であるので、音響モデル１２ａを生成する際に用いた音声データから学習することが望ましい。
この音声モデル１９ａは、音響モデル１２ａと同様に、単一の音声モデルλ^Sを用いてもよいし、音声の種別毎に複数のモデルλ^S＝｛λ₁ ^S，λ₂ ^S，…，λ_K ^S｝（Ｋは音声モデルのクラスタ数）を用いてもよい。 The voice model (second acoustic model) 19a stored in the voice model storage unit 19 is a model in which the data amount is reduced from that of the acoustic model 12a. The speech model 19a is generated by modeling the acoustic model 12a with a mixed normalization distribution model, for example. The speech model 19a may be modeled by learning from a large amount of speech data, but since the purpose is to calculate the reliability by the reliability calculation means 15B, the acoustic model 12a is generated. It is desirable to learn from the speech data used when doing this.
As the acoustic model 12a, the speech model 19a may use a single speech model λ ^S , or a plurality of models λ ^S = {λ ₁ ^S , λ ₂ ^S ,. _K ^S } (K is the number of clusters in the speech model) may be used.

信頼度算出手段１５Ｂは、雑音モデル記憶手段１１に記憶されている雑音モデル１１ａと、音声モデル記憶手段１９に記憶されている音声モデル１９ａとに基づいて、音響分析手段１４から時系列に出力される時刻ｔの特徴ベクトルｘ_tが、音声である度合いを示す信頼度Ｐ（Ｓ｜ｘ_t）を算出する
このように、音響モデル１２ａのデータ量を削減した音声モデル１９ａを用いることで、音声認識の演算量を削減することができる。 The reliability calculation means 15B is output in time series from the acoustic analysis means 14 based on the noise model 11a stored in the noise model storage means 11 and the speech model 19a stored in the speech model storage means 19. The reliability P (S | x _t ) indicating the degree to which the feature vector x _{t at the} time t is a speech is calculated in this manner by using the speech model 19a with the data amount of the acoustic model 12a reduced. The amount of calculation for recognition can be reduced.

なお、以上説明した音声認識装置１０，１０Ｂは、一般的なコンピュータにプログラムを実行させ、コンピュータ内の演算装置や記憶装置を動作させることにより実現することができる。このプログラム（音声認識プログラム）は、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。 The speech recognition apparatuses 10 and 10B described above can be realized by causing a general computer to execute a program and operating an arithmetic device or a storage device in the computer. This program (voice recognition program) can be distributed via a communication line, or can be distributed by writing on a recording medium such as a CD-ROM.

［音声認識装置の動作］
次に、図３を参照（適宜図１参照）して、音声認識装置１０の動作について説明する。図３は、音声認識装置１０の動作を示すフローチャートである。 [Operation of voice recognition device]
Next, the operation of the speech recognition apparatus 10 will be described with reference to FIG. FIG. 3 is a flowchart showing the operation of the speech recognition apparatus 10.

（音響分析ステップ）
音声認識装置１０は、音響分析手段１４によって、入力された音声（入力音声）を分析し、その音声の特徴量を時系列に特徴ベクトルｘ_tとして抽出する（ステップＳ１）。 (Acoustic analysis step)
Speech recognition device 10, by the acoustic analysis means 14 analyzes the voice (input voice) input, extracts a feature quantity of the speech in time series as a feature vector x _t (Step S1).

（信頼度算出ステップ）
そして、音声認識装置１０は、信頼度算出手段１５によって、特徴ベクトルｘ_tで示される入力音声が音声である度合いを示す信頼度を算出する。すなわち、信頼度算出手段１５が、前記（１）式に示したように、雑音モデル１１ａを参照して、特徴ベクトルｘ_tが雑音である度合いを示す尤度和Ｐ（ｘ_t｜λ^N）を算出する（ステップＳ２）。そして、信頼度算出手段１５が、前記（２）式に示したように、音響モデル１２ａを参照して、特徴ベクトルｘ_tが音声である度合いを示す尤度和Ｐ（ｘ_t｜λ^S）を算出する（ステップＳ３）。そして、信頼度算出手段１５が、このステップＳ２で算出した雑音の尤度和Ｐ（ｘ_t｜λ^N）と、ステップＳ３で算出した音声の尤度和Ｐ（ｘ_t｜λ^S）とに基づいて、前記（３）式により、特徴ベクトルｘ_tが、音声である度合いを示す信頼度Ｐ（Ｓ｜ｘ_t）を算出する（ステップＳ４）。 (Reliability calculation step)
Then, the speech recognition apparatus 10 calculates the reliability indicating the degree to which the input speech indicated by the feature vector x _t is speech by the reliability calculation means 15. That is, the reliability calculation means 15 refers to the noise model 11a as shown in the equation (1), and the likelihood sum P (x _t | λ ^N ) indicating the degree to which the feature vector x _t is noise. Is calculated (step S2). Then, the reliability calculation means 15 refers to the acoustic model 12a as shown in the equation (2), and the likelihood sum P (x _t | λ ^S ) indicating the degree to which the feature vector x _t is speech. Is calculated (step S3). The reliability calculation means 15 then adds the noise likelihood sum P (x _t | λ ^N ) calculated in step S2 and the speech likelihood sum P (x _t | λ ^S ) calculated in step S3. Based on the equation (3), the reliability P (S | x _t ) indicating the degree to which the feature vector x _t is speech is calculated (step S4).

（音響尤度算出ステップ）
また、音声認識装置１０は、音響尤度算出手段１６によって、音響分析手段１４で抽出された音声の特徴量である特徴ベクトルｘ_tと、音響モデル１２ａとに基づいて、探索手段１８の出力系列探索部１８ａで探索された出力系列の探索候補毎に音響尤度を算出する（ステップＳ５）。すなわち、探索候補の音響モデルがλ^AM＝｛λ₁ ^AM，λ₂ ^AM，…，λ_J ^AM｝（Ｊは探索候補のモデルの総数）のとき、特徴ベクトルｘ_tが音声である度合い（音響尤度）を、条件付確率Ｐ（ｘ_t｜λ_j ^AM）（１≦ｊ≦Ｊ）として算出する。 (Acoustic likelihood calculation step)
In addition, the speech recognition apparatus 10 outputs the output sequence of the search unit 18 based on the feature vector x _t that is the feature amount of the speech extracted by the acoustic analysis unit 14 by the acoustic likelihood calculation unit 16 and the acoustic model 12a. The acoustic likelihood is calculated for each output sequence search candidate searched by the search unit 18a (step S5). That is, when the search candidate acoustic model is λ ^AM = {λ ₁ ^AM , λ ₂ ^AM ,..., Λ _J ^AM } (J is the total number of search candidate models), the degree to which the feature vector x _t is speech (sound Likelihood) is calculated as a conditional probability P (x _t | λ _j ^AM ) (1 ≦ j ≦ J).

（音響尤度補正ステップ）
そして、音声認識装置１０は、音響尤度補正手段１７によって、前記（９）式に示したように、音響尤度算出手段１６で算出された探索候補の音響尤度Ｐ（ｘ_t｜λ_j ^AM）（１≦ｊ≦Ｊ）を、信頼度算出手段１５で算出された信頼度Ｐ（Ｓ｜ｘ_t）に基づいて補正し、補正音響尤度Ｐ´（ｘ_t｜λ_j ^AM）を生成する（ステップＳ６）。 (Acoustic likelihood correction step)
Then, the speech recognition apparatus 10 uses the acoustic likelihood correcting unit 17 to calculate the acoustic likelihood P (x _t | λ _{j of} the search candidate calculated by the acoustic likelihood calculating unit 16 as shown in the equation (9). ^AM ) (1 ≦ j ≦ J) is corrected based on the reliability P (S | x _t ) calculated by the reliability calculation means 15, and the corrected acoustic likelihood P ′ (x _t | λ _j ^AM ) is corrected. Generate (step S6).

（探索ステップ）
そして、音声認識装置１０は、探索手段１８の出力系列探索部１８ａによって、音響尤度補正手段１７で生成された補正音響尤度Ｐ´（ｘ_t｜λ_j ^AM）と、言語モデル１３ａとに基づいて、接続確率が高くなる出力系列を探索し、その探索結果である探索候補を音響尤度算出手段１６に出力する（ステップＳ７）。なお、この探索結果はステップＳ５において、音響尤度算出手段１６が、音響尤度を算出する際に用いられる。 (Search step)
Then, the speech recognition apparatus 10 applies the corrected acoustic likelihood P ′ (x _t | λ _j ^AM ) generated by the acoustic likelihood correcting unit 17 and the language model 13 a by the output sequence searching unit 18 a of the searching unit 18. Based on this, an output sequence with a high connection probability is searched, and search candidates that are search results are output to the acoustic likelihood calculating means 16 (step S7). This search result is used when the acoustic likelihood calculating means 16 calculates the acoustic likelihood in step S5.

また、音声認識装置１０は、認識結果出力部１８ｂによって、ステップＳ７で探索された複数の出力系列の中で、接続確率が最大となる出力系列を音声認識結果として出力する（ステップＳ８）。本動作は、入力音声が入力されている間、ステップＳ１以降の動作が繰り返し実行される。 The speech recognition device 10, recognition by the results output section 18b, among the plurality of output sequences that have been searched in step S7, and outputs an output series connection probability is maximized as the speech recognition result (step S8). In this operation, the operation after step S1 is repeatedly executed while the input voice is input.

なお、ここでは、ステップＳ４において、信頼度を、特徴ベクトルが雑音の特徴量である度合い、及び音声の特徴量である度合いを用いて算出したが、前記（５）式及び（６）式に示したクラスタモデルにおける最大の尤度や、前記（７）式及び（８）式に示した移動平均値を用いて、信頼度を算出することとしてもよい。 Here, in step S4, the reliability is calculated using the degree that the feature vector is the feature quantity of noise and the degree that the feature vector is the feature quantity of speech. However, the above formulas (5) and (6) The reliability may be calculated using the maximum likelihood in the cluster model shown or the moving average values shown in the equations (7) and (8).

以上の動作によって、音声認識装置１０は、入力された音声に雑音が重畳されていた場合は、信頼度Ｐ（Ｓ｜ｘ_t）が低下することで、言語モデル１３ａの言語的制約による探索の度合いを大きくすることができる。これによって、雑音により、音響モデル１２ａと入力音声とで不整合が発生した場合であっても、認識誤りを低減させることができる。 Through the above operation, the speech recognition apparatus 10 performs a search based on the linguistic restriction of the language model 13a by reducing the reliability P (S | _xt ) when noise is superimposed on the input speech. The degree can be increased. As a result, even if a mismatch occurs between the acoustic model 12a and the input speech due to noise, recognition errors can be reduced.

本発明の参考例に係る音声認識装置の構成を示したブロック図である。It is the block diagram which showed the structure of the speech recognition apparatus which concerns on the reference example of this invention. 本発明に係る音声認識装置の構成を示したブロック図である。It is a block diagram showing a configuration of a speech recognition apparatus according to the present invention. 本発明の参考例に係る音声認識装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech recognition apparatus which concerns on the reference example of this invention. 従来の音声認識装置の構成を示したブロック図である。It is the block diagram which showed the structure of the conventional speech recognition apparatus.

Explanation of symbols

１０、１０Ｂ音声認識装置
１１雑音モデル記憶手段
１１ａ雑音モデル
１２音響モデル記憶手段
１２ａ音響モデル
１３言語モデル記憶手段
１３ａ言語モデル
１４音響分析手段
１５信頼度算出手段
１６音響尤度算出手段
１７音響尤度補正手段
１８探索手段
１８ａ出力系列探索部（出力系列探索手段）
１８ｂ認識結果出力部（認識結果出力手段）
１９音声モデル記憶手段
１９ａ音声モデル（第二音響モデル）

DESCRIPTION OF SYMBOLS 10, 10B Speech recognition apparatus 11 Noise model storage means 11a Noise model 12 Acoustic model storage means 12a Acoustic model 13 Language model storage means 13a Language model 14 Acoustic analysis means 15 Reliability calculation means 16 Acoustic likelihood calculation means 17 Acoustic likelihood correction Means 18 Search means 18a Output series search section (output series search means)
18b Recognition result output unit (recognition result output means)
19 Voice model storage means 19a Voice model (second acoustic model)

Claims

A speech recognition device that recognizes an input speech using an acoustic model and a language model, a second acoustic model obtained by modeling the acoustic model, and a noise model obtained by modeling noise data,
An acoustic analysis means for analyzing an acoustic signal of the input speech and extracting a feature quantity of the input speech;
A reliability calculation means for calculating a reliability indicating the degree that the input voice is a voice based on the feature amount extracted by the acoustic analysis means, and the noise model and the second acoustic model;
Acoustic likelihood calculating means for calculating an acoustic likelihood indicating a similarity between the input speech and the acoustic model based on the feature amount and the acoustic model;
Acoustic likelihood correction means for correcting the acoustic likelihood calculated by the acoustic likelihood calculation means with the reliability to generate a corrected acoustic likelihood;
Based on the corrected acoustic likelihood generated by the acoustic likelihood correcting means and the language model, an output series searching means for searching for an output series candidate constituting the input speech;
A recognition result output means for outputting, as a speech recognition result of the input speech, an output sequence having a maximum connection probability among a plurality of output sequences searched by the output sequence search means;
A speech recognition apparatus comprising:

A computer for recognizing input speech using an acoustic model and a language model, a second acoustic model that models the acoustic model, and a noise model that models noise data,
An acoustic analysis means for analyzing an acoustic signal of the input voice and extracting a feature quantity of the input voice;
A reliability calculation means for calculating a reliability indicating the degree that the input voice is a voice based on the feature amount extracted by the acoustic analysis means, and the noise model and the second acoustic model;
An acoustic likelihood calculating means for calculating an acoustic likelihood indicating the similarity between the input speech and the acoustic model based on the feature quantity and the acoustic model;
An acoustic likelihood correcting means for generating a corrected acoustic likelihood by correcting the acoustic likelihood calculated by the acoustic likelihood calculating means by the reliability;
Output sequence search means for searching for output sequence candidates constituting the input speech based on the corrected acoustic likelihood generated by the acoustic likelihood correction means and the language model;
A recognition result output means for outputting, as a speech recognition result of the input speech, an output sequence having a maximum connection probability among the plurality of output sequences searched by the output sequence search means;
A voice recognition program characterized by functioning as