JP2002358097A

JP2002358097A - Voice recognition device

Info

Publication number: JP2002358097A
Application number: JP2001167041A
Authority: JP
Inventors: Toshiyuki Hanazawa; 利行花沢
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2001-06-01
Filing date: 2001-06-01
Publication date: 2002-12-13

Abstract

PROBLEM TO BE SOLVED: To improve the precision by giving a low reliability to a word, which is an erroneously recognized word though acoustically resembling a correct answer word, when giving the reliability showing certainty of the recognition result to each of words constituting a word string which is the recognition result of continuous voice recognition. SOLUTION: A voice recognition device is provided with a linguistic reliability calculation means 13 which uses a forward statistical language model 14 for reliability calculation and a backward statistical language model 15 for reliability calculation to calculate the reliability showing whether a word has been correctly recognized or not to each of words constituting the word string, which is the result of continuous voice recognition, on the basis of a linguistic statistic.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は音声認識装置に関
し、特に、連続音声認識の認識結果である単語列を構成
する単語の各々に対して認識結果の確からしさを示す信
頼度を付与する機能を有する音声認識装置に関するもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus, and more particularly, to a speech recognition apparatus having a function of assigning a reliability indicating the certainty of a recognition result to each of words forming a word string as a result of continuous speech recognition. The present invention relates to a voice recognition device having the same.

【０００２】[0002]

【従来の技術】現在の音声認識技術では常に１００％の
認識率が得られるわけではない。特に連続音声認識で認
識対象とする発話には複数個の単語が含まれることが多
いため、１発話中のいずれかの単語で誤認識が発生する
機会が多くなる。2. Description of the Related Art Current speech recognition technology does not always provide a recognition rate of 100%. In particular, an utterance to be recognized in continuous speech recognition often includes a plurality of words, and therefore, there is a greater chance of erroneous recognition occurring in any one of the words in one utterance.

【０００３】この種の誤認識に対する対策として、認識
結果の各単語に認識の確からしさを示す信頼度を計算
し、信頼度の低い単語はリジェクトしたりユーザに確認
する等の方法が考えられる。認識結果の信頼度を計算す
る方法としては、例えば、特開平４−２５５９００号公
報がある。ここでは従来技術として、当該特開平４−２
５５９００号公報で開示された技術を説明する。As a countermeasure against this kind of erroneous recognition, a method of calculating the reliability indicating the certainty of the recognition for each word of the recognition result, and rejecting the word with low reliability or confirming it with the user can be considered. As a method of calculating the reliability of the recognition result, there is, for example, Japanese Patent Application Laid-Open No. 4-255900. Here, as a prior art, Japanese Patent Application Laid-Open No.
The technique disclosed in Japanese Patent No. 55900 will be described.

【０００４】図３は、当該特開平４−２５５９００号公
報で開示されている従来の音声認識装置の構成を示した
ブロック図である。図３において、１は音声信号の入力
端、２は入力端１により入力された入力音声信号、３は
入力音声信号２の音響分析を行う分析手段、４は分析手
段３により得られた入力音声信号２の特徴ベクトルの時
系列、５は特徴ベクトルの時系列４を用いて連続音声認
識を行う連続音声認識手段、６は音響モデル、７は音声
認識用言語モデル、８は連続音声認識手段５による音声
認識結果、９は音響モデル６を用いて入力音声信号２の
特徴ベクトルの時系列４の参照尤度を計算する参照尤度
計算手段、１０は参照尤度計算手段９により得られた参
照尤度、１１は参照尤度１０を用いて音声認識結果８に
含まれる単語列の各単語に対する信頼度を計算する音響
的信頼度計算手段、１２は当該信頼度を付与された認識
結果である。FIG. 3 is a block diagram showing the configuration of a conventional speech recognition apparatus disclosed in Japanese Patent Application Laid-Open No. 4-255900. In FIG. 3, reference numeral 1 denotes an input end of an audio signal, 2 denotes an input audio signal input from the input end 1, 3 denotes analysis means for performing an acoustic analysis of the input audio signal 2, and 4 denotes an input audio obtained by the analysis means 3. 5 is a continuous speech recognition means for performing continuous speech recognition using the feature vector time series 4, 6 is an acoustic model, 7 is a speech recognition language model, and 8 is a continuous speech recognition means 5. The reference likelihood calculating means 9 calculates the reference likelihood of the time series 4 of the feature vector of the input speech signal 2 using the acoustic model 6, and the reference 10 obtained by the reference likelihood calculating means 9 Likelihood, 11 is an acoustic reliability calculating means for calculating the reliability of each word of the word string included in the speech recognition result 8 using the reference likelihood 10, and 12 is a recognition result provided with the reliability. .

【０００５】なお、音響モデル６として連続分布型のＨ
ＭＭ（Hidden Markov Model，隠れマルコフモデル）を
用いる。音響モデル６は、単語単位、すなわち、１個の
モデルで１個の単語をモデル化するものとする。したが
って、認識対象語彙数と同数の音響モデルを用意する。
１個のモデルは複数個の状態で構成し、モデルのトポロ
ジーはｌｅｆｔ−ｔｏ−ｒｉｇｈｔ型とする。The acoustic model 6 is a continuous distribution type H
MM (Hidden Markov Model, Hidden Markov Model) is used. The acoustic model 6 models one word with a word unit, that is, one model. Therefore, the same number of acoustic models as the number of words to be recognized are prepared.
One model is composed of a plurality of states, and the topology of the model is a left-to-right type.

【０００６】また、音声認識用言語モデル７として統計
言語モデルである単語バイグラムモデルを用いることと
する。認識対象は例えばホテル予約に関するユーザ発話
とする。音声認識用言語モデル７は、あらかじめホテル
予約に関する大量のユーザ発話を書き起こしたテキスト
データを用いて学習しておくものとする。Further, a word bigram model, which is a statistical language model, is used as the language model 7 for speech recognition. The recognition target is, for example, a user utterance related to hotel reservation. The speech recognition language model 7 is learned in advance using text data in which a large amount of user utterances related to hotel reservations are transcribed.

【０００７】次に、図３に基づいて、従来の本音声認識
装置の動作について説明する。音声信号の入力端１から
音声信号２を入力すると分析手段３は音声信号２を時間
軸上で短時間ごとの複数個の区間（以後、この区間をフ
レームと呼ぶ）に分割し、各フレームごとに、例えば、
ＬＰＣ（Linear Predictive Coding）法を用いて音響分
析を行い、特徴ベクトルＸに変換する。この特徴ベクト
ルＸは例えばＬＰＣケプストラムである。分析手段３は
全フレームに対して前記音響分析を行い、特徴ベクトル
の時系列４であるＸ₁，Ｘ₂，Ｘ₃，．．．，Ｘ_Tを出
力する。ここで添字は各特徴ベクトルのフレーム番号を
示し、Ｔは音声信号２の全フレーム数を示す。Next, the operation of the conventional speech recognition apparatus will be described with reference to FIG. When an audio signal 2 is input from the input terminal 1 of the audio signal, the analysis means 3 divides the audio signal 2 into a plurality of short time intervals (hereinafter referred to as frames) on a time axis. For example,
Acoustic analysis is performed using the LPC (Linear Predictive Coding) method, and converted into a feature vector X. This feature vector X is, for example, an LPC cepstrum. The analysis means 3 performs the acoustic analysis on all frames, and obtains a time series 4 of feature vectors X ₁ , X ₂ , X ₃ ,. . . , And outputs the X _T. Here, the subscript indicates the frame number of each feature vector, and T indicates the total number of frames of the audio signal 2.

【０００８】連続音声認識手段５は、分析手段３の出力
である特徴ベクトルの時系列４を入力として、特徴ベク
トルの時系列４と単語単位の音響モデル６とのパターン
マッチングを行う。パターンマッチングの方法としては
例えばワンパスＤＰマッチング法を用い、フレーム１か
ら時間軸順方向に処理を行い、パターンマッチングのフ
レームが進むにしたがって、音声認識用言語モデル７の
単語バイグラム確率にしたがって音響モデル６を接続し
てパターンマッチングを行う。フレームＴまでのパター
ンマッチングを終了すると連続音声認識手段５は、音声
認識結果８として単語列ｗ₁，ｗ₂，…，ｗ_i，…，ｗ
_N（ｗ_iは認識結果の単語列中で先頭からｉ番目の単語、
Ｎは単語列の長さ）と各単語ｗ_n（ｎ＝１〜Ｎ）の開始
フレームｓ_n（ｎ＝１〜Ｎ）、終了フレームｅ_n（ｎ＝１
〜Ｎ）、および各単語の尤度Ｌ_n（ｎ＝１〜Ｎ）を出力
する。なお、前記各単語ｗ_n（ｎ＝１〜Ｎ）の開始フレ
ームｓ_n（ｎ＝１〜Ｎ）、終了フレームｅ_n（ｎ＝１〜
Ｎ）は、ワンパスＤＰマッチング法によるパターンマッ
チングをフレーム１から特徴ベクトルの時系列４である
Ｘ ₁，Ｘ₂，Ｘ₃，…，Ｘ_Tの終了フレームＴまで行ったの
ち、終了フレームＴからフレーム１まで時間軸を逆方向
にパターンマッチング結果をトレースバックすることに
よって得ることができる。[0008] The continuous voice recognition means 5 outputs the output of the analysis means 3.
The time vector 4 of the feature vector
Pattern of time series 4 of Torr and acoustic model 6 in word units
Perform matching. As a method of pattern matching
For example, using the one-pass DP matching method,
Process in the forward direction of the time axis from
As the game progresses, the speech recognition language model 7
Connect acoustic model 6 according to word bigram probability
To perform pattern matching. Putter up to frame T
When the matching is completed, the continuous speech recognition means 5
Word string w as recognition result 8₁, W_Two, ..., w_i, ..., w
_N(W_iIs the ith word from the beginning in the word string of the recognition result,
N is the length of the word string) and each word w_n(N = 1 to N) start
Frame s_n(N = 1 to N), end frame e_n(N = 1
~ N) and the likelihood L of each word_n(N = 1 to N) is output
I do. Each word w_n(N = 1 to N) start frame
S_n(N = 1 to N), end frame e_n(N = 1 ~
N) is a pattern map by the one-pass DP matching method.
Ching is a time series 4 of feature vectors from frame 1
X ₁, X_Two, X_Three, ..., X_TUp to the end frame T
The time axis is reversed from the end frame T to frame 1.
Traceback the pattern matching result to
Therefore, it can be obtained.

【０００９】一方、参照尤度計算手段９は、分析手段３
の出力である特徴ベクトルの時系列４を入力として、以
下の（１）式によって各フレームごとの参照尤度１０で
あるＬＲ_i（ｉ＝１〜Ｔ）を出力する。On the other hand, the reference likelihood calculating means 9 comprises the analyzing means 3
LR _i (i = 1 to T), which is the reference likelihood 10 for each frame, is output according to the following equation (1) by using the time series 4 of the feature vector output as the input.

【００１０】[0010]

【数１】 (Equation 1)

【００１１】上記（１）式における右辺のｂ_k（Ｘ_i）
は、フレームｉの特徴ベクトルＸ_iに対する、音響モデ
ル６の状態ｋの尤度である。したがって（１）式は全音
響モデルの状態中で最大の尤度を参照尤度ＬＲ_iとする
ことを意味する。[0011] b _k (X _i ) on the right side of the above equation (1)
Is the likelihood of the state k of the acoustic model 6 with respect to the feature vector X _i of the frame i. Therefore equation (1) means that the reference likelihood LR _i the maximum likelihood in states of all the acoustic model.

【００１２】次に音響的信頼度計算手段１１は、音声認
識結果８である単語列ｗ₁，ｗ₂，…，ｗ_i，…，ｗ_Nと各
単語ｗ_n（ｎ＝１〜Ｎ）の開始フレームｓ_n（ｎ＝１〜
Ｎ）、終了フレームｅ_n（ｎ＝１〜Ｎ）、および、各単
語の尤度Ｌ_n（ｎ＝１〜Ｎ）と、参照尤度１０であるＬ
Ｒ_i（ｉ＝１〜Ｔ）を入力として以下の（２）式にし
たがって各単語の信頼度Ｓ⁽¹⁾ _n（ｎ＝１〜Ｎ）を求め
る。そして信頼度を付与された認識結果１２として、単
語列ｗ₁，ｗ₂，…，ｗ_i，…，ｗ_Nと信頼度Ｓ⁽¹⁾ ₁，Ｓ
⁽¹⁾ ₂，…，Ｓ⁽¹⁾ _i，…，Ｓ⁽¹⁾ _Nを出力する。Next, the acoustic reliability calculation means 11 outputs
Word sequence w that is knowledge result 8₁, W_Two, ..., w_i, ..., w_NAnd each
Word w_n(N = 1 to N) start frame s_n(N = 1 ~
N), end frame e_n(N = 1 to N) and each unit
Word likelihood L_n(N = 1 to N) and L which is the reference likelihood 10
R_i(I = 1 to T) as input and make the following equation (2)
Thus, the reliability S of each word⁽¹⁾ _n(N = 1 to N)
You. Then, simply as the recognition result 12 to which the reliability is given,
Word string w₁, W_Two, ..., w_i, ..., w_NAnd reliability S⁽¹⁾ ₁, S
⁽¹⁾ _Two, ..., S⁽¹⁾ _i, ..., S⁽¹⁾ _NIs output.

【００１３】[0013]

【数２】 (Equation 2)

【００１４】単語の尤度Ｌ_nは話者の違い等によって変
動するが、参照尤度ＬＲ_iも同様に話者の違い等によっ
て変動する。したがって（２）式で示したとおり両者の
差をとり、単語のフレーム数（ｅ_n−ｓ_n＋１）で正規化
することによって話者の違い等による変動が低減され、
認識結果の信頼度を表す指標として用いることができ
る。なお参照尤度ＬＲ_iは各フレーム毎の全音響モデル
の尤度の最大値なので、常にＬｎ＜＝Σ_iＬＲ_iの関係
が成立する。したがって、（２）式で計算されるＳ ⁽¹⁾ _n
は、信頼度が最も高い場合でＳ⁽¹⁾ _n＝０であり、信頼度
が低下するにしたがって負の大きな値となる。Word likelihood L_nDepends on the speakers
Move, but the reference likelihood LR_iAlso depends on the speakers
Fluctuate. Therefore, as shown in equation (2),
Taking the difference, the number of word frames (e_n-S_n+1)
By doing so, fluctuations due to differences in speakers are reduced,
Can be used as an index to indicate the reliability of recognition results
You. Note that the reference likelihood LR_iIs the total acoustic model for each frame
Ln <= Σ_iLR_iconnection of
Holds. Therefore, S calculated by equation (2) ⁽¹⁾ _n
Is S when the reliability is the highest⁽¹⁾ _n= 0 and reliability
Becomes a large negative value as the value decreases.

【００１５】[0015]

【発明が解決しようとする課題】従来の音声認識装置に
おいては、上述のように構成されているため、例えば、
発話が「（ポーズ）二泊宿泊します（ポーズ）」で、連
続音声認識結果が「（ポーズ）二泊近くします（ポー
ズ）」（ｗ₁＝ポーズ、ｗ₂＝「二泊」、ｗ₃＝「近
く」、ｗ₄＝「します」、ｗ₅＝ポーズ）であった場合、
ｗ₃＝「近く」は誤認識であるが、音響的には「宿泊」
と類似しているため、開始フレームｓ₃から終了フレー
ムｅ₃までの区間における参照尤度の累積値Σ_iＬＲ_iと
単語の尤度Ｌ₃は近い値となり、信頼度が高い（０に近
い）値になる。すなわち、誤認識した単語が、正解単語
と音響的に類似している場合には、信頼度が高くなって
しまうという問題点があった。In the conventional speech recognition apparatus, since it is configured as described above, for example,
The utterance is “(pause) staying for two nights (pause)” and the continuous speech recognition result is “(pause) staying for almost two nights (pause)” (w ₁ = pause, w ₂ = “two nights”, w ₃ = “near”, w ₄ = “do”, w ₅ = pose)
w ₃ = “Nearby” is misrecognized, but acoustically “Accommodation”
Due to the similar, start frame s ₃ end frame e cumulative value sigma _i LR _i and likelihood L ₃ word reference likelihood at intervals of up to ₃ becomes a value close, close to the high reliability (0 ) Value. That is, when the incorrectly recognized word is acoustically similar to the correct word, there is a problem that the reliability increases.

【００１６】この発明はかかる問題点を解決するために
なされたものであり、認識した単語が正解単語と音響的
に類似している場合でも、当該単語が誤認識単語であれ
ば、低い信頼度を与えることができる音声認識装置を提
供することを目的とする。The present invention has been made to solve such a problem. Even when a recognized word is acoustically similar to a correct word, if the word is a misrecognized word, the reliability is low. It is an object of the present invention to provide a speech recognition device that can provide the following.

【００１７】[0017]

【課題を解決するための手段】この発明は、入力された
音声の連続音声認識を行い、認識結果として当該入力さ
れた音声に対応する単語列を出力する連続音声認識手段
と、所定単語の前後に言語的に続き得る各単語の出現確
率を与える１種類以上の信頼度計算用統計言語モデルを
格納している信頼度計算用統計言語モデル格納手段と、
前記認識結果の単語列を構成している各単語のそれぞれ
に対して、前記信頼度計算用統計言語モデルを用いて、
前記各単語が正認識であるか否かの信頼度を算出する言
語的信頼度計算手段とを備えた音声認識装置である。According to the present invention, there is provided a continuous voice recognition means for performing continuous voice recognition of an input voice and outputting a word string corresponding to the input voice as a recognition result; Means for storing a statistical language model for calculating reliability, which stores at least one statistical language model for calculating reliability, which gives the probability of occurrence of each word that can be linguistically followed by
For each of the words constituting the word string of the recognition result, using the statistical language model for reliability calculation,
A linguistic reliability calculation means for calculating the reliability of whether or not each word is correctly recognized.

【００１８】また、前記認識結果の単語列を構成してい
る各単語のそれぞれに対して、前記各単語が正認識であ
るか否かの信頼度を音響尤度に基づいて算出する音響的
信頼度計算手段と、前記言語的信頼度計算手段によって
算出された信頼度と前記音響的信頼度計算手段によって
算出された信頼度の両者の値を用いて、統合信頼度を算
出する信頼度統合手段とをさらに備えている。In addition, for each of the words constituting the word string of the recognition result, the reliability of whether or not each word is correctly recognized is calculated based on the acoustic likelihood. Reliability calculating means, and reliability integrating means for calculating integrated reliability using both values of the reliability calculated by the linguistic reliability calculating means and the reliability calculated by the acoustic reliability calculating means. And further comprising:

【００１９】また、前記信頼度計算用統計言語モデルと
して単語ｎ−ｇｒａｍモデルを用いる。Further, a word n-gram model is used as the reliability calculation statistical language model.

【００２０】また、前記信頼度計算用統計言語モデルと
して単語を幾つかのクラスに分類してまとめた単語クラ
スｎ−ｇｒａｍモデルを用いる。A word class n-gram model in which words are classified into several classes and summarized is used as the reliability calculation statistical language model.

【００２１】また、１種類以上の信頼度計算用統計言語
モデルとして、当該単語と先行の所定個の単語との条件
付き確率モデルである信頼度計算用前向き統計言語モデ
ルと、当該単語と後続の所定個の単語との条件付き確率
モデルである信頼度計算用後向き統計言語モデルとを備
え、前記言語的信頼度計算手段が、前記連続音声認識の
認識結果の単語列を構成する各単語のそれぞれに対して
前記信頼度計算用前向き統計言語モデルを用いて前記各
単語が正認識であるか否かの第一の信頼度を算出し、前
記信頼度計算用後向き統計言語モデルを用いて前記各単
語が正認識であるか否かの第二の信頼度を算出する。As one or more kinds of statistical language models for calculating reliability, a forward statistical language model for calculating reliability, which is a conditional probability model of the word and a predetermined number of preceding words, A backward statistical language model for reliability calculation, which is a conditional probability model with a predetermined number of words, wherein the linguistic reliability calculation means is configured to generate a word sequence of a recognition result of the continuous speech recognition. Using the reliability calculation forward statistical language model to calculate a first reliability of whether or not each word is correctly recognized, using the reliability calculation backward statistical language model A second reliability of whether the word is correctly recognized is calculated.

【００２２】また、前記言語的信頼度計算手段が、前記
第一の信頼度と前記第二の信頼度のうち、大きい方の値
を当該単語の信頼度として出力する。Further, the linguistic reliability calculating means outputs a larger one of the first reliability and the second reliability as the reliability of the word.

【００２３】また、前記言語的信頼度計算手段が、前記
第一の信頼度と前記第二の信頼度との荷重和を当該単語
の信頼度として出力する。Further, the linguistic reliability calculating means outputs a weighted sum of the first reliability and the second reliability as the reliability of the word.

【００２４】[0024]

【発明の実施の形態】実施の形態１．図１は本発明の実
施の形態１による音声認識装置の構成を示すブロック図
である。同図において、１は音声信号の入力端、２は入
力端１により入力された入力音声信号、３は入力音声信
号２の音響分析を行う分析手段、４は分析手段３により
得られた入力音声信号２の特徴ベクトルの時系列、５は
入力音声信号２の特徴ベクトルの時系列４が入力されて
入力音声の連続音声認識を行い認識結果として単語列を
出力する連続音声認識手段、６は音響モデル、７は音声
認識用言語モデル、８は連続音声認識手段５から出力さ
れた音声認識結果、１４は、所定の単語の後に続き得る
単語の出現確率（すなわち、単語ｗ_n-1の次に単語ｗ_nが
接続する条件付き確率）を示す１種類以上の信頼度計算
用前向き統計言語モデル、１５は、所定の単語の前に続
き得る単語の出現確率（すなわち、単語ｗ_n+1の前に単
語ｗ_nが接続する条件付き確率）を示す１種類以上の信
頼度計算用後向き統計言語モデル、１３は、信頼度計算
用前向き統計言語モデル１４及び信頼度計算用後向き統
計言語モデル１５を用いて前記連続音声認識結果８であ
る単語列を構成する各単語のそれぞれに対して前記各単
語が正認識であるか否かの信頼度を算出する言語的信頼
度計算手段である。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiment 1 FIG. 1 is a block diagram showing a configuration of the speech recognition device according to the first embodiment of the present invention. 1, reference numeral 1 denotes an input end of an audio signal, 2 denotes an input audio signal input from the input end 1, 3 denotes an analysis unit for performing an acoustic analysis of the input audio signal 2, and 4 denotes an input audio obtained by the analysis unit 3 A time series of feature vectors of the signal 2, a continuous speech recognition means 5 for receiving a time series 4 of the feature vectors of the input speech signal 2, performing continuous speech recognition of the input speech, and outputting a word string as a recognition result, and 6 a sound Model, 7 is a language model for speech recognition, 8 is a speech recognition result output from the continuous speech recognition means 5, and 14 is a probability of occurrence of a word that can follow a predetermined word (that is, next to the word w _n−1 ). One or more types of forward-looking statistical language models for calculating reliability, indicating the conditional probability that a word w _n is connected, 15 is a probability of appearance of a word that can follow a given word (ie, a word before a word w _{n + 1} ). conditional probability that the word w _n is connected to the ), One or more types of backward statistical language models for calculating reliability, 13 are words that are the continuous speech recognition results 8 using the forward statistical language model 14 for calculating reliability and the backward statistical language model 15 for calculating reliability. A linguistic reliability calculating means for calculating the reliability of each word constituting the column as to whether or not each word is correctly recognized.

【００２５】従来技術と同様に、本実施の形態では、音
響モデル６として連続分布型のＨＭＭを用いる。音響モ
デル６は単語単位、すなわち１個のモデルで１個の単語
をモデル化するものとする。したがって認識対象語彙数
と同数の音響モデルを用意する。１個のモデルは複数個
の状態で構成し、モデルのトポロジーはｌｅｆｔ−ｔｏ
−ｒｉｇｈｔ型とする。As in the prior art, in this embodiment, a continuous distribution type HMM is used as the acoustic model 6. The acoustic model 6 models one word with one word, that is, one model. Therefore, the same number of acoustic models as the number of words to be recognized are prepared. One model is composed of a plurality of states, and the topology of the model is left-to-to.
-A right type.

【００２６】音声認識用言語モデル７も従来技術と同様
に統計言語モデルである単語バイグラムモデルを用いる
こととする。認識対象は例えばホテル予約に関するユー
ザ発話とする。音声認識用言語モデル７は、あらかじめ
ホテル予約に関する大量のユーザ発話を書き起こしたテ
キストデータを用いて学習しておくものとする。The speech recognition language model 7 uses a word bigram model, which is a statistical language model, as in the prior art. The recognition target is, for example, a user utterance related to hotel reservation. The speech recognition language model 7 is learned in advance using text data in which a large amount of user utterances related to hotel reservations are transcribed.

【００２７】本発明で新たに追加した信頼度計算用前向
き統計言語モデル１４としては、本実施の形態では音声
認識用言語モデル７と同じモデルを用いることとする。In the present embodiment, the same model as the language model 7 for speech recognition is used as the forward statistical language model 14 for reliability calculation newly added in the present invention.

【００２８】また、信頼度計算用後向き統計言語モデル
１５は、通常のｎ−ｇｒａｍモデルでは当該単語の生起
確率は先行する数単語との条件付き確率であるのに対
し、本実施の形態では、当該単語の生起確率を後続する
数単語との条件付き確率とするものである。例えば信頼
度計算用後向き統計言語モデル１５として、後向きバイ
グラムモデルを用いる場合、「明日」＋「二泊」＋「宿
泊」という単語列における「二泊」の生起確率はＰ（二
泊｜宿泊）として計算する。信頼度計算用後向き統計言
語モデル１５もあらかじめホテル予約に関する大量のユ
ーザ発話を書き起こしたテキストデータを用いて学習し
ておくものとする。In the backward statistical language model 15 for calculating reliability, the occurrence probability of the word in the ordinary n-gram model is the conditional probability with the preceding several words, whereas in the present embodiment, The occurrence probability of the word is set as a conditional probability with the following several words. For example, when a backward bigram model is used as the backward statistical language model 15 for calculating reliability, the occurrence probability of “two nights” in the word string “tomorrow” + “two nights” + “stay” is P (two nights | stay). Is calculated as The backward statistical language model 15 for reliability calculation is also learned in advance using text data in which a large amount of user utterances related to hotel reservations are transcribed.

【００２９】次に、本実施の形態における音声認識装置
の動作について説明する。音声信号の入力端１から音声
信号２を入力すると、分析手段３と連続音声手段５は従
来技術の音声認識装置と同様の動作を行い、連続音声手
段５は音声認識結果８として単語列ｗ₁，ｗ₂，…，
ｗ_i，…，ｗ_N（ｗ_iは認識結果の単語列中で先頭からｉ
番目の単語、Ｎは単語列の長さ）を出力する。Next, the operation of the speech recognition apparatus according to the present embodiment will be described. When a voice signal 2 is input from a voice signal input terminal 1, the analysis means 3 and the continuous voice means 5 perform the same operation as the conventional voice recognition apparatus, and the continuous voice means 5 outputs the word string w ₁ as the voice recognition result 8. , W ₂ , ...,
w _i ,..., w _N (where w _i is i
The second word, N, is the length of the word string.

【００３０】次に、言語的信頼度計算手段１３は、音声
認識結果８である単語列ｗ₁，ｗ₂，…，ｗ_i，…，ｗ_Nを
入力として、信頼度計算用前向き統計言語モデル１４と
信頼度計算用後向き統計言語モデル１５とを用いて、以
下の（３）式にしたがって各単語の信頼度Ｓ⁽²⁾ _n（ｎ＝
１〜Ｎ）を求める。（３）式中のＰ（ｗ_n｜ｗ_n-1）は、
信頼度計算用前向き統計言語モデル１３が保持している
単語ｗ_n-1の次に単語ｗ_nが接続する条件付き確率であ
る。すなわち単語ｗ_nの信頼度として統計言語モデルの
確率値を利用する。そして言語的信頼度計算手段１３は
信頼度を付与された認識結果１６として単語列ｗ₁，
ｗ₂，…，ｗ_i，…，ｗ_Nと信頼度Ｓ⁽²⁾ ₁，Ｓ⁽² ⁾ ₂，…，
Ｓ⁽²⁾ _i，…，Ｓ⁽²⁾ _Nを出力する。Next, linguistic reliability calculation means 13, a word string w _1, w ₂ is a speech recognition result 8, ..., w _i, ..., as an input w _N, forward statistical language model for reliability calculation 14 and the backward statistical language model 15 for reliability calculation, the reliability S ⁽²⁾ _n (n =
1 to N). P (w _n | w _n-1 ) in the equation (3) is
This is the conditional probability that the word w _n is connected next to the word w _n ₋₁ held by the reliability calculation forward statistical language model 13. That is, using the probability value of statistical language models as the reliability of the word w _n. Then, the linguistic reliability calculating means 13 outputs the word string w ₁ ,
w ₂ ,..., w _i ,..., w _N and the reliability S ⁽²⁾ ₁ , S ⁽² ⁾ ₂ ,.
S ⁽²⁾ _i , ..., S ⁽²⁾ _N are output.

【００３１】[0031]

【数３】 (Equation 3)

【００３２】通常の音声認識では統計言語モデルは単語
列ｗ₁，ｗ₂，…，ｗ_i，…，ｗ_Nの全体に対する言語尤度
を求めるのに用いる。そして前記単語列全体に対する言
語尤度を音響尤度と荷重和することによって単語列
ｗ₁，ｗ₂，…，ｗ_i，…，ｗ_Nの音声認識スコアを算出す
る。したがって例えば発話が「（ポーズ）二泊宿泊しま
す（ポーズ）」で、音声認識結果の候補として「（ポー
ズ）二泊近くします（ポーズ）」と「（ポーズ）二泊宿
泊します（ポーズ）」の２つを比較した場合、言語尤度
は後者が高くなることが期待できるが、音響尤度は前者
のほうが高く、言語尤度と音響尤度を荷重和した音声認
識スコアも前者のほうが高くなることが起こり得る。こ
の場合、誤認識単語である「近く」に対する信頼度を低
くすることは前述した従来技術においては困難であっ
た。[0032] In the normal voice recognition statistical language model is a word string _{_{w 1, w 2, ...,}} w i, ..., used to determine the language likelihood for the entire w _N. And the word sequence language likelihood for the entire word string by acoustic likelihood and weighted sum _{_{w 1, w 2, ...,}} w i, ..., and calculates a speech recognition score w _N. Therefore, for example, the utterance is “(pause) staying for two nights (pause)”, and “(pause) staying for nearly two nights (pause)” and “(pause) staying for two nights (pause) as voice recognition result candidates )), It can be expected that the latter has a higher language likelihood, but the acoustic likelihood is higher in the former, and the speech recognition score obtained by weighted sum of the language likelihood and the acoustic likelihood is also higher in the former. Can be higher. In this case, it is difficult in the above-described related art to lower the reliability of the word “near” which is an erroneously recognized word.

【００３３】一方、本実施の形態では認識結果の単語列
に対して改めて個々の単語ごとに言語尤度を求め、その
言語尤度を当該単語の信頼度とすることに特徴がある。
これによって音響的に類似した単語による認識誤りの場
合でも、言語的に先行単語ｗ _n-1との接続確率が低い場
合には当該単語ｗ_nに対する信頼度を低くすることが可
能となる。例えば発話が「（ポーズ）二泊宿泊します
（ポーズ）」で、連続音声認識結果が「（ポーズ）二泊
近くします（ポーズ）」（ｗ₁＝ポーズ，ｗ₂＝「二
泊」、ｗ₃＝「近く」、ｗ₄＝「します」、ｗ₅＝ポー
ズ）であった場合、ｗ₃＝「近く」は誤認識で正解の
「宿泊」と音響的に類似しているが通常の日本語では
「二泊」＋「近く」という単語の連鎖はまれなので、言
語モデルの尤度Ｐ（近く｜二泊）が低くなり、ｗ₃＝
「近く」に対する信頼度を低くすることができる。On the other hand, in this embodiment, a word string
The language likelihood for each individual word
The feature is that the language likelihood is used as the reliability of the word.
Thus, the field of recognition errors due to acoustically similar words
Linguistically preceding word w _n-1When the connection probability with is low
If the word w_nCan be less reliable
It works. For example, the utterance is "(pause) I will stay for two nights
(Pause) "and the continuous speech recognition result is" (pause)
I'll be close (pause) "(w₁= Pause, w_Two= "Two
Night ", w_Three= "Near", w_Four= "I do", w_Five= Poe
Z), w_Three= "Nearby" is incorrect and incorrect
It is acoustically similar to "Accommodation", but in normal Japanese
The word "Night Night" + "Near" is rare, so the word
Likelihood P (near | two nights) of the Japanese language model decreases, w_Three=
The reliability for “near” can be reduced.

【００３４】なお、信頼度は以下の（４）式または
（５）式によって計算することも可能である。（４）式
及び（５）式中のＰ（ｗ_n｜ｗ_n+1）は、信頼度計算用後
向き統計言語モデル１５が保持している単語ｗ_n+1の前
に単語ｗ_nが接続する条件付き確率である。（４）式中
のＭＡＸ（，）は２つの値のうち大きい方を選択す
る演算子である。また、（５）式中のαは事前に設定す
る定数であり例えばα＝０．５である。なお、（４）式
及び（５）式においては、信頼度計算用前向き統計言語
モデル１４を用いて各単語が正認識であるか否かの第一
の信頼度と、信頼度計算用後向き統計言語モデル１５を
用いて各単語が正認識であるか否かの第二の信頼度とを
求め、（４）式においては、第一及び第二の信頼度のう
ち、大きい方の値を当該単語の信頼度として算出し、
（５）式においては、第一及び第二の信頼度の荷重和を
当該単語の信頼度として算出している。The reliability can be calculated by the following equation (4) or (5). P (w _n | w _{n + 1} ) in the equations (4) and (5) indicates that the word w _n is connected before the word w _{n + 1} held by the reliability calculation backward statistical language model 15. Is the conditional probability of MAX (,) in the expression (4) is an operator for selecting the larger one of the two values. Further, α in the equation (5) is a constant set in advance, for example, α = 0.5. In the expressions (4) and (5), the first reliability of whether each word is correctly recognized and the backward statistics for reliability calculation are calculated by using the reliability calculation forward statistical language model 14. Using the language model 15, a second reliability indicating whether each word is correctly recognized is obtained. In the equation (4), the larger value of the first and second reliability is determined. Calculated as word reliability,
In the equation (5), the weighted sum of the first and second reliability is calculated as the reliability of the word.

【００３５】[0035]

【数４】 (Equation 4)

【００３６】[0036]

【数５】 (Equation 5)

【００３７】このように前方と後方の両方からの単語接
続の条件付き確率を考慮することにより、さらに高精度
に信頼度を計算することができる。例えば発話が「（ポ
ーズ）二泊宿泊します（ポーズ）」で、音声認識結果が
「（ポーズ）二泊近くします（ポーズ）」（ｗ₁＝ポー
ズ，ｗ₂＝「二泊」、ｗ₃＝「近く」、ｗ₄＝「しま
す」、ｗ₅＝ポーズ）であった場合、ｗ₄＝「します」は
正認識であるが、ｗ₃＝「近く」が誤認識のため、
（３）式による単語ｗ₄に対する信頼度の計算ではＰ
（します｜近く）の値を用いることになる。「近く」が
誤認識のため「近く」＋「します」の単語連鎖は通常の
日本語ではまれなので、ｗ₄は正認識であるが信頼度が
低くなってしまう。これに対して（４）及び（５）式で
はＰ（します｜近く）の値の他に、Ｐ_bw（します｜ポー
ズ）の値を考慮しており、「します」＋「ポーズ」の単
語連鎖は日本語としてよく生じるので、Ｐ_bw（します｜
ポーズ）は高い値となり、正認識であるｗ₄に対する信
頼度が低くなることを抑制するという効果を有する。As described above, the reliability can be calculated with higher accuracy by considering the conditional probability of the word connection from both the front and the rear. For example, the utterance is “(pause) staying for two nights (pause)”, and the speech recognition result is “(pause) staying near two nights (pause)” (w ₁ = pause, w ₂ = “two nights”, w ₃ = “near”, w ₄ = “do”, w ₅ = pause), w ₄ = “do” is correct recognition, but w ₃ = “near” is erroneous recognition.
In the calculation of the reliability of the word w ₄ by the equation (3), P
(Near | near) value will be used. Since the "near" because of erroneous recognition "near" + word chain of "you" is rare in normal Japanese, w ₄ is a positive recognition reliability becomes low. On the other hand, in the formulas (4) and (5), in addition to the value of P (do | near), the value of P _bw (do | pause) is considered, and "do" + "pause" Word chains often occur in Japanese, so P _bw (I |
Pause) becomes a high value, reliability of w ₄ is a correct recognition has the effect of suppressing be lower.

【００３８】なお、以上では信頼度計算用前向き統計言
語モデル１４と信頼度計算用後向き統計言語モデル１５
は単語単位のバイグラムを例に挙げて説明したが、品詞
や単語を幾つかのクラスに分類してまとめた単語クラス
を言語モデルの単位としてもよい。またトライグラムや
他の統計言語モデルを用いても同様の効果を得ることが
できる。In the above description, the forward statistical language model 14 for calculating reliability and the backward statistical language model 15 for calculating reliability are described.
Has been described using a bigram in word units as an example, but a word class in which parts of speech and words are classified into several classes and put together may be used as a unit of the language model. Similar effects can be obtained by using a trigram or another statistical language model.

【００３９】例えば、単語クラスバイグラム（単語クラ
スｎ−ｇｒａｍモデル）を用いる場合には、（３）式の
代りに以下の（６）式あるいは（７）式によって信頼度
を計算すればよい。For example, when a word class bigram (word class n-gram model) is used, the reliability may be calculated by the following expression (6) or (7) instead of expression (3).

【００４０】[0040]

【数６】 (Equation 6)

【００４１】[0041]

【数７】 (Equation 7)

【００４２】（６）式及び（７）式中でｃ_nは単語ｗ_nが
属するクラス、ｃ_n-1は単語ｗ_n-1が属するクラスであ
る。またＰ（ｃ_n｜ｃ_n-1）は先行クラスがｃ_n-1であっ
たときに次に接続するクラスがｃ_nである条件付き確率
である。またＰ（ｗ_n｜ｃ_n）はクラスｃ_n内での単語ｗ_n
の出現確率である。なお（６）式に対して（７）式の違
いはＰ（ｗ_n｜ｃ_n）を乗じていないことであるが、これ
は単語クラスｃ_nに属する単語数が多い場合にはＰ（ｗ_n
｜ｃ_n）の値が小さくなり、単語クラス間のバイグラム
確率Ｐ（ｃ_n｜ｃ_n-1）は大きい場合でも、信頼度が低く
なってしまうことを防ぐ効果がある。例えば本例のよう
にホテル予約に関する統計言語モデルでホテル名を一つ
のクラスとし、そのクラス内での各ホテル名の出現確率
Ｐ（ｗ_n｜ｃ_n）を等確率と設定した場合、ホテル数をＮ
個とすると前記ホテル名のクラスからのホテル名の出現
確率であるＰ（ｗ_n｜ｃ_n）の値は１／Ｎとなる。したが
ってＮが大きくなるほど（６）式で計算される信頼度は
低くなるが、（７）式ではＰ（ｗ_n｜ｃ_n）を乗じていな
いため信頼度の値はＮの値に依存しないという効果があ
る。In the equations (6) and (7), c _n is a class to which the word w _n belongs, and c _n-1 is a class to which the word w _n-1 belongs. P (c _n | c _n-1 ) is a conditional probability that the class to be connected next is c _n when the preceding class is c _n-1 . The P (w _{_n} | c _n) the word w _n in the class c _n
Is the appearance probability of. Note that the difference between equation (6) and equation (7) is that P (w _n | c _n ) is not multiplied. This is because if the number of words belonging to word class c _n is large, P (w _n _n
| C _n ) is reduced, and even if the bigram probability P (c _n | c _n-1 ) between word classes is large, there is an effect of preventing the reliability from being lowered. For example, as shown in this example, in a statistical language model relating to hotel reservation, a hotel name is set as one class, and the appearance probability P (w _n | c _n ) of each hotel name in the class is set as an equal probability. To N
If the number is P, the value of P (w _n | c _n ), which is the appearance probability of the hotel name from the class of the hotel name, is 1 / N. Therefore, as N increases, the reliability calculated by equation (6) decreases, but in equation (7), the value of reliability does not depend on the value of N because P (w _n | c _n ) is not multiplied. effective.

【００４３】また、単語クラスバイグラムを用いる場合
の（４）式の代りとしては（８）式あるいは（９）式に
よって信頼度を計算すればよい。（８）式及び（９）式
中でＰ_bw（ｃ_n｜ｃ_n+1）は単語クラスｃ_n+1の前が単語
クラスｃ_nである条件付き確率である。When the word class bigram is used, instead of the expression (4), the reliability may be calculated by the expression (8) or the expression (9). In Equations (8) and (9), P _bw (c _n | c _{n + 1} ) is a conditional probability that the word class c _{n + 1} precedes the word class c _n .

【００４４】[0044]

【数８】 (Equation 8)

【００４５】[0045]

【数９】 (Equation 9)

【００４６】同様に（５）式の代りとしては（１０）式
あるいは（１１）式によって信頼度を計算すればよい。Similarly, instead of equation (5), the reliability may be calculated by equation (10) or (11).

【００４７】[0047]

【数１０】 (Equation 10)

【００４８】[0048]

【数１１】 [Equation 11]

【００４９】以上のように、本実施の形態における音声
認識装置においては、認識対象の単語の生起確率を当該
単語に先行するｎ個の単語との条件付き確率とした信頼
度計算用前向き統計言語モデル１４と、認識対象の単語
の生起確率を当該単語に後続するｎ個の単語との条件付
き確率とした信頼度計算用後向き統計言語モデル１５と
を備え、連続音声認識の結果である単語列を構成する各
単語のそれぞれに対して、各単語が正認識であるか否か
の信頼度を、単語の前後関係の尤度に関する言語的な統
計量に基づいて算出するようにしたので、認識した単語
が正解単語と音響的に類似している場合でも、当該単語
が誤認識単語であれば、低い信頼度を与えることがで
き、音声認識の精度を高くすることができる。As described above, in the speech recognition apparatus according to the present embodiment, the forward statistical language for calculating the reliability is defined as the occurrence probability of the word to be recognized as the conditional probability with the n words preceding the word. A word sequence which is a result of continuous speech recognition, comprising a model 14, and a backward statistical language model 15 for reliability calculation in which the occurrence probability of the word to be recognized is a conditional probability of n words following the word. The reliability of whether each word is correctly recognized is calculated based on the linguistic statistic about the likelihood of the context of the word for each of the words constituting Even when the word is acoustically similar to the correct word, if the word is a misrecognized word, low reliability can be given, and the accuracy of speech recognition can be increased.

【００５０】実施の形態２．図２は、本実施の形態によ
る音声認識装置の他の構成例を示すブロック図である。
図２において、上述の従来例または実施の形態１と同等
部分には同一番号を付してここではその説明を省略す
る。本実施の形態において新たに追加した部分は、言語
的信頼度計算手段１３によって算出された信頼度１６と
音響的信頼度計算手段１１によって算出された信頼度１
２との荷重和を最終的な信頼度１８として算出する信頼
度統合手段１７である。Embodiment 2 FIG. 2 is a block diagram showing another configuration example of the speech recognition device according to the present embodiment.
In FIG. 2, the same parts as those in the above-described conventional example or the first embodiment are denoted by the same reference numerals, and description thereof will be omitted. In the present embodiment, the newly added portions are the reliability 16 calculated by the linguistic reliability calculation means 13 and the reliability 1 calculated by the acoustic reliability calculation means 11.
The reliability integration means 17 calculates the sum of the weights with 2 as the final reliability 18.

【００５１】実施の形態１と同様に本実施の形態では音
響モデル６として連続分布型のＨＭＭを用いる。音響モ
デル６は単語単位、すなわち、１個のモデルで１個の単
語をモデル化するものとする。したがって認識対象語彙
数と同数の音響モデルを用意する。１個のモデルは複数
個の状態で構成し、モデルのトポロジーはｌｅｆｔ−ｔ
ｏ−ｒｉｇｈｔ型とする。In the present embodiment, a continuous distribution type HMM is used as the acoustic model 6 as in the first embodiment. It is assumed that the acoustic model 6 models one word with a word unit, that is, one model. Therefore, the same number of acoustic models as the number of words to be recognized are prepared. One model is composed of a plurality of states, and the topology of the model is left-t
o-right type.

【００５２】音声認識用言語モデル７も、実施の形態１
と同様に統計言語モデルである単語バイグラムモデルを
用いることとする。認識対象は例えばホテル予約に関す
るユーザ発話とする。音声認識用言語モデル７は、あら
かじめホテル予約に関する大量のユーザ発話を書き起こ
したテキストデータを用いて学習しておくものとする。The speech recognition language model 7 is also used in the first embodiment.
The word bigram model, which is a statistical language model, is used in the same manner as described above. The recognition target is, for example, a user utterance related to hotel reservation. The speech recognition language model 7 is learned in advance using text data in which a large amount of user utterances related to hotel reservations are transcribed.

【００５３】本発明で新たに追加した信頼度計算用前向
き統計言語モデルとしては、本実施例では音声認識用言
語モデル７と同じモデルを用いることとする。As the forward-looking statistical language model for reliability calculation newly added in the present invention, the same model as the speech recognition language model 7 is used in this embodiment.

【００５４】また、信頼度計算用後向き統計言語モデル
１５は、実施の形態１と同様に後向きバイグラムモデル
を用い、あらかじめホテル予約に関する大量のユーザ発
話を書き起こしたテキストデータを用いて学習しておく
ものとする。The backward statistical language model 15 for calculating the reliability uses a backward bigram model as in the first embodiment, and learns in advance using text data in which a large amount of user utterances related to hotel reservation have been transcribed. Shall be.

【００５５】次に、本実施の形態における音声認識装置
の動作について説明する。音声信号の入力端１から音声
信号２を入力すると、分析手段３と連続音声手段５と言
語的信頼度計算手段１３は実施の形態１の音声認識装置
と同様の動作を行い、言語的信頼度計算手段１３は信頼
度を付与された認識結果１６として、単語列ｗ₁，ｗ₂，
…，ｗ_i，…，ｗ_Nと信頼度Ｓ⁽²⁾ ₁，Ｓ⁽²⁾ ₂，…，
Ｓ⁽²⁾ _i，…，Ｓ⁽²⁾ _Nを出力する。Next, the operation of the speech recognition apparatus according to the present embodiment will be described. When the voice signal 2 is input from the voice signal input terminal 1, the analyzing means 3, the continuous voice means 5, and the linguistic reliability calculating means 13 perform the same operations as those of the voice recognition apparatus of the first embodiment, and The calculating means 13 outputs the word strings w ₁ , w ₂ ,
…, W _i ,…, w _N and reliability S ⁽²⁾ ₁ , S ⁽²⁾ ₂ ,…,
S ⁽²⁾ _i , ..., S ⁽²⁾ _N are output.

【００５６】また、分析手段３の出力である特徴ベクト
ルの時系列４と音響モデル６が参照尤度計算手段９に入
力され、参照尤度計算手段９と音響的信頼度計算手段１
１は従来技術と同様の動作をして、その結果、音響的信
頼度計算手段１１は、信頼度Ｓ⁽¹⁾ ₁，Ｓ⁽¹⁾ ₂，…，Ｓ
⁽¹⁾ _i，…，Ｓ⁽¹⁾ _Nを出力する。The time series 4 of the feature vector and the acoustic model 6 which are the outputs of the analysis means 3 are input to the reference likelihood calculation means 9, and the reference likelihood calculation means 9 and the acoustic reliability calculation means 1
1 operate in the same manner as in the prior art, and as a result, the acoustic reliability calculation means 11 outputs the reliability S ⁽¹⁾ ₁ , S ⁽¹⁾ ₂ ,.
⁽¹⁾ _i , ..., S ⁽¹⁾ Output _N.

【００５７】次に、信頼度統合手段１７は、言語的信頼
度計算手段１３から出力された認識結果１６である単語
列ｗ₁，ｗ₂，…，ｗ_i，…，ｗ_Nと信頼度Ｓ⁽²⁾ ₁，
Ｓ⁽²⁾ ₂，…，Ｓ⁽²⁾ _i，…，Ｓ⁽²⁾ _Nと、音響的信頼度計算
手段１１の出力である信頼度Ｓ⁽¹⁾ ₁，Ｓ⁽¹⁾ ₂，…，Ｓ
⁽¹⁾ _i，…，Ｓ⁽¹⁾ _Nを入力として、以下の（１２）式にし
たがって統合信頼度Ｓ⁽¹⁰⁾ ₁，Ｓ⁽¹⁰⁾ ₂，…，Ｓ⁽¹⁰⁾ _i，
…，Ｓ⁽¹⁰⁾ _Nを計算する。（１２）式のβは事前に設定
する定数であり例えばβ＝０．５である。Next, the reliability integration means 17 outputs the word strings w ₁ , w ₂ ,..., W _i ,..., W _N which are the recognition results 16 output from the linguistic reliability calculation means 13 and the reliability S ⁽²⁾ ₁ ,
S ⁽²⁾ ₂ ,..., S ⁽²⁾ _i ,..., S ⁽²⁾ _N and the reliability S ⁽¹⁾ ₁ , S ⁽¹⁾ ₂ ,. S
⁽¹⁾ _i, ..., S ⁽¹⁾ where _{N is} the input, the following (12) integrated reliability according formula ^{_{S (10) 1, S (}} 10) 2, ..., S (10) i,
…, S ⁽¹⁰⁾ _N is calculated. Β in the equation (12) is a constant set in advance, for example, β = 0.5.

【００５８】[0058]

【数１２】 (Equation 12)

【００５９】そして信頼度統合手段１７は信頼度を付与
された認識結果１８として単語列ｗ ₁，ｗ₂，…，ｗ_i，
…，ｗ_Nと統合信頼度Ｓ⁽¹⁰⁾ ₁，Ｓ⁽¹⁰⁾ ₂，…，Ｓ⁽¹⁰⁾ _i，
…，Ｓ⁽¹⁰⁾ _Nを出力する。The reliability integration means 17 gives the reliability.
Word string w as the recognized recognition result 18 ₁, W_Two, ..., w_i,
…, W_NAnd integrated reliability S^(Ten) ₁, S^(Ten) _Two, ..., S^(Ten) _i,
…, S^(Ten) _NIs output.

【００６０】以上のように、本実施の形態においては、
上述の実施の形態１と同様の効果が得られるとともに、
さらに、信頼度統合手段１７を設けて、音響尤度に基づ
く信頼度１２と言語的な信頼度１６とを統合するように
したので、音響と言語の両面から信頼度を考慮すること
が可能になり、より高精度な信頼度を得ることができ
る。As described above, in the present embodiment,
The same effects as in the first embodiment can be obtained, and
Furthermore, since the reliability integration means 17 is provided to integrate the reliability 12 based on the acoustic likelihood and the linguistic reliability 16, it is possible to consider the reliability from both sound and language. Therefore, a higher degree of reliability can be obtained.

【００６１】なお、本例で言語的な信頼度として（３）
式で計算されるＳ⁽²⁾ ₁，Ｓ⁽²⁾ ₂，…，Ｓ⁽²⁾ _i，…，Ｓ
⁽²⁾ _Nとを用いたが、（４）式または（８）式または
（９）式で計算されるＳ⁽³⁾ ₁，Ｓ⁽³⁾ ₂，…，Ｓ⁽³⁾ _i，
…，Ｓ⁽³⁾ _Nあるいは（５）式または（１０）式または
（１１）式で計算されるＳ⁽⁴⁾ ₁，Ｓ⁽⁴⁾ ₂，…，Ｓ⁽⁴⁾ _i，
…，Ｓ⁽⁴ ⁾ _Nを用いてもかまわない。また、音響尤度に基
づく信頼度は従来技術と同様の方法によって求めた値を
用いたが他の方法によって求めた値を用いてかまわな
い。In this example, the linguistic reliability is (3)
S ⁽²⁾ ₁ , S ⁽²⁾ ₂ ,..., S ⁽²⁾ _i ,.
⁽²⁾ _N was used, but S ⁽³⁾ ₁ , S ⁽³⁾ ₂ ,..., S ⁽³⁾ _i , calculated by equation (4) or (8) or (9)
, S ⁽³⁾ _N or S ⁽⁴⁾ ₁ , S ⁽⁴⁾ ₂ ,..., S ⁽⁴⁾ _i , calculated by equation (5), equation (10), or equation (11)
.., S ⁽⁴ ⁾ _N may be used. As the reliability based on the acoustic likelihood, a value obtained by the same method as that of the related art is used, but a value obtained by another method may be used.

【００６２】[0062]

【発明の効果】この発明は、入力された音声の連続音声
認識を行い、認識結果として当該入力された音声に対応
する単語列を出力する連続音声認識手段と、所定単語の
前後に言語的に続き得る各単語の出現確率を与える１種
類以上の信頼度計算用統計言語モデルを格納している信
頼度計算用統計言語モデル格納手段と、前記認識結果の
単語列を構成している各単語のそれぞれに対して、前記
信頼度計算用統計言語モデルを用いて、前記各単語が正
認識であるか否かの信頼度を算出する言語的信頼度計算
手段とを備えた音声認識装置であるので、正解単語と音
響的に類似している場合でも誤認識単語であれば低い信
頼度を与えることを可能にし、音声認識の精度を向上さ
せることができる。According to the present invention, there is provided a continuous voice recognition means for performing continuous voice recognition of an input voice and outputting a word string corresponding to the input voice as a recognition result, and linguistically before and after a predetermined word. Means for storing a statistical language model for reliability calculation that stores one or more statistical language models for reliability calculation that give the probability of appearance of each possible word; For each of the above, the speech recognition device includes linguistic reliability calculation means for calculating the reliability of whether each of the words is correctly recognized using the statistical language model for reliability calculation. Even if the word is acoustically similar to the correct word, it can be given low reliability if the word is an erroneously recognized word, and the accuracy of voice recognition can be improved.

【００６３】また、前記認識結果の単語列を構成してい
る各単語のそれぞれに対して、前記各単語が正認識であ
るか否かの信頼度を音響尤度に基づいて算出する音響的
信頼度計算手段と、前記言語的信頼度計算手段によって
算出された信頼度と前記音響的信頼度計算手段によって
算出された信頼度の両者の値を用いて、統合信頼度を算
出する信頼度統合手段とをさらに備えているので、音響
と言語の両面から信頼度を考慮することが可能になり、
より高精度な信頼度を得ることができる。Further, for each of the words constituting the word string of the recognition result, the reliability of whether or not each word is correctly recognized is calculated based on the acoustic likelihood. Reliability calculating means, and reliability integrating means for calculating integrated reliability using both values of the reliability calculated by the linguistic reliability calculating means and the reliability calculated by the acoustic reliability calculating means. It is possible to consider the reliability from both the sound and the language,
Higher accuracy reliability can be obtained.

【００６４】また、前記信頼度計算用統計言語モデルと
して単語ｎ−ｇｒａｍモデルを用いるようにしたので、
単語に先行するまたは後続する単語との条件付き確率を
元に、認識結果の単語列を構成する各単語の信頼度を求
めることができる。Since the word n-gram model is used as the statistical language model for calculating the reliability,
Based on the conditional probability of the word preceding or succeeding the word, the reliability of each word constituting the word string of the recognition result can be obtained.

【００６５】また、前記信頼度計算用統計言語モデルと
して単語を幾つかのクラスに分類してまとめた単語クラ
スｎ−ｇｒａｍモデルを用いるようにしたので、単語に
先行するまたは後続する単語との条件付き確率を元に、
認識結果の単語列を構成する各単語の信頼度を求めるこ
とができる。Further, since the word class n-gram model in which words are classified into several classes and put together is used as the statistical language model for calculating the reliability, the condition for the word preceding or following the word is used. Based on the attached probability,
The reliability of each word constituting the word string of the recognition result can be obtained.

【００６６】また、１種類以上の信頼度計算用統計言語
モデルとして、当該単語と先行の所定個の単語との条件
付き確率モデルである信頼度計算用前向き統計言語モデ
ルと、当該単語と後続の所定個の単語との条件付き確率
モデルである信頼度計算用後向き統計言語モデルとを備
え、前記言語的信頼度計算手段が、前記連続音声認識の
認識結果の単語列を構成する各単語のそれぞれに対して
前記信頼度計算用前向き統計言語モデルを用いて前記各
単語が正認識であるか否かの第一の信頼度を算出し、前
記信頼度計算用後向き統計言語モデルを用いて前記各単
語が正認識であるか否かの第二の信頼度を算出するよう
にしたので、さらに高精度に信頼度を計算することがで
きる。As one or more kinds of statistical language models for calculating reliability, a forward statistical language model for calculating reliability, which is a conditional probability model of the word and a predetermined number of preceding words, A backward statistical language model for reliability calculation, which is a conditional probability model with a predetermined number of words, wherein the linguistic reliability calculation means is configured to generate a word sequence of a recognition result of the continuous speech recognition. Using the reliability calculation forward statistical language model to calculate a first reliability of whether or not each word is correctly recognized, using the reliability calculation backward statistical language model Since the second reliability of whether the word is correctly recognized is calculated, the reliability can be calculated with higher accuracy.

【００６７】また、前記言語的信頼度計算手段が、前記
第一の信頼度と前記第二の信頼度のうち、大きい方の値
を当該単語の信頼度として出力するようにしたので、さ
らに高精度に信頼度を計算することができる。Further, the linguistic reliability calculating means outputs the larger value of the first reliability and the second reliability as the reliability of the word, so that the linguistic reliability can be further increased. Accuracy can calculate reliability.

【００６８】また、前記言語的信頼度計算手段が、前記
第一の信頼度と前記第二の信頼度との荷重和を当該単語
の信頼度として出力するようにしたので、さらに高精度
に信頼度を計算することができる。Further, the linguistic reliability calculating means outputs the weighted sum of the first reliability and the second reliability as the reliability of the word. Degrees can be calculated.

[Brief description of the drawings]

【図１】本発明の実施の形態１における音声認識装置
の構成を示した構成図である。FIG. 1 is a configuration diagram showing a configuration of a speech recognition device according to Embodiment 1 of the present invention.

【図２】本発明の実施の形態２における音声認識装置
の構成を示した構成図である。FIG. 2 is a configuration diagram showing a configuration of a speech recognition device according to a second embodiment of the present invention.

【図３】従来の音声認識装置の構成を示した構成図で
ある。FIG. 3 is a configuration diagram showing a configuration of a conventional voice recognition device.

[Explanation of symbols]

１音声信号の入力端、２入力音声信号、３分析手
段、４入力音声信号の特徴ベクトルの時系列、５連
続音声認識手段、６音響モデル、７音声認識用言語
モデル、８音声認識結果、９参照尤度計算手段、１
０参照尤度、１１音響的信頼度計算手段、１２認
識結果、１３言語的信頼度計算手段、１４信頼度計
算用前向き統計言語モデル、１５信頼度計算用後向き
統計言語モデル、１６，１８信頼度、１７信頼度統
合手段。Reference Signs List 1 input terminal of voice signal, 2 input voice signal, 3 analysis means, 4 time series of feature vector of input voice signal, 5 continuous voice recognition means, 6 acoustic model, 7 language model for voice recognition, 8 voice recognition result, 9 Reference likelihood calculating means, 1
0 reference likelihood, 11 acoustic reliability calculation means, 12 recognition result, 13 linguistic reliability calculation means, 14 forward statistical language model for reliability calculation, 15 backward statistical language model for reliability calculation, 16, 18 reliability , 17 Reliability integration means.

Claims

[Claims]

1. Continuous speech recognition of input speech is performed,
Continuous speech recognition means for outputting a word sequence corresponding to the input speech as a recognition result; and one or more statistical language models for calculating reliability, which give the appearance probabilities of words that can be linguistically continued before and after a predetermined word Using a statistical language model for reliability calculation, for each of the words forming the word string of the recognition result,
A speech recognition device comprising: a linguistic reliability calculation unit that calculates a reliability of whether each of the words is correctly recognized.

2. An acoustic reliability for each word constituting a word sequence of the recognition result, wherein the reliability of whether or not each word is correctly recognized is calculated based on the acoustic likelihood. Reliability calculation means; and reliability integration means for calculating integrated reliability using both values of the reliability calculated by the linguistic reliability calculation means and the reliability calculated by the acoustic reliability calculation means. The speech recognition device according to claim 1, further comprising:

3. The speech recognition device according to claim 1, wherein a word n-gram model is used as the reliability calculation statistical language model.

4. A word class n in which words are classified into several classes and compiled as the statistical language model for calculating reliability.
The speech recognition device according to claim 1 or 2, wherein a -gram model is used.

5. One or more types of statistical language models for calculating reliability, a forward statistical language model for calculating reliability, which is a conditional probability model of the word and a predetermined number of preceding words, A backward statistical language model for reliability calculation, which is a conditional probability model with a predetermined number of words, wherein the linguistic reliability calculation means comprises: Using the reliability calculation forward statistical language model to calculate a first reliability of whether or not each word is correctly recognized, using the reliability calculation backward statistical language model 5. The speech recognition device according to claim 1, wherein a second reliability of whether or not the word is correctly recognized is calculated.

6. The linguistic reliability calculating means outputs a larger value of the first reliability and the second reliability as the reliability of the word. 6. The voice recognition device according to 5.

7. The linguistic reliability calculating means outputs a weighted sum of the first reliability and the second reliability as the reliability of the word.
A speech recognition device according to claim 1.