JP4478925B2

JP4478925B2 - Speech recognition result reliability verification apparatus, computer program, and computer

Info

Publication number: JP4478925B2
Application number: JP2003401724A
Authority: JP
Inventors: フランク・スーン; ロー・ウェイキット; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-12-01
Filing date: 2003-12-01
Publication date: 2010-06-09
Anticipated expiration: 2023-12-01
Also published as: JP2005164837A

Description

この発明は音声認識の結果の受入／拒否を判定するための技術に関し、特に、音声認識結果の単語の信頼度尺度を用いて判定を行なう技術に関する。 The present invention relates to a technique for determining acceptance / rejection of a speech recognition result, and more particularly, to a technique for performing determination using a reliability measure of a word of a speech recognition result.

最先端の音声認識技術を適用できる可能性のある分野は非常に広い。それら分野のうちには、好結果が得られているものも存在する。しかし、音声認識技術はまだ確実なものではない。そのため、音声認識結果を監視して、認識結果を受入れるか拒否するかを容易に判定できるような何らかの尺度が求められている。そうした技術の必要性は、音声認識結果がさらに新しく困難な分野に適用されて行くに伴い、今後増大していくと考えられる。 The fields where the most advanced speech recognition technology can be applied are very wide. Some of these areas have been successful. However, voice recognition technology is not yet certain. Therefore, there is a need for some scale that can monitor the voice recognition result and easily determine whether to accept or reject the recognition result. The need for such technology is expected to increase in the future as speech recognition results are applied to new and difficult fields.

そのような尺度は、容易に計算が可能で、かつ統計的に意味のあるものでなければならない。これまで提案されよく研究されてきたものに、単語事後確率（ＷｏｒｄＰｏｓｔｅｒｉｏｒＰｒｏｂａｂｉｌｉｔｙ：ｗｐｐ）がある。単語事後確率を音声認識結果の単語ラティス／グラフ又はＮ‐ベストリストに適用した研究として、非特許文献１，２，３に記載のものがある。 Such a measure must be easily computable and statistically meaningful. A word posterior probability (Wpp) has been proposed and well studied. Non-Patent Documents 1, 2, and 3 include studies in which word posterior probabilities are applied to word lattices / graphs or N-best lists of speech recognition results.

ステファン・オルトマンス、ヘルマン・ネイ、サビアー・オーバート、「大語彙連続音声認識のための単語グラフアルゴリズム」、コンピュータ・スピーチ及び言語、第１１巻１号、ｐｐ．４３−７２、１９９７年１月（ＳｔｅｆａｎＯｒｔｍａｎｎｓ，ＨｅｒｍａｎｎＮｅｙ，ａｎｄＸａｖｉｅｒＡｕｂｅｒｔ， “ＡＷｏｒｄＧｒａｐｈＡｌｇｏｒｉｔｈｍｆｏｒＬａｒｇｅＶｏｃａｂｕｌａｒｙＣｏｎｔｉｎｕｏｕｓＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ，“ ＣｏｍｐｕｔｅｒＳｐｅｅｃｈａｎｄＬａｎｇｕａｇｅ，ｖｏｌ．１１，ｎｏ．１，ｐｐ．４３‐７２，Ｊａｎｕａｒｙ１９９７．）Stephan Ortmans, Hermann Ney, Savier Obert, “Word Graph Algorithm for Large Vocabulary Continuous Speech Recognition”, Computer Speech and Language, Vol. 43-72, January 1997 (Stefan Ortmanns, Hermann Ney, and Xavier Albert, "A Word Graph Algorithm Conforme Sp.," Long Vocable Continuum Spe. , January 1997.) リヒャルト・シュヴァルツ及びイェン・ルー・チョウ、「Ｎ‐ベストアルゴリズム：Ｎ個の最尤文仮説を発見するための効率的かつ正確な手続き」、ＩＣＡＳＳＰ１９９０予稿集、１９９０年、第１巻、ｐｐ．８１‐９４（ＲｉｃｈａｒｄＳｃｈｗａｒｔｚａｎｄＹｅｎ‐ＬｕＣｈｏｗ， “ＴｈｅＮ‐ｂｅｓｔＡｌｇｏｒｉｔｈｍ：ＡｎＥｆｆｉｃｉｅｎｔａｎｄＥｘａｃｔＰｒｏｃｅｄｕｒｅｆｏｒＦｉｎｄｉｎｇｔｈｅＮＭｏｓｔＬｉｋｅｌｙＳｅｎｔｅｎｃｅＨｙｐｏｔｈｅｓｅｓ，“ ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆＩＣＡＳＳＰ１９９０，１９９０，ｖｏｌ．１，ｐｐ．８１‐９４．）Richard Schwarz and Yen Lou Chou, “N-Best Algorithm: Efficient and Accurate Procedure for Finding N Maximum Likelihood Hypotheses”, ICASSP 1990 Proceedings, 1990, Vol. 81-94 (Richard Schwartz and Yen-Lu Chow, "The N-best Algorithm, An 94 Efficient and Excise Procedure in Proceed in the Nest, Sentinel Sent. .) フランク・Ｋ・スーン及びエン‐フォン‐ファン、「ツリー・トレリスに基づく、連続音声認識におけるＮベストの文仮説の発見のための高速探索」、ＩＣＡＳＳＰ１９９１予稿集、１９９１年、第１巻、ｐｐ．７０５‐７０８（ＦｒａｎｋＫ．ＳｏｏｎｇａｎｄＥｎｇ‐ＦｏｎｇＨｕａｎｇ，“ＡＴｒｅｅ‐ＴｒｅｌｌｉｓＢａｓｅｄＦａｓｔＳｅａｒｃｈｆｏｒＦｉｎｄｉｎｇｔｈｅＮ‐ＢｅｓｔＳｅｎｔｅｎｃｅＨｙｐｏｔｈｅｓｅｓｉｎＣｏｎｔｉｎｕｏｕｓＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ，“ ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆＩＣＡＳＳＰ１９９１，１９９１，ｖｏｌ．１，ｐｐ．７０５‐７０８．）Frank K. Soon and En-von-Fan, “Fast Search for Discovery of N Best Sentence Hypotheses in Continuous Speech Recognition Based on Tree Trellis”, ICASSP 1991 Proceedings, 1991, Vol. 705-708 (Frank K. Soong and Eng-Fong Hang, "A Tree-Tellis Based Fast Searching for 19th in the Concealed in the Continuing. -708.)

しかし、上記した非特許文献１〜３には、単語事後確率をどのように用いれば、音声認識の受入／拒否を容易に、かつ信頼性高く判定できるようになるかについての具体的な記載はない。音声認識の精度が高くないような応用分野では、受入／拒否の判定を行なう際の制約がきつく、その判定は容易ではない。そのため、容易に、かつ信頼性高く音声認識結果の受入／拒否を判定できる技術が、特に音声認識技術の新しい適用分野において必要とされている。 However, Non-Patent Documents 1 to 3 described above specifically describe how to use word posterior probabilities to easily and reliably determine acceptance / rejection of speech recognition. Absent. In application fields where the accuracy of speech recognition is not high, there are severe restrictions on accepting / rejecting judgment, and the judgment is not easy. Therefore, a technique that can easily and reliably determine acceptance / rejection of a speech recognition result is required particularly in a new application field of the speech recognition technique.

また、従来から音声認識技術が使用されている技術分野でも、音声認識結果の受入／拒否を自動的に判定可能になれば、人間の介入なしに様々な処理を自動的に実行できる。例えば、音声認識の信頼度に基づいて音声認識環境を自動的に調整したり、何らかの形でコンピュータが自動的なトレーニングを行なったり、語学の発声の練習に用いたりすることもできる。 Even in a technical field where speech recognition technology has been used, various processes can be automatically executed without human intervention if it is possible to automatically determine whether speech recognition results are accepted / rejected. For example, the speech recognition environment can be automatically adjusted based on the reliability of speech recognition, the computer can automatically train in some way, or can be used to practice language utterance.

したがって、本発明の目的は、音声認識結果の受入／拒否を容易に、かつ信頼性高く行なうことができる音声認識結果の検証装置を提供することである。 Therefore, an object of the present invention is to provide a speech recognition result verification apparatus that can easily and reliably accept / reject speech recognition results.

本発明の他の目的は、音声認識結果である単語列の単語ごとに、受入／拒否を容易に、かつ信頼性高く行なうことができる音声認識結果の検証装置を提供することである。 Another object of the present invention is to provide a speech recognition result verification apparatus that can easily accept / reject each word of a word string that is a speech recognition result with high reliability.

本発明の第１の局面にかかる音声認識結果の信頼度検証装置は、音声認識デコーダから出力される、各々単語事後確率が付与された単語からなる、複数の仮説単語列をあらわす音声認識結果を受け、単語事後確率に基づいて当該音声認識結果の信頼度を検証するための、音声認識結果の信頼度検証装置であって、音声認識結果に含まれる各単語について、音声認識結果に含まれる単語の単語事後確率に基づいて一般化単語事後確率を算出するための一般化単語事後確率算出手段と、音声認識結果に含まれる各単語の単語事後確率を、一般化単語事後確率算出手段により算出された一般化単語事後確率で更新するための更新手段と、更新手段により単語事後確率が更新された音声認識結果に基づき、複数の仮説単語列の中で、当該仮説単語列に含まれる単語の単語事後確率の和が最大となるものを探索するための探索手段と、探索手段により探索された仮説単語列の単語事後確率の和が所定の条件を充足するか否かを判定することにより、音声認識結果の信頼度を検証するための判定手段とを含む。 A speech recognition result reliability verification apparatus according to a first aspect of the present invention provides a speech recognition result representing a plurality of hypothesized word strings, each of which is output from a speech recognition decoder and is composed of words each having a word posterior probability. And a speech recognition result reliability verification device for verifying the reliability of the speech recognition result based on the word posterior probability, and for each word included in the speech recognition result, the word included in the speech recognition result A generalized word posterior probability calculating means for calculating a generalized word posterior probability based on the word posterior probability of the word, and a word posterior probability of each word included in the speech recognition result is calculated by the generalized word posterior probability calculating means. Update means for updating with the generalized word posterior probability, and based on the speech recognition result in which the word posterior probability is updated by the update means, among the plurality of hypothesis word strings, Search means for searching for a word having the maximum sum of the word posterior probabilities, and whether or not the sum of the word posterior probabilities of the hypothesized word string searched by the search means satisfies a predetermined condition And determining means for verifying the reliability of the speech recognition result.

好ましくは、音声認識結果の信頼度検証装置は、一般化単語事後確率算出手段による一般化単語確率の算出に先立って、音声認識結果のうち、所定の基準により定められるしきい値よりも尤度が高いものからなる単語列のみを選択して一般化単語事後確率算出手段に与えるための手段をさらに含む。 Preferably, the reliability verification apparatus for the speech recognition result is more likely than a threshold value determined by a predetermined criterion among the speech recognition results prior to the calculation of the generalized word probability by the generalized word a posteriori probability calculating means. Means for selecting only a word string having a high value and giving it to the generalized word posterior probability calculation means.

より好ましくは、音声認識結果に含まれる各単語には、さらに音声認識デコーダへの入力発話中における時間期間を定める情報が付されており、一般化単語事後確率算出手段は、音声認識結果中に含まれる各単語について、当該単語の時間期間と重なる時間期間であって、かつ当該単語と一致する単語を音声認識結果中で検索するための単語検索手段と、単語検索手段により検索された単語の単語事後確率の総和と、音声認識結果に含まれる全ての単語の単語事後確率の総和とに基づいて、各単語の一般化単語事後確率を算出するための手段とを含む。 More preferably, each word included in the speech recognition result is further attached with information for determining a time period during the input utterance to the speech recognition decoder, and the generalized word a posteriori probability calculating means includes for each word contained, a time period overlaps with the time period of the word, and a word search means for searching for a word that matches the word in the speech recognition result, a word retrieved by the word search means Means for calculating a generalized word posterior probability of each word based on the sum of word posterior probabilities and the sum of word posterior probabilities of all words included in the speech recognition result.

さらに好ましくは、一般化単語事後確率を算出するための手段は、単語検索手段により検索された単語の単語事後確率の総和と、音声認識結果に含まれる全ての単語の単語事後確率の総和との比率によって、各単語の一般化単語事後確率を算出するための手段を含む。 More preferably, the means for calculating the generalized word posterior probability is a sum of the word posterior probabilities of the words searched by the word search means and the sum of the word posterior probabilities of all words included in the speech recognition result. Means are included for calculating the generalized word posterior probabilities for each word by ratio.

好ましくは、仮説単語列中の単語ｗの一般化単語事後確率ｐ（［ｗ；ｓ、ｔ］｜ｘ₁ ^T）（ただしｓ及びｔはそれぞれ単語ｗの時間期間の開始時刻及び終了時刻）は次の式 Preferably, the generalized word posterior probability p ([w; s, t] | x ₁ ^T ) of the word w in the hypothesis word string (where s and t are the start time and end time of the time period of the word w, respectively) The following formula

で与えられ、ただしｘ₁ ^T＝ｘ₁，…，ｘ_Tは観測された音声シーケンスであり、Ｍは音声認識結果の仮説に含まれる単語数であり、ｓ_n及びｔ_nはそれぞれ、単語ｗと一致するｎ番目の単語ｗ_nの開始時刻及び終了時刻であり、ｐ（ｘ_sm ^tm｜ｗ_m）は音響尤度であり、ｐ（ｗ_m｜ｗ₁ ^M）は言語尤度であり、ｐ（ｘ₁ ^T）は音響観測尤度であり、α及びβはそれぞれ所定の定数である。 Given, provided that _{^{_{x 1 T = x 1, ...}}} , x T is the observed speech sequence, M is the number of words included in the hypothesis of a speech recognition result, s _n and t _n, respectively, word w _Are the start and end times of the _nth word _wn , where p (x _sm ^tm | w _m ) is the acoustic likelihood, and p (w _m | w ₁ ^M ) is the language likelihood, p (x ₁ ^T ) is the acoustic observation likelihood, and α and β are predetermined constants, respectively.

本発明の第２の局面にかかるコンピュータプログラムは、コンピュータにより実行されると、上記したいずれかの音声認識結果の信頼度検証装置の各手段を実現するよう、当該コンピュータを動作させる。 When the computer program according to the second aspect of the present invention is executed by a computer, the computer program is operated so as to realize each means of the reliability verification device for any of the above-described speech recognition results.

本発明の第３の局面にかかるコンピュータは、上記したコンピュータプログラムによりプログラムされたものである。 A computer according to the third aspect of the present invention is programmed by the computer program described above.

［導入］
本実施の形態では、連続音声認識により認識された各単語を受入れるか、拒絶するかを判定するという問題を、注目単語の位置の特定という考え方を導入することで解決する。注目単語以外の単語（非注目単語）については、互いに区別せずいずれも単にそれぞれの場所を占めるだけのものとして取り扱って、注目単語の事後確率を算出する。 [Introduction]
In the present embodiment, the problem of determining whether to accept or reject each word recognized by continuous speech recognition is solved by introducing the concept of specifying the position of the word of interest. Words other than the attention word (non-attention word) are not distinguished from each other and are treated as occupying the respective places, and the posterior probability of the attention word is calculated.

このように注目単語／非注目単語という二分法を採用することにより、動的計画法に基づく文字列のアライメント等の複雑な処理を行なう必要が回避できる。 Thus, by adopting the bisection method of attention word / non- attention word, it is possible to avoid the necessity of performing complicated processing such as character string alignment based on dynamic programming.

まず、以下の概念を導入し、それらについて説明する。すなわち、それらは、（１）音声認識結果の単語ラティス又はＮ‐ベストリスト中における、注目単語の位置決定を行なうための、仮説（候補）となる文字列の探索空間の削減、（２）ある候補単語の複数の出現個所における事後確率をグループ化する際の時間的制約の緩和、及び（３）音響モデル及び言語モデルによる寄与に対する適切な重み付け、である。 First, the following concepts are introduced and explained. That is, they are (1) reduction of search space for a character string that becomes a hypothesis (candidate) for determining the position of the word of interest in the word lattice or N-best list of the speech recognition result, and (2) The relaxation of temporal constraints when grouping posterior probabilities at multiple occurrences of candidate words, and (3) appropriate weighting for contributions by acoustic and language models.

‐文字列と単語の事後確率‐
ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ：隠れマルコフモデル）を用いる音声認識装置では、所与の音響観測データｘ₁ ^T＝ｘ₁，…，ｘ_Tに対する、最適な単語シーケンスｗ₁ ^M*＝ｗ₁ ^*，…，ｗ_M ^*を、以下に示すように、可能な全ての単語シーケンスからなる空間を探索して、最大事後確率（ＭＡＰ）を与えるものとして求める。 -Possibilities of character strings and words-
HMM: In the speech recognition apparatus using the (Hidden Markov Model), given the acoustic observation data _{^{_{x 1 T = x 1, ...}}} , for x _T, the optimal word sequence _{^{_{^{w 1 M * = w 1 *}}}} , ... , W _M ^* , as shown below, is searched for a space consisting of all possible word sequences and gives the maximum posterior probability (MAP).

ただし、ｐ（ｘ₁ ^T｜ｗ₁ ^M）は音響モデルの確率、ｐ（ｗ₁ ^M）は言語モデルによる確率、ｐ（ｘ₁ ^T）は音響の観測確率である。 Here, p (x ₁ ^T | w ₁ ^M ) is the probability of the acoustic model, p (w ₁ ^M ) is the probability of the language model, and p (x ₁ ^T ) is the acoustic observation probability.

トレーニング環境とテスト環境、話者、ノイズ等の相違により「最適な」単語シーケンスであっても誤りを含むことがある。そこで、数学的に扱いやすく、かつ統計的に好ましい何らかの信頼度尺度を採用すべきである。 Even the “optimal” word sequence may contain errors due to differences in training and test environments, speakers, noise, and the like. Therefore, some kind of reliability measure that is mathematically easy to handle and statistically favorable should be adopted.

単語列の事後確率ｐ（ｗ₁ ^M｜ｘ₁ ^T）は、観測された音響ｘ₁ ^Tに対し、認識された単語列ｗ₁ ^Mの尤度を測るものであるが、これは対応する時間的セグメンテーション The posterior probability p (w ₁ ^M | x ₁ ^T ) of the word string measures the likelihood of the recognized word string w ₁ ^M with respect to the observed sound x ₁ ^T , which corresponds to the corresponding time. Segmentation

を仮定することで算出される。ただし、ｓ及びｔは単語ｗの始点及び終点の時刻を示し、ｓ1＝１、ｔ_M＝Ｔ、１≦ｍ≦Ｍ‐１のｍに対しｔ_m＋１＝ｓ_m+1である。 Is calculated by assuming Here, s and t indicate the start and end times of the word w, where t _m + 1 = s _{m + 1} for _{m of} s 1 = 1, t _M = T, 1 ≦ m ≦ M−1.

これを用いて、式（２）を次のように書き換えることができる。 Using this, equation (2) can be rewritten as follows.

認識された単語列の全体の信頼性を測るためには、この単語列事後確率ｐ（ｗ₁ ^M｜ｘ₁ ^T）を採用するのが自然である。
In order to measure the reliability of the entire recognized word string, it is natural to employ this word string posterior probability p (w ₁ ^M | x ₁ ^T ).

単語の信頼性を測るために適切な信頼度尺度は、単語事後確率ｐ（［ｗ_m；ｓ_m，ｔ_m］｜ｘ₁ ^T）である。これは特定の単語を含む単語列の事後確率を全て合計することにより算出される。 A suitable confidence measure for measuring word reliability is the word posterior probability p ([w _m ; s _m , t _m ] | x ₁ ^T ). This is calculated by summing up all posterior probabilities of word strings including a specific word.

この単語事後確率を実際に有効な信頼度尺度として用いるためには、さらにいくつかの問題を解決する必要がある。 In order to use this word posterior probability as an effective reliability measure, several problems need to be solved.

［単語事後確率の修正］
‐考慮すべき仮説数‐
大語彙の連続音声認識装置（ＬＶＣＳＲ）においては、可能な単語列の探索空間は膨大である。しかし、各単語列の事後確率の値には大きな相違があり、比較的低い尤度の単語列については刈込みしても差し支えない。このようにして得た、単語列の仮説の部分集合のみを用いて単語ラティス／グラフ又はＮ‐ベスト単語列リストを得ることができる。以下の実施の形態では、そのように部分集合を用いて得た単語ラティス／グラフを使用するものとする。 [Correction of word posterior probability]
-Number of hypotheses to consider-
In a large vocabulary continuous speech recognition apparatus (LVCSR), the search space for possible word strings is enormous. However, there is a great difference in the value of the posterior probability of each word string, and a word string with a relatively low likelihood may be trimmed. The word lattice / graph or the N-best word string list can be obtained by using only the subset of the word string hypotheses obtained in this way. In the following embodiment, it is assumed that the word lattice / graph obtained by using the subset is used.

‐仮説内の単語の時間的なレジストレーション‐
単語の時間的位置決め（レジストレーション）を［ｗ；ｓ，ｔ］で表わす。別々の仮説中にある同一の単語が出現する場合でも、その位置は仮説によって多少異なることがあり得る。自動音声認識（ＡＳＲ）の最終的目標は発話中の単語からなる内容を認識することであるから、厳密な時間的制約を多少緩和することにする。ここでは、ある単語がある単語列中において出現する期間が、基準となる単語の期間［ｓ，ｔ］と重なっており（オーバーラップしている）、かつその単語が基準となる単語と一致しているような単語を検索し、それら単語をその基準となる単語の事後確率の計算に含める。その結果式（７）は以下のように書き換えられる。 -Temporal registration of words in a hypothesis-
Word temporal registration (registration) is represented by [w; s, t]. Even if the same word appears in different hypotheses, its position may differ slightly depending on the hypothesis. Since the ultimate goal of automatic speech recognition (ASR) is to recognize the content of words being spoken, we will relax some of the strict time constraints. Here, the period in which a certain word appears in a certain word string overlaps (overlaps) the period [s, t] of the reference word, and the word matches the reference word. Search for such words and include them in the calculation of the posterior probabilities of the reference word. As a result, equation (7) can be rewritten as follows.

‐音響尤度と言語尤度との比重‐
本実施の形態では、音響尤度と言語尤度とには、それぞれα及びβで示される重みによって指数的な重み付けがなされる。式（８）にこれを適用すると次式となる。 -Specific gravity between acoustic likelihood and language likelihood-
In the present embodiment, the acoustic likelihood and the language likelihood are exponentially weighted by weights indicated by α and β, respectively. When this is applied to the equation (8), the following equation is obtained.

［注目単語の抽出］
ここで、本実施の形態に係る単語抽出方式により抽出された注目単語の受入／拒否について検討する。図１に本実施の形態で使用する、音声認識の結果得られる単語ラティス／グラフの例を、図２に同様に音声認識の結果得られる単語列のＮ‐ベストリスト（仮説の単語列のうち、尤度の高いＮ個からなるリスト）の模式的な例を、それぞれ示す。 [Extract attention word]
Here, the acceptance / rejection of the attention word extracted by the word extraction method according to the present embodiment will be considered. FIG. 1 shows an example of a word lattice / graph obtained as a result of speech recognition used in the present embodiment. FIG. 2 similarly shows an N-best list of word strings obtained as a result of speech recognition (of hypothetical word strings). , A schematic example of a list of N items with a high likelihood).

図１を参照して、本実施例で使用する単語ラティス／グラフ２０は、従来のものと異なり、注目単語（「ｗ」で示す。）以外の単語については個々の単語ラベルを付さず、いずれも単に「＊」というラベルを付してあるだけである。 Referring to FIG. 1, the word lattice / graph 20 used in the present embodiment is different from the conventional one, and words other than the attention word (indicated by “w”) are not attached with individual word labels. Both are simply labeled “*”.

この単語ｗの出現個所の各々に対し、前方‐後方アルゴリズムを用いて単語事後確率を効率的に計算できる。その後、この特定の単語ｗ（たとえば単語３０、３２、３４）を通るパスの全てについての尤度を合計し、その合計をこの単語ラティス／グラフ内の全てのパスの尤度の合計で除算し正規化することによって、一般化された単語事後確率（以下「一般化単語事後確率」と呼ぶ。）が算出できる。さらにこの際、単語の時間的レジストレーション（単語開始及び終了時刻の一致）の条件を緩和する。すなわち、各パスの単語ｗの期間が正確に一致する必要はなく、時間的にオーバーラップしているものの事後確率の合計を計算する。 For each occurrence of the word w, the word posterior probability can be efficiently calculated using the forward-backward algorithm. It then sums the likelihood for all of the paths through this particular word w (eg, words 30, 32, 34) and divides the sum by the sum of the likelihoods of all paths in this word lattice / graph. By normalizing, a generalized word posterior probability (hereinafter referred to as “generalized word posterior probability”) can be calculated. Further, at this time, the condition of word temporal registration (word start and end time coincidence) is relaxed. That is, the period of the word w of each path does not need to match exactly, and the total of the posterior probabilities of those that overlap in time is calculated.

同様に、図２に示すようなＮ‐ベストリストでも単語ｗの一般化単語事後確率を算出できる。ここでも、注目単語７０（単語ｗ）以外の単語には「＊」というマークが付され、その単語が何であるかは問わない。図２に示すように仮説５０，…，６２が存在している場合を考える。単語７０と時間的にオーバーラップしている単語ｗとして仮説５４、５６、５８、６０の単語７２，７４，７６，７８等が考えられる。仮説６２の単語８０は、単語７０の期間とオーバーラップしていないのでこの場合の単語事後確率の算出には用いられない。 Similarly, the generalized word posterior probability of the word w can also be calculated using the N-best list as shown in FIG. Again, words other than the attention word 70 (word w) are marked with “*”, and it does not matter what the word is. Consider the case where hypotheses 50,..., 62 exist as shown in FIG. As the word w that temporally overlaps the word 70, the words 72, 74, 76, 78 of the hypotheses 54, 56, 58, and 60 can be considered. Since the word 80 of the hypothesis 62 does not overlap with the period of the word 70, it is not used for calculating the word posterior probability in this case.

上記したように単語７０及びこれと時間的にオーバーラップしている、単語７０と同じ単語７２，７４，７６，７８が出現する仮説５０，５４，５６，５８，６０の尤度の合計を算出し、それをＮ‐ベストリスト中の全ての仮説の尤度の合計で除算し正規化することによって、この単語の一般化単語事後確率を算出できる。ここでも、時間的レジストレーションの制約を緩和している。 As described above, the sum of the likelihoods of the hypothesis 50, 54, 56, 58, 60 in which the word 70 and the same words 72, 74, 76, 78 that overlap with the word 70 appear in time are calculated. Then, by dividing it by the total likelihood of all hypotheses in the N-best list and normalizing it, the generalized word posterior probability of this word can be calculated. Again, the restriction on temporal registration is relaxed.

なお、上記したようにして注目単語を抽出して一般化単語事後確率を計算する場合、単語のアライメントは不要である。また動的プログラム法により仮説のアライメントを求める必要もない。 It should be noted that when the attention word is extracted and the generalized word posterior probability is calculated as described above, word alignment is not necessary. Moreover, it is not necessary to obtain the alignment of the hypothesis by the dynamic programming method.

［本実施の形態に係る装置構成］
図３に、本実施の形態に係る仮説検証装置９４を含む音声機械翻訳装置８０のブロック図を示す。図３を参照して、この音声機械翻訳装置８０は、入力される音声ｓ（ｔ）１００の音声認識を行ない、認識結果を図１に示す単語ラティス／グラフ２０のような単語グラフ１０４として出力するためのＡＳＲデコーダ９０と、ＡＳＲデコーダ９０から出力される単語グラフ１０４の中の各単語に対して上記したように一般化単語事後確率を用いた仮説の検証を行ない、その結果最も尤度の高い単語列１０６を、一般化単語事後確率を付して出力するための仮説検証装置９４と、仮説検証装置９４の出力する単語列１０６を入力として機械翻訳を行ない、翻訳結果１１０を出力するための機械翻訳装置９２とを含む。 [Apparatus configuration according to the present embodiment]
FIG. 3 shows a block diagram of a speech machine translation apparatus 80 including a hypothesis verification apparatus 94 according to the present embodiment. Referring to FIG. 3, this speech machine translation apparatus 80 performs speech recognition of input speech s (t) 100 and outputs the recognition result as a word graph 104 such as the word lattice / graph 20 shown in FIG. And the hypothesis is verified using the generalized word posterior probability as described above for each word in the word graph 104 output from the ASR decoder 90 and the ASR decoder 90. As a result, the highest likelihood is obtained. A hypothesis verification device 94 for outputting a high word string 106 with a generalized word posterior probability and a word string 106 output from the hypothesis verification device 94 for machine translation and outputting a translation result 110 Machine translation device 92.

音声機械翻訳装置８０はさらに、仮説検証装置９４の出力する単語列１０６を受け、当該単語列の各単語に付されている単語事後確率に基づいて、この単語列１０６を認識結果として受入れるか拒否するかを決定してその結果をユーザインタフェース（以下「ユーザＩ／Ｆ」と呼ぶ。）８２を用いてユーザにフィードバックするとともに、判定結果に基づいて機械翻訳装置９２の制御を行なうための受入／拒否判定装置９６を含む。 The speech machine translation apparatus 80 further receives the word string 106 output from the hypothesis verification apparatus 94, and accepts or rejects the word string 106 as a recognition result based on the word posterior probabilities attached to the words of the word string. The user interface (hereinafter referred to as “user I / F”) 82 feeds back the result to the user and accepts / controls the machine translation device 92 based on the determination result. A rejection determination device 96 is included.

図４に仮説検証装置９４の詳細を示す。図４を参照して、仮説検証装置９４は、ＡＳＲデコーダ９０から与えられる単語グラフ１０４を記憶するための単語グラフ記憶部１２０と、単語グラフ記憶部１２０に記憶された単語グラフ中において、尤度の低い単語列を除く単語列に含まれる各単語について、一般化単語事後確率を算出するために、単語グラフ中で当該単語の期間とオーパーラップする、同じ単語を検索するための対象単語検索部１２２と、対象単語検索部１２２により検索された単語群に対し、前述した算出方法によりその一般化単語事後確率を算出するための事後確率算出部１２４と、事後確率算出部１２４により単語ごとに算出された一般化単語事後確率を、単語グラフ記憶部１２０に記憶された単語グラフの各単語に再付与して単語グラフを更新するための単語グラフ更新部１２６と、このようにして単語ごとに一般化単語事後確率が再付与された単語グラフの中で、最も高い一般化単語事後確率を示すパス（最尤パス）を探索してそのパスに含まれる単語列を一般化単語事後確率とともに単語列１０６として出力するための最尤パス探索部１２８とを含む。 FIG. 4 shows details of the hypothesis verification device 94. Referring to FIG. 4, hypothesis verification device 94 has a likelihood in word graph storage unit 120 for storing word graph 104 provided from ASR decoder 90 and in the word graph stored in word graph storage unit 120. In order to calculate a generalized word posterior probability for each word included in a word string excluding a low word string, a target word search unit for searching for the same word that overlaps with the period of the word in the word graph 122, a posterior probability calculation unit 124 for calculating the generalized word posterior probability by the calculation method described above for the word group searched by the target word search unit 122, and a posterior probability calculation unit 124 for each word. For re-assigning the generalized word posterior probabilities to each word of the word graph stored in the word graph storage unit 120 to update the word graph A path (maximum likelihood path) indicating the highest generalized word posterior probability is searched for in the rough update unit 126 and the word graph in which the generalized word posterior probability is reassigned for each word in this way, and the path is searched. And a maximum likelihood path search unit 128 for outputting the word string included in as a word string 106 together with the generalized word posterior probabilities.

［動作］
この装置は以下のように動作する。図３を参照して、入力音声１００が与えられると、ＡＳＲデコーダ９０は音声認識を行ない、その結果を単語グラフ１０４として出力する。この単語グラフ１０４の各単語には、それぞれ認識時に単語事後確率が得られ付与されている。 [Operation]
This device operates as follows. Referring to FIG. 3, when input speech 100 is given, ASR decoder 90 performs speech recognition and outputs the result as word graph 104. Each word in the word graph 104 is given and given a word posterior probability at the time of recognition.

図４を参照して、仮説検証装置９４の単語グラフ記憶部１２０はこの単語グラフのうち、尤度の低いものを除くサブセットを記憶する。対象単語検索部１２２は、単語グラフ記憶部１２０に記憶されている単語グラフの単語列のうち、単語ごとに、一般化単語事後確率の計算対象となる単語群（当該単語と同一の単語で、当該単語の出現期間と重なる期間に他のパス上に出現している単語）を検索し、事後確率算出部１２４に与える。 Referring to FIG. 4, word graph storage unit 120 of hypothesis verification device 94 stores a subset of the word graph excluding those with low likelihood. The target word search unit 122 includes, for each word, a word group (a word that is the same as the word in question) that is a calculation target of the generalized word posterior probability in the word string of the word graph stored in the word graph storage unit 120. A word appearing on another path during a period overlapping with the appearance period of the word) is retrieved and given to the posterior probability calculation unit 124.

事後確率算出部１２４は、対象単語検索部１２２により検索された単語群を対象にして前述したとおり、単語事後確率を合計し、その合計を全パスの単語事後確率で除算し正規化することで、対象単語の一般化単語事後確率を算出する。 As described above, the posterior probability calculation unit 124 sums up the word posterior probabilities for the word group searched by the target word search unit 122 and normalizes the sum by dividing the sum by the word posterior probabilities of all paths. The generalized word posterior probability of the target word is calculated.

単語グラフ更新部１２６は、単語ごとに、単語グラフ中のその単語に対し、事後確率算出部１２４で算出された一般化単語事後確率を再付与する。 The word graph update unit 126 re-assigns the generalized word posterior probability calculated by the posterior probability calculation unit 124 for each word in the word graph.

全ての単語に対し一般化単語事後確率の再付与が行われると、最尤パス探索部１２８が最も高い一般化単語事後確率を示す単語列を単語グラフ中で探索し、その結果見出されたパスに含まれる単語列を、一般化単語事後確率とともに単語列１０６として図３に示す機械翻訳装置９２および受入／拒否判定装置９６に与える。 When the generalized word posterior probabilities are reassigned to all the words, the maximum likelihood path search unit 128 searches the word graph for the highest generalized word posterior probability in the word graph, and the result is found. The word string included in the path is given to the machine translation device 92 and the acceptance / rejection determination device 96 shown in FIG. 3 as the word string 106 together with the generalized word posterior probability.

受入／拒否判定装置９６は、単語列１０６から与えられる単語列の単語事後確率に基づき、当該単語列を認識結果として受入れるか、拒否するかを判定する。この場合、この単語列の尤度を所定のしきい値と比較し、しきい値以上であれば受入れ、しきい値未満であれば拒否する。受入／拒否判定装置９６は、認識結果を受入れる場合には機械翻訳装置９２を制御して単語列１０６に対する機械翻訳を実行させる。拒否する場合には機械翻訳装置９２による機械翻訳を停止させるとともに、ユーザＩ／Ｆ８２を用いて、ユーザに対して認識結果が拒否されたことを伝える。 The acceptance / rejection determination device 96 determines whether to accept or reject the word string as a recognition result based on the word posterior probability of the word string given from the word string 106. In this case, the likelihood of this word string is compared with a predetermined threshold, and if it is greater than or equal to the threshold, it is accepted, and if it is less than the threshold, it is rejected. The acceptance / rejection determination device 96 controls the machine translation device 92 to execute machine translation for the word string 106 when accepting the recognition result. In the case of refusal, machine translation by the machine translation device 92 is stopped, and the user I / F 82 is used to inform the user that the recognition result has been rejected.

機械翻訳装置９２は、受入／拒否判定装置９６から翻訳の開始を指示されたことに応答して単語列１０６に対する機械翻訳を行ない、翻訳結果１１０を出力する。この際、単語列１０６の各単語に事後確率が付与されているため、翻訳においてこの事後確率を考慮した翻訳を行なうことができる。 The machine translation device 92 performs machine translation on the word string 106 in response to an instruction to start translation from the acceptance / rejection determination device 96, and outputs a translation result 110. At this time, since posterior probabilities are given to the respective words in the word string 106, translation can be performed in consideration of the posterior probabilities in translation.

［実験］
‐実験システムの構成‐
上記した実施の形態に従った装置をセットアップし、実験を行なった。この実験では、単語グラフではなくＮ‐ベストリストを用いる方式を採用した。実験では、出願人において作成した日本語基本旅行表現コーパスを使用した。テストセットとしてセット０１およびセット０２の二つを用いた。これらテストセットは、それぞれ５１０発話および５０８発話からなる。これら二つのテストセットは、セットごとに、１０人の話者による種種の発話を録音したものである。なお、ＡＳＲデコーダ９０としては、出願人において開発したものを用いた。 [Experiment]
-Configuration of the experimental system-
An apparatus according to the above-described embodiment was set up and an experiment was performed. In this experiment, a method using an N-best list instead of a word graph was adopted. In the experiment, a Japanese basic travel expression corpus created by the applicant was used. Two sets 01 and 02 were used as test sets. These test sets consist of 510 utterances and 508 utterances, respectively. Each of these two test sets is a recording of various utterances by 10 speakers. As the ASR decoder 90, the one developed by the applicant was used.

本実験では、発話ごとにＡＳＲデコーダ９０により１００‐ベストの認識結果の仮説を出力し、探索のビーム幅を狭くした。 In this experiment, the hypothesis of the 100-best recognition result was output by the ASR decoder 90 for each utterance, and the search beam width was narrowed.

‐性能評価‐
実験では、信頼度尺度の誤り率（ＣｏｎｆｉｄｅｎｃｅＥｒｒｏｒＲａｔｅ：ＣＥＲ）を、誤り拒否数（ＦＲ）および誤り受入数（ＦＡ）を採用して以下のように定義し実験システムの性能を評価した。 -Performance evaluation-
In the experiment, the error rate (Confidence Error Rate: CER) of the reliability measure was defined as follows using the error rejection number (FR) and the error acceptance number (FA), and the performance of the experimental system was evaluated.

‐結果‐
仮に認識結果を全て受入れることにし、拒否しない場合には、判別可能な誤りは挿入と置換だけとなる。実験では、この誤りレベルをベースラインとした。実験で使用したテストセットに対する認識結果の、ベースラインのＣＥＲを次のテーブル１に「ベースライン」として示す。 -result-
If all recognition results are accepted and not rejected, the only errors that can be identified are insertion and replacement. In the experiment, this error level was used as the baseline. The baseline CER of the recognition results for the test set used in the experiment is shown as “Baseline” in Table 1 below.

単語事後確率の簡単な算出方法は、ある特定の単語について、その単語が出現する仮説数を数えることである。この数と、仮説の総数との比によって、その単語の一般化単語事後確率の大まかな値を算出できる。これは、式（９）においてα＝β＝０とした場合に相当する。この大まかな算出方法を使用した場合の結果をテーブル１において「再出現率」として示す。テーブル１から分かるように、この方法を用いるとベースラインに対して２ポイント程度のＣＥＲの改善が得られた。 A simple method for calculating word posterior probabilities is to count the number of hypotheses in which a particular word appears. Based on the ratio of this number to the total number of hypotheses, a rough value of the generalized word posterior probability of the word can be calculated. This corresponds to the case where α = β = 0 in equation (9). The results when this rough calculation method is used are shown as “reappearance rate” in Table 1. As can be seen from Table 1, using this method resulted in a CER improvement of about 2 points relative to the baseline.

一般化単語事後確率に対する音響尤度と言語尤度とによる寄与の割合は明確には分からないが、一般化単語事後確率をできるだけ精度高く算出するためにはα、βの値を適切に決めることが有用である。式（９）によって算出した一般化単語事後確率に対する、αおよびβの種々の値による影響を調べるために、α及びβの広い範囲にわたって単一しきい値の分類器の性能をテストした。結果を図５（セット０１）及び図６（セット０２）の等高線図により示す。図５及び図６において、色の濃い領域に属するα、βの組合せを用いた場合に、高い性能が得られた。 The contribution ratio of acoustic likelihood and language likelihood to generalized word posterior probabilities is not clearly understood, but in order to calculate generalized word posterior probabilities as accurately as possible, the values of α and β must be determined appropriately. Is useful. To examine the effect of various values of α and β on the generalized word posterior probability calculated by equation (9), the performance of a single threshold classifier was tested over a wide range of α and β. The results are shown by the contour diagrams of FIG. 5 (set 01) and FIG. 6 (set 02). In FIGS. 5 and 6, high performance was obtained when the combination of α and β belonging to a dark region was used.

この結果から、セット０１とセット０２とのいずれの場合にも、高い性能を示すαとβとの組合せはグラフ上のほぼ直線上に存在していること、さらにα及びβの値が比較的小さい領域に性能の高い部分があることが分かった。そこで、この最適と思われる領域（α∈［０．０１，０．２］、β∈［０．１，１．５］）上でより詳細な調査を行なった。その結果を図７（セット０１）及び図８（セット０２）に示す。 From this result, in both cases of set 01 and set 02, the combination of α and β exhibiting high performance exists on a substantially straight line on the graph, and the values of α and β are relatively low. It was found that there was a high performance part in a small area. Therefore, a more detailed investigation was performed on the region (α∈ [0.01, 0.2], β∈ [0.1, 1.5]) considered to be optimal. The results are shown in FIG. 7 (set 01) and FIG. 8 (set 02).

図７及び図８から、セット０１に対してはα＝０．０６、β＝０．３において最も高い性能が得られ、セット０２に対してはα＝０．０３、β＝０．３において最も高い性能が得られることが分かった。これら最適点におけるＣＥＲをテーブル１の最下行に示す。 7 and 8, the highest performance is obtained at α = 0.06 and β = 0.3 for the set 01, and at α = 0.03 and β = 0.3 for the set 02. It has been found that the highest performance can be obtained. The CER at these optimum points is shown in the bottom row of Table 1.

図９に、最適なαとβを用いたときの、再出現率と単語事後確率（ｗｐｐ）のＲＯＣ（ＲｅｃｅｉｖｅｒＯｐｅｒａｔｉｎｇＣｈａｒａｃｔｅｒｉｓｔｉｃｓ）曲線を示す。図９（Ａ）はセット０１、図９（Ｂ）はセット０２に対するものである。 FIG. 9 shows a ROC (Receiver Operating Characteristics) curve of the reappearance rate and the word posterior probability (wpp) when using the optimum α and β. FIG. 9A is for set 01 and FIG. 9B is for set 02.

図９からは、テストセットの一方を用いて得られた最適なα及びβを他方のテストセットに適用した場合、性能低下はごくわずかであることが分かる。テーブル２は、このクロス検証によるＣＥＲ性能を示す。 It can be seen from FIG. 9 that the performance degradation is negligible when the optimal α and β obtained using one of the test sets is applied to the other test set. Table 2 shows the CER performance by this cross verification.

表２からは、これらパラメータが非常に安定したものであることが分かる。α、βのいずれを多少変更しても性能の急激な低下は生じない。 From Table 2, it can be seen that these parameters are very stable. Even if either α or β is slightly changed, the performance does not drop sharply.

図１０は、これら二つのテストセットに対する最適点の近傍での、システム性能の挙動をより詳細に示したものである。図１０（Ｃ）及び（Ｄ）から分かるように、αの変動に対してＣＥＲは比較的安定している。また図１０（Ａ）及び（Ｂ）から分かるように、βの変動に対してはαの変動ほどＣＥＲは安定しない。しかしそれでも急激なＣＥＲの変動はなく、比較的安定しているといえる。 FIG. 10 shows in more detail the behavior of the system performance in the vicinity of the optimal point for these two test sets. As can be seen from FIG. 10 (C) and (D), CER to variations in α is relatively stable. Also as can be seen from FIG. 10 (A) and (B), CER as variation of α with respect to the variation of β is not stable. However, it can be said that there is no sudden CER fluctuation and it is relatively stable.

‐２クラスガウス分布分類器‐
比較のため、単語スポッティングによる仮説検証方式を２クラスへの分類処理に変換した。この分類器を所与のデータセットを用いてトレーニングした。データを最初に音声認識し、１００‐ベストリストを生成した。この１００‐ベストリストに基づき、認識結果の単語列中の各単語の事後確率を計算し、正確度を示すラベルでタグ付けした。 -2-class Gaussian distribution classifier-
For comparison, the hypothesis verification method by word spotting was converted into a classification process into two classes. This classifier was trained with a given data set. The data was first voice-recognized to generate a 100-best list. Based on this 100-best list, the posterior probabilities of each word in the recognition result word string were calculated and tagged with a label indicating accuracy.

正確な単語と不正確な単語との単語事後確率を二つのクラスタに分類し、二つのガウス分布モデルをトレーニングした。トレーニングセット及びテストセットとして、セット０１及びセット０２を交代で用いた。 We classified the word posterior probabilities of correct and incorrect words into two clusters and trained two Gaussian distribution models. As a training set and a test set, set 01 and set 02 were used alternately.

得られた性能をテーブル３に示す。 Table 3 shows the obtained performance.

テーブル３から分かるように、いずれの場合も、性能はテーブル１に示した単一しきい値による仮説分類の場合とほぼ等しい。もっとも、いずれの場合もテーブル１の場合と比較してやや性能が落ちている。 As can be seen from Table 3, in either case, the performance is almost the same as in the case of hypothesis classification with a single threshold shown in Table 1. However, in any case, the performance is slightly lower than in the case of Table 1.

以上のとおり、本発明の実施の形態に係る仮説検証装置９４を用いると、ＡＳＲデコーダの出力する各単語について、一般化単語事後確率を算出できる。その際、（１）探索のための仮説数を少なくして探索空間を小さくすることにより処理を高速に行なうことができ、（２）ある単語の単語事後確率を算出する際、同一単語を選択するための時間的制約を緩和したことにより、単語事後確率の安定した値を算出することができ、（３）単語事後確率の算出にあたって、音響尤度と言語尤度との寄与の度合いをα及びβで反映させるようにし、かつそれらの最適な範囲の値を特定したことにより、性能の安定した仮説検証装置を得ることができる。 As described above, when the hypothesis verification device 94 according to the embodiment of the present invention is used, a generalized word posterior probability can be calculated for each word output from the ASR decoder. At that time, (1) the number of hypotheses for search can be reduced and the search space can be made smaller, so that the processing can be performed at high speed. (2) When calculating the word posterior probability of a word, the same word is selected. As a result, the stable value of the word posterior probability can be calculated. (3) In calculating the word posterior probability, the degree of contribution between the acoustic likelihood and the language likelihood is expressed as α. And β are reflected, and those values in the optimum range are specified, so that a hypothesis verification device with stable performance can be obtained.

以上ブロック図形式で説明した各機能部は、いずれもコンピュータハードウェア及び当該コンピュータ上で実行されるソフトウェア（コンピュータプログラム）により実現することができる。このコンピュータハードウェアとしては、音声を扱う設備を持ったものであれば、汎用のハードウェアを有するものを用いることができる。そうしたソフトウェアもまた一つのデータであり、記憶媒体に記憶させて流通させることができる。 Each of the functional units described above in the block diagram format can be realized by computer hardware and software (computer program) executed on the computer. As this computer hardware, if it has the equipment which handles an audio | voice, what has general-purpose hardware can be used. Such software is also a piece of data that can be stored in a storage medium for distribution.

当該ソフトウェアには、上記した仮説検証装置９４の機能を実現するために必要な全ての命令を含んでいる必要はなく，例えばオペレーティングシステムに備えられている命令を呼び出すことにより、所望の機能を実現するものでものよい。すなわち、コンピュータのハードウェア及びソフトウェア資源を利用して上記した仮説検証装置９４の各機能を実現するものであればよい。 The software need not include all the instructions necessary to realize the functions of the hypothesis verification device 94 described above. For example, a desired function can be realized by calling instructions provided in the operating system. You can do it. In other words, any function that realizes the functions of the hypothesis verification device 94 using computer hardware and software resources may be used.

また、図３に示す音声機械翻訳装置８０も、マイクロフォン及び音声処理専用のボード等を除き、一般的な構成のコンピュータ及びソフトウェアにより実現可能である。 Further, the speech machine translation apparatus 80 shown in FIG. 3 can also be realized by a computer and software having a general configuration except for a microphone and a board dedicated to speech processing.

そして、そうしたソフトウェアによりプログラムされたコンピュータは、本発明に係る音声認識結果の信頼度検証装置となる。 A computer programmed with such software serves as a speech recognition result reliability verification apparatus according to the present invention.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の第１の実施の形態に係る仮説検証装置９４の動作原理を説明するための、単語ラティス／グラフの模式図である。It is a schematic diagram of the word lattice / graph for demonstrating the operation principle of the hypothesis verification apparatus 94 which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る仮説検証装置９４の原理を説明するための、Ｎ‐ベストリストの模式図である。It is a schematic diagram of an N-best list for explaining the principle of the hypothesis verification device 94 according to the first exemplary embodiment of the present invention. 本発明の第１の実施の形態に係る仮説検証装置９４を用いた音声機械翻訳装置８０のブロック図である。It is a block diagram of the speech machine translation apparatus 80 using the hypothesis verification apparatus 94 which concerns on the 1st Embodiment of this invention. 図３に示す仮説検証装置９４の詳細なブロック図である。FIG. 4 is a detailed block diagram of a hypothesis verification device 94 shown in FIG. 3. セット０１を用い、α及びβの広い範囲にわたって単一しきい値の分類器の性能をテストした結果を示す等高線図である。FIG. 6 is a contour plot showing the results of testing the performance of a single threshold classifier over a wide range of α and β using set 01. セット０２を用い、α及びβの広い範囲にわたって単一しきい値の分類器の性能をテストした結果を示す等高線図である。FIG. 6 is a contour plot showing the results of testing the performance of a single threshold classifier over a wide range of α and β using set 02. セット０１を用い、α及びβの微小な範囲における単一しきい値の分類器の性能をテストした結果を示す等高線図である。FIG. 7 is a contour plot showing the results of testing the performance of a single threshold classifier in a small range of α and β using set 01. セット０２を用い、α及びβの微小な範囲における単一しきい値の分類器の性能をテストした結果を示す等高線図である。Using a set 0 2 is a contour plot showing the results of testing the classifier performance of a single threshold value in small range of α and beta. 最適なαとβを用いたときの、再出現率と一般化単語事後確率のＲＯＣ曲線を比較して示すグラフである。It is a graph which compares and shows the ROC curve of a reappearance rate and the generalized word posterior probability when optimal (alpha) and (beta) are used. 図１０は、二つのテストセット（０１、０２）に対する最適点の近傍での、α及びβの変動に対するシステム性能の挙動を示したグラフである。FIG. 10 is a graph showing the behavior of the system performance with respect to variations in α and β in the vicinity of the optimum point for the two test sets (01, 02).

Explanation of symbols

８０音声機械翻訳装置、９０ＡＳＲデコーダ、９２機械翻訳装置、９４信頼度尺度方式仮説検証装置、９６受入／拒否判定装置、１２０単語グラフ記憶部、１２２対象単語検索部、１２４事後確率算出部、１２６単語グラフ更新部、１２８最尤パス探索部 80 speech machine translation device, 90 ASR decoder, 92 machine translation device, 94 reliability scale method hypothesis verification device, 96 acceptance / rejection determination device, 120 word graph storage unit, 122 target word search unit, 124 posterior probability calculation unit, 126 Word graph update unit, 128 maximum likelihood path search unit

Claims

Receiving a speech recognition result representing a plurality of hypothesized word strings each consisting of a word to which a word posterior probability is given, output from the speech recognition decoder, and verifying the reliability of the speech recognition result based on the word posterior probability A speech recognition result reliability verification device for
For each word included in the speech recognition result, generalized word posterior probability calculating means for calculating a generalized word posterior probability based on the word posterior probability of the word included in the speech recognition result;
Updating means for updating the word posterior probability of each word included in the speech recognition result with the generalized word posterior probability calculated by the generalized word posterior probability calculating means;
Based on the speech recognition result in which the word posterior probability is updated by the updating means, a search is made among the plurality of hypothesis word strings for which the sum of the word posterior probabilities of the words included in the hypothesis word string is maximized. Search means for,
By the sum of word posterior probabilities of hypotheses word string searched by the search means to determine whether to satisfy a predetermined condition, seen including a determining means for verifying the reliability of the speech recognition result ,
Each word included in the speech recognition result is further provided with information for determining a time period during the input utterance to the speech recognition decoder,
The generalized word posterior probability calculating means is:
For searching for a word that is in a time period that overlaps the time period of the word and that matches the word for each word included in the word lattice composed of the plurality of hypothesized word strings from the speech recognition result Word search means,
For each word included in the word lattice composed of the plurality of hypothetical word strings, out of the paths included in the word lattice, the sum of the likelihoods of the paths that pass through the word searched by the word search means, It said word by dividing the sum of the likelihoods of all paths included in the lattice means and the including for calculating the generalization word posterior probabilities of the words, the speech recognition result reliability verification device.

Prior to the calculation of generalized word probabilities by the generalized word posterior probability calculating means, only the word string consisting of the speech recognition results having a higher likelihood than the threshold value determined by a predetermined criterion is selected. The speech recognition result reliability verification apparatus according to claim 1, further comprising means for giving to the generalized word posterior probability calculating means.

The means for calculating the generalized word posterior probability is a ratio between the sum of the word posterior probabilities of the words searched by the word search means and the sum of the word posterior probabilities of all words included in the speech recognition result. The speech recognition result reliability verification apparatus according to claim 1 , further comprising: means for calculating a generalized word posterior probability of each word.

The generalized word posterior probability p ([w; s, t] | x ₁ ^T ) of the word w in the hypothesis word string (where s and t are the start time and end time of the time period of the word w, respectively) is formula
Given, provided that _{^{_{x 1 T = x 1, ...}}} , x T is the observed speech sequence, M is the number of words included in the hypothesis of a speech recognition result, s _n and t _n, respectively, word w _Are the start and end times of the _nth word _wn , where p (x _sm ^tm | w _m ) is the acoustic likelihood, and p (w _m | w ₁ ^M ) is the language likelihood, p (x ₁ ^T) is the acoustic observation likelihood, respectively α and β is a predetermined constant, the speech recognition result reliability verification apparatus according to any one of claims 1 to 3.

A computer program that, when executed by a computer, causes the computer to operate so as to realize each means of the speech recognition result reliability verification apparatus according to any one of claims 1 to 4 .

A computer programmed by the computer program according to claim 5 .