JP4852448B2

JP4852448B2 - Error tendency learning speech recognition apparatus and computer program

Info

Publication number: JP4852448B2
Application number: JP2007050175A
Authority: JP
Inventors: 彰夫小林
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2007-02-28
Filing date: 2007-02-28
Publication date: 2012-01-11
Anticipated expiration: 2027-02-28
Also published as: JP2008216341A

Description

本発明は、誤り傾向学習音声認識装置及びコンピュータプログラムに関する。 The present invention relates to an error tendency learning speech recognition apparatus and a computer program.

音声認識装置が出力する認識結果の言語的な尤もらしさは、統計的言語モデルによって評価される。統計的言語モデルの多くは、音声認識装置において、ある時点で得られた単語の履歴（単語列）から、履歴に接続する単語の確率を求める単語n-gramモデルを用いることが多い。この統計的言語モデルに関しては、たとえば、非特許文献１に記載されている。
統計的言語モデルは、正例のみから構成されたデータ、すなわち日本語として正しく記されたテキストから学習されるため、音声認識における言語的な誤りを排除できない。したがって、ある発話内容に誤認識が含まれているのであれば、類似した発話内容を誤って認識する可能性が高い。 The linguistic likelihood of the recognition result output by the speech recognition apparatus is evaluated by a statistical language model. Many of the statistical language models often use a word n-gram model for obtaining a probability of a word connected to a history from a word history (word string) obtained at a certain point in a speech recognition apparatus. This statistical language model is described in Non-Patent Document 1, for example.
Since the statistical language model is learned from data composed only of positive examples, that is, text correctly written as Japanese, linguistic errors in speech recognition cannot be eliminated. Therefore, if erroneous recognition is included in a certain utterance content, there is a high possibility that a similar utterance content is erroneously recognized.

そこで、従来、音声認識を行おうとしている発話内容について、時間的に近い、あるいは関連した話題のテキストを集め、統計的言語モデルを適応化する方法を取ることにより、音声認識の誤りを削減することが行われてきた。非特許文献２には、『過去に出現した単語は再び使われやすい』という情報に基づいて、認識結果に含まれる単語の出現確率を増大させることにより認識率の向上を図る手法について記載されている。また、非特許文献３には、事後確率に基づいて認識結果を選択し、統計的言語モデルの学習に用いる手法について記載されている。
北研二，「確率的言語モデル」，東京大学出版会，１９９９年，ｐ．５７−６２クーン及びデ・モリ（R. Kuhn, R. De Mori）,「音声認識のためのキャッシュベース自然言語モデル（A Cache-Based Natural Language Model for Speech Recognition）」，ＩＥＥＥトランスパターン分析及びマシンインテリジェンス（IEEE Trans. on Pattern Analysis and Machine Intelligence），１９９０年，ｖｏｌ．１２，ｎｏ．６，ｐ．５７０−５８３リカリディ及びハッカニ・チュール（G. Riccaridi and D. Hakkani-Tur），「動的学習：理論と自動音声認識への応用（Active Learning: Theory and Applications to Automatic Speech Recognition）」，ＩＥＥＥトランス音声及びオーディオ（IEEE Trans. on Speech and Audio），２００５年，ｖｏｌ．１３，ｎｏ．４，ｐ．５０４−５１１ Therefore, the speech recognition errors are reduced by collecting texts of topics that are close in time or related to speech contents that are traditionally being used for speech recognition and adopting a statistical language model. Things have been done. Non-Patent Document 2 describes a technique for improving the recognition rate by increasing the appearance probability of a word included in a recognition result based on information that “words that appeared in the past are easy to use again”. Yes. Non-Patent Document 3 describes a method of selecting a recognition result based on a posteriori probability and using it for learning a statistical language model.
Kitakenji, “Probabilistic Language Model”, The University of Tokyo Press, 1999, p. 57-62 Kuhn, R. De Mori, “A Cache-Based Natural Language Model for Speech Recognition”, IEEE Trans Pattern Analysis and Machine Intelligence (IEEE) On Pattern Analysis and Machine Intelligence), 1990, vol. 12, no. 6, p. 570-583 Riccaridi and D. Hakkani-Tur, “Dynamic Learning: Theory and Applications to Automatic Speech Recognition”, IEEE Transformer Speech and Audio ( IEEE Trans. On Speech and Audio), 2005, vol. 13, no. 4, p. 504-511

非特許文献２に基づく手法では、認識結果の中には必ずしも正解の単語が含まれているとはかぎらないため、認識結果に含まれる誤り単語の出現確率を増大させる可能性がある。したがって、認識率の改善は限定的である。また、非特許文献３に基づく手法では、人手により正解を与えて言語モデルを学習することは可能だが、音声認識装置の誤り傾向を学習することができない。 In the method based on Non-Patent Document 2, the recognition result does not necessarily include the correct word, so there is a possibility of increasing the appearance probability of the error word included in the recognition result. Therefore, the improvement of the recognition rate is limited. In the method based on Non-Patent Document 3, it is possible to learn a language model by giving a correct answer manually, but it is not possible to learn an error tendency of the speech recognition apparatus.

一方、実用的なリアルタイム音声認識装置では、多くの場合、音声認識装置の出力結果を人手により修正している。入力音声に対する正解は必ず得られるため、音声認識装置の出力する認識結果と正解とを比較することにより、認識結果の誤り傾向を得ることが可能である。すなわち、音声認識装置の出力する認識結果の単語あるいは単語列の誤りを同定することができる。類似した発話内容を持つ音声が入力された場合、誤り傾向を学習しておけば、再度誤る可能性は少なくなる。
つまり、正解および認識結果から音声認識装置の誤り傾向をとらえ、この情報を音声認識装置にフィードバックすることにより、音声認識誤りを削減できるとともに、修正オペレータの負荷の軽減も期待できる。 On the other hand, in practical real-time speech recognition apparatuses, in many cases, the output result of the speech recognition apparatus is manually corrected. Since a correct answer for the input speech is always obtained, an error tendency of the recognition result can be obtained by comparing the recognition result output from the speech recognition apparatus with the correct answer. That is, it is possible to identify an error in a word or a word string as a recognition result output from the speech recognition apparatus. When voices having similar utterance contents are input, if the error tendency is learned, the possibility of mistakes is reduced.
In other words, by detecting the error tendency of the speech recognition apparatus from the correct answer and the recognition result and feeding back this information to the speech recognition apparatus, it is possible to reduce speech recognition errors and reduce the load on the correction operator.

本発明は、このような事情を考慮してなされたもので、その目的は、音声認識装置による認識結果と、それを修正した正解を用いて認識結果の誤り傾向を統計的に学習し、将来における音声認識の認識誤りを削減することができる誤り傾向学習音声認識装置及びコンピュータプログラムを提供することにある。 The present invention has been made in consideration of such circumstances, and its purpose is to statistically learn the error tendency of the recognition result by using the recognition result by the speech recognition apparatus and the correct answer corrected by the recognition result. It is an object to provide an error tendency learning speech recognition apparatus and a computer program capable of reducing recognition errors in speech recognition.

この発明は、上記の課題を解決すべくなされたもので、入力音声を音声認識して正解候補を複数出力し、出力した正解候補の中から音声認識結果を選択する音声認識手段と、前記音声認識手段により選択された音声認識結果に対する修正の入力を受け、当該音声認識結果を修正して前記入力音声に対する正解を出力する修正手段と、前記音声認識手段により出力された複数の正解候補と、前記修正手段により出力された正解とから統計的に認識誤りの傾向を分析する誤り傾向学習手段とを備え、前記音声認識手段は、前記誤り傾向学習手段により分析された認識誤りの傾向を修正するための誤り修正モデルを用いて、音声認識結果の選択における誤りを修正する、ことを特徴とする誤り傾向学習音声認識装置である。 The present invention has been made to solve the above problem, and recognizes an input speech, outputs a plurality of correct answer candidates, selects a speech recognition result from the output correct candidates, and the speech Receiving a correction input for the voice recognition result selected by the recognition means, correcting the voice recognition result and outputting a correct answer to the input voice; a plurality of correct answer candidates output by the voice recognition means; Error tendency learning means for statistically analyzing the tendency of recognition errors from correct answers output by the correction means, and the speech recognition means corrects the tendency of recognition errors analyzed by the error tendency learning means. The error tendency learning speech recognition apparatus is characterized in that an error in selecting a speech recognition result is corrected using an error correction model.

また、本発明は、上述する誤り傾向学習音声認識装置であって、前記誤り修正モデルは、前記音声認識の正解候補及び正解の中に含まれる単語、当該単語の品詞または意味情報、前後の単語列、あるいは、係り受けのうち１以上の情報に基づいて、音声認識結果から正解が選択される確率が最大となるように統計的に算出されることを特徴とする。 Further, the present invention is the error tendency learning speech recognition apparatus described above, wherein the error correction model includes the speech recognition correct candidate and the word included in the correct answer, the part of speech or semantic information of the word, the preceding and following words. It is statistically calculated so as to maximize the probability that the correct answer is selected from the speech recognition result based on one or more pieces of information in the column or the dependency.

また、本発明は、上述する誤り傾向学習音声認識装置であって、前記誤り傾向学習手段は、新たな入力音声に対して前記音声認識手段により出力された複数の正解候補と、前記修正手段により出力された当該入力音声の認識結果の正解とから統計的に認識誤りの傾向を分析して前記誤り修正モデルを更新し、前記音声認識手段は、前記誤り傾向学習手段により更新された誤り修正モデルを用いて、音声認識結果の選択における誤りを修正する、ことを特徴とする。 Further, the present invention is the error tendency learning speech recognition apparatus described above, wherein the error tendency learning means includes a plurality of correct answer candidates output by the speech recognition means for new input speech, and the correction means. The error correction model is updated by statistically analyzing the tendency of recognition errors from the correct recognition result of the input speech that is output, and the speech recognition means is updated by the error tendency learning means. Is used to correct an error in selecting a speech recognition result.

また、本発明は、上述する誤り傾向学習音声認識装置であって、前記音声認識手段は、実時間で入力音声の音声認識結果を出力することを特徴とする。 Further, the present invention is the error tendency learning speech recognition apparatus described above, wherein the speech recognition means outputs a speech recognition result of the input speech in real time.

また、本発明は、誤り傾向学習音声認識装置として用いられるコンピュータに、入力音声を音声認識して正解候補を複数出力し、出力した正解候補の中から音声認識結果を選択する音声認識ステップと、前記音声認識ステップにより選択された音声認識結果に対する修正の入力を受け、当該音声認識結果を修正して前記入力音声に対する正解を出力する修正ステップと、前記音声認識ステップにより出力された複数の正解候補と、前記修正ステップにより出力された正解とから統計的に認識誤りの傾向を分析する誤り傾向学習ステップとを実行させ、前記音声認識ステップにおいては、前記誤り傾向学習ステップにより分析された認識誤りの傾向を修正するための誤り修正モデルを用いて、音声認識結果の選択における誤りを修正する処理をコンピュータに実行させる、ことを特徴とするコンピュータプログラムである。 Further, the present invention provides a computer used as an error tendency learning speech recognition device, speech recognition step for speech recognition of input speech and outputting a plurality of correct answer candidates, and selecting speech recognition results from the output correct answer candidates; Receiving a correction input for the voice recognition result selected in the voice recognition step, correcting the voice recognition result and outputting a correct answer to the input voice; and a plurality of correct answer candidates output in the voice recognition step And an error tendency learning step that statistically analyzes the tendency of recognition errors from the correct answer output in the correction step. In the speech recognition step, the recognition error analyzed in the error tendency learning step is analyzed. A process for correcting errors in selecting speech recognition results using an error correction model for correcting trends To be executed by the computer, a computer program, characterized in that.

本発明によれば、音声認識を行った結果出力される正解候補および正解から音声認識の誤りの傾向を統計的に学習し、学習の結果得られた統計的モデルを用いて音声認識における認識誤りを排除することによって認識率の向上を図ることが可能となり、修正オペレータの負荷を軽減することも可能となる。 According to the present invention, the correct candidate and the correct answer output as a result of speech recognition are statistically learned from the tendency of speech recognition errors, and the recognition error in speech recognition is performed using the statistical model obtained as a result of learning. It is possible to improve the recognition rate by eliminating the above, and it is possible to reduce the load on the correction operator.

以下、図面を参照して本発明の一実施の形態を説明する。
図１は、本実施の一実施の形態による誤り傾向学習音声認識装置の概要を説明するための図である。
誤り傾向学習音声認識装置の備える音声認識装置において、ある入力音声に対して、正解の候補がＮ個得られたとする。これらＮ個の正解候補は、音声認識装置の出力のうち、尤もらしいとされた順にＮ個並べたものであり、同図に示おいて、正解候補は、尤もらしいとされる順に、正解候補１「家族／の／再開／の／日程」、正解候補２「家族／の／最下位／の／日程」、…である。また、Ｎ個のうち、第１番目の正解候補（正解候補１）に基づいて、人手により挿入、置換、脱落の誤りが修正された正解が得られているものとする。ここでは、得られた正解は、「家族／の／再会／の／日程」である。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a diagram for explaining an outline of an error tendency learning speech recognition apparatus according to an embodiment of the present invention.
Assume that N correct answer candidates are obtained for a certain input speech in the speech recognition device provided in the error tendency learning speech recognition device. These N correct answer candidates are arranged in the order in which they are considered to be likely from among the outputs of the speech recognition apparatus, and the correct answer candidates are shown in FIG. 1 “family / of / resume / of / schedule”, correct candidate 2 “family / of / lowest / of / schedule”, and so on. Further, it is assumed that a correct answer in which errors of insertion, replacement, and omission are corrected manually is obtained based on the first correct answer candidate (correct answer candidate 1) among N answers. Here, the correct answer obtained is “family / of / reunion / of / date”.

図１の点線で囲まれた部分（符号Ａ）に着目すると、正解は『再会』であるのに対し、正解候補１では『再開』、正解候補２では『最下位』となっているが、これらはいずれも誤りである。音声認識結果が出力する誤りの傾向とは、『家族／の』という文脈が与えられた場合に、音声認識システムが『再会』を選択せずに『再開』や『最下位』を選択してしまうことを意味する。
誤り傾向の学習とは、該当する文脈が与えられたときに音声認識システムが『再開』『最下位』を選択しないようにすることであり、統計的に『再開』『最下位』を出現しにくくする、または『再会』を出現しやすくすることである。誤り傾向は、正解候補の『正解らしさ』によって評価され、統計的な手段により与えられる。
以下に、本発明の一実施の形態による誤り傾向学習音声認識装置の構成図と、その動作を説明する。 Focusing on the portion surrounded by the dotted line in FIG. 1 (symbol A), the correct answer is “reunion”, while the correct answer candidate 1 is “restart” and the correct answer candidate 2 is “lowest”. These are all incorrect. The tendency of errors output by speech recognition results is that when the context “family / no” is given, the speech recognition system selects “resume” or “lowest” without selecting “reunion”. It means to end.
Error tendency learning is to prevent the speech recognition system from selecting “Resume” or “Lowest” when the relevant context is given. Statistically, “Resume” or “Lowest” appears. It is to make it harder or to make “reunion” more likely to appear. The error tendency is evaluated based on the “correctness” of the correct answer candidates, and is given by statistical means.
Below, the block diagram and operation | movement of the error tendency learning speech recognition apparatus by one embodiment of this invention are demonstrated.

図２は、本発明の一実施の形態による誤り傾向学習音声認識装置１の全体構成を示す図である。音声認識装置１０は、入力音声４０を入力とし、Ｎ個の正解候補を示すデータである正解候補リスト６０を出力する。一般にＮは２００から３００程度である。音声認識装置１０は、音響モデル記憶部２１内に記憶される音響モデルデータである音響モデル２０、言語モデル記憶部３１内に記憶される言語モデルデータである言語モデル３０、誤り修正モデル記憶部９１内に記憶される誤り修正モデルデータである誤り修正モデル９０を用い、各正解候補について尤もらしさを判断するためのスコアを算出し、この算出したスコアに基づいて、入力音声４０に対して尤もらしい正解候補を決定する。すなわち、音声認識装置１０は、音響モデル２０、言語モデル３０および誤り修正モデル９０を用いて各モデルのスコアの和が最も高くなるような単語列を探索し、スコアの上位Ｎ個の正解候補を正解候補リスト６０として出力する。音響モデル２０、言語モデル３０には、既存の任意のモデルを使用することができる。 FIG. 2 is a diagram showing an overall configuration of the error tendency learning speech recognition apparatus 1 according to the embodiment of the present invention. The speech recognition apparatus 10 receives the input speech 40 and outputs a correct answer candidate list 60 that is data indicating N correct answer candidates. In general, N is about 200 to 300. The speech recognition apparatus 10 includes an acoustic model 20 that is acoustic model data stored in the acoustic model storage unit 21, a language model 30 that is language model data stored in the language model storage unit 31, and an error correction model storage unit 91. An error correction model 90 that is error correction model data stored therein is used to calculate a score for determining the likelihood of each correct answer candidate, and the input speech 40 is likely based on the calculated score. The correct answer candidate is determined. That is, the speech recognition apparatus 10 searches for a word string having the highest sum of scores of each model using the acoustic model 20, the language model 30, and the error correction model 90, and determines the top N correct answer candidates of the score. The correct candidate list 60 is output. Any existing model can be used for the acoustic model 20 and the language model 30.

ここで、正解候補の１つをｗとすると、音声認識装置１０は、そのスコアｇ（ｗ）を以下の（式１）のように計算する。 Here, assuming that one of the correct answer candidates is w, the speech recognition apparatus 10 calculates the score g (w) as in the following (Equation 1).

上記において、ｆ_０（ｗ）は音響モデルのスコア、ｆ_１（ｗ）は言語モデルのスコアである。入力音声をｘとすると、音響スコアは、統計的音響モデルから得られる確率Ｐ（ｘ｜ｗ）を対数に変換したスコアである。また、言語スコアは、統計的言語モデルにより得られる確率Ｐ（ｗ）を対数に変換したスコアである。これは、以下に基づく。つまり、ベイズの定理より、入力音声ｘが発生したときに正解候補ｗが得られる事後確率Ｐ（ｗ｜ｘ）＝Ｐ（ｘ｜ｗ）・Ｐ（ｗ）／Ｐ（ｘ）となる。Ｐ（ｘ）は確率１であるので、両辺の対数をとると、右辺はｌｏｇＰ（ｘ｜ｗ）＋ｌｏｇＰ（ｗ）となり、これらの項のそれぞれ重み付けがλ_０、λ_１である。
（式１）のλ_０、λ_１は定数であり、事前に決めておく。（式１）におけるｇ_ｅｃ（ｗ）が誤り修正モデルのスコアとなり、ｇ（ｗ）スコアが高いほど尤もらしいと判断される。 In the above, f ₀ (w) is the score of the acoustic model, and f ₁ (w) is the score of the language model. When the input speech is x, the acoustic score is a score obtained by converting the probability P (x | w) obtained from the statistical acoustic model into a logarithm. The language score is a score obtained by converting the probability P (w) obtained by the statistical language model into a logarithm. This is based on the following. That is, from the Bayes' theorem, the posterior probability P (w | x) = P (x | w) · P (w) / P (x) that the correct candidate w is obtained when the input speech x is generated. Since P (x) has a probability of 1, if the logarithm of both sides is taken, the right side is logP (x | w) + logP (w), and the weights of these terms are λ ₀ and λ ₁ , respectively.
Λ ₀ and λ _{1 in} (Equation 1) are constants and are determined in advance. G _ec (w) in (Equation 1) is the score of the error correction model, and it is determined that the higher the g (w) score, the more likely it is.

修正装置５０では、音声認識装置１０から出力される正解候補リスト６０で示されるＮ個の正解候補の中の第１位の認識結果に対して、入力手段（図示せず）を用いた人手による修正の入力を受け、この修正に基づいて生成した正解の音声認識結果を示す正解データ７０を出力する。
誤り傾向学習装置８０では、正解候補リスト６０の示すＮ個の正解候補、および、正解データ７０の示す正解から、音声認識装置１０における誤り傾向を学習する。誤り傾向学習装置８０における誤り修正モデルの作成方法について、図３を用いて説明する。 In the correction device 50, the first recognition result among the N correct answer candidates shown in the correct answer candidate list 60 output from the speech recognition device 10 is manually input using an input unit (not shown). In response to the correction input, correct data 70 indicating the correct speech recognition result generated based on the correction is output.
The error tendency learning device 80 learns the error tendency in the speech recognition device 10 from the N correct answer candidates indicated by the correct answer candidate list 60 and the correct answer indicated by the correct answer data 70. A method of creating an error correction model in the error tendency learning device 80 will be described with reference to FIG.

図３は、誤り傾向学習装置８０の動作フローを示す図である。誤り傾向学習装置８０における誤り修正モデルの作成は、重み初期化ステップ（Ｓ１００）および重み更新ステップ（Ｓ１１０）の２つの手順から構成される。重み初期化ステップ（Ｓ１００）は、あらかじめ蓄積された大量の正解候補と正解から誤り修正モデルの重みを推定することを目的とする。重み更新ステップ（Ｓ１１０）は、重み初期化ステップ（Ｓ１００）により得られた誤り修正モデルを、新たに音声認識装置１０から入力された正解候補および正解を用いて更新することを目的とする。 FIG. 3 is a diagram illustrating an operation flow of the error tendency learning device 80. The creation of an error correction model in the error tendency learning device 80 is composed of two procedures: a weight initialization step (S100) and a weight update step (S110). The purpose of the weight initialization step (S100) is to estimate the weight of the error correction model from a large number of correct answer candidates and correct answers accumulated in advance. The purpose of the weight update step (S110) is to update the error correction model obtained in the weight initialization step (S100) using the correct answer candidates and correct answers newly input from the speech recognition apparatus 10.

重み初期化ステップ（Ｓ１００）について、図４を用いて説明する。
図４は、重み初期化ステップにて動作する誤り傾向学習装置８０内の機能ブロック図である。
誤り修正モデルのスコアは、（式１）において、ｇ_ｅｃ（ｗ）として与えられているが、これを、以下の（式２）のように定める。 The weight initialization step (S100) will be described with reference to FIG.
FIG. 4 is a functional block diagram in the error tendency learning device 80 that operates in the weight initialization step.
The score of the error correction model is given as g _ec (w) in (Equation 1), which is defined as in (Equation 2) below.

誤り修正モデルは、上式における関数ｆ_ｉ（ｗ）（ｉ＝２，３，…，Ｉ）の線形和として定義される。ここでｆ_ｉ（ｗ）（ｉ＝２，３，…，Ｉ）は素性関数とよばれ、認識結果（正解候補）ｗにおける誤り傾向を表現するためのルールであり、あらかじめ誤り傾向学習装置８０内に保持される。素性関数は、ある事象（ここでは、正解候補ｗ）がその傾向（特徴）を持つときに０以外の実数を返すものであり、一例としては、その傾向（特徴）の観測数を表す。これらルールは、正解候補内の単語および該当する品詞や意味情報、単語列（文脈）、係り受けなどの文法情報に基づいて定められる。λ_ｉは素性関数に関する重みであり、ｆ_ｉがどれだけ重要であるかを示すものである。λ_ｉは、後述の学習手順により求められる。
素性関数に用いられる規則には、例えば、次のようなものがある。 The error correction model is defined as a linear sum of the functions f _i (w) (i = 2, 3,..., I) in the above equation. Here, f _i (w) (i = 2, 3,..., I) is called a feature function, and is a rule for expressing an error tendency in the recognition result (correct answer candidate) w. Held in. The feature function returns a real number other than 0 when a certain event (here, the correct answer candidate w) has the tendency (feature). As an example, the feature function represents the number of observations of the tendency (feature). These rules are determined based on grammatical information such as a word in a correct answer candidate, corresponding part of speech and semantic information, a word string (context), and dependency. λ _i is a weight related to the feature function, and indicates how important f _i is. λ _i is obtained by a learning procedure described later.
Examples of rules used for feature functions include the following.

規則１：ｗに含まれる『家族／の／再会』という単語列の個数（ｆ₂とおく）
規則２：ｗの文節係り受けで、『家族』を含む文節が『再会』を含む文節への係り受けである個数（ｆ₃とおく） Rule 1: Number of word strings “family / no / reunion” included in w (denoted as f ₂ )
Rule 2: Number of clauses in w that are clauses that include “Family” are clauses that contain “Reunion” (set as f ₃ )

規則１の素性関数は、正解候補ｗにおける文脈『家族／の／再会』の出現頻度を表す。また、規則２の素性関数は、正解候補ｗについて、規則２が成立するのであれば、その個数を返し、それ以外では０を返す。
素性抽出部１３０では、正解候補・正解１２０、すなわち、あらかじめ音声認識装置１０から出力された正解候補リスト６０で示される大量の正解候補および修正装置５０により出力された正解を用いて、上記のルールに合致する素性を抽出する。
そして、初期重み学習部１４０では、抽出された素性関数に基づいて、その重みを決定する。重みを決定するためには、例えば、次の（式３）のような目的関数を考える。 The feature function of rule 1 represents the appearance frequency of the context “family / of / reunion” in the correct answer candidate w. The feature function of rule 2 returns the number of correct answers w if rule 2 holds, and returns 0 otherwise.
The feature extraction unit 130 uses the above rules using the correct answer candidate / correct answer 120, that is, the large number of correct answer candidates indicated in the correct answer candidate list 60 output from the speech recognition apparatus 10 in advance and the correct answer output by the correction apparatus 50. Extract features that match.
Then, the initial weight learning unit 140 determines the weight based on the extracted feature function. In order to determine the weight, for example, an objective function such as the following (formula 3) is considered.

ここで、ｗ_ｍ，０は第ｍ番目の入力音声４０に対する正解文を指し、ｗ_ｍ，ｎは第ｍ番目の入力音声４０に対する音声認識のｎ番目の正解候補を表す。上の目的関数は、入力音声ｘ_ｍに対する正解文の条件付き確率ｑ（ｗ｜ｘ）を以下の（式４）のように定めたときの対数尤度の和に基づく。 Here, w _{m, 0} indicates the correct sentence for the m-th input speech 40, and w _{m, n} indicates the n-th correct candidate for speech recognition for the m-th input speech 40. Objective function above, the input speech x _m conditional probability of correct sentence for q | based on the sum of the log-likelihood when determined as (w x) the following (Equation 4).

ただし、上式のＺ（ｘ_ｍ）は正規化項であり、以下の（式５）のように示される。 However, Z (x _m ) in the above equation is a normalization term, and is expressed as in the following (Equation 5).

つまり、（式３）においては、正解候補１、２、…、Ｎの各スコアｇ（ｗ）を合計した全体の中で、正解候補のスコアｇ（ｗ）がどれくらいを占めているのかの割合（（式４）に示す条件付き確率）を、それぞれの入力音声ｘ_ｍ（ｍ＝１〜Ｍ）について算出して対数をとり（対数尤度）、Ｍ個の入力音声ｘ_ｍ全てについて加算して損失関数としている。なお、ｅｘｐの部分は、対数を落とすためのものである。ｌｏｇの真数である条件付き確率は全体に占める割合のため１以下であり、ｌｏｇの値は負となるため、全体に負の符号をつけて正の値に戻している。割合が１に近くなるほどｌｏｇの値は０に近くなるため、Ｌ_ｌｏｇが最も小さくなるようにすれば、全体において正解の占める割合が大きくなる。 That is, in (Equation 3), the ratio of how much the score g (w) of the correct answer occupies in the total of the scores g (w) of the correct answer candidates 1, 2,. (Conditional probability shown in (Equation 4)) is calculated for each input speech x _m (m = 1 to M), and a logarithm is taken (log likelihood), and all M input speeches x _m are added. Loss function. The exp part is for reducing the logarithm. The conditional probability that is the true number of log is 1 or less because of the ratio to the whole, and the value of log is negative. Therefore, the negative sign is attached to the whole to return to a positive value. Since the value of log becomes closer to 0 as the ratio becomes closer to 1, if the L _log is made the smallest, the ratio of correct answers to the whole increases.

また、別の目的関数として、第ｍ番目の文章に対する単語正解精度の期待値に基づく関数Ｌ_ａｃｃを以下の（式６）のように定めてもよい。 Further, as another objective function, a function L _acc based on the expected value of the word correct accuracy for the m-th sentence may be defined as in the following (formula 6).

上式において、Ａ_ｃｃ（ｗ_ｍ，ｎ）は正解候補ｗ_ｍ，ｎに対する単語正解精度を表す。単語正解精度は（正解単語数−挿入誤単語数）／（総単語数）×１００により求められる。
挿入誤単語数は、挿入あるいは置き換えを行った単語の数である。例えば、図１に示す例の場合、単語数は５であり、正解の場合は、単語正解精度１００％となる。また、認識結果１、２は、１単語「再開」あるいは「最下位」を「再会」に置き換えるため、挿入誤単語数は「１」であり、単語精度は（５−１）／５＝８０％となる。また、例えば、正解を「家族との再会の日程」であった場合、認識結果１に「と」を挿入し、「再開」を「再会」に置き換えるため挿入誤単語数は２となり、認識結果１の単語精度は（５−２）／５＝６０％となる。 In the above equation, A _cc (w _{m, n} ) represents the word correct accuracy for the correct answer candidate w _{m, n} . The correct word accuracy is obtained by (number of correct words−number of erroneous insertion words) / (total number of words) × 100.
The number of erroneous insertion words is the number of words that have been inserted or replaced. For example, in the example shown in FIG. 1, the number of words is 5, and in the case of a correct answer, the word accuracy is 100%. In addition, since the recognition results 1 and 2 replace one word “restart” or “lowest” with “reunion”, the number of erroneous insertion words is “1”, and the word accuracy is (5-1) / 5 = 80. %. Also, for example, when the correct answer is “Reunion schedule with family”, “to” is inserted into recognition result 1 and “resumption” is replaced with “reunion”, so the number of erroneous words inserted becomes 2, and the recognition result The word accuracy of 1 is (5-2) / 5 = 60%.

つまり、（式６）では、各入力音声ｘ_ｍ（ｍ＝１〜Ｍ）について、正解の条件付確率の期待値を算出して対数をとり、それらを学習データとしてのＭ個の入力音声ｘ_ｍ全てについて加算し、損失関数としている。ｌｏｇの真数は１以下となり、全体に負の符号をつけて正の値に戻しているが、正解が現れる確率が１に近くなるほどｌｏｇの値は０に近くなる。したがって、Ｌ_ａｃｃが最も小さくなるようにすれば、全体において正解が現れる期待値が大きくなる。 In other words, in (Equation 6), for each input speech x _m (m = 1 to M), the expected value of the correct conditional probability is calculated and a logarithm is taken, and these M input speech x as learning data are used. All _m values are added to obtain a loss function. The log's true number is less than or equal to 1 and is returned to a positive value by adding a negative sign to the whole. However, as the probability that a correct answer appears is closer to 1, the value of log becomes closer to 0. Therefore, if L _acc is _minimized , the expected value at which the correct answer appears as a whole increases.

上述するように、誤り傾向を反映した誤り修正モデルは、上式の目的関数Ｌ_ｌｏｇまたはＬ_ａｃｃを最小化するような重みを持つものである。上の目的関数を最小化する重みを求めるためには、例えば準ニュートン法などを用いる。準ニュートン法は、適当な初期値を与えて解に近い次の値を生成し、その値からまた次の解に近い値を生成することを繰り返し、最終的に最適解に収束させるものである。準ニュートン法の詳細については、文献「W.H. Press et al.，“Numerical Recipes in C”，（訳）丹慶他，pp.313-314，1993.」を参照のこと。
初期重み学習部１４０は、上記手順により求められた重みλ_ｉを持つ初期誤り修正モデル１５０を出力する。この初期誤り修正モデル１５０は、誤り修正モデル９０の初期値として誤り修正モデル記憶部９１に書き込まれる。 As described above, the error correction model reflecting the error tendency has a weight that minimizes the objective function L _log or L _acc in the above equation. In order to obtain a weight that minimizes the above objective function, for example, a quasi-Newton method is used. In the quasi-Newton method, an appropriate initial value is given to generate the next value close to the solution, and the value close to the next solution is repeatedly generated from that value, and finally converges to the optimal solution. . For details of the quasi-Newton method, see the document “WH Press et al.,“ Numerical Recipes in C ”, (Translation) Tankei et al., Pp.313-314, 1993.”
The initial weight learning unit 140 outputs the initial error correction model 150 having the weight λ _i obtained by the above procedure. The initial error correction model 150 is written in the error correction model storage unit 91 as an initial value of the error correction model 90.

次に、図３の重み更新ステップ（Ｓ１１０）について、図５を用いて説明する。
図５は、重み更新ステップにて動作する誤り傾向学習装置８０の構成を示すブロック図である。
ニュースなどを対象とした音声認識では、音声認識結果が時々刻々と得られ、蓄積されていくことが特徴である。そのため、新たに得られた正解候補や正解を用いて初期誤り修正モデルを更新していくことが必要である。 Next, the weight update step (S110) in FIG. 3 will be described with reference to FIG.
FIG. 5 is a block diagram showing the configuration of the error tendency learning device 80 that operates in the weight update step.
A feature of speech recognition for news and the like is that speech recognition results are obtained and accumulated every moment. Therefore, it is necessary to update the initial error correction model using newly obtained correct answer candidates and correct answers.

重み更新部１７０は、前回の重み初期化ステップ（Ｓ１００）を実行した後に新たに音声認識装置１０の出力として得られた正解候補および正解１６０と、前回の重み初期化ステップ（Ｓ１００）により新たに得られた初期誤り修正モデル１５０を入力とし、現在の初期誤り修正モデル１５０における重みλ_ｉを更新する。
新たな重みは、最急降下法に基づいて求められる。最急降下法とは、関数の１階微分（傾き）から関数の最小値を探索する方法である。最急降下法の詳細については、文献「R. O. Duda，P. E. Hart and D. G. Stork，“Pattern Classification (2nd edition)”，pp.223-227，2001.」を参照のこと。
最急降下法による重みλ_ｉの更新式は次のとおりとなる。 The weight update unit 170 newly performs the correct answer candidate and correct answer 160 obtained as the output of the speech recognition apparatus 10 after executing the previous weight initialization step (S100), and the previous weight initialization step (S100). Using the obtained initial error correction model 150 as an input, the weight λ _i in the current initial error correction model 150 is updated.
The new weight is obtained based on the steepest descent method. The steepest descent method is a method of searching for the minimum value of a function from the first-order derivative (slope) of the function. For details of the steepest descent method, refer to the document “RO Duda, PE Hart and DG Stork,“ Pattern Classification (2nd edition) ”, pp.223-227, 2001.
The update formula of the weight λ _i by the steepest descent method is as follows.

ここで、ηは定数であり、事前に定めた値を用いる。また、目的関数Ｌ（Λ）は、（式３）または（式６）に基づいて、新たに得られた正解候補と正解１７０、すなわち、新たに音声認識装置１０から出力された正解候補および修正装置５０により出力されたその正解を用いて算出されたＬ_ｌｏｇまたはＬ_ａｃｃである。
重み更新部１７０は、求めたλ_ｉ’を元のλ_ｉに置き換え、誤り修正モデル９０を更新する。音声認識装置１０は、音響モデル２０、言語モデル３０、誤り修正モデル９０を用いて正解候補の中から正解文を選択する。 Here, η is a constant, and a predetermined value is used. Further, the objective function L (Λ) is based on (Equation 3) or (Equation 6), the newly obtained correct answer candidate and the correct answer 170, that is, the correct answer candidate newly output from the speech recognition apparatus 10 and the correction L _log or L _acc calculated using the correct answer output by the device 50.
The weight update unit 170 updates the error correction model 90 by replacing the obtained λ _i ′ with the original λ _i . The speech recognition apparatus 10 selects the correct sentence from the correct answer candidates using the acoustic model 20, the language model 30, and the error correction model 90.

正解文を選択する手順を、図６を元に説明する。図６は正解文の選択時に動作する音声認識装置１０の構成を示すブロック図である。
音声認識装置１０では、入力音声４０から、Ｎ個の正解候補１８０を生成する。スコア計算部１９０は、音響モデル２０、言語モデル３０、誤り修正モデル９０を用いて、各正解候補１８０に対して（式１）にしたがって、スコアを計算する。
続いて、正解候補ソート部２００では、スコア計算部１９０で求めたスコアにしたがって、スコアの大きい順に正解候補１８０を並べ替える。正解候補ソート部２００での並べ替えの結果、第１位となった正解候補１８０を音声認識結果２１０とし、音声認識システム１０の出力とする。
なお、通常の一般的なパーソナルコンピュータ等を用いることにより、音声認識装置１０は、上記に示した入力音声４０の入力から音声認識結果２１０の出力までの処理を実時間で行うことができる。 The procedure for selecting the correct sentence will be described with reference to FIG. FIG. 6 is a block diagram showing the configuration of the speech recognition apparatus 10 that operates when a correct sentence is selected.
The speech recognition apparatus 10 generates N correct answer candidates 180 from the input speech 40. The score calculation unit 190 uses the acoustic model 20, the language model 30, and the error correction model 90 to calculate a score for each correct answer candidate 180 according to (Equation 1).
Subsequently, the correct answer candidate sorting unit 200 rearranges the correct answer candidates 180 in descending order of score according to the score obtained by the score calculating unit 190. As a result of the rearrangement in the correct answer candidate sorting unit 200, the correct answer candidate 180 that is ranked first is set as the speech recognition result 210 and is output from the speech recognition system 10.
Note that, by using a normal general personal computer or the like, the speech recognition apparatus 10 can perform the processing from the input of the input speech 40 to the output of the speech recognition result 210 described above in real time.

以下に、過去に放映されたニュース（１，２９８文）について上記の実施形態に基づいて音声認識を行い、単語正解精度を求めた。
従来法としてtrigram言語モデルによるリスコアリングを行ったものと比較した結果を表１に示す。 Below, speech recognition was performed on news (1,298 sentences) broadcasted in the past based on the above embodiment, and word accuracy was obtained.
Table 1 shows the result of comparison with the conventional method of re-scoring using the trigram language model.

上述した実施の形態によれば、以下の効果がある。
（ａ）類似する認識誤りを削減し、リアルタイム音声認識システムのようなアプリケーションにおける修正オペレータの負荷を削減する。
（ｂ）誤り傾向モデルは逐次学習回を繰り返して精度を上げていくため、時間経過とともに音声認識システムの認識率が向上していく。 The embodiment described above has the following effects.
(A) Reducing similar recognition errors and reducing the burden on corrective operators in applications such as real-time speech recognition systems.
(B) Since the error tendency model increases the accuracy by repeating the sequential learning times, the recognition rate of the speech recognition system improves with time.

なお、上記においては、音響モデルのスコアの係数λ_０、音声認識モデルのスコアの係数λ_１とも変化させているが、これらを変化させず、素性関数の係数λ_ｉ（ｉ＝２〜Ｉ）のみを変化させるようにしてもよい。 In the above description, both the coefficient λ ₀ of the score of the acoustic model and the coefficient λ ₁ of the score of the speech recognition model are changed, but these are not changed, and the coefficient λ _i (i = 2 to I) of the feature function is not changed. Only the change may be made.

なお、上述の誤り傾向学習音声認識装置１は内部にコンピュータシステムを有している。そして、誤り傾向学習音声認識装置１の各装置の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 The error tendency learning speech recognition apparatus 1 described above has a computer system inside. The process of operation of each device of the error tendency learning speech recognition device 1 is stored in a computer-readable recording medium in the form of a program, and the computer system reads and executes this program, whereby the above processing is performed. Done. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

本発明の一実施の形態による誤り傾向学習音声認識装置の概要を説明するための図である。It is a figure for demonstrating the outline | summary of the error tendency learning speech recognition apparatus by one embodiment of this invention. 同実施の形態による誤り傾向学習音声認識装置の機能ブロック図である。It is a functional block diagram of the error tendency learning speech recognition apparatus by the embodiment. 同実施の形態による誤り傾向学習装置の動作フローを示す図である。It is a figure which shows the operation | movement flow of the error tendency learning apparatus by the embodiment. 同実施の形態による重み初期化ステップに係る誤り傾向学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the error tendency learning apparatus which concerns on the weight initialization step by the embodiment. 同実施の形態による重み更新ステップに係る誤り傾向学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the error tendency learning apparatus which concerns on the weight update step by the embodiment. 同実施の形態による音声認識装置の構成を示す図である。It is a figure which shows the structure of the speech recognition apparatus by the embodiment.

Explanation of symbols

１…誤り傾向学習音声認識装置
１０…音声認識装置（音声認識手段）
２０…音響モデル
３０…言語モデル
５０…修正装置（修正手段）
８０…誤り傾向学習装置（誤り傾向学習手段）
９０…誤り修正モデル
１３０…素性抽出部
１４０…初期重み学習部
１７０…重み更新部
１９０…スコア計算部
２００…正解候補ソート部 DESCRIPTION OF SYMBOLS 1 ... Error tendency learning voice recognition apparatus 10 ... Voice recognition apparatus (voice recognition means)
20 ... Acoustic model 30 ... Language model 50 ... Correction device (correction means)
80 ... Error tendency learning device (error tendency learning means)
DESCRIPTION OF SYMBOLS 90 ... Error correction model 130 ... Feature extraction part 140 ... Initial weight learning part 170 ... Weight update part 190 ... Score calculation part 200 ... Correct candidate sorting part

Claims

Speech recognition means for speech recognition of input speech, outputting a plurality of correct answer candidates, and selecting speech recognition results from the output correct answer candidates;
Correction means for receiving a correction input to the voice recognition result selected by the voice recognition means, correcting the voice recognition result, and outputting a correct answer to the input voice;
Storage means for storing an error correction model defined by a feature function representing an error tendency of a correct candidate and its weight;
Error tendency learning that creates the error correction model by determining the weight based on a plurality of correct answer candidates output by the speech recognition means and the correct answer output by the correction means, and stores the error correction model in the storage means Means and
The speech recognition means selects a speech recognition result from correct candidates based on an acoustic model, a language model, and the error correction model ,
The error tendency learning means determines the weight so that a correct score calculated based on each model is larger than a score of each candidate other than the correct answer among the correct candidates. Trend learning speech recognition device.

Speech recognition means for speech recognition of input speech, outputting a plurality of correct answer candidates, and selecting speech recognition results from the output correct answer candidates;
Correction means for receiving a correction input to the voice recognition result selected by the voice recognition means, correcting the voice recognition result, and outputting a correct answer to the input voice;
Storage means for storing an error correction model defined by a feature function representing an error tendency of a correct candidate and its weight;
Error tendency learning that creates the error correction model by determining the weight based on a plurality of correct answer candidates output by the speech recognition means and the correct answer output by the correction means, and stores the error correction model in the storage means Means and
The speech recognition means selects a speech recognition result from correct candidates based on an acoustic model, a language model, and the error correction model,
The error tendency means such that said probability of speech recognition result coincides with the correct answer is maximized, tend training speech recognition apparatus Ri erroneous you characterized by determining the weight.

The voice recognition means, error tendency speech recognition apparatus according to claim 1 or claim 2, characterized in that for outputting a speech recognition result of the input speech in real time.

In a computer used as an error tendency learning speech recognition device,
A speech recognition step for recognizing input speech to output a plurality of correct answer candidates and selecting a speech recognition result from the output correct answer candidates;
A correction step of receiving correction input for the voice recognition result selected in the voice recognition step, correcting the voice recognition result and outputting a correct answer to the input voice;
Based on the plurality of correct answer candidates output by the speech recognition step and the correct answer output by the correction step, an error correction model defined by a feature function representing an error tendency of the correct answer candidate and its weight is selected as the weight. And an error tendency learning step created by determining
In the voice recognition step, based on the acoustic model, the language model, and the error correction model, let the computer execute a process of selecting a voice recognition result from the correct answer candidates ,
In the error tendency learning step, the computer executes the process of determining the weight so that the correct score calculated based on each model is greater than the score of each candidate other than the correct answer among the correct candidates. A computer program characterized by causing