JP5271299B2

JP5271299B2 - Speech recognition apparatus, speech recognition system, and speech recognition program

Info

Publication number: JP5271299B2
Application number: JP2010064175A
Authority: JP
Inventors: 亨今井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2010-03-19
Filing date: 2010-03-19
Publication date: 2013-08-21
Anticipated expiration: 2030-03-19
Also published as: JP2011197410A

Description

本発明は、音声認識装置、音声認識システム、及び音声認識プログラムに係り、特に音声認識の正解精度を向上させるための音声認識装置、音声認識システム、及び音声認識プログラムに関する。 The present invention relates to a voice recognition device, a voice recognition system, and a voice recognition program, and more particularly, to a voice recognition device, a voice recognition system, and a voice recognition program for improving accuracy of voice recognition.

従来、テレビ等の生放送番組にリアルタイムで字幕を付与する目的等で音声認識処理が行われている。このような音声認識処理を利用する場合、現状の音声認識技術では完璧ではなく、出力される認識単語列中に数パーセントの誤りが発生する。そこで、誤りをリアルタイムで修正する１名又は数名のオペレータを音声認識装置の後段に配置し、人手で誤りを修正したテキストを字幕放送として送出することが一般的である（非特許文献１、非特許文献２参照。）。 Conventionally, voice recognition processing has been performed for the purpose of adding subtitles to a live broadcast program such as a television in real time. When such a speech recognition process is used, the current speech recognition technology is not perfect, and an error of several percent occurs in the output recognition word string. Therefore, it is common to arrange one or several operators who correct errors in real time in the subsequent stage of the speech recognition device and send out text corrected by errors as subtitle broadcast (Non-patent Document 1, (Refer nonpatent literature 2.).

しかしながら、このようにリアルタイムで修正される単語の情報は、現状では音声認識装置にフィードバックされておらず、どの単語をどのように誤って認識したか、或いは正しく認識された単語は何であったかといった、以後の音声認識処理にとって有用な情報が有効に活用されてはいない。なお、近年では、誤りを含む音声認識結果を音声認識装置にフィードバックさせ、将来の認識精度を上げようとする研究がなされており、例えばキャッシュ・モデルと呼ばれる方法（例えば、非特許文献３参照。）や、誤りが修正されたテキストを音声認識装置にフィードバックさせる方法（例えば、非特許文献４参照。）が提案されている。 However, the information on the words that are corrected in real time in this way is not currently fed back to the speech recognition device, and which word was misrecognized and how was it correctly recognized, Information useful for subsequent speech recognition processing is not effectively utilized. In recent years, research has been made to feed back a speech recognition result including an error to a speech recognition device to improve future recognition accuracy. For example, a method called a cache model (see, for example, Non-Patent Document 3). ) And a method of feeding back the text whose error has been corrected to the speech recognition apparatus (for example, see Non-Patent Document 4).

安藤他，“音声認識を利用した放送用ニュース字幕制作システム，”電子情報通信学会論文誌，ｖｏｌ．Ｊ８４−Ｄ−ＩＩ，ｎｏ．６，ｐｐ．８７７−８８７，２００１Ando et al., “Broadcast News Subtitle Production System Using Speech Recognition,” IEICE Transactions, vol. J84-D-II, no. 6, pp. 877-887, 2001 本間他，“ダイレクト方式とリスピーク方式の音声認識を併用したリアルタイム字幕制作システム”，映像情報学会論文誌，ｖｏｌ．６３，ｎｏ．３，ｐｐ．３３１−３３８，２００９Honma et al., “Real-time caption production system using both direct and lispeak speech recognition”, Transactions of the Video Information Society, vol. 63, no. 3, pp. 331-338, 2009 非特許文献３：北研二著、“確率的言語モデル”、東京大学出版会、ｐｐ．７７、１９９９Non-Patent Document 3: Kenji Kita, “Stochastic Language Model”, University of Tokyo Press, pp. 77, 1999 本間他、“報道系対談番組向け自由発話音声認識の改善”、日本音響学会春季研究発表会講演論文集、３−Ｑ−１７，ｐｐ．２４３−２４４，２００９Honma et al., “Improvement of free speech recognition for news talk programs”, Proceedings of the Spring Meeting of the Acoustical Society of Japan, 3-Q-17, pp. 243-244, 2009

しかしながら、上述した従来手法は、何れも単語の連鎖出現確率を表す言語モデルの補正に留まっており、しかも将来の音声認識処理のみに反映されるものである。そのため、声の特徴を表す音響モデルを補正したり、既に音声認識済みだが字幕テキストとしては未確定の部分に対して逐次修正を行うといった処理ができないため、正解精度を迅速に向上させることはできないといった問題があった。 However, any of the conventional methods described above is limited to correcting the language model representing the chain appearance probability of words, and is reflected only in future speech recognition processing. For this reason, it is not possible to correct the acoustic model representing the characteristics of the voice, or to correct the accuracy of the correct answer quickly, because it is not possible to perform correction such as sequential correction on a part that has already been recognized but is not yet confirmed as subtitle text. There was a problem.

本発明は、上述した問題点に鑑みなされたものであり、音声認識の正解精度を向上させるための音声認識装置、音声認識システム、及び音声認識プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object thereof is to provide a speech recognition device, a speech recognition system, and a speech recognition program for improving accuracy of speech recognition.

上記課題を解決するために、本件発明は、以下の特徴を有する課題を解決するための手段を採用している。 In order to solve the above problems, the present invention employs means for solving the problems having the following characteristics.

請求項１に記載された発明は、入力音声に対する音声認識結果と誤り修正結果とを用いて音声認識を行う音声認識装置において、前記入力音声の音響特徴量を抽出する音響分析手段と、予め設定された音響モデル、言語モデル、及び発音辞書を用いて、前記音響分析手段により得られる前記入力音声の音響特徴量に対応する候補単語のネットワークからなる単語ラティスを生成する単語ラティス生成手段と、前記単語ラティス生成手段により得られる前記単語ラティスから最尤単語列を選択する最尤単語列選択手段と、前記最尤単語列に対して修正された単語列を用いて前記音響モデルを学習させる音響モデル識別学習手段と、前記音響モデル識別学習手段により学習された音響モデルを用いて、前記修正された単語列に対する単語ラティスを再構成する単語ラティス再構成手段とを有し、更に前記最尤単語列選択手段は、前記単語ラティス再構成手段により得られる前記単語ラティスを用いて、前記修正された単語列より後の単語列に対する最尤単語列を選択することを特徴とする。
According to the first aspect of the present invention, in a speech recognition apparatus that performs speech recognition using a speech recognition result and an error correction result for an input speech, an acoustic analysis unit that extracts an acoustic feature amount of the input speech, and a preset A word lattice generation means for generating a word lattice comprising a network of candidate words corresponding to the acoustic feature quantities of the input speech obtained by the acoustic analysis means using the acoustic model, the language model, and the pronunciation dictionary; An acoustic model for learning the acoustic model using a maximum likelihood word string selection means for selecting a maximum likelihood word string from the word lattice obtained by the word lattice generation means, and a word string modified with respect to the maximum likelihood word string A word lattice for the modified word string using an identification learning means and an acoustic model learned by the acoustic model identification learning means; Possess a word lattice reconstruction means for reconstructing further the best word string selecting means uses the word lattice obtained by the word lattice reconstruction means, word string after the corrected word sequence The most likely word string for is selected .

請求項１記載の発明によれば、音声認識の正解精度を向上させることができる。 According to the first aspect of the present invention, it is possible to improve the accuracy of voice recognition.

請求項２に記載された発明は、前記単語ラティス再構成手段は、前記入力音声に対する初期の単語ラティスに含まれる各候補単語のうち、誤った単語を正しい単語へ置換させ、不足している正しい単語を新たに挿入し、正しい単語に接続し得ない単語を削除することにより、前記単語ラティスを全体的又は部分的に再構成することを特徴とする。 In the invention described in claim 2, the word lattice reconstructing means replaces an incorrect word with a correct word among the candidate words included in the initial word lattice for the input speech, and is lacking correct The word lattice is entirely or partially reconstructed by newly inserting a word and deleting a word that cannot be connected to a correct word.

請求項２記載の発明によれば、修正された部分を迅速に単語ラティスに反映させることができるため、音声認識の正解精度をより向上させることができる。 According to the second aspect of the present invention, the corrected portion can be quickly reflected in the word lattice, so that the accuracy of correct speech recognition can be further improved.

請求項３に記載された発明は、前記音響モデル識別学習手段は、同じ入力音声に対する正解単語列を複数回取得した場合、最新の正解単語列の統計情報だけを利用し、前記最新の正解単語列以外の古い正解単語列の統計情報は削除して、前記音響モデルを学習させることを特徴とする。 In the invention described in claim 3, when the acoustic model identification learning unit acquires a correct word string for the same input speech a plurality of times, only the statistical information of the latest correct word string is used, and the latest correct word string is used. The acoustic model is trained by deleting statistical information of old correct word strings other than the strings.

請求項３記載の発明によれば、オペレータのミスや何らかの理由で同じ箇所に再度の修正が行われた場合でも、最新の正解単語列だけを音響モデルに学習させることで、モデルの精度を向上させることができる。 According to the third aspect of the present invention, even when the same part is corrected again due to an operator's mistake or for some reason, the accuracy of the model is improved by causing the acoustic model to learn only the latest correct word string. Can be made.

請求項４に記載された発明は、前記単語ラティス再構成手段は、前記修正された単語列が前記正解単語列となるまで繰り返し単語ラティスを再構成することを特徴とする。 The invention described in claim 4 is characterized in that the word lattice reconstructing means reconstructs a word lattice repeatedly until the corrected word string becomes the correct word string.

請求項４記載の発明によれば、リアルタイムに修正内容を反映させることができる。また、複数の再構成を行うことで、音声認識の正解精度をより向上させることができる。 According to the fourth aspect of the present invention, the correction contents can be reflected in real time. In addition, the accuracy of speech recognition can be further improved by performing a plurality of reconstructions.

請求項５に記載された発明は、請求項１乃至４の何れか１項に記載の音声認識装置と、該音声認識装置から得られる音声認識結果に対して誤り修正を行う誤り修正装置とを含む音声認識システムにおいて、前記誤り修正装置は、前記音声認識装置から順次入力される最新の認識単語列を画面上に表示する単語列表示手段と、前記単語列表示手段により表示された認識単語列に対する誤り修正を行うための誤り修正手段と、前記誤り修正手段により得られる正解単語列を、外部装置に出力する、及び／又は、前記音声認識装置にフィードバックする情報出力手段とを有することを特徴とする。 According to a fifth aspect of the present invention, there is provided the speech recognition apparatus according to any one of the first to fourth aspects and an error correction apparatus that performs error correction on a speech recognition result obtained from the speech recognition apparatus. In the speech recognition system, the error correction device includes a word string display unit that displays the latest recognition word sequence sequentially input from the speech recognition device on a screen, and a recognition word sequence displayed by the word sequence display unit. Error correction means for correcting an error with respect to the information, and an information output means for outputting a correct word string obtained by the error correction means to an external device and / or feeding back to the voice recognition device. And

請求項５記載の発明によれば、例えば、同じ発話に対する音声認識結果を繰り返し取得して、自動的に誤り修正装置の画面上の文字を最新の状態に変更することができる。したがって、音声認識結果に対する修正や確認を迅速且つ正確に行うことができる。 According to the fifth aspect of the present invention, for example, it is possible to repeatedly obtain a speech recognition result for the same utterance and automatically change the characters on the screen of the error correction device to the latest state. Therefore, correction and confirmation of the speech recognition result can be performed quickly and accurately.

請求項６に記載された発明は、入力音声に対する音声認識結果と誤り修正結果とを用いて音声認識を行うための音声認識プログラムにおいて、コンピュータを、前記入力音声の音響特徴量を抽出する音響分析手段、予め設定された音響モデル、言語モデル、及び発音辞書を用いて、前記音響分析手段により得られる前記入力音声の音響特徴量に対応する候補単語のネットワークからなる単語ラティスを生成する単語ラティス生成手段、前記単語ラティス生成手段により得られる前記単語ラティスから最尤単語列を選択する最尤単語列選択手段、前記最尤単語列に対して修正された単語列を用いて前記音響モデルを学習させる音響モデル識別学習手段、及び、前記音響モデル識別学習手段により学習された音響モデルを用いて、前記修正された単語列に対する単語ラティスを再構成する単語ラティス再構成手段として機能させ、更に前記最尤単語列選択手段は、前記単語ラティス再構成手段により得られる前記単語ラティスを用いて、前記修正された単語列より後の単語列に対する最尤単語列を選択することを特徴とする。

According to a sixth aspect of the present invention, in a voice recognition program for performing voice recognition using a voice recognition result and an error correction result for an input voice, a computer analyzes an acoustic analysis for extracting an acoustic feature quantity of the input voice. Generating a word lattice comprising a network of candidate words corresponding to an acoustic feature quantity of the input speech obtained by the acoustic analysis means , using a means, a preset acoustic model, a language model, and a pronunciation dictionary Means for selecting a maximum likelihood word string from the word lattice obtained by the word lattice generation means, and learning the acoustic model using a word string modified with respect to the maximum likelihood word string The acoustic model identification learning means and the acoustic model learned by the acoustic model identification learning means are used for the correction. To function as a word lattice reconstruction means for reconstructing the word lattice for a word string, further wherein the best word string selecting means uses the word lattice obtained by the word lattice reconstruction means, said modified word sequence The most likely word string for a later word string is selected .

請求項６記載の発明によれば、音声認識の正解精度を逐次向上させることができる。また、実行プログラムをコンピュータにインストールすることにより、容易に音声認識処理を実現することができる。 According to the sixth aspect of the present invention, the accuracy of correct speech recognition can be improved sequentially. In addition, voice recognition processing can be easily realized by installing an execution program in a computer.

本発明によれば、音声認識の正解精度を向上させることができる。 According to the present invention, the accuracy of voice recognition accuracy can be improved.

本実施形態における音声認識システムのシステム構成例を示す図である。It is a figure which shows the system configuration example of the speech recognition system in this embodiment. 誤り修正装置における機能構成の一例を示す図である。It is a figure which shows an example of the function structure in an error correction apparatus. 認識単語列表示手段に表示される文字列の変更例を示す図である。It is a figure which shows the example of a change of the character string displayed on a recognition word string display means. 初期の単語ラティスの一例を示す図である。It is a figure which shows an example of an initial word lattice. 単語の置換により再構成された単語ラティスの一例を示す図である。It is a figure which shows an example of the word lattice reconfigure | reconstructed by the replacement of a word. 単語の追加により再構成された単語ラティスの一例を示す図である。It is a figure which shows an example of the word lattice reconfigure | reconstructed by the addition of a word. 単語の削除により再構成された単語ラティスの一例を示す図である。It is a figure which shows an example of the word lattice reconfigure | reconstructed by the deletion of a word. 本実施形態における各処理時刻の違いを説明するための図である。It is a figure for demonstrating the difference of each processing time in this embodiment. 他の実施形態における音声認識装置の機能構成一例を示す図である。It is a figure which shows an example of a function structure of the speech recognition apparatus in other embodiment. 音声認識の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of speech recognition.

＜本発明について＞
本発明は、例えば、音声認識結果の確定及び誤り修正情報を、オンラインで音声認識装置にフィードバックし、音響モデルの適応学習を正解単語と不正解単語の対応から識別的に実行する。また、本発明は、音声認識の候補単語のネットワークである単語ラティスを自動的に修正し、単語ラティスを再構成してリスコアリングすることにより、より正解精度の高い認識結果を逐次出力するものである。 <About the present invention>
In the present invention, for example, confirmation of speech recognition results and error correction information are fed back online to a speech recognition apparatus, and adaptive learning of an acoustic model is performed in an identifying manner from the correspondence between correct words and incorrect words. In addition, the present invention automatically corrects a word lattice, which is a network of candidate words for speech recognition, reconstructs the word lattice, and re-scores to sequentially output recognition results with higher accuracy. It is.

具体的に説明すると、本発明では、例えばテレビ等の生放送番組にリアルタイムで字幕を付与する目的等で音声認識を利用する場合に、オンラインで逐次修正される文字列、即ち字幕テキストを音声認識装置にフィードバックすることで、既に出力済みの認識単語列でさえも、字幕テキストとして未確定であれば逐次自動修正し、より正解精度の高い認識結果を出力することが可能となる。 More specifically, in the present invention, for example, when speech recognition is used for the purpose of giving subtitles to a live broadcast program such as a TV in real time, a character string that is sequentially corrected online, that is, subtitle text is converted into a voice recognition device. As a result, it is possible to automatically correct a recognition word string that has already been output if it has not been confirmed as subtitle text, and to output a recognition result with higher accuracy.

以下に、本発明における音声認識装置、音声認識システム、及び音声認識プログラムを好適に実施した形態について、図面を用いて説明する。なお、以下に示す音声認識は、基本的には１発話毎に逐次処理される。１発話とは、２つの無音区間（例えば、約４００ｍｓ程度で無音である区間）で囲まれた音声区間である。 Hereinafter, preferred embodiments of a voice recognition device, a voice recognition system, and a voice recognition program according to the present invention will be described with reference to the drawings. Note that the following voice recognition is basically processed sequentially for each utterance. One utterance is a voice section surrounded by two silence sections (for example, a section that is silent in about 400 ms).

＜音声認識システム：システム構成例＞
図１は、本実施形態における音声認識システムのシステム構成例を示す図である。図１に示す音声認識システム１は、音声認識装置１０として、音響分析手段１１と、単語ラティス生成手段１２と、言語モデル・発音辞書１３と、音響モデル１４と、最尤単語列選択手段１５と、単語ラティス再構成手段１６と、音響モデル識別学習手段１７とを有すると共に、更に誤り修正装置１８を有するよう構成されている。 <Voice recognition system: system configuration example>
FIG. 1 is a diagram illustrating a system configuration example of a voice recognition system according to the present embodiment. A speech recognition system 1 shown in FIG. 1 includes, as a speech recognition device 10, an acoustic analysis unit 11, a word lattice generation unit 12, a language model / pronunciation dictionary 13, an acoustic model 14, and a maximum likelihood word string selection unit 15. In addition to the word lattice reconstruction means 16 and the acoustic model identification learning means 17, an error correction device 18 is further provided.

なお、本実施形態では、図１に示すように、音声認識装置処理により出力された認識単語列の誤りを修正する誤り修正装置１８を音声認識装置１０の外部に設けているが、音声認識装置１０の内部に同様の構成を有していてもよい。 In the present embodiment, as shown in FIG. 1, an error correction device 18 that corrects an error in a recognized word string output by the speech recognition device processing is provided outside the speech recognition device 10. 10 may have the same configuration.

また、本実施形態における音響モデル１４の識別学習には、例えば音素誤り最小化学習（非特許文献：Ｄ．ＰｏｖｅｙａｎｄＰ．Ｃ．Ｗｏｏｄｌａｎｄ， “ＭｉｎｉｍｕｍｐｈｏｎｅｅｒｒｏｒａｎｄＩ−ｓｍｏｏｔｈｉｎｇｆｏｒｉｍｐｒｏｖｅｄｄｉｓｃｒｉｍｉｎａｔｉｖｅｔｒａｉｎｉｎｇ，” Ｐｒｏｃ．ＩＥＥＥＩＣＡＳＳＰ，ｐｐ．Ｉ−１０５−１０８，２００２．）等の一般的手法を用いることができる。 Further, the discrimination learning of the acoustic model 14 according to the present embodiment includes, for example, phoneme error minimization learning (non-patent document: D. Povey and PC Woodland, “Minimum phone error and I-smoothing for stimulated training,” Proc. IEEE ICASSP, pp. I-105-108, 2002.) can be used.

＜音声認識装置１０：機能構成例＞
まず、音声認識システム１における音声認識装置１０の具体的な機能構成について説明する。 <Voice Recognition Device 10: Functional Configuration Example>
First, a specific functional configuration of the voice recognition device 10 in the voice recognition system 1 will be described.

音響分析手段１１は、外部から入力される音声信号（入力音声）を分析し、その音響特徴量を抽出する。なお、音響特徴量としては、例えば周波数特性や音のパワー等の各種音響特徴量を抽出する。また、これらの特徴量は、まず音声信号の音声波形に窓関数（ハミング窓等）をかけることによりフレーム化された波形を抽出し、その波形を周波数分析することで、種々の特徴量を抽出する。本実施形態では、例えば、フレーム化された波形のパワースペクトルの対数を逆フーリエ変換した値であるケプストラム係数等を特徴量とすることができ、その他にも一般的な音声認識手法で用いられている。本実施形態では、例えば声の特徴を表す１２次元程度のメル周波数ケプストラム係数（ＭＦＣＣ：ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒａｌＣｏｅｆｆｉｃｉｅｎｔｓ）（例えば、鹿野他、「音声認識システム」、オーム社、２００１等を参照。）や、ＬＰＣ（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ）係数のような声道の形状を数値化した特徴量、韻律（ピッチ、抑揚等）等の特徴量、またそれらの特徴量の平均値や分散等の統計的情報を分析することにより、種々の特徴量を取得することができる。また、音響分析手段１１は、分析により得られる各種音響特徴量を単語ラティス生成手段１２に出力する。 The acoustic analysis means 11 analyzes a voice signal (input voice) input from the outside and extracts the acoustic feature amount. As the acoustic feature amount, for example, various acoustic feature amounts such as frequency characteristics and sound power are extracted. In addition, these feature quantities are extracted by first applying a window function (such as a Hamming window) to the speech waveform of the speech signal to extract a framed waveform and analyzing the frequency of the waveform to extract various feature quantities. To do. In the present embodiment, for example, a cepstrum coefficient that is a value obtained by performing inverse Fourier transform on the logarithm of the power spectrum of a framed waveform can be used as a feature amount, and it is also used in other general speech recognition methods. Yes. In the present embodiment, for example, Mel Frequency Cepstrum Coefficients (MFCC: Mel Frequency Cepstrum Coefficients) representing voice characteristics (see, for example, Shikano et al., “Speech Recognition System”, Ohmsha, 2001, etc.) Analyzes features such as LPC (Linear Predictive Coding) coefficients that quantify the shape of the vocal tract, features such as prosody (pitch, intonation), and statistical information such as the average value and variance of these features By doing so, various feature amounts can be acquired. The acoustic analysis unit 11 outputs various acoustic feature amounts obtained by the analysis to the word lattice generation unit 12.

単語ラティス生成手段１２は、音響分析手段１１により得られる音響特徴量から、予め蓄積されている言語モデル・発音辞書１３及び音響モデル１４を利用して、認識候補として可能性のある複数の単語のネットワークにより構成された単語ラティスを生成する。また、単語ラティス生成手段１２は、生成した単語ラティスを最尤単語列選択手段１５に出力する。 The word lattice generation unit 12 uses a language model / pronunciation dictionary 13 and an acoustic model 14 stored in advance from the acoustic feature quantity obtained by the acoustic analysis unit 11 to identify a plurality of words that may be recognition candidates. A word lattice composed of a network is generated. In addition, the word lattice generation unit 12 outputs the generated word lattice to the maximum likelihood word string selection unit 15.

言語モデル・発音辞書１３は、予め設定された本実施形態における音声認識に必要な複数の言語モデルと発音辞書とを蓄積する。ここで、言語モデルには、例えば単語と単語の繋がり易さを確率で表した一般的なＮグラム・モデルを利用することができ、これにより、例えば『単語「地球」の次に単語「温暖化」が接続する確率は０．８』等とそれぞれの単語の繋がり易さを数値化して表現することができる。 The language model / pronunciation dictionary 13 stores a plurality of language models and pronunciation dictionaries necessary for speech recognition in the present embodiment set in advance. Here, as the language model, for example, a general N-gram model expressing the ease of connection between words as a probability can be used. The probability of connection of “” is 0.8 ”, etc., and the ease of connecting each word can be expressed numerically.

また、発音辞書は、各単語の発音を母音と子音の組み合わせで表したファイルであり、例えば単語「地球」の発音は「／ｃｈｉｋｙｕ：／」等と記述されている。なお、言語モデルと発音辞書とは、図１に示すように、それぞれ一体のデータベースで蓄積されていてもよく、また別体のデータベースとして構成されていてもよい。言語モデル・発音辞書１３のデータは、単語ラティス生成手段１２で利用される。 The pronunciation dictionary is a file in which the pronunciation of each word is represented by a combination of vowels and consonants. For example, the pronunciation of the word “Earth” is described as “/ chi ki u: /” or the like. As shown in FIG. 1, the language model and the pronunciation dictionary may be stored in an integrated database, or may be configured as separate databases. The data of the language model / pronunciation dictionary 13 is used by the word lattice generation means 12.

音響モデル１４は、各母音・子音の声の周波数特性等を表したものであり、一般的な隠れマルコフ・モデル（ＨＭＭ；ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）で表すことができる。また、本実施形態では、音響モデル１４又は単語ラティス生成手段１２内に音声認識パラメータを有していてもよい。音声認識パラメータとは、音声認識の過程で保持すべき最大単語数や、言語モデルと音響モデルによる各スコアのバランスを調整する重み係数等、音声認識の正確さと処理速度を調整する変数の情報である。 The acoustic model 14 represents the frequency characteristics of each vowel / consonant voice, and can be represented by a general hidden Markov model (HMM). In the present embodiment, the acoustic model 14 or the word lattice generation means 12 may have a speech recognition parameter. Speech recognition parameters are information on variables that adjust the accuracy and processing speed of speech recognition, such as the maximum number of words that should be retained during the speech recognition process and weighting factors that adjust the balance of each score between the language model and the acoustic model. is there.

更に、上述した音響モデル１４は、音響モデル識別学習手段１７により、単語ラティス再構成手段１６から得られる正解単語列を利用して、最新のものに学習することができる。音響モデル１４のデータは、次の発話に対して単語ラティス生成手段１２及び現在処理中の発話に対して単語ラティス再構成手段１６で利用される。 Furthermore, the acoustic model 14 described above can be learned to the latest by the acoustic model identification learning means 17 using the correct word string obtained from the word lattice reconstruction means 16. The data of the acoustic model 14 is used by the word lattice generation means 12 for the next utterance and the word lattice reconstruction means 16 for the utterance currently being processed.

また、本実施形態のように、言語モデル・発音辞書１３及び音響モデル１４を音声認識装置１０の内部に設けているが、外部に設けられていてもよく、その場合には他の外部装置等により適宜更新されていてもよい。 Further, as in the present embodiment, the language model / pronunciation dictionary 13 and the acoustic model 14 are provided inside the speech recognition apparatus 10, but may be provided outside, in which case other external devices, etc. May be updated as appropriate.

最尤単語列選択手段１５は、単語ラティス生成手段１２により得られた単語ラティスのうち、最も高いスコアとなる単語列の経路を探索し、これを初期の認識単語列として出力する。 The maximum likelihood word string selection means 15 searches for the path of the word string having the highest score among the word lattices obtained by the word lattice generation means 12, and outputs this as the initial recognition word string.

上述までの処理により得られる認識単語列は、その後、誤り修正装置１８により正誤判定が行われる。誤り修正装置１８では、入力音声に対応する現在処理中の発話の始端部分から、誤り修正オペレータによって順次正解単語の確定と誤り単語の修正がほぼリアルタイムに実行され、これらの部分的な正解単語列が、例えば字幕放送等のアプリケーション（外部装置）で利用されると共に、誤り修正を反映させるため、単語ラティス再構成手段１６に出力される。 The recognition word string obtained by the above-described processing is then subjected to error determination by the error correction device 18. In the error correction device 18, the correct word is sequentially determined and corrected in near real time by the error correction operator from the beginning of the currently processed utterance corresponding to the input speech, and these partial correct word strings are executed. Is used in an application (external device) such as subtitle broadcasting, for example, and is output to the word lattice reconstruction unit 16 to reflect error correction.

つまり、誤り修正装置１８は、入力された認識単語列に対する正誤判定により誤りの有無を判断し、その判定結果（ＯＫ又は修正がある旨の制御信号）、及び、そのＯＫ又は修正後のデータの内容（修正された単語列（元から正解だった単語列、又は、誤りが正解に修正された単語列））を音声認識装置１０に出力する。 That is, the error correction device 18 determines the presence / absence of an error by determining whether the input recognition word string is correct, and determines the result of the determination (a control signal indicating that there is OK or correction) and the data of the OK or corrected data. The contents (corrected word string (a word string that was correct from the beginning or a word string in which an error was corrected to the correct answer)) are output to the speech recognition apparatus 10.

単語ラティス再構成手段１６は、音響モデル識別学習手段１７により学習された音響モデル１４を用いて、正解単語列等の修正された単語列に対する単語ラティスを再構成する。なお、単語列とは、例えば、入力音声の発話区間に対応する１つ又は複数の単語からなる部分文字列である。 The word lattice reconstructing means 16 reconstructs a word lattice for a corrected word string such as a correct word string using the acoustic model 14 learned by the acoustic model identification learning means 17. The word string is, for example, a partial character string composed of one or a plurality of words corresponding to the utterance section of the input voice.

また、単語ラティス再構成手段１６は、現時点で初期の音声認識処理はいったん終了しているが、字幕テキスト等の最終結果としては未確定部分の音声認識の単語ラティスのうち、不正解単語に対応する単語ノードとのリンクを、正解単語列から除去する。また、単語ラティス再構成手段１６は、これら除去される単語ノードにしか接続し得ない最終結果未確定部分の単語ノードも除去し、現在処理中の発話の単語ラティスを正解単語列のみに対応するように単語ラティスを再構成して更新する。 In addition, the word lattice reconstruction unit 16 has already finished the initial speech recognition processing at present, but the final result of subtitle text and the like corresponds to an incorrect word in the word recognition word recognition of the uncertain part. The link to the word node to be removed is removed from the correct word string. Further, the word lattice reconstruction means 16 also removes the word node of the final result undetermined part that can be connected only to the word nodes to be removed, and the word lattice of the utterance currently being processed corresponds only to the correct word string. To restructure and update the word lattice.

また、単語ラティス再構成手段１６は、再構成された単語ラティスを最尤単語列選択手段１５に出力する。また、単語ラティス再構成手段１６は、最終的に確定された情報（正解単語列）を音響モデル識別学習手段１７に出力する。なお、確定された正解単語列は、例えば、誤り修正装置１８からＯＫの制御信号を受信した場合に、その信号に付随して受信した正解単語列である。 Further, the word lattice reconstruction unit 16 outputs the reconstructed word lattice to the maximum likelihood word string selection unit 15. Further, the word lattice reconstructing means 16 outputs the finally determined information (correct word string) to the acoustic model identification learning means 17. The confirmed correct word string is, for example, a correct word string received along with the signal when an OK control signal is received from the error correction device 18.

なお、単語ラティス再構成手段１６は、誤り修正装置１８から得られる単語列は、１発話全体の単語列でもよく、また１発話中の１つ又は複数の単語からなる部分文字列でもよい。 Note that the word lattice reconstruction unit 16 may use a word string obtained from the error correction device 18 as a whole word string or a partial character string made up of one or more words in one utterance.

音響モデル識別学習手段１７は、単語ラティス再構成手段１６から正解単語列の情報を入力し、入力された正解単語列から、単語ラティスにおける正解単語と誤り単語を特定し、これらの時間的な対応付けを利用して、音響モデル１４を識別的に適応学習し、これを更新する。また、音響モデル識別学習手段１７は、適応学習された音響モデル１４を用いて、単語ラティスにおける全単語のスコアを再計算して更新する。ただし、音響モデル識別学習手段１７は、後述する除去対象の単語のスコアは再計算しない。 The acoustic model identification learning means 17 inputs the correct word string information from the word lattice reconstructing means 16, identifies the correct word and the error word in the word lattice from the input correct word string, and these temporal correspondences The acoustic model 14 is discriminatively adaptively learned by using the attachment and is updated. The acoustic model identification learning means 17 recalculates and updates the scores of all words in the word lattice using the adaptively learned acoustic model 14. However, the acoustic model identification learning means 17 does not recalculate the score of a word to be removed, which will be described later.

また、音響モデル識別学習手段１７における音響モデル１４の更新タイミングは、単語ラティス再構成手段１６から最終的に確定された正解単語列を入力したときが好ましい。なお、それ以外にも、例えば、誤り修正装置１８からの正解単語列を単語ラティス再構成手段１６が入力し、単語ラティス再構成手段１６が音響モデル識別学習手段１７に出力したときに更新を行うようにしてもよい。これにより、単語ラティス再構成手段１６による再構成処理が行われる度に音響モデル１４の更新処理を行うことができる。更に、音響モデル識別学習手段１７は、ある一定時間毎、又はある一定の文字量毎、実行制御信号等受信したタイミング等により更新を行ってもよい。 The update timing of the acoustic model 14 in the acoustic model identification learning means 17 is preferably when the correct word string finally determined from the word lattice reconstruction means 16 is input. In addition, for example, the correct word string from the error correction device 18 is input by the word lattice reconstructing means 16 and updated when the word lattice reconstructing means 16 outputs it to the acoustic model identification learning means 17. You may do it. Thereby, the update process of the acoustic model 14 can be performed every time the reconstruction process by the word lattice reconstruction unit 16 is performed. Furthermore, the acoustic model identification learning means 17 may perform updating at a certain time interval or every certain character amount, at the timing of receiving an execution control signal, or the like.

更に、音響モデル識別学習手段１７は、例えば同じ入力音声に対する正解単語列を複数回取得した場合には、その中から最新の正解単語列の統計情報だけを利用し、それ以外の古い正解単語列の統計情報は削除して、音響モデル１４を識別学習してもよい。 Furthermore, when the acoustic model identification learning means 17 acquires the correct word string for the same input speech a plurality of times, for example, only the latest correct word string statistical information is used from among them, and the other old correct word strings are used. May be deleted, and the acoustic model 14 may be identified and learned.

上記の再構成が終了後、最尤単語列選択手段１５は、単語ラティスのうち、最も高いスコアとなる単語列の経路を再度探索し、これを誤り反映後の認識単語列として出力し、それ以降は、上述した誤りの修正と正解単語列のフィードバック、音響モデル１４、及び単語ラティスの更新を発話終了まで繰り返す。 After the above reconstruction is completed, the maximum likelihood word string selection means 15 searches again the path of the word string having the highest score among the word lattices, and outputs this as a recognized word string after error reflection. Thereafter, the above-described error correction and correct word string feedback, acoustic model 14 and word lattice update are repeated until the end of the utterance.

誤り修正装置１８は、１つの発話の入力音声に対する認識単語列を繰り返し取得し、常に最新の単語ラティスに基づく最尤な単語列を、誤り修正を行うオペレータに逐次提示する。また、誤り修正装置１８は、オペレータにより修正された正解単語列を音声認識装置１０に出力する。 The error correction device 18 repeatedly acquires a recognition word string for input speech of one utterance, and always presents the maximum likelihood word string based on the latest word lattice to an operator who performs error correction. In addition, the error correction device 18 outputs the correct word string corrected by the operator to the speech recognition device 10.

ここで、本実施形態における音声認識装置１０は、発話途中でも認識結果を逐次に早期確定する手法を想定している。逐次音声認識は、例えば特許第３８３４１６９号公報で示されているような早期確定型の従来手法等を用いることができる。これにより、単語ラティス、認識単語列、及び正解単語列は、発話全体ではなく発話の先頭部分から順次対応することになり、より正しい認識結果をより早く、例えば約０．５秒程度の遅れ時間で出力することになる。なお、本実施形態で適用される早期確定型の従来手法については、本発明ではこれに限定されるものではない。 Here, the speech recognition apparatus 10 according to the present embodiment assumes a method of sequentially confirming recognition results early even during utterance. For the sequential speech recognition, for example, an early-determined conventional method as disclosed in Japanese Patent No. 3834169 can be used. As a result, the word lattice, the recognized word string, and the correct word string correspond sequentially from the beginning of the utterance instead of the entire utterance, and a more correct recognition result is obtained earlier, for example, a delay time of about 0.5 seconds. Will be output. Note that the early-determined conventional method applied in the present embodiment is not limited to this.

＜誤り修正装置１８：機能構成例＞
次に、上述した音声認識システム１における誤り修正装置１８の具体的な機能構成例について図を用いて説明する。図２は、誤り修正装置における機能構成の一例を示す図である。図２に示す誤り修正装置１８は、認識単語列表示手段２１と、誤り修正手段２２と、情報出力手段２３とを有するよう構成されている。 <Error Correction Device 18: Functional Configuration Example>
Next, a specific functional configuration example of the error correction device 18 in the voice recognition system 1 described above will be described with reference to the drawings. FIG. 2 is a diagram illustrating an example of a functional configuration in the error correction apparatus. The error correction device 18 shown in FIG. 2 is configured to include a recognized word string display means 21, an error correction means 22, and an information output means 23.

誤り修正装置１８は、音声認識装置１０から最新の認識単語列を順次入力すると、その文字列を認識単語列表示手段２１により画面上に表示する。なお、認識単語列表示手段２１は、例えばタッチパネルやモニター等の画面等からなる。また、認識単語列表示手段２１は、表示される認識単語文字列を画面の大きさや文字列の内容に合わせて改行して複数行に表示させる。このとき、認識単語列は、単語毎に区切られているため、認識単語列表示手段２１は、その区切られた単語間で改行することがないように、単語単位で纏めて表示させる。 When the error recognition device 18 sequentially inputs the latest recognition word string from the speech recognition device 10, the error correction device 18 displays the character string on the screen by the recognition word string display means 21. The recognized word string display means 21 is composed of a screen such as a touch panel or a monitor, for example. The recognized word string display means 21 displays the recognized word character string to be displayed on a plurality of lines with line breaks in accordance with the size of the screen and the content of the character string. At this time, since the recognized word string is divided for each word, the recognized word string display means 21 displays the word words collectively so as not to break the line between the divided words.

誤り修正手段２２は、例えばタッチパネル、キーボード、マウス等の入力装置を用いてユーザ（誤り修正オペレータ）等により入力されたＯＫ又は誤りがある旨の信号、及びその修正された又は正解確定の文字列を決定する。つまり、誤り修正手段２２は、ユーザが画面に表示された認識単語列を確認して修正の有無を判断し、その結果、及び、修正時がある場合には、その修正後のデータの内容を入力する。 The error correction means 22 is, for example, an OK or error signal input by a user (error correction operator) using an input device such as a touch panel, a keyboard, or a mouse, and a corrected or correct answer character string. To decide. That is, the error correction means 22 confirms the recognition word string displayed on the screen and determines whether or not the correction is made. If there is a correction result, the error correction means 22 shows the content of the corrected data. input.

情報出力手段２３は、得られた正解単語列に対する制御信号がＯＫの場合には、その正解単語列を、字幕放送等のアプリケーション（外部装置）に出力したり、音声認識装置１０にフィードバックするといった処理を行う。なお、上述したアプリケーションへの出力や音声認識装置１０へのフィードバックは、両方行ってもよく、また何れか一方を行ってもよい。 When the control signal for the obtained correct word string is OK, the information output means 23 outputs the correct word string to an application (external device) such as caption broadcasting or feeds it back to the voice recognition device 10. Process. Note that both the output to the application and the feedback to the speech recognition apparatus 10 described above may be performed, or either one may be performed.

また、情報出力手段２３は、正解単語列に対する制御信号がＮＧの場合には、上述の誤り修正オペレータが入力した正しい単語列を音声認識装置１０に出力してフィードバックすると共に、これを字幕放送等のアプリケーションに出力する。なお、情報出力手段２３は、音声認識装置１０に対して上述したＯＫ、ＮＧの制御信号も送ることができる。これにより、音声認識装置１０は、誤り修正の結果を容易に取得することができる。 Further, when the control signal for the correct word string is NG, the information output means 23 outputs the correct word string input by the error correction operator to the voice recognition device 10 and feeds it back. Output to the application. The information output means 23 can also send the above-described OK and NG control signals to the speech recognition apparatus 10. Thereby, the speech recognition apparatus 10 can easily obtain the result of error correction.

＜認識単語列表示手段２１における文字列の変更例＞
次に、認識単語列表示手段２１における文字列の変更例について、図を用いて説明する。図３は、認識単語列表示手段に表示される文字列の変更例を示す図である。図３に示すように、誤り修正装置１８の表示画面３０には、音声認識装置１０から得られる文字列が表示される。 <Example of changing a character string in the recognized word string display means 21>
Next, an example of changing a character string in the recognized word string display means 21 will be described with reference to the drawings. FIG. 3 is a diagram showing an example of changing the character string displayed on the recognized word string display means. As shown in FIG. 3, the character string obtained from the speech recognition device 10 is displayed on the display screen 30 of the error correction device 18.

音声認識の例としては、例えば入力音声「次／の／ニュース／です」を「続いて／は／ニューヨーク／して」と音声認識した場合、まず誤り修正装置１８の認識単語列表示手段２１が、入力音声から約０．５秒遅れ程度で同期しながら、ほぼリアルタイムで１単語毎に次々に音声認識結果を画面表示していく（図３（ａ））。 As an example of speech recognition, for example, when the input speech “next / no / news / is” is speech-recognized as “follow / ha / New York / do”, first, the recognized word string display means 21 of the error correction device 18 is used. The voice recognition results are displayed on the screen one after another in almost real time while synchronizing with a delay of about 0.5 seconds from the input voice (FIG. 3 (a)).

なお、誤り修正装置１８の表示画面３０には、例えば最大１６文字×１０行程度を表示することができ、単語の境界を明確にするため、記号「｜」等の仕切り３１で明示されている。なお、仕切り３１の記号については、本発明においては上記限定されるものではなく、「／」や「＠」、「＃」等であってもよく、括弧で括られていてもよい。 The display screen 30 of the error correction apparatus 18 can display, for example, a maximum of about 16 characters × 10 lines, and is clearly indicated by a partition 31 such as a symbol “|” in order to clarify a word boundary. . The symbol of the partition 31 is not limited to the above in the present invention, and may be “/”, “@”, “#”, or may be enclosed in parentheses.

誤り修正装置１８は、初期の誤った認識結果「続いて」を表示画面３０に表示すると、誤り修正を行うオペレータのタッチパネル及びキーボード等の入力手段により、「続いて」（図３（ｂ））から正しい単語「次」への修正指示を誤り修正手段２２で実行する。また、実行結果は、表示画面３０をそのように書き換える（図３（ｃ））。 When the error correction device 18 displays the initial erroneous recognition result “continue” on the display screen 30, the error correction device 18 “follows” by an input means such as a touch panel and a keyboard of the operator who performs error correction (FIG. 3B). The error correction means 22 executes an instruction to correct the correct word “next”. In addition, the execution result rewrites the display screen 30 as such (FIG. 3C).

修正されたテキストは、正解単語列として字幕放送等のアプリケーションで利用されると共に、音声認識装置１０にフィードバックされる。音声認識装置１０では、音響モデル１４の識別学習と単語ラティスの再構成を行い、未確定部分に対してより正しい単語列「の／ニュース／です」が認識単語列として、誤り修正装置１８に再度送られる。 The corrected text is used as a correct word string in an application such as subtitle broadcasting and fed back to the speech recognition apparatus 10. The speech recognition apparatus 10 performs identification learning of the acoustic model 14 and reconstruction of the word lattice, and the correct word string “no / news / is” for the unconfirmed part is again recognized as the recognized word string to the error correction apparatus 18. Sent.

誤り修正装置１８は、常に最新の認識単語列を表示画面３０に表示する（図３（ｄ））。そのため、オペレータは、これ以上の誤りを修正する必要がなく、最初の１単語を修正するだけで、残りの単語も自動的に修正することができ、即座に正解単語列を確定することができる。なお、オペレータは、表示画面３０の１行単位で認識結果を確定および出力指示することができ、この１行単位は、音声認識における１つの発話単位として切り出された音声区間と全く同じか、図３（ｅ）に示すように、それよりも短い場合があり得る。これにより、誤り修正オペレータは発話終了を待たずに、発話中であっても、認識結果の確定と修正を入力音声から遅れなく実行することができる。なお、上述した本実施形態における修正は、仕切り３１で区切られた単語毎に行ってもよく、また複数の単語からなる文字列で行ってもよい。 The error correction device 18 always displays the latest recognized word string on the display screen 30 (FIG. 3D). Therefore, the operator does not need to correct any further errors, and can correct the remaining words automatically by correcting only the first word, and can immediately determine the correct word string. . Note that the operator can confirm and output the recognition result in units of one line on the display screen 30. Is this one line unit exactly the same as the speech segment cut out as one utterance unit in speech recognition? As shown in 3 (e), it may be shorter. Thus, the error correction operator can execute the confirmation and correction of the recognition result without delay from the input voice even during the utterance without waiting for the utterance to end. In addition, the correction in this embodiment mentioned above may be performed for every word divided | segmented by the partition 31, and may be performed by the character string which consists of several words.

＜単語ラティス再構成の具体例＞
次に、単語ラティス再構成の具体例について、図を用いて説明する。図４は、初期の単語ラティスの一例を示す図である。また、図５は、単語の置換により再構成された単語ラティスの一例を示す図である。また、図６は、単語の追加により再構成された単語ラティスの一例を示す図である。また、図７は、単語の削除により再構成された単語ラティスの一例を示す図である。 <Specific examples of word lattice reconstruction>
Next, a specific example of word lattice reconstruction will be described with reference to the drawings. FIG. 4 is a diagram illustrating an example of an initial word lattice. FIG. 5 is a diagram illustrating an example of a word lattice reconstructed by word replacement. FIG. 6 is a diagram illustrating an example of a word lattice reconstructed by adding words. FIG. 7 is a diagram illustrating an example of a word lattice reconstructed by deleting words.

いま、入力音声の発話内容が、仮に「次／の／ニュース／です」であったとする。ここで、記号「／」は、単語の境界（仕切り３１）を表す。本実施形態では、音響分析手段１１において音響特徴量が抽出され、単語ラティス生成手段１２において、例えば図４に示すように、認識候補として可能性のある複数の単語のネットワーク（単語ラティス）が生成されたとする。 Assume that the utterance content of the input voice is “next / no / news /”. Here, the symbol “/” represents a word boundary (partition 31). In the present embodiment, the acoustic feature quantity is extracted by the acoustic analysis means 11, and the word lattice generation means 12 generates a network of a plurality of words (word lattices) that can be recognized as recognition candidates, for example, as shown in FIG. Suppose that

ここで、図４の発話始端は発声直前の無音（発音記号ｓｉｌ）を表し、発話終端は発話直後の無音（発音記号ｓｉｌ）を表し、その他の単語は入力音声に対して可能性のある候補単語を表す。また、これらの単語ノードは、単語の漢字仮名表記に加えて、母音・子音で表される発音記号、正解としての尤もらしさを表すスコア、そして入力音声中での単語終端時刻の情報を持つものとする。更に、単語ノード間の矢印で表される各リンクは、単語の接続可能性を表す。 Here, the beginning of the utterance in FIG. 4 represents silence immediately before the utterance (phonetic symbol sil), the end of the utterance represents silence immediately after the utterance (phonetic symbol sil), and other words may be candidates for the input speech. Represents a word. In addition to kanji kana notation of words, these word nodes have phonetic symbols represented by vowels and consonants, scores indicating the likelihood of correct answers, and information on the word end times in the input speech And Furthermore, each link represented by an arrow between word nodes represents a word connection possibility.

なお、図４〜図７の例では、発明の内容をより明確に示すため、正解となるべき単語ノード間を実線のリンクで結び、不正解となるべき単語ノード間を破線のリンクで結んでいる。また、図４〜図７では、図の上位に書かれた単語ノード程、スコアが高いものとする。 In the examples of FIGS. 4 to 7, in order to show the contents of the invention more clearly, the word nodes that should be correct are connected with solid links, and the word nodes that should be incorrect are connected with broken links. Yes. In FIGS. 4 to 7, it is assumed that a word node written at the top of the figure has a higher score.

最尤単語列選択手段１５では、図４で示された初期の単語ラティスのうち、最も高いスコアとなる単語列の経路を探索し、これを初期の認識単語列として出力するので、この例では、「続いて／は／ニューヨーク／して」という認識単語列が出力されることになる。 The maximum likelihood word string selection means 15 searches for the path of the word string having the highest score from the initial word lattice shown in FIG. 4 and outputs it as an initial recognition word string. , A recognition word string “succeeding / ha / new york / do” is output.

誤り修正装置１８では、ユーザ（誤り修正オペレータ）等によって、誤った単語「続いて」が正しい単語「次」に修正されると、この部分的な正解単語列「次」が単語ラティス再構成手段１６及び音響モデル識別学習手段１７に入力される。 In the error correction device 18, when the erroneous word “follow” is corrected to the correct word “next” by the user (error correction operator) or the like, this partial correct word string “next” is converted into the word lattice reconstruction means. 16 and the acoustic model identification learning means 17.

なお、本実施形態では一例として、音声認識装置１０が発話途中でも認識結果を逐次に早期確定するリアルタイム向きのタイプを想定している。 In the present embodiment, as an example, a real-time type is assumed in which the speech recognition apparatus 10 sequentially confirms the recognition results early even during speech.

音響モデル識別学習手段１７は、部分的な正解単語列から、単語ラティスにおける正解単語「次」と誤り単語「続いて」を特定し、これらの時間的な対応付けを利用して、音響モデル１４を識別的に適応学習し、これを更新する。 The acoustic model identification learning means 17 identifies the correct word “next” and the error word “follow” in the word lattice from the partial correct word string, and uses these temporal correspondences to generate the acoustic model 14. Is adaptively learned and updated.

これにより、音響モデル１４は、正解の発音／ｔｓ，ｕ，ｇ，ｉ／に対するスコアが、誤った発音／ｔｓ，ｕ，ｚ，ｕ，ｉ，ｔ，ｅ／に対するスコアよりも高くなるよう、適応学習がなされ、単語ラティス中の全単語のスコアをこれで更新する。これにより、図５に示すように、例えば単語「して」と単語「です」のスコアが逆転し、単語「です」の方が単語「して」のスコアよりも高くなるというような場合も起こり得る。これは、従来のフィードバック手法で困難だった音響的な誤り修正に対応するものである。 Accordingly, the acoustic model 14 is configured so that the score for the correct pronunciation / ts, u, g, i / is higher than the score for the incorrect pronunciation / ts, u, z, u, i, t, e /. Adaptive learning is done, and this updates the scores of all words in the word lattice. As a result, as shown in FIG. 5, for example, the scores of the words “s” and “s” are reversed, and the word “s” is higher than the score of the word “s”. Can happen. This corresponds to acoustic error correction that has been difficult with the conventional feedback method.

単語ラティス再構成手段１６は、部分的な正解単語列に基づき、単語ラティスから誤り単語「続いて」を除去すると共に、誤り単語「続いて」にしか接続し得ない単語ノード「は」、「ニューヨーク」も除去する。これにより、字幕テキスト等の最終結果として未確定部分の単語ラティスは、現時点までの確定及び誤り修正情報を反映して再構成されることになる。 The word lattice reconstructing means 16 removes the error word “follow” from the word lattice based on the partial correct word string, and the word nodes “ha”, “which can only be connected to the error word“ follow ”. "New York" is also removed. As a result, the word lattice of the unconfirmed part as the final result of the caption text or the like is reconstructed reflecting the confirmation and error correction information up to the present time.

なお、この例では、図５に示すように「次／の／ニュース／です」という単語系列が認識単語列として最尤単語列選択手段１５で選択し再出力される。また、ユーザ（誤り修正を行うオペレータ）等によって後続の単語「の」が正しいと確定された場合には、誤った単語「が」からしか接続し得ない単語「入浴」や単語「でした」も自動的に除去されることになる。 In this example, as shown in FIG. 5, the word sequence “next / no / news / is” is selected as the recognized word string by the maximum likelihood word string selecting means 15 and re-output. In addition, when the subsequent word “NO” is determined to be correct by a user (an operator who corrects an error) or the like, the word “bathing” or the word “was” that can only be connected from the wrong word “GA”. Will also be removed automatically.

このように、本実施形態においては、ユーザ等が１単語目から誤りの修正と正解単語の確定操作をするだけで、後続の認識誤りを自動的に修正できる可能性がある。以降、誤りの修正とフィードバック、音響モデルと単語ラティスの更新を発話終了まで繰り返すことで、発話中でありながらも、順次誤り修正結果を認識結果に反映させることが可能となる。 As described above, in the present embodiment, there is a possibility that the subsequent recognition error can be automatically corrected by simply performing an error correction and a correct word confirmation operation from the first word. Thereafter, by repeating error correction and feedback, and updating the acoustic model and word lattice until the end of the utterance, the error correction results can be sequentially reflected in the recognition result even during the utterance.

上述した誤り修正の例では、単語ラティスの中に正しい単語が含まれており、ユーザ等の指示に基づいて単語ラティスを再構成した。一方、単語ラティスに含まれていない単語を誤り修正オペレータが正しい単語として入力した場合には、例えば図６の「話題」のように、新たな単語ノードを単語ラティスに追加する。また、図７に示すように、ユーザが単語「して」を削除すべきと指示した場合には、単語ノード「して」を削除すると共に、単語ノード「ニュース」から発話終端への新たなリンクを追加する。もし、仮に図４の最上位の列（初期の認識結果）が全て正しい場合には、ユーザが単語を変更することなく確定するため、単語に変更があった場合と同様、正解単語と誤り単語の情報を利用して、音響モデル１４の識別的適応学習及び更新を実施することができる。 In the example of error correction described above, the correct word is included in the word lattice, and the word lattice is reconstructed based on an instruction from the user or the like. On the other hand, when the error correction operator inputs a word that is not included in the word lattice as a correct word, a new word node is added to the word lattice, for example, as “topic” in FIG. In addition, as shown in FIG. 7, when the user instructs to delete the word “to”, the word node “to” is deleted and a new node from the word node “news” to the end of the utterance is deleted. Add a link. If all the uppermost columns (initial recognition results) in FIG. 4 are correct, the user confirms the word without changing the word, so that the correct word and the error word are the same as when the word is changed. This information can be used to perform discriminative adaptive learning and update of the acoustic model 14.

このように、本実施形態によれば、ラティスの再構成について置換、追加、削除のあらゆるパターンについて適切なラティスを迅速且つ正確に構成することができる。なお、本実施形態において、上述した単語の置換、追加、削除は、適宜組み合わせて適用することができる。 As described above, according to the present embodiment, it is possible to quickly and accurately configure an appropriate lattice for every pattern of replacement, addition, and deletion for the reconstruction of the lattice. In the present embodiment, the above-described word replacement, addition, and deletion can be applied in appropriate combination.

音声認識装置１０は、上述したような単語ラティスの再構成等を行い、未確定部分のより正しい単語列を認識単語列として、誤り修正装置１８に再度出力することができる。したがって、誤り修正装置１８は、常に最新の認識単語列を表示するので、誤り修正オペレータはこれ以上の誤りを修正する必要がなく、即座に正解単語列を確定できる場合もある。 The speech recognition apparatus 10 can reconstruct the word lattice as described above, and can output a more correct word string of the undetermined portion as a recognized word string to the error correction apparatus 18 again. Therefore, since the error correction device 18 always displays the latest recognized word string, the error correction operator does not need to correct any further errors, and the correct word string may be determined immediately.

＜本実施形態における処理時刻の違い＞
次に、本実施形態における各処理時刻の違いについて、図面を用いて説明する。図８は、本実施形態における各処理時刻の違いを説明するための図である。なお、図８では、発話始端から時刻Ｔ１までの音声が、音声認識装置に既に入力されたものとする。 <Difference in processing time in this embodiment>
Next, differences in processing times in the present embodiment will be described with reference to the drawings. FIG. 8 is a diagram for explaining a difference in each processing time in the present embodiment. In FIG. 8, it is assumed that the speech from the start of speech to time T1 has already been input to the speech recognition apparatus.

音声認識装置１０は、音響分析手段１１において数十ミリ秒の時間差を要するため、単語ラティス生成手段１２では、時刻Ｔ２の部分の音声認識処理（音響スコア及び言語スコアの算出と部分的な単語ラティスの生成）を実行していると考えられる。また、安定して信頼できる最尤単語列を出力するには、更に数百ミリ秒の時間差を要するため、時刻Ｔ３までの最尤認識単語列が出力されると共に、これと同じ文字列が誤り修正装置１８に表示される。誤り修正を行うオペレータは、この表示される文字列とほぼ同期して遅らせた音声をモニターしており、正解か誤りかの判定、及び正しい単語への修正に多少の時間を要するため、時刻Ｔ４までの正解単語列が確定されることになる。 Since the speech recognition apparatus 10 requires a time difference of several tens of milliseconds in the acoustic analysis unit 11, the word lattice generation unit 12 performs speech recognition processing (acoustic score and language score calculation and partial word lattice) at the time T2. Generation). In addition, since a time difference of several hundred milliseconds is required to output a stable and reliable maximum likelihood word string, the maximum likelihood recognition word string up to time T3 is output, and the same character string is erroneous. It is displayed on the correction device 18. The operator who corrects the error monitors the delayed voice almost in synchronization with the displayed character string, and it takes some time to determine whether the answer is correct or incorrect, and to correct the correct word. The correct word strings up to are determined.

また、時刻Ｔ２までは単語ラティスが生成済みなので、正解単語列に基づき、時刻Ｔ２と時刻Ｔ４との間の単語ラティスが再構成されることになる。ここで、もし、最尤単語列も自動的に変更されることになれば、誤り修正装置１８の表示画面上では時刻Ｔ３と時刻Ｔ４の間の最終結果未確定部分が更新されることとなる。 Since the word lattice has been generated until time T2, the word lattice between time T2 and time T4 is reconstructed based on the correct word string. Here, if the maximum likelihood word string is also automatically changed, the final result undetermined portion between time T3 and time T4 is updated on the display screen of the error correction device 18. .

＜他の実施形態＞
ここで、上述した本実施形態では、音響モデル識別学習手段１７において音響モデル１４を更新していたが、本発明においてはこれに限定されるものではなく、例えば、上述した学習を言語モデル・発音辞書１３に含まれる言語モデルにおいても同様に行うことができ、言語モデルも識別学習させることで、より高精度な音声認識を実現することができる。ここで、上述した内容を他の実施形態として、図を用いて説明する。 <Other embodiments>
Here, in the present embodiment described above, the acoustic model 14 is updated in the acoustic model identification learning means 17. However, the present invention is not limited to this. For example, the above-described learning is performed as a language model / pronunciation. The same can be applied to the language model included in the dictionary 13, and more accurate speech recognition can be realized by discriminating and learning the language model. Here, the content described above will be described as another embodiment with reference to the drawings.

図９は、他の実施形態における音声認識装置の機能構成一例を示す図である。なお、図９に示す音声認識装置４０と、上述した図１における音声認識装置１０とにおいて、略同一の機能を有する構成部分には同一番号を付するものとし、ここでの具体的な説明は省略する。 FIG. 9 is a diagram illustrating an example of a functional configuration of a speech recognition apparatus according to another embodiment. In addition, in the speech recognition apparatus 40 shown in FIG. 9 and the speech recognition apparatus 10 in FIG. 1 described above, the same reference numerals are given to components having substantially the same functions, and the specific description here is as follows. Omitted.

図９では、上述した実施形態と比較すると、音声認識装置４０に言語モデル識別学習手段４１が設けられている。つまり、本実施形態では、言語モデル識別学習手段４１を用いて言語モデルも入力される正解単語列に対応させて、音響モデル１４と同様に、言語モデル・発音辞書１３についても逐次更新することで、より高精度な音声認識を実現することができる。 In FIG. 9, the speech recognition device 40 is provided with language model identification learning means 41 as compared with the above-described embodiment. In other words, in the present embodiment, the language model / pronunciation dictionary 13 is sequentially updated in the same manner as the acoustic model 14 so as to correspond to the correct word string in which the language model is also input using the language model identification learning unit 41. More accurate voice recognition can be realized.

このとき、単語ラティス再構成手段１６は、音響モデル１４だけでなく、言語モデル・発音辞書１３からもデータを取得して単語ラティスの再構成を行うことができる。 At this time, the word lattice reconstruction means 16 can acquire data not only from the acoustic model 14 but also from the language model / pronunciation dictionary 13 to reconstruct the word lattice.

なお、上述した実施形態の他にも、本発明においては、例えば音響モデルや言語モデル等に対する識別学習を行わず、単語ラティスを再構成する単語ラティス再構成手段１６のみを備えた構成であっても同様の効果を得ることができる。 In addition to the above-described embodiment, the present invention has a configuration including only the word lattice reconstruction means 16 for reconstructing a word lattice without performing identification learning for an acoustic model, a language model, or the like. The same effect can be obtained.

＜実行プログラム＞
ここで、上述した本実発明における音声認識装置は、ＣＰＵ、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の揮発性の記憶媒体、ＲＯＭ等の不揮発性の記憶媒体、マウスやキーボード、ポインティングデバイス等の入力装置、画像やデータを表示する表示部、並びに外部と通信するためのインタフェイスを備えたコンピュータによって構成することができる。 <Execution program>
Here, the voice recognition device in the present invention described above includes a CPU, a volatile storage medium such as a RAM (Random Access Memory), a nonvolatile storage medium such as a ROM, an input device such as a mouse, a keyboard, and a pointing device, A computer having a display unit for displaying images and data and an interface for communicating with the outside can be used.

したがって、音声認識装置１０が有する各機能は、これらの機能を記述したプログラムをＣＰＵに実行させることによりそれぞれ実現可能となる。また、これらのプログラムは、磁気ディスク（フロッピィーディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリ等の記録媒体に格納して頒布することもできる。 Therefore, each function of the speech recognition apparatus 10 can be realized by causing the CPU to execute a program describing these functions. These programs can also be stored and distributed in a recording medium such as a magnetic disk (floppy disk, hard disk, etc.), optical disk (CD-ROM, DVD, etc.), semiconductor memory, or the like.

つまり、上述した各構成における処理をコンピュータに実行させるための実行プログラム（音声認識プログラム）を生成し、例えば、汎用のパーソナルコンピュータやサーバ等にそのプログラムをインストールすることにより、音声認識処理を実現することができる。 That is, an execution program (voice recognition program) for causing a computer to execute the processes in the above-described configurations is generated, and the voice recognition process is realized by installing the program in, for example, a general-purpose personal computer or server. be able to.

次に、本発明における実行プログラムによる音声認識処理手順についてフローチャートを用いて説明する。 Next, the speech recognition process procedure by the execution program in this invention is demonstrated using a flowchart.

＜音声認識処理手順例＞
図１０は、音声認識の処理手順の一例を示すフローチャートである。なお、図１０は、誤り逐次修正型音声認識装置全体のフローチャートを示している。 <Example of voice recognition processing procedure>
FIG. 10 is a flowchart illustrating an example of a voice recognition processing procedure. FIG. 10 shows a flowchart of the entire error sequential correction type speech recognition apparatus.

まず、音声認識装置全体の動作を開始すると、認識させたい音声が入力され始めると（Ｓ０１）、最初に音響分析を行い予め設定された音響特徴量を抽出する（Ｓ０２）。また、音声の発話始端を検出する（Ｓ０３）。その後、上述した言語モデルや発音辞書、音響モデル等を利用して、認識候補として可能性のある複数の単語のネットワーク、即ち単語ラティスを生成する（Ｓ０４）。 First, when the operation of the entire speech recognition apparatus is started, when a speech to be recognized starts to be input (S01), acoustic analysis is first performed to extract a preset acoustic feature amount (S02). Also, the beginning of speech is detected (S03). Thereafter, a network of a plurality of words that are possible candidates for recognition, that is, a word lattice is generated using the language model, pronunciation dictionary, acoustic model, and the like described above (S04).

単語ラティスが生成されると、安定して信頼できる最尤単語列を選択し、これを誤り修正手順やフィードバックさせるための手順に出力する（Ｓ０５）。なお、誤り修正手順では、音声認識処理と並行して、誤りの修正と正解の確定手順を実行する。また、これらの手順は、上述する誤り修正装置１８が行う処理に相当する。 When the word lattice is generated, a stable and reliable maximum likelihood word string is selected and output to an error correction procedure or a procedure for feeding back (S05). In the error correction procedure, an error correction and correct answer determination procedure is executed in parallel with the speech recognition process. These procedures correspond to the processing performed by the error correction device 18 described above.

ここで、上述した誤り修正手順により正解単語列の入力があるか否かを判断し（Ｓ０６）、正解単語列の入力がある場合（Ｓ０６において、ＹＥＳ）、音響モデルの識別学習（Ｓ０７）、単語ラティスの再構成（Ｓ０８）、最尤単語列の選択及び出力（Ｓ０９）を行う。このとき、最尤単語列は、再び誤り修正手順において、以降の音声認識処理と並行して誤りの修正と正解の確定が行われる（Ｓ１０）。 Here, it is determined whether there is an input of a correct word string by the error correction procedure described above (S06). If there is an input of a correct word string (YES in S06), acoustic model identification learning (S07), The word lattice is reconstructed (S08), and the maximum likelihood word string is selected and output (S09). At this time, the maximum likelihood word string is corrected again and the correct answer is confirmed in parallel with the subsequent speech recognition process in the error correction procedure (S10).

次に、Ｓ１０の終了後又はＳ０６の処理において正解単語列の入力なしで結果確定処理がなされる場合（Ｓ０６において、ＮＯ）、音声認識処理が発話終端に達したか否かを判断し（Ｓ１１）、まだ発話終端に達していない場合（Ｓ１１において、ＮＯ）、Ｓ０４に戻り、単語ラティスの生成以降の処理を繰り返す。また、発話終端に達している場合（Ｓ１１において、ＹＥＳ）、次に、音声認識処理全体を終了するか否かを判断する（Ｓ１２）。まだ音声認識処理全体を終了しない場合（Ｓ１２において、ＮＯ）、Ｓ０３に戻り、次の発話の始端検出以降の処理を行う。また、音声認識処理を終了する旨の指示を受けた場合（Ｓ１２において、ＹＥＳ）、全体の処理を終了する。 Next, when the result confirmation process is performed after the end of S10 or without input of the correct word string in the process of S06 (NO in S06), it is determined whether or not the speech recognition process has reached the end of speech (S11). ), If the end of the utterance has not been reached yet (NO in S11), the process returns to S04, and the processing after generation of the word lattice is repeated. If the utterance end has been reached (YES in S11), it is next determined whether or not the entire speech recognition process is to be terminated (S12). If the entire speech recognition process has not been completed yet (NO in S12), the process returns to S03, and the process after the start of the next utterance is detected. If an instruction to end the voice recognition process is received (YES in S12), the entire process ends.

なお、上述した発話始端検出及び終端検出の処理手順は、公知のあらゆる発話区間検出方式での動作が可能であり、例えば本出願人により出願された特開２００７−２３３１４８号公報に記載された技術等を利用することができる。また、言語モデル及び発音辞書と音響モデルを利用して、認識候補として可能性のある複数の単語のネットワーク、即ち単語ラティスを生成していく音声認識の処理手順は、公知のあらゆる音声認識方式での動作が可能であり、例えば特許第３８３４１６９号公報に記載された技術等を利用することができる。なお、上述の手法は、本発明においては上記公報に記載された内容に限定されるものではない。 Note that the processing procedure for detecting the start and end of the utterance described above can be operated in any known utterance section detection method. For example, the technique described in Japanese Patent Application Laid-Open No. 2007-233148 filed by the present applicant. Etc. can be used. In addition, using a language model, a pronunciation dictionary, and an acoustic model, a speech recognition processing procedure for generating a network of a plurality of possible words as recognition candidates, that is, a word lattice, is performed by any known speech recognition method. For example, the technique described in Japanese Patent No. 3834169 can be used. In addition, the above-mentioned method is not limited to the content described in the said gazette in this invention.

上述したように、本発明によれば、音声認識の正解精度を向上させることができる。具体的には、音声認識結果の確定及び誤り修正情報をオンラインで音声認識装置にフィードバックし、正解単語と不正解単語の対応から音響モデルを識別的に適応学習すると共に、音声認識の候補単語のネットワークである単語ラティスを自動的に修正し、単語ラティスを再構成してリスコアリングすることにより、より正解精度の高い認識結果を逐次出力することができる。 As described above, according to the present invention, the accuracy of correct speech recognition can be improved. Specifically, confirmation of speech recognition results and error correction information are fed back to the speech recognition device online, and an acoustic model is identified and adaptively learned from correspondence between correct and incorrect words, and candidate words for speech recognition are also identified. By automatically correcting the word lattice that is a network, reconstructing the word lattice, and re-scoring, it is possible to sequentially output recognition results with higher accuracy.

つまり、本発明によれば、例えば既に音声認識済みだが、字幕テキスト等の最終結果としては未確定の部分でさえも、迅速に正解精度を向上させ、誤り修正オペレータの作業負担を軽減し、更には字幕テキスト等の最終結果をより正しく、より少ない遅れ時間で提供することができる。 That is, according to the present invention, for example, even a part that has already been voice-recognized, but has not yet been finalized as a final result such as subtitle text, can quickly improve the accuracy of accuracy and reduce the work load of the error correction operator. Can provide the final result such as subtitle text more correctly and with less delay time.

また、本発明によれば、字幕制作の他にも、会議・議会・講義・法廷等での議事録や書き起こし作成、携帯電話等での音声入力等、オンラインで音声認識結果の確認及び誤り修正が施される音声文字化システム等に広く利用できる。 In addition, according to the present invention, in addition to caption production, confirmation and error of speech recognition results online, such as making minutes and transcripts at meetings, parliaments, lectures, courts, etc., voice input with mobile phones, etc. It can be widely used in a phonetic character conversion system to which correction is applied.

以上本発明の好ましい実施形態について詳述したが、本発明は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、変更が可能である。 Although the preferred embodiment of the present invention has been described in detail above, the present invention is not limited to the specific embodiment, and various modifications, within the scope of the gist of the present invention described in the claims, It can be changed.

１音声認識システム
１０，４０音声認識装置
１１音響分析手段
１２単語ラティス生成手段
１３言語モデル・発音辞書
１４音響モデル
１５最尤単語列選択手段
１６単語ラティス再構成手段
１７音響モデル識別学習手段
１８誤り修正装置
２１認識単語列表示手段
２２誤り修正手段
２３情報出力手段
３０表示画面
３１仕切り
４１言語モデル識別学習手段 DESCRIPTION OF SYMBOLS 1 Speech recognition system 10,40 Speech recognition apparatus 11 Acoustic analysis means 12 Word lattice generation means 13 Language model and pronunciation dictionary 14 Acoustic model 15 Maximum likelihood word sequence selection means 16 Word lattice reconstruction means 17 Acoustic model identification learning means 18 Error correction Device 21 Recognition word string display means 22 Error correction means 23 Information output means 30 Display screen 31 Partition 41 Language model identification learning means

Claims

In a speech recognition apparatus that performs speech recognition using a speech recognition result and an error correction result for an input speech,
Acoustic analysis means for extracting an acoustic feature of the input speech;
Word lattice generation means for generating a word lattice consisting of a network of candidate words corresponding to the acoustic feature quantities of the input speech obtained by the acoustic analysis means using a preset acoustic model, language model, and pronunciation dictionary; ,
Maximum likelihood word string selection means for selecting a maximum likelihood word string from the word lattice obtained by the word lattice generation means;
Acoustic model identification learning means for learning the acoustic model using a word string modified with respect to the maximum likelihood word string;
Using said acoustic model learned by the acoustic model identification learning means, possess a word lattice reconstruction means for reconstructing the word lattice for said modified word sequence,
Further, the maximum likelihood word string selection means selects a maximum likelihood word string for a word string after the corrected word string using the word lattice obtained by the word lattice reconstruction means. Voice recognition device.

The word lattice reconstruction means includes:
Of the candidate words included in the initial word lattice for the input speech, the wrong word is replaced with the correct word, the missing correct word is newly inserted, and the word that cannot be connected to the correct word is deleted. The speech recognition apparatus according to claim 1, wherein the word lattice is reconfigured in whole or in part.

The acoustic model identification learning means includes
When the correct word string for the same input speech is acquired a plurality of times, only the latest correct word string statistical information is used, and the old correct word string statistical information other than the latest correct word string is deleted, and the acoustic model The speech recognition apparatus according to claim 1, wherein learning is performed.

The word lattice reconstruction means includes:
The speech recognition apparatus according to claim 3, wherein the word lattice is reconstructed repeatedly until the corrected word string becomes the correct word string.

A speech recognition system comprising: the speech recognition device according to any one of claims 1 to 4; and an error correction device that performs error correction on a speech recognition result obtained from the speech recognition device.
The error correction device includes:
Word string display means for displaying the latest recognized word string sequentially input from the voice recognition device on a screen;
Error correction means for correcting an error for the recognized word string displayed by the word string display means;
A speech recognition system comprising: an information output means for outputting a correct word string obtained by the error correction means to an external device and / or feeding back to the speech recognition device.

In a speech recognition program for performing speech recognition using speech recognition results and error correction results for input speech,
Computer
Acoustic analysis means for extracting an acoustic feature of the input speech;
Word lattice generation means for generating a word lattice comprising a network of candidate words corresponding to the acoustic feature quantities of the input speech obtained by the acoustic analysis means using a preset acoustic model, language model, and pronunciation dictionary;
Maximum likelihood word string selection means for selecting a maximum likelihood word string from the word lattice obtained by the word lattice generation means;
Acoustic model identification learning means for learning the acoustic model using a word string modified with respect to the maximum likelihood word string; and
Using the acoustic model learned by the acoustic model identification learning means, function as word lattice reconstruction means for reconstructing a word lattice for the corrected word string ,
Further, the maximum likelihood word string selection means selects a maximum likelihood word string for a word string after the corrected word string using the word lattice obtained by the word lattice reconstruction means. Speech recognition program.