JP6277659B2

JP6277659B2 - Speech recognition apparatus and speech recognition method

Info

Publication number: JP6277659B2
Application number: JP2013214411A
Authority: JP
Inventors: 伍井　啓恭; 啓恭伍井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2013-10-15
Filing date: 2013-10-15
Publication date: 2018-02-14
Anticipated expiration: 2033-10-15
Also published as: JP2015079035A

Description

本発明は、言語モデルを入力音声に対して適応化することにより単語連鎖のカバー率を向上して、音声の認識精度を向上する音声認識技術に関するものである。 The present invention relates to a speech recognition technique for improving speech recognition accuracy by improving a word chain coverage by adapting a language model to input speech.

音声をテキスト化する音声認識技術は有用であり、医療や法律分野における発話音声の書き起こしや、放送等における字幕の作成など、多くの分野でテキスト入力効率向上やテキスト入力によるデータベースの検索の容易化などへの適用が期待され、あるいは既に適用され始めている。
しかしながら、音声認識結果には誤認識が含まれる可能性があり、誤認識をいかに低減するかが音声認識技術の大きな課題である。現在の音声認識技術では音声の特徴と音素を対応付ける音響モデルと、連鎖する単語間の関係を表現した言語モデルが一般的に用いられている。 Speech recognition technology that makes speech into text is useful, and improves text input efficiency and facilitates database search by text input in many fields, such as transcription of spoken speech in medical and legal fields and creation of subtitles in broadcasting etc. Application to computerization is expected or already applied.
However, there is a possibility that misrecognition is included in the speech recognition result, and how to reduce misrecognition is a big problem in speech recognition technology. In current speech recognition technology, an acoustic model that associates speech features with phonemes and a language model that expresses the relationship between linked words are generally used.

なお、以下で用いる専門用語は、従来技術文献１：鹿野清宏，伊藤克亘，河原達也，武田一哉，山本幹雄著，「音声認識システム」，株式会社オーム社，平成13年5月15日，p.53〜175（以下教科書１）、または従来技術文献２：北研二，辻井潤一著，「確率的言語モデル」，東京大学出版会，1999年11月25日，p.57〜99（以下教科書２）、または従来技術文献３：長尾真著，「自然言語処理」，岩波書店，1996年4月26日，p.118〜137（以下教科書３）に著された用語を用いるものとする。 The technical terms used in the following are: Prior Art Document 1: Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto, “Voice Recognition System”, Ohm Corporation, May 15, 2001, p. .53-175 (hereinafter referred to as textbook 1), or prior art reference 2: Kenji Kita, Junichi Sakurai, “Probabilistic Language Model”, University of Tokyo Press, November 25, 1999, p.57-99 (hereinafter referred to as textbook) 2) or prior art document 3: Nagao Makoto, “Natural Language Processing”, Iwanami Shoten, April 26, 1996, p.118-137 (hereinafter referred to as textbook 3).

音声を精度よく認識するための言語モデルとして教科書１〜３に記されているＮグラムモデルを用いる方式が注目されている。Ｎグラムモデルの言語モデルでは、Ｎグラムがコーパスから学習されるので、コーパスに出現しない単語連鎖は誤認識の原因となるというスパースネス問題があることが知られている。 A method using an N-gram model described in textbooks 1 to 3 as a language model for accurately recognizing speech has attracted attention. In the language model of the N-gram model, since N-grams are learned from the corpus, it is known that there is a sparseness problem that word chains that do not appear in the corpus cause misrecognition.

このスパースネス問題に対応するため、言語モデルを入力音声に対して適応化する技術が提案されている。例えば、特許文献１には１回目の音声認識結果に基づいて、予め階層化した言語モデルのなかから適切な言語モデルを選択し、選択した言語モデルを混合して１つの言語モデルを生成して、言語モデルを入力音声に対して適応化する技術が開示されている。 In order to cope with this sparseness problem, a technique for adapting a language model to input speech has been proposed. For example, in Patent Document 1, based on the first speech recognition result, an appropriate language model is selected from previously layered language models, and the selected language models are mixed to generate one language model. A technique for adapting a language model to input speech is disclosed.

WO2008/004666（図３）WO2008 / 004666 (Fig. 3)

上述の従来の音声認識装置において、入力音声に含まれる選択された言語モデルのいずれにも学習されていない単語連鎖は、混合した言語モデルにおいても未学習の単語連鎖であり、それらの未学習の単語系列が誤認識される可能性が依然として高いという問題点があった。 In the above-described conventional speech recognition apparatus, word chains that have not been learned in any of the selected language models included in the input speech are unlearned word chains in the mixed language model, and those unlearned words There was a problem that the possibility of misrecognizing a word sequence is still high.

この発明は上記のような問題点を解決するためになされたもので、音声認識結果の単語連鎖を学習してもともとの言語モデルに含まれていなかった単語連鎖のカバー率を向上することが可能な言語モデルを備えた音声認識装置を得ることを目的とする。 The present invention has been made to solve the above-described problems, and it is possible to improve the coverage of word chains that were not included in the original language model even after learning word chains of speech recognition results. An object of the present invention is to obtain a speech recognition device having a simple language model.

この発明の音声認識装置は、混合言語モデルを記憶する第２言語モデル記憶部と、前記混合言語モデルを用いて、入力された音声信号を認識し第２の音声認識結果を出力する第２音声認識部と、第１言語モデルを用いて、入力された音声信号を認識し、第１の音声認識結果を出力する第１音声認識部と、前記第１音声認識部による前記第１の音声認識結果が入力され、この第１の音声認識結果の単語系列に基づいてＮグラムを作成し、前記第２言語モデル記憶部に記憶された前記混合言語モデルに前記第１の音声認識結果の単語系列に基づいて作成したＮグラムを追加するＮグラム追加部と、を備えるようにしたものである。
Speech recognition apparatus of this invention, a second language model storage unit for storing a mixed language model, the mixing using a language model to recognize the input speech signal a second outputting a second speech recognition result a voice recognition unit, by using the first language model to recognize the input speech signal, a first speech recognition unit which outputs the first speech recognition result, the first pre-SL by the first speech recognition section speech recognition result is input, to create a N-gram based on the word sequence of the first speech recognition result, the first result speech recognition stored in said second language model storage unit the mixed language model And an N-gram adding unit for adding N-grams created based on the word series .

この発明の音声認識方法は、第１音声認識部が第１言語モデルを参照して入力された音声の音声認識を行う第１の音声認識手順と、第２音声認識部が混合言語モデルを参照して前記入力された音声の音声認識を行う第２の音声認識手順と、Ｎグラム追加部が前記第１音声認識手順による音声認識結果の単語系列に基づいてＮグラムを作成し、前記混合言語モデルに前記第１の音声認識結果の単語系列に基づいて作成したＮグラムを追加するＮグラム追加手順と、を備えたるようにしたものである。
According to the speech recognition method of the present invention, the first speech recognition unit performs speech recognition of speech input by referring to the first language model, and the second speech recognition unit refers to the mixed language model. A second speech recognition procedure for performing speech recognition of the input speech, and an N-gram adding unit creates an N-gram based on a word sequence of a speech recognition result by the first speech recognition procedure, and the mixed language And an N-gram adding procedure for adding an N-gram created based on the word sequence of the first speech recognition result to the model.

上述のように、この発明に係る音声認識装置によれば、第１言語モデルの認識結果の単語系列に基づいて混合言語モデルである第２言語モデルを更新することにより、第２言語モデルを入力音声に適応させて入力音声に現れた単語連鎖のカバー率を向上し、音声認識性能を向上することができる。
この発明に係る音声認識方法によれば、第１の音声認識手順による入力された音声の認識結果の単語系列に基づいて混合言語モデルである第２言語モデルを更新する手順を実施することにより、第２言語モデルを入力音声に適応させて入力音声に現れた単語連鎖のカバー率を向上し、音声認識性能を向上することができる。 As described above, according to the speech recognition apparatus of the present invention, the second language model is input by updating the second language model that is the mixed language model based on the word sequence of the recognition result of the first language model. It is possible to improve the speech recognition performance by improving the coverage of word chains appearing in the input speech by adapting to the speech.
According to the speech recognition method according to the present invention, by performing the procedure of updating the second language model that is a mixed language model based on the word sequence of the speech recognition result input by the first speech recognition procedure, By applying the second language model to the input speech, the coverage of word chains appearing in the input speech can be improved, and speech recognition performance can be improved.

この発明の実施の形態１の音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus of Embodiment 1 of this invention. 実施の形態１の音声認識装置の処理のフローチャートである。3 is a flowchart of processing performed by the speech recognition apparatus according to the first embodiment. 実施の形態１の音声認識装置の第１音声認識部の認識結果の単語系列の例を示す図である。It is a figure which shows the example of the word series of the recognition result of the 1st speech recognition part of the speech recognition apparatus of Embodiment 1. FIG. 実施の形態１の音声認識装置の第２音声認識部の更新前の学習例文情報の例を示す図である。It is a figure which shows the example of the learning example sentence information before the update of the 2nd speech recognition part of the speech recognition apparatus of Embodiment 1. FIG. 実施の形態１の音声認識装置の第２言語モデルの更新前のＮグラムの例を示す図である。It is a figure which shows the example of N-gram before the update of the 2nd language model of the speech recognition apparatus of Embodiment 1. FIG. 実施の形態１の音声認識装置の第２音声認識部の更新後の学習例文情報の例を示す図である。It is a figure which shows the example of the learning example sentence information after the update of the 2nd speech recognition part of the speech recognition apparatus of Embodiment 1. FIG. 実施の形態１の音声認識装置の第２言語モデルの更新後のＮグラムの例を示す図である。It is a figure which shows the example of the N-gram after the 2nd language model of the speech recognition apparatus of Embodiment 1 is updated. 実施の形態１の音声認識装置の動作例の更新後の第２言語モデルに基づく正解の単語系列の各単語の確率の例を示す図である。It is a figure which shows the example of the probability of each word of the correct word series based on the 2nd language model after the update of the operation example of the speech recognition apparatus of Embodiment 1. FIG. 実施の形態１の音声認識装置の動作例の更新後の第２言語モデルに基づく誤りを含む単語系列の各単語の確率の例を示す図である。It is a figure which shows the example of the probability of each word of the word series containing the error based on the 2nd language model after the update of the operation example of the speech recognition apparatus of Embodiment 1. 実施の形態１の文字列検索装置の更新後の第２言語モデルと第１言語モデルと更新前の第２言語モデルを混合した言語モデルの単語の確率の比較を示す図である。It is a figure which shows the comparison of the probability of the word of the language model which mixed the 2nd language model after the update of the character string search apparatus of Embodiment 1, a 1st language model, and the 2nd language model before an update. 実施の形態１において第２音声認識部の認識結果で第２言語モデルを更新する変形例の構成を示すブロック図である。FIG. 10 is a block diagram showing a configuration of a modified example in which the second language model is updated with the recognition result of the second speech recognition unit in the first embodiment. 実施の形態１において第２音声認識部の認識結果で第２言語モデルを更新する場合の学習例文の例を示す図である。FIG. 10 is a diagram illustrating an example of a learning example sentence when the second language model is updated with the recognition result of the second speech recognition unit in the first embodiment. 実施の形態１において第２音声認識部の認識結果で更新した第２言語モデルの例を示す図である。6 is a diagram illustrating an example of a second language model updated with a recognition result of a second speech recognition unit in Embodiment 1. FIG. 実施の形態１において第１音声認識部の認識結果で更新した場合と第２音声認識部の認識結果で更新した場合の第２言語モデルの単語の確率の比較を示す図である。FIG. 10 is a diagram showing a comparison of word probabilities of the second language model when updated with the recognition result of the first speech recognition unit and updated with the recognition result of the second speech recognition unit in the first embodiment. この発明の実施の形態２の音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus of Embodiment 2 of this invention. 実施の形態２の音声認識装置の処理のフローチャートである。6 is a flowchart of processing performed by the speech recognition apparatus according to the second embodiment. 実施の形態２の音声認識装置の第３音声認識部の認識結果の例を示す図である。It is a figure which shows the example of the recognition result of the 3rd speech recognition part of the speech recognition apparatus of Embodiment 2. FIG. 実施の形態２の音声認識装置のＮグラム追加部が更新した学習例文の例を示す図である。It is a figure which shows the example of the learning example sentence which the N-gram addition part of the speech recognition apparatus of Embodiment 2 updated. 実施の形態２の音声認識装置の更新された第２言語モデルの例を示す図である。It is a figure which shows the example of the 2nd language model updated of the speech recognition apparatus of Embodiment 2. FIG. 実施の形態２の音声認識装置の更新された第２言語モデルの例を示す図である。It is a figure which shows the example of the 2nd language model updated of the speech recognition apparatus of Embodiment 2. FIG. この発明の実施の形態３の音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus of Embodiment 3 of this invention. 実施の形態３の音声認識装置の重み付けを行った学習例文情報の例を示す図である。It is a figure which shows the example of the learning example sentence information which weighted the speech recognition apparatus of Embodiment 3. FIG. 実施の形態３の音声認識装置の重み付けを行った学習例文情報の例を示す図である。It is a figure which shows the example of the learning example sentence information which weighted the speech recognition apparatus of Embodiment 3. FIG.

以下この発明の実施の形態を、図を参照して説明する。なお、参照する図において同一もしくは相当する部分には同一の符号を付している。 Embodiments of the present invention will be described below with reference to the drawings. In the drawings to be referred to, the same or corresponding parts are denoted by the same reference numerals.

なお以下に示す動作の具体例は、音声認識処理を行う音声認識エンジンの一例として、オープンソースの大語彙連続音声認識エンジンであるJulius-4.2.2（http://julius.sourceforge.jp、以降ではJulius-4.2.2を単にJuliusとも表記する）を使用して行った実験結果を用いて説明する。
また、音響モデルは例えばJuliusディクテーション実行キットに含まれているhmmdefs_ptm_gid.binhmmを、言語モデルの更新処理におけるＮグラム確率を求めるツール（以下言語モデル作成ツールと称す）は例えば教科書１に記載されたCMU-Cambridge統計的言語モデルツールキットを用いることができる。
なお、以下に示す実施の形態ではＮグラムの次数が３（Ｎ＝３）の場合について説明する。ただし、この発明はＮグラムの次数を３に限定するものではなく、２グラムあるいは４以上の多次数のＮグラムであっても良い。 The specific example of the operation shown below is an example of a speech recognition engine that performs speech recognition processing. Julius-4.2.2 (http://julius.sourceforge.jp, which is an open source large vocabulary continuous speech recognition engine) Then, we will explain using experimental results using Julius-4.2.2.
For example, the acoustic model is hmmdefs_ptm_gid.binhmm, which is included in the Julius dictation execution kit, and the NMU probability in the language model update process (hereinafter referred to as language model creation tool) is a CMU described in textbook 1, for example. -Cambridge statistical language model toolkit can be used.
In the following embodiment, the case where the order of N-gram is 3 (N = 3) will be described. However, the present invention does not limit the order of N-grams to 3, but may be 2 grams or N-grams having multiple orders of 4 or more.

実施の形態１．
図１はこの発明の実施の形態１に係る音声認識装置の構成を示す図である。実施の形態１の音声認識装置は、音声入力部１０１、第１音声認識部１０２、第１言語モデル記憶部１０３、第１音響モデル記憶部１０４、Ｎグラム追加部１０５、第２音声認識部１０６、第２言語モデル（混合言語モデル）記憶部１０７、第２音響モデル記憶部１０８で構成される。 Embodiment 1 FIG.
1 is a diagram showing a configuration of a speech recognition apparatus according to Embodiment 1 of the present invention. The speech recognition apparatus according to Embodiment 1 includes a speech input unit 101, a first speech recognition unit 102, a first language model storage unit 103, a first acoustic model storage unit 104, an N-gram addition unit 105, and a second speech recognition unit 106. , A second language model (mixed language model) storage unit 107 and a second acoustic model storage unit 108.

音声入力部１０１は、ユーザが発した音声をデジタル処理可能なデジタル音声信号に変換する。音声入力部１０１が出力するデジタル音声信号は第１音声認識部１０２および第２音声認識部１０６に入力される。
第１音声認識部１０２では入力されたデジタル音声信号について、第１音響モデル記憶部１０４に記憶された第１音響モデルと第１言語モデル記憶部１０３に記憶された第１言語モデルを参照して、音声認識を行う。第１音声認識部１０２は音声認識結果の単語系列（認識文）をＮグラム追加部１０５に出力する。Ｎグラム追加部１０５は、入力された音声認識結果の単語系列からＮグラムを抽出してそのＮグラム確率を計算して、第２言語モデル記憶部１０７に記憶される第２言語モデルにＮグラムを追加する。 The voice input unit 101 converts voice uttered by the user into a digital voice signal that can be digitally processed. The digital voice signal output from the voice input unit 101 is input to the first voice recognition unit 102 and the second voice recognition unit 106.
The first speech recognition unit 102 refers to the input digital speech signal with reference to the first acoustic model stored in the first acoustic model storage unit 104 and the first language model stored in the first language model storage unit 103. , Perform voice recognition. The first speech recognition unit 102 outputs the word series (recognition sentence) of the speech recognition result to the N-gram adding unit 105. The N-gram adding unit 105 extracts the N-gram from the input word sequence of the speech recognition result, calculates the N-gram probability, and adds the N-gram to the second language model stored in the second language model storage unit 107. Add

音声入力部１０１からデジタル音声信号を入力された第２音声認識部１０６は、第２言語モデル記憶部１０７に記憶された混合言語モデルである第２言語モデルと第２音響モデル記憶部１０８に記憶された第２音響モデルを参照して音声認識を行い、音声認識結果の単語系列を出力する。第２音声認識部１０６から出力された音声認識結果の単語系列は、例えば表示部１０９に表示されるなどの処理で使用される。 The second speech recognition unit 106 to which the digital speech signal is input from the speech input unit 101 is stored in the second language model that is a mixed language model stored in the second language model storage unit 107 and the second acoustic model storage unit 108. Speech recognition is performed with reference to the second acoustic model, and a word sequence of the speech recognition result is output. The word sequence of the speech recognition result output from the second speech recognition unit 106 is used in a process such as being displayed on the display unit 109, for example.

上記において音響モデル（第１音響モデル、第２音響モデル）は、音素などの単位で音声の標準的な特徴量のパターンを保持するデータベースである。音声の特徴量としては、例えばＭＦＣＣ（Mel Frequency Cepstrum Coefficient）やΔＭＦＣＣなどがある。音響モデルは音声認識処理において入力音声の特徴量と照合され、音素単位での入力音声の認識に用いられるものである。なお、ここでは第１音響モデルと第２音響モデルは双方とも一例として上述の同じ音響モデルを用いているが、第１音響モデルと第２音響モデルが互いに異なる音響モデルであっても良い。 In the above description, the acoustic model (first acoustic model, second acoustic model) is a database that holds patterns of standard feature amounts of speech in units of phonemes and the like. Examples of the voice feature amount include MFCC (Mel Frequency Cepstrum Coefficient) and ΔMFCC. The acoustic model is collated with the feature amount of the input speech in the speech recognition process, and is used for the recognition of the input speech on a phoneme basis. Here, both the first acoustic model and the second acoustic model use the same acoustic model described above as an example, but the first acoustic model and the second acoustic model may be different acoustic models.

言語モデル（第１言語モデル、第２言語モデル）は、単語の連鎖に関する制約（単語連鎖の発生確率）を規定するデータベースであり、Ｎグラムモデルに基づいて構成されている。言語モデルは、音声認識処理において音響モデルを用いた音素レベルの認識結果に基づいてこれらの言語モデルが参照される。 The language model (first language model, second language model) is a database that defines restrictions on word chaining (word chain occurrence probability), and is configured based on an N-gram model. Language models are referred to based on phoneme level recognition results using acoustic models in the speech recognition process.

音声認識部（第１音声認識部１０２、第２音声認識部１０６）が行う音響モデルと言語モデルを参照して入力音声から認識文を得る処理は、周知の方法を適用すればよい。ここでは一例として上述の通りJuliusを使用するものとする。 A known method may be applied to the process of obtaining the recognition sentence from the input speech by referring to the acoustic model and the language model performed by the speech recognition unit (the first speech recognition unit 102 and the second speech recognition unit 106). Here, as an example, Julius is used as described above.

図１の音声入力部１０１は音声を収録するマイクロホンと、アナログ音声をデジタルデータに変換するＡＤ（Analog Digital）コンバーターなどの回路で構成することが可能である。また、第１音声認識部１０２、Ｎグラム追加部１０５、及び第２音声認識部１０６は、プロセッサとＲＡＭ（Random Access Memory）などの周辺回路およびプロセッサで実行されるソフトウェアで構成することが可能である。また、第１言語モデル記憶部１０３、第１音響モデル記憶部１０４、第２言語モデル記憶部１０７、第２音響モデル記憶部１０８はハードディスクなどの記憶装置で構成することが可能である。 The voice input unit 101 in FIG. 1 can be configured by a circuit such as a microphone that records voice and an AD (Analog Digital) converter that converts analog voice into digital data. The first speech recognition unit 102, the N-gram addition unit 105, and the second speech recognition unit 106 can be configured by a processor and peripheral circuits such as a RAM (Random Access Memory) and software executed by the processor. is there. The first language model storage unit 103, the first acoustic model storage unit 104, the second language model storage unit 107, and the second acoustic model storage unit 108 can be configured by a storage device such as a hard disk.

あるいはクライアント−サーバ形式を採用して、例えば第１音声認識部１０２および、第１言語モデル記憶部１０３、第１音響モデル記憶部１０４をネットワーク経由でアクセス可能なサーバの機能として実現し、第２音声認識部１０６、第２言語モデル記憶部１０７、第２音響モデル記憶部１０８およびＮグラム追加部１０５をクライアントの機能として実現するなどの構成にすることも可能である。 Alternatively, by adopting a client-server format, for example, the first speech recognition unit 102, the first language model storage unit 103, and the first acoustic model storage unit 104 are realized as server functions accessible via a network. The speech recognition unit 106, the second language model storage unit 107, the second acoustic model storage unit 108, and the N-gram addition unit 105 may be configured as client functions.

ここで、音声認識部（第１音声認識部１０２、第２音声認識部１０６）が行う音声認識処理における尤度計算について説明する。今、単語系列Wが以下の数式１で定義するように、n個の単語w₁からw_nで構成されるものとする。このとき、単語系列Wの確率（尤度）は以下の数式２で表される。ここで、P(w_i|w_i-2,w_i-1)は単語w_i-2と単語w_i-1に続いて単語w_iが出現する確率（３次のＮグラムにおけるＮグラム確率）である。 Here, the likelihood calculation in the speech recognition processing performed by the speech recognition units (the first speech recognition unit 102 and the second speech recognition unit 106) will be described. Now, it is assumed that the word sequence W is composed of n words w ₁ to w _n as defined by the following Equation 1. At this time, the probability (likelihood) of the word sequence W is expressed by Equation 2 below. Here, P (w _i | w _i−2 , w _i−1 ) is the probability that the word w _i appears after the word w _i ₋₂ and the word w _i−1 (the N-gram probability in the third-order N-gram). ).

単語系列の確率は上記の数式２であるが、最も尤度の高い単語系列の判定では候補となる単語系列の確率の大小関係が比較できれば良い。このことから実際の処理では、以下の数式３に示すように、単語系列に含まれる各単語w_iの確率（P(w_i|w_i-2,w_i-1)）を対数（対数確率と称す）にして、各単語系列の対数確率の総和X(W)を求め、各候補の単語系列のX(W)の比較する処理が行われる。 The probability of the word sequence is expressed by Equation 2 above, but the determination of the word sequence with the highest likelihood is sufficient if the magnitude relationships of the probabilities of the candidate word sequences can be compared. Therefore, in actual processing, as shown in Equation 3 below, the probability (P (w _i | w _i−2 , w _i−1 )) of each word w _i included in the word sequence is expressed as a logarithm (log probability). In other words, the sum X (W) of logarithmic probabilities of each word series is obtained, and the process of comparing X (W) of each candidate word series is performed.

なお、上述の各単語の確率を与えるものが言語モデルである。以降では、言語モデルに基づく尤度を言語尤度と称する。 A language model gives the probability of each word described above. Hereinafter, the likelihood based on the language model is referred to as language likelihood.

音声認識処理では、上述の言語尤度と音響モデルに基づく尤度（音響尤度）を以下の数式４に従って加えた総合尤度f(h)で評価して、最尤の候補を音声認識結果とする。数式４においてhは音声認識結果の候補の単語系列、AC(h)は単語系列hに対する音響尤度、LM(h)は単語系列hに対する言語尤度、nは候補の単語系列の単語数である。また、LM_WEIGHTは言語モデル重み、LM_PENALTYは単語納入ペナルティである。なお、以下に示す動作具体例の説明ではそれぞれを8.0、-2.0とした場合の結果を用いている。 In the speech recognition processing, the above-mentioned language likelihood and the likelihood based on the acoustic model (acoustic likelihood) are evaluated according to the total likelihood f (h) added according to the following Equation 4, and the maximum likelihood candidate is determined as the speech recognition result. And In Equation 4, h is the word sequence of the candidate speech recognition result, AC (h) is the acoustic likelihood for the word sequence h, LM (h) is the language likelihood for the word sequence h, and n is the number of words in the candidate word sequence. is there. LM_WEIGHT is a language model weight, and LM_PENALTY is a word delivery penalty. In the following description of the specific operation examples, the results when 8.0 and -2.0 are set are used.

次に、この実施の形態の音声認識装置の動作について説明する。図２は実施の形態１の音声認識装置の処理フローを示す図である。なお、以下の動作説明では「しゅーずけーすからーこーどおねがいします」（表記は「シューズケースカラーコードお願いします」）という文が発話された場合を例に具体例を説明する。
まず音声入力部１０１がユーザの発した音声をアナログの電気信号に変換し（ST201）、そしてアナログの電気信号である入力音声をＡＤコンバーターにより、デジタル情報に変換する（ST202）。 Next, the operation of the speech recognition apparatus according to this embodiment will be described. FIG. 2 is a diagram illustrating a processing flow of the speech recognition apparatus according to the first embodiment. In addition, in the following operation explanation, a specific example will be explained with an example where the sentence “Shu-zu-kara-ko-do-goshi-gai” (notation is “shoe case color code please”) is spoken. To do.
First, the voice input unit 101 converts the voice uttered by the user into an analog electric signal (ST201), and the input voice, which is an analog electric signal, is converted into digital information by an AD converter (ST202).

次に、第１音声認識部１０２がこの入力音声のデジタル情報に基づいて音声認識を行う（ST203）。ST203の処理では、まず入力音声のデジタル情報から適当な時間間隔（例えば６０ミリ秒ごと）で音声の特徴量の抽出を行う。そして、抽出した特徴量を用いて第１音響モデル記憶部１０４に記憶された第１音響モデルを参照し、音素レベルの認識を行って認識結果候補の単語系列とその単語系列の音響モデルに基づく音響尤度を求める。さらにこの認識結果候補の単語系列に基づいて第１言語モデル記憶部１０３に記憶された第１言語モデルを参照して言語モデルに基づく言語尤度を求め、上述の通り数式４に示した計算の結果により音響尤度と言語尤度を総合的に判断して、入力音声との照合の度合いが最も高い（すなわち最尤の）単語系列を認識文（第１の音声認識結果）として得る。 Next, the first speech recognition unit 102 performs speech recognition based on the digital information of the input speech (ST203). In the process of ST203, first, the feature amount of the voice is extracted from the digital information of the input voice at an appropriate time interval (for example, every 60 milliseconds). Then, using the extracted feature amount, the first acoustic model stored in the first acoustic model storage unit 104 is referred to, the phoneme level is recognized, and the recognition result candidate word sequence and the acoustic model of the word sequence are used. Obtain acoustic likelihood. Furthermore, the language likelihood based on the language model is obtained by referring to the first language model stored in the first language model storage unit 103 based on the word sequence of the recognition result candidate, and the calculation shown in Equation 4 is performed as described above. Based on the result, the acoustic likelihood and the language likelihood are comprehensively determined, and the word sequence having the highest degree of matching with the input speech (that is, the maximum likelihood) is obtained as a recognized sentence (first speech recognition result).

Juliusにより上述の「しゅーずけーすからーこーどおねがいします」という音声入力の認識を行うと、図３に示す単語系列が認識結果として出力される。なお、ここでは第１言語モデルの一例として、情報処理学会の連続音声認識コンソーシアム2002年度版ソフトウェアのNP12y.60k.4.arpa（60K 単語N-gram，学習データ：毎日新聞社の新聞記事データ「CD-毎日新聞 91〜2002年版」，形態素解析：chasen 2.2.1 + ipadic 2.4.1，テキストサイズ：3.5億形態素，語彙サイズ：60156）を使用することとする。「カラーコード」はこの言語モデルには学習されておらず、この言語モデルにおいて未知語である。 When Julius recognizes the above-mentioned voice input “Shu-zu-kara-ko-do-goshi-gai”, the word sequence shown in FIG. 3 is output as a recognition result. As an example of the first language model, the NP12y.60k.4.arpa (60K word N-gram, learning data: Mainichi Shimbun newspaper article data " CD-Mainichi Shimbun 91-2002 edition ”, morphological analysis: chasen 2.2.1 + ipadic 2.4.1, text size: 350 million morphemes, vocabulary size: 60156). “Color code” is not learned in this language model and is an unknown word in this language model.

図３において<s>、</s>はそれぞれ文頭、文末を示す記号である。また、例えば「し：シ:する:227」の「し」は認識した単語の表記を、「シ」は読みを、「する」は原形を、「227」は品詞を表すコード（品詞コード）をそれぞれ示している。すなわち認識結果の単語系列は「シューズケースから行動お願いします」である。この認識結果では「カラーコード」を「から行動」（から:カラ:から:63 行動:コードー:行動:505）に認識誤りが起こっている。これは、第１言語モデルでは「カラーコード」が未知語であり、類似した読みの対立候補が存在して、この対立候補が最尤と判断されてしまったためである。 In FIG. 3, <s> and </ s> are symbols indicating the beginning and end of a sentence, respectively. Also, for example, “Shi” in “Shi: Shi: Su: 227” is the recognized word notation, “Shi” is the reading, “S” is the original form, “227” is the part of speech code (part of speech code) Respectively. That is, the word sequence of the recognition result is “Please act from the shoe case”. In this recognition result, a recognition error has occurred in “color code” as “from action” (from: kara: to: 63 action: code: action: 505). This is because in the first language model, “color code” is an unknown word, and there are similar reading conflict candidates, and this conflict candidate has been determined to be the maximum likelihood.

次に、Ｎグラム追加部１０５が第１音声認識部１０２の認識結果の単語系列に基づくＮグラムを第２言語モデル記憶部１０７に記憶された第２言語モデルに追加する。ここで、第２言語モデルは第１言語モデルとは異なり特定分野の専門の例文からＮグラムを学習した言語モデルとする。また、第２言語モデル記憶部１０７には学習に用いた例文（学習例文）を学習例文情報（Ｎグラムの学習に用いられた例文に関する情報）として記憶しているものとする。なお、この学習例文情報は第２言語モデル記憶部１０７とは別のＲＡＭ（Random Access Memory）などの記憶媒体（学習例文情報記憶部）に記憶するようにしても良い。また、この実施の形態では例文自体を学習例文情報としたが、例文に出現する各単語の出現回数の情報を学習例文情報にするなどＮグラム確率の計算に用いることが可能な他の情報にすることも可能である。 Next, the N-gram adding unit 105 adds an N-gram based on the word sequence of the recognition result of the first speech recognition unit 102 to the second language model stored in the second language model storage unit 107. Here, unlike the first language model, the second language model is a language model obtained by learning N-grams from specialized example sentences in a specific field. Further, it is assumed that the second language model storage unit 107 stores example sentences (learning example sentences) used for learning as learning example sentence information (information about example sentences used for learning N-grams). The learning example sentence information may be stored in a storage medium (learning example sentence information storage unit) such as a RAM (Random Access Memory) different from the second language model storage unit 107. In this embodiment, the example sentence itself is used as learning example sentence information. However, other information that can be used for calculating N-gram probabilities, such as information on the number of appearances of each word appearing in an example sentence, is used as learning example sentence information. It is also possible to do.

今、第２言語モデルには「シューズケース」、「カラーコード」の２つの例文からＮグラムが学習されているものとする。図４に学習例文情報として記憶されたこの２つの例文を示す。図３と同様に<s>は文頭、</s>は文末を示し、表記、読み、原形、品詞コードを表している。また、図５に第２言語モデルに学習されたこの２つの例文に係るＮグラムを示す。今、Ｎグラムの次数はＮ＝３であるので１グラム、２グラム、３グラムが学習されている。 Now, it is assumed that N-grams are learned from two example sentences “shoe case” and “color code” in the second language model. FIG. 4 shows these two example sentences stored as learning example sentence information. As in FIG. 3, <s> indicates the beginning of a sentence, and </ s> indicates the end of the sentence, and indicates notation, reading, original form, and part of speech code. FIG. 5 shows N-grams related to these two example sentences learned by the second language model. Now, since the order of N-gram is N = 3, 1 gram, 2 gram, and 3 gram are learned.

図５において例えば\1-gramsの４行目の「-0.9031 カラーコード:カラーコード:カラーコード:507 0.0000」は、「カラーコード:カラーコード:カラーコード:507」がＮグラム（「カラーコード」という１グラムで図３と同様の表記、読み、原形、品詞コードを示す）であり、「-0.9031」がこのＮグラムの対数確率である。「0.0000」は、高次のＮグラムが言語モデルに存在しない場合に、グッド・チューリング推定法に基づくバックオフ・スムージングにより低次のＮグラムの確率を用いてその存在しない高次のＮグラムの確率を推定する処理で用いるバックオフ係数の対数値である。
また、例えば\3-gramsの２行目の「-0.3010 <s> カラーコード:カラーコード:カラーコード:507 </s>」は、「<s> カラーコード:カラーコード:カラーコード:507 </s>」がＮグラム（「文頭、カラーコード、文末」という３グラム）であり、「-0.3010」がＮグラム確率である。なお、次数が３のＮグラムの言語モデルにおいて、３グラムを用いてより高次のＮグラムの確率を推定することはないのでバックオフ係数は存在しない。 In FIG. 5, for example, “-0.9031 Color code: Color code: Color code: 507 0.0000” on line 4 of \ 1-grams is N-gram (“Color code”). The same notation, reading, original form, and part-of-speech code as in FIG. 3), and “-0.9031” is the logarithmic probability of this N-gram. “0.0000” means that when a higher-order N-gram does not exist in the language model, the probability of the lower-order N-gram does not exist using back-off smoothing based on the Good Turing estimation method. It is a logarithmic value of a back-off coefficient used in the process of estimating the probability.
For example, "-0.3010 <s> Color code: Color code: Color code: 507 </ s>" on the second line of \ 3-grams is "<s> Color code: Color code: Color code: 507 < / s> ”is N-gram (3 grams of“ sentence, color code, end of sentence ”), and“ −0.3010 ”is the N-gram probability. Note that in the N-gram language model of degree 3, there is no backoff coefficient because the probability of higher-order N-grams is not estimated using 3 grams.

ここで、バックオフ係数を用いて低次のＮグラム確率から高次のＮグラム確率を推定する処理を説明する。２次のＮグラム確率を用いて３次のＮグラム確率を推定する計算は以下に示す擬似的なプログラムコード（擬似コード）で定義される。
P(wd3|wd1,wd2) = if(trigram exists) p_3(wd1,wd2,wd3)
else if(bigram wd1,wd2 exists) bo_wt_2(wd1,wd2) * P(wd3|wd2)
else P(wd3|wd2)
この擬似コードにおいて、wd1、wd2、wd3は単語を示している。P(wd3|wd1,wd2)は前述の通り、wd1、wd2の後にwd3が生成される確率（単語wd3の確率）である。P(wd3|wd2)も同様である。また、p_3(wd1,wd2,wd3)は単語列wd1,wd2,wd3の３グラムの確率である。bo_wt_2(wd1,wd2)は単語列wd1,wd2の２グラムのバックオフ係数である。
つまり、単語列wd1,wd2,wd3の３グラムが言語モデルに存在する場合はその３グラムの確率が単語wd3の確率となる。単語列wd1,wd2,wd3の３グラムが存在せず、単語列wd1,wd2の２グラムが存在する場合は、単語列wd1,wd2の２グラムのバックオフ係数とP(wd3|wd2)の積が単語wd3の確率となる。単語列wd1,wd2の２グラムも存在しない場合にはP(wd3|wd2)が単語wd3の確率となる。 Here, a process for estimating a higher-order N-gram probability from a lower-order N-gram probability using a back-off coefficient will be described. The calculation for estimating the third-order N-gram probability using the second-order N-gram probability is defined by the following pseudo program code (pseudo code).
P (wd3 | wd1, wd2) = if (trigram exists) p_3 (wd1, wd2, wd3)
else if (bigram wd1, wd2 exists) bo_wt_2 (wd1, wd2) * P (wd3 | wd2)
else P (wd3 | wd2)
In this pseudo code, wd1, wd2, and wd3 indicate words. As described above, P (wd3 | wd1, wd2) is a probability (probability of the word wd3) that wd3 is generated after wd1 and wd2. The same applies to P (wd3 | wd2). Further, p_3 (wd1, wd2, wd3) is a 3-gram probability of the word strings wd1, wd2, wd3. bo_wt_2 (wd1, wd2) is a 2-gram back-off coefficient of the word string wd1, wd2.
That is, if 3 grams of the word strings wd1, wd2, and wd3 exist in the language model, the probability of the 3 grams is the probability of the word wd3. If 3 grams of word strings wd1, wd2, and wd3 do not exist, but 2 grams of word strings wd1 and wd2 exist, the product of backoff coefficient of 2 grams of word strings wd1 and wd2 and P (wd3 | wd2) Is the probability of the word wd3. If there are no two grams of the word strings wd1 and wd2, P (wd3 | wd2) is the probability of the word wd3.

同様に、１次のＮグラム確率を用いて２次のＮグラム確率を推定する計算は以下に示す擬似コードで定義される。
P(wd2|wd1) = if(bigram exists) p_2(wd1,wd2)
else bo_wt_1(wd1)*p_1(wd2)
ここで、p_2(wd1,wd2)は単語列wd1,wd2の２グラムの確率であり、p_1(wd2)は単語列wd2の１グラムの確率である。また、bo_wt_1(wd1)は単語列wd1の１グラムのバックオフ係数である。
なお、上記の擬似コードでは積を求めるように定義されているが、対数確率で計算をする場合には積ではなく和を計算する。 Similarly, the calculation for estimating the second order N-gram probability using the first order N-gram probability is defined by the pseudo code shown below.
P (wd2 | wd1) = if (bigram exists) p_2 (wd1, wd2)
else bo_wt_1 (wd1) * p_1 (wd2)
Here, p_2 (wd1, wd2) is a 2-gram probability of the word string wd1, wd2, and p_1 (wd2) is a 1-gram probability of the word string wd2. Bo_wt_1 (wd1) is a 1-gram back-off coefficient of the word string wd1.
In the above pseudo code, the product is defined so as to obtain a product, but when calculating with logarithmic probability, a sum is calculated instead of a product.

第２言語モデルへの第１音声認識部１０２の認識結果の単語系列の追加処理では、まずＮグラム追加部１０５が学習例文に第１音声認識部１０２の認識結果の単語系列を加えて学習例文を更新する（ST204）。上述のように「しゅーずけーすからーこーどおねがいします」という音声入力について第１音声認識部１０２が「シューズケースから行動お願いします」と認識したとき、図４に示した学習例文にこの認識結果の単語系列が追加される。第１音声認識部１０２の認識結果の単語系列である「シューズケースから行動お願いします」が追加された学習例文情報を図６に示す。「シューズケースから行動お願いします」に対応した「<s> シューズ:シューズ:シューズ:507 ケース:ケース:ケース:507 から:カラ:から:63 行動:コードー:行動:505 お願い:オネガイ:お願い:505 し:シ:する:227 ます:マス:ます:146 </s>」が追加されている。 In the process of adding the word sequence of the recognition result of the first speech recognition unit 102 to the second language model, first, the N-gram addition unit 105 adds the word sequence of the recognition result of the first speech recognition unit 102 to the learning example sentence, and then learns the example sentence. Is updated (ST204). As described above, when the first voice recognition unit 102 recognizes “Please act from the shoe case” for the voice input “I want to do this”, it is shown in FIG. The word sequence of the recognition result is added to the learning example sentence. FIG. 6 shows learning example sentence information to which “the action is requested from the shoe case”, which is a word series of the recognition result of the first speech recognition unit 102, is added. <S> Shoes: Shoes: Shoes: 507 Case: Case: Case: 507 From: Kara: From: 63 Action: Code: Action: 505 Request: Onegai: Request: 505 Shi: Shi: To: 227 Mas: Mass: Mas: 146 </ s> "has been added.

次に、Ｎグラム追加部１０５はこの更新した学習例文に基づいてＮグラム確率を算出し（ST205）、算出したＮグラム確率により第２言語モデル記憶部１０７に記憶された第２言語モデルを更新する（ST206）。図７に更新後の第２言語モデルを示す。なお、Ｎグラム確率の計算は上述の言語モデルツールキットを用いて行った。図７において、例えば\3-gramsでは、１行目の「-0.4771 </s> <s> カラーコード:カラーコード:カラーコード:507」では、Ｎグラム確率が図５に示した更新前の-0.3010から-0.4771に更新されている。また、６行目の「-0.3010 お願い:オネガイ:お願い:505 し:シ:する:227 ます:マス:ます:146」など図５に示した更新前の第２言語モデルには存在しなかったＮグラムが第１音声認識部１０２の認識結果の単語系列に基づいて新たに追加されている。 Next, the N-gram adding unit 105 calculates an N-gram probability based on the updated learning example sentence (ST205), and updates the second language model stored in the second language model storage unit 107 with the calculated N-gram probability. (ST206). FIG. 7 shows the updated second language model. The N-gram probability was calculated using the language model tool kit described above. In FIG. 7, for example, in the case of \ 3-grams, the "-0.4771 </ s> <s> color code: color code: color code: 507" on the first line indicates that the N-gram probability is the value before the update shown in FIG. The version has been updated from -0.3010 to -0.4771. Also, in the second language model before the update shown in Fig. 5 such as "-0.3010 Request: Onegai: Request: 505 Shi: Shi: Yes: 227 Masashi: Mas: Mas: 146" on the 6th line N-grams are newly added based on the word sequence of the recognition result of the first speech recognition unit 102.

次に、第２音声認識部１０６が音声認識を行う（ST207）。ST207の処理では、第１音声認識部１０２と同様に入力音声のデジタル情報から音声の特徴量の抽出し、抽出した特徴量に基づいて第２音響モデル記憶部１０８に記憶された第２音響モデルおよび第２言語モデル記憶部１０７に記憶された第２言語モデル（混合言語モデル）を参照して、最尤の単語系列を認識文（第２の音声認識結果）として得る。 Next, the second voice recognition unit 106 performs voice recognition (ST207). In the process of ST207, the second acoustic model stored in the second acoustic model storage unit 108 is extracted from the digital information of the input speech in the same manner as the first speech recognition unit 102 and based on the extracted feature amount. Then, referring to the second language model (mixed language model) stored in the second language model storage unit 107, the most likely word sequence is obtained as a recognized sentence (second speech recognition result).

ここで、図８を参照してこの第２音声認識部１０６が行う音声認識処理における尤度計算の具体例を説明する。なお、図８では表を見やすくするために単語の表記のみを記載し、読みや原形等は省略した形式で表現している。図９、図１０、図１４も同様である。
尤度計算で用いられる単語の確率は最も次数の高いＮグラムの値を採用する。例えば文頭の確率であるP(<s> |)の場合、前接の単語が無いので次数は１グラムであり、図７の\1-gramsの「<s>」の値-0.6368を採用する。また文頭に続くシューズケースの確率であるP(シューズケース | <s>)の場合は、２グラムであるので\2-gramsの「<s> シューズケース:シューズケース:シューズケース:507」の値-0.5441を対数確率として採用する。 Here, a specific example of likelihood calculation in the speech recognition processing performed by the second speech recognition unit 106 will be described with reference to FIG. In FIG. 8, only the word notation is shown for easy viewing of the table, and the reading and original form are expressed in a omitted form. The same applies to FIG. 9, FIG. 10, and FIG.
As the probability of the word used in the likelihood calculation, the N-gram value having the highest degree is adopted. For example, in the case of P (<s> |), which is the probability of the beginning of a sentence, the degree is 1 gram because there is no preceding word, and the value -0.6368 of "<s>" in \ 1-grams in Fig. 7 is adopted. . In the case of P (shoe case | <s>), which is the probability of the shoe case following the beginning of the sentence, it is 2 grams, so the value of "<s> shoe case: shoe case: shoe case: 507" in \ 2-grams -0.5441 is adopted as the log probability.

次のP(カラーコード | <s>, シューズケース)は３グラムであるが\3-gramsに対応するものが無いので、前述のバック・スムージングによりＮグラム確率を推定する。\2-gramsに「<s> シューズケース:シューズケース:シューズケース:507」があるのでこの２グラムのバックオフ係数を使用する。ただし、\2-gramsに「シューズケース:シューズケース:シューズケース:507 カラーコード:カラーコード:カラーコード:507」の２グラムが存在しないのでこの２グラムの確率も同様にバック・スムージングにより推定する。
具体的な推定値の計算は、bo_wt_2(<s>,シューズケース:シューズケース:シューズケース:507)＋bo_wt_1(シューズケース:シューズケース:シューズケース:507)＋p_1(カラーコード:カラーコード:カラーコード:507)＝(0.1761)＋(-0.4046)＋(-1.2109)=-1.4394となる。以上により推定値-1.4394を対数確率として採用する。同様にしてすべての単語の対数確率を取得し、この各単語の対数確率を用いて上述の数式３により正解の単語系列である「<s>シューズケースカラーコードお願いします</s>」の言語モデルに基づく尤度を計算すると、図８の表に示すように(-0.6368)+(-0.5441)+(-1.4394)+(-1.6155)+(-0.1761)+(-0.301)+(-0.301)=-5.0139となる。 The next P (color code | <s>, shoe case) is 3 grams, but there is nothing corresponding to \ 3-grams, so the N-gram probability is estimated by the back smoothing described above. \ 2-grams has "<s> Shoe case: Shoe case: Shoe case: 507", so use this 2 gram back-off factor. However, since there are not 2 grams of \ Shoecase: Shoecase: Shoecase: 507 Color code: Color code: Color code: 507 in \ 2-grams, the probability of this 2-gram is also estimated by back smoothing. .
The specific estimated value is calculated by bo_wt_2 (<s>, Shoe case: Shoe case: Shoe case: 507) + bo_wt_1 (Shoe case: Shoe case: Shoe case: 507) + p_1 (Color code: Color code: Color code: 507) = (0.1761) + (-0.4046) + (-1.2109) =-1.4394. As described above, the estimated value -1.4394 is adopted as the logarithmic probability. Similarly, logarithmic probabilities of all words are acquired, and the logarithmic probabilities of each word are used to calculate the correct word sequence "Please give me the <s> shoe case color code </ s>" according to Equation 3 above. When the likelihood based on the language model is calculated, as shown in the table of FIG. 8, (−0.6368) + (− 0.5441) + (− 1.4394) + (− 1.6155) + (− 0.1761) + (− 0.301) + (− 0.301) =-5.0139.

今回使用した音響モデルにおける正解の単語系列の音響尤度は-9118.412109であり、上述の数式4に基づいて正解の単語系列「<s>シューズケースカラーコードお願いします</s>」の総合尤度は、-9118.412109+(-5.0139×8.0)+(-2.0×7)=-9172.52と算出できる。 The acoustic likelihood of the correct word sequence in the acoustic model used this time is -9118.412109, and the overall likelihood of the correct word sequence "<s> Shoecase color code please </ s>" based on Equation 4 above The degree can be calculated as -9118.412109 + (-5.0139 * 8.0) + (-2.0 * 7) =-9172.52.

第１音声認識部１０２の認識結果として出力された単語系列は誤認識を含んだまま第２言語モデルに混合されており、誤認識された単語を含む単語系列が最尤と判定されてしまうと認識性能を向上することができない。
しかし、誤りを含む単語系列「<s>シューズケースから行動お願いします</s>」の音響尤度は-9133.199219であり、また言語尤度は図９に示す表の通り(-0.6368)+(-0.5441)+(-0.301)+(-0.301)+(-0.301)+(-0.301)+(-0.301)+(-0.301)+(-0.301)=-3.2879であるので、総合尤度は-9133.199219+(-3.2879×8.0)+(-2.0×9)＝-9177.5となる。従って誤りを含む単語系列の総合尤度よりも正解の単語系列の総合尤度の方が高いことから、この誤りを含む単語系列は棄却され、正解の単語系列を音声認識結果として得ることができる。 The word sequence output as the recognition result of the first speech recognition unit 102 is mixed with the second language model while including erroneous recognition, and the word sequence including the erroneously recognized word is determined to be the maximum likelihood. Recognition performance cannot be improved.
However, the acoustic likelihood of the error word sequence “<s> Please take action from the shoe case </ s>” is -9133.199219, and the language likelihood is (-0.6368) + (-0.5441) + (-0.301) + (-0.301) + (-0.301) + (-0.301) + (-0.301) + (-0.301) + (-0.301) =-3.2879, so the total likelihood is −9133.199219 + (− 3.2879 × 8.0) + (− 2.0 × 9) = − 9177.5. Therefore, since the overall likelihood of the correct word sequence is higher than the overall likelihood of the word sequence including an error, the word sequence including the error is rejected, and the correct word sequence can be obtained as a speech recognition result. .

次に、第１言語モデルと更新前の第２言語モデルを単純に混合して得られる言語モデル（単純混合モデルと称する）と、上記で説明したこの実施の形態の更新後の第２言語モデルの比較を示す。図１０に正解の単語系例「<s>シューズケースカラーコードお願いします</s>」についての単純混合言語モデルとこの実施の形態の更新後の第２言語モデルのそれぞれに基づく各単語の確率とＮグラムの次数と言語尤度を示す。単純混合言語モデルではX(W)=-11.91879であるのに対し更新後の第２言語モデルではX(W)=-5.0139であるので、更新後の第２言語モデルの方が同じ単語系列に関して言語尤度が高くなっている。これは、第１音声認識部１０２の認識結果の単語系列を学習例文に加えて第２言語モデルを更新したことによる効果である。音声認識処理において言語尤度が高い更新後の第２言語モデルを使用する方が認識誤りを起こす可能性が低く、認識性能を向上することができる。 Next, a language model obtained by simply mixing the first language model and the second language model before update (referred to as a simple mixed model), and the second language model after update of this embodiment described above A comparison of is shown. In FIG. 10, each word based on the simple mixed language model for the correct word system example “<s> shoe case color code please </ s>” and the updated second language model of this embodiment is shown. The probability, the order of the N-gram, and the language likelihood are shown. In the simple mixed language model, X (W) =-11.91879, whereas in the updated second language model, X (W) =-5.0139. Therefore, the updated second language model is related to the same word sequence. Language likelihood is high. This is an effect obtained by updating the second language model by adding the word sequence of the recognition result of the first speech recognition unit 102 to the learning example sentence. The use of the updated second language model having a high language likelihood in the speech recognition process is less likely to cause a recognition error, and the recognition performance can be improved.

なお、この実施の形態では第１音声認識部１０２の認識結果を第２言語モデルに追加混合したが、図１１に示すようにＮモデル追加部１０５ｃが第２音声認識部１０６の認識結果に基づいて学習例文を更新し、更新した学習例文に基づいて第２言語モデルにＮグラムを追加する構成にしてもよい。このとき、第１音声認識部１０２の認識結果と第２音声認識部１０６の認識結果が異なる場合は、第１音声認識部１０２の認識結果を棄却し、第２音声認識部１０６の認識結果を採用するようにしてもよい。第２音声認識部１０６の認識結果を学習例文に追加した場合の学習例文を図１２に、このときの言語モデルを図１３に示す。また、図１４に第１音声認識部１０２の認識結果を学習した場合と、第２音声認識部１０６の認識結果を学習した場合との正解の単語系列に対する対数確率と適用されるＮグラム次数の比較を示す。第２音声認識部１０２の認識結果を学習した場合の対数確率の合計は-2.5464となっており第１音声認識部１０６の認識結果を学習した場合よりも言語尤度が向上している。 In this embodiment, the recognition result of the first speech recognition unit 102 is additionally mixed with the second language model. However, the N model addition unit 105c is based on the recognition result of the second speech recognition unit 106 as shown in FIG. The learning example sentence may be updated, and an N-gram may be added to the second language model based on the updated learning example sentence. At this time, if the recognition result of the first speech recognition unit 102 is different from the recognition result of the second speech recognition unit 106, the recognition result of the first speech recognition unit 102 is rejected and the recognition result of the second speech recognition unit 106 is used. You may make it employ | adopt. FIG. 12 shows a learning example sentence when the recognition result of the second speech recognition unit 106 is added to the learning example sentence, and FIG. 13 shows a language model at this time. FIG. 14 shows logarithmic probabilities for correct word sequences and N-gram orders applied when the recognition result of the first speech recognition unit 102 is learned and when the recognition result of the second speech recognition unit 106 is learned. A comparison is shown. The total log probability when learning the recognition result of the second speech recognition unit 102 is −2.5464, and the language likelihood is improved as compared with the case where the recognition result of the first speech recognition unit 106 is learned.

上述のように、第２言語モデルに第１言語モデルの認識結果の単語系列を追加混合するＮグラム追加部を備えることにより、第１言語モデルの認識結果の単語系列により第２言語モデルを入力音声に適応させて、第２言語モデルの入力音声に現れた単語連鎖のカバー率を向上し、音声認識装置の音声認識性能を向上することが可能である。 As described above, the second language model is input by the word sequence of the recognition result of the first language model by including the N-gram adding unit that additionally mixes the word sequence of the recognition result of the first language model with the second language model. It is possible to improve the speech recognition performance of the speech recognition apparatus by adapting to speech and improving the coverage of word chains appearing in the input speech of the second language model.

また、この実施の形態では第１音声認識器と第２音声認識器は別個のものとしたが、同じ１つの音声認識器を用いても良い。また、第１言語モデルと第２言語モデルをそれぞれ第１言語モデル記憶部と第２言語モデル記憶部に記憶された別の言語モデルとしていたが、同じ言語モデル記憶部に記憶された一つの言語モデルにしてもよい。このとき、例えば発話が「音声認識」であり、もともとの言語モデルには１グラムの「音声」と「認識」のみが存在した場合、「音声認識」が学習され、「<s>音声」,「音声認識」、「認識</s>」の２グラムが学習されることとなり音声認識性能を向上することができる。 In this embodiment, the first speech recognizer and the second speech recognizer are separate, but the same single speech recognizer may be used. Further, although the first language model and the second language model are different language models stored in the first language model storage unit and the second language model storage unit, respectively, one language stored in the same language model storage unit It may be a model. At this time, for example, if the utterance is “speech recognition” and the original language model has only one gram of “speech” and “recognition”, “speech recognition” is learned and “<s> speech”, Since two grams of “voice recognition” and “recognition </ s>” are learned, the voice recognition performance can be improved.

実施の形態２．
実施の形態１は、１つの音声認識結果を混合言語モデルに追加するようにしたものであったが、次に２つの音声認識器から得た２つの音声認識結果を混合言語モデルに追加する実施の形態を示す。
図１５はこの発明の実施の形態２に係る音声認識装置の構成図である。実施の形態２の音声認識装置は、音声入力部１０１、第１音声認識部１０２、第１言語モデル記憶部１０３、第１音響モデル記憶部１０４、Ｎグラム追加部１０５ｂ、第２音声認識部１０６、第２言語モデル記憶部１０７ｂ、第２音響モデル記憶部１０８、第３音声認識部１１２、第３言語モデル記憶部１１３、第３音響モデル記憶部１１４で構成される。実施の形態１と同じ符号を付した部分は実施の形態１と同様であるので説明を省略する。 Embodiment 2. FIG.
In the first embodiment, one speech recognition result is added to the mixed language model. Next, two speech recognition results obtained from two speech recognizers are added to the mixed language model. The form of is shown.
FIG. 15 is a block diagram of a speech recognition apparatus according to Embodiment 2 of the present invention. The speech recognition apparatus according to Embodiment 2 includes a speech input unit 101, a first speech recognition unit 102, a first language model storage unit 103, a first acoustic model storage unit 104, an N-gram addition unit 105b, and a second speech recognition unit 106. , A second language model storage unit 107b, a second acoustic model storage unit 108, a third speech recognition unit 112, a third language model storage unit 113, and a third acoustic model storage unit 114. The portions denoted by the same reference numerals as those in the first embodiment are the same as those in the first embodiment, and thus the description thereof is omitted.

第３音声認識部１１２は第１音声認識部１０２、第２音声認識部１０６と同様に、第３言語モデル記憶部１１３に記憶された第３言語モデル、第３音響モデル記憶部１１４に記憶された第３音響モデルを参照して音声認識処理を行う。第３言語モデル記憶部１１３に記憶された第３言語モデルは、実施の形態１の第２言語モデルと同様の特定分野の専門の学習例文から学習された言語モデルとする。 The third speech recognition unit 112 is stored in the third language model and third acoustic model storage unit 114 stored in the third language model storage unit 113, similarly to the first speech recognition unit 102 and the second speech recognition unit 106. The speech recognition process is performed with reference to the third acoustic model. The third language model stored in the third language model storage unit 113 is a language model learned from specialized learning example sentences in a specific field similar to the second language model of the first embodiment.

Ｎグラム追加部１０５ｂは第１音声認識部１０２の認識結果の単語系列および第３音声認識部１１２の認識結果の単語系列に基づいて、第２言語モデル記憶部１０７ｂに記憶された第２言語モデルにＮグラムを追加する。第２言語モデル記憶部１０７ｂに記憶された第２言語モデルは実施の形態１と同様に第２音声認識部１０６が参照する言語モデルである。ただし、この言語モデルはＮグラム追加部１０５ｂによって追加されるＮグラムを記憶するための言語モデルであり、初期状態ではＮグラムは学習されておらず、学習例文も記憶されていない。 The N-gram adding unit 105b uses the second language model stored in the second language model storage unit 107b based on the recognition result word sequence of the first speech recognition unit 102 and the recognition result word sequence of the third speech recognition unit 112. To add N grams. The second language model stored in the second language model storage unit 107b is a language model referred to by the second speech recognition unit 106 as in the first embodiment. However, this language model is a language model for storing the N-gram added by the N-gram adding unit 105b. In the initial state, the N-gram is not learned and the learning example sentence is not stored.

次に実施の形態２の音声認識装置の動作を実施の形態１との差分を中心に動作に説明する。図１６は実施の形態２の音声認識装置の処理フローである。この実施の形態の特徴は図１６に示したST208とST204bの処理である。その他の処理は実施の形態１と同様である。ST208の処理で第３音声認識部１１２は第１音声認識部１０２と同様の処理により、第３音響モデル記憶部１１４に記憶された第３音響モデルと第３言語モデル記憶部１１３に記憶された第３言語モデルを参照して認識結果の単語系列（第３の音声認識結果）を得る。第１音声認識部１０２による認識結果の単語系列と第３音声認識部１１２の認識結果の単語系列はＮグラム追加部１０５ｂに出力される。 Next, the operation of the speech recognition apparatus according to the second embodiment will be described focusing on differences from the first embodiment. FIG. 16 is a processing flow of the speech recognition apparatus according to the second embodiment. The feature of this embodiment is the processing of ST208 and ST204b shown in FIG. Other processes are the same as those in the first embodiment. In the process of ST208, the third speech recognition unit 112 is stored in the third acoustic model storage unit 113 and the third acoustic model storage unit 114 by the same processing as the first speech recognition unit 102. A word series (third speech recognition result) of the recognition result is obtained with reference to the third language model. The word sequence of the recognition result by the first speech recognition unit 102 and the word sequence of the recognition result by the third speech recognition unit 112 are output to the N-gram adding unit 105b.

Ｎグラム追加部１０５ｂは、受信した第１音声認識部１０２による認識結果の単語系列と第３音声認識部１１２の認識結果の単語系列から学習例文を作成する（ST204b）。なお、この学習例文は記憶しておき、以降の別の音声入力の音声認識の際にはこの記憶していた学習例文を更新する。 The N-gram adding unit 105b creates a learning example sentence from the received word sequence of the recognition result by the first speech recognition unit 102 and the word sequence of the recognition result of the third speech recognition unit 112 (ST204b). This learning example sentence is stored, and the stored learning example sentence is updated in the subsequent speech recognition of another voice input.

実施の形態１と同様に「しゅーずけーすからーこーどおねがいします」という音声が入力された場合を例にして具体的な動作を説明する。第１音声認識部１０２の認識結果は実施の形態１で示した図３と同様の単語系列となる。また、第３音声認識部１１２の認識結果は図１７に示す通りとなる。これらに基づいてＮグラム追加部１０５ｂは学習例文を作成する。図１８に作成された学習例文を示す。 Similar to the first embodiment, a specific operation will be described by taking as an example a case where a voice “I'm going to be here” is input. The recognition result of the first speech recognition unit 102 is a word sequence similar to that shown in FIG. Further, the recognition result of the third speech recognition unit 112 is as shown in FIG. Based on these, the N-gram adding unit 105b creates a learning example sentence. FIG. 18 shows a learning example sentence created.

Ｎグラム追加部１０５ｂは図１８に示した学習例文からＮグラム確率を求め、図１９、図２０に示すＮグラムを第２言語モデルに追加する。次に、第２音声認識部１０６がこの新たに学習されたＮグラムを有する第２言語モデルを参照して音声認識をすることにより、実施の形態１の場合と同様に正しい音声認識結果を得ることができる。 The N-gram adding unit 105b obtains an N-gram probability from the learning example sentence shown in FIG. 18, and adds the N-gram shown in FIGS. 19 and 20 to the second language model. Next, the second speech recognition unit 106 obtains a correct speech recognition result as in the first embodiment by performing speech recognition with reference to the second language model having the newly learned N-gram. be able to.

以上のように、第１音声認識器の出力する単語系列と第３音声認識器の出力する単語系列から混合言語モデルを生成するようにしているので、カバー率を改善する効果を損なうことなく、混合言語モデルをコンパクトにすることができる。これは、第２音声認識器をモバイル端末などの小型機器で実施する場合に特に有効である。 As described above, since the mixed language model is generated from the word sequence output from the first speech recognizer and the word sequence output from the third speech recognizer, without impairing the effect of improving the coverage ratio, The mixed language model can be made compact. This is particularly effective when the second speech recognizer is implemented in a small device such as a mobile terminal.

上述の実施の形態２では第１言語モデルを用いて音声認識を行う第１音声認識部の認識結果と、第３言語モデルを用いて音声認識を行う第３音声認識部の認識結果の２つの認識結果により混合言語モデルを更新したが、さらに多くの言語モデルおよび音声認識部を備えて、それらの音声認識部の認識結果を加えて混合言語モデルを更新するようにしても良い。また、実施の形態１に示したように第２音声認識部の認識結果に基づいて第２言語モデルを更新するように構成しても良い。 In the second embodiment, the recognition result of the first speech recognition unit that performs speech recognition using the first language model and the recognition result of the third speech recognition unit that performs speech recognition using the third language model are used. Although the mixed language model is updated based on the recognition result, more language models and voice recognition units may be provided, and the mixed language model may be updated by adding the recognition results of those voice recognition units. Further, as shown in the first embodiment, the second language model may be updated based on the recognition result of the second speech recognition unit.

実施の形態３．
実施の形態２では、第２言語モデルへのＮグラムの追加時にそれぞれの音声認識部が出力する認識結果の単語列を同じ重みで混合して第２言語モデルを生成する構成であった。この実施の形態では、音声認識部毎に重み付け（混合重み）をし、混合重みを変化させた１個以上の第２言語モデルを生成して、すべての組み合わせのうち最尤の認識結果を出力するようにする。 Embodiment 3 FIG.
In the second embodiment, the second language model is generated by mixing the recognition result word strings output by the respective speech recognition units with the same weight when adding N-grams to the second language model. In this embodiment, weighting (mixing weight) is performed for each voice recognition unit, one or more second language models with varying mixing weights are generated, and the maximum likelihood recognition result among all combinations is output. To do.

基本的な構成は実施の形態２と同様であるので、重み付けを行った混合言語モデルの生成と、複数の混合言語モデルを用いた認識結果の比較を中心に説明する。なお、以下の説明では３個の混合言語モデルを備える場合を例に説明するが、混合言語モデルの個数は１個あるいは２個でも良いし、また４個以上であってもよい。ただし、混合言語モデルが１この場合には、認識結果を比較する処理は不要である。 Since the basic configuration is the same as that of the second embodiment, the description will focus on the generation of a weighted mixed language model and the comparison of recognition results using a plurality of mixed language models. In the following description, a case where three mixed language models are provided will be described as an example. However, the number of mixed language models may be one or two, or may be four or more. However, when there is only one mixed language model, processing for comparing recognition results is not necessary.

図２１はこの実施の形態の音声認識装置の構成を、第１音声認識部１０２、第３音声認識部１１２、Ｎグラム追加部１０５ｄ、第２言語モデル記憶部１０７ｄおよび第２音声認識部１０６ｄに注目して示した機能ブロック図である。図２１においてλ（0≦λ≦１）は第１音声認識部１０２の認識結果と第３音声認識部１１２の認識結果に重み付けをする係数である。ここでは、λが第１音声認識部１０２の認識結果に対する重みであり、（１−λ）が第３音声認識部１１２の認識結果に対する重みであるものとする。
なお、λの値はそれぞれの第２言語モデルに対応して予め定められているものとする。ここではλ＝２／３、λ＝１／２，λ＝１／３の３種類とし、それぞれ図２１に示す第２言語モデルＡ、第２言語モデルＢ、第２言語モデルＣに対応するものとする。 FIG. 21 shows the configuration of the speech recognition apparatus of this embodiment in the first speech recognition unit 102, the third speech recognition unit 112, the N-gram addition unit 105d, the second language model storage unit 107d, and the second speech recognition unit 106d. It is the functional block diagram shown paying attention. In FIG. 21, λ (0 ≦ λ ≦ 1) is a coefficient for weighting the recognition result of the first speech recognition unit 102 and the recognition result of the third speech recognition unit 112. Here, λ is a weight for the recognition result of the first speech recognition unit 102, and (1-λ) is a weight for the recognition result of the third speech recognition unit 112.
Note that the value of λ is determined in advance corresponding to each second language model. Here, λ = 2/3, λ = 1/2, and λ = 1/3, which correspond to the second language model A, the second language model B, and the second language model C shown in FIG. 21, respectively. And

次に動作を説明する。Ｎグラム追加部１０５ｄはλおよび１−λに基づいて重み付けを行った学習例文を作成してそれぞれの学習例文に基づいて第２言語モデルＡ、第２言語モデルＢ、第２言語モデルＣを作成する。
λ＝１／２の場合の学習例文は図１７に示した学習例文と同様である。λ＝２／３の場合は図２２に示すように、第１音声認識部の認識結果の単語系列が第３音声認識部の認識結果の単語系列の２倍になるようにして学習例文を生成する。反対にλ＝１／３の場合には図２３に示すように、第３音声認識部の認識結果の単語系列が第３音声認識部の認識結果の単語系列の２倍になるようにして学習例文を生成する。このようにλ：１-λと同等の比になるように第１音声認識部の認識結果の単語系列と第３音声認識部の認識結果の単語系列を含む学習例文を生成してそれぞれに対応した第２言語モデルＡ、第２言語モデルＢ、第２言語モデルＣを作成する。
このように第２言語モデルを作成することにより、第２言語モデルに追加するＮグラムの確率にそのＮグラムの元となった例文を出力した音声認識部に対応した重み付けをすることができる。 Next, the operation will be described. The N-gram adding unit 105d creates weighted learning example sentences based on λ and 1-λ, and creates a second language model A, a second language model B, and a second language model C based on the respective learning example sentences. To do.
The learning example sentence in the case of λ = 1/2 is the same as the learning example sentence shown in FIG. When λ = 2/3, as shown in FIG. 22, a learning example sentence is generated so that the word sequence of the recognition result of the first speech recognition unit is twice the word sequence of the recognition result of the third speech recognition unit. To do. On the other hand, when λ = 1/3, learning is performed such that the word sequence of the recognition result of the third speech recognition unit is twice the word sequence of the recognition result of the third speech recognition unit, as shown in FIG. Generate example sentences. In this way, a learning example sentence including the word sequence of the recognition result of the first speech recognition unit and the word sequence of the recognition result of the third speech recognition unit is generated so as to have a ratio equivalent to λ: 1−λ, and corresponding to each. The second language model A, the second language model B, and the second language model C are created.
By creating the second language model in this way, the probability corresponding to the N gram added to the second language model can be weighted corresponding to the speech recognition unit that outputs the example sentence that is the basis of the N gram.

第２音声認識部１０６ｄは実施の形態１、実施の形態２と同様の処理で第２言語モデルＡ、第２言語モデルＢ、第２言語モデルＣそれぞれを参照して認識結果となる単語系列を取得し、これらの単語系列を以下の数式５を用いて比較して、最尤のものを認識結果の単語系列として出力する。なお、数式５においてh_A，h_B，h_Cはそれぞれ第２言語モデルＡ、第２言語モデルＢ、第２言語モデルＣを参照した音声認識処理で得られる単語系列、AC(h_A)，AC(h_B)，AC(h_C)は単語系列h_A，h_B，h_Cに対する音響尤度、LM(h_A)，LM(h_B)，LM(h_C)は単語系列h_A，h_B，h_Cに対する言語尤度、n_A，n_B，n_Cはそれぞれ単語系列h_A，h_B，h_Cの単語数である。また、LM_WEIGHTは言語モデル重み、LM_PENALTYは単語納入ペナルティである。 The second speech recognition unit 106d refers to each of the second language model A, the second language model B, and the second language model C in the same process as in the first and second embodiments, and selects a word sequence that is a recognition result. These word sequences are compared using Equation 5 below, and the most likely one is output as a recognition result word sequence. In Equation 5, h _A , h _{B, and} h _C are word sequences obtained by speech recognition processing referring to the second language model A, the second language model B, and the second language model C, respectively, and AC (h _A ), AC (h _B ), AC (h _C ) are acoustic likelihoods for the word sequences h _A, h _B , h _C , and LM (h _A ), LM (h _B ), LM (h _C ) are the word sequences h _A, h _B, the language likelihood for h _C, is n _a, n _B, n _C each word sequence h _a, h _{_B,} the number of words h _C. LM_WEIGHT is a language model weight, and LM_PENALTY is a word delivery penalty.

以上のようにすることにより、入力音声に現れた単語連鎖のカバー率を向上し、音声認識装置の音声認識性能を向上するとともに、適切な言語モデルの混合比による音声認識結果が得られる効果がある。
なお、実施の形態１において第１音声認識部の認識結果の単語系列を第２言語モデルに追加するときに重み付けをすることも可能である。 By doing so, the coverage of the word chain appearing in the input speech is improved, the speech recognition performance of the speech recognition device is improved, and a speech recognition result by an appropriate language model mixing ratio is obtained. is there.
In the first embodiment, weighting can also be performed when the word sequence of the recognition result of the first speech recognition unit is added to the second language model.

なお、以上に述べた実施の形態では最尤解を混合の対象としたが、複数の候補を選ぶＮベスト解を用いるようにしても良い。また、第１音声認識部、第２音声認識部、第３音声認識部で認識した単語系列を形態素解析した結果に基づいて混合言語モデルを更新するようにしても良い。また、第１言語モデルと第２言語モデルもしくは第３言語モデルで形態素単位が異なるようにしても良い。 In the embodiment described above, the maximum likelihood solution is an object to be mixed, but an N best solution for selecting a plurality of candidates may be used. Further, the mixed language model may be updated based on the result of morphological analysis of the word series recognized by the first speech recognition unit, the second speech recognition unit, and the third speech recognition unit. The morpheme units may be different between the first language model and the second language model or the third language model.

１０１音声入力部、１０２第１音声認識部、１０３第１言語モデル記憶部、１０４第１音響モデル記憶部、１０５，１０５ｂ，１０５ｃ，１０５ｄＮグラム追加部、１０６，１０６ｄ第２音声認識部、１０７，１０７ｂ，１０７ｄ第２言語モデル記憶部、１０８第２音響モデル記憶部、１０９表示部、１１２第３音声認識部、１１３第３言語モデル記憶部、１１４第３音響モデル記憶部 101 speech input unit, 102 first speech recognition unit, 103 first language model storage unit, 104 first acoustic model storage unit, 105, 105b, 105c, 105d N-gram addition unit, 106, 106d second speech recognition unit, 107 107b, 107d Second language model storage unit, 108 Second acoustic model storage unit, 109 Display unit, 112 Third speech recognition unit, 113 Third language model storage unit, 114 Third acoustic model storage unit

Claims

A second language model storage unit for storing a mixed language model,
A second speech recognition unit that recognizes an input speech signal and outputs a second speech recognition result using the mixed language model;
A first speech recognition unit for recognizing an input speech signal using a first language model and outputting a first speech recognition result;
The Symbol before by the first speech recognition unit first speech recognition result is input, to create a N-gram based on the word sequence of the first speech recognition result, which is stored in the second language model storage unit An N-gram adding unit that adds an N-gram created based on the word sequence of the first speech recognition result to the mixed language model;
A speech recognition apparatus comprising:

The N-gram adding unit receives a third speech recognition result of the speech signal from a third speech recognition unit different from the first and second speech recognition units, and a word sequence of the first speech recognition result wherein said creating a N-gram based on the word sequence of the third speech recognition result, the previous SL said first word sequence of speech recognition results in the second language model storage unit to said stored mixed language model No. The speech recognition apparatus according to claim 1, wherein an N-gram created based on the word sequence of the speech recognition result of 3 is added.

The speech recognition apparatus according to claim 1, further comprising a first language model storage unit that stores the first language model.

A third language model storage unit for storing a third language model;
The third speech recognition unit for recognizing the speech signal using the third language model and outputting the third speech recognition result;
The speech recognition apparatus according to claim 2 , further comprising:

The N-gram additional unit, before SL to create the N-gram based on the second word sequence of the speech recognition result of the second speech recognition section, the word sequence of the second speech recognition result to the mixed language model based speech recognition apparatus according to any one of claims 1 to 4, characterized by adding N-gram that was created.

The speech recognition apparatus according to any one of claims 1 to 5 , wherein the N-gram adding unit further updates the probability of the N-gram stored in the mixed language model.

The N-gram adding unit weights the first speech recognition result and the third speech recognition result with weights corresponding to the first speech recognition unit and the third speech recognition unit , respectively, to the mixed language model . The speech recognition apparatus according to claim 2 , wherein a probability of N grams to be added is calculated.

The second language model storage unit stores a plurality of the mixed language models,
The N-gram adding unit performs the weighting based on a predetermined combination of the weight for the first speech recognition result and the weight for the third speech recognition result corresponding to each of the plurality of mixed language models. And calculating a probability of N grams to be added to each of the plurality of second language models,
The second speech recognition unit performs speech recognition with reference to each of the plurality of mixed language models, selects one of the obtained speech recognition results based on the likelihood of the speech recognition result, and selects the first 2 speech recognition result,
The speech recognition apparatus according to claim 7 .

A first speech recognition procedure in which the first speech recognition unit performs speech recognition of speech input with reference to the first language model;
A second speech recognition procedure in which a second speech recognition unit performs speech recognition of the input speech with reference to a mixed language model;
An N-gram adding unit creates an N-gram based on the word sequence of the speech recognition result by the first speech recognition procedure, and creates the N-gram created based on the word sequence of the first speech recognition result in the mixed language model. N-gram addition procedure to be added,
A speech recognition method comprising:

A third speech recognition unit that performs speech recognition of the input speech with reference to a third language model,
The N-gram addition procedure creates an N-gram based on the word sequence of the speech recognition result by the first speech recognition procedure and the word sequence of the speech recognition result by the third speech recognition procedure, and adds the N-gram to the mixed language model. claim 9, characterized in that to add a N-gram that was created based on the word sequence of the speech recognition result by the first of the third speech recognition procedure and the word sequence of the speech recognition result by the speech recognition procedure Voice recognition method.