JP6300394B2

JP6300394B2 - Error correction model learning device and program

Info

Publication number: JP6300394B2
Application number: JP2013103291A
Authority: JP
Inventors: 彰夫小林
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2013-05-15
Filing date: 2013-05-15
Publication date: 2018-03-28
Anticipated expiration: 2033-05-15
Also published as: JP2014224860A

Description

本発明は、誤り修正モデル学習装置、及びプログラムに関する。 The present invention relates to an error correction model learning device and a program.

音声認識の誤り修正については、音声とその書き起こし（正解文）から、言語的な特徴を用いて音声認識の誤り傾向を統計的に学習し、学習の結果得られた統計的な誤り修正モデルを用いて音声認識の性能改善を図る技術がある（例えば、非特許文献１参照）。この従来技術では、音声データと、その音声認識結果および正解単語列を用いて誤り修正モデルのモデルパラメータを学習する。 For error correction in speech recognition, statistical error correction models obtained as a result of learning by statistically learning the tendency of speech recognition errors using linguistic features from speech and transcriptions (correct sentences) There is a technology for improving the performance of speech recognition by using (see, for example, Non-Patent Document 1). In this prior art, model parameters of an error correction model are learned using speech data, the speech recognition result and a correct word string.

小林ほか，「単語誤り最小化に基づく識別的スコアリングによるニュース音声認識」，電子情報通信学会誌，vol.J93-D no.5，２０１０年，ｐ．５９８−６０９Kobayashi et al., “News speech recognition by discriminative scoring based on word error minimization”, IEICE Journal, vol.J93-D no.5, 2010, p. 598-609

字幕制作システムなどのアプリケーションシステムでは、入力となる音声を逐次的かつリアルタイムに処理する必要がある。しかしながら、従来技術による誤り修正モデルの学習は計算量が多いため、逐次性とリアルタイム性の観点から上記のアプリケーションの利用では問題がある。
例えば、従来技術では、正解単語列を構成する各単語の発話時刻を求めるために、入力音声に対する整列を行う必要がある。整列の際には、各単語の発話時刻に合わせて、その音響スコア（統計的音響モデルによる対数尤度に基づくスコア）を計算する。これは、従来技術の誤り修正モデルにおいては、音響スコアが必要となることが理由である。整列処理は、誤り修正モデルの学習において、厳密なモデルパラメータの推定を行う上で重要ではあるものの、計算量の観点からは好ましいものではない。逐次的に誤り修正モデルを学習することができたとしても、モデルの適用に遅延が生じるため、放送音声のような話題の移り変わりの激しいタスクでは、誤り修正モデルの有効性が大きく損なわれると考えられる。 In an application system such as a caption production system, it is necessary to process input audio sequentially and in real time. However, the learning of the error correction model according to the prior art has a large amount of calculation, so that there is a problem in using the above application from the viewpoint of sequentiality and real-time property.
For example, in the prior art, it is necessary to align input speech in order to obtain the utterance time of each word constituting a correct word string. At the time of alignment, the acoustic score (score based on log likelihood based on a statistical acoustic model) is calculated in accordance with the utterance time of each word. This is because an acoustic score is required in the error correction model of the prior art. Although the alignment process is important for accurate model parameter estimation in error correction model learning, it is not preferable from the viewpoint of computational complexity. Even if the error correction model can be learned sequentially, the application of the model will be delayed, so it is considered that the effectiveness of the error correction model will be greatly impaired in tasks where the topic changes rapidly, such as broadcast audio. It is done.

本発明は、このような事情を考慮してなされたもので、簡便に誤り修正モデルを逐次学習することができる誤り修正モデル学習装置、及びプログラムを提供する。 The present invention has been made in view of such circumstances, and provides an error correction model learning apparatus and program that can easily learn error correction models sequentially.

［１］本発明の一態様は、入力された指示に従って音声認識結果を修正する音声認識結果修正部と、前記音声認識結果に含まれる言語的な特徴と、前記音声認識結果修正部による修正結果に含まれる言語的な特徴との差分から単語の誤り傾向を学習し、音声認識において単語の誤り傾向を修正するための誤り修正モデルを、学習した前記単語の誤りの傾向に応じて更新する誤り修正モデル更新部と、を備えることを特徴とする誤り修正モデル学習装置である。
この発明によれば、誤り修正モデル学習装置は、音声認識結果と、音声認識結果を人手により修正した修正結果とのそれぞれに含まれる言語的な特徴の差分に基づいて単語の誤り傾向を学習し、学習した誤り傾向に応じて誤り修正モデルを更新する。
これにより、誤り修正モデル学習装置は、逐次入力される音声の音声認識結果と、その音声認識結果を人手により修正した修正結果のみを用いて、少ない計算量により単語の誤り傾向を学習し、誤り修正モデルを更新することができる。従って、誤り修正モデル学習装置は、誤り修正モデルを逐次的かつ低遅延で更新することが可能となる。
なお、上記の「入力された指示」は、人手による修正作業を指すものとしてよい。 [1] According to one aspect of the present invention, a speech recognition result correcting unit that corrects a speech recognition result according to an input instruction, a linguistic feature included in the speech recognition result, and a correction result by the speech recognition result correcting unit An error that learns an error tendency of a word from a difference from a linguistic feature included in an error and updates an error correction model for correcting the error tendency of the word in speech recognition according to the learned error tendency of the word An error correction model learning device comprising: a correction model update unit.
According to the present invention, the error correction model learning device learns an error tendency of a word based on a difference in linguistic features included in each of a speech recognition result and a correction result obtained by manually correcting the speech recognition result. The error correction model is updated according to the learned error tendency.
As a result, the error correction model learning device learns the error tendency of the word with a small amount of calculation using only the speech recognition result of the sequentially input speech and the correction result obtained by manually correcting the speech recognition result. The modified model can be updated. Therefore, the error correction model learning device can update the error correction model sequentially and with low delay.
Note that the “input instruction” may refer to manual correction work.

［２］本発明の一態様は、上述する誤り修正モデル学習装置であって、前記誤り修正モデル更新部は、前記音声認識結果に含まれる単語または単語の品詞の共起の頻度と、前記修正結果に含まれる単語または単語の品詞の共起の頻度とを用いて単語誤りの傾向を学習する、ことを特徴とする。
この発明によれば、誤り修正モデル学習装置は、音声認識結果と、音声認識結果を人手により修正した修正結果とのそれぞれに含まれる単語または単語の品詞の共起の頻度を用いて単語の誤り傾向を学習し、誤り修正モデルを更新する。
これにより、誤り修正モデル学習装置は、音声認識結果と修正結果について単語または単語の品詞の共起の頻度を計数することで、音声認識における単語の誤り傾向を効率的に学習し、学習結果から誤り修正モデルを更新することができる。 [2] One aspect of the present invention is the error correction model learning device described above, wherein the error correction model update unit includes the frequency of co-occurrence of words or parts of speech included in the speech recognition result, and the correction. The tendency of word errors is learned using the frequency of words or part-of-speech co-occurrence included in the results.
According to this invention, the error correction model learning device uses the frequency of co-occurrence of words or word parts of speech included in each of the speech recognition result and the correction result obtained by manually correcting the speech recognition result. Learn trends and update error correction models.
Thereby, the error correction model learning device efficiently learns the error tendency of words in speech recognition by counting the frequency of co-occurrence of words or parts of speech of speech recognition results and correction results, and from the learning results. The error correction model can be updated.

［３］本発明の一態様は、上述する誤り修正モデル学習装置であって、前記言語的な特徴は、連続する単語列、または、連続する単語の品詞列の頻度であり、前記誤り修正モデルは、前記言語的な特徴に基づく素性関数と、前記素性関数の素性重みとを用いて音声認識のスコアを修正する算出式であり、前記誤り修正モデル更新部は、前記誤り修正モデルの前記素性重みを、学習した前記単語の誤りの傾向に応じて更新する、ことを特徴とする。
この発明によれば、誤り修正モデル学習装置は、音声認識結果及び修正結果のそれぞれに含まれる連続する単語列の頻度、連続する単語の品詞列の頻度などの言語的な特徴の差分から単語の誤り傾向を学習する。誤り修正モデルは、言語的な特徴を表わす素性関数と素性関数の素性重みとにより音声認識のスコアを修正する算出式であり、誤り修正モデル学習装置は、学習した単語誤りの傾向に応じて誤り修正モデルの素性重みを更新する。
これにより、誤り修正モデル学習装置は、音声認識の認識誤り傾向を効率的に学習し、誤り修正モデルにおける素性重みを更新することができる。 [3] One aspect of the present invention is the error correction model learning apparatus described above, wherein the linguistic feature is a frequency of a continuous word string or a continuous word part-of-speech string, and the error correction model. Is a calculation formula for correcting a score of speech recognition using a feature function based on the linguistic feature and a feature weight of the feature function, and the error correction model update unit is configured to calculate the feature of the error correction model. The weight is updated according to the tendency of the learned error of the word.
According to the present invention, the error correction model learning device can detect a word from a difference in linguistic features such as the frequency of consecutive word strings and the frequency of part of speech strings of consecutive words included in each of the speech recognition result and the correction result. Learn error tendency. An error correction model is a calculation formula that corrects a speech recognition score based on a feature function that represents a linguistic feature and a feature weight of the feature function. The error correction model learning device performs an error according to the tendency of a learned word error. Update the feature weight of the modified model.
Thereby, the error correction model learning device can efficiently learn the recognition error tendency of speech recognition and can update the feature weight in the error correction model.

［４］本発明の一態様は、上述する誤り修正モデル学習装置であって、入力音声を音声認識し、前記誤り修正モデル更新部により更新された前記誤り修正モデルを用いて、前記入力音声から得られた音声認識結果の選択における誤りを修正して出力する音声認識部をさらに備える、ことを特徴とする。
この発明によれば、誤り修正モデル学習装置は、入力音声を音声認識することにより得られた正解候補の中から、逐次更新される誤り修正モデルを用いて音声認識結果を選択する。
これにより、誤り修正モデル学習装置は、話題が移り変わっていく場合でも、その話題に応じて逐次的かつ学習による遅延時間の小さい誤り修正モデルを用いて、認識率のよい音声認識結果を得ることができる。 [4] One aspect of the present invention is the error correction model learning apparatus described above, which recognizes an input speech and uses the error correction model updated by the error correction model update unit, from the input speech. It further comprises a speech recognition unit that corrects and outputs an error in selection of the obtained speech recognition result.
According to the present invention, the error correction model learning device selects a speech recognition result using an error correction model that is sequentially updated from among correct answer candidates obtained by speech recognition of input speech.
Thereby, even when a topic changes, the error correction model learning device can obtain a speech recognition result with a high recognition rate by using an error correction model that is sequential and has a small delay time due to learning according to the topic. it can.

［５］本発明の一態様は、コンピュータを、入力された指示に従って音声認識結果を修正する音声認識結果修正手段と、前記音声認識結果に含まれる言語的な特徴と、前記音声認識結果修正手段による修正結果に含まれる言語的な特徴との差分から単語の誤り傾向を学習し、音声認識において単語の誤り傾向を修正するための誤り修正モデルを、学習した前記単語の誤りの傾向に応じて更新する誤り修正モデル更新手段と、を具備する誤り修正モデル学習装置として機能させるためのプログラムである。 [5] According to one aspect of the present invention, a computer recognizes a speech recognition result correcting unit that corrects a speech recognition result according to an input instruction, a linguistic feature included in the speech recognition result, and the speech recognition result correcting unit. Learns the error tendency of the word from the difference from the linguistic feature included in the correction result by, and an error correction model for correcting the error tendency of the word in speech recognition according to the learned error tendency of the word An error correction model update unit for updating is a program for causing an error correction model learning device to function.

本発明によれば、簡便に誤り修正モデルを逐次学習することができる。 According to the present invention, an error correction model can be sequentially learned easily.

本発明の一実施形態に適用される音声認識結果の修正プロセスを示す図である。It is a figure which shows the correction process of the speech recognition result applied to one Embodiment of this invention. 同実施形態による誤り修正モデルの逐次更新プロセスを示す図である。It is a figure which shows the sequential update process of the error correction model by the embodiment. 同実施形態による言語モデルの更新プロセスを示す図である。It is a figure which shows the update process of the language model by the embodiment. 同実施形態による誤り修正モデル学習装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the error correction model learning apparatus by the embodiment. 同実施形態による誤り修正モデル学習装置の全体処理を示すフローチャートである。It is a flowchart which shows the whole process of the error correction model learning apparatus by the embodiment. 同実施形態による形態素解析部の処理を示すフローチャートである。It is a flowchart which shows the process of the morphological analysis part by the embodiment. 同実施形態による誤り修正モデル更新部の処理を示すフローチャートである。It is a flowchart which shows the process of the error correction model update part by the embodiment. 同実施形態による言語モデル更新部の処理を示すフローチャートである。It is a flowchart which shows the process of the language model update part by the embodiment.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

［１．本実施形態の概要］
音声認識の誤り傾向を反映した識別学習に基づく誤り修正モデルがすでに考案されている。従来の誤り修正モデルの学習では、与えられた音声とその音声に該当する正解単語列に対して、単語列を構成する各単語の発話開始時刻と終了時刻を整列するための手続きが必要である。その手続きは次の通りである。 [1. Overview of this embodiment]
An error correction model based on discriminative learning that reflects the error tendency of speech recognition has already been devised. In conventional error correction model learning, a procedure for aligning the utterance start time and end time of each word constituting a word string is necessary for a given voice and a correct word string corresponding to the voice. . The procedure is as follows.

（１）音声データと正解単語列を整列する。この際、正解単語列を構成する各単語の音響スコアと言語スコア（統計的言語モデルによる単語列の生成確率に基づくスコア）を計算する。
（２）音声認識結果および（１）で整列済みの正解単語列を基に、入力音声データに対する期待誤り数（平均的な音声認識の単語誤り数）を計算する。
（３）期待誤り数を評価関数として、誤り修正モデルのモデルパラメータの推定を行う。 (1) Align voice data and correct word strings. At this time, an acoustic score and a language score (score based on a word string generation probability based on a statistical language model) of each word constituting the correct word string are calculated.
(2) The expected number of errors (average number of word errors in speech recognition) for the input speech data is calculated on the basis of the speech recognition result and the correct word strings already aligned in (1).
(3) The model parameters of the error correction model are estimated using the expected number of errors as an evaluation function.

上述した従来技術では、すべての音声データが所与のもとで誤り修正モデルのモデルパラメータを推定している。音声データが得られるたびに上記手続きを繰り返し行うようにすれば、従来技術においても逐次処理に適用可能であるが、（１）と（２）の処理は誤り修正モデルのパラメータ推定のための計算時間を押し上げる要因となっている。そのため、推定したモデルパラメータを即時音声認識に適用するような、リアルタイム性が求められるアプリケーションには向いていない。そこで、整列手続きを行うことなく簡便にモデルを学習することが望まれる。 In the prior art described above, the model parameters of the error correction model are estimated with all audio data given. If the above procedure is repeated every time voice data is obtained, the conventional technique can be applied to sequential processing, but the processes (1) and (2) are calculations for parameter estimation of an error correction model. It is a factor that pushes up time. Therefore, it is not suitable for an application that requires real-time performance, such as applying the estimated model parameter to immediate speech recognition. Therefore, it is desirable to learn a model easily without performing an alignment procedure.

本実施形態の誤り修正モデル学習装置は、字幕制作システムなどにより、入力音声の認識誤りを修正した正解単語列が所与であるという音声認識アプリケーションに適用される誤り修正モデルを、正解単語列の整列手続きを踏まえることなく、音声認識結果と修正結果から逐次的かつ簡便に学習する。 The error correction model learning apparatus according to the present embodiment uses an error correction model applied to a speech recognition application in which a correct word string in which an input speech recognition error is corrected is given by a caption production system or the like as a correct word string. Learning sequentially and simply from the speech recognition results and the correction results without considering the alignment procedure.

［２．誤り修正モデルの学習アルゴリズム］
本実施形態の誤り修正モデル学習装置は、従来法の課題を解決するために、リアルタイム性に優れる逐次処理に基づいて誤り修正モデルを学習する。本実施形態の学習アルゴリズムでは、字幕制作システムのように正解単語列が逐次的に得られるアプリケーションに適用される誤り修正モデルを学習の対象とする。 [2. Error correction model learning algorithm]
The error correction model learning apparatus according to the present embodiment learns an error correction model based on sequential processing with excellent real-time characteristics in order to solve the problem of the conventional method. In the learning algorithm of this embodiment, an error correction model applied to an application in which correct word strings are sequentially obtained as in a caption production system is targeted for learning.

図１は、音声認識結果から修正結果を得る修正プロセスを示す図である。一般的な修正プロセスにおいては、前段となる音声認識装置により、音声認識結果が逐次与えられる。従来の逐次確定型の音声認識装置（例えば、特開２００１−０９２４９６号公報参照。）の場合、認識結果は、入力発話に該当する単語がまとめて送られてくるのではなく、音声認識装置において確定した（最も正解らしいと判定された）一部分の単語列が送られてくる。このプロセスは、本実施形態においても適用される。 FIG. 1 is a diagram illustrating a correction process for obtaining a correction result from a voice recognition result. In a general correction process, a speech recognition result is sequentially given by the speech recognition device at the previous stage. In the case of a conventional sequential confirmation type speech recognition apparatus (for example, see Japanese Patent Application Laid-Open No. 2001-092496), the recognition result is not sent in a word corresponding to the input utterance, but in the speech recognition apparatus. A partial word string that has been confirmed (determined to be the most correct answer) is sent. This process is also applied in this embodiment.

その後、人手による作業により音声認識結果が修正され、修正結果が正解単語列として後段の処理に送られる。修正プロセスでは、通常、修正後の単語列を逐次後段に送るのではなく、入力発話の該当する単語すべてを後段に送る。したがって、修正プロセスにおける修正の一単位（修正ブロック）は、発話を単位とするブロックとなる。
後段の処理は、例えば、字幕制作システムであれば、正解単語列を字幕のフォーマットに変換し、放送波に重畳するといった処理を指す。 Thereafter, the speech recognition result is corrected by manual work, and the correction result is sent to the subsequent processing as a correct word string. In the correction process, normally, the corrected word string is not sequentially sent to the subsequent stage, but all the corresponding words of the input utterance are sent to the subsequent stage. Accordingly, one unit of correction (correction block) in the correction process is a block with utterance as a unit.
For example, in the case of a caption production system, the subsequent process refers to a process of converting a correct word string into a caption format and superimposing it on a broadcast wave.

修正作業は、音声認識結果の各単語に対する次の３つの作業を指す。
（１）音声認識結果に含まれる単語を削除する。これは、「挿入誤り」とよばれる認識誤りの修正に該当する。
（２）音声認識結果に含まれる単語を置き換える。これは、「置換誤り」とよばれる認識誤りの修正に該当する。
（３）音声認識結果にはない単語を挿入する。これは、「削除誤り」とよばれる認識誤りの修正に該当する。 The correction work refers to the following three works for each word of the speech recognition result.
(1) Delete words included in the speech recognition result. This corresponds to correction of a recognition error called “insertion error”.
(2) Replace words included in the speech recognition result. This corresponds to correction of a recognition error called “replacement error”.
(3) Insert a word that is not in the speech recognition result. This corresponds to correction of a recognition error called “deletion error”.

図２は、本実施形態による誤り修正モデルの逐次更新プロセスを示す図である。本実施形態では、誤り修正モデルを逐次的に学習して音声認識に適用するため、同図に示すように、複数の修正ブロックを一つの逐次更新ブロックと定めて誤り修正モデルの更新タイミングを調整する。そして、この逐次更新ブロックに含まれる音声認識結果と修正結果の単語列のみを用いて、逐次的に誤り修正モデルを更新していく。 FIG. 2 is a diagram showing a sequential update process of the error correction model according to the present embodiment. In this embodiment, the error correction model is sequentially learned and applied to speech recognition. Therefore, as shown in the figure, multiple correction blocks are defined as one sequential update block, and the update timing of the error correction model is adjusted. To do. Then, the error correction model is sequentially updated using only the speech recognition result and the word string of the correction result included in the sequential update block.

［２．１従来法の誤り修正モデル］
ベイズの定理によれば、音声入力ｘが与えられたとき、この音声入力ｘに対して最も尤もらしい単語列ｗ＾（「＾」は、「ハット」を表す。）は、以下の式（１）により求めることができる。 [2.1 Error correction model of conventional method]
According to Bayes' theorem, when speech input x is given, the most likely word sequence w ^ (“^” represents “hat”) for this speech input x is expressed by the following equation (1). ).

音声入力ｘ及び単語列ｗは、例えば、発話の単位に対応し、Ｐ（ｗ｜ｘ）は、音声入力ｘが与えられたときに単語列（文仮説）ｗが得られる事後確率である。
また、Ｐ（ｘ｜ｗ）は、単語列ｗに対する音響的な尤もらしさを示す尤度であり、そのスコア（音響スコア）は隠れマルコフモデル（Hidden Markov Model、ＨＭＭ）及び混合ガウス分布（Gaussian Mixture Model、ＧＭＭ）に代表される統計的音響モデル（以下、「音響モデル」と記載する。）に基づいて計算される。言い換えれば、音響特徴量が与えられたとき、複数の正解候補の単語それぞれに対する尤もらしさを表すスコアが音響スコアである。 The voice input x and the word string w correspond to, for example, the unit of speech, and P (w | x) is a posterior probability that a word string (sentence hypothesis) w is obtained when the voice input x is given.
P (x | w) is a likelihood indicating acoustic likelihood for the word string w, and the score (acoustic score) is a hidden Markov model (HMM) and a mixed Gaussian distribution (Gaussian Mixture). It is calculated based on a statistical acoustic model (hereinafter referred to as “acoustic model”) typified by Model, GMM). In other words, when an acoustic feature amount is given, a score representing the likelihood of each of a plurality of correct candidate words is an acoustic score.

一方、Ｐ（ｗ）は、単語列ｗに対する言語的な生成確率であり、そのスコア（言語スコア）は、単語ｎ−ｇｒａｍモデル等の統計的言語モデル（以下、「言語モデル」と記載する。）により計算される。言い換えれば、音声認識対象の単語の前または後の単語列、あるいは前後両方の単語列が与えられたとき、複数の正解候補の単語列それぞれに対する尤もらしさを表すスコアが言語スコアである。なお、単語ｎ−ｇｒａｍモデルは、Ｎ単語連鎖（Ｎは、例えば１、２、または３である。）の統計に基づいて、（Ｎ−１）単語の履歴から次の単語の生起確率を与えるモデルである。
以下の説明では、音響モデルにＨＭＭ−ＧＭＭを用い、言語モデルにｎ−ｇｒａｍを用いる。 On the other hand, P (w) is a linguistic generation probability for the word string w, and the score (language score) is described as a statistical language model (hereinafter, “language model”) such as a word n-gram model. ). In other words, when a word string before or after a speech recognition target word, or both word strings before and after the given word string, a score representing the likelihood of each of a plurality of correct answer word strings is a language score. The word n-gram model gives the occurrence probability of the next word from the history of the word (N-1) based on the statistics of N word chains (N is 1, 2, or 3, for example). It is a model.
In the following description, HMM-GMM is used for the acoustic model and n-gram is used for the language model.

式（１）のＰ（ｘ｜ｗ）Ｐ（ｗ）が最大の場合は、その対数も最大である。そこで、音声認識では、上記の式（１）のベイズの定理に基づいて、音声入力ｘが与えられたときの文仮説（正解候補）である単語列ｗの評価関数Ｓ（ｗ｜ｘ）を以下の式（２）のように定める。 When P (x | w) P (w) in Equation (1) is maximum, the logarithm is also maximum. Therefore, in speech recognition, the evaluation function S (w | x) of the word string w, which is a sentence hypothesis (correct answer candidate) when the speech input x is given, based on the Bayes' theorem of the above equation (1). It is defined as the following formula (2).

式（２）において、ｆ_ａｍ（ｘ｜ｗ）は、音響モデルによる単語列ｗの対数音響スコア、ｆ_ｌｍ（ｗ）は、言語モデルによる単語列ｗの対数言語スコア、λ_ｌｍは、音響スコアに対する言語スコアの重みである。 In formula (2), f _am (x | w) is a logarithmic acoustic score of the word sequence w according to the acoustic model, f _lm (w) is a logarithmic language score of the word sequence w according to the language model, and λ _lm is an acoustic score Is the weight of the language score for.

式（２）が定められたとき、以下の式（３）に示すように、音声入力ｘに対する正解候補の単語列ｗの集合の中から、式（２）が示す評価関数Ｓ（ｗ｜ｘ）の結果が最大である単語列ｗ＾が、音声入力ｘの音声認識結果として選択される。 When the formula (2) is determined, as shown in the following formula (3), the evaluation function S (w | x shown by the formula (2) is selected from the set of correct candidate word strings w for the speech input x. ) Is selected as the speech recognition result of speech input x.

非特許文献１などの従来法における誤り修正モデルでは、仮説（正解候補の単語列ｗ）の評価関数を以下の式（４）として、最尤仮説である単語列ｗ＾を以下の式（５）により求める。 In the error correction model in the conventional method such as Non-Patent Document 1, the evaluation function of the hypothesis (correct candidate word string w) is set as the following expression (4), and the word string w ^ that is the maximum likelihood hypothesis is expressed as the following expression (5 )

式（４）の右辺が誤り修正モデルである。式（４）におけるΣ_ｉλ_ｉｆ_ｉ（ｗ）は、単語列ｗの誤り傾向を反映したスコアであり、単語列ｗに対するペナルティもしくは報償として働く。また、ｆ_ｉ（ｗ）（ｉ＝１，...，）はｉ番目の素性関数であり、モデルパラメータΛ＝｛λ_１，...｝の要素λ_ｉは素性関数ｆ_ｉ（ｗ）の重み（素性重み）である。素性関数は、与えられた単語列（ここでは、単語列ｗ）で言語的ルールが成立すればその数となり、成立しなければ０となるような関数として定められる。これらルールは、例えば、同一の発話内における連続する単語、連続しない２単語以上の単語の共起関係、単語の構文的な情報または意味的な情報、などの言語的特徴である。従来法における具体的な素性関数ｆ_ｉのルールの例として、以下があげられる。 The right side of Equation (4) is an error correction model. Equation (4) in _{_{_{Σ i λ i f i (w}}} ) is a score reflecting the error tendency of the word sequence w, act as a penalty or reward for the word sequence w. Also, f _i (w) (i = 1,...) Is the i-th feature function, and the element λ _i of the model parameter Λ = {λ ₁ ,...} Is the feature function f _i (w). Weight (feature weight). The feature function is defined as a function that becomes the number if a linguistic rule is established in a given word string (here, word string w), and is 0 if not established. These rules are, for example, linguistic features such as consecutive words in the same utterance, co-occurrence relationship of two or more words that are not consecutive, syntactic information or semantic information of words. Examples of rules specific feature function f _i in the conventional method, and the like below.

例えば、単語の共起関係に基づく素性関数として、以下の（１）、（２）がある。 For example, there are the following (1) and (2) as feature functions based on the co-occurrence relationship of words.

（１）単語列ｗに連続する単語２つ組（ｕ，ｖ）が含まれる場合、その数を返す関数
（２）単語列ｗに連続しない単語２つ組（ｕ，ｖ）が含まれる場合、その数を返す関数 (1) A function that returns the number of consecutive words (u, v) when the word string w includes a continuous word pair (u, v) (2) A case where the word string w includes a non-continuous word pair (u, v) , A function that returns the number

また、単語列ｗを構成する各単語を名詞や動詞といった品詞カテゴリ（構文情報）に置き換えた上で得られる、構文情報に基づく素性関数として、例えば以下の（３）、（４）がある。なお、ｃ（・）は、単語を品詞にマッピングする関数である。 For example, the following (3) and (4) are feature functions based on syntax information obtained by replacing each word constituting the word string w with a part-of-speech category (syntax information) such as a noun or a verb. Note that c (•) is a function that maps words to parts of speech.

（３）単語列ｗに連続する品詞２つ組（ｃ（ｕ），ｃ（ｖ））が含まれる場合、その数を返す関数
（４）単語列ｗに連続しない品詞２つ組（ｃ（ｕ），ｃ（ｖ））が含まれる場合、その数を返す関数 (3) A function that returns the number of pairs of parts of speech (c (u), c (v)) that are consecutive in the word string w (4) A pair of parts of speech that are not consecutive in the word string w (c ( u), c (v)), a function that returns the number if it is included

あるいは、単語列ｗを構成する各単語を、意味情報を表すカテゴリ（意味カテゴリ）に置き換えた上で得られる、意味的な情報に基づく素性関数として、例えば以下の（５）、（６）がある。意味カテゴリは、本実施形態の誤り修正モデル学習装置が外部または内部に備えるデータベースに記憶されるシソーラスなどを用いて得ることができる。なお、ｓ（・）は単語を意味カテゴリにマッピングする関数である。 Alternatively, for example, the following (5) and (6) are feature functions based on semantic information obtained by replacing each word constituting the word string w with a category (semantic category) representing semantic information. is there. The semantic category can be obtained by using a thesaurus stored in a database provided externally or internally in the error correction model learning apparatus of the present embodiment. Note that s (•) is a function that maps words to semantic categories.

（５）単語列ｗに連続する意味カテゴリ２つ組（ｓ（ｕ），ｓ（ｖ））が含まれる場合、その数を返す関数
（６）単語列ｗに連続しない意味カテゴリ２つ組（ｓ（ｕ），ｓ（ｖ））が含まれる場合、その数を返す関数 (5) A function that returns the number of consecutive semantic categories (s (u), s (v)) in the word string w (6) A pair of semantic categories that are not consecutive in the word string w ( a function that returns the number of s (u), s (v))

非特許文献１の手法では、式（４）における仮説のスコアの計算を、修正済みの正解単語列について行っているが、この計算は時間がかかるため、誤り修正モデルの適用からリアルタイム性が失われてしまう。
そこで、逐次更新処理ブロックにおける誤り修正モデルのモデルパラメータΛの更新方法として、以下のアルゴリズムを用いる。 In the method of Non-Patent Document 1, the calculation of the hypothesis score in Equation (4) is performed for the corrected correct word string. However, since this calculation takes time, the real-time property is lost due to the application of the error correction model. It will be broken.
Therefore, the following algorithm is used as a method for updating the model parameter Λ of the error correction model in the sequential update processing block.

［２．２本実施形態に適用される誤り修正モデルの学習アルゴリズム］
いま、着目している逐次更新処理ブロック内の音声認識結果をｗとし、その修正結果である正解単語列をｗ^ｒｅｆとする。モデルパラメータΛを用いた式（４）により算出した音声認識結果wのスコアをＳ＾（ｗ｜ｘ；Λ）、正解単語列ｗ^ｒｅｆのスコアをＳ＾（ｗ^ｒｅｆ｜ｘ；Λ）としたとき、以下の式（６）に示すそれらスコアの差分Ｌ_１（Λ）を考える。 [2.2 Learning algorithm of error correction model applied to this embodiment]
Now, let w be the speech recognition result in the sequential update processing block of interest, and let w ^{ref be the} correct word string that is the correction result. The score of the speech recognition result w calculated by the equation (4) using the model parameter Λ is S ^ (w | x; Λ), and the score of the correct word string w ^ref is S ^ (w ^ref | x; Λ). Then, consider the difference L ₁ (Λ) of the scores shown in the following equation (6).

正解単語列ｗ^ｒｅｆに対するスコアが大きければ、式（６）において正解単語列が選ばれる見込みが高くなる。一方、誤りを含む音声認識結果ｗに対するスコアが小さければ、そのような単語列が選ばれる見込みが小さくなる。したがって、Ｌ_１（Λ）が大きいほど、正解単語列が選ばれる見込みが高くなる。
つまり、Ｌ_１（Λ）を大きくするようにモデルパラメータΛを推定することにより、音声認識結果の誤り傾向を反映した誤り修正モデルが得られる。そこで、Ｌ_１（Λ）を最大化するために、その重みλ_ｉに関する勾配を求めると、以下の式（７）となる。 If the score for the correct word string w ^ref is large, the probability that the correct word string will be selected in equation (6) increases. On the other hand, if the score for the speech recognition result w containing an error is small, the probability that such a word string is selected becomes small. Therefore, the larger L ₁ (Λ) is, the higher the probability that the correct word string will be selected.
That is, by estimating the model parameter Λ so as to increase L ₁ (Λ), an error correction model reflecting the error tendency of the speech recognition result can be obtained. Therefore, when the gradient related to the weight λ _i is obtained in order to maximize L ₁ (Λ), the following equation (7) is obtained.

今、逐次更新ブロックｂ_ｍ（ｍ＝１，...，）内に修正ブロックがｎ_ｍ個含まれていたとすると、逐次更新ブロックｂ_ｍにおけるスコアの差分Ｌ_１（Λ）は、以下の式（８）のように書き換えられる。なお、入力音声ｘ_ｎに対応した修正ブロックｎ（ｎ＝１，...，ｎ_ｍ）における音声認識結果をｗ_ｎとし、その修正結果である正解単語列をｗ_ｎ ^ｒｅｆとする。 Assuming that n _m correction blocks are included in the sequential update block b _m (m = 1,...), The score difference L ₁ (Λ) in the sequential update block b _m is expressed by the following equation: It is rewritten as (8). The input speech _{x n} modified corresponding to the block n (n = 1, ..., n m) the speech recognition result in the _{w n,} the correct word sequence which is a modification result as _w ^{n ref.}

したがって、逐次更新ブロックｂ_ｍにおける式（８）に示す関数の勾配Δλ_ｉ ^ｍは、以下の式（９）となる。 Thus, the gradient [Delta] [lambda] _i ^m of the function shown in equation (8) in the sequential update block _{b m,} and becomes the following equation (9).

素性関数ｆ_ｉの値は、その定義より、音声認識結果および正解単語列で生起する言語的ルールの個数である。従って、修正済みの正解単語列に対して式（４）によるスコアを計算することなく、逐次処理更新ブロック内の素性を数え上げるだけの簡単な処理によりスコアの差分Ｌ_１（Λ）を最大化でき、そのときのモデルパラメータΛから結果として誤り修正モデルを学習できる。 The value of the feature function f _i is the number of linguistic rules that occur in the speech recognition result and the correct word string based on the definition. Accordingly, the score difference L ₁ (Λ) can be maximized by a simple process of counting up the features in the sequential processing update block without calculating the score according to the formula (4) for the corrected correct word string. As a result, an error correction model can be learned from the model parameter Λ at that time.

逐次処理更新ブロックに対する繰り返し更新によりモデルパラメータΛの推定を行うとすれば、ｍ−１番目の逐次更新ブロックｂ_ｍ−１で重みλ_ｉ ^ｍ−１が得られたとして、現在のブロックｍにおける重みλ_ｉ ^ｍは、以下の式（１０）となる。 If it is assumed that the model parameter Λ is estimated by iterative updating for the sequential update block, the weight λ _i ^m−1 is obtained in the m−1th sequential update block b _m−1 , and the weight in the current block m. λ _i ^m is expressed by the following equation (10).

式（１０）において、ηは、事前に定めた係数とする。
あるいは、過去に得られたＫ個の逐次更新ブロックに対して重み付け加算を行うことにより、以下の式（１１）とすることもできる。 In equation (10), η is a predetermined coefficient.
Alternatively, the following equation (11) can be obtained by performing weighted addition on K sequential update blocks obtained in the past.

式（１１）において、ρ_ｋは、事前に定めた重みであり、Σρ_ｋ＝１（ｋ＝０，...，Ｋ−１）とする。 In equation (11), ρ _k is a predetermined weight, and Σρ _k = 1 (k = 0,..., K−1).

一方、音声認識では、最尤系列を音声認識結果として一意に求めるのではなくて、最尤系列の導出の際に複数の正解候補となる系列をＬ個同時に生成することが多い。いま、音声ｘ_ｎに対する第ｌ番目（ｌ＝１，...，）の正解候補文（正解候補となる系列）をｗ_ｎ ^ｌ、式（４）により算出した正解候補文ｗ_ｎ ^ｌのスコアをＳ＾（ｗ_ｎ ^ｌ｜ｘ；Λ）とすると、正解候補文ｗ_ｎ ^ｌが生成される事後確率ｐ（ｗ_ｎ ^ｌ｜ｘ_ｎ）は、以下の式（１２）で与えられる。なお、正解候補文ｗ_ｎ ^ｌ’は、音声ｘ_ｎの正解候補文ｗ_ｎ ^ｌ以外の正解候補文である。 On the other hand, in speech recognition, instead of uniquely obtaining the maximum likelihood sequence as a speech recognition result, L sequences that are a plurality of correct candidate candidates are often generated simultaneously when the maximum likelihood sequence is derived. Now, the score of the correct candidate sentence w _n ^l calculated by Expression (4) is w _n ^l for the l-th (l = 1,...) Correct answer sentence (sequence that is a correct candidate) for the speech x _n . Is S (w _n ^l | x; Λ), the posterior probability p (w _n ^l | x _n ) that the correct candidate sentence w _n ^l is generated is given by the following equation (12). It should be noted that the correct candidate sentence _{w n} ^{l 'is} the correct candidate sentence other than the correct candidate sentence _w ^{n l} of voice _{x n.}

ここで、音声ｘ_ｎの正解単語列ｗ_ｎ ^ｒｅｆのスコアとＬ個の正解候補文ｗ_ｎ ^ｌの平均スコアとの差分Ｌ_２（Λ）を以下の式（１３）のように定める。 Here, the difference L ₂ (Λ) between the score of the correct word string w _n ^ref of the speech x _n and the average score of the L correct candidate sentences w _n ^l is determined as in the following equation (13).

式（１３）に示す関数を最大化するために勾配を求めると、以下の式（１４）となる。 When the gradient is obtained in order to maximize the function shown in Expression (13), the following Expression (14) is obtained.

したがって、逐次更新ブロックｂ_ｍにおける勾配Δλ_ｉ ^ｍは、以下の式（１５）となる。 Thus, sequential gradient [Delta] [lambda] _i ^m in the update block _{b m} is given by the following expression (15).

モデルパラメータΛの最終更新式は、上述した式（１０）または式（１１）となる。 The final update formula of the model parameter Λ is the above-described formula (10) or formula (11).

［３．誤り修正モデル学習装置の構成］
図４は、本発明の一実施形態による誤り修正モデル学習装置１の構成を示す機能ブロック図であり、本実施形態と関係する機能ブロックのみ抽出して示してある。誤り修正モデル学習装置１は、コンピュータ装置により実現され、同図に示すように、音声認識部２、発音辞書記憶部３、言語モデル記憶部４、音響モデル記憶部５、誤り修正モデル記憶部６、音声認識結果修正部７、形態素解析部８、形態素解析辞書データベース（ＤＢ）記憶部９、誤り修正モデル更新部１０、言語モデル更新部１２、及び発音辞書データベース（ＤＢ）記憶部１４を備えて構成される。 [3. Configuration of error correction model learning device]
FIG. 4 is a functional block diagram showing the configuration of the error correction model learning device 1 according to an embodiment of the present invention, and only functional blocks related to the present embodiment are extracted and shown. The error correction model learning device 1 is realized by a computer device, and, as shown in the figure, a speech recognition unit 2, a pronunciation dictionary storage unit 3, a language model storage unit 4, an acoustic model storage unit 5, and an error correction model storage unit 6 A speech recognition result correction unit 7, a morpheme analysis unit 8, a morpheme analysis dictionary database (DB) storage unit 9, an error correction model update unit 10, a language model update unit 12, and a pronunciation dictionary database (DB) storage unit 14. Composed.

発音辞書記憶部３は、単語と発音の組を示す発音辞書を記憶する。言語モデル記憶部４は、言語モデルを記憶する。音響モデル記憶部５は、音響モデルを記憶する。誤り修正モデル記憶部６は、誤り修正モデルを記憶する。音声認識部２には、従来技術の音声認識装置を用いることができ、発音辞書記憶部３に記憶されている発音辞書、言語モデル記憶部４に記憶されている言語モデル、音響モデル記憶部５に記憶されている音響モデル、及び誤り修正モデル記憶部６に記憶されている誤り修正モデルを用いて、逐次入力される入力音声Ｄ１の音声認識結果を得る。なお、誤り修正モデル記憶部６に記憶されている誤り修正モデルは誤り修正モデル更新部１０により逐次更新されるため、音声認識部２は、更新された誤り修正モデルを音声認識に用いる。また、発音辞書記憶部３に記憶されている発音辞書、及び言語モデル記憶部４に記憶されている言語モデルは言語モデル更新部１２により更新されるため、音声認識部２は、更新された発音辞書及び言語モデルを音声認識に用いる。音声認識部２は、音声認識結果を音声認識結果修正部７に出力する。 The pronunciation dictionary storage unit 3 stores a pronunciation dictionary indicating pairs of words and pronunciations. The language model storage unit 4 stores a language model. The acoustic model storage unit 5 stores an acoustic model. The error correction model storage unit 6 stores an error correction model. As the speech recognition unit 2, a conventional speech recognition device can be used. The pronunciation dictionary stored in the pronunciation dictionary storage unit 3, the language model stored in the language model storage unit 4, and the acoustic model storage unit 5 And the error recognition model stored in the error correction model storage unit 6 are used to obtain the speech recognition result of the input speech D1 that is sequentially input. Since the error correction model stored in the error correction model storage unit 6 is sequentially updated by the error correction model update unit 10, the speech recognition unit 2 uses the updated error correction model for speech recognition. In addition, since the pronunciation model stored in the pronunciation dictionary storage unit 3 and the language model stored in the language model storage unit 4 are updated by the language model update unit 12, the speech recognition unit 2 performs the updated pronunciation. Dictionary and language model are used for speech recognition. The voice recognition unit 2 outputs the voice recognition result to the voice recognition result correction unit 7.

音声認識結果修正部７は、人手により音声認識結果を正解単語列に修正し、音声認識結果及びその修正結果である正解単語列を形態素解析部８に出力する。音声認識結果修正部７には、例えば、従来の字幕制作システムなどを用いることができる。形態素解析部８は、正解単語列に含まれる修正文字列を形態素解析する。形態素解析部８は、音声認識結果及びその正解単語列と、形態素解析結果とを誤り修正モデル更新部１０に出力する。誤り修正モデル更新部１０は、音声認識部２により得られた音声認識結果と、音声認識結果修正部７で修正され、形態素解析部８により形態素解析された結果の正解単語列を用いて、誤り修正モデルのモデルパラメータΛを推定する。誤り修正モデル更新部１０は、逐次更新ブロック分の音声認識結果と修正結果の正解単語列が得られるたびにモデルパラメータΛを逐次推定し、推定したモデルパラメータΛにより誤り修正モデル記憶部６に記憶されている誤り修正モデルを逐次更新する。誤り修正モデル更新部１０が備える記憶部１１は、誤り修正モデルの更新に用いる音声認識結果及び修正結果の正解単語列を記憶する。 The speech recognition result correction unit 7 manually corrects the speech recognition result to a correct word string, and outputs the speech recognition result and the correct word string that is the correction result to the morpheme analysis unit 8. For the speech recognition result correction unit 7, for example, a conventional caption production system can be used. The morpheme analysis unit 8 performs morpheme analysis on the corrected character string included in the correct word string. The morpheme analysis unit 8 outputs the speech recognition result and the correct word string and the morpheme analysis result to the error correction model update unit 10. The error correction model update unit 10 uses the speech recognition result obtained by the speech recognition unit 2 and the correct word string obtained as a result of the morphological analysis by the morpheme analysis unit 8 after correction by the speech recognition result correction unit 7. Estimate the model parameter Λ of the modified model. The error correction model update unit 10 sequentially estimates the model parameter Λ every time the speech recognition result for the sequentially updated block and the correct word string of the correction result are obtained, and stores them in the error correction model storage unit 6 using the estimated model parameter Λ. Sequentially update the error correction model. The storage unit 11 included in the error correction model update unit 10 stores a speech recognition result used for updating the error correction model and a correct word string of the correction result.

発音辞書データベース記憶部１４は、発音辞書のデータベースを記憶する。発音辞書記憶部３が記憶する発音辞書は、発音辞書データベース記憶部１４に記憶されている発音辞書のデータベースの中から音声認識に使用されるものを抽出した一部である。言語モデル更新部１２は、正解単語列に基づいて言語モデル記憶部４に記憶されている言語モデルを更新する。さらに、言語モデル更新部１２は、正解単語列に含まれる単語の発音が、発音辞書記憶部３に記憶されている発音辞書に登録されていない場合、発音辞書データベース記憶部１４に記憶されている発音辞書データベースからその単語の発音を読み出して登録する。言語モデル更新部１２が備える記憶部１３は、言語モデルの更新に用いる正解単語列を記憶する。 The pronunciation dictionary database storage unit 14 stores a pronunciation dictionary database. The pronunciation dictionary stored in the pronunciation dictionary storage unit 3 is a part of the pronunciation dictionary database stored in the pronunciation dictionary database storage unit 14 that is used for speech recognition. The language model update unit 12 updates the language model stored in the language model storage unit 4 based on the correct word string. Furthermore, if the pronunciation of the word included in the correct word string is not registered in the pronunciation dictionary stored in the pronunciation dictionary storage unit 3, the language model update unit 12 is stored in the pronunciation dictionary database storage unit 14. Read and register the pronunciation of the word from the pronunciation dictionary database. The storage unit 13 included in the language model update unit 12 stores a correct word string used for updating the language model.

［４．誤り修正モデル学習装置の処理手順］
図５は、誤り修正モデル学習装置１の全体処理を示すフローチャートである。誤り修正モデル学習装置１は、音声が入力されると逐次的に同図に示す処理を行う。以下、同図に示す各ステップの処理について説明する。 [4. Processing procedure of error correction model learning device]
FIG. 5 is a flowchart showing the overall processing of the error correction model learning device 1. The error correction model learning device 1 sequentially performs the processing shown in FIG. Hereinafter, processing of each step shown in FIG.

［４．１ステップＳ１］
音声認識部２は、音発音辞書記憶部３に記憶されている発音辞書、言語モデル記憶部４に記憶されている言語モデル、音響モデル記憶部５に記憶されている音響モデル、及び誤り修正モデル記憶部６に記憶されている誤り修正モデルを用いて、入力音声Ｄ１を音声認識する。この音声認識結果は、誤り修正モデルにより、入力音声から得られた音声認識結果の選択における誤りを修正したものである。音声認識部２は、入力音声の音声認識結果である単語列を出力する。本実施形態では、音声認識部２は、特開２００１−０９２４９６号公報に示す技術と同様に、入力音声を逐次音声認識し、確定した認識結果を示す音声認識結果データＤ２を、単語を単位として次々に後段の音声認識結果修正部７に出力する。また、音声認識部２は、入力音声が発話の終了点（無音区間で定められる発話境界）に達した際に、発話終了記号を示す発話終了記号データを音声認識結果修正部７に出力する。発話終了記号は、音声認識結果修正部７において、修正ブロックの境界を定めるために必要となる。なお、本実施形態において音声認識結果データＤ２が示す音声認識結果は、逐次確定した最尤系列の単語列だけでもよく、複数の正解候補単語列（最尤系列の単語列と１以上の他の正解候補単語列）から構成されていても良い。音声認識結果が複数の正解候補単語列から構成される場合、各正解候補単語列の音声認識のスコアも音声認識結果データＤ２に付加される。 [4.1 Step S1]
The speech recognition unit 2 includes a pronunciation dictionary stored in the phonetic dictionary storage unit 3, a language model stored in the language model storage unit 4, an acoustic model stored in the acoustic model storage unit 5, and an error correction model. Using the error correction model stored in the storage unit 6, the input speech D1 is recognized as speech. This speech recognition result is obtained by correcting an error in selection of a speech recognition result obtained from input speech by an error correction model. The voice recognition unit 2 outputs a word string that is a voice recognition result of the input voice. In the present embodiment, the speech recognition unit 2 sequentially recognizes the input speech and performs speech recognition result data D2 indicating a confirmed recognition result in units of words, as in the technique disclosed in Japanese Patent Laid-Open No. 2001-092496. It outputs to the speech recognition result correction part 7 of a back | latter stage one after another. Further, the speech recognition unit 2 outputs utterance end symbol data indicating the utterance end symbol to the speech recognition result correction unit 7 when the input speech reaches the end point of the utterance (the utterance boundary defined by the silent section). The utterance end symbol is necessary for the speech recognition result correction unit 7 to determine the boundary of the correction block. In the present embodiment, the speech recognition result indicated by the speech recognition result data D2 may be only a word string of the maximum likelihood sequence that is sequentially determined, or a plurality of correct candidate word sequences (a word sequence of the maximum likelihood sequence and one or more other word sequences). Correct candidate word string). When the speech recognition result is composed of a plurality of correct candidate word strings, the speech recognition score of each correct candidate word string is also added to the speech recognition result data D2.

［４．２ステップＳ２］
音声認識結果修正部７は、前段の音声認識部２から音声認識結果データＤ２が送られてくると、音声認識結果データＤ２が示す音声認識結果を構成する各単語の誤りを、図示しない入力手段により人が入力した指示に従って修正した修正結果を生成する。ただし、音声認識結果が複数の正解候補単語列から構成される場合、修正対象となるのは最尤系列とする。認識結果を構成する単語は、音声認識部２から音声認識結果データＤ２により逐次送られてくるが、人手による修正を行う際の単位（修正ブロック）は、発話の始端から終端までとする。音声認識結果修正部７は、修正ブロックの境界を音声認識部２から出力される発話終了記号データにより定める。
修正ブロックの各単語の認識誤りを人手により修正した後、図示しない入力手段により送出操作が行われた場合、音声認識結果修正部７は、後段の形態素解析部８に、音声認識結果データＤ２と修正結果を示す修正結果データＤ３とのペアを修正ブロック単位で出力する。 [4.2 Step S2]
When the speech recognition result data D2 is sent from the preceding speech recognition unit 2, the speech recognition result correction unit 7 indicates an error of each word constituting the speech recognition result indicated by the speech recognition result data D2 as input means (not shown) The correction result corrected according to the instruction input by the person is generated. However, when the speech recognition result is composed of a plurality of correct candidate word strings, the correction target is the maximum likelihood sequence. The words constituting the recognition result are sequentially transmitted from the speech recognition unit 2 by the speech recognition result data D2, and the unit (correction block) for performing manual correction is from the beginning to the end of the utterance. The speech recognition result correction unit 7 determines the boundary of the correction block based on the utterance end symbol data output from the speech recognition unit 2.
When a recognition operation of each word in the correction block is corrected manually and then a sending operation is performed by an input unit (not shown), the speech recognition result correction unit 7 sends the speech recognition result data D2 and the morphological analysis unit 8 in the subsequent stage. A pair with correction result data D3 indicating the correction result is output in units of correction blocks.

［４．３ステップＳ３］
形態素解析部８は、前段の音声認識結果修正部７から入力された修正結果データＤ３が示す修正結果の単語列を形態素解析する。音声認識結果は単語列であるが、修正作業では単語の分割を考慮せずに文字入力を行うため、修正が行われた箇所の単語境界は不明である。そこで、形態素解析部８は、修正結果に含まれる単語を順次チェックし、単語分割されていない語を形態素解析により単語に分解する。
図６は、形態素解析部８の処理を示すフローチャートである。 [4.3 Step S3]
The morpheme analysis unit 8 performs morpheme analysis on the correction result word string indicated by the correction result data D3 input from the speech recognition result correction unit 7 in the previous stage. Although the speech recognition result is a word string, in the correction work, character input is performed without considering the division of words, so the word boundary of the portion where the correction has been performed is unknown. Therefore, the morphological analysis unit 8 sequentially checks words included in the correction result, and decomposes words that are not divided into words into words by morphological analysis.
FIG. 6 is a flowchart showing the processing of the morphological analysis unit 8.

（ステップＳ３１：修正結果単語選択処理）
形態素解析部８は、修正結果データＤ３が示す修正結果を、音声認識結果データＤ２が示す最尤系列と比較し、不一致箇所を修正文字列として特定する。形態素解析部８は、特定されたｍ＝１，...，番目の修正文字列からまだ処理対象となっていない先頭（ｍ＝１）の修正文字列を選択する。 (Step S31: Correction Result Word Selection Process)
The morpheme analysis unit 8 compares the correction result indicated by the correction result data D3 with the maximum likelihood sequence indicated by the speech recognition result data D2, and specifies a mismatched portion as a correction character string. The morphological analysis unit 8 selects the first (m = 1) corrected character string that has not yet been processed from the identified m = 1,..., Corrected character string.

（ステップＳ３２：未登録単語判定処理）
形態素解析部８は、ｍ番目の修正文字列が、複数の単語から構成されているか否かを、当該修正文字列が形態素解析辞書データベース記憶部９に記憶されている形態素解析辞書に含まれているかどうかにより判断する。形態素解析部８は、修正文字列が形態素解析辞書に含まれている場合、ｍ番目の修正文字列が１つの単語であると判断し、ステップＳ３４の処理を行う。一方、ｍ番目の修正文字列が形態素解析辞書に含まれていない場合、形態素解析部８は、当該修正文字列が複数の単語列から構成されているとみなし、ステップＳ３３の処理を行う。 (Step S32: Unregistered word determination process)
The morpheme analysis unit 8 determines whether or not the m-th corrected character string is composed of a plurality of words, and the corrected character string is included in the morpheme analysis dictionary database storage unit 9. Judgment by whether or not. If the modified character string is included in the morphological analysis dictionary, the morpheme analysis unit 8 determines that the mth modified character string is one word, and performs the process of step S34. On the other hand, when the mth modified character string is not included in the morpheme analysis dictionary, the morpheme analysis unit 8 considers that the modified character string is composed of a plurality of word strings, and performs the process of step S33.

（ステップＳ３３：形態素解析処理）
形態素解析部８は、修正文字列を形態素解析して形態素解析辞書に含まれる既知の単語の組み合わせを探索する。例えば、修正文字列ｃが、修正結果の第ｋ番目の単語であったとすると、当該修正文字列ｃから形態素解析により得られる単語列ｗ＾は、以下の式（１６）により求めることができる。 (Step S33: Morphological analysis process)
The morpheme analysis unit 8 performs a morphological analysis on the corrected character string and searches for a combination of known words included in the morpheme analysis dictionary. For example, if the corrected character string c is the k-th word as a correction result, the word string w ^ obtained from the corrected character string c by morphological analysis can be obtained by the following equation (16).

ここで、単語列ｗ_ｋ−１、単語列ｗ_ｋ＋１はそれぞれ、修正結果における修正文字列ｃの前後の単語である。これは、統計的な形態素解析にみられる従来の手法と同一の手法である。形態素解析部８は、修正結果におけるｍ番目の修正文字列ｃを、形態素解析の結果得られた単語列ｗ＾に置き換える。 Here, the word string w _k−1 and the word string w _{k + 1} are words before and after the corrected character string c in the correction result, respectively. This is the same technique as the conventional technique found in statistical morphological analysis. The morpheme analysis unit 8 replaces the m-th corrected character string c in the correction result with the word string w ^ obtained as a result of the morphological analysis.

（ステップＳ３４：次の修正単語選択処理）
形態素解析部８は、現在のｍの値に１を加算して、ステップＳ３２からの処理を繰り返す。そして、形態素解析部８は、修正結果に含まれるすべての修正単語列についてステップＳ３２〜Ｓ３３の処理を行うと、図６の処理フローを終了する。形態素解析部８は、音声認識結果データＤ２と、修正結果データＤ３が示す修正結果に含まれる修正文字列を形態素解析の結果得られた単語列に置き換えた修正結果を示す修正結果データＤ４を、後段の誤り修正モデル更新部１０に出力する。 (Step S34: Next Correction Word Selection Process)
The morpheme analyzer 8 adds 1 to the current value of m, and repeats the processing from step S32. And the morpheme analysis part 8 will complete | finish the processing flow of FIG. 6, if the process of step S32-S33 is performed about all the correction word strings contained in a correction result. The morpheme analysis unit 8 includes the speech recognition result data D2 and the correction result data D4 indicating the correction result obtained by replacing the correction character string included in the correction result indicated by the correction result data D3 with the word string obtained as a result of the morphological analysis The data is output to the error correction model update unit 10 at the subsequent stage.

［４．４ステップＳ４］
誤り修正モデル更新部１０は、前段の形態素解析部８から音声認識結果データＤ２と、形態素解析が行われた修正結果を示す修正結果データＤ４を入力として受け取ると、音声認識結果データＤ２及び修正結果データＤ４を内部に備える記憶部１１に書き込む。誤り修正モデル更新部１０は、記憶部１１に記憶されている音声認識結果データＤ２が示す音声認識結果と修正結果データＤ４が示す修正結果とに基づいてモデルパラメータΛを更新し、更新したモデルパラメータΛにより、誤り修正モデル記憶部６に記憶されている誤り修正モデルを更新する。
図７は、誤り修正モデル更新部１０の処理を示すフローチャートである。 [4.4 Step S4]
When the error correction model update unit 10 receives the speech recognition result data D2 from the preceding morpheme analysis unit 8 and the correction result data D4 indicating the correction result on which the morphological analysis has been performed as input, the error correction model update unit 10 receives the speech recognition result data D2 and the correction result. Data D4 is written in the storage unit 11 provided inside. The error correction model updating unit 10 updates the model parameter Λ based on the speech recognition result indicated by the speech recognition result data D2 stored in the storage unit 11 and the correction result indicated by the correction result data D4, and the updated model parameter The error correction model stored in the error correction model storage unit 6 is updated by Λ.
FIG. 7 is a flowchart showing the processing of the error correction model update unit 10.

（ステップＳ４１：更新判定処理）
誤り修正モデル更新部１０は、誤り修正モデルのモデルパラメータΛの更新を行うかどうかを判定する。この判定は、モデルの更新頻度を調整するために行う。誤り修正モデル更新部１０は、まだ後段の処理に使用していない音声認識結果の単語数をＮ、修正結果の単語数をＭとしたときに、音声認識結果の単語数Ｎまたは修正結果の単語数Ｍのどちらか小さい方が、あらかじめ定めた閾値以上の場合に、後段の処理を行うと判定する。なお、誤り修正モデル更新部１０は、音声認識結果の単語数Ｎ、修正結果の単語数Ｍを、記憶部１１に記憶され、まだ後段の処理に使用していない音声認識結果データＤ２、修正結果データＤ４から取得する。ただし、音声認識結果が複数の正解候補単語列から構成される場合は、音声認識結果の単語数Ｎの値を、最尤系列に含まれる単語数とする。 (Step S41: Update determination process)
The error correction model update unit 10 determines whether to update the model parameter Λ of the error correction model. This determination is performed in order to adjust the update frequency of the model. The error correction model update unit 10 assumes that the number of words in the speech recognition result that has not yet been used in the subsequent processing is N, and the number of words in the correction result is M. If the smaller of the number M is equal to or greater than a predetermined threshold value, it is determined that the subsequent process is performed. The error correction model update unit 10 stores the number of words N of the speech recognition result and the number of words M of the correction result in the storage unit 11 and has not yet been used for subsequent processing, the speech recognition result data D2, and the correction result Obtained from data D4. However, when the speech recognition result is composed of a plurality of correct answer candidate word strings, the number N of words in the speech recognition result is set as the number of words included in the maximum likelihood sequence.

音声認識結果の単語数Ｎまたは修正結果の単語数Ｍのどちらか小さい方が閾値よりも小さな場合、誤り修正モデル更新部１０は、後段の処理を行わないと判定し、図７の処理を終了する。記憶部１１に書き込まれた音声認識結果データＤ２及び修正結果データＤ４は、次の更新機会に利用される。
一方、音声認識結果の単語数Ｎまたは修正結果の単語数Ｍのどちらか小さい方が閾値以上の場合、誤り修正モデル更新部１０は、ステップＳ４２からの処理を行うと判定する。記憶部１１に記憶され、まだステップＳ４２以降の処理に使用していない音声認識結果データＤ２及び修正結果データＤ４を逐次更新ブロックｂ_ｍとする。 When the smaller number of words N of the speech recognition result or the number of words M of the correction result is smaller than the threshold value, the error correction model update unit 10 determines not to perform the subsequent process, and ends the process of FIG. To do. The speech recognition result data D2 and the correction result data D4 written in the storage unit 11 are used for the next update opportunity.
On the other hand, if the smaller one of the number of words N of the speech recognition result or the number of words M of the correction result is equal to or greater than the threshold, the error correction model update unit 10 determines to perform the processing from step S42. The speech recognition result data D2 and the correction result data D4 that are stored in the storage unit 11 and have not been used for the processing after step S42 are set as the sequential update block b _m .

（ステップＳ４２：素性計算処理）
誤り修正モデル更新部１０は、逐次更新ブロックｂ_ｍについて素性計算処理を行う。具体的には、誤り修正モデル更新部１０は、逐次更新ブロックｂ_ｍの音声認識結果データＤ２が示す音声認識結果と、修正結果データＤ４が示す修正結果のそれぞれについて、以下の素性関数の値（素性が生起する個数）を計算する。 (Step S42: feature calculation process)
The error correction model update unit 10 performs a feature calculation process on the sequential update block b _m . Specifically, the error correction model update unit 10 sets the following feature function values (for each of the speech recognition result indicated by the speech recognition result data D2 of the sequential update block b _m and the correction result indicated by the correction result data D4) ( Number of features).

（１）連続する単語２つ組（ｕ，ｖ）が含まれる場合、その数を返す関数
（２）連続する品詞２つ組（ｃ（ｕ），ｃ（ｖ））が含まれる場合、その数を返す関数
（３）連続する意味カテゴリ２つ組（ｓ（ｕ），ｓ（ｖ））が含まれる場合、その数を返す関数 (1) When a continuous word pair (u, v) is included, a function that returns the number (2) When a continuous part-of-speech pair (c (u), c (v)) is included, A function that returns a number (3) A function that returns a number when a series of consecutive semantic categories (s (u), s (v)) is included

これにより、修正結果が示す正解単語列ｗ_ｎ ^ｒｅｆから、ｆ_ｉ（ｗ_ｎ ^ｒｅｆ）が得られる。また、音声認識結果が最尤系列の場合、最尤系列の音声認識結果ｗ_ｎからｆ_ｉ（ｗ_ｎ）が得られ、音声認識結果が複数の正解候補単語列（正解候補文）ｗ_ｎ ^ｌの場合、ｆ_ｉ（ｗ_ｎ ^ｌ）が得られる。 As a result, f _i (w _n ^ref ) is obtained from the correct word string w _n ^ref indicated by the correction result. Also, if the speech recognition result of the maximum likelihood sequence, the speech recognition result of the maximum likelihood sequence w _n from f _{i (w} _n) is obtained, the speech recognition result is more correct candidate word sequence (correct candidate sentence) w _n ^l In this case, f _i (w _n ^l ) is obtained.

（ステップＳ４３：勾配計算処理）
誤り修正モデル更新部１０は、ステップＳ４２において求めた素性関数の値を用いて、勾配の値（素性関数の値の差分）を、式（９）または式（１５）に従って計算する勾配計算処理を行う。
具体的には、音声認識結果が最尤系列を示す場合、誤り修正モデル更新部１０は、ステップＳ４２において求めたｆ_ｉ（ｗ_ｎ ^ｒｅｆ）及びｆ_ｉ（ｗ_ｎ）を式（９）に代入して勾配Δλ_ｉ ^ｍを算出する。
一方、音声認識結果が複数正解候補単語列を示す場合、誤り修正モデル更新部１０は、音声認識結果データＤ２から各正解候補単語列ｗ_ｎ ^ｌのスコアＳ＾（ｗ_ｎ ^ｌ｜ｘ_ｎ）を読み出し、式（１２）により各正解候補単語列ｗ_ｎ ^ｌの事後確率ｐ（ｗ_ｎ ^ｌ｜ｘ_ｎ）を算出する。続いて、誤り修正モデル更新部１０は、ステップＳ４２において求めたｆ_ｉ（ｗ_ｎ ^ｒｅｆ）及びｆ_ｉ（ｗ_ｎ ^ｌ）と、算出した各正解候補単語列ｗ_ｎ ^ｌの事後確率ｐ（ｗ_ｎ ^ｌ｜ｘ_ｎ）を式（１５）に代入して勾配Δλ_ｉ ^ｍを算出する。 (Step S43: gradient calculation process)
The error correction model update unit 10 performs gradient calculation processing for calculating a gradient value (difference between feature function values) according to the equation (9) or the equation (15) using the value of the feature function obtained in step S42. Do.
Specifically, when the speech recognition result indicates the maximum likelihood sequence, the error correction model update unit 10 substitutes f _i (w _n ^ref ) and f _i (w _n ) obtained in step S42 into equation (9). Then, the gradient Δλ _i ^m is calculated.
On the other hand, when the speech recognition result indicates a plurality of correct candidate word strings, the error correction model update unit 10 ^obtains the score S ^ (w _n ^l | x _n ) of each correct candidate word string w _n ^l from the speech recognition result data D2. Read-out, and a posteriori probability p (w _n ^l | x _n ) of each correct answer candidate word string w _n ^l is calculated by equation (12). Subsequently, the error correction model update unit 10 performs f _i (w _n ^ref ) and f _i (w _n ^l ) obtained in step S42 and the posterior probability p (w _n ) of each calculated correct candidate word string w _n ^l. ^l | x _n ) is substituted into equation (15) to calculate the gradient Δλ _i ^m .

（ステップＳ４４：パラメータ更新処理）
誤り修正モデル更新部１０は、逐次更新ブロックｂ_ｍ−１について算出した重みλ_ｉ ^ｍ−１を誤り修正モデル記憶部６に記憶されている誤り修正モデルから取得し、ステップＳ４３において求めた勾配Δλ_ｉ ^ｍと、読み出した重みλ_ｉ ^ｍ−１とを用いて、式（１０）により重みλ_ｉ ^ｍを算出する。なお、重みλ_ｉ ^ｍ−１を記憶部１１から読み出してもよい。
あるいは、誤り修正モデル更新部１０は、逐次更新ブロックｂ_ｍ−１〜ｂ_{ｍ−（Ｋ−１）}について算出した重みλ_ｉ ^ｍ−１〜λ_ｉ ^{ｍ−（Ｋ−１）}を記憶部１１から読み出す。誤り修正モデル更新部１０は、ステップＳ４３において求めた勾配Δλ_ｉ ^ｍと、読み出した重みλ_ｉ ^ｍ−１〜λ_ｉ ^{ｍ−（Ｋ−１）}とを用いて、式（１１）により重みλ_ｉ ^ｍを算出する。なお、誤り修正モデル記憶部６に現在の誤り修正モデルより前に使用していた誤り修正モデルも記憶されている場合、重みλ_ｉ ^ｍ−１〜λ_ｉ ^{ｍ−（Ｋ−１）}をこれらの誤り修正モデルから取得してもよい。
誤り修正モデル更新部１０は、算出した重みλ_ｉ ^ｍからなるモデルパラメータΛを記憶部１１に書き込む。 (Step S44: Parameter update process)
The error correction model update unit 10 acquires the weight λ _i ^m−1 calculated for the sequential update block b _m−1 from the error correction model stored in the error correction model storage unit 6, and the gradient Δλ obtained in step S43. by using the _i ^m, read the weight λ _i ^m-1, it calculates the weight lambda _i ^m by the equation (10). The weight λ _i ^m−1 may be read from the storage unit 11.
Alternatively, error correction model update section 10, the sequential update block _{_{b m-1 ~b m- (K}} -1) weighting _{^{_{^{λ i m-1 ~λ i m-}}}} (K-1) to the storage unit 11 calculated for read out. The error correction model update unit 10 uses the gradient Δλ _i ^m obtained in step S43 and the read weights λ _i ^{m−1 to} λ _i ^{m− (K−1)} , and uses the weight λ _{i according} to the equation (11). ^m is calculated. In addition, when the error correction model used before the current error correction model is also stored in the error correction model storage unit 6, the weights λ _i ^{m−1 to} λ _i ^{m− (K−1)} are assigned to these. You may acquire from an error correction model.
The error correction model update unit 10 writes the model parameter Λ including the calculated weight λ _i ^m in the storage unit 11.

（ステップＳ４５：モデル更新処理）
誤り修正モデル更新部１０は、音声認識部２が入力音声の発話終了（引き続く発話の開始前）を検出したタイミングに応じて、誤り修正モデル記憶部６に記憶され、音声認識部２が参照している誤り修正モデルのモデルパラメータを、記憶部１１に保持しておいたモデルパラメータΛにより置き換える。誤り修正モデル更新部１０は、図７の処理を終了する。 (Step S45: Model update process)
The error correction model update unit 10 is stored in the error correction model storage unit 6 according to the timing when the speech recognition unit 2 detects the end of the utterance of the input speech (before the start of the subsequent utterance), and the speech recognition unit 2 refers to it. The model parameter of the error correction model is replaced with the model parameter Λ held in the storage unit 11. The error correction model update unit 10 ends the process of FIG.

［４．５ステップＳ５］
言語モデル更新部１２は、修正結果を利用して言語モデル記憶部４に記憶されている言語モデルを更新する。本実施形態では、誤り修正モデルと同様に、逐次処理により言語モデルを更新する。また、更新手法は、従来法であるｎ−ｇｒａｍモデルの線形補間に基づく。ただし、言語モデル更新部１２では、言語モデルの更新だけではなく、発音辞書記憶部３に記憶され、音声認識部２が参照する発音辞書も更新する。これは、現在使用している発音辞書に含まれていない単語を音声認識できるようにするための処理であり、誤り修正モデルにおいても、その効果を改善する上で必要となる。 [4.5 Step S5]
The language model update unit 12 updates the language model stored in the language model storage unit 4 using the correction result. In the present embodiment, the language model is updated by sequential processing, similar to the error correction model. Moreover, the update method is based on the linear interpolation of the n-gram model which is a conventional method. However, the language model updating unit 12 updates not only the language model but also the pronunciation dictionary stored in the pronunciation dictionary storage unit 3 and referred to by the speech recognition unit 2. This is a process for enabling speech recognition of words that are not included in the pronunciation dictionary currently in use, and is necessary for improving the effect even in the error correction model.

図３は、言語モデル更新プロセスを示す図である。ｎ−ｇｒａｍモデルでは、モデルの統計的な精度を保証するために、可能な限り大量のテキストデータから学習する必要がある。そこで、同図に示すように、言語モデルの更新処理は、誤り修正モデルの逐次更新ブロックを複数組み合わせ、十分な数の単語が得られるブロックを更新の１単位とする。この言語モデルの更新の単位となるブロックを、言語モデル更新ブロックとする。 FIG. 3 is a diagram illustrating a language model update process. In the n-gram model, it is necessary to learn from as much text data as possible in order to guarantee the statistical accuracy of the model. Therefore, as shown in the figure, in the update process of the language model, a plurality of sequential update blocks of the error correction model are combined, and a block from which a sufficient number of words are obtained is regarded as one unit of update. A block that is a unit for updating the language model is a language model update block.

図８は、言語モデル更新部１２の処理を示すフローチャートである。言語モデル更新部１２は、形態素解析部８から送られた修正結果データＤ４の入力を受けると、修正結果データＤ４を内部に備える記憶部１３に書き込む。 FIG. 8 is a flowchart showing the processing of the language model update unit 12. When receiving the input of the correction result data D4 sent from the morphological analysis unit 8, the language model update unit 12 writes the correction result data D4 in the storage unit 13 provided therein.

（ステップＳ５１：更新判定処理）
言語モデル更新部１２は、言語モデルの更新を行うかどうかを判定する。この判定は、修正結果から十分な単語数を得た上で言語モデルを推定することを目的に行う。言語モデル更新部１２は、修正結果の単語数Ｍがあらかじめ定めた閾値以上の場合に、後段の処理を行うと判定する。なお、言語モデル更新部１２は、修正結果の単語数Ｍを、記憶部１３に記憶され、まだ後段の処理に使用していない修正結果データＤ４から取得する。修正結果の単語数Ｍが閾値よりも小さな場合、言語モデル更新部１２は、後段の処理を行わないと判定し、図８の処理を終了する。記憶部１３に書き込まれた修正結果データＤ４は、次の更新機会に利用される。
一方、修正結果の単語数Ｍが閾値以上の場合、言語モデル更新部１２は、後段の処理を行うと判定し、ステップＳ５２の処理を行う。言語モデル更新部１２は、記憶部１３に記憶され、まだステップＳ５２以降の処理に使用していない修正結果データＤ４を処理対象の言語モデル更新ブロックとする。 (Step S51: Update determination process)
The language model update unit 12 determines whether to update the language model. This determination is performed for the purpose of estimating a language model after obtaining a sufficient number of words from the correction result. The language model update unit 12 determines that the subsequent process is performed when the number M of words in the correction result is equal to or greater than a predetermined threshold. The language model update unit 12 acquires the number of words M as a correction result from the correction result data D4 stored in the storage unit 13 and not yet used for the subsequent processing. If the number M of correction results is smaller than the threshold value, the language model update unit 12 determines not to perform the subsequent process, and ends the process of FIG. The correction result data D4 written in the storage unit 13 is used for the next update opportunity.
On the other hand, when the number M of corrected words is equal to or greater than the threshold, the language model update unit 12 determines that the subsequent process is to be performed, and performs the process of step S52. The language model update unit 12 sets the correction result data D4 stored in the storage unit 13 and not yet used for the processing after step S52 as a language model update block to be processed.

（ステップＳ５２：発音辞書更新処理）
言語モデル更新部１２は、記憶部１３から処理対象の言語モデル更新ブロックに含まれる修正結果データＤ４を読み出す。言語モデル更新部１２は、修正結果データＤ４が示す修正結果に含まれる単語に対して、発音辞書データベース記憶部１４に記憶されている発音辞書のデータベースから発音を読み出して付与する。発音辞書記憶部３に記憶されている発音辞書は、音声認識部２により参照されるが、更新対象となるのは発音辞書記憶部３に含まれない単語と発音の組となる。そこで、例えば、言語モデル更新部１２は、修正結果に含まれる単語と、その発音の組が登録されているか否かを発音辞書記憶部３に問い合わせ、登録されていない単語と発音の組を選択する。なお、発音辞書のデータベースを参照した結果、該当する発音が存在しない場合、言語モデル更新部１２は、その単語が後段において推定するｎ−ｇｒａｍに用いられないように、修正結果における当該単語を、未知語を表すシンボルで置換しておく。 (Step S52: Pronunciation dictionary update processing)
The language model update unit 12 reads correction result data D4 included in the language model update block to be processed from the storage unit 13. The language model update unit 12 reads and assigns the pronunciation from the pronunciation dictionary database stored in the pronunciation dictionary database storage unit 14 to the word included in the correction result indicated by the correction result data D4. The pronunciation dictionary stored in the pronunciation dictionary storage unit 3 is referred to by the voice recognition unit 2, but what is to be updated is a set of words and pronunciations not included in the pronunciation dictionary storage unit 3. Therefore, for example, the language model updating unit 12 inquires the pronunciation dictionary storage unit 3 whether or not a word included in the correction result and its pronunciation set are registered, and selects a word and pronunciation set that are not registered. To do. If there is no corresponding pronunciation as a result of referring to the pronunciation dictionary database, the language model update unit 12 determines the word in the correction result so that the word is not used for n-gram estimated later. Replace with a symbol representing an unknown word.

（ステップＳ５３：ｎ−ｇｒａｍ計算処理）
言語モデル更新部１２は、修正結果に含まれる単語からｎ−ｇｒａｍを推定する。ｎ−ｇｒａｍとして、例えばｔｒｉｇｒａｍを考えると、その推定式は、以下の式（１７）となる。 (Step S53: n-gram calculation process)
The language model update unit 12 estimates n-gram from the words included in the correction result. For example, when trigram is considered as n-gram, the estimation formula is the following formula (17).

ここで、Ｃ（ｕ，ｖ）は、修正結果における単語２つ組（ｕ，ｖ）の頻度、Ｃ（ｕ，ｖ，ｗ）は、単語３つ組（ｕ，ｖ，ｗ）の頻度である。Ｐ（ｗ｜ｕ，ｖ）は、ｔｒｉｇｒａｍであり、単語２つ組（ｕ，ｖ）に引き続き単語ｗが生起する条件付き確率である。言語モデル更新部１２は、修正結果の先頭から単語３つ組を１単語ずつ順に後ろにずらしていき、修正結果のすべての単語３つ組から上記の計算を行う。 Here, C (u, v) is the frequency of the word pair (u, v) in the correction result, and C (u, v, w) is the frequency of the word triplet (u, v, w). is there. P (w | u, v) is a trigram, and is a conditional probability that the word w occurs following the word duplication (u, v). The language model update unit 12 sequentially shifts the word triplet from the beginning of the correction result one word at a time, and performs the above calculation from all the word triplets of the correction result.

次に、言語モデル更新部１２は、以下の式（１８）のように、言語モデル記憶部４に現在記憶され、音声認識部２で参照しているｔｒｉｇｒａｍと、上記において求めたｔｒｉｇｒａｍを線形補間により結合する。 In the following, the language model updating unit 12, as shown in the following expression (18), currently stored in the language model storage unit 4, the linear interpolation and trigram referenced by the speech recognition unit 2, the trigram obtained in the above Connect by.

ここで、Ｐ^ｎｅｗ（ｗ｜ｕ，ｖ）は、更新されたｔｒｉｇｒａｍ、Ｐ^ｏｌｄ（ｗ｜ｕ，ｖ）は、言語モデル記憶部４に現在記憶されている言語モデルのｔｒｉｇｒａｍである。また、νは、線形補間の重みであり、事前に定めておく。 Here, P ^new (w | u, v) is an updated trigram, and P ^old (w | u, v) is a trigram of the language model currently stored in the language model storage unit 4. Further, ν is a weight for linear interpolation and is determined in advance.

（ステップＳ５４：モデル更新処理）
言語モデル更新部１２は、ステップＳ５２において選択した単語と発音の組を、発音辞書記憶部３に記憶されている発音辞書に追加する。さらに、言語モデル更新部１２は、ステップＳ５３において得られたｔｒｉｇｒａｍにより、言語モデル記憶部４に現在記憶されている言語モデルを置き換える。
なお、誤り修正モデル更新部１０における誤り修正モデルの更新と言語モデル更新部１２における言語モデルの更新は、互いに独立して動作するため、どちらも任意の発話終了タイミングでモデル更新を行うことができる。 (Step S54 : model update process)
The language model update unit 12 adds the combination of the word and pronunciation selected in step S52 to the pronunciation dictionary stored in the pronunciation dictionary storage unit 3. Furthermore, the language model updating unit 12, the trigram obtained in step S 53, replacing the language model currently stored in the language model storage unit 4.
Note that the update of the error correction model in the error correction model update unit 10 and the update of the language model in the language model update unit 12 operate independently of each other, so that both can perform model update at any utterance end timing. .

［５．効果］
本実施形態によれば、誤り修正モデルを少量の計算で推定可能となるため、誤り修正モデル学習装置１は、音声認識に用いられる誤り修正モデルを低遅延で逐次更新することができる。この逐次更新された誤り修正モデルを用いて音声認識を行うことにより、音声認識部２は、従来よりもリアルタイム性を反映して音声認識の誤りを削減することが可能となる。 [5. effect]
According to the present embodiment, since the error correction model can be estimated with a small amount of calculation, the error correction model learning device 1 can sequentially update the error correction model used for speech recognition with low delay. By performing speech recognition using the error correction model that is sequentially updated, the speech recognition unit 2 can reduce errors in speech recognition by reflecting real-time characteristics as compared with the conventional case.

［６．その他］
なお、上述の誤り修正モデル学習装置１は、内部にコンピュータシステムを有している。そして、誤り修正モデル学習装置１の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 [6. Others]
Note that the error correction model learning device 1 described above has a computer system therein. The process of operation of the error correction model learning device 1 is stored in a computer-readable recording medium in the form of a program, and the above-described processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１誤り修正モデル学習装置
２音声認識部
３発音辞書記憶部
４言語モデル記憶部
５音響モデル記憶部
６誤り修正モデル記憶部
７音声認識結果修正部
８形態素解析部
９形態素解析辞書データベース記憶部
１０誤り修正モデル更新部
１１記憶部
１２言語モデル更新部
１３記憶部
１４発音辞書データベース記憶部 1 error correction model learning device 2 speech recognition unit 3 pronunciation dictionary storage unit 4 language model storage unit 5 acoustic model storage unit 6 error correction model storage unit 7 speech recognition result correction unit 8 morpheme analysis unit 9 morpheme analysis dictionary database storage unit 10 error Modified model update unit 11 Storage unit 12 Language model update unit 13 Storage unit 14 Pronunciation dictionary database storage unit

Claims

A voice recognition result correction unit for correcting the voice recognition result according to the input instruction;
Learning a word error tendency from a difference between a linguistic feature included in the speech recognition result and a linguistic feature included in the correction result by the speech recognition result correcting unit, and correct the word error in the speech recognition an error correction model update unit error correction model for correcting, updating according to a tendency of error in the words learned,
Equipped with a,
The linguistic characteristic is a frequency of a continuous word string or a part of speech string of continuous words;
The error correction model is a calculation formula for correcting a score of speech recognition using a feature function based on the linguistic feature and a feature weight of the feature function,
The error correction model update unit updates the feature weight of the error correction model according to a tendency of the learned error of the word;
An error correction model learning device characterized by the above.

The error correction model update unit uses the frequency of co-occurrence of words or word parts of speech included in the speech recognition result and the frequency of co-occurrence of words or word parts of speech included in the correction result. Learn trends,
The error correction model learning device according to claim 1.

  The speech recognition result correcting unit corrects the speech recognition result obtained sequentially for each block,
  The error correction model update unit sequentially corrects a word error based on a difference between a linguistic feature included in the speech recognition result in the block and a linguistic feature included in the correction result in the block. The tendency is learned, the feature weight in the block is calculated according to the learned error tendency of the word, and the feature weight in the calculated block and the feature weight calculated in the block before the block are calculated. Based on weighted addition, a process of updating the feature weight of the error correction model is performed.
  The error correction model learning apparatus according to claim 1, wherein the error correction model learning apparatus is a learning apparatus.

A speech recognition unit for recognizing input speech and correcting and outputting an error in selection of a speech recognition result obtained from the input speech using the error correction model updated by the error correction model update unit; Prepare
The error correction model learning device according to any one of claims 1 to 3, wherein

Computer
Speech recognition result correcting means for correcting the speech recognition result in accordance with the input instruction;
Learning the error tendency of a word from the difference between the linguistic feature included in the speech recognition result and the linguistic feature included in the correction result by the speech recognition result correcting means, and correct the word error in the speech recognition an error correction model updating means for error correction model for correcting, updating according to a tendency of error in the words learned,
Equipped with,
The linguistic characteristic is a frequency of a continuous word string or a part of speech string of continuous words;
The error correction model is a calculation formula for correcting a score of speech recognition using a feature function based on the linguistic feature and a feature weight of the feature function,
The error correction model update means updates the feature weight of the error correction model according to the tendency of the learned error of the word,
A program for functioning as an error correction model learning device.