JP2014119559A

JP2014119559A - Speech recognition device, error correction model learning method, and program

Info

Publication number: JP2014119559A
Application number: JP2012273707A
Authority: JP
Inventors: Akio Kobayashi; 彰夫小林
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2012-12-14
Filing date: 2012-12-14
Publication date: 2014-06-30
Anticipated expiration: 2032-12-14
Also published as: JP6086714B2

Abstract

PROBLEM TO BE SOLVED: To optimize an error correction model by utilizing the relationship between speeches derived from a speech recognition result of speeches at different timings, while suppressing cost.SOLUTION: A speech recognition unit 11 stores a result of speech recognition of voice data of a speech in a voice language resource storage unit 21 while retaining the order of the speeches. An error tendency learning unit 13 acquires a linguistic feature according to the order of the speeches from the words included in the speech recognition result and the words included in the speech recognition result of a past speech temporally adjacent to the speech recognition result. The error tendency learning unit 13 weights the acquired linguistic feature according to the posterior probability of the speech recognition result of the past speech, learns statistically the tendency of an error of the word recognition based on the weighted linguistic feature, and generates an error correction model for correcting the tendency of the learned recognition error. A speech recognition unit 14 speech-recognizes the voice data and corrects the error in the selection of the speech recognition result by using the generated error correction model.

Description

本発明は、音声認識装置、誤り修正モデル学習方法、及びプログラムに関する。 The present invention relates to a speech recognition device, an error correction model learning method, and a program.

音声認識の誤り修正については、音声とその書き起こし（正解文）から、言語的な特徴を用いて音声認識の誤り傾向を統計的に学習し、学習の結果得られた統計的な誤り修正モデルを用いて音声認識の性能改善を図る技術がある（例えば、非特許文献１参照）。また、正解単語列のない学習データから誤り修正モデルを学習し、音声認識性能の改善を図る技術がある（例えば、非特許文献２参照）。 For error correction in speech recognition, statistical error correction models obtained as a result of learning by statistically learning the tendency of speech recognition errors using linguistic features from speech and transcriptions (correct sentences) There is a technology for improving the performance of speech recognition by using (see, for example, Non-Patent Document 1). In addition, there is a technique for learning an error correction model from learning data without a correct word string and improving speech recognition performance (for example, see Non-Patent Document 2).

小林ほか，「単語誤り最小化に基づく識別的スコアリングによるニュース音声認識」，電子情報通信学会誌，vol.J93-D no.5，２０１０年，ｐ．５９８−６０９Kobayashi et al., “News speech recognition by discriminative scoring based on word error minimization”, IEICE Journal, vol.J93-D no.5, 2010, p. 598-609 Kobayashi, A., Oku, T., Homma, S., Imai, T. and Nakagawa, S., "Lattice-based risk minimization training for unsupervised language model adaptation", Proc. Interspeech, pp.1453-1456, 2011.Kobayashi, A., Oku, T., Homma, S., Imai, T. and Nakagawa, S., "Lattice-based risk minimization training for unsupervised language model adaptation", Proc. Interspeech, pp.1453-1456, 2011 .

放送番組などの音声認識では、連続した複数の発話を逐次音声認識するが、音声認識が処理している発話の内容は、すでに音声認識の終わった直前の発話内容と関連することが多い。例えば、料理番組では、食材の紹介についての発話があれば、その後は料理方法に関する発話が続くと期待される。つまり、食材に関する単語とその料理方法に関する単語は、隣接する発話において共起する可能性が高い。例えば、「豚ヒレをたたきます」という発話の後に、「次に塩こしょうします」という発話が続くのであれば、「豚ヒレ」と「塩こしょう」の間に関係があり、これらが共起しやすいということになる。
しかし、非特許文献１、２に示す従来の誤り修正モデルのモデルパラメータ学習では、発話の順序に関係した発話間の単語の共起などの情報は考慮されていないため、発話内容を正しく予測する上で最適なモデルとはなっていない。
また、非特許文献１では、従来の誤り修正モデルの学習に際して、大量の音声データとその音声データの書き起こしである正解単語列が必要となる。統計的に頑健なモデルを推定するには、大量の学習データが必要となるが、書き起こしを作成するためのコストは高くつくという欠点がある。 In speech recognition for broadcast programs and the like, a plurality of continuous utterances are successively recognized, but the content of the utterance being processed by the speech recognition is often related to the utterance content immediately before speech recognition is finished. For example, in a cooking program, if there is an utterance about the introduction of ingredients, it is expected that the utterance about the cooking method will continue thereafter. That is, there is a high possibility that words related to food ingredients and words related to the cooking method co-occur in adjacent utterances. For example, if the utterance of “pick a pork fin” is followed by the utterance of “next salt and pepper”, there is a relationship between “pig fin” and “salt pepper”, and these are co-occurring It will be easy to do.
However, in the conventional error correction model model parameter learning shown in Non-Patent Documents 1 and 2, information such as word co-occurrence between utterances related to the utterance order is not taken into account, so the utterance content is correctly predicted. It is not the optimal model above.
In Non-Patent Document 1, when learning a conventional error correction model, a large amount of speech data and a correct word string that is a transcription of the speech data are required. Estimating a statistically robust model requires a large amount of learning data, but has the disadvantage that the cost of creating a transcript is high.

本発明は、このような事情を考慮してなされたもので、時間的に異なる発話の音声認識結果から引き出した発話間の関係性を利用し、コストを抑えながら誤り修正モデルを最適化する音声認識装置、誤り修正モデル学習方法、及びプログラムを提供する。 The present invention has been made in consideration of such circumstances, and uses the relationship between utterances derived from speech recognition results of utterances that differ in time, and optimizes the error correction model while reducing costs. A recognition device, an error correction model learning method, and a program are provided.

［１］本発明の一態様は、発話の音声データを音声認識して得られた音声認識結果を、発話の順序を保持して格納する音声言語資源格納部と、前記音声認識結果に含まれる単語と当該音声認識結果よりも過去の発話の前記音声認識結果に含まれる単語とから発話の順序に応じた言語的な特徴を取得し、取得した前記言語的な特徴を前記過去の発話の前記音声認識結果の事後確率に応じて重み付けし、重み付けした前記言語的な特徴に基づいて単語の認識誤りの傾向を統計的に学習し、学習した前記認識誤りの傾向を修正するための誤り修正モデルを生成する誤り傾向学習部と、を備えることを特徴とする音声認識装置である。
この発明によれば、音声認識装置は、発話の音声データを音声認識し、得られた音声認識結果に含まれる単語と、その音声認識結果よりも過去の発話の音声認識結果に含まれる単語とから、発話の順序に応じた言語的な特徴を抽出する。過去の発話の音声認識結果として、例えば、時間的に隣接した直近の過去の発話の音声認識結果を用いる。音声認識装置は、抽出した言語的な特徴を、その言語的な特徴が得られた過去の発話の音声認識結果の事後確率に応じて重み付けし、重み付けした言語的な特徴に基づいて単語の認識誤りの傾向を統計的に学習し、学習した認識誤りの傾向を修正するための誤り修正モデルを生成する。
これにより、時間的に異なる発話の音声認識結果から引き出した発話間の関係性を利用し、発話内容を正しく予測する上で好適な誤り修正モデルを生成することができるとともに、音声データの書き起こしにかかるコストを低減することができる。 [1] One aspect of the present invention is included in a speech language resource storage unit that stores speech recognition results obtained by speech recognition of speech speech data while retaining the order of speech, and the speech recognition results. A linguistic feature corresponding to the order of utterances is acquired from a word and a word included in the speech recognition result of a past utterance than the speech recognition result, and the acquired linguistic feature is used as the utterance of the past utterance. An error correction model for weighting according to the posterior probability of a speech recognition result, statistically learning a tendency of recognition error of a word based on the weighted linguistic feature, and correcting the tendency of the learned recognition error An error tendency learning unit for generating a speech recognition device.
According to the present invention, the speech recognition apparatus recognizes speech speech data, and includes words included in the obtained speech recognition result, and words included in speech recognition results of past utterances than the speech recognition result. Then, linguistic features corresponding to the order of utterances are extracted. As the speech recognition result of the past utterance, for example, the speech recognition result of the latest past utterance adjacent in time is used. The speech recognition device weights the extracted linguistic features according to the posterior probabilities of speech recognition results of past utterances from which the linguistic features were obtained, and recognizes words based on the weighted linguistic features. The error tendency is statistically learned, and an error correction model for correcting the learned recognition error tendency is generated.
This makes it possible to generate an error correction model suitable for correctly predicting the utterance content using the relationship between utterances derived from the speech recognition results of utterances that differ in time, and to transcribe speech data. Can reduce the cost.

［２］本発明の一態様は、上述する音声認識装置であって、前記誤り傾向学習部は、重み付けした前記発話の順序に応じた言語的な特徴と、前記音声認識結果から得られる同一発話内の言語的な特徴とに基づいて単語の認識誤りの傾向を統計的に学習し、学習した認識誤りの傾向を修正するための誤り修正モデルを生成する、ことを特徴とする。
この発明によれば、音声認識装置は、時間的に異なる発話の音声認識結果から発話の順序に応じた言語的な特徴を抽出するとともに、各音声認識結果から同一発話内の言語的な特徴を抽出する。音声認識装置は、抽出したこれらの言語的な特徴に基づいて単語の認識誤りの傾向を統計的に学習し、学習した認識誤りの傾向を修正するための誤り修正モデルを生成する。
これにより、音声認識装置は、音声認識の対象となっている発話よりも過去の発話の内容から引き出した情報に加えて、同一の発話内の言語的特徴を利用して、認識誤りを精度よく修正する誤り修正モデルを生成することができる。 [2] One aspect of the present invention is the speech recognition apparatus described above, wherein the error tendency learning unit includes the same utterance obtained from the linguistic features according to the weighted utterance order and the speech recognition result. The tendency is to statistically learn the tendency of recognition errors of words based on the linguistic features, and to generate an error correction model for correcting the tendency of learned recognition errors.
According to this invention, the speech recognition device extracts linguistic features corresponding to the order of utterances from speech recognition results of utterances that differ in time, and linguistic features in the same utterance are extracted from each speech recognition result. Extract. The speech recognition apparatus statistically learns the tendency of recognition errors of words based on these extracted linguistic features, and generates an error correction model for correcting the tendency of learned recognition errors.
As a result, the speech recognition apparatus uses the linguistic features in the same utterance in addition to the information extracted from the content of the utterances past than the utterance that is the target of speech recognition, to accurately recognize the recognition error. An error correction model to be corrected can be generated.

［３］本発明の一態様は、上述する音声認識装置であって、前記発話の順序に応じた言語的な特徴は、前記音声認識結果に含まれる単語と前記過去の発話の前記音声認識結果に含まれる単語との共起関係であり、前記同一発話内の言語的な特徴は、前記音声認識結果から得られる同一発話内の連続する複数の単語の共起関係、連続しない複数の単語の共起関係、単語の構文的な情報、または単語の意味的な情報のうち１以上である、ことを特徴とする。
この発明によれば、音声認識装置は、時間的に異なる発話の音声認識結果から得られる単語間の共起関係と、各音声認識結果から得られる同一発話内における単語の共起関係や構文的、意味的な情報とに基づいて単語の誤り傾向を統計的に学習し、学習した認識誤りの傾向を修正するための誤り修正モデルを生成する。
これにより、音声認識装置は、認識誤りを精度良く修正する誤り修正モデルを生成することができる。 [3] One aspect of the present invention is the speech recognition apparatus described above, wherein the linguistic feature according to the order of the utterances includes the words included in the speech recognition results and the speech recognition results of the past utterances. The linguistic features in the same utterance are co-occurrence relationships of a plurality of consecutive words in the same utterance obtained from the speech recognition result, It is one or more of co-occurrence relation, syntactic information of words, or semantic information of words.
According to this invention, the speech recognition apparatus is capable of co-occurrence relationships between words obtained from speech recognition results of utterances that differ in time, and word co-occurrence relationships and syntactical relationships in the same utterance obtained from each speech recognition result. Then, the error tendency of the word is statistically learned based on the semantic information, and an error correction model for correcting the tendency of the recognized recognition error is generated.
Thereby, the speech recognition apparatus can generate an error correction model that corrects a recognition error with high accuracy.

［４］本発明の一態様は、上述する音声認識装置であって、前記誤り修正モデルは、前記過去の発話の前記音声認識結果の事後確率によって重み付けされた、前記発話の順序に応じた言語的な特徴に基づく第１の素性関数と、前記同一発話内の言語的な特徴に基づく第２の素性関数と、前記第１の素性関数及び前記第２の素性関数それぞれの素性重みとを用いて音声認識のスコアを修正する算出式であり、前記誤り傾向学習部は、前記音声認識結果及び前記過去の発話の前記音声認識結果から得られた前記第１の素性関数の値と、前記音声認識結果から得られた前記第２の素性関数の値と、同一の前記音声データから得られた複数の前記音声認識結果を比較して得られる単語誤りの数を前記音声認識結果の事後確率で重み付けした値とを用いて定められる評価関数により算出した評価値に基づいて前記素性重みを統計的に算出し、算出した前記素性重みを用いて前記誤り修正モデルを生成する、ことを特徴とする。
この発明によれば、誤り修正モデルは、過去の発話の音声認識結果の事後確率によって重み付けされた、発話の順序に応じた言語的な特徴を表わす素性関数と、同一発話内の言語的な特徴を表す素性関数と、それら素性関数の素性重みとにより、音声認識のスコアを修正する算出式である。音声認識装置は、時間的に異なる発話の音声認識結果や同一の発話の音声認識結果から得られた素性関数の値と、同一の発話の複数の音声認識結果を比較して得られる単語誤りの数を、その音声認識結果の事後確率で重み付けした値とを用いて定められる評価関数により算出した評価値が、最も認識誤りが少ないことを示す評価値となるように素性重みを決定し、誤り修正モデルを生成する。
これにより、音声認識装置は、認識誤り傾向を効率的に学習し、誤り修正モデルを生成することができる。 [4] One aspect of the present invention is the speech recognition device described above, wherein the error correction model is a language according to the order of the utterances weighted by the posterior probabilities of the speech recognition results of the past utterances. A first feature function based on a characteristic feature, a second feature function based on a linguistic feature in the same utterance, and a feature weight of each of the first feature function and the second feature function. A calculation formula for correcting a score of speech recognition, wherein the error tendency learning unit includes the speech recognition result and the value of the first feature function obtained from the speech recognition result of the past utterance, and the speech The number of word errors obtained by comparing the value of the second feature function obtained from the recognition result with a plurality of the speech recognition results obtained from the same speech data is expressed as the posterior probability of the speech recognition result. With weighted values The feature weight is statistically calculated on the basis of an evaluation value calculated by an evaluation function determined in such a manner, and the error correction model is generated using the calculated feature weight.
According to the present invention, the error correction model includes a feature function representing a linguistic feature according to the order of utterances weighted by a posteriori probability of a speech recognition result of a past utterance, and a linguistic feature within the same utterance. Is a calculation formula for correcting the score of speech recognition based on a feature function representing, and feature weights of these feature functions. The speech recognition apparatus detects a word error obtained by comparing the value of a feature function obtained from speech recognition results of different utterances or the same speech with a plurality of speech recognition results of the same utterance. The feature weight is determined so that the evaluation value calculated by the evaluation function determined using the value weighted by the posterior probability of the speech recognition result is the evaluation value indicating that there is the least recognition error. Generate a modified model.
Thereby, the speech recognition apparatus can learn the recognition error tendency efficiently and generate an error correction model.

［５］本発明の一態様は、上述する音声認識装置であって、入力された音声データを音声認識し、前記誤り傾向学習部により生成された前記誤り修正モデルを用いて、前記入力された音声データから得られた音声認識結果の選択における誤りを修正する音声認識部をさらに備える、ことを特徴とする。
この発明によれば、音声認識装置は、音声データを音声認識することにより得られた正解候補の中から、誤り修正モデルを用いて音声認識結果を選択する。
これにより、音声認識装置は、認識率のよい音声認識結果を得ることができる。 [5] One aspect of the present invention is the speech recognition device described above, wherein the input speech data is speech-recognized, and the input is performed using the error correction model generated by the error tendency learning unit. It further comprises a speech recognition unit that corrects an error in selecting a speech recognition result obtained from speech data.
According to this invention, the speech recognition apparatus selects a speech recognition result using an error correction model from among correct answer candidates obtained by speech recognition of speech data.
Thereby, the speech recognition apparatus can obtain a speech recognition result with a good recognition rate.

［６］本発明の一態様は、発話の音声データを音声認識して得られた音声認識結果を、発話の順序を保持して格納する音声言語資源格納過程と、前記音声認識結果に含まれる単語と当該音声認識結果よりも過去の発話の前記音声認識結果に含まれる単語とから発話の順序に応じた言語的な特徴を取得し、取得した前記言語的な特徴を前記過去の発話の前記音声認識結果の事後確率に応じて重み付けし、重み付けした前記言語的な特徴に基づいて単語の認識誤りの傾向を統計的に学習し、学習した前記認識誤りの傾向を修正するための誤り修正モデルを生成する誤り傾向学習過程と、を有することを特徴とする誤り修正モデル学習方法である。 [6] One aspect of the present invention is included in a speech language resource storage process of storing speech recognition results obtained by speech recognition of speech speech data while maintaining the order of speech, and the speech recognition results. A linguistic feature corresponding to the order of utterances is acquired from a word and a word included in the speech recognition result of a past utterance than the speech recognition result, and the acquired linguistic feature is used as the utterance of the past utterance. An error correction model for weighting according to the posterior probability of a speech recognition result, statistically learning a tendency of recognition error of a word based on the weighted linguistic feature, and correcting the tendency of the learned recognition error And an error tendency learning process for generating an error correction model learning method.

［７］本発明の一態様は、コンピュータを、発話の音声データを音声認識して得られた音声認識結果を、発話の順序を保持して格納する音声言語資源格納手段と、前記音声認識結果に含まれる単語と当該音声認識結果よりも過去の発話の前記音声認識結果に含まれる単語とから発話の順序に応じた言語的な特徴を取得し、取得した前記言語的な特徴を前記過去の発話の前記音声認識結果の事後確率に応じて重み付けし、重み付けした前記言語的な特徴に基づいて単語の認識誤りの傾向を統計的に学習し、学習した前記認識誤りの傾向を修正するための誤り修正モデルを生成する誤り傾向学習手段と、を具備する音声認識装置として機能させるためのプログラムである。 [7] In one aspect of the present invention, a speech language resource storage unit that stores a speech recognition result obtained by speech recognition of speech speech data stored in a computer while maintaining a speech order, and the speech recognition result Linguistic features corresponding to the order of utterances are acquired from the words included in the speech recognition result and the words included in the speech recognition result of the past utterance than the speech recognition result, and the acquired linguistic feature is Weighting according to the posterior probability of the speech recognition result of utterance, statistically learning the tendency of recognition errors of words based on the weighted linguistic features, and correcting the tendency of the learned recognition errors An error tendency learning means for generating an error correction model is a program for functioning as a speech recognition apparatus.

本発明によれば、時間的に異なる発話の音声認識結果から引き出した発話間の関係性を利用し、音声データの書き起こしにかかるコストを低減しながら誤り修正モデルを最適化することが可能となる。 According to the present invention, it is possible to optimize an error correction model while reducing the cost of transcription of speech data by utilizing the relationship between speeches derived from speech recognition results of speeches that differ in time. Become.

本発明の一実施形態による音声認識装置における誤り修正モデルの学習方法を示す図である。It is a figure which shows the learning method of the error correction model in the speech recognition apparatus by one Embodiment of this invention. 同実施形態による発話の順序に応じた素性関数の例を説明するための図である。It is a figure for demonstrating the example of the feature function according to the order of the speech by the embodiment. 同実施形態による音声認識装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the speech recognition apparatus by the embodiment. 同実施形態による音声認識装置の全体処理フローを示す図である。It is a figure which shows the whole processing flow of the speech recognition apparatus by the embodiment. 同実施形態による音声認識装置のモデル学習処理フローを示す図である。It is a figure which shows the model learning process flow of the speech recognition apparatus by the embodiment. 従来法による誤り修正モデルの学習方法を示す図である。It is a figure which shows the learning method of the error correction model by the conventional method.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

［１．本実施形態の概要］
音声認識の誤り傾向を反映した誤り修正モデルはすでに考案されているが、この誤り修正モデルは、連続して発声される発話に対して、時間的に隣接する発話内容との関係性に基づく情報を利用したものではない。連続した発話では、直前の発話で使われた単語と関連する単語が含まれることが多い。従って、このような近接した発話間の単語のつながりを誤り修正モデルで利用すれば、音声認識の性能改善が期待される。 [1. Overview of this embodiment]
An error correction model that reflects the error tendency of speech recognition has already been devised, but this error correction model is information based on the relationship between utterances that are continuously uttered and temporally adjacent utterance contents. It is not something that uses. Consecutive utterances often include words related to the word used in the previous utterance. Therefore, if such a word connection between adjacent utterances is used in an error correction model, improvement in speech recognition performance is expected.

一方、誤り修正モデルを生成する際には一般に、音声とその書き起こしテキストである正解単語列を用いて音声認識の誤り傾向を学習する。書き起こしテキストは、人手により作成されるが、統計的モデルの頑健性を得るには大量の学習データが必要となるため、書き起こしの作成コストが高くつくという欠点がある。また、音声に対応する正解単語列が得られない場合、着目している発話の直前の発話の内容に関する情報は、認識誤りを含む複数の音声認識結果から獲得しなければならない。
そこで本実施形態の音声認識装置は、直近の発話内容についての複数の音声認識結果に含まれる言語的な特徴を利用して、音声認識性能を発話内容に適合させた誤り修正モデルを正解単語列なしに学習し、音声認識へ適用する。 On the other hand, when generating an error correction model, generally, an error tendency of speech recognition is learned using a correct word string which is a speech and a transcription text thereof. Transcripted text is created manually, but a large amount of learning data is required to obtain the robustness of the statistical model, so there is a drawback that the cost of creating the transcript is high. Further, when a correct word string corresponding to speech cannot be obtained, information on the content of the speech immediately before the speech of interest must be acquired from a plurality of speech recognition results including recognition errors.
Therefore, the speech recognition apparatus according to the present embodiment uses a linguistic feature included in a plurality of speech recognition results for the latest utterance content, and corrects an error correction model in which speech recognition performance is adapted to the utterance content as a correct word string. Learn without and apply it to speech recognition.

［２．誤り修正モデルの学習アルゴリズム］
続いて、本発明の一実施形態による音声認識装置に適用される誤り修正モデルの学習アルゴリズムを説明する。
上述したように、本実施形態の音声認識装置は、従来の課題を解決するために、学習に用いる音声データに発話の順序関係を導入し、隣接する発話間の関係性を誤り修正モデルに取り入れ、正解単語列なしに誤り修正モデルを学習する。本実施形態と従来法の違いは、誤り修正モデルを学習する際のデータの扱い方である。 [2. Error correction model learning algorithm]
Subsequently, an error correction model learning algorithm applied to the speech recognition apparatus according to the embodiment of the present invention will be described.
As described above, in order to solve the conventional problem, the speech recognition apparatus according to the present embodiment introduces an utterance order relationship into speech data used for learning, and incorporates a relationship between adjacent utterances into an error correction model. The error correction model is learned without the correct word string. The difference between this embodiment and the conventional method is how to handle data when learning an error correction model.

図６は、従来法による誤り修正モデルの学習方法を示す図である。同図に示すように、従来法では、複数の発話から構成される学習データは、その順序関係を保存していない。加えて、従来法では、発話の正解単語列が学習データに含まれている。このように、音声認識の誤り傾向は、正解単語列を含み、かつ、順序関係が保持されていない学習データから学習される。 FIG. 6 is a diagram illustrating an error correction model learning method according to a conventional method. As shown in the figure, in the conventional method, the learning data composed of a plurality of utterances does not preserve the order relationship. In addition, in the conventional method, the correct word string of the utterance is included in the learning data. As described above, the error tendency of speech recognition is learned from learning data that includes a correct word string and does not maintain the order relation.

図１は、本実施形態による誤り修正モデルの学習方法を示す図である。同図に示すように、本実施形態では、音声認識結果のみを学習データとして用い、学習データの中の各発話の順序関係を考慮して時間的に隣接する発話間の関係を言語的な特徴として抽出し、誤り修正モデルの学習に利用する。これにより、隣接する発話間の関係が反映された誤り修正モデルが得られるため、従来法よりも音声認識性能を改善することが可能となる。また、正解単語列を必要としないため、従来法よりも誤り修正モデルの学習コストを抑えることができる。
このように、本実施形態の音声認識装置は、発話内容を正しく予測するための誤り修正モデルを、正解単語列のない音声認識結果と、直前の発話の音声認識結果とを用いて学習する。 FIG. 1 is a diagram illustrating an error correction model learning method according to the present embodiment. As shown in the figure, in this embodiment, only the speech recognition result is used as learning data, and the relationship between temporally adjacent utterances is considered in terms of linguistic features in consideration of the order relationship of each utterance in the learning data. And used for learning error correction models. As a result, an error correction model reflecting the relationship between adjacent utterances can be obtained, so that speech recognition performance can be improved as compared with the conventional method. In addition, since a correct word string is not required, the learning cost of the error correction model can be reduced as compared with the conventional method.
As described above, the speech recognition apparatus according to the present embodiment learns an error correction model for correctly predicting the utterance content by using the speech recognition result without the correct word string and the speech recognition result of the immediately preceding utterance.

［２．１従来法の誤り修正モデル］
ベイズの定理によれば、音声入力ｘが与えられたとき、この音声入力ｘに対して最も尤もらしい単語列ｗ＾（「＾」は、「ハット」を表す。）は、以下の式（１）により求めることができる。 [2.1 Error correction model of conventional method]
According to Bayes' theorem, when speech input x is given, the most likely word sequence w ^ (“^” represents “hat”) for this speech input x is expressed by the following equation (1). ).

音声入力ｘ及び単語列ｗは、例えば、発話の単位に対応し、Ｐ（ｗ｜ｘ）は、音声入力ｘが与えられたときに単語列（文仮説）ｗが得られる事後確率である。
また、Ｐ（ｘ｜ｗ）は、単語列ｗに対する音響的な尤もらしさを示す尤度であり、そのスコア（音響スコア）は隠れマルコフモデル（Hidden Markov Model、ＨＭＭ）及びガウス混合分布（Gaussian Mixture Model，ＧＭＭ）に代表される統計的音響モデル（以下、「音響モデル」と記載する。）に基づいて計算される。言い換えれば、音響特徴量が与えられたとき、複数の正解候補の単語それぞれに対する尤もらしさを表すスコアが音響スコアである。 The voice input x and the word string w correspond to, for example, the unit of speech, and P (w | x) is a posterior probability that a word string (sentence hypothesis) w is obtained when the voice input x is given.
P (x | w) is a likelihood indicating acoustic likelihood for the word string w, and the score (acoustic score) is a hidden Markov model (HMM) and a Gaussian mixture distribution (Gaussian Mixture). It is calculated based on a statistical acoustic model (hereinafter referred to as “acoustic model”) typified by Model, GMM). In other words, when an acoustic feature amount is given, a score representing the likelihood of each of a plurality of correct candidate words is an acoustic score.

一方、Ｐ（ｗ）は、単語列ｗに対する言語的な生成確率であり、そのスコア（言語スコア）は、単語ｎ−ｇｒａｍモデル等の統計的言語モデル（以下、「言語モデル」と記載する。）により計算される。言い換えれば、音声認識対象の単語の前または後の単語列、あるいは前後両方の単語列が与えられたとき、複数の正解候補の単語列それぞれに対する尤もらしさを表すスコアが言語スコアである。なお、単語ｎ−ｇｒａｍモデルは、Ｎ単語連鎖（Ｎは、例えば１、２、または３である。）の統計に基づいて、（Ｎ−１）単語の履歴から次の単語の生起確率を与えるモデルである。 On the other hand, P (w) is a linguistic generation probability for the word string w, and the score (language score) is described as a statistical language model (hereinafter, “language model”) such as a word n-gram model. ). In other words, when a word string before or after a speech recognition target word, or both word strings before and after the given word string, a score representing the likelihood of each of a plurality of correct answer word strings is a language score. The word n-gram model gives the occurrence probability of the next word from the history of the word (N-1) based on the statistics of N word chains (N is 1, 2, or 3, for example). It is a model.

以下の説明では、音響モデルにＨＭＭ−ＧＭＭを用い、言語モデルにｎ−ｇｒａｍを用いる。 In the following description, HMM-GMM is used for the acoustic model and n-gram is used for the language model.

式（１）のＰ（ｘ｜ｗ）Ｐ（ｗ）が最大の場合は、その対数も最大である。そこで、音声認識では、上記の式（１）のベイズの定理に基づいて、音声入力ｘが与えられたときの文仮説（正解候補）である単語列ｗの評価関数ｑ（ｗ｜ｘ）を以下の式（２）のように定める。 When P (x | w) P (w) in Equation (1) is maximum, the logarithm is also maximum. Therefore, in speech recognition, an evaluation function q (w | x) of a word string w, which is a sentence hypothesis (correct answer candidate) when a speech input x is given, based on the Bayes' theorem of the above equation (1). It is defined as the following formula (2).

式（２）において、ｆ_ａｍ（ｘ｜ｗ）は、音響モデルによる単語列ｗの対数音響スコア、ｆ_ｌｍ（ｗ）は、言語モデルによる単語列ｗの対数言語スコア、λ_ｌｍは、音響スコアに対する言語スコアの重みである。 In formula (2), f _am (x | w) is a logarithmic acoustic score of the word sequence w according to the acoustic model, f _lm (w) is a logarithmic language score of the word sequence w according to the language model, and λ _lm is an acoustic score Is the weight of the language score for.

式（２）が定められたとき、以下の式（３）に示すように、音声入力ｘに対する正解候補の単語列ｗの集合の中から、式（２）が示す評価関数ｑ（ｗ｜ｘ）の結果が最大である単語列ｗ＾が、音声入力ｘの音声認識結果として選択される。 When the formula (2) is determined, as shown in the following formula (3), the evaluation function q (w | x) represented by the formula (2) is selected from the set of correct candidate word strings w for the speech input x. ) Is selected as the speech recognition result of speech input x.

従来法における誤り修正モデルでは、最尤仮説を以下の式（４）により求める。 In the error correction model in the conventional method, the maximum likelihood hypothesis is obtained by the following equation (4).

式（４）におけるΣ_ｉλ_ｉｆ_ｉ（ｗ）は、単語列ｗの誤り傾向を反映したスコアであり、単語列ｗに対するペナルティもしくは報償として働く。また、ｆ_ｉ（ｗ）（ｉ＝１，...，）はｉ番目の素性関数、λ_ｉは素性関数ｆ_ｉ（ｗ）の重み（素性重み）である。素性関数は、与えられた単語列（ここでは、単語列ｗ）で言語的ルールが成立すればその数となり、成立しなければ０となるような関数として定められる。これらルールは、例えば、同一の発話内における連続する単語、連続しない２単語以上の単語の共起関係、単語の構文的な情報または意味的な情報、などの言語的特徴である。従来法における具体的な素性関数ｆ_ｉのルールの例として、以下があげられる。 Equation (4) in _{_{_{Σ i λ i f i (w}}} ) is a score reflecting the error tendency of the word sequence w, act as a penalty or reward for the word sequence w. Further, f _i (w) (i = 1,...) Is an i-th feature function, and λ _i is a weight (feature weight) of the feature function f _i (w). The feature function is defined as a function that becomes the number if a linguistic rule is established in a given word string (here, word string w), and is 0 if not established. These rules are, for example, linguistic features such as consecutive words in the same utterance, co-occurrence relationship of two or more words that are not consecutive, syntactic information or semantic information of words. Examples of rules specific feature function f _i in the conventional method, and the like below.

例えば、単語の共起関係に基づく素性関数として、以下の（１）、（２）がある。 For example, there are the following (1) and (2) as feature functions based on the co-occurrence relationship of words.

（１）単語列ｗに連続する単語２項組（ｕ，ｖ）が含まれる場合、その数を返す関数
（２）単語列ｗに連続しない単語２項組（ｕ，ｖ）が含まれる場合、その数を返す関数 (1) When the word string w includes a continuous word binary set (u, v), a function that returns the number (2) When the word string w includes a non-continuous word binary set (u, v) , A function that returns the number

また、単語列ｗを構成する各単語を名詞や動詞といった品詞カテゴリ（構文情報）に置き換えた上で得られる、構文情報に基づく素性関数として、例えば以下の（３）、（４）がある。なお、ｃ（・）は、単語を品詞にマッピングする関数である。 For example, the following (3) and (4) are feature functions based on syntax information obtained by replacing each word constituting the word string w with a part-of-speech category (syntax information) such as a noun or a verb. Note that c (•) is a function that maps words to parts of speech.

（３）単語列ｗに連続する品詞２項組（ｃ（ｕ），ｃ（ｖ））が含まれる場合、その数を返す関数
（４）単語列ｗに連続しない品詞２項組（ｃ（ｕ），ｃ（ｖ））が含まれる場合、その数を返す関数 (3) A function that returns the number of part-of-speech binaries (c (u), c (v)) that are consecutive in the word string w (4) A part-of-speech binary pair that is not consecutive in the word string w u), c (v)), a function that returns the number if it is included

あるいは、単語列ｗを構成する各単語を、意味情報を表すカテゴリ（意味カテゴリ）に置き換えた上で得られる、意味的な情報に基づく素性関数として、例えば以下の（５）、（６）がある。意味カテゴリは、本実施形態の音声認識装置が外部または内部に備えるデータベースに記憶されるシソーラスなどを用いて得ることができる。なお、ｓ（・）は単語を意味カテゴリにマッピングする関数である。 Alternatively, for example, the following (5) and (6) are feature functions based on semantic information obtained by replacing each word constituting the word string w with a category (semantic category) representing semantic information. is there. The semantic category can be obtained by using a thesaurus stored in a database provided outside or inside the speech recognition apparatus of the present embodiment. Note that s (•) is a function that maps words to semantic categories.

（５）単語列ｗに連続する意味カテゴリ２項組（ｓ（ｕ），ｓ（ｖ））が含まれる場合、その数を返す関数
（６）単語列ｗに連続しない意味カテゴリ２項組（ｓ（ｕ），ｓ（ｖ））が含まれる場合、その数を返す関数 (5) A function that returns the number of semantic category binary groups (s (u), s (v)) that are consecutive in the word string w (6) A semantic category binary group that is not consecutive in the word string w ( a function that returns the number of s (u), s (v))

上記のように、音声認識の誤り傾向は、素性関数とその重みにより言語的な特徴に対するペナルティとして表現され、学習データの単語誤りを最小化する評価関数に基づいて推定される。つまり、従来の誤り傾向の学習とは、学習データを用いて式（４）の重みλ_ｉを求めることである。 As described above, an error tendency of speech recognition is expressed as a penalty for a linguistic feature by a feature function and its weight, and is estimated based on an evaluation function that minimizes a word error in learning data. That is, the conventional learning of error tendency is to obtain the weight λ _i of Equation (4) using the learning data.

［２．２本実施形態に適用される誤り修正モデルの学習アルゴリズム］
いま、着目している音声入力（発話）ｘ_ｍの音声認識結果の１つを単語列（文仮説）ｗ_ｍ，ｋとする。また、その音声入力ｘ_ｍの直近の音声入力ｘ_ｍ−１から、時間的に隣接する発話の音声認識結果として得られた単語列の集合をＧ_ｍ−１とする。この場合、音声入力ｘ_ｍ、単語列の集合Ｇ_ｍ−１が与えられたときの単語列ｗ_ｍ，ｋの条件付き確率Ｐ（ｗ_ｍ，ｋ｜ｘ_ｍ，Ｇ_ｍ−１）は、以下の式（５）のようになる。 [2.2 Learning algorithm of error correction model applied to this embodiment]
Now, paying attention to that voice input (speech) x _m word column one of the speech recognition result of (Bunkasetsu) w _m, and _k. Further, to the nearest voice input x _m-1 of the audio inputs x _m, a set of word strings obtained as the speech recognition result of the speech temporally adjacent to G _m-1. In this case, the conditional probability P (w _{m, k} | x _m , G _m-1 ) of the word string w _{m, k} when the speech input x _m and the word string set G _m-1 are given is as follows: Equation (5) is obtained.

ただし、式（５）の導出では、ベイズの定理と、集合Ｇ_ｍ−１と音声入力ｘ_ｍが条件付き独立であることを利用している。また、式（５）において、単語列ｗ_ｍ，ｌは、音声入力ｘ_ｍの音声認識結果として得られた複数の単語列である。
ここで、音声入力ｘ_ｍと、隣接する発話の音声認識結果として得られた単語列の集合Ｇ_ｍ−１とが与えられたとき、入力に対して最も尤もらしい単語列ｗ＾は以下の式（６）となり、式（１）が変更されることに注意する。 However, the derivation of Equation (5) uses the Bayes' theorem and the fact that the set G _m−1 and the speech input x _m are conditionally independent. Further, in the expression (5), the word string w _{m, l} is a plurality of word strings obtained as a voice recognition result of the voice input x _m .
Here, when a speech input x _m and a set of word sequences G _m−1 obtained as speech recognition results of adjacent utterances are given, the most likely word sequence w ^ for the input is Note that Equation (1) is changed to (6).

ここで、直近の入力音声により単語列ｗが得られた下での単語列の集合Ｇ_ｍ−１の条件付き確率Ｐ（Ｇ_ｍ−１｜ｗ）を、式（７）のように仮定する。 Here, the conditional probability P (G _m−1 | w) of the set of word strings G _m−1 under the word string w obtained from the latest input speech is assumed as shown in Expression (7). .

この仮定から、式（６）は以下の式（８）となる。 From this assumption, equation (6) becomes the following equation (8).

なお、γ_ｊ（ｗ，Ｇ_ｍ−１）（ｊ＝１，...，）は、単語列ｗと単語列の集合Ｇ_ｍ−１によって定められる言語的な特徴を表す素性関数であり、時間的に異なる複数の発話間の情報を用いて表現される。また、φ_ｊは、γ_ｊに対応した重み（素性重み）である。 Γ _j (w, G _m−1 ) (j = 1,...) Is a feature function representing a linguistic feature defined by the word string w and the set of word strings G _m−1 . It is expressed using information between multiple utterances that differ in time. Φ _j is a weight (feature weight) corresponding to γ _j .

図２は、このような時間的に異なる複数の発話間で成立する言語的な特徴の素性関数γ_ｊの例を説明するための図である。同図においては、着目している現在の発話の音声認識結果を文仮説ｗ_１、ｗ_２、ｗ_３の集合とし、直近の発話の音声認識結果を文仮説ｕ_１、…、ｕ_４の集合とている。
時間的に異なる複数の発話間で成立する言語的な特徴として、以下に示すような、直近の発話の文仮説と現在の発話の文仮説との間の単語（同図における単語ｖとｚ）の共起がある。 FIG. 2 is a diagram for explaining an example of a feature function γ _j of a linguistic feature established between a plurality of utterances that are different in time. In the figure, the speech recognition result of the current utterance of interest is a set of sentence hypotheses w ₁ , w ₂ , w ₃ , and the speech recognition result of the latest utterance is a set of sentence hypotheses u ₁ ,..., U ₄ . It is.
As a linguistic feature established between a plurality of temporally different utterances, the word between the sentence hypothesis of the latest utterance and the sentence hypothesis of the current utterance (words v and z in the figure) as shown below There is co-occurrence.

（例）先行する発話の文仮説に単語ｚが含まれおり、かつ着目している発話の文仮説に単語ｖが含まれるときの単語ｖの数 (Example) The number of words v when the word z is included in the sentence hypothesis of the preceding utterance and the word v is included in the sentence hypothesis of the utterance of interest

上記の例のような発話間の言語的な特徴を得る関数をｈ_ｊ（・，・）と表す。例えば、着目している現在の文仮説が図２に示す文仮説ｗ_１であれば、文仮説ｗ_１には単語ｖが１つ含まれており、単語ｚが１つ含まれる文仮説ｕ_１との間で、ｈ_ｊ（ｗ_１，ｕ_１）＝１となる。一方、着目している現在の文仮説を文仮説ｗ_３とすれば、文仮説ｗ_３には単語ｖが２つ含まれており、単語ｚが１つ含まれる文仮説ｕ_４との間で、ｈ_ｊ（ｗ_３，ｕ_４）＝２となる。 A function for obtaining a linguistic feature between utterances as in the above example is represented as h _j (•, •). For example, if the current sentence hypotheses of interest is a sentence hypotheses w ₁ shown in FIG. 2, the word v in Bunkasetsu w ₁ includes a single, sentence hypothesis u ₁ word z is contained one , H _j (w ₁ , u ₁ ) = 1. On the other hand, if the current sentence hypotheses of interest with sentence hypotheses w _3, the word v in Bunkasetsu w ₃ contains two, between the sentence hypothesis u ₄ word z is contained one , H _j (w ₃ , u ₄ ) = 2.

この関数ｈ_ｊを用いて、素性関数γ_ｊ（・，・）を以下の式（９）のように定める。 Using this function h _j , the feature function γ _j (•, •) is defined as in the following equation (9).

式（９）のｐ（ｗ_{ｍ−１，ｎ}）は、直前の入力音声の発話についての第ｎ番目の音声認識結果である文仮説ｗ_{ｍ−１，ｎ}の事後確率である。このように、本実施形態の特徴は、過去の発話の音声認識結果に対する正解単語列が与えられない場合でも、過去の発話の音声認識結果の集合に含まれる文仮説がその事後確率に応じて正解単語列に相当するとみなし、各文仮説から得られた素性関数の値を事後確率で重み付けして足し合わせる点にある。
従来の識別的言語モデルの素性関数を考慮すれば、入力音声ｘ_ｍから得られた文仮説ｗの事後確率は、直前の入力音声ｘ_ｍ−１に対する音声認識結果として文仮説の集合Ｇ_ｍ−１が得られた場合、以下の式（１０）となる。 In Equation (9), p (w _{m−1, n} ) is the posterior probability of the sentence hypothesis w _{m−1, n} that is the nth speech recognition result for the utterance of the immediately preceding input speech. As described above, the feature of this embodiment is that the sentence hypothesis included in the set of speech recognition results of past utterances depends on the posterior probability even when correct word strings for the speech recognition results of past utterances are not given. It is considered that it corresponds to a correct word string, and the value of the feature function obtained from each sentence hypothesis is weighted by the posterior probability and added.
Considering the feature functions of conventional identification language model, the posterior probability of the input speech x _m sentence hypotheses w obtained from a set of sentence hypotheses resulting speech recognition for the input speech x _m-1 of the immediately preceding G _{m- When 1} is obtained, the following equation (10) is obtained.

式（１０）のモデルパラメータΛは、（λ_１，λ_２，…）であり、モデルパラメータΦは、（φ_１，φ_２，…）である。また、式（１０）における、Ｚ（Λ，Φ）は、確率の条件を満たすための正規化定数であり、以下の式（１１）とする。式（１１）における単語列ｗ’は、音声入力ｘ_ｍから音声認識により得られた複数の音声認識結果の文仮説である。 The model parameter Λ in equation (10) is (λ ₁ , λ ₂ ,...), And the model parameter Φ is (φ ₁ , φ ₂ ,...). Further, Z (Λ, Φ) in the equation (10) is a normalization constant for satisfying the probability condition, and is represented by the following equation (11). Equation (11) word sequence w in 'is a sentence hypotheses of the plurality of speech recognition result obtained by speech recognition from the speech input x _m.

本実施形態の音声認識装置による誤り修正モデルの学習とは、式（１０）における右辺の指数関数ｅｘｐの指数部分として示される誤り修正モデルのモデルパラメータΛ及びΦを学習データから推定することである。このように、本実施形態の誤り修正モデルは、発話の順序に応じた言語的な特徴に基づく素性関数の値と、同一発話内の言語的な特徴に基づく素性関数の値と、これら素性関数の素性重みとを用いて、音声認識のスコアを修正する算出式である。 The learning of the error correction model by the speech recognition apparatus according to the present embodiment is to estimate the model parameters Λ and Φ of the error correction model shown as the exponent part of the exponent function exp on the right side in Equation (10) from the learning data. . As described above, the error correction model of the present embodiment includes a value of a feature function based on a linguistic feature corresponding to an utterance order, a value of a feature function based on a linguistic feature in the same utterance, and these feature functions. It is a calculation formula which corrects the score of voice recognition using the feature weight of.

ここで、正解単語列が付与されていないＭ個の発話からなる学習データが与えられたとき、モデルパラメータ推定のための目的関数Ｌ（Λ，Φ）を以下の式（１２）とする。 Here, when learning data consisting of M utterances to which a correct word string is not given is given, an objective function L (Λ, Φ) for model parameter estimation is expressed by the following equation (12).

ｍは発話の順序を示し、Ｎ_ｍは、ｍ番目の学習データである音声入力ｘ_ｍに対して音声認識により生成された文仮説ｗ_ｍ，１、ｗ_ｍ，２、…の総数である。また、Χ（ｗ_ｍ，ｎ）は、以下の式（１３）とする。 m indicates the order of utterances, and N _m is the total number of sentence hypotheses w _{m, 1} , w _{m, 2} ,... generated by speech recognition for the speech input x _m that is the m-th learning data. Also, Χ (w _{m, n} ) is expressed by the following formula (13).

式（１３）におけるＲ（・，・）は２つの単語列の編集距離を返す関数であり、文仮説ｗ_ｍ，ｋは音声入力ｘ_ｍから得られた文仮説_ｍ，ｎ以外の全ての文仮説である。２つの単語列の編集距離は、動的計画法により効率的に求めることができる。編集距離は、正解単語列に対する音声認識結果の誤り単語数と等価（置換、挿入、脱落誤りの操作）であるため、式（１２）の目的関数Ｌ（Λ，Φ）は、期待される単語誤りの数を表している。この目的関数Ｌ（Λ，Φ）を最小化するようにモデルパラメータΛとモデルパラメータΦを推定すれば、期待される単語誤りの数が最小となる誤り修正モデルが得られるため、音声認識の性能の向上が期待できる。これは、目的関数Ｌ（Λ，Φ）を最小化するようにモデルパラメータΛ及びΦを推定すれば、正解候補の単語列に期待される認識誤りが最小となり、学習データとは異なる未知の入力音声に対する音声認識においても、モデルパラメータΛ及びΦによって認識誤りの最小化が同様に行われるからである。つまり、式（１２）の目的関数は、正解候補の単語列に期待される認識誤りが最小となり、モデルパラメータΛ及びΦが適切であるかの評価値を算出する評価関数として用いられる。 In Expression (13), R (•, •) is a function that returns the edit distance between two word strings, and the sentence hypothesis w _{m, k} is all sentences other than the sentence hypothesis _{m, n} obtained from the speech input x _m. It is a hypothesis. The edit distance between two word strings can be efficiently obtained by dynamic programming. Since the edit distance is equivalent to the number of error words in the speech recognition result for the correct word string (operation of substitution, insertion, omission error), the objective function L (Λ, Φ) in Expression (12) is the expected word. It represents the number of errors. If the model parameter Λ and the model parameter Φ are estimated so as to minimize the objective function L (Λ, Φ), an error correction model that minimizes the number of expected word errors can be obtained. Improvement can be expected. This is because if the model parameters Λ and Φ are estimated so as to minimize the objective function L (Λ, Φ), the recognition error expected for the correct candidate word string is minimized, and the unknown input is different from the learning data. This is because also in speech recognition for speech, recognition errors are similarly minimized by the model parameters Λ and Φ. That is, the objective function of Expression (12) is used as an evaluation function for calculating an evaluation value as to whether or not the model parameters Λ and Φ are appropriate because the recognition error expected for the correct candidate word string is minimized.

なお、式（１３）におけるＰ（ｗ_ｍ,ｋ｜ｘ_ｍ）は、以下の式（１４）のように算出される。 Note that P (w _{m, k} | x _m ) in the equation (13) is calculated as the following equation (14).

式（１４）におけるｇ＾は、式（１０）から以下の式（１５）となる。 In the equation (14), g ^ becomes the following equation (15) from the equation (10).

モデルパラメータを推定するため、目的関数のモデルパラメータΛ、Φに関する勾配ΔΛ、ΔΦを求めると、以下の式（１６）、式（１７）のようになる。 If the gradients ΔΛ and ΔΦ related to the model parameters Λ and Φ of the objective function are obtained in order to estimate the model parameters, the following equations (16) and (17) are obtained.

勾配ΔΛは、（∂Ｌ（Λ，Φ）／∂λ_１，∂Ｌ（Λ，Φ）／∂λ_２，∂Ｌ（Λ，Φ）／∂λ_３，…）であり、勾配ΔΦは、（∂Ｌ（Λ，Φ）／∂φ_１，∂Ｌ（Λ，Φ）／∂φ_２，∂Ｌ（Λ，Φ）／∂φ_３，…）である。また、Χ（・,・）は、モデルパラメータΛ、Φに関して定数とみなした。 The gradient ΔΛ is (∂L (Λ, Φ) / ∂λ ₁ , ∂L (Λ, Φ) / ∂λ ₂ , ∂L (Λ, Φ) / ∂λ ₃ ,...), And the gradient ΔΦ is (∂L (Λ, Φ) / ∂φ ₁ , ∂L (Λ, Φ) / ∂φ ₂ , ∂L (Λ, Φ) / ∂φ ₃ ,...). Also, Χ (·, ·) was regarded as a constant with respect to the model parameters Λ and Φ.

ｔ回の繰り返し更新によりモデルパラメータΛ^ｔ、Φ^ｔの学習を行うとすれば、ｔ−１回目の繰り返しの後にモデルパラメータΛ^ｔ−１、Φ^ｔ−１が得られたとすると、以下の式（１８）、式（１９）がパラメータ更新式となる。 Assuming that model parameters Λ ^t and Φ ^t are learned by ^t repeated updating, if model parameters Λ ^t−1 and Φ ^t−1 are obtained after the ^t− 1th ^iteration , the following formula ( 18) and Expression (19) are parameter update expressions.

ここで、η_Λ、η_Φはそれぞれ、式（１６）、式（１７）で得られた勾配ΔΛ、勾配ΔΦの係数である。
また、隣接する発話の音声認識結果の集合が複数（Ｇ_ｍ−１，Ｇ_ｍ−２，…）与えられた場合も、同様の手続きを行えば、モデルパラメータΛ、Φの学習が可能である。 Here, η _Λ and η _Φ are coefficients of the gradient ΔΛ and the gradient ΔΦ obtained by the equations (16) and (17), respectively.
Further, even when a plurality of sets (G _m−1 , G _m−2 ,...) Of speech recognition result sets of adjacent utterances are given, the model parameters Λ and Φ can be learned by performing the same procedure. .

［３．音声認識装置の構成］
図３は、本発明の一実施形態による音声認識装置１の構成を示す機能ブロック図であり、発明と関係する機能ブロックのみ抽出して示してある。
音声認識装置１は、コンピュータ装置により実現され、同図に示すように、音声認識部１１、特徴量抽出部１２、誤り傾向学習部１３、音声認識部１４、音声言語資源格納部２１、音響モデル格納部２２、言語モデル格納部２３、及び誤り修正モデル格納部２４を備えて構成される。 [3. Configuration of voice recognition device]
FIG. 3 is a functional block diagram showing the configuration of the speech recognition apparatus 1 according to an embodiment of the present invention, and only functional blocks related to the invention are extracted and shown.
The speech recognition apparatus 1 is realized by a computer device, and as shown in the figure, a speech recognition unit 11, a feature amount extraction unit 12, an error tendency learning unit 13, a speech recognition unit 14, a speech language resource storage unit 21, an acoustic model A storage unit 22, a language model storage unit 23, and an error correction model storage unit 24 are provided.

音声言語資源格納部２１は、学習データを格納する。音響モデル格納部２２は、音響モデルを格納する。言語モデル格納部２３は、言語モデルを格納する。誤り修正モデル格納部２４は、誤り修正モデルを格納する。 The spoken language resource storage unit 21 stores learning data. The acoustic model storage unit 22 stores an acoustic model. The language model storage unit 23 stores a language model. The error correction model storage unit 24 stores an error correction model.

音声認識部１１は、学習データを生成するために発話の音声データＤ１を音声認識する。音声データＤ１は、発話の音声波形を短時間スペクトル分析して得られた特徴量を示す。本実施形態では、音声データを放送信号から取得する。音声認識部１１は、音声データＤ１と、この音声データＤ１を音声認識して得られた音声認識結果データＤ２とを対応付けて学習データとして音声言語資源格納部２１に書き込む。このとき、音声認識部１１は、音声認識を行った際の発話の順番も合わせて音声言語資源格納部２１に保持しておく。 The speech recognition unit 11 recognizes speech speech data D1 to generate learning data. The voice data D1 indicates a feature value obtained by performing a short-time spectrum analysis on a voice waveform of an utterance. In this embodiment, audio data is acquired from a broadcast signal. The voice recognition unit 11 associates the voice data D1 with the voice recognition result data D2 obtained by voice recognition of the voice data D1, and writes it in the spoken language resource storage unit 21 as learning data. At this time, the speech recognizing unit 11 also holds the order of utterances when performing speech recognition in the spoken language resource storage unit 21.

特徴量抽出部１２は、学習データが示す発話の順番に整列された音声認識結果データＤ２から、同一発話内における言語的な特徴と発話の順序に応じた言語的な特徴を抽出する。特徴量抽出部１２は、得られた言語的な特徴をルールとする素性関数ｆ_ｉ，γ_ｊを示す素性関数データＤ３を出力する。 The feature amount extraction unit 12 extracts linguistic features in the same utterance and linguistic features according to the utterance order from the speech recognition result data D2 arranged in the utterance order indicated by the learning data. The feature quantity extraction unit 12 outputs feature function data D3 indicating feature functions f _i and γ _j using the obtained linguistic features as rules.

誤り傾向学習部１３は、特徴量抽出部１２が出力した素性関数データＤ３と、音声言語資源格納部２１が記憶する学習データとを入力として、誤り修正モデルのモデルパラメータΛ、Φを統計的手段により学習する。誤り傾向学習部１３は、学習したこれらのモデルパラメータΛ、Φを用いた誤り修正モデルを誤り修正モデル格納部２４に書き込む。 The error tendency learning unit 13 uses the feature function data D3 output from the feature quantity extraction unit 12 and the learning data stored in the spoken language resource storage unit 21 as inputs, and uses the statistical parameters for model parameters Λ and Φ of the error correction model. To learn. The error tendency learning unit 13 writes the error correction model using the learned model parameters Λ and Φ into the error correction model storage unit 24.

音声認識部１４は、音響モデル格納部２２に記憶されている音響モデル、及び言語モデル格納部２３に記憶されている言語モデルを参照し、誤り修正モデル格納部２４に記憶されている誤り修正モデルを用いて入力音声データＤ４の音声認識を行い、その結果を示す音声認識結果データＤ５を出力する。入力音声データＤ４は、発話の音声波形を短時間スペクトル分析して得られた特徴量を示す。 The speech recognition unit 14 refers to the acoustic model stored in the acoustic model storage unit 22 and the language model stored in the language model storage unit 23, and stores the error correction model stored in the error correction model storage unit 24. Is used to perform speech recognition of the input speech data D4, and speech recognition result data D5 indicating the result is output. The input voice data D4 indicates a feature amount obtained by performing a short-time spectrum analysis on the voice waveform of the utterance.

［４．音声認識装置の処理手順］
図４は、本実施形態による音声認識装置１の全体処理フローを示す図である。以下、同図に示す各ステップの処理について説明する。 [4. Processing procedure of voice recognition device]
FIG. 4 is a diagram showing an overall processing flow of the speech recognition apparatus 1 according to the present embodiment. Hereinafter, processing of each step shown in FIG.

［４．１ステップＳ１］
まず、音声認識部１１は、放送信号から番組の音声データＤ１を取得し、音声認識する。音声認識部１１は、各発話の音声データＤ１と、その音声認識結果を示す音声認識結果データＤ２とを対応付けた学習データを音声言語資源格納部２１に格納する。このとき、音声認識部１１は、音声認識を行った際の発話の順序を保持して格納する。音声認識結果データＤ２が示すｍ番目（ｍ＝１，２，…）の学習データには、ｍ番目の音声データＤ１である音声入力ｘ_ｍと、音声入力ｘ_ｍを音声認識して得られた文仮説ｗ_ｍ，ｎ（ｎ＝１，２，…）が含まれる。 [4.1 Step S1]
First, the voice recognition unit 11 acquires the voice data D1 of the program from the broadcast signal and recognizes the voice. The speech recognition unit 11 stores learning data in which the speech data D1 of each utterance is associated with the speech recognition result data D2 indicating the speech recognition result in the speech language resource storage unit 21. At this time, the voice recognition unit 11 stores and stores the order of utterances when voice recognition is performed. M th showing speech recognition result data D2 (m = 1,2, ...) in the training data, and voice input x _m is the m-th audio data D1, obtained by speech recognition of the speech input x _m The sentence hypothesis w _{m, n} (n = 1, 2,...) Is included.

［４．２ステップＳ２］
誤り傾向学習部１３は、音声言語資源格納部２１に記憶されている学習データから、誤り傾向学習のために用いる言語的特徴に基づく素性関数を抽出する。 [4.2 Step S2]
The error tendency learning unit 13 extracts a feature function based on linguistic features used for error tendency learning from the learning data stored in the spoken language resource storage unit 21.

まず、誤り傾向学習部１３は、学習データに含まれる音声認識結果データＤ２から、連続する単語２項組（ｕ，ｖ）の数を返す関数や、連続しない単語２項組（ｕ，ｖ）の数を返す関数など、連続する単語、連続しない２単語以上の単語、単語の構文的な情報または意味的な情報、などの先に記載したような同一発話内の言語的特徴に基づく素性関数を全て抽出する。
さらに、誤り傾向学習部１３は、音声認識結果データＤ２が示す正解候補の文仮説ｗ_ｍ，ｎと、先行する発話の文仮説ｗ_{ｍ−１，ｎ}との全ての組み合わせを参照し、式（９）に示すような、発話の順序に応じた言語的特徴に基づく素性関数を全て抽出する。
誤り傾向学習部１３は、抽出したこれらの素性関数が出現する頻度をカウントする。誤り傾向学習部１３は、カウントした出現頻度が予め定めた閾値以上である同一発話内の言語的特徴に基づく素性関数、発話の順序に応じた言語的特徴に基づく素性関数をそれぞれ、ステップＳ３の誤り傾向学習処理においてモデルパラメータの学習に用いる素性関数ｆ_ｉ、素性関数γ_ｊとして決定する。誤り傾向学習部１３は、決定した素性関数ｆ_ｉ、γ_ｊを設定した素性関数データＤ３を誤り傾向学習部１３に出力する。 First, the error tendency learning unit 13 returns a function of returning the number of consecutive word binaries (u, v) from the speech recognition result data D2 included in the learning data, or a discontinuous word binary set (u, v). Feature functions based on linguistic features within the same utterance, such as continuous words, non-consecutive two or more words, syntactic or semantic information of words, such as functions that return the number of words Are all extracted.
Further, the error tendency learning unit 13 refers to all combinations of the sentence hypothesis w _{m, n of the} correct candidate indicated by the speech recognition result data D2 and the sentence hypothesis w _{m−1, n} of the preceding utterance. As shown in 9), all feature functions based on linguistic features corresponding to the utterance order are extracted.
The error tendency learning unit 13 counts the frequency at which these extracted feature functions appear. The error tendency learning unit 13 obtains a feature function based on a linguistic feature in the same utterance whose counted appearance frequency is equal to or higher than a predetermined threshold, and a feature function based on a linguistic feature according to the order of utterances in step S3. It is determined as a feature function f _i and feature function γ _j used for learning model parameters in the error tendency learning process. The error tendency learning unit 13 outputs the feature function data D3 in which the determined feature functions f _i and γ _j are set to the error tendency learning unit 13.

［４．３ステップＳ３］
続いて誤り傾向学習部１３は、誤り修正モデルのモデルパラメータΛ、Φを学習する。
図５は、ステップＳ３において誤り傾向学習部１３が実行する誤り修正モデル更新処理の処理フローを示す図である。 [4.3 Step S3]
Subsequently, the error tendency learning unit 13 learns model parameters Λ and Φ of the error correction model.
FIG. 5 is a diagram illustrating a processing flow of the error correction model update process executed by the error tendency learning unit 13 in step S3.

（ステップＳ３１：モデルパラメータ初期化処理）
誤り傾向学習部１３は、モデルパラメータΛ、Φに対して適当な初期値を設定する。本実施形態では、初期値をΛ＝Φ＝０とし、すべてのパラメータをゼロとおく。 (Step S31: Model parameter initialization process)
The error tendency learning unit 13 sets appropriate initial values for the model parameters Λ and Φ. In this embodiment, the initial value is Λ = Φ = 0, and all parameters are set to zero.

（ステップＳ３２：編集距離計算処理）
式（１２）の目的関数を計算するためには、同じ発話から得られた音声認識結果同士の編集距離を計算する必要がある。そこで、誤り傾向学習部１３は、音声言語資源格納部２１に記憶されている学習データを読み出し、学習データが示す音声認識結果データＤ２から、同じ入力音声ｘ_ｍを音声認識して得られた文仮説ｗ_ｍ，ｎと他の文仮説ｗ_ｍ，ｋを取得して、編集距離Ｒ（ｗ_ｍ，ｎ，ｗ_ｍ，ｋ）を計算する。これらの編集距離は、誤り修正モデルの学習では定数扱いとなることに注意する。 (Step S32: Edit distance calculation process)
In order to calculate the objective function of Expression (12), it is necessary to calculate the edit distance between the speech recognition results obtained from the same utterance. Therefore, sentence error tendency section 13 reads the learning data stored in the spoken language resource storage unit 21, from the speech recognition result data D2 indicating the learning data, the same input speech x _m obtained by speech recognition The hypothesis w _{m, n} and other sentence hypotheses w _{m, k} are acquired, and the edit distance R (w _{m, n} , w _{m, k} ) is calculated. Note that these edit distances are treated as constants when learning the error correction model.

（ステップＳ３３：素性関数更新処理）
誤り傾向学習部１３は、発話間の言語的な特徴に基づく素性関数γ_ｊ（・，・）の値を更新する。これは、式（９）から明らかなように、素性関数γ_ｊが、直前の発話の文仮説ｗ_{ｍ−１，ｎ}の事後確率ｐ（ｗ_{ｍ−１，ｎ}）に依存するためである。なお、事後確率ｐ（ｗ_{ｍ−１，ｎ}）は、現在のモデルパラメータΛ、Φの値により、式（１０）の式を用いて算出する。つまり、誤り傾向学習部１３は、式（１０）におけるｘ_ｍ、ｗ、Ｇ_ｍ−１をそれぞれ、学習データが示す直前の発話の音声入力ｘ_ｍ−１、音声入力ｘ_ｍ−１から得られた文仮説ｗ_{ｍ−１，ｎ}、音声入力ｘ_ｍ−２から音声認識結果として得られた単語列の集合Ｇ_ｍ−２として算出する。このとき、誤り傾向学習部１３は、文仮説ｗ_{ｍ−１，ｎ}の音響モデルのスコアｆ_ａｍ（ｘ_ｍ−１｜ｗ_{ｍ−１，ｎ}）を、音響モデル格納部２２に記憶されている音響モデルと、ｍ−１番目の学習データが示す音声データである入力音声ｘ_ｍ−１とを用いて得る。また、誤り傾向学習部１３は、文仮説ｗ_{ｍ−１，ｎ}の言語モデルのスコアｆ_ａｍ（ｗ_{ｍ−１，ｎ}）を、言語モデル格納部２３に記憶されている言語モデルを用いて得る。なお、誤り傾向学習部１３は、素性関数γ_ｊを素性関数データＤ３から得る。 (Step S33: Feature Function Update Process)
The error tendency learning unit 13 updates the value of the feature function γ _j (•, •) based on the linguistic features between utterances. This is because the feature function γ _j depends on the posterior probability p (w _{m−1, n} ) of the sentence hypothesis w _{m−1, n} of the immediately preceding utterance, as is apparent from the equation (9). The posterior probability p (w _{m−1, n} ) is calculated using the equation (10) based on the current model parameters Λ and Φ. That is, the error tendency learning unit 13 obtains x _m , w, and G _m−1 in Expression (10) from the speech input x _m−1 and the speech input x _{m−1 of the} immediately preceding utterance indicated by the learning data, respectively. sentence hypotheses _{w m-1 was, n,} is calculated from the voice input _{x m-2} as a set _{G m-2} of word strings obtained as the speech recognition result. At this time, the error tendency learning unit 13 stores the score f _am (x _m−1 | w _{m−1, n} ) of the acoustic model of the sentence hypothesis w _{m−1, n in} the acoustic model storage unit 22. It is obtained using an acoustic model and input speech x _m−1 which is speech data indicated by the (m−1) th learning data. Further, the error tendency learning unit 13 obtains the language model score f _am (w _{m−1, n} ) of the sentence hypothesis w _{m−1, n} using the language model stored in the language model storage unit 23. . The error tendency learning unit 13 obtains the feature function γ _j from the feature function data D3.

（ステップＳ３４：目的関数計算処理）
誤り傾向学習部１３は、式（１２）に従って目的関数Ｌ（Λ，Φ）の値を計算する。誤り傾向学習部１３は、式（１２）におけるΧ（ｗ_ｍ，ｎ）を式（１３）により算出するが、この算出には、ステップＳ３２において求めた編集距離Ｒ（ｗ_ｍ，ｎ，ｗ_ｍ，ｋ）と、式（１４）により算出した事後確率Ｐ（ｗ_ｍ,ｋ｜ｘ_ｍ）を用いる。誤り傾向学習部１３は、式（１４）の算出に用いるｇ＾（ｗ_ｍ，ｋ｜ｘ_ｍ）を、音響モデルのスコアｆ_ａｍ（ｘ_ｍ｜ｗ_ｍ，ｋ）、言語モデルのスコアｆ_ｌｍ（ｗ_ｍ，ｋ）、及び現在のモデルパラメータΛ、Φから式（１５）を用いて計算する。また、誤り傾向学習部１３は、式（１２）における条件付き確率Ｐ（ｗ_ｍ，ｎ｜ｘ_ｍ，Ｇ_ｍ―１）を、音響モデルのスコアｆ_ａｍ（ｘ_ｍ｜ｗ_ｍ，ｎ）、言語モデルのスコアｆ_ｌｍ（ｗ_ｍ，ｎ）、及び現在のモデルパラメータΛ、Φから式（１０）を用いて計算する。
各文仮説ｗ_ｍ，ｎ、ｗ_ｍ，ｋを文仮説ｗとした場合、誤り傾向学習部１３は、音響モデルのスコアｆ_ａｍ（ｘ_ｍ｜ｗ）を、音響モデル格納部２２に記憶されている音響モデルと、ｍ番目の学習データが示す音声データである入力音声ｘ_ｍとを用いて得る。また、誤り傾向学習部１３は、文仮説ｗの言語モデルのスコアｆ_ｌｍ（ｗ）を、言語モデル格納部２３に記憶されている言語モデルを用いて得る。さらに、誤り傾向学習部１３は、素性関数ｆ_ｉ（ｗ）の値を文仮説ｗから算出し、素性関数γ_ｉ（ｗ，Ｇ_ｍ−１）の値にステップＳ３３において算出した値を用いる。なお、誤り傾向学習部１３は、素性関数ｆ_ｉを素性関数データＤ３から得る。 (Step S34: Objective function calculation process)
The error tendency learning unit 13 calculates the value of the objective function L (Λ, Φ) according to the equation (12). The error tendency learning unit 13 calculates Χ (w _{m, n} ) in the equation (12) by the equation (13). For this calculation, the edit distance R (w _{m, n} , w _m obtained in step S32 is used. _{, K} ) and the posterior probability P (w _{m, k} | x _m ) calculated by the equation (14). The error tendency learning unit 13 uses g ^ (w _{m, k} | x _m ) used for the calculation of Expression (14), the acoustic model score f _am (x _m | w _{m, k} ), and the language model score f _lm. (W _{m, k} ) and the current model parameters Λ and Φ are calculated using equation (15). Further, the error tendency learning unit 13 uses the conditional probability P (w _{m, n} | x _m , G _m−1 ) in the equation (12) as the acoustic model score f _am (x _m | w _{m, n} ), The language model score f _lm (w _{m, n} ) and the current model parameters Λ and Φ are calculated using equation (10).
When each sentence hypothesis w _{m, n} , w _{m, k} is a sentence hypothesis w, the error tendency learning unit 13 stores the acoustic model score f _am (x _m | w) in the acoustic model storage unit 22. an acoustic model are obtained using the input and speech x _m is voice data indicated by the m-th training data. Further, the error tendency learning unit 13 obtains the score f _lm (w) of the language model of the sentence hypothesis w using the language model stored in the language model storage unit 23. Further, the error tendency learning unit 13 calculates the value of the feature function f _i (w) from the sentence hypothesis w, and uses the value calculated in step S33 as the value of the feature function γ _i (w, G _m-1 ). The error tendency learning unit 13 obtains the feature function f _i from the feature function data D3.

（ステップＳ３５：勾配計算処理）
誤り傾向学習部１３は、現在のモデルパラメータΛ、Φの値を用いて、式（１６）及び式（１７）により、式（１２）のモデルパラメータΛ、Φに関する勾配ΔΛ、ΔΦを求める。誤り傾向学習部１３は、式（１６）及び式（１７）におけるΧ（ｗ_ｍ，ｎ）、及び事後確率Ｐ（ｗ_ｍ，ｎ｜ｘ_ｍ，Ｇ_ｍ−１）、Ｐ（ｗ_ｍ，ｎ’｜ｘ_ｍ，Ｇ_ｍ−１）に、ステップＳ３４において目的関数Ｌ（Λ，Φ）を算出したときの値を用いる。また、誤り傾向学習部１３は、式（１６）における素性関数γ_ｊ（ｗ_ｍ，ｎ，Ｇ_ｍ−１）、γ_ｊ（ｗ_ｍ，ｎ’，Ｇ_ｍ−１）に、ステップＳ３３において算出した値を用いる。また、誤り傾向学習部１３は、式（１７）における素性関数ｆ_ｉ（ｗ_ｍ，ｎ）、ｆ_ｉ（ｗ_ｍ，ｎ’）に、ステップＳ３４において算出した値を用いる。 (Step S35: gradient calculation process)
The error tendency learning unit 13 obtains the gradients ΔΛ and ΔΦ related to the model parameters Λ and Φ in Expression (12) using Expressions (16) and (17) using the current model parameters Λ and Φ. The error tendency learning unit 13 uses Χ (w _{m, n} ) and posterior probabilities P (w _{m, n} | x _m , G _m−1 ) and P (w _{m, n} ) in the equations (16) and (17). _' | X _m , G _m-1 ) uses the value obtained when the objective function L (Λ, Φ) is calculated in step S34. In addition, the error tendency learning unit 13 calculates the feature functions γ _j (w _{m, n} , G _m−1 ) and γ _j (w _{m, n ′} , G _m−1 ) in Expression (16) in step S33. Use the value obtained. Further, the error tendency learning unit 13 uses the values calculated in step S34 for the feature functions f _i (w _{m, n} ) and f _i (w _{m, n ′} ) in the equation (17).

誤り傾向学習部１３は、求めた勾配ΔΛ、ΔΦを用いて、式（１８）及び式（１９）により、モデルパラメータΛ、Φを更新する。なお、式（１８）及び式（１９）における係数η_Λ、η_Φは、予め定めた値を用いる。 The error tendency learning unit 13 updates the model parameters Λ and Φ according to the equations (18) and (19) using the obtained gradients ΔΛ and ΔΦ. Note that predetermined values are used as the coefficients η _Λ and η _Φ in the equations (18) and (19).

（ステップＳ３６：終了判定処理）
誤り傾向学習部１３は、ステップＳ３５の勾配計算処理により得られた目的関数の値と、更新前の目的関数の値とを比較し、値の変化が所定以上であれば、ステップＳ３３からの処理を繰り返し、所定よりも小さければ更新が収束したとみなしてモデルパラメータΛ、Φの更新を打ち切り、ステップＳ３７の処理を実行する。 (Step S36: End determination process)
The error tendency learning unit 13 compares the value of the objective function obtained by the gradient calculation process in step S35 with the value of the objective function before update, and if the change in value is equal to or greater than a predetermined value, the process from step S33 is performed. If it is smaller than the predetermined value, it is considered that the update has converged, the update of the model parameters Λ and Φ is aborted, and the process of step S37 is executed.

（ステップＳ３７：誤り修正モデル出力処理）
誤り傾向学習部１３は、更新が収束したときのモデルパラメータΛ＝（λ_１，λ_２，…）、及びΦ＝（φ_１，φ_２，…）を用いた誤り修正モデルを誤り修正モデル格納部２４に書き込む。 (Step S37: Error correction model output process)
The error tendency learning unit 13 stores an error correction model using the model parameters Λ = (λ ₁ , λ ₂ ,...) And Φ = (φ ₁ , φ ₂ ,. Write to part 24.

［４．４ステップＳ４］
音声認識部１４は、音声認識対象の音声データとして入力音声データＤ４が入力されると、誤り修正モデル格納部２４に記憶されている誤り修正モデルと、音響モデル格納部２２に記憶されている音響モデル、及び、言語モデル格納部２３に記憶されている言語モデルとを用いて、入力音声データＤ４の正解候補の単語列を得、それらのスコアを算出する。音声認識部１４は、スコアが最もよい正解候補の単語列を正解単語列として設定した音声認識結果データＤ５をリアルタイムで出力する。この誤り修正モデルを用いることにより、音声認識部１４は、入力音声データＤ４から得られた音声認識結果の選択における誤りを修正する。 [4.4 Step S4]
When the input speech data D4 is input as speech recognition target speech data, the speech recognition unit 14 stores the error correction model stored in the error correction model storage unit 24 and the acoustic stored in the acoustic model storage unit 22. Using the model and the language model stored in the language model storage unit 23, a word string of correct answer candidates of the input speech data D4 is obtained, and their scores are calculated. The voice recognition unit 14 outputs the voice recognition result data D5 in which the correct candidate word string having the best score is set as the correct word string in real time. By using this error correction model, the speech recognition unit 14 corrects an error in the selection of the speech recognition result obtained from the input speech data D4.

［５．効果］
本実施形態によれば、音声認識装置１は、直前の発話内容を反映した誤り修正モデルを正解単語列なしに構成することが可能となり、従来の音声認識よりも認識誤りが削減される。 [5. effect]
According to the present embodiment, the speech recognition apparatus 1 can configure an error correction model reflecting the content of the immediately preceding utterance without a correct word string, and recognition errors can be reduced compared to conventional speech recognition.

［６．その他］
なお、上述の音声認識装置１は、内部にコンピュータシステムを有している。そして、音声認識装置１の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 [6. Others]
The voice recognition device 1 described above has a computer system inside. The operation process of the speech recognition apparatus 1 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１音声認識装置
１１音声認識部
１２特徴量抽出部
１３誤り傾向学習部
１４音声認識部
２１音声言語資源格納部
２２音響モデル格納部
２３言語モデル格納部
２４誤り修正モデル格納部 DESCRIPTION OF SYMBOLS 1 Speech recognition apparatus 11 Speech recognition part 12 Feature-value extraction part 13 Error tendency learning part 14 Speech recognition part 21 Spoken language resource storage part 22 Acoustic model storage part 23 Language model storage part 24 Error correction model storage part

Claims

A speech language resource storage unit that stores speech recognition results obtained by speech recognition of speech data of speech, while retaining the order of speech;
Acquire linguistic features according to the order of utterances from words included in the speech recognition results and words included in the speech recognition results of utterances earlier than the speech recognition results, and the acquired linguistic features Are weighted according to the posterior probabilities of the speech recognition results of the past utterances, the tendency of recognition errors of words is statistically learned based on the weighted linguistic features, and the tendency of the learned recognition errors is An error tendency learning unit that generates an error correction model for correction;
A speech recognition apparatus comprising:

The error tendency learning unit statistically calculates a tendency of recognition error of a word based on a linguistic feature corresponding to the weighted utterance order and a linguistic feature in the same utterance obtained from the speech recognition result. And generate an error correction model to correct the tendency of learned recognition errors.
The speech recognition apparatus according to claim 1.

The linguistic feature according to the utterance order is a co-occurrence relationship between a word included in the speech recognition result and a word included in the speech recognition result of the past utterance,
The linguistic features in the same utterance are a co-occurrence relationship of a plurality of consecutive words in the same utterance obtained from the speech recognition result, a co-occurrence relationship of a plurality of non-consecutive words, syntactic information of words One or more of the semantic information of the word,
The speech recognition apparatus according to claim 2.

The error correction model includes a first feature function based on a linguistic feature according to the order of the utterances, weighted by a posterior probability of the speech recognition result of the past utterance, and a linguistic feature in the same utterance. A calculation formula for correcting a score of speech recognition using a second feature function based on a unique feature and feature weights of each of the first feature function and the second feature function,
The error tendency learning unit includes a value of the first feature function obtained from the speech recognition result and the speech recognition result of the past utterance, and the second feature function obtained from the speech recognition result. Calculated by an evaluation function determined using a value and a value obtained by weighting the number of word errors obtained by comparing a plurality of the speech recognition results obtained from the same speech data with the posterior probability of the speech recognition results Statistically calculating the feature weight based on the evaluated value, and generating the error correction model using the calculated feature weight;
The voice recognition apparatus according to claim 2 or claim 3, wherein

A speech recognition unit that recognizes input speech data and corrects an error in selecting a speech recognition result obtained from the input speech data using the error correction model generated by the error tendency learning unit Further comprising
The voice recognition device according to claim 1, wherein the voice recognition device is a voice recognition device.

Spoken language resource storage process for storing speech recognition results obtained by speech recognition of speech data, while maintaining the order of speech,
Acquire linguistic features according to the order of utterances from words included in the speech recognition results and words included in the speech recognition results of utterances earlier than the speech recognition results, and the acquired linguistic features Are weighted according to the posterior probabilities of the speech recognition results of the past utterances, the tendency of recognition errors of words is statistically learned based on the weighted linguistic features, and the tendency of the learned recognition errors is An error tendency learning process to generate an error correction model for correction;
An error correction model learning method characterized by comprising:

Computer
A speech language resource storage means for storing speech recognition results obtained by speech recognition of speech data of speech, while retaining the order of speech;
Acquire linguistic features according to the order of utterances from words included in the speech recognition results and words included in the speech recognition results of utterances earlier than the speech recognition results, and the acquired linguistic features Are weighted according to the posterior probabilities of the speech recognition results of the past utterances, the tendency of recognition errors of words is statistically learned based on the weighted linguistic features, and the tendency of the learned recognition errors is An error tendency learning means for generating an error correction model for correction;
A program for causing a voice recognition apparatus to function.