JP2014077865A

JP2014077865A - Speech recognition device, error correction model learning method and program

Info

Publication number: JP2014077865A
Application number: JP2012224985A
Authority: JP
Inventors: Akio Kobayashi; 彰夫小林
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2012-10-10
Filing date: 2012-10-10
Publication date: 2014-05-01
Anticipated expiration: 2032-10-10
Also published as: JP6047364B2

Abstract

PROBLEM TO BE SOLVED: To optimize an error correction model using a piece of information picked up from latest utterance content of a speech as an object of speech recognition.SOLUTION: A speech language resource storage 21 stores a result of speech recognition of a voice data of a speech made by a speech recognition section 11 and speech correct word strings maintaining the order of the speech. A model learning section 13 learns in a statistical manner a tendency of recognition failure of a word based on the words included in the speech recognition result, linguistic characteristics according to the order of the speech obtained from the words included in a previous speech close to the speech recognition result in time scale, and linguistic characteristics in the identical speech obtained from the speech recognition result to generate an error correction model for correcting the tendency of the learnt recognition failure. A speech recognition section 14 recognizes the speech included in the voice data and corrects incorrectness in a selection of speech recognition result using the generated error correction model.

Description

本発明は、音声認識装置、誤り修正モデル学習方法、及びプログラムに関する。 The present invention relates to a speech recognition device, an error correction model learning method, and a program.

音声認識の誤り修正については、音声とその書き起こし（正解文）から、言語的な特徴を用いて音声認識の誤り傾向を統計的に学習し、学習の結果得られた統計的な誤り修正モデルを用いて音声認識の性能改善を図る技術がある（例えば、非特許文献１参照）。 For error correction in speech recognition, statistical error correction models obtained as a result of learning by statistically learning the tendency of speech recognition errors using linguistic features from speech and transcriptions (correct sentences) There is a technology for improving the performance of speech recognition by using (see, for example, Non-Patent Document 1).

小林ほか，「単語誤り最小化に基づく識別的スコアリングによるニュース音声認識」，電子情報通信学会誌，vol.J93-D no.5，２０１０年，ｐ．５９８−６０９Kobayashi et al., “News speech recognition by discriminative scoring based on word error minimization”, IEICE Journal, vol.J93-D no.5, 2010, p. 598-609

放送番組などの音声認識では、連続した複数の発話を逐次音声認識するが、音声認識が処理している発話の内容は、すでに音声認識の終わった直前の発話内容と関連することが多い。例えば、料理番組では、食材の紹介についての発話があれば、その後の料理方法に関する発話が続くと期待される。つまり、食材に関する単語とその料理方法に関する単語は、隣接する発話において共起する可能性が高い。例えば、「豚ヒレをたたきます」という発話の後に、「次に塩こしょうします」という発話が続くのであれば、「豚ヒレ」と「塩こしょう」の間に関係があり、これらが共起しやすいということになる。
しかし、従来の誤り修正モデルのモデルパラメータ学習では、音声データとその音声認識結果及び正解単語列を用いているが、学習時に音声データの発話順序は考慮されていない。このように、従来の誤り修正モデルでは発話の順序に関係した発話間の単語の共起などの情報は考慮されていないため、発話内容を正しく予測する上で最適なモデルとはなっていない。 In speech recognition for broadcast programs and the like, a plurality of continuous utterances are successively recognized, but the content of the utterance being processed by the speech recognition is often related to the utterance content immediately before speech recognition is finished. For example, in a cooking program, if there is an utterance about the introduction of ingredients, it is expected that the subsequent utterance about the cooking method will continue. That is, there is a high possibility that words related to food ingredients and words related to the cooking method co-occur in adjacent utterances. For example, if the utterance of “pick a pork fin” is followed by the utterance of “next salt and pepper”, there is a relationship between “pig fin” and “salt pepper”, and these are co-occurring It will be easy to do.
However, in conventional model parameter learning of an error correction model, speech data, a speech recognition result thereof, and a correct word string are used, but the speech data speech order is not considered during learning. As described above, in the conventional error correction model, information such as word co-occurrence between utterances related to the utterance order is not taken into consideration, so that it is not an optimal model for correctly predicting the utterance contents.

本発明は、このような事情を考慮してなされたもので、音声認識の対象となっている発話より過去の発話内容から引き出した情報を利用して誤り修正モデルを最適化する音声認識装置、誤り修正モデル学習方法、及びプログラムを提供する。 The present invention has been made in consideration of such circumstances, and a speech recognition device that optimizes an error correction model using information extracted from past utterance contents from utterances targeted for speech recognition, An error correction model learning method and program are provided.

［１］本発明の一態様は、発話の音声データを音声認識して得られた音声認識結果と、前記発話の正解単語列とを発話の順序を保持して格納する音声言語資源格納部と、前記音声データから得られた前記音声認識結果に含まれる単語と前記音声データの発話よりも過去の発話の前記正解単語列に含まれる単語とから得られる発話の順序に応じた言語的な特徴に基づいて単語の認識誤りの傾向を統計的に学習し、学習した前記認識誤りの傾向を修正するための誤り修正モデルを生成するモデル学習部と、を備えることを特徴とする音声認識装置である。
この発明によれば、音声認識装置は、音声データを音声認識し、得られた音声認識結果に含まれる単語と、その音声認識結果よりも過去の発話の正解単語列に含まれる単語とから、発話の順序に応じた言語的な特徴を抽出する。過去の発話の正解単語列として、例えば、音声認識結果と時間的に隣接した直近の過去の発話の正解単語列を用いる。音声認識装置は、抽出した言語的な特徴に基づいて単語の認識誤りの傾向を統計的に学習し、学習した認識誤りの傾向を修正するための誤り修正モデルを生成する。
これにより、音声認識の対象となっている発話よりも前の発話の内容から引き出した情報を利用して、発話内容を正しく予測する上で好適な誤り修正モデルを生成することができる。 [1] According to one aspect of the present invention, a speech language resource storage unit that stores a speech recognition result obtained by speech recognition of speech speech data and the correct word string of the speech while maintaining the speech order. The linguistic features according to the order of utterances obtained from the words included in the speech recognition result obtained from the speech data and the words included in the correct word string of utterances in the past than the speech of the speech data And a model learning unit that statistically learns the tendency of recognition errors of words based on the above and generates an error correction model for correcting the learned tendency of recognition errors. is there.
According to this invention, the speech recognition apparatus recognizes speech data, and includes words included in the obtained speech recognition result and words included in correct word strings of utterances utterances earlier than the speech recognition result. Extract linguistic features according to the order of utterances. As the correct word string of the past utterance, for example, the correct word string of the latest past utterance that is temporally adjacent to the speech recognition result is used. The speech recognition apparatus statistically learns the tendency of recognition errors of words based on the extracted linguistic features, and generates an error correction model for correcting the learned tendency of recognition errors.
Thus, an error correction model suitable for correctly predicting the utterance content can be generated using the information extracted from the utterance content before the utterance subject to speech recognition.

［２］本発明の一態様は、上述する音声認識装置であって、前記モデル学習部は、前記音声認識結果から得られる同一発話内の言語的な特徴と前記発話の順序に応じた言語的な特徴とに基づいて単語の認識誤りの傾向を統計的に学習し、学習した前記認識誤りの傾向を修正するための誤り修正モデルを生成する、ことを特徴とする。
この発明によれば、音声認識装置は、音声認識結果及び正解単語列から発話の順序に応じた言語的な特徴を抽出するとともに音声認識結果から同一発話内の言語的な特徴を抽出する。音声認識装置は、抽出したこれらの言語的な特徴に基づいて単語の認識誤りの傾向を統計的に学習し、学習した認識誤りの傾向を修正するための誤り修正モデルを生成する。
これにより、音声認識装置は、音声認識の対象となっている発話よりも過去の発話内容から引き出した情報に加えて、同一の発話内の言語的特徴を利用して、認識誤りを精度よく修正する誤り修正モデルを生成することができる。 [2] One aspect of the present invention is the speech recognition device described above, in which the model learning unit is linguistic in accordance with linguistic features in the same utterance obtained from the speech recognition result and the order of the utterances. A tendency of statistically learning a recognition error tendency of a word based on the characteristic and generating an error correction model for correcting the learned tendency of the recognition error.
According to this invention, the speech recognition apparatus extracts linguistic features corresponding to the order of utterances from the speech recognition result and the correct word string and extracts linguistic features in the same utterance from the speech recognition result. The speech recognition apparatus statistically learns the tendency of recognition errors of words based on these extracted linguistic features, and generates an error correction model for correcting the tendency of learned recognition errors.
As a result, the speech recognition device uses the linguistic features in the same utterance in addition to the information extracted from the utterance contents past the utterance that is the target of speech recognition, and corrects the recognition error with high accuracy. An error correction model can be generated.

［３］本発明の一態様は、上述する音声認識装置であって、前記モデル学習部は、前記音声認識結果から得られる同一発話内の連続する複数の単語の共起関係、連続しない複数の単語の共起関係、単語の構文的な情報、または単語の意味的な情報のうち１以上と、前記音声認識結果に含まれる単語及び前記過去の発話の前記正解単語列に含まれる単語の共起関係とに基づいて単語の誤り傾向を統計的に学習する、ことを特徴とする。
この発明によれば、音声認識装置は、音声認識結果から得られる同一発話内における単語の共起関係や構文的、意味的な情報と、音声認識結果に含まれる単語と過去の発話の正解単語列から得られる単語の共起関係とに基づいて単語の誤り傾向を統計的に学習し、学習した認識誤りの傾向を修正するための誤り修正モデルを生成する。
これにより、音声認識装置は、認識誤りを精度良く修正する誤り修正モデルを生成することができる。 [3] One aspect of the present invention is the speech recognition device described above, in which the model learning unit includes a co-occurrence relationship between a plurality of consecutive words in the same utterance obtained from the speech recognition result, One or more of word co-occurrence relationships, word syntactic information, or word semantic information, and a word included in the speech recognition result and a word included in the correct word string of the past utterance. It is characterized in that the error tendency of words is statistically learned based on the origin relation.
According to the present invention, the speech recognition apparatus provides the word co-occurrence relationship and syntactic and semantic information in the same utterance obtained from the speech recognition result, the word included in the speech recognition result, and the correct word of the past utterance. A word error tendency is statistically learned based on the word co-occurrence relationship obtained from the sequence, and an error correction model for correcting the learned recognition error tendency is generated.
Thereby, the speech recognition apparatus can generate an error correction model that corrects a recognition error with high accuracy.

［４］本発明の一態様は、上述する音声認識装置であって、前記誤り修正モデルは、前記言語的な特徴に基づく素性関数とその重みとを用いて音声認識のスコアを修正する算出式であり、前記モデル学習部は、前記音声認識結果及び前記正解単語列から得られた前記素性関数の値と前記音声認識結果に含まれる単語の認識誤りとを用いて定められる評価関数により算出した評価値に基づいて前記重みを統計的に算出し、算出した前記重みを用いて前記誤り修正モデルを生成する、ことを特徴とする。
この発明によれば、音声認識装置は、言語的特徴を表す素性関数とその重みとで定義される誤り修正モデルが用いる重みを、音声認識結果から得られた素性関数の値と、音声認識結果を正解単語列と比較することによって得られる認識誤りとを用いて定められる評価関数により算出した評価値が、最も認識誤りが少ないことを示す評価値となるように決定し、誤り修正モデルを生成する。
これにより、音声認識装置は、認識誤り傾向を効率的に学習し、誤り修正モデルを生成することができる。 [4] One aspect of the present invention is the speech recognition device described above, wherein the error correction model is a calculation formula for correcting a speech recognition score using a feature function based on the linguistic feature and its weight. The model learning unit is calculated by an evaluation function determined using the feature function value obtained from the speech recognition result and the correct word string and a recognition error of a word included in the speech recognition result. The weight is statistically calculated based on an evaluation value, and the error correction model is generated using the calculated weight.
According to this invention, the speech recognition apparatus uses the feature function value obtained from the speech recognition result, the weight used by the error correction model defined by the feature function representing the linguistic feature and its weight, and the speech recognition result. An error correction model is generated by determining that the evaluation value calculated by the evaluation function determined using the recognition error obtained by comparing the correct word string with the correct word string is the evaluation value indicating the least recognition error. To do.
Thereby, the speech recognition apparatus can learn the recognition error tendency efficiently and generate an error correction model.

［５］本発明の一態様は、上述する音声認識装置であって、入力された音声データを音声認識し、前記モデル学習部により生成された前記誤り修正モデルを用いて、前記入力された音声データから得られた音声認識結果の選択における誤りを修正する音声認識部をさらに備える、ことを特徴とする。
この発明によれば、音声認識装置は、音声データを音声認識することにより得られた正解候補の中から、誤り修正モデルを用いて音声認識結果を選択する。
これにより、音声認識装置は、認識率のよい音声認識結果を得ることができる。 [5] One aspect of the present invention is the speech recognition device described above, which recognizes input speech data and uses the error correction model generated by the model learning unit to input the input speech. It further comprises a voice recognition unit for correcting an error in selecting a voice recognition result obtained from data.
According to this invention, the speech recognition apparatus selects a speech recognition result using an error correction model from among correct answer candidates obtained by speech recognition of speech data.
Thereby, the speech recognition apparatus can obtain a speech recognition result with a good recognition rate.

［６］本発明の一態様は、発話の音声データを音声認識して得られた音声認識結果と、前記発話の正解単語列とを発話の順序を保持して音声言語資源格納部に格納する音声言語資源格納過程と、前記音声データから得られた前記音声認識結果に含まれる単語と前記音声データの発話よりも過去の発話の前記正解単語列に含まれる単語とから得られる発話の順序に応じた言語的な特徴に基づいて単語の認識誤りの傾向を統計的に学習し、学習した前記認識誤りの傾向を修正するための誤り修正モデルを生成するモデル学習過程と、を有することを特徴とする誤り修正モデル学習方法である。 [6] According to one aspect of the present invention, a speech recognition result obtained by speech recognition of speech speech data and the correct word sequence of the speech are stored in the speech language resource storage unit while maintaining the speech order. In the order of utterances obtained from the spoken language resource storage process, the words included in the speech recognition result obtained from the speech data, and the words included in the correct word sequence of utterances in the past than the speech of the speech data A model learning process for statistically learning a tendency of recognition error of a word based on a corresponding linguistic feature and generating an error correction model for correcting the learned tendency of the recognition error. This is an error correction model learning method.

［７］本発明の一態様は、コンピュータを、発話の音声データを音声認識して得られた音声認識結果と、前記発話の正解単語列とを発話の順序を保持して格納する音声言語資源格納手段と、前記音声データから得られた前記音声認識結果に含まれる単語と前記音声データの発話よりも過去の発話の前記正解単語列に含まれる単語とから得られる発話の順序に応じた言語的な特徴に基づいて単語の認識誤りの傾向を統計的に学習し、学習した前記認識誤りの傾向を修正するための誤り修正モデルを生成するモデル学習手段と、を具備する音声認識装置として機能させるためのプログラムである。 [7] According to one aspect of the present invention, a speech language resource that stores a speech recognition result obtained by speech recognition of speech speech data and a correct word string of the speech while maintaining a speech sequence in a computer. A language according to the order of utterances obtained from the storage means and the words included in the speech recognition result obtained from the speech data and the words included in the correct word sequence of utterances earlier than the speech of the speech data Model learning means for statistically learning the tendency of recognition errors of words based on typical characteristics and generating an error correction model for correcting the learned tendency of recognition errors It is a program to make it.

本発明によれば、音声認識の対象となっている発話よりも過去の発話内容から引き出した情報を利用して誤り修正モデルを最適化することが可能となる。 According to the present invention, it is possible to optimize an error correction model using information extracted from past utterance contents rather than utterances that are targets of speech recognition.

本発明の一実施形態による音声認識装置における誤り修正モデルの学習方法を示す図である。It is a figure which shows the learning method of the error correction model in the speech recognition apparatus by one Embodiment of this invention. 同実施形態による話の順序に応じた素性関数の例を示す図である。It is a figure which shows the example of the feature function according to the order of the story by the embodiment. 同実施形態による音声認識装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the speech recognition apparatus by the embodiment. 同実施形態による音声認識装置の全体処理フローを示す図である。It is a figure which shows the whole processing flow of the speech recognition apparatus by the embodiment. 同実施形態による音声認識装置のモデル学習処理フローを示す図である。It is a figure which shows the model learning process flow of the speech recognition apparatus by the embodiment. 従来法による誤り修正モデルの学習方法を示す図である。It is a figure which shows the learning method of the error correction model by the conventional method.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

［１．本実施形態の概要］
音声認識の誤り傾向を反映した誤り修正モデルはすでに考案されているが、この誤り修正モデルは、連続して発声される発話に対して、隣接する発話内容との関係性に基づく情報を利用したものではない。連続した発話では、直前の発話で使われた単語と関連する単語が含まれることが多い。従って、このような近接した発話間の単語のつながりを誤り修正モデルで利用すれば、音声認識の改善が期待される。 [1. Overview of this embodiment]
An error correction model that reflects the error tendency of speech recognition has already been devised, but this error correction model uses information based on the relationship between adjacent utterance contents for continuously uttered utterances. It is not a thing. Consecutive utterances often include words related to the word used in the previous utterance. Therefore, if such a word connection between adjacent utterances is used in an error correction model, an improvement in speech recognition is expected.

そこで本実施形態の音声認識装置は、直近の発話内容に含まれる言語的な特徴を利用して音声認識性能を発話内容に適合させた誤り修正モデルを学習し、音声認識へ適用する。このように、直近の発話内容によって最適化された誤り修正モデルにより、音声認識の性能改善を図る。 Therefore, the speech recognition apparatus according to the present embodiment learns an error correction model in which speech recognition performance is adapted to the utterance content using linguistic features included in the latest utterance content, and applies it to speech recognition. As described above, the performance of speech recognition is improved by the error correction model optimized by the latest utterance content.

［２．誤り修正モデルの学習アルゴリズム］
続いて、本発明の一実施形態による音声認識装置に適用される誤り修正モデルの学習アルゴリズムを説明する。
上述したように、本実施形態の音声認識装置は、従来の課題を解決するために、学習に用いる音声データに発話の順序関係を導入し、隣接する発話間の関係性を誤り修正モデルに取り入れる。本実施形態と従来法の違いは、誤り修正モデルを学習する際のデータの扱い方である。 [2. Error correction model learning algorithm]
Subsequently, an error correction model learning algorithm applied to the speech recognition apparatus according to the embodiment of the present invention will be described.
As described above, in order to solve the conventional problem, the speech recognition apparatus according to the present embodiment introduces an utterance order relationship into speech data used for learning, and incorporates a relationship between adjacent utterances into an error correction model. . The difference between this embodiment and the conventional method is how to handle data when learning an error correction model.

図６は、従来法による誤り修正モデルの学習方法を示す図である。同図に示すように、従来法では、複数の発話から構成される学習データは、その順序関係を保存しておらず、単語の誤り傾向は、データを一括して用いて学習されてきた。 FIG. 6 is a diagram illustrating an error correction model learning method according to a conventional method. As shown in the figure, in the conventional method, the learning data composed of a plurality of utterances does not preserve the order relation, and the error tendency of words has been learned using data collectively.

図１は、本実施形態による誤り修正モデルの学習方法を示す図である。同図に示すように、本実施形態では、学習データの中の各発話の順序関係を考慮し、時間的に隣接する発話間の関係を言語的な特徴として抽出し、誤り修正モデルの学習に利用する。これにより、隣接する発話間の関係が反映された誤り修正モデルが得られるため、従来法よりも音声認識性能を改善することが可能となる。 FIG. 1 is a diagram illustrating an error correction model learning method according to the present embodiment. As shown in the figure, in this embodiment, in consideration of the order relation of each utterance in the learning data, the relation between the temporally adjacent utterances is extracted as a linguistic feature, and the error correction model is learned. Use. As a result, an error correction model reflecting the relationship between adjacent utterances can be obtained, so that speech recognition performance can be improved as compared with the conventional method.

［２．１従来法の誤り修正モデル］
ベイズの定理によれば、音声入力ｘが与えられたとき、この音声入力ｘに対して尤もらしい単語列ｗ＾（「＾」は、「ハット」を表す。）は、以下の式（１）により求めることができる。 [2.1 Error correction model of conventional method]
According to Bayes' theorem, when speech input x is given, a word string w ^ (“^” represents “hat”) that is likely to be associated with speech input x is expressed by the following equation (1). It can ask for.

音声入力ｘ及び単語列ｗは、例えば、発話の単位に対応し、Ｐ（ｗ｜ｘ）は、音声入力ｘが発生したときに単語列（文仮説）ｗが得られる事後確率である。
また、Ｐ（ｘ｜ｗ）は、単語列ｗに対する音響的な尤もらしさを示す尤度であり、そのスコア（音響スコア）は隠れマルコフモデル（Hidden Markov Model、ＨＭＭ）及びガウス混合分布（Gaussian Mixture Model，ＧＭＭ）に代表される統計的音響モデル（以下、「音響モデル」と記載する。）に基づいて計算される。言い換えれば、音響特徴量が与えられたとき、複数の正解候補の単語それぞれに対する尤もらしさを表すスコアが音響スコアである。 The voice input x and the word string w correspond to, for example, an utterance unit, and P (w | x) is a posterior probability that a word string (sentence hypothesis) w is obtained when the voice input x occurs.
P (x | w) is a likelihood indicating acoustic likelihood for the word string w, and the score (acoustic score) is a hidden Markov model (HMM) and a Gaussian mixture distribution (Gaussian Mixture). It is calculated based on a statistical acoustic model (hereinafter referred to as “acoustic model”) typified by Model, GMM). In other words, when an acoustic feature amount is given, a score representing the likelihood of each of a plurality of correct candidate words is an acoustic score.

一方、Ｐ（ｗ）は、単語列ｗに対する言語的な生成確率であり、そのスコア（言語スコア）は、単語ｎ−ｇｒａｍモデル等の統計的言語モデル（以下、「言語モデル」と記載する。）により計算される。言い換えれば、音声認識対象の単語の前または後の単語列、あるいは前後両方の単語列が与えられたとき、複数の正解候補の単語列それぞれに対する尤もらしさを表すスコアが言語スコアである。なお、単語ｎ−ｇｒａｍモデルは、Ｎ単語連鎖（Ｎは、例えば１、２、または３である。）の統計に基づいて、（Ｎ−１）単語の履歴から次の単語の生起確率を与えるモデルである。 On the other hand, P (w) is a linguistic generation probability for the word string w, and the score (language score) is described as a statistical language model (hereinafter, “language model”) such as a word n-gram model. ). In other words, when a word string before or after a speech recognition target word, or both word strings before and after the given word string, a score representing the likelihood of each of a plurality of correct answer word strings is a language score. The word n-gram model gives the occurrence probability of the next word from the history of the word (N-1) based on the statistics of N word chains (N is 1, 2, or 3, for example). It is a model.

以下の説明では、統計的音響モデルにＨＭＭ−ＧＭＭを用い、統計的言語モデルにｎ−ｇｒａｍを用いる。 In the following description, HMM-GMM is used for the statistical acoustic model, and n-gram is used for the statistical language model.

式（１）のＰ（ｘ｜ｗ）Ｐ（ｗ）が最大の場合は、その対数も最大である。そこで、音声認識では、上記の式（１）のベイズの定理に基づいて、音声入力ｘが発生したときの文仮説（正解候補）である単語列ｗの評価関数ｑ（ｗ｜ｘ）を以下の式（２）のように定める。なお、κは、音響スコアＰ（ｘ｜ｗ）に対する言語スコアＰ（ｗ）の重みである。 When P (x | w) P (w) in Equation (1) is maximum, the logarithm is also maximum. Therefore, in speech recognition, the evaluation function q (w | x) of the word string w, which is a sentence hypothesis (correct answer candidate) when the speech input x is generated, based on the Bayesian theorem of the above equation (1) is as follows. This is determined as shown in equation (2). Note that κ is a weight of the language score P (w) with respect to the acoustic score P (x | w).

そして、以下の式（３）に示すように、音声入力ｘに対する正解候補の単語列ｗの集合の中から、式（２）が示す評価関数ｑ（ｗ｜ｘ）の結果が最大である単語列ｗ＾が、音声入力ｘの音声認識結果として選択される。 Then, as shown in the following equation (3), the word having the maximum result of the evaluation function q (w | x) indicated by the equation (2) is selected from the set of word strings w of correct answer candidates for the voice input x. The column。 is selected as the speech recognition result for speech input x.

従来法における誤り修正モデルでは、式（１）を以下の式（４）のように変更する。 In the error correction model in the conventional method, equation (1) is changed to the following equation (4).

式（４）のｅｘｐΣ_ｉλ_ｉｆ_ｉ（ｗ）は、単語列ｗの誤り傾向を反映したスコアであり、単語列ｗに対するペナルティもしくは報償として働く。また、ｆ_ｉ（ｗ）（ｉ＝１，...，）はｉ番目の素性関数、λ_ｉは素性関数ｆ_ｉ（ｗ）の重み（素性重み）である。素性関数は、与えられた単語列（ここでは、単語列ｗ）で言語的ルールが成立すればその数となり、成立しなければ０となるような関数として定められる。これらルールは、例えば、同一の発話内における連続する単語、連続しない２単語以上の単語の共起関係、単語の構文的な情報または意味的な情報、などの言語的特徴である。従来法における具体的な素性関数ｆ_ｉのルールの例として、以下があげられる。 ExpΣ _i λ _i f _i (w) in equation (4) is a score reflecting the error tendency of the word string w, and acts as a penalty or reward for the word string w. Further, f _i (w) (i = 1,...) Is an i-th feature function, and λ _i is a weight (feature weight) of the feature function f _i (w). The feature function is defined as a function that becomes the number if a linguistic rule is established in a given word string (here, word string w), and is 0 if not established. These rules are, for example, linguistic features such as consecutive words in the same utterance, co-occurrence relationship of two or more words that are not consecutive, syntactic information or semantic information of words. Examples of rules specific feature function f _i in the conventional method, and the like below.

例えば、単語の共起関係に基づく素性関数として、以下の（１）、（２）がある。 For example, there are the following (1) and (2) as feature functions based on the co-occurrence relationship of words.

（１）単語列ｗに連続する単語２項組（ｕ，ｖ）が含まれる場合、その数を返す関数
（２）単語列ｗに連続しない単語２項組（ｕ，ｖ）が含まれる場合、その数を返す関数 (1) When the word string w includes a continuous word binary set (u, v), a function that returns the number (2) When the word string w includes a non-continuous word binary set (u, v) , A function that returns the number

また、単語列ｗを構成する各単語を名詞や動詞といった品詞カテゴリ（構文情報）に置き換えた上で得られる、構文情報に基づく素性関数として、例えば以下の（３）、（４）がある。なお、ｃ（・）は、単語を品詞にマッピングする関数である。 For example, the following (3) and (4) are feature functions based on syntax information obtained by replacing each word constituting the word string w with a part-of-speech category (syntax information) such as a noun or a verb. Note that c (•) is a function that maps words to parts of speech.

（３）単語列ｗに連続する品詞２項組（ｃ（ｕ），ｃ（ｖ））が含まれる場合、その数を返す関数
（４）単語列ｗに連続しない品詞２項組（ｃ（ｕ），ｃ（ｖ））が含まれる場合、その数を返す関数 (3) A function that returns the number of part-of-speech binaries (c (u), c (v)) that are consecutive in the word string w (4) A part-of-speech binary pair that is not consecutive in the word string w u), c (v)), a function that returns the number if it is included

あるいは、単語列ｗを構成する各単語を、意味情報を表すカテゴリ（意味カテゴリ）に置き換えた上で得られる、意味的な情報に基づく素性関数として、例えば以下の（５）、（６）がある。意味カテゴリは、本実施形態の音声認識装置が外部または内部に備えるデータベースに記憶されるシソーラスなどを用いて得ることができる。なお、ｓ（・）は単語を意味カテゴリにマッピングする関数である。 Alternatively, for example, the following (5) and (6) are feature functions based on semantic information obtained by replacing each word constituting the word string w with a category (semantic category) representing semantic information. is there. The semantic category can be obtained by using a thesaurus stored in a database provided outside or inside the speech recognition apparatus of the present embodiment. Note that s (•) is a function that maps words to semantic categories.

（５）単語列ｗに連続する意味カテゴリ２項組（ｓ（ｕ），ｓ（ｖ））が含まれる場合、その数を返す関数
（６）単語列ｗに連続しない意味カテゴリ２項組（ｓ（ｕ），ｓ（ｖ））が含まれる場合、その数を返す関数 (5) A function that returns the number of semantic category binary groups (s (u), s (v)) that are consecutive in the word string w (6) A semantic category binary group that is not consecutive in the word string w ( a function that returns the number of s (u), s (v))

上記のように、音声認識の誤り傾向は、素性関数とその重みにより言語的な特徴に対するペナルティとして表現され、学習データの単語誤りを最小化する評価関数に基づいて推定される。つまり、従来の誤り傾向の学習とは、音声データの音声認識結果とその正解単語列を学習データとして用いて式（４）の重みλ_ｉを求めることである。 As described above, an error tendency of speech recognition is expressed as a penalty for a linguistic feature by a feature function and its weight, and is estimated based on an evaluation function that minimizes a word error in learning data. In other words, the conventional learning of error tendency is to obtain the weight λ _i of Equation (4) using the speech recognition result of speech data and the correct word string as learning data.

［２．２本実施形態に適用される誤り修正モデルの学習アルゴリズム］
いま、単語列ｗに対して、直近の入力音声から得られた単語列をｕとすると、音声入力ｘ、単語列ｕが与えられたときの単語列ｗの条件付き確率Ｐ（ｗ｜ｘ，ｕ）は、以下の式（５）のようになる。 [2.2 Learning algorithm of error correction model applied to this embodiment]
Now, with respect to the word string w, if the word string obtained from the latest input speech is u, the conditional probability P (w | x,) of the word string w when the speech input x and the word string u are given. u) is expressed by the following equation (5).

ただし、式（５）の導出では、ベイズの定理と、単語列ｕと音声入力ｘが独立であることを利用している。また、単語列ｕは、任意の長さの単語列であり、複数の発話内容を連結した単語列であってもよい。 However, the derivation of Equation (5) uses the Bayes' theorem and the fact that the word string u and the speech input x are independent. The word string u is a word string having an arbitrary length, and may be a word string obtained by connecting a plurality of utterance contents.

ただし、音声入力ｘと隣接する発話の単語列ｕが与えられたとき、入力に対して最も尤もらしい単語列ｗ＾は以下の式（６）となり、式（１）が変更されることに注意する。 However, when an utterance word string u adjacent to the voice input x is given, the most likely word string w ^ for the input is expressed by the following expression (6), and the expression (1) is changed. To do.

ここで、直近の入力音声により単語列ｕが与えられたときの単語列ｗの条件付き確率Ｐ（ｗ｜ｕ）を、式（７）のように仮定する。 Here, the conditional probability P (w | u) of the word string w when the word string u is given by the latest input speech is assumed as shown in Expression (7).

なお、ｇ_ｊ（ｗ，ｕ）（ｊ＝１，...，）は、単語列ｗと単語列ｕに対する言語的な特徴を表す素性関数であり、φ_ｊは、ｇ_ｊに対応した重み（素性重み）である。このような発話の順序に応じた言語的な特徴の素性関数ｇ_ｊとして、以下の例がある。ここでは、ｖ，ｚはそれぞれ単語とする。 Note that g _j (w, u) (j = 1,...) Is a feature function representing a linguistic feature with respect to the word string w and the word string u, and φ _j is a weight corresponding to g _j (Feature weight). Examples of the feature function g _j of linguistic features corresponding to the order of such utterances are as follows. Here, v and z are words.

（例）先行する発話の単語列ｕに単語ｚが含まれている場合に、着目している発話の単語列ｗに含まれる単語ｖの数を返す関数 (Example) Function that returns the number of words v included in the word sequence w of the utterance of interest when the word z is included in the word sequence u of the preceding utterance

図２は、素性関数ｇ_ｊの例を示す図である。同図においては、先行する発話の単語列ｕを正解単語列（もしくは尤もらしい認識結果）とし、着目している現在の発話の単語列ｗを正解候補の単語列ｗ_１、ｗ_２、ｗ_３の集合としている。そして、同図においては、先行する発話の単語列ｕに単語ｚが含まれており、正解候補の単語列ｗ_１には単語ｖが１つ含まれている。この場合、ｇ_ｊ（ｗ_１，ｕ）＝１となる。一方、正解候補の単語列ｗ３には単語ｖが２つ含まれているため、ｇ_ｊ（ｗ_３，ｕ）＝２となる。 FIG. 2 is a diagram illustrating an example of the feature function g _j . In the figure, the preceding utterance word string u is a correct word string (or likely recognition result), and the current utterance word string w of interest is the correct candidate word string w ₁ , w ₂ , w _3. It is a set of. In the drawing, the word sequence u of the preceding utterance includes the word z, and the correct candidate word sequence w ₁ includes one word v. In this case, g _j (w ₁ , u) = 1. On the other hand, since the correct candidate word string w3 includes two words v, g _j (w ₃ , u) = 2.

式（５）及び式（７）から、以下の式（８）となる。 From Expression (5) and Expression (7), the following Expression (8) is obtained.

従来の識別的言語モデルの素性関数を考慮すれば、式（８）は、以下の式（９）となる。 Considering the feature function of the conventional discriminative language model, Expression (8) becomes the following Expression (9).

音響モデルの尤度をＨＭＭによる対数音響スコアをｈ_０（ｘ，ｗ）、ｎ−ｇａｒａｍ言語モデルによる対数言語スコアをｈ_１（ｗ）とすると、式（９）は、以下の式（１０）のように書き直せる。 Assuming that the likelihood of the acoustic model is a logarithmic acoustic score by HMM is h ₀ (x, w), and a logarithmic language score by the n-garam language model is h ₁ (w), the equation (9) is expressed by the following equation (10): Can be rewritten as

ただし、κは、言語スコアに対する重み係数とする。また、Ｚ（Λ，Φ）は、確率の条件を満たすための正規化定数であり、以下の式（１１）とする。式（１１）における単語列ｗ’は、音声入力ｘから音声認識により得られた複数の音声認識結果である。また、モデルパラメータΛは、（λ_１，λ_２，…）であり、モデルパラメータΦは、（φ_１，φ_２，…）である。 Here, κ is a weighting factor for the language score. Z (Λ, Φ) is a normalization constant for satisfying the probability condition, and is represented by the following expression (11). The word string w ′ in Equation (11) is a plurality of speech recognition results obtained by speech recognition from the speech input x. The model parameter Λ is (λ ₁ , λ ₂ ,...), And the model parameter Φ is (φ ₁ , φ ₂ ,...).

本実施形態の音声認識装置による誤り修正モデルの学習とは、式（１０）に示す誤り修正モデルに用いるモデルパラメータΛ及びΦを学習データから推定することである。 The learning of the error correction model by the speech recognition apparatus of the present embodiment is to estimate the model parameters Λ and Φ used for the error correction model shown in Expression (10) from the learning data.

ここで、Ｍ個の発話からなる学習データが与えられたとき、モデルパラメータ推定のための目的関数Ｌ（Λ，Φ）を以下の式（１２）とする。 Here, when learning data composed of M utterances is given, an objective function L (Λ, Φ) for model parameter estimation is expressed by the following equation (12).

式（１２）におけるＰ（ｗ_ｍ,ｎ｜ｘ_ｍ,ｗ_ｍ−１ ^ｒｅｆ）は、以下の式（１３）のように算出される。 P (w _{m, n} | x _m , w _m−1 ^ref ) in the equation (12) is calculated as in the following equation (13).

ｍは発話の順序を示し、Ｎ_ｍはｍ番目の発話の学習データに対して音声認識により生成された文仮説ｗ_ｍ，１、ｗ_ｍ，２、…の総数、文仮説ｗ_ｍ，ｎ（ｎは１以上の整数）はｍ番目の発話の学習データの第ｎ番目の正解候補の単語列である。ｗ_ｍ ^ｒｅｆはｍ番目の発話の学習データの正解単語列、Ｒ（・，・）は２つの単語列の編集距離を返す関数である。２つの単語列の編集距離は、動的計画法により効率的に求めることができる。編集距離は、正解単語列に対する音声認識結果の誤り単語数と等価（置換、挿入、脱落誤りの操作）であるため、式（１２）の目的関数Ｌ（Λ，Φ）は、期待される単語誤りの数を表している。この目的関数Ｌ（Λ，Φ）を最小化するようにモデルパラメータΛとモデルパラメータΦを推定すれば、期待される単語誤りの数が最小となる誤り修正モデルが得られるため、音声認識の性能の向上が期待できる。これは、目的関数Ｌ（Λ，Φ）を最小化するようにモデルパラメータΛ及びΦを推定すれば、正解候補の単語列に期待される認識誤りが最小となり、学習データとは異なる未知の入力音声に対する音声認識においても、モデルパラメータΛ及びΦによって認識誤りの最小化が同様に行われるからである。つまり、式（１２）の目的関数は、正解候補の単語列に期待される認識誤りが最小となり、モデルパラメータΛ及びΦが適切であるかの評価値を算出する評価関数として用いられる。 m indicates the order of utterances, N _m is the total number of sentence hypotheses w _{m, 1} , w _{m, 2} ,... generated by speech recognition on the learning data of the m-th utterance, sentence hypothesis w _{m, n} ( n is an integer of 1 or more) is a word string of the nth correct answer candidate of learning data of the mth utterance. w _m ^ref is a correct word string of learning data of the m-th utterance, and R (·, ·) is a function that returns an edit distance between two word strings. The edit distance between two word strings can be efficiently obtained by dynamic programming. Since the edit distance is equivalent to the number of error words in the speech recognition result for the correct word string (operation of substitution, insertion, omission error), the objective function L (Λ, Φ) in Expression (12) is the expected word. It represents the number of errors. If the model parameter Λ and the model parameter Φ are estimated so as to minimize the objective function L (Λ, Φ), an error correction model that minimizes the number of expected word errors can be obtained. Improvement can be expected. This is because if the model parameters Λ and Φ are estimated so as to minimize the objective function L (Λ, Φ), the recognition error expected for the correct candidate word string is minimized, and the unknown input is different from the learning data. This is because also in speech recognition for speech, recognition errors are similarly minimized by the model parameters Λ and Φ. That is, the objective function of Expression (12) is used as an evaluation function for calculating an evaluation value as to whether or not the model parameters Λ and Φ are appropriate because the recognition error expected for the correct candidate word string is minimized.

パラメータを推定するため、目的関数のモデルパラメータΛ、Φに関する勾配ΔΛ、ΔΦは、以下の式（１４）、式（１５）から求められる。 In order to estimate the parameters, the gradients ΔΛ and ΔΦ with respect to the model parameters Λ and Φ of the objective function are obtained from the following expressions (14) and (15).

勾配ΔΛは、（∂Ｌ（Λ，Φ）／∂λ_１，∂Ｌ（Λ，Φ）／∂λ_２，∂Ｌ（Λ，Φ）／∂λ_３，…）であり、勾配ΔΦは、（∂Ｌ（Λ，Φ）／∂φ_１，∂Ｌ（Λ，Φ）／∂φ_２，∂Ｌ（Λ，Φ）／∂φ_３，…）である。 The gradient ΔΛ is (∂L (Λ, Φ) / ∂λ ₁ , ∂L (Λ, Φ) / ∂λ ₂ , ∂L (Λ, Φ) / ∂λ ₃ ,...), And the gradient ΔΦ is (∂L (Λ, Φ) / ∂φ ₁ , ∂L (Λ, Φ) / ∂φ ₂ , ∂L (Λ, Φ) / ∂φ ₃ ,...).

繰り返し更新によりモデルパラメータΛ^ｔ、Φ^ｔの学習を行うとすれば、ｔ−１回目の繰り返しの後にモデルパラメータΛ^ｔ−１、Φ^ｔ−１が得られたとすると、以下の式（１６）、式（１７）がパラメータ更新式となる。 If the model parameters Λ ^t and Φ ^t are learned by iterative updating, and the model parameters Λ ^t−1 and Φ ^t−1 are obtained after the ^t− 1th ^iteration , the following equation (16), Expression (17) is a parameter update expression.

ここで、η_Λ、η_Φはそれぞれ、式（１４）、式（１５）で得られた勾配ΔΛ、勾配ΔΦの係数である。 Here, η _Λ and η _Φ are the coefficients of the gradient ΔΛ and the gradient ΔΦ obtained by the equations (14) and (15), respectively.

［３．音声認識装置の構成］
図３は、本発明の一実施形態による音声認識装置１の構成を示す機能ブロック図であり、発明と関係する機能ブロックのみ抽出して示してある。
音声認識装置１は、コンピュータ装置により実現され、同図に示すように、音声認識部１１、特徴量抽出部１２、モデル学習部１３、音声認識部１４、音声言語資源格納部２１、音響モデル格納部２２、言語モデル格納部２３、及び誤り修正モデル格納部２４を備えて構成される。 [3. Configuration of voice recognition device]
FIG. 3 is a functional block diagram showing the configuration of the speech recognition apparatus 1 according to an embodiment of the present invention, and only functional blocks related to the invention are extracted and shown.
The speech recognition apparatus 1 is realized by a computer device, and as shown in the figure, a speech recognition unit 11, a feature amount extraction unit 12, a model learning unit 13, a speech recognition unit 14, a speech language resource storage unit 21, and an acoustic model storage A unit 22, a language model storage unit 23, and an error correction model storage unit 24.

音声言語資源格納部２１は、学習データを格納する。音響モデル格納部２２は、音響モデルを格納する。言語モデル格納部２３は、言語モデルを格納する。誤り修正モデル格納部２４は、誤り修正モデルを格納する。 The spoken language resource storage unit 21 stores learning data. The acoustic model storage unit 22 stores an acoustic model. The language model storage unit 23 stores a language model. The error correction model storage unit 24 stores an error correction model.

音声認識部１１は、学習データを生成するために音声データを音声認識する。音声データは、発話の音声波形を短時間スペクトル分析して得られた特徴量を示す。本実施形態では、音声データとして、放送音声・字幕データＤ１を用いる。音声認識部１１は、発話の音声データと、音声データを音声認識して得られた音声認識結果データＤ２と、発話の内容の正解単語列を示す正解単語列データＤ３とを対応付けて学習データとして音声言語資源格納部２１に書き込む。このとき、音声認識部１１は、音声認識を行った際の発話の順番も合わせて音声言語資源格納部２１に保持しておく。 The voice recognition unit 11 recognizes voice data to generate learning data. The voice data indicates a feature amount obtained by performing short-time spectrum analysis on the voice waveform of the utterance. In the present embodiment, broadcast audio / caption data D1 is used as audio data. The speech recognition unit 11 associates speech data of speech, speech recognition result data D2 obtained by speech recognition of speech data, and correct word sequence data D3 indicating a correct word sequence of the content of speech to learn data. Is written in the spoken language resource storage unit 21. At this time, the speech recognizing unit 11 also holds the order of utterances when performing speech recognition in the spoken language resource storage unit 21.

特徴量抽出部１２は、発話の順番により整列された学習データの音声認識結果データＤ２及び正解単語列データＤ３から、同一発話内における言語的な特徴と発話の順序に応じた言語的な特徴を抽出する。特徴量抽出部１２は、得られた言語的な特徴をルールとする素性関数ｆ_ｉ，ｇ_ｊを示す素性関数データＤ４を出力する。 The feature quantity extraction unit 12 obtains linguistic features in the same utterance and linguistic features according to the utterance order from the speech recognition result data D2 and correct word string data D3 of the learning data arranged in the utterance order. Extract. The feature quantity extraction unit 12 outputs feature function data D4 indicating feature functions f _i and g _j using the obtained linguistic features as rules.

モデル学習部１３は、特徴量抽出部１２が出力した素性関数データＤ４と、音声言語資源格納部２１が記憶する学習データとを入力として、誤り修正モデルのモデルパラメータΛ、Φを統計的手段により学習する。モデル学習部１３は、学習したこれらのモデルパラメータΛ、Φを用いた誤り修正モデルを誤り修正モデル格納部２４に書き込む。 The model learning unit 13 receives the feature function data D4 output from the feature quantity extraction unit 12 and the learning data stored in the spoken language resource storage unit 21, and inputs model parameters Λ and Φ of the error correction model by statistical means. learn. The model learning unit 13 writes the error correction model using the learned model parameters Λ and Φ into the error correction model storage unit 24.

音声認識部１４は、音響モデル格納部２２に記憶されている音響モデル、及び言語モデル格納部２３に記憶されている言語モデルを参照し、誤り修正モデル格納部２４に記憶されている誤り修正モデルを用いて入力音声データＤ５の音声認識を行い、音声認識結果データＤ６を出力する。 The speech recognition unit 14 refers to the acoustic model stored in the acoustic model storage unit 22 and the language model stored in the language model storage unit 23, and stores the error correction model stored in the error correction model storage unit 24. Is used to perform speech recognition of the input speech data D5, and speech recognition result data D6 is output.

［４．音声認識装置の処理手順］
図４は、本実施形態による音声認識装置１の全体処理フローを示す図である。以下、同図に示す各ステップの処理について説明する。 [4. Processing procedure of voice recognition device]
FIG. 4 is a diagram showing an overall processing flow of the speech recognition apparatus 1 according to the present embodiment. Hereinafter, processing of each step shown in FIG.

［４．１ステップＳ１］
本実施形態では、誤り修正モデルの生成のために、学習データとして、発話の音声認識結果と、その発話内容の書き起こしである正解単語列が必要となる。そこで、音声認識部１１は、音声データ及び正解単語列データの組として放送音声・字幕データＤ１を収集し、放送音声・字幕データＤ１に含まれる音声データを音声認識する。音声認識部１１は、放送音声・字幕データＤ１から取得した字幕データ、あるいは、音声認識の結果を人手により修正したテキストデータを正解単語列データＤ３とする。音声認識部１１は、各発話の音声データと、音声認識結果を示す音声認識結果データＤ２と、正解単語列データＤ３とを対応付けた学習データを音声言語資源格納部２１に格納する。このとき、音声認識部１１は、音声認識を行った際の発話の順序を保持して格納する。ｍ番目（ｍ＝１，２，…）の学習データの音声認識結果データＤ２には、ｍ番目の音声データを音声認識して得られた正解候補である文仮説ｗ_ｍ，ｎ（ｎ＝１，２，…）が含まれ、ｍ番目の学習データの正解単語列データＤ３には、ｍ番目の音声データの正解単語列ｗ_ｍ ^ｒｅｆが含まれる。 [4.1 Step S1]
In this embodiment, in order to generate an error correction model, a speech recognition result of an utterance and a correct word string that is a transcription of the utterance content are required as learning data. Therefore, the voice recognition unit 11 collects the broadcast voice / caption data D1 as a set of voice data and correct word string data, and recognizes the voice data included in the broadcast voice / subtitle data D1. The voice recognition unit 11 sets the caption data acquired from the broadcast voice / subtitle data D1 or text data obtained by manually correcting the result of voice recognition as correct word string data D3. The speech recognition unit 11 stores learning data in which speech data of each utterance, speech recognition result data D2 indicating a speech recognition result, and correct word string data D3 are associated with each other in the speech language resource storage unit 21. At this time, the voice recognition unit 11 stores and stores the order of utterances when voice recognition is performed. The speech recognition result data D2 of the m-th (m = 1, 2,...) learning data includes sentence hypotheses w _{m, n} (n = 1) that are correct candidates obtained by speech recognition of the m-th speech data. , 2,..., And the correct word string data D3 of the m-th learning data includes the correct word string w _m ^ref of the m-th speech data.

［４．２ステップＳ２］
モデル学習部１３は、音声言語資源格納部２１に記憶されている学習データから、誤り傾向学習のために用いる言語的特徴に基づく素性関数を抽出する。 [4.2 Step S2]
The model learning unit 13 extracts a feature function based on linguistic features used for error tendency learning from the learning data stored in the spoken language resource storage unit 21.

まず、モデル学習部１３は、学習データに含まれる音声認識結果データＤ２及び正解単語列データＤ３のそれぞれから、連続する単語、連続しない２単語以上の単語、単語の構文的な情報または意味的な情報、などの同一発話内の言語的特徴に基づく素性関数を全て抽出する。さらに、モデル学習部１３は、音声認識結果データＤ２が示す正解候補の文仮説ｗ_ｍ，ｎ（図２に示す単語列ｗ_１、ｗ_２，…に相当）と、正解単語列データＤ３が示す当該発話に先行する正解単語列ｗ_ｍ−１ ^ｒｅｆ（図２に示す先行する単語列ｕに相当）とを参照し、発話の順序に応じた言語的特徴に基づく素性関数を全て抽出する。正解単語列ｗ_ｍ−１ ^ｒｅｆは、文仮説ｗ_ｍ，ｎの発話と時間的に隣接した過去の発話の正解単語列である。モデル学習部１３は、抽出したこれらの素性関数が出現する頻度をカウントする。モデル学習部１３は、カウントした出現頻度が予め定めた閾値以上である同一発話内の言語的特徴に基づく素性関数、発話の順序に応じた言語的特徴に基づく素性関数をそれぞれ、誤り傾向学習で用いる素性関数ｆ_ｉ、素性関数ｇ_ｊとして決定する。モデル学習部１３は、決定した素性関数ｆ_ｉ、ｇ_ｊを設定した素性関数データＤ４をモデル学習部１３に出力する。 First, the model learning unit 13 obtains continuous words, non-consecutive two or more words, syntactic information of words or semantics from each of the speech recognition result data D2 and the correct word string data D3 included in the learning data. All feature functions based on linguistic features in the same utterance such as information are extracted. Further, the model learning unit 13 indicates the correct candidate sentence hypothesis w _{m, n} (corresponding to the word strings w ₁ , w ₂ ,... Shown in FIG. 2) indicated by the speech recognition result data D2 and the correct word string data D3. With reference to the correct word string w _m−1 ^ref preceding the utterance (corresponding to the preceding word string u shown in FIG. 2), all feature functions based on the linguistic features corresponding to the utterance order are extracted. The correct word string w _m−1 ^ref is a correct word string of a past utterance that is temporally adjacent to the utterance of the sentence hypothesis w _{m, n} . The model learning unit 13 counts the frequency at which these extracted feature functions appear. The model learning unit 13 performs error tendency learning on a feature function based on a linguistic feature in the same utterance whose counted appearance frequency is equal to or higher than a predetermined threshold, and a feature function based on a linguistic feature corresponding to the order of utterances. The feature function f _{i to be} used and the feature function g _j are determined. The model learning unit 13 outputs the feature function data D4 in which the determined feature functions f _i and g _j are set to the model learning unit 13.

［４．３ステップＳ３］
続いてモデル学習部１３は、誤り修正モデルのモデルパラメータを学習する。
図５は、ステップＳ３においてモデル学習部１３が実行する誤り修正モデル更新処理の処理フローを示す図である。 [4.3 Step S3]
Subsequently, the model learning unit 13 learns model parameters of the error correction model.
FIG. 5 is a diagram illustrating a processing flow of the error correction model update process executed by the model learning unit 13 in step S3.

（ステップＳ３１：モデルパラメータ初期化処理）
モデル学習部１３は、モデルパラメータΛ、Φに対して適当な初期値を設定する。本実施形態では、初期値をΛ＝Φ＝０とする。 (Step S31: Model parameter initialization process)
The model learning unit 13 sets appropriate initial values for the model parameters Λ and Φ. In this embodiment, the initial value is Λ = Φ = 0.

（ステップＳ３２：編集距離計算処理）
式（１２）の目的関数を計算するためには、まず音声認識結果と対応する正解単語列との編集距離を計算する必要がある。そこで、モデル学習部１３は、学習データとして音声言語資源格納部２１に記憶されている学習データを読み出し、音声認識結果データＤ２が示す文仮説ｗ_ｍ，ｎと正解単語列データＤ３が示す正解単語列ｗ_ｍ ^ｒｅｆとから編集距離Ｒ（ｗ_ｍ ^ｒｅｆ，ｗ_ｍ，ｎ）を計算する。これらの編集距離は、誤り修正モデルの学習では定数扱いとなることに注意する。 (Step S32: Edit distance calculation process)
In order to calculate the objective function of equation (12), it is first necessary to calculate the edit distance between the speech recognition result and the corresponding correct word string. Therefore, the model learning unit 13 reads the learning data stored in the spoken language resource storage unit 21 as learning data _, and the correct word shown by the sentence hypothesis w _{m, n} indicated by the speech recognition result data D2 and the correct word string data D3. An edit distance R (w _m ^ref , w _{m, n} ) is calculated from the column w _m ^ref . Note that these edit distances are treated as constants when learning the error correction model.

（ステップＳ３３：目的関数計算処理）
モデル学習部１３は、ステップＳ３２において求めた編集距離Ｒ（ｗ_ｍ ^ｒｅｆ，ｗ_ｍ，ｎ）を用い、式（１２）に従って目的関数Ｌ（Λ，Φ）の値を計算する。そこで、モデル学習部１３は、式（１２）における条件付き確率Ｐ（ｗ_ｍ，ｎ｜ｘ_ｍ，ｗ_ｍ―１ ^ｒｅｆ）を、音響モデルのスコアｈ_０（ｘ_ｍ｜ｗ_ｍ，ｎ）、言語モデルのスコアｈ_１（ｗ_ｍ，ｎ）、及び現在のモデルパラメータΛ、Φとから式（１３）を用いて計算する。モデル学習部１３は、各文仮説ｗ_ｍ，ｎの音響モデルのスコアｈ_０（ｘ_ｍ｜ｗ_ｍ，ｎ）を、音響モデル格納部２２に記憶されている音響モデルと、ｍ番目の学習データの音声データとを用いて取得する。また、モデル学習部１３は、文仮説ｗ_ｍ，ｎの言語モデルのスコアｈ_１（ｗ_ｍ，ｎ）を、言語モデル格納部２３に記憶されている言語モデルを用いて取得する。 (Step S33: Objective function calculation process)
The model learning unit 13 calculates the value of the objective function L (Λ, Φ) according to the equation (12) using the edit distance R (w _m ^ref , w _{m, n} ) obtained in step S32. Therefore, the model learning unit 13 uses the conditional probability P (w _{m, n} | x _m , w _m−1 ^ref ) in Equation (12) as the acoustic model score h ₀ (x _m | w _{m, n} ), The language model score h ₁ (w _{m, n} ) and the current model parameters Λ and Φ are calculated using the equation (13). The model learning unit 13 stores the acoustic model score h ₀ (x _m | w _{m, n} ) of each sentence hypothesis w _{m, n in} the acoustic model storage unit 22 and the m-th learning data. Using the voice data. The model learning unit 13 acquires the language model score h ₁ (w _{m, n} ) of the sentence hypothesis w _{m, n} using the language model stored in the language model storage unit 23.

（ステップＳ３４：勾配計算処理）
モデル学習部１３は、現在のモデルパラメータΛ、Φの値を用いて、式（１４）及び式（１５）により、式（１２）のモデルパラメータΛ、Φに関する勾配ΔΛ、ΔΦを求める。モデル学習部１３は、式（１４）及び式（１５）における編集距離Ｒ（ｗ_ｍ ^ｒｅｆ，ｗ_ｍ，ｎ）及び条件付き確率Ｐ（ｗ_ｍ，ｎ｜ｘ_ｍ，ｗ_ｍ−１ ^ｒｅｆ）に、ステップＳ３３において目的関数Ｌ（Λ，Φ）を算出したときの値を用いる。また、モデル学習部１３は、式（１４）における素性関数ｇ_ｊ（ｗ_ｍ，ｎ，ｗ_ｍ−１ ^ｒｅｆ）の値を、音声認識結果データＤ２が示す文仮説ｗ_ｍ，ｎ及び正解単語列データＤ３が示す正解単語列ｗ_ｍ−１ ^ｒｅｆとから取得する。モデル学習部１３は、式（１５）における素性関数ｆ_ｉ（ｗ_ｍ，ｎ）の値を、音声認識結果データＤ２が示す文仮説ｗ_ｍ，ｎから取得する。なお、モデル学習部１３は、素性関数ｆ_ｉ及び素性関数ｇ_ｊを素性関数データＤ４から得る。 (Step S34: gradient calculation process)
The model learning unit 13 obtains the gradients ΔΛ and ΔΦ related to the model parameters Λ and Φ in Expression (12) using Expression (14) and Expression (15) using the current model parameters Λ and Φ. The model learning unit 13 sets the edit distance R (w _m ^ref , w _{m, n} ) and the conditional probability P (w _{m, n} | x _m , w _m−1 ^ref ) in the equations (14) and (15). The value when the objective function L (Λ, Φ) is calculated in step S33 is used. Further, the model learning unit 13 uses the sentence hypothesis w _{m, n} and the correct word string indicated by the speech recognition result data D2 as the value of the feature function g _j (w _{m, n} , w _m−1 ^ref ) in the equation (14). Obtained from the correct word string w _m-1 ^ref indicated by the data D3. The model learning unit 13 acquires the value of the feature function f _i (w _{m, n} ) in Expression (15) from the sentence hypothesis w _{m, n} indicated by the speech recognition result data D2. The model learning unit 13 obtains a feature function f _i and a feature function g _j from the feature function data D4.

モデル学習部１３は、求めた勾配ΔΛ、ΔΦを用いて、式（１６）及び式（１７）により、モデルパラメータΛ、Φを更新する。なお、式（１６）及び式（１７）における係数η_Λ、η_Φは、予め定めた値を用いる。 The model learning unit 13 updates the model parameters Λ and Φ by the equations (16) and (17) using the obtained gradients ΔΛ and ΔΦ. Note that predetermined values are used as the coefficients η _Λ and η _Φ in the equations (16) and (17).

（ステップＳ３５：終了判定処理）
モデル学習部１３は、ステップＳ３４の勾配計算処理により得られた目的関数の値と、更新前の目的関数の値とを比較し、値の変化が所定以上であれば、ステップＳ３３からの処理を繰り返し、所定よりも小さければ更新が収束したとみなしてモデルパラメータΛ、Φの更新を打ち切り、ステップＳ３６の処理を実行する。 (Step S35: End determination process)
The model learning unit 13 compares the value of the objective function obtained by the gradient calculation process in step S34 with the value of the objective function before update, and if the change in value is equal to or greater than a predetermined value, the model learning unit 13 performs the process from step S33. Repeatedly, if it is smaller than the predetermined value, it is regarded that the update has converged, the update of the model parameters Λ and Φ is aborted, and the process of step S36 is executed.

（ステップＳ３６：誤り修正モデル出力処理）
モデル学習部１３は、更新が収束したときのモデルパラメータΛ＝（λ_１，λ_２，…）、及びΦ＝（φ_１，φ_２，…）を用いた誤り修正モデルを誤り修正モデル格納部２４に書き込む。 (Step S36: Error correction model output process)
The model learning unit 13 converts the error correction model using the model parameters Λ = (λ ₁ , λ ₂ ,...) And Φ = (φ ₁ , φ ₂ ,. 24.

［４．４ステップＳ４］
音声認識部１４は、音声認識対象の音声データとして入力音声データＤ５が入力されると、誤り修正モデル格納部２４に記憶されている誤り修正モデルと、音響モデル格納部２２に記憶されている音響モデル、及び、言語モデル格納部２３に記憶されている言語モデルとを用いて、入力音声データＤ５の正解候補の単語列を得、それらのスコアを算出する。学習時には、現在処理している発話に先行する発話列は正解単語列となるが、音声認識時には正解単語列が得られないため、音声認識部１４は、現在処理している発話より前の発話を音声認識したときに得られた最尤単語列を正解単語列として用いる。音声認識部１４は、スコアが最もよい正解候補の単語列を正解単語列として設定した音声認識結果データＤ６をリアルタイムで出力する。この誤り修正モデルを用いることにより、音声認識部１４は、入力音声データＤ５から得られた音声認識結果の選択における誤りを修正する。 [4.4 Step S4]
When the input speech data D5 is input as speech recognition target speech data, the speech recognition unit 14 stores the error correction model stored in the error correction model storage unit 24 and the acoustic stored in the acoustic model storage unit 22. Using the model and the language model stored in the language model storage unit 23, a word string of correct answer candidates of the input speech data D5 is obtained, and their scores are calculated. At the time of learning, the utterance sequence preceding the utterance currently being processed is a correct word sequence, but since the correct word sequence is not obtained at the time of speech recognition, the speech recognition unit 14 utters before the utterance currently being processed. Is used as a correct word string. The voice recognition unit 14 outputs the voice recognition result data D6 in which the correct candidate word string having the best score is set as the correct word string in real time. By using this error correction model, the speech recognition unit 14 corrects an error in the selection of the speech recognition result obtained from the input speech data D5.

［５．効果］
本実施形態によれば、音声認識装置１は、直前の発話内容を反映した誤り修正モデルが構成可能となり、従来の音声認識よりも認識誤りが削減される。 [5. effect]
According to the present embodiment, the speech recognition apparatus 1 can configure an error correction model that reflects the content of the immediately preceding utterance, and recognition errors are reduced compared to conventional speech recognition.

［６．その他］
なお、上述の音声認識装置１は、内部にコンピュータシステムを有している。そして、音声認識装置１の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 [6. Others]
The voice recognition device 1 described above has a computer system inside. The operation process of the speech recognition apparatus 1 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１音声認識装置
１１音声認識部
１２特徴量抽出部
１３モデル学習部
１４音声認識部
２１音声言語資源格納部
２２音響モデル格納部
２３言語モデル格納部
２４誤り修正モデル格納部 DESCRIPTION OF SYMBOLS 1 Speech recognition apparatus 11 Speech recognition part 12 Feature-value extraction part 13 Model learning part 14 Speech recognition part 21 Spoken language resource storage part 22 Acoustic model storage part 23 Language model storage part 24 Error correction model storage part

Claims

A spoken language resource storage unit that stores a speech recognition result obtained by speech recognition of speech data of speech and a correct word string of the speech while maintaining the order of speech;
A linguistic feature according to the order of utterances obtained from words included in the speech recognition result obtained from the speech data and words included in the correct word string of utterances earlier than the speech of the speech data. A model learning unit that statistically learns the tendency of recognition errors of words based on the error and generates an error correction model for correcting the learned tendency of recognition errors;
A speech recognition apparatus comprising:

The model learning unit statistically learns the tendency of recognition errors of words based on linguistic features in the same utterance obtained from the speech recognition result and linguistic features according to the order of the utterances, Generating an error correction model for correcting the learned tendency of recognition errors;
The speech recognition apparatus according to claim 1.

The model learning unit includes a co-occurrence relationship between a plurality of consecutive words in the same utterance obtained from the speech recognition result, a co-occurrence relationship between a plurality of non-consecutive words, syntactic information of words, or a word semantic Statistically learning the error tendency of the word based on one or more of the information and the co-occurrence relationship of the word included in the speech recognition result and the word included in the correct word sequence of the past utterance,
The speech recognition apparatus according to claim 2.

The error correction model is a calculation formula that corrects a speech recognition score using a feature function based on the linguistic feature and its weight,
The model learning unit calculates an evaluation value calculated by an evaluation function determined using the feature function value obtained from the speech recognition result and the correct word string and a word recognition error included in the speech recognition result. Statistically calculating the weight based on, and generating the error correction model using the calculated weight,
The speech recognition apparatus according to any one of claims 1 to 3, wherein

A speech recognition unit that recognizes input speech data and corrects an error in selection of a speech recognition result obtained from the input speech data using the error correction model generated by the model learning unit; In addition,
The voice recognition device according to claim 1, wherein the voice recognition device is a voice recognition device.

A spoken language resource storage step of storing a speech recognition result obtained by speech recognition of speech data of speech and a correct word string of the speech in a speech language resource storage unit while maintaining the order of speech;
A linguistic feature according to the order of utterances obtained from words included in the speech recognition result obtained from the speech data and words included in the correct word string of utterances earlier than the speech of the speech data. A model learning process for statistically learning the tendency of recognition errors of words based on the generated error correction model for correcting the learned tendency of recognition errors;
An error correction model learning method characterized by comprising:

Computer
A spoken language resource storage means for storing a speech recognition result obtained by speech recognition of speech data of speech and a correct word sequence of the speech while maintaining the order of speech;
A linguistic feature according to the order of utterances obtained from words included in the speech recognition result obtained from the speech data and words included in the correct word string of utterances earlier than the speech of the speech data. Model learning means for statistically learning a tendency of recognition error of a word based on the error and generating an error correction model for correcting the learned tendency of the recognition error;
A program for causing a voice recognition apparatus to function.