JP2014077882A

JP2014077882A - Speech recognition device, error correction model learning method and program

Info

Publication number: JP2014077882A
Application number: JP2012225330A
Authority: JP
Inventors: Akio Kobayashi; 彰夫小林
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2012-10-10
Filing date: 2012-10-10
Publication date: 2014-05-01
Anticipated expiration: 2032-10-10
Also published as: JP6051004B2

Abstract

PROBLEM TO BE SOLVED: To improve speech recognition performance by adapting error correction model to an utterance content.SOLUTION: A speech recognition section 11 recognizes a speech included in a voice data using an error correction model currently stored in an error correction model storage 23. An error correction section 12 corrects a correct text candidate obtained as a result of the speech recognition according to input made by a user to generate a correct word string. An alignment section 13 aligns the words included in the correct word string in time sequence based on the voice data. A feature amount extraction section 14 extracts linguistic characteristics from the correct text candidate and the aligned correct word string. A model parameter learning section 15 calculates a feature weight for a feature function for processing linguistic characteristics as a rule and a mixture weight of plural language models constituting a mixture model in a statistical manner based on the extracted linguistic characteristics, the correct text candidate, and an acoustic score of the aligned correct word string and a language score to update the error correction model.

Description

本発明は、音声認識装置、誤り修正モデル学習方法、及びプログラムに関する。 The present invention relates to a speech recognition device, an error correction model learning method, and a program.

音声認識の誤り修正については、音声とその書き起こし（正解文）から、言語的な特徴を用いて音声認識の誤り傾向を統計的に学習し、学習の結果得られた統計的な誤り修正モデルを用いて音声認識の性能改善を図る技術がある（例えば、非特許文献１参照）。 For error correction in speech recognition, statistical error correction models obtained as a result of learning by statistically learning the tendency of speech recognition errors using linguistic features from speech and transcriptions (correct sentences) There is a technology for improving the performance of speech recognition by using (see, for example, Non-Patent Document 1).

小林ほか，「単語誤り最小化に基づく識別的スコアリングによるニュース音声認識」，電子情報通信学会誌，vol.J93-D no.5，２０１０年，ｐ．５９８−６０９Kobayashi et al., “News speech recognition by discriminative scoring based on word error minimization”, IEICE Journal, vol.J93-D no.5, 2010, p. 598-609

音声認識では、統計的言語モデル（以下、「言語モデル」と記載する。）を用いて単語の予測を行うが、このときに用いられる言語モデルは、単語の予測性能を向上させるために複数の言語モデルを混合して使うことが多い。一般には混合時の言語モデル間の寄与の度合いを調整する混合パラメータは、予め用意された学習データである静的なテキストを用いて決定する。 In speech recognition, a word is predicted using a statistical language model (hereinafter referred to as “language model”). The language model used at this time has a plurality of words in order to improve word prediction performance. Often mixed language models are used. In general, a mixing parameter for adjusting the degree of contribution between language models at the time of mixing is determined using static text that is learning data prepared in advance.

一方、誤り修正モデルでは、音声認識の誤り傾向を学習するために音声認識結果を学習データとして用いる。誤り修正モデルにおいて言語的な誤り傾向の重み付けを表すモデルパラメータも、混合パラメータ同様に静的なデータから学習する。しかし、音声認識の対象となるデータ（発話内容）と学習に用いた静的なデータを比較した場合、話題が異なっていたりするなど、必ずしも適合しているとは限らず、むしろ適合している方がまれである。例えば、同じ食材を扱う料理の話題であっても、学習データの内容が調理法などで異なっているのならば、学習データから推定された各誤り修正モデルのモデルパラメータは、発話内容に対して最適化されていることにはならない。また、非特許文献１は、言語モデルを線形補間により混合しているが、混合モデルに用いられる各言語モデルの重み付けの割合を示す混合パラメータは固定されているため、やはり発話内容に適合しているとは言いがたい。 On the other hand, in the error correction model, the speech recognition result is used as learning data in order to learn the error tendency of speech recognition. Model parameters representing weights of linguistic error tendencies in the error correction model are learned from static data as well as the mixed parameters. However, when comparing the data that is the target of speech recognition (speech content) and the static data used for learning, the topic is not always suitable, for example, it is not always suitable, but rather it is suitable. Is rare. For example, even if it is a topic of cooking that uses the same ingredients, if the content of the learning data differs depending on the cooking method, the model parameters of each error correction model estimated from the learning data are It is not optimized. In Non-Patent Document 1, language models are mixed by linear interpolation, but since the mixing parameter indicating the weighting ratio of each language model used in the mixed model is fixed, it is also adapted to the utterance content. It's hard to say.

本発明は、このような事情を考慮してなされたもので、誤り修正モデルを音声認識対象の発話内容に適合させて音声認識性能を改善することができる音声認識装置、誤り修正モデル学習方法、及びプログラムを提供する。 The present invention has been made in consideration of such circumstances, and a speech recognition apparatus, an error correction model learning method, which can improve speech recognition performance by adapting an error correction model to speech content of a speech recognition target, And provide programs.

［１］本発明の一態様は、複数の言語モデルを混合重みに従って混合した混合モデルに基づいて得られる言語スコアを、重み付けされた言語的な特徴により修正した値を用いて音声認識のスコアを算出する式である誤り修正モデルを格納する誤り修正モデル格納部と、入力された音声データを前記誤り修正モデル格納部に格納されている前記誤り修正モデルを用いて音声認識し、音声認識の結果得られた正解文候補を出力する音声認識部と、前記音声認識部から出力された前記正解文候補をユーザ入力に従って修正し、正解単語列を生成する誤り修正部と、前記誤り修正部が生成した前記正解単語列に含まれる各単語を前記音声データに基づいて時刻順に整列させる整列部と、前記正解文候補と前記整列された正解単語列とから言語的な特徴を抽出する特徴量抽出部と、前記特徴量抽出部により抽出された前記言語的な特徴と、前記正解文候補及び前記整列された正解単語列の音響スコア及び言語スコアとに基づいて前記言語的な特徴の重み及び前記言語モデルの混合重みを統計的に算出し、前記誤り修正モデル格納部に格納されている前記誤り修正モデルを、算出した前記言語的な特徴の重み及び前記言語モデルの混合重みを用いた誤り修正モデルに更新するモデルパラメータ学習部と、を備えることを特徴とする音声認識装置である。
この発明によれば、音声認識装置は、音声データが入力されると、現在格納している誤り修正モデルを用いて音声認識を行い、音声認識の結果得られた正解文候補をユーザ入力に従って修正する。音声認識装置は、正解単語列に含まれる各単語を音声データに基づいて時刻順に整列させ、正解文候補と正解単語列とから言語的な特徴を抽出する。音声認識装置は、抽出された言語的な特徴と、正解文候補及び整列された正解単語列それぞれの音響スコア及び言語スコアとに基づいて、誤り修正モデルに用いられる言語的な特徴の重みと複数の言語モデルの混合重みを統計的に算出し、現在格納している誤り修正モデル更新する。音声認識装置は、新たに入力された音声データを、更新された誤り修正モデルを用いて音声認識する。
これにより、音声認識装置は、誤り修正モデルを現在音声認識対象としている発話内容に適合させ、音声認識性能を改善することができる。 [1] In one embodiment of the present invention, a speech recognition score is obtained using a value obtained by correcting a language score obtained based on a mixture model obtained by mixing a plurality of language models according to a mixture weight using weighted linguistic features. An error correction model storage unit that stores an error correction model that is a formula to be calculated, and speech recognition is performed using the error correction model stored in the error correction model storage unit for the input speech data, and the result of speech recognition Generated by a speech recognition unit that outputs the obtained correct sentence candidate, an error correction unit that corrects the correct sentence candidate output from the speech recognition unit according to user input, and generates a correct word string, and the error correction unit A linguistic feature from an alignment unit that aligns each word included in the correct word string in time order based on the audio data, the correct sentence candidate, and the aligned correct word string Based on the linguistic features extracted by the feature quantity extraction unit, the correct sentence candidates and the acoustic scores and language scores of the aligned correct word strings The feature weight and the mixture weight of the language model are statistically calculated, and the error correction model stored in the error correction model storage unit is calculated as the mixture of the calculated linguistic feature weight and the language model. And a model parameter learning unit for updating to an error correction model using weights.
According to the present invention, when speech data is input, the speech recognition apparatus performs speech recognition using the currently stored error correction model, and corrects the correct sentence candidate obtained as a result of speech recognition according to the user input. To do. The speech recognition apparatus arranges the words included in the correct word string in order of time based on the speech data, and extracts linguistic features from the correct sentence candidates and the correct word strings. Based on the extracted linguistic features and the acoustic score and language score of each of the correct sentence candidates and the aligned correct word strings, the speech recognition apparatus uses the weights of the linguistic features used in the error correction model and a plurality of weights. The mixture weights of the language models are statistically calculated, and the currently stored error correction model is updated. The voice recognition apparatus recognizes voice of newly input voice data using the updated error correction model.
As a result, the speech recognition apparatus can improve the speech recognition performance by adapting the error correction model to the utterance content that is currently targeted for speech recognition.

［２］本発明の一態様は、上述する音声認識装置であって、前記モデルパラメータ学習部は、前記正解単語列との比較により得られる前記正解文候補の認識誤りと、前記誤り修正モデルにより得られた前記正解文候補の音声認識のスコアとを用いて定められる評価関数によって算出した評価値に基づいて、前記正解単語列の事後確率が最大、あるいは、前記正解文候補の認識誤りが最小となるように前記言語的な特徴の重み及び前記言語モデルの混合重みを統計的に算出する。
この発明によれば、音声認識装置は、正解文候補に含まれる認識誤りと、誤り修正モデルにより得られた正解文候補の音声認識のスコアとを用いて定められる評価関数により算出した評価値が、正解単語列の事後確率が最大であることを示す評価値、あるいは、正解文候補の期待される単語誤りが最も少ないことを示す評価値となるように言語的な特徴の重み及び言語モデルの混合重み決定し、誤り修正モデルを更新する。
これにより、音声認識装置は、言語的な特徴の重みと複数の言語モデルの混合重みを効率的に学習し、誤り修正モデルを更新することができる。 [2] One aspect of the present invention is the speech recognition apparatus described above, wherein the model parameter learning unit uses a recognition error of the correct sentence candidate obtained by comparison with the correct word string and the error correction model. Based on the evaluation value calculated by the evaluation function determined using the obtained speech recognition score of the correct sentence candidate, the posterior probability of the correct word string is the maximum, or the recognition error of the correct sentence candidate is the minimum The weight of the linguistic feature and the mixture weight of the language model are statistically calculated so that
According to this invention, the speech recognition apparatus has an evaluation value calculated by an evaluation function determined using a recognition error included in a correct sentence candidate and a speech recognition score of a correct sentence candidate obtained by an error correction model. The weight of the linguistic feature and the language model so that the evaluation value indicating that the posterior probability of the correct word string is the maximum, or the evaluation value indicating that the expected correct word sentence error is the smallest The mixture weight is determined and the error correction model is updated.
Thereby, the speech recognition apparatus can efficiently learn the weight of the linguistic feature and the mixed weight of the plurality of language models, and can update the error correction model.

［３］本発明の一態様は、上述する音声認識装置であって、前記モデルパラメータ学習部は、前記音声認識部が音声データの音声認識を行う度に前記言語的な特徴の重み及び前記言語モデルの混合重みを算出し、前記誤り修正モデル格納部に格納されている前記誤り修正モデルを、算出した前記言語的な特徴の重み及び前記言語モデルの混合重みを用いた誤り修正モデルに逐次更新する、ことを特徴とする。
この発明によれば、音声認識装置は、新たな音声データが入力されるたびに誤り修正モデルを逐次更新する。
これにより、音声認識装置は、誤り修正モデルをリアルタイムで発話内容に適合させ、音声認識性能を改善することができる。 [3] One aspect of the present invention is the speech recognition apparatus described above, wherein the model parameter learning unit performs the linguistic feature weights and the language each time the speech recognition unit performs speech recognition of speech data. Calculate the model mixture weight, and sequentially update the error correction model stored in the error correction model storage unit to the error correction model using the calculated linguistic feature weight and the language model mixture weight It is characterized by.
According to this invention, the speech recognition apparatus sequentially updates the error correction model every time new speech data is input.
Thereby, the speech recognition apparatus can improve the speech recognition performance by adapting the error correction model to the utterance content in real time.

［４］本発明の一態様は、上述する音声認識装置であって、前記特徴量抽出部は、連続する単語、単語を構成する音素、連続しない複数の単語、音素間の共起関係、単語の構文的な情報、または単語の意味的な情報に基づいて前記言語的な特徴を抽出する、ことを特徴とする。
この発明によれば、音声認識装置は、言語的な特徴の重み及び言語モデルの混合重みを、正解文候補及び正解単語列に含まれる単語や音素などから得た言語的特徴に基づいて更新する。
これにより、音声認識装置は、現在の話題に応じて認識誤りを精度良く修正する誤り修正モデルを生成することができる。 [4] One aspect of the present invention is the speech recognition apparatus described above, wherein the feature amount extraction unit includes continuous words, phonemes constituting the words, a plurality of discontinuous words, a co-occurrence relationship between phonemes, and words. The linguistic features are extracted based on the syntactic information or the semantic information of the words.
According to this invention, the speech recognition apparatus updates the weight of the linguistic feature and the mixture weight of the language model based on the linguistic feature obtained from the correct sentence candidate and the word or phoneme included in the correct word string. .
Thereby, the speech recognition apparatus can generate an error correction model that corrects a recognition error with high accuracy according to the current topic.

［５］本発明の一態様は、複数の言語モデルを混合重みに従って混合した混合モデルに基づいて得られる言語スコアを、重み付けされた言語的な特徴により修正した値を用いて音声認識のスコアを算出する式である誤り修正モデルを格納する誤り修正モデル格納過程と、入力された音声データを前記誤り修正モデル格納過程において格納された前記誤り修正モデルを用いて音声認識し、音声認識の結果得られた正解文候補を出力する音声認識過程と、前記音声認識過程において出力された前記正解文候補をユーザ入力に従って修正し、正解単語列を生成する誤り修正過程と、前記誤り修正過程において生成された前記正解単語列に含まれる各単語を前記音声データに基づいて時刻順に整列させる整列過程と、前記正解文候補と前記整列された正解単語列とから言語的な特徴を抽出する特徴量抽出過程と、前記特徴量抽出過程において抽出された前記言語的な特徴と、前記正解文候補及び前記整列された正解単語列の音響スコア及び言語スコアとに基づいて前記言語的な特徴の重み及び前記言語モデルの混合重みを統計的に算出し、現在格納されている前記誤り修正モデルを、算出した前記言語的な特徴の重み及び前記言語モデルの混合重みを用いた誤り修正モデルに更新するモデルパラメータ学習過程と、を有することを特徴とする誤り修正モデル学習方法である。 [5] In one embodiment of the present invention, a speech recognition score is obtained using a value obtained by correcting a language score obtained based on a mixture model obtained by mixing a plurality of language models according to a mixture weight using weighted linguistic features. An error correction model storage process for storing an error correction model, which is an expression to be calculated, and speech recognition of the input speech data using the error correction model stored in the error correction model storage process. A speech recognition process that outputs the correct correct sentence candidate, an error correction process that corrects the correct sentence candidate output in the speech recognition process according to a user input, and generates a correct word string, and is generated in the error correction process. An alignment process of aligning each word included in the correct word string in time order based on the voice data, and the correct sentence candidates are aligned. A feature extraction process for extracting linguistic features from the correct word sequence, the linguistic features extracted in the feature extraction process, the correct sentence candidates and the acoustic scores of the aligned correct word sequences, and The linguistic feature weight and the mixed weight of the linguistic model are statistically calculated based on the language score, and the currently stored error correction model is calculated using the calculated linguistic feature weight and the language. An error correction model learning method comprising: a model parameter learning process for updating to an error correction model using model mixture weights.

［６］本発明の一態様は、コンピュータを、複数の言語モデルを混合重みに従って混合した混合モデルに基づいて得られる言語スコアを、重み付けされた言語的な特徴により修正した値を用いて音声認識のスコアを算出する式である誤り修正モデルを格納する誤り修正モデル格納手段と、入力された音声データを前記誤り修正モデル格納手段に格納されている前記誤り修正モデルを用いて音声認識し、音声認識の結果得られた正解文候補を出力する音声認識手段と、前記音声認識手段から出力された前記正解文候補をユーザ入力に従って修正し、正解単語列を生成する誤り修正手段と、前記誤り修正手段が生成した前記正解単語列に含まれる各単語を前記音声データに基づいて時刻順に整列させる整列手段と、前記正解文候補と前記整列された正解単語列とから言語的な特徴を抽出する特徴量抽出手段と、前記特徴量抽出手段により抽出された前記言語的な特徴と、前記正解文候補及び前記整列された正解単語列の音響スコア及び言語スコアとに基づいて前記言語的な特徴の重み及び前記言語モデルの混合重みを統計的に算出し、前記誤り修正モデル格納手段に格納されている前記誤り修正モデルを、算出した前記言語的な特徴の重み及び前記言語モデルの混合重みを用いた誤り修正モデルに更新するモデルパラメータ学習手段と、を具備する音声認識装置として機能させるためのプログラムである。 [6] According to one embodiment of the present invention, a computer recognizes speech using a value obtained by correcting a language score obtained based on a mixture model obtained by mixing a plurality of language models according to a mixture weight using weighted linguistic features. An error correction model storage means for storing an error correction model, which is an expression for calculating the score of the voice, and speech recognition using the error correction model stored in the error correction model storage means for speech recognition, Speech recognition means for outputting correct sentence candidates obtained as a result of recognition; error correction means for correcting the correct sentence candidates output from the speech recognition means according to user input; and generating correct word strings; and the error correction. An alignment means for aligning each word included in the correct word string generated by the time sequence based on the voice data, the correct sentence candidate, and the alignment A feature quantity extracting means for extracting linguistic features from the correct word string, the linguistic features extracted by the feature quantity extracting means, the correct sentence candidates and the acoustic scores of the aligned correct word strings And the linguistic feature weight and the language model mixture weight are statistically calculated based on the language score, and the error correction model stored in the error correction model storage means is calculated as the linguistic And a model parameter learning means for updating to an error correction model using the weight of various features and the mixture weight of the language model.

本発明によれば、誤り修正モデルを音声認識対象の発話内容に適合させて音声認識性能を改善することができる。 According to the present invention, the speech recognition performance can be improved by adapting the error correction model to the utterance content of the speech recognition target.

本発明の一実施形態による音声認識装置における誤り修正モデルの逐次推定の手続を示す図である。It is a figure which shows the procedure of the successive estimation of the error correction model in the speech recognition apparatus by one Embodiment of this invention. 同実施形態による音声認識装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the speech recognition apparatus by the embodiment. 同実施形態による音声認識装置の全体処理フローを示す図である。It is a figure which shows the whole processing flow of the speech recognition apparatus by the embodiment. 同実施形態による正解単語列と音声認識結果との関係を示す図である。It is a figure which shows the relationship between the correct word sequence by the same embodiment, and a speech recognition result. 同実施形態による音声認識装置のパラメータ学習処理フローを示す図である。It is a figure which shows the parameter learning process flow of the speech recognition apparatus by the embodiment.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

［１．本実施形態の概要］
音声認識の誤り傾向を反映した、いわゆる誤り修正モデルがすでに提案されている。この誤り修正モデルは、予め用意された大量の音声認識結果である学習データから静的に推定されることが多い。しかし、実際の音声認識では、学習データと音声認識対象が、その話題性において完全に適合することは多くはない。そのため、誤り修正モデルを利用した音声認識の性能は、必ずしも音声認識の対象となる発話の内容に対して最適とはいえない。誤り修正モデルにおいて言語モデルの誤り傾向を重み付けするモデルパラメータは学習データから静的に推定されるが、高い音声認識性能を目指すには、この静的に推定したモデルパラメータを評価対象の音声の認識結果を用いて逐次（オンライン）的に最適化する必要がある。 [1. Overview of this embodiment]
A so-called error correction model that reflects the error tendency of speech recognition has already been proposed. This error correction model is often statically estimated from learning data that is a large amount of speech recognition results prepared in advance. However, in actual speech recognition, it is not often the case that the learning data and the speech recognition target are perfectly matched in topicality. For this reason, the performance of speech recognition using an error correction model is not necessarily optimal for the content of an utterance subject to speech recognition. In the error correction model, the model parameters that weight the error tendency of the language model are statically estimated from the training data. To achieve high speech recognition performance, the statically estimated model parameters are used to recognize the speech to be evaluated. It is necessary to optimize the results sequentially (online).

一方で、評価時に誤り修正モデルと併用される統計的言語モデル（以下、「言語モデル」と記載する。）は、線形補間などの手法によって複数の情報源から推定された混合モデルであることが多い。混合モデルの混合重みは、音声認識結果の性能に大きく寄与し、かつ誤り修正モデルのモデルパラメータ推定にも影響を与える。このため、混合モデルの混合重みを表す混合パラメータも逐次的（動的）に最適化する必要がある。 On the other hand, a statistical language model (hereinafter referred to as “language model”) used in combination with an error correction model at the time of evaluation is a mixed model estimated from a plurality of information sources by a method such as linear interpolation. Many. The mixing weight of the mixed model greatly contributes to the performance of the speech recognition result, and also affects the model parameter estimation of the error correction model. For this reason, it is necessary to optimize the mixing parameter representing the mixing weight of the mixing model sequentially (dynamically).

字幕作成を目的とした音声認識システムでは、リアルタイムに音声認識結果を出力したのち、人手によりこれを修正する。従って、人手による修正を経たデータを正解とみなせば、従来の誤り修正モデル及び混合モデルのパラメータを正解が得られるたびに逐次的に最適化することが可能となる。そこで、本実施形態の音声認識装置は、複数の情報源から推定された言語モデルの混合パラメータと誤り修正モデルのモデルパラメータを逐次的かつ同時に最適化して音声認識性能を発話内容に逐次最適化した誤り修正モデルを学習し、学習した誤り修正モデルを音声認識に適用する。 In a speech recognition system for creating subtitles, a speech recognition result is output in real time and then manually corrected. Therefore, if data corrected manually is regarded as a correct answer, the parameters of the conventional error correction model and the mixed model can be sequentially optimized every time a correct answer is obtained. Therefore, the speech recognition apparatus according to the present embodiment optimizes the speech recognition performance to the utterance contents by sequentially and simultaneously optimizing the mixed parameters of the language model estimated from a plurality of information sources and the model parameters of the error correction model. An error correction model is learned, and the learned error correction model is applied to speech recognition.

図１は、本実施形態の音声認識装置による統計的誤り修正モデルの逐次推定の手続を示す図である。
同図に示すように、本実施形態の音声認識装置は、入力音声を音声認識して発話の音声認識結果を逐次取得し、取得した音声認識結果とこの音声認識結果を修正して得られた正解単語列とを用いて、誤り修正モデルのモデルパラメータを推定する。このとき、本実施形態の音声認識装置は、誤り修正モデルのモデルパラメータに併せて混合モデル（混合言語モデル）の混合パラメータを同時に逐次推定するため、発話内容に適合した誤り修正モデルが得られる。よって、本実施形態の音声認識装置は、誤り修正モデルをその時の発話内容に逐次適合させて、入力音声の音声認識性能を改善することが可能となる。
このように、本実施形態の音声認識装置は、音声認識の誤りを修正する統計的な誤り修正モデルを逐次更新し、音声認識に適用する。 FIG. 1 is a diagram showing a procedure for successive estimation of a statistical error correction model by the speech recognition apparatus of the present embodiment.
As shown in the figure, the speech recognition apparatus of the present embodiment is obtained by recognizing input speech and sequentially acquiring speech recognition results of utterances, and correcting the acquired speech recognition results and the speech recognition results. The model parameter of the error correction model is estimated using the correct word string. At this time, since the speech recognition apparatus according to the present embodiment sequentially estimates the mixed parameters of the mixed model (mixed language model) together with the model parameters of the error correction model, an error correction model suitable for the utterance content is obtained. Therefore, the speech recognition apparatus of this embodiment can improve the speech recognition performance of input speech by sequentially adapting the error correction model to the content of the utterance at that time.
As described above, the speech recognition apparatus according to the present embodiment sequentially updates the statistical error correction model for correcting the speech recognition error and applies it to the speech recognition.

［２．本実施形態による音声認識装置に適用される学習アルゴリズム］
ベイズの定理によれば、音声入力ｘが与えられたとき、この音声入力ｘに対して尤もらしい単語列ｗ＾（「＾」は、「ハット」を表す。）は、以下の式（１）により求めることができる。 [2. Learning algorithm applied to the speech recognition apparatus according to the present embodiment]
According to Bayes' theorem, when speech input x is given, a word string w ^ (“^” represents “hat”) that is likely to be associated with speech input x is expressed by the following equation (1). It can ask for.

音声入力ｘ及び単語列ｗは、例えば、発話の単位に対応し、Ｐ（ｗ｜ｘ）は、音声入力ｘが発生したときに単語列（文仮説）ｗが得られる事後確率である。
また、Ｐ（ｘ｜ｗ）は、単語列ｗに対する音響的な尤もらしさを示す尤度であり、そのスコア（音響スコア）は隠れマルコフモデル（Hidden Markov Model、ＨＭＭ）及びガウス混合分布（Gaussian Mixture Model，ＧＭＭ）に代表される統計的音響モデル（以下、「音響モデル」と記載する。）に基づいて計算される。言い換えれば、ある音響特徴量が与えられたとき、複数の正解候補の単語それぞれに対する尤もらしさを表すスコアが音響スコアである。 The voice input x and the word string w correspond to, for example, an utterance unit, and P (w | x) is a posterior probability that a word string (sentence hypothesis) w is obtained when the voice input x occurs.
P (x | w) is a likelihood indicating acoustic likelihood for the word string w, and the score (acoustic score) is a hidden Markov model (HMM) and a Gaussian mixture distribution (Gaussian Mixture). It is calculated based on a statistical acoustic model (hereinafter referred to as “acoustic model”) typified by Model, GMM). In other words, when a certain acoustic feature amount is given, a score representing the likelihood of each of a plurality of correct candidate words is an acoustic score.

一方、Ｐ（ｗ）は、単語列ｗに対する言語的な生成確率であり、そのスコア（言語スコア）は、単語ｎ−ｇｒａｍモデル等の言語モデルにより計算される。言い換えれば、音声認識対象の単語の前または後の単語列、あるいは前後両方の単語列が与えられたとき、複数の正解候補の単語列それぞれに対する尤もらしさを表すスコアが言語スコアである。なお、単語ｎ−ｇｒａｍモデルは、Ｎ単語連鎖（Ｎは、例えば１、２、または３である。）の統計に基づいて、（Ｎ−１）単語の履歴から次の単語の生起確率を与えるモデルである。 On the other hand, P (w) is a linguistic generation probability for the word string w, and the score (language score) is calculated by a language model such as a word n-gram model. In other words, when a word string before or after a speech recognition target word, or both word strings before and after the given word string, a score representing the likelihood of each of a plurality of correct answer word strings is a language score. The word n-gram model gives the occurrence probability of the next word from the history of the word (N-1) based on the statistics of N word chains (N is 1, 2, or 3, for example). It is a model.

以下の説明では、音響モデルにＨＭＭ−ＧＭＭを用い、言語モデルにｎ−ｇｒａｍを用いる。 In the following description, HMM-GMM is used for the acoustic model and n-gram is used for the language model.

式（１）のＰ（ｘ｜ｗ）Ｐ（ｗ）が最大の場合は、その対数も最大である。そこで、音声認識では、上記の式（１）のベイズの定理に基づいて、音声入力ｘが発生したときの文仮説（正解候補）である単語列ｗの評価関数ｇ（ｗ｜ｘ）を以下の式（２）のように定める。なお、κは、音響スコアＰ（ｘ｜ｗ）に対する言語スコアＰ（ｗ）の重みである。 When P (x | w) P (w) in Equation (1) is maximum, the logarithm is also maximum. Therefore, in speech recognition, the evaluation function g (w | x) of the word string w, which is a sentence hypothesis (correct answer candidate) when the speech input x is generated, is expressed as follows based on the Bayes' theorem of the above formula (1). This is determined as shown in equation (2). Note that κ is a weight of the language score P (w) with respect to the acoustic score P (x | w).

そして、以下の式（３）に示すように、音声入力ｘに対する正解候補の単語列ｗの集合の中から、式（２）が示す評価関数ｇ（ｗ｜ｘ）の結果が最大である単語列ｗ＾が、音声入力ｘの音声認識結果として選択される。 Then, as shown in the following equation (3), the word having the maximum result of the evaluation function g (w | x) indicated by the equation (2) is selected from the set of correct candidate word strings w for the speech input x. The column。 is selected as the speech recognition result for speech input x.

従来法における誤り修正モデルでは、式（１）を以下の式（４）のように変更する。 In the error correction model in the conventional method, equation (1) is changed to the following equation (4).

式（４）のｅｘｐΣ_ｉλ_ｉｆ_ｉ（ｗ）は、単語列ｗの誤り傾向を反映したスコアであり、単語列ｗに対するペナルティもしくは報償として働く。また、ｆ_ｉ（ｗ）（ｉ＝１，...，）はｉ番目の素性関数、λ_ｉは素性関数ｆ_ｉ（ｗ）の重み（素性重み）である。素性関数は、与えられた単語列（ここでは、単語列ｗ）で言語的ルールが成立すればその数となり、成立しなければ０となるような関数として定められる。具体的な素性関数ｆ_ｉの言語的ルールの例として、以下があげられる。 ExpΣ _i λ _i f _i (w) in equation (4) is a score reflecting the error tendency of the word string w, and acts as a penalty or reward for the word string w. Further, f _i (w) (i = 1,...) Is an i-th feature function, and λ _i is a weight (feature weight) of the feature function f _i (w). The feature function is defined as a function that becomes the number if a linguistic rule is established in a given word string (here, word string w), and is 0 if not established. Examples of linguistic rules specific feature function f _i, and the like below.

（ａ）単語列ｗに含まれる連続する単語２項組（ｕ，ｖ）の数
（ｂ）単語列ｗに含まれる連続しない単語２項組（ｕ，ｖ）の数 (A) Number of consecutive word binary groups (u, v) included in the word string w (b) Number of non-continuous word binary groups (u, v) included in the word string w

一方、音声認識では、単語の予測精度を向上させるために複数の言語モデルを混合して用いることが多い。線形補間による言語モデルの混合では、混合言語モデルは式（５）のように表される。 On the other hand, in speech recognition, a plurality of language models are often used in combination in order to improve word prediction accuracy. In mixing language models by linear interpolation, the mixed language model is expressed as shown in Equation (5).

ここで、Ｐ_ｎ（ｗ）は、情報源である静的な学習データのテキスト（正解文）から推定されたｎ番目の言語モデルのスコアである。また、θ_ｎは、ｎ番目（ｎ＝１，...，）の言語モデルに対する混合パラメータと呼ばれる係数であり、θ_ｎ≧０、Σ_ｎθ_ｎ＝１を満たす。 Here, P _n (w) is the score of the n-th language model estimated from the text (correct sentence) of static learning data that is an information source. Θ _n is a coefficient called a mixed parameter for the n-th (n = 1,...) Language model, and satisfies θ _n ≧ 0 and Σ _n θ _n = 1.

また、線形補間以外の言語のモデルの混合方法として、対数線形モデルでは、以下の式（６）に示すように言語モデルを混合する。 As a method for mixing languages other than linear interpolation, in a logarithmic linear model, language models are mixed as shown in the following equation (6).

φ_ｎは線形補間同様、混合パラメータと呼ばれる係数であるが、その和が１となる必要はない。また、１／Ｚ（Φ）は、正規化のための定数である。なお、混合パラメータΦ＝（φ_１，φ_２，…）である。 phi _n Similarly linear interpolation, is a factor called mixing parameters need not be the sum of 1. 1 / Z (Φ) is a constant for normalization. Note that the mixing parameter Φ = (φ ₁ , φ ₂ ,...).

対数線形モデルを言語モデルの混合手法として選べば、音声入力ｘが与えられたときの単語列ｗの事後確率Ｐ（ｗ｜ｘ）は、以下の式（７）となる。なお、モデルパラメータΛ＝（λ_１，λ_２，…）である。 If a logarithmic linear model is selected as the language model mixing method, the posterior probability P (w | x) of the word string w when the speech input x is given is expressed by the following equation (7). Note that the model parameter Λ = (λ ₁ , λ ₂ ,...).

音響モデルの尤度をＨＭＭによる対数音響スコアｈ_０（ｘ，ｗ）、ｎ番目の言語モデルによる仮説単語列の生成確率を対数言語スコアｈ_ｎ（ｗ）とすると、式（７）は、以下の式（８）のように書き直せる。 Assuming that the likelihood of the acoustic model is a logarithmic acoustic score h ₀ (x, w) by HMM, and the generation probability of a hypothesis word string by the nth language model is a logarithmic language score h _n (w), Equation (7) is (8) can be rewritten.

式（８）におけるＺ（Λ，Φ）は、確率の条件を満たすための正規化定数である。
ある音声入力ｘに対して、音声認識結果ｗ^ｈｙｐおよび人手による修正結果（正解単語列）ｗ^ｒｅｆが得られたとする。ここで、音声認識装置は、複数の音声認識結果を出力しても良く、その集合をＷとする。音声認識では、ベイズの定理により事後確率が最大となる単語列を正解として出力する。従って、修正結果ｗ^ｒｅｆの事後確率Ｐ（ｗ^ｒｅｆ｜ｘ）は、集合Ｗに含まれる任意の音声認識結果ｗ^ｈｙｐの事後確率Ｐ（ｗ^ｈｙｐ｜ｘ）よりも大きくなる必要がある。 Z (Λ, Φ) in equation (8) is a normalization constant for satisfying the probability condition.
^Assume that a voice recognition result w ^hyp and a manual correction result (correct word string) w ^ref are obtained for a certain voice input x. Here, the speech recognition apparatus may output a plurality of speech recognition results, and set the set as W. In speech recognition, a word string with the maximum posterior probability is output as a correct answer according to Bayes' theorem. Therefore, the posterior probability P (w ^ref | x) of the modification result w ^ref needs to be larger than the posterior probability P (w ^hyp | x) of any speech recognition result w ^hyp included in the set W.

音声認識の音声入力ｘおよび音声認識結果ｗ^ｈｙｐ、いずれかの音声認識結果ｗ^ｈｙｐから得られた修正結果（正解単語列）ｗ^ｒｅｆが与えられたとき、パラメータ推定のための目的関数Ｌ（Λ，Φ）は、以下の式（９）となる。 Given a speech input x and speech recognition result w ^hyp of speech recognition, and a correction result (correct word sequence) w ^ref obtained from one of the speech recognition results w ^hyp, an objective function L (Λ , Φ) is represented by the following equation (9).

Ｒ（ｗ^ｒｅｆ，ｗ^ｈｙｐ）は、修正結果ｗ^ｒｅｆと音声認識結果ｗ^ｈｙｐの編集距離（置換、挿入、脱落の総数）を返す関数である。２つの単語列の編集距離は、動的計画法により効率的に求めることができる。この編集距離は、正解単語列である修正結果ｗ^ｒｅｆに対する音声認識結果ｗ^ｈｙｐの誤り単語数を表している。期待される単語誤りの数が少ないほど、音声認識は認識誤りを生じることなく認識結果を生成できる。また、確率の条件として、音声認識結果ｗ^ｈｙｐの事後確率Ｐ（ｗ^ｈｙｐ｜ｘ）＋修正結果ｗ^ｒｅｆの事後確率Ｐ（ｗ^ｒｅｆ｜ｘ）＝１を制約として仮定しているため、音声認識結果ｗ^ｈｙｐの事後確率Ｐ（ｗ^ｈｙｐ｜ｘ）を最小化し、修正結果ｗ^ｒｅｆの事後確率Ｐ（ｗ^ｒｅｆ｜ｘ）を最大化することによって、音声認識は認識誤りを生じることなく認識結果を生成できる。そのため、目的関数Ｌ（Λ，Φ）を最小化するようにモデルパラメータΛと混合パラメータΦを推定すれば、音声認識結果に期待される単語誤りが最小、かつ、正解単語列の事後確率が最大になり、未知の入力音声に対する音声認識においても、音声認識結果の単語誤りの最小化が見込まれ、音声認識の性能の向上が期待できる。つまり、式（９）の目的関数は、正解候補の単語列に期待される認識誤りが最小かつ正解単語列の事後確率が最大となり、モデルパラメータΛ及び混合パラメータΦが適切であるかの評価値を算出する評価関数として用いられる。 R (w ^ref , w ^hyp ) is a function that returns the edit distance (total number of substitutions, insertions, omissions) between the correction result w ^ref and the speech recognition result w ^hyp . The edit distance between two word strings can be efficiently obtained by dynamic programming. This edit distance represents the number of error words in the speech recognition result w ^{hyp with} respect to the correction result w ^ref which is a correct word string. As the number of expected word errors is smaller, speech recognition can generate a recognition result without causing a recognition error. Further, since the ^postulate probability P (w ^hyp | x) of the speech recognition result w ^hyp + the posterior probability P (w ^ref | x) = 1 of the correction result w ^ref is assumed as a constraint as the probability condition, the speech recognition is performed. By minimizing the a posteriori probability P (w ^hyp | x) of the result w ^hyp and maximizing the a posteriori probability P (w ^ref | x) of the modified result w ^ref , speech recognition produces a recognition result without causing a recognition error. Can be generated. Therefore, if the model parameter Λ and the mixed parameter Φ are estimated so as to minimize the objective function L (Λ, Φ), the word error expected in the speech recognition result is minimized, and the posterior probability of the correct word string is maximized. Thus, even in speech recognition for unknown input speech, it is expected that word errors in speech recognition results will be minimized, and improvement in speech recognition performance can be expected. In other words, the objective function of Equation (9) is an evaluation value for whether the model parameter Λ and the mixing parameter Φ are appropriate because the recognition error expected for the correct candidate word string is the minimum and the posterior probability of the correct word string is the maximum. Is used as an evaluation function for calculating.

式（９）の目的関数のモデルパラメータΛと混合パラメータΦに関する勾配ΔΛ、ΔΦを求めると、以下の式（１０）、式（１１）のようになる。なお、勾配ΔΛは、（∂Ｌ（Λ，Φ）／∂λ_１，∂Ｌ（Λ，Φ）／∂λ_２，∂Ｌ（Λ，Φ）／∂λ_３，…）であり、勾配ΔΦは、（∂Ｌ（Λ，Φ）／∂φ_１，∂Ｌ（Λ，Φ）／∂φ_２，∂Ｌ（Λ，Φ）／∂φ_３，…）である。また、ｗ’は、集合Ｗに含まれる全ての音声認識結果ｗ^ｈｙｐ及び修正結果ｗ^ｒｅｆである。 When the gradients ΔΛ and ΔΦ related to the model parameter Λ and the mixing parameter Φ of the objective function in Expression (9) are obtained, the following Expressions (10) and (11) are obtained. The gradient ΔΛ is (で L (Λ, Φ) / ∂λ ₁ , ∂L (Λ, Φ) /） λ ₂ , ∂L (Λ, Φ) / ∂λ ₃ ,...), And the gradient ΔΦ. Are (∂L (Λ, Φ) / ∂φ ₁ , ∂L (Λ, Φ) / ∂φ ₂ , ∂L (Λ, Φ) / ∂φ ₃ ,...). W ′ is all the speech recognition results w ^hyp and the correction results w ^ref included in the set W.

但し、式（８）における正規化定数Ｚ（Λ，Φ）は、以下の式（１２）とした。 However, the normalization constant Z (Λ, Φ) in the equation (8) is the following equation (12).

音声認識の音声入力を...，ｘ_ｔ−１，ｘ_ｔ，...とすれば、ｔ−１番目の音声入力ｘ_ｔ−１を認識した後におけるモデルパラメータΛ、混合パラメータΦの更新式は以下の式（１３）、式（１４）となる。 If the speech input for speech recognition is ..., x _t−1 , x _t ,..., The model parameter Λ and the mixing parameter Φ are updated after the _t− 1th speech input x _t−1 is recognized. The equations are the following equations (13) and (14).

式（１４）におけるη_Λは勾配ΔΛに対する定数、式（１５）におけるη_Φは勾配ΔΦに対する定数である。 Η _Λ in equation (14) is a constant for the gradient ΔΛ, and η _Φ in equation (15) is a constant for the gradient ΔΦ.

なお、モデルパラメータΛと混合パラメータΦを頑健に推定するために、過去Ｔ個の推定値を用いて以下の式（１５）、式（１６）のように、直近の推定回数Ｔで平均化した値を用いてもよい。 In order to robustly estimate the model parameter Λ and the mixed parameter Φ, the past T estimated values were averaged at the latest estimated number of times T as in the following formulas (15) and (16). A value may be used.

音声認識から認識結果と人手による正解単語列が逐次得られた場合、式（１３）、式（１４）を適用することにより、発話の取得に伴って、式（８）の誤り修正モデルのモデルパラメータΛと混合パラメータΦの推定値が更新される。これにより、音声認識対象の発話内容に適した誤り修正モデルが学習されることになる。 When a recognition result and a manual correct word string are sequentially obtained from the speech recognition, the error correction model model of the equation (8) is obtained with the acquisition of the utterance by applying the equations (13) and (14). The estimated values of the parameter Λ and the mixing parameter Φ are updated. As a result, an error correction model suitable for the speech content to be recognized is learned.

［３．音声認識装置の構成］
図２は、本発明の一実施形態による音声認識装置１の構成を示す機能ブロック図であり、発明と関係する機能ブロックのみ抽出して示してある。
音声認識装置１は、コンピュータ装置により実現され、同図に示すように、音声認識部１１、誤り修正部１２、整列部１３、特徴量抽出部１４、モデルパラメータ学習部１５、音響モデル格納部２１、言語モデル格納部２２、及び誤り修正モデル格納部２３を備えて構成される。 [3. Configuration of voice recognition device]
FIG. 2 is a functional block diagram showing the configuration of the speech recognition apparatus 1 according to an embodiment of the present invention, and only functional blocks related to the invention are extracted and shown.
The speech recognition apparatus 1 is realized by a computer device, and as shown in the figure, a speech recognition unit 11, an error correction unit 12, an alignment unit 13, a feature amount extraction unit 14, a model parameter learning unit 15, and an acoustic model storage unit 21. A language model storage unit 22 and an error correction model storage unit 23.

音響モデル格納部２１は、音響モデルを格納する。言語モデル格納部２２は、言語モデルを格納する。誤り修正モデル格納部２３は、式（８）及び式（１２）で表される誤り修正モデルを格納する。 The acoustic model storage unit 21 stores an acoustic model. The language model storage unit 22 stores a language model. The error correction model storage unit 23 stores the error correction model represented by Expression (8) and Expression (12).

音声認識部１１は、音響モデル格納部２１に格納されている音響モデルと、言語モデル格納部２２に格納されている言語モデルと、誤り修正モデル格納部２３に格納されている誤り修正モデルを用いて、入力音声データＤ１をリアルタイムで音声認識する。入力音声データＤ１は、発話の音声波形を短時間スペクトル分析して得られた特徴量を示す音声データである。音声認識部１１は、入力音声データＤ１の音声認識結果を示す音声認識結果データＤ２を、誤り修正部１２及び特徴量抽出部１４に出力する。音声認識結果は、第１候補となった正解文候補、もしくは複数の正解文候補とする。本実施形態では、音声認識結果として複数の正解文候補を得るものとして説明する。また、音声認識結果データＤ２には、正解文候補を構成する各単語が発話された時刻の情報も付加される。各単語が発話された時刻とは、単語の開始から終了までの区間である。 The speech recognition unit 11 uses an acoustic model stored in the acoustic model storage unit 21, a language model stored in the language model storage unit 22, and an error correction model stored in the error correction model storage unit 23. Thus, the input voice data D1 is recognized in real time. The input voice data D1 is voice data indicating a feature amount obtained by performing a short-time spectrum analysis on a voice waveform of an utterance. The speech recognition unit 11 outputs speech recognition result data D2 indicating the speech recognition result of the input speech data D1 to the error correction unit 12 and the feature amount extraction unit 14. The speech recognition result is a correct sentence candidate that is the first candidate or a plurality of correct sentence candidates. In this embodiment, a description will be given assuming that a plurality of correct sentence candidates are obtained as a speech recognition result. In addition, information on the time when each word constituting the correct sentence candidate is spoken is also added to the speech recognition result data D2. The time when each word is uttered is a section from the start to the end of the word.

誤り修正部１２は、音声認識結果データＤ２が示す第１候補の正解文候補を人手により修正した正解単語列を得る。誤り修正部１２は、正解単語列を示す正解単語列データＤ３を整列部１３に出力する。 The error correction unit 12 obtains a correct word string obtained by manually correcting the first candidate correct sentence candidate indicated by the speech recognition result data D2. The error correction unit 12 outputs the correct word string data D3 indicating the correct word string to the alignment unit 13.

整列部１３は、入力音声データＤ１を用いて、正解単語列データＤ３が示す正解単語列を構成する各単語が発話された時刻を同定する。整列部１３は、同定した各単語が発話された時刻の情報を正解単語列データＤ３に付加し、単語を発話された時刻順に整列する。 The alignment unit 13 identifies the time when each word constituting the correct word string indicated by the correct word string data D3 is uttered, using the input voice data D1. The alignment unit 13 adds information on the time at which each identified word was uttered to the correct word string data D3, and arranges the words in order of the time at which they were uttered.

特徴量抽出部１４は、音声認識結果データＤ２が示す正解文候補と、正解単語列データＤ３が示す正解単語列とから、言語的な特徴を抽出する。ここで抽出される言語的な特徴は、前述の言語的なルールであり、素性関数として定義される。 The feature quantity extraction unit 14 extracts linguistic features from the correct sentence candidate indicated by the speech recognition result data D2 and the correct word string indicated by the correct word string data D3. The linguistic features extracted here are the linguistic rules described above and are defined as feature functions.

モデルパラメータ学習部１５は、特徴量抽出部１４が抽出した素性関数と、音声認識結果データＤ２が示す正解文候補と、正解単語列データＤ３が示す正解単語列とを入力として、誤り修正モデルのモデルパラメータΛ及び言語モデルの混合パラメータΦを統計的に学習する。モデルパラメータ学習部１５は、誤り修正モデル格納部２３に現在格納されている誤り修正モデルを、学習により得られたモデルパラメータΛ及び混合パラメータΦを用いた誤り修正モデルに更新する。 The model parameter learning unit 15 receives the feature function extracted by the feature amount extraction unit 14, the correct sentence candidate indicated by the speech recognition result data D2, and the correct word string indicated by the correct word string data D3, and inputs the error correction model. Statistically learn the model parameter Λ and the mixed parameter Φ of the language model. The model parameter learning unit 15 updates the error correction model currently stored in the error correction model storage unit 23 to an error correction model using the model parameter Λ and the mixed parameter Φ obtained by learning.

これにより、音声認識部１１は、音響モデル及び言語モデルと、更新された誤り修正モデルを用いて新たな入力音声データＤ１を音声認識する。音声認識装置１は、音声認識部１１において入力音声データＤ１の音声認識結果が得られるたびに、誤り修正モデル更新の一連の処理を実行する。 Thereby, the speech recognition unit 11 recognizes the new input speech data D1 by using the acoustic model and the language model and the updated error correction model. The speech recognition apparatus 1 executes a series of processes for updating the error correction model every time the speech recognition unit 11 obtains the speech recognition result of the input speech data D1.

［４．音声認識装置の処理手順］
図３は、本実施形態による音声認識装置１の全体処理フローを示す図である。以下、同図に示す各ステップの処理について説明する。同図に示す処理を実行する前に、音声認識装置１の誤り修正モデル格納部２３は、静的な学習データから決定した誤り修正モデルを初期値として格納しておく。ここでは、誤り修正モデルは、モデルパラメータΛの値を示すモデルパラメータデータＤ４と、混合パラメータΦの値を示す混合パラメータデータＤ５により表される。 [4. Processing procedure of voice recognition device]
FIG. 3 is a diagram showing an overall processing flow of the speech recognition apparatus 1 according to the present embodiment. Hereinafter, processing of each step shown in FIG. Before executing the processing shown in the figure, the error correction model storage unit 23 of the speech recognition apparatus 1 stores an error correction model determined from static learning data as an initial value. Here, the error correction model is represented by model parameter data D4 indicating the value of the model parameter Λ and mixed parameter data D5 indicating the value of the mixed parameter Φ.

［４．１ステップＳ１］
音声認識部１１は、入力音声データＤ１を音声認識し、各正解文候補の音響スコアを音響モデル格納部２１に格納されている音響モデルから算出し、言語スコアを言語モデル格納部２２に格納されている言語モデルから算出する。音声認識部１１は、さらに各正解文候補の素性関数の値を算出すると、現在誤り修正モデル格納部２３に格納されている誤り修正モデル（式（８）及び式（１２））により、音声認識のスコアを算出する。音声認識部１１は、算出されたスコアに従って尤もらしさの順に並べた複数の正解文候補ｗ^ｈｙｐを示す音声認識結果データＤ２を出力する。 [4.1 Step S1]
The speech recognition unit 11 recognizes the input speech data D1, calculates the acoustic score of each correct sentence candidate from the acoustic model stored in the acoustic model storage unit 21, and stores the language score in the language model storage unit 22. Calculated from the language model. When the speech recognition unit 11 further calculates the value of the feature function of each correct sentence candidate, the speech recognition unit 11 performs speech recognition using the error correction model (Equation (8) and Equation (12)) currently stored in the error correction model storage unit 23. Calculate the score. The speech recognition unit 11 outputs speech recognition result data D2 indicating a plurality of correct sentence candidates w ^hyp arranged in order of likelihood according to the calculated score.

［４．２ステップＳ２］
誤り修正部１２は、図示しない入力手段によりユーザが人手で入力した修正指示に従って、音声認識結果データＤ２が示す第１候補の正解文候補ｗ^ｈｙｐを正解単語列ｗ^ｒｅｆに修正する。誤り修正部１２は、修正により得られた正解単語列ｗ^ｒｅｆを示す正解単語列データＤ３を整列部１３に出力する。 [4.2 Step S2]
The error correction unit 12 corrects the first candidate correct sentence candidate w ^hyp indicated by the speech recognition result data D2 to the correct word string w ^ref in accordance with a correction instruction manually input by the user using an input unit (not shown). The error correcting unit 12 outputs correct word string data D3 indicating the correct word string w ^ref obtained by the correction to the aligning unit 13.

図４は、正解単語列ｗ^ｒｅｆと音声認識結果の正解文候補ｗ^ｈｙｐとの関係を示す図である。同図に示すように、音声認識結果データＤ２には、音声認識結果として尤もらしさの順位ｎ（ｎ＝１，２，…）の順に正解文候補ｗ^ｈｙｐが設定されている。同図では、誤り修正部１２は、ｎ＝１の正解文候補ｗ^ｈｙｐ「ＡＢＦＤＥＦ」を修正して、正解単語列ｗ^ｒｅｆ「ＡＢＣＤＥＦ」を得ている。 FIG. 4 is a diagram illustrating the relationship between the correct word string w ^ref and the correct sentence candidate w ^hyp of the speech recognition result. As shown in the figure, correct sentence candidates w ^hyp are set in the speech recognition result data D2 in the order of likelihood n (n = 1, 2,...) As speech recognition results. In the figure, the error correction unit 12 corrects the correct sentence candidate w ^hyp “ABFDEF” with n = 1 to obtain the correct word string w ^ref “ABCDEF”.

［４．３ステップＳ３］
図３において、整列部１３は、正解単語列データＤ３と、入力音声データＤ１とを用いて、既存の技術により、正解単語列ｗ^ｒｅｆを構成する各単語が発話された時刻を同定する。整列部１３は、同定した各単語が発話された時刻の情報を正解単語列データＤ３に付加し、単語を発話された時刻順に整列する。 [4.3 Step S3]
In FIG. 3, the alignment unit 13 identifies the time when each word constituting the correct word string w ^ref is uttered by the existing technique using the correct word string data D3 and the input voice data D1. The alignment unit 13 adds information on the time at which each identified word was uttered to the correct word string data D3, and arranges the words in order of the time at which they were uttered.

［４．４ステップＳ４］
特徴量抽出部１４は、音声認識結果データＤ２が示す各正解文候補ｗ^ｈｙｐの単語列と、正解単語列データＤ３が示す整列された正解単語列ｗ^ｒｅｆの単語列とから、誤り修正モデルのパラメータ学習のために用いる言語的特徴に基づく素性関数を抽出する。素性関数のルールは、例えば、同一の発話内における連続する単語、単語を構成する音素、連続しない２単語以上の単語、音素間の共起関係、単語の構文的な情報または意味的な情報、などの言語的特徴である。 [4.4 Step S4]
The feature amount extraction unit 14 generates an error correction model from the word string of each correct sentence candidate w ^hyp indicated by the speech recognition result data D2 and the word string of the aligned correct word string w ^ref indicated by the correct word string data D3. Feature functions based on linguistic features used for parameter learning are extracted. The feature function rules are, for example, consecutive words in the same utterance, phonemes constituting the words, two or more words that are not consecutive, co-occurrence relationships between phonemes, syntactic information or semantic information of words, It is a linguistic feature.

本実施形態では、特徴量抽出部１４は、単語の共起関係に基づく素性関数として、例えば以下の（ａ）、（ｂ）を抽出する。 In the present embodiment, the feature quantity extraction unit 14 extracts, for example, the following (a) and (b) as feature functions based on word co-occurrence relationships.

（ａ）単語列に連続する単語２項組（ｕ，ｖ）が含まれる場合，その数を返す関数
（ｂ）単語列に連続しない単語２項組（ｕ，ｖ）が含まれる場合、その数を返す関数 (A) A function that returns the number of consecutive word binaries (u, v) in the word string (b) A function that returns the number (b) If the word string contains non-continuous word binaries (u, v), A function that returns a number

また、特徴量抽出部１４は、単語列を構成する各単語を名詞や動詞といった品詞カテゴリに置き換えた上で、構文情報に基づく素性関数として、例えば以下の（ｃ）、（ｄ）を抽出する。なお、ｃ（・）は単語を品詞にマッピングする関数である。 In addition, the feature quantity extraction unit 14 replaces each word constituting the word string with a part of speech category such as a noun or a verb, and extracts, for example, the following (c) and (d) as feature functions based on the syntax information. . Note that c (•) is a function that maps words to parts of speech.

（ｃ）単語列に連続する品詞２項組（ｃ（ｕ），ｃ（ｖ））が含まれる場合、その数を返す関数
（ｄ）単語列に連続しない品詞２項組（ｃ（ｕ），ｃ（ｖ））が含まれる場合、その数を返す関数 (C) a function that returns the number of part-of-speech binary pairs (c (u), c (v)) included in a word string (d) a part-of-speech binary group that does not continue in a word string (c (u) , C (v)), the function that returns the number

あるいは特徴量抽出部１４は、単語列を構成する各単語を、意味情報を表すカテゴリ（意味カテゴリ）に置き換えた上で、意味的な情報に基づく素性関数として、例えば以下の（ｅ）、（ｆ）を抽出する。意味カテゴリは、音声認識装置１の外部のデータベースまたは内部に備える図示しない記憶手段に記憶されるシソーラスなどを用いて得ることができる。なお、ｓ（・）は単語を意味カテゴリにマッピングする関数である。 Alternatively, the feature quantity extraction unit 14 replaces each word constituting the word string with a category (semantic category) representing semantic information, and, as a feature function based on semantic information, for example, the following (e), ( f) is extracted. The semantic category can be obtained by using a thesaurus stored in an external database of the speech recognition apparatus 1 or a storage unit (not shown) provided inside. Note that s (•) is a function that maps words to semantic categories.

（ｅ）単語列に連続する意味カテゴリ２項組（ｓ（ｕ），ｓ（ｖ））が含まれる場合、その数を返す関数
（ｆ）単語列に連続しない意味カテゴリ２項組（ｓ（ｕ），ｓ（ｖ））が含まれる場合、その数を返す関数 (E) a function that returns the number of semantic category binaries (s (u), s (v)) that are consecutive in the word string (f) a semantic category binary group that is not consecutive in the word string (s ( u), s (v)), if included, a function that returns the number

また、特徴量抽出部１４は、音素列に関する素性関数として、例えば以下の（ｇ）を抽出する。 Further, the feature quantity extraction unit 14 extracts, for example, the following (g) as a feature function related to a phoneme string.

（ｇ）単語列に音素列ｑが含まれる場合、その数を返す関数 (G) A function that returns the number of phoneme sequences q in the word sequence

特徴量抽出部１４は、音声認識結果データＤ２が示す各正解文候補ｗ^ｈｙｐと、正解単語列データＤ３が示す正解単語列ｗ^ｒｅｆから、上記のルールに従った素性関数を全て抽出し、抽出した素性関数が出現する頻度をカウントする。特徴量抽出部１４は、カウントした出現頻度が予め定めた閾値以上である素性関数を、誤り修正モデルのパラメータ学習で用いる素性関数ｆ_ｉとして決定し、モデルパラメータ学習部１５に通知する。 The feature amount extraction unit 14 extracts and extracts all feature functions according to the above rules from each correct sentence candidate w ^hyp indicated by the speech recognition result data D2 and the correct word string w ^ref indicated by the correct word string data D3. The frequency at which the feature function appears is counted. Feature amount extraction unit 14, a feature function counted occurrence frequency is a predetermined threshold value or more, determined as feature function f _i used in the parameter learning error correction model, notifies the model parameter learning unit 15.

［４．５ステップＳ５］
続いてモデルパラメータ学習部１５は、誤り修正モデルのモデルパラメータΛと混合パラメータΦを学習する。 [4.5 Step S5]
Subsequently, the model parameter learning unit 15 learns the model parameter Λ and the mixed parameter Φ of the error correction model.

図５は、ステップＳ５においてモデルパラメータ学習部１５が実行するパラメータ学習処理フローを示す図である。 FIG. 5 is a diagram showing a parameter learning process flow executed by the model parameter learning unit 15 in step S5.

（ステップＳ５０：正解文候補選択処理）
モデルパラメータ学習部１５は、ｎ＝１を初期値とし、音声認識結果データＤ２が示す正解文候補ｗ^ｈｙｐとの中からｎ番目の正解文候補ｗ^ｈｙｐを選択する。 (Step S50: correct sentence candidate selection process)
The model parameter learning unit 15 sets n = 1 as an initial value, and selects the nth correct sentence candidate w ^hyp from the correct sentence candidates w ^hyp indicated by the speech recognition result data D2.

（ステップＳ５１：スコア計算処理）
モデルパラメータ学習部１５は、選択した正解文候補ｗ^ｈｙｐと、正解単語列データＤ３が示す正解単語列ｗ^ｒｅｆの音響スコアと言語スコアを計算する。
具体的には、モデルパラメータ学習部１５は、音響モデル格納部２１に格納されている音響モデルを参照して、正解文候補ｗ^ｈｙｐの対数音響スコアｈ_０（ｘ，ｗ^ｈｙｐ）、及び正解単語列ｗ^ｒｅｆの対数音響スコアｈ_０（ｘ，ｗ^ｒｅｆ）を算出する。これらは、式（８）における対数音響スコアｈ_０（ｘ，ｗ）である。なお、音響スコアの算出の際には、正解文候補ｗ^ｈｙｐや正解単語列ｗ^ｒｅｆの各単語に付与された時刻情報により特定される入力音声データＤ１の部分が用いられる。 (Step S51: Score calculation process)
The model parameter learning unit 15 calculates the acoustic score and language score of the selected correct sentence candidate w ^hyp and the correct word string w ^ref indicated by the correct word string data D3.
Specifically, the model parameter learning unit 15 refers to the acoustic model stored in the acoustic model storage unit 21, and the logarithmic acoustic score h ₀ (x, w ^hyp ) of the correct sentence candidate w ^hyp and the correct word The logarithmic acoustic score h ₀ (x, w ^ref ) of the column w ^ref is calculated. These are the logarithmic acoustic scores h ₀ (x, w) in equation (8). In calculating the acoustic score, the portion of the input voice data D1 specified by the time information given to each word of the correct sentence candidate w ^hyp and the correct word string w ^ref is used.

さらにモデルパラメータ学習部１５は、言語モデル格納部２２に格納されている言語モデルを参照し、正解文候補ｗ^ｈｙｐの対数言語スコアｈ_ｎ（ｗ^ｈｙｐ）、及び正解単語列ｗ^ｒｅｆの対数言語スコアｈ_ｎ（ｗ^ｒｅｆ）を各言語モデルについて算出する（ｎ＝１，...，）。これらは、式（８）における対数言語スコアｈ_ｎ（ｗ）である。
なお、正解単語列ｗ^ｒｅｆの対数音響スコアｈ_０（ｘ，ｗ^ｒｅｆ）及び対数言語スコアｈ_ｎ（ｗ^ｈｙｐ）は、最初のループのみで算出すればよい。 Further, the model parameter learning unit 15 refers to the language model stored in the language model storage unit 22, and the logarithmic language score h _n (w ^hyp ) of the correct sentence candidate w ^{hyp and} the logarithmic language score of the correct word string w ^ref. h _n (w ^ref ) is calculated for each language model (n = 1,...). These are the logarithmic language scores h _n (w) in equation (8).
The logarithmic acoustic score h ₀ (x, w ^ref ) and the logarithmic language score h _n (w ^hyp ) of the correct word string w ^ref may be calculated only in the first loop.

（ステップＳ５２：事後確率計算処理）
モデルパラメータ学習部１５は、各正解文候補ｗ^ｈｙｐと正解単語列ｗ^ｒｅｆのそれぞれから、特徴量抽出部１４が定めた素性関数ｆ_ｉの値ｆ_ｉ（ｗ^ｈｙｐ）、ｆ_ｉ（ｗ^ｒｅｆ）を算出する。さらに、モデルパラメータ学習部１５は、各正解文候補ｗ^ｈｙｐを単語列ｗとし、ステップＳ５１において計算した音響スコア及び言語スコアと、算出した素性関数の値を用いて、式（８）により事後確率Ｐ（ｗ^ｈｙｐ｜ｘ）を算出する。式（８）に用いるモデルパラメータΛと混合パラメータΦの値は、現在の誤り修正モデルに使用されている値である。なお、正解単語列ｗ^ｒｅｆの素性関数ｆ_ｉの値ｆ_ｉ（ｗ^ｒｅｆ）は、最初のループのみで算出すればよい。 (Step S52: posterior probability calculation process)
The model parameter learning unit 15 determines the values f _i (w ^hyp ) and f _i (w ^ref ) of the feature function f _i determined by the feature quantity extraction unit 14 from each of the correct sentence candidates w ^hyp and the correct word string w ^ref. Is calculated. Further, the model parameter learning unit 15 sets each correct sentence candidate w ^hyp as the word string w, and uses the acoustic score and the language score calculated in step S51 and the calculated feature function value to calculate the posterior probability according to the equation (8). P (w ^hyp | x) is calculated. The values of the model parameter Λ and the mixing parameter Φ used in Expression (8) are values used in the current error correction model. Note that the value f _i (w ^ref ) of the feature function f _i of the correct word string w ^ref may be calculated only in the first loop.

（ステップＳ５３：編集距離計算処理）
モデルパラメータ学習部１５は、選択した正解文候補ｗ^ｈｙｐと、正解単語列ｗ^ｒｅｆとを比較し、編集距離Ｒ（ｗ^ｒｅｆ，ｗ^ｈｙｐ）を動的計画法に基づいて計算する。 (Step S53: Edit distance calculation process)
The model parameter learning unit 15 compares the selected correct sentence candidate w ^hyp with the correct word string w ^ref and calculates the edit distance R (w ^ref , w ^hyp ) based on dynamic programming.

（ステップＳ５４：ループ終了判断処理）
モデルパラメータ学習部１５は、音声認識結果データＤ２が示す正解文候補ｗ^ｈｙｐを全て選択したかを判断する。モデルパラメータ学習部１５は、まだ未選択の正解文候補ｗ^ｈｙｐがあると判断した場合は、現在のｎの値に１を加算してステップＳ５０からの処理を繰り返し、全て選択済みであると判断した場合は、ステップＳ５５の処理を実行する。 (Step S54: loop end determination process)
The model parameter learning unit 15 determines whether all the correct sentence candidates w ^hyp indicated by the speech recognition result data D2 have been selected. If the model parameter learning unit 15 determines that there are still unselected correct sentence candidates w ^hyp , it adds 1 to the current value of n and repeats the processing from step S50 and determines that all have been selected. If so, the process of step S55 is executed.

（ステップＳ５５：勾配計算処理）
モデルパラメータ学習部１５は、現在のモデルパラメータΛ及び混合パラメータΦの値を用いて、式（１０）及び式（１１）により、式（９）のモデルパラメータΛ、混合パラメータΦに関する勾配ΔΛ、ΔΦを求める。モデルパラメータ学習部１５は、式（１０）及び式（１１）における編集距離Ｒ（ｗ^ｒｅｆ，ｗ^ｈｙｐ）に、ステップＳ５３において算出した値を用い、事後確率Ｐ（ｗ^ｈｙｐ｜ｘ）に、ステップＳ５２において算出した値を用いる。また、モデルパラメータ学習部１５は、対数言語スコアｈ_ｎ（ｗ^ｈｙｐ）、ｈ_ｎ（ｗ’）に、ステップＳ５１において算出した対数言語スコアｈ_ｎ（ｗ^ｈｙｐ）及びｈ_ｎ（ｗ^ｒｅｆ）の値を、素性関数ｆ_ｉ（ｗ^ｈｙｐ）、ｆ_ｉ（ｗ’）に、ステップＳ５２において算出したｆ_ｉ（ｗ^ｈｙｐ）及びｆ_ｉ（ｗ^ｒｅｆ）の値を用いる。なお、モデルパラメータ学習部１５は、言語スコアＰ（ｗ’）を、ステップＳ５２において算出した対数言語スコアｈ_ｎ（ｗ^ｈｙｐ）、ｈ_ｎ（ｗ^ｒｅｆ）の値を用いて式（６）により算出する。 (Step S55: gradient calculation process)
The model parameter learning unit 15 uses the values of the current model parameter Λ and the mixing parameter Φ, and the gradients ΔΛ and ΔΦ relating to the model parameter Λ and the mixing parameter Φ of the equation (9) according to the equations (10) and (11). Ask for. The model parameter learning unit 15 uses the value calculated in step S53 for the edit distance R (w ^ref , w ^hyp ) in the equations (10) and (11), and uses the value calculated in the step S53 for the posterior probability P (w ^hyp | x). The value calculated in S52 is used. In addition, the model parameter learning unit 15 uses the logarithmic language scores h _n (w ^hyp ) and h _n (w ′) as values of the logarithmic language scores h _n (w ^hyp ) and h _n (w ^ref ) calculated in step S51. ^{Are used} as the feature functions f _i (w ^hyp ) and f _i (w ′) in accordance with the values of f _i (w ^hyp ) and f _i (w ^ref ) calculated in step S ⁵² . Note that the model parameter learning unit 15 calculates the language score P (w ′) by using the values of the logarithmic language scores h _n (w ^hyp ) and h _n (w ^ref ) calculated in step S52 by Expression (6). To do.

（ステップＳ５６：パラメータ更新処理）
モデルパラメータ学習部１５は、ステップＳ５５において求めた勾配ΔΛ、ΔΦを用いて、式（１３）及び式（１４）により、または、式（１５）及び式（１６）により、モデルパラメータΛ及び混合パラメータΦを更新する。なお、式（１３）、式（１４）における係数η_Λ、η_Φは、予め定めた値を用いる。
モデルパラメータ学習部１５は、更新後のモデルパラメータΛの値を示すモデルパラメータデータＤ４と、更新後の混合パラメータΦの値を示す混合パラメータデータＤ５により、誤り修正モデル格納部２３に現在格納されているモデルパラメータデータＤ４と混合パラメータデータＤ５を更新する。 (Step S56: parameter update process)
The model parameter learning unit 15 uses the gradients ΔΛ and ΔΦ obtained in step S55, the equations (13) and (14), or the equations (15) and (16), the model parameters Λ and the mixing parameters. Update Φ. Note that predetermined values are used as the coefficients η _Λ and η _Φ in the equations (13) and (14).
The model parameter learning unit 15 is currently stored in the error correction model storage unit 23 by model parameter data D4 indicating the value of the updated model parameter Λ and mixed parameter data D5 indicating the value of the updated mixed parameter Φ. The model parameter data D4 and the mixed parameter data D5 are updated.

再び音声認識装置１は、音声認識装置１に次の入力音声データＤ１が入力されると、逐次、図３のステップＳ１からの処理を繰り返す。 When the next input speech data D1 is input to the speech recognition device 1 again, the speech recognition device 1 repeats the processing from step S1 in FIG.

［５．効果］
本実施形態の音声認識装置１によれば、認識率を向上させたい話題の情報を、音声認識結果から逐次的に反映した誤り修正モデルが構成可能となる。これにより、学習データと発話内容のミスマッチを解消し、音声認識で用いる誤り修正モデルを発話内容に対して最適化し、従来よりも認識誤りを削減することができる。
また、本実施形態の音声認識装置１によれば、複数の言語モデルの混合パラメータを誤り修正モデルのモデルパラメータ推定と同時に行うため、従来よりも認識誤りを削減することができる。 [5. effect]
According to the speech recognition apparatus 1 of the present embodiment, it is possible to configure an error correction model in which information on a topic whose recognition rate is to be improved is sequentially reflected from a speech recognition result. This eliminates the mismatch between the learning data and the utterance content, optimizes the error correction model used for speech recognition with respect to the utterance content, and reduces recognition errors as compared to the conventional case.
Also, according to the speech recognition apparatus 1 of the present embodiment, since the mixed parameters of a plurality of language models are performed simultaneously with the model parameter estimation of the error correction model, recognition errors can be reduced as compared with the conventional case.

［６．その他］
なお、上述の音声認識装置１は、内部にコンピュータシステムを有している。そして、音声認識装置１の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 [6. Others]
The voice recognition device 1 described above has a computer system inside. The operation process of the speech recognition apparatus 1 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１音声認識装置
１１音声認識部
１２誤り修正部
１３整列部
１４特徴量抽出部
１５モデルパラメータ学習部
２１音響モデル格納部
２２言語モデル格納部
２３誤り修正モデル格納部 DESCRIPTION OF SYMBOLS 1 Speech recognition apparatus 11 Speech recognition part 12 Error correction part 13 Alignment part 14 Feature-value extraction part 15 Model parameter learning part 21 Acoustic model storage part 22 Language model storage part 23 Error correction model storage part

Claims

Stores an error correction model that is an expression for calculating a speech recognition score using a value obtained by correcting a language score obtained by mixing a plurality of language models according to a mixture weight according to a weighted linguistic feature. An error correction model storage unit,
A speech recognition unit that recognizes speech using the error correction model stored in the error correction model storage unit and outputs a correct sentence candidate obtained as a result of speech recognition;
An error correction unit that corrects the correct sentence candidate output from the voice recognition unit according to a user input and generates a correct word string;
An alignment unit that aligns each word included in the correct word string generated by the error correction unit in time order based on the audio data;
A feature amount extraction unit that extracts linguistic features from the correct sentence candidates and the aligned correct word strings;
Based on the linguistic features extracted by the feature quantity extraction unit, the correct sentence candidates and the acoustic scores and language scores of the aligned correct word strings, the weights of the linguistic features and the language model Statistically calculate mixed weights, and update the error correction model stored in the error correction model storage unit to an error correction model using the calculated linguistic feature weight and the mixed weight of the language model A model parameter learning unit to perform,
A speech recognition apparatus comprising:

The model parameter learning unit is configured to use a recognition error of the correct sentence candidate obtained by comparison with the correct word string, and a speech recognition score of the correct sentence candidate obtained by the error correction model. Based on the evaluation value calculated by the function, the weight of the linguistic feature and the mixture weight of the language model are set so that the posterior probability of the correct word string is maximum, or the recognition error of the correct sentence candidate is minimum. Statistically calculated,
The speech recognition apparatus according to claim 1.

The model parameter learning unit calculates a weight of the linguistic feature and a mixture weight of the language model each time the voice recognition unit performs voice recognition of voice data, and stores the weight in the error correction model storage unit. Sequentially updating the error correction model to an error correction model using the calculated linguistic feature weight and the mixed weight of the language model;
The speech recognition apparatus according to claim 1 or 2, wherein

The feature amount extraction unit is configured to perform the linguistic analysis based on a continuous word, a phoneme constituting the word, a plurality of discontinuous words, a co-occurrence relationship between phonemes, syntactic information of the word, or semantic information of the word. Extracting features
The speech recognition apparatus according to any one of claims 1 to 3, wherein

Stores an error correction model that is an expression for calculating a speech recognition score using a value obtained by correcting a language score obtained by mixing a plurality of language models according to a mixture weight according to a weighted linguistic feature. Error correction model storage process to
A speech recognition process in which the input speech data is speech-recognized using the error correction model stored in the error correction model storage process, and a correct sentence candidate obtained as a result of the speech recognition is output;
Correcting the correct sentence candidate output in the speech recognition process according to a user input, and generating a correct word string;
An alignment process in which each word included in the correct word string generated in the error correction process is aligned in time order based on the speech data;
A feature extraction process for extracting linguistic features from the correct sentence candidates and the aligned correct word strings;
The weight of the linguistic feature and the language model based on the linguistic feature extracted in the feature amount extraction process, the correct sentence candidate and the acoustic score and the linguistic score of the aligned correct word string A model parameter learning process for statistically calculating a mixture weight and updating the currently stored error correction model to an error correction model using the calculated linguistic feature weight and the mixture weight of the language model; ,
An error correction model learning method characterized by comprising:

Computer
Stores an error correction model that is an expression for calculating a speech recognition score using a value obtained by correcting a language score obtained by mixing a plurality of language models according to a mixture weight according to a weighted linguistic feature. Error correction model storage means for
Speech recognition means for recognizing input speech data using the error correction model stored in the error correction model storage means, and outputting correct sentence candidates obtained as a result of speech recognition;
Correcting the correct sentence candidate output from the speech recognition means according to a user input, and generating an correct word string;
An alignment means for aligning each word included in the correct word string generated by the error correction means in time order based on the audio data;
A feature amount extracting means for extracting linguistic features from the correct sentence candidates and the aligned correct word strings;
Based on the linguistic features extracted by the feature amount extraction means, the correct sentence candidates and the acoustic scores and language scores of the aligned correct word strings, the weights of the linguistic features and the language model Statistically calculate mixing weights and update the error correction model stored in the error correction model storage means to an error correction model using the calculated linguistic feature weight and the mixed weight of the language model Model parameter learning means to
A program for causing a voice recognition apparatus to function.