JP2014074732A

JP2014074732A - Voice recognition device, error correction model learning method and program

Info

Publication number: JP2014074732A
Application number: JP2012220426A
Authority: JP
Inventors: Akio Kobayashi; 彰夫小林
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2012-10-02
Filing date: 2012-10-02
Publication date: 2014-04-24
Anticipated expiration: 2032-10-02
Also published as: JP6031316B2

Abstract

PROBLEM TO BE SOLVED: To learn an error correction model used when recognizing voice while suppressing learning cost.SOLUTION: A voice language resource storage unit 21 stores voice data of speech of a specified speaker and text data which is a correct sentence of the speech, and a statistic model storage unit 31 stores an acoustic model of a specified speaker and a language model by the subject. A recognition error generation unit 4 voice-recognizes the voice data stored in the voice language resource storage unit 21 by using the acoustic model of the specified speaker and the language model of the specified topic stored in the statistic model storage unit 31, and generates a voice recognition result including a recognition error. An error correction model learning unit 5 analyzes a tendency of the recognition error statistically from the voice recognition result including the recognition error and the correct sentence shown by the text data, and generates an error correction model. A model integration unit 6 generates an integrated model in which the error correction model generated by the error correction model learning unit 5 and a plurality of language models stored in the statistic model storage unit 31 are integrated.

Description

本発明は、音声認識装置、誤り修正モデル学習方法、及びプログラムに関する。 The present invention relates to a speech recognition device, an error correction model learning method, and a program.

音声認識の誤り修正については、音声とその書き起こし（正解文）から、言語的な特徴を用いて音声認識の誤り傾向を統計的に学習し、学習の結果得られた統計的な誤り修正モデルを用いて音声認識の性能改善を図る技術がある（例えば、非特許文献１参照）。 For error correction in speech recognition, statistical error correction models obtained as a result of learning by statistically learning the tendency of speech recognition errors using linguistic features from speech and transcriptions (correct sentences) There is a technology for improving the performance of speech recognition by using (see, for example, Non-Patent Document 1).

小林ほか，「単語誤り最小化に基づく識別的スコアリングによるニュース音声認識」，電子情報通信学会誌，vol.J93-D no.5，２０１０年，ｐ．５９８−６０９Kobayashi et al., “News speech recognition by discriminative scoring based on word error minimization”, IEICE Journal, vol.J93-D no.5, 2010, p. 598-609

音声認識では、統計的言語モデルを用いて単語の予測を行うが、この予測には、単語の予測性能を向上させるために複数の統計的言語モデルを統合して使うことが多い。非特許文献１の技術による誤り修正モデルの学習では、音声認識の誤り傾向を学習するために認識誤りを含む音声認識結果を学習データとしており、この学習データは、異なる基準によって統合された統計的言語モデルを用いた音声認識により生成される。そのため、統合方法を変更した場合には、学習データとなる音声認識結果を作成し直す必要がある。このような誤り修正モデルの学習方法は、音声認識結果の生成と誤り修正モデルの学習のコストが高くつく。よって、さまざまな話題や話者について誤り修正モデルを生成し、音声を認識する場合に効率的な学習方法であるとはいえない。 In speech recognition, a word is predicted using a statistical language model, and in order to improve the word prediction performance, a plurality of statistical language models are often used in an integrated manner. In learning of an error correction model by the technique of Non-Patent Document 1, speech recognition results including recognition errors are used as learning data in order to learn an error tendency of speech recognition, and this learning data is statistically integrated according to different criteria. Generated by speech recognition using a language model. Therefore, when the integration method is changed, it is necessary to recreate a speech recognition result as learning data. Such an error correction model learning method is expensive to generate a speech recognition result and to learn the error correction model. Therefore, it cannot be said that it is an efficient learning method when error correction models are generated for various topics and speakers and speech is recognized.

本発明は、このような事情を考慮してなされたもので、音声認識に用いる誤り修正モデルを、コストを抑えながら学習することができる音声認識装置、誤り修正モデル学習方法、及びプログラムを提供する。 The present invention has been made in view of such circumstances, and provides a speech recognition apparatus, an error correction model learning method, and a program capable of learning an error correction model used for speech recognition while reducing costs. .

［１］本発明の一態様は、特定話者の発話の音声データと前記音声データに対応する正解文であるテキストデータとを格納する音声言語資源格納部と、前記特定話者の音響モデルと話題別の言語モデルとを格納する統計モデル格納部と、前記特定話者の前記音響モデルと特定話題の前記言語モデルとを用いて前記音声データを音声認識し、認識誤りを含む音声認識結果を生成する認識誤り生成部と、前記認識誤り生成部により生成された前記音声認識結果と、前記テキストデータにより示される前記正解文とから統計的に認識誤りの傾向を分析し、分析された認識誤りの傾向を修正する誤り修正モデルを生成する誤り修正モデル学習部と、前記特定話者の音響モデルと複数の前記言語モデルとを用いて前記誤り修正モデルの生成に使用した前記音声データとは異なる音声データを音声認識し、音声認識により得られた正解文候補と前記異なる音声データに対応した正解文とを比較して得られる認識誤りに基づいて、前記誤り修正モデル学習部により生成された前記誤り修正モデルに前記複数の言語モデルを統合するときの混合重みを統計的に算出し、算出した前記混合重みに従って前記特定話題とは異なる話題の前記言語モデルを前記誤り修正モデルに統合して統合モデルを生成するモデル統合部と、を備えることを特徴とする音声認識装置である。
この発明によれば、音声認識装置は、特定話者の音声データを、その特定話者の音響モデルと特定話題の言語モデルとを用いて音声認識し、認識誤りを含む音声認識結果を学習データとして生成する。音声認識装置は、音声データから生成した学習データと、その音声データの正解文とから統計的に認識誤りの傾向を分析して誤り修正モデルを生成した後、生成した誤り修正モデルと特定話題以外の言語モデルとを統合して特定話者及び特定話題の音声認識に用いる統合モデルを生成する。
これにより、言語モデルの統合方法に応じて学習データとなる音声認識結果を作成し直す必要がなく、効率的に統合モデルを学習することができる。 [1] According to one aspect of the present invention, a spoken language resource storage unit that stores speech data of an utterance of a specific speaker and text data that is a correct sentence corresponding to the speech data, an acoustic model of the specific speaker, and The speech model is speech-recognized using a statistical model storage unit that stores topical language models, the acoustic model of the specific speaker and the language model of the specific topic, and a speech recognition result including a recognition error is obtained. A recognition error generation unit that statistically analyzes the tendency of recognition errors from the recognition error generation unit to be generated, the speech recognition result generated by the recognition error generation unit, and the correct sentence indicated by the text data; An error correction model learning unit that generates an error correction model that corrects a tendency of the above, a sound model of the specific speaker, and a plurality of the language models before being used to generate the error correction model The error correction model learning is performed based on a recognition error obtained by speech recognition of speech data different from the speech data, and comparing a correct sentence candidate obtained by speech recognition with a correct sentence corresponding to the different speech data. Statistically calculating a mixture weight when integrating the plurality of language models into the error correction model generated by the unit, and correcting the error correction of the language model of a topic different from the specific topic according to the calculated mixture weight And a model integration unit that generates an integrated model by integrating with a model.
According to this invention, the speech recognition apparatus recognizes speech data of a specific speaker using the acoustic model of the specific speaker and a language model of the specific topic, and learns speech recognition results including recognition errors as learning data. Generate as The speech recognizer generates an error correction model by statistically analyzing the tendency of recognition errors from the learning data generated from the speech data and the correct sentence of the speech data, and then generates the error correction model and other than the specific topic Are integrated with the language model to generate an integrated model used for speech recognition of a specific speaker and a specific topic.
Thereby, it is not necessary to recreate a speech recognition result as learning data according to the language model integration method, and the integrated model can be efficiently learned.

［２］本発明の一態様は、上述する音声認識装置であって、前記モデル統合部は、前記正解文候補から得られた前記認識誤りと、前記誤り修正モデルにより得られた前記正解文候補の音響スコア及び前記認識誤り傾向が修正された言語スコアと、前記特定話題とは異なる話題の前記言語モデルから得られた前記正解文候補の言語スコアとを用いて定められる評価関数によって算出した評価値に基づいて前記混合重みを統計的に算出する、ことを特徴とする。
この発明によれば、音声認識装置は、音声データに対応したテキストデータを正解文とみなしたときの正解文候補に含まれる単語の認識誤りと、誤り修正モデルにより得られた正解文候補の音響スコア及び認識誤り傾向が修正された言語スコアと、特定話題とは異なる話題の言語モデルから得られた正解文候補の言語スコアとに基づいて定められる評価関数によって算出した評価値が、最も認識誤りが少ないことを示す評価値になるように言語モデルの混合重みを算出し、算出した混合重みにより誤り修正モデルと複数の言語モデルとを統合する。
これにより、音声認識装置は、特定話者の特定話題についての発話を音声認識するために適した誤り修正モデルを生成した後に、認識率が上がるように他の話題の言語モデルを統合する際の混合重みを決定することができる。 [2] One aspect of the present invention is the speech recognition apparatus described above, wherein the model integration unit includes the recognition error obtained from the correct sentence candidate and the correct sentence candidate obtained from the error correction model. An evaluation function calculated using an evaluation function determined using the acoustic score and the language score in which the recognition error tendency is corrected, and the language score of the correct sentence candidate obtained from the language model of a topic different from the specific topic The mixing weight is statistically calculated based on the value.
According to this invention, the speech recognition apparatus recognizes the recognition error of the word included in the correct sentence candidate when the text data corresponding to the speech data is regarded as the correct sentence, and the sound of the correct sentence candidate obtained by the error correction model. The evaluation value calculated by the evaluation function determined based on the language score with the corrected score and recognition error tendency and the language score of the correct sentence candidate obtained from the language model of the topic different from the specific topic is the most recognized error The blending weight of the language model is calculated so that the evaluation value indicates that there is little, and the error correction model and the plurality of language models are integrated by the calculated blending weight.
As a result, the speech recognition device generates an error correction model suitable for speech recognition of the speech about a specific topic of a specific speaker, and then integrates the language models of other topics so that the recognition rate increases. Mixing weights can be determined.

［３］本発明の一態様は、上述する音声認識装置であって、前記認識誤り生成部は、前記特定話題に対応した前記テキストデータの発話内容を前記特定話者の前記音響モデルを用いて音声合成して音声データを生成し、生成した前記音声データを前記特定話者の前記音響モデルと前記特定話題の前記言語モデルとを用いて音声認識して認識誤りを含む音声認識結果を生成する、ことを特徴とする。
この発明によれば、音声認識装置は、特定話題のテキストデータから音声合成により特定話者の音声データを生成し、生成した音声データの音声認識結果とテキストデータが示す正解文とから誤り修正モデルを生成する。
これにより、音声認識装置は、特定話者の音声データが統計的に十分な量とならない場合でも、特定話題のテキストデータから誤り修正モデルを生成することができる。 [3] One aspect of the present invention is the speech recognition device described above, in which the recognition error generation unit uses the acoustic model of the specific speaker to determine the utterance content of the text data corresponding to the specific topic. Speech synthesis is performed to generate speech data, and the generated speech data is speech-recognized using the acoustic model of the specific speaker and the language model of the specific topic to generate a speech recognition result including a recognition error. It is characterized by that.
According to the present invention, the speech recognition apparatus generates speech data of a specific speaker by speech synthesis from text data of a specific topic, and an error correction model from the speech recognition result of the generated speech data and a correct sentence indicated by the text data. Is generated.
Thereby, the speech recognition apparatus can generate an error correction model from text data of a specific topic even when the speech data of the specific speaker is not statistically sufficient.

［４］本発明の一態様は、上述する音声認識装置であって、前記誤り修正モデルは、連続する単語、単語を構成する音素、連続しない複数の単語、音素間の共起関係、単語の構文的な情報、または単語の意味的な情報に基づく言語的特徴を表す素性関数とその素性重みとを用いて定義され、前記誤り修正モデル学習部は、前記音声認識結果から得られた前記素性関数の値と前記音声認識結果に含まれる前記認識誤りとを用いて定められる評価関数によって算出した評価値に基づいて前記素性重みを統計的に算出し、算出した前記素性重みを用いて前記誤り修正モデルを生成する、ことを特徴とする。
この発明によれば、音声認識装置は、単語や音素などに基づく言語的特徴を表す素性関数とその素性重みとで定義される誤り修正モデルが用いる素性重みを、音声認識結果から得られた素性関数の値と認識誤りとを用いて定められる評価関数によって算出した評価値が、最も認識誤りが少ないことを示す評価値となるように決定し、誤り修正モデルを生成する。
これにより、音声認識装置は、特定話者の特定話題についての発話を音声認識するために適した誤り修正モデルを生成した上で、他の話題についての誤り傾向を統合することができる。 [4] One aspect of the present invention is the speech recognition apparatus described above, wherein the error correction model includes continuous words, phonemes constituting the words, a plurality of discontinuous words, a co-occurrence relationship between phonemes, It is defined using a feature function representing a linguistic feature based on syntactic information or semantic information of a word and its feature weight, and the error correction model learning unit is configured to obtain the feature obtained from the speech recognition result. The feature weight is statistically calculated based on an evaluation value calculated by an evaluation function determined using a function value and the recognition error included in the speech recognition result, and the error is calculated using the calculated feature weight. A modified model is generated.
According to the present invention, the speech recognition apparatus uses the feature weight obtained from the speech recognition result to use the feature weight used by the error correction model defined by the feature function representing the linguistic feature based on the word, phoneme, and the like and the feature weight. An error correction model is generated by determining an evaluation value calculated by an evaluation function determined using a function value and a recognition error to be an evaluation value indicating that the recognition error is the smallest.
As a result, the speech recognition apparatus can generate error correction models suitable for speech recognition of utterances on specific topics of specific speakers, and then integrate error tendencies on other topics.

［５］本発明の一態様は、上述する音声認識装置であって、前記モデル統合部により生成された前記統合モデルを用いて前記特定話者による前記特定話題の発話の音声データを音声認識する音声認識部をさらに備える、ことを特徴とする。
この発明によれば、音声認識装置は、特定話者及び特定話題について学習した統合モデルに基づいて音声認識を行う。
これにより、音声認識装置は、特定話者の特定話題の発話について認識率のよい音声認識結果を得ることができる。 [5] One aspect of the present invention is the speech recognition device described above, which recognizes speech data of speech of the specific topic by the specific speaker using the integrated model generated by the model integration unit. A voice recognition unit is further provided.
According to this invention, the speech recognition apparatus performs speech recognition based on the integrated model learned about the specific speaker and the specific topic.
Thereby, the speech recognition apparatus can obtain a speech recognition result with a good recognition rate for the utterance of the specific topic of the specific speaker.

［６］本発明の一態様は、特定話者の発話の音声データと前記音声データに対応する正解文であるテキストデータとを格納する音声言語資源格納過程と、前記特定話者の音響モデルと話題別の言語モデルとを格納する統計モデル格納過程と、前記特定話者の前記音響モデルと特定話題の前記言語モデルとを用いて前記音声データを音声認識し、認識誤りを含む音声認識結果を生成する認識誤り生成過程と、前記認識誤り生成過程において生成された前記音声認識結果と、前記テキストデータにより示される前記正解文とから統計的に認識誤りの傾向を分析し、分析された認識誤りの傾向を修正する誤り修正モデルを生成する誤り修正モデル学習過程と、前記特定話者の音響モデルと複数の前記言語モデルとを用いて前記誤り修正モデルの生成に使用した前記音声データとは異なる音声データを音声認識し、音声認識により得られた正解文候補と前記異なる音声データに対応した正解文とを比較して得られる認識誤りに基づいて、前記誤り修正モデル学習過程において生成された前記誤り修正モデルに前記複数の言語モデルを統合するときの混合重みを統計的に算出し、算出した前記混合重みに従って前記特定話題とは異なる話題の前記言語モデルを前記誤り修正モデルに統合して統合モデルを生成するモデル統合過程と、を有することを特徴とする誤り修正モデル学習方法である。 [6] According to one aspect of the present invention, a spoken language resource storage process of storing speech data of a specific speaker's utterance and text data that is a correct sentence corresponding to the speech data, an acoustic model of the specific speaker, and The speech data is speech-recognized using a statistical model storing process for storing topic-specific language models, the acoustic model of the specific speaker and the language model of the specific topic, and a speech recognition result including a recognition error is obtained. The recognition error is statistically analyzed from the recognition error generation process to be generated, the speech recognition result generated in the recognition error generation process, and the correct sentence indicated by the text data. An error correction model learning process for generating an error correction model that corrects the tendency of the error, and the generation of the error correction model using the acoustic model of the specific speaker and the plurality of language models. Based on a recognition error obtained by speech recognition of speech data different from the speech data used in the above, and comparing a correct sentence candidate obtained by speech recognition with a correct sentence corresponding to the different speech data, the error A statistical weight is calculated when the plurality of language models are integrated into the error correction model generated in the correction model learning process, and the language model of a topic different from the specific topic is calculated according to the calculated mixing weight. And a model integration process of generating an integrated model by integrating with the error correction model.

［７］本発明の一態様は、コンピュータを、特定話者の発話の音声データと前記音声データに対応する正解文であるテキストデータとを格納する音声言語資源格納手段と、前記特定話者の音響モデルと話題別の言語モデルとを格納する統計モデル格納手段と、前記特定話者の前記音響モデルと特定話題の前記言語モデルとを用いて前記音声データを音声認識し、認識誤りを含む音声認識結果を生成する認識誤り生成手段と、前記認識誤り生成手段により生成された前記音声認識結果と、前記テキストデータにより示される前記正解文とから統計的に認識誤りの傾向を分析し、分析された認識誤りの傾向を修正する誤り修正モデルを生成する誤り修正モデル学習手段と、前記特定話者の音響モデルと複数の前記言語モデルとを用いて前記誤り修正モデルの生成に使用した前記音声データとは異なる音声データを音声認識し、音声認識により得られた正解文候補と前記異なる音声データに対応した正解文とを比較して得られる認識誤りに基づいて、前記誤り修正モデル学習手段により生成された前記誤り修正モデルに前記複数の言語モデルを統合するときの混合重みを統計的に算出し、算出した前記混合重みに従って前記特定話題とは異なる話題の前記言語モデルを前記誤り修正モデルに統合して統合モデルを生成するモデル統合手段と、を具備する音声認識装置として機能させるためのプログラムである。 [7] According to one aspect of the present invention, a computer stores speech language resource storage means for storing speech data of a specific speaker's utterance and text data that is a correct sentence corresponding to the speech data; A speech model including a recognition error by recognizing the speech data using a statistical model storage unit that stores an acoustic model and a topical language model; and the acoustic model of the specific speaker and the language model of the specific topic. The recognition error generation means for generating a recognition result, the speech recognition result generated by the recognition error generation means, and the tendency of the recognition error are statistically analyzed from the correct sentence indicated by the text data. The error correction model learning means for generating an error correction model for correcting the tendency of recognition errors, the acoustic model of the specific speaker, and a plurality of the language models Based on a recognition error obtained by recognizing speech data different from the speech data used for generating the correct model and comparing a correct sentence candidate obtained by speech recognition with a correct sentence corresponding to the different speech data Statistically calculating a mixing weight when integrating the plurality of language models into the error correction model generated by the error correction model learning means, and according to the calculated mixing weight, a topic different from the specific topic is calculated. It is a program for functioning as a speech recognition device comprising model integration means for generating an integrated model by integrating the language model with the error correction model.

本発明によれば、音声を認識する際に用いる誤り修正モデルを、学習コストをおさえながら学習することができる。よって、さまざまな話題や話者についての誤り修正モデルを効率的に学習することが可能となる。 ADVANTAGE OF THE INVENTION According to this invention, the error correction model used when recognizing a speech can be learned, suppressing learning cost. Therefore, it is possible to efficiently learn error correction models for various topics and speakers.

本発明の一実施形態による音声認識装置における統合モデル学習の手続きを示す図である。It is a figure which shows the procedure of the integrated model learning in the speech recognition apparatus by one Embodiment of this invention. 同実施形態による音声認識装置の統合モデル学習処理の概要処理フローを示す図である。It is a figure which shows the outline | summary processing flow of the integrated model learning process of the speech recognition apparatus by the embodiment. 同実施形態による音声認識装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the speech recognition apparatus by the embodiment. 同実施形態による擬似的な音声認識結果の生成処理の処理フローを示す図である。It is a figure which shows the processing flow of the production | generation process of the pseudo speech recognition result by the embodiment. 同実施形態による誤り修正モデル学習処理の処理フローを示す図である。It is a figure which shows the processing flow of the error correction model learning process by the embodiment. 同実施形態によるモデル統合処理の処理フローを示す図である。It is a figure which shows the processing flow of the model integration process by the embodiment. 従来法による誤り修正モデル学習の手続きを示す図である。It is a figure which shows the procedure of the error correction model learning by the conventional method.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

［１．本実施形態の概要］
音声認識の誤り傾向を反映した、いわゆる誤り修正モデルでは、誤り傾向を学習するために音声・テキストデータのほかに、統計的音響モデル（以下、「音響モデル」と記載する。）と統計的言語モデル（以下、「言語モデル」と記載する。）が必要となる。従来の音声認識装置は、この２つの統計的モデルを用いて音声データを音声認識し、認識誤りを含む仮説（音声認識結果）を生成する。仮説の生成の際に使われる言語モデルは、単一のテキスト集合から学習された言語モデルだけではなく、複数の異なるテキスト集合から個別に学習された言語モデルを組み合わせて使うことが多い。従来法では、言語モデルの統合が先に行われ、その後に、統合された言語モデルに整合した誤り修正モデルが学習される。しかし、この学習順序では、異なる組み合わせで言語モデルを統合する都度、誤り修正モデルの学習に用いる認識誤りを含む仮説を生成することとなり、実用性の面からも効率的であるとはいえない。 [1. Overview of this embodiment]
In a so-called error correction model that reflects the error tendency of speech recognition, a statistical acoustic model (hereinafter referred to as “acoustic model”) and a statistical language are used in addition to speech and text data to learn the error tendency. A model (hereinafter referred to as “language model”) is required. A conventional speech recognition apparatus recognizes speech data using these two statistical models, and generates a hypothesis (speech recognition result) including a recognition error. In many cases, the language model used when generating a hypothesis is not only a language model learned from a single text set, but also a combination of language models individually learned from a plurality of different text sets. In the conventional method, language models are integrated first, and then an error correction model consistent with the integrated language model is learned. However, in this learning order, each time a language model is integrated with a different combination, a hypothesis including a recognition error used for learning of an error correction model is generated, which is not efficient from a practical standpoint.

そこで、本実施形態の音声認識装置は、言語モデルの統合と誤り修正モデルの学習の順番を従来とは変更し、先に特定の言語モデルと音響モデルのみを利用して音声認識の誤り傾向を反映した仮説を生成し、その生成された仮説から誤り修正モデルを学習する。その後、本実施形態の音声認識装置は、学習された誤り修正モデルと、他のテキストデータで学習された複数の言語モデルとを統合する。このように、本実施形態の音声認識装置は、誤り修正モデルの学習と、学習条件の異なる言語モデルによる統合を分離して行うことにより、誤り修正モデルを効率的に学習する。本実施形態の音声認識装置は、このようにして効率的に生成された誤り修正モデルにより、さまざまな話者及び話題についての認識性能の改善を図る。 Therefore, the speech recognition apparatus according to the present embodiment changes the order of language model integration and error correction model learning from the conventional order, and uses only a specific language model and acoustic model first to detect the error tendency of speech recognition. A reflected hypothesis is generated, and an error correction model is learned from the generated hypothesis. Thereafter, the speech recognition apparatus according to the present embodiment integrates the learned error correction model and a plurality of language models learned from other text data. As described above, the speech recognition apparatus according to the present embodiment efficiently learns an error correction model by separately performing learning of an error correction model and integration using language models having different learning conditions. The speech recognition apparatus according to the present embodiment improves the recognition performance of various speakers and topics by using the error correction model generated efficiently as described above.

［２．音声認識装置の処理概要］
続いて、本発明の一実施形態による音声認識装置の処理概要を説明する。
上述したように、本実施形態の音声認識装置は、特定の音響モデル及び言語モデルにより誤り修正モデルを学習した後、学習した誤り修正モデルと単語の予測を行う複数の言語モデルとを統合する。以下、複数の言語モデルと統合した誤り修正モデルを統合モデルと記載する。本実施形態の音声認識装置は、統合する際の言語モデル間の混合重みを調整することで、音声認識対象の話題及び話者にマッチした統合モデルを得る。 [2. Outline of processing of voice recognition device]
Next, the processing outline of the speech recognition apparatus according to the embodiment of the present invention will be described.
As described above, the speech recognition apparatus according to the present embodiment learns an error correction model using a specific acoustic model and language model, and then integrates the learned error correction model and a plurality of language models that perform word prediction. Hereinafter, an error correction model integrated with a plurality of language models is referred to as an integrated model. The speech recognition apparatus of this embodiment obtains an integrated model that matches the topic and speaker of the speech recognition target by adjusting the mixing weight between the language models when integrating.

［２．１従来法の誤り修正モデル］
ベイズの定理によれば、音声入力ｘが与えられたとき、この音声入力ｘに対して尤もらしい単語列ｗ＾（「＾」は、「ハット」を表す。）は、以下の式（１）により求めることができる。 [2.1 Error correction model of conventional method]
According to Bayes' theorem, when speech input x is given, a word string w ^ (“^” represents “hat”) that is likely to be associated with speech input x is expressed by the following equation (1). It can ask for.

音声入力ｘ及び単語列ｗは、例えば、発話の単位に対応し、Ｐ（ｗ｜ｘ）は、音声入力ｘが発生したときに文仮説である単語列ｗが得られる事後確率である。
また、Ｐ（ｘ｜ｗ）は、単語列ｗに対する音響的な尤もらしさを示す尤度であり、そのスコア（音響スコア）は隠れマルコフモデル（Hidden Markov Model、ＨＭＭ）及びガウス混合分布（Gaussian Mixture Model，ＧＭＭ）に代表される音響モデルに基づいて計算される。言い換えれば、音響特徴量が与えられたとき、複数の正解候補の単語列それぞれに対する尤もらしさを表すスコアが音響スコアである。
一方、Ｐ（ｗ）は、単語列ｗに対する言語的な尤もらしさであり、そのスコア（言語スコア）は、単語ｎ−ｇｒａｍモデル等の言語モデルにより計算される。言い換えれば、音声認識対象の単語の前または後の単語列、あるいは前後両方の単語列が与えられたとき、複数の正解候補の単語列それぞれに対する尤もらしさを表すスコアが言語スコアである。なお、単語ｎ−ｇｒａｍモデルは、Ｎ単語連鎖（Ｎは、例えば１、２、または３である。）の統計に基づいて、（Ｎ−１）単語の履歴から次の単語の生起確率を与えるモデルである。 The voice input x and the word string w correspond to, for example, a unit of speech, and P (w | x) is a posterior probability that the word string w that is a sentence hypothesis is obtained when the voice input x occurs.
P (x | w) is a likelihood indicating acoustic likelihood for the word string w, and the score (acoustic score) is a hidden Markov model (HMM) and a Gaussian mixture distribution (Gaussian Mixture). Calculation based on an acoustic model represented by Model, GMM). In other words, when an acoustic feature amount is given, a score representing the likelihood of each of a plurality of correct candidate word strings is an acoustic score.
On the other hand, P (w) is a linguistic likelihood for the word string w, and the score (language score) is calculated by a language model such as a word n-gram model. In other words, when a word string before or after a speech recognition target word, or both word strings before and after the given word string, a score representing the likelihood of each of a plurality of correct answer word strings is a language score. The word n-gram model gives the occurrence probability of the next word from the history of the word (N-1) based on the statistics of N word chains (N is 1, 2, or 3, for example). It is a model.

以下の説明では、音響モデルにＨＭＭ−ＧＭＭを用い、言語モデルにｎ−ｇｒａｍを用いる。 In the following description, HMM-GMM is used for the acoustic model and n-gram is used for the language model.

式（１）のＰ（ｘ｜ｗ）Ｐ（ｗ）が最大の場合は、その対数も最大である。そこで、音声認識では、上記の式（１）のベイズの定理に基づいて、評価関数ｇ（ｗ｜ｘ）を以下の式（２）のように定める。なお、κは、音響スコアＰ（ｘ｜ｗ）に対する言語スコアＰ（ｗ）の重みである。 When P (x | w) P (w) in Equation (1) is maximum, the logarithm is also maximum. Therefore, in speech recognition, the evaluation function g (w | x) is determined as in the following equation (2) based on the Bayes' theorem in the above equation (1). Note that κ is a weight of the language score P (w) with respect to the acoustic score P (x | w).

そして、以下の式（３）に示すように、音声入力ｘに対する正解候補の単語列ｗの集合Ｌの中から、式（２）が示す評価関数ｇ（ｗ｜ｘ）の結果が最大である単語列ｗ＾が音声入力ｘの音声認識結果として選択される。 Then, as shown in the following equation (3), the result of the evaluation function g (w | x) indicated by the equation (2) is the maximum from the set L of correct candidate word strings w for the voice input x. The word string w ^ is selected as the voice recognition result of the voice input x.

従来法における誤り修正モデルでは、式（１）を以下の式（４）のように変更する。 In the error correction model in the conventional method, equation (1) is changed to the following equation (4).

式（４）のｅｘｐΣ_ｉλ_ｉｆ_ｉ（ｗ）は、単語列ｗの誤り傾向を反映したペナルティスコアであり、ｆ_ｉ（ｗ）はｉ番目の素性関数、λ_ｉは素性関数ｆ_ｉ（ｗ）の素性重みである。素性関数は、与えられた単語列（ここでは、単語列ｗ）で言語的ルールが成立すればその数となり、成立しなければ０となるような関数として定められる。具体的な言語的ルールとして、以下の例があげられる。 In formula (4), expΣ _i λ _i f _i (w) is a penalty score reflecting the error tendency of the word string w, f _i (w) is the i-th feature function, and λ _i is the feature function f _i ( The feature weight of w). The feature function is defined as a function that becomes the number if a linguistic rule is established in a given word string (here, word string w), and is 0 if not established. Specific examples of linguistic rules include the following.

（ａ）単語列ｗに含まれる連続する単語２項組み（ｕ，ｖ）の数
（ｂ）単語列ｗに含まれる連続しない単語２項組み（ｕ，ｖ）の数 (A) Number of consecutive word binary groups (u, v) included in word string w (b) Number of non-contiguous word binary groups (u, v) included in word string w

上記の式（４）に示すように、音声認識の誤り傾向は、素性関数と素性重みにより言語的な特徴に対するペナルティとして表現され、学習データの単語誤りを最小化する評価関数に基づいて推定される。つまり、従来法の誤り傾向の学習とは、音声データの音声認識結果とその正解文を学習データとして式（４）の素性重みλ_ｉを求めることである。 As shown in the above equation (4), the error tendency of speech recognition is expressed as a penalty for a linguistic feature by a feature function and a feature weight, and is estimated based on an evaluation function that minimizes a word error in learning data. The That is, the learning of error tendency in the conventional method is to obtain the feature weight λ _i of the equation (4) using the speech recognition result of speech data and the correct answer sentence as learning data.

［２．２本実施形態による音声認識装置に適用される手法］
前節で述べたように、本実施形態の音声認識装置は、特定の話者及び話題に依存した音声認識の誤り傾向を学習して統計的な誤り修正モデルを生成する。本実施形態の音声認識装置は、この生成した誤り修正モデルを音声認識に適用して認識率の改善をはかる。 [2.2 Techniques Applied to the Speech Recognition Apparatus According to the Present Embodiment]
As described in the previous section, the speech recognition apparatus according to the present embodiment generates a statistical error correction model by learning an error tendency of speech recognition depending on a specific speaker and topic. The speech recognition apparatus according to this embodiment applies the generated error correction model to speech recognition to improve the recognition rate.

誤り修正モデルを統計的手段により学習するためには、学習データが必要となる。学習データは、誤りを含む単語列であり、一般に音声認識結果が用いられる。したがって、学習データにどのような単語列が含まれるかは、音声認識に用いた音響モデルや言語モデルに依存することになる。 In order to learn the error correction model by statistical means, learning data is required. The learning data is a word string including an error, and a speech recognition result is generally used. Therefore, what kind of word string is included in the learning data depends on the acoustic model and language model used for speech recognition.

一方、音声認識では、単語の予測精度を向上させるため、複数の言語モデルを統合して用いることが多い。一般には、以下の式（５）に示すような線形補間により言語モデルを統合する。 On the other hand, in speech recognition, a plurality of language models are often used in an integrated manner to improve word prediction accuracy. Generally, language models are integrated by linear interpolation as shown in the following equation (5).

Ｐ_ｊは、ｊ番目の言語モデルを用いたときの言語スコアである。また、θ_ｊは、ｊ番目の言語モデルに対する混合重みと呼ばれる係数であり、Σ_ｊθ_ｊ＝１を満たす。以下、θ_ｊをモデルパラメータとも呼ぶ。通常、言語モデルなどのモデル学習に用いる学習データとモデルパラメータの学習に用いる学習データとは異なるものを用い、モデルパラメータの学習に用いる学習データは一般的に開発データと呼ばれる。 P _j is a language score when the j-th language model is used. Θ _j is a coefficient called a blending weight for the j-th language model and satisfies Σ _j θ _j = 1. Hereinafter, θ _j is also referred to as a model parameter. Normally, learning data used for model learning such as a language model is different from learning data used for model parameter learning, and the learning data used for model parameter learning is generally called development data.

従来法では、式（５）に示すような線形補間等の手法により、複数の言語モデルを統合したモデルを利用して音声認識を行ない、学習データとなる音声認識結果を得る。そして、この得られた音声認識結果を用いて誤り修正モデルを学習する。 In the conventional method, speech recognition is performed using a model in which a plurality of language models are integrated by a method such as linear interpolation as shown in Expression (5), and a speech recognition result as learning data is obtained. Then, an error correction model is learned using the obtained speech recognition result.

図７は、従来法による誤り修正モデル学習の手続きを示す図である。
同図に示すように、従来法では、誤り学習に用いる音響モデルとして音響モデルＡ_１〜Ａ_Ｎの中から特定話者の音響モデル（同図では、音響モデルＡ_２）を選択し、言語モデルとして複数の言語モデルＢ_１〜Ｂ_Ｍを統合する。誤り修正モデルの学習データは、これらの特定話者の音響モデル、及び、統合した言語モデルで音声データを音声認識することにより得られる。そのため、従来法では、複数の言語モデルの組み合わせの条件が変わってしまうと（すなわち、式（６）におけるモデルパラメータθ_ｊの値が変わると）、誤り修正モデルの学習データである音声認識結果が大きく変わってしまう。従って、ある条件に適した誤り修正モデルを学習するには、言語モデルの組み合わせに合わせて、音声認識結果を生成しなければならない。これでは、学習データを生成するための計算時間がかかりすぎてしまい、効率的とはいえない。 FIG. 7 is a diagram showing a procedure for error correction model learning according to the conventional method.
As shown in the figure, in the conventional method, an acoustic model (acoustic model A _{2 in} the figure) of a specific speaker is selected from acoustic models A _{1 to} A _N as acoustic models used for error learning, and a language model is selected. A plurality of language models B _{1 to} B _M are integrated. The learning data of the error correction model is obtained by voice recognition of the voice data using the acoustic model of the specific speaker and the integrated language model. Therefore, in the conventional method, when the combination condition of a plurality of language models changes (that is, when the value of the model parameter θ _j in Expression (6) changes), the speech recognition result that is the learning data of the error correction model is changed. It will change a lot. Therefore, in order to learn an error correction model suitable for a certain condition, a speech recognition result must be generated according to the combination of language models. This takes too much calculation time to generate learning data, and is not efficient.

図１は、本実施形態による音声認識装置における統合モデル学習の手続きを示す図である。同図に示すように、本実施形態による音声認識装置は、音響モデルＡ_１〜Ａ_Ｎの中から選択した特定話者の音響モデル（同図では、音響モデルＡ_２）と、言語モデルＢ_１〜Ｂ_Ｍの中から選択した特定話題の言語モデル（同図では、言語モデルＢ_１）とを用いた音声認識結果を学習データとして誤り修正モデルを生成し、従来法の問題点である言語モデル（同図では、言語モデルＢ_２〜Ｂ_Ｍ）の統合を、誤り修正モデルの生成後に変更する。この手法では、統合した言語モデルに対する誤り傾向は近似的にしか推定できない。しかし、特定の話題に依存した誤り修正モデルを学習するのであれば、話題依存性を反映した言語モデルが高々１つに限定されるという仮定の下では、誤り傾向の学習をよい近似で行えると考えられる。 FIG. 1 is a diagram showing an integrated model learning procedure in the speech recognition apparatus according to the present embodiment. As shown in the figure, the speech recognition device according to the present embodiment, (in the figure, an acoustic model A ₂₎ acoustic models of a particular speaker selected from among acoustic models A ₁ to A _N and language model B ₁ (in the figure, the language model B ₁₎ language model of a specific topic selected from among .about.B _M generates an error correction model as learning data the voice recognition results using a problem of the conventional method language model The integration of the language models (B _{2 to} B _{M in} the figure) is changed after the error correction model is generated. With this method, the error tendency for the integrated language model can only be estimated approximately. However, if an error correction model depending on a specific topic is to be learned, under the assumption that at most one language model reflecting topic dependency can be learned, error tendency learning can be performed with a good approximation. Conceivable.

従って、本実施形態の音声認識装置が音声認識の対象とする評価データに対して特定の話題及び話者についての誤り傾向を学習する場合、着目する話題に関する言語モデルを１つ用意しておき、特定の話者の音響モデルとともに音声認識に用いる。なお、評価データとは、言語モデルや音響モデルを学習する際に用いた音声データとは別の未知の音声データである。これにより、本実施形態の音声認識装置は、特定の話者及び話題を反映した誤りを含む学習データを生成することが可能となる。 Therefore, when the speech recognition apparatus of the present embodiment learns an error tendency about a specific topic and a speaker with respect to evaluation data to be subjected to speech recognition, one language model related to the topic of interest is prepared, Used for speech recognition along with an acoustic model of a specific speaker. The evaluation data is unknown speech data different from speech data used when learning a language model or an acoustic model. Thereby, the speech recognition apparatus of the present embodiment can generate learning data including an error reflecting a specific speaker and topic.

例えば、料理の話題（と特定の話者）に特化した誤り修正モデルを作成したいとする。この場合、料理の話題に関する言語モデルに対して、料理とは直接関係のない話題から学習した言語モデルを線形補間して統合することが多い。これは、ある特定の話題に特化した言語モデルは通常、学習データが少なく、音声認識で単語を予測する精度（推定精度）が劣化してしまい、言語モデルの統計的な頑健性が失われるからである。そこで、この頑健性を担保するために、他の言語モデルとの統合が行われる。しかし、料理に関する誤り傾向をとらえるのであれば、料理の話題に特化した言語モデルを利用するたけで十分である。 For example, suppose that an error correction model specialized for cooking topics (and a specific speaker) is to be created. In this case, a language model learned from a topic that is not directly related to cooking is often integrated with a language model related to the topic of cooking by linear interpolation. This is because a language model specialized for a specific topic usually has less learning data, and the accuracy (estimation accuracy) of predicting a word by speech recognition deteriorates, and the statistical robustness of the language model is lost. Because. Therefore, in order to ensure this robustness, integration with other language models is performed. However, it is sufficient to use a language model specialized in the topic of cooking if you want to capture the error tendency of cooking.

本実施形態の音声認識装置は、特定の音響モデルと特定の言語モデルの組み合わせにより学習データを生成して誤り修正モデルを学習しておき、対象となるタスクに合わせて他の言語モデルを混合する。これにより、言語モデルの組み合わせの条件が変更される度に学習データを都度生成する手続が不要となるため、計算時間等のコストが大幅に削減される。 The speech recognition apparatus of the present embodiment generates learning data by combining a specific acoustic model and a specific language model, learns an error correction model, and mixes other language models according to the target task. . This eliminates the need for a procedure for generating learning data each time the language model combination condition is changed, thereby greatly reducing costs such as calculation time.

［２．３統合モデル学習処理の概要処理手順］
図２は、本実施形態の音声認識装置による統合モデル学習処理の概要処理フローを示す。
本実施形態の音声認識装置は、音声データとその書き起こしであるテキストデータとからなる音声言語資源データを音声言語資源格納部に予め格納している。さらに本実施形態の音声認識装置は、音響モデル及び言語モデルを統計モデル格納部に予め格納している。音声データ及び音響モデルには、少なくとも話者を示すラベルデータが付与されており、テキストデータ及び言語モデルには、少なくとも話題や番組を示すラベルデータが付与されている。 [2.3 Overview of integrated model learning process]
FIG. 2 shows an outline processing flow of the integrated model learning process by the speech recognition apparatus of the present embodiment.
The speech recognition apparatus according to the present embodiment stores in advance a speech language resource data composed of speech data and text data as a transcription thereof in a speech language resource storage unit. Furthermore, the speech recognition apparatus of the present embodiment stores the acoustic model and the language model in advance in the statistical model storage unit. Label data indicating at least a speaker is assigned to the voice data and the acoustic model, and label data indicating at least a topic or a program is assigned to the text data and the language model.

（ステップＳ１）：認識性能を改善したいタスクに関連する話題及び話者の指定を受ける。
まず、本実施形態の音声認識装置の利用者は、認識率を改善したい話題及び話者を選択する。例えば、利用者は、音声言語資源データの各音声データや各テキストデータに付与されたラベルデータが示す番組や話題、話者の情報に基づいて、話者名や話題（料理・健康・旅行など）のラベルデータを指定する。本実施形態の音声認識装置は、誤り修正モデルの学習に用いる音声言語資源データとして、指定された話者名や話題のラベルデータが付与された音声データやテキストデータを選択する。 (Step S1): A topic and a speaker related to a task whose recognition performance is to be improved are specified.
First, the user of the speech recognition apparatus of this embodiment selects a topic and a speaker whose recognition rate is to be improved. For example, the user can select a speaker name or topic (cooking / health / travel etc.) based on the program, topic, or speaker information indicated by the label data attached to each speech data or text data of the spoken language resource data. ) Label data is specified. The speech recognition apparatus according to the present embodiment selects speech data or text data to which a designated speaker name or topic label data is assigned as speech language resource data used for learning an error correction model.

（ステップＳ２）：認識性能を改善したいタスクで使う言語モデル・音響モデルを選択する。
次に、本実施形態の音声認識装置は、統計モデル格納部から、ステップＳ１において指定された話題や話者に適合した言語モデル及び音響モデルを選択する。これらのモデルは、音声言語資源格納部に記憶されている音声データやテキストデータから統計的手段により推定された音響モデル及び言語モデルでもよく、他の音声データやテキストデータから推定された音響モデル及び言語モデルでもよい。 (Step S2): A language model / acoustic model used in a task whose recognition performance is to be improved is selected.
Next, the speech recognition apparatus according to the present embodiment selects a language model and an acoustic model suitable for the topic or speaker specified in step S1 from the statistical model storage unit. These models may be acoustic models and language models estimated by statistical means from speech data and text data stored in the spoken language resource storage unit, and acoustic models and language models estimated from other speech data and text data. It may be a language model.

（ステップＳ３）：選択された音響モデル及び言語モデルと音声言語資源データとから音声認識の認識誤りを含む仮説を生成する。
本実施形態の音声認識装置は、ステップＳ１で選択した音声データを、ステップＳ２で選択した音響モデル及び言語モデルにより音声認識し、認識誤りを含む仮説（音声認識結果）を学習データとして生成する。なお、ステップＳ１においてテキストデータを選択した場合、本実施形態の音声認識装置は、音響モデル及び言語モデルを用いて擬似的に誤りを含む仮説を生成することも可能である。 (Step S3): A hypothesis including a recognition error of speech recognition is generated from the selected acoustic model and language model and the speech language resource data.
The speech recognition apparatus according to the present embodiment recognizes speech data selected in step S1 using the acoustic model and language model selected in step S2, and generates a hypothesis (speech recognition result) including a recognition error as learning data. When text data is selected in step S1, the speech recognition apparatus according to the present embodiment can generate a hypothesis including a pseudo error using an acoustic model and a language model.

（ステップＳ４）：生成した仮説を用いて誤り修正モデルを学習する。
本実施形態の音声認識装置は、ステップＳ３において得られた認識誤りを含む仮説を学習データとして用い、統計的手段により誤り修正モデルを推定する。この誤り修正モデルの推定の際、本実施形態の音声認識装置は、修正モデルの推定に用いる言語的な特徴を、ステップＳ１で選択したテキストデータが示す書き起こし（正解文）とステップＳ３で得られた誤りを含む仮説とから予め定めておく。 (Step S4): An error correction model is learned using the generated hypothesis.
The speech recognition apparatus of this embodiment uses the hypothesis including the recognition error obtained in step S3 as learning data, and estimates an error correction model by statistical means. When estimating the error correction model, the speech recognition apparatus according to the present embodiment obtains the linguistic features used for the correction model estimation in step S3 and the transcription (correct sentence) indicated by the text data selected in step S1. And a hypothesis including a given error.

（ステップＳ５）：誤り修正モデルと任意の数の言語モデルを統合する。
本実施形態の音声認識装置は、ステップＳ４において得られた誤り修正モデルと、統計モデル格納部に記憶されている任意の数の言語モデルを線形補間などの手法で統合し、統合モデルを生成する。この際、本実施形態の音声認識装置は、認識性能を改善したい話題・話者の音声データを用い、その認識性能を最大にするよう各言語モデルの混合重み（モデルパラメータ）を推定して統合する。 (Step S5): The error correction model and an arbitrary number of language models are integrated.
The speech recognition apparatus according to the present embodiment integrates the error correction model obtained in step S4 and an arbitrary number of language models stored in the statistical model storage unit by a technique such as linear interpolation, and generates an integrated model. . At this time, the speech recognition apparatus of the present embodiment uses topic / speaker speech data whose recognition performance is to be improved, and estimates and integrates mixed weights (model parameters) of each language model so as to maximize the recognition performance. To do.

（ステップＳ６）：本実施形態の音声認識装置は、ステップＳ５によって生成された統合モデルを用いて特定話者の特定話題についての発話を音声認識する。 (Step S6): The speech recognition apparatus of the present embodiment recognizes speech of a specific speaker's specific topic using the integrated model generated in Step S5.

［３．音声認識装置の構成］
図３は、本発明の一実施形態による音声認識装置１の構成を示す機能ブロック図であり、本実施形態と関係する機能ブロックのみ抽出して示してある。
音声認識装置１は、コンピュータ装置により実現され、同図に示すように、音声言語資源管理部２、統計モデル管理部３、認識誤り生成部４、誤り修正モデル学習部５、モデル統合部６、音声認識部７、及び記憶部８を備えて構成される。 [3. Configuration of voice recognition device]
FIG. 3 is a functional block diagram showing the configuration of the speech recognition apparatus 1 according to one embodiment of the present invention, and only the functional blocks related to the present embodiment are extracted and shown.
The speech recognition device 1 is realized by a computer device, and as shown in the figure, a speech language resource management unit 2, a statistical model management unit 3, a recognition error generation unit 4, an error correction model learning unit 5, a model integration unit 6, A voice recognition unit 7 and a storage unit 8 are provided.

音声言語資源管理部２は、音声データ及びテキストデータからなる音声言語資源データを記憶する音声言語資源格納部２１を備える。音声言語資源管理部２は、外部から取得した放送音声・字幕データＤ１を音声言語資源データとして収集し、音声言語資源格納部２１に書き込む。この際、音声言語資源管理部２は、放送音声・字幕データＤ１の内容に応じて、音声データであれば発話者の名前（話者名）を示すラベルデータを付与し、テキストデータであれば話題などの内容を示すラベルデータを付与して音声言語資源格納部２１に格納する。なお、音声データのラベルデータに話題が含まれてもよく、テキストデータのラベルデータに話者名が含まれてもよい。音声データは、発話の音声波形を短時間スペクトル分析して得られた特徴量を示し、テキストデータは、発話内容の書き起こし（正解文）を示す。１つのテキストデータは、複数の文を含み得る。 The spoken language resource management unit 2 includes a spoken language resource storage unit 21 that stores spoken language resource data including speech data and text data. The spoken language resource management unit 2 collects broadcast audio / caption data D1 acquired from the outside as the spoken language resource data, and writes it into the spoken language resource storage unit 21. At this time, the spoken language resource management unit 2 assigns label data indicating the name of the speaker (speaker name) if it is audio data, depending on the contents of the broadcast audio / subtitle data D1, and if it is text data. Label data indicating contents such as a topic is attached and stored in the spoken language resource storage unit 21. Note that the topic may be included in the label data of the voice data, and the speaker name may be included in the label data of the text data. The voice data indicates a feature amount obtained by performing a short-time spectrum analysis on the voice waveform of the utterance, and the text data indicates a transcription (correct sentence) of the utterance content. One text data can include a plurality of sentences.

利用者は、音声言語資源格納部２１に格納された音声データ、テキストデータに付与されたラベルデータに応じて、所望の統合モデルの生成対象となる話者と話題を選択する。この選択作業により話者及び話題が特定され、音声言語資源管理部２は、特定された話者や話題に対応したラベルデータの音声言語資源データを特定する。音声言語資源管理部２は、特定した音声言語資源データの中から、誤り修正モデルの学習に用いる学習データを生成するための音声・テキストデータＤ４と、統合モデルのモデルパラメータ推定用の開発データＤ５とを重ならないように選択し、記憶部８に書き込む。音声言語資源管理部２は、特定された話者を示す特定話者データＤ２及び特定された話題を示す特定話題データＤ３を統計モデル管理部３に出力する。 The user selects a speaker and a topic for which a desired integrated model is to be generated, according to the speech data stored in the speech language resource storage unit 21 and the label data attached to the text data. The speaker and the topic are specified by this selection work, and the spoken language resource management unit 2 specifies the spoken language resource data of the label data corresponding to the specified speaker and the topic. The spoken language resource management unit 2 generates speech / text data D4 for generating learning data used for learning of the error correction model from the identified spoken language resource data, and development data D5 for estimating model parameters of the integrated model. Are selected so as not to overlap with each other, and are written in the storage unit 8. The spoken language resource management unit 2 outputs specific speaker data D2 indicating the specified speaker and specific topic data D3 indicating the specified topic to the statistical model management unit 3.

統計モデル管理部３は、ラベルデータが付与された音響モデル及び言語モデルを格納する統計モデル格納部３１を備える。音響モデルのラベルデータは発話者の名前を示し、言語モデルのラベルデータは話題名を示す。統計モデル管理部３は、特定話者データＤ２及び特定話題データＤ３が示す話者や話題、あるいは、その話題に類似した話題のラベルデータによって統計モデル格納部３１に記憶されている音響モデル及び言語モデルから音響モデルＤ６及び言語モデルＤ７を選択する。さらに、統計モデル管理部３は、統計モデル格納部３１に記憶されている言語モデルから言語モデルＤ７とは異なる言語モデルＤ８を１以上選択する。 The statistical model management unit 3 includes a statistical model storage unit 31 that stores an acoustic model and a language model to which label data is assigned. The label data of the acoustic model indicates the name of the speaker, and the label data of the language model indicates the topic name. The statistical model management unit 3 includes an acoustic model and a language stored in the statistical model storage unit 31 by label data of a speaker or a topic indicated by the specific speaker data D2 and the specific topic data D3, or a topic similar to the topic. An acoustic model D6 and a language model D7 are selected from the models. Furthermore, the statistical model management unit 3 selects one or more language models D8 different from the language model D7 from the language models stored in the statistical model storage unit 31.

認識誤り生成部４は、音声・テキストデータＤ４に含まれる音声データを、音響モデルＤ６及び言語モデルＤ７を用いて音声認識し、認識誤りを含む音声認識結果である学習データＤ９を生成して記憶部８に書き込む。なお、認識誤り生成部４は、音声・テキストデータＤ４のうち、対応する音声データがないテキストデータについては、音声認識によらない擬似的な仮説生成手法により学習データＤ９を生成する。 The recognition error generation unit 4 recognizes speech data included in the speech / text data D4 using the acoustic model D6 and the language model D7, and generates and stores learning data D9 that is a speech recognition result including a recognition error. Write to part 8. The recognition error generation unit 4 generates learning data D9 by using a pseudo hypothesis generation method that does not use voice recognition for text data that does not have corresponding voice data among the voice / text data D4.

誤り修正モデル学習部５は、学習データＤ９が示す音声認識結果と音声・テキストデータＤ４内のテキストデータが示す正解文とを用いて、音声認識の誤り傾向を統計的手段により学習し、誤り修正モデルＤ１０を生成する。誤り修正モデル学習部５は、生成した誤り修正モデルＤ１０を記憶部８に書き込む。 The error correction model learning unit 5 learns the error tendency of the speech recognition by statistical means using the speech recognition result indicated by the learning data D9 and the correct sentence indicated by the text data in the speech / text data D4, and corrects the error. A model D10 is generated. The error correction model learning unit 5 writes the generated error correction model D10 in the storage unit 8.

モデル統合部６は、誤り修正モデルＤ１０と、特定話題以外の任意の数の言語モデルＤ８とを統合して統合モデルＤ１１を生成し、記憶部８に書き込む。 The model integration unit 6 integrates the error correction model D10 and an arbitrary number of language models D8 other than the specific topic to generate an integration model D11 and writes it in the storage unit 8.

音声認識部７は、統計モデル格納部３１に格納されている従来の言語モデル及び音響モデルと、統合モデルＤ１１が示す統合モデルとを用いて音声認識を行い、音声認識結果を示す音声認識結果データＤ１２を出力する。
記憶部８は、各機能部が使用する各種データを記憶する。 The speech recognition unit 7 performs speech recognition using the conventional language model and acoustic model stored in the statistical model storage unit 31 and the integrated model indicated by the integrated model D11, and includes speech recognition result data indicating a speech recognition result. D12 is output.
The storage unit 8 stores various data used by each functional unit.

［４．音声認識装置における詳細な処理手順］
続いて、図２に示す統合モデル学習処理において、音声認識装置１が実行する詳細な処理手順について説明する。 [4. Detailed processing procedure in voice recognition apparatus]
Next, a detailed processing procedure executed by the speech recognition apparatus 1 in the integrated model learning process shown in FIG. 2 will be described.

［４．１ステップＳ１］
音声言語資源管理部２は、音声言語資源データの音声データ及びテキストデータとして放送音声・字幕データＤ１を収集し、音声言語資源格納部２１に格納する。あるいは、音声言語資源管理部２は、音声認識装置１とネットワークにより接続されるサーバコンピュータ等からウェブデータを収集し、テキストデータのみからなる音声言語資源データとして音声言語資源格納部２１に格納する。収集した音声データ及びテキストデータには、音声言語資源管理部２により、あるいは、人手によりラベルデータが付与される。例えば、放送音声・字幕データＤ１に付与されている番組情報をラベルデータに利用することができる。また、ウェブデータが例えばニュースのテキストデータである場合、そのニュースが属するカテゴリをウェブデータから取得し、ラベルデータに使用することができる。 [4.1 Step S1]
The spoken language resource management unit 2 collects the broadcast voice / caption data D1 as the voice data and text data of the voice language resource data and stores them in the voice language resource storage unit 21. Alternatively, the spoken language resource management unit 2 collects web data from a server computer or the like connected to the speech recognition apparatus 1 via a network, and stores it in the spoken language resource storage unit 21 as spoken language resource data consisting only of text data. Label data is assigned to the collected speech data and text data by the speech language resource management unit 2 or manually. For example, the program information given to the broadcast audio / caption data D1 can be used for the label data. When the web data is, for example, news text data, the category to which the news belongs can be acquired from the web data and used as label data.

誤り修正モデルの学習に用いる学習データを生成するため、利用者は、音声言語資源格納部２１に格納されているラベルデータの集合を参照して話者及び話題を指定し、キーボードなどの図示しない入力手段により、特定話者データＤ２と特定話題データＤ３を入力する。音声言語資源管理部２は、音声言語資源格納部２１に記憶されているラベルデータに基づいて、特定話者データＤ２及び特定話題データＤ３に対応した音声データ及びテキストデータの組みを特定すると、特定した中から一部の組みを音声・テキストデータＤ４として選択する。さらに、音声言語資源管理部２は、同様の手順により、特定話者データＤ２及び特定話題データＤ３に対応した音声データ及びテキストデータの組みを特定し、特定した中から一部の組みを言語モデルのモデルパラメータを推定するための開発データＤ５として選択する。このとき、音声言語資源管理部２は、音声・テキストデータＤ４と重なりがないように開発データＤ５を選択する。なお、音声言語資源管理部２は、特定話題データＤ３に対応したテキストデータを特定し、特定したテキストデータから音声・テキストデータＤ４を選択してもよい。なお、音声・テキストデータＤ４、開発データＤ５の選択方法については任意であるが、開発データＤ５の量は音声・テキストデータＤ４の量の数％程度でよい。 In order to generate learning data used for learning an error correction model, a user designates a speaker and a topic by referring to a set of label data stored in the spoken language resource storage unit 21, and a keyboard or the like is not shown. The specific speaker data D2 and the specific topic data D3 are input by the input means. The spoken language resource management unit 2 identifies the combination of the speech data and the text data corresponding to the specific speaker data D2 and the specific topic data D3 based on the label data stored in the spoken language resource storage unit 21. A part of the set is selected as voice / text data D4. Further, the spoken language resource management unit 2 identifies a combination of the speech data and the text data corresponding to the specific speaker data D2 and the specific topic data D3 by the same procedure, and selects some combinations from the specified language model. Is selected as development data D5 for estimating the model parameters. At this time, the spoken language resource management unit 2 selects the development data D5 so as not to overlap with the speech / text data D4. The spoken language resource management unit 2 may specify text data corresponding to the specific topic data D3 and select the speech / text data D4 from the specified text data. The selection method of the voice / text data D4 and the development data D5 is arbitrary, but the amount of the development data D5 may be about several percent of the amount of the voice / text data D4.

［４．２ステップＳ２］
統計モデル格納部３１には、音声言語資源データとして音声言語資源格納部２１に格納された音声データ及びテキストデータ、もしくは、他の音声言語資源データから学習された音響モデル及び言語モデルが、ラベルデータと対応づけて格納されている。統計モデル管理部３は、統計モデル格納部３１に格納されているラベルデータに基づいて、特定話者データＤ２に対応する音響モデルと、特定話題データＤ３に対応する言語モデルを選択する。さらに、統計モデル管理部３は、誤り修正モデルと統合するための言語モデルＤ８として、統計モデル格納部３１に格納されている言語モデルの中から言語モデルＤ７とは異なる言語モデルを１以上選択する。 [4.2 Step S2]
In the statistical model storage unit 31, speech data and text data stored in the speech language resource storage unit 21 as speech language resource data, or an acoustic model and language model learned from other speech language resource data are stored in the label data. Are stored in association with each other. The statistical model management unit 3 selects an acoustic model corresponding to the specific speaker data D2 and a language model corresponding to the specific topic data D3 based on the label data stored in the statistical model storage unit 31. Further, the statistical model management unit 3 selects one or more language models different from the language model D7 from the language models stored in the statistical model storage unit 31 as the language model D8 to be integrated with the error correction model. .

［４．３ステップＳ３］
認識誤り生成部４は、音響モデルＤ６及び言語モデルＤ７を用いて、音声・テキストデータＤ４に含まれる音声データを音声認識する。ここでは、音声認識結果を、上位ｎ個（ｎは１以上の整数）の最尤単語系列（ｎ−ｂｅｓｔ）または単語ラティスとする。認識誤り生成部４は、音声データの音声認識結果を示す学習データＤ９を記憶部８に書き込む。 [4.3 Step S3]
The recognition error generation unit 4 recognizes speech data included in the speech / text data D4 using the acoustic model D6 and the language model D7. Here, the speech recognition result is the top n (n is an integer of 1 or more) maximum likelihood word series (n-best) or word lattice. The recognition error generation unit 4 writes learning data D9 indicating the voice recognition result of the voice data in the storage unit 8.

なお、認識誤り生成部４は、音声・テキストデータＤ４が、音声データに対応付けられていないテキストデータである場合、音声認識が使用できないため、以下の図４に示すように擬似的に音声認識結果を生成する。 Note that the recognition error generation unit 4 cannot use speech recognition when the speech / text data D4 is text data that is not associated with speech data, so that pseudo speech recognition is performed as shown in FIG. Generate results.

図４は、認識誤り生成部４による擬似的な音声認識結果の生成処理の処理フローを示す図である。この生成処理は、文献「徳田，隠れマルコフモデルの音声合成への応用，電子情報通信学会研究報告ＳＰ−９９，１９９９年，ｐ．４７−５４，１９９９年」に示す音声合成で用いられる手続に基づく。 FIG. 4 is a diagram illustrating a processing flow of a pseudo speech recognition result generation process by the recognition error generation unit 4. This generation process is the same as the procedure used in speech synthesis described in the document “Tokuda, Application of Hidden Markov Model to Speech Synthesis, IEICE Technical Report SP-99, 1999, p. 47-54, 1999”. Based.

（ステップＳ３０：発音系列生成処理）
まず、認識誤り生成部４は、音声・テキストデータＤ４に含まれるテキストデータが示す各単語列を、正解音素列に変換する。この正解音素列の変換処理には様々な変換方法が考えられるが、本実施形態では、以下のように変換する。 (Step S30: pronunciation sequence generation process)
First, the recognition error generation unit 4 converts each word string indicated by the text data included in the speech / text data D4 into a correct phoneme string. Various conversion methods are conceivable for the conversion process of the correct phoneme string, but in the present embodiment, conversion is performed as follows.

いま、単語列をｗ、音素列をｑとすると、求める正解音素列ｑ＾は、以下の式（６）により得られる。 If the word string is w and the phoneme string is q, the correct phoneme string q ^ to be obtained is obtained by the following equation (6).

ここで、単語列ｗが与えられたときの音素列ｑの条件付き確率Ｐ（ｑ｜ｗ）は、以下の式（７）により得られる。 Here, the conditional probability P (q | w) of the phoneme sequence q when the word sequence w is given is obtained by the following equation (7).

ただし、Ｓ（ｗ，ｑ）は、対数線形モデルによるスコアとし、以下の式（８）により得られる。 However, S (w, q) is a score based on a logarithmic linear model, and is obtained by the following equation (8).

なお、式（８）におけるｆ_ｉ（ｗ，ｑ）はｉ番目の素性関数、λ_ｉは素性関数ｆ_ｉ（ｗ，ｑ）の素性重みである。
式（８）に用いられる素性関数として、例えば、以下があげられる。 In Equation (8), f _i (w, q) is the i-th feature function, and λ _i is the feature weight of the feature function f _i (w, q).
Examples of the feature function used in Equation (8) include the following.

（ａ）単語列ｗのｉ番目の単語ｗ_ｉ＝ｕの発音ｑ_ｉ＝αであれば１、それ以外は０。例えば、単語ｗ_ｉが単語ｕ「行って」であり、単語ｗ_ｉの発音ｑ_ｉが音素列α「/i/ /Q/ /t/ /e/」に合致すれば「１」となる。
（ｂ）単語列ｗのｉ−１番目の単語ｗ_ｉ−１＝単語ｕ、かつ、ｉ番目の単語ｗ_ｉ＝単語ｖであり発音ｑ_ｉ＝βであれば１、それ以外は０。例えば、単語ｗ_ｉ−１が単語ｕ「へ」であり、直後の単語ｗ_ｉが単語ｕ「行って」、かつ、単語ｗ_ｉの発音ｑ_ｉが音素列α「/i/ /Q/ /t/ /e/」に合致すれば「１」となる。 (A) The pronunciation of the _i -th word w _i = u in the word string w is 1 if q _i = α, and 0 otherwise. For example, if the word w _i is the word u “go” and the pronunciation q _i of the word w _i matches the phoneme sequence α “/ i // Q // t // e /”, the word w _i is “1”.
(B) i-1th word w _i-1 = word u of word string w, i-th word w _i = word v and pronunciation q _i = β is 1, otherwise 0. For example, the word w _i-1 is the word u "to", immediately after the word w _i is the word u "Go", and the pronunciation q _i of word w _i is the phoneme string α "/ i / / Q / / If it matches “t // e /”, it becomes “1”.

単語列ｗに対して音素列ｑは複数生成され得る。そこで、上記では、単語の表記の情報を利用しながら正解の発音（音素列）を推定しており、出現しやすいほど０以外の値の素性関数が多くなり、式（８）のスコアが大きくなる。 A plurality of phoneme strings q can be generated for the word string w. Therefore, in the above, the correct pronunciation (phoneme string) is estimated using the word notation information, and the feature function of a value other than 0 increases as it appears more easily, and the score of equation (8) increases. Become.

記憶部８は、単語・発音変換モデルＤ３１として単語表記とその単語の発音を示す音素列とを対応付けたテーブルである発音辞書を予め記憶しておく。一つの単語表記に対して、１以上の音素列が対応しうる。認識誤り生成部４は、音声・テキストデータＤ４の各テキストデータが示す単語列をｗとし、単語列ｗを構成する各単語ｗ_１，ｗ_２，…それぞれの音素列を単語・発音変換モデルＤ３１から取得する。認識誤り生成部４は、単語列ｗを構成する単語ｗ_１，ｗ_２，…の順に、その単語について取得した音素列を結合して単語列ｗの音素列ｑを生成する。従って、単語列ｗが単語ｗ_１，ｗ_２，…からなり、単語ｗ_ｉに対応した音素列がｎ_ｉ個（ｉ＝１，…）ある場合、音素列ｑは、Πｎ_ｉ通り生成される。認識誤り生成部４は、単語列ｗについて、式（７）及び式（８）により生成した音素列ｑそれぞれの条件付き確率Ｐ（ｑ｜ｗ）を算出し、算出した条件付き確率Ｐ（ｑ｜ｗ）を用いて式（６）により、単語列ｗが与えられたときに尤もらしい発音系列である正解音素列ｑ＾を得る。なお、式（８）に用いられるモデルパラメータΛ＝（λ_１，λ_２，…）は、別に用意した単語列と正解発音系列からなる学習データから予め学習しておいた値を用いる。 The storage unit 8 stores in advance a pronunciation dictionary which is a table in which a word notation and a phoneme string indicating the pronunciation of the word are associated as the word / pronunciation conversion model D31. One or more phoneme strings can correspond to one word notation. Recognition error generation unit 4, the word sequence represented by the text data of the speech text data D4 and w, each word w ₁ constituting the word string _w, w 2, _... words-phonetic conversion model D31 each phoneme string Get from. The recognition error generator 4 combines the phoneme strings acquired for the words in the order of the words w ₁ , w ₂ ,... Constituting the word string w to generate a phoneme string q of the word string w. Thus, the word _w 1 word sequence w is, _{w 2,} consists ..., phoneme string corresponding to a word _{w i} is _{n i} number (i = 1, ...) when there is a phoneme sequence q is generated as Paienu _i . The recognition error generation unit 4 calculates the conditional probability P (q | w) of each phoneme string q generated by the expressions (7) and (8) for the word string w, and calculates the calculated conditional probability P (q | W) is used to obtain a correct phoneme string q ^ which is a probable pronunciation sequence when the word string w is given by Equation (6). The model parameter Λ = (λ ₁ , λ ₂ ,...) Used in Equation (8) uses a value learned in advance from learning data consisting of a separately prepared word string and correct pronunciation sequence.

（ステップＳ３１：ＨＭＭ状態系列生成処理）
認識誤り生成部４は、ステップＳ３０において得られた正解の発音系列である正解音素列ｑ＾から、対応するＨＭＭの状態系列を求める。例えば、認識誤り生成部４は、音響モデルＤ６が示す各音素に対応したＨＭＭを参照し、正解音素列ｑ＾を構成する各音素に対応したＨＭＭを結合してＨＭＭの状態系列を生成する。 (Step S31: HMM state sequence generation process)
The recognition error generator 4 obtains the corresponding HMM state sequence from the correct phoneme sequence q ^ which is the correct pronunciation sequence obtained in step S30. For example, the recognition error generation unit 4 refers to the HMM corresponding to each phoneme indicated by the acoustic model D6 and combines the HMM corresponding to each phoneme constituting the correct phoneme string q ^ to generate an HMM state sequence.

ＨＭＭは、一般的には３〜５状態程度の有限状態オートマトンであり、その有限状態オートマトンのモデルを構成する状態には自己遷移が付随する。この自己遷移の回数が状態継続時間となるが、ここではまだ自己遷移の回数が不明である。そこで、認識誤り生成部４は、生成したＨＭＭの状態系列に含まれる各状態の状態継続時間長を推定する。本実施形態では、認識誤り生成部４は、Ｇａｍｍａ（ガンマ）分布を用いて状態継続時間長をサンプリングにより求める。なお、Ｇａｍｍａ分布とは、以下の式（９）に示す確率密度関数ｆ（ｘ）を持つ分布である。 An HMM is generally a finite state automaton having about 3 to 5 states, and a state constituting the model of the finite state automaton is accompanied by a self-transition. The number of self-transitions is the state duration, but here the number of self-transitions is still unknown. Accordingly, the recognition error generation unit 4 estimates the state duration length of each state included in the generated state sequence of the HMM. In the present embodiment, the recognition error generator 4 obtains the state duration length by sampling using a Gamma (gamma) distribution. The Gamma distribution is a distribution having a probability density function f (x) shown in the following formula (9).

式（９）において、ｘはＨＭＭの継続時間であり、θは尺度母数、ｋは形状母数と呼ばれるモデルパラメータである。また、Γ（ｋ）は、ガンマ関数を示す。 In equation (9), x is the duration of the HMM, θ is a scale parameter, and k is a model parameter called a shape parameter. Γ (k) represents a gamma function.

記憶部８は、音声データからＨＭＭの各状態について予め推定した状態継続時間のガンマ分布を示す状態継続時間モデルＤ３２を話者毎に記憶しておく。認識誤り生成部４は、特定話者の状態継続時間モデルＤ３２が示すＨＭＭの各状態のガンマ分布に従った乱数発生器により、正解音素列ｑ＾から生成したＨＭＭの状態系列を構成する各ＨＭＭについて状態継続時間の推定値を得る。これにより、各ＨＭＭの状態間の遷移（経路）が求められる。つまり、認識誤り生成部４は、得られた推定値を状態継続時間としたＨＭＭを連結し、状態継続時間付きのＨＭＭの状態系列を得る。 The storage unit 8 stores, for each speaker, a state duration model D32 indicating a gamma distribution of state durations estimated in advance for each state of the HMM from speech data. The recognition error generation unit 4 uses the random number generator according to the gamma distribution of each state of the HMM indicated by the specific speaker state duration model D32 to generate each HMM constituting the state sequence of the HMM generated from the correct phoneme sequence q ^. Get an estimate of the state duration for. Thereby, a transition (path) between the states of each HMM is obtained. That is, the recognition error generation unit 4 concatenates HMMs using the obtained estimated value as a state duration, and obtains an HMM state sequence with a state duration.

（ステップＳ３２：ＨＭＭ特徴量ベクトル生成処理）
認識誤り生成部４は、音響モデルＤ６が示すＨＭＭの各状態における多変量混合Ｇａｕｓｓ（ガウス）分布から、音響特徴量をサンプリングにより求める。なお、多変量混合Ｇａｕｓｓ分布とは、式（１０）に示す確率密度関数Ν（ｘ；μ，Σ）を持つ分布である。 (Step S32: HMM feature vector generation process)
The recognition error generation unit 4 obtains an acoustic feature quantity by sampling from a multivariate mixed Gauss distribution in each state of the HMM indicated by the acoustic model D6. Note that the multivariate mixed Gaussian distribution is a distribution having a probability density function Ν (x; μ, Σ) shown in Expression (10).

式（１０）において、ｘはＮ次元の音響特徴量ベクトル、μ、Σはそれぞれ、多変量Ｇａｕｓｓ分布の平均と共分散行列である。
ここで、ＨＭＭは、式（１０）に示す確率密度関数Ν（ｘ；μ，Σ）の多変量混合Ｇａｕｓｓ分布を用いて、式（１１）に示す混合Ｇａｕｓｓ分布で定められる。 In Expression (10), x is an N-dimensional acoustic feature vector, and μ and Σ are the mean and covariance matrix of the multivariate Gaussian distribution, respectively.
Here, the HMM is determined by the mixed Gaussian distribution shown in Expression (11) using the multivariate mixed Gaussian distribution of the probability density function Ν (x; μ, Σ) shown in Expression (10).

式（１１）におけるｃ_ｍは、混合要素である確率密度関数Ν（ｘ；μ_ｍ，Σ_ｍ）の多変量混合Ｇａｕｓｓ分布に対する重みであり、式（１２）を満たす。 C _m in the equation (11) is a weight for the multivariate mixed Gaussian distribution of the probability density function Ν (x; μ _m , Σ _m ), which is a mixing element, and satisfies the equation (12).

音響特徴量（音声の短時間スペクトルから抽出した特徴）に対する出力確率を計算するために、各話者について予め求めておいたＨＭＭの各状態における音響特徴量の多変量混合Ｇａｕｓｓ分布を音響モデルとして統計モデル格納部３１に記憶しておく。認識誤り生成部４は、音響モデルＤ６が示すＨＭＭの各状態の多変量混合Ｇａｕｓｓ分布に従った乱数発生器により、状態継続時間付きの各ＨＭＭの状態系列に対応した音響特徴量を得る（サンプリング）。 In order to calculate output probabilities for acoustic features (features extracted from the short-time spectrum of speech), multivariate mixed Gaussian distributions of acoustic features in each state of the HMM previously obtained for each speaker are used as acoustic models. This is stored in the statistical model storage unit 31. The recognition error generation unit 4 obtains an acoustic feature amount corresponding to the state sequence of each HMM with a state duration by using a random number generator according to the multivariate mixed Gaussian distribution of each state of the HMM indicated by the acoustic model D6 (sampling). ).

（ステップＳ３３：線形変換処理）
線形変換処理は、オプションである。認識誤り生成部４は、ステップＳ３２においてサンプリングにより得た音響特徴量に対して、音声認識がコンフュージョンを起こしやすくするように、予め記憶部８に記憶されている特徴量変換行列Ｄ３３を用いて、特徴量空間での最尤線形回帰（feature-space Maximum Likelihood Linear Regression；ｆＭＬＬＲ）を行う。この処理は、文献「Y. Li et al. Incremental on-line feature space MLLR adaptation for telephony speech recognition, In ICSLP, 2002.」に記載の技術を用いる。通常、ＨＭＭのような統計的なモデルでは、音響特徴量空間上で識別面（他のＨＭＭよりも高い確率を出力する空間）を構成する。そこで、特徴量変換行列による線形変換を使って、特徴量を識別面から離す（どこか遠い別の点に近づける）ことにより、識別性能を故意に劣化させることができる。近づける対象の点としては、ある音素を統計的に間違いが生じやすい他の音素に置き換えた点を用いることができる。 (Step S33: linear conversion process)
The linear transformation process is optional. The recognition error generation unit 4 uses the feature amount conversion matrix D33 stored in advance in the storage unit 8 so that speech recognition is likely to cause confusion with respect to the acoustic feature amount obtained by sampling in step S32. Then, feature-space maximum like linear regression (fMLLR) is performed. This processing uses the technique described in the document “Y. Li et al. Incremental on-line feature space MLLR adaptation for telephony speech recognition, In ICSLP, 2002.”. In general, a statistical model such as an HMM forms an identification plane (a space that outputs a higher probability than other HMMs) in the acoustic feature space. Therefore, the discrimination performance can be intentionally deteriorated by separating the feature quantity from the discrimination plane (approaching it to another point somewhere far away) using linear transformation based on the feature quantity transformation matrix. As a point to be approached, a point obtained by replacing a phoneme with another phoneme that is statistically prone to error can be used.

（ステップＳ３４：音声認識処理）
最後に、認識誤り生成部４は、ステップＳ３３により得られた音響特徴量（あるいは、ステップＳ３２により得られた音響特徴量）を、音響モデルＤ６及び言語モデルＤ７を用いて音声認識し、音声認識結果を得る。音声認識結果は、ｎ−ｂｅｓｔまたは単語ラティスとする。音声認識結果には、複数の正解文候補と、各正解文候補の音響スコア及び言語スコアが含まれる。認識誤り生成部４は、音声認識結果を学習データＤ９として記憶部８に書き込む。 (Step S34: Voice recognition processing)
Finally, the recognition error generation unit 4 recognizes the acoustic feature obtained in step S33 (or the acoustic feature obtained in step S32) using the acoustic model D6 and the language model D7, and recognizes the speech. Get results. The speech recognition result is n-best or word lattice. The speech recognition result includes a plurality of correct sentence candidates and the acoustic score and language score of each correct sentence candidate. The recognition error generation unit 4 writes the speech recognition result in the storage unit 8 as learning data D9.

［４．４ステップＳ４］
［４．４．１言語的特徴抽出処理］
ステップＳ４において誤り修正モデル学習部５は最初に、記憶部８に記憶されている音声・テキストデータＤ４及び学習データＤ９から、誤り傾向学習のために用いる言語的特徴に基づく素性関数を抽出する。素性関数のルールは、例えば、連続する単語、単語を構成する音素、連続しない２単語以上の単語、音素間の共起関係、単語の構文的な情報または意味的な情報、などの言語的特徴である。 [4.4 Step S4]
[4.4.1 Linguistic feature extraction processing]
In step S4, the error correction model learning unit 5 first extracts a feature function based on linguistic features used for error tendency learning from the speech / text data D4 and the learning data D9 stored in the storage unit 8. The rules of the feature function are, for example, linguistic features such as continuous words, phonemes constituting the words, two or more words that are not continuous, co-occurrence relationships between phonemes, syntactic information or semantic information of the words, etc. It is.

本実施形態では、誤り修正モデル学習部５は、単語の共起関係に基づく素性関数として、例えば以下の（ａ）、（ｂ）を定める。 In the present embodiment, the error correction model learning unit 5 determines, for example, the following (a) and (b) as feature functions based on word co-occurrence relationships.

（ａ）単語列ｗに連続する単語２項組み（ｕ，ｖ）が含まれる場合，その数を返す関数
（ｂ）単語列ｗに連続しない単語２項組み（ｕ，ｖ）が含まれる場合、その数を返す関数 (A) When the word string w includes a continuous word binary set (u, v), a function that returns the number (b) When the word string w includes a non-continuous word binary set (u, v) , A function that returns the number

また、誤り修正モデル学習部５は、単語列ｗを構成する各単語を名詞や動詞といった品詞カテゴリに置き換えた上で、構文情報に基づく素性関数として、例えば以下の（ｃ）、（ｄ）を定める。なお、ｃ（・）は単語を品詞にマッピングする関数である。 The error correction model learning unit 5 replaces each word constituting the word string w with a part-of-speech category such as a noun or a verb, and uses, for example, the following (c) and (d) as feature functions based on the syntax information. Determine. Note that c (•) is a function that maps words to parts of speech.

（ｃ）単語列ｗに連続する品詞２項組み（ｃ（ｕ），ｃ（ｖ））が含まれる場合、その数を返す関数
（ｄ）単語列ｗに連続しない品詞２項組み（ｃ（ｕ），ｃ（ｖ））が含まれる場合、その数を返す関数 (C) a function that returns the number of part-of-speech binaries (c (u), c (v)) that are consecutive in the word string w (d) a part-of-speech binary pair that is not consecutive in the word string w (c ( u), c (v)), a function that returns the number if it is included

あるいは誤り修正モデル学習部５は、単語列ｗを構成する各単語を、意味情報を表すカテゴリ（意味カテゴリ）に置き換えた上で、意味的な情報に基づく素性関数として、例えば以下の（ｅ）、（ｆ）を定める。意味カテゴリは、音声認識装置１の外部のデータベースまたは記憶部８に記憶されるシソーラスなどを用いて得ることができる。なお、ｓ（・）は単語を意味カテゴリにマッピングする関数である。 Alternatively, the error correction model learning unit 5 replaces each word constituting the word string w with a category (semantic category) representing semantic information, and as a feature function based on semantic information, for example, the following (e) , (F) is defined. The semantic category can be obtained using a database external to the speech recognition apparatus 1 or a thesaurus stored in the storage unit 8. Note that s (•) is a function that maps words to semantic categories.

（ｅ）単語列ｗに連続する意味カテゴリ２項組み（ｓ（ｕ），ｓ（ｖ））が含まれる場合、その数を返す関数
（ｆ）単語列ｗに連続しない意味カテゴリ２項組み（ｓ（ｕ），ｓ（ｖ））が含まれる場合、その数を返す関数 (E) a function that returns the number of semantic category binomials (s (u), s (v)) that are consecutive in the word string w (f) a semantic category binary group that is not consecutive in the word string w ( a function that returns the number of s (u), s (v))

また、誤り修正モデル学習部５は、音素列に関する素性関数として、例えば以下の（ｇ）を定める。 Further, the error correction model learning unit 5 determines, for example, the following (g) as a feature function related to a phoneme string.

（ｇ）単語列ｗに音素列ｑが含まれる場合、その数を返す関数 (G) A function that returns the number of phoneme strings q in the word string w

誤り修正モデル学習部５は、音声・テキストデータＤ４のテキストデータが示す正解単語列、及び、学習データＤ９の音声認識結果から、上記のルールに従った素性関数を全て抽出し、抽出した素性関数が出現する頻度をカウントする。誤り修正モデル学習部５は、カウントした出現頻度が予め定めた閾値以上である素性関数を、誤り傾向学習で用いる素性関数ｆ_ｉとして決定する。 The error correction model learning unit 5 extracts all feature functions in accordance with the above rules from the correct word string indicated by the text data of the speech / text data D4 and the speech recognition result of the learning data D9, and extracts the feature functions Counts the frequency of occurrences of The error correction model learning unit 5, the counted frequency of occurrence of the feature function is a predetermined threshold value or more is determined as a feature function f _i used in the error tendency.

［４．４．２誤り傾向学習処理］
本実施形態では、誤り修正モデル学習部５は、誤り傾向を反映した誤り修正モデルを得るために、以下で述べるリスク最小化法を用いる。
リスク最小化手法に基づく、統計的な誤り修正モデルでは、発話ｘ_ｍ（ｍは１以上Ｍ以下の整数、Ｍは学習データの数）と、この発話ｘ_ｍに対応した正解単語列ｗ_ｍ，０が与えられたとき、目的関数Ｌ（Λ）を以下の式（１３）のように定める。 [4.4.2 Error tendency learning process]
In the present embodiment, the error correction model learning unit 5 uses a risk minimization method described below in order to obtain an error correction model reflecting an error tendency.
Based on the risk minimization techniques, the statistical error correction model, speech x _{m (m} is an integer of 1 to M, M is the number of learning data) and, correct word sequence w _m which corresponds to the utterance x _{_m, When 0} is given, the objective function L (Λ) is defined as the following equation (13).

Ｌ_ｍは、発話ｘ_ｍから音声認識により生成された文仮説ｗ_ｍ，１、ｗ_ｍ，２、…の集合であり、文仮説ｗ_ｍ，ｋ（ｋは１以上の整数）は発話ｘ_ｍの第ｋ番目の正解文候補の単語列である。また、ｗ_ｍ，０は発話ｘ_ｍの正解文であり、Ｒ（ｗ_ｍ，ｏ，ｗ_ｍ，ｋ）は、正解文ｗ_ｍ，０と文仮説ｗ_ｍ，ｋとのLevenshtein編集距離である。事後確率Ｐ（ｗ_ｍ，ｋ｜ｘ_ｍ；Λ）は、発話ｘ_ｍが発生したときに文仮説ｗ_ｍ，ｋが得られる事後確率である。Λは、素性関数に対する素性重みλ_１、λ_２、…の集合であり、式（１３）の目的関数を最小化するΛが、求める誤り修正モデルのパラメータとなる。これは、式（１３）の目的関数を最小化するようにΛを推定すれば、正解文候補に期待される認識誤りが最小となり、学習データとは異なる未知の入力音声に対する音声認識においても、Λによって認識誤りの最小化が同様に行われ、音声認識の性能の向上が期待できるからである。つまり、式（１３）の目的関数は、正解文候補に期待される認識誤りが最小となり、素性重みが適切であるかの評価値を算出する評価関数として用いられる。 L _m is a set of sentence hypotheses w _{m, 1} , w _{m, 2} ,... Generated by speech recognition from the utterance x _m, and the sentence hypothesis w _{m, k} (k is an integer of 1 or more) is the utterance x _m. The k-th correct sentence candidate word string. W _{m, 0} is the correct sentence of the utterance x _m , and R (w _{m, o} , w _{m, k} ) is the Levenshtein edit distance between the correct sentence w _{m, 0} and the sentence hypothesis w _{m, k.} . The posterior probability P (w _{m, k} | x _m ; Λ) is the posterior probability that the sentence hypothesis w _{m, k} is obtained when the utterance x _m occurs. Λ is a set of feature weights λ ₁ , λ ₂ ,... For the feature function, and Λ that minimizes the objective function of Equation (13) is a parameter of the error correction model to be obtained. This is because if Λ is estimated so as to minimize the objective function of Equation (13), the recognition error expected for the correct sentence candidate is minimized, and in speech recognition for unknown input speech different from the learning data, This is because recognition errors are similarly minimized by Λ and an improvement in speech recognition performance can be expected. That is, the objective function of Expression (13) is used as an evaluation function for calculating an evaluation value as to whether the feature weight is appropriate because the recognition error expected for the correct sentence candidate is minimized.

誤り修正モデルは、音声入力ｘに対して得られた文仮説ｗに対して、以下の式（１４）に従う誤りスコアＳ（ｗ）を出力する。 The error correction model outputs an error score S (w) according to the following equation (14) for the sentence hypothesis w obtained for the speech input x.

従って、誤り修正モデルによるスコアを考慮した音声認識のスコアｇ＾（ｗ｜ｘ）は、以下の式（１５）のように算出され、音声認識により得られた文仮説の中で、式（１５）により算出されたスコアを最大とする仮説が音声認識結果として出力される。 Accordingly, a speech recognition score {circumflex over (g)} (w | x) in consideration of the score of the error correction model is calculated as in the following equation (15), and among the sentence hypotheses obtained by speech recognition, the equation (15 ) Is output as a speech recognition result.

式（１５）におけるκは、音響モデルのスコアＰ（ｘ｜ｗ）に対する言語モデルのスコアの重みである。 In Expression (15), κ is the weight of the score of the language model with respect to the score P (x | w) of the acoustic model.

なお、式（１３）の事後確率Ｐ（ｗ_ｍ，ｋ｜ｘ_ｍ；Λ）は、以下の式（１６）のように算出される。 Note that the posterior probability P (w _{m, k} | x _m ; Λ) of the equation (13) is calculated as the following equation (16).

式（１６）におけるｇ＾（ｗ_ｍ，ｋ｜ｘ_ｍ；Λ）は、式（１５）から以下の式（１７）のように算出される。 G ^ (w _{m, k} | x _m ; Λ) in the equation (16) is calculated from the equation (15) as the following equation (17).

式（１７）におけるＳ（ｗ_ｍ，ｋ）は、Λ＝λ_１、λ_２、…の値を用いて式（１４）により算出される。 S (w _{m, k} ) in equation (17) is calculated by equation (14) using the values of Λ = λ ₁ , λ ₂ ,.

図５は、誤り修正モデル学習部５による誤り修正モデル学習処理の処理フローを示す図である。 FIG. 5 is a diagram illustrating a processing flow of error correction model learning processing by the error correction model learning unit 5.

（ステップＳ４０：特徴量抽出処理）
誤り修正モデル学習部５は、先に示した言語的特徴抽出処理により、誤り傾向学習で用いる素性関数ｆ_ｉを抽出する。 (Step S40: feature amount extraction process)
The error correction model learning unit 5 extracts a feature function f _i used in error tendency learning by the linguistic feature extraction process described above.

（ステップＳ４１：モデルパラメータ初期化処理）
誤り修正モデル学習部５は、ステップＳ４０において得られた素性関数ｆ_ｉの素性重みλ_ｉを全てゼロに初期化する。 (Step S41: Model parameter initialization process)
The error correction model learning unit 5 initializes all the feature weights λ _i of the feature function f _i obtained in step S40 to zero.

（ステップＳ４２：目的関数計算処理）
誤り修正モデル学習部５は、学習データＤ９から音声認識結果を読み込み、音声・テキストデータＤ４からこの音声認識結果に対応した正解単語列（テキストデータ）を読み込む。誤り修正モデル学習部５は、読み込んだこれらのデータを用いて、現在のΛ＝（λ_１，λ_２，…）の値を用い、式（１３）により目的関数Ｌ（Λ）の値を計算する。 (Step S42: Objective function calculation process)
The error correction model learning unit 5 reads the speech recognition result from the learning data D9, and reads the correct word string (text data) corresponding to the speech recognition result from the speech / text data D4. The error correction model learning unit 5 calculates the value of the objective function L (Λ) by using the current value of Λ = (λ ₁ , λ ₂ ,...) Using these read data and the equation (13). To do.

なお、文仮説ｗ_ｍ，ｋ（ｋ＝１，．．．）は、発話ｘ_ｍの音声データから得られた音声認識結果に含まれる第ｋ番目の正解文候補であり、学習データＤ９から得られる。また正解文ｗ_ｍ，０は、発話ｘ_ｍの正解単語列であり、音声・テキストデータＤ４から得られる。誤り修正モデル学習部５は、音声・テキストデータＤ４から読み出した正解文ｗ_ｍ，０と学習データＤ９から読み出した文仮説ｗ_ｍ，ｋとを用いて、式（１３）におけるLevenshtein編集距離Ｒ（ｗ_ｍ，ｏ，ｗ_ｍ，ｋ）を算出する。また、誤り修正モデル学習部５は、式（１６）及び式（１７）により、事後確率Ｐ（ｗ_ｍ，ｋ｜ｘ_ｍ；Λ）を算出するが、式（１７）における音響スコアＰ（ｘ_ｍ｜ｗ_ｍ，ｋ）及び言語スコアＰ（ｗ_ｍ，ｋ）は、学習データＤ９の音声認識結果から得られる。また、誤り修正モデル学習部５は、式（１７）のＳ（ｗ_ｍ，ｋ）を、ステップＳ４０において抽出した素性関数ｆ_ｉについて文仮説ｗ_ｍ，ｋから得た値と、現在のΛの値を用いて式（１４）により算出する。 The sentence hypothesis w _{m, k} (k = 1,...) Is the kth correct sentence candidate included in the speech recognition result obtained from the speech data of the utterance x _m , and is obtained from the learning data D9. It is done. The correct answer sentence w _{m, 0} is a correct word string of the utterance x _m and is obtained from the voice / text data D4. The error correction model learning unit 5 uses the correct sentence w _{m, 0} read from the speech / text data D4 and the sentence hypothesis w _{m, k} read from the learning data D9 _, and the Levenshtein edit distance R ( w _{m, o} , w _{m, k} ) is calculated. In addition, the error correction model learning unit 5 calculates the posterior probability P (w _{m, k} | x _m ; Λ) according to the equations (16) and (17), but the acoustic score P (x in equation (17) _m | w _{m, k} ) and the language score P (w _{m, k} ) are obtained from the speech recognition result of the learning data D9. Further, the error correction model learning unit 5 calculates S (w _{m, k} ) of the equation (17) from the sentence hypothesis w _{m, k} for the feature function f _i extracted in step S40 and the current Λ. It calculates by Formula (14) using a value.

（ステップＳ４３：パラメータ更新処理）
誤り修正モデル学習部５は、準ニュートン法に基づいて、誤り修正モデルのパラメータΛを更新する。準ニュートン法は、適当な初期値を与えて解に近い次の値を生成し、その値からまた次の解に近い値を生成することを繰り返し、最終的に最適解に収束させるものである。準ニュートン法の詳細については、非特許文献１を参照のこと。 (Step S43: Parameter update process)
The error correction model learning unit 5 updates the parameter Λ of the error correction model based on the quasi-Newton method. In the quasi-Newton method, an appropriate initial value is given to generate the next value close to the solution, and the value close to the next solution is repeatedly generated from that value, and finally converges to the optimal solution. . See Non-Patent Document 1 for details of the quasi-Newton method.

（ステップＳ４４：終了判定処理）
誤り修正モデル学習部５は、パラメータの更新により変更された目的関数Ｌ（Λ）の値と、変更前の目的関数Ｌ（Λ）の値を比較する。誤り修正モデル学習部５は、値の変化が所定以上であればステップＳ４２からの処理を繰り返し、所定よりも小さければ更新が収束したとみなしてステップＳ４５の処理を実行する。 (Step S44: End determination process)
The error correction model learning unit 5 compares the value of the objective function L (Λ) changed by updating the parameter with the value of the objective function L (Λ) before the change. The error correction model learning unit 5 repeats the process from step S42 if the value change is greater than or equal to a predetermined value, and executes the process of step S45 assuming that the update has converged if the change is smaller than the predetermined value.

（ステップＳ４５：誤り修正モデル出力処理）
誤り修正モデル学習部５は、更新が収束したときの誤り修正モデルの素性重みΛ＝（λ_０，λ_１，…）を用いた誤り修正モデルＤ１０を記憶部８に書き込む。 (Step S45: Error correction model output process)
The error correction model learning unit 5 writes the error correction model D10 using the feature weight Λ = (λ ₀ , λ ₁ ,...) Of the error correction model when the update converges in the storage unit 8.

［４．５ステップＳ５］
モデル統合部６は、ステップＳ４において生成された誤り修正モデルＤ１０と、統計モデル格納部３１から選択された複数の言語モデルＤ８とを統合し、統合モデルＤ１１を生成する。 [4.5 Step S5]
The model integration unit 6 integrates the error correction model D10 generated in step S4 and the plurality of language models D8 selected from the statistical model storage unit 31 to generate an integrated model D11.

モデル統合部６で得られる統合モデルＤ１１と、音響モデルを用いた音声認識では、音声入力（入力音響特徴量）ｘに対して、文仮説ｗの音声認識のスコアｇ＾（ｗ｜ｘ）を、あらためて以下の式（１８）ように計算する。 In the speech recognition using the integrated model D11 obtained by the model integration unit 6 and the acoustic model, the speech recognition score g ^ (w | x) of the sentence hypothesis w is obtained for the speech input (input acoustic feature amount) x. The calculation is again performed as in the following equation (18).

式（１８）におけるＰ（ｘ｜ｗ）は、文仮説ｗに対する音響的な尤もらしさを示す事後確率であり、音響スコアである。Ｐ_ｋ（ｗ）、Ｐ_ｒ（ｗ）はそれぞれ、統計モデル格納部３１から得られた第ｋ番目の言語モデルの言語スコア、第ｒ番目の言語モデルの言語スコアである。なお、Ｐ_ｒ（ｗ）は、ステップＳ４の誤り修正モデルの学習で用いた言語モデルＤ７の言語スコアであり、Ｐ_ｋ（ｗ）は言語モデルＤ８の言語スコアである。
また、θ_ｋ∈Θは、各言語モデルに対する混合重み（モデルパラメータ）であり、Σ_ｋθ_ｋ＝１を満たす。モデル統合部６は、モデルパラメータΘを推定する。この推定したモデルパラメータΘを用いた式（１８）が、求める統合モデルである。
つまり、式（１８）は、学習データから得られた認識誤りの誤り傾向を反映させた式（１５）に示す誤り修正モデルの言語スコアの部分に、混合重みに従って各言語モデルＤ８による言語スコアを線形補間したものである。 P (x | w) in Expression (18) is a posterior probability indicating an acoustic likelihood for the sentence hypothesis w, and is an acoustic score. P _k (w) and P _r (w) are the language score of the kth language model and the language score of the rth language model obtained from the statistical model storage unit 31, respectively. Note that P _r (w) is the language score of the language model D7 used in the learning of the error correction model in step S4, and P _k (w) is the language score of the language model D8.
Θ _k ∈Θ is a mixture weight (model parameter) for each language model, and satisfies Σ _k θ _k = 1. The model integration unit 6 estimates the model parameter Θ. Expression (18) using the estimated model parameter Θ is an integrated model to be obtained.
That is, Expression (18) is obtained by adding the language score of each language model D8 according to the mixture weight to the language score portion of the error correction model shown in Expression (15) that reflects the error tendency of recognition errors obtained from the learning data. It is a linear interpolation.

ここで、Ｐ_ｒ（ｗ）ｅｘｐΣ_ｉλ_ｉｆ_ｉ（ｗ）をあらためてＰ_ｒ（ｗ）と置いて式（１８）を整理すると、文仮説ｗの音声認識のスコアｇ＾（ｗ）は、以下の式（１９）のようになる。 Here, P _r (w) expΣ _i λ _i f _i (w) is replaced with P _r (w) to rearrange equation (18), then the speech recognition score g ^ (w) of the sentence hypothesis w is The following equation (19) is obtained.

Ｐ（ｗ｜ｘ）を、音声入力ｘが与えられたときの文仮説ｗの条件付き確率とすれば、以下の式（２０）となる。 If P (w | x) is a conditional probability of the sentence hypothesis w when the speech input x is given, the following equation (20) is obtained.

分子は確率の和が１となるため、正規化項でＺ（Θ）≡Σ_ｗ’ｅｘｐｇ＾（ｗ’｜ｘ）とすれば、文仮説ｗの事後確率Ｐ（ｗ｜ｘ）は、以下の式（２１）のように与えられる。 Since the sum of probabilities is 1, the posterior probability P (w | x) of the sentence hypothesis _w is Z (Θ) ≡Σw _′ exp g ^ (w ′ | x) in the normalization term. The following equation (21) is given.

ここで、Ｚ（Θ）は、正規化のための定数であり、式（２２）にように算出される。 Here, Z (Θ) is a constant for normalization and is calculated as shown in Equation (22).

ただし、ｗ_ｔは、音声入力ｘを音声認識した結果得られた全ての文仮説（正解候補）である。
モデルパラメータΘは、誤り修正モデルと同様に、Ｎ個の発話から構成される開発データＤ５を用いて、以下の式（２３）に示すリスク最小化問題を解くことにより得られる。 However, w _t is all sentence hypotheses (correct answer candidates) obtained as a result of voice recognition of the voice input x.
Similar to the error correction model, the model parameter Θ is obtained by solving the risk minimization problem expressed by the following equation (23) using the development data D5 including N utterances.

最適化問題が制約条件Σ_ｋθ_ｋ＝１を満たすようにするため、ラグランジュ係数ν＞０を用いて制約なしの最適化問題にしていることに注意する。 Note that the optimization problem is made an unconstrained optimization problem using Lagrange coefficient ν> 0 so that the constraint condition Σ _k θ _k = 1 is satisfied.

図６は、モデル統合部６によるモデル統合処理の処理フローを示す図である。 FIG. 6 is a diagram illustrating a processing flow of model integration processing by the model integration unit 6.

（ステップＳ５０：モデルパラメータ初期化処理）
モデル統合部６は、モデルパラメータΘを初期化する。ここでは、言語モデルの個数をＫ個とし、モデル統合部６は、θ_ｋ＝１／Ｋとして初期化する。 (Step S50: Model parameter initialization process)
The model integration unit 6 initializes the model parameter Θ. Here, the number of language models is K, and the model integration unit 6 initializes θ _k = 1 / K.

（ステップＳ５１：目的関数計算処理）
モデル統合部６は、記憶部８から開発データＤ５を読み出す。モデル統合部６は、特定話者の音響モデルＤ６と、言語モデルＤ７及び言語モデルＤ８を用いて開発データＤ５の音声データを音声認識し、現在のモデルパラメータΘの値を使って式（２３）によって目的関数Ｌ（Θ）の値を算出する。
なお、文仮説ｗ_ｎ，ｍ（ｍ＝１，．．．）は開発データＤ５の音声データが示す発話ｘ_ｎの第ｍ番目の正解文候補である。発話ｘ_ｎの正解文ｗ_ｎ，０は、開発データＤ５のテキストデータから得られる。モデル統合部６は、発話ｘ_ｎの正解文ｗ_ｎ，０と文仮説ｗ_ｎ，ｍとを用いて、式（２３）におけるLevenshtein編集距離Ｒ（ｗ_ｎ，ｏ，ｗ_ｎ，ｍ）を算出する。
また、式（２３）における事後確率Ｐ（ｗ_ｎ，ｍ｜ｘ_ｎ；Θ）は、発話ｘ_ｎが発生したときに正解文候補ｗ_ｎ，ｍが得られる事後確率であり、以下の式（２４）のように算出される。 (Step S51: Objective function calculation process)
The model integration unit 6 reads the development data D5 from the storage unit 8. The model integration unit 6 recognizes the speech data of the development data D5 using the acoustic model D6 of the specific speaker, the language model D7, and the language model D8, and uses the value of the current model parameter Θ to formula (23) To calculate the value of the objective function L (Θ).
The sentence hypothesis w _{n, m} (m = 1,...) Is the mth correct sentence candidate of the utterance x _n indicated by the speech data of the development data D5. The correct sentence w _{n, 0} of the utterance x _n is obtained from the text data of the development data D5. The model integration unit 6 calculates the Levenshtein edit distance R (wn _{, o} , wn _{, m} ) in the equation (23) using the correct sentence wn _{, 0} of the utterance _{xn and} the sentence hypothesis wn _{, m.} To do.
Further, the posterior probability P (w _{n, m} | x _n ; Θ) in the equation (23) is a posterior probability that the correct sentence candidate wn _{, m} is obtained when the utterance x _n occurs, and the following equation ( 24).

ただし、ｇ＾（ｗ_ｎ，ｍ｜ｘ_ｎ；Θ）は、式（１８）から以下の式（２５）のように算出される。

However, g (w _{n, m} | x _n ; Θ) is calculated from the equation (18) as in the following equation (25).

式（２５）において、Ｐ（ｘ_ｎ｜ｗ_ｎ，ｍ）（ｍ＝１，．．．）は文仮説ｗ_ｎ，ｍの音響スコアである。また、Ｐ_ｋ（ｗ_ｎ，ｍ）はｋ番目の言語モデルである言語モデルＤ８を用いたときの文仮説ｗ_ｎ，ｍの言語スコアであり、Ｐ_ｒ（ｗ_ｎ，ｍ）は言語モデルＤ７を用いたときの文仮説ｗ_ｎ，ｍの言語スコアである。 In the equation (25), P (x _n | w _{n, m} ) (m = 1,...) Is an acoustic score of the sentence hypothesis w _{n, m} . P _k (w _{n, m} ) is a language score of the sentence hypothesis w _{n, m} when the language model D8 that is the k-th language model is used, and P _r (w _{n, m} ) is the language model D7. _Is the language score of sentence hypothesis wn _{, m} .

（ステップＳ５２：モデルパラメータ更新処理）
モデル統合部６は、準ニュートン法に基づいて、誤り修正モデルのモデルパラメータΘを更新する。 (Step S52: Model parameter update process)
The model integration unit 6 updates the model parameter Θ of the error correction model based on the quasi-Newton method.

（ステップＳ５３：終了判定処理）
モデル統合部６は、パラメータの更新により変更された目的関数値と、変更前の目的関数値を比較して、値の変化が所定以上であればステップＳ５１からの処理を繰り返し、所定よりも小さければ更新が収束したとみなしてステップＳ５４の処理を実行する。 (Step S53: End determination process)
The model integration unit 6 compares the objective function value changed by updating the parameter with the objective function value before the change, and repeats the processing from step S51 if the value change is greater than or equal to a predetermined value. If the update has converged, the process of step S54 is executed.

（ステップＳ５４：統合モデル出力処理）
モデル統合部６は、更新が収束したときのモデルパラメータΘ＝（θ_０，θ_１，…）を用いた式（１８）を統合モデルＤ１１として記憶部８に書き込む。 (Step S54: Integrated model output process)
The model integration unit 6 writes the equation (18) using the model parameters Θ = (θ ₀ , θ ₁ ,...) When the update converges in the storage unit 8 as the integration model D11.

［４．６ステップＳ５］
音声認識部７は、音声データが入力されると、リアルタイムで音声認識を行う。音声認識部７は、入力された音声データの話者及び話題に対応して記憶部８に記憶されている統合モデルＤ１１と、話者に対応して統計モデル格納部３１に記憶されている音響モデル、及び、話題に対応して統計モデル格納部３１に記憶されている言語モデルとを用いて、入力された音声データの正解文候補とそのスコアを得る。音声認識部７は、スコアの最も良い正解文候補を示す音声認識結果データＤ１２を出力する。 [4.6 Step S5]
When voice data is input, the voice recognition unit 7 performs voice recognition in real time. The voice recognition unit 7 includes an integrated model D11 stored in the storage unit 8 corresponding to the speaker and topic of the input voice data, and an acoustic model stored in the statistical model storage unit 31 corresponding to the speaker. Using the model and the language model stored in the statistical model storage unit 31 corresponding to the topic, the correct sentence candidate of the input speech data and its score are obtained. The speech recognition unit 7 outputs speech recognition result data D12 indicating a correct sentence candidate having the best score.

［５．効果］
本実施形態によれば、音声認識装置１は、認識率を向上させたい話者・話題などの情報が誤り傾向に反映された統合モデルを生成することができるため、従来の音声認識よりも認識誤りが削減される。
また、音声認識装置１は、複数の言語モデルの統合を、誤り修正モデルの学習後に行うため、従来よりも効率的にモデル学習を行うことができる。 [5. effect]
According to the present embodiment, since the speech recognition apparatus 1 can generate an integrated model in which information such as a speaker and a topic for which the recognition rate is to be improved is reflected in an error tendency, the speech recognition apparatus 1 recognizes more than conventional speech recognition. Errors are reduced.
In addition, since the speech recognition apparatus 1 integrates a plurality of language models after learning the error correction model, the speech recognition apparatus 1 can perform model learning more efficiently than before.

［６．その他］
なお、上述の音声認識装置１は、内部にコンピュータシステムを有している。そして、音声認識装置１の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 [6. Others]
The voice recognition device 1 described above has a computer system inside. The operation process of the speech recognition apparatus 1 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１音声認識装置
２音声言語資源管理部
２１音声言語資源格納部
３統計モデル管理部
３１統計モデル格納部
４認識誤り生成部
５誤り修正モデル学習部
６モデル統合部
７音声認識部
８記憶部 DESCRIPTION OF SYMBOLS 1 Speech recognition apparatus 2 Spoken language resource management part 21 Spoken language resource storage part 3 Statistical model management part 31 Statistical model storage part 4 Recognition error generation part 5 Error correction model learning part 6 Model integration part 7 Speech recognition part 8 Storage part

Claims

A spoken language resource storage unit that stores voice data of an utterance of a specific speaker and text data that is a correct sentence corresponding to the voice data;
A statistical model storage unit for storing the acoustic model of the specific speaker and a language model for each topic;
A speech recognition unit that recognizes the speech data using the acoustic model of the specific speaker and the language model of the specific topic, and generates a speech recognition result including a recognition error;
An error correction model that statistically analyzes the tendency of recognition error from the speech recognition result generated by the recognition error generation unit and the correct sentence indicated by the text data, and corrects the tendency of the analyzed recognition error An error correction model learning unit for generating
Speech recognition of speech data different from the speech data used for generating the error correction model using the acoustic model of the specific speaker and the plurality of language models, and correct sentence candidates obtained by speech recognition, Based on a recognition error obtained by comparing correct sentences corresponding to different speech data, a statistical analysis is performed on a mixing weight when the plurality of language models are integrated into the error correction model generated by the error correction model learning unit. A model integration unit that generates an integrated model by integrating the language model of a topic different from the specific topic into the error correction model according to the calculated mixture weight;
A speech recognition apparatus comprising:

The model integration unit includes the recognition error obtained from the correct sentence candidate, an acoustic score of the correct sentence candidate obtained by the error correction model, a language score in which the recognition error tendency is corrected, and the specific topic. Statistically calculating the mixing weight based on an evaluation value calculated by an evaluation function determined using a language score of the correct sentence candidate obtained from the language model of a topic different from
The speech recognition apparatus according to claim 1.

The recognition error generation unit generates speech data by synthesizing speech content of the text data corresponding to the specific topic using the acoustic model of the specific speaker, and the generated speech data is used for the specific story. Generating a speech recognition result including a recognition error by performing speech recognition using the acoustic model of the person and the language model of the specific topic,
The speech recognition apparatus according to claim 1 or 2, wherein

The error correction model represents linguistic features based on continuous words, phonemes constituting the words, non-consecutive words, co-occurrence relationships between phonemes, syntactic information of words, or semantic information of words. Defined using feature functions and their feature weights,
The error correction model learning unit includes the feature based on an evaluation value calculated by an evaluation function determined using a value of the feature function obtained from the speech recognition result and the recognition error included in the speech recognition result. A weight is calculated statistically, and the error correction model is generated using the calculated feature weight;
The speech recognition apparatus according to any one of claims 1 to 3, wherein

A speech recognition unit that further recognizes speech data of the utterance of the specific topic by the specific speaker using the integrated model generated by the model integration unit;
The voice recognition device according to claim 1, wherein the voice recognition device is a voice recognition device.

A spoken language resource storage process of storing voice data of a specific speaker's utterance and text data which is a correct sentence corresponding to the voice data;
A statistical model storing process for storing the acoustic model of the specific speaker and a language model for each topic;
A recognition error generation process for recognizing the speech data using the acoustic model of the specific speaker and the language model of a specific topic, and generating a speech recognition result including a recognition error;
An error correction model that statistically analyzes the tendency of recognition error from the speech recognition result generated in the recognition error generation process and the correct sentence indicated by the text data, and corrects the tendency of the analyzed recognition error An error correction model learning process to generate
Speech recognition of speech data different from the speech data used for generating the error correction model using the acoustic model of the specific speaker and the plurality of language models, and correct sentence candidates obtained by speech recognition, Based on a recognition error obtained by comparing correct sentences corresponding to different speech data, a statistical analysis is performed on a mixing weight when the plurality of language models are integrated into the error correction model generated in the error correction model learning process. A model integration step of generating an integrated model by integrating the language model of a topic different from the specific topic according to the calculated mixture weight into the error correction model;
An error correction model learning method characterized by comprising:

Computer
Spoken language resource storage means for storing voice data of a specific speaker's utterance and text data which is a correct sentence corresponding to the voice data;
Statistical model storage means for storing the acoustic model of the specific speaker and a language model for each topic;
Recognition error generation means for recognizing the speech data using the acoustic model of the specific speaker and the language model of a specific topic, and generating a speech recognition result including a recognition error;
An error correction model that statistically analyzes the tendency of recognition error from the speech recognition result generated by the recognition error generation means and the correct sentence indicated by the text data, and corrects the tendency of the analyzed recognition error An error correction model learning means for generating
Speech recognition of speech data different from the speech data used for generating the error correction model using the acoustic model of the specific speaker and the plurality of language models, and correct sentence candidates obtained by speech recognition, Based on a recognition error obtained by comparing correct sentences corresponding to different speech data, a statistical analysis is performed on a mixture weight when the plurality of language models are integrated into the error correction model generated by the error correction model learning unit. A model integration unit that generates a unified model by integrating the language model of a topic different from the specific topic according to the calculated mixture weight into the error correction model;
A program for causing a voice recognition apparatus to function.