JP6518142B2

JP6518142B2 - Language model generation device and program thereof

Info

Publication number: JP6518142B2
Application number: JP2015122789A
Authority: JP
Inventors: 和穂尾上
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2015-06-18
Filing date: 2015-06-18
Publication date: 2019-05-22
Anticipated expiration: 2035-06-18
Also published as: JP2017009691A

Description

本発明は、複数の言語モデルを混合して新たな言語モデルを生成する言語モデル生成装置およびそのプログラムに関する。 The present invention relates to a language model generation device that mixes a plurality of language models to generate a new language model, and its program.

従来、独立した学習コーパスから生成された複数の言語モデル（統計的言語モデル）を、音声の認識精度を高めるために混合する手法が存在する（例えば、特許文献１参照）。
この手法は、音声認識対象（発話内容等）と類似した評価用文章を用いて、評価用文章の生成確率が最大となるように線形補間係数（混合重み）を求め、あるいは、線形補間係数をベイズ学習法により求めて、各言語モデルを線形和補間するものである。 Conventionally, there is a method of mixing a plurality of language models (statistical language models) generated from an independent learning corpus in order to improve speech recognition accuracy (see, for example, Patent Document 1).
In this method, a linear interpolation coefficient (mixing weight) is determined using the evaluation text similar to the speech recognition target (speech content etc.) so that the generation probability of the evaluation text is maximized, or the linear interpolation coefficient is calculated. It is obtained by the Bayesian learning method, and linear sum interpolation of each language model is performed.

ここで、図６を参照して、従来の一般的な言語モデルの混合手法について説明する。
図６に示すように、ここでは、２つの言語モデル（グローバル言語モデル２０、話題依存言語モデル４０）を混合することとする。なお、グローバル言語モデル２０は、大規模学習データ（大規模コーパス２００）から予め学習し生成したものである。また、話題依存言語モデル４０は、音声認識対象の話題（トピック）に依存した小規模学習データ（話題依存小規模コーパス４００）から予め学習し生成したものである。 Here, referring to FIG. 6, a conventional general language model mixing method will be described.
As shown in FIG. 6, here, two language models (global language model 20, topic-dependent language model 40) are mixed. The global language model 20 is generated by learning in advance from large-scale learning data (large-scale corpus 200). The topic-dependent language model 40 is generated by learning in advance from small-scale learning data (topic-dependent small corpus 400) depending on the topic (topic) of the speech recognition target.

例えば、特許文献１の背景技術として記載されている従来手法（第１従来手法）によれば、音声認識対象の話題と類似した評価用文章Ｈを用いて、評価用文章Ｈの生成確率が最大となる最尤学習により、グローバル言語モデル２０と話題依存言語モデル４０との混合重みλを計算する。そして、第１従来手法は、線形和補間手段Ｍによって、計算で求めた混合重みλを用いて、グローバル言語モデル２０と話題依存言語モデル４０とを重み付け加算（線形和補間）して混合言語モデル８０を生成する。 For example, according to the conventional method (the first conventional method) described as the background art of Patent Document 1, the generation probability of the evaluation sentence H is maximized using the evaluation sentence H similar to the topic of the speech recognition target The mixed weight λ of the global language model 20 and the topic dependent language model 40 is calculated by maximum likelihood learning. Then, in the first conventional method, a mixed language model is obtained by performing weighted addition (linear sum interpolation) of the global language model 20 and the topic-dependent language model 40 using the mixing weight λ calculated by the linear sum interpolation means M. Generate 80.

また、特許文献１に記載の発明に係る他の手法（第２従来手法）によれば、前記した第１従来手法よりも少ない評価用文章Ｈを用いて、ベイズ学習により、線形補間係数を求め、線形和補間して混合言語モデル８０を生成する。
このように生成された混合言語モデル８０を用いて、音声認識装置１００が音声認識を行うことで、グローバル言語モデル２０のみを用いる場合に比べ、特定の話題に対する音声認識精度を高めることができる。 Further, according to another method (second conventional method) according to the invention described in Patent Document 1, linear interpolation coefficients are obtained by Bayesian learning using evaluation sentences H smaller than the above-described first conventional method. , Linear sum interpolation to generate a mixed language model 80.
The speech recognition performed by the speech recognition apparatus 100 using the mixed language model 80 generated in this manner can improve the speech recognition accuracy for a specific topic as compared to the case where only the global language model 20 is used.

なお、第１従来手法では、評価用文章Ｈを極力コーパス（大規模コーパス２００、話題依存小規模コーパス４００）とは異なるように選定することで、過学習を抑えるようにしている。また、第２従来手法では、ベイズ学習を用いて、第１従来手法よりも評価用文章Ｈの数を減らすことで、さらに、過学習を抑えるようにしている。
ここで、過学習とは、コーパス内に評価用文章が存在していた場合に、その文章（学習データ）に強く言語モデルの出現確率が依存し、他の文章（未知データ）において期待される出現確率が得られない状態となることをいう。 In the first conventional method, over-learning is suppressed by selecting the evaluation text H as different as possible from the corpus (large-scale corpus 200, topic-dependent small-scale corpus 400). Further, in the second conventional method, over learning is further suppressed by using Bayesian learning to reduce the number of evaluation sentences H more than the first conventional method.
Here, when the text for evaluation exists in the corpus, the overlearning strongly depends on the text (learning data) and the appearance probability of the language model is expected in other texts (unknown data). It means that it will be in the state where the appearance probability can not be obtained.

特開２００５−８４１７９号公報JP 2005-84179 A

前記した第１，第２従来手法は、評価用文章を極力コーパスとは異なるように選定したり、ベイズ学習を用いることで少数の評価用文章を選定したりすることで、過学習を抑えるようにしている。
しかし、コーパスが大規模であればあるほど、コーパス内に存在している文章を除いて、評価用文章を選定することは現実的に困難である。すなわち、従来手法では、評価用文章としてコーパス内の文章を用いる場合があり、評価用文章により過学習が発生してしまうという問題がある。 In the first and second conventional methods described above, over-learning can be suppressed by selecting sentences for evaluation as different as possible from the corpus as much as possible, or selecting a small number of sentences for evaluation by using Bayesian learning. I have to.
However, as the corpus is larger, it is practically difficult to select evaluation sentences except for sentences existing in the corpus. That is, in the conventional method, the sentences in the corpus may be used as the sentences for evaluation, and there is a problem that over-learning occurs due to the sentences for evaluation.

本発明は、このような問題に鑑みてなされたものであり、評価用文章に用いる各文章が評価用文章として適切か否かを予め評価し、適切な評価用文章を用いることで、過学習を抑えて、言語モデルを混合することが可能な言語モデル生成装置およびそのプログラムを提供することを課題とする。 The present invention has been made in view of such a problem, and it is evaluated in advance whether each sentence used for evaluation sentences is appropriate as an evaluation sentence and by using an appropriate evaluation sentence, overlearning It is an object of the present invention to provide a language model generation device capable of mixing language models while suppressing the

前記課題を解決するため、本発明に係る言語モデル生成装置は、音声認識対象の話題に関連する学習コーパスから予め学習した話題依存言語モデルと、前記学習コーパスよりもデータ量の多い学習コーパスで予め学習したグローバル言語モデルとを混合し、音声認識対象の混合言語モデルを生成する言語モデル生成装置であって、評価重み生成手段と、第１混合重み生成手段と、第１線形和補間手段と、第２混合重み生成手段と、第２線形和補間手段と、を備える構成とした。 In order to solve the above problem, the language model generation device according to the present invention comprises a topic-dependent language model learned in advance from a learning corpus related to a speech recognition target topic, and a learning corpus having a larger amount of data than the learning corpus. A language model generation device that mixes a learned global language model and generates a mixed language model of a speech recognition target, the evaluation weight generation unit, the first mixture weight generation unit, and the first linear sum interpolation unit. The second mixture weight generation means and the second linear sum interpolation means are provided.

かかる構成において、言語モデル生成装置は、評価重み生成手段によって、グローバル言語モデルに対して、音声認識対象の話題に関連する予め選定された評価用文章の全体を用いて、言語モデルを評価するための評価値（例えば、パープレキシティ）を全体評価値として算出する。
さらに、言語モデル生成装置は、評価重み生成手段によって、グローバル言語モデルに対して、評価用文章を予め定めた分類により区分した区分文章を用いて、区分文章ごとの評価値を個別評価値として算出する。 In such a configuration, the language model generation device evaluates the language model by using the evaluation weight generation means, with respect to the global language model, using the whole of the evaluation sentences selected in advance related to the topic of the speech recognition target. An evaluation value of (for example, perplexity) is calculated as an overall evaluation value.
Furthermore, the language model generation apparatus calculates the evaluation value for each divided sentence as an individual evaluation value using the divided sentences obtained by dividing the evaluation sentences according to the predetermined classification with respect to the global language model by the evaluation weight generation means. Do.

そして、言語モデル生成装置は、評価重み生成手段によって、区分文章ごとに評価用文章としての適否の度合いを評価重みとして生成する。すなわち、評価重み生成手段は、区分文章を用いた方が評価用文章全体を用いた場合よりもグローバル言語モデルの評価が低ければ、区分文章の評価重みを大きく、評価が高ければ、区分文章の評価重みを小さくして評価重みを生成する。これは、区分文章を用いた方がグローバル言語モデルの評価が高ければ、区分文章がすでにグローバル言語モデルを学習するために用いた学習コーパスに含まれている可能性が高く、過学習となることを防止するためである。
このように、区分単位で評価用文章を評価することで、言語モデル生成装置は、評価用文章の区分単位ごとに過学習の度合いを評価することが可能になる。 Then, the language model generation apparatus generates, as an evaluation weight, the degree of suitability as an evaluation sentence for each divided sentence by the evaluation weight generation means. That is, the evaluation weight generation means increases the evaluation weight of the divided sentence if the evaluation of the global language model is lower when the divided sentence is used than when the entire evaluation sentence is used, and if the evaluation is high, The evaluation weight is generated by reducing the evaluation weight. This means that if the evaluation of the global language model is higher when the divided sentences are used, it is highly likely that the divided sentences are already included in the learning corpus used to learn the global language model, resulting in over-learning. To prevent
As described above, by evaluating the evaluation sentences in division units, the language model generation device can evaluate the degree of overlearning for each classification unit of evaluation sentences.

そして、言語モデル生成装置は、第１混合重み生成手段によって、グローバル言語モデルを学習するために用いた学習コーパスを構成する複数の個別学習コーパスから予め学習した複数の個別言語モデルを、区分文章ごとに評価重みの割合で線形和補間したときの対数尤度が最大となる混合重みを生成する。
このように、第１混合重み生成手段は、言語モデルを線形和補間する割合として評価重みを加味することで、過学習の発生を抑えるとともに、評価用文章における区分文章の生成確率を高める方向に作用する混合重みを生成することができる。 Then, the language model generation device generates, by the first mixture weight generation means, the plurality of individual language models previously learned from the plurality of individual learning corpuses constituting the learning corpus used for learning the global language model, To generate a mixture weight that maximizes the log likelihood when performing linear sum interpolation at the rate of evaluation weight.
As described above, the first mixed weight generation unit suppresses the occurrence of overlearning by adding the evaluation weight as a ratio of performing linear sum interpolation on the language model, and increases the generation probability of the divided sentences in the evaluation sentence. Working mixing weights can be generated.

そして、言語モデル生成装置は、第１線形和補間手段によって、第１混合重み生成手段で生成された混合重みの割合で、複数の個別言語モデルを線形和補間して混合グローバル言語モデルを生成する。すなわち、この混合グローバル言語モデルは、グローバル言語モデルよりも、音声認識対象の話題に対する単語の接続確率を高めた言語モデルとなる。 Then, the language model generation apparatus generates a mixed global language model by performing linear sum interpolation on a plurality of individual language models at a ratio of the mixed weight generated by the first mixed weight generation unit by the first linear sum interpolation unit. . That is, this mixed global language model is a language model in which the connection probability of the word to the speech recognition target topic is higher than that of the global language model.

また、言語モデル生成装置は、第２混合重み生成手段によって、区分文章ごとに評価重みの割合で混合グローバル言語モデルと話題依存言語モデルとを線形和補間したときの対数尤度が最大となる混合重みを生成する。
このように、第２混合重み生成手段は、言語モデルを線形和補間する割合として評価重みを加味することで、過学習の発生を抑えるとともに、評価用文章における区分文章の生成確率を高める方向に作用する混合重みを生成することができる。 Also, the language model generation device is a mixture that maximizes the log likelihood when performing linear sum interpolation between the mixed global language model and the topic dependent language model at the rate of the evaluation weight for each divided sentence by the second mixture weight generation means. Generate weights.
As described above, the second mixed weight generation unit suppresses the occurrence of overlearning by adding the evaluation weight as a ratio of linear sum interpolation of the language model, and increases the generation probability of the divided sentences in the evaluation sentence. Working mixing weights can be generated.

そして、言語モデル生成装置は、第２線形和補間手段によって、第２混合重み生成手段で生成された混合重みの割合で、混合グローバル言語モデルおよび話題依存言語モデルを線形和補間することで音声認識対象の混合言語モデルを生成する。 Then, the language model generation device performs speech recognition by performing linear sum interpolation on the mixed global language model and the topic dependent language model at the ratio of the mixed weight generated by the second mixed weight generation means by the second linear sum interpolation means. Generate a mixed language model of the object.

また、前記課題を解決するため、本発明に係る言語モデル生成装置は、音声認識対象の話題に関連する学習コーパスから予め学習した話題依存言語モデルと、前記学習コーパスよりもデータ量の多い学習コーパスで予め学習したグローバル言語モデルとを混合し、前記音声認識対象の混合言語モデルを生成する言語モデル生成装置であって、評価重み生成手段と、混合重み生成手段と、線形和補間手段と、を備える構成とした。 Further, in order to solve the above problem, the language model generation device according to the present invention includes a topic-dependent language model learned in advance from a learning corpus related to a speech recognition target topic, and a learning corpus having a larger amount of data than the learning corpus. A language model generation apparatus for generating a mixed language model of the speech recognition target by mixing the global language model learned in advance with the evaluation model, the evaluation weight generation means, the mixture weight generation means, and the linear sum interpolation means It had composition.

そして、言語モデル生成装置は、評価重み生成手段によって、区分文章ごとに評価用文章としての適否の度合いを評価重みとして生成する。すなわち、評価重み生成手段は、区分文章を用いた方が評価用文章全体を用いた場合よりもグローバル言語モデルの評価が低ければ、区分文章の評価重みを大きく、評価が高ければ、区分文章の評価重みを小さくして評価重みを生成する。
このように、区分単位で評価用文章を評価することで、言語モデル生成装置は、評価用文章の区分単位ごとに過学習の度合いを評価することが可能になる。 Then, the language model generation apparatus generates, as an evaluation weight, the degree of suitability as an evaluation sentence for each divided sentence by the evaluation weight generation means. That is, the evaluation weight generation means increases the evaluation weight of the divided sentence if the evaluation of the global language model is lower when the divided sentence is used than when the entire evaluation sentence is used, and if the evaluation is high, The evaluation weight is generated by reducing the evaluation weight.
As described above, by evaluating the evaluation sentences in division units, the language model generation device can evaluate the degree of overlearning for each classification unit of evaluation sentences.

そして、言語モデル生成装置は、混合重み生成手段によって、区分文章ごとに評価重みの割合でグローバル言語モデルと前題依存言語モデルとを線形和補間したときの対数尤度が最大となる混合重みを生成する。
このように、混合重み生成手段は、言語モデルを線形和補間する割合として評価重みを加味することで、過学習の発生を抑えるとともに、評価用文章における区分文章の生成確率を高める方向に作用する混合重みを生成することができる。 Then, the language model generation device uses the mixture weight generation means to obtain the mixture weight that maximizes the log likelihood when performing linear sum interpolation between the global language model and the proposition dependent language model at the rate of the evaluation weight for each divided sentence. Generate
As described above, the mixing weight generation means suppresses the occurrence of overlearning and acts in a direction to increase the generation probability of the divided sentences in the evaluation sentence by adding the evaluation weight as a ratio of performing linear sum interpolation on the language model. Mixed weights can be generated.

そして、言語モデル生成装置は、線形和補間手段によって、混合重み生成手段で生成された混合重みの割合で、グローバル言語モデルおよび話題依存言語モデルを線形和補間し、音声認識対象の混合言語モデルを生成する。
これによって、言語モデル生成装置は、既存のグローバル言語モデルと、音声認識対象の小規模の話題依存言語モデルとを混合する際に、過学習を抑えて音声認識対象に適した言語モデルを生成することができる。
なお、言語モデル生成装置は、コンピュータを、前記した各手段として機能させるための言語モデル生成プログラムで動作させることができる。 Then, the language model generation device performs linear sum interpolation on the global language model and the topic dependent language model at a ratio of the mixture weight generated by the mixture weight generation means by the linear sum interpolation means, and generates a mixed language model as a speech recognition target. Generate
Thus, the language model generation device suppresses overlearning and generates a language model suitable for a speech recognition target when mixing an existing global language model and a small-scale topic-dependent language model targeted for speech recognition. be able to.
The language model generation device can be operated by a language model generation program for causing a computer to function as each of the above-described means.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、混合する言語モデルにおいて、評価用文章の予め区分した文章ごとに、過学習となるか否かを評価して評価重みを生成し、その評価重みを用いて、言語モデルを混合するための混合重みを計算する。そのため、本発明は、過学習を抑えて言語モデルを生成することができる。また、本発明により生成された過学習が抑えられた言語モデルを音声認識に用いることで、従来よりも認識精度の高い音声認識を行うことが可能になる。 The present invention exhibits the following excellent effects.
According to the present invention, in the language model to be mixed, it is evaluated whether or not overlearning is obtained for each of the sentences classified in advance for the evaluation sentences to generate an evaluation weight, and the language model is generated using the evaluation weight. Calculate mixing weights for mixing. Therefore, the present invention can generate language models while suppressing overlearning. Further, by using the language model in which overlearning is suppressed according to the present invention for speech recognition, speech recognition with higher recognition accuracy than before can be performed.

本発明の第１実施形態に係る言語モデル生成装置の構成を示すブロック構成図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a block block diagram which shows the structure of the language model production | generation apparatus which concerns on 1st Embodiment of this invention. 本発明の第１実施形態に係る言語モデル生成装置で使用する言語モデルの概要を説明するための説明図であって、（ａ）は大規模コーパスから学習された言語モデル、（ｂ）は話題依存小規模コーパスから学習された言語モデルを示す。It is explanatory drawing for demonstrating the outline | summary of the language model used with the language model production | generation apparatus which concerns on 1st Embodiment of this invention, Comprising: (a) is a language model learned from a large scale corpus, (b) is a topic We show a language model learned from a dependent small corpus. 本発明の第１実施形態に係る言語モデル生成装置で生成された言語モデルを用いて、音声認識を行う音声認識システムの構成を示す構成図である。It is a block diagram which shows the structure of the speech recognition system which performs speech recognition using the language model produced | generated by the language model production | generation apparatus which concerns on 1st Embodiment of this invention. 本発明の第１実施形態に係る言語モデル生成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the language model production | generation apparatus which concerns on 1st Embodiment of this invention. 本発明の第２実施形態に係る言語モデル生成装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the language model production | generation apparatus which concerns on 2nd Embodiment of this invention. 従来の言語モデルの混合手法を説明するための説明図である。It is explanatory drawing for demonstrating the mixing method of the conventional language model.

以下、本発明の実施形態（第１，第２実施形態）について図面を参照して説明する。
第１実施形態は、大規模コーパスを構成する元の素材ごとに学習した複数の言語モデルを音声認識対象の言語に適するように混合し、さらに、小規模コーパスから学習した音声認識対象の話題（トピック）に依存した言語モデルを混合する形態である。
第２実施形態は、第１実施形態を簡略化し、大規模コーパスから学習した１つの言語モデルと、小規模コーパスとなる音声認識対象の話題（トピック）に依存した言語モデルとを混合する形態である。 Hereinafter, embodiments (first and second embodiments) of the present invention will be described with reference to the drawings.
In the first embodiment, a plurality of language models learned for each of the original materials constituting the large-scale corpus are mixed to be suitable for the language of the speech recognition target, and a topic of the speech recognition target learned from the small-scale corpus ( It is a form that mixes language models depending on the topic).
In the second embodiment, the first embodiment is simplified, and one language model learned from a large corpus is mixed with a language model dependent on a topic for speech recognition to be a small corpus. is there.

ここで、言語モデルとは、任意の文字列において、それが文である確率を付与する確率モデル（統計的言語モデル）である。この言語モデルは、例えば、Ｎグラム言語モデルであって、以下の式（１）に示すように、単語列ｗ_１ｗ_２…ｗ_ｉ−１の後に単語ｗ_ｉが出現する条件付き確率（Ｎグラム確率）で与えるモデルである。 Here, the language model is a probability model (statistical language model) which gives the probability that it is a sentence in an arbitrary character string. The language model, for example, N grams a language model, as shown in the following equation (1), a word string w _{1 w} 2 _... w conditional probability of a word w _i after the _i-1 appears (N It is a model given by gram probability).

なお、図１等では、式（１）を簡略化して、Ｐ（ｗ｜ｈ）と表記している。すなわち、ｈは、単語ｗの直前に出現する単語列である。 In addition, in FIG. 1 etc., Formula (1) is simplified and it describes with P (w | h). That is, h is a word string appearing immediately before the word w.

≪第１実施形態≫
〔言語モデル生成装置の構成〕
まず、図１を参照して、本発明の第１実施形態に係る言語モデル生成装置１の構成について説明する。 First Embodiment
[Configuration of Language Model Generation Device]
First, the configuration of the language model generation device 1 according to the first embodiment of the present invention will be described with reference to FIG.

言語モデル生成装置１は、音声認識対象の話題に関連する学習コーパスから予め学習した話題依存言語モデル４０と、当該学習コーパスよりもデータ量の多い学習コーパス（大規模コーパス）を構成する独立した複数の学習コーパスからそれぞれ予め学習した複数の個別言語モデル（ここでは、一例として、原稿言語モデル３０、字幕言語モデル３１、書き起こし言語モデル３２）とを混合し、音声認識対象の言語モデル（混合言語モデル５０）を生成するものである。 The language model generation device 1 includes a plurality of topic dependent language models 40 learned in advance from a learning corpus related to a topic of speech recognition target, and a plurality of independent learning corpuses (large scale corpus) having a larger data volume than the learning corpus. Mixed with a plurality of individual language models (here, as an example, a manuscript language model 30, a subtitle language model 31, and a transcription language model 32) previously learned from each learning corpus, and a language model (mixed language) for speech recognition Model 50) is generated.

なお、図１中、グローバル言語モデル２０は、原稿言語モデル３０、字幕言語モデル３１および書き起こし言語モデル３２を生成したそれぞれの学習コーパスを１つの学習コーパス（大規模コーパス）として学習した言語モデルである。
また、図１中、混合グローバル言語モデル２１は、言語モデル生成装置１が、原稿言語モデル３０と、字幕言語モデル３１と、書き起こし言語モデル３２とを混合して生成する途中段階の言語モデルである。
なお、言語モデルの学習とは、学習コーパスから前記式（１）の確率を、最尤推定法等の一般的な手法によって求めるものであり、ここでは、詳細な説明を省略する。 In FIG. 1, the global language model 20 is a language model in which each learning corpus generated as a manuscript language model 30, a subtitle language model 31, and a transcription language model 32 is learned as one learning corpus (large scale corpus). is there.
Further, in FIG. 1, the mixed global language model 21 is a language model in the middle stage in which the language model generation device 1 generates the original language model 30, the subtitle language model 31, and the transcription language model 32 by mixing. is there.
The learning of the language model is to obtain the probability of the equation (1) from the learning corpus by a general method such as the maximum likelihood estimation method, and the detailed description is omitted here.

ここで、図２を参照して、言語モデル生成装置１が混合する言語モデルの関係について説明しておく。
図２（ａ）に示すように、グローバル言語モデル２０は、大規模コーパス２００に含まれる「原稿」、「字幕」、「書き起こし」のそれぞれの学習データ（個別学習コーパス）から予め学習された言語モデルである。「原稿」は、例えば、ニュース等の放送番組の原稿データである。また、「字幕」は、放送番組に付与した字幕データである。また、「書き起こし」は、放送番組が実際に放送された音声を人手で書き起こした書き起こしデータである。この大規模コーパス２００は、これらのデータ（学習データ）を、例えば、数年分蓄積したデータである。
また、図２（ａ）に示すように、原稿言語モデル３０は、大規模コーパス２００に含まれる「原稿」から予め学習された言語モデルである。字幕言語モデル３１は、大規模コーパス２００に含まれる「字幕」から予め学習された言語モデルである。書き起こし言語モデル３２は、大規模コーパス２００に含まれる「書き起こし」から予め学習された言語モデルである。 Here, with reference to FIG. 2, the relationship of the language model which the language model production | generation apparatus 1 mixes is demonstrated.
As shown in FIG. 2A, the global language model 20 has been learned in advance from learning data (individual learning corpus) for each of the "document", "caption" and "transcription" included in the large-scale corpus 200. It is a language model. The "original" is, for example, original data of a broadcast program such as news. Also, "subtitle" is subtitle data attached to a broadcast program. Also, "transcription" is transcription data in which the sound of a broadcast program is actually transcribed by hand. This large-scale corpus 200 is data obtained by accumulating these data (learning data) for several years, for example.
Further, as shown in FIG. 2A, the manuscript language model 30 is a language model learned in advance from the “manuscript” included in the large-scale corpus 200. The subtitle language model 31 is a language model learned in advance from “subtitles” included in the large-scale corpus 200. The transcription language model 32 is a language model learned in advance from “transcription” included in the large-scale corpus 200.

また、図２（ｂ）に示すように、話題依存言語モデル４０は、話題依存小規模コーパス４００から予め学習された言語モデルである。この話題依存小規模コーパス４００は、音声認識対象のトピック（話題）に類似した学習データである。例えば、音声認識の対象をスポーツ番組の音声とした場合、話題依存小規模コーパス４００は、過去に放送されたスポーツ番組から書き起こした学習データ等である。
図１に戻って、言語モデル生成装置１の構成について詳細に説明する。 Also, as shown in FIG. 2 (b), the topic dependent language model 40 is a language model learned in advance from the topic dependent small corpus 400. This topic-dependent small-scale corpus 400 is learning data similar to a topic (topic) to be subjected to speech recognition. For example, when the target of speech recognition is speech of a sports program, the topic-dependent small corpus 400 is learning data or the like transcribed from a sports program broadcasted in the past.
Referring back to FIG. 1, the configuration of the language model generation device 1 will be described in detail.

図１に示すように、言語モデル生成装置１は、評価重み生成手段１０と、混合重み生成手段（第１混合重み生成手段１１Ａ、第２混合重み生成手段１１Ｂ）と、線形和補間手段（第１線形和補間手段１２Ａ、第２線形和補間手段１２Ｂ）と、を備える。 As shown in FIG. 1, the language model generation device 1 includes an evaluation weight generation unit 10, a mixture weight generation unit (first mixture weight generation unit 11A, a second mixture weight generation unit 11B), and a linear sum interpolation unit (first 1) linear sum interpolation means 12A, second linear sum interpolation means 12B).

評価重み生成手段１０は、評価用文章Ｈを構成する各文章（単語列）を用いて、グローバル言語モデル２０を評価し、評価用文章Ｈの各文章が評価用文章として適切か否かを評価重みとして生成するものである。
ここで、評価用文章Ｈは、音声認識対象のトピック（話題）に関連（類似）した内容の文章として選定された文章である。例えば、音声認識対象を、ある情報を提供する放送番組（情報番組）とした場合、評価用文章Ｈとして、過去の同じ情報番組の書き起こしを用いればよい。ここでは、評価重み生成手段１０に入力される評価用文章をＨ＝｛ｈ_１，…，ｈ_ｃ，…｝とし、各ｈ_ｃは、１以上の文章で構成されるものとする。例えば、各ｈ_ｃを、情報番組の番組内の各コーナーの書き起こしとすればよい。すなわち、評価用文章Ｈは、予め定めた単位、例えば、１文章ごと、あるいは、予め定めた分類で区分された１以上の文章（区分文章）ごとに分類されているものとする。 The evaluation weight generation means 10 evaluates the global language model 20 using each sentence (word string) constituting the evaluation sentence H, and evaluates whether each sentence of the evaluation sentence H is appropriate as an evaluation sentence or not. It is generated as a weight.
Here, the evaluation sentence H is a sentence selected as a sentence of content related (similar) to a topic (topic) to be subjected to speech recognition. For example, when the speech recognition target is a broadcast program (information program) providing certain information, a transcript of the same information program in the past may be used as the evaluation sentence H. Here, it is assumed that the evaluation sentences input to the evaluation weight generation means 10 are H = {h ₁ ,..., H _c ,...}, And each h _c is composed of one or more sentences. For example, each h _c may be a transcription of each corner in the program of the information program. That is, it is assumed that the evaluation sentences H are classified in predetermined units, for example, one sentence or one sentence or more (sorted sentences) classified in a predetermined classification.

この評価重み生成手段１０は、評価用文章Ｈを用いて、パープレキシティ（平均分岐数）により、グローバル言語モデル２０を評価する。このパープレキシティは、ある単語に対して次につながる単語の平均個数を示し、その値が小さいほど、言語モデルの精度が高いという評価を与えることができる評価値である。 The evaluation weight generation unit 10 evaluates the global language model 20 by perplexity (average number of branches) using the evaluation sentence H. The perplexity is an evaluation value that can indicate an average number of words connected next to a certain word, and the smaller the value is, the higher the accuracy of the language model is.

具体的には、評価重み生成手段１０は、以下の式（２）により、評価用文章Ｈの文章全体｛ｈ_１，…，ｈ_ｃ，…｝を用いてパープレキシティＰＰ（全体評価値、全体パープレキシティＰＰ_ａｌｌ）と、評価用文章Ｈの個別の文章（区分文章）｛ｈ_ｃ｝ごとにパープレキシティＰＰ（個別評価値、個別パープレキシティＰＰ_ｃ）とを計算する。 Specifically, the evaluation weight generation means 10 uses perplexity PP (whole evaluation value, using the whole sentence {h ₁ ,..., H _c ,. Overall perplexity PP _all ) and perplexity PP (individual evaluation value, individual perplexity PP _c ) are calculated for each individual sentence (segmented sentence) {h _c } of evaluation text H.

この式（２）において、Ｐ_{ｇｌｏｂａｌ}（ｗ_ｉ｜ｗ_{ｉ−Ｎ＋１}…ｗ_ｉ−１）は、グローバル言語モデル２０の条件付き確率（Ｎグラム確率）を示す。また、ｎは、全体パープレキシティＰＰ_ａｌｌを計算する場合、評価用文章Ｈの文章全体の単語数であり、個別パープレキシティＰＰ_ｃを計算する場合、評価用文章Ｈの個別の区分文章｛ｈ_ｃ｝の単語数である。 In this equation (2), P _global (w _i | w _{i −N + 1} ... W _i−1 ) represents the conditional probability (N-gram probability) of the global language model 20. Further, n is the number of words of the whole sentence of the evaluation sentence H when calculating the whole perplexity PP _all, and when calculating the individual perplexity PP _c , the individual divided sentences of the evaluation sentence H { h _c } is the number of words.

そして、評価重み生成手段１０は、全体パープレキシティＰＰ_ａｌｌと、個別パープレキシティＰＰ_ｃとを比較する。ここで、個別パープレキシティＰＰ_ｃの方が大きければ、すなわち、区分文章を用いた方が評価用文章Ｈ全体を用いた場合よりもグローバル言語モデル２０の評価が低ければ、評価重み生成手段１０は、対応する個別の区分文章｛ｈ_ｃ｝の評価用文章としての重みを大きくする。一方、区分文章を用いた方が評価用文章Ｈ全体を用いた場合よりもグローバル言語モデル２０の評価が高ければ、評価重み生成手段１０は、個別の区分文章｛ｈ_ｃ｝の評価用文章としての重みを小さくする。
例えば、評価重み生成手段１０は、以下の式（３）に示すように、ＰＰ_Ｃ＞ＰＰ_ａｌｌであれば、文章｛ｈ_ｃ｝に対する評価重みα_ｃを“１”、ＰＰ_Ｃ≦ＰＰ_ａｌｌであれば、区分文章｛ｈ_ｃ｝に対する評価重みα_ｃを“０”とする。 Then, the evaluation weight generation means 10 compares the overall perplexity PP _all with the individual perplexity PP _c . Here, if the individual perplexity PP _c is larger, that is, if the evaluation of the global language model 20 is lower when the divided sentences are used than when the entire evaluation sentence H is used, the evaluation weight generation unit 10 Increases the weight as the evaluation text of the corresponding individual divided text {h _c }. On the other hand, if the evaluation of the global language model 20 is higher when the divided sentences are used than when the entire evaluation sentence H is used, the evaluation weight generation unit 10 uses the divided sentences {h _c } as evaluation sentences. Reduce the weight of
For example, evaluation weight generating unit 10, as shown in the following equation _(3), if the PP _{C> PP all,} the evaluation weight alpha _c for sentences _{{h c}} "1", with _PP _C ≦ PP _all If there is, the evaluation weight α _c for the divided sentence {h _c } is set to “0”.

この評価重み生成手段１０は、評価用文章Ｈの区分文章ごとに評価重みを対応付け、重み付き評価用文章として、第１混合重み生成手段１１Ａおよび第２混合重み生成手段１１Ｂに出力する。 The evaluation weight generation means 10 associates the evaluation weight with each of the divided sentences of the evaluation sentence H, and outputs it as a weighted evaluation sentence to the first mixed weight generation means 11A and the second mixed weight generation means 11B.

第１混合重み生成手段１１Ａは、評価重み生成手段１０で生成された重み付き評価用文章（評価用文章、評価重み）を用いて、複数の言語モデル（原稿言語モデル３０、字幕言語モデル３１および書き起こし言語モデル３２）を混合する重み係数（混合重み）を生成するものである。この第１混合重み生成手段１１Ａは、重み付き評価用文章の対数尤度が最大となるように各言語モデルの混合重みを算出する。 The first mixed weight generation unit 11A uses the weighted evaluation sentences (evaluation sentences, evaluation weights) generated by the evaluation weight generation unit 10 to generate a plurality of language models (the manuscript language model 30, the subtitle language model 31 and the like). A weighting factor (mixing weight) is generated to mix the transcription language model 32). The first mixed weight generation unit 11A calculates the mixed weight of each language model so that the log likelihood of the weighted evaluation text is maximized.

具体的には、第１混合重み生成手段１１Ａは、以下の式（４）の対数尤度Ｌを最大化する原稿言語モデル３０の混合重みλ_{ｇｅｎｋｏ}、字幕言語モデル３１の混合重みλ_{ｊｉｍａｋｕ}、書き起こし言語モデル３２の混合重みλ_{ｋａｋｉｏｋｏｓｈｉ}を算出する。 Specifically, first mixing weight generating unit 11A has the following formula mixture weight lambda _Genko document language model 30 to maximize the log likelihood L (4), the mixture weight lambda _Jimaku subtitle language model 31, write The mixture weight λ _kakiokoshi of the _translation language model 32 is calculated.

この式（４）において、ｃは、評価用文章Ｈ＝｛ｈ_１，…，ｈ_ｃ，…｝の区分文章ｈ_ｃを指し示す指標であり、Ｃはその文章総数を示す。また、Ｐ_{ｇｅｎｋｏ}（ｗ_ｉ ^ｃ｜ｗ_{ｉ−Ｎ＋１} ^ｃ…ｗ_ｉ−１ ^ｃ）は、文章ｈ_ｃに対する原稿言語モデル３０の条件付き確率（Ｎグラム確率）を示す。また、Ｐ_{ｊｉｍａｋｕ}（ｗ_ｉ ^ｃ｜ｗ_{ｉ−Ｎ＋１} ^ｃ…ｗ_ｉ−１ ^ｃ）は、区分文章ｈ_ｃに対する字幕言語モデル３１の条件付き確率（Ｎグラム確率）を示す。また、Ｐ_{ｋａｋｉｏｋｏｓｈｉ}（ｗ_ｉ ^ｃ｜ｗ_{ｉ−Ｎ＋１} ^ｃ…ｗ_ｉ−１ ^ｃ）は、区分文章ｈ_ｃに対する書き起こし言語モデル３２の条件付き確率（Ｎグラム確率）を示す。また、ｎ^ｃは区分文章ｈ_ｃの単語数、ｗ^ｃは区分文章ｈ_ｃの単語を示す。
この第１混合重み生成手段１１Ａは、前記式（４）を最大化する混合重みλ（λ_{ｇｅｎｋｏ}，λ_{ｊｉｍａｋｕ}，λ_{ｋａｋｉｏｋｏｓｈｉ}）を、ＥＭアルゴリズム等により生成（算出）することができる。
この第１混合重み生成手段１１Ａは、生成した混合重みλ（λ_{ｇｅｎｋｏ}，λ_{ｊｉｍａｋｕ}，λ_{ｋａｋｉｏｋｏｓｈｉ}）を、第１線形和補間手段１２Ａに出力する。 In the formula (4), c is the evaluation sentence _{H = {h 1, ...,} h c, ...} is an index pointing to segment text _{h c} of, C is indicative of the sentence total. In _{_{^{_{addition, P genko (w i c |}}}} w i-N + 1 c ... w i-1 c) shows the conditional probability of the original language model 30 with respect to the sentence _{h c} (N-gram probability). In _{_{^{_{addition, P jimaku (w i c |}}}} w i-N + 1 c ... w i-1 c) shows the conditional probability of subtitle language model 31 with respect to the division sentence _{h c} a (N-gram probability). In _{_{^{_{addition, P kakiokoshi (w i c |}}}} w i-N + 1 c ... w i-1 c) shows the conditional probability of the language model 32 transcriptions for the division sentence _{h c} (N-gram probability). Also, n ^c indicates the number of words of the divided sentence h _c , and w ^c indicates the word of the divided sentence h _c .
The first mixing weight generation means 11A can generate (calculate) the mixing weights λ (λ _genko , λ _jimaku , λ _kakiokoshi ) maximizing the equation (4) by the EM algorithm or the like.
The first mixing weight generation means 11A outputs the generated mixing weights λ (λ _genko , λ _jimaku , λ _kakiokoshi ) to the first linear sum interpolation means 12A.

第１線形和補間手段１２Ａは、第１混合重み生成手段１１Ａで生成された混合重みλを用いて、複数の言語モデル（原稿言語モデル３０、字幕言語モデル３１および書き起こし言語モデル３２）を混合するものである。 The first linear sum interpolation unit 12A mixes a plurality of language models (a manuscript language model 30, a subtitle language model 31, and a transcription language model 32) using the mixing weights λ generated by the first mixing weight generation unit 11A. It is

具体的には、第１線形和補間手段１２Ａは、以下の式（５）に示すように、混合する言語モデル間で、同じ単語ｗ_ｉごとに、混合重みλ（λ_{ｇｅｎｋｏ}，λ_{ｊｉｍａｋｕ}，λ_{ｋａｋｉｏｋｏｓｈｉ}）を用いてＮグラム確率を重み付き加算（線形和補間）することで、混合グローバル言語モデル２１を生成する。ここで、Ｐ_ｍｉｘ（ｗ_ｉ｜ｗ_{ｉ−Ｎ＋１}…ｗ_ｉ−１）は、生成される混合グローバル言語モデル２１のＮグラム確率を示す。 Specifically, the first linear sum interpolation means 12A, as shown in the following equation (5), between the language model be mixed, each same word _{w i,} mixture weight _{_{λ (λ genko, λ jimaku,}} λ _A mixed global language model 21 is generated by _performing weighted addition (linear sum interpolation) of N-gram probabilities using _kakiokoshi ). Here, P _mix (w _i | w _{i−N + 1} ... W _i−1 ) indicates the N-gram probability of the mixed global language model 21 to be generated.

このように生成された混合グローバル言語モデル２１は、グローバル言語モデル２０よりも、音声認識対象の表現に対する条件付き確率を高めた言語モデルとなる。
この第１線形和補間手段１２Ａは、生成した混合グローバル言語モデル２１を、図示を省略した記憶手段に書き込み記憶する。また、この混合グローバル言語モデル２１は、後記する第２混合重み生成手段１１Ｂおよび第２線形和補間手段１２Ｂによって参照される。 The mixed global language model 21 generated in this manner is a language model in which the conditional probability for the expression of the speech recognition target is higher than that of the global language model 20.
The first linear sum interpolation means 12A writes and stores the generated mixed global language model 21 in storage means (not shown). Further, this mixed global language model 21 is referred to by the second mixed weight generation means 11B and the second linear sum interpolation means 12B described later.

第２混合重み生成手段１１Ｂは、評価重み生成手段１０で生成された重み付き評価用文章（評価用文章、評価重み）を用いて、複数の言語モデル（混合グローバル言語モデル２１および話題依存言語モデル４０）を混合する重み係数（混合重み）を生成するものである。この第２混合重み生成手段１１Ｂは、重み付き評価用文章の対数尤度が最大となるように各言語モデルの混合重みを算出する。
この第２混合重み生成手段１１Ｂにおける混合重みの生成手法は、混合する言語モデルが異なるだけで、第１混合重み生成手段１１Ａと同じである。 The second mixed weight generation unit 11B uses the weighted evaluation sentences (evaluation sentences, evaluation weights) generated by the evaluation weight generation unit 10 to generate a plurality of language models (mixed global language model 21 and topic dependent language model). 40) to generate a weighting factor (mixing weight). The second mixed weight generation unit 11B calculates the mixed weight of each language model so that the log likelihood of the weighted evaluation text is maximized.
The method of generating mixture weights in the second mixture weight generation means 11B is the same as the first mixture weight generation means 11A except that the language model to be mixed is different.

具体的には、第２混合重み生成手段１１Ｂは、以下の式（６）の対数尤度Ｌを最大化する混合グローバル言語モデル２１の混合重みλ_ｍｉｘ、話題依存言語モデル４０の混合重みλ_{ｗａｄａｉ}を算出する。 Specifically, the second mixture weight generation means 11 B mixes the mixture weights λ _mix of the mixed global language model 21 which maximizes the log likelihood L of the following equation (6), and mixes the mixture weights λ _wadai of the topic dependent language model 40. Calculate

この式（６）において、Ｐ_ｍｉｘ（ｗ_ｉ ^ｃ｜ｗ_{ｉ−Ｎ＋１} ^ｃ…ｗ_ｉ−１ ^ｃ）は、文章ｈ_ｃに対する混合グローバル言語モデル２１の条件付き確率（Ｎグラム確率）を示す。また、Ｐ_{ｗａｄａｉ}（ｗ_ｉ ^ｃ｜ｗ_{ｉ−Ｎ＋１} ^ｃ…ｗ_ｉ−１ ^ｃ）は、文章ｈ_ｃに対する話題依存言語モデル４０の条件付き確率（Ｎグラム確率）を示す。他の変数は、前記式（４）と同じである。
この第２混合重み生成手段１１Ｂは、生成した混合重みλ（λ_ｍｉｘ，λ_{ｗａｄａｉ}）を、第２線形和補間手段１２Ｂに出力する。 In the formula _{_{^{(6), P mix (w}}} i c | w i-N + 1 c ... w i-1 c) illustrates a conditional probability of mixing for sentence _{h c} global language model 21 (N-gram probability). In _{_{^{_{addition, P wadai (w i c |}}}} w i-N + 1 c ... w i-1 c) shows the conditional probability of topic dependent language model 40 with respect to the sentence _{h c} (N-gram probability). Other variables are the same as in the equation (4).
The second mixing weight generation means 11B outputs the generated mixing weights λ (λ _mix , λ _wadai ) to the second linear sum interpolation means 12B.

第２線形和補間手段１２Ｂは、第２混合重み生成手段１１Ｂで生成された混合重みλを用いて、複数の言語モデル（混合グローバル言語モデル２１および話題依存言語モデル４０）を混合するものである。
この第２線形和補間手段１２Ｂにおける混合手法は、混合する言語モデルが異なるだけで、第１線形和補間手段１２Ａと同じである。 The second linear sum interpolation unit 12B mixes a plurality of language models (a mixed global language model 21 and a topic dependent language model 40) using the mixture weights λ generated by the second mixture weight generation unit 11B. .
The mixing method in the second linear sum interpolation means 12B is the same as the first linear sum interpolation means 12A except for the language model to be mixed.

具体的には、第２線形和補間手段１２Ｂは、以下の式（７）に示すように、混合する言語モデル間で、同じ単語ｗ_ｉごとに、混合重みλ（λ_ｍｉｘ，λ_{ｗａｄａｉ}）を用いてＮグラム確率を重み付き加算（線形和補間）することで、混合言語モデル５０を生成する。ここで、Ｐ_ｍｉｘ２（ｗ_ｉ｜ｗ_{ｉ−Ｎ＋１}…ｗ_ｉ−１）は、生成される混合言語モデル５０のＮグラム確率を示す。 Specifically, the second linear sum interpolation unit 12B mixes the mixing weights λ (λ _mix , λ _wadai ) for each word w _i among the language models to be mixed, as shown in the following equation (7). A mixed language model 50 is generated by using weighted addition (linear sum interpolation) of N-gram probabilities. Here, P _mix2 (w _i | w _{i−N + 1} ... W _i−1 ) indicates the N-gram probability of the mixed language model 50 to be generated.

これによって、混合言語モデル５０は、音声認識対象の表現に対する条件付き確率を高めた混合グローバル言語モデル２１に対して、さらに、音声認識対象のトピック（話題）についての条件付き確率を高めた言語モデルとなる。
なお、言語モデル生成装置１が生成する混合言語モデル５０は、一般的な音声認識装置において使用することができる。その場合、例えば、図３に示すように、音声認識装置１００は、言語モデル生成装置１が生成する混合言語モデル５０と、既存の発音辞書６０および音響モデル７０とにより、音声を音声認識し認識結果を出力する。 As a result, the mixed language model 50 is a language model in which the conditional probability of the speech recognition target topic is raised with respect to the mixed global language model 21 in which the conditional probability for the expression of the speech recognition target is increased. It becomes.
The mixed language model 50 generated by the language model generation device 1 can be used in a general speech recognition device. In that case, for example, as shown in FIG. 3, the speech recognition apparatus 100 recognizes and recognizes speech using the mixed language model 50 generated by the language model generation apparatus 1 and the existing pronunciation dictionary 60 and the acoustic model 70. Output the result.

以上説明したように言語モデル生成装置１を構成することで、言語モデル生成装置１は、複数の言語モデルを混合する際に、使用する評価用文章Ｈに対して評価重みを設け、混合重みを算出して混合するため、過学習を抑えることができる。
また、言語モデル生成装置１は、混合グローバル言語モデル２１において、音声認識対象の表現に対する条件付き確率が高められているため、後記する第２実施形態（図５）のように、グローバル言語モデル２０に話題依存言語モデル４０を直接混合する場合に比べ、音声認識対象の表現に対する条件付き確率をさらに高めることができる。 By configuring the language model generation device 1 as described above, the language model generation device 1 provides an evaluation weight to the evaluation sentence H to be used when mixing a plurality of language models, and mixes the mixed weights. Since it calculates and mixes, over-learning can be suppressed.
Further, in the language model generation device 1, since the conditional probability with respect to the expression of the speech recognition target is enhanced in the mixed global language model 21, as in the second embodiment (FIG. 5) described later, the global language model 20 As compared with the case where the topic dependent language model 40 is directly mixed, the conditional probability for the expression of the speech recognition target can be further enhanced.

なお、言語モデル生成装置１は、図示を省略したコンピュータを、評価重み生成手段１０、第１混合重み生成手段１１Ａ、第１線形和補間手段１２Ａ、第２混合重み生成手段１１Ｂ、第２線形和補間手段１２Ｂとして機能させるプログラム（言語モデル生成プログラム）で動作させることができる。 The language model generation device 1 is a computer whose illustration is omitted, evaluation weight generation means 10, first mixed weight generation means 11A, first linear sum interpolation means 12A, second mixed weight generation means 11B, second linear sum It can be operated by a program (language model generation program) to function as the interpolation means 12B.

〔言語モデル生成装置の動作〕
次に、図４を参照（構成については適宜図１参照）して、本発明の第１実施形態に係る言語モデル生成装置１の動作について説明する。
まず、言語モデル生成装置１は、評価重み生成手段１０によって、大規模コーパスを用いて学習したグローバル言語モデル２０から、評価用文章Ｈの評価重みを生成する（ステップＳ１）。
具体的には、評価重み生成手段１０は、評価用文章Ｈの文章全体｛ｈ_１，…，ｈ_ｃ，…｝を用いてパープレキシティＰＰ（全体パープレキシティＰＰ_ａｌｌ）を計算し、評価用文章Ｈの個別の文章｛ｈ_ｃ｝ごとにパープレキシティＰＰ（個別パープレキシティＰＰ_ｃ）を計算する（前記式（２）参照）。そして、評価重み生成手段１０は、全体パープレキシティＰＰ_ａｌｌよりも個別パープレキシティＰＰ_ｃの方が大きければ、対応する個別の文章｛ｈ_ｃ｝の評価用文章としての重みを大きくし、それ以外であれば、個別の文章｛ｈ_ｃ｝の評価用文章としての重みを小さくするように、評価重みを生成する（前記式（３）参照）。 [Operation of language model generation device]
Next, the operation of the language model generation device 1 according to the first embodiment of the present invention will be described with reference to FIG. 4 (refer to FIG. 1 for the configuration as appropriate).
First, the language model generation device 1 generates the evaluation weight of the evaluation sentence H from the global language model 20 learned using the large scale corpus by the evaluation weight generation means 10 (step S1).
Specifically, the evaluation weight generation means 10 calculates perplexity PP (overall perplexity PP _all ) using the whole sentence {h ₁ ,..., H _c ,. Perplexity PP (individual perplexity PP _c ) is calculated for each individual sentence {h _c } of sentence H (see equation (2) above). Then, if the individual perplexity PP _c is larger than the whole perplexity PP _all , the evaluation weight generation means 10 increases the weight of the corresponding individual sentence {h _c } as an evaluation sentence, Otherwise, an evaluation weight is generated so as to reduce the weight of the individual sentence {h _c } as an evaluation sentence (see the equation (3)).

そして、言語モデル生成装置１は、第１混合重み生成手段１１Ａによって、ステップＳ１で生成された評価重みを用いて、大規模コーパスに含まれる原稿、字幕、書き起こしの各学習データから学習した原稿言語モデル３０、字幕言語モデル３１および書き起こし言語モデル３２を混合する混合重みλ（λ_{ｇｅｎｋｏ}，λ_{ｊｉｍａｋｕ}，λ_{ｋａｋｉｏｋｏｓｈｉ}）を生成する（ステップＳ２）。
具体的には、第１混合重み生成手段１１Ａは、評価用文章の対数尤度が最大となるように原稿言語モデル３０、字幕言語モデル３１および書き起こし言語モデル３２の混合重みを算出する（前記式（４）参照）。 Then, the language model generation device 1 uses the evaluation weight generated in step S1 by the first mixed weight generation unit 11A to read the original, the subtitle, and the original learned from the respective learning data of the transcription and transcription. A mixed weight λ (λ _genko , λ _jimaku , λ _kakiokoshi ) for mixing the language model 30, the subtitle language model 31, and the transcribed language model 32 is generated (step S2).
Specifically, the first mixing weight generation unit 11A calculates the mixing weight of the document language model 30, the subtitle language model 31, and the transcription language model 32 such that the log likelihood of the evaluation text is maximized (described above. Formula (4) reference).

そして、言語モデル生成装置１は、第１線形和補間手段１２Ａによって、ステップＳ２で算出された混合重みを用いて、原稿言語モデル３０、字幕言語モデル３１および書き起こし言語モデル３２を混合し、混合グローバル言語モデル２１を生成する（ステップＳ３）。
具体的には、第１線形和補間手段１２Ａは、ステップＳ２で算出された原稿言語モデル３０、字幕言語モデル３１および書き起こし言語モデル３２の混合重みλ（λ_{ｇｅｎｋｏ}，λ_{ｊｉｍａｋｕ}，λ_{ｋａｋｉｏｋｏｓｈｉ}）を用いて、Ｎグラム確率を重み付き加算（線形和補間）することで、混合グローバル言語モデル２１を生成する（前記式（５）参照）。 Then, the language model generation device 1 mixes the original language model 30, the subtitle language model 31, and the transcription language model 32 by the first linear sum interpolation means 12A using the mixing weights calculated in step S2, and mixes them. A global language model 21 is generated (step S3).
Specifically, the first linear sum interpolation unit 12A calculates the mixture weights λ (λ _genko , λ _jimaku , λ _kakiokoshi ) of the original language model 30, the subtitle language model 31, and the transcription language model 32 calculated in step S2. The mixed global language model 21 is generated by performing weighted addition (linear sum interpolation) of N-gram probabilities using the above (see the above-mentioned equation (5)).

そして、言語モデル生成装置１は、第２混合重み生成手段１１Ｂによって、ステップＳ１で生成された評価重みを用いて、ステップＳ３で生成された混合グローバル言語モデル２１と、話題依存小規模コーパスから学習した話題依存言語モデル４０とを混合する混合重みλ（λ_ｍｉｘ，λ_{ｗａｄａｉ}）を生成する（ステップＳ４）。
具体的には、第２混合重み生成手段１１Ｂは、評価用文章の対数尤度が最大となるように混合グローバル言語モデル２１および話題依存言語モデル４０の混合重みを算出する（前記式（６）参照）。 Then, the language model generation device 1 learns from the mixed global language model 21 generated in step S3 and the topic-dependent small-scale corpus, using the evaluation weight generated in step S1 by the second mixture weight generation unit 11B. A mixed weight λ (λ _mix , λ _wadai ) is generated to be mixed with the topic dependent language model 40 (step S4).
Specifically, the second mixed weight generation unit 11B calculates the mixed weight of the mixed global language model 21 and the topic-dependent language model 40 such that the log likelihood of the evaluation text is maximized (the above-mentioned equation (6)). reference).

そして、言語モデル生成装置１は、第２線形和補間手段１２Ｂによって、ステップＳ４で算出された混合重みを用いて、混合グローバル言語モデル２１および話題依存言語モデル４０を混合し、混合言語モデル５０を生成する（ステップＳ５）。
具体的には、第２線形和補間手段１２Ｂは、ステップＳ４で算出された混合グローバル言語モデル２１および話題依存言語モデル４０の混合重みλ（λ_ｍｉｘ，λ_{ｗａｄａｉ}）を用いて、Ｎグラム確率を重み付き加算（線形和補間）することで、混合言語モデル５０を生成する（前記式（７）参照）。
以上の動作によって、言語モデル生成装置１は、過学習を抑え、音声認識対象の認識精度を高めた言語モデルを生成することができる。 Then, the language model generation device 1 mixes the mixed global language model 21 and the topic dependent language model 40 by the second linear sum interpolation means 12 B using the mixing weights calculated in step S 4, and sets the mixed language model 50. Generate (step S5).
Specifically, the second linear sum interpolation unit 12B uses the mixed weights λ (λ _mix , λ _wadai ) of the mixed global language model 21 and the topic dependent language model 40 calculated in step S4 to calculate the N-gram probability. The mixed language model 50 is generated by performing weighted addition (linear sum interpolation) (see the above equation (7)).
By the above operation, the language model generation device 1 can generate a language model in which overlearning is suppressed and the recognition accuracy of the speech recognition target is enhanced.

〔性能評価〕
次に、言語モデル生成装置１を評価した評価結果について説明する。
この評価に用いた大規模コーパス２００（図２参照）を構成するコーパス（原稿、字幕、書き起こし）、および、話題依存小規模コーパス４００は、過去の放送番組で用いられたデータであって、以下の〔表１〕で示したコーパスサイズである。 [Performance evaluation]
Next, an evaluation result of evaluating the language model generation device 1 will be described.
Corpus (script, subtitles, transcription) constituting the large-scale corpus 200 (see FIG. 2) used for this evaluation, and the topic-dependent small-scale corpus 400 are data used in a past broadcast program, The corpus size is shown in [Table 1] below.

言語モデル生成装置１において、〔表１〕で示したコーパスを用いて学習した原稿言語モデル３０、字幕言語モデル３１、書き起こし言語モデル３２および話題依存言語モデル４０を混合する。 In the language model generation device 1, the manuscript language model 30, the subtitle language model 31, the transcription language model 32, and the topic dependent language model 40 which are learned using the corpus shown in [Table 1] are mixed.

まず、言語モデル生成装置１が、評価重み生成手段１０で生成した評価重みを用いて、言語モデル（ここでは、原稿言語モデル３０、字幕言語モデル３１および書き起こし言語モデル３２）を混合する効果について説明する。
以下の〔表２〕には、大規模コーパスを単純に学習して生成したグローバル言語モデル２０（Ｐ_{ｇｌｏｂａｌ}（ｗ｜ｈ））と、評価重みを用いず、前記式（４）で評価重みα_ｃを常に“１”にして生成した言語モデル（Ｐ_{ｍｉｘｔｅｓｔ}（ｗ｜ｈ））と、本発明の評価重みを用いて生成した混合グローバル言語モデル２１（Ｐ_ｍｉｘ（ｗ｜ｈ））とのそれぞれのパープレキシティの値を示している。 First, about the effect that the language model generation device 1 mixes language models (here, the original language model 30, the subtitle language model 31, and the transcription language model 32) using the evaluation weights generated by the evaluation weight generation unit 10. explain.
In the following [Table 2], the global language model 20 (P _global (w | h)) generated by simply learning a large scale corpus and the evaluation weight are not used, and the evaluation weight α in the above equation (4) A language model (P _mixtest (w | h)) generated with _c being always “1” and a mixed global language model 21 (P _mix (w | h)) generated using the evaluation weights of the present invention Shows the perplexity value of.

この〔表２〕に示すように、本発明に係る言語モデル生成装置１において、評価用文章Ｈの評価重みを生成して混合した混合グローバル言語モデル２１（Ｐ_ｍｉｘ（ｗ｜ｈ）は、他の言語モデル（Ｐ_{ｇｌｏｂａｌ}（ｗ｜ｈ）、Ｐ_{ｍｉｘｔｅｓｔ}（ｗ｜ｈ））に比べて、パープレキシティの値が小さくなり、精度の高い言語モデルが生成されたことを示している。 As shown in [Table 2], in the language model generation device 1 according to the present invention, the mixed global language model 21 (P _mix (w | h) generated by mixing the evaluation weights of the evaluation sentences H is Compared to the language models (P _global (w | h) and P _mixtest (w | h)), the perplexity value is smaller, which indicates that a language model with high accuracy is generated.

次に、言語モデル生成装置１が生成した言語モデル（混合言語モデル５０）を用いた、音声認識の精度について説明する。
以下の〔表３〕には、本発明の評価重みを用いて生成した混合言語モデル５０（Ｐ_ｍｉｘ２（ｗ｜ｈ））を用いて音声認識したときの単語誤り率と、評価重みを用いず、前記式（４）で評価重みα_ｃを常に“１”にして生成した言語モデル（Ｐ_{ｍｉｘ２ｔｅｓｔ}（ｗ｜ｈ））を用いて音声認識したときの単語誤り率とを示している。 Next, the accuracy of speech recognition using the language model (mixed language model 50) generated by the language model generation device 1 will be described.
In the following [Table 3], the word error rate when speech recognition is performed using the mixed language model 50 (P _mix2 (w | h)) generated using the evaluation weight of the present invention, and without using the evaluation weight (4) shows the word error rate when speech recognition is performed using the language model (P _mix2test (w | h)) generated with the evaluation weight α _c always set to “1” in the equation (4).

この〔表３〕に示すように、本発明に係る言語モデル生成装置１において、評価用文章Ｈの評価重みを生成して混合した混合言語モデル５０（Ｐ_ｍｉｘ２（ｗ｜ｈ）は、評価重みを用いないで生成した言語モデル（Ｐ_{ｍｉｘ２ｔｅｓｔ}（ｗ｜ｈ））に比べて、単語誤り率が小さく、音声認識の精度を高めることができる。 As shown in [Table 3], in the language model generation device 1 according to the present invention, the mixed language model 50 (P _mix2 (w | h) generated and mixed with the evaluation weight of the text for evaluation H is an evaluation weight. As compared with the language model (P _mix2test (w | h)) generated without using, the word error rate is smaller, and the accuracy of speech recognition can be improved.

≪第２実施形態≫
次に、図５を参照して、本発明の第２実施形態に係る言語モデル生成装置１Ｂの構成について説明する。 Second Embodiment
Next, the configuration of the language model generation device 1B according to the second embodiment of the present invention will be described with reference to FIG.

言語モデル生成装置１Ｂは、言語モデル生成装置１（図１参照）と同様に、大規模言語モデル（グローバル言語モデル２０）に、音声認識対象の小規模の言語モデル（話題依存言語モデル４０）とを重み付け加算して混合するものである。この言語モデル生成装置１Ｂは、言語モデル生成装置１（図１参照）に比べ、大規模コーパスで予め独立して学習した複数の言語モデル（原稿言語モデル３０、字幕言語モデル３１および書き起こし言語モデル３２〔図１参照〕）を混合しない点が異なる。 Similar to the language model generation device 1 (see FIG. 1), the language model generation device 1B adds a small scale language model (topic dependent language model 40) for speech recognition to a large scale language model (global language model 20). Are weighted and added. The language model generation device 1B is different from the language model generation device 1 (see FIG. 1) in that a plurality of language models (a manuscript language model 30, a subtitle language model 31, and a transcription language model) are learned in advance independently in a large scale corpus. 32 [see FIG. 1]) is not mixed.

図５に示すように、言語モデル生成装置１Ｂは、評価重み生成手段１０と、混合重み生成手段１１、線形和補間手段１２と、を備える。評価重み生成手段１０は、図１で説明した言語モデル生成装置１の構成と同じものであるため、説明を省略する。 As shown in FIG. 5, the language model generation device 1B includes an evaluation weight generation unit 10, a mixture weight generation unit 11, and a linear sum interpolation unit 12. The evaluation weight generation unit 10 is the same as the configuration of the language model generation apparatus 1 described with reference to FIG.

混合重み生成手段１１は、評価重み生成手段１０で生成された重み付き評価用文章（評価用文章、評価重み）を用いて、複数の言語モデル（グローバル言語モデル２０および話題依存言語モデル４０）を混合する重み係数（混合重み）を生成するものである。この混合重み生成手段１１は、重み付き評価用文章の対数尤度が最大となるように各言語モデルの混合重みを算出する。
なお、対数尤度により混合重みを算出する手法は、図１で説明した第１混合重み生成手段１１Ａや第２混合重み生成手段１１Ｂの手法と同様であるため、ここでは説明を省略する。
この混合重み生成手段１１は、生成した混合重みλ（λ_{ｇｌｏｂａｌ}，λ_{ｗａｄａｉ}）を、線形和補間手段１２に出力する。 The mixed weight generation unit 11 uses the weighted evaluation sentences (evaluation sentences, evaluation weights) generated by the evaluation weight generation unit 10 to generate a plurality of language models (global language model 20 and topic dependent language model 40). It generates weighting factors (mixing weights) to be mixed. The mixing weight generation unit 11 calculates the mixing weight of each language model so as to maximize the log likelihood of the weighted evaluation sentence.
The method of calculating the mixture weight based on the log likelihood is the same as the method of the first mixture weight generation means 11A and the second mixture weight generation means 11B described with reference to FIG.
The mixing weight generation means 11 outputs the generated mixing weights λ (λ _global , λ _wadai ) to the linear sum interpolation means 12.

線形和補間手段１２は、混合重み生成手段１１で生成された混合重みλを用いて、複数の言語モデル（グローバル言語モデル２０および話題依存言語モデル４０）を混合するものである。この線形和補間手段１２は、生成した混合言語モデル５０Ｂを外部に出力する。
なお、この混合重みを用いて言語モデルを混合する手法は、図１で説明した第１線形和補間手段１２Ａや第２線形和補間手段１２Ｂの手法と同様であるため、ここでは説明を省略する。 The linear sum interpolation means 12 mixes a plurality of language models (the global language model 20 and the topic dependent language model 40) using the mixing weight λ generated by the mixing weight generation means 11. The linear sum interpolation means 12 outputs the generated mixed language model 50B to the outside.
The method of mixing language models using the mixing weights is the same as the method of the first linear sum interpolation means 12A and the second linear sum interpolation means 12B described in FIG. .

このように、言語モデル生成装置１Ｂは、既存の大規模コーパスで生成されたグローバル言語モデル２０に対して、音声認識対象のトピックをコーパスとして生成された話題依存言語モデル４０を混合することで、対象音声の認識精度を高めることができる。また、このとき、言語モデル生成装置１Ｂは、評価用文章Ｈの各文章で、混合重みを計算する際に、評価に適している文章の重みを大きくすることで、すでにコーパス内に含まれ学習されている文章の過学習を抑えることができる。 As described above, the language model generation device 1 B mixes the topic dependent language model 40 generated as a corpus of the speech recognition target topic with the global language model 20 generated by the existing large-scale corpus, The recognition accuracy of the target speech can be improved. In addition, at this time, the language model generation device 1B is already included in the corpus and learned by increasing the weight of the sentence suitable for evaluation when calculating the mixture weight in each sentence of the evaluation sentence H. Over-learning can be suppressed.

なお、言語モデル生成装置１Ｂは、図示を省略したコンピュータを、評価重み生成手段１０、混合重み生成手段１１、線形和補間手段１２として機能させるプログラム（言語モデル生成プログラム）で動作させることができる。 The language model generation device 1B can operate a computer (not shown) with a program (language model generation program) that functions as the evaluation weight generation means 10, the mixing weight generation means 11, and the linear sum interpolation means 12.

以上、本発明の実施形態（第１，第２実施形態）について説明したが、本発明は、これらの実施形態に限定されるものではなく、以下のように種々変形することができる。
≪その他の変形例≫
ここでは、評価重み生成手段１０が生成する評価重みα_ｃを、前記式（３）に示すように、２値（“０”，“１”）とした。
しかし、評価重み生成手段１０は、評価重みα_ｃを、評価用文章Ｈの文章全体の全体パープレキシティＰＰ_ａｌｌと、各文章の個別パープレキシティＰＰ_ｃとの差等を基に、“０”以上“１”以下の範囲の値としてもよい。例えば、文章全体の全体パープレキシティＰＰ_ａｌｌから、各文章の個別パープレキシティＰＰ_ｃを減算した差が最大となる文章に対する評価重みを“１”、最小となる文章に対する評価重みを“０”とし、他の文章に対する評価重みについては、その差の大きさの割合に応じて値を付与すればよい。 As mentioned above, although embodiment (1st, 2nd embodiment) of this invention was described, this invention is not limited to these embodiment, It can deform | transform variously as follows.
«Other modifications»
Here, the evaluation weight α _c generated by the evaluation weight generation means 10 is binary (“0”, “1”) as shown in the equation (3).
However, the evaluation weight generation means 10 sets the evaluation weight α _c on the basis of the difference between the entire perplexity PP _all of the whole sentence of the evaluation sentence H and the individual perplexity PP _{c of} each sentence, etc. The value may be in the range of “more than“ 1 ”or less. For example, the entire whole sentences perplexity PP _all, the evaluation weights for sentences difference obtained by subtracting the individual perplexity PP _c of each sentence is maximum "1", the evaluation weights for sentences having the smallest "0" For the evaluation weights for other sentences, values may be assigned according to the ratio of the magnitude of the difference.

また、ここでは、評価重み生成手段１０は、言語モデルを評価する指標としてパープレキシティを用いた。
しかし、評価重み生成手段１０は、言語モデルを数値評価できる指標であれば、必ずしもパープレキシティを用いる必要はない。例えば、エントロピー（前記式（２）のＥ）、対数尤度（前記式（２）のΣ以降）を用いても構わない。 Also, here, the evaluation weight generation unit 10 uses perplexity as an index for evaluating a language model.
However, the evaluation weight generation means 10 does not necessarily have to use perplexity as long as it is an index that can evaluate the language model numerically. For example, entropy (E in the above equation (2)) and log likelihood (from Σ in the above equation (2)) may be used.

また、ここでは、複数の言語モデルを例示（例えば、原稿言語モデル３０、字幕言語モデル３１、書き起こし言語モデル３２等）して説明したが、混合する言語モデルは、これらに限定されるものではない。例えば、原稿言語モデル３０は、放送番組の原稿以外にも、数年分の新聞の原稿であっても構わない。 In addition, although a plurality of language models are illustrated (for example, the manuscript language model 30, the subtitle language model 31, the transcription language model 32, etc.) here, the language models to be mixed are limited to these. Absent. For example, the manuscript language model 30 may be a newspaper manuscript for several years in addition to a broadcast program manuscript.

１，１Ｂ言語モデル生成装置
１０評価重み生成手段
１１混合重み生成手段
１１Ａ第１混合重み生成手段
１１Ｂ第２混合重み生成手段
１２線形和補間手段
１２Ａ第１線形和補間手段
１２Ｂ第２線形和補間手段
２０グローバル言語モデル
２１混合グローバル言語モデル
３０原稿言語モデル（個別言語モデル）
３１字幕言語モデル（個別言語モデル）
３２書き起こし言語モデル（個別言語モデル）
４０話題依存言語モデル
５０，５０Ｂ混合言語モデル 1, 1 B language model generation device 10 evaluation weight generation means 11 mixture weight generation means 11 A first mixture weight generation means 11 B second mixture weight generation means 12 linear sum interpolation means 12 A first linear sum interpolation means 12 B second linear sum interpolation means 20 Global Language Model 21 Mixed Global Language Model 30 Manuscript Language Model (Individual Language Model)
31 Subtitle Language Model (Individual Language Model)
32 Transcript language model (individual language model)
40 Topic-Dependent Language Model 50, 50B Mixed Language Model

Claims

A mixed language of speech recognition target by mixing a topic dependent language model learned in advance from a learning corpus related to a speech recognition target topic and a global language model learned in advance in a learning corpus having a larger data volume than the learning corpus A language model generation device for generating a model,
The global language model using the overall evaluation value obtained by evaluating the global language model using the whole of the evaluation sentences associated with the topic in advance and the divided sentences obtained by classifying the evaluation sentences according to predetermined classifications Evaluation weight generation means for calculating an individual evaluation value for each of the divided sentences obtained by evaluating the above, and generating a degree of appropriateness as the evaluation sentence as an evaluation weight for each of the divided sentences;
Logarithm of linear sum interpolation of a plurality of individual language models learned in advance from a plurality of individual learning corpuses constituting a learning corpus used to learn the global language model, for each of the divided sentences at the ratio of the evaluation weight First mixing weight generation means for generating mixing weights with maximum likelihood;
First linear sum interpolation means for generating a mixed global language model by performing linear sum interpolation on the plurality of individual language models at a ratio of the mixture weights generated by the first mixture weight generation means;
A second mixture weight generation unit configured to generate mixture weights that maximize the log likelihood when performing linear sum interpolation on the mixture global language model and the topic dependent language model at the ratio of the evaluation weight for each of the divided sentences;
A second linear sum interpolation that generates a mixed language model of the speech recognition target by performing linear sum interpolation on the mixed global language model and the topic dependent language model at a ratio of the mixed weight generated by the second mixed weight generation unit. Means,
A language model generation apparatus comprising:

A mixed language of speech recognition target by mixing a topic dependent language model learned in advance from a learning corpus related to a speech recognition target topic and a global language model learned in advance in a learning corpus having a larger data volume than the learning corpus A language model generation device for generating a model,
The global language model using the overall evaluation value obtained by evaluating the global language model using the whole of the evaluation sentences associated with the topic in advance and the divided sentences obtained by classifying the evaluation sentences according to predetermined classifications Evaluation weight generation means for calculating an individual evaluation value for each of the divided sentences obtained by evaluating the above, and generating a degree of appropriateness as the evaluation sentence as an evaluation weight for each of the divided sentences;
Mixing weight generation means for generating a mixture weight that maximizes the log likelihood when performing linear sum interpolation between the global language model and the topic dependent language model at a rate of the evaluation weight for each of the divided sentences;
Linear sum interpolation means for performing linear sum interpolation on the global language model and the topic dependent language model at a ratio of the mixture weight to generate a mixed language model of the speech recognition target;
A language model generation apparatus comprising:

The evaluation weight generation means calculates a perplexity of the global language model as the overall evaluation value and the individual evaluation value, and for the divided sentence in which the individual evaluation value is larger than the overall evaluation value. The language model generation device according to claim 1, wherein the evaluation weight is set large.

The evaluation weight generation means calculates a perplexity of the global language model as the overall evaluation value and the individual evaluation value, and for the divided sentence in which the individual evaluation value is larger than the overall evaluation value. 3. The language model generation device according to claim 1, wherein the evaluation weight is set to “0” for the divided sentences other than the evaluation weight “1”.

A language model generation program for causing a computer to function as the language model generation device according to any one of claims 1 to 4.