JP4340024B2

JP4340024B2 - Statistical language model generation apparatus and statistical language model generation program

Info

Publication number: JP4340024B2
Application number: JP2001172260A
Authority: JP
Inventors: 彰夫小林; 真一本間; 彰男安藤
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2001-06-07
Filing date: 2001-06-07
Publication date: 2009-10-07
Anticipated expiration: 2021-06-07
Also published as: JP2002366190A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識装置に供される統計的言語モデルを生成する統計的言語モデル生成装置および統計的言語モデル生成プログラムに関する。
【０００２】
【従来の技術】
従来、音声認識装置における、音声の認識性能を向上するための方法として、統計的（確率的）言語モデルを利用する方法が提唱されており、代表的なものに下記に示すようなものがある。なお、統計的言語モデルとは、言語における単語や音素間の関係が統計量に基づいてモデル化されたものである。
【０００３】
（１）キャッシュモデルによる方法（Ｒ．Ｋｕｈｎ，Ｒ．ＤｅＭｏｒｉ“ＡＣａｃｈｅ−ＢａｓｅｄＮａｔｕｒａｌＬａｎｇｕａｇｅＭｏｄｅｌｆｏｒＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ，”ＩＥＥＥＴｒａｎｓ．ＰＡＭＩ，ｖｏｌ．１２，ｎｏ．６，１９９０，ｐｐ．５７０〜５８３）。この方法は、過去の大量の原稿（テキストデータ）から学習されたｎ−ｇｒａｍ確率値と、最近の音声認識結果における単語の出現確率とを線形補間等で結合することによって音声の認識性能を向上させる方法である。なお、念のために補足しておくと、ｎ−ｇｒａｍ確率値とは、単語の系列をマルコフ連鎖としてモデル化した単語ｎ−ｇｒａｍにおける生起確率、つまり、ある単語の生起確率は直前の（ｎ−１）単語に依存するというものである。線形補間（ｌｉｎｅａｒｉｎｔｅｒｐｏｌａｔｉｏｎ）とは、ｎ−ｇｒａｍ確率値と、低次のｍ−ｇｒａｍ確率値（ｍ＜ｎ）とを線形に補間することである。
【０００４】
（２）ＭＡＰ（事後確率最大化）推定に基づく方法（小林、今井、安藤、“ＴｉｍｅＤｅｐｅｎｄｅｎｔＬａｎｇｕａｇｅＭｏｄｅｌｆｏｒＢｒｏａｄｃａｓｔＮｅｗｓＴｒａｎｓｃｒｉｐｔｉｏｎａｎｄＩｔｓＰｏｓｔｃｏｒｒｅｃｔｉｏｎ，”ＩＣＳＬＰ−１９９８）。この方法は、あるタスクのｎ−ｇｒａｍ確率値を、タスクに依存しない大量の原稿に対して小量の原稿をＭＡＰ推定により得られた適当な重みにより足し合わせ、言語モデルの統計的な精度を高め、認識性能を向上させる方法である。言語モデルを生成するための語彙（コーパス）は、小量の原稿中のすべての単語と、大量の原稿の一部の単語を合わせて構成されている。なお、念のために補足しておくと、タスクとは、一般的には仕事、つまり、処理される対象であり、適当な重みとは、統計的（確率）言語モデルにおける、ある単語の出現確率が高くなるように付加された数値であり、語彙（コーパス）とは、言語モデルを生成するための元になるデータであり、一般的には通常、数十万個以上の単語を含んだテキストデータベースのことである。
【０００５】
【発明が解決しようとする課題】
しかしながら、従来のキャッシュモデルによる方法では、過去の音声認識結果を利用するため、言語モデルを生成するための語彙に登録されていない単語（最近、広く使われるようになった言葉）については考慮されていない。このため、報道番組（ニュース等）のように、一つの話題が少ない文章数で構成されている場合が多く、人名、地名、組織名といった固有名詞（新しい単語）が極めて頻繁に出現しやすいタスクにおいては、常に新しい単語を登録した語彙に基づいた言語モデルを利用しなければ、音声認識の認識性能の向上は見込めないという問題がある。
【０００６】
また、ＭＡＰ（事後確率最大化）推定に基づく方法では、実際の発話内容ではなく、書き言葉で記された原稿を利用するので、発話内容に含まれやすい単語の出現確率を上げることができないという問題がある。
【０００７】
本発明の目的は前記した従来の技術が有する課題を解消し、音声認識における認識性能を向上することができ、発話内容に含まれやすい単語の出現確率を上げることができる統計的言語モデルを生成する統計的言語モデル生成装置および統計的言語モデル生成プログラムを提供することにある。
【０００８】
請求項１記載の統計的言語モデル生成装置は、統計的言語モデルを生成する統計的言語モデル生成装置であって、出現頻度の高くなることが予測される単語を含む直近のテキストデータを取得するテキストデータ取得手段と、前記直近のテキストデータおよびこの直近のテキストデータよりデータ量の多い、それ以前の過去の大量テキストデータを音声から統計的言語モデルを参照して音声を認識する音声認識手段と、前記直近のテキストデータ、前記過去の大量テキストデータ、前記音声認識手段による認識結果、を蓄積する蓄積手段と、ＥＭアルゴリズムにより、前記過去の大量のテキストデータの単語ｎ個組の頻度に基づいて、ｎ−ｇｒａｍの第一の確率重みを算出し、前記直近のテキストデータの単語ｎ個組の頻度に基づいて、ｎ−ｇｒａｍの第二の確率重みを算出し、前記認識結果の単語ｎ個組の頻度に基づいて、ｎ−ｇｒａｍの第三の確率重みを算出する確率重み算出手段と、前記第一の確率重み、および前記第二の確率重み、ならびに前記第三の確率重みに基づいて、統計的言語モデルを生成する言語モデル生成手段と、を備えたことを特徴とする。
【０００９】
かかる構成によれば、テキストデータ取得手段によって、出現頻度の高くなることが予測される単語を含む直近のテキストデータが取得され、音声認識手段によって、直近のテキストデータおよびそれ以前の大量テキストデータが音声として認識され、蓄積手段によって、直近のテキストデータおよび過去の大量テキストデータならびに認識した認識結果が蓄積される。そして、確率重み算出手段によって、それぞれのｎ−ｇｒａｍにおける確率重みが算出され、この算出結果に基づいて言語モデル生成手段によって言語モデルが生成される。
【００１０】
なお、出現頻度の高くなることが予測される直近のテキストデータとは、例えば、放送直前または直後の放送番組に供される原稿、発売直前または直後の新聞や雑誌に掲載されている記事、等が挙げられる。また、過去の大量テキストデータとは、例えば、数年から数十年分の放送番組に供された原稿、或いは、ブラウン・コーパス、ＬＯＢコーパス等が挙げられる。
【００１１】
請求項２記載の統計的言語モデル生成装置は、請求項１に記載の統計的言語モデル生成装置において、前記音声認識手段によって認識された認識結果を修正する認識結果修正手段を備え、前記確率重み算出手段が、修正された認識結果に基づいてｎ−ｇｒａｍの第三の確率重みを算出することを特徴とする。
【００１２】
かかる構成によれば、認識結果修正手段によって、テキストデータが音声として認識された結果が修正され、この修正された認識結果に基づいて、確率重み算出手段によって、ｎ−ｇｒａｍの第三の確率重みが算出される。
【００１３】
請求項３記載の統計的言語モデル生成プログラムは、コンピュータを、出現頻度の高くなることが予測される単語を含む直近のテキストデータを取得するテキストデータ取得手段、前記直近のテキストデータおよびこの直近のテキストデータよりデータ量の多い、それ以前の過去の大量テキストデータを音声から統計的言語モデルを参照して音声を認識する音声認識手段、前記直近のテキストデータ、および前記過去の大量テキストデータ、ならびに前記音声認識手段によって認識された認識結果を蓄積する蓄積手段、ＥＭアルゴリズムにより、前記過去の大量のテキストデータの単語ｎ個組の頻度に基づいて、ｎ−ｇｒａｍの第一の確率重みを算出し、前記直近のテキストデータの単語ｎ個組の頻度に基づいて、ｎ−ｇｒａｍの第二の確率重みを算出し、前記認識結果の単語ｎ個組の頻度に基づいて、ｎ−ｇｒａｍの第三の確率重みを算出する確率重み算出手段、前記第一の確率重み、および前記第二の確率重み、ならびに前記第三の確率重みに基づいて、統計的言語モデルを生成する言語モデル生成手段、を備えたことを特徴とする。
【００１４】
かかる構成によれば、テキストデータ取得手段によって、出現頻度の高くなることが予測される単語を含む直近のテキストデータが取得され、音声認識手段によって、直近のテキストデータおよびそれ以前の大量テキストデータが音声として認識され、蓄積手段によって、直近のテキストデータおよび過去の大量テキストデータならびに認識した認識結果が蓄積される。そして、確率重み算出手段によって、それぞれのｎ−ｇｒａｍにおける確率重みが算出され、この算出結果に基づいて言語モデル生成手段によって言語モデルが生成される。
【００１５】
【発明の実施の形態】
以下、本発明の実施形態を図面に基づいて詳細に説明する。
（統計的言語モデル生成装置：第一の実施形態）
図１に、統計的言語モデル生成装置の第一の実施形態の機能説明図を示す。
図１に示すように、統計的言語モデル生成装置１は、図示を省略した主制御部、記憶部、表示出力部、入力部、外部接続部等を備えて構成されており、過去ニュース原稿集積手段３と、直近記者原稿集積手段５と、音声認識手段７と、認識結果集積手段９と、言語モデル計算手段１１とを機能的に実現する。
【００１６】
統計的言語モデル生成装置１は、大量のテキストデータに基づいて、音声認識装置（図示せず）における音声認識時に供される統計的言語モデルを生成する装置である。なお、この実施の形態では、統計的言語モデル生成装置１は、一般的なコンピュータであり、図示を省略した各主制御部、記憶部、表示出力部、入力部、外部接続部は、ＣＰＵ、メモリ、ハードディスク、キーボード等から構成されている。
【００１７】
過去ニュース原稿集積手段３は、図示を省略した記憶部に記憶（集積）されたデータベースであって、請求項に記載した過去の大量テキストデータが集積されたものである。この過去ニュース原稿集積手段３には、過去の大量のニュース原稿がテキストファイル形式（テキストデータ）で蓄積されている。このテキストファイルは、原稿を構成する一つ一つの単語間にスペースを挟んだものである。
【００１８】
なお、この実施の形態では、ニュース原稿に含まれている句読点は、その句読点の直前の単語に一体化され取り扱われるものとする。さらに、補足しておくと、この明細書中において、記憶、集積、蓄積という語句は、実質的に差がないものとして記載している。
【００１９】
直近記者原稿集積手段５は、図示を省略した主制御部に展開するプログラムと記憶部に記憶されたデータベースとであって、出現頻度の高くなることが予測される単語を含む直近のテキストデータを取得するテキストデータ取得手段と取得された直近のテキストデータが集積されたものである。この直近記者原稿集積手段５では、まず、直近の放送番組（特に報道番組）に供される記者原稿を取得する。取得の仕方は、ニュース原稿をオペレーター（操作者）が統計的言語モデル生成装置１に入力、或いはＯＣＲ等で読みとって、その読みとった結果を外部接続部を介して入力することで行われる。
【００２０】
直近記者原稿集積手段５では、記者原稿を取得後、自動的に一定の修正が加えられ、或いは、オペレーターによって校正されて、テキストデータとして、記憶部（図示せず）の直近テキストデータベースに蓄積される。記者原稿における個々の文章は、一つの話題単位で、テキストファイル化されており、このテキストファイルは、過去ニュース原稿集積手段３と同様に、記者原稿を構成する一つ一つの単語間にスペースを挟んだものとして構成されている。
【００２１】
音声認識手段７は、テキストファイルを音声として認識する（テキストファイルを読み上げる）ものである。この音声認識手段７は、一般的なテキスト音声変換エンジン等であり、このテキスト音声変換エンジンは数十万語を格納した辞書を搭載し、まず、過去ニュース原稿集積手段３および直近記者原稿集積手段５のテキストファイルから、このファイルに含まれる単語を認識する。
【００２２】
認識結果集積手段９は、図示を省略した記憶部に記憶（集積）されたデータベースであって、音声認識手段７によって認識された認識結果に、過去ニュース原稿集積手段３および直近記者原稿集積手段５のテキストファイルが参照され、認識結果の各文章ごとに日付、時刻がタイムスタンプとして付与されて、蓄積されるものである。
【００２３】
言語モデル計算手段１１は、過去ニュース原稿集積手段３による過去の大量テキストデータと、直近記者原稿集積手段５による直近の記者原稿のテキストデータと、音声認識手段７による認識結果とに基づいて、統計的言語モデルを生成するプログラムである。この実施の形態では、言語モデルにｂｉｇｒａｍモデル（ｂｉｇｒａｍモデルを含むｎ−ｇｒａｍモデルについては、例えば、「確率モデルによる音声認識」、中川聖一、電子情報通信学会、ｐｐ．１０９参照）を用いている。
【００２４】
この言語モデル計算手段１１では、後記する数式に基づき、以下に示す順序で、種々の計算がなされる。
まず、過去の大量テキストデータと、直近の記者原稿のテキストデータと、認識結果とに基づいて、言語モデルのｂｉｇｒａｍＰ₀、Ｐ₁、Ｐ₂を線形補間（線形補間については、例えば、「音声言語処理」、北、中村、永田共著、森北出版、ｐｐ．２９参照）によって表すと、重み付けされた言語モデルは
【００２５】
【数１】

によって表される。
【００２６】
この数１において、ｙ_n、ｙ_n-1は、語彙に登録されている単語である。確率Ｐ（ｙ_n｜ｙ_n-1）は、単語ｙ_n-1が発声された後に、単語ｙ_nが発声される確率を意味する。一般にｎ−ｇｒａｍの言語モデルでは、ｎを大きくするほど長い連続単語列が取り扱われ、次の単語の認識精度は高くなる。ただし認識精度が高くなる代わりに、膨大な量（ｎ乗倍）の語彙数を含むテキストデータを必要とする。λは各言語モデルにおける確率重みを、Ｖは語彙を示すものである。
【００２７】
重み付けされた言語モデルの単語ｙ_n、ｙ_n-1に対するｂｉｇｒａｍが大きければ、この統計的言語モデル生成装置１によって生成された言語モデルが音声認識装置（図示せず）に供された場合に、当該装置の音声認識時において、それらｙ_nとｙ_n-1との組み合わせが出現しやすくなる。つまり、当該装置の音声認識時に、読み上げる文章（音声認識される文章）に対し、ｂｉｇｒａｍの積が最大となるように確率重みλの値が決定されればよいことになる。或いは、評価データ（音声認識される文章）のエントロピー（例えば、「確率モデルによる音声認識」中川聖一、電子情報通信学会、ｐｐ．１１１、および数２参照）が最小となるように確率重みλの値が決定されればよいことになる。
【００２８】
【数２】

【００２９】
この数式２において、Ｎは評価テキスト中（テキストデータ）の総単語数を示すものであり、評価テキストは、評価データの単語列ｙ＝ｙ₁ｙ₂・・・・・・ｙ_Nで表されるものとする。なお、この式のλは、期待値最大化アルゴリズム（ＥＭアルゴリズム（ＥＭアルゴリズムについては、例えば、「音声言語処理」、北、中村、永田共著、森北出版、ｐｐ．３１参照））を用いるものとし、数式３によって繰り返し計算により求められる。
【００３０】
【数３】

【００３１】
この数式３において、λ_iを更新しながら、評価テキストに対するエントロピーが収束するまで繰り返し計算される。この計算によって、各言語モデルに対する確率重みλを自動的に得ることができる。ただし、読み上げる文章に最適な確率重みλを求めることは、通常、評価テキストの内容が未知であるため困難である。このため、事前に評価テキストに係る既知の発話内容の書き起こし（テキストデータ化したもの）を準備し、これを用いて確率重みλの値を実験的に求めておく。
【００３２】
次に、テキスト重みｗを求める。このテキスト重みｗの値は重み付けされた単語頻度を与えるものである。過去の大量テキストデータＧ₀の総単語数をｍ₀と、直近のテキストデータＧ₁の総単語数をｍ₁と、認識結果Ｇ₂の総単語数をｍ₂とすると、テキスト重みｗは、収束した確率重みλ₀、λ₁、λ₂（ｎ−ｇｒａｍの第一の確率重み、第二の確率重み、第三の確率重み）を用いて、数式４により計算される。
【００３３】
【数４】

【００３４】
この数式４において、過去の大量テキストデータＧ₀に加える直近のテキストデータＧ₁および認識結果Ｇ₂の足し合わせ回数（テキスト重み）ｗ₁、ｗ₂は、確率重みλ₀、λ₁、λ₂から計算される。この数式４では、統計的言語モデルでの確率重みλが複数のテキストデータの集合での確率重みに正規化されるものである。
【００３５】
計算されたテキスト重みｗ₁、ｗ₂に基づいて、直近のテキストデータがテキスト重みｗ₁で、また、認識結果がテキスト重みｗ₂で、重み付けされ、過去の大量テキストデータに足し合わされ、新たな語彙が求められる。つまり、ある単語の出現頻度ｆは、過去の大量テキストデータＧ₀での頻度ｆ₀、直近のテキストデータＧ₁での頻度ｆ₁、認識結果Ｇ₂での頻度ｆ₂とすると、
【００３６】
【数５】

となる。
【００３７】
そして、頻度ｆの大きい順に、単語を語彙Ｖに登録する。ただし、語彙の登録数には、予め上限（Ｖ_max）が設定されており、この上限を越えないように登録される。このため、語彙の総登録語数が制限されつつ、直近のテキストデータＧ₁に含まれていた、それまで出現頻度の低かった単語が重み付けられ、語彙に登録される。
【００３８】
つまり、統計的言語モデル生成装置１の言語モデル計算手段１１では、直近のテキストデータ（最新のニュース原稿等）の中の新しい（過去の大量テキストデータに含まれていない）単語の出現頻度が高められる。しかも、音声認識手段７による認識結果も踏まえて、新たな語彙が決定されているので、この統計的言語モデル生成装置１によって生成された言語モデルを、音声認識装置（図示せず）が利用することにより、音声認識時の認識性能が向上する。なお、この実施の形態では、生成された言語モデルが音声認識手段７にフィードバックされ、音声認識の際に再び利用される。
【００３９】
（統計的言語モデル生成装置：第二の実施形態）
図２に統計的言語モデル生成装置の第二の実施形態の機能説明図を示す。この統計的言語モデル生成装置１Ａにおいて、統計的言語モデル生成装置１の構成と同じものは、同一の符号を付して、その説明は省略する。
【００４０】
統計的言語モデル生成装置１Ａの認識結果修正手段１３は、音声認識手段７の認識結果を修正するプログラムであって、例えば、音声認識手段７によってテキストデータを読み上げる際に、テキストデータにはひらがなで「あめがふる」とあった場合、「雨が降る」と読み上げたとする。つまり、この場合には“あ”にアクセントがあることになる。実際には「飴が降る」であった場合（“め”にアクセントがあることになる）、「あめがふる」の前後の文脈から類推して、認識結果を修正するものである。
【００４１】
修正認識結果集積手段１５は、図示を省略した記憶部に記憶（集積）されたデータベースであって、認識結果修正手段１３によって修正された認識結果を集積（蓄積）するものである。なお、この修正認識結果集積手段１５には、修正前の音声認識手段７による認識結果が一時的に蓄積される。
【００４２】
言語モデル計算手段１１Ａは、言語モデル計算手段１１と同様に、以下に示す順序で、種々の計算がなされる。なお、この実施の形態では、過去の大量テキストデータおよび直近のテキストデータ、ならびに、これらのテキストデータを音声認識手段７によって認識後、修正認識結果集積手段１５によって修正された認識結果（評価データ）に基づいて、下記の計算が言語モデル計算手段１１Ａによってなされる。
【００４３】
まず、過去の大量テキストデータと、直近の記者原稿のテキストデータと、修正された認識結果とに基づいて、言語モデルのｂｉｇｒａｍＰ₀、Ｐ₁、Ｐ₂′を線形補間して表す（数式１参照）。すると、重み付けされた言語モデルの確率重みλが定義される。
【００４４】
評価するデータ（過去の大量テキストデータと、直近の記者原稿のテキストデータと、修正された認識結果）のエントロピーが最小になるように確率重みλの値が求められれば（数式２を参照）よく、この確率重みλが期待値最大化アルゴリズムを用いることにより、繰り返し計算により求められる（数式３参照）。
【００４５】
収束した確率重みλ₀、λ₁、λ₂′（ｎ−ｇｒａｍの第一の確率重み、第二の確率重み、第三の確率重み）を用いて、テキスト重みｗ₁、ｗ₂′が計算される（数式４参照）。計算されたテキスト重みｗ₁、ｗ₂′に基づいて、直近のテキストデータがテキスト重みｗ₁で、また、修正された認識結果がテキスト重みｗ₂′で、重み付けされ、過去の大量テキストデータに足し合わされ、新たな語彙が求められる。
【００４６】
つまり、ある単語の出現頻度ｆは、過去の大量テキストデータＧ₀での頻度ｆ₀、直近のテキストデータＧ₁での頻度ｆ₁、修正された認識結果Ｇ₂′での頻度ｆ₂′とされ、テキスト重みｗ₁、ｗ₂′との積によって表される（数式５参照）。そして、頻度ｆの大きい順に、単語を語彙Ｖに登録する。ただし、語彙の登録数には、予め上限（Ｖ_max）が設定されており、この上限を越えないように登録される。
【００４７】
つまり、統計的言語モデル生成装置１Ａの言語モデル計算手段１１Ａでは、直近のテキストデータ（最新のニュース原稿等）の中の新しい（過去の大量テキストデータに含まれていない）単語の出現頻度が高められる。しかも、音声認識手段７による認識結果を認識結果修正手段１３によって修正し、その修正した認識結果も踏まえて、新たな語彙が決定されているので、この統計的言語モデル生成装置１Ａによって生成された言語モデルを、音声認識装置（図示せず）が利用することにより音声認識時の認識性能が向上する。なお、この実施の形態では、生成された言語モデルが音声認識手段７にフィードバックされ、音声認識の際に再び利用される。
【００４８】
統計的言語モデル生成装置１Ａでは、直近記者原稿集積手段５によって、音声認識する直近のニュース番組等を対象に取得・集積され、音声認識手段７と認識結果修正手段１３とによって、音声認識の出力が修正され、認識された音声に対応する正しい文字列が作成される。このため、この統計的言語モデル生成装置１Ａによって生成された言語モデルを利用すれば、時間的にごく近い時刻の放送番組（音声認識する対象となる）に対する正しい文字列の情報を利用することになり、音声認識性能を向上することができる。
【００４９】
また、時間的にごく近い時刻の放送番組（音声認識する対象となる）に対する正しい文字列の情報を参照して、過去の大量データベースの音声認識出力に含まれる認識誤りを検出し、認識結果修正手段１３によって修正することができる。
【００５０】
（統計的言語モデル生成装置の動作）
次に、図３に示すフローチャートを参照して、統計的言語モデル生成装置１の動作を説明する。
まず、過去ニュース原稿集積手段３によって、過去の大量テキストデータが集積され（集積されている）、この過去の大量テキストデータに含まれている各単語の出現頻度に応じて初期の語彙が決定される（Ｓ１）。通常、初期の語彙は、数十万以上の単語から形成されている。一般に、言語モデルにおける語彙は、記憶部（図示せず）の記憶容量または主制御部（図示せず）の処理能力に応じて、予め登録語数が設定されており、この登録語数に収まるように、集積或いは学習されるデータ中の単語で出現頻度の高い単語順に、当該単語が語彙に登録され決定される。
【００５１】
一方、直近記者原稿集積手段５によって、直近の放送番組等に供されるテキストデータ（直近のテキストデータ）が集積されており、これらの過去の大量テキストデータおよび直近のテキストデータが音声認識手段７によって音声認識される。音声認識された認識結果が認識結果集積手段９に集積されている。
【００５２】
そして、言語モデル計算手段１１によって、まず、各言語モデル（ｂｉｇｒａｍＰ₀、Ｐ₁、Ｐ₂）が作成され、これらのｂｉｇｒａｍＰ₀、Ｐ₁、Ｐ₂が線形補間される（数式１参照）（Ｓ２）。これらの言語モデルの確率重みλ₀、λ₁、λ₂をＥＭアルゴリズムによって算出（計算）し（数式３参照）（Ｓ３）、これらの確率重みλ₀、λ₁、λ₂に基づいてテキスト重みｗ₁、ｗ₂が算出（計算）される（数式４参照）（Ｓ４）。
【００５３】
さらに、言語モデル計算手段１１がテキスト重みｗ₁、ｗ₂に基づいて、単語の出現頻度ｆを算出（計算）し（数式５参照）（Ｓ５）、この出現頻度ｆに基づいて、この出現頻度ｆの大きい単語順に、登録語数に収まるように新たな語彙が決定される（Ｓ６）。そして、新たな語彙に基づいて、言語モデルが生成される（Ｓ７）。
【００５４】
以上、実施形態に基づいて本発明を説明したが、本発明はこれに限定されるものではない。
【００５５】
例えば、統計的言語モデル生成装置１、１Ａにおいて実現した各構成を、特定の記憶媒体に記憶させたプログラムとして取り扱うことは可能である。さらに、ｂｉｇｒａｍ以上のｎ−ｇｒａｍ（ｔｒｉｇｒａｍ、４−ｇｒａｍ）については、ｂｉｇｒａｍの場合と同様に、各確率重みλを計算し、この確率重みλをテキスト重みに変換し、新たな語彙Ｖを作成して、この語彙から統計的言語モデルを生成することは可能である。
【００５６】
【発明の効果】
請求項１記載の発明によれば、テキストデータ取得手段によって、出現頻度の高くなることが予測される単語を含む直近のテキストデータが取得され、音声認識手段によって、直近のテキストデータおよびそれ以前の大量テキストデータが音声として認識され、蓄積手段によって、直近のテキストデータおよび過去の大量テキストデータならびに認識した認識結果が蓄積される。そして、確率重み算出手段によって、それぞれのｎ−ｇｒａｍにおける確率重みが算出され、この算出結果に基づいて言語モデル生成手段によって言語モデルが生成されるので、この言語モデルが音声認識装置に利用されれば、音声認識時の認識性能を向上させることができる。
【００５７】
また、直近のテキストデータに一定の確率重みを付加して、言語モデルを生成する語彙に含めているので、直近の発話内容に含まれやすい単語の出現確率を上げることができる。
【００５８】
請求項２記載の発明によれば、認識結果修正手段によって、テキストデータが音声として認識された結果が修正され、この修正された認識結果に基づいて、確率重み算出手段によって、ｎ−ｇｒａｍの第三の確率重みが算出されるので、修正された認識結果を踏まえて得られる言語モデルが音声認識装置に利用されれば、音声認識時の認識性能をさらに向上させることができる。
【００５９】
請求項３記載の発明によれば、統計的言語モデル生成プログラムのテキストデータ取得手段によって、出現頻度の高くなることが予測される単語を含む直近のテキストデータが取得され、音声認識手段によって、直近のテキストデータおよびそれ以前の大量テキストデータが音声として認識され、蓄積手段によって、直近のテキストデータおよび過去の大量テキストデータならびに認識した認識結果が蓄積される。そして、確率重み算出手段によって、それぞれのｎ−ｇｒａｍにおける確率重みが算出され、この算出結果に基づいて言語モデル生成手段によって言語モデルが生成されるので、この言語モデルが音声認識装置に利用されれば、音声認識時の認識性能を向上させることができる。
【００６０】
また、この統計的言語モデル生成プログラムを記憶させた記憶媒体として市場で流通させることも可能である。
【図面の簡単な説明】
【図１】本発明による第一の実施形態である統計的言語モデル生成装置の機能説明図である。
【図２】本発明による第二の実施形態である統計的言語モデル生成装置の機能説明図である。
【図３】統計的言語モデル生成装置の動作を説明したフローチャートである。
【符号の説明】
１、１Ａ統計的言語モデル生成装置
３過去ニュース原稿集積手段
５直近記者原稿集積手段
７音声認識手段
９認識結果集積手段
１１、１１Ａ言語モデル計算手段
１３認識結果修正手段
１５修正認識結果集積手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a statistical language model generation device and a statistical language model generation program for generating a statistical language model used in a speech recognition device.
[0002]
[Prior art]
Conventionally, a method of using a statistical (probabilistic) language model has been proposed as a method for improving speech recognition performance in a speech recognition apparatus, and typical methods include the following. . The statistical language model is a model in which the relationship between words and phonemes in a language is modeled based on statistics.
[0003]
(1) A method based on a cache model (R. Kuhn, R. De Mori “ACache-Based Natural Language Model for Speed Recognition”, “IEEE Trans. PAMI, vol. 12, no. 6, 1990, pp. 570-583). This method improves speech recognition performance by combining n-gram probability values learned from a large amount of past manuscripts (text data) and word appearance probabilities in recent speech recognition results by linear interpolation or the like. It is a method to make it. It should be noted that the n-gram probability value is the occurrence probability in the word n-gram modeled as a Markov chain, that is, the occurrence probability of a certain word (n -1) It depends on words. Linear interpolation is to linearly interpolate an n-gram probability value and a low-order m-gram probability value (m <n).
[0004]
(2) Method based on MAP (maximum posterior probability) estimation (Kobayashi, Imai, Ando, “Time Dependent Language Model for Broadcast News Transcribation and Its Postcorrection,” ICSLP-1998). This method adds the n-gram probability value of a task to a large amount of manuscripts that do not depend on the task, and adds a small amount of manuscripts with appropriate weights obtained by MAP estimation, thereby improving the statistical accuracy of the language model. It is a method to improve recognition performance. A vocabulary (corpus) for generating a language model is composed of all words in a small amount of manuscript and some words in a large amount of manuscript. Note that a task is generally a work, that is, an object to be processed, and an appropriate weight is the appearance of a word in a statistical (probability) language model. A numerical value added with a high probability, and a vocabulary (corpus) is data used to generate a language model, and generally contains hundreds of thousands or more of words. It is a text database.
[0005]
[Problems to be solved by the invention]
However, in the conventional cache model method, since past speech recognition results are used, words that are not registered in the vocabulary for generating the language model (words that have become widely used recently) are considered. Not. For this reason, tasks such as news programs (news, etc.) are often composed of a small number of sentences in one topic, and proper nouns (new words) such as names of people, places, and organizations are likely to appear very frequently. However, there is a problem that speech recognition performance cannot be improved unless a language model based on a vocabulary in which new words are registered is used.
[0006]
Further, in the method based on MAP (maximization of posterior probability), since a manuscript written in written words is used instead of actual utterance contents, it is impossible to increase the appearance probability of words that are likely to be included in the utterance contents. There is.
[0007]
The object of the present invention is to generate a statistical language model that can solve the problems of the conventional techniques described above, improve recognition performance in speech recognition, and increase the probability of appearance of words that are likely to be included in utterance content. And providing a statistical language model generation apparatus and a statistical language model generation program.
[0008]
  The statistical language model generation apparatus according to claim 1 is a statistical language model generation apparatus that generates a statistical language model, and acquires the latest text data including a word that is predicted to increase in appearance frequency. Text data acquisition means and voice data of the most recent text data and the past large amount of text data having a larger data volume than the most recent text dataTo reference a statistical language modelVoice recognition means for recognizing; storage means for storing the most recent text data, the past large amount of text data, and recognition results by the voice recognition means;By EM algorithmLarge amount of past text dataFrequency of n wordsTo calculate the first probability weight of n-gram, and the most recent text dataFrequency of n wordsAnd calculating a second probability weight of n-gram, and the recognition resultFrequency of n wordsBased on the probability weight calculating means for calculating the third probability weight of n-gram, and based on the first probability weight, the second probability weight, and the third probability weight. Language model generation means for generating a language model.
[0009]
According to such a configuration, the text data acquisition unit acquires the latest text data including a word that is predicted to appear frequently, and the voice recognition unit acquires the latest text data and the previous large amount of text data. Recognized as speech, the latest text data and past large text data and the recognized recognition result are stored by the storage means. Then, the probability weight calculation means calculates the probability weight in each n-gram, and the language model generation means generates a language model based on the calculation result.
[0010]
Note that the most recent text data that is expected to appear frequently includes, for example, articles provided for broadcast programs immediately before or immediately after broadcasting, articles published in newspapers and magazines immediately before or immediately after release, etc. Is mentioned. The past large amount of text data includes, for example, a manuscript provided for a broadcast program for several years to several decades, a brown corpus, a LOB corpus, and the like.
[0011]
The statistical language model generation device according to claim 2, further comprising a recognition result correction unit that corrects a recognition result recognized by the speech recognition unit in the statistical language model generation device according to claim 1. The calculating means calculates an n-gram third probability weight based on the corrected recognition result.
[0012]
According to such a configuration, the recognition result correcting unit corrects the result of the text data being recognized as speech, and based on the corrected recognition result, the probability weight calculating unit calculates the third probability weight of n-gram. Is calculated.
[0013]
  The statistical language model generation program according to claim 3, wherein the computer obtains text data acquisition means for acquiring the latest text data including a word that is predicted to appear frequently, the latest text data, and the latest text data Voice of past large amount of text data with more data than text dataTo reference a statistical language modelA voice recognition means for recognizing, a storage means for storing the latest text data, the past large amount of text data, and a recognition result recognized by the voice recognition means;By EM algorithmLarge amount of past text dataFrequency of n wordsTo calculate the first probability weight of n-gram, and the most recent text dataFrequency of n wordsAnd calculating a second probability weight of n-gram, and the recognition resultFrequency of n wordsBased on the probability weight calculating means for calculating the third probability weight of n-gram, the first probability weight, the second probability weight, and the statistical probability language based on the third probability weight A language model generating means for generating a model is provided.
[0014]
According to such a configuration, the text data acquisition unit acquires the latest text data including a word that is predicted to appear frequently, and the voice recognition unit acquires the latest text data and the previous large amount of text data. Recognized as speech, the latest text data and past large text data and the recognized recognition result are stored by the storage means. Then, the probability weight calculation means calculates the probability weight in each n-gram, and the language model generation means generates a language model based on the calculation result.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
(Statistical language model generation apparatus: first embodiment)
FIG. 1 shows a functional explanatory diagram of the first embodiment of the statistical language model generation apparatus.
As shown in FIG. 1, the statistical language model generation apparatus 1 includes a main control unit, a storage unit, a display output unit, an input unit, an external connection unit, and the like that are not shown, and collects past news manuscripts. Means 3, the latest reporter document accumulation means 5, the speech recognition means 7, the recognition result accumulation means 9, and the language model calculation means 11 are functionally realized.
[0016]
The statistical language model generation device 1 is a device that generates a statistical language model used for speech recognition in a speech recognition device (not shown) based on a large amount of text data. In this embodiment, the statistical language model generation device 1 is a general computer, and each main control unit, storage unit, display output unit, input unit, and external connection unit (not shown) are a CPU, It consists of memory, hard disk, keyboard and so on.
[0017]
The past news manuscript accumulation means 3 is a database stored (accumulated) in a storage unit (not shown), in which a large amount of past text data described in the claims is accumulated. The past news manuscript accumulation means 3 stores a large amount of past news manuscripts in a text file format (text data). In this text file, a space is inserted between each word constituting the manuscript.
[0018]
In this embodiment, it is assumed that the punctuation marks included in the news manuscript are integrated with the word immediately before the punctuation marks. Further, in addition, in this specification, the terms “memory”, “accumulation”, and “accumulation” are described as having substantially no difference.
[0019]
The latest reporter document accumulating means 5 is a program developed in a main control unit (not shown) and a database stored in the storage unit, and the latest text data including a word predicted to increase in appearance frequency. The acquired text data acquisition means and the acquired latest text data are integrated. The latest reporter document accumulating means 5 first acquires a reporter document to be used for the latest broadcast program (especially a news program). An acquisition method is performed by an operator (operator) inputting a news manuscript into the statistical language model generation apparatus 1 or reading it with an OCR or the like and inputting the read result via an external connection unit.
[0020]
In the latest reporter document accumulating means 5, after obtaining the reporter document, a certain correction is automatically made, or it is calibrated by an operator and stored as text data in a recent text database in a storage unit (not shown). The Each sentence in the report manuscript is made into a text file in units of one topic, and this text file has a space between each word constituting the report manuscript like the past news manuscript accumulation means 3. It is comprised as what was pinched | interposed.
[0021]
  The voice recognition means 7 recognizes a text file as voice (reads out a text file). The speech recognition means 7 is a general text-to-speech conversion engine or the like. The text-to-speech conversion engine is equipped with a dictionary storing hundreds of thousands of words. First, the past news manuscript accumulation unit 3 and the latest reporter manuscript accumulation unit. Recognize words contained in this file from 5 text files.
[0022]
The recognition result accumulating unit 9 is a database stored (accumulated) in a storage unit (not shown), and the past news manuscript accumulating unit 3 and the latest reporter manuscript accumulating unit 5 are added to the recognition result recognized by the voice recognizing unit 7. The date and time are given as time stamps for each sentence of the recognition result and stored.
[0023]
The language model calculation unit 11 performs statistical processing based on the past large amount of text data by the past news manuscript accumulation unit 3, the text data of the latest reporter manuscript by the latest report manuscript accumulation unit 5, and the recognition result by the voice recognition unit 7. It is a program that generates a static language model. In this embodiment, the bigram model is used as the language model (for the n-gram model including the bigram model, for example, “speech recognition based on a probability model”, Seichi Nakagawa, IEICE, pp. 109). Yes.
[0024]
In the language model calculation means 11, various calculations are performed in the following order based on mathematical expressions to be described later.
First, based on the past large text data, the text data of the latest reporter's manuscript, and the recognition result, the language model bigP₀, P₁, P₂Is expressed by linear interpolation (for linear interpolation, see, for example, “spoken language processing”, Kita, Nakamura, Nagata, Morikita Publishing, pp. 29), the weighted language model is
[0025]
[Expression 1]

Represented by
[0026]
In this equation 1, y_n, Y_n-1Is a word registered in the vocabulary. Probability P (y_n｜ y_n-1) Is the word y_n-1The word y after_nIs the probability of being uttered. In general, in an n-gram language model, a longer continuous word string is handled as n is increased, and the recognition accuracy of the next word is increased. However, instead of increasing the recognition accuracy, text data including an enormous amount (n-th power) of vocabulary is required. λ is a probability weight in each language model, and V is a vocabulary.
[0027]
Weighted language model word y_n, Y_n-1If the language model generated by the statistical language model generation device 1 is provided to a speech recognition device (not shown), the y is determined at the time of speech recognition of the device._nAnd y_n-1The combination with becomes easy to appear. That is, it is only necessary to determine the value of the probability weight λ so that the bigram product is maximized for the text to be read (speech recognized) when the apparatus recognizes the voice. Alternatively, the probability weight λ is set so that the entropy of the evaluation data (sentence recognized by speech) (for example, “Speech recognition by probability model” by Seiichi Nakagawa, IEICE, pp. 111, and Equation 2) is minimized. It is sufficient that the value of is determined.
[0028]
[Expression 2]

[0029]
In Equation 2, N indicates the total number of words in the evaluation text (text data), and the evaluation text is a word string y = y of the evaluation data.₁y₂... y_NIt shall be represented by Note that λ in this equation uses an expected value maximization algorithm (EM algorithm (see, for example, “spoken language processing”, Kita, Nakamura, and Nagata, Morikita Publishing, pp. 31 for the EM algorithm)). , And is obtained by repetitive calculation according to Equation 3.
[0030]
[Equation 3]

[0031]
In Equation 3, λ_iIs updated repeatedly until the entropy for the evaluation text converges. By this calculation, the probability weight λ for each language model can be automatically obtained. However, it is difficult to obtain the optimal probability weight λ for the text to be read because the content of the evaluation text is usually unknown. For this reason, a transcription (text data) of a known utterance content related to the evaluation text is prepared in advance, and the value of the probability weight λ is experimentally obtained using this.
[0032]
Next, the text weight w is obtained. The value of this text weight w gives the weighted word frequency. Past large text data G₀The total number of words in m₀And the latest text data G₁The total number of words in m₁And recognition result G₂The total number of words in m₂Then the text weight w is the converged probability weight λ₀, Λ₁, Λ₂Using (n-gram first probability weight, second probability weight, third probability weight), it is calculated according to Equation 4.
[0033]
[Expression 4]

[0034]
In Equation 4, past large text data G₀The latest text data G to be added to₁And recognition result G₂Number of times of addition (text weight) w₁, W₂Is the probability weight λ₀, Λ₁, Λ₂Calculated from In Equation 4, the probability weight λ in the statistical language model is normalized to the probability weight in a set of a plurality of text data.
[0035]
Calculated text weight w₁, W₂The most recent text data is the text weight w₁And the recognition result is the text weight w.₂Thus, a new vocabulary is obtained by weighting and adding to the past large amount of text data. That is, the appearance frequency f of a certain word is the past large amount of text data G₀Frequency f₀The most recent text data G₁Frequency f₁, Recognition result G₂Frequency f₂Then,
[0036]
[Equation 5]

It becomes.
[0037]
Then, words are registered in the vocabulary V in descending order of frequency f. However, the upper limit (V_max) Is set and registered so as not to exceed this upper limit. Therefore, while the total number of registered words in the vocabulary is limited, the most recent text data G₁The words that have been low in frequency until then are weighted and registered in the vocabulary.
[0038]
That is, the language model calculation unit 11 of the statistical language model generation device 1 increases the frequency of appearance of new words (not included in the past large amount of text data) in the latest text data (the latest news manuscript and the like). It is done. In addition, since a new vocabulary is determined based on the recognition result by the speech recognition means 7, the speech recognition device (not shown) uses the language model generated by the statistical language model generation device 1. As a result, the recognition performance at the time of voice recognition is improved. In this embodiment, the generated language model is fed back to the speech recognition means 7 and used again during speech recognition.
[0039]
(Statistical language model generation apparatus: second embodiment)
FIG. 2 shows a functional explanatory diagram of the second embodiment of the statistical language model generation apparatus. In this statistical language model generation device 1A, a statistical language modelGeneratorThe same components as those in FIG. 1 are denoted by the same reference numerals, and the description thereof is omitted.
[0040]
The recognition result correction means 13 of the statistical language model generation device 1A is a program for correcting the recognition result of the speech recognition means 7. For example, when the speech recognition means 7 reads out the text data, the text data is not hiragana. Suppose you read “It rains” when you hear “Ame no Furu”. In other words, in this case, “A” has an accent. In practice, if the word “falls” (an accent will appear in “Me”), the recognition result is corrected by analogy with the context before and after “Ame ga fu”.
[0041]
The correction recognition result accumulation unit 15 is a database stored (accumulated) in a storage unit (not shown), and accumulates (accumulates) the recognition results corrected by the recognition result correction unit 13. The correction recognition result accumulating unit 15 temporarily stores the recognition result by the voice recognition unit 7 before correction.
[0042]
As with the language model calculation unit 11, the language model calculation unit 11A performs various calculations in the order shown below. In this embodiment, the past large amount of text data and the latest text data, and the recognition results (evaluation data) corrected by the correction recognition result accumulating means 15 after the text recognition means 7 recognizes these text data. Based on the above, the following calculation is performed by the language model calculation means 11A.
[0043]
First, based on the past large text data, the text data of the latest reporter's manuscript, and the corrected recognition result, the language model bigP₀, P₁, P₂'Is expressed by linear interpolation (see Equation 1). Then, a probability weight λ of the weighted language model is defined.
[0044]
If the value of the probability weight λ is determined so that the entropy of the data to be evaluated (the past large amount of text data, the text data of the latest reporter's manuscript, and the corrected recognition result) is minimized (see Equation 2) The probability weight λ is obtained by iterative calculation using an expected value maximization algorithm (see Formula 3).
[0045]
Converged probability weight λ₀, Λ₁, Λ₂′ (First probability weight of n-gram, second probability weight, third probability weight)₁, W₂'Is calculated (see Equation 4). Calculated text weight w₁, W₂Based on ′, the latest text data is the text weight w₁In addition, the corrected recognition result is the text weight w.₂′ Is weighted and added to a large amount of past text data to obtain a new vocabulary.
[0046]
That is, the appearance frequency f of a certain word is the past large amount of text data G₀Frequency f₀The most recent text data G₁Frequency f₁, Corrected recognition result G₂Frequency at '₂′ And text weight w₁, W₂Is represented by the product of ′ (see Equation 5). Then, words are registered in the vocabulary V in descending order of frequency f. However, the upper limit (V_max) Is set and registered so as not to exceed this upper limit.
[0047]
That is, in the language model calculation unit 11A of the statistical language model generation device 1A, the frequency of appearance of new words (not included in the past large amount of text data) in the latest text data (the latest news manuscript, etc.) is increased. It is done. Moreover, since the recognition result by the speech recognition means 7 is corrected by the recognition result correction means 13 and a new vocabulary is determined based on the corrected recognition result, the statistical language model generation apparatus 1A generates the new vocabulary. The speech recognition device (not shown) uses the language model to improve the recognition performance during speech recognition. In this embodiment, the generated language model is fed back to the speech recognition means 7 and used again during speech recognition.
[0048]
In the statistical language model generation device 1A, the latest reporter document accumulating unit 5 acquires and accumulates the latest news program and the like for speech recognition, and the speech recognition unit 7 and the recognition result correcting unit 13 output speech recognition. Is corrected, and a correct character string corresponding to the recognized speech is created. For this reason, if the language model generated by the statistical language model generation device 1A is used, the correct character string information for the broadcast program (target for speech recognition) at a very close time will be used. Thus, the voice recognition performance can be improved.
[0049]
In addition, referring to correct character string information for broadcast programs (subject to speech recognition) at very close times in time, recognition errors included in speech recognition output of past large-scale databases are detected, and recognition results are corrected. It can be corrected by means 13.
[0050]
(Operation of statistical language model generator)
Next, the operation of the statistical language model generation device 1 will be described with reference to the flowchart shown in FIG.
First, the past large amount of text data is accumulated (accumulated) by the past news manuscript accumulation means 3, and the initial vocabulary is determined according to the appearance frequency of each word included in the past large amount of text data. (S1). The initial vocabulary is usually formed from hundreds of thousands of words. In general, the number of registered words in the language model is set in advance according to the storage capacity of the storage unit (not shown) or the processing capacity of the main control unit (not shown), so that it falls within this number of registered words. The words are registered and determined in the vocabulary in the order of the most frequently occurring words in the data to be accumulated or learned.
[0051]
On the other hand, text data (most recent text data) provided for the latest broadcast program or the like is accumulated by the latest reporter manuscript accumulation means 5, and these past large amount of text data and the latest text data are voice recognition means 7. Is recognized by voice. Recognition results obtained by voice recognition are accumulated in the recognition result accumulation means 9.
[0052]
Then, each language model (bigramP) is first processed by the language model calculation means 11.₀, P₁, P₂) And these bigramP₀, P₁, P₂Are linearly interpolated (see Equation 1) (S2). Probability weight λ of these language models₀, Λ₁, Λ₂Is calculated (calculated) by the EM algorithm (see Equation 3) (S3), and these probability weights λ₀, Λ₁, Λ₂Text weight w based on₁, W₂Is calculated (refer to Formula 4) (S4).
[0053]
Furthermore, the language model calculation means 11 performs text weight w₁, W₂The word appearance frequency f is calculated (calculated) based on the above (see Equation 5) (S5), and based on the appearance frequency f, a new vocabulary is arranged so that the words appear in the descending order of the appearance frequency f so as to fit in the number of registered words. Is determined (S6). Then, a language model is generated based on the new vocabulary (S7).
[0054]
As mentioned above, although this invention was demonstrated based on embodiment, this invention is not limited to this.
[0055]
For example, each configuration realized in the statistical language model generation device 1 or 1A can be handled as a program stored in a specific storage medium. Furthermore, for n-grams (trigram, 4-gram) greater than bigram, each probability weight λ is calculated in the same way as bigram, and this probability weight λ is converted into a text weight to create a new vocabulary V. Thus, it is possible to generate a statistical language model from this vocabulary.
[0056]
【The invention's effect】
According to the first aspect of the present invention, the text data acquisition unit acquires the latest text data including a word that is predicted to appear frequently, and the voice recognition unit acquires the latest text data and the previous text data. A large amount of text data is recognized as speech, and the latest text data and past large amount of text data and the recognized recognition result are stored by the storage means. Then, the probability weight calculation means calculates the probability weight in each n-gram, and the language model generation means generates the language model based on the calculation result, so that the language model is used in the speech recognition apparatus. Thus, the recognition performance at the time of voice recognition can be improved.
[0057]
In addition, since a certain probability weight is added to the latest text data and included in the vocabulary for generating the language model, it is possible to increase the appearance probability of words that are likely to be included in the latest utterance content.
[0058]
According to the second aspect of the present invention, the result of recognition of text data as speech is corrected by the recognition result correction unit, and the probability weight calculation unit calculates the n-gram number of the result based on the corrected recognition result. Since the third probability weight is calculated, if a language model obtained based on the corrected recognition result is used in the speech recognition apparatus, the recognition performance during speech recognition can be further improved.
[0059]
According to the third aspect of the present invention, the text data acquisition unit of the statistical language model generation program acquires the latest text data including a word that is predicted to appear frequently, and the voice recognition unit detects the latest text data. Text data and a large amount of previous text data are recognized as speech, and the latest text data and past large amount of text data and the recognized recognition result are stored by the storage means. Then, the probability weight calculation means calculates the probability weight in each n-gram, and the language model generation means generates the language model based on the calculation result, so that the language model is used in the speech recognition apparatus. Thus, the recognition performance at the time of voice recognition can be improved.
[0060]
It is also possible to distribute in the market as a storage medium storing this statistical language model generation program.
[Brief description of the drawings]
FIG. 1 is a functional explanatory diagram of a statistical language model generation apparatus according to a first embodiment of the present invention.
FIG. 2 is a functional explanatory diagram of a statistical language model generation apparatus according to a second embodiment of the present invention.
FIG. 3 is a flowchart illustrating the operation of the statistical language model generation device.
[Explanation of symbols]
1, 1A Statistical language model generator
3 Past news manuscript collection means
5 Latest reporter collecting means
7 Voice recognition means
9 Recognition result accumulation means
11, 11A Language model calculation means
13 Recognition result correction means
15 Correction recognition result accumulation means

Claims

A statistical language model generation device for generating a statistical language model,
Text data acquisition means for acquiring the latest text data including words that are predicted to appear frequently;
Speech recognition means for recognizing speech by referring to a statistical language model from the previous text data and a past large amount of text data having a data amount larger than that of the latest text data;
Storage means for storing the most recent text data, the past large amount of text data, a recognition result by the voice recognition means,
The EM algorithm, on the basis of the word n-tuple of the frequency of past large amounts of text data, to calculate the first probability weight of n-gram, based on the word n-tuple of the frequency of the most recent text data , A second probability weight for n-gram, and a probability weight calculation means (paragraphs 0029 to 0032) for calculating a third probability weight for n-gram based on the frequency of the n word pairs of the recognition result. When,
Language model generating means for generating a statistical language model based on the first probability weight, the second probability weight, and the third probability weight;
A statistical language model generation device characterized by comprising:

A recognition result correcting means for correcting the recognition result recognized by the voice recognition means;
2. The statistical language model generation apparatus according to claim 1, wherein the probability weight calculation unit calculates an n-gram third probability weight based on the corrected recognition result.

Computer
Text data acquisition means for acquiring the latest text data including words that are expected to appear frequently,
Speech recognition means for recognizing speech by referring to a statistical language model from the most recent text data and a past large amount of text data having a larger data amount than the most recent text data;
Storage means for storing the latest text data, the past large amount of text data, and the recognition result recognized by the voice recognition means;
The EM algorithm, on the basis of the word n-tuple of the frequency of past large amounts of text data, to calculate the first probability weight of n-gram, based on the word n-tuple of the frequency of the most recent text data , A probability weight calculating means for calculating a second probability weight of n-gram and calculating a third probability weight of n-gram based on the frequency of n sets of words of the recognition result,
Language model generation means for generating a statistical language model based on the first probability weight, the second probability weight, and the third probability weight;
Statistical language model generation program characterized by functioning as