JP3961780B2

JP3961780B2 - Language model learning apparatus and speech recognition apparatus using the same

Info

Publication number: JP3961780B2
Application number: JP2001144885A
Authority: JP
Inventors: 洋平岡登; 純石井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2001-05-15
Filing date: 2001-05-15
Publication date: 2007-08-22
Anticipated expiration: 2021-05-15
Also published as: JP2002342323A

Abstract

PROBLEM TO BE SOLVED: To provide a language model learning device, which is enhanced in recognition accuracy. SOLUTION: The device is provided with object task language data 101, general task language data 102, a similar word couple extracting means 103, a similar word string synthesizing means 104 and a language model generating means 105 for constructing a task adapted language model by reading text data from the respective language data. Then, the similar word couple extracting means 103 reads the respective text data and extracts a similar couple from the combination of words contained in the object task data and words contained in the general task data, the similar word string synthesizing means 104 reads the respective text data and the similar word couple and synthesizes out a word string containing words in an object task, which is not contained in the language data, and the language generating means 105 finds the statistic quantity of the word string by weighting for every text data.

Description

【０００１】
【発明の属する技術分野】
この発明は、確率的言語モデルを用いた言語モデル学習装置およびそれを用いた音声認識装置に関するものである。
【０００２】
【従来の技術】
一般に、音声認識においては、通常、ディジタル化されて入力される音声信号の処理手法を用いて、音声の音響的特徴をよく表すベクトルの時系列に変換した後、音声モデルとの照合処理が行われる。
【０００３】
照合処理とは、Ｋ個の時刻フレームからなる音響特徴ベクトル時系列Ａ（＝［ａ₁，ａ₂，・・・，ａ_K］）に基づいて、発声された単語列Ｗ（＝［ｗ₁，ｗ₂，・・・，ｗ_M］、（Ｍは単語数））を求める問題に相当する。
【０００４】
上記照合処理において、認識精度が最も高くなるような単語列Ｗを推定するためには、出現確率Ｐ（Ｗ｜Ａ）が最大となる認識単語列Ｗ^*を、以下の（１）式により求めればよい。
【０００５】
【数１】

【０００６】
ただし、（１）式において、出現確率Ｐ（Ｗ｜Ａ）を直接求めることは、通常困難である。そこで、出現確率Ｐ（Ｗ｜Ａ）は、ベイズの定理を用いて、以下の（２）式のように書き換えられる。
【０００７】
【数２】

【０００８】
ここで、（２）式の左辺を最大化する単語列Ｗを求める際、右辺の分母Ｐ（Ａ）は、認識候補となる単語列Ｗに影響を与えないので、右辺の分子を最大化する単語列Ｗを求めればよい。すなわち、認識単語列Ｗ^*は、以下の（３）式のように表される。
【０００９】
【数３】

【００１０】
ここで、（３）式内のＰ（Ｗ）を与える確率モデル、Ｐ（Ａ｜Ｗ）を与える確率モデルを、それぞれ、言語モデル、音響モデルと呼ぶ。
音声認識において、近年盛んに検討されているモデル化方法としては、音響モデルを「隠れマルコフモデル」で表現し、言語モデルを「確率言語モデル」で表現するものが知られている。
【００１１】
これらのモデル化方法の詳細は、たとえば、「音声認識の基礎（上、下）」（Ｌ．Ｒ．ＲＡＢＩＮＥＲ、Ｂ．Ｈ．ＪＵＡＮＧ、古井監訳、１９９５年、１１月、ＮＴＴアドバンステクノロジ）（以下、「文献１」と称する）、または「確率的言語モデル」（北研二、東京大学出版会）（以下、「文献２」と称する）などに記されている。
【００１２】
これらの方法において、確率モデルを構成するパラメータは、大量のデータから統計的に推定される。
すなわち、音響モデルの構築においては、あらかじめ多数の話者からの単語や文などの音声データを収集し、統計的手法を利用して認識精度や認識精度と良好に関連した指標が向上するように推定が行われる。
【００１３】
たとえば、バウム・ウェルチアルゴリズムを用いて、学習データに対して尤度が大きくなるように、音響モデルを構成する「隠れマルコフモデル」のパラメータを推定する。
音響モデルの推定方法は、上記文献１の下巻に詳述されている。
【００１４】
同様に、言語モデルの構築においては、新聞や会話の書き起こしなどのテキストから、言語モデルの構造にしたがって、それぞれの発話や発話を構成する単語の出現する確率を計算する。
【００１５】
言語モデルの構造としては、直前の単語に関する「ｎ−１重マルコフモデル」を用いて、後続する単語の出現確率を予想する「Ｎグラム言語モデル」や「確率文脈自由文法」、または、それらの組み合わせなどがよく適用される。
【００１６】
特に、Ｎグラム言語モデルは、効果的であるうえ、パラメータ推定手段が容易に実現可能であることから、広く用いられている。
そこで、以下の説明では、Ｎグラム言語モデルを例にとって、言語モデルの構築について説明する。
【００１７】
たとえば、Ｎグラム言語モデルにおいて、Ｎ＝２としたとき（バイグラム言語モデルと呼ばれる）、上記（３）式内のＰ（Ｗ）は、以下の（４）式のように近似される。
【００１８】
【数４】

【００１９】
Ｎグラム言語モデルのパラメータとなる条件つき確率Ｐ（ｗ_N｜ｗ₁，・・・，ｗ_N-1）は、学習用テキストデータ内の隣接する単語列の頻度Ｃ（ｗ₁，・・・，ｗ_N）から、以下の（５）式のように推定される。
【００２０】
【数５】

【００２１】
しかし、単語の条件付き出現確率を、単純に上記（５）式のように推定すると、学習データに存在しない単語列を含む場合、文の出現確率は「０」になってしまう。
【００２２】
このような状態を防ぐため、学習用テキストに出現しない単語列に対して非零の（「０」でない）確率を割り当てる処理（一般に、「スムージング」と呼ばれる）が行われる。
【００２３】
最も一般的なスムージング方法としては、Ｋａｔｚが提案した「バックオフスムージング」があげられる。
バックオフスムージングにおいては、上記（５）式で推定される確率から、頻度に応じて一定の割合を除き（ディスカウンティングを実行し）、学習データで出現しなかった単語列に確率が割り当てられる。
【００２４】
学習データで出現しなかった単語列に割り当てられる条件付き確率には、さらに大雑把な言語モデルによって推定された値が用いられる。
上記Ｋａｔｚによる方法では、Ｎグラムよりも粗いモデルとして、Ｎ−１グラムが用いられる。この方法の詳細については、上記文献２の第６７頁に示されている。
【００２５】
なお、日本語の場合には、テキストが分かち書きされないので、単語の定義があいまいである。そこで本文では、何らかの手段でテキストを整合性のある部分に分割したものを、それぞれ、単語と定義する。
【００２６】
すなわち、単語とは、たとえば文字や形態素、文節などの言語的な単位や、エントロピー基準に基づいたテキストの分割、ならびに、これらの組み合わせなどであり、これら分割された単位に読み方や品詞などの言語情報が付加された場合を含む。
【００２７】
上記統計的手法を用いた言語モデルの構築においては、言語モデルのパラメータを推定するために、大量の音声データおよびテキストデータが必要となる。
特に、Ｎグラム言語モデルは、学習データに強く依存するので、対象とするタスク（以下、「対象タスク」と称する）毎に大量のデータ収集が必要である。
【００２８】
しかし、タスク毎に大量のテキストデータを収集することは困難であり、対象タスクに関する少量のテキストデータから言語モデルを構築できることが望ましいので、クラス言語モデルの利用や、タスク適応化などが行われる。
【００２９】
クラス言語モデルとは、類似した単語をまとめ、同一のクラス（グループ）として扱われるものであり、言語モデルの推定パラメータ数を削減したり、学習データに存在しない単語に適当な確率を割り当てるものである。
【００３０】
単語とクラスとの関係定義は、単語やタスクに応じて人手で決定されたり、データに基づいて決定され、Ｎグラム言語モデルであっても適用可能である。
【００３１】
たとえば、バイグラムクラス言語モデルにおける文の出現確率は、
（１）クラス間の遷移確率Ｐ（ｃ_i｜ｃ_i-1）と、
（２）クラス内から特定の単語が選択される確率Ｐ（ｗ_i｜ｃ_i）と
の積として、以下の（６）式のように定義される。
【００３２】
【数６】

【００３３】
たとえば、１０００単語を各１０単語からなる１００のクラスに分割した場合を考える。このとき、単語バイグラム言語モデルの場合での推定パラメータ数は、１０００²（＝１００００００）である。
【００３４】
これに対して、クラスバイグラム言語モデルの場合での推定パラメータ数は、（１）クラス間の遷移と、
（２）クラスと単語との写像と
の和として表され、１００²＋１００×１０（＝１１０００）に減少する。
【００３５】
単語とクラスとの対応関係は、人手で決定されてもよく、言語データから単語クラスタリングを実行して求めてもよい。
図２０はクラス定義の一例を示す説明図である。図２０において、単語ｗと、単語ｗが所属するクラスｃと、単語ｗが所属するクラスｃから出力される確率Ｐ（ｗ｜ｃ）とが記述されている。
【００３６】
クラスＮグラム言語モデルのうち、クラス間遷移モデルの推定は、通常の単語Ｎグラムの場合と同様である。
クラスＮグラム言語モデルの構築方法に関しては、上記文献２の第７２頁以降に詳述されている。
【００３７】
一方、タスク適応化とは、対象タスク以外のテキストデータを合わせて利用し、学習データの不足を補うものである。
ここでは、対象タスク以外のタスクを含むテキストデータを一般タスク言語データと呼ぶことにする。
【００３８】
タスク適応化に関しては、「Ｎ−ｇｒａｍのタスク適応における語彙の設定法の検討」（伊藤彰則、好田正紀、電子情報通信学会研究技術報告、第５１−５８頁、ＳＰ９７−２５、１９９７）（以下、文献３と称する）で述べられている方法が提案されている。
【００３９】
この方法は、Ｎグラム言語モデルを対象として、対象タスクと一般タスクとの学習データを重みづけして加えることにより、タスク適応を行うというものである。
【００４０】
図２１は上記文献３で述べられている音声認識用の言語モデル構築方法を適用した装置を概略的に示すブロック構成図である。
図２１において、１００はタスク適応化済みの言語モデルを生成する言語モデル推定手段である。
【００４１】
１０１は対象タスク言語データであり、対象タスクのテキストデータを集積し、対象タスクで認識すべき文を表すテキストを単語に分割している。
１０２は一般タスク言語データであり、対象タスク以外のタスクを含む一般タスクのテキストデータを集積し、一般タスクに含まれる文を表すテキストを単語に分割している。
【００４２】
言語モデル推定手段１００は、対象タスク言語データ１０１および一般タスク言語データ１０２を読み込み、それぞれ適当な重み付け処理を施して、単語列の頻度を数え上げ、統計的手法を用いて言語モデルのパラメータを推定する。
【００４３】
重み付け処理は、それぞれの入力について与えられる。
たとえば、「私、は」という単語列が対象タスクで２回、一般タスクで４回出現したとして、対象タスクの頻度重みが「３」、一般タスクの頻度重みが「１」であれば、単語列「私、は」の頻度は、「１０（＝３×２＋１×４）」と見積もられる。
【００４４】
なお、重み付け係数は、整数でなくてもよい。
また、数え上げの際、必要であれば、頻度が小さい単語は取り除き、取り除いた確率を認識に必要な単語に等確率で再配分することができる。
【００４５】
こうして得られた頻度情報「１０」から、たとえばＫａｔｚのバックオフスムージング法により、既知および未知の単語列について確率を推定する。
なお、頻度重みの決定は、たとえば最終的に得られる言語モデルのテストデータに対する出現確率を高めるように、削除推定法を用いて定めることができる。
また、削除推定法については、上記文献２の第４９頁に述べられている。
【００４６】
次に、図２２のフローチャートを参照しながら、図２１に示した従来装置および従来方法に基づくタスク適応による言語モデルの学習手順について説明する。
まず、言語モデル推定手段１００は、重みパラメータ保存手段（図示せず）から、入力に対する重みパラメータを読み込む（ステップＳ２２０１）。
【００４７】
次に、対象タスク言語データ１０１および一般タスク言語データ１０２から単語に区切られた学習用テキストを読み込み、重みパラメータにしたがって重み付けされたｎ単語以下の単語列の頻度を求める（ステップＳ２２０２）。
【００４８】
最後に、たとえばＫａｔｚのバックオフスムージング法を用いたスムージングを実行して、言語モデルのパラメータを推定し（ステップＳ２２０３）、図２２の処理ルーチンを終了する。
【００４９】
上記手法は、一般タスク言語データ１０２のテキストデータを合わせて利用することにより、対象タスクに関する少量の学習データから取得困難な多彩な表現を表す単語列の出現確率を、さらに妥当に推定することができる。
【００５０】
また、同時に、対象タスク言語データ１０１に重み付けすることにより、対象タスクのコーパスに出現した単語列に対して、さらに大きい確率を与えることができ、認識精度を向上させることができる。
【００５１】
しかしながら、上記言語モデルのタスク適応化方法では、対象タスクで固有の単語や一般タスクで出現した単語列の出現確率を良好に推定できるものの、対象タスクに特有の単語と一般タスクで出現した単語との組み合わせを考慮していないので、対象タスクのテキストデータが少ないときには、対象タスク特有の単語の周辺で言語モデルのパラメータ推定精度が悪化するという問題がある。
【００５２】
たとえば、対象タスクがホテル予約業務であって、類似したホテル以外の予約業務タスクで発声されたテキストデータを一般タスク言語データ１０２として利用する場合を考える。
【００５３】
この場合、「それ、を、お願い」といった予約業務一般で出現する単語列や、「ホテル」という対象タスク特有の単語は、それぞれ、一般タスク言語データ１０２および対象タスク言語データ１０１から、頻度に応じて出現確率が見積もられる。
【００５４】
しかし、単語の組み合わせの種類数が非常に大きいので、対象タスクのテキストデータが少量である場合、「ホテル、を、お願い」といった対象タスク特有の単語を含む単語列は、十分にテキストデータでカバーされていないことが多い。
【００５５】
この結果、単語列に不適切な出現確率が割り当てられてしまい、認識精度が低下するおそれがある。
特に、対象タスク特有の単語は、タスクを遂行するうえで重要な場合が多く、これらの単語周辺における認識精度の低下は、システム全体の性能に大きな影響をおよぼす可能性が高い。
【００５６】
【発明が解決しようとする課題】
従来の言語モデル学習装置およびそれを用いた音声認識装置は以上のように、対象タスクに特有の単語と一般タスクで出現した単語との組み合わせを考慮していないので、対象タスクのテキストデータが少ない場合に、対象タスク特有の単語の周辺で言語モデルのパラメータ推定精度が悪化してしまい、システム全体の性能に悪影響をおよぼすという問題点があった。
【００５７】
この発明は上記のような問題点を解決するためになされたもので、対象タスクに固有の単語と一般タスクのデータとから類似する単語を求め、タスク固有の単語を含む単語列の出現確率の推定に利用することにより、認識精度を高めた言語モデル学習装置およびそれを用いた音声認識装置を得ることを目的とする。
【００５８】
【課題を解決するための手段】
この発明の請求項１に係る言語モデル学習装置は、対象タスクのテキストデータを集積した対象タスク言語データと、対象タスク以外のタスクを含む一般タスクのテキストデータを集積した一般タスク言語データと、対象タスク言語データおよび一般タスク言語データから、それぞれ言語モデル学習用のテキストデータを読み込み、タスク適応化済み言語モデルを構築するための、類似単語対抽出手段、類似単語列合成手段および言語モデル生成手段とを備え、類似単語対抽出手段は、対象タスク言語データおよび一般タスク言語データから各テキストデータを読み込み、対象タスクのテキストデータに含まれる単語と一般タスクのテキストデータに含まれる単語との組み合わせから類似単語対を抽出し、類似単語列合成手段は、各テキストデータを読み込むとともに、類似単語対抽出手段から類似単語対を読み込み、言語データに含まれない対象タスク内の単語を含む単語列を合成して出力し、言語モデル生成手段は、各テキストデータを読み込むとともに、類似単語列合成手段から単語列を読み込み、各テキストデータ毎に重み付けて単語列の統計量を求めることにより、タスク適応化済み言語モデルを生成するものである。
【００５９】
また、この発明の請求項２に係る言語モデル学習装置は、対象タスクのテキストデータを集積した対象タスク言語データと、対象タスク以外のタスクを含む一般タスクのテキストデータを集積した一般タスク言語データと、対象タスク言語データおよび一般タスク言語データからタスク適応化済み言語モデルを構築するための、対象タスク単語クラス化手段、一般タスク単語クラス化手段および言語モデル生成手段とを備え、対象タスク単語クラス化手段は、対象タスク言語データから対象タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第１のテキストデータを出力し、一般タスク単語クラス化手段は、一般タスク言語データから一般タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第２のテキストデータを出力し、言語モデル生成手段は、第１および第２のテキストデータを読み込み、各テキストデータ毎に重み付けて単語列の統計量を求めることにより、言語モデルを生成するものである。
【００６０】
また、この発明の請求項３に係る言語モデル学習装置は、対象タスクのテキストデータを集積した対象タスク言語データと、対象タスク以外のタスクを含む一般タスクのテキストデータを集積した一般タスク言語データと、対象タスク言語データおよび一般タスク言語データからタスク適応化済み言語モデルを構築するための、対象タスク単語クラス化手段、一般タスク単語クラス化手段、類似単語対抽出手段、類似単語列合成手段および言語モデル生成手段とを備え、対象タスク単語クラス化手段は、対象タスク言語データから対象タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第１のテキストデータを出力し、一般タスク単語クラス化手段は、一般タスク言語データから一般タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第２のテキストデータを出力し、類似単語対抽出手段は、第１および第２のテキストデータを読み込み、対象タスクのテキストデータに含まれる単語と一般タスクのテキストデータに含まれる単語との組み合わせから類似単語対を抽出し、類似単語列合成手段は、第１および第２のテキストデータを読み込むとともに、類似単語対抽出手段から類似単語対を読み込み、言語データに含まれない対象タスク内の単語を含む単語列を合成して出力し、言語モデル生成手段は、第１および第２のテキストデータを読み込むとともに、類似単語列合成手段から単語列を読み込み、各テキストデータ毎に重み付けて単語列の統計量を求めることにより、タスク適応化済み言語モデルを生成するものである。
【００６１】
また、この発明の請求項４に係る言語モデル学習装置は、対象タスクのテキストデータを集積した対象タスク言語データと、対象タスク以外のタスクを含む一般タスクのテキストデータを集積した一般タスク言語データと、事前に準備したテキストデータを用いて作成された初期言語モデルと、対象タスク言語データ、一般タスク言語データおよび初期言語モデルから、タスク適応化済み統計的言語モデルを構築するための、類似単語対抽出手段および類似単語確率補正手段とを備え、類似単語対抽出手段は、対象タスク言語データおよび一般タスク言語データから、それぞれ言語モデル学習用のテキストデータを読み込み、対象タスクのテキストデータに含まれる単語と一般タスクのテキストデータに含まれる単語との組み合わせから類似単語対を抽出し、類似単語確率補正手段は、類似単語対抽出手段から類似単語対を読み込むとともに、初期言語モデルを読み込み、対象タスクで出現する単語の出現確率のスムージングを行うことにより、タスク適応化済み統計的言語モデルを生成するものである。
【００６２】
また、この発明の請求項５に係る言語モデル学習装置は、対象タスクのテキストデータを集積した対象タスク言語データと、対象タスク以外のタスクを含む一般タスクのテキストデータを集積した一般タスク言語データと、あらかじめ作成された初期クラス言語モデルと、対象タスク言語データ、一般タスク言語データおよび初期クラス言語モデルから、タスク適応化済みクラス言語モデルを構築するための、対象タスク単語クラス化手段、一般タスク単語クラス化手段、類似単語対抽出手段および類似単語確率補正手段とを備え、対象タスク単語クラス化手段は、対象タスク言語データから対象タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第１のテキストデータを出力し、一般タスク単語クラス化手段は、一般タスク言語データから一般タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第２のテキストデータを出力し、類似単語対抽出手段は、第１および第２のテキストデータを読み込み、対象タスクのテキストデータに含まれる単語と一般タスクのテキストデータに含まれる単語との組み合わせから類似単語対を抽出し、類似単語確率補正手段は、類似単語対抽出手段から類似単語対を読み込むとともに、初期クラス言語モデルを読み込み、対象タスクで出現する単語の出現確率のスムージングを行うことにより、タスク適応化済みクラス言語モデルを生成するものである。
【００６３】
また、この発明の請求項６に係る言語モデル学習装置は、請求項１または請求項４において、類似単語抽出手段は、距離算出用言語モデル生成手段、統計的単語間距離算出手段およびしきい値判定手段を含み、距離算出用言語モデル生成手段は、対象タスク言語データおよび一般タスク言語データから、それぞれ言語モデル学習用のテキストデータを読み込み、各テキストデータ毎に重み付けて単語列の統計量を求めて、距離算出用の統計的言語モデルを生成し、統計的単語間距離算出手段は、距離算出用言語モデル生成手段から統計的言語モデルを読み込み、各テキストデータから抽出した単語からなる単語対について、統計的言語モデル上の統計的な距離を単語間距離として求め、しきい値判定手段は、統計的単語間距離算出手段から単語対および単語間距離を読み込み、所定のしきい値を越える単語対を出力するものである。
【００６４】
また、この発明の請求項７に係る言語モデル学習装置は、請求項１または請求項４において、類似単語抽出手段は、距離算出用言語モデル、統計的単語間距離算出手段およびしきい値判定手段を含み、距離算出用言語モデルは、事前に準備したテキストデータを用いて作成されており、統計的単語間距離算出手段は、距離算出用言語モデルを読み込み、各テキストデータから抽出した単語からなる単語対について、距離算出用言語モデル上の統計的な距離を単語間距離として求め、しきい値判定手段は、統計的単語間距離算出手段から単語対および単語間距離を読み込み、所定のしきい値を越える単語対を出力するものである。
【００６５】
また、この発明の請求項８に係る言語モデル学習装置は、請求項３または請求項５において、類似単語抽出手段は、距離算出用言語モデル生成手段、統計的単語間距離算出手段およびしきい値判定手段を含み、距離算出用言語モデル生成手段は、対象タスク単語クラス化手段および一般タスク単語クラス化手段から第１および第２のテキストデータを読み込み、各テキストデータ毎に重み付けて単語列の統計量を求めて、距離算出用の統計的言語モデルを生成し、統計的単語間距離算出手段は、距離算出用言語モデル生成手段から統計的言語モデルを読み込み、各テキストデータから抽出した単語からなる単語対について、統計的言語モデル上の統計的な距離を単語間距離として求め、しきい値判定手段は、統計的単語間距離算出手段から単語対および単語間距離を読み込み、所定のしきい値を越える単語対を出力するものである。
【００６６】
また、この発明の請求項９に係る言語モデル学習装置は、請求項３または請求項５において、類似単語抽出手段は、距離算出用クラス言語モデル、統計的単語間距離算出手段およびしきい値判定手段を含み、距離算出用クラス言語モデルは、事前に準備したテキストデータを用いて作成されており、統計的単語間距離算出手段は、距離算出用クラス言語モデルを読み込むとともに、対象タスク単語クラス化手段および一般タスク単語クラス化手段から第１および第２のテキストデータを読み込み、各テキストデータから抽出した単語からなる単語対について、距離算出用クラス言語モデル上の統計的な距離を単語間距離として求め、しきい値判定手段は、統計的単語間距離算出手段から単語対および単語間距離を読み込み、所定のしきい値を越える単語対を出力するものである。
【００６７】
また、この発明の請求項１０に係る言語モデル学習装置は、請求項６から請求項９までのいずれかにおいて、統計的単語間距離算出手段は、Ｎグラム言語モデル上のユークリッド距離を用いて、単語間距離を測定するものである。
【００６８】
また、この発明の請求項１１に係る言語モデル学習装置は、請求項６から請求項９までのいずれかにおいて、統計的単語間距離算出手段は、Ｎグラム言語モデル上のクロスエントロピーを用いて、単語間距離を測定するものである。
【００６９】
また、この発明の請求項１２に係る音声認識装置は、請求項１から請求項１１までのいずれかの言語モデル学習装置を用いた音声認識装置であって、言語モデルまたはクラス言語モデルは、音声認識に用いられるものである。
【００８４】
【発明の実施の形態】
実施の形態１．
以下、図面を参照しながら、この発明の実施の形態１について詳細に説明する。ここでは、Ｎグラム言語モデルを例にとって説明するが、任意の統計的言語モデルに対して適用可能であることは言うまでもない。
【００８５】
図１はこの発明の実施の形態１による言語モデル学習装置を概略的に示すブロック構成図であり、音声認識用の言語モデル学習装置の構成例を示している。
図１において、１０１は対象タスクにおける単語に分割された対象タスク言語データ、１０２は一般タスクにおける単語に分割された一般タスク言語データであり、これらは前述（図２１参照）と同様のものである。
【００８６】
１０３は類似単語対抽出手段、１０４は類似単語列合成手段、１０５は言語モデル生成手段であり、これらの手段１０３〜１０５は、対象タスク言語データ１０１および一般タスク言語データ１０２と関連して、タスク適応化済み言語モデルを生成する。
【００８７】
言語モデル生成手段１０５は、前述の言語モデル推定手段１００に対応しており、タスク適応化済み言語モデルを生成する。
類似単語対抽出手段１０３および類似単語列合成手段１０４は、前述の従来装置とは異なり、この発明の特徴的な部分を構成している。
【００８８】
すなわち、各手段１０３および１０４により、対象タスク固有の単語について類似した一般タスクの単語を求め、学習テキスト中の一般タスクの単語を類似する対象タスクの単語で置き換えた単語列を合成して、言語モデルの学習テキストに追加することにより、言語モデル構築の際に、対象タスクのテキストデータが少量であっても、認識精度を高めることができるようになっている。
【００８９】
以下、図１内の各手段１０３〜１０５の機能について、各種モデルおよび各種データと関連させながら具体的に説明する。
ただし、前述と同様の機能ブロックおよびモデルについては、同一符号を付して詳述を省略する。
【００９０】
まず、類似単語対抽出手段１０３は、対象タスク言語データ１０１に含まれる単語ｗＴと、一般タスク言語データ１０２に含まれる単語ｗＧとの任意の組み合わせ（ｗＴ_,ｗＧ）について、あらかじめ定義された距離尺度に基づき、単語間の距離を計算する。
【００９１】
このとき、類似単語対抽出手段１０３は、単語間距離の算出値があらかじめ設定されたしきい値ｔｈよりも小さい場合に、その類似単語対（ｗＴ_,ｗＧ）を類似単語列合成手段１０４に出力する。
【００９２】
単語間の距離ｄ（ｗＴ，ｗＧ）は、たとえば、あらかじめ各単語と対応する意味分類を概念の広さにしたがって木構造にしておき、各単語が対応する意味ノード間のアーク数を距離として用いることにより得られる。
【００９３】
次に、類似単語列合成手段１０４は、対象タスク言語データ１０１および一般タスク言語データ１０２に含まれる任意の長さの単語列を別々に取り出すとともに、類似単語対抽出手段１０３から読み込んだ類似単語対（ｗＴ，ｗＧ）を参照し、対象タスクの単語列のそれぞれについて、一般タスク内の単語ｗＧが含まれるか否かを判定する。
【００９４】
この結果、一般タスク内の単語ｗＧを含む単語列「・・・ｗＧ・・・」が存在する場合には、続いて、一般タスク内の単語ｗＧを対象タスク内の単語ｗＴで置き換えた単語列「・・・ｗＴ・・・」が、一般タスクまたは対象タスクのデータに存在するか否かを判定する。
【００９５】
この結果、単語列「・・・ｗＴ・・・」が一般タスクまたは対象タスクのデータに存在しない場合、類似単語列合成手段１０４は、一般タスクの単語ｗＧを対象タスクの単語ｗＴで置き換えた単語列「・・・ｗＴ・・・」を合成し、言語モデル生成手段１０５に出力する。
【００９６】
最後に、言語モデル生成手段１０５は、対象タスク言語データ１０１、一般タスク言語データ１０２および類似単語列合成手段１０４から、それぞれテキストデータを読み込み、入力される頻度にそれぞれ適当な重みをつけて単語列の頻度を求め、統計的手法を用いて言語モデルのパラメータを推定することにより、タスク適応化済みの言語モデルを生成する。
【００９７】
次に、図２のフローチャートを参照しながら、図１に示したこの発明の実施の形態１に基づくタスク適応による言語モデルの学習手順について、さらに具体的に説明する。
【００９８】
図２において、ステップＳ２０１〜Ｓ２０３は類似単語対抽出手段１０３により実行される処理、ステップＳ２０４〜Ｓ２０８は類似単語列合成手段１０４により実行される処理、ステップＳ２０９〜Ｓ２１１は言語モデル生成手段１０５により実行される処理である。
【００９９】
まず、類似単語対抽出手段１０３は、対象タスク言語データ１０１および一般タスク言語データ１０２から、単語に区切られた学習用テキストを読み込み、単語対（ｗＴ，ｗＧ）を作成する（ステップＳ２０１）。
【０１００】
また、対象タスク言語データ１０１に含まれる単語ｗＴと、一般タスク言語データ１０２に含まれる単語ｗＧ（単語ｗＴとは異なる）との組み合わせについて距離ｄ（ｗＴ，ｗＧ）を計算する（ステップＳ２０２）。
【０１０１】
続いて、算出された距離ｄ（ｗＴ，ｗＧ）を所定のしきい値ｔｈと比較し、距離ｄ（ｗＴ，ｗＧ）がしきい値ｔｈよりも小さいか否かを判定する（ステップＳ２０３）。
【０１０２】
類似単語対抽出手段１０３は、ステップＳ２０３において、ｄ（ｗＴ，ｗＧ）≧ｔｈ（すなわち、Ｎｏ）と判定されれば、ステップＳ２０２に戻って距離ｄ（ｗＴ，ｗＧ）の計算を繰り返し、ｄ（ｗＴ，ｗＧ）＜ｔｈ（すなわち、Ｙｅｓ）と判定されれば、そのときの単語対（ｗＴ，ｗＧ）を類似単語列合成手段１０４に出力する。
【０１０３】
類似単語列合成手段１０４は、対象タスク言語データ１０１および一般タスク言語データ１０２から単語に区切られたテキストデータを読み込み、データに含まれる全てのｎ単語の単語列を取り出して記憶する（ステップＳ２０４）。
【０１０４】
また、読み込んだ単語列から、類似単語対抽出手段１０３によって選択された単語対（ｗＴ，ｗＧ）のうち、一般タスクの単語ｗＧが含まれる単語列「・・・ｗＧ・・・」を取り出す（ステップＳ２０５）。
【０１０５】
続いて、取り出した単語列のうち、一般タスク単語ｗＧを対象タスク単語ｗＴに置き換えた単語列「・・・ｗＴ・・・」が、既に記憶されている単語列に存在する否かを判定する（ステップＳ２０６）。
【０１０６】
ステップＳ２０６において、単語列「・・・ｗＴ・・・」が、既に記憶されている単語列に存在する（すなわち、Ｙｅｓ）と判定されば、ステップＳ２０５に戻り、単語列「・・・ｗＴ・・・」が存在しない（すなわち、Ｎｏ）と判定されれば、その単語列「・・・ｗＴ・・・」をテキストデータとして出力する（ステップＳ２０７）。
【０１０７】
次に、全ての類似単語対（ｗＴ，ｗＧ）に対する処理を終了したか否かを判定し（ステップＳ２０８）、終了していない（すなわち、Ｎｏ）と判定されればステップＳ２０２に戻り、終了した（すなわち、Ｙｅｓ）と判定されれば、ステップＳ２０９に進む。
これにより、処理ステップＳ２０２〜Ｓ２０７は、全ての類似単語対（ｗＴ，ｗＧ）について実行される。
【０１０８】
ここで、具体例として、対象タスクの単語［横浜駅」と一般タスクの単語「成田空港」との距離がしきい値ｔｈよりも小さく、各単語列「成田空港、まで」および「から、成田空港」が一般テキストデータに存在している場合を考える。
【０１０９】
このとき、さらに、対象テキストデータに単語列「横浜駅、まで」は存在するものの、単語列「から、横浜駅」が存在しない場合であれば、類似単語列合成手段１０４は、単語列「から、横浜駅」を合成して出力することになる。
この結果、単語の類似情報を用いて、対象タスクで出現が予想される単語列を学習用テキストデータに追加することになる。
【０１１０】
次に、図２において、言語モデル生成手段１０５は、重みパラメータ保存手段（図示せず）から、それぞれの入力に対応する重みパラメータを読み込む（ステップＳ２０９）。
【０１１１】
また、対象タスク言語データ１０１、一般タスク言語データ１０２および類似単語列合成手段１０４から、単語に区切られた学習用テキストを読み込み、単語列の頻度を求める（ステップＳ２１０）。
このとき、Ｎグラム言語モデルの場合には、ｎ単語以下の単語列について頻度を計算する必要がある。
【０１１２】
さらに、言語モデル生成手段１０５は、たとえば、Ｋａｔｚのバックオフスムージング法を用いたスムージングを行い、言語モデルのパラメータを推定することにより、タスク適応化済み言語モデルを生成し（ステップＳ２１１）、図２の処理ルーチンを終了する。
【０１１３】
こうして得られた言語モデルの学習データには、対象タスクに特徴的な単語を含む単語列が追加されているので、対象タスクに対する言語モデルの予測精度が向上する。
【０１１４】
したがって、対象以外のタスクを含む大量データ（一般タスク言語データ１０２）と対象タスクに関する少量データ（対象タスク言語データ１０１）とから、音声認識用の高精度の言語モデルを推定することができる。
、タスク適応化済み言語モデルを生成し（ステップＳ２１１）、図２の処理ルーチンを終了する。
【０１１５】
なお、上記のように得られる言語モデルは、音声認識に限らず、言語処理を必要とする文字認識や、自然言語のテキスト処理に対しても適用可能である。
【０１１６】
また、図１のように構成される音声認識用の言語モデル学習装置をプログラムとして記録媒体に記録することもできる。
【０１１７】
すなわち、図１内の類似単語対抽出手段１０３と同様の処理を行う類似単語対抽出機能と、類似単語列合成手段１０４と同様の処理を行う類似単語列合成機能と、言語モデル生成手段１０５と同様の処理を行う言語モデル生成機能とから構成されるソフトウェアにより、音声認識用言語モデル学習プログラムを実現することができる。
【０１１８】
実施の形態２．
なお、上記実施の形態１では、対象タスク言語データ１０１および一般タスク言語データ１０２からの各テキストデータをそのまま用いたが、クラス化されたテキストデータを用いてもよい。
【０１１９】
図３はこの発明の実施の形態２による音声認識装置用の言語モデル学習装置を概略的に示すブロック構成図であり、前述（図１参照）と同様のものについては、同一符号を付して、または、符号の後に「Ａ」を付して詳述を省略する。
【０１２０】
図３において、３０１は対象タスク単語クラス化手段であり、対象タスク言語データ１０１と言語モデル生成手段１０５Ａとの間に挿入されている。
３０２は一般タスク単語クラス化手段であり、一般タスク言語データ１０２と言語モデル生成手段１０５Ａとの間に挿入されている。
【０１２１】
この場合の特徴的な機能は、対象タスク単語クラス化手段３０１と、一般タスク単語クラス化手段３０２とを設け、対象タスクおよび一般タスクのテキストコーパスの単語をクラス化して、言語モデルの推定パラメータ数を減少させることにより、言語モデル学習の際に対象タスクのデータが少量であっても高精度の認識を可能にしたことにある。
【０１２２】
以下、図３内の各手段３０１、３０２の機能について、各種モデルおよび各種データと関連させながら具体的に説明する。
単語クラス定義データ（図示せず）は、たとえば、前述（図２０参照）のように、単語ｗ、単語ｗが所属するクラスｃ、および、単語ｗが所属するクラスｃから出力される確率Ｐ（ｗ｜ｃ）を記述している。図２０のような単語クラス定義データは、人手で作成してもよく、計算により学習データから作成してもよい。
【０１２３】
対象タスク単語クラス化手段３０１は、単語クラス定義データにしたがい、入力された対象タスク言語データ１０１の単語のうちでクラス定義されているものを順次クラス化し、言語モデル生成手段１０５Ａに出力する。
【０１２４】
一般タスク単語クラス化手段３０２は、単語クラス定義データにしたがい、入力された一般タスク言語データ１０２の単語のうちでクラス定義されているものを順次クラス化し、言語モデル生成手段１０５Ａに出力する。
【０１２５】
次に、図４のフローチャートを参照しながら、図３に示したこの発明の実施の形態２に基づくタスク適応による言語モデルの学習手順について、さらに具体的に説明する。
【０１２６】
図４において、ステップＳ４０１〜Ｓ４０３は、対象タスク単語クラス化手段３０１および一般タスク単語クラス化手段３０２により実行される処理である。
【０１２７】
また、ステップＳ４０４〜Ｓ４０６は、言語モデル生成手段１０５Ａにより実行される処理であり、前述（図２参照）のステップＳ２０９〜Ｓ２１１にそれぞれ対応している。
【０１２８】
まず、対象タスク単語クラス化手段３０１および一般タスク単語クラス化手段３０２は、それぞれ、単語クラス定義データ（図示せず）を読み込む（ステップＳ４０１）。
【０１２９】
また、対象タスク単語クラス化手段３０１は、対象タスク言語データ１０１を読み込み、単語クラス定義で定義される単語に関して、単語をクラスに置き換えたテキストを生成し、これを出力する（ステップＳ４０２）。
【０１３０】
同様に、一般タスク単語クラス化手段３０２は、一般タスク言語データ１０２を読み込み、単語クラス定義で定義される単語に関して、単語をクラスに置き換えたテキストを生成し、これを出力する（ステップＳ４０３）。
【０１３１】
次に、言語モデル生成手段１０５Ａは、まず、重みパラメータ保存手段（図示せず）から重みパラメータを読み込み（ステップＳ４０４）、続いて、対象タスク単語クラス化手段３０１および一般タスク単語クラス化手段３０２から、クラスを含む単語列である学習用テキストを読み込み、それぞれについて与えられた重みパラメータを乗算することにより、単語および単語列の頻度を累積演算する（ステップＳ４０５）。
【０１３２】
ここで、クラスＮグラム言語モデルの場合、前述と同様に、ｎ単語以下のクラス列について頻度を計算する。
最後に、言語モデル生成手段１０５Ａは、算出された頻度をスムージングし、言語モデルのパラメータを推定して、タスク適応化済みクラス言語モデルを生成し（ステップＳ４０６）、図４の処理ルーチンを終了する。
【０１３３】
上記処理手順と、あらかじめ定義された単語クラス定義データ（図示せず）とにより、クラス言語モデルが得られる。
このように、対象以外のタスクを含む大量データ（一般タスク言語データ１０２）と、対象タスクに関する少量データ（対象タスク言語データ１０１）とから、音声認識用の高精度の言語モデルを推定することができる。
【０１３４】
なお、こうして得られる言語モデルは、音声認識のみならず、言語処理を必要とする文字認識や、自然言語のテキスト処理に対しても適用可能である。
【０１３５】
また、図３に示した音声認識用の言語モデル学習装置は、プログラムとして記録媒体に記録することもできる。
【０１３６】
すなわち、図３内の対象タスク単語クラス化手段３０１と同様の処理を行う対象単語クラス化機能と、一般タスク単語クラス化手段３０２と同様の処理を行う一般単語クラス化機能と、言語モデル生成手段１０５Ａと同様の処理を行う言語モデル生成機能とから構成されるソフトウェアにより、音声認識用の言語モデル学習プログラムを実現することができる。
【０１３７】
実施の形態３．
なお、上記実施の形態２では、言語モデル生成手段１０５Ａのみを用いたが、図１（実施の形態１）と同様の類似単語対抽出手段および類似単語列合成手段を併用してもよい。
【０１３８】
図５はこの発明の実施の形態３による音声認識装置用の言語モデル学習装置を概略的に示すブロック構成図であり、前述（図１、図３参照）と同様のものについては、同一符号を付して、または、符号の後に「Ｂ」を付して詳述を省略する。
【０１３９】
この場合の特徴的な機能は、単一のクラス定義にしたがい、対象タスク単語クラス化手段３０１および一般タスク単語クラス化手段３０２を設け、単語をクラス化して言語モデルのパラメータ数を減少させるとともに、類似単語対抽出手段１０３Ｂおよび類似単語列合成手段１０４Ｂを設けることにより、言語モデル構築の際に対象タスクのデータが少量であっても高精度の認識を可能にしたことにある。
【０１４０】
次に、図６のフローチャートを参照しながら、図５に示したこの発明の実施の形態３に基づくタスク適応による言語モデルの学習手順について、さらに具体的に説明する。
【０１４１】
図６において、ステップＳ６０１〜Ｓ６０３は、前述（図４参照）のステップＳ４０１〜Ｓ４０３にそれぞれ対応しており、ステップＳ６０４〜Ｓ６１４は、前述（図２参照）のステップＳ２０１〜Ｓ２１１にそれぞれ対応している。
【０１４２】
まず、対象タスク単語クラス化手段３０１および一般タスク単語クラス化手段３０２は、それぞれ単語クラス定義データ（図示せず）を読み込む（ステップＳ６０１）。
【０１４３】
対象タスク単語クラス化手段３０１は、対象タスク言語データ１０１を読み込み、単語クラス定義で定義される単語に関して単語をクラスに置き換えたテキストを生成して出力する（ステップＳ６０２）。
【０１４４】
また、一般タスク単語クラス化手段３０２は、一般タスク言語データ１０２を読み込み、単語クラス定義で定義される単語に関して単語をクラスに置き換えたテキストを生成して出力する（ステップＳ６０３）。
【０１４５】
類似単語対抽出手段１０３Ｂは、対象タスク単語クラス化手段３０１および一般タスク単語クラス化手段３０２から、対象タスク言語データに含まれるクラスｃＴと、一般タスク言語データに含まれるクラスｃＧ（クラスｃＴとは異なる）との組み合わせからなる単語クラス対（ｃＴ，ｃＧ）のリストを作成し、これを記憶する（ステップＳ６０４）。
【０１４６】
また、類似単語対抽出手段１０３Ｂは、対象タスク言語データに含まれるクラスｃＴと、一般タスク言語データに含まれるクラスｃＧ（クラスｃＴとは異なる）とについて、単語クラス対間の距離ｄ（ｃＴ，ｃＧ）を求め（ステップＳ６０５）、あらかじめ与えられたしきい値ｔｈｃよりも小さいか否かを判定する（ステップＳ６０６）。
【０１４７】
ステップＳ６０６において、ｄ（ｃＴ，ｃＧ）≧ｔｈｃ（すなわち、Ｎｏ）と判定されればステップＳ６０５に戻り、ｄ（ｃＴ，ｃＧ）＜ｔｈｃ（すなわち、Ｙｅｓ）と判定されれば、そのときのクラス対（ｃＴ，ｃＧ）を類似単語対として類似単語列合成手段１０４Ｂに出力する（ステップＳ６０６）。
【０１４８】
類似単語列合成手段１０４Ｂは、対象タスク単語クラス化手段３０１および一般タスク単語クラス化手段３０２から、クラスに区切られた学習用テキストデータを読み込み、これを長さｎ以下のクラス列に区切って記憶する（ステップＳ６０７）。
【０１４９】
また、各単語クラス化手段３０１および３０２から読み込んだクラス列に基づき、類似単語対抽出手段１０３Ｂにより選択されたクラス対（ｃＴ，ｃＧ）のうち、一般タスクのクラスｃＧが含まれるクラス列「・・・ｃＧ・・・」を取り出す（ステップＳ６０８）。
【０１５０】
さらに、類似単語列合成手段１０４Ｂは、各単語クラス化手段３０１および３０２から読み込んで記憶したクラス列を参照し、一般タスクのクラスｃＧを対象タスクのクラスｃＴで置き換えたクラス列「・・・ｃＴ・・・」が、対象タスク言語データ１０１または一般タスク言語データ１０２に存在するか否かを判定する（ステップＳ６０９）。
【０１５１】
ステップＳ６０９において、各言語データ１０１または１０２にクラス列「・・・ｃＴ・・・」が存在する（すなわち、Ｙｅｓ）と判定されれば、ステップＳ６０８に戻り、クラス列が存在しない（すなわち、Ｎｏ）と判定されれば、そのクラス列「・・・ｃＴ・・・」を合成して、学習用テキストデータとして出力する（ステップＳ６１０）。
【０１５２】
次に、全ての類似クラス対に対して処理を終了したか否かを判定し（ステップＳ６１１）、終了していない（すなわち、Ｎｏ）と判定されればステップＳ６０５に戻り、終了した（すなわち、Ｙｅｓ）と判定されれば、言語モデル生成手段１０５Ｂによる処理ステップ（Ｓ６１２〜Ｓ６１４）に進む。
これにより、上記処理は全ての類似単語クラス対（ｃＴ，ｃＧ）に対して繰り返し実行される。
【０１５３】
言語モデル生成手段１０５Ｂは、まず、重みパラメータ保存手段（図示せず）から重みパラメータを読み込み（ステップＳ６１２）、続いて、対象タスク言語データ１０１、一般タスク言語データ１０２および類似単語列合成手段１０４Ｂから、重みパラメータにより頻度の重み付けされて単語に区切られた学習用テキストを読み込む（ステップＳ６１３）。
【０１５４】
また、頻度のスムージングを行うことにより、言語モデルのパラメータを推定し（ステップＳ６１４）、図６の処理ルーチンを終了する。
上記処理手順およびあらかじめ定義される単語クラス定義データ（図示せず）により、タスク適応化したクラス言語モデルが得られる。
【０１５５】
このように、対象以外のタスクを含む大量データと、対象タスクに関する少量データとから、音声認識のための高精度の言語モデルを学習することができる。
【０１５６】
なお、こうして得られる言語モデルは、音声認識のみならず、言語処理を必要とする文字認識、自然言語によるテキスト処理などにも適用可能である。
【０１５７】
また、図５に示した音声認識用の言語モデル学習装置は、プログラムとして記録媒体に記録することもできる。
【０１５８】
すなわち、図５内の対象タスク単語クラス化手段３０１と同様の処理を行う対象単語クラス化機能と、一般タスク単語クラス化手段３０２と同様の処理を行う一般単語クラス化機能と、類似単語対抽出手段１０３Ｂと同様の処理を行う類似単語対抽出機能と、類似単語列合成手段１０４Ｂと同様の処理を行う類似単語列合成機能と、言語モデル生成手段１０５Ｂと同様の処理を行う言語モデル生成機能とから構成されるソフトウェアにより、音声認識用の言語モデル学習プログラムを実現することができる。
【０１５９】
実施の形態４．
なお、上記実施の形態１〜３では、タスク適応化済み言語モデルを生成するために、言語モデル生成手段１０５、１０５Ａまたは１０５Ｂを用いたが、事前に作成された初期言語モデルと、単語出現確率のスムージングを実行する類似単語確率補正手段とを用いてもよい。
【０１６０】
図７はこの発明の実施の形態４による音声認識装置用の言語モデル学習装置を概略的に示すブロック構成図であり、前述（図１参照）と同様のものについては、同一符号を付して詳述を省略する。
【０１６１】
図７において、７０１は初期言語モデル、７０２は類似単語確率補正手段である。
類似単語確率補正手段７０２は、類似単語対抽出手段１０３からの類似単語対と、初期言語モデル７０１からの事前の言語モデルとに基づいて、タスク適応化済み統計的言語モデルを生成する。
【０１６２】
この場合の特徴的な機能は、類似単語対抽出手段１０３および類似単語確率補正手段７０２を設け、対象タスクに特有の単語について一般タスクのテキストデータに出現する類似単語の性質を反映させるため、統計的言語モデル構築の際に、対象タスクのデータが少量であっても高精度の認識を可能にしたことにある。
【０１６３】
以下、図７内の各手段の機能について、各種モデルおよび各種データと関連させながら具体的に説明する。
初期言語モデル７０１は、周知の従来方法や上記実施の形態１などの方法によりパラメータ推定された統計的言語モデルからなる。
【０１６４】
類似単語確率補正手段７０２は、初期言語モデル７０１および類似単語対抽出手段１０３から、対象タスクと一般タスク間の類似単語対を読み込み、対象タスクの単語が含まれる単語列の条件付き出現確率を補正する。
このときの単語列出現確率の補正処理においては、類似した一般タスクの単語が含まれる単語列の条件付き出現確率が用いられる。
【０１６５】
類似単語確率補正手段７０２が割り当てる確率は、学習テキストデータで未出現の単語列の出現確率として求められ、出現した単語列の条件付き確率から除いた（ディスカウントした）確率の一部である。すなわち、学習用テキストデータに存在する単語列の条件付き出現確率は、初期言語モデル７０１と等しいままで保存される。
【０１６６】
次に、図８のフローチャートを参照しながら、図７に示したこの発明の実施の形態４に基づくタスク適応による言語モデルの学習手順について、さらに具体的に説明する。
【０１６７】
図８において、ステップＳ８０１〜Ｓ８０３およびＳ８０５は、前述（図２参照）のステップＳ２０１〜Ｓ２０３およびＳ２０８にそれぞれ対応している。
また、ステップＳ８０６〜Ｓ８１２は、類似単語確率補正手段７０２により実行される処理である。
【０１６８】
まず、類似単語対抽出手段１０３は、対象タスク言語データ１０１および一般タスク言語データ１０２から、単語に区切られた学習用テキストを読み込み（ステップＳ８０１）、対象タスク言語データに含まれる単語ｗＴと一般タスク言語データに含まれる単語ｗＧ（ｗＴとは異なる）とについて、距離ｄ（ｗＴ，ｗＧ）を求める（ステップＳ８０２）。
【０１６９】
続いて、単語間の距離ｄ（ｗＴ，ｗＧ）がしきい値ｔｈよりも小さいか否かを判定し（ステップＳ８０３）、ｄ（ｗＴ，ｗＧ）≧ｔｈ（すなわち、Ｎｏ）と判定されればステップＳ８０２に戻り、ｄ（ｗＴ，ｗＧ）＜ｔｈ（すなわち、Ｙｅｓ）と判定されれば、そのときの単語対（ｗＴ，ｗＧ）を類似単語対に追加する（ステップＳ８０４）。
【０１７０】
以下、上記処理を全ての単語対について計算終了したか否かを判定し（ステップＳ８０５）、終了していない（すなわち、Ｎｏ）と判定されればステップＳ８０２に戻り、終了した（すなわち、Ｙｅｓ）と判定されれば、次の処理ステップＳ８０６に進む。
これにより、全単語対についての計算が順次行われ、作成された類似単語対（ｗＴ，ｗＧ）の一覧が類似単語確率補正手段７０２に出力される。
【０１７１】
類似単語確率補正手段７０２は、まず、初期言語モデル７０１を読み込み（ステップＳ８０６）、続いて、類似単語対抽出手段１０３から読み出される類似単語対（ｗＴ，ｗＧ）について、初期言語モデル７０１内に定義された条件付き確率のうち、一般タスク単語ｗＧを含む条件付き確率ＰｗＧ（ｗ_n｜ｗ₁，・・・，ｗ_n-1）を取り出す（ステップＳ８０７）。
【０１７２】
次に、取り出したそれぞれの条件付き確率について、一般タスク単語ｗＧを対象タスク単語ｗＴで置き換えた条件付き確率ＰｗＴ（ｗ_n｜ｗ₁，・・・，ｗ_n-1）が、初期言語モデル７０１で定義されているか否かを判定する（ステップＳ８０８）。
【０１７３】
ステップＳ８０８において、条件付き確率ＰｗＴ（ｗ_n｜ｗ₁，・・・，ｗ_n-1）が初期言語モデル７０１で定義されていない（すなわち、Ｎｏ）と判定されれば、未知の単語列のために除いた確率から一部を割り当てて、条件付き確率を補正し（ステップＳ８０９）、次の判定ステップＳ８１０に進む。
【０１７４】
一方、条件付き確率ＰｗＧが定義されており、ステップＳ８０８において、条件付き確率ＰｗＴが定義されている（すなわち、Ｙｅｓ）と判定されれば、直ちに次の判定ステップＳ８１０に進む。
【０１７５】
このとき、ステップＳ８０９において補正した確率は、たとえば、同一の単語履歴（ｗ₁，・・・，ｗ_n-1）である条件付き確率のうちの最小値とする。
【０１７６】
次に、他にも一般単語ｗＧを含む単語列の条件付き確率が存在するか否かを判定し（ステップＳ８１０）、一般単語ｗＧを含む単語列が存在する（すなわち、Ｙｅｓ）と判定されれば、ステップＳ８０８に戻る。
【０１７７】
一方、ステップＳ８１０において、一般単語ｗＧを含む条件付き確率が他に存在しない（すなわち、Ｎｏ）と判定されれば、全ての単語対（ｗＴ，ｗＧ）について、上記処理の実行が終了したか否かを判定する（ステップＳ８１１）。
【０１７８】
ステップＳ８１１において、全単語対の処理が終了していない（すなわち、Ｎｏ）と判定されればステップＳ８０７に戻り、終了した（すなわち、Ｙｅｓ）と判定されれば、次の処理ステップＳ８１２に進む。
【０１７９】
これにより、全ての一般単語ｗＧを含む単語列について、また、全ての一般単語ｗＧを含む単語対（ｗＴ，ｗＧ）について、上記処理が実行される。
最後に、言語モデルの確率の和が「１」となるように、未知の単語列のために言語モデルから除いた確率の総和を正規化して（ステップＳ８１２）、図８の処理ルーチンを終了する。
【０１８０】
仮に、条件付き確率が定義されていない場合には、通常は簡易な言語モデルによって与えられる確率が使われる。
たとえば、ＫａｔｚのバックオフにしたがうＮグラム言語モデルでは、低次のＮ−１グラム言語モデルが参照されて、小さな確率が割り当てられるが、この確率の精度は低いので、対象タスクの類似単語を含む単語列がある場合、実際よりも大き確率が見積もられることになる。
【０１８１】
一般単語ｗＧを含む他の条件付き確率ＰｗＧについても、ステップＳ８１０により同様に処理され、また、ステップＳ８０６〜Ｓ８１０の処理は、ステップＳ８１１により、全ての類似単語対（ｗＧ、ｗＴ）について実行される。
【０１８２】
このように、類似単語確率補正手段７０２を用いることにより、一般タスクと対象タスクとの間で性質が類似する単語について、一般タスクの単語の出現確率を用いたスムージングが行われ、音声認識用のさらに精度の高いモデルを推定することができる。
【０１８３】
なお、こうして得られる言語モデルは、前述と同様に、言語処理を必要とする文字認識や、テキスト処理などにも適用可能である。
【０１８４】
また、図７に示した音声認識用の言語モデル学習装置は、プログラムとして記録媒体に記録することもできる。
すなわち、図７内の類似単語対抽出手段１０３と同様の処理を行う類似単語対抽出機能と、類似単語確率補正手段７０２と同様の処理を行う類似単語確率補正機能とから構成されるソフトウェアにより、音声認識用の言語モデル学習プログラムを実現することができる。
【０１８５】
実施の形態５．
なお、上記実施の形態４では、対象タスク言語データ１０１および一般タスク言語データ１０２からの各テキストデータをそのまま用いたが、上記実施の形態３（図５参照）のようにクラス化されたテキストデータを用いてもよい。
【０１８６】
図９はこの発明の実施の形態５による音声認識装置用の言語モデル学習装置を概略的に示すブロック構成図であり、前述（図５、図７参照）と同様のものについては、同一符号を付して詳述を省略する。
【０１８７】
図９において、９０１は初期クラス言語モデルであり、前述（図７参照）の初期言語モデル７０１に代えて、類似単語確率補正手段７０２に接続されている。
【０１８８】
この場合の特徴的な機能は、類似単語対抽出手段１０３Ｂ、対象タスク単語クラス化手段３０１、一般タスク単語クラス化手段３０２および類似単語確率補正手段７０２を設け、対象タスクに特有のクラスに対して一般タスクのテキストデータに出現する類似クラスの性質を反映させることにより、対象タスクのデータが少量であっても、初期クラス言語モデル９０１から、さらに認識精度を高めたクラス言語モデルを生成することにある。
【０１８９】
以下、図９内の各手段の機能について、各種モデルおよび各種データと関連させながら具体的に説明する。
初期クラス言語モデル９０１は、周知の従来方法や上記実施の形態２、３などの方法によりパラメータ推定された統計的クラス言語モデルからなる。
【０１９０】
類似単語確率補正手段７０２により割り当てられる確率は、学習テキストデータで未出現の単語クラス列のために出現した単語クラス列の条件付き確率から除いた（ディスカウントした）確率の一部であり、学習用テキストデータに含まれる単語クラスの条件付き出現確率が保存される。
【０１９１】
たとえば、単語クラスに関する条件付き確率Ｐ（ｃ_n｜ｃ₁，・・・，ｃ_n-1）を変えた場合、単語クラス列の元の条件付き確率よりも大きくなるように確率が割り当てられる。
【０１９２】
次に、図１０のフローチャートを参照しながら、図９に示したこの発明の実施の形態５に基づくタスク適応による言語モデルの学習手順について、さらに具体的に説明する。
【０１９３】
図１０において、ステップＳ１００１〜Ｓ１００３は、前述（図６参照）のステップＳ６０１〜Ｓ６０３にそれぞれ対応しており、ステップＳ１００４〜Ｓ１０１５は、前述（図８参照）のステップＳ８０１〜Ｓ８１２にそれぞれ対応している。
【０１９４】
まず、対象タスク単語クラス化手段３０１および一般タスク単語クラス化手段３０２は、それぞれ単語クラス定義データ（図示せず）を読み込む（ステップＳ１００１）。
【０１９５】
対象タスク単語クラス化手段３０１は、対象タスク言語データ１０１を読み込み、単語クラス定義で定義される単語に関して単語をクラスに置き換えたテキストを生成して出力する（ステップＳ１００２）。
【０１９６】
また、一般タスク単語クラス化手段３０２は、一般タスク言語データ１０２を読み込み、単語クラス定義で定義される単語に関して単語をクラスに置き換えたテキストを生成して出力する（ステップＳ１００３）。
【０１９７】
次に、類似単語対抽出手段１０３Ｂは、対象タスク単語クラス化手段３０１および一般タスク単語クラス化手段３０２を通して、それぞれクラス列を読み込む（ステップＳ１００４）。
【０１９８】
また、対象タスク言語データに含まれるクラスｃＴと一般タスク言語データに含まれるクラスｃＧ（ｃＴとは異なる）とについて、距離ｄ（ｃＴ，ｃＧ）を求め（ステップＳ１００５）、クラス間の距離ｄ（ｃＴ，ｃＧ）がしきい値ｔｈｃよりも小さいか否かを判定する（ステップＳ１００６）。
【０１９９】
ステップＳ１００６において、ｄ（ｃＴ，ｃＧ）≧ｔｈｃ（すなわち、Ｎｏ）と判定されればステップＳ１００５に戻り、ｄ（ｃＴ，ｃＧ）＜ｔｈｃ（すなわち、Ｙｅｓ）と判定されれば、そのときのクラス対（ｃＴ，ｃＧ）を類似クラス対に追加する（ステップＳ１００７）。
【０２００】
以下、判定ステップＳ１００８を介して、上記処理を順次全てのクラス対について実行し、作成された類似クラス対（ｃＴ，ｃＧ）の一覧を類似単語確率補正手段７０２に出力する。
【０２０１】
次に、類似単語確率補正手段７０２は、まず、初期クラス言語モデル９０１を読み込み（ステップＳ１００９）、続いて、類似単語対抽出手段１０３Ｂから類似クラス対（ｃＴ，ｃＧ）を順次読み出す（ステップＳ１０１０）。
【０２０２】
また、初期クラス言語モデル９０１内に定義された条件付き確率のうち、一般タスクのクラスｃＧを含む条件付き確率ＰｃＧ（ｃ_n｜ｃ₁，・・・，ｃ_n-1）のそれぞれについて、一般タスククラスｃＧを対象タスククラスｃＴで置き換えた条件付き確率ＰｃＴ（ｃ_n｜ｃ₁，・・・ｃ_n-1）が学習データ内で定義されているか否かを判定する（ステップＳ１０１１）。
【０２０３】
ステップＳ１０１１において、条件付き確率ＰｃＴ（ｃ_n｜ｃ₁，・・・，ｃ_n-1）が初期クラス言語モデル９０１で定義されていない（すなわち、Ｎｏ）と判定されれば、未知のクラス列のために除いた確率から一部を割り当てて、条件付き確率を補正し（ステップＳ１０１２）、次の判定ステップＳ１０１３に進む。
【０２０４】
一方、条件付き確率ＰｃＧが定義されており、ステップＳ１０１１において、条件付き確率ＰｃＴが定義されている（すなわち、Ｙｅｓ）と判定されれば、直ちに次の判定ステップＳ１０１３に進む。
【０２０５】
このとき、ステップＳ１０１２において補正した確率は、たとえば、同一のクラス履歴（ｃ₁，・・・，ｃ_n-1）である条件付き確率のうちの最小値とする（ステップＳ１０１２）。
【０２０６】
以下、ステップＳ１０１３を介して、クラスｃＧを含む他の条件付き確率ＰｃＧについても同様の処理が行われる。また、ステップＳ１０１４を介して、上記ステップＳ１００６〜Ｓ１０１０の処理は、全ての類似クラス対（ｃＧ、ｃＴ）について実行される。
【０２０７】
最後に、類似単語確率補正手段７０２は、クラス言語モデルの確率の和が１となるようにバックオフ確率を正規化して、タスク適応化済みクラス言語モデルを生成し（ステップＳ１０１５）、図１０の処理ルーチンを終了する。
【０２０８】
このように、各単語クラス化手段３０１および３０２とともに、類似単語対抽出手段１０３Ｂおよび類似単語確率補正手段７０２を設け、一般タスクと対象タスクとの間で性質が類似する単語クラスについて、一般タスクの単語クラスの出現確率を用いたスムージングを行うことにより、音声認識用のクラス言語モデルを高精度に推定することができる。
【０２０９】
なお、こうして得られるクラス言語モデルは、言語処理を必要とする文字認識や、自然言語のテキスト処理などにも適用可能である。
【０２１０】
また、図９に示した音声認識用言語モデル学習装置は、プログラムとして記録媒体に記録することもできる。
【０２１１】
すなわち、図９内の類似単語対抽出手段１０３Ｂと同様の処理を行う類似単語対抽出機能と、対象タスク単語クラス化手段３０１と同様の処理を行う対象タスク単語クラス化機能と、一般タスク単語クラス化手段３０２と同様の処理を行う一般タスク単語クラス化機能と、類似単語確率補正手段７０２と同様の処理を行う類似単語確率補正機能とから構成されるソフトウェアにより、音声認識用の言語モデル学習プログラムを実現することができる。
【０２１２】
実施の形態６
なお、上記実施の形態１では、類似単語対抽出手段の機能構成について具体的に言及しなかったが、たとえば図１１のように構成してもよい。
【０２１３】
図１１はこの発明の実施の形態６による音声認識用の言語モデル学習装置に用いられる類似単語対抽出手段１０３Ｃの具体的構成例を示す機能ブロック図であり、前述と同様のものについては、同一符号を付して、または符号の後に「Ｃ」を付して、詳述を省略する。
【０２１４】
図１１において、１１０１は統計的単語間距離算出手段、１１０２はしきい値判定手段、１１０５は類似単語対抽出手段１０３Ｃ内の距離算出用言語モデル生成手段である。
【０２１５】
この場合の特徴的な機能は、類似単語対抽出手段１０３Ｃ内に距離算出用言語モデル生成手段１１０５、統計的単語間距離算出手段１１０１およびしきい値判定手段１１０２を設け、言語データにしたがった統計的距離尺度に基づき、対象タスクの単語ｗＴと一般タスクの単語ｗＧとの単語間距離ｄ（ｗＴ，ｗＧ）を算出して単語対を選択することにより、高精度に類似単語対を判定することにある。
【０２１６】
以下、図１１内の各手段の機能について、各種モデルおよび各種データと関連させながら具体的に説明する。
類似単語対抽出手段１０３Ｃにおいて、統計的単語間距離算出手段１１０１は、距離算出用言語モデル生成手段１１０５から推定された言語モデルを取り出し、対象タスク言語データ１０１および一般タスク言語データ１０２から抽出される異なる単語対のそれぞれについて、言語モデルに基づいた単語間距離を求め、単語対および単語間距離を出力する。
【０２１７】
しきい値判定手段１１０２は、単語対および統計的単語間距離を、統計的単語間距離算出手段１１０１から順次読み込み、単語間距離が一定のしきい値以下の場合に、単語対（ｗＴ，ｗＧ）を出力する。
【０２１８】
このとき、統計的単語間距離算出手段１１０１は、対象タスク内単語ｗＴおよび一般タスク内単語ｗＧに関する統計的単語間距離の算出方法として、たとえば、Ｎグラム言語モデルの条件付き確率におけるユークリッド距離を用い、以下の（７）式のように統計的単語間距離Ｄ₁（ｗＴ，ｗＧ）を求める。
【０２１９】
【数７】

【０２２０】
ただし、（７）式において、Ｖは言語データ（単語）の語彙ｘの母集団であり、言語モデルに含まれる全ての語彙を表す。
【０２２１】
また、統計的単語間距離算出手段１１０１は、後続単語に対する先行単語の条件付き確率を用いたユークリッド距離を用い、以下の（８）式のように、統計的単語間距離Ｄ₂（ｗＴ，ｗＧ）を求めることができる。
【０２２２】
【数８】

【０２２３】
また、上記（７）式および（８）式を個別に用いることのみならず、（７）式と（８）式との和を用いることもできる。
【０２２４】
また、統計的単語間距離算出手段１１０１は、たとえば、単語ｗＴに関するクロスエントロピーを用い、以下の（９）式のように、統計的単語間距離Ｄ₃（ｗＴ，ｗＧ）を求めることができる。
【０２２５】
【数９】

【０２２６】
また、ユークリッド距離を用いた場合と同様に、以下の（１０）式に示すように、後続単語に関する先行単語の条件付き確率を用いることができる。
【０２２７】
【数１０】

【０２２８】
また、上記（９）式および（１０）式を個別に用いることのみならず、（９）式と（１０）式との和を用いることもできる。
【０２２９】
さらに、上記統計的尺度と言語情報とを組み合わせて用いることもできる。
たとえば、単語が形態素を表す場合において、２つの単語の品詞が同一でない場合、距離を無限大として類似単語候補から外すことができる。
【０２３０】
次に、図１２のフローチャートを参照しながら、図１１に示したこの発明の実施の形態６に基づくタスク適応における類似単語対抽出手段１０３Ｃの動作について、さらに具体的に説明する。
図１２において、ステップＳ１２０３〜Ｓ１２０７は、前述（図２参照）のステップＳ２０１〜Ｓ２０３、Ｓ２０７およびＳ２０８にそれぞれ対応している。
【０２３１】
まず、距離算出用言語モデル生成手段１１０５は、対象タスク言語データ１０１および一般タスク言語データ１０２を読み込み（ステップＳ１２０１）、入力されたテキストデータから、言語モデルのパラメータ推定を行う（ステップＳ１２０２）。
【０２３２】
また、統計的単語間距離算出手段１１０１は、対象タスクに含まれる単語ｗＴと、一般タスクに含まれる単語ｗＧとの任意の組み合わせからなる単語対（ｗＴ，ｗＧ）を作成し（ステップＳ１２０３）、距離算出用言語モデル生成手段１１０５により推定される言語モデル上で統計的距離ｄ（ｗＴ，ｗＧ）を計算する（ステップＳ１２０４）。
【０２３３】
続いて、しきい値判定手段１１０２は、統計的単語間距離算出手段１１０１から得られた単語対（ｗＴ，ｗＧ）の距離ｄ（ｗＴ，ｗＧ）をしきい値ｔｈと比較し、距離ｄ（ｗＴ，ｗＧ）がしきい値ｔｈ未満であるか否かを判定する（ステップＳ１２０５）。
【０２３４】
ステップＳ１２０５において、ｄ（ｗＴ，ｗＧ）≧ｔｈ（すなわち、Ｎｏ）と判定されればステップＳ１２０４に戻り、ｄ（ｗＴ，ｗＧ）＜ｔｈ（すなわち、Ｙｅｓ）と判定されれば、そのときの単語対（ｗＴ，ｗＧ）を類似単語対として出力する（ステップＳ１２０６）。
【０２３５】
以下、終了判定ステップＳ１２０７を介して、以上の処理を全ての単語対（ｗＴ，ｗＧ）について行う。
【０２３６】
このように、類似単語対抽出手段１０３Ｃにおいて、言語モデルを推定して統計量に基づいた距離尺度を利用することにより、高精度の類似単語対を判定することができる。
【０２３７】
なお、こうして得られる言語モデルは、言語処理を必要とする文字認識や、自然言語のテキスト処理などにも適用可能である。
また、図１１内の類似単語対抽出手段１０３Ｃの機能をプログラムとして記録媒体に記録することもできる。
【０２３８】
すなわち、図１１内の距離算出用言語モデル生成手段１１０５と同様の処理を行う言語モデル生成機能と、統計的単語間距離算出手段１１０１と同様の処理を行う統計的単語間距離算出機能と、しきい値判定手段１１０２と同様の処理を行うしきい値判定機能とから構成されるソフトウェアにより、音声認識用の言語モデル学習装置の類似単語対抽出プログラムを実現することができる。
【０２３９】
また、図１１においては、距離算出用言語モデル生成手段１１０５を用いたが、図１３のように、距離算出用言語モデル１３０１を用いてもよい。
図１３において、類似単語対抽出手段１０３Ｄ内の距離算出用言語モデル１３０１は、前述（図７参照）の初期言語モデル７０１と同様のものであり、事前に作成されている。
【０２４０】
また、ここでは、類似単語対抽出手段１０３Ｃへの入力データを単語としているが、単語の代わりに、図１４のように単語クラスを用いてもよい。
図１４において、類似単語対抽出手段１０３Ｅ内の距離算出用言語モデル生成手段１１０５Ｅおよび統計的単語間距離算出手段１１０１Ｅは、各単語クラス化手段３０１および３０２から単語クラスを取り込んでいる。
この場合も、前述と同様に、クラス対を抽出することができる。
【０２４１】
さらに、図１４においては、距離算出用言語モデル生成手段１１０５Ｅを用いているが、図１５のように、距離算出用クラス言語モデル１５０１を用いてもよい。
図１５において、類似単語対抽出手段１０３Ｆ内の距離算出用クラス言語モデル１５０１は、前述（図９参照）の初期クラス言語モデル９０１と同様のものであり、事前に作成されている。
【０２４２】
実施の形態７
なお、上記実施の形態１〜６では、言語モデル学習装置のみに注目し、音声認識装置について具体的に言及しなかったが、たとえば、音声認識装置を図１６のように構成してもよい。
【０２４３】
図１６はこの発明の実施の形態７による言語モデルを用いた音声認識装置を概略的に示すブロック構成図であり、従来方法または上記実施の形態１、４、６などで述べた方法により生成される言語モデルを用いた場合を示している。
【０２４４】
図１６において、１６０１は音響特徴抽出手段、１６０２は音響モデル、１６０３は音響照合手段、１６０４は単語辞書、１６０５は言語モデル、１６０６は言語照合手段である。
【０２４５】
言語モデル１６０５は、上記実施の形態１、４、６で述べた言語モデル学習装置および方法を用いて構築されたものである。
この場合の特徴的な機能は、各手段１６０１〜１６０４とともに、言語モデル１６０５を用いた言語照合手段１６０６を設け、対象タスクのデータが少量の場合であっても高精度の音声認識を可能としたことにある。
【０２４６】
以下、図１６内の各手段の機能について、各種モデルおよび各種データと関連させながら具体的に説明する。
まず、音響特徴抽出手段１６０１は、入力された音声波形をＡ／Ｄ変換するとともに、分析時間フレーム毎に取り出して、メルケプストラムなどの音声特徴を良好に表すパラメータのベクトルに変換する。
【０２４７】
音響モデル１６０２は、たとえばＨＭＭを用いて、音声の認識単位（音素や単語など）内の音響特徴ベクトルの性質を確率分布や状態推移などによって表すものである。
【０２４８】
音響照合手段１６０３は、音響特徴抽出手段１６０１から得られる音素の音響特徴ベクトルと、音響モデル１６０２とを照合し、照合の度合いを表すスコアを出力する。
【０２４９】
単語辞書１６０４は、音響モデル１６０２の並びと、言語的な単位である単語との対応を記述するものである。
言語モデル１６０５は、言語モデル学習装置から得られ、認識対象とする単語の接続情報を記述するものであり、たとえば、単語Ｎグラム言語モデルを用いて単語間の遷移を（ｎ−１）重マルコフ過程で表現する。
【０２５０】
言語照合手段１６０６は、音響照合手段１６０３から音響特徴量と音響モデルとの照合スコアを受け取り、単語辞書１６０４および言語モデル１６０５を参照して、認識対象となる単語列のうち、最もスコアが高いものを認識結果とする処理を行う。
【０２５１】
次に、図１７のフローチャートを参照しながら、図１６に示したこの発明の実施の形態７に基づく音声認識の手順について、さらに具体的に説明する。
まず、図１６に示す音声認識装置は、あらかじめ準備した音響モデル１６０２および単語辞書１６０４とともに、上記実施の形態１、４、６（図１、図２、図７、図８、図１１〜図１３参照）により生成された言語モデル１６０５を読み込む（ステップＳ１７０１）。
【０２５２】
音響特徴抽出手段１６０１は、認識対象である入力音声をＡ／Ｄし、ある時間区間を区切った音声フレームを読み込み（ステップＳ１７０２）、対象とする音声フレームについて信号処理手法を用い、メルケプストラムなどの音声特徴を良好に表す音響特徴ベクトルを抽出する（ステップＳ１７０３）。
【０２５３】
続いて、音響照合手段１６０３は、ステップＳ１７０３で得られた音響特徴ベクトルを音響モデル１６０２と照合して、音響照合スコアを求める（ステップＳ１７０４）。
【０２５４】
次に、言語照合手段１６０６は、単語辞書１６０４および言語モデル１６０５を参照して、認識対象となる単語について、音響照合スコアを累積していく（ステップＳ１７０５）。
【０２５５】
言語照合手段１６０６は、上記照合処理を各フレーム毎に実行しながら、対象音声の最終フレームに到達したか否かを判定し（ステップＳ１７０６）、対象音声の最終フレームに到達していない（すなわち、Ｎｏ）と判定されればステップＳ１７０２戻る。
【０２５６】
また、ステップＳ１７０６において、対象音声の最終フレームに到達した（すなわち、Ｙｅｓ）と判定されれば、照合が終了したものと見なし、この時点で最も良いスコアとなっているものを認識結果として出力し（ステップＳ１７０７）、図１７の処理ルーチンを終了する。
【０２５７】
このように、言語モデル１６０５を用いることにより、対象以外のタスクを含む大量データと、対象タスクに関する少量データとから、高精度の言語モデルが構築されるので、高精度の音声認識を実現することができる。
【０２５８】
実施の形態８
なお、上記実施の形態７では、上記実施の形態１、４、６により生成された言語モデルを用いたが、上記実施の形態２、３、５、６により生成されたクラス言語モデルを用いてもよい。
【０２５９】
図１８はこの発明の実施の形態８による言語モデルを用いた音声認識装置を概略的に示すブロック構成図であり、上記実施の形態２、３、５、６で述べた装置および方法により生成される言語モデルを用いた場合を示している。
【０２６０】
図１８において、各手段１６０１〜１６０４は前述（図１６参照）と同様のものであり、言語照合手段１６０６Ａは前述の言語照合手段１６０６に対応している。
１８０１は言語モデル内のクラスと単語との対応関係を表すクラス定義、１８０２はクラスの出現確率を与えるクラス言語モデルである。
【０２６１】
クラス言語モデル１８０２は、上記実施の形態２、３、５、６（図３〜図６、図９、図１０、図１４、図１５参照）で述べた装置および方法を用いて構築したものである。
【０２６２】
この場合の特徴的な機能は、クラス言語モデル１８０２を用いた言語照合手段１６０６Ａを設けることにより、学習に用いた対象タスクのデータが少量の場合であっても高精度の音声認識を可能にしたことにある。
【０２６３】
次に、図１９のフローチャートを参照しながら、図１８に示したこの発明の実施の形態８に基づく音声認識の手順について、さらに具体的に説明する。
図１９において、ステップＳ１９０１〜Ｓ１９０７は、前述（図１７参照）のステップＳ１７０１〜Ｓ１７０７にそれぞれ対応している。
【０２６４】
まず、あらかじめ準備した音響モデル１６０２、単語辞書１６０４およびクラス定義１８０１とともに、上記実施の形態２、３、５、６により生成されたクラス言語モデル１８０２を読み込む（ステップＳ１９０１）。
【０２６５】
音響特徴抽出手段１６０１は、認識対象である入力音声をＡ／Ｄし、ある時間区間を区切った音声フレームを読み込み（ステップＳ１９０２）、対象とする音声フレームについて信号処理手法を用い、メルケプストラムなどの音声特徴を良好に表す音響特徴ベクトルを抽出する（ステップＳ１９０３）。
【０２６６】
続いて、音響照合手段１６０３は、得られた音響特徴ベクトルを音響モデル１６０２と照合して、音響照合スコアを求める（ステップＳ１９０４）。
【０２６７】
次に、言語照合手段１６０６Ａは、単語辞書１６０４、クラス定義１８０１およびクラス言語モデル１８０２を参照して、認識対象となる単語について、音響照合スコアを累積していく（ステップＳ１９０５）。
【０２６８】
以下、ステップＳ１９０６を介して上記照合処理を各フレーム毎に実行していき、対象音声の最終フレームに到達して照合が終了した時点で、最も良いスコアとなっているものを認識結果として出力し（ステップＳ１９０７）、図１９の処理ルーチンを終了する。
【０２６９】
このように、クラス言語モデル１８０２を用いることにより、対象以外のタスクを含む大量データと対象タスクに関する少量データとから、高精度の音声認識を実現することができる。
【０２７０】
【発明の効果】
以上のように、この発明の請求項１によれば、対象タスクのテキストデータを集積した対象タスク言語データと、対象タスク以外のタスクを含む一般タスクのテキストデータを集積した一般タスク言語データと、対象タスク言語データおよび一般タスク言語データから、それぞれ言語モデル学習用のテキストデータを読み込み、タスク適応化済み言語モデルを構築するための、類似単語対抽出手段、類似単語列合成手段および言語モデル生成手段とを備え、類似単語対抽出手段は、対象タスク言語データおよび一般タスク言語データから各テキストデータを読み込み、対象タスクのテキストデータに含まれる単語と一般タスクのテキストデータに含まれる単語との組み合わせから類似単語対を抽出し、類似単語列合成手段は、各テキストデータを読み込むとともに、類似単語対抽出手段から類似単語対を読み込み、言語データに含まれない対象タスク内の単語を含む単語列を合成して出力し、言語モデル生成手段は、各テキストデータを読み込むとともに、類似単語列合成手段から単語列を読み込み、各テキストデータ毎に重み付けて単語列の統計量を求めることにより、タスク適応化済み言語モデルを生成するようにしたので、認識精度を高めた言語モデル学習装置が得られる効果がある。
【０２７１】
また、この発明の請求項２によれば、対象タスクのテキストデータを集積した対象タスク言語データと、対象タスク以外のタスクを含む一般タスクのテキストデータを集積した一般タスク言語データと、対象タスク言語データおよび一般タスク言語データからタスク適応化済み言語モデルを構築するための、対象タスク単語クラス化手段、一般タスク単語クラス化手段および言語モデル生成手段とを備え、対象タスク単語クラス化手段は、対象タスク言語データから対象タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第１のテキストデータを出力し、一般タスク単語クラス化手段は、一般タスク言語データから一般タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第２のテキストデータを出力し、言語モデル生成手段は、第１および第２のテキストデータを読み込み、各テキストデータ毎に重み付けて単語列の統計量を求めることにより、言語モデルを生成するようにしたので、認識精度を高めた言語モデル学習装置が得られる効果がある。
【０２７２】
また、この発明の請求項３によれば、対象タスクのテキストデータを集積した対象タスク言語データと、対象タスク以外のタスクを含む一般タスクのテキストデータを集積した一般タスク言語データと、対象タスク言語データおよび一般タスク言語データからタスク適応化済み言語モデルを構築するための、対象タスク単語クラス化手段、一般タスク単語クラス化手段、類似単語対抽出手段、類似単語列合成手段および言語モデル生成手段とを備え、対象タスク単語クラス化手段は、対象タスク言語データから対象タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第１のテキストデータを出力し、一般タスク単語クラス化手段は、一般タスク言語データから一般タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第２のテキストデータを出力し、類似単語対抽出手段は、第１および第２のテキストデータを読み込み、対象タスクのテキストデータに含まれる単語と一般タスクのテキストデータに含まれる単語との組み合わせから類似単語対を抽出し、類似単語列合成手段は、第１および第２のテキストデータを読み込むとともに、類似単語対抽出手段から類似単語対を読み込み、言語データに含まれない対象タスク内の単語を含む単語列を合成して出力し、言語モデル生成手段は、第１および第２のテキストデータを読み込むとともに、類似単語列合成手段から単語列を読み込み、各テキストデータ毎に重み付けて単語列の統計量を求めることにより、タスク適応化済み言語モデルを生成するようにしたので、認識精度を高めた言語モデル学習装置が得られる効果がある。
【０２７３】
また、この発明の請求項４によれば、対象タスクのテキストデータを集積した対象タスク言語データと、対象タスク以外のタスクを含む一般タスクのテキストデータを集積した一般タスク言語データと、事前に準備したテキストデータを用いて作成された初期言語モデルと、対象タスク言語データ、一般タスク言語データおよび初期言語モデルから、タスク適応化済み統計的言語モデルを構築するための、類似単語対抽出手段および類似単語確率補正手段とを備え、類似単語対抽出手段は、対象タスク言語データおよび一般タスク言語データから、それぞれ言語モデル学習用のテキストデータを読み込み、対象タスクのテキストデータに含まれる単語と一般タスクのテキストデータに含まれる単語との組み合わせから類似単語対を抽出し、類似単語確率補正手段は、類似単語対抽出手段から類似単語対を読み込むとともに、初期言語モデルを読み込み、対象タスクで出現する単語の出現確率のスムージングを行うことにより、タスク適応化済み統計的言語モデルを生成するようにしたので、認識精度を高めた言語モデル学習装置が得られる効果がある。
【０２７４】
また、この発明の請求項５によれば、対象タスクのテキストデータを集積した対象タスク言語データと、対象タスク以外のタスクを含む一般タスクのテキストデータを集積した一般タスク言語データと、あらかじめ作成された初期クラス言語モデルと、対象タスク言語データ、一般タスク言語データおよび初期クラス言語モデルから、タスク適応化済みクラス言語モデルを構築するための、対象タスク単語クラス化手段、一般タスク単語クラス化手段、類似単語対抽出手段および類似単語確率補正手段とを備え、対象タスク単語クラス化手段は、対象タスク言語データから対象タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第１のテキストデータを出力し、一般タスク単語クラス化手段は、一般タスク言語データから一般タスクのテキストデータを読み込み、クラス定義に示されたクラスに単語を置き換えて、言語モデル学習用のクラス化された第２のテキストデータを出力し、類似単語対抽出手段は、第１および第２のテキストデータを読み込み、対象タスクのテキストデータに含まれる単語と一般タスクのテキストデータに含まれる単語との組み合わせから類似単語対を抽出し、類似単語確率補正手段は、類似単語対抽出手段から類似単語対を読み込むとともに、初期クラス言語モデルを読み込み、対象タスクで出現する単語の出現確率のスムージングを行うことにより、タスク適応化済みクラス言語モデルを生成するようにしたので、認識精度を高めた言語モデル学習装置が得られる効果がある。
【０２７５】
また、この発明の請求項６によれば、請求項１または請求項４において、類似単語抽出手段は、距離算出用言語モデル生成手段、統計的単語間距離算出手段およびしきい値判定手段を含み、距離算出用言語モデル生成手段は、対象タスク言語データおよび一般タスク言語データから、それぞれ言語モデル学習用のテキストデータを読み込み、各テキストデータ毎に重み付けて単語列の統計量を求めて、距離算出用の統計的言語モデルを生成し、統計的単語間距離算出手段は、距離算出用言語モデル生成手段から統計的言語モデルを読み込み、各テキストデータから抽出した単語からなる単語対について、統計的言語モデル上の統計的な距離を単語間距離として求め、しきい値判定手段は、統計的単語間距離算出手段から単語対および単語間距離を読み込み、所定のしきい値を越える単語対を出力するようにしたので、認識精度を高めた言語モデル学習装置が得られる効果がある。
【０２７６】
また、この発明の請求項７によれば、請求項１または請求項４において、類似単語抽出手段は、距離算出用言語モデル、統計的単語間距離算出手段およびしきい値判定手段を含み、距離算出用言語モデルは、事前に準備したテキストデータを用いて作成されており、統計的単語間距離算出手段は、距離算出用言語モデルを読み込み、各テキストデータから抽出した単語からなる単語対について、距離算出用言語モデル上の統計的な距離を単語間距離として求め、しきい値判定手段は、統計的単語間距離算出手段から単語対および単語間距離を読み込み、所定のしきい値を越える単語対を出力するようにしたので、認識精度を高めた言語モデル学習装置が得られる効果がある。
【０２７７】
また、この発明の請求項８によれば、請求項３または請求項５において、類似単語抽出手段は、距離算出用言語モデル生成手段、統計的単語間距離算出手段およびしきい値判定手段を含み、距離算出用言語モデル生成手段は、対象タスク単語クラス化手段および一般タスク単語クラス化手段から第１および第２のテキストデータを読み込み、各テキストデータ毎に重み付けて単語列の統計量を求めて、距離算出用の統計的言語モデルを生成し、統計的単語間距離算出手段は、距離算出用言語モデル生成手段から統計的言語モデルを読み込み、各テキストデータから抽出した単語からなる単語対について、統計的言語モデル上の統計的な距離を単語間距離として求め、しきい値判定手段は、統計的単語間距離算出手段から単語対および単語間距離を読み込み、所定のしきい値を越える単語対を出力するようにしたので、認識精度を高めた言語モデル学習装置が得られる効果がある。
【０２７８】
また、この発明の請求項９によれば、請求項３または請求項５において、類似単語抽出手段は、距離算出用クラス言語モデル、統計的単語間距離算出手段およびしきい値判定手段を含み、距離算出用クラス言語モデルは、事前に準備したテキストデータを用いて作成されており、統計的単語間距離算出手段は、距離算出用クラス言語モデルを読み込むとともに、対象タスク単語クラス化手段および一般タスク単語クラス化手段から第１および第２のテキストデータを読み込み、各テキストデータから抽出した単語からなる単語対について、距離算出用クラス言語モデル上の統計的な距離を単語間距離として求め、しきい値判定手段は、統計的単語間距離算出手段から単語対および単語間距離を読み込み、所定のしきい値を越える単語対を出力するようにしたので、認識精度を高めた言語モデル学習装置が得られる効果がある。
【０２７９】
また、この発明の請求項１０によれば、請求項６から請求項９までのいずれかにおいて、統計的単語間距離算出手段は、Ｎグラム言語モデル上のユークリッド距離を用いて、単語間距離を測定するようにしたので、認識精度を高めた言語モデル学習装置が得られる効果がある。
【０２８０】
また、この発明の請求項１１によれば、請求項６から請求項９までのいずれかにおいて、統計的単語間距離算出手段は、Ｎグラム言語モデル上のクロスエントロピーを用いて、単語間距離を測定するようにしたので、認識精度を高めた言語モデル学習装置が得られる効果がある。
【０２８１】
また、この発明の請求項１２によれば、請求項１から請求項１１までのいずれかの言語モデル学習装置を用いた音声認識装置であって、言語モデルまたはクラス言語モデルは、音声認識に用いられるようにしたので、高精度の音声認識装置が得られる効果がある。
【図面の簡単な説明】
【図１】この発明の実施の形態１による言語モデル学習装置を概略的に示すブロック構成図である。
【図２】この発明の実施の形態１による言語モデル学習装置の処理手順を示すフローチャートである。
【図３】この発明の実施の形態２による言語モデル学習装置を概略的に示すブロック構成図である。
【図４】この発明の実施の形態２による言語モデル学習装置の処理手順を示すフローチャートである。
【図５】この発明の実施の形態３による言語モデル学習装置を概略的に示すブロック構成図である。
【図６】この発明の実施の形態３による言語モデル学習装置の処理手順を示すフローチャートである。
【図７】この発明の実施の形態４による言語モデル学習装置を概略的に示すブロック構成図である。
【図８】この発明の実施の形態４による言語モデル学習装置の処理手順を示すフローチャートである。
【図９】この発明の実施の形態５による言語モデル学習装置を概略的に示すブロック構成図である。
【図１０】この発明の実施の形態５による言語モデル学習装置の処理手順を示すフローチャートである。
【図１１】この発明の実施の形態６による言語モデル学習装置の類似単語対抽出手段を具体例に示す機能ブロック図である。
【図１２】この発明の実施の形態６による言語モデル学習装置の類似単語対抽出手段の処理手順を示すフローチャートである。
【図１３】この発明の実施の形態６による類似単語対抽出手段の第２の具体例を示す機能ブロック図である。
【図１４】この発明の実施の形態６による類似単語対抽出手段の第３の具体例を示す機能ブロック図である。
【図１５】この発明の実施の形態６による類似単語対抽出手段の第４の具体例を示す機能ブロック図である。
【図１６】この発明の実施の形態７による言語モデル学習装置を用いた音声認識装置を概略的に示すブロック構成図である。
【図１７】この発明の実施の形態７による言語モデル学習装置を用いた音声認識装置の処理手順を示すフローチャートである。
【図１８】この発明の実施の形態８による言語モデル学習装置を用いた音声認識装置を概略的に示すブロック構成図である。
【図１９】この発明の実施の形態８による言語モデル学習装置を用いた音声認識装置の処理手順を示すフローチャートである。
【図２０】一般的なクラス定義の一例を示す説明図である。
【図２１】従来の言語モデル学習装置を概略的に示すブロック構成図である。
【図２２】従来の言語モデル学習装置および方法による処理手順を示すフローチャートである。
【符号の説明】
１０１対象タスク言語データ、１０２一般タスク言語データ、１０３、１０３Ｂ、１０３Ｃ、１０３Ｄ、１０３Ｅ、１０３Ｆ類似単語対抽出手段、１０４、１０４Ｂ類似単語列合成手段、１０５、１０５Ａ、１０５Ｂ言語モデル生成手段、３０１対象タスク単語クラス化手段、３０２一般タスク単語クラス化手段および言語モデル生成手段とを備え、７０１初期言語モデル、７０２類似単語確率補正手段、９０１初期クラス言語モデル、１１０１、１１０１Ｄ、１１０１Ｆ統計的単語間距離算出手段、１１０２、１１０２Ｅしきい値判定手段、１１０５、１１０５Ｅ距離算出用言語モデル生成手段、１３０１距離算出用言語モデル、１５０１距離算出用クラス言語モデル、１６０５言語モデル、１８０２クラス言語モデル。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a language model learning apparatus using a probabilistic language model and speech recognition using the same. Equipment It is related.
[0002]
[Prior art]
In general, in speech recognition, a digital signal input processing method is generally used to convert a vector time series that well represents the acoustic characteristics of the speech, and then a matching process with the speech model is performed. Is called.
[0003]
The collation process is an acoustic feature vector time series A (= [a ₁ , A ₂ , ..., a _K ]) Based on the uttered word string W (= [w ₁ , W ₂ , ..., w _M ] (M is the number of words)).
[0004]
In the above collation processing, in order to estimate the word string W with the highest recognition accuracy, the recognized word string W having the maximum appearance probability P (W | A). ^* Can be obtained by the following equation (1).
[0005]
[Expression 1]

[0006]
However, in the equation (1), it is usually difficult to directly obtain the appearance probability P (W | A). Therefore, the appearance probability P (W | A) is rewritten as the following equation (2) using Bayes' theorem.
[0007]
[Expression 2]

[0008]
Here, when obtaining the word string W that maximizes the left side of equation (2), the denominator P (A) on the right side does not affect the word string W that is a recognition candidate, so the numerator on the right side is maximized. What is necessary is just to obtain the word string W. That is, the recognition word string W ^* Is expressed by the following equation (3).
[0009]
[Equation 3]

[0010]
Here, the probability model giving P (W) and the probability model giving P (A | W) in the expression (3) are called a language model and an acoustic model, respectively.
As a modeling method that has been actively studied in speech recognition in recent years, an acoustic model is expressed by a “hidden Markov model” and a language model is expressed by a “stochastic language model”.
[0011]
Details of these modeling methods are described in, for example, “Fundamentals of Speech Recognition (Up, Down)” (LR RABINER, BH JUANG, Translated by Furui, 1995, November, NTT Advanced Technology) , "Reference 1"), or "probabilistic language model" (Kitakenji, University of Tokyo Press) (hereinafter referred to as "reference 2").
[0012]
In these methods, the parameters constituting the probability model are statistically estimated from a large amount of data.
In other words, in the construction of an acoustic model, speech data such as words and sentences from a large number of speakers are collected in advance, and the statistical method is used to improve recognition accuracy and indicators that are well related to recognition accuracy. Estimation is performed.
[0013]
For example, using the Baum-Welch algorithm, the parameters of the “hidden Markov model” constituting the acoustic model are estimated so as to increase the likelihood of the learning data.
The method for estimating the acoustic model is described in detail in the second volume of the above document 1.
[0014]
Similarly, in the construction of a language model, the probability of appearance of each utterance and words constituting the utterance is calculated from text such as a newspaper or a transcript of a conversation according to the structure of the language model.
[0015]
As the structure of the language model, using the “n−1 Markov model” for the immediately preceding word, the “N-gram language model” or “probability context free grammar” that predicts the appearance probability of the following word, or their Combinations are often applied.
[0016]
In particular, the N-gram language model is widely used because it is effective and parameter estimation means can be easily realized.
Therefore, in the following description, the construction of a language model will be described using an N-gram language model as an example.
[0017]
For example, in the N-gram language model, when N = 2 (referred to as a bigram language model), P (W) in the above equation (3) is approximated as the following equation (4).
[0018]
[Expression 4]

[0019]
Conditional probability P (w as a parameter of the N-gram language model _N ｜ w ₁ , ..., w _N-1 ) Is a frequency C (w) of adjacent word strings in the text data for learning. ₁ , ..., w _N ) From the following equation (5).
[0020]
[Equation 5]

[0021]
However, if the conditional appearance probability of a word is simply estimated as in the above equation (5), if the word string that does not exist in the learning data is included, the appearance probability of the sentence becomes “0”.
[0022]
In order to prevent such a state, a process of assigning a non-zero (not “0”) probability to a word string that does not appear in the learning text (generally called “smoothing”) is performed.
[0023]
The most common smoothing method is “back-off smoothing” proposed by Katz.
In the back-off smoothing, a probability is assigned to a word string that does not appear in the learning data by excluding a certain ratio according to the frequency (discounting is performed) from the probability estimated by the above equation (5).
[0024]
For the conditional probability assigned to the word string that did not appear in the learning data, a value estimated by a more rough language model is used.
In the method according to Katz, N-1 gram is used as a coarser model than N gram. Details of this method are shown on page 67 of the above-mentioned document 2.
[0025]
In the case of Japanese, since the text is not divided, the definition of the word is ambiguous. Therefore, in the main text, each of the texts divided into consistent parts by some means is defined as a word.
[0026]
That is, a word is a linguistic unit such as a character, morpheme, or clause, a division of text based on entropy criteria, or a combination of these, and a language such as reading or part of speech in these divided units. Includes cases where information is added.
[0027]
In the construction of a language model using the statistical method, a large amount of speech data and text data are required to estimate the language model parameters.
In particular, since the N-gram language model strongly depends on learning data, it is necessary to collect a large amount of data for each target task (hereinafter referred to as “target task”).
[0028]
However, it is difficult to collect a large amount of text data for each task, and it is desirable that a language model can be constructed from a small amount of text data relating to the target task.
[0029]
A class language model combines similar words and treats them as the same class (group). It reduces the estimated number of parameters of the language model and assigns an appropriate probability to words that do not exist in the learning data. is there.
[0030]
The relationship definition between a word and a class is determined manually according to a word or a task, or determined based on data, and can be applied even to an N-gram language model.
[0031]
For example, the appearance probability of a sentence in the bigram class language model is
(1) Transition probability P (c) between classes _i | C _i-1 )When,
(2) Probability P (w of selecting a specific word from the class _i | C _i )When
Is defined as the following equation (6).
[0032]
[Formula 6]

[0033]
For example, consider a case where 1000 words are divided into 100 classes each consisting of 10 words. At this time, the estimated number of parameters in the case of the word bigram language model is 1000. ² (= 1000000).
[0034]
In contrast, the estimated number of parameters in the case of a class bigram language model is (1) transition between classes,
(2) Mapping between classes and words
Expressed as the sum of 100 ² Decrease to + 100 × 10 (= 11000).
[0035]
The correspondence between words and classes may be determined manually, or may be obtained by executing word clustering from language data.
FIG. 20 is an explanatory diagram showing an example of class definition. In FIG. 20, a word w, a class c to which the word w belongs, and a probability P (w | c) output from the class c to which the word w belongs are described.
[0036]
Of the class N-gram language model, the estimation of the inter-class transition model is the same as in the case of a normal word N-gram.
The construction method of the class N gram language model is described in detail on page 72 et seq.
[0037]
On the other hand, task adaptation uses text data other than the target task together to compensate for the lack of learning data.
Here, text data including tasks other than the target task is referred to as general task language data.
[0038]
Regarding task adaptation, “Examination of vocabulary setting method in N-gram task adaptation” (Akinori Ito, Masaki Yoshida, IEICE Technical Report, pp. 51-58, SP97-25, 1997) ( Hereinafter, the method described in Document 3) has been proposed.
[0039]
In this method, task adaptation is performed by weighting and adding learning data of a target task and a general task to an N-gram language model.
[0040]
FIG. 21 is a block diagram schematically showing an apparatus to which the speech recognition language model construction method described in the above-mentioned document 3 is applied.
In FIG. 21, reference numeral 100 denotes a language model estimation means for generating a task-adapted language model.
[0041]
Reference numeral 101 denotes target task language data, which accumulates text data of the target task and divides text representing a sentence to be recognized by the target task into words.
Reference numeral 102 denotes general task language data, which accumulates text data of general tasks including tasks other than the target task, and divides text representing sentences included in the general tasks into words.
[0042]
The language model estimation means 100 reads the target task language data 101 and the general task language data 102, performs appropriate weighting processing, counts the frequency of word strings, and estimates the parameters of the language model using a statistical method. .
[0043]
A weighting process is given for each input.
For example, if the word string “I, ha” appears twice in the target task and four times in the general task, if the frequency weight of the target task is “3” and the frequency weight of the general task is “1”, the word The frequency of the column “I, ha” is estimated as “10 (= 3 × 2 + 1 × 4)”.
[0044]
The weighting coefficient may not be an integer.
Also, when counting up, if necessary, words with low frequency can be removed, and the probability of removal can be redistributed to the words necessary for recognition with equal probability.
[0045]
From the frequency information “10” obtained in this way, the probabilities are estimated for known and unknown word strings by, for example, the Katz back-off smoothing method.
The frequency weight can be determined using a deletion estimation method so as to increase the appearance probability of the finally obtained language model with respect to the test data.
The deletion estimation method is described on page 49 of the above document 2.
[0046]
Next, a language model learning procedure by task adaptation based on the conventional apparatus and the conventional method shown in FIG. 21 will be described with reference to the flowchart of FIG.
First, the language model estimation unit 100 reads a weight parameter for an input from a weight parameter storage unit (not shown) (step S2201).
[0047]
Next, the learning text divided into words is read from the target task language data 101 and the general task language data 102, and the frequency of word strings of n words or less weighted according to the weight parameter is obtained (step S2202).
[0048]
Finally, smoothing using, for example, the Katz back-off smoothing method is executed to estimate the parameters of the language model (step S2203), and the processing routine of FIG.
[0049]
In the above method, by using the text data of the general task language data 102 together, it is possible to more appropriately estimate the appearance probability of a word string representing various expressions that are difficult to obtain from a small amount of learning data related to the target task. it can.
[0050]
At the same time, by weighting the target task language data 101, a greater probability can be given to the word string that appears in the corpus of the target task, and the recognition accuracy can be improved.
[0051]
However, although the task adaptation method of the language model can estimate the appearance probability of a word unique to the target task or a word string that appears in the general task, the word unique to the target task and the word that appeared in the general task Therefore, when the text data of the target task is small, there is a problem in that the language model parameter estimation accuracy deteriorates around the word specific to the target task.
[0052]
For example, consider a case where the target task is a hotel reservation service, and text data uttered by a reservation service task other than a similar hotel is used as the general task language data 102.
[0053]
In this case, a word string that appears in the general reservation business such as “That, please” and a word specific to the target task “hotel” are obtained from the general task language data 102 and the target task language data 101 according to the frequency. The appearance probability is estimated.
[0054]
However, since the number of types of word combinations is very large, if there is a small amount of text data for the target task, the word string containing words specific to the target task, such as “Hotel, Please,” is sufficiently covered with text data. Often not.
[0055]
As a result, an inappropriate appearance probability is assigned to the word string, which may reduce the recognition accuracy.
In particular, the words specific to the target task are often important in performing the task, and a decrease in recognition accuracy around these words is likely to have a large effect on the performance of the entire system.
[0056]
[Problems to be solved by the invention]
Conventional language model learning apparatus and speech recognition using the same Equipment As described above, the combination of the words specific to the target task and the words that appeared in the general task is not considered, so if there is little text data for the target task, the parameters of the language model around the target task specific words There is a problem that the estimation accuracy deteriorates and the performance of the entire system is adversely affected.
[0057]
The present invention has been made to solve the above-described problems. A similar word is obtained from the word specific to the target task and the data of the general task, and the occurrence probability of the word string including the task-specific word is calculated. Language model learning device with high recognition accuracy by using for estimation and speech recognition using the same Equipment The purpose is to obtain.
[0058]
[Means for Solving the Problems]
The language model learning device according to claim 1 of the present invention includes target task language data in which text data of a target task is integrated, general task language data in which text data of general tasks including tasks other than the target task are integrated, and target Similar word pair extracting means, similar word string synthesizing means, and language model generating means for reading language model learning text data from task language data and general task language data, respectively, and constructing a task-adapted language model; The similar word pair extraction unit reads each text data from the target task language data and the general task language data, and is similar based on a combination of a word included in the target task text data and a word included in the general task text data. Word pairs are extracted and similar word string synthesis means And reading a similar word pair from the similar word pair extracting means, synthesizing and outputting a word string including words in the target task not included in the language data, and the language model generating means In addition to reading, a word string is read from the similar word string synthesizing means, and a statistic of the word string is obtained by weighting each text data, thereby generating a task-adapted language model.
[0059]
The language model learning apparatus according to claim 2 of the present invention includes target task language data in which text data of the target task is integrated, general task language data in which text data of general tasks including tasks other than the target task are integrated, A target task word classifying means, a general task word classifying means, and a language model generating means for constructing a task-adapted language model from the target task language data and the general task language data. The means reads the text data of the target task from the target task language data, replaces the word with the class shown in the class definition, and outputs the first text data classified for language model learning, and the general task word Classifying means reads general task text data from general task language data. Replacing the word with the class shown in the class definition, and outputting the second text data classified into the language model learning, the language model generation means reads the first and second text data, A language model is generated by obtaining a statistic of a word string by weighting each text data.
[0060]
The language model learning apparatus according to claim 3 of the present invention includes target task language data in which text data of a target task is integrated, general task language data in which text data of general tasks including tasks other than the target task are integrated, Target task word classifying means, general task word classifying means, similar word pair extracting means, similar word string synthesizing means, and language for constructing a task-adapted language model from target task language data and general task language data The target task word classifying means reads the text data of the target task from the target task language data, replaces the word with the class indicated in the class definition, and is classified into a language model learning class. The first text data is output, and the general task word classifying means is a general task language. The text data of the general task is read from the data, the word is replaced with the class indicated in the class definition, and the second text data classified into language model learning is output. The first and second text data are read, a similar word pair is extracted from a combination of a word included in the text data of the target task and a word included in the text data of the general task. The second text data is read, a similar word pair is read from the similar word pair extraction means, a word string including words in the target task not included in the language data is synthesized and output, and the language model generation means Read 1 and 2 text data, read word string from similar word string synthesizing means, weight each text data By determining the statistics of word string, and it generates a task adapted already language model.
[0061]
The language model learning apparatus according to claim 4 of the present invention includes target task language data in which text data of a target task is integrated, general task language data in which text data of general tasks including tasks other than the target task are integrated, Similar word pairs for constructing a task-adapted statistical language model from an initial language model created using text data prepared in advance, target task language data, general task language data, and initial language model An extraction unit and a similar word probability correction unit. The similar word pair extraction unit reads the text data for learning the language model from the target task language data and the general task language data, respectively, and includes words included in the text data of the target task. Similar to the combination of words and words included in general task text data The word pair is extracted, and the similar word probability correcting means reads the similar word pair from the similar word pair extracting means, reads the initial language model, and smoothes the appearance probability of the word appearing in the target task, thereby adapting the task. Generate a statistical language model.
[0062]
The language model learning device according to claim 5 of the present invention includes target task language data in which text data of a target task is integrated, general task language data in which text data of general tasks including tasks other than the target task are integrated, Target task word classifying means and general task word for constructing a task-adapted class language model from a pre-created initial class language model, target task language data, general task language data, and initial class language model A classifying means, a similar word pair extracting means, and a similar word probability correcting means. The target task word classifying means reads the text data of the target task from the target task language data, and assigns the word to the class indicated in the class definition. Replace with the first text data classified for learning the language model. The general task word classifying means reads the text data of the general task from the general task language data, replaces the word with the class shown in the class definition, and classifies the second classified language model for learning the language model. The text data is output, and the similar word pair extraction unit reads the first and second text data, and calculates the similar word pair from the combination of the word included in the text data of the target task and the word included in the text data of the general task. The similar word probability correcting means reads the similar word pair from the similar word pair extracting means, reads the initial class language model, and smoothes the appearance probability of the word appearing in the target task, thereby adapting the task. Generated class language model.
[0063]
The language model learning apparatus according to claim 6 of the present invention is the language model learning device according to claim 1 or 4, wherein the similar word extracting means is a distance calculating language model generating means, a statistical inter-word distance calculating means, and a threshold value. The distance calculation language model generation means reads the text data for learning the language model from the target task language data and the general task language data, and calculates the statistic of the word string by weighting each text data. The statistical language model for distance calculation is generated, and the statistical word distance calculation means reads the statistical language model from the distance calculation language model generation means for word pairs consisting of words extracted from each text data. Then, the statistical distance on the statistical language model is obtained as the distance between words, and the threshold judgment means is simply set from the statistical word distance calculation means. It reads the pairs and word spacing, and outputs a word pair exceeding a predetermined threshold.
[0064]
The language model learning device according to claim 7 of the present invention is the language model learning device according to claim 1 or 4, wherein the similar word extracting means includes a distance calculating language model, a statistical inter-word distance calculating means, and a threshold determining means. The distance calculation language model is created using text data prepared in advance, and the statistical inter-word distance calculation means is composed of words extracted from each text data by reading the distance calculation language model. For a word pair, a statistical distance on the distance calculation language model is obtained as an inter-word distance, and the threshold determination means reads the word pair and the inter-word distance from the statistical inter-word distance calculation means, and sets a predetermined threshold. The word pair that exceeds the value is output.
[0065]
The language model learning device according to claim 8 of the present invention is the language model learning device according to claim 3 or 5, wherein the similar word extracting means is a distance calculating language model generating means, a statistical inter-word distance calculating means, and a threshold value. The distance calculation language model generation unit includes a determination unit, reads the first and second text data from the target task word classifying unit and the general task word classifying unit, weights each text data, and calculates the word string statistics. The statistical word model for distance calculation is generated by obtaining the quantity, and the statistical word distance calculation means is composed of words extracted from each text data by reading the statistical language model from the distance calculation language model generation means. For a word pair, the statistical distance on the statistical language model is obtained as the distance between words, and the threshold value judging means is simply connected to the statistical word distance calculating means. It reads the pairs and word spacing, and outputs a word pair exceeding a predetermined threshold.
[0066]
The language model learning apparatus according to claim 9 of the present invention is the language model learning device according to claim 3 or 5, wherein the similar word extracting means is a distance calculating class language model, a statistical inter-word distance calculating means, and a threshold value determination. The distance calculation class language model is created using text data prepared in advance, and the statistical inter-word distance calculation means reads the distance calculation class language model and classifies the task as a task word class. The first and second text data are read from the means and the general task word classifying means, and for a word pair consisting of words extracted from each text data, the statistical distance on the distance calculation class language model is used as the distance between words. The threshold determination means reads the word pair and the inter-word distance from the statistical inter-word distance calculation means, and sets a predetermined threshold. And outputs the obtaining word pair.
[0067]
The language model learning device according to claim 10 of the present invention is the language model learning device according to any one of claims 6 to 9, wherein the statistical inter-word distance calculation means uses the Euclidean distance on the N-gram language model, It measures the distance between words.
[0068]
The language model learning device according to claim 11 of the present invention is the language model learning device according to any one of claims 6 to 9, wherein the statistical inter-word distance calculation means uses the cross-entropy on the N-gram language model, It measures the distance between words.
[0069]
A speech recognition apparatus according to a twelfth aspect of the present invention is a speech recognition apparatus using the language model learning apparatus according to any one of the first to eleventh aspects, wherein the language model or the class language model is a speech It is used for recognition.
[0084]
DETAILED DESCRIPTION OF THE INVENTION
Embodiment 1 FIG.
Hereinafter, the first embodiment of the present invention will be described in detail with reference to the drawings. Here, an N-gram language model will be described as an example, but it goes without saying that it can be applied to any statistical language model.
[0085]
FIG. 1 is a block diagram schematically showing a language model learning device according to Embodiment 1 of the present invention, and shows a configuration example of a language model learning device for speech recognition.
In FIG. 1, 101 is target task language data divided into words in the target task, 102 is general task language data divided into words in the general task, and these are the same as described above (see FIG. 21). .
[0086]
103 is a similar word pair extracting means, 104 is a similar word string synthesizing means, 105 is a language model generating means, and these means 103 to 105 are related to the target task language data 101 and the general task language data 102, Generate an adapted language model.
[0087]
The language model generation unit 105 corresponds to the language model estimation unit 100 described above, and generates a task-adapted language model.
Unlike the conventional device described above, the similar word pair extracting unit 103 and the similar word string synthesizing unit 104 constitute a characteristic part of the present invention.
[0088]
That is, each means 103 and 104 obtains a word of a general task similar to the word specific to the target task, synthesizes a word string in which the word of the general task in the learning text is replaced with a word of the similar target task, and By adding to the learning text of the model, the recognition accuracy can be improved even when the text data of the target task is small when constructing the language model.
[0089]
Hereinafter, the functions of the respective means 103 to 105 in FIG. 1 will be specifically described in relation to various models and various data.
However, the same functional blocks and models as those described above are denoted by the same reference numerals, and detailed description thereof is omitted.
[0090]
First, the similar word pair extraction unit 103 selects an arbitrary combination (wT) of the word wT included in the target task language data 101 and the word wG included in the general task language data 102. _, For wG), calculate the distance between words based on a predefined distance measure.
[0091]
At this time, when the calculated value of the distance between words is smaller than a preset threshold th, the similar word pair extraction unit 103 selects the similar word pair (wT _, wG) is output to the similar word string synthesizing means 104.
[0092]
For the distance d (wT, wG) between words, for example, a semantic classification corresponding to each word is made into a tree structure according to the size of the concept in advance, and the number of arcs between semantic nodes corresponding to each word is used as the distance. Can be obtained.
[0093]
Next, the similar word string synthesizing unit 104 separately extracts word strings of arbitrary lengths included in the target task language data 101 and the general task language data 102 and also reads similar word pairs read from the similar word pair extracting unit 103. With reference to (wT, wG), it is determined whether or not the word wG in the general task is included for each word string of the target task.
[0094]
As a result, when there is a word string “... WG...” Including the word wG in the general task, the word string in which the word wG in the general task is replaced with the word wT in the target task. It is determined whether “... WT...” Exists in the data of the general task or the target task.
[0095]
As a result, when the word string “... WT... Does not exist in the data of the general task or the target task, the similar word string synthesis unit 104 replaces the word wG of the general task with the word wT of the target task. The column “... WT...” Is synthesized and output to the language model generation unit 105.
[0096]
Finally, the language model generation unit 105 reads the text data from the target task language data 101, the general task language data 102, and the similar word string synthesis unit 104, and assigns an appropriate weight to the input frequency to each word string. And the language model with task adaptation is generated by estimating the parameters of the language model using a statistical method.
[0097]
Next, the language model learning procedure by task adaptation based on the first embodiment of the present invention shown in FIG. 1 will be described more specifically with reference to the flowchart of FIG.
[0098]
2, steps S201 to S203 are executed by the similar word pair extracting unit 103, steps S204 to S208 are executed by the similar word string synthesizing unit 104, and steps S209 to S211 are executed by the language model generating unit 105. Process.
[0099]
First, the similar word pair extraction unit 103 reads the learning text divided into words from the target task language data 101 and the general task language data 102, and creates a word pair (wT, wG) (step S201).
[0100]
Further, the distance d (wT, wG) is calculated for the combination of the word wT included in the target task language data 101 and the word wG (different from the word wT) included in the general task language data 102 (step S202).
[0101]
Subsequently, the calculated distance d (wT, wG) is compared with a predetermined threshold value th to determine whether or not the distance d (wT, wG) is smaller than the threshold value th (step S203).
[0102]
If it is determined in step S203 that d (wT, wG) ≧ th (ie, No), the similar word pair extraction unit 103 returns to step S202 and repeats the calculation of the distance d (wT, wG), and d ( If it is determined that wT, wG) <th (that is, Yes), the word pair (wT, wG) at that time is output to the similar word string synthesizing unit 104.
[0103]
The similar word string synthesizing unit 104 reads the text data divided into words from the target task language data 101 and the general task language data 102, and extracts and stores the word strings of all n words included in the data (step S204). .
[0104]
Further, from the read word string, the word string “... WG...” Including the word wG of the general task is extracted from the word pairs (wT, wG) selected by the similar word pair extraction unit 103 ( Step S205).
[0105]
Subsequently, it is determined whether or not a word string “... WT...” Obtained by replacing the general task word wG with the target task word wT in the extracted word string exists in the already stored word string. (Step S206).
[0106]
In step S206, if it is determined that the word string “... WT...” Exists in the already stored word string (that is, Yes), the process returns to step S205 and the word string “. .. ”Is determined not to exist (that is, No), the word string“... WT... ”Is output as text data (step S207).
[0107]
Next, it is determined whether or not the processing for all similar word pairs (wT, wG) has been completed (step S208). If it is determined that the processing has not been completed (that is, No), the processing returns to step S202 and is completed. If it is determined (that is, Yes), the process proceeds to step S209.
Thereby, process step S202-S207 is performed about all the similar word pairs (wT, wG).
[0108]
Here, as a specific example, the distance between the word “Yokohama Station” of the target task and the word “Narita Airport” of the general task is smaller than the threshold th, and the word strings “Narita Airport, To” and “From Narita Consider the case where "Airport" exists in general text data.
[0109]
At this time, if the word string “Yokohama Station,” exists in the target text data but the word string “From Yokohama Station” does not exist, the similar word string synthesizing means 104 determines whether the word string “From” , Yokohama Station "will be synthesized and output.
As a result, using the word similarity information, a word string expected to appear in the target task is added to the learning text data.
[0110]
Next, in FIG. 2, the language model generation unit 105 reads weight parameters corresponding to the respective inputs from a weight parameter storage unit (not shown) (step S209).
[0111]
Further, the learning text divided into words is read from the target task language data 101, the general task language data 102, and the similar word string synthesizing means 104, and the frequency of the word string is obtained (step S210).
At this time, in the case of the N-gram language model, it is necessary to calculate the frequency for a word string of n words or less.
[0112]
Further, the language model generation means 105 generates a task-adapted language model by performing smoothing using, for example, the Katz back-off smoothing method and estimating the parameters of the language model (step S211). This processing routine is terminated.
[0113]
Since the word model including words characteristic of the target task is added to the language model learning data obtained in this way, the prediction accuracy of the language model for the target task is improved.
[0114]
Therefore, a highly accurate language model for speech recognition can be estimated from a large amount of data (general task language data 102) including tasks other than the target and a small amount of data related to the target task (target task language data 101).
Then, a task-adapted language model is generated (step S211), and the processing routine of FIG.
[0115]
The language model obtained as described above can be applied not only to speech recognition but also to character recognition that requires language processing and natural language text processing.
[0116]
Further, the speech recognition language model learning apparatus configured as shown in FIG. 1 can be recorded on a recording medium as a program.
[0117]
That is, a similar word pair extraction function that performs processing similar to that of the similar word pair extraction unit 103 in FIG. 1, a similar word string synthesis function that performs processing similar to the similar word sequence synthesis unit 104, and a language model generation unit 105 A speech recognition language model learning program can be realized by software including a language model generation function that performs the same processing.
[0118]
Embodiment 2. FIG.
In the first embodiment, each text data from the target task language data 101 and the general task language data 102 is used as it is, but classed text data may be used.
[0119]
FIG. 3 is a block diagram schematically showing a language model learning apparatus for a speech recognition apparatus according to Embodiment 2 of the present invention. Components similar to those described above (see FIG. 1) are denoted by the same reference numerals. Or, “A” is added after the reference numerals and the detailed description is omitted.
[0120]
In FIG. 3, reference numeral 301 denotes target task word classifying means, which is inserted between the target task language data 101 and the language model generating means 105A.
Reference numeral 302 denotes general task word classifying means, which is inserted between the general task language data 102 and the language model generating means 105A.
[0121]
The characteristic function in this case is that the target task word classifying means 301 and the general task word classifying means 302 are provided, classifying the words of the text corpus of the target task and the general task, and the estimated number of parameters of the language model By reducing the above, it is possible to recognize with high accuracy even if there is a small amount of data of the target task during language model learning.
[0122]
In the following, the functions of the

respective means

301 and 302 in FIG. 3 will be specifically described in relation to various models and various data.
The word class definition data (not shown) is, for example, as described above (see FIG. 20), the word w, the class c to which the word w belongs, and the probability P ( w | c) is described. The word class definition data as shown in FIG. 20 may be created manually or may be created from learning data by calculation.
[0123]
In accordance with the word class definition data, the target task word classifying unit 301 sequentially classifies the words defined in the class of the input target task language data 101 and outputs them to the language model generating unit 105A.
[0124]
In accordance with the word class definition data, the general task word classifying means 302 sequentially classifies the words defined in the class of the input general task language data 102 and outputs them to the language model generating means 105A.
[0125]
Next, the language model learning procedure by task adaptation based on the second embodiment of the present invention shown in FIG. 3 will be described more specifically with reference to the flowchart of FIG.
[0126]
In FIG. 4, steps S 401 to S 403 are processes executed by the target task word classifying unit 301 and the general task word classifying unit 302.
[0127]
Steps S404 to S406 are processes executed by the language model generation unit 105A, and correspond to steps S209 to S211 described above (see FIG. 2), respectively.
[0128]
First, the target task word classifying means 301 and the general task word classifying means 302 each read word class definition data (not shown) (step S401).
[0129]
Further, the target task word classifying means 301 reads the target task language data 101, generates a text in which words are replaced with classes for the words defined in the word class definition, and outputs this (step S402).
[0130]
Similarly, the general task word classifying means 302 reads the general task language data 102, generates text in which words are replaced with classes for the words defined in the word class definition, and outputs this (step S403).
[0131]
Next, the language model generation unit 105A first reads the weight parameter from the weight parameter storage unit (not shown) (step S404), and then from the target task word classifying unit 301 and the general task word classifying unit 302. Then, the learning text, which is a word string including the class, is read and multiplied by the weight parameter given for each, thereby accumulatively calculating the frequency of the word and the word string (step S405).
[0132]
Here, in the case of the class N-gram language model, the frequency is calculated for class strings of n words or less, as described above.
Finally, the language model generation unit 105A smoothes the calculated frequency, estimates the parameters of the language model, generates a task-adapted class language model (step S406), and ends the processing routine of FIG. .
[0133]
A class language model is obtained from the above processing procedure and predefined word class definition data (not shown).
Thus, it is possible to estimate a highly accurate language model for speech recognition from a large amount of data including general tasks (general task language data 102) and a small amount of data related to the target task (target task language data 101). it can.
[0134]
The language model thus obtained is applicable not only to speech recognition but also to character recognition that requires language processing and natural language text processing.
[0135]
The language model learning device for speech recognition shown in FIG. 3 can also be recorded on a recording medium as a program.
[0136]
That is, a target word classifying function that performs the same processing as the target task word classifying means 301 in FIG. 3, a general word classifying function that performs the same processing as the general task word classifying means 302, and a language model generating means A language model learning program for speech recognition can be realized by software including a language model generation function that performs the same processing as 105A.
[0137]
Embodiment 3 FIG.
In the second embodiment, only the language model generation unit 105A is used. However, similar word pair extraction unit and similar word string synthesis unit similar to those in FIG. 1 (first embodiment) may be used in combination.
[0138]
FIG. 5 is a block diagram schematically showing a language model learning apparatus for a speech recognition apparatus according to Embodiment 3 of the present invention. Components similar to those described above (see FIGS. 1 and 3) are denoted by the same reference numerals. It attaches | subjects or attaches | subjects "B" after a code | symbol and abbreviate | omits detailed description.
[0139]
A characteristic function in this case is to provide a target task word classifying means 301 and a general task word classifying means 302 according to a single class definition, classifying words to reduce the number of parameters of the language model, By providing the similar word pair extracting unit 103B and the similar word string synthesizing unit 104B, it is possible to recognize with high accuracy even when the data of the target task is small in the language model construction.
[0140]
Next, a language model learning procedure by task adaptation based on the third embodiment of the present invention shown in FIG. 5 will be described more specifically with reference to the flowchart of FIG.
[0141]
In FIG. 6, steps S601 to S603 correspond to steps S401 to S403 described above (see FIG. 4), respectively, and steps S604 to S614 correspond to steps S201 to S211 described above (see FIG. 2), respectively. Yes.
[0142]
First, the target task word classifying means 301 and the general task word classifying means 302 each read word class definition data (not shown) (step S601).
[0143]
The target task word classifying unit 301 reads the target task language data 101, and generates and outputs text in which words are replaced with classes for words defined in the word class definition (step S602).
[0144]
Further, the general task word classifying means 302 reads the general task language data 102, and generates and outputs text in which words are replaced with classes for words defined in the word class definition (step S603).
[0145]
The similar word pair extraction unit 103B receives the class cT included in the target task language data and the class cG (class cT included in the general task language data) from the target task word classifying unit 301 and the general task word classifying unit 302. A list of word class pairs (cT, cG) consisting of combinations with (different) is created and stored (step S604).
[0146]
The similar word pair extraction unit 103B also uses the distance d (cT,) between the word class pairs for the class cT included in the target task language data and the class cG (different from the class cT) included in the general task language data. cG) is obtained (step S605), and it is determined whether or not it is smaller than a predetermined threshold thc (step S606).
[0147]
If it is determined in step S606 that d (cT, cG) ≧ thc (that is, No), the process returns to step S605. If it is determined that d (cT, cG) <thc (that is, Yes), the class at that time is determined. The pair (cT, cG) is output to the similar word string synthesizing means 104B as a similar word pair (step S606).
[0148]
The similar word string synthesizing means 104B reads the learning text data divided into classes from the target task word classifying means 301 and the general task word classifying means 302, and stores them by dividing them into class strings of length n or less. (Step S607).
[0149]
Further, based on the class string read from each word classifying means 301 and 302, among the class pairs (cT, cG) selected by the similar word pair extracting means 103B, a class string “• containing the class cG of the general task” ... CG... "Is taken out (step S608).
[0150]
Further, the similar word string synthesizing unit 104B refers to the class string read and stored from each of the

word classifying units

301 and 302, and replaces the class string “... CT of the general task with the class cT of the target task. ... ”Exists in the target task language data 101 or the general task language data 102 (step S609).
[0151]
If it is determined in step S609 that the class string “... CT...” Exists in each language data 101 or 102 (that is, Yes), the process returns to step S608 and the class string does not exist (that is, No. ), The class string “... CT...” Is synthesized and output as learning text data (step S610).
[0152]
Next, it is determined whether or not the processing has been completed for all similar class pairs (step S611). If it is determined that the processing has not been completed (that is, No), the process returns to step S605 and is terminated (that is, If it is determined Yes, the process proceeds to the processing steps (S612 to S614) by the language model generation unit 105B.
Thus, the above process is repeatedly executed for all similar word class pairs (cT, cG).
[0153]
The language model generation unit 105B first reads the weight parameter from the weight parameter storage unit (not shown) (step S612), and then from the target task language data 101, the general task language data 102, and the similar word string synthesis unit 104B. Then, the learning text that is weighted by the weighting parameter and divided into words is read (step S613).
[0154]
Further, by performing frequency smoothing, the parameters of the language model are estimated (step S614), and the processing routine of FIG. 6 is terminated.
A class language model with task adaptation is obtained by the above processing procedure and word class definition data (not shown) defined in advance.
[0155]
Thus, a highly accurate language model for speech recognition can be learned from a large amount of data including tasks other than the target and a small amount of data related to the target task.
[0156]
The language model thus obtained can be applied not only to speech recognition but also to character recognition that requires language processing, text processing using natural language, and the like.
[0157]
The language model learning apparatus for speech recognition shown in FIG. 5 can also be recorded on a recording medium as a program.
[0158]
That is, a target word classifying function that performs the same processing as the target task word classifying means 301 in FIG. 5, a general word classifying function that performs the same processing as the general task word classifying means 302, and similar word pair extraction A similar word pair extraction function that performs the same processing as the means 103B, a similar word string synthesis function that performs the same processing as the similar word string synthesis means 104B, and a language model generation function that performs the same processing as the language model generation means 105B A language model learning program for speech recognition can be realized by software composed of:
[0159]
Embodiment 4 FIG.
In the first to third embodiments, the language model generation unit 105, 105A, or 105B is used to generate the task-adapted language model. However, the initial language model created in advance and the word appearance probability are used. Similar word probability correcting means for performing the smoothing may be used.
[0160]
FIG. 7 is a block diagram schematically showing a language model learning apparatus for a speech recognition apparatus according to Embodiment 4 of the present invention. Components similar to those described above (see FIG. 1) are denoted by the same reference numerals. Detailed description is omitted.
[0161]
In FIG. 7, reference numeral 701 denotes an initial language model, and reference numeral 702 denotes similar word probability correction means.
The similar word probability correcting unit 702 generates a task-adapted statistical language model based on the similar word pair from the similar word pair extracting unit 103 and the prior language model from the initial language model 701.
[0162]
The characteristic function in this case is provided with the similar word pair extraction unit 103 and the similar word probability correction unit 702, and reflects the nature of the similar word appearing in the text data of the general task for the word specific to the target task. In constructing a natural language model, it is possible to recognize with high accuracy even if the target task data is small.
[0163]
In the following, the function of each means in FIG. 7 will be specifically described with reference to various models and various data.
The initial language model 701 is a statistical language model whose parameters are estimated by a known conventional method or the method of the first embodiment.
[0164]
The similar word probability correcting unit 702 reads the similar word pair between the target task and the general task from the initial language model 701 and the similar word pair extracting unit 103, and corrects the conditional appearance probability of the word string including the word of the target task. To do.
In the word string appearance probability correction processing at this time, conditional appearance probabilities of word strings including words of similar general tasks are used.
[0165]
The probability assigned by the similar word probability correcting means 702 is a part of the probability obtained as a probability of appearance of a word string that has not appeared in the learning text data, and excluded (discounted) from the conditional probability of the word string that has appeared. That is, the conditional appearance probability of the word string existing in the learning text data is saved while being equal to the initial language model 701.
[0166]
Next, the language model learning procedure by task adaptation based on the fourth embodiment of the present invention shown in FIG. 7 will be described more specifically with reference to the flowchart of FIG.
[0167]
In FIG. 8, steps S801 to S803 and S805 correspond to steps S201 to S203 and S208 described above (see FIG. 2), respectively.
Steps S806 to S812 are processes executed by the similar word probability correcting unit 702.
[0168]
First, the similar word pair extraction unit 103 reads the learning text divided into words from the target task language data 101 and the general task language data 102 (step S801), and the word wT and the general task included in the target task language data. A distance d (wT, wG) is obtained for a word wG (different from wT) included in the language data (step S802).
[0169]
Subsequently, it is determined whether or not the distance d (wT, wG) between words is smaller than the threshold value th (step S803), and if it is determined that d (wT, wG) ≧ th (ie, No). Returning to step S802, if it is determined that d (wT, wG) <th (ie, Yes), the word pair (wT, wG) at that time is added to the similar word pair (step S804).
[0170]
Thereafter, it is determined whether or not the above processing has been completed for all word pairs (step S805). If it is determined that the processing has not ended (that is, No), the process returns to step S802 and ends (that is, Yes). If determined, the process proceeds to the next processing step S806.
As a result, calculations for all word pairs are sequentially performed, and a list of created similar word pairs (wT, wG) is output to the similar word probability correcting means 702.
[0171]
The similar word probability correcting unit 702 first reads the initial language model 701 (step S806), and then defines the similar word pair (wT, wG) read from the similar word pair extracting unit 103 in the initial language model 701. Conditional probability PwG (w including general task word wG out of the conditional probabilities _n ｜ w ₁ , ..., w _n-1 ) Is taken out (step S807).
[0172]
Next, for each extracted conditional probability, a conditional probability PwT (w that replaces the general task word wG with the target task word wT _n ｜ w ₁ , ..., w _n-1 ) Is defined in the initial language model 701 (step S808).
[0173]
In step S808, the conditional probability PwT (w _n ｜ w ₁ , ..., w _n-1 ) Is not defined in the initial language model 701 (that is, No), the conditional probability is corrected by assigning a part from the probability excluded for the unknown word string (step S809), Proceed to the next determination step S810.
[0174]
On the other hand, if the conditional probability PwG is defined and it is determined in step S808 that the conditional probability PwT is defined (that is, Yes), the process immediately proceeds to the next determination step S810.
[0175]
At this time, the probability corrected in step S809 is, for example, the same word history (w ₁ , ..., w _n-1 ) Is the minimum value of the conditional probabilities.
[0176]
Next, it is determined whether or not there is another conditional probability of the word string including the general word wG (step S810), and it is determined that the word string including the general word wG exists (that is, Yes). Then, the process returns to step S808.
[0177]
On the other hand, if it is determined in step S810 that there is no other conditional probability including the general word wG (ie, No), whether or not the above processing has been completed for all word pairs (wT, wG). Is determined (step S811).
[0178]
If it is determined in step S811 that the processing of all word pairs has not been completed (that is, No), the process returns to step S807. If it is determined that the processing has been completed (that is, Yes), the process proceeds to the next processing step S812.
[0179]
As a result, the above processing is executed for the word string including all the general words wG and for the word pair (wT, wG) including all the general words wG.
Finally, the sum of the probabilities removed from the language model for the unknown word string is normalized so that the sum of the probabilities of the language model is “1” (step S812), and the processing routine of FIG. .
[0180]
If a conditional probability is not defined, the probability given by a simple language model is usually used.
For example, in an N-gram language model according to Katz's back-off, a low-order N-1 gram language model is referenced and a small probability is assigned, but since the probability is low in accuracy, it includes similar words in the target task. If there is a word string, the probability is greater than the actual probability.
[0181]
Other conditional probabilities PwG including the general word wG are similarly processed in step S810, and the processes in steps S806 to S810 are executed for all similar word pairs (wG, wT) in step S811. .
[0182]
As described above, by using the similar word probability correcting unit 702, smoothing is performed on the words having similar properties between the general task and the target task, using the appearance probability of the words of the general task, and for speech recognition. Furthermore, a model with high accuracy can be estimated.
[0183]
Note that the language model obtained in this way can also be applied to character recognition requiring text processing, text processing, and the like, as described above.
[0184]
The language model learning device for speech recognition shown in FIG. 7 can also be recorded on a recording medium as a program.
That is, by software composed of a similar word pair extraction function that performs the same process as the similar word pair extraction unit 103 in FIG. 7 and a similar word probability correction function that performs the same process as the similar word probability correction unit 702, A language model learning program for speech recognition can be realized.
[0185]
Embodiment 5 FIG.
In the fourth embodiment, the text data from the target task language data 101 and the general task language data 102 are used as they are. However, the text data classified as in the third embodiment (see FIG. 5). May be used.
[0186]
FIG. 9 is a block diagram schematically showing a language model learning apparatus for a speech recognition apparatus according to Embodiment 5 of the present invention. Components similar to those described above (see FIGS. 5 and 7) are denoted by the same reference numerals. A detailed description will be omitted.
[0187]
In FIG. 9, reference numeral 901 denotes an initial class language model, which is connected to the similar word probability correcting means 702 instead of the initial language model 701 described above (see FIG. 7).
[0188]
A characteristic function in this case is that similar word pair extraction means 103B, target task word classifying means 301, general task word classifying means 302 and similar word probability correcting means 702 are provided, and a class specific to the target task is provided. By reflecting the properties of similar classes appearing in the text data of general tasks, a class language model with higher recognition accuracy can be generated from the initial class language model 901 even if the data of the target task is small. is there.
[0189]
In the following, the function of each means in FIG. 9 will be specifically described with reference to various models and various data.
The initial class language model 901 includes a statistical class language model whose parameters are estimated by a well-known conventional method or the methods of the second and third embodiments.
[0190]
The probability assigned by the similar word probability correction means 702 is a part of the probability that is excluded (discounted) from the conditional probability of the word class sequence that appears for the word class sequence that has not appeared in the learning text data, and is used for learning. The conditional appearance probability of the word class included in the text data is stored.
[0191]
For example, conditional probability P (c _n | C ₁ , ..., c _n-1 ) Is assigned a probability that is greater than the original conditional probability of the word class sequence.
[0192]
Next, the language model learning procedure by task adaptation based on the fifth embodiment of the present invention shown in FIG. 9 will be described more specifically with reference to the flowchart of FIG.
[0193]
10, steps S1001 to S1003 correspond to steps S601 to S603, respectively (see FIG. 6), and steps S1004 to S1015 correspond to steps S801 to S812, respectively (see FIG. 8). Yes.
[0194]
First, the target task word classifying means 301 and the general task word classifying means 302 each read word class definition data (not shown) (step S1001).
[0195]
The target task word classifying unit 301 reads the target task language data 101, and generates and outputs text in which words are replaced with classes for words defined in the word class definition (step S1002).
[0196]
Further, the general task word classifying means 302 reads the general task language data 102, and generates and outputs text in which words are replaced with classes for words defined in the word class definition (step S1003).
[0197]
Next, the similar word pair extraction unit 103B reads the class strings through the target task word classifying unit 301 and the general task word classifying unit 302 (step S1004).
[0198]
Further, a distance d (cT, cG) is obtained for the class cT included in the target task language data and the class cG (different from cT) included in the general task language data (step S1005), and the distance d ( It is determined whether or not (cT, cG) is smaller than the threshold value thc (step S1006).
[0199]
If it is determined in step S1006 that d (cT, cG) ≧ thc (that is, No), the process returns to step S1005, and if it is determined that d (cT, cG) <thc (that is, Yes), the class at that time The pair (cT, cG) is added to the similar class pair (step S1007).
[0200]
Thereafter, the above process is sequentially executed for all the class pairs through the determination step S1008, and the created list of similar class pairs (cT, cG) is output to the similar word probability correcting unit 702.
[0201]
Next, the similar word probability correcting unit 702 first reads the initial class language model 901 (step S1009), and then sequentially reads the similar class pair (cT, cG) from the similar word pair extracting unit 103B (step S1010). .
[0202]
Of the conditional probabilities defined in the initial class language model 901, the conditional probability PcG (c _n | C ₁ , ..., c _n-1 ), The conditional probability PcT (c) obtained by replacing the general task class cG with the target task class cT. _n | C ₁ , ... c _n-1 ) Is defined in the learning data (step S1011).
[0203]
In step S1011, the conditional probability PcT (c _n | C ₁ , ..., c _n-1 ) Is not defined in the initial class language model 901 (i.e., No), a part is assigned from the probability excluded for the unknown class sequence, and the conditional probability is corrected (step S1012). The process proceeds to the next determination step S1013.
[0204]
On the other hand, if the conditional probability PcG is defined and it is determined in step S1011 that the conditional probability PcT is defined (that is, Yes), the process immediately proceeds to the next determination step S1013.
[0205]
At this time, the probability corrected in step S1012 is, for example, the same class history (c ₁ , ..., c _n-1 ) Is the minimum value of the conditional probabilities (step S1012).
[0206]
Thereafter, the same processing is performed for other conditional probabilities PcG including the class cG via step S1013. In addition, the processing in steps S1006 to S1010 is executed for all similar class pairs (cG, cT) via step S1014.
[0207]
Finally, the similar word probability correction unit 702 normalizes the back-off probability so that the sum of the probabilities of the class language model is 1, and generates a task-adapted class language model (step S1015). The processing routine ends.
[0208]
As described above, the similar word pair extracting unit 103B and the similar word probability correcting unit 702 are provided together with each of the

word classifying units

301 and 302, and the word class similar in nature between the general task and the target task is provided for the general task. By performing smoothing using the appearance probability of a word class, a class language model for speech recognition can be estimated with high accuracy.
[0209]
The class language model obtained in this way can also be applied to character recognition that requires language processing, text processing in natural language, and the like.
[0210]
Further, the speech recognition language model learning device shown in FIG. 9 can also be recorded on a recording medium as a program.
[0211]
That is, a similar word pair extraction function that performs the same processing as the similar word pair extraction unit 103B in FIG. 9, a target task word classifying function that performs the same processing as the target task word classifying unit 301, and a general task word class Language model learning program for speech recognition using software composed of a general task word classifying function that performs the same processing as the generating means 302 and a similar word probability correcting function that performs the same processing as the similar word probability correcting means 702 Can be realized.
[0212]
Embodiment 6
In the first embodiment, the functional configuration of the similar word pair extraction unit is not specifically mentioned, but may be configured as shown in FIG. 11, for example.
[0213]
FIG. 11 is a functional block diagram showing a specific configuration example of similar word pair extraction means 103C used in the language model learning apparatus for speech recognition according to the sixth embodiment of the present invention. Detailed description is omitted by adding a reference numeral or adding a “C” after the reference numeral.
[0214]
In FIG. 11, 1101 is a statistical word distance calculation means, 1102 is a threshold value determination means, and 1105 is a distance calculation language model generation means in the similar word pair extraction means 103C.
[0215]
A characteristic function in this case is that a similar word pair extraction unit 103C is provided with a distance calculation language model generation unit 1105, a statistical inter-word distance calculation unit 1101 and a threshold value determination unit 1102, and the statistics according to the language data. A similar word pair is determined with high accuracy by calculating an inter-word distance d (wT, wG) between the word wT of the target task and the word wG of the general task on the basis of the target distance measure. It is in.
[0216]
In the following, the function of each means in FIG. 11 will be specifically described with reference to various models and various data.
In the similar word pair extraction unit 103C, the statistical inter-word distance calculation unit 1101 extracts the language model estimated from the distance calculation language model generation unit 1105 and extracts it from the target task language data 101 and the general task language data 102. For each different word pair, the distance between words based on the language model is obtained, and the word pair and the distance between words are output.
[0217]
The threshold determination unit 1102 sequentially reads the word pair and the statistical inter-word distance from the statistical inter-word distance calculation unit 1101. When the inter-word distance is equal to or less than a certain threshold, the word pair (wT, wG ) Is output.
[0218]
At this time, the statistical inter-word distance calculation unit 1101 uses, for example, the Euclidean distance in the conditional probability of the N-gram language model as the statistical inter-word distance calculation method for the target task word wT and the general task word wG. , Statistical inter-word distance D as shown in equation (7) below ₁ (WT, wG) is obtained.
[0219]
[Expression 7]

[0220]
However, in the equation (7), V is a population of the vocabulary x of the language data (word) and represents all vocabularies included in the language model.
[0221]
The statistical word distance calculation means 1101 uses the Euclidean distance using the conditional probability of the preceding word with respect to the subsequent word, and calculates the statistical word distance D as shown in the following equation (8). ₂ (WT, wG) can be obtained.
[0222]
[Equation 8]

[0223]
Moreover, not only can the above equations (7) and (8) be used individually, but also the sum of equations (7) and (8) can be used.
[0224]
Further, the statistical inter-word distance calculation unit 1101 uses, for example, the cross-entropy regarding the word wT, and the statistical inter-word distance D as expressed by the following equation (9). _Three (WT, wG) can be obtained.
[0225]
[Equation 9]

[0226]
Similarly to the case where the Euclidean distance is used, the conditional probability of the preceding word regarding the succeeding word can be used as shown in the following equation (10).
[0227]
[Expression 10]

[0228]
Moreover, not only can the above equations (9) and (10) be used individually, but also the sum of equations (9) and (10) can be used.
[0229]
Further, the statistical scale and language information can be used in combination.
For example, in the case where a word represents a morpheme, if the parts of speech of two words are not the same, the distance can be made infinite and excluded from similar word candidates.
[0230]
Next, the operation of similar word pair extraction section 103C in task adaptation based on Embodiment 6 of the present invention shown in FIG. 11 will be described more specifically with reference to the flowchart of FIG.
In FIG. 12, steps S1203 to S1207 correspond to steps S201 to S203, S207, and S208 described above (see FIG. 2), respectively.
[0231]
First, the distance calculation language model generation unit 1105 reads the target task language data 101 and the general task language data 102 (step S1201), and performs language model parameter estimation from the input text data (step S1202).
[0232]
Further, the statistical inter-word distance calculating unit 1101 creates a word pair (wT, wG) including any combination of the word wT included in the target task and the word wG included in the general task (Step S1203). A statistical distance d (wT, wG) is calculated on the language model estimated by the distance calculation language model generation means 1105 (step S1204).
[0233]
Subsequently, the threshold value determination unit 1102 compares the distance d (wT, wG) of the word pair (wT, wG) obtained from the statistical inter-word distance calculation unit 1101 with the threshold value th, and the distance d ( It is determined whether or not (wT, wG) is less than the threshold value th (step S1205).
[0234]
If it is determined in step S1205 that d (wT, wG) ≧ th (that is, No), the process returns to step S1204, and if it is determined that d (wT, wG) <th (that is, Yes), the word at that time The pair (wT, wG) is output as a similar word pair (step S1206).
[0235]
Thereafter, the above process is performed for all word pairs (wT, wG) via the end determination step S1207.
[0236]
In this way, the similar word pair extraction unit 103C can determine a highly accurate similar word pair by estimating the language model and using the distance measure based on the statistic.
[0237]
The language model obtained in this way can also be applied to character recognition that requires language processing, natural language text processing, and the like.
Further, the function of the similar word pair extraction unit 103C in FIG. 11 can be recorded as a program on a recording medium.
[0238]
That is, a language model generation function that performs processing similar to that of the distance calculation language model generation means 1105 in FIG. 11 and a statistical word distance calculation function that performs processing similar to that of the statistical word distance calculation means 1101. The similar word pair extraction program of the language model learning apparatus for speech recognition can be realized by software configured with a threshold value determination function that performs the same processing as the threshold value determination unit 1102.
[0239]
In FIG. 11, the distance calculation language model generation means 1105 is used, but a distance calculation language model 1301 may be used as shown in FIG.
In FIG. 13, a distance calculation language model 1301 in the similar word pair extraction unit 103D is the same as the initial language model 701 described above (see FIG. 7), and is created in advance.
[0240]
Here, the input data to the similar word pair extraction unit 103C is a word, but a word class may be used as shown in FIG. 14 instead of a word.
In FIG. 14, a distance calculation language model generation unit 1105E and a statistical inter-word distance calculation unit 1101E in the similar word pair extraction unit 103E fetch the word classes from the respective

word classifying units

301 and 302.
Also in this case, class pairs can be extracted as described above.
[0241]
Furthermore, although the distance calculation language model generation means 1105E is used in FIG. 14, a distance calculation class language model 1501 may be used as shown in FIG.
In FIG. 15, a distance calculation class language model 1501 in the similar word pair extraction unit 103F is the same as the initial class language model 901 described above (see FIG. 9), and is created in advance.
[0242]
Embodiment 7
In the first to sixth embodiments, attention is paid only to the language model learning device and the speech recognition device is not specifically mentioned. For example, the speech recognition device may be configured as shown in FIG.
[0243]
FIG. 16 is a block diagram schematically showing a speech recognition apparatus using a language model according to the seventh embodiment of the present invention, which is generated by the conventional method or the method described in the first, fourth and sixth embodiments. This shows the case of using a language model.
[0244]
In FIG. 16, 1601 is an acoustic feature extracting means, 1602 is an acoustic model, 1603 is an acoustic matching means, 1604 is a word dictionary, 1605 is a language model, and 1606 is a language matching means.
[0245]
The language model 1605 is constructed using the language model learning apparatus and method described in the first, fourth, and sixth embodiments.
A characteristic function in this case is that language collation means 1606 using a language model 1605 is provided together with each means 1601 to 1604, and high-accuracy speech recognition is possible even when the amount of data of the target task is small. There is.
[0246]
In the following, the function of each means in FIG. 16 will be specifically described with reference to various models and various data.
First, the acoustic feature extraction unit 1601 performs A / D conversion on the input speech waveform, extracts it for each analysis time frame, and converts it into a parameter vector that favorably represents speech features such as a mel cepstrum.
[0247]
The acoustic model 1602 represents the properties of acoustic feature vectors in speech recognition units (phonemes, words, etc.) by using probability distributions, state transitions, and the like, using, for example, an HMM.
[0248]
The acoustic matching unit 1603 collates the phoneme acoustic feature vector obtained from the acoustic feature extraction unit 1601 with the acoustic model 1602 and outputs a score representing the degree of matching.
[0249]
The word dictionary 1604 describes the correspondence between the arrangement of the acoustic models 1602 and words that are linguistic units.
The language model 1605 is obtained from a language model learning device and describes connection information of words to be recognized. For example, a transition between words is expressed by (n-1) double Markov using a word N-gram language model. Express in the process.
[0250]
The language matching unit 1606 receives the matching score between the acoustic feature quantity and the acoustic model from the acoustic matching unit 1603, refers to the word dictionary 1604 and the language model 1605, and has the highest score among the word strings to be recognized. Is processed as a recognition result.
[0251]
Next, the procedure of speech recognition based on the seventh embodiment of the present invention shown in FIG. 16 will be described more specifically with reference to the flowchart of FIG.
First, the speech recognition apparatus shown in FIG. 16 includes the acoustic model 1602 and the word dictionary 1604 prepared in advance, as well as the first, fourth, and sixth embodiments (FIGS. 1, 2, 7, 8, and 11 to 13). The language model 1605 generated by (see) is read (step S1701).
[0252]
The acoustic feature extraction unit 1601 performs A / D on the input speech to be recognized, reads a speech frame that divides a certain time interval (step S1702), uses a signal processing method for the target speech frame, and uses a mel cepstrum or the like. An acoustic feature vector that satisfactorily represents the voice feature is extracted (step S1703).
[0253]
Subsequently, the acoustic matching unit 1603 matches the acoustic feature vector obtained in step S1703 with the acoustic model 1602 to obtain an acoustic matching score (step S1704).
[0254]
Next, the language collating unit 1606 refers to the word dictionary 1604 and the language model 1605 and accumulates the acoustic collation score for the word to be recognized (step S1705).
[0255]
The language collation unit 1606 determines whether or not the final frame of the target speech has been reached while executing the collation process for each frame (step S1706), and has not reached the final frame of the target speech (that is, If it is determined No), the process returns to step S1702.
[0256]
If it is determined in step S1706 that the final frame of the target voice has been reached (that is, Yes), it is considered that collation has been completed, and the one with the best score at this time is output as the recognition result. (Step S1707), the processing routine of FIG. 17 is terminated.
[0257]
As described above, by using the language model 1605, a highly accurate language model is constructed from a large amount of data including tasks other than the target and a small amount of data related to the target task. Can do.
[0258]
Embodiment 8
In the seventh embodiment, the language model generated in the first, fourth, and sixth embodiments is used. However, the class language model generated in the second, third, fifth, and sixth embodiments is used. Also good.
[0259]
FIG. 18 is a block diagram schematically showing a speech recognition apparatus using a language model according to Embodiment 8 of the present invention, which is generated by the apparatus and method described in Embodiments 2, 3, 5, and 6 above. This shows the case of using a language model.
[0260]
In FIG. 18, the units 1601 to 1604 are the same as those described above (see FIG. 16), and the language collating unit 1606A corresponds to the language collating unit 1606 described above.
Reference numeral 1801 denotes a class definition representing a correspondence relationship between a class and a word in the language model, and 1802 denotes a class language model that gives an appearance probability of the class.
[0261]
The class language model 1802 is constructed using the apparatus and method described in the second, third, fifth, and sixth embodiments (see FIGS. 3 to 6, 9, 10, 14, and 15). is there.
[0262]
A characteristic function in this case is that by providing language collation means 1606A using a class language model 1802, high-accuracy speech recognition is possible even when the amount of data of the target task used for learning is small. There is.
[0263]
Next, the speech recognition procedure based on the eighth embodiment of the present invention shown in FIG. 18 will be described more specifically with reference to the flowchart of FIG.
In FIG. 19, steps S1901 to S1907 respectively correspond to steps S1701 to S1707 described above (see FIG. 17).
[0264]
First, the class language model 1802 generated by the second, third, fifth, and sixth embodiments is read together with the acoustic model 1602, the word dictionary 1604, and the class definition 1801 prepared in advance (step S1901).
[0265]
The acoustic feature extraction unit 1601 performs A / D on the input speech that is the recognition target, reads a speech frame that divides a certain time interval (step S1902), uses a signal processing method for the target speech frame, and uses a mel cepstrum or the like. An acoustic feature vector that well represents the voice feature is extracted (step S1903).
[0266]
Subsequently, the acoustic matching unit 1603 compares the obtained acoustic feature vector with the acoustic model 1602 to obtain an acoustic matching score (step S1904).
[0267]
Next, the language collating unit 1606A refers to the word dictionary 1604, the class definition 1801, and the class language model 1802, and accumulates the acoustic collation score for the word to be recognized (step S1905).
[0268]
Thereafter, the above collation processing is executed for each frame through step S1906, and when the final frame of the target speech is reached and collation is completed, the one with the best score is output as the recognition result. (Step S1907), the processing routine of FIG. 19 is terminated.
[0269]
As described above, by using the class language model 1802, highly accurate speech recognition can be realized from a large amount of data including tasks other than the target and a small amount of data related to the target task.
[0270]
【The invention's effect】
As described above, according to claim 1 of the present invention, the target task language data in which the text data of the target task is integrated, the general task language data in which the text data of the general task including the tasks other than the target task is integrated, Similar word pair extracting means, similar word string synthesizing means, and language model generating means for reading the text data for learning the language model from the target task language data and the general task language data, respectively, and constructing the task-adapted language model The similar word pair extraction means reads each text data from the target task language data and the general task language data, and from a combination of a word included in the target task text data and a word included in the general task text data. Similar word pairs are extracted, and the similar word string synthesis means And reading similar word pairs from the similar word pair extracting means, synthesizing and outputting a word string including words in the target task not included in the language data, and the language model generating means reads each text data The language model with improved recognition accuracy is created by reading the word string from the similar word string synthesizing means and generating the language model with task adaptation by weighting each text data and calculating the statistic of the word string. There is an effect that a learning device can be obtained.
[0271]
According to claim 2 of the present invention, the target task language data in which the text data of the target task is integrated, the general task language data in which the text data of the general task including tasks other than the target task is integrated, and the target task language A target task word classifying means, a general task word classifying means, and a language model generating means for constructing a task-adapted language model from the data and the general task language data. The text data of the target task is read from the task language data, the word is replaced with the class indicated in the class definition, the first text data classified for language model learning is output, and the general task word classifying means is , Read general task text data from general task language data, class definition The word is replaced with the indicated class, and the second text data classified into language model learning is output, and the language model generation means reads the first and second text data, and for each text data Since the language model is generated by weighting and obtaining the statistic of the word string, it is possible to obtain a language model learning device with improved recognition accuracy.
[0272]
According to claim 3 of the present invention, target task language data in which text data of the target task is integrated, general task language data in which text data of general tasks including tasks other than the target task are integrated, and target task language Object task word classifying means, general task word classifying means, similar word pair extracting means, similar word string synthesizing means, and language model generating means for constructing a task-adapted language model from data and general task language data The target task word classifying means reads the text data of the target task from the target task language data, replaces the word with the class indicated in the class definition, and classifies the first text for language model learning. The general task word classifying means outputs the general task from the general task language data. The text data is read out, the word is replaced with the class indicated in the class definition, and the second text data classified into the language model learning is output, and the similar word pair extracting means includes the first and the second The text data is read, a similar word pair is extracted from a combination of a word included in the text data of the target task and a word included in the text data of the general task, and the similar word string synthesizing means includes first and second text data. , Read similar word pairs from the similar word pair extracting means, synthesize and output a word string including words in the target task not included in the language data, and the language model generating means includes the first and second language models. Reads text data, reads word strings from similar word string synthesizing means, and weights each text data to calculate word string statistics. By Mel, since to generate a task adapted already language model, the effect of the language model learning device with improved recognition accuracy can be obtained.
[0273]
According to claim 4 of the present invention, the target task language data in which the text data of the target task is accumulated, the general task language data in which the text data of general tasks including tasks other than the target task are accumulated, and preparation in advance Similar word pair extraction means and similarity for building a task-adapted statistical language model from the initial language model created using the selected text data, the target task language data, the general task language data, and the initial language model A word probability correcting means, and the similar word pair extracting means reads the text data for language model learning from the target task language data and the general task language data, respectively, and the words included in the text data of the target task and the general task Extract similar word pairs from combinations with words included in text data, The word probability correcting means reads the similar word pair from the similar word pair extracting means, reads the initial language model, and smoothes the appearance probability of the word appearing in the target task, thereby obtaining the task-adapted statistical language model. Since it is generated, there is an effect that a language model learning device with improved recognition accuracy can be obtained.
[0274]
According to claim 5 of the present invention, the target task language data in which the text data of the target task is accumulated, the general task language data in which the text data of the general task including the tasks other than the target task are accumulated, and created in advance. Target task word classifying means, general task word classifying means for constructing a task-adapted class language model from the initial class language model, target task language data, general task language data, and initial class language model, A similar word pair extracting unit and a similar word probability correcting unit, and the target task word classifying unit reads the text data of the target task from the target task language data, replaces the word with the class indicated in the class definition, and Output the first classified text data for model learning. The word classification means reads the general task text data from the general task language data, replaces the word with the class shown in the class definition, and outputs the second text data classified for language model learning. The similar word pair extraction means reads the first and second text data, extracts a similar word pair from a combination of the word included in the text data of the target task and the word included in the text data of the general task, and is similar The word probability correction unit reads the similar word pair from the similar word pair extraction unit, reads the initial class language model, and smoothes the appearance probability of the word appearing in the target task, thereby obtaining the task-adapted class language model. Since it is generated, there is an effect that a language model learning device with improved recognition accuracy can be obtained.
[0275]
According to a sixth aspect of the present invention, in the first or fourth aspect, the similar word extracting means includes a distance calculating language model generating means, a statistical inter-word distance calculating means, and a threshold value determining means. The distance calculation language model generation means reads the text data for learning the language model from the target task language data and the general task language data, calculates the distance between the text data by weighting each text data, and calculates the distance. A statistical language model is generated, and the statistical word distance calculation means reads the statistical language model from the distance calculation language model generation means, and calculates a statistical language model for word pairs made up of words extracted from each text data. The statistical distance on the model is obtained as the distance between words, and the threshold value judging means determines the word pair and the distance between words from the statistical word distance calculating means. Reading, since the output a word pair exceeding a predetermined threshold, the language model learning device with improved recognition accuracy is the effect obtained.
[0276]
According to a seventh aspect of the present invention, in the first or fourth aspect, the similar word extracting unit includes a distance calculation language model, a statistical inter-word distance calculating unit, and a threshold value determining unit, The calculation language model is created using text data prepared in advance, and the statistical inter-word distance calculation means reads the distance calculation language model, and for word pairs consisting of words extracted from each text data, The statistical distance on the language model for distance calculation is obtained as the distance between words, and the threshold value determination means reads the word pair and the distance between words from the statistical word distance calculation means, and the word exceeds a predetermined threshold value. Since a pair is output, there is an effect that a language model learning device with improved recognition accuracy can be obtained.
[0277]
According to an eighth aspect of the present invention, in the third or fifth aspect, the similar word extracting unit includes a distance calculation language model generation unit, a statistical inter-word distance calculation unit, and a threshold value determination unit. The distance calculation language model generation means reads the first and second text data from the target task word classifying means and the general task word classifying means, and obtains a statistic of the word string by weighting each text data. The statistical language model for distance calculation is generated, and the statistical word distance calculation means reads the statistical language model from the distance calculation language model generation means, and for word pairs consisting of words extracted from each text data, A statistical distance on the statistical language model is obtained as a distance between words, and the threshold value determining means is configured to obtain the word pair and the distance between words from the statistical word distance calculating means. Reading, since the output a word pair exceeding a predetermined threshold, the language model learning device with improved recognition accuracy is the effect obtained.
[0278]
According to claim 9 of the present invention, in claim 3 or claim 5, the similar word extracting means includes a distance calculating class language model, a statistical inter-word distance calculating means, and a threshold value determining means, The distance calculation class language model is created using text data prepared in advance, and the statistical inter-word distance calculation means reads the distance calculation class language model, and the target task word classifying means and the general task The first and second text data are read from the word classifying means, and a statistical distance on the distance calculation class language model is obtained as a distance between words for a word pair made up of words extracted from each text data. The value judging means reads the word pair and the inter-word distance from the statistical inter-word distance calculating means, and outputs the word pair exceeding a predetermined threshold value. Since the so that, the effect of the language model learning device with improved recognition accuracy can be obtained.
[0279]
According to a tenth aspect of the present invention, in any one of the sixth to ninth aspects, the statistical inter-word distance calculating means calculates the inter-word distance using the Euclidean distance on the N-gram language model. Since the measurement is performed, there is an effect that a language model learning apparatus with improved recognition accuracy can be obtained.
[0280]
According to an eleventh aspect of the present invention, in any one of the sixth to ninth aspects, the statistical inter-word distance calculating means calculates the inter-word distance using the cross entropy on the N-gram language model. Since the measurement is performed, there is an effect that a language model learning apparatus with improved recognition accuracy can be obtained.
[0281]
According to claim 12 of the present invention, there is provided a speech recognition apparatus using the language model learning apparatus according to any one of claims 1 to 11, wherein the language model or the class language model is used for speech recognition. Therefore, there is an effect that a highly accurate speech recognition apparatus can be obtained.
[Brief description of the drawings]
FIG. 1 is a block configuration diagram schematically showing a language model learning apparatus according to Embodiment 1 of the present invention;
[Fig. 2] Language model learning according to Embodiment 1 of the present invention. Equipment It is a flowchart which shows a process sequence.
FIG. 3 is a block configuration diagram schematically showing a language model learning apparatus according to Embodiment 2 of the present invention;
FIG. 4 is a language model learning according to a second embodiment of the present invention. Equipment It is a flowchart which shows a process sequence.
FIG. 5 is a block diagram schematically showing a language model learning apparatus according to Embodiment 3 of the present invention.
FIG. 6 shows language model learning according to Embodiment 3 of the present invention. Equipment It is a flowchart which shows a process sequence.
FIG. 7 is a block diagram schematically showing a language model learning apparatus according to Embodiment 4 of the present invention.
FIG. 8 is a language model learning according to a fourth embodiment of the present invention. Equipment It is a flowchart which shows a process sequence.
FIG. 9 is a block diagram schematically showing a language model learning apparatus according to Embodiment 5 of the present invention.
FIG. 10 shows language model learning according to the fifth embodiment of the present invention. Equipment It is a flowchart which shows a process sequence.
11 is a functional block diagram showing a specific example of similar word pair extraction means of a language model learning device according to Embodiment 6 of the present invention; FIG.
FIG. 12 is a flowchart showing a processing procedure of similar word pair extraction means of the language model learning device according to Embodiment 6 of the present invention;
FIG. 13 is a functional block diagram showing a second specific example of similar word pair extraction means according to Embodiment 6 of the present invention;
FIG. 14 is a functional block diagram showing a third specific example of similar word pair extraction means according to Embodiment 6 of the present invention;
FIG. 15 is a functional block diagram showing a fourth specific example of similar word pair extraction means according to Embodiment 6 of the present invention;
FIG. 16 is a block configuration diagram schematically showing a speech recognition apparatus using a language model learning apparatus according to Embodiment 7 of the present invention;
FIG. 17 is a flowchart showing a processing procedure of a speech recognition apparatus using a language model learning apparatus according to Embodiment 7 of the present invention.
FIG. 18 is a block configuration diagram schematically showing a speech recognition apparatus using a language model learning apparatus according to an eighth embodiment of the present invention.
FIG. 19 is a flowchart showing a processing procedure of a speech recognition apparatus using a language model learning apparatus according to Embodiment 8 of the present invention.
FIG. 20 is an explanatory diagram showing an example of a general class definition.
FIG. 21 is a block diagram schematically showing a conventional language model learning device.
FIG. 22 is a flowchart showing a processing procedure by a conventional language model learning apparatus and method.
[Explanation of symbols]
101 Target task language data, 102 General task language data, 103, 103B, 103C, 103D, 103E, 103F Similar word pair extraction means, 104, 104B Similar word string synthesis means, 105, 105A, 105B Language model generation means, 301 Target Task word classifying means, 302 General task word classifying means, and language model generating means, 701 initial language model, 702 similar word probability correcting means, 901 initial class language model, 1101, 1101D, 1101F statistical inter-word distance Calculation means 1102, 1102E Threshold determination means, 1105, 1105E Distance calculation language model generation means, 1301 Distance calculation language model, 1501 Distance calculation class language model, 1605 language model, 1802 class language model.

Claims

The target task language data that accumulates the text data of the target task, and
General task language data that integrates general task text data including tasks other than the target task,
Similar word pair extracting means, similar word string synthesizing means, and language model for reading language model learning text data from the target task language data and the general task language data, respectively, and constructing a task-adapted language model Generating means,
The similar word pair extraction unit reads each text data from the target task language data and the general task language data, and a combination of a word included in the text data of the target task and a word included in the text data of the general task Extract similar word pairs from
The similar word string synthesizing unit reads each text data, reads the similar word pair from the similar word pair extraction unit, synthesizes and outputs a word string including words in the target task not included in the language data. And
The language model generation means reads the text data, reads the word string from the similar word string synthesis means, and calculates the statistic of the word string by weighting each text data, thereby adapting the task A language model learning apparatus, characterized by generating a customized language model.

The target task language data that accumulates the text data of the target task, and
General task language data that integrates general task text data including tasks other than the target task,
A target task word classifying means, a general task word classifying means, and a language model generating means for constructing a task-adapted language model from the target task language data and the general task language data;
The target task word classifying means reads the text data of the target task from the target task language data, replaces the word with the class indicated in the class definition, and classifies the first text data for language model learning. Output
The general task word classifying means reads the general task text data from the general task language data, replaces the word with the class shown in the class definition, and classifies the second text data for language model learning. Output
The language model generating means generates the language model by reading the first and second text data and weighting each text data to obtain a statistic of a word string. Learning device.

The target task language data that accumulates the text data of the target task, and
General task language data that integrates general task text data including tasks other than the target task,
Target task word classifying means, general task word classifying means, similar word pair extracting means, similar word string synthesizing means for constructing a task-adapted language model from the target task language data and the general task language data Language model generation means,
The target task word classifying means reads the text data of the target task from the target task language data, replaces the word with the class indicated in the class definition, and classifies the first text data for language model learning. Output
The general task word classifying means reads the general task text data from the general task language data, replaces the word with the class shown in the class definition, and classifies the second text data for language model learning. Output
The similar word pair extraction unit reads the first and second text data, and extracts a similar word pair from a combination of a word included in the text data of the target task and a word included in the text data of the general task And
The similar word string synthesizing unit reads the first and second text data, reads the similar word pair from the similar word pair extraction unit, and includes a word string including a word in the target task not included in the language data. Are combined and output,
The language model generating means reads the first and second text data, reads the word string from the similar word string synthesizing means, and obtains a statistic of the word string by weighting each text data. The language model learning device according to claim 1, wherein the task-adapted language model is generated.

The target task language data that accumulates the text data of the target task, and
General task language data that integrates general task text data including tasks other than the target task,
An initial language model created using text data prepared in advance;
A similar word pair extracting unit and a similar word probability correcting unit for constructing a task-adapted statistical language model from the target task language data, the general task language data, and the initial language model,
The similar word pair extraction unit reads text data for language model learning from the target task language data and the general task language data, respectively, and converts the word data included in the text data of the target task and the text data of the general task. Extract similar word pairs from combinations with included words,
The similar word probability correction means reads the similar word pair from the similar word pair extraction means, reads the initial language model, and smoothes the appearance probability of words appearing in the target task, thereby adapting the task. A language model learning apparatus characterized by generating a statistical language model that has been converted into a statistical language.

The target task language data that accumulates the text data of the target task, and
General task language data that integrates general task text data including tasks other than the target task,
A pre-created initial class language model,
Target task word classifying means, general task word classifying means, and similar word pair extraction for constructing a task-adapted class language model from the target task language data, the general task language data, and the initial class language model Means and similar word probability correction means,
The target task word classifying means reads the text data of the target task from the target task language data, replaces the word with the class indicated in the class definition, and classifies the first text data for language model learning. Output
The general task word classifying means reads the general task text data from the general task language data, replaces the word with the class shown in the class definition, and classifies the second text data for language model learning. Output
The similar word pair extraction unit reads the first and second text data, and extracts a similar word pair from a combination of a word included in the text data of the target task and a word included in the text data of the general task And
The similar word probability correcting means reads the similar word pair from the similar word pair extracting means, reads the initial class language model, and smoothes the appearance probability of words appearing in the target task, thereby performing the task A language model learning apparatus characterized by generating an adapted class language model.

The similar word extraction means includes a distance calculation language model generation means, a statistical inter-word distance calculation means, and a threshold determination means,
The distance calculation language model generation means reads the text data for language model learning from the target task language data and the general task language data, respectively, calculates the statistic of the word string by weighting for each text data, Generate a statistical language model for distance calculation,
The statistical inter-word distance calculation means reads the statistical language model from the distance calculation language model generation means, and determines a statistical word on the statistical language model for a word pair consisting of words extracted from the text data. The distance between words,
2. The threshold value determination unit reads the word pair and the inter-word distance from the statistical inter-word distance calculation unit, and outputs a word pair exceeding a predetermined threshold value. Item 5. The language model learning device according to Item 4.

The similar word extraction means includes a distance calculation language model, a statistical inter-word distance calculation means, and a threshold determination means,
The language model for distance calculation is created using text data prepared in advance,
The statistical inter-word distance calculation means reads the distance calculation language model, and for a word pair consisting of words extracted from the text data, the statistical distance on the distance calculation language model is used as the inter-word distance. Seeking
2. The threshold value determination unit reads the word pair and the inter-word distance from the statistical inter-word distance calculation unit, and outputs a word pair exceeding a predetermined threshold value. Item 5. The language model learning device according to Item 4.

The similar word extraction means includes a distance calculation language model generation means, a statistical inter-word distance calculation means, and a threshold determination means,
The distance calculation language model generation means reads the first and second text data from the target task word classifying means and the general task word classifying means, and weights each text data to obtain a statistic of the word string. To generate a statistical language model for distance calculation,
The statistical inter-word distance calculation means reads the statistical language model from the distance calculation language model generation means, and determines a statistical word on the statistical language model for a word pair consisting of words extracted from the text data. The distance between words,
4. The threshold value determination unit reads the word pair and the inter-word distance from the statistical inter-word distance calculation unit, and outputs a word pair exceeding a predetermined threshold value. Item 6. The language model learning device according to Item 5.

The similar word extraction means includes a distance calculation class language model, a statistical word distance calculation means, and a threshold determination means,
The distance calculation class language model is created using text data prepared in advance,
The statistical inter-word distance calculating means reads the distance calculating class language model, reads first and second text data from the target task word classifying means and the general task word classifying means, and For a word pair consisting of words extracted from text data, a statistical distance on the distance calculation class language model is obtained as a distance between words,
4. The threshold value determination unit reads the word pair and the inter-word distance from the statistical inter-word distance calculation unit, and outputs a word pair exceeding a predetermined threshold value. Item 6. The language model learning device according to Item 5.

The language model according to any one of claims 6 to 9, wherein the statistical inter-word distance calculation unit measures the inter-word distance using an Euclidean distance on an N-gram language model. Learning device.

The language model according to any one of claims 6 to 9, wherein the statistical inter-word distance calculation means measures the inter-word distance using cross entropy on an N-gram language model. Learning device.

The speech recognition device using the language model learning device according to any one of claims 1 to 11, wherein the language model or the class language model is used for speech recognition.