JP2002229588A

JP2002229588A - Statistical language model forming system, speech recognizer and statistical language model forming method as well as recording medium

Info

Publication number: JP2002229588A
Application number: JP2001020583A
Authority: JP
Inventors: Yuzo Maruta; 裕三丸田; Yoshiharu Abe; 芳春阿部; Hirotaka Goi; 啓恭伍井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2001-01-29
Filing date: 2001-01-29
Publication date: 2002-08-16

Abstract

PROBLEM TO BE SOLVED: To solve the problems that, if a statistical language model is formed from a language corpus 201 having an extremely high sparse characteristic, there is a tendency to inclusion of words having a high unigram frequency as the recognition result of speech using the same and a recognition error is liable to occur. SOLUTION: This system is provided with word connection probability calculating means of inputting a backoff probability to a monotonously increasing function and determining the result of the computation by this function as a fresh conditional probability of a word chain in making calculation by subjecting the conditional probability of the word chain having a low appearance frequency to backoff smoothing.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は音声認識に用いら
れる統計的言語モデルを作成する統計的言語モデル作成
装置に係り、特にバックオフした場合の頻度差のダイナ
ミックレンジを緩和して低次元の単語（連鎖）頻度が大
きい単語を含む単語列が不当に高い言語確率をもつこと
を防ぐことができる統計的言語モデル生成装置、これを
用いた音声認識装置、及び統計的言語モデル作成方法並
びに記録媒体に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a statistical language model creating apparatus for creating a statistical language model used for speech recognition, and more particularly to a low-dimensional word by reducing a dynamic range of a frequency difference when backing off. (Chain) Statistical language model generation device capable of preventing a word string including a word having a high frequency from having an unduly high language probability, a speech recognition device using the same, a statistical language model creation method, and a recording medium It is about.

【０００２】[0002]

【従来の技術】統計的言語モデルとは、一般的に単語の
連鎖に対する制約を記述したものであり、連続音声の認
識において続きやすい単語列についての情報を与えるも
のとして使用される。2. Description of the Related Art A statistical language model generally describes a restriction on a chain of words, and is used as information that gives information about a word string that is likely to be continued in continuous speech recognition.

【０００３】図１３は特開２０００−９９０８６号公報
に開示される従来の連続音声を認識する音声認識装置の
構成を示すブロック図である。図において、１０１は入
力音声に対して音素、音節あるいは単語ごとの単位でパ
ターンマッチングを行うための音響モデルで、パターン
マッチングによって音響的に似通った認識候補が得られ
る。図示の例では、単語ごとに格納された音響モデル１
０１を使用してパターンマッチングが行われる。１０２
は音響モデル１０１を用いて入力音声のパターンマッチ
ングを行う音響確率計算手段であって、図示の例では単
語ごとの音響モデル１０１を用いて入力音声がある単語
から発生した確率を全ての単語について計算する。１０
３はある単語の連鎖に対する制約を記述した（統計的）
言語モデルであり、１０４は言語確率計算手段であっ
て、言語モデル１０３を用いて音響確率計算手段１０２
で確率計算を行った単語が発生する確率を計算し、得ら
れた確率と音響確率計算手段１０２で計算した確率との
合成した確率が最も大きい単語を認識結果として出力す
る。FIG. 13 is a block diagram showing the configuration of a conventional speech recognition apparatus for recognizing continuous speech disclosed in Japanese Patent Application Laid-Open No. 2000-99086. In the figure, reference numeral 101 denotes an acoustic model for performing pattern matching on an input speech in units of phonemes, syllables, or words. A recognition candidate that is acoustically similar is obtained by pattern matching. In the illustrated example, the acoustic model 1 stored for each word
01 is used for pattern matching. 102
Is an acoustic probability calculation unit that performs pattern matching of the input voice using the acoustic model 101. In the illustrated example, the probability that the input voice occurs from a certain word is calculated for all the words using the acoustic model 101 for each word. I do. 10
3 describes constraints on a chain of words (statistical)
A language model 104 is a language probability calculation means.
The probability of occurrence of the word whose probability has been calculated is calculated, and the word having the largest combined probability of the obtained probability and the probability calculated by the acoustic probability calculation means 102 is output as a recognition result.

【０００４】次に動作について説明する。先ず、音響確
率計算手段１０２は、入力音声に対して単語（列）ごと
に予め記憶媒体に格納されている音響モデル１０１を用
いてパターンマッチングを行う。このとき、入力音声を
Ｘ、ある単語（列）をｗ_ｋとして、入力音声Ｘがある単
語（列）ｗ_ｋから発生する確率Ｐ（Ｘ｜ｗ_ｋ）が、音響
確率計算手段１０２によって全ての単語（列）について
計算される。Next, the operation will be described. First, the acoustic probability calculation unit 102 performs pattern matching on the input speech for each word (string) using the acoustic model 101 stored in the storage medium in advance. At this time, assuming that the input speech is X and a certain word (string) is w _k , the probability P (X | w _k ) of the input speech X occurring from a certain word (string) w _k is calculated by the acoustic probability calculation means 102. Calculated for words (columns).

【０００５】このあと、言語確率計算手段１０４は、不
図示の記憶媒体に予め格納されている言語モデル１０３
を用いて単語（列）ｗ_ｋの発生する言語確率Ｐ（ｗ_ｋ）
を全ての単語（列）ｗ_ｋについて計算する。そして、こ
の言語確率と音響確率計算手段１０２によって計算され
た確率とからＰ（Ｘ｜ｗ_ｋ）Ｐ（ｗ_ｋ）を計算し、この
値が最も大きい単語（列）を認識結果として出力する。[0005] Thereafter, the language probability calculating means 104 stores the language model 103 stored in a storage medium (not shown) in advance.
, The language probability P (w _k ) of the occurrence of a word (column) w _k
_Is calculated for all words (columns) w _k . Then, P (X | w _k ) P (w _k ) is calculated from the language probability and the probability calculated by the acoustic probability calculation means 102, and the word (string) having the largest value is output as a recognition result.

【０００６】図１４は図１３の音声認識装置に使用され
る統計的言語モデルの生成を説明する説明図である。図
において、２０１は言語モデルを構築するのに使用され
る言語コーパスであって、例えば新聞などの多くの文章
から構成される言語データベースを意味する。２０２は
Ｎグラム推定手段で、言語モデルコーパス２０１や００
〜（Ｎ−１）グラム確率モデルを用いて、Ｎグラム条件
付き確率において全ての単語連鎖の条件付き確率が等し
くなる方向に確率値を操作するスムージングを行ってＮ
グラム言語モデル２０３を出力する。２０３は言語モデ
ル１０３として一般的に使用されるＮグラム言語モデル
で、単語Ｎ個の連鎖の出現する確率を記述するモデルで
ある。このＮグラム言語モデル２０３は、直前の出現単
語列が（Ｎ−１）個の任意の単語であるときに、次にあ
る単語がくる確率に相当するＮグラムの条件付き確率を
求め、この条件付き確率を全ての出現可能なＮ個の単語
による単語列について保持したものである。２０４−０
〜２０４−（Ｎ−１）は記憶手段であり、Ｎグラム推定
手段２０２によるＮグラム推定の途中で使用される０〜
（Ｎ−１）グラムの確率モデルを格納する。ここで、０
グラム（０連鎖）確率を、例えば最大単語種類数の逆数
を各単語の０グラム確率と仮定し、全ての単語で等しい
値をとるものとする。FIG. 14 is an explanatory diagram for explaining generation of a statistical language model used in the speech recognition apparatus of FIG. In the figure, reference numeral 201 denotes a language corpus used for constructing a language model, which means a language database composed of many sentences such as newspapers. 202 is an N-gram estimating means, which is a language model corpus 201 or 00
（(N-1) Gram probability model is used to perform smoothing by manipulating probability values in a direction in which the conditional probabilities of all word chains are equal in the N-gram conditional probabilities.
It outputs a Gram language model 203. An N-gram language model 203 generally used as the language model 103 is a model that describes the probability of occurrence of a chain of N words. This N-gram language model 203 obtains the conditional probability of an N-gram corresponding to the probability that the next word will come when the immediately preceding word string is (N-1) arbitrary words. The attached probabilities are held for word strings of all N words that can appear. 204-0
-204- (N-1) are storage means, which are used during the N-gram estimation by the N-gram estimation means 202;
(N-1) The probability model of the gram is stored. Where 0
Assuming that the gram (zero chain) probability is, for example, the reciprocal of the maximum number of word types is the zero gram probability of each word, all words have the same value.

【０００７】先ず、言語モデルの生成の概要について説
明する。Ｎグラム言語モデル２０３のＮとして大きな値
をとると、組み合わせの数が膨大となるので、通常、Ｎ
としては２、３などの値が用いられる。Ｎ＝２の場合を
バイグラム（２連鎖）、Ｎ＝３の場合はトライグラム
（３連鎖）と呼ぶ。以降では、トライグラムを例として
説明する。このトライグラムにおける条件付き確率は、
例えば直前の出現単語列（先行２つの単語（列））がｗ
_１，ｗ_２だったとき、このあとにくる単語がｗ_３である
条件付き確率Ｐ（ｗ_３｜ｗ_１，ｗ_２）で表される。この
条件付き確率を全ての出現可能な単語列（ｗ_１，ｗ_２，
ｗ_３）について保持したもの（又は計算可能にしたも
の）がトライグラム言語モデルである。First, an outline of generation of a language model will be described. If a large value is taken as N of the N-gram language model 203, the number of combinations becomes enormous.
Are used as values such as 2 and 3. A case where N = 2 is called a bigram (two chains), and a case where N = 3 is a trigram (three chains). Hereinafter, a trigram will be described as an example. The conditional probability in this trigram is
For example, if the immediately preceding word string (the two preceding words (strings)) is w
_1, when he was _{w 2,} the words that come to this after the conditional probability _P is _{w 3} _| represented by _{_{(w 3 w 1, w 2}} ). This conditional probability is calculated for all possible word strings (w ₁ , w ₂ ,
What is held (or made computable) for w ₃ ) is the trigram language model.

【０００８】また、この条件付き確率は、一般にある大
きなデータベースにおいて実際に起きる事象をカウント
することによって得られる。例えばトライグラムにおい
て、条件付き確率Ｐ（ｗ_３｜ｗ_１，ｗ_２）は、単語列
（ｗ_１，ｗ_２，ｗ_３）が出現するカウント数Ｎ（ｗ_１，
ｗ_２，ｗ_３）を単語列（ｗ_１，ｗ_２）が出現するカウン
ト数Ｎ（ｗ_１，ｗ_２）で割ることにより得られる。しか
しながら、このようなＮグラムモデルでは、実用化に際
して組み合わせの数が膨大となり、１回もデータベース
に出現しないＮグラムが現れたり、ほんの数回しか現れ
ないＮグラムが多数あるという不具合がおこる。この不
具合をスパース性と呼んでいる。このとき、１回もデー
タベースに出現しないＮグラムに相当する単語列の出現
確率は０値となり、ほんの数回しか現れないＮグラムで
は非常に小さい確率になって推定精度が低くなって認識
性能の劣化をもたらす要因となる。[0008] The conditional probability is generally obtained by counting the events actually occurring in a certain large database. For example, in a trigram, the conditional probability P (w ₃ | w ₁ , w ₂ ) is equal to the count number N (w ₁ , w ₁ , w ₁ , w ₃ ) of the word string (w ₁ , w ₂ , w ₃ ).
w ₂ , w ₃ ) by the count N (w ₁ , w ₂ ) in which the word string (w ₁ , w ₂ ) appears. However, in such an N-gram model, the number of combinations is enormous in practical use, and there is a problem that N-grams that do not appear in the database even once appear or that there are many N-grams that appear only a few times. This defect is called sparsity. At this time, the appearance probability of a word string corresponding to an N-gram that does not appear in the database at least once has a value of 0, and the N-gram that appears only a few times has a very small probability, the estimation accuracy is low, and the recognition performance is low. It causes deterioration.

【０００９】そこで、上記不具合を解決するために、Ｎ
グラム推定手段２０２によってＮグラムの条件付き確率
の値を操作するスムージングが行われる。このＮグラム
の条件付き確率のスムージングは、（Ｎ−１）グラムの
条件付き確率を用いて行われることが多い。例えば、単
語連鎖ｗ_１−ｗ_２の次に単語ｗ_３が出現する条件付き確
率、即ち、トライグラムの条件付き確率Ｐ（ｗ_３｜
ｗ_１，ｗ_２）のスムージングでは、バイグラムの条件付
き確率Ｐ（ｗ_３｜ｗ_２）を用いて行われる。Therefore, in order to solve the above problem, N
The gram estimating means 202 performs smoothing for manipulating the value of the conditional probability of the N-gram. The smoothing of the conditional probability of the N-gram is often performed using the conditional probability of the (N-1) -gram. For example, the conditional probability that the next word _{w 3} of the word chain _w 1 -w ₂ appears, that is, conditional probability of the tri-gram P _(w 3 |
The smoothing of w ₁ , w ₂ ) is performed using the bigram conditional probability P (w ₃ | w ₂ ).

【００１０】次に言語モデルの生成動作について説明す
る。先ず、Ｎグラム推定手段２０２に０グラム確率を設
定し、これを０グラム確率モデルとして記憶手段２０４
−０に格納する。次に、ｎ＝１〜ＮまでＮグラム推定手
段２０２によるｎ（０≦ｎ≦Ｎ）グラムのスムージング
が順に行われる。ｎグラム推定には、記憶手段２０４−
（ｎ−１）に格納された（ｎ−１）グラム（（ｎ−１）
連鎖の単語列）が利用され、ｎ＜Ｎの間では推定された
ｎグラム条件付き確率がｎグラム確率モデルとして記憶
手段２０４−ｎに格納される。このあと、ｎ＝Ｎとなっ
たところで、記憶手段２０４−０〜２０４−（Ｎ−１）
に順次格納されたＮグラム条件付き確率がＮグラム言語
モデル２０３として出力される。Next, the operation of generating a language model will be described. First, 0-gram probability is set in the N-gram estimating means 202, and this is stored as a 0-gram probability model in the storage means 204.
Store to −0. Next, smoothing of n (0 ≦ n ≦ N) grams by the N-gram estimating means 202 is sequentially performed from n = 1 to N. The storage means 204-
(N-1) gram stored in (n-1) ((n-1)
The word sequence of the chain is used, and during n <N, the estimated n-gram conditional probability is stored in the storage means 204-n as an n-gram probability model. Thereafter, when n = N, the storage means 204-0 to 204- (N-1)
Are sequentially output as the N-gram language model 203.

【００１１】上述したようなＮグラム推定手段２０２に
よるスムージング動作には、バックオフと呼ばれる手法
が一般に用いられる。このバックオフとは、「音声情報
処理」（北研二、中村哲、永田昌明共著、森北出版株式
会社、１９９６年）のｐ．３４に記載されているよう
に、言語コーパス２０１に存在する単語連鎖の確率推定
値の和を１より小さく見積もり、残りを言語コーパス２
０１にない単語連鎖の連鎖確率に割り当てる近似方法で
ある（観測回数の少ないＮグラムの回数をさらに少なく
見積もり、余った確率値の総和を観測されなかったＮグ
ラムに、（Ｎ−１）グラムの値に基づいて分配する）。A technique called back-off is generally used for the smoothing operation by the N-gram estimating means 202 as described above. This back-off is described in “Speech Information Processing” (Kenji Kita, Satoshi Nakamura, Masaaki Nagata, Morikita Publishing Co., Ltd., 1996), p. 34, the sum of the probability estimates of the word chains existing in the language corpus 201 is estimated to be smaller than 1, and the rest is estimated to be the language corpus 2
This is an approximation method that assigns to the chain probabilities of word chains that are not 01 (the number of N-grams with a small number of observations is further reduced, and the sum of the surplus probability values is added to the N-grams for which no observations were made. Distribute based on value).

【００１２】単語連鎖ｗ_１−ｗ_２−ｗ_３が言語コーパス
２０１に存在しない場合（言語コーパス２０１において
単語連鎖の頻度Ｎ（ｗ_１，ｗ_２，ｗ_３）＝０）、ｗ_１−
ｗ_２−ｗ_３の単語連鎖の推定確率Ｐ（ｗ_３｜ｗ_１，
ｗ_２）は下記式（１）で表される。Ｐ（ｗ_３｜ｗ_１，ｗ_２）∝Ｐ（ｗ_３｜ｗ_２）・・・（１）さらに、単語連鎖ｗ_２−ｗ_３の単語連鎖も言語コーパス
２０１にない場合（言語コーパス２０１において単語連
鎖の頻度Ｎ（ｗ_２，ｗ_３）＝０）、ｗ_２−ｗ_３の単語連
鎖の推定確率Ｐ（ｗ_３｜ｗ_２）は下記式（２）で表され
る。Ｐ（ｗ_３｜ｗ_２）∝Ｐ（ｗ_３）・・・（２）When the word chain w ₁ -w ₂ -w ₃ does not exist in the language corpus 201 (the frequency N (w ₁ , w ₂ , w ₃ ) of the word chain in the language corpus 201) = 0, w _1-
The estimated probability P (w ₃ | w ₁ , w ₂ −w ₃₎ of the word chain
w ₂ ) is represented by the following equation (1). P (w ₃ | w ₁ , w ₂ ) ∝P (w ₃ | w ₂ ) (1) Furthermore, when the word chain of the word chain w ₂ −w ₃ is not in the language corpus 201 (in the language corpus 201). The word chain frequency N (w ₂ , w ₃ ) = 0) and the estimated probability P (w ₃ | w ₂ ) of the word chain w ₂ −w ₃ are expressed by the following equation (2). P (w ₃ | w ₂ ) ∝P (w ₃ ) (2)

【００１３】上記式（１）、（２）をより具体的に示し
たものとして、代表的なバックオフの方法を記述する、
例えばＰ．Ｃｌａｒｋｓｏｎｅｔａｌ．”Ｓｔａｔｉ
ｓｔｉｃａｌＬａｎｇｕａｇｅＭｏｄｅｌｉｎｇ
ＵｓｉｎｇＴｈｅＣＭＵ−ＣＡＭＢＲＩＤＧＥＴ
ＯＯＬＫＩＴ，”ＥｕｒｏＳｐｅｅｃｈ’９７，Ｖｏ
ｌ．５，ｐｐ．２７０７−２７１０（１９９８）に引用
されているＷｉｔｔｅｎ−Ｂｅｌｌ法がよく知られてい
る。このＷｉｔｔｅｎ−Ｂｅｌｌ法は、確率推定値Ｐ
（ｗ_３｜ｗ_１，ｗ_２）、Ｐ（ｗ_２｜ｗ_１）を下記式
（３）、（４）で表す。As a more specific expression of the above equations (1) and (2), a typical back-off method will be described.
For example, Clarkson et al. "Stati
physical Language Modeling
Using The CMU-CAMBRIDGE T
OOOLKIT, "EuroSpeech '97, Vo
l. 5, pp. The Witten-Bell method cited in 2707-2710 (1998) is well known. The Witten-Bell method uses a probability estimate P
(W ₃ | w ₁ , w ₂ ) and P (w ₂ | w ₁ ) are represented by the following equations (3) and (4).

【数１】ここで、α（ｗ_１）、β（ｗ_１，ｗ_２）は確率の正規化
定数、Ｒ（ｗ_１，ｗ_２）は言語コーパス２０１内でｗ_１
ｗ_２の次に出現する単語の種類数で、Ｒ（ｗ_１）は言語
コーパス２０１内でｗ_１の次に出現する単語の種類数で
ある。また、Ｎ（ｗ_１，ｗ_２，ｗ_３）は言語コーパス２
０１内でのｗ_１ｗ_２ｗ_３の単語連鎖が出現する回数、Ｎ
（ｗ_１，ｗ_２）は言語コーパス２０１内での単語連鎖ｗ
_１ｗ_２が出現する回数で、Ｎ（ｗ_１）は言語コーパス２
０１内での単語ｗ_１が出現する回数を示している。Ｐ
（ｗ_１）は、一般にＰ（ｗ_１）＝Ｎ（ｗ_１）／Ｎ（＊）
で表されるものであり、これを単語ｗ_１のユニグラム確
率という。なお、＊は言語コーパス２０１内の全ての単
語を表わしている。(Equation 1) Here, α (w ₁ ) and β (w ₁ , w ₂ ) are probability normalization constants, and R (w ₁ , w ₂ ) is w ₁ in the language corpus 201.
the type number of words that appear in the next w _2, R (w ₁₎ is the number of types of words that appear in the following w ₁ in corpus within 201. N (w ₁ , w ₂ , w ₃ ) is the language corpus 2
The number of occurrences of the word chain w ₁ w ₂ w ₃ in 01, N
(W ₁ , w ₂ ) is a word chain w in the language corpus 201
N (w ₁ ) is the number of occurrences of ₁ w ₂ and N (w ₁ ) is the language corpus 2
Word w ₁ is shows the number of times that appear in the 01. P
(W ₁ ) is generally P (w ₁ ) = N (w ₁ ) / N (*)
In are those represented, this is referred unigram probability of the word w _1. In addition, * represents all the words in the language corpus 201.

【００１４】また、上述したスパース性を解消しようと
する別のアプローチとして、「音声情報処理」（北研
二、中村哲、永田昌明共著、森北出版株式会社、１９９
６年）のｐ．３５に記載されているクラス言語モデルと
いうものがある。これは、単語をクラスに分類して、そ
のクラス間の連鎖確率推定値を、単語連鎖の連鎖確率推
定値に用いようというものである。ここで、単語を
ｗ_１，ｗ_２，・・・とし、それぞれがｗ_ｉ∈Ｃ_ｉという
クラスに所属するものと仮定すると、単語列ｗ_１，
ｗ _２，・・・，ｗ_ｉ−１が出現した後に単語ｗ_ｉが出現
する条件付き確率は下記式（５）〜（７）のように近似
される。Ｐ（ｗ_ｉ｜ｗ_１，ｗ_２，ｗ_３，・・・，ｗ_ｉ−１）〜Ｐ（ｗ_ｉ｜Ｃ_ｉ）・Ｐ（Ｃ_ｉ｜Ｃ_１，Ｃ_２，・・・，Ｃ_ｉ−１）・・・（５）特にバイグラムモデルの場合には、下記式（６）のよう
になる。Ｐ（ｗ_ｉ｜ｗ_１，ｗ_２，ｗ_３，・・・，ｗ_ｉ−１）〜Ｐ（ｗ_ｉ｜Ｃ_ｉ）・Ｐ（Ｃ_ｉ｜Ｃ_ｉ−１）・・・（６）さらに、トライグラムモデルの場合には、下記式（７）
のようになる。Ｐ（ｗ_ｉ｜ｗ_１，ｗ_２，ｗ_３，・・・，ｗ_ｉ−１）〜Ｐ（ｗ_ｉ｜Ｃ_ｉ）・Ｐ（Ｃ_ｉ｜Ｃ_ｉ−２，Ｃ_ｉ−１）・・・（７）[0014] Further, in order to eliminate the sparseness described above,
Another approach to this is "speech information processing" (Kitaken
2, Satoshi Nakamura, Masaaki Nagata, Morikita Publishing Co., Ltd., 199
6 years) p. The class language model described in 35
There is something called. It categorizes words into classes and
Is used to estimate the chain probability between word classes.
It is intended to be used for fixed values. Where the word
w₁, W₂, ..., each of which is w_i∈C_iThat
Assuming that it belongs to the class, the word string w₁,
w ₂, ..., w_i-1After the appearance of the word w_iAppears
Is approximated as in the following equations (5) to (7)
Is done. P (w_i| W₁, W₂, W₃, ..., w_i-1) To P (w_i｜ C_i) ・ P (C_i｜ C₁, C₂, ..., C_i-1) (5) Especially in the case of the bigram model, the following equation (6) is used.
become. P (w_i| W₁, W₂, W₃, ..., w_i-1) To P (w_i｜ C_i) ・ P (C_i｜ C_i-1(6) Further, in the case of the trigram model, the following equation (7) is used.
become that way. P (w_i| W₁, W₂, W₃, ..., w_i-1) To P (w_i｜ C_i) ・ P (C_i｜ C_i-2, C_i-1) ・・・ (7)

【００１５】それぞれの条件付き確率は、一般的には言
語コーパス２０１における単語連鎖の頻度より下記式
（８）のように決定される。Ｐ（ｗ_ｉ｜Ｃ_ｉ）＝Ｎ（ｗ_ｉ）／Ｎ（∀ｗ∈Ｃ_ｉ）Ｐ（Ｃ_ｉ｜Ｃ_ｉ−２，Ｃ_ｉ−１）＝Ｎ（Ｃ_ｉ−２，Ｃ_ｉ−１，Ｃ_ｉ）／Ｎ（Ｃ_ｉ−２，Ｃ_ｉ−１，＊）・・・（８）ここで、Ｎ（ｗ_ｉ）は単語ｗ_ｉが言語コーパス２０１内
に出現する頻度で、Ｎ（ｗ，ｘ）は単語連鎖ｗ，ｘが言
語コーパス２０１内に出現する頻度であって、Ｎ（Ｃ，
Ｄ）はクラスＣに属する単語とクラスＤに属する単語が
連鎖する頻度を示している。Each conditional probability is generally determined from the frequency of word chains in the language corpus 201 as shown in the following equation (8). _{_{P (w i | C i)}} = N (w i) / N (∀w∈C i) P (C i | C i-2, C i-1) = N (C i-2, C i-1 _{_{_{, C i) / N (C}}} i-2, C i-1, *) ··· (8) here, N _{(w i)} the frequency with which the word _{w i} appears in the language corpus 201, N ( w, x) is the frequency at which the word chains w, x appear in the language corpus 201, and N (C,
D) indicates the frequency at which words belonging to class C and words belonging to class D are linked.

【００１６】[0016]

【発明が解決しようとする課題】従来の統計的言語モデ
ル生成装置は以上のように構成されているので、スパー
ス性の非常に高い（即ち、バックオフする場合が非常に
多い）言語コーパス２０１から統計的言語モデルを生成
すると、これを用いた音声の認識結果としてユニグラム
頻度の高い単語を含みやすく、認識誤りが発生しやすい
という課題があった。Since the conventional statistical language model generating apparatus is configured as described above, the statistical linguistic corpus 201 having a very high sparseness (that is, very often back-off) is used. When a statistical language model is generated, there is a problem that a word having a high unigram frequency is likely to be included as a result of speech recognition using the model, and a recognition error is likely to occur.

【００１７】上記課題を具体的に説明する。バックオフ
スムージングを用いる方法の場合、例えば語彙に「日
本」、「の（格助詞）」、「へ（格助詞）」、「絵」が
あって、「にほんのえ」と発声した場合の言語確率を計
算するとき、これらの単語連鎖が全て言語コーパス２０
１にない場合には、バックオフすると下記式（９）のよ
うになる。Ｐ（”絵”｜”日本”，”の”）∝Ｐ（”絵”）Ｐ（”へ”｜”日本”，”の”）∝Ｐ（”へ”）・・・（９）ここで、格助詞「へ」は一般的には頻度が「絵」より非
常に大きいので、下記式（１０）のようになる。Ｐ（”絵”｜”日本”，”の”）≪Ｐ（”へ”｜”日本”，”の”）・・・（１０）式（１０）からわかるように、言語確率Ｐ（”絵”｜”
日本”，”の”）とＰ（”へ”｜”日本”，”の”）と
の間に大きな差がついてしまう。従って、スパース性の
非常に高い（即ち、バックオフする場合が非常に多い）
言語コーパス２０１から作成した言語モデルでは、認識
結果として、ユニグラム頻度の高い単語を含みやすく、
認識誤りが発生しやすいという問題があった。The above problem will be specifically described. In the case of the method using back-off smoothing, for example, when the vocabulary includes “Japan”, “no (case particle)”, “he (case particle)”, and “picture”, and the utterance is “Nihononoe” When calculating probabilities, all of these word chains are
If it is not 1, the back-off results in the following equation (9). P ("picture" | "Japan", "") ∝P ("picture") P ("he" | "Japan", "") ∝P ("he") ... (9) where Since the case particle "he" generally has a much higher frequency than "picture", the following expression (10) is obtained. P ("picture" | "Japan", "") ≪P ("he" | "Japan", "of") ... (10) As can be seen from equation (10), the language probability P ("picture" ”|”
There is a large difference between "" of "Japan" and "" of "" and ") of"", so that the sparseness is very high (that is, the case of back-off is very low). Many)
In a language model created from the language corpus 201, words having a high unigram frequency are likely to be included as recognition results.
There is a problem that recognition errors easily occur.

【００１８】また、上記の他に、クラス言語モデルを用
いた場合にも、統計的言語モデルの単語連鎖確率の精度
が落ちるという課題があった。In addition to the above, when a class language model is used, there is a problem that the accuracy of the word chain probability of the statistical language model is reduced.

【００１９】上記課題を具体的に説明する。クラス言語
モデルにおいて、一般的に単語連鎖ｗ_１−ｗ_２−ｗ_３の
連鎖確率推定値Ｐ（ｗ_３｜ｗ_１，ｗ_２）は下記式（１
１）で与えられる。Ｐ（ｗ_３｜ｗ_１，ｗ_２）＝Ｐ（Ｃ_３｜Ｃ_１，Ｃ_２）・Ｐ（ｗ_３｜Ｃ_３）・・・（１１）ここで、Ｃ_１，Ｃ_２，Ｃ_３はｗ_１，ｗ_２，ｗ_３がそれぞ
れ所属するクラスを表している。このとき、ｗ_３’∈Ｃ
_３という単語が与えられた場合、単語ｗ_１，ｗ_２の次に
単語ｗ_３’が連鎖する条件付き確率は、下記式（１２）
で与えられる。Ｐ（ｗ_３’｜ｗ_１，ｗ_２）＝Ｐ（Ｃ_３｜Ｃ_１，Ｃ_２）・Ｐ（ｗ_３’｜Ｃ_３）・・・（１２）ここで、Ｐ（ｗ_３｜ｗ_１，ｗ_２）とＰ（ｗ_３’｜ｗ_１，
ｗ_２）との大小は、Ｐ（ｗ_３｜Ｃ_３）とＰ（ｗ_３’｜Ｃ
_３）の大小に帰着する。Ｐ（ｗ_３｜Ｃ_３）は、上述した
ように一般に、下記式（１３）で与えられる。Ｐ（ｗ_３｜Ｃ_３）＝Ｎ（ｗ_３）／Ｎ（Ｃ_３）・・・（１３）式（１３）より、Ｐ（ｗ_３｜ｗ_１，ｗ_２）とＰ（ｗ_３’
｜ｗ_１，ｗ_２）との大小は、結局のところ、クラス内の
出現頻度（ユニグラム）の大小に依存する。The above problem will be specifically described. In the class language model, generally, the chain probability estimation value P (w ₃ | w ₁ , w ₂ ) of the word chain w ₁ −w ₂ −w ₃ is expressed by the following equation (1).
Given in 1). P (w ₃ | w ₁ , w ₂ ) = P (C ₃ | C ₁ , C ₂ ) · P (w ₃ | C ₃ ) (11) where C ₁ , C ₂ , and C ₃ are The classes to which w ₁ , w ₂ , and w ₃ belong respectively are shown. At this time, w ₃ ∈C
_If the word “ ₃ ” is given, the conditional probability that the word w ₃ ′ is chained next to the words w ₁ and w ₂ is expressed by the following equation (12).
Given by P (w ₃ ′ | w ₁ , w ₂ ) = P (C ₃ │C ₁ , C ₂ ) · P (w ₃ ′ | C ₃ ) (12) where P (w ₃ │w ₁ , W ₂ ) and P (w ₃ ′ | w ₁ ,
w ₂ ) are P (w ₃ | C ₃ ) and P (w ₃ ′ | C
₃ ) Big and small. P (w ₃ | C ₃ ) is generally given by the following equation (13) as described above. P (w ₃ | C ₃ ) = N (w ₃ ) / N (C ₃ ) (13) From equation (13), P (w ₃ | w ₁ , w ₂ ) and P (w ₃ ′)
The magnitude of | w ₁ , w ₂ ) ultimately depends on the magnitude of the appearance frequency (unigram) in the class.

【００２０】このような場合、単語連鎖頻度Ｎ（ｗ_１ｗ
_２ｗ_３）≫Ｎ（ｗ_１ｗ_２ｗ_３’）であってもクラス内出
現頻度がＮ（ｗ_３）／Ｎ（Ｃ_３）＜Ｎ（ｗ_３’）／Ｎ
（Ｃ_３）であれば、下記式（１４）のような関係になっ
てしまう。Ｐ（ｗ_３｜ｗ_１，ｗ_２）＜Ｐ（ｗ_３’｜ｗ_１，ｗ_２）・・・（１４）このため、確率の大小関係が正しく表現されず、単語連
鎖確率の精度が劣化してしまう。In such a case, the word chain frequency N (w ₁ w
_{Even if 2} w ₃ ) ≫N (w ₁ w ₂ w ₃ ′), the occurrence frequency in the class is N (w ₃ ) / N (C ₃ ) <N (w ₃ ′) / N
In the case of (C ₃ ), the relationship becomes as shown in the following expression (14). P (w ₃ | w ₁ , w ₂ ) <P (w ₃ ′ | w ₁ , w ₂ ) (14) Therefore, the magnitude relation of the probabilities is not correctly expressed, and the accuracy of the word chain probability deteriorates. Resulting in.

【００２１】この発明は上記のような課題を解決するた
めになされたもので、バックオフした場合の頻度差のダ
イナミックレンジを緩和して、ユニグラム頻度などの低
い次元の単語（連鎖）頻度が大きい単語を含む単語列が
不当に高い言語確率をもつことを防ぐことができる統計
的言語モデル生成装置及び方法を得ることを目的とす
る。The present invention has been made in order to solve the above-mentioned problem, and alleviates the dynamic range of the frequency difference in the case of back-off so that the word (chain) frequency of a low dimension such as the unigram frequency is large. An object of the present invention is to provide a statistical language model generation device and method capable of preventing a word string including a word from having an unduly high language probability.

【００２２】また、この発明はクラス言語モデルを用い
ることによる単語連鎖確率の精度の劣化を防ぎ、高い精
度の言語モデルを提供する統計的言語モデル生成装置及
び方法を得ることを目的とする。It is another object of the present invention to provide a statistical language model generating apparatus and method for preventing a decrease in word chain probability accuracy due to the use of a class language model and providing a high accuracy language model.

【００２３】さらに、この発明は上記統計的言語モデル
生成装置が生成した統計的言語モデルを使用して、音声
の認識精度を向上させた音声認識装置を得ることを目的
とする。A further object of the present invention is to obtain a speech recognition device with improved speech recognition accuracy using the statistical language model generated by the statistical language model generation device.

【００２４】さらに、この発明は上記統計的言語モデル
生成方法をコンピュータに実現させるためのプログラム
を記録したコンピュータ読み取り可能な記録媒体を得る
ことを目的とする。Still another object of the present invention is to provide a computer-readable recording medium on which a program for causing a computer to implement the above-described statistical language model generation method is recorded.

【００２５】[0025]

【課題を解決するための手段】この発明に係る統計的言
語モデル生成装置は、言語コーパスを記憶するコーパス
記憶手段と、言語コーパス内に存在する単語連鎖の出現
頻度及び単語種類数を含んでなる単語連鎖に係る頻度を
計数する単語連鎖頻度計数手段と、この単語連鎖頻度計
数手段が計数した単語連鎖に係る頻度に基づいて単語連
鎖の条件付き確率を計算する単語接続確率計算手段と、
出現頻度が小さい単語連鎖の条件付き確率を単語接続確
率計算手段がバックオフ平滑化を行って計算するにあた
り、バックオフ確率を単調増加する関数に入力し、この
関数による演算結果を単語連鎖の新たな条件付き確率と
する単語接続確率再計算手段とを備えるものである。A statistical language model generating apparatus according to the present invention includes a corpus storing means for storing a language corpus, and the appearance frequency and the number of word types of word chains existing in the language corpus. Word chain frequency counting means for counting the frequency of word chains, word connection probability calculating means for calculating the conditional probability of word chains based on the frequency of the word chains counted by the word chain frequency counting means,
When the word connection probability calculation means calculates the conditional probability of a word chain having a low frequency of occurrence by performing back-off smoothing, the function is input to a function that monotonically increases the back-off probability. And a word connection probability recalculating means for obtaining a conditional probability.

【００２６】この発明に係る統計的言語モデル生成装置
は、単調増加する関数をべき乗関数とするものである。In the statistical language model generating apparatus according to the present invention, a monotonically increasing function is a power function.

【００２７】この発明に係る統計的言語モデル生成装置
は、単調増加する関数をシグモイド関数とするものであ
る。In the statistical language model generating apparatus according to the present invention, a function that monotonically increases is a sigmoid function.

【００２８】この発明に係る統計的言語モデル生成装置
は、単調増加する関数が演算結果に上下限値を有するも
のである。In the statistical language model generating apparatus according to the present invention, the function that monotonically increases has upper and lower limits in the operation result.

【００２９】この発明に係る統計的言語モデル生成装置
は、言語コーパスを複数に分割するコーパス分割手段
と、分割された複数の言語コーパスを、単語連鎖頻度計
数手段による単語連鎖に係る頻度計数用の言語コーパス
と言語性能評価用の言語コーパスとに分類するととも
に、単語接続確率再計算手段が用いる単調増加する関数
のパラメータを設定するパラメータ設定手段と、頻度計
数用の言語コーパス内で計数された単語連鎖に係る頻度
に基づいて言語性能評価用の言語コーパスにおける単語
連鎖の条件付き確率を算出し、この条件付き確率を用い
て言語性能を評価する言語性能評価手段と、パラメータ
設定手段によって分割された言語コーパスの分類の入れ
替え及び単調増加する関数のパラメータの変更が行われ
るごとに最高の言語性能を与えるパラメータを検出し、
これらパラメータの平均値を最適なパラメータとして決
定する最適パラメータ決定手段とを備えるものである。A statistical language model generating apparatus according to the present invention includes a corpus dividing means for dividing a language corpus into a plurality of words, and a plurality of divided language corpuses for counting the frequency of word chains by the word chain frequency counting means. Parameter setting means for classifying into a language corpus and a language corpus for language performance evaluation, and setting a parameter of a monotonically increasing function used by the word connection probability recalculation means, and words counted in the language corpus for frequency counting The conditional probability of the word chain in the language corpus for language performance evaluation is calculated based on the frequency of the chain, and the language performance is evaluated by using the conditional probability. Best linguistics every time the language corpus is reclassified and parameters of monotonically increasing functions are changed Detecting a parameter that gives the,
Optimum parameter determining means for determining an average value of these parameters as an optimum parameter.

【００３０】この発明に係る統計的言語モデル生成装置
は、パラメータ設定手段が分割された複数の言語コーパ
スを単語連鎖に係る頻度計数用の言語コーパスと言語性
能評価用の言語コーパスとに分類して、これらを複数種
類の組み合わせにするとともに、それぞれの組み合わせ
に異なるパラメータを設定し、最適パラメータ決定手段
が各組み合わせで最高の言語性能を与えるパラメータを
検出し、これらパラメータの平均値を最適パラメータと
して決定するものである。In the statistical language model generating apparatus according to the present invention, the plurality of language corpuses divided by the parameter setting means are classified into a language corpus for counting frequencies related to word chains and a language corpus for evaluating language performance. In addition, these are made into a plurality of kinds of combinations, different parameters are set for each combination, and the optimum parameter determining means detects a parameter that gives the highest linguistic performance in each combination, and determines an average value of these parameters as an optimum parameter. Is what you do.

【００３１】この発明に係る統計的言語モデル生成装置
は、言語コーパスを記憶するコーパス記憶手段と、言語
コーパス内に存在する単語連鎖の出現頻度及び単語種類
数を含んでなる単語連鎖に係る頻度を計数する単語連鎖
頻度計数手段と、この単語連鎖頻度計数手段が計数した
単語連鎖に係る頻度に基づいて単語連鎖の条件付き確率
を計算する単語接続確率計算手段と、言語コーパス内に
存在する単語連鎖に類似する単語連鎖を新たに言語コー
パスに追加する類似単語連鎖追加手段とを備えるもので
ある。A statistical language model generating apparatus according to the present invention includes a corpus storing means for storing a language corpus, and a frequency associated with a word chain including the frequency of occurrence of word chains and the number of types of words existing in the language corpus. Word chain frequency counting means for counting, word connection probability calculating means for calculating conditional probability of word chains based on the frequency of word chains counted by the word chain frequency counting means, and word chains existing in the language corpus And a similar word chain adding means for newly adding a word chain similar to to the language corpus.

【００３２】この発明に係る統計的言語モデル生成装置
は、言語コーパスに存在する単語を類似する意味の単語
クラスごとにクラス分けしたクラス情報を記憶する単語
クラス記憶手段を備え、類似単語連鎖追加手段が、クラ
ス情報に基づいて所定の単語クラスに所属する単語が言
語コーパス内の単語連鎖に含まれるか否かを判断し、所
定の単語クラスに所属する単語を、この単語クラスに所
属する他の単語に置き換えた単語連鎖を類似する単語連
鎖として、言語コーパスに新たに追加するものである。A statistical language model generating apparatus according to the present invention includes word class storage means for storing class information obtained by classifying words existing in a language corpus into word classes having similar meanings, and a similar word chain adding means. Determines whether a word belonging to a predetermined word class is included in a word chain in a language corpus based on the class information, and replaces a word belonging to the predetermined word class with another word belonging to this word class. The word chain replaced with the word is newly added to the language corpus as a similar word chain.

【００３３】この発明に係る統計的言語モデル生成装置
は、言語コーパスを記憶するコーパス記憶手段と、言語
コーパス内に存在する単語連鎖の出現頻度及び単語種類
数を含んでなる単語連鎖に係る頻度を計数する単語連鎖
頻度計数手段と、言語コーパスに存在する単語を類似す
る意味の単語クラスごとにクラス分けしたクラス情報を
記憶する単語クラス記憶手段と、クラス情報に基づいて
言語コーパス内に存在する単語連鎖の出現頻度を、この
単語連鎖に含まれる単語が属する単語クラス別に計数す
る単語クラス間頻度計数手段と、単語連鎖に係る頻度に
基づいて単語連鎖の条件付確率を算出するとともに、単
語クラス間頻度計数手段が算出した単語連鎖の出現頻度
に基づいて単語連鎖に含まれる単語が属する単語クラス
に係る条件付き確率を算出して、これらを線形結合して
なる確率を単語連鎖の新たな条件付き確率とする単語接
続確率計算手段とを備えるものである。The statistical language model generating apparatus according to the present invention includes a corpus storing means for storing a language corpus, and a frequency associated with a word chain including the frequency of occurrence of a word chain and the number of word types existing in the language corpus. Word chain frequency counting means for counting, word class storage means for storing class information obtained by classifying words existing in the language corpus into word classes having similar meanings, words existing in the language corpus based on the class information A word-class frequency counting means for counting the frequency of occurrence of the chain for each word class to which the words included in the word chain belong; and calculating a conditional probability of the word chain based on the frequency related to the word chain, Based on the appearance frequency of the word chain calculated by the frequency counting means, the conditional probability of the word class to which the word included in the word chain belongs To calculate the one in which these and a word connection probability calculation means for a new conditional probability of a word chain probabilities obtained by linear combination.

【００３４】この発明に係る音声認識装置は、上記統計
的言語モデル生成装置によって得られた統計的言語モデ
ルを用いて入力音声を認識するものである。A speech recognition apparatus according to the present invention recognizes input speech using a statistical language model obtained by the above-described statistical language model generation apparatus.

【００３５】この発明に係る統計的言語モデル生成方法
は、言語コーパス内に存在する単語連鎖の出現頻度及び
単語種類数を含んでなる単語連鎖に係る頻度を計数する
単語連鎖頻度計数ステップと、この単語連鎖頻度計数ス
テップにて計数された単語連鎖に係る頻度に基づいて単
語連鎖の条件付き確率を計算する単語接続確率計算ステ
ップと、出現頻度が小さい単語連鎖の条件付き確率を単
語接続確率計算ステップにてバックオフ平滑化を行って
計算するにあたり、バックオフ確率を単調増加する関数
に入力し、この関数による演算結果を単語連鎖の新たな
条件付き確率とする単語接続確率再計算ステップと、単
語接続確率計算ステップ及び単語接続確率再計算ステッ
プにて求められた単語連鎖の条件付き確率を用いて統計
的言語モデルを生成する統計的言語モデル生成ステップ
とを備えるものである。The statistical language model generating method according to the present invention includes a word chain frequency counting step of counting the frequency of a word chain including the number of appearances of word chains and the number of word types existing in a language corpus; A word connection probability calculating step of calculating a conditional probability of a word chain based on the frequency relating to the word chain counted in the word chain frequency counting step, and a word connection probability calculating step of calculating a conditional probability of a word chain having a low appearance frequency In performing the back-off smoothing in the calculation, the back-off probability is input to a function that monotonically increases, and the result of the operation by this function is used as a new conditional probability of a word chain. A statistical language model is generated using the conditional probabilities of word chains obtained in the connection probability calculation step and the word connection probability recalculation step. It is intended and a statistical language model generating step of.

【００３６】この発明に係る統計的言語モデル生成方法
は、単調増加する関数をべき乗関数とするものである。In the statistical language model generating method according to the present invention, a monotonically increasing function is a power function.

【００３７】この発明に係る統計的言語モデル生成方法
は、単調増加する関数をシグモイド関数とするものであ
る。In the statistical language model generation method according to the present invention, a monotonically increasing function is used as a sigmoid function.

【００３８】この発明に係る統計的言語モデル生成方法
は、単調増加する関数が演算結果に上下限値を有するも
のである。In the statistical language model generation method according to the present invention, the function that monotonically increases has upper and lower limits in the operation result.

【００３９】この発明に係る統計的言語モデル生成方法
は、言語コーパスを複数に分割して、単語連鎖に係る頻
度計数用の言語コーパスと言語性能評価用の言語コーパ
スとに分類するとともに、単語接続確率再計算ステップ
にて用いる単調増加する関数のパラメータを設定するパ
ラメータ設定ステップと、頻度計数用の言語コーパス内
で計数された単語連鎖に係る頻度に基づいて言語性能評
価用の言語コーパスにおける単語連鎖の条件付き確率を
算出し、この条件付き確率を用いて言語性能を評価する
言語性能評価ステップと、パラメータ設定ステップと言
語性能評価ステップとを繰り返すたびに最高の言語性能
を与えるパラメータを取得し、これらパラメータの平均
値を最適なパラメータとして決定する最適パラメータ決
定ステップとを備えるものである。In the statistical language model generation method according to the present invention, a language corpus is divided into a plurality of languages, and the language corpus is classified into a language corpus for counting frequencies and a language corpus for evaluating language performance, and word connection is performed. A parameter setting step of setting a parameter of a monotonically increasing function used in the probability recalculation step, and a word chain in a language corpus for language performance evaluation based on the frequency of the word chain counted in the language corpus for frequency counting Calculate the conditional probability of, the language performance evaluation step to evaluate the language performance using this conditional probability, and obtain a parameter that gives the best language performance every time the parameter setting step and the language performance evaluation step are repeated, An optimal parameter determining step of determining an average value of these parameters as an optimal parameter. Is shall.

【００４０】この発明に係る統計的言語モデル生成方法
は、パラメータ設定ステップにて、分割した複数の言語
コーパスを、単語連鎖に係る頻度計数用の言語コーパス
と言語性能評価用の言語コーパスとに分類して、これら
を複数種類の組み合わせにするとともに、それぞれの組
み合わせに異なるパラメータを設定し、最適パラメータ
決定ステップにて、各組み合わせで最高の言語性能を与
えるパラメータを検出し、これらパラメータの平均値を
最終的な最適パラメータとして決定するものである。In the statistical language model generation method according to the present invention, in the parameter setting step, the plurality of divided language corpuses are classified into a language corpus for counting the frequency of word chains and a language corpus for evaluating language performance. Then, these are made into a plurality of kinds of combinations, different parameters are set for each combination, and in the optimum parameter determination step, the parameter that gives the best language performance in each combination is detected, and the average value of these parameters is calculated. This is determined as the final optimum parameter.

【００４１】この発明に係る統計的言語モデル生成方法
は、言語コーパス内に存在する単語連鎖に類似する単語
連鎖を新たに言語コーパスに追加する類似単語連鎖追加
ステップと、言語コーパス内に存在する単語連鎖の出現
頻度及び単語種類数を含んでなる単語連鎖に係る頻度を
計数する単語連鎖頻度計数ステップと、単語連鎖頻度計
数ステップにて計数された単語連鎖に係る頻度に基づい
て単語連鎖の条件付き確率を計算する単語接続確率計算
ステップと、この単語接続確率計算ステップにて求めら
れた単語連鎖の条件付き確率を用いて統計的言語モデル
を生成する統計的言語モデル生成ステップとを備えるも
のである。The statistical language model generation method according to the present invention includes a similar word chain adding step of newly adding a word chain similar to a word chain existing in a language corpus to a language corpus, and a word existing in the language corpus. A word chain frequency counting step of counting the frequency of the word chain including the frequency of occurrence of the chain and the number of word types; and a condition of the word chain based on the frequency of the word chain counted in the word chain frequency counting step. A word connection probability calculation step of calculating a probability; and a statistical language model generation step of generating a statistical language model using the conditional probability of the word chain obtained in the word connection probability calculation step. .

【００４２】この発明に係る統計的言語モデル生成方法
は、類似単語連鎖追加ステップにて、言語コーパスに存
在する単語を分類したクラス情報に基づいて所定の単語
クラスに所属する単語が言語コーパス内の単語連鎖に含
まれるか否かを判断し、所定の単語クラスに所属する単
語をこの単語クラスに所属する他の単語に置き換えた単
語連鎖を類似する単語連鎖として言語コーパスに新たに
追加するものである。In the statistical language model generating method according to the present invention, in the similar word chain adding step, words belonging to a predetermined word class are classified into words in the language corpus based on class information obtained by classifying words existing in the language corpus. It is determined whether or not a word chain is included in a word chain, and a word chain obtained by replacing a word belonging to a predetermined word class with another word belonging to this word class is newly added to the language corpus as a similar word chain. is there.

【００４３】この発明に係る統計的言語モデル生成方法
は、言語コーパス内に存在する単語連鎖の出現頻度及び
単語種類数を含んでなる単語連鎖に係る頻度を計数する
単語連鎖頻度計数ステップと、言語コーパスに存在する
単語を類似する意味の単語クラスごとにクラス分けした
クラス情報に基づいて、言語コーパス内に存在する単語
連鎖の出現頻度を、この単語連鎖に含まれる単語が属す
る単語クラス別に計数する単語クラス間頻度計数ステッ
プと、単語連鎖に係る頻度に基づいて単語連鎖の条件付
確率を算出するとともに、単語クラス間頻度計数ステッ
プにて計数した単語連鎖の出現頻度に基づいて単語連鎖
に含まれる単語が属する単語クラスに係る条件付き確率
を算出して、これらを線形結合してなる確率を単語連鎖
の新たな条件付き確率とする単語接続確率計算ステップ
と、この単語接続確率計算ステップにて求められた単語
連鎖の条件付き確率を用いて統計的言語モデルを生成す
る統計的言語モデル生成ステップとを備えるものであ
る。The statistical language model generation method according to the present invention includes a word chain frequency counting step of counting the frequency of a word chain including the number of occurrences of word chains and the number of word types existing in a language corpus; Based on class information obtained by classifying words existing in the corpus into word classes having similar meanings, the frequency of occurrence of word chains existing in the language corpus is counted for each word class to which the words included in this word chain belong. Calculating the conditional probability of the word chain based on the word class frequency counting step and the word chain frequency, and including the word chain occurrence frequency counted in the word class frequency counting step in the word chain. Calculate the conditional probabilities of the word class to which the word belongs, and calculate the probability obtained by linearly combining them with a new conditional And word connection probability calculation step of the rate, in which and a statistical language model generating step of generating a statistical language model using the conditional probability of a word chain obtained by this word connection probability calculation step.

【００４４】この発明に係る記録媒体は、上記統計的言
語モデル生成方法をコンピュータに実行させるためのプ
ログラムを記録するものである。A recording medium according to the present invention stores a program for causing a computer to execute the above-described statistical language model generation method.

【００４５】[0045]

【発明の実施の形態】以下、この発明の実施の一形態を
説明する。実施の形態１．図１はこの発明の実施の形態１による統
計的言語モデル生成装置の構成を示す図である。図にお
いて、１は言語モデルを構築するのに使用される言語コ
ーパスを記憶するコーパス記憶手段で、コンピュータシ
ステムのハードディスクや大容量の記憶媒体によって実
現される。また、言語コーパスとしては、従来の技術で
説明したものと同様に、例えば新聞などの多くの文章か
ら構成される言語データベースがある。２はコーパス入
力手段で、コーパス記憶手段１から所望の言語コーパス
を読み出して単語連鎖頻度計数手段３が利用可能に保持
する。３は単語連鎖頻度計数手段で、コーパス入力手段
２に保持される言語コーパスに存在する単語連鎖に係る
頻度を計数する。単語連鎖頻度計数手段３が計数する単
語連鎖に係る頻度としては、言語コーパス内に存在する
単語連鎖の出現頻度やこの単語連鎖に後続する単語種類
の頻度などがある。４は単語連鎖頻度計数手段３が計数
した単語連鎖に係る頻度に基づいて単語が連鎖する条件
付き確率を計算する単語接続確率計算手段で、図示の例
では、従来の技術で説明したＷｉｔｔｅｎ−Ｂｅｌｌ法
による平滑化を行って条件付き確率を計算する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below. Embodiment 1 FIG. FIG. 1 is a diagram showing a configuration of a statistical language model generation device according to Embodiment 1 of the present invention. In FIG. 1, reference numeral 1 denotes a corpus storage unit for storing a language corpus used for constructing a language model, which is realized by a hard disk or a large-capacity storage medium of a computer system. As a language corpus, there is a language database composed of many sentences such as newspapers, for example, as described in the related art. Numeral 2 denotes a corpus input unit which reads out a desired language corpus from the corpus storage unit 1 and holds it so that the word chain frequency counting unit 3 can use it. Reference numeral 3 denotes a word chain frequency counting unit that counts the frequency related to a word chain existing in the language corpus held in the corpus input unit 2. The frequency relating to the word chain counted by the word chain frequency counting means 3 includes the frequency of appearance of the word chain existing in the language corpus, the frequency of the word type following the word chain, and the like. Reference numeral 4 denotes a word connection probability calculating unit that calculates a conditional probability that words are linked based on the word chain frequency counted by the word chain frequency counting unit 3. In the illustrated example, the Witten-Bell described in the related art is used. The conditional probability is calculated by performing smoothing by the method.

【００４６】５は単語接続確率再計算手段であって、出
現頻度が０か０に限りなく近い値で計数された単語連鎖
に係る条件付き確率をバックオフ平滑化を行って計算す
るにあたり、バックオフ確率を単調増加する関数に入力
し、この関数による演算結果を新たな条件付き確率とし
て再計算する。６は単語接続確率計算手段４及び単語接
続確率再計算手段５が算出する各単語連鎖の条件付き確
率を保持して統計的言語モデルを生成する言語モデル生
成手段である。Reference numeral 5 denotes a word connection probability recalculating means, which performs back-off smoothing to calculate conditional probabilities relating to word chains whose appearance frequency is counted as 0 or as close to 0 as possible. The off-probability is input to a monotonically increasing function, and the result of operation by this function is recalculated as a new conditional probability. Reference numeral 6 denotes a language model generation unit that holds a conditional probability of each word chain calculated by the word connection probability calculation unit 4 and the word connection probability recalculation unit 5 and generates a statistical language model.

【００４７】次に動作ついて説明する。図２は図１の統
計的言語モデル生成装置による統計的言語モデル生成動
作を示すフロー図であり、このフロー図に沿ってトライ
グラム言語モデルを作成する場合を例として説明する。
先ず、コーパス入力手段２がコーパス記憶手段１から言
語コーパスを読み出し、単語連鎖頻度計数手段３が利用
可能に保持する（ステップＳＴ１）。Next, the operation will be described. FIG. 2 is a flow chart showing a statistical language model generation operation by the statistical language model generation device of FIG. 1. An example in which a trigram language model is created along this flowchart will be described.
First, the corpus input means 2 reads the language corpus from the corpus storage means 1, and holds the word chain frequency counting means 3 in a usable state (step ST1).

【００４８】次に、単語連鎖頻度計数手段３がコーパス
入力手段２に読み込まれた言語コーパスから、単語連鎖
に係る頻度を計数する（ステップＳＴ２、単語連鎖頻度
計数ステップ）。具体的には、言語コーパス内に「明
日」、「の」、「天気」、「電車」、「状況」という語
彙が登録されていると仮定する。このとき、言語コーパ
ス内に”・・・明日の天気・・・”という部分が５
回、”・・・明日の電車・・・”という部分が３回、”
・・・明日の状況・・・”という部分が４回あるとし、
単語ｗ_１，ｗ_２，ｗ_３からなる単語連鎖の出現頻度をＮ
（ｗ_１，ｗ_２，ｗ_３）で表現すると、単語連鎖頻度計数
手段３は、上述した単語からなる単語連鎖の出現頻度を
下記式（１５）のように計数する。Ｎ（”明日”，”の”，”天気”）＝５Ｎ（”明日”，”の”，”電車”）＝３Ｎ（”明日”，”の”，”状況”）＝４・・・（１５）このとき、Ｎ（”明日”，”の”）＝（５＋３＋４）＝
１２となる。同様にして、単語連鎖頻度計数手段３は、
ｗ_１，ｗ_２の２つ組の単語連鎖の出現頻度Ｎ（ｗ_１，ｗ
_２）や単語ｗ_１の単語出現頻度Ｎ（ｗ_１）を計数する。Next, the word chain frequency counting means 3 counts the frequency of word chains from the language corpus read into the corpus input means 2 (step ST2, word chain frequency counting step). Specifically, it is assumed that vocabulary words “tomorrow”, “no”, “weather”, “train”, and “situation” are registered in the language corpus. At this time, in the language corpus, "... tomorrow's weather ..."
Times, "... tomorrow's train ..." three times, "
... the situation of tomorrow ... "
The appearance frequency of the word chain consisting of the words w ₁ , w ₂ , w ₃ is N
Expressed as (w ₁ , w ₂ , w ₃ ), the word chain frequency counting means 3 counts the appearance frequency of the word chain composed of the above words as in the following equation (15). N ("Tomorrow", "no", "weather") = 5 N ("tomorrow", "no", "train") = 3 N ("tomorrow", "no", "situation") = 4・ (15) At this time, N (“tomorrow”, “no”) = (5 + 3 + 4) =
It becomes 12. Similarly, the word chain frequency counting means 3
w _1, the frequency of occurrence of the two sets of words chain of _{_w 2} N _{_(w} 1, _w
₂₎ and words _{w 1} of the word appearance frequency N _{(w 1)} for counting.

【００４９】さらに、単語連鎖頻度計数手段３は、所定
の単語連鎖ｗ_１ｗ_２に後続する単語種類Ｒ（ｗ_１，
ｗ_２）を計数する。上述の場合、「明日」−「の」に後
続する単語の種類は「天気」、「電車」、「状況」の３
つであり、Ｒ（”明日”，”の”）＝３となる。また、
同様にして、Ｒ（ｗ_１）を計数する。上記の場合、「明
日」に後続する単語は、３つの「の」でありＲ（”明
日”）＝１、Ｒ（”の”）＝３となる。Further, the word chain frequency counting means 3 calculates the word type R (w ₁ , w ₁ , w ₁ , w ₂₎ following the predetermined word chain w ₁ w ₂ .
w ₂ ) is counted. In the case described above, the types of words following “tomorrow”-“no” are “weather”, “train”, and “situation”.
R (“tomorrow”, “of”) = 3. Also,
Similarly, R (w ₁ ) is counted. In the above case, the words following "tomorrow" are three "no", and R ("tomorrow") = 1 and R ("of") = 3.

【００５０】次に、単語接続確率計算手段４がある３つ
組の単語連鎖ｗ_１ｗ_２ｗ_３を考え、その単語連鎖確率Ｐ
（ｗ_３｜ｗ_１，ｗ_２）を推定する。具体的には、与えら
れた単語連鎖ｗ_１ｗ_２ｗ_３が言語コーパスに存在すると
き、即ち、Ｎ（ｗ_１，ｗ_２，ｗ_３）＞０の場合には、単
語接続確率計算手段４が下記式（１６）の第１式に従っ
て確率を推定する。これは従来の技術で説明したＷｉｔ
ｔｅｎ−Ｂｅｌｌ法のスムージングと同じである（ステ
ップＳＴ３、単語接続確率計算ステップ）。Next, a word connection probability calculating means 4 considers a set of three word chains w ₁ w ₂ w ₃ and the word chain probability P
Estimate (w ₃ | w ₁ , w ₂ ). Specifically, when the given word chain w ₁ w ₂ w ₃ exists in the language corpus, that is, when N (w ₁ , w ₂ , w ₃ )> 0, the word connection probability calculation means 4 Estimates the probability according to the first equation of the following equation (16). This is Wit described in the background art.
This is the same as the smoothing of the ten-Bell method (step ST3, word connection probability calculation step).

【００５１】また、Ｎ（ｗ_１，ｗ_２，ｗ_３）＝０で、Ｎ
（ｗ_１，ｗ_２）＞０の場合、単語接続確率再計算手段５
は、バックオフ確率Ｐ（ｗ_３｜ｗ_２）を用いて下記式
（１７）で表わされる条件付き確率Ｐ’（ｗ_３｜ｗ_２）
を計算し、さらに、下記式（１６）の第２式に従って上
記出現頻度が０の単語連鎖の条件付き確率を推定する
（ステップＳＴ４、単語接続確率計算ステップ、単語接
続確率再計算ステップ）。If N (w ₁ , w ₂ , w ₃ ) = 0 and N
If (w ₁ , w ₂ )> 0, the word connection probability recalculating means 5
Backoff probability _{P (w} 3 | _{w 2)} conditional probability P represented by the following formula (17) using a _{_'(w} 3 _| w 2)
Is calculated, and the conditional probability of the word chain whose appearance frequency is 0 is estimated according to the second expression of the following expression (16) (step ST4, word connection probability calculation step, word connection probability recalculation step).

【００５２】さらに、Ｎ（ｗ_１，ｗ_２，ｗ_３）＝０で、
Ｎ（ｗ_１，ｗ_２）＝０の場合、単語接続確率再計算手段
５は、バックオフ確率Ｐ（ｗ_３｜ｗ_２）を用いて下記式
（１７）で表わされる条件付き確率Ｐ’（ｗ_３｜ｗ_２）
を計算し、さらに、下記式（１６）の第３式に従って上
記出現頻度が０の単語連鎖の条件付き確率を推定する
（ステップＳＴ５、単語接続確率計算ステップ、単語接
続確率再計算ステップ）。このとき、単語接続確率再計
算手段５は、バックオフ確率Ｐ（ｗ_２）、下記式（１
８）及び式（１９）を用いて下記式（１７）で条件付き
確率Ｐ’（ｗ_３｜ｗ _２）を計算する。Further, N (w₁, W₂, W₃) = 0,
N (w₁, W₂) = 0, word connection probability recalculation means
5 is the back-off probability P (w₃| W₂) Using the following formula
(17) The conditional probability P '(w₃| W₂)
, And further, according to the third equation of the following equation (16),
Estimate the conditional probability of a word chain with zero occurrence frequency
(Step ST5, word connection probability calculation step, word connection probability
Continuation probability recalculation step). At this time, the word connection probability
The calculating means 5 calculates the back-off probability P (w₂), The following equation (1
8) and conditional expression (17) using equation (19)
Probability P '(w₃| W ₂) Is calculated.

【数２】ここで、式（１６）〜式（１９）において、α
（ｗ_１）、β（ｗ_１，ｗ_２）は正規化定数、δとλとは
べき乗のパラメータであり、δ≦１，λ≦１である。ま
た、Ｐ’（ｗ_３｜ｗ_２）とＰ’（ｗ_２）との分母は正規
化定数である。式（１７）の右辺の分母は、言語コーパ
ス内でＮ（ｗ_２，ｗ_３）＞０となる全てのｗ_３について
とる。さらに、式（１９）右辺の分母は、言語コーパス
内でＮ（ｗ_２）＞０となる全ての単語ｗ_２についてと
る。(Equation 2) Here, in Expressions (16) to (19), α
(W ₁ ) and β (w ₁ , w ₂ ) are normalization constants, δ and λ are power parameters, and δ ≦ 1, λ ≦ 1. The denominator between P ′ (w ₃ | w ₂ ) and P ′ (w ₂ ) is a normalization constant. The denominator of the right side of equation _{_{(17), N (w 2, w 3}} ) in the corpus> 0. The take for all _{w 3.} Further, the denominator of equation (19) the right-hand side takes for all words _{w 2} as the N _(w 2)> 0 in the corpus.

【００５３】言語モデル生成手段６によって制御された
単語接続確率計算手段４及び単語接続確率再計算手段５
が、以上の処理を全ての出現可能な単語連鎖（ｗ_１，ｗ
_２，ｗ_３）について計算する。これによって、言語モデ
ル生成手段６に各単語連鎖の条件付き確率が保持されて
トライグラム言語モデルが作成される（ステップＳＴ
５、統計的言語モデル生成ステップ）。Word connection probability calculating means 4 and word connection probability recalculating means 5 controlled by language model generating means 6
There, all of that can appear word chain the above processing (w _1, w
₂ , w ₃ ). As a result, the trigram language model is created with the conditional probability of each word chain held in the language model generation means 6 (step ST).
5. Statistical language model generation step).

【００５４】このように、単語連鎖の条件付き確率の計
算においてバックオフするとき、低次の単語連鎖の条件
付き確率をべき乗することによって、頻度差のダイナミ
ックレンジを緩和することができる。例えば、ある単語
ｗ_１とｗ_２とのユニグラム確率の比が２倍だった場合、
Ｐ（ｗ_２）＝２Ｐ（ｗ_１）であるが、それを０．５でべ
き乗した場合、Ｐ’（ｗ_２）＝Ｐ（ｗ_２）^０．５＝（２
Ｐ（ｗ_１））^０．５＝２^０．５・Ｐ（ｗ_１）^０．５＝
１．４・Ｐ’（ｗ_１）で比が１．４倍になってダイナミ
ックレンジが小さくなる。従って、ユニグラム頻度が大
きい単語を含む単語列が不当に高い言語確率をもつこと
を防ぐことができる。As described above, when backing off in the calculation of the conditional probability of a word chain, the dynamic range of the frequency difference can be reduced by raising the conditional probability of a low-order word chain to the power. For example, if the ratio of the unigram probability of a certain word w ₁ and w ₂ was two-fold,
P (w ₂ ) = 2P (w ₁ ), and when it is raised to the power of 0.5, P ′ (w ₂ ) = P (w ₂ ) ^0.5 = (2
P (w ₁ )) ^0.5 = 2 ^0.5 · P (w ₁ ) ^0.5 =
The ratio becomes 1.4 times at 1.4 · P ′ (w ₁ ), and the dynamic range becomes smaller. Therefore, it is possible to prevent a word string including a word having a high unigram frequency from having an unduly high language probability.

【００５５】ここで、バックオフする際に低次の単語連
鎖の条件付き確率をべき乗するのではなく、ダイナミッ
クレンジを緩和するような単調増加関数であれば、同様
の効果を奏することは明らかである。図３は実施の形態
１による統計的言語モデル生成装置が使用する単調増加
関数を示す図であり、（ａ）は単調増加関数をべき乗関
数とした場合、（ｂ）は単調増加関数をシグモイド関数
とした場合、（ｃ）及び（ｄ）は単調増加関数の出力結
果が上下限値を有する場合を示している。上述した説明
では、図３（ａ）に示すような、１より小さい正の値の
べき乗パラメータを有する単調増加関数（べき乗関数）
に、バックオフスムージングの際に計算される低次の単
語連鎖の条件付き確率を代入した結果を上記単語連鎖の
新たな条件付き確率とした（例えば、ｆ（Ｐ）＝ａ
Ｐ^ｂ、ｂはべき乗パラメータ）。Here, it is apparent that a similar effect can be obtained if a monotonically increasing function that relaxes the dynamic range is used instead of raising the conditional probability of a low-order word chain when backing off. is there. 3A and 3B are diagrams illustrating a monotonically increasing function used by the statistical language model generation device according to the first embodiment. FIG. 3A illustrates a case where the monotonically increasing function is a power function, and FIG. 3B illustrates a case where the monotonically increasing function is a sigmoid function. , (C) and (d) show the case where the output result of the monotone increasing function has upper and lower limits. In the above description, a monotonically increasing function (power function) having a power parameter of a positive value smaller than 1 as shown in FIG.
The result obtained by substituting the conditional probability of a lower-order word chain calculated at the time of back-off smoothing into a new conditional probability of the word chain (for example, f (P) = a
P ^b , b is a power parameter).

【００５６】一方、本願発明の装置では、単調増加関数
をｆとすると例えば、図３（ｂ）に示すようなシグモイ
ド関数や、図３（ｃ）、（ｄ）に示すような上下限を限
定された関数を使用してもよい。図３（ｂ）に示したシ
グモイド関数（ｆ（Ｐ）＝１／（１＋
ｅ^{−α（Ｐ＋β）}）、α＞０）は非線形関数の１つであ
り、パラメータαの値を適宜変更することで非線形性の
程度を変えることができる（αを大きくすると線形性が
強くなり、小さくすると非線形性が大きくなる）。これ
により、パラメータαを適宜変更することで条件付き確
率の取り得る範囲も広範囲に設定することができ、対象
とする単語連鎖間の頻度差によるダイナミックレンジが
適切な値となるように柔軟に処理することができる。ま
た、図３（ｃ）、（ｄ）に示す取り得る値が上下限定さ
れた関数であれば、格助詞などの出現頻度の大きい単語
のみ確率の圧縮効果がある。即ち、格助詞などの出現頻
度の大きい単語は、それだけ条件付き確率の推定値も大
きくなるが、上記関数の出力値に対応させることで、強
制的に上限値がその推定値となる。このような上下限を
限定された関数としては、図３（ｂ）に示したシグモイ
ド関数に出力値を限定するパラメータを設けることで実
現することができる。図３（ｃ）の例では、シグモイド
関数に上記出力値を限定するパラメータを設けて、パラ
メータαの値を大きくしたもので実現でき、図３（ｄ）
の例では、シグモイド関数に上記出力値を限定するパラ
メータを設けて、パラメータαの値を小さくしたもので
実現できる。On the other hand, in the apparatus of the present invention, if the monotone increasing function is f, for example, the sigmoid function as shown in FIG. 3B and the upper and lower limits as shown in FIGS. 3C and 3D are limited. You may use the function provided. The sigmoid function (f (P) = 1 / (1+
e− ^{α (P + β)} ), α> 0) is one of the non-linear functions, and the degree of non-linearity can be changed by appropriately changing the value of the parameter α. , The smaller the size, the greater the nonlinearity). Thereby, the range in which the conditional probability can be taken can be set in a wide range by appropriately changing the parameter α, and the processing is flexibly performed so that the dynamic range due to the frequency difference between the target word chains becomes an appropriate value. can do. Further, if the possible values shown in FIGS. 3C and 3D are functions whose upper and lower limits are limited, only words having a high appearance frequency such as case particles have the effect of compressing the probability. In other words, words having a high appearance frequency, such as case particles, also have an increased estimated value of the conditional probability. However, by associating the estimated value with the output value of the function, the upper limit value is forcibly set as the estimated value. Such a function whose upper and lower limits are limited can be realized by providing a parameter for limiting the output value to the sigmoid function shown in FIG. In the example of FIG. 3C, a parameter that limits the output value is provided in the sigmoid function, and the parameter α can be realized by increasing the value of the parameter α.
In the example, the parameter can be realized by providing a parameter for limiting the output value to the sigmoid function and reducing the value of the parameter α.

【００５７】ここで、上述したような単調増加関数ｆを
用いたときの単語連鎖の条件付き確率の具体例を下記式
（２０）、（２１）に示す。Here, specific examples of conditional probabilities of word chains when the above monotone increasing function f is used are shown in the following equations (20) and (21).

【数３】但し、ｆ、ｇは確率の代わりなので、下記式（２２）、
（２３）に示す正規化条件を満たさなければならない。(Equation 3) However, since f and g are instead of probabilities, the following equation (22),
The normalization condition shown in (23) must be satisfied.

【数４】以上のような単調増加関数ｆ、ｇを用いれば、上記実施
の形態におけるべき乗関数に確率を代入する場合と同様
の効果を奏することができる。(Equation 4) By using the monotonically increasing functions f and g as described above, the same effect as in the case of substituting the probability into the power function in the above embodiment can be obtained.

【００５８】なお、上記実施の形態１ではＷｉｔｔｅｎ
−Ｂｅｌｌ法によるスムージングを例にして説明した
が、従来の他のスムージング法でも同様の効果を奏する
ことはいうまでもない。In the first embodiment, Witten
-The smoothing by the Bell method has been described as an example, but it goes without saying that the same effect can be obtained by another conventional smoothing method.

【００５９】以上のように、この実施の形態１によれ
ば、出現頻度が小さい（０か限りなく０に近い値）単語
連鎖の条件付き確率をバックオフ平滑化を行って計算す
るにあたり、バックオフ確率を単調増加する関数に入力
し、この関数による演算結果を単語連鎖の新たな条件付
き確率とするので、出現頻度が大きい単語を含む単語連
鎖が不当に高い言語確率をもつことを防ぐことができ
る。As described above, according to the first embodiment, when performing the back-off smoothing to calculate the conditional probability of a word chain having a small frequency of appearance (a value as close to 0 as possible, 0), Since the off-probability is input to a function that monotonically increases, and the result of this function is used as the new conditional probability of word chains, it is possible to prevent word chains that include words with high frequency of occurrence from having an unduly high language probability. Can be.

【００６０】また、この実施の形態１によれば、単調増
加する関数にべき乗関数、シグモイド関数、演算結果に
上下限値を有するものを使用するので、それぞれのパラ
メータを適宜変更することで対象とする単語連鎖間の頻
度差によるダイナミックレンジが適切な値となるように
処理することができる。Further, according to the first embodiment, a power function, a sigmoid function, and a calculation result having upper and lower limits are used for a function that increases monotonically. The processing can be performed so that the dynamic range based on the frequency difference between the word chains to be executed becomes an appropriate value.

【００６１】実施の形態２．この実施の形態２は、上記
実施の形態１におけるべき乗（又は、それに代わる単調
増加関数）のパラメータを決定する構成を示すものであ
る。Embodiment 2 The second embodiment shows a configuration for determining the power (or monotonically increasing function) parameter in the first embodiment.

【００６２】図４はこの発明の実施の形態２による統計
的言語モデル生成装置の構成を示す図である。図におい
て、７は言語コーパスを複数に分割するコーパス分割手
段で、例えばコーパス記憶手段１に記憶された言語コー
パス内のテキストを時系列にならべて各系列に対応させ
て分割するか、テキストをランダムに選択して分割す
る。８はパラメータ設定手段であって、分割された複数
の言語コーパスを単語連鎖頻度計数手段３による単語連
鎖に係る頻度計数用の言語コーパスと言語性能評価用の
言語コーパスとに分類するとともに、単語接続確率再計
算手段５が用いる単調増加関数の例えばべき乗パラメー
タなどを設定する。９はパープレキシティ計算手段（言
語性能評価手段）で、言語性能を評価する指標としてパ
ープレキシティ（ｐｅｒｐｌｅｘｉｔｙ複雑度）を言
語性能評価用の言語コーパスにおいて計算した単語連鎖
の条件付き確率を用いて算出する。１０は最高のパープ
レキシティを与える単調増加関数の最適パラメータを決
定する最適パラメータ決定手段で、パラメータ設定手段
８によって言語コーパスの分類の入れ替え及び単調増加
関数のパラメータの変更が行われるごとに、パープレキ
シティ計算手段９が求めるパープレキシティ値を検出
し、パープレキシティが最低値となるときの単調増加関
数のパラメータを最適パラメータとして決定する。ま
た、図示の例では言語モデル生成手段６が省略されてい
るが、実際には言語モデル生成手段６を具備しているも
のとし、上記実施の形態１と同様に単語接続確率計算手
段４及び単語接続確率再計算手段５によって算出された
単語連鎖の条件付き確率を用いて統計的言語モデルを生
成する。なお、図１と同一構成要素には同一符号を付し
て重複する説明を省略する。FIG. 4 is a diagram showing a configuration of a statistical language model generation device according to the second embodiment of the present invention. In the figure, reference numeral 7 denotes a corpus dividing means for dividing the language corpus into a plurality of pieces. For example, texts in the language corpus stored in the corpus storage means 1 are arranged in time series so as to be divided in correspondence with each series, or the text is divided at random. Select to divide. Numeral 8 denotes a parameter setting means for classifying a plurality of divided language corpuses into a language corpus for frequency counting and a language corpus for evaluating language performance by the word chain frequency counting means 3 and a word connection. For example, a power parameter of the monotone increasing function used by the probability recalculation means 5 is set. Reference numeral 9 denotes a perplexity calculating means (language performance evaluation means) which uses a conditional probability of a word chain calculated in a language corpus for evaluating perplexity (perplexity) as an index for evaluating language performance. calculate. Numeral 10 is an optimal parameter determining means for determining an optimal parameter of a monotonically increasing function that gives the highest perplexity. Each time the parameter setting means 8 changes the language corpus classification and changes the parameters of the monotonically increasing function, parsing is performed. The perplexity value calculated by the plexity calculating means 9 is detected, and the parameter of the monotonically increasing function when the perplexity becomes the minimum value is determined as the optimum parameter. Although the language model generating means 6 is omitted in the illustrated example, it is assumed that the language model generating means 6 is actually provided, and the word connection probability calculating means 4 and the word A statistical language model is generated using the conditional probability of the word chain calculated by the connection probability recalculating means 5. The same components as those in FIG. 1 are denoted by the same reference numerals, and duplicate description will be omitted.

【００６３】次に動作について説明する。図５は図４の
統計的言語モデル生成装置による統計的言語モデル生成
動作を示すフロー図であり、このフロー図に沿ってトラ
イグラム言語モデルを作成する場合を例として説明す
る。先ず、上記実施の形態１と同様に、コーパス入力手
段２がコーパス記憶手段１から言語コーパスを読み出
し、コーパス分割手段７が利用可能に保持する（ステッ
プＳＴ１ａ）。Next, the operation will be described. FIG. 5 is a flowchart showing a statistical language model generation operation by the statistical language model generation device of FIG. 4. An example in which a trigram language model is created along this flowchart will be described. First, as in the first embodiment, the corpus input unit 2 reads the language corpus from the corpus storage unit 1, and the corpus dividing unit 7 holds the corpus available (step ST1a).

【００６４】次に、コーパス分割手段７がコーパス入力
手段２に読み込まれた言語コーパスを２つに分割し、そ
れぞれ言語コーパスＣ_１，Ｃ_２とする（ステップＳＴ２
ａ、パラメータ設定ステップ）。分割の方法としては、
上述したように言語コーパス内のテキストを時系列にな
らべて各系列に対応させて分割する方法や、ランダムに
選択して分割する方法などがある。Next, the corpus dividing means 7 divides the language corpus read into the corpus input means 2 into two corpora, which are referred to as language corpuses C ₁ and C ₂ (step ST2).
a, parameter setting step). As a method of division,
As described above, there are a method in which texts in a language corpus are arranged in time series and divided according to each series, and a method in which text is randomly selected and divided.

【００６５】コーパス分割手段７が言語コーパスを分割
すると、パラメータ設定手段８が分割された言語コーパ
スのうちの１つ（ここでは、仮にＣ_１とする）を単語連
鎖に係る頻度計数用とし、他の言語コーパス（言語コー
パスＣ_２）を言語性能評価用に分類する。このあと、単
語連鎖頻度計数手段３が頻度計数用として分類された言
語コーパスＣ_１内における単語の３連鎖の出現頻度Ｎ
（ｗ_１，ｗ_２，ｗ_３）、単語の２連鎖の出現頻度Ｎ（ｗ
_１，ｗ_２）、単語の出現頻度Ｎ（ｗ_１）、及びある単語
連鎖に後続する単語（列）の種類頻度Ｒ（ｗ_１，
ｗ_２）、Ｒ（ｗ_１）を計数する（ステップＳＴ３ａ、単
語連鎖頻度計数ステップ）。[0065] When the corpus dividing means 7 divides the corpus, one of the language corpus parameter setting unit 8 is divided (here, assumed as C ₁₎ and a frequency counter according to the word chain, other Is classified for language performance evaluation (language corpus C ₂ ). Thereafter, word frequency of triads in the word chain frequency counting unit 3 is in the corpus C ₁ classified as a counting frequency N
(W ₁ , w ₂ , w ₃ ), the appearance frequency N (w
₁ , w ₂ ), word appearance frequency N (w ₁ ), and type frequency R (w ₁ ,
w _2), counts the R _{(w 1)} (step ST3a, word concatenation frequency counting step).

【００６６】このとき、パラメータ設定手段８は、上記
実施の形態１で説明した単調増加関数に対して適当な値
のパラメータを設定する（ステップＳＴ４ａ、パラメー
タ設定ステップ）。このパラメータの設定は、不図示の
メモリなどに予め格納しておいた所定の複数のパラメー
タ値を自動的に読み出して設定する、若しくは、不図示
の入力手段を使ってユーザが所望の値を設定するなどの
方法が考えられる。At this time, the parameter setting means 8 sets an appropriate value parameter for the monotone increasing function described in the first embodiment (step ST4a, parameter setting step). This parameter setting is performed by automatically reading and setting a plurality of predetermined parameter values stored in advance in a memory (not shown), or setting a desired value by a user using input means (not shown). There is a method such as doing.

【００６７】次に、パープレキシティ計算手段９が言語
コーパスＣ_２についてパープレキシティＰＰを計算す
る。ここで、パープレキシティＰＰについて説明する。
パープレキシティＰＰは、統計的言語モデルの性能を測
る尺度の１つであり、文Ｓ＝ｗ_１ｗ_２ｗ_３・・・ｗ_ｎの
出現確率Ｐ（ｗ_１ｗ_２ｗ_３・・・）とは、下記式（２
４）のような関係にある。具体的には、ある言語に後続
する単語を決定するのに必要な試行錯誤の回数に相当す
る平均分岐数がパープレキシティＰＰであり、この値が
小さいほど言語モデルの性能が良いとされる。ＰＰ＝Ｐ（ｗ_１ｗ_２ｗ_３・・・ｗ_ｎ）^{−（１／ｎ）} ・・・（２４）また、トライグラムモデルの場合では、下記式（２５）
で近似される。Ｐ（ｗ_１ｗ_２ｗ_３・・・ｗ_ｎ）＝Ｐ（ｗ_１）・Ｐ（ｗ_２｜ｗ_１）・Ｐ（ｗ_３｜ｗ_１ｗ_２）・Ｐ（ｗ_４｜ｗ_２ｗ _３）・・・・・・（２５）このＰ（ｗ_３｜ｗ_１ｗ_２）、Ｐ（ｗ_４｜ｗ_２ｗ_３）、・
・・の計算は、単語接続確率計算手段４及び単語接続確
率再計算手段５が上記実施の形態１で示した式（１６）
〜（１９）（又は式（２０）〜（２３））を用いて計算
する（ステップＳＴ５ａ、言語性能評価ステップ）。ス
テップＳＴ５ａで得られたパープレキシティＰＰは、最
適パラメータ決定手段１０に出力される。Next, the perplexity calculating means 9 converts the language
Corpus C₂Calculate perplexity PP for
You. Here, the perplexity PP will be described.
Perplexity PP measures the performance of statistical language models
Sentence S = w₁w₂w₃... w_nof
Appearance probability P (w₁w₂w₃...) is the following formula (2)
The relationship is as shown in 4). Specifically, a language that follows
The number of trials and errors required to determine the word
The average number of branches is perplexity PP, and this value is
It is said that the smaller the language model, the better the performance of the language model. PP = P (w₁w₂w₃... w_n)^{− (1 / n)} ... (24) In the case of the trigram model, the following equation (25)
Is approximated by P (w₁w₂w₃... w_n) = P (w₁) ・ P (w₂| W₁) ・ P (w₃| W₁w₂) ・ P (w₄| W₂w ₃ ) (25) This P (w₃| W₁w₂), P (w₄| W₂w₃),
Is calculated by the word connection probability calculation means 4 and the word connection probability.
Equation (16) shown in the first embodiment by the rate recalculating means 5
To (19) (or equations (20) to (23))
(Step ST5a, language performance evaluation step). S
The perplexity PP obtained in Step ST5a is
It is output to the appropriate parameter determining means 10.

【００６８】次に、パラメータ設定手段８は、単調増加
関数のパラメータを他の値に設定し、パープレキシティ
計算手段９はパープレキシティＰＰを計算する。ここま
での動作をいくつかのパラメータ値について繰り返し行
なう（ステップＳＴ６ａ、言語性能評価ステップ）。こ
れにより、各パラメータに対応するパープレキシティＰ
Ｐが最適パラメータ決定手段１０に保持される。Next, the parameter setting means 8 sets the parameter of the monotone increasing function to another value, and the perplexity calculating means 9 calculates the perplexity PP. The operation up to this point is repeated for some parameter values (step ST6a, language performance evaluation step). Thereby, the perplexity P corresponding to each parameter is obtained.
P is stored in the optimum parameter determination means 10.

【００６９】最適パラメータ決定手段１０は、いくつか
のパラメータ値について計算したパープレキシティＰＰ
の全てを保持すると、これらのうち最も小さい（性能が
よい）パープレキシティＰＰを与えるパラメータ値をパ
ラメータ１として決定する（ステップＳＴ７ａ、最適パ
ラメータ決定ステップ）。The optimal parameter determining means 10 calculates the perplexity PP calculated for some parameter values.
Are held, the parameter value that gives the smallest (high performance) perplexity PP is determined as the parameter 1 (step ST7a, optimum parameter determination step).

【００７０】さらに、ステップＳＴ３ａでパラメータ設
定手段８が分類した頻度計数用と言語性能評価用とに分
類した言語コーパスのそれぞれの分類を変更して、言語
コーパスＣ_２を頻度計数用とし、言語コーパスＣ_１を言
語性能評価用とする。分類が変更されると、同様な処理
を言語コーパスＣ_２で単語連鎖頻度を計数し、言語コー
パスＣ_１を用いてパープレキシティＰＰを計算して、最
も小さいパープレキシティＰＰを与えるパラメータ値を
パラメータ２として決定する（ステップＳＴ８ａ〜ステ
ップＳＴ１２ａ、パラメータ設定ステップ、言語性能評
価ステップ、最適パラメータ決定ステップ）。[0070] Further, by changing the respective classification of corpus classified into a parameter setting means 8 for and the language performance evaluation frequency counting classified in step ST3a, the corpus C ₂ and a frequency counter, corpus the C ₁ and language performance for evaluation. When classification is changed, counts the word concatenation frequency the same process in corpus C _2, to calculate the perplexity PP with corpus C _1, the parameter value which gives the smallest perplexity PP Determined as parameter 2 (step ST8a to step ST12a, parameter setting step, language performance evaluation step, optimal parameter determination step).

【００７１】上記動作が完了し、各パターンで最も小さ
いパープレキシティＰＰを与えるパラメータ１，２を取
得すると、最適パラメータ決定手段１０がパラメータ
１，２の平均値を算出して最終的な最適パラメータとす
る（ＳＴ１３ａ）。When the above operation is completed and the parameters 1 and 2 that give the smallest perplexity PP in each pattern are obtained, the optimum parameter determining means 10 calculates the average value of the parameters 1 and 2 to obtain the final optimum parameter. (ST13a).

【００７２】以上のように、この実施の形態２によれ
ば、言語コーパスを複数に分割して、単語連鎖に係る頻
度計数用の言語コーパスと言語性能評価用の言語コーパ
スとに分類するとともに単調増加関数のパラメータを設
定し、頻度計数用の言語コーパス内で計数された単語連
鎖に係る頻度に基づいて言語性能評価用の言語コーパス
における単語連鎖の条件付き確率を算出し、この条件付
き確率を用いて言語性能の指標となるパープレキシティ
ＰＰの値を求め、これらの動作を繰り返して最も小さい
パープレキシティＰＰを与えるパラメータを取得し、こ
れらパラメータの平均値を最適なパラメータとして決定
するので、上記実施の形態１の単語接続確率再計算手段
５が使用する言語性能が最高の統計的言語モデルを生成
するように単調増加関数のパラメータを設定することが
できる。As described above, according to the second embodiment, the language corpus is divided into a plurality of parts, and the language corpus is classified into a language corpus for frequency counting and a language corpus for language performance evaluation. The parameter of the increasing function is set, and the conditional probability of the word chain in the language corpus for language performance evaluation is calculated based on the frequency of the word chain counted in the language corpus for frequency counting. The value of the perplexity PP which is an index of the language performance is obtained using the above-mentioned parameters, and these operations are repeated to obtain a parameter that gives the smallest perplexity PP, and the average value of these parameters is determined as an optimal parameter. The linguistic performance used by the word connection probability recalculation means 5 of the first embodiment monotonically increases so as to generate the statistical language model having the best language performance. It is possible to set the number of parameters.

【００７３】なお、上記実施の形態２で示したものの他
に、コーパス分割手段７が言語コーパスを３つ以上に分
割して、単語連鎖に係る頻度を計数するための言語コー
パスと言語性能評価用のパープレキシティＰＰを計算す
るための言語コーパスとに分類し、それぞれの組み合わ
せに異なるパラメータを設定して、各組み合わせで最高
の言語性能（最も小さい）のパープレキシティＰＰを与
えるパラメータを検出し、これらパラメータの平均値を
最終的な最適パラメータとすることもできる。これによ
っても、上記実施の形態２と同様の効果が得られる。Note that, in addition to those described in the second embodiment, the corpus dividing means 7 divides the language corpus into three or more parts, and includes a language corpus for counting the frequency of word chains and a language performance evaluation part. Is classified into a language corpus for calculating the perplexity PP of each of the combinations, different parameters are set for each combination, and a parameter that gives the highest language performance (smallest) perplexity PP for each combination is detected. The average value of these parameters can be used as the final optimum parameter. Also according to this, the same effect as in the second embodiment can be obtained.

【００７４】実施の形態３．この実施の形態３は、特に
クラス言語モデルを用いることによる単語連鎖確率（単
語連鎖の条件付き確率）の精度の劣化を防ぎ、高い精度
の言語モデルを提供するものである。Embodiment 3 The third embodiment is intended to prevent a deterioration in the accuracy of the word chain probability (conditional probability of the word chain) due to the use of the class language model, and to provide a high-accuracy language model.

【００７５】図６はこの発明の実施の形態３による統計
的言語モデル生成装置の構成を示す図である。図におい
て、１１はクラス情報記憶手段（単語クラス記憶手段）
であって、言語コーパスに存在する単語のクラス情報を
記憶する。このクラス情報は言語コーパス内の単語にお
いて意味が近いものをクラス分けして、このクラスと該
当する単語とを対応付ける情報である。１２はクラス情
報入力手段であって、クラス情報記憶手段１１からクラ
ス情報を読み出して類似単語連鎖追加手段１３が利用可
能に保持する。１３は言語コーパス内に存在する単語連
鎖に類似する単語連鎖を新たに言語コーパスに追加する
類似単語連鎖追加手段で、１４は追加コーパスであっ
て、類似単語連鎖追加手段１３によって新たに類似する
単語連鎖が追加された言語コーパスに相当する。また、
図示の例では単語接続確率計算手段４、言語モデル生成
手段６、あるいは単語接続確率再計算手段５が省略され
ているが、実際には単語接続確率計算手段４、言語モデ
ル生成手段６、あるいは単語接続確率再計算手段５を具
備しているものとし、言語モデル生成手段６が単語接続
確率計算手段４によって算出された単語連鎖の条件付き
確率を用いて統計的言語モデルを生成する。なお、図１
と同一構成要素には同一符号を付して重複する説明を省
略する。FIG. 6 is a diagram showing a configuration of a statistical language model generation device according to the third embodiment of the present invention. In the figure, reference numeral 11 denotes class information storage means (word class storage means).
And stores the class information of the words existing in the language corpus. The class information is information that classifies words having similar meanings in the language corpus into classes and associates the classes with the corresponding words. Reference numeral 12 denotes a class information input unit, which reads out class information from the class information storage unit 11 and stores the class information so that the similar word chain adding unit 13 can use it. Reference numeral 13 denotes a similar word chain adding unit that newly adds a word chain similar to a word chain existing in the language corpus to the language corpus. Reference numeral 14 denotes an additional corpus, and a word that is newly similar by the similar word chain adding unit 13. The chain is equivalent to the added language corpus. Also,
Although the word connection probability calculation means 4, the language model generation means 6, or the word connection probability recalculation means 5 is omitted in the illustrated example, the word connection probability calculation means 4, the language model generation means 6, or the word connection probability It is assumed that the language model generating unit 6 includes a connection probability recalculating unit 5, and a language model generating unit 6 generates a statistical language model using the conditional probability of the word chain calculated by the word connecting probability calculating unit 4. FIG.
The same components as those described above are denoted by the same reference numerals, and redundant description will be omitted.

【００７６】次に動作について説明する。図７、図８、
及び図９は図６の統計的言語モデル生成装置による統計
的言語モデル生成動作を示すフロー図であり、このフロ
ー図に沿ってトライグラム言語モデルを作成する場合を
例として説明する。先ず、上記実施の形態１と同様に、
コーパス入力手段２がコーパス記憶手段１から言語コー
パスを読み出し、単語連鎖頻度計数手段３及び類似単語
連鎖追加手段１３が利用可能に保持する（ステップＳＴ
１ｂ）。Next, the operation will be described. 7, 8,
9 is a flow chart showing a statistical language model generation operation by the statistical language model generation device of FIG. 6, and a case where a trigram language model is created along this flow chart will be described as an example. First, as in the first embodiment,
The corpus input means 2 reads the language corpus from the corpus storage means 1, and holds the word chain frequency counting means 3 and the similar word chain adding means 13 in a usable state (step ST).
1b).

【００７７】次に、単語連鎖頻度計数手段３がコーパス
入力手段２に読み込まれた言語コーパスから、単語連鎖
に係る頻度を計数する（ステップＳＴ２ｂ、単語連鎖頻
度計数ステップ）。具体的には、単語連鎖頻度計数手段
３が言語コーパス内の単語の３連鎖の出現頻度Ｎ
（ｗ_１，ｗ_２，ｗ_３）、単語の２連鎖の出現頻度Ｎ（ｗ
_１，ｗ_２）、単語の出現頻度Ｎ（ｗ_１）をそれぞれ計算
する。Next, the word chain frequency counting means 3 counts the frequency related to word chains from the language corpus read into the corpus input means 2 (step ST2b, word chain frequency counting step). Specifically, the word chain frequency counting means 3 calculates the appearance frequency N of the three chains of words in the language corpus.
(W ₁ , w ₂ , w ₃ ), the appearance frequency N (w
₁ , w ₂ ) and word appearance frequency N (w ₁ ) are calculated.

【００７８】このあと、クラス情報入力手段１２がクラ
ス情報記憶手段１１からクラス情報を読み出して類似単
語連鎖追加手段１３が利用可能に保持する（ステップＳ
Ｔ３ｂ）。このクラス分けには、人手による直感的なク
ラス分けでもよいし、ある基準に基づく機械的なクラス
分けでもよい。また、必ずしも全ての単語がどこかのク
ラスに所属する必要はない。さらに、ある単語が所属す
るクラスは必ずしも１つではなく、複数のクラスに所属
してもよい。図１０は実施の形態３による統計的言語モ
デル生成装置で用いるクラス情報の一例を示す図であ
る。図示の例では、医薬の種類や効能によってクラス分
けされており、言語コーパス内の単語が各クラス１，
２，３，・・・に対応付けられて管理される。また、こ
のクラス情報は医療に関連する分野に対して人手による
直感的なクラス分けをした例であり、具体的には塗布
薬、飲み薬、腹部疾病などのクラスに分類し、それらの
各クラス１，２，３，・・・に関連する単語が割り当て
られている。Thereafter, the class information input means 12 reads the class information from the class information storage means 11 and holds the similar word chain adding means 13 usable (step S).
T3b). The classification may be an intuitive manual classification or a mechanical classification based on a certain standard. Also, not all words need to belong to any class. Furthermore, the class to which a certain word belongs is not necessarily one, but may belong to a plurality of classes. FIG. 10 is a diagram showing an example of class information used in the statistical language model generation device according to the third embodiment. In the illustrated example, the medicines are classified according to their types and effects, and words in the language corpus are classified into classes 1 and 2, respectively.
Are managed in association with 2, 3,.... In addition, this class information is an example of intuitive classification by hand in the field related to medical treatment. Specifically, the class information is classified into classes such as topical medicines, oral medicines, and abdominal diseases. Words related to 1, 2, 3,... Are assigned.

【００７９】次に、類似単語連鎖追加手段１３が言語コ
ーパスに存在する任意の単語連鎖ｗ _１−ｗ_２を１つ選択
する。この２単語連鎖ｗ_１−ｗ_２に後続する単語ｗ
_３は、通常複数種類ある。このあと、類似単語連鎖追加
手段１３があるクラスＣを選択し、このクラスＣに所属
する単語がｗ_１−ｗ_２に後続することがあるかを調べる
（ステップＳＴ４ｂ）。このクラスＣに所属するある単
語ｘがｗ_１−ｗ_２に後続していた場合（即ち、Ｎ
（ｗ_１，ｗ_２，ｘ）＞０）、クラスＣに所属する他の単
語ｙがＮ（ｗ_１，ｗ_２，ｙ）＝０であっても、それに接
続し得る可能性があると考えて、Ｎ（ｗ_１，ｗ_２，ｙ）
に所定の出現頻度ｋを与える（ステップＳＴ５ｂ、類似
単語連鎖追加ステップ）。Next, the similar word chain adding means 13 outputs
-Any word chain w present in the path ₁-W₂Select one
I do. This two-word chain w₁-W₂The word w following
₃Are usually of multiple types. After this, add similar word chains
Means 13 selects a certain class C and belongs to this class C
The word to be w₁-W₂Find out if there is anything following
(Step ST4b). Someone belonging to this class C
The word x is w₁-W₂(That is, N
(W₁, W₂, X)> 0), other units belonging to class C
If the word y is N (w₁, W₂, Y) = 0
N (w₁, W₂, Y)
Is given a predetermined appearance frequency k (step ST5b, similar
Word chain addition step).

【００８０】上述した動作は、言語コーパスを非常に大
きく（大量のテキストを保持する）すれば出現するであ
ろう単語連鎖が言語コーパスが小さいために整数で表現
される出現頻度が０になってしまうことによる不具合を
防止するためのものである。これより、出現頻度ｋの値
としては０＜ｋ≦１が適当である。例えば、クラス１
「塗布薬」について、ｗ_１＝「毎日」、ｗ_２＝「夕
方」、ｘ＝「パウダー」という単語連鎖が言語コーパス
にあった場合、図１０に示した「パウダー」と同一のク
ラス１である単語「ワセリン」による「毎日」−「夕
方」−「ワセリン」という単語連鎖がなくても、その出
現頻度をｋとして与える。このようにして、類似単語連
鎖追加手段１３が類似する単語連鎖に出現頻度を与えて
新たに言語コーパスに追加し、追加コーパス１４として
保持する。In the above-described operation, the word chain that would appear if the language corpus was made very large (a large amount of text was held) was reduced to an integer frequency 0 because the language corpus was small. This is for preventing a problem caused by the above. Accordingly, 0 <k ≦ 1 is appropriate as the value of the appearance frequency k. For example, class 1
If the word chain of “applied medicine” in the language corpus includes w ₁ = “every day”, w ₂ = “evening”, and x = “powder”, it is in the same class 1 as “powder” shown in FIG. Even if there is no word chain "everyday"-"evening"-"vaseline" by a certain word "vaseline", its appearance frequency is given as k. In this way, the similar word chain adding unit 13 gives a frequency of appearance to the similar word chain and newly adds it to the language corpus, and holds it as the additional corpus 14.

【００８１】このあと、類似単語連鎖追加手段１３が全
てのクラスＣ及び全ての２単語連鎖ｗ_１−ｗ_２について
上記処理を行う（ステップＳＴ６ｂ、ステップＳＴ７
ｂ、類似単語連鎖追加ステップ）。Thereafter, the similar word chain adding means 13 performs the above processing for all classes C and all two-word chains w ₁ -w ₂ (step ST6b, step ST7).
b, similar word chain addition step).

【００８２】次に、類似単語連鎖追加手段１３が言語コ
ーパス内に存在するある単語連鎖ｗ _１−＊−ｗ_３を選択
する。ここで、＊は３単語連鎖の間に存在する任意の単
語であり、通常複数種類の単語がある。このあと、類似
単語連鎖追加手段１３があるクラスＣを選択し、このク
ラスＣに所属する単語がｗ_１−＊−ｗ_３の＊に存在して
いるかを調べる（ステップＳＴ８ｂ）。このクラスＣに
所属するある単語ｘがｗ_２−＊−ｗ_３の＊に存在してい
た場合（即ち、Ｎ（ｗ_２，ｘ，ｗ_３）＞０）、クラスＣ
に所属するある他の単語ｙがＮ（ｗ_１，ｙ，ｗ_３）＝０
であっても、それに接続し得る可能性があると考え、Ｎ
（ｗ_１，ｙ，ｗ_３）に所定の出現頻度ｋを与える（ステ
ップＳＴ９ｂ、類似単語連鎖追加ステップ）。このよう
にして、類似単語連鎖追加手段１３が類似する単語連鎖
に出現頻度を与えて新たに言語コーパスに追加し、追加
コーパス１４として保持する。Next, the similar word chain adding means 13 outputs
-A word chain w in the path ₁-*-W₃choose
I do. Here, * indicates any unit existing between three word chains.
Word, usually there are multiple types of words. After this, similar
The word chain adding means 13 selects a class C, and this class
The word belonging to Lass C is w₁-*-W₃In *
Is checked (step ST8b). In this class C
A certain word x belongs to w₂-*-W₃Exists in *
(Ie, N (w₂, X, w₃)> 0), class C
Is another word y belonging to N (w₁, Y, w₃) = 0
Considers that it may be possible to connect to it,
(W₁, Y, w₃) Is given a predetermined appearance frequency k (step
Step ST9b, similar word chain addition step). like this
The similar word chain adding means 13 sets the similar word chain
To the language corpus with the frequency of appearance
Stored as corpus 14.

【００８３】このあと、類似単語連鎖追加手段１３が全
てのクラスＣ及び全てのｗ_１，ｗ_３の組み合わせについ
て上記処理を行う（ステップＳＴ１０ｂ、ステップＳＴ
１１ｂ、類似単語連鎖追加ステップ）。Thereafter, the similar word chain adding means 13 performs the above-described processing for all classes C and all combinations of w ₁ and w ₃ (step ST10b, step ST10).
11b, similar word chain adding step).

【００８４】次に、類似単語連鎖追加手段１３が言語コ
ーパス内に存在するある単語連鎖＊−ｗ_２−ｗ_３を選択
する。ここで、＊は３単語連鎖が存在する任意の単語で
あり、通常複数種類の単語がある。このあと、類似単語
連鎖追加手段１３があるクラスＣを選択し、このクラス
Ｃに所属する単語が＊−ｗ_２−ｗ_３の＊に存在している
かを調べる（ステップＳＴ１２ｂ）。このクラスＣに所
属するある単語ｘが＊−ｗ_２−ｗ_３の＊に存在していた
場合（即ち、Ｎ（ｘ，ｗ_２，ｗ_３）＞０）、クラスＣに
所属する他の単語ｙがＮ（ｙ，ｗ_２，ｗ_３）＝０であっ
ても、それに接続しうる可能性があると考え、Ｎ（ｙ，
ｗ_２，ｗ_３）に所定の出現頻度ｋを与える（ステップＳ
Ｔ１３ｂ、類似単語連鎖追加ステップ）。このようにし
て、類似単語連鎖追加手段１３が類似する単語連鎖に出
現頻度を与えて新たに言語コーパスに追加し、追加コー
パス１４として保持する。[0084] Next, similar word concatenation additional means 13 selects a certain word concatenation * _-w 2 -w ₃ present in the corpus. Here, * is an arbitrary word having a three-word chain, and usually includes a plurality of types of words. Thereafter, select the class C which have similar word concatenation additional means 13, investigate whether words belonging to this class C is present in the * _-w 2 _-w ₃ * (step ST12b). If there is word x that belong to the class C was present in the * _-w 2 _-w ₃ * _{_{(ie, N (x, w 2,}} w 3)> 0), other word of belonging to the class C Even if y is N (y, w ₂ , w ₃ ) = 0, it is considered that there is a possibility that it can be connected to N (y, w ₂ , w ₃ ).
w ₂ , w ₃ ) is given a predetermined appearance frequency k (step S
T13b, similar word chain addition step). In this way, the similar word chain adding unit 13 gives a frequency of appearance to the similar word chain and newly adds it to the language corpus, and holds it as the additional corpus 14.

【００８５】このあと、類似単語連鎖追加手段１３が全
てのクラスＣ及び全てのｗ_２−ｗ_３の組み合わせについ
て上記処理を行う（ステップＳＴ１４ｂ、ステップＳＴ
１５ｂ、類似単語連鎖追加ステップ）。Thereafter, the similar word chain adding means 13 performs the above-described processing for all combinations of classes C and all combinations of w ₂ -w ₃ (step ST14b, step ST14).
15b, similar word chain addition step).

【００８６】以上のように、この実施の形態３によれ
ば、言語コーパス内に存在する単語連鎖に類似する単語
連鎖を新たに言語コーパスに追加する、具体的には言語
コーパスに存在する単語を分類したクラス情報に基づい
て所定の単語クラスに所属する単語が言語コーパス内の
単語連鎖に含まれるか否かを判断し、所定の単語クラス
に所属する単語をこの単語クラスに所属する他の単語に
置き換えた単語連鎖を類似する単語連鎖として言語コー
パスに新たに追加するので、連鎖する可能性がある単語
は出現頻度が０であっても所定の出現頻度を与え、スパ
ース性を減少させることができることから、言語確率の
計算の際にバックオフする場合が減少し、統計的言語モ
デルの性能を向上させることができる。また、この実施
の形態３では、類似する単語連鎖を作成するときのみク
ラス情報を用い、単語連鎖確率（単語連鎖の条件付き確
率）を計算する際には用いないので、クラス言語モデル
に伴う性能劣化を防止することができる。As described above, according to the third embodiment, a word chain similar to the word chain existing in the language corpus is newly added to the language corpus. Specifically, the words existing in the language corpus are added. Based on the classified class information, it is determined whether a word belonging to a predetermined word class is included in a word chain in a language corpus, and a word belonging to the predetermined word class is replaced with another word belonging to this word class. Is added to the language corpus as a similar word chain to the language corpus. Even if the frequency of occurrence is 0, a given frequency of appearance is given to reduce the sparseness. Because it is possible, the case of backing off when calculating the language probability is reduced, and the performance of the statistical language model can be improved. In the third embodiment, the class information is used only when a similar word chain is created, and is not used when calculating a word chain probability (conditional probability of a word chain). Deterioration can be prevented.

【００８７】なお、上記実施の形態３で示した処理のう
ち、例えば部分的な処理であるステップＳＴ８ｂ〜ステ
ップＳＴ１５ｂを省略しても類似の効果を奏する。A similar effect is obtained even if, for example, steps ST8b to ST15b, which are partial processes, are omitted from the processes described in the third embodiment.

【００８８】実施の形態４．この実施の形態４はクラス
言語モデルを用いることによる単語連鎖確率（単語連鎖
の条件付き確率）の精度の劣化を防ぎ、高い精度の言語
モデルを提供する上記実施の形態３と異なる構成を示す
ものである。Embodiment 4 The fourth embodiment is different from the third embodiment in that the use of a class language model prevents the word chain probability (conditional probability of word chain) from deteriorating in accuracy and provides a language model with high accuracy. It is.

【００８９】図１１はこの発明の実施の形態４による統
計的言語モデル生成装置の構成を示す図である。図にお
いて、１５は単語−クラス連鎖計数手段（単語クラス間
頻度計数手段）であって、コーパス入力手段２がコーパ
ス記憶手段１から読み込んだ言語コーパス内に存在する
単語連鎖の出現頻度を、この単語連鎖の最後に位置する
単語が属する単語クラス別に計数し、この単語クラスに
おける単語種類数による出現頻度の平均値を算出する。
１６は単語接続確率計算手段であって、単語連鎖頻度計
数手段３が計数した単語連鎖に係る頻度に基づいて単語
連鎖の条件付確率を算出するとともに、単語−クラス連
鎖計数手段１５が計数した単語連鎖の出現頻度の平均値
に基づいて単語連鎖の最後に位置する単語が属する単語
クラスに係る条件付き確率を算出して、これらを線形結
合してなる確率を単語連鎖の新たな条件付き確率とす
る。１７は単語接続確率計算手段１６が線形結合される
条件付き確率にそれぞれ加える重み係数を格納する係数
記憶手段である。なお、図１及び図６と同一構成要素に
は同一符号を付して重複する説明を省略する。FIG. 11 is a diagram showing a configuration of a statistical language model generation device according to the fourth embodiment of the present invention. In the figure, reference numeral 15 denotes word-class chain counting means (inter-class frequency counting means), which indicates the frequency of occurrence of word chains present in the language corpus read from the corpus storage means 1 by the corpus input means 2. Counting is performed for each word class to which the word located at the end of the chain belongs, and the average value of the appearance frequency based on the number of word types in this word class is calculated.
Reference numeral 16 denotes a word connection probability calculating unit that calculates the conditional probability of a word chain based on the frequency of the word chain counted by the word chain frequency counting unit 3 and the word counted by the word-class chain counting unit 15. Calculate the conditional probabilities of the word class to which the word located at the end of the word chain belongs based on the average value of the frequency of occurrence of the chain, and calculate the probability obtained by linearly combining these with the new conditional probability of the word chain. I do. Reference numeral 17 denotes coefficient storage means for storing a weight coefficient added to the conditional probability that the word connection probability calculation means 16 is linearly combined. The same components as those in FIGS. 1 and 6 are denoted by the same reference numerals, and redundant description will be omitted.

【００９０】次に動作について説明する。図１２は図１
１の統計的言語モデル生成装置による統計的言語モデル
生成動作を示すフロー図であり、このフロー図に沿って
トライグラム言語モデルを作成する場合を例として説明
する。先ず、上記実施の形態１と同様に、コーパス入力
手段２がコーパス記憶手段１から言語コーパスを読み出
し、単語連鎖頻度計数手段３及び単語−クラス連鎖計数
手段１５が利用可能に保持する（ステップＳＴ１ｃ）。Next, the operation will be described. FIG. 12 shows FIG.
FIG. 4 is a flowchart showing a statistical language model generation operation by one statistical language model generation device, and a case where a trigram language model is created along this flowchart will be described as an example. First, as in the first embodiment, the corpus input means 2 reads the language corpus from the corpus storage means 1, and holds the word chain frequency counting means 3 and the word-class chain counting means 15 in a usable state (step ST1c). .

【００９１】次に、単語連鎖頻度計数手段３がコーパス
入力手段２に読み込まれた言語コーパスから、単語連鎖
に係る頻度を計数する（ステップＳＴ２ｃ、単語連鎖頻
度計数ステップ）。具体的には、単語連鎖頻度計数手段
３が言語コーパス内の単語の３連鎖の出現頻度Ｎ
（ｗ_１，ｗ_２，ｗ_３）、単語の２連鎖の出現頻度Ｎ（ｗ
_１，ｗ_２）、単語の出現頻度Ｎ（ｗ_１）をそれぞれ計算
する。Next, the word chain frequency counting means 3 counts the word chain frequency from the language corpus read by the corpus input means 2 (step ST2c, word chain frequency counting step). Specifically, the word chain frequency counting means 3 calculates the appearance frequency N of the three chains of words in the language corpus.
(W ₁ , w ₂ , w ₃ ), the appearance frequency N (w
₁ , w ₂ ) and word appearance frequency N (w ₁ ) are calculated.

【００９２】このあと、クラス情報入力手段１２がクラ
ス情報記憶手段１１からクラス情報を読み出して単語−
クラス連鎖計数手段１５及び単語接続確率計算手段１６
が利用可能に保持する（ステップＳＴ３ｃ）。このクラ
ス分けには、人手による直感的なクラス分けでもよい
し、ある基準に基づく機械的なクラス分けでもよい。ま
た、必ずしも全ての単語がどこかのクラスに所属する必
要はない。さらに、ある単語が所属するクラスは必ずし
も１つではなく、複数のクラスに所属してもよい。Thereafter, the class information input means 12 reads the class information from the class information storage means 11 and reads the class information.
Class chain counting means 15 and word connection probability calculating means 16
Are kept available (step ST3c). The classification may be an intuitive manual classification or a mechanical classification based on a certain standard. Also, not all words need to belong to any class. Furthermore, the class to which a certain word belongs is not necessarily one, but may belong to a plurality of classes.

【００９３】次に、単語−クラス連鎖計数手段１５は単
語とそれに後続する単語の連鎖数をクラスＣ別に計数
し、下記式（２６）で与えられるＮ（ｗ_１，ｗ_２，
Ｃ）、Ｎ（ｗ_２，Ｃ）、Ｎ（Ｃ）を求める（ステップＳ
Ｔ４ｃ）。Next, the word-class chain counting means 15 counts the number of chains of the word and the succeeding word for each class C, and obtains N (w ₁ , w ₂ ,
C), N (w ₂ , C) and N (C) are obtained (step S)
T4c).

【数５】例えば、ｗ_１＝”今日”、ｗ_２＝”は”、クラスＣ_１を
図１０におけるクラス１（塗布薬）とし、言語コーパス
内に、”・・・今日はパウダー・・・”が２回、”・・
・今日はワセリン・・・”が４回、”・・・今日はリン
デロン・・・”が４回、”・・・夜はバンテリン・・
・”が３回、”・・・処方にバンテリン・・・”が１回
という単語連鎖があった場合、Ｎ（”今日”，”は”，
Ｃ_１）＝２＋４＋４＝１０、Ｎ（”は”，Ｃ_１）＝２＋
４＋４＋３＝１３、Ｎ（Ｃ_１）＝２＋４＋４＋３＋１＝
１４となる。(Equation 5) For example, w ₁ = “today”, w ₂ = “is”, class C ₁ is class 1 (applied medicine) in FIG. 10, and “... today is a powder ...” twice in the language corpus. , "...
・ Vaseline ... 4 times today, “… Linderon ... 4 times today,” ... Vantelin at night ...
・ If there is a word chain of “3 times” and “... Vanterin ...” once, N (“Today”, “ha”, “
C ₁ ) = 2 + 4 + 4 = 10, N (“is”, C ₁ ) = 2 +
4 + 4 + 3 = 13, N (C ₁ ) = 2 + 4 + 4 + 3 + 1 =
It becomes 14.

【００９４】さらに、単語−クラス連鎖計数手段１５
は、上述のようにして求めた出現頻度のクラスＣ内の単
語種類数による平均値Ｎ_ＡＶ（ｗ_１，ｗ_２，Ｃ）、Ｎ
_ＡＶ（ｗ _２，Ｃ）、Ｎ_ＡＶ（Ｃ）を下記式（２７）に従
って計算する（ステップＳＴ５ｃ）。Further, the word-class chain counting means 15
Is a simple expression in the class C of the appearance frequency obtained as described above.
Average value N based on number of word_AV(W₁, W₂, C), N
_AV(W ₂, C), N_AV(C) according to the following equation (27)
(Step ST5c).

【数６】上記の例の場合では、クラスＣ_１に所属する単語は図１
０で５種類あるとすると、Ｎ_ＡＶ（”今日”，”は”，
Ｃ_１）＝１０／５＝２、Ｎ_ＡＶ（”は”,Ｃ_１）＝１３
／５＝２．６、Ｎ_ＡＶ（Ｃ_１）＝１４／５＝２．８とな
る。以上のステップＳＴ４ｃ及びステップＳＴ５ｃが単
語クラス間頻度計数ステップに相当する。(Equation 6) In the above example, word 1 belonging to the class C ₁
0, and there are five types, N _AV (“today”, “ha”,
C ₁ ) = 10/5 = 2, N _AV (“ _ha ”, C ₁ ) = 13
/5=2.6 and N _AV (C ₁ ) = 14/5 = 2.8. The above steps ST4c and ST5c correspond to a word class frequency counting step.

【００９５】次に、単語接続確率計算手段１６は係数記
憶手段１７から係数λ_１〜λ_６を読み込む（ステップＳ
Ｔ６ｃ）。これら係数λ_１〜λ_６は、確率の総和＝１と
いう制限よりλ_１＋λ_２＋・・・＋λ_６＝１である。ま
た、係数λ_１〜λ_６は良く知られているＥＭアルゴリズ
ムで決定することができる。このＥＭアルゴリズムと
は、初期値から出発して、漸化式によって適切なパラメ
ータの収束値を求める方法である。ここで、漸化式にお
ける第ｋ回目の係数λ_ｊをλ^ｋ _ｊとすると、第ｋ＋１回
目の係数λ^ｋ＋１ _ｊは下記式（２８）で与えられる。Next, the word connection probability calculation means 16 reads the coefficients λ ₁ to λ ₆ from the coefficient storage means 17 (step S).
T6c). The coefficients λ _{1 to} λ ₆ are λ ₁ + λ ₂ +... + Λ ₆ = 1 due to the restriction that the sum of probabilities = 1. The coefficients λ _{1 to} λ ₆ can be determined by a well-known EM algorithm. The EM algorithm is a method of obtaining an appropriate convergence value of a parameter by a recurrence formula starting from an initial value. Here, assuming that the k-th coefficient λ _j in the recurrence equation is λ ^k _j , the (k + 1) _-th coefficient λ ^{k + 1} _j is given by the following equation (28).

【数７】係数λの初期値としてλ^１ _１から始まって上記式（２
８）による計算を係数λが収束するまで続ける。この式
（２８）において最初のΣは言語コーパスに出現する全
ての単語連鎖についてとる。Ｎは言語コーパスに出現す
る全ての単語連鎖の延べ頻度数である。(Equation 7) Starting from λ ¹¹ ₁ as the initial value of the coefficient λ, the above equation (2)
The calculation according to 8) is continued until the coefficient λ converges. In this equation (28), the first Σ is taken for all word chains appearing in the language corpus. N is the total frequency of all word chains appearing in the language corpus.

【００９６】次に、単語接続確率計算手段１６は、任意
の３つの単語からなる単語連鎖ｗ_１ｗ_２ｗ_３を抽出し、
その単語連鎖確率Ｐ（ｗ_３｜ｗ_１，ｗ_２）を以下のよう
に計算する。先ず、クラス情報入力手段１２がクラス情
報記憶手段１１から読み出したクラス情報に基づいて、
単語接続確率計算手段１６は単語連鎖ｗ_１ｗ_２ｗ_３の最
後に位置する単語ｗ_３が所属するクラスを求め、これを
クラスＣ_３とする（ステップＳＴ７ｃ）。Next, the word connection probability calculation means 16 extracts a word chain w ₁ w ₂ w ₃ composed of any three words,
The word chain probability P (w ₃ | w ₁ , w ₂ ) is calculated as follows. First, based on the class information read by the class information input unit 12 from the class information storage unit 11,
Word connection probability calculation unit 16 obtains the class to which the word _{w 3} located at the end of the word concatenation _{_{_w}} 1 _w 2 _w ₃ belongs, which is referred to as Class _{C 3} (step ST7c).

【００９７】このあと、単語接続確率計算手段１６は、
下記式（２９）に従って単語連鎖確率Ｐ（ｗ_３｜ｗ_１，
ｗ_２）を求める（ステップＳＴ８ｃ、単語接続確率計算
ステップ）。Thereafter, the word connection probability calculating means 16 calculates
The word chain probability P (w ₃ | w ₁ ,
w ₂ ) (step ST8c, word connection probability calculation step).

【数８】ここで、＊は全ての単語を意味している。上記式（２
９）の係数λ_１〜λ_６は、上述したように確率の総和＝
１という制限より、λ_１＋λ_２＋・・・＋λ_６＝１とな
る。(Equation 8) Here, * means all words. The above equation (2
The coefficients λ _{1 to} λ ₆ of 9) are, as described above, the sum of the probabilities =
Due to the restriction of 1, λ ₁ + λ ₂ +... + Λ ₆ = 1.

【００９８】言語モデル生成手段６によって制御された
単語−クラス連鎖計数手段１５及び単語接続確率計算手
段１６が、以上の処理を全ての出現可能な単語連鎖（ｗ
_１，ｗ_２，ｗ_３）について計算する。これによって、言
語モデル生成手段６に各単語連鎖の条件付き確率が保持
されてトライグラム言語モデルが作成される（ステップ
ＳＴ９ｃ、統計的言語モデル生成ステップ）。The word-class chain counting means 15 and the word connection probability calculating means 16 controlled by the language model generating means 6 perform the above processing on all possible word chains (w
₁ , w ₂ , w ₃ ). As a result, the trigram language model is created while the conditional probability of each word chain is held in the language model creating means 6 (step ST9c, statistical language model creating step).

【００９９】ここで、上記式（２９）について説明す
る。上記式（２９）の右辺の第１項〜第３項の部分は、
従来からよく知られている線形補間法という手法に対応
するものであるが、この実施の形態４では、右辺の第４
項から第６項までを追加することによって、従来の線形
補間法の精度を向上を図っている。例えば、言語コーパ
スに格納されるコーパスの量が少ないとき、右辺第１項
の分子が０であっても、同じ単語群が所属するクラスＣ
_３まで考慮した右辺第４項の分子では０とならない場合
が多い。このとき、単語連鎖確率は右辺第４項の分だけ
大きくなる。このように、コーパス量の少ない言語コー
パスにおいて出現頻度が０となったり、０に限りなく近
い値となる単語連鎖があっても、線形補間法に対応する
右辺の第４項から第６項までがその補間を行い、適切な
単語連鎖確率を推定することができることから、結果と
して音声認識装置の精度が向上する。また、この実施の
形態４ではクラス情報を類似単語連鎖の作成時にしか用
いず、単語連鎖確率を計算する際には用いないので、ク
ラス言語モデルに伴う性能劣化は存在しない。Here, the above equation (29) will be described. The part of the first to third terms on the right side of the above equation (29) is
Although this corresponds to a conventionally well-known method called linear interpolation, the fourth embodiment on the right side
By adding the terms 6 to 6, the accuracy of the conventional linear interpolation method is improved. For example, when the amount of the corpus stored in the language corpus is small, even if the numerator of the first term on the right side is 0, the class C to which the same word group belongs
_In many cases, the numerator of the fourth term on the right side taking into account ₃ does not become 0. At this time, the word chain probability increases by the amount of the fourth term on the right side. As described above, even when the frequency of appearance is 0 in a language corpus with a small corpus amount, or there is a word chain having a value as close as possible to 0, the fourth to sixth terms on the right side corresponding to the linear interpolation method Can perform the interpolation to estimate an appropriate word chain probability, and as a result, the accuracy of the speech recognition apparatus is improved. Further, in the fourth embodiment, the class information is used only when a similar word chain is created, and is not used when calculating a word chain probability, so that there is no performance degradation associated with the class language model.

【０１００】以上のように、この実施の形態４によれ
ば、言語コーパス内に存在する単語連鎖の出現頻度及び
単語種類数を含んでなる単語連鎖に係る頻度を計数する
とともに、言語コーパスに存在する単語を類似する意味
の単語クラスごとにクラス分けしたクラス情報及び単語
連鎖に係る頻度に基づいて、言語コーパス内に存在する
単語連鎖の出現頻度を、この単語連鎖の最後に位置する
単語が属する単語クラス別に計数してこの単語クラスに
おける単語種類数による出現頻度の平均値を算出し、単
語連鎖に係る頻度に基づいて算出した単語連鎖の条件付
確率と、上記出現頻度の平均値に基づいて算出した上記
単語連鎖の最後に位置する単語が属する単語クラスに係
る条件付き確率とを線形結合してなる確率を単語連鎖の
新たな条件付き確率とするので、単語をクラスに分類し
た情報を用いて単語連鎖確率を計算するのでコーパス量
が少ない場合でも精度良く単語連鎖確率を推定すること
ができる。As described above, according to the fourth embodiment, the appearance frequency of the word chain existing in the language corpus and the frequency of the word chain including the number of word types are counted, and the number of words existing in the language corpus is counted. Based on the class information obtained by classifying words to be classified into word classes having similar meanings and the frequency of word chains, the frequency of occurrence of word chains existing in the language corpus is changed to the word to which the last word of this word chain belongs. Counting by word class, calculating the average value of the appearance frequency according to the number of word types in this word class, based on the conditional probability of the word chain calculated based on the frequency related to the word chain, and the average value of the occurrence frequency The probability that is obtained by linearly combining the calculated probability with respect to the word class to which the word located at the end of the word chain belongs belongs to the new conditional probability of the word chain. Since it is possible to precisely estimate the word chain probabilities even corpora amount is small because it calculates a word chain probabilities using information classified words into classes.

【０１０１】なお、上記実施の形態４では１種類のクラ
ス情報のみを用いる例を示したが、複数のクラス情報を
用いてもよい。その場合、単語ｗ_３は第１のクラス情報
ではＣ_ａに属し、第２のクラス情報ではＤ_ｂに属し、・
・・のようにすると、単語連鎖確率Ｐ（ｗ_３｜ｗ_１，ｗ
_２）は下記式（３０）で与えられる。In the fourth embodiment, an example is shown in which only one type of class information is used, but a plurality of class information may be used. In that case, the word w ₃ belongs to C _a in the first class information, belong to the D _b in the second class information, -
..., the word chain probability P (w ₃ | w ₁ , w
₂ ) is given by the following equation (30).

【数９】ここで、λ_１＋λ_２＋λ_３＋・・・＝１である。このよ
うに複数の基準のクラス情報を用いることで、出現頻度
が０となる不具合やスパース性の問題に対する頑健性を
向上させることができるという利点がある。(Equation 9) Here, λ ₁ + λ ₂ + λ ₃ +... = 1. By using the class information of a plurality of references in this way, there is an advantage that the robustness against the problem that the appearance frequency becomes 0 and the problem of sparsity can be improved.

【０１０２】また、上記実施の形態１から４で示した統
計的言語モデル生成装置によれば言語性能の高い統計的
言語モデルを生成することができるので、これらが生成
した統計的言語モデルを用いて入力音声を認識する音声
認識装置を構成することで、音声認識の精度を向上させ
ることができる。Further, according to the statistical language model generating apparatus shown in the first to fourth embodiments, a statistical language model having high linguistic performance can be generated. Thus, the accuracy of voice recognition can be improved by configuring a voice recognition device that recognizes input voice.

【０１０３】さらに、以上述べた実施の形態１から実施
の形態４までにおいて示した各手段は、ハードウェア・
ソフトウェアのいずれでも構成できる。また、ソフトウ
ェアによって構成する場合には、本願発明の統計的言語
モデル生成方法をコンピュータに実行させるソフトウェ
アプログラムを記録したコンピュータ読み取り可能な記
録媒体とすることで、出現頻度が大きい単語を含む単語
連鎖が不当に高い言語確率をもつことを防ぐことがで
き、又は、クラス言語モデルに伴う性能劣化を防止する
ことができる統計的言語モデル生成装置として機能する
コンピュータシステムを容易に実現することができる。Further, each means described in the above-described first to fourth embodiments is a hardware /
Can be configured with any software. In the case of using software, a computer-readable recording medium that records a software program that causes a computer to execute the statistical language model generation method of the present invention enables word chains including words with a high appearance frequency to be formed. It is possible to easily realize a computer system functioning as a statistical language model generation device that can prevent having an unduly high language probability or prevent performance degradation accompanying a class language model.

【０１０４】[0104]

【発明の効果】以上のように、この発明によれば、言語
コーパス内に存在する単語連鎖の出現頻度及び単語種類
数を含んでなる単語連鎖に係る頻度を計数し、この単語
連鎖に係る頻度に基づいて単語連鎖の条件付き確率を計
算し、出現頻度が小さい単語連鎖の条件付き確率をバッ
クオフ平滑化にて計算するにあたり、バックオフ確率を
単調増加する関数に入力し、この関数による演算結果を
単語連鎖の新たな条件付き確率とするので、出現頻度が
大きい単語を含む単語連鎖との間の頻度差によるダイナ
ミックレンジを緩和することができることから、出現頻
度が大きい単語を含む単語連鎖が不当に高い言語確率を
もつことを防ぐことができるという効果がある。As described above, according to the present invention, the frequency of occurrence of word chains existing in the language corpus and the frequency of word chains including the number of word types are counted, and the frequency of this word chain is counted. Calculates the conditional probability of a word chain based on, and calculates the conditional probability of a word chain with a low frequency of occurrence by back-off smoothing. Since the result is a new conditional probability of the word chain, the dynamic range due to the frequency difference between the word chain including the word having a high appearance frequency can be relaxed. There is an effect that it is possible to prevent having an unduly high language probability.

【０１０５】この発明によれば、単調増加する関数をべ
き乗関数とするので、出現頻度が大きい単語を含む単語
連鎖との間の頻度差によるダイナミックレンジを緩和す
ることができるという効果がある。According to the present invention, since the monotonically increasing function is a power function, there is an effect that the dynamic range due to a frequency difference between a word chain including a word having a high appearance frequency can be reduced.

【０１０６】この発明によれば、単調増加する関数をシ
グモイド関数とするので、パラメータを適宜変更するこ
とで条件付き確率の取り得る範囲も広範囲に設定するこ
とができ、対象とする単語連鎖間の頻度差によるダイナ
ミックレンジが適切な値となるように柔軟に処理するこ
とができるという効果がある。According to the present invention, since the monotonically increasing function is a sigmoid function, the range in which the conditional probability can be taken can be set in a wide range by appropriately changing the parameters. There is an effect that the processing can be flexibly performed so that the dynamic range based on the frequency difference becomes an appropriate value.

【０１０７】この発明によれば、単調増加する関数が演
算結果に上下限値を有するので、出現頻度が大きい単語
を含む単語連鎖との間の頻度差によるダイナミックレン
ジを緩和することができるとともに、出現頻度の大きい
単語のみ確率を圧縮することができるという効果があ
る。According to the present invention, since the monotonically increasing function has upper and lower limits in the operation result, it is possible to reduce the dynamic range due to the frequency difference between a word chain including a word having a high appearance frequency, and There is an effect that the probability can be reduced only for words having a high appearance frequency.

【０１０８】この発明によれば、言語コーパスを複数に
分割し、分割した複数の言語コーパスを、単語連鎖に係
る頻度計数用の言語コーパスと言語性能評価用の言語コ
ーパスとに分類するとともに、単調増加する関数のパラ
メータを所望の値に設定し、頻度計数用の言語コーパス
内で計数された単語連鎖に係る頻度に基づいて言語性能
評価用の言語コーパスにおける単語連鎖の条件付き確率
を算出し、この条件付き確率を用いて言語性能を評価す
るとともに、分割した言語コーパスの分類の入れ替え及
び単調増加する関数のパラメータの変更を行うごとに最
高の言語性能を与えるパラメータを検出して、これらパ
ラメータの平均値を最適なパラメータとして決定するの
で、言語性能が最高の統計的言語モデルを生成するよう
に単調増加関数のパラメータを設定することができると
いう効果がある。According to the present invention, the language corpus is divided into a plurality of languages, and the plurality of divided language corpuses are classified into a language corpus for counting frequencies and a language corpus for evaluating language performance. Set the parameter of the increasing function to a desired value, calculate the conditional probability of the word chain in the language corpus for language performance evaluation based on the frequency related to the word chain counted in the language corpus for frequency counting, The language performance is evaluated using this conditional probability, and the parameter that gives the best language performance is detected each time the classification of the divided language corpus is replaced and the parameter of the function that monotonically increases is detected. Since the average value is determined as the optimal parameter, the monotonically increasing function is used to generate the statistical language model with the best language performance. There is an effect that it is possible to set the parameters.

【０１０９】この発明によれば、分割した複数の言語コ
ーパスを、単語連鎖に係る頻度計数用の言語コーパスと
言語性能評価用の言語コーパスとに分類して、これらを
複数種類の組み合わせにするとともに、それぞれの組み
合わせに異なるパラメータを設定し、各組み合わせで最
高の言語性能を与えるパラメータを検出して、これらパ
ラメータの平均値を最終的な最適パラメータとして決定
するので、上記段落０１０８と同様の効果を奏すること
ができる。According to the present invention, a plurality of divided language corpora are classified into a frequency corpus language corpus and a language performance evaluation language corpus relating to word concatenation. Since different parameters are set for each combination, the parameter that gives the best linguistic performance for each combination is detected, and the average value of these parameters is determined as the final optimal parameter. Can play.

【０１１０】この発明によれば、言語コーパス内に存在
する単語連鎖に類似する単語連鎖を新たに言語コーパス
に追加するとともに、言語コーパス内に存在する単語連
鎖に係る頻度を計数し、この単語連鎖に係る頻度に基づ
いて単語が連鎖する条件付き確率を計算するので、連鎖
する可能性がある単語は出現頻度が０であっても所定の
出現頻度を与え、スパース性を減少させることができる
ことから、言語確率の計算の際にバックオフする場合が
減少し、統計的言語モデルの性能を向上させることがで
きるという効果がある。According to the present invention, a word chain similar to the word chain existing in the language corpus is newly added to the language corpus, and the frequency of the word chain existing in the language corpus is counted. Since the conditional probability that words are chained is calculated based on the frequency according to the above, a word having a possibility of chaining is given a predetermined frequency even if the frequency of occurrence is 0, so that sparsity can be reduced. This has the effect of reducing the possibility of back-off during the calculation of the language probability and improving the performance of the statistical language model.

【０１１１】この発明によれば、言語コーパスに存在す
る単語を分類したクラス情報に基づいて所定の単語クラ
スに所属する単語が言語コーパス内の単語連鎖に含まれ
るか否かを判断し、所定の単語クラスに所属する単語
を、この単語クラスに所属する他の単語に置き換えた単
語連鎖を類似する単語連鎖として、言語コーパスに新た
に追加するので、上記段落０１１０と同様の効果を奏す
ることができるとともに、類似する単語連鎖を作成する
ときのみクラス情報を用い、単語連鎖の条件付き確率を
計算する際には用いないことから、クラス言語モデルに
伴う性能劣化を防止することができるという効果があ
る。According to the present invention, it is determined whether or not a word belonging to a predetermined word class is included in a word chain in the language corpus based on class information obtained by classifying words existing in the language corpus. Since a word chain obtained by replacing a word belonging to the word class with another word belonging to this word class is newly added to the language corpus as a similar word chain, the same effect as in paragraph 0110 can be obtained. In addition, since class information is used only when creating similar word chains and is not used when calculating conditional probability of word chains, there is an effect that performance degradation associated with a class language model can be prevented. .

【０１１２】この発明によれば、言語コーパス内に存在
する単語連鎖の出現頻度及び単語種類数を含んでなる単
語連鎖に係る頻度を計数するとともに、言語コーパスに
存在する単語を類似する意味の単語クラスごとにクラス
分けしたクラス情報及び単語連鎖に係る頻度に基づい
て、言語コーパス内に存在する単語連鎖の出現頻度を、
この単語連鎖に含まれる単語が属する単語クラス別に計
数して、単語連鎖に係る頻度に基づいて算出した単語連
鎖の条件付確率と、この単語連鎖に含まれる単語が属す
る単語クラスに係る条件付き確率とを線形結合してなる
確率を単語連鎖の新たな条件付き確率とするので、単語
をクラスに分類した情報を用いて単語連鎖確率を計算す
るのでコーパス量が少ない場合でも精度良く単語連鎖確
率を推定することができるという効果がある。According to the present invention, the frequency of occurrence of word chains existing in the language corpus and the frequency of word chains including the number of word types are counted, and words existing in the language corpus are converted into words having similar meanings. Based on the class information classified by class and the frequency related to the word chain, the appearance frequency of the word chain existing in the language corpus is
The conditional probability of the word chain calculated based on the frequency of the word chain, counted by the word class to which the words included in the word chain belong, and the conditional probability of the word class to which the word included in the word chain belongs Is used as the new conditional probability of the word chain.The word chain probability is calculated using the information obtained by classifying the words into classes, so that even when the corpus amount is small, the word chain probability can be accurately calculated. The effect is that it can be estimated.

【０１１３】この発明によれば、上記統計的言語モデル
生成装置によって得られた統計的言語モデルを用いて入
力音声を認識する音声認識装置を提供するので、音声認
識の精度を向上させた音声認識装置を提供することがで
きるという効果がある。According to the present invention, a speech recognition apparatus for recognizing an input speech using a statistical language model obtained by the above-described statistical language model generation apparatus is provided, so that speech recognition with improved speech recognition accuracy is provided. There is an effect that the device can be provided.

【０１１４】この発明によれば、上記統計的言語モデル
生成方法をコンピュータに実行させるためのプログラム
を記録する記録媒体としたので、上記効果を奏する統計
的言語モデル生成装置として機能するコンピュータシス
テムを容易に実現することができるという効果がある。According to the present invention, a recording medium for recording a program for causing a computer to execute the above-described statistical language model generating method is provided. Therefore, a computer system functioning as a statistical language model generating apparatus having the above-described effects can be easily realized. There is an effect that can be realized.

[Brief description of the drawings]

【図１】この発明の実施の形態１による統計的言語モ
デル生成装置の構成を示す図である。FIG. 1 is a diagram showing a configuration of a statistical language model generation device according to a first embodiment of the present invention.

【図２】図１の統計的言語モデル生成装置による統計
的言語モデル生成動作を示すフロー図である。FIG. 2 is a flowchart showing a statistical language model generation operation by the statistical language model generation device of FIG. 1;

【図３】実施の形態１による統計的言語モデル生成装
置が使用する単調増加関数を示す図である。FIG. 3 is a diagram showing a monotonically increasing function used by the statistical language model generation device according to the first embodiment.

【図４】この発明の実施の形態２による統計的言語モ
デル生成装置の構成を示す図である。FIG. 4 is a diagram showing a configuration of a statistical language model generation device according to a second embodiment of the present invention.

【図５】図４の統計的言語モデル生成装置による統計
的言語モデル生成動作を示すフロー図である。FIG. 5 is a flowchart showing a statistical language model generation operation by the statistical language model generation device of FIG. 4;

【図６】この発明の実施の形態３による統計的言語モ
デル生成装置の構成を示す図である。FIG. 6 is a diagram showing a configuration of a statistical language model generation device according to a third embodiment of the present invention.

【図７】図６の統計的言語モデル生成装置による統計
的言語モデル生成動作を示すフロー図である。FIG. 7 is a flowchart showing a statistical language model generation operation by the statistical language model generation device of FIG. 6;

【図８】図７の統計的言語モデル生成動作の続きを示
すフロー図である。FIG. 8 is a flowchart showing a continuation of the statistical language model generation operation of FIG. 7;

【図９】図８の統計的言語モデル生成動作の続きを示
すフロー図である。FIG. 9 is a flowchart showing a continuation of the statistical language model generation operation of FIG. 8;

【図１０】実施の形態３による統計的言語モデル生成
装置で用いるクラス情報の一例を示す図である。FIG. 10 is a diagram showing an example of class information used in the statistical language model generation device according to the third embodiment.

【図１１】この発明の実施の形態４による統計的言語
モデル生成装置の構成を示す図である。FIG. 11 is a diagram showing a configuration of a statistical language model generation device according to a fourth embodiment of the present invention.

【図１２】図１１の統計的言語モデル生成装置による
統計的言語モデル生成動作を示すフロー図である。FIG. 12 is a flowchart showing a statistical language model generation operation by the statistical language model generation device of FIG. 11;

【図１３】従来の連続音声を認識する音声認識装置の
構成を示すブロック図である。FIG. 13 is a block diagram illustrating a configuration of a conventional speech recognition device that recognizes continuous speech.

【図１４】図１３の音声認識装置に使用される統計的
言語モデルの生成を説明する説明図である。14 is an explanatory diagram illustrating generation of a statistical language model used in the speech recognition device in FIG.

[Explanation of symbols]

１コーパス記憶手段、２コーパス入力手段、３単
語連鎖頻度計数手段、４，１６単語接続確率計算手
段、５単語接続確率再計算手段、６言語モデル生成
手段、７コーパス分割手段、８パラメータ設定手
段、９パープレキシティ計算手段（言語性能評価手
段）、１０最適パラメータ決定手段、１１クラス情報
記憶手段（単語クラス記憶手段）、１２クラス情報入
力手段、１３類似単語連鎖追加手段、１４追加コーパ
ス、１５単語−クラス連鎖計数手段（単語クラス間頻
度計数手段）、１７係数記憶手段。1 corpus storage means, 2 corpus input means, 3 word chain frequency counting means, 4, 16 word connection probability calculation means, 5 word connection probability recalculation means, 6 language model generation means, 7 corpus division means, 8 parameter setting means, 9 perplexity calculation means (language performance evaluation means), 10 optimum parameter determination means, 11 class information storage means (word class storage means), 12 class information input means, 13 similar word chain addition means, 14 additional corpus, 15 words Class linkage counting means (word class frequency counting means), 17 coefficient storage means.

───────────────────────────────────────────────────── フロントページの続き (72)発明者伍井啓恭東京都千代田区丸の内二丁目２番３号三菱電機株式会社内Ｆターム(参考） 5D015 HH23 ──────────────────────────────────────────────────続き Continued from the front page (72) Inventor Hiroyasu Goi 2-3-2 Marunouchi, Chiyoda-ku, Tokyo F-term (reference) 5D015 HH23

Claims

[Claims]

1. A corpus storing means for storing a language corpus; a word chain frequency counting means for counting a frequency of a word chain including the frequency of appearance of word chains and the number of word types existing in the language corpus; A word connection probability calculating means for calculating a conditional probability of a word chain based on the frequency of the word chain counted by the word chain frequency counting means; When performing back-off smoothing and calculating, input the back-off probability to a function that monotonically increases,
A statistical language model generating apparatus comprising: a word connection probability recalculating means for setting a calculation result by this function as a new conditional probability of the word chain.

2. The statistical language model generating apparatus according to claim 1, wherein the monotonically increasing function is a power function.

3. The statistical language model generator according to claim 1, wherein the monotonically increasing function is a sigmoid function.

4. The statistical language model generation apparatus according to claim 1, wherein the monotonically increasing function has upper and lower limits in the operation result.

5. A corpus dividing means for dividing a language corpus into a plurality of languages, a language corpus for counting frequencies related to word chains by a word chain frequency counting means, and a language corpus for evaluating language performance. Parameter setting means for setting parameters of a monotonically increasing function used by the word connection probability recalculating means; and the language performance based on the frequency of the word chain counted in the language corpus for frequency counting. A language performance evaluation unit that calculates a conditional probability of a word chain in the language corpus for evaluation, and evaluates the language performance using the conditional probability; Each time the parameter of the monotonically increasing function is changed, the parameter that gives the best language performance 2. The statistical language model generating apparatus according to claim 1, further comprising: an optimum parameter determining means for detecting and determining an average value of these parameters as an optimum parameter.

6. The parameter setting means classifies the plurality of divided language corpus into a language corpus for frequency counting and a language corpus for language performance evaluation, and combines these into a plurality of types of combinations. 6. The method according to claim 5, wherein different parameters are set for the combination of the parameters, and the optimal parameter determining means detects a parameter that gives the best linguistic performance in each combination and determines an average value of these parameters as the optimal parameter. Statistical language model generator.

7. A corpus storage means for storing a language corpus; a word chain frequency counting means for counting a frequency of a word chain including an appearance frequency of a word chain existing in the language corpus and a number of word types; A word connection probability calculating means for calculating a conditional probability of a word chain based on the frequency of the word chain counted by the word chain frequency counting means; and a word chain similar to the word chain existing in the language corpus. A statistical language model generating apparatus comprising: a similar word chain adding means for adding to the language corpus.

8. A word class storage means for storing class information obtained by classifying words existing in a language corpus into word classes having similar meanings, wherein the similar word chain adding means includes a predetermined word based on the class information. It is determined whether the word belonging to the word class is included in the word chain in the language corpus, and the word chain obtained by replacing the word belonging to the predetermined word class with another word belonging to the word class is similar. 8. The statistical language model generation apparatus according to claim 7, wherein a new word chain is newly added to the language corpus.

9. A corpus storage means for storing a language corpus; a word chain frequency counting means for counting a frequency of a word chain including the number of appearances of word chains and the number of word types existing in the language corpus; Word class storage means for storing class information obtained by classifying words existing in the language corpus into word classes having similar meanings; and, based on the class information, an appearance frequency of a word chain existing in the language corpus. The inter-word class frequency counting means for counting for each word class to which the words included in the word chain belong, and the conditional probability of the word chain based on the frequency relating to the word chain, and the inter-word class frequency counting means Based on the calculated appearance frequency of the word chain, the conditional probability of the word class to which the word included in the word chain belongs is calculated. To, statistical language model generator with a word connection probability calculation means for the probability formed by these linear combination with the new conditional probability of the word chain.

10. A speech recognition device for recognizing an input speech using a statistical language model obtained by the statistical language model generation device according to any one of claims 1 to 9.

11. A word chain frequency counting step for counting the frequency of a word chain including the number of appearances of word chains and the number of word types existing in a language corpus; and a word chain frequency counting step. A word connection probability calculation step of calculating a conditional probability of a word chain based on a frequency related to the word chain; and a conditional probability of a word chain having a low frequency of occurrence is subjected to back-off smoothing in the word connection probability calculation step. In the calculation, the back-off probability is input to a function that monotonically increases, and a calculation result by this function is used as a new conditional probability of the word chain. A statistical language model that generates a statistical language model using the conditional probabilities of word chains found in the connection probability recalculation step A statistical language model generation method comprising a generation step.

12. The statistical language model generation method according to claim 11, wherein the monotonically increasing function is a power function.

13. The statistical language model generation method according to claim 11, wherein the monotonically increasing function is a sigmoid function.

14. The statistical language model generation method according to claim 11, wherein the monotonically increasing function has upper and lower limits in the operation result.

15. A language corpus is divided into a plurality of words, and classified into a language corpus for counting frequencies related to a word chain and a language corpus for evaluating language performance, and monotonically increases in a word connection probability recalculation step. A parameter setting step of setting a parameter of the function; and calculating a conditional probability of a word chain in the language corpus for language performance evaluation based on the frequency related to the word chain counted in the language corpus for counting the frequency of appearance. A language performance evaluation step of evaluating language performance using the conditional probability; and obtaining the parameter that gives the highest language performance each time the parameter setting step and the language performance evaluation step are repeated, and averaging these parameters. An optimal parameter determining step of determining a value as an optimal parameter. Statistical language model generating method according to claim 11.

16. In the parameter setting step, the plurality of divided language corpuses are classified into a language corpus for counting frequencies and a language corpus for evaluating language performance related to a word chain, and these are classified into a plurality of types of combinations. Along with
Different parameters are set for each combination, and in the optimal parameter determination step, the parameters that give the best linguistic performance for each combination are detected, and the average value of these parameters is determined as the final optimal parameter. The method for generating a statistical language model according to claim 15, wherein

17. A similar word chain adding step of newly adding a word chain similar to a word chain existing in the language corpus to the language corpus, an appearance frequency and a number of word types of the word chain existing in the language corpus. A word chain frequency counting step of counting the frequency of the word chain including: a word connection probability of calculating a conditional probability of the word chain based on the frequency of the word chain counted in the word chain frequency counting step. A statistical language model generation method, comprising: a calculation step; and a statistical language model generation step of generating a statistical language model using the conditional probability of word chains obtained in the word connection probability calculation step.

18. A word belonging to a predetermined word class is added to the language corpus based on class information obtained by classifying words existing in the language corpus into word classes having similar meanings in the similar word chain adding step. Judge whether it is included in the word chain, and add a word chain obtained by replacing a word belonging to the predetermined word class with another word belonging to this word class as a similar word chain to the language corpus. 18. The statistical language model generation method according to claim 17, wherein:

19. A word chain frequency counting step for counting the frequency of a word chain including the number of occurrences of word chains and the number of types of words existing in the language corpus, and a similar meaning for words existing in the language corpus. A word class frequency counting step of counting the appearance frequency of word chains present in the language corpus for each word class to which the words included in the word chain belong, based on the class information classified for each word class. The conditional probability of the word chain is calculated based on the frequency related to the word chain, and the word to which the word included in the word chain belongs based on the frequency of appearance of the word chain counted in the inter-word class frequency counting step. Calculate the conditional probabilities related to the classes, and use the probability obtained by linearly combining them as the new conditional probability of the word chain. A statistical language model generation step comprising: a word connection probability calculation step; and a statistical language model generation step of generating a statistical language model using the conditional probability of the word chain obtained in the word connection probability calculation step Method.

20. A computer-readable recording medium having recorded thereon a program for causing a computer to execute the statistical language model generation method according to any one of claims 11 to 19.