JP2886121B2

JP2886121B2 - Statistical language model generation device and speech recognition device

Info

Publication number: JP2886121B2
Application number: JP7292685A
Authority: JP
Inventors: 浩和政瀧; 芳典匂坂; 昭一松永
Original assignee: Ei Tei Aaru Onsei Honyaku Tsushin Kenkyusho Kk
Current assignee: Ei Tei Aaru Onsei Honyaku Tsushin Kenkyusho Kk
Priority date: 1995-11-10
Filing date: 1995-11-10
Publication date: 1999-04-26
Anticipated expiration: 2015-11-10
Also published as: JPH09134192A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、学習用テキストデ
ータに基づいて統計的言語モデルを生成する統計的言語
モデル生成装置、及び上記統計的言語モデルを用いて、
入力される発声音声文の音声信号を音声認識する音声認
識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a statistical language model generating apparatus for generating a statistical language model based on learning text data, and a statistical language model using the statistical language model.
The present invention relates to a voice recognition device that recognizes a voice signal of an input uttered voice sentence.

【０００２】[0002]

【従来の技術】近年、連続音声認識装置において、その
性能を高めるために言語モデルを用いる方法が研究され
ている。これは、言語モデルを用いて、次単語を予測し
探索空間を削減することにより、認識率の向上および計
算時間の削減の効果を狙ったものである。最近盛んに用
いられている言語モデルとしてＮ−グラム（Ｎ−ｇｒａ
ｍ）がある。これは、大規模なテキストデータを学習
し、直前のＮ−１個の単語から次の単語への遷移確率を
統計的に与えるものである。複数Ｌ個の単語列ｗ₁ ^L＝ｗ
₁，ｗ₂，…，ｗ_Lの生成確率Ｐ（ｗ₁ ^L）は次式で表され
る。2. Description of the Related Art In recent years, a method of using a language model has been studied to improve the performance of a continuous speech recognition apparatus. This aims to improve the recognition rate and reduce the calculation time by predicting the next word and reducing the search space using a language model. Recently, N-gram (N-gram) has been widely used as a language model.
m). It learns large-scale text data and statistically gives the transition probability from the previous N-1 words to the next word. Multiple L word strings w ₁ ^L = w
The generation probability P (w ₁ ^L ) of ₁ , w ₂ ,..., W _L is expressed by the following equation.

【０００３】[0003]

【数１】 (Equation 1)

【０００４】ここで、ｗ_tは単語列ｗ₁ ^Lのうちｔ番目の
１つの単語を表し、ｗ_i ^jはｉ番目からｊ番目の単語列を
表わす。上記数１において、確率Ｐ（ｗ_t｜
ｗ_t+1-N ^t-1）は、Ｎ個の単語からなる単語列ｗ_t+1-N ^t-1
が発声された後に単語ｗ_tが発声される確率であり、以
下同様に、確率Ｐ（Ａ｜Ｂ）は単語又は単語列Ｂが発声
された後に単語Ａが発声される確率を意味する。また、
数１における「Π」はｔ＝１からＬまでの確率Ｐ（ｗ_t
｜ｗ_t+1-N ^t-1）の積を意味し、以下同様である。[0004] Here, w _t represents a t-th one word of the word string w ₁ ^L, w _i ^j represents the j-th word string from the i-th. In the above _equation 1, the probability P (w _t |
wt _{+ 1-} ^Nt-1 ) is a word sequence wt _{+ 1-} ^Nt-1 composed of N words.
Is the probability that the word w _t will be uttered after is uttered, and similarly, the probability P (A | B) means the probability that the word A will be uttered after the word or word string B has been uttered. Also,
“Π” in Equation 1 represents the probability P (w _t from t = 1 to L
| W _{t + 1−N} ^t−1 ), and so on.

【０００５】Ｎ−グラムは極めて単純なものでありなが
ら、構築の容易さ、統計的音響モデルとの相性の良さ、
認識率向上や計算時間の短縮の効果が大きい等の理由
で、連続音声認識には非常に有効である（例えば、従来
文献１「Ｌ．Ｒ．Ｂａｈｌほか，“ＡＭａｘｉｍｕｍ
ＬｉｋｅｌｉｈｏｏｄＡｐｐｒｏａｃｈｔｏＣ
ｏｎｔｉｎｕｏｕｓＳｐｅｅｃｈＲｅｃｏｇｎｉｔ
ｉｏｎ”，ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｏｎ
ＰａｔｔｅｒｎＡｎａｌｙｓｉｓａｎｄＭａｃｈ
ｉｎｅＩｎｔｅｌｌｉｇｅｎｃｅ，ｐｐ．１７９−
１９０，１９８３年」、従来文献２「Ｐ．Ｃ．Ｗｏｏｄ
ｌａｎｄほか，“ＴＨＥ１９９４ＨＴＫＬａｒｇ
ｅＶｏｃａｂｕｌａｒｙＳｐｅｅｃｈＲｅｃｏｇ
ｎｉｔｉｏｎＳｙｓｔｅｍ”，Ｐｒｏｃｅｅｄｉｎｇ
ｓｏｆＩＣＡＳＳＰ９５’，Ｖｏｌ．１，ｐｐ．７
３−７６，１９９５年」、従来文献３「村上ほか，“単
語のｔｒｉｇｒａｍを利用した文音声認識と自由発話認
識への拡張”，電子情報通信学会技術研究報告，ＳＰ９
３−１２７，ｐｐ７１−７８，平成６年」参照。）。Although the N-gram is extremely simple, it is easy to construct, has good compatibility with the statistical acoustic model,
It is very effective for continuous speech recognition because it has a significant effect of improving the recognition rate and shortening the calculation time (for example, see LR Bahl et al., "A Maximum" in Reference 1).
Likelihood Approach to C
ontinous Speech Recognit
ion ", IEEE Transaction on
Pattern Analysis and Mach
intelIntelligence, pp. 179-
190, 1983 "and Conventional Document 2" PC Wood
Land et al., "THE 1994 HTK Large.
e Vocabulary Speech Recog
Nation System ”, Proceeding
s of CASSP95 ', Vol. 1, pp. 7
3-76, 1995 ", and conventional literature 3," Murakami et al., "Extension to Sentence Speech Recognition and Free Utterance Recognition Using Word Trigram", IEICE Technical Report, SP9.
3-127, pp71-78, 1994 ". ).

【０００６】一般に、Ｎ−グラムの言語モデルは、Ｎを
大きくすると長い単語連鎖を取り扱うことにより次単語
の精度は高くなるが、パラメータ数が多くなり、学習デ
ータ量が少ない場合は出現頻度の低い単語に信頼できる
遷移確率を与えることはできない。例えば語彙数が５，
０００語のとき、トライグラム（ｔｒｉｇｒａｍ）（Ｎ
＝３）の全ての単語の遷移組は（５，０００）³＝１，
２５０億であるから、信頼できる遷移確率を求めるため
には、数千億単語以上からなる膨大なテキストデータが
必要となる。これだけの膨大なテキストデータを集める
のは事実上不可能である。逆に、Ｎを小さくすると、遷
移確率の信頼性は高くなるが、短い単語連鎖しか取り扱
うことができず、次単語の予測精度は低くなる。In general, the N-gram language model increases the accuracy of the next word by handling a long word chain when N is increased, but the frequency of appearance is low when the number of parameters is large and the amount of learning data is small. Words cannot be given reliable transition probabilities. For example, if the vocabulary number is 5,
For 000 words, trigram (N
= 3) is (5,000) ³ = 1
Since it is 25 billion, a large amount of text data consisting of hundreds of billions of words or more is required to obtain a reliable transition probability. It is virtually impossible to collect such a huge amount of text data. Conversely, when N is reduced, the reliability of the transition probability increases, but only short word chains can be handled, and the prediction accuracy of the next word decreases.

【０００７】[0007]

【発明が解決しようとする課題】この問題を解決するた
め、次のような方法が提案されている。（１）補間による未学習遷移確率の推定方法この方法は、例えば、ＤｅｌｅｔｅｄＩｎｔｅｒｐｏ
ｌａｔｉｏｎ（削除補間法）（例えば、従来文献４
「Ｆ．Ｊｅｌｉｎｅｋほか，“Ｉｎｔｅｒｐｏｌａｔｅ
ｄｅｓｔｉｍａｔｉｏｎｏｆＭａｒｋｏｖＳｏ
ｕｒｃｅＰａｒａｍｅｔｅｒｓｆｒｏｍＳｐａｒ
ｓｅＤａｔａ”，ＰｒｏｃｅｅｄｉｎｇｓｏｆＷｏ
ｒｋｓｈｏｐＰａｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏ
ｎｉｎＰｒａｃｔｉｃｅ，ｐｐ．３８１−３７，１９
８０年」参照。）や、Ｂａｃｋ−ｏｆｆＳｍｏｏｔｈ
ｉｎｇ法（従来文献５「Ｓ．Ｍ．Ｋａｔｚ，“Ｅｓｔｉ
ｍａｔｉｏｎｏｆＰｒｏｂａｂｉｌｉｔｉｅｓｆ
ｒｏｍＳｐａｒｓｅＤａｔａｆｏｒｔｈｅＬ
ａｎｇｕａｇｅｍｏｄｅｌＣｏｍｐｏｎｅｎｔｏｆ
ａＳｐｅｅｃｈＲｅｃｏｇｎｉｚｅｒ”，ＩＥＥ
ＥＴｒａｎｓａｃｔｉｏｎｏｎＡｃｏｕｓｔｉｃ
ｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏ
ｃｅｓｓｉｎｇ，Ｖｏｌ．ＡＳＳＰ−３５，Ｎｏ．３，
ｐｐ．４００−４０１，１９８７年３月」参照。）等に
代表される方法で、小さいＮのＮ−グラム（Ｎ−ｇｒａ
ｍ）の値で遷移確率を補間することにより、学習用テキ
ストデータには存在しない単語遷移に対しても、遷移確
率を与えることができる。しかしながら、出現頻度の低
い単語に関しては信頼できる遷移確率を与えられない恐
れがある。In order to solve this problem, the following method has been proposed. (1) Method of Estimating Unlearned Transition Probability by Interpolation This method uses, for example, Deleted Interpo.
ration (deletion interpolation method) (for example,
"F. Jelinek et al.," Interpolate
de estimation of Markov So
source Parameters from Spar
se Data ", Proceedingsof Wo
rkshop Pattern Recognition
n inPractice, pp. 381-37, 19
80 years ". ) And Back-off Smooth
ing method (conventional document 5, "SM Katz," Esti
nation of Probabilities f
rom Sparse Data for the L
angle model Componentof
a Speech Recognizer ", IEEE
E Transaction on Acoustic
s, Speech, and Signal Pro
sessing, Vol. ASSP-35, No. 3,
pp. 400-401, March 1987 ". )), A small N-gram (N-gram)
By interpolating the transition probabilities with the value of m), transition probabilities can be given to word transitions that do not exist in the learning text data. However, there is a risk that reliable transition probabilities may not be given for words that appear infrequently.

【０００８】（２）クラスＮ−グラムによるパラメータ
数の削減方法この方法は、相互情報量に基づくクラスタリング（例え
ば、従来文献６「Ｐ．Ｆ．Ｂｒｏｗｎほか，“Ｃｌａｓ
ｓ−Ｂａｓｅｄｎ−ｇｒａｍｍｏｄｅｌｓｏｆｎ
ａｔｕｒａｌｌａｎｇｕａｇｅ”，Ｃｏｍｐｕｔａｔ
ｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，Ｖｏｌ．１８，Ｎ
ｏ．４，ｐｐ４６７−４７９，１９９２年」参照。）
や、品詞（従来文献７「周ほか，“確率モデルによる日
本語の大語彙連続音声認識”，情報処理学会，第５１回
全国大会講演論文集，ｐｐ１１９−１２０，平成７年」
参照。）等によるクラス間のＮ−グラムを考えたもの
で、Ｌ個の単語の文生成確率Ｐ（ｗ₁ ^L）は一般に次式で
表される。(2) Method for Reducing the Number of Parameters by Class N-gram This method is based on clustering based on mutual information (for example, see Reference 6 “PF Brown et al.,“ Class ”
s-Based n-gram modelofn
atarulanguage ", Computat
ionicLinguistics, Vol. 18, N
o. 4, pp 467-479, 1992 ". )
And part-of-speech (Conventional Document 7, “Zhou et al.,“ Large Vocabulary Continuous Speech Recognition of Japanese Using Probability Model ”, Information Processing Society of Japan, Proc. Of the 51st Annual Convention, pp119-120, 1995.)
reference. )), And the sentence generation probability P (w ₁ ^L ) of ^L words is generally expressed by the following equation.

【０００９】[0009]

【数２】 (Equation 2)

【００１０】ここで、ｃ_tは単語ｗ_tの属するクラスを表
し、ｃ_i ^jはｉ番目からｊ番目のクラス列を表わす。上記
数２で、Ｐ（ｃ_t｜ｃ_t-N+1 ^t+1）は、直前の（Ｎ−１）
個の単語の属するクラスから次の単語の属するクラスへ
の遷移確率を表す。クラス数が５０のとき、トライグラ
ムの全てのクラス間の遷移の組は５０³＝１２５，００
０であるから、数十万単語程度と単語Ｎ−グラムに比べ
てかなり小規模なテキストデータで遷移確率が求められ
ると考えられる。しかしながら、単語間の特有な連接関
係を表現することができないので、次単語の予測精度は
悪くなると考えられる。[0010] Here, c _t represents a class that belongs word w _t, c _i ^j represents the j-th class sequence from the i-th. In the above _equation 2, P ( _ct | _{ct-N +} ^{1t + 1} ) is ^{equal to} the immediately preceding (N-1).
It represents the transition probability from the class to which the word belongs to the class to which the next word belongs. When the number of classes is 50, the set of transitions between all classes of the trigram is 50 ³ = 125,000.
Since it is 0, it is considered that the transition probability can be obtained with text data of several hundred thousand words, which is considerably smaller than the word N-gram. However, since it is not possible to express a specific connection between words, the prediction accuracy of the next word is considered to be poor.

【００１１】本発明の目的は以上の問題点を解決し、従
来例に比較して遷移確率の予測精度及び信頼性を改善す
ることができる統計的言語モデルを生成することができ
る統計的言語モデル生成装置、及び、当該統計的言語モ
デルを用いて従来例に比較して高い音声認識率で音声認
識することができる音声認識装置を提供することにあ
る。An object of the present invention is to solve the above-mentioned problems, and to generate a statistical language model capable of generating a statistical language model capable of improving the prediction accuracy and reliability of transition probability as compared with the conventional example. An object of the present invention is to provide a generation device and a speech recognition device that can perform speech recognition at a higher speech recognition rate than a conventional example using the statistical language model.

【００１２】[0012]

【課題を解決するための手段】本発明に係る請求項１記
載の統計的言語モデル生成装置は、所定の話者の発声音
声文を書き下した学習用テキストデータに基づいて、す
べての語彙を品詞毎にクラスタリングされた品詞クラス
に分類し、それらの品詞クラス間のバイグラムを初期状
態の統計的言語モデルとして生成する生成手段と、上記
生成手段によって生成された初期状態の統計的言語モデ
ルに基づいて、単語の品詞クラスからの分離することが
できる第１の分離クラス候補と、１つの単語と１つの単
語との結合、１つの単語と複数の単語の単語列との結
合、複数の単語の単語列と１つの単語との結合、複数の
単語の単語列と、複数の単語の単語列との結合とを含む
連接単語又は連接単語列の結合によって単語の品詞クラ
スから分離することができる第２の分離クラス候補とを
検索する検索手段と、上記検索手段によって検索された
第１と第２の分離クラス候補に対して、次単語の予測の
難易度を表わす所定のエントロピーを用いて、クラスを
分離することによる当該エントロピーの減少量を計算す
る計算手段と、上記計算手段によって計算された上記第
１と第２の分離クラス候補に対するエントロピーの減少
量の中で最大のクラス分離を選択して、選択されたクラ
スの分離を実行することにより、品詞のバイグラムと可
変長Ｎの単語のＮ−グラムとを含む統計的言語モデルを
生成する分離手段と、上記分離手段によって生成された
統計的言語モデルのクラス数が所定のクラス数になるま
で、上記分離手段によって生成された統計的言語モデル
を処理対象モデルとして、上記検索手段の処理と、上記
計算手段の処理と、上記分離手段の処理とを繰り返すこ
とにより、所定のクラス数を有する統計的言語モデルを
生成する制御手段とを備えたことを特徴とする。According to a first aspect of the present invention, there is provided a statistical language model generating apparatus which converts all vocabulary parts of speech based on learning text data in which uttered voice sentences of a predetermined speaker are written. Generating means for classifying each part-of-speech class into clustered part-of-speech classes, and generating a bigram between those part-of-speech classes as an initial state statistical language model; and an initial state statistical language model generated by the generating means. , A first separation class candidate that can be separated from a part of speech class of a word, a combination of one word and one word, a combination of one word and a word string of a plurality of words, a word of a plurality of words Separation from a word class by combining concatenated words or concatenated word strings, including concatenation of strings with one word, concatenation of word strings of multiple words, and word strings of multiple words Search means for searching for a possible second separation class candidate, and for the first and second separation class candidates searched for by the search means, using a predetermined entropy representing the difficulty of predicting the next word. Calculating means for calculating the amount of reduction in entropy by separating classes, and selecting the largest class separation from the amount of reduction in entropy for the first and second separation class candidates calculated by the calculation means And separating the selected class to generate a statistical language model including the part-of-speech bigram and the N-gram of the variable-length N word, and a statistic generated by the separating unit. Until the number of classes of the statistical language model reaches a predetermined number of classes, the statistical language model generated by the separating means is used as a processing target model, and the search method is performed. And processing, the processing of the calculation means, by repeating the process of the separating means, characterized in that a control means for generating a statistical language model having a predetermined number of classes.

【００１３】本発明に係る請求項２記載の音声認識装置
は、入力される発声音声文の音声信号に基づいて、所定
の統計的言語モデルを用いて音声認識する音声認識手段
を備えた音声認識装置において、上記音声認識手段は、
品詞のバイグラムと可変長Ｎの単語のＮ−グラムとを含
む統計的言語モデルを用いて音声認識することを特徴と
する。According to a second aspect of the present invention, there is provided a speech recognition apparatus comprising: a speech recognition unit for recognizing a speech based on a speech signal of an input uttered speech sentence using a predetermined statistical language model. In the apparatus, the voice recognition means includes:
It is characterized in that speech recognition is performed using a statistical language model including a part-of-speech bigram and a variable-length N-word N-gram.

【００１４】また、請求項３記載の音声認識装置におい
ては、上記統計的言語モデルは、請求項１記載の統計的
言語モデル生成装置によって生成されたことを特徴とす
る。According to a third aspect of the present invention, the statistical language model is generated by the statistical language model generating apparatus according to the first aspect.

【００１５】本発明に係る請求項４記載の連続音声認識
装置は、入力される発声音声文の音声信号に基づいて上
記発声音声文の単語仮説を検出し尤度を計算することに
より、連続的に音声認識する音声認識手段を備えた連続
音声認識装置において、上記音声認識手段は、請求項１
記載の統計的言語モデル生成装置によって生成された統
計的言語モデルを参照して、終了時刻が等しく開始時刻
が異なる同一の単語の単語仮説に対して、当該単語の先
頭音素環境毎に、発声開始時刻から当該単語の終了時刻
に至る計算された総尤度のうちの最も高い尤度を有する
１つの単語仮説で代表させるように単語仮説の絞り込み
を行うことを特徴とする。According to a fourth aspect of the present invention, a continuous speech recognition apparatus detects a word hypothesis of the uttered speech sentence based on an input speech signal of the uttered speech sentence and calculates a likelihood to obtain a continuous speech. In a continuous speech recognition apparatus provided with a speech recognition means for recognizing a speech, the speech recognition means comprises:
With reference to the statistical language model generated by the described statistical language model generation device, for each word hypothesis of the same word having the same end time and different start time, the utterance starts for each head phoneme environment of the word. It is characterized in that word hypotheses are narrowed down so as to be represented by one word hypothesis having the highest likelihood among the total likelihoods calculated from the time to the end time of the word.

【００１６】[0016]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１に本発明に係る一実
施形態の連続音声認識装置のブロック図を示す。本実施
形態の連続音声認識装置は、公知のワン−パス・ビタビ
復号化法を用いて、入力される発声音声文の音声信号の
特徴パラメータに基づいて上記発声音声文の単語仮説を
検出し尤度を計算して出力する単語照合部４を備えた連
続音声認識装置において、単語照合部４からバッファメ
モリ５を介して出力される、終了時刻が等しく開始時刻
が異なる同一の単語の単語仮説に対して、統計的言語モ
デル２２を参照して、当該単語の先頭音素環境毎に、発
声開始時刻から当該単語の終了時刻に至る計算された総
尤度のうちの最も高い尤度を有する１つの単語仮説で代
表させるように単語仮説の絞り込みを行う単語仮説絞込
部６を備えたことを特徴とする。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a block diagram of a continuous speech recognition apparatus according to an embodiment of the present invention. The continuous speech recognition apparatus according to the present embodiment detects the word hypothesis of the uttered speech sentence based on the characteristic parameter of the speech signal of the input uttered speech sentence using a known one-pass Viterbi decoding method. In a continuous speech recognition device provided with a word matching unit 4 for calculating and outputting a degree, a word hypothesis of the same word having the same end time and different start time, which is output from the word matching unit 4 via the buffer memory 5, is obtained. On the other hand, with reference to the statistical language model 22, one of the total likelihoods calculated from the utterance start time to the end time of the word for each head phoneme environment of the word has one of the highest likelihoods. A word hypothesis narrowing unit 6 for narrowing down word hypotheses so as to be represented by word hypotheses is provided.

【００１７】ここで用いる統計的言語モデル２２は、学
習用テキストデータに基づいて言語モデル生成部２０に
より生成されたものであって、統計的言語モデル２２
は、品詞クラス間のバイグラム（Ｎ＝２）を基本とした
ものであるが、単独で信頼できる単語は品詞クラスより
分離させ、単独のクラスとして取り扱い、さらに、予測
精度を向上させるため、頻出単語列に関してはそれらの
単語を結合して一つのクラスとして取り扱い、長い単語
連鎖の表現を可能にさせ、こうして、生成されたモデル
は、品詞バイグラムと可変長単語Ｎ−グラムとの特徴を
併せ持つ統計的言語モデルとなり、遷移確率の精度と信
頼性とのバランスをとられたものであることを特徴とす
る。The statistical language model 22 used here is generated by the language model generator 20 based on the learning text data.
Is based on bigrams between part-of-speech classes (N = 2), but words that can be independently trusted are separated from part-of-speech classes, treated as a single class, and frequent words are used to improve prediction accuracy. For sequences, those words are combined and treated as a class, allowing the representation of long word chains, and thus the model generated is a statistical model that combines the features of part-of-speech bigrams and variable-length word N-grams. It is a language model, characterized by a balance between accuracy and reliability of transition probability.

【００１８】まず、本実施形態において用いる可変長Ｎ
−グラムの概念について以下に説明する。Ｎ−グラム
は、（Ｎ−１）重のマルコフモデルであり、これは、過
去（Ｎ−１）回の状態遷移を記憶するように単純（１
重）マルコフモデルの各状態が分離されたものと解釈さ
れる。例として、図３にバイグラムをマルコフモデルと
して図式化した状態遷移図を示し、図４にトライグラム
をマルコフモデルとして図式化した状態遷移図を示す。First, the variable length N used in this embodiment
-The concept of gram is explained below. The N-gram is a (N-1) -fold Markov model, which is simple (1-1) to store the past (N-1) state transitions.
Heavy) Each state of the Markov model is interpreted as being separated. As an example, FIG. 3 shows a state transition diagram in which a bigram is represented as a Markov model, and FIG. 4 shows a state transition diagram in which a trigram is represented as a Markov model.

【００１９】図３においては、状態ｓ₁においてシンボ
ルａを出力されたとき状態ｓ₁のままであるが、状態ｓ₁
でシンボルｂを出力した状態ｓ₂に遷移する。状態ｓ₂で
シンボルｂを出力したときは状態ｓ₂のままであるが、
状態ｓ₂でシンボルａを出力したとき状態ｓ₁に戻る。図
４のトライグラムは、バイグラムの状態ｓ₁を状態ｓ₁₁
と状態ｓ₁₂とに分離しかつ、状態ｓ₂を状態ｓ₂₁と状態
ｓ₂₂とに分離したものと考えられる。さらに、全ての状
態の分離を進めることにより、より高次のＮ−グラムと
なる。[0019] In FIG. 3, but remains in state s ₁ when output symbol a in state s _1, the state s ₁
In a transition to a state s ₂ which has output the symbol b. But it remains in the state s ₂ when outputting the symbol b in the state s _2,
Back in state s ₂ to the state s ₁ when outputting the symbol a. Figure 4 trigram, state status s ₁ bigrams s ₁₁
Vital separated into state s ₁₂ and it is believed that separation of the state s ₂ in the state s ₂₁ and the state s _22. Further, by promoting the separation of all states, a higher-order N-gram is obtained.

【００２０】図５に示す可変長Ｎ−グラムは、単純マル
コフモデルの状態を部分的に分離させたものである。す
なわち、図３のバイグラムにおいて、状態ｓ₂から、シ
ンボルａが出力される際に、続けてシンボルｂを出力す
る場合（これをａｂと表わし、シンボルａｂを出力する
という。）、続けてｂ以外のシンボルを出力する場合
（これをａ（／ｂ）と表し、シンボルａ（／ｂ）を出力
するという。ここで、／は否定の意味を表しバー（上
線）である。）とに分け、前者の場合、状態ｓ₁から状
態ｓ₁₂に遷移させる一方、後者の場合、状態ｓ₂から状
態ｓ₁₁に遷移させる。すなわち、前者の場合において、
状態ｓ₁から状態ｓ₁₂へと分離させ、シンボルａを出力
する残りの遷移（ａ（／ｂ））を状態ｓ₁₁に残したもの
である。なお、このモデルにおいて、状態ｓ₁₁でシンボ
ルａｂを出力したとき状態ｓ₁₂に遷移する一方、状態ｓ
₁₁でシンボルａ（／ｂ）を出力したとき状態ｓ₁₁のまま
である。また、状態ｓ₁₂でシンボルａｂを出力したとき
状態ｓ₁₂のままである一方、状態ｓ₁₂でシンボルａ（／
ｂ）を出力したとき状態ｓ₁₁に遷移する。The variable length N-gram shown in FIG. 5 is obtained by partially separating the states of the simple Markov model. That is, in the bigram of FIG. 3, when the symbol a is output from the state s ₂ and the symbol b is continuously output (this is referred to as ab, and the symbol ab is output), the state other than b is continued. (This is expressed as a (/ b) and the symbol a (/ b) is output. Here, / represents a negative meaning and is a bar (overline)). in the former case, while transitioning from state s ₁ to the state s _12, in the latter case, the transition from state s ₂ to the state s _11. That is, in the former case,
Separated from the state s ₁ to state s _12, in which left remaining transition for outputting the symbols a to (a (/ b)) in the state s _11. Incidentally, in this model, whereas a transition to the state s ₁₂ when outputting the symbol ab in the state s _11, the state s
_{When the} symbol a (/ b) is output in step ₁₁ , the state remains s11. Further, while in the state s ₁₂ and remain in the state s ₁₂ when outputting the symbol ab, while s ₁₂ symbols a (/
b) a transition to the state s ₁₁ when the output.

【００２１】このモデルは、複数の連続したシンボルを
新しいシンボルとみなすことで、単純マルコフモデルの
構造のまま、長い連鎖を表すことができるという特徴が
ある。同様の状態分離を繰り返すことで、局所的にさら
に長い連鎖を表すことができる。これが可変長Ｎ−グラ
ムである。すなわち、シンボルを単語とみなした言語モ
デルとしての可変長単語Ｎ−グラムは、単語列（１単語
も含む）間のバイグラムと表される。This model has a feature that long chains can be represented with a simple Markov model structure by regarding a plurality of consecutive symbols as new symbols. By repeating the same state separation, a longer chain can be represented locally. This is the variable length N-gram. That is, a variable-length word N-gram as a language model in which a symbol is regarded as a word is represented as a bigram between word strings (including one word).

【００２２】次いで、可変長Ｎ−グラムの動作について
説明する。本実施形態で用いる統計的言語モデル２２
は、品詞クラスと単語との可変長Ｎ−グラムであり、次
の３種類のクラス間のバイグラムとして表現する。（１）品詞クラス（以下、第１のクラスという。）、
（２）品詞クラスから分離した単語のクラス（以下、第
２のクラスという。）、及び、（３）連接単語が結合し
てできたクラス（以下、第３のクラスという。）。Next, the operation of the variable length N-gram will be described. Statistical language model 22 used in this embodiment
Is a variable length N-gram of a part of speech class and a word, and is expressed as a bigram between the following three types of classes. (1) part of speech class (hereinafter referred to as first class),
(2) A class of a word separated from the part of speech class (hereinafter, referred to as a second class), and (3) a class formed by combining connected words (hereinafter, referred to as a third class).

【００２３】上記第１のクラスに属する単語は、主とし
て出現頻度の小さいもので、単語単独で取り扱うよりも
遷移確率の信頼性が高められる。また、第２のクラスに
属する単語は、主として出現頻度が高いもので、単独で
取り扱っても十分な信頼性があり、さらに、連接単語が
結合して上記第３のクラスに分類されることにより、可
変長Ｎ−グラムとして動作し、次単語の予測精度が高め
られる。ただし、本実施形態において、連接する品詞ク
ラスと品詞クラス、および、品詞クラスと単語の結合は
考えない。複数Ｌ個の単語からなる文の生成確率Ｐ（ｗ
₁ ^L）は、次式で与えられる。The words belonging to the first class have a low appearance frequency, and the reliability of the transition probability is improved as compared with the case where the words are handled alone. In addition, the words belonging to the second class are mainly those having a high frequency of appearance, and have sufficient reliability even if handled alone. Furthermore, the words connected to each other are combined and classified into the third class. , Operates as a variable-length N-gram, and the prediction accuracy of the next word is improved. However, in the present embodiment, it is not considered that the part-of-speech class and the part-of-speech class that are connected to each other and the combination of the part-of-speech class and the word are considered. Generation probability P (w of a sentence composed of a plurality of L words
₁ ^L ) is given by the following equation.

【００２４】[0024]

【数３】 (Equation 3)

【００２５】ここで、ｗｓ_tは文章を上記のクラスに分
類した時の、ｔ番目の単語列（単独の単語も含める）を
意味する。従って、Ｐ（ｗｓ_t｜ｃ_t）は、ｔ番目のクラ
スがわかったときに単語列ｗｓ_tが出現する確率であ
り、Ｐ（ｃ_t｜ｃ_t-1）は１つ前の（ｔ−１）番目のクラ
スから当該ｔ番目のクラスの単語が出現する確率であ
る。また、文章のＫは単語列の個数を表し、Ｋ≦Ｌであ
る。従って、数３のΠはｔ＝１からＫまでの積である。
ここで、例として、次の７単語からなる発声音声文の文
章を考える。[0025] In this case, ws _t means the sentence at the time of the classification of the above class, t-th word string (a stand-alone word, is also included). _{_{Therefore, P (ws t | c t}} ) is the probability that the word string ws _t appears when found t-th _{_{class, P (c t | c t}} -1) is the previous (t- 1) Probability that a word of the t-th class appears from the class. Further, K in the text represents the number of word strings, and K ≦ L. Therefore, Π in Equation 3 is a product from t = 1 to K.
Here, as an example, consider a sentence of an uttered voice sentence composed of the following seven words.

【００２６】[0026]

【数４】「わたくし−村山−と−言−い−ま−す」[Equation 4] "I, Murayama, and-Words-I-Mas-"

【００２７】この文章の生成確率Ｐ（ｗ₁ ^L）は、数３を
用いて、次の式で与えられる。The generation probability P (w ₁ ^L ) of this sentence is given by the following equation using Expression 3.

【００２８】[0028]

【００２９】ただし、＜＞，｛｝，［］はそれぞれ、第
１のクラス、第２のクラス、第３のクラスに属している
ことを表す。ただし、各単語および単語列は次のように
属している。（１）「村山」は固有名詞なので、第１のクラスに属す
る。（２）「わたくし」、「と」はそれぞれ、名詞から分離
した単語、助詞から分離した単語であり、第２のクラス
に属する。（３）「言います」は動詞と、動詞の接尾辞と、助動詞
と、助動詞の接尾辞との組み合わせであり、第３のクラ
スに属する。ここで、第２と第３のクラスにおいて、単
語とクラスの出現頻度は等しいので、Ｐ（わたくし｜
｛わたくし｝）＝１、Ｐ（と｜｛と｝）＝１、Ｐ（言い
ます｜［言います］）＝１であり、従って、上記数５は
次の式のようになる。However, <>, ｛｝, and [] indicate that they belong to the first class, the second class, and the third class, respectively. However, each word and word string belong as follows. (1) "Murayama" belongs to the first class because it is a proper noun. (2) “I” and “to” are words separated from nouns and words separated from particles, respectively, and belong to the second class. (3) "I say" is a combination of a verb, a verb suffix, an auxiliary verb, and an auxiliary verb suffix, and belongs to the third class. Here, in the second and third classes, the frequency of appearance of the word and the class is equal, so that P (I |
(I) = 1, P (and | ｛and｝) = 1, and P (say | [say]) = 1, therefore, the above equation 5 becomes the following equation.

【００３０】[0030]

【数６】Ｐ（ｗ₁ ^L）＝Ｐ（わたくし）・Ｐ（村山｜＜固有名詞＞）・Ｐ（＜固有名詞＞｜わたくし）・Ｐ（と｜＜固有名詞＞）・Ｐ（言います｜と）[Equation 6] P (w ₁ ^L ) = P (Watakushi) ・ P (Murayama | <proper noun>) ・ P (<proper noun> | Watakushi) ・ P (and | <proper noun>) ・ P (say | And)

【００３１】次いで、本実施形態で用いる可変長Ｎ−グ
ラムである統計的言語モデル２２を生成するための言語
モデル生成処理について参照して説明する。本実施形態
で用いる統計的言語モデル２２は、品詞クラスのバイグ
ラムを初期状態とし、エントロピーの最小化の基準によ
るクラス分離という形で生成される。エントロピーの減
少は正になることが保証されており、クラス分離によっ
て、学習用テキストデータに関してエントロピーは単調
に減少する。ここで用いるエントロピーは、一般には、
「あいまいさ」の尺度を表わすものであり、言語モデル
において、エントロピーが小さいことは、言語としてあ
いまいさが小さく、次の単語の予測が容易であることを
意味する。すなわち、エントロピーは次単語の予測の難
易度を表わす。ｙという条件のもとでのｘの確率である
条件付き確率Ｐ（ｘ｜ｙ）のエントロピーＨ（Ｘ｜Ｙ）
は次式で表される。Next, a language model generation process for generating a statistical language model 22 which is a variable length N-gram used in the present embodiment will be described with reference to FIG. The statistical language model 22 used in the present embodiment is generated in the form of class separation based on the criterion for minimizing entropy with the bigram of the part of speech class as an initial state. The decrease in entropy is guaranteed to be positive, and the class separation causes the entropy to decrease monotonically with respect to the training text data. The entropy used here is generally
It represents a measure of "ambiguity". In a language model, a small entropy means that the language has small ambiguity and that the next word is easy to predict. That is, entropy indicates the difficulty of predicting the next word. Entropy H (X | Y) of conditional probability P (x | y) which is the probability of x under the condition of y
Is represented by the following equation.

【００３２】[0032]

【数７】Ｈ（Ｘ｜Ｙ）＝−ΣＰ（ｙ）ΣＰ（ｘ｜ｙ）ｌｏｇ₂Ｐ（ｘ｜ｙ）H (X | Y) = − {P (y)} P (x | y) log ₂ P (x | y)

【００３３】従って、上記数７に基づいて、本実施形態
で用いるエントロピーは次式で計算される。Therefore, based on the above equation (7), the entropy used in this embodiment is calculated by the following equation.

【００３４】[0034]

【数８】ここで、ｗ_k∈ｃ_j (Equation 8) Where w _k ∈c _j

【００３５】図６は、言語モデル生成部２０によって実
行される言語モデル生成処理の詳細を示すフローチャー
トであり、以下、図６を参照して当該処理について説明
する。まず、ステップＳ１では、所定の話者の発声音声
文を書き下した学習用テキストデータに含まれる全語彙
を品詞クラス（ここで、品詞クラスとは、品詞毎にクラ
スタリングされたクラスをいう。）に分類し、それらの
クラス間のバイグラムを初期状態の統計的言語モデルと
する。次いで、次のステップＳ２乃至Ｓ４でクラスの分
離を行う。すなわち、ステップＳ２で、クラス分離する
ことが可能な分離クラス候補を検索することによりリス
トアップを行う。ここでは、次の２種類のクラス分離を
考える。（１）単語の品詞クラスからの分離（以下、第１のクラ
ス分離という。）、（２）連接単語又は連接単語列の結
合によるクラス分離（以下、第２のクラス分離とい
う。）。ここで、連接単語又は連接単語列の結合とは、
連接する（時間的に隣接して入力される）１つの単語と
１つの単語との結合、１つの単語と複数の単語の単語列
との結合、複数の単語の単語列と１つの単語との結合、
複数の単語の単語列と、複数の単語の単語列との結合と
を含む。FIG. 6 is a flowchart showing details of the language model generation processing executed by the language model generation unit 20. The processing will be described below with reference to FIG. First, in step S1, all the vocabulary included in the learning text data in which the uttered voice sentence of a predetermined speaker has been written is converted into a part-of-speech class (here, the part-of-speech class means a class that is clustered for each part of speech). Classify, and the bigram between those classes is used as an initial statistical language model. Next, classes are separated in the next steps S2 to S4. That is, in step S2, a list is made by searching for separated class candidates that can be separated into classes. Here, the following two types of class separation are considered. (1) Separation of a word from a part-of-speech class (hereinafter, referred to as a first class separation), (2) Class separation by combining a connected word or a connected word string (hereinafter, referred to as a second class separation). Here, the concatenation word or the concatenation of the concatenation word string is
Concatenation (joined in time) of one word and one word, concatenation of one word and plural word strings, plural word word string and one word Join,
It includes a word string of a plurality of words and a combination of the word strings of a plurality of words.

【００３６】前者の単語の品詞クラスからの分離におい
ては、当初品詞クラスに属している単語が、そのクラス
から分離し、分離した単語は、その単語で単独のクラス
を形成する。In the former separation of the word from the part-of-speech class, a word originally belonging to the part-of-speech class is separated from the class, and the separated word forms a single class by the word.

【００３７】[0037]

【数９】ｃ_ξ→｛ｗ_x｝＋ｃ_ξ＼｛ｗ_x｝ここで、ｗ_x∈ｃ_ξ [Mathematical formula 9] c _ξ → ｛w _x ｝ + c _ξ ＼｛w _x ｝ where w _x ∈c _ξ

【００３８】ここで、ｃ_ξ＼｛ｗ_x｝はクラスｃ_ξから
単語ｗ_xのクラスを除いたクラスであることを意味し、
単語ｗ_xはクラスｃ_ξに属している。従って、数９の意
味するところは、例えば、名詞のクラスｃ_ξは、「机」
という単語ｗ_xのクラス｛ｗ_x｝と、「机」という単語ｗ
_xのクラス｛ｗ_x｝をクラスｃ_ξから除いたクラスとに分
離することを意味する。Here, c _ξ {w _x } means a class obtained by removing the class of the word w _x from the class c _、,
Word w _x belongs to the class c _ξ. Therefore, the meaning of Expression 9 is, for example, that the class of the noun _cｃ is “desk”
Class {w _x } of the word w _{x and} the word w of “desk”
means separated into the classes excluding the _x classes {w _x} from class c _xi].

【００３９】後者の連接単語又は連接単語列の結合によ
るクラス分離においては、既に初期クラスより分離され
ている単語クラス及び単語列クラスについて、連接した
２クラス間の結合を考える。結合した単語列は、その単
語列で単独のクラスを形成する。In the latter class separation by connecting the connected words or the connected word strings, a connection between two connected classes is considered for the word class and the word string class that have already been separated from the initial class. The combined word string forms a single class with the word string.

【００４０】[0040]

【数１０】｛ｗ_x｝→｛ｗ_x，ｗ_y｝＋｛ｗ_x，／ｗ_y｝[Equation 10] {w _x } → {w _x , w _y } + {w _x , / w _y }

【００４１】ここで、｛ｗ_x，ｗ_y｝は連接単語列ｗ_x，
ｗ_yのクラスを表し、｛ｗ_x，／ｗ_y｝は単語ｗ_xの次に単
語ｗ_yが後続しない単語ｗ_xのクラスを表わす。すなわ
ち、／ｗ_yは単語ｗ_y以外の単語を表わす。数１０の意味
するところは、例えば、「机」という単語のクラス｛ｗ
_x｝は、「机の」という単語列のクラス｛ｗ_x，ｗ_y｝
と、「机の」以外の例えば「机は」、「机が」などの単
語列のクラス｛ｗ_x，／ｗ_y｝とに分離することを意味す
る。上記数１０は、単語の結合に関する式であるが、単
語列と単語の結合、および、単語列と単語列との結合も
同様に表される。従って、第２のクラス分離では、これ
らのクラス分離を含む。Here, {w _x , w _y } is a concatenated word sequence w _x ,
represent classes of w _y, representing the class of {w _x, / w _y} is not word w _y is subsequent to the next word w _x word w _x. In other words, / w _y represents a word other than the word w _y. The meaning of Equation 10 is, for example, the class “w” of the word “desk”
_x ｝ is the class {w _x , w _yの
Means that the words are separated into classes {w _x , / w _y } of word strings other than “desk”, such as “desk” and “desk”. Equation (10) is an expression relating to the combination of words, but the combination of a word string and a word and the combination of a word string and a word string are similarly expressed. Therefore, the second class separation includes these class separations.

【００４２】次いで、ステップＳ３で、ステップＳ２で
リストアップされた上記第１と第２の分離クラス候補に
対して次の数１１及び数１２を用いてエントロピー減少
量を計算する。ここで、上記第１のクラス分離である初
期クラスの分離に対して数１１を用いる一方、上記第２
のクラス分離である連接単語又は連接単語列の結合によ
るクラス分離に対して数１２を用いる。Next, in step S3, the amount of entropy reduction is calculated using the following equations (11) and (12) for the first and second separation class candidates listed in step S2. Here, while using Equation 11 for the separation of the initial class, which is the first class separation,
Equation 12 is used for class separation by combining connected words or connected word strings, which is the class separation of.

【００４３】[0043]

【数１１】 ΔＨ＝Ｈ（｛ｃ_i｝）−Ｈ（｛ｃ_i＼ｃ_ξ｝＋｛ｗ_x｝＋｛ｃ_ξ＼ｗ_x｝）Equation 11] _{ΔH = H ({c i}} ) - H ({c i \c ξ} + {w x} + {c ξ \w x})

【数１２】 ΔＨ＝Ｈ（｛ｃ_i｝）−Ｈ（｛ｃ_i＼ｗ_x｝＋｛ｗ_x，ｗ_y｝＋｛ｗ_x，／ｗ_y｝）ΔH = H (｛c _i ｝) − H (｛c _i ＼w _x ｝ + ｛w _x , w _y ｝ + ｛w _x , / w _y ｝)

【００４４】ここで、数１１及び数１２において、Ｈ
（｛ｃ_i｝）は元のすべての品詞クラスｃ_iについてのエ
ントロピーであり、数１１においてＨ（｛ｃ_i＼ｃ_ξ｝
＋｛ｗ_x｝＋｛ｃ_ξ＼ｗ_x｝）は元のすべての品詞クラス
ｃ_iから単語ｗ_xのクラスを分離したときのエントロピー
であり、数１１のΔＨは単語ｗ_xのクラスを分離したと
きのエントロピーの減少量である。また、数１２におい
てＨ（｛ｃ_i＼ｗ_x｝＋｛ｗ_x，ｗ_y｝＋｛ｗ_x，／ｗ_y｝）
は、元のすべての品詞クラスｃ_iから単語列｛ｗ_x，
ｗ_y｝のクラスを分離したときのエントロピーであり、
数１２のΔＨは単語列｛ｗ_x，ｗ_y｝のクラスを分離した
ときのエントロピーの減少量である。Here, in Equations 11 and 12, H
({C _i }) is the entropy for all original part-of-speech classes c _i , and H ({c _i ＼c _ξ }
+ ｛W _x ｝ + ｛c _ξ ＼w _x ｝) is the entropy when the class of the word w _x is separated from all the original part-of-speech classes c _i , and ΔH in Equation 11 separates the class of the word w _x It is the amount of decrease in entropy when doing. Further, the number _{_{12 H ({c i \w x}} } + {w x, w y} + {w x, / w y})
The word string from the original of all the part-of-speech class c _i {w _x,
entropy when the class of w _yク_ラス is separated,
ΔH in Expression 12 is the amount of decrease in entropy when the class of the word sequence {w _x , w _y } is separated.

【００４５】次いで、ステップＳ４においては、ステッ
プＳ２でリストアップされたすべての分離クラス候補の
中で、ステップＳ３で計算したエントロピー減少量ΔＨ
を最大にするクラスのみを実際にクラス分離する。そし
て、ステップＳ５で分離クラス数が所定のしきい値の所
望分離クラス数（例えば、５００、１０００など）以上
になったか否かを判断し、なっていないときは、ステッ
プＳ２に戻って上記の処理を繰り返す。一方、ステップ
Ｓ５で所望分離クラス数以上になっているときは、ステ
ップＳ６で、得られた統計的言語モデル２２をメモリに
格納した後、当該言語モデル生成処理を終了する。この
言語モデル生成処理のアルゴリズムは、品詞間、およ
び、品詞と単語間の結合は行なわないため、生成完了時
点では、品詞のバイグラムと可変長単語のＮ−グラムの
特徴を併せた統計的言語モデル２２となる。Next, in step S4, of all the separation class candidates listed in step S2, the entropy reduction amount ΔH calculated in step S3 is used.
Only the class that maximizes is actually separated. Then, in step S5, it is determined whether or not the number of separation classes is equal to or more than a desired number of separation classes of a predetermined threshold (for example, 500, 1000, etc.). Repeat the process. On the other hand, if the number is equal to or larger than the desired number of separation classes in step S5, the obtained statistical language model 22 is stored in the memory in step S6, and then the language model generation processing ends. Since the algorithm of this language model generation processing does not perform a part-of-speech or a connection between a part-of-speech and a word, a statistical language model combining features of a part-of-speech bigram and a variable-length word N-gram at the time of completion of generation. 22.

【００４６】図１において、単語照合部４に接続され、
例えばハードディスクメモリに格納される音素ＨＭＭ１
１は、各状態を含んで表され、各状態はそれぞれ以下の
情報を有する。（ａ）状態番号（ｂ）受理可能なコンテキストクラス（ｃ）先行状態、及び後続状態のリスト（ｄ）出力確率密度分布のパラメータ（ｅ）自己遷移確率及び後続状態への遷移確率なお、本実施形態において用いる音素ＨＭＭ１１は、各
分布がどの話者に由来するかを特定する必要があるた
め、所定の話者混合ＨＭＭを変換して生成する。ここ
で、出力確率密度関数は３４次元の対角共分散行列をも
つ混合ガウス分布である。In FIG. 1, it is connected to the word collating unit 4,
For example, a phoneme HMM1 stored in a hard disk memory
1 includes each state, and each state has the following information. (A) State number (b) Acceptable context class (c) List of preceding state and succeeding state (d) Parameter of output probability density distribution (e) Self transition probability and transition probability to succeeding state Since it is necessary to specify which speaker each distribution originates from, the phoneme HMM 11 used in the embodiment is generated by converting a predetermined speaker mixed HMM. Here, the output probability density function is a Gaussian mixture distribution having a 34-dimensional diagonal covariance matrix.

【００４７】また、単語照合部４に接続され、例えばハ
ードディスクに格納される単語辞書１２は、音素ＨＭＭ
１１の各単語毎にシンボルで表した読みを示すシンボル
列を格納する。The word dictionary 12 connected to the word collating unit 4 and stored in, for example, a hard disk is a phoneme HMM
For each of the eleven words, a symbol sequence indicating a reading represented by a symbol is stored.

【００４８】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して単語照合部４に入力される。In FIG. 1, a uttered voice of a speaker is input to a microphone 1 and converted into a voice signal, and then input to a feature extracting unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and performs 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the word matching unit 4 via the buffer memory 3.

【００４９】単語照合部４は、ワン−パス・ビタビ復号
化法を用いて、バッファメモリ３を介して入力される特
徴パラメータのデータに基づいて、音素ＨＭＭ１１と単
語辞書１２とを用いて単語仮説を検出し尤度を計算して
出力する。ここで、単語照合部４は、各時刻の各ＨＭＭ
の状態毎に、単語内の尤度と発声開始からの尤度を計算
する。尤度は、単語の識別番号、単語の開始時刻、先行
単語の違い毎に個別にもつ。また、計算処理量の削減の
ために、音素ＨＭＭ１１及び単語辞書１２とに基づいて
計算される総尤度のうちの低い尤度のグリッド仮説を削
減する。単語照合部４は、その結果の単語仮説と尤度の
情報を発声開始時刻からの時間情報（具体的には、例え
ばフレーム番号）とともにバッファメモリ５を介して単
語仮説絞込部６に出力する。The word collating unit 4 uses the one-pass Viterbi decoding method and the word hypothesis using the phoneme HMM 11 and the word dictionary 12 based on the characteristic parameter data input via the buffer memory 3. Is detected, the likelihood is calculated and output. Here, the word matching unit 4 determines whether each HMM
The likelihood within a word and the likelihood from the start of utterance are calculated for each state. The likelihood is individually provided for each word identification number, word start time, and difference between preceding words. Further, in order to reduce the amount of calculation processing, the grid hypothesis of a low likelihood among the total likelihoods calculated based on the phoneme HMM 11 and the word dictionary 12 is reduced. The word matching unit 4 outputs the resulting word hypothesis and likelihood information to the word hypothesis narrowing unit 6 via the buffer memory 5 together with time information (specifically, for example, a frame number) from the utterance start time. .

【００５０】単語仮説絞込部６は、単語照合部４からバ
ッファメモリ５を介して出力される単語仮説に基づい
て、統計的言語モデル２２を参照して、終了時刻が等し
く開始時刻が異なる同一の単語の単語仮説に対して、当
該単語の先頭音素環境毎に、発声開始時刻から当該単語
の終了時刻に至る計算された総尤度のうちの最も高い尤
度を有する１つの単語仮説で代表させるように単語仮説
の絞り込みを行った後、絞り込み後のすべての単語仮説
の単語列のうち、最大の総尤度を有する仮説の単語列を
認識結果として出力する。本実施形態においては、好ま
しくは、処理すべき当該単語の先頭音素環境とは、当該
単語より先行する単語仮説の最終音素と、当該単語の単
語仮説の最初の２つの音素とを含む３つの音素並びをい
う。The word hypothesis narrowing section 6 refers to the statistical language model 22 based on the word hypothesis output from the word matching section 4 via the buffer memory 5 and has the same end time and the same start time. Is represented by one word hypothesis having the highest likelihood among the total likelihoods calculated from the utterance start time to the end time of the word for each head phoneme environment of the word. After narrowing down the word hypotheses so as to cause them, the word string of the hypothesis having the maximum total likelihood is output as the recognition result among the word strings of all the narrowed word hypotheses. In the present embodiment, preferably, the first phoneme environment of the word to be processed is three phonemes including the last phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. I mean a line.

【００５１】例えば、図２に示すように、（ｉ−１）番
目の単語Ｗ_i-1の次に、音素列ａ₁，ａ₂，…，ａ_nからな
るｉ番目の単語Ｗ_iがくるときに、単語Ｗ_i-1の単語仮説
として６つの仮説Ｗａ，Ｗｂ，Ｗｃ，Ｗｄ，Ｗｅ，Ｗｆ
が存在している。ここで、前者３つの単語仮説Ｗａ，Ｗ
ｂ，Ｗｃの最終音素は／ｘ／であるとし、後者３つの単
語仮説Ｗｄ，Ｗｅ，Ｗｆの最終音素は／ｙ／であるとす
る。終了時刻ｔ_eと先頭音素環境が等しい仮説（図２で
は先頭音素環境が“ｘ／ａ₁／ａ₂”である上から３つの
単語仮説）のうち総尤度が最も高い仮説（例えば、図２
において１番上の仮説）以外を削除する。なお、上から
４番めの仮説は先頭音素環境が違うため、すなわち、先
行する単語仮説の最終音素がｘではなくｙであるので、
上から４番めの仮説を削除しない。すなわち、先行する
単語仮説の最終音素毎に１つのみ仮説を残す。図２の例
では、最終音素／ｘ／に対して１つの仮説を残し、最終
音素／ｙ／に対して１つの仮説を残す。[0051] For example, as shown in FIG. 2, the (i-1) th word W _i-1 of the following, a phoneme string a _1, a _2, ..., comes i-th word W _i, which consists of a _n Sometimes, six hypotheses Wa, Wb, Wc, Wd, We, and Wf are assumed as the word hypotheses of the word Wi _-1.
Exists. Here, the former three word hypotheses Wa, W
It is assumed that the final phonemes of b and Wc are / x /, and the final phonemes of the latter three word hypotheses Wd, We and Wf are / y /. The hypothesis with the highest total likelihood among the hypotheses in which the end time t _e is equal to the first phoneme environment (the top three word hypotheses in which the _first phoneme environment is “x / a ₁ / a ₂ ” in FIG. 2) (for example, FIG. 2
Are deleted except for the top hypothesis). Note that the fourth hypothesis from the top has a different phoneme environment, that is, since the last phoneme of the preceding word hypothesis is y instead of x,
Do not delete the fourth hypothesis from the top. That is, only one hypothesis is left for each final phoneme of the preceding word hypothesis. In the example of FIG. 2, one hypothesis is left for the final phoneme / x /, and one hypothesis is left for the final phoneme / y /.

【００５２】以上の実施形態においては、当該単語の先
頭音素環境とは、当該単語より先行する単語仮説の最終
音素と、当該単語の単語仮説の最初の２つの音素とを含
む３つの音素並びとして定義されているが、本発明はこ
れに限らず、先行する単語仮説の最終音素と、最終音素
と連続する先行する単語仮説の少なくとも１つの音素と
を含む先行単語仮説の音素列と、当該単語の単語仮説の
最初の音素を含む音素列とを含む音素並びとしてもよ
い。In the above embodiment, the head phoneme environment of the word is defined as a sequence of three phonemes including the last phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. Although defined, the present invention is not limited to this. The phoneme sequence of the preceding word hypothesis including the final phoneme of the preceding word hypothesis, and at least one phoneme of the preceding word hypothesis that is continuous with the final phoneme, And a phoneme sequence that includes a phoneme sequence that includes the first phoneme of the word hypothesis.

【００５３】以上の実施形態において、特徴抽出部２
と、単語照合部４と、単語仮説絞込部６と、言語モデル
生成部２０とは、例えば、デジタル電子計算機で構成さ
れ、バッファメモリ３，５は例えばハードデイスクメモ
リで構成され、音素ＨＭＭ１１と単語辞書１２と学習用
テキストデータ２１と統計的言語モデル２２とは、例え
ばハードデイスクメモリなどの記憶装置に記憶される。In the above embodiment, the feature extraction unit 2
The word matching unit 4, the word hypothesis narrowing unit 6, and the language model generation unit 20 are constituted by, for example, a digital computer. The buffer memories 3 and 5 are constituted by, for example, hard disk memories. The dictionary 12, the learning text data 21, and the statistical language model 22 are stored in a storage device such as a hard disk memory.

【００５４】以上実施形態においては、単語照合部４と
単語仮説絞込部６とを用いて音声認識を行っているが、
本発明はこれに限らず、例えば、音素ＨＭＭ１１を参照
する音素照合部と、例えばＯｎｅＰａｓｓＤＰアル
ゴリズムを用いて統計的言語モデル２２を参照して単語
の音声認識を行う音声認識部とで構成してもよい。In the above embodiment, speech recognition is performed using the word collating unit 4 and the word hypothesis narrowing unit 6.
The present invention is not limited to this, and includes, for example, a phoneme matching unit that refers to the phoneme HMM 11 and a speech recognition unit that performs speech recognition of a word by referring to the statistical language model 22 using, for example, the One Pass DP algorithm. You may.

【００５５】[0055]

【実施例】本発明者は、本実施形態で用いる統計的言語
モデル２２の性能を確認するため、パープレキシティお
よびパラメータ数について従来の単語Ｎ−グラムとの比
較を行った。実験に用いたデータは本出願人が所有する
自然発話旅行会話データベース（従来文献８「Ｍｏｒｉ
ｍｏｔｏほか，“ＡＳｐｅｅｃｈａｎｄＬａｎｇ
ｕａｇｅＤａｔａｂａｓｅｆｏｒＳｐｅｅｃｈ
ＴｒａｎｓｌａｔｉｏｎＲｅｓｅａｒｃｈ”，ＩＣＳ
ＬＰ，ｐｐ１７９１−１７９４，１９９４年」参照。）
であって、８４６対話、３５４，７００語から構成され
る。このうち、統計的言語モデル２２を生成するための
学習用テキストデータ（トレーニングセットともい
う。）として、８２８対話、３４７，２９９語を使用
し、残りのデータをテスト用テキストデータ（テストセ
ットともいう。）とした。本実施形態に係る統計的言語
モデル２２は、初期クラスを活用形も含めた８０品詞と
し、１０００個まで分離を行い、１００個おきにデータ
を採取した。また、本実施形態に係る統計的言語モデル
２２と、単語Ｎ−グラムとともに、未知単語遷移に対す
る対策として、クラスおよび単語の遷移確率を削除補間
法（従来文献４参照。）によって補間し、テストセット
において、未知語が出現したときは、所定の固定値（例
えば、７．０×１０^-6）を与えた。ここで、本発明に係
る統計的言語モデル２２を評価するために、パープレキ
シティを用いる。例えば、複数ｎ個の単語からなる長い
単語列ｗ₁ ⁿ＝ｗ₁ｗ₂…ｗ_nがあるときのエントロピーＨ
（ｎ）は次式で表される。EXAMPLE The present inventor compared the perplexity and the number of parameters with a conventional word N-gram in order to confirm the performance of the statistical language model 22 used in the present embodiment. The data used in the experiment is a naturally spoken travel conversation database owned by the present applicant (refer to the conventional document 8 “Mori”).
Moto et al., “A Speech and Lang
age Database for Speech
Translation Research ”, ICS
LP, pp 1791-1794, 1994. " )
846 dialogues, consisting of 354,700 words. Among them, 828 dialogues, 347,299 words are used as learning text data (also called a training set) for generating the statistical language model 22, and the remaining data is used as test text data (also called a test set). ). In the statistical language model 22 according to the present embodiment, the initial class is set to 80 parts of speech including the inflected form, separated up to 1000 parts, and data is collected every 100 parts. In addition to the statistical language model 22 according to the present embodiment and the word N-gram, as a measure against unknown word transition, the class and the transition probability of the word are interpolated by a deletion interpolation method (see conventional literature 4), and the test set is set. In, when an unknown word appeared, a predetermined fixed value (for example, 7.0 × 10 ⁻⁶ ) was given. Here, perplexity is used to evaluate the statistical language model 22 according to the present invention. For example, the entropy H when there is a long word string _{^{_{_{w 1 n = w 1 w 2}}}} ... w n comprising a plurality of n words
(N) is represented by the following equation.

【００５６】[0056]

【数１３】Ｈ（ｎ）＝−（１／ｎ）・ｌｏｇ₂Ｐ（ｗ₁ ⁿ）H (n) = − (1 / n) · log ₂ P (w ₁ ⁿ )

【００５７】ここで、Ｐ（ｗ₁ ⁿ）は単語列ｗ₁ ⁿの生成確
率であり、パープレキシティＰＰ（ｎ）は次式で表され
る。Here, P (w ₁ ⁿ ) is the generation probability of the word string w ₁ ⁿ , and the perplexity PP (n) is expressed by the following equation.

【００５８】[0058]

【数１４】ＰＰ（ｎ）＝２^H(n) ## EQU14 ## PP (n) = 2 ^{H (n)}

【００５９】ここで、単語列としてテスト用テキストデ
ータを用いたときのパープレキシティをテストセットパ
ープレキシティといい、単語列として学習用テキストデ
ータを用いたときのパープレキシティをトレーニングセ
ットパープレキシティという。Here, the perplexity when the test text data is used as the word string is called a test set perplexity, and the perplexity when the learning text data is used as the word string is the training set perplexity. Called Tee.

【００６０】当該実験結果におけるテストセットパープ
レキシティの値の変化の様子を図７に示す。図７から明
らかなように、分離クラス数が増加するに従って、テス
トセットパープレキシティは減少し、分離クラス数が２
００で単語バイグラムと、分離クラス数が６００で単語
トライグラムと同程度の値となることが分かる。分離ク
ラス数が６００以上になると、パープレキシティの減少
の割合が極端に小さくなるため、分離クラス６００程度
で、本実施形態の統計的言語モデル２２が最も有効に働
いていると考えられる。従って、本実施形態の統計的言
語モデル２２は単語バイグラム以上、単語トライグラム
と同程度の予測精度の言語モデルと考えられる。FIG. 7 shows how the value of the test set perplexity changes in the experimental results. As is clear from FIG. 7, as the number of separation classes increases, the test set perplexity decreases, and the number of separation classes decreases by two.
It can be seen that the value of 00 is a word bigram and the number of separation classes is 600, which is almost the same value as the word trigram. When the number of separation classes exceeds 600, the rate of decrease in perplexity becomes extremely small. Therefore, it is considered that the statistical language model 22 of the present embodiment works most effectively with the separation classes of about 600. Therefore, the statistical language model 22 of the present embodiment is considered to be a language model having a prediction accuracy equal to or greater than a word bigram and a word trigram.

【００６１】表１にまた、分離クラス数が０，５００，
１０００の時のパープレキシティの値、およびパラメー
タ数を示す。Table 1 also shows that the number of separated classes is 0,500,
The value of the perplexity at the time of 1000 and the number of parameters are shown.

【００６２】[0062]

【表１】各言語モデルの性能比較 ─────────────────────────────────── バイグラムトライグラム本実施形態（分離クラス数）０５００１０００ ─────────────────────────────────── テストセット 20.31 16.96 41.68 17.61 16.75 パープレキシティ ─────────────────────────────────── トレーニングセット 13.50 5.99 48.77 18.77 15.05 パープレキシティ ─────────────────────────────────── パラメータ数(１) 4.10×10⁷ 2.62×10¹¹ 1.28×10⁴ 3.43×10⁵ 1.17×10⁶ ─────────────────────────────────── パラメータ数(２）５２，２４４１６５，１３９
７，９９１２７，８３０４３，０７５ ───────────────────────────────────[Table 1] Performance comparison of each language model ─────────────────────────────────── bigram trigram Form (number of separated classes) 0 500 1000 ─────────────────────────────────── Test set 20.31 16.96 41.68 17.61 16.75 Perplexity ─────────────────────────────────── Training set 13.50 5.99 48.77 18.77 15.05 Perplexity ─数 Number of parameters (1) 4.10 × 10 ⁷ 2.62 × 10 ¹¹ 1.28 × 10 ⁴ 3.43 × 10 ⁵ 1.17 × 10 ⁶数 Number of parameters (2) 52 , 244 165, 139
7,991 27,830 43,075}

【００６３】ここで、パラメータ数（１）は全クラス
（単語）の遷移の組み合わせ数を意味し、パラメータ数
（２）は、トレーニングセットにおいて実際に存在する
クラス（単語）遷移の組み合わせ数を意味する。表１よ
り、本実施形態の統計的言語モデル２２は、テストセッ
トとトレーニングセットとのパープレキシティの差が、
単語バイグラム及び単語トライグラムと比較して非常に
小さいことが分かる。また、パラメータ数は、１０００
クラス分離した時でも、単語バイグラムよりも少なく、
単語トライグラムよりもはるかに少ない。したがって、
本実施形態の統計的言語モデル２２は、与えられたパラ
メータで言語特徴を効率的に表現できる優れた言語モデ
ルであると言える。従って、当該統計的言語モデル２２
は従来の単語バイグラム、単語トライグラムよりも信頼
性が高い言語モデルであると考えられる。Here, the number of parameters (1) means the number of combinations of transitions of all classes (words), and the number of parameters (2) means the number of combinations of class (word) transitions actually existing in the training set. I do. From Table 1, the statistical language model 22 of the present embodiment has a difference in perplexity between the test set and the training set.
It can be seen that it is very small compared to the word bigram and the word trigram. The number of parameters is 1000
Even when classes are separated, less than a word bigram,
Much less than a word trigram. Therefore,
It can be said that the statistical language model 22 of the present embodiment is an excellent language model that can efficiently express language features with given parameters. Therefore, the statistical language model 22
Is considered to be a more reliable language model than conventional word bigrams and word trigrams.

【００６４】また、本実施形態の統計的言語モデル２２
の信頼性を確認するため、学習単語数を変化させてテス
トセットパープレキシティの値の変化を調べた結果を図
８に示す。この図８から明らかなように、全ての学習セ
ット（約３５万語）を用いたときは、単語バイグラム
と、本実施形態の統計的言語モデル２２（２００クラ
ス）（カッコ内の数字は分離クラス数を表す、以下同様
である。）とは、ほぼ同じパープレキシティ値である
が、学習単語数を減少させても当該統計的言語モデル２
２のパープレキシティの増加は比較的小さく、単語バイ
グラムよりも値が低くなることが分かる。同様に、単語
トライグラムと、当該統計的言語モデル２２（６００ク
ラス）とを比較しても、学習単語数が減少すると、当該
統計的言語モデル２２の方が低いパープレキシティを呈
する。The statistical language model 22 of the present embodiment
FIG. 8 shows the result of examining the change in the value of the test set perplexity by changing the number of learning words to confirm the reliability of the test set. As is clear from FIG. 8, when all the learning sets (about 350,000 words) are used, the word bigram and the statistical language model 22 (200 classes) of the present embodiment (the numbers in parentheses indicate the separation classes) Are the same perplexity values, but even if the number of learning words is reduced, the statistical language model 2
It can be seen that the increase in perplexity of 2 is relatively small and has a lower value than the word bigram. Similarly, even when the word trigram is compared with the statistical language model 22 (600 classes), when the number of learning words decreases, the statistical language model 22 exhibits lower perplexity.

【００６５】次いで、本発明者は、本実施形態の統計的
言語モデル２２を図１の連続音声認識装置に適用し、統
計的言語モデル２２の効果を確認した。音素認識の実験
条件を表２に示す。また、音響をパラメータもあわせて
表２に示す。Next, the inventor applied the statistical language model 22 of the present embodiment to the continuous speech recognition apparatus of FIG. 1 and confirmed the effect of the statistical language model 22. Table 2 shows the experimental conditions for phoneme recognition. Table 2 also shows the sound parameters.

【００６６】[0066]

【表２】実験条件 ─────────────────────────────────── 分析条件サンプリング周波数：１２ＫＨｚ，ハミング窓：２０ｍｓ，フレーム周期：１０ｍｓ ─────────────────────────────────── 使用パラメータ１６次ＬＰＣケプストラム＋１６次Δケプストラム＋ｌｏｇパワー＋Δｌｏｇパワー ─────────────────────────────────── 音響モデルＨＭ網の男女別不特定話者モデル４００状態，５混合 ───────────────────────────────────[Table 2] Experimental conditions ─────────────────────────────────── Analysis conditions Sampling frequency: 12 KHz, Hamming window : 20 ms, Frame period: 10 ms 使用 Parameters used 16th order LPC cepstrum + 16th order ΔCepstrum + log power + Δlog power ─────────────────────────────────── Acoustic model Unspecified gender of HM network Speaker model 400 states, 5 mixed ───────────────────────────────────

【００６７】表２において、ＨＭ網の男女別不特定話者
モデルについては、従来文献９「小坂ほか，“話者混合
ＳＳＳによる不特定話者音声認識”，日本音響学会講演
論文集，２−５−９，ｐｐ１３５−１３６，平成４年」
に開示されている。この実験では、単語グラフを用いた
連続音声認識法を用いて音響モデルおよび言語モデルを
連続音声認識装置に適用した。また、認識の対象は、統
計的言語モデル２２のトレーニングセット中の１６対話
であり、学習に用いられていないテストセットは１８対
話である。各言語モデルで尤度１位の文認識候補の正解
単語含有率を表３に示す。In Table 2, with regard to the gender-specific unspecified speaker model of the HM network, see Reference 9 "Kosaka et al.," Unspecified speaker speech recognition using mixed speaker SSS ", Proceedings of the Acoustical Society of Japan, 2- 5-9, pp 135-136, 1992 "
Is disclosed. In this experiment, an acoustic model and a language model were applied to a continuous speech recognizer using a continuous speech recognition method using a word graph. The recognition target is 16 conversations in the training set of the statistical language model 22, and the test set not used for learning is 18 conversations. Table 3 shows the correct word content rates of the sentence recognition candidates having the highest likelihood in each language model.

【００６８】[0068]

【表３】正解単語含有率 ─────────────────────────────────── バイグラム本実施形態（分離クラス数）０５００ ─────────────────────────────────── 辞書サブセットテストセット７１．４６７．３７２．２トレーニンク゛セット６９．４６３．４
６９．７ ─────────────────────────────────── 辞書フルセットテストセット −− ５７．１５８．４トレーニンク゛セット −− ５４．６５６．０ ───────────────────────────────────[Table 3] Correct word content rate ─────────────────────────────────── bigram This embodiment (separation class Number) 0 500 ─────────────────────────────────── dictionary subset test set 71.4 67.3 72 .2 Training set 69.4 63.4
69.7 ─────────────────────────────────── Dictionary full set Test set --- 57.1 58. 4 Training set-54.6 56.0

【００６９】表３において、辞書サブセットは認識対象
に含まれる単語のみを辞書に登録したもの（７５０
語）、辞書フルセットは、統計的言語モデルの生成のた
めの学習に用いた全単語よりなる辞書（６，４００語）
を表す。ただし、従来の単語バイグラムは、メモリ容量
と計算時間の都合上で、辞書フルセットの辞書の認識
は、今回の実験では計算を行なっていない。この場合
は、言い換えれば、大容量のメモリと莫大な処理時間が
必要である。In Table 3, the dictionary subset is obtained by registering only words included in the recognition target in the dictionary (750).
Word), dictionary full set is a dictionary (6,400 words) consisting of all words used for learning to generate a statistical language model
Represents However, in the conventional word bigram, the recognition of the dictionary of the full dictionary was not performed in this experiment because of the memory capacity and the calculation time. In this case, in other words, a large-capacity memory and a huge processing time are required.

【００７０】テストセットに関しては、パープレキシテ
ィの低い順、すなわち本実施形態の統計的言語モデル２
２（０クラス）→単語バイグラム→本実施形態の統計的
言語モデル２２（５００クラス）の順で正解単語含有率
が良くなっており、本実施形態の統計的言語モデル２２
（５００クラス）は、単語のバイグラムよりも若干では
あるが正解単語含有率が向上している。トレーニングセ
ットに関しては、本実施形態の統計的言語モデル２２
（５００クラス）は単語バイグラムよりも高いパープレ
キシティであるが、正解単語含有率は高くなっている。
また、本実施形態の統計的言語モデル２２はパラメータ
数が少ないので、大語彙の認識への拡張が容易ある。し
たがって、本実施形態の統計的言語モデル２２は連続音
声認識に対しても単語バイグラムより有効な言語モデル
であると考えられる。Regarding the test set, in order of decreasing perplexity, that is, the statistical language model 2 of this embodiment
The correct word content rate is improved in the order of 2 (0 class) → word bigram → statistical language model 22 of this embodiment (500 classes).
(500 classes) have a slightly higher correct word content ratio than the word bigram. Regarding the training set, the statistical language model 22 of the present embodiment is used.
(500 classes) have higher perplexity than word bigrams, but have a higher correct word content rate.
Further, since the statistical language model 22 of the present embodiment has a small number of parameters, it can be easily extended to recognition of a large vocabulary. Therefore, the statistical language model 22 of the present embodiment is considered to be a language model that is more effective than the word bigram for continuous speech recognition.

【００７１】以上説明したように、Ｎ−グラムの精度・
信頼性の向上を目的とした可変長Ｎ−グラムの統計的言
語モデル２２の生成装置及びこれを用いた連続音声認識
装置を実現することができる。当該統計的言語モデル２
２は、品詞バイグラムを初期状態とし、品詞クラスから
の単語分離、および、連接単語の結合という、２種類の
状態分離を行なうことにより生成されるもので、品詞バ
イグラムと可変長単語Ｎ−グラムの特徴を併せ持つモデ
ルである。当該統計的言語モデル２２の評価実験の結
果、当該統計的言語モデル２２は、単語バイグラム以
上、単語トライグラムと同等のパープレキシティを、は
るかに少ないパラメータで実現できることが分かり、目
的とした性能が実現されていることが確認できた。ま
た、連続音声認識に適用した結果、単語バイグラムと同
じ程度の正解単語含有率を得ることができた。当該統計
的言語モデル２２は少ないパラメータで実現できるた
め、大語彙の音声認識にも容易に拡張可能である。As described above, the accuracy of the N-gram
A variable-length N-gram statistical language model 22 generation apparatus for improving reliability and a continuous speech recognition apparatus using the same can be realized. Statistical language model 2
2 is generated by performing two types of state separation, that is, separating the part of speech bigram from the part of speech class and combining the concatenated words with the part of speech bigram as an initial state. It is a model that combines features. As a result of the evaluation experiment of the statistical language model 22, it was found that the statistical language model 22 can realize a perplexity equal to or larger than a word bigram and a word trigram with far fewer parameters. It was confirmed that it was realized. In addition, as a result of applying the method to continuous speech recognition, it was possible to obtain a correct word content rate equivalent to that of a word bigram. Since the statistical language model 22 can be realized with a small number of parameters, it can be easily extended to speech recognition of a large vocabulary.

【００７２】従って、遷移確率の予測精度及び信頼性を
改善することができる統計的言語モデル２２を生成する
ことができる統計的言語モデル生成装置を提供すること
ができるとともに、当該統計的言語モデル２２を用いて
より高い音声認識率で連続的に音声認識することができ
る連続音声認識装置を提供することができる。Therefore, it is possible to provide a statistical language model generation device capable of generating the statistical language model 22 capable of improving the prediction accuracy and reliability of the transition probability, and to provide the statistical language model 22 , A continuous speech recognition device capable of continuously performing speech recognition at a higher speech recognition rate can be provided.

【００７３】[0073]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の統計的言語モデル生成装置によれば、所定の話
者の発声音声文を書き下した学習用テキストデータに基
づいて、すべての語彙を品詞毎にクラスタリングされた
品詞クラスに分類し、それらの品詞クラス間のバイグラ
ムを初期状態の統計的言語モデルとして生成する生成手
段と、上記生成手段によって生成された初期状態の統計
的言語モデルに基づいて、単語の品詞クラスからの分離
することができる第１の分離クラス候補と、１つの単語
と１つの単語との結合、１つの単語と複数の単語の単語
列との結合、複数の単語の単語列と１つの単語との結
合、複数の単語の単語列と、複数の単語の単語列との結
合とを含む連接単語又は連接単語列の結合によって単語
の品詞クラスから分離することができる第２の分離クラ
ス候補とを検索する検索手段と、上記検索手段によって
検索された第１と第２の分離クラス候補に対して、次単
語の予測の難易度を表わす所定のエントロピーを用い
て、クラスを分離することによる当該エントロピーの減
少量を計算する計算手段と、上記計算手段によって計算
された上記第１と第２の分離クラス候補に対するエント
ロピーの減少量の中で最大のクラス分離を選択して、選
択されたクラスの分離を実行することにより、品詞のバ
イグラムと可変長Ｎの単語のＮ−グラムとを含む統計的
言語モデルを生成する分離手段と、上記分離手段によっ
て生成された統計的言語モデルのクラス数が所定のクラ
ス数になるまで、上記分離手段によって生成された統計
的言語モデルを処理対象モデルとして、上記検索手段の
処理と、上記計算手段の処理と、上記分離手段の処理と
を繰り返すことにより、所定のクラス数を有する統計的
言語モデルを生成する制御手段とを備える。従って、遷
移確率の予測精度及び信頼性を改善することができる統
計的言語モデルを生成することができる。また、当該統
計的言語モデルは少ないパラメータで実現できるため、
大語彙の音声認識にも容易に拡張可能であるという特有
の利点を有する。As described above in detail, according to the statistical language model generating apparatus according to the first aspect of the present invention, based on the learning text data in which the uttered voice sentence of a predetermined speaker is written, Means for classifying the vocabulary of words into part-of-speech classes clustered for each part-of-speech, and generating a bigram between the part-of-speech classes as an initial state statistical language model; and an initial state statistical language generated by the generator. A first separation class candidate that can be separated from a part of speech class of a word based on a model, a combination of one word and one word, a combination of one word and a word string of a plurality of words, From the part of speech class of a word by combining a word string of a word with one word, a word string of a plurality of words and a word string of a plurality of words. Search means for searching for a second separation class candidate that can be performed, and a predetermined entropy indicating the difficulty of predicting the next word for the first and second separation class candidates searched for by the search means. Calculating means for calculating the amount of decrease in the entropy by separating the classes, and the largest class of the amount of decrease in entropy for the first and second separation class candidates calculated by the calculating means. Separating means for selecting a separation and performing separation of the selected class to generate a statistical language model including a part-of-speech bigram and an N-gram of a variable-length N word; Until the number of classes of the statistical language model thus obtained reaches a predetermined number of classes, the statistical language model generated by the separating means as a processing target model, Comprising a processing of the serial retrieval means, the processing of the calculation means, by repeating the process of the separating means, and control means for generating a statistical language model having a predetermined number of classes. Therefore, a statistical language model that can improve the prediction accuracy and reliability of the transition probability can be generated. Also, since the statistical language model can be realized with few parameters,
It has the unique advantage that it can be easily extended to large vocabulary speech recognition.

【００７４】本発明に係る請求項２記載の音声認識装置
においては、入力される発声音声文の音声信号に基づい
て、所定の統計的言語モデルを用いて音声認識する音声
認識手段を備えた音声認識装置において、上記音声認識
手段は、品詞のバイグラムと可変長Ｎの単語のＮ−グラ
ムとを含む統計的言語モデルを用いて音声認識する。従
って、遷移確率の予測精度及び信頼性を改善することが
できる統計的言語モデルを用いて音声認識するので、よ
り高い音声認識率で音声認識することができる音声認識
装置を提供することができる。According to a second aspect of the present invention, there is provided a speech recognition apparatus comprising a speech recognition means for recognizing a speech by using a predetermined statistical language model based on a speech signal of an input uttered speech sentence. In the recognition device, the voice recognition means performs voice recognition using a statistical language model including a bigram of a part of speech and an N-gram of a word having a variable length. Therefore, since the speech recognition is performed using the statistical language model that can improve the prediction accuracy and reliability of the transition probability, it is possible to provide a speech recognition device that can perform speech recognition at a higher speech recognition rate.

【００７５】また、請求項３記載の音声認識装置におい
ては、上記統計的言語モデルは、請求項１記載の統計的
言語モデル生成装置によって生成された。従って、遷移
確率の予測精度及び信頼性を改善することができる統計
的言語モデルを用いて音声認識するので、より高い音声
認識率で音声認識することができる音声認識装置を提供
することができる。In the speech recognition apparatus according to the third aspect, the statistical language model is generated by the statistical language model generation apparatus according to the first aspect. Therefore, since the speech recognition is performed using the statistical language model that can improve the prediction accuracy and reliability of the transition probability, it is possible to provide a speech recognition device that can perform speech recognition at a higher speech recognition rate.

【００７６】本発明に係る請求項４記載の連続音声認識
装置は、入力される発声音声文の音声信号に基づいて上
記発声音声文の単語仮説を検出し尤度を計算することに
より、連続的に音声認識する音声認識手段を備えた連続
音声認識装置において、上記音声認識手段は、請求項１
記載の統計的言語モデル生成装置によって生成された統
計的言語モデルを参照して、終了時刻が等しく開始時刻
が異なる同一の単語の単語仮説に対して、当該単語の先
頭音素環境毎に、発声開始時刻から当該単語の終了時刻
に至る計算された総尤度のうちの最も高い尤度を有する
１つの単語仮説で代表させるように単語仮説の絞り込み
を行う。すなわち、先行単語毎に１つの単語仮説で代表
させる従来技術の単語ペア近似法に比較して、単語の先
頭音素の先行音素（つまり、先行単語の最終音素）が等
しいものをひとまとめに扱うために、単語仮説数を削減
することができ、近似効果は大きい。特に、語彙数が増
加した場合において削減効果が大きい。従って、当該連
続音声認識装置を、間投詞の挿入や、言い淀み、言い直
しが頻繁に生じる自然発話の認識に用いた場合であって
も、単語仮説の併合又は分割に要する計算コストは従来
例に比較して小さくなる。すなわち、音声認識のために
必要な処理量が小さくなり、それ故、音声認識のための
記憶装置において必要な記憶容量は小さくなる一方、処
理量が小さくなるので音声認識のための処理時間を短縮
することができる。さらに、遷移確率の予測精度及び信
頼性を改善することができる統計的言語モデルを用いて
音声認識するので、より高い音声認識率で連続的に音声
認識することができる連続音声認識装置を提供すること
ができる。A continuous speech recognition apparatus according to a fourth aspect of the present invention detects a word hypothesis of the uttered speech sentence based on an input speech signal of the uttered speech sentence and calculates a likelihood to thereby obtain a continuous speech. In a continuous speech recognition apparatus provided with a speech recognition means for recognizing a speech, the speech recognition means comprises:
With reference to the statistical language model generated by the described statistical language model generation device, for each word hypothesis of the same word having the same end time and different start time, the utterance starts for each head phoneme environment of the word. The word hypotheses are narrowed down so as to be represented by one word hypothesis having the highest likelihood among the total likelihoods calculated from the time to the end time of the word. That is, as compared with the prior art word pair approximation method in which each preceding word is represented by one word hypothesis, those in which the leading phoneme of the word (that is, the last phoneme of the preceding word) is equal are collectively handled. Therefore, the number of word hypotheses can be reduced, and the approximation effect is large. In particular, when the number of words increases, the reduction effect is large. Therefore, even when the continuous speech recognition device is used for recognizing natural utterances in which interjections are inserted, stagnant, and rephrased frequently, the calculation cost required for merging or dividing word hypotheses is lower than in the conventional example. It will be smaller compared to That is, the amount of processing required for speech recognition is reduced, and therefore, the storage capacity required for the storage device for speech recognition is reduced, while the amount of processing is reduced, so that the processing time for speech recognition is reduced. can do. Furthermore, since speech recognition is performed using a statistical language model that can improve the prediction accuracy and reliability of transition probability, a continuous speech recognition device that can continuously perform speech recognition at a higher speech recognition rate is provided. be able to.

[Brief description of the drawings]

【図１】本発明に係る一実施形態である連続音声認識
装置のブロック図である。FIG. 1 is a block diagram of a continuous speech recognition apparatus according to an embodiment of the present invention.

【図２】図１の連続音声認識装置における単語仮説絞
込部６の処理を示すタイミングチャートである。FIG. 2 is a timing chart showing a process of a word hypothesis narrowing section 6 in the continuous speech recognition device of FIG.

【図３】バイグラムの統計的言語モデルを示す状態遷
移図である。FIG. 3 is a state transition diagram showing a bigram statistical language model.

【図４】トライグラムの統計的言語モデルを示す状態
遷移図である。FIG. 4 is a state transition diagram showing a statistical language model of a trigram.

【図５】図１の連続音声認識装置において用いる可変
長Ｎ−グラムの下のモデルを示す状態遷移図である。FIG. 5 is a state transition diagram showing a model under a variable length N-gram used in the continuous speech recognition device of FIG. 1;

【図６】図１の言語モデル生成部２０によって実行さ
れる言語モデル生成処理を示すフローチャートである。FIG. 6 is a flowchart illustrating a language model generation process executed by the language model generation unit 20 of FIG. 1;

【図７】図１の言語モデル生成部２０によって生成さ
れる統計的言語モデルにおける分離クラス数に対するテ
ストセットパープレキシティを示すグラフである。FIG. 7 is a graph showing test set perplexity with respect to the number of separated classes in a statistical language model generated by the language model generation unit 20 of FIG. 1;

【図８】図１の言語モデル生成部２０によって生成さ
れる統計的言語モデルにおける学習データの単語数に対
するテストセットパープレキシティを示すグラフであ
る。8 is a graph showing test set perplexity with respect to the number of words of learning data in the statistical language model generated by the language model generation unit 20 of FIG.

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３，５…バッファメモリ、４…単語照合部、６…単語仮説絞込部、１１…音素ＨＭＭ、１２…単語辞書、２０…言語モデル生成部、２１…学習用テキストデータ、２２…統計的言語モデル。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Feature extraction part, 3, 5 ... Buffer memory, 4 ... Word collation part, 6 ... Word hypothesis narrowing part, 11 ... Phoneme HMM, 12 ... Word dictionary, 20 ... Language model generation part, 21 ... Learning text data, 22 ... Statistical language model.

───────────────────────────────────────────────────── フロントページの続き (72)発明者松永昭一京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (56)参考文献特開平５−108704（ＪＰ，Ａ) 特開平５−250405（ＪＰ，Ａ) 日本音響学会講演論文集（平成８年３月）１−Ｐ−17，ｐ．195〜196 (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 535 G10L 3/00 561 ＪＩＣＳＴファイル（ＪＯＩＳ)────────────────────────────────────────────────── ─── Continuing on the front page (72) Inventor Shoichi Matsunaga 5th place Sanraya, Daiyaku small character, Seika-cho, Soraku-gun, Kyoto A.T. 108704 (JP, A) JP-A-5-250405 (JP, A) Proceedings of the Acoustical Society of Japan (March 1996) 1-P-17, p. 195-196 (58) Fields investigated (Int. Cl. ⁶ , DB name) G10L 3/00 535 G10L 3/00 561 JICST file (JOIS)

Claims

(57) [Claims]

1. Classifying all vocabulary into part-of-speech classes clustered for each part-of-speech based on learning text data in which uttered voice sentences of a predetermined speaker have been written, and setting a bigram between these part-of-speech classes in an initial state A first separation class candidate capable of separating words from a part of speech class based on the initial state statistical language model generated by the generation unit; Combination of one word and one word, combination of one word and multiple word strings, combination of multiple word word strings and one word, multiple word word strings and multiple word A search unit for searching for a connected word including a combination with a word string or a second separation class candidate that can be separated from a word class based on the combination of the connected word strings; Calculating means for calculating, using the predetermined entropy representing the difficulty level of the prediction of the next word, for the first and second separation class candidates retrieved by the search, the amount of reduction of the entropy by separating the classes Selecting the largest class separation among the reduced amounts of entropy for the first and second separation class candidates calculated by the calculation means, and executing the separation of the selected class, whereby the bigram of the part of speech is obtained. Separating means for generating a statistical language model including N-grams of words of variable length N, and the separating means until the number of classes of the statistical language model generated by the separating means reaches a predetermined number of classes. Using the statistical language model generated by the above as a processing target model, and repeating the processing of the search means, the processing of the calculation means, and the processing of the separation means And a control unit for generating a statistical language model having a predetermined number of classes.

2. A speech recognition apparatus comprising speech recognition means for recognizing speech based on a speech signal of an input uttered speech sentence using a predetermined statistical language model, wherein the speech recognition means comprises a part-of-speech bigram. A speech recognition apparatus characterized in that speech recognition is performed using a statistical language model including N-grams of words of variable length N.

3. A speech recognition device, wherein the statistical language model is generated by the statistical language model generation device according to claim 1.

4. A continuous speech recognition device comprising a speech recognition means for continuously recognizing speech by detecting a word hypothesis of the speech speech sentence based on an input speech signal of the speech speech and calculating a likelihood. In the apparatus, the speech recognition unit refers to the statistical language model generated by the statistical language model generating apparatus according to claim 1 and, for a word hypothesis of the same word having equal end times and different start times. For each leading phoneme environment of the word, the word hypothesis is narrowed down so as to be represented by one word hypothesis having the highest likelihood among the total likelihoods calculated from the utterance start time to the end time of the word. A continuous speech recognition apparatus characterized by performing.