JP2002091484A

JP2002091484A - Language model generator and voice recognition device using the generator, language model generating method and voice recognition method using the method, computer readable recording medium which records language model generating program and computer readable recording medium which records voice recognition program

Info

Publication number: JP2002091484A
Application number: JP2000280655A
Authority: JP
Inventors: Jun Ishii; 純石井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2000-09-14
Filing date: 2000-09-14
Publication date: 2002-03-27
Anticipated expiration: 2020-09-14
Also published as: JP4270732B2

Abstract

PROBLEM TO BE SOLVED: To obtain a language model having high estimation precision and a voice recognition device having high recoginition precision. SOLUTION: A learning text data tree structure clustering means 2001 conducts a tree structure clustering to hierarchically divide learning text data 1001 so as to have a linguistically similar nature and generates a tree structure learning text data cluster 2002. A language model generating means 1004 generates a tree structure cluster language model 2003 employing each learning text data which belongs to the cluster 2002.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、音声認識を行う
際に参照する言語モデル生成装置及びこれを用いた音声
認識装置、言語モデル生成方法及びこれを用いた音声認
識方法、並びに言語モデル生成プログラムを記録したコ
ンピュータ読み取り可能な記録媒体及び音声認識プログ
ラムを記録したコンピュータ読み取り可能な記録媒体に
関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a language model generating apparatus to be referred to when performing speech recognition, a speech recognizing apparatus using the same, a language model generating method, a speech recognizing method using the same, and a language model generating program. And a computer-readable recording medium on which a voice recognition program is recorded.

【０００２】[0002]

【従来の技術】近年、使用話者が単語を連続して入力で
きる連続音声認識技術の実用化検討が盛んに行われてい
る。連続音声認識は、単語の復号列が最大事後確率を持
つように、音声の音響的な観測系列に基づいて復号する
ことである。これは次の（１）式で表される。2. Description of the Related Art In recent years, the practical use of continuous speech recognition technology that enables a user to continuously input words has been actively studied. Continuous speech recognition is decoding based on an acoustic observation sequence of speech such that a decoded sequence of words has a maximum posterior probability. This is expressed by the following equation (1).

【数１】ここで、Ｏは音声の音響的な観測値系列［ｏ₁，ｏ₂，
ｏ₃，．．．，ｏ_T］であり、Ｗは単語列［ｗ₁，
ｗ₂，ｗ₃，．．．，ｗ_n］である。Ｐ（Ｏ｜Ｗ）は単
語列Ｗが与えられたときの観測値系列Ｏに対する確率で
あり、音響モデルによって計算するものである。Ｐ
（Ｗ）は単語列Ｗの生起確率（出現確率）であり、言語
モデルによって計算するものである。(Equation 1) Here, O is an acoustic observation value sequence [o ₁ , o ₂ ,
o ₃ ,. . . , O _T ], and W is the word string [w ₁ ,
w ₂ , w ₃ ,. . . , W _n ]. P (O | W) is a probability for the observed value series O when the word string W is given, and is calculated by an acoustic model. P
(W) is the occurrence probability (appearance probability) of the word string W, which is calculated by a language model.

【０００３】音声認識については、森北出版（株）から
出版されている古井貞煕著の「音声情報処理」（以降、
文献１とする）、電子情報通信学会から出版されている
中川聖一著の「確率モデルによる音声認識」（以降、文
献２とする）、ＮＴＴアドバンステクノロジ（株）から
出版されているＬａｗｒｅｎｃｅＲａｂｉｎｅｒ、Ｂ
ｉｉｎｇ−ＨｗａｎｇＪｕａｎｇ著、古井貞煕監訳の
「音声認識の基礎（上、下）」（以降、文献３とする）
に詳しく説明されている。[0003] Speech recognition is described in "Speech Information Processing" by Sadahiro Furui, published by Morikita Publishing Co., Ltd.
Reference 1), "Speech Recognition by Stochastic Model" by Seichi Nakagawa published by the Institute of Electronics, Information and Communication Engineers (hereinafter referred to as Reference 2), Lawrence Rabiner, published by NTT Advanced Technology Corporation, B
"Basics of Speech Recognition (upper and lower)", translated by Iing-Hwang Jung and translated by Sadahiro Furui (hereinafter referred to as Reference 3)
Is described in detail.

【０００４】音響モデルによって計算するＰ（Ｏ｜Ｗ）
は、最近は統計的手法である隠れマルコフモデル（ＨＭ
Ｍ）を用いる検討が盛んである。隠れマルコフモデルを
用いた音響モデルは、例えば文献３の６章に詳しく述べ
られている。[0004] P (O | W) calculated by an acoustic model
Is based on the Hidden Markov Model (HM
M) has been actively studied. The acoustic model using the hidden Markov model is described in detail in, for example, Chapter 6 of Reference 3.

【０００５】また、言語モデルによって計算するＰ
（Ｗ）は統計的な手法を用いることが多く、代表的なも
のにＮ−ｇｒａｍモデルがある（Ｎは２以上）。これら
については、東京大学出版会から出版されている北研二
著の「確率的言語モデル」（以下文献４とする）の３章
において詳しく説明されている。Ｎ−ｇｒａｍモデル
は、直前の（Ｎ−１）個の単語から次の単語への遷移確
率を統計的に与えるものである。Ｎ−ｇｒａｍモデルに
よる単語列ｗ^L ₁＝ｗ₁．．．ｗ_Lの生起確率は、次の
（２）式によって与えられる。[0005] In addition, P calculated by a language model
(W) often uses a statistical method, and a typical example is an N-gram model (N is 2 or more). These are described in detail in Chapter 3 of “Probabilistic Language Model” (hereinafter referred to as Reference 4) by Kenji Kita published by The University of Tokyo Press. The N-gram model statistically gives a transition probability from the immediately preceding (N-1) words to the next word. Word strings w ^L ₁ = w ₁ . . . The occurrence probability of w _L is given by the following equation (2).

【数２】 (Equation 2)

【０００６】上記（２）式において、確率Ｐ（ｗ_t｜ｗ
_t+1-N ^t-1）は（Ｎ−１）個の単語からなる単語列ｗ
_t+1-N ^t-1の後に単語ｗ_tが生起する確率であり、Πは
積を表している。例えば、「私・は・駅・へ・行く」
（・は単語の区切りを表す）といった単語列の生起確率
を２−ｇｒａｍ（バイグラム）で求める場合は、次の
（３）式のようになる。（３）式において、＃は文頭、
文末を表す記号である。Ｐ（私・は・駅・へ・行く）＝Ｐ（私｜＃）Ｐ（は｜私）Ｐ（駅｜は）Ｐ（へ｜駅）Ｐ（行く｜へ）Ｐ（＃｜行く）（３）In the above equation (2), the probability P (w _t | w
_{t + 1-N} ^t-1 ) is a word string w composed of (N-1) words
_The probability that the word w _t will occur after _{t + 1−N} ^t−1 , and Π indicates the product. For example, "I, ha, station, go, go"
When the occurrence probability of a word string such as (• represents a word delimiter) is determined by 2-gram (bigram), the following equation (3) is used. In equation (3), # is the beginning of a sentence,
This is a symbol indicating the end of a sentence. P (I ・ Ha ・ Station ・ Go ・ Go) = P (I│ #) P (Ha│I) P (Station│Ha) P (To│Station) P (Go│To) P (# │Go) ( 3)

【０００７】確率Ｐ（ｗ_t｜ｗ_t+1-N ^t-1）は学習用テ
キストデータの単語列の相対頻度によって求められる。
単語列Ｗの学習用テキストデータにおける出現頻度をＣ
（Ｗ）とすれば、例えば、「私・は」の２−ｇｒａｍ確
率Ｐ（は｜私）は、次の（４）式によって計算される。
（４）式において、Ｃ（私・は）は単語列「私・は」の
出現頻度、Ｃ（私）は「私」の出現頻度である。Ｐ（は｜私）＝Ｃ（私・は）／Ｃ（私）（４）[0007] The probability _{_{P (w t | w t +}} 1-N t-1) is determined by the relative frequency of the word sequence of the learning text data.
The appearance frequency of the word string W in the learning text data is represented by C
Assuming (W), for example, the 2-gram probability P (ha | I) of “I.ha” is calculated by the following equation (4).
In the equation (4), C (I. ha) is the frequency of appearance of the word string "I. ha", and C (I) is the frequency of appearance of "I". P (ha | me) = C (me ・ ha) / C (me) (4)

【０００８】しかしながら、Ｎ−ｇｒａｍモデルの確率
値を単純に相対頻度によって推定すると、学習用テキス
トデータ中に出現しない単語組を０にしてしまうという
大きな欠点がある（ゼロ頻度問題）。また、例え学習用
テキストデータ中に出現したとしても出現頻度の小さな
単語列に対しては、統計的に信頼性のある確率値を推定
するのが難しい（スパースネスの問題）。これらの問題
に対処するために、通常はスムージングあるいは平滑化
と呼ばれる手法を用いる。スムージングについては、上
記文献４の３．３章にいくつかの手法が述べられている
ので、ここでは、具体的な説明は省略する。However, if the probability value of the N-gram model is simply estimated based on the relative frequency, there is a major drawback that a word set that does not appear in the learning text data is set to 0 (zero frequency problem). Further, even if it appears in the learning text data, it is difficult to estimate a statistically reliable probability value for a word string having a low appearance frequency (sparseness problem). To address these problems, a technique called smoothing or smoothing is usually used. Regarding smoothing, some methods are described in Chapter 3.3 of the above-mentioned Document 4, and a specific description is omitted here.

【０００９】言語モデルの学習には、音声認識の対象と
する分野や場面・状況の文を学習用テキストデータとし
て用いるが、実際のアプリケーションでは、音声認識の
対象がさまざまな分野や、さまざまな場面・状況の音声
である場合が多い。単語列の生起確率は分野、場面・状
況が違うと異なった確率となるので、分野、場面・状況
の異なりを無視して学習用テキストデータを一括して学
習して言語モデルを生成した場合は、言語モデルの精度
は良くない。In learning a language model, sentences in fields, scenes, and situations to be subjected to speech recognition are used as learning text data. However, in an actual application, speech recognition is performed in various fields and in various scenes.・ It is often voice of situation. Since the probability of occurrence of word strings will be different for different fields, scenes and situations, if the language model is generated by learning the learning text data collectively and ignoring the differences in fields, scenes and situations However, the accuracy of the language model is not good.

【００１０】このような、さまざまな分野や、さまざま
な場面・状況を音声認識の対象とした音声認識装置の性
能を上げるために、言語モデルの学習用テキストデータ
をクラスタリングして、分割されたクラスタ毎に言語モ
デルを作成する方法が検討されている。従来技術として
は、例えば、公開特許公報２０００−７５８８６号の
「統計的言語モデル生成装置及び音声認識装置」（以
降、文献５とする）がある。ここで、クラスタとは、例
えばクラスタ１が政治、クラスタ２がスポーツといった
分野別の分割や、文の距離を定義して文をクラスタリン
グして得ることができる。[0010] In order to improve the performance of a speech recognition apparatus for speech recognition in various fields and various scenes and situations, text data for learning a language model is clustered and divided into clusters. A method of creating a language model for each is being studied. As a conventional technique, for example, there is “Statistical Language Model Generation Apparatus and Speech Recognition Apparatus” of JP-A-2000-75886 (hereinafter referred to as Document 5). Here, the cluster can be obtained by, for example, dividing the fields into categories such as cluster 1 for politics and cluster 2 for sports, or clustering sentences by defining the distance between sentences.

【００１１】学習用テキストデータをクラスタに分割し
た場合には、クラスタ当たりの学習用テキストデータは
少なくなるので、更に前述のゼロ頻度問題やスパースネ
スの問題が大きくなる。これに対して文献５では、クラ
スタに分割しない全学習用テキストデータを用いて推定
した言語モデルＬＭ_aと、クラスタに分割された学習用
テキストデータを用いて推定したクラスタ別の言語モデ
ルＬＭ_c ^k（ｋはクラスタ番号）を用いて、最大事後確率
推定法によってＬＭ_map ^kを推定することで精度の高い言
語モデルを得ている。When the text data for learning is divided into clusters, the text data for learning per cluster is reduced, so that the above-mentioned problem of zero frequency and sparseness is further increased. In [5 contrast, the language model LM _a and, in another cluster estimated using training text data divided into clusters language model LM _c ^k estimated using the entire training text data is not divided into clusters (K is a cluster number), and a highly accurate language model is obtained by estimating LM _map ^k by a maximum posterior probability estimation method.

【００１２】図１３は文献５に記述されている従来の言
語モデル生成装置の構成を示すブロック図である。図に
おいて、１００１は言語モデルの学習用テキストデー
タ、１００２は学習用テキストデータクラスタリング手
段、１００３は学習用テキストデータクラスタ、１００
３−１〜１００３−Ｍはクラスタ１〜Ｍの学習用テキス
トデータ、１００４は言語モデル生成手段、１００５は
クラスタ別言語モデル、１００５−１〜１００５−Ｍは
クラスタ１〜Ｍの言語モデルである。FIG. 13 is a block diagram showing a configuration of a conventional language model generation device described in Reference 5. In the figure, 1001 is text data for learning a language model, 1002 is texturing clustering means for learning, 1003 is a text data cluster for learning, 100
Reference numerals 3-1 to 1003-M denote learning text data of clusters 1 to M, 1004 denotes a language model generating unit, 1005 denotes a language model for each cluster, and 1005-1 to 1005-M denote language models of clusters 1 to M.

【００１３】次に動作について説明する。学習用テキス
トデータ１００１は、言語モデルを学習するためのテキ
ストデータであり、音声認識装置が認識対象とする単語
や文を文字にしたものである。この学習用テキストデー
タ１００１は、学習用テキストデータクラスタリング手
段１００２へ入力される。Next, the operation will be described. The learning text data 1001 is text data for learning a language model, and is obtained by converting words or sentences to be recognized by the speech recognition device into characters. The learning text data 1001 is input to the learning text data clustering means 1002.

【００１４】学習用テキストデータクラスタリング手段
１００２は、学習用テキストデータ１００１をクラスタ
リングする。文献５では、ｋ−ｍｅａｎｓ法に類似した
方法を用いてテキストを文単位でクラスタリングしてい
る。通常のｋ−ｍｅａｎｓ法と異なる点は、（１）クラ
スタ中心ベクトルを、そのクラスタに属する文で生成さ
れる言語モデルとすること、（２）距離尺度に文の生成
確率を用いていることである。また、言語モデルにはＮ
−ｇｒａｍモデルを用いている。The learning text data clustering means 1002 clusters the learning text data 1001. In Reference 5, texts are clustered on a sentence basis using a method similar to the k-means method. The points different from the ordinary k-means method are that (1) the cluster center vector is a language model generated by a sentence belonging to the cluster, and (2) the sentence generation probability is used as a distance scale. is there. The language model has N
-The gram model is used.

【００１５】学習用テキストデータクラスタ１００３
は、学習用テキストデータクラスタリング手段１００２
によって、Ｍ個のクラスタにクラスタリングされたクラ
スタ１の学習用テキストデータ１００３−１〜クラスタ
Ｍの学習用テキストデータ１００３−Ｍで構成されてい
る。Learning text data cluster 1003
Is a learning text data clustering means 1002
Thus, the learning text data 1003-1 of the cluster 1 and the learning text data 1003-M of the cluster M are clustered into M clusters.

【００１６】言語モデル生成手段１００４は、学習用テ
キストデータクラスタリング手段１００２によって得ら
れたクラスタ１の学習用テキストデータ１００３−１〜
クラスタＭの学習用テキストデータ１００３−Ｍをそれ
ぞれ入力して、クラスタ１の言語モデル１００５−１〜
クラスタＭの言語モデル１００５−Ｍで構成するクラス
タ別言語モデル１００５を生成する。言語モデル生成手
段１００４は、クラスタ毎の学習用テキストデータ数の
減少による言語モデルの推定精度の低下を防ぐために、
クラスタ分割しない全学習用テキストデータを用いて推
定した言語モデルＬＭ_ａと、クラスタに分割された学習
用テキストデータを用いて推定したクラスタ別の言語モ
デルＬＭ_c ^kを用いて、最大事後確率推定法によってクラ
スタ別の言語モデルＬＭ_map ^kを推定している。The language model generating means 1004 includes the learning text data 1003-1 to 1003-1 for the cluster 1 obtained by the learning text data clustering means 1002.
The learning text data 1003-M of the cluster M is input, and the language model 1005-1 of the cluster 1 is input.
A language model 1005 for each cluster composed of the language models 1005-M of the cluster M is generated. The language model generation unit 1004 is provided to prevent a decrease in the estimation accuracy of the language model due to a decrease in the number of learning text data for each cluster.
By using the language model LM _a estimated using the whole learning text data that does not cluster split, the cluster-specific language model LM _c ^k estimated using the learning text data that has been divided into clusters, the maximum a posteriori probability estimation method _{Is used} to estimate a language model LM _map ^k for each cluster.

【００１７】次に上記言語モデル生成装置を用いた従来
の音声認識装置の説明を行う。図１４は文献５に開示さ
れた従来の音声認識装置の構成を示すブロック図であ
る。図において、１１０１は認識対象音声、１１０２は
音声特徴量抽出手段、１１０３は音響モデル、１１０４
は言語モデル選択手段、１１０５は照合手段、１１０６
は音声認識結果である。クラスタ別言語モデル１００５
は、図１３と同一の機能ブロックであり、同一の符号を
付すと共に説明は省略する。Next, a conventional speech recognition apparatus using the above-described language model generation apparatus will be described. FIG. 14 is a block diagram showing a configuration of a conventional speech recognition device disclosed in Reference 5. In the figure, reference numeral 1101 denotes a recognition target voice; 1102, a voice feature amount extraction unit; 1103, an acoustic model;
Is a language model selecting means, 1105 is a matching means, 1106
Is the speech recognition result. Cluster-based language model 1005
Are the same functional blocks as those in FIG. 13, and are denoted by the same reference numerals and description thereof is omitted.

【００１８】次に動作について説明する。認識対象音声
１１０１は認識対象とする音声であり、音声特徴量抽出
手段１１０２へ入力される。音声特徴量抽出手段１１０
２は、認識対象音声１１０１に含まれている音声特徴量
を抽出する。音響モデル１１０３は音声の音響的な照合
を行うためのモデルである。音響モデル１１０３は、例
えば、多数の話者が発声した文や単語の音声を用いて学
習した、前後音素環境を考慮した音素を認識ユニットと
したＨＭＭを用いる。Next, the operation will be described. The recognition target voice 1101 is a voice to be recognized, and is input to the voice feature amount extraction unit 1102. Voice feature extraction means 110
2 extracts a speech feature amount included in the recognition target speech 1101. The acoustic model 1103 is a model for performing acoustic collation of voice. As the acoustic model 1103, for example, an HMM using a phoneme that recognizes a phoneme environment before and after, which is learned using sentences or words uttered by a large number of speakers, is used as a recognition unit.

【００１９】言語モデル選択手段１１０４は、言語モデ
ル生成装置を用いて生成したクラスタ１の言語モデル１
００５−１〜クラスタＭの言語モデル１００５−Ｍで構
成されるクラスタ別言語モデル１００５の中から、照合
手段１１０５で用いる言語モデルを選択する。文献５で
は、クラスタに分割する前の不特定言語モデルを用いて
照合を行い、得られた認識結果候補の単語列に対して、
最も生起確率が高いクラスタ別言語モデルを、クラスタ
１の言語モデル１００５−１〜クラスタＭの言語モデル
１００５−Ｍから１つ選択している。The language model selecting means 1104 outputs the language model 1 of the cluster 1 generated using the language model generating device.
A language model to be used by the matching unit 1105 is selected from the cluster-specific language models 1005 composed of the language models 1005-M of the clusters 005-1 to M. In Reference 5, collation is performed using an unspecified language model before being divided into clusters.
One cluster-specific language model having the highest probability of occurrence is selected from the language model 1005-1 of the cluster 1 to the language model 1005-M of the cluster M.

【００２０】照合手段１１０５は、言語モデル選択手段
１１０４によって選択された言語モデルが設定している
認識対象の単語［Ｗ（１），Ｗ（２），・・・，Ｗ（ｗ
ｎ）］（ｗｎは認識対象とする単語数）の発音表記を認
識ユニットラベル表記に変換し、このラベルにしたがっ
て、音響モデル１１０３に格納されている音素単位のＨ
ＭＭを連結し、認識対象単語の標準パターン［λ_W(1)，
λ_W(2)，．．．，λ_W( _wn)］を作成する。The collating unit 1105 includes a recognition target word [W (1), W (2),..., W (w) set by the language model selected by the language model selecting unit 1104.
n)] (where wn is the number of words to be recognized) is converted into a recognition unit label notation, and according to this label, H in phoneme units stored in the acoustic model 1103 is converted.
The MMs are concatenated and the standard pattern of the word to be recognized [λ _{W (1)} ,
λ _{W (2)} ,. . . , Λ _{W (} _wn) ].

【００２１】そして、照合手段１１０５は、認識対象単
語の標準パターンと選択された言語モデルによって表さ
れる単語列の生起確率を用いて、音声特徴量分析手段１
１０２の出力である音声特徴量に対して照合を行い、音
声認識結果１１０６を出力する。音声認識結果１１０６
は、認識対象音声１１０１に対して、認識対象単語で最
も照合スコアが高い単語の単語番号系列Ｒｎ＝［ｒ
（１），ｒ（２），．．．，ｒ（ｍ）］を計算し、単語
番号に対応する単語Ｒｗ＝［Ｗ（ｒ（１）），Ｗ（ｒ
（２）），．．．，Ｗ（ｒ（ｍ））］を出力する。ここ
で、ｒ（ｉ）は音声認識結果１１０６の単語系列のｉ番
目の単語の単語番号を示す。また、ｍは認識単語系列の
単語数を示す。The collation means 1105 uses the standard pattern of the word to be recognized and the occurrence probability of the word string represented by the selected language model, and
The collation is performed on the speech feature amount output from 102, and a speech recognition result 1106 is output. Voice recognition result 1106
Is a word number sequence Rn = [r of the word having the highest matching score among the words to be recognized with respect to the voice to be recognized 1101.
(1), r (2),. . . , R (m)], and the word Rw = [W (r (1)), W (r
(2)),. . . , W (r (m))]. Here, r (i) indicates the word number of the i-th word in the word sequence of the speech recognition result 1106. M indicates the number of words in the recognized word sequence.

【００２２】[0022]

【発明が解決しようとする課題】従来の言語モデル生成
装置は以上のように構成されているので、クラスタリン
グによって分割するクラスタ数が多くなると、クラスタ
当たりの学習用テキストデータ数が少なくなり、言語モ
デルの推定精度が悪くなるので音声認識精度が高くなら
ないという課題があった。Since the conventional language model generator is configured as described above, if the number of clusters to be divided by clustering increases, the number of text data for learning per cluster decreases, and the language model There is a problem that the accuracy of speech recognition does not increase because the estimation accuracy of the speech recognition becomes worse.

【００２３】また、分割するクラスタ数が多くなると、
１発声が複数のクラスタの言語性質を持つような場合、
認識率が高くならないという課題があった。When the number of clusters to be divided increases,
If one utterance has the linguistic properties of multiple clusters,
There was a problem that the recognition rate did not increase.

【００２４】この発明は、上記のような課題を解決する
ためになされたものであり、推定精度の高い言語モデル
を作成できる言語モデル生成装置、言語モデル生成方法
及び言語モデル生成プログラムを記録したコンピュータ
読み取り可能な記録媒体を得ることを目的とする。SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and has a language model generating apparatus, a language model generating method, and a computer storing a language model generating program capable of generating a language model with high estimation accuracy. It is an object to obtain a readable recording medium.

【００２５】また、この発明は、推定精度の高い言語モ
デルを用いて、音声認識精度の高い音声認識装置、音声
認識方法及び音声認識プログラムを記録したコンピュー
タ読み取り可能な記録媒体を得ることを目的とする。It is another object of the present invention to provide a computer-readable recording medium storing a speech recognition device, a speech recognition method, and a speech recognition program with high speech recognition accuracy using a language model with high estimation accuracy. I do.

【００２６】[0026]

【課題を解決するための手段】この発明に係る言語モデ
ル生成装置は、学習用テキストデータを入力して、単語
列の生起確率を求める言語モデルを生成するものにおい
て、上記学習用テキストデータを言語的に類似した性質
を持つように階層的に分割する木構造クラスタリングを
行い、木構造学習用テキストデータクラスタを生成する
学習用テキストデータ木構造クラスタリング手段と、上
記木構造学習用テキストデータクラスタに属する各学習
用テキストデータを用いて、木構造クラスタ別言語モデ
ルを生成する言語モデル生成手段とを備えたものであ
る。A language model generating apparatus according to the present invention is configured to generate a language model for inputting learning text data and calculating the occurrence probability of a word string. A learning text data tree structure clustering means for performing a tree structure clustering for hierarchically dividing so as to have a property similar to each other and generating a tree structure learning text data cluster; and belonging to the tree structure learning text data cluster. Language model generating means for generating a language model for each tree structure cluster using each learning text data.

【００２７】この発明に係る言語モデル生成装置は、木
構造クラスタ別言語モデルが位置する木構造の上位に位
置する木構造クラスタ別言語モデルを用いて補間処理を
行い、補間処理された木構造クラスタ別言語モデルを生
成する言語モデル補間手段を備えたものである。The language model generating apparatus according to the present invention performs an interpolation process using a tree structure cluster-based language model located above a tree structure in which the tree structure cluster-based language model is located, and performs the interpolation-processed tree structure cluster. It is provided with a language model interpolation means for generating another language model.

【００２８】この発明に係る音声認識装置は、認識対象
音声を入力して音声認識を行い音声認識結果を出力する
ものにおいて、上記認識対象音声を入力し音声特徴量を
抽出する音声特徴量抽出手段と、音声の音響的な観測値
系列の確率を求める音響モデルと、学習用テキストデー
タを言語的に類似した性質を持つように階層的に分割す
る木構造クラスタリングを行い、各木構造クラスタの学
習用テキストデータを用いて生成された木構造クラスタ
別言語モデルと、上記木構造クラスタ別言語モデルか
ら、音声認識結果候補の単語列に対して最も生起確率が
高い言語モデルを選択する言語モデル選択手段と、上記
言語モデル選択手段により選択された言語モデルと上記
音響モデルを用いて、上記音声特徴量抽出手段が抽出し
た音声特徴量に対して照合を行い音声認識結果を出力す
る照合手段とを備えたものである。A speech recognition apparatus according to the present invention, wherein a speech to be recognized is inputted, speech recognition is performed, and a speech recognition result is output. And a tree model clustering that hierarchically divides the training text data into linguistically similar properties by performing an acoustic model that calculates the probability of the acoustic observation sequence of speech Model selecting means for selecting a language model having the highest probability of occurrence for a word string of a speech recognition result candidate from the language model for each tree structure cluster generated using the text data for speech and the language model for each tree structure cluster And using the language model selected by the language model selection means and the acoustic model, the speech feature quantity extracted by the speech feature quantity extraction means When the comparison is that a matching means for outputting a speech recognition result.

【００２９】この発明に係る音声認識装置は、言語モデ
ル選択手段が、木構造クラスタ別言語モデルにおける最
も下層の葉ノードのクラスタ別言語モデルから言語モデ
ルを選択するものである。In the speech recognition apparatus according to the present invention, the language model selecting means selects a language model from the cluster-based language model of the lowest leaf node in the tree-structure cluster-based language model.

【００３０】この発明に係る音声認識装置は、認識対象
音声を入力して音声認識を行い音声認識結果を出力する
ものにおいて、上記認識対象音声を入力し音声特徴量を
抽出する音声特徴量抽出手段と、音声の音響的な観測値
系列の確率を求める音響モデルと、学習用テキストデー
タを言語的に類似した性質を持つように階層的に分割す
る木構造クラスタリングを行い、各木構造クラスタの学
習用テキストデータを用いて生成された木構造クラスタ
別言語モデルと、上記木構造クラスタ別言語モデルか
ら、音声認識結果候補の単語列に対して生起確率の高い
複数の言語モデルを選択する複数言語モデル選択手段
と、上記複数言語モデル選択手段によって選択された複
数の言語モデルを入力して混合言語モデルを生成する混
合言語モデル生成手段と、上記混合言語モデル生成手段
により生成された言語モデルと上記音響モデルを用い
て、上記音声特徴量抽出手段が抽出した音声特徴量に対
して照合を行い音声認識結果を出力する照合手段とを備
えたものである。A speech recognition apparatus according to the present invention, wherein a speech to be recognized is inputted, speech recognition is performed, and a speech recognition result is output. And a tree model clustering that hierarchically divides the training text data into linguistically similar properties by performing an acoustic model that calculates the probability of the acoustic observation sequence of speech -Language model for each tree-structured cluster generated using text data for speech, and a multi-language model for selecting a plurality of language models having a high probability of occurrence for a word string of a speech recognition result candidate from the language model for each tree-structured cluster Selecting means, and a mixed language model generating means for generating a mixed language model by inputting a plurality of language models selected by the plurality of language model selecting means Using the language model generated by the mixed language model generating unit and the acoustic model, performing matching against the voice feature amount extracted by the voice feature amount extracting unit and outputting a voice recognition result. It is a thing.

【００３１】この発明に係る音声認識装置は、複数言語
モデル選択手段が、木構造クラスタ別言語モデルにおけ
る最も下層の葉ノードのクラスタ別言語モデルから複数
の言語モデルを選択するものである。In the speech recognition apparatus according to the present invention, the plurality of language models selecting means selects a plurality of language models from the cluster-based language model of the lowest leaf node in the tree-structure cluster-based language model.

【００３２】この発明に係る音声認識装置は、木構造ク
ラスタ別言語モデルが、木構造の上位に位置する木構造
クラスタ別言語モデルを用いて補間処理が行われた補間
処理された木構造クラスタ別言語モデルであることを特
徴とするものである。In the speech recognition apparatus according to the present invention, the language model for each tree structure cluster is obtained by performing an interpolation process using a language model for each tree structure cluster located at a higher level of the tree structure. It is a language model.

【００３３】この発明に係る言語モデル生成方法は、学
習用テキストデータを入力して、単語列の生起確率を求
める言語モデルを生成するものにおいて、上記学習用テ
キストデータを言語的に類似した性質を持つように階層
的に分割する木構造クラスタリングを行い、木構造学習
用テキストデータクラスタを生成する第１のステップ
と、上記木構造学習用テキストデータクラスタに属する
各学習用テキストデータを用いて、木構造クラスタ別言
語モデルを生成する第２のステップとを備えたものであ
る。A language model generating method according to the present invention is a method for generating a language model for inputting learning text data and calculating the probability of occurrence of a word string, wherein the learning text data has a linguistically similar property. The first step of generating a tree structure learning text data cluster by performing tree structure clustering that hierarchically divides into a tree structure, and using each learning text data belonging to the tree structure learning text data cluster to generate a tree And a second step of generating a language model for each structural cluster.

【００３４】この発明に係る言語モデル生成方法は、木
構造クラスタ別言語モデルが位置する木構造の上位に位
置する木構造クラスタ別言語モデルを用いて補間処理を
行い、補間処理された木構造クラスタ別言語モデルを生
成する第３のステップを備えたものである。In the language model generating method according to the present invention, an interpolation process is performed by using a tree structure cluster-based language model located above a tree structure in which a tree structure cluster-based language model is located, and the interpolated tree structure cluster is obtained. And a third step of generating a different language model.

【００３５】この発明に係る音声認識方法は、認識対象
音声を入力して音声認識を行い音声認識結果を出力する
ものにおいて、上記認識対象音声を入力し音声特徴量を
抽出する第１のステップと、学習用テキストデータを言
語的に類似した性質を持つように階層的に分割する木構
造クラスタリングを行い、各木構造クラスタの学習用テ
キストデータを用いて生成された木構造クラスタ別言語
モデルから、音声認識結果候補の単語列に対して最も生
起確率が高い言語モデルを選択する第２のステップと、
音声の音響的な観測値系列の確率を求める音響モデル
と、上記第２のステップで選択された言語モデルを用い
て、上記第１のステップで抽出した音声特徴量に対して
照合を行い音声認識結果を出力する第３のステップとを
備えたものである。A speech recognition method according to the present invention is characterized in that a speech to be recognized is inputted, speech recognition is performed, and a speech recognition result is output. , Perform tree structure clustering that hierarchically divides the training text data so as to have linguistically similar properties, and from the language model for each tree structure cluster generated using the training text data of each tree structure cluster, A second step of selecting a language model having the highest occurrence probability for the word string of the speech recognition result candidate;
Speech recognition is performed by using the acoustic model for calculating the probability of the acoustic observation sequence of the speech and the language model selected in the second step to the speech features extracted in the first step. And a third step of outputting a result.

【００３６】この発明に係る音声認識方法は、第２のス
テップで、木構造クラスタ別言語モデルにおける最も下
層の葉ノードのクラスタ別言語モデルから言語モデルを
選択するものである。In the speech recognition method according to the present invention, in the second step, a language model is selected from the cluster-based language model of the lowest leaf node in the tree-structured cluster-based language model.

【００３７】この発明に係る音声認識方法は、認識対象
音声を入力した音声認識を行い音声認識結果を出力する
ものにおいて、上記認識対象音声を入力し音声特徴量を
抽出する第１のステップと、学習用テキストデータを言
語的に類似した性質を持つように階層的に分割する木構
造クラスタリングを行い、各木構造クラスタの学習用テ
キストデータを用いて生成された木構造クラスタ別言語
モデルから、音声認識結果候補の単語列に対して生起確
率が高い複数の言語モデルを選択する第２のステップ
と、上記第２のステップで選択された複数の言語モデル
を入力して混合言語モデルを生成する第３のステップ
と、音声の音響的な観測値系列の確率を求める音響モデ
ルと、上記第３のステップで生成された言語モデルを用
いて、上記第１のステップで抽出した音声特徴量に対し
て照合を行い音声認識結果を出力する第４のステップと
を備えたものである。A speech recognition method according to the present invention performs a speech recognition in which a recognition target speech is input and outputs a speech recognition result, wherein a first step of inputting the recognition target speech and extracting a speech feature amount; Tree structure clustering is performed to divide the training text data hierarchically so as to have linguistically similar properties, and speech is generated from the language model for each tree structure cluster generated using the training text data for each tree structure cluster. A second step of selecting a plurality of language models having a high occurrence probability for the word string of the recognition result candidate; and a second step of inputting the plurality of language models selected in the second step to generate a mixed language model. Step 3, using the acoustic model for obtaining the probability of the acoustic observation sequence of the speech and the language model generated in the third step, It is obtained and a fourth step of outputting a speech recognition result collates the speech features extracted by the flop.

【００３８】この発明に係る音声認識方法は、第２のス
テップで、木構造クラスタ別言語モデルにおける最も下
層の葉ノードのクラスタ別言語モデルから複数の言語モ
デルを選択するものである。In the speech recognition method according to the present invention, in the second step, a plurality of language models are selected from the cluster-based language model of the lowest leaf node in the tree-structured cluster-based language model.

【００３９】この発明に係る言語モデル生成プログラム
を記録したコンピュータ読み取り可能な記録媒体は、学
習用テキストデータを入力して、単語列の生起確率を求
める言語モデルを生成するものであって、上記学習用テ
キストデータを言語的に類似した性質を持つように階層
的に分割する木構造クラスタリングを行い、木構造学習
用テキストデータクラスタを生成する学習用テキストデ
ータ木構造クラスタリング手順と、上記木構造学習用テ
キストデータクラスタに属する各学習用テキストデータ
を用いて、木構造クラスタ別言語モデルを生成する言語
モデル生成手順とを実現させるものである。A computer-readable recording medium on which a language model generating program according to the present invention is recorded is for inputting learning text data and generating a language model for obtaining a word string occurrence probability. Tree clustering procedure for performing tree structure clustering that hierarchically divides text data for learning so as to have linguistically similar properties to generate text data clusters for tree structure learning, A language model generation procedure for generating a language model for each tree structure cluster using each learning text data belonging to the text data cluster.

【００４０】この発明に係る言語モデル生成プログラム
を記録したコンピュータ読み取り可能な記録媒体は、木
構造クラスタ別言語モデルが位置する木構造の上位に位
置する木構造クラスタ別言語モデルを用いて補間処理を
行い、補間処理された木構造クラスタ別言語モデルを生
成する言語モデル補間手順を実現させるものである。A computer-readable recording medium having recorded thereon a language model generation program according to the present invention performs interpolation processing using a tree-structure cluster language model located above a tree structure in which a tree-structure cluster language model is located. This implements a language model interpolation procedure for generating a language model for each tree structure cluster subjected to interpolation processing.

【００４１】この発明に係る音声認識プログラムを記録
したコンピュータ読み取り可能な記録媒体は、認識対象
音声を入力して音声認識を行い音声認識結果を出力する
もので、上記認識対象音声を入力し音声特徴量を抽出す
る音声特徴量抽出手順と、学習用テキストデータを言語
的に類似した性質を持つように階層的に分割する木構造
クラスタリングを行い、各木構造クラスタの学習用テキ
ストデータを用いて生成された木構造クラスタ別言語モ
デルから、音声認識結果候補の単語列に対して最も生起
確率が高い言語モデルを選択する言語モデル選択手順
と、音声の音響的な観測値系列の確率を求める音響モデ
ルと、上記言語モデル選択手順により選択された言語モ
デルを用いて、上記音声特徴量抽出手順により抽出され
た音声特徴量に対して照合を行い音声認識結果を出力す
る照合手順とを実現させるものである。A computer-readable recording medium on which a speech recognition program according to the present invention is recorded is for inputting a speech to be recognized and performing speech recognition and outputting a speech recognition result. Speech feature extraction procedure to extract the amount, and tree structure clustering that hierarchically divides the training text data so that it has linguistically similar properties, and generates using the training text data of each tree structure cluster Language model selection procedure to select the language model with the highest probability of occurrence for the word sequence of the speech recognition result candidate from the extracted tree model cluster-based language models, and an acoustic model to obtain the probability of the acoustic observation value sequence of the speech And using the language model selected in the language model selection procedure, the speech feature quantity extracted in the speech feature quantity extraction procedure Verification was carried out is used for realizing a matching procedure for outputting the result of speech recognition.

【００４２】この発明に係る音声認識プログラムを記録
したコンピュータ読み取り可能な記録媒体は、言語モデ
ル選択手順が、木構造クラスタ別言語モデルにおける最
も下層の葉ノードのクラスタ別言語モデルから言語モデ
ルを選択するものである。In the computer-readable recording medium storing the speech recognition program according to the present invention, the language model selecting step selects a language model from the cluster-based language model of the lowest leaf node in the tree-structure cluster-based language model. Things.

【００４３】この発明に係る音声認識プログラムを記録
したコンピュータ読み取り可能な記録媒体は、認識対象
音声を入力して音声認識を行い音声認識結果を出力する
ものであって、上記認識対象音声を入力し音声特徴量を
抽出する音声特徴量抽出手順と、学習用テキストデータ
を言語的に類似した性質を持つように階層的に分割する
木構造クラスタリングを行い、各木構造クラスタの学習
用テキストデータを用いて生成された木構造クラスタ別
言語モデルから、音声認識結果候補の単語列に対して生
起確率の高い複数の言語モデルを選択する複数言語モデ
ル選択手順と、上記複数言語モデル選択手順によって選
択された複数の言語モデルを入力して混合言語モデルを
生成する混合言語モデル生成手順と、音声の音響的な観
測値系列の確率を求める音響モデルと、上記混合言語モ
デル生成手順により生成された言語モデルを用いて、上
記音声特徴量抽出手順により抽出された音声特徴量に対
して照合を行い音声認識結果を出力する照合手順とを実
現させるものである。A computer-readable recording medium on which a speech recognition program according to the present invention is recorded is for inputting a speech to be recognized, performing speech recognition and outputting a speech recognition result. A speech feature extraction procedure for extracting speech features and a tree structure clustering that hierarchically divides the learning text data into linguistically similar properties are performed, and the training text data of each tree structure cluster is used. A multi-language model selecting step of selecting a plurality of language models having a high probability of occurrence for a word string of a speech recognition result candidate from the tree model cluster-based language model generated by A mixed-language model generation procedure for generating a mixed-language model by inputting multiple language models Using a language model generated by the mixed language model generation procedure, and using a language model generated by the mixed language model generation procedure to perform a verification on the voice features extracted by the voice feature extraction procedure and output a voice recognition result. Is realized.

【００４４】この発明に係る音声認識プログラムを記録
したコンピュータ読み取り可能な記録媒体は、複数言語
モデル選択手順が、木構造クラスタ別言語モデルにおけ
る最も下層の葉ノードのクラスタ別言語モデルから複数
の言語モデルを選択するものである。In a computer-readable recording medium storing a speech recognition program according to the present invention, a plurality of language models are selected by selecting a plurality of language models from a cluster-based language model of the lowest leaf node in a tree-structured cluster-based language model. Is to select.

【００４５】[0045]

【発明の実施の形態】以下、この発明の一形態を説明す
る。実施の形態１．図１はこの発明の実施の形態１による言
語モデル生成装置の構成を示すブロック図である。図に
おいて、２００１は学習用テキストデータ木構造クラス
タリング手段、２００２は木構造学習用テキストデータ
クラスタ、２００２−１〜２００２−Ｍは木構造クラス
タ１〜Ｍの学習用テキストデータ、２００３は木構造ク
ラスタ別言語モデル、２００３−１〜２００３−Ｍは木
構造クラスタ１〜Ｍの言語モデルである。従来の言語モ
デル生成装置の構成を示す図１３と同一の機能ブロック
については、同一の符号を付し説明を省略する。DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below. Embodiment 1 FIG. FIG. 1 is a block diagram showing a configuration of a language model generating device according to Embodiment 1 of the present invention. In the figure, 2001 is a learning text data tree structure clustering means, 2002 is a tree structure learning text data cluster, 2002-1 to 2002-M are learning text data of tree structure clusters 1 to M, and 2003 is a tree structure cluster by cluster. Language models 2003-1 to 2003-M are language models of the tree-structure clusters 1 to M. The same functional blocks as those in FIG. 13 showing the configuration of the conventional language model generation device are denoted by the same reference numerals and description thereof will be omitted.

【００４６】なお、言語モデルの学習用テキストデータ
１００１は、音声認識の認識対象とする分野や場面・状
況において用いられる単語や文を文字化したものであ
る。例えば、アナウンサーが発声する政治のニュースを
音声認識対象とした場合は、新聞の政治欄の記事や、政
治の放送ニュースの発声内容を文字として書き起こした
テキストデータである。The language model learning text data 1001 is obtained by converting words and sentences used in a field, a scene, and a situation to be recognized by speech recognition into characters. For example, when a political news uttered by an announcer is targeted for speech recognition, it is text data in which an utterance content of a politics section of a newspaper or a politics news is transcribed as characters.

【００４７】次に動作について説明する。図２はこの発
明の実施の形態１による言語モデル生成装置における言
語モデル生成方法を示すフローチャートである。学習用
テキストデータ木構造クラスタリング手段２００１は、
ステップＳＴ１０１において、学習用テキストデータ１
００１を入力し、ステップＳＴ１０２において、クラス
タリングの階層Ｉを０とし、ステップＳＴ１０３におい
て、初めに学習用テキストデータ１００１の全てに対し
てクラスタリングを行う。この学習用テキストデータ１
００１全てに対するクラスタリングを、階層０のクラス
タリングとする。Next, the operation will be described. FIG. 2 is a flowchart showing a language model generation method in the language model generation device according to the first embodiment of the present invention. The text data tree structure clustering means 2001 for learning includes:
In step ST101, the learning text data 1
001 is input, and in step ST102, the hierarchical level I of clustering is set to 0. In step ST103, clustering is first performed on all the learning text data 1001. This text data for learning 1
The clustering for all 001 is referred to as the hierarchical 0 clustering.

【００４８】ここで、クラスタリングとは、人手で２つ
以上の分野に分けることや、文献５に示してあるｋ−ｍ
ｅａｎｓアルゴリズムに類似した方法を用いて、学習用
テキストデータを２つ以上の集合に分割することであ
る。クラスタリングによって得られるクラスタに属する
学習用テキストデータは、言語的に類似した性質を持つ
ものとなる。Here, clustering refers to dividing manually into two or more fields, or using km-m shown in Reference 5.
This is to divide the learning text data into two or more sets using a method similar to the eans algorithm. The learning text data belonging to the cluster obtained by the clustering has linguistically similar properties.

【００４９】図３は学習用テキストデータ木構造クラス
タリング手段２００１で行われる学習用テキストデータ
木構造クラスタリングの説明図であり、文を単位として
階層的にクラスタリングしている様子を示したものであ
る。図３では、階層０の木構造クラスタ００のクラスタ
リングにより、学習用テキストデータ１００１全てを２
つのクラスタに分割している。分割された学習用テキス
トデータの集合は、階層１の木構造クラスタ１０と木構
造クラスタ１１となっている。FIG. 3 is an explanatory diagram of the learning text data tree structure clustering performed by the learning text data tree structure clustering means 2001, and shows a state in which sentences are hierarchically clustered in units. In FIG. 3, all of the learning text data 1001 is changed to 2 by clustering the tree structure cluster 00 of the hierarchy 0.
Divided into two clusters. The set of the divided learning text data is a tree-structure cluster 10 and a tree-structure cluster 11 of the hierarchy 1.

【００５０】図２のステップＳＴ１０４において、学習
用テキストデータ木構造クラスタリング手段２００１
は、階層Ｉをインクリメントし、ステップＳＴ１０５に
おいて、学習用テキストデータ木構造クラスタリング手
段２００１は、階層Ｉ−１（ここでは、Ｉ＝０）でクラ
スタリングされた各クラスタに属する学習用テキストデ
ータに対してクラスタリングを行う。図３では、階層１
のクラスタリングにより、階層１の木構造クラスタ１０
から階層２の木構造クラスタ２０と木構造クラスタ２１
を生成し、木構造クラスタ１１から階層２の木構造クラ
スタ２２と木構造クラスタ２３を生成している。In step ST104 of FIG. 2, the learning text data tree structure clustering means 2001 is used.
Increments the hierarchy I, and in step ST105, the learning text data tree structure clustering means 2001 assigns the learning text data tree structure clustering unit 2001 to the learning text data belonging to each cluster clustered at the hierarchy I-1 (here, I = 0). Perform clustering. In FIG.
Of the tree structure cluster 10 of the hierarchy 1
Tree structure cluster 20 and tree structure cluster 21
Are generated, and a tree structure cluster 22 and a tree structure cluster 23 of the hierarchy 2 are generated from the tree structure cluster 11.

【００５１】ステップＳＴ１０６において、クラスタ数
が予め定めた数Ｍになったかを調べて、予め定めた数Ｍ
にならない場合には、ステップＳＴ１０４に戻り、階層
Ｉをインクリメントし、ステップＳＴ１０５のクラスタ
リングの処理を繰り返す。以上の処理をクラスタ数が予
め定めた数Ｍになるまで繰り返して、木構造クラスタ１
の学習用テキストデータ２００２−１〜木構造クラスタ
Ｍの学習用テキストデータ２００２−Ｍを生成する。In step ST106, it is checked whether or not the number of clusters has reached a predetermined number M.
Otherwise, the process returns to step ST104, increments the hierarchy I, and repeats the clustering process of step ST105. The above processing is repeated until the number of clusters reaches a predetermined number M, and the tree structure cluster 1
Of the tree structure cluster M is generated.

【００５２】予め定めたクラスタ数まで学習用テキスト
データの木構造クラスタリングを行った後に、ステップ
ＳＴ１０７において、言語モデル生成手段１００４は、
クラスタリングされた木構造クラスタ別に、各クラスタ
に属する学習用テキストデータを用いて言語モデルの生
成を行い、木構造クラスタ１の言語モデル２００３−１
〜木構造クラスタＭの言語モデル２００３−Ｍで構成さ
れる木構造クラスタ別言語モデル２００３を生成する。After performing tree structure clustering of the text data for learning up to a predetermined number of clusters, in step ST107, the language model generation means 1004
A language model is generated for each of the clustered tree-structured clusters using learning text data belonging to each cluster, and a language model 2003-1 of the tree-structured cluster 1 is generated.
A tree model cluster-specific language model 2003 composed of the language model 2003-M of the tree structure cluster M is generated.

【００５３】上記ステップＳＴ１０６において、階層的
な学習用テキストデータのクラスタリングを、予め定め
たクラスタ数Ｍになるまで繰り返す。ここでは、クラス
タ数をクラスタリングの終了の基準にしているが、階層
数を基準としても、クラスタ内の学習用テキストデータ
数がある値以下であるならば、クラスタリングを終了す
るとしても良い。階層的なクラスタリングによって得ら
れるクラスタは、階層が下になるほどクラスタに属する
学習テキストデータの性質は分野や場面・状況の違いを
よく表現している。In step ST106, the hierarchical clustering of the learning text data is repeated until the number of clusters reaches a predetermined number M. Here, the number of clusters is used as a criterion for terminating clustering. However, if the number of text data for learning in a cluster is equal to or less than a certain value, clustering may be terminated based on the number of layers. In the cluster obtained by the hierarchical clustering, the properties of the learning text data belonging to the cluster at the lower level express the difference in the field, scene, and situation.

【００５４】図４は木構造クラスタ別の言語モデル生成
の説明図である。図４では木構造のノードが学習用テキ
ストデータの木構造クラスタを表しており、各木構造ク
ラスタ毎にそこに属する学習用テキストデータを用いて
言語モデルの生成を行う。木構造の親ノードの木構造ク
ラスタは、子ノードの木構造クラスタに属する学習用テ
キストデータ全てを含むものとなっている。図４では、
例えば、木構造クラスタ００に属する学習用テキストデ
ータを用いて生成した言語モデルが、木構造クラスタ０
０の言語モデルＬＭ００，木構造クラスタ１０に属する
学習用テキストデータを用いて生成した言語モデルが、
木構造クラスタ１０の言語モデルＬＭ１０にそれぞれ対
応している。FIG. 4 is an explanatory diagram of language model generation for each tree structure cluster. In FIG. 4, the nodes of the tree structure represent the tree structure clusters of the learning text data, and a language model is generated for each tree structure cluster using the learning text data belonging to the cluster. The tree structure cluster of the parent node of the tree structure includes all the learning text data belonging to the tree structure cluster of the child node. In FIG.
For example, the language model generated using the learning text data belonging to the tree structure cluster 00 is the tree structure cluster 0
The language model generated using the language model LM00 of 0 and the text data for learning belonging to the tree structure cluster 10 is
It corresponds to the language model LM10 of the tree structure cluster 10, respectively.

【００５５】生成される言語モデルの性質は、下層の木
構造クラスタの言語モデルへいくほど、分野や場面・状
況の違いによる言語の性質の違いを、より表現した言語
モデルとなる。また、上層の木構造クラスタの言語モデ
ルは、分野や場面・状況の違いによる言語の性質の違い
は細かく表していないが、複数の分野や場面・状況の言
語特徴を含んでいるので、発声が複数の分野や場面・状
況を含んでいる場合には、有効な言語モデルとなってい
る。さらに、上層の木構造クラスタの言語モデルは学習
テキストデータが多いので、木構造クラスタと同数のク
ラスタ数に一度に分割した場合に比べてパラメータ推定
精度が高い。The nature of the generated language model is such that the closer to the language model of the lower-level tree-structured cluster, the more the language model expresses differences in language properties due to differences in fields, scenes, and situations. In addition, the language model of the tree structure cluster in the upper layer does not finely express the difference in language properties due to differences in fields, scenes, and situations. It is a valid language model when it includes multiple fields, scenes and situations. Further, since the language model of the upper-layer tree-structured cluster has a large amount of learning text data, the parameter estimation accuracy is higher than that in a case where the number of clusters is equal to the number of tree-structured clusters at once.

【００５６】言語モデルの生成の具体的方法は、文献４
の３章から５章に述べられている、Ｎ−ｇｒａｍモデ
ル、隠れマルコフモデル、確率文脈自由文法等である。A specific method of generating a language model is described in Reference 4.
, An N-gram model, a hidden Markov model, a stochastic context-free grammar, and the like described in Chapters 3 to 5.

【００５７】また、この実施の形態１における言語モデ
ル生成方法を言語モデル生成プログラムとして記録媒体
に記録することもできる。この場合には、学習用テキス
トデータ木構造クラスタリング手段２００１と同様の処
理を実現する学習用テキストデータ木構造クラスタリン
グ手順と、言語モデル生成手段１００４と同様の処理を
実現する言語モデル生成手順とから構成される言語モデ
ル生成プログラムを記録媒体に記録する。Further, the language model generating method according to the first embodiment can be recorded on a recording medium as a language model generating program. In this case, it is composed of a learning text data tree structure clustering procedure for realizing the same processing as the learning text data tree structure clustering means 2001, and a language model generation procedure for realizing the same processing as the language model generating means 1004. The generated language model generation program is recorded on a recording medium.

【００５８】以上のように、この実施の形態１の言語モ
デル生成装置及び言語モデル生成方法によれば、学習用
テキストデータを階層的に木構造クラスタリングし、各
木構造クラスタに属する学習用テキストデータを用い
て、木構造クラスタ別言語モデルを生成するので、学習
用テキストデータが少量であることによって生じる言語
モデルのゼロ頻度問題やスパースネスの問題を軽減で
き、認識率の高い言語モデルが生成できる効果が得られ
る。また、認識対象の１発声が複数の分野や場面・状況
を含む場合であっても、複数の分野や場面・状況の言語
特徴を学習した言語モデルが存在するので、認識率の高
い言語モデルが生成できる効果が得られる。As described above, according to the language model generating apparatus and the language model generating method of the first embodiment, the learning text data is hierarchically clustered into a tree structure, and the learning text data belonging to each tree structure cluster is clustered. Generates language models for each tree-structured cluster using, thus reducing the problem of zero frequency and sparseness of language models caused by a small amount of text data for learning, and the effect of generating language models with high recognition rates Is obtained. Even when one utterance to be recognized includes a plurality of fields, scenes, and situations, there is a language model that has learned the language features of the plurality of fields, scenes, and situations. An effect that can be generated is obtained.

【００５９】実施の形態２．図５はこの発明の実施の形
態２による言語モデル生成装置の構成を示すブロック図
である。図において、３００１は言語モデル補間手段、
３００２は補間処理された木構造クラスタ別言語モデ
ル、３００２−１〜３００２−Ｍは補間処理された木構
造クラスタ１〜Ｍの言語モデルである。実施の形態１の
図１と同一の機能ブロックについては、同一の符号を付
し説明を省略する。Embodiment 2 FIG. 5 is a block diagram showing a configuration of a language model generating device according to Embodiment 2 of the present invention. In the figure, reference numeral 3001 denotes a language model interpolation unit;
Reference numeral 3002 denotes a language model for each tree structure cluster subjected to interpolation processing, and reference numerals 3002-1 to 3002-M denote language models of the tree structure clusters 1 to M subjected to interpolation processing. The same functional blocks as in FIG. 1 of the first embodiment are denoted by the same reference numerals, and description thereof is omitted.

【００６０】次に動作について説明する。図６はこの発
明の実施の形態２による言語モデル生成装置における言
語モデル生成方法を示すフローチャートである。ステッ
プＳＴ２０１からステップＳＴ２０７までの処理は、実
施の形態１の図２におけるステップＳＴ１０１からステ
ップＳＴ１０７までの処理と同一である。Next, the operation will be described. FIG. 6 is a flowchart showing a language model generation method in the language model generation device according to the second embodiment of the present invention. The processing from step ST201 to step ST207 is the same as the processing from step ST101 to step ST107 in FIG. 2 of the first embodiment.

【００６１】ステップＳＴ２０８において、言語モデル
補間手段３００１は、言語モデル生成手段１００４によ
って生成された木構造クラスタ別言語モデルである木構
造クラスタ１の言語モデル２００３−１〜木構造クラス
タＭの言語モデル２００３−Ｍを入力し、補間処理され
た木構造クラスタ１の言語モデル３００２−１〜補間処
理された木構造クラスタＭの言語モデル３００２−Ｍを
生成する。このときの補間処理は、補間対象のクラスタ
言語モデルが位置する木構造のノードの親ノードの木構
造クラスタの言語モデルを用いて補間処理を行う。In step ST208, the language model interpolation means 3001 generates the language model 2003-1 of the tree structure cluster 1, which is the language model for each tree structure cluster generated by the language model generation means 1004, and the language model 2003 of the tree structure cluster M. -M is input to generate a language model 3002-1 of the tree structure cluster 1 subjected to the interpolation processing to a language model 3002-M of the tree structure cluster M subjected to the interpolation processing. In the interpolation processing at this time, the interpolation processing is performed using the language model of the tree structure cluster of the parent node of the tree structure node where the cluster language model to be interpolated is located.

【００６２】図４の例では、木構造クラスタ２０の言語
モデルＬＭ２０を補間する場合は、親ノードである木構
造クラスタ１０の言語モデルＬＭ１０と、更に上層の親
ノードである木構造クラスタ００の言語モデルＬＭ００
とを用いて補間する。この補間処理において、例えば言
語モデルがＮ−ｇｒａｍモデルである場合には、単語列
ｗ_n+1-N ^n-1に続いてｗ_nが生起する確率がパラメータで
あり、次の（５）式によって求める。In the example of FIG. 4, when the language model LM20 of the tree structure cluster 20 is interpolated, the language model LM10 of the tree structure cluster 10 which is the parent node and the language model LM20 of the tree structure cluster 00 which is a further upper parent node are interpolated. Model LM00
And interpolate using In this interpolation process, for example, when the language model is N-gram model, the probability of following the word sequence _{^{w n + 1-N n-}} 1 occurring is w _n is a parameter, the following expression (5) Ask by.

【数３】 (Equation 3)

【００６３】上記（５）式において、Ｐ’_s（ｗ_n｜ｗ
_n+1-N ^n-1）は補間処理された木構造クラスタＳの言語モ
デルにおける単語列ｗ_n+1-N ^n-1に続いてｗ_nが生起する
確率、Ωは木構造クラスタＳとその親ノードのクラスタ
番号の集合、Ｐ_i（ｗ_n｜ｗ _n+1-N ^n-1）は木構造クラス
タｉの言語モデルにおける単語列ｗ_n+1-N ^n-1に続いてｗ
_nが生起する確率、α_iは重み係数である。このα
_iは、例えば、文献４の３章に述べられている削除補間
法によって推定可能である。In the above equation (5), P '_s(W_n| W
_{n + 1-N} ^n-1) Is the language model of the tree structure cluster S subjected to the interpolation processing.
The word sequence w in Dell_{n + 1-N} ^n-1Followed by w_nOccurs
Probability, Ω is the cluster of the tree structure cluster S and its parent node
Set of numbers, P_i(W_n| W _{n + 1-N} ^n-1) Is the tree structure class
Word sequence w in the language model of i_{n + 1-N} ^n-1Followed by w
_nProbability of occurrence of α_iIs a weight coefficient. This α
_iIs, for example, the deletion interpolation described in Chapter 3 of Reference 4.
It can be estimated by the method.

【００６４】この説明では、Ｐ_i（ｗ_n｜ｗ_n+1-N ^n-1）
は補間する前の生起確率としたが、木構造の上層から補
間し、補間処理された生起確率Ｐ’_i（ｗ_n｜ｗ_n+1-N
^n-1）を用いても良い。木構造クラスタでは、下層のク
ラスタは学習用テキストデータが少量であるので、言語
モデル生成において、ゼロ頻度問題やスパースネスの問
題が生じやすいが、このように、学習用テキストデータ
数が多い親ノードのクラスタの言語モデルを用いて、パ
ラメータすなわち単語列ｗ_n+1-N ^n-1に続いてｗ_nが生起
する確率の補間処理を行うので、言語モデル推定精度が
高くなる。In this description, P _i (w _n | w _{n + 1−N} ⁿ⁻¹ )
Is the occurrence probability before interpolation, but the occurrence probability P ′ _i (w _n | w _{n + 1−N) obtained} by performing interpolation from the upper layer of the tree structure and performing interpolation processing
^n-1 ) may be used. In the tree-structured cluster, since the lower-level cluster has a small amount of text data for learning, a zero frequency problem and a problem of sparseness tend to occur in language model generation. using cluster language model, since the interpolation process of the probability of occurrence is w _n Following parameters: the word sequence _{^{w n + 1-n n-}} 1, the language model estimation accuracy is increased.

【００６５】また、実施の形態２における言語モデル生
成方法を言語モデル生成プログラムとして記録媒体に記
録することもできる。この場合には、学習用テキストデ
ータ木構造クラスタリング手段２００１と同様の処理を
実現する学習用テキストデータ木構造クラスタリング手
順と、言語モデル生成手段１００４と同様の処理を実現
する言語モデル生成手順と、言語モデル補間手段３００
１と同様の処理を実現する言語モデル補間手順とから構
成される言語モデル生成プログラムを記録媒体に記録す
る。Further, the language model generating method according to the second embodiment can be recorded on a recording medium as a language model generating program. In this case, a learning text data tree structure clustering procedure for realizing the same processing as the learning text data tree structure clustering means 2001, a language model generation procedure for realizing the same processing as the language model generating means 1004, Model interpolation means 300
A language model generation program including a language model interpolation procedure for realizing the same processing as in step 1 is recorded on a recording medium.

【００６６】以上のように、この実施の形態２の言語モ
デル生成装置及び言語モデル生成方法によれば、学習用
テキストデータを階層的に木構造クラスタリングし、各
木構造クラスタに属する学習用テキストデータを用いて
木構造クラスタ別言語モデルを生成し、生成されたクラ
スタ言語モデルを木構造の親ノードのクラスタ言語モデ
ルを用いて補間するので、学習用テキストデータが少量
であることによって生じる言語モデルのゼロ頻度問題や
スパースネスの問題を軽減でき、さらに認識率の高い言
語モデルを生成できるという効果が得られる。As described above, according to the language model generating apparatus and the language model generating method of the second embodiment, the learning text data is hierarchically clustered into a tree structure, and the learning text data belonging to each tree structure cluster is obtained. Is used to generate a language model for each tree structure cluster, and the generated cluster language model is interpolated using the cluster language model of the parent node of the tree structure. The effect of reducing the problem of zero frequency and the problem of sparseness can be obtained, and a language model with a high recognition rate can be generated.

【００６７】また、認識対象の１発声が複数の分野や場
面・状況を含む場合であっても、複数の分野や場面・状
況の言語特徴を学習した言語モデルが存在するので、認
識率の高い言語モデルが生成できるという効果が得られ
る。Even if one utterance to be recognized includes a plurality of fields, scenes, and situations, there is a language model that has learned the language features of the plurality of fields, scenes, and situations, so that the recognition rate is high. The advantage is that a language model can be generated.

【００６８】実施の形態３．図７はこの発明の実施の形
態３による音声認識装置の構成を示すブロック図であ
る。図において、実施の形態１の図１，及び従来の音声
認識装置の図１４と同一の機能ブロックについては、同
一の符号を付し説明を省略する。Embodiment 3 FIG. 7 is a block diagram showing a configuration of a voice recognition device according to Embodiment 3 of the present invention. In the figure, the same functional blocks as those in FIG. 1 of Embodiment 1 and FIG. 14 of the conventional speech recognition apparatus are denoted by the same reference numerals, and description thereof will be omitted.

【００６９】次に動作について説明する。図８はこの発
明の実施の形態３による音声認識装置における音声認識
方法を示すフローチャートである。音声特徴量抽出手段
１１０２は、ステップＳＴ３０１において認識対象音声
１１０１を入力し、ステップＳＴ３０２において音声特
徴量を抽出する。ここで、音声特徴量とは少ない情報量
で音声の特徴を表すものであり、例えば、文献１の５章
で述べているようなケプストラム、ケプストラムの動的
特徴で構成する特徴ベクトルである。Next, the operation will be described. FIG. 8 is a flowchart showing a voice recognition method in the voice recognition device according to the third embodiment of the present invention. The voice feature amount extraction unit 1102 inputs the recognition target voice 1101 in step ST301, and extracts a voice feature amount in step ST302. Here, the speech feature amount represents a feature of speech with a small amount of information, and is, for example, a cepstrum described in Chapter 5 of Document 1 and a feature vector composed of dynamic features of the cepstrum.

【００７０】ステップＳＴ３０３において、言語モデル
選択手段１１０４は、照合手段１１０５で用いる言語モ
デルを、木構造クラスタ別言語モデル２００３の木構造
クラスタ１の言語モデル２００３−１〜木構造クラスタ
Ｍの言語モデル２００３−Ｍから１つ選択する。言語モ
デルの選択は、例えば文献５に示されている方法を用
い、最も生起確率が高い木構造クラスタの言語モデルを
選択する。In step ST303, the language model selecting means 1104 converts the language model used by the matching means 1105 into the language model 2003-1 of the tree-structure cluster 1 and the language model 2003 of the tree-structure cluster M by the tree-structure cluster-based language model 2003. -Select one from M. The language model is selected by using, for example, the method described in Reference 5, and the language model of the tree-structure cluster having the highest occurrence probability is selected.

【００７１】ステップＳＴ３０４において、照合手段１
１０５は、言語モデル選択手段１１０４によって選択さ
れた木構造クラスタ言語モデルと、音響モデル１１０３
を入力して認識対象音声１１０１の音声特徴量に対して
照合を行い、最も尤度（照合スコア）が高い単語列を音
声認識結果１１０６として出力する。In step ST304, the matching means 1
Reference numeral 105 denotes a tree structure cluster language model selected by the language model selecting unit 1104 and an acoustic model 1103.
Is input to perform matching on the speech feature amount of the recognition target speech 1101, and a word string having the highest likelihood (matching score) is output as the speech recognition result 1106.

【００７２】この場合の照合処理を具体的に説明する。
照合手段１１０５は、言語モデル選択手段１１０４によ
って選択された木構造クラスタ言語モデルが設定してい
る認識対象の単語［Ｗ（１），Ｗ（２），．．．，Ｗ
（ｗｎ）］（ｗｎは認識対象とする単語数）の発音表記
を、認識ユニットラベル表記に変換し、このラベルにし
たがって、音響モデル１１０３に格納されている音素ユ
ニットのＨＭＭを連結し、認識対象単語の標準パターン
［λ_W(1)，λ_W(2)，．．．，λ_W(wn)］を作成する。The collation processing in this case will be specifically described.
The matching unit 1105 outputs the recognition target words [W (1), W (2),...) Set by the tree-structure cluster language model selected by the language model selecting unit 1104. . . , W
(Wn)] (where wn is the number of words to be recognized) is converted into a recognition unit label notation, and HMMs of phoneme units stored in the acoustic model 1103 are connected according to the label, and Standard patterns of words [λ _{W (1)} , λ _{W (2)} ,. . . , Λ _{W (wn)} ].

【００７３】そして、照合手段１１０５は、認識対象単
語標準パターンと選択された木構造クラスタ言語モデル
によって表される単語列の生起確率を用いて、音声特徴
量分析手段１１０２の出力である音声特徴量に対して照
合を行い、音声認識結果１１０６を出力する。音声認識
結果１１０６は、認識対象音声に対して認識対象単語で
最も尤度が高い単語の単語番号系列Ｒｎ＝［ｒ（１），
ｒ（２），．．．，ｒ（ｍ）］を計算し、単語番号に対
応する単語Ｒｗ＝［Ｗ（ｒ（１）），Ｗ（ｒ
（２）），．．．，Ｗ（ｒ（ｍ））］を出力する。ここ
で、ｒ（ｉ）は音声認識結果の単語系列のｉ番目の単語
の単語番号を示し、ｍは認識単語系列の単語数を示す。The collation unit 1105 uses the recognition target word standard pattern and the occurrence probability of the word string represented by the selected tree-structured cluster language model to generate the speech feature amount output from the speech feature amount analysis unit 1102. And outputs a speech recognition result 1106. The speech recognition result 1106 is a word number sequence Rn = [r (1),
r (2),. . . , R (m)], and the word Rw = [W (r (1)), W (r
(2)),. . . , W (r (m))]. Here, r (i) indicates the word number of the i-th word in the word sequence of the speech recognition result, and m indicates the number of words in the recognized word sequence.

【００７４】以上は、選択対象の木構造クラスタ別言語
モデルを、実施の形態１で生成した木構造クラスタ１の
言語モデル２００３−１〜木構造クラスタＭの言語モデ
ル２００３−Ｍとして説明したが、実施の形態２で生成
した補間処理された木構造クラスタ１の言語モデル３０
０２−１〜補間処理された木構造クラスタＭの言語モデ
ル３００２−Ｍとしても良い。The language model for each tree structure cluster to be selected has been described above as the language model 2003-1 of the tree structure cluster 1 generated in the first embodiment and the language model 2003-M of the tree structure cluster M. Language Model 30 of Interpolated Tree Structure Cluster 1 Generated in Embodiment 2
The language model 3002-M of the tree structure cluster M subjected to the interpolation processing 02-1 to 02 may be used.

【００７５】また、実施の形態３における音声認識方法
を音声認識プログラムとして記録媒体に記録することも
できる。この場合には、実施の形態１の言語モデル生成
プログラムに加えて、音声特徴量抽出手段１１０２と同
様の処理を実現する音声特徴量抽出手順と、言語モデル
選択手段１１０４と同様の処理を実現する言語モデル選
択手順と、照合手段１１０５と同様の処理を実現する照
合手順を含む音声認識プログラムを記録媒体に記録す
る。The voice recognition method according to the third embodiment can be recorded on a recording medium as a voice recognition program. In this case, in addition to the language model generation program of the first embodiment, a speech feature extraction procedure for realizing the same processing as that of the speech feature extraction unit 1102 and a process similar to the language model selection unit 1104 are realized. A speech recognition program including a language model selection procedure and a collation procedure for realizing the same processing as the collation unit 1105 is recorded on a recording medium.

【００７６】以上のように、この実施の形態３における
音声認識装置及び音声認識方法によれば、学習用テキス
トデータ１００１を階層的に木構造クラスタリングし、
各木構造クラスタに属する学習用テキストデータ２００
２−１〜２００２−Ｍを用いて、木構造クラスタ別言語
モデル２００３−１〜２００３−Ｍを生成するので、学
習用テキストデータが少量であることによって生じる言
語モデルのゼロ頻度問題やスパースネスの問題を軽減で
き、この木構造クラスタ別言語モデル２００３から言語
モデルを選択して音声認識を行うので、認識精度が高い
音声認識ができるという効果が得られる。As described above, according to the speech recognition apparatus and the speech recognition method in the third embodiment, the learning text data 1001 is hierarchically clustered in a tree structure.
Learning text data 200 belonging to each tree structure cluster
Since the language model for each tree structure cluster 2003-1 to 2003-M is generated using 2-1 to 2002-M, the problem of zero frequency and sparseness of the language model caused by a small amount of text data for learning is generated. Can be reduced, and a language model is selected from the tree model cluster-based language model 2003 to perform speech recognition. Therefore, an effect that speech recognition with high recognition accuracy can be performed can be obtained.

【００７７】また、認識対象の音声が複数の分野や場面
・状況を含む場合であっても、複数の分野や場面・状況
の言語特徴を学習した木構造クラスタ言語モデルを選択
し音声認識を行うので、認識性能が高い音声認識ができ
る効果が得られる。Even when the speech to be recognized includes a plurality of fields, scenes, and situations, a tree-structure cluster language model that has learned the language features of the plurality of fields, scenes, and situations is selected and speech recognition is performed. Therefore, an effect of enabling speech recognition with high recognition performance can be obtained.

【００７８】実施の形態４．図９はこの発明の実施の形
態４による音声認識装置の構成を示すブロック図であ
る。図において、５００１は複数言語モデル選択手段、
５００２は混合言語モデル生成手段である。実施の形態
３の図７と同一の機能ブロックについては、同一の符号
を付し説明を省略する。Embodiment 4 FIG. 9 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 4 of the present invention. In the figure, reference numeral 5001 denotes a multiple language model selecting unit;
Reference numeral 5002 denotes a mixed language model generation unit. The same functional blocks as in FIG. 7 of the third embodiment are denoted by the same reference numerals, and description thereof is omitted.

【００７９】次に動作について説明する。図１０はこの
発明の実施の形態４による音声認識装置における音声認
識方法を示すフローチャートである。ステップＳＴ４０
１及びステップＳＴ４０２の処理は、実施の形態３にお
ける図８のステップＳＴ３０１及びステップＳＴ３０２
の処理と同一である。Next, the operation will be described. FIG. 10 is a flowchart showing a voice recognition method in the voice recognition device according to Embodiment 4 of the present invention. Step ST40
1 and step ST402 correspond to steps ST301 and ST302 in FIG.
Is the same as the processing of

【００８０】ステップＳＴ４０３において、複数言語モ
デル選択手段５００１は、木構造クラスタ１の言語モデ
ル２００３−１〜木構造クラスタＭの言語モデル２００
３−Ｍから２つ以上（Ｋ個以下）の木構造クラスタの言
語モデルを選択する。言語モデルの選択は、例えば文献
５に示されている方法を拡張し、生起確率が高い順から
Ｋ個の言語モデルを選択する方法を用いる。In step ST403, the multiple language model selecting means 5001 selects the language model 2003-1 of the tree structure cluster 1 to the language model 200 of the tree structure cluster M.
Select language models of two or more (K or less) tree-structure clusters from 3-M. The selection of the language model uses, for example, a method that extends the method described in Document 5 and selects K language models in descending order of occurrence probability.

【００８１】ステップＳＴ４０４において、混合言語モ
デル生成手段５００２は、複数言語モデル選択手段５０
０１によって選択された複数の木構造クラスタ言語モデ
ルを入力し、１つの混合言語モデルを生成する。混合モ
デルは、例えばＮ−ｇｒａｍモデルであるならば、次の
（６）式によって生起確率を計算する。In step ST404, the mixed language model generating means 5002 sets the plural language model selecting means 50
01, a plurality of tree-structure cluster language models selected as input, and one mixed language model is generated. If the mixed model is, for example, an N-gram model, the occurrence probability is calculated by the following equation (6).

【数４】 (Equation 4)

【００８２】上記（６）式において、Ｐ_m（ｗ_n｜ｗ
_n+1-N ^n-1）は混合言語モデルの生起確率であり、Ψは複
数言語モデル選択手段５００１によって選択された木構
造クラスタ言語モデルの番号の集合、Ｐ_i（ｗ_n｜ｗ
_n+1-N ^n-1）は選択された言語モデルの生起確率であり、
β_iは重み係数である。ここでβ_iについては、例えば
文献５に示されている言語モデル選択時の生起確率にし
たがって、生起確率が高い言語モデルはβ_iが大きくな
るように設定する。In the above equation (6), P _m (w _n | w
_{n + 1-N} ^n-1 ) is the occurrence probability of the mixed language model, Ψ is the set of numbers of the tree-structured cluster language model selected by the multiple language model selecting means 5001, and P _i (w _n | w)
_{n + 1-N} ^n-1 ) is the probability of occurrence of the selected language model,
β _i is a weight coefficient. Here, with respect to β _i , for example, a language model with a high occurrence probability is set to have a large β _{i in} accordance with the occurrence probability at the time of selecting a language model shown in Document 5.

【００８３】ステップＳＴ４０５において、照合手段１
１０５は、混合言語モデル生成手段５００２によって生
成された混合言語モデルと、音響モデル１１０３を入力
し、認識対象音声１１０１の音声特徴量に対して照合を
行い、最も尤度が高い単語列を音声認識結果１１０６と
して出力する。In step ST405, the matching means 1
105, the mixed language model generated by the mixed language model generating means 5002 and the acoustic model 1103 are input, and a collation is performed on the speech feature amount of the recognition target speech 1101, and a word string having the highest likelihood is subjected to speech recognition. The result is output as 1106.

【００８４】以上は、選択対象の木構造クラスタ言語モ
デルを、実施の形態１で生成した木構造クラスタ１の言
語モデル２００３−１〜木構造クラスタＭの言語モデル
２００３−Ｍとして説明したが、実施の形態２で生成し
た補間処理された木構造クラスタ１の言語モデル３００
２−１〜補間処理された木構造クラスタＭの言語モデル
３００２−Ｍとしても良い。In the above, the tree structure cluster language model to be selected has been described as the language model 2003-1 of the tree structure cluster 1 generated in the first embodiment to the language model 2003-M of the tree structure cluster M. Language Model 300 of Interpolated Tree Structure Cluster 1 Generated in Form 2
2-1 may be a language model 3002-M of the tree-structured cluster M subjected to the interpolation processing.

【００８５】また、実施の形態４における音声認識方法
を音声認識プログラムとして記録媒体に記録することも
できる。この場合には、実施の形態１の言語モデル生成
プログラムに加えて、音声特徴量抽出手段１１０２と同
様の処理を実現する音声特徴量抽出手順と、照合手段１
１０５と同様の処理を実現する照合手順と、複数言語モ
デル選択手段５００１と同様の処理を実現する複数言語
モデル選択手順と、混合言語モデル生成手段５００２と
同様の処理を実現する混合言語モデル生成手順とを含む
音声認識プログラムを記録媒体に記録する。Further, the voice recognition method according to the fourth embodiment can be recorded on a recording medium as a voice recognition program. In this case, in addition to the language model generation program of the first embodiment, a speech feature amount extraction procedure for realizing the same processing as the speech feature amount extraction unit 1102, and a matching unit 1
105, a multilingual model selecting procedure for realizing the same processing as the multilingual model selecting means 5001, and a mixed language model generating procedure for realizing the same processing as the mixed language model generating means 5002 Is recorded on a recording medium.

【００８６】以上のように、この実施の形態４における
音声認識装置及び音声認識方法によれば、学習用テキス
トデータ１００１を階層的に木構造クラスタリングし、
各木構造クラスタの学習用テキストデータ２００２−１
〜２００２−Ｍを用いて、木構造クラスタ別言語モデル
２００３−１〜２００３−Ｍを生成し、学習用テキスト
データが少量であることによって生じる言語モデルのゼ
ロ頻度問題やスパースネスの問題を軽減でき、この木構
造クラスタ別言語モデル２００３から複数選択した木構
造クラスタ言語モデルによって混合言語モデルを生成し
て、音声認識に用いるので、さらに認識精度が高い音声
認識ができるという効果が得られる。As described above, according to the speech recognition apparatus and the speech recognition method in the fourth embodiment, the learning text data 1001 is hierarchically clustered in a tree structure.
Learning text data 2002-1 for each tree structure cluster
-2002-M, the tree-structure cluster-based language models 2003-1 to 2003-M are generated, and the zero frequency problem and the sparseness problem of the language model caused by a small amount of text data for learning can be reduced. Since a mixed language model is generated by using a plurality of tree-structure cluster language models selected from the tree-structure cluster language model 2003 and is used for speech recognition, an effect that speech recognition with higher recognition accuracy can be performed can be obtained.

【００８７】また、認識対象の１発声が複数の分野や場
面・状況を含む場合であっても、複数の分野や場面・状
況の言語特徴を学習した言語モデルを選択し混合言語モ
デルを生成して音声認識に用いるので、認識性能が高い
音声認識ができる効果が得られる。Even when one utterance to be recognized includes a plurality of fields, scenes, and situations, a language model that has learned language features of a plurality of fields, scenes, and situations is selected to generate a mixed language model. Therefore, the present invention has the effect of enabling speech recognition with high recognition performance.

【００８８】実施の形態５．図１１はこの発明の実施の
形態５による音声認識装置の構成を示すブロック図であ
る。図において、６００１は葉ノードのクラスタ別言語
モデル、６００１−１〜６００１−Ｌは葉ノードクラス
タ１〜Ｌの言語モデルである。実施の形態３の図７と同
一の機能ブロックについては、同一の符号を付し説明を
省略する。Embodiment 5 FIG. 11 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 5 of the present invention. In the figure, 6001 is a language model for each leaf node cluster, and 6001-1 to 6001-L are language models for leaf node clusters 1 to L. The same functional blocks as in FIG. 7 of the third embodiment are denoted by the same reference numerals, and description thereof is omitted.

【００８９】次に動作について説明する。図１２はこの
発明の実施の形態５による音声認識装置における音声認
識方法を示すフローチャートである。ステップＳＴ５０
１及びステップＳＴ５０２の処理は、実施の形態３にお
ける図８のステップＳＴ３０１及びステップＳＴ３０２
の処理と同一である。Next, the operation will be described. FIG. 12 is a flowchart showing a voice recognition method in the voice recognition device according to the fifth embodiment of the present invention. Step ST50
1 and step ST502 correspond to steps ST301 and ST302 in FIG.
Is the same as the processing of

【００９０】ステップＳＴ５０３において、言語モデル
選択手段１１０４は、木構造クラスタの葉ノードクラス
タの言語モデルから、照合手段１１０５で用いる言語モ
デルを、葉ノードクラスタ１の言語モデル６００１−１
〜葉ノードクラスタＬの言語モデル６００１−Ｌから１
つ選択する。ここで、葉ノードクラスタの言語モデルと
は、木構造の最も下層の木構造クラスタの言語モデルで
ある。図４の例では、木構造クラスタ２０の言語モデル
ＬＭ２０，木構造クラスタ２１の言語モデルＬＭ２１，
木構造クラスタ２２の言語モデルＬＭ２２，木構造クラ
スタ２３の言語モデルＬＭ２３が葉ノードクラスタの言
語モデルに相当する。In step ST503, the language model selecting means 1104 converts the language model used by the matching means 1105 from the language model of the leaf node cluster of the tree structure cluster into the language model 6001-1 of the leaf node cluster 1.
~ 1 from language model 6001-L of leaf node cluster L
Choose one. Here, the language model of the leaf node cluster is the language model of the lowest tree structure cluster of the tree structure. In the example of FIG. 4, the language model LM20 of the tree structure cluster 20, the language model LM21 of the tree structure cluster 21,
The language model LM22 of the tree structure cluster 22 and the language model LM23 of the tree structure cluster 23 correspond to the language model of the leaf node cluster.

【００９１】このような葉ノードクラスタの言語モデル
は、分野や場面・状況の違いによる言語の性質の違いを
詳細に表現するモデルとなっているので、分野や場面・
状況が明確に分かれるような認識対象の音声である場合
は有効である。また、全ての木構造クラスタ別の言語モ
デルを用いる場合に比べて、選択対象のクラスタ言語モ
デルの数が少ないので、省メモリー、演算量削減の効果
がある。葉ノードクラスタの言語モデルの選択は、例え
ば文献５に示されている方法を用い、最も生起確率が高
い葉ノードクラスタの言語モデルを選択する。The language model of such a leaf node cluster is a model that expresses in detail the differences in the language properties due to the differences in fields, scenes, and situations.
This is effective when the recognition target speech is such that the situation is clearly separated. Further, the number of cluster language models to be selected is smaller than in the case where language models for all tree-structured clusters are used, so that there is an effect of saving memory and reducing the amount of calculation. The language model of the leaf node cluster is selected by using, for example, the method described in Reference 5, and the language model of the leaf node cluster having the highest occurrence probability is selected.

【００９２】ステップＳＴ５０４において、照合手段１
１０５は、言語モデル選択手段１１０４によって選択さ
れた葉ノードクラスタの言語モデルと、音響モデル１１
０３を入力して、認識対象音声１１０１の音声特徴量に
対して照合を行い、最も尤度が高い単語列を音声認識結
果１１０６として出力する。In step ST504, the matching means 1
Reference numeral 105 denotes a language model of the leaf node cluster selected by the language model selecting unit 1104 and the acoustic model 11
03 is input, the speech feature amount of the recognition target speech 1101 is collated, and the word string having the highest likelihood is output as the speech recognition result 1106.

【００９３】以上は、選択対象の葉ノードクラスタの言
語モデルを、実施の形態１で生成した木構造クラスタ別
言語モデル２００３の葉ノードクラスタの言語モデルと
したが、実施の形態２で生成した補間処理された木構造
クラスタ別言語モデル３００２の葉ノードクラスタの言
語モデルとしても良い。また、言語モデル選択手段１１
０４を複数言語モデル選択手段５００１とし、後段に混
合言語モデル生成手段５００２を接続し、混合言語モデ
ルを用いて照合処理を行っても良い。In the above description, the language model of the leaf node cluster to be selected is the language model of the leaf node cluster generated by the tree structure cluster 2003 generated in the first embodiment. A language model of the leaf node cluster of the processed tree structure cluster language model 3002 may be used. Language model selection means 11
04 may be a multiple language model selecting means 5001, and a mixed language model generating means 5002 may be connected at a later stage to perform the collation processing using the mixed language model.

【００９４】また、実施の形態５における音声認識方法
を音声認識プログラムとして記録媒体に記録することも
できる。この場合には、実施の形態１の言語モデル生成
プログラムに加えて、音声特徴量抽出手段１１０２と同
様の処理を実現する音声特徴量抽出手順と、言語モデル
選択手段１１０４と同様の処理を実現する言語モデル選
択手順と、照合手段１１０５と同様の処理を実現する照
合手順を含む音声認識プログラムを記録媒体に記録す
る。Further, the voice recognition method according to Embodiment 5 can be recorded on a recording medium as a voice recognition program. In this case, in addition to the language model generation program of the first embodiment, a speech feature extraction procedure for realizing the same processing as that of the speech feature extraction unit 1102 and a process similar to the language model selection unit 1104 are realized. A speech recognition program including a language model selection procedure and a collation procedure for realizing the same processing as the collation unit 1105 is recorded on a recording medium.

【００９５】以上のように、この実施の形態５における
音声認識装置及び音声認識方法によれば、学習用テキス
トデータ１００１を階層的に木構造クラスタリングし、
各木構造クラスタの学習用テキストデータ２００２−１
〜２００２−Ｍを用いて、木構造クラスタ言語モデル２
００３を生成するので、学習用テキストデータが少量で
あることによって生じる言語モデルのゼロ頻度問題やス
パースネスの問題を軽減でき、この木構造クラスタ言語
モデル２００３の葉ノードクラスタの言語モデル６００
１から選択した言語モデルを音声認識に用いるので、認
識精度が高い音声認識ができると共に、言語モデルのメ
モリ容量を削減でき、言語モデルを選択する際の演算量
を削減できるという効果が得られる。As described above, according to the speech recognition apparatus and the speech recognition method in the fifth embodiment, the learning text data 1001 is hierarchically clustered in a tree structure.
Learning text data 2002-1 for each tree structure cluster
-2002-M, the tree structure cluster language model 2
Since 003 is generated, the zero frequency problem and the sparseness problem of the language model caused by a small amount of learning text data can be reduced, and the language model 600 of the leaf node cluster of the tree structure cluster language model 2003 can be reduced.
Since the language model selected from No. 1 is used for speech recognition, speech recognition with high recognition accuracy can be performed, the memory capacity of the language model can be reduced, and the amount of calculation when selecting the language model can be reduced.

【００９６】また、認識対象の１発声が複数の分野や場
面・状況を含む場合であっても、複数の葉ノードクラス
タの言語モデルを選択し混合言語モデルを生成すれば、
複数の分野や場面・状況の言語特徴を学習した言語モデ
ルを音声認識に用いることになるので、認識性能が高い
音声認識ができる効果が得られる。Even if one utterance to be recognized includes a plurality of fields, scenes, and situations, if a language model of a plurality of leaf node clusters is selected and a mixed language model is generated,
Since a language model that has learned language features of a plurality of fields, scenes, and situations is used for speech recognition, an effect of enabling speech recognition with high recognition performance can be obtained.

【００９７】[0097]

【発明の効果】以上のように、この発明によれば、言語
モデル生成装置が、学習用テキストデータを言語的に類
似した性質を持つように階層的に分割する木構造クラス
タリングを行い、木構造学習用テキストデータクラスタ
を生成する学習用テキストデータ木構造クラスタリング
手段と、木構造学習用テキストデータクラスタに属する
各学習用テキストデータを用いて、木構造クラスタ別言
語モデルを生成する言語モデル生成手段とを備えたこと
により、学習用テキストデータが少量であることによっ
て生じる言語モデルのゼロ頻度問題やスパースネスの問
題を軽減でき、認識率の高い言語モデルが生成できると
共に、認識対象の１発声が複数の分野や場面・状況を含
む場合であっても、複数の分野や場面・状況の言語特徴
を学習した言語モデルが存在するので、認識率の高い言
語モデルが生成できる効果がある。As described above, according to the present invention, the language model generating apparatus performs tree structure clustering for hierarchically dividing the learning text data so as to have linguistically similar properties. A learning text data tree structure clustering means for generating a learning text data cluster; and a language model generating means for generating a language model for each tree structure cluster using each learning text data belonging to the tree structure learning text data cluster. , The problem of zero frequency and sparseness of the language model caused by a small amount of text data for learning can be reduced, a language model with a high recognition rate can be generated, and one utterance of the recognition target Even if it includes fields, scenes, and situations, language models that have learned the language features of multiple fields, scenes, and situations Since Le is present, the effect of high recognition rate language model can be generated.

【００９８】この発明によれば、言語モデル生成装置
が、木構造クラスタ別言語モデルが位置する木構造の上
位に位置する木構造クラスタ別言語モデルを用いて補間
処理を行い、補間処理された木構造クラスタ別言語モデ
ルを生成する言語モデル補間手段を備えたことにより、
学習用テキストデータが少量であることによって生じる
言語モデルのゼロ頻度問題やスパースネスの問題を軽減
でき、さらに認識率の高い言語モデルを生成できると共
に、認識対象の１発声が複数の分野や場面・状況を含む
場合であっても、複数の分野や場面・状況の言語特徴を
学習した言語モデルが存在するので、認識率の高い言語
モデルが生成できるという効果がある。According to the present invention, the language model generating apparatus performs an interpolation process using the tree-structure-cluster-based language model located above the tree structure in which the tree-structure-cluster-based language model is located. By providing a language model interpolation means for generating a language model for each structural cluster,
It can reduce the problem of zero frequency and sparseness of language models caused by a small amount of text data for learning, can generate a language model with high recognition rate, and can recognize one utterance in multiple fields, scenes and situations. Is included, there is a language model that has learned the language features of a plurality of fields and scenes / situations, so that there is an effect that a language model with a high recognition rate can be generated.

【００９９】この発明によれば、音声認識装置が、音声
特徴量抽出手段と、音響モデルと、学習用テキストデー
タを言語的に類似した性質を持つように階層的に分割す
る木構造クラスタリングを行い、各木構造クラスタの学
習用テキストデータを用いて生成された木構造クラスタ
別言語モデルと、木構造クラスタ別言語モデルから、音
声認識結果候補の単語列に対して最も生起確率が高い言
語モデルを選択する言語モデル選択手段と、選択された
言語モデルと音響モデルを用いて、音声特徴量抽出手段
が抽出した音声特徴量に対して照合を行い音声認識結果
を出力する照合手段とを備えたことにより、学習用テキ
ストデータが少量であることによって生じる言語モデル
のゼロ頻度問題やスパースネスの問題を軽減でき、木構
造クラスタ別言語モデルから言語モデルを選択して音声
認識を行うので、認識精度が高い音声認識ができると共
に、認識対象の音声が複数の分野や場面・状況を含む場
合であっても、複数の分野や場面・状況の言語特徴を学
習した木構造クラスタ言語モデルを選択し音声認識を行
うので、認識性能が高い音声認識ができる効果がある。According to the present invention, the speech recognition apparatus performs tree structure clustering for hierarchically dividing the speech feature amount extracting means, the acoustic model, and the learning text data so as to have linguistically similar properties. From the language model for each tree structure cluster generated using the training text data of each tree structure cluster and the language model for each tree structure cluster, the language model with the highest probability of occurrence for the word string of the speech recognition result candidate is A language model selecting means for selecting, and a matching means for performing matching with respect to the speech feature quantity extracted by the speech feature quantity extracting means using the selected language model and acoustic model and outputting a speech recognition result. Can reduce the problem of zero frequency and sparseness of the language model caused by a small amount of text data for training. Since speech recognition is performed by selecting a language model from Dell, high-accuracy speech recognition can be performed, and even when the recognition target speech includes multiple fields, scenes, and situations, multiple fields, scenes, and situations are recognized. Since the tree structure cluster language model that has learned the language features of the situation is selected and speech recognition is performed, there is an effect that speech recognition with high recognition performance can be performed.

【０１００】この発明によれば、音声認識装置の言語モ
デル選択手段が、木構造クラスタ別言語モデルにおける
最も下層の葉ノードのクラスタ別言語モデルから言語モ
デルを選択することにより、言語モデルのメモリ容量を
削減でき、言語モデルを選択する際の演算量を削減でき
るという効果がある。According to the present invention, the language model selecting means of the speech recognition apparatus selects the language model from the cluster-based language model of the lowest leaf node in the tree-structure cluster-based language model, so that the memory capacity of the language model is And the amount of calculation when selecting a language model can be reduced.

【０１０１】この発明によれば、音声認識装置が、音声
特徴量抽出手段と、音響モデルと、学習用テキストデー
タを言語的に類似した性質を持つように階層的に分割す
る木構造クラスタリングを行い、各木構造クラスタの学
習用テキストデータを用いて生成された木構造クラスタ
別言語モデルと、木構造クラスタ別言語モデルから、音
声認識結果候補の単語列に対して生起確率の高い複数の
言語モデルを選択する複数言語モデル選択手段と、選択
された複数の言語モデルを入力して混合言語モデルを生
成する混合言語モデル生成手段と、生成された言語モデ
ルと音響モデルを用いて、音声特徴量抽出手段が抽出し
た音声特徴量に対して照合を行い音声認識結果を出力す
る照合手段とを備えたことにより、学習用テキストデー
タが少量であることによって生じる言語モデルのゼロ頻
度問題やスパースネスの問題を軽減でき、木構造クラス
タ別言語モデルから複数選択した木構造クラスタ言語モ
デルによって混合言語モデルを生成して、音声認識に用
いるので、さらに認識精度が高い音声認識ができると共
に、認識対象の１発声が複数の分野や場面・状況を含む
場合であっても、複数の分野や場面・状況の言語特徴を
学習した言語モデルを選択し混合言語モデルを生成して
音声認識に用いるので、認識性能が高い音声認識ができ
る効果がある。According to the present invention, the speech recognition device performs tree structure clustering for hierarchically dividing the speech feature amount extracting means, the acoustic model, and the learning text data into linguistically similar properties. From the language model for each tree structure cluster generated using the training text data of each tree structure cluster and the language model for each tree structure cluster, a plurality of language models with a high probability of occurrence for the word sequence of the speech recognition result candidate Multi-language model selecting means for selecting a language model, a mixed language model generating means for generating a mixed language model by inputting a plurality of selected language models, and speech feature extraction using the generated language model and acoustic model Means for collating the speech features extracted by the means and outputting a speech recognition result, so that the amount of text data for learning is small. It can reduce the problem of zero frequency and sparseness of the language model caused by the language model, and generate a mixed language model by using a tree structure cluster language model selected from multiple language models for each tree structure cluster, and use it for speech recognition. In addition to high speech recognition, even if one utterance to be recognized includes multiple fields, scenes, and situations, a language model that has learned the language features of multiple fields, scenes, and situations is selected and a mixed language model is created. Since it is generated and used for speech recognition, there is an effect that speech recognition with high recognition performance can be performed.

【０１０２】この発明によれば、音声認識装置の複数言
語モデル選択手段が、木構造クラスタ別言語モデルにお
ける最も下層の葉ノードのクラスタ別言語モデルから複
数の言語モデルを選択することにより、言語モデルのメ
モリ容量を削減でき、言語モデルを選択する際の演算量
を削減できるという効果がある。According to the present invention, the plurality of language models selecting means of the speech recognition apparatus selects a plurality of language models from the cluster-based language models of the lowest leaf node in the tree-structure cluster-based language model, whereby the language model is selected. This has the effect of reducing the memory capacity of the device and the amount of calculation when selecting a language model.

【０１０３】この発明によれば、音声認識装置の木構造
クラスタ別言語モデルが、木構造の上位に位置する木構
造クラスタ別言語モデルを用いて補間処理が行われた補
間処理された木構造クラスタ別言語モデルであることに
より、学習用テキストデータが少量であることによって
生じる言語モデルのゼロ頻度問題やスパースネスの問題
を軽減でき、さらに認識率の高い言語モデルを生成でき
ると共に、認識対象の１発声が複数の分野や場面・状況
を含む場合であっても、複数の分野や場面・状況の言語
特徴を学習した言語モデルが存在するので、認識率の高
い言語モデルが生成できるという効果がある。According to the present invention, the language model for each tree structure cluster in the speech recognition apparatus is an interpolated tree structure cluster obtained by performing an interpolation process using the language model for each tree structure cluster positioned at the top of the tree structure. By using a different language model, the problem of zero frequency and sparseness of the language model caused by a small amount of training text data can be reduced, a language model with a higher recognition rate can be generated, and one utterance of the recognition target can be generated. However, even if includes a plurality of fields, scenes, and situations, there is a language model that has learned the linguistic features of the plurality of fields, scenes, and situations, so that a language model with a high recognition rate can be generated.

【０１０４】この発明によれば、言語モデル生成方法と
して、学習用テキストデータを言語的に類似した性質を
持つように階層的に分割する木構造クラスタリングを行
い、木構造学習用テキストデータクラスタを生成する第
１のステップと、木構造学習用テキストデータクラスタ
に属する各学習用テキストデータを用いて、木構造クラ
スタ別言語モデルを生成する第２のステップとを備えた
ことにより、学習用テキストデータが少量であることに
よって生じる言語モデルのゼロ頻度問題やスパースネス
の問題を軽減でき、認識率の高い言語モデルが生成でき
ると共に、認識対象の１発声が複数の分野や場面・状況
を含む場合であっても、複数の分野や場面・状況の言語
特徴を学習した言語モデルが存在するので、認識率の高
い言語モデルが生成できる効果がある。According to the present invention, as a language model generation method, tree structure clustering for hierarchically dividing learning text data so as to have linguistically similar properties is performed to generate a tree structure learning text data cluster. And a second step of generating a language model for each tree structure cluster using each text data for learning belonging to the text data cluster for tree structure learning. It is possible to reduce the problem of zero frequency and sparseness of the language model caused by a small amount, to generate a language model with a high recognition rate, and to have one utterance to be recognized includes multiple fields, scenes and situations. However, there are language models that have learned the language features of multiple fields and scenes / situations. There can be effectively.

【０１０５】この発明によれば、言語モデル生成方法と
して、木構造クラスタ別言語モデルが位置する木構造の
上位に位置する木構造クラスタ別言語モデルを用いて補
間処理を行い、補間処理された木構造クラスタ別言語モ
デルを生成する第３のステップを備えたことにより、学
習用テキストデータが少量であることによって生じる言
語モデルのゼロ頻度問題やスパースネスの問題を軽減で
き、さらに認識率の高い言語モデルを生成できると共
に、認識対象の１発声が複数の分野や場面・状況を含む
場合であっても、複数の分野や場面・状況の言語特徴を
学習した言語モデルが存在するので、認識率の高い言語
モデルが生成できるという効果がある。According to the present invention, as a language model generation method, an interpolation process is performed by using a tree structure cluster-based language model located above a tree structure in which a tree structure cluster-based language model is located, and the interpolated tree The provision of the third step of generating the language model for each structural cluster can reduce the zero frequency problem and the sparseness problem of the language model caused by a small amount of text data for learning, and can further reduce the language model having a high recognition rate. Can be generated, and even when one utterance to be recognized includes a plurality of fields, scenes, and situations, there is a language model that has learned the language features of the plurality of fields, scenes, and situations. There is an effect that a language model can be generated.

【０１０６】この発明によれば、音声認識方法として、
音声特徴量を抽出する第１のステップと、学習用テキス
トデータを言語的に類似した性質を持つように階層的に
分割する木構造クラスタリングを行い、各木構造クラス
タの学習用テキストデータを用いて生成された木構造ク
ラスタ別言語モデルから、音声認識結果候補の単語列に
対して最も生起確率が高い言語モデルを選択する第２の
ステップと、音響モデルと選択された言語モデルを用い
て、音声特徴量に対して照合を行い音声認識結果を出力
する第３のステップとを備えたことにより、学習用テキ
ストデータが少量であることによって生じる言語モデル
のゼロ頻度問題やスパースネスの問題を軽減でき、木構
造クラスタ別言語モデルから言語モデルを選択して音声
認識を行うので、認識精度が高い音声認識ができると共
に、認識対象の音声が複数の分野や場面・状況を含む場
合であっても、複数の分野や場面・状況の言語特徴を学
習した木構造クラスタ言語モデルを選択し音声認識を行
うので、認識性能が高い音声認識ができる効果がある。According to the present invention, as a voice recognition method,
A first step of extracting speech features and a tree structure clustering that hierarchically divides the learning text data so as to have linguistically similar properties are performed, and using the learning text data of each tree structure cluster. A second step of selecting a language model having the highest occurrence probability for the word string of the speech recognition result candidate from the generated language model for each tree structure cluster, and using the acoustic model and the selected language model, A third step of collating feature amounts and outputting a speech recognition result can reduce a zero frequency problem and a sparseness problem of a language model caused by a small amount of learning text data, Speech recognition is performed by selecting a language model from the language model for each tree-structured cluster. Even if the data includes multiple fields, scenes, and situations, it selects a tree-structured cluster language model that has learned the language features of multiple fields, scenes, and situations, and performs speech recognition. There is an effect that can be done.

【０１０７】この発明によれば、音声認識方法の第２の
ステップで、木構造クラスタ別言語モデルにおける最も
下層の葉ノードのクラスタ別言語モデルから言語モデル
を選択することにより、言語モデルを選択する際の演算
量を削減できるという効果がある。According to the present invention, in the second step of the speech recognition method, the language model is selected by selecting the language model from the cluster-based language model of the lowest leaf node in the tree-structure cluster-based language model. This has the effect of reducing the amount of computation at the time.

【０１０８】この発明によれば、音声認識方法として、
音声特徴量を抽出する第１のステップと、学習用テキス
トデータを言語的に類似した性質を持つように階層的に
分割する木構造クラスタリングを行い、各木構造クラス
タの学習用テキストデータを用いて生成された木構造ク
ラスタ別言語モデルから、音声認識結果候補の単語列に
対して生起確率が高い複数の言語モデルを選択する第２
のステップと、選択された複数の言語モデルを入力して
混合言語モデルを生成する第３のステップと、音響モデ
ルと生成された言語モデルを用いて、抽出した音声特徴
量に対して照合を行い音声認識結果を出力する第４のス
テップとを備えたことにより、学習用テキストデータが
少量であることによって生じる言語モデルのゼロ頻度問
題やスパースネスの問題を軽減でき、木構造クラスタ別
言語モデルから複数選択した木構造クラスタ言語モデル
によって混合言語モデルを生成して、音声認識に用いる
ので、さらに認識精度が高い音声認識ができると共に、
認識対象の１発声が複数の分野や場面・状況を含む場合
であっても、複数の分野や場面・状況の言語特徴を学習
した言語モデルを選択し混合言語モデルを生成して音声
認識に用いるので、認識性能が高い音声認識ができる効
果がある。According to the present invention, as a speech recognition method,
A first step of extracting speech features and a tree structure clustering that hierarchically divides the learning text data so as to have linguistically similar properties are performed, and using the learning text data of each tree structure cluster. A second method of selecting a plurality of language models having a high probability of occurrence with respect to the word string of the speech recognition result candidate from the generated language model for each tree structure cluster;
And a third step of generating a mixed language model by inputting a plurality of selected language models, and performing matching on the extracted speech feature using the acoustic model and the generated language model. And the fourth step of outputting a speech recognition result can reduce a zero frequency problem and a sparseness problem of the language model caused by a small amount of text data for learning. Since a mixed language model is generated using the selected tree-structure cluster language model and used for speech recognition, speech recognition with higher recognition accuracy can be performed.
Even if one utterance to be recognized includes a plurality of fields, scenes, and situations, a language model that has learned language features of the plurality of fields, scenes, and situations is selected, and a mixed language model is generated and used for speech recognition. Therefore, there is an effect that voice recognition with high recognition performance can be performed.

【０１０９】この発明によれば、音声認識方法の第２の
ステップで、木構造クラスタ別言語モデルにおける最も
下層の葉ノードのクラスタ別言語モデルから複数の言語
モデルを選択することにより、言語モデルを選択する際
の演算量を削減できるという効果がある。According to the present invention, in the second step of the speech recognition method, the language model is selected by selecting a plurality of language models from the cluster-based language model of the lowest leaf node in the tree-structure cluster-based language model. This has the effect of reducing the amount of calculation when selecting.

【０１１０】この発明によれば、言語モデル生成プログ
ラムを記録した記録媒体で、学習用テキストデータを言
語的に類似した性質を持つように階層的に分割する木構
造クラスタリングを行い、木構造学習用テキストデータ
クラスタを生成する学習用テキストデータ木構造クラス
タリング手順と、木構造学習用テキストデータクラスタ
に属する各学習用テキストデータを用いて、木構造クラ
スタ別言語モデルを生成する言語モデル生成手順とを実
現させることにより、学習用テキストデータが少量であ
ることによって生じる言語モデルのゼロ頻度問題やスパ
ースネスの問題を軽減でき、認識率の高い言語モデルが
生成できると共に、認識対象の１発声が複数の分野や場
面・状況を含む場合であっても、複数の分野や場面・状
況の言語特徴を学習した言語モデルが存在するので、認
識率の高い言語モデルが生成できる効果がある。According to the present invention, tree structure clustering for hierarchically dividing learning text data so as to have linguistically similar properties is performed on a recording medium storing a language model generation program. A learning text data tree structure clustering procedure for generating text data clusters and a language model generation procedure for generating a tree model cluster-specific language model using each learning text data belonging to the tree structure learning text data cluster are realized. By doing so, the problem of zero frequency and sparseness of the language model caused by a small amount of text data for learning can be reduced, and a language model with a high recognition rate can be generated. Learn the linguistic features of multiple disciplines and scenes / situations, even if they include scenes / situations. Since the language model is present, the effect of high recognition rate language model can be generated.

【０１１１】この発明によれば、言語モデル生成プログ
ラムを記録した記録媒体で、木構造クラスタ別言語モデ
ルが位置する木構造の上位に位置する木構造クラスタ別
言語モデルを用いて補間処理を行い、補間処理された木
構造クラスタ別言語モデルを生成する言語モデル補間手
順を実現させることにより、学習用テキストデータが少
量であることによって生じる言語モデルのゼロ頻度問題
やスパースネスの問題を軽減でき、さらに認識率の高い
言語モデルを生成できると共に、認識対象の１発声が複
数の分野や場面・状況を含む場合であっても、複数の分
野や場面・状況の言語特徴を学習した言語モデルが存在
するので、認識率の高い言語モデルが生成できるという
効果がある。According to the present invention, an interpolation process is performed on a recording medium on which a language model generation program is recorded, using a tree structure cluster language model located above a tree structure where the tree structure cluster language model is located. By implementing a language model interpolation procedure that generates an interpolated tree-structured cluster-based language model, it is possible to reduce the zero frequency problem and sparseness problem of the language model caused by a small amount of text data for learning, and further recognize In addition to generating a language model with a high rate, even if one utterance to be recognized includes multiple fields, scenes, and situations, there is a language model that has learned the language features of multiple fields, scenes, and situations. This has the effect that a language model with a high recognition rate can be generated.

【０１１２】この発明によれば、音声認識プログラムを
記録した記録媒体で、音声特徴量を抽出する音声特徴量
抽出手順と、学習用テキストデータを言語的に類似した
性質を持つように階層的に分割する木構造クラスタリン
グを行い、各木構造クラスタの学習用テキストデータを
用いて生成された木構造クラスタ別言語モデルから、音
声認識結果候補の単語列に対して最も生起確率が高い言
語モデルを選択する言語モデル選択手順と、音響モデル
と選択された言語モデルを用いて、抽出された音声特徴
量に対して照合を行い音声認識結果を出力する照合手順
とを実現させることにより、学習用テキストデータが少
量であることによって生じる言語モデルのゼロ頻度問題
やスパースネスの問題を軽減でき、木構造クラスタ別言
語モデルから言語モデルを選択して音声認識を行うの
で、認識精度が高い音声認識ができると共に、認識対象
の音声が複数の分野や場面・状況を含む場合であって
も、複数の分野や場面・状況の言語特徴を学習した木構
造クラスタ言語モデルを選択し音声認識を行うので、認
識性能が高い音声認識ができる効果がある。According to the present invention, on the recording medium on which the speech recognition program is recorded, the speech feature amount extracting procedure for extracting the speech feature amount and the learning text data are hierarchically arranged so as to have linguistically similar properties. Perform a tree structure clustering to divide, and select the language model with the highest probability of occurrence for the word string of the speech recognition result candidate from the language model for each tree structure cluster generated using the training text data of each tree structure cluster The learning text data is realized by realizing a language model selection procedure to perform matching, and a matching procedure of performing matching on the extracted speech features and outputting a speech recognition result using the acoustic model and the selected language model. Can reduce the problem of zero frequency and sparseness of the language model caused by a small amount of language. Selects Dell and performs speech recognition, enabling high-accuracy speech recognition.Also, even when the speech to be recognized includes multiple fields, scenes, and situations, the language of multiple fields, scenes, and situations is used. Since speech recognition is performed by selecting a tree-structured cluster language model whose features have been learned, there is an effect that speech recognition with high recognition performance can be performed.

【０１１３】この発明によれば、音声認識プログラムの
言語モデル選択手順が、木構造クラスタ別言語モデルに
おける最も下層の葉ノードのクラスタ別言語モデルから
言語モデルを選択することにより、言語モデルを選択す
る際の演算量を削減できるという効果が得られる。According to the present invention, the language model is selected by selecting the language model from the cluster-based language model of the lowest leaf node in the tree-structure cluster-based language model in the language model selection procedure of the speech recognition program. In this case, the effect of reducing the amount of calculation can be obtained.

【０１１４】この発明によれば、音声認識プログラムを
記録した記録媒体で、音声特徴量を抽出する音声特徴量
抽出手順と、学習用テキストデータを言語的に類似した
性質を持つように階層的に分割する木構造クラスタリン
グを行い、各木構造クラスタの学習用テキストデータを
用いて生成された木構造クラスタ別言語モデルから、音
声認識結果候補の単語列に対して生起確率の高い複数の
言語モデルを選択する複数言語モデル選択手順と、選択
された複数の言語モデルを入力して混合言語モデルを生
成する混合言語モデル生成手順と、音響モデルと、生成
された言語モデルを用いて、抽出された音声特徴量に対
して照合を行い音声認識結果を出力する照合手順とを実
現させることにより、学習用テキストデータが少量であ
ることによって生じる言語モデルのゼロ頻度問題やスパ
ースネスの問題を軽減でき、木構造クラスタ別言語モデ
ルから複数選択した木構造クラスタ言語モデルによって
混合言語モデルを生成して、音声認識に用いるので、さ
らに認識精度が高い音声認識ができると共に、認識対象
の１発声が複数の分野や場面・状況を含む場合であって
も、複数の分野や場面・状況の言語特徴を学習した言語
モデルを選択し混合言語モデルを生成して音声認識に用
いるので、認識性能が高い音声認識ができる効果があ
る。According to the present invention, on a recording medium on which a speech recognition program is recorded, a speech feature amount extraction procedure for extracting a speech feature amount and a learning text data are hierarchically arranged so as to have linguistically similar properties. Performing tree structure clustering to divide, and from the language model for each tree structure cluster generated using the text data for learning of each tree structure cluster, a plurality of language models with high probability of occurrence for the word sequence of the speech recognition result candidate A multi-language model selection procedure to select, a mixed-language model generation procedure to generate a mixed-language model by inputting a plurality of selected language models, an acoustic model, and a speech extracted using the generated language model By implementing the matching procedure that performs matching on the feature amount and outputs the speech recognition result, the small amount of training text data The problem of zero frequency and sparseness of language models can be reduced, and a mixed language model is generated using a tree-structured cluster language model selected from multiple language models for each tree-structured cluster and used for speech recognition. Generates a mixed language model by selecting a language model that has learned the language features of multiple fields, scenes, and situations, even if one utterance to be recognized includes multiple fields, scenes, and situations Since it is used for speech recognition, there is an effect that speech recognition with high recognition performance can be performed.

【０１１５】この発明によれば、音声認識プログラムの
複数言語モデル選択手順が、木構造クラスタ別言語モデ
ルにおける最も下層の葉ノードのクラスタ別言語モデル
から複数の言語モデルを選択することにより、言語モデ
ルを選択する際の演算量を削減できるという効果が得ら
れる。According to the present invention, the step of selecting a plurality of language models in the speech recognition program includes the step of selecting a plurality of language models from the cluster-based language models of the lowest leaf node in the tree-structure cluster-based language model. The effect of reducing the amount of computation when selecting is obtained.

[Brief description of the drawings]

【図１】この発明の実施の形態１による言語モデル生
成装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a language model generation device according to a first embodiment of the present invention.

【図２】この発明の実施の形態１による言語モデル生
成装置における言語モデル生成方法を示すフローチャー
トである。FIG. 2 is a flowchart showing a language model generation method in the language model generation device according to the first embodiment of the present invention.

【図３】この発明の実施の形態１による学習用テキス
トデータ木構造クラスタリングの説明図である。FIG. 3 is an explanatory diagram of learning text data tree structure clustering according to the first embodiment of the present invention;

【図４】この発明の実施の形態１による木構造クラス
タ別の言語モデル生成の説明図である。FIG. 4 is an explanatory diagram of generation of a language model for each tree structure cluster according to the first embodiment of the present invention;

【図５】この発明の実施の形態２による言語モデル生
成装置の構成を示すブロック図である。FIG. 5 is a block diagram showing a configuration of a language model generation device according to a second embodiment of the present invention.

【図６】この発明の実施の形態２による言語モデル生
成装置における言語モデル生成方法を示すフローチャー
トである。FIG. 6 is a flowchart showing a language model generation method in the language model generation device according to the second embodiment of the present invention.

【図７】この発明の実施の形態３による音声認識装置
の構成を示すブロック図である。FIG. 7 is a block diagram showing a configuration of a voice recognition device according to a third embodiment of the present invention.

【図８】この発明の実施の形態３による音声認識装置
における音声認識方法を示すフローチャートである。FIG. 8 is a flowchart showing a voice recognition method in a voice recognition device according to Embodiment 3 of the present invention.

【図９】この発明の実施の形態４による音声認識装置
の構成を示すブロック図である。FIG. 9 is a block diagram showing a configuration of a voice recognition device according to a fourth embodiment of the present invention.

【図１０】この発明の実施の形態４による音声認識装
置における音声認識方法を示すフローチャートである。FIG. 10 is a flowchart showing a voice recognition method in a voice recognition device according to Embodiment 4 of the present invention.

【図１１】この発明の実施の形態５による音声認識装
置の構成を示すブロック図である。FIG. 11 is a block diagram showing a configuration of a voice recognition device according to a fifth embodiment of the present invention.

【図１２】この発明の実施の形態５による音声認識装
置における音声認識方法を示すフローチャートである。FIG. 12 is a flowchart showing a voice recognition method in a voice recognition device according to Embodiment 5 of the present invention.

【図１３】従来の言語モデル生成装置の構成を示すブ
ロック図である。FIG. 13 is a block diagram illustrating a configuration of a conventional language model generation device.

【図１４】従来の音声認識装置の構成を示すブロック
図である。FIG. 14 is a block diagram illustrating a configuration of a conventional voice recognition device.

[Explanation of symbols]

１００１学習用テキストデータ、１００４言語モデ
ル生成手段、１１０１認識対象音声、１１０２音声特
徴量抽出手段、１１０３音響モデル、１１０４言語
モデル選択手段、１１０５照合手段、１１０６音声
認識結果、２００１学習用テキストデータ木構造クラ
スタリング手段、２００２木構造学習用テキストデー
タクラスタ、２００２−１木構造クラスタ１の学習用
テキストデータ、２００２−２木構造クラスタ２の学
習用テキストデータ、２００２−Ｍ木構造クラスタＭ
の学習用テキストデータ、２００３木構造クラスタ別
言語モデル、２００３−１木構造クラスタ１の言語モ
デル、２００３−２木構造クラスタ２の言語モデル、
２００３−Ｍ木構造クラスタＭの言語モデル、３００
１言語モデル補間手段、３００２補間処理された木
構造クラスタ別言語モデル、３００２−１補間処理さ
れた木構造クラスタ１の言語モデル、３００２−２補
間処理された木構造クラスタ２の言語モデル、３００２
−Ｍ補間処理された木構造クラスタＭの言語モデル、
５００１複数言語モデル選択手段、５００２混合言
語モデル生成手段、６００１葉ノードのクラスタ別言
語モデル、６００１−１葉ノードクラスタ１の言語モ
デル、６００１−２葉ノードクラスタ２の言語モデ
ル、６００１−Ｌ葉ノードクラスタＬの言語モデル。1001 text data for learning, 1004 language model generating means, 1101 recognition target speech, 1102 voice feature quantity extracting means, 1103 acoustic model, 1104 language model selecting means, 1105 collating means, 1106 voice recognition result, 2001 text data tree structure for learning Clustering means, 2002 tree structure learning text data cluster, 2002-1 tree structure cluster 1 learning text data, 2002-2 tree structure cluster 2 learning text data, 2002-M tree structure cluster M
Learning text data of 2003, a language model by tree structure cluster 2003, a language model of 2003-1 tree structure cluster 1, a language model of 2003-2 tree structure cluster 2,
2003-M Language model of tree-structured cluster M, 300
1 Language Model Interpolating Means, 3002 Interpolated Tree Model by Tree Structure Cluster, 3002-1 Interpolated Tree Model of Tree Structure Cluster 1, 3002-2 Interpolated Tree Model of Tree Structure Cluster 2, 3002
-M a language model of the interpolated tree-structured cluster M,
5001 plural language model selection means, 5002 mixed language model generation means, 6001 language model for each leaf node cluster, 6001-1 language model for leaf node cluster 1, 6001-2 language model for leaf node cluster 2, 6001-L leaf node Language model of cluster L.

───────────────────────────────────────────────────── フロントページの続き (54)【発明の名称】言語モデル生成装置及びこれを用いた音声認識装置、言語モデル生成方法及びこれを用いた音声認識方法、並びに言語モデル生成プログラムを記録したコンピュータ読み取り可能な記録媒体及び音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体 ──────────────────────────────────────────────────続き Continuation of the front page (54) [Title of the Invention] A language model generation device and a speech recognition device using the same, a language model generation method and a speech recognition method using the same, and a computer recording a language model generation program Readable recording medium and computer readable recording medium on which voice recognition program is recorded

Claims

[Claims]

1. A language model generating apparatus for inputting learning text data and generating a language model for obtaining an occurrence probability of a word string, wherein the learning text data is hierarchically structured so as to have linguistically similar properties. A learning text data tree structure clustering means for performing tree structure clustering for generating a tree structure learning text data cluster, and a tree structure using each learning text data belonging to the tree structure learning text data cluster. A language model generating apparatus, comprising: a language model generating means for generating a cluster-specific language model.

2. A language model for performing an interpolation process using a tree structure cluster-based language model located above a tree structure in which a tree structure cluster-based language model is located, and generating an interpolated tree structure cluster-based language model. 2. The language model generating apparatus according to claim 1, further comprising an interpolation unit.

3. A speech recognition apparatus for inputting a speech to be recognized, performing speech recognition and outputting a speech recognition result, comprising: a speech feature amount extraction unit for inputting the speech to be recognized and extracting a speech feature amount; Model that calculates the probability of a series of observed values and tree structure clustering that hierarchically divides the training text data into linguistically similar properties, and uses the training text data of each tree structure cluster. A language model having a highest probability of occurrence for a word string of a speech recognition result candidate from the language model having a tree structure cluster generated by the above-mentioned language model; Using the language model selected by the selection means and the acoustic model, the speech feature quantity extracted by the speech feature quantity extraction means is collated to perform speech recognition. Speech recognition apparatus characterized by comprising a verification means for outputting a result.

4. The speech recognition apparatus according to claim 3, wherein the language model selecting means selects a language model from a cluster-based language model of a leaf node at the lowest layer in the tree-structure cluster-based language model.

5. A voice recognition device for inputting a voice to be recognized, performing voice recognition and outputting a voice recognition result, wherein: a voice feature amount extraction means for inputting the voice to be recognized and extracting a voice feature amount; Model that calculates the probability of a series of observed values and tree structure clustering that hierarchically divides the training text data into linguistically similar properties, and uses the training text data of each tree structure cluster. A multi-language model selecting means for selecting a plurality of language models having a high probability of occurrence for a word sequence of a speech recognition result candidate from the tree model cluster-based language model generated by the above-mentioned language model; A mixed language model generating means for generating a mixed language model by inputting a plurality of language models selected by the plurality of language model selecting means; Using the language model generated by the Dell generating unit and the acoustic model, performing matching against the voice feature amount extracted by the voice feature amount extracting unit and outputting a voice recognition result. Speech recognition device.

6. The speech recognition apparatus according to claim 5, wherein said plurality of language model selecting means selects a plurality of language models from cluster-based language models of the lowest leaf node in the tree-structure cluster-based language model. .

7. The tree-structure-cluster language model is an interpolated tree-structure-cluster language model obtained by performing an interpolation process using a tree-structure-cluster language model located at a higher level of the tree structure. Claim 3 or Claim 5
The speech recognition device according to the above.

8. A language model generating method for inputting learning text data and generating a language model for obtaining an occurrence probability of a word string, wherein the learning text data is hierarchically structured so as to have linguistically similar properties. A first step of generating a tree structure learning text data cluster by performing tree structure clustering for dividing into tree structures, and using each of the learning text data belonging to the tree structure learning text data cluster, by using a tree structure cluster-specific language model. And a second step of generating a language model.

9. A third step of performing an interpolation process using a tree structure cluster-based language model located above the tree structure in which the tree structure cluster-based language model is located, and generating an interpolated tree structure cluster-based language model. 9. The language model generating method according to claim 8, further comprising the step of:

10. A voice recognition method for inputting a voice to be recognized, performing voice recognition and outputting a voice recognition result, wherein: a first step of inputting the voice to be recognized and extracting a voice feature amount; Tree-structure clustering, which hierarchically divides the tree structure so that it has linguistically similar properties, and uses the tree-structure-cluster-based language model generated using the training text data of each tree-structure cluster A second step of selecting a language model having the highest probability of occurrence for the word string; an acoustic model for obtaining a probability of an acoustic observation value sequence of speech; and a language model selected in the second step. And a third step of comparing the voice feature quantity extracted in the first step and outputting a voice recognition result.

11. The speech recognition method according to claim 10, wherein in the second step, a language model is selected from the cluster-based language model of the lowest leaf node in the tree-structure cluster-based language model.

12. A voice recognition method for performing voice recognition by inputting a voice to be recognized and outputting a voice recognition result, comprising: a first step of inputting the voice to be recognized and extracting a voice feature amount; Tree structure clustering is performed by hierarchically dividing the tree so as to have linguistically similar properties. From the language model for each tree structure cluster generated using the training text data of each tree structure cluster, the words of speech recognition result candidates are A second step of selecting a plurality of language models having a high probability of occurrence for the column, a third step of inputting the plurality of language models selected in the second step and generating a mixed language model, The sound extracted in the first step using the acoustic model for obtaining the probability of the acoustic observation sequence of the speech and the language model generated in the third step. Speech recognition method characterized by comprising a fourth step of outputting a speech recognition result collates the feature quantity.

13. The speech recognition method according to claim 12, wherein in the second step, a plurality of language models are selected from the cluster-based language models of the lowest leaf node in the tree-structure cluster-based language model.

14. A recording medium recording a language model generating program for inputting learning text data and generating a language model for obtaining an occurrence probability of a word string, wherein the learning text data is linguistically similar. A learning text data tree structure clustering procedure for performing tree structure clustering that hierarchically divides to have a property and generating a tree structure learning text data cluster; and a learning text belonging to the tree structure learning text data cluster. A computer-readable storage medium storing a language model generation program for realizing a language model generation procedure for generating a tree model cluster-based language model using data.

15. A language model for performing an interpolation process using a tree structure cluster-based language model positioned above a tree structure in which a tree structure cluster-based language model is located, and generating an interpolated tree structure cluster-based language model. 15. A computer-readable recording medium on which the language model generating program according to claim 14 for realizing an interpolation procedure.

16. A recording medium storing a speech recognition program for inputting a speech to be recognized and performing speech recognition and outputting a speech recognition result, wherein the speech feature quantity for inputting the speech to be recognized and extracting a speech feature quantity. An extraction procedure and tree-structure clustering that hierarchically divides the learning text data into linguistically similar properties, and a tree-structure cluster language generated using the learning text data of each tree-structure cluster A language model selection procedure for selecting a language model having the highest probability of occurrence for a word sequence of a speech recognition result candidate from a model; an acoustic model for obtaining a probability of an acoustic observation sequence of speech; Collating using the language model selected by the above with respect to the speech feature extracted by the above speech feature extracting procedure and outputting a speech recognition result. And a computer-readable recording medium on which a voice recognition program for realizing the above procedure is recorded.

17. The recorded speech recognition program according to claim 16, wherein the language model selecting step selects a language model from a cluster language model of a leaf node at the lowest layer in the tree structure cluster language model. Computer readable recording medium.

18. A recording medium on which a speech recognition program for inputting a speech to be recognized and performing speech recognition and outputting a speech recognition result is recorded, wherein a speech feature quantity for inputting the speech to be recognized and extracting a speech feature quantity. An extraction procedure and tree-structure clustering that hierarchically divides the learning text data into linguistically similar properties, and a tree-structure cluster language generated using the learning text data of each tree-structure cluster A multi-language model selection procedure for selecting a plurality of language models having a high probability of occurrence for a word string of a speech recognition result candidate from a model; A mixed language model generation procedure for generating a language model, an acoustic model for determining the probability of an acoustic observation sequence of speech, and the mixed language model generation Computer-readable recording of a speech recognition program for realizing a collation procedure of outputting a speech recognition result by performing collation on speech features extracted by the speech feature extraction procedure using the language model generated in order. Possible recording medium.

19. The speech recognition program according to claim 18, wherein the step of selecting a plurality of language models selects a plurality of language models from cluster-based language models of the lowest leaf node in the tree-structure cluster-based language model. A computer-readable recording medium on which is recorded.