JP2001034287A

JP2001034287A - Class determining method for language model, voice recognition device, and program recording medium

Info

Publication number: JP2001034287A
Application number: JP11202307A
Authority: JP
Inventors: Takahiro Kudo; 貴弘工藤; Yumi Wakita; 由実脇田
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1999-07-15
Filing date: 1999-07-15
Publication date: 2001-02-09

Abstract

PROBLEM TO BE SOLVED: To generate a class which is free from redundancy and has generality by determining respective classes while making them correspond to respective nodes which do not always have the same layer depth suitable for classing as to thesaurus information wherein words are classified hierarchically in tree structure. SOLUTION: As an initial state, respective words are classified into classes which are in the shallowest layer, i.e., have the highest degree of abstraction (1, 2). At this time, the number of the classes is the least. Then one class is selected out of those classes (3), the layer of the thesaurus is made one layer deeper to increase the number of classes, and entropy is calculated (4). The selected class is put back and similar trial is carried out for all the classes (6); and the class where the entropy decreases most through fractionation is determined and the fractionation is carried out. The respective classes which are thus determined correspond to nodes in layers of different depth of the thesaurus information generally by the classes.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、言語解析、言語理
解、音声認識等で用いられる言語モデルにおけるクラス
決定手法及び決定されたクラスを元に構築された言語モ
デルを用いて音声認識を行う音声認識装置に関するもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for determining a class in a language model used in language analysis, language understanding, speech recognition, and the like, and a speech for performing speech recognition using a language model constructed based on the determined class. It relates to a recognition device.

【０００２】[0002]

【従来の技術】以下、従来の技術を音声認識用の統計的
言語モデルの一つであるクラスN-gramモデルの構築法に
ついて述べる。2. Description of the Related Art A method of constructing a class N-gram model which is one of statistical language models for speech recognition will be described below.

【０００３】従来のクラスN-gramモデルの構築法では、
クラスを作成するための属性として、品詞情報や単語の
出現頻度などの情報を用いるものがある(例えば政瀧、
松永、匂坂：信学技報SP95-73)。In a conventional method of constructing a class N-gram model,
As attributes for creating a class, there are those that use information such as part of speech information and frequency of occurrence of words (for example, Masataki,
Matsunaga, Sakasaka: IEICE Technical Report SP95-73).

【０００４】この方法は、まず単語を品詞クラスに分類
し、そのなかで隣接共起頻度の最も高い単語から順にク
ラスから分離して1単語で1クラスを構成させ、単独で扱
うことにより、予測性能の向上をねらったものである。
例えば図６の例では、「赤い」という語に続く名詞とし
て「りんご」、「ほっぺ」、「血」がそれぞれ図のよう
な隣接共起頻度で出現しているとする。この中で、「赤
い」に「りんご」が隣接共起する確率は８０％であり、
最も接続しやすいとすると、「りんご」を名詞クラス７
から分離して単語単体で扱うことになる。つまり名詞ク
ラスから分離された「りんご」は単語単体でクラスを構
成する。このようにして作成されたクラスが隣接して共
起する頻度をクラス対とともに記述することによりクラ
スN-gramモデルを作成することが出来る。According to this method, words are first classified into parts of speech classes, and words having the highest adjacent co-occurrence frequency are separated from the classes in that order to form one class with one word, and the words are treated independently. The aim is to improve performance.
For example, in the example of FIG. 6, it is assumed that "apple", "hope", and "blood" appear as nouns following the word "red" at adjacent co-occurrence frequencies as shown in the figure. Among them, the probability that "apple" co-occurs with "red" is 80%,
Assuming the easiest connection, "apple" is a noun class 7
Will be treated as a single word. In other words, "apple" separated from the noun class forms a class with a single word. A class N-gram model can be created by describing the frequency of co-occurrence of classes created in this way together with class pairs.

【０００５】またシソーラス情報を用いることで品詞情
報のみを用いたクラス化に比べてより細分化されたクラ
スを構築しているものがある(安藤、他：日本音響学会
論文集平成9年秋季)。シソーラスとは単語を体系的に木
構造分類したものであり、どの階層においても、同一ク
ラスに属する各単語はその用例、意味が類似したもので
あると捉えることができる。[0005] Further, there is a case where a more subdivided class is constructed by using thesaurus information as compared with a class using only part of speech information (Ando, et al .: Transactions of the Acoustical Society of Japan, Fall 1997). . The thesaurus is a systematic classification of words in a tree structure, and it can be considered that words belonging to the same class have similar examples and meanings in any hierarchy.

【０００６】この方法では、シソーラスの各枝の階層の
深さ、つまり細分化レベルはみな一様として扱われてい
る。例えばシソーラス情報の３階層目の分類に対応させ
てクラスを決定するなど、クラス決定に使用するシソー
ラス情報の階層の深さは一定として扱われている。In this method, the depth of the hierarchy of each branch of the thesaurus, ie, the subdivision level, is all treated as uniform. For example, the depth of the hierarchy of the thesaurus information used for class determination is treated as being constant, for example, a class is determined corresponding to the classification of the third hierarchy of the thesaurus information.

【０００７】そして、未学習語に対してもその語の持っ
ているシソーラス情報から、その語の属するクラスを決
定しようとしている。[0007] Even for an unlearned word, an attempt is made to determine the class to which the word belongs from the thesaurus information of the word.

【０００８】シソーラス情報から作成されたクラスが隣
接して共起する頻度をクラス対とともに記述することに
よりクラスN-gramモデルを作成することが出来る。[0008] A class N-gram model can be created by describing the frequency of co-occurrence of a class created from the thesaurus information together with a class pair.

【０００９】[0009]

【発明が解決しようとする課題】クラスを作成するため
の属性として、品詞情報や単語の出現頻度などの情報を
用いる方法は、いかに高い隣接共起頻度で出現する単語
であっても１つのクラスに1単語しか含まれないクラス
が多く生成されると、特に内容語(名詞、動詞、形容
詞、形容動詞、副詞、連体詞、感動詞、接続詞等)にお
ける隣接単語との接続の一般性が排除されてしまい、ク
ラス数の増加の割には音声認識性能が向上しない。The method of using information such as part of speech information and the frequency of appearance of words as an attribute for creating a class requires a single class even if the word appears at a high frequency of adjacent co-occurrence. When many classes that contain only one word are generated, the generality of connection with adjacent words in content words (nouns, verbs, adjectives, adjective verbs, adverbs, adverbs, inflections, conjunctions, etc.) is eliminated. Thus, the speech recognition performance does not improve despite the increase in the number of classes.

【００１０】ここで、隣接単語との接続の一般性が排除
されるとは、以下のことを意味する。すなわち、隣接共
起頻度が極めて高い単語は、１単語で１クラスを構成さ
せることで予測性能が向上するが、所定のクラス数まで
クラス分離を行うと、決して十分な頻度を持っていない
単語（Ａとする）まで１単語で１クラスを構成させると
いう場合も多くなる。単語Ａと隣接する単語またはクラ
ス（これをＤとする）にとっては、単語Ａと単語Ａと同
様な接続をする単語Ｂ、単語Ｃ等を同一クラスとして扱
ったほうが効率が良い場合がある。つまり単語Ａを単独
で１クラスを構成させることにより、Ｄと単語Ａとの特
化した接続が分離され、Ａ、Ｂ、Ｃという類似した接続
関係を持つ単語群と、Ｄとの一般性を持った隣接関係が
排除されることになる。Here, the exclusion of the generality of connection with an adjacent word means the following. That is, for words with extremely high adjacent co-occurrence frequency, the prediction performance is improved by constructing one class with one word. However, when class separation is performed up to a predetermined number of classes, words that have never sufficient frequency ( A), one word constitutes one class. For a word or a class adjacent to the word A (referred to as D), it may be more efficient to treat the words A and B, which are connected in the same manner as the word A, as the same class. In other words, by forming the word A into one class alone, the specialized connection between D and the word A is separated, and the generality of D, a word group having a similar connection relationship of A, B, and C, and Adjacent relationships that have are eliminated.

【００１１】また分離されなかったその他の多くの単語
は品詞という非常に大きな枠組みで残されているが、元
来品詞という枠組みは語の接続性を考慮して作られたも
のではないため、次単語を予測するという音声認識のた
めに用いるには非常に冗長なものになってしまう。その
ため、クラスをさらに細分化してそのような冗長性を排
除する必要がある。Many other words that are not separated remain in a very large part of speech, but since the part of speech framework was not originally created in consideration of word connectivity, When used for speech recognition to predict words, it becomes very redundant. Therefore, it is necessary to further subdivide the class to eliminate such redundancy.

【００１２】ここで、冗長とは、以下のことを意味す
る。すなわち、例えば、形容詞「たくましい」に接続す
る名詞としては「腕」、「男」等があるが、同じ名詞の
「コップ」は接続しないと考えられる。品詞によるクラ
ス化を考えた場合、「コップ」の接続性は「腕」、
「男」とは異なるにもかかわらず、この３つの名詞は同
一のクラスに接続されてしまうので、品詞によるクラス
化は冗長である。もちろん名詞は、普通名詞、数詞、サ
変名詞、形容名詞などに細分化することはできる。しか
し、「腕」と「コップ」の違いは単に品詞を細分化する
だけでは明確にできない。つまり冗長とは、このように
隣接単語と接続しない単語までが同一クラスに割り当て
られてしまうということを意味する。Here, "redundancy" means the following. That is, for example, the nouns connected to the adjective "brilliant" include "arm" and "male", but it is considered that the same noun "cup" is not connected. When considering classifying by part of speech, the connectivity of "cup" is "arm",
Despite being different from "male", these three nouns are connected to the same class, so that classifying by part of speech is redundant. Of course, nouns can be subdivided into common nouns, number nouns, sa-variant nouns, adjective nouns, and the like. However, the difference between "arms" and "cups" cannot be clarified simply by subdividing the parts of speech. That is, the word “redundant” means that even words that are not connected to adjacent words are assigned to the same class.

【００１３】また、シソーラス情報を用いているだけの
方法では、各枝の階層の深さが一様として扱われている
が、実際に話し言葉、書き言葉等で単語が使用される場
合にはそれら適切な階層の深さが全ての枝において一様
である保証はなく、またシソーラスの体系自体も分類の
仕方によって様々であるため、適切な階層の深さを決定
しなければならないという問題が存在する。そのため、
あるクラスは必要以上に細分化されてしまい一般性を十
分に持たないものであり、別のクラスは細分化が十分に
進んでおらず冗長性を含んだもの、という結果に陥って
しまう。In the method using only thesaurus information, the depth of each branch is treated as being uniform. However, when words are actually used in spoken words, written words, etc. There is no guarantee that the depth of the hierarchy is uniform in all branches, and the thesaurus system itself varies depending on the classification method, so there is a problem that an appropriate hierarchy depth must be determined. . for that reason,
One class is subdivided more than necessary and does not have sufficient generality, and the other class is not sufficiently subdivided and includes redundancy.

【００１４】以上のように従来の方法ではクラスから冗
長性を排除することができなかったり、また1単語とい
うあまりに特化されたクラスであるために一般性、汎用
性に欠けたクラスしか生成することができなかった。As described above, according to the conventional method, it is not possible to eliminate redundancy from a class, and since the class is so specialized as one word, only a class lacking generality and versatility is generated. I couldn't do that.

【００１５】そこで冗長性を除くための新たなクラス化
のための情報を付加し、かつ一般性をもった最適なクラ
スを生成する方法が必要である。Therefore, there is a need for a method of adding new classifying information for removing redundancy and generating an optimal class having generality.

【００１６】本発明は、上記の課題を考慮し、冗長性を
除き、かつ一般性をもったクラスを生成する言語モデル
におけるクラス決定方法及び音声認識装置を提供するこ
とを目的とするものである。SUMMARY OF THE INVENTION An object of the present invention is to provide a method of determining a class in a language model for generating a class having generality while eliminating redundancy and in consideration of the above problem, and a speech recognition apparatus. .

【００１７】[0017]

【課題を解決するための手段】上述した課題を解決する
ために、第１の本発明（請求項１に対応）は、単語が階
層的に木構造に分類されているシソーラス情報のノード
のうち、クラス化に適した階層の深さが、全て同じとは
限らない各ノードに対応付けて各クラスを決定すること
を特徴とする言語モデルにおけるクラス決定方法であ
る。In order to solve the above-mentioned problem, a first aspect of the present invention (corresponding to claim 1) is that a node of a thesaurus information in which words are hierarchically classified into a tree structure is provided. This is a class determination method in a language model, characterized in that each class is determined in association with each node whose hierarchical depth suitable for classifying is not always the same.

【００１８】また、第２の本発明（請求項２に対応）
は、前記各クラスは、エントロピーの値を評価尺度とし
て決定されることを特徴とする第１の本発明に記載の言
語モデルにおけるクラス決定方法である。Further, the second invention (corresponding to claim 2)
Is a class determination method for a language model according to the first aspect of the present invention, wherein each class is determined using an entropy value as an evaluation scale.

【００１９】また、第３の本発明（請求項３に対応）
は、前記決定は、シソーラス情報を下層側にトレース展
開していき、所定の基準に基づきそのトレース展開を停
止させ、その際の未分岐の各ノードを各クラスに対応つ
けることによって行われ、そのトレース展開とは、それ
までのトレース展開した各ノードのうち未分岐のノード
すべてのうち次の分岐のエントロピーが最も小さいノー
ドのみをトレース展開していくことを特徴とする第２の
本発明に記載の言語モデルにおけるクラス決定方法であ
る。Further, the third invention (corresponding to claim 3)
Is determined by tracing the thesaurus information to the lower layer side, stopping the tracing based on a predetermined criterion, and associating each unbranched node with each class at that time. The trace expansion is performed by trace-expanding only a node having the smallest entropy of the next branch among all unbranched nodes among the trace-expanded nodes. This is a method for class determination in the language model.

【００２０】また、第４の本発明（請求項４に対応）
は、前記所定の基準とは、未分岐のノードの数が予め決
められた数になることを特徴とする第３の本発明に記載
の言語モデルにおけるクラス決定方法である。A fourth aspect of the present invention (corresponding to claim 4)
In the third aspect of the present invention, the predetermined criterion is that the number of unbranched nodes is a predetermined number.

【００２１】また、第５の本発明（請求項５に対応）
は、第１〜４の本発明のいずれかに記載の言語モデルに
おけるクラス決定方法によって決定されたクラスから作
成されたクラスＮ−ｇｒａｍモデルを格納する言語モデ
ル格納手段と、音声を入力する音声入力手段と、前記言
語モデルを利用して、前記入力された音声の音声認識を
行う音声認識手段と、前記音声認識の結果をテキストと
して出力する出力手段とを備えたことを特徴とする音声
認識装置である。Further, a fifth aspect of the present invention (corresponding to claim 5)
Is a language model storing means for storing a class N-gram model created from a class determined by the class determining method in the language model according to any of the first to fourth aspects of the present invention; Means, a voice recognition means for performing voice recognition of the input voice using the language model, and an output means for outputting a result of the voice recognition as text. It is.

【００２２】[0022]

【発明の実施の形態】以下に本発明の実施の形態につい
て図面を参照して説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００２３】（第１の実施の形態）まず、第１の実施の
形態について説明する。(First Embodiment) First, a first embodiment will be described.

【００２４】はじめにシソーラス情報について説明す
る。図５は、国立国語研究所によって作成されたシソー
ラス情報の例である。これらは、「１．３人間活動、精
神及び行為」というカテゴリに属するシソーラス情報で
ある。図５の例では、シソーラス情報の各単語は５桁の
コードを有し、このコードによって、各単語がカテゴリ
に分類されている。簡単のため、「１．３８２３建
築」、「１．３８３０運輸・交通・通信」、「１．３８
３１医療」という３つの単語に限定して考える。コード
の上位から４桁でこれらの単語を見てみると、「建築」
は「１．３８２」、「運輸・交通・通信」と「医療」は
「１．３８３」となり、二つのカテゴリにわかれる。さ
らにもう一桁とって、５桁で考えると、３つのカテゴリ
にわかれる。このように、シソーラス情報の各単語はコ
ードによって木構造に分類されている。First, the thesaurus information will be described. FIG. 5 is an example of thesaurus information created by the National Language Institute. These are thesaurus information belonging to the category of "1.3 human activities, spirits and actions". In the example of FIG. 5, each word of the thesaurus information has a five-digit code, and each word is classified into a category by this code. For the sake of simplicity, "1.3823 architecture", "1.3830 transportation / transport / communication", "1.38
Consider only the three words "31 medical care". If you look at these words in the four digits from the top of the code,
Is “1.382”, “Transportation / Transportation / Communication” and “Medical” are “1.383”, which are divided into two categories. If we take one more digit and consider it with five digits, it is divided into three categories. Thus, each word of the thesaurus information is classified into a tree structure by the code.

【００２５】本実施の形態では、このようなシソーラス
情報の各カテゴリに属する単語に対応させてクラスを決
定する。すなわち、上記の例で、上位４桁までのコード
では、二つのカテゴリにわかれた。これに対応して、
「建築」、「運輸・交通・通信」からなるクラスと、
「医療」からなるクラスが得られる。また、上位５桁ま
でのコードでは、三つのカテゴリにわかれたので、これ
に対応して、三つのクラスが得られる。本実施の形態で
は、このようにシソーラス情報のノードに対応させてク
ラスを作成する。In the present embodiment, a class is determined corresponding to words belonging to each category of such thesaurus information. That is, in the above example, the codes of up to four digits were divided into two categories. Correspondingly,
Classes consisting of "Architecture" and "Transportation / Transportation / Communication"
A class consisting of "medical care" is obtained. In addition, the codes up to the upper five digits are divided into three categories, and accordingly, three classes are obtained. In the present embodiment, a class is created in such a manner as to correspond to a node of the thesaurus information.

【００２６】図１は本発明の請求項１〜４に係る言語モ
デルにおけるクラス決定方法のブロック図である。本実
施の形態では、シソーラス情報の各ノードに対応させて
クラスを決定する。FIG. 1 is a block diagram of a method for determining a class in a language model according to claims 1 to 4 of the present invention. In the present embodiment, a class is determined corresponding to each node of the thesaurus information.

【００２７】まず初期状態として各単語を、シソーラス
情報を用いて最も階層の浅い、つまり最も抽象度の高い
クラスに分類しておく(１、２)。このとき、クラス数は
最も少ない状態にある。次に、それらのクラスの中から
３においてクラスを１つ選択し、シソーラスの階層を１
階層深くしてクラス数を増加させ、４でエントロピーを
計算する。ただし、エントロピーは下の数１のＨで計算
する。First, as an initial state, each word is classified into the class having the lowest hierarchy, that is, the class having the highest abstraction, using thesaurus information (1, 2). At this time, the number of classes is the smallest. Next, one of the classes is selected in 3 and the hierarchy of the thesaurus is set to 1
Increase the number of classes by increasing the depth, and calculate entropy by 4. Here, the entropy is calculated using H in Equation 1 below.

【００２８】[0028]

【数１】 (Equation 1)

【００２９】選んだクラスをもとに戻して同様の試行を
全てのクラスに対して行い(５)、６で、細分化すること
でエントロピーが最も減少するクラスを決定し細分化さ
せる。The selected class is returned to the original state, and the same trial is performed for all the classes (5). In step 6, the class whose entropy is reduced most by subdivision is determined and subdivided.

【００３０】例えば図２であれば、現在のクラスが斜線
を付した丸で示した状態まで細分化されているとしたと
き、そのクラスのうちクラス113を展開した場合が最も
エントロピーが減少したとすると、図２のようにクラス
113が1階層深くなり細分化される。ここで、図２で、ク
ラスの後に賦された数字（例えばクラス１１３の１１３
など）は、シソーラス情報の各単語に与えられたコード
の一部を表している。For example, in FIG. 2, when it is assumed that the current class is subdivided to the state shown by the hatched circle, entropy is reduced most when class 113 is expanded among the classes. Then, as shown in Figure 2, the class
113 is one level deeper and subdivided. Here, in FIG. 2, the number given after the class (for example, 113 of class 113)
) Represents a part of a code given to each word of the thesaurus information.

【００３１】そしてこの状態を次ステップの初期状態と
して、目的のクラス数になるまで同様の試行を繰り返
す。With this state as the initial state of the next step, the same trial is repeated until the target number of classes is reached.

【００３２】このようにして決定される各クラスは、従
来の技術とは異なり、一般にクラス毎にシソーラス情報
の異なった深さの階層のノードに対応している。Each class determined in this way is different from the prior art, and generally corresponds to a node of a hierarchy having a different depth of thesaurus information for each class.

【００３３】本発明方法の性能を評価するため、ターゲ
ットとなる単語を名詞のみに絞り、従来方法２つと比較
実験を行った。従来方法は (I)名詞すべてを1クラスにまとめた状態を初期状態と
し、名詞の中で隣接共起確率の最も高い単語から1つず
つ順に名詞クラスから分離させ、独立したクラスを形成
していく。 (II)名詞のシソーラス階層の最も浅い状態を初期状態と
し、シソーラスの意味階層を1階層ずつ一様に深くす
る。の２つである。In order to evaluate the performance of the method of the present invention, target words were narrowed down to only nouns, and a comparative experiment was performed with the two conventional methods. In the conventional method, (I) the state where all nouns are put into one class is set as the initial state, and the words with the highest adjacent co-occurrence probability are separated from the noun class one by one from the noun, forming an independent class. Go. (II) The shallowest state of the thesaurus hierarchy of nouns is set as the initial state, and the semantic hierarchy of the thesaurus is made evenly deeper by one hierarchy. The two.

【００３４】図３にそれぞれの方法についてクラス数と
パープレキシティの関係を示す。パープレキシティとは
次単語の予測のし易さを示す尺度で、エントロピーをも
とに計算される。言語モデルの評価は一般に少ないクラ
ス数で低いパープレキシティの値を示すものが性能のよ
いモデルである。FIG. 3 shows the relationship between the number of classes and perplexity for each method. Perplexity is a measure of the ease with which the next word can be predicted, and is calculated based on entropy. In general, a model with a low number of classes and a low perplexity value indicates a good performance in the evaluation of a language model.

【００３５】すなわち、一般にクラス数の多いもの（極
端な例が単語Ｎ−ｇｒａｍモデル）では、すべての単語
に対して出現頻度は十分に大きい値を得ることは出来
ず、信頼できる統計データを得るには大規模なコーパス
が必要になる。ところが、単語を数単語ずつまとめてク
ラス化を行い、クラス数を少なくした場合、パラメータ
数は少なくなり、クラスとしての出現頻度はそのクラス
に含まれる単語の出現頻度の総和となるため、統計的に
信頼できる値を得やすい。従って、少ないクラス数で低
いパープレキシティーの値を示すモデルが性能が良いと
言える。That is, in general, with a class having a large number of classes (an extreme example is a word N-gram model), the frequency of appearance cannot be sufficiently high for all the words, and reliable statistical data is obtained. Requires a large corpus. However, when words are classified by classifying several words at a time and the number of classes is reduced, the number of parameters is reduced, and the frequency of appearance as a class is the sum of the frequencies of appearance of the words included in the class. It is easy to obtain reliable values. Therefore, it can be said that a model showing a low perplexity value with a small number of classes has good performance.

【００３６】図３より本発明の方法は、特に少ないクラ
ス数においてパープレキシティの値が最も効率よく減少
しており、従来方法よりも性能がよいことが分かる。FIG. 3 shows that the perplexity value of the method of the present invention is reduced most efficiently especially in the case of a small number of classes, and that the method of the present invention has better performance than the conventional method.

【００３７】クラス化にシソーラス情報を付加すること
でクラスの冗長性をなくし、シソーラスの階層の深さを
一様としないことで一般性を保ったクラスを生成できた
ということが言える。It can be said that by adding thesaurus information to the classifying, the redundancy of the class is eliminated, and the generality of the class can be maintained by making the depth of the thesaurus not uniform.

【００３８】すなわち、図３の意味することを具体例を
用いて説明する。例えば、図６の例を変更して、「赤
い」という形容詞に続く名詞クラスのメンバとして、
「ほっぺ」が１００回、「りんご」が２０回、「トマ
ト」が１５回、「血」が１０回、それぞれ接続するとす
る。That is, the meaning of FIG. 3 will be described using a specific example. For example, by changing the example of FIG. 6, as a member of the noun class following the adjective "red",
It is assumed that "hopping" is connected 100 times, "apple" is connected 20 times, "tomato" is connected 15 times, and "blood" is connected 10 times.

【００３９】このとき、隣接頻度の高い単語を単独クラ
スとして扱うと、極めて出現数の多い「ほっぺ」は単独
で１クラスを構成することにより性能がよくなる。とこ
ろが所定のクラス数までクラスの分離を繰り返していく
と、飛び抜けて出現数の多くない単語の「りんご」や
「トマト」も単独でクラスとして扱われるようになる。
「りんご」や「トマト」の出現数は少ないので統計的に
信頼できる確率として扱うことはできない。また「赤
い」以外のそれぞれの単語が接続する単語も、単語によ
ってばらばらである可能性が高い。At this time, if words having a high frequency of adjacentness are treated as a single class, the performance of "hopp", which has an extremely large number of occurrences, is improved by forming one class by itself. However, when the class separation is repeated up to a predetermined number of classes, the words "apple" and "tomato", which are by far the most infrequent, will be treated as classes by themselves.
Since the number of occurrences of “apple” and “tomato” is small, it cannot be treated as a statistically reliable probability. Also, the words connected to each word other than “red” are likely to be different depending on the word.

【００４０】そのため、こういった出現数のあまり多く
ない単語は、数語まとめてクラスを構成する方が性能が
良くなると考えられる。つまり、このように大きな枠組
みでクラス化を行うと、「クラスＡとクラスＢは一般的
に接続しやすい」と捉えることが出来る。このような、
一般性を保ったクラスは、シソーラスの階層の深さを一
様としないことで生成することが出来る。Therefore, it is considered that the performance of such a word having a very small number of appearances is better if a few words are put together to form a class. That is, when classifying is performed in such a large framework, it can be understood that "class A and class B are generally easily connected". like this,
Classes that maintain generality can be generated by making the depth of the thesaurus hierarchy not uniform.

【００４１】また、「りんご」、「トマト」、「血」か
らなるクラスについて考えた場合、「赤い」との接続の
しやすさは同程度であっても、他の単語（例えば「おい
しい」）との接続では必ずしも同様の接続をするわけで
はない。「りんご」、「トマト」のクラスと「血」のク
ラスは別々のクラスであった方が良い。つまり、これら
３単語からなるクラスは冗長であると考えられる。とこ
ろが品詞情報という枠組みではこれらを差別化すること
は出来ない。When a class consisting of "apple", "tomato", and "blood" is considered, even if the degree of connection with "red" is similar, another word (for example, "delicious") is used. ) Does not always mean the same connection. It is better to have separate classes for apples, tomatoes, and blood. That is, the class consisting of these three words is considered redundant. However, in the framework of part-of-speech information, these cannot be differentiated.

【００４２】これらの差別化は、クラスを決定する際に
シソーラス情報を利用することにより可能となり、クラ
スの冗長性をなくすことが出来る。Such differentiation can be realized by using thesaurus information when determining a class, and the redundancy of the class can be eliminated.

【００４３】（第２の実施の形態）次に、第２の実施の
形態について説明する。(Second Embodiment) Next, a second embodiment will be described.

【００４４】本実施の形態では、第１の実施の形態で作
成したクラスを用いて構築した言語モデルにより音声認
識を行う音声認識装置について説明する。In this embodiment, a speech recognition apparatus that performs speech recognition using a language model constructed using the classes created in the first embodiment will be described.

【００４５】図４は本発明の請求項５に係る音声認識装
置のブロック図である。FIG. 4 is a block diagram of a voice recognition device according to claim 5 of the present invention.

【００４６】音声認識装置は、マイクロフォン４１、音
声認識手段４２、音響モデル４３、言語モデル記憶手段
４４、出力手段４５から構成される。The speech recognition apparatus comprises a microphone 41, speech recognition means 42, an acoustic model 43, a language model storage means 44, and an output means 45.

【００４７】マイクロフォン４１は、音声を入力する手
段である。音声認識手段４２は、音響モデル４３と言語
モデル記憶手段４４に格納されている言語モデルを利用
して、入力された音声を連続単語列に変換する手段であ
る。出力手段４５は、言語記号をテキストに変換して出
力する手段である。The microphone 41 is a means for inputting voice. The speech recognition unit 42 is a unit that converts the input speech into a continuous word string using the acoustic model 43 and the language model stored in the language model storage unit 44. The output unit 45 is a unit that converts a language symbol into text and outputs the text.

【００４８】なお、本実施の形態のマイクロフォン４１
は本発明の音声入力手段の例であり、本実施の形態の言
語モデル記憶手段４４は本発明の言語モデル格納手段の
例である。The microphone 41 according to the present embodiment
Is an example of the voice input means of the present invention, and the language model storage means 44 of the present embodiment is an example of the language model storage means of the present invention.

【００４９】次にこのような本実施の形態の動作を説明
する。Next, the operation of this embodiment will be described.

【００５０】まず、第１の実施の形態で説明した言語モ
デルにおけるクラス決定方法によって決定したクラスか
らクラスＮーｇｒａｍモデルを作成する。すなわち、ク
ラス対とともにそのクラス対が隣接共起する頻度を記述
する。さらに、クラスの出現総数を表すクラスのｕｎｉ
ーｇｒａｍ値、クラスに含まれる単語をＩＤなどで表し
たクラス内の単語情報、単語の出現数も記述する。作成
したクラスＮ−ｇｒａｍモデルは、言語モデル記憶手段
４４に予め格納しておく。First, a class N-gram model is created from the classes determined by the class determination method in the language model described in the first embodiment. That is, the class pair and the frequency at which the class pair co-occurs are described. Further, the class uni representing the total number of occurrences of the class
-Gram value, word information in the class in which words included in the class are represented by IDs, and the number of appearances of the words are also described. The created class N-gram model is stored in the language model storage unit 44 in advance.

【００５１】次に音声認識を行う際の動作を説明する。
発声された音声は、マイクロフォン４１から入力され、
音声認識手段４２に入力される。音声認識手段４２は、
言語モデル記憶手段４４に格納されている言語モデルに
より、時系列にそって順次認識単語候補が予測される。
予め学習されている音響モデル４３と入力音声との距離
値をベースとした音響スコアとクラスＮ−ｇｒａｍによ
る言語スコアとの和を認識スコアとし、Ｎｂｅｓｔーｓ
ｅａｒｃｈにより認識候補である連続単語列が決定され
る。このように決定された連続単語列は、出力手段４５
に入力され、テキストとして出力される。Next, the operation for performing voice recognition will be described.
The uttered voice is input from the microphone 41,
It is input to the voice recognition means 42. The voice recognition means 42
Based on the language model stored in the language model storage means 44, recognition word candidates are predicted sequentially in time series.
The sum of the acoustic score based on the distance value between the acoustic model 43 and the input speech that has been learned in advance and the language score based on the class N-gram is used as the recognition score.
A search determines a continuous word string that is a recognition candidate. The continuous word sequence determined in this way is output to the output unit 45.
And output as text.

【００５２】このように、第１の実施の形態で説明した
言語モデルにおけるクラス決定方法で決定したクラスか
ら作成した言語モデルを用いて音声認識を行うことによ
り、高性能な音声認識を行うことが出来る。As described above, by performing speech recognition using the language model created from the class determined by the class determination method in the language model described in the first embodiment, high-performance speech recognition can be performed. I can do it.

【００５３】なお、本発明の言語モデルにおけるクラス
決定方法または音声認識装置を構成する各ステップまた
は各手段の、全部または一部の機能をハードウェアによ
って実現しても構わないし、コンピュータのプログラム
によってソフトウェア的に実現しても構わない。It is to be noted that all or a part of the functions of each step or each unit constituting the method for determining a class in the language model or the speech recognition apparatus of the present invention may be realized by hardware, or the software of the computer program may be used. It may be realized in a practical way.

【００５４】さらに、本発明の言語モデルにおけるクラ
ス決定方法または音声認識装置を構成する各ステップま
たは各手段の、全部または一部の機能をコンピュータで
実行させるためのプログラムを格納していることを特徴
とするプログラム記録媒体も本発明に属する。Further, a program for causing a computer to execute all or a part of the functions of each step or each means constituting the method for determining a class in a language model or the speech recognition apparatus of the present invention is stored. Is also included in the present invention.

【００５５】[0055]

【発明の効果】以上説明したところから明らかなよう
に、請求項１〜４の本発明は、従来の方法の問題点であ
ったクラスの冗長性の存在と、一般性の欠如という二つ
の点を補い、高性能な言語モデルを構築することができ
る言語モデルにおけるクラス決定方法を提供することが
出来る。As is apparent from the above description, the present invention according to claims 1 to 4 has two problems, that is, the existence of class redundancy and the lack of generality, which are problems of the conventional method. And a class determination method in a language model capable of constructing a high-performance language model can be provided.

【００５６】また、請求項５の本発明は、高性能な音声
認識を行うことが出来る音声認識装置を提供することが
出来る。The present invention according to claim 5 can provide a speech recognition device capable of performing high-performance speech recognition.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態における言語モデル
におけるクラス決定方法のブロック図FIG. 1 is a block diagram of a method for determining a class in a language model according to a first embodiment of the present invention.

【図２】本発明の第１の実施の形態におけるクラスの細
分化を示す図FIG. 2 is a diagram showing subdivision of classes according to the first embodiment of the present invention.

【図３】本発明の言語モデルにおけるクラス決定方法と
従来の言語モデルにおけるクラス決定方法の性能を比較
した図FIG. 3 is a diagram comparing the performance of a class determination method in a language model of the present invention with the performance of a class determination method in a conventional language model.

【図４】本発明の第２の実施の形態における音声認識装
置の構成を示すブロック図FIG. 4 is a block diagram showing a configuration of a speech recognition device according to a second embodiment of the present invention.

【図５】シソーラス情報の例を示す図FIG. 5 is a diagram showing an example of thesaurus information;

【図６】従来の言語モデルにおけるクラス決定方法でク
ラスを分離する例を示す図FIG. 6 is a diagram showing an example in which classes are separated by a class determination method in a conventional language model.

[Explanation of symbols]

１単語のクラス化ステップ２初期状態設定ステップ３細分化クラス候補選択ステップ４エントロピー計算ステップ５ループ回数を表すクラス数６細分化クラス決定ステップ７名詞クラスと名詞クラスに含まれる単語４１マイクロフォン４２音声認識手段４３音響モデル４４言語モデル記憶手段４５出力手段４５ 1 Word Classification Step 2 Initial State Setting Step 3 Subdivision Class Candidate Selection Step 4 Entropy Calculation Step 5 Number of Classes Representing the Number of Loops 6 Subdivision Class Determination Step 7 Noun Class and Words in Noun Class 41 Microphone 42 Speech Recognition Means 43 Acoustic model 44 Language model storage means 45 Output means 45

Claims

[Claims]

1. The method according to claim 1, wherein, among nodes of the thesaurus information in which words are hierarchically classified into a tree structure, each class is associated with each node whose hierarchical depth suitable for classifying is not always the same. A method for determining a class in a language model, characterized by determining.

2. The method according to claim 1, wherein each class is determined using an entropy value as an evaluation scale.

3. The determination is made by tracing the thesaurus information to the lower layer side, stopping the tracing based on a predetermined criterion, and associating each unbranched node at that time with each class. 3. The trace expansion method according to claim 2, wherein the trace expansion is performed only on a node having the smallest entropy of the next branch among all the unbranched nodes among the nodes that have been trace expanded so far. Class Determination Method in the Language Model

4. The predetermined criterion is that the number of unbranched nodes is a predetermined number.
Class determination method in the described language model.

5. A language model storage means for storing a class N-gram model created from a class determined by the class determination method in the language model according to claim 1, and a voice for inputting voice. Speech recognition, comprising: input means; speech recognition means for performing speech recognition of the input speech using the language model; and output means for outputting the result of speech recognition as text. apparatus.

6. A program for causing a computer to execute all or a part of the functions of each step or each unit constituting a method for determining a class in a language model or a speech recognition device according to claim 1. A program recording medium characterized by storing a program.