JPH11110385A

JPH11110385A - Device and method for processing language

Info

Publication number: JPH11110385A
Application number: JP9268592A
Authority: JP
Inventors: Akio Ando; 彰男安藤
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1997-10-01
Filing date: 1997-10-01
Publication date: 1999-04-23

Abstract

PROBLEM TO BE SOLVED: To deepen the degree of relation for a classified word as well and to improve the accuracy of chain probability by semantically classifying the word according to a thesaurus. SOLUTION: A processing block 3 divides the pair of input sentences into respective three parts and by exchanging these parts, a non-semantic sentence is generated. A morpheme analytic block 2 divides the pair of generated non- semantic sentences for the unit of a word. A storage block 4 stores the thesaurus. A word classifying block 6 converts a word to a class code based on the thesaurus. A storage block 8 stores the table of trigram probability between classes. A processing block 10 calculates Perplexity based on the class trigram concerning the set of all the generated non-semantic sentences and selects the set of non-semantic sentences having the highest Perplexity. Thus, the classification of the thesaurus is stored and according to this classification of the thesaurus, the word is classified.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、自然言語処理にお
いて、最適な単語列を選択／生成する言語処理装置およ
び方法に係り、特に、シソーラスに基づいて設定された
単語クラス間の統計情報を用いる言語処理装置および方
法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a language processing apparatus and method for selecting / generating an optimum word string in natural language processing, and more particularly to using statistical information between word classes set based on a thesaurus. The present invention relates to a language processing device and method.

【０００２】[0002]

【従来の技術】従来は、ｎ個の単語の統計的連鎖情報を
利用する単語ｎ−ｇｒａｍモデルが言語処理装置におい
て利用されていた。また、単語を品詞ごとに分類して単
語クラスを作成し、これらの単語クラス間（品詞間）の
統計連鎖情報を利用するクラスｎ−ｇｒａｍモデルも利
用されていた（以上については、例えば、北他「音声言
語処理」森北出版参照）。これらの方式では、例えば、
単語ｎ−ｇｒａｍモデルの場合、扱う単語数が増え、ｎ
が大きくなると、単語間の連鎖確率を十分に推定できな
くなる、いわゆるスパース問題と呼ばれる問題が生じ
た。一方、品詞間のクラスｎ−ｇｒａｍモデルの場合に
は、大ざっぱな言語処理しか行えないという欠点があっ
た。2. Description of the Related Art Conventionally, a word n-gram model using statistical chain information of n words has been used in a language processing apparatus. In addition, a class n-gram model that creates a word class by classifying words for each part of speech and uses statistical chain information between these word classes (between parts of speech) has also been used. See "Spoken Language Processing" Morikita Publishing. In these systems, for example,
In the case of the word n-gram model, the number of words to be handled increases, and n
Becomes large, a problem called the sparse problem arises, in which the chain probability between words cannot be sufficiently estimated. On the other hand, the class n-gram model between parts of speech has a drawback that only rough language processing can be performed.

【０００３】[0003]

【発明が解決しようとする課題】聴覚心理実験などにお
いて、被験者が音声をきちんと聞き取れるかどうかを判
定する実験が行われるが、この際、意味的な類推の影響
を除去して本来の聴覚的特性のみを調べたいという要求
がある。このような目的を満足させる方法として、文法
的に正しいが意味の通らない文を作成する方法がある
（例えば、渡部一雄：“話速変換による話速遅延の高齢
難聴者に対する効果”、日本耳鼻咽喉科学会会報、Ｖｏ
ｌ．９９、ｐｐ．４４５−４５３（１９９６））。しか
しながら、意味の通らない試験文を作成する良い方法が
見つかっておらず、作成された試験文から、作成者の恣
意的要素を除くことは事実上不可能であった。In an auditory psychological experiment or the like, an experiment is performed to determine whether or not the subject can hear the voice properly. There is a request to check only. As a method for satisfying such a purpose, there is a method of creating a grammatically correct but meaningless sentence (for example, Kazuo Watanabe: "Effects of speech speed delay due to speech speed conversion on elderly hearing-impaired", Japanese Otona The Journal of the Society of Throat Science, Vo
l. 99 pp. 445-453 (1996)). However, no good way to create meaningless test sentences has not been found, and it was virtually impossible to remove the author's arbitrary elements from the prepared test sentences.

【０００４】一方、音声認識では、認識性能を向上させ
るため、音響レベルでの認識結果から言語的に尤もらし
い単語列を選択したり、それまで得られた認識結果を用
いて、それ以降の認識結果として想定しうる候補の数を
削減するため、単語ｎ−ｇｒａｍが利用されている。特
に、大語彙連続音声認識では、単語ｎ−ｇｒａｍの利用
は不可欠とされている。また、音声認識以外の自然言語
処理においても、複数個ある単語列の中から、尤もらし
い単語列を選択するため、単語ｎ−ｇｒａｍが利用され
ている。ところが、単語ｎ−ｇｒａｍモデルでは、例え
ばｎ＝３（３単語の連鎖モデル：トライグラムモデル）
の場合を例にとると、単語数が２万語の場合には、２万
の３乗（８，０００，０００，０００，０００）個の組
み合わせを考える必要があり、処理の際に膨大なメモリ
が必要となるほか、入手可能な言語データベースからで
は、その統計量を推定できないという問題がある（この
問題は、スパースネス問題と呼ばれている）。この問題
は、ｎが３より大きい場合には一層顕著となる。このよ
うな問題が、音声認識などでは、処理結果に大きな影響
を与えていた。On the other hand, in speech recognition, in order to improve the recognition performance, a linguistically likely word string is selected from the recognition results at the acoustic level, and the subsequent recognition is performed using the recognition results obtained so far. In order to reduce the number of possible candidates as a result, the word n-gram is used. In particular, in large vocabulary continuous speech recognition, use of the word n-gram is indispensable. In natural language processing other than speech recognition, the word n-gram is used to select a likely word string from a plurality of word strings. However, in the word n-gram model, for example, n = 3 (chain model of three words: trigram model)
As an example, if the number of words is 20,000 words, it is necessary to consider combinations of 20,000 to the third power (8,000,000,000,000,000,000). In addition to the need for memory, there is a problem that the statistics cannot be estimated from available language databases (this problem is called the sparseness problem). This problem becomes more pronounced when n is greater than 3. Such a problem has a great effect on the processing result in speech recognition and the like.

【０００５】そこで、本発明の目的は、単語を品詞毎に
分類するよりも単語間の連鎖確率の精度を高め、また、
言語処理装置のメモリに関する制限を緩和することの可
能な言語処理装置および方法を提供することにある。Accordingly, an object of the present invention is to improve the accuracy of the chain probability between words, rather than classifying words for each part of speech,
It is an object of the present invention to provide a language processing device and a method capable of relaxing the restriction on the memory of the language processing device.

【０００６】[0006]

【課題を解決するための手段】このような目的を達成す
るために、請求項１の発明は、１つのクラスが複数の単
語の集合で構成される複数のクラスに、複数の単語を分
類し、当該分類されたクラス間の連鎖の程度を示す連鎖
確率を取得し、当該取得された連鎖確率に基づきクラス
および／または単語列を選択する言語処理装置におい
て、シソーラスの分類を記憶する記憶手段と、該シソー
ラスの分類にしたがって、前記複数の単語を分類する分
類手段とを具えたことを特徴とする。In order to achieve the above object, the invention of claim 1 classifies a plurality of words into a plurality of classes in which one class is composed of a plurality of word sets. A language processing device that acquires a linkage probability indicating the degree of linkage between the classified classes and selects a class and / or a word string based on the acquired linkage probability, and a storage unit that stores the thesaurus classification. Classification means for classifying the plurality of words according to the classification of the thesaurus.

【０００７】請求項２の発明は、請求項１に記載の言語
処理装置において、前記分類手段は、前記複数の単語を
分類する時に、単語を分類コードに変換し、前記連鎖確
率の計算時には前記単語列を分類コード列の形態で計算
することを特徴とする。According to a second aspect of the present invention, in the language processing apparatus according to the first aspect, the classification means converts a word into a classification code when classifying the plurality of words, and converts the word into a classification code when calculating the chain probability. It is characterized in that a word string is calculated in the form of a classification code string.

【０００８】請求項３の発明は、請求項２に記載の言語
処理装置において、前記分類手段は、前記単語に付属す
る付属語については該付属語そのものを分類コードとす
ることを特徴とする。According to a third aspect of the present invention, in the language processing apparatus according to the second aspect, the classifying means uses, as a classification code, the attached word itself for an attached word attached to the word.

【０００９】請求項４の発明は、請求項２に記載の言語
処理装置において、前記分類手段は、前記単語が前記シ
ソーラスの分類に記載されていない自立語の場合には、
品詞ごとの分類を行うことを特徴とする。According to a fourth aspect of the present invention, in the language processing apparatus according to the second aspect, when the word is an independent word that is not described in the thesaurus,
It is characterized by classifying each part of speech.

【００１０】請求項５の発明は、請求項１に記載の言語
処理装置において、前記連鎖確率が最小となる単語列を
聴覚実験で使用する無意味文として選択する選択手段を
さらに具えたことを特徴とする。According to a fifth aspect of the present invention, there is provided the language processing apparatus according to the first aspect, further comprising a selection unit for selecting the word sequence having the smallest chain probability as a meaningless sentence used in a hearing experiment. Features.

【００１１】請求項６の発明は、請求項１に記載の言語
処理装置において、前記連鎖確率が最大となる単語列を
音声認識候補の中の最適候補として選択する選択手段を
さらに具えたことを特徴とする。According to a sixth aspect of the present invention, there is provided the language processing apparatus according to the first aspect, further comprising a selecting means for selecting a word string having the largest chain probability as an optimal candidate among speech recognition candidates. Features.

【００１２】請求項７の発明は、請求項１に記載の言語
処理装置において、各単語間の連鎖確率の計算に関連し
て、クラスの中の単語の出現確率およびクラス間の連鎖
確率を取得しておき、単語間の連鎖確率が直接取得でき
ない場合には、前記単語の出現確率および前記クラス間
の連鎖確率に基づき前記単語間の連鎖確率を推定するこ
とを特徴とする。According to a seventh aspect of the present invention, in the language processing apparatus according to the first aspect, an occurrence probability of a word in a class and a chain probability between classes are obtained in relation to calculation of a chain probability between words. If the chain probability between words cannot be directly obtained, the chain probability between the words is estimated based on the occurrence probability of the word and the chain probability between the classes.

【００１３】請求項８の発明は、１つのクラスが複数の
単語の集合で構成される複数のクラスに、複数の単語を
分類し、当該分類されたクラスの間の連鎖の程度を示す
連鎖確率を取得し、当該取得された連鎖確率に基づきク
ラスおよび／または単語列を選択する言語処理方法にお
いて、シソーラスの分類を記憶しておき、該シソーラス
の分類にしたがって、前記複数の単語を分類することを
特徴とする。According to an eighth aspect of the present invention, a plurality of words are classified into a plurality of classes in which one class is composed of a plurality of words, and a linkage probability indicating a degree of linkage between the classified classes. In a language processing method for selecting a class and / or a word string based on the acquired chain probability, storing classification of a thesaurus, and classifying the plurality of words according to the classification of the thesaurus It is characterized by.

【００１４】請求項９の発明は、請求項８に記載の言語
処理方法において、前記複数の単語を分類する時に、単
語を分類コードに変換し、前記連鎖確率の計算時には前
記単語列を分類コード列の形態で計算することを特徴と
する。According to a ninth aspect of the present invention, in the language processing method according to the eighth aspect, when the plurality of words are classified, the words are converted to a classification code, and when the linkage probability is calculated, the word string is converted to a classification code. The calculation is performed in the form of a column.

【００１５】請求項１０の発明は、請求項９に記載の言
語処理方法において、前記単語に付属する付属語につい
ては該付属語そのものを分類コードとすることを特徴と
する。According to a tenth aspect of the present invention, in the linguistic processing method according to the ninth aspect, for the attached word attached to the word, the attached word itself is used as a classification code.

【００１６】請求項１１の発明は、請求項９に記載の言
語処理方法において、前記単語が前記シソーラスの分類
に記載されていない自立語の場合には、品詞ごとの分類
を行うことを特徴とする。According to an eleventh aspect of the present invention, in the language processing method according to the ninth aspect, if the word is an independent word that is not described in the thesaurus classification, classification is performed for each part of speech. I do.

【００１７】請求項１２の発明は、請求項８に記載の言
語処理方法において、前記連鎖確率が最小となる単語列
を聴覚実験で使用する無意味文として選択することを特
徴とする。According to a twelfth aspect of the present invention, in the linguistic processing method according to the eighth aspect, the word sequence having the smallest chain probability is selected as a meaningless sentence used in a hearing experiment.

【００１８】請求項１３の発明は、請求項８に記載の言
語処理方法において、前記連鎖確率が最大となる単語列
を音声認識候補の中の最適候補として選択することを特
徴とする。According to a thirteenth aspect of the present invention, in the linguistic processing method according to the eighth aspect, the word sequence having the maximum chain probability is selected as an optimal candidate among the speech recognition candidates.

【００１９】請求項１４の発明は、請求項８に記載の言
語処理方法において、各単語間の連鎖確率の計算に関連
して、クラスの中の単語の出現確率およびクラス間の連
鎖確率を取得しておき、単語間の連鎖確率が直接取得で
きない場合には、前記単語の出現確率および前記クラス
間の連鎖確率に基づき前記単語間の連鎖確率を推定する
ことを特徴とする。According to a fourteenth aspect of the present invention, in the linguistic processing method according to the eighth aspect, an occurrence probability of a word in a class and a chain probability between the classes are obtained in relation to calculation of a chain probability between words. If the chain probability between words cannot be directly obtained, the chain probability between the words is estimated based on the occurrence probability of the word and the chain probability between the classes.

【００２０】[0020]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００２１】まず、本発明を適用した言語処理方法につ
いて述べる。本実施形態では、シソーラスに基づくクラ
スｎ−ｇｒａｍモデルを利用する。単語ｎ−ｇｒａｍモ
デルは、単語列の生起をｎ−１重マルコフ課程で近似す
るモデルである。従って、単語列ｗ₁ ｗ₂ …ｗ_m-1 が与
えられた時、単語ｗ_m が生起する確率を、式First, a language processing method to which the present invention is applied will be described. In the present embodiment, a class n-gram model based on a thesaurus is used. The word n-gram model is a model that approximates the occurrence of a word string by an n-1 double Markov process. Therefore, given a word string w ₁ w ₂ ... W _m−1 , the probability of occurrence of word w _m is given by

【００２２】[0022]

【数１】 (Equation 1)

【００２３】で近似する（ただしｍ≧ｎ）。今、各単語
は、それぞれ１つのクラスにしか属さないものとする。
単語ｗ_i が属するクラスをｃ_i で表す。このとき、クラ
スｎ−ｇｒａｍモデルでは、単語列ｗ₁ ｗ₂ …ｗ_m-1 が
与えられたという条件のもとで、単語ｗ_m が生起する確
率を、次式で近似する：(Where m ≧ n). Now, it is assumed that each word belongs to only one class.
A class that word w _i belongs represented by c _i. At this time, in the class n-gram model, the probability of occurrence of the word w _m is approximated by the following equation under the condition that the word string w ₁ w ₂ ... W _m-1 is given:

【００２４】[0024]

【数２】 (Equation 2)

【００２５】また、聴覚心理実験のための試験文生成に
当たっては、数２式を簡略化した式：In generating a test sentence for the psychoacoustic experiment, an equation obtained by simplifying the equation 2 is:

【００２６】[0026]

【数３】Ｐ（ｃ_m ｜ｃ_m ｃ_m-1 ）も利用する。[Number 3] _{_{P (c m | c m c}} m-1) is also available.

【００２７】クラスｎ−ｇｒａｍモデルでは、単語をク
ラスに分類し、クラス間の連鎖の程度を示す確率（連鎖
確率）と、クラス内の単語の出現確率を用いるものであ
り、単語ｎ−ｇｒａｍモデルと比べて、パラメータ数を
大幅に減らすことが可能である。例えば、扱う語彙の大
きさ（単語数）をＶ、クラスの数をＣとすると、単語ｎ
−ｇｒａｍにおける独立したパラメータの数はＶⁿ −１
である。一方、クラスｎ−ｇｒａｍモデルの独立パラメ
ータの数は、Ｃⁿ −１＋Ｖ−Ｃとなる。Ｖ＝２０，００
０、Ｃ＝１，０００とした場合の独立パラメータの数
を、表１に示す。The class n-gram model classifies words into classes and uses the probability (chain probability) indicating the degree of linkage between classes and the appearance probability of words in the class. It is possible to greatly reduce the number of parameters as compared with. For example, if the vocabulary size (the number of words) to be handled is V and the number of classes is C, the word n
The number of independent parameters in -gram is V ⁿ -1
It is. On the other hand, the number of independent parameters of the class n-gram model is C ⁿ -1 + VC. V = 2,00
Table 1 shows the number of independent parameters when 0 and C = 1,000.

【００２８】[0028]

【表１】 [Table 1]

【００２９】クラスｎ−ｇｒａｍモデルとしては、従来
から品詞に基づいて単語クラスを設定する方法などが提
案されているが、大ざっぱな単語分類しかできないとい
う問題があった。また、単語間の相互情報量を用いて単
語を分類する方法も提案されているが、最適な分類を求
めるアルゴリズムが存在しないなどの問題が知られてい
る。一方、単語の意味的な分類を行うための手法とし
て、シソーラスが知られている。そこで、実施の形態で
は、このシソーラスに基づいたクラスｎ−ｇｒａｍモデ
ルを利用することにより、スパースネス問題など、単語
ｎ−ｇｒａｍモデルにおける問題の解決を図る。As a class n-gram model, a method of setting a word class based on a part of speech has been conventionally proposed, but there is a problem that only rough word classification can be performed. Further, a method of classifying words using mutual information between words has also been proposed, but there is a known problem that there is no algorithm for finding an optimum classification. On the other hand, a thesaurus is known as a method for performing semantic classification of words. Therefore, in the embodiment, a problem in the word n-gram model, such as the sparseness problem, is achieved by using the class n-gram model based on the thesaurus.

【００３０】このために、聴覚心理実験のための聴き取り検査用の無意味文を生
成させた例（第１実施形態）音声認識のための単語ｎ−ｇｒａｍモデルにおけるス
パースネス問題への対処例（第２実施形態）の２つの場合を例示する。なお、いずれの場合も、ｎ−
ｇｒａｍのｎを３とした場合（トライグラム（三つ組の
連鎖確率））について示す。For this reason, an example in which a meaningless sentence for a listening test for an auditory psychological experiment is generated (first embodiment) An example of handling a sparseness problem in a word n-gram model for speech recognition ( (2nd Embodiment) The following two cases are exemplified. In each case, n-
The case where n of the gram is set to 3 (trigram (linkage probability of three sets)) is shown.

【００３１】なお、第１および第２実施形態では、シソ
ーラスとして、国立国語研究所の分類語彙表を利用した
例を示す。分類語彙表では、各単語のコード番号とし
て、１．３０６２など、ピリオドで区切られた５桁の数
字が与えられている。このコード番号をそのまま１つの
クラスを表す数字とすると、クラスの数が大きくなりす
ぎるため、本実施形態では、最後の１桁を切り捨てた
後、同じコード番号を有する単語の集合を、一つのクラ
スとして扱う。In the first and second embodiments, examples are shown in which a classification vocabulary table of the National Institute for Japanese Language is used as a thesaurus. In the classification vocabulary table, five-digit numbers separated by periods, such as 1.3062, are given as code numbers of the respective words. If this code number is directly used as a number representing one class, the number of classes becomes too large. Therefore, in the present embodiment, after truncating the last digit, a set of words having the same code number is replaced with one class. Treat as

【００３２】分類語彙表には、基本的に自立語しか登録
されていないため、数２式のクラスモデルを利用する際
には、付属語の取り扱いが問題となる。また、分類語彙
表に登録されていない単語をどう扱うかも問題となる。
付属語については、文法的に分類することによりクラス
を構成することも可能であるが、第１および第２実施形
態では、１語で１クラスを構成させる方法を採用した。
また、分類語彙表に登録されていない単語については、
クラスモデル学習時に、形態素解析結果に基づいて自動
的に登録する方法を採用した。Since only the independent words are basically registered in the classification vocabulary table, handling of attached words becomes a problem when using the class model of the formula (2). Another problem is how to handle words that are not registered in the classification vocabulary table.
Although it is possible to compose a class by grammatically classifying the attached words, in the first and second embodiments, a method of constituting one class with one word is adopted.
For words that are not registered in the classification vocabulary table,
In class model learning, a method of automatically registering based on the morphological analysis results was adopted.

【００３３】単語ｎ−ｇｒａｍの学習は、通常、言語デ
ータベース中の文を、形態素解析によって単語単位に分
割した後、単語連鎖の頻度から数１式の確率値を推定す
る。クラスｎ−ｇｒａｍも学習も同様の手順で行う。In the learning of the word n-gram, usually, a sentence in a language database is divided into words by morphological analysis, and then the probability value of Equation 1 is estimated from the frequency of word chains. Class n-gram and learning are performed in the same procedure.

【００３４】まず、言語データベースを単語単位に分割
し、分類語彙表と照合することにより、単語をクラスコ
ードに変換する。その際、付属語は、コードへの変換を
行わず、語そのものをクラスコードとする。分類語彙表
に記載されていない自立語については、以下の処理を行
う。まず動詞などの活用形については、語幹で照合した
後、その活用形と対応する分類コード番号を別に用意す
る単語−クラス対応表に追加する。数詞は、強制的に数
詞のコードを割り当てた後、その数詞も単語−クラス対
応表に追加する。名詞については、形態素解析結果に従
って、「普通名詞」、「サ変名詞」、「固有名詞」、
「形式名詞」、「副詞的名詞」、「時相名詞」のクラス
に振り分け、合わせて単語−クラス対応表（クラスとそ
のクラスに含まれる単語の対応関係を示す表）への追加
を行う。それ以外の語については、未知クラスに分類す
る。First, the language database is divided into words, and the words are converted into class codes by collating with the classification vocabulary table. At this time, the attached word is not converted into a code, and the word itself is used as a class code. For independent words not described in the classification vocabulary table, the following processing is performed. First, the inflected forms such as verbs are collated with the stem, and then the classification code numbers corresponding to the inflected forms are added to a separately prepared word-class correspondence table. After forcibly assigning the code of a numeral to the numeral, the numeral is also added to the word-class correspondence table. For nouns, according to the results of morphological analysis, "common nouns", "sa-variable nouns", "proper nouns"
The words are classified into “formal noun”, “adverbial noun”, and “temporal noun”, and are added to a word-class correspondence table (table indicating the correspondence between classes and words included in the class). Other words are classified into unknown classes.

【００３５】単語トライグラム、クラストライグラムの
学習には、放送局のニュース原稿データベースを利用し
た。これは、１９９２年から１９９６年の５月までに、
放送局の記者が作成したニュース原稿をデータベース化
したものである。クラストライグラムの学習では、語彙
の大きさを２０，０００語とし、それ以外の単語は、未
知語専用のクラスに割り当てた。この未知語専用クラス
は、未知クラスとは異なるものである（前者は、語彙に
含まれない単語のクラスであり、後者は語彙の範囲に含
まれる単語であるが、属するクラスが未知の単語の集合
である）。得られたクラスの数は９３６であり、そのう
ち５１８クラスは分類語彙表から作成された。自立語に
対する各クラスに割り当てられた平均単語数は４１であ
った。一方、単語トライグラムについても同じ条件で作
成する。For learning the word trigram and the class trigram, a news manuscript database of a broadcasting station was used. This was from 1992 to May 1996,
It is a database of news manuscripts created by reporters of broadcasting stations. In class trigram learning, the vocabulary size was 20,000 words, and the other words were assigned to a class dedicated to unknown words. This class for unknown words is different from the class for unknown words (the former is a class of words that are not included in the vocabulary, and the latter is a word that is included in the vocabulary range. Is a set). The number of obtained classes was 936, of which 518 classes were created from the classification vocabulary table. The average number of words assigned to each class for independent words was 41. On the other hand, a word trigram is created under the same conditions.

【００３６】（第１実施形態）以下、例について、詳
しく説明する。音声の聴き取り試験を行う際、確実に聞
き取られたどうかを調査するためには、平易な単語から
なる意味不明な文を作成し、この文を読み上げた音声を
用いて実験を行うことが考えられる。以下、このような
無意味文の生成について例示する。(First Embodiment) Hereinafter, an example will be described in detail. When conducting a listening test of a voice, in order to check whether or not the voice was heard, it is conceivable to create an incomprehensible sentence consisting of plain words and perform an experiment using the voice read out of this sentence. Can be Hereinafter, generation of such a meaningless sentence will be exemplified.

【００３７】無意味文の作成は、あらかじめ用意された
短い有意味文をそれぞれ３つの部分に分割し、これらの
部分を入れ換えることにより行う。有意味文の例を以下
に示す。A meaningless sentence is created by dividing a short meaningful sentence prepared in advance into three parts and exchanging these parts. An example of a meaningful sentence is shown below.

【００３８】オートバイでは煙草がすえない。今朝
から歯が痛い。引き出しに鍵がかかっている。
頭に鳩が止った。高いビルがたくさんある。There is no smoking on motorcycles. My teeth hurt this morning. The drawer is locked.
A pigeon stopped on his head. There are many tall buildings.

【００３９】ただし簡便化のため、各文を、分類語彙表
および単語−クラス対応表を用いてクラスコード列に変
換した後、数３式のクラス間のトライグラム確率のみを
用いて、文ごとの生起確率を計算し、最も確率が低くな
った文の組を選択することにより、無意味文を作成す
る。また、本発明の効果を比較により示すため、単語ト
ライグラムを用いて同様な処理を行った。なお、以下で
は、便宜上、確率値のかわりに、Perplexityを利用し
て、結果を表示する。この場合のPerplexityは、各文
（単語列ｗ₁ …ｗ_n で表す）の生起確率をＰ（ｗ₁ …ｗ
_n ）、そのエントロピーをHowever, for the sake of simplicity, each sentence is converted into a class code string using a classification vocabulary table and a word-class correspondence table, and each sentence is converted using only the trigram probability between the classes in Equation (3). Is calculated, and a sentence set with the lowest probability is selected to create a meaningless sentence. Further, in order to show the effect of the present invention by comparison, similar processing was performed using a word trigram. In the following, for convenience, the result is displayed using Perplexity instead of the probability value. Perplexity in this case indicates the occurrence probability of each sentence (represented by a word string w ₁ ... W _n ) as P (w ₁ .
_n ), its entropy

【００４０】[0040]

【数４】 (Equation 4)

【００４１】としたとき、Then,

【００４２】[0042]

【数５】ＰＰ＝２^H(L) で表される量である。従って、文の生起確率が小さいほ
ど、Perplexityは大きくなる。## EQU5 ## PP = 2 An amount represented by ^{H (L)} . Therefore, the smaller the occurrence probability of a sentence, the larger the Perplexity.

【００４３】このような言語処理方法で、聴き取り検査
用の無意味文生成のための言語処理を行う言語処理装置
の機能構成を示す。図１において、１は入力文の組（先
に示した有意味文の５文）をそれぞれ３つの部分に分割
し、これらを入れ換えることにより無意味文を生成する
処理ブロックである。この場合に生成される文の数は１
４，４００である。A functional configuration of a language processing apparatus that performs language processing for generating a meaningless sentence for listening test using such a language processing method will be described. In FIG. 1, reference numeral 1 denotes a processing block that divides a set of input sentences (the five meaningful sentences described above) into three parts and replaces them to generate a meaningless sentence. The number of statements generated in this case is 1
4,400.

【００４４】２は、生成された無意味文の組を単語単位
に分割する形態素解析ブロックである。４はシソーラス
を格納する格納ブロックである。６はシソーラスに基づ
いて単語をクラスコードに変換する単語分類ブロックで
ある。Reference numeral 2 denotes a morphological analysis block that divides the generated set of meaningless sentences into words. A storage block 4 stores a thesaurus. Reference numeral 6 denotes a word classification block for converting a word into a class code based on a thesaurus.

【００４５】また、８はクラス間のトライグラム確率
（数３式の確率値）の表を記憶する格納ブロックであ
る。１０は、生成された全ての無意味文の組についてク
ラストライグラムに基づいてPerplexityを計算し、最も
Perplexityが大きくなった無意味文の組を選択する処理
ブロックである。Numeral 8 denotes a storage block for storing a table of trigram probabilities (probability values of equation 3) between classes. 10 calculates Perplexity based on the class trigram for all the generated meaningless sentence pairs,
This is a processing block for selecting a set of meaningless sentences whose perplexity has increased.

【００４６】図２は言語処理装置において実行する言語
処理の内容を示す。図２に従って、実際の言語処理を説
明する。図２の各ブロックにおける実行内容は以下のと
おりである。FIG. 2 shows the contents of language processing executed in the language processing apparatus. The actual language processing will be described with reference to FIG. The execution contents in each block of FIG. 2 are as follows.

【００４７】Ｂ２：入力文の組を読み込むブロック。B2: Block for reading a set of input sentences.

【００４８】Ｂ４：各入力文を３つの部分に分割し、そ
の第１の部分を固定し、第２、第３の部分を入れ換える
ことにより、無意味文を生成するブロック。５つの文
を、B4: A block for generating a meaningless sentence by dividing each input sentence into three parts, fixing the first part, and exchanging the second and third parts. Five sentences

【００４９】[0049]

【数６】ｂ₁₁ｂ₁₂ｂ₁₃ ｂ₂₁ｂ₂₂ｂ₂₃ ｂ₃₁ｂ₃₂ｂ₃₃ ｂ₄₁ｂ₄₂ｂ₄₃ ｂ₅₁ｂ₅₂ｂ₅₃ で表す（ｂ_ijはｉ番目の文のｊ番目のブロック）。ま
た、関数ｒを、集合｛１，２，３，４，５｝から｛１，
２，３，４，５｝への関数で、ｉ≠ｊ⇒ｒ（ｉ）≠ｒ
（ｊ）を満たすものとする。また、関数ｓも同様に定義
する。この時、本ブロックでは、## EQU6 ## b ₁₁ b ₁₂ b ₁₃ b ₂₁ b ₂₂ b ₂₃ b ₃₁ b ₃₂ b ₃₃ b ₄₁ b ₄₂ b ₄₃ b ₅₁ b ₅₂ b ₅₃ (b _ij is the j-th block of the i-th sentence) . Also, the function r is defined as {1, 2, 3, 4, 5} from {1,
A function to 2,3,4,5}, i ≠ j⇒r (i) ≠ r
(J) shall be satisfied. The function s is defined in the same manner. At this time, in this block,

【００５０】[0050]

【数７】 (Equation 7)

【００５１】で表される文を生成する。ここに、ｒ_i 、
ｓ_j は、それぞれ、ｒ（ｉ）、ｓ（ｉ）の略記である。The sentence represented by is generated. Where r _i ,
s _j is an abbreviation for r (i) and s (i), respectively.

【００５２】Ｂ６：変数Ｍａｘに０を代入するブロッ
ク。B6: Block for assigning 0 to a variable Max.

【００５３】Ｂ８：Ｂ４で生成された全ての無意味文の
組が選択されたかどうかを判断するブロック。B8: A block for judging whether all the meaningless sentence sets generated in B4 have been selected.

【００５４】Ｂ１０：Ｂ４で生成された無意味文の組の
中から、まだ選択されていない組を選択するブロック。B10: A block for selecting a not-yet-selected set from the set of meaningless sentences generated in B4.

【００５５】Ｂ１２：Ｂ１０で選択された無意味文の組
に対して、そのPerplexityを計算するサブルーチン。詳
細は、図３に示す。入力は無意味文の組で、計算結果
は、変数ｏｕｔに返される。B12: A subroutine for calculating the Perplexity of the set of meaningless sentences selected in B10. Details are shown in FIG. The input is a set of meaningless sentences, and the calculation result is returned to a variable out.

【００５６】Ｂ１４：変数ｏｕｔの値が、変数Ｍａｘよ
り大きいことを判定するブロック。B14: A block for judging that the value of the variable out is larger than the variable Max.

【００５７】Ｂ１６：変数ｏｕｔの値が、変数Ｍａｘと
等しいことを判定するブロック。B16: A block for determining that the value of the variable out is equal to the variable Max.

【００５８】Ｂ１８：出力文の組を格納する配列を初期
化し、そこに、現在選択されている無意味文の組を格納
するブロック。B18: A block for initializing an array for storing a set of output sentences and storing therein a currently selected set of meaningless sentences.

【００５９】Ｂ２０：変数Ｍａｘの値を、変数ｏｕｔの
値で置き換えるブロック。B20: Block for replacing the value of the variable Max with the value of the variable out.

【００６０】Ｂ２２：出力文の組を格納する配列に、現
在選択されている無意味文の組を追加するブロック。B22: A block for adding a currently selected set of meaningless sentences to an array for storing sets of output sentences.

【００６１】Ｂ２４：格納されている出力文の組を出力
するブロック。B24: A block for outputting a set of stored output statements.

【００６２】また、図３は、図２のＢ１２に示したサブ
ルーチンでの処理内容を示す。図３の処理ブロックは以
下のブロックにより構成されている。FIG. 3 shows the contents of processing in the subroutine indicated by B12 in FIG. The processing block of FIG. 3 includes the following blocks.

【００６３】Ｂ３２：変数ｏｕｔに０を代入するブロッ
ク。B32: Block for assigning 0 to a variable out.

【００６４】Ｂ３４：入力された無意味文の組から、文
を選択する際、全ての文が選択されたことを判定するブ
ロック。B34: A block for judging that all sentences have been selected when selecting a sentence from a set of input meaningless sentences.

【００６５】Ｂ３６：入力された文の組の中から、まだ
選択されていない文を選択するブロック。B36: A block for selecting a sentence that has not been selected from the input sentence set.

【００６６】Ｂ３７：選択された文を単語単位に分割す
る形態素解析ブロック。形態素解析法としては、どのよ
うな方法を用いてもよいが、本実施形態では、京都大学
で開発されたソフトウエアJuman を採用した。B37: Morphological analysis block for dividing the selected sentence into words. As the morphological analysis method, any method may be used, but in the present embodiment, software “Juman” developed at Kyoto University is adopted.

【００６７】Ｂ３８：文を単語の３つ組に分解するブロ
ック。文が単語ｗ₁ ，ｗ₂ ，…ｗ_nの連結でできている
場合、３つ組ｗ₁ ｗ₂ ｗ₃ ，ｗ₂ ｗ₃ ｗ₄ ，…，ｗ_n-2
ｗ_n-1 ｗ_n を生成する。B38: A block for decomposing a sentence into triads of words. Word statement w _1, w _2, ... If you are able to a concatenation of w _n, 3-tuple _{_{_{w 1 w 2 w 3, w}}} 2 w 3 w 4, ..., w n-2
Generate w _n-1 w _n .

【００６８】Ｂ４０：単語３つ組中の各単語をクラスに
分類し、クラスコードの３つの組に変換するブロック。
単語３つの組ｗ₁ ｗ₂ ｗ₃ ，ｗ₂ ｗ₃ ｗ₄ ，…，ｗ_n-2
ｗ_n- ₁ ｗ_n を、クラス３つ組ｃ₁ ｃ₂ ｃ₃ ，ｃ₂ ｃ₃ ｃ
₄ ，…，ｃ_n-2 ｃ_n-1 ｃ_n に変換する。B40: A block for classifying each word in the set of three words into a class and converting it into three sets of class codes.
Word three sets _{_{_{w 1 w 2 w 3, w}}} 2 w 3 w 4, ..., w n-2
w _n- ₁ w _n is _converted to a class triple set c ₁ c ₂ c ₃ , c ₂ c ₃ c
_4, ..., it is converted into _{_{c n-2 c n-1}} c n.

【００６９】Ｂ４１：Ｂ４０で生成された各３つ組につ
いて、数３式のトライグラム確率を読み出すブロック。B41: A block for reading out the trigram probability of Formula 3 for each triple generated in B40.

【００７０】Ｂ４２：Ｂ４０で読み出したトライグラム
確率から、式B42: From the trigram probability read out in B40,

【００７１】[0071]

【数８】 (Equation 8)

【００７２】によって文のPerplexityを計算するブロッ
ク。A block for calculating the Perplexity of a sentence.

【００７３】Ｂ４４：変数ｏｕｔに、変数ｏｕｔの値と
Ｂ４２で求めたPerplexityの値の和を代入するブロッ
ク。B44: A block for substituting the sum of the value of the variable out and the value of Perplexity obtained in B42 for the variable out.

【００７４】Ｂ４６：変数ｏｕｔの値を出力するブロッ
ク。B46: Block for outputting the value of variable out.

【００７５】（第２実施形態）次に例について、以下
に詳しく説明する。一般に、音声認識で単語トライグラ
ムモデルを利用する際、発声された文の中に含まれる単
語の３つ組が、学習時に存在しなかったため、その確率
値が推定できていない場合がある。例では、このよう
な場合に対して、数２式のクラストライグラムを用い
て、その確率値を推定する。(Second Embodiment) Next, an example will be described in detail below. In general, when a word trigram model is used in speech recognition, a triplet of words included in an uttered sentence did not exist at the time of learning, so that the probability value may not be estimated in some cases. In the example, for such a case, the probability value is estimated using the class trigram of Expression 2.

【００７６】図４は例の言語処理を実行する言語処理
装置の機能構成を示す。この例では、なんらかの連続音
声認識システムがあって、このシステムが、複数の認識
候補文を出力する場合を想定している。図４において、
２２は、音声認識候補として入力された複数の文を、単
語単位に分割する形態素解析ブロック、２４は、単語ト
ライグラム確率（数１式）を記憶する格納ブロックであ
る。２６は、入力に対応する単語連鎖確率を読み出す処
理ブロックである。２８はシソーラスを格納する格納ブ
ロック、３０はシソーラスに基づいて単語をクラスコー
ドに変換する単語分類ブロックである。FIG. 4 shows a functional configuration of a language processing apparatus for executing the language processing of the example. In this example, it is assumed that there is some continuous speech recognition system, and this system outputs a plurality of recognition candidate sentences. In FIG.
Reference numeral 22 denotes a morphological analysis block that divides a plurality of sentences input as speech recognition candidates into words, and reference numeral 24 denotes a storage block that stores word trigram probabilities (Equation 1). 26 is a processing block for reading out the word chain probability corresponding to the input. 28 is a storage block for storing a thesaurus, and 30 is a word classification block for converting words to class codes based on the thesaurus.

【００７７】また、３２はクラス間のトライグラム確率
と、クラス−単語確率（単語がそのクラスに出現する確
率）の表を記憶する格納ブロックである。３４は、入力
文に対応するクラストライグラム確率（数２式の確率）
を計算する処理ブロックである。３６は、単語トライグ
ラム確率が求められていなかった場合にはクラストライ
グラム確率を利用しながら、入力された文（認識候補）
に対するPerplexityを計算し、その値が最も小さかった
もの（確率値が高かったもの）を選択して出力する処理
ブロックである。Reference numeral 32 denotes a storage block for storing a table of inter-class trigram probabilities and class-word probabilities (probabilities that words appear in the class). 34 is the class trigram probability corresponding to the input sentence (probability of equation 2)
Is a processing block for calculating. Reference numeral 36 denotes an input sentence (recognition candidate) using the class trigram probability when the word trigram probability has not been obtained.
This is a processing block that calculates the Perplexity for, selects the one with the smallest value (the one with the highest probability value), and outputs it.

【００７８】図５は、例２における言語処理の内容を示
す。図５に従って、実際の言語処理を説明する。図５の
各ブロックにおける実行内容は以下の通りである。FIG. 5 shows the contents of language processing in Example 2. The actual language processing will be described with reference to FIG. The execution contents in each block of FIG. 5 are as follows.

【００７９】Ｂ５２：入力文の組（複数の認識候補）を
読み込むブロック。B52: Block for reading a set of input sentences (a plurality of recognition candidates).

【００８０】Ｂ５４：変数Ｍｉｎの値に、可能なかぎり
小さい値を代入するブロック。B54: A block for substituting the smallest possible value for the value of the variable Min.

【００８１】Ｂ５６：入力された文を全て選択されたか
どうかを判定するブロック。B56: A block for determining whether or not all the input sentences have been selected.

【００８２】Ｂ５８：それまでに選択されていない文を
選択するブロック。B58: Block for selecting a sentence that has not been selected before.

【００８３】Ｂ６０：選択された文に対するPerplexity
を計算するサブルーチン。入力は、選択された文であ
り、それに対するPerplexityを変数ｏｕｔを用いて返
す。B60: Perplexity for the selected sentence
Subroutine for calculating. The input is the selected sentence, and the Perplexity for it is returned using the variable out.

【００８４】Ｂ６２：変数ｏｕｔが、変数Ｍｉｎより小
さいかどうかを判定するブロック。B62: Block for determining whether or not the variable out is smaller than the variable Min.

【００８５】Ｂ６４：出力文を格納する配列を初期化
し、その配列に、現在選択されている文を格納する。B64: Initialize the array for storing the output sentence, and store the currently selected sentence in the array.

【００８６】Ｂ６６：変数Ｍｉｎに変数ｏｕｔの値を代
入する。B66: The value of the variable out is substituted for the variable Min.

【００８７】Ｂ６８：出力文用の配列に格納されている
文を出力する。B68: The sentence stored in the output sentence array is output.

【００８８】図６は、サブルーチンＢ６０の処理内容を
示す。図６の処理ブロックは以下のブロックで構成され
ている。FIG. 6 shows the processing contents of the subroutine B60. The processing block in FIG. 6 includes the following blocks.

【００８９】Ｂ７２：変数ｏｕｔに０を代入するブロッ
ク。B72: Block for assigning 0 to a variable out.

【００９０】Ｂ７３：入力された文を単語単位に分割す
る形態素解析ブロック。B73: A morphological analysis block for dividing the input sentence into words.

【００９１】Ｂ７４：文を単語の３つ組に分解するブロ
ック。文が単語ｗ₁ ，ｗ₂ …ｗ_n の連結でできている場
合、単語の３つ組ｗ₁ ｗ₂ ｗ₃ ，ｗ₂ ｗ₃ ｗ₄ ，…，ｗ
_n-2ｗ_n-1 ｗ_n を生成する。B74: Block for decomposing a sentence into triads of words. If the sentence is made up of the concatenation of the words w ₁ , w ₂ … w _n , the word triplet w ₁ w ₂ w ₃ , w ₂ w ₃ w ₄ ,.
to generate the _{_{n-2 w n-1 w}} n.

【００９２】Ｂ７６：全ての単語の３つ組が選択された
かどうかを判定するブロック。B76: A block for judging whether a triple of all words has been selected.

【００９３】Ｂ７８：それまでに選択されていない単語
３つ組を選択するブロック。B78: Block for selecting a word triplet that has not been selected.

【００９４】Ｂ８０：選択された３つ組に対し、単語ト
ライグラムが存在するかどうかを判定するブロック。B80: A block for determining whether or not a word trigram exists for the selected triplet.

【００９５】Ｂ８２：単語トライグラムモデルから求め
られた確率の対数値を、変数ｏｕｔに加算するブロッ
ク。B82: A block for adding the logarithmic value of the probability obtained from the word trigram model to the variable out.

【００９６】Ｂ８４：選択されている単語の３つ組か
ら、クラスの３つ組を求める単語分類ブロック。B84: A word classification block for obtaining a class triple from the selected word triple.

【００９７】Ｂ８６：単語の３つ組をｗ_m-2 ｗ_m-1 ｗ
_m 、単語ｗ_i の属する単語クラスをｃ_i とするとき、ク
ラスの３つ組ｃ_m-2 ｃ_m-1 ｃ_m に対する連鎖確率と、ｃ
_m が与えられた条件下での単語ｗ_m の出現確率から、数
２式に従って単語の３つ組ｗ_m-2 ｗ_m-1 ｗ_m の連鎖確率
の対数値を求めるブロック。B86: Word triad is w _m−2 w _m−1 w
_m , and the word class to which the word w _i belongs is c _i, and the chain probability for the triple of the class _cm −2 _{cm −1} _cm and c
A block for obtaining a logarithmic value of a chain probability of a word triplet w _m−2 w _m−1 w _m from the appearance probability of the word w _m under the condition where _m is given, according to Equation ₂ .

【００９８】Ｂ８８：Ｂ８６で求めた確率値を変数ｏｕ
ｔに加算するブロック。B88: The probability value obtained in B86 is replaced by a variable ou
Block to add to t.

【００９９】Ｂ９０：変数ｏｕｔを出力するブロック。B90: Block for outputting variable out.

【０１００】本実施形態では、単語クラスを作成する
際、分類語彙表のコード番号の最後の１桁を切り捨てて
作成し、また付属語は１単語１クラスとしたが、その他
の方法で単語クラスを作成する方法にも適用可能であ
る。さらに、シソーラスとして区立国語研究所の分類語
彙表を用いたが、分類語彙表以外のシソーラスを利用し
た場合に適用可能なことはいうまでもない。また、ｎ−
ｇｒａｍモデルのｎ＝３の場合（トライグラム）の場合
にのみ例示したが、その他のｎの値に対しても適用可能
であることは勿論である。In the present embodiment, when creating a word class, the last one digit of the code number of the classification vocabulary table is truncated and created. It is also applicable to the method of creating. Furthermore, although the classification vocabulary table of the municipal language institute was used as the thesaurus, it goes without saying that the present invention can be applied to the case where a thesaurus other than the classification vocabulary table is used. Also, n-
Although illustrated only in the case of n = 3 (trigram) in the gram model, it is needless to say that the present invention can be applied to other values of n.

【０１０１】上述した５つの文を用いて無意味文を生成
する例についての実験結果を以下に示す。生成された
無意味文の組を、Perplexityが大きい順（すなわち生起
確率が小さい順）に並べた結果は、以下の通りである： Perplexity：２６０．３６オートバイでは煙草がかかっている。今朝からビ
ルがすえない。引き出しに鳩がたくさんある。頭
に鍵が痛い。高い歯が止った。Experimental results for an example of generating a meaningless sentence using the above-described five sentences are shown below. The result of arranging the generated set of meaningless sentences in descending order of Perplexity (that is, in order of decreasing occurrence probability) is as follows: Perplexity: 260.36 A motorcycle is smoked. The building has not been installed since this morning. There are many pigeons in the drawer. The key hurts in my head. High teeth stopped.

【０１０２】Perplexity：２６０．３６オートバイでは煙草が止った。今朝からビルが
すえない。引き出しに鳩がたくさんある。頭に鍵
が痛い。高い歯がかかっている。Perplexity: 260.36 Tobacco stopped on motorcycles. From this morning the building
I'm sorry. There are many pigeons in the drawer. The key hurts in my head. High teeth.

【０１０３】Perplexity：２６０．２７オートバイでは煙草がかかっている。今朝からビ
ルが痛い。引き出しに鳩がたくさんある。頭に
鍵がすえない。高い歯が止った。Perplexity: 260.27 Motorcycles are smoking. The building hurts from this morning. There are many pigeons in the drawer. To head
I can't find the key. High teeth stopped.

【０１０４】Perplexity：２６０．２７オートバイでは煙草が止った。今朝からビルが
痛い。引き出しに鳩がたくさんある。頭に鍵が
すえない。高い歯がかかっている。Perplexity: 260.27 Tobacco stopped on motorcycles. From this morning the building
painful. There are many pigeons in the drawer. A key in my head
I'm sorry. High teeth.

【０１０５】Perplexity：２５８．９２オートバイでは鳩がたくさんある。今朝からビル
がすえない。引き出しに煙草がかかっている。頭
に鍵が痛い。高い歯が止った。Perplexity: 258.92 There are many pigeons on motorcycles. The building has not been installed since this morning. There is a cigarette in the drawer. The key hurts in my head. High teeth stopped.

【０１０６】Perplexity：２５８．９２オートバイでは鳩がたくさんある。今朝からビル
がすえない。引き出しに煙草が止った。頭に鍵
が痛い。高い歯がかかっている。Perplexity: 258.92 There are many pigeons on motorcycles. The building has not been installed since this morning. Tobacco stopped in the drawer. The key hurts in my head. High teeth.

【０１０７】Perplexity：２５８．８３オートバイでは鳩がたくさんある。今朝からビル
が痛い。引き出しに煙草がかかっている。頭に
鍵がすえない。高い歯が止った。Perplexity: 258.83 There are many pigeons on motorcycles. The building hurts from this morning. There is a cigarette in the drawer. To head
I can't find the key. High teeth stopped.

【０１０８】Perplexity：２５８．８３オートバイでは鳩がたくさんある。今朝からビル
が痛い。引き出しに煙草が止った。頭に鍵が
すえない。高い歯がかかっている。Perplexity: 258.83 There are many pigeons on motorcycles. The building hurts from this morning. Tobacco stopped in the drawer. A key in my head
I'm sorry. High teeth.

【０１０９】一方、比較のため、図３のＢ４０のブロッ
クで、単語トライグラム（数１式）を用いて、同様の処
理を行った。その結果、Perplexityが最大となった（確
率値が最小となった）文の組が、１６組得られた。以下
にこれらの一部を示す： Perplexity：７１８．６７オートバイでは煙草がすえない。今朝から歯が
痛い。引き出しに鳩が止った。頭にビルがかか
っている。高い鍵がたくさんある。On the other hand, for comparison, similar processing was performed using the word trigram (Equation 1) in the block B40 in FIG. As a result, 16 sets of sentences with the maximum Perplexity (minimum probability value) were obtained. Some of these are listed below: Perplexity: 718.67 No tobacco on motorcycles. From this morning my teeth
painful. The pigeon stopped in the drawer. There is a building on my head. There are many high keys.

【０１１０】Perplexity：７１８．６７オートバイでは煙草がすえない。今朝から歯が
かかっている。引き出しに鳩が止った。頭にビル
が痛い。高い鍵がたくさんある。Perplexity: 718.67 No tobacco is available on motorcycles. From this morning my teeth
It depends. The pigeon stopped in the drawer. The building hurts on my head. There are many high keys.

【０１１１】Perplexity：７１８．６７オートバイでは煙草がすえない。今朝から歯が
かかっている。引き出しに鳩が止った。頭にビル
がたくさんある。高い鍵が痛い。[0111] Perplexity: 718.67 No tobacco is available on motorcycles. From this morning my teeth
It depends. The pigeon stopped in the drawer. There are many buildings in my head. The high key hurts.

【０１１２】Perplexity：７１８．６７オートバイイでは煙草がすえない。今朝から歯が
たくさんある。引き出しに鳩が止った。頭にビ
ルがかかっている。高い鍵が痛い。Perplexity: 718.67 No tobacco is available on motorcycles. I have a lot of teeth from this morning. The pigeon stopped in the drawer. There is a building on my head. The high key hurts.

【０１１３】perplexity：７１８．６７オートバイでは煙草が止った。今朝から歯がか
かっている。引き出しに鳩がすえない。頭にビル
が痛い。高い鍵がたくさんある。Perplexity: 718.67 Tobacco stopped on the motorcycle. My teeth have started this morning. There are no pigeons in the drawer. The building hurts on my head. There are many high keys.

【０１１４】Perplexity：７１８．６７オートバイでは煙草が止った。今朝から歯がか
かっている。引き出しに鳩がすえない。頭にビル
がたくさんある。高い鍵が痛い。Perplexity: 718.67 Tobacco stopped on motorcycles. My teeth have started this morning. There are no pigeons in the drawer. There are many buildings in my head. The high key hurts.

【０１１５】Perplexity：７１８．６７オートバイでは煙草が止った。今朝から歯がた
くさんある。引き出しに鳩がすえない。頭にビル
がかかっている。高い鍵が痛い。Perplexity: 718.67 Tobacco stopped on motorcycles. I have a lot of teeth from this morning. There are no pigeons in the drawer. There is a building on my head. The high key hurts.

【０１１６】これらの実験結果を比較すると、本実施形
態の言語処理方法を用いた場合は、無意味文を安定して
構成しているのに対し、単語トライグラムを利用した場
合では、もとの有意味文も含まれるという結果となっ
た。このことより、本発明が、十分な効果を有すること
が分かる。Comparing these experimental results, when the language processing method of the present embodiment is used, the meaningless sentence is stably constructed, whereas when the word trigram is used, the original meaningless sentence is obtained. Result is also included. This shows that the present invention has a sufficient effect.

【０１１７】次に、例の効果を調べるため、実際に放
送されたニュースの書き起し文を用いてPerplexityを計
算することにより、バックオフ平滑化を用いた場合と本
発明とを比較した。用いた評価用データは、１９９６年
６月４日のあるニュース番組におけるアンカーの発生か
ら書き起したテキストであり、学習用のデータベースと
重複した文は含まれていない。その結果、バックオフ平
滑化を用いた場合のPerplexityが６２．３８であったの
に対して、本実施形態の言語処理方法を用いた場合は５
７．９７という結果が得られ、本発明の効果が確認され
た。Next, in order to examine the effect of the example, the present invention was compared with the case using back-off smoothing by calculating the Perplexity using the transcript of the news actually broadcast. The evaluation data used is a text transcribed from the occurrence of an anchor in a certain news program on June 4, 1996, and does not include a sentence duplicated with the learning database. As a result, the Perplexity when the back-off smoothing was used was 62.38, whereas the Perplexity when the language processing method of the present embodiment was used was 5
7.97 was obtained, and the effect of the present invention was confirmed.

【０１１８】この実験ではまず、単語ｎ−ｇｒａｍモデ
ルとして単語トライグラムモデル、クラスｎ−ｇｒａｍ
モデルとしてクラストライグラムモデルを採用し、パー
プレキシティによりこの両者のみの比較を行った。な
お、単語トライグラムについては、Katzのバックオフ平
滑化を併用した。結果を表２に示す。表２より、単語ト
ライグラムの方が、クラストライグラムよりも低いパー
プレキシティを示した。In this experiment, first, a word trigram model, a class n-gram model were used as a word n-gram model.
A class trigram model was adopted as a model, and only the two were compared by perplexity. For word trigrams, we used Katz's back-off smoothing. Table 2 shows the results. From Table 2, the word trigram showed lower perplexity than the class trigram.

【０１１９】[0119]

【表２】 [Table 2]

【０１２０】この理由を調べるため、評価用のデータに
対する結果を詳細に検討した。単語トライグラムモデル
のパープレキシティを下げている例として、トライグラ
ムが高い値となった例を示す。In order to investigate the reason, the result of the evaluation data was examined in detail. As an example of reducing the perplexity of the word trigram model, an example in which the trigram has a high value will be described.

【０１２１】[0121]

【数９】Ｐ（演習｜環太平洋、合同）＝１Ｐ（審議会｜中央、教育）＝１これらの例では、トライグラムの確率が１となってい
る。すなわち、学習データの中では、「環太平洋」、
「合同」という２単語の連鎖が現れた場合には、その次
には必ず「演習」が現れたということを意味する。「中
央」、「教育」、「審議会」も同様である。このような
３単語の連鎖が評価用データの中にも出現したため、単
語パープレキシティの値が下がったものと思われる。P (Practice | Pacific Rim, Joint) = 1 P (Council | Central, Education) = 1 In these examples, the probability of the trigram is 1. In other words, in the training data,
When a two-word chain of "joint" appears, it means that "practice" always follows. The same applies to "center", "education" and "council". Since such a chain of three words also appeared in the evaluation data, it is considered that the value of the word perplexity was lowered.

【０１２２】これに対して、クラストライグラムモデル
の場合は、たとえ、学習データにおいて、「環太平
洋」、「合同」の後に続く単語としては「演習」しかな
かったとしても、それぞれがクラスに置き換えられるた
め、クラス連鎖の確率は、必ずしも１にはならない。さ
らに、このクラス連鎖の確率に、「演習」が属するクラ
ス（１．３０５）における「演習」の出現確率が乗じら
れるので、クラストライグラムモデルの確率値は、一層
小さくなる。このような理由によって、クラストライグ
ラムモデルのパープレキシティが、単語トライグラムモ
デルを利用した場合よりも大きくなったものと思われ
る。On the other hand, in the case of the class trigram model, even if “learning” is the only word following “ring pacific” and “joint” in the learning data, each is replaced with a class. Therefore, the probability of the class linkage is not always 1. Further, since the probability of this class chain is multiplied by the appearance probability of “exercise” in the class (1.305) to which “exercise” belongs, the probability value of the class trigram model is further reduced. For these reasons, it is considered that the perplexity of the class trigram model is larger than the case where the word trigram model is used.

【０１２３】以上の検討より、次のような結論が得られ
る：（１）単語トライグラムモデルの方が、クラストライグ
ラムモデルよりも、学習データに対する適合能力が高
い。従って、評価用データと学習用データの統計的性質
が似通っている場合には、高い表現能力を発揮する。From the above examination, the following conclusions can be obtained: (1) The word trigram model has a higher adaptability to learning data than the class trigram model. Therefore, when the statistical properties of the evaluation data and the learning data are similar, high expression ability is exhibited.

【０１２４】（２）クラストライグラムモデルは、クラ
スごとの統計的性質を利用するため、スムージングと同
様の効果を有する。従って、数９式のように確率値が１
になることは望めないが、学習データ中に存在しない単
語連鎖に対しても、その確率値を推定することが可能で
ある。(2) The class trigram model has the same effect as smoothing because it utilizes the statistical properties of each class. Therefore, as shown in Equation 9, the probability value is 1
However, it is possible to estimate the probability value of a word chain that does not exist in the learning data.

【０１２５】上記（２）より、クラストライグラムを単
語トライグラムのためのスムージングモデルとして利用
する方法が考えられる。そこで、単語トライグラムモデ
ルに含まれていない単語の３つ組が出現した際には、バ
ックオフスムージングを用いず、クラストライグラムモ
デルによって確率値を推定する複合モデルを構築し、こ
のモデルを用いて、評価データに対するパープレキシテ
ィを求めた。その結果、単語トライグラム＋バックオフ
平滑化の場合のパープレキシティ６２．３８に対して、
５７．９７という結果が得られ、パープレキシティをさ
らに減少させることが可能となった。From the above (2), a method is conceivable in which the class trigram is used as a smoothing model for the word trigram. Therefore, when a triple of words not included in the word trigram model appears, a composite model for estimating a probability value by a class trigram model without using back-off smoothing is constructed, and this model is used. The perplexity for the evaluation data was determined. As a result, for perplexity 62.38 in the case of word trigram + back-off smoothing,
A result of 57.97 was obtained, making it possible to further reduce perplexity.

【０１２６】[0126]

【発明の効果】以上、説明したように、請求項１、８の
発明では、シソーラスにしたがって、単語を意味的に分
類するので、品詞の分類に比べ分類の種類数が多く、分
類された単語も関連度が深くなり、以って、連鎖確率の
精度が高くなる。As described above, according to the first and eighth aspects of the present invention, words are semantically classified according to the thesaurus, so that the number of types of classification is larger than that of parts of speech, Also have a high degree of relevance, thereby increasing the accuracy of the linkage probability.

【０１２７】請求項２、９の発明では、単語間の連鎖確
率を取得する際に単語そのものを使用せず、その単語が
含まれる分類の分類コードを使用するので、たとえば、
学習した分類コードの連鎖確率を記憶しておく分類コー
ドの組み合わせ数は、単語の組み合わせに比べると大幅
に減少する。これにより、言語処理装置のメモリ容量は
従来よりも小さくすることができる。According to the second and ninth aspects of the present invention, the word itself is not used when acquiring the chain probability between words, but the classification code of the classification including the word is used.
The number of combinations of classification codes for storing the chain probabilities of the learned classification codes is significantly reduced as compared with the combinations of words. Thus, the memory capacity of the language processing device can be made smaller than before.

【０１２８】請求項３、１０の発明では、シソーラスに
記載されていない付属語については付属語そのもを分類
コードとすることで、付属語についても取り扱いが可能
となる。According to the third and tenth aspects of the present invention, by using an auxiliary word that is not described in the thesaurus as a classification code, the auxiliary word can be handled.

【０１２９】請求項４、１１の発明では、シソーラスの
分類に記載されていない自立語については品詞の分類を
行うことで、多数の単語が取り扱い可能となる。According to the fourth and eleventh aspects of the present invention, a large number of words can be handled by classifying parts of speech for independent words not described in the thesaurus classification.

【０１３０】請求項５、１２の発明では、連鎖確率が最
小となる単語列を選択することで、聴覚心理実験で使用
する無意味文を作成することができ、作成された文も恣
意的な要素を含むことがない。According to the fifth and twelfth aspects of the present invention, a meaningless sentence used in the psychoacoustic experiment can be created by selecting a word string with the smallest chain probability, and the created sentence is also arbitrary. Contains no elements.

【０１３１】請求項６、１３の発明では連鎖確率が最大
となる単語列を音声認識候補の中の最適候補とするの
で、従来の単語のみの音声認識に比べて、単語間の連鎖
が考慮されており、音声認識精度が向上する。According to the sixth and thirteenth aspects of the present invention, the word string having the largest chain probability is determined as the optimum candidate among the speech recognition candidates. And the speech recognition accuracy is improved.

【０１３２】請求項７、１４の発明では、学習等により
単語間の連鎖確率が与えられていない場合でも、単語間
の連鎖確率を推定できるので、取り扱いの対象となる単
語を多数とすることができる。According to the seventh and fourteenth aspects of the present invention, even when the chain probability between words is not given by learning or the like, the chain probability between words can be estimated, so that the number of words to be handled can be increased. it can.

[Brief description of the drawings]

【図１】本発明第１実施形態の言語処理装置の機能構成
を示すブロック図である。FIG. 1 is a block diagram illustrating a functional configuration of a language processing apparatus according to a first embodiment of the present invention.

【図２】本発明第１実施形態の言語処理装置の処理手順
を示すフローチャートである。FIG. 2 is a flowchart showing a processing procedure of the language processing device according to the first embodiment of the present invention.

【図３】本発明第１実施形態の言語処理装置の処理手順
を示すフローチャートである。FIG. 3 is a flowchart illustrating a processing procedure of the language processing apparatus according to the first embodiment of the present invention.

【図４】本発明第２実施形態の言語処理装置の機能構成
を示すブロック図である。FIG. 4 is a block diagram illustrating a functional configuration of a language processing device according to a second embodiment of the present invention.

【図５】本発明第２実施形態の言語処理装置の処理手順
を示すフローチャートである。FIG. 5 is a flowchart illustrating a processing procedure of a language processing device according to a second embodiment of the present invention.

【図６】本発明第２実施形態の言語処理装置の処理手順
を示すフローチャートである。FIG. 6 is a flowchart illustrating a processing procedure of a language processing apparatus according to a second embodiment of the present invention.

[Explanation of symbols]

１無意味文生成ブロック２形態素解析ブロック４シソーラス表格納ブロック６単語分類ブロック８クラス間のトライグラム確率の表を格納する処理ブ
ロック１０無意味文の組を選択する処理ブロック２２形態素解析ブロック２４単語トライグラム確率格納ブロック２６単語連鎖確率を読み出す処理ブロック２８シソーラス格納ブロック３０単語分類ブロック３２クラス間のトライグラム確率とクラス−単語確率
の表を記憶する格納ブロック３４クラストライグラムを計算する処理ブロック３６クラストライグラム確率を利用して認識候補を選
択する処理ブロックReference Signs List 1 nonsense sentence generation block 2 morphological analysis block 4 thesaurus table storage block 6 word classification block 8 processing block for storing a table of trigram probabilities between classes 10 processing block for selecting a set of nonsense sentences 22 morphological analysis block 24 words Trigram probability storage block 26 Processing block for reading word chain probabilities 28 Thesaurus storage block 30 Word classification block 32 Storage block for storing a table of trigram probabilities and class-word probabilities between classes 34 Processing block for calculating class trigrams 36 Processing block for selecting recognition candidates using class trigram probability

Claims

[Claims]

1. Classifying a plurality of words into a plurality of classes in which one class is composed of a plurality of sets of words, acquiring a linkage probability indicating a degree of linkage between the classified classes,
In a language processing apparatus for selecting a class and / or a word string based on the acquired chain probability, a storage means for storing a thesaurus classification; and a classification means for classifying the plurality of words according to the thesaurus classification A language processing device, comprising:

2. The language processing apparatus according to claim 1, wherein the classifying unit is configured to classify the plurality of words.
A language processing apparatus for converting a word into a classification code, and calculating the word string in the form of a classification code string when calculating the linkage probability.

3. The language processing apparatus according to claim 2, wherein said classifying means uses, as a classification code, the attached word itself for an attached word attached to said word.

4. The language processing apparatus according to claim 2, wherein the classification unit performs classification for each part of speech when the word is an independent word that is not described in the thesaurus classification. Language processing device.

5. The language processing apparatus according to claim 1, further comprising a selection unit that selects a word string having the smallest chain probability as a meaningless sentence used in a hearing experiment. apparatus.

6. The language processing apparatus according to claim 1, further comprising a selection unit that selects a word string having the largest chain probability as an optimal candidate among speech recognition candidates. apparatus.

7. The language processing apparatus according to claim 1, wherein a probability of appearance of a word in a class and a probability of linkage between classes are obtained in association with calculation of a chain probability between words. If the chain probability between them cannot be obtained directly,
A language processing apparatus for estimating a chain probability between the words based on an appearance probability of the words and a chain probability between the classes.

8. Classifying a plurality of words into a plurality of classes in which one class is composed of a plurality of words, acquiring a linkage probability indicating a degree of linkage between the classified classes,
A language processing method for selecting a class and / or a word string based on the acquired chain probability, wherein a classification of a thesaurus is stored, and the plurality of words are classified according to the classification of the thesaurus. Language processing method.

9. The language processing method according to claim 8, wherein when the plurality of words are classified, the words are converted into a classification code, and when the chain probability is calculated, the word string is calculated in the form of a classification code string. A language processing method.

10. The language processing method according to claim 9, wherein, for an attached word attached to the word, the attached word itself is used as a classification code.

11. The language processing method according to claim 9, wherein, if the word is an independent word that is not described in the thesaurus classification, classification is performed for each part of speech.

12. The language processing method according to claim 8, wherein the word sequence with the smallest chain probability is selected as a meaningless sentence used in a hearing experiment.

13. The language processing method according to claim 8, wherein the word sequence having the largest chain probability is selected as an optimal candidate among the speech recognition candidates.

14. The language processing method according to claim 8, wherein the probability of appearance of a word in a class and the probability of linkage between classes are obtained in association with calculation of a chain probability between words. If the chain probability between them cannot be obtained directly,
A language processing method comprising: estimating a chain probability between the words based on an appearance probability of the word and a chain probability between the classes.