JP2000231572A

JP2000231572A - Method and device for registering unknown word with noun thesaurus and recording medium with unknown word registration program recorded therein

Info

Publication number: JP2000231572A
Application number: JP11032475A
Authority: JP
Inventors: Yasunari Maeda; 康成前田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-02-10
Filing date: 1999-02-10
Publication date: 2000-08-22

Abstract

PROBLEM TO BE SOLVED: To statistically strictly decide a node of a noun thesaurus having a multinomial distribution being close to the multinomial distribution of unknown words as an unknown work registration node by using a Bayesian estimator logically having guarantee under a limited sample instead of a cooccurrence frequency and Kullback-Leibler information quantity(KL information quantity) being an inter-distribution distance in probability distribution space instead of a cosine between vectors in vector space. SOLUTION: This device consists of a means 100 which calculates the cooccurrence frequency of an unknown word and each verb in document data in corpus 120 and the cooccurrence frequency of each node and each verb of a noun thesaurus 130 in the document data in the corpus, a means 200 which uses the cooccurrence frequency information and calculates the Bayesian estimator of a multinominal distribution in which the unknown word co-occurs with each verb and the Bayesian estimator of a multinomial distribution in which each node of the noun thesaurus co-occurs with each verb and a means 300 which uses the Bayesian estimators and outputs a node of the noun thesaurus having a multinomial distribution being the closest to the unknown word as an unknown work registration node with Kullback-Leibler information quantity as a standard.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、名詞シソ−ラスへ
の未知語登録技術に関し、詳しくは、未知語とコーパス
中の各動詞との共起及び名詞シソーラスの各ノードとコ
ーパス中の各動詞との共起に多項分布を仮定したもと
で、カルバック・ライブラー情報量を尺度に最も未知語
の多項分布に近い多項分布を有す名詞シソーラスのノー
ドを未知語登録ノードとして既存の名詞シソーラスに未
知語を登録する方法及び装置、並びにそのプログラムを
記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for registering an unknown word in a noun thesaurus, and more particularly, to co-occurrence of an unknown word with each verb in a corpus and each node of the noun thesaurus and each verb in the corpus. Based on the assumption of a polynomial distribution for co-occurrence with the existing noun thesaurus, the node of the noun thesaurus having a polynomial distribution closest to the polynomial distribution of the unknown word is used as the unknown word registration node using the Kullback-Leibler information as a scale. The present invention relates to a method and an apparatus for registering an unknown word in a computer, and a recording medium on which the program is recorded.

【０００２】[0002]

【従来の技術】自然言語処理の分野において、情報検索
や文書クラスタリングなどへの利用を目的として、名詞
を意味的に木構造に分類した多くの名詞シソーラスが構
築されている（例えば「ＮＴＴシソーラス」；池原外
「日本語語彙大系」、岩波書店、１９９７）。また、既
存の名詞シソーラスの管理方法の一部として、既存の名
詞シソーラスへの未知語登録方法が提案されている。2. Description of the Related Art In the field of natural language processing, many noun thesauruses in which nouns are semantically classified into a tree structure have been constructed for use in information retrieval and document clustering (for example, the "NTT Thesaurus"). Ikehara et al., “Japanese vocabulary system”, Iwanami Shoten, 1997). Further, as a part of the management method of the existing noun thesaurus, a method of registering an unknown word in the existing noun thesaurus has been proposed.

【０００３】従来の名詞シソ−ラスへの未知語登録方法
の多くは、「単語の意味はどのような単語と共起するか
という観点から特徴付けられる」というＨarrisの分布
仮説（Ｈarris，Ｚelig Ｓ．「Ｍathematical Ｓrtuctu
res of Ｌanguage」，ＮewＹork：Ｗiley(１９６８)）
に基づいて、どのような単語とどれくらい共起している
かを示す共起ベクトル間の余弦を尺度にして、未知語を
登録するノードを決定している。Many of the conventional methods for registering unknown words in the noun thesaurus include the Harris distribution hypothesis (Harris, Zelig S.) that "the meaning of a word is characterized from the viewpoint of co-occurring with the word.""Mathematical Srtuctu
res of Language ", New York: Willey (1968))
, A node for registering an unknown word is determined based on the cosine between co-occurrence vectors indicating what word and how much co-occurrence.

【０００４】ここで、浦本「コーパスに基づくシソーラ
ス」（情報処理学会論文誌、Ｖol.37,No.１２，ｐｐ．
２１８２−２１８９、１９９６）等、多くの従来方法に
共通する、未知語及び名詞シソーラスの各ノードの共起
ベクトル間の余弦による未知語登録方法の概要をまとめ
ると、以下の通りである。 (１) 未知語が与えられると、コーパス中の文書データ
における未知語と各動詞の共起頻度及びコーパス中の文
書データにおける名詞シソーラスの各ノードと各動詞の
共起頻度を算出する。 (２) 共起頻度で構成される共起ベクトル間の余弦を算
出し、余弦が最大となる名詞シソーラスのノードを未知
語登録ノードに決定する。Here, Uramoto "Corpus based thesaurus" (Transactions of Information Processing Society of Japan, Vol.
2182-2189, 1996) and the like, the outline of an unknown word registration method based on the cosine between co-occurrence vectors of each node of an unknown word and a noun thesaurus common to many conventional methods is as follows. (1) Given an unknown word, calculate the co-occurrence frequency of the unknown word and each verb in the document data in the corpus and the co-occurrence frequency of each node and each verb in the noun thesaurus in the document data in the corpus. (2) The cosine between co-occurrence vectors composed of co-occurrence frequencies is calculated, and the node of the noun thesaurus with the maximum cosine is determined as the unknown word registration node.

【０００５】名詞ｎ_iの共起ベクトルは次式で与えられ
る。[0005] The co-occurrence vector of noun n _i is given by the following equation.

【０００６】[0006]

【数１】 (Equation 1)

【０００７】但し、ｎ_i∈Ｎ（Ｎは名詞集合）、ｖ_j∈Ｖ
（Ｖは動詞集合）、｜・｜は集合の濃度、co（ｎ_i，
ｖ_j）は文書のデータベースであるコーパス中の同一文
において名詞ｎ_iと動詞ｖ_jが共起した総回数である共起
頻度を示す。However, n _i ∈N (N is a set of nouns), v _j ∈V
(V is a verb set), | · | is the density of the set, co (n _i ,
v _j ) indicates the co-occurrence frequency which is the total number of times a noun _ni and a verb v _j co-occur in the same sentence in a corpus which is a database of documents.

【０００８】名詞シソーラスのノードnode_iの共起ベク
トルは次式で与えられる。[0008] The co-occurrence vector of node node _i of the noun thesaurus is given by the following equation.

【０００９】[0009]

【数２】 (Equation 2)

【００１０】但し、node_i∈ＮＯＤＥ（ＮＯＤＥは名詞
シソーラスのノード集合）、co（node_i，ｖ_j）は文書の
データベースであるコーパス中の同一文において名詞シ
ソーラスのノードnode_iと動詞ｖ_jが共起した総数である
共起頻度を示し、次式で計算される。[0010] However, node _i ∈NODE (NODE is the node set of noun _{thesaurus), co (node i, v} j) node of the noun thesaurus in the same sentence in the corpus is a database of document node _i and a verb v _j is Indicates the co-occurrence frequency, which is the total number of co-occurrences, and is calculated by the following equation.

【００１１】[0011]

【数３】 (Equation 3)

【００１２】なお、シソーラスには、その木構造の葉の
みに単語が登録されている分類シソーラスと、葉及び中
間ノードにも単語が登録されている上位下位シソーラス
があるが、ここでは特にその区別は行わない。The thesaurus includes a classification thesaurus in which words are registered only in the leaves of the tree structure, and an upper / lower thesaurus in which words are also registered in leaves and intermediate nodes. Is not performed.

【００１３】未知語unknownの共起ベクトルは次式で与
えられる。The co-occurrence vector of the unknown word unknown is given by the following equation.

【００１４】[0014]

【数４】 (Equation 4)

【００１５】但し、co（unknown，ｖ_j）は文書のデータ
ベースであるコーパス中の同一文において未知語unknow
nと動詞ｖ_jが共起した総回数である共起頻度を示す。Here, co (unknown, v _j ) is an unknown word unknow in the same sentence in a corpus which is a document database.
Indicates the co-occurrence frequency, which is the total number of times n and the verb v _j co-occur.

【００１６】未知語unknownを登録する未知語登録ノー
ドdnode_vec（unknown）は次式で決定される。An unknown word registration node dnode_vec (unknown) for registering an unknown word unknown is determined by the following equation.

【００１７】[0017]

【数５】 (Equation 5)

【００１８】但し、cosはベクトル間の余弦、「・」は
ベクトルの内積、‖vec‖はベクトルvecのノレムを示
す。Here, cos is the cosine between the vectors, “•” is the inner product of the vectors, and {vec} is the norem of the vector vec.

【００１９】[0019]

【発明が解決しようとする課題】上記従来方法にはいく
つかの問題点が挙げられる。第１に、分布仮説に基づい
ているので、共起頻度には何らかの確率分布を仮定して
いるはずであるが、実際には共起頻度の値をそのまま用
いているだけで、真の分布を推定しようとはしていな
い。これでは、どのような尺度で未知語と名詞シソーラ
スの各ノードとの類似度あるいは距離を測定しても、厳
密な大小判定は出来ない。The above-mentioned conventional method has several problems. First, since it is based on the distribution hypothesis, some probability distribution should be assumed for the co-occurrence frequency, but in practice, the true distribution is calculated only by using the value of the co-occurrence frequency as it is. I do not try to estimate. In this case, even if the similarity or the distance between the unknown word and each node of the noun thesaurus is measured by any scale, it is not possible to make a strict magnitude judgment.

【００２０】第２に、分布仮説に基づいているので、共
起頻度には何らかの確率分布を仮定しているはずである
が、実際にはベクトル空間における共起ベクトル間の余
弦を尺度として用いている。ベクトル空間における共起
ベクトル間の余弦は、必ずしも確率分布空間における尺
度にはなり得ない。確率分布を仮定しているからには、
確率分布空間における何らかの尺度を用いない限り、厳
密な類似度あるいは距離は計算できない。Second, since it is based on the distribution hypothesis, some probability distribution should be assumed for the co-occurrence frequency, but in practice, the cosine between co-occurrence vectors in the vector space is used as a measure. I have. The cosine between co-occurrence vectors in vector space cannot always be a measure in probability distribution space. Assuming a probability distribution,
Exact similarity or distance cannot be calculated unless some measure in the probability distribution space is used.

【００２１】本発明の目的は、上記従来の問題点を踏ま
えた上で、共起頻度の代わりに有限のサンプルのもとで
理論的に保証のあるベイズ推定量と、ベクトル空間にお
けるベクトル間の余弦の代わりに確率分布空間における
分布間の距離であるカルバック・ライブラー情報量（Ｋ
Ｌ情報量）を用いることによって、従来方法よりも統計
的に厳密に、未知語の多項分布に近い多項分布を有す名
詞シソーラスのノードを未知語登録ノードとして決定す
ることにある。An object of the present invention is to provide a Bayesian estimator theoretically guaranteed under a finite sample instead of a co-occurrence frequency, and Kullback-Leibler information (K), which is the distance between distributions in the probability distribution space instead of the cosine
L information), the node of the noun thesaurus having a polynomial distribution close to the polynomial distribution of the unknown word is determined more strictly statistically than the conventional method as an unknown word registration node.

【００２２】[0022]

【課題を解決するための手段】本発明の名詞シソーラス
への未知語登録手法は、未知語を入力データとして与え
られると、未知語と各動詞との多項分布と名詞シソーラ
スの各ノードと各動詞との多項分布間のベイズ推定量と
カルバック・ライブラー情報量（ＫＬ情報量）を用いて
算出した距離が最小の名詞シソーラスのノード、すなわ
ち、統計的に未知語と最も似た共起の仕方をする名詞シ
ソーラスのノードを未知語登録ノードとして出力するも
のである。According to the method of registering unknown words in a noun thesaurus according to the present invention, when an unknown word is given as input data, a polynomial distribution of the unknown word and each verb, each node of the noun thesaurus, and each verb A node of the noun thesaurus whose distance calculated using the Bayesian estimator between the polynomial distributions and the Kullback-Leibler information (KL information), that is, the co-occurrence method that is statistically most similar to the unknown word Is output as an unknown word registration node.

【００２３】なお、確率分布ｐ(ｘ)，ｑ(ｘ)間のＫＬ情
報量は、次式で示される（例えば、平澤「情報理論」
（情報数理シリーズＢ−１）２２〜２３頁，培風館，１
９９６）。The KL information amount between the probability distributions p (x) and q (x) is expressed by the following equation (for example, Hirasawa "Information Theory")
(Information Mathematics Series B-1) pages 22-23, Baifukan, 1
996).

【００２４】[0024]

【数６】 (Equation 6)

【００２５】また、カルバック・ライブラー情報量に関
するベイズ推定量は次式で示される（例えば「IEEE TRA
NSACTIONS ON INFORMATION THEORY」VOL.37，NO.5，PP.
1288〜1291，SEPTEMBER 1991参照）。The Bayesian estimator for the Kullback-Leibler information is expressed by the following equation (for example, “IEEE TRA”).
NSACTIONS ON INFORMATION THEORY ”VOL.37, NO.5, PP.
1288-1291, see SEPTEMBER 1991).

【００２６】[0026]

【数７】 (Equation 7)

【００２７】但し、θは分布を支配する連続パラメー
タ、ｐ(θ)はパラメータに対する事前分布を示す。Here, θ is a continuous parameter that controls the distribution, and p (θ) is a prior distribution for the parameter.

【００２８】図１は、本発明の原理構成図である。即
ち、本発明の名詞シソーラスへの未知語登録装置は、未
知語が与えられると、コーパス中の文書データにおける
未知語と各動詞の共起頻度及びコーパス中の文書データ
における名詞シソーラスの各ノードと各動詞の共起頻度
を算出し、未知語と共起頻度情報を出力する共起頻度算
出部１００と、未知語と共起頻度情報が与えられると、
未知語が各動詞と共起する多項分布のベイズ推定量及び
名詞シソーラスの各ノードと各動詞が共起する多項分布
のベイズ推定量を算出し、未知語とベイズ推定量を出力
するベイズ推定量算出部２００と、未知語とベイズ推定
量が与えられると、カルバック・ライブラー情報量を尺
度に未知語と最も近い多項分布を有す名詞シソーラスの
ノードを未知語登録ノードとして出力する未知語登録ノ
ード決定部３００とにより構成される。FIG. 1 is a diagram showing the principle of the present invention. That is, the unknown word registration device for the noun thesaurus of the present invention, when given an unknown word, the co-occurrence frequency of the unknown word and each verb in the document data in the corpus and each node of the noun thesaurus in the document data in the corpus. When the co-occurrence frequency of each verb is calculated and the unknown word and the co-occurrence frequency information are given,
A Bayesian estimator that calculates the Bayesian estimator of a polynomial distribution in which an unknown word co-occurs with each verb and a Bayesian estimator of a polynomial distribution in which each verb co-occurs with each node in the noun thesaurus, and outputs the unknown word and the Bayesian estimator Given the calculation unit 200 and the unknown word and the Bayesian estimator, unknown word registration that outputs a node of a noun thesaurus having a polynomial distribution closest to the unknown word on the basis of the Kullback-Leibler information amount as an unknown word registration node And a node determining unit 300.

【００２９】図２は、図１の本発明の原理構成を説明す
るためのフローチャートである。即ち、本発明の名詞シ
ソーラスへの未知語登録方法は、未知語を入力する段階
Ｓ１０と、コーパス中の文書データにおける未知語と各
動詞の共起頻度及びコーパス中の文書データにおける名
詞シソーラスの各ノードと各動詞の共起頻度を算出して
共起頻度情報を出力する段階Ｓ２０と、共起頻度情報を
用いて多項分布の場合には有限のサンプルに対して真の
分布とのカルバック・ライブラー情報量がベイズ基準の
もとで最小になることが保証された未知語と名詞シソー
ラスの各ノードのベイズ推定量を算出して出力する段階
Ｓ３０と、ベイズ推定量を用いて未知語の多項分布と名
詞シソーラスの各ノードの多項分布とのカルバック・ラ
イブラー情報量を算出して出力する段階Ｓ４０と、カル
バック・ライブラー情報量を用いて未知語登録ノードを
決定する段階Ｓ５０と、該段階Ｓ５０において求められ
た、統計的に厳密に未知語の多項分布に最も近い多項分
布を有す名詞シソーラスのノードを未知語登録ノードと
して出力する段階Ｓ６０とからなる。FIG. 2 is a flowchart for explaining the principle configuration of the present invention shown in FIG. That is, in the method of registering an unknown word in the noun thesaurus of the present invention, the unknown word is input in step S10, the co-occurrence frequency of the unknown word and each verb in the document data in the corpus, and each of the noun thesaurus in the document data in the corpus. Calculating the co-occurrence frequency of the node and each verb and outputting the co-occurrence frequency information; and, in the case of a polynomial distribution using the co-occurrence frequency information, a Kullback live between the finite sample and the true distribution. Calculating and outputting a Bayesian estimator of each node of the unknown word and the noun thesaurus whose error information amount is guaranteed to be minimum under the Bayesian criterion, and a polynomial of the unknown word using the Bayesian estimator Calculating and outputting the Kullback-Leibler information amount between the distribution and the multinomial distribution of each node of the noun thesaurus; And a step S60 of outputting, as an unknown word registration node, a node of a noun thesaurus having a polynomial distribution that is statistically closest to the polynomial distribution of an unknown word, which is obtained in the step S50. .

【００３０】本発明の名詞シソーラスへの未知語登録手
法は、第１に、有限のサンプルに対して真の分布とのカ
ルバック・ライブラー情報量がベイズ基準のもとで最小
になることが保証されたベイズ推定量を用いているの
で、従来の共起頻度をそのまま用いるのとは違い、共起
頻度の持つ統計情報を完全に利用することが出来る。ま
た、第２には、確率分布空間における距離であるカルバ
ック・ライブラー情報量を尺度として用いているので、
統計的に厳密に未知語の多項分布に最も近い多項分布を
有す名詞シソーラスのノードを未知語登録ノードとして
出力することが出来る。The method of registering unknown words in a noun thesaurus according to the present invention firstly guarantees that the amount of Kullback-Leibler information with a true distribution for a finite sample is minimized under the Bayes criterion. Since the obtained Bayesian estimator is used, the statistical information of the co-occurrence frequency can be completely used, unlike the conventional method of using the co-occurrence frequency as it is. Secondly, since the Kullback-Leibler information amount, which is the distance in the probability distribution space, is used as a measure,
A node of a noun thesaurus having a polynomial distribution that is statistically strictly closest to the polynomial distribution of an unknown word can be output as an unknown word registration node.

【００３１】[0031]

【発明の実施の形態】図３は、本発明による一実施の形
態の構成図である。図３において、１００は共起頻度算
出部、２００はベイズ推定量算出部、３００は未知語登
録ノード決定部である。共起頻度算出部１００は、共起
頻度算出器１１０と文書データベースのコーパス１２０
と既存の名詞シソーラス１３０とからなる。シソーラス
には、その木構造の葉のみに単語が登録されている分類
シソーラスと、葉及び中間ノードにも単語が登録されて
いる上位下位シソーラスがあるが、ここでは、その区別
は行わない。ベイズ推定量算出部２００は、ベイズ推定
量算出器２１０とベータ分布パラメータテーブル２２０
とからなる。未知語登録ノード決定部３００は、ＫＬ情
報量算出器３１０と未知語登録ノード決定器３２０とか
らなる。以下、共起頻度算出部１００、ベイズ推定量算
出部２００、未知語登録ノード決定部３００の動作につ
いて説明する。FIG. 3 is a block diagram of an embodiment according to the present invention. In FIG. 3, 100 is a co-occurrence frequency calculation unit, 200 is a Bayesian estimation amount calculation unit, and 300 is an unknown word registration node determination unit. The co-occurrence frequency calculator 100 includes a co-occurrence frequency calculator 110 and a corpus 120 of a document database.
And an existing noun thesaurus 130. The thesaurus includes a classified thesaurus in which words are registered only in the leaves of the tree structure, and an upper / lower thesaurus in which words are also registered in leaves and intermediate nodes. However, no distinction is made here. The Bayesian estimator calculator 200 includes a Bayesian estimator calculator 210 and a beta distribution parameter table 220.
Consists of The unknown word registration node determining unit 300 includes a KL information amount calculator 310 and an unknown word registration node determining unit 320. Hereinafter, the operations of the co-occurrence frequency calculation unit 100, the Bayesian estimation amount calculation unit 200, and the unknown word registration node determination unit 300 will be described.

【００３２】図４は共起頻度算出部１００の動作フロー
チャートである。まず、共起頻度算出器１１０に未知語
が入力される（ステップ７０）。未知語が入力される
と、共起頻度算出器１１０は、コーパス１２０中の文書
データにおける未知語unknownと動詞ｖ_jの共起頻度co(u
nknown，ｖ_j）を算出する（ステップ７２）。次に、共
起頻度算出器１１０は、コーパス１２０中の文書データ
における名詞シソーラス１３０のノードnode_iと動詞ｖ_j
の共起頻度co（node_i，ｖ_j）を算出する（ステップ７
４）。ここで、FIG. 4 is an operation flowchart of the co-occurrence frequency calculating section 100. First, an unknown word is input to the co-occurrence frequency calculator 110 (Step 70). When an unknown word is input, the co-occurrence frequency calculator 110 calculates the co-occurrence frequency co (u) of the unknown word unknown and the verb v _j in the document data in the corpus 120.
nknown, v _j ) is calculated (step 72). Next, the co-occurrence frequency calculator 110 calculates the node node _i and the verb v _j of the noun thesaurus 130 in the document data in the corpus 120.
Of the co-occurrence frequency co (node _i , v _j ) (Step 7)
4). here,

【００３３】[0033]

【数８】 (Equation 8)

【００３４】であり、名詞ｎ_kと動詞ｖ_jのコーパス１２
０における共起頻度を示す。The corpus 12 of the noun _nk and the verb v _j
The co-occurrence frequency at 0 is shown.

【００３５】共起頻度の算出後、共起頻度算出器１１０
は、未知語とそれぞれの共起頻度情報を出力する（ステ
ップ７６）。After calculating the co-occurrence frequency, the co-occurrence frequency calculator 110
Outputs unknown word and respective co-occurrence frequency information (step 76).

【００３６】図５はベイズ推定量算出部２００の動作フ
ローチャートである。まず、ベイズ推定量算出器２１０
に、共起頻度算出器１１０から未知語と共起頻度情報が
入力される（ステップ８０）。未知語と共起頻度情報が
入力されると、ベイズ推定量算出器２１０は、共起頻度
情報とべ−タ分布パラメータテーブル２２０のベータ分
布のパラメータβ（ｖ_j｜node_i）とβ（ｖ_j｜unknown）
を用いて、名詞シソーラス１３０の各ノード対応のベイ
ス推定量と未知語に対応するベイズ推定量を算出する
（ステップ８２、８４）。ここで、β（ｖ_j｜node_i）や
β（ｖ_j｜unknown）は、名詞シソ−ラス１３０のノード
node_iと動詞ｖ_j、または、未知語unknownと動詞ｖ_jの共
起の仕方を表わす多項分布を支配するパラメータθの事
前分布ｐ(θ)を意味する。ベータ分布パラメータテーブ
ル２２０の概念図を図６に示す。各ベイズ推定量は次の
ようにして計算される。FIG. 5 is a flowchart of the operation of the Bayesian estimation amount calculating section 200. First, the Bayesian estimator calculator 210
The unknown word and co-occurrence frequency information are input from the co-occurrence frequency calculator 110 (step 80). When the unknown word and the co-occurrence frequency information are input, the Bayesian estimator calculator 210 calculates the co-occurrence frequency information and the beta distribution parameters β (v _j | node _i ) and β (v _j ) of the beta distribution parameter table 220. ｜ unknown)
Is used to calculate a Bayesian estimator corresponding to each node of the noun thesaurus 130 and a Bayesian estimator corresponding to the unknown word (steps 82 and 84). Here, β (v _j | node _i ) and β (v _j | unknown) are the nodes of the noun thesaurus 130
A node _i and a verb v _j , or a prior distribution p (θ) of a parameter θ governing a polynomial distribution representing a co-occurrence of an unknown word unknown and a verb v _j . FIG. 6 is a conceptual diagram of the beta distribution parameter table 220. Each Bayesian estimator is calculated as follows.

【００３７】コーパス１２０の中で名詞シソーラス１３
０のノードnode_iに含まれる任意の名詞ｎ_k（ｎ_k∈nod
e_i）が存在する条件の下で、動詞ｖ_jが存在する条件付
き確率分布のカルバック・ライブラー情報量に関するベ
イズ推定量ｐ′(ｖ_j｜node_i)はNoun Thesaurus 13 in Corpus 120
Any noun _nk ( _{nk k} nod included in node _i of node 0
Under the condition that e _i ) exists, the Bayesian estimator p ′ (v _j | node _i ) for the Kullback-Leibler information of the conditional probability distribution in which the verb v _j exists is

【００３８】[0038]

【数９】 (Equation 9)

【００３９】で計算される（ステップ８２）。但し、β
（ｖ_j｜node_i）は、図６に示すようにベータ分布のパラ
メータであり、真の多項分布ｐ（ｖ_j｜node_i，θ′node
_i)）を支配するパラメータθの事前分布ｐ(θ)を意味す
る。Is calculated (step 82). Where β
(V _j | node _i ) is a parameter of the beta distribution as shown in FIG. 6, and is a true multinomial distribution p (v _j | node _i , θ′node).
_i ) means the prior distribution p (θ) of the parameter θ that governs it.

【００４０】なお、θ′(node_i）はこの多項分布ｐ（ｖ
_j｜node_i，θ′(node_i)）を支配する真のパラメータ
θ′(node_i）∈Θ、また、（７）式による推定量は、有
限のサンプルに対して真の分布とカルバック・ライブラ
ー情報量がベイズ基準のもとで最小になることが保証さ
れた推定量であり、次式が成立している。Note that θ ′ (node _i ) is the polynomial distribution p (v
_j | node _i , θ ′ (node _i )), the true parameter θ ′ (node _i ) ∈Θ, and the estimator according to equation (7) are the true distribution and the culvert This is an estimated amount that is guaranteed that the amount of the information on the liver is minimized under the Bayes criterion, and the following equation holds.

【００４１】[0041]

【数１０】 (Equation 10)

【００４２】また、コーパス１２０の中で未知語unknow
nが存在する条件の下で、動詞ｖ_jが存在する条件付き確
率分布のカルバック・ライブラー情報量に関するベイズ
推定量ｐ′（ｖ_j｜unknown）はThe unknown word unknow in the corpus 120
Under the condition that n exists, the Bayesian estimator p ′ (v _j | unknown) regarding the Kullback-Leibler information of the conditional probability distribution in which the verb v _j exists is

【００４３】[0043]

【数１１】 [Equation 11]

【００４４】で計算される（ステップ８４）。この推定
量は、未知語unknownと各動詞との真の多項分布ｐ（ｖ_j
｜unknown，θ′(unknown)）のベイズ推定量を示す。Is calculated (step 84). This estimator calculates the true polynomial distribution p (v _j ) between the unknown word unknown and each verb.
| Unknown, θ ′ (unknown)).

【００４５】ベイズ推定量の算出後、ベイズ推定量算出
器２１０は、未知語とベイズ推定量を出力する（ステッ
プ８６）。After calculating the Bayesian estimator, the Bayesian estimator calculator 210 outputs the unknown word and the Bayesian estimator (step 86).

【００４６】図７は未知語登録ノード決定部３００の動
作フローチャートである。まず、ＫＬ情報量算出器３１
０に、ベイズ推定量算出器２１０から未知語とベイズ推
定量が入力される（ステップ９０）。未知語とベイズ推
定量が入力されると、ＫＬ情報量算出器３１０は、未知
語unknownのベイズ推定量ｐ′(ｖ_j｜unknown）と名詞シ
ソーラスの各ノードnode_iのベイズ推定量ｐ′(ｖ_j｜nod
e_i）との間のカルバック・ライブラー情報量を次の（１
２）式によって算出する（ステップ９２）。FIG. 7 is an operation flowchart of the unknown word registration node determining section 300. First, the KL information amount calculator 31
At 0, the unknown word and the Bayesian estimator are input from the Bayesian estimator calculator 210 (step 90). When an unknown word and Bayesian estimator is input, KL information amount calculator 310, Bayesian estimator of the unknown word _{unknown p '(v j | unknown} ) and Bayesian estimator of the nodes node _i noun thesaurus p' ( v _j ｜ nod
e _i ) and the amount of Kullback-Leibler information between
It is calculated by the expression 2) (step 92).

【００４７】[0047]

【数１２】 (Equation 12)

【００４８】次に、未知語登録ノード決定器３２０は、
ＫＬ情報量算出器３１０によって算出されたカルバック
・ライブラ−情報量Ｄ(ｐ′(・｜unknown)；ｐ′(・｜n
ode_i)）の各値を比較して、該カルバック・ライブラー
情報量の値が最小となる名詞シソーラスのノードnode_i
を次の（１３）式によって決定し、未知語unknownに対
する未知語登録ノードdnode_prob（unknown）とする
（ステップ９４）。Next, the unknown word registration node determiner 320
The Kullback library information amount D (p '(• unknown); p' (• | n calculated by the KL information amount calculator 310
ode _i )) are compared, and the node node _{i of the} noun thesaurus that minimizes the value of the Kullback-Leibler information amount
Is determined by the following equation (13), and is set as an unknown word registration node dnode_prob (unknown) for the unknown word unknown (step 94).

【００４９】[0049]

【数１３】 (Equation 13)

【００５０】最後に、未知語登録ノード決定器３２０
は、未知語登録ノードdnode_prob（unknown）を出力す
る（ステップ９６）。Finally, unknown word registration node determiner 320
Outputs the unknown word registration node dnode_prob (unknown) (step 96).

【００５１】図８は、本発明による名詞シソーラスへの
未知語登録のシミュレーション結果を説明する図であ
る。実際の既存の名詞シソーラスに既に登録されている
名詞約１０００語を抜き取り、その１０００語を未知語
と仮定して登録実験を行い、従来の名詞シソーラスへの
未知語登録手法と本発明による名詞シソーラスへの未知
語登録手法の比較を行った。なお、名詞シソーラスには
ＮＴＴシソーラス（池原外，「日本語語彙大系」、岩波
書店、１９９７）を用い、文章のデータベースであるコ
ーパスにはＥＤＲコーパス（日本電子化辞書研究所，
「ＥＤＲ電子化辞書利用マニュアル第２．１版」、１９
９４）を用いて、ＥＤＲコーパス中の頻出動詞上位５０
０語との共起頻度を用いた。FIG. 8 is a view for explaining a simulation result of registering an unknown word in a noun thesaurus according to the present invention. Approximately 1000 nouns already registered in the actual existing noun thesaurus are extracted, and a registration experiment is performed by assuming the 1000 words as unknown words, and a conventional noun thesaurus registration method to the noun thesaurus and a noun thesaurus according to the present invention are used. We compared the method of registering unknown words to Wikipedia. The noun thesaurus uses the NTT thesaurus (Nagai Ikehara, “Japanese vocabulary system”, Iwanami Shoten, 1997), and the corpus which is a database of sentences is the EDR corpus (Japan Electronic Dictionary Research Institute,
"EDR Electronic Dictionary User Manual Version 2.1", 19
94), the top 50 most frequent verbs in the EDR corpus
The co-occurrence frequency with 0 words was used.

【００５２】図８において、横軸はカルバック・ライブ
ラー情報量が最小のノード１つのみではなく、候補順位
の数だけ考慮していることを示す。縦軸は、最小のノー
ドからその候補順位のノードまで見て、その中に元のＮ
ＴＴシソーラスと同じノードがあれば正解とし、パーセ
ンテージで累積の正解率を示している。従来と記したの
が従来の共起ベクトル間の余弦による名詞シソーラスへ
の未知語登録結果、本発明としたのが本発明による名詞
シソーラスへの未知語登録結果を示す。図８が示すよう
に、本発明による名詞シソーラスへの未知語登録手法の
正解率は、従来の名詞シソーラスへの未知語登録手法よ
りも常に２０％以上高い正解率を達成していることが分
かる。In FIG. 8, the horizontal axis indicates that not only one node having the minimum Kullback-Leibler information amount but also the number of candidate ranks is considered. The vertical axis shows the original N in the view from the smallest node to the node of the candidate rank.
If there is a node that is the same as the TT thesaurus, the correct answer is determined, and the cumulative correct answer rate is shown as a percentage. The word "conventional" indicates the result of registration of an unknown word in a noun thesaurus using a cosine between conventional co-occurrence vectors, and the present invention indicates the result of registering an unknown word in a noun thesaurus according to the present invention. As shown in FIG. 8, the correct answer rate of the method for registering an unknown word in a noun thesaurus according to the present invention always achieves a correct rate of 20% or more higher than that of the conventional method of registering an unknown word in a noun thesaurus. .

【００５３】以上、本発明の一実施の形態について説明
したが、図３の構成などは、実際には所謂コンピュータ
上で構築されるものである。また、図３の各部の処理手
順やアルゴリズムは、コンピュータで実行可能な形式に
まとめて記述し、コンピュータが読み取り可能な記録媒
体、例えばフロッピーディスクやコンパクトディスク
（ＣＤ−ＲＯＭ）等に記録して提供することが可能であ
る。Although the embodiment of the present invention has been described above, the configuration shown in FIG. 3 is actually constructed on a so-called computer. The processing procedure and algorithm of each unit in FIG. 3 are collectively described in a computer-executable format, and are provided by being recorded on a computer-readable recording medium such as a floppy disk or a compact disk (CD-ROM). It is possible to

【００５４】[0054]

【発明の効果】上述のように、本発明によれば、有限の
サンプルに対して真の分布とのカルバックー・ライブラ
ー情報量がベイズ基準のもとで最小になることが保証さ
れたベイズ推定量を用いて、確率分布空間における距離
であるカルバック・ライブラー情報量を尺度として未知
語登録ノードを決定しているので、統計的に厳密に未知
語の多項分布に最も近い多項分布を有する名詞シソーラ
スのノードを未知語登録ノードとして出力することが可
能になる。As described above, according to the present invention, a Bayesian estimation that guarantees that the amount of Kullback-Leibler information with a true distribution for a finite sample is minimized under the Bayesian criterion Since the unknown word registration node is determined using the quantity and the Kullback-Leibler information amount, which is the distance in the probability distribution space, the noun having a polynomial distribution that is statistically strictly closest to the polynomial distribution of the unknown word. It becomes possible to output the thesaurus nodes as unknown word registration nodes.

[Brief description of the drawings]

【図１】本発明の原理構成図である。FIG. 1 is a principle configuration diagram of the present invention.

【図２】本発明の原理構成を説明するフローチャートで
ある。FIG. 2 is a flowchart illustrating the principle configuration of the present invention.

【図３】本発明による一実施の形態の構成図である。FIG. 3 is a configuration diagram of an embodiment according to the present invention.

【図４】図３の共起頻度算出部の動作フローチャートで
ある。FIG. 4 is an operation flowchart of a co-occurrence frequency calculation unit in FIG. 3;

【図５】図３のベイズ推定量算出部の動作フローチャー
トである。FIG. 5 is an operation flowchart of a Bayesian estimation amount calculation unit in FIG. 3;

【図６】図３のベータ分布パラメータテーブルの概念図
である。FIG. 6 is a conceptual diagram of a beta distribution parameter table of FIG. 3;

【図７】図３の未知語登録ノード決定部の動作フローチ
ャートである。FIG. 7 is an operation flowchart of an unknown word registration node determining unit of FIG. 3;

【図８】本発明による名詞シソーラスへの未知語登録の
シミュレーション結果の説明図である。FIG. 8 is an explanatory diagram of a simulation result of registering an unknown word in a noun thesaurus according to the present invention.

[Explanation of symbols]

１００共起頻度算出部１２０コーパス１３０名詞シソーラス２００ベイズ推定量算出部２２０ベータ分布パラメータテーブル３００未語登録ノード決定部 Reference Signs List 100 Co-occurrence frequency calculating unit 120 Corpus 130 Noun thesaurus 200 Bayesian estimator calculating unit 220 Beta distribution parameter table 300 Non-word registered node determining unit

Claims

[Claims]

In the method of registering an unknown word in an existing noun thesaurus, when an unknown word is given, the co-occurrence frequency of the unknown word and each verb in the document data in the corpus and each of the noun thesaurus in the document data in the corpus are provided. Calculating a co-occurrence frequency of a node and each verb; a Bayesian estimator of a polynomial distribution in which an unknown word co-occurs with each verb using the co-occurrence frequency; and a polynomial in which each node and each verb of the noun thesaurus co-occur. Calculating a Bayesian estimator of the distribution, and using the Bayesian estimator to determine, as an unknown word registration node, a node of a noun thesaurus having a polynomial distribution closest to the unknown word on the basis of the amount of Kullback-Leibler information And a method for registering unknown words in a noun thesaurus.

2. An apparatus for registering an unknown word in an existing noun thesaurus, wherein when an unknown word is given, the co-occurrence frequency of the unknown word and each verb in the document data in the corpus and the noun in the document data in the corpus. A Bayesian estimator and a noun for a polynomial distribution in which an unknown word co-occurs with each verb, given a co-occurrence frequency calculator that calculates the co-occurrence frequency of each node and each verb in the thesaurus, and given the unknown word and co-occurrence frequency information A Bayesian estimator calculator that calculates a Bayesian estimator of a polynomial distribution in which each node of the thesaurus and each verb co-occurs, and given an unknown word and a Bayesian estimator, the unknown word and the Kullback-Leibler information amount are used as scales. An unknown word registration node determining unit that outputs a node of the noun thesaurus having the closest polynomial distribution as an unknown word registration node; apparatus.

3. A computer-readable recording medium storing a program for registering an unknown word in an existing noun thesaurus, wherein when the unknown word is given, the unknown word and each verb in the document data in the corpus are provided. Processing for calculating the co-occurrence frequency of each node and the co-occurrence frequency of each node and each verb of the noun thesaurus in the document data in the corpus,
A process of calculating a Bayesian estimator of a polynomial distribution in which an unknown word co-occurs with each verb and a Bayesian estimator of a polynomial distribution in which each verb co-occurs with each node of the noun thesaurus using the co-occurrence frequency, A recording medium characterized by recording a process of determining a node of a noun thesaurus having a polynomial distribution closest to an unknown word as an unknown word registration node using the amount as a measure of the amount of Kullback-Leibler information.