JP3379608B2

JP3379608B2 - Method of determining meaning similarity between words

Info

Publication number: JP3379608B2
Application number: JP29000494A
Authority: JP
Inventors: 勉石川; 和光松沢; 要笠原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1994-11-24
Filing date: 1994-11-24
Publication date: 2003-02-24
Anticipated expiration: 2018-02-24
Also published as: JPH08147324A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、単語間意味類似性判
別方法に関し、特に、データベースの曖昧検索処理、機
械翻訳の様な各種の自然言語処理において必要とされる
単語間の意味の類似性を判別する単語間意味類似性判別
方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for determining semantic similarity between words, and more particularly to the similarity of meaning between words required in various natural language processing such as fuzzy search processing of database and machine translation. The present invention relates to a method for determining the meaning similarity between words for determining.

【０００２】[0002]

【従来の技術】ワークステーションＷＳその他のデータ
処理装置が発達したことにより、文献データベースの様
にその記憶内容が数値ではなくして自然言語により表現
されている大容量データベースが広く利用されつつあ
る。この種の大容量データベースについて、検索したい
文献の内容を表す単語の集合をキーワードとして入力
し、その文献自体或はそれに関連する文献を検索するこ
とが行われている。この様なデータベースの検索処理に
おいては、その処理の中核をなす技術は比較される単語
間の意味の類似性を判別する技術であり、この類似性判
別技術の良否により検索処理の性能が決定されることに
なる。このことはデータベースの検索処理のみに限ら
ず、機械翻訳を始めとする各種の自然言語処理について
も当てはまることである。2. Description of the Related Art With the development of workstations WS and other data processing devices, large-capacity databases whose stored contents are expressed in natural language rather than numerical values, such as document databases, are being widely used. With respect to this kind of large-capacity database, a set of words representing the contents of a document to be searched is input as a keyword, and the document itself or a document related thereto is searched. In such a database search process, the technology that is the core of the process is a technique that determines the similarity in meaning between compared words, and the performance of the search process is determined by the quality of this similarity determination technique. Will be. This applies not only to database search processing, but also to various natural language processing such as machine translation.

【０００３】この様な単語間の意味の類似性判別は、基
本的には、多数の単語の意味を記憶した概念ベースと呼
ばれる単語意味データベースを予め準備しておき、これ
に基づいて行われている。図１を参照するに、この概念
ベースにおいては、単語ｗ_iそれぞれについて、単語ｗ
_iの特徴を表す属性ｖ_jと、その単語ｗ_iとその属性ｖ
_jとの間の関連の深さを示す重要度ａ_ijの対を複数対準
備することが一般的に行われている。この概念ベースを
表形式に表現すると図１の様になる。図１において、ｗ
_l〜ｗ_nは各単語を示し、ｖ_l〜ｖ_mは各属性を示す。
属性ｖ₁〜ｖ_m全体を属性集合と称する。重要度ａ_ijは、
その単語ｗ_iがその属性ｖ_jによって特徴付けられてい
ない場合、即ち、その単語ｗ_iがその属性ｖ_jに関連し
ない場合、ａ_ij＝０である。なお、重要度ａ_ijは何らか
の方法により正規化されている。例えば、重要度の平方
の総和の平方根＝１という様に正規化される。Basically, such a similarity determination of meanings between words is performed based on a word meaning database called a concept base in which meanings of a large number of words are stored in advance. There is. Referring to FIG. 1, in this concept base, for each word w _i , the word w
_An attribute v _j representing the feature of _i , its word w _i and its attribute v
It is common practice to prepare a plurality of pairs of importance a _ij indicating the depth of the relationship with _j . If this concept base is expressed in a table format, it becomes as shown in FIG. In FIG. 1, w
_{l to} w _n indicate each word, and v _{l to} v _m indicate each attribute.
The entire attributes v _{1 to} v _m are called an attribute set. The importance a _ij is
If the word w _i is not characterized by the attribute v _j , ie, the word w _i is not associated with the attribute v _j , then a _ij = 0. The importance a _ij is normalized by some method. For example, the square root of the sum of squares of importance = 1 is normalized.

【０００４】この様な概念ベースを使用して単語間の意
味の類似性を判別する場合、基本的には、比較されるべ
き２個の単語間の何らかの意味の距離計算をする仕方が
提案されている。例えば、２個の単語ｗ_iおよび単語ｗ
_jが属性ｖ_hを要素とするベクトルにより表現されてい
るものとして取り扱い、次の式の様にその内積により類
似性の尺度である類似度を算出している（詳細は、当該
特許出願人の出願に係わる特願平４−３１５２３３号明
細書参照）。In the case of determining the similarity of meaning between words using such a concept base, basically, a method of calculating a distance of some meaning between two words to be compared has been proposed. ing. For example, two words w _i and word w
_It is assumed that _j is represented by a vector having the attribute v _h as an element, and the similarity, which is a measure of similarity, is calculated by the inner product as in the following formula (for details, see the patent applicant). See the specification of Japanese Patent Application No. 4-315233 related to the application).

【０００５】以上を、単語ｗ_l＝“馬”、単語ｗ₂ ＝“豚”を例に取
って説明する。“馬”について、ｖ₂ ＝“たてがみ
（０．３）”、ｖ₃ ＝“動物（０．８）”、ｖ₈ ＝“尾
（０．７）”をその属性とすることができる。単語
“豚”ｗ₂ について、ｖ ₃ ＝“動物（０．８）”、ｖ₇
＝“食肉（０．５）”、ｖ₉ ＝“しっぽ（０．２）”を
その属性とすることができる。なお、（）内の数字は１
に正規化された重要度を示す。単語“馬”と単語“豚”
の間の類似性を判別する類似度を（１）式に依り計算す
ると、共通する属性はｖ₃ に着目してその重要度に基づ
いて下記の如く求めることができる。類似度＝０．８×０．８＝０．６４上述の通りに類似度を（１）式に依り計算する場合、概
念ベースにおける属性集合として何を選択するかが重要
となる。最も単純には、単語を特徴付けの対象としてい
るのであるが、この単語を属性として選択することが考
えられる。属性集合は単語全体の集合と同一となる訳で
ある。この場合、上述の如く単語“馬”を例にとる場
合、“たてがみ”“尾”その他をその属性とすることが
できる。しかし、属性として単語を選択する方法の場
合、属性集合の内に同義語が存在することがあり、
（１）式の計算に依っては正確な類似度が得られなくな
る。即ち、（１）式においては意味的に同一である
“尾”と“しっぽ”は相異なる属性として取り扱われ、
これらの属性の重要度は類似度に寄与しない。“馬”或
は“豚”の属性の何れか一方が“０”であるのでこれら
属性間の積は零であり、結局、類似度として加算されな
いのである。[0005] The above is the word w_l= "Horse", word w₂ = Taking "pig" as an example
I will explain. About "horse", v₂ = “Mane
(0.3) ”, v₃ = "Animal (0.8)", v₈ = "Tail
(0.7) ”can be used as the attribute.
"Pig" w₂ About v ₃ = "Animal (0.8)", v₇
= "Meat (0.5)", v₉ = “Tail (0.2)”
It can be the attribute. The number in () is 1
Shows the normalized importance to. The word "horse" and the word "pig"
Calculate the degree of similarity to determine the similarity between
Then, the common attribute is v₃ Focus on the importance of
And can be calculated as follows. Similarity = 0.8 × 0.8 = 0.64 As described above, when the similarity is calculated by the equation (1),
What is important to choose as a set of attributes in the mind-based
Becomes In the simplest case, the word is to be characterized.
However, it is a good idea to select this word as an attribute.
available. The attribute set is the same as the set of all words.
is there. In this case, if the word "horse" is taken as an example, as described above.
In that case, "mane," "tail," and other attributes can be used as its attributes.
it can. However, in the case of the method of selecting a word as an attribute,
Synonyms may exist in the attribute set,
Accurate similarity cannot be obtained depending on the calculation of equation (1).
It That is, the expressions (1) are semantically the same.
"Tail" and "tail" are treated as different attributes,
The importance of these attributes does not contribute to the similarity. "Horse" or
Since one of the attributes of "pig" is "0", these
The product between attributes is zero, and as a result, it is not added as a similarity.
It is.

【０００６】以上の理由から、判別の精度を向上させる
ために同義語については、例えば、一つの単語を属性と
して代表させるか、或はシソーラスを採用して特徴とな
る単語の上位概念を表す単語を属性とするかして属性集
合を構成している。一つの単語を属性として代表させる
には、“尾”、“しっぽ”の同義語については“尾”が
全てを代表するものとする。シソーラスを採用する場合
は、特徴となる単語として“親指”、“子指”その他指
の種類を表す単語が選択されていれば、属性としてその
上位概念である“指”を属性として代表させる。即ち、
高い判別精度を得るには、属性集合は理想的には各属性
が互に意味的に独立して互に直交する様に構成されるべ
きであり、そのため様々な工夫がなされてきた訳であ
る。ところが、如何に工夫しようとも属性集合として単
語の様に意味をもつものが選択される限り、互に意味的
に独立な属性群を選択することは不可能であり、以上の
工夫もこれに近づけるための一方法であるに過ぎない。
“尾”が同義語の代表として属性とされたとしても、他
の全ての属性が意味的に“尾”とは独立、換言すれば非
類似であると保証することはできない。また、特徴とな
る単語の上位概念を表す単語を属性としたとしても同様
である。For the above reason, in order to improve the accuracy of discrimination, for synonyms, for example, one word is represented as an attribute, or a thesaurus is used to represent a superordinate concept of a characteristic word. Is defined as an attribute to form an attribute set. In order to represent one word as an attribute, the synonyms of “tail” and “tail” are all represented by “tail”. When the thesaurus is adopted, if "thumb", "child finger" or other words representing the kind of finger is selected as the characteristic word, "finger" which is a superordinate concept thereof is represented as the attribute. That is,
In order to obtain high discrimination accuracy, the attribute set should ideally be constructed so that the attributes are semantically independent of each other and orthogonal to each other, and various innovations have been made for that reason. . However, no matter how you try, it is impossible to select attribute groups that are semantically independent of each other, as long as the ones that have meaning like words are selected as the attribute set. It's just one way to do it.
Even if "tail" is used as a representative of synonyms, it cannot be guaranteed that all other attributes are semantically independent of, or in other words dissimilar to, "tail". The same applies when a word representing a superordinate concept of a characteristic word is used as an attribute.

【０００７】以上の通り、概念ベースを使用して単語間
の意味的な類似性を判別する場合、（１）式の様な２個
の単語間の何らかの意味の距離を計算することによる判
別の仕方には、判別の精度を低下させる要因が内在して
いる。属性が全く意味を持たない様に何らかの変換操作
を施し、互に独立な属性からなる属性集合を構成するこ
とも考えられるが、これも現実的には極めて困難である
と考えられる。As described above, when the semantic similarity between words is determined by using the concept base, the determination is made by calculating the distance of some meaning between two words as shown in equation (1). There is an inherent factor in the method that reduces the accuracy of discrimination. It is possible to perform some conversion operation so that the attributes do not have any meaning to form an attribute set consisting of mutually independent attributes, but this is also considered to be extremely difficult in reality.

【０００８】[0008]

【発明が解決しようとする課題】この発明は、上述の様
な概念ベースを使用して単語間の意味的な類似性を判別
する単語間意味類似性判別方法において、概念ベースに
おける属性集合の属性群として単語群を選択し、これら
の単語群が意味的に互に独立ではなくとも、高い判別精
度を得る単語間意味類似性判別方法を提供するものであ
る。SUMMARY OF THE INVENTION The present invention is an inter-word semantic similarity discrimination method for discriminating semantic similarity between words using the concept base as described above. Provided is a method of determining semantic similarity between words, which selects a word group as a group and obtains high discrimination accuracy even if these word groups are not semantically independent of each other.

【０００９】[0009]

【課題を解決するための手段】単語ｗの特徴を表す属性
ｖと、当該単語と属性との間の関連の深さを示す当該属
性の重要度ａの対の集合により単語の意味を表現してい
る単語意味データベースを使用して単語相互間の意味の
距離計算をして単語相互間の類似性を判別する単語間意
味類似性判別方法において、類似性判別の対象とされる
２個の単語ｗ_i およびｗ_j のそれぞれについて属性ｖ_k
およびｖ_l を選択し、選択された２個の属性の重要度ａ
_ikおよびａ_jlと、これら２個の属性ｖ_k およびｖ_l 間の
意味的な近さを表す量Ｌ_klの３者の積ａ_ik×ａ_jl×Ｌ_kl
を２個の単語に含まれる全属性の組み合わせについて総
和した結果を類似度とする単語間意味類似性判別方法を
構成した。Meaning of a word is expressed by a set of a pair of an attribute v representing a feature of a word w and an importance a of the attribute indicating the depth of the relation between the word and the attribute. Of the meaning of each word using a word meaning database
In the inter- word semantic similarity determination method of calculating the distance to determine the similarity between words , the attribute v _{k is associated} with each of the two words w _i and w _j that are targets of the similarity determination.
And v _l , the importance a of the two selected attributes
A product of _ik and a _jl and a quantity L _kl representing the semantic closeness between these two attributes v _k and v _l a _ik × a _jl × L _kl
We constructed a method for determining the semantic similarity between words, where the similarity is the result of summing all the attributes included in two words.

【００１０】そして、２個の属性間の意味的な近さを表
す量を、選択された２個の属性間のシソーラス上におけ
る距離の逆数とする単語間意味類似性判別方法を構成し
た。また、２個の属性間の意味的な近さを表す量を、選
択された２個の属性に共通するシソーラス上における上
位概念の情報量とする単語間意味類似性判別方法を構成
した。Then, a method for discriminating semantic similarity between words is constructed in which the amount representing the semantic closeness between the two attributes is the reciprocal of the distance on the thesaurus between the two selected attributes. In addition, a method for determining the meaning similarity between words is configured in which the amount representing the semantic closeness between the two attributes is the information amount of the superordinate concept on the thesaurus common to the two selected attributes.

【００１１】[0011]

【実施例】この発明の実施例を説明する。この発明は、
単語間の意味的な類似性を判別する際の指標となる類似
度を、上述された従来例の様に単に２個の単語間の距離
を計算するというものではなくして、２個の単語ｗ_iお
よび単語ｗ_jそれぞれについて属性ｖ_kおよび属性ｖ_l
を選択し、選択された属性ｖ_kの重要度ａ_ikと、属性ｖ
_lの重要度ａ_jlと、これら属性ｖ_kおよび属性ｖ_l間の
意味的な近さを表す量Ｌ_klの３者の積を、全属性の組み
合わせについて以下の（２）式の通りに総和することに
より算出する。Embodiments of the present invention will be described. This invention
The degree of similarity, which is an index for determining the semantic similarity between words, is not calculated by simply calculating the distance between two words as in the above-described conventional example, but two words w Attribute v _k and attribute v _l for _i and word w _j respectively
, The importance a _{ik of} the selected attribute v _k and the attribute v
and severity a _jl of _l, summing the three parties of the product of the amount L _kl representing the semantic proximity between these attributes v _k and attribute v _l, as the following expression (2) for the combination of all the attributes It is calculated by

【００１２】Ｌ_klは属性ｖ_kおよび属性ｖ_lの間のシソーラス上の意
味的な近さを表す量である。[0012] L _kl is a quantity that represents the semantic closeness on the thesaurus between the attributes v _k and v _l .

【００１３】意味的な近さを表す量としては、属性間の
シソーラス上における距離の逆数、或はこれら属性に共
通するシソーラス上における上位概念の情報量その他の
量を適用することができる。そして、シソーラスは全単
語が意味的に階層化されたツリー状に構成する。この様
なシソーラスの一例を図２に示す。ここで、この発明
は、属性集合の属性としてどの様な単語が選択されて
も、この単語は他の全ての単語と何れかのレベルの上位
概念において意味的に同一となることを考慮したもので
ある。２個の単語の対応する属性の内の一方の重要度が
“０”である場合、（１）式による計算に依ると、２個
の単語の対応する属性即ち、ｈが同一である属性ｖ_hに
ついて何れか一方の重要度ａ_ih或はａ_jhが“０”であれ
ばこれらの積である類似度は“０”となり、他方の属性
の重要度は類似度に一切寄与しなくなる。ところが、こ
の発明は、意味の類似性を判別する２個の単語の全ての
属性の組合せについて、これらの重要度と、これら重要
度間の意味的な近さを表す量の３者の積を計算し、これ
らの積を総和することにより類似度を算出するものであ
るところから、２個の単語の対応する属性の内の一方の
重要度が“０”であっても、重要度が“０”ではない他
方の属性が他の対応していない属性の何れかと意味的に
類似していればこれらの間の積は“０”ではないので、
この類似性は（２）式の類似度の計算に加算、反映され
ることになる。As the quantity representing the semantic closeness, the reciprocal of the distance between the attributes on the thesaurus, or the information quantity of the superordinate concept on the thesaurus common to these attributes and other quantities can be applied. Then, the thesaurus is structured in a tree structure in which all words are semantically hierarchized. An example of such a thesaurus is shown in FIG. Here, the present invention considers that, no matter what word is selected as an attribute of the attribute set, this word is semantically the same as all other words in the superordinate concept of any level. Is. When one of the corresponding attributes of the two words has an importance of “0”, the corresponding attribute of the two words, that is, the attribute v in which h is the same, is calculated according to the equation (1). _If either importance a _ih or a _{jh of} _h is “0”, the product of these, the similarity, is “0”, and the importance of the other attribute does not contribute to the similarity at all. However, the present invention, for all combinations of attributes of two words that determine the similarity of meanings, calculates the product of these importance levels and the quantity representing the semantic closeness between these importance levels. Since the similarity is calculated by calculating and summing these products, even if one of the corresponding attributes of two words has an importance of “0”, the importance is “ If the other attribute that is not "0" is semantically similar to any of the other uncorresponding attributes, the product between them is not "0".
This similarity will be added and reflected in the calculation of the degree of similarity in Expression (2).

【００１４】この発明による（２）式の類似度の計算に
おいては、更に、意味的な近さを表す量Ｌ_klが乗じられ
るので、これら２個の属性が意味的に類似している程そ
の類似性は類似度の計算により大きく反映されることに
なる。即ち、この発明は、単語間の意味的な類似性の判
別において全ての属性間の類似性が類似度の計算におい
て考慮されるので、属性集合の中に意味的に類似する属
性が含まれていて独立ではなくても、高い類似性判別精
度を得ることができる。In the calculation of the degree of similarity of the formula (2) according to the present invention, the quantity L _kl representing the semantic closeness is further multiplied, so that the more similar these two attributes are in terms of meaning, the better. The similarity will be largely reflected by the calculation of the similarity. That is, according to the present invention, since the similarity between all attributes is considered in the calculation of the similarity in the determination of the semantic similarity between words, the attribute set includes semantically similar attributes. Even if they are not independent, high similarity discrimination accuracy can be obtained.

【００１５】この発明の実施例を更に具体的に説明す
る。ここで、属性間の意味的な近さを表す量として、こ
れら属性間のシソーラス上における距離の逆数を使用す
る場合について説明する。この場合、単語間の意味的な
類似度は、（２）式においてシソーラス上における距離
の逆数をＬとして使用し、計算される。The embodiment of the present invention will be described more specifically. Here, a case will be described in which the reciprocal of the distance on the thesaurus between these attributes is used as a quantity representing the semantic closeness between the attributes. In this case, the semantic similarity between words is calculated by using the reciprocal of the distance on the thesaurus as L in Expression (2).

【００１６】図１を参照するに、比較されるべき２個の
単語をｗ_iおよび単語ｗ_jとし、属性集合はｖ₁ 〜ｖ_m
とすると、単語ｗ_iおよび単語ｗ_jは属性ｖ₁ 〜ｖ_mの
重要度（ａ_i1〜ａ_im，ａ_j1〜ａ_jm）を要素とするｍ次元
のベクトルであり、以下の様に表される。ｗ_i＝（ａ_i1，ａ_i2，・・・，ａ_ik，・・・ａ_im）ｗ_j＝（ａ_j1，ａ_j2，・・・，ａ_jl，・・・ａ_jm）ここで、単語ｗ_iについて属性ｖ_hがこれを特徴付ける
属性ではない場合、重要度ａ_ihは“０”であることは言
うまでもない。単語ｗ_jについても同様である。属性集
合の各属性は意味を有するものであり、ここにおいては
単語自体であるものとする。即ち、単語ｗ_iは単語ｖ₁
〜ｖ_mを属性とし、これら属性それぞれの重要度ａ_i1〜
ａ_imにより特徴付けられているものとする。Referring to FIG. 1, the two words to be compared are w _i and w _j , and the attribute sets are v ₁ to v _m.
Then, the word w _i and the word w _j are m-dimensional vectors having the importance (a _{i1 to} a _im , a _{j1 to} a _jm ) of the attributes v _{1 to} v _m as elements, and are represented as follows. It w _i = (a _i1 , a _i2 , ..., a _ik , ... a _im ) w _j = (a _j1 , a _j2 , ..., a _jl , ... a _jm ) where the word Needless to say, the importance level a _ih is “0” when the attribute v _h is not a characteristic of w _i . The same applies to the word w _j . Each attribute of the attribute set has a meaning, and here is a word itself. That is, the word w _i is the word v ₁
~ V _m as attributes, and the importance a _{i1 of} each of these attributes ~
Let it be characterized by a _im .

【００１７】図３は属性集合の各属性である単語群のシ
ソーラスの例を示す。即ち、シソーラスはツリー構造を
有しており、属性である単語ｖ₁ は上位概念Ｊ₁₁に属
し、この上位概念Ｊ₁₁自体は更にその上位概念Ｊ₂₁に属
している。上位概念が単語自体であっても差し支えな
い。この様なシソーラスを想定すると、何れの属性間に
も必ず距離が存在し、その距離が近い程両属性は意味的
に近いものとすることができる。ここで、両属性間の距
離は一方の属性から他方の属性に到達するまでの枝の数
である。図３において、例えば属性ｖ_kと属性ｖ_lとの
間の距離は“６”となる。FIG. 3 shows an example of a thesaurus of word groups which are each attribute of the attribute set. That is, the thesaurus has a tree structure, the word v ₁ is the attribute belongs to the preamble J _11, the preamble J ₁₁ itself is further belong to the broader concept J _21. It does not matter if the superordinate concept is the word itself. Assuming such a thesaurus, there is always a distance between any attributes, and the closer the distance is, the closer the attributes can be to each other. Here, the distance between both attributes is the number of branches from one attribute to the other attribute. In FIG. 3, for example, the distance between the attribute v _k and the attribute v _l is “6”.

【００１８】この実施例においては、類似度は、（２）
式に基づいて、各単語について選択された２個の属性の
重要度と、これら属性間のシソーラス上における距離の
逆数の３者の積の総和として計算されることになる。即
ち、単語ｗ_iおよび単語ｗ_jにおけるベクトル要素の積
に、更に、これら対応する属性間の距離の逆数を乗じ
て、これらの積を総和したものが類似度である。なお、
対応する属性間の距離の逆数は先の例においては１／６
である。従って、比較される２個の単語間において、重
要度が“０”ではない属性がシソーラス上において互に
接近していると、類似度は高くなり、互に遠く離隔して
いると類似度は低くなる。これは、両単語について、重
要度が共に非零である属性がない場合についても成り立
つことは言うまでもない。この場合、（１）式による従
来例の計算に依っては、共通する属性の重要度の積の総
和を類似度としているので、類似度は常に“０”となっ
た。In this embodiment, the similarity is (2)
It will be calculated as the sum of the product of the importance of the two attributes selected for each word and the reciprocal of the distance on the thesaurus between these attributes, based on the formula. That is, the similarity is obtained by multiplying the product of the vector elements in the word w _i and the word w _j by the reciprocal of the distance between the corresponding attributes and summing these products. In addition,
The reciprocal of the distance between the corresponding attributes is 1/6 in the previous example.
Is. Therefore, between two words to be compared, the similarity is high when the attributes whose importance is not “0” are close to each other on the thesaurus, and the similarity is high when they are far from each other. Get lower. It goes without saying that this is true even when there is no attribute whose importance is non-zero for both words. In this case, according to the calculation of the conventional example by the formula (1), the sum of products of the importance of common attributes is used as the similarity, and thus the similarity is always “0”.

【００１９】次に、属性間の意味的な近さを表す量とし
てシソーラス上におけるこれら属性に共通する上位概念
の情報量を使用する方法を他の実施例として説明する。
先の実施例とは、この意味的な近さを表す量だけが異な
り、その他については全く同様であるので、この情報量
の求め方のみについて説明する。この上位概念の情報量
ｌは下記の通りに定義する。Next, a method of using the information amount of the superordinate concept common to these attributes on the thesaurus as the amount representing the semantic closeness between the attributes will be described as another embodiment.
Only the amount representing this semantic closeness is different from the previous embodiment, and the others are exactly the same, so only the method for obtaining this information amount will be described. The information amount l of this superordinate concept is defined as follows.

【００２０】ｌ＝ｌｏｇ₂ （Ａ／ｎ）（３）ここで、Ａ：共通する上位概念の下位に位置する属性の
内の何れかの属性の重要度が非零となる単語の総数ｎ：概念ベースの全単語数図３を参照して、属性ｖ₂ と属性ｖ₄ との間の関係につ
いて考える。これらの属性の共通する上位概念はＪ₂₁で
ある。ここで、Ａは属性ｖ₁ 〜ｖ₄ の何れかの重要度が
非零となる単語の総数であり、概念ベースが図４に示さ
れる通りのものであり、図中○印が重要度が非零である
ことを示すものとすると、該当する単語はｗ₁ 、ｗ₂ 、
ｗ₃ 、ｗ₅ 、ｗ₆ 、ｗ₈ の合計６個であり、Ａ＝６とな
る。情報量をこの様に定義すると、２個の属性がシソー
ラス上において接近していると、これらに共通する上位
概念の情報量は大きく、また、遠く離隔していると小さ
くなる。即ち、この情報量は等価的に属性間の意味的な
近さを表していることになる。L = log ₂ (A / n) (3) Here, A: the total number of words in which the importance of any one of the attributes located in the lower order of the common superordinate concept is non-zero n: Concept-Based Total Word Count With reference to FIG. 3, consider the relationship between attributes v ₂ and v ₄ . The common superordinate concept of these attributes is J ₂₁ . Here, A is the total number of words in which any of the attributes v _{1 to} v ₄ has a non-zero importance, the concept base is as shown in FIG. 4, and the circles in the figure indicate the importance. Assuming it is non-zero, the corresponding words are w ₁ , w ₂ ,
There are a total of 6 of w ₃ , w ₅ , w ₆ , and w ₈ , and A = 6. When the information amount is defined in this way, when two attributes are close to each other on the thesaurus, the information amount of the superordinate concept common to them is large, and when they are far apart, the information amount is small. That is, this information amount equivalently represents the semantic closeness between the attributes.

【００２１】以上の通り、属性間の意味的な近さを表す
量として、属性間のシソーラス上における距離の逆数、
或はシソーラス上におけるこれら属性に共通する上位概
念の情報量を選択した例について説明したが、属性間の
意味的な近さを表す量でありさえすれば、これらの例に
類する如何なる量を選択しても差し支えない。As described above, the reciprocal of the distance between the attributes on the thesaurus is defined as the quantity representing the semantic closeness between the attributes.
Or, the example of selecting the information amount of the superordinate concept common to these attributes on the thesaurus was explained, but any amount similar to these examples can be selected as long as it is the amount indicating the semantic closeness between attributes. It doesn't matter.

【００２２】[0022]

【発明の効果】上述した通りであって、この発明は、単
語間の意味的な類似性の判別をするに際して、比較され
るべき２個の単語の全ての属性の組合せについてこれら
の重要度とこれら重要度の間の意味的な近さを表す量の
３者の積を計算し、これらの積を総和することにより類
似度を計算するものであるので、属性集合の中に意味的
に類似する属性が含まれていて互に独立ではなくても、
高い類似性判別精度を得ることができる。更に、従来の
計算の仕方に依っては、重要度が非零の特徴付ける属性
が少ない単語間については、共通な属性が少なくなると
ころから、類似性判別精度が低下することが考えられる
が、この発明においては全ての属性の組合せを考慮して
類似度が計算されるので、この様な欠点も解消される。As described above, according to the present invention, when determining the semantic similarity between words, the significance levels of all the attribute combinations of two words to be compared are calculated. The similarity is calculated by calculating the product of three parties of the quantities that represent the semantic closeness between these degrees of importance, and summing these products to calculate the similarity. Even if the attributes to be included are not independent of each other,
High similarity discrimination accuracy can be obtained. Furthermore, depending on the conventional calculation method, for words with few non-zero importance attributes that characterize, the similarity determination accuracy may decrease because there are few common attributes. In the present invention, since the similarity is calculated in consideration of all the attribute combinations, such a drawback is eliminated.

[Brief description of drawings]

【図１】概念ベースを示す図。FIG. 1 is a diagram showing a concept base.

【図２】シソーラスの構成を示す図。FIG. 2 is a diagram showing a structure of a thesaurus.

【図３】属性間の意味的な距離を説明する図。FIG. 3 is a diagram illustrating a semantic distance between attributes.

【図４】共通上位概念の情報量の計算方法を説明する
図。FIG. 4 is a diagram illustrating a method of calculating an information amount of a common superordinate concept.

[Explanation of symbols]

ｗ_i 単語ｗ_j 単語ｖ_k 属性ｖ_l 属性ａ_ik 重要度ａ_jl 重要度Ｌ_kl ２個の属性間の意味的な近さを表す量w _i word w _j word v _k attribute v _l attribute a _ik importance degree a _jl importance degree L _kl quantity representing the semantic closeness between two attributes

フロントページの続き (56)参考文献特開平６−162099（ＪＰ，Ａ) 特開平６−208590（ＪＰ，Ａ) 特開平６−131184（ＪＰ，Ａ) 沢田裕司，大川剛直，馬場口登，観点を考慮した連想機構の実現，情報処理学会論文誌，1994年５月15日，第35巻，第５号，ｐｐ．714−724 酒匂孝之，中村順一，吉田将，概念間の外見的な類似性と心理的な評価を利用した比喩表現の生成，電子情報通信学会技術研究報告，1993年７月８日，第93 巻，第131号，ｐｐ．17−24 笠原要，松澤和光，石川勉，河岡司, 観点に基づく該年間の類似性判別，情報処理学会論文誌，1994年３月15日，第35 巻，第３号，ｐｐ．505−509 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 G06F 17/27 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of front page (56) References JP-A-6-162099 (JP, A) JP-A-6-208590 (JP, A) JP-A-6-131184 (JP, A) Yuji Sawada, Takenao Okawa, Babaguchi Noboru, Realization of Association Mechanism Considering Point of View, Journal of Information Processing Society, May 15, 1994, Volume 35, No. 5, pp. 714-724 Takayuki Sakano, Junichi Nakamura, Masashi Yoshida, Generating Metaphorical Expressions Utilizing Appearance Similarity and Psychological Evaluation between Concepts, IEICE Technical Report, July 8, 1993, 93rd Vol. 131, pp. 17-24 Kasahara Kaname, Matsuzawa Kazumitsu, Ishikawa Tsutomu, Kawaoka Tsukasa, Similarity Discrimination for the year based on viewpoint, Journal of Information Processing, March 15, 1994, Vol. 35, No. 3, pp. 505-509 (58) Fields investigated (Int.Cl. ⁷ , DB name) G06F 17/30 G06F 17/27 JISST file (JOIS)

Claims

(57) [Claims]

1. A word meaning database is used which expresses the meaning of a word by a set of a pair of an attribute representing a feature of a word and a degree of importance of the attribute indicating a depth of association between the word and the attribute. Then, the distance between the words is calculated and the word phase is calculated.
In a method for determining a meaning between words, which determines similarity between words, an attribute is selected for each of the two words that are the objects of similarity determination, and the importance of the two selected attributes and the Inter-word semantic similarity, characterized in that the similarity is the result of summing the products of three parties representing the amount of semantic closeness between two attributes for all combinations of attributes included in two words. Gender discrimination method.

2. The inter-word semantic similarity determination method according to claim 1, wherein an amount representing a semantic closeness between two attributes is a distance between two selected attributes on a thesaurus. A method for discriminating semantic similarity between words, which is a reciprocal of.

3. The inter-word semantic similarity determination method according to claim 1, wherein an amount representing a semantic closeness between two attributes is set on a thesaurus common to the two selected attributes. A method for determining semantic similarity between words, which is characterized in that the information amount of a superordinate concept is used.