JP2010117764A

JP2010117764A - Inter-word association degree determining device, inter-word association degree determination method, program, and recording medium

Info

Publication number: JP2010117764A
Application number: JP2008288649A
Authority: JP
Inventors: Masashi Uchiyama; 匡内山; Toshiro Uchiyama; 俊郎内山; Katsuto Bessho; 克人別所
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-11-11
Filing date: 2008-11-11
Publication date: 2010-05-27
Anticipated expiration: 2028-11-11
Also published as: JP5131923B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an inter-word association degree determination device for efficiently and successively calculating the approximate value of an association degree between two words, i.e., efficiently performing integration processing which is to be originally increased in proportion with a dimension number, without significantly deteriorating the overall accuracy. <P>SOLUTION: A corpus is input, and the co-occurrence frequency between a word and a word meaning attribute with respect to the dimension of the word is tabulated by each arbitrary word, so as to generate a frequency vector. The frequency vector is normalized as a probability, according to a tree structure held by the word meaning attribute so as to generate a concept vector. A local distance, which is obtained by calculating a distance between the two concept vectors acquired with respect to the two arbitrary words, based on the probability corresponding to a master node and the probability corresponding to a slave node according to the tree structure, is integrated from a root node to a leaf node, thereby calculating the association degree between the two words. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、単語間の関連度を効率的に算出する単語間関連度判定装置、単語間関連度判定方法、プログラムおよび記録媒体に関する。 The present invention relates to an inter-word relevance determination device, an inter-word relevance determination method, a program, and a recording medium that efficiently calculate the relevance between words.

単語間の類似性を判定し、類義語を検索し、また、関連文書を検索するためのデータベースの１つとして、概念ベースが用いられている。 A concept base is used as one of databases for determining similarity between words, searching for synonyms, and searching for related documents.

この「概念ベース」は、単語とその単語に対応する概念ベクトルとの組からなるデータベースである。概念ベースとして、国語辞典の語義文に基づいて作成される辞書概念ベースが知られている（たとえば、特許文献１参照）。 This “concept base” is a database composed of pairs of words and concept vectors corresponding to the words. As a concept base, a dictionary concept base that is created based on a word meaning sentence in a Japanese dictionary is known (for example, see Patent Document 1).

また、概念ベースとして、新聞記事等の文書を大量に集めたコーパスに基づいて作成されるコーパス概念ベースが知られている（たとえば、非特許文献１参照）。 Further, as a concept base, a corpus concept base created based on a corpus obtained by collecting a large number of documents such as newspaper articles is known (for example, see Non-Patent Document 1).

所定の単語の「概念ベクトル」は、上記所定の単語が属する範囲（たとえば、文）内で、予め決められた複数の共起語のそれぞれと共起する頻度に応じて算出される。 The “concept vector” of the predetermined word is calculated according to the frequency of co-occurrence with each of a plurality of predetermined co-occurrence words within the range (eg, sentence) to which the predetermined word belongs.

辞書概念ベースにおける共起語として、単語を辞書引きして得られる語義文中に出現する単語が用いられ、コーパス概念ベースにおける共起語として、コーパス中に高頻度で出現する単語が用いられる。各単語を、行とし、共起語を、列とし、単語と共起語との共起頻度を、行列の成分とする共起行列を作成する。 As a co-occurrence word in the dictionary concept base, a word appearing in a word meaning sentence obtained by lexicographically extracting the word is used, and as a co-occurrence word in the corpus concept base, a word appearing frequently in the corpus is used. A co-occurrence matrix is created in which each word is a row, a co-occurrence word is a column, and the co-occurrence frequency of the word and the co-occurrence word is a matrix component.

辞書概念ベースの場合、上記共起行列における各行の行ベクトルが、単語の概念ベクトルであり、通常は、語義文中に含まれている単語について、孫引きすることによって得られる語義文等を用いて、概念ベクトルの精錬が行われる。 In the case of the dictionary concept base, the row vector of each row in the co-occurrence matrix is a concept vector of a word, and usually, using a word meaning sentence obtained by subtracting the word included in the word meaning sentence, The concept vector is refined.

コーパス概念ベースにおいて、特異値分解によって、共起行列の列の次元を圧縮した行列を作成し、この圧縮した行列の各行の行ベクトルが概念ベクトルである。 In the corpus concept base, a matrix in which the dimension of the column of the co-occurrence matrix is compressed by singular value decomposition, and the row vector of each row of the compressed matrix is a concept vector.

このようにして作成された概念ベースは、単語間の類似性が高いほど、単語の概念ベクトル間の距離が近いという性質を持つので、単語間の類似性を判定する場合に有効である。つまり、２つの単語間の概念ベクトルの距離が近いほど、上記２つの単語間の類似性が高いと判断できる。
特許第３３７９６０３号公報 H.Schutze,“Dimensions of meaning”, Proceedings of Supercomputing '92, pp.787-796, 1992年 The concept base created in this manner has the property that the higher the similarity between words, the closer the distance between the concept vectors of the words, so it is effective in determining similarity between words. That is, it can be determined that the closer the concept vector between two words is, the higher the similarity between the two words is.
Japanese Patent No. 3379603 H. Schutze, “Dimensions of meaning”, Proceedings of Supercomputing '92, pp.787-796, 1992

これらの方法によって作成された概念ベースにおいて、概念ベクトルの次元数が、距離計算の精度と計算時間とに大きく影響する。 In the concept base created by these methods, the number of dimensions of the concept vector greatly affects the accuracy of the distance calculation and the calculation time.

すなわち、上記従来例において、距離計算の精度を向上させるために次元数を大きくすれば、次元数に比例した計算時間が必要であり、逆に、計算時間を抑えるために次元数を小さくすれば、本来必要な情報が概念ベクトルから欠落し、距離計算の精度が低下するという問題がある。 In other words, in the above conventional example, if the number of dimensions is increased in order to improve the accuracy of distance calculation, a calculation time proportional to the number of dimensions is required. Conversely, if the number of dimensions is decreased in order to reduce the calculation time. However, there is a problem that originally necessary information is missing from the concept vector and the accuracy of distance calculation is lowered.

本発明は、２単語間の関連度の近似値を効率的に逐次計算することができ、つまり、全体的な精度を大きく損なわずに、本来は次元数に比例して増加する積算処理を、効率化することができる単語間関連度判定装置、単語間関連度判定方法、プログラムおよび記録媒体を提供することを目的とする。 The present invention can efficiently calculate an approximate value of the degree of association between two words, that is, an integration process that increases in proportion to the number of dimensions without significantly reducing the overall accuracy. An object of the present invention is to provide an inter-word relevance determination device, an inter-word relevance determination method, a program, and a recording medium that can be made efficient.

本発明は、コーパスを入力し、任意の単語について、上記単語と各次元に対応する単語意味属性との共起頻度を集計することによって頻度ベクトルを生成する頻度ベクトル生成手段と、上記単語意味属性が具備する木構造に従って、上記頻度ベクトルを、確率として正規化して概念ベクトルを生成する概念ベクトル生成手段と、上記概念ベクトルを記憶する記憶手段と、任意の２単語に対して、上記頻度ベクトル生成手段と上記概念ベクトル生成手段とを適用して得られた２つの概念ベクトル間の距離を、上記木構造に従って、親ノードに対応する上記確率と子ノードに対応する上記確率とに基づいて計算した局所的な距離を、根ノードから葉ノードに向かって積算することによって上記２単語間の関連度を算出する単語間関連度算出手段とを有する単語間関連度判定装置である。 The present invention inputs a corpus and, for an arbitrary word, frequency vector generation means for generating a frequency vector by counting the co-occurrence frequencies of the word and the word semantic attribute corresponding to each dimension, and the word semantic attribute In accordance with the tree structure included in the above, the frequency vector is generated with respect to two arbitrary words, a concept vector generation unit that normalizes the frequency vector as a probability to generate a concept vector, a storage unit that stores the concept vector, and The distance between two concept vectors obtained by applying the means and the concept vector generation means is calculated based on the probability corresponding to the parent node and the probability corresponding to the child node according to the tree structure. An inter-word relevance calculating means for calculating the relevance between the two words by integrating the local distance from the root node toward the leaf node; A word relevancy determining apparatus having.

本発明によれば、概念ベクトルの各次元を、単語意味属性が有する木構造の各ノードに対応させ、親ノードに対応するノード値と、子ノードに対応するノード値とに基づいて計算される局所的な距離を、根ノードから葉ノードに向かって積算し、この積算を途中で打ち切っても、比較的よい精度が保証されるという効果を奏する。つまり、本発明によれば、積算を途中で打ち切れば、計算量を削減することができ、しかも、上記２単語間の関連度の近似値の精度を保ちつつ計算量を削減することができるという効果を奏する。 According to the present invention, each dimension of the concept vector is associated with each node of the tree structure of the word semantic attribute, and is calculated based on the node value corresponding to the parent node and the node value corresponding to the child node. Even if the local distance is integrated from the root node toward the leaf node, and this integration is stopped halfway, there is an effect that relatively good accuracy is guaranteed. In other words, according to the present invention, the calculation amount can be reduced if the integration is stopped halfway, and the calculation amount can be reduced while maintaining the accuracy of the approximate value of the degree of association between the two words. There is an effect.

発明を実施するための最良の形態は、以下の実施例である。 The best mode for carrying out the invention is the following examples.

図１は、本発明の実施例１である単語間関連度判定装置１００を示すブロック図である。 FIG. 1 is a block diagram illustrating an inter-word relevance determination apparatus 100 that is Embodiment 1 of the present invention.

単語間関連度判定装置１００は、コーパス入力手段１０と、頻度ベクトル生成手段２０と、概念ベクトル生成手段３０と、概念ベクトル記憶手段４０と、単語間関連度算出手段５０とを有する。 The inter-word association degree determination device 100 includes a corpus input unit 10, a frequency vector generation unit 20, a concept vector generation unit 30, a concept vector storage unit 40, and an inter-word association degree calculation unit 50.

コーパス入力手段１０を介して、処理対象となるコーパスを入力する。入力されたコーパスを使用して、頻度ベクトル生成手段２０が頻度ベクトルを生成する。概念ベクトル生成手段３０が、頻度ベクトルを確率として正規化し、概念ベクトル記憶手段４０に格納する。 A corpus to be processed is input via the corpus input means 10. The frequency vector generation means 20 generates a frequency vector using the input corpus. The concept vector generation means 30 normalizes the frequency vector as a probability and stores it in the concept vector storage means 40.

なお、上記「頻度ベクトルを確率として正規化し」とは、頻度ベクトルが（ｘ１，ｘ２，…，ｘＮ）であり、各要素の総和をＳとした場合、頻度ベクトル（ｘ１，ｘ２，…，ｘＮ）を、（ｘ１／Ｓ，ｘ２／Ｓ，…，ｘＮ／Ｓ）に変換して、各要素の総和を１とする操作を意味する。 The above-mentioned “normalize frequency vectors as probabilities” means that the frequency vectors are (x1, x2,..., XN), and the sum of each element is S, the frequency vectors (x1, x2,. ) Is converted to (x1 / S, x2 / S,..., XN / S), and the sum of the elements is set to 1.

最後に、単語間関連度算出手段５０が、各単語間の関連度を算出する。 Finally, the inter-word relevance calculation means 50 calculates the relevance between the words.

次に、頻度ベクトル生成手段２０について説明する。 Next, the frequency vector generation means 20 will be described.

図２は、頻度ベクトル生成手段２０を示すブロック図である。 FIG. 2 is a block diagram showing the frequency vector generation means 20.

頻度ベクトル生成手段２０は、文分割手段２１と、単語抽出手段２２と、単語−単語意味属性共起集計手段２３と、単語−単語意味属性辞書２４とを有する。 The frequency vector generation unit 20 includes a sentence division unit 21, a word extraction unit 22, a word-word meaning attribute co-occurrence counting unit 23, and a word-word meaning attribute dictionary 24.

文分割手段２１は、入力されたコーパスを文に分割する。コーパスは、複数の句点「。」や、文末に使用される記号「？」「．」等を含むと、この直後で分割される。単語抽出手段２２は、形態素解析器からなり、文を、単語に分割する。形態素解析器から同時に出力される各単語の品詞情報に基づいて、内容語以外の単語を、取り除く。抽出された単語から、単語−単語意味属性辞書２４を参照し、対応する単語意味属性を抽出する。 The sentence dividing means 21 divides the input corpus into sentences. When a corpus includes a plurality of punctuation marks “.” And symbols “?”, “.”, Etc. used at the end of a sentence, the corpus is divided immediately after this. The word extracting means 22 is composed of a morphological analyzer and divides a sentence into words. Based on the part-of-speech information of each word simultaneously output from the morphological analyzer, words other than the content word are removed. From the extracted word, the word-word meaning attribute dictionary 24 is referred to, and the corresponding word meaning attribute is extracted.

図３は、実施例１における単語−単語意味属性辞書２４の例を示す図である。 FIG. 3 is a diagram illustrating an example of the word-word meaning attribute dictionary 24 according to the first embodiment.

単語を、その意味属性に応じて、単語を複数のカテゴリに予め分類する。各単語意味属性の間には、たとえば、図３に示すように、単語と、対応する単語意味属性とが記述されている。これは、たとえば、非特許文献２（池原悟、宮崎正弘、白井諭、横尾昭男、中岩浩巳、小倉健太郎、大山芳史、林良彦、“日本語語彙体系１意味体系”、岩波書店、１９９９年）に記載されている。 The words are classified in advance into a plurality of categories according to their semantic attributes. Between each word meaning attribute, for example, as shown in FIG. 3, a word and a corresponding word meaning attribute are described. For example, Non-Patent Document 2 (Satoru Ikehara, Masahiro Miyazaki, Atsushi Shirai, Akio Yokoo, Hiroaki Nakaiwa, Kentaro Ogura, Yoshifumi Oyama, Yoshihiko Hayashi, “Japanese vocabulary system 1 semantic system”, Iwanami Shoten, 1999 )It is described in.

図３に示すように、「総選挙」には「選挙」という単語意味属性が与えられ、この属性は、後述の図４では、１８０４番のノードに対応する。「選挙」属性の上位の属性として「推挙」「人事」「管理」「支配」「行為」「人間活動」「事」「抽象」「名詞」がある。 As shown in FIG. 3, the word meaning attribute “election” is given to “general election”, and this attribute corresponds to the node 1804 in FIG. As attributes higher than the “election” attribute, there are “promotion”, “personnel”, “management”, “dominance”, “action”, “human activity”, “thing”, “abstract”, and “noun”.

図１３は、実施例１における単語意味属性の構造（「幹事長」に関係する部分）の例を示す図である。 FIG. 13 is a diagram illustrating an example of the structure of word meaning attributes (portion related to the “secretary general”) in the first embodiment.

図１４は、実施例１における単語意味属性の構造（「指揮」に関係する部分）の例を示す図である。 FIG. 14 is a diagram illustrating an example of a word meaning attribute structure (portion related to “command”) in the first embodiment.

なお、「幹事長」「指揮」については、図１３、図１４のそれぞれのノードに、上記と同様に対応している。つまり、「幹事長」には、図３に示すように、「長」「政治家」「治者」という単語意味属性が与えられ、この属性は、図１３では、３２３番、２６０番、１６７番のノードに対応する。たとえば「長」属性の上位の属性として「人＜地位＞」「人」「主体」「具体」」「名詞」がある。また、「指揮」には、図３に示すように、「管理」「命令」「創造（その他）」という単語意味属性が与えられ、この属性は、図１４では、１７７９番、１８２４番、１５５９番のノードに対応する。たとえば「管理」属性の上位の属性として、「支配」「行為」「人間活動」「事」「抽象」「名詞」がある。 Note that “secretary chief” and “command” correspond to the respective nodes in FIGS. 13 and 14 in the same manner as described above. That is, as shown in FIG. 3, the “secretary general” is given the word meaning attributes “long”, “politician”, and “curator”, and these attributes are numbered 323, 260, 167 in FIG. Corresponds to the numbered node. For example, “person <position>”, “person”, “subject”, “specific”, and “noun” are attributes higher than the “long” attribute. Further, as shown in FIG. 3, the word meaning attributes of “management”, “command”, and “creation (other)” are given to “command”, which are 1779, 1824, 1559 in FIG. Corresponds to the numbered node. For example, “dominance”, “action”, “human activity”, “thing”, “abstract”, and “noun” are attributes higher than the “management” attribute.

これらに基づいて、単語−単語意味属性共起集計手段２３が、単語−単語意味属性の共起頻度を頻度ベクトルとして集計する。 Based on these, the word-word meaning attribute co-occurrence counting means 23 totals the word-word meaning attribute co-occurrence frequency as a frequency vector.

図４は、実施例１における単語意味属性の構造の例を示す図である。 FIG. 4 is a diagram illustrating an example of a structure of word meaning attributes in the first embodiment.

ここで、上記「頻度ベクトルの各次元」は、図４に示すように、木構造を直列化して得られる各ノードに対応している。なお、上記木構造を直列化するとは、ノードを番号順に並べることである。 Here, the “each dimension of the frequency vector” corresponds to each node obtained by serializing the tree structure as shown in FIG. Note that serializing the tree structure means arranging nodes in numerical order.

図４は、木構造の一部を示す図であるが、全体では、単語意味属性に対応する２７１５のノードを含む（上記非特許文献２参照）。したがって頻度ベクトルは、２７１５次元のベクトルである。 FIG. 4 is a diagram showing a part of the tree structure, but as a whole, includes 2715 nodes corresponding to word semantic attributes (see Non-Patent Document 2 above). Therefore, the frequency vector is a 2715-dimensional vector.

単語−単語意味属性の共起頻度の集計の例を以下に示す。 An example of counting the co-occurrence frequencies of word-word semantic attributes is shown below.

文から抽出された各単語の頻度ベクトルについて、同じ文中に出現する他の単語の単語意味属性と、その上位の属性に対応する次元とに、１を加算する。たとえば、コーパスに含まれている単語「与党」が記載されている文中に、単語「総選挙」が出現している場合、「与党」の頻度ベクトルの次元１８０４、１８０２、１７９３、１７７９、１７７８、１５６０、１２３６、１２３５、１０００、１について、１を加算する。 For the frequency vector of each word extracted from the sentence, 1 is added to the word semantic attribute of another word appearing in the same sentence and the dimension corresponding to the higher attribute. For example, if the word “general election” appears in a sentence in which the word “ruling party” included in the corpus is written, the frequency vector dimensions 1804, 1802, 1793, 1779, 1778, “ For 1560, 1236, 1235, 1000, 1 is added.

なお、「総選挙」について、その単語意味属性「選挙」（図３参照）に注目し、図４の木構造における関係性を参照する。 Note that the word meaning attribute “election” (see FIG. 3) of “general election” is noted, and the relationship in the tree structure of FIG. 4 is referred to.

「総選挙」と同じ文中に出現する単語は「総選挙」に関係した意味を持つと仮定して、これらの単語に「総選挙」の属性を付与する。具体的には、「総選挙」の属性に対応する各次元１８０４、１８０２、１７９３、１７７９、１７７８、１５６０、１２３６、１２３５、１０００、１の値に１点を加える。同じ文中に「幹事長」「指揮」が出現する場合、上記と同様に「幹事長」「指揮」の属性を付与する。このときに１点を加える次元は、図５に示すとおり、最終的には、当該文の概念を最もよく表す次元に、高い点数が付与される。 Assuming that words appearing in the same sentence as “general election” have a meaning related to “general election”, the attribute of “general election” is given to these words. Specifically, one point is added to each dimension 1804, 1802, 1793, 1779, 1778, 1560, 1236, 1235, 1000, 1 corresponding to the attribute of “general election”. When “secretary chief” and “command” appear in the same sentence, the attributes of “secretary chief” and “command” are given in the same manner as described above. As shown in FIG. 5, the dimension to which one point is added at this time is finally given a high score to the dimension that best represents the concept of the sentence.

全コーパスについて、上記と同じ操作が実行され、コーパスに含まれている全ての単語について、頻度ベクトルを求める。 The same operation as described above is performed for all corpora, and frequency vectors are obtained for all words included in the corpus.

「与党は幹事長が総選挙を指揮した」という一文を、コーパスに見立て、単語「与党委」の頻度ベクトルを求める例について説明する。 An example of finding the frequency vector of the word “the ruling party committee” with the sentence “The ruling party has led the general election by the secretary” as a corpus.

この例では、内容語として、「与党」「幹事長」「総選挙」「指揮」が抽出される。「与党」と共起する「幹事長」「総選挙」「指揮」の単語意味属性は、図３に示すように定義されている。 In this example, “the ruling party”, “secretary general”, “general election”, and “command” are extracted as content words. The word semantic attributes of “secretary chief”, “general election”, and “command” co-occurring with “the ruling party” are defined as shown in FIG.

図５は、実施例１における単語と上位の単語意味属性との例を示す図である。 FIG. 5 is a diagram illustrating an example of words and upper word meaning attributes according to the first embodiment.

図４に示す木構造に従って、これらの単語の上位の単語意味属性を求めた例が、図５に示す単語と上位の単語意味属性との例である。 An example of obtaining upper word semantic attributes of these words according to the tree structure shown in FIG. 4 is an example of the word and upper word semantic attributes shown in FIG.

図６は、「幹事長」「総選挙」「指揮」の各単語について、単語意味属性の出現頻度を集計し、集計した出現頻度を木構造上に示す図である。 FIG. 6 is a diagram showing the frequency of appearance of word semantic attributes for each word of “secretary chief”, “general election”, and “command”, and showing the totaled appearance frequency on a tree structure.

なお、図６では、値が０であるノードについては、「＝０」の表記を省略してある。 In FIG. 6, the notation “= 0” is omitted for a node having a value of 0.

図７は、実施例１における頻度ベクトルの例を示す図である。 FIG. 7 is a diagram illustrating an example of a frequency vector in the first embodiment.

図６に示す木構造を直列化する（ノードを番号順に並べる）ことによって、図７に示す頻度ベクトルを得ることができる。 By serializing the tree structure shown in FIG. 6 (arranging nodes in numerical order), the frequency vector shown in FIG. 7 can be obtained.

次に、概念ベクトル生成手段３０について詳しく説明する。 Next, the concept vector generation means 30 will be described in detail.

図８は、概念ベクトル生成手段３０の構成を示すブロック図である。 FIG. 8 is a block diagram showing the configuration of the concept vector generation means 30.

まず、頻度ベクトルを、ベクトル−木構造変換手段３１が、上記直列化と逆の操作によって、木構造に変換する。 First, the vector-tree structure conversion means 31 converts the frequency vector into a tree structure by an operation reverse to the above serialization.

次に、兄弟ノード正規化手段３２が、兄弟ノード毎に正規化する。具体的には、まず根ノードの値を１とする。根ノードの直下の子ノードの値の総和が１になるように、集計値に比例して、各ノードの値を定義する。同様に、各親ノード直下の子ノードの値の総和が１になるように、集計値に比例して値を定義する。各ノードの値は、注目している単語が、当該ノードの親ノードに対応する意味属性を有する場合、同時に当該ノードに対応する意味属性を有する条件付き確率に相当する。 Next, the sibling node normalizing means 32 normalizes each sibling node. Specifically, the value of the root node is first set to 1. The value of each node is defined in proportion to the total value so that the sum of the values of the child nodes immediately below the root node is 1. Similarly, the value is defined in proportion to the total value so that the sum of the values of the child nodes immediately below each parent node is 1. The value of each node corresponds to a conditional probability having a semantic attribute corresponding to the node at the same time when the focused word has a semantic attribute corresponding to the parent node of the node.

なお、上記条件付き確率は、後述の図９において、単語の右に記載されている「＝」の右に記載されている数字である。 The conditional probability is a number described to the right of “=” described to the right of the word in FIG. 9 described later.

最後に、木構造−ベクトル変換手段３３が、木構造を再度直列化し、概念ベクトルとする。 Finally, the tree structure-vector conversion means 33 serializes the tree structure again to obtain a concept vector.

図９は、実施例１における概念ベクトル算出の例を示す図である。 FIG. 9 is a diagram illustrating an example of concept vector calculation in the first embodiment.

第１０図は、実施例１における概念ベクトルの例を示す図である。 FIG. 10 is a diagram illustrating an example of concept vectors in the first embodiment.

上記例では、図６に示す木構造に対応する頻度ベクトル（図７）を、木構造に変換して、兄弟ノード毎に正規化し（図９）、再度直列化することによって、第１０図に示す概念ベクトルを得ることができる。 In the above example, the frequency vector (FIG. 7) corresponding to the tree structure shown in FIG. 6 is converted into a tree structure, normalized for each sibling node (FIG. 9), and serialized again, so that FIG. The concept vector shown can be obtained.

次に、単語間関連度算出手段５０について詳しく説明する。 Next, the inter-word relevance calculating means 50 will be described in detail.

２つの概念ベクトル Two concept vectors

を、以下のように計算する。

Is calculated as follows.

数１に記述してあるように、ここでは２つの概念ベクトルｖ、ｗの関連度を計算し、内部ノード全体の集合を、Ｓとし、ｖ_ｋ、ｗ_ｋは、概念ベクトルｖ、ｗのそれぞのノードｋの値である。 As described in Equation 1, here, the degree of association between the two concept vectors v and w is calculated, and the set of all internal nodes is S, and v _k and w _k are those of the concept vectors v and w. The value of each node k.

ノードｋの直下の子ノードの集合をＳ_ｋとする。このときに、子ノード（兄弟ノード）間の距離 A set of child nodes directly under the node k and S _k. At this time, the distance between child nodes (sibling nodes)

をＫｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒ距離

Kullback-Leibler distance

で定義し、この親ノードｋと、その直下の子ノードとに関する局所的な距離

Local distance between this parent node k and its immediate child nodes

を、

The

で定義する。ここで、

Define in. here,

は、親ノードｋとその上位にある根ノードを含む全てのノードの集合をＲ_ｋとすると、

R _k is a set of all nodes including the parent node k and the root node above it.

とする。なお、上記式（１）の総和は、概念ベクトルｖ、ｗのそれぞのノードｉの値であるｖ_ｋとｗ_ｋとが、いずれも０でないｉについて演算したものである。

And Note that the sum of the above formula (1) is calculated for i in which v _k and w _k which are values of the respective nodes i of the concept vectors v and w are not 0.

上記Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒ距離の代わりに、αダイバージェンスを導入し、パラメータαを適切に設定することによって、単語の上位概念や下位概念を強調した関連度を求めることができる。これは、たとえば非特許文献３（T.Minka,“Divergence measures and message passing,” MSR-TR-2005-173, 2005.）に記載されている
この場合、子ノード（兄弟ノード）間の距離 Instead of the Kullback-Leibler distance, by introducing α divergence and appropriately setting the parameter α, it is possible to obtain a degree of relevance that emphasizes the higher and lower concepts of the word. This is described in, for example, Non-Patent Document 3 (T. Minka, “Divergence measures and message passing,” MSR-TR-2005-173, 2005.) In this case, the distance between child nodes (sibling nodes)

は

Is

で定義され、親ノードｋと、その直下の子ノードとに関する局所的な距離

Local distance between parent node k and its immediate child nodes

は、

Is

として定義される。

Is defined as

上記式（４）の総和は、上記式（１）の総和と同様に、概念ベクトルｖ、ｗのそれぞのノードｉの値であるｖ_ｋとｗ_ｋとが、いずれも０でないｉについて計算される。 The sum of the above equation (4) is calculated for i in which v _k and w _k which are values of the respective nodes i of the concept vectors v and w are not 0, similarly to the sum of the above equation (1). Is done.

根ノードから葉ノードに向かって、内部ノードについて、 From the root node to the leaf node,

を積算することによって、２つの概念ベクトル

By multiplying the two concept vectors

を

The

として評価する。

Evaluate as

根ノードから葉ノードに向かって、上記式（６）の積算を実行する過程は、単語意味属性の上位属性から、下位属性へ向かって単語間関連度を積算する過程である。 The process of executing the above equation (6) from the root node to the leaf node is a process of integrating the degree of association between words from the higher attribute of the word semantic attribute toward the lower attribute.

この過程の途中段階で得られる値は、最上位からの単語意味属性が考慮された単語間関連度の適切な近似値として利用することができる。 The value obtained in the middle of this process can be used as an appropriate approximate value of the degree of association between words in consideration of the word semantic attribute from the top.

図１１は、ノード１７７９に着目すると、ノード１７７９の上位にある全てのノードの集合Ｒ_１７７９、と、ノード_１７７９の直下の子ノードの集合Ｓ１７７９との例を示す図である。 FIG. 11 is a diagram illustrating an example of a set R ₁₇₇₉ of all nodes above the node ₁₇₇₉ and a set S ₁₇₇₉ of child nodes immediately below the node 1779 when attention is paid to the node 1779.

図１２は、実施例１の動作を示すフローチャートである。 FIG. 12 is a flowchart illustrating the operation of the first embodiment.

Ｓ１で、処理対象となるコーパスを入力する。Ｓ２で、入力されたコーパスから、任意の単語について、上記単語と各次元に対応する単語意味属性との共起頻度を集計することによって、頻度ベクトルを生成し、記憶装置に記憶する。 In S1, a corpus to be processed is input. In S2, a frequency vector is generated from an input corpus by counting the co-occurrence frequencies of the word and the word semantic attribute corresponding to each dimension for an arbitrary word, and stored in the storage device.

Ｓ３で、単語意味属性が有する木構造に従って、兄弟ノード毎に、頻度ベクトルを、正規化することによって、概念ベクトルを生成し、Ｓ４で、記憶する。 In S3, a concept vector is generated by normalizing the frequency vector for each sibling node according to the tree structure of the word meaning attribute, and stored in S4.

記憶された概念ベクトルに基づいて、Ｓ５で、上記式（１）〜（６）に従って、単語間関連度を計算し、記憶装置に記憶する。 Based on the stored concept vector, in S5, the degree of association between words is calculated according to the above formulas (1) to (6) and stored in the storage device.

上記実施例によれば、類似性の判定を根から葉に向かって積算するので、類似していないとの判断を、ノードの途中で実行することができ、この場合には、葉（末端の子ノード）まで処理しなくても足りるので、処理時間が短縮される。 According to the above embodiment, since the similarity determination is accumulated from the root toward the leaf, it can be determined that the similarity is not similar in the middle of the node. Since it is not necessary to process up to (child node), the processing time is shortened.

なお、上記実施例において、手段を工程に置き換えれば、上記実施例は、単語間関連度判定方法の発明の例である。 In the above embodiment, if the means is replaced with a process, the above embodiment is an example of the invention of the method for determining the degree of association between words.

また、頻度ベクトル生成手段２０は、コーパスを入力し、任意の単語について、上記単語と各次元に対応する単語意味属性との共起頻度を集計することによって頻度ベクトルを生成する頻度ベクトル生成手段の例である。 Further, the frequency vector generation means 20 is a frequency vector generation means for inputting a corpus and generating a frequency vector by counting the co-occurrence frequencies of the word and the word meaning attribute corresponding to each dimension for an arbitrary word. It is an example.

概念ベクトル生成手段３０は、単語意味属性が具備する木構造に従って、上記頻度ベクトルを、確率として正規化して概念ベクトルを生成する概念ベクトル生成手段の例である。 The concept vector generation unit 30 is an example of a concept vector generation unit that generates a concept vector by normalizing the frequency vector as a probability according to a tree structure included in the word meaning attribute.

概念ベクトル記憶手段４０は、概念ベクトルを記憶する記憶手段の例である。 The concept vector storage unit 40 is an example of a storage unit that stores a concept vector.

単語間関連度算出手段５０は、任意の２単語に対して、上記頻度ベクトル生成手段と上記概念ベクトル生成手段とを適用して得られた２つの概念ベクトル間の距離を、上記木構造に従って、親ノードに対応する上記確率と子ノードに対応する上記確率とに基づいて計算した局所的な距離を、根ノードから葉ノードに向かって積算することによって上記２単語間の関連度を算出する単語間関連度算出手段の例である。 The word-to-word relevance calculating unit 50 calculates the distance between two concept vectors obtained by applying the frequency vector generating unit and the concept vector generating unit to any two words according to the tree structure. A word for calculating the degree of association between the two words by accumulating the local distance calculated based on the probability corresponding to the parent node and the probability corresponding to the child node from the root node toward the leaf node. It is an example of an inter-relationship degree calculation means.

また、上記単語間関連度算出手段は、２単語間の関連度の近似値が、予め設定した値を超えると、上記積算を中断する手段であり、これによって、２単語間の関連度の精度を大きく損なわずに、関連度の計算時間を抑えることができる。本来は次元数に比例して増加する積算処理を効率化することができる。 The inter-word relevance calculating means is a means for interrupting the integration when the approximate value of the relevance between the two words exceeds a preset value, thereby improving the accuracy of the relevance between the two words. It is possible to reduce the calculation time of the relevance degree without greatly damaging the. Originally, the integration process that increases in proportion to the number of dimensions can be made efficient.

また、上記単語間関連度算出手段は、木構造の最後の子ノードまで、上記積算を実行する手段であり、これによって、２単語間の関連度の精度が最も高くなる。 The inter-word relevance calculation means is a means for performing the above integration up to the last child node of the tree structure, and thereby the relevance between the two words is the highest.

本発明の実施例１である単語間関連度判定装置１００を示すブロック図である。1 is a block diagram illustrating an inter-word association degree determination device 100 that is Embodiment 1 of the present invention. FIG. 頻度ベクトル生成手段２０を示すブロック図である。FIG. 4 is a block diagram showing a frequency vector generation means 20. 実施例１における単語−単語意味属性辞書２４の例を示す図である。It is a figure which shows the example of the word-word meaning attribute dictionary 24 in Example 1. FIG. 実施例１における単語意味属性の構造の例を示す図である。It is a figure which shows the example of the structure of the word meaning attribute in Example 1. FIG. 実施例１における単語と上位の単語意味属性との例を示す図である。It is a figure which shows the example in the word in Example 1, and a high-order word meaning attribute. 「幹事長」「総選挙」「指揮」の各単語について、単語意味属性の出現頻度を集計し、集計した出現頻度を木構造上に示す図である。It is a figure which totals the appearance frequency of a word meaning attribute about each word of "secretary chief", "general election", and "command", and shows the tabulated appearance frequency on a tree structure. 実施例１における頻度ベクトルの例を示す図である。It is a figure which shows the example of the frequency vector in Example 1. FIG. 概念ベクトル生成手段３０を示すブロック図である。3 is a block diagram showing a concept vector generation means 30. FIG. 実施例１における概念ベクトル算出の例を示す図である。It is a figure which shows the example of the concept vector calculation in Example 1. FIG. 実施例における概念ベクトルの例を示す図である。It is a figure which shows the example of the concept vector in an Example. ノード１７７９の上位にある全てのノードの集合Ｒ_１７７９、と、ノード_１７７９の直下の子ノードの集合Ｓ１７７９との例を示す図である。FIG. 38 is a diagram illustrating an example of a set R ₁₇₇₉ of all nodes above the node ₁₇₇₉ and a set S1779 of child nodes immediately below the node ₁₇₇₉ . 実施例１の動作を示すフローチャートである。3 is a flowchart showing the operation of the first embodiment. 実施例１における単語意味属性の構造（「幹事長」に関係する部分）の例を示す図である。It is a figure which shows the example of the structure of the word meaning attribute in Example 1 (part related to "secretary general"). 実施例１における単語意味属性の構造（「指揮」に関係する部分）の例を示す図である。It is a figure which shows the example of the structure (part relevant to "command") of the word meaning attribute in Example 1. FIG.

Explanation of symbols

１００…単語間関連度判定装置、
１０…コーパス入力手段、
２０…頻度ベクトル生成手段、
２１…文分割手段、
２２…単語抽出手段、
２３…単語−単語意味属性共起集計手段、
２４…単語−単語意味属性辞書、
３０…概念ベクトル生成手段、
３１…ベクトル−木構造変換手段、
３２…兄弟ノード正規化手段、
３３…木構造−ベクトル変換手段、
４０…概念ベクトル記憶手段、
５０…単語間関連度算出手段。 100: Inter-word relevance determination device,
10 ... Corpus input means,
20 ... Frequency vector generation means,
21 ... sentence dividing means,
22 ... word extraction means,
23 ... Word-word semantic attribute co-occurrence counting means,
24 ... Word-word semantic attribute dictionary,
30 ... concept vector generation means,
31: Vector-tree structure conversion means,
32 ... brother node normalization means,
33 ... Tree structure-vector conversion means,
40. Concept vector storage means,
50: Inter-word relevance calculation means.

Claims

A frequency vector generation means for inputting a corpus and generating a frequency vector for an arbitrary word by counting the co-occurrence frequencies of the word and the word semantic attribute corresponding to each dimension;
Concept vector generation means for generating a concept vector by normalizing the frequency vector as a probability according to the tree structure of the word semantic attribute;
Storage means for storing the concept vector;
The distance between two concept vectors obtained by applying the frequency vector generation means and the concept vector generation means to any two words is represented by the probability and child corresponding to the parent node according to the tree structure. An inter-word relevance calculating means for calculating the relevance between the two words by integrating the local distance calculated based on the probability corresponding to the node from the root node toward the leaf node;
A device for determining the degree of association between words, comprising:

In claim 1,
The inter-word relevance calculation means is a means for interrupting the integration when the approximate value of the relevance between two words exceeds a preset value.

In claim 1,
The inter-word relevance calculation means is a means for executing the above integration up to the last child node of the tree structure.

A frequency vector generating step of inputting a corpus, generating a frequency vector by counting the co-occurrence frequencies of the word and the word meaning attribute corresponding to each dimension for an arbitrary word, and storing the frequency vector in a storage device;
A concept vector generation step of generating the concept vector by normalizing the probability according to the tree structure of the word meaning attribute and storing the concept vector in a storage device;
The distance between two concept vectors obtained by applying the frequency vector generation step and the concept vector generation step to any two words is determined according to the tree structure and the probability and child corresponding to the parent node. The degree of association between the two words is calculated by accumulating the local distance calculated based on the probability corresponding to the node from the root node toward the leaf node, and stored in the storage device. A calculation step;
A method for determining the degree of association between words, comprising:

In claim 4,
The inter-word relevance calculation step is a step of interrupting the integration when the approximate value of the relevance between two words exceeds a preset value.

In claim 4,
The inter-word relevance calculation step is a step of executing the above integration up to the last child node of the tree structure.

A program that causes a computer to function as each means that constitutes the inter-word relevance determination device according to claim 1.

A computer-readable recording medium in which a program that causes a computer to function as each means constituting the inter-word relevance determination device according to claim 1 is recorded.