JP2012043294A

JP2012043294A - Binomial relationship categorization program, method, and device for categorizing semantically similar word pair by binomial relationship

Info

Publication number: JP2012043294A
Application number: JP2010185391A
Authority: JP
Inventors: Asuka Sumida; 飛鳥隅田; Kazunori Matsumoto; 一則松本; Hajime Hattori; 元服部; Toshihiro Ono; 智弘小野
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2010-08-20
Filing date: 2010-08-20
Publication date: 2012-03-01
Anticipated expiration: 2030-08-20
Also published as: JP5504097B2

Abstract

PROBLEM TO BE SOLVED: To provide a binomial relationship categorization program and the like that can categorize a semantically similar word pair according to a binomial relationship by collectively handling an inter-noun relationship and an inter-verb/adjective relationship as an inter-word relationship, without predefining an inter-word relationship to be acquired.SOLUTION: A plurality of word pairs of a prescribed threshold amount or more that tend to co-occur are extracted from a sentence set storage section, and a set of dependency words that co-occur for each word in the word pairs is extracted from the sentence set storage section. Then, a first set of characteristic dependency words consisting of dependency words appearing in a first set of dependency words and not appearing in a second set of dependency words, and a second set of characteristic dependency words consisting of dependency words appearing in the second set of dependency words and not appearing in the first set of dependency words, are extracted. Further, an appearance frequency of co-occurrences with a word within a set of sentences is counted, for each dependency word, and a vector is derived in order to generate a word pair cluster by division optimization clustering based on the degree of similarity between the vectors.

Description

本発明は、文章から語間関係を抽出する技術に関する。 The present invention relates to a technique for extracting an inter-word relationship from a sentence.

従来、文章から、二つの項から構成される、上位下位関係や部分全体関係をはじめとした名詞間関係又は動詞／形容詞間関係といった語間関係を抽出する技術がある。語間関係を自動的に文章中から抽出するために、語彙統語パターンを用いる第１の従来技術と、係り受け関係にある名詞の共起情報を用いる第２の従来技術とがある。 2. Description of the Related Art Conventionally, there is a technique for extracting a word relationship such as a relationship between nouns or a verb / adjective relationship including a high-order relationship and a partial whole relationship, which is composed of two terms, from a sentence. There are a first conventional technique using a lexical syntactic pattern and a second conventional technique using co-occurrence information of nouns in a dependency relationship in order to automatically extract a word relationship from a sentence.

第１の従来技術によれば、品詞の活用形や、接続詞などを含む語彙統語パターンを文章に適用することによって、語間関係を抽出する（例えば非特許文献１、２、３、８、９、１０参照）。語彙統語パターンとは、「＊などの＊」などの、語と係り受け関係を利用したパターンである（例えば非特許文献８参照）。例えば、以下のように、例文に、語彙統語パターンを適用することによって上位下位関係を抽出することができる。
例文：「ソメイヨシノなどの桜」
語彙統語パターン：「＊などの＊」
上位下位関係：（桜，ソメイヨシノ） According to the first prior art, inter-word relations are extracted by applying a lexical syntactic pattern including a part-of-speech utilization form or a conjunction to a sentence (for example, non-patent documents 1, 2, 3, 8, 9 10). A lexical syntactic pattern is a pattern that uses a dependency relationship with a word, such as “* such as *” (see Non-Patent Document 8, for example). For example, as shown below, the upper and lower relations can be extracted by applying a lexical syntactic pattern to an example sentence.
Example sentence: "Cherry blossoms such as Yoshino cherry tree"
Vocabulary pattern: “*”
Higher-order subordinate relationship: (Sakura, Yoshino cherry tree)

第２の従来技術によれば、所定の動詞／形容詞対について、それぞれの対を構成する要素に係る語の集合間の類似度が高いほど、対間に意味的関係があると推定する（例えば特許文献１、２、非特許文献４、５、１１参照）。以下のように、例えば２つの述語には「ぶらつく」「行く」の間には、共通して係る名詞があるために、意味関係があると推定される。
述語「ぶらつく」の係り受け名詞：「河原」「街」「公園」
述語「行く」の係り受け名詞：「街」「公園」「砂浜」
２つの述語に共通して係る名詞：「街」「公園」 According to the second prior art, for a given verb / adjective pair, the higher the similarity between sets of words related to the elements constituting each pair, the more presumed that there is a semantic relationship between the pairs (for example, (See Patent Documents 1 and 2, Non-Patent Documents 4, 5, and 11). As described below, for example, two predicates are presumed to have a semantic relationship because there is a common noun between “blur” and “go”.
Dependent nouns of the predicate "blur": "Kawara""town""park"
Dependent nouns for the predicate "go": "town""park""sandbeach"
Common nouns for two predicates: "town""park"

特開２０１０−１２９０２５号公報JP 2010-129025 A 特開２００９−２６５８８９号公報JP 2009-265889 A

T. Inui and M. Okumura, “Investigating thecharacteristics of causal relations in Japanese text,” in Proceedings of theWorkshop on Frontiers in Corpus Annotations II, 2005, 37-44.T. Inui and M. Okumura, “Investigating the characteristics of causal relations in Japanese text,” in Proceedings of theWorkshop on Frontiers in Corpus Annotations II, 2005, 37-44. K. Torisawa, “Automaticacquisition of expressions representing preparation and utilization of anobject,” in Proceedings of the Recent Advances in Natural Language Processing,2005, 556-560.K. Torisawa, “Automaticacquisition of expressions representing preparation and utilization of anobject,” in Proceedings of the Recent Advances in Natural Language Processing, 2005, 556-560. S. Abe, K.Inui, and Y. Matsumoto, “Acquiring event relation knowledge by learning cooccurrencepatterns and fertilizing cooccurrence samples with verbal nouns,” inProceedings of the 3rd International Joint Conference on Natural LanguageProcessing, 2008, 497-504.S. Abe, K. Inui, and Y. Matsumoto, “Acquiring event relation knowledge by learning cooccurrencepatterns and fertilizing cooccurrence samples with verbal nouns,” inProceedings of the 3rd International Joint Conference on Natural LanguageProcessing, 2008, 497-504. D. Lin and P.Pantel, “DIRT-discovery of inference rules from text,” in Proceedings of ACMSIGKDD Conference on Knowledge Discovery and Data Mining, 2001, 323-328.D. Lin and P. Pantel, “DIRT-discovery of inference rules from text,” in Proceedings of ACMSIGKDD Conference on Knowledge Discovery and Data Mining, 2001, 323-328. C. Hashimoto etal., “Large-scale verb entailment acquisition from the web,” in Proceedings ofthe 2009 Conference on Empirical Methods in Natural Language Processing: Volume3, 2009, 1172-1181.C. Hashimoto etal., “Large-scale verb entailment acquisition from the web,” in Proceedings ofthe 2009 Conference on Empirical Methods in Natural Language Processing: Volume3, 2009, 1172-1181. 高橋秀幸、竹内孔一、「多義性を考慮した同時共起クラスタリングによる動詞の類語抽出」、電子情報通信学会技術研究報告. NLC、言語理解とコミュニケーション、vol. 108、2009、37-42Hideyuki Takahashi and Koichi Takeuchi, “Syntax Extraction of Verbs by Simultaneous Co-occurrence Clustering Considering Ambiguity”, IEICE Technical Report. NLC, Language Understanding and Communication, vol. 108, 2009, 37-42 C. Fellbaum,WordNet: An Electronic Lexical Database, The MIT Press, 1998.C. Fellbaum, WordNet: An Electronic Lexical Database, The MIT Press, 1998. M. A. Hearst, “Automatic acquisition ofhyponyms from large text corpora,” in Proceedings of the 14th conference onComputational linguistics-Volume 2 (Association for Computational LinguisticsMorristown, NJ, USA, 1992), 539-545.M. A. Hearst, “Automatic acquisition ofhyponyms from large text corpora,” in Proceedings of the 14th conference on Computational linguistics-Volume 2 (Association for Computational Linguistics Morristown, NJ, USA, 1992), 539-545. T. Chklovskiand P. Pantel, “Verbocean: Mining the web for fine-grained semantic verbrelations,” in Proceedings of EMNLP, vol. 4, 2004, 33-40.T. Chklovskiand P. Pantel, “Verbocean: Mining the web for fine-grained semantic verbrelations,” in Proceedings of EMNLP, vol. 4, 2004, 33-40. O. Etzioni et al., “Unsupervisednamed-entity extraction from the web: An experimental study,” ArtificialIntelligence 165, no. 1 (2005), 91-134.O. Etzioni et al., “Unsupervisednamed-entity extraction from the web: An experimental study,” ArtificialIntelligence 165, no. 1 (2005), 91-134. J.Kazama and K.Torisawa, “InducingGazetteers for Named Entity Recognition by Large-scale Clustering of DependencyRelations,” in Proceedings ACL-08: HLT (2008), 407-415.J. Kazama and K. Torisawa, “InducingGazetteers for Named Entity Recognition by Large-scale Clustering of DependencyRelations,” in Proceedings ACL-08: HLT (2008), 407-415.

前述した第１の従来技術によれば、語彙統語パターンにマッチすれば高精度に語間関係を獲得できる一方、語彙統語パターンにマッチしない語間関係は抽出できない。また、第２の従来技術によれば、語彙統語パターンにマッチしない語間関係であっても抽出できる一方、名詞間関係については、意味的関係に分類することができず、動詞／形容詞間関係については、含意関係に属する関係しか抽出することができない。 According to the first prior art described above, it is possible to acquire the inter-word relation with high accuracy if it matches the lexical syntactic pattern, but it is not possible to extract the inter-word relation that does not match the lexical syntactic pattern. Further, according to the second prior art, it is possible to extract even an inter-word relationship that does not match the lexical syntactic pattern, while an inter-noun relationship cannot be classified into a semantic relationship, and a verb / adjective relationship. For, only relationships that belong to implication relationships can be extracted.

また、特許文献２に記載された技術によれば、名詞間関係について、意味的関係に分類できるよう、機械学習を行っている。しかしながら、機械学習を行う際には、学習データが必要となる。 Moreover, according to the technique described in Patent Document 2, machine learning is performed so that the relationship between nouns can be classified into semantic relationships. However, when performing machine learning, learning data is required.

更に、第１及び第２の従来技術によれば、獲得すべき意味的関係を事前に定めておき、その定められた意味的関係を獲得する。しかしながら、存在する全ての意味的関係を事前に定めておくことは困難であるため、特定の意味的関係を獲得することはできるが、多様な意味的関係及び意外な意味的関係を獲得することはできない。 Further, according to the first and second prior arts, the semantic relationship to be acquired is determined in advance, and the determined semantic relationship is acquired. However, because it is difficult to predetermine all the existing semantic relationships, it is possible to acquire specific semantic relationships, but acquire various semantic relationships and unexpected semantic relationships. I can't.

また、第１及び第２の従来技術によれば、名詞間関係及び動詞／形容詞間関係のいずれかを分類対象としている。しかしながら、これらの関係を語間関係として、一括して扱うことができる汎用的な技術にはなっていない。 Further, according to the first and second prior arts, either the noun relationship or the verb / adjective relationship is set as the classification target. However, it is not a general-purpose technique that can handle these relationships collectively as inter-word relationships.

そこで、本発明は、名詞間関係及び動詞／形容詞間関係を一括して語間関係として扱い、獲得したい語間関係を予め定義することなく、意味的に類似している語対を二項関係に分類することができる二項関係分類プログラム、方法及び装置を提供することを目的とする。 Therefore, the present invention treats the relationship between nouns and the relationship between verbs / adjectives collectively as the relationship between words and binomial relationship between semantically similar word pairs without predefining the relationship between words to be acquired. It is an object of the present invention to provide a binary relation classification program, method and apparatus that can be classified into two categories.

本発明によれば、語対を意味的な二項関係に分類するために、装置に搭載されたコンピュータを実行させる二項関係分類プログラムにおいて、
多数の文書情報を蓄積した文章集合蓄積部を有し、
文章集合蓄積部から、第１の語及び第２の語からなる複数の語対を抽出する第１のステップと、
語対の中で共起しやすさを表す類似度が、所定閾値以上となる語対を抽出する第２のステップと、
第２のステップによって抽出された語対について、文章集合蓄積部から、第１の語に共起する第１の係り受け語集合と、第２の語に共起する第２の係り受け語集合とを抽出する第３のステップと、
第１の係り受け語集合に出現し且つ第２の係り受け語集合に出現しない係り受け語からなる第１の特徴係り受け語集合と、第２の係り受け語集合に出現し且つ第１の係り受け語集合に出現しない係り受け語からなる第２の特徴係り受け語集合とを抽出する第４のステップと、
第１の特徴係り受け語集合に属する係り受け語毎に、第１の語と共起する文書集合中の出現頻度と、第２の特徴係り受け語集合に属する係り受け語毎に、第２の語と共起する文書集合中の出現頻度とを計数する第５のステップと、
第１の語に基づく第１の特徴係り受け語集合に属する係り受け語毎の出現頻度を表す第１のベクトルと、第２の語に基づく第２の特徴係り受け語集合に属する係り受け語毎の出現頻度を表す第２のベクトルとを結合したベクトルを導出する第６のステップと、
ベクトル間類似度に基づく分割最適化クラスタリングによって、語対クラスタを生成する第７のステップと
してコンピュータを更に機能させることを特徴とする。 According to the present invention, in a binary relation classification program for executing a computer mounted on a device to classify word pairs into semantic binary relations,
It has a sentence set storage unit that stores a large number of document information,
A first step of extracting a plurality of word pairs consisting of a first word and a second word from the sentence set storage unit;
A second step of extracting a word pair whose similarity representing the ease of co-occurrence among word pairs is equal to or greater than a predetermined threshold;
For the word pairs extracted in the second step, the first dependency word set co-occurs in the first word and the second dependency word set co-occurs in the second word from the sentence set accumulation unit. A third step of extracting
A first feature dependency word set consisting of dependency words that appear in the first dependency word set and not appear in the second dependency word set; and a first feature dependency word set that appears in the second dependency word set and the first A fourth step of extracting a second characteristic dependency word set consisting of dependency words that do not appear in the dependency word set;
For each dependency word that belongs to the first feature dependency word set, the frequency of occurrence in the document set that co-occurs with the first word, and for each dependency word that belongs to the second feature dependency word set, the second A fifth step of counting the frequency of occurrence in the document set co-occurring with
A first vector representing the appearance frequency of each dependency word belonging to the first feature dependency word set based on the first word, and a dependency word belonging to the second feature dependency word set based on the second word A sixth step of deriving a vector obtained by combining the second vector representing the appearance frequency of each;
The computer further functions as a seventh step of generating a word pair cluster by division optimization clustering based on the similarity between vectors.

本発明の二項関係分類プログラムにおける他の実施形態によれば、
語は、名詞であり、
語対は、名詞対であり、
係り受け語集合は、述語集合であり、
述語は、動詞又は形容詞であり、
特徴係り受け語集合は、特徴述語集合である
ようにコンピュータを更に機能させることも好ましい。 According to another embodiment of the binary relation classification program of the present invention,
A word is a noun,
A word pair is a noun pair,
A dependency set is a predicate set,
A predicate is a verb or adjective,
It is also preferred that the computer further functions so that the feature dependency set is a feature predicate set.

本発明の二項関係分類プログラムにおける他の実施形態によれば、
語は、動詞又は形容詞である述語であり、
語対は、述語対であり、
係り受け語集合は、名詞集合であり、
特徴係り受け語集合は、特徴名詞集合である
ようにコンピュータを更に機能させることも好ましい。 According to another embodiment of the binary relation classification program of the present invention,
A word is a predicate that is a verb or an adjective,
A word pair is a predicate pair,
The dependency language set is a noun set,
It is also preferable that the computer further functions so that the feature dependency set is a feature noun set.

本発明の二項関係分類プログラムにおける他の実施形態によれば、
第７のステップについて、第２のステップにおける類似度として相互情報量を用い所定閾値以上となる対のみをクラスタリングするようにコンピュータを更に機能させることも好ましい。 According to another embodiment of the binary relation classification program of the present invention,
Regarding the seventh step, it is also preferable to further cause the computer to function so as to cluster only pairs that are equal to or greater than a predetermined threshold using mutual information as the similarity in the second step.

本発明によれば、語対を意味的な二項関係に分類する装置における二項関係分類方法において、
多数の文書情報を蓄積した文章集合蓄積部を有し、
文章集合蓄積部から、第１の語及び第２の語からなる複数の語対を抽出する第１のステップと、
語対の中で共起しやすさを表す類似度が、所定閾値以上となる語対を抽出する第２のステップと、
第２のステップによって抽出された語対について、文章集合蓄積部から、第１の語に共起する第１の係り受け語集合と、第２の語に共起する第２の係り受け語集合とを抽出する第３のステップと、
第１の係り受け語集合に出現し且つ第２の係り受け語集合に出現しない係り受け語からなる第１の特徴係り受け語集合と、第２の係り受け語集合に出現し且つ第１の係り受け語集合に出現しない係り受け語からなる第２の特徴係り受け語集合とを抽出する第４のステップと、
第１の特徴係り受け語集合に属する係り受け語毎に、第１の語と共起する文書集合中の出現頻度と、第２の特徴係り受け語集合に属する係り受け語毎に、第２の語と共起する文書集合中の出現頻度とを計数する第５のステップと、
第１の語に基づく第１の特徴係り受け語集合に属する係り受け語毎の出現頻度を表す第１のベクトルと、第２の語に基づく第２の特徴係り受け語集合に属する係り受け語毎の出現頻度を表す第２のベクトルとを結合したベクトルを導出する第６のステップと、
ベクトル間類似度に基づく分割最適化クラスタリングによって、語対クラスタを生成する第７のステップと
を有することを特徴とする。 According to the present invention, in a binary relation classification method in an apparatus for classifying word pairs into semantic binary relations,
It has a sentence set storage unit that stores a large number of document information,
A first step of extracting a plurality of word pairs consisting of a first word and a second word from the sentence set storage unit;
A second step of extracting a word pair whose similarity representing the ease of co-occurrence among word pairs is equal to or greater than a predetermined threshold;
For the word pairs extracted in the second step, the first dependency word set co-occurs in the first word and the second dependency word set co-occurs in the second word from the sentence set accumulation unit. A third step of extracting
A first feature dependency word set consisting of dependency words that appear in the first dependency word set and not appear in the second dependency word set; and a first feature dependency word set that appears in the second dependency word set and the first A fourth step of extracting a second characteristic dependency word set consisting of dependency words that do not appear in the dependency word set;
For each dependency word that belongs to the first feature dependency word set, the frequency of occurrence in the document set that co-occurs with the first word, and for each dependency word that belongs to the second feature dependency word set, the second A fifth step of counting the frequency of occurrence in the document set co-occurring with
A first vector representing the appearance frequency of each dependency word belonging to the first feature dependency word set based on the first word, and a dependency word belonging to the second feature dependency word set based on the second word A sixth step of deriving a vector obtained by combining the second vector representing the appearance frequency of each;
And a seventh step of generating a word pair cluster by division optimization clustering based on the similarity between vectors.

本発明によれば、語対を意味的な二項関係に分類する装置における二項関係分類装置において、
多数の文書情報を蓄積した文章集合蓄積手段と、
文章集合蓄積部から、第１の語及び第２の語からなる複数の語対を抽出する語対抽出手段と、
語対の中で共起しやすさを表す類似度が、所定閾値以上となる語対を抽出する類似語対抽出手段と、
類似語対抽出手段によって抽出された語対について、文章集合蓄積部から、第１の語に共起する第１の係り受け語集合と、第２の語に共起する第２の係り受け語集合とを抽出する係り受け語集合抽出手段と、
第１の係り受け語集合に出現し且つ第２の係り受け語集合に出現しない係り受け語からなる第１の特徴係り受け語集合と、第２の係り受け語集合に出現し且つ第１の係り受け語集合に出現しない係り受け語からなる第２の特徴係り受け語集合とを抽出する特徴係り受け語集合抽出手段と、
第１の特徴係り受け語集合に属する係り受け語毎に、第１の語と共起する文書集合中の出現頻度と、第２の特徴係り受け語集合に属する係り受け語毎に、第２の語と共起する文書集合中の出現頻度とを計数する係り受け語出現頻度計数手段と、
第１の語に基づく第１の特徴係り受け語集合に属する係り受け語毎の出現頻度を表す第１のベクトルと、第２の語に基づく第２の特徴係り受け語集合に属する語毎の出現頻度を表す第２のベクトルとを結合したベクトルを導出する語対類似度算出手段と、
ベクトル間類似度に基づく分割最適化クラスタリングによって、語対クラスタを生成する語対クラスタ生成手段と
を有することを特徴とする。 According to the present invention, in a binary relation classification apparatus in an apparatus for classifying word pairs into semantic binary relations,
A sentence set storage means for storing a large number of document information;
Word pair extraction means for extracting a plurality of word pairs consisting of the first word and the second word from the sentence set storage unit;
Similar word pair extraction means for extracting word pairs whose similarity indicating the ease of co-occurrence among word pairs is equal to or greater than a predetermined threshold;
With respect to the word pairs extracted by the similar word pair extraction unit, the first dependency word set co-occurs with the first word and the second dependency word co-occurs with the second word from the sentence set accumulation unit. A dependency word set extraction means for extracting the set;
A first feature dependency word set consisting of dependency words that appear in the first dependency word set and not appear in the second dependency word set; and a first feature dependency word set that appears in the second dependency word set and the first Feature dependency word set extraction means for extracting a second feature dependency word set consisting of dependency words that do not appear in the dependency word set;
For each dependency word that belongs to the first feature dependency word set, the frequency of occurrence in the document set that co-occurs with the first word, and for each dependency word that belongs to the second feature dependency word set, the second A dependency word appearance frequency counting means for counting the appearance frequency in the document set co-occurring with
A first vector representing an appearance frequency for each dependency word belonging to the first feature dependency word set based on the first word, and a word for each word belonging to the second feature dependency word set based on the second word Word pair similarity calculating means for deriving a vector obtained by combining the second vector representing the appearance frequency;
It has a word pair cluster generation means for generating a word pair cluster by division optimization clustering based on similarity between vectors.

本発明の二項関係分類プログラム、方法及び装置によれば、獲得したい語間関係を予め定義することなく、意味的に類似している語対を二項関係に分類することができる。 According to the binary relation classification program, method, and apparatus of the present invention, word pairs that are semantically similar can be classified into a binary relation without pre-defining the inter-word relation to be acquired.

本発明の二項関係分類プログラムにおける処理を表すフローチャートである。It is a flowchart showing the process in the binary relation classification program of this invention. 名詞対クラスタの生成のフローチャートである。It is a flowchart of the production | generation of a noun pair cluster. 述語対クラスタの生成のフローチャートである。It is a flowchart of the production | generation of a predicate pair cluster. 本発明における二項関係分類装置の機能構成図である。It is a functional block diagram of the binary relation classification apparatus in this invention.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

本発明は、文章の集合から意味的に類似している語対を抽出し、それら語対を意味的な関係を有する二項関係に分類することができる。 The present invention can extract word pairs that are semantically similar from a set of sentences and classify the word pairs into binary relations having a semantic relationship.

ここで、「語対」とは「語」の対をいう。「語」とは、言語の構成単位の一つであり、１以上の形態素からなる。形態素とは、ある言語について意味を持つ最小の単位をいい、それ以上分解したら意味をなさなくなる単位まで分解された音素の集合の1つ1つを指す。語には、一つの形態素からなる単純語と、複数の形態素からなる複合語とがある。以下では、「語」というときは、単純語及び複合語のいずれも対象とする。
単純語：「山」
複合語：「山登り」 Here, “word pairs” refer to pairs of “words”. A “word” is one of the structural units of a language and consists of one or more morphemes. A morpheme is the smallest unit that has a meaning for a language, and refers to each of a set of phonemes that have been decomposed into units that no longer make sense after further decomposition. There are simple words composed of one morpheme and compound words composed of a plurality of morphemes. In the following, when referring to “words”, both simple words and compound words are targeted.
Simple word: "Mountain"
Compound word: “Mountain climbing”

語が集まることにより、句、節、文及び文章が構成される。例えば、「吉野山に行く」という文は、「吉野山」「に」「行く」の３語から構成される。また、語は、文法的な役割を持つ機能語と、それ以外の一般的な意味を持つ内容語とに大別できる。以下では、「語」というときは、内容語を対象とする。
内容語の例：名詞（吉野山）、動詞（行く）、形容詞（きれい）
機能語の例：助詞（が，を，に，の）、助動詞（れる，られる，た） Phrases, clauses, sentences and sentences are composed by gathering words. For example, the sentence “go to Yoshinoyama” is composed of three words “Yoshinoyama”, “ni” and “go”. The words can be broadly classified into function words having a grammatical role and content words having other general meanings. Hereinafter, the term “word” refers to a content word.
Examples of content words: noun (Yoshinoyama), verb (go), adjective (beautiful)
Examples of function words: particles (ga, o, ni, no), auxiliary verbs (being, being able to be)

「語対」とは、このような語が対になったものをいい、例えば、以下のようなものがある。
名詞対：（桜，ソメイヨシノ）、（ビアパーティー，枝豆）
動詞／形容詞対：（寝る，起きる）、（早い，起きる）、（速い，すばやい） “Word pair” refers to a pair of such words, for example, the following.
Noun pairs: (Sakura, Yoshino cherry), (Beer party, green soybeans)
Verb / adjective pairs: (sleep, wake up), (fast, wake up), (fast, fast)

一般に、「二項関係」とは、FellBaumが定義した「含意、同義、対義、因果、時間」などの関係を指す（例えば非特許文献７参照）。これに対し、本発明によれば、このような二項関係のみならず、人手では定義しきれない、意外な意味的関係ごとに分類することができる。
（ａ）名詞間関係（桜，ソメイヨシノ）：上位下位関係
（ｂ）動詞間関係（寝る，起きる）：因果関係
（ｃ）名詞間関係（ビアパーティー，枝豆）：「イベント−イベントに必須の道具」関係
例えば、前述の（ａ）（ｂ）は、FellBaumらによる関係分類に属する関係である。これに対し、前述の（ｃ）は、FellBaumらによる関係分類に属する関係ではない。本発明によれば、このような意味的関係も考慮して分類することができ、関係分類の定義のための膨大な人手によるコストを要しない。 In general, the “binary relationship” refers to a relationship such as “entailment, synonym, synonym, causality, time” defined by FellBaum (see, for example, Non-Patent Document 7). On the other hand, according to the present invention, not only such binary relations but also surprising semantic relations that cannot be defined manually can be classified.
(A) Relationship between nouns (Sakura, Yoshino Yoshino): Upper and lower relationship (b) Relationship between verbs (sleeping, getting up): Causal relationship (c) Relationship between nouns (beer party, edamame): “Essential tool for event-event For example, the above-mentioned (a) and (b) are relationships belonging to the relationship classification by FellBaum et al. On the other hand, the above-mentioned (c) is not a relationship belonging to the relationship classification by FellBaum et al. According to the present invention, it is possible to classify in consideration of such a semantic relationship, and no huge manual cost is required for defining the relationship classification.

図１は、本発明の二項関係分類プログラムにおける処理を表すフローチャートである。 FIG. 1 is a flowchart showing processing in the binary relation classification program of the present invention.

二項関係分類プログラムは、語の対からなる「語対」を、意味的な二項関係に分類するために、装置に搭載されたプロセッサ（コンピュータ）によって実行される。尚、装置は、多数の文章情報を蓄積した文章集合蓄積部を有する。 The binary relation classification program is executed by a processor (computer) installed in the apparatus to classify “word pairs” made up of word pairs into semantic binary relations. The apparatus includes a sentence set storage unit that stores a large number of pieces of sentence information.

本発明によれば、文章集合蓄積部から所定閾値以上で共起しやすい語対を抽出する。次に、文章集合蓄積部から語対に含まれる語毎に共起する係り受け語集合を抽出する。そして、語毎に当該係り受け語集合の出現頻度を表すベクトルに基づいて語対クラスタを生成する。 According to the present invention, word pairs that are likely to co-occur above a predetermined threshold are extracted from the sentence set storage unit. Next, a dependency word set that co-occurs for each word included in the word pair is extracted from the sentence set storage unit. Then, a word pair cluster is generated for each word based on a vector representing the appearance frequency of the dependency word set.

二項関係分類プログラムは、以下の７つのステップを実行する。
（Ｓ１）文章集合蓄積部に蓄積されている文章から、複数の語対が抽出される。「語対」は、第１の語及び第２の語からなる。
（Ｓ２）抽出された語対の中から、第１の語及び第２の語が共起しやすい語対のみが抽出される。
（Ｓ３）Ｓ２によって抽出された語対について、文章集合蓄積部から、第１の語と共起する第１の係り受け語集合と、第２の語と共起する第２の係り受け語集合とが抽出される。
（Ｓ４）第１の係り受け語集合に出現し且つ第２の係り受け語集合に出現しない係り受け語からなる第１の特徴係り受け語集合と、第２の係り受け語集合に出現し且つ第１の係り受け語集合合に出現しない係り受け語からなる第２の特徴係り受け語集合とが抽出される。
（Ｓ５）第１の特徴係り受け語集合に属する係り受け語毎に、文章集合蓄積部に蓄積されている文章中の、第１の語と共起する係り受け語の出現頻度が計数される。同様に、第２の特徴係り受け語集合に属する係り受け語毎に、文章集合蓄積部に蓄積されている文章中の、第２の語と共起する係り受け語の出現頻度が計数される。
（Ｓ６）第１の語に基づく第１の特徴係り受け語集合に属する係り受け語毎の出現頻度を表す第１のベクトルが生成される。同様に、第２の語に基づく第２の特徴係り受け語集合に属する係り受け語毎の出現頻度を表す第２のベクトルが生成される。第１のベクトルと第２のベクトルとを結合したベクトルが導出される。
（Ｓ７）Ｓ６で生成されたベクトル間類似度に基づく分割最適化クラスタリングによって、語対クラスタを生成する。 The binary relation classification program executes the following seven steps.
(S1) A plurality of word pairs are extracted from the sentences stored in the sentence set storage unit. A “word pair” consists of a first word and a second word.
(S2) Only word pairs in which the first word and the second word are likely to co-occur are extracted from the extracted word pairs.
(S3) A first dependency word set co-occurring with the first word and a second dependency word set co-occurring with the second word from the sentence set accumulation unit for the word pair extracted in S2. And are extracted.
(S4) a first feature dependency word set consisting of dependency words that appear in the first dependency word set and not appear in the second dependency word set, and appear in the second dependency word set; A second feature dependency word set consisting of dependency words that does not appear in the first dependency word set combination is extracted.
(S5) For each dependency word belonging to the first characteristic dependency word set, the frequency of appearance of the dependency word co-occurring with the first word in the sentence stored in the sentence set storage unit is counted. . Similarly, for each dependency word belonging to the second feature dependency word set, the appearance frequency of the dependency word co-occurring with the second word in the sentences stored in the sentence set storage unit is counted. .
(S6) A first vector representing the appearance frequency of each dependency word belonging to the first feature dependency word set based on the first word is generated. Similarly, a second vector representing the appearance frequency for each dependency word belonging to the second feature dependency word set based on the second word is generated. A vector obtained by combining the first vector and the second vector is derived.
(S7) A word pair cluster is generated by division optimization clustering based on the similarity between vectors generated in S6.

図２は、名詞対クラスタの生成のフローチャートである。 FIG. 2 is a flowchart for generating a noun pair cluster.

図２では、語対クラスタの生成について、名詞対クラスタを生成する場合を例に説明する。図２では、図１の「語」を「名詞」として、「係り受け語」を「述語」として、具体的に説明する。 In FIG. 2, the case of generating a noun pair cluster will be described as an example for generating a word pair cluster. In FIG. 2, the “word” in FIG. 1 is specifically described as “noun” and the “dependency word” is described as “predicate”.

（Ｓ２１）文章集合蓄積部に蓄積されている文章から、複数の名詞対が抽出される。「名詞対」は、第１の名詞ｎ１及び第２の名詞ｎ２からなる。名詞対の抽出には、例えば以下のような語彙統語パターンが用いられる。
パターン：「＜ｎ１＞の＜ｎ２＞」
「＜ｎ１＞で＜ｎ２＞」 (S21) A plurality of noun pairs are extracted from the sentences stored in the sentence set storage unit. The “noun pair” includes a first noun n1 and a second noun n2. For example, the following lexical syntactic pattern is used for extracting noun pairs.
Pattern: “<n1><n2>”
“<N1> to <n2>”

例えば、文章集合蓄積部に蓄積された文章から、パターン「＜ｎ１＞の＜ｎ２＞」を用いて、以下の名詞対が抽出される。
［文章］［名詞対］
「吉野山の桜をみたい」＝＝＞（吉野山，桜）
「新宿御苑のソメイヨシノはきれいだ」＝＝＞（新宿御苑，ソメイヨシノ）
「庭の桜が咲く」＝＝＞（庭，桜） For example, the following noun pairs are extracted from the text stored in the text set storage using the pattern “<n2><n2>”.
[Sentence] [noun pair]
"I want to see Yoshinoyama's cherry blossoms"==> (Yoshinoyama, Sakura)
“The Yoshino cherry tree in Shinjuku Gyoen is beautiful” ==> (Shinjuku Gyoen, Yoshino cherry tree)
"Cherry blossoms in the garden"==> (Garden, cherry blossom)

（Ｓ２２）抽出された名詞対の中から、第１の名詞及び第２の名詞が共起しやすい名詞対のみが抽出される。２語の「共起しやすさ」を計測する類似度の一種として，相互情報量が利用できる。共起しやすさを計測した類似度が所定閾値以上となる名詞対のみが抽出される。 (S22) From the extracted noun pairs, only noun pairs in which the first noun and the second noun are likely to co-occur are extracted. Mutual information can be used as a kind of similarity to measure “ease of co-occurrence” of two words. Only noun pairs whose degree of similarity measured for co-occurrence is equal to or greater than a predetermined threshold are extracted.

相互情報量を用いることによって、（吉野山，桜）のように、様々な文章で共起しやすい名詞対は類似度が高くなる。一方で、（隅田さん，靴下）のように、特定の文章でのみ共起する名詞対は、類似度が低くなる。これにより、何らかの意味的関係を有する名詞対を抽出することができる。 By using the mutual information, noun pairs that are likely to co-occur in various sentences, such as (Yoshinoyama, Sakura), have high similarity. On the other hand, noun pairs that co-occur only in specific sentences, such as (Mr. Sumida, socks), have a low similarity. As a result, noun pairs having some semantic relationship can be extracted.

（Ｓ２３）Ｓ２２によって抽出された名詞対について、文章集合蓄積部から、第１の名詞と共起する第１の述語集合と、第２の名詞と共起する第２の述語集合とが抽出される。 (S23) For the noun pair extracted in S22, a first predicate set that co-occurs with the first noun and a second predicate set that co-occurs with the second noun are extracted from the sentence set storage unit. The

例えば、「ソメイヨシノ」と「桜」との２語が指す概念の違いに基づいて、意味的関係ごとに分類する。しかしながら、語が表す概念の範囲を明示的に表すことは困難である。そのために、名詞と共起する述語の集合が概念を表すと想定する。前述の第２の従来技術によれば、集合の共通部分から関係を推定する。これに対し、本発明によれば、何らかの意味的関係を表すものとして、明示的に、各名詞と共起する述語集合間の差を利用する。また、本発明では、Ｓ２２によって抽出された共起しやすい（類似度が高い）名詞対の集合を用いるために、語が表す概念の範囲を考慮して、意味的関係に基づいて分類することができる。 For example, classification is made for each semantic relationship based on the difference in the concept indicated by the two words “Somei Yoshino” and “Sakura”. However, it is difficult to express explicitly the range of concepts represented by words. To that end, assume that a set of predicates that co-occur with a noun represents a concept. According to the second prior art described above, the relationship is estimated from the common part of the set. On the other hand, according to the present invention, the difference between the predicate sets co-occurring with each noun is explicitly used as representing some semantic relationship. Also, in the present invention, in order to use the set of noun pairs that are likely to co-occur (highly similar) extracted in S22, classification is performed based on the semantic relationship in consideration of the range of concepts represented by words. Can do.

例えば、名詞対＜吉野山，桜＞について、第１の名詞「吉野山」と共起する第１の述語集合と、第２の名詞「桜」と共起する第２の述語集合とが、以下のように抽出される。
名詞対＜吉野山，桜＞
名詞「吉野山」＝＝＞述語集合｛行く，植樹する，立ち寄る，咲く｝
名詞「桜」＝＝＞述語集合｛咲く，植樹する，守る，みる}
名詞対＜新宿御苑，ソメイヨシノ＞
名詞「新宿御苑」＝＝＞述語集合｛行く，整備する，立ち寄る｝
名詞「ソメイヨシノ」＝＝＞述語集合｛咲く，植樹する，守る，みる｝
名詞対＜庭，桜＞
名詞「庭」＝＝＞述語集合｛手入れする，植樹する，掃除する｝
名詞「桜」＝＝＞述語集合｛咲く，植樹する，みる｝ For example, for the noun pair <Yoshinoyama, Sakura>, a first predicate set that co-occurs with the first noun "Yoshinoyama" and a second predicate set that co-occurs with the second noun "Sakura" Extracted as follows.
Noun pair <Yoshinoyama, Sakura>
Noun “Yoshinoyama” ==> Predicate set {go, plant, stop, bloom}
Noun "Sakura"==> Predicate set {blooming, planting, protecting, seeing}
Noun pair <Shinjuku Gyoen, Yoshino Yoshino>
Noun "Shinjuku Gyoen"==> Predicate set {go, improve, stop by}
Noun "Yoshino Yoshino"==> Predicate set {blooming, planting, protecting, seeing}
Noun pair <garden, cherry blossom>
Noun “garden” ==> Predicate set {care, planting, cleaning}
Noun "Sakura"==> Predicate set {blooming, planting, seeing}

（Ｓ２４）次に、第１の述語集合に出現し且つ第２の述語集合に出現しない述語からなる第１の特徴述語集合と、第２の述語集合に出現し且つ第１の述語集合に出現しない述語からなる第２の特徴述語集合とが抽出される。 (S24) Next, a first feature predicate set consisting of predicates that appear in the first predicate set and do not appear in the second predicate set, and appear in the second predicate set and appear in the first predicate set A second feature predicate set consisting of predicates not to be extracted is extracted.

例えば、名詞対＜吉野山，桜＞について、いずれの述語集合にも共通して｛咲く，植樹する｝が含まれるので、これら述語を削除する。同様に、前述した名詞対は、以下のような特徴述語集合を有する。
名詞対＜吉野山，桜＞
名詞「吉野山」＝＝＞特徴述語集合｛行く，立ち寄る｝
名詞「桜」＝＝＞特徴述語集合｛守る，みる}
名詞対＜新宿御苑，ソメイヨシノ＞
名詞「新宿御苑」＝＝＞特徴述語集合｛行く，整備する，立ち寄る｝
名詞「ソメイヨシノ」＝＝＞特徴述語集合｛咲く，植樹する，守る，みる｝
名詞対＜庭，桜＞
名詞「庭」＝＝＞特徴述語集合｛手入れする，掃除する｝
名詞「桜」＝＝＞特徴述語集合｛咲く，みる｝ For example, for the noun pair <Yoshinoyama, Sakura>, all predicate sets include {blooming, planting}, so these predicates are deleted. Similarly, the above-described noun pair has the following feature predicate set.
Noun pair <Yoshinoyama, Sakura>
Noun “Yoshinoyama” ==> Feature Predicate Set {Go, Stop}
Noun “Sakura” ==> Feature Predicate Set {Protect, See}
Noun pair <Shinjuku Gyoen, Yoshino Yoshino>
Noun “Shinjuku Gyoen” ＝＝＞ Characteristic Predicate Set {Go, Maintain, Stop}
Noun "Yoshino Yoshino"==> Feature predicate set {blooming, planting, protecting, seeing}
Noun pair <garden, cherry blossom>
Noun “garden” ==> Feature predicate set {care, clean}
Noun "Sakura"==> Feature predicate set {blooming, seeing}

（１）名詞対＜吉野山，桜＞と名詞対＜新宿御苑，ソメイヨシノ＞との類似性
いずれの名詞対についても、第１の名詞と共起する第１の特徴述語集合には、共通して「行く」「立ち寄る」が含まれており、第２の名詞と共起する第２の特徴述語集合には、共通して「みる」「守る」が含まれている。従って、これらの名詞対は類似性が高い、と判断される。 (1) Similarity between the noun pair <Yoshinoyama, Sakura> and the noun pair <Shinjuku Gyoen, Yoshino Yoshino> Any noun pair is common to the first feature predicate set that co-occurs with the first noun. The second feature predicate set that co-occurs with the second noun includes “see” and “protect” in common. Therefore, it is determined that these noun pairs have high similarity.

（２）名詞対＜吉野山，桜＞と名詞対＜庭，桜＞との類似性
両方の名詞対は、第２の名詞「桜」で共通する。しかしながら、名詞対＜吉野山，桜＞の第１の名詞「吉野山」と共起する第１の特徴述語集合には、「行く」「立ち寄る」のように場所に関する述語である。これに対し、名詞対＜庭，桜＞の第１の名詞「庭」と共起する第１の特徴述語集合には、「手入れする」「掃除する」などの造園に関する述語である。即ち、２つの名詞対で共通する述語が異なる。従って、これらの名詞対は類似性が低い、と判断される。 (2) Similarity between the noun pair <Yoshinoyama, Sakura> and the noun pair <Niwa, Sakura> Both noun pairs are common to the second noun "Sakura". However, the first feature predicate set that co-occurs with the first noun “Yoshinoyama” of the noun pair <Yoshinoyama, Sakura> is a predicate related to a place like “go” or “stop”. On the other hand, the first feature predicate set that co-occurs with the first noun “garden” of the noun pair <garden, cherry blossoms> includes landscaping predicates such as “care” and “clean”. That is, the predicates common to the two noun pairs are different. Therefore, it is determined that these noun pairs have low similarity.

このように、２つの名詞対＜吉野山，桜＞と＜庭，桜＞とについて、全く同じ名詞「桜」であっても、対になる名詞と共起しない述語が、特徴述語集合として抽出される。即ち、全く同じ名詞であっても、対になる名詞によって特徴述語集合が異なる。 In this way, for the two noun pairs <Yoshinoyama, Sakura> and <Niwa, Sakura>, the predicate that does not co-occur with the paired nouns is extracted as a feature predicate set, even if they are the same noun “Sakura”. Is done. That is, even if the nouns are exactly the same, the feature predicate sets differ depending on the pair of nouns.

前述した（１）名詞対＜吉野山，桜＞と名詞対＜新宿御苑，ソメイヨシノ＞のように、名詞対の類似性が高い場合、これらの名詞対は、共通の意味的関係を有すると判断される。 If the noun pair is highly similar, as in (1) the noun pair <Yoshinoyama, Sakura> and the noun pair <Shinjuku Gyoen, Yoshino Yoshino>, it is determined that these noun pairs have a common semantic relationship. Is done.

（Ｓ２５）第１の特徴述語集合に属する述語毎に、文章集合蓄積部に蓄積されている文章中の、第１の名詞と共起する述語の出現頻度が計数される。同様に、第２の特徴述語集合に属する述語毎に、文章集合蓄積部に蓄積されている文章中の、第２の名詞と共起する述語の出現頻度が計数される。 (S25) For each predicate belonging to the first feature predicate set, the appearance frequency of the predicate co-occurring with the first noun in the sentences stored in the sentence set storage unit is counted. Similarly, for each predicate belonging to the second feature predicate set, the appearance frequency of the predicate co-occurring with the second noun in the text stored in the text set storage unit is counted.

例えば、名詞「吉野山」及び述語「行く」が、直接的に係り受け関係にある表現の出現頻度を、文章集合蓄積部１０に蓄積されている文章中で計数する。例えば以下のように表す。
freq（吉野山，行く）＝１３２回
freq（吉野山，立ち寄る）＝７６回
freq（桜，守る）＝６３回
freq（桜，みる）＝１４２回 For example, the frequency of appearance of expressions in which the noun “Yoshinoyama” and the predicate “go” are directly related is counted in the sentences stored in the sentence set storage unit 10. For example, it is expressed as follows.
freq (Yoshinoyama, go) = 132 times
freq (Yoshinoyama, stop by) = 76 times
freq (cherry tree, protect) = 63 times
freq (cherry tree, see) = 142 times

（Ｓ２６）第１の名詞に基づく第１の特徴述語集合(np1')に属する述語と、第１の特徴述語集合に属する述語毎の出現頻度を表す第１のベクトル（freg_np1'）が導出される。同様に、第２の名詞に基づく第２の特徴述語集合(np2')に属する述語と、第２の特徴述語集合に属する述語毎の出現頻度を表す第２のベクトル（freg_np2'）が導出される。 (S26) A predicate belonging to the first feature predicate set (np1 ′) based on the first noun and a first vector (freg_np1 ′) representing the appearance frequency for each predicate belonging to the first feature predicate set are derived. The Similarly, a predicate belonging to the second feature predicate set (np2 ′) based on the second noun and a second vector (freg_np2 ′) representing the appearance frequency for each predicate belonging to the second feature predicate set are derived. The

ベクトルの各項は、述語に対応し、以下のように表される。
freq(n,p)：名詞ｎと共起する述語ｐの出現頻度
freq_np＝[freq(n,p1),freq(n,p2)…..]^Ｔ Each term in the vector corresponds to a predicate and is expressed as follows:
freq (n, p): frequency of occurrence of predicate p co-occurring with noun n
freq_np = [freq (n, p1), freq (n, p2) .....] ^T

各名詞に関するベクトルは、以下のように表される。
名詞「吉野山」に関するベクトル：freq_np1'＝［132,76］^Ｔ
名詞「桜」に関するベクトル：freq_np2'＝［63,142］^Ｔ The vector for each noun is expressed as follows:
Vector for the noun “Yoshinoyama”: freq_np1 '= [132,76] ^T
Vector for the noun "sakura": freq_np2 '= [63,142] ^T

そして、生成されたベクトルfreq_np1'及びfreq_np2'は、それぞれの次元が異なるように結合される。
f(吉野山,桜)＝[行く,立ち寄る,守る,みる]^Ｔ
f(吉野山,桜)＝[132, 76, 63, 142 ]^Ｔ Then, the generated vectors freq_np1 ′ and freq_np2 ′ are combined so that their dimensions are different.
f (Yoshinoyama, cherry blossoms) = [Go, Stop, Protect, See] ^T
f (Yoshinoyama, Sakura) = [132, 76, 63, 142] ^T

（Ｓ２７）Ｓ２６で導出されたベクトル間類似度に基づく分割最適化クラスタリングによって、名詞対クラスタを生成する。 (S27) A noun pair cluster is generated by split optimization clustering based on the similarity between vectors derived in S26.

Ｓ２７では、名詞対（ｎ１，ｎ２）について、その名詞対が属するクラスタの中でのクラスタＩＤ（識別子）とその寄与度を取得する。Ｓ２６で生成されたベクトルが、名詞対（吉野山，桜）については、[132,76,63,142]^Ｔであり、名詞対（新宿御苑，ソメイヨシノ）については、[130,78,63,140]^Ｔであるとする。ここで、ベクトル間類似度に基づくクラスタリングによって、名詞対（吉野山，桜）と名詞対（新宿御苑，ソメイヨシノ）とが同じクラスタに属するとする。この場合、属するクラスタＩＤとして同一のＩＤが得られと、それぞれの名詞対について、クラスタ寄与度が得られる。
名詞対ベクトルクラスタＩＤクラスタ寄与度
f(吉野山,桜) ＝[132,76,63,142]^Ｔｒ１０．８
f(新宿御苑,ソメイヨシノ)＝[130,78,63,140]^Ｔｒ１０．８５ In S27, for the noun pair (n1, n2), the cluster ID (identifier) and the contribution degree in the cluster to which the noun pair belongs are acquired. Vector generated in S26, the noun pair (Yoshino, cherry) for a [132,76,63,142] ^T, noun pair (Shinjuku Gyoen, Yoshino) for, in [130,78,63,140] ^T Suppose there is. Here, it is assumed that the noun pair (Yoshinoyama, Sakura) and the noun pair (Shinjuku Gyoen, Yoshino Yoshino) belong to the same cluster by clustering based on the similarity between vectors. In this case, when the same ID is obtained as the cluster ID to which the cluster belongs, the cluster contribution is obtained for each noun pair.
Noun pair Vector Cluster ID Cluster contribution
f (Mount Yoshino, Sakura) = [132,76,63,142] ^T r1 0.8
f (Shinjuku Gyoen, Yoshino Yoshino) = [130,78,63,140] ^T r1 0.85

上記の例によれば、名詞対（吉野山，桜）及び名詞対（新宿御苑，ソメイヨシノ）は、ベクトル間類似度が高いので、何らかの共通の意味的関係を有するものとして、同一のクラスタに所属するように分類される。一方で、名詞対（庭，桜）は、名詞対（吉野山，桜）及び名詞対（新宿御苑，ソメイヨシノ）と比較してベクトル間類似度が低いので、異なるクラスタに所属するように分類される。このように、名詞対における第１の名詞と第２の名詞とが表す概念間の差が、何らかの意味的関係を意味するものとして、名詞対をクラスタリングすることができる。 According to the above example, the noun pair (Yoshinoyama, Sakura) and the noun pair (Shinjuku Gyoen, Yoshino Yoshino) have high similarity between vectors, so they belong to the same cluster as having some common semantic relationship. Be classified. On the other hand, noun pairs (garden, cherry blossoms) are classified as belonging to different clusters because their similarity between vectors is lower than noun pairs (Yoshinoyama, cherry blossoms) and noun pairs (Shinjuku Gyoen, Yoshino Yoshino). The In this way, noun pairs can be clustered assuming that the difference between the concepts represented by the first and second nouns in the noun pair means some semantic relationship.

クラスタリング技術として、例えばK-means又はＥＭアルゴリズム（確率的クラスタリング）を用いることができる。K-meansによれば、学習データを用いることなくクラスタリングすることができる。一方で、ＥＭアルゴリズムによれば、事前に名詞対が所属するクラスタを学習データとして与えることによって、教師学習を実行できる。K-meansの場合、意味的関係が近いが、既存の分類には属さない未知の関係でクラスタリングすることができる。一方で、ＥＭアルゴリズムの場合、学習データによって事前に設計した関係、例えば、上位下位関係又は部分全体関係でクラスタリングすることができる。 As a clustering technique, for example, K-means or EM algorithm (probabilistic clustering) can be used. According to K-means, clustering can be performed without using learning data. On the other hand, according to the EM algorithm, teacher learning can be performed by giving a cluster to which a noun pair belongs in advance as learning data. In the case of K-means, clustering can be performed with unknown relationships that are close in semantic relationship but do not belong to existing classifications. On the other hand, in the case of the EM algorithm, it is possible to perform clustering based on a relationship designed in advance by learning data, for example, an upper-lower relationship or a partial overall relationship.

クラスタリングによって得られた各クラスタを、何らかの同一の意味的関係を表す名詞対の集合とみなされる。各クラスタには、それぞれ異なるクラスタＩＤを付与する。 Each cluster obtained by clustering is regarded as a set of noun pairs representing some same semantic relationship. Each cluster is assigned a different cluster ID.

また、各名詞対に、それぞれのクラスタへの寄与度を付与する。寄与度は、クラスタリング方法によって異なる。K-meansの場合、クラスタに所属する各名詞対と、そのクラスタの重心からの距離が、クラスタへの寄与度に相当する。一方で、ＥＭアルゴリズムの場合、各名詞対の各クラスタへの所属確率が、クラスタへの寄与度に相当する。 Also, each noun pair is given a contribution to each cluster. The degree of contribution varies depending on the clustering method. In the case of K-means, each noun pair belonging to a cluster and the distance from the center of gravity of the cluster correspond to the contribution to the cluster. On the other hand, in the case of the EM algorithm, the affiliation probability of each noun pair to each cluster corresponds to the contribution to the cluster.

図３は、述語対クラスタの生成のフローチャートである。 FIG. 3 is a flowchart of predicate pair cluster generation.

図３では、語対クラスタの生成について、述語対クラスタを生成する場合を例に説明する。図３では、図１の「語」を「述語」として、「係り受け語」を「名詞」として、具体的に説明する。 In FIG. 3, generation of word pair clusters will be described by taking an example of generating predicate pair clusters. In FIG. 3, the “word” in FIG. 1 is specifically described as “predicate”, and the “dependency word” is described as “noun”.

図２では、各名詞と共起する特徴述語集合を用いて、名詞対同士の類似性に基づいてクラスタリングする例について説明した。これに対し、図３では、各述語と共起する特徴名詞集合を用いて、述語対同士の類似性に基づいてクラスタリングする例について説明する。図３は、名詞対と述語対とで相違する以外は、図２の処理の内容と全く同様である。 In FIG. 2, the example which clustered based on the similarity of noun pairs using the feature predicate set which co-occurs with each noun was demonstrated. On the other hand, FIG. 3 illustrates an example of clustering based on the similarity between predicate pairs using feature noun sets that co-occur with each predicate. FIG. 3 is exactly the same as the contents of the process in FIG. 2 except that the noun pair differs from the predicate pair.

（Ｓ３１）文章集合蓄積部に蓄積されている文章から、複数の述語対が抽出される。「述語対」は、第１の述語ｐ１及び第２の述語ｐ２からなる。述語対の抽出には、例えば以下のような語彙統語パターンが用いられる。
パターン：「＜ｐ１＞ながら＜ｐ２＞」
「＜ｐ１＞て＜ｐ２＞」 (S31) A plurality of predicate pairs are extracted from the sentences stored in the sentence set storage unit. The “predicate pair” includes a first predicate p1 and a second predicate p2. For example, the following lexical syntactic pattern is used for extracting predicate pairs.
Pattern: "<p1> while <p2>"
“<P1> to <p2>”

例えば、文章集合蓄積部に蓄積された文章から、パターン「＜ｎ１＞の＜ｎ２＞」を用いて、以下の述語対が抽出される。
［文章］［述語対］
「吉野山を歩きながら、桜をみる」＝＝＞（歩く，みる）
「新宿御苑に行って、桜をみる」＝＝＞（行く，みる） For example, the following predicate pairs are extracted from the sentences stored in the sentence set storage unit using the pattern “<n2><n2>”.
[Sentence] [predicate pair]
“Walking Mt. Yoshino, see the cherry blossoms” ==> (walk, see)
“Go to Shinjuku Gyoen and see the cherry blossoms” ==> (Go, see)

（Ｓ３２）抽出された述語対の中から、第１の述語及び第２の述語が共起しやすい述語対のみが抽出される。２語の「共起しやすさ」を計測する類似度の一種として、図１のＳ２２と同様に、相互情報量が利用できる。共起しやすさを計測した類似度が所定閾値以上となる述語対のみが抽出される。 (S32) Only predicate pairs in which the first predicate and the second predicate are likely to co-occur are extracted from the extracted predicate pairs. As a kind of similarity for measuring “ease of co-occurrence” of two words, the mutual information amount can be used as in S22 of FIG. Only predicate pairs in which the similarity measured for co-occurrence is equal to or greater than a predetermined threshold are extracted.

（Ｓ３３）Ｓ３２によって抽出された述語対について、文章集合蓄積部から、第１の述語と共起する第１の名詞集合と、第２の述語と共起する第２の名詞集合とが抽出される。 (S33) For the predicate pair extracted in S32, a first noun set co-occurring with the first predicate and a second noun set co-occurring with the second predicate are extracted from the sentence set storage unit. The

例えば、「歩く」と「行く」との２語が指す概念の違いに基づいて、意味的関係ごとに分類するために、述語と共起する名詞の集合が概念を表すと想定する。本発明によれば、何らかの意味的関係を表すものとして、明示的に、各述語と共起する名詞集合間の差を利用する。また、本発明では、Ｓ３２によって抽出された共起しやすい（類似度が高い）述語対の集合を用いるために、語が表す概念の範囲を考慮して、意味的関係に基づいて分類することができる。 For example, it is assumed that a set of nouns co-occurring with predicates represents a concept in order to classify each semantic relationship based on the difference in concept indicated by two words “walking” and “going”. According to the present invention, the difference between noun sets co-occurring with each predicate is explicitly used to represent some semantic relationship. In the present invention, in order to use the set of predicate pairs that are likely to co-occur (highly similar) extracted in S32, classification is performed based on the semantic relationship in consideration of the range of concepts represented by the words. Can do.

例えば、述語対＜歩く，みる＞について、第１の述語「歩く」と共起する第１の名詞集合と、第２の述語「みる」と共起する第２の名詞集合とが、以下のように抽出される。
述語対＜歩く，みる＞
述語「歩く」＝＝＞名詞集合｛公園，道，山，街｝
述語「みる」＝＝＞名詞集合｛花，山，森，街｝
述語対＜行く，みる＞
述語「行く」＝＝＞名詞集合｛会社，学校，山，街｝
述語「みる」＝＝＞名詞集合｛花，山，森，街} For example, for the predicate pair <walk, see>, the first noun set co-occurring with the first predicate “walk” and the second noun set co-occurring with the second predicate “view” are as follows: Extracted as follows.
Predicate pair <walk, see>
Predicate "walking"==> noun set {park, road, mountain, town}
Predicate "see"==> noun set {flower, mountain, forest, town}
Predicate pair <go, see>
Predicate "go"==> noun set {company, school, mountain, town}
Predicate "see"==> noun set {flower, mountain, forest, town}

（Ｓ３４）次に、第１の名詞集合に出現し且つ第２の名詞集合に出現しない名詞からなる第１の特徴名詞集合と、第２の名詞集合に出現し且つ第１の名詞集合に出現しない名詞からなる第２の特徴名詞集合とが抽出される。 (S34) Next, a first feature noun set consisting of nouns appearing in the first noun set and not appearing in the second noun set, and appearing in the second noun set and appearing in the first noun set A second feature noun set consisting of nouns to be extracted is extracted.

例えば、述語対＜歩く，みる＞について、いずれの名詞集合にも共通して｛山，街｝が含まれるので、これら名詞を削除する。同様に、前述した述語対は、以下のような特徴名詞集合を有する。
述語対＜歩く，みる＞
述語「歩く」＝＝＞特徴名詞集合｛公園，道｝
述語「みる」＝＝＞特徴名詞集合｛花，森｝
述語対＜行く，みる＞
述語「行く」＝＝＞名詞集合｛会社，学校｝
述語「みる」＝＝＞名詞集合｛花，森} For example, for the predicate pair <walking, seeing>, {nouns, towns} are included in all noun sets, so these nouns are deleted. Similarly, the predicate pair described above has the following characteristic noun set.
Predicate pair <walk, see>
Predicate "walking"==> characteristic noun set {park, road}
Predicate "see"==> feature noun set {flower, forest}
Predicate pair <go, see>
Predicate "go"==> noun set {company, school}
Predicate "see"==> noun set {flower, forest}

（Ｓ３５）第１の特徴名詞集合に属する名詞毎に、第１の述語と共起する文章集合蓄積部に蓄積されている文章中の、第１の述語と共起する名詞の出現頻度が計数される。同様に、第２の特徴名詞集合に属する名詞毎に、第２の述語と共起する文章集合蓄積部に蓄積されている文章中の、第２の述語と共起する名詞の出現頻度が計数される。 (S35) For each noun belonging to the first characteristic noun set, the frequency of appearance of the noun co-occurring with the first predicate in the sentence accumulated in the sentence set accumulating unit co-occurring with the first predicate is counted. Is done. Similarly, for each noun belonging to the second characteristic noun set, the frequency of appearance of the noun co-occurring with the second predicate in the sentence accumulated in the sentence set accumulating unit co-occurring with the second predicate is counted. Is done.

例えば、述語「歩く」及び名詞「公園」が、直接的に係り受け関係にある表現の出現頻度を、文章集合蓄積部に蓄積されている文章中で計数する。例えば以下のように表す。
freq（歩く，公園）＝１２８回
freq（歩く，道）＝６０回
freq（みる，花）＝４８回
freq（みる，森）＝１２２回 For example, the appearance frequency of expressions in which the predicate “walking” and the noun “park” are directly related is counted in the sentences accumulated in the sentence set accumulation unit. For example, it is expressed as follows.
freq (walking, park) = 128 times
freq (walking, road) = 60 times
freq (see, flower) = 48 times
freq (see, forest) = 122 times

（Ｓ３６）第１の述語に基づく第１の特徴名詞集合に属する名詞毎の出現頻度を表すベクトルと、第２の述語に基づく第２の特徴名詞集合に属する名詞毎の出現頻度を表すベクトルとを導出する。 (S36) A vector representing the appearance frequency for each noun belonging to the first feature noun set based on the first predicate, and a vector representing the appearance frequency for each noun belonging to the second feature noun set based on the second predicate Is derived.

ベクトルの各項は、名詞に対応し、以下のように表される。
freq(p,n)：述語ｐと共起する名詞ｎの出現頻度
freq_pn＝[freq(p,n1),freq(p,n2)…..]^Ｔ Each term in the vector corresponds to a noun and is expressed as follows:
freq (p, n): Frequency of occurrence of noun n co-occurring with predicate p
freq_pn = [freq (p, n1), freq (p, n2) .....] ^T

各名詞に関するベクトルは、以下のように表される。
述語「歩く」に関するベクトル：freq_pn1'＝［128,60］^Ｔ
述語「みる」に関するベクトル：freq_pn2'＝［48,122］^Ｔ The vector for each noun is expressed as follows:
Vector for predicate "walking": freq_pn1 '= [128,60] ^T
Vector for the predicate “look”: freq_pn2 ′ = [48,122] ^T

そして、生成されたベクトルfreq_pn1'及びfreq_pn2'は、それぞれの次元が異なるように結合される。
f(歩く,みる)＝[公園,道,花,森 ]^Ｔ
f(歩く,みる)＝[128, 60,48,122]^Ｔ The generated vectors freq_pn1 ′ and freq_pn2 ′ are combined so that their dimensions are different.
f (walk, look) = [park, road, flower, forest] ^T
f (walking, looking) = [128, 60, 48, 122] ^T

（Ｓ３７）Ｓ３６で導出されたベクトル間類似度に基づく分割最適化クラスタリングによって、述語対クラスタを生成する。 (S37) Predicate pair clusters are generated by split optimization clustering based on the similarity between vectors derived in S36.

Ｓ３７では、述語対（ｐ１，ｐ２）について、その述語対が属するクラスタの中でのクラスタＩＤとその寄与度を取得する。ここで、ベクトル間類似度に基づくクラスタリングによって、述語対（歩く，みる）と（行く，みる）とが同じクラスタに属するとする。Ｓ２７と同様に、属するクラスタＩＤとして同一のＩＤが得られと、それぞれの名詞対について、クラスタ寄与度が得られる。
名詞対ベクトルクラスタＩＤクラスタ寄与度
f(歩く,みる) ＝[128,60,48,122]^Ｔｒ１０．９
f(行く,みる) ＝[130,60,45,121]^Ｔｒ１０．７ In S37, for the predicate pair (p1, p2), the cluster ID and the contribution degree in the cluster to which the predicate pair belongs are acquired. Here, it is assumed that the predicate pair (walking, seeing) and (going, seeing) belong to the same cluster by clustering based on the similarity between vectors. Similar to S27, when the same ID is obtained as the cluster ID to which the cluster belongs, the cluster contribution is obtained for each noun pair.
Noun pair Vector Cluster ID Cluster contribution
f (walking, looking) = [128,60,48,122] ^T r1 0.9
f (go, see) = [130, 60, 45, 121] ^T r1 0.7

Ｓ２７の名詞対のクラスタリングと、Ｓ３７の述語対のクラスタリングとでは、処理に大きな差異がない。差異は、Ｓ２７では述語ごとの出現頻度に基づくベクトルの類似度によって名詞対をクラスタリングするのに対し、Ｓ３７では名詞ごとの出現頻度に基づくベクトルの類似度によって述語対をクラスタリングする点である。 There is no significant difference in processing between the clustering of noun pairs in S27 and the clustering of predicate pairs in S37. The difference is that, in S27, noun pairs are clustered based on the vector similarity based on the appearance frequency for each predicate, whereas in S37, the predicate pairs are clustered based on the vector similarity based on the appearance frequency for each noun.

尚、前述した語対のクラスタリング（Ｓ７）、名詞対のクラスタリング（Ｓ２７）及び述語対のクラスタリング（Ｓ３７）は、１つの要素が１つのクラスタに１対１に対応しなければならないハードクラスタリングに限られない。１つの要素が複数のクラスタに所属するソフトクラスタリングを用いることもできる。ハードクラスタリングの場合、２つのベクトル間で名詞対又は述語対がそれぞれ異なるクラスタに所属している場合には、類似度が０になる。一方で、ソフトクラスタリングの場合、１つの名詞対が複数のクラスタに所属できるので、類似度が０になるベクトル対を減らすことができる。 The word pair clustering (S7), the noun pair clustering (S27), and the predicate pair clustering (S37) described above are limited to hard clustering in which one element must correspond one-to-one with one cluster. I can't. Soft clustering in which one element belongs to a plurality of clusters can also be used. In the case of hard clustering, the similarity is 0 when noun pairs or predicate pairs belong to different clusters between two vectors. On the other hand, in the case of soft clustering, since one noun pair can belong to a plurality of clusters, the number of vector pairs whose similarity is 0 can be reduced.

図４は、本発明における二項関係分類装置の機能構成図である。 FIG. 4 is a functional configuration diagram of the binary relation classification apparatus according to the present invention.

図４によれば、二項関係分類装置１は、文章集合蓄積部１０と、語対抽出部１１と、類似語対抽出部１２と、係り受け語集合抽出部１３と、特徴係り受け語集合抽出部１４と、語出現頻度計数部１５と、語対類似度算出部１６と、語対クラスタ生成部１７とを有する。文章集合蓄積部１０を除くこれら機能構成部は、装置に搭載されたコンピュータを機能させる二項関係分類プログラムを実行することによって実現されるものであってもよい。 According to FIG. 4, the binary relation classification apparatus 1 includes a sentence set accumulation unit 10, a word pair extraction unit 11, a similar word pair extraction unit 12, a dependency word set extraction unit 13, and a feature dependency word set. An extraction unit 14, a word appearance frequency counting unit 15, a word pair similarity calculation unit 16, and a word pair cluster generation unit 17 are included. These functional components other than the sentence set accumulating unit 10 may be realized by executing a binary relation classification program that causes a computer installed in the apparatus to function.

文章集合蓄積部１０は、多数の文章情報を蓄積する。 The sentence set storage unit 10 stores a large amount of sentence information.

語対抽出部１１は、文章集合蓄積部１０から、第１の語及び第２の語からなる複数の語対を抽出する（前述した図１のＳ１参照）。抽出された語対は、類似語対抽出部１２へ出力される。 The word pair extraction unit 11 extracts a plurality of word pairs composed of the first word and the second word from the sentence set storage unit 10 (see S1 in FIG. 1 described above). The extracted word pairs are output to the similar word pair extraction unit 12.

類似語対抽出部１２は、抽出された語対を入力する（前述した図１のＳ２参照）。類似語対抽出部１２は、語対の中で共起しやすさを表す類似度が、所定閾値以上となる語対を抽出する。抽出された語対は、係り受け語集合抽出部１３へ出力される。 The similar word pair extraction unit 12 inputs the extracted word pair (see S2 in FIG. 1 described above). The similar word pair extraction unit 12 extracts word pairs in which the similarity indicating the ease of co-occurrence among word pairs is equal to or greater than a predetermined threshold. The extracted word pairs are output to the dependency word set extraction unit 13.

係り受け語集合抽出部１３は、抽出された語対を入力する（前述した図１のＳ３参照）。係り受け語集合抽出部１３は、文章集合蓄積部１０を参照し、入力した語対について、文章集合蓄積部１０から、第１の語と共起する第１の係り受け語集合と、第２の語と共起する第２の係り受け語集合とを抽出する。抽出された第１の係り受け語集合及び第２の係り受け語集合は、特徴係り受け語集合抽出部１４へ出力される。 The dependency word set extraction unit 13 inputs the extracted word pairs (see S3 in FIG. 1 described above). The dependency word set extraction unit 13 refers to the sentence set accumulation unit 10, and for the input word pair, from the sentence set accumulation unit 10, the first dependency word set co-occurs with the first word, and the second And a second dependency word set that co-occurs with the word. The extracted first dependency word set and second dependency word set are output to the feature dependency word set extraction unit 14.

特徴係り受け語集合抽出部１４は、第１の係り受け語集合及び第２の係り受け語集合を入力する（前述した図１のＳ４参照）。特徴係り受け語集合抽出部１４は、第１の係り受け語集合に出現し且つ第２の係り受け語集合に出現しない係り受け語からなる第１の特徴係り受け語集合と、第２の係り受け語集合に出現し且つ第１の係り受け語集合に出現しない係り受け語からなる第２の特徴係り受け語集合とを抽出する。抽出された第１の特徴係り受け語集合及び第２の特徴係り受け語集合は、係り受け語出現頻度計数部１５へ出力される。 The feature dependency word set extraction unit 14 inputs the first dependency word set and the second dependency word set (see S4 in FIG. 1 described above). The feature dependency word set extraction unit 14 includes a first feature dependency word set composed of dependency words that appear in the first dependency word set and do not appear in the second dependency word set, and a second dependency. A second characteristic dependency language set consisting of dependency words that appear in the response language set and do not appear in the first dependency language set is extracted. The extracted first feature dependency word set and second feature dependency word set are output to the dependency word appearance frequency counting unit 15.

語出現頻度計数部１５は、第１の特徴係り受け語集合及び第２の特徴係り受け語集合を入力する（前述した図１のＳ５参照）。係り受け語出現頻度計数部１５は、第１の特徴係り受け語集合に属する係り受け語毎に、文章集合蓄積部１０を参照し、蓄積された文章集合中で、その係り受け語が第１の語と共起して出現する頻度を計数する。同様に、係り受け語出現頻度計数部１５は、第２の特徴係り受け語集合に属する係り受け語毎に、文章集合蓄積部１０を参照し、蓄積された文章集合中で、その係り受け語が第２の語と共起して出現する頻度を計数する。計数された第１の特徴係り受け語集合に属する各係り受け語及び第２の特徴係り受け語集合に属する各係り受け語の出現頻度は、語対類似度算出部１６へ出力される。 The word appearance frequency counting unit 15 inputs the first feature dependency word set and the second feature dependency word set (see S5 in FIG. 1 described above). The dependency word appearance frequency counting unit 15 refers to the sentence set storage unit 10 for each dependency word belonging to the first characteristic dependency word set, and the dependency word is the first in the stored sentence set. Count the frequency of occurrences with the words. Similarly, the dependency word appearance frequency counting unit 15 refers to the sentence set storage unit 10 for each dependency word belonging to the second characteristic dependency word set, and the dependency word is included in the stored sentence set. Counts the frequency of occurrence of co-occurring with the second word. The counted appearance frequency of each dependency word belonging to the first feature dependency word set and each dependency word belonging to the second feature dependency word set is output to the word pair similarity calculation unit 16.

語対類似度算出部１６は、計数された出現頻度を入力する（前述した図１のＳ６参照）。語対類似度算出部１６は、第１の語に基づく第１の特徴係り受け語集合に属する係り受け語毎の出現頻度を表す第１のベクトルと、第２の語に基づく第２の特徴係り受け語集合に属する係り受け語毎の出現頻度を表す第２のベクトルとを結合したベクトルを導出する。生成された係り受け語ベクトルは、語対クラスタ生成部１７へ出力される。 The word pair similarity calculation unit 16 inputs the counted appearance frequency (see S6 in FIG. 1 described above). The word pair similarity calculation unit 16 includes a first vector representing an appearance frequency for each dependency word belonging to the first feature dependency word set based on the first word, and a second feature based on the second word. A vector obtained by combining the second vector representing the appearance frequency for each dependency word belonging to the dependency word set is derived. The generated dependency word vector is output to the word pair cluster generation unit 17.

語対クラスタ生成部１７は、係り受け語ベクトルを入力する（前述した図１のＳ７参照）。語対クラスタ生成部１７は、類似語対抽出部１２を参照し、ベクトル間類似度に基づく分割最適化クラスタリングによって、入力した係り受け語ベクトルを基に、類似語対抽出部１２に蓄積されている語対をクラスタリングする。クラスタリングされた語対は、語対クラスタ生成部１７に蓄積される。 The word pair cluster generation unit 17 inputs a dependency word vector (see S7 in FIG. 1 described above). The word pair cluster generation unit 17 refers to the similar word pair extraction unit 12 and is stored in the similar word pair extraction unit 12 based on the input dependency word vector by division optimization clustering based on the similarity between vectors. Cluster word pairs. The clustered word pairs are accumulated in the word pair cluster generation unit 17.

以上、詳細に説明したように、本発明の二項関係分類プログラム、方法及び装置によれば、獲得したい語間関係を予め定義することなく、意味的に類似している語対を二項関係に分類することができる。 As described above in detail, according to the binomial relationship classification program, method, and apparatus of the present invention, a pair of terms that are semantically similar can be binarized without predefining the inter-word relationship to be acquired. Can be classified.

本発明によれば、多様な意味的関係を獲得することで、ユーザの意図の抽出やユーザの隠れた行動の発見が容易になる。そこで、本発明は、例えば、質問応答システムの検索キーワード拡張機能を提供することができる。 According to the present invention, it is easy to extract a user's intention and discover a user's hidden behavior by acquiring various semantic relationships. Therefore, the present invention can provide, for example, a search keyword expansion function of a question answering system.

例えば、ユーザによって、検索キーワードとして「ビアパーティー」が入力された場合を想定する。獲得した意味的関係から、「ビアパーティー」と「枝豆」との名詞間関係が、「イベント−イベントに必須の道具」の関係であることがわかる。これにより、「枝豆」は「ビアパーティー」に必須の道具であることが抽出できる。そこで、検索のクエリに「枝豆」を追加することにより、検索結果からノイズを減らすことができる。 For example, it is assumed that “beer party” is input as a search keyword by the user. From the acquired semantic relationship, it can be seen that the relationship between the nouns “beer party” and “edamame” is the “event-event essential tool” relationship. Thereby, it can be extracted that “edamame” is an essential tool for “beer party”. Therefore, by adding “edamame” to the search query, noise can be reduced from the search results.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 Various changes, modifications, and omissions of the above-described various embodiments of the present invention can be easily made by those skilled in the art. The above description is merely an example, and is not intended to be restrictive. The invention is limited only as defined in the following claims and the equivalents thereto.

１二項関係分類装置
１０文章集合蓄積部
１１語対抽出部
１２類似語対抽出部
１３係り受け語集合抽出部
１４特徴係り受け語集合抽出部
１５語出現頻度計数部
１６語対類似度算出部
１７語対クラスタ生成部 DESCRIPTION OF SYMBOLS 1 Binary relation classification apparatus 10 Text set accumulation | storage part 11 Word pair extraction part 12 Similar word pair extraction part 13 Dependence word set extraction part 14 Feature dependency word set extraction part 15 Word appearance frequency counting part 16 Word pair similarity calculation part 17 word pair cluster generator

Claims

In order to classify word pairs into semantic binary relations, a binary relation classification program that causes a computer installed in a device to execute
It has a sentence set storage unit that stores a large number of document information,
A first step of extracting a plurality of word pairs consisting of a first word and a second word from the sentence set storage unit;
A second step of extracting a word pair having a degree of similarity that is equal to or greater than a predetermined threshold among the word pairs;
A first dependency word set that co-occurs in the first word and a second dependency co-occurs in the second word from the sentence set accumulation unit for the word pair extracted in the second step. A third step of extracting a word set;
A first feature dependency word set consisting of dependency words that appear in the first dependency word set and not appear in the second dependency word set; and a first feature dependency word set that appears in the second dependency word set and the first A fourth step of extracting a second characteristic dependency word set consisting of dependency words that do not appear in the dependency word set;
For each dependency word belonging to the first feature dependency word set, the appearance frequency in the document set co-occurring with the first word and for each dependency word belonging to the second feature dependency word set, A fifth step of counting the frequency of occurrence in the document set that co-occurs with two words;
A first vector representing the appearance frequency of each dependency word belonging to the first feature dependency word set based on the first word, and a dependency word belonging to the second feature dependency word set based on the second word A sixth step of deriving a vector obtained by combining the second vector representing the appearance frequency of each;
A binary relation classification program that causes a computer to function as a seventh step of generating a word pair cluster by division optimization clustering based on similarity between vectors.

The word is a noun,
The word pair is a noun pair,
The dependency set is a predicate set;
Predescriptor is a verb or adjective,
The binary relation classification program according to claim 1, further causing the computer to function so that the feature dependency word set is a feature predicate set.

The word is a predicate that is a verb or an adjective;
The word pair is a predicate pair;
The dependency language set is a noun set,
The binary relation classification program according to claim 1, further causing the computer to function so that the feature dependency word set is a feature noun set.

4. The computer according to claim 1, wherein the computer further functions to cluster only pairs having a predetermined threshold value or more by using mutual information as the similarity in the second step. The binary relation classification program according to item 1.

In a binary relation classification method in a device for classifying word pairs into semantic binary relations,
It has a sentence set storage unit that stores a large number of document information,
A first step of extracting a plurality of word pairs consisting of a first word and a second word from the sentence set storage unit;
A second step of extracting a word pair having a degree of similarity that is equal to or greater than a predetermined threshold among the word pairs;
A first dependency word set that co-occurs in the first word and a second dependency co-occurs in the second word from the sentence set accumulation unit for the word pair extracted in the second step. A third step of extracting a word set;
A first feature dependency word set consisting of dependency words that appear in the first dependency word set and not appear in the second dependency word set; and a first feature dependency word set that appears in the second dependency word set and the first A fourth step of extracting a second characteristic dependency word set consisting of dependency words that do not appear in the dependency word set;
For each dependency word belonging to the first feature dependency word set, the appearance frequency in the document set co-occurring with the first word and for each dependency word belonging to the second feature dependency word set, A fifth step of counting the frequency of occurrence in the document set that co-occurs with two words;
A first vector representing the appearance frequency of each dependency word belonging to the first feature dependency word set based on the first word, and a dependency word belonging to the second feature dependency word set based on the second word A sixth step of deriving a vector obtained by combining the second vector representing the appearance frequency of each;
And a seventh step of generating a word-pair cluster by division optimization clustering based on similarity between vectors.

In a binary relation classifier in a device that classifies word pairs into semantic binary relations,
A sentence set storage means for storing a large number of document information;
Word pair extraction means for extracting a plurality of word pairs consisting of a first word and a second word from the sentence set storage unit;
Similar word pair extraction means for extracting word pairs whose similarity indicating the ease of co-occurrence among the word pairs is equal to or greater than a predetermined threshold;
For the word pairs extracted by the similar word pair extraction means, a first dependency word set that co-occurs in the first word and a second co-occurrence in the second word from the sentence set storage unit. A dependency word set extraction means for extracting a dependency word set;
A first feature dependency word set consisting of dependency words that appear in the first dependency word set and not appear in the second dependency word set; and a first feature dependency word set that appears in the second dependency word set and the first Feature dependency word set extraction means for extracting a second feature dependency word set consisting of dependency words that do not appear in the dependency word set;
For each dependency word belonging to the first feature dependency word set, the appearance frequency in the document set co-occurring with the first word and for each dependency word belonging to the second feature dependency word set, A dependency word appearance frequency counting means for counting the appearance frequency in the document set co-occurring with two words;
A first vector representing an appearance frequency for each dependency word belonging to the first feature dependency word set based on the first word, and a word for each word belonging to the second feature dependency word set based on the second word Word pair similarity calculating means for deriving a vector obtained by combining the second vector representing the appearance frequency;
A binary relation classifying apparatus comprising: word pair cluster generation means for generating word pair clusters by division optimization clustering based on similarity between vectors.