JP2014119988A

JP2014119988A - Synonym determination device, synonym learning device, and program

Info

Publication number: JP2014119988A
Application number: JP2012274963A
Authority: JP
Inventors: Tomoko Izumi; 朋子泉; Kuniko Saito; 邦子齋藤; Yoshihiro Matsuo; 義博松尾; Sadao Kurohashi; 禎夫黒橋; Daisuke Kawahara; 大輔河原; Tomohide Shibata; 知秀柴田
Original assignee: Kyoto University; Nippon Telegraph and Telephone Corp
Current assignee: Kyoto University; Nippon Telegraph and Telephone Corp
Priority date: 2012-12-17
Filing date: 2012-12-17
Publication date: 2014-06-30
Anticipated expiration: 2032-12-17
Also published as: JP5916016B2

Abstract

PROBLEM TO BE SOLVED: To highly accurately perform synonym determination for a predicate.SOLUTION: A feature construction unit 222 extracts distribution similarity, a dictionary define statement feature, a semantic attribute feature and a functional expression feature for an "argument-predicate" pair input by an input unit 210. A synonym determination unit 230 determines whether or not a synonym determination model stored in a synonym determination model storage unit 232 for determining whether or not the "argument-predicate" pair is a synonym is synonymous with the "argument-predicate" pair on the basis of the extracted distribution similarity, dictionary define statement feature, semantic attribute feature and functional expression feature.

Description

本発明は、同義判定装置、同義学習装置、及びプログラムに係り、特に、述部ペアの同義性を判定するための同義判定装置、同義学習装置、及びプログラムに関する。 The present invention relates to a synonym determination device, a synonym learning device, and a program, and more particularly, to a synonym determination device, a synonym learning device, and a program for determining synonyms of predicate pairs.

現在、Ｗｅｂ上のブログや音声対話ログなど大量のテキスト情報から欲しい情報を探し出す検索技術や、有益な情報のみを自動で抽出・集計するテキストマイニング技術の高精度化が求められている。これらを実現するために必要なのが、計算機による自然文の意味理解である。 Currently, there is a demand for high-precision search technology for searching for desired information from a large amount of text information such as blogs and voice dialogue logs on the Web, and text mining technology for automatically extracting and counting only useful information. In order to realize these, it is necessary to understand the meaning of natural sentences by a computer.

例えば、（１）「ＸＸのランチに満足だった。」、（２）「ＸＸのランチを堪能しました。」という２つの文があった場合、これらが「同じ事を表している」と判定できなくては、利用者が求める情報を正しく検索できなかったり、テキストマイニングで必要な「同じ情報のまとめ上げ」ができない。 For example, if there are two sentences: (1) “I was satisfied with the lunch of XX” and (2) “I enjoyed the lunch of XX”, these were judged as “representing the same thing”. If it is not possible, the information required by the user cannot be searched correctly or the “same information gathering” necessary for text mining cannot be performed.

上記（１）の文と（２）の文は同じ事を表しているが、文字列からはこれらが同じ意味（すなわち、同義）であることは認識できない。そのため、ユーザが求める情報検索や情報抽出に障害が起きる。そこで、表層文字列以外を手がかりに自然文の意味理解を行う必要がある。 Although the sentence (1) and the sentence (2) represent the same thing, it cannot be recognized from the character string that they have the same meaning (that is, synonyms). For this reason, a failure occurs in information retrieval and information extraction required by the user. Therefore, it is necessary to understand the meaning of the natural sentence using clues other than the surface character string.

特に、「満足だった」や「堪能しました」のような文の「どうした」を表す述部は、文の核情報を表しており、これらの同義判定が可能になれば、より精度の高い情報検索・情報抽出が可能となる。 In particular, the predicate that indicates the “how” of a sentence such as “I was satisfied” or “I was satisfied” represents the core information of the sentence. High information retrieval and information extraction are possible.

従来の述部の同義判定手法として、「（景気が）冷え込む」と「（景気が）悪化する」の「冷え込む」と「悪化する」のように、組み合わさる要素によって同義になるような述部を含めた同義判定手法が提案されている（以後、「どうした」を表す部分（すなわち「冷え込む」と「悪化する」）を述部、「何が」や「何を」の部分（すなわち、「景気」）を項と呼ぶ。）（非特許文献１）。 Conventional predicate synonym judgment method, predicate that is synonymous with the elements to be combined, such as “(the economy) gets cold” and “(the economy) gets worse” A synonym determination method is proposed that includes “what happened” (ie “cool” and “deteriorate”) as predicates, “what” and “what” parts (ie "Business" is called a term.) (Non-Patent Document 1).

この同義判定手法は、「景気が-冷え込む」のように、「何が」や「何を」を表す項と述部をペアにし、分布類似度というものを用いて、同義計算を行う。 In this synonym determination method, a synonym calculation is performed using a pair of a term representing “what” or “what” and a predicate, and using a distribution similarity as in “economy is getting cold”.

分布類似度とは、似た意味の単語はその単語が出現する文脈も似ているという考えに基づき、同義計算の対象となる単語の周辺に出てくる要素を素性として、周辺にどのような要素がどのような頻度で出てきているかをもとに、似た文脈で出てくる単語か否かを計算するものである。 Distribution similarity is based on the idea that words with similar meanings are similar in the context in which the word appears. Based on how often the element appears, it is calculated whether the word appears in a similar context.

また、上記の同義判定手法は、対象の「項−述部」に対して、それらの周辺に現れる別の「項−述部」、もしくは「述部」を素性とし、周辺に現れる要素の分布類似度を用いて、同義性を判定する。 In addition, the above synonym determination method has a feature of another “term-predicate” or “predicate” appearing in the vicinity of the target “term-predicate”, and distribution of elements appearing in the vicinity. The synonymity is determined using the similarity.

柴田知秀・黒橋禎夫(2010). 文脈に依存した述語の同義関係獲得. IPSJ SIG Notes 2010-NL-199(13), 1-6.Toshihide Shibata and Ikuo Kurohashi (2010). Acquisition of synonym relations of context-dependent predicates. IPSJ SIG Notes 2010-NL-199 (13), 1-6.

しかし、上記の非特許文献１に記載の手法では、（３）「棚を撤去する」、（４）「棚を設置する」というまったく逆の事を表す表現である２文についても「意味が類似している」と判定されるという問題がある。 However, in the technique described in Non-Patent Document 1 above, the meaning of two sentences, which are expressions representing the opposite of (3) “remove the shelf” and (4) “install the shelf,” There is a problem that it is determined as “similar”.

また、（５）「テキストを作成する」、（６）「テキストを用いる」という「ある動作の連続」を表す２文（これらの述部関係を、以後「時間経過関係」と呼ぶ。）についても、述部同士が、高い類似性を出してしまうという問題がある。 Further, regarding two sentences (hereinafter referred to as “time-elapsed relations”) representing “continuous operation” (5) “create text” and (6) “use text”. However, there is a problem that predicates give high similarity.

これらの問題は、分布類似度が「文脈の類似性によって同義を判定する」ために起きてしまう。たとえば、非特許文献１の手法では、１文内に対象の「項−述部」もしくは「述部」と一緒に出現する単語を手がかりに同義性を判定する。しかし、反義の述部は、その述部以外はまったく同じ文脈で出現することが可能なために、分布類似度では判定が難しい。同様に、「時間経過」を表す述部同士も、「教師が協力して、テキストを作成し、授業ではそのテキストを用いる。」というように、一文内に時間経過をあらわす述部が存在することが多々ある。その場合、「テキスト−ヲ−作成する」と「テキスト−ヲ−用いる」は両方とも「教師−ガ−協力」という単語を共有するため、結果として類似度が高くなる。 These problems occur because the distribution similarity is “determined by the similarity of context”. For example, in the technique of Non-Patent Document 1, synonymity is determined using a word that appears together with a target “term-predicate” or “predicate” in one sentence. However, an annoyance predicate can appear in exactly the same context except for the predicate, and thus it is difficult to determine by distribution similarity. Similarly, there is a predicate representing the passage of time in one sentence, such as “Precursors cooperate to create a text and use the text in the class”. There are many things. In this case, both “create text-wo” and “use text-wo” share the word “teacher-gager cooperation”, resulting in high similarity.

このように、非特許文献１の手法では、周辺単語の出現を素性として単語の類似度を測る分布類似度を用いていることにより、周辺単語が似ている「反義関係」や「時間経過関係」も「類似している」と判定されるため、述部の同義判定が高精度にできないという問題がある。 As described above, in the method of Non-Patent Document 1, by using the distribution similarity that measures the similarity of words using the appearance of the surrounding words as a feature, the “anonymity relationship” and “time passage” in which the surrounding words are similar are used. Since “relation” is also determined to be “similar”, there is a problem in that predicate synonym determination cannot be performed with high accuracy.

品詞が動詞、形容詞、形容動詞、名詞である単語を「内容語」と呼び、助詞・助動詞など述部の内容語の後に現れる文末表現を「機能表現」（非特許文献６：松吉俊, 佐藤理史, 宇津呂武仁（2007）日本語機能表現辞書の編纂自然言語処理, vol.14, No.5, 123-146.）と呼ぶ。同義判定の対象である述部は、例えば「募集している」というよう述部は、「募集する」という「内容語」と「ている」という動作の継続を表す「機能表現」から構成されている。 A word whose part of speech is a verb, adjective, adjective verb, or noun is called a “content word”, and a sentence ending expression that appears after the predicate content word, such as a particle or auxiliary verb, is a “functional expression” (Non-patent Document 6: Shun Matsuyoshi, Sato) Rishi, Takehito Utsuro (2007) Compilation of Japanese functional expression dictionary Natural language processing, vol.14, No.5, 123-146.). Predicates that are subject to synonym determination are composed of, for example, “recruiting” predicates that are “content words” “recruiting” and “functional expressions” that represent the continuation of the operation “having”. ing.

機能表現は、述部に重要な意味（例えば、「動作の継続を表す」）を与えており、述部を考慮することが同義判定では必要となる。たとえば、（７）「サポーターを募集している」、（８）「サポーターを募っている」（（７）と同義）、（９）「サポーターを募っているかもしれない」（（７）とは同義ではない）の「募集する」と「募る」という述部はどの機能表現と組み合わされるかによって同義になったり同義にならなかったりする。 The function expression gives an important meaning to the predicate (for example, “represents continuation of operation”), and it is necessary for synonym determination to consider the predicate. For example, (7) “recruiting supporters”, (8) “recruiting supporters” (synonymous with (7)), (9) “maybe recruiting supporters” ((7) The “recruit” and “recruit” predicates may or may not be synonymous depending on which functional expression is combined.

また、機能表現の出現のパターンそのものが、述部同士が同義か否かを表す重要な特徴となる場合がある。「キッチンが片付いている」の述部の内容語である「片付く」と、「キッチンが整っている」の述部の内容語である「整う」を例に説明する。図２に、ブログ８００万文から抽出した内容語「片付く」と「整う」に対する機能表現の出現頻度の一例を示す。「片付く」と「整う」の後に出現する機能表現は、どちらに対しても「継続」を表す「ている」や「ていた」のような表現が多い。これは、「片付く」や「整う」という内容語は、お互い無生物のものを主語（たとえば、「部屋」、「机の上」など）とする動詞である。そのため、「ている」という表現をつけることで、その主語の「状態」を表す表現になりやすい。一方、「片付けたい」や「整いたい」というような願望表現はあまり出現しない。このように、同義の内容語は似たような機能表現と出現しやすい。述部の同義判定を目的とした場合、述部内の機能表現の出現パターンが、同義となるか否かを判定するために重要な特徴となる場合があり、述部の同義判定を行う場合、述部の機能表現を考慮することが必要である。 In addition, the appearance pattern of the functional expression itself may be an important feature indicating whether or not the predicates are synonymous. An explanation will be given by taking “clearing” which is the content word of the predicate “kitchen is tidy” and “smooth” being the content word of the predicate “kitchen is well prepared”. FIG. 2 shows an example of the frequency of appearance of functional expressions for the content words “tuck away” and “arrange” extracted from 8 million sentences in the blog. The functional expressions that appear after “tidy up” and “arrange” often have expressions such as “having” and “having” representing “continuation”. This is a verb in which the content words “keep away” and “arrange” are inanimate to each other (for example, “room”, “on the desk”, etc.). Therefore, by adding the expression “being”, it becomes easy to express the subject “state”. On the other hand, expression of desires such as “I want to get rid of” and “I want to put it in order” do not appear much. Thus, synonymous content words tend to appear with similar functional expressions. When the predicate synonym determination is intended, the appearance pattern of the functional expression in the predicate may be an important feature for determining whether or not it is synonymous. It is necessary to consider the functional expression of the predicate.

しかし、上記の非特許文献１に記載の手法では、述部の機能表現を考慮して同義判定を行うためには、上記の（７）（８）（９）を例にすると、「募集している」、「募っている」、「募るかもしれない」という３つの異なる述部として、それぞれの分布類似度を計算しなくてはならない。その場合、分布類似度計算のデータがスパースになってしまうため（すなわち、それぞれの述部の出現頻度が、内容語より大幅に減ってしまう）、正しく分布類似度を計算するためには、膨大なデータを必要とするという問題がある。 However, in the method described in Non-Patent Document 1 above, in order to make a synonym determination in consideration of the function expression of the predicate, the above (7) (8) (9) Each of the distribution similarities must be calculated as three different predicates: “Yes”, “Recruiting”, and “May be recruited”. In that case, since the data of the distribution similarity calculation becomes sparse (that is, the appearance frequency of each predicate is greatly reduced from the content word), it is enormous to correctly calculate the distribution similarity. There is a problem of requiring a lot of data.

本発明では、上記問題点を解決するために成されたものであり、述部ペアが同義であるか否かを高精度に判定することができる同義判定装置、同義学習装置、及びプログラムを提供することを目的とする。 The present invention provides a synonym determination device, a synonym learning device, and a program that are made to solve the above-described problems and can determine with high accuracy whether or not a predicate pair is synonymous. The purpose is to do.

上記目的を達成するために、第１の発明の同義判定装置は、予め用意された複数の述部の各々についての定義文からなる定義文集合から得られる、入力された述部ペアの述部各々の定義文に基づいて抽出される、前記述部ペアの述部各々の前記定義文内にペアとなる述部が存在するか否かを第１の素性とし、予め用意された複数の述部の各々についての意味属性からなる意味属性集合から得られる、前記入力された前記述部ペアの述部各々の意味属性に基づいて抽出される前記述部ペアで共通する意味属性を第２の素性とし、前記第１の素性及び前記第２の素性のうち少なくとも１つを抽出する素性構築部と、予め求められた同義判定モデルが記憶された同義判定モデル記憶部と、前記素性構築部によって抽出された素性に基づいて前記入力された述部ペアが同義であるか否かを判定する同義判定部と、を含んで構成されている。 In order to achieve the above object, the synonym determination device of the first invention provides a predicate of an input predicate pair obtained from a definition sentence set consisting of definition sentences for each of a plurality of predicates prepared in advance. The first feature is whether or not there is a predicate that is paired in the definition statement of each predicate of the preceding description pair extracted based on each definition statement, and a plurality of prescriptions prepared in advance. The second attribute is a semantic attribute common to the preceding description part pair extracted based on the semantic attributes of each predicate of the input preceding description part pair, obtained from a semantic attribute set consisting of semantic attributes for each part. A feature construction unit that extracts at least one of the first feature and the second feature, a synonym determination model storage unit that stores a previously determined synonym determination model, and the feature construction unit The input based on the extracted features Predicate pairs are is configured to include, synonymous determination unit determines whether or not synonymous.

第１の発明によれば、素性構築部によって、予め用意された複数の述部の各々についての定義文からなる定義文集合から得られる、入力された述部ペアの述部各々の定義文に基づいて抽出される、前記述部ペアの述部各々の前記定義文内にペアとなる述部が存在するか否かを第１の素性とし、予め用意された複数の述部の各々についての意味属性からなる意味属性集合から得られる、前記入力された前記述部ペアの述部各々の意味属性に基づいて抽出される前記述部ペアで共通する意味属性を第２の素性とし、前記第１の素性及び前記第２の素性のうち少なくとも１つを抽出し、同義判定部によって、述部ペアが同義であるか否かを判定するための予め求められた同義判定モデルが記憶された同義判定モデル記憶部と、前記素性構築部によって抽出された前記第１の素性及び前記第２の素性のうち少なくとも１つの素性に基づいて前記同義判定モデル記憶部に記憶された前記同義判定モデルを基に前記入力された述部ペアが同義であるか否かを判定する。 According to the first invention, the feature construction unit obtains the definition statements of the predicates of the input predicate pair obtained from the definition statement set including the definition statements for each of the plurality of predicates prepared in advance. The first feature is whether or not there is a pair of predicates in the definition statement of each predicate of the predescription part pair extracted based on each of the predicates prepared in advance. A semantic attribute common to the preceding description part pair extracted based on the semantic attribute of each predicate of the input preceding description part pair, obtained from a semantic attribute set consisting of semantic attributes, is defined as a second feature, and A synonym in which at least one of the first feature and the second feature is extracted, and the synonym determination unit stores a previously determined synonym determination model for determining whether the predicate pair is synonymous. The judgment model storage unit and the feature construction unit The input predicate pair is synonymous based on the synonym determination model stored in the synonym determination model storage unit based on at least one of the extracted first and second features. It is determined whether or not there is.

このように、入力された述部ペアの述部各々の定義文内にペアとなる述部が存在するか否かを第１の素性とし、入力された述部ペアの述部各々の意味属性で共通する意味属性を第２の素性とし、第１の素性及び第２の素性のうち少なくとも１つを抽出し、入力された述部ペアが同義であるか否かを判定することにより、述部ペアが同義であるか否かを高精度に判定することができる。 In this way, the first feature is whether or not there is a paired predicate in the definition statement of each predicate of the input predicate pair, and the semantic attribute of each predicate of the input predicate pair The semantic attribute common to the second feature is the second feature, at least one of the first feature and the second feature is extracted, and it is determined whether the input predicate pair is synonymous. It is possible to determine with high accuracy whether or not a pair of parts is synonymous.

第２の発明の同義判定装置は、予め用意された複数の述部の各々についての定義文からなる定義文集合から得られる、入力された「項-述部」ペアの述部各々の定義文に基づいて抽出される、前記「項-述部」ペアの述部各々の前記定義文内にペアとなる述部が存在するか否か、及び前記「項-述部」ペアの述部の各々の前記定義文内にペアとなる「項-述部」の項が存在するか否かのうち少なくとも前記述部が存在するか否かを第１の素性とし、予め用意された複数の述部の各々についての意味属性からなる意味属性集合から得られる、前記入力された前記「項-述部」ペアの述部各々の意味属性に基づいて抽出される前記「項-述部」ペアで共通する意味属性を第２の素性とし、前記第１の素性及び前記第２の素性のうち少なくとも１つを抽出する素性構築部と、予め求められた同義判定モデルが記憶された同義判定モデル記憶部と、前記素性構築部によって抽出された素性とに基づいて前記入力された「項-述部」ペアが同義であるか否かを判定する同義判定部と、を含んで構成されている。 The synonym determination device according to the second aspect of the invention provides a definition statement for each predicate of the input “term-predicate” pair obtained from a definition statement set consisting of definition statements for each of a plurality of predicates prepared in advance. Whether or not there is a paired predicate in the definition statement of each of the predicates of the “term-predicate” pair, and the predicate of the “term-predicate” pair. The first feature is whether or not at least the previous description part exists among the paired “term-predicate” terms in each of the definition statements. In the “term-predicate” pair extracted based on the semantic attribute of each predicate of the inputted “term-predicate” pair, obtained from a semantic attribute set consisting of semantic attributes for each part A common semantic attribute is a second feature, and at least one of the first feature and the second feature is extracted. The “term-predicate” pair input based on the feature construction unit, the synonym judgment model storage unit storing the synonym judgment model obtained in advance, and the feature extracted by the feature construction unit are synonymous. And a synonym determination unit that determines whether or not there is.

第２の発明によれば、素性構築部によって、予め用意された複数の述部の各々についての定義文からなる定義文集合から得られる、入力された「項-述部」ペアの述部各々の定義文に基づいて抽出される、前記「項-述部」ペアの述部各々の前記定義文内にペアとなる述部が存在するか否か、及び前記「項-述部」ペアの述部の各々の前記定義文内にペアとなる「項-述部」の項が存在するか否かのうち少なくとも前記述部が存在するか否かを第１の素性とし、予め用意された複数の述部の各々についての意味属性からなる意味属性集合から得られる、前記入力された前記「項-述部」ペアの述部各々の意味属性に基づいて抽出される前記「項-述部」ペアで共通する意味属性を第２の素性とし、前記第１の素性及び前記第２の素性のうち少なくとも１つを抽出し、同義判定部によって、「項-述部」ペアが同義であるか否かを判定するための予め求められた同義判定モデルが記憶された同義判定モデル記憶部と、前記素性構築部によって抽出された前記第１の素性及び前記第２の素性のうち少なくとも１つの素性に基づいて前記同義判定モデル記憶部に記憶された前記同義判定モデルを基に前記入力された「項-述部」ペアが同義であるか否かを判定する。 According to the second invention, each predicate of the inputted “term-predicate” pair obtained from the definition sentence set including the definition sentences for each of the plurality of predicates prepared in advance by the feature construction unit. Whether or not there is a pair of predicates in the definition statement of each of the predicates of the “term-predicate” pair, and the “term-predicate” pair The first feature is whether or not there is at least a predescription part out of whether or not there is a pair of “term-predicate” in the definition statement of each predicate. The “term-predicate” extracted based on the semantic attribute of each predicate of the inputted “term-predicate” pair obtained from a semantic attribute set consisting of semantic attributes for each of a plurality of predicates. The semantic attribute common to the pair is a second feature, and at least one of the first feature and the second feature A synonym determination model storage unit in which a synonym determination model obtained in advance for determining whether or not the “term-predicate” pair is synonymous is extracted by the synonym determination unit, and the feature construction Based on the synonym determination model stored in the synonym determination model storage unit based on at least one of the first feature and the second feature extracted by the unit, the inputted “term-description” It is determined whether the “part” pair is synonymous.

このように、入力された「項-述部」ペアの述部各々の定義文内にペアとなる述部が存在するか否か、及び前記「項-述部」ペアの述部の各々の前記定義文内にペアとなる「項-述部」の項が存在するか否かのうち少なくとも前記述部が存在するか否かを第１の素性とし、入力された「項-述部」ペアの述部各々の意味属性で共通する意味属性を第２の素性とし、第１の素性及び第２の素性のうち少なくとも１つを抽出し、入力された「項-述部」ペアが同義であるか否かを判定することにより、述部ペアが同義であるか否かを高精度に判定することができる。 Thus, whether or not there is a paired predicate in the definition statement of each predicate of the “term-predicate” pair input, and each of the predicates of the “term-predicate” pair Whether or not there is a pair of “term-predicate” terms in the definition statement, at least the preceding description portion exists as a first feature, and the input “term-predicate” The semantic attribute common to each predicate of the pair is the second feature, and at least one of the first feature and the second feature is extracted, and the input “term-predicate” pair is synonymous. It is possible to determine with high accuracy whether or not the predicate pair is synonymous.

第３の発明の同義判定学習装置は、予め用意された複数の述部の各々についての定義文からなる定義文集合から得られる、述部ペアの述部各々の定義文に基づいて抽出される、前記述部ペアの述部各々の前記定義文内にペアとなる述部が存在するか否かを第１の素性とし、予め用意された複数の述部の各々についての意味属性からなる意味属性集合から得られる、前記述部ペアの述部各々の意味属性で共通する意味属性を第２の素性とし、同義か否かの情報が付され、かつ、予め用意された複数の述部ペアの各々について、前記第１の素性及び前記第２の素性のうち少なくとも１つを抽出する素性構築部と、前記素性構築部によって前記複数の述部ペアについて抽出された第１の素性及び前記第２の素性のうち少なくとも１つと、前記複数の述部ペアについての前記同義か否かの情報とに基づいて前記述部ペアが同義であるか否かを判定するための同義判定モデルを学習する同義判定モデル学習部と、を含んで構成されている。 The synonym determination learning device of the third invention is extracted based on the definition sentence of each predicate of the predicate pair obtained from the definition sentence set including the definition sentences for each of the plurality of predicates prepared in advance. The first feature is whether or not there is a paired predicate in the definition statement of each predicate of the previous description part pair, and the meaning consists of semantic attributes for each of a plurality of predicates prepared in advance. A plurality of predicate pairs prepared in advance, with the second feature being a semantic attribute common to the semantic attributes of each predicate of the predescription part pair obtained from the attribute set, and information indicating whether or not they are synonymous. , A feature construction unit that extracts at least one of the first feature and the second feature, and the first feature extracted from the plurality of predicate pairs by the feature construction unit and the first feature At least one of the two features and the plurality of features And a synonym determination model learning unit that learns a synonym determination model for determining whether or not the previous description unit pair is synonymous based on the information on whether or not they are synonymous with respect to a part pair. Yes.

第３の発明によれば、素性構築部によって予め用意された複数の述部の各々についての定義文からなる定義文集合から得られる、述部ペアの述部各々の定義文に基づいて抽出される、前記述部ペアの述部各々の前記定義文内にペアとなる述部が存在するか否かを第１の素性とし、予め用意された複数の述部の各々についての意味属性からなる意味属性集合から得られる、前記述部ペアの述部各々の意味属性で共通する意味属性を第２の素性とし、同義か否かの情報が付され、かつ、予め用意された複数の述部ペアの各々について、前記第１の素性及び前記第２の素性のうち少なくとも１つを抽出し、前記素性構築部によって前記複数の述部ペアについて抽出された第１の素性及び前記第２の素性のうち少なくとも１つと、前記複数の述部ペアについての前記同義か否かの情報とに基づいて前記述部ペアが同義であるか否かを判定するための同義判定モデルを学習する。 According to the third aspect of the present invention, the feature construction unit extracts the predicates in the predicate pair of the predicate pair obtained from the definition statement set including the definition statements for each of the plurality of predicates prepared in advance. The first feature is whether or not there is a predicate as a pair in the definition statement of each predicate of the previous description part pair, and consists of semantic attributes for each of a plurality of predicates prepared in advance. A plurality of predicates that are obtained from the semantic attribute set and have a semantic attribute common to the semantic attributes of each predicate of the predescription part pair as a second feature, and are provided with information on whether or not they are synonymous, and prepared in advance For each pair, at least one of the first feature and the second feature is extracted, and the first feature and the second feature extracted by the feature construction unit for the plurality of predicate pairs And at least one of the plurality of predicate pairs For the same meaning whether the information before written portion pairs based on of learns synonymous judgment model for judging whether or not synonymous.

このように、述部ペアの述部各々の前記定義文内にペアとなる述部が存在するか否かを第１の素性とし、述部ペアの述部各々の意味属性で共通する意味属性を第２の素性とし、同義か否かの情報が付され、かつ、予め用意された複数の述部ペアの各々について、第１の素性及び第２の素性のうち少なくとも１つを抽出し、述部ペアが同義であるか否かを判定するための同義判定モデルを学習することにより、述部ペアが同義であるか否かを高精度に判定することができる。 In this way, the first feature is whether or not there is a pair of predicates in the definition statement of each predicate of the predicate pair, and the semantic attribute common to the semantic attributes of each predicate of the predicate pair Is a second feature, information on whether or not it is synonymous, and for each of a plurality of predicate pairs prepared in advance, extract at least one of the first feature and the second feature, By learning a synonym determination model for determining whether or not a predicate pair is synonymous, it is possible to determine with high accuracy whether or not a predicate pair is synonymous.

第４の発明の同義判定学習装置は、予め用意された複数の述部の各々についての定義文からなる定義文集合から得られる、「項-述部」ペアの述部各々の定義文に基づいて抽出される、前記「項-述部」ペアの述部各々の前記定義文内にペアとなる述部が存在するか否か、及び前記「項-述部」ペアの述部の各々の前記定義文内にペアとなる「項-述部」の項が存在するか否かのうち少なくとも前記述部が存在するか否かを第１の素性とし、予め用意された複数の述部の各々についての意味属性からなる意味属性集合から得られる、前記「項-述部」ペアの述部各々の意味属性で共通する意味属性を第２の素性とし、同義か否かの情報が付され、かつ、予め用意された複数の「項-述部」ペアの各々について、前記第１の素性及び前記第２の素性のうち少なくとも１つを抽出する素性構築部と、前記素性構築部によって前記複数の「項-述部」ペアについて抽出された第１の素性及び前記第２の素性のうち少なくとも１つと、前記複数の「項-述部」ペアについての前記同義か否かの情報とに基づいて前記「項-述部」ペアが同義であるか否かを判定するための同義判定モデルを学習する同義判定モデル学習部と、を含んで構成されている。 A synonym determination learning device according to a fourth aspect of the present invention is based on a definition sentence of each predicate of a “term-predicate” pair obtained from a definition sentence set including a definition sentence for each of a plurality of predicates prepared in advance. Whether there is a paired predicate in the definition statement of each predicate of the “term-predicate” pair, and each of the predicates of the “term-predicate” pair. Of the plurality of predicates prepared in advance, the first feature is whether or not there is at least a predescription part out of whether or not there is a pair of “term-predicate” terms in the definition statement. The semantic attribute that is common to the semantic attributes of the predicates of the “term-predicate” pair obtained from the semantic attribute set consisting of the semantic attributes for each is the second feature, and information on whether or not they are synonymous is attached. For each of a plurality of “term-predicate” pairs prepared in advance, the first feature and the second feature A feature construction unit that extracts at least one, at least one of the first feature and the second feature extracted by the feature construction unit for the plurality of “term-predicate” pairs, and the plurality of “ A synonym determination model learning unit that learns a synonym determination model for determining whether or not the “term-predicate” pair is synonymous based on information on whether or not the “term-predicate” pair is synonymous. And.

第４の発明によれば、素性構築部によって予め用意された複数の述部の各々についての定義文からなる定義文集合から得られる、「項-述部」ペアの述部各々の定義文に基づいて抽出される、前記「項-述部」ペアの述部各々の前記定義文内にペアとなる述部が存在するか否か、及び前記「項-述部」ペアの述部の各々の前記定義文内にペアとなる「項-述部」の項が存在するか否かのうち少なくとも前記述部が存在するか否かを第１の素性とし、予め用意された複数の述部の各々についての意味属性からなる意味属性集合から得られる、前記「項-述部」ペアの述部各々の意味属性で共通する意味属性を第２の素性とし、同義か否かの情報が付され、かつ、予め用意された複数の「項-述部」ペアの各々について、前記第１の素性及び前記第２の素性のうち少なくとも１つを抽出し、前記素性構築部によって前記複数の「項-述部」ペアについて抽出された第１の素性及び前記第２の素性のうち少なくとも１つと、前記複数の「項-述部」ペアについての前記同義か否かの情報とに基づいて前記「項-述部」ペアが同義であるか否かを判定するための同義判定モデルを学習する。 According to the fourth invention, the definition statements of the predicates of the “term-predicate” pair obtained from the definition sentence set including the definition sentences for each of the plurality of predicates prepared in advance by the feature construction unit Whether or not there is a paired predicate in the definition statement of each of the predicates of the “term-predicate” pair, and each of the predicates of the “term-predicate” pair. A plurality of predicates prepared in advance with a first feature as to whether or not there is at least a predescription part of whether or not there is a pair of “term-predicate” terms in the definition statement of The semantic attribute common to each of the predicates of the “term-predicate” pair obtained from the semantic attribute set consisting of semantic attributes for each of the two is a second feature, and information on whether or not they are synonymous is attached. And the first feature and the second feature for each of a plurality of “term-predicate” pairs prepared in advance. At least one of the first feature and the second feature extracted by the feature construction unit for the plurality of “term-predicate” pairs, and the plurality of “term- A synonym determination model for determining whether or not the “term-predicate” pair is synonymous is learned based on the information on whether or not the synonym is about the predicate pair.

このように、「項-述部」ペアの述部各々の前記定義文内にペアとなる述部が存在するか否か、及び前記「項-述部」ペアの述部の各々の前記定義文内にペアとなる「項-述部」の項が存在するか否かのうち少なくとも前記述部が存在するか否かを第１の素性とし、「項-述部」ペアの述部各々の意味属性で共通する意味属性を第２の素性とし、同義か否かの情報が付され、かつ、予め用意された複数の「項-述部」ペアの各々について、第１の素性及び第２の素性のうち少なくとも１つを抽出し、「項-述部」ペアが同義であるか否かを判定するための同義判定モデルを学習することにより、「項-述部」ペアが同義であるか否かを高精度に判定することができる。 Thus, whether there is a paired predicate in the definition statement of each predicate of the “term-predicate” pair, and the definition of each predicate of the “term-predicate” pair. Each of the predicates of the “term-predicate” pair has the first feature whether or not there is at least a previous description part among the paired “term-predicate” terms. The semantic attribute that is common to the semantic attributes of the second attribute is the second feature, information on whether or not it is synonymous, and the first feature and the first feature for each of a plurality of “term-predicate” pairs prepared in advance. By extracting at least one of the two features and learning a synonym determination model for determining whether the “term-predicate” pair is synonymous, the “term-predicate” pair is synonymous. Whether or not there is can be determined with high accuracy.

また、上記第１の発明に係る同義判定装置は、前記第１の素性及び前記第２の素性のうち少なくとも１つの素性を抽出すると共に、前記述部ペアの述部各々の前記定義文の両方に出現する語彙の数を第３の素性とし、前記入力された前記述部ペアの述部各々の意味属性に基づいて抽出される前記述部ペアに共通する意味属性の詳細の度合いに応じた重みを付加した前記述部ペアの意味属性の重なり度合いを第４の素性とし、入力された前記述部ペアの述部の各々について、テキストコーパスにおいて前記述部の周辺に出現する単語を比較した分布類似度を第５の素性とし、予め用意された複数の述部の各々の機能表現の意味ラベルからなる意味ラベル集合から得られる、前記入力された前記述部ペアの述部各々の機能表現の意味ラベルに基づいて抽出される前記述部ペアで共通する意味ラベルを第６の素性とし、前記述部ペアの前記共通する意味ラベルの重なり度合いを第７の素性とし、前記第３の素性乃至第７の素性のうち少なくとも１つの素性を抽出し、前記同義判定部は、前記素性構築部によって抽出された素性に基づいて前記述部ペアが同義であるか否かを判定するようにすることができる。 Further, the synonym determination device according to the first invention extracts at least one of the first feature and the second feature, and both of the definition sentences of each predicate of the preceding description pair. The number of vocabulary appearing in is a third feature, according to the degree of detail of the semantic attribute common to the previous description part pair extracted based on the semantic attribute of each predicate of the input previous description part pair The degree of overlap of the semantic attribute of the previous description part pair with the weight added is set as the fourth feature, and the words appearing around the previous description part in the text corpus are compared for each predicate of the input previous description part pair. A functional expression of each predicate of the input previous description part pair obtained from a semantic label set consisting of a semantic label of each functional expression of a plurality of predicates prepared in advance with a distribution similarity as a fifth feature Based on the meaning label The semantic label common to the previous description part pair extracted in this way is the sixth feature, the overlapping degree of the common semantic labels of the previous description part pair is the seventh feature, and the third to seventh features At least one feature is extracted, and the synonym determination unit can determine whether or not the previous description unit pair is synonymous based on the feature extracted by the feature construction unit.

また、上記第２の発明に係る同義判定装置は、前記第１の素性及び前記第２の素性のうち少なくとも１つの素性を抽出すると共に、前記「項-述部」ペアの述部各々の前記定義文の両方に出現する語彙の数を第３の素性とし、前記入力された前記「項-述部」ペアの述部各々の意味属性に基づいて抽出される前記「項-述部」ペアに共通する意味属性の詳細の度合いに応じた重みを付加した前記「項-述部」ペアの意味属性の重なり度合いを第４の素性とし、入力された前記「項-述部」ペアの「項-述部」の各々について、テキストコーパスにおいて前記「項-述部」の周辺に出現する単語を比較した分布類似度、及び前記「項-述部」ペアの述部の各々について、テキストコーパスにおいて前記「項-述部」の述部の周辺に出現する単語を比較した分布類似度のうち少なくとも前記「項-述部」の周辺に出現する単語を比較した分布類似度を第５の素性とし、予め用意された複数の述部の各々の機能表現の意味ラベルからなる意味ラベル集合から得られる、前記入力された前記「項-述部」ペアの述部各々の機能表現の意味ラベルに基づいて抽出される前記「項-述部」ペアで共通する意味ラベルを第６の素性とし、前記「項-述部」ペアの前記共通する意味ラベルの重なり度合いを第７の素性とし、前記第３の素性乃至第７の素性のうち少なくとも１つの素性を抽出し、前記判定手段は、前記素性抽出手段によって抽出された素性に基づいて前記「項-述部」ペアが同義であるか否かを判定するようにすることができる。 The synonym determination device according to the second aspect of the invention extracts at least one of the first feature and the second feature, and each of the predicates of the “term-predicate” pair. The number of vocabulary appearing in both of the definition sentences is a third feature, and the “term-predicate” pair extracted based on the semantic attributes of each of the inputted “term-predicate” pairs. The fourth attribute is the overlapping degree of the semantic attributes of the “term-predicate” pair to which a weight according to the degree of detail of the semantic attributes common to the four is added, and the “term-predicate” pair of “ For each of the “term-predicate”, a distribution corpus that compares words appearing around the “term-predicate” in the text corpus, and for each predicate of the “term-predicate” pair, the text corpus Compared to the words that appear around the predicate of "term-predicate" in A distribution similarity that compares at least words that appear in the vicinity of the “term-predicate” among the similarities is defined as a fifth feature, and a meaning that includes a semantic label of each functional expression of a plurality of predicates prepared in advance. A sixth semantic label common to the “term-predicate” pair extracted from the functional expression of each of the predicates of the input “term-predicate” pair obtained from the set of labels is obtained from the label set. The common semantic label of the “term-predicate” pair is defined as a seventh feature, and at least one of the third to seventh features is extracted and the determination is performed. The means may determine whether or not the “term-predicate” pair is synonymous based on the feature extracted by the feature extraction means.

また、本発明のプログラムは、コンピュータを、上記の同義判定装置又は同義学習装置を構成する各手段として機能させるためのプログラムである。 Moreover, the program of this invention is a program for functioning a computer as each means which comprises said synonym determination apparatus or synonym learning apparatus.

以上説明したように、本発明の同義判定装置、同義学習装置、及びプログラムによれば、述部ペアが同義であるか否かを高精度に判定することができる。 As described above, according to the synonym determination device, the synonym learning device, and the program of the present invention, it is possible to determine with high accuracy whether or not the predicate pair is synonymous.

本実施の形態の素性ベクトル構築装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the feature vector construction apparatus of this Embodiment. ブログ８００万文から抽出した、内容語に対する機能表現の出現頻度の一例である。It is an example of the appearance frequency of the functional expression with respect to the content word extracted from the blog 8 million sentences. 基本解析部における解析結果の例を示す図である。It is a figure which shows the example of the analysis result in a basic analysis part. 素性抽出部において抽出される素性の例を示す図である。It is a figure which shows the example of the feature extracted in a feature extraction part. 素性ベクトル構築部において構築される素性ベクトルの例を示す図である。It is a figure which shows the example of the feature vector constructed | assembled in the feature vector construction part. 本実施の形態の同義学習装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the synonymous learning device of this Embodiment. 正解コーパスの例を示す図である。It is a figure which shows the example of a correct answer corpus. 分布類似度算出部において算出される分布類似度の例を示す図である。It is a figure which shows the example of the distribution similarity calculated in a distribution similarity calculation part. 定義文抽出部において抽出される定義文の例を示す図である（定義文相互補完性）。It is a figure which shows the example of the definition sentence extracted in a definition sentence extraction part (definition sentence mutual complementarity). 定義文抽出部において抽出される定義文の例を示す図である（語彙の重なり）。It is a figure which shows the example of the definition sentence extracted in a definition sentence extraction part (overlap of vocabulary). 本実施の形態の辞書定義文素性抽出部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the dictionary definition sentence feature extraction part of this Embodiment. 辞書定義文素性抽出部で抽出する素性の一覧を示す図である。It is a figure which shows the list of the features extracted in a dictionary definition sentence feature extraction part. 辞書定義文素性抽出部で抽出する素性の例を示す図である。It is a figure which shows the example of the feature extracted in a dictionary definition sentence feature extraction part. 用言属性の例を示す図である。It is a figure which shows the example of a precaution attribute. 用言属性の階層の例を示す図である。It is a figure which shows the example of the hierarchy of prescription attribute. 本実施形態の意味属性素性抽出部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the semantic attribute feature extraction part of this embodiment. 意味属性素性抽出部において抽出される素性の例を示す図である。It is a figure which shows the example of the feature extracted in a semantic attribute feature extraction part. 意味ラベル付与部において付与される意味ラベルの例を示す図である。It is a figure which shows the example of the semantic label provided in a semantic label provision part. 本実施形態の機能表現素性抽出部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the function expression feature extraction part of this embodiment. 機能表現の例を示す図である。It is a figure which shows the example of function expression. 同義判定モデル学習部において作成される素性の例を示す図である。It is a figure which shows the example of the feature produced in a synonym determination model learning part. 本実施の形態の同義判定装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the synonym determination apparatus of this Embodiment. 素性構築部において作成される素性の例を示す図である。It is a figure which shows the example of the feature produced in a feature construction part. 本実施の形態の素性ベクトル構築装置における同義判定モデル学習処理ルーチン中の素性ベクトル構築処理ルーチンを示すフローチャートである。It is a flowchart which shows the feature vector construction process routine in the synonym determination model learning process routine in the feature vector construction apparatus of this Embodiment. 本実施の形態の同義学習装置における同義判定モデル学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the synonym determination model learning process routine in the synonym learning apparatus of this Embodiment. 本実施の形態の同義学習装置における同義判定モデル学習処理ルーチン中の分布類似度算出処理ルーチンを示すフローチャートである。It is a flowchart which shows the distribution similarity calculation process routine in the synonym determination model learning process routine in the synonym learning device of this Embodiment. 本実施の形態の同義学習装置における同義判定モデル学習処理ルーチン中の辞書定義文素性抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the dictionary definition sentence feature extraction process routine in the synonym determination model learning process routine in the synonym learning device of this Embodiment. 本実施の形態の同義学習装置における同義判定モデル学習処理ルーチン中の意味属性素性抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the semantic attribute feature extraction process routine in the synonym determination model learning process routine in the synonym learning device of this Embodiment. 本実施の形態の同義学習装置における同義判定モデル学習処理ルーチン中の機能表現素性抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the function expression feature extraction process routine in the synonym determination model learning process routine in the synonym learning device of this Embodiment. 本実施の形態の同義判定装置における同義判定処理ルーチンを示すフローチャートである。It is a flowchart which shows the synonym determination processing routine in the synonym determination apparatus of this Embodiment. 同義判定の結果と入力した素性の例を示す図である。It is a figure which shows the example of the result of a synonym determination, and the input feature. 同義判定の結果と入力した素性の例を示す図である。It is a figure which shows the example of the result of a synonym determination, and the input feature. 同義判定の結果と入力した素性の例を示す図である。It is a figure which shows the example of the result of a synonym determination, and the input feature. 同義判定の結果と入力した素性の例を示す図である。It is a figure which shows the example of the result of a synonym determination, and the input feature.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜素性ベクトル構築装置の構成＞
本発明の実施の形態に係る素性ベクトル構築装置について説明する。図１に示すように、本発明の実施の形態に係る素性ベクトル構築装置１００は、ＣＰＵとＲＡＭと後述する素性ベクトル構築処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この素性ベクトル構築装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部３０とを備えている。 <Configuration of feature vector construction device>
A feature vector construction device according to an embodiment of the present invention will be described. As shown in FIG. 1, a feature vector construction apparatus 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores programs and various data for executing a feature vector construction processing routine described later. Can be configured with a computer. Functionally, the feature vector construction device 100 includes an input unit 10, a calculation unit 20, and an output unit 30 as shown in FIG.

入力部１０は、キーボードなどの入力装置から自然言語で記載され且つ電子化された複数の文を受け付ける。この複数の文による集合をテキストコーパスと呼ぶ。なお、入力部１０は、ネットワーク等を介して外部から入力されたものを受け付けるようにしてもよい。 The input unit 10 receives a plurality of sentences written in a natural language and digitized from an input device such as a keyboard. This set of sentences is called a text corpus. Note that the input unit 10 may accept input from the outside via a network or the like.

演算部２０は、基本解析部２４と、素性抽出部２６と、素性ベクトル生成部２８とを備えている。 The calculation unit 20 includes a basic analysis unit 24, a feature extraction unit 26, and a feature vector generation unit 28.

基本解析部２４には、入力部１０が受け付けたテキストコーパスが入力される。基本解析部２４は、入力されたテキストコーパスの各文について、形態素解析及び係り受け解析を行い、形態素毎の表記と標準形と品詞、および文節ごとの係り受け情報が少なくとも含まれる解析結果を素性抽出部２６に出力する。図３に「花を植えて、花壇が完成した。」という文に対する基本解析部２４の出力の一例を示す。なお、形態素解析、係り受け解析は既存のものを用いて良い。 A text corpus received by the input unit 10 is input to the basic analysis unit 24. The basic analysis unit 24 performs morphological analysis and dependency analysis for each sentence of the input text corpus, and features an analysis result including at least morpheme notation, standard form and part of speech, and dependency information for each clause. The data is output to the extraction unit 26. FIG. 3 shows an example of the output of the basic analysis unit 24 for the sentence “planting flowers and flower beds completed”. Note that existing morphological analysis and dependency analysis may be used.

素性抽出部２６は、基本解析部２４から入力される各文の解析結果を用い、各文に含まれる「項-述部」に対して、その「項−述部」の周辺に現れる単語の情報（文脈情報）を項述部素性として抽出して出力する。また、素性抽出部２６は、基本解析部２４から入力される各文の解析結果を用い、各文に含まれる述部に対して、その述部の周辺に現れる単語の情報（文脈情報）を述部素性として抽出して出力する。本実施形態では、例えば上記の非特許文献１と同じ方法で項述部素性や述部素性を抽出する。具体的には、対象の「項−述部」に係っている別の「項−述部」、「述部」を項述部素性として抽出する。さらに、「述部」単体に係っている項（格助詞をもつ名詞句）、及び別の「述部」を述部素性として抽出する。本実施形態での素性抽出部２６の出力の一例を図４に示す。図４に示す通り、「花壇-ガ-完成する」という「項-述部」に対して、「植える」という「述部」および、「花-ヲ-植える」という別の「項-述部」を項述部素性として抽出する。また、「完成する」という述部に対しては、「花壇-ガ」という項と「植える」という述部を述部素性として抽出する。 The feature extraction unit 26 uses the analysis result of each sentence input from the basic analysis unit 24, and with respect to the “term-predicate” included in each sentence, Information (context information) is extracted and output as a term predicate feature. In addition, the feature extraction unit 26 uses the analysis result of each sentence input from the basic analysis unit 24, and for the predicate included in each sentence, information (word context information) of words appearing around the predicate. Extract and output as predicate features. In the present embodiment, for example, the term predicate feature and the predicate feature are extracted by the same method as in Non-Patent Document 1 described above. Specifically, another “term-predicate” and “predicate” related to the target “term-predicate” are extracted as term predicate features. Furthermore, a term (noun phrase having a case particle) related to a single “predicate” and another “predicate” are extracted as predicate features. An example of the output of the feature extraction unit 26 in this embodiment is shown in FIG. As shown in FIG. 4, “term-predicate” “planting” and “term-predicate” “flower-wo-planting” are different from “term-predicate” “flowerbed-ga-complete”. As a term predicate feature. For the predicate “complete”, the term “flowerbed-ga” and the predicate “plant” are extracted as predicate features.

素性ベクトル生成部２８は、基本解析部２４から入力される各文の解析結果を用いて、各文に含まれる「項−述部」を素性ベクトル構築対象の「項−述部」として抽出する。そして、素性ベクトル生成部２８は、抽出された素性ベクトル構築対象の「項-述部」ごとに素性抽出部２６から入力される項述部素性を用いて算出される値を要素とする素性ベクトルを構築し、出力部３０に出力する。また、素性ベクトル生成部２８は、基本解析部２４から入力される各文の解析結果を用いて、各文に含まれる述部を素性ベクトル構築対象の述部として抽出する。そして、素性ベクトル生成部２８は、抽出された素性ベクトル構築対象の述部ごとに素性抽出部２６から入力される述部素性を用いて算出される値を要素とする素性ベクトルを構築し、出力部３０に出力する。本実施形態では、例えば非特許文献１と同じ方法で素性ベクトルを構築する。具体的には、素性ベクトル構築対象の「項−述部」と各項述部素性の相互情報量（ＭＩ）をもとに算出されるweightの値を要素の値とする素性ベクトルを構築する。また、素性ベクトル構築対象の述部と各述部素性の相互情報量（ＭＩ）をもとに算出されるweightの値を要素の値とする素性ベクトルを構築する。weightは、下記（１）式を用いて算出される。また、相互情報量（ＭＩ）は、下記（２）式を用いて算出される。 The feature vector generation unit 28 uses the analysis result of each sentence input from the basic analysis unit 24 to extract “term-predicate” included in each sentence as “term-predicate” of the feature vector construction target. . Then, the feature vector generation unit 28 uses, as elements, values calculated using the term predicate features input from the feature extraction unit 26 for each “term-predicate” of the extracted feature vector construction target. Is output to the output unit 30. Also, the feature vector generation unit 28 extracts predicates included in each sentence as feature vector construction target predicates using the analysis result of each sentence input from the basic analysis unit 24. Then, the feature vector generation unit 28 constructs a feature vector whose element is a value calculated using the predicate feature input from the feature extraction unit 26 for each extracted predicate of the feature vector construction target, and outputs the feature vector. To the unit 30. In the present embodiment, for example, a feature vector is constructed by the same method as in Non-Patent Document 1. Specifically, a feature vector having a weight value calculated on the basis of the mutual information (MI) of each term predicate feature and the “term-predicate” of the feature vector construction target is constructed. . In addition, a feature vector having a weight value calculated based on a predicate to be feature vector construction target and a mutual information (MI) of each predicate feature as an element value is constructed. The weight is calculated using the following formula (1). The mutual information (MI) is calculated using the following equation (2).

素性ベクトル構築対象が「項-述部」の場合の素性ベクトルを項述部素性ベクトルと呼ぶ。素性ベクトル構築対象が「項-述部」の場合、ｕは「項-述部」を表し、ｆは項述部素性を表す。Ｐ（ｕ）は素性ベクトル構築対象の「項−述部」がテキストコーパスに出現する確率を、Ｐ（ｆ）は項述部素性がテキストコーパスに出現する確率、Ｐ（ｕ，ｆ）は素性ベクトル構築対象の「項-述部」と項述部素性が同時に現れる確率を表す。ＭＩが０より大きい場合、weightの値は１となる。ＭＩが０以下の場合、weightの値は０となる。図５上の表に構築された項述部素性ベクトルの例を示す。例では、uが「花壇-ガ-完成」、fが「花-ヲ-植える」の場合、ＭＩが０より大きくweightの値が１であることを示している。また、uが「花壇-ガ-出来上がる」、fが「時間-ヲ-かける」の場合、ＭＩが０以下でweightの値が０であることを示している。 A feature vector when the feature vector construction target is “term-predicate” is called a term predicate feature vector. When the feature vector construction target is “term-predicate”, u represents “term-predicate”, and f represents the term predicate feature. P (u) is the probability that the “term-predicate” of the feature vector construction target appears in the text corpus, P (f) is the probability that the term predicate feature appears in the text corpus, and P (u, f) is the feature Represents the probability that the “term-predicate” and term predicate features appear simultaneously. When MI is greater than 0, the value of weight is 1. When MI is 0 or less, the value of weight is 0. An example of the term predicate feature vector constructed in the table in FIG. 5 is shown. In the example, when u is “flowerbed-ga-complete” and f is “flower-wo-plant”, it indicates that MI is greater than 0 and the weight value is 1. Further, when u is “flowerbed-ga-finished” and f is “time-over”, it indicates that MI is 0 or less and the weight value is 0.

素性ベクトル構築対象が述部の場合の素性ベクトルを述部素性ベクトルと呼ぶ。素性ベクトル構築対象が述部の場合、ｕは述部を表し、ｆは述部素性を表す。Ｐ（ｕ）は素性ベクトル構築対象の述部がテキストコーパスに出現する確率を、Ｐ（ｆ）は述部素性がテキストコーパスに出現する確率、Ｐ（ｕ，ｆ）は素性ベクトル構築対象の述部と述部素性が同時に現れる確率を表す。ＭＩが０より大きい場合、weightの値は１となる。ＭＩが０以下の場合、weightの値は０となる。図５下の表に構築された述部素性ベクトルの例を示す。例では、uが「完成」、fが「花壇-ガ」の場合、ＭＩが０より大きくweightの値が１であることを示している。また、uが「出来上がる」、fが「家-ガ」の場合、ＭＩが０以下でweightの値が０であることを示している。 A feature vector when a feature vector construction target is a predicate is called a predicate feature vector. When the feature vector construction target is a predicate, u represents a predicate, and f represents a predicate feature. P (u) is the probability that the predicate of the feature vector construction target appears in the text corpus, P (f) is the probability that the predicate feature appears in the text corpus, and P (u, f) is the description of the feature vector construction target. Represents the probability that a part and predicate feature will appear simultaneously. When MI is greater than 0, the value of weight is 1. When MI is 0 or less, the value of weight is 0. An example of the predicate feature vector constructed in the lower table of FIG. 5 is shown. In the example, when u is “complete” and f is “flowerbed-ga”, MI is greater than 0 and the weight value is 1. Further, when u is “completed” and f is “house-ga”, it indicates that MI is 0 or less and the weight value is 0.

このように、本実施形態では、入力されたテキストコーパスに含まれる各述部を素性ベクトルの構築対象とした述部素性ベクトルと、入力されたテキストコーパスに含まれる各「項-述部」を素性ベクトルの構築対象とした項述部素性ベクトルの２種類（以下、両者を合わせて「素性ベクトル」とする。）を作成する。
本実施形態では、述部素性ベクトルと項述部素性ベクトルの両方を作成したが、後述する同義学習装置及び同義判定装置で使用される素性ベクトルのみを作成すれば良い。 As described above, in the present embodiment, the predicate feature vector in which each predicate included in the input text corpus is a construction target of the feature vector, and each “term-predicate” included in the input text corpus are included. Two types of term predicate feature vectors that are feature vectors to be constructed (hereinafter referred to as “feature vectors” together) are created.
In the present embodiment, both the predicate feature vector and the term predicate feature vector are created. However, only the feature vector used in the synonym learning device and synonym determination device described later may be created.

＜同義学習装置の構成＞
次に、本発明の実施の形態に係る同義学習装置の構成について説明する。図６に示すように、本発明の実施の形態に係る同義学習装置２００は、ＣＰＵとＲＡＭと後述する判定モデル学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することができる。この同義学習装置２００は、機能的には図６に示すように入力部１１０と、演算部１２０と、出力部１５０とを備えている。 <Configuration of synonym learning device>
Next, the configuration of the synonym learning device according to the embodiment of the present invention will be described. As shown in FIG. 6, a synonymous learning apparatus 200 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program for executing a determination model learning process routine (to be described later) and various data. Can be configured. Functionally, the synonym learning device 200 includes an input unit 110, a calculation unit 120, and an output unit 150 as shown in FIG.

入力部１１０は、キーボードなどの入力装置から同義か否かの情報が付与された複数の述部ペア及び複数の「項−述部」ペアを受け付ける。この、同義か否かの情報が付与された複数の述部ペア及び複数の「項−述部」ペアを正解コーパスと呼ぶ。なお、入力部１１０は、ネットワーク等を介して外部から入力されたものを受け付けるようにしてもよい。 The input unit 110 receives a plurality of predicate pairs and a plurality of “term-predicate” pairs to which information indicating whether or not synonyms are given from an input device such as a keyboard. A plurality of predicate pairs and a plurality of “term-predicate” pairs to which information indicating whether or not they are synonymous are provided are called correct corpus. Note that the input unit 110 may accept input from the outside via a network or the like.

演算部１２０は、素性ベクトル記憶部１２４と、分布類似度計算部１３２と、定義文辞書記憶部１３４と、辞書定義文素性抽出部１３６と、意味属性辞書記憶部１３８と、機能表現辞書記憶部１３９と、意味属性素性抽出部１４０と、機能表現素性抽出部１４１と、素性集合記憶部１４２と、同義判定モデル学習部１４４と、を含んだ構成で表すことができる。 The calculation unit 120 includes a feature vector storage unit 124, a distribution similarity calculation unit 132, a definition sentence dictionary storage unit 134, a dictionary definition sentence feature extraction unit 136, a semantic attribute dictionary storage unit 138, and a function expression dictionary storage unit. 139, a semantic attribute feature extraction unit 140, a function expression feature extraction unit 141, a feature set storage unit 142, and a synonym determination model learning unit 144.

正解コーパス１２２（図１１に記載）は、入力部１１０において受け付けた、図７に示すような人手であらかじめ意味関係情報「同義」が付された述部ペア及び「項−述部」ペアである正解データの集合（正例）と、意味関係情報「反義」又は「その他」が付された述部ペア及び「項−述部」ペアである正解データの集合（負例）である。 The correct corpus 122 (described in FIG. 11) is a predicate pair and a “term-predicate” pair, which are received by the input unit 110 and are manually assigned semantic relationship information “synonymous” as illustrated in FIG. A set of correct data (positive example) and a set of correct answer data (negative example) that is a predicate pair and a “term-predicate” pair to which the semantic relationship information “annoyance” or “others” is attached.

素性ベクトル記憶部１２４は、素性ベクトル構築装置１００により出力された素性ベクトルを記憶している。 The feature vector storage unit 124 stores the feature vector output by the feature vector construction device 100.

分布類似度計算部１３２は、入力された正解コーパス１２２のすべての述部ペア又は「項−述部」ペア各々に対して、素性ベクトル記憶部１２４から得られる、対応する素性ベクトルを用いて、分布類似度を計算し、計算結果をその計算対象の述部ペア又は「項-述部」ペアとともに素性集合記憶部１４２に出力する。述部素性ベクトルを用いて計算される分布類似度を述部分布類似度と呼び、項述部素性ベクトルを用いて計算される分布類似度を項述部分布類似度と呼ぶ。述部ペアに対して分布類似度を計算する場合は、述部に関しての情報のみを使うため、述部分布類似度のみが算出されるが、「項-述部」ペアに対して分布類似度を計算する場合は、「項-述部」に関しての情報のみならず、述部に関する情報も使うことができるため、項述部分布類似度と述部分布類似度の両方を算出することができる。分布類似度の計算は、素性ベクトル構築装置１００によってテキストコーパスから得られた述部素性ベクトル及び項述部素性ベクトルの少なくとも一方を用いて分布類似度を計算する処理である。また、本実施形態では、非特許文献１と同じ方法で分布類似度を計算する。具体的には、下記（３）〜（５）式を用いて分布類似度を計算し、式（３）のｍｅａｓｕｒｅが分布類似度である。 The distribution similarity calculation unit 132 uses the corresponding feature vectors obtained from the feature vector storage unit 124 for all the predicate pairs or “term-predicate” pairs of the input correct corpus 122, respectively. The distribution similarity is calculated, and the calculation result is output to the feature set storage unit 142 together with the predicate pair or “term-predicate” pair to be calculated. A distribution similarity calculated using the predicate feature vector is called a predicate distribution similarity, and a distribution similarity calculated using the term predicate feature vector is called a term predicate distribution similarity. When calculating the distribution similarity for a predicate pair, only the information about the predicate is used, so only the predicate distribution similarity is calculated, but the distribution similarity for the “term-predicate” pair. When calculating, it is possible to use both predicate distribution similarity and predicate distribution similarity because it is possible to use not only information on "term-predicate" but also information on predicates. . The calculation of the distribution similarity is a process of calculating the distribution similarity using at least one of the predicate feature vector and the term predicate feature vector obtained from the text corpus by the feature vector construction device 100. In this embodiment, the distribution similarity is calculated by the same method as in Non-Patent Document 1. Specifically, the distribution similarity is calculated using the following equations (3) to (5), and the measurement in equation (3) is the distribution similarity.

ただし、上記（４）式の、ＪＡＣＣＡＲＤ係数の分子は、項述部素性ベクトルを用いて分布類似度の算出を行う場合、２つの項述部素性ベクトルを要素毎に比較したときに、一方の項述部素性ベクトルの要素の値が１であり、且つ他方の項述部素性ベクトルの要素の値も１である要素の個数である。また、ＪＡＣＣＡＲＤ係数の分母は、分布類似度の算出に用いるペアの２つの項述部素性ベクトルを要素毎に比較したときに、一方の項述部素性ベクトルの要素および他方の項述部素性ベクトルの要素の少なくとも一方の要素の値が１である要素の個数である。 However, when calculating the distribution similarity using the term predicate feature vector, the numerator of the JACCARD coefficient in the above equation (4), when comparing two term predicate feature vectors element by element, This is the number of elements in which the value of the element of the term predicate feature vector is 1 and the value of the element of the other term predicate feature vector is also 1. Further, the denominator of the JACCARD coefficient is obtained by comparing the elements of one term predicate feature vector and the other term predicate feature vector when the two term predicate feature vectors of the pair used for calculating the distribution similarity are compared for each element. The number of elements in which the value of at least one of the elements is 1.

また、上記（５）式の、ＳＩＭＰＳＯＮ係数の分子は、ＪＡＣＣＡＲＤ係数の分子と同様であり、ＳＩＭＰＳＯＮ係数の分母は、算出に用いるペアの項述部素性ベクトルにおいて要素の値が１である要素の個数と、他方の項述部素性ベクトルにおいて要素の値が１である要素の個数のうち、少ないほうの個数である。 In addition, the numerator of the SIMPSON coefficient in the above equation (5) is the same as the numerator of the JACCARD coefficient, and the denominator of the SIMPSON coefficient is the element whose element value is 1 in the pair term predicate feature vector used for calculation. This is the smaller number of the number and the number of elements whose element value is 1 in the other term predicate feature vector.

ただし、上記（４）式の、ＪＡＣＣＡＲＤ係数の分子は、述部素性ベクトルを用いて分布類似度の算出を行う場合、２つの述部素性ベクトルを要素毎に比較したときに、一方の述部素性ベクトルの要素の値が１であり、且つ他方の述部素性ベクトルの要素の値も１である要素の個数である。また、ＪＡＣＣＡＲＤ係数の分母は、分布類似度の算出に用いるペアの２つの述部素性ベクトルを要素毎に比較したときに、一方の述部素性ベクトルの要素および他方の述部素性ベクトルの要素の少なくとも一方の要素の値が１である要素の個数である。 However, the numerator of the JACCARD coefficient in the above equation (4), when calculating the distribution similarity using the predicate feature vector, when comparing two predicate feature vectors element by element, This is the number of elements whose element value of the feature vector is 1 and whose element value of the other predicate feature vector is also 1. In addition, the denominator of the JACCARD coefficient is obtained by comparing the elements of one predicate feature vector and the elements of the other predicate feature vector when the two predicate feature vectors of the pair used for calculating the distribution similarity are compared for each element. This is the number of elements in which the value of at least one element is 1.

また、上記（５）式の、ＳＩＭＰＳＯＮ係数の分子は、ＪＡＣＣＡＲＤ係数の分子と同様であり、ＳＩＭＰＳＯＮ係数の分母は、算出に用いるペアの述部素性ベクトルにおいて要素の値が１である要素の個数と、他方の述部素性ベクトルにおいて要素の値が１である要素の個数のうち、少ないほうの個数である。 In addition, the numerator of the SIMPSON coefficient in the above equation (5) is the same as the numerator of the JACCARD coefficient, and the denominator of the SIMPSON coefficient is the number of elements whose element value is 1 in the pair predicate feature vector used for calculation. And the smaller of the number of elements whose element value is 1 in the other predicate feature vector.

なお、本実施形態では述部分布類似度と項述部分布類似度の両方を用いる。図８に「花壇-ガ-完成する」と「花壇-ガ-出来上がる」の分布類似度の例を示す。図８の表の上段が項述部分布類似度の例であり、下段が述部分布類似度の例である。述部分布類似度と項述部分布類似度あわせて分布類似度と呼ぶ。また、分布類似度が第５の素性の一例である。 In the present embodiment, both the predicate distribution similarity and the term predicate distribution similarity are used. FIG. 8 shows an example of the distribution similarity of “flowerbed-ga-complete” and “flowerbed-ga-complete”. The upper part of the table of FIG. 8 is an example of the term predicate distribution similarity, and the lower part is an example of the predicate distribution similarity. The predicate distribution similarity and the term predicate distribution similarity are collectively referred to as distribution similarity. The distribution similarity is an example of a fifth feature.

＜辞書定義文素性抽出部の構成＞
辞書定義文素性抽出部１３６は、入力された正解コーパスのすべての述部ペアの内容語又はすべての「項−述部」ペアの内容語の各々に関して、定義文辞書記憶部１３４に記憶されている定義文辞書に基づいて「定義文相互補完性」を示す素性と「語彙の重なり」を示す素性を抽出し、抽出対象のペアとともに素性集合記憶部１４２に出力する。定義文辞書は、複数の述部の内容語の各々に対応する1つ以上の定義文からなり、定義文辞書から抽出される各々の１つ以上の定義文のセットを定義文セットと呼ぶ。なお、定義文相互補完性を示す素性が第１の素性の一例であり、語彙の重なりを示す素性が第３の素性の一例である。また、第１の素性と第３の素性をあわせて辞書定義文素性と呼ぶ。 <Configuration of dictionary definition sentence feature extraction unit>
The dictionary definition sentence feature extraction unit 136 stores the content words of all predicate pairs in the input correct corpus or the content words of all “term-predicate” pairs in the definition sentence dictionary storage unit 134. Based on the definition sentence dictionary, a feature indicating “definition sentence mutual complementarity” and a feature indicating “overlapping vocabulary” are extracted and output to the feature set storage unit 142 together with a pair to be extracted. The definition sentence dictionary is composed of one or more definition sentences corresponding to the content words of the plurality of predicates, and each set of one or more definition sentences extracted from the definition sentence dictionary is called a definition sentence set. Note that a feature indicating definition sentence mutual complementarity is an example of a first feature, and a feature indicating overlapping of vocabularies is an example of a third feature. The first feature and the third feature are collectively referred to as a dictionary definition sentence feature.

「同義の述部はその語義を説明する相互の定義文セットに類似性がある」という特徴から、辞書定義文素性抽出部１３６で、「定義文相互補完性」と「語彙の重なり」を示す素性を抽出することによって、従来手法で問題であった「まったく逆のことを表す述部を誤って同義と判定する」という問題が起きるのを回避することができる。 The dictionary definition sentence feature extraction unit 136 shows "definition sentence mutual complementarity" and "vocabulary overlap" from the feature that "synonymous predicates have similarities in mutual definition sentence sets that explain the meaning". By extracting the features, it is possible to avoid the problem of “determining a predicate that expresses the opposite in the same sense is synonymous” that was a problem in the conventional method.

ここで、「定義文相互補完性」とは、相手の述部の定義文セット内に自分の述部が出現することをいい、図９に示す「完成する」と「出来上がる」の２つの同義である述部を例にとると、「完成」という述部が、同義である「出来上がる」の辞書定義文セット内に現れており、また「出来上がる」という述部が、同義である「完成」の辞書定義文セット内に出現していることをいう。 Here, “definition statement mutual complementarity” means that one's predicate appears in the definition statement set of the other's predicate, and is synonymous with “complete” and “complete” shown in FIG. For example, the predicate “complete” appears in the synonymous “finished” dictionary definition sentence set, and the predicate “completed” is synonymous with “completed”. Appears in the dictionary definition sentence set.

また、「語彙の重なり」とは、定義文セット同士で語彙が重なっていることをいい、図１０に示す「値段−ガ−高値」と「値段−ガ−高い」の２つの同義である述部を例にとると、双方の定義文セットに「値段」という語彙が共通に出現していることをいう。 Further, “overlapping vocabulary” means that vocabulary overlaps between definition sentence sets, and is a statement having two synonyms of “price-ga-high” and “price-ga-high” shown in FIG. For example, the term “price” appears in both definition sentence sets.

このように、「同義の述部はその語義を説明する相互の定義文セットに類似性がある」という特徴があり、この特徴を「定義文相互補完性」もしくは「語彙の重なり」という形で表現することができる。なお、以後は説明のため、「花壇−ガ−出来上がる」と「花壇−ガ−完成する」のような「項−述部」ペアに対して、最初の「項−述部」の述部（すなわち、「出来上がる」）をＰｒｅｄ１、「項」（すなわち、「花壇」）をＡｒｇ１、２つ目の「項−述部」の述部（すなわち、「完成する」）をＰｒｅｄ２、項（すなわち、「花壇」）をＡｒｇ２とする。同様に、「出来上がる」と「完成する」のような述部ペアに対しても、最初の述部をＰｒｅｄ１、２つ目の述部をＰｒｅｄ２とする。 In this way, there is a feature that “synonymous predicates have similarities in mutual definition sentence sets that explain the meaning”, and this feature is expressed in the form of “definition mutual complementarity” or “vocabulary overlap”. Can be expressed. For the sake of explanation, the first “term-predicate” predicate (for the “term-predicate” pair such as “flowerbed-ga-finish” and “flowerbed-ga-complete”) ( That is, “complete”) is Pred1, “term” (ie, “flowerbed”) is Arg1, the second “term-predicate” predicate (ie, “completed”) is Pred2, and term (ie, “complete”). “Flowerbed”) is designated Arg2. Similarly, for predicate pairs such as “completed” and “completed”, the first predicate is Pred1, and the second predicate is Pred2.

辞書定義文素性抽出部１３６の詳細構成を図１１に示す。辞書定義文素性抽出部１３６は、定義文抽出部１３６０と、定義文相互補完性抽出部１３６２と、語彙の重なり抽出部１３６４とから構成される。 The detailed configuration of the dictionary definition sentence feature extraction unit 136 is shown in FIG. The dictionary definition sentence feature extraction unit 136 includes a definition sentence extraction unit 1360, a definition sentence mutual complementarity extraction unit 1362, and a vocabulary overlap extraction unit 1364.

定義文抽出部１３６０は、入力された正解コーパスのすべての述部ペア又は「項−述部」ペアの各々の述部の内容語に対して、定義文辞書記憶部１３４に記憶されている定義文辞書の辞書引きを行い、述部ペア又は「項−述部」ペアごとにそれぞれの述部の定義文セットを抽出する。そして、抽出した定義文セットの形態素解析を行い、形態素毎の表記と標準形と品詞、および読みが少なくとも含まれる解析結果を定義文相互補完性抽出部１３６２及び語彙の重なり抽出部１３６４に出力する。 The definition sentence extraction unit 1360 stores the definition stored in the definition sentence dictionary storage unit 134 for all the predicate pairs of the input correct corpus or the content words of each predicate of the “term-predicate” pair. The dictionary lookup of the sentence dictionary is performed, and the definition sentence set of each predicate is extracted for each predicate pair or “term-predicate” pair. The extracted definition sentence set is subjected to morphological analysis, and an analysis result including at least the notation, the standard form, the part of speech, and the reading for each morpheme is output to the definition sentence mutual complementarity extraction unit 1362 and the vocabulary overlap extraction unit 1364. .

定義文相互補完性抽出部１３６２は、抽出対象のペア（述部ペア又は「項−述部」ペア）ごとに、定義文抽出部１３６０から入力された定義文セットの形態素解析の結果から、定義文相互補完性を示す素性を抽出し、抽出対象のペアとともに素性集合記憶部１４２に出力する。具体的には、Ｐｒｅｄ１の定義文セット内にＰｒｅｄ２が出現したか、また、Ｐｒｅｄ２の定義文セット内にＰｒｅｄ１が出現したかを文字列マッチで抽出する。Ｐｒｅｄ１の定義文セット内にＰｒｅｄ２が出現したかどうかをPred1Match、また、Ｐｒｅｄ２の定義文セット内にＰｒｅｄ１が出現したかどうかをPred2Matchとする。本実施形態では、出現した場合には、素性の値を１とする。同様に、Ｐｒｅｄ１の定義文セット内に、Ａｒｇ２が出現したか、またＰｒｅｄ２の定義文セット内にＡｒｇ１が出現したかを抽出し、出現した場合には、素性の値を１とする。Ｐｒｅｄ１の定義文セット内に、Ａｒｇ２が出現したかどうかをArg1Match、また、Ｐｒｅｄ２の定義文セット内にＡｒｇ１が出現したかどうかをArg2Matchとする。なお、これらの値は、重なり回数や重なり回数を定義文セットの総単語数で正規化した値など実数値を入れてもよい。
本実施形態では、第１の素性として、Pred1Match、Pred2Match、Arg1Match、Arg2Matchの全てを使っているが、第１の素性を使う場合において、Arg1MatchとArg2Matchは使わなくてもよい。 The definition sentence mutual complementarity extraction unit 1362 defines, for each extraction target pair (predicate pair or “term-predicate” pair), from the result of the morphological analysis of the definition sentence set input from the definition sentence extraction unit 1360. A feature indicating sentence mutual complementarity is extracted and output to the feature set storage unit 142 together with the pair to be extracted. Specifically, whether Pred2 appears in the definition sentence set of Pred1 or whether Pred1 appears in the definition sentence set of Pred2 is extracted by character string matching. Whether Pred2 appears in the definition sentence set of Pred1 is Pred1Match, and whether Pred1 appears in the definition sentence set of Pred2 is Pred2Match. In the present embodiment, the feature value is set to 1 when it appears. Similarly, whether Arg2 has appeared in the definition sentence set of Pred1 or whether Arg1 has appeared in the definition sentence set of Pred2 is extracted. Arg1Match indicates whether Arg2 appears in the definition sentence set of Pred1, and Arg2Match indicates whether Arg1 appears in the definition sentence set of Pred2. Note that these values may include real numbers such as the number of overlaps and the value obtained by normalizing the number of overlaps with the total number of words in the definition sentence set.
In the present embodiment, all of Pred1Match, Pred2Match, Arg1Match, and Arg2Match are used as the first feature. However, when using the first feature, Arg1Match and Arg2Match may not be used.

語彙の重なり抽出部１３６４は、抽出対象のペア（述部ペア又は「項−述部」ペア）ごとに、定義文抽出部１３６０から入力した定義文セットの形態素解析の結果から、抽出対象のペアの述部の定義文セット同士に語彙の重なりがあるかを示す素性を抽出し、抽出対象のペアとともに素性集合記憶部１４２に出力する。本実施形態では、両方の定義文セットに共通して出現する語彙の個数を素性とし、語彙の品詞（非自立性を除く名詞、動詞、形容詞、形容動詞と４種類の品詞）毎に集計する。ここで、両方の定義文セットに共通して出現する名詞の品詞をもつ語彙の個数をNounMatch、動詞の品詞をもつ語彙の個数をVerbMatch、形容詞の品詞をもつ語彙の個数をAdjMatch、及び形容動詞の品詞をもつ語彙の個数をAdjNMatchと呼ぶ（図１２）。本実施形態では、第３の素性として、NounMatch、VerbMatch、AdjMatch、AdjNMatchの全てを使っているが、少なくとも１つの素性があれば良い。なお、両方の定義文セットに共通して出現する語彙の個数ではなく、両方の定義文セットに共通して出現する語彙の有無を素性とし、語彙の品詞毎に集計してもよく、有りの場合に値を１にして無の場合に値を０にしてもよい。さらに、両方の定義文セットに共通して出現する語彙の個数を定義文セットの総単語数で正規化した値など用いてもよい。図１３に作成された素性の一覧の例を示す。 The vocabulary overlap extraction unit 1364 extracts, for each extraction target pair (predicate pair or “term-predicate” pair), the extraction target pair from the result of the morphological analysis of the definition sentence set input from the definition sentence extraction unit 1360. A feature indicating whether there is a vocabulary overlap between the definition sentence sets of the predicates is output to the feature set storage unit 142 together with the pair to be extracted. In the present embodiment, the number of vocabulary appearing in common in both definition sentence sets is used as a feature, and the vocabulary parts of speech (nouns, non-independence excluding independence, adjectives, adjective verbs and four types of parts of speech) are aggregated. . Here, NounMatch is the number of vocabularies with part-of-speech nouns that appear in both definition sets, VerbMatch is the number of vocabularies with part-of-speech parts, AdjMatch is the number of vocabularies with part-of-speech parts, and adjective verbs The number of vocabularies with parts of speech is called AdjNMatch (FIG. 12). In the present embodiment, all of NounMatch, VerbMatch, AdjMatch, and AdjNMatch are used as the third feature. However, at least one feature is sufficient. It should be noted that the presence or absence of vocabulary that appears in both definition sentence sets is not the number of vocabularies that appear in both definition sentence sets, and may be aggregated for each part of speech of the vocabulary. In this case, the value may be set to 1 and the value may be set to 0 if none. Furthermore, a value obtained by normalizing the number of vocabulary appearing in common in both definition sentence sets with the total number of words in the definition sentence set may be used. FIG. 13 shows an example of the feature list created.

定義文辞書記憶部１３４は、複数の述部の各々に対応する定義文セットを格納した定義文辞書を記憶している。定義文辞書は、既存の国語辞書や、複数のユーザによって加筆・編集されたＷｅｂ上のフリー辞書を用いても良い。なお、定義文辞書が定義文集合の一例である。 The definition sentence dictionary storage unit 134 stores a definition sentence dictionary that stores a definition sentence set corresponding to each of the plurality of predicates. As the definition sentence dictionary, an existing national language dictionary or a free dictionary on the Web that has been added and edited by a plurality of users may be used. The definition sentence dictionary is an example of a definition sentence set.

＜意味属性素性抽出部の構成＞
意味属性素性抽出部１４０は、入力された正解コーパスのすべてのペア（述部ペア又は「項−述部」ペア）の各々に関して、当該ペアの述部の内容語の各々の抽象的な意味属性の重なりを示す素性を抽出し、抽出対象のペアとともに素性集合記憶部１４２に出力する。本実施形態では、抽象的な意味属性の重なりを示す素性として、後述する「重なり用言属性」と「意味属性重み付き重なり率」の二つを抽出する。本実施形態においては、抽出対象の述部の抽象的な意味属性として用言属性を用いる。意味属性辞書は、複数の述部の各々に対応する1つ以上の用言属性からなり、意味属性辞書から抽出される各々の１つ以上の用言属性のセットを用言属性集合と呼ぶ。両方の述語の用言属性集合に共通して出現する用言属性を「重なり用言属性」の素性として抽出する。また、その両方に共通して出現する用言属性が属する階層に重みを付与して算出する「意味属性重み付き重なり率」も素性として抽出することができる。意味属性素性抽出部１４０は、これらの二つ素性を抽出対象のペアとともに素性集合記憶部１４２に出力する。 <Configuration of semantic attribute feature extraction unit>
The semantic attribute feature extraction unit 140 extracts the abstract semantic attributes of the content words of the predicates of the pair for each of all the pairs of correct corpus inputted (predicate pair or “term-predicate” pair). Are extracted and output to the feature set storage unit 142 together with the pair to be extracted. In the present embodiment, two features of “overlapping word attribute” and “separation ratio with semantic attribute weight”, which will be described later, are extracted as features indicating the overlapping of abstract semantic attributes. In this embodiment, a prescriptive attribute is used as an abstract semantic attribute of the predicate to be extracted. The semantic attribute dictionary is composed of one or more prescriptive attributes corresponding to each of the plurality of predicates, and each set of one or more prescriptive attributes extracted from the semantic attribute dictionary is called a prescriptive attribute set. A prescriptive attribute that appears in common in the prescriptive attribute set of both predicates is extracted as a feature of the “overlapping prescriptive attribute”. In addition, a “semantic attribute weighted overlap rate” calculated by assigning a weight to a hierarchy to which a prescriptive attribute that appears in common in both can belong can be extracted as a feature. The semantic attribute feature extraction unit 140 outputs these two features together with the extraction target pair to the feature set storage unit 142.

「同義の述部同士は、その述部の抽象的な意味属性も似ている」という特徴から、意味属性素性抽出部１４０で、述部そのものの抽象的な意味属性の重なりと階層的重なりの「深さ」を考慮し素性として抽出することによって、従来手法で問題であった「時間経過を表す述部を誤って同義と判定する」という問題が起きるのを回避することができる。 The semantic attribute feature extraction unit 140 allows the abstract semantic attributes of the predicates themselves to overlap with the hierarchical overlap because of the feature that the synonymous predicates have similar abstract semantic attributes. By extracting “feature” in consideration of “depth”, it is possible to avoid the problem of “determining a predicate representing the passage of time erroneously as synonymous”, which was a problem in the conventional method.

本実施形態において、抽象的な意味属性として、「用言属性」（非特許文献２：池原悟, 宮崎正弘, 白井諭, 横尾昭男, 中岩浩巳, 小倉健太郎, 大山芳史, 林良彦 (1999) 日本語語彙大系 CD-ROM版. 岩波書店.）を用いる。用言属性集合の一例を、述部が「完成する」の場合と「出来上がる」の場合を例に図１４に示す。図１５が示すように、これらの属性はしばし、階層的な構造をもち、階層が下位に進むにつれ、より属性が詳細化される。たとえば、「行動」という上位属性に対して、さらに「物理的行動」という中間属性を経て、「所有的移動」というようなより詳細な属性が明記されている。 In this embodiment, as an abstract semantic attribute, “prescriptive attribute” (Non-patent Document 2: Satoru Ikehara, Masahiro Miyazaki, Satoshi Shirai, Akio Yokoo, Hiroaki Nakaiwa, Kentaro Ogura, Yoshifumi Oyama, Yoshihiko Hayashi (1999) Japanese vocabulary large CD-ROM version. Iwanami Shoten.) Is used. An example of the predicate attribute set is shown in FIG. 14 with an example where the predicate is “completed” and “completed”. As FIG. 15 shows, these attributes often have a hierarchical structure, and the attributes are further refined as the hierarchy progresses downward. For example, a more detailed attribute such as “ownership movement” is specified for an upper attribute “action” via an intermediate attribute “physical behavior”.

意味属性素性抽出部１４０の詳細構成を図１６に示す。意味属性素性抽出部１４０は、意味属性重なり抽出部１４００と、意味属性重み付き重なり率計算部１４０２とから構成される。 A detailed configuration of the semantic attribute feature extraction unit 140 is shown in FIG. The semantic attribute feature extraction unit 140 includes a semantic attribute overlap extraction unit 1400 and a semantic attribute weighted overlap rate calculation unit 1402.

意味属性重なり抽出部１４００は、入力された正解コーパスのすべてのペア（述部ペア又は「項−述部」ペア）の各々に関して、当該ペアの各述部の内容語の意味属性である用言属性集合を意味属性辞書から抽出し、ペア同士の用言属性集合の両方に出現する用言属性を素性として抽出し、抽出対象のペアとともに意味属性重み付き重なり率計算部１４０２へ出力する。図１７に「花壇−ガ−完成する」の「完成する」に対する用言属性集合（属性変化、生成）と、「花壇−ガ−出来上がる」の「出来上がる」に対する用言属性集合（生成）から重なり用言属性として「生成」が抽出された例を示す。なお、重なり用言属性が第２の素性の一例である。 The semantic attribute overlap extraction unit 1400, for each of all the input correct corpus pairs (predicate pair or “term-predicate” pair), is a prescription that is a semantic attribute of the content word of each predicate of the pair. The attribute set is extracted from the semantic attribute dictionary, the prescriptive attributes appearing in both of the paired prescriptive attribute sets are extracted as features, and are output to the semantic attribute weighted overlap rate calculating unit 1402 together with the extraction target pairs. In FIG. 17, there is an overlap from a prescriptive attribute set (attribute change, generation) for “completed” of “flowerbed-ga-complete” and a prescriptive attribute set (generation) for “completed” of “flowerbed-ga-finished”. An example in which “generation” is extracted as a prescriptive attribute is shown. Note that the overlap prescriptive attribute is an example of the second feature.

意味属性重み付き重なり率計算部１４０２は、ペア（述部ペア又は「項−述部」ペア）毎に意味属性重なり抽出部１４００から入力された全ての用言属性と意味属性辞書から抽出した用言属性の階層情報に基づいて、用言属性の重なり度合いを示す素性としての「意味属性重み付き重なり率」を下記（６）式及び（７）式に従って計算し、意味属性重なり抽出部１４００から入力された素性及び抽出対象のペアとともに、素性集合記憶部１４２に出力する。重なり用言属性と意味属性重み付き重なり率の二つの素性は、それぞれ「よりたくさんの属性を共有するほど、述部同士が類似している」という特徴と、「より詳細な属性を共有するほど、述部同士は類似している」という特徴を表わしている。なお、意味属性重み付き重なり率が第４の素性の一例である。第２の素性と第４の素性をあわせて意味属性素性と呼ぶ。 The semantic attribute weighted overlap rate calculation unit 1402 uses all of the prescriptive attributes input from the semantic attribute overlap extraction unit 1400 and the semantic attribute dictionary for each pair (predicate pair or “term-predicate” pair). Based on the hierarchical information of the word attributes, a “semantic attribute weighted overlap rate” as a feature indicating the degree of overlapping of the prescriptive attributes is calculated according to the following expressions (6) and (7). Along with the input feature and extraction target pair, it is output to the feature set storage unit 142. The two features of overlapping prescriptive attributes and semantic attribute weighted overlap rates are characterized by the characteristics that "the more attributes are shared, the more predicates are similar" and "the more detailed attributes are shared , The predicates are similar to each other. The semantic attribute weighted overlap rate is an example of a fourth feature. The second feature and the fourth feature are collectively referred to as a semantic attribute feature.

例えば、「花壇−ガ−完成する」と「花壇−ガ−出来上がる」の意味属性重み付き重なり率を計算する場合、「完成する」の用言属性集合と「出来上がる」の用言属性集合の両方に出現する用言属性は「生成」という用言属性である。さらに、「生成」は図１５に示すとおり一番詳細な階層４の属性であるため、下記の（８）式のように重み付き重なり率が計算される。 For example, when calculating the weighted overlap ratio of “flowerbed-ga-completed” and “flowerbed-ga-completed”, both “completed” predicate attribute set and “completed” predicate attribute set The prescriptive attribute appearing in is a prescriptive attribute of “generation”. Furthermore, since “generation” is the most detailed attribute of the hierarchy 4 as shown in FIG. 15, the weighted overlap ratio is calculated as in the following equation (8).

意味属性辞書記憶部１３８は、複数の述部の内容語の各々に対応する意味属性を格納した意味属性辞書を記憶している。本実施形態においては、意味属性辞書として用言属性辞書が記憶されている。なお、意味属性辞書が意味属性集合の一例である。 The semantic attribute dictionary storage unit 138 stores a semantic attribute dictionary that stores semantic attributes corresponding to the content words of the plurality of predicates. In the present embodiment, a prescriptive attribute dictionary is stored as a semantic attribute dictionary. The semantic attribute dictionary is an example of a semantic attribute set.

＜機能表現素性抽出部の構成＞
機能表現素性抽出部１４１は、正解コーパスのすべてのペア（述部ペア又は「項-述部」ペア）の各々に関して、当該ペアの述部の機能表現の意味の重なりを示す素性を抽出し
、抽出対象のペアとともに素性集合記憶部１４２に出力する。本実施形態では、機能表現の意味の重なりを示す素性として、後述する「重なり意味ラベル」と「意味ラベル重なり率」の二つを抽出する。 <Configuration of functional expression feature extraction unit>
The functional expression feature extraction unit 141 extracts, for each of all pairs of the correct corpus (predicate pair or “term-predicate” pair), a feature indicating the overlapping of the meaning of the functional expression of the predicate of the pair, The result is output to the feature set storage unit 142 together with the pair to be extracted. In the present embodiment, two features of “overlapping semantic label” and “semantic label overlapping rate” to be described later are extracted as features indicating the overlapping of meanings of function expressions.

機能表現素性抽出部１４１で、述部の機能表現の重なりを示す素性を抽出することによって、述部の内容語のみを用いて分布類似度の計算を行なっていても述部の同義判定を高精度に行うことができる。これは、分布類似度のみを用いて機能表現を考慮した同義判定を行う場合に生じる、述部を個々の内容語と機能表現の組み合わせとした分布類似度計算のために膨大なデータを必要とする問題が起きるのを回避することができる。 The feature expression feature extraction unit 141 extracts features indicating the overlap of the function expressions of the predicates, thereby increasing the predicate synonym determination even when the distribution similarity is calculated using only the content words of the predicates. Can be done with precision. This requires a huge amount of data to calculate the distribution similarity, which is a combination of individual content words and functional expressions, which occurs when synonymous determination is performed using functional distribution only by using distribution similarity. Can be avoided.

機能表現素性抽出部１４１の詳細構成を図１９に示す。機能表現素性抽出部１４１は、意味ラベル付与部１５００と、重なり意味ラベル抽出部１５０２と、意味ラベル重なり率計算部１５０４とから構成される。 FIG. 19 shows a detailed configuration of the functional expression feature extraction unit 141. The functional expression feature extraction unit 141 includes a semantic label assignment unit 1500, an overlapping meaning label extraction unit 1502, and a semantic label overlap rate calculation unit 1504.

意味ラベル付与部１５００は、入力された正解コーパスのすべてのペア（述部ペア又は「項-述部」ペア）の各々に関して、当該ペアの各述部の機能表現の意味ラベルを機能表現辞書から抽出し、当該ペアとともに重なり意味ラベル抽出部１５０２へ出力する。本実施形態では、統計的な意味ラベル付与方法を用いる（非特許文献７：今村賢治,泉朋子,菊井玄一郎,佐藤理史 (2011).述部機能表現の意味ラベルタガー言語処理学会第１７回年次大会.518-521）。具体的には、入力された正解コーパスのすべてのペアの各述部の形態素解析を行い、形態素毎の表記と標準形と品詞が少なくとも含まれる解析結果を用いて最も尤もらしい意味ラベル列を付与する。図２０に機能表現辞書の例を示す。図１８に、「花壇-ガ-出来上がった」と「花壇-ガ-完成した」という「項-述部」ペアが入力された場合に意味ラベル列を付与した例を示す。図１８に示される例においては、述部はそれぞれ「出来上がった」と「完成した」であり、それぞれを内容語部分と機能表現部分を識別し、それぞれの機能表現の「た」に「完了」の意味ラベルを付与している。 The semantic label assigning unit 1500 obtains, from the functional expression dictionary, the functional label of the functional expression of each predicate of the pair for each of all the input correct corpus pairs (predicate pair or “term-predicate” pair). It is extracted and output to the overlapping meaning label extraction unit 1502 together with the pair. In this embodiment, a statistical meaning labeling method is used (Non-patent Document 7: Kenji Imamura, Atsuko Izumi, Genichiro Kikui, Satoshi Sato (2011). Semantic Label of Predicate Functional Representation Language Processing Society 17th Annual Tournament. 518-521). Specifically, morphological analysis is performed on each predicate of all pairs of correct corpus that have been input, and the most likely semantic label string is assigned using an analysis result including at least the notation for each morpheme and the standard form and part of speech. To do. FIG. 20 shows an example of a function expression dictionary. FIG. 18 shows an example in which a meaning label column is given when a “term-predicate” pair “flowerbed-ga-completed” and “flowerbed-ga-completed” is input. In the example shown in FIG. 18, the predicates are “completed” and “completed”, respectively, and the content word part and the function expression part are identified, and “completed” is added to “ta” of each function expression. The meaning label is given.

本実施形態では、述部の内容語部分と機能表現部分の識別を意味ラベル付与部１５００で行なっているが、出来事の意味に影響を与える機能表現のみを残す事前処理を実施するようにしてもよい。（非特許文献８：Izumi T., Imamura K., Kikui G., & Sato S. (2010). Standardizing Complex Functional Expressions in JapansesePredicates: Applying Theoretically-Based Paraphrasing Rules. Proceedings of the Workshop on Multiword Exressions: From theory to applications (MWE 2010), 63-71） In this embodiment, the content label part and the function expression part of the predicate are identified by the semantic label assigning unit 1500. However, it is also possible to perform preprocessing that leaves only the function expression that affects the meaning of the event. Good. (Non-Patent Document 8: Izumi T., Imamura K., Kikui G., & Sato S. (2010). Standardizing Complex Functional Expressions in JapansesePredicates: Applying Theoretically-Based Paraphrasing Rules. Proceedings of the Workshop on Multiword Exressions: From theory to applications (MWE 2010), 63-71)

重なり意味ラベル抽出部１５０２は、意味ラベル付与部１５００から入力された抽出対象のペアとそれぞれの意味ラベルから、両方の述部の意味ラベルに共通して出現する意味ラベルを重なり意味ラベルとして抽出し、当該ペアとともに意味ラベル重なり計算部１５０４に出力する。「花壇-ガ-出来上がった」と「花壇-ガ-完成した」という「項-述部」ペアの例では、述部「出来上がった」の意味ラベル「完了」と述部「完成した」の意味ラベル「完了」から、重なり意味ラベルとして「完了」が抽出される。なお、重なり意味ラベルが第６の素性の一例である。 The overlapping meaning label extraction unit 1502 extracts, as overlapping meaning labels, meaning labels that appear in common in the meaning labels of both predicates from the extraction target pairs input from the meaning label assignment unit 1500 and the respective meaning labels. , And output to the semantic label overlap calculator 1504 together with the pair. In the example of the “term-predicate” pair “flowerbed-ga-completed” and “flowerbed-ga-completed”, the meaning label “completed” and predicate “completed” mean the predicate “completed”. From the label “completed”, “completed” is extracted as the overlapping meaning label. Note that the overlapping meaning label is an example of a sixth feature.

意味ラベル重なり率計算部１５０４は、重なり意味ラベル抽出部１５０２から入力された抽出対象のペアと重なり意味ラベルに基づいて、意味ラベル重なり率を下記（９）式に従って計算し、当該ペア及び重なり意味ラベルとともに計算結果を素性集合記憶部１４２に出力する。なお、意味ラベル重なり率が第７の素性の一例である。第６の素性と第７の素性をあわせて機能表現素性と呼ぶ。 The semantic label overlap rate calculation unit 1504 calculates a semantic label overlap rate according to the following equation (9) based on the extraction target pair and the overlap meaning label input from the overlap meaning label extraction unit 1502, and the pair and overlap meaning The calculation result together with the label is output to the feature set storage unit 142. The semantic label overlap rate is an example of a seventh feature. The sixth feature and the seventh feature are collectively called a function expression feature.

機能表現辞書記憶部１３９は、複数の述部の機能表現に対する意味ラベルを格納した機能表現辞書を記憶している。本実施形態では、非特許文献６の辞書を用いているが、これに限られるものではなく、他の文末表現辞書、モダリティ表現辞書を用いてもよい。 The function expression dictionary storage unit 139 stores a function expression dictionary that stores semantic labels for function expressions of a plurality of predicates. In the present embodiment, the dictionary of Non-Patent Document 6 is used, but the present invention is not limited to this, and other sentence ending expression dictionaries and modality expression dictionaries may be used.

図６の素性集合記憶部１４２には、分布類似度計算部１３２、辞書定義文素性抽出部１３６、意味属性素性抽出部１４０、及び機能表現素性抽出部１４１で得られた各ペア（述部ペア又は「項−述部」ペア）の各素性及び素性の抽出対象のペアが入力され、素性集合記憶部１４２は、入力された各素性を各ペアごとに記憶している。 In the feature set storage unit 142 of FIG. 6, each pair (predicate pair) obtained by the distribution similarity calculation unit 132, the dictionary definition sentence feature extraction unit 136, the semantic attribute feature extraction unit 140, and the function expression feature extraction unit 141 is stored. Alternatively, each feature of “term-predicate” pair) and a pair of feature extraction targets are input, and the feature set storage unit 142 stores each input feature for each pair.

図６の同義判定モデル学習部１４４は、入力された正解コーパスのすべてのペア（述部ペア又は「項−述部」ペア）の各々に対し、素性集合記憶部１４２から入力された各素性をもとに、同義判定モデルの学習を行い、学習した同義判定モデルを出力部１５０に出力する。同義判定モデルの学習にはＳＶＭを用いる（非特許文献３：Chang, C.-C. and Lin, C.-J.(2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27, 1-27.）。 The synonym determination model learning unit 144 in FIG. 6 calculates each feature input from the feature set storage unit 142 for each of all the pairs of correct corpus that are input (predicate pair or “term-predicate” pair). Originally, the synonym determination model is learned, and the learned synonym determination model is output to the output unit 150. SVM is used for learning the synonym determination model (Non-patent Document 3: Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27, 1-27.).

具体的には、同義である述部ペア又は「項-述部」ペアを「正例（ＳＶＭでは＋１）」、それ以外の同義ではない述部ペア又は「項-述部」ペアを「負例（ＳＶＭでは−１）」として、述部分布類似度及び項述部分布類似度の少なくとも１つと、辞書定義文素性、意味属性素性、機能表現素性を用いて同義判定モデルの学習を行う。図２１に「花壇-ガ-完成する」と「花壇-ガ-出来上がる」に対する素性の一覧を示す。 Specifically, a predicate pair or “term-predicate” pair that is synonymous is “positive example (+1 in SVM)”, and a non-synonymous predicate pair or “term-predicate” pair is “negative”. As an example (-1 in SVM) ", the synonym determination model is learned using at least one of the predicate distribution similarity and the term predicate distribution similarity, and the dictionary definition sentence feature, semantic attribute feature, and functional expression feature. FIG. 21 shows a list of features for “flowerbed-ga-complete” and “flowerbed-ga-complete”.

＜同義判定部の構成＞
次に、本発明の実施の形態に係る同義判定装置３００の構成について詳細に説明する。図２２に示すように、本発明の実施の形態に係る同義判定装置３００は、ＣＰＵとＲＡＭと後述する同義判定処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することができる。この同義判定装置３００は、機能的には図２２に示すように入力部２１０と、演算部２２０と、出力部２５０とを備えている。 <Configuration of synonym determination unit>
Next, the configuration of the synonym determination device 300 according to the embodiment of the present invention will be described in detail. As shown in FIG. 22, a synonym determination device 300 according to an embodiment of the present invention is a computer including a CPU, a RAM, and a ROM that stores a program for executing a synonym determination processing routine described later and various data. Can be configured. Functionally, the synonym determination device 300 includes an input unit 210, a calculation unit 220, and an output unit 250 as shown in FIG.

入力部２１０は、キーボードなどの入力装置から同義判定対象のペア（述部ペア又は「項−述部」ペア）を受け付ける。なお、入力部２１０は、ネットワーク等を介して外部から入力されたものを受け付けるようにしてもよい。 The input unit 210 receives a synonym determination target pair (predicate pair or “term-predicate” pair) from an input device such as a keyboard. Note that the input unit 210 may accept input from the outside via a network or the like.

演算部２２０は、素性ベクトル記憶部２２４と、定義文辞書記憶部２２６と、意味属性辞書記憶部２２８と、機能表現辞書記憶部２２９と、素性構築部２２２と、同義判定部２３０と、同義判定モデル記憶部２３２とを含んだ構成で表すことができる。 The calculation unit 220 includes a feature vector storage unit 224, a definition sentence dictionary storage unit 226, a semantic attribute dictionary storage unit 228, a function expression dictionary storage unit 229, a feature construction unit 222, a synonym determination unit 230, and a synonym determination. It can be expressed by a configuration including the model storage unit 232.

素性ベクトル記憶部２２４には、素性ベクトル構築装置１００において出力された素性ベクトルが記憶されている。 The feature vector storage unit 224 stores feature vectors output from the feature vector construction device 100.

素性構築部２２２は、入力部２１０において受け付けた同義判定対象のペア（述部ペア又は「項−述部」ペア）に対して、同義学習装置２００と同じ種類の「分布類似度」、「辞書定義文素性」、「意味属性素性」及び「機能表現素性」をそれぞれ抽出し、同義判定部２３０に出力する。図２３に抽出された素性の例を示す。 The feature construction unit 222 uses the same type of “distribution similarity” and “dictionary” as the synonym learning device 200 for the synonym determination target pair (predicate pair or “term-predicate” pair) received by the input unit 210. The “definition sentence feature”, “semantic attribute feature”, and “functional expression feature” are extracted and output to the synonym determination unit 230. FIG. 23 shows an example of the extracted features.

同義判定部２３０は、素性構築部２２２から入力された素性に基づいて、同義学習装置２００において出力され、予め求められた同義判定モデルが記憶されている同義判定モデル記憶部２３２に記憶された同義判定モデルを基にＳＶＭ識別器を用いて、入力された同義判定対象のペア（述部ペア又は「項−述部」ペア）が「同義か否か」を判定し、出力部２５０に出力する。 The synonym determination unit 230 is output from the synonym learning device 200 based on the feature input from the feature construction unit 222 and stored in the synonym determination model storage unit 232 in which the synonym determination model obtained in advance is stored. Based on the determination model, the SVM classifier is used to determine whether the input synonym determination target pair (predicate pair or “term-predicate” pair) is “synonymous” and outputs the result to the output unit 250. .

＜素性ベクトル構築装置の作用＞
次に、本発明の実施の形態に係る素性ベクトル構築装置１００の作用について説明する。まず、入力部１０によりテキストコーパスが入力される。そして、素性ベクトル構築装置１００のＲＯＭに記憶されたプログラムを、ＣＰＵが実行することにより、図２４に示す素性ベクトル構築処理ルーチンが実行される。 <Operation of feature vector construction device>
Next, the operation of the feature vector construction device 100 according to the embodiment of the present invention will be described. First, a text corpus is input by the input unit 10. Then, when the CPU executes the program stored in the ROM of the feature vector construction device 100, the feature vector construction processing routine shown in FIG. 24 is executed.

まず、ステップＳ１００において、複数の文の集合であるテキストコーパスを読み込む。 First, in step S100, a text corpus that is a set of a plurality of sentences is read.

まず、ステップＳ１０２において、ステップＳ１００において受け付けた複数の文のうちの一つの文について形態素解析を行う。 First, in step S102, morphological analysis is performed on one sentence among the plurality of sentences received in step S100.

次に、ステップＳ１０４において、ステップＳ１００において形態素解析を行った文について係り受け解析を行う。 Next, in step S104, dependency analysis is performed on the sentence for which morphological analysis was performed in step S100.

次に、ステップＳ１０６において、ステップＳ１０２及びステップＳ１０４において得られた形態素毎の表記と標準型と品詞、および文節ごとの係り受け情報が少なくとも含まれる解析結果から当該文に含まれる「項−述部」の周辺に現れる単語の情報（文脈情報）を項述部素性として抽出する。また、ステップＳ１０６において、ステップＳ１０２及びステップＳ１０４において得られた形態素毎の表記と標準型と品詞、および文節ごとの係り受け情報が少なくとも含まれる解析結果から当該文に含まれる述部の周辺に現れる単語の情報（文脈情報）を述部素性として抽出する。本実施形態では、例えば上記の非特許文献１と同じ方法で項述部素性や述部素性を抽出する。具体的には、対象の「項−述部」に係っている別の「項−述部」、「述部」を項述部素性として抽出する。さらに、「述部」単体に係っている項（格助詞をもつ名詞句）、及び別の「述部」を述部素性として抽出する。 Next, in step S106, the “term-predicate” included in the sentence from the analysis result including at least the notation for each morpheme, the standard type, the part of speech, and the dependency information for each clause obtained in step S102 and step S104. Information (context information) of words appearing around "is extracted as a term predicate feature. In step S106, the morpheme notation obtained in steps S102 and S104, the standard type, the part of speech, and the analysis result including at least dependency information for each clause appear around the predicate included in the sentence. Extract word information (context information) as predicate features. In the present embodiment, for example, the term predicate feature and the predicate feature are extracted by the same method as in Non-Patent Document 1 described above. Specifically, another “term-predicate” and “predicate” related to the target “term-predicate” are extracted as term predicate features. Furthermore, a term (noun phrase having a case particle) related to a single “predicate” and another “predicate” are extracted as predicate features.

次に、ステップＳ１０８において、ステップＳ１００において受け付けた複数の文のすべてについて上記ステップＳ１０２〜Ｓ１０６の処理を実行したか否かを判定する。すべての文について上記ステップＳ１０２〜Ｓ１０６の処理を実行した場合にはステップＳ１１０に移行し、上記ステップＳ１０２〜Ｓ１０６の処理を実行していない文が存在する場合には、ステップＳ１０２に移行して各処理を繰り返す。 Next, in step S108, it is determined whether or not the processing in steps S102 to S106 has been executed for all of the plurality of sentences received in step S100. When the processes of steps S102 to S106 are executed for all sentences, the process proceeds to step S110, and when there is a sentence that does not execute the processes of steps S102 to S106, the process proceeds to step S102. Repeat the process.

次に、ステップＳ１１０において、少なくとも一つの文に含まれる「項−述部」についてステップＳ１０６において抽出された素性に基づいて素性ベクトルを構築する。また、ステップＳ１１０において、少なくとも一つの文に含まれる述部についてステップＳ１０６において抽出された素性に基づいて素性ベクトルを構築する。 Next, in step S110, a feature vector is constructed based on the features extracted in step S106 for the “term-predicate” included in at least one sentence. In step S110, a feature vector is constructed based on the features extracted in step S106 for the predicates included in at least one sentence.

次に、ステップＳ１１２において、ステップＳ１００において読み込んだ少なくとも一つの文に含まれるすべての「項−述部」の各々について項述部素性ベクトルを構築したか否かを判定する。また、ステップＳ１１２において、ステップＳ１００において読み込んだ少なくとも一つの文に含まれるすべての述部の各々について述部素性ベクトルを構築したか否かを判定する。すべての「項−述部」の各々について項述部素性ベクトルを構築した場合には、ステップＳ１１４に移行し、項述部素性ベクトルを構築していない「項−述部」が存在する場合には、ステップＳ１１０に移行して各処理を繰り返す。また、すべての述部の各々について述部素性ベクトルを構築した場合には、ステップＳ１１４に移行し、述部素性ベクトルを構築していない述部が存在する場合には、ステップＳ１１０に移行して各処理を繰り返す。 Next, in step S112, it is determined whether or not a term predicate feature vector has been constructed for each of all “term-predicate” included in at least one sentence read in step S100. In step S112, it is determined whether a predicate feature vector has been constructed for each of all predicates included in at least one sentence read in step S100. If a term predicate feature vector is constructed for each of all “term-predicates”, the process proceeds to step S114, and there is a “term-predicate” in which no term predicate feature vector is constructed. Moves to step S110 and repeats each process. If a predicate feature vector is constructed for each of all predicates, the process proceeds to step S114. If there is a predicate that does not construct a predicate feature vector, the process proceeds to step S110. Repeat each process.

次に、ステップＳ１１４において、ステップＳ１１０において構築された素性ベクトルの全てを出力部３０により出力して処理を終了する。 Next, in step S114, all the feature vectors constructed in step S110 are output by the output unit 30, and the process is terminated.

本実施形態では、述部素性ベクトルと項述部素性ベクトルの両方を作成したが、同義学習装置及び同義判定装置で使用される素性ベクトルのみを作成すれば良い。 In the present embodiment, both the predicate feature vector and the term predicate feature vector are created. However, only the feature vector used in the synonym learning device and the synonym judgment device may be created.

＜同義学習装置の作用＞
次に、本発明の実施の形態に係る同義学習装置２００の作用について説明する。まず、入力部１１０により、素性ベクトル構築装置１００により出力された、素性ベクトルが入力され、素性ベクトル記憶部１２４に記憶される。また、入力部１１０により正解コーパスが入力される。そして、同義学習装置２００のＲＯＭに記憶されたプログラムを、ＣＰＵが実行することにより、図２５に示す同義判定モデル学習処理ルーチンが実行される。 <Operation of synonym learning device>
Next, the operation of the synonym learning device 200 according to the embodiment of the present invention will be described. First, the feature vector output by the feature vector construction device 100 is input by the input unit 110 and stored in the feature vector storage unit 124. A correct corpus is input by the input unit 110. Then, when the CPU executes the program stored in the ROM of the synonym learning device 200, the synonym determination model learning process routine shown in FIG. 25 is executed.

まず、ステップＳ２０２において、正解コーパスを読み込む。 First, in step S202, a correct corpus is read.

以降のステップにおいては、述部ペアの同義判定モデルを学習する場合には読み込まれた正解コーパスのうち述部ペアの正解データのみを用いて処理が行われ、「項-述部」ペアの同義判定モデルを学習する場合には読み込まれた正解コーパスのうち「項-述部」ペアの正解データのみを用いて処理が行われる。 In the subsequent steps, when learning a synonym determination model for predicate pairs, processing is performed using only the correct answer data for the predicate pair in the read correct corpus, and the synonym for the “term-predicate” pair. When learning a decision model, processing is performed using only correct data of “term-predicate” pairs in the read correct corpus.

次に、ステップＳ２０４において、ステップＳ２０２において得られた複数の正解データの各々について、素性ベクトル記憶部１２４に記憶された当該正解データのペア（述部ペア又は「項−述部」ペア）の各述部素性ベクトルに基づいて、当該正解データのペアの述部分布類似度を算出する。また、ステップＳ２０４において、ステップＳ２０２において得られた複数の正解データの各々について、素性ベクトル記憶部１２４に記憶された当該正解データの「項−述部」ペアの各項述部素性ベクトルに基づいて、当該正解データのペアの項述部分布類似度を算出する。本実施形態では、「項-述部」ペアに対しても述部分布類似度を算出したが、項述部分布類似度のみを算出してもよい。 Next, in step S204, for each of the plurality of correct answer data obtained in step S202, each of the correct answer data pair (predicate pair or “term-predicate” pair) stored in the feature vector storage unit 124. Based on the predicate feature vector, the predicate distribution similarity of the correct data pair is calculated. Further, in step S204, for each of the plurality of correct answer data obtained in step S202, based on each term predicate feature vector of the “term-predicate” pair of the correct answer data stored in the feature vector storage unit 124. Then, the term predicate distribution similarity of the correct answer data pair is calculated. In this embodiment, the predicate distribution similarity is calculated for the “term-predicate” pair, but only the term predicate distribution similarity may be calculated.

次に、ステップＳ２０６において、ステップＳ２０２において得られた複数の正解データの各々について、当該正解データのペア（述部ペア又は「項−述部」ペア）の辞書定義文素性を抽出する。 Next, in step S206, for each of the plurality of correct answer data obtained in step S202, the dictionary definition sentence feature of the correct answer data pair (predicate pair or “term-predicate” pair) is extracted.

次に、ステップＳ２０８において、ステップＳ２０２において得られた複数の正解データの各々について、当該正解データのペア（述部ペア又は「項−述部」ペア）の意味属性素性を抽出する。 Next, in step S208, for each of the plurality of correct answer data obtained in step S202, a semantic attribute feature of the correct answer data pair (predicate pair or “term-predicate” pair) is extracted.

次に、ステップＳ２０９において、ステップＳ２０２において得られた複数の正解データの各々について、当該正解データのペア（述部ペア又は「項-述部」ペア）の機能表現素性を抽出する。 Next, in step S209, for each of the plurality of correct answer data obtained in step S202, the function expression feature of the correct answer data pair (predicate pair or “term-predicate” pair) is extracted.

次に、ステップＳ２１０において、ステップＳ２０４において得られた分布類似度と、ステップＳ２０６において得られた辞書定義文素性と、ステップＳ２０８において得られた意味属性素性と、ステップＳ２０９において得られた機能表現素性と、ステップＳ２０２において得られた複数の正解データの同義か否かの情報とに基づいて、複数の正解データの各々について、当該正解データのペア（述部ペア又は「項−述部」ペア）に対して抽出された全ての素性を含む正例、又は負例の学習データを作成する。 Next, in step S210, the distribution similarity obtained in step S204, the dictionary definition sentence feature obtained in step S206, the semantic attribute feature obtained in step S208, and the functional expression feature obtained in step S209. And a pair of correct data (predicate pair or “term-predicate” pair) for each of a plurality of correct data based on the information on whether or not the plurality of correct data are synonymous obtained in step S202. Learning data of positive examples or negative examples including all the features extracted for is generated.

次に、ステップＳ２１２において、ステップＳ２１０において作成された各学習データに基づいて同義判定モデルを学習し、同義判定モデルを出力部１５０により出力して処理を終了する。 Next, in step S212, the synonym determination model is learned based on each learning data created in step S210, the synonym determination model is output by the output unit 150, and the process ends.

上記ステップＳ２０４は、図２６に示す分布類似度算出ルーチンによって実現される。 Step S204 is realized by the distribution similarity calculation routine shown in FIG.

まず、ステップＳ２５０において算出対象の正解データの述部ペアについて、素性ベクトル記憶部１２４に記憶されている当該ペアの各述部素性ベクトルを読み出す。また、ステップＳ２５０において算出対象の正解データの「項-述部」ペアについて、素性ベクトル記憶部１２４に記憶されている当該ペアの各述部素性ベクトル及び各項述部素性ベクトルを読み出す。本実施形態では、算出対象の正解データの「項-述部」ペアについて述部素性ベクトルを読み出したが、読み出さなくてもよい。 First, in step S250, for each predicate pair of correct answer data to be calculated, each predicate feature vector of the pair stored in the feature vector storage unit 124 is read. In step S250, for the “term-predicate” pair of correct data to be calculated, the predicate feature vector and each term predicate feature vector of the pair stored in the feature vector storage unit 124 are read. In the present embodiment, the predicate feature vector is read for the “term-predicate” pair of the correct data to be calculated.

次に、ステップＳ２５２において、ステップＳ２５０において読み出された素性ベクトルに基づいて、当該正解データのペア（述部ペア又は「項−述部」ペア）について分布類似度を算出する。 Next, in step S252, based on the feature vector read in step S250, the distribution similarity is calculated for the correct data pair (predicate pair or “term-predicate” pair).

次に、ステップＳ２５４において、ステップＳ２０２において読み込んだ全ての正解データの各々について、当該正解データのペアの分布類似度が算出されたか否かを判定する。全ての正解データの各々のペア（述部ペア又は「項−述部」ペア）について分布類似度が算出されている場合には処理を終了し、分布類似度が算出されていない正解データのペア（述部ペア又は「項−述部」ペア）が存在する場合には、ステップＳ２５０に移行し当該正解データを算出対象の正解データとして各処理を繰り返す。 Next, in step S254, it is determined whether or not the distribution similarity of the correct data pair has been calculated for each of the correct data read in step S202. If the distribution similarity is calculated for each pair of all correct data (predicate pair or “term-predicate” pair), the process is terminated, and the correct data pair for which distribution similarity is not calculated If there is a (predicate pair or “term-predicate” pair), the process proceeds to step S250, and each process is repeated using the correct answer data as correct answer data to be calculated.

上記、ステップＳ２０６は、図２７に示す辞書定義文素性の抽出ルーチンによって実現される。 Step S206 is realized by the dictionary definition sentence feature extraction routine shown in FIG.

まず、ステップＳ３００において、抽出対象の正解データのペア（述部ペア又は「項−述部」ペア）の各述部の定義文セットを定義文辞書記憶部１３４から抽出する。 First, in step S300, a definition sentence set of each predicate of a pair of correct data to be extracted (predicate pair or “term-predicate” pair) is extracted from the definition sentence dictionary storage unit 134.

次に、ステップＳ３０２において、ステップＳ３００において抽出された定義文セットの各々の定義文について形態素解析を行う。 Next, in step S302, morphological analysis is performed on each definition sentence of the definition sentence set extracted in step S300.

次に、ステップＳ３０４において、ステップＳ３０２において得られた形態素毎の表記と標準形と品詞、および読みが少なくとも含まれる形態素解析の結果に基づいて、定義文相互補完性を示す素性を抽出する。定義文相互補完性を示す素性として、Pred1Match、Pred2Matchを抽出する。抽出対象の正解データのペアが「項-述部」ペアの場合、Arg1Match、Arg2Matchをさらに抽出することもできる。 Next, in step S304, based on the result of the morpheme analysis including at least the notation, the standard form, the part of speech, and the reading for each morpheme obtained in step S302, the feature indicating the definition sentence mutual complementarity is extracted. Pred1Match and Pred2Match are extracted as features indicating definition sentence mutual complementarity. If the pair of correct data to be extracted is a “term-predicate” pair, Arg1Match and Arg2Match can be further extracted.

次に、ステップＳ３０６において、ステップＳ３０２において得られた形態素毎の表記と標準形と品詞、および読みが少なくとも含まれる形態素解析の結果に基づいて、語彙の重なりを示す素性を抽出する。語彙の重なりを示す素性として、NounMatch,VerbMatch,AdjMatch,AdjNMatchの少なくとも１つを抽出する。 Next, in step S306, based on the result of morpheme analysis including at least the notation, the standard form, the part of speech, and the reading for each morpheme obtained in step S302, a feature indicating vocabulary overlap is extracted. At least one of NounMatch, VerbMatch, AdjMatch, and AdjNMatch is extracted as a feature indicating vocabulary overlap.

次に、ステップＳ３０８において、ステップＳ３０４において抽出された定義文相互補完性を示す素性及びステップＳ３０６において抽出された語彙の重なりを示す素性に基づいて、辞書定義文素性を構築する。 Next, in step S308, a dictionary definition sentence feature is constructed based on the feature indicating the mutual definition of the definition sentences extracted in step S304 and the feature indicating the overlap of the vocabulary extracted in step S306.

次に、ステップＳ３１０において、ステップＳ２０２において読み込んだ全ての正解データの各々のペア（述部ペア又は「項−述部」ペア）について辞書定義文素性を構築したか判定する。すべての正解データのペア（述部ペア又は「項−述部」ペア）について辞書定義文素性を構築した場合には、処理を終了し、辞書定義文素性を構築していない正解データのペア（述部ペア又は「項−述部」ペア）が存在する場合には、ステップＳ３００に移行して当該正解データを抽出対象の正解データとして各処理を繰り返す。 Next, in step S310, it is determined whether a dictionary definition sentence feature has been constructed for each pair (predicate pair or “term-predicate” pair) of all correct data read in step S202. When the dictionary definition sentence features are constructed for all correct data pairs (predicate pairs or “term-predicate” pairs), the processing is terminated, and the correct data pairs that have not constructed dictionary definition sentence features ( If there is a predicate pair or “term-predicate” pair), the process proceeds to step S300, and each process is repeated with the correct answer data as the correct answer data to be extracted.

上記ステップＳ２０８は、図２８に示す意味属性素性の抽出ルーチンによって実現される。 The step S208 is realized by the semantic attribute feature extraction routine shown in FIG.

まず、ステップＳ４００において、抽出対象の正解データのペア（述部ペア又は「項−述部」ペア）の各述部の重なり用言属性を抽出する。 First, in step S400, the overlapping prescriptive attributes of each predicate of a pair of correct data to be extracted (predicate pair or “term-predicate” pair) are extracted.

次に、ステップＳ４０２において、ステップＳ４００において得られた重なり用言属性に基づいて、抽出対象の正解データのペア（述部ペア又は「項−述部」ペア）について意味属性重み付き重なり率を計算する。 Next, in step S402, based on the overlap prescriptive attributes obtained in step S400, the semantic attribute weighted overlap rate is calculated for the correct answer data pair (predicate pair or “term-predicate” pair) to be extracted. To do.

次に、ステップＳ４０４において、ステップＳ４００において得られた重なり用言属性、及びステップＳ４０２において得られた意味属性重み付き重なり率に基づいて、当該抽出対象の正解データのペア（述部ペア又は「項−述部」ペア）について意味属性素性を構築する。 Next, in step S404, based on the overlapping prescriptive attribute obtained in step S400 and the semantic attribute weighted overlapping rate obtained in step S402, the pair of correct data to be extracted (predicate pair or “term” -Build semantic attribute features for predicate pairs.

次に、ステップＳ４０６において、ステップＳ２０２において読み込んだ全ての正解データの各々のペア（述部ペア又は「項−述部」ペア）について意味属性素性の構築をしたか否かを判定する。すべての正解データの各々のペア（述部ペア又は「項−述部」ペア）について意味属性素性の構築をした場合には処理を終了し、意味属性素性の構築をしていない正解データのペア（述部ペア又は「項−述部」ペア）が存在する場合には、ステップＳ４００に移行し、当該正解データを抽出対象の正解データとして各処理を繰り返す。 Next, in step S406, it is determined whether or not a semantic attribute feature has been constructed for each pair (predicate pair or “term-predicate” pair) of all correct data read in step S202. When the semantic attribute feature is constructed for each pair of all correct data (predicate pair or “term-predicate” pair), the processing is terminated, and the correct data pair for which no semantic attribute feature is constructed If there is a (predicate pair or “term-predicate” pair), the process proceeds to step S400, and each process is repeated with the correct data as the correct data to be extracted.

上記ステップＳ２０９は、図２９に示す機能表現素性の抽出ルーチンによって実現される。 The step S209 is realized by the function expression feature extraction routine shown in FIG.

まず、ステップＳ６００において、抽出対象の正解データのペア（述部ペア又は「項-述部」ペア）の各述部の形態素解析を行い、形態素毎の表記と標準形と品詞が少なくとも含まれる解析結果を用いて最も尤もらしい意味ラベル列を付与する。 First, in step S600, a morpheme analysis of each predicate of a pair of correct data to be extracted (predicate pair or “term-predicate” pair) is performed, and an analysis including at least a notation, a standard form, and a part of speech for each morpheme. The most likely meaning label string is assigned using the result.

次に、ステップＳ６０２において、ステップＳ６００において得られたペア各々の述部の意味ラベルの両方に共通して出現する重なり意味ラベルを抽出する。 Next, in step S602, overlapping semantic labels that appear in common in both semantic labels of the predicates of each pair obtained in step S600 are extracted.

次に、ステップＳ６０４において、ステップＳ６０２において得られたペアの重なり意味ラベルを用いて、意味ラベル重なり率を計算する。 Next, in step S604, a semantic label overlap rate is calculated using the paired overlapping semantic labels obtained in step S602.

次に、ステップＳ６０６において、ステップＳ６０２において得られた重なり意味ラベル、及びステップＳ６０４において得られた意味ラベル重なり率に基づいて、当該抽出対象の正解データのペア（述部ペア又は「項-述部」ペア）について機能表現素性を構築する。 Next, in step S606, based on the overlapping semantic label obtained in step S602 and the semantic label overlapping rate obtained in step S604, the correct data pair (predicate pair or “term-predicate” to be extracted) is extracted. “Pairs” are constructed with functional expression features.

次に、ステップＳ６０８において、ステップＳ２０２において読み込んだ全ての正解データの各々のペア（述部ペア又は「項-述部」ペア）について機能表現素性の構築をしたか否かを判定する。すべての正解データの各々のペア（述部ペア又は「項-述部」ペア）について機能表現素性の構築をした場合には処理を終了し、機能表現素性の構築をしていない正解データのペア（述部ペア又は「項-述部」ペア）が存在する場合には、ステップＳ６００に移行し、当該正解データを抽出対象の正解データとして各処理を繰り返す。 Next, in step S608, it is determined whether or not a function representation feature has been constructed for each pair (predicate pair or “term-predicate” pair) of all correct data read in step S202. When the functional expression feature is constructed for each pair of all correct data (predicate pair or “term-predicate” pair), the processing is terminated, and the correct data pair for which no functional representation feature is constructed If there is a (predicate pair or “term-predicate” pair), the process proceeds to step S600, and each process is repeated with the correct data as the correct data to be extracted.

＜同義判定装置の作用＞
次に、本発明の実施の形態に係る同義判定装置３００の作用について説明する。まず、入力部２１０により、同義学習装置２００により出力された同義判定モデルが入力され、同義判定モデル記憶部２３２に記憶される。また、入力部２１０により同義判定対象のペア（述部ペア又は「項−述部」ペア）が入力されると、同義判定装置３００のＲＯＭに記憶されたプログラムを、ＣＰＵが実行することにより、図３０に示す同義判定処理ルーチンが実行される。 <Operation of synonym determination device>
Next, the effect | action of the synonym determination apparatus 300 which concerns on embodiment of this invention is demonstrated. First, the synonym determination model output by the synonym learning device 200 is input by the input unit 210 and stored in the synonym determination model storage unit 232. Further, when a synonym determination target pair (predicate pair or “term-predicate” pair) is input by the input unit 210, the CPU executes a program stored in the ROM of the synonym determination device 300. The synonym determination processing routine shown in FIG. 30 is executed.

まず、ステップＳ５００において、入力された同義判定対象のペア（述部ペア又は「項−述部」ペア）を受け付ける。 First, in step S500, the input synonym determination target pair (predicate pair or “term-predicate” pair) is received.

次に、ステップＳ５０２において、上記ステップＳ２５０、Ｓ２５２と同様に、同義判定対象のペア（述部ペア又は「項−述部」ペア）の分布類似度を算出する。 Next, in step S502, similar to steps S250 and S252 above, the distribution similarity of the synonym determination target pair (predicate pair or “term-predicate” pair) is calculated.

次に、ステップＳ５０４において、上記ステップＳ３００、Ｓ３０２、Ｓ３０４、Ｓ３０６、Ｓ３０８と同様に、同義判定対象のペア（述部ペア又は「項−述部」ペア）の辞書定義文素性を抽出する。 Next, in step S504, similar to steps S300, S302, S304, S306, and S308, a dictionary definition sentence feature of a synonym determination target pair (predicate pair or “term-predicate” pair) is extracted.

次に、ステップＳ５０６において、上記ステップＳ４００、Ｓ４０２、Ｓ４０４と同様に、同義判定対象のペア（述部ペア又は「項−述部」ペア）の意味属性素性を抽出する。 Next, in step S506, the semantic attribute features of the synonym determination target pair (predicate pair or “term-predicate pair”) are extracted in the same manner as in steps S400, S402, and S404.

次に、ステップＳ５０７において、上記ステップＳ６００、Ｓ６０２、Ｓ６０４、Ｓ６０６と同様に、同義判定対象のペア（述部ペア又は「項-述部」ペア）の機能表現素性を抽出する。 Next, in step S507, as in the above-described steps S600, S602, S604, and S606, the function representation feature of the synonym determination target pair (predicate pair or “term-predicate pair”) is extracted.

次に、ステップＳ５０８において、ステップＳ５０２において得られた分布類似度と、ステップＳ５０４において得られた辞書定義文素性と、ステップＳ５０６において得られた意味属性素性と、ステップＳ５０７において得られた機能表現素性とに基づいて、同義判定対象のペア（述部ペア又は「項−述部」ペア）の素性を作成する。 Next, in step S508, the distribution similarity obtained in step S502, the dictionary definition sentence feature obtained in step S504, the semantic attribute feature obtained in step S506, and the functional expression feature obtained in step S507. Based on the above, a feature of a synonym determination target pair (predicate pair or “term-predicate” pair) is created.

次に、ステップＳ５１０において、ステップＳ５０６において作成された素性と、同義判定モデル記憶部２３２に記憶された同義判定モデルとに基づいて、同義判定対象のペア（述部ペア又は「項−述部」ペア）が同義か否かを判定する。 Next, in step S 510, based on the feature created in step S 506 and the synonym determination model stored in the synonym determination model storage unit 232, a synonym determination target pair (predicate pair or “term-predicate”). It is determined whether or not (pair) is synonymous.

次に、ステップＳ５１２において、ステップＳ５１０において同義判定された結果を出力部２５０により出力して処理を終了する。 Next, in step S512, the result determined synonymously in step S510 is output by the output unit 250, and the process ends.

＜同義判定結果の例＞
図３１〜図３４を用いて、同義判定処理ルーチンを実行した例を説明する。図３１は、「棚−ヲ−設置する」と「棚−ヲ−撤去する」という「項−述部」ペアを入力とした場合の、同義判定の結果と、当該「項−述部」ペアについての素性一覧を示す。 <Example of synonym determination result>
An example in which the synonym determination process routine is executed will be described with reference to FIGS. 31 to 34. FIG. 31 shows the result of the synonym determination and the “term-predicate” pair when the “term-predicate” pair “shelf-wo-install” and “shelf-wo-remove” is input. The feature list about is shown.

図３１の例の場合、「同義ではない」と判定されている。これは、非特許文献１の手法である分布類似度のみを用いた場合では判定が困難であった反意関係の述部である。当該「項−述部」ペアについて作成された素性一覧が示しているように、「棚−ヲ−設置する」と「棚−ヲ−撤去する」の場合、算出された述部分布類似度と項述部分布類似度はともに高い値を出しているが（通常、分布類似度０．２以上が「同義」を表す閾値とされる）、抽出された辞書定義文内での語彙の重なりや定義文相互補完性がなかったために、正しく「同義ではない」と判定できている。 In the case of the example in FIG. 31, it is determined as “not synonymous”. This is a predicate of an aversive relationship that was difficult to determine when only the distribution similarity that is the method of Non-Patent Document 1 was used. As shown in the feature list created for the “term-predicate” pair, in the case of “shelf-install” and “shelf-remove”, the calculated predicate distribution similarity and Both the term predicate distribution similarities have high values (usually a distribution similarity of 0.2 or higher is a threshold value indicating “synonym”), but lexical overlap in the extracted dictionary definition sentence and Since there is no definition sentence mutual complementarity, it can be correctly judged as “not synonymous”.

次に、図３２に「テキスト−ヲ−作成する」と「テキスト−ヲ−用いる」との「項−述部」ペアが入力された場合に、作成される素性一覧を示す。図３２の「項−述部」ペアでは、算出された項述部分布類似度では比較的高い値を示しているが、述部分布類似度が低い値を示していることと、抽出された辞書定義文内での語彙の重なりや定義文相互補完性がないことと、抽出された意味属性素性に重なりがないために、正しく「同義ではない」と判定できている。 Next, FIG. 32 shows a list of features to be created when a “term-predicate” pair of “text-create” and “text-use” is input. In the “term-predicate” pair of FIG. 32, the calculated term predicate distribution similarity shows a relatively high value, but the predicate distribution similarity shows a low value and is extracted. Since there is no vocabulary overlap or definition sentence mutual complementarity in the dictionary definition sentence and there is no overlap in the extracted semantic attribute features, it can be correctly determined as “not synonymous”.

次に、図３３に「花壇−ガ−出来上がる」と「花壇−ガ−完成する」との「項−述部」ペアが入力された場合に、作成される素性の一覧示す。図３３の場合、算出された述部分布類似度及び項述部分布類似度に加え、本発明で提案している抽出された辞書定義文素性と抽出された意味属性素性と機能表現素性が特徴となって、正しく「同義」と判定できる。 Next, FIG. 33 shows a list of features to be created when a “term-predicate” pair of “flowerbed-ga-finish” and “flowerbed-ga-complete” is input. In the case of FIG. 33, in addition to the calculated predicate distribution similarity and term predicate distribution similarity, the extracted dictionary definition sentence feature proposed in the present invention, the extracted semantic attribute feature, and the functional expression feature are characterized. Thus, it can be correctly determined as “synonymous”.

次に、図３４に「サポーター-ヲ-募っている」と「サポーター-ヲ-募集している」との「項-述部」ペアが入力された場合に、作成される素性の一覧を示す。図３４の場合、算出された述部分布類似度に加え、本発明で提案している抽出された辞書定義文素性と抽出された意味属性素性と機能表現素性が特徴となって、正しく「同義」と判定できる。この例の場合、機能表現素性を用いないで同義判定モデルを学習し、学習された同義判定モデルを用いて同義判定を行わない場合、「同義ではない」と誤って判定される。 Next, FIG. 34 shows a list of features to be created when the “term-predicate” pair of “supporter-wo-recruiting” and “supporter-wo-recruitment” is input. . In the case of FIG. 34, in addition to the calculated predicate distribution similarity, the extracted dictionary definition sentence feature, the extracted semantic attribute feature, and the functional representation feature proposed in the present invention are features, Can be determined. In the case of this example, when the synonym determination model is learned without using the functional representation feature and the synonym determination is not performed using the learned synonym determination model, it is erroneously determined as “not synonymous”.

上記のように、「同義の述部」に関する複数の言語的特徴を組み込むことで、同義の述部は正しく同義と、それ以外の述部は正しく「同義ではない」と判定できるようになる。 As described above, by incorporating a plurality of linguistic features related to “synonymous predicates”, it becomes possible to determine that synonymous predicates are correctly synonymous and other predicates are correctly “not synonymous”.

なお、図３３は学習モデルと同じ素性になっているが、判定結果は、当該述部ペアを除いた正解コーパスで学習した同義判定モデルを用いている。図３３の場合、述部分布類似度および項述部分布類似度に加え、本発明で提案している辞書定義文素性と意味属性素性と機能表現素性が特徴となって、正しく「同義」と判定できる。 Although FIG. 33 has the same feature as the learning model, the determination result uses a synonym determination model learned with a correct corpus excluding the predicate pair. In the case of FIG. 33, in addition to the predicate distribution similarity and the term predicate distribution similarity, the dictionary definition sentence feature, the semantic attribute feature, and the function representation feature proposed in the present invention are features, and are correctly defined as “synonymous”. Can be judged.

以上説明したように、本発明の実施の形態に係る同義判定装置によれば、述部に焦点をあてることにより、表層は異なるが同じことを表している述部ペア又は「項−述部」ペアについて同義判定を自動で高精度に行うことができる。 As described above, according to the synonym determination device according to the embodiment of the present invention, by focusing on the predicate, a predicate pair or “term-predicate” that represents the same but different surface layers. The synonym determination can be automatically performed with high accuracy for the pair.

また、同義判定対象の述部ペア又は「項-述部」ペアの述部同士の辞書定義文セット間の「語彙の重なり」と「定義文相互補完性」、及び述部の抽象的な意味属性の重なりを示す素性である「重なり用言属性」を用い、さらに、「重なり用言属性」の階層情報を考慮して同義判定に反映させるための「意味属性重み付き重なり率」を用い、複数の言語的特徴に基づいて同義判定を行うことにより、反義関係及び時間経過関係が含まれている述部ペア又は「項−述部」ペアであっても、同義判定を自動で高精度に行うことができる。 In addition, "lexical overlap" and "definition mutual complementarity" between dictionary definition sentence sets of predicates in synonym determination predicates or "term-predicate" pairs, and the abstract meaning of predicates Using the “overlapping word attribute” which is a feature indicating the overlap of attributes, and using the “semantic attribute weighted overlap rate” for reflecting in the synonym determination in consideration of the hierarchical information of the “overlapping word attribute” By performing synonym determination based on multiple linguistic features, synonym determination is automatically performed with high accuracy even for predicate pairs or "term-predicate" pairs that contain anomaly relationships and time-lapse relationships. Can be done.

また、同義判定対象の述部ペア又は「項-述部」ペアの述部同士の機能表現の「重なり意味ラベル」と「意味ラベル重なり率」を素性として、「語彙の重なり」と「定義文相互補完性」と「重なり用言属性」と「意味属性重み付き重なり率」と「分布類似度」に追加して用いることにより、膨大なデータで分布類似度を計算する必要なく、機能表現の意味を考慮した同義判定が可能になり、同義判定を自動で高精度に行うことができる。 In addition, the “overlapping semantic label” and “semantic label overlapping rate” of the functional expressions between the predicates of the synonym determination predicate pair or “term-predicate” pair are used as features, and “lexical overlap” and “definition sentence”. By using it in addition to “mutual complementarity”, “overlapping prescriptive attribute”, “semantic attribute weighted overlap rate” and “distribution similarity”, it is not necessary to calculate the distribution similarity with a huge amount of data. The synonym determination considering the meaning can be performed, and the synonym determination can be automatically performed with high accuracy.

また、本実施形態により、表層は異なるが同じ事を表している述部ペア及び「項−述部」ペアに対して、同義か否かを計算機で判定する同義判定手法において、複数の言語的特徴を素性として用いることで、より正確な述部ペア、「項−述部」ペアの同義判定ができるようになる。結果、大量のテキストから、重要な情報のみを抽出・集計・提示するテキストマイニング技術において、表層が異なる場合においても同じ出来事を正しく集計することができるようになる。 Further, according to the present embodiment, in the synonym determination method for determining whether or not the predicate pair and the “term-predicate” pair representing the same thing with different surface layers are synonymous, a plurality of linguistic expressions By using features as features, it becomes possible to determine synonyms of a more accurate predicate pair and “term-predicate” pair. As a result, in the text mining technology that extracts, summarizes, and presents only important information from a large amount of text, the same event can be correctly counted even when the surface layers are different.

また、表層は異なるが同じ事を表している述部ペア及び「項−述部」ペアに対して、同義か否かを計算機で判定する同義判定手法において、複数の言語的特徴を素性として用いることで、より正確な述部ペア及び「項−述部」ペアの同義判定ができるようになる。結果、ユーザが求める情報を探し出す検索技術において、文字列が異なる表現で検索しても同じ事を表すテキストを表示することが可能となり、検索技術の精度を向上させることができる。 In addition, multiple linguistic features are used as features in a synonym determination method that uses a computer to determine whether or not a predicate pair and a “term-predicate” pair that represent the same thing but have different surface layers. This makes it possible to determine the synonyms of a more accurate predicate pair and “term-predicate” pair. As a result, in the search technique for searching for information requested by the user, it is possible to display text representing the same thing even if the search is performed with different expressions of character strings, and the accuracy of the search technique can be improved.

また、本実施形態では、同義を表す述部ペア及び「項−述部」ペアの述部同士の語義も類似しているという言語的特徴を、辞書定義文の「定義文相互補完性」と「語彙の重なり」という２つの特徴を用いて素性化し、その結果、既存の分布類似度手法では判定が難しかった「反意述部」を正しく判定することができるようになった。 Further, in the present embodiment, the linguistic feature that the meaning of the predicates of the predicate pair indicating the synonym and the predicate of the “term-predicate” pair is also similar to the “definition sentence mutual complementarity” of the dictionary definition sentence. Using the two features of "vocabulary overlap", it became a feature, and as a result, it became possible to correctly determine "anti-predicate parts" that were difficult to determine with existing distribution similarity methods.

また、同義を表す述部ペア及び「項-述部」ペアの述部同士の抽象的な意味属性も類似しているという言語的特徴を、意味属性素性とし、具体的には、「よりたくさんの属性を共有するほど、述部同士が類似している」という特徴と、「より詳細な属性を共有するほど、述部同士は類似している」という特徴をそれぞれ「重なり用言属性」と「意味属性重み付き重なり率」として素性化し、その結果、既存の分布類似度手法では判定が難しかった「時間的経過を表す述部」に対して、「同義ではない」と正しく判定することができるようになった。 In addition, the semantic feature is a linguistic feature that the abstract semantic attributes of the predicate pairs that represent synonyms and the predicates of the “term-predicate” pair are similar. The more common the attributes, the more similar the predicates are, and the more shared the more detailed attributes, the more similar are the predicates. It is featured as a “semantic attribute weighted overlap rate”, and as a result, it is possible to correctly determine “not synonymous” for “predicate representing time course” that was difficult to determine with the existing distribution similarity method. I can do it now.

また、「同義の述部」に関する複数の言語的特徴を素性として同義判定モデルを学習することで、既存の手法よりもより正確に同義を判定できるようになった。 Moreover, by learning a synonym determination model using a plurality of linguistic features related to “synonymous predicates” as features, synonyms can be determined more accurately than existing methods.

また、上記の実施の形態では、分布類似度と、定義文相互補完性を示す素性と、語彙の重なりを示す素性と、重なり用言属性と、意味属性重み付き重なり率と、重なり意味ラベルと、意味ラベル重なり率の７つの素性をすべて抽出したが、これに限定されるものではなく、定義文相互補完性を示す素性及び意味属性の重なりを示す素性の少なくとも一方の素性のみを抽出するようにしてもよい。また、より精度を向上させるために、分布類似度、語彙の重なりを示す素性、重なり用言属性、意味属性重み付き重なり率、重なり意味ラベル、及び意味ラベル重なり率の少なくとも１つを更に抽出してもよい。 Further, in the above-described embodiment, the distribution similarity, the feature indicating the definition sentence mutual complementarity, the feature indicating the overlap of the vocabulary, the overlapping word attribute, the semantic attribute weighted overlapping rate, the overlapping semantic label, Although all the seven features of the semantic label overlap rate have been extracted, the present invention is not limited to this, and only at least one of the feature indicating the definition sentence mutual complementarity and the feature indicating the overlap of the semantic attributes is extracted. It may be. In order to further improve accuracy, at least one of distribution similarity, vocabulary overlap feature, overlapping word attribute, semantic attribute weighted overlapping rate, overlapping semantic label, and semantic label overlapping rate is further extracted. May be.

また、上記の実施の形態では、素性ベクトル構築装置１００において、非特許文献１と同じ方法で素性ベクトルを構築しているが、これに限定されるものではなく、他のベクトル構築の手法を用いてもよい。(非特許文献４：Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.) In the above embodiment, the feature vector construction apparatus 100 constructs feature vectors by the same method as in Non-Patent Document 1, but the present invention is not limited to this, and other vector construction methods are used. May be. (Non-Patent Document 4: Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.)

また、上記の実施の形態では、分布類似度計算部１３２において、素性ベクトル記憶部１２４から入力された素性ベクトルを用いて、分布類似度を算出しているが、これに限定されるものではなく、他の類似度計算の手法を用いてもよい。(非特許文献４：Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.) In the above embodiment, the distribution similarity calculation unit 132 calculates the distribution similarity using the feature vector input from the feature vector storage unit 124. However, the present invention is not limited to this. Other similarity calculation methods may be used. (Non-Patent Document 4: Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.)

また、上記の実施の形態では、抽象的な意味属性として、「用言属性」を用いているが、これに限定されるものではなく、ＬＣＳ構造などを言語リソースとして用いてもよい。(非特許文献５：竹内孔一，乾健太郎，藤田篤(2006).語彙概念構造に基づく日本語動詞の統語・意味特性の記述，レキシコンフォーラム，No.2, pp.85-120.) In the above embodiment, the “predicate attribute” is used as an abstract semantic attribute. However, the present invention is not limited to this, and an LCS structure or the like may be used as a language resource. (Non-Patent Document 5: Koichi Takeuchi, Kentaro Inui, Atsushi Fujita (2006). Description of syntactic and semantic characteristics of Japanese verbs based on lexical concept structure, Lexicon Forum, No.2, pp.85-120.)

また、上記の実施の形態では、同義判定モデルの学習にはＳＶＭを用いているが、これに限定されるものではなく、ＤｅｃｉｓｉｏｎＴｒｅｅなど別の判定モデルを用いてもよい。 In the above embodiment, SVM is used for learning the synonymous determination model. However, the present invention is not limited to this, and another determination model such as Decision Tree may be used.

また、本発明は、上記実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

また、上述の同義判定装置３００は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 Moreover, although the above-described synonym determination device 300 includes a computer system, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。また、本実施の形態の同義判定装置３００の各部をハードウエアにより構成してもよい。また、正解コーパス、テキストコーパス、素性ベクトル、定義文辞書、意味属性辞書、機能表現辞書、素性集合、判定モデルが記憶されるデータベースとしては、ハードディスク装置やファイルサーバ等に例示される記憶手段によって実現可能であり、同義判定装置内部にデータベースを設けても良いし、外部装置に設けてもよい。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. It is also possible to do. Moreover, you may comprise each part of the synonym determination apparatus 300 of this Embodiment with hardware. In addition, the database storing the correct corpus, text corpus, feature vector, definition sentence dictionary, semantic attribute dictionary, function expression dictionary, feature set, and judgment model is realized by a storage means exemplified by a hard disk device or a file server. The database may be provided inside the synonym determination device or may be provided in an external device.

１０，１１０，２１０入力部
２０，１２０，２２０演算部
２４基本解析部
２６素性抽出部
２８素性ベクトル生成部
３０，１５０，２５０出力部
１００素性ベクトル構築装置
１２２正解コーパス
１２４素性ベクトル記憶部
１３２分布類似度計算部
１３４定義文辞書記憶部
１３６辞書定義文素性抽出部
１３８意味属性辞書記憶部
１３９機能表現辞書記憶部
１４０意味属性素性抽出部
１４１機能表現素性抽出部
１４２素性集合記憶部
１４４同義判定モデル学習部
２００同義学習装置
２２２素性構築部
２２４素性ベクトル記憶部
２２６定義文辞書記憶部
２２８意味属性辞書記憶部
２２９機能表現辞書記憶部
２３０同義判定部
２３２同義判定モデル記憶部
３００同義判定装置
１３６０定義文抽出部
１３６２定義文相互補完性抽出部
１３６４語彙の重なり抽出部
１４００意味属性重なり抽出部
１４０２意味属性重み付き重なり率計算部
１５００意味ラベル付与部
１５０２重なり意味ラベル抽出部
１５０４意味ラベル重なり率計算部 10, 110, 210 Input unit 20, 120, 220 Operation unit 24 Basic analysis unit 26 Feature extraction unit 28 Feature vector generation unit 30, 150, 250 Output unit 100 Feature vector construction device 122 Correct corpus 124 Feature vector storage unit 132 Distribution similarity Degree calculation unit 134 Definition sentence dictionary storage unit 136 Dictionary definition sentence feature extraction unit 138 Semantic attribute dictionary storage unit 139 Functional expression dictionary storage unit 140 Semantic attribute feature extraction unit 141 Functional expression feature extraction unit 142 Feature set storage unit 144 Synonym determination model learning Unit 200 synonym learning device 222 feature construction unit 224 feature vector storage unit 226 definition sentence dictionary storage unit 228 semantic attribute dictionary storage unit 229 function expression dictionary storage unit 230 synonym determination unit 232 synonym determination model storage unit 300 synonym determination device 1360 definition sentence extraction Part 1362 definition sentence mutual complementarity extraction part 364 vocabulary of overlapping extraction unit 1400 meaning attribute overlap extraction unit 1402 meaning attribute weighted overlap ratio calculation unit 1500 means the label applying section 1502 overlap meaning label extraction unit 1504 means the label overlap ratio calculation unit

Claims

Predicate of predescription part pair extracted based on definition sentence of each predicate of input predicate pair obtained from definition sentence set consisting of definition statements for each of a plurality of predicates prepared in advance The first feature is whether or not there is a pair of predicates in each definition statement,
Common to previous description part pairs extracted based on the semantic attributes of each predicate of the input previous description part pair, obtained from a semantic attribute set consisting of semantic attributes for each of a plurality of predicates prepared in advance A semantic attribute to be a second feature, and a feature construction unit that extracts at least one of the first feature and the second feature;
A synonym determination model storage unit in which a synonym determination model obtained in advance is stored;
A synonym determination unit that determines whether or not the input predicate pair is synonymous based on the synonym determination model stored in the synonym determination model storage unit based on the features extracted by the feature construction unit; ,
A synonym determination device.

The feature construction unit extracts at least one feature from the first feature and the second feature,
The number of vocabulary appearing in both of the definition sentences of each predicate of the predescription part pair is a third feature,
Overlapping semantic attributes of previous description part pairs with weights added according to the degree of detail of semantic attributes common to the previous description part pairs extracted based on the semantic attributes of each predicate of the input previous description part pair The degree is the fourth feature,
For each predicate of the input previous description part pair, a distribution similarity that compares words appearing around the previous description part in the text corpus is set as a fifth feature,
Before extraction based on the semantic labels of the functional expressions of the predicates of the input predescription section pair obtained from the semantic label set consisting of the semantic labels of the functional expressions of the plurality of predicates prepared in advance The semantic label common to the description part pair is the sixth feature,
The overlapping degree of the common semantic labels of the previous description part pair is a seventh feature,
Extracting at least one of the third to seventh features;
The synonym determination device according to claim 1.

The “term” is extracted based on the definition statement of each predicate of the input “term-predicate” pair obtained from a definition statement set including a definition statement for each of a plurality of predicates prepared in advance. -Whether or not there is a pairing predicate in the definition statement of each predicate of the "predicate" pair, and a pair in the definition statement of each of the predicates of the "term-predicate" pair The first feature is whether or not there is at least a predescription part of whether or not a term of “term-predicate” exists,
The “extracted” based on the semantic attribute of each predicate of the “term-predicate” pair input, obtained from a semantic attribute set consisting of semantic attributes for each of a plurality of predicates prepared in advance. A semantic attribute common to the term-predicate pair is a second feature, and a feature construction unit that extracts at least one of the first feature and the second feature;
A synonym determination model storage unit in which a synonym determination model obtained in advance is stored;
Based on the features extracted by the feature construction unit, it is determined whether or not the input “term-predicate” pair is synonymous based on the synonym determination model stored in the synonym determination model storage unit. A synonym determination unit;
A synonym determination device.

The feature construction unit extracts at least one feature from the first feature and the second feature,
The number of vocabulary appearing in both of the definition sentences of each predicate of the “term-predicate” pair is a third feature
The weight is added according to the degree of detail of the semantic attribute common to the “term-predicate” pair extracted based on the semantic attribute of each predicate of the input “term-predicate” pair. The fourth feature is the overlapping degree of semantic attributes of the “term-predicate” pair,
For each of the “term-predicate” of the inputted “term-predicate” pair, a distribution similarity that compares words appearing around the “term-predicate” in the text corpus, and the “term- Each of the predicates in the “predicate” pair appears at least around the “term-predicate” among the distribution similarities obtained by comparing words appearing around the “term-predicate” predicate in the text corpus. The fifth feature is the distribution similarity between the words
Based on a semantic label of the functional expression of each predicate of the inputted “term-predicate” pair, obtained from a semantic label set consisting of semantic labels of functional expressions of each of a plurality of predicates prepared in advance. A semantic label common to the extracted “term-predicate” pair is a sixth feature,
The overlapping degree of the common semantic labels of the “term-predicate” pair is a seventh feature,
Extracting at least one of the third to seventh features;
The synonym determination device according to claim 3.

Each of the predicates of the predescription part pair extracted based on the definition sentence of each predicate of the predicate pair obtained from the definition sentence set including the definition sentences for each of the plurality of predicates prepared in advance. The first feature is whether or not there is a pair of predicates in the definition statement, and a predescription pair pair obtained from a semantic attribute set consisting of semantic attributes for each of a plurality of predicates prepared in advance. A semantic attribute common to the semantic attributes of each predicate is a second feature, information on whether or not it is synonymous, and for each of a plurality of predicate pairs prepared in advance, the first feature and the A feature construction unit that extracts at least one of the second features;
Based on at least one of the first feature and the second feature extracted for the plurality of predicate pairs by the feature construction unit and information on whether or not they are synonymous with the plurality of predicate pairs. A synonym determination model learning unit for learning a synonym determination model;
A synonymous learning device.

The “term-predicate” is extracted based on the definition statement of each predicate of the “term-predicate” pair, which is obtained from a definition statement set including definition statements for each of a plurality of predicates prepared in advance. "Whether there is a paired predicate in the definition statement of each of the predicates of the pair, and" pair- "in the definition statement of each of the predicates of the" term-predicate "pair The first feature is whether or not there is at least the previous description part of whether or not the term of “predicate” exists, and from a semantic attribute set consisting of semantic attributes for each of a plurality of predicates prepared in advance. A semantic attribute common to the semantic attributes of each predicate of the “term-predicate” pair obtained is set as a second feature, information on whether or not it is synonymous, and a plurality of “terms” prepared in advance. -For each predicate pair, a feature structure that extracts at least one of the first feature and the second feature And parts,
Whether at least one of the first feature and the second feature extracted for the plurality of “term-predicate” pairs by the feature construction unit and the synonym for the plurality of “term-predicate” pairs. A synonym determination model learning unit that learns a synonym determination model based on information on whether or not,
A synonymous learning device.

The program for functioning a computer as each means which comprises the synonym determination apparatus of any one of Claims 1-4, or the synonym learning apparatus of any one of Claims 5-6.