JP2007219947A

JP2007219947A - Causal relation knowledge extraction device and program

Info

Publication number: JP2007219947A
Application number: JP2006041281A
Authority: JP
Inventors: Ichiro Yamada; 一郎山田; Kikuka Miura; 菊佳三浦; Hideki Sumiyoshi; 英樹住吉; Nobuyuki Yagi; 伸行八木; Takeshi Kobayakawa; 健小早川
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2006-02-17
Filing date: 2006-02-17
Publication date: 2007-08-30

Abstract

<P>PROBLEM TO BE SOLVED: To provide a causal relation knowledge extraction device and program, capable of extracting causal relation between nouns (noun pair) even in the case that a key word doesn't appear or verbs in parallel words don't have a common object. <P>SOLUTION: A noun pair extraction part 2 extracts a noun pair from text data, and a feature extraction part 3 extracts features of a syntactic structure of text data including the noun pair and features of an attribute of the noun pair to generate a three-item set. Data without teacher where the presence or the absence of causal relation is not shown and data with teacher where the presence or the absence of causal relation is shown are stored by a tagging part 4. A machine learning part 5 uses the stored data to determine about causal relation and stores the result. A text analysis interface part 6 discriminates the presence or the absence of causal relation of the noun pair on the basis of the result data with respect to text data being an object of text analysis. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、電子化されたテキストデータを対象とした、情報抽出及び自然言語処理に関し、特に、自然言語処理技術を利用することにより、テキストデータから、因果関係を持つ名詞のペア及び因果関係を持つ構文構造を抽出する技術に関する。 The present invention relates to information extraction and natural language processing for digitized text data, and in particular, by using natural language processing technology, pairs of nouns and causal relationships having causal relationships can be identified from text data. The present invention relates to a technique for extracting a syntactic structure.

現在、デジタル放送サービスでは、データ放送や字幕放送等のテキストデータが大量に多重放送されている。デジタル放送を受信する装置がこれらのテキストデータを常時監視し、有益な情報を抽出して蓄積することができれば、視聴者の疑問に答えるテレビを実現できるようになると考えられる。このような状況の下で、単語間における「原因−結果」等の関係（以下、因果関係という。）を自動抽出する研究が進められている。 Currently, in digital broadcasting services, a large amount of text data such as data broadcasting and caption broadcasting is multiplexed and broadcast. If a device that receives digital broadcasts can constantly monitor these text data and extract and store useful information, it will be possible to realize a television that answers the viewer's questions. Under such circumstances, research for automatically extracting a relationship such as “cause-result” between words (hereinafter referred to as “causal relationship”) is underway.

例えば、非特許文献１には、日本語テキストを対象として、「ため」という単語を手掛かり語とした因果関係抽出手法が提案されている。また、非特許文献２には、並列句が一つの文に存在し、並列句中の動詞が共通の目的語を持つ場合に因果関係が成立しやすいと仮定して、統計的に因果関係を抽出する手法が提案されている。 For example, Non-Patent Document 1 proposes a causal relationship extraction method for Japanese text, using the word “for” as a clue word. In Non-Patent Document 2, assuming that a parallel phrase exists in one sentence and the verbs in the parallel phrase have a common object, it is easy to establish a causal relation. An extraction method has been proposed.

乾孝司、乾健太郎、松本裕治、“接続標識「ため」に基づく文書集合からの因果関係知識の自動獲得”、情処論、Ｖｏｌ．４５、Ｎｏ．３、ｐｐ．９１９−９３３（２００４）Inui Takashi, Inui Kentaro, Matsumoto Yuji, “Automatic Acquisition of Causal Relationship Knowledge from Document Set Based on Connection Sign“ For ””, Information Processing, Vol. 45, No. 3, pp. 919-933 (2004) 鳥澤健太郎、“「常識的」推論規則のコーパスからの自動抽出”、言語処理学会第９回年次大会、ｐｐ．３１８−３２１（２００３）Kentaro Torizawa, “Automatic Extraction of“ Commonsense ”Inference Rules from Corpus”, 9th Annual Conference of the Association for Natural Language Processing, pp. 318-321 (2003)

しかしながら、前述の非特許文献１の手法において、手掛かり語が出現しない場合は因果関係を抽出することができないという問題があった。また、非特許文献２の手法においても、並列句中の動詞が共通の目的語を持たない場合は因果関係を抽出することができないという問題があった。 However, in the method of Non-Patent Document 1 described above, there is a problem that the causal relationship cannot be extracted when no clue word appears. In the method of Non-Patent Document 2, there is a problem that the causal relationship cannot be extracted when the verbs in the parallel phrase do not have a common object.

そこで、本発明は前記課題を解決するためになされたものであり、その目的は、手掛かり語が出現しない場合や、並列語中の動詞が共通の目的語を持たない場合であっても、名詞間（名詞ペア）における因果関係を抽出することが可能な因果関係知識抽出装置及びプログラムを提供することにある。 Therefore, the present invention has been made to solve the above problems, and the purpose of the present invention is to provide a noun even when a clue word does not appear or when a verb in a parallel word does not have a common object. It is an object of the present invention to provide a causal relationship knowledge extraction apparatus and program capable of extracting a causal relationship between spaces (noun pairs).

上記課題を解決するため、本発明による因果関係知識抽出装置は、テキストデータに含まれる名詞ペアについて、その因果関係の有無を知識として抽出する装置であって、テキストデータに含まれる名詞ペアを抽出する名詞ペア抽出部と、該名詞ペア抽出部により抽出された名詞ペアを含むテキストデータについて、該テキストデータの構文構造及び前記名詞ペアの属性を特徴データとして抽出する特徴抽出部と、該特徴抽出部により抽出された特徴データの一部に対し、前記名詞ペアにおける因果関係の有無をタグとして付与するタグ付け部と、該タグ付け部によりタグが付与された特徴データが教師有りデータとして格納され、タグ付け部によりタグが付与されない特徴データが教師無しデータとして格納される記憶部と、該記憶部に格納された教師有りデータ及び教師無しデータに基づいて、該教師無しデータに含まれる名詞ペアにおける因果関係の有無を判定する因果関係判定部と、を備えたことを特徴とする。 In order to solve the above-described problem, the causal relationship knowledge extraction device according to the present invention is a device that extracts the presence or absence of a causal relationship as knowledge from noun pairs included in text data, and extracts noun pairs included in text data. A noun pair extracting unit that extracts the syntax structure of the text data and the attributes of the noun pair as feature data for the text data including the noun pair extracted by the noun pair extracting unit, and the feature extraction A tagging unit that adds a tag indicating whether or not a causal relationship exists in the noun pair for a part of the feature data extracted by the unit, and feature data to which the tag is added by the tagging unit is stored as supervised data. A storage unit in which feature data not tagged by the tagging unit is stored as unsupervised data; Based on the supervised data and unsupervised data is characterized by comprising a causality determination unit determines the presence or absence of a causal relationship noun pairs in the 該教 nurses no data.

また、本発明による因果関係知識抽出装置は、前記特徴抽出部が、名詞ペアを含むテキストデータについて、該名詞ペアの係り受け関係を示す木構造を生成する木構造生成手段と、該木構造生成手段により生成されたテキストデータの木構造を構成するノードを、自立語及び機能語に分離し、前記テキストデータの構文構造を示すＰＳＥ（ＰｒｅｏｒｄｅｒＳｔｒｉｎｇＥｘｐｒｅｓｓｉｏｎ）を生成するＰＳＥ生成手段と、前記名詞ペアの上位概念を抽出する上位概念抽出手段と、前記ＰＳＥ生成手段により生成されたＰＳＥと、上位概念抽出手段により抽出された上位概念とを用いて、３項組を特徴データとして生成する３項組生成手段と、を有することを特徴とする。 In the causal relationship knowledge extracting apparatus according to the present invention, the feature extraction unit generates a tree structure indicating a dependency relation of the noun pair for the text data including the noun pair, and the tree structure generation PSE generation means for separating nodes constituting the tree structure of text data generated by the means into independent words and function words and generating PSE (Preorder String Expression) indicating the syntax structure of the text data; and the noun pair A ternary set for generating a ternary set as feature data using a superordinate concept extracting means for extracting a superordinate concept of the PSE, a PSE generated by the PSE generating means, and a superordinate concept extracted by the superordinate concept extracting means And generating means.

前記因果関係判定部は、ＥＭアルゴリズムにより、記憶部に格納された教師有りデータについて、当該データにおける因果関係の有無の確率を算出し、前記記憶部に格納された教師有りデータ及び教師無しデータについて、同一または所定の値以上の類似度を有する構文構造が出現する確率を算出し、前記記憶部に格納された教師有りデータ及び教師無しデータについて、当該データに含まれる名詞ペアと同一の属性が出現する確率を算出し、これらの確率に基づいて、特徴データにおける因果関係の有無の確率、特徴データに含まれる構文構造における因果関係の有無の確率、及び、特徴データに含まれる名詞ペアにおける因果関係の有無の確率を算出して因果関係の有無を判定することが好適である。 The causal relationship determination unit calculates the probability of the presence or absence of a causal relationship in the data with respect to supervised data stored in the storage unit by an EM algorithm, and the supervised data and unsupervised data stored in the storage unit Calculating the probability of occurrence of a syntactic structure having the same or a similarity equal to or higher than a predetermined value, and for the supervised data and the unsupervised data stored in the storage unit, the same attribute as the noun pair included in the data Based on these probabilities, the probability of occurrence of causality in the feature data, the probability of causality in the syntactic structure included in the feature data, and the causality in the noun pair included in the feature data are calculated based on these probabilities. It is preferable to determine the presence or absence of a causal relationship by calculating the probability of the presence or absence of a relationship.

また、本発明による因果関係知識抽出装置は、さらに、新たなテキストデータを入力し、該テキストデータに含まれる名詞ペアを抽出し、テキストデータの構文構造及び名詞ペアの属性を特徴データとして抽出し、前記因果関係判定部により算出された因果関係の有無の確率に基づいて、前記新たなテキストデータに含まれる名詞ペアにおける因果関係の有無を判定するテキスト判定部を備えたことを特徴とする。 The causal relationship knowledge extracting apparatus according to the present invention further inputs new text data, extracts noun pairs included in the text data, and extracts the syntax structure of the text data and the attributes of the noun pairs as feature data. A text determination unit for determining the presence or absence of a causal relationship in a noun pair included in the new text data based on the probability of the presence or absence of a causal relationship calculated by the causal relationship determination unit.

前記テキスト判定部は、抽出した特徴データを用いて、因果関係判定部により算出された特徴データに含まれる構文構造における因果関係の有無の確率に基づいて、前記抽出した特徴データに含まれる構文構造における因果関係の有無の確率を算出し、該確率により因果関係の有無を判定することが好適である。 The text determination unit uses the extracted feature data, and based on the probability of the presence or absence of a causal relationship in the syntax structure included in the feature data calculated by the causal relationship determination unit, the syntax structure included in the extracted feature data It is preferable to calculate the probability of the presence or absence of a causal relationship in, and determine the presence or absence of the causal relationship based on the probability.

また、前記テキスト判定部は、抽出した特徴データを用いて、因果関係判定部により算出された特徴データに含まれる名詞ペアにおける因果関係の有無の確率に基づいて、前記抽出した特徴データに含まれる名詞ペアにおける因果関係の有無の確率を算出し、該確率により因果関係の有無を判定することが好適である。 The text determination unit is included in the extracted feature data based on the probability of the presence or absence of a causal relationship in a noun pair included in the feature data calculated by the causal relationship determination unit using the extracted feature data. It is preferable to calculate the probability of the presence or absence of a causal relationship in the noun pair and determine the presence or absence of the causal relationship based on the probability.

本発明を因果関係知識抽出装置として説明したが、本発明はこの因果関係知識抽出装置を構成するコンピュータによって実行されるプログラムとしても実質的に実現し得るものであり、本発明には、因果関係知識抽出プログラムも包含される。すなわち、本発明による因果関係知識抽出プログラムは、テキストデータに含まれる名詞ペアの因果関係の有無を抽出する装置による因果関係知識抽出プログラムであって、前記装置を構成するコンピュータに、テキストデータに含まれる名詞ペアを抽出する処理と、前記名詞ペアを含むテキストデータの構文構造及び名詞ペアの属性を特徴データとして抽出する処理と、前記抽出した特徴データの一部に対し、名詞ペアにおける因果関係の有無をタグとして付与する処理と、前記タグが付与された特徴データである教師有りデータと、タグが付与されない特徴データである教師無しデータとに基づいて、教師無しデータに含まれる名詞ペアにおける因果関係の有無を判定する処理とを実行させることを特徴とする。 Although the present invention has been described as a causal relationship knowledge extraction device, the present invention can be substantially realized as a program executed by a computer constituting the causal relationship knowledge extraction device. A knowledge extraction program is also included. That is, the causal relationship knowledge extraction program according to the present invention is a causal relationship knowledge extraction program by a device that extracts the presence or absence of a causal relationship between noun pairs included in text data, and is included in the text data in the computer constituting the device. A process of extracting a noun pair to be processed, a process of extracting a syntactic structure of text data including the noun pair and an attribute of the noun pair as feature data, and a causal relationship in the noun pair with respect to a part of the extracted feature data. Causality in noun pairs included in unsupervised data based on the process of assigning presence / absence as a tag, supervised data that is feature data to which the tag is attached, and unsupervised data that is feature data to which no tag is attached And a process for determining whether or not there is a relationship.

以上のように、本発明によれば、テキストデータ中に手掛かり語が存在しない場合や、並列語中の動詞が共通の目的語を持たない場合であっても、名詞ペアにおける因果関係を抽出することが可能となる。 As described above, according to the present invention, the causal relationship in the noun pair is extracted even when there is no clue word in the text data or when the verb in the parallel word does not have a common object. It becomes possible.

以下、本発明を実施するための最良の形態について図面を用いて詳細に説明する。
〔構成〕
まず、本発明の実施の形態による因果関係知識抽出装置の構成について説明する。図１は、因果関係知識抽出装置の構成を示すブロック図である。この因果関係知識抽出装置１は、名詞ペア抽出部２、特徴抽出部３、タグ付け部４、機械学習部５、テキスト解析インターフェース部６、及び記憶部７−１〜４を備えている。テキストデータが格納された記憶部７−１、名詞ペア抽出部２、特徴抽出部３及びタグ付け部４により、因果関係の有無が示されていない教師無しデータが記憶部７−２に格納され、因果関係の有無が示された教師有りデータが記憶部７−３に格納される。また、機械学習部５により、教師無しデータに対する因果関係について判定され、その確率を含む出力データが記憶部７−４に格納される。また、テキスト解析インターフェース部６により、テキスト解析の対象となるテキストデータについて、記憶部７−４に格納された出力データを用いて因果関係の有無が判定される。 The best mode for carrying out the present invention will be described below in detail with reference to the drawings.
〔Constitution〕
First, the configuration of the causal relationship knowledge extraction apparatus according to the embodiment of the present invention will be described. FIG. 1 is a block diagram showing the configuration of the causal relationship knowledge extraction apparatus. The causal relationship knowledge extraction device 1 includes a noun pair extraction unit 2, a feature extraction unit 3, a tagging unit 4, a machine learning unit 5, a text analysis interface unit 6, and storage units 7-1 to -4. Unsupervised data in which the presence / absence of causal relationship is not indicated is stored in the storage unit 7-2 by the storage unit 7-1 storing the text data, the noun pair extraction unit 2, the feature extraction unit 3 and the tagging unit 4. The supervised data indicating the presence or absence of the causal relationship is stored in the storage unit 7-3. Further, the machine learning unit 5 determines the causal relationship with the unsupervised data, and the output data including the probability is stored in the storage unit 7-4. Further, the text analysis interface unit 6 determines whether or not there is a causal relationship for the text data to be analyzed by using the output data stored in the storage unit 7-4.

名詞ペア抽出部２は、記憶部７−１からテキストデータを入力し、形態素解析により形態素に分割し、名詞のペアを抽出する。この名詞ペアを抽出する手法（形態素解析による抽出手法）は既知であるため、ここでは説明を省略する。 The noun pair extraction unit 2 inputs text data from the storage unit 7-1, divides the data into morphemes by morphological analysis, and extracts noun pairs. Since the method for extracting the noun pair (extraction method by morphological analysis) is known, the description thereof is omitted here.

特徴抽出部３は、名詞ペア抽出部２により抽出された名詞ペアについて、当該名詞ペアを含むテキストデータの構造上（構文構造）の特徴、及び名詞ペアの属性の特徴を抽出する。図２は、図１に示した特徴抽出部３の構成を示すブロック図である。この特徴抽出部３は、木構造生成手段３１、上位概念抽出手段３２、ＰＳＥ（ＰｒｅｏｒｄｅｒＳｔｒｉｎｇＥｘｐｒｅｓｓｉｏｎ）生成手段３３、３項組生成手段３４及びシソーラス記憶部３５を備えている。木構造生成手段３１は、構文解析結果に基づいて、名詞ペアを含むテキストデータの文について木構造を生成する。この構文解析手法は既知であるため、ここでは説明を省略する。詳細については、「工藤他、“チャンキングの段階適用による係り受け解析”、情処論、Ｖｏｌ．４３、Ｎｏ．６、ｐｐ．１８３４−１８４２（２００２）」の文献を参照されたい。ＰＳＥ生成手段３３は、木構造生成手段３１により生成された木構造から、当該木構造を表現することが可能なＰＳＥを生成する。上位概念抽出手段３２は、名詞ペアのそれぞれの上位概念を、シソーラス記憶部３５を検索して抽出する。３項組生成手段３４は、ＰＳＥ生成手段３３により生成されたＰＳＥと、上位概念抽出手段３２により抽出された上位概念とを用いて、テキストデータの構造上の特徴及び名詞ペアの属性の特徴を示す３項組を生成する。 The feature extraction unit 3 extracts, from the noun pair extracted by the noun pair extraction unit 2, features on the structure (syntax structure) of text data including the noun pair and features of attributes of the noun pair. FIG. 2 is a block diagram showing a configuration of the feature extraction unit 3 shown in FIG. The feature extraction unit 3 includes a tree structure generation unit 31, a superordinate concept extraction unit 32, a PSE (Preorder String Expression) generation unit 33, a triplet generation unit 34, and a thesaurus storage unit 35. The tree structure generation unit 31 generates a tree structure for a sentence of text data including a noun pair based on the syntax analysis result. Since this parsing method is known, the description is omitted here. For details, see the literature of “Kudo et al.,“ Dependency Analysis by Chunking Stage Application ”, Information Processing, Vol. 43, No. 6, pp. 1834-1842 (2002)”. The PSE generation unit 33 generates a PSE that can express the tree structure from the tree structure generated by the tree structure generation unit 31. The superordinate concept extraction means 32 searches the thesaurus storage unit 35 and extracts superordinate concepts of each noun pair. The ternary set generating means 34 uses the PSE generated by the PSE generating means 33 and the superordinate concept extracted by the superordinate concept extracting means 32 to determine the structural features of the text data and the features of the noun pair attributes. Generate the triplet shown.

タグ付け部４は、ユーザの操作により、特徴抽出部３により生成された３項組から、因果関係の有無を指定する３項組を選択し、当該選択した３項組に対して因果関係の有無を指定し（タグ付けし）、教師有りデータとして記憶部７−３に格納する。タグ付け部４によりタグ付けされない３項組は、教師無しデータとして記憶部７−２に格納される。この場合、３項組は教師無しデータと教師有りデータとに区分され、教師無しデータは大量に存在し、教師有りデータは少量しか存在しない。 The tagging unit 4 selects, by the user's operation, a ternary group that specifies the presence or absence of a causal relationship from the ternary group generated by the feature extraction unit 3, and the causal relationship is selected for the selected ternary group. The presence / absence is designated (tagged) and stored as supervised data in the storage unit 7-3. The ternary set not tagged by the tagging unit 4 is stored in the storage unit 7-2 as unsupervised data. In this case, the ternary group is divided into unsupervised data and supervised data, a large amount of unsupervised data exists, and a small amount of supervised data exists.

機械学習部５は、因果関係が明記されていない大量の教師無しデータを記憶部７−２から読み出し、因果関係が明記されている少量の教師有りデータを記憶部７−３から読み出し、教師有りデータにおけるテキストデータの構造上の特徴及び名詞ペアの属性の特徴に基づいて、当該教師有りデータ及び教師無しデータの３項組が因果関係を有する確率等を算出し、これらを出力データとして記憶部７−４に格納する。 The machine learning unit 5 reads a large amount of unsupervised data in which the causal relationship is not specified from the storage unit 7-2, reads out a small amount of supervised data in which the causal relationship is specified from the storage unit 7-3, and has a teacher. Based on the structural characteristics of text data in the data and the characteristics of the attributes of the noun pair, the probability of the causal relationship between the supervised data and the unsupervised data is calculated and stored as output data 7-4.

テキスト解析インターフェース部６は、テキスト解析の対象となるテキストデータを入力し、記憶部７−４に格納された出力データを用いて、因果関係を有する名詞ペアが存在するか否か等を判定する。 The text analysis interface unit 6 inputs text data to be subjected to text analysis, and determines whether or not a noun pair having a causal relationship exists using the output data stored in the storage unit 7-4. .

〔動作〕
次に、図１に示した因果関係知識抽出装置１の動作について説明する。まず最初に、名詞ペア抽出部２は、記憶部７−１から入力したテキストデータについて、形態素解析により形態素に分割して名詞ペアを抽出する。ここで、１文中に出現する全ての名詞の組み合わせを名詞ペアとする。 [Operation]
Next, the operation of the causal relationship knowledge extraction apparatus 1 shown in FIG. 1 will be described. First, the noun pair extraction unit 2 extracts the noun pair by dividing the text data input from the storage unit 7-1 into morphemes by morphological analysis. Here, a combination of all nouns appearing in one sentence is defined as a noun pair.

そして、特徴抽出部３は、名詞ペア抽出部２により抽出された名詞ペアについて、構文解析により、名詞ペアを含むテキストデータの１文における文節の間の係り受け関係を解析する。尚、構文解析の手法は既知であるから、ここでは説明を省略する。また、特徴抽出部３は、構文解析結果により、名詞ペア間がどのような構文構造に位置しているかを抽出する。例えば、「動脈硬化が起きると脳卒中につながります。」という文において、名詞ペアである「動脈硬化」と「脳卒中」との間の係り受け関係は、図３に示す文節をノードとした木構造により表現することができる。図３において、四角で囲まれたノードは、その親を修飾することを示している。例えば、「動脈硬化が」は「起きると」を修飾する。同様に、図４の木構造は、「隠れ肥満が糖尿病につながります。」という文において、名詞ペアである「肥満」及び「糖尿病」について、「肥満」を含む文節と「糖尿病」を含む文節との間の係り受け関係を示している。すなわち、特徴抽出部３の木構造生成手段３１は、名詞ペアを含む文の構文解析結果により、図３や図４のような木構造を生成する。 Then, the feature extraction unit 3 analyzes the dependency relationship between clauses in one sentence of the text data including the noun pair by syntactic analysis for the noun pair extracted by the noun pair extraction unit 2. Since the syntax analysis method is known, the description is omitted here. Further, the feature extraction unit 3 extracts a syntactic structure between the noun pairs based on the syntax analysis result. For example, in the sentence “Atherosclerosis leads to stroke.”, The dependency relationship between the noun pair “arteriosclerosis” and “stroke” is a tree structure with the nodes shown in FIG. 3 as nodes. It can be expressed by In FIG. 3, nodes surrounded by a square indicate that the parent is modified. For example, “arteriosclerosis” modifies “when it happens”. Similarly, in the sentence “hidden obesity leads to diabetes”, the tree structure of FIG. Shows the dependency relationship between. That is, the tree structure generation means 31 of the feature extraction unit 3 generates a tree structure as shown in FIG. 3 or FIG. 4 based on the syntax analysis result of the sentence including the noun pair.

そして、特徴抽出部３のＰＳＥ生成手段３３は、木構造からＰＳＥを生成する。この生成処理の際に、木構造の各ノードを、自立語（名詞、動詞、形容詞、副詞、形容動詞、接続詞等）及び機能語（助詞、助動詞等）に分離する。例えば、図３に示した「動脈硬化が」というノードを、「動脈硬化」という自立語、及び「が」という機能語に分離する。また、この分離した機能語を、元のノードの上位ノードとして木構造内に挿入する。図５は、図３に示した木構造から機能語を分離した後の新たな木構造、すなわち自立語及び機能語をノードとした木構造を示す。ＰＳＥ生成手段３３は、図５のような、機能語を分離した新たな木構造からＰＳＥを生成する。そして、ＰＳＥ生成手段３３は、生成したＰＳＥに対して、名詞ペアのそれぞれを「名詞１」「名詞２」に置き換える。尚、木構造からＰＳＥを生成する手法は既知であるから、ここでは説明を省略する。詳細については、「ＦａｂｒｉｚｉｏＬｕｃｃｉｏｅｔａｌ．、“ＥｘａｃｔＲｏｏｔｅｄＳｕｂｔｒｅｅＭａｔｃｈｉｎｇｉｎＳｕｂｌｉｎｅａｒＴｉｍｅ”、ＴｅｃｈｉｎｉｃａｌＲｅｐｏｒｔＴＲ−０１−１４（２００１）」の文献を参照されたい。 Then, the PSE generation means 33 of the feature extraction unit 3 generates a PSE from the tree structure. During the generation process, each node of the tree structure is separated into independent words (nouns, verbs, adjectives, adverbs, adjective verbs, conjunctions, etc.) and function words (particles, auxiliary verbs, etc.). For example, the node “arteriosclerosis” shown in FIG. 3 is separated into an independent word “arteriosclerosis” and a function word “ga”. Also, this separated function word is inserted into the tree structure as an upper node of the original node. FIG. 5 shows a new tree structure after separating function words from the tree structure shown in FIG. 3, that is, a tree structure having independent words and function words as nodes. The PSE generation unit 33 generates a PSE from a new tree structure from which function words are separated as shown in FIG. Then, the PSE generation unit 33 replaces each of the noun pairs with “noun 1” and “noun 2” for the generated PSE. Since a method for generating a PSE from a tree structure is known, description thereof is omitted here. For details, refer to the document “Fablizio Luccio et al.,“ Exact Rooted Subtree Matching in Subscriber Time ”, Technical Report TR-01-14 (2001)”.

例えば、ＰＳＥ生成手段３３は、木構造生成手段３１により生成された図３の木構造から機能語を分離し、図５の新たな木構造を生成し、当該新たな木構造から以下に示すＰＳＥを生成する。
ＰＳＥ＝｛“つながる”，“と”，“起きる”，“が”，“動脈硬化”，０，０，０，０，“に”，“脳卒中”，０，０，０｝
ここで、ＰＳＥ生成手段３３は、生成したＰＳＥに対し、名詞ペアの対象を出現順に「名詞１」「名詞２」に置き換える。例えば、前述の例では以下のようになる。
ＰＳＥ＝｛“つながる”，“と”，“起きる”，“が”，“名詞１”，０，０，０，０，“に”，“名詞２”，０，０，０｝
同様に、図４の木構造から機能語を分離し、新たな木構造からＰＳＥを生成し、「名詞１」「名詞２」に置き換えると、以下のようになる。
ＰＳＥ＝｛“つながる”，“が”，“名詞１”，０，０，“に”，“名詞２”，０，０，０｝
尚、ＰＳＥは、語順と要素「０」により、元の木構造に復元することができる。 For example, the PSE generation unit 33 separates the function words from the tree structure of FIG. 3 generated by the tree structure generation unit 31 to generate the new tree structure of FIG. 5, and the PSE shown below from the new tree structure Is generated.
PSE = {“connect”, “to”, “get up”, “ga”, “arteriosclerosis”, 0, 0, 0, 0, “to”, “stroke”, 0, 0, 0}
Here, the PSE generating unit 33 replaces the target of the noun pair with “noun 1” and “noun 2” in the order of appearance for the generated PSE. For example, in the above example, it is as follows.
PSE = {“connect”, “to”, “get up”, “ga”, “noun 1”, 0, 0, 0, 0, “ni”, “noun 2”, 0, 0, 0}
Similarly, when function words are separated from the tree structure of FIG. 4 and a PSE is generated from the new tree structure and replaced with “noun 1” and “noun 2”, the result is as follows.
PSE = {“connected”, “ga”, “noun 1”, 0, 0, “ni”, “noun 2”, 0, 0, 0}
The PSE can be restored to the original tree structure by word order and element “0”.

図３に戻って、上位概念抽出手段３２は、名詞の上位概念の特徴を抽出する。具体的には、名詞ペアの対象となる名詞について、シソーラス記憶部３５に格納された既存の分類語彙表等を検索し、それぞれの上位概念を抽出する。この場合、シソーラス上で上位概念が一意に決定できる場合はその上位概念を抽出し、複数の属性を有する等のように上位概念が一意に決定できない場合は表記そのものを上位概念として抽出する。また、上位概念が存在しない場合は上位概念を抽出することができないから、名詞そのものを上位概念として扱う。例えば、名詞ペアである「脳卒中」「動脈硬化」は、共に「病気・体調」の上位概念を有するため、上位概念抽出手段３２は、「病気・体調」を上位概念として抽出する。 Returning to FIG. 3, the superordinate concept extraction means 32 extracts the features of the superordinate concept of the noun. Specifically, for a noun that is the target of a noun pair, an existing classification vocabulary table or the like stored in the thesaurus storage unit 35 is searched, and each superordinate concept is extracted. In this case, when the superordinate concept can be uniquely determined on the thesaurus, the superordinate concept is extracted, and when the superordinate concept cannot be uniquely determined such as having a plurality of attributes, the notation itself is extracted as the superordinate concept. In addition, since no superordinate concept can be extracted when there is no superordinate concept, the noun itself is treated as a superordinate concept. For example, since the noun pairs “stroke” and “arteriosclerosis” both have a superordinate concept of “disease / physical condition”, the superordinate concept extraction means 32 extracts “disease / physical condition” as a superordinate concept.

そして、３項組生成手段３４は、ＰＳＥ生成手段３３により生成されたＰＳＥと、上位概念抽出手段３２により抽出された上位概念とを用いて、３項組を生成する。例えば、３項組生成手段３４は、「動脈硬化が起きると脳卒中につながります。」という文の名詞ペア「動脈硬化」「脳卒中」について、３項組は以下のようになる。
３項組＝＜｛病気・体調｝，｛病気・体調｝，｛“つながる”，“と”，“起きる”，“が”，“名詞１”，０，０，０，０，“に”，“名詞２”，０，０，０｝＞
このように、３項組は、テキストデータの構造の特徴である｛“つながる”，“と”，“起きる”，“が”，“名詞１”，０，０，０，０，“に”，“名詞２”，０，０，０｝と、名詞ペアの属性の特徴である｛病気・体調｝，｛病気・体調｝とから構成される。 Then, the ternary set generation unit 34 generates a ternary set by using the PSE generated by the PSE generation unit 33 and the superordinate concept extracted by the superordinate concept extraction unit 32. For example, the ternary set generation means 34 is as follows for the noun pair “arteriosclerosis” “stroke” with the sentence “arteriosclerosis occurs, which leads to stroke”.
Ternary set = <{sickness / physical condition}, {sickness / physical condition}, {“connected”, “to”, “get up”, “ga”, “noun 1”, 0, 0, 0, 0, “to” , "Noun 2", 0, 0, 0}>
Thus, the triplet is a characteristic of the structure of the text data {“connected”, “to”, “occurs”, “ga”, “noun 1”, 0, 0, 0, 0, “to” , “Noun 2”, 0, 0, 0} and {sickness / physical condition}, {sickness / physical condition} which are the characteristics of the attributes of the noun pair.

そして、タグ付け部４は、ユーザの操作により、特徴抽出部３により生成された３項組の一部に対し、その名詞ペアにおける文の表現が因果関係を有しているか否かのタグ付けを行い、タグ付けされた３項組を教師有りデータとして記憶部７−３に格納する。尚、タグ付けされない３項組は、教師無しデータとして特徴抽出部３により記憶部７−２に格納されている。これにより、大量の教師無しデータと、少量の教師有りデータが生成される。 Then, the tagging unit 4 tags whether or not the expression of the sentence in the noun pair has a causal relationship with respect to a part of the ternary set generated by the feature extraction unit 3 by a user operation. And the tagged three-term set is stored as supervised data in the storage unit 7-3. The ternary tuples that are not tagged are stored in the storage unit 7-2 by the feature extraction unit 3 as unsupervised data. Thereby, a large amount of unsupervised data and a small amount of supervised data are generated.

そして、機械学習部５は、記憶部７−２から教師無しデータを、記憶部７−３から教師有りデータを読み出し、教師無しデータの３項組が因果関係を有する確率等を、ＥＭアルゴリズムを用いた機械学習により算出する。この場合、以下に示す式（１）（２）が用いられる。この確率により、因果関係の有無を判定することができる。３項組が因果関係を持つ確率及び持たない確率は、以下の式により表される。

ここで、ｔ_ｉは３項組、Ｐ（ｃ_１｜ｔ_ｉ）は３項組ｔ_ｉが因果関係を持つ確率、Ｐ（ｃ_０｜ｔ_ｉ）は３項組ｔ_ｉが因果関係を持たない確率、Ｐ（ｔ_ｉ）は３項組ｔ_ｉが出現する確率、Ｐ（ｃ_１）は因果関係を持つ３項組が出現する確率、Ｐ（ｃ_０）は因果関係を持たない３項組が出現する確率、Ｐ（ｔ_ｉ｜ｃ_ｊ）はクラスｃ_ｊのときに３項組ｔ_ｉが出現する確率をそれぞれ示す。式（１）で表された確率Ｐ（ｃ_ｊ｜ｔ_ｉ）の値が大きいクラスｃ_ｊ（ｃ_０またはｃ_１）を、因果関係の有無の判定結果とする。すなわち、Ｐ（ｃ_０｜ｔ_ｉ）＞Ｐ（ｃ_１｜ｔ_ｉ）のときは３項組ｔ_ｉについて因果関係無しを判定し、Ｐ（ｃ_０｜ｔ_ｉ）＜Ｐ（ｃ_１｜ｔ_ｉ）のときは３項組ｔ_ｉについて因果関係有りを判定する。同じ確率であるときは予め設定された判定結果とする。 Then, the machine learning unit 5 reads the unsupervised data from the storage unit 7-2 and the supervised data from the storage unit 7-3. The probability that the ternary set of unsupervised data has a causal relationship is calculated using the EM algorithm. Calculated by the machine learning used. In this case, the following formulas (1) and (2) are used. Based on this probability, the presence or absence of a causal relationship can be determined. The probability that the ternary group has a causal relationship and the probability that it does not have it are expressed by the following equations.

Here, t _i is a ternary set, P (c ₁ | t _i ) is a probability that the ternary set t _i has a causal relationship, and P (c ₀ | t _i ) is a ternary set t _i has a causal relationship. P (t _i ) is the probability that a ternary set t _i will appear, P (c ₁ ) is the probability that a ternary set with causality will appear, and P (c ₀ ) is the 3 term that has no causal relationship The probability that a pair appears, P (t _i | c _j ), indicates the probability that a ternary set t _i will appear when class c _j is used. A class c _j (c ₀ or c ₁ ) having a large value of the probability P (c _j | t _i ) expressed by the expression (1) is set as a determination result of the causal relationship. That is, when P (c ₀ | t _i )> P (c ₁ | t _i ), it is determined that there is no causal relationship for the ternary set t _i , and P (c ₀ | t _i ) <P (c ₁ | t _{In the case of i} ), it is determined that there is a causal relationship for the ternary set t _i . When the probability is the same, the determination result is set in advance.

式（１）に示した、クラスｃ_ｊのときに３項組が出現する確率Ｐ（ｔ_ｉ｜ｃ_ｊ）は、以下に式により表される。

ここで、ＣＰ_ｔｉは３項組ｔ_ｉに含まれる２つの名詞間の構文構造、すなわちテキストデータの構文構造を、ＳＰ_ｔｉは３項組ｔ_ｉに含まれる名詞ペアを、Ｐ（ＣＰ_ｔｉ｜ｃ_ｊ）はクラスｃ_ｊのときに３項組ｔ_ｉに含まれる２つの名詞間の構文構造ＣＰ_ｔｉが出現する確率を、Ｐ（ＳＰ_ｔｉ｜ｃ_ｊ）はクラスｃ_ｊのときに３項組ｔ_ｉに含まれる名詞ペアＳＰ_ｔｉが出現する確率をそれぞれ示す。 The probability P (t _i | c _j ) of the occurrence of a ternary set in class c _j shown in equation (1) is expressed by the following equation.

Here, _{CP ti} syntax structure between two nouns included in 3-tuple _{t i,} i.e. the syntactic structure of the text data, the noun pair _{SP ti} is contained in the 3-tuple _{t i,} P _{(CP ti} | c _j ) is the probability of the occurrence of a syntactic structure CP _ti between two nouns included in the ternary set t _i for class c _j , and P (SP _ti | c _j ) is ternary for class c _j indicating the probability of noun pair _{SP ti} included in the set _{t i} appears respectively.

機械学習部５は、これらの式（１）（２）を用いて、ＥＭアルゴリズムによる機械学習を行う。ＥＭアルゴリズムとは、内部状態が不明な不完全データに対して尤度が最大になるような繰り返し学習を行い、内部状態を推定する処理をいう。尚、ＥＭアルゴリズムは既知であるから、ここでは説明を省略する。詳細については、「ＫａｍｅｌＮｉｇａｍｅｔａｌ．、“ＴｅｘｔＣｌａｓｓｉｆｉｃａｔｉｏｎｆｒｏｍＬａｂｅｌｅｄａｎｄＵｎｌａｂｅｌｅｄＤｏｃｕｍｅｎｔｕｓｉｎｇＥＭ”、ｍａｃｈｉｎｅｌｅａｒｎｉｎｇ、Ｖｏｌ．３９、Ｎｏ．２／３、ｐｐ．１０３−１３４（２０００）」の文献を参照されたい。 The machine learning unit 5 performs machine learning by the EM algorithm using these equations (1) and (2). The EM algorithm is a process of performing the iterative learning that maximizes the likelihood for incomplete data whose internal state is unknown, and estimating the internal state. Since the EM algorithm is known, the description is omitted here. For details, see "Kamel Nigam et al.," Text Classification from Labeled and Unlabeled Document using EM ", machine learning, Vol. 39, No. 2/3, pp. 103-134 (2000). I want.

次に、機械学習部５の処理について詳細に説明する。図６は、機械学習部５の処理を説明するためのフローチャートである。まず、機械学習部５は、記憶部７−３から読み出した教師有りデータを対象に、３組項ｔ_ｉが属するクラスｃ_ｊ（因果関係を持つ場合ｃ_１、持たない場合ｃ_０）の初期確率Ｐ（ｃ_１｜ｔ_ｉ）を以下の式により計算する（ステップＳ６−１）。このステップがＥＭアルゴリズムにおけるＥステップである。

Next, the process of the machine learning unit 5 will be described in detail. FIG. 6 is a flowchart for explaining the processing of the machine learning unit 5. First, the machine learning unit 5 targets the supervised data read from the storage unit 7-3, and the initial of the class c _j to which the triplet t _i belongs (c ₁ if causal, c ₀ if not) The probability P (c ₁ | t _i ) is calculated by the following equation (step S6-1). This step is the E step in the EM algorithm.

次に、機械学習部５は、記憶部７−２から読み出した教師無しデータ及び記憶部７−３から読み出した教師有りデータを合わせた全てのテキストデータの集合を対象に、クラスｃ_ｊの下で、ＣＰ_ｔｉ（３項組ｔ_ｉに含まれる２つの名詞を含むテキストデータの構文構造）が出現する確率Ｐ（ＣＰ_ｔｉ｜ｃ_ｊ）、及びＳＰ_ｔｉ（３項組ｔ_ｉに含まれる名詞ペア）が出現する確率Ｐ（ＳＰ_ｔｉ｜ｃ_ｊ）を以下の式によりそれぞれ計算する（ステップＳ６−２）。このステップがＥＭアルゴリズムにおけるＭステップである。

ここで、｜ＣＰ｜は名詞ペアを含むテキストデータの構文構造の総数を、｜ＳＰ｜は名詞ペアの総数を、｜Ｔ｜は３項組の総数を示す。Ｎ（ＳＰ，ｔ_ｋ）は３組項ｔ_ｋに名詞ペアを含むか否かを表す関数を示し、含むときのみ１の値となる。ｓｉｍ’（ＣＰ_ｔｉ，ＣＰ_ｔｋ）は名詞ペアを含むテキストデータの構文構造の類似性を示し、以下に示す式（６）による類似度の計算により求められる。類似度が０．５より大きいときは、ｓｉｍ’（ＣＰ_ｔｉ，ＣＰ_ｔｋ）を式（６）の計算結果であるｓｉｍ（ＣＰ_ｔｉ，ＣＰ_ｔｋ）とし、０．５以下のときは、ｓｉｍ’（ＣＰ_ｔｉ，ＣＰ_ｔｋ）を０とする。 Next, the machine learning unit 5 targets a set of all text data including unsupervised data read from the storage unit 7-2 and supervised data read from the storage unit 7-3, under the class c _j . in, probability _{_P CP ti} (syntactic structure of the text data including the two nouns included in 3-tuple _{t i)} appears _| nouns included in _(CP ti _c j), and _{SP ti} (3-tuple _{t i} probability pair) appears _{P (SP} ti | _{c j)} calculating each according to the following equation (step S6-2). This step is the M step in the EM algorithm.

Here, | CP | indicates the total number of syntactic structures of text data including noun pairs, | SP | indicates the total number of noun pairs, and | T | N (SP, t _k ) represents a function indicating whether or not a noun pair is included in the triplet t _k, and takes a value of 1 only when included. Sim ′ (CP _ti , CP _tk ) indicates the similarity of the syntax structure of text data including noun pairs, and is obtained by calculating the similarity according to the following equation (6). When the similarity is greater than _{_{0.5, sim '(CP ti, CP}} tk) which is a calculation result of formula _{(6) sim (CP ti,} CP tk) and, when less than 0.5, sim' _Let (CP _ti , CP _tk ) be 0.

前述の名詞ペア間の構文構造の類似度ｓｉｍ（Ｐ_１，Ｐ_２）は、以下の式により計算する。

ｗｃ（ｐ_ｉ）は、ＰＳＥを示すｐ_ｉに出現する名詞ペアの２つの名詞の属性及び要素「０」以外の単語数を示す。ｃｏｍ（Ｐ_１，Ｐ_２）は、そのうちのＰＳＥの構造を考慮した共通単語数を示す。例えば、図３及び図４の木構造から機能語を分離して得た２つのＰＳＥであるＰ_１，Ｐ_２では、ｗｃ（ｐ_１）＝５、ｗｃ（ｐ_２）＝３、単語「つながります」「が」「に」が共通単語であるから共通単語数ｃｏｍ（Ｐ_１，Ｐ_２）＝３となり、ｓｉｍ（Ｐ_１，Ｐ_２）＝６／８＝０．７５となる。 The similarity sim (P ₁ , P ₂ ) of the syntactic structure between the above-mentioned noun pairs is calculated by the following equation.

wc (p _i) shows two of the number of words other than the attributes and elements "0" of the noun of noun pairs that appear to p _i indicating the PSE. com (P ₁ , P ₂ ) indicates the number of common words in consideration of the PSE structure. For example, in P ₁ and P ₂ which are two PSEs obtained by separating the function words from the tree structure of FIGS. 3 and 4, wc (p ₁ ) = 5, wc (p ₂ ) = 3, and the word “connection” Since “Masu”, “GA” and “NI” are common words, the number of common words is com (P ₁ , P ₂ ) = 3, and sim (P ₁ , P ₂ ) = 6/8 = 0.75.

機械学習部５は、ステップ６−２において、構文構造ＣＰ_ｔｉが出現する確率Ｐ（ＣＰ_ｔｉ｜ｃ_ｊ）及び名詞ペアＳＰ_ｔｉが出現する確率Ｐ（ＳＰ_ｔｉ｜ｃ_ｊ）を式（４）（５）により計算した後、これらの結果を利用して、３項組ｔ_ｉが因果関係を持つまたは持たない確率Ｐ（ｃ_ｊ｜ｔ_ｉ）の期待値を以下の式により計算する（ステップＳ６−３）。

そして、機械学習部５は、式（７）の結果を利用して、因果関係を持つまたは持たない３項組３項組ｔ_ｉが出現する確率Ｐ（ｃ_ｊ）を以下の式により計算する（ステップＳ６−４）。

ここで、｜ｃ｜は、分類すべきクラスの数を示し、この場合は２である。 In step 6-2, the machine learning unit 5 calculates the probability P (CP _ti | c _j ) that the syntactic structure CP _ti appears and the probability P (SP _ti | c _j ) that the noun pair SP _ti appears in the formula (4). After calculating according to (5), using these results, the expected value of the probability P (c _j | t _i ) of the ternary group t _i having or not having a causal relationship is calculated by the following equation (step S6-3).

Then, the machine learning unit 5 calculates the probability P (c _j ) of occurrence of the ternary group ternary group t _i having or not having the causal relationship using the result of the equation (7) by the following equation. (Step S6-4).

Here, | c | indicates the number of classes to be classified, and is 2 in this case.

そして、機械学習部５は、式（８）の結果を利用して、因果関係を持つまたは持たない３項組ｔ_ｉが出現する確率Ｐ（ｃ_ｊ）の変化量と一定の閾値（例えば１．０×１０^-3）とを比較する（ステップＳ６−５）。確率Ｐ（ｃ_ｊ）の変化量が一定の閾値以上の場合はステップ６−２に戻り、ステップ６−３において計算した新たな確率Ｐ（ｃ_ｊ｜ｔ_ｉ）を用いて、教師無しデータ及び教師有りデータを合わせた全てのテキストデータの集合を対象に、構文構造が出現する確率Ｐ（ＣＰ_ｔｉ｜ｃ_ｊ）及び名詞ペアが出現する確率Ｐ（ＳＰ_ｔｉ｜ｃ_ｊ）を前述した式（４）（５）によりそれぞれ計算する（ステップＳ６−２）。そして、ステップ６−５において確率Ｐ（ｃ_ｊ）の変化量が一定の閾値より小さくなるまで、ステップ６−２〜６−５を繰り返す。そして、機械学習部５は、確率Ｐ（ｃ_ｊ）の変化量が一定の閾値より小さくなった場合に、最後に計算した、３項組ｔ_ｉが因果関係を持つまたは持たない確率（３項組ｔ_ｉにおける因果関係の有無の確率）Ｐ（ｃ_ｊ｜ｔ_ｉ）、構文構造ＣＰ_ｔｉが出現する確率Ｐ（ＣＰ_ｔｉ｜ｃ_ｊ）及び名詞ペアＳＰ_ｔｉが出現する確率Ｐ（ＳＰ_ｔｉ｜ｃ_ｊ）を得る。 The machine learning unit 5 uses the result of Expression (8) to change the probability P (c _j ) of occurrence of a ternary set t _i having or not having a causal relationship and a certain threshold (for example, 1 .0 × 10 ⁻³ ) (step S6-5). When the amount of change in the probability P (c _j ) is equal to or greater than a certain threshold value, the process returns to step 6-2, and the unsupervised data and the new probability P (c _j | t _i ) calculated in step 6-3 are used. For a set of all text data including supervised data, the probability P (CP _ti | c _j ) that a syntax structure appears and the probability P (SP _ti | c _j ) that a noun pair appears are _expressed by the above-described equations ( 4) Calculate according to (5), respectively (step S6-2). Then, steps 6-2 to 6-5 are repeated until the change amount of the probability P (c _j ) becomes smaller than a certain threshold value in step 6-5. Then, when the amount of change in the probability P (c _j ) is smaller than a certain threshold, the machine learning unit 5 determines whether the last calculated ternary set t _i has or does not have a causal relationship (3 terms). Probability of existence of causal relationship in set t _i ) P (c _j | t _i ), probability P of occurrence of syntax structure CP _ti (CP _ti | c _j ), and probability P of appearance of noun pair SP _ti (SP _ti | c _j ).

表１は、ユーザにより因果関係を持つ３項組であると指定された原文（一部）、及び図６に示した機械学習部５の処理により計算された、３項組が因果関係を持つ確率Ｐ（ｃ_１｜ｔ_ｉ）を示す。これは、循環器系の話題に取り上げられている「きょうの健康」１６番組を対象とし、番組で使われたクローズドキャプション２１８０文をテキストデータとして、３項組１４９５個を生成し機械学習による因果関係実験を行った例である。１６番組の中から無作為に選定された１番組の３項組１４９個に対して、ユーザ操作により因果関係の有無のタグ付けを行い教師有りデータとし、機械学習部５により確率Ｐ（ｃ_１｜ｔ_ｉ）を算出したものである。表１によれば、それぞれの確率Ｐ（ｃ_１｜ｔ_ｉ）が１に近いから、因果関係知識抽出装置１は、Ｐ（ｃ_１｜ｔ_ｉ）＞Ｐ（ｃ_０｜ｔ_ｉ）を判断して、表１に示した３項組ｔ_ｉが因果関係を持つものと判定することができる。 In Table 1, the original sentence (part) designated as a ternary group having a causal relationship by the user and the ternary group calculated by the processing of the machine learning unit 5 shown in FIG. 6 have a causal relationship. The probability P (c ₁ | t _i ) is shown. This is a 16-program “Kyoto Health” featured in the topic of circulatory system. The closed caption 2180 sentence used in the program is used as text data, and 1495 ternary groups are generated. This is an example of a related experiment. Tagging of presence / absence of causality is performed by user operation to 149 ternary groups of one program randomly selected from 16 programs, and the machine learning unit 5 uses the probability P (c ₁ | T _i ) is calculated. According to Table 1, since the respective probabilities P (c ₁ | t _i ) are close to 1, the causal relationship knowledge extraction apparatus 1 determines P (c ₁ | t _i )> P (c ₀ | t _i ). to, 3-tuple t _i shown in Table 1 can be determined as having a causal relationship.

さらに、機械学習部５は、図６に示したステップ６−２において最終的に計算した確率Ｐ（ＣＰ_ｔｉ｜ｃ_ｊ）から、以下の式を用いて確率Ｐ（ｃ_ｊ｜ＣＰ_ｔｉ）を計算する。

ここで、確率Ｐ（ｃ₁｜ＣＰ_ｔｉ）は、３項組ｔ_ｉに含まれる構文特徴ＣＰ_ｔｉが因果関係を持つ確率を示す。また、確率Ｐ（ｃ_０｜ＣＰ_ｔｉ）は、３項組ｔ_ｉに含まれる構文特徴ＣＰ_ｔｉが因果関係を持たない確率を示す。因果関係知識抽出装置１は、この値を用いて予め設定された値と比較する等により、因果関係を持つときの特徴的な構文構造を判定することができる。 Furthermore, the machine learning unit 5 calculates the probability P (c _j | CP _ti ) using the following equation from the probability P (CP _ti | c _j ) finally calculated in step 6-2 shown in FIG. calculate.

Here, the probability P (c ₁ | _{CP ti)} indicates the probability that the syntax characteristics _{CP ti} has a causal relation included in the 3-tuple _{t i.} Also, the probability _P (c 0 | _{CP ti)} indicates the probability that the syntax characteristics _{CP ti} has no causal relationships included in the 3-tuple _{t i.} The causal relationship knowledge extraction apparatus 1 can determine a characteristic syntax structure having a causal relationship by using this value and comparing it with a preset value.

また、機械学習部５は、図６に示したステップ６−２において最終的に計算した確率Ｐ（ＳＰ_ｔｉ｜ｃ_ｊ）から、以下の式を用いて確率Ｐ（ｃ_ｊ｜ＳＰ_ｔｉ）を計算する。

ここで、確率Ｐ（ｃ_１｜ＳＰ_ｔｉ）は、３項組ｔ_ｉに含まれる名詞ペアＳＰ_ｔｉが因果関係を持つ確率を示す。確率Ｐ（ｃ_０｜ＳＰ_ｔｉ）は、３項組ｔ_ｉに含まれる名詞ペアＳＰ_ｔｉが因果関係を持たない確率を示す。因果関係知識抽出装置１は、この値を用いて予め設定された値と比較する等により、因果関係を持つ名詞ペアを判定することができる。 Further, the machine learning unit 5 calculates the probability P (c _j | SP _ti ) from the probability P (SP _ti | c _j ) finally calculated in Step 6-2 shown in FIG. calculate.

Here, the probability _P (c 1 | _{SP ti)} indicates the probability that the noun pair _{SP ti} has a causal relation included in the 3-tuple _{t i.} The probability P (c ₀ | SP _ti ) indicates the probability that the noun pair SP _ti included in the ternary set t _i has no causal relationship. The causal relationship knowledge extraction apparatus 1 can determine a noun pair having a causal relationship by using this value and comparing it with a preset value.

次に、テキスト解析インターフェース部６の処理について詳細に説明する。図７は、テキスト解析インターフェース部６の処理を説明するためのフローチャートである。まず、テキスト解析インターフェース部６は、テキスト解析の対象となるテキストデータを入力し、形態素解析により形態素に分割し、名詞のペアを抽出する（ステップＳ７−１）。この名詞ペアを抽出する手法は、図１に示した名詞ペア抽出部２によるものと同様である。 Next, the processing of the text analysis interface unit 6 will be described in detail. FIG. 7 is a flowchart for explaining the processing of the text analysis interface unit 6. First, the text analysis interface unit 6 inputs text data to be subjected to text analysis, divides it into morphemes by morphological analysis, and extracts noun pairs (step S7-1). The method for extracting the noun pair is the same as that by the noun pair extracting unit 2 shown in FIG.

そして、テキスト解析インターフェース部６は、抽出した全ての名詞ペアについて、当該名詞ペアを含むテキストデータの構造上の特徴及び名詞ペアの属性の特徴を抽出する（ステップＳ７−２）。具体的には、構文解析により、名詞ペアを含むテキストデータにおける文節の間の係り受け関係を解析して木構造を生成し、木構造からＰＳＥを生成し、シソーラスを用いて名詞ペアの上位概念を抽出する。そして、ＰＳＥ及び上位概念を用いて、当該テキストデータの構造上の特徴及び名詞ペアの属性の特徴を示す３項組を生成する。このように、テキスト解析インターフェース部６は、全ての名詞ペアについて３項組を生成する。この３項組を生成する手法は、図１及び図２に示した特徴抽出部３によるものと同様である。 Then, the text analysis interface unit 6 extracts the structural features of the text data including the noun pairs and the attribute features of the noun pairs for all the extracted noun pairs (step S7-2). Specifically, by syntactic analysis, dependency relationships between clauses in text data including noun pairs are analyzed to generate a tree structure, a PSE is generated from the tree structure, and a superordinate concept of noun pairs using a thesaurus To extract. Then, using the PSE and the superordinate concept, a ternary set indicating the structural characteristics of the text data and the characteristics of the attributes of the noun pairs is generated. In this way, the text analysis interface unit 6 generates ternary groups for all noun pairs. The method for generating the triplet is the same as that by the feature extraction unit 3 shown in FIGS.

そして、テキスト解析インターフェース部６は、解析対象の名詞ペアにおける３項組の構文特徴ＣＰ_ｔｉについて、それと同一の構文特徴を有する式（９）に示した確率Ｐ（ｃ_ｊ｜ＣＰ_ｔｉ）を、記憶部７−１に格納された出力データから検索する。そして、検索結果の確率Ｐ（ｃ_１｜ＣＰ_ｔｉ）と予め設定された閾値（例えば０．５）とを比較し（ステップＳ７−４）、この確率Ｐ（ｃ_１｜ＣＰ_ｔｉ）が閾値より大きいときに、その名詞ペアは因果関係が有ると判定する（ステップＳ７−４）。一方、この確率Ｐ（ｃ_ｊ｜ＣＰ_ｔｉ）が閾値以下のときに、その名詞ペアは因果関係が無いと判定する（ステップＳ７−５）。全ての名詞ペアについて、ステップ７−４〜７−６の処理を行う（ステップＳ７−３）。 Then, the text analysis interface unit 6 uses the probability P (c _j | CP _ti ) shown in the equation (9) having the same syntax feature as the ternary syntax feature CP _ti in the noun pair to be analyzed, A search is performed from the output data stored in the storage unit 7-1. Then, the probability P (c ₁ | CP _ti ) of the search result is compared with a preset threshold value (for example, 0.5) (step S7-4), and this probability P (c ₁ | CP _ti ) is greater than the threshold value. When it is larger, it is determined that the noun pair has a causal relationship (step S7-4). On the other hand, when the probability P (c _j | CP _ti ) is less than or equal to the threshold value, it is determined that the noun pair has no causal relationship (step S7-5). Steps 7-4 to 7-6 are performed for all noun pairs (step S7-3).

尚、図７に示したフローチャートでは、名詞ペアについての因果関係の有無を、ステップ７−４において式（９）に示した確率Ｐ（ｃ_１｜ＣＰ_ｔｉ）と予め設定された閾値とを比較することにより判定するようにしたが、式（９）に示した確率Ｐ（ｃ_１｜ＣＰ_ｔｉ）の代わりに式（７）に示した確率Ｐ（ｃ_１｜ｔ_ｉ）を用いるようにしてもよい。 In the flowchart shown in FIG. 7, the presence / absence of a causal relationship with respect to a noun pair is compared with the probability P (c ₁ | CP _ti ) shown in Expression (9) in step 7-4 and a preset threshold value. The probability P (c ₁ | t _i ) shown in the equation (7) is used instead of the probability P (c ₁ | CP _ti ) shown in the equation (9). Also good.

以上のように、因果関係知識抽出装置１によれば、因果関係が明記されていない大量の教師無しデータと、因果関係が明記されている少量の教師有りデータとを生成し、ＥＭアルゴリズムを用いた機械学習により、全ての名詞ペアに対し、因果関係の有無を表す確率を算出するようにした。この確率を用いることにより、因果関係の有無を判定することが可能となる。すなわち、因果関係を持つ名詞ペア及び因果関係を持つ文構造を抽出することが可能となる。また、未知のテキストデータに対しても、因果関係の有無を判定することが可能となる。 As described above, according to the causal relationship knowledge extraction apparatus 1, a large amount of unsupervised data in which the causal relationship is not specified and a small amount of supervised data in which the causal relationship is specified are generated, and the EM algorithm is used. The probability of the presence or absence of causality was calculated for all noun pairs by machine learning. By using this probability, it is possible to determine the presence or absence of a causal relationship. That is, it becomes possible to extract a noun pair having a causal relationship and a sentence structure having a causal relationship. In addition, it is possible to determine whether or not there is a causal relationship with unknown text data.

また、例えば、このような因果関係知識抽出装置１を、放送波を受信する放送受信装置に適用した場合には、当該放送受信装置は、放送で送られてくるクローズドキャプション等の信頼できる情報源となるテキストデータを常時監視して解析することにより、自動的に因果関係知識データを蓄積することができる。これにより、人間が持っている因果関係に関する知識を自動的に学習することが可能となる。したがって、このように学習した因果関係知識を利用することにより、「何故」といったタイプの高度な質問に対して自動的に応答することが可能なシステムを構築することが可能となる。 In addition, for example, when such a causal relationship knowledge extraction device 1 is applied to a broadcast receiving device that receives broadcast waves, the broadcast receiving device can provide reliable information sources such as closed captions transmitted by broadcasting. By constantly monitoring and analyzing the text data, the causal knowledge data can be automatically accumulated. Thereby, it becomes possible to automatically learn the knowledge about the causal relationship that a human has. Therefore, by using the causal knowledge learned in this way, it is possible to construct a system capable of automatically responding to advanced questions of the “why” type.

尚、因果関係知識抽出装置１は、ＣＰＵ、ＲＡＭ等の揮発性の記憶媒体、ＲＯＭ等の不揮発性の記憶媒体、及びインターフェース等を備えたコンピュータによって構成される。因果関係知識抽出装置１に備えた名詞ペア抽出部２、特徴抽出部３、タグ付け部４及び機械学習部５の各機能は、これらの機能を記述したプログラムをＣＰＵに実行させることによりそれぞれ実現される。また、これらのプログラムは、磁気ディスク（フロッピィーディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリ等の記憶媒体に格納して頒布することもできる。 The causal relationship knowledge extraction apparatus 1 is configured by a computer including a volatile storage medium such as a CPU and a RAM, a non-volatile storage medium such as a ROM, an interface, and the like. The functions of the noun pair extraction unit 2, the feature extraction unit 3, the tagging unit 4 and the machine learning unit 5 included in the causal knowledge extraction apparatus 1 are realized by causing the CPU to execute a program describing these functions. Is done. These programs can also be stored and distributed in a storage medium such as a magnetic disk (floppy disk, hard disk, etc.), optical disk (CD-ROM, DVD, etc.), semiconductor memory, or the like.

以上、実施の形態を挙げて本発明を説明したが、本発明は上記実施の形態に限定されるものではなく、その技術思想を逸脱しない範囲で種々変形可能である。例えば、図１に示した因果関係知識抽出装置１は、１台のコンピュータ装置により構成されるが、これに限定されるものではなく、テキスト解析インターフェース部６のみを他のコンピュータ装置に備え、ネットワークを介して接続するように構成してもよい。また、名詞ペア抽出部２及び特徴抽出部３等を処理単位毎に異なるコンピュータ装置に備え、ネットワークを介して接続するように構成してもよいし、記憶部７−１〜７−４またはそのうちの一部を他のコンピュータ装置に備えるように構成してもよい。 The present invention has been described with reference to the embodiment. However, the present invention is not limited to the above embodiment, and various modifications can be made without departing from the technical idea thereof. For example, the causal relationship knowledge extraction device 1 shown in FIG. 1 is configured by one computer device, but is not limited to this, and only the text analysis interface unit 6 is provided in another computer device, and the network You may comprise so that it may connect via. Further, the noun pair extraction unit 2 and the feature extraction unit 3 may be provided in different computer devices for each processing unit and connected via a network, or the storage units 7-1 to 7-4 or the like. A part of the above may be provided in another computer apparatus.

本発明の実施の形態による因果関係知識抽出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the causal relationship knowledge extraction apparatus by embodiment of this invention. 図１の特徴抽出部の構成を示すブロック図である。It is a block diagram which shows the structure of the feature extraction part of FIG. 木構造の生成例１を示す図である。It is a figure which shows the production example 1 of a tree structure. 木構造の生成例２を示す図である。It is a figure which shows the production example 2 of a tree structure. 変形した木構造の例を示す図である。It is a figure which shows the example of the deformed tree structure. 機械学習部の処理を説明するフローチャートである。It is a flowchart explaining the process of a machine learning part. テキスト解析インターフェース部の処理を説明するフローチャートである。It is a flowchart explaining the process of a text analysis interface part.

Explanation of symbols

１因果関係知識抽出装置
２名詞ペア抽出部
３特徴抽出部
４タグ付け部
５機械学習部
６テキスト解析インターフェース部
７記憶部
３１木構造生成手段
３２上位概念抽出手段
３３ＰＳＥ生成手段
３４３項組生成手段
３５シソーラス記憶部 DESCRIPTION OF SYMBOLS 1 Causal relationship knowledge extraction device 2 Noun pair extraction part 3 Feature extraction part 4 Tagging part 5 Machine learning part 6 Text analysis interface part 7 Memory | storage part 31 Tree structure production | generation means 32 High-order concept extraction means 33 PSE production | generation means 34 3 term generation Means 35 Thesaurus storage unit

Claims

It is a device that extracts the presence or absence of causal relationships as knowledge about noun pairs included in text data,
A noun pair extraction unit for extracting noun pairs included in the text data;
For text data including noun pairs extracted by the noun pair extraction unit, a feature extraction unit that extracts the syntax structure of the text data and the attributes of the noun pairs as feature data;
A tagging unit that gives a tag of presence or absence of a causal relationship in the noun pair for a part of the feature data extracted by the feature extraction unit;
Feature data to which a tag is given by the tagging unit is stored as supervised data, and feature data to which no tag is given by the tagging unit is stored as unsupervised data;
A causal relationship determination unit for determining the presence or absence of a causal relationship in a noun pair included in the unsupervised data based on supervised data and unsupervised data stored in the storage unit. Relational knowledge extraction device.

In the causal relationship knowledge extraction device according to claim 1,
The feature extraction unit
For text data including noun pairs, a tree structure generating means for generating a tree structure indicating the dependency relationship of the noun pairs;
PSE generation means for separating nodes constituting the tree structure of the text data generated by the tree structure generation means into independent words and function words and generating PSE (Preorder String Expression) indicating the syntax structure of the text data; ,
Superordinate concept extracting means for extracting superordinate concepts of the noun pair;
A ternary set generation unit that generates a ternary set as feature data using the PSE generated by the PSE generation unit and the higher level concept extracted by the higher level concept extraction unit; Relational knowledge extraction device.

In the causal relationship knowledge extraction device according to claim 1 or 2,
The causal relationship determination unit calculates the probability of the presence or absence of a causal relationship in the data with respect to supervised data stored in the storage unit by an EM algorithm, and the supervised data and unsupervised data stored in the storage unit Calculating the probability of occurrence of a syntactic structure having the same or a similarity equal to or higher than a predetermined value, and for the supervised data and the unsupervised data stored in the storage unit, the same attribute as the noun pair included in the data Based on these probabilities, the probability of occurrence of causality in the feature data, the probability of causality in the syntactic structure included in the feature data, and the causality in the noun pair included in the feature data are calculated based on these probabilities. A causal relationship knowledge extraction apparatus characterized by calculating the probability of presence or absence of a relationship and determining the presence or absence of a causal relationship.

In the causal relationship knowledge extraction device according to claim 3,
Further, new text data is input, a noun pair included in the text data is extracted, a syntactic structure of the text data and an attribute of the noun pair are extracted as feature data, and the causal relationship calculated by the causal relationship determination unit A causal relationship knowledge extraction device comprising a text determination unit that determines the presence or absence of a causal relationship in a noun pair included in the new text data based on the probability of the presence or absence.

In the causal relationship knowledge extraction device according to claim 4,
The text determination unit uses the extracted feature data, and based on the probability of the presence or absence of a causal relationship in the syntax structure included in the feature data calculated by the causal relationship determination unit, the syntax structure included in the extracted feature data The causal relationship knowledge extraction apparatus characterized in that the probability of the presence or absence of a causal relationship is calculated and the presence or absence of the causal relationship is determined based on the probability.

In the causal relationship knowledge extraction device according to claim 4,
The text determination unit uses the extracted feature data, and based on the probability of the presence or absence of a causal relationship in the noun pair included in the feature data calculated by the causal relationship determination unit, the noun pair included in the extracted feature data The causal relationship knowledge extraction apparatus characterized in that the probability of the presence or absence of a causal relationship is calculated and the presence or absence of the causal relationship is determined based on the probability.

A causal relationship knowledge extraction program by a device that extracts the presence or absence of a causal relationship between noun pairs included in text data, the computer constituting the device,
Processing to extract noun pairs contained in text data;
Processing for extracting the syntactic structure of text data including the noun pair and the attributes of the noun pair as feature data;
A process for assigning a tag of presence or absence of a causal relationship in a noun pair for a part of the extracted feature data;
A process of determining the presence or absence of a causal relationship in a noun pair included in unsupervised data based on supervised data that is feature data to which the tag is attached and unsupervised data that is feature data to which no tag is attached. A causal knowledge extraction program to be executed.