JP2009265889A

JP2009265889A - Language processor and program

Info

Publication number: JP2009265889A
Application number: JP2008113908A
Authority: JP
Inventors: Ichiro Yamada; 一郎山田; Kikuka Miura; 菊佳三浦; Hideki Sumiyoshi; 英樹住吉; Masahiro Shibata; 正啓柴田; Nobuyuki Yagi; 伸行八木
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2008-04-24
Filing date: 2008-04-24
Publication date: 2009-11-12
Anticipated expiration: 2028-04-24
Also published as: JP5184195B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a language processor capable of precisely extracting only a pair of nouns having a mutual correlation with high possibility, and capable of extracting also a relation between the paired nouns. <P>SOLUTION: This language processor includes a processing-objective word pair feature extracting part for selecting a pair of words included in one sentence as a processing-objective word pair, and for extracting an appearance frequency feature of the processing-objective word pair, based on an input text data, a co-occurrence word feature extracting part for selecting a co-occurrence word, and for extracting an appearance frequency feature of the co-occurrence word, a syntax structure feature extracting part for extracting syntax structure in the sentence of the processing-objective word pair and the co-occurrence word, and for extracting an appearance frequency feature of the syntax structure, and a machine learning processing part for calculating the conditional probability of the processing-objective word pair, the conditional probability of the co-occurrence word, and the conditional probability of the syntax structure, using the obtained appearance frequency feature data, by machine learning processing, so as to be written as a leaning result data into a leaning result data storage part. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、自然言語処理に関する。特に、テキストから情報を抽出するための言語処理装置およびそのコンピュータプログラムに関する。 The present invention relates to natural language processing. In particular, the present invention relates to a language processing apparatus for extracting information from text and a computer program thereof.

従来、同一文に出現する関係のある名詞を抽出する手法として、相互情報量を用いる手法がある。相互情報量とは、２つの確率変数に対する依存尺度を表し、これを単語に対して用いることで、単語がどの程度、別の単語に依存しているかを測ることができる。非特許文献１では、相互情報量について記載されている。
北研二，「言語と計算４確率的言語モデル」，東京大学出版会，ｐ．１１，１９９９年 Conventionally, there is a method of using mutual information as a method of extracting related nouns appearing in the same sentence. The mutual information represents a dependence scale for two random variables, and can be used for a word to measure how much the word depends on another word. Non-Patent Document 1 describes mutual information.
Kenji Kita, “Language and Computation 4 Stochastic Language Model”, University of Tokyo Press, p. 11, 1999

しかしながら、相互情報量を用いることによって関係のある名詞の対を抽出しようとする場合、それら２つの単語が出現するときの構文構造などは考慮されない。従って、出現頻度が低い単語については、相互に全く関係を持たない単語同士の場合でも、偶発的に同じ文に出現するために高い相互情報量を持ってしまう場合があるという問題がある。また、相互情報量を用いて名詞の対を抽出したとき、それら２つの単語がどのような関係を持つかを把握することはできないという問題がある。 However, when trying to extract a pair of related nouns by using mutual information, the syntax structure when these two words appear is not considered. Therefore, there is a problem that words having a low appearance frequency may have a high mutual information amount because they appear accidentally in the same sentence even when the words have no relation to each other. In addition, when a pair of nouns is extracted using the mutual information amount, there is a problem that it is impossible to grasp the relationship between the two words.

本発明は、上記のような課題認識に基づいて行なわれたものであり、互いに関係を持つ可能性の高い名詞の対のみを精度良く抽出するとともに、それら対をなす２つの名詞の関係も抽出することのできる言語処理装置およびそのコンピュータプログラムを提供することを目的とする。 The present invention has been made based on the above problem recognition, and accurately extracts only pairs of nouns that are highly likely to be related to each other, and also extracts the relationship between the two nouns forming the pair. An object of the present invention is to provide a language processing apparatus and a computer program for the same.

［１］上記の課題を解決するため、本発明の一態様による言語処理装置は、複数の文を含む入力テキストデータの中から、一つの文に含まれる単語のペアを処理対象単語ペアとして選択し、前記入力テキストデータ中の前記処理対象単語ペアの出現頻度の所定の特徴を抽出する処理対象単語ペア特徴抽出部（１１）と、前記入力テキストデータの中の前記処理対象単語ペアが含まれる文の中に出現する他の単語を共起単語として選択し、前記入力テキストデータ中の前記共起単語の出現頻度の所定の特徴を抽出する共起単語特徴抽出部（共起名詞特徴抽出部１２）と、前記入力テキストデータの中の前記処理対象単語ペアと前記共起単語とが含まれる文の構文構造を抽出し、前記入力テキストデータの中の前記構文構造の出現頻度の所定の特徴を抽出する構文構造特徴抽出部（１３）と、処理対象概念と関係する語を予め記憶した処理対象概念関連語データを参照することにより前記共起単語が前記処理対象単語ペアの関係を表わすクラスに属すると判別できる前記入力テキストデータ中の文の情報と、前記処理対象単語ペアの出現頻度特徴と、前記共起単語の出現頻度特徴と、前記構文構造の出現頻度特徴とに基づいて、文が前記処理対象単語ペアの関係を表わすクラスに属することを前提としたとき前記処理対象単語ペアが出現する条件付き確率、および文が前記処理対象単語ペアの関係を表わすクラスに属することを前提としたとき前記共起単語が出現する条件付き確率、および文が前記処理対象単語ペアの関係を表わすクラスに属することを前提としたとき前記構文構造が出現する条件付き確率を、学習結果データとして学習結果データ記憶部（３）に書き込む処理を行なう機械学習処理部（１４）とを具備することと特徴とする。 [1] In order to solve the above problem, a language processing apparatus according to an aspect of the present invention selects a pair of words included in one sentence as a processing target word pair from input text data including a plurality of sentences. And a processing target word pair feature extraction unit (11) for extracting a predetermined feature of the appearance frequency of the processing target word pair in the input text data, and the processing target word pair in the input text data. A co-occurrence word feature extraction unit (co-occurrence noun feature extraction unit) that selects another word appearing in a sentence as a co-occurrence word and extracts a predetermined feature of the appearance frequency of the co-occurrence word in the input text data 12), a syntax structure of a sentence including the processing target word pair and the co-occurrence word in the input text data is extracted, and a predetermined characteristic of the appearance frequency of the syntax structure in the input text data is extracted. A class in which the co-occurrence word represents the relationship of the processing target word pair by referring to processing target concept related word data in which a word related to the processing target concept is stored in advance. On the basis of information on the sentence in the input text data that can be determined to belong to, an appearance frequency characteristic of the processing target word pair, an appearance frequency characteristic of the co-occurrence word, and an appearance frequency characteristic of the syntax structure Assuming that the word belongs to a class that represents the relationship of the processing target word pair, and that the conditional probability that the processing target word pair appears, and that the sentence belongs to the class that represents the relationship of the processing target word pair When the conditional probability that the co-occurrence word appears, and the sentence is assumed to belong to a class representing the relationship of the processing target word pair, the syntax structure is The conditional probability that the current to a characterized by comprising a machine learning processing unit that performs a process of writing the learning result data storage unit (3) as the learning result data (14).

この構成によれば、処理対象単語ペア特徴抽出部は、文に含まれる処理対象単語ペアについての出現頻度特徴を抽出する。共起単語特徴抽出部は、共起単語についての出現頻度特徴を抽出する。構文構造特徴抽出部は、処理対象単語ペアと共起単語との当該文中の構文構造を抽出するとともにその構文構造についての出現頻度特徴を抽出する。入力テキストデータとして大量の文を含むものを用いた場合、これら抽出された出現頻度特徴の数値は、言語として統計的に妥当な特徴を表わすものとなる。機械学習処理部は、処理対象概念と関係する語を予め記憶した処理対象概念関連語データを参照することにより、具体的にはこの処理対象概念関連語データに対応する語が文に含まれているか否かを判定することなどにより、与えられている複数の文のうち、共起単語が処理対象単語ペアの関係を表わすクラスに属すると判別できる文を抽出する。これら抽出された文は、共起単語が処理対象単語ペアの関係を表わすクラスに属することが明らかであるような文であるものとすることができる。このような文は、機械学習処理における正解サンプルとして作用する。この正解サンプルを基に、例えばＥＭアルゴリズムなどを用いた機械学習処理等により、統計的に、正解サンプル以外も含めた入力テキストデータに含まれる文全体について、前記クラスのときの処理対象単語ペアの条件付き確率と、前記クラスのときの共起単語の条件付き確率と、前記クラスのときの構文構造の条件付き確率が得られる（学習結果データ）。この場合のクラス（Ｃ_１）とは共起単語が処理対象単語ペアの関係を表わすという命題によるものであるが、本クラスの補集合が成す別のクラス（Ｃ_０、共起単語が処理対象単語ペアの関係を表わさないようなクラス）についてのそれぞれの確率も、全体の確率（１）から前記のそれぞれの条件付き確率を減ずることによって得られる。得られた学習結果データは、処理対象単語ペアが前記クラスに属する確率や、共起単語が前記クラスに属する確率や、構文構造が前記クラスに属する確率を算出するために用いることができる。
つまり、同一文に出現する他の単語（共起単語）が、処理対象とする２つの名詞の関係名を示すかを判定することができる。この結果、出現頻度が低い単語間の関係も高精度に推定することができる。
この言語処理装置が処理の対象とする単語の典型例は、名詞である。このとき、処理対象単語ペアは、処理対象名詞ペアである。共起単語の典型例は共起名詞である。 According to this configuration, the processing target word pair feature extraction unit extracts appearance frequency features for processing target word pairs included in the sentence. The co-occurrence word feature extraction unit extracts appearance frequency features for the co-occurrence word. The syntax structure feature extraction unit extracts the syntax structure in the sentence of the processing target word pair and the co-occurrence word and extracts the appearance frequency feature of the syntax structure. When the input text data includes a large amount of sentences, the extracted appearance frequency feature values represent features that are statistically valid as a language. The machine learning processing unit refers to the processing target concept related word data in which the word related to the processing target concept is stored in advance, and specifically, the word corresponding to the processing target concept related word data is included in the sentence. By determining whether or not there is a sentence, a sentence that can be determined that the co-occurrence word belongs to a class representing the relationship of the processing target word pairs is extracted from a plurality of given sentences. These extracted sentences may be sentences in which it is clear that the co-occurrence words belong to a class representing the relationship between the processing target word pairs. Such a sentence acts as a correct sample in the machine learning process. Based on this correct sample, for example, by machine learning processing using an EM algorithm or the like, statistically, for the entire sentence included in the input text data including other than the correct sample, the processing target word pair of the class A conditional probability, a conditional probability of a co-occurrence word for the class, and a conditional probability of a syntax structure for the class are obtained (learning result data). The class (C ₁ ) in this case is based on the proposition that the co-occurrence word represents the relationship of the word pair to be processed, but another class (C ₀ , the co-occurrence word is the object to be processed) of the complement of this class. The respective probabilities for classes that do not represent word pair relationships are also obtained by subtracting the respective conditional probabilities from the overall probability (1). The obtained learning result data can be used to calculate the probability that the processing target word pair belongs to the class, the probability that the co-occurrence word belongs to the class, and the probability that the syntax structure belongs to the class.
That is, it can be determined whether another word (co-occurrence word) appearing in the same sentence indicates a relation name of two nouns to be processed. As a result, the relationship between words with low appearance frequency can be estimated with high accuracy.
A typical example of a word to be processed by this language processing apparatus is a noun. At this time, the processing target word pair is a processing target noun pair. A typical example of a co-occurrence word is a co-occurrence noun.

また、この言語処理装置において、予め決められた処理対象属性に限定して処理対象単語ペアを選択するようにしても良い。このような限定を行なうことにより、無関係な処理対象単語ペアが候補に含まれなくなり、算出される出現頻度特徴の信頼性が上がる。 In this language processing apparatus, processing target word pairs may be selected by limiting to predetermined processing target attributes. By performing such a limitation, irrelevant processing target word pairs are not included in the candidates, and the reliability of the calculated appearance frequency feature is improved.

［２］また、本発明の一態様は、上記の言語処理装置において、前記構文構造特徴抽出部は、前記文の構文解析結果に基づき、前記処理対象単語ペアに含まれる第１の単語と当該処理対象単語ペアに含まれる第２の単語と前記共起単語との共通係り先文節を取り出し、前記第１の単語から前記共通係り先文節までの構文構造と、前記第２の単語から前記共通係り先文節までの構文構造と、前記共通係り先文節を修飾する構文構造との組み合わせにより当該文の構文構造を同定することを特徴とする。 [2] Further, according to one aspect of the present invention, in the language processing device, the syntax structure feature extraction unit includes the first word included in the processing target word pair and the first word based on the syntax analysis result of the sentence. A common destination clause between the second word and the co-occurrence word included in the processing target word pair is extracted, a syntax structure from the first word to the common destination clause, and the common from the second word The syntax structure of the sentence is identified by a combination of the syntax structure up to the dependency destination clause and the syntax structure that modifies the common dependency clause.

この構成により、本発明の統計的処理に特に適した構文構造が得られる。その結果、単語抽出の精度が上がる。 With this configuration, a syntax structure particularly suitable for the statistical processing of the present invention is obtained. As a result, the accuracy of word extraction is improved.

［３］また、本発明の一態様は、上記の言語処理装置において、前記構文構造特徴抽出部は、前記構文構造を表わす単語のリストに出現する単語であって、前記第１の単語でも前記第２の単語でも前記共起単語でもない単語が共通である割合が所定の閾値以上であるような複数の構文構造を類似の構文構造を有する構文構造グループとし、この構文構造グループの出現頻度特徴を前記構文構造の出現頻度特徴として抽出することを特徴とする。 [3] Further, according to one aspect of the present invention, in the language processing apparatus, the syntax structure feature extraction unit is a word that appears in a list of words representing the syntax structure, and the first word is the above word A plurality of syntax structures in which the ratio of words that are neither the second word nor the co-occurrence word is equal to or greater than a predetermined threshold is defined as a syntax structure group having a similar syntax structure, and appearance frequency characteristics of the syntax structure group Is extracted as an appearance frequency feature of the syntax structure.

この構成により、類似の構文構造を有する文をまとめた構文構造グループを対象として出現頻度特徴を抽出することができる。その結果、文中での使用単語や表記に関する些細な揺れを吸収し、統計的に安定した構文構造の出現頻度特徴を抽出できる。その結果、入力テキストデータに含まれる文数が比較的少ない場合にも高い精度での単語抽出が可能となる。 With this configuration, the appearance frequency feature can be extracted for a syntax structure group in which sentences having similar syntax structures are collected. As a result, it is possible to absorb slight fluctuations in terms of words and notations used in sentences and to extract appearance frequency features of statistically stable syntax structures. As a result, even when the number of sentences included in the input text data is relatively small, it is possible to extract words with high accuracy.

［４］また、本発明の一態様は、上記の言語処理装置において、前記学習結果データ記憶部から読み出した前記学習結果データを用いて、文に前記処理対象単語ペアが出現することを前提として当該文が前記クラスに属する条件付き確率と、文に前記共起単語が出現することを前提として当該文が前記クラスに属する条件付き確率と、文に前記構文構造が出現することを前提として当該文が前記クラスに属する条件付き確率とを算出する確率値計算処理部をさらに具備することを特徴とする。 [4] Moreover, one aspect of the present invention is based on the premise that in the language processing apparatus, the processing target word pair appears in a sentence using the learning result data read from the learning result data storage unit. The conditional probability that the sentence belongs to the class, the conditional probability that the sentence belongs to the class assuming that the co-occurrence word appears in the sentence, and the syntax structure appearing in the sentence A probability value calculation processing unit for calculating a conditional probability that a sentence belongs to the class is further provided.

この構成により、処理対象単語ペアが前記クラスに属する確率や、共起単語が前記クラスに属する確率や、構文構造が前記クラスに属する確率を算出することができる。つまり、例えば適宜確率についての閾値を用いることなどにより、文およびそこに含まれる処理対象単語ペアや共起単語や構文構造が、そのクラスに属するか否かを判定することができる。 With this configuration, it is possible to calculate the probability that the processing target word pair belongs to the class, the probability that the co-occurrence word belongs to the class, and the probability that the syntax structure belongs to the class. That is, for example, by using a threshold for probability as appropriate, it is possible to determine whether a sentence and a processing target word pair, a co-occurrence word, or a syntax structure included in the sentence belong to the class.

［５］また、本発明の一態様による言語処理装置は、上記の言語処理装置によって前記学習結果データ記憶部に書き込まれた前記学習結果データを用いて、文に前記処理対象単語ペアが出現することを前提として当該文が前記クラスに属する条件付き確率と、文に前記共起単語が出現することを前提として当該文が前記クラスに属する条件付き確率と、文に前記構文構造が出現することを前提として当該文が前記クラスに属する条件付き確率とを算出する確率値計算処理部を具備することを特徴とする。 [5] In the language processing device according to an aspect of the present invention, the processing target word pair appears in a sentence using the learning result data written in the learning result data storage unit by the language processing device. Assuming that the sentence belongs to the class, the conditional probability that the sentence belongs to the class on the assumption that the co-occurrence word appears in the sentence, and the syntax structure appears in the sentence And a probability value calculation processing unit for calculating a conditional probability that the sentence belongs to the class.

この構成により、予め入力テキストデータを用いて行った機械学習処理の結果である学習結果データを用いて、文およびそこに含まれる処理対象単語ペアや共起単語や構文構造が、そのクラスに属するか否かを判定することができる。
また、機械学習処理に用いた元の入力テキストデータには含まれていなかった別の文を判定対象とすることもできる。 With this configuration, a sentence, a processing target word pair, a co-occurrence word, and a syntax structure included in the class belong to the class by using learning result data that is a result of machine learning processing performed in advance using input text data. It can be determined whether or not.
Also, another sentence that is not included in the original input text data used in the machine learning process can be determined.

［６］また、本発明の一態様は、複数の文を含む入力テキストデータの中から、一つの文に含まれる単語のペアを処理対象単語ペアとして選択し、前記入力テキストデータ中の前記処理対象単語ペアの出現頻度の所定の特徴を抽出する処理対象単語ペア特徴抽出過程と、前記入力テキストデータの中の前記処理対象単語ペアが含まれる文の中に出現する他の単語を共起単語として選択し、前記入力テキストデータ中の前記共起単語の出現頻度の所定の特徴を抽出する共起単語特徴抽出過程と、前記入力テキストデータの中の前記処理対象単語ペアと前記共起単語とが含まれる文の構文構造を抽出し、前記入力テキストデータの中の前記構文構造の出現頻度の所定の特徴を抽出する構文構造特徴抽出過程と、処理対象概念と関係する語を予め記憶した処理対象概念関連語データを参照することにより前記共起単語が前記処理対象単語ペアの関係を表わすクラスに属すると判別できる前記入力テキストデータ中の文の情報と、前記処理対象単語ペアの出現頻度特徴と、前記共起単語の出現頻度特徴と、前記構文構造の出現頻度特徴とに基づいて、文が前記処理対象単語ペアの関係を表わすクラスに属することを前提としたとき前記処理対象単語ペアが出現する条件付き確率、および文が前記処理対象単語ペアの関係を表わすクラスに属することを前提としたとき前記共起単語が出現する条件付き確率、および文が前記処理対象単語ペアの関係を表わすクラスに属することを前提としたとき前記構文構造が出現する条件付き確率とを、学習結果データとして学習結果データ記憶部に書き込む処理を行なう機械学習処理過程との処理をコンピュータに実行させるプログラムである。 [6] According to another aspect of the present invention, a pair of words included in one sentence is selected as a processing target word pair from input text data including a plurality of sentences, and the processing in the input text data is performed. A processing target word pair feature extraction process for extracting a predetermined feature of the appearance frequency of the target word pair, and another word appearing in a sentence including the processing target word pair in the input text data Selecting a predetermined feature of the appearance frequency of the co-occurrence word in the input text data, the processing target word pair and the co-occurrence word in the input text data, A syntactic structure feature extracting process for extracting a predetermined feature of the appearance frequency of the syntax structure in the input text data, and a word related to the processing target concept. The sentence information in the input text data that can be determined that the co-occurrence word belongs to the class representing the relationship of the processing target word pair by referring to the processing target concept related word data, and the appearance of the processing target word pair Based on the frequency feature, the appearance frequency feature of the co-occurrence word, and the appearance frequency feature of the syntax structure, the processing target word when it is assumed that a sentence belongs to a class representing the relationship of the processing target word pair A conditional probability that a pair appears, and a conditional probability that the co-occurrence word appears when a sentence belongs to a class that represents the relationship between the processing target word pairs, and a sentence that relates the processing target word pair The conditional probability that the syntax structure appears when it is assumed that it belongs to the class that represents, is written into the learning result data storage unit as learning result data It is a program for executing processing performs the processing of the machine learning process on the computer.

本発明によれば、テキストに含まれる関係のある単語対とその関係名を表わす単語を抽出することができる。また、出現頻度が低い単語間の関係も高精度に推定することができる。また、関係を構成するときの構文構造の特徴も抽出できる。このような技術は、機械によるテキスト理解の分野でも有用であり、今後、大量のテキストを機械により解析して重要な情報を抽出するような情報分析などの分野での応用が期待できる。 According to the present invention, it is possible to extract a word representing a related word pair and a related name included in the text. In addition, the relationship between words with low appearance frequency can be estimated with high accuracy. It can also extract the features of the syntactic structure when constructing the relationship. Such a technique is also useful in the field of text understanding by machines, and can be expected to be applied in fields such as information analysis in which a large amount of text is analyzed by machines to extract important information.

［第１の実施の形態］
次に、本発明の一実施形態について、図面を参照しながら説明する。
図１は、本実施形態による言語処理装置の機能構成を示すブロック図である。この図において、符号１は言語処理装置である。図示するように、言語処理装置１は、入力テキスト記憶部２と、学習結果データ記憶部３と、出力データ４と、処理対象単語ペア特徴抽出部１１と、共起名詞特徴抽出部１２（共起単語特徴抽出部）と、構文構造特徴抽出部１３と、機械学習処理部１４と、確率値計算処理部１５とを含んで構成される。 [First Embodiment]
Next, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram illustrating a functional configuration of the language processing apparatus according to the present embodiment. In this figure, reference numeral 1 denotes a language processing device. As illustrated, the language processing apparatus 1 includes an input text storage unit 2, a learning result data storage unit 3, output data 4, a processing target word pair feature extraction unit 11, and a co-occurrence noun feature extraction unit 12 (co-occurrence). Word feature extraction unit), syntactic structure feature extraction unit 13, machine learning processing unit 14, and probability value calculation processing unit 15.

入力テキスト記憶部２は、処理対象となる入力テキストデータを記憶する。この入力テキストデータには大量の文が含まれている。 The input text storage unit 2 stores input text data to be processed. This input text data contains a large amount of sentences.

処理対象単語ペア特徴抽出部１１は、処理対象属性を予め決め、その属性に属する名詞ペアを処理対象単語ペアとする。この処理対象単語ペアに対して、その出現回数などの特徴を入力テキストから抽出する。言い換えれば、処理対象単語ペア特徴抽出部１１は、複数の文を含む入力テキストデータの中から、一つの文に含まれる単語のペアを処理対象単語ペアとして選択し、入力テキストデータ中の処理対象単語ペアの出現頻度の所定の特徴を抽出する。 The processing target word pair feature extraction unit 11 determines a processing target attribute in advance and sets a noun pair belonging to the attribute as the processing target word pair. Features such as the number of appearances of this processing target word pair are extracted from the input text. In other words, the processing target word pair feature extraction unit 11 selects a pair of words included in one sentence as processing target word pairs from the input text data including a plurality of sentences, and processes the processing target in the input text data. Predetermined features of the appearance frequency of word pairs are extracted.

共起名詞特徴抽出部１２は、一文中に出現する処理対象属性に属する２つの名詞（名詞ペア）に対して、その関係の候補となる同一文に出現する他の名詞に対して、その出現回数などの特徴を入力テキストから抽出する。言い換えれば、共起名詞特徴抽出部１２は、入力テキストデータの中の処理対象単語ペアが含まれる文の中に出現する他の単語を共起単語として選択し、入力テキストデータ中の共起単語の出現頻度の所定の特徴を抽出する。 The co-occurrence noun feature extraction unit 12 generates two nouns (noun pairs) belonging to the processing target attribute appearing in one sentence, with respect to other nouns appearing in the same sentence that are candidates for the relationship. Extract features such as frequency from the input text. In other words, the co-occurrence noun feature extraction unit 12 selects another word appearing in the sentence including the processing target word pair in the input text data as the co-occurrence word, and the co-occurrence word in the input text data. A predetermined feature of the appearance frequency of is extracted.

構文構造特徴抽出部１３は、処理対象単語ペアと共起する名詞との間の構文構造特徴を抽出する。具体的には、構文構造特徴抽出部１３は、一文中に出現する処理対象属性に属する２つの名詞と、同一文に出現する他の名詞との３文節間の構文構造の出現回数などの特徴を入力テキストから抽出する。言い換えれば、構文構造特徴抽出部１３は、入力テキストデータの中の処理対象単語ペアと共起単語とが含まれる文の構文構造を抽出し、入力テキストデータの中の構文構造の出現頻度の所定の特徴を抽出する。 The syntax structure feature extraction unit 13 extracts a syntax structure feature between a processing target word pair and a co-occurrence noun. Specifically, the syntax structure feature extraction unit 13 includes features such as the number of appearances of the syntax structure between three phrases of two nouns belonging to the processing target attribute appearing in one sentence and other nouns appearing in the same sentence. Is extracted from the input text. In other words, the syntax structure feature extraction unit 13 extracts the syntax structure of the sentence including the processing target word pair and the co-occurrence word in the input text data, and determines the appearance frequency of the syntax structure in the input text data. Extract features.

機械学習処理部１４は、処理対象単語ペア特徴抽出部１１と共起名詞特徴抽出部１２と構文構造特徴抽出部１３の結果を入力として、ＥＭアルゴリズムによる機械学習処理を行なう。詳しくは、機械学習処理部１４は、処理対象概念と関係する語を予め記憶した処理対象概念関連語データを参照することにより共起単語が処理対象単語ペアの関係を表わすクラスに属すると判別できる前記入力テキストデータ中の文の情報と、処理対象単語ペアの出現頻度特徴と、共起単語の出現頻度特徴と、構文構造の出現頻度特徴とに基づいて、文が処理対象単語ペアの関係を表わすクラスに属することを前提としたとき処理対象単語ペアが出現する条件付き確率、および文が処理対象単語ペアの関係を表わすクラスに属することを前提としたとき共起単語が出現する条件付き確率、および文が処理対象単語ペアの関係を表わすクラスに属することを前提としたとき構文構造が出現する条件付き確率とを、学習結果データとして学習結果データ記憶部に書き込む処理を行なう。 The machine learning processing unit 14 receives the results of the processing target word pair feature extraction unit 11, the co-occurrence noun feature extraction unit 12, and the syntax structure feature extraction unit 13, and performs machine learning processing using an EM algorithm. Specifically, the machine learning processing unit 14 can determine that a co-occurrence word belongs to a class representing a relationship between processing target word pairs by referring to processing target concept related word data in which words related to the processing target concept are stored in advance. Based on the sentence information in the input text data, the appearance frequency feature of the processing target word pair, the appearance frequency feature of the co-occurrence word, and the appearance frequency feature of the syntax structure, the sentence indicates the relationship between the processing target word pair. Conditional probability that a processing word pair will appear when it is assumed that it belongs to the class to be represented, and conditional probability that a co-occurrence word will appear if it is assumed that the sentence belongs to a class that represents the relationship between the processing word pairs , And conditional probabilities that the syntax structure appears when the sentence is assumed to belong to a class that represents the relationship of the word pairs to be processed as learning result data It performs a process of writing the over data storage unit.

学習結果データ記憶部３は、機械学習処理の結果得られるデータ（確率値のデータ）を記憶するためのものである。 The learning result data storage unit 3 is for storing data (probability value data) obtained as a result of the machine learning process.

確率値計算処理部１５は、機械学習処理部１４の結果である学習結果データを学習結果データ記憶部３から読み出し、処理対象名詞ペアが相互に関係を持つ確率と、処理対象名詞ペアと共起した名詞が関係を表す単語である確率と、処理対象名詞ペアと共起した名詞との間の構文構造が関係を示す構造である確率を計算して出力する。言い換えれば、確率値計算処理部１５は、学習結果データを用いて、文に処理対象単語ペアが出現することを前提として当該文がクラスに属する条件付き確率と、文に共起単語が出現することを前提として当該文がクラスに属する条件付き確率と、文に構文構造が出現することを前提として当該文が前記クラスに属する条件付き確率とを算出する。
出力データ４は、確率値計算処理部１５によって出力されるデータである。 The probability value calculation processing unit 15 reads out the learning result data that is the result of the machine learning processing unit 14 from the learning result data storage unit 3, and the probability that the processing target noun pairs are related to each other, and the processing target noun pair co-occurs. The probability that the noun is a word representing a relationship and the probability that the syntactic structure between the noun co-occurring with the processing target noun pair is a relationship is calculated and output. In other words, the probability value calculation processing unit 15 uses the learning result data, and on the assumption that a processing target word pair appears in the sentence, the probability that the sentence belongs to the class, and a co-occurrence word appears in the sentence. Assuming that the sentence belongs to a class, a conditional probability that the sentence belongs to the class is calculated on the assumption that a syntax structure appears in the sentence.
The output data 4 is data output by the probability value calculation processing unit 15.

図２は、言語処理装置１の全体の処理の手順を示すフローチャートである。以下では、このフローチャートを参照しながら、言語処理装置１全体の処理の流れについて、説明する。 FIG. 2 is a flowchart showing an overall processing procedure of the language processing apparatus 1. Hereinafter, the processing flow of the entire language processing apparatus 1 will be described with reference to this flowchart.

本装置では、まずステップＳ０１において、処理対象単語ペア特徴抽出部１１が、処理対象属性を決め、その属性に属する名詞ペアを入力テキスト記憶部２から読み出したテキスト中から抽出する。処理対象属性の例としては、「動物」、「人」、「国」、「団体」などといった属性が挙げられる。処理対象単語ペア特徴抽出部１１は、与えられた処理対象属性に属する名詞ペアを決定し、入力テキストに対する、その名詞ペアの出現頻度特徴を計算する。 In this apparatus, first, in step S01, the processing target word pair feature extraction unit 11 determines a processing target attribute, and extracts a noun pair belonging to the attribute from the text read from the input text storage unit 2. Examples of attributes to be processed include attributes such as “animal”, “person”, “country”, “organization”, and the like. The processing target word pair feature extraction unit 11 determines a noun pair belonging to the given processing target attribute, and calculates an appearance frequency feature of the noun pair for the input text.

次にステップＳ０２において、共起名詞特徴抽出部１２が、上で決定した処理対象属性に属する名詞ペアと同じ文中に出現する他の名詞（これが関係候補となる名詞であるが、処理対象単語ペアと共起する名詞であるので、以後、便宜的に「共起名詞」と呼ぶ）を、その関係の候補として一つ選択し、当該共起名詞の入力テキストに対する出現頻度特徴を計算する。 Next, in step S02, the co-occurrence noun feature extraction unit 12 uses another noun that appears in the same sentence as the noun pair belonging to the processing target attribute determined above (this is a noun that is a relation candidate, but the processing target word pair Therefore, for convenience, the term “co-occurrence noun” is selected as a candidate for the relationship, and the appearance frequency feature for the input text of the co-occurrence noun is calculated.

そしてステップＳ０３において、構文構造特徴抽出部１３は、処理対象単語ペア特徴抽出部１１が決定した処理対象属性に属する名詞ペアと、共起名詞特徴抽出部１２が決定した共起名詞との間の構文構造を抽出し、その出現回数を計算する。ここで、構文特徴とは、文中の文節間の係り受け構造である。構文構造特徴抽出部１３は、既存技術を用いて該当する文の構文解析処理を行ない、得られた構文木のデータを基に、当該文の係り受け構造を表わす３つのリストを生成する。 In step S03, the syntax structure feature extraction unit 13 determines between the noun pair belonging to the processing target attribute determined by the processing target word pair feature extraction unit 11 and the co-occurrence noun determined by the co-occurrence noun feature extraction unit 12. Extract the syntax structure and calculate the number of occurrences. Here, the syntactic feature is a dependency structure between clauses in a sentence. The syntax structure feature extraction unit 13 performs syntax analysis processing of the corresponding sentence using existing technology, and generates three lists representing the dependency structure of the sentence based on the obtained syntax tree data.

まず、対象としている名詞ペアと共起名詞特徴抽出部１２が決定した共起名詞との共通係り先の文節（共通係り先文節）を抽出する。そして、上記の３つのリストとは、まず第１に、対象としている名詞ペア中の一つ目の名詞から、上記の共通係り先の文節への係り受け構造を表わすリストである。そして第２に、対象としている名詞ペア中の二つ目の名詞から、上記の共通係り先の文節への係り受け構造を表わすリストである。そして第３に、これら以外の部分で、上記の共通係り先の文節を修飾する構造を表わすリストである。なおこのとき、係り先の文節として、係り元の文節自体も含めて処理を行う。 First, a common connection clause (common connection clause) between the target noun pair and the co-occurrence noun determined by the co-occurrence noun feature extraction unit 12 is extracted. The above three lists are first lists representing the dependency structure from the first noun in the target noun pair to the above-mentioned common dependency clause. Second, it is a list representing the dependency structure from the second noun in the target noun pair to the common dependency destination clause. Thirdly, the list represents a structure that modifies the above-mentioned common relation clause in parts other than these. At this time, processing is performed including the original clause itself as the related clause.

このとき、構文構造特徴抽出部１３は、各文節を、名詞や動詞などの自立語部分と、助詞などの付属語部分とに分割する。例えば、「プレーリードッグにとってイヌワシは恐ろしい天敵です。」という文を処理する場合であって、「プレーリードッグ」と「イヌワシ」が処理対象単語ペア特徴抽出部１１によって決定された名詞ペアであり、また「天敵」が共起名詞特徴抽出部１２によって決定された共起名詞である場合、下記の３つのリストが構文構造として取り出される。 At this time, the syntax structure feature extraction unit 13 divides each clause into independent word parts such as nouns and verbs and auxiliary word parts such as particles. For example, in the case of processing a sentence “a golden eagle is a terrible natural enemy for a prairie dog”, “prairie dog” and “gold eagle” are noun pairs determined by the processing target word pair feature extraction unit 11, and “natural enemy” Are co-occurrence nouns determined by the co-occurrence noun feature extraction unit 12, the following three lists are extracted as a syntax structure.

第１のリスト＝「プレーリードッグ」から共通係り先の文節「天敵です」までの構文構造：「名詞１」，にとって
第２のリスト＝「イヌワシ」から共通係り先の文節「天敵です」までの構文構造：「名詞２」，は
第３のリスト＝「天敵です」を修飾する構文構造：恐ろしい，ＮＵＬＬ，「名詞３」 1st list = Syntax structure from “Pralee Dog” to common “clause” sentence: “Noun 1”, 2nd list = syntax from “Eagles” to common “clause” sentence Structure: “Noun 2”, is the third list = “Neighbors” is a syntactic structure: Scary, NULL, “Noun 3”

この例では、「名詞１」は「プレーリードッグ」であり、「名詞２」は「イヌワシ」であり、「名詞３」は「天敵」である。なお、上の３つのリストを抽出する元になる係り受け構造は、「イヌワシ−は−天敵−です」，「プレーリードッグ−にとって−天敵−です」，「恐ろしい−天敵」などであり、この係り受け構造は、構文解析処理によって取得可能である。 In this example, “noun 1” is “prairie dog”, “noun 2” is “dog eagle”, and “noun 3” is “natural enemy”. Note that the dependency structure from which the above three lists are extracted is “The eagle is a natural enemy”, “The prairie dog is a natural enemy”, “Horrible natural enemy”, etc. The structure can be obtained by a parsing process.

そして、構文構造特徴抽出部１３は、これら３つのリストの組が全く同一であるものの出現頻度をカウントする。 Then, the syntax structure feature extraction unit 13 counts the appearance frequencies of those sets whose three lists are exactly the same.

但し、３つのリストの組が全く同一であるものの出現頻度をカウントする代わりに、互いの類似度が所定値以上となる組の出現頻度をカウントするようにしても良い。ここで用いる類似度としては、例えば、上記の３つのリストに出現する名詞１〜３以外の共通単語の割合で判断することができる。例えば、上に示した第１の文「プレーリードッグにとってイヌワシは恐ろしい天敵です。」と、別の第２の文「プレーリードッグにとってイヌワシは天敵です。」という文から取り出した構造との共通単語の割合は、次のように計算できる。即ち、第２の文から得られる３つのリストは、次の通りである。
第１のリスト＝「プレーリードッグ」から「天敵です」までの構文構造：「名詞１」，にとって
第２のリスト＝「イヌワシ」から「天敵です」までの構文構造：「名詞２」，は
第３のリスト＝「天敵です」を修飾する構文構造：「名詞３」
そして、第１の文の３つのリストと第２の文の３つのリストとの間の共通単語は、「にとって」と「は」であり、これらが第１の文と第２の文に出現しているので、共通単語数は４である。また、共通でない単語は「恐ろしい」と「ＮＵＬＬ」であり、共通でない単語数は２である。よって、これらの文の類似度は４／（４＋２）であり、即ち４／６と計算できる。 However, instead of counting the appearance frequency of the three lists that are exactly the same, it is also possible to count the appearance frequencies of the sets whose mutual similarity is a predetermined value or more. The similarity used here can be determined, for example, by the ratio of common words other than nouns 1 to 3 appearing in the above three lists. For example, the ratio of common words with the structure taken from the sentence of the first sentence shown above, “The golden eagle is a terrible natural enemy for the prairie dog,” and the second sentence, “The golden eagle is a natural enemy for the prairie dog.” And can be calculated as follows: That is, the three lists obtained from the second sentence are as follows.
The first list = syntactic structure from "Pralee Dog" to "is a natural enemy": "Noun 1", and the second list = syntactic structure from "Inu Eagle" to "is a natural enemy": "Noun 2" is the third List structure = "Noun 3"
The common words between the three lists of the first sentence and the three lists of the second sentence are “for” and “ha”, and they appear in the first sentence and the second sentence. Therefore, the number of common words is 4. Further, the words that are not common are “Awesome” and “NULL”, and the number of words that are not common is two. Therefore, the similarity of these sentences is 4 / (4 + 2), that is, it can be calculated as 4/6.

上述したように、構文構造特徴抽出部１３は、名詞１（処理対象単語ペアに含まれる第１の単語）と名詞２（当該処理対象単語ペアに含まれる第２の単語）と共起単語との共通係り先の文節を取り出し、第１の単語から共通係り先の文節までの構文構造と、第２の単語から共通係り先の文節までの構文構造と、これら以外の部分で、その共通係り先文節を修飾する構文構造との組み合わせにより当該文の構文構造を同定している。 As described above, the syntax structure feature extraction unit 13 includes the noun 1 (the first word included in the processing target word pair), the noun 2 (the second word included in the processing target word pair), the co-occurrence word, The common relationship clause is taken out, the syntax structure from the first word to the common relationship clause, the syntax structure from the second word to the common relationship clause, and the other portions are the common relationships. The syntactic structure of the sentence is identified by a combination with the syntactic structure that modifies the previous clause.

また、構文構造特徴抽出部１３が、全く同一の構文構造の出現頻度をカウントする代わりに、互いの類似度が所定値以上となる組の出現頻度をカウントするような場合には、構文構造特徴抽出部１３は、構文構造を表わす単語のリストに出現する単語であって、名詞１でも名詞２でも共起名詞でもない単語が共通である割合が所定の閾値以上であるような複数の構文構造を類似の構文構造を有する構文構造グループとし、この構文構造グループの出現頻度特徴を前記構文構造の出現頻度特徴として抽出しているといえる。 Further, when the syntax structure feature extraction unit 13 counts the appearance frequency of a pair whose similarity is equal to or higher than a predetermined value instead of counting the appearance frequency of the completely same syntax structure, the syntax structure feature The extraction unit 13 includes a plurality of syntax structures that are words that appear in the list of words representing the syntax structure and that have a common ratio of words that are not the noun 1, the noun 2, or the co-occurrence noun more than a predetermined threshold. Is a syntactic structure group having a similar syntactic structure, and the appearance frequency feature of this syntactic structure group is extracted as the appearance frequency feature of the syntactic structure.

ここで、処理対象単語ペア特徴抽出部１１と共起名詞特徴抽出部１２と構文構造特徴抽出部１３とがそれぞれ計算する出現頻度特徴について説明する。これらの出現頻度特徴の情報は、後の機械学習処理部１４による機械学習の処理において用いられる。
一文中に出現する与えられた処理対象属性に属する名詞ペアと、当該文に出現する他の名詞と、これら３つの名詞間の構文構造の３項組をｔ_ｉと表現する。また、この３項組に含まれる名詞ペアをＣＰｔ_ｉとし、同一文に出現する他の名詞であって関係候補となる名詞をＲＰｔ_ｉとし、これら３つの名詞間の構文構造をＳＰｔ_ｉとする。 Here, the appearance frequency features calculated by the processing target word pair feature extraction unit 11, the co-occurrence noun feature extraction unit 12, and the syntax structure feature extraction unit 13 will be described. Information on these appearance frequency features is used in the machine learning process by the machine learning processing unit 14 later.
And noun pairs belonging to the processing target attribute given appear in one sentence, and other nouns appearing in the statement, 3-tuple syntax structure between these three noun is expressed as t _i. In addition, a noun pair included in this triplet is CPt _i , another noun appearing in the same sentence and a noun that is a candidate for relation is RPt _i, and a syntactic structure between these three nouns is SPt _i . .

処理対象単語ペア特徴抽出部１１は、上記の抽出結果を基に、名詞ペアの種類の出現総数をカウントする。また、処理対象単語ペア特徴抽出部１１は、ある３項組ｔ_ｉに含まれる名詞ペアＣＰｔ_ｉが３項組ｔ_ｋに含まれるか否かの情報を取得する。 The processing target word pair feature extraction unit 11 counts the total number of appearances of noun pair types based on the extraction result. Further, the processing target word pairs feature extraction unit 11 obtains information on whether the noun pair CPt _i included in a certain 3-tuple t _i is included in the 3-tuple t _k.

共起名詞特徴抽出部１２は、上記の抽出結果を基に、共起名詞の種類の出現総数をカウントする。また、共起名詞特徴抽出部１２は、ある３項組ｔ_ｉに含まれる共起名詞ＲＰｔ_ｉが３項組ｔ_ｋに含まれるか否かの情報を取得する。 The co-occurrence noun feature extraction unit 12 counts the total number of types of co-occurrence nouns based on the extraction result. Moreover, co-occurrence noun feature extraction unit 12 obtains information on whether the co-occurrence noun RPt _i included in a certain 3-tuple t _i is included in the 3-tuple t _k.

構文構造特徴抽出部１３は、上記の分析の結果を基に、構文構造の種類の出現総数をカウントする。また、構文構造特徴抽出部１３は、ある３項組ｔ_ｉに含まれる構文構造ＳＰｔ_ｉが３項組ｔ_ｋに含まれるか否かの情報を取得する。 The syntax structure feature extraction unit 13 counts the total number of types of syntax structure based on the result of the above analysis. Also, syntactic structure feature extraction unit 13 acquires information on whether the syntactic structure SPt _i included in a certain 3-tuple t _i is included in the 3-tuple t _k.

機械学習処理部１４は、上述した処理対象単語ペア特徴抽出部１１と共起名詞特徴抽出部１２と構文構造特徴抽出部１３からの出力を入力データとして用いて学習処理を行なう。
機械学習処理部１４は、まずステップＳ０４において、それらの入力データから、明らかに関係を表すと判断できる文を抽出する。例えば、動物を処理対象概念とした場合、共起名詞特徴抽出部１２で得られた共起名詞（この共起名詞は、単語ペアの関係を表わす候補である）が、「弱い」、「大好物」、「好物」、「天敵」、「敵」、「仲間」、「大敵」、「得意」、「種類」、「獲物」、「食べる」などやその同義語や類義語である文を抽出する。これらは、動物という処理対象概念について関係を表すと明らかに判断できる名詞であるためである。なお、処理対象概念とここで抽出対象となる名詞（単語）との関係は、予め定義した処理対象概念関連語データとして記憶部（図示せず）に記憶しておく。例えば、概念辞書のデータをその目的のデータとして使用することができる。機械学習処理部１４は、この処理対象概念関連語データを記憶部から読み出して（参照して）比較することにより、共起名詞特徴抽出部１２で得られた共起名詞がその処理対象概念についての関係を表わすか否かを判断し、その判断に基づき、入力データの中から関係を表すと判別できる文を抽出する。 The machine learning processing unit 14 performs learning processing using the output from the processing target word pair feature extraction unit 11, the co-occurrence noun feature extraction unit 12, and the syntax structure feature extraction unit 13 as input data.
First, in step S04, the machine learning processing unit 14 extracts sentences that can be determined to clearly represent a relationship from the input data. For example, when an animal is a concept to be processed, the co-occurrence noun obtained by the co-occurrence noun feature extraction unit 12 (the co-occurrence noun is a candidate representing the relationship between word pairs) is “weak”, “large” Extracts sentences that are synonyms or synonyms such as “favorite”, “favorite”, “natural enemy”, “enemy”, “friend”, “great enemy”, “special”, “kind”, “prey”, “eat” To do. This is because these are nouns that can be clearly judged to express the relationship with respect to the processing target concept of animals. The relationship between the processing target concept and the noun (word) to be extracted here is stored in a storage unit (not shown) as processing target concept related word data defined in advance. For example, conceptual dictionary data can be used as the target data. The machine learning processing unit 14 reads (refers to) the processing target concept related word data from the storage unit and compares them, so that the co-occurrence noun obtained by the co-occurrence noun feature extraction unit 12 has the processing target concept. Based on the determination, a sentence that can be determined to represent the relationship is extracted from the input data.

３項組ｔ_ｉにおいて、その３項組を構成するＣＰｔ_ｉ，ＲＰｔ_ｉ，ＳＰｔ_ｉが、関係を表現する場合（クラス）をｃ_１とし、関係を表現しない場合（クラス）をｃ_０とする。それらの確率は、下の式（１）によって定義できる。 In the ternary set t _i , CPt _i , RPt _i , SPt _i constituting the ternary set represent c ₁ as a class (class), and c ₀ as a class not represented (class). . Those probabilities can be defined by equation (1) below.

式（１）において、Ｐ（ＣＰｔ_ｉ｜ｃ_ｊ）は、クラスｃ_ｊのときにｔ_ｉに含まれる２つの名詞ペアＣＰｔ_ｉが出現する確率である。また、Ｐ（ＲＰｔ_ｉ｜ｃ_ｊ）は、クラスｃ_ｊのときにｔ_ｉに含まれる関係候補の名詞（同一文に出現する共起名詞）ＲＰｔ_ｉが出現する確率である。また、Ｐ（ＳＰｔ_ｉ｜ｃ_ｊ）は、クラスｃ_ｊのときにｔ_ｉに含まれる３つの名詞間の構文構造ＳＰｔ_ｉが出現する確率である。 In Expression (1), P (CPt _i | c _j ) is a probability that two noun pairs CPt _i included in t _i appear in class c _j . P (RPt _i | c _j ) is a probability that a noun (a co-occurring noun that appears in the same sentence) RPt _i included in t _i appears in class c _j . P (SPt _i | c _j ) is a probability that a syntactic structure SPt _i between three nouns included in t _i appears in the case of class c _j .

この式を利用して、次に機械学習処理部１４は、ステップＳ０５において、ＥＭアルゴリズム（Expectation-maximization algorithm）を利用した機械学習を行なう。なお、ＥＭアルゴリズムを用いた学習処理の手順は次に示すが、下記参考文献にも記載されている。
参考文献： Kamel Nigam et al.，“Text Classification from Labeled and Unlabeled Document using EM.”，Machine Learning，Vol.39，No.2/3，pp.103-134 (2000)． Next, in step S05, the machine learning processing unit 14 performs machine learning using an EM algorithm (Expectation-maximization algorithm) using this equation. In addition, although the procedure of the learning process using EM algorithm is shown next, it is described also in the following reference.
Reference: Kamel Nigam et al., “Text Classification from Labeled and Unlabeled Document using EM.”, Machine Learning, Vol.39, No.2 / 3, pp.103-134 (2000).

この機械学習処理（ステップＳ０５内の処理）については別のフローチャートを参照して説明する。
図３は、機械学習処理部１４がＥＭアルゴリズムを用いて行なう機械学習処理の手順を示すフローチャートである。
まずステップＳ２１において、機械学習処理部１４は、入力テキスト記憶部２から処理対象のテキストデータを読み込み、このテキストデータから得られるｔ_ｉが属するクラスｃ_ｊの初期確率Ｐ（ｃ_ｊ｜ｔ_ｉ）を、下の式（２）により計算する。なお、クラスｃ_ｊは、ｃ_０またはｃ_１のいずれかであり、それらの定義は前述の通りである。 This machine learning process (the process in step S05) will be described with reference to another flowchart.
FIG. 3 is a flowchart showing a procedure of machine learning processing performed by the machine learning processing unit 14 using the EM algorithm.
First, in step S21, the machine learning processing unit 14 reads the text data to be processed from the input text storage unit 2, and the initial probability P (c _j | t _i ) of the class c _j to which t _i obtained from the text data belongs. Is calculated by the following equation (2). The class c _j is either c ₀ or c ₁ , and their definition is as described above.

この初期確率の計算においては、機械学習処理部１４によって、明らかに関係を表すと判断された文（上述）から抽出された３項組ｔ_ｉについて、関係を表現する場合のクラスｃ_１に属する回数を１とカウントする。また、それ以外の文から抽出された３項組ｔ_ｉについて、関係を表現する場合のクラスｃ_１に属する回数を０以上且つ１未満の所定値（例えば０．５）とカウントする。この所定値は０．５に限らず適宜変更できる。また、ある文のあるｔ_ｉについて、上記によってｃ_１が決まると、その文のそのｔ_ｉについてのｃ_０は、ｃ_０＝１−ｃ_１によって決定する。そして、そのｔ_ｉが出現するすべての文についてのカウントの総和をとり、得られたｃ_０およびｃ_１カウントの結果を用いて、式（２）の分子を算出する。 In the calculation of the initial probability, the machine learning processing unit 14 belongs to the class c ₁ in the case of expressing the relationship for the ternary set t _i extracted from the sentence (described above) that is clearly determined to represent the relationship. Count the number as 1. For the ternary set t _i extracted from other sentences, the number of times belonging to the class c ₁ when expressing the relationship is counted as a predetermined value (for example, 0.5) of 0 or more and less than 1. This predetermined value is not limited to 0.5 and can be changed as appropriate. Further, when c ₁ is determined for a certain t _i of a sentence, c ₀ for the t _i of the sentence is determined by c ₀ = 1−c ₁ . Then, taking the count sum of for all sentences that t _i appears, with the results of the c ₀ and c ₁ count, calculates the numerator of equation (2).

なお、ステップＳ２１の初期確率を計算する処理は、ＥＭアルゴリズムのＥステップである。 In addition, the process which calculates the initial probability of step S21 is E step of EM algorithm.

次にステップＳ２２において、機械学習処理部１４は、クラスｃ_ｊのもとで名詞ペアＣＰｔ_ｉが発生する確率Ｐ（ＣＰｔ_ｉ｜ｃ_ｊ）を式（３）により、クラスｃ_ｊのもとで共起名詞ＲＰｔ_ｉが発生する確率Ｐ（ＲＰｔ_ｉ｜ｃ_ｊ）を式（４）により、クラスｃ_ｊのもとで３つの名詞間の構文構造ＳＰｔ_ｉが発生する確率Ｐ（ＳＰｔ_ｉ｜ｃ_ｊ）を式（５）により、それぞれ算出する。 Next, in step S22, the machine learning processing unit 14 obtains the probability P (CPt _i | c _j ) of occurrence of the noun pair CPt _i under the class c _j using the equation (3) under the class c _j . The probability P (SPt _i | c _j ) of occurrence of the co-occurrence noun RPt _i is expressed by the equation (4), and the probability P (SPt _i | c of occurrence of the syntactic structure SPt _i between the three nouns under the class c _j. _j ) is calculated by equation (5).

つまり、式（３）はクラスｃ_ｊのときの処理対象単語ペアの条件付き確率を算出するための式であり、式（４）はクラスｃ_ｊのときの共起単語の条件付き確率を算出するための式であり、式（５）はクラスｃ_ｊのときの前記構文構造の条件付き確率を算出するための式である。
なお、ステップＳ２２の各確率を計算する処理は、ＥＭアルゴリズムのＭステップである。 That is, equation (3) is an equation for calculating the conditional probability of the word pair to be processed when class c _j , and equation (4) calculates the conditional probability of the co-occurrence word when class c _j. Equation (5) is an equation for calculating the conditional probability of the syntax structure for class c _j .
In addition, the process which calculates each probability of step S22 is M step of EM algorithm.

上の式（３），（４），（５）において、｜ＣＰ｜は名詞ペアの出現総数を表わし、｜ＲＰ｜は関係候補となる名詞の出現総数を表わし、｜ＳＰ｜は３名詞の構文構造の出現総数を表わし、｜Ｔ｜は３項組の出現総数を表す。Ｎ（ＣＰｔ_ｉ｜ｔ_ｋ）は、３項組ｔ_ｉに含まれる名詞ペアが３項組ｔ_ｋに含まれるか否かを表す関数である。Ｎ（ＲＰｔ_ｉ｜ｔ_ｋ）は、３項組ｔ_ｉに含まれる関係候補となる名詞（共起名詞）が３項組ｔ_ｋに含まれるか否かを表す関数である。Ｎ（ＳＰｔ_ｉ｜ｔ_ｋ）は、３項組ｔ_ｉに含まれる３名詞の構文構造が３項組ｔ_ｋに含まれるか否かを表す関数である。これらの、含まれるか否かを表わす関数は、それぞれ、含まれる場合は１を値として返し、含まれない場合は０を値として返す。 In the above formulas (3), (4), and (5), | CP | represents the total number of noun pairs, | RP | represents the total number of nouns that are relation candidates, and | SP | This represents the total number of occurrences of the syntax structure, and | T | N (CPt _i | t _k ) is a function indicating whether or not a noun pair included in the ternary set t _i is included in the ternary set t _k . N _(RPt i | _{t k)} is a function indicating whether a noun which is a relationship candidates included in the 3-tuple _{t i} (co-occurrence noun) are included in the 3-tuple _{t k.} N _(SPt i | _{t k)} is a function indicating whether 3 noun syntax structure included in the three-tuple _{t i} is included in the 3-tuple _{t k.} Each of these functions indicating whether or not they are included returns 1 as a value when they are included, and returns 0 as a value when they are not included.

なお、式（３）が表わすように、確率Ｐ（ＣＰｔ_ｉ｜ｃ_ｊ）の分母の第１項は名詞ペアの出現総数である。分母の第２項は、３項組ｔ_ｋに名詞ペアＣＰｔ_ｍが含まれる場合のｔ_ｋを前提としたｃ_ｊの条件付き確率（便宜的にXｃと呼ぶ）の、全ての３項組且つ全ての名詞ペアについての総和である。また、分子の第１項は定数項（１）である。分子の第２項は、上記Ｘｃの、当該名詞ペアＣＰｔ_ｉについての全ての３項組についての総和である。
また、式（４）が表わすように、確率Ｐ（ＲＰｔ_ｉ｜ｃ_ｊ）の分母の第１項は共起名詞の出現総数である。分母の第２項は、３項組ｔ_ｋに共起名詞ＲＰｔ_ｍが含まれる場合のｔ_ｋを前提としたｃ_ｊの条件付き確率（便宜的にXｒと呼ぶ）の、全ての３項組且つ全ての共起名詞についての総和である。また、分子の第１項は定数項（１）である。分子の第２項は、上記Ｘｒの、当該共起名詞ＲＰｔ_ｉについての全ての３項組についての総和である。
また、式（５）が表わすように、確率Ｐ（ＳＰｔ_ｉ｜ｃ_ｊ）の分母の第１項は構文構造の出現総数である。分母の第２項は、３項組ｔ_ｋに構文構造ＳＰｔ_ｍが含まれる場合のｔ_ｋを前提としたｃ_ｊの条件付き確率（便宜的にXｓと呼ぶ）の、全ての３項組且つ全ての構文構造についての総和である。また、分子の第１項は定数項（１）である。分子の第２項は、上記Ｘｓの、当該構文構造ＳＰｔ_ｉについての全ての３項組についての総和である。 As represented by Equation (3), the first term of the denominator of the probability P (CPt _i | c _j ) is the total number of noun pairs. The second term in the denominator, the conditional probability of c _j which assumes t _k of if they contain noun pair CPt _m to 3-tuple t _k (conveniently referred to as Xc), all 3-tuple and Sum of all noun pairs. The first term of the numerator is the constant term (1). The second term of the numerator is the sum total of all three terms of the noun pair CPt _i of Xc.
Further, as expressed by equation (4), the first term of the denominator of the probability P (RPt _i | c _j ) is the total number of co-occurrence nouns. The second term in the denominator, the conditional probability of c _j which assumes t _k of if they contain co-occurrence noun RPt _m to 3-tuple t _k (conveniently referred to as Xr), all three tuple And it is the sum total about all the co-occurrence nouns. The first term of the numerator is the constant term (1). The second term of the numerator is the Xr, the sum of all the 3-tuple for that co-occurrence noun RPt _i.
Further, as expressed in equation (5), the first term of the denominator of the probability P (SPt _i | c _j ) is the total number of occurrences of the syntax structure. The second term of the denominator is all ternary groups of the conditional probabilities of c _j (referred to as Xs for convenience) assuming t _k when the ternary set t _k contains the syntax structure SPt _m and The sum of all syntax structures. The first term of the numerator is the constant term (1). The second term of the numerator is the sum of all three terms of the syntactic structure SPt _i of Xs.

次にステップＳ２３において、機械学習処理部１４は、上で式（３），（４），（５）によりそれぞれ計算された確率Ｐ（ＣＰｔ_ｉ｜ｃ_ｊ）とＰ（ＲＰｔ_ｉ｜ｃ_ｊ）とＰ（ＳＰｔ_ｉ｜ｃ_ｊ）の値を用いて、下の式（６）により、Ｐ（ｃ_ｊ｜ｔ_ｉ）の期待値を計算する。 Next, in step S23, the machine learning processing unit 14 calculates the probabilities P (CPt _i | c _j ) and P (RPt _i | c _j ) calculated by the equations (3), (4), and (5), respectively. And the value of P (SPt _i | c _j ), the expected value of P (c _j | t _i ) is calculated by the following equation (6).

そして、ステップＳ２４において、機械学習処理部１４は、式（６）の結果を用いて、下の式（７）により、Ｐ（ｃ_ｊ）の値を計算する。 In step S _< b _> 24, the machine learning processing unit 14 calculates the value of P (c _j ) according to the following equation (7) using the result of the equation (6).

式（７）において、｜ｃ｜は分類すべきクラスの数を指すものであり、ここではクラスはｃ_０とｃ_１の２種類であるので、｜ｃ｜は２である。 In Expression (7), | c | indicates the number of classes to be classified. Here, there are two classes c ₀ and c ₁ , and | c | is 2.

そして、ステップＳ２５において、機械学習処理部１４は、収束条件の判断を行い、収束していなければステップＳ２２に戻り（ステップＳ２５：ＮＯ）、収束していればこのフローチャートで示した学習処理全体を終了する（ステップＳ２５：ＹＥＳ）。 In step S25, the machine learning processing unit 14 determines the convergence condition, and if not converged, returns to step S22 (step S25: NO), and if converged, the entire learning process shown in this flowchart is performed. The process ends (step S25: YES).

この収束条件の判断は、具体的には、ステップＳ２４で算出されたＰ（ｃ_ｊ）の値の前回算出時からの変化量ΔＰ（ｃ_ｊ）が、所定の閾値（例えば、１．０×１０^−３）未満であるか否かにより行なう。つまり、変化量ΔＰ（ｃ_ｊ）がその閾値以上であれば（ステップＳ２５：ＮＯ）、ステップＳ２２に戻り、再度このフローチャートの手順に従って、新たなＰ（ｃ_ｊ）およびＰ（ｃ_ｊ｜ｔ_ｉ）の値を利用して、Ｐ（ＣＰｔ_ｉ｜ｃ_ｊ）とＰ（ＲＰｔ_ｉ｜ｃ_ｊ）とＰ（ＳＰｔ_ｉ｜ｃ_ｊ）の値を計算し（ステップＳ２２）、ステップＳ２５において変化量ΔＰ（ｃ_ｊ）がその閾値より小さい値となるまで、ステップＳ２２〜Ｓ２５の処理を繰り返す。ステップＳ２５において、Ｐ（ｃ_ｊ）の変化量ΔＰ（ｃ_ｊ）がその閾値より小さい場合には（ステップＳ２５：ＹＥＳ）、このフローチャートで示した学習処理全体を終了する。 Specifically, the determination of the convergence condition is based on the fact that the amount of change ΔP (c _j ) from the previous calculation of the value of P (c _j ) calculated in step S24 is a predetermined threshold (for example, 1.0 × It is carried out depending on whether it is less than 10 ⁻³ ). That is, if the change amount ΔP (c _j ) is equal to or greater than the threshold (step S25: NO), the process returns to step S22, and new P (c _j ) and P (c _j | t _{i are} again performed according to the procedure of this flowchart. ) Is used to calculate the values of P (CPt _i | c _j ), P (RPt _i | c _j ), and P (SPt _i | c _j ) (step S22), and the change ΔP in step S25. The processes in steps S22 to S25 are repeated until (c _j ) becomes a value smaller than the threshold value. In step S25, if the change amount ΔP of _{_{P (c j) (c j}} ) is smaller than the threshold (step S25: YES), and ends the whole learning process shown in this flowchart.

図２のフローチャートに戻って、ステップＳ０６において、機械学習処理部１４は、上の処理手順において最後に計算された確率値Ｐ（ｃ_ｊ｜ｔ_ｉ），Ｐ（ＣＰｔ_ｉ｜ｃ_ｊ），Ｐ（ＲＰｔ_ｉ｜ｃ_ｊ），Ｐ（ＳＰｔ_ｉ｜ｃ_ｊ）を学習結果データ記憶部３に書き込む。 Returning to the flowchart of FIG. 2, in step S06, the machine learning processing unit 14 determines the probability values P (c _j | t _i ), P (CPt _i | c _j ), P last calculated in the above processing procedure. (RPt _i | c _j ) and P (SPt _i | c _j ) are written in the learning result data storage unit 3.

そしてステップＳ０７において、確率値計算処理部１５は、機械学習処理部１４によって出力され学習結果データ記憶部３に書き込まれたＰ（ｃ_ｊ｜ｔ_ｉ），Ｐ（ＣＰｔ_ｉ｜ｃ_ｊ），Ｐ（ＲＰｔ_ｉ｜ｃ_ｊ），Ｐ（ＳＰｔ_ｉ｜ｃ_ｊ）を読み出し、これらの値を基に、確率値の計算を行なう。確率値計算処理部１５が算出するのは、処理対象名詞ペアが関係を持つ確率Ｐ（ｃ_ｊ｜ＣＰｔ_ｉ）と、処理対象名詞ペアと共起した名詞が関係を持つ確率Ｐ（ｃ_ｊ｜ＲＰｔ_ｉ）と、処理対象名詞ペアと共起した名詞との間の構文構造が関係を示す構造である確率Ｐ（ｃ_ｊ｜ＳＰｔ_ｉ）であり、これらはそれぞれ、式（８），（９），（１０）により計算される。 In step S07, the probability value calculation processing unit 15 outputs P (c _j | t _i ), P (CPt _i | c _j ), P output from the machine learning processing unit 14 and written in the learning result data storage unit 3. (RPt _i | c _j ) and P (SPt _i | c _j ) are read, and the probability value is calculated based on these values. The probability value calculation processing unit 15 calculates the probability P (c _j | CPt _i ) that the processing target noun pair is related to and the probability P (c _j | that the noun co-occurring with the processing target noun pair is related. RPt _i ) and the probability P (c _j | SPt _i ) that is a structure in which the syntactic structure between the noun pair that co-occurs with the noun pair to be processed indicates a relationship, and these are respectively the expressions (8) and (9 ) And (10).

確率値計算処理部１５は、計算されたこれらの値を出力データ４として出力する。確率値Ｐ（ｃ_ｊ｜ＣＰｔ_ｉ）は、名詞ペアＣＰｔ_ｉが関係を持つか否かの度合いを示す。確率値Ｐ（ｃ_ｊ｜ＲＰｔ_ｉ）は、名詞ペアと共起した名詞ＲＰｔ_ｉが関係を表すか否かの度合いを示す。確率値Ｐ（ｃ_ｊ｜ＳＰｔ_ｉ）は、処理対象名詞ペアと共起した名詞との間の構文構造ＳＰｔ_ｉが関係を示す構造であるか否かの度合いを示す。これらの出力データにより判定を行なえる。 The probability value calculation processing unit 15 outputs these calculated values as output data 4. The probability value P (c _j | CPt _i ) indicates the degree of whether or not the noun pair CPt _i is related. The probability value P (c _j | RPt _i ) indicates the degree of whether or not the noun RPt _i co-occurring with the noun pair represents a relationship. The probability value P (c _j | SPt _i ) indicates a degree of whether or not the syntax structure SPt _i between the processing target noun pair and the co-occurring noun is a structure indicating a relationship. The determination can be made based on these output data.

なお、確率値計算処理部１５が、確率値Ｐ（ｃ_ｊ｜ＣＰｔ_ｉ）やＰ（ｃ_ｊ｜ＲＰｔ_ｉ）やＰ（ｃ_ｊ｜ＳＰｔ_ｉ）について、それぞれ所定の閾値以上かどうかによる判定を行い、その判定結果を出力するようにしても良い。 Note that the probability value calculation processing unit 15 determines whether the probability values P (c _j | CPt _i ), P (c _j | RPt _i ), and P (c _j | SPt _i ) are equal to or greater than a predetermined threshold value. And the determination result may be output.

＜処理結果例＞
テキストから関係を抽出するという上記一連の処理を、実データに対象として行なった結果について、次に説明する。ここでは、処理対象属性（対象概念）を「動物」とし、処理対象データは日本放送協会（ＮＨＫ）によって制作・放送された動物に関するテレビ番組のクローズドキャプションデータを用いている。 <Example of processing results>
Next, the results of performing the above-described series of processing for extracting the relationship from the text on the actual data will be described. Here, the processing target attribute (target concept) is “animal”, and the processing target data is closed caption data of a television program related to animals produced and broadcast by the Japan Broadcasting Corporation (NHK).

図４は、抽出された名詞ペアＣＰｔ_ｉとそれに関する確率値Ｐ（ｃ_０｜ＣＰｔ_ｉ）の値を列挙して示す概略図である。この図のデータは、確率値計算処理部１５によって出力されたデータをＰ（ｃ_０｜ＣＰｔ_ｉ）の昇順にソートして示しているものである。Ｐ（ｃ_１｜ＣＰｔ_ｉ）＝１−Ｐ（ｃ_０｜ＣＰｔ_ｉ）であるため、この図では、上に挙げられている名詞ペアほど関係を持つ可能性（度合い）が高いものである。例えば、「名詞１」が「イルカ」で「名詞２」が「ボラ」である名詞ペアについてのＰ（ｃ_０｜ＣＰｔ_ｉ）は０．０３１である。また、例えば、「名詞１」が「サケ」で「名詞２」が「ヒグマ」である名詞ペアについてのＰ（ｃ_０｜ＣＰｔ_ｉ）は０．０４４である。また、例えば、「名詞１」が「シロフクロウ」で「名詞２」が「レミング」である名詞ペアについてのＰ（ｃ_０｜ＣＰｔ_ｉ）は０．０４４である。そして、以下同様である。 FIG. 4 is a schematic diagram showing a list of extracted noun pairs CPt _i and probability values P (c ₀ | CPt _i ) related thereto. The data in this figure shows the data output by the probability value calculation processing unit 15 sorted in ascending order of P (c ₀ | CPt _i ). Since P (c ₁ | CPt _i ) = 1−P (c ₀ | CPt _i ), in this figure, there is a higher possibility (degree) of having a relationship in the noun pairs listed above. For example, P (c ₀ | CPt _i ) for a noun pair in which “noun 1” is “dolphin” and “noun 2” is “bora” is 0.031. For example, P (c ₀ | CPt _i ) for a noun pair in which “noun 1” is “salmon” and “noun 2” is “brown bear” is 0.044. Further, for example, P (c ₀ | CPt _i ) for a noun pair in which “noun 1” is “snow owl” and “noun 2” is “lemming” is 0.044. The same applies to the following.

図５は、抽出された共起名詞ＲＰｔ_ｉとそれに関する確率値Ｐ（ｃ_０｜ＲＰｔ_ｉ）の値を列挙して示す概略図である。この図のデータは、確率値計算処理部１５によって出力されたデータをＰ（ｃ_０｜ＲＰｔ_ｉ）の昇順にソートして示しているものである。Ｐ（ｃ_１｜ＲＰｔ_ｉ）＝１−Ｐ（ｃ_０｜ＲＰｔ_ｉ）であるため、この図では、上に挙げられている共起名詞ほど、関係を表わす名詞である可能性（度合い）が高いものである。例えば、「名詞３」が「仲間」であるときＰ（ｃ_０｜ＲＰｔ_ｉ）は０．０１１である。また、例えば、「名詞３」が「食べる」であるときＰ（ｃ_０｜ＲＰｔ_ｉ）は０．０１２である。そして、以下同様である。 FIG. 5 is a schematic diagram illustrating the extracted co-occurrence noun RPt _i and the value of the probability value P (c ₀ | RPt _i ) related thereto. The data in this figure shows the data output by the probability value calculation processing unit 15 sorted in ascending order of P (c ₀ | RPt _i ). Since P (c ₁ | RPt _i ) = 1−P (c ₀ | RPt _i ), in this figure, the possibility (degree) of a noun representing a relationship is higher as the co-occurrence noun listed above. It is expensive. For example, P (c ₀ | RPt _i ) is 0.011 when “noun 3” is “companion”. For example, when “noun 3” is “eat”, P (c ₀ | RPt _i ) is 0.012. The same applies to the following.

図６は、処理対象名詞ペアと共起した名詞との間の構文構造ＳＰｔ_ｉとそれに関する確率値Ｐ（ｃ_０｜ＳＰｔ_ｉ）の値を列挙して示す概略図である。この図のデータは、確率値計算処理部１５によって出力されたデータをＰ（ｃ_０｜ＳＰｔ_ｉ）の昇順にソートして示しているものである。Ｐ（ｃ_１｜ＳＰｔ_ｉ）＝１−Ｐ（ｃ_０｜ＳＰｔ_ｉ）であるため、この図では、上に挙げられている構文構造ほど、その構文が関係を表わす構造である可能性（度合い）が高いものである。 FIG. 6 is a schematic diagram showing the syntax structure SPt _i between the noun pair to be processed and the co-occurring noun and the value of the probability value P (c ₀ | SPt _i ) related thereto. The data in this figure shows the data output by the probability value calculation processing unit 15 sorted in ascending order of P (c ₀ | SPt _i ). Since P (c ₁ | SPt _i ) = 1−P (c ₀ | SPt _i ), in this figure, the more likely the syntax structure listed above is, the possibility (degree) ) Is high.

この図における構文構造の表記について説明する。表記に現れる記号として、「ＮＰ１」は名詞１を表わし、「ＮＰ２」は名詞２を表わし、「ＲＥＬ」は関係候補名詞を表わす。構文構造の表記のパターンは次の通りである。即ち、名詞１と名詞２と共起単語との共通係り先の文節を取り出し、名詞１から共通係り先の文節までの構文構造と、名詞２から共通係り先の文節までの構文構造と、共通係り先の文節を修飾する構文構造の３つの構造を、セパレータ文字「＝」で区切って表記している。この第１のパターンで表記するのは、名詞１と名詞２の後に関係候補名詞が出現する場合である。 The notation of the syntax structure in this figure will be described. As symbols appearing in the notation, “NP1” represents noun 1, “NP2” represents noun 2, and “REL” represents a relationship noun. The syntax structure notation pattern is as follows. In other words, the common clauses of noun 1, noun 2, and co-occurrence words are taken out, and the syntactic structure from noun 1 to the common clause is common to the syntactic structure from noun 2 to the common clause. Three structures of the syntax structure that modifies the clause at the destination are shown separated by a separator character “=”. This first pattern is used when a candidate noun appears after noun 1 and noun 2.

例えば、この図の第１行目のデータは、名詞１から関係候補名詞までの構文構造が「ＮＰ１，は」であり、名詞２から関係候補名詞までの構文構造が「ＮＰ２，を」であり、関係候補名詞を修飾する構文構造が「ＲＥＬ」であるような構文構造に対応しており、そのときのＰ（ｃ_０｜ＳＰｔ_ｉ）は０．０３４である。他の行のデータも同様である。 For example, the data on the first line of this figure shows that the syntax structure from noun 1 to relation candidate noun is “NP1, ha”, and the syntax structure from noun 2 to relation candidate noun is “NP2, a”. , Corresponding to a syntax structure in which the relationship candidate noun is modified as “REL”, and P (c ₀ | SPt _i ) at that time is 0.034. The same applies to other rows of data.

なお、本実施形態による言語処理装置１は、当該文において名詞１と名詞２の共通係り先を抽出し、名詞１から共通係り先までの構文構造、もしくは名詞２から共通係り先までの構文構造に関係候補名詞を含む場合のみを処理対象としている。名詞１から共通係り先までの構文構造もしくは名詞２から共通係り先までの構文構造に関係候補名詞を含まない場合や、関係候補名詞が名詞１の前にある場合は処理対象から除いている。 Note that the language processing apparatus 1 according to the present embodiment extracts a common connection destination of the noun 1 and the noun 2 in the sentence, and a syntax structure from the noun 1 to the common connection destination or a syntax structure from the noun 2 to the common connection destination. Only the case where the related candidate noun is included is processed. A case where no relation candidate noun is included in the syntax structure from the noun 1 to the common relation destination or the syntax structure from the noun 2 to the common relation destination or when the relation candidate noun precedes the noun 1 is excluded from the processing target.

これらの図に示した処理結果の例のデータは、適切な結果であると判断できる。つまり、単語ペアや、関係名を表わす共起名詞や、関係を表わす構文構造などとして、妥当なものが処理結果の上位に挙げられている。つまり、本実施形態による言語処理装置１が有効であることが確認できた。 The data of the processing result examples shown in these drawings can be determined to be an appropriate result. In other words, the proper ones are listed at the top of the processing results, such as word pairs, co-occurrence nouns representing relationship names, and syntactic structures representing relationships. That is, it was confirmed that the language processing apparatus 1 according to the present embodiment is effective.

なお、上述した実施形態における言語処理装置の全部又は一部の機能をコンピュータで実現するようにしても良い。その場合、これらの機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時刻の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時刻プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 In addition, you may make it implement | achieve all or one part function of the language processing apparatus in embodiment mentioned above with a computer. In that case, a program for realizing these functions may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” dynamically holds a program for a short time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It is also possible to include those that hold a program for a certain time, such as a volatile memory inside a computer system serving as a server or client in that case. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

以上、複数の実施形態を説明したが、本発明はさらに次のような変形例でも実施することが可能である。
前記の実施形態では、入力テキスト記憶部２と、学習結果データ記憶部３と、出力データ４と、処理対象単語ペア特徴抽出部１１と、共起名詞特徴抽出部１２と、構文構造特徴抽出部１３と、機械学習処理部１４と、確率値計算処理部１５とをすべて一体として含んだ言語処理装置の構成としたが、例えば、処理対象単語ペア特徴抽出部１１と、共起名詞特徴抽出部１２と、構文構造特徴抽出部１３と、機械学習処理部１４とを含んで機械学習処理までを行なう装置と、確率値計算処理部１５を含んで与えられた学習結果データを用いて確率値計算処理（判定処理）の部分を行なう装置に分けて構成しても良い。このとき、学習結果データは、両装置によって共有される記憶手段を介して渡したり、通信線を介して渡したりするように構成する。このように装置を分けた場合、機械学習処理までの部分と確率値計算処理の部分とを別に行なうことができる。また、予め機械学習処理を行なっておき、その結果得られる学習結果データを用いて繰り返し確率値計算処理を行なうこともできる。また、入力テキストと類似分野の文（学習結果データが有効であるような文）であれば、元の入力テキストに含まれていない文を対象として確率値計算処理を行なうこともできる。 Although a plurality of embodiments have been described above, the present invention can also be implemented in the following modifications.
In the above embodiment, the input text storage unit 2, the learning result data storage unit 3, the output data 4, the processing target word pair feature extraction unit 11, the co-occurrence noun feature extraction unit 12, and the syntax structure feature extraction unit 13, the machine learning processing unit 14, and the probability value calculation processing unit 15 are all configured as one unit. For example, the processing target word pair feature extraction unit 11 and the co-occurrence noun feature extraction unit 12, a syntactic structure feature extraction unit 13, a machine learning processing unit 14, a machine for performing machine learning processing, and a probability value calculation processing unit 15. You may divide and comprise into the apparatus which performs the part of a process (judgment process). At this time, the learning result data is configured to be transferred via a storage unit shared by both apparatuses or via a communication line. When the apparatus is divided in this way, the part up to the machine learning process and the part of the probability value calculation process can be performed separately. It is also possible to perform machine learning processing in advance and perform repeated probability value calculation processing using the learning result data obtained as a result. In addition, if the sentence is in a field similar to the input text (a sentence in which the learning result data is valid), the probability value calculation process can be performed on a sentence that is not included in the original input text.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

本発明は、大量のテキストからの情報自動抽出、知識獲得などに利用できる。 The present invention can be used for automatic extraction of information from a large amount of text, knowledge acquisition, and the like.

本発明の実施形態による言語処理装置の機能構成を示したブロック図である。It is the block diagram which showed the function structure of the language processing apparatus by embodiment of this invention. 同実施形態による言語処理装置の全体の処理手順を示したフローチャートである。It is the flowchart which showed the whole process sequence of the language processing apparatus by the embodiment. 同実施形態による言語処理装置の処理のうちＥＭアルゴリズムを用いた機械学習処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the machine learning process using EM algorithm among the processes of the language processing apparatus by the embodiment. 同実施形態による処理結果のデータであり、処理対象名詞ペアが関係を持つ確率の上位を、Ｐ（ｃ_０｜ＣＰｔ_ｉ）の昇順で示す概略図である。FIG. 6 is a schematic diagram showing the result of processing according to the embodiment and showing the higher probability that the processing target noun pair is related in ascending order of P (c ₀ | CPt _i ). 同実施形態による処理結果のデータであり、名詞ペアと共起した名詞が関係を表す確率の上位を、Ｐ（ｃ_０｜ＲＰｔ_ｉ）の昇順で示す概略図である。FIG. 6 is a schematic diagram showing the result of processing according to the embodiment, in which the highest probability that a noun co-occurring with a noun represents a relationship is in ascending order of P (c ₀ | RPt _i ). 同実施形態による処理結果のデータであり、処理対象名詞ペアと共起した名詞との間の構文構造が関係を示す構造である確率の上位を、Ｐ（ｃ_０｜ＳＰｔ_ｉ）の昇順で示す概略図である。It is the data of the processing result according to the embodiment, and shows the upper order of the probability that the syntax structure between the noun pair to be processed and the co-occurring noun indicates the relationship in ascending order of P (c ₀ | SPt _i ). FIG.

Explanation of symbols

１言語処理装置
２入力テキスト記憶部
３学習結果データ記憶部
４出力データ
１１処理対象単語ペア特徴抽出部
１２共起名詞特徴抽出部（共起単語特徴抽出部）
１３構文構造特徴抽出部
１４機械学習処理部
１５確率値計算処理部 DESCRIPTION OF SYMBOLS 1 Language processing apparatus 2 Input text memory | storage part 3 Learning result data memory | storage part 4 Output data 11 Processing object word pair feature extraction part 12 Co-occurrence noun feature extraction part (co-occurrence word feature extraction part)
13 Syntax structure feature extraction unit 14 Machine learning processing unit 15 Probability value calculation processing unit

Claims

A pair of words included in one sentence is selected as a processing target word pair from input text data including a plurality of sentences, and a predetermined feature of the appearance frequency of the processing target word pair in the input text data is extracted. A processing target word pair feature extraction unit,
Selecting another word appearing in a sentence including the processing target word pair in the input text data as a co-occurrence word, and a predetermined characteristic of the appearance frequency of the co-occurrence word in the input text data A co-occurrence word feature extraction unit to extract;
A syntax for extracting a syntax structure of a sentence including the processing target word pair and the co-occurrence word in the input text data, and extracting a predetermined feature of an appearance frequency of the syntax structure in the input text data A structural feature extraction unit;
The sentence in the input text data that can be determined that the co-occurrence word belongs to a class representing the relation of the processing target word pair by referring to processing target concept related word data in which a word related to the processing target concept is stored in advance. Based on the information, the appearance frequency feature of the processing target word pair, the appearance frequency feature of the co-occurrence word, and the appearance frequency feature of the syntax structure, a machine learning process is performed, and a sentence of the processing target word pair The conditional probability that the processing target word pair appears when assuming that it belongs to a class representing a relationship, and the co-occurrence word when assuming that a sentence belongs to a class representing the relationship of the processing target word pair A conditional probability of appearance, and a conditional probability of occurrence of the syntax structure on the assumption that the sentence belongs to a class representing the relationship of the processing target word pairs. And machine learning processing unit that performs a process of writing the learning result data storage unit as the learning result data,
A language processing apparatus comprising:

The language processing apparatus according to claim 1,
The syntactic structure feature extraction unit is configured to share a first word included in the processing target word pair, a second word included in the processing target word pair, and the co-occurrence word based on a syntax analysis result of the sentence. A syntactic structure from the first word to the common dependency clause, a syntactic structure from the second word to the common dependency clause, and a syntactic structure that modifies the common dependency clause Identify the syntactic structure of the sentence by combining with
A language processing apparatus.

The language processing device according to claim 2,
The syntactic structure feature extraction unit has a predetermined ratio of words that appear in a list of words representing the syntactic structure and are not common to the first word, the second word, and the co-occurrence word. A plurality of syntactic structures that are equal to or greater than the threshold value of a syntactic structure group having a similar syntactic structure, and an appearance frequency feature of the syntax structure group is extracted as an appearance frequency feature of the syntax structure.
A language processing apparatus.

In the language processing device according to any one of claims 1 to 3,
Using the learning result data read from the learning result data storage unit, a conditional probability that the sentence belongs to the class on the assumption that the processing target word pair appears in the sentence, and the co-occurrence word in the sentence A probability value calculation processing unit that calculates a conditional probability that the sentence belongs to the class on the assumption that it appears and a conditional probability that the sentence belongs to the class on the assumption that the syntax structure appears in the sentence; ,
A language processing apparatus, further comprising:

Assuming that the processing target word pair appears in a sentence using the learning result data written in the learning result data storage unit by the language processing device according to claim 1. The conditional probability that the sentence belongs to the class, the conditional probability that the sentence belongs to the class assuming that the co-occurrence word appears in the sentence, and the syntax structure appearing in the sentence A language processing apparatus comprising a probability value calculation processing unit for calculating a conditional probability that a sentence belongs to the class.

A pair of words included in one sentence is selected as a processing target word pair from input text data including a plurality of sentences, and a predetermined feature of the appearance frequency of the processing target word pair in the input text data is extracted. Processing target word pair feature extraction process,
Selecting another word appearing in a sentence including the processing target word pair in the input text data as a co-occurrence word, and a predetermined characteristic of the appearance frequency of the co-occurrence word in the input text data A co-occurrence word feature extraction process to be extracted;
A syntax for extracting a syntax structure of a sentence including the processing target word pair and the co-occurrence word in the input text data, and extracting a predetermined feature of an appearance frequency of the syntax structure in the input text data Structural feature extraction process;
The sentence in the input text data that can be determined that the co-occurrence word belongs to a class representing the relation of the processing target word pair by referring to processing target concept related word data in which a word related to the processing target concept is stored in advance. Based on the information, the appearance frequency feature of the processing target word pair, the appearance frequency feature of the co-occurrence word, and the appearance frequency feature of the syntax structure, the sentence belongs to the class representing the relationship of the processing target word pair The conditional probability that the processing target word pair appears when it is assumed, and the conditional probability that the co-occurrence word appears when a sentence belongs to a class that represents the relationship of the processing target word pair, And a conditional probability that the syntax structure appears when it is assumed that the sentence belongs to a class representing the relationship between the processing target word pairs, and learning result data A machine learning process to perform the process for writing the learning result data storage unit Te,
A program that causes a computer to execute this process.