JP2546515B2

JP2546515B2 - Information extraction device

Info

Publication number: JP2546515B2
Application number: JP5230701A
Authority: JP
Inventors: 真一安藤; 伸一土井
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1993-09-17
Filing date: 1993-09-17
Publication date: 1996-10-23
Anticipated expiration: 2011-10-23
Also published as: JPH0785071A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は自然言語で記述された文
書を解析し、予め与えられた分野の情報を抽出し、文書
に含まれる単語間の関係を含む情報を一定の形式で出力
する情報抽出システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention analyzes a document described in natural language, extracts information in a predetermined field, and outputs information including a relationship between words included in the document in a fixed format. Information extraction system.

【０００２】[0002]

【従来の技術】テキストから特定分野の情報を単語間の
関係まで抽出し、一定の形式で出力する情報抽出の手法
には、従来、キーワードの出現状況を利用する手法や、
構文解析を利用する手法があった。キーワードを利用す
る手法は抽出すべき情報の分野や出力すべき形式に関係
するキーワードを予め与え、入力文書内におけるキーワ
ードの出現や共起を基に情報抽出を行う手法である。し
かし、この手法は文章構造を無視しているため、キーワ
ードが含まれてはいるが単語間の関係が正しく成立しな
い不適切な情報を抽出する場合が多くあった。また構文
解析を利用する方法は構文解析によって得られた構文木
を再度解析して、意味に依らない一定の解釈木を生成し
ようとするものである。しかし、解析の対象が構文木で
あるため、わずかな構文木の違いによって抽出結果が異
なってしまう。2. Description of the Related Art Information extraction methods for extracting information in a specific field from a text up to a relationship between words and outputting the information in a fixed format have hitherto been used, such as a method of utilizing the appearance state of a keyword,
There was a method of using parsing. The method of using a keyword is a method of previously providing a keyword related to the field of information to be extracted and a format to be output, and performing information extraction based on the appearance and co-occurrence of the keyword in the input document. However, since this method ignores the sentence structure, it often extracts inappropriate information that includes keywords but does not establish a correct relationship between words. Further, the method using the syntactic analysis is to re-parse the syntactic tree obtained by the syntactic analysis to generate a certain interpretation tree that does not depend on the meaning. However, since the target of analysis is a syntax tree, the extraction result will be different due to a slight difference in the syntax tree.

【０００３】[0003]

【発明が解決しようとする課題】キーワードの出現を利
用する方法は与えるキーワードの数を増やすことによっ
て抽出すべき情報を多く出力することができる。しかし
文章構造を無視し、単純にキーワードが出現したか否か
によって出力が決定されるため、抽出された出力には単
語間の関係が成立しない不適切な情報が多く含まれてい
た。またキーワードとして登録されていない語に対して
は情報抽出を行なうことはできなかった。構文解析を利
用する方法は文章の構文構造を認定するため、正しい抽
出結果を得ることができる。しかし、構文解析技術だけ
では構文上の曖昧性を解消しきれず、正しい解析木を得
ることが難しかった。このため、実際の文書へ適応した
場合、情報が含まれている文書を抽出できない場合が多
くあった。The method using the appearance of keywords can output a large amount of information to be extracted by increasing the number of keywords provided. However, since the output is determined by simply ignoring the sentence structure and simply appearance of the keyword, the extracted output contains a lot of inappropriate information that does not establish the relationship between words. In addition, information could not be extracted from words that were not registered as keywords. The method using the syntactic analysis recognizes the syntactic structure of the sentence so that the correct extraction result can be obtained. However, it was difficult to get the correct parse tree because the ambiguity in the syntax could not be resolved only by the syntax analysis technology. Therefore, when it is applied to an actual document, there are many cases where a document including information cannot be extracted.

【０００４】本発明の目的はキーワード間の関係を認定
しながら構文解析を行うことによって正しい情報を数多
く、かつ正確に抽出することである。It is an object of the present invention to extract a large number of correct information by performing a syntactic analysis while recognizing a relationship between keywords.

【０００５】[0005]

【課題を解決するための手段】第１の発明は、自然言語
で記述された文書を入力として受けつける文書入力部
と、形態素とその形態素毎に構文情報を記した形態素辞
書部と、予め決められた抽出すべき情報の分野に関する
キーワードとそのキーワード毎に最終的に出力すべき形
式内でそのキーワードが果たす役割を記述したキーワー
ド辞書と、前記文書入力部から入力された文を語切り
し、前記形態素辞書部、前記キーワード辞書部の辞書内
容を各語に割り当てる形態素解析部と、前記形態素辞書
部に格納された構文情報を利用して、入力文の構文解析
を行う規則を格納した構文解析規則格納部と、キーワー
ドに与えられたキーワード情報によって構文解析規則を
制御しながら、キーワード間関係を示す意味構造を生成
する規則を格納したキーワード間関係計算規則格納部
と、前記構文解析規則格納部に格納された構文解析規則
と前記キーワード間関係計算規則格納部に格納されたキ
ーワード間関係計算規則を用いて、前記形態素解析部で
形態素解析された文を解析し、キーワード間関係を示す
意味構造を出力する文書情報抽出部と、前記文書情報抽
出部が出力した文書全体の意味構造を出力形式に変換
し、出力する抽出結果出力部を備えていることを特徴と
する。According to a first aspect of the present invention, a document input section for receiving a document described in natural language as an input, a morpheme and a morpheme dictionary section in which syntax information is written for each morpheme are predetermined. The keyword related to the field of information to be extracted and the keyword dictionary describing the role of each keyword in the format to be finally output for each keyword, and the sentence input from the document input unit is cut into words, A morpheme dictionary unit, a morpheme analysis unit that allocates the dictionary contents of the keyword dictionary unit to each word, and a syntactic analysis rule that stores rules for performing syntactic analysis of an input sentence using the syntactic information stored in the morpheme dictionary unit. The storage unit and the key that stores the rules that generate the semantic structure showing the inter-keyword relationships while controlling the parsing rules by the keyword information given to the keywords. Using the interword relation calculation rule storage unit, the syntactic analysis rule stored in the syntactic analysis rule storage unit and the inter-keyword relation calculation rule stored in the inter-keyword relation calculation rule storage unit, the morpheme analysis unit uses morphemes. A document information extraction unit that analyzes the parsed sentence and outputs a semantic structure indicating a relationship between keywords, and an extraction result output unit that converts the semantic structure of the entire document output by the document information extraction unit into an output format and outputs the result. It is characterized by having.

【０００６】第２の発明は、第１の発明において、前記
キーワード辞書部に存在しない形態素列を構文構造から
キーワードとして推定するキーワード推定規則格納部
と、前記構文解析規則格納部に格納された構文解析規則
と前記キーワード間関係計算規則格納部に格納されたキ
ーワード間関係計算規則と前記キーワード推定規則格納
部に格納されたキーワード推定規則を用いて、前記形態
素解析部で形態素解析された文を解析し、キーワード間
関係を示す意味構造を出力する文書情報抽出部を備えて
いることを特徴とする。In a second aspect based on the first aspect, a keyword estimation rule storage unit for estimating a morpheme sequence that does not exist in the keyword dictionary unit as a keyword from a syntactic structure, and a syntax stored in the syntactic analysis rule storage unit. Analyzing the sentence morphologically analyzed by the morpheme analysis unit using the inter-keyword relation calculation rule stored in the keyword relation calculation rule storage unit and the keyword estimation rule stored in the keyword estimation rule storage unit However, it is characterized in that it is provided with a document information extraction unit that outputs a semantic structure indicating a relationship between keywords.

【０００７】[0007]

【実施例】次に本発明について図面を参照して説明す
る。The present invention will be described below with reference to the drawings.

【０００８】図１は第１の発明の一実施例を示すブロッ
ク図である。図１を参照すると本発明は、自然言語で記
述された文書を入力として受けつける文書入力部１と、
形態素とその形態素毎に構文情報を記した形態素辞書部
２と、予め決められた抽出すべき情報の分野に関するキ
ーワードとそのキーワード毎に最終的に出力すべき形式
内でそのキーワードが果たす役割を記述したキーワード
辞書３と、前記文書入力部１から入力された文を語切り
し、前記形態素辞書部２、前記キーワード辞書部３の辞
書内容を各語に割り当てる形態素解析部４と、前記形態
素辞書部２に格納された構文情報を利用して、入力文の
構文解析を行う規則を格納した構文解析規則格納部５
と、キーワードに与えられたキーワード情報によって構
文解析規則を制御しながら、キーワード間関係を示す意
味構造を生成する規則を格納したキーワード間関係計算
規則格納部６と、前記構文解析規則格納部５に格納され
た構文解析規則と前記キーワード間関係計算規則格納部
６に格納されたキーワード間関係計算規則を用いて、前
記形態素解析部４で形態素解析された文を解析し、キー
ワード間関係を示す意味構造を出力する文書情報抽出部
７と、前記文書情報抽出部７が出力した文書全体の意味
構造を出力形式に変換し、出力する抽出結果出力部８か
ら構成される。FIG. 1 is a block diagram showing an embodiment of the first invention. Referring to FIG. 1, the present invention includes a document input unit 1 that receives a document described in natural language as an input,
A morpheme and a morpheme dictionary section 2 in which syntax information is written for each morpheme, a keyword related to a predetermined field of information to be extracted, and a role of each keyword in a format to be finally output for each keyword The keyword dictionary 3, the sentence input from the document input unit 1, is word-cut, and the morpheme dictionary unit 2 and the morpheme analysis unit 4 assigns the dictionary contents of the keyword dictionary unit 3 to each word; and the morpheme dictionary unit. A parsing rule storage unit 5 that stores rules for parsing an input sentence using the syntax information stored in 2.
And the inter-keyword relationship calculation rule storage unit 6 in which the rules for generating a semantic structure indicating the inter-keyword relationship are stored while controlling the syntactic analysis rule according to the keyword information given to the keyword, and the syntax analysis rule storage unit 5. Meaning indicating the inter-keyword relationship by analyzing the sentence morphologically analyzed by the morphological analysis unit 4 using the stored syntactic analysis rule and the inter-keyword relation calculation rule stored in the inter-keyword relation calculation rule storage unit 6. It is composed of a document information extraction unit 7 that outputs a structure and an extraction result output unit 8 that converts the semantic structure of the entire document output by the document information extraction unit 7 into an output format and outputs the output format.

【０００９】図２は第２の発明の一実施例を示すブロッ
ク図である。図２を参照すると本発明は第１の発明に加
え、前記キーワード辞書部２に存在しない形態素列を構
文構造からキーワードとして推定するキーワード推定規
則格納部９を備えている。このとき文書情報抽出部７は
前記構文解析規則格納部５に格納された構文解析規則と
前記キーワード間関係計算規則格納部６に格納されたキ
ーワード間関係計算規則に加え、前記キーワード推定規
則格納部９に格納されたキーワード推定規則を用いて、
前記形態素解析部４で形態素解析された文を解析し、キ
ーワード間関係を示す意味構造を出力する。FIG. 2 is a block diagram showing an embodiment of the second invention. Referring to FIG. 2, the present invention is provided with a keyword estimation rule storage unit 9 for estimating a morpheme sequence that does not exist in the keyword dictionary unit 2 as a keyword from a syntactic structure, in addition to the first invention. At this time, the document information extraction unit 7 adds the syntax analysis rules stored in the syntax analysis rule storage unit 5 and the inter-keyword relationship calculation rules stored in the inter-keyword relationship calculation rule storage unit 6 to the keyword estimation rule storage unit. Using the keyword estimation rules stored in 9,
The sentence subjected to morpheme analysis by the morpheme analysis unit 4 is analyzed, and a semantic structure indicating a relationship between keywords is output.

【００１０】次に図１、図２を参照して、本発明の実施
例の動作について説明する。Next, the operation of the embodiment of the present invention will be described with reference to FIGS.

【００１１】本発明の一実施例として半導体製造工程で
あるレイヤリングの装置をどこが開発、製造や販売して
いるか、あるいは利用しているかという情報を抽出する
場合を考える。また、例えば図３のような出力形式が与
えられたとする。ここで図３で示された関係の欄には開
発、製造、販売、利用の中から当てはまるもののいくつ
かが入り、企業、装置にはそれぞれ関係欄の関係にある
企業と装置が埋められる。As an embodiment of the present invention, consider a case of extracting information as to where a layering device, which is a semiconductor manufacturing process, is developed, manufactured, sold, or used. Further, assume that an output format as shown in FIG. 3 is given. Here, the relationship column shown in FIG. 3 contains some of the items that apply from development, manufacturing, sales, and use, and companies and devices are filled with companies and devices in the relationship column.

【００１２】例えば、「日本電気がスパッタリング装置
を開発した」という文が文書入力部１から入力されたと
する。この文が形態素解析部４によって語切りされ、そ
の各々の語に形態素辞書部２とキーワード辞書部３の辞
書情報が与えられる。ここで、「日本電気」、「ＣＶＤ
装置」、「開発」がそれぞれ企業名、装置名、関係のキ
ーワードであるとき、構文構造からこれら３つのキーワ
ードに関係があると認定でき、図３の出力を得ることが
できる。For example, assume that the sentence "NEC has developed a sputtering device" is input from the document input unit 1. This sentence is divided into words by the morpheme analysis unit 4, and the dictionary information of the morpheme dictionary unit 2 and the keyword dictionary unit 3 is given to each word. Here, "NEC", "CVD
When “device” and “development” are keywords of a company name, a device name, and a relationship, respectively, it can be recognized from the syntactic structure that these three keywords are related, and the output of FIG. 3 can be obtained.

【００１３】また、キーワードを用いる方法では、例え
ば「日本電気」が企業名キーワードとして登録されてい
ない場合、図２の情報を抽出することができない。これ
に対し、本発明では、キーワード推定規則格納部９に格
納されたキーワード推定規則によって、「日本電気」と
いう文字列は「装置を開発した」という文の主語に立っ
ているため装置の開発者であると推定して、「日本電
気」を企業名として認定することもできる。これため
「日本電気」がキーワードとして登録されていない場合
でも、図３の出力を得ることができる。Further, in the method using the keyword, the information in FIG. 2 cannot be extracted if, for example, "NEC" is not registered as the company name keyword. On the other hand, in the present invention, the character string “NEC” stands for the subject of the sentence “developed the device” according to the keyword estimation rule stored in the keyword estimation rule storage unit 9, and thus the device developer. Therefore, it is possible to approve “NEC” as the company name. Therefore, even if "NEC" is not registered as a keyword, the output of FIG. 3 can be obtained.

【００１４】またキーワードの出現だけでは誤って情報
を抽出してしまう文も、構文構造を利用して抽出すべき
情報が存在しないとして認定することができる。Further, even a sentence in which information is erroneously extracted only by the appearance of a keyword can be recognized as having no information to be extracted by using the syntactic structure.

【００１５】例えば、「日本電気がスパッタリングの材
料を開発した」という文には「日本電気」、「スパッタ
リング」、「開発」という３種のキーワードが含まれて
おり、キーワードの出現だけで判断すると図３の情報を
抽出してしまう。しかし、「スパッタリング」という装
置キーワードは「材料」に係っており、日本電気が開発
したのは材料である。本発明では構文構造から「日本電
気が材料を開発した」ことを検出することができ、図４
に示すように開発したのはレイヤリング装置でないとし
て認定することができる。For example, the sentence "NEC has developed a sputtering material" includes three types of keywords "NEC", "sputtering", and "development", and it is judged only by the appearance of the keywords. The information of FIG. 3 is extracted. However, the device keyword "sputtering" is related to "material", and NEC developed the material. In the present invention, it is possible to detect that "NEC has developed a material" from the syntactic structure.
It can be certified that it was not a layering device that was developed as shown in.

【００１６】さらに本発明はキーワードに含まれる語彙
知識を利用することによって、構文解析時に生じる曖昧
性を減少させることができる。Further, the present invention can reduce the ambiguity generated during parsing by utilizing the lexical knowledge contained in the keyword.

【００１７】例えば、１）「日本電気がスパッタリング装置とＣＶＤ装置を開
発した」２）「日本電気が住友金属工業とＣＶＤ装置を開発し
た」という２つの文は構文構造に曖昧性があることを示して
いる。すなわち、どちらも「ＡがＢとＣを開発した」と
格助詞の並びが同等であるにも関わらず、１）でＢはＣ
と並列構造をなし、２）でＢは随伴格としてＡと共に開
発に係る。これらの曖昧性は文法情報だけでは解消でき
ない。これに対し本発明では「日本電気」、「住友金属
工業」を企業キーワード、「スパッタリング装置」、
「ＣＶＤ装置」を装置キーワードとして認定することが
できる。このため、同種のキーワードから並列構造が成
り立つというキーワード間関係規則を利用することによ
り、図５に示すように、１）では「スパッタリング装
置」と「ＣＶＤ装置」を、２）では「日本電気」と「住
友金属工業」を並列として認定することができる。For example, the two sentences 1) "NEC has developed a sputtering apparatus and a CVD apparatus" and 2) "NEC has developed a CVD apparatus with Sumitomo Metal Industries" indicate that the syntactic structure is ambiguous. Shows. That is, although both have the same case particle alignment as "A developed B and C," in 1), B is C
In 2), B is involved in the development together with A as an adjoint case. These ambiguities cannot be resolved by grammatical information alone. On the other hand, in the present invention, “NEC” and “Sumitomo Metal Industries” are the company keywords, “sputtering equipment”,
“CVD equipment” can be recognized as a device keyword. Therefore, by using the inter-keyword relation rule that parallel structures are formed from the same kind of keywords, as shown in FIG. 5, “sputtering device” and “CVD device” are used in 1) and “NEC” is used in 2). And "Sumitomo Metal Industries" can be certified as parallel.

【００１８】[0018]

【発明の効果】本発明では十分なキーワードを与えるこ
とによって正しい出力を多く抽出することができる。ま
た抽出すべき情報がキーワード間の関係を認定した結果
として生成されているため、誤った抽出結果の出力を減
少することができる。また与えられたキーワードが不十
分な場合でも構文構造からキーワードとなるべき語を認
定することが可能なため、正しい出力をより多く抽出す
ることができる。さらに構文解析を行いつつ、キーワー
ド間の関係からなる意味構造を生成しているため、構文
的な曖昧性を大幅に減少することができる。構文解析で
はキーワードに含まれる語彙知識を利用しているため、
構文規則を制御するため曖昧性を少なくすることができ
る。また、キーワードとして登録されていない語も構文
構造からキーワードとして推定することができ、これら
についても情報を抽出することができる。According to the present invention, many correct outputs can be extracted by giving sufficient keywords. Moreover, since the information to be extracted is generated as a result of recognizing the relationship between the keywords, it is possible to reduce the output of erroneous extraction results. Further, even if the given keywords are insufficient, it is possible to identify the word that should be the keyword from the syntactic structure, so that more correct output can be extracted. Furthermore, since the semantic structure consisting of the relationships between the keywords is generated while performing the syntactic analysis, the syntactic ambiguity can be greatly reduced. Since syntactic analysis uses vocabulary knowledge contained in keywords,
The ambiguity can be reduced by controlling the syntax rules. Also, words that are not registered as keywords can be inferred as keywords from the syntactic structure, and information can be extracted for these as well.

[Brief description of drawings]

【図１】第１の発明の一実施例であるブロック図を説明
する図である。FIG. 1 is a diagram illustrating a block diagram according to an embodiment of the first invention.

【図２】第２の発明の一実施例であるブロック図を説明
する図である。FIG. 2 is a diagram illustrating a block diagram which is an embodiment of the second invention.

【図３】本発明の一実施例の入出力を説明する図であ
る。FIG. 3 is a diagram illustrating input / output according to an embodiment of the present invention.

【図４】本発明の一実施例の入出力を説明する図であ
る。FIG. 4 is a diagram illustrating input / output according to an embodiment of the present invention.

【図５】本発明の一実施例の入出力を説明する図であ
る。FIG. 5 is a diagram illustrating input / output according to an embodiment of the present invention.

[Explanation of symbols]

１文書入力部２形態素辞書部３キーワード辞書部４形態素解析部５構文解析規則格納部６キーワード間関係計算規則格納部７文書情報抽出部８抽出結果出力部９キーワード推定規則格納部 DESCRIPTION OF SYMBOLS 1 document input unit 2 morpheme dictionary unit 3 keyword dictionary unit 4 morpheme analysis unit 5 syntax analysis rule storage unit 6 keyword relation calculation rule storage unit 7 document information extraction unit 8 extraction result output unit 9 keyword estimation rule storage unit

Claims

(57) [Claims]

1. A document input section for receiving a document described in natural language as an input, a morpheme dictionary section in which syntactic information is written for each morpheme, and a keyword relating to a predetermined field of information to be extracted. For each keyword, a keyword dictionary that describes the role that the keyword plays in the format to be finally output, and a sentence that is input from the document input unit is cut into words, and the dictionary of the morpheme dictionary unit and the keyword dictionary unit is cut. A morphological analysis unit that assigns contents to each word, and a syntactic analysis rule storage unit that stores rules for parsing an input sentence by using the syntactic information stored in the morpheme dictionary unit, and a keyword given to a keyword A keyword relation calculation rule storage unit that stores a rule for generating a semantic structure indicating a keyword relation while controlling a parsing rule by information. , Using the syntactic analysis rule stored in the syntactic analysis rule storage unit and the inter-keyword relation calculation rule stored in the inter-keyword relation calculation rule storage unit,
Analyzing the sentence morphologically analyzed by the morpheme analysis unit, and a document information extraction unit that outputs a semantic structure indicating a relationship between keywords, and convert the semantic structure of the entire document output by the document information extraction unit to an output format, An information extraction device comprising an extraction result output unit for outputting.

2. A keyword estimation rule storage unit that estimates a morpheme string that does not exist in the keyword dictionary unit as a keyword from a syntactic structure, a syntactic analysis rule stored in the syntactic analysis rule storage unit, and the inter-keyword relation calculation rule storage. Using the keyword relation calculation rule stored in the section and the keyword estimation rule stored in the keyword estimation rule storage unit, the sentence morphologically analyzed by the morpheme analysis unit is analyzed, and a semantic structure indicating a keyword relation is determined. The information extracting apparatus according to claim 1, further comprising a document information extracting unit for outputting.