JP5317922B2

JP5317922B2 - Information extraction rule creation support system

Info

Publication number: JP5317922B2
Application number: JP2009239416A
Authority: JP
Inventors: 修大島; 智靖岡田; 一彰竹原
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2009-10-16
Filing date: 2009-10-16
Publication date: 2013-10-16
Anticipated expiration: 2029-10-16
Also published as: JP2011086167A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technique that can support so that general users can easily create information extraction rules for use in extracting company information having a semantic structure of company names-company activities-subjects of activities from natural sentences. <P>SOLUTION: A support system 10 includes a dictionary storage part 22; a morphological analysis process part 12; a subject identifying process part 14; a parsing process part 16; an abstraction process part 18 for imparting an abstraction tag to each segment and storing the segments in a material sentence storage part 26; and an information extraction rule creation part 20 which sends a rule editing screen 40 with a plurality of material sentences shown thereon to a client terminal 30 and urges selection of material sentences describing the company activities and the subjects of activities of a particular company, and which, when material sentence selection information is transmitted, takes out a modification structure or the like between segments including the company names, segments including the company activities, and segments including the subjects of activities in the material sentences to create information extraction rules 52 described in accordance with a predetermined format for storage in an information extraction rule storage part 28. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は情報抽出ルール作成支援システムに係り、特に、構文解析技術を用いて自然文から意味的構造を備えた情報要素を抽出する際に用いられる情報抽出ルールを、専門知識を有さない一般ユーザでも容易に作成できるように支援する技術に関する。 The present invention relates to an information extraction rule creation support system, and in particular, an information extraction rule used when extracting an information element having a semantic structure from a natural sentence using a parsing technique, The present invention relates to a technology that supports a user so that it can be easily created.

インターネット上のニュースサイトにおいて公開されているWebページなど、自然言語で記述された構造化されていないテキストデータから意味的構造を備えた情報要素のセットを抽出するための技術として、構文解析技術を用いるものが種々提案されている。 Parsing technology is used to extract a set of information elements with a semantic structure from unstructured text data written in natural language, such as web pages published on news sites on the Internet. Various uses have been proposed.

例えば、特許文献１に記載の情報抽出装置の場合、自然言語で記述された文書中の文字列と所定の文字パターンとを逐次照合し、一致が認められた文字列部分に対し固有名詞の種類を示すタグ情報を付与する文字パターン処理部と、上記タグ情報はそのままに、タグ情報を除く他の文字列部分を逐次単語情報に分割する形態素解析処理部と、形態素解析の結果得られた単語情報を文節単位にまとめ上げ、当該まとめ上げ後の単語情報を、文法上の構文規則と共に、ある種の情報の表現に特徴的に現れる構文パターンを用いて構文解析する構文解析部と、上記構文パターンに基づく解析により得られる係り受け関係及び当該係り受け関係に含まれるタグ情報から特定される情報を、必要な情報として抽出する情報抽出部を備えている。 For example, in the case of the information extraction device described in Patent Document 1, a character string in a document described in a natural language is sequentially compared with a predetermined character pattern, and the type of proper noun is determined for a character string portion in which a match is recognized. A character pattern processing unit that provides tag information indicating morphological analysis, a morpheme analysis processing unit that sequentially divides other character string portions excluding tag information into word information, and a word obtained as a result of morphological analysis A syntactic analysis unit that compiles information into clauses, and parses the word information after the compilation using a grammatical syntax rule and a syntactic pattern that appears characteristically in the expression of certain information, and the above syntax An information extraction unit is provided for extracting, as necessary information, dependency relationships obtained by analysis based on patterns and tag information included in the dependency relationships.

この従来の情報抽出装置を用いることにより、例えば「５日午前零時３５分ごろ、大阪市中央町、消毒業、鈴木勇さん（５０）方から出火、木造平屋建て約１２５平方メートルが全焼した。」という文章から、「＜人名＞鈴木勇さん｜＜地名＞大阪市中央町｜＜業種名＞消毒業」の構造化された情報が抽出可能となる。
特開平１１−２７２６９５号 By using this conventional information extraction device, for example, “At around 5:00 am on the 5th, Chuo-cho, Osaka city, disinfection industry, Mr. Isamu Suzuki (50) broke out, and about 125 square meters of wooden one-storied houses were burnt down. ”Can be extracted from the sentence“ <person name> Isamu Suzuki><placename> Chuo-cho, Osaka city | <industry name> disinfection industry ”.
JP-A-11-272695

しかしながら、このように構文解析処理を前提とした情報抽出方式の場合、必要な情報を正確に抽出するためには、抽出対象となる形態素を指定するための構文パターンを予め多数準備しておく必要があった。 However, in the case of the information extraction method based on the syntax analysis process in this way, in order to accurately extract necessary information, it is necessary to prepare in advance a large number of syntax patterns for designating morphemes to be extracted. was there.

例えば、「東洋自動車子会社の変速機メーカーパイロンは、燃費効率の高い変速機である無段変速機を中国の広州市で生産すると発表した」という文章の場合、複数の企業名（東洋自動車、パイロン）、複数の対象物（変速機、無段変速機）、及び複数の活動内容（生産、発表）が形式上含まれている。このため、この文章から真の活動主体（主語）、活動対象（目的語）、活動内容（述語）を抽出するとなると、図１７に示すように、まず文全体を構文解析して文節間の係り受け構造を明らかにし、各種辞書を参照して各文節に種類を表すタグ（＜企業名＞等）を付与した後、図１８に示すように、抽出すべき形態素の種類及び文中の位置関係を定義した構文パターンＡを用意する必要がある。 For example, in the sentence “Toyo Motor Company's transmission manufacturer Pylon has announced that it will produce a continuously variable transmission, which is a highly fuel efficient transmission, in Guangzhou, China.” ), A plurality of objects (transmission, continuously variable transmission), and a plurality of activities (production, announcement) are included in form. Therefore, when the true activity subject (subject), activity target (object), and activity content (predicate) are extracted from this sentence, as shown in FIG. After clarifying the receiving structure and referring to various dictionaries and adding tags (<company name>, etc.) indicating the type to each phrase, as shown in FIG. 18, the type of morpheme to be extracted and the positional relationship in the sentence It is necessary to prepare the defined syntax pattern A.

この構文パターンＡを図１７の構文解析結果に適用することにより、図１９(a)に示す文節が文中より抽出され、これに必要な整形処理を施すことにより、図１９(b)に示すように、主語、述語、目的語の組合せからなる構造化された企業情報が得られる。 By applying this syntax pattern A to the syntax analysis result of FIG. 17, the clause shown in FIG. 19 (a) is extracted from the sentence, and by applying the necessary shaping processing to this, as shown in FIG. 19 (b). In addition, structured company information consisting of a combination of subject, predicate and object is obtained.

また、文が「東洋自動車子会社の変速機メーカーパイロンは、燃費効率の高い変速機である無段変速機を中国の広州市で生産する」と表現される場合を想定し、図２０に示すように、別の構文パターンＢを用意しておく必要がある。 Assuming that the sentence is expressed as “Toyo Motors subsidiary transmission manufacturer Pylon produces continuously variable transmissions in Guangzhou, China”, as shown in FIG. In addition, it is necessary to prepare another syntax pattern B.

さらに、自然文の場合には修辞上の目的で倒置表現が用いられることがあるが、このような文章からも必要な情報を抽出するためには、倒置表現を前提とした構文パターンを事前に多数用意しておく必要があった。 Furthermore, in the case of natural sentences, inverted expressions may be used for rhetorical purposes, but in order to extract necessary information from such sentences, syntactic patterns based on inverted expressions are preliminarily used. It was necessary to prepare a large number.

このように、自然文は複雑かつ多様な構造を備えているため、例え事前に構文解析が施された文を抽出対象とする場合であっても、そこから特定の意味的構造を備えた情報を的確に抽出するためには、多数の情報抽出ルール（構文パターン）を準備しておくことが不可欠であり、そのためには言語解析に通じた専門家を多数動員する必要があった。 In this way, natural sentences have a complex and diverse structure, so even if a sentence that has been parsed in advance is extracted, information with a specific semantic structure can be extracted from it. It is essential to prepare a large number of information extraction rules (syntax patterns) in order to accurately extract the information. To that end, it was necessary to mobilize a large number of specialists who have knowledge of language analysis.

この発明は、従来のこのような問題を解決するために案出されたものであり、構文解析がなされたテキストから特定の意味的構造を備えた情報を抽出する際に用いる情報抽出ルールを、専門知識を有さない一般ユーザでも容易に作成できるように支援可能な技術の提供を目的としている。 The present invention has been devised in order to solve such a conventional problem, and an information extraction rule used when extracting information having a specific semantic structure from a parsed text, The purpose is to provide a technology that can be easily supported by general users who do not have specialized knowledge.

上記の目的を達成するため、請求項１に記載した情報抽出ルールの作成支援システムは、具体的な表現文字列と、その種類を示す抽象化文字列との対応関係を登録した辞書と、テキストデータ中の各文を形態素単位に分解する手段と、上記の辞書を参照し、各形態素の中で企業名、企業活動、活動対象に該当するものに対して対応の抽象化タグを関連付ける手段と、各文中の企業名主語を探索する手段と、企業名主語を備えた文を、素材文として素材文記憶手段に格納する手段と、この素材文記憶手段から素材文を取り出し、これらを所定のテンプレートに充填してルール編集画面を生成する手段と、このルール編集画面をクライアント端末に送信し、特定企業の企業活動及び活動対象が記述されている素材文の選択を促す手段と、クライアント端末から特定の素材文の選択情報が送信された場合に、当該素材文における企業名を含む文節、企業活動を含む文節及び活動対象を含む文節相互間の係り受け構造、またはこれら企業名を含む文節、企業活動を含む文節及び活動対象物を含む文節と企業名、企業活動、活動対象物以外の文字列を含む文節との間の係り受け構造を抽出条件として取り出し、これを所定の書式に従って記述した情報抽出ルールを生成する情報抽出ルール生成手段と、この情報抽出ルールを情報抽出ルール記憶手段に格納する手段とを備えたことを特徴としている。 In order to achieve the above object, an information extraction rule creation support system according to claim 1 includes a dictionary in which a correspondence between a specific expression character string and an abstract character string indicating the type is registered, a text Means for decomposing each sentence in the data into morpheme units and means for referring to the above dictionary and associating corresponding abstract tags with those corresponding to the company name, company activity, and activity target in each morpheme; , Means for searching for the company name subject in each sentence, means for storing a sentence with the company name subject in the material sentence storage means as material sentences, and taking out the material sentences from the material sentence storage means, Means for generating a rule editing screen by filling a template, means for transmitting the rule editing screen to a client terminal, prompting selection of a material sentence in which a company activity and an activity target of a specific company are described, and a client When the selection information of a specific material sentence is transmitted from the mobile terminal, the clause including the company name in the material sentence, the clause including the corporate activity and the dependency structure between the clauses including the activity target, or these company names Dependent structure between clauses containing clauses, clauses containing business activities and clauses containing activity objects, and clauses containing character strings other than company names, business activities, and activity targets are extracted as extraction conditions, and this is formatted in a predetermined format. The information extraction rule generating means for generating the information extraction rule described according to the above and the means for storing the information extraction rule in the information extraction rule storage means are provided.

これらの情報抽出ルールは、自然文から構造化された企業情報を生成する際に、情報抽出の条件として適用される。すなわち、対象となる自然文に対して形態素解析及び構文解析を施し、辞書を参照して対応の文節に企業名、企業活動、活動対象の抽象化タグを付与した上で、各情報抽出ルールを当てはめていき、マッチする場合には当該情報抽出ルールに規定された位置にある文字列が企業名、企業活動、活動対象として取り出される。 These information extraction rules are applied as information extraction conditions when generating structured company information from natural sentences. In other words, morphological analysis and syntactic analysis are performed on the target natural sentence, the company name, the company activity, and the abstract tag of the activity target are added to the corresponding phrase by referring to the dictionary, and each information extraction rule is set. When matching is performed, the character string at the position defined in the information extraction rule is extracted as the company name, the company activity, and the activity target.

請求項２に記載した情報抽出ルールの作成支援システムは、請求項１のシステムであって、さらに上記ルール編集画面中の各素材文には、企業名を示す文字列部分、企業活動を示す文字列部分、活動対象を示す文字列部分に対して、それぞれの種類を識別するためのマーキング（網掛けやカラーリング等の文字装飾）が施されていることを特徴としている。 The information extraction rule creation support system according to claim 2 is the system according to claim 1, and each material sentence in the rule editing screen includes a character string portion indicating a company name and characters indicating a company activity. It is characterized in that markings (character decorations such as shading and coloring) for identifying the respective types are applied to the line part and the character string part indicating the activity target.

請求項３に記載した情報抽出ルールの作成支援システムは、請求項１または２のシステムであって、さらに、上記の辞書を参照し、各形態素の中で場所に該当するものに対して対応の抽象化タグを関連付ける手段を備え、上記情報抽出ルール生成手段は、クライアント端末から特定の素材文の選択情報が送信された場合に、当該素材文中に場所の抽象化タグが関連付けられた形態素が含まれているときには、当該場所を示す形態素が含まれた文節と、他の文節との係り受け構造を任意的な抽出条件として取り出し、これを含めた情報抽出ルールを生成することを特徴としている。 The information extraction rule creation support system according to claim 3 is the system according to claim 1 or 2, and further refers to the dictionary and corresponds to a morpheme corresponding to a place. Means for associating an abstract tag, and the information extraction rule generating means includes a morpheme associated with a location abstract tag in the material sentence when selection information of a specific material sentence is transmitted from the client terminal. In this case, a dependency structure between a clause including a morpheme indicating the place and another clause is taken out as an arbitrary extraction condition, and an information extraction rule including this is generated.

請求項４に記載した情報抽出ルールの作成支援システムは、請求項１〜３のシステムであって、さらに、文中に主語が存在しない場合に、当該文の前に位置する文において企業名主語が存在しており、かつ、当該文と上記企業名主語が存在する文との間に他の企業名主語及び企業名以外の形態素を主語とする文が介在していないときには、上記企業名主語を当該主語の存在しない文の企業名主語として関連付ける手段を備えたことを特徴としている。 The information extraction rule creation support system according to claim 4 is the system according to claims 1 to 3, and further, when there is no subject in the sentence, the company name subject in the sentence located before the sentence is If there is a sentence that has a subject other than the company name subject and a morpheme other than the company name between the sentence and the sentence in which the company name subject exists, the company name subject It is characterized in that it has means for associating as a company name subject of a sentence in which the subject does not exist.

請求項５に記載した情報抽出ルールの作成支援システムは、請求項１〜４のシステムであって、さらに、文中に企業代名詞が主語として含まれていた場合に、当該文の前に位置する文において企業名主語が存在しており、かつ、当該文と上記企業名主語が存在する文との間に他の企業名主語及び企業名以外の形態素を主語とする文が介在していないときには、当該企業代名詞を上記企業名主語の企業名に置き換える手段を備えたことを特徴としている。 The information extraction rule creation support system according to claim 5 is the system according to claims 1 to 4, and further, when a corporate pronoun is included as a subject in the sentence, the sentence positioned before the sentence If there is a company name subject and there is no sentence with a subject other than the company name subject and a morpheme other than the company name between the sentence and the sentence in which the company name subject exists, It has a feature that replaces the company pronoun with the company name subject to the company name.

請求項６に記載した情報抽出ルールの作成支援システムは、請求項１〜５のシステムであって、さらに、予め設定された正規表現ルールを各文に対して適用し、当該正規表現ルールにマッチする形態素を企業名、企業活動、活動対象物の何れかであると認定すると共に、当該形態素に対して対応の抽象化タグを関連付ける手段を備えたことを特徴としている。 The information extraction rule creation support system according to claim 6 is the system according to claims 1 to 5, and further applies a regular expression rule set in advance to each sentence and matches the regular expression rule. The morpheme is identified as any one of a company name, a company activity, and an activity object, and a means for associating a corresponding abstract tag with the morpheme is provided.

請求項７に記載した情報抽出ルールの作成支援システムは、請求項１〜６のシステムであって、さらに、予め設定された係り受けルールを各文に対して適用し、当該係り受けルールにマッチする文節の形態素を企業名、企業活動、活動対象物の何れかであると認定すると共に、当該形態素に対して対応の抽象化タグを関連付けることを特徴としている。 The information extraction rule creation support system according to claim 7 is the system according to claims 1 to 6, further applying a predetermined dependency rule to each sentence and matching the dependency rule. The morpheme of the clause to be identified is any one of a company name, a company activity, and an activity object, and a corresponding abstract tag is associated with the morpheme.

自然文は複雑かつ曖昧な構造を備えているため、情報抽出ルール生成の基礎となるべき素材文の選定に関しては、人間の感覚に頼らざるを得ないが、請求項１に記載した情報抽出ルールの作成支援システムによれば、ユーザはクライアント端末に表示されたルール編集画面上において、各素材文の意味内容を判断し、特定企業の企業活動及び活動対象が実質的に記述されている素材文を選択するだけで済み、素材文の生成及び情報抽出ルールの生成はシステム側で自動処理されるため、言語解析についての専門知識を有さないユーザであっても、情報抽出ルールを容易に作成することが可能となる。 Since the natural sentence has a complicated and ambiguous structure, the selection of the material sentence that should be the basis of the information extraction rule generation must be relied on the human sense, but the information extraction rule according to claim 1 According to the creation support system, the user determines the semantic content of each material sentence on the rule editing screen displayed on the client terminal, and the material sentence that substantially describes the business activity and activity target of the specific company. Since the system automatically processes the generation of material sentences and information extraction rules, information extraction rules can be easily created even for users who do not have expertise in language analysis. It becomes possible to do.

請求項２に記載の情報抽出ルール作成支援システムの場合、企業名を示す文字列部分、企業活動を示す文字列部分、活動対象を示す文字列部分に対してマーキングが施されているため、ユーザは素材文中に含まれるチェックポイントを一目で認識可能となり、選択作業の効率化が実現できる。 In the case of the information extraction rule creation support system according to claim 2, since the character string portion indicating the company name, the character string portion indicating the company activity, and the character string portion indicating the activity target are marked, Makes it possible to recognize the checkpoints included in the material sentence at a glance, and to achieve efficient selection work.

上記のように、企業名、企業活動、活動対象物の組合せからなる企業情報は、主語（企業名）→述語（企業活動）→目的語（活動対象物）の形式を備え、それ自体で完結した一連の意味的構造を表現しているといえる。ただし、これに場所的な要素が組み込まれることにより、さらに企業情報の価値や有効性を高めることが可能となる。
請求項３に記載の情報抽出ルール作成支援システムによれば、自然文から企業活動に係る場所的要素を抽出可能な情報抽出ルールを作成することが可能となる。 As described above, company information consisting of a combination of company name, company activity, and activity object has the following format: subject (company name) → predicate (company activity) → object (activity object). It can be said that it expresses a series of semantic structures. However, it is possible to further enhance the value and effectiveness of corporate information by incorporating a location element into this.
According to the information extraction rule creation support system according to the third aspect, it is possible to create an information extraction rule that can extract a place element related to a corporate activity from a natural sentence.

請求項４に記載した情報抽出ルール作成支援システムによれば、ある文に主語が含まれていない場合であっても、それよりも前に位置する文の中に企業名主語が含まれている場合には、当該企業名の記述が省略されたものと推定し、主語のない文の企業名主語として認定される。この結果、主語が含まれていない文をも素材文として有効に活用することが可能となる。 According to the information extraction rule making support system described in claim 4, even if a subject does not contain a subject, the subject name subject is contained in a sentence positioned before that. In this case, it is presumed that the description of the company name is omitted, and it is recognized as a company name subject of a sentence without a subject. As a result, it is possible to effectively use a sentence that does not contain a subject as a material sentence.

請求項５に記載した情報抽出ルール作成支援システムによれば、企業を表す「同社」等の代名詞が主語として含まれている文については、それよりも前に位置する文の企業名主語を指すものと推定し、この企業名主語の企業名によって当該企業代名詞が置き換えられる。この結果、企業代名詞を主語とする文をも素材文として有効に活用することが可能となる。 According to the information extraction rule making support system described in claim 5, for a sentence including a pronoun such as “Company” representing a company as a subject, it indicates a company name subject of a sentence positioned before that. The company pronoun is replaced by the company name subject to this company name. As a result, it is possible to effectively use a sentence whose subject is a corporate pronoun as a material sentence.

請求項６及び７に記載した情報抽出ルール作成支援システムによれば、各形態素に対しルールベースで必要な企業名、企業活動、活動対象物の抽象化タグを関連付けることが可能となり、辞書ベースでの抽象化を補完することが可能となる。 According to the information extraction rule creation support system described in claims 6 and 7, it is possible to associate each morpheme with a business name, business activity, and an abstract tag of an activity object required on a rule basis, and on a dictionary basis. It is possible to complement the abstraction.

図１は、この発明に係る情報抽出ルール作成支援システム10の全体構成を示すブロック図であり、形態素解析処理部12と、主語特定処理部14と、構文解析処理部16と、抽象化処理部18と、情報抽出ルール生成部20と、辞書記憶部22と、抽象化ルール記憶部24と、素材文記憶部26と、情報抽出ルール記憶部28とを備えている。
情報抽出ルール生成部20には、ユーザが操作するクライアント端末30が通信ネットワークを介して接続されている。 FIG. 1 is a block diagram showing the overall configuration of an information extraction rule creation support system 10 according to the present invention. The morpheme analysis processing unit 12, the subject identification processing unit 14, the syntax analysis processing unit 16, and the abstraction processing unit 18, an information extraction rule generation unit 20, a dictionary storage unit 22, an abstraction rule storage unit 24, a material sentence storage unit 26, and an information extraction rule storage unit 28.
A client terminal 30 operated by a user is connected to the information extraction rule generation unit 20 via a communication network.

上記の形態素解析処理部12、主語特定処理部14、構文解析処理部16、抽象化処理部18及び情報抽出ルール生成部20は、コンピュータのCPUが、OS及びアプリケーションプログラムに従って必要な処理を実行することによって実現される。
また、上記の辞書記憶部22、抽象化ルール記憶部24、素材文記憶部26及び情報抽出ルール記憶部28は、同コンピュータのハードディスク内に設けられている。 The morpheme analysis processing unit 12, the subject identification processing unit 14, the syntax analysis processing unit 16, the abstraction processing unit 18 and the information extraction rule generation unit 20 execute necessary processing according to the OS and application program by the computer CPU. Is realized.
The dictionary storage unit 22, the abstraction rule storage unit 24, the material sentence storage unit 26, and the information extraction rule storage unit 28 are provided in the hard disk of the computer.

辞書記憶部22内には、企業名辞書、企業活動辞書、活動対象辞書、人物名辞書、国名辞書、地域名辞書、都道府県名辞書、市町村名辞書、動植物名辞書、同義語辞書等が格納されている。 The dictionary storage unit 22 stores a company name dictionary, a company activity dictionary, an activity target dictionary, a person name dictionary, a country name dictionary, a region name dictionary, a prefecture name dictionary, a municipality name dictionary, an animal and plant name dictionary, a synonym dictionary, and the like. Has been.

図２は、企業活動辞書の登録内容を例示するものであり、企業活動の一種である上位概念的な「生産活動」の抽象化文字列に対して、「生産」、「製造」、「加工」、「組立」等の述語となるべき具体的な表現文字列が予め対応付けられている。同じく、企業活動の一種である上位概念的な「販売活動」の抽象化文字列に対しては、「販売」、「発売」、「売り出す」等の述語となるべき具体的な表現文字列が予め対応付けられている。なお、「生産活動」や「販売活動」の代わりに、より上位概念的な「企業活動」の抽象化文字列を用いて一まとめにしてもよい。 FIG. 2 exemplifies the registered contents of the corporate activity dictionary, and “production”, “manufacturing”, “processing” are performed on an abstract character string of a high-level conceptual “production activity” that is a type of corporate activity. ], Specific expression character strings to be predicates such as “Assembly” are associated in advance. Similarly, for the abstract string of high-level conceptual “sales activities” that is a type of corporate activity, there are specific expression strings that should be predicates such as “sales”, “release”, “sell”, etc. Corresponding in advance. Instead of “production activities” and “sales activities”, a higher-level conceptual “business activity” abstract character string may be used as a group.

図３は、活動対象辞書の登録内容を例示するものであり、上位概念的な「生産対象」の抽象化文字列に対して、「液晶」、「液晶テレビ」、「液晶パネル」、「液晶モニター」、「携帯音楽プレーヤー」、「ギガバイト」等の、目的語となるべき具体的な表現文字列が予め対応付けられている。同じく、上位概念的な「販売対象」の抽象化文字列に対して、「液晶」、「液晶テレビ」、「液晶パネル」、「液晶モニター」、「携帯音楽プレーヤー」、「ギガバイト」等の、目的語となるべき具体的な表現文字列が予め対応付けられている。なお、括弧内の「一般」は一般名詞であることを、また「固有」は固有名詞であることを示している。
「生産対象」や「販売対象」の代わりに、より上位概念的な「活動対象」の抽象化文字列を用いてもよい。 FIG. 3 exemplifies the registered contents of the activity target dictionary, and “liquid crystal”, “liquid crystal television”, “liquid crystal panel”, “liquid crystal” are used for the abstract conceptual character string “production target”. Specific expression character strings to be objects such as “monitor”, “portable music player”, and “gigabyte” are associated in advance. In the same way, for the abstract string of “sale target”, which is a high-level concept, “liquid crystal”, “liquid crystal television”, “liquid crystal panel”, “liquid crystal monitor”, “portable music player”, “gigabyte”, etc. A specific expression character string to be an object is associated in advance. “General” in parentheses indicates a general noun, and “proprietary” indicates a proper noun.
Instead of “production object” and “sales object”, an abstract character string of a more conceptual “activity object” may be used.

図示は省略したが、企業名辞書には、主語となるべき具体的な企業名（正式名称及び略称）が、「企業名」の抽象化文字列に関連付けられて多数登録されている。 Although illustration is omitted, in the company name dictionary, a number of specific company names (formal names and abbreviations) to be the subject are registered in association with the abstract character string “company name”.

つぎに、このシステム10による処理内容を説明する。
まず、形態素解析処理部12により、外部から入力されたWebファイル等のテキストデータ32に対する形態素解析が実行される。ここで「形態素解析」とは、自然言語で記述された文を、意味を有する最小の言語単位である形態素に分解し、それぞれの品詞を同定する処理をいう。 Next, the processing contents by the system 10 will be described.
First, the morpheme analysis processing unit 12 performs morpheme analysis on text data 32 such as a web file input from the outside. Here, “morpheme analysis” refers to a process of decomposing a sentence described in a natural language into morphemes, which are the smallest meaningful language units, and identifying each part of speech.

例えば、「初芝は、携帯音楽プレーヤー『ギガバイト』の新製品を１５日、関西で発売する。」という文が与えられた場合、図４に示すように、形態素解析処理部12はこれを「初芝／名詞，固有名詞」、「は／助詞，係助詞」、「携帯音楽プレーヤー／名詞，一般」のように分解し、それぞれの品詞を特定する。
この形態素解析自体は公知技術であり、例えば以下のようなフリーソフトを形態素解析エンジンとして用いることができる。
(1) MeCab（http://mecab.sourceforge.net/）
(2) ChaSen（http://chasen.naist.jp/hiki/ChaSen/） For example, when a sentence “Hatsushiba will release a new product of portable music player“ Gigabyte ”in Kansai on the 15th” is given, as shown in FIG. / Noun, proper noun "," ha / participant, collaborative particle "," portable music player / noun, general ", and specify each part of speech.
This morpheme analysis itself is a known technique. For example, the following free software can be used as a morpheme analysis engine.
(1) MeCab (http://mecab.sourceforge.net/)
(2) ChaSen (http://chasen.naist.jp/hiki/ChaSen/)

形態素解析処理部12は、辞書記憶部22に格納された活動対象辞書を参照し、その登録内容を形態素解析の結果に反映させることができる。
例えば、「携帯音楽プレーヤー」という語は、一般的な形態素解析エンジンに投入すると「携帯／名詞，サ変接続」、「音楽／名詞，一般名詞」、「プレーヤー／名詞，一般名詞」のように細分化されてしまうことになるが、活動対象辞書に「携帯音楽プレーヤー（一般）」が生産対象及び販売対象として登録されているため、形態素解析処理部12は「携帯音楽プレーヤー」という結合語を一形態素として認定している。
また、「ギガバイト」という語についても、一般的な形態素解析エンジンによれば「ギガ／名詞，一般名詞」、「バイト／名詞，一般名詞」のように細分化されてしまうが、活動対象辞書に「ギガバイト（固有）」が生産対象及び販売対象として登録されているため、形態素解析処理部12は「ギガバイト」という結合語を一形態素として認識できる。 The morpheme analysis processing unit 12 can refer to the activity target dictionary stored in the dictionary storage unit 22 and reflect the registered contents in the result of the morpheme analysis.
For example, the word “portable music player” is subdivided into “mobile / noun, common connection”, “music / noun, general noun”, “player / noun, general noun” when put into a general morphological analysis engine. However, since “portable music player (general)” is registered as a production target and a sales target in the activity target dictionary, the morphological analysis processing unit 12 uses the combined word “portable music player” as one. Certified as a morpheme.
The word “gigabyte” is also subdivided into “giga / noun, general noun” and “byte / noun, general noun” according to a general morphological analysis engine. Since “Gigabyte (unique)” is registered as a production target and a sales target, the morpheme analysis processing unit 12 can recognize the combined word “Gigabyte” as one morpheme.

つぎに形態素解析処理部12は、辞書記憶部22内に格納された企業名辞書、企業活動辞書、活動対象辞書を参照し、特定形態素の品詞に対応の抽象化タグを補充する。
例えば、「初芝」に関しては、企業名辞書に登録例が存在していたため、「企業名」という抽象化タグが品詞項目に追記される。また、「携帯音楽プレーヤー」及び「ギガバイト」に関しては、活動対象辞書に生産対象及び販売対象として登録されていたため、「生産対象」及び「販売対象」の抽象化タグが品詞項目に追記される。また、「関西」に関しては、地域名辞書に場所として登録されていたため、「場所」の抽象化タグが品詞項目に追記される。さらに、「発売」に関しては、企業活動辞書中に販売活動の一類型として登録例が存在していたため、「販売活動」という抽象化タグが品詞項目に追記される。 Next, the morpheme analysis processing unit 12 refers to the company name dictionary, the company activity dictionary, and the activity target dictionary stored in the dictionary storage unit 22, and supplements the abstract tag corresponding to the part of speech of the specific morpheme.
For example, for “Hatsushiba”, since there is a registered example in the company name dictionary, an abstract tag “company name” is added to the part of speech item. Further, since “portable music player” and “gigabyte” are registered as production targets and sales targets in the activity target dictionary, abstract tags of “production target” and “sale target” are added to the part of speech item. Since “Kansai” is registered as a place in the regional name dictionary, an abstract tag of “place” is added to the part of speech item. Furthermore, regarding “release”, since a registered example exists as a type of sales activity in the corporate activity dictionary, an abstract tag “sales activity” is added to the part of speech item.

つぎに、主語特定処理部14が起動し、形態素解析処理部12によって形態素単位に分解されたデータに基づき、主語となる企業名を文単位で特定する。
具体的には、図５に示すように、企業名の直後に「が〜」（助詞-格助詞）や「は〜」（助詞-係助詞）、あるいは「、〜」（記号-読点）が続く文字列が文中に存在していることを検知した場合、主語特定処理部14は当該企業名を主語と認定する。 Next, the subject identification processing unit 14 is activated, and the company name as the subject is identified in sentence units based on the data decomposed into morpheme units by the morpheme analysis processing unit 12.
Specifically, as shown in FIG. 5, immediately after the company name, "ga ~" (particle-case particle), "ha ~" (particle-indicative particle), or ", ~" (symbol-reading mark). When it is detected that the subsequent character string is present in the sentence, the subject identification processing unit 14 recognizes the company name as the subject.

図６の(a)及び(b)の文では、「初芝、」及び「初芝は」が存在しているため、企業名「初芝」が企業名主語としてそれぞれ認定される。
また、主語特定処理部14は、企業名主語を認定した文の後に主語の存在しない文が続く場合、当該文に対しても先に認定した企業名主語を継承させる。
例えば、図６の(c)の文中からは主語に相当する文字列が検出されないが、この場合には前の文に登場した企業名主語が省略されているものと推定し、「初芝」が当該文の企業名主語として関連付けられる。 In the sentences of (a) and (b) in FIG. 6, “Hatsushiba” and “Hatsushiba wa” exist, and therefore the company name “Hatsushiba” is recognized as the subject name of the company.
In addition, when a sentence that does not have a subject follows the sentence that identifies the company name subject, the subject identification processing unit 14 inherits the company name subject that has been identified earlier.
For example, the character string corresponding to the subject is not detected in the sentence of (c) of FIG. 6, but in this case, it is presumed that the company name subject appearing in the previous sentence is omitted, and “Hatsushiba” is Associated as the subject name of the company in the sentence.

さらに、主語特定処理部14は、企業を表す代名詞である「同社」等が主語となっている文については、前の文の企業名主語で置き換える処理（照応処理）を実行する。
例えば、図６の(d)の「同社の新製品は、ソフコムが提供する公衆無線LANサービス…」という文が存在した場合、(d)'に示すように、企業代名詞の「同社」がその直前の文に関連付けられた企業名主語「初芝」に置き換えられる。 Further, the subject identification processing unit 14 executes a process (anaphora process) for replacing a sentence whose main subject is “company”, which is a pronoun representing a company, with the subject name subject of the previous sentence.
For example, if there is a sentence “(New company's new product is a public wireless LAN service provided by Sofcom…)” in FIG. 6 (d), the company pronoun “Company” is Replaced by the company name “Hatsushiba” associated with the previous sentence.

企業名主語の継承処理及び照応処理は、他の企業名主語の存在が認められる文が登場するまで続けられる。例えば、図６の(x)の文には「サリーは、」の文字列が存在し、「サリー」は企業名に相当するため、主語特定処理部14は(x)の文の企業名主語として「サリー」を認定する。
また、これ以降に主語に相当する文字列が存在しない文が続いた場合、主語特定処理部14は「サリー」を企業名主語としてこれらの文に関連付ける。
さらに、これ以降に企業代名詞を主語とする文が登場した場合、主語特定処理部14は「サリー」によって企業代名詞を置き換える。 The process of inheriting and responding to the company name subject continues until a sentence in which the existence of another company name subject is recognized appears. For example, in the sentence (x) in FIG. 6, the character string “Sally is” exists, and “Sally” corresponds to the company name. Therefore, the subject identification processing unit 14 uses the company name subject of the sentence (x). As “Sally”.
In addition, when a sentence that does not have a character string corresponding to the subject continues thereafter, the subject identification processing unit 14 associates “Sally” with these sentences as the subject name of the company.
Further, if a sentence having the subject as a business pronoun appears thereafter, the subject specifying processing unit 14 replaces the business pronoun with “Sally”.

企業名主語の承継は、企業名以外の主語（例えば人物名、国名、都道府県名、動物名、植物名等）を備えた文が登場した際にも、停止される。この前提として、形態素解析処理に際し、各形態素には人物名、国名、都道府県名の識別タグが品詞項目に追記されている。
ただし、このシステム10は企業に係る情報を抽出するためのルール作成を目的としているため、企業名以外の形態素が主語となっている文については、処理の対象外として排除される。 Succession of a company name subject is also stopped when a sentence with a subject other than the company name (for example, a person name, country name, prefecture name, animal name, plant name, etc.) appears. As a premise, in the morpheme analysis process, identification tags of person name, country name, and prefecture name are added to each morpheme item.
However, since this system 10 is intended to create a rule for extracting information related to a company, a sentence whose subject is a morpheme other than the company name is excluded from processing.

つぎに、構文解析処理部16によって、各文の係り受け構造が分析される。
図７は、その分析結果を示すテーブルであり、文節毎に係り先文節IDが格納されている。また、図８は、この係り受け構造をイメージ化したものである。
この構文解析自体は公知技術であり、例えば以下のようなフリーソフトを構文解析エンジンとして用いることができる。
(1) KNP（http://nlp.kuee.kyoto-u.ac.jp/nl-resource/knp.html）
(2) CaboCha（http://chasen.org/~taku/software/cabocha/） Next, the dependency structure of each sentence is analyzed by the syntax analysis processing unit 16.
FIG. 7 is a table showing the analysis result, and stores the related phrase ID for each phrase. FIG. 8 is an image of this dependency structure.
This syntax analysis itself is a known technique. For example, the following free software can be used as the syntax analysis engine.
(1) KNP (http://nlp.kuee.kyoto-u.ac.jp/nl-resource/knp.html)
(2) CaboCha (http://chasen.org/~taku/software/cabocha/)

つぎに、抽象化処理部18により、各文節単位で企業名、企業活動（生産活動、販売活動）、活動対象（生産対象、販売対象等）、場所に該当する文字列（形態素）を含んでいるか否かが判定され、含んでいる場合には、図９(b)に示すように、対応する抽象化タグが当該文節に関連付けられる。 Next, the abstraction processing unit 18 includes the character string (morpheme) corresponding to the company name, business activity (production activity, sales activity), activity target (production target, sales target, etc.) and location in each phrase unit. In the case where it is included, the corresponding abstract tag is associated with the clause as shown in FIG. 9B.

この抽象化に際しては、先に形態素解析処理部12が辞書ＤＢ22を参照し、形態素単位で企業名や販売活動、販売対象等の抽象化タグの付与を完了しているため、まず第１にこの形態素単位で関連付けられた抽象化タグが、当該形態素を含む文節にもそのまま引き継がれることとなる。
具体的には、図９(a)に示すように、形態素ID、形態素、品詞、文節IDを備えたテーブルがメモリ上に生成され、形態素に付加された抽象化タグが対応の文節に転記される。 At the time of this abstraction, the morpheme analysis processing unit 12 refers to the dictionary DB 22 in advance, and since the assignment of abstract tags such as company names, sales activities, and sales targets has been completed in units of morphemes, first of all, The abstract tag associated with the morpheme unit is inherited as it is to the clause including the morpheme.
Specifically, as shown in FIG. 9 (a), a table with morpheme ID, morpheme, part of speech, and clause ID is generated on the memory, and the abstract tag added to the morpheme is transferred to the corresponding clause. The

ただし、辞書の収録語数には自ずと限界があり、辞書ベースでの抽象化処理だけでは漏れが生じる可能性があるため、抽象化処理部18は正規表現ルールによる抽象化処理と、係り受けルールによる抽象化処理を実行し、辞書に収録されていない企業名や活動対象について、対応の抽象化タグを関連付ける機能を備えている。 However, the number of words recorded in the dictionary is naturally limited, and there is a possibility that leakage will occur only with the dictionary-based abstraction processing. Therefore, the abstraction processing unit 18 uses the regular expression rule and the dependency rule. It has a function of executing abstraction processing and associating a corresponding abstract tag with a company name or activity target not recorded in the dictionary.

［正規表現ルールによる抽象化処理］
これは、例えば、「新製品であるABCを〜」という表現が文中に存在した場合、「ABC」の部分を「生産対象物」と認定し、「ABCを」の文節に「生産対象物」の抽象化タグを割り当てることを意味する。あるいは、「小売り大手の米AAAマートは、〜」という表現が文中に存在した場合に、「AAAマート」の部分を「企業名」と認定し、「AAAマートは」の文節に「企業名」の抽象化タグを割り当てることが該当する。
このため、抽象化ルール記憶部24には、予め多数の抽象化ルールが格納されている。 [Abstract processing by regular expression rule]
For example, if the expression "ABC is a new product" is present in the sentence, the part "ABC" is recognized as "Production object", and the phrase "Production object" appears in the phrase "ABC". Means to assign an abstract tag. Or, if the phrase “America Mart, a major retailer,” is present in the sentence, the part “AAA Mart” is recognized as “Company Name” and the phrase “Company Name” appears in the “AAA Mart” clause. This corresponds to assigning the abstract tag.
For this reason, the abstraction rule storage unit 24 stores a large number of abstraction rules in advance.

図１０の(a)は抽象化ルールの一例を示すものであり、「<company_size>の<country>(<feature:名詞>+)」は、「company_size（企業規模を表す文字列）」＋「の」＋「country（国を表す文字列）」の直後に続く名詞を企業名と認定することが定義されている。また、「company_size」のエイリアス表現（別名）として、「首位、大手、中堅」が定義されており、「company_size」のエイリアス表現として、「米、英、欧州」が定義されている。 (A) of FIG. 10 shows an example of an abstraction rule. “<Company_size> <country> (<feature: noun> +)” is “company_size (a character string representing a company size)” + “ It is defined that the noun immediately following "no" + "country (character string representing country)" is recognized as a company name. Further, “first, major, middle-ranking” is defined as an alias expression (alias) of “company_size”, and “US, UK, Europe” is defined as an alias expression of “company_size”.

ここに、図１０(b)に示すように、「小売大手の米AAAマートは、人員削減計画を発表した。」という文が与えられた場合、抽象化処理部18はこれを図１０(c)に示すように名詞単位のOR表現に置き換え、ルールにマッチする「小売り大手の米AAAマート」を抽出した後、正規表現の「後方参照」を用いて「AAAマート」を取り出し、企業名と認定する。 Here, as shown in FIG. 10 (b), if the sentence “America Mart, a major retailer has announced a personnel reduction plan,” is given, the abstraction processing unit 18 converts this to FIG. 10 (c). ) And replace it with the noun unit OR expression, extract the `` retail major US AAA mart '' that matches the rule, then use the regular expression `` backward reference '' to extract `` AAA mart '' Authorize.

［係り受けルールによる抽象化処理］
(1) 企業名の抽象化
これは、企業活動を含む文節に係る文節で、助詞が「が」または「は」となる場合には、当該助詞の直前の形態素を企業名と認定するルールをいう。
例えば、図１１の例で説明すると、「米オラテルは」という文節は、販売活動を意味する「出荷する」の文節に係っており、「は」の助詞を備えているため、その直前の形態素「オラテル」が企業名と認定される。
(2) 活動対象の抽象化
これは、企業活動を含む文節に係る文節で、助詞が「を」となる場合には、当該助詞の直前の形態素を活動対象物と認定するルールをいう。
図１１の例でいえば、「『Core o7』を」という文節は、販売活動を意味する「出荷する」の文節に係っており、「を」の助詞を備えているため、その直前の形態素「Core o7」が販売対象と認定される。 [Abstract processing by dependency rules]
(1) Abstraction of company name This is a clause related to a clause that includes corporate activities.If the particle is `` ga '' or `` ha '', the rule that certifies the morpheme immediately before the particle as the company name. Say.
For example, in the example of FIG. 11, the phrase “US oratel is” is related to the phrase “shipping”, which means sales activity, and has the particle “ha”. The morpheme “Olatel” is recognized as the company name.
(2) Abstraction of Activity Targets This is a rule that recognizes a morpheme immediately before a particle as an activity target when the particle is a phrase related to a clause that includes corporate activities.
In the example of FIG. 11, the phrase “Core o7” is related to the phrase “shipping”, which means sales activities, and has the particle “ The morpheme “Core o7” is certified for sale.

抽象化処理が完了した文は、抽象化処理部18によって素材文記憶部26に格納される。この素材文記憶部26に格納された各文は、情報抽出ルール生成のための素材として利用される。 The sentence for which the abstraction processing has been completed is stored in the material sentence storage unit 26 by the abstraction processing unit 18. Each sentence stored in the material sentence storage unit 26 is used as a material for generating an information extraction rule.

すなわち、ユーザの操作するクライアント端末30から情報抽出ルール作成のリクエストが送信されると、情報抽出ルール生成部20は素材文記憶部26から一定のルール（例えば新着順）に従って素材文を抽出し、これらを埋め込んだルール編集画面を生成した後、クライアント端末30に送信する。 That is, when a request for creating an information extraction rule is transmitted from the client terminal 30 operated by the user, the information extraction rule generation unit 20 extracts a material sentence from the material sentence storage unit 26 according to a certain rule (for example, new arrival order), After the rule editing screen in which these are embedded is generated, it is transmitted to the client terminal 30.

図１２は、クライアント端末30のWebブラウザ上に表示されたルール編集画面40の一例を示すものであり、センテンス表示欄42には、構造化された多数のテキストが列記されている。
また、各テキスト中の企業名に該当する文字列部分、企業活動に該当する文字列部分、活動対象に該当する文字列部分には、それぞれ固有のマーキング（装飾）が施されている。なお、先行する文の企業名主語を継承した場合には、「（初芝）」のように企業名が括弧付きで表示され、元から企業名が明記されている文と区別されている。
図においては、企業名、企業活動、活動対象、場所に対して異なるパターンの網掛けが施される例を示したが、それぞれに対して異なる色彩のマーカー表示（例えば、企業名＝青色、企業活動＝黄緑色、活動対象＝オレンジ色、場所＝黄緑色等）を施すこともできる。
この結果ユーザは、テキスト中からチェックすべき企業名、企業活動、活動対象、場所を一目で識別可能となる。
この文字列のマーキングに際しては、各文字列に関連付けられた抽象化タグや、当該文に関連付けられた企業名の情報（企業名の継承情報）が参照される。 FIG. 12 shows an example of the rule editing screen 40 displayed on the Web browser of the client terminal 30. In the sentence display field 42, a large number of structured texts are listed.
In addition, each character string portion corresponding to a company name, each character string portion corresponding to a company activity, and each character string portion corresponding to an activity target in each text is provided with a unique marking (decoration). When the subject name subject of the preceding sentence is inherited, the company name is displayed in parentheses, such as “(Hatsushiba)”, and is distinguished from the sentence in which the company name is clearly specified.
In the figure, an example is shown in which different patterns of shading are applied to the company name, company activity, activity target, and location, but marker display in different colors for each (for example, company name = blue, company Activity = yellowish green, activity target = orange, place = yellowish green, etc.).
As a result, the user can identify at a glance the company name, company activity, activity target, and place to be checked from the text.
When marking a character string, an abstract tag associated with each character string and company name information (company name inheritance information) associated with the sentence are referred to.

この場合、ルール編集画面40の「販売活動」タブ44が選択されているため、企業活動の中でも「販売活動」に係る文字列（例えば「発売」）についてマーキングが施されているが、ユーザが「生産活動」のタブ46に切り替えると、「生産」に係る文字列（例えば「製造」）についてマーキングが施され、「発売」に施されていたマーキングは解除されることとなる。 In this case, since the “sales activity” tab 44 of the rule editing screen 40 is selected, the character string related to “sales activity” (for example, “release”) is marked in the corporate activity. When the tab is switched to the “production activity” tab 46, the character string related to “production” (for example, “manufacturing”) is marked, and the marking applied to “release” is cancelled.

同様に、ルール編集画面40の「販売活動」タブ44が選択された状態では、活動対象の中でも「販売対象」に係る文字列（例えば「携帯音楽プレーヤー」）についてマーキングが施されているが、「生産活動」タブ46に切り替えられると、「生産対象」に係る文字列についてマーキングが施され、「販売対象」に施されていたマーキングは解除されることとなる。もっとも、同一文字列が生産対象と販売対象の両方に登録されている場合が多く、見た目には変化が感じられないこともある。 Similarly, when the “sales activity” tab 44 on the rule editing screen 40 is selected, the character string related to “sales object” (for example, “portable music player”) is marked among the activity objects. When the “production activity” tab 46 is switched, the character string related to “production target” is marked, and the marking applied to “sales target” is cancelled. However, there are many cases where the same character string is registered in both the production target and the sales target, and there is a case where no change is felt in appearance.

このルール編集画面40に対してユーザは、センテンス表示欄42に記載された各テキストを読み込んでいき、主語（企業名）、述語（販売活動）、目的語（販売対象）が正しく揃っている文を見付けた場合には、それぞれの文字列に施されたマーキング部をクリックしていく。 On this rule editing screen 40, the user reads each text described in the sentence display field 42, and a sentence in which the subject (company name), predicate (sales activity), and object (sales target) are correctly aligned. If it finds, click on the markings on each character string.

例えば、最初の文である「［初芝］、外出先で［動画］を視聴できる［携帯音楽プレーヤー］。」は所謂タイトル文であり、述語が省略されているため、ユーザは情報抽出ルール作成の対象文から除外する。 For example, the first sentence “[Hatsushiba], [moving music player] that can watch [movie] on the go” is a so-called title sentence, and the predicate is omitted, so the user can create an information extraction rule Exclude from the target sentence.

これに対し、２番目の文である「［初芝］は、［携帯音楽プレーヤー］『［ギガバイト］』の新製品を１５日、［関西］で［発売］する。」は、主語、述語、目的語が正しく揃っており、企業名、販売対象、販売活動、場所を有効に抽出可能であるため、［初芝］、［携帯音楽プレーヤー］、［ギガバイト］、［関西］、［発売］の各マーキング部をクリックする。
この結果、抽出ノード表示欄48にユーザが選択した文字列（初芝、携帯音楽プレーヤー、ギガバイト、発売、関西）が表示される。この状態でユーザが「ルールを生成」のアンカーテキスト50をクリックすると、情報抽出ルール生成部20は、情報抽出ルールを生成する。 On the other hand, the second sentence “[Hatsushiba] [releases] a new product of [portable music player]“ [Gigabyte] ”in [Kansai] on the 15th” is the subject, predicate, purpose. Because the words are aligned correctly and the company name, sales target, sales activities, and location can be extracted effectively, the markings are [Hatsushiba], [Portable Music Player], [Gigabyte], [Kansai], [Release] Click the section.
As a result, the character string (Hatsushiba, portable music player, gigabyte, release, Kansai) selected by the user is displayed in the extraction node display field 48. In this state, when the user clicks the anchor text 50 of “Generate rule”, the information extraction rule generation unit 20 generates an information extraction rule.

図１３は、この情報抽出ルール生成の仕組みを説明するものであり、同図(a)は当該文に含まれる文節間の係り受け構造を示している。
この図に示すように、ユーザが選択した「初芝」を含む「初芝は」の文節は、販売活動の抽象化タグが付された「発売する」の文節に係っており、「携帯音楽プレーヤー」の文節は「『ギガバイト』の」の文節に、「『ギガバイト』の」の文節は「新製品を」の文節に、「新製品を」の文節は「発売する」の文節に、「関西で」の文節は「発売する」の文節にそれぞれ係っている。 FIG. 13 explains the mechanism of generating this information extraction rule. FIG. 13 (a) shows the dependency structure between clauses included in the sentence.
As shown in this figure, the phrase “Hatsushiba wa” including “Hatsushiba” selected by the user is related to the phrase “to be released” with an abstract tag of sales activity. "Gigabyte" clause, "Gigabyte" clause "New product" clause, "New product" clause "Launch" clause, "Kansai" The “de” clause is associated with the “release” clause.

そこで情報抽出ルール生成部20は、同図(b)に示すように、抽象化タグが付されている文節については＜企業名＞や＜企業活動＞、＜販売対象物＞、＜場所＞のように抽象化タグで当該文節を表現し、抽象化タグが付されていない文節については（新製品を）のようにそのままの文字列を採用し、文節間の係り受け関係を認定する。この際、「携帯音楽プレーヤー」と「『ギガバイト』の」の文節間は「販売対象→販売対象」のように同じ抽象化タグが連続しているため、一つの抽象化タグにまとめられる。 Therefore, the information extraction rule generation unit 20 uses <company name>, <business activity>, <sales target>, and <location> for clauses with abstract tags as shown in FIG. In this way, the clause is expressed by an abstract tag, and for the clause not attached with the abstract tag, the same character string is adopted as in (new product), and the dependency relationship between clauses is recognized. At this time, since the same abstract tag is continuous between the phrases “portable music player” and ““ gigabyte ”” as “sale object → sale object”, they are combined into one abstract tag.

つぎに情報抽出ルール生成部20は、同図(c)に示すように、(b)の係り受け関係に基づき、所定の表記法に従って情報抽出ルール52を生成する。
この情報抽出ルール52においては、まず「extraction（抽出対象）」として、「subject（主語）＝＜企業名＞」、「predicate（述語）＝＜販売活動＞」、「object（目的語）＝＜販売対象＞」、「option（オプション）＝＜場所＞」が定義されている。
また、「necessary（必須条件）」として、「＜企業名＞の文節が＜販売活動＞の文節に係っていること」、「＜販売対象＞の文節が（新製品を）の文節に係っていること」、「（新製品を）の文節が＜販売活動＞の文節に係っていること」が規定されている。
さらに、「optional（オプション条件）」として、「＜場所＞の文節が＜販売活動＞の文節に係っていること」が規定されている。
この情報抽出ルール52は、情報抽出ルール生成部20によって情報抽出ルール記憶部28に格納される。
なお、「場所」は必須条件ではなくオプション条件に過ぎないため、場所に係る文字列を含まない文に対しても適用可能なルールとなる。 Next, the information extraction rule generation unit 20 generates the information extraction rule 52 according to a predetermined notation based on the dependency relationship of (b) as shown in FIG.
In this information extraction rule 52, first, as “extraction (extraction target)”, “subject (subject) = <company name>”, “predicate (predicate) = <sales activity>”, “object (object) = < “Sales target>” and “option = <location>” are defined.
In addition, as “necessary (required condition)”, “<company name> clause is related to <sales activity> clause”, “<sales target> clause is related to (new product) clause “The phrase“ (New product) is related to the phrase “Sales activity” ”).
Furthermore, “optional (optional condition)” specifies that “the clause of <location> is related to the clause of <sales activity>”.
The information extraction rule 52 is stored in the information extraction rule storage unit 28 by the information extraction rule generation unit 20.
Since “place” is not an essential condition but an optional condition, it is a rule applicable to a sentence that does not include a character string related to the place.

この情報抽出ルール52を適用することにより、例えば「東洋自動車は来月、スカイライナーの新製品をアメリカで発売する。」という文から、以下の情報を抽出可能となる。
主語（企業名）：東洋自動車
述語（企業活動）：販売
目的語（活動対象）：スカイライナー
オプション（場所）：アメリカ
これらの情報要素は相互に主語−述語−目的語−場所の関係を持っているため、これらの情報を検索用のデータベースとして多数蓄積しておくことにより、ノイズの少ない高精度の検索処理が実現可能となる。 By applying this information extraction rule 52, for example, the following information can be extracted from the sentence “Toyo Motors will release a new Skyliner product in the US next month”.
Subject (Company Name): Toyo Motors Predicate (Business Activity): Sales Object (Activity Target): Skyliner Option (Location): USA These information elements have a subject-predicate-object-location relationship with each other. Therefore, by accumulating a large amount of such information as a search database, a highly accurate search process with less noise can be realized.

なお、上記の情報抽出ルール52は、「新製品を」の文節が販売活動の文節に係っている必要があるため、「東洋自動車は来月、スカイライナーを発売する。」という文に変形した場合には抽出条件にマッチせず、「主語（企業名）：東洋自動車−述語（企業活動）：販売−目的語（活動対象）：スカイライナー」という情報を抽出することができないことになる。
しかしながらユーザは、他の多くの素材文に基づいて情報抽出ルールの作成作業を繰り返していくことで、このような文に適合可能なルールを作成可能となる。 Note that the information extraction rule 52 described above has been transformed into the phrase “Toyo Motors will release a Skyliner next month” because the phrase “New product” needs to be related to the phrase of sales activities. In this case, the information does not match the extraction condition, and information such as “subject (company name): Toyo Motor Corporation—predicate (business activity): sales—object (activity target): skyliner” cannot be extracted.
However, the user can create a rule that can be adapted to such a sentence by repeatedly creating information extraction rules based on many other material sentences.

例えば、図１４の(a)に示すように、「アニーは来月、２０代をターゲットにした男性化粧品を発売する。」という文がルール編集画面40のセンテンス表示欄42に記載されていた場合、ユーザは「アニー（企業名）」、「男性化粧品（販売対象）」、「発売（販売活動）」のマーキング部をクリックして「ルールを作成」のアンカーテキスト50をクリックする。 For example, as shown in FIG. 14A, a sentence “Annie will release male cosmetics targeting the 20s next month” is written in the sentence display field 42 of the rule editing screen 40. The user clicks the marking parts of “Any (company name)”, “Men's cosmetics (sales target)”, “Release (sales activities)”, and clicks the anchor text 50 of “Create rule”.

この文を構成する各文節は、図１４(b)に示す係り受け構造を備えているため、情報抽出ルール生成部20は同図(c)に示すように、「＜企業名＞→＜販売活動＞」及び「＜販売対象＞→＜販売活動＞」の係り受け関係を取り出し、同図(d)に示す情報抽出ルール54を生成する。
この情報抽出ルール54を先の「東洋自動車は来月、スカイライナーを発売する。」の文に適用することにより、以下の情報を首尾良く抽出可能となる。
主語（企業名）：東洋自動車
述語（企業活動）：販売
目的語（活動対象）：スカイライナー Since each clause constituting this sentence has a dependency structure shown in FIG. 14 (b), the information extraction rule generation unit 20 indicates that “<company name> → <sales” as shown in FIG. 14 (c). The dependency relationship “activity>” and “<sales target> → <sales activity>” is extracted, and the information extraction rule 54 shown in FIG.
By applying this information extraction rule 54 to the sentence “Toyo Motors will release Skyliner next month.”, The following information can be successfully extracted.
Subject (Company Name): Toyo Motors Predicate (Business Activity): Sales Object (Activity Target): Skyliner

この情報抽出ルールの編集作業に際しユーザは、素材文の意味内容を理解し、当該文が積極的な企業活動を表現しているものと判断した場合にのみ、文中のマーキング部を選択する。
例えば、図１５(b)に示すように、「初芝は１５日、携帯音楽プレーヤー『ギガバイト』の新製品の発売を延期すると発表した。」という文が与えられた場合、同図(a)の「初芝は、携帯音楽プレーヤー『ギガバイト』の新製品を１５日、発売する。」と同じく、「初芝」、「携帯音楽プレーヤー」、「ギガバイト」、「発売」についてマーキングが施されることになる。
しかしながら、この(b)の文に基づいて情報抽出ルールを生成すると、ある企業が特定の企業活動の延期を発表したとする文の中から、当該企業が企業活動を積極的に行ったことを示す情報が誤って抽出される結果となる。
そこでユーザは、このような文を情報抽出ルール作成の対象からは意識的に除外する。 In editing the information extraction rule, the user understands the meaning of the material sentence and selects a marking portion in the sentence only when it is determined that the sentence expresses an active business activity.
For example, as shown in FIG. 15 (b), when the sentence “Hatsushiba announced that it will postpone the release of a new product of portable music player“ Gigabyte ”on the 15th” is given, “Hatsushiba will release a new product of portable music player“ Gigabyte ”on the 15th.” Like “Hatsushiba”, “Portable music player”, “Gigabyte”, “Release” will be marked. .
However, when an information extraction rule is generated based on the sentence in (b), it can be said that a company has actively engaged in corporate activities from the statement that a company has announced the postponement of a specific corporate activity. As a result, the information shown is extracted in error.
Therefore, the user consciously excludes such a sentence from the information extraction rule creation target.

これに対し、図１５(c)に示すように、「初芝は、携帯音楽プレーヤー『ギガバイト』をアメリカで発売すると発表した。」という文が与えられた場合、文の述語は「発表した」という一般的な用語であり、企業の生産活動を表す用語が述語にはなっていないが、文全体の意味からすれば企業が販売対象を販売することが記述されているため、ユーザは各マーキング部をクリックし、「ルールを作成」のアンカーテキスト50をクリックする。 On the other hand, as shown in FIG. 15 (c), when a sentence “Hatsushiba has announced that portable music player“ Gigabyte ”will be released in the United States” is given, the predicate of the sentence is “announced”. Although it is a general term and a term representing a company's production activities is not a predicate, it is described that the company sells the object to be sold in the meaning of the whole sentence. Click, and click the “Create Rule” anchor text 50.

これを受けた情報抽出ルール生成部20は、当該文から情報抽出ルールを生成する。
この文を構成する各文節は、図１６(a)に示す係り受け構造を備えているため、情報抽出ルール生成部20は同図(b)に示すように、「＜企業名＞→＜販売活動＞」、「＜販売対象＞→＜販売活動＞」、「＜販売活動＞→（発表した）」及び「＜場所＞→＜販売活動＞」の係り受け関係を取り出し、同図(c)に示す情報抽出ルール56を生成する。 Receiving this, the information extraction rule generation unit 20 generates an information extraction rule from the sentence.
Since each clause constituting this sentence has a dependency structure shown in FIG. 16 (a), the information extraction rule generation unit 20 indicates that “<company name> → <sales” as shown in FIG. 16 (b). Activity ”,“ <Sales target> → <Sales activity> ”,“ <Sales activity> → (announced) ”and“ <Location> → <Sales activity> ”are taken out, and the figure (c) The information extraction rule 56 shown in FIG.

この情報抽出ルール56を適用することにより、例えば「東洋自動車は来月、ハイブリッド車をヨーロッパで売り出すと発表した。」という文から、以下の企業情報を抽出可能となる。
主語（企業名）：東洋自動車
述語（企業活動）：販売
目的語（活動対象）：ハイブリッド車
場所：ヨーロッパ By applying this information extraction rule 56, for example, the following company information can be extracted from the sentence “Toyo Motors announced that it will sell hybrid vehicles in Europe next month”.
Subject (Company Name): Toyo Motors Predicate (Business Activity): Sales Object (Activity Target): Hybrid Vehicle Location: Europe

図示は省略したが、先行する文の企業名主語を継承した文からも、情報抽出ルールを生成することができる。例えば、「［（初芝）］『［ギガバイト］』は１０月下旬に［発売］する。」という文がルール編集画面40のセンテンス表示欄42に記載されていた場合に、ユーザが［（初芝）］（継承した企業名主語）、［ギガバイト］及び［発売］に施されたマーキング部をクリックし、「ルールを作成」のアンカーテキスト50をクリックすると、情報抽出ルール生成部20によって以下の情報抽出ルールが生成される。
============================
extraction:
subject <企業名>
predicate: <販売活動>
object: <販売対象>

necessary:
<販売対象> -> <販売活動>
============================ Although illustration is omitted, an information extraction rule can also be generated from a sentence that inherits the subject name subject of the preceding sentence. For example, if the sentence “[(Hatsushiba)]“ [Gigabyte] ”[releases] in late October” is written in the sentence display field 42 of the rule editing screen 40, the user selects [(Hatsushiba) ] (Inherited company name subject), [Gigabyte] and [Marketing] marking part is clicked, click the anchor text 50 of "Create rule", information extraction rule generation unit 20 to extract the following information A rule is generated.
============================
extraction:
subject <company name>
predicate: <Sales activities>
object: <Sales target>

necessary:
<Sales target>-><Salesactivities>
============================

この情報抽出ルールでは、「necessary（必須条件）」に＜企業名＞に関する係り受けが規定されていないが、この情報抽出ルールを適用する際には、対象となる文が先行文から継承した企業名が、主語（subject）として抽出されることになる。 In this information extraction rule, “necessary (required condition)” does not specify a dependency on <company name>. However, when this information extraction rule is applied, the target sentence inherits from the preceding sentence. The name will be extracted as the subject.

ユーザは、上記のルール編集画面40を通じて辞書の追加登録を行うことができる。
例えば、センテンス表示欄42において「投入」という文字列に企業活動（この場合は「販売活動」）であることを示すマーキングが施されていない場合、ユーザはマウス操作によって「投入」を選択状態となした上で、「活動として辞書に追加」のアンカーテキスト60をクリックする。 The user can additionally register a dictionary through the rule editing screen 40 described above.
For example, in the sentence display field 42, if the character string “input” is not marked to indicate corporate activity (in this case, “sales activity”), the user selects “input” by the mouse operation. After that, click the anchor text 60 of "Add to activity as dictionary".

これを受けた情報抽出ルール生成部20は、「投入」の文字列を辞書記憶部22の企業活動辞書に販売活動の一つとして追加登録する。
この結果、つぎに「投入」を含む文がセンテンス表示欄42に表示された際には、販売活動を示すマーキングが施されることとなる。 Receiving this, the information extraction rule generation unit 20 additionally registers the character string “input” in the corporate activity dictionary of the dictionary storage unit 22 as one of the sales activities.
As a result, the next time a sentence including “input” is displayed in the sentence display field 42, the marking indicating the sales activity is performed.

同様に、明らかに企業名であるにもかかわらずマーキング対象から外れている文字列が存在した場合、ユーザは当該文字列を選択状態とした上で、「企業として辞書に追加」のアンカーテキスト58をクリックする。
これを受けた情報抽出ルール生成部20は、当該文字列を辞書記憶部22の企業名辞書に企業名の一つとして追加登録する。 Similarly, when there is a character string that is clearly a company name but is not included in the marking target, the user selects the character string and selects the anchor text “add to company as dictionary” 58. Click.
Receiving this, the information extraction rule generation unit 20 additionally registers the character string in the company name dictionary of the dictionary storage unit 22 as one of the company names.

また、明らかに活動対象を示す文字列に対してマーキングが施されていない場合、ユーザは当該文字列を選択状態とした上で「対象として辞書に追加」のアンカーテキスト62をクリックし、展開される選択ウィンドウ（図示省略）上で一般名詞と固有名詞の区分を選択入力する。
これを受けた情報抽出ルール生成部20は、当該文字列を辞書記憶部22の活動対象辞書に販売活動の一つとして追加登録する。 In addition, when the character string indicating the activity target is not clearly marked, the user selects the character string and then clicks on the anchor text 62 of “Add to dictionary as target” to expand it. On the selection window (not shown), a general noun and proper noun are selected and input.
Receiving this, the information extraction rule generation unit 20 additionally registers the character string in the activity dictionary of the dictionary storage unit 22 as one of sales activities.

この発明に係る情報抽出ルール作成支援システムの機能構成を示すブロック図である。It is a block diagram which shows the function structure of the information extraction rule preparation assistance system which concerns on this invention. 企業活動辞書の登録内容を例示する図表である。It is a chart which illustrates the registration contents of a corporate activity dictionary. 活動対象辞書の登録内容を例示する図表である。It is a chart which illustrates the registration contents of an activity object dictionary. 形態素解析の結果を示す図表である。It is a chart which shows the result of morphological analysis. 文中の企業名主語を認定する方法を例示する説明図である。It is explanatory drawing which illustrates the method of certifying the company name subject in a sentence. 企業名主語の継承及び照応処理を説明する図表である。It is a chart explaining inheritance of a company name subject and anaphora processing. 構文解析の結果を示す図表である。It is a chart which shows the result of parsing. 構文解析の結果である係り受け構造をイメージ化した図である。It is the figure which imaged the dependency structure which is a result of a parsing. 形態素に付与された抽象化タグが文節に引き継がれる様子を示す図表である。It is a graph which shows a mode that the abstract tag provided to the morpheme is succeeded by the clause. 正規表現ルールによる抽象化処理を示す説明図である。It is explanatory drawing which shows the abstraction process by a regular expression rule. 係り受けルールによる抽象化処理を示す説明図である。It is explanatory drawing which shows the abstraction process by a dependency rule. ルール編集画面の一例を示す画面構成図である。It is a screen block diagram which shows an example of a rule edit screen. 情報抽出ルール生成の過程を示す説明図である。It is explanatory drawing which shows the process of information extraction rule production | generation. 他の情報抽出ルール生成の過程を示す説明図である。It is explanatory drawing which shows the process of another information extraction rule production | generation. 情報抽出ルール生成の素材となる文の他の例を示す説明図である。It is explanatory drawing which shows the other example of the sentence used as the raw material of information extraction rule production | generation. 他の情報抽出ルール生成の過程を示す説明図である。It is explanatory drawing which shows the process of another information extraction rule production | generation. 構文解析格の結果である係り受け構造をイメージ化した説明図である。It is explanatory drawing which imaged the dependency structure which is a result of a parsing case. 情報抽出ルールをイメージ化した説明図である。It is explanatory drawing which imaged the information extraction rule. 構文解析された文に情報抽出ルールを適用して構造化された企業情報を抽出する様子を示す説明図である。It is explanatory drawing which shows a mode that the structured company information is extracted by applying an information extraction rule to the parsed sentence. 他の情報抽出ルールをイメージ化した説明図である。It is explanatory drawing which imaged other information extraction rules.

10 情報抽出ルール作成支援システム
12 形態素解析処理部
14 主語特定処理部
16 構文解析処理部
18 抽象化処理部
20 情報抽出ルール生成部
22 辞書記憶部
24 抽象化ルール記憶部
26 素材文記憶部
28 情報抽出ルール記憶部
30 クライアント端末
32 テキストデータ
40 ルール編集画面
42 センテンス表示欄
44 「販売活動」タブ
46 「生産活動」タブ
48 抽出ノード表示欄
50 「ルールを生成」のアンカーテキスト
52 情報抽出ルール
54 情報抽出ルール
56 情報抽出ルール
58 「企業として辞書に追加」のアンカーテキスト
60 「活動として辞書に追加」のアンカーテキスト
62 「対象として辞書に追加」のアンカーテキスト 10 Information extraction rule creation support system
12 Morphological analysis processor
14 Subject identification processing section
16 Parsing processing section
18 Abstraction processing section
20 Information extraction rule generator
22 Dictionary storage
24 Abstraction rule storage
26 Material sentence storage
28 Information extraction rule storage
30 client terminals
32 Text data
40 Rule edit screen
42 Sentence display field
44 Sales Activity tab
46 Production Activity tab
48 Extraction node display column
50 “Create Rule” Anchor Text
52 Information extraction rules
54 Information extraction rules
56 Information extraction rules
58 Anchor text for "Add to company as dictionary"
60 Anchor text for "Add to activity as dictionary"
62 Anchor text for "Add to dictionary as target"

Claims

A dictionary in which the correspondence between a specific expression character string and an abstract character string indicating the type is registered,
Means for decomposing each sentence in text data into morpheme units;
Referencing the above dictionary, means for associating a corresponding abstract tag with a morpheme that corresponds to a company name, business activity, or activity target,
A means of searching for the subject name of the company in each sentence;
Means for storing a sentence having a company name subject as a material sentence in a material sentence storage means;
Means for taking out material sentences from the material sentence storage means, filling them in a predetermined template and generating a rule editing screen;
Means for sending the rule editing screen to the client terminal and prompting selection of a material sentence in which the business activity and activity target of the specific company are described;
When the selection information of a specific material sentence is transmitted from the client terminal, the clause including the company name in the material sentence, the clause including the business activity and the dependency structure between the clauses including the activity target, or these company names Dependent structure between clauses containing clauses, clauses containing business activities and clauses containing activity objects, and clauses containing character strings other than company names, business activities, and activity targets are extracted as extraction conditions, and this is formatted in a predetermined format. Information extraction rule generation means for generating an information extraction rule described in accordance with
Means for storing the information extraction rule in the information extraction rule storage means;
An information extraction rule creation support system characterized by comprising:

Each material sentence in the above rule editing screen is marked to identify the type of the character string part indicating the company name, the character string part indicating the company activity, and the character string part indicating the activity target. The information extraction rule making support system according to claim 1, wherein

Means for referring to the dictionary and associating a corresponding abstract tag with a corresponding one of the morphemes,
The information extraction rule generation means indicates the location when the selection information of the specific material sentence is transmitted from the client terminal and the material sentence includes a morpheme associated with the location abstraction tag. The information extraction according to claim 1 or 2, wherein a dependency structure between a clause including a morpheme and another clause is extracted as an arbitrary extraction condition, and an information extraction rule including the extraction structure is generated. Rule creation support system.

If there is no subject in the sentence, there is a company name subject in the sentence that precedes the sentence, and another company name subject and the sentence in which the company name subject exists. The method according to any one of claims 1 to 3, further comprising means for associating the company name subject as a company name subject of a sentence that does not have the subject when a sentence whose subject is a morpheme other than the company name is not present. Information extraction rule creation support system described in Crab.

If a company pronoun is included as a subject in a sentence, the company name subject exists in the sentence located before the sentence, and there is another between the sentence and the sentence in which the company name subject exists. The method according to claim 1, further comprising means for replacing the company pronoun with the company name of the company name subject when no sentence having a subject name of the company name and a morpheme other than the company name is present. The information extraction rule creation support system according to any one of the above.

Apply a regular expression rule set in advance to each sentence, identify a morpheme that matches the regular expression rule as one of the company name, corporate activity, or activity target, and handle the morpheme 6. The information extraction rule creation support system according to claim 1, further comprising means for associating the abstract tag.

Apply predetermined dependency rules to each sentence, and identify the morpheme of the clause that matches the dependency rule as either the company name, the company activity, or the activity object, and 7. The information extraction rule creation support system according to claim 1, wherein a corresponding abstract tag is associated.