JP2006023968A

JP2006023968A - Unique expression extracting method and device and program to be used for the same

Info

Publication number: JP2006023968A
Application number: JP2004201272A
Authority: JP
Inventors: Yoshiaki Kudo; 嘉晃工藤; Toshiko Aizono; 敏子相薗; Atsuko Koizumi; 敦子小泉
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2004-07-08
Filing date: 2004-07-08
Publication date: 2006-01-26

Abstract

<P>PROBLEM TO BE SOLVED: To provide a unique expression extracting method and device and a program to be used for them for supporting manual operation sections in an operation to extract unique expressions from an object document in order to reduce huge manual workloads to be required on an operation to prepare large amounts of teacher data, and to discriminate correct unique expressions from extracted unique expression candidates in a conventional unique expression extraction system. <P>SOLUTION: A system for successively learning extraction rules is adopted, and candidates being teacher data are presented to an operator to reduce workloads to be required on the preparation of teacher data. Also, reference materials related with the candidates are presented to an operator to support the discrimination of unique expression candidates. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、固有表現抽出、テキストマイニングに関するものである。 The present invention relates to named entity extraction and text mining.

テキストマイニングは膨大な文書データが蓄積されたデータベースから有益な情報を発見するための手法である。例えば，コールセンタに蓄積された問い合せ履歴から，製品に対するクレームなどの重要な情報を発見するために，テキストマイニングが用いられる。このようなクレームの発見において，文書中に現れる製品名の判別は不可欠であるが，テキストマイニングだけでは，膨大な文書データ（数万件／日）から製品名を正確に判別することは困難である。 Text mining is a technique for finding useful information from a database in which a large amount of document data is stored. For example, text mining is used to find important information such as complaints about products from inquiry histories stored in a call center. In the discovery of such a claim, it is indispensable to determine the product name that appears in the document, but it is difficult to accurately determine the product name from a huge amount of document data (tens of thousands of documents / day) by text mining alone. is there.

そこで，文書データから人名，地名，組織名，人工物名（製品名や法律名など）などの固有名詞的表現，日付や時間などの時間表現および価格や割合などの数値表現を抽出する方式がいくつか提案されている。抽出対象となるこれらの表現は固有表現（Named-Entity）と総称される。固有表現抽出の従来方式は，いずれも，固有表現を抽出するための規則（以下，抽出規則）を生成し，抽出規則に基づいて文書データから固有表現を抽出する。一般に、抽出規則の生成には学習アルゴリズムを用いて自動的に行われる。学習アルゴリズムを用いて抽出規則を生成する場合、人手で固有表現を示すタグ（以下，固有表現タグ）を予め付加した文書データを用意する。つまり，学習アルゴリズムに入力するための教師データを作成する。例えば，文書データ中に「日立」という単語が現れれば，「<ORG>日立</ORG>」というように組織を表すタグを単語に付加する。
固有表現抽出の従来方式の一例として，特許文献１，２および非特許文献１，２を挙げる。 Therefore, there is a method to extract proper noun expressions such as person names, place names, organization names, artifact names (product names, law names, etc.), time expressions such as dates and times, and numerical expressions such as prices and ratios from document data. Several proposals have been made. These expressions to be extracted are collectively referred to as named expressions (Named-Entity). In each of the conventional methods for extracting a specific expression, a rule for extracting a specific expression (hereinafter referred to as an extraction rule) is generated, and the specific expression is extracted from the document data based on the extraction rule. Generally, extraction rules are automatically generated using a learning algorithm. When an extraction rule is generated using a learning algorithm, document data to which a tag indicating a specific expression (hereinafter referred to as a specific expression tag) is manually added is prepared. That is, teacher data for input to the learning algorithm is created. For example, if the word “Hitachi” appears in the document data, a tag representing the organization is added to the word, such as “<ORG> Hitachi </ ORG>”.
Patent Documents 1 and 2 and Non-Patent Documents 1 and 2 are given as an example of a conventional method for extracting a proper expression.

特開２００４−４６７７５号公報JP 2004-46775 A

特開２００１−３１８７９２号公報JP 2001-318792 A 磯崎秀樹：メタルールと決定木学習を用いた日本語固有表現抽出, 情報処理学会論文誌, Vol.43, No.5, pp.1481-1491 (2002)Hideki Amagasaki: Extracting Japanese Named Expressions Using Metarules and Decision Tree Learning, IPSJ Transactions, Vol.43, No.5, pp.1481-1491 (2002) 宇津呂武仁, 颯々野学：ブートストラップによる低人手コスト日本語固有表現抽出, 情報処理学会研究報告, 2000-NL-139, pp.9-16 (2000)Takehito Utsuro, Manabu Sugano: Low-cost Japanese-specific expression extraction by bootstrap, IPSJ Research Report, 2000-NL-139, pp.9-16 (2000)

固有表現抽出の従来方式の一つに、ブートストラップ方式がある。その方式は少量の教師データから逐次的に抽出規則を学習し、効率良く固有表現を抽出することができる。しかしながら、教師データから学習した抽出規則を対象文書に適用して得られる多数の固有表現候補が正しい固有表現か否かを判別する作業に非常に膨大な作業量を要する。つまり、対象分野に対する知識がない作業者が、抽出された固有表現候補をみてもすぐに判別することができず、その候補が出現する文書や関連資料などの調査といった作業が発生する。 There is a bootstrap method as one of the conventional methods for extracting a proper expression. In this method, it is possible to learn extraction rules sequentially from a small amount of teacher data and extract a specific expression efficiently. However, an enormous amount of work is required for determining whether or not a large number of specific expression candidates obtained by applying an extraction rule learned from teacher data to a target document is a correct specific expression. That is, an operator who does not have knowledge of the target field cannot immediately discriminate even if he / she sees the extracted specific expression candidates, and work such as investigation of documents and related materials in which the candidates appear.

例えば、先にあげた営業日報の例では、作業者がまったく営業日報中に出現する製品名を知らなければ、抽出規則を適用して得られる製品名候補を正しい製品名だと判断できない。そのため、作業者は製品カタログなどの参考資料を参照しながら、製品名候補が製品名か否かを判別しなければならず、判別作業には膨大な時間を要する。
したがって、本発明が解決しなければならない課題は、教師データの作成および固有表現候補の判別作業にかかる膨大な作業量を軽減することである。 For example, in the example of the daily business report mentioned above, the product name candidate obtained by applying the extraction rule cannot be determined as the correct product name unless the worker knows the product name that appears in the business daily report. For this reason, the worker must determine whether or not the product name candidate is a product name while referring to reference materials such as a product catalog, and the determination work takes an enormous amount of time.
Therefore, the problem to be solved by the present invention is to reduce the enormous amount of work required for the creation of teacher data and the operation of discriminating candidate candidates for specific expressions.

本発明では、ブートストラップ方式による抽出規則の逐次学習において、教師データとなる候補を作業者に提示することで、教師データを学習アルゴリズムに入力することを支援する。例えば、先に述べた営業日報の例では、正解の例（以下、正例）となる既知の製品名および正例ではない例（以下、負例）となる製品名以外の単語を作業者に提示することで教師データの入力を支援する。 In this invention, in the sequential learning of the extraction rule by a bootstrap system, the candidate which becomes teacher data is shown to an operator, and it supports that teacher data is input into a learning algorithm. For example, in the example of the business daily report mentioned above, a word other than a known product name that is an example of correct answer (hereinafter, positive example) and a product name that is not correct example (hereinafter, negative example) is given to the worker. Supporting the input of teacher data by presenting.

また、本発明では、抽出規則を用いて対象文書から抽出した固有表現候補が正しい固有表現か否かの判別にかかる作業量を軽減するために、抽出した候補に関連する参考資料を作業者に提示する。例えば、営業日報から製品名候補を抽出した場合、作業者にＷｅｂ上に存在する製品カタログなどの情報を閲覧させて、その候補の判別を支援する。 Further, in the present invention, in order to reduce the amount of work required to determine whether or not the specific expression candidate extracted from the target document using the extraction rule is a correct specific expression, reference materials related to the extracted candidate are given to the worker. Present. For example, when a product name candidate is extracted from a daily business report, the operator browses information such as a product catalog existing on the Web, and assists the determination of the candidate.

従来の固有表現抽出では、大量の教師データの作成および抽出した固有表現候補の人手での判別に膨大な作業量を要するが、本発明により、大量に教師データを用意する必要はなくなるので作業量を軽減することができる。また、少量の教師データを作成する際に、作業者に教師データとなる候補を提示するため、容易に教師データを作成することができる。また、本発明は固有表現候補に関連する参考資料を作業者に提示するので、その候補が固有表現か否かを判別する作業時間を短縮することができる。 In the conventional specific expression extraction, a huge amount of work is required to create a large amount of teacher data and to manually identify the extracted specific expression candidates. However, according to the present invention, it is not necessary to prepare a large amount of teacher data. Can be reduced. In addition, when a small amount of teacher data is created, candidates for teacher data are presented to the operator, so that teacher data can be created easily. Further, since the present invention presents the reference material related to the specific expression candidate to the worker, it is possible to shorten the work time for determining whether or not the candidate is the specific expression.

以下、本発明の実施形態の一例を、図面を用いて説明する。 Hereinafter, an example of an embodiment of the present invention will be described with reference to the drawings.

1.システム全体の説明
本発明の実施例の一つである固有表現抽出システムにおける構成と処理の流れについて説明する。
1.1構成
システム全体の構成を図１に示す。本システムにおいては、一人以上の作業者が、端末１００を利用して膨大な文書データから固有表現を抽出する。本システムは、次の部分から構成される。
・作業者による、正例および負例の入力を受け付けたり、抽出規則の生成や固有表現候補の判別などの各行程において必要な情報を利用者に提示したりする入出力手段を有する端末１００
・文書データの集合（以下、文書データ１０４）、その文書データに形態素解析を適用した後の文書データ（以下、形態素解析済み文書データ１０５）、および形態素解析済み文書データ１０５中の単語（形態素）と文書のＩＤを記録した文書ＩＤテーブル１０９を蓄積するデータベース１０３
・端末２で作業者が指定した複数の正例および負例を用いて、データベース１０３に形態素解析済み文書データ１０５から規則性を学習し、抽出規則を生成する抽出規則学習部１０１
・抽出規則学習部１０１で生成された、固有表現の抽出に用いる抽出規則を記憶する抽出規則記憶部１０６
・抽出した固有表現を記憶する固有表現記憶部１０７
・抽出規則学習部１０１が生成した抽出規則を用いて、データベース１０３に蓄積された形態素解析済み文書データ１０５から固有表現候補を抽出し、端末１００を介して作業者から抽出した固有表現候補が正しい固有表現か否かを判別する入力などを受け付ける固有表現抽出部１０２
固有表現抽出部１０２はインターネット１０８と接続している。 1. Description of Overall System A configuration and processing flow in a named entity extraction system that is one embodiment of the present invention will be described.
1.1 Configuration The overall system configuration is shown in Fig. 1. In this system, one or more workers use the terminal 100 to extract a specific expression from a large amount of document data. This system consists of the following parts.
A terminal 100 having an input / output unit that accepts input of positive examples and negative examples by an operator, and presents information necessary for each process such as generation of extraction rules and determination of specific expression candidates to the user.
A set of document data (hereinafter, document data 104), document data after applying morphological analysis to the document data (hereinafter, morphological analyzed document data 105), and words (morphemes) in the morphological analyzed document data 105 And a database 103 that stores a document ID table 109 that records document IDs
An extraction rule learning unit 101 that learns regularity from the morphological-analyzed document data 105 in the database 103 and generates an extraction rule using a plurality of positive examples and negative examples specified by the operator at the terminal 2
An extraction rule storage unit 106 that stores an extraction rule that is generated by the extraction rule learning unit 101 and is used to extract a specific expression.
A specific expression storage unit 107 that stores the extracted specific expressions
Using the extraction rule generated by the extraction rule learning unit 101, the specific expression candidate is extracted from the morphological-analyzed document data 105 stored in the database 103, and the specific expression candidate extracted from the operator via the terminal 100 is correct. A specific expression extraction unit 102 that receives an input for determining whether or not a specific expression is used.
The specific expression extraction unit 102 is connected to the Internet 108.

端末１００は、一般的なコンピュータで、演算部、記憶部、キーボード・マウスなどのユーザ入出力装置、表示部、サーバと通信を行うための通信部を有する。抽出規則学習部１０１と固有表現抽出部１０２は、計算機上で実行するプログラムにより実現される。これらのプログラムは、ＣＤ−ＲＯＭ、ハードディスクなどの媒体に格納され、端末１００あるいはその他の機能を司るサーバ装置の演算部において実行される。データベース１０３、抽出規則記憶部１０６および固有表現記憶部１０７は、外部記憶装置である。これらの外部記憶装置は、システムが生成したデータを記憶し、上述のプログラムを実行する演算部から入出力が行われる。 The terminal 100 is a general computer and includes a calculation unit, a storage unit, a user input / output device such as a keyboard / mouse, a display unit, and a communication unit for communicating with a server. The extraction rule learning unit 101 and the specific expression extraction unit 102 are realized by a program executed on a computer. These programs are stored in a medium such as a CD-ROM or a hard disk, and are executed by the computing unit of the server device that controls the terminal 100 or other functions. The database 103, the extraction rule storage unit 106, and the specific expression storage unit 107 are external storage devices. These external storage devices store data generated by the system, and input / output is performed from an arithmetic unit that executes the above-described program.

文書データ１０５と形態素解析済み文書データ１０６中の文書には文書ＩＤが付加されており、文書データ１０５における文書と形態素解析済み文書データ１０６における文書の文書ＩＤが同じであれば、形態素解析前後の文書を表す。また、文書ＩＤテーブル１０９の記録形式は、図１１に示すように、単語を格納する単語格納部１１００、その単語が形態素解析済み文書データ１０５に出現する頻度を格納する頻度格納部１１０１、および形態素解析済み文書データ１０５の文書ＩＤを格納する文書ＩＤ格納部１１０２からなる。例えば、単語「Ｓｅｒｖｅｒ０１」は頻度が「９４８」であり、「Ｓｅｒｖｅｒ０１」を含む文書の文書ＩＤは「００００１、００００９、００２０３、…」である。 A document ID is added to the document data 105 and the document in the morphologically analyzed document data 106. If the document ID of the document in the document data 105 and the document in the morphologically analyzed document data 106 are the same, the document ID before and after the morphological analysis is obtained. Represents a document. As shown in FIG. 11, the recording format of the document ID table 109 includes a word storage unit 1100 for storing words, a frequency storage unit 1101 for storing the frequency of appearance of the words in the morphological-analyzed document data 105, and a morpheme. The document ID storage unit 1102 stores the document ID of the analyzed document data 105. For example, the frequency of the word “Server01” is “948”, and the document ID of a document including “Server01” is “00001, 00009, 00003,.

1.2固有表現抽出の流れ
本システムの処理の流れは次の二つのフェーズに分けることができる。
・抽出規則生成フェーズ
・固有表現抽出フェーズ
1.2.1抽出規則生成フェーズ
抽出規則生成フェーズでは、抽出規則学習部１０１が作業者からの入力に基づいて、データベース１０３に蓄積された形態素解析済み文書データ１０５から規則性を学習して抽出規則を生成するという処理を行う。図２は作業者とシステム間の処理の流れを示す。図中の矢印はデータの流れを表す。ステップＳ２０１からＳ２０５までが抽出規則生成フェーズの処理であり、各ステップの概要は次の通りである。 1.2 Specific Expression Extraction Flow The processing flow of this system can be divided into the following two phases.
・ Extraction rule generation phase ・ Specific expression extraction phase
1.2.1 Extraction Rule Generation Phase In the extraction rule generation phase, the extraction rule learning unit 101 learns regularity from the morphological-analyzed document data 105 stored in the database 103 based on the input from the operator, and extracts the extraction rules. Process to generate. FIG. 2 shows the flow of processing between the worker and the system. The arrows in the figure represent the data flow. Steps S201 to S205 are processing of the extraction rule generation phase, and the outline of each step is as follows.

・Ｓ２０１：システムは固有表現記憶部１０７に記憶された固有表現を正例候補として、固有表現の種別と共に端末１００の画面上に表示する。ただし、固有表現抽出部１０２に固有表現が記憶されていなければ、何も表示しない。
・Ｓ２０２：作業者は端末１００の画面に表示された正例候補から正例を選択し、システムに入力する。また、作業者は既に知られている固有表現を正例としてシステムに直接入力することもできる。
・Ｓ２０３：システムは形態素解析済み文書データ１０５のうち正例を含む文書データから負例候補を抽出し、それら候補を文書データ中に出現する頻度順に端末１００の画面上に表示する。 S201: The system displays the specific expression stored in the specific expression storage unit 107 as a positive example candidate on the screen of the terminal 100 together with the type of specific expression. However, nothing is displayed unless the specific expression is stored in the specific expression extraction unit 102.
S202: The worker selects a positive example from the positive example candidates displayed on the screen of the terminal 100 and inputs it to the system. In addition, the operator can directly input a known expression already known to the system as a positive example.
S203: The system extracts negative example candidates from the document data including positive examples from the morphological-analyzed document data 105, and displays the candidates on the screen of the terminal 100 in the order in which they appear in the document data.

・Ｓ２０４：作業者は端末１００の画面に表示された負例候補から負例を選択し、システムに入力する。また、作業者は固有表現以外の単語を負例としてシステムに直接入力することもできる。
・Ｓ２０５：システムは作業者により入力された正例と負例に基づき、形態素解析済み文書データ１０５から規則性を学習して抽出規則を生成し、それらの規則を端末１００の画面上に表示する。 S204: The worker selects a negative example from the negative example candidates displayed on the screen of the terminal 100 and inputs it to the system. In addition, the worker can directly input words other than the specific expressions into the system as negative examples.
S205: The system generates regularity rules by learning regularity from the morphological-analyzed document data 105 based on positive and negative examples input by the worker, and displays those rules on the screen of the terminal 100. .

本フェーズにおいてシステムが端末１００に表示する画面の構成について、ＩＴ企業の営業日報データベースからＩＴ製品の名称（以下、製品名）を抽出するために抽出規則を生成する例を用いて説明する。つまり、データベース１０３に蓄積される文書データ１０４は営業日報の文書、形態素解析済み文書データ１０５は形態素解析後の営業日報の文書である。 The configuration of the screen displayed on the terminal 100 by the system in this phase will be described using an example in which an extraction rule is generated in order to extract the name of an IT product (hereinafter, product name) from an IT company's business daily report database. That is, the document data 104 stored in the database 103 is a daily business report document, and the morpheme-analyzed document data 105 is a daily business report after morphological analysis.

図３に本システムの画面構成の一例を示す。図３は抽出規則生成支援画面３００であり、正例候補を表示する正例候補一覧表示部３０１、正例の入力を受け付ける正例直接入力部３０２、正例入力決定ボタン３０３、負例候補を表示する負例候補一覧表示部３０４、負例の入力を受け付ける負例直接入力部３０５、負例入力決定ボタン３０６、および生成された抽出規則を表示する抽出規則一覧表示部３０７から構成される。正例候補一覧表示部３０１は、図３に示すように、正例候補を選択するチェックボックスを表示する正例候補選択部３０８、正例候補を表示する正例候補表示部３０９、および正例候補の固有表現としての種別を表示する種別表示部３１０から構成される。
また、負例候補一覧表示部３０４は、図４に示すように、負例候補を選択するチェックボックスを表示する負例候補選択部４００、負例候補を表示する負例候補表示部４０１、および負例候補が形態素解析済み文書データ１０５に出現する頻度を表示する頻度表示部４０２から構成される。 FIG. 3 shows an example of the screen configuration of this system. FIG. 3 shows an extraction rule generation support screen 300. A positive example candidate list display unit 301 that displays positive example candidates, a positive example direct input unit 302 that receives input of positive examples, a positive example input determination button 303, and negative example candidates are displayed. A negative example candidate list display unit 304 to be displayed, a negative example direct input unit 305 that accepts an input of a negative example, a negative example input determination button 306, and an extraction rule list display unit 307 that displays a generated extraction rule. As shown in FIG. 3, the positive example candidate list display unit 301 includes a positive example candidate selection unit 308 that displays a check box for selecting a positive example candidate, a positive example candidate display unit 309 that displays a positive example candidate, and a positive example. It is comprised from the classification display part 310 which displays the classification | category as a candidate's specific expression.
Further, as shown in FIG. 4, the negative example candidate list display unit 304 includes a negative example candidate selection unit 400 that displays a check box for selecting a negative example candidate, a negative example candidate display unit 401 that displays a negative example candidate, and The frequency display unit 402 displays the frequency at which negative example candidates appear in the morphological-analyzed document data 105.

さらに、図６に示す例のように、抽出規則一覧表示部３０７に抽出規則が表示されている間、抽出規則一覧表示部３０７は、抽出規則を選択するチェックボックスを表示する抽出規則選択部６００、抽出規則のＩＤを表示する規則ＩＤ表示部６０１、抽出規則の条件部を表示する条件表示部６０２、抽出規則の結論部を表示する結論表示部６０３、抽出規則の確信度を表示する規則確信度表示部６０４、選択された抽出規則を用いて固有表現抽出の実行を開始する抽出ボタン６０５、および選択された抽出規則を削除する削除ボタン６０６から構成される。ここで、抽出規則の確信度とはその規則の正しさを表す指標であり、その規則が教師データとして用いた語を正例に分類する回数をＰ、負例に分類する回数をＮとしたときに、規則の確信度はＰ／（Ｐ＋Ｎ）の式を計算することにより求められる。
ここで、図３、図４および図６に示した画面上の操作例とステップＳ２０１からＳ２０５を対応付けて説明する。 Further, as in the example shown in FIG. 6, while the extraction rule is displayed on the extraction rule list display unit 307, the extraction rule list display unit 307 displays a check box for selecting the extraction rule. A rule ID display unit 601 for displaying the ID of the extraction rule, a condition display unit 602 for displaying the condition part of the extraction rule, a conclusion display unit 603 for displaying the conclusion part of the extraction rule, and a rule belief for displaying the confidence of the extraction rule A degree display unit 604, an extraction button 605 for starting execution of specific expression extraction using the selected extraction rule, and a delete button 606 for deleting the selected extraction rule. Here, the certainty of the extraction rule is an index representing the correctness of the rule. The number of times the word used as the teacher data by the rule is classified as a positive example is P, and the number of times the word is classified as a negative example is N. Sometimes the confidence of the rule is determined by calculating the equation P / (P + N).
Here, the on-screen operation examples shown in FIGS. 3, 4, and 6 will be described in association with steps S201 to S205.

・Ｓ２０１（図３）：システムは固有表現記憶部１０７に蓄積された固有表現「ＪＰ０１」、「ＰＣ０２」、「ＤＢ０４」および「ＤＯＣＰ０４」を正例候補として、その種別「人工物名（ＩＴ製品）」と共に、正例候補一覧表示部３０１に表示する。
・Ｓ２０２（図３）：作業者は正例候補一覧表示部３０１に表示された正例候補「ＪＰ０１」、「ＰＣ０２」、「ＤＢ０４」、「ＤＯＣＰ０４」をすべて正例として選択する。さらに作業者は、正例直接入力部３０２に「ＰＣ０１」、「ＪＰ０２」、「ＤＢ０１」を正例として入力し、正例入力決定ボタン３０３を押す。 S201 (FIG. 3): The system uses specific expressions “JP01”, “PC02”, “DB04”, and “DOCP04” stored in the specific expression storage unit 107 as positive example candidates, and the type “artifact name (IT product) ”And the like are displayed on the positive example candidate list display unit 301.
S202 (FIG. 3): The operator selects all of the positive example candidates “JP01”, “PC02”, “DB04”, and “DOCP04” displayed on the positive example candidate list display unit 301 as positive examples. Further, the operator inputs “PC01”, “JP02”, and “DB01” as positive examples to the positive example direct input unit 302 and presses the positive example input determination button 303.

・Ｓ２０３（図４）：システムは正例入力決定ボタン１０３が押されると、「ＪＰ０１」、「ＰＣ０２」、「ＤＢ０４」、「ＤＯＣＰ０４」、「ＰＣ０１」、「ＪＰ０２」および「ＤＢ０１」を正例入力として受け付け、正例として入力したＩＴ製品を含む文書データからＩＴ製品以外の語を負例候補として、文書に出現する頻度と共に負例候補一覧表示部３０４に表示する。図４の例では、システムは「提案」、「説明」。「対応」および「ユーザ」を負例候補として表示する。 S203 (FIG. 4): When the positive example input decision button 103 is pressed, the system sets “JP01”, “PC02”, “DB04”, “DOCP04”, “PC01”, “JP02”, and “DB01” as positive examples. Accepted as input, and displays words other than the IT product as negative example candidates from the document data including the IT product input as positive examples on the negative example candidate list display unit 304 together with the frequency of appearance in the document. In the example of FIG. 4, the system is “Proposal” and “Description”. “Correspondence” and “User” are displayed as negative example candidates.

・Ｓ２０４（図４）：作業者は負例候補一覧表示部３０４に表示された負例候補のうち「提案」、「説明」、「対応」を負例として選択する。さらに作業者は、負例直接入力部３０５に「導入」、「向け」を負例として入力し、負例入力決定ボタン３０６を押す。
・Ｓ２０５（図６）：まず、システムは負例入力決定ボタン３０６が押されると、「提案」、「説明」、「対応」、「導入」および「向け」を負例入力として受け付ける。次にシステムは、正例と負例に基づき抽出規則生成用の教師データを生成する。教師データの生成方法は文献「メタルールと決定木学習を用いた日本語固有表現抽出」（磯崎秀樹著，情報処理学会論文誌，43巻5号，2002年）（非特許文献１）に開示された方法で行うことができる。 S204 (FIG. 4): The worker selects “suggestion”, “explanation”, and “correspondence” as negative examples from the negative example candidates displayed on the negative example candidate list display unit 304. Furthermore, the operator inputs “introduction” and “to” as negative examples to the negative example direct input unit 305 and presses the negative example input determination button 306.
S205 (FIG. 6): First, when the negative example input decision button 306 is pressed, the system accepts “suggestion”, “explanation”, “correspondence”, “introduction”, and “toward” as negative example inputs. Next, the system generates teacher data for generating extraction rules based on the positive examples and the negative examples. The method of generating teacher data is disclosed in the document "Japanese Named Expression Extraction Using Metarules and Decision Tree Learning" (Hideki Isozaki, IPSJ Journal, Vol. 43, No. 5, 2002) (Non-patent Document 1) Can be done by any method.

図５は抽出規則記憶部１０６に一時的に記憶される教師データの格納形式の例を表す。格納形式は関係表形式であり、正例または負例とした単語前後ｎ単語の文字列、品詞および文字種を属性にもつ（ｎは１以上の整数）。図５の例では、正負例ラベル格納部５００、前１単語文字列格納部５０１、前１単語品詞格納部５０２、前１単語文字種格納部５０３、後１単語文字列格納部５０４、後１単語品詞格納部５０５、後１単語文字種格納部５０６、前２単語以降格納領域５０７、後２単語以降格納領域５０８からなる。例えば、形態素解析済み文書データ「運用／管理／ツール／ＪＰ０１／提案」（元文書「運用管理ツールのＪＰ０１を提案する」）から正例「ＪＰ０１」の教師データを生成すると、図５に示す表の一行目のようになる。最後に、システムは生成した教師データから規則性を学習して抽出規則を生成し、それらを抽出規則一覧表示部３０７に表示する。規則性の学習には学習アルゴリズムを用いるが、その詳細は先に述べた文献（非特許文献１）に開示されている。 FIG. 5 shows an example of a storage format of teacher data temporarily stored in the extraction rule storage unit 106. The storage format is a relational table format, and has a character string, part of speech and character type of n words before and after the word as a positive example or a negative example (n is an integer of 1 or more). In the example of FIG. 5, the positive / negative example label storage unit 500, the previous one-word character string storage unit 501, the previous one-word part-of-speech storage unit 502, the previous one-word character type storage unit 503, the subsequent one-word character string storage unit 504, and the subsequent one word A part-of-speech storage unit 505, a subsequent one-word character type storage unit 506, a storage region 507 after the previous two words, and a storage region 508 after the subsequent two words. For example, when the teacher data of the positive example “JP01” is generated from the document data “Operation / Management / Tool / JP01 / Proposal” (original document “Propose JP01 of the operation management tool”) after morphological analysis, the table shown in FIG. Like the first line. Finally, the system learns regularity from the generated teacher data, generates extraction rules, and displays them on the extraction rule list display unit 307. A learning algorithm is used for learning of regularity, and details thereof are disclosed in the above-mentioned document (Non-Patent Document 1).

1.2.2固有表現抽出フェーズ
固有表現抽出フェーズでは、固有表現抽出部１０２が抽出規則生成フェーズで生成した抽出規則に基づいて、データベース１０３に蓄積された形態素解析済み文書データ１０５から固有表現候補を抽出するという処理を行う。図２に示した処理の流れにおいてステップＳ２０６からＳ２０９までが本フェーズである。各ステップの概要は次の通りである。 1.2.2 Specific Expression Extraction Phase In the specific expression extraction phase, specific expression candidates are extracted from the morphological-analyzed document data 105 stored in the database 103 based on the extraction rules generated by the specific expression extraction unit 102 in the extraction rule generation phase. The process of doing. Steps S206 to S209 in the processing flow shown in FIG. The outline of each step is as follows.

・Ｓ２０６：作業者は端末１００に表示された抽出規則から適切なものを選択し、選択した抽出規則を形態素解析済み文書データ１０５に適用するよう要求する。
・Ｓ２０７：システムは作業者の要求を受け付け、作業者が選択した抽出規則に基づいて、形態素解析済み文書データ１０５から固有表現候補を抽出し、端末１００にそれら候補を表示する。 S206: The worker selects an appropriate one from the extraction rules displayed on the terminal 100, and requests that the selected extraction rule be applied to the morphologically analyzed document data 105.
S207: The system accepts the worker's request, extracts specific expression candidates from the morphological-analyzed document data 105 based on the extraction rule selected by the worker, and displays these candidates on the terminal 100.

・Ｓ２０８：作業者は表示された固有表現候補から正しい固有表現を判別し、それらを固有表現記憶部１０７に登録するように要求する。
・Ｓ２０９：システムは作業者の要求を受け付け、作業者が正しい固有表現として選択した候補を固有表現記憶部１０７に登録する。また、登録した候補の抽出に用いた抽出規則を抽出規則記憶部１０６に登録する。
本フェーズにおいてシステムが端末１００に表示する画面の構成について、前節で述べたＩＴ企業の営業日報データベースの例を用いて説明する。端末１００に表示する画面は、図６、図７および図１０である。図６については前節で述べたとおりである。 S208: The operator discriminates correct proper expressions from the displayed specific expression candidates and requests to register them in the specific expression storage unit 107.
S209: The system accepts the worker's request, and registers the candidate selected by the worker as the correct specific expression in the specific expression storage unit 107. Further, the extraction rule used for extracting the registered candidate is registered in the extraction rule storage unit 106.
The configuration of the screen displayed on the terminal 100 by the system in this phase will be described using the example of the daily business report database of the IT company described in the previous section. The screens displayed on the terminal 100 are shown in FIGS. FIG. 6 is as described in the previous section.

図７は固有表現候補判別支援画面７００であり、抽出規則を表示する抽出規則一覧表示部７０１、システムが抽出した固有表現候補を表示する固有表現候補一覧表示部７０２、判別中の固有表現候補に関する情報を表示する関連情報表示部７０３から構成される。抽出規則一覧表示部７０１は、図７に示すように、抽出規則選択部７０４、規則ＩＤ表示部７０５、条件表示部７０６、結論表示部７０７、および規則確信度表示部７０８からなる。また、固有表現候補一覧表示部７０２は、固有表現候補を選択するためのチェックボックスを表示する候補選択部７０９、固有表現候補を表示する候補表示部７１０、固有表現候補の確信度を表示する候補確信度表示部７１１、形態素解析済み文書データ１０５に出現する固有表現候補の頻度を表示する頻度表示部７１２、固有表現候補を含んだ文書の内容を表示する文書表示部７１３、選択した候補を固有表現記憶部１０７に登録する登録ボタン７２０、および選択した候補を削除する削除ボタン７２１からなる。 FIG. 7 shows a specific expression candidate discrimination support screen 700, which includes an extraction rule list display unit 701 for displaying extraction rules, a specific expression candidate list display unit 702 for displaying specific expression candidates extracted by the system, and a specific expression candidate being determined. The related information display unit 703 displays information. As illustrated in FIG. 7, the extraction rule list display unit 701 includes an extraction rule selection unit 704, a rule ID display unit 705, a condition display unit 706, a conclusion display unit 707, and a rule certainty factor display unit 708. Also, the specific expression candidate list display unit 702 includes a candidate selection unit 709 that displays a check box for selecting a specific expression candidate, a candidate display unit 710 that displays a specific expression candidate, and a candidate that displays the certainty of the specific expression candidate. Certainty factor display unit 711, frequency display unit 712 that displays the frequency of specific expression candidates appearing in the morphological-analyzed document data 105, document display unit 713 that displays the content of the document including the specific expression candidates, and the selected candidate is specific The registration button 720 is registered in the expression storage unit 107, and the delete button 721 is used to delete the selected candidate.

そして、関連情報表示部７０３は、検索対象となる候補を表示する検索キー表示部７１４、検索オプションを指定する検索オプション選択部７１５、検索結果を表示する検索結果一覧表示部７１６からなる。さらに、検索結果一覧表示部７１６は、検索結果を選択するためのチェックボックスを表示する検索結果選択部７１７、検索した資料名を表示する資料名表示部７１８、資料の内容を表示する内容表示部７１９、およびに検索キー表示部７１４に表示した候補に関する情報をデータベース１０３とは別の情報源から検索する検索ボタン７２２からなる。ただし、固有表現候補の確信度とはその候補の正しさを表す指標であり、その候補が正例として抽出された回数をＰ、負例として抽出された回数をＮとしたときに、候補の確信度はＰ／（Ｐ＋Ｎ）の式を計算することにより求められる。 The related information display unit 703 includes a search key display unit 714 that displays candidates to be searched, a search option selection unit 715 that specifies search options, and a search result list display unit 716 that displays search results. Further, the search result list display unit 716 includes a search result selection unit 717 that displays a check box for selecting a search result, a material name display unit 718 that displays a searched material name, and a content display unit that displays the content of the material. 719 and a search button 722 for searching information related to candidates displayed on the search key display unit 714 from an information source different from the database 103. However, the certainty of a specific expression candidate is an index representing the correctness of the candidate. When the number of times that the candidate is extracted as a positive example is P and the number of times that the candidate is extracted as a negative example is N, The certainty factor is obtained by calculating the expression P / (P + N).

ここで、図６と図７に示した画面上の操作例とステップＳ２０６からＳ２０９を対応付けて説明する。
・Ｓ２０６（図６）：作業者は抽出規則一覧表示部３０７に表示された抽出規則の確信度をもとに、確信度0.80以上の抽出規則を採用することに決めて該当する抽出規則（画面上では、規則ＩＤ：００１と００２）を選択する。さらに、作業者は抽出ボタン６０５を押し、選択した抽出規則を用いて固有表現抽出を開始するようにシステムに要求する。 Here, the operation example on the screen shown in FIGS. 6 and 7 and steps S206 to S209 will be described in association with each other.
S206 (FIG. 6): Based on the certainty of the extraction rule displayed on the extraction rule list display unit 307, the operator decides to adopt an extraction rule with a certainty factor of 0.80 or higher and the corresponding extraction rule (screen In the above, rule IDs: 001 and 002) are selected. In addition, the operator presses the extract button 605 and requests the system to start the named entity extraction using the selected extraction rule.

・Ｓ２０７（図７）：抽出ボタン６０５が押されると、システムは、まず固有表現候補判別支援画面７００の抽出規則一覧表示部７０１にステップＳ２０６において作業者が選択した抽出規則（規則ＩＤ：００１、００２）を表示する（画面の例では、規則ＩＤ：００５も表示している）。次に、システムは規則ＩＤ：００１、００２および００５の抽出規則を用いて、形態素解析済み文書データ１０５から固有表現候補「Ｓｅｒｖｅｒ０１」、「Ｔｏｏｌ０２」などを抽出する。最後に、抽出した候補を固有表現候補一覧表示部７０２に表示する。この時点では、抽出規則一覧表示部７０１に表示した抽出規則が抽出した全候補が固有表現候補一覧表示部７０２に表示される。 S207 (FIG. 7): When the extraction button 605 is pressed, the system first displays an extraction rule (rule ID: 001, selected by the operator in step S206) on the extraction rule list display unit 701 of the specific expression candidate discrimination support screen 700. 002) (Rule ID: 005 is also displayed in the screen example). Next, the system extracts specific expression candidates “Server01”, “Tool02”, and the like from the morphological-analyzed document data 105 using the extraction rules of the rule IDs: 001, 002, and 005. Finally, the extracted candidates are displayed on the specific expression candidate list display unit 702. At this time, all candidates extracted by the extraction rule displayed on the extraction rule list display unit 701 are displayed on the specific expression candidate list display unit 702.

・Ｓ２０８（図７）：作業者は、抽出規則ごとの抽出結果をみるために、抽出規則一覧表示部７０１に表示された抽出規則から一部を選択する。図７の例では、規則ＩＤ：００１の抽出規則を選択している。このとき、固有表現候補一覧表示部７０２には選択した抽出規則（規則ＩＤ：００１）が抽出した固有表現候補が表示される。次に、作業者は、固有表現候補が正しい固有表現か否かを判別するために、固有表現候補一覧表示部７０２に表示されている固有表現候補から一部を選択する。この例では、作業者は候補「Ｓｅｒｖｅｒ０１」を選択している。候補「Ｓｅｒｖｅｒ０１」を選択すると、文書表示部７１３にその候補を含んだ文書の内容が表示される。また、検索キー表示部に「Ｓｅｒｖｅｒ０１」が表示される。 S208 (FIG. 7): The operator selects a part of the extraction rules displayed on the extraction rule list display unit 701 in order to see the extraction result for each extraction rule. In the example of FIG. 7, the extraction rule with the rule ID: 001 is selected. At this time, the specific expression candidate list display unit 702 displays the specific expression candidates extracted by the selected extraction rule (rule ID: 001). Next, the operator selects a part of the specific expression candidates displayed on the specific expression candidate list display unit 702 in order to determine whether or not the specific expression candidate is a correct specific expression. In this example, the worker has selected the candidate “Server01”. When the candidate “Server01” is selected, the document display unit 713 displays the contents of the document including the candidate. In addition, “Server01” is displayed in the search key display section.

このとき、作業者は文書の内容をみて正しい固有表現と判別すれば、登録ボタン７２０を押して「Ｓｅｒｖｅｒ０１」を固有表現記憶部１０７に登録する。逆に、固有表現ではないと判別した場合は、削除ボタン７２１を押して固有表現候補一覧表示部７０２から「Ｓｅｒｖｅｒ０１」を削除する。また、「Ｓｅｒｖｅｒ０１」が固有表現か否かの判別が難しいならば、作業者は検索ボタン７２２を押して「Ｓｅｒｖｅｒ０１」についての関連情報を検索する。その際に、作業者は検索範囲を指定するために、検索オプション選択部７１５にあるオプション「Ｗｅｂ（社外）」か「Ｗｅｂ（社内）」かを選択する。オプション「Ｗｅｂ（社内）」は社内のイントラネット上のホームページを検索範囲とし、オプション「Ｗｅｂ（社外）」はインターネット上のホームページを検索範囲とする。図７の例では、作業者は「Ｗｅｂ（社外）」を選択し、検索ボタン７２２を押して、その検索結果が検索結果一覧表示部７１６に表示される。 At this time, if the worker sees the content of the document and determines that the correct unique expression is found, the worker presses a registration button 720 to register “Server01” in the specific expression storage unit 107. Conversely, if it is determined that the expression is not a unique expression, the delete button 721 is pressed to delete “Server01” from the specific expression candidate list display unit 702. If it is difficult to determine whether or not “Server01” is a unique expression, the operator presses the search button 722 to search for related information about “Server01”. At that time, the operator selects the option “Web (external)” or “Web (internal)” in the search option selection unit 715 in order to specify the search range. The option “Web (internal)” uses a home page on the intranet in the company as a search range, and the option “Web (external)” uses a home page on the Internet as a search range. In the example of FIG. 7, the worker selects “Web (external)”, presses the search button 722, and the search result is displayed on the search result list display unit 716.

このとき、資料名表示部７１８には検索キー（「Ｓｅｒｖｅｒ０１」）を含んだホームページのＵＲＬが表示され、作業者がそのうちの一つのＵＲＬ（この例では、「http://www.abcd.co.xx/Products」）を選択する。ＵＲＬを選択すると、内容表示部７１９に検索キー（「Ｓｅｒｖｅｒ０１」）を含んだホームページの内容が表示される。ただし、検索オプションを「Ｗｅｂ（社外）」としているので、ホームページ上の表部分のみが表示される。作業者は、内容表示部７１９に表示された内容をもとに、「Ｓｅｒｖｅｒ０１」が正しい固有表現か否かを判別する。正しい固有表現だと判別すれば登録ボタン７２０を押す。 At this time, the URL of the home page including the search key (“Server01”) is displayed in the material name display portion 718, and the operator can select one of the URLs (in this example, “http://www.abcd.co .xx / Products ”). When the URL is selected, the contents display section 719 displays the contents of the home page including the search key (“Server01”). However, since the search option is “Web (external)”, only the table portion on the home page is displayed. The worker determines whether “Server01” is a correct unique expression based on the content displayed on the content display unit 719. If it is determined that the unique expression is correct, the registration button 720 is pressed.

・Ｓ２０９（図７）：システムは、作業者により登録ボタン７２０が押された場合に、候補選択部７０９にチェックがある候補を固有表現記憶部１０７に登録する。図７の例において、作業者が固有表現候補「Ｓｅｒｖｅｒ０１」を選択して、登録ボタン７２０を押すと、システムはその候補を固有表現として固有表現記憶部１０７に登録する。 S209 (FIG. 7): When the registration button 720 is pressed by the operator, the system registers candidates that are checked by the candidate selection unit 709 in the specific expression storage unit 107. In the example of FIG. 7, when the operator selects a specific expression candidate “Server01” and presses a registration button 720, the system registers the candidate as a specific expression in the specific expression storage unit 107.

ここで、固有表現記憶部１０７における固有表現の格納形式を図８に示す。その形式は固有表現を格納する固有表現格納部８００、固有表現の種類を格納する固有表現種類格納部８０１、その固有表現を抽出したときの確信度を格納する確信度格納部８０２、その固有表現が形態素解析済み文書データ１０５に出現する頻度を格納する出現頻度格納部８０３、その固有表現の抽出に用いられた抽出規則のＩＤを格納する抽出規則ＩＤ格納部８０４からなる。固有表現記憶部１０７に「Ｓｅｒｖｅｒ０１」を登録した場合、図８に示すように、固有表現格納部８００に「Ｓｅｒｖｅｒ０１」、固有表現種類格納部８０１に「人工物名（ＩＴ製品）」、確信度格納部８０２に「０．９８」、出現頻度格納部８０３に「９４８」、抽出規則ＩＤ格納部８０４に「００００１，００００９，００２０３」が格納される。図８の例では、その他に、「Ｔｏｏｌ０２」、「ＰＣ０３」、「ＤＢ０４」、および「ＤＢ０５」が格納されている。 Here, the storage format of the specific expression in the specific expression storage unit 107 is shown in FIG. The format includes a specific expression storage unit 800 that stores the specific expression, a specific expression type storage unit 801 that stores the type of the specific expression, a certainty degree storage unit 802 that stores the certainty level when the specific expression is extracted, and the specific expression. Is formed of an appearance frequency storage unit 803 that stores the frequency of appearance in the morphological-analyzed document data 105, and an extraction rule ID storage unit 804 that stores the ID of the extraction rule used to extract the specific expression. When “Server01” is registered in the specific expression storage unit 107, as shown in FIG. 8, “Server01” is stored in the specific expression storage unit 800, “artifact name (IT product)” is stored in the specific expression type storage unit 801, and the certainty factor. The storage unit 802 stores “0.98”, the appearance frequency storage unit 803 stores “948”, and the extraction rule ID storage unit 804 stores “00001,00009,00203”. In the example of FIG. 8, “Tool02”, “PC03”, “DB04”, and “DB05” are stored in addition.

次に、システムは固有表現記憶部１０７に固有表現候補を登録した後、その候補の抽出に用いた抽出規則を抽出規則記憶部１０６に登録する。図７の例では「Ｓｅｒｖｅｒ０１」を抽出した規則（規則ＩＤ：００１、条件：「後１単語“紹介”」、結論：「正例（ＩＴ製品）」）が、抽出規則記憶部１０６に登録される。 Next, the system registers the specific expression candidate in the specific expression storage unit 107 and then registers the extraction rule used for extracting the candidate in the extraction rule storage unit 106. In the example of FIG. 7, the rule (Rule ID: 001, Condition: “After 1 word“ Introduction ”, Conclusion:“ Normal example (IT product) ”) from which“ Server01 ”is extracted is registered in the extraction rule storage unit 106. The

ここで、抽出規則記憶部１０６における抽出規則の格納形式を図９に示す。その形式は抽出規則のＩＤを格納する規則ＩＤ格納部９００、抽出規則の条件部を格納する条件格納部９０１、抽出規則の結論部を格納する結論格納部９０２、および規則の確信度を格納する確信度格納部９０３からなる。「Ｓｅｒｖｅｒ０１」を抽出した規則を格納した場合、図９に示すように、規則ＩＤ格納部９００に「００００１」、条件格納部９０１に「後１単語＝“紹介”」、結論格納部９０２に「正例（ＩＴ製品）」、および確信度格納部９０３に「０．９５」が格納される。規則ＩＤ格納部９００の規則ＩＤは、先に述べた抽出規則ＩＤ格納部８０４に格納される規則ＩＤと対応する。最後に、システムは、固有表現候補および抽出規則の登録が完了すると、端末１００の画面上に「登録完了」というメッセージを表示する。 Here, a storage format of the extraction rule in the extraction rule storage unit 106 is shown in FIG. The format includes a rule ID storage unit 900 that stores an extraction rule ID, a condition storage unit 901 that stores a condition part of the extraction rule, a conclusion storage unit 902 that stores a conclusion part of the extraction rule, and a certainty factor of the rule. It consists of a certainty factor storage unit 903. When the rule extracted from “Server01” is stored, as shown in FIG. 9, “00001” is stored in the rule ID storage unit 900, “after 1 word =“ introduction ”” is stored in the condition storage unit 901, and “ Positive example (IT product) ”and“ 0.95 ”are stored in the certainty factor storage unit 903. The rule ID in the rule ID storage unit 900 corresponds to the rule ID stored in the extraction rule ID storage unit 804 described above. Finally, when the registration of the unique expression candidate and the extraction rule is completed, the system displays a message “registration complete” on the screen of the terminal 100.

1.3教師データの作成支援機能
本システムは図２に示した処理（抽出規則生成フェーズと固有表現抽出フェーズ）を繰り返し実行するブートストラップ方式を採用する。ブートストラップ方式については、文献「ブートストラップによる低人手コスト日本語固有表現抽出」（宇津呂武仁，颯々野学著，情報処理学会研究報告，2000-NL-139，2000年）（非特許文献２）を参照のこと。この方式により、作業者は少量の教師データをシステムに入力するという単純な操作を繰り返すのみで、固有表現抽出作業を行うことができる。このとき、作業者が入力する少量の教師データは、既に述べたように固有表現記憶部１０７に登録された固有表現を用いる。この教師データ入力作業において、利用者により正確な教師データを入力させるために、本システムの抽出規則学習部１０１が教師データの作成を支援する機能を作業者に提供する。 1.3 Teacher Data Creation Support Function This system adopts a bootstrap system that repeatedly executes the processing shown in FIG. 2 (extraction rule generation phase and specific expression extraction phase). Regarding the bootstrap method, refer to the document "Low-cost Japanese-specific expression extraction using bootstrap" (Takehito Utsuro, Manabu Sasano, Information Processing Society of Japan Research Report, 2000-NL-139, 2000) (Non-Patent Document 2) )checking. By this method, the worker can perform the proper expression extraction work only by repeating a simple operation of inputting a small amount of teacher data to the system. At this time, the small amount of teacher data input by the worker uses the specific expression registered in the specific expression storage unit 107 as described above. In this teacher data input operation, the extraction rule learning unit 101 of the present system provides a worker with a function for supporting creation of teacher data so that the user can input accurate teacher data.

図１０にその機能を利用するための教師データ作成支援画面１０００を示す。この画面は、抽出規則一覧表示部７０１、固有表現候補一覧表示部７０２、および抽出規則が誤って適用された文書（以下、誤適用文書）を表示する語適用文書一覧表示部１００１からなる。抽出規則一覧表示部７０１と固有表現候補一覧表示部７０２は図７の固有表現候補判別支援画面７００のそれらとほぼ同じである。異なる点は、固有表現候補一覧表示部７０２に、文書表示部７１３に表示された文書が誤適用文書をチェックする誤適用文書選択部１００２、およびチェックされた文書を誤適用文書だと確定する誤り確定ボタン１００５が追加されている点である。また誤適用文書一覧表示部１００１は、誤適用文書の候補を表示する誤適用文書候補表示部１００３、および誤適用文書候補から正しい誤適用文書をチェックする誤適用文書選択部１００４、およびチェックされた文書を誤適用文書だと確定する誤り確定ボタン１００６からなる。 FIG. 10 shows a teacher data creation support screen 1000 for using the function. This screen includes an extraction rule list display unit 701, a specific expression candidate list display unit 702, and a word application document list display unit 1001 that displays a document to which the extraction rule is erroneously applied (hereinafter, erroneous application document). The extraction rule list display unit 701 and the specific expression candidate list display unit 702 are substantially the same as those in the specific expression candidate discrimination support screen 700 of FIG. The difference is that the proper expression candidate list display unit 702, the erroneous application document selection unit 1002 for checking the erroneous application document for the document displayed in the document display unit 713, and the error for confirming that the checked document is the erroneous application document A confirmation button 1005 is added. In addition, the misapplied document list display unit 1001 is checked for a misapplied document candidate display unit 1003 for displaying a candidate for a misapplied document, a misapplied document selecting unit 1004 for checking a correct misapplied document from the misapplied document candidates, and a check. An error confirmation button 1006 for confirming that the document is an erroneously applied document is provided.

教師データ作成支援画面１０００の操作手順を説明する。図１０の例では、抽出規則一覧表示部７０１に表示された抽出規則のうち、作業者が規則ＩＤ：００１の規則を選択し、さらに固有表現候補一覧表示部７０２にその規則が抽出した固有表現候補から「くまくん」を選択する。ここまでは、図７の固有表現候補判別支援画面７００上の操作と同様である。システムは文書データ１０４から「くまくん」を含む文書を検索して文書表示部７１３に表示する。作業者は表示された文書の内容をみて、「くまくん」が固有表現として扱われていない文書を選択する。この例では、文書ＩＤ：９０００３以外の文書「くまくん」は固有表現の人工物名（ＩＴ製品）として扱われ、文書ＩＤ：９０００３の文書は「くまくん」を単に動物の愛称として扱われているものとする。作業者は誤適用文書選択部１００２のチェックボックスのうち、文書ＩＤ：９０００３に対応するものにチェックして、文書ＩＤ：９０００３の文書を誤適用文書として選択する。 The operation procedure of the teacher data creation support screen 1000 will be described. In the example of FIG. 10, among the extraction rules displayed in the extraction rule list display unit 701, the operator selects a rule with the rule ID: 001, and the specific expression extracted by the rule in the specific expression candidate list display unit 702. Select "Kuma-kun" from the candidates. Up to this point, the operations are the same as those on the specific expression candidate discrimination support screen 700 of FIG. The system retrieves a document including “Kumakun” from the document data 104 and displays it on the document display unit 713. The operator looks at the contents of the displayed document and selects a document in which “Kuma-kun” is not treated as a specific expression. In this example, a document “KUMA-KUN” other than the document ID: 90003 is treated as an artificial object name (IT product) with a unique expression, and a document with the document ID: 90003 is treated as an animal nickname. It shall be. The operator checks the check box of the erroneous application document selection unit 1002 corresponding to the document ID: 90003, and selects the document with the document ID: 90003 as the erroneous application document.

このとき、誤り確定ボタン１００５を押すと、作業者が選択した文書（文書ＩＤ：９０００３）における「くまくん」をシステムは負例として扱い、未選択の文書における「くまくん」をシステムは正例として扱う。また、選択した文書（文書ＩＤ：９０００３）において、利用者が「くまくん」が固有表現ではないと判別できた部分をマークすると、誤適用文書一覧表示部１００１にマークした部分を含んだ文書（文書ＩＤ：９０００３以外）が表示される。図１０の例では、作業者は「動物園」をマークしている（マーク部分１００７）。「動物園」にマークすると、システムは「くまくん」と「動物園」を含む文書を文書データ１０４から検索し、検索でヒットした文書を誤適用文書一覧表示部１００１に表示する。そして、作業者は誤適用文書候補表示部１００３に表示された文書の内容をみて、「くまくん」が固有表現として扱われていない文書を選択する。この例では、文書ＩＤ：９０００１、９００２３および９０２３４を選択している。文書の選択作業が終わると、作業者は誤り確定ボタン１００６を押す。誤り確定ボタン１００５が押された場合と同様に、システムは、作業者が選択した文書（文書ＩＤ：９０００１、９００２３および９０２３４）中の「くまくん」を負例として扱い、未選択の文書を正例として扱う。
以上のような方法により、作業者は教師データの作成の際に、正例と負例を語の字面で決定するだけでなく、単語が出現する文書を見て同じ単語でも正例となるもの負例となるものを詳細に指定することができる。 At this time, when the error confirmation button 1005 is pressed, the system treats “Kumakun” in the document (document ID: 90003) selected by the operator as a negative example, and the system treats “Kumakun” in the unselected document as a positive example. Treat as. Further, in the selected document (document ID: 90003), if the user marks a portion where “Kuma-kun” can be determined not to be a unique expression, a document including the portion marked in the erroneously applied document list display unit 1001 ( Document ID: other than 90003) is displayed. In the example of FIG. 10, the worker has marked “zoo” (marked portion 1007). When “zoo” is marked, the system searches the document data 104 for documents including “Kuma-kun” and “zoo”, and displays the hit document in the search on the misapplied document list display unit 1001. Then, the worker looks at the contents of the document displayed on the erroneous application document candidate display unit 1003 and selects a document in which “Kuma-kun” is not treated as a specific expression. In this example, document IDs 90001, 90023, and 90234 are selected. When the document selection operation is completed, the operator presses the error confirmation button 1006. As in the case where the error confirmation button 1005 is pressed, the system treats “Kumakun” in the document (document ID: 90001, 90023, and 90234) selected by the operator as a negative example, and corrects the unselected document. Treat as an example.
By the above method, when creating the teacher data, the operator not only determines the positive and negative examples in terms of the words, but also looks at the document in which the word appears and even the same word becomes a positive example You can specify in detail what will be negative examples.

2.構成部分の説明
2.1抽出規則学習部
抽出規則学習部１０１は図２に示した抽出規則作成フェーズ（ステップＳ２０１からＳ２０５）と図１０に示した教師データ作成支援機能の処理を行う。抽出規則作成フェーズで行う処理は既に述べたとおりである。抽出規則学習部１０１側からみた処理の流れを次にまとめる。 2.Description of components
2.1 Extraction Rule Learning Unit The extraction rule learning unit 101 performs the extraction rule creation phase (steps S201 to S205) shown in FIG. 2 and the teacher data creation support function shown in FIG. The processing performed in the extraction rule creation phase is as described above. The flow of processing viewed from the extraction rule learning unit 101 side is summarized below.

・固有表現記憶部１０７に蓄積した固有表現を正例候補一覧表示部３０１に表示する（ステップＳ２０１）。
・作業者が入力した正例を受け付け、形態素解析済み文書データ１０５から正例を含む文書から正例以外の後を負例候補として負例候補一覧表示部３０４に表示する（ステップＳ２０３）。
・作業者が入力した正例と負例から教師データ（例えば、図５）を生成し、教師データから規則性を学習して抽出規則を生成する。そして、生成した抽出規則を抽出規則一覧表示部３０７に表示する（Ｓ２０５）。
また、抽出規則学習部１０１側からみた教師データ作成支援機能の処理の流れを次にまとめる。 The specific expressions stored in the specific expression storage unit 107 are displayed on the positive example candidate list display unit 301 (step S201).
A positive example input by the operator is received, and a non-positive example is displayed on the negative example candidate list display unit 304 from the document including the positive example from the morphological-analyzed document data 105 (step S203).
Teacher data (for example, FIG. 5) is generated from positive examples and negative examples input by the operator, and regularity is learned from the teacher data to generate extraction rules. Then, the generated extraction rule is displayed on the extraction rule list display unit 307 (S205).
The processing flow of the teacher data creation support function viewed from the extraction rule learning unit 101 side is summarized below.

・生成した抽出規則を抽出規則一覧表示部７０１（図１０）に表示する。
・抽出規則一覧表示部７０１において作業者が選択した抽出規則を適用して、形態素解析済み文書データ１０５から抽出した固有表現候補を固有表現候補一覧表示部７０２に表示する。
・固有表現候補一覧表示部７０２において作業者が選択した固有表現候補を含む文書を、文書データ１０４から見つけ出して文書表示部７１３に表示する。 The generated extraction rule is displayed on the extraction rule list display unit 701 (FIG. 10).
Applying the extraction rule selected by the operator in the extraction rule list display unit 701, the specific expression candidates extracted from the morphological-analyzed document data 105 are displayed on the specific expression candidate list display unit 702.
A document including the candidate for the specific expression selected by the operator in the specific expression candidate list display unit 702 is found from the document data 104 and displayed on the document display unit 713.

・誤り確定ボタン１００５が押されると、固有表現候補一覧表示部７０２において作業者が選択した文書を負例、未選択の文書を正例として扱うように教師データ作成の際に正負例ラベル格納部５００に格納するラベルを指定する。
・文書表示部７１３において、利用者によるマーク入力（例：マーク部分１００７）を受け付けると、誤適用文書一覧表示部１００１にマーク部分を含んだ文書を、文書データ１０４から見つけ出して表示する。
・誤り確定ボタン１００６が押されると、誤り確定ボタン１００５が押されたときと同様に、誤適用文書一覧表示部７０３において作業者が選択した文書を負例、未選択の文書を正例として扱うように教師データ作成の際に正負例ラベル格納部５００に格納するラベルを指定する。 When the error confirmation button 1005 is pressed, a positive / negative example label storage unit is created when creating teacher data so that a document selected by the operator in the specific expression candidate list display unit 702 is treated as a negative example and an unselected document is treated as a positive example. The label stored in 500 is designated.
When the document display unit 713 receives a mark input (for example, a mark portion 1007) by the user, the document display unit 1001 finds a document including the mark portion from the document data 104 and displays it.
When the error confirmation button 1006 is pressed, the document selected by the operator in the erroneous application document list display unit 703 is treated as a negative example and the unselected document is treated as a positive example, similar to when the error confirmation button 1005 is pressed. In this way, the label to be stored in the positive / negative example label storage unit 500 is designated when teacher data is created.

2.2固有表現抽出部
固有表現抽出部１０２は図２に示した固有表現抽出フェーズ（ステップＳ２０６からＳ２０９）の処理を行う。固有表現抽出フェーズで行う処理は既に述べたとおりである。固有表現抽出部１０２側からみた処理の概要を次にまとめる。
・抽出規則一覧表示部３０７（図６）において作業者が選択した抽出規則を適用して、形態素解析済みデータ１０５から固有表現候補を抽出し、それらを固有表現候補一覧表示部７０２（図７）に表示する。さらに、候補の抽出に用いた抽出規則を抽出規則一覧表示部７０１に表示する。 2.2 Specific Expression Extraction Unit The specific expression extraction unit 102 performs the processing of the specific expression extraction phase (steps S206 to S209) shown in FIG. The processing performed in the specific expression extraction phase is as described above. An overview of the processing as seen from the specific expression extraction unit 102 side is summarized below.
Applying the extraction rule selected by the operator in the extraction rule list display unit 307 (FIG. 6), the specific expression candidates are extracted from the morpheme analyzed data 105, and the specific expression candidate list display unit 702 (FIG. 7) is extracted. To display. Further, the extraction rule used for extracting candidates is displayed on the extraction rule list display unit 701.

・抽出規則一覧表示部７０１において、作業者が選択した抽出規則が抽出した固有表現候補を、固有表現候補一覧表示部７０２に表示する。
・候補選択部７０９のチェックボックスがチェックされると、チェックのある候補を含む文書を文書データ１０４から検索し、文書の内容を文書表示部７１３に表示する。また、その候補を検索キー表示部７１４に表示する。
・検索ボタン７２２が押されると、検索オプション選択部７１５のチェックされたオプションをもとに関連情報を検索して、検索結果を検索結果一覧表示部７１６に表示する。
・検索結果選択部７１７のチェックボックスがチェックされると、チェックのある資料の内容を内容表示部７１９に表示する。 The extraction rule list display unit 701 displays the specific expression candidates extracted by the extraction rule selected by the operator on the specific expression candidate list display unit 702.
When the check box of the candidate selection unit 709 is checked, a document including a checked candidate is searched from the document data 104 and the content of the document is displayed on the document display unit 713. The candidates are displayed on the search key display unit 714.
When the search button 722 is pressed, related information is searched based on the option checked by the search option selection unit 715, and the search result is displayed on the search result list display unit 716.
When the check box of the search result selection unit 717 is checked, the content of the checked material is displayed on the content display unit 719.

テキストマイニングシステムや情報検索システムにおいて利用することができる。 It can be used in text mining systems and information retrieval systems.

システム全体の構成を示した図。The figure which showed the structure of the whole system. 固有表現抽出作業の流れを示した図。The figure which showed the flow of the specific expression extraction operation | work. 抽出規則生成支援画面における正例入力の例を示した図。The figure which showed the example of the positive example input in an extraction rule production | generation assistance screen. 抽出規則生成支援画面における負例入力の例を示した図。The figure which showed the example of the negative example input in the extraction rule production | generation assistance screen. 抽出規則学習部１０１に入力する教師データの例を示した図。The figure which showed the example of the teacher data input into the extraction rule learning part 101. FIG. 抽出規則生成支援画面における抽出規則表示の例を示した図。The figure which showed the example of the extraction rule display in an extraction rule production | generation assistance screen. 固有表現候補判別支援画面における固有表現候補の判別作業の例を示した図。The figure which showed the example of the discrimination | determination work of the specific expression candidate in a specific expression candidate discrimination | determination assistance screen. 固有表現記憶部１０７における固有表現の記憶形式の例を示した図。The figure which showed the example of the storage format of the specific expression in the specific expression storage part 107. FIG. 固有表現記憶部１０７における抽出規則の記憶形式の例を示した図。The figure which showed the example of the storage format of the extraction rule in the specific expression memory | storage part 107. FIG. 教師データ作成支援画面における誤適用文書の特定作業の例を示した図。The figure which showed the example of the specific operation | work of the misapplication document in a teacher data creation assistance screen. 文書ＩＤテーブル１０９における単語、頻度および文書ＩＤの格納形式の例を示した図。The figure which showed the example of the storage format of the word in the document ID table 109, frequency, and document ID.

Explanation of symbols

１００：端末、１０１：抽出規則学習部、１０２：固有表現抽出部、１０３：データベース、１０４：文書データ、１０５：形態素解析済み文書データ、１０６：抽出規則記憶部、１０７：固有表現記憶部、３００：抽出規則生成支援画面、３０１：正例候補一覧表示部、３０２：正例直接入力部、３０４：負例候補一覧表示部、３０５：負例直接入力部、３０７：抽出規則一覧表示部、５００：正負例ラベル格納部、５０１：前１単語の文字列格納部、５０２：前１単語の品詞格納部、５０３：前１単語の文字種格納部、５０４：後１単語の文字列格納部、５０５：後１単語の品詞格納部、５０６：後１単語の文字種格納部、７００：固有表現候補判別支援画面、７０１：抽出規則一覧表示部、７０２：固有表現候補一覧表示部、７０３：関連情報表示部、８００：固有表現格納部、８０１：固有表現種類格納部、８０２：確信度格納部、８０３：出現頻度格納部、８０４：抽出規則ＩＤ格納部、９００：抽出規則ＩＤ格納部、９０１：条件格納部、９０２：結論格納部、９０３：確信度格納部、１０００：教師データ作成支援画面、１００１：誤適用文書一覧表示部、１１００：単語格納部、１１０１：頻度格納部、１１０２：文書ＩＤ格納部。 100: Terminal, 101: Extraction rule learning unit, 102: Specific expression extraction unit, 103: Database, 104: Document data, 105: Document data after morphological analysis, 106: Extraction rule storage unit, 107: Specific expression storage unit, 300 : Extraction rule generation support screen, 301: positive example candidate list display unit, 302: positive example direct input unit, 304: negative example candidate list display unit, 305: negative example direct input unit, 307: extraction rule list display unit, 500 : Positive / negative example label storage unit, 501: Character string storage unit of the previous 1 word, 502: Part of speech storage unit of the previous 1 word, 503: Character type storage unit of the previous 1 word, 504: Character string storage unit of the subsequent 1 word, 505 : Part of speech storage unit for the next one word, 506: Character type storage unit for the next one word, 700: Specific expression candidate discrimination support screen, 701: Extraction rule list display unit, 702: Specific expression candidate list display unit, 703: Continuous information display unit, 800: specific expression storage unit, 801: specific expression type storage unit, 802: certainty factor storage unit, 803: appearance frequency storage unit, 804: extraction rule ID storage unit, 900: extraction rule ID storage unit, 901: Condition storage unit, 902: Conclusion storage unit, 903: Certainty factor storage unit, 1000: Teacher data creation support screen, 1001: Misapplied document list display unit, 1100: Word storage unit, 1101: Frequency storage unit, 1102: Document ID storage unit.

Claims

A unique expression extraction method for extracting a specific expression from a plurality of documents, wherein the method includes a step of receiving a first user input including at least one specific expression and a specific expression included in the first user input. Extracting one or more words from a plurality of documents, displaying at least a part of the extracted words, receiving a second user input for selecting at least a part of the displayed words, and the first Learning one or more rules from the second user input and the second user input to generate one or more rules, and extracting one or more specific expression candidates from the plurality of documents using the generated rules, Displaying at least a part of the extracted specific expression candidates; receiving a third user input for selecting at least a part of the displayed specific expression candidates; Storing the specific expression candidate selected by the third user input as an extracted specific expression and storing one or more rules used for extracting the specific expression candidate together. Named entity extraction method.

A step of displaying the stored extracted specific expression and a step of receiving a fourth user input for selecting at least a part of the displayed extracted specific expression; The specific expression extraction method according to claim 1, wherein it is treated as the first user input and the subsequent steps are continued.

In the step of extracting one or more words different from any of the specific expressions included in the first user input from a plurality of documents and displaying at least a part of the extracted words, each of the extracted words 3. The specific expression extraction according to claim 1, wherein the frequency of appearance in the plurality of documents is calculated, and at least a part of the extracted words is displayed in an order based on the calculated frequency. Method.

One or more rules are generated by learning regularity from the first user input and the second user input, and one or more specific expression candidates are generated from the plurality of documents using the generated rules. And displaying at least a part of the extracted specific expression candidates together with the rules used for extracting the specific expression candidates to be displayed. The named entity extraction method.

Receiving a fifth user input for selecting one or more rules from the displayed rules, and displaying a list of candidates for specific expressions extracted using the rules selected by the fifth user input; Receiving a sixth user input for selecting one or more error candidates from the displayed list of specific expression candidates, a list of documents including the error candidates selected by the sixth user input, and A step of displaying a location where the error candidate appears in a list of documents, and a seventh user specifying an erroneously applied document in which the rule selected by the fifth user input is erroneously applied from the displayed list of documents Accepting input, and
5. The specific expression extracting method according to claim 4, further comprising a step of displaying a list of documents similar to the erroneously applied documents designated by the seventh user input.

A database for storing a plurality of documents, a user input means for receiving a first user input comprising at least one unique expression, and a uniqueness included in the first user input from the plurality of documents stored in the database Learning means for extracting one or more words different from any of the expressions, display means for displaying at least a part of the words extracted by the learning means, extraction means for extracting specific expression candidates using extraction rules, Storage means for storing an expression, wherein the input means further accepts a second user input for selecting at least a part of the word displayed on the display means, and the learning means further includes the first The regularity is learned from the user input and the second user input to generate one or more extraction rules, and the extraction means generates the extraction rules generated by the learning means And extracting one or more specific expression candidates from a plurality of documents stored in the database, the display means displays at least a part of the specific expression candidates extracted by the extraction means, and the input means includes: Further, receiving a third user input for selecting at least a part from the displayed specific expression candidates, the storage means stores the specific expression candidates selected by the third user input as an extracted specific expression, and One or more rules used for extraction of said specific expression candidates are stored together, and the specific expression extraction apparatus characterized by the above-mentioned.