JP4186774B2

JP4186774B2 - Information extraction apparatus, information extraction method, and program

Info

Publication number: JP4186774B2
Application number: JP2003332796A
Authority: JP
Inventors: 宏行大沼
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2003-09-25
Filing date: 2003-09-25
Publication date: 2008-11-26
Anticipated expiration: 2023-09-25
Also published as: JP2005100082A

Description

本発明は，文書からユーザが必要とする情報を抽出し，抽出したい属性を示すテンプレートに，文書中の情報を対応付ける情報抽出装置，情報抽出方法，およびプログラムに関するものである。 The present invention relates to an information extraction apparatus, an information extraction method, and a program for extracting information required by a user from a document and associating information in the document with a template indicating an attribute to be extracted.

従来より，文書からユーザが必要とする情報を抽出する情報抽出装置としては，以下の論文１，２に開示されるものがあった。 Conventionally, there have been disclosed in the following papers 1 and 2 as information extraction apparatuses for extracting information required by a user from a document.

論文１：「電子ニュースのダイジェスト自動生成」情報処理学会論文誌Ｖｏｌ．３６，Ｎｏ．１０（１９９５）では，会告記事用のニュースグループから会告情報を抽出している。抽出したい属性（以下，抽出属性と呼ぶ）ごとに，その特徴情報（見出しのラベル語や言語表現パターンなど）を用意し，情報抽出に利用している。そして，これらの抽出属性は，「タイトル」，「開催期日」，「開催地」，「論文締切」，「記事種別」などのフラットなテンプレートになっている。 Paper 1: “Automatic Digest Generation of Electronic News”, Information Processing Society of Japan, Journal Vol. 36, no. 10 (1995), conference information is extracted from a news group for conference articles. For each attribute to be extracted (hereinafter referred to as an extracted attribute), feature information (such as a headline label word and a language expression pattern) is prepared and used for information extraction. These extracted attributes are flat templates such as “Title”, “Date”, “Location”, “Paper Deadline”, and “Article Type”.

同様に，論文２：「オントロジーに基づく広域ネットワークからの情報収集・分類・統合化」情報処理学会論文誌Ｖｏｌ．３８，Ｎｏ．３（１９９７）でも，統合化された情報は，フラットなテンプレートになっている。 Similarly, Paper 2: “Collecting, classifying, and integrating information from wide-area networks based on ontology”, Information Processing Society of Japan, Vol. 38, no. 3 (1997), the integrated information is a flat template.

「電子ニュースのダイジェスト自動生成」情報処理学会論文誌Ｖｏｌ．３６，Ｎｏ．１０（１９９５）"Automatic digest generation of electronic news", Information Processing Society of Japan, Journal Vol. 36, no. 10 (1995) 「オントロジーに基づく広域ネットワークからの情報収集・分類・統合化」情報処理学会論文誌Ｖｏｌ．３８，Ｎｏ．３（１９９７）“Collecting, classifying, and integrating information from wide-area networks based on ontology” Transactions of Information Processing Society of Japan, Vol. 38, no. 3 (1997)

しかしながら，上記のフラットなテンプレートを用いた場合に，次のような問題点が生ずる。 However, the following problems arise when the above flat template is used.

（問題点１）
例えば，論文１の表１の項目として，「開催地」の他に「開催地の住所」も抽出したい場合には，論文１では，表１の項目に「開催地の住所」を項目に付け加えることになる。しかし，その際に「開催地」と「開催地の住所」の関係は表現できない。「開催地」は，場所に関する情報を表すので，「住所」や「郵便番号」などの情報もその近くに記載されるが，それを表現できない。 (Problem 1)
For example, if it is desired to extract “the address of the venue” in addition to “the venue” as an item in Table 1 of the paper 1, the paper 1 adds “the address of the venue” to the item of Table 1 in the paper 1. It will be. However, the relationship between “the venue” and “the address of the venue” cannot be expressed. “Venue” indicates information about the place, so information such as “address” and “zip code” is also written nearby, but it cannot be expressed.

（問題点２）
別の目的として，企業の所在地や連絡先の郵送先を抽出したい場合にも，場所名，郵便番号，住所などの場所情報は，よく似た抽出規則になる。したがって，場所情報は，会議などの開催場所だけでなく，企業情報や連絡先情報を抽出する目的としても再利用したい。しかし，論文１のような構造になっている場合には，別々のテンプレートを用意しなければならず，テンプレートの再利用ができない。 (Problem 2)
For another purpose, if you want to extract the address of a company or the mailing address of a contact, the location information such as the location name, zip code, and address becomes a similar extraction rule. Therefore, we want to reuse location information not only for the location of meetings, but also for the purpose of extracting company information and contact information. However, in the case of the structure as in paper 1, separate templates must be prepared, and the templates cannot be reused.

本発明は，従来の情報抽出装置が有する上記問題点に鑑みてなされたものであり，本発明の主な目的は，既存のテンプレートを再利用可能な属性抽出を行うことができる，新規かつ改良された情報抽出装置，情報抽出方法，およびプログラムを提供することである。 The present invention has been made in view of the above-described problems of conventional information extraction apparatuses, and a main object of the present invention is a new and improved attribute extraction capable of reusing an existing template. An information extraction device, an information extraction method, and a program are provided.

上記課題を解決するため，本発明の第１の観点によれば，入力された文書中に出現する情報を属性に対応付ける情報抽出装置が提供される。本発明の情報抽出装置（１）は，抽出対象となる１以上の属性の集合によって構成されるテンプレートが格納され，属性とその属性の抽出知識とが対応付けられて格納される抽出知識記憶部（１１）と，抽出知識記憶部を参照し，文書から抽出した情報を属性に対応付ける抽出属性決定部（１２）と，属性とその属性の値とを出力する出力部（１３）と，を含み，抽出知識記憶部（１１）に格納される属性は，関連する他のテンプレートを参照することを特徴とする。
In order to solve the above problems, according to a first aspect of the present invention, there is provided an information extraction apparatus for associating information appearing in an input document with attributes . The information extraction device (1) of the present invention stores an extracted knowledge storage unit in which a template composed of a set of one or more attributes to be extracted is stored, and the attributes and extracted knowledge of the attributes are stored in association with each other. (11), an extracted attribute determining unit (12) that refers to the extracted knowledge storage unit and associates the information extracted from the document with the attribute, and an output unit (13) that outputs the attribute and the value of the attribute. The attribute stored in the extracted knowledge storage unit (11) is characterized by referring to another related template .

かかる構成によれば，文書中に近接して出現する情報を，同一テンプレートにある抽出属性に優先的に対応付けるような動作を行うことにより，既存のテンプレートを再利用可能な属性抽出を行うことができる。 According to such a configuration, it is possible to perform attribute extraction in which an existing template can be reused by performing an operation for preferentially associating information appearing in the document with the extracted attribute in the same template. it can.

上記本発明の情報抽出装置において，以下のような応用が可能である。 In the information extraction apparatus of the present invention, the following applications are possible.

抽出属性決定部（１２）は，文書中から固有表現を抽出し，抽出された固有表現に固有表現の種類を付加する固有表現抽出部（１２１）と，抽出知識記憶部を参照して，固有表現抽出部で付加された固有表現の種類に対応する属性，及び，固有表現の種類に対応する属性を含むテンプレートを参照する属性を，固有表現に対する候補属性として決定する候補属性決定部（１２３）と，所定の文書範囲から抽出された固有表現について，決定された候補属性が最も多く属するテンプレートを決定し，当該テンプレートに属する候補属性を文書範囲から抽出された固有表現の属性として決定する属性決定部（１２４）と，を含む。
Extracting attribute determining section (12) extracts the named entities from the document, named entity extraction unit that adds the kind of unique representation extracted named entities and (121), referring to the extracted knowledge storage unit, unique Candidate attribute determination unit (123) that determines an attribute that corresponds to the type of the unique expression added by the expression extraction unit and an attribute that refers to the template including the attribute corresponding to the type of the specific expression as a candidate attribute for the specific expression Attribute determination for determining the template to which the determined candidate attribute most belongs for the specific expression extracted from the predetermined document range, and determining the candidate attribute belonging to the template as the attribute of the specific expression extracted from the document range Part (124) .

上記所定の文書範囲は，文書中の空行で分割される範囲であってもよく，あるいは，固有表現が出現する行が連続している範囲であってもよい。
The predetermined document range may be a range divided by blank lines in the document , or may be a range in which lines where specific expressions appear are continuous .

抽出知識記憶部（１１）は，抽出対象となる属性を含むテンプレートが他の属性によって参照されている場合に，所定の条件を満たす場合は，上記他の属性を候補属性に追加しないことを示す情報を格納するようにしてもよい。一般に，文書中に出現しない属性を候補属性にしないようにすることで，候補属性が多くなり計算量が増えたり，抽出精度を低下させることにならないようにできる。
The extracted knowledge storage unit (11) indicates that, when a template including an attribute to be extracted is referred to by another attribute and the predetermined condition is satisfied, the other attribute is not added to the candidate attribute. Information may be stored . In general, by avoiding attributes that do not appear in the document as candidate attributes, the number of candidate attributes increases, so that the amount of calculation does not increase and the extraction accuracy does not decrease.

抽出属性決定部（１２）は，さらに，文書中の文字列を改行，空白，句点で区切った範囲を文字列要素として抽出する文字列要素分割部（１２２）を含み，文字列要素が句点で区切られる範囲である場合，候補属性決定部は，当該文字列要素から抽出された固有表現について候補属性を決定する処理をしないようにしてもよい。
The extraction attribute determining unit (12) further includes a character string element dividing unit (122) for extracting a character string element in the document as a character string element by separating the character string in the document with a line feed, a space, and a character mark. In the case of the range to be delimited, the candidate attribute determination unit may not perform the process of determining the candidate attribute for the specific expression extracted from the character string element .

上記課題を解決するため，本発明の第２の観点によれば，入力された文書中に出現する情報を属性に対応付ける情報抽出方法が提供される。本発明の情報抽出方法は，文書を入力として受け付ける文書入力部が実行する文書入力工程（Ｓ１００）と，抽出対象となる１以上の属性の集合によって構成されるテンプレートが格納され，属性と属性の抽出知識とが対応付けられて格納される抽出知識記憶部を参照し，文書から抽出した情報を属性に対応付ける抽出属性決定部が実行する抽出属性決定工程（Ｓ１１０〜Ｓ１４０）と，属性とその属性の値とを出力する出力部が実行する出力工程（Ｓ１５０）と，を含み，抽出知識記憶部に格納される属性は，関連する他のテンプレートを参照することを特徴とする。
In order to solve the above problems, according to a second aspect of the present invention, there is provided an information extraction method for associating information appearing in an input document with attributes . The information extraction method according to the present invention stores a document input step (S100) executed by a document input unit that receives a document as input, and a template composed of a set of one or more attributes to be extracted. An extracted attribute determining step (S110 to S140) executed by an extracted attribute determining unit that associates information extracted from a document with an attribute with reference to an extracted knowledge storage unit that is stored in association with the extracted knowledge, and an attribute and its attribute And an output step (S150) executed by the output unit that outputs the value of the attribute, and the attribute stored in the extracted knowledge storage unit refers to another related template .

上記本発明の情報抽出方法において，以下のような応用が可能である。 In the information extraction method of the present invention, the following applications are possible.

抽出属性決定工程は，文書中から固有表現を抽出し，抽出された固有表現に固有表現の種類を付加する固有表現抽出部が実行する固有表現抽出工程と，抽出知識記憶部を参照して，固有表現抽出工程で付加された固有表現の種類に対応する属性，及び，固有表現の種類に対応する属性を含むテンプレートを参照する属性を，固有表現に対する候補属性として決定する候補属性決定部が実行する候補属性決定工程（Ｓ１３０）と，所定の文書範囲から抽出された固有表現について，決定された候補属性が最も多く属するテンプレートを決定し，当該テンプレートに属する候補属性を文書範囲から抽出された固有表現の属性として決定する属性決定部が実行する属性決定工程（Ｓ１４０）と，を含む。
The extraction attribute determination process refers to the specific expression extraction process executed by the specific expression extraction unit that extracts the specific expression from the document and adds the type of specific expression to the extracted specific expression, and the extracted knowledge storage unit . Executed by the candidate attribute determination unit that determines the attribute corresponding to the type of the specific expression added in the specific expression extraction step and the attribute that refers to the template including the attribute corresponding to the type of the specific expression as the candidate attribute for the specific expression a candidate attribute determining step (S130) of, for named entities extracted from the predetermined document range, determined candidate attribute determines the most belongs template, extracted candidate attributes belonging to the template from a document range specific And an attribute determination step (S140) executed by an attribute determination unit that determines the attribute of the expression .

抽出属性決定工程は，文書中の文字列を改行，空白，句点で区切った範囲を文字列要素として抽出する文字列要素分割部が実行する文字列要素分割工程（Ｓ１２０）を含み，文字列要素分割工程において，文字列要素が句点で区切られる場合には，当該文字列要素から抽出された固有表現について，候補属性決定工程による処理をしないようにしてもよい。
The extraction attribute determining step includes a character string element dividing step (S120) executed by a character string element dividing unit that extracts a character string element in a document as a character string element by dividing a character string in a document by a line feed, a space, and a punctuation mark. In the dividing step, when the character string element is delimited by a punctuation mark, the candidate attribute determination step may not be performed on the specific expression extracted from the character string element .

また，本発明の他の観点によれば，コンピュータを，上記本発明の情報抽出装置として機能させるためのプログラムと，そのプログラムを記録した，コンピュータにより読み取り可能な記録媒体が提供される。ここで，プログラムはいかなるプログラム言語により記述されていてもよい。また，記録媒体としては，例えば，ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，フロッピーディスク（ＦＤ：ＦｌｏｐｐｙＤｉｓｋ）など，プログラムを記録可能な記録媒体として現在一般に用いられている記録媒体，あるいは将来用いられるいかなる記録媒体をも採用することができる。
According to another aspect of the present invention, there are provided a program for causing a computer to function as the information extraction apparatus of the present invention, and a computer-readable recording medium on which the program is recorded . Here, the program may be described in any programming language. In addition, as a recording medium, for example, a CD-ROM, a DVD-ROM, a floppy disk (FD), a recording medium that is currently used as a recording medium capable of recording a program, or any recording that will be used in the future Media can also be employed.

なお上記において，構成要素に付随して括弧書きで記した参照符号およびステップは，理解を容易にするため，後述の実施の形態および図面における対応する構成要素を一例として記したに過ぎず，本発明がこれに限定されるものではない。 In the above description, the reference numerals and steps in parentheses attached to the constituent elements are merely shown as examples of corresponding constituent elements in the following embodiments and drawings for easy understanding. The invention is not limited to this.

以上説明したように，本発明によれば，文書中に近接して出現する情報を，同一テンプレートにある抽出属性に優先的に対応付けるような動作を行うことにより，既存のテンプレートを再利用可能な属性抽出を行うことができる。 As described above, according to the present invention, an existing template can be reused by performing an operation for preferentially associating information appearing in the document with the extracted attribute in the same template. Attribute extraction can be performed.

以下に添付図面を参照しながら，本発明にかかる情報抽出装置，情報抽出方法，およびプログラムの好適な実施の形態について詳細に説明する。なお，本明細書および図面において，実質的に同一の機能構成を有する構成要素については，同一の符号を付することにより重複説明を省略する。 Exemplary embodiments of an information extraction apparatus, an information extraction method, and a program according to the present invention will be described below in detail with reference to the accompanying drawings. In the present specification and drawings, components having substantially the same functional configuration are denoted by the same reference numerals, and redundant description is omitted.

（第１の実施の形態）
第１の実施の形態では，問題点１，２を解決するために，ある抽出属性の特徴情報が他のテンプレートにある，というネットワーク構造によって表現する。そして，文書中に近接して出現する複数の情報を，同一テンプレートにある抽出属性に優先的に対応付けるような動作を行うことを特徴とするものである。 (First embodiment)
In the first embodiment, in order to solve the problems 1 and 2, it is expressed by a network structure in which feature information of a certain extraction attribute is in another template. Then, an operation for preferentially associating a plurality of pieces of information appearing close to each other in a document with extracted attributes in the same template is performed.

図１は，本実施の形態にかかる情報抽出装置を示す説明図である。
情報抽出装置１は，図１に示したように，文書入力部１０と，抽出知識記憶部１１と，抽出属性決定部１２と，出力部１３を含んで構成されている。さらに，抽出属性決定部１２は，固有表現抽出部１２１と，文字列要素分割部１２２と，候補属性決定部１２３と，属性決定部１２４を含んで構成されている。さらに，属性決定部１２４は，ブロック分割部１２５と，テンプレート順位付け部１２６と，属性計算部１２７を含んで構成されている。以下に，これら構成要素について詳細に説明する。
FIG. 1 is an explanatory diagram showing an information extraction apparatus according to the present embodiment.
As shown in FIG. 1, the information extraction apparatus 1 includes a document input unit 10, an extracted knowledge storage unit 11, an extraction attribute determination unit 12, and an output unit 13. Further, the extraction attribute determination unit 12 includes a specific expression extraction unit 121 , a character string element division unit 122, a candidate attribute determination unit 123, and an attribute determination unit 124. Further, the attribute determining unit 124 includes a block dividing unit 125, a template ranking unit 126, and an attribute calculating unit 127. Hereinafter, these components will be described in detail.

（文書入力部１０）
文書入力部１０は，ユーザが指定したテキスト文書を入力として受け付ける。 (Document input unit 10)
The document input unit 10 receives a text document designated by the user as input.

（抽出知識記憶部１１）
抽出知識記憶部１１は，抽出属性決定部１２が，何を抽出する（抽出属性）か，どのように抽出するか（抽出属性ごとの特徴情報）に関する情報を格納する。抽出属性の特徴情報が，他のテンプレートにあることを指し示すというネットワーク構造によって表現する。 (Extracted knowledge storage unit 11)
The extracted knowledge storage unit 11 stores information about what the extraction attribute determination unit 12 extracts (extraction attributes) and how to extract (feature information for each extraction attribute). The feature information of the extracted attribute is expressed by a network structure indicating that it is in another template.

図２に抽出知識記憶部１１の構造を示す。個々のテンプレートは，「抽出属性名（属性名）」，「属性値のタイプ」，「他のテンプレートへの参照」，「その属性の見出し語になる単語（見出し語）」，「その属性は見出し語がなくても，文書中の要素と対応付けるかどうか」の各項目を持つ。図２の各行が，抽出属性の特徴情報を表す。「属性値のタイプ」項目は，その抽出属性に入る値が，人名か組織名かイベント名かのいずれの固有表現かを示す。「その属性の見出し語になる単語（見出し語）」項目は，その属性の見出しとして文書中に記載される単語を示す。「その属性は見出し語がなくても，文書中の要素と対応付けるかどうか」の項目は，その属性に，文書中の要素を対応付ける場合には，見出し語が必要かどうかを示す。表中，対応付ける場合は「Ｙｅｓ」，対応付けない場合は「Ｎｏ」で表記している。 FIG. 2 shows the structure of the extracted knowledge storage unit 11. Each template has “extracted attribute name (attribute name)”, “attribute value type”, “reference to another template”, “word that becomes the headword of the attribute (headword)”, “the attribute is Even if there is no entry word, it has an item “whether it is associated with an element in the document”. Each line in FIG. 2 represents feature information of the extracted attribute. The “attribute value type” item indicates whether the value to be included in the extracted attribute is a unique representation of a person name, an organization name, or an event name. The item “word to be a headword of the attribute (headword)” indicates a word described in the document as a headline of the attribute. The item “whether the attribute is associated with an element in the document even if there is no entry word” indicates whether an entry word is required when the element in the document is associated with the attribute. In the table, “Yes” is used for association, and “No” is used for no association.

例えば，イベント情報テンプレートの１行目において，抽出属性名「イベント名」の属性値は，「イベント名」を表すパターンであり，その見出し語として，「名称」が考えられる。「イベント名」を表すパターンに一致した文字列が文書中に出現した場合には，その前に，対応する見出し語がなくても，その文字列を，抽出属性名「イベント名」に対応付けることを示す。 For example, in the first line of the event information template, the attribute value of the extracted attribute name “event name” is a pattern representing “event name”, and “name” is considered as the headword. If a character string that matches the pattern representing "event name" appears in the document, associate that character string with the extracted attribute name "event name" even if there is no corresponding headword before it. Indicates.

また，イベント情報テンプレートの第５行目において，抽出属性名「開催地」の属性値は，場所情報テンプレートに示されていることを表す。その見出し語として「会場」，「場所」が考えられる。見出し語に一致した単語が文書中に出現しない場合でも，イベント情報の開催地に対応付けることを示す。 In addition, in the fifth line of the event information template, the attribute value of the extracted attribute name “Venue” is shown in the place information template. “Venue” and “place” can be considered as headwords. Even if a word that matches the headword does not appear in the document, it is associated with the event information venue.

また，例えば，イベントの開催地の住所は，イベント情報テンプレートの「開催地」から，「他のテンプレートへの参照」項目をたどり，場所情報テンプレートの「住所」で表される。これを，「イベント情報．開催地．住所」というように，［テンプレート１］，［テンプレート１の属性名］，［テンプレート２の属性名］・・・［テンプレートｎの属性名］のようにドット記法で表現する（ｎをテンプレート部分の深さと呼ぶ）。 Also, for example, the address of the event venue is represented by the “address” of the location information template by following the “reference to other template” item from the “venue” of the event information template. This is a dot like [template 1], [attribute name of template 1], [attribute name of template 2] ... [attribute name of template n], such as "event information. Venue. Address" Expressed in notation (n is called the depth of the template portion).

この中で，テンプレートを指し示す部分を，テンプレート部分と呼ぶ。例えば，「イベント情報．開催地．住所」では，「イベント情報．開催地」（［テンプレート１］，［テンプレート１の属性名］）が，場所情報テンプレート（テンプレート２）を指し示すので，この部分をテンプレート部分と呼ぶ。この例では，テンプレート部分の深さは２である。 In this, the part which points to a template is called a template part. For example, in “event information. Venue. Address”, “event information. Venue” ([template 1], [attribute name of template 1]) points to the location information template (template 2). It is called a template part. In this example, the depth of the template portion is 2.

また，［テンプレートｎの属性名］が，他のテンプレートへの参照の場合には，［テンプレート１］．［テンプレート１の属性名］．［テンプレート２の属性名］・・・［テンプレートｎの属性名］までをテンプレート部分とする。例えば，「イベント情報．講演」のテンプレート部分は，「イベント情報．講演」で，テンプレート部分の深さは２とする。 Also, if [attribute name of template n] is a reference to another template, [template 1]. [Attribute name of template 1]. [Attribute name of template 2]... Up to [attribute name of template n] are template parts. For example, the template part of “event information. Lecture” is “event information. Lecture”, and the depth of the template part is 2.

（抽出属性決定部１２）
抽出属性決定部１２は，抽出知識記憶部１１を参照し，抽出した情報を属性に対応付ける。抽出属性決定部１２は，固有表現抽出部１２１と，文字列要素分割部１２２と，候補属性決定部１２３と，属性決定部１２４を含んで構成されている。さらに，属性決定部１２４は，ブロック分割部１２５と，テンプレート順位付け部１２６と，属性計算部１２７を含んで構成されている。以下に，これら構成要素について詳細に説明する。
(Extraction attribute determination unit 12)
The extracted attribute determination unit 12 refers to the extracted knowledge storage unit 11 and associates the extracted information with the attribute. The extraction attribute determination unit 12 includes a specific expression extraction unit 121 , a character string element division unit 122, a candidate attribute determination unit 123, and an attribute determination unit 124. Further, the attribute determining unit 124 includes a block dividing unit 125, a template ranking unit 126, and an attribute calculating unit 127. Hereinafter, these components will be described in detail.

（固有表現抽出部１２１）
固有表現抽出部１２１は，入力された文書中の固有表現にタグを付加する。タグには，人名，組織名，役職名，場所名，イベント名などの種類がある。このような固有表現抽出部１２１としては，例えば，福本らの「固有表現抽出における日本語と英語の比較」（信学技報ＮＬＣ９８−２１（１９９８））に記載された装置が利用できる。なお，同文献に示された装置は，本実施の形態にかかる固有表現抽出部１２１の機能を実現するための一例に過ぎないため，その詳細な説明を省略するが，例えば，組織名を抽出するには，正規表現を用いて，先頭や末尾に「株式会社」を含む「○○株式会社」や「株式会社××」という部分を抽出する。イベント名についても，末尾に「シンポジウム」や「セミナー」を含む部分を抽出する。また，人名や役職名では，形態素解析で利用する辞書によって，考えられる人名をあらかじめ保持しておき，文字列一致によって抽出してもよい。固有表現抽出部１２１の機能を実現するために，同文献に示された装置以外の装置を採用してもよいことは言うまでもない。
( Specific Expression Extraction Unit 121 )
The specific expression extraction unit 121 adds a tag to the specific expression in the input document. There are various types of tags such as a person name, an organization name, a job title, a place name, and an event name. As such a specific expression extraction unit 121 , for example, an apparatus described in “Comparison of Japanese and English in specific expression extraction” by Fukumoto et al. (Science Technical Report NLC 98-21 (1998)) can be used. The apparatus shown in the same document is merely an example for realizing the function of the named entity extraction unit 121 according to the present embodiment, and a detailed description thereof is omitted. For example, an organization name is extracted. To do this, a regular expression is used to extract parts such as “XX Co., Ltd.” and “Co., Ltd. XX” including “Co. For the event name, extract the part including “Symposium” or “Seminar” at the end. In the case of a person name or job title, a possible person name may be held in advance by a dictionary used in morphological analysis, and extracted by character string matching. It goes without saying that a device other than the device shown in the same document may be adopted in order to realize the function of the specific expression extraction unit 121 .

（文字列要素分割部１２２）
文字列要素分割部１２２は，文書中の文字列を改行，空白，句点で区切った範囲として抽出する。そして，文字列要素が，句点で区切られる場合には，文とみなして，その範囲の文字列を，以降の候補属性決定部１２３で処理をしないようにする。これは，文字列要素が文の場合には，構文解析などの処理をしないと誤った属性との対応付けを起こしやすいためである。 (Character string element division unit 122)
The character string element dividing unit 122 extracts a character string in the document as a range delimited by a line feed, a space, and a punctuation mark. If the character string element is delimited by a punctuation mark, it is regarded as a sentence, and the character string in that range is not processed by the candidate attribute determination unit 123 thereafter. This is because when a character string element is a sentence, it is easy to cause an association with an incorrect attribute unless processing such as parsing is performed.

（候補属性決定部１２３）
候補属性決定部１２３は，抽出知識記憶部１１を参照して，固有表現抽出部１２１で抽出された固有表現が，どの属性になると考えられるか，その候補を決定する。
(Candidate attribute determination unit 123)
The candidate attribute determination unit 123 refers to the extracted knowledge storage unit 11 to determine which attribute the specific expression extracted by the specific expression extraction unit 121 is considered to be.

（属性決定部１２４）
属性決定部１２４は，各固有表現の前後の候補属性を考慮して，各固有表現の属性を決定する。 (Attribute determination unit 124)
The attribute determination unit 124 determines the attribute of each unique expression in consideration of candidate attributes before and after each specific expression.

（ブロック分割部１２５）
ブロック分割部１２５は，固有表現が各行で連続して出現している範囲を見つけ，その範囲をブロックとして分割する。 (Block division unit 125)
The block dividing unit 125 finds a range where the specific expression appears continuously in each line, and divides the range as a block.

（テンプレート順位付け部１２６）
テンプレート順位付け部１２６は，個々のブロックの固有表現に対応する属性が，どのテンプレートにあるかを計算する。 (Template ranking unit 126)
The template ranking unit 126 calculates which template has the attribute corresponding to the unique expression of each block.

（属性計算部１２７）
属性計算部１２７は，ブロックごとに各固有表現の属性を決定する。 (Attribute calculation unit 127)
The attribute calculation unit 127 determines the attribute of each unique expression for each block.

（出力部１３０）
出力部１３０は，属性とその属性の値を出力する。 (Output unit 130)
The output unit 130 outputs the attribute and the value of the attribute.

本実施の形態にかかる情報抽出装置１は，以上のように構成されている。
次いで，情報抽出装置１の動作について説明する。
図３は第１の実施の形態の動作を示すフローチャートである。以下，このフローチャートに沿って動作を説明する。ここでは，例として，図４に示す文書を入力として受け付けた場合について述べる。 The information extraction apparatus 1 according to the present embodiment is configured as described above.
Next, the operation of the information extraction device 1 will be described.
FIG. 3 is a flowchart showing the operation of the first embodiment. Hereinafter, the operation will be described with reference to this flowchart. Here, as an example, a case where the document shown in FIG. 4 is received as an input will be described.

（ステップＳ１００）
文書入力部１０は，文書を入力として受け付ける。ここでは，例として，図４に示す文書を入力として受け付けた場合について述べる。 (Step S100)
The document input unit 10 receives a document as input. Here, as an example, a case where the document shown in FIG. 4 is received as an input will be described.

（ステップＳ１１０）
抽出属性決定部１２は，固有表現抽出部１２１を起動する。文書全体を固有表現抽出部１２１への入力にする。文書中の固有表現にタグを付加する。図４の文書にタグ付けをした結果を図５に示す。行位置項目は先頭からの行番号を，開始位置項目は，各行の先頭から何文字目かを示す。また，見出し語は，各行に対して，２文字以上の文字列で始まり，：，］などを含む文字列として抽出する。この処理は，パターンマッチングによって行うことができる。例えば，図４の４行目が該当する。見出し語の抽出方法は，パターンマッチングの他に，見出し語に関する辞書を持っていてもよいし，この方法に限らない。
(Step S110)
The extraction attribute determination unit 12 activates the specific expression extraction unit 121 . The entire document is input to the specific expression extraction unit 121. Add a tag to a specific expression in a document. The result of tagging the document of FIG. 4 is shown in FIG. The line position item indicates the line number from the beginning, and the start position item indicates the number of characters from the beginning of each line. A headword is extracted as a character string that starts with a character string of two or more characters and includes:,], etc. for each line. This process can be performed by pattern matching. For example, the fourth line in FIG. The method for extracting headwords may include a dictionary related to headwords in addition to pattern matching, and is not limited to this method.

（ステップＳ１２０）
文字列要素分割部１２２は，文書を入力として，改行記号，空白記号，句点によって，各文字列要素を区切る。図４の文書を，文字列要素に分割した結果を，図６に示す。各文字列要素は，図６に示すように行位置，文字列要素の行頭からの開始位置，文字列要素，区切り方の項目を持つ。例えば，行位置＝１の「第１回○○セミナー」は，改行で終わっているので，区切り方は「改行」になっている。一方，行位置＝２の「７月２３日」は，次の文字が空白記号なので，区切り方は「空白」になっている。 (Step S120)
The character string element dividing unit 122 receives a document as an input and delimits each character string element by a line feed symbol, a space symbol, and a punctuation mark. The result of dividing the document of FIG. 4 into character string elements is shown in FIG. As shown in FIG. 6, each character string element has items of a line position, a start position from the beginning of the character string element, a character string element, and a delimiter. For example, since the “first seminar” at the line position = 1 ends with a line feed, the delimiter is “line feed”. On the other hand, for “July 23” at the line position = 2, since the next character is a space symbol, the delimiter is “blank”.

（ステップＳ１３０）
候補属性決定部１２３は，ステップＳ１１０で抽出された各固有表現について，対応付けの候補となる属性（候補属性）を決定する。各固有表現について，抽出知識記憶部１１の「属性値のタイプ」項目と一致したものが，候補属性となる。 (Step S130)
The candidate attribute determination unit 123 determines an attribute (candidate attribute) that is a candidate for association for each unique expression extracted in step S110. For each unique expression, the one that matches the “attribute value type” item in the extracted knowledge storage unit 11 is a candidate attribute.

図７は候補属性決定部１２３の処理を表すフローチャートである。ステップＳ１１０で得られた固有表現を，先頭からＮ_１，Ｎ_２，・・・，Ｎ_ｍとする。例えば，図５において，固有表現は１７個あるので，ｍは１７である。 FIG. 7 is a flowchart showing processing of the candidate attribute determination unit 123. The specific expressions obtained in step S110 are N ₁ , N ₂ ,..., N _m from the top. For example, in FIG. 5, since there are 17 unique expressions, m is 17.

（ステップＳ１０００）
処理対象の固有表現を示すカウンタｉを１に初期化する。 (Step S1000)
A counter i indicating a specific expression to be processed is initialized to 1.

（ステップＳ１０１０）
ｉがｍ以下ならステップＳ１０２０へ行く。そうでないなら終了する（すべての固有表現で処理が終了）。 (Step S1010)
If i is less than or equal to m, go to step S1020. If not, the process ends (the process ends for all proper expressions).

（ステップＳ１０２０）
固有表現Ｎ_ｉを処理対象にする。Ｎ_ｉについて，それが，文の一部になっているかどうかを判断する。Ｎ_ｉの位置情報（行位置項目と開始位置項目）に対応する文字列要素を，ステップＳ１２０の処理結果から見つける。すなわち，Ｎ_ｉの行位置項目が，文字列要素の行位置項目項目に一致し，かつ，Ｎ_ｉの開始位置項目−文字列要素の開始位置項目が０以上で，最小のものを見つける。例えば，Ｎ_５「○○市××町１−１」に対応する文字列要素は，図６において，行位置項目と開始位置項目が一致する５行目の「○○市××町１−１」になる。 (Step S1020)
The specific expression N _i is set as a processing target. For N _i, it is, it is determined whether or not has become part of the sentence. The string element corresponding to the position information of the N _i (row position items and start position items), found from the processing result of step S120. That is, row position items N _i is matched to the row position Item string element, and the start position items N _i - at the start entry of the string elements are 0 or more, finding the minimum one. For example, the character string element corresponding to N ₅ “XX city XX town 1-1” is “XX city XX town 1−1” on the fifth line in FIG. 1 ”.

（ステップＳ１０３０）
対応する文字列要素の区切り方項目が「句点」であるなら，文とみなして，以後の処理対象とせず，ステップＳ１１２０へ行く。そうでないなら，ステップＳ１０４０へ行く。例えば，図５において，Ｎ_８「○○セミナー」に対応する文字列要素は，図６においては，「この○○セミナーでは，今後必要となる能力の向上を目的とします。」であり，これは区切り方項目が「句点」であるので，処理対象にしない。一方，それ以外のすべての固有表現は，対応する文字列要素が「句点」以外なので，処理対象になる。 (Step S1030)
If the delimiter item of the corresponding character string element is “punctuation”, it is regarded as a sentence and the process goes to step S1120 without being subjected to subsequent processing. Otherwise, go to step S1040. For example, in FIG. 5, the character string element corresponding to N ₈ “XX seminar” is “in this XX seminar, the purpose is to improve the ability that will be required in the future” in FIG. Is not subject to processing because the delimiter is "punctuation". On the other hand, all other unique expressions are processed because the corresponding character string elements are other than “punctuation”.

（ステップＳ１０４０）
Ｎ_ｉと，抽出知識記憶部１１の属性値のタイプ項目または見出し語項目を比較する。Ｎ_ｉの固有表現のタイプ項目が「見出し語」ならば，ステップＳ１０５０へ行く。一方，Ｎ_ｉの固有表現のタイプ項目が「見出し語」以外ならば，ステップＳ１０６０へ行く。 (Step S1040)
Comparing the N _i, the type entry or lemma item attribute values of the extracted knowledge memory unit 11. Type item of specific representation of N _i is if "entry word", go to step S1050. On the other hand, type item of specific representation of N _i is if other than the "entry word", go to step S1060.

（ステップＳ１０５０）
Ｎ_ｉの固有表現項目について，抽出知識記憶部１１の見出し語項目に一致したものがあるかどうかを調べる。一致した属性を，候補属性にし，ステップＳ１０７０へ行く。例えば，図５において，Ｎ_６の「主催」は，図２の「イベント情報．主催」に見出し語項目が一致する。Ｎ_９の「プログラム」は，図２の「イベント情報．講演」に見出し語が一致する。 (Step S1050)
It is checked whether or not there is a Ni _i specific expression item that matches the entry word item in the extracted knowledge storage unit 11. The matched attribute is set as a candidate attribute, and the process goes to step S1070. For example, in FIG. 5, “host” of N ₆ matches the entry item “event information. Host” in FIG. "Program" of N ₉ is to find words to match the "Event Information. Lecture" in FIG. 2.

（ステップＳ１０６０）
Ｎ_ｉの「固有表現のタイプ」項目について，抽出知識記憶部１１の「属性値のタイプ」項目に，一致したものがあるかどうかを調べる。一致したものに対して，次の条件１，２をチェックする。 (Step S1060)
The "type of named entity" item of N _i, the "type of the attribute value" item of the extracted knowledge memory unit 11, checks whether there is a match. The following conditions 1 and 2 are checked for those that match.

［条件１］その属性の「見出し語がなくても，その属性を対応付けるかどうか」項目の値が「Ｙｅｓ」ならば，その属性を，候補属性に追加する。そうでないなら，条件２をチェックする。
［条件２］その属性の「見出し語がなくても，その属性を対応付けるかどうか」項目の値が「ＮＯ」で，かつ，Ｎ_ｉの同一行に「固有表現のタイプ」項目が「見出し語」である固有表現が存在し，その固有表現の候補属性が，その属性に一致しているなら追加する。そうでないなら，追加しない。
ステップＳ１０７０へ行く。 [Condition 1] If the value of the item “whether to associate the attribute even if there is no entry word” is “Yes”, the attribute is added to the candidate attribute. If not, check condition 2.
"Even if there is no entry word, whether or associate the attribute" [Condition 2] of the attribute with the value of the item is "NO", and the "type of unique expressions" item in the same row of the N _i is "entry word Is added and the candidate attribute of the specific expression matches the attribute. If not, don't add.
Go to step S1070.

図８に図５の候補属性を示す。例えば，Ｎ_１「第１回○○セミナー」の「固有表現のタイプ」項目はイベント名なので，図２の「イベント情報．イベント名」に一致する。一方，Ｎ_２の「７月２３日」の固有表現のタイプは日時なので，図２の「イベント情報．開催日」と「講演情報．講演時間」に一致する。一致した属性を候補属性として追加する。 FIG. 8 shows the candidate attributes of FIG. For example, since the “type of unique expression” item of N ₁ “1st XX seminar” is an event name, it matches “event information. Event name” in FIG. On the other hand, since the type of the specific expression of “July 23” of N ₂ is the date and time, it matches the “event information.date” and “lecture information.lecture time” in FIG. Add matching attributes as candidate attributes.

また，Ｎ_７「△△大学」の「固有表現のタイプ」項目は組織名なので，図２の「イベント情報．主催」，「組織情報．名称」に一致する。「イベント情報．主催」の「見出し語がなくても，その属性を対応付けるかどうか」項目の値は「Ｎｏ」なので，同一行の固有表現をチェックする。Ｎ_６の「固有表現のタイプ」項目は「見出し語」であり，候補属性が「イベント情報．主催」であるので，Ｎ_７の候補属性として，「イベント情報．主催」を追加する。 In addition, since the item “specific expression type” of N ₇ “Δ University” is an organization name, it matches “event information. Sponsor” and “organization information. Name” in FIG. Since the value of the item “whether the attribute is associated even if there is no headword” of “event information. Sponsor” is “No”, the unique expression on the same line is checked. Since the item of “type of proper expression” of N ₆ is “headword” and the candidate attribute is “event information. Sponsor”, “event information. Sponsor” is added as the candidate attribute of N ₇ .

（ステップＳ１０７０）
属性情報が存在しないなら，次の固有表現を処理するために，ステップＳ１１２０へ行く。そうでないなら，ステップＳ１０８０へ行く。 (Step S1070)
If the attribute information does not exist, go to step S1120 to process the next specific expression. Otherwise, go to step S1080.

（ステップＳ１０８０）
ステップＳ１０５０またはステップＳ１０６０で決定された候補属性をＰ_ｉ，１，Ｐ_ｉ，２，・・・，Ｐ_ｉ，ｎとする。例えば，図８において，Ｐ_１，１は「イベント情報．イベント名」，Ｐ_２，１は「イベント情報，開催日」，Ｐ_２，２は「講演情報，講演時間」になる。処理対象の候補属性を示すカウンタｊを１に初期化する。 (Step S1080)
The candidate attributes determined in step S1050 or step S1060 are P _{i, 1} , P _{i, 2} ,..., P _{i, n} . For example, in FIG. 8, P _1,1 is “event information.event name”, P _2,1 is “event information, date”, and P _2,2 is “lecture information, lecture time”. A counter j indicating a candidate attribute to be processed is initialized to 1.

（ステップＳ１０９０）
ｊがｎ以下ならステップＳ１１００へ行く。そうでないならステップＳ１１２０へ行く（すべての候補属性で処理が終了）。 (Step S1090)
If j is n or less, go to Step S1100. If not, the process goes to step S1120 (the process ends for all candidate attributes).

（ステップＳ１１００）
候補属性Ｐ_ｉ，ｊを処理対象にする。Ｐ_ｉ，ｊの属するテンプレートの参照元を，「他のテンプレートへの参照」項目を利用して見つける。その参照元の属性を，候補属性に追加するかどうかを，次の条件で判断する。
［条件１］参照元の属性の「見出し語がなくても，その属性を対応付けるかどうか」項目の値が「Ｙｅｓ」ならば，その属性を候補属性に追加する。そうでないなら，条件２をチェックする。
［条件２］参照元の属性の「見出し語がなくても，その属性を対応付けるかどうか」項目の値が「ＮＯ」で，かつ，Ｎ_ｉの同一行に「固有表現のタイプ」項目が「見出し語」である固有表現が存在し，その固有表現の候補属性が，参照元の属性に一致しているなら追加する。そうでないなら追加しない。 (Step S1100)
Candidate attributes P _{i, j} are to be processed. The reference source of the template to which P _{i, j} belongs is found using the “reference to another template” item. Whether the attribute of the reference source is added to the candidate attribute is determined under the following conditions.
[Condition 1] If the value of the item “whether the attribute is associated even if there is no entry word” of the reference source attribute is “Yes”, the attribute is added to the candidate attribute. If not, check condition 2.
"Even if there is no entry word, whether or not associate the attribute" [Condition 2] of the reference source of the attribute with the value of the item is "NO", and the "type-specific expression" in the same line of N _i item " If there is a specific expression that is a “keyword” and the candidate attribute of the specific expression matches the attribute of the reference source, it is added. If not, don't add.

例えば，Ｎ_１２の「山田太郎」の候補属性は，「人物情報．氏名」である。人名情報テンプレートを参照している属性として，「講演情報．講演者」，「組織情報．社長名」が存在している。「講演情報．講演者」の「見出し語がなくても，その属性を対応付けるかどうか」項目は「Ｙｅｓ」なので，この属性を候補属性に追加する。一方，「組織情報．社長名」の「見出し語がなくても，その属性を対応付けるかどうか」項目は，「Ｎｏ」であり，Ｎ_１２の同一行（行位置＝９）には，「固有表現のタイプ」項目が「見出し語」である固有表現が存在しないので，「組織情報．社長名」は，候補属性としない。 For example, the candidate attribute of "Taro Yamada" of N ₁₂ is a "person information. Name". There are “lecture information. Lecturer” and “organization information. President name” as attributes referring to the personal name information template. The item “whether the attribute is associated even if there is no headword” of “Lecture information. Speaker” is “Yes”, so this attribute is added to the candidate attribute. On the other hand, the item “whether or not to associate the attribute even if there is no headword” of “organization information. President name” is “No”, and the same line (line position = 9) of N ₁₂ has “unique” Since there is no unique expression whose “expression type” item is “entry word”, “organization information. President name” is not a candidate attribute.

また，追加した候補属性についても，再帰的に，他のテンプレートから参照されていないかどうかチェックし，ステップＳ１１００を実行する。例えば，Ｐ_１，１やＰ_２，１において，イベント情報テンプレートは，他のテンプレートから参照されていないので，追加する候補属性は存在しない。一方，Ｐ_２，２では，講演情報テンプレートは，「イベント情報．講演」から参照されている。したがって，「イベント情報．講演．講演時間」を候補属性に追加する。追加した「イベント情報．講演．講演時間」については，イベント情報テンプレートが，他のテンプレートから参照されていないので，追加する候補属性はこれ以上存在しない。図８に，ステップＳ１１００で追加される候補属性を示す。 Also, the added candidate attribute is recursively checked to see if it is referenced from another template, and step S1100 is executed. For example, in P _1,1 and P _2,1 , since the event information template is not referenced from other templates, there is no candidate attribute to be added. On the other hand, in P _{2 and 2} , the lecture information template is referenced from “event information. Lecture”. Therefore, “event information. Lecture. Lecture time” is added to the candidate attribute. For the added “event information. Lecture. Lecture time”, the event information template is not referenced from other templates, so there are no more candidate attributes to add. FIG. 8 shows candidate attributes added in step S1100.

（ステップＳ１１１０）
次の候補属性で計算するために，ｊに１を加える。ステップＳ１０９０へ戻る。 (Step S1110)
Add 1 to j for calculation with the next candidate attribute. The process returns to step S1090.

（ステップＳ１１２０）
次の固有表現で計算するために，ｉに１を加える。ステップＳ１０１０へ戻る。 (Step S1120)
Add 1 to i to calculate with the following specific expression. The process returns to step S1010.

以上，候補属性決定部１２３の処理について説明した。
再び図３のステップＳ１４０以降を説明する。 The processing of the candidate attribute determination unit 123 has been described above.
Step S140 and subsequent steps in FIG. 3 will be described again.

（ステップＳ１４０）
属性決定部１２４は，ステップＳ１３０で決定された候補属性の中から，各固有表現に対応する属性を決定する。属性を決定する場合に，固有表現が各行で連続して出現する領域ごとにブロックとして定義し，ブロックごとに，どのテンプレートに対応付けるかを決定する。例えば，図４は，６，１０行目で，固有表現の連続が途切れるので，１から５行目まで，７から９行目，１１から１２行目までが個々のブロックになる。１から５行目は，図８のＮ_１からＮ_８までの固有表現が１つのブロックに属する。 (Step S140)
The attribute determining unit 124 determines an attribute corresponding to each unique expression from the candidate attributes determined in step S130. When determining an attribute, a block is defined for each area where the specific expression appears continuously in each line, and which template is associated with each block is determined. For example, in FIG. 4, since the continuation of the unique expression is interrupted at the 6th and 10th lines, the 1st to 5th lines, the 7th to 9th lines, and the 11th to 12th lines become individual blocks. In the first to fifth lines, the unique expressions N ₁ to N _{8 in} FIG. 8 belong to one block.

この範囲の固有表現を，どのテンプレートの属性に対応付けるかを決定するために，テンプレート部分が，候補属性に挙がるごとに，１点を割り当てる。そして，高得点のテンプレートにある属性を優先的に固有表現に対応付ける。例えば，Ｎ_１からＮ_８までのブロックでは，Ｎ_１で，イベント情報に１点割り当てる。Ｎ_２では，イベント情報に１点，イベント情報．講演に１点，講演情報に１点割り当てる。同様に，Ｎ_３では，場所情報に１点，イベント情報．開催地に１点，人物情報．居住地に１点，講演情報．講演者．居住地に１点，イベント情報．講演．講演者．居住地に１点，組織情報．所在地に１点を割り当てる。 In order to determine which template attribute the specific expression of this range is associated with, one point is assigned each time the template portion is listed as a candidate attribute. The attributes in the high score template are preferentially associated with the specific expressions. For example, in the blocks from N ₁ to N ₈ , one point is assigned to event information at N ₁ . In N _2, 1 points to the event information, event information. One point is assigned to the lecture and one point is assigned to the lecture information. Similarly, in _{N 3,} 1 points to the location information, event information. One point at the venue, personal information. One point in the residence, lecture information. Speaker. One point in the place of residence, event information. Lecture. Speaker. One point in the place of residence, organization information. Assign a point to the location.

最終的に，Ｎ_１からＮ_８までのブロックの合計では，イベント情報が４点，イベント情報．開催地が３点，イベント情報．講演．講演者．居住地が３点，場所情報が３点，人物情報．居住地が３点，講演情報．講演者．居住地が３点，組織情報．所在地が３点，イベント情報．講演が１点，イベント情報．講演情報．講演者．身分．所属組織が１点，講演情報が１点，組織情報が１点，身分情報．所属組織が１点，人物情報．身分．所属組織が１点，講演情報．講演者．身分．所属組織が１点になる。 Finally, the sum of the blocks from N ₁ to N _8, the event information is 4 points, the event information. 3 venues, event information. Lecture. Speaker. 3 points of residence, 3 points of location information, personal information. Three points of residence, lecture information. Speaker. Three places of residence, organization information. 3 locations, event information. One lecture, event information. Lecture information. Speaker. ID. One affiliation organization, one lecture information, one organization information, identification information. One point of belonging organization, personal information. ID. One point of belonging organization, lecture information. Speaker. ID. Your organization becomes 1 point.

次に，Ｎ_１からＮ_８までの候補属性のうち，対応する属性を決定する。イベント情報テンプレートが最高得点なので，イベント情報テンプレートを含む候補属性を優先的に対応する属性とする。また，他にも，Ｎ_ｉに対応する属性を決定する際に，Ｎ_ｉ−１に対応する属性を含むテンプレートの属性を優先する。結果として，
Ｎ_１：イベント情報．イベント名
Ｎ_２：イベント情報．開催日
Ｎ_３：イベント情報．開催地．建物名
Ｎ_４：イベント情報．開催地．郵便番号
Ｎ_５：イベント情報．開催地．住所
Ｎ_６：イベント情報．主催
Ｎ_７：イベント情報．主催
Ｎ_８：候補なし
になる。 Next, among the candidate attributes N ₁ to N ₈ , the corresponding attribute is determined. Since the event information template has the highest score, the candidate attribute including the event information template is preferentially assigned to the attribute. Alternatively, it is also possible when determining the attributes corresponding to N _i, priority attributes of a template containing the attributes corresponding to the N _i-1. as a result,
N ₁ : Event information. Event name N ₂ : Event information. Date N ₃ : Event information. venue. Building name N ₄ : Event information. venue. Postal code N ₅ : Event information. venue. Address N ₆ : Event information. Organizer N ₇ : Event information. Organizer N ₈ : No candidate.

このような処理をするために，図９は属性決定部１２４のブロック分割部１２５の処理を表すフローチャートである。ステップＳ１１０で得られた固有表現を，先頭からＮ_１，Ｎ_２，Ｎ_ｍとする。ブロックの先頭の固有表現を表す変数をｓｔａｒｔ＿ｂｒｏｃｋ，処理対象の固有表現を示すカウンタをｉとする。 In order to perform such processing, FIG. 9 is a flowchart showing the processing of the block dividing unit 125 of the attribute determining unit 124. The specific expressions obtained in step S110 are denoted as N ₁ , N ₂ , and N _m from the top. It is assumed that a variable indicating the specific expression at the head of the block is start_block and a counter indicating the specific expression to be processed is i.

（ステップＳ１２００）
ｉを１に，ｓｔａｒｔ＿ｂｌｏｃｋを０に初期化する。 (Step S1200)
Initialize i to 1 and start_block to 0.

（ステップＳ１２１０）
ｉがｍ以下ならステップＳ１２２０へ行く。そうでないならステップＳ１２５０へ行く（すべての固有表現でブロックの計算処理が終了）。 (Step S1210)
If i is less than or equal to m, go to step S1220. If not, the process goes to step S1250 (block calculation processing ends for all unique expressions).

（ステップＳ１２２０）
固有表現Ｎ_ｉを処理対象にする。ｓｔａｒｔ＿ｂｌｏｃｋが０ならば，ｓｔａｒｔ＿ｂｌｏｃｋをｉにする。 (Step S1220)
The specific expression _Ni is used as a processing target. If start_block is 0, start_block is set to i.

（ステップＳ１２３０）
ｉが１以上で，Ｎ_ｉの行位置−Ｎ_ｉ−１の行位置が２以上だったら，ブロックの終了とみなし，ステップＳ１２６０へ行く。そうでないなら，ステップＳ１２４０へ行く。 (Step S1230)
If i is 1 or more and N _i row position−N _i−1 is 2 or more, it is regarded as the end of the block, and the process goes to step S1260. Otherwise, go to step S1240.

（ステップＳ１２４０）
次の固有表現で計算するために，ｉに１を加える。ステップＳ１２１０へ戻る。 (Step S1240)
Add 1 to i to calculate with the following specific expression. The process returns to step S1210.

（ステップＳ１２５０）
ステップＳ１３００以降で示すブロック内のテンプレートの重みの決定処理を行い，処理を終了する。 (Step S1250)
The template weight determination process in the block shown in step S1300 and subsequent steps is performed, and the process ends.

（ステップＳ１２６０）
ステップＳ１３００以降で示すブロック内のテンプレートの重みの決定処理を行う。 (Step S1260)
The template weight determination process in the block shown in step S1300 and subsequent steps is performed.

（ステップＳ１２７０）
ｓｔａｒｔ＿ｂｌｏｃｋを０に初期化し，ステップＳ１２４０へ戻る。 (Step S1270)
The start_block is initialized to 0, and the process returns to step S1240.

図１０は，テンプレート順位付部１２６の，ブロック内のテンプレートの重みの決定処理の動作を示すフローチャートである。ブロックの最後の固有表現を表す変数をｅｎｄ＿ｂｌｏｃｋ，処理対象の固有表現を示すカウンタをｉ２とする。 FIG. 10 is a flowchart showing the operation of the template ranking unit 126 for determining the template weight in the block. It is assumed that a variable representing the last specific expression of the block is end_block and a counter indicating the specific expression to be processed is i2.

（ステップＳ１３００）
ｅｎｄ＿ｂｌｏｃｋをｉに設定する。ｉ２をｓｔａｒｔ＿ｂｌｏｃｋに設定する。 (Step S1300)
Set end_block to i. Set i2 to start_block.

（ステップＳ１３１０）
ｉ２がｅｎｄ＿ｂｌｏｃｋ以下なら，ステップＳ１３２０へ行く。そうでないならステップＳ１３７０へ行く。 (Step S1310)
If i2 is less than or equal to end_block, go to step S1320. Otherwise, go to step S1370.

（ステップＳ１３２０）
Ｎ_ｉ２の候補属性をＰ_ｉ２，１，Ｐ_ｉ２，２，・・・，Ｐ_ｉ２，ｎとする。処理対象の候補属性を示すカウンタｊ２を１に初期化する。ブロック情報一時記憶部を初期化する。ブロック情報一時記憶部とは，そのブロックの候補属性のテンプレート部分項目と，その点数から構成される。 (Step S1320)
Let N _i2 candidate attributes be P _i2,1 , P _i2,2 ,..., P _{i2, n} . A counter j2 indicating a candidate attribute to be processed is initialized to 1. The block information temporary storage unit is initialized. The block information temporary storage unit is composed of a template part item of candidate attributes of the block and its score.

（ステップＳ１３３０）
ｊ２がｎ以下ならステップＳ１３４０へ行く。そうでないならステップＳ１３６０へ行く（すべての候補属性で処理が終了）。 (Step S1330)
If j2 is less than or equal to n, go to step S1340. Otherwise, go to step S1360 (processing is completed for all candidate attributes).

（ステップＳ１３４０）
候補属性Ｐ_{ｉ２，ｊ２}を処理対象にする。Ｐ_{ｉ２，ｊ２}のテンプレート部分が，テンプレート情報一時記憶部に存在するなら，対応するテンプレート部分の出現数を１加算する。加算する点数は，１に限らず，１／ｎとしてもよい。この例では，１として説明する。存在しないなら，当該テンプレート部分をテンプレート部分項目に追加し，出現数を１（または，１／ｎ）に設定する。ステップＳ１３５０へ行く。この例では，１として説明する。 (Step S1340)
Candidate attributes P _{i2, j2} are to be processed. If the template portion of P _{i2, j2} exists in the template information temporary storage unit, 1 is added to the number of appearances of the corresponding template portion. The number of points to be added is not limited to 1 and may be 1 / n. In this example, it will be described as 1. If it does not exist, the template part is added to the template part item, and the number of appearances is set to 1 (or 1 / n). Go to step S1350. In this example, it will be described as 1.

（ステップＳ１３５０）
次の候補属性で計算するために，ｊ２に１を加える。ステップＳ１３３０へ戻る。 (Step S1350)
Add 1 to j2 for calculation with the next candidate attribute. It returns to step S1330.

（ステップＳ１３６０）
次の固有表現で計算するために，ｉ２に１を加える。ステップＳ１３１０へ戻る。 (Step S1360)
Add 1 to i2 to calculate with the following specific expression. The process returns to step S1310.

（ステップＳ１３７０）
ステップＳ１４００以降で示すブロック内の属性の決定処理を行い，ステップＳ１２５０またはステップＳ１２６０の属性の決定処理に戻る。 (Step S1370)
The attribute determination process in the block shown in step S1400 and subsequent steps is performed, and the process returns to the attribute determination process in step S1250 or step S1260.

図１１は，ブロック内の属性の決定処理の動作を示すフローチャートである。処理対象の固有表現を示すカウンタをｉ３，処理対象の候補属性を示すカウンタをｊ３とする。 FIG. 11 is a flowchart showing the operation of attribute determination processing in a block. The counter indicating the specific expression to be processed is i3, and the counter indicating the candidate attribute to be processed is j3.

（ステップＳ１４００）
ｉ３をｓｔａｒｔ＿ｂｌｏｃｋに設定する。 (Step S1400)
i3 is set to start_block.

（ステップＳ１４１０）
ｉ３がｅｎｄ＿ｂｌｏｃｋ以下なら，ステップＳ１４２０へ行く。そうでないなら終了し，ステップＳ１３７０に戻る。 (Step S1410)
If i3 is less than or equal to end_block, go to step S1420. Otherwise, the process ends and returns to step S1370.

（ステップＳ１４２０）
チェック対象の候補属性を示すカウンタｊ３を１に初期化する。 (Step S1420)
A counter j3 indicating a candidate attribute to be checked is initialized to 1.

（ステップＳ１４３０）
ｊ３がｎ以下なら，ステップＳ１４４０へ行く。そうでないなら，すべての候補属性で処理が終了したので，ステップＳ１５１０へ行く。 (Step S1430)
If j3 is less than or equal to n, go to step S1440. If not, since the processing is completed for all candidate attributes, the process goes to step S1510.

（ステップＳ１４４０）
ブロック情報一時記憶部のテンプレート部分をＴ_１，Ｔ_２，・・・，Ｔ_ｔとする。処理対象のテンプレート部分を示すカウンタｋを１に初期化する。 (Step S1440)
Let T ₁ , T ₂ ,..., T _t be template portions of the block information temporary storage unit. A counter k indicating the template portion to be processed is initialized to 1.

（ステップＳ１４５０）
ｋがｔ以下ならステップＳ１４６０へ行く。そうでないならステップＳ１４９０へ行く（すべてのテンプレート部分で処理が終了）。 (Step S1450)
If k is equal to or less than t, the process proceeds to step S1460. Otherwise, go to step S1490 (processing is completed for all template portions).

（ステップＳ１４６０）
Ｔ_ｋがＰ_{ｉ３，ｊ３}のテンプレート部分に一致するかどうか調べる。一致するなら，ステップＳ１４７０へ行く。そうでないなら，次のテンプレート部分を計算するために，ステップＳ１４８０へ行く。 (Step S1460)
It is checked whether T _k matches the template portion of P _{i3, j3} . If they match, go to step S1470. Otherwise, go to step S1480 to calculate the next template part.

（ステップＳ１４７０）
Ｐ_{ｉ３，ｊ３}の点数を次の式で計算する（ｗ_０は重みとする）。
［Ｐ_{ｉ３，ｊ３}の点数］＝［Ｔ_ｋの点数］＋［ステップＳ１４９０で計算される点数］×ｗ_０
［ステップＳ１４９０で計算される点数］を計算するために，ステップＳ１４９０へ行く。 (Step S1470)
The score of P _{i3, j3} is calculated by the following formula (w ₀ is a weight).
[ _Points of P _{i3, j3} ] = [ _Points of T _k ] + [Points calculated in step S1490] × w ₀
Go to step S1490 to calculate [the number of points calculated in step S1490].

（ステップＳ１４８０）
次のテンプレート部分で計算するために，ｋに１を加える。ステップＳ１４５０へ戻る。 (Step S1480)
Add 1 to k for calculation in the next template part. It returns to step S1450.

（ステップＳ１４９０）
Ｎ_ｉ３の一つ前の固有表現Ｎ_ｉ３−１に対応付けられた属性と，テンプレート部分が同じ候補属性を優遇する。Ｐ_{ｉ３，ｊ３}のテンプレート部分が，Ｎ_ｉ３−１に対応する属性のテンプレート部分に完全に等しいなら，［ステップＳ１４９０で計算される点数］を１にする。テンプレート部分が前方一致していたら，その一致部分の深さｐと，Ｐ_{ｉ３，ｊ３}のテンプレート部分の深さｑ，Ｎ_ｉ３−１に対応する属性のテンプレート部分の深さｒについて，（０．８）^{ｍａｘ（ｑ，ｒ）−ｐ}とする。例えば，Ｐ_３，２が「イベント情報．開催地，建物名」，Ｐ_３，６が「イベント情報．講演．講演者．居住地．建物名」，Ｎ_２に対応する属性が「イベント情報」とすると，Ｐ_３，２に加算する点数は，（０．８）^２−１になる。なぜなら，Ｐ_３，２と，Ｎ_２に対応する属性テンプレート部分の一致部分は「イベント情報」で深さは１，Ｐ_３，２のテンプレート部分の深さは２だからである。一方，Ｐ_３，６に加算する点数は，（０．８）^４−１になる。なぜなら，Ｐ_３，６とＮ_２に対応する属性のテンプレート部分の一致部分は「イベント情報」で深さは１，Ｐ_３，６のテンプレート部分の深さは４だからである。ステップＳ１５００に行く。ただし，ｉ３−１が０なら［ステップＳ１４９０で計算される点数］を０にする。 (Step S1490)
A candidate attribute having the same template part as the attribute associated with the specific expression N _{i3-1 immediately} before N _i3 is preferentially _treated . _If the template portion of P _{i3, j3} is completely equal to the template portion of the attribute corresponding to N _i3-1 , [the score calculated in step S1490] is set to 1. If the template part matches forward, the depth p of the matched part and the depth r of the template part of the attribute corresponding to the template part depth q and N _i3-1 of P _{i3, j3} are (0. 8) ^{Set to max (q, r) -p} . For example, P _{3 and 2} are “event information. Venue, building name”, P _{3 and 6} are “event information. Lecture, lecturer, residence, building name”, and the attribute corresponding to N ₂ is “event information”. Then, the number of points added to P _3,2 is (0.8) ^2-1 . This is because the matching part of the attribute template parts corresponding to P ₃ and ₂ and N ₂ is “event information”, and the depth of the template part of depth 1 and P ₃ and 2 is 2. On the other hand, the number of points added to P ₃ and ₆ is (0.8) ^4-1 . This is because the matching part of the template parts of the attributes corresponding to P ₃ , ₆ and N ₂ is “event information”, and the depth of the template part of 1, P ₃ , ₆ is 4. Go to step S1500. However, if i3-1 is 0, [the number of points calculated in step S1490] is set to 0.

（ステップＳ１５００）
次の候補属性で計算するために，ｊ３に１を加える。ステップＳ１４３０へ戻る。 (Step S1500)
In order to calculate with the next candidate attribute, 1 is added to j3. It returns to step S1430.

（ステップＳ１５１０）
Ｐ_ｉ３，１，Ｐ_ｉ３，２，・・・，Ｐ_ｉ３，ｎの点数のうち，最大のものをＮ_ｉ３の属性に決定する。ステップＳ１５２０に行く。 (Step S1510)
Of the points of P _i3,1 , P _i3,2 ,..., P _{i3, n} , the largest one is determined as the attribute of N _i3 . Go to step S1520.

（ステップＳ１５２０）
次の固有表現で計算するために，ｉ３に１を加える。ステップＳ１４１０へ戻る。前出の通り，Ｎ_１からＮ_８までのブロックでは，ｗ_０＝１とすると，
Ｎ_１：イベント情報．イベント名（４点）
Ｎ_２：イベント情報．開催日（４点＋１．０×１点）
Ｎ_３：イベント情報．開催地．建物名（３点＋０．８×１点）
Ｎ_４：イベント情報．開催地．郵便番号（３点＋０．８×１点）
Ｎ_５：イベント情報．開催地．住所（３点＋０．８×１点）
Ｎ_６：イベント情報．主催（４点＋０．８×１点）
Ｎ_７：イベント情報．主催（４点＋１．０×１点）
Ｎ_８：候補なし (Step S1520)
Add 1 to i3 to calculate with the following specific expression. The process returns to step S1410. As mentioned above, in the block from N ₁ to N ₈ , if w ₀ = 1,
N ₁ : Event information. Event name (4 points)
N ₂ : Event information. Date (4 points + 1.0 x 1 point)
N ₃ : Event information. venue. Building name (3 points + 0.8 x 1 point)
N ₄ : Event information. venue. Zip code (3 points + 0.8 x 1 point)
N ₅ : Event information. venue. Address (3 points + 0.8 x 1 point)
N ₆ : Event information. Organizer (4 points + 0.8 x 1 point)
N ₇ : Event information. Organizer (4 points + 1.0 x 1 point)
N ₈ : No candidate

また，Ｎ_９からＮ_１３までのブロックでは，イベント情報．講演が３点，イベント情報が１点，講演情報が２点，人物情報が１点，講演情報．講演者が１点，イベント情報．講演．講演者が１点，組織情報が１点，身分情報．所属組織が１点，人物情報．身分．所属組織が１点，講演情報．講演者．身分．所属組織が１点，イベント情報．講演情報．講演者．身分．所属組織が１点になる。そして，上位のイベント情報．講演を含む候補属性が，次のように対応付けられる。
Ｎ_９：イベント情報．講演（３点）
Ｎ_１０：イベント情報．講演．講演時間（３点＋１．０×１点）
Ｎ_１１：イベント情報．講演．タイトル（３点＋１．０×１点）
Ｎ_１２：イベント情報．講演．講演者．氏名（１点＋０．８×１点）
Ｎ_１３：イベント情報．講演．講演者．身分．所属組織．氏名（１点＋０．６４×１点）
Ｎ_１４からＮ_１７までのブロックもほぼ同様である。 Further, in the block from _{N 9} to _{N 13,} the event information. 3 lectures, 1 event information, 2 lecture information, 1 personal information, lecture information. One point of lecturer, event information. Lecture. 1 speaker, 1 organization information, identity information. One point of belonging organization, personal information. ID. One point of belonging organization, lecture information. Speaker. ID. One point of organization, event information. Lecture information. Speaker. ID. Your organization becomes 1 point. And upper event information. Candidate attributes including lectures are mapped as follows.
N ₉ : Event information. Lecture (3 points)
N ₁₀ : Event information. Lecture. Lecture time (3 points + 1.0 x 1 point)
N ₁₁ : Event information. Lecture. Title (3 points + 1.0 x 1 point)
N ₁₂ : Event information. Lecture. Speaker. Name (1 point + 0.8 x 1 point)
N ₁₃ : Event information. Lecture. Speaker. ID. Affiliated organization. Name (1 point + 0.64 x 1 point)
The blocks from N ₁₄ to N ₁₇ are almost the same.

（第１の実施の形態の効果）
以上説明したように，本実施の形態によれば，固有表現と抽出属性を対応付けるための知識を，ある抽出属性の特徴情報が他のテンプレートにある，というネットワーク構造によって表現する。そして，文書中に近接して出現する情報を，同一テンプレートにある抽出属性に優先的に対応付けるような動作を行うことにより，既存のテンプレートを再利用可能な属性抽出を行うことができる。 (Effects of the first embodiment)
As described above, according to the present embodiment, knowledge for associating a specific expression with an extracted attribute is expressed by a network structure in which feature information of a certain extracted attribute exists in another template. Then, by performing an operation for preferentially associating information that appears in the document in the vicinity with the extracted attribute in the same template, it is possible to perform attribute extraction that can reuse the existing template.

（第２の実施の形態）
第１の実施の形態では，他のテンプレートを参照しているときに，参照先のテンプレートにあるすべての属性が，候補属性になるようにしていた。例えば，文書中に住所が出現する場合に，イベントでの講演者の住所（イベント．講演．講演者．住所）も候補属性になっていた。しかし，一般に，イベントの講演を示す文書中に，講演者の住所が記載されることはないため，そのような属性を候補属性にすると，候補属性が多くなり計算量が増えたり，抽出精度を低下させることになる。そこで，第２の実施の形態では，あるテンプレートから参照される場合に，候補属性を限定できるようにする。 (Second Embodiment)
In the first embodiment, when referring to another template, all attributes in the reference template are candidate attributes. For example, when an address appears in a document, the address of the speaker at the event (event. Lecture. Speaker. Address) was also a candidate attribute. However, in general, since the address of the speaker is not described in the document indicating the lecture of the event, making such attributes as candidate attributes increases the number of candidate attributes, which increases the amount of calculation and increases the extraction accuracy. Will be reduced. Therefore, in the second embodiment, candidate attributes can be limited when they are referenced from a certain template.

第２の実施の形態にかかる情報抽出装置の構成は，第１の実施の形態と実質的に同一の構成とする。ただし，抽出知識記憶部１１に，参照元のテンプレートに応じて，その属性を候補属性に追加しないための項目を追加する。図１２に，抽出知識記憶部１１の例を示す。抽出知識記憶部１１の一番右の項目が，テンプレートが参照されている場合に，属性を候補属性に追加しないための項目である。この項目の値は，他のテンプレートの属性名である。例えば，図１２の「人物情報．居住地」では，この項目に「講演情報．講演者」が設定されている。これは，「人物情報．居住地」が，他のテンプレートから参照される属性（例えば，講演情報．講演者．居住地や，組織情報．社長．居住地や，イベント．講演．講演者．居住地）などのうち，講演情報．講演者をテンプレート部分の一部に持つ属性（例えば，講演情報．講演者．居住地や，イベント．講演．講演者．居住地）を，候補属性にしないことを示す。 The configuration of the information extraction apparatus according to the second embodiment is substantially the same as that of the first embodiment. However, an item for not adding the attribute to the candidate attribute is added to the extracted knowledge storage unit 11 according to the template of the reference source. FIG. 12 shows an example of the extracted knowledge storage unit 11. The rightmost item in the extracted knowledge storage unit 11 is an item for not adding an attribute to a candidate attribute when a template is referenced. The value of this item is the attribute name of another template. For example, in “person information. Residence” in FIG. 12, “lecture information. Lecturer” is set in this item. This is because “person information. Place of residence” is an attribute referenced from other templates (for example, lecture information. Lecturer. Place of residence, organization information. President. Place of residence, event. Lecture. Lecturer. Residence). Talk) information. Indicates that attributes that have a speaker as a part of the template part (for example, lecture information, lecturer, residence, event. Lecture, lecturer, residence) are not set as candidate attributes.

本実施の形態にかかる情報抽出装置の動作は，候補属性決定部１２３が，第１の実施の形態と異なり，他の処理は，第１の実施の形態と同一である。 The operation of the information extracting apparatus according to this embodiment is different from that of the first embodiment in that the candidate attribute determination unit 123 is the same as that of the first embodiment.

図７の第１の実施の形態のステップＳ１１００に，次のような条件０を追加する。これにより，図７のステップＳ１１００の代わりにステップＳ２０００とする。 The following condition 0 is added to step S1100 of the first embodiment in FIG. Thus, step S2000 is performed instead of step S1100 in FIG.

（ステップＳ２０００）
候補属性Ｐ_ｉ，ｊを処理対象にする。Ｐ_ｉ，ｊのテンプレート名を参照している他のテンプレートを，「他のテンプレートへの参照」項目を利用して見つける。その参照元の属性を，候補属性に追加するかどうかを，次の条件で判断する。
［条件０］「他のテンプレートから参照されたときに，候補属性にしないもの」の項目に値があり，その値が参照元の属性と一致したら，候補属性に追加しない。そうでないなら，条件１，条件２をチェックする。
［条件１］参照元の属性の「見出し語がなくても，その属性を対応付けるかどうか」項目の値が「Ｙｅｓ」ならば，その属性を，候補属性に追加する。そうでないなら，条件２をチェックする。
［条件２］参照元の属性の「見出し語がなくても，その属性を対応付けるかどうか」項目の値が「ＮＯ」で，かつ，Ｎ_ｉの同一行に「固有表現のタイプ」項目が「見出し語」である固有表現が存在し，その固有表現の候補属性が，参照元の属性に一致しているなら追加する。そうでないなら，追加しない。 (Step S2000)
Candidate attributes P _{i, j} are to be processed. Other templates referencing the template name of P _{i, j} are found using the “reference to other templates” item. Whether the attribute of the reference source is added to the candidate attribute is determined under the following conditions.
[Condition 0] If there is a value in the item “not to be a candidate attribute when referenced from another template” and the value matches the attribute of the reference source, it is not added to the candidate attribute. If not, check condition 1 and condition 2.
[Condition 1] If the value of the item “whether the attribute is associated even if there is no entry word” of the reference source attribute is “Yes”, the attribute is added to the candidate attribute. If not, check condition 2.
"Even if there is no entry word, whether or not associate the attribute" [Condition 2] of the reference source of the attribute with the value of the item is "NO", and the "type-specific expression" in the same line of N _i item " If there is a specific expression that is a “keyword” and the candidate attribute of the specific expression matches the attribute of the reference source, it is added. If not, don't add.

また，追加した候補属性についても，再帰的に「他のテンプレートへの参照」項目を参照するが，条件０をチェックする際に，以前の「他のテンプレートから参照されたときに，候補属性にしないもの」の項目の値を引き継いでチェックする。 Also, for the added candidate attribute, the “reference to another template” item is recursively referenced. However, when checking the condition 0, the “candidate attribute when it is referenced from another template” Take over the value of the item of “Nothing” and check it.

例えば，Ｎ_５「○○市××町１−１」の候補属性は，「場所情報．建物名」である。場所情報テンプレートを参照している属性として，「人物情報．居住地」，「組織情報．所在地」，「イベント情報．開催地」が存在している。このうち，「人物情報．居住地」と「組織情報．所在地」の「他のテンプレートから参照されたときに，候補属性にしないもの」の項目は「講演情報．講演者」である。「人物情報．居住地」，「組織情報．所在地」，「イベント情報．開催地」は，すべて「講演情報．講演者」を含まないから，条件０は満たす。 For example, the candidate attribute of N ₅ “XX city XX town 1-1” is “location information. Building name”. As attributes referring to the place information template, there are “person information. Place of residence”, “organization information. Location”, and “event information. Place”. Among these items, “Personal information. Resident” and “Organization information. Since “person information. Place of residence”, “organization information. Location”, and “event information. Venue” do not include “lecture information. Lecturer”, condition 0 is satisfied.

しかし，これらを参照しているテンプレートを考慮すると，「人物情報．居住地」について「講演情報．講演．講演者．居住地」は，「他のテンプレートから参照されたときに，候補属性にしないもの」の項目に一致する。したがって，「講演情報．講演。講演者．居住地」は，条件０を満たさないため，候補属性にしない。 However, considering the templates that refer to them, “Lecture Information. Lecture. Speaker. Residential” for “Personal Information. Residential” is not considered as a candidate attribute when referenced from other templates. Matches the “thing” item. Therefore, “lecture information. Lecture. Lecturer. Residence” is not set as a candidate attribute because condition 0 is not satisfied.

同様に，「組織情報．所在地」についても，これを参照しているテンプレートとして，「講演情報．講演．講演者．身分．所属組織．所在地」は，条件０を満たさないため，候補属性にしない。 Similarly, for “organization information. Address”, as a template referring to this, “lecture information. Lecture. Lecturer. Identity. Affiliation organization. Address” is not a candidate attribute because condition 0 is not satisfied. .

結果として，図１３に候補属性の例を示す。図８と比べて，Ｎ_３〜Ｎ_５の候補属性で，「イベント．講演．講演者．居住地」と「講演情報．講演者．居住地」で始まる候補属性がなくなっている。 As a result, FIG. 13 shows an example of candidate attributes. Compared to FIG. 8, there are no candidate attributes starting with “Event.Lecture.Lecturer.Residence” and “Lecture information.Lecturer.Residence” with candidate attributes of N _{3 to} N ₅ .

（第２の実施の形態の効果）
以上説明したように，本実施の形態では，一般に，文書中に出現しない属性を候補属性にしないようにすることで，候補属性が多くなり計算量が増えたり，抽出精度を低下させることにならないようにできる。 (Effect of the second embodiment)
As described above, in this embodiment, in general, by not making an attribute that does not appear in a document a candidate attribute, the number of candidate attributes increases and the amount of calculation does not increase or the extraction accuracy does not decrease. You can

以上，添付図面を参照しながら本発明にかかる情報抽出装置の好適な実施の形態について説明したが，本発明はかかる例に限定されない。当業者であれば，特許請求の範囲に記載された技術的思想の範疇内において各種の変更例または修正例に想到し得ることは明らかであり，それらについても当然に本発明の技術的範囲に属するものと了解される。 The preferred embodiment of the information extraction apparatus according to the present invention has been described above with reference to the accompanying drawings, but the present invention is not limited to such an example. It will be obvious to those skilled in the art that various changes or modifications can be conceived within the scope of the technical idea described in the claims, and these are naturally within the technical scope of the present invention. It is understood that it belongs.

例えば，以下のような応用が可能である。 For example, the following applications are possible.

（１）第１の実施の形態では，個々の固有表現に対応する属性を１つに絞り込んだが，第２，第３の候補を挙げたほうがよい場合もある。そこで，出力する候補属性の点数の閾値を指定することで，複数の候補をユーザに提示することができる。 (1) In the first embodiment, the attribute corresponding to each unique expression is narrowed down to one, but it may be better to list the second and third candidates. Therefore, a plurality of candidates can be presented to the user by specifying a threshold value for the number of candidate attributes to be output.

（２）第１の実施の形態では，文書入力部１０への入力として，テキスト文書を例にしたが，ＨＴＭＬ文書でも拡張可能である。そのためには，文字列要素分割処理で，行位置の他に，実際に，ＨＴＭＬ文書をブラウザで表示させた場合に，１行で表示される範囲を抽出する処理を行う。例えば，表の各行を表すタグ＜ｔｒ＞と＜／ｔｒ＞で囲まれる範囲が続いても，＜ｂｒ＞や＜ｐ＞などのタグがないかぎりは，１行で表示される。これは従来のブラウザがもっているＨＴＭＬタグの解析機能を利用すれば解析可能である。 (2) In the first embodiment, a text document is taken as an example of input to the document input unit 10, but it can also be extended to an HTML document. For this purpose, in the character string element dividing process, in addition to the line position, when an HTML document is actually displayed on the browser, a process for extracting a range displayed in one line is performed. For example, even if the range surrounded by the tags <tr> and </ tr> representing each row of the table continues, it is displayed in one line unless there is a tag such as <br> or <p>. This can be analyzed by using the HTML tag analysis function of a conventional browser.

（３）上記実施の形態では，各固有表現に属性を対応付ける処理までを示した。この処理の後で，例えば，第１の実施の形態では，Ｎ_１０からＮ_１３までと，Ｎ_１４からＮ_１７がそれぞれ別の講演を表すことを検出する処理が行われる。その処理の際に，本提案の固有表現に属性を対応付ける処理に誤りがあり，以降の処理で，正しい結果が得られないとシステムが判断した場合には，再度，固有表現に属性を対応付ける処理を行ってもよい。 (3) In the above embodiment, the process up to associating attributes with each unique expression has been described. After this processing, for example, in the first embodiment, processing for detecting that N ₁₀ to N ₁₃ and N ₁₄ to N ₁₇ represent different lectures is performed. If there is an error in the process of associating the attribute with the proposed specific expression during the process, and the system determines that the correct result cannot be obtained in the subsequent processes, the process of associating the attribute with the specific expression again May be performed.

（４）文字列要素が文の場合には，構文解析などの処理をしないと誤抽出を起こしやすいため，文になっている部分は，候補属性決定部で処理対象としなかった。しかし，構文解析などを行って候補属性を決定する方法と組み合わせることも可能である。 (4) When the character string element is a sentence, erroneous extraction is likely to occur unless processing such as syntax analysis is performed. Therefore, the part that is the sentence is not processed by the candidate attribute determination unit. However, it can be combined with a method of determining candidate attributes by performing syntax analysis or the like.

（５）上記実施の形態では，テンプレートが表形式になっていたが，例えば，ＲＤＦＳ（ＲｅｓｏｕｒｃｅＤｅｓｃｒｉｐｔｉｏｎＦｒａｍｅｗｏｒｋＳｃｈｅｍａ）などの，オントロジ記述言語などを利用して，３項組みのフォーマットで記述してもよい。 (5) In the above embodiment, the template is in a tabular form. However, even if it is described in a ternary format using an ontology description language such as RDFS (Resource Description Framework Schema), for example. Good.

（６）ステップＳ１４７０では，完全にテンプレート部分が一致したものに対してだけ加算していたが，テンプレート部分の一部が一致した場合でも加算できるようにしてもよい。 (6) In step S1470, the addition is performed only for the template part that is completely matched. However, the addition may be performed even when a part of the template part is matched.

（７）ステップＳ１４０では，固有表現が連続する範囲ごとに，ブロックを分割していたが，空行ごとにブロックを分割してもよい。 (7) In step S140, the block is divided for each range where the unique expressions are continuous, but the block may be divided for each blank line.

本発明は，文書からユーザが必要とする情報を抽出し，抽出したい属性を示すテンプレートに，文書中の情報を対応付ける情報抽出装置，情報抽出方法，およびプログラムに利用可能である。 The present invention is applicable to an information extraction apparatus, an information extraction method, and a program for extracting information required by a user from a document and associating information in the document with a template indicating an attribute to be extracted.

第１の実施の形態にかかる情報抽出装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the information extraction apparatus concerning 1st Embodiment. 抽出知識記憶部１１の一例を示す説明図である。4 is an explanatory diagram illustrating an example of an extracted knowledge storage unit 11. FIG. 第１の実施の形態にかかる情報抽出装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the information extraction apparatus concerning 1st Embodiment. 文書入力部１０に入力される入力文書の一例を示す説明図である。4 is an explanatory diagram illustrating an example of an input document input to a document input unit 10. FIG. 固有表現抽出部１２１の出力例を示す説明図である。It is explanatory drawing which shows the example of an output of the specific expression extraction part 121. FIG. 文字列要素分割部１２２の出力例を示す説明図である。It is explanatory drawing which shows the example of an output of the character string element division part 122. FIG. 候補属性決定部１２３の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the candidate attribute determination part 123. FIG. 候補属性の一例を示す説明図である。It is explanatory drawing which shows an example of a candidate attribute. ブロック分割部１２５の動作を示すフローチャートである。5 is a flowchart showing the operation of a block dividing unit 125. テンプレート順位付け部１２６の動作を示すフローチャートである。5 is a flowchart showing the operation of a template ranking unit 126. 属性計算部１２７の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the attribute calculation part 127. 抽出知識記憶部１１の一例を示す説明図である。4 is an explanatory diagram illustrating an example of an extracted knowledge storage unit 11. FIG. 候補属性の一例を示す説明図である。It is explanatory drawing which shows an example of a candidate attribute.

Explanation of symbols

１情報抽出装置
１０文書入力部
１１抽出知識記憶部
１２抽出属性記憶部
１３出力部
１２１固有表現抽出部
１２２文字列要素分割部
１２３属性情報決定部
１２４属性決定部
１２５ブロック分割部
１２６テンプレート順位付け部
１２７属性計算部
DESCRIPTION OF SYMBOLS 1 Information extraction apparatus 10 Document input part 11 Extracted knowledge memory | storage part 12 Extracted attribute memory | storage part 13 Output part 121 Specific expression extraction part 122 Character string element division part 123 Attribute information determination part 124 Attribute determination part 125 Block division part 126 Template ranking part 127 Attribute calculation part

Claims

In an information extraction device that associates information appearing in an input document with attributes,
A document input unit that accepts a document as input;
An extracted knowledge storage unit that stores a template composed of a set of one or more attributes to be extracted, and stores the attribute and the extracted knowledge of the attribute in association with each other;
An extracted attribute determining unit that refers to the extracted knowledge storage unit and associates information extracted from the document with the attribute;
An output unit for outputting the attribute and the value of the attribute;
Including
The attribute stored in the extracted knowledge storage unit refers to another related template, and the extracted knowledge of the attribute stored in the extracted knowledge storage unit includes the type of attribute,
The extraction attribute determination unit
A named entity extraction unit for extracting a named entity from the document and adding the type of the named entity to the extracted named entity;
With reference to the extracted knowledge storage unit, the attribute associated with the attribute type that matches the type of the specific expression added by the specific expression extraction unit is determined as a candidate attribute for the specific expression , A candidate attribute determination unit that determines an attribute that refers to a template including the attribute as a candidate attribute for the specific expression;
Attribute determination for determining the template to which the determined candidate attribute belongs most for the specific expression extracted from a predetermined document range, and determining the candidate attribute belonging to the template as the attribute of the specific expression extracted from the document range Part,
An information extraction device characterized by including:

The information extracting apparatus according to claim 1, wherein the predetermined document range is a range divided by blank lines in the document.

2. The information extracting apparatus according to claim 1, wherein the predetermined document range is a range in which lines where the specific expression appears are continuous.

The extracted knowledge storage unit stores, when a template including the attribute to be extracted is referred to by another attribute, a condition that is satisfied by one of the other attributes not added to the candidate attribute. The information extraction device according to claim 1, wherein the information extraction device is a feature.

The extraction attribute determining unit includes a character string element dividing unit that extracts a character string element as a character string element by separating a character string in the document with a line feed, a space, or a punctuation mark.
When the character string element is in a range delimited by punctuation points, the candidate attribute determination unit is configured not to perform a process of determining a candidate attribute for the specific expression extracted from the character string element. The information extraction device according to claim 1.

A template configured by a set of one or more attributes to be extracted, and an extracted knowledge storage unit that stores the attribute and the extracted knowledge of the attribute in association with each other; a document input unit; , An information extraction method comprising an extraction attribute determination unit and an output unit, and associating information appearing in an input document with an attribute,
The extracted attribute determining unit includes a specific expression extracting unit, a candidate attribute determining unit, and an attribute determining unit,
And Rubun document input step receiving as an input the document input unit is a document,
The extracted attribute determination unit refers to the extraction knowledge storage unit, and Extraction attribute determining step the information extracted from the document that associates the attribute,
The output unit and the attributes and output step you output the value of the attribute,
Including
The attribute stored in the extracted knowledge storage unit refers to another related template, and the extracted knowledge of the attribute stored in the extracted knowledge storage unit includes the type of attribute,
The extraction attribute determination step includes
Wherein the named entity extraction section extracts the specific expression from within the document, the extracted unique expressions above named entity types added unique expression extraction step,
The candidate attribute determination unit refers to the extracted knowledge storage unit, and determines the attribute associated with the attribute type that matches the type of the specific expression added in the specific expression extraction step with respect to the specific expression. and it determines as a candidate attribute, the attributes that reference template containing the attributes, and candidate attribute determination step that determine as a candidate attribute for the named entity,
For the specific expression extracted from the predetermined document range by the attribute determination unit, the template to which the determined candidate attribute belongs most is determined, and the candidate attribute belonging to the template is the attribute of the specific expression extracted from the document range. and attribute determination step that determine as,
An information extraction method characterized by including:

7. The information extraction method according to claim 6, wherein the predetermined document range is a range divided by blank lines in the document.

7. The information extraction method according to claim 6, wherein the predetermined document range is a range in which lines where the specific expression appears are continuous.

The extracted attribute determining step includes a character sequence component division unit newline character strings in the document, blank, the extraction to Rubun string element dividing step ranges separated by punctuation mark as a string element,
The process according to any one of claims 6 to 8, wherein, when the character string element is delimited by a punctuation mark, the specific attribute extracted from the character string element is not processed by the candidate attribute determination step. Information extraction method described in 1.

A program for causing a computer to function as the information extraction device according to any one of claims 1 to 5.