JP2009237640A

JP2009237640A - Information extraction device, information extraction method, and information extraction program

Info

Publication number: JP2009237640A
Application number: JP2008079460A
Authority: JP
Inventors: Maki Murata; 真樹村田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2008-03-26
Filing date: 2008-03-26
Publication date: 2009-10-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide new technology for extracting useful information without being affected by noise information when extracting the useful information from a text document group. <P>SOLUTION: Item expressions, the types of extracted expressions and unit expressions are extracted as primary expressions from a text document group. The primary expressions simultaneously appearing are predetermined from the text document group. Sets of the item expressions described there, the extracted expressions belonging to the types of the extracted expressions and numerical expressions relevant to the unit expressions are extracted as information sets. Then, as for each group of the extracted information sets sorted based on the item expressions, the information sets not to be treated as the same group as the other information sets are predetermined according to the degree of the difference in meaning between statistical information of the numerical expressions or the extracted expressions. All the predetermined information sets are deleted from the information sets. The predetermined information sets to be deleted are deleted from the extracted information sets. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、記憶装置に記憶されるテキスト文書群から有用な情報を抽出する情報抽出装置およびその方法と、その情報抽出装置の実現に用いられる情報抽出プログラムとに関し、特に、有用な情報をノイズ情報の影響を受けることなく抽出できるようにする情報抽出装置およびその方法と、その情報抽出装置の実現に用いられる情報抽出プログラムとに関する。 The present invention relates to an information extraction apparatus and method for extracting useful information from a group of text documents stored in a storage device, and an information extraction program used to implement the information extraction apparatus, and more particularly to useful information as noise. The present invention relates to an information extraction apparatus and method for enabling extraction without being affected by information, and an information extraction program used for realizing the information extraction apparatus.

本発明者らは、ここのところで、データベースに格納される膨大な量の記事群から有用な情報を自動抽出することを実現する一連の発明を開示してきた（特許文献１，２参照）。 Here, the inventors have disclosed a series of inventions that realize automatic extraction of useful information from an enormous amount of articles stored in a database (see Patent Documents 1 and 2).

この一連の発明では、膨大な量の記事群から主要な項目表現と主要な単位表現とを抽出して、それらが同時に出現しているテキスト文書群の箇所を特定し、その特定した箇所に記載されている主要項目表現と主要単位表現に関連する数値表現との対を情報対として抽出して、それらの情報対をグラフ化して表示するようにしたり（特許文献１に記載の発明）、膨大な量の記事群から主要な項目表現と主要な単位表現と主要な時間表現とを抽出して、それらが同時に出現しているテキスト文書群の箇所を特定し、その特定した箇所に記載されている主要項目表現と主要単位表現に関連する数値表現と主要時間表現との対を情報対として抽出して、それら情報対をグラフ化して表示するようにしている（特許文献２に記載の発明）。 In this series of inventions, the main item expression and the main unit expression are extracted from a huge amount of articles, and the location of the text document group in which they appear at the same time is specified and described in the specified location. A pair of a main item expression and a numerical expression related to the main unit expression is extracted as an information pair, and the information pair is displayed as a graph (invention described in Patent Document 1). The main item expression, the main unit expression, and the main time expression are extracted from a large amount of articles, and the location of the text document group in which they appear at the same time is specified and described in the specified location. A pair of a numerical expression and a main time expression related to the main item expression and the main unit expression is extracted as an information pair, and the information pair is displayed in a graph (the invention described in Patent Document 2). .

これらの発明により、膨大な量の記事群から有用な情報を自動抽出することができるようになる。
特開２００８−０２１０５２号公報特開２００７−１９９９０２号公報 By these inventions, useful information can be automatically extracted from a huge amount of articles.
JP 2008-021052 A JP 2007-199902 A

本発明者らは、膨大な量の記事群からさらに有用な情報を自動抽出することを実現するために、さらに、先に出願した特願２００７−１３０２１８で、膨大な量の記事群から主要な項目表現と主要な固有表現の種類と主要な単位表現とを抽出して、それらが同時に出現しているテキスト文書群の箇所を特定し、その特定した箇所に記載されている主要項目表現と主要固有表現と主要単位表現に関連する数値表現との対を情報対として抽出して、それらの情報対をグラフ化して表示するという発明を出願した。 In order to automatically extract more useful information from an enormous amount of articles, the present inventors further applied a major application from an enormous amount of articles in Japanese Patent Application No. 2007-130218 filed earlier. Extract the item expression, the type of the main unique expression and the main unit expression, identify the part of the text document group in which they appear at the same time, and the main item expression and the main part described in the specified part We applied for an invention in which a pair of a specific expression and a numerical expression related to a main unit expression is extracted as an information pair, and the information pair is displayed in a graph.

本発明者らが開示したこれら一連の発明によれば、確かに、膨大な量の記事群から有用な情報を自動抽出することができるようになる。 According to these series of inventions disclosed by the present inventors, it is possible to automatically extract useful information from a huge amount of articles.

しかしながら、膨大な量の記事群には、想定もしないような情報が含まれていることがあり、これから、本発明者らが開示したこれら一連の発明で抽出する情報対には、本来の情報対には含まれるべきでない情報対が含まれてしまう可能性がある。 However, an enormous amount of articles may contain information that cannot be assumed. From this, the information pairs extracted by these series of inventions disclosed by the present inventors include original information. There is a possibility that an information pair that should not be included is included in the pair.

この点において、本発明者らが開示したこれら一連の発明には改善の余地が残されている。 In this regard, there remains room for improvement in these series of inventions disclosed by the present inventors.

本発明はかかる事情に鑑みてなされたものであって、記憶装置に記憶されるテキスト文書群から有用な情報を抽出することを実現するときに、その有用な情報をノイズ情報の影響を受けることなく抽出できるようにする新たな情報抽出技術の提供を目的とする。 The present invention has been made in view of such circumstances, and when realizing useful information extraction from a text document group stored in a storage device, the useful information is affected by noise information. The purpose is to provide a new information extraction technique that enables extraction without any problems.

〔１〕第１の構成
前記の目的を達成するために、本発明の情報抽出装置は、記憶装置に記憶されるテキスト文書群から情報を抽出することを実現するために、（１）テキスト文書群から項目表現と抽出対象表現の種類とを主要表現として抽出するか、ユーザとの対話処理に従って、その主要表現を入力する主要表現設定手段と、（２）テキスト文書群から主要表現が同時に出現している箇所を特定して、その特定した箇所に記載されている項目表現と抽出対象表現種類に属する抽出対象表現との対を情報対として抽出する情報対抽出手段と、（３）項目表現で区分けされる抽出した情報対の各グループについて、他の情報対と同一グループとして取り扱うべきでない情報対を特定する情報対特定手段と、（４）特定した情報対のすべてを抽出した情報対から削除するか、特定した情報対の内の削除指示のあるものを抽出した情報対から削除するか、特定した情報対を削除するのではなくて、特定した情報対を明示表示する形で抽出した情報対をディスプレイに表示する情報対処理手段とを備えるように構成する。 [1] First Configuration In order to achieve the above object, the information extraction device of the present invention provides (1) a text document to realize the extraction of information from a text document group stored in a storage device. Main expression setting means to input item expressions and types of extraction target expressions from groups as main expressions or to input the main expressions according to user interaction processing, and (2) main expressions appear simultaneously from text documents An information pair extracting means for identifying a part that is identified and extracting a pair of an item expression described in the identified part and an extraction target expression belonging to the extraction target expression type as an information pair; and (3) item expression For each group of extracted information pairs separated by, information pair identifying means for identifying information pairs that should not be treated as the same group as other information pairs, and (4) extracting all identified information pairs Delete the specified information pair, delete the specified information pair with the delete instruction from the extracted information pair, or explicitly display the specified information pair instead of deleting the specified information pair An information pair processing means for displaying the information pair extracted in the form on a display is provided.

このように構成されるときにあって、さらに、情報対処理手段により削除されなかった情報対をグラフ化して表示するグラフ表示手段を備えることがある。 When configured in this way, there may be further provided graph display means for displaying information pairs that have not been deleted by the information pair processing means in a graph.

また、情報対特定手段は、抽出対象表現の属する属性グループを特定して、それに基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定することがある。 The information pair specifying unit may specify an attribute group to which the extraction target expression belongs, and specify an information pair that should not be handled as the same group as other information pairs based on the attribute group.

また、主要表現設定手段は、テキスト文書群から時間表現を抽出し、これを受けて、情報対抽出手段は、前記の同時箇所に時間表現が同時に出現している情報対を抽出することがある。このとき、情報対特定手段は、時間表現に基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定することがある。 In addition, the main expression setting unit may extract a time expression from the text document group, and in response to this, the information pair extraction unit may extract an information pair in which the time expression appears simultaneously at the same location. . At this time, the information pair specifying unit may specify an information pair that should not be handled as the same group as other information pairs based on the time expression.

ここで、以上の各処理手段はコンピュータプログラムでも実現できるものであり、このコンピュータプログラムは、適当なコンピュータ読み取り可能な記録媒体に記録して提供されたり、ネットワークを介して提供され、本発明を実施する際にインストールされてＣＰＵなどの制御手段上で動作することにより本発明を実現することになる。 Here, each of the processing means described above can be realized by a computer program, and this computer program is provided by being recorded on an appropriate computer-readable recording medium or provided via a network to implement the present invention. In this case, the present invention is realized by being installed and operating on a control means such as a CPU.

このように構成される本発明の情報抽出装置では、テキスト文書群から、どのような項目に関係する情報を抽出するのかということについて指定する項目表現と、その抽出する情報の具体的な種類について指定する抽出対象表現の種類（例えば、人名に関する固有表現であるとか、組織・機関に関する一般名詞であるといった情報）とを主要表現として抽出する。 In the information extraction apparatus of the present invention configured as described above, an item expression that specifies what item related information is to be extracted from a text document group, and a specific type of information to be extracted The type of extraction target expression to be specified (for example, information such as a specific expression related to a person name or a general noun related to an organization / institution) is extracted as a main expression.

この主要表現については、テキスト文書群から抽出する他に、ユーザに対して主要表現の入力画面を提示して、それに対するユーザからの入力情報を受け取ることで入力したり、ユーザに対して入力可能な主要表現のリストについて記述する選択画面を提示して、それに対するユーザからの選択情報を受け取ることで入力することもある。 In addition to extracting the main expression from the text document group, the main expression can be input by presenting the input screen of the main expression to the user and receiving input information from the user, or input to the user. A selection screen describing a list of major main expressions is presented, and selection information from the user for the selection screen may be received and input.

続いて、テキスト文書群から主要表現が同時に出現している箇所を特定して、その特定した箇所に記載されている項目表現と抽出対象表現種類に属する抽出対象表現との対を情報対として抽出する。 Subsequently, the part where the main expression appears simultaneously from the text document group is specified, and the pair of the item expression described in the specified part and the extraction target expression belonging to the extraction target expression type is extracted as an information pair. To do.

続いて、テキスト文書群から抽出したノイズ情報を除去するために、項目表現で区分けされる抽出した情報対の各グループについて、他の情報対と同一グループとして取り扱うべきでない情報対を特定して、その特定した情報対を抽出した情報対から削除する。 Subsequently, in order to remove noise information extracted from the text document group, for each group of extracted information pairs divided by item expression, specify an information pair that should not be treated as the same group as other information pairs, The identified information pair is deleted from the extracted information pair.

例えば、抽出対象表現が階層構造に従って分類されている場合には、各抽出対象表現がどの階層構造位置に分類されているのかを特定することで、各抽出対象表現の属する属性グループを特定し、それに基づいて、項目表現で区分けされる抽出した情報対の各グループについて、他の情報対と同一グループとして取り扱うべきでない情報対を特定して、その特定した情報対を抽出した情報対から削除するのである。 For example, when the extraction target expressions are classified according to the hierarchical structure, the attribute group to which each extraction target expression belongs is specified by specifying which hierarchical structure position each extraction target expression is classified, Based on this, for each group of extracted information pairs classified by item expression, an information pair that should not be treated as the same group as other information pairs is identified, and the identified information pair is deleted from the extracted information pair It is.

このとき、特定した情報対（他の情報対と同一グループとして取り扱うべきでない情報対）のすべてを無条件に削除するのではなくて、その特定した情報対をユーザに提示し、それに対する削除指示を受け取ることで、その特定した情報対の内の削除指示のあるものを削除するようにしてもよい。 At this time, instead of unconditionally deleting all of the specified information pairs (information pairs that should not be handled as the same group as other information pairs), the specified information pairs are presented to the user and an instruction to delete them is given. Of the specified information pair may be deleted.

また、特定した情報対（他の情報対と同一グループとして取り扱うべきでない情報対）を削除するのではなくて、特定した情報対についてもユーザがそのまま使用できることを可能にし、かつ、ユーザに対してその特定結果を知らせるようにするために、特定した情報対を明示表示する形で抽出した情報対をディスプレイに表示するようにしてもよい。 Also, instead of deleting the specified information pair (information pair that should not be handled as the same group as other information pairs), the specified information pair can be used as is by the user, and In order to notify the identification result, the extracted information pair may be displayed on the display in a form that explicitly displays the identified information pair.

続いて、削除しなかった情報対（項目表現と抽出対象表現種類に属する抽出対象表現との対）をグラフ化して表示する。 Subsequently, the information pairs that have not been deleted (pairs of item expressions and extraction target expressions belonging to the extraction target expression type) are displayed in a graph.

すなわち、テキスト文書群からは様々な種類の抽出対象表現が抽出されることになるが、それらの抽出対象表現がどのような種類に属するのかが分かることで、いわば、どの抽出対象表現とどの抽出対象表現とを同一のグラフ軸にプロットし、どの抽出対象表現とどの抽出対象表現とを別のグラフ軸にプロットするのかということを認識することが可能であるので、その認識結果に基づいて、削除しなかった情報対をグラフ化して表示するのである。 In other words, various types of extraction target expressions are extracted from the text document group, and what kind of extraction target expressions and which extraction are to be understood by knowing what kind of extraction target expressions belong to them. Since it is possible to recognize which extraction target expression and which extraction target expression are plotted on different graph axes by plotting the target expression on the same graph axis, based on the recognition result, Information pairs that were not deleted are displayed in a graph.

この構成を採るときに、テキスト文書群から時間表現についても抽出するようにして、情報対を抽出するときに、時間表現についても同時に出現している情報対を抽出するようにすれば、時間経過に伴う動向情報をグラフ化して表示することができるようになる。 If this structure is adopted, the time expression is also extracted from the text document group, and when the information pair is extracted, the information pair that appears at the same time is also extracted for the time expression. It becomes possible to display the trend information accompanying the graph.

このときには、テキスト文書群から抽出したノイズ情報を除去するために、項目表現で区分けされる抽出した情報対の各グループについて、他の時間表現と時間が大きくずれている時間表現を持つ情報対を特定して、その特定した情報対を抽出した情報対から削除するようにする。 At this time, in order to remove noise information extracted from the text document group, for each group of extracted information pairs divided by item expression, an information pair having a time expression that is significantly different from other time expressions is added. The specified information pair is deleted from the extracted information pair.

〔２〕第２の構成
前記の目的を達成するために、本発明の情報抽出装置は、記憶装置に記憶されるテキスト文書群から情報を抽出することを実現するために、（１）テキスト文書群から項目表現と単位表現とを主要表現として抽出するか、ユーザとの対話処理に従って、その主要表現を入力する主要表現設定手段と、（２）テキスト文書群から主要表現が同時に出現している箇所を特定して、その特定した箇所に記載されている項目表現と単位表現に関連する数値表現との対を情報対として抽出する情報対抽出手段と、（３）項目表現で区分けされる抽出した情報対の各グループについて、他の情報対と同一グループとして取り扱うべきでない情報対を特定する情報対特定手段と、（４）特定した情報対のすべてを抽出した情報対から削除するか、特定した情報対の内の削除指示のあるものを抽出した情報対から削除するか、特定した情報対を削除するのではなくて、特定した情報対を明示表示する形で抽出した情報対をディスプレイに表示する情報対処理手段とを備えるように構成する。 [2] Second Configuration In order to achieve the above object, the information extraction device of the present invention provides (1) a text document to realize the extraction of information from a text document group stored in a storage device. Extracting item representation and unit representation from the group as the main representation, or the main representation setting means for inputting the main representation in accordance with the dialogue processing with the user, and (2) the main representation appears simultaneously from the text document group An information pair extracting means for identifying a location and extracting a pair of an item expression described in the identified location and a numerical representation related to the unit representation as an information pair, and (3) extraction classified by item representation Information pair identifying means for identifying information pairs that should not be treated as the same group as other information pairs, and (4) deleting all identified information pairs from the extracted information pairs. Information that has been instructed to be deleted or deleted from the extracted information pair, or instead of deleting the specified information pair, the information extracted in a form that explicitly displays the specified information pair Information pair processing means for displaying the pair on the display is provided.

また、情報対特定手段は、数値表現の統計情報を算出して、それに基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定することがある。 In addition, the information pair identification unit may calculate numerical information of statistical information, and based on the statistical information, may identify an information pair that should not be handled as the same group as other information pairs.

このように構成される本発明の情報抽出装置では、テキスト文書群から、どのような項目に関係する情報を抽出するのかということについて指定する項目表現と、どのような単位を持つ数値表現を抽出するのかということについて指定する単位表現とを主要表現として抽出する。 In the information extraction device of the present invention configured as described above, an item expression that specifies what item related information is to be extracted from a text document group, and a numerical expression having what unit are extracted. The unit representation that specifies whether to do is extracted as the main representation.

続いて、テキスト文書群から主要表現が同時に出現している箇所を特定して、その特定した箇所に記載されている項目表現と単位表現に関連する数値表現との対を情報対として抽出する。 Subsequently, a part where the main expression appears simultaneously from the text document group is specified, and a pair of an item expression described in the specified part and a numerical expression related to the unit expression is extracted as an information pair.

例えば、項目表現で区分けされる抽出した情報対の各グループについて、数値表現の統計情報（例えば平均値や標準偏差）を算出して、それに基づいて、例えば３σ（σ：標準偏差）の範囲に入らない数値表現を持つ情報対を特定することで、他の情報対と同一グループとして取り扱うべきでない情報対を特定して、その特定した情報対を抽出した情報対から削除するのである。 For example, for each group of extracted information pairs classified by item expression, statistical information (for example, average value and standard deviation) of numerical expression is calculated, and based on the statistical information, for example, within a range of 3σ (σ: standard deviation) By identifying an information pair having a numerical expression that does not enter, an information pair that should not be handled as the same group as another information pair is identified, and the identified information pair is deleted from the extracted information pair.

続いて、削除しなかった情報対（項目表現と単位表現に関連する数値表現との対）をグラフ化して表示する。 Subsequently, the information pairs that have not been deleted (the pairs of the item expression and the numerical expression related to the unit expression) are displayed as a graph.

すなわち、テキスト文書群からは様々な数値表現が抽出されることになるが、単位表現との組み合わせに基づいて、それらの数値表現がどのような数値表現であるのかが分かることで、いわば、どの数値表現とどの数値表現とを同一のグラフ軸にプロットし、どの数値表現とどの数値表現とを別のグラフ軸にプロットするのかということを認識することが可能であるので、その認識結果に基づいて、削除しなかった情報対をグラフ化して表示するのである。 In other words, various numerical expressions are extracted from the text document group. Based on the combination with the unit expression, it can be understood which numerical expressions are those numerical expressions. It is possible to recognize which numerical expression and which numerical expression are plotted on the same graph axis, and which numerical expression and which numerical expression are plotted on different graph axes. Thus, the information pairs that were not deleted are displayed in a graph.

〔３〕第３の構成
前記の目的を達成するために、本発明の情報抽出装置は、記憶装置に記憶されるテキスト文書群から情報を抽出することを実現するために、（１）テキスト文書群から項目表現と抽出対象表現の種類と単位表現とを主要表現として抽出するか、ユーザとの対話処理に従って、その主要表現を入力する主要表現設定手段と、（２）テキスト文書群から主要表現が同時に出現している箇所を特定して、その特定した箇所に記載されている項目表現と抽出対象表現種類に属する抽出対象表現と単位表現に関連する数値表現との対を情報対として抽出する情報対抽出手段と、（３）項目表現で区分けされる抽出した情報対の各グループについて、他の情報対と同一グループとして取り扱うべきでない情報対を特定する情報対特定手段と、（４）特定した情報対のすべてを抽出した情報対から削除するか、特定した情報対の内の削除指示のあるものを抽出した情報対から削除するか、特定した情報対を削除するのではなくて、特定した情報対を明示表示する形で抽出した情報対をディスプレイに表示する情報対処理手段とを備えるように構成する。 [3] Third Configuration In order to achieve the above object, the information extraction apparatus of the present invention provides (1) a text document to realize the extraction of information from a text document group stored in a storage device. Main expression setting means for extracting the item expression and the type and unit expression of the extraction target expression from the group as the main expression or inputting the main expression in accordance with the dialogue processing with the user, and (2) the main expression from the text document group Is identified at the same time, and a pair of the item expression described in the identified part, the extraction target expression belonging to the extraction target expression type, and the numerical expression related to the unit expression is extracted as an information pair Information pair extraction means, and (3) Information pair specification means for specifying an information pair that should not be treated as the same group as other information pairs for each group of extracted information pairs classified by item expression (4) Delete all the specified information pairs from the extracted information pair, delete the specified information pair with the delete instruction from the extracted information pair, or delete the specified information pair Instead, the information pair processing means for displaying on the display the information pair extracted in such a manner that the specified information pair is clearly displayed is provided.

また、情報対特定手段は、抽出対象表現の属する属性グループを特定して、それに基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定したり、数値表現の統計情報を算出して、それに基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定したり、その双方のアルゴリズムに基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定することがある。 Further, the information pair identification means identifies an attribute group to which the extraction target expression belongs, and based on the attribute group, identifies an information pair that should not be handled as the same group as other information pairs, or calculates numerical expression statistical information. Based on this, it is possible to specify an information pair that should not be handled as the same group as other information pairs, or to specify an information pair that should not be handled as the same group as other information pairs based on both algorithms. is there.

このように構成される本発明の情報抽出装置では、テキスト文書群から、どのような項目に関係する情報を抽出するのかということについて指定する項目表現と、その抽出する情報の具体的な種類について指定する抽出対象表現の種類（例えば、人名に関する固有表現であるとか、組織・機関に関する一般名詞であるといった情報）と、どのような単位を持つ数値表現を抽出するのかということについて指定する単位表現とを主要表現として抽出する。 In the information extraction apparatus of the present invention configured as described above, an item expression that specifies what item related information is to be extracted from a text document group, and a specific type of information to be extracted A unit expression that specifies the type of extraction target expression to be specified (for example, information such as a specific expression related to a person's name or a general noun related to an organization / institution) and what kind of unit to extract a numerical expression Are extracted as main expressions.

続いて、テキスト文書群から主要表現が同時に出現している箇所を特定して、その特定した箇所に記載されている項目表現と抽出対象表現種類に属する抽出対象表現と単位表現に関連する数値表現との対を情報対として抽出する。 Next, the part where the main expression appears simultaneously from the text document group is specified, and the numerical expression related to the item expression described in the specified part, the extraction target expression belonging to the extraction target expression type, and the unit expression Are extracted as information pairs.

例えば、抽出対象表現が階層構造に従って分類されている場合には、各抽出対象表現がどの階層構造位置に分類されているのかを特定することで、各抽出対象表現の属する属性グループを特定し、それに基づいて、項目表現で区分けされる抽出した情報対の各グループについて、他の情報対と同一グループとして取り扱うべきでない情報対を特定して、その特定した情報対を抽出した情報対から削除したり、例えば、項目表現で区分けされる抽出した情報対の各グループについて、数値表現の統計情報（例えば平均値や標準偏差）を算出して、それに基づいて、例えば３σ（σ：標準偏差）の範囲に入らない数値表現を持つ情報対を特定することで、他の情報対と同一グループとして取り扱うべきでない情報対を特定して、その特定した情報対を抽出した情報対から削除するのである。 For example, when the extraction target expressions are classified according to the hierarchical structure, the attribute group to which each extraction target expression belongs is specified by specifying which hierarchical structure position each extraction target expression is classified, Based on that, for each group of extracted information pairs classified by item representation, identify information pairs that should not be treated as the same group as other information pairs, and delete the identified information pairs from the extracted information pairs For example, for each group of extracted information pairs classified by item expression, statistical information of numerical expression (for example, average value or standard deviation) is calculated, and based on that, for example, 3σ (σ: standard deviation) By identifying an information pair with a numerical expression that does not fall within the range, an information pair that should not be treated as the same group as another information pair is identified, and the identified information pair Is deleted from the extracted information pair.

続いて、削除しなかった情報対（項目表現と抽出対象表現種類に属する抽出対象表現と単位表現に関連する数値表現との対）をグラフ化して表示する。 Subsequently, the information pairs that have not been deleted (the pairs of the item expression and the extraction target expression belonging to the extraction target expression type and the numerical expression related to the unit expression) are displayed as a graph.

すなわち、テキスト文書群からは様々な種類の抽出対象表現が抽出されることになるが、それらの抽出対象表現がどのような種類に属するのかが分かることで、いわば、どの抽出対象表現とどの抽出対象表現とを同一のグラフ軸にプロットし、どの抽出対象表現とどの抽出対象表現とを別のグラフ軸にプロットするのかということを認識することが可能であるとともに、テキスト文書群からは様々な数値表現が抽出されることになるが、単位表現との組み合わせに基づいて、それらの数値表現がどのような数値表現であるのかが分かることで、いわば、どの数値表現とどの数値表現とを同一のグラフ軸にプロットし、どの数値表現とどの数値表現とを別のグラフ軸にプロットするのかということを認識することが可能であるので、それらの認識結果に基づいて、削除しなかった情報対をグラフ化して表示するのである。 In other words, various types of extraction target expressions are extracted from the text document group, and what kind of extraction target expressions and which extraction are to be understood by knowing what kind of extraction target expressions belong to them. It is possible to plot the target expression on the same graph axis, recognize which extraction target expression and which extraction target expression are plotted on different graph axes, and various text documents Numeric expressions will be extracted, but based on the combination with the unit expression, knowing what the numerical expressions are, the so-called numerical expressions are the same. Plotting on one graph axis and recognizing which numeric representation and which numeric representation are plotted on different graph axes. Based on the results, it is to graphically displays the information pair was not deleted.

本発明によれば、記憶装置に記憶されるテキスト文書群から有用な情報を抽出することを実現するときに、その有用な情報をノイズ情報の影響を受けることなく抽出することができるようになる。 According to the present invention, when useful information is extracted from a text document group stored in a storage device, the useful information can be extracted without being affected by noise information. .

以下、実施の形態に従って本発明を詳細に説明する。 Hereinafter, the present invention will be described in detail according to embodiments.

まず最初に、本発明の前提となる発明（特願２００７−１３０２１８などで出願した発明）について説明し、その後、本発明について説明する。 First, the invention (the invention filed in Japanese Patent Application No. 2007-130218) as a premise of the present invention will be described, and then the present invention will be described.

なお、本発明の前提となる発明もまた、本発明を構成する発明であることから、以下の説明においては本発明として説明する。 Since the invention which is the premise of the present invention is also an invention constituting the present invention, it will be described as the present invention in the following description.

図１は、本発明のシステム構成の一例を示す図である。情報抽出装置１は、記事群から、複数の情報の対を情報対として抽出する処理装置である。情報抽出装置１は、例えば、後述する関連記事データベース（ＤＢ）１４に格納された記事群から、１又は複数の項目表現と１又は複数の固有表現の対を情報対として抽出する。また、情報抽出装置１は、上記関連記事ＤＢ１４に格納された記事群から、１又は複数の項目表現と１又は複数の固有表現と１又は複数の数値表現の対を情報対として抽出する。 FIG. 1 is a diagram showing an example of a system configuration of the present invention. The information extraction device 1 is a processing device that extracts a plurality of information pairs as information pairs from an article group. For example, the information extraction apparatus 1 extracts a pair of one or more item expressions and one or more unique expressions as an information pair from an article group stored in a related article database (DB) 14 described later. Further, the information extraction apparatus 1 extracts a pair of one or more item expressions, one or more unique expressions, and one or more numerical expressions as an information pair from the article group stored in the related article DB 14.

情報抽出装置１は、主要表現抽出部１１、情報対抽出部１２、表示部１３、関連記事データベース（ＤＢ）１４を備える。主要表現抽出部１１は、後述する関連記事ＤＢ１４に格納された記事群から、主要表現を抽出する。主要表現抽出部１１は、例えば、１又は複数の項目表現と１又は複数の固有表現の種類を主要表現として抽出する。また、例えば、主要表現抽出部１１は、１又は複数の項目表現と１又は複数の固有表現の種類と１又は複数の単位表現とを主要表現として抽出する。主要表現は、後述する情報対抽出部１２において情報対を抽出する際に用いられる。主要表現を抽出する際には、例えば、対象の記事群全体に万遍なく高頻度に出現する該当表現を抽出する。 The information extraction apparatus 1 includes a main expression extraction unit 11, an information pair extraction unit 12, a display unit 13, and a related article database (DB) 14. The main expression extraction unit 11 extracts a main expression from an article group stored in a related article DB 14 described later. The main expression extraction unit 11 extracts, for example, one or more item expressions and one or more types of unique expressions as main expressions. For example, the main expression extraction unit 11 extracts one or more item expressions, one or more types of unique expressions, and one or more unit expressions as main expressions. The main expression is used when an information pair is extracted by the information pair extraction unit 12 described later. When extracting a main expression, for example, a corresponding expression that appears uniformly and frequently in the entire target article group is extracted.

主要表現抽出部１１は、主要単位表現抽出部１１１と主要項目表現抽出部１１２と主要固有表現抽出部１１３とを備える。主要単位表現抽出部１１１は、情報対を抽出、整理する際に必要となる単位表現（主要単位表現）を抽出する。例えば、映画に関する記事群から、興行収入「５億円」における「円」や，観客動員数「３０万人」における「人」を主要単位表現として抽出する。 The main expression extraction unit 11 includes a main unit expression extraction unit 111, a main item expression extraction unit 112, and a main specific expression extraction unit 113. The main unit expression extraction unit 111 extracts a unit expression (main unit expression) necessary for extracting and organizing information pairs. For example, “yen” in the box office revenue “500 million yen” and “person” in the audience mobilization number “300,000” are extracted from the group of articles about the movie as main unit expressions.

主要項目表現抽出部１１２は、情報対を抽出、整理する際に必要となる項目表現（主要項目表現）を抽出する。例えば、映画に関する記事群から、「興行収入」や「観客動員数」などを主要項目表現として抽出する。 The main item expression extraction unit 112 extracts an item expression (main item expression) necessary for extracting and organizing information pairs. For example, “entertainment income”, “number of audience mobilization”, and the like are extracted as main item expressions from articles related to movies.

主要固有表現抽出部１１３は、情報対を抽出、整理する際に必要となる固有表現の種類（主要固有表現の種類）を抽出する。例えば、映画に関する記事群から、人物を示す固有表現の種類「ＰＥＲＳＯＮ」や場所を示す固有表現の種類「ＬＯＣＡＴＩＯＮ」などを主要固有表現の種類として抽出する。 The main specific expression extraction unit 113 extracts the types of specific expressions (types of main specific expressions) necessary for extracting and organizing information pairs. For example, from the group of articles related to movies, the type of specific expression “PERSON” indicating a person and the type of specific expression “LOCATION” indicating a place are extracted as types of main specific expressions.

情報対抽出部１２は、主要表現抽出部１１によって抽出された主要表現に基づいて、関連記事ＤＢ１４中の記事群を構成する記事から複数の情報の対（例えば、１又は複数の項目表現と１又は複数の固有表現との対や、１又は複数の項目表現と１又は複数の数値表現と１又は複数の固有表現との対）を情報対として抽出する。上記固有表現は、上記固有表現の種類に属する固有表現であり、例えば、固有表現の種類「ＬＯＣＡＴＩＯＮ」に属する「日本」、「アメリカ」等が該当する。固有表現の種類に属する固有表現は、後述する固有表現抽出技術を用いて抽出される。 Based on the main expression extracted by the main expression extraction unit 11, the information pair extraction unit 12 sets a plurality of information pairs (for example, one or a plurality of item expressions and 1) from the articles constituting the article group in the related article DB 14. Alternatively, a pair with a plurality of specific expressions or a pair of one or a plurality of item expressions, one or a plurality of numerical expressions and one or a plurality of specific expressions) is extracted as an information pair. The specific expression is a specific expression belonging to the type of the specific expression, for example, “Japan”, “America”, etc., belonging to the specific expression type “LOCATION”. A specific expression belonging to the type of specific expression is extracted using a specific expression extraction technique described later.

情報対抽出部１２は、例えば、関連記事ＤＢ１４に格納された記事群において、主要表現抽出部１１によって抽出された主要表現（例えば、項目表現と固有表現の種類）が同時に出現している箇所を特定し、その箇所に記載されている固有表現と項目表現との対を情報対とする。また、例えば、情報対抽出部１２は、関連記事ＤＢ１４に格納された記事群において、主要表現（例えば、項目表現と固有表現の種類と単位表現）が同時に出現している箇所を特定し、その箇所に記載されている項目表現と固有表現（固有表現の種類に属する固有表現）と数値表現との対を情報対とする。上記数値表現は、主要表現としての単位表現に関連する数値表現である。 For example, in the article group stored in the related article DB 14, the information pair extraction unit 12 detects a location where the main expression (for example, item expression and unique expression type) extracted by the main expression extraction unit 11 appears at the same time. A pair of the unique expression and the item expression described in the place is identified as an information pair. In addition, for example, the information pair extraction unit 12 identifies a place where main expressions (for example, item expression and specific expression type and unit expression) appear simultaneously in the article group stored in the related article DB 14, and A pair of an item expression, a specific expression (a specific expression belonging to the type of specific expression) and a numerical expression described in a place is an information pair. The numerical expression is a numerical expression related to the unit expression as the main expression.

すなわち、上記主要表現のうちの単位表現については、情報対抽出部１２は、当該単位表現に関連する数値（例えば、単位表現に隣接して記事中に出現している数値）も同時に抽出し、数値と単位表現とをあわせて数値表現として抽出する。 That is, for the unit expression of the main expressions, the information pair extraction unit 12 simultaneously extracts a numerical value related to the unit expression (for example, a numerical value appearing in the article adjacent to the unit expression), The numerical value and the unit expression are combined and extracted as a numerical expression.

例えば、映画の記事の場合、情報対抽出部１２は、「項目表現：台風」「数値表現：４号」「ＬＯＣＡＴＩＯＮ：南大東島」という情報対を抽出する。 For example, in the case of a movie article, the information pair extraction unit 12 extracts information pairs of “item expression: typhoon”, “numerical expression: No. 4”, and “LOCATION: Minami Daitojima”.

表示部１３は、情報対抽出部１２によって抽出された情報対を整理して表示（例えばグラフ化して表示）する。 The display unit 13 organizes and displays the information pairs extracted by the information pair extraction unit 12 (for example, displays them in a graph).

関連記事ＤＢ１４には記事群が蓄積されている。 Article groups are accumulated in the related article DB 14.

本発明の一実施形態によれば、主要表現抽出部１１が、更に、上記抽出された固有表現の種類の前又は後に予め決められた単語が付随するか否かに基づいて上記固有表現の種類を分類し、該分類された各々の固有表現の種類を上記主要表現とするようにしてもよい。 According to an embodiment of the present invention, the main expression extraction unit 11 further determines the type of the specific expression based on whether a predetermined word is attached before or after the type of the extracted specific expression. And the type of each of the classified proper expressions may be the main expression.

本発明の一実施形態によれば、主要表現抽出部１１が、関連記事ＤＢ１４に格納された記事群から項目表現と単位表現とを主要表現として抽出し、情報対抽出部１２が、上記記事群から上記主要表現が同時に出現している箇所を特定し、該特定された箇所に記載されている項目表現と上記単位表現に関連する数値表現との対を情報対として抽出し、上記主要表現抽出部１１が、上記抽出された単位表現に関連する数値表現の前又は後に予め決められた単語が付随するか否かに基づいて上記数値表現を分類し、該分類された各々の数値表現に関連する単位表現を上記主要表現とするようにしてもよい。 According to one embodiment of the present invention, the main expression extraction unit 11 extracts item expressions and unit expressions as main expressions from the article group stored in the related article DB 14, and the information pair extraction unit 12 extracts the article group. The location where the main expression appears at the same time is identified, the pair of the item representation described in the identified location and the numerical representation related to the unit representation is extracted as an information pair, and the main representation extraction The unit 11 classifies the numerical expression based on whether or not a predetermined word is attached before or after the numerical expression related to the extracted unit expression, and relates to each of the classified numerical expressions. The unit expression to be used may be the main expression.

ここで、上記「単語が付随する」とは、必ずしも該単語が固有表現の種類や数値表現に連接して出現することのみを意味するものではなく、例えば該単語が固有表現の種類や数値表現が出現する文と同一の文に出現することをも意味する。また、固有表現の種類や数値表現とかかりうけ関係にある単語も該固有表現の種類や数値表現に付随する単語に含まれる。 Here, the phrase “with a word” does not necessarily mean that the word appears concatenated with the type of specific expression or numerical expression. For example, the word includes the type of specific expression or numerical expression. It also means that it appears in the same sentence as the sentence that appears. In addition, words associated with the types of specific expressions and numerical expressions are also included in the words associated with the types of specific expressions and numerical expressions.

また、本発明の一実施形態によれば、主要表現抽出部１１が、更に、上記抽出された固有表現の種類の前又は後、又は、上記抽出された単位表現に関連する数値表現の前又は後に付随する単語を抽出し、該抽出された単語から選択された単語が付随する固有表現の種類、又は該抽出された単語から選択された単語が付随する数値表現に関連する単位表現を上記主要表現とするようにしてもよい。 In addition, according to an embodiment of the present invention, the main expression extraction unit 11 may further perform before or after the extracted specific expression type, or before or after the numerical expression related to the extracted unit expression. A word associated with a numerical expression accompanied by a type of a unique expression accompanied by a word selected from the extracted word or a word selected from the extracted word; An expression may be used.

また、本発明の一実施形態によれば、主要表現抽出部１１が、関連記事ＤＢ１４に格納された記事群から項目表現と単位表現とを主要表現として抽出し、情報対抽出部１２が、上記記事群から上記主要表現が同時に出現している箇所を特定し、該特定された箇所に記載されている項目表現と上記単位表現に関連する数値表現との対を情報対として抽出し、上記主要表現抽出部１１が、更に、上記抽出された単位表現に関連する数値表現の前又は後に付随する単語を抽出し、該抽出された単語から選択された単語が付随する数値表現に関連する単位表現を上記主要表現とするようにしてもよい。 Moreover, according to one Embodiment of this invention, the main expression extraction part 11 extracts item expression and unit expression as main expressions from the article group stored in related article DB14, and the information pair extraction part 12 is the said. A location where the main expression appears simultaneously from an article group is identified, and a pair of an item expression described in the specified location and a numerical expression related to the unit expression is extracted as an information pair, and the main expression is extracted. The expression extraction unit 11 further extracts a word associated with the numerical expression related to the extracted unit expression before or after, and a unit expression related to the numerical expression associated with the word selected from the extracted words May be the main expression.

また、本発明の一実施形態によれば、主要表現抽出部１１が、更に、主要表現に付随する場合と付随しない場合とにおける主要表現の区分けの度合いが高くなる単語を決定し、該決定された単語が付随する主要表現と該単語が付随しない主要表現とを主要表現として抽出するようにしてもよい。 In addition, according to an embodiment of the present invention, the main expression extraction unit 11 further determines a word having a high degree of classification of the main expression when it is attached to the main expression and when it is not attached. A main expression accompanied by a word and a main expression not accompanied by the word may be extracted as main expressions.

また、本発明の一実施形態によれば、主要表現抽出部１１が、記事群から項目表現と単位表現とを主要表現として抽出し、情報対抽出部１２が、上記記事群から上記主要表現が同時に出現している箇所を特定し、該特定された箇所に記載されている項目表現と前記単位表現に関連する数値表現との対を情報対として抽出し、上記主要表現抽出部１１が、更に、主要表現に付随する場合と付随しない場合とにおける主要表現の区分けの度合いが高くなる単語を決定し、該決定された単語が付随する主要表現と該単語が付随しない主要表現とを主要表現として抽出するようにしてもよい。 According to one embodiment of the present invention, the main expression extraction unit 11 extracts item expressions and unit expressions from the article group as main expressions, and the information pair extraction unit 12 extracts the main expression from the article group. A location that appears at the same time is identified, and a pair of an item representation described in the identified location and a numerical representation related to the unit representation is extracted as an information pair. Determining a word having a high degree of classification of the main expression between the case where it is attached to the main expression and the case where it is not attached, and the main expression including the determined word and the main expression not including the word as the main expression You may make it extract.

また、本発明の一実施形態によれば、上記主要表現抽出部１１が、更に、上記抽出された単位表現に関連する数値表現の前又は後に付随する単語を抽出し、該抽出された単語が付随する数値表現の正規分布と該抽出された単語が付随しない数値表現の正規分布とを求め、求めた正規分布同士が重なっている割合が最も小さい場合の単語が付随する数値表現に関連する単位表現と該単語が付随しない数値表現に関連する単位表現とを上記主要表現とするようにしてもよい。 In addition, according to an embodiment of the present invention, the main expression extraction unit 11 further extracts a word accompanying or before the numerical expression related to the extracted unit expression, and the extracted word is A unit related to a numerical expression accompanied by a word when a normal distribution of the accompanying numerical expression and a normal distribution of the numerical expression not accompanied by the extracted word are obtained and the ratio of the obtained normal distributions overlapping is the smallest The main expression may be an expression and a unit expression related to a numerical expression not accompanied by the word.

また、本発明の一実施形態によれば、上記主要表現抽出部１１が、更に、上記抽出された固有表現の種類の前又は後に付随する単語を抽出し、上記抽出された固有表現の種類に属する固有表現同士の類似度を求め、求まった固有表現同士の類似度と各々の固有表現に前記抽出された単語が付随するか否かを示す情報とに基づいて決まるスコア値に基づいて、固有表現に付随した場合と付随しない場合とにおける固有表現の区分けの度合いが高くなる単語を決定し、該決定された単語が付随する上記固有表現の種類と該単語が付随しない固有表現の種類とを前記主要表現とするようにしてもよい。 In addition, according to an embodiment of the present invention, the main expression extraction unit 11 further extracts a word attached before or after the extracted specific expression type, and sets the extracted specific expression type as the extracted specific expression type. The degree of similarity between specific expressions belonging to each other is obtained, and based on a score value determined based on the degree of similarity between the obtained specific expressions and information indicating whether or not the extracted word is attached to each specific expression. A word having a high degree of classification of the unique expression in the case of accompanying with the expression and the case of not accompanying the expression is determined, and the type of the specific expression to which the determined word is attached and the kind of the unique expression to which the word is not attached are determined. The main expression may be used.

また、本発明の一実施形態によれば、上記主要表現抽出部１１が、更に、上記抽出された単位表現の前又は後に付随する単語を抽出し、上記抽出された単位表現に関連する数値表現同士の類似度を求め、上記求まった数値表現同士の類似度と各々の数値表現に上記抽出された単語が付随するか否かを示す情報とに基づいて決まるスコア値に基づいて、単位表現に付随した場合と付随しない場合とにおける上記単位表現の区分けの度合いが高くなる単語を決定し、該決定された単語が付随する単位表現と該単語が付随しない単位表現とを上記主要表現とするようにしてもよい。 In addition, according to an embodiment of the present invention, the main expression extraction unit 11 further extracts a word attached before or after the extracted unit expression, and a numerical expression related to the extracted unit expression Based on the score value determined based on the similarity between the obtained numerical expressions and information indicating whether or not the extracted word is attached to each numerical expression, the unit expression is obtained. A word that increases the degree of classification of the unit expression in the case of accompanying and not accompanying is determined, and the unit expression to which the determined word is attached and the unit expression to which the word is not attached are used as the main expression. It may be.

また、本発明の一実施形態によれば、上記主要表現抽出部１１が、更に、上記抽出された主要表現の上記関連記事ＤＢ１４に格納された記事群における頻度に基づき所定の算出式に従って算出されるスコア値に基づいて、最終的に抽出対象とする主要表現を決定するようにしてもよい。 Further, according to an embodiment of the present invention, the main expression extraction unit 11 is further calculated according to a predetermined calculation formula based on the frequency in the article group stored in the related article DB 14 of the extracted main expression. The main expression to be finally extracted may be determined based on the score value.

また、本発明の一実施形態によれば、上記主要表現抽出部１１が、更に、上記関連記事ＤＢ１４から、該関連記事ＤＢ１４の記事群中の単語が属するクラスターを抽出して、該抽出された各クラスターを上記固有表現の種類とするようにしてもよい。 Further, according to an embodiment of the present invention, the main expression extraction unit 11 further extracts a cluster to which a word in an article group of the related article DB 14 belongs from the related article DB 14 and extracts the cluster. Each cluster may be a type of the unique expression.

また、本発明の一実施形態によれば、予め図示を省略する記憶手段内に、人手で作成した固有表現の辞書（例えば、駅名、映画名、スペースシャトル名等と単語との対応情報）を記憶しておき、主要表現抽出部１１が、上記固有表現の辞書を参照して、上記関連記事ＤＢ１４の記事群中の単語が対応する固有表現を決定し、該決定された固有表現が属する固有表現の種類を主要表現として抽出するようにしてもよい。 In addition, according to an embodiment of the present invention, a dictionary of specific expressions created manually (for example, correspondence information between a station name, a movie name, a space shuttle name, and a word) is stored in a storage unit (not shown) in advance. The main expression extraction unit 11 determines the specific expression corresponding to the word in the article group of the related article DB 14 with reference to the specific expression dictionary, and the specific expression to which the determined specific expression belongs is stored. The type of expression may be extracted as the main expression.

また、本発明の一実施形態によれば、情報対抽出部１２が、更に、主要表現抽出部１１によって抽出された主要表現のうち、ユーザの指定入力に基づいて特定の主要表現を選択し、選択された特定の主要表現に基づいて、上記記事群を構成する記事から情報対を抽出するようにしてもよい。 According to an embodiment of the present invention, the information pair extraction unit 12 further selects a specific main expression from the main expressions extracted by the main expression extraction unit 11 based on a user's designated input, Information pairs may be extracted from the articles constituting the article group based on the selected specific main expression.

また、本発明の一実施形態によれば、情報対抽出部１２が、更に、予め（ユーザの指定入力に従って）指定された固有表現の種類を固有表現の種類として抽出し、関連記事ＤＢ１４において主要表現抽出部１１によって抽出された項目表現と該抽出された固有表現の種類とが同時に出現している箇所に記載されている項目表現と固有表現との対を情報対として抽出するようにしてもよい。 In addition, according to the embodiment of the present invention, the information pair extraction unit 12 further extracts a specific expression type specified in advance (according to a user's specification input) as a specific expression type, A pair of the item expression and the specific expression described in the place where the item expression extracted by the expression extraction unit 11 and the type of the specific expression that appears at the same time may be extracted as an information pair. Good.

また、本発明の一実施形態によれば、情報対抽出部１２が、更に、機械学習の手法を用いて、上記情報対を抽出するようにしてもよい。 Further, according to an embodiment of the present invention, the information pair extraction unit 12 may further extract the information pair using a machine learning technique.

以下に、本発明の実施の形態に係る情報抽出装置１の各構成要素の詳細な例について説明する。
（主要表現抽出部１１）
主要表現抽出部１１は、情報対を抽出、整理する際に必要となる主要表現を抽出する。主要表現抽出部１１は、例えば、項目表現と固有表現の種類とを主要表現として抽出する。また、主要表現抽出部１１は、例えば、項目表現と固有表現の種類と単位表現とを主要表現として抽出する。 Below, the detailed example of each component of the information extraction apparatus 1 which concerns on embodiment of this invention is demonstrated.
(Main Expression Extraction Unit 11)
The main expression extraction unit 11 extracts a main expression necessary for extracting and organizing information pairs. The main expression extraction unit 11 extracts, for example, item expressions and types of specific expressions as main expressions. The main expression extraction unit 11 extracts, for example, item expressions, types of specific expressions, and unit expressions as main expressions.

主要表現抽出部１１は、例えば、ＣｈａＳｅｎ（下記の参照文献（１）参照）を利用して、項目表現と単位表現とを抽出する。 The main expression extraction unit 11 extracts an item expression and a unit expression using, for example, ChaSen (see the following reference document (1)).

参考文献（１）： Y. Matsumoto, A. Kitauchi, T. Yamashita,Y. Hirano, H. Matsuda and M. Asahara: Japanese morphological analysis system ChaSen version 2.0 manual 2nd edition ”(1999).
ＣｈａＳｅｎの出力において、品詞の情報を利用して、各表現の抽出を行う。単位表現については、数値の前方または後方に接続する名詞連続を取り出す。項目表現は、例えば名詞連続を取り出す。また、例えば、単位表現として得られた表現のうち、時間に関する表現（例：「年」、「月」、「日」）を含む表現を取り除くようにしてもよい。 Reference (1): Y. Matsumoto, A. Kitauchi, T. Yamashita, Y. Hirano, H. Matsuda and M. Asahara: Japanese morphological analysis system ChaSen version 2.0 manual 2nd edition ”(1999).
In the output of ChaSen, each expression is extracted using part of speech information. For unit expressions, take out the noun series connected to the front or back of the numerical value. For the item expression, for example, a noun series is taken out. Also, for example, expressions including expressions related to time (eg, “year”, “month”, “day”) may be removed from expressions obtained as unit expressions.

また、主要表現抽出部１１は、例えば、以下に述べる固有表現抽出技術によって、固有表現の種類を抽出する。該固有表現の種類の抽出の際に、該固有表現の種類に属する固有表現が抽出される。 Also, the main expression extraction unit 11 extracts the types of specific expressions by using the specific expression extraction technique described below, for example. At the time of extracting the type of the specific expression, the specific expression belonging to the type of the specific expression is extracted.

固有表現とは、人名、地名、組織名などの固有名詞、金額などの数値表現といった、特定の事物・数量を意味する言語表現のことである。例えば、固有表現の種類として、組織を示す「ＯＲＧＡＮＩＺＡＴＩＯＮ」、人物を示す「ＰＥＲＳＯＮ」、場所を示す「ＬＯＣＡＴＩＯＮ」、人工物を示す「ＡＲＴＩＦＡＣＴ」、日付を示す「ＤＡＴＥ」、時間を示す「ＴＩＭＥ」、金額を示す「ＭＯＮＥＹ」、割合を示す「ＰＥＲＣＥＮＴ」がある。 A proper expression is a linguistic expression that means a specific thing / quantity, such as a proper noun such as a person name, a place name, or an organization name, or a numerical expression such as a monetary amount. For example, as types of specific expressions, “ORGANIZATION” indicating an organization, “PERSON” indicating a person, “LOCATION” indicating a place, “ARTIFACT” indicating an artifact, “DATE” indicating a date, and “TIME” indicating a time. , “MONEY” indicating the amount of money, and “PERCENT” indicating the ratio.

固有表現抽出技術とは、上記のような固有表現の種類と該固有表現に属する固有表現を文章中から計算機で自動で抽出する技術である。例えば、「日本の首相は小泉純一郎である」という文に対して固有表現抽出を行なうと、固有表現の種類（例えば、「ＰＥＲＳＯＮ」、「ＬＯＣＡＴＩＯＮ」）と該固有表現の種類に属する固有表現（例えば、「ＰＥＲＳＯＮ」に属する固有表現「小泉純一郎」、「ＬＯＣＡＴＩＯＮ」に属する固有表現「日本」）とが抽出される。 The specific expression extraction technique is a technique for automatically extracting the types of specific expressions as described above and specific expressions belonging to the specific expressions from a sentence by a computer. For example, if a specific expression is extracted for a sentence “The Japanese prime minister is Junichiro Koizumi”, the types of specific expressions (for example, “PERSON” and “LOCATION”) and specific expressions belonging to the types of the specific expressions ( For example, a specific expression “Joiichiro Koizumi” belonging to “PERSON” and a specific expression “Japan” belonging to “LOCATION”) are extracted.

以下に、固有表現抽出の一般的な手法の例について説明する。
（１）機械学習を用いる手法
機械学習を用いて固有表現を抽出する手法がある（例えば、以下の参考文献（２）参照）。 Hereinafter, an example of a general technique for extracting a specific expression will be described.
(1) A method using machine learning There is a method of extracting a specific expression using machine learning (for example, see the following reference (2)).

参考文献（２）：浅原正幸，松本裕治，日本語固有表現抽出における冗長的な形態素解析の利用情報処理学会自然言語処理研究会 NL153-7 2002
まず、例えば、「日本の首相は小泉さんです。」という文を、各文字に分割し、分割した文字について、以下のように、 B−LOCATION、 I−LOCATION等の正解タグを付与することによって、正解を設定する。以下の一列目は、分割された各文字であり、各文字の正解タグは二列目である。
日 B−LOCATION
本 I−LOCATION
の O
首 O
相 O
は O
小 B−PERSON
泉 I−PERSON
さ O
ん O
で O
す O
。 O
上記において、B −？？？は、ハイフン以下の固有表現の種類の始まりを意味するタグである。例えば、 B−LOCATIONは、場所を示す固有表現の始まりを意味しており、 B−PERSONは、人名を示す固有表現の始まりを意味している。また、I −？？？は、ハイフン以下の固有表現の種類の始まり以外を意味するタグであり、O はこれら以外である。従って、例えば、文字「日」は、場所を示す固有表現の始まりに該当する文字であり、文字「本」までが場所を示す固有表現である。 Reference (2): Masayuki Asahara, Yuji Matsumoto, Use of Redundant Morphological Analysis in Japanese Named Expression Extraction Information Processing Society of Japan Natural Language Processing Study Group NL153-7 2002
First, for example, the sentence “Japan's prime minister is Mr. Koizumi” is divided into each character, and the correct characters such as B-LOCATION and I-LOCATION are assigned to the divided characters as follows. Set the correct answer. The first column below is each divided character, and the correct tag of each character is the second column.
Sun B-LOCATION
I-LOCATION
O
Neck O
Phase O
Is O
Small B-PERSON
Izumi I-PERSON
O
N
At O
O
. O
In the above, B-? ? ? Is a tag that signifies the start of the type of proper expression below the hyphen. For example, B-LOCATION means the beginning of a specific expression indicating a place, and B-PERSON means the start of a specific expression indicating a person name. I-? ? ? Is a tag that means something other than the beginning of the type of proper expression below the hyphen, and O is something else. Therefore, for example, the character “day” is a character corresponding to the beginning of the specific expression indicating the place, and the character “book” is the specific expression indicating the place.

このように、各文字の正解を設定しておき、このようなデータから学習し、新しいデータでこの正解を推定し、この正解のタグから、各固有表現の始まりと、どこまでがその固有表現かを認識して、固有表現を推定する。 In this way, the correct answer of each character is set, learned from such data, this correct answer is estimated with new data, and from this correct answer tag, the beginning of each proper expression and how far it is. Is recognized and the proper expression is estimated.

この各文字に設定された正解のデータから学習するときには、システムによってさまざまな情報を素性という形で利用する。例えば、
日 B−LOCATION
の部分は、
日本−B 名詞−B
などの情報を用いる。日本−B は、日本という単語の先頭を意味し、名詞−B は、名詞の先頭を意味する。単語や品詞の認定には、例えば前述したChaSenによる形態素解析を用いる。ChaSenを用いれば、入力された日本語を単語に分割することができる。例えば、ChaSenは、前述したように、日本語文を分割し、さらに、各単語の品詞も推定してくれる。例えば、「学校へ行く」を入力すると以下の結果を得ることができる。 When learning from the correct data set for each character, the system uses various information in the form of features. For example,
Sun B-LOCATION
Part of
Japan-B Noun-B
Such information is used. Japan-B means the beginning of the word Japan, and noun-B means the beginning of the noun. For recognition of words and parts of speech, for example, morphological analysis by ChaSen described above is used. If ChaSen is used, the input Japanese can be divided into words. For example, ChaSen divides a Japanese sentence and estimates the part of speech of each word as described above. For example, if “go to school” is entered, the following results can be obtained.

学校ガッコウ学校名詞−一般
へヘへ助詞−格助詞−一般
行くイク行く動詞−自立五段・カ行促音便基本形
ＥＯＳ
このように各行に一個の単語が入るように分割され、各単語に読みや品詞の情報が付与される。 School Gacco School Noun-General To He To particle-Case particle-General Go Iku Go Verb-independence
In this way, each line is divided so that one word is included, and reading and part-of-speech information are given to each word.

なお、例えば、上記の参考文献（２）では、素性として、入力文を構成する文字の、文字自体（例えば、「小」という文字）、字種（例えば、ひらがなやカタカナ等）、品詞情報、タグ情報（例えば、「 B−PERSON」等）を利用している。 For example, in the above reference (2), as features, characters constituting the input sentence itself (for example, “small” character), character type (for example, hiragana, katakana, etc.), part of speech information, Tag information (for example, “B-PERSON” or the like) is used.

これら素性を利用して学習する。タグを推定する文字やその周辺の文字にどういう素性が出現するかを調べ、どういう素性が出現しているときにどういうタグになりやすいかを学習し、その学習結果を利用して新しいデータでのタグの推定を行なう。機械学習には、例えばサポートベクトルマシンを用いる。 Learning using these features. Investigate what features appear in the characters that estimate the tag and the surrounding characters, learn what features are likely to appear when the features appear, and use the learning results to create new data Perform tag estimation. For machine learning, for example, a support vector machine is used.

固有表現抽出には、上記の手法の他にも種々の手法がある。例えば、最大エントロピーモデルと書き換え規則を用いて固有表現を抽出する手法がある（参考文献（３）参照）。 In addition to the above-described method, there are various methods for extracting the proper expression. For example, there is a technique for extracting a specific expression using a maximum entropy model and a rewrite rule (see reference (3)).

参考文献（３）：内元清貴，馬青，村田真樹，小作浩美，内山将夫，井佐原均，最大エントロピーモデルと書き換え規則に基づく固有表現抽出，言語処理学会誌, Vol.7, No.2, 2000
また、例えば、以下の参考文献（４）に、サポートベクトルマシンを用いて日本語固有表現抽出を行う手法について記載されている。 Reference (3): Kiyotaka Uchimoto, Maoi, Maki Murata, Hiromi Osaku, Masao Uchiyama, Hitoshi Isahara, Named Expression Extraction Based on Maximum Entropy Model and Rewriting Rules, Journal of the Language Processing Society, Vol.7, No.2 , 2000
Further, for example, the following reference (4) describes a technique for extracting Japanese proper expressions using a support vector machine.

参考文献（４）：山田寛康，工藤拓，松本裕治，Support Vector Machineを用いた日本語固有表現抽出，情報処理学会論文誌, Vol.43, No.1", 2002
（２）作成したルールを用いる手法
人手でルールを作って固有表現を取り出すという方法もある。 Reference (4): Hiroyasu Yamada, Taku Kudo, Yuji Matsumoto, Extracting Japanese Named Expressions Using Support Vector Machine, Journal of Information Processing Society of Japan, Vol.43, No.1 ", 2002
(2) A method using a created rule There is also a method of manually creating a rule to extract a specific expression.

例えば、
名詞＋「さん」だと人名とする
名詞＋「首相」だと人名とする
名詞＋「町」だと場所とする
名詞＋「市」だと場所とする
などである。 For example,
A noun + “san” means a person's name + “a prime minister” means a person's name + a “town” means a place + a “city” means a place.

また、本発明の一実施形態によれば、主要表現抽出部１１が、抽出された主要表現からユーザの指定入力に従って特定の主要表現を選択するようにしてもよい。 According to one embodiment of the present invention, the main expression extraction unit 11 may select a specific main expression from the extracted main expressions according to a user's designated input.

主要表現抽出部１１は、今扱っている記事群で主たる役割を果たす主要な項目表現、固有表現の種類、単位表現を主要表現として抽出する。例えば、対象の記事群全体に万遍なく高頻度に出現する該当表現を主要表現として抽出する。 The main expression extraction unit 11 extracts, as main expressions, main item expressions, types of unique expressions, and unit expressions that play a main role in the article group currently being handled. For example, a corresponding expression that appears uniformly and frequently in the entire target article group is extracted as a main expression.

具体的には、主要表現の抽出には、以下の式（１）〜式（３）に示すようなＳｃｏｒｅ（スコア）の値を用い、スコアの値が大きいものを主要表現として抽出する。
（１）ＯｋａｐｉのＴＦ項の式 Specifically, for the extraction of the main expression, Score (score) values as shown in the following formulas (1) to (3) are used, and the one with a large score value is extracted as the main expression.
(1) Okapi's TF term equation

（２）総頻度 (2) Total frequency

（３）総出現記事数 (3) Total number of appearing articles

ただし、ｉは記事の番号、Ｄｏｃｓは記事の番号の集合、ＴＦ_iは記事ｉでの表現の出現回数、ｌ_iは記事ｉの長さ、Δは記事群Ｄｏｃｓにおける記事の平均の長さを意味する。ＯｋａｐｉのＴＦ項の式は、複数の記事に万遍なく出現しなおかつ頻度が大きい表現のスコアを大きくする効果がある。なお、記事の長さとは、例えば、記事に含まれる単語数や文字数である。また、固有表現の種類については、上記ＴＦ_iは、記事ｉでの該固有表現の種類に属する固有表現の出現回数である。 Where i is the article number, Docs is the set of article numbers, TF _i is the number of appearances of the expression in article i, l _i is the length of article i, and Δ is the average length of articles in article group Docs. means. The expression of the TF term of Okapi has the effect of increasing the score of an expression that appears uniformly in a plurality of articles and has a high frequency. The length of the article is, for example, the number of words or characters included in the article. As for the type of specific expression, TF _i is the number of appearances of the specific expression belonging to the type of specific expression in article i.

項目表現については、長い文字列を優先して取ってくることができるように、ＴＦ_iを記事ｉでの表現の出現回数とせずに、例えば記事ｉでの表現の出現回数とその表現の文字列長の積とする方法も利用する。 For item expressions, for example, TF _i is not the number of appearances of an expression in article i so that a long character string can be preferentially fetched. A method of product of column length is also used.

また、本発明の実施の形態においては、式（１）の値にＩＤＦすなわちｌｏｇＮ／ＤＦを乗じた値、式（２）の値に上記ＩＤＦを乗じた値、式（３）の値に上記ＩＤＦを乗じた値を各スコアの値としてもよい。ここで、Ｎは図示しない大規模コーパス中の全記事数、ＤＦは、例えば当該大規模コーパス中において当該表現が出現した記事数を意味する。 In the embodiment of the present invention, the value obtained by multiplying the value of equation (1) by IDF, that is, log N / DF, the value of equation (2) by the IDF, and the value of equation (3) by the above A value obtained by multiplying IDF may be used as the value of each score. Here, N means the total number of articles in a large-scale corpus (not shown), and DF means the number of articles in which the expression appears in the large-scale corpus, for example.

本発明の実施の形態においては、主要表現抽出部１１は、例えば、算出されたスコア値が最も高い表現を主要表現として抽出する。主要表現抽出部１１は、例えば、算出されたスコア値が所定の閾値以上の表現を主要表現として抽出してもよい。また、主要表現抽出部１１は、例えば、算出されたスコア値が高いものから所定の個数の表現を主要表現として抽出してもよい。 In the embodiment of the present invention, the main expression extraction unit 11 extracts, for example, an expression having the highest calculated score value as the main expression. For example, the main expression extraction unit 11 may extract an expression having a calculated score value equal to or greater than a predetermined threshold as the main expression. In addition, the main expression extraction unit 11 may extract a predetermined number of expressions as the main expression from the one with the high calculated score value, for example.

本発明の一実施形態によれば、主要表現抽出部１１が、更に、抽出された固有表現の種類の前または後に予め決められた単語が付随するか否かに基づいて上記固有表現の種類を分類し、該分類された各々の固有表現の種類を上記主要表現とするようにしてもよい。例えば、主要表現抽出部１１は、抽出された固有表現の種類「ＯＲＧＡＮＩＺＡＴＩＯＮ」の後に単語「警」が付随するか否かに基づいて、該固有表現の種類を分類し、該単語「警」が付随する「ＯＲＧＡＮＩＺＡＴＩＯＮ」、該単語「警」が付随しない「ＯＲＧＡＮＩＺＡＴＩＯＮ」のそれぞれを主要表現とするようにしてもよい。 According to an embodiment of the present invention, the main expression extraction unit 11 further determines the type of the specific expression based on whether a predetermined word is attached before or after the type of the extracted specific expression. Classification may be performed, and the type of each unique expression thus classified may be the main expression. For example, the main expression extraction unit 11 classifies the types of specific expressions based on whether or not the word “alarm” is appended to the extracted specific expression type “ORGANIZATION”. Each of the accompanying “ORGANIZATION” and “ORGANIZATION” without the word “warning” may be used as the main expression.

また、本発明の一実施形態によれば、主要表現抽出部１１が、更に、上記抽出された固有表現の種類の前又は後に付随する単語を抽出し、該抽出された単語から選択された単語が付随する固有表現の種類を上記主要表現とするようにしてもよい。例えば、主要表現抽出部１１は、抽出された固有表現の種類「ＯＲＧＡＮＩＺＡＴＩＯＮ」の前又は後に付随する単語の全てを抽出して、各々の単語を関連記事ＤＢ１４中に出現する頻度に基づいてソートして表示し、該表示された単語からユーザの指定入力に従って選択された単語が付随する固有表現の種類を上記主要表現とする。抽出された固有表現の種類の前又は後に付随する単語を抽出する代わりに、抽出された固有表現の種類の前又は後に付随する文字列を抽出するようにしてもよい。 In addition, according to an embodiment of the present invention, the main expression extraction unit 11 further extracts a word attached before or after the type of the unique expression extracted, and a word selected from the extracted words The type of specific expression accompanied by may be the main expression. For example, the main expression extraction unit 11 extracts all the words attached before or after the extracted unique expression type “ORGANIZATION”, and sorts each word based on the frequency of appearance in the related article DB 14. The kind of specific expression accompanied by the word selected according to the user's designated input from the displayed word is defined as the main expression. Instead of extracting a word that comes before or after the type of the extracted specific expression, a character string that comes before or after the type of the extracted specific expression may be extracted.

また、本発明の一実施形態によれば、上記主要表現抽出部１１が、更に、抽出された単位表現に関連する数値表現の前又は後に付随する単語を抽出し、該抽出された単語が付随する数値表現の正規分布と該抽出された単語が付随しない数値表現の正規分布とを求め、求めた正規分布同士が重なっている割合が最も小さい場合の単語が付随する数値表現に関連する単位表現と該単語が付随しない数値表現に関連する単位表現とを上記主要表現とするようにしてもよい。抽出された単位表現に関連する数値表現の前又は後に付随する単語を抽出する代わりに、抽出された単位表現に関連する数値表現の前又は後に付随する文字列を抽出するようにしてもよい。 Further, according to an embodiment of the present invention, the main expression extraction unit 11 further extracts a word accompanying or before the numerical expression related to the extracted unit expression, and the extracted word is attached. Unit expression related to the numerical expression with which the normal distribution of the numerical expression to be obtained and the normal distribution of the numerical expression without the extracted word attached are obtained, and the ratio in which the obtained normal distributions overlap is the smallest And the unit expression related to the numerical expression not accompanied by the word may be the main expression. Instead of extracting a word associated with the extracted unit expression before or after the numerical expression, a character string associated with the extracted numerical expression related to the unit expression may be extracted.

すなわち、例えば、主要表現抽出部１１は、上記数値表現の前後に出現する単語又は文字列を関連記事ＤＢ１４から抽出する。ここでは、説明の便宜上、該単語又は文字列を「パターン」と呼ぶ。なお、数値表現の前後に隣接して出現するパターンの代わりに、同一文に出現するパターンを抽出するようにしてもよい。そして、主要表現抽出部１１は、例えば、抽出されたパターンから、関連記事ＤＢ１４に出現する頻度に基づいて所定の数のパターンを選択し、選択されたパターンをパターンの候補とする。 That is, for example, the main expression extraction unit 11 extracts words or character strings that appear before and after the numerical expression from the related article DB 14. Here, for convenience of explanation, the word or character string is referred to as a “pattern”. Note that patterns appearing in the same sentence may be extracted instead of patterns appearing adjacently before and after the numerical expression. Then, the main expression extraction unit 11 selects a predetermined number of patterns based on the frequency of appearance in the related article DB 14 from the extracted patterns, and sets the selected patterns as pattern candidates.

次に、主要表現抽出部１１は、各パターンの候補について、以下の計算をする。まず、主要表現抽出部１１は、パターンが前後に付随する数値表現の平均と、分散を求める。そして、求まった平均と分散とから正規分布（第１の正規分布）を求める。また、主要表現抽出部１１は、パターンが前後に付随しない数値表現の平均と、分散を求める。そして、求まった平均と分散とから正規分布（第２の正規分布）を求める。そして、主要表現抽出部１１は、上記求めた第１の正規分布と第２の正規分布同士が重なっている割合を求める。上記正規分布同士の重なっている割合が小さいときのパターンほど、数値表現同士を区分けする度合い（分解能力）が高いパターンとなる。主要表現抽出部１１は、上記求まった割合が最も小さい場合のパターンの候補を最終的なパターンとして決定し、該決定された最終的なパターンが付随する数値表現に関連する単位表現と該パターンが付随しない数値表現に関連する単位表現とを主要表現とする。例えば、主要表現抽出部１１は、「時速」という単語を最終的なパターンとして決定し、該単語「時速」が付随する数値表現に関連する単位表現と「時速」が付随しない数値表現に関連する単位表現とを主要表現とする。 Next, the main expression extraction unit 11 performs the following calculation for each pattern candidate. First, the main expression extraction unit 11 obtains an average and variance of numerical expressions with patterns preceding and following. Then, a normal distribution (first normal distribution) is obtained from the obtained average and variance. In addition, the main expression extraction unit 11 obtains an average and variance of numerical expressions in which no pattern is attached before and after. Then, a normal distribution (second normal distribution) is obtained from the obtained average and variance. Then, the main expression extraction unit 11 obtains a ratio in which the obtained first normal distribution and second normal distribution overlap each other. A pattern having a smaller overlapping ratio of the normal distributions has a higher degree of separating numerical expressions (decomposition ability). The main expression extraction unit 11 determines a pattern candidate when the obtained ratio is the smallest as a final pattern, and the unit expression related to the numerical expression accompanied by the determined final pattern and the pattern are A unit expression related to a numerical expression that is not attached is a main expression. For example, the main expression extraction unit 11 determines the word “hourly speed” as a final pattern, and relates to a unit expression related to a numerical expression accompanied by the word “hourly speed” and a numerical expression not accompanied by “hourly speed”. The unit expression is the main expression.

本発明の一実施形態によれば、後述する情報対抽出部１２が、上記決定された最終的なパターンが付随する数値表現を含む情報対を抽出するようにしてもよい。 According to an embodiment of the present invention, the information pair extraction unit 12 to be described later may extract an information pair including a numerical expression accompanied by the determined final pattern.

本発明の一実施形態によれば、主要表現抽出部１１が、上記パターンの候補のそれぞれについて、上記正規分布同士の重なっている割合の少ない順に所定の数選択し、該選択されたパターンの候補が付随する数値表現に関連する単位表現と該パターンの候補が付随しない数値表現に関連する単位表現とを主要表現とするようにしてもよい。 According to an embodiment of the present invention, the main expression extraction unit 11 selects a predetermined number of the above pattern candidates in order of increasing proportion of the normal distributions, and the selected pattern candidates. A unit expression related to a numerical expression accompanied by a symbol and a unit expression related to a numerical expression not accompanied by a candidate for the pattern may be used as a main expression.

また、本発明の一実施形態によれば、主要表現抽出部１１が、上記パターンの候補のそれぞれについて、上記正規分布同士の重なっている割合の少ない順に所定の数選択してリストとして表示し、該リストとして表示されたパターンの候補からユーザの指定入力に従って指定したパターンの候補が付随する数値表現に関連する単位表現と該パターンの候補が付随しない数値表現に関連する単位表現とを主要表現とするようにしてもよい。 Further, according to an embodiment of the present invention, the main expression extraction unit 11 selects and displays a predetermined number for each of the pattern candidates in a descending order of the overlapping ratio of the normal distributions as a list. A unit expression related to a numerical expression accompanied by a pattern candidate designated according to a user's designated input from the pattern candidates displayed as the list and a unit expression related to a numerical expression not accompanied by the pattern candidate are main expressions. You may make it do.

なお、本発明においては、上述した方法以外の、最終的なパターンの決定方法を用いるようにしてもよい。 In the present invention, a final pattern determination method other than the method described above may be used.

すなわち、例えば、主要表現抽出部１１は、上記固有表現の種類の前後に出現する単語又は文字列を関連記事ＤＢ１４から抽出する。前述したように、説明の便宜上、該単語又は文字列を「パターン」と呼ぶ。なお、数値表現の前後に隣接して出現するパターンの代わりに、同一文に出現するパターンを抽出するようにしてもよい。そして、主要表現抽出部１１は、例えば、抽出されたパターンから、関連記事ＤＢ１４に出現する頻度に基づいて所定の数のパターンを選択し、選択されたパターンをパターンの候補とする。 That is, for example, the main expression extraction unit 11 extracts, from the related article DB 14, words or character strings that appear before and after the type of the unique expression. As described above, for convenience of explanation, the word or character string is referred to as a “pattern”. Note that patterns appearing in the same sentence may be extracted instead of patterns appearing adjacently before and after the numerical expression. Then, the main expression extraction unit 11 selects a predetermined number of patterns based on the frequency of appearance in the related article DB 14 from the extracted patterns, and sets the selected patterns as pattern candidates.

次に、主要表現抽出部１１は、各パターンの候補について、以下の計算をする。例えば、主要表現抽出部１１は、予め記憶手段に記憶された、分類語彙表などの、単語を分類した辞書を用いて、当該辞書の記述において、近い意味とされた単語ほど類似度を高く設定しておくことによって、単語同士の類似度を予め決定する。 Next, the main expression extraction unit 11 performs the following calculation for each pattern candidate. For example, the main expression extraction unit 11 uses a dictionary that classifies words, such as a classification vocabulary table that is stored in advance in the storage unit, and sets a higher similarity for words that have a closer meaning in the dictionary description. By doing so, the similarity between words is determined in advance.

辞書を利用する代わりに、以下の方法で単語同士の類似度を決定するようにしてもよい。すなわち、主要表現抽出部１１が、予め記憶された大規模言語コーパスから、ある単語と、該単語とよく共起する単語（例えば、同一文に共起して出現する頻度が高い単語）を取得する。そして、該共起する単語をベクトルの次元、該共起する単語の共起した回数（頻度）をベクトルの要素とするベクトルを、単語毎に作成する。単語同士の類似度を、単語のベクトル同士の角度（又はｃｏｓ）と定義して、この角度（又はｃｏｓ）が小さい（又は大きい）ほど、類似度が高いと定義する。 Instead of using a dictionary, the similarity between words may be determined by the following method. That is, the main expression extraction unit 11 acquires a word and a word that frequently co-occurs with the word (for example, a word that frequently appears in the same sentence) from a large-scale language corpus stored in advance. To do. Then, a vector having the co-occurrence word as a vector dimension and the number of times of co-occurrence of the co-occurrence word (frequency) as a vector element is created for each word. The similarity between words is defined as an angle (or cos) between word vectors, and the smaller (or larger) this angle (or cos), the higher the similarity.

上記のようにして単語同士の類似度を決定した後、主要表現抽出部１１は、上記各パターンの候補について、以下の計算式に従って、ｓｃｏｒｅ（スコア値）を算出する。 After determining the similarity between words as described above, the main expression extraction unit 11 calculates score (score value) for each pattern candidate according to the following calculation formula.

ｓｃｏｒｅ＝Σ２つの固有表現の類似度×ｆ（第１の固有表現，第２の固有表現）
但し、上記式において、Σは、関連記事ＤＢ１４において出現する、上記固有表現の種類に属する固有表現のあらゆる２つの組合せ毎に加算する処理である。また、２つの固有表現の類似度は、上述した単語同士の類似度の決定方法に従って決まる、上記固有表現同士の類似度である。第１の固有表現，第２の固有表現は、上記固有表現の種類に属する固有表現に含まれる固有表現のうちの２つの固有表現である。また、ｆ（第１の固有表現，第２の固有表現）は、第１の固有表現と第２の固有表現とで共にパターンの候補が出現した（付随する）場合、又は共にパターンの候補が出現しなかった場合は１、どちらか一方のみに上記パターンの候補が出現した場合は−１である関数である。 score = Σ similarity between two specific expressions × f (first specific expression, second specific expression)
However, in the above formula, Σ is a process of adding every two combinations of specific expressions belonging to the specific expression type appearing in the related article DB 14. Further, the similarity between two unique expressions is the similarity between the above-described specific expressions, which is determined according to the above-described method for determining the similarity between words. The first specific expression and the second specific expression are two specific expressions among the specific expressions included in the specific expression belonging to the type of the specific expression. In addition, f (first specific expression, second specific expression) is the case where pattern candidates appear (accompany) in both the first specific expression and the second specific expression, or in which both the pattern candidates are The function is 1 when it does not appear, and is -1 when the pattern candidate appears only in one of them.

上記ｓｃｏｒｅを各々のパターンの候補毎に計算し、ｓｃｏｒｅの値を求める。求まったｓｃｏｒｅの値が高いときのパターンの候補ほど、固有表現に付随した場合と付随しない場合とにおける固有表現の区分けの度合い（分解能力）が高くなるパターンの候補である。 The score is calculated for each pattern candidate, and the score value is obtained. A pattern candidate with a higher score value is a candidate pattern that has a higher degree of distinction (decomposition ability) of the unique expression when it is associated with the unique expression and when it is not associated with the specific expression.

主要表現抽出部１１は、例えば、最も高いｓｃｏｒｅの値のときのパターンの候補が付随する上記固有表現の種類と該パターンの候補が付随しない固有表現の種類とを主要表現とする。 The main expression extraction unit 11 uses, for example, the above-described specific expression type accompanied by a pattern candidate at the highest score value and the specific expression type not accompanied by the pattern candidate as the main expression.

本発明の一実施形態によれば、後述する情報対抽出部１２が、上記最も高いｓｃｏｒｅの値のときのパターンの候補が付随する固有表現を含む情報対を抽出するようにしてもよい。 According to an embodiment of the present invention, the information pair extraction unit 12 to be described later may extract an information pair including a unique expression accompanied by a pattern candidate at the highest score value.

また、本発明の一実施形態によれば、上記ｓｃｏｒｅの値の高い順にパターンの候補を所定の数選択し、該選択されたパターンの候補それぞれが付随する上記固有表現の種類と該パターンの候補それぞれが付随しない上記固有表現の種類とを主要表現とするようにしてもよい。 In addition, according to an embodiment of the present invention, a predetermined number of pattern candidates are selected in descending order of the score value, and the types of the unique expressions and the pattern candidates to which the selected pattern candidates are attached respectively. You may make it make it the main expression the kind of said specific expression which each does not accompany.

また、本発明の一実施形態によれば、主要表現抽出部１１が、上記ｓｃｏｒｅの値の高い順にパターンの候補を所定の数選択し、該選択されたパターンの候補をリストとして表示し、該リストとして表示されたパターンの候補からユーザの指定入力に従って指定したパターンの候補が付随する固有表現の種類と該パターンの候補が付随しない固有表現の種類とを主要表現とするようにしてもよい。 Further, according to an embodiment of the present invention, the main expression extraction unit 11 selects a predetermined number of pattern candidates in descending order of the score value, displays the selected pattern candidates as a list, and The types of specific expressions accompanied by the pattern candidates designated according to the user's designated input from the pattern candidates displayed as a list and the types of unique expressions not accompanied by the pattern candidates may be used as the main expressions.

なお、本発明においては、上述した方法以外の、主要表現の決定方法を用いるようにしてもよい。 In the present invention, a method for determining a main expression other than the method described above may be used.

本発明の一実施形態によれば、上記ｓｃｏｒｅの値を利用する方法で、前述した数値表現に付随する最終的なパターンを決定するようにしてもよい。この場合は、例えば、上記ｓｃｏｒｅの値を利用する方法において、固有表現の種類を単位表現、固有表現を数値表現（数値、数値データ）として、数値表現同士の類似度を、数値の近さを示すものとして定義すればよい。例えば、数値表現同士の差を、数値表現同士の差の最大値で割った値を求め、１から該求まった値を引いたものを、数値表現同士の類似度とする。このようにして定義される数値表現同士の類似度と、各々の数値表現に前記抽出された単語が付随するか否かを示す情報とに基づいて決まるスコア値（ｓｃｏｒｅ＝Σ２つの数値表現の類似度×ｆ（第１の数値表現，第２の数値表現）に基づいて、例えば、最も高いｓｃｏｒｅの値のときのパターンの候補が付随する単位表現と該パターンの候補が付随しない単位表現とを主要表現とする。 According to an embodiment of the present invention, a final pattern associated with the numerical expression described above may be determined by a method using the score value. In this case, for example, in the method of using the score value described above, the type of the specific expression is unit expression, the specific expression is the numerical expression (numerical value, numerical data), the similarity between the numerical expressions is expressed, It can be defined as shown. For example, a value obtained by dividing the difference between the numerical expressions by the maximum value of the difference between the numerical expressions is obtained, and a value obtained by subtracting the obtained value from 1 is set as the similarity between the numerical expressions. A score value (score = Σsimilarity between two numerical expressions) determined based on the similarity between the numerical expressions defined in this way and information indicating whether or not the extracted word is attached to each numerical expression. Based on degree × f (first numerical expression, second numerical expression), for example, a unit expression accompanied by a pattern candidate at the highest score value and a unit expression not accompanied by the pattern candidate. The main expression.

但し、上記スコア値を示す式において、Σは、関連記事ＤＢ１４において出現する、数値表現のあらゆる２つの組合せ毎に加算する処理である。ｆ（第１の数値表現，第２の数値表現）は、第１の数値表現と第２の数値表現とで共にパターンの候補が出現した（付随する）場合、又は共にパターンの候補が出現しなかった場合は１、どちらか一方のみに上記パターンの候補が出現した場合は−１である関数である。 However, in the formula indicating the score value, Σ is a process of adding every two combinations of numerical expressions that appear in the related article DB 14. f (first numerical expression, second numerical expression) is a case where a pattern candidate appears (attached) in both the first numerical expression and the second numerical expression, or a pattern candidate appears. The function is 1 when there is no pattern, and is -1 when a candidate for the pattern appears only in one of them.

なお、本発明の一実施形態によれば、上記数値表現同士の類似度を、値が大きい方の数値表現を値が小さい方の数値表現で除算した値と定義するようにしてもよい。 According to one embodiment of the present invention, the similarity between the numerical expressions may be defined as a value obtained by dividing the numerical expression having a larger value by the numerical expression having a smaller value.

また、本発明の一実施形態によれば、上記主要表現抽出部１１が、更に、以下に示すクラスタリングの方法を用いて、上記関連記事ＤＢ１４から、該関連記事ＤＢ１４の記事群中の単語が属するクラスターを抽出して、該抽出された各クラスターを、主要表現としての固有表現の種類とするようにしてもよい。 In addition, according to an embodiment of the present invention, the main expression extraction unit 11 further includes a word in an article group of the related article DB 14 from the related article DB 14 by using the following clustering method. Clusters may be extracted, and each extracted cluster may be a kind of unique expression as a main expression.

以下に、クラスタリングの方法の例について説明する。
（階層クラスタリングによる方法）
クラスターの成員のうち、距離が最も近い成員同士を結合していき、クラスターを作る。そして、距離が最も近いクラスター同士を結合する。クラスター間の距離の定義は様々ある。例えば、クラスターＡとクラスターＢとの距離を、クラスターＡの成員（すなわち、クラスターＡに属する単語）とクラスターＢの成員（すなわち、クラスターＢに属する単語）との距離の中で最も小さいものとしてもよい。ここで、ある成員と他の成員との距離とは、ある成員の位置ベクトルと他の成員の位置ベクトルとの間の距離である。位置ベクトルとは、ベクトル空間上における成員の位置を示すベクトルである。また、例えば、クラスターＡとクラスターＢとの距離を、クラスターＡの成員とクラスターＢの成員との距離の中で最も大きいものとしてもよい。また、例えば、クラスターＡとクラスターＢとの距離を、全てのクラスターＡの成員とクラスターＢの成員との距離の平均としてもよい。また、全てのクラスターＡの成員の位置の平均をクラスターＡの位置とし、全てのクラスターＢの成員の位置の平均をクラスターＢの位置とし、当該クラスターＡの位置とクラスターＢの位置との距離をクラスターＡとクラスターＢとの距離としてもよい。
（ウォード法による方法）
以下に示すＷを定義する。
Ｗ＝ ΣΣ（ｘ（ｉ，ｊ）−ａｖｅ＿ｘ（ｉ））＾２
＾は指数を意味する。例えば、上記の式における１つ目のΣは、ｉ＝１からｉ＝ｇまでの加算、２つ目のΣは、ｊ＝１からｊ＝ｎｉまでの加算を意味する。また、ｘ（ｉ，ｊ）は、ｉ番目のクラスターのｊ番目の成員の位置、ａｖｅ＿ｘ（ｉ）は、ｉ番目のクラスターの全ての成員の位置の平均を意味する。クラスター同士を結合していくと、Ｗの値が増加するが、ウォード法では、Ｗの値がなるべく大きくならないようにクラスター同士を結合していく。 Hereinafter, an example of a clustering method will be described.
(Method by hierarchical clustering)
Among the members of a cluster, members who are closest to each other are joined together to form a cluster. Then, the clusters having the shortest distance are combined. There are various definitions of the distance between clusters. For example, the distance between cluster A and cluster B may be the smallest of the distances between members of cluster A (ie, words belonging to cluster A) and members of cluster B (ie, words belonging to cluster B). Good. Here, the distance between a certain member and another member is the distance between the position vector of a certain member and the position vector of another member. The position vector is a vector indicating the position of the member in the vector space. For example, the distance between cluster A and cluster B may be the largest of the distances between members of cluster A and cluster B. Further, for example, the distance between the cluster A and the cluster B may be an average of the distances between all the members of the cluster A and the members of the cluster B. Also, the average of the positions of all the members of cluster A is the position of cluster A, the average of the positions of all the members of cluster B is the position of cluster B, and the distance between the position of cluster A and the position of cluster B is It may be the distance between cluster A and cluster B.
(Method by Ward method)
The following W is defined.
W = ΣΣ (x (i, j) −ave_x (i)) ^ 2
^ Means exponent. For example, the first Σ in the above equation means the addition from i = 1 to i = g, and the second Σ means the addition from j = 1 to j = ni. Further, x (i, j) means the position of the j-th member of the i-th cluster, and ave_x (i) means the average of the positions of all the members of the i-th cluster. When the clusters are joined together, the value of W increases, but in the Ward method, the clusters are joined together so that the value of W does not become as large as possible.

（クラスタリングの終了条件）
予めクラスターの個数を決めておいて、クラスターの個数が当該予め決められた数になったときに、クラスター同士を結合するのをやめるようにしてもよい。また、予め距離の閾値を決めておいて、その閾値以上離れているクラスター同士を結合するのをやめるようにしてもよい。 (Ending condition for clustering)
The number of clusters may be determined in advance, and when the number of clusters reaches the predetermined number, it is possible to stop joining the clusters. Alternatively, a threshold value for the distance may be determined in advance, and the clusters that are separated by the threshold value or more may be stopped.

（各成員の位置）
各成員（単語）の位置は、各成員に関する種々の情報（例えば、各成員の属性情報）を用いて求める。各成員の属性情報としては、例えば、各成員（単語）に含まれる文字の種類（例えば、ひらがな、カタカナ、漢字、それ以外が、それぞれあるかないか) 、単語の長さ、単語の語義等を用いる。 (Position of each member)
The position of each member (word) is obtained by using various information related to each member (for example, attribute information of each member). The attribute information of each member includes, for example, the type of characters (for example, whether there are hiragana, katakana, kanji, and others), the length of the word, the meaning of the word, etc. Use.

本発明の一実施形態によれば、例えば、主要表現抽出部１１が、関連記事ＤＢ１４内の記事群に含まれる記事、又は、該記事のタイトルや記事の先頭文から、公知のキーワード抽出技術を用いて単語を抽出する。そして、各単語（成員）の位置をベクトル（位置ベクトル）で表現する。成員の位置を示す位置ベクトルの要素の値は、例えば、各単語の出現頻度や、当該単語のＯｋａｐｉの式（例えば上述した式（１）で示される値）、当該単語のｔｆｉｄｆ（前述した式（１）の値にｌｏｇＮ／ＤＦを乗じた値）等としてもよい。なお、例えば、位置ベクトルの次元を単位表現や時間表現の個数分増やして、当該記事において単位表現、時間表現に隣接して記事中に出現している数値を成員の位置ベクトルの要素の値としてもよい。 According to one embodiment of the present invention, for example, the main expression extraction unit 11 uses a known keyword extraction technique from an article included in an article group in the related article DB 14 or the title of the article and the head sentence of the article. To extract words. Then, the position of each word (member) is expressed by a vector (position vector). The value of the element of the position vector indicating the position of the member includes, for example, the appearance frequency of each word, the Okapi expression of the word (for example, the value indicated by the above-described expression (1)), the tfidf of the word (the expression described above) (The value obtained by multiplying the value of (1) by logN / DF). For example, the position vector dimension is increased by the number of unit expressions and time expressions, and the numerical value appearing in the article adjacent to the unit expression and time expression in the article is used as the element value of the member position vector. Also good.

主要表現抽出部１１が、複数の記事中の単語（成員）の位置を位置ベクトルで表現し、記事間の距離を、それぞれの記事の成員同士の距離の中で最も小さいものとして、距離が最も近い記事同士を結合して、クラスターを作ってもよい。 The main expression extraction unit 11 expresses the position of a word (member) in a plurality of articles as a position vector, and sets the distance between articles as the smallest of the distances between members of each article, and the distance is the longest. You may create a cluster by combining nearby articles.

次にトップダウンのクラスタリング（非階層クラスタリング）の方法を説明する。
（最大距離アルゴリズムによるクラスタリング）
ある成員と、当該成員と距離が最も離れた成員を求め、これらの成員をそれぞれのクラスターの中心とする。次に、それぞれのクラスターの中心と各成員との距離の最小値を各成員の距離とし、その距離が最も大きい成員を新たなクラスターの中心とする。当該クラスターの中心を求める処理を繰り返す。例えば、予め定めた数のクラスターになったときに、当該クラスターの中心を求める処理の繰り返しをやめる。また、例えば、クラスター間の距離が予め定めた数以下になったときに、当該クラスターの中心を求める処理の繰り返しをやめる。 Next, a method of top-down clustering (non-hierarchical clustering) will be described.
(Clustering with maximum distance algorithm)
Find a member and the member farthest away from the member, and make these members the center of each cluster. Next, the minimum value of the distance between the center of each cluster and each member is set as the distance of each member, and the member having the largest distance is set as the center of the new cluster. The process for obtaining the center of the cluster is repeated. For example, when the number of clusters reaches a predetermined number, the process for obtaining the center of the cluster is not repeated. Further, for example, when the distance between the clusters is equal to or less than a predetermined number, the process of obtaining the center of the cluster is stopped.

また、クラスターの良さを例えばＡＩＣ情報量基準などで評価して、評価によって求まった値と予め定めた閾値との比較結果に基づいて、当該クラスターの中心を求める処理の繰り返しをやめるようにしてもよい。上記の最大距離アルゴリズムによるクラスタリングによれば、各成員は、各成員と最も近いクラスター中心を持つクラスターの成員となる。
（ｋ平均法）
例えば、以下に示すｋ平均法によって、予め定めた個数（ｋ個）にクラスタリングする。まず、ｋ個の成員をランダムに選択し、選択されたｋ個の成員をクラスターの中心とする。そして、各成員を、当該各成員に最も近いクラスター中心を持つクラスターの成員とする。 In addition, the goodness of the cluster is evaluated based on, for example, an AIC information amount standard, and the repetition of the process of obtaining the center of the cluster is stopped based on the comparison result between the value obtained by the evaluation and a predetermined threshold value. Good. According to the clustering by the above maximum distance algorithm, each member becomes a member of a cluster having a cluster center closest to each member.
(K-average method)
For example, clustering is performed to a predetermined number (k) by the following k-average method. First, k members are selected at random, and the selected k members are set as the center of the cluster. Each member is a member of a cluster having a cluster center closest to each member.

次に、クラスター内の各成員の平均の位置に最も近い成員を、それぞれのクラスターの中心とする。そして、各成員を、当該各成員に最も近いクラスター中心を持つクラスターの成員とする。また、クラスター内の各成員の平均の位置に最も近い成員をそれぞれのクラスターの中心とする。上記のクラスターの中心を求める処理を繰り返し、クラスターの中心が移動しなくなったときに、クラスターの中心を求める処理の繰り返しをやめる。本発明の一実施形態によれば、予め定めた回数だけクラスターの中心を求める処理を繰り返してやめるようにしてもよい。そして、最終的なクラスター中心を持つクラスターを決定する。そして、各成員を、当該各成員が最も近いクラスター中心を持つクラスターの成員とする。上記の手法によって、成員のクラスタリングをする。本発明において用いるクラスタリングの方法は、上述した方法に限定されるものではない。 Next, the member closest to the average position of each member in the cluster is set as the center of each cluster. Each member is a member of a cluster having a cluster center closest to each member. The member closest to the average position of each member in the cluster is set as the center of each cluster. The process for obtaining the center of the cluster is repeated, and when the center of the cluster stops moving, the process for obtaining the center of the cluster is stopped. According to an embodiment of the present invention, the process for obtaining the center of the cluster may be repeated for a predetermined number of times. Then, the cluster having the final cluster center is determined. Each member is a member of a cluster having the closest cluster center. Cluster members by the above method. The clustering method used in the present invention is not limited to the method described above.

本発明に係る情報抽出装置１は、上述したクラスタリングの方法以外の様々な方法を用いて、クラスタリングをするようにしてもよい。例えば、予め情報抽出装置１内の記憶手段（図示を省略）内に、単語と単語が属するクラスター（例えば、当該単語を含む記事の文）との対応情報を予め記憶させておき、主要表現抽出部１１が、関連記事ＤＢ１４中の記事群から特定の単語を選択し、当該記憶手段内の、当該選択された単語と当該選択された単語が属するクラスター（例えば、当該単語を含む記事の文）との対応情報に基づいて、上記選択された単語が属するクラスターを決定し、該決定されたクラスターを、主要表現としての固有表現の種類としてもよい。 The information extraction apparatus 1 according to the present invention may perform clustering using various methods other than the above-described clustering method. For example, correspondence information between a word and a cluster to which the word belongs (for example, a sentence of an article including the word) is stored in advance in a storage unit (not shown) in the information extraction apparatus 1, and main expression extraction is performed. The unit 11 selects a specific word from the group of articles in the related article DB 14, and the selected word and a cluster to which the selected word belongs in the storage unit (for example, a sentence of an article including the word). Based on the correspondence information, the cluster to which the selected word belongs is determined, and the determined cluster may be the type of the unique expression as the main expression.

また、本発明の一実施形態によれば、主要表現抽出部１１が、ユーザの指定入力に従って、項目表現を入力し、入力された項目表現と共起して出現する単位表現を関連記事ＤＢ１４中の記事群または予め記憶手段に記憶された書誌データから抽出し、上記項目表現と抽出した単位表現を主要表現とするようにしてもよい。そして、情報対抽出部１２が、上記主要表現に基づいて、当該記事群を構成する記事から上記単位表現に関連する数値表現と当該項目表現との対を情報対として抽出し、表示部１３が、上記主要表現に基づいて抽出された情報対をグラフ表示する際に、項目表現を、当該項目表現と偏って共起して出現する単位表現と対応付けてグラフ表示するようにしてもよい。 Further, according to an embodiment of the present invention, the main expression extraction unit 11 inputs an item expression in accordance with a user's designation input, and displays a unit expression that co-occurs with the input item expression in the related article DB 14. May be extracted from the article group or bibliographic data previously stored in the storage means, and the item expression and the extracted unit expression may be used as the main expression. Then, the information pair extraction unit 12 extracts a pair of the numerical expression related to the unit expression and the item expression as an information pair from the articles constituting the article group based on the main expression, and the display unit 13 When the information pair extracted based on the main expression is displayed in a graph, the item expression may be displayed in a graph in association with a unit expression that appears in co-occurrence with the item expression.

また、本発明の一実施形態によれば、主要表現抽出部１１が、ユーザの指定入力に従って、項目表現を入力し、入力された項目表現と共起して出現する単位表現を関連記事ＤＢ１４中の記事群または予め記憶手段に記憶された書誌データから抽出し、上記項目表現と抽出した単位表現を主要表現とするようにしてもよい。そして、情報対抽出部１２が、上記主要表現に基づいて、当該記事群を構成する記事から上記単位表現に関連する数値表現と当該項目表現との対を情報対として抽出するようにしてもよい。 Further, according to an embodiment of the present invention, the main expression extraction unit 11 inputs an item expression in accordance with a user's designation input, and displays a unit expression that co-occurs with the input item expression in the related article DB 14. May be extracted from the article group or bibliographic data previously stored in the storage means, and the item expression and the extracted unit expression may be used as the main expression. Then, the information pair extraction unit 12 may extract a pair of the numerical expression related to the unit expression and the item expression as an information pair from the articles constituting the article group based on the main expression. .

また、本発明の一実施形態によれば、主要表現抽出部１１が、ユーザの指定入力に従って、項目表現を入力し、入力された項目表現と共起して出現する固有表現を関連記事ＤＢ１４中の記事群または予め記憶手段に記憶された書誌データから抽出し、上記項目表現と抽出した固有表現が属する固有表現の種類とを主要表現とするようにしてもよい。そして、情報対抽出部１２が、上記主要表現に基づいて、当該記事群を構成する記事から固有表現と項目表現との対を情報対として抽出するようにしてもよい。 According to one embodiment of the present invention, the main expression extraction unit 11 inputs an item expression in accordance with a user's designation input, and the unique expression that appears along with the input item expression is included in the related article DB 14. The above-mentioned item expression and the kind of specific expression to which the extracted specific expression belongs may be used as the main expression. Then, the information pair extraction unit 12 may extract a pair of the unique expression and the item expression as an information pair from the articles constituting the article group based on the main expression.

ここで、一般に、表現Ｂと偏って共起して出現する単語Ａの抽出方法（共起語抽出方法）について説明する。当該共起語抽出方法を用いれば、例えば、項目表現「観客動員数」から単位表現「人」を求めることができる。また、逆に、単位表現「人」から項目表現「観客動員数」などを求めることができる。また、当該共起語抽出方法を用いれば、例えば、項目表現「選手」から固有表現「北島康介」を求めることができる。また、逆に、固有表現「北島康介」から項目表現「選手」などを求めることができる。 Here, a method of extracting a word A that appears co-occurring with the expression B in general (co-occurrence word extraction method) will be described. If the co-occurrence word extraction method is used, for example, the unit expression “person” can be obtained from the item expression “number of spectators mobilized”. Conversely, the item expression “number of spectators” can be obtained from the unit expression “people”. Further, if the co-occurrence word extraction method is used, for example, the unique expression “Kousuke Kitajima” can be obtained from the item expression “player”. Conversely, the item expression “player” or the like can be obtained from the unique expression “Kosuke Kitajima”.

例えば、項目表現「観客動員数」から単位表現「人」を求める場合は、単位表現の候補を取り出し，それぞれをＡとして以下の計算をする。単位表現「人」から項目表現「観客動員数」などを求める場合は、項目表現の候補を取り出し、それぞれをＡとして以下の計算をする。また、例えば、項目表現「選手」から固有表現「北島康介」を求める場合は、固有表現の候補を取り出し，それぞれをＡとして以下の計算をする。固有表現「北島康介」から項目表現「選手」などを求める場合は、項目表現の候補を取り出し、それぞれをＡとして以下の計算をする。 For example, when the unit expression “person” is obtained from the item expression “number of spectators”, candidates for the unit expression are extracted and the following calculation is performed by setting each as A. When the item expression “number of spectators” is obtained from the unit expression “people”, the item expression candidates are extracted, and the following calculation is performed with each of them as A. Also, for example, when the specific expression “Kousuke Kitajima” is obtained from the item expression “player”, the specific expression candidates are extracted, and the following calculation is performed with A as each. When the item expression “player” or the like is obtained from the unique expression “Kousuke Kojima”, candidate item expressions are taken out and A is used for each of the following expressions.

Ｃ中のＡの出現率、Ｂ中のＡの出現率を求める。ここで、
Ｃ中のＡの出現率＝Ｃ中のＡの出現回数／Ｃ中の単語総数
Ｂ中のＡの出現率＝Ｂ中のＡの出現回数／Ｂ中の単語総数
である。そして、Ｂ中のＡの出現率／Ｃ中のＡの出現率を求めて、この値が大きいものほど、単語Ａを、表現Ｂに偏って共起して出現する単語とする。 The appearance rate of A in C and the appearance rate of A in B are obtained. here,
Appearance rate of A in C = Number of appearances of A in C / Total number of words in C. Appearance rate of A in B = Number of appearances of A in B / Total number of words in B. Then, the appearance rate of A in B / the appearance rate of A in C is obtained, and the larger this value, the word A becomes a word that appears co-occurring with a bias toward expression B.

Ｂ中のＡの出現率とは、ＢとＡが共起している場合のＡの出現率という意味であり、Ｃ中のＡの出現率とは、予め記憶手段に記憶された書誌データにおけるＡの出現率または出現回数という意味である。 The appearance rate of A in B means the appearance rate of A when B and A co-occur, and the appearance rate of A in C is the bibliographic data stored in the storage means in advance. It means the appearance rate or the number of appearances of A.

本発明の他の実施形態によれば、Ｂ中のＡの出現率とは、関連記事ＤＢ１４中の記事群における、ＢとＡが共起している場合のＡの出現率という意味であり、Ｃ中のＡの出現率とは、関連記事ＤＢ１４中の記事群におけるＡの出現率または出現回数という意味としてもよい。 According to another embodiment of the present invention, the appearance rate of A in B means the appearance rate of A when B and A co-occur in the article group in the related article DB 14, The appearance rate of A in C may mean the appearance rate or the number of appearances of A in the article group in the related article DB 14.

なお、本発明の一実施形態によれば、例えば、複数の選手名から項目表現「選手」を求めるようにしてもよい。例えば、各選手名毎に、よく偏って多く出現する表現Ｆを偏り度合いとともに求め、全ての選手について表現Ｆ毎に偏り度合いを加算したものや、乗じたもの（ゼロのものの場合はゼロを乗じずに例えば０．０００００１等の微小値を乗じる）をスコアとして、該スコアの最も大きい表現Ｆを項目表現とするようにしてもよい。偏り度合いは、例えば、よく共起するものを算出する時に使った値等を用いる。 According to an embodiment of the present invention, for example, the item expression “player” may be obtained from a plurality of player names. For example, for each player name, an expression F that often appears biased and frequently appears together with the degree of bias, and for all players, the sum of the degree of bias for each expression F, or a product (multiply zero for zero). For example, the expression F having the largest score may be used as the item expression. As the degree of bias, for example, a value used when calculating a co-occurrence is used.

表現Ｂと偏って共起して出現する単語Ａの抽出方法として、以下のように、有意差検定を利用する方法を用いてもよい。
（二項検定の場合）
ＡのＣでの出現数をＮ、ＡのＢでの出現数をＮ１、Ｎ２＝Ｎ−Ｎ１とする。ＡがＣに現れたときに、それがＢ中に現れる確率を０．５と仮定して、Ｎの総出現のうち、Ｎ２回以下、ＡがＣに出現してＢに出現しなかった確率を求める。 As a method for extracting the word A that appears co-occurring with the expression B, a method using a significant difference test may be used as follows.
(In case of binomial test)
The number of occurrences of A at C is N, the number of occurrences of A at B is N1, and N2 = N−N1. Probability that when A appears in C, the probability that it appears in B is 0.5, and out of N total occurrences, N appears less than N2 times and A appears in C and does not appear in B Ask for.

この確率は、Ｐ１＝ΣＣ（Ｎ１＋Ｎ２，ｘ）＊０．５^(x)＊０．５^(N1+N2-x)
である。ただし、上記式において、Σは、ｘ＝０〜ｘ＝Ｎ２の和であり、Ｃ（Ｎ１＋Ｎ２，ｘ）は、Ｎ１＋Ｎ２個の異なったものからｘ個のものを取り出す場合の数を示す。 This probability is P1 = ΣC (N1 + N2, x) * 0.5 ^(x) * 0.5 ^{(N1 + N2-x)}
It is. However, in the above equation, Σ is the sum of x = 0 to x = N2, and C (N1 + N2, x) represents the number when x pieces are extracted from N1 + N2 different pieces.

上記の式で示される確率の値が十分小さければ、Ｎ１とＮ２は等価な確率でない、すなわち、Ｎ１がＮ２に比べて有意に大きいことと判断できる。５％検定なら、Ｐ１が５％よりも小さいこと、１０％検定なら、Ｐ１が１０％よりも小さいことが、有意に大きいかどうかの判断基準になる。 If the probability value expressed by the above equation is sufficiently small, it can be determined that N1 and N2 are not equivalent probabilities, that is, N1 is significantly larger than N2. If it is 5% test, P1 is smaller than 5%, and if it is 10% test, P1 is smaller than 10%.

例えば、Ｎ１がＮ２に比べて有意に大きいと判断されたものを、表現Ｂに偏ってよく共起して出現する単語とする。また、Ｐ１が小さいものほど、表現Ｂに偏ってよく共起して出現する単語とする。
（カイ二乗検定の場合）
Ｂ中のＡの出現回数をＮ１、Ｂ中の単語の総出現数をＦ１、ＣにあってＢにない、Ａの出現回数をＮ２、ＣにあってＢにない、単語の総出現数をＦ２とする。Ｒ１＝Ｆ１／Ｎ１、Ｒ２＝Ｆ２／Ｎ２とする。 For example, words in which N1 is determined to be significantly larger than N2 are words that appear well co-occurring with the expression B. In addition, a word having a smaller P1 is more likely to be biased toward the expression B and appear co-occurring.
(Chi-square test)
The number of occurrences of A in B is N1, the total number of occurrences of words in B is in F1, C and not in B, the number of occurrences of A is in N2, and the total number of occurrences in C is not in B Let it be F2. It is assumed that R1 = F1 / N1 and R2 = F2 / N2.

ここで、Ｎ＝Ｎ１＋Ｎ２として、
カイ二乗値＝（Ｎ＊（Ｆ１＊（Ｎ２−Ｆ２）−（Ｎ１−Ｆ１）＊Ｆ２）²）／（（Ｆ１＋Ｆ２）＊（Ｎ−（Ｆ１＋Ｆ２））＊Ｎ１＊Ｎ２）
を求める。 Here, N = N1 + N2
Chi-square value = (N * (F1 * (N2−F2) − (N1−F1) * F2) ² ) / ((F1 + F2) * (N− (F1 + F2)) * N1 * N2)
Ask for.

そして、求めたカイ二乗値が大きいほど、Ｒ１とＲ２は有意差があると言え、カイ二乗値が３．８４よりも大きいとき、危険率５％の有意差があると言え、カイ二乗値が６．６３よりも大きいとき、危険率１％の有意差があると言える。 Then, it can be said that the larger the obtained chi-square value is, the more significant difference between R1 and R2 is. If the chi-square value is larger than 3.84, it can be said that there is a significant difference of 5% of the risk rate, and the chi-square value is When it is larger than 6.63, it can be said that there is a significant difference of 1% of the risk rate.

例えば、Ｎ１＞Ｎ２でかつカイ二乗値が大きいものほど、表現Ｂに偏ってよく共起して出現する単語とする。
（比の検定（比率の差の検定））
ｐ＝（Ｆ１＋Ｆ２）／（Ｎ１＋Ｎ２）、ｐ１＝Ｒ１、ｐ２＝Ｒ２として、
Ｚ＝｜ｐ１−ｐ２｜／ｓｑｒｔ（ｐ＊（１−ｐ）＊（１／Ｎ１＋１／Ｎ２））
を求める。ｓｑｒｔは、ルートを意味する。 For example, a word having N1> N2 and a larger chi-square value is more likely to be biased toward the expression B and appear co-occurring.
(Ratio test (ratio difference test))
p = (F1 + F2) / (N1 + N2), p1 = R1, p2 = R2,
Z = | p1-p2 | / sqrt (p * (1-p) * (1 / N1 + 1 / N2))
Ask for. sqrt means the root.

そして、Ｚが大きいほど、Ｒ１とＲ２は有意差があると言え、Ｚが１．９６よりも大きいとき、危険率５％の有意差があると言え、Ｚが２．５８よりも大きいとき、危険率１％の有意差があると言える。 And as Z is larger, it can be said that R1 and R2 are significantly different. When Z is larger than 1.96, it can be said that there is a significant difference of 5% of the risk rate. When Z is larger than 2.58, It can be said that there is a significant difference in the risk rate of 1%.

例えば、Ｎ１＞Ｎ２でかつＺが大きいものほど、表現Ｂに偏ってよく共起して出現する単語とする。 For example, a word having N1> N2 and a larger Z is more likely to be biased toward the expression B and appear co-occurring.

上記の３つの検定の方法と、前述した、単純にＢ中のＡの出現率／Ｃ中のＡの出現率を求めて判定する方法を組み合わせてもよい。例えば、危険率５％以上有意差があるもののうち、Ｂ中のＡの出現率／Ｃ中のＡの出現率の値が大きいものほど表現Ｂに偏ってよく共起して出現する単語とする。
（情報対抽出部１２）
情報対抽出部１２は、関連記事ＤＢ１４に格納された記事群において、主要表現抽出部１１によって抽出された表現（例えば項目表現と固有表現の種類）が例えば同時に出現している箇所を特定し、例えば、その箇所に記載されている項目表現と固有表現との対を情報対とする。上記情報対に含まれる固有表現は、固有表現の種類に属する固有表現である。当該固有表現は、上述した固有表現抽出技術を用いて抽出される。 The above-described three test methods may be combined with the above-described method of simply determining and determining the appearance rate of A in B / the appearance rate of A in C. For example, among those having a significant difference of 5% or more in risk rate, the larger the value of the appearance rate of A in B / the appearance rate of A in C, the more likely to appear in co-occurrence with a bias toward expression B .
(Information pair extraction unit 12)
The information pair extraction unit 12 specifies, for example, a part where the expressions (for example, item expression and specific expression type) extracted by the main expression extraction unit 11 appear simultaneously in the article group stored in the related article DB 14, For example, a pair of an item expression and a specific expression described in the place is an information pair. The specific expression included in the information pair is a specific expression belonging to the type of specific expression. The specific expression is extracted using the specific expression extraction technique described above.

また、例えば、情報対抽出部１２は、主要表現抽出部１１によって抽出された表現（例えば項目表現と固有表現の種類と単位表現）が例えば同時に出現している箇所を特定し、例えば、その箇所に記載されている項目表現と固有表現と上記単位表現に関連する数値表現との対を情報対とする。上記単位表現に関連する数値表現とは、例えば、単位表現に隣接して記事中に出現している数値と単位表現とをあわせて得られる表現である。 In addition, for example, the information pair extraction unit 12 identifies a place where the expressions extracted by the main expression extraction unit 11 (for example, item expression and specific expression type and unit expression) appear simultaneously, for example, A pair of an item expression, a unique expression, and a numerical expression related to the unit expression described above is an information pair. The numerical expression related to the unit expression is, for example, an expression obtained by combining the numerical value appearing in the article adjacent to the unit expression and the unit expression.

本発明の実施の形態においては、例えば、句点、改行、文書の切れ目を示す特殊記号を切れ目とし、これらをはさまずに同時に主要表現が出現した箇所を、同時に出現した箇所とする。 In the embodiment of the present invention, for example, a special symbol indicating a punctuation mark, a line feed, or a document break is defined as a break, and a place where the main expression appears at the same time is defined as a place where it appears simultaneously.

本発明の一実施形態によれば、情報対抽出部１２が、更に、主要表現抽出部１１によって抽出された主要表現のうち、特定の主要表現を選択し、上記選択された特定の主要表現に基づいて、前記記事群から情報対を抽出するようにしてもよい。 According to an embodiment of the present invention, the information pair extraction unit 12 further selects a specific main expression from the main expressions extracted by the main expression extraction unit 11, and sets the selected specific main expression as the selected specific main expression. Based on this, information pairs may be extracted from the article group.

本発明の一実施例によれば、情報対抽出部１２が、ユーザによって予め指定された固有表現の種類（例えば、ＬＯＣＡＴＩＯＮ）に基づいて、（上述した固有表現抽出技術を用いて）関連記事ＤＢ１４から該固有表現の種類に属する固有表現を抽出し、関連記事ＤＢ１４において主要表現抽出部１１によって抽出された項目表現と該抽出された固有表現とが同時に出現している箇所に記載されている項目表現と固有表現との対を情報対として抽出するようにしてもよい。 According to one embodiment of the present invention, the information pair extraction unit 12 uses the related article DB 14 (using the above-described specific expression extraction technique) based on the specific expression type (for example, LOCATION) specified in advance by the user. The unique expression belonging to the type of the specific expression is extracted from the item, and the item described in the location where the item expression extracted by the main expression extracting unit 11 and the extracted specific expression appear at the same time in the related article DB 14 A pair of an expression and a specific expression may be extracted as an information pair.

本発明の一実施形態によれば、情報対抽出部１２が、更に、機械学習の手法を用いて、上記情報対を抽出するようにしてもよい。 According to an embodiment of the present invention, the information pair extraction unit 12 may further extract the information pair using a machine learning technique.

また、本発明の一実施形態によれば、情報対抽出部１２が、主要表現抽出部１１において抽出された複数の主要表現に基づいて情報対抽出部１２が抽出した複数種類の情報対から、各主要表現についての所定の評価値に基づいて、主要な情報対を選択（例えば、評価値が最も大きい情報対を選択）するようにしてもよい。 Moreover, according to one Embodiment of this invention, the information pair extraction part 12 is based on the several types of information pair which the information pair extraction part 12 extracted based on the some main expression extracted in the main expression extraction part 11. Based on a predetermined evaluation value for each main expression, a main information pair may be selected (for example, an information pair having the largest evaluation value is selected).

図２は、本発明の実施の形態において、機械学習の手法を用いて情報対を抽出する構成を採る場合の、情報対抽出部１２の構成例を示す図である。情報対抽出部１２は、教師データ記憶手段１２１、解−素性対抽出手段１２２、機械学習手段１２３、学習結果記憶手段１２４、表現対抽出手段１２５、素性抽出手段１２６、解推定手段１２７、情報対抽出手段１２８を備える。 FIG. 2 is a diagram illustrating a configuration example of the information pair extraction unit 12 when a configuration for extracting information pairs using a machine learning technique is employed in the embodiment of the present invention. The information pair extraction unit 12 includes a teacher data storage unit 121, a solution-feature pair extraction unit 122, a machine learning unit 123, a learning result storage unit 124, an expression pair extraction unit 125, a feature extraction unit 126, a solution estimation unit 127, an information pair Extraction means 128 is provided.

教師データ記憶手段１２１は、機械学習処理において使用される教師データとなるテキストデータを記憶する。例えば、項目表現をａｉ（ｉ＝１，２，３，．．．）、固有表現の種類に属する固有表現をｂｉ（ｉ＝１，２，３，．．．）、単位表現に関連する数値表現をｃｉ（ｉ＝１，２，３，．．．）とすると、教師データとして、テキストデータの文中に出現しているａｉ、ｂｉ、ｃｉの対（表現対）を問題、情報対として抽出するべき表現対であるか否かの情報を解とする事例を記憶する。具体的には、テキストデータ中に現れるあらゆるａｉ、ｂｉ、ｃｉの対について、情報対として抽出すべき表現対（正例）であるか、抽出するべきでない表現対（負例）かのいずれかの解を示すタグを人手によって付与する。 The teacher data storage unit 121 stores text data serving as teacher data used in the machine learning process. For example, the item expression is ai (i = 1, 2, 3,...), The specific expression belonging to the type of specific expression is bi (i = 1, 2, 3,...), And the numerical value related to the unit expression. Assuming that the expression is ci (i = 1, 2, 3,...), A pair of ai, bi, and ci (expression pair) appearing in the text data sentence is extracted as a problem and information pair as teacher data. A case where the information on whether or not the expression pair is to be made is a solution is stored. Specifically, for every ai, bi, ci pair appearing in text data, either an expression pair to be extracted as an information pair (positive example) or an expression pair that should not be extracted (negative example) A tag indicating the solution is manually attached.

すなわち、本発明の実施の形態においては、例えば、
（ａ１，ｂ１，ｃ１）−解「正例」
（ａ１，ｂ２，ｃ１）−解「負例」
・
・
・
（ａ２，ｂ２，ｃ２）−解「負例」
といった、表現対と解との組を生成する。 That is, in the embodiment of the present invention, for example,
(A1, b1, c1)-solution "positive example"
(A1, b2, c1)-solution "negative example"
・
・
・
(A2, b2, c2)-solution "negative example"
A pair of expression pair and solution is generated.

解−素性対抽出手段１２２は、教師データ記憶手段１２１内に記憶されているテキストデータの事例から、解と素性の集合との組を抽出する。素性は、機械学習処理で使用する情報である。解−素性対抽出手段１２２は、素性として、例えば、あるテキストデータ中の、解が付与された各表現対についての、ａｉとｂｉ、ｂｉとｃｉ、ａｉとｃｉの間の距離（文字または単語数等）や、テキストデータ中におけるａｉとｂｉとｃｉの表現対を含む範囲や、ａｉ、ｂｉ、ｃｉそれぞれの前後の文字列、単語、品詞情報等を用いる。また、解−素性対抽出手段１２２は、例えば、ａｉ，ｂｉ，ｃｉがテキストデータのタイトルに含まれるか等の情報や、ａｉとｂｉ、ｂｉとｃｉ、ａｉとｃｉの間に出現する品詞の情報等を素性としてもよい。また、本発明の実施の形態においては、記事中におけるａｉ、ｂｉ、ｃｉそれぞれの位置情報を素性としてもよい。例えば、新聞等の記事においては、最初に出現する主要表現が重要となることが多いからである。 The solution-feature pair extraction unit 122 extracts a set of a solution and a set of features from an example of text data stored in the teacher data storage unit 121. The feature is information used in the machine learning process. The solution-feature pair extraction unit 122 uses, as a feature, for example, the distance (character or word) between ai and bi, bi and ci, and ai and ci for each expression pair to which a solution is given in certain text data. A range including the expression pairs of ai, bi, and ci in text data, character strings before and after each of ai, bi, and ci, words, parts of speech information, and the like. In addition, the answer-feature pair extraction unit 122 may include information such as whether ai, bi, and ci are included in the title of text data, and parts of speech that appear between ai and bi, bi and ci, and ai and ci. Information or the like may be used as a feature. In the embodiment of the present invention, the position information of ai, bi, and ci in an article may be a feature. For example, in articles such as newspapers, the first main expression that appears first is often important.

機械学習手段１２３は、解−素性対抽出手段１２２によって抽出された解と素性の集合との組から、どのような素性のときにどのような解になりやすいかを、教師あり機械学習法により学習する。その学習結果は、学習結果記憶手段１２４内に記憶される。 The machine learning means 123 uses a supervised machine learning method to determine what kind of solution is likely to be generated from a set of the solution extracted by the solution-feature pair extraction means 122 and the feature set. learn. The learning result is stored in the learning result storage unit 124.

表現対抽出手段１２５は、主要表現抽出部１１によって抽出された主要表現（例えば、項目表現、固有表現の種類、単位表現）を用いて、関連記事ＤＢ１４中の各記事に含まれるａｉ（項目表現）、ｂｉ（固有表現の種類に属する固有表現）、ｃｉ（単位表現に関連する数値表現）という３種類の表現のあらゆる組み合わせ（表現対）を抽出する。なお、単位表現と連接して記事中に出現する数値と当該単位表現との組み合わせを数値表現とする。 The expression pair extraction unit 125 uses a main expression (for example, item expression, type of specific expression, unit expression) extracted by the main expression extraction unit 11 to use ai (item expression) included in each article in the related article DB 14. ), Bi (specific expression belonging to the type of specific expression), and ci (numerical expression related to unit expression), all combinations (expression pairs) of three types of expressions are extracted. A combination of a numerical value appearing in an article connected to the unit expression and the unit expression is a numerical expression.

素性抽出手段１２６は、解−素性対抽出手段１２２と同様の処理によって、表現対抽出手段１２５によって抽出された各表現対について、素性を抽出する。 The feature extraction unit 126 extracts a feature for each expression pair extracted by the expression pair extraction unit 125 by the same processing as the solution-feature pair extraction unit 122.

解推定手段１２７は、学習結果記憶手段１２４の学習結果を参照して、各表現対について、その素性の集合の場合に、どのような解（分類先）になりやすいかの度合いを推定する。 The solution estimation unit 127 refers to the learning result of the learning result storage unit 124, and estimates the degree of the solution (classification destination) that is likely to be obtained in the case of a set of features for each expression pair.

情報対抽出手段１２８は、解推定手段１２７の推定結果に基づいて、情報対として抽出すべき表現対（正例）となる度合いが高いと推定されたものを、情報対として抽出する。 Based on the estimation result of the solution estimation unit 127, the information pair extraction unit 128 extracts an information pair that is estimated to have a high degree of expression pair (positive example) to be extracted as an information pair.

ここで、機械学習手段１２３による機械学習の手法について説明する。機械学習の手法は、問題−解の組のセットを多く用意し、それで学習を行ない、どういう問題のときにどういう解になるかを学習し、その学習結果を利用して、新しい問題のときも解を推測できるようにする方法である（例えば、下記の参考文献（５）〜参考文献（７）参照）。 Here, a machine learning method by the machine learning means 123 will be described. The machine learning method prepares many sets of problem-solution pairs, learns them, learns what kind of solution the problem becomes, and uses the learning result to create a new problem. This is a method that makes it possible to guess the solution (for example, see the following references (5) to (7)).

参考文献（５）：村田真樹，機械学習に基づく言語処理，龍谷大学理工学部．招待講演．2004. http://www2.nict.go.jp/jt/a132/members/murata/ps/rk1-siryou.pdf
参考文献（６）：サポートベクトルマシンを用いたテンス・アスペクト・モダリティの日英翻訳，村田真樹，馬青，内元清貴，井佐原均，電子情報通信学会言語理解とコミュニケーション研究会 NLC2000-78 ，2001年．
参考文献（７）：SENSEVAL2J辞書タスクでのＣＲＬの取り組み，村田真樹，内山将夫，内元清貴，馬青，井佐原均，電子情報通信学会言語理解とコミュニケーション研究会 NLC2001-40 ，2001年．
どういう問題のときに、という、問題の状況を機械に伝える際に、素性（解析に用いる情報で問題を構成する各要素）というものが必要になる。問題を素性によって表現するのである。例えば、日本語文末表現の時制の推定の問題において、問題：「彼が話す。」−−−解「現在」が与えられた場合に、素性の一例は、「彼が話す。」「が話す。」「話す。」「す」「。」となる。 Reference (5): Masaki Murata, Language Processing Based on Machine Learning, Faculty of Science and Engineering, Ryukoku University. Invited lecture. 2004.http: //www2.nict.go.jp/jt/a132/members/murata/ps/rk1-siryou.pdf
Reference (6): Japanese-English translation of tense aspect modality using support vector machine, Maki Murata, Mao, Kiyotaka Uchimoto, Hitoshi Isahara, IEICE Language Understanding and Communication Study Group NLC2000-78, 2001 Year.
Reference (7): CRL's efforts in the SENSEVAL2J dictionary task, Masaki Murata, Masao Uchiyama, Kiyotaka Uchimoto, Ma Aoi, Hitoshi Isahara, IEICE Language Understanding and Communication Study Group NLC2001-40, 2001.
In order to convey the problem situation to the machine, what kind of problem is required, features (elements constituting the problem with information used for analysis) are required. The problem is expressed by the feature. For example, in the problem of estimating the tense of Japanese sentence ending expressions, the problem: “He speaks.” --- If the solution “present” is given, an example of a feature is “He speaks.” . "" Speaking. "" Su "". "

すなわち、機械学習の手法は、素性の集合−解の組のセットを多く用意し、それで学習を行ない、どういう素性の集合のときにどういう解になるかを学習し、その学習結果を利用して、新しい問題のときもその問題から素性の集合を取り出し、その素性の場合の解を推測する方法である。 In other words, the machine learning method prepares many sets of feature set-solution pairs, performs learning, learns what kind of solution the feature set becomes, and uses the learning result. This is a method of extracting a set of features from a new problem and inferring a solution in the case of the feature.

機械学習手段１２３は、機械学習の手法として、例えば、ｋ近傍法、シンプルベイズ法、決定リスト法、最大エントロピー法、サポートベクトルマシン法などの手法を用いる。 The machine learning unit 123 uses a technique such as a k-nearest neighbor method, a simple Bayes method, a decision list method, a maximum entropy method, or a support vector machine method as a machine learning method.

ｋ近傍法は、最も類似する一つの事例のかわりに、最も類似するｋ個の事例を用いて、このｋ個の事例での多数決によって分類先（解）を求める手法である。ｋは、あらかじめ定める整数の数字であって、一般的に、１から９の間の奇数を用いる。 The k-nearest neighbor method is a method for obtaining a classification destination (solution) by using the k most similar cases instead of the most similar case, and by majority decision of the k cases. k is a predetermined integer number, and generally an odd number between 1 and 9 is used.

シンプルベイズ法は、ベイズの定理にもとづいて各分類になる確率を推定し、その確率値が最も大きい分類を求める分類先とする方法である。 The Simple Bayes method is a method of estimating the probability of each classification based on Bayes' theorem and determining the classification having the highest probability value as a classification destination.

シンプルベイズ法において、文脈ｂで分類ａを出力する確率は、以下の式（４）で与えられる。 In the simple Bayes method, the probability of outputting the classification a in the context b is given by the following equation (4).

ただし、ここで文脈ｂは、あらかじめ設定しておいた素性ｆ_j（∈Ｆ，１≦ｊ≦ｋ）の集合である。ｐ（ｂ）は、文脈ｂの出現確率である。ここで、分類ａに非依存であって定数のために計算しない。Ｐ（ａ）（ここでＰはｐの上部にチルダ）とＰ（ｆ_i｜ａ）は、それぞれ教師データから推定された確率であって、分類ａの出現確率、分類ａのときに素性ｆ_iを持つ確率を意味する。Ｐ（ｆ_i｜ａ）として最尤推定を行って求めた値を用いると、しばしば値がゼロとなり、式（５）の値がゼロで分類先を決定することが困難な場合が生じる。そのため、スームージングを行う。ここでは、以下の式（６）を用いてスームージングを行ったものを用いる。 Here, the context b is a set of features f _j (εF, 1 ≦ j ≦ k) set in advance. p (b) is the appearance probability of the context b. Here, since it is independent of the classification a and is a constant, it is not calculated. P (a) (where P is a tilde at the top of p) and P (f _i | a) are the probabilities estimated from the teacher data, respectively, and the appearance probability of class a, and the feature f for class a means the probability of having _i . When a value obtained by performing maximum likelihood estimation as P (f _i | a) is used, the value often becomes zero, and it may be difficult to determine the classification destination because the value of Equation (5) is zero. Therefore, smoothing is performed. Here, what smoothed using the following formula | equation (6) is used.

ただし、ｆｒｅｑ（ｆ_i，ａ）は、素性ｆ_iを持ちかつ分類がａである事例の個数、ｆｒｅｑ（ａ）は、分類がａである事例の個数を意味する。 Here, freq (f _i , a) means the number of cases having the feature f _i and the classification a, and freq (a) means the number of cases having the classification a.

決定リスト法は、素性と分類先の組とを規則とし、それらをあらかじめ定めた優先順序でリストに蓄えおき、検出する対象となる入力が与えられたときに、リストで優先順位の高いところから入力のデータと規則の素性とを比較し、素性が一致した規則の分類先をその入力の分類先とする方法である。 The decision list method uses features and combinations of classification destinations as rules, stores them in the list in a predetermined priority order, and when input to be detected is given, from the highest priority in the list This is a method in which input data is compared with the feature of the rule, and the classification destination of the rule having the same feature is set as the classification destination of the input.

決定リスト方法では、あらかじめ設定しておいた素性ｆ_j( ∈Ｆ，１≦ｊ≦ｋ）のうち、いずれか一つの素性のみを文脈として各分類の確率値を求める。ある文脈ｂで分類ａを出力する確率は以下の式によって与えられる。 In the decision list method, the probability value of each classification is obtained using only one of the features f _j (εF, 1 ≦ j ≦ k) set in advance as a context. The probability of outputting classification a in a context b is given by

ｐ（ａ｜ｂ）＝ｐ（ａ｜ｆmax ）式（７）
ただし、ｆmax は以下の式によって与えられる。 p (a | b) = p (a | fmax) Equation (7)
However, fmax is given by the following equation.

また、Ｐ（ａ_i｜ｆ_j）（ここでＰはｐの上部にチルダ）は、素性ｆ_jを文脈に持つ場合の分類ａ_iの出現の割合である。 P (a _i | f _j ) (where P is a tilde at the top of p) is the rate of appearance of the classification a _i when the feature f _j is in the context.

最大エントロピー法は、あらかじめ設定しておいた素性ｆ_j（１≦ｊ≦ｋ）の集合をＦとするとき、以下所定の条件式（式（９））を満足しながらエントロピーを意味する式（１０）を最大にするときの確率分布ｐ（ａ，ｂ）を求め、その確率分布にしたがって求まる各分類の確率のうち、最も大きい確率値を持つ分類を求める分類先とする方法である。 In the maximum entropy method, when a set of preset features f _j (1 ≦ j ≦ k) is F, an expression (entropy) that satisfies a predetermined conditional expression (equation (9)) below ( In this method, the probability distribution p (a, b) when 10) is maximized is obtained, and the classification having the largest probability value is obtained from the probabilities of the respective classifications obtained according to the probability distribution.

ただし、Ａ、Ｂは分類と文脈の集合を意味し、ｇ_j（ａ，ｂ）は文脈ｂに素性ｆ_jがあって、なおかつ分類がａの場合１となり、それ以外で０となる関数を意味する。また、Ｐ（ａ_i｜ｆ_j）（ここでＰはｐの上部にチルダ）は、既知データでの（ａ，ｂ）の出現の割合を意味する。 However, A and B mean a set of classifications and contexts, and g _j (a, b) is a function that is 1 if the context b has a feature f _j and the classification is a, and is 0 otherwise. means. Further, P (a _i | f _j ) (where P is a tilde at the top of p) means the rate of appearance of (a, b) in the known data.

式（９）は、確率ｐと出力と素性の組の出現を意味する関数ｇをかけることで出力と素性の組の頻度の期待値を求めることになっており、右辺の既知データにおける期待値と、左辺の求める確率分布に基づいて計算される期待値が等しいことを制約として、エントロピー最大化（確率分布の平滑化) を行なって、出力と文脈の確率分布を求めるものとなっている。最大エントロピー法の詳細については、以下の参考文献（８）および参考文献（９）に記載されている。 Formula (9) is to obtain the expected value of the frequency of the output and feature pair by multiplying the probability p and the function g meaning the appearance of the pair of output and feature. And the expected value calculated based on the probability distribution calculated on the left side is the constraint, entropy maximization (smoothing of the probability distribution) is performed to determine the probability distribution of the output and the context. Details of the maximum entropy method are described in the following references (8) and (9).

参考文献（８）：Eric Sven Ristad, Maximum Entropy Modeling for Natural Language,(ACL/EACL Tutorial Program, Madrid, 1997
参考文献（９）：Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta, (http://www.mnemonic.com/software/memt,1998) ）
サポートベクトルマシン法は、空間を超平面で分割することにより、二つの分類からなるデータを分類する手法である。 Reference (8): Eric Sven Ristad, Maximum Entropy Modeling for Natural Language, (ACL / EACL Tutorial Program, Madrid, 1997
(9): Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta, (http://www.mnemonic.com/software/memt,1998))
The support vector machine method is a method of classifying data composed of two classifications by dividing a space by a hyperplane.

図３にサポートベクトルマシン法のマージン最大化の概念を示す。図３において、白丸は正例、黒丸は負例を意味し、実線は空間を分割する超平面を意味し、破線はマージン領域の境界を表す面を意味する。図３（Ａ）は、正例と負例の間隔が狭い場合（スモールマージン）の概念図、図３（Ｂ）は、正例と負例の間隔が広い場合（ラージマージン）の概念図である。 FIG. 3 shows the concept of margin maximization in the support vector machine method. In FIG. 3, a white circle means a positive example, a black circle means a negative example, a solid line means a hyperplane that divides the space, and a broken line means a surface that represents the boundary of the margin area. 3A is a conceptual diagram when the interval between the positive example and the negative example is small (small margin), and FIG. 3B is a conceptual diagram when the interval between the positive example and the negative example is wide (large margin). is there.

このとき、二つの分類が正例と負例からなるものとすると、学習データにおける正例と負例の間隔（マージン) が大きいものほどオープンデータで誤った分類をする可能性が低いと考えられ、図３（Ｂ）に示すように、このマージンを最大にする超平面を求めそれを用いて分類を行なう。 At this time, if the two classifications consist of positive and negative examples, the larger the interval (margin) between the positive and negative examples in the learning data, the less likely it is to make an incorrect classification with open data. As shown in FIG. 3B, a hyperplane that maximizes this margin is obtained, and classification is performed using it.

基本的には上記のとおりであるが、通常、学習データにおいてマージンの内部領域に少数の事例が含まれてもよいとする手法の拡張や、超平面の線形の部分を非線型にする拡張（カーネル関数の導入) がなされたものが用いられる。 Basically, it is as described above. Usually, an extension of the method that the training data may contain a small number of cases in the inner area of the margin, or an extension that makes the linear part of the hyperplane nonlinear ( The one with the introduction of the kernel function is used.

この拡張された方法は、以下の識別関数を用いて分類することと等価であり、その識別関数の出力値が正か負かによって二つの分類を判別することができる。 This extended method is equivalent to classification using the following discriminant function, and the two classes can be discriminated depending on whether the output value of the discriminant function is positive or negative.

ただし、ｘは識別したい事例の文脈（素性の集合) を、ｘ_iとｙ_j（ｉ＝１，…，ｌ，ｙ_j∈｛１，−１｝）は学習データの文脈と分類先を意味し、関数ｓｇｎは、
ｓｇｎ（ｘ）＝１（ｘ≧０）
−１（otherwise ）
であり、また、各α_iは式（１３）と式（１４）の制約のもと式（１２）を最大にする場合のものである。 Where x is the context (set of features) to be identified, and x _i and y _j (i = 1,..., L, y _j ∈ {1, -1}) mean the context and classification destination of the learning data. And the function sgn is
sgn (x) = 1 (x ≧ 0)
-1 (otherwise)
Each α _i is for maximizing the expression (12) under the constraints of the expressions (13) and (14).

また、関数Ｋはカーネル関数と呼ばれ、様々なものが用いられるが、本形態では以下の多項式のものを用いる。 The function K is called a kernel function, and various functions are used. In this embodiment, the following polynomial is used.

Ｋ（ｘ，ｙ）＝（ｘ・ｙ＋１）ｄ式（１５）
Ｃ、ｄは実験的に設定される定数である。例えば、Ｃはすべての処理を通して１に固定した。また、ｄは、１と２の二種類を試している。ここで、α_i＞０となるｘ_iは、サポートベクトルと呼ばれ、通常、式（１１）の和をとっている部分は、この事例のみを用いて計算される。つまり、実際の解析には学習データのうちサポートベクトルと呼ばれる事例のみしか用いられない。 K (x, y) = (x · y + 1) d Equation (15)
C and d are constants set experimentally. For example, C was fixed at 1 throughout all treatments. Moreover, two types of 1 and 2 are tried for d. Here, x _i where α _i > 0 is called a support vector, and the portion taking the sum of Expression (11) is usually calculated using only this case. That is, only actual cases called support vectors are used for actual analysis.

なお、拡張されたサポートベクトルマシン法の詳細については、以下の参考文献（１０）および参考文献（１１）に記載されている。 The details of the extended support vector machine method are described in the following references (10) and (11).

参考文献（１０）：Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods,(Cambridge University Press,2000)
参考文献（１１）：Taku Kudoh, Tinysvm:Support Vector machines,(http://cl.aist-nara.ac.jp/taku-ku//software/Tiny SVM/index.html,2000)
サポートベクトルマシン法は、分類の数が２個のデータを扱うものである。したがって、分類の数が３個以上の事例を扱う場合には、通常、これにペアワイズ法またはワンＶＳレスト法などの手法を組み合わせて用いることになる。 Reference (10): Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, (Cambridge University Press, 2000)
Reference (11): Taku Kudoh, Tinysvm: Support Vector machines, (http://cl.aist-nara.ac.jp/taku-ku//software/Tiny SVM / index.html, 2000)
The support vector machine method handles data with two classifications. Therefore, when handling cases with three or more classifications, a pair-wise method or a one-VS rest method is usually used in combination with this.

ペアワイズ法は、ｎ個の分類を持つデータの場合に、異なる二つの分類先のあらゆるペア（ｎ（ｎ−１）／２個）を生成し、各ペアごとにどちらがよいかを二値分類器、すなわちサポートベクトルマシン法処理モジュールで求めて、最終的に、ｎ（ｎ−１）／２個の二値分類による分類先の多数決によって、分類先を求める方法である。 In the pairwise method, in the case of data having n classifications, every pair (n (n-1) / 2) of two different classification destinations is generated, and a binary classifier indicates which is better for each pair. That is, it is obtained by the support vector machine method processing module and finally obtains the classification destination by majority decision of the classification destination by n (n−1) / 2 binary classification.

ワンＶＳレスト法は、例えば、ａ、ｂ、ｃという三つの分類先があるときは、分類先ａとその他、分類先ｂとその他、分類先ｃとその他、という三つの組を生成し、それぞれの組についてサポートベクトルマシン法で学習処理する。そして、学習結果による推定処理において、その三つの組のサポートベクトルマシンの学習結果を利用する。推定するべき候補が、その三つのサポートベクトルマシンではどのように推定されるかを見て、その三つのサポートベクトルマシンのうち、その他でないほうの分類先であって、かつサポートベクトルマシンの分離平面から最も離れた場合のものの分類先を求める解とする方法である。例えば、ある候補が、「分類先ａとその他」の組の学習処理で作成したサポートベクトルマシンにおいて分離平面から最も離れた場合には、その候補の分類先は、a と推定する。 For example, when there are three classification destinations a, b, and c, the one VS rest method generates three sets of classification destination a and other, classification destination b and other, classification destination c and other, The learning process is performed on the set of the support vector machine method. Then, in the estimation process based on the learning result, the learning results of the three sets of support vector machines are used. See how the three support vector machines are estimated as candidates to be estimated. Of the three support vector machines, it is the non-other classification target and the separation plane of the support vector machine. This is a method for obtaining a classification destination of a thing farthest from the object. For example, when a candidate is farthest from the separation plane in the support vector machine created by the learning process of “classification destination a and other”, the candidate classification destination is estimated as a.

解推定手段１２７が推定する、各表現対についての、どのような解（分類先）になりやすいかの度合いの求め方は、機械学習手段１２３が機械学習の手法として用いる様々な方法によって異なる。 The method of determining the level of the solution (classification destination) that is likely to be the solution (classification destination) for each expression pair estimated by the solution estimation unit 127 differs depending on various methods used by the machine learning unit 123 as a machine learning method.

例えば、本発明の実施の形態において、機械学習手段１２３が、機械学習の手法としてｋ近傍法を用いる場合、機械学習手段１２３は、教師データの事例同士で、その事例から抽出された素性の集合のうち重複する素性の割合（同じ素性をいくつ持っているかの割合）にもとづく事例同士の類似度を定義して、前記定義した類似度と事例とを学習結果情報として学習結果記憶手段１２４に記憶しておく。 For example, in the embodiment of the present invention, when the machine learning means 123 uses the k-nearest neighbor method as a machine learning technique, the machine learning means 123 sets the feature data extracted from the cases among the cases of the teacher data. The similarity between cases based on the ratio of overlapping features (the number of the same features) is defined, and the defined similarity and the case are stored in the learning result storage means 124 as learning result information. Keep it.

そして、解推定手段１２７は、表現対抽出手段１２５によって新しい表現対（の候補）が抽出されたときに、学習結果記憶手段１２４において定義された類似度と事例を参照して、表現対抽出手段１２５によって抽出された表現対の候補について、その候補の類似度が高い順にｋ個の事例を学習結果記憶手段１２４の事例から選択し、選択したｋ個の事例での多数決によって決まった分類先を、表現対の候補の分類先（解）として推定する。すなわち、解推定手段１２７では、各表現対についての、どのような解（分類先）になりやすいかの度合いを、選択したｋ個の事例での多数決の票数、ここでは「抽出するべき」という分類が獲得した票数とする。 Then, when a new expression pair (candidate) is extracted by the expression pair extraction unit 125, the solution estimation unit 127 refers to the similarity and the case defined in the learning result storage unit 124, and the expression pair extraction unit 127 For the candidates of the expression pairs extracted by 125, k cases are selected from the cases in the learning result storage means 124 in descending order of the similarity of the candidates, and the classification destinations determined by the majority vote in the selected k cases are selected. Estimated as the classification target (solution) of the expression pair candidate. That is, in the solution estimation means 127, the degree of what kind of solution (classification destination) is likely to be obtained for each expression pair is the number of votes of the majority vote in the selected k cases, here “to be extracted”. The number of votes obtained by classification.

また、機械学習手法として、シンプルベイズ法を用いる場合には、機械学習手段１２３は、教師データの事例について、前記事例の解と素性の集合との組を学習結果情報として学習結果記憶手段１２４に記憶する。そして、解推定手段１２７は、表現対抽出手段１２５によって新しい表現対（の候補）が抽出されたときに、学習結果記憶手段１２４の学習結果情報の解と素性の集合との組をもとに、ベイズの定理にもとづいて素性抽出手段１２６で取得した表現対の候補の素性の集合の場合の各分類になる確率を算出して、その確率の値が最も大きい分類を、その表現対の候補の素性の分類（解）と推定する。すなわち、解推定手段１２７では、表現対の候補の素性の集合の場合にある解となりやすさの度合いを、各分類になる確率、ここでは「抽出するべき」という分類になる確率とする。 When the simple Bayes method is used as the machine learning method, the machine learning unit 123 stores, in the learning result storage unit 124, a set of a solution of the case and a set of features as learning result information for the example of the teacher data. Remember. Then, the solution estimation means 127, when a new expression pair (candidate) is extracted by the expression pair extraction means 125, based on the combination of the learning result information solution and the feature set in the learning result storage means 124. Based on Bayes' theorem, the probability of becoming each classification in the case of the feature pair of the expression pair candidates acquired by the feature extraction means 126 is calculated, and the classification having the largest probability value is selected as the candidate of the expression pair. It is estimated as the classification (solution) of the features of. That is, in the solution estimation means 127, the degree of the likelihood of becoming a certain solution in the case of a set of features of expression pair candidates is set as the probability of becoming each classification, here, the probability of becoming the classification “to be extracted”.

また、機械学習手法として決定リスト法を用いる場合には、機械学習手段１２３は、教師データの事例について、素性と分類先との規則を所定の優先順序で並べたリストを学習結果記憶手段１２４に記憶する。そして、表現対抽出手段１２５によって新しい表現対（の候補）が抽出されたときに、解推定手段１２７は、学習結果記憶手段１２４のリストの優先順位の高い順に、抽出された表現対の候補の素性と規則の素性とを比較し、素性が一致した規則の分類先をその候補の分類先（解）として推定する。すなわち、解推定手段１２７では、表現対の候補の素性の集合の場合にある解となりやすさの度合いを、所定の優先順位またはそれに相当する数値、尺度、ここでは「抽出するべき」という分類になる確率のリストにおける優先順位とする。 When the decision list method is used as the machine learning method, the machine learning unit 123 stores, in the learning result storage unit 124, a list in which rules of features and classification destinations are arranged in a predetermined priority order with respect to examples of teacher data. Remember. Then, when a new expression pair (candidate) is extracted by the expression pair extraction unit 125, the solution estimation unit 127 selects candidate expression pairs extracted in descending order of priority in the list of the learning result storage unit 124. The feature is compared with the feature of the rule, and the classification destination of the rule having the identical feature is estimated as the candidate classification destination (solution). That is, the solution estimation means 127 assigns the degree of the likelihood of becoming a solution in the case of a set of candidate features of the expression pair to a predetermined priority or a numerical value or scale corresponding thereto, in this case, “to be extracted”. Priority in the list of probabilities.

また、機械学習手法として最大エントロピー法を使用する場合には、機械学習手段１２３は、教師データの事例から解となりうる分類を特定し、所定の条件式を満足しかつエントロピーを示す式を最大にするときの素性の集合と解となりうる分類の二項からなる確率分布を求めて学習結果記憶手段１２４に記憶する。そして、表現対抽出手段１２５によって新しい表現対（の候補）が抽出されたときに、解推定手段１２７は、学習結果記憶手段１２４の確率分布を利用して、抽出された表現対の候補の素性の集合についてその解となりうる分類の確率を求めて、最も大きい確率値を持つ解となりうる分類を特定し、その特定した分類をその候補の解と推定する。すなわち、解推定手段１２７では、表現対の候補の素性の集合の場合にある解となりやすさの度合いを、各分類になる確率、ここでは「抽出するべき」という分類になる確率とする。 When the maximum entropy method is used as the machine learning method, the machine learning means 123 specifies a class that can be a solution from the example of the teacher data, and maximizes an expression that satisfies a predetermined conditional expression and shows entropy. A probability distribution consisting of a set of features and a class that can be a solution is obtained and stored in the learning result storage means 124. When a new expression pair (candidate) is extracted by the expression pair extraction unit 125, the solution estimation unit 127 uses the probability distribution of the learning result storage unit 124 to identify the features of the extracted expression pair candidate. The probability of the classification that can be the solution for the set of is determined, the classification that can be the solution having the largest probability value is identified, and the identified classification is estimated as the candidate solution. That is, in the solution estimation means 127, the degree of the likelihood of becoming a certain solution in the case of a set of features of expression pair candidates is set as the probability of becoming each classification, here, the probability of becoming the classification “to be extracted”.

また、機械学習手法としてサポートベクトルマシン法を使用する場合には、機械学習手段１２３は、教師データの事例から解となりうる分類を特定し、分類を正例と負例に分割して、カーネル関数を用いた所定の実行関数にしたがって事例の素性の集合を次元とする空間上で、その事例の正例と負例の間隔を最大にし、かつ正例と負例を超平面で分割する超平面を求めて学習結果記憶手段１２４に記憶する。そして表現対抽出手段１２５によって新しい表現対（の候補）が抽出されたときに、解推定手段１２７は、学習結果記憶手段１２４の超平面を利用して、抽出された表現対の候補の素性の集合が超平面で分割された空間において正例側か負例側のどちらにあるかを特定し、その特定された結果にもとづいて定まる分類を、その候補の解と推定する。すなわち、解推定手段１２７では、表現対の候補の素性の集合の場合にある解となりやすさの度合いを、分離平面からの正例（抽出するべき表現対）の空間への距離の大きさとする。より詳しくは、抽出するべき表現対を正例、抽出するべきではない表現対を負例とする場合に、分離平面に対して正例側の空間に位置する事例が「抽出するべき事例」と判断され、その事例の分離平面からの距離をその事例の度合いとする。なお、上記では、情報対抽出部１２が、主要表現としての項目表現、固有表現の種類、単位表現を用いて、機械学習の手法によって情報対を抽出する例を説明したが、上記と同様の機械学習の手法を用いて、情報対抽出部１２が、主要表現としての項目表現、固有表現の種類を用いて情報対を抽出するようにしてもよい。 When the support vector machine method is used as the machine learning method, the machine learning unit 123 specifies a class that can be a solution from the example of the teacher data, divides the class into a positive example and a negative example, A hyperplane that maximizes the interval between the positive and negative examples of a case and divides the positive and negative examples by a hyperplane in a space whose dimension is a set of case features according to a predetermined execution function using Is stored in the learning result storage means 124. Then, when a new expression pair (candidate) is extracted by the expression pair extraction unit 125, the solution estimation unit 127 uses the hyperplane of the learning result storage unit 124 to identify the feature of the extracted expression pair candidate. Whether the set is on the positive example side or the negative example side in the space divided by the hyperplane is specified, and the classification determined based on the specified result is estimated as the candidate solution. That is, in the solution estimation means 127, the degree of the likelihood of being a solution in the case of a set of candidate expression pairs is the distance from the separation plane to the space of the positive example (expression pair to be extracted). . More specifically, when the expression pair to be extracted is a positive example and the expression pair that should not be extracted is a negative example, the case located in the space on the positive example side with respect to the separation plane is referred to as “example to be extracted”. The distance from the separation plane of the case is determined as the degree of the case. In the above description, an example in which the information pair extraction unit 12 extracts an information pair by the machine learning method using the item expression as the main expression, the type of the unique expression, and the unit expression has been described. Using the machine learning technique, the information pair extraction unit 12 may extract the information pair using the item expression as the main expression and the kind of specific expression.

本発明の一実施形態によれば、情報対抽出部１２が、更に、抽出された情報対のうち、最終的に抽出対象とする情報対を正例、最終的に抽出対象としない情報対を負例として決定し、該決定された正例及び負例とを教師データとして、上述した機械学習の手法を用いて上記抽出された情報対について機械学習して、最終的に抽出対象とする情報対を決定するようにしてもよい。 According to an embodiment of the present invention, the information pair extraction unit 12 further selects an information pair that is finally targeted for extraction from the extracted information pairs, and an information pair that is not finally targeted for extraction. Information that is determined as a negative example, machine-learned about the extracted information pair using the above-described machine learning technique, using the determined positive example and negative example as teacher data, and finally extracted. You may make it determine a pair.

例えば、情報対抽出部１２が、１００個の情報対を抽出した後、ユーザの指定入力に従って、該１００個の情報対に含まれる６個の情報対のうちの３個を、最終的に抽出対象とする情報対（正例）とし、残りの３個を、最終的に抽出対象としない情報対（負例）として決定する。そして、上記正例又は負例として決定された６個の情報対を教師データとして、上述した機械学習を行い、その学習結果を用いて、残りの９４個の情報対を正例又は負例に分類する。そして、上記ユーザの指定入力に従って正例とされた３個の情報対と、上記正例に分類された情報対とを最終的に抽出対象とする情報対として決定する。
（表示部１３）
表示部１３は、情報対抽出部１２によって抽出された情報対を整理して、例えばグラフ化して表示する。 For example, after extracting 100 information pairs, the information pair extracting unit 12 finally extracts 3 of 6 information pairs included in the 100 information pairs in accordance with a user's designated input. The target information pair (positive example) is determined, and the remaining three are finally determined as information pairs not to be extracted (negative example). Then, the above-described machine learning is performed using the six information pairs determined as positive examples or negative examples as teacher data, and the remaining 94 information pairs are converted into positive examples or negative examples using the learning results. Classify. Then, the three information pairs determined as positive examples according to the user's designated input and the information pairs classified as positive examples are finally determined as information pairs to be extracted.
(Display unit 13)
The display unit 13 organizes the information pairs extracted by the information pair extraction unit 12 and displays them in a graph, for example.

本発明の一実施形態によれば、例えば、主要表現抽出部１１において抽出された複数の主要表現に基づいて情報対抽出部１２が抽出した複数種類の情報対から、表示部１３が、各主要表現についての所定の評価値に基づいて、主要な情報対を選択（例えば、評価値が最も大きい情報対を選択）した上で、選択した主要な情報対をグラフ化する構成を採ってもよい。 According to an embodiment of the present invention, for example, the display unit 13 may select each main information from a plurality of types of information pairs extracted by the information pair extraction unit 12 based on a plurality of main expressions extracted by the main expression extraction unit 11. A configuration may be adopted in which a main information pair is selected (for example, an information pair having the largest evaluation value is selected) based on a predetermined evaluation value for expression, and then the selected main information pair is graphed. .

上記評価値の算出方法としては、例えば、以下の評価値の４種類の算出式のうちのいずれか１つを用いる。ここでは、主要表現抽出部１１によって抽出された主要表現が１つの項目表現と２つの固有表現の種類と１つの単位表現である場合を例にとって説明する。
（方法１）：数値表現の頻度と主要表現のスコアを用いる。 As the evaluation value calculation method, for example, any one of the following four evaluation value calculation formulas is used. Here, a case where the main expressions extracted by the main expression extraction unit 11 are one item expression, two types of unique expressions, and one unit expression will be described as an example.
(Method 1): Frequency of numerical expression and score of main expression are used.

評価値Ｍ＝Ｆｒｅｑ×Ｓ１×Ｓ２×Ｓ２’×Ｓ３
（方法２）：数値表現の頻度と主要表現のスコアを用いる。 Evaluation value M = Freq × S1 × S2 × S2 ′ × S3
(Method 2): The numerical expression frequency and the main expression score are used.

評価値Ｍ＝Ｆｒｅｑ×（Ｓ１×Ｓ２×Ｓ２’×Ｓ３）
（方法３）：数値表現の頻度を用いる。 Evaluation value M = Freq × (S1 × S2 × S2 ′ × S3)
(Method 3): The frequency of numerical expression is used.

評価値Ｍ＝Ｆｒｅｑ
（方法４）：主要表現のスコアを用いる。 Evaluation value M = Freq
(Method 4): The score of the main expression is used.

評価値Ｍ＝Ｓ１×Ｓ２×Ｓ２’×Ｓ３
ここで、Ｆｒｅｑは、当該主要表現に基づいて情報対抽出部１２によって抽出された数値表現の数、Ｓ１は、項目表現についての前述した式（１）〜式（３）に示すようなスコアの値、Ｓ２、Ｓ２’は、２つの固有表現の種類のそれぞれについての前述した式（１）〜式（３）に示すようなスコアの値、Ｓ３は、単位表現についての前述した式（１）〜式（３）に示すようなスコアの値である。 Evaluation value M = S1 × S2 × S2 ′ × S3
Here, Freq is the number of numerical expressions extracted by the information pair extraction unit 12 based on the main expression, and S1 is a score as shown in the above-described expressions (1) to (3) for the item expression. The values S2, S2 ′ are the score values as shown in the above-described formulas (1) to (3) for each of the two types of proper expressions, and S3 is the above-described formula (1) for the unit representation. It is a score value as shown in Formula (3).

本発明の一実施形態によれば、例えば、主要表現抽出部１１が、項目表現、固有表現の種類、単位表現について、それぞれ、前述したスコアの値が高いものから所定の数ずつ選択する。そして、表示部１３が、上記選択された項目表現、固有表現の種類、単位表現の中から、例えば項目表現を１つ、固有表現の種類を２つ、単位表現を１つ選択し、その全ての組み合わせに対して上記の評価値Ｍの計算をして得られる評価値Ｍが大きいものほど有用なグラフと判断し、情報対抽出部１２によって抽出された情報対のうち、例えば、評価値Ｍが最も大きい１つの項目表現と１つの固有表現の種類と１つの単位表現とに基づいて抽出された情報対をグラフ表示する。 According to one embodiment of the present invention, for example, the main expression extraction unit 11 selects a predetermined number of item expressions, types of specific expressions, and unit expressions from the above-described high score values. Then, the display unit 13 selects, for example, one item expression, two types of specific expressions, and one unit expression from the selected item expressions, types of specific expressions, and unit expressions, and all of them. As the evaluation value M obtained by calculating the evaluation value M with respect to the combination of is larger, it is determined that the graph is more useful, and among the information pairs extracted by the information pair extraction unit 12, for example, the evaluation value M An information pair extracted based on one item expression, one unique expression type, and one unit expression having the largest is displayed in a graph.

なお、本発明の一実施形態によれば、情報抽出装置１が、更に、情報対抽出部１２によって抽出された情報対について相関分析を行う手段（図１では図示を省略）を備えるようにしてもよい。また、表示部１３が、上記相関分析を行い、該相関分析結果を表示するようにしてもよい。 According to one embodiment of the present invention, the information extracting apparatus 1 further includes means (not shown in FIG. 1) for performing a correlation analysis on the information pair extracted by the information pair extracting unit 12. Also good. Further, the display unit 13 may perform the correlation analysis and display the correlation analysis result.

図４は、本発明の実施の形態における情報抽出処理フローの一例を示す図である。まず、情報抽出装置１は、関連記事ＤＢ１４中の記事群から主要表現を抽出する（ステップＳ１）。次に、情報抽出装置１は、抽出された主要表現を用いて、情報対を抽出する（ステップＳ２）。そして、情報抽出装置１は、抽出された情報対を表示する（ステップＳ３）。 FIG. 4 is a diagram showing an example of an information extraction processing flow in the embodiment of the present invention. First, the information extraction apparatus 1 extracts a main expression from an article group in the related article DB 14 (step S1). Next, the information extraction device 1 extracts an information pair using the extracted main expression (step S2). Then, the information extraction device 1 displays the extracted information pair (step S3).

図５乃至１４は、表示部による表示例を示す図である。図５に示す表示例は、主要表現としての項目表現が「末端価格」、固有表現の種類が「ＬＯＣＡＴＩＯＮ」と「ＯＲＧＡＮＩＺＡＴＩＯＮ」、単位表現が「キロ」と「円」である場合の情報対の表示例を示している。図５に示す表中の「文」という項目は、各々の情報対が同時に出現した文を示している。 5 to 14 are diagrams showing examples of display by the display unit. The display example shown in FIG. 5 is an information pair in the case where the item expression as the main expression is “end price”, the types of specific expressions are “LOCATION” and “ORGANIZATION”, and the unit expressions are “kilo” and “yen”. A display example is shown. The item “sentence” in the table shown in FIG. 5 indicates a sentence in which each information pair appears simultaneously.

図６に示す表示例は、主要表現としての項目表現が「弾道ミサイル」、固有表現の種類が「ＡＲＴＩＦＡＣＴ」と「ＬＯＣＡＴＩＯＮ」である場合の情報対の表示例を示している。図６に示す表中の「文」という項目は、各々の情報対が同時に出現した文を示している。 The display example shown in FIG. 6 shows a display example of an information pair when the item expression as the main expression is “ballistic missile” and the types of specific expressions are “ARTIFACT” and “LOCATION”. The item “sentence” in the table shown in FIG. 6 indicates a sentence in which each information pair appears simultaneously.

図７に示す表示例は、主要表現としての項目表現が「毎日新聞社主催」、固有表現の種類が「ＤＡＴＥ」と「ＬＯＣＡＴＩＯＮ」と「ＯＲＧＡＮＩＺＡＴＩＯＮ」と「ＰＥＲＳＯＮ」と「ＴＩＭＥ」である場合の情報対の表示例を示している。図７に示す表中の「文」という項目は、各々の情報対が同時に出現した文を示している。 In the display example shown in FIG. 7, the item expression as the main expression is “sponsored by Mainichi Shimbun”, and the types of unique expressions are “DATE”, “LOCATION”, “ORGANIZATION”, “PERSON”, and “TIME”. The example of a display of an information pair is shown. The item “sentence” in the table shown in FIG. 7 indicates a sentence in which each information pair appears simultaneously.

図８に示す表示例は、主要表現としての項目表現が「台風」、固有表現の種類が「ＬＯＣＡＴＩＯＮ」、単位表現が「号」と「キロ」である場合の情報対の表示例を示している。図８に示す表中の「文」という項目は、各々の情報対が同時に出現した文を示している。 The display example shown in FIG. 8 shows a display example of an information pair when the item expression as the main expression is “typhoon”, the type of specific expression is “LOCATION”, and the unit expressions are “No.” and “Kilo”. Yes. The item “sentence” in the table shown in FIG. 8 indicates a sentence in which each information pair appears simultaneously.

図９に示す表示例は、主要表現としての項目表現が「中前打」、固有表現の種類が「ＯＲＧＡＮＩＺＡＴＩＯＮ」と「ＰＥＲＳＯＮ」、単位表現が「回」と「点」である場合の情報対の表示例を示している。図９に示す表中の「文」という項目は、各々の情報対が同時に出現した文を示している。 The display example shown in FIG. 9 is an information pair in which the item expression as the main expression is “middle advance”, the types of specific expressions are “ORGANIZATION” and “PERSON”, and the unit expressions are “times” and “points”. A display example is shown. The item “sentence” in the table shown in FIG. 9 indicates a sentence in which each information pair appears simultaneously.

図１０に示す表示例は、主要表現としての項目表現が「無職」、固有表現の種類が「ＰＥＲＳＯＮ」と「ＴＩＭＥ」、単位表現が「階」と「階建て」である場合の情報対の表示例を示している。図１０に示す表中の「文」という項目は、各々の情報対が同時に出現した文を示している。 The display example shown in FIG. 10 is an information pair in which the item expression as the main expression is “unemployed”, the types of specific expressions are “PERSON” and “TIME”, and the unit expressions are “floor” and “floor”. A display example is shown. The item “sentence” in the table shown in FIG. 10 indicates a sentence in which each information pair appears simultaneously.

図１１に示す表示例は、主要表現としての項目表現が「男子」、固有表現の種類が「ＬＯＣＡＴＩＯＮ」と「ＯＲＧＡＮＩＺＡＴＩＯＮ」と「ＰＥＲＳＯＮ」、単位表現が「位」と「メートル」である場合の情報対の表示例を示している。図１１に示す表中の「文」という項目は、各々の情報対が同時に出現した文を示している。 In the display example shown in FIG. 11, the item expression as the main expression is “male”, the types of specific expressions are “LOCATION”, “ORGANIZATION”, “PERSON”, and the unit expressions are “rank” and “meter”. The example of a display of an information pair is shown. The item “sentence” in the table shown in FIG. 11 indicates a sentence in which each information pair appears simultaneously.

図１２に示す表示例は、主要表現としての項目表現が「収賄罪」、固有表現の種類が「ＤＡＴＥ」と「ＬＯＣＡＴＩＯＮ」と「ＭＯＮＥＹ」と「ＰＥＲＳＯＮ」、単位表現が「人」と「円」である場合の情報対の表示例を示している。図１２に示す表中の「文」という項目は、各々の情報対が同時に出現した文を示している。 In the display example shown in FIG. 12, the item expression as the main expression is “bribery crime”, the types of specific expressions are “DATE”, “LOCATION”, “MONEY” and “PERSON”, and the unit expressions are “people” and “yen” ”Is a display example of information pairs. The item “sentence” in the table shown in FIG. 12 indicates a sentence in which each information pair appears simultaneously.

図１３に示す表示例は、主要表現としての項目表現が「台風」、固有表現の種類が「ＬＯＣＡＴＩＯＮ」、単位表現が「号」と「キロ」である場合の、情報対のグラフ表示例を示している。図１３には、上記図８に示す表示例における第１行目の情報対（４号−２１０キロ−南大東島）のグラフ表示例が示される。図１３を示す表示例を参照すると、台風４号が南大東島から２１０キロの地点にあることがわかる。 The display example shown in FIG. 13 is a graph display example of an information pair when the item expression as the main expression is “typhoon”, the specific expression type is “LOCATION”, and the unit expressions are “No.” and “Kilo”. Show. FIG. 13 shows a graph display example of the information pair (No. 4-210 km-Minamidaitojima) in the first row in the display example shown in FIG. Referring to the display example shown in FIG. 13, it can be seen that Typhoon No. 4 is 210 km from Minami Daitojima.

図１４は、主要表現としての項目表現が「末端価格」、固有表現の種類が「ＬＯＣＡＴＩＯＮ」と「ＯＲＧＡＮＩＺＡＴＩＯＮ」、単位表現が「キロ」と「円」である場合の情報対のグラフ表示例を示している。図１４に示す表示例は、大阪税関伏木税関支署が、ロシア船籍の船から末端価格７４０，０００円の覚醒剤を９．３キロ押収したことを示している。 FIG. 14 is a graph display example of information pairs when the item expression as the main expression is “end price”, the types of specific expressions are “LOCATION” and “ORGANIZATION”, and the unit expressions are “kilo” and “yen”. Show. The display example shown in FIG. 14 indicates that the Osaka Customs Fushiki Customs Branch seized 9.3 kilos of stimulant with a terminal price of 740,000 yen from a Russian flag ship.

本発明の変形例について説明する。本発明の変形例においては、情報対抽出部１２が、情報対が抽出された記事群を構成する各記事を各記事が属するクラスターにクラスタリングする。そして、表示部１３が、各クラスターに属する記事から抽出された情報対について、クラスター毎に相関分析を行い、当該相関分析の結果に基づいて、各クラスターに属する記事から抽出された情報対をクラスター毎にグラフ化して表示する。相関分析とは、例えば、２つのデータの相関を分析することをいう。例えば、本発明において、ｘ軸、ｙ軸の２軸のグラフ上に情報対のプロットが並んでいる場合において、ｘ軸に対応するデータ（例えば数値表現）とｙ軸に対応するデータ（例えば数値表現）とに相関があるかといった相関分析を行う。また、例えば、本発明において、ｘ軸、ｙ軸の２軸のグラフ上に情報対のプロットが並んでいる場合において、ｘ軸に対応するデータ（例えばＤＡＴＥという固有表現の種類に属する固有表現）とｙ軸に対応するデータ（例えば数値表現）とに相関があるかといった相関分析を行う。 A modification of the present invention will be described. In the modification of the present invention, the information pair extraction unit 12 clusters each article constituting the article group from which the information pair is extracted into a cluster to which each article belongs. Then, the display unit 13 performs correlation analysis for each cluster on the information pairs extracted from the articles belonging to each cluster, and the information pairs extracted from the articles belonging to each cluster are clustered based on the correlation analysis result. Each graph is displayed. Correlation analysis refers to, for example, analyzing the correlation between two data. For example, in the present invention, when information pairs are plotted on a two-axis graph of x-axis and y-axis, data corresponding to the x-axis (for example, numerical expression) and data corresponding to the y-axis (for example, numerical values) Correlation analysis is performed to see if there is a correlation with the expression. Further, for example, in the present invention, when information pair plots are arranged on a two-axis graph of x-axis and y-axis, data corresponding to the x-axis (for example, a specific expression belonging to the specific expression type DATE) And a correlation analysis is performed to determine whether there is a correlation between the data corresponding to the y axis (for example, a numerical expression).

グラフのプロットの並びが直線に近くなっていれば相関があると言える。表示部１３は、例えば、グラフのプロットについて相関分析を行う際に、２つのデータが、どの程度直線的な関係にあるかを示す相関係数を算出してもよい。 It can be said that there is a correlation if the plot of the graph is close to a straight line. For example, when the correlation analysis is performed on the plot of the graph, the display unit 13 may calculate a correlation coefficient indicating how linear the two data are.

表示部１３は、各クラスターに属する記事から抽出された情報対に基づいて生成するクラスター毎のグラフデータにおいて、ｘ軸に対応する数値表現とｙ軸に対応する数値表現とに相関があるかを分析し、当該分析の結果、相関があるとされたグラフデータ、又は、相関係数が所定の値以上のグラフデータのみをグラフ化して表示するようにしてもよい。 The display unit 13 determines whether or not there is a correlation between the numerical expression corresponding to the x axis and the numerical expression corresponding to the y axis in the graph data for each cluster generated based on the information pairs extracted from the articles belonging to each cluster. Analysis may be performed, and only graph data that is determined to be correlated as a result of the analysis or graph data having a correlation coefficient equal to or greater than a predetermined value may be displayed as a graph.

また、表示部１３は、各クラスターに属する記事から抽出された情報対に基づいて生成するクラスター毎のグラフデータにおいて、ｘ軸に対応する固有表現とｙ軸に対応する数値表現とに相関があるかを分析し、当該分析の結果、相関があるとされたグラフデータ、又は、相関係数が所定の値以上のグラフデータのみをグラフ化して表示するようにしてもよい。 Further, the display unit 13 has a correlation between the specific expression corresponding to the x axis and the numerical expression corresponding to the y axis in the graph data for each cluster generated based on the information pairs extracted from the articles belonging to each cluster. As a result of the analysis, only graph data determined to be correlated or graph data having a correlation coefficient equal to or greater than a predetermined value may be displayed in a graph.

なお、表示部１３は、当該相関分析の結果、相関があるとされたグラフデータ、又は、相関係数が所定の値以上のグラフデータを、当該グラフデータに対応する情報対の前述した評価値Ｍが大きい順にソートし、各グラフデータをグラフとして表示するようにしてもよい。 The display unit 13 displays the above-described evaluation value of the information pair corresponding to the graph data, the graph data determined to be correlated as a result of the correlation analysis, or the graph data having a correlation coefficient equal to or greater than a predetermined value. Sorting may be performed in descending order of M, and each graph data may be displayed as a graph.

また、本発明の一実施形態によれば、表示部１３が、情報対抽出部１２によって抽出された情報対について相関分析を行い、相関分析の結果に基づいて、上記情報対をグラフ化して表示するようにしてもよい。例えば、表示部１３は、５種類の数値表現と５種類の項目表現との組合せの数（２５個）だけの種類の情報対について前述した相関分析を行い、当該相関分析の結果、相関があるとされた情報対、又は、相関係数が所定の値以上である情報対のみをグラフ化して表示するようにしてもよい。また、例えば、表示部１３は、５種類の固有表現と５種類の項目表現との組合せの数（２５個）だけの種類の情報対について前述した相関分析を行い、当該相関分析の結果、相関があるとされた情報対、又は、相関係数が所定の値以上である情報対のみをグラフ化して表示するようにしてもよい。 In addition, according to an embodiment of the present invention, the display unit 13 performs correlation analysis on the information pair extracted by the information pair extraction unit 12, and graphs and displays the information pair based on the result of the correlation analysis. You may make it do. For example, the display unit 13 performs the above-described correlation analysis with respect to the information pairs of the number of combinations (25) of the five types of numerical expressions and the five types of item expressions, and there is a correlation as a result of the correlation analysis. Only the information pairs that are determined to be or the information pair whose correlation coefficient is equal to or greater than a predetermined value may be displayed in a graph. In addition, for example, the display unit 13 performs the above-described correlation analysis for the number of types of information pairs (25) of the combinations of the five types of specific expressions and the five types of item expressions. Only information pairs that are determined to be present or only information pairs having a correlation coefficient equal to or greater than a predetermined value may be displayed in a graph.

なお、固有表現をグラフのｘ軸などに表示する場合、表示部１３は、固有表現の表示の順番を決定した上でグラフ上に表示する。表示の順番の決め方には以下の（１）〜（５）までの５通りがある。
（１）予め人手で表示の順番を決めておき、該表示の順番を表形式のデータとして記憶手段に記憶しておく。該記憶手段に記憶された表形式のデータを参照して順番を決める。
（２）固有表現と、それと共起する単語( 例えば、大規模コーパスでその固有表現と同一文にある単語）を抽出し、その共起回数を求め、この結果を以下のような表形式にまとめる。 When displaying the specific expression on the x-axis of the graph, the display unit 13 determines the display order of the specific expression and displays it on the graph. There are five ways of determining the display order: (1) to (5) below.
(1) The display order is determined manually in advance, and the display order is stored in the storage means as tabular data. The order is determined with reference to tabular data stored in the storage means.
(2) A specific expression and a word that co-occurs with it (for example, a word in the same sentence as the specific expression in a large corpus) are extracted, and the number of co-occurrence is obtained. To summarize.

単語１単語２単語３
固有表現１２０１
固有表現２３２１
固有表現３１００
上記表形式のデータに対して、主成分分析や、双対尺度法などの数値解析を実行する。（例えば、参考文献（１２）：「図解でわかる多変量解析」（日本実業出版社）、参考文献（１３）：「実践ワークショップＥｘｃｅｌ徹底活用多変量解析」（秀和システム）参照）。 Word 1 word 2 word 3
Proper expression 1 2 0 1
Proper expression 2 3 2 1
Proper expression 3 1 0 0
A numerical analysis such as principal component analysis or dual scaling is performed on the tabular data. (See, for example, Reference (12): “Multivariate analysis understood by illustration” (Nippon Jitsugyo Publishing Co., Ltd.), Reference (13): “Practical Workshop Excel Thorough Use Multivariate Analysis” (Hidewa System)).

第一固有値に対応するそれぞれの値の順に、固有表現を並べ替えて、それを固有表現の順番とする。
（３）固有表現と、それと共起する単語（例えば、大規模コーパスでその固有表現と同一文にある単語）を抽出し、その共起回数を求める。この結果を以下のような表形式にまとめる。 The unique expressions are rearranged in the order of the values corresponding to the first eigenvalues, and this is used as the order of the unique expressions.
(3) A specific expression and a word that co-occurs with it (for example, a word in the same sentence as the specific expression in a large corpus) are extracted, and the number of co-occurrence is obtained. The results are summarized in the following table format.

単語１単語２単語３
固有表現１２０１
固有表現２３２１
固有表現３１００
共起する単語の種類をベクトルの次元、共起する単語の共起した回数をベクトルの要素とするベクトルを固有表現ごとに作成する。２つのベクトル（例えばｖ１とｖ２）の内積又はＣｏｓ（ｖ１，ｖ２）を固有表現同士の類似度とする。類似度が大きい固有表現同士を順につなげていく。ただし、一つの固有表現は、多くても二つの固有表現としかつながらないようにする。全ての固有表現がつながったら、それを一直線に伸ばして、その端から順に、固有表現の並び順を固有表現の順番とする。
（４）固有表現の表示の順番を５０音順とする。
（５）各固有表現を文字列長の長い順に並べ、該並んだ順番を固有表現の表示の順番とする。 Word 1 word 2 word 3
Proper expression 1 2 0 1
Proper expression 2 3 2 1
Proper expression 3 1 0 0
For each unique expression, a vector is created with the type of co-occurring words as the vector dimension and the number of co-occurring words as the vector elements. The inner product of two vectors (for example, v1 and v2) or Cos (v1, v2) is set as the similarity between the proper expressions. The specific expressions with large similarity are connected in order. However, at most one unique expression should be two unique expressions. When all the proper expressions are connected, they are stretched in a straight line, and the order of proper expressions is set as the order of proper expressions from the end.
(4) The display order of the unique expressions is set to the order of the Japanese syllabary.
(5) The unique expressions are arranged in the order of the longest character string length, and the arranged order is set as the display order of the unique expressions.

図１５乃至２４は、本発明を用いて抽出された情報対の評価例を示している。図１５は、情報対の評価数を示す図である。図１５に示す表に記述されている数の情報対だけ人手で評価する。ＮＥｘ（ｘ＝１〜８）は、ｘ個の固有表現と１個の項目表現を示す。また、記事数Ｘは、Ｘ個の記事数を持つ記事群を示す。すなわち、記事数Ｘが記述されている列とＮＥｘが記述されている行とが交差するセルに記述されている数は、記事数がＸ個の記事群から抽出されたｘ個の固有表現と１個の項目表現とからなる情報対の数を示している。 FIGS. 15 to 24 show evaluation examples of information pairs extracted using the present invention. FIG. 15 is a diagram showing the number of evaluations of information pairs. Only the number of information pairs described in the table shown in FIG. 15 is manually evaluated. NEx (x = 1 to 8) indicates x number of unique expressions and one item expression. The number of articles X indicates an article group having X articles. That is, the number described in the cell where the column in which the number of articles X is described and the row in which NEx is described intersects with the x number of unique expressions extracted from the group of articles whose number of articles is X. It shows the number of information pairs consisting of one item expression.

図１６（Ａ）、（Ｂ）、図１７（Ａ）、（Ｂ）は、図１５に示す表に記述されている数の情報対についての評価結果の一例を示す図である。図１６（Ａ）は、本発明を用いて抽出された情報対が７５％以上正しい（例えば、情報対が記事中に同時に出現する）場合の、当該情報対の数を示し、図１６（Ｂ）は、図１６（Ａ）に示す評価結果に示す情報対の数を前述した図１５に示す対応する情報対の評価数で除算した結果を示す。 FIGS. 16A, 16B, 17A, and 17B are diagrams showing an example of evaluation results for the number of information pairs described in the table shown in FIG. FIG. 16A shows the number of information pairs when information pairs extracted using the present invention are more than 75% correct (for example, information pairs appear simultaneously in an article). ) Shows the result of dividing the number of information pairs shown in the evaluation result shown in FIG. 16A by the evaluation number of the corresponding information pair shown in FIG.

また、図１７（Ａ）は、本発明を用いて抽出された情報対が５０％以上正しい（例えば、情報対が記事中に同時に出現する）場合の、当該情報対の数を示し、図１７（Ｂ）は、図１７（Ａ）に示す評価結果に示す情報対の数を前述した図１５に示す対応する情報対の評価数で除算した結果を示す。なお、図１８は、上記図１６（Ｂ）に示す評価結果を示すグラフであり、図１９は、上記図１７（Ｂ）に示す評価結果を示すグラフである。 FIG. 17A shows the number of information pairs when the information pairs extracted using the present invention are 50% or more correct (for example, information pairs appear simultaneously in an article). (B) shows the result of dividing the number of information pairs shown in the evaluation result shown in FIG. 17A by the evaluation number of the corresponding information pair shown in FIG. 18 is a graph showing the evaluation results shown in FIG. 16B, and FIG. 19 is a graph showing the evaluation results shown in FIG. 17B.

図２０は、情報対の評価数を示す図である。図２０に示す表に記述されている数の情報対だけ人手で評価する。ＮＥｘ（ｘ＝１〜８）は、ｘ個の固有表現と１個の項目表現と２個の数値表現（単位表現に関連する数値表現）を示す。また、記事数Ｘは、Ｘ個の記事数を持つ記事群を示す。すなわち、記事数Ｘが記述されている列とＮＥｘが記述されている行とが交差するセルに記述されている数は、記事数がＸ個の記事群から抽出されたｘ個の固有表現と１個の項目表現と２個の数値表現からなる情報対の数を示している。 FIG. 20 is a diagram illustrating the number of evaluations of information pairs. Only the number of information pairs described in the table shown in FIG. 20 is manually evaluated. NEx (x = 1 to 8) indicates x unique expressions, one item expression, and two numerical expressions (numerical expressions related to the unit expression). The number of articles X indicates an article group having X articles. That is, the number described in the cell where the column in which the number of articles X is described and the row in which NEx is described intersects with the x number of unique expressions extracted from the group of articles whose number of articles is X. The number of information pairs consisting of one item expression and two numerical expressions is shown.

図２１（Ａ）、（Ｂ）、図２２（Ａ）、（Ｂ）は、図２０に示す表に記述されている数の情報対についての評価結果の一例を示す図である。図２１（Ａ）は、本発明を用いて抽出された情報対が７５％以上正しい（例えば、情報対が記事中に同時に出現する）場合の、当該情報対の数を示し、図２１（Ｂ）は、図２１（Ａ）に示す評価結果に示す情報対の数を前述した図２０に示す対応する情報対の評価数で除算した結果を示す。 FIGS. 21A, 21B, 22A, and 22B are diagrams showing an example of evaluation results for the number of information pairs described in the table shown in FIG. FIG. 21A shows the number of information pairs when the information pairs extracted by using the present invention are 75% or more correct (for example, information pairs appear simultaneously in an article). ) Shows the result of dividing the number of information pairs shown in the evaluation result shown in FIG. 21A by the evaluation number of the corresponding information pair shown in FIG.

また、図２２（Ａ）は、本発明を用いて抽出された情報対が５０％以上正しい（例えば、情報対が記事中に同時に出現する）場合の、当該情報対の数を示し、図２２（Ｂ）は、図２２（Ａ）に示す評価結果に示す情報対の数を前述した図２０に示す対応する情報対の評価数で除算した結果を示す。なお、図２３は、上記図２１（Ｂ）に示す評価結果を示すグラフであり、図２４は、上記図２２（Ｂ）に示す評価結果を示すグラフである。 22A shows the number of information pairs when the information pairs extracted using the present invention are 50% or more correct (for example, information pairs appear simultaneously in an article). (B) shows the result of dividing the number of information pairs shown in the evaluation result shown in FIG. 22A by the evaluation number of the corresponding information pair shown in FIG. FIG. 23 is a graph showing the evaluation results shown in FIG. 21B, and FIG. 24 is a graph showing the evaluation results shown in FIG. 22B.

本発明における実験について更に述べる。本発明の情報抽出装置により、主要表現として項目表現と固有表現の種類を用いた場合の情報対の抽出実験を行った。ここでは、１９９８年と１９９９年の２年分の毎日新聞の記事群（２２０，０７８記事）を利用した。この実験では、抽出された情報対全体を実験対象とし、評価はそのうちいくつかを選んで人手で評価した。評価結果を図２５乃至２８に示す。図２５、図２６は、固有表現のみを主要表現として用いた場合の、情報対の評価結果であり（図２５は、精度、図２６は抽出総数を示す）、図２７、図２８は、１〜６個の固有表現と１個の単位（数値）表現を主要表現として用いた場合の、情報対の評価結果である（図２７は、精度、図２８は、抽出総数を示す）。図２５乃至２８中のＮＥｘは、ｘ個の固有表現を用いる場合を意味する。評価は、抽出記事数（主要表現が同時に出現した１文を持つ記事の数）がちょうど１０，３０，５０，７０，９０であったデータからそれぞれ１０個ずつランダムに取り出し、それが正解かどうかを人手で調べた。「評価Ａ」は、抽出記事数個取り出した数値・固有表現の情報対のうち７５％がある一つのトピックについて正しい情報を示す場合にそのデータを正しいと判断し、その正しいとされたデータの割合を意味する。「評価Ｂ」は、抽出記事数個取り出した数値・固有表現の情報対のうち、５０％がある一つのトピックについて正しい情報を示す場合にそのデータを正しいと判断し、その正しいとされたデータの割合を意味する。但し、同一文に複数の同種の固有表現が出現した場合はそのどれかが正解として解釈できるものであれば正解とした。図２７では評価Ｂの結果のみ示す。評価Ａでは全体データで固有表現情報のみの場合０．０８４、数値・固有表現情報の場合０．０１８であった。図２６と図２７にはデータの抽出総数を示す。 The experiment in the present invention will be further described. With the information extraction apparatus of the present invention, an information pair extraction experiment was performed when the types of item expression and specific expression were used as the main expression. Here, we used the articles (220,078 articles) of the Mainichi Shimbun for two years of 1998 and 1999. In this experiment, the entire extracted information pair was the subject of the experiment, and some of them were selected and evaluated manually. The evaluation results are shown in FIGS. 25 and 26 show the evaluation results of the information pair when only the specific expression is used as the main expression (FIG. 25 shows the accuracy, FIG. 26 shows the total number of extractions), and FIGS. FIG. 27 shows the evaluation results of information pairs when ˜6 unique expressions and one unit (numerical value) expression are used as main expressions (FIG. 27 shows accuracy, and FIG. 28 shows the total number of extractions). NEx in FIGS. 25 to 28 means a case where x number of unique expressions are used. Evaluation is based on the number of extracted articles (the number of articles with one sentence in which the main expression appears at the same time) taken at random from each of the data that was exactly 10, 30, 50, 70, 90. Was examined manually. “Evaluation A” indicates that the data is correct when the correct information is shown for one topic having 75% of the information pairs of the numerical value / specific expression extracted from the number of extracted articles, and the correct data Mean percentage. “Evaluation B” is the data that is determined to be correct when it shows correct information for one topic with 50% of the information pairs of numerical values and specific expressions extracted from several extracted articles. Means the percentage of However, if multiple equivalent expressions of the same type appear in the same sentence, any of them can be interpreted as a correct answer. In FIG. 27, only the result of evaluation B is shown. In the evaluation A, the total data was 0.084 in the case of only unique expression information, and 0.018 in the case of numerical value / specific expression information. 26 and 27 show the total number of extracted data.

評価Ａで取り出せたデータの個数は、固有表現のみを用いた場合、数値・固有表現を用いた場合の両方を合わせて２４個であった。また、評価Ｂの全体データでの精度は固有表現のみを用いた場合０．２８で数値・固有表現の情報を用いた場合０．２６であった。また、抽出総数と精度をかけあわせて合計どのくらい有用なデータを抽出できるかを見積もった。これは、例えば２１〜４０の記事数のＮＥ２の抽出総数と記事数３０の精度の積を１〜４０の記事数のＮＥ２の場合の抽出できる有用データとする手順で求めた。この見積もりでは抽出可能な評価Ａのデータは固有表現のみを用いた場合、数値・固有表現を用いた場合の両方を合わせて約２万個であった。 The number of data that can be extracted in evaluation A was 24 when both the numerical expression and the specific expression were used when only the specific expression was used. The accuracy of the entire evaluation B data was 0.28 when only the specific expression was used, and 0.26 when the numerical value / specific expression information was used. We also estimated how much useful data can be extracted by multiplying the total number of extractions and accuracy. For example, the product of the total number of extractions of NE2 with the number of articles of 21 to 40 and the accuracy of the number of articles 30 is obtained by a procedure to obtain useful data that can be extracted in the case of NE2 with the number of articles of 1 to 40. In this estimation, the data of evaluation A that can be extracted is about 20,000 when only the specific expression is used and when both the numerical value and the specific expression are used.

本発明の情報抽出装置１により抽出したデータ（情報対）を図２９、図３０に示す。図２９には固有表現と項目表現を主要表現とした場合に得られた情報対を示す。図２９（Ａ）は、項目表現「スライダー」、人名と組織名の固有表現の種類を主要表現セットとした場合の情報対である。図２９（Ａ）から、当時スライダーを投げていた選手とそのチーム名がわかる。図２９（Ｂ）は、項目表現「弾道ミサイル」、人工物名と地名の固有表現の種類を主要表現とした場合の情報対である。図２９（Ｂ）から、当時の弾道ミサイルに関係するミサイル名とそのミサイルの保有国がわかる。その他、囲碁将棋などの毎日新聞社主催行事の開催時期・場所・主催団体・棋士名のデータ、家宅捜索を受けた組織・日付・場所・人・金額・関連する法律のデータなど多様なデータが得られる。 Data (information pairs) extracted by the information extraction apparatus 1 of the present invention are shown in FIGS. FIG. 29 shows information pairs obtained when the specific expression and the item expression are the main expressions. FIG. 29A shows an information pair in the case where the item expression “slider” and the type of the unique expression of the person name and the organization name are the main expression set. From FIG. 29 (A), the player who was throwing the slider at that time and the team name can be seen. FIG. 29B shows an information pair in the case where the main expression is the item expression “ballistic missile” and the type of the unique expression of the artifact name and the place name. From FIG. 29 (B), the missile name related to the ballistic missile at that time and the country of possession of the missile are known. In addition, there are various data such as the date, place, sponsoring organization, name of the name of the person sponsored by the Mainichi Shimbun, such as Go Shogi, the data of the organization, date, place, person, amount of money, and related laws that received the house search. can get.

図３０は、固有表現と項目表現を主要表現とした場合に得られた情報対の表示例である。項目表現「収賄罪」、単位表現「人」、「円」、人名と地名の固有表現の種類を主要表現とした場合のものである。図３０の横軸は、収賄罪をおかした人数、縦軸は収賄罪の金額を示す。各プロットには人名と関連する場所を記載した。但し、人名はシステムではとれているがここでは匿名で表示している。その他、何階建ての何階で火事が起きたかとその住民の氏名と時間、スポーツ競技の順位とその競技のメートル数・選手・組織・場所などを示す多様なグラフを得た。 FIG. 30 is a display example of information pairs obtained when the specific expression and the item expression are the main expressions. This is the case where the main expression is the item expression “bribery crime”, the unit expressions “people”, “yen”, and the unique expressions of person names and place names. The horizontal axis in FIG. 30 indicates the number of people who committed bribery, and the vertical axis indicates the amount of bribery. Each plot lists the location associated with the person's name. However, although the name of the person is taken in the system, it is displayed anonymously here. In addition, we obtained a variety of graphs showing how many floors and how many fires occurred, the names and times of the residents, the ranking of sports competitions, the number of athletes, the players, the organization, and the location.

本発明の他の変形例を説明する。この例では、表示部１３が、情報対抽出部１２が抽出した情報対を含む文を関連記事ＤＢ１４中の記事群から抽出し、当該抽出した文において、情報対を強調表示する。 Another modification of the present invention will be described. In this example, the display unit 13 extracts a sentence including the information pair extracted by the information pair extraction unit 12 from the article group in the related article DB 14, and highlights the information pair in the extracted sentence.

例えば、情報対抽出部１２が抽出した情報対（数値表現、固有表現、項目表現の対）が、「○号」、「○日」、「台風」であるとすると、表示部１３は、この三つの表現が同時に出現している文を抽出し、該抽出された文において、該三つの表現を強調表示する。同一文において複数の表現がある場合は、例えば最初に出現しているものを二重線でそれ以外を一重線で強調表示する。その結果を図３１に示す。上記三つの表現を適宜色分けして表示する構成を採ってもよい。 For example, if the information pair (number expression, specific expression, item expression pair) extracted by the information pair extraction unit 12 is “◯ No.”, “○ day”, “typhoon”, the display unit 13 A sentence in which the three expressions appear simultaneously is extracted, and the three expressions are highlighted in the extracted sentence. When there are a plurality of expressions in the same sentence, for example, the first appearing is highlighted with a double line and the others are highlighted with a single line. The result is shown in FIG. A configuration may be adopted in which the above three expressions are appropriately color-coded and displayed.

抽出した文は、そのときの台風の様子を端的に示しており、要約の研究における重要文抽出と同等の効果を持つ文を抽出できていると思われる。すなわち、台風が通った地名、また場合によって被害状況も記述されており、その台風に関する重要な記述が抽出した文に含まれている。 The extracted sentence clearly shows the state of the typhoon at that time, and it seems that the sentence having the same effect as the important sentence extraction in the summary research can be extracted. In other words, the name of the place through which the typhoon passed and the damage situation are also described, and an important description about the typhoon is included in the extracted sentence.

また、図中の７個目のデータには、台風７号と台風８号の複数のデータが含まれるが、抽出した情報以外に、現在着目している主要表現があればそれも一重の下線で強調表示することで、その複数データがそのデータにあることがすぐにわかる。また、取り出すべき情報対の組をシステムが誤る場合があるが、この強調表示はその誤りを早く見つけることにも役に立つ。ここでは、抽出した文のみで強調表示を行ったが、記事中に抽出すべき文が残っている可能性もある。記事全体で同様の強調表示を行えば、そういう漏れも抽出できる可能性がある。そこで、元の記事全体で強調表示をする構成を採ってもよい。 In addition, the seventh data in the figure includes a plurality of data of typhoon No. 7 and typhoon No. 8. In addition to the extracted information, if there is a main expression currently focused on, it is also a single underline. By highlighting with, you can immediately see that the data is in the data. In addition, the system may mistake the set of information pairs to be retrieved, but this highlighting helps to find the error early. Here, only the extracted sentence is highlighted, but there is a possibility that the sentence to be extracted remains in the article. If the same highlighting is applied to the entire article, such a leak may be extracted. Thus, a configuration may be adopted in which the entire original article is highlighted.

次に、本発明の一実施形態において、主要表現抽出部１１が、前述したように、抽出された固有表現の種類の前又は後、又は、上記抽出された単位表現に関連する数値表現の前又は後に付随する単語を抽出し、該抽出された単語から選択された単語が付随する固有表現の種類、又は該抽出された単語から選択された単語が付随する数値表現に関連する単位表現を上記主要表現とする処理を行う場合において、どのようにして上記単語が選択されるかについて説明する。 Next, in one embodiment of the present invention, as described above, the main expression extraction unit 11 performs before or after the type of the extracted unique expression or before the numerical expression related to the extracted unit expression. Or a word associated with a numerical expression associated with a type of a unique expression accompanied by a word selected from the extracted word or a word selected from the extracted word, or a unit expression associated with a numerical expression accompanied by a word selected from the extracted word A description will be given of how the word is selected in the case of performing processing as the main expression.

図３２及び図３３は、抽出された固有表現の種類の前又は後に付随する単語、抽出された単位表現に関連する数値表現の前又は後に付随する単語を出現頻度に並べた表を示す図である。図３２及び図３３に示す例では、固有表現の種類又は数値表現の前後３文字までの単語を出現頻度順に並べた結果を示している。 FIG. 32 and FIG. 33 are diagrams showing a table in which appearance words are arranged with words attached before or after the type of the extracted unique expression and words attached before or after the numerical expression related to the extracted unit expression. is there. In the example shown in FIG. 32 and FIG. 33, the result of arranging words up to 3 characters before and after the kind of numerical expression or numerical expression is arranged in order of appearance frequency.

例えば、図３２中の「キロ」という単位表現に関連する数値表現「９．３キロ｜９．３キロ｜１２０キロ｜・・・」の後ろ３文字として付随する単語は「（末端」であり、「ＡＲＴＩＦＡＣＴ」という固有表現の種類に属する固有表現「玉丸｜シテロワテ・・・」の前１文字として付随する単語は「船」である。主要表現抽出部１１は、例えば、上記図３２に示すような表を作成して表示し、該表示を見たユーザの指定入力に従って、「船」という単語を選択する。 For example, the word attached as the last three letters of the numerical expression “9.3 kg | 9.3 kg | 120 kg |...” Related to the unit expression “kilo” in FIG. , The word that accompanies the specific expression “Tamamaru | Citeirote ...” that belongs to the type of specific expression “ARTIFACT” is “ship.” For example, the main expression extraction unit 11 is shown in FIG. A table as shown is created and displayed, and the word “ship” is selected in accordance with the designation input by the user who saw the display.

また、図３３を参照すると、例えば、「キロ」という単位表現に関連する数値表現「４号｜５号｜６号｜・・・」の前２文字として付随する単語は「台風」であり、「キロ」という単位表現に関連する数値表現「１５キロ｜２５キロ｜７５キロ」の前２文字として付随する単語は「時速」である。主要表現抽出部１１は、例えば、上記図３３に示すような表を作成して表示し、該表示を見たユーザの指定入力に従って、「時速」という単語を選択する。 Referring to FIG. 33, for example, the word accompanying the first two letters of the numerical expression “No. 4 | No. 5 | No. 6 | ...” related to the unit expression “kilo” is “typhoon”. The word attached as the first two letters of the numerical expression “15 km | 25 km | 75 km” related to the unit expression “kilo” is “speed”. The main expression extracting unit 11 creates and displays a table as shown in FIG. 33, for example, and selects the word “speed” according to the designation input by the user who saw the display.

次に、本発明の情報抽出装置１による情報対の抽出結果を、情報抽出装置１が相関分析した実験について説明する。データの相関を表す指標としては、一般に、相関係数が用いられる。データｘｉ、ｙｉ（ｉ＝１，２，３，・・・ｎ）が与えられたとき、ｘとｙの相関係数ｒは、 Next, an experiment in which the information extraction apparatus 1 performs a correlation analysis on the information pair extraction result by the information extraction apparatus 1 of the present invention will be described. In general, a correlation coefficient is used as an index representing the correlation of data. Given data xi, yi (i = 1, 2, 3,... N), the correlation coefficient r between x and y is

となる。
ここで、 It becomes.
here,

である。 It is.

ｒは常に−１と１の間の値をとり、１（あるいは−１）に近いとき、強い相関があり、０に近いとき相関がないといえる。相関係数が正の値のときは正の相関があり、ｘが増加するとｙも増加する。相関係数が負の値のときは負の相関あり、ｘが増加するとｙが減少する。相関の有無の判定は、例えばｔ検定等の検定を用いて行う。 r always takes a value between -1 and 1, and there is a strong correlation when close to 1 (or -1), and no correlation when close to 0. When the correlation coefficient is a positive value, there is a positive correlation, and as x increases, y also increases. When the correlation coefficient is a negative value, there is a negative correlation, and when x increases, y decreases. The determination of the presence or absence of correlation is performed using a test such as a t test.

例えば、情報抽出装置１が、新聞２年分などの大規模データから複数セット（マラソン、台風、収賄など様々な分野のセット）に関連する情報対を抽出し、該抽出された情報対について相関分析し、情報抽出装置１が情報対の抽出結果について相関分析する。 For example, the information extraction apparatus 1 extracts information pairs related to a plurality of sets (a set of various fields such as marathon, typhoon, bribery) from large-scale data such as newspapers for two years, and correlates the extracted information pairs. The information extraction apparatus 1 performs correlation analysis on the information pair extraction result.

例えば図３４に示すような相関分析結果が得られる。図３４に示すデータにおける相関係数は、単位表現に関連する数値表現の間の相関係数である。検定で相関ありとされたデータについては、フラグ「１」を立てている。 For example, a correlation analysis result as shown in FIG. 34 is obtained. The correlation coefficient in the data shown in FIG. 34 is a correlation coefficient between numerical expressions related to unit expressions. A flag “1” is set for data that has been correlated in the test.

例えば、図３４中の第６番目のデータ（項目表現が「間」、単位表現が「区」と「キロ」、固有表現の種類が「ＯＲＧＡＮＩＺＡＴＩＯＮ」であるデータ）の元データ（相関分析の対象とした元のデータ）である情報対の抽出結果を図３５に示す。 For example, the original data of the sixth data in FIG. 34 (data in which the item expression is “between”, the unit expression is “ku” and “kilo”, and the type of specific expression is “ORGANIZATION”) FIG. 35 shows the extraction result of the information pair that is the original data).

図３６は、図３５中に示す単位表現「区」に関連する数値表現と、単位表現「キロ」に関連する数値表現とについての相関分析結果を示すグラフである。単位表現「区」に関連する数値表現と、単位表現「キロ」に関連する数値表現との間の相関係数は正（約０．７８３）である。また、図３６に示すグラフから、例えば、どういう「区」の数値が増加すると「キロ」の数値（例えば駅伝の走行距離）が伸びるかということがわかる。 FIG. 36 is a graph showing a correlation analysis result for the numerical expression related to the unit expression “ku” shown in FIG. 35 and the numerical expression related to the unit expression “kilo”. The correlation coefficient between the numerical expression related to the unit expression “ku” and the numerical expression related to the unit expression “kilo” is positive (about 0.783). In addition, from the graph shown in FIG. 36, for example, it can be seen that when the numerical value of “K” increases, the numerical value of “kilometers” (for example, the distance traveled by relay stations) increases.

情報抽出装置１による相関分析処理によれば、相関を有している一連のデータを簡単に抽出することができる。 According to the correlation analysis process by the information extraction device 1, a series of data having correlation can be easily extracted.

なお、本発明の一実施形態によれば、情報抽出装置１が、一つの分野（例えば台風という分野）のデータから、複数種類の主要表現のセットを抽出し、相関分析に基づいて、該抽出された主要表現のセットから最終的に抽出対象とする主要表現を決定するようにしてもよい。 According to an embodiment of the present invention, the information extraction apparatus 1 extracts a set of a plurality of types of main expressions from data in one field (for example, a field called typhoon), and extracts the extracted data based on correlation analysis. The main expression to be extracted may be finally determined from the set of the main expressions.

次に、本発明の情報抽出装置１を用いた他の実験について説明する。図３７は、前又は後にパターンの候補（単語又は文字列）が付随する固有表現毎のｓｃｏｒｅを示す図である。ここで、パターンの候補とは、記事群中において固有表現の種類の前又は後に出現する単語又は文字列をいう。 Next, another experiment using the information extraction apparatus 1 of the present invention will be described. FIG. 37 is a diagram illustrating a score for each unique expression accompanied by a pattern candidate (word or character string) before or after. Here, the pattern candidate means a word or a character string that appears before or after the type of the unique expression in the article group.

すなわち、図３７は、項目表現を「末端価格」、単位表現を「キロ」と「円」、固有表現の種類を「ＬＯＣＡＴＩＯＮ」と「ＯＲＧＡＮＩＺＡＴＩＯＮ」としたときの情報抽出装置１による情報対の抽出結果に基づいて、固有表現の種類「ＬＯＣＡＴＩＯＮ（地名）」について、前又は後にパターンの候補が付随する固有表現毎のｓｃｏｒｅを示している。図３７中では、ｓｃｏｒｅの大きい順にデータを表示している。 That is, FIG. 37 shows information pair extraction by the information extraction apparatus 1 when the item expression is “end price”, the unit expression is “kilo” and “yen”, and the types of specific expressions are “LOCATION” and “ORGANIZATION”. Based on the result, the score for each unique expression accompanied by a pattern candidate before or after the specific expression type “LOCATION (location name)” is shown. In FIG. 37, data is displayed in descending order of score.

図３７中において、「表現」という項目は、前又は後にパターンの候補が付随する固有表現、「ｓｃｏｒｅ」という項目は該固有表現毎のｓｃｏｒｅ、「例」という項目は該固有表現の具体例を示している。 In FIG. 37, the item “expression” is a specific expression accompanied by a pattern candidate before or after, the item “score” is a score for each specific expression, and the item “example” is a specific example of the specific expression. Show.

例えば、図３７中の第１番目のデータは、単語「人」が後ろ１文字として付随する固有表現のｓｃｏｒｅが７９８であることを示すとともに、該固有表現の具体例がブータン｜ブータン｜コロンビア｜コロンビア｜中国｜・・・であることを示している。 For example, the first data in FIG. 37 indicates that the score of the specific expression accompanying the word “person” as the last character is 798, and a specific example of the specific expression is Bhutan | Bhutan | Colombia | Colombia | China |.

上記ｓｃｏｒｅは、前述した、以下に示す式
ｓｃｏｒｅ＝Σ２つの固有表現の類似度×ｆ（第１の固有表現，第２の固有表現）
に基づいて算出する。 The above score is the above-described formula score = Σsimilarity between two specific expressions × f (first specific expression, second specific expression)
Calculate based on

上記「２つの固有表現の類似度」は、図３７中に示す「例」の項目に表示された固有表現から得られる各々の２つの固有表現（第１の固有表現と第２の固有表現）の類似度である。例えば、固有表現同士が両方とも国名、又は両方とも国名以外なら当該固有表現同士の類似度を１、固有表現同士の一方が国名で他方が国名以外なら当該固有表現同士の類似度を−１とする。 The above “similarity between two unique expressions” indicates two specific expressions (first specific expression and second specific expression) obtained from the specific expressions displayed in the item “example” shown in FIG. The degree of similarity. For example, if the unique expressions are both country names, or both are other than country names, the similarity between the specific expressions is 1, and if one of the unique expressions is a country name and the other is not a country name, the similarity between the specific expressions is -1. To do.

本発明の一実施形態によれば、固有表現同士の類似度を所定の変換式を用いて変換し、変換された類似度を用いてｓｃｏｒｅを算出するようにしてもよい。例えば、２つの固有表現同士の類似度を、所定のベクトル生成手法によって決まる固有表現に応じたベクトル同士の角度（またはｃｏｓ）とする場合、０から１の類似度を持つので、例えば、該ｃｏｓの値を２倍して１を減じる等して求まる値を２つの固有表現同士の類似度とするようにしてもよい。本発明においては、他の任意の類似度の変換手法を用いるようにしてもよい。 According to an embodiment of the present invention, the similarity between specific expressions may be converted using a predetermined conversion formula, and the score may be calculated using the converted similarity. For example, when the similarity between two specific expressions is an angle (or cos) between vectors according to the specific expression determined by a predetermined vector generation method, since the similarity is 0 to 1, for example, the cos The value obtained by doubling the value of 1 and subtracting 1 may be used as the similarity between the two unique expressions. In the present invention, any other conversion method of similarity may be used.

また、ｆ（第１の固有表現，第２の固有表現）は、第１の固有表現と第２の固有表現とで共にパターンの候補が出現した（付随する）場合、又は共にパターンの候補が出現しなかった場合は１、どちらか一方のみに上記パターンの候補が出現した場合は−１である関数である。 In addition, f (first specific expression, second specific expression) is the case where pattern candidates appear (accompany) in both the first specific expression and the second specific expression, or in which both the pattern candidates are The function is 1 when it does not appear, and is -1 when the pattern candidate appears only in one of them.

図３７において、同一の「表現」に対応する２行分のデータは、上の行がパターンの候補が出現した固有表現についてのデータ、下の行がパターンの候補が出現した固有表現についてのデータである。 In FIG. 37, the data for two lines corresponding to the same “expression” includes data for a unique expression in which a pattern candidate appears in the upper line, and data for a unique expression in which the pattern candidate appears in the lower line. It is.

例えば、図３７中の第１番目のデータ中の「例」の項目に表示されている固有表現「ブータン｜ブータン｜コロンビア｜コロンビア｜中国｜・・・」は、パターンの候補「人」が後ろ１文字として付随する固有表現の具体例を示し、第２番目のデータ中の「例」の項目に表示されている固有表現「ロシア｜伏木港｜ロシア｜成田空港｜・・・」は、パターンの候補「人」が後ろ１文字として付随しない固有表現の具体例を示している。 For example, the specific expression “Bhutan | Bhutan | Colombia | Colombia | China | ...” displayed in the item “example” in the first data in FIG. 37 is followed by the pattern candidate “people”. The specific expression “Russia | Fushiki Port | Russia | Narita Airport |...” Displayed in the “example” item in the second data is a pattern. A specific example of a specific expression in which the candidate “person” is not attached as the last character is shown.

図３７に表示されたデータを見れば、情報対として抽出される固有表現を分ける（グループ分けする）のに役立つパターン（例えば「人」や「国籍」など）を自動で取得することができていることがわかる。 By looking at the data displayed in FIG. 37, it is possible to automatically acquire patterns (for example, “people”, “nationality”, etc.) useful for separating (grouping) specific expressions extracted as information pairs. I understand that.

以上に説明した本発明により関連記事データベース（ＤＢ）１４の記事群から抽出されることになる情報には、ノイズ情報の含まれることが避けられない。 The information extracted from the article group of the related article database (DB) 14 according to the present invention described above inevitably includes noise information.

次に、このノイズ情報を除去することを実現する本発明について説明する。 Next, the present invention for realizing the removal of the noise information will be described.

図３８に、ノイズ情報の除去を実現するために本発明の情報抽出装置１が実行するフローチャートの一例を図示する。 FIG. 38 shows an example of a flowchart executed by the information extraction apparatus 1 of the present invention in order to realize the removal of noise information.

次に、このフローチャートに従って、本発明の情報抽出装置１が実行する情報抽出処理について説明する。 Next, information extraction processing executed by the information extraction apparatus 1 of the present invention will be described according to this flowchart.

本発明の情報抽出装置１は、ノイズ情報を除去することを実現しつつ、関連記事データベース（ＤＢ）１４の記事群から有用な情報を抽出してグラフ化表示することを実現する場合には、図３８のフローチャートに示すように、まず最初に、ステップＳ１０１で、前述した処理に従って、記事群から、主要項目表現（スコア値が高いことなどにより抽出される主要な項目表現）を抽出する。 When the information extraction apparatus 1 of the present invention realizes removing noise information and extracting useful information from the article group of the related article database (DB) 14 and displaying it in a graph, As shown in the flowchart of FIG. 38, first, in step S101, main item expressions (main item expressions extracted due to high score values, etc.) are extracted from the article group according to the above-described processing.

続いて、ステップＳ１０２で、前述した処理に従って、記事群から、主要単位表現（出現頻度が高いことなどにより抽出される主要な単位表現）とそれに関連する数値（例えば主要単位表現の前後に位置する数値）とを抽出することで、主要単位表現・主要数値表現を抽出する。 Subsequently, in step S102, according to the above-described processing, the main unit expression (main unit expression extracted due to high appearance frequency, etc.) and the related numerical values (for example, before and after the main unit expression) from the article group. The main unit expression and the main numerical expression are extracted.

続いて、ステップＳ１０３で、前述した処理に従って、記事群から、主要固有表現（スコア値が高いことなどにより抽出される主要な固有表現）を抽出するとともに、それらの主要固有表現の属する主要固有表現の種類（人物を示す「ＰＥＲＳＯＮ」や場所を示す「ＬＯＣＡＴＩＯＮ」などのような主要固有表現の種類）を特定する。 Subsequently, in step S103, in accordance with the above-described processing, main specific expressions (main specific expressions extracted due to high score values, etc.) are extracted from the article group, and the main specific expressions to which the main specific expressions belong are extracted. Type (the type of main specific expression such as “PERSON” indicating a person and “LOCATION” indicating a place).

続いて、ステップＳ１０４で、前述した処理に従って、記事群において、抽出した主要項目表現と、抽出した主要単位表現と、特定した主要固有表現の種類に属する主要固有表現（抽出した主要固有表現）とが同時に出現する箇所（例えば句点などをはさまずに同時に出現した箇所）を特定することで情報対を抽出する。 Subsequently, in step S104, according to the processing described above, in the article group, the extracted main item expression, the extracted main unit expression, and the main specific expression (extracted main specific expression) belonging to the specified main specific expression type, The information pair is extracted by specifying the part where the characters appear at the same time (for example, the place where the characters appear at the same time without the phrase).

このとき、出現する記事数が多い情報対を抽出し、出現する記事数が少ない情報対については抽出しないようにすることで、主要な情報対のみを抽出するようにしてもよい。 At this time, it is possible to extract only main information pairs by extracting information pairs with a large number of appearing articles and not extracting information pairs with a small number of appearing articles.

続いて、ステップＳ１０５で、主要項目表現で区分けされる情報対の各グループについて、主要数値表現の平均値および標準偏差を算出し、それに基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定して、それらの情報対を削除する。 Subsequently, in step S105, for each group of information pairs classified by the main item expression, an average value and a standard deviation of the main numerical expression are calculated, and based on this, information that should not be handled as the same group as other information pairs Identify pairs and delete those information pairs.

例えば、図３９に示すように、主要項目表現で区分けされる情報対の各グループについて、情報対の持つ主要数値表現の平均値ｍおよび標準偏差σを算出し、それに基づいて、例えば、主要数値表現の値がｍ±３σの範囲に入らない情報対を特定して、それらの情報対を削除するのである。 For example, as shown in FIG. 39, the average value m and the standard deviation σ of the main numerical expression of the information pair are calculated for each group of information pairs classified by the main item expression. Information pairs whose expression values do not fall within the range of m ± 3σ are specified, and those information pairs are deleted.

このとき、主要数値表現の値がｍ±３σの範囲に入らない情報対のすべてを無条件に削除するのではなくて、それらの情報対の一覧をユーザに提示して、それに対する削除指示を受け取ることで、主要数値表現の値がｍ±３σの範囲に入らない情報対の内の削除指示のあるものだけを削除するようにしてもよい。 At this time, instead of unconditionally deleting all of the information pairs whose values of the main numerical expression do not fall within the range of m ± 3σ, a list of those information pairs is presented to the user, and a deletion instruction is given to them. By receiving it, only the information pair with the deletion instruction may be deleted from the information pairs whose main numerical expression value does not fall within the range of m ± 3σ.

なお、データをｘ_1,ｘ_2,・・・・_,ｘ_nとすると、それらのデータの相和平均や相乗平均や調和平均は、
相和平均＝（ｘ₁＋ｘ₂＋・・・＋ｘ_n）／ｎ
相乗平均＝（ｘ₁×ｘ₂×・・・×ｘ_n）^1/n
調和平均＝ｎ／｛（１／ｘ₁）＋（１／ｘ₂）＋・・・＋（１／ｘ_n）｝
という式に従って算出され、それらのデータの分散Ｖは、この相和平均の値ｍを使って、
Ｖ＝｛（ｘ₁−ｍ）²＋（ｘ₂−ｍ）²＋・・・＋（ｘ_n−ｍ）²｝／（ｎ−１）
＝Ｓ／（ｎ−１）
という式に従って算出され、それらのデータの標準偏差σは、この分散Ｖを使って、
σ＝（Ｓ／（ｎ−１））^1/2
という式に従って算出されることになる。 Assuming that the data is x1 _, x2 _, ... _, _xn , the sum average, geometric average, and harmonic average of these data are
Phase average = (x ₁ + x ₂ +... + X _n ) / n
Geometric mean = (x ₁ × x ₂ × ... × x _n ) ^{1 / n}
Harmonic average = n / {(1 / x ₁ ) + (1 / x ₂ ) +... + (1 / x _n )}
The variance V of the data is calculated using this phase average value m,
V = {(x ₁ −m) ² + (x ₂ −m) ² +... + (X _n −m) ² } / (n−1)
= S / (n-1)
The standard deviation σ of these data is calculated using this variance V,
σ = (S / (n−1)) ^1/2
It is calculated according to the following formula.

上述したように、ステップＳ１０５では、このようにして算出される主要数値表現の平均値ｍおよび標準偏差σを使って、例えば、主要数値表現の値がｍ±３σの範囲に入らない情報対を特定して、それらの情報対を削除するのである。 As described above, in step S105, using the average value m and the standard deviation σ of the main numerical expression calculated in this way, for example, an information pair whose value of the main numerical expression does not fall within the range of m ± 3σ. Identify and delete those pairs of information.

続いて、ステップＳ１０６で、主要項目表現で区分けされる情報対の各グループについて、主要固有表現の属する属性グループを特定し、それに基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定して、それらの情報対を削除する。 Subsequently, in step S106, for each group of information pairs classified by the main item expression, an attribute group to which the main unique expression belongs is specified, and based on this, an information pair that should not be treated as the same group as other information pairs Identify and delete those pairs of information.

例えば、前述したようなクラスタリング手法に従って、図４０に示すように、主要項目表現で区分けされる情報対の各グループについて、情報対の持つ主要固有表現を複数のクラスターに分類して、例えば、要素数が規定の閾値よりも少ないクラスターや要素数の割合が規定の閾値よりも小さいクラスターに属する情報対を特定して、それらの情報対を削除するのである。 For example, according to the clustering technique as described above, as shown in FIG. 40, for each group of information pairs classified by the main item expression, the main unique expression that the information pair has is classified into a plurality of clusters. An information pair belonging to a cluster having a number smaller than a prescribed threshold or a cluster having a ratio of the number of elements smaller than a prescribed threshold is specified, and those information pairs are deleted.

ここで、このとき用いる閾値については、あらかじめ定めるようにしてもよいし、適宜ユーザが値を変更可能とする形で設定するようにしてもよい。また、要素数が規定の閾値よりも少ないクラスターや要素数の割合が規定の閾値よりも小さいクラスターに属する情報対を特定するのではなくて、閾値以下のものの中から所定の個数分小さな値の順に従って特定するようにしてもよい。 Here, the threshold used at this time may be determined in advance, or may be set in such a manner that the user can change the value as appropriate. Also, instead of specifying an information pair belonging to a cluster whose number of elements is less than the specified threshold or a ratio of the number of elements that is smaller than the specified threshold, a value smaller by a predetermined number than the threshold is specified. You may make it specify according to order.

このとき、要素数の少ないクラスターなどに属する情報対のすべてを無条件に削除するのではなくて、それらの情報対の一覧をユーザに提示して、それに対する削除指示を受け取ることで、要素数の少ないクラスターなどに属する情報対の内の削除指示のあるものだけを削除するようにしてもよい。 At this time, instead of unconditionally deleting all of the information pairs belonging to a cluster with a small number of elements, the list of those information pairs is presented to the user, and a deletion instruction is received for it. Of the information pairs belonging to a cluster having a small number of data, only those with a deletion instruction may be deleted.

続いて、ステップＳ１０７で、削除後の情報対を構成する主要項目表現、主要単位表現・主要数値表現、主要固有表現・主要固有表現の種類の情報に基づいて、前述したグラフ化表示の手法に従って、記事群から取り出した情報をグラフ化表示する。 Subsequently, in step S107, based on the information of the types of main item expression, main unit expression / main numerical expression, main specific expression / main specific expression constituting the deleted information pair, according to the graph display method described above. The information extracted from the article group is displayed as a graph.

このようにして、本発明の情報抽出装置１は、図３８のフローチャートを実行することで、固有表現を抽出対象として、ノイズ情報を除去することを実現しつつ、関連記事データベース（ＤＢ）１４の記事群から有用な情報を抽出してグラフ化表示することを実現するのである。 In this way, the information extraction apparatus 1 according to the present invention executes the flowchart of FIG. 38 to realize the removal of noise information with the specific expression as an extraction target, while the related article database (DB) 14 It is possible to extract useful information from articles and display it in a graph.

図３８のフローチャートでは、記事群から主要な固有表現を抽出することで有用な情報を抽出してグラフ化表示するようにしたが、記事群から主要な固有表現（固有名詞）を抽出するのではなくて、記事群から普通名詞を含む主要な一般名詞（固有名詞を含んでもよい）を抽出することで有用な情報を抽出してグラフ化表示することもできる。ここで、一般名詞については形態素解析に従って抽出されることになる。 In the flowchart of FIG. 38, useful information is extracted and extracted and displayed in a graph by extracting main specific expressions from the article group. However, when extracting main specific expressions (proper nouns) from the article group, Instead, useful information can be extracted and displayed in a graph by extracting main common nouns including common nouns (which may include proper nouns) from the article group. Here, general nouns are extracted according to morphological analysis.

記事群から一般名詞を抽出してグラフ化表示することを実現する場合には、記事群から抽出した一般名詞をグループ分けする必要があることから、記事群から抽出した一般名詞の属する種類を特定する必要がある。 When it is necessary to extract general nouns from an article group and display them in a graph, it is necessary to group the general nouns extracted from the article group. There is a need to.

この一般名詞の種類の特定は、例えば分類語彙表（国立国語研究所１９６４）を利用することで実現可能である。 The identification of the type of the general noun can be realized by using, for example, a classification vocabulary table (National Institute of Japanese Language 1964).

分類語彙表とはボトムアップ的に単語を意味に基づいて整理した表であり、下記に記載するように、各単語に対して分類番号という数字を付与した表である。
あ, あ,4.310,1,10,*,
あ, 亜,1.104,2,40,,
あ, 亜,3.100,10,40,,
ああ, ああ,3.100,3,40,*,
ああ, ああ,4.310,1,20,*,
ああくとう, アーク燈,1.460,2,70,,
ああす, アース,1.462,6,10,,
ああち, アーチ,1.442,2,20,,
ああむほおる, アームホール,1.184,5,30,,
あある, アール,1.1961,4,10,,
あい, 愛,1.3020,9,10,*,
あい, 相,3.112,1,10,*,
あい, 藍,1.502,6,40,,
あいいく, 愛育,1.3642,1,40,,
あいいん, 愛飲,1.3332,3,60,,
あいいん, 合印,1.3114,1,30,Y,
あいうち, あい打ち,1.357,4,30,,
あいかぎ, 合鍵,1.454,8,50,,
あいかわらず, 相変らず,3.165,2,10,*,
あいかん, 哀歓,1.3011,4,60,,
あいがん, 哀願,1.366,1,100,,
あいがん, 愛翫,1.3852,2,10,,
あいぎ, 合着,1.421,4,40,,
あいきょう, 愛郷,1.3020,11,170,,
あいきょう, 愛嬌,1.3030,4,40,,
ここで、上記の"," で区切ってある情報は、それぞれ、単語の読み、単語の見出し語、単語の分類番号、単語の分類番号の下位番号１、単語の分類番号の下位番号２、標本使用頻度が７以上の単語かどうかを示す情報である。 The classification vocabulary table is a table in which words are arranged in a bottom-up manner based on meaning, and is a table in which a number called a classification number is assigned to each word as described below.
Ah, Ah, 4.310, 1,10, *,
Ah, A, 1.104,2,40,
Ah, A, 3.100, 10, 40,
Oh, oh, 3.100, 3,40, *,
Oh, oh, 4.310,1,20, *,
Oh, Toku, 1.460, 2,70 ,,
Ah, earth, 1.462, 6, 10,
Ah, arch, 1.442, 2, 20,
Ahmuhooru, Armhole, 1.184,5,30,
Aal, Earl, 1.1961,4,10,
Love, 1.3020,9,10, *,
Ai, Phase, 3.112,1,10, *,
Ai, indigo, 1.502,6,40 ,,
Love, love, 1.3642,1,40,
Ain, love drinking, 1.3332, 3, 60 ,,
Good, sign, 1.3114,1,30, Y,
Aiuchi, Aiuchi, 1.357, 4, 30,
Aikagi, joint key, 1.454, 8, 50 ,,
As always, 3.165,2,10, *,
Akan, sorrow, 1.3011,4,60 ,,
Aigan, pleading, 1.366,1,100 ,,
Aigan, Atago, 1.3852, 2, 10,
Aigi, coalescence, 1.421,4,40 ,,
Aikiyo, Aisato, 1.3020,11,170 ,,
Aikiyo, Atago, 1.3030, 4, 40 ,,
Here, the information separated by “,” is the word reading, the word headword, the word classification number, the lower classification number 1 of the word classification number, the lower classification number 2 of the word classification number, and the sample, respectively. This is information indicating whether the frequency of use is 7 or more.

電子化された分類語彙表では、図４１に示すように、各単語には１０桁の分類番号が与えられている（書籍判の分類語彙表では分類番号は５桁までしかないが、電子化判では１０桁存在する）。この１０桁の分類番号は７レベルの階層構造を示しており、上位５レベルは分類番号の最初の５桁で表現され、６レベル目は次の２桁、最下層のレベルは最後の３桁で表現されている。 In the digitized classification vocabulary table, as shown in FIG. 41, each word is given a 10-digit classification number (in the book-based classification lexicon table, the classification number is limited to 5 digits, but is digitized. 10-digit size is present). This 10-digit classification number indicates a 7-level hierarchical structure. The upper 5 levels are represented by the first 5 digits of the classification number, the 6th level is the next 2 digits, and the lowest level is the last 3 digits. It is expressed by.

本発明者らは、下記の参考文献（１４）で、このような分類語彙表の分類番号を名詞の意味素性に合わせて修正した。図４２に、名詞の意味素性と分類語彙表での分類番号の変換表を図示する。図４２の数字は分類番号の最初の何桁かを変換するためのものであり、例えば、１行目の "[1-3]56"や "511"は、分類番号の頭の３桁が "156"か "256"か "356"ならば511 に変換するということを意味している（[1-3] は1,2,3 を意味している）。 The present inventors modified the classification number of such a classification vocabulary table according to the semantic feature of the noun in the following reference (14). FIG. 42 shows a conversion table of semantic features of nouns and classification numbers in the classification vocabulary table. The numbers in FIG. 42 are used to convert the first few digits of the classification number. For example, "[1-3] 56" and "511" on the first line are the first three digits of the classification number. "156", "256" or "356" means to convert to 511 ([1-3] means 1,2,3).

参考文献（１４）：村田真樹，神崎享子，内元清貴，馬青，井佐原均，意味ソート msort−意味的並べかえ手法による辞書の構築例とタグつきコーパスの作成例と情報提示システム例−，言語処理学会誌, Vol.7, No.1, (2000), pp.51-66.
この分類番号の変換により、図４１に示した分類番号は、図４３のように変換されることになる。 Reference (14): Maki Murata, Kyoko Kanzaki, Kiyotaka Uchimoto, Ma Aoi, Hitoshi Isahara, Semantic Sorting msort-Example of dictionary construction by semantic sorting method, example of creation of tagged corpus and example of information presentation system-, language Journal of Processing Society of Japan, Vol.7, No.1, (2000), pp.51-66.
By this classification number conversion, the classification numbers shown in FIG. 41 are converted as shown in FIG.

図４２から分かるように、この変換された分類番号において、上位２桁が“５１”である単語は“動物”に関係する単語であることを意味し、上位２桁が“５２”である単語は“人間”に関係する単語であることを意味し、上位２桁が“５３”である単語は“組織・機関”に関係する単語であることを意味し、上位２桁が“６１”である単語は“生産物・道具”に関係する単語であることを意味し、上位２桁が“６２”である単語は“動物の部分”に関係する単語であることを意味し、上位２桁が“６３”である単語は“植物”に関係する単語であることを意味し、上位２桁が“６４”である単語は“自然物”に関係する単語であることを意味し、上位２桁が“６５”である単語は“空間・方角”に関係する単語であることを意味し、上位２桁が“７１”である単語は“数量”に関係する単語であることを意味し、上位２桁が“８１”である単語は“時間”に関係する単語であることを意味し、上位２桁が“９１”である単語は“現象名詞”に関係する単語であることを意味し、上位２桁が“ａａ（１６進表示）”である単語は“抽象関係”に関係する単語であることを意味し、上位２桁が“ａｂ（１６進表示）”である単語は“人間活動”に関係する単語であることを意味し、上位２桁が“ｄ０（１６進表示）”である単語は“その他”に関係する単語であることを意味する。 As can be seen from FIG. 42, in the converted classification number, the word whose upper two digits are “51” means a word related to “animal”, and the word whose upper two digits are “52”. Means a word related to “human”, a word whose upper two digits are “53” means a word related to “organization / institution”, and the upper two digits are “61” A word means a word related to “product / tool”, a word whose upper 2 digits are “62” means a word related to “animal part”, and the upper 2 digits Means that the word with “63” is related to “plant”, and the word whose upper 2 digits are “64” means that it is related to “natural objects”. Means that the word with "65" is related to "space / direction". Means that the word with "71" is related to "quantity", and the word with the upper two digits "81" means the word related to "time", the upper two digits Means that the word with “91” is related to “phenomenal noun”, and the word whose upper two digits are “aa (hexadecimal)” is related to “abstract relation” , Meaning that a word whose upper 2 digits are “ab (hexadecimal)” means a word related to “human activity”, and a word whose upper 2 digits are “d0 (hexadecimal)” Means a word related to “others”.

このような分類番号が付与されている分類語彙表を索引することで、記事群から抽出した一般名詞の属する種類を特定することが可能になり、図３８のフローチャートと同様の処理を実行することで、記事群から主要な一般名詞を抽出することで有用な情報を抽出してグラフ化表示することができる。 By indexing the classification vocabulary table to which such a classification number is assigned, it becomes possible to specify the type to which the general noun extracted from the article group belongs, and execute the same processing as the flowchart of FIG. Thus, by extracting main general nouns from the article group, useful information can be extracted and displayed in a graph.

本発明者は、１個の分類語彙表の分類の種類と単位表現２個と項目表現とを主要表現をセットにした実験において、電車の遅延・運休に関わるデータを得た。図４４に、その実験データを図示する。この実験データは、図４２の「活動」の分類を分類語彙表の分類として利用し、「本」「人」を単位表現、「影響」を項目表現として利用した場合の実験データである。この実験データは、どういう原因で何本の電車の運行に影響を与えて何人の人の足に影響を与えたのかを示す。電車の遅延・運休の原因は、固有表現では表現されず一般名詞で表現されるものであり、この実験データにより、本発明を一般名詞にまで適用できることが確認できた。 The present inventor obtained data relating to train delay and suspension in an experiment in which a main expression is a set of classification types, two unit expressions, and item expressions of one classification vocabulary table. FIG. 44 shows the experimental data. This experimental data is experimental data when the “activity” classification of FIG. 42 is used as the classification of the classification vocabulary table, “book” and “person” are used as unit expressions, and “influence” is used as the item expression. This experimental data shows how many trains were affected by what reason and how many people's feet were affected. The cause of train delay and suspension is not expressed in proper expressions but in general nouns, and this experimental data confirms that the present invention can be applied to general nouns.

次に、図４５のフローチャートに従って、記事群から主要な一般名詞を抽出する場合に実行する本発明の情報抽出装置１の情報抽出処理について説明する。 Next, according to the flowchart of FIG. 45, the information extraction process of the information extraction apparatus 1 of the present invention that is executed when main general nouns are extracted from the article group will be described.

ここで、図４５のフローチャートでは、分類語彙表を使って一般名詞の種類を特定するようにしており、これから、図１に示す主要表現抽出部１１は、固有名詞ではなくて一般名詞を抽出対象として抽出処理を実行するとともに、図１に示す情報対抽出部１２は、図１では図示していない分類語彙表を索引することで、抽出した一般名詞の種類を特定する処理を実行することになる。 Here, in the flowchart of FIG. 45, the type of the general noun is specified using the classification vocabulary table. From this, the main expression extraction unit 11 shown in FIG. 1 extracts the general noun instead of the proper noun. 1 and the information pair extraction unit 12 shown in FIG. 1 performs a process of specifying the type of the extracted general noun by indexing a classification vocabulary table not shown in FIG. Become.

本発明の情報抽出装置１は、一般名詞を抽出対象として、ノイズ情報を除去することを実現しつつ、関連記事データベース（ＤＢ）１４の記事群から有用な情報を抽出してグラフ化表示することを実現する場合には、図４５のフローチャートに示すように、まず最初に、ステップＳ２０１で、記事群から、主要項目表現を抽出する。 The information extraction device 1 of the present invention extracts useful information from a group of articles in the related article database (DB) 14 and displays it in a graph while realizing the removal of noise information with a general noun as an extraction target. 45, first, in step S201, main item expressions are extracted from the article group, as shown in the flowchart of FIG.

続いて、ステップＳ２０２で、記事群から、主要単位表現とそれに関連する数値とを抽出することで、主要単位表現・主要数値表現を抽出する。 Subsequently, in step S202, main unit expressions and main numerical expressions are extracted from the article group by extracting the main unit expressions and the related numerical values.

続いて、ステップＳ２０３で、記事群から、主要一般名詞（出現頻度が高いことなどにより抽出される主要な一般名詞）を抽出するとともに、一般名詞と分類番号との対応関係について管理する前述の分類語彙表を索引することで、それらの主要一般名詞に付与される分類番号を得て、それらの分類番号の持つ規定の上位桁数の値に従って、それらの主要一般名詞の種類を特定する。 Subsequently, in step S203, the main general noun (main general noun extracted due to high appearance frequency) is extracted from the article group, and the correspondence between the general noun and the classification number is managed. By indexing the lexicon, the classification numbers assigned to the main common nouns are obtained, and the types of the main general nouns are specified according to the prescribed upper digit value of the classification numbers.

続いて、ステップＳ２０４で、記事群において、抽出した主要項目表現と、抽出した主要単位表現と、特定した主要一般名詞の種類に属する主要一般名詞（抽出した主要一般名詞）とが同時に出現する箇所を特定することで情報対を抽出する。 Subsequently, in step S204, in the article group, the extracted main item expression, the extracted main unit expression, and the main general noun belonging to the identified main general noun type (the extracted main general noun) appear simultaneously. The information pair is extracted by specifying.

続いて、ステップＳ２０５で、主要項目表現で区分けされる情報対の各グループについて、主要数値表現の平均値および標準偏差を算出し、それに基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定して、それらの情報対を削除する。 Subsequently, in step S205, for each group of information pairs classified by the main item expression, an average value and a standard deviation of the main numerical expression are calculated, and based on this, information that should not be handled as the same group as other information pairs Identify pairs and delete those information pairs.

続いて、ステップＳ２０６で、主要項目表現で区分けされる情報対の各グループについて、分類語彙表から得た分類番号に基づいて主要一般名詞の属する属性グループを特定し、それに基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定して、それらの情報対を削除する。 Subsequently, in step S206, for each group of information pairs classified by the main item expression, the attribute group to which the main general noun belongs is specified based on the classification number obtained from the classification vocabulary table. Identify information pairs that should not be treated as the same group as the pair, and delete those information pairs.

すなわち、主要項目表現で区分けされる情報対の各グループについて、分類語彙表から得た分類番号の持つ上位桁の値に従って、情報対の持つ主要一般名詞を複数のクラスターに分類して、例えば、図４０に示すように、要素数が規定の閾値よりも少ないクラスターに属する情報対を特定して、それらの情報対を削除するのである。 That is, for each group of information pairs classified by the main item representation, according to the value of the upper digit of the classification number obtained from the classification vocabulary table, the main common nouns possessed by the information pairs are classified into a plurality of clusters, for example, As shown in FIG. 40, information pairs belonging to a cluster having a smaller number of elements than a prescribed threshold value are specified, and those information pairs are deleted.

続いて、ステップＳ２０７で、削除後の情報対を構成する主要項目表現、主要単位表現・主要数値表現、主要一般名詞・主要一般名詞の種類の情報に基づいて、前述したグラフ化表示の手法に従って、記事群から取り出した情報をグラフ化表示する。 Subsequently, in step S207, based on the information of the main item expression, the main unit expression / main numerical expression, and the main general noun / main general noun type constituting the deleted information pair, according to the graph display method described above. The information extracted from the article group is displayed as a graph.

このようにして、本発明の情報抽出装置１は、図４５のフローチャートを実行することで、一般名詞を抽出対象として、ノイズ情報を除去することを実現しつつ、関連記事データベース（ＤＢ）１４の記事群から有用な情報を抽出してグラフ化表示することを実現するのである。 In this way, the information extraction apparatus 1 according to the present invention executes the flowchart of FIG. 45 to realize the removal of the noise information with the general noun as an extraction target, and the related article database (DB) 14. It is possible to extract useful information from articles and display it in a graph.

図３８のフローチャートは固有名詞を抽出対象として情報対を抽出しており、一方、図４５のフローチャートは一般名詞を抽出対象として情報対を抽出しているが、図３８のフローチャートで実行する固有表現の抽出技術・固有表現種類の特定技術と、図４５のフローチャートで実行する一般名詞の抽出技術・一般名詞種類の特定技術とを組み合わせることで、固有表現および一般名詞を抽出対象として情報対を抽出することが可能である。 The flowchart of FIG. 38 extracts information pairs from a proper noun as an extraction target, while the flowchart of FIG. 45 extracts information pairs from a general noun as an extraction target. Extracting information pairs by extracting specific expressions and common nouns by combining the extraction technology / specific expression type identification technology with the general noun extraction technology / general noun type identification technology executed in the flowchart of FIG. Is possible.

これから、本発明によれば、例えば、記事群から会社における役員人事の情報を抽出したいというような要求がある場合に、それに応えることができるようになる。記事群から会社における役員人事の情報を抽出する場合には、人名や会社名といった固有名詞の他に、役職といった一般名詞を抽出する必要があるが、本発明によれば、固有表現および一般名詞を抽出対象として情報対を抽出することも可能であるので、それに応えることができるのである。 Thus, according to the present invention, for example, when there is a request for extracting information on officers and personnel in a company from a group of articles, it becomes possible to respond to the request. When extracting information on officers and personnel in a company from a group of articles, it is necessary to extract general nouns such as titles in addition to proper nouns such as personal names and company names. According to the present invention, proper expressions and general nouns are extracted. Since it is also possible to extract an information pair using as an extraction target, this can be met.

すなわち、本発明によれば、記事群から、図４６（ａ）に示すような情報を抽出することができることで、その情報をグラフ化した図４６（ｂ）に示すようなグラフを表示することができるのである。 That is, according to the present invention, the information as shown in FIG. 46A can be extracted from the article group, and the graph as shown in FIG. 46B in which the information is graphed is displayed. Can do it.

なお、図４５のフローチャートでは、分類語彙表の定義する分類番号を使って、主要一般名詞の種類を特定するとともに、主要一般名詞をクラスタリングすることで説明したが、分類語彙表で付与されていたような分類番号を持たない階層シソーラスを使うことでも、主要一般名詞の種類を特定することが可能であるとともに、主要一般名詞をクラスタリングすることが可能である。すなわち、分類語彙表の１０桁の分類番号は７レベルの階層構造を示しているので、階層シソーラス上の各ノードにおける概念の定義文をそのレベルの番号のように取り扱うことで、分類番号のような番号をあらためてふってやらなくても、主要一般名詞の種類を特定することが可能であるとともに、主要一般名詞をクラスタリングすることが可能である。 In the flowchart of FIG. 45, the classification number defined in the classification vocabulary table is used to specify the type of main general noun and the main general noun is clustered. By using a hierarchical thesaurus that does not have such classification numbers, it is possible to identify the types of main common nouns and to cluster the main general nouns. That is, since the 10-digit classification number in the classification vocabulary table indicates a 7-level hierarchical structure, the definition sentence of the concept in each node on the hierarchical thesaurus is treated like the number of that level, so that It is possible to identify the types of main common nouns and cluster the main general nouns without having to change the numbers.

図３８のフローチャートでは、固有表現を抽出対象とするときに、主要項目表現、主要単位表現および主要固有表現種類の情報を記事群から抽出するようにしたが、これらの情報についてはユーザが指定することも可能である。 In the flowchart of FIG. 38, when the specific expression is targeted for extraction, information on the main item expression, main unit expression, and main specific expression type is extracted from the article group, but this information is specified by the user. It is also possible.

次に、図４７のフローチャートに従って、ユーザがこれらの情報を指定する場合に実行する本発明の情報抽出装置１の情報抽出処理について説明する。 Next, the information extraction process of the information extraction apparatus 1 of the present invention that is executed when the user designates such information will be described with reference to the flowchart of FIG.

本発明の情報抽出装置１は、図４７のフローチャートに従って情報抽出処理を実行する場合には、まず最初に、ステップＳ３０１で、入力用画面を使ってユーザと対話することで、ユーザから主要項目表現を入力する。 When the information extraction apparatus 1 according to the present invention executes the information extraction process according to the flowchart of FIG. 47, first, in step S301, by interacting with the user using the input screen, the main item expression is expressed from the user. Enter.

続いて、ステップＳ３０２で、入力用画面を使ってユーザと対話することで、ユーザから主要単位表現を入力し、続くステップＳ３０３で、記事群から、入力した主要単位表現に関連する数値を抽出することで、主要数値表現を抽出する。 Subsequently, in step S302, the main unit expression is input from the user by interacting with the user using the input screen, and in step S303, a numerical value related to the input main unit expression is extracted from the article group. Thus, the main numerical expression is extracted.

続いて、ステップＳ３０４で、入力用画面を使ってユーザと対話することで、ユーザから主要固有表現の種類を入力し、続くステップＳ３０５で、記事群から、入力した主要固有表現の種類に属する主要固有表現を抽出する。 Subsequently, in step S304, the type of main specific expression is input from the user by interacting with the user using the input screen, and in step S305, the main specific expression belonging to the input type of main specific expression is input from the group of articles. Extract a specific expression.

続いて、ステップＳ３０６で、記事群において、入力した主要項目表現と、入力した主要単位表現と、入力した主要固有表現の種類に属する主要固有表現（抽出した主要固有表現）とが同時に出現する箇所を特定することで情報対を抽出する。 Subsequently, in step S306, in the group of articles, the input main item expression, the input main unit expression, and the main specific expression (extracted main specific expression) belonging to the type of the input main specific expression appear at the same time. The information pair is extracted by specifying.

続いて、ステップＳ３０７で、前述のステップＳ１０５の処理と同様の処理を実行することで、主要項目表現で区分けされる情報対の各グループについて、主要数値表現の平均値および標準偏差を算出し、それに基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定して、それらの情報対を削除する。 Subsequently, in step S307, by executing processing similar to the processing in step S105 described above, the average value and standard deviation of the main numerical expression are calculated for each group of information pairs classified by the main item expression, Based on this, information pairs that should not be handled as the same group as other information pairs are specified, and those information pairs are deleted.

続いて、ステップＳ３０８で、前述のステップＳ１０６の処理と同様の処理を実行することで、主要項目表現で区分けされる情報対の各グループについて、主要固有表現の属する属性グループを特定し、それに基づいて、他の情報対と同一グループとして取り扱うべきでない情報対を特定して、それらの情報対を削除する。 Subsequently, in step S308, an attribute group to which the main specific expression belongs is specified for each group of information pairs classified by the main item expression by executing the same process as the process of step S106 described above. Thus, information pairs that should not be handled as the same group as other information pairs are specified, and those information pairs are deleted.

続いて、ステップＳ３０９で、削除後の情報対を構成する主要項目表現、主要単位表現・主要数値表現、主要固有表現・主要固有表現の種類の情報に基づいて、前述したグラフ化表示の手法に従って、記事群から取り出した情報をグラフ化表示する。 Subsequently, in step S309, based on the information of the types of the main item expression, main unit expression / main numerical expression, main specific expression / main specific expression constituting the deleted information pair, according to the graph display method described above. The information extracted from the article group is displayed as a graph.

このようにして、本発明の情報抽出装置１は、ユーザが主要項目表現、主要単位表現および主要固有表現種類の情報を入力してくる場合には、それらの入力情報に応じて、ノイズ情報を除去することを実現しつつ、関連記事データベース（ＤＢ）１４の記事群から有用な情報を抽出してグラフ化表示することを実現するのである。 In this way, when the user inputs information on the main item expression, main unit expression, and main unique expression type, the information extraction apparatus 1 of the present invention outputs noise information according to the input information. While realizing the removal, it is possible to extract useful information from the article group of the related article database (DB) 14 and display it in a graph.

ここで、図４７のフローチャートでは固有表現を抽出対象とする場合を具体例にしてユーザ入力構成を採るときの処理について説明したが、これはあくまで一例であって、一般名詞を抽出対象とする場合など、その他の場合にもそのまま適用できるものである。 Here, in the flowchart of FIG. 47, the processing when the user input configuration is adopted is described by taking the specific expression as the extraction target as a specific example, but this is only an example, and when the general noun is the extraction target In other cases, it can be applied as it is.

本発明の情報抽出装置１は、記事群から主要な時間表現を抽出して、それに基づいて、時間経過とともに変化していく動向情報を抽出してグラフ化表示する処理を行うこともある。 The information extraction apparatus 1 of the present invention may perform processing of extracting main time expressions from a group of articles, extracting trend information that changes with the passage of time, and displaying the trend information as a graph.

このような動向情報のグラフ化表示処理を実行する場合には、本発明の情報抽出装置１は、図４８のフローチャートに従って、ノイズ情報を除去することを実現しつつ、関連記事データベース（ＤＢ）１４の記事群から有用な動向情報を抽出してグラフ化表示することになる。 When executing such graph display processing of trend information, the information extraction apparatus 1 according to the present invention realizes the removal of noise information according to the flowchart of FIG. Useful trend information is extracted from the group of articles and displayed in a graph.

すなわち、本発明の情報抽出装置１は、図４８のフローチャートに従って情報抽出処理を実行する場合には、まず最初に、ステップＳ４０１で、記事群から、主要項目表現を抽出する。続いて、ステップＳ４０２で、記事群から、主要単位表現とそれに関連する数値とを抽出することで、主要単位表現・主要数値表現を抽出する。 That is, when performing the information extraction process according to the flowchart of FIG. 48, the information extraction apparatus 1 of the present invention first extracts the main item expression from the article group in step S401. Subsequently, in step S402, main unit expressions and main numerical expressions are extracted from the group of articles by extracting the main unit expressions and the related numerical values.

続いて、ステップＳ４０３で、記事群から、主要時間表現（出現頻度が高いことなどにより特定される主要な時間単位を持った時間表現）を抽出する。続いて、ステップＳ４０４で、記事群において、抽出した主要項目表現と、抽出した主要単位表現と、抽出した主要時間表現とが同時に出現する箇所を特定することで動向情報対を抽出する。 Subsequently, in step S403, a main time expression (a time expression having a main time unit specified by high appearance frequency or the like) is extracted from the article group. Subsequently, in step S404, a trend information pair is extracted by specifying a location where the extracted main item expression, the extracted main unit expression, and the extracted main time expression appear simultaneously in the article group.

続いて、ステップＳ４０５で、主要項目表現で区分けされる動向情報対の各グループについて、主要数値表現の平均値および標準偏差を算出し、それに基づいて、他の動向情報対と同一グループとして取り扱うべきでない動向情報対を特定して、それらの動向情報対を削除する。 Subsequently, in step S405, for each group of trend information pairs classified by the main item expression, an average value and a standard deviation of the main numerical expression are calculated, and based on this, the same group as other trend information pairs should be handled. Identify trend information pairs that are not, and delete those trend information pairs.

続いて、ステップＳ４０６で、主要項目表現で区分けされる動向情報対の各グループについて、主要時間表現の時間に基づいて、他の動向情報対と同一グループとして取り扱うべきでない動向情報対（例えば、時間的にかけ離れている主要時間表現を持つ動向情報対）を特定して、それらの動向情報対を削除する。 Subsequently, in step S406, for each group of trend information pairs classified by the main item expression, based on the time of the main time expression, a trend information pair (for example, a time information pair that should not be treated as the same group as other trend information pairs). Identify trend information pairs having major time expressions that are far apart, and delete those trend information pairs.

続いて、ステップＳ４０７で、削除後の動向情報対を構成する主要項目表現、主要単位表現・主要数値表現、主要時間表現に基づいて、前述したグラフ化表示の手法に従って、記事群から取り出した動向情報をグラフ化表示する。 Subsequently, in step S407, based on the main item expression, main unit expression / main numerical expression, and main time expression constituting the deleted trend information pair, the trend extracted from the article group according to the graph display method described above. Display information graphically.

このようにして、本発明の情報抽出装置１は、図４８のフローチャートを実行することで、ノイズ情報を除去することを実現しつつ、関連記事データベース（ＤＢ）１４の記事群から有用な動向情報を抽出してグラフ化表示することを実現するのである。 In this way, the information extraction apparatus 1 of the present invention executes useful information about the trend from the article group of the related article database (DB) 14 while realizing the removal of noise information by executing the flowchart of FIG. Is extracted and displayed in a graph.

図４９に、動向情報のグラフ化表示の実験データを図示する。ここで、図４９（Ａ）のグラフは台風のデータであり、主要単位表現として「号」、主要時間表現として「日」、主要項目表現として「台風」を取り出したものである。また、図４９（Ｂ）のグラフは大リーグのデータであり、主要単位表現として「号」、主要時間表現として「日」、主要項目表現として「マグワイア」を取り出したものである。また、図４９（Ｃ）のグラフは政治動向のデータであり、主要単位表現として「％」、主要時間表現として「月」、主要項目表現として「内閣支持率」を取り出したものである。 FIG. 49 shows experimental data for graph display of trend information. Here, the graph of FIG. 49A is typhoon data, in which “No.” is extracted as the main unit expression, “Day” is expressed as the main time expression, and “Typhoon” is extracted as the main item expression. The graph of FIG. 49B is data of the major leagues, in which “No.” is extracted as the main unit expression, “Day” is expressed as the main time expression, and “Maguire” is extracted as the main item expression. Further, the graph of FIG. 49C shows political trend data, which is “%” as a main unit expression, “month” as a main time expression, and “Cabinet support rate” as a main item expression.

ここで、図４８のフローチャートでは固有表現や一般名詞を抽出対象としない場合を具体例にして動向情報の抽出処理について説明したが、これはあくまで一例であって、固有表現や一般名詞を抽出対象とする場合など、その他の場合にもそのまま適用できるものである。 Here, in the flowchart of FIG. 48, the process of extracting the trend information has been described with a specific example in which the specific expression and the general noun are not extracted, but this is only an example, and the specific expression and the general noun are extracted. The present invention can be applied to other cases as it is.

次に、図３８のフローチャートのステップＳ１０５や図４５のフローチャートのステップＳ２０５で実行することになる、主要数値表現の標準偏差を用いた情報対の削除処理に関して行った実験データについて説明する。 Next, description will be given of experimental data performed in relation to the information pair deletion processing using the standard deviation of the main numerical expressions, which is executed in step S105 of the flowchart of FIG. 38 and step S205 of the flowchart of FIG.

この実験は、１９９８年と１９９９年の２年分の毎日新聞の記事群（２２０，０７８記事）を抽出対象として、主要項目表現を「自動車」、主要単位表現を「キロ」および「周」として、その抽出対象の記事群から、「自動車」と「キロ」・「周」を持つ主要数値表現との同時出現箇所の対データで定義される情報対を抽出することで行った。この実験により抽出される情報対は、自動車レースで、平均時速が何キロで、何周のレースであったのかを示すデータとなる。 In this experiment, articles from the Mainichi Newspapers (220,078 articles) for two years in 1998 and 1999 were extracted, the main item expression was “car”, the main unit expression was “kilo” and “lap”. This is done by extracting the information pairs defined by the pair data of the locations of simultaneous appearances of the main numerical expression with “car” and “kilo” and “lap” from the article group to be extracted. The information pairs extracted by this experiment are data indicating how many kilometers and how many laps the average speed was in a car race.

この実験では、主要数値表現がはずれ値となるのか否かの判断基準として、
“平均値±２．６σ”
に入るのか否かという判断基準を用いた。 In this experiment, as a criterion for determining whether or not the primary numerical expression is an outlier,
“Average ± 2.6σ”
The criterion of whether or not to enter is used.

図５０に、実験データを示す。この実験データから、３行目のデータである
“３．２３８キロ，１周”
というデータが“平均値±２．６σ”の範囲の外にあり、このデータがはずれ値となって、このデータについての情報対が削除対象となることが分かる。 FIG. 50 shows experimental data. From this experimental data, the data in the third row is “3,238 km, 1 lap”
This data is out of the range of “average value ± 2.6σ”, and this data becomes an outlier, and it can be seen that the information pair for this data is to be deleted.

この場合、主要数値表現の種類が「キロ」と「周」という２つがあるが、そのいずれか一方が“平均値±２．６σ”の範囲の外にある場合に、そのデータをはずれ値として取り除くようにしてもよいし、その双方が“平均値±２．６σ”の範囲の外にある場合に、そのデータをはずれ値として取り除くようにしてもよい。 In this case, there are two types of major numerical expressions, “kilo” and “circumference”, but if either one is outside the range of “average value ± 2.6σ”, the data is regarded as an outlier. The data may be removed, or when both are outside the range of “average value ± 2.6σ”, the data may be removed as an outlier.

図５１に、別の実験データを示す。この実験は、主要項目表現を「末端価格」、主要単位表現を「キロ」および「円」にして、その抽出対象の記事群から、「末端価格」と「キロ」・「円」を持つ主要数値表現との同時出現箇所の対データで定義される情報対を抽出することで行った。この実験により抽出される情報対は、覚醒剤の重さとその価格を示すデータとなる。 FIG. 51 shows another experimental data. In this experiment, the main item expression is "end price", the main unit expression is "kilo" and "yen", and the main items with "end price" and "kilo" / "yen" are extracted from the article group to be extracted This was done by extracting the information pairs defined by the pair data of the locations that appear simultaneously with the numerical expression. The information pair extracted by this experiment is data indicating the weight of the stimulant and its price.

この実験データから、９行目のデータである
“７０００キロ，６００００円”
というデータが“平均値±２．６σ”の範囲の外にあり、このデータがはずれ値となって、このデータについての情報対が削除対象となることが分かる。 From this experiment data, the data on the 9th line is “7000 kg, 60000 yen”
This data is out of the range of “average value ± 2.6σ”, and this data becomes an outlier, and it can be seen that the information pair for this data is to be deleted.

覚醒剤の重さが重くなると、だいたいそれに比例して価格が高くなる。実際、この実験データをプロットしてみると、図５２に示す通りとなり、この９行目のデータだけが右下のかけはなれた位置にプロットされることで抽出処理の誤りであることが分かる。この誤ったデータを取り除くと、データが綺麗に一直線に並ぶことが確認できた。 If the weight of the stimulant increases, the price will increase in proportion to it. In fact, when this experimental data is plotted, it becomes as shown in FIG. 52, and it can be seen that only the data on the ninth line is plotted at a position far from the lower right, which is an extraction error. It was confirmed that when the erroneous data was removed, the data was neatly aligned.

次に、図４５のフローチャートのステップＳ２０６で実行することになる、分類語彙表を用いた情報対の削除処理に関して行った実験データについて説明する。 Next, description will be made regarding experimental data performed regarding the information pair deletion processing using the classification vocabulary table, which is executed in step S206 of the flowchart of FIG.

この実験は、１９９８年と１９９９年の２年分の毎日新聞の記事群（２２０，０７８記事）を抽出対象として、主要項目表現を「風」、主要単位表現を「度」および「％」、主要一般名詞の種類を分類語彙表の「空間・方角（図４２のように変換した変換後の分類番号の上位二桁の値が６５となるもの）」として、その抽出対象の記事群から、「風」と「度」・「％」を持つ主要数値表現と分類語彙表の「空間・方角」に属する一般名詞との同時出現箇所の対データで定義される情報対を抽出することで行った。この実験により抽出される情報対は、マラソンなどの競技の気象条件（温度、湿度、風の方角）を示すデータとなる。 In this experiment, articles from the Mainichi Newspapers (220,078 articles) for two years of 1998 and 1999 were extracted, the main item expression was “wind”, the main unit expression was “degree” and “%”, From the article group to be extracted, the type of the main general noun is classified as “space / direction (the first two digits of the converted classification number converted as shown in FIG. 42 is 65)” in the classification vocabulary table. This is done by extracting the information pairs defined by the paired data of the simultaneous occurrence locations of the main numerical expressions with “wind”, “degree” and “%” and general nouns belonging to “space / direction” in the classification vocabulary table It was. The information pair extracted by this experiment is data indicating the weather conditions (temperature, humidity, wind direction) of the competition such as marathon.

図５３に、実験データを示す。この実験データから分かるように、分類語彙表の「空間・方角」に属する単語として、「北北東（１０）」、「東南東（４）」、「北東（３）」、「スタート（３）」、「北西（３）」、「午前（３）」、「南南西（２）」、「西北西（２）」、「南（２）」、「南南東（２）」、「南西（１）」、「東（１）」、「スタート・午前（１）」、「南東（１）」、「北（１）」、「北北西（１）」という単語が抽出された。ここで、（・）内は出現頻度を示している。 FIG. 53 shows experimental data. As can be seen from this experimental data, the words belonging to “space / direction” in the classification vocabulary table are “north-northeast (10)”, “east-southeast (4)”, “north-east (3)”, “start (3)”. , “Northwest (3)”, “am (3)”, “south southwest (2)”, “west northwest (2)”, “south (2)”, “south southeast (2)”, “southwest (1)” The words “east (1)”, “start / am (1)”, “southeast (1)”, “north (1)”, “northwest (1)” were extracted. Here, (.) Indicates the appearance frequency.

これらの単語の分類番号は、「スタート：６５７１００５０２２」、「東：６５７３１１６０１３」、「南：６５７３１１６０３２」、「北：６５７３１１６０３３」、「北東：６５７３１１８０１２」、「北北東：６５７３１１８０１２」、「東南東：６５７３１１８０１５」、「南東：６５７３１１８０１５」、「南南東：６５７３１１８０１５」、「南西：６５７３１１８０２１」、「南南西：６５７３１１８０２１」、「西北西：６５７３１１８０３２」、「北西：６５７３１１８０３２」、「北北西：６５７３１１８０３２」、「スタート・午前：６５７６００１０８１」、「午前：６５７６００１０８１」となる。 The classification numbers of these words are “start: 6571005022”, “east: 657311603”, “south: 6573116032,” “north: 657311003”, “northeast: 65731118012”, “north-northeast: 65731118012”, “east-southeast: 6571118015”. , “Southeast: 657311015”, “Southeast Southeast: 65731118015”, “Southwest: 65731118021”, “Southwest Southwest: 65731118021”, “West Northwest: 65731118032”, “Northwest: 65731118032”, “Northwest Northwest: 65731118032”, “Start / AM: 6576001081 ”,“ AM: 6576001081 ”.

ここで、分類語彙表にのっていない単語については、一番最後の単語を含む、なるべく長い名詞連続の単語の分類番号を付与する。例えば、分類語彙表には、北東などはのっているが、北北東はのっていない。このようなときには、北北東を形態素解析する。そうすると、「北：名詞」、「北東：名詞」などと分かれる。この場合、北東は分類語彙表にのっており、北東は見つかるので、北北東には北東の分類番号を付与するのである。 Here, for words not included in the classification vocabulary table, a classification number of words having the longest continuous nouns including the last word is assigned. For example, the classification vocabulary table has northeast, but not northeast. In such a case, morphological analysis is performed on north-northeast. Then, it is divided into “north: noun”, “northeast: noun” and so on. In this case, the northeast is on the classification vocabulary table, and since the northeast is found, the northeast northeast is assigned the northeast classification number.

このような処理を行う場合に、形態素解析結果によっては、「北北東：名詞」となる場合がある。そのようなときには、分類語彙表にのっていないので文字列処理を行う。すなわち、一番最後の文字を含む、なるべく長い文字列が分類語彙表にのっていないかを調べて、分類語彙表にのっている、一番最後の文字を含む、なるべく長い文字列の分類番号を付与する。例えば、北東はのっているので、北東が見つかる。そうすると、北北東には北東の分類番号を付与するのである。 When performing such processing, depending on the result of morphological analysis, there may be “north-northeast: noun”. In such a case, character string processing is performed because it is not on the classification vocabulary table. That is, check whether the longest possible character string including the last character is on the classification vocabulary table, and the longest character string including the last character on the classification vocabulary table Assign a classification number. For example, because the northeast is on, you can find the northeast. Then, the northeast northeast is given a northeast classification number.

名詞連続で調べるほうが確実なため、名詞連続の方法でも単語が見つからないときに、この文字列連続の方法を用いる。ただし、名詞連続、文字列連続では、このように、予測して分類語彙表にのっていない単語に分類番号を付与するのであるが、その予測が間違っている場合もありえる。これから、これらの方法を用いる場合、用いない場合の両方を選択できるようにするオプションの形式にしておくと便利がよい。 Since it is more reliable to check by noun continuation, this string continuation method is used when a word is not found even by the noun continuation method. However, in the case of noun continuation and character string continuation, a classification number is assigned to a word that is predicted and not included in the classification vocabulary table as described above, but the prediction may be wrong. From now on, when these methods are used, it is convenient to use an option format that allows selection of both cases.

前記のように得られた分類番号の５桁で一致する単語グループに区分けすると、
（１）第１グループ
６５７１００５０２２：スタート
（２）第２グループ
６５７３１１６０１３：東，６５７３１１６０３２：南，
６５７３１１６０３３：北，６５７３１１８０１２：北東，
６５７３１１８０１２：北北東，６５７３１１８０１５：東南東，
６５７３１１８０１５：南東，６５７３１１８０１５：南南東，
６５７３１１８０２１：南西，６５７３１１８０２１：南南西，
６５７３１１８０３２：西北西，６５７３１１８０３２：北西，
６５７３１１８０３２：北北西
（３）第３グループ
６５７６００１０８１：スタート・午前，６５７６００１０８１：午前
となる。 When dividing into word groups that match with the 5 digits of the classification number obtained as described above,
(1) 1st group 6571005022: Start (2) 2nd group 657311003: East, 657311632: South
6571116033: North, 6577318012: Northeast,
65731118012: north-northeast, 6577318015: east-southeast,
6577318015: southeast, 6577318015: southeast,
6573118021: Southwest, 6573118021: Southwest,
6571118032: west northwest, 6577318032: northwest,
6571118032: North-northwest (3) Third group 6576001081: Start-AM, 6576001081: AM

ここで、第１グループである６５７１０（分類名：｛点・目・許｝）に属するものの単語の総数は３で、分類６５の中での出現比率は３／４１＝０．０７である。 Here, the total number of words belonging to the first group 65710 (classification name: {dot / eye / permitted}) is 3, and the appearance ratio in the classification 65 is 3/41 = 0.07.

また、第２グループである６５７３１（分類名：｛方面・方角｝）に属するものの単語の総数は３４で、分類６５の中での出現比率は３４／４１＝０．８３である。 The total number of words belonging to the second group 65731 (classification name: {direction / direction}) is 34, and the appearance ratio in the classification 65 is 34/41 = 0.83.

また、第３グループである６５７６０（分類名：｛前後・間・端｝）に属するものの単語の総数は４で、分類６５の中での出現比率は４／４１＝０．１０である。 In addition, the total number of words belonging to the third group 65760 (classification name: {front / rear / interval / end}) is 4, and the appearance ratio in the classification 65 is 4/41 = 0.10.

ここで、分類２桁の中で、出現比率が５０％以上のものだけを利用するという方法をとると、第１グループと第３グループに属する単語をはずれ単語として判断して取り除き、第２グループに属する単語だけを利用することになる。その結果、
「北北東（１０）」、「東南東（４）」、「北東（３）」、「北西（３）」、
「南南西（２）」、「南南東（２）」、「南（２）」、「西北西（２）」、
「南東（１）」、「北北西（１）」、「南西（１）」、「東（１）」、「北（１）」
という単語だけを使うことになり、風の向きを表す表現だけを正しく抽出することができるようになる。 Here, if the method of using only those having an appearance ratio of 50% or more among the two digits of the classification is used, the words belonging to the first group and the third group are judged to be outliers and removed, and the second group Only the words belonging to are used. as a result,
"North-northeast (10)", "East-southeast (4)", "Northeast (3)", "North-west (3)",
"South-southwest (2)", "South-southeast (2)", "South (2)", "West-northwest (2)"
"Southeast (1)", "North-northwest (1)", "Southwest (1)", "East (1)", "North (1)"
This means that only the expression representing the direction of the wind can be correctly extracted.

したがって、図５３に示すデータの内、「北北東、東南東、北東、北西、南南西、南南東、南、西北西、南東、北北西、南西、東、北」という単語がでていない情報対については取り除くことになる。これにより、風の向きも記載した情報対のみを正しく抽出することができるようになる。また、風の向きに関する情報も正しく抽出することができるようになる。 Therefore, in the data shown in FIG. 53, information pairs that do not have the word “north-northeast, east-southeast, northeast, northwest, south-southwest, south-southeast, south, west-northwest, southeast, north-northwest, southwest, east, north” are removed. It will be. As a result, it is possible to correctly extract only information pairs in which the wind direction is also described. In addition, information regarding the direction of the wind can be correctly extracted.

なお、図５３に示す実験では、主要単位表現として、「度」「％」のみを利用しているが、メートル（風の速さ）を主要単位表現として利用するようにすれば、風の速さも合わせて抽出することができるようになる。 In the experiment shown in FIG. 53, only “degree” and “%” are used as the main unit expression. However, if the meter (wind speed) is used as the main unit expression, the wind speed In addition, it can be extracted together.

また、上記の実験では、出現比率が所定の閾値以上のグループに属する単語だけを利用するという方法をとったが、グループ配下の単語数が所定の閾値以上のグループに属する単語だけを利用するという方法をとってもよい。また、このときに、利用可能とするグループの数を決めて、出現比率や単語数の大きなグループの順に所定の数のグループを選択して、その選択したグループに属する単語だけを利用するという方法をとってもよい。 In the above experiment, only the words belonging to the group whose appearance ratio is equal to or higher than the predetermined threshold is used. However, only the words belonging to the group whose number of words under the group is equal to or higher than the predetermined threshold is used. You may take a method. Also, at this time, the number of groups that can be used is determined, a predetermined number of groups are selected in the order of the group having the highest appearance ratio and number of words, and only the words belonging to the selected group are used. You may take

また、上記の実験では、第１グループと第３グループに属する単語を利用しないようにするという構成をとったが、
（１）第１グループ：スタート
（２）第２グループ：東，南，北，北東，北北東，東南東，南東，南南東，
南西，南南西，西北西，北西，北北西
（３）第３グループ：スタート・午前，午前
というように表示して、ユーザに対して、全てのグループを利用するのかということと、全てのグループを利用しない場合には、どのグループを利用するのかということを指定させるようにしてもよい。このとき、出現比率などが所定の閾値以下となる第１グループや第３グループについて、第２グループとは異なる表示形態（色などを変える）で表示するようにしてもよい。 In the above experiment, the word belonging to the first group and the third group is not used.
(1) First group: Start (2) Second group: East, South, North, Northeast, North-Northeast, East-Southeast, Southeast, Southeast,
Southwest, South-West, West-NW, North-West, North-NW (3) Third group: Start / Morning / Morning and whether to use all groups and all groups If not, it may be specified which group is used. At this time, the first group or the third group whose appearance ratio or the like is equal to or less than a predetermined threshold may be displayed in a display form (changing color or the like) different from that of the second group.

以上に説明した本発明は、コンピュータにより読み取られ実行されるプログラムとして実施することもできる。本発明を実現するプログラムは、コンピュータが読み取り可能な、可搬媒体メモリ、半導体メモリ、ハードディスクなどの適当な記録媒体に格納することができ、これらの記録媒体に記録して提供され、または、通信インタフェースを介してネットワークを利用した送受信により提供されるものである。 The present invention described above can also be implemented as a program that is read and executed by a computer. The program for realizing the present invention can be stored in an appropriate recording medium such as a portable medium memory, a semiconductor memory, or a hard disk, which can be read by a computer, provided by being recorded on these recording media, or communication. It is provided by transmission / reception using a network via an interface.

本発明は、データベースに格納されるテキスト文書群から有用な情報を抽出する場合に適用できるものであり、本発明を用いることで、その有用な情報の自動抽出を実現するときに、その有用な情報をノイズ情報の影響を受けることなく抽出することができるようになる。 The present invention can be applied when extracting useful information from a group of text documents stored in a database, and is useful when automatic extraction of useful information is realized by using the present invention. Information can be extracted without being affected by noise information.

本発明のシステム構成の一例を示す図である。It is a figure which shows an example of the system configuration | structure of this invention. 情報対抽出部の構成例を示す図である。It is a figure which shows the structural example of an information pair extraction part. サポートベクトルマシン法のマージン最大化の概念を示す図である。It is a figure which shows the concept of margin maximization of a support vector machine method. 情報抽出処理フローの一例を示す図である。It is a figure which shows an example of an information extraction process flow. 表示部による表示例を示す図である。It is a figure which shows the example of a display by a display part. 表示部による表示例を示す図である。It is a figure which shows the example of a display by a display part. 表示部による表示例を示す図である。It is a figure which shows the example of a display by a display part. 表示部による表示例を示す図である。It is a figure which shows the example of a display by a display part. 表示部による表示例を示す図である。It is a figure which shows the example of a display by a display part. 表示部による表示例を示す図である。It is a figure which shows the example of a display by a display part. 表示部による表示例を示す図である。It is a figure which shows the example of a display by a display part. 表示部による表示例を示す図である。It is a figure which shows the example of a display by a display part. 表示部による表示例を示す図である。It is a figure which shows the example of a display by a display part. 表示部による表示例を示す図である。It is a figure which shows the example of a display by a display part. 情報対の評価数を示す図である。It is a figure which shows the evaluation number of an information pair. 情報対についての評価結果の一例を示す図である。It is a figure which shows an example of the evaluation result about an information pair. 情報対についての評価結果の一例を示す図である。It is a figure which shows an example of the evaluation result about an information pair. 情報対についての評価結果の一例を示すグラフである。It is a graph which shows an example of the evaluation result about an information pair. 情報対についての評価結果の一例を示すグラフである。It is a graph which shows an example of the evaluation result about an information pair. 情報対の評価数を示す図である。It is a figure which shows the evaluation number of an information pair. 情報対についての評価結果の一例を示す図である。It is a figure which shows an example of the evaluation result about an information pair. 情報対についての評価結果の一例を示す図である。It is a figure which shows an example of the evaluation result about an information pair. 情報対についての評価結果の一例を示すグラフである。It is a graph which shows an example of the evaluation result about an information pair. 情報対についての評価結果の一例を示すグラフである。It is a graph which shows an example of the evaluation result about an information pair. 情報対についての評価結果の一例を示す図である。It is a figure which shows an example of the evaluation result about an information pair. 情報対についての評価結果の一例を示す図である。It is a figure which shows an example of the evaluation result about an information pair. 情報対についての評価結果の一例を示す図である。It is a figure which shows an example of the evaluation result about an information pair. 情報対についての評価結果の一例を示す図である。It is a figure which shows an example of the evaluation result about an information pair. 情報対を示す図である。It is a figure which shows an information pair. 情報対を示す図である。It is a figure which shows an information pair. 情報対の強調表示の例を示す図である。It is a figure which shows the example of the highlight display of an information pair. 抽出された固有表現の種類の前又は後に付随する単語、抽出された単位表現に関連する数値表現の前又は後に付随する単語を出現頻度に並べた表を示す図である。It is a figure which shows the table | surface which arranged the word which accompanies before or after the type of the extracted specific expression, and the word which accompanies before or after the numerical expression relevant to the extracted unit expression in appearance frequency. 抽出された固有表現の種類の前又は後に付随する単語、抽出された単位表現に関連する数値表現の前又は後に付随する単語を出現頻度に並べた表を示す図である。It is a figure which shows the table | surface which arranged the word which accompanies before or after the type of the extracted specific expression, and the word which accompanies before or after the numerical expression relevant to the extracted unit expression in appearance frequency. 相関分析結果を示す図である。It is a figure which shows a correlation analysis result. 情報対の抽出結果を示す図である。It is a figure which shows the extraction result of an information pair. 相関分析結果を示すグラフである。It is a graph which shows a correlation analysis result. 前又は後にパターンの候補が付随する固有表現毎のｓｃｏｒｅを示す図である。It is a figure which shows score for every specific expression with which the pattern candidate precedes or follows. 本発明の実行するフローチャートの一例である。It is an example of the flowchart which this invention performs. 情報対の削除処理の説明図である。It is explanatory drawing of the deletion process of an information pair. 情報対の削除処理の説明図である。It is explanatory drawing of the deletion process of an information pair. 分類語彙表の分類番号の説明図である。It is explanatory drawing of the classification number of a classification vocabulary table. 分類語彙表の分類番号の変換処理の説明図である。It is explanatory drawing of the conversion process of the classification number of a classification vocabulary table. 変換された分類番号の説明図である。It is explanatory drawing of the converted classification number. 実験データの説明図である。It is explanatory drawing of experiment data. 本発明の実行するフローチャートの一例である。It is an example of the flowchart which this invention performs. 本発明により可能となる情報抽出の説明図である。It is explanatory drawing of the information extraction enabled by this invention. 本発明の実行するフローチャートの一例である。It is an example of the flowchart which this invention performs. 本発明の実行するフローチャートの一例である。It is an example of the flowchart which this invention performs. 実験データの説明図である。It is explanatory drawing of experiment data. 実験データの説明図である。It is explanatory drawing of experiment data. 実験データの説明図である。It is explanatory drawing of experiment data. 実験データの説明図である。It is explanatory drawing of experiment data. 実験データの説明図である。It is explanatory drawing of experiment data.

Explanation of symbols

１情報抽出装置
１１主要表現抽出部
１２情報対抽出部
１３表示部
１４関連記事ＤＢ
１１１主要単位表現抽出部
１１２主要項目表現抽出部
１１３主要固有表現抽出部
１２１教師データ記憶手段
１２２解−素性対抽出手段
１２３機械学習手段
１２４学習結果記憶手段
１２５表現対抽出手段
１２６素性抽出手段
１２７解推定手段
１２８情報対抽出手段 DESCRIPTION OF SYMBOLS 1 Information extraction apparatus 11 Main expression extraction part 12 Information pair extraction part 13 Display part 14 Related article DB
111 Main Unit Expression Extraction Unit 112 Main Item Expression Extraction Unit 113 Main Specific Expression Extraction Unit 121 Teacher Data Storage Unit 122 Solution-Feature Pair Extraction Unit 123 Machine Learning Unit 124 Learning Result Storage Unit 125 Expression Pair Extraction Unit 126 Feature Extraction Unit 127 Solution Estimating means 128 Information pair extracting means

Claims

An information extraction device that extracts information from a group of text documents stored in a storage device,
Main expression setting means for extracting the item expression and the type of extraction target expression from the text document group as the main expression, or inputting the main expression according to the interactive processing with the user;
A location where the main expression appears at the same time from the text document group is specified, and a pair of the item expression described in the specified location and an extraction target expression belonging to the extraction target expression type is used as an information pair. An information pair extracting means to extract;
Information pair identification means for identifying an information pair that should not be treated as the same group as other information pairs for each group of the extracted information pairs classified by the item expression;
Delete all of the specified information pairs from the extracted information pair, delete one of the specified information pairs with a deletion instruction from the extracted information pair, or delete the specified information pair Instead of the information pair processing means for displaying the extracted information pair on a display in a form that explicitly displays the identified information pair,
Feature information extraction device.

An information extraction device that extracts information from a group of text documents stored in a storage device,
Main expression setting means for extracting the item expression and unit expression from the text document group as the main expression or inputting the main expression in accordance with the dialogue processing with the user;
A part where the main expression appears simultaneously from the text document group is specified, and a pair of the item expression described in the specified part and a numerical expression related to the unit expression is extracted as an information pair. Information pair extraction means;
Information pair identification means for identifying an information pair that should not be treated as the same group as other information pairs for each group of the extracted information pairs classified by the item expression;
Delete all of the specified information pairs from the extracted information pair, delete one of the specified information pairs with a deletion instruction from the extracted information pair, or delete the specified information pair Instead of the information pair processing means for displaying the extracted information pair on a display in a form that explicitly displays the identified information pair,
Feature information extraction device.

An information extraction device that extracts information from a group of text documents stored in a storage device,
Main expression setting means for extracting the item expression and the type and unit expression of the extraction target expression from the text document group as the main expression, or inputting the main expression according to the dialogue processing with the user;
A location where the main expression appears simultaneously from the text document group is specified, and the item expression described in the specified location, the extraction target expression belonging to the extraction target expression type, and the unit expression are related An information pair extraction means for extracting a pair with a numerical expression as an information pair;
Information pair identification means for identifying an information pair that should not be treated as the same group as other information pairs for each group of the extracted information pairs classified by the item expression;
Delete all of the specified information pairs from the extracted information pair, delete one of the specified information pairs with a deletion instruction from the extracted information pair, or delete the specified information pair Instead of the information pair processing means for displaying the extracted information pair on a display in a form that explicitly displays the identified information pair,
Feature information extraction device.

In the information extraction device according to claim 1 or 3,
The information pair specifying means specifies an attribute group to which the extraction target expression belongs and specifies an information pair that should not be treated as the same group as other information pairs based on the attribute group.
Feature information extraction device.

In the information extraction device according to claim 2 or 3,
The information pair specifying unit calculates statistical information of the numerical expression, and based on the statistical information, specifies an information pair that should not be handled as the same group as other information pairs.
Feature information extraction device.

The information extraction device according to claim 3,
The information pair specifying unit specifies an attribute group to which the extraction target expression belongs, specifies an information pair that should not be treated as the same group as another information pair, and calculates statistical information of the numerical expression Based on that, identifying information pairs that should not be treated as the same group as other information pairs,
Feature information extraction device.

In the information extraction device according to any one of claims 1 to 6,
The main expression setting means extracts a time expression from the text document group,
The information pair extraction means extracts an information pair in which the time expression appears at the same location at the same time.
Feature information extraction device.

The information extraction device according to claim 7,
The information pair specifying means specifies an information pair that should not be handled as the same group as other information pairs based on the time expression.
Feature information extraction device.

In the information extraction device according to any one of claims 1 to 8,
Comprising graph display means for graphing and displaying information pairs that have not been deleted by the information pair processing means,
Feature information extraction device.

An information extraction method executed by an information extraction device for extracting information from a text document group stored in a storage device,
A main expression setting process in which the item expression and the type of expression to be extracted are extracted as the main expression from the text document group, or the main expression is input according to a dialogue process with the user,
A location where the main expression appears at the same time from the text document group is specified, and a pair of the item expression described in the specified location and an extraction target expression belonging to the extraction target expression type is used as an information pair. Extracting information versus extracting process;
An information pair identification process for identifying an information pair that should not be treated as the same group as other information pairs for each group of the extracted information pairs classified by the item representation;
Delete all of the specified information pairs from the extracted information pair, delete one of the specified information pairs with a deletion instruction from the extracted information pair, or delete the specified information pair And an information pair processing step for displaying the extracted information pair on a display in a form that explicitly displays the identified information pair.
Characteristic information extraction method.

An information extraction method executed by an information extraction device for extracting information from a text document group stored in a storage device,
A main expression setting process in which an item expression and a unit expression are extracted as main expressions from the text document group, or the main expressions are input according to a dialogue process with a user,
A part where the main expression appears simultaneously from the text document group is specified, and a pair of the item expression described in the specified part and a numerical expression related to the unit expression is extracted as an information pair. Information pair extraction process;
An information pair identification process for identifying an information pair that should not be treated as the same group as other information pairs for each group of the extracted information pairs classified by the item representation;
Delete all of the specified information pairs from the extracted information pair, delete one of the specified information pairs with a deletion instruction from the extracted information pair, or delete the specified information pair And an information pair processing step for displaying the extracted information pair on a display in a form that explicitly displays the identified information pair.
Characteristic information extraction method.

An information extraction method executed by an information extraction device for extracting information from a text document group stored in a storage device,
A main expression setting process of extracting the item expression and the type and unit expression of the extraction target expression from the text document group as the main expression, or inputting the main expression according to the dialogue processing with the user,
A location where the main expression appears simultaneously from the text document group is specified, and the item expression described in the specified location, the extraction target expression belonging to the extraction target expression type, and the unit expression are related An information pair extraction process for extracting a pair with a numerical expression as an information pair;
An information pair identification process for identifying an information pair that should not be treated as the same group as other information pairs for each group of the extracted information pairs classified by the item representation;
Delete all of the specified information pairs from the extracted information pair, delete one of the specified information pairs with a deletion instruction from the extracted information pair, or delete the specified information pair And an information pair processing step for displaying the extracted information pair on a display in a form that explicitly displays the identified information pair.
Characteristic information extraction method.

An information extraction program used to realize an information extraction device that extracts information from a text document group stored in a storage device,
Computer
Main expression setting means for extracting the item expression and the type of extraction target expression from the text document group as the main expression, or inputting the main expression according to the interactive processing with the user;
A location where the main expression appears at the same time from the text document group is specified, and a pair of the item expression described in the specified location and an extraction target expression belonging to the extraction target expression type is used as an information pair. An information pair extracting means to extract;
Information pair identification means for identifying an information pair that should not be treated as the same group as other information pairs for each group of the extracted information pairs classified by the item expression;
Delete all of the specified information pairs from the extracted information pair, delete one of the specified information pairs with a deletion instruction from the extracted information pair, or delete the specified information pair An information extraction program for functioning as an information pair processing means for displaying the extracted information pair on a display in a form that explicitly displays the identified information pair.

An information extraction program used to realize an information extraction device that extracts information from a text document group stored in a storage device,
Computer
Main expression setting means for extracting the item expression and unit expression from the text document group as the main expression or inputting the main expression in accordance with the dialogue processing with the user;
A part where the main expression appears simultaneously from the text document group is specified, and a pair of the item expression described in the specified part and a numerical expression related to the unit expression is extracted as an information pair. Information pair extraction means;
Information pair identification means for identifying an information pair that should not be treated as the same group as other information pairs for each group of the extracted information pairs classified by the item expression;
Delete all of the specified information pairs from the extracted information pair, delete one of the specified information pairs with a deletion instruction from the extracted information pair, or delete the specified information pair An information extraction program for functioning as an information pair processing means for displaying the extracted information pair on a display in a form that explicitly displays the identified information pair.

An information extraction program used to realize an information extraction device that extracts information from a text document group stored in a storage device,
Computer
Main expression setting means for extracting the item expression and the type and unit expression of the extraction target expression from the text document group as the main expression, or inputting the main expression according to the dialogue processing with the user;
A location where the main expression appears simultaneously from the text document group is specified, and the item expression described in the specified location, the extraction target expression belonging to the extraction target expression type, and the unit expression are related An information pair extraction means for extracting a pair with a numerical expression as an information pair;
Information pair identification means for identifying an information pair that should not be treated as the same group as other information pairs for each group of the extracted information pairs classified by the item expression;
Delete all of the specified information pairs from the extracted information pair, delete one of the specified information pairs with a deletion instruction from the extracted information pair, or delete the specified information pair An information extraction program for functioning as an information pair processing means for displaying the extracted information pair on a display in a form that explicitly displays the identified information pair.