JP4919386B2

JP4919386B2 - Information extraction / display device

Info

Publication number: JP4919386B2
Application number: JP2006016052A
Authority: JP
Inventors: 真樹村田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2006-01-25
Filing date: 2006-01-25
Publication date: 2012-04-18
Anticipated expiration: 2026-01-25
Also published as: JP2007199902A

Description

本発明は、情報抽出・表示技術に関し、特に、記事群から動向情報を抽出して表示する情報抽出・表示装置に関する。 The present invention relates to an information extraction and display technology, especially relates to the information extraction and display equipment to extract and view trends information from the article group.

従来の動向情報の抽出技術として、例えば、下記の非特許文献１は、文間関係を利用して、文間関係が推移か更新かを判断し、その情報を利用して動向情報を抽出する技術に関して記載している。 As a conventional technique for extracting trend information, for example, Non-Patent Document 1 below uses the inter-sentence relationship to determine whether the inter-sentence relationship is transition or update, and extracts the trend information using the information. It describes about technology.

ここで、動向情報とは、ある項目の数量に注目し、その時間的な変化をまとめた情報のことを指す。例として、ある人物のホームラン数の変化や内閣支持率の変化などが挙げられる。動向情報を抽出するには、ある項目とその数量の表現を抽出するだけでは不十分で、その数量が対応している時間表現も同時に抽出する必要がある。
難波，国政，福島，相沢，奥村：“文書横断文間関係を考慮した動向情報の抽出と可視化”，情報処理学会自然言語処理研究会，2005-NL-168 ， pp.67−74 (2005). Here, trend information refers to information that focuses on the quantity of a certain item and summarizes its temporal changes. Examples include a change in the number of homeruns of a person and a change in the cabinet support rate. To extract trend information, it is not enough to extract an expression of a certain item and its quantity. It is also necessary to extract a time expression corresponding to the quantity at the same time.
Namba, Kokusei, Fukushima, Aizawa, Okumura: “Extraction and visualization of trend information considering inter-document relations”, IPSJ SIG, 2005-NL-168, pp.67-74 (2005) .

しかし、上記従来技術では、数値情報の単位表現を自動で取り出すことや、数値情報に対応する項目を自動で取り出すことは行っていない。従って、従来技術によっては、ある分野に関連する記事群から自動で主要な単位表現、時間表現、項目表現を抽出し、それらを利用して、動向情報を抽出し、抽出した動向情報をグラフ化して表示することは困難である。 However, the above prior art does not automatically extract the unit representation of the numerical information or automatically extract the item corresponding to the numerical information. Therefore, depending on the prior art, main unit expressions, time expressions, and item expressions are automatically extracted from articles related to a certain field, trend information is extracted using them, and the extracted trend information is graphed. Is difficult to display.

本発明は、上記従来技術の問題点を解決し、ある分野に関連する記事群から自動で動向情報を抽出して表示する情報抽出・表示装置の提供を目的とする。 The present invention shows the above-mentioned solution to the problems of the prior art, and to provide information extraction and display equipment to extract and view trends information automatically from article group associated with a field.

上記課題を解決するため、本発明は、動向情報を抽出して表示する情報抽出・表示装置であって、ある分野に関連する記事群から、前記記事群における主要単位表現を抽出する主要単位表現抽出手段を備えることを特徴とする。 In order to solve the above problems, the present invention provides an information extraction / display device that extracts and displays trend information, and extracts a main unit expression in the article group from an article group related to a certain field. An extraction unit is provided.

また、本発明は、動向情報を抽出して表示する情報抽出・表示装置であって、ある分野に関連する記事群から主要表現を抽出する主要表現抽出手段と、前記主要表現抽出手段によって抽出された主要表現に基づいて、前記記事群から動向情報対を抽出する動向情報対抽出手段と、前記動向情報対抽出手段によって抽出された動向情報対を表示する表示手段とを備えることを特徴とする。 In addition, the present invention is an information extraction / display device that extracts and displays trend information, and is extracted by a main expression extraction unit that extracts a main expression from an article group related to a certain field, and the main expression extraction unit. A trend information pair extracting means for extracting a trend information pair from the article group based on the main expression, and a display means for displaying the trend information pair extracted by the trend information pair extracting means. .

また、本発明は、前記の情報抽出・表示装置において、前記主要表現抽出手段は、前記記事群から主要単位表現を抽出する主要単位表現抽出手段と、前記記事群から主要時間表現を抽出する主要時間表現抽出手段と、前記記事群から主要項目表現を抽出する主要項目表現抽出手段とを備えることを特徴とする。 In the information extraction / display apparatus according to the present invention, the main expression extraction unit includes a main unit expression extraction unit that extracts a main unit expression from the article group, and a main time expression that extracts a main time expression from the article group. It comprises time expression extraction means and main item expression extraction means for extracting a main item expression from the article group.

また、本発明は、前記の情報抽出・表示装置において、前記動向情報対抽出手段は、機械学習の手法を用いて、前記動向情報対を抽出することを特徴とする。 In the information extraction / display apparatus according to the present invention, the trend information pair extraction unit extracts the trend information pair by using a machine learning technique.

また、本発明は、動向情報を抽出して表示する情報抽出・表示装置であって、入力された主要表現に基づいて、ある分野に関連する記事群から動向情報対を抽出する動向情報対抽出手段と、前記動向情報対抽出手段によって抽出された動向情報対を表示する表示手段とを備え、前記動向情報対抽出手段は、機械学習の手法を用いて、前記動向情報対を抽出することを特徴とする。 The present invention also relates to an information extraction / display device that extracts and displays trend information, and extracts trend information pairs from a group of articles related to a certain field based on the inputted main expression. And a display means for displaying the trend information pair extracted by the trend information pair extraction means, wherein the trend information pair extraction means extracts the trend information pair using a machine learning technique. Features.

また、本発明は、前記の情報抽出・表示装置において、前記主要表現抽出手段は、主要表現を複数抽出し、前記表示手段は、前記動向情報抽出手段が前記抽出された主要表現に基づいて抽出した複数種類の動向情報対から、主要な動向情報対を抽出し、前記抽出した主要な動向情報対を表示することを特徴とする。 In the information extraction / display apparatus according to the present invention, the main expression extraction unit extracts a plurality of main expressions, and the display unit extracts the trend information extraction unit based on the extracted main expressions. The main trend information pairs are extracted from the plurality of types of trend information pairs, and the extracted main trend information pairs are displayed.

また、本発明は、前記の情報抽出・表示装置において、前記動向情報対抽出手段が、さらに、前記主要表現抽出手段によって抽出された主要表現のうち、選択された主要表現に基づいて、前記記事群から動向情報対を抽出することを特徴とする。 Further, the present invention is the information extraction / display apparatus, wherein the trend information pair extraction unit further includes the article based on a selected main expression among the main expressions extracted by the main expression extraction unit. It is characterized by extracting trend information pairs from groups.

また、本発明は、前記の情報抽出・表示装置において、キーワードを入力するキーワード入力手段と、前記入力されたキーワードに関連する記事群を記憶手段に記憶された書誌データから抽出する記事群抽出手段とを備え、前記主要表現抽出手段は、前記記事群抽出手段によって抽出された記事群から前記主要表現を抽出することを特徴とする。 Further, the present invention provides a keyword input unit for inputting a keyword and an article group extraction unit for extracting an article group related to the input keyword from bibliographic data stored in a storage unit in the information extraction / display apparatus. The main expression extracting means extracts the main expression from the article group extracted by the article group extracting means.

また、本発明は、前記の情報抽出・表示装置において、前記表示手段は、前記動向情報対抽出手段によって抽出された動向情報対をグラフ化して表示することを特徴とする。 Further, the present invention is characterized in that, in the information extraction / display apparatus, the display means displays the trend information pairs extracted by the trend information pair extraction means in a graph.

また、本発明は、前記の情報抽出・表示装置において、前記表示手段は、前記動向情報対抽出手段によって抽出された動向情報対を含む文を前記記事群から抽出し、前記抽出された文中において、前記動向情報対を強調表示することを特徴とする。 In the information extraction / display apparatus according to the present invention, the display unit extracts a sentence including the trend information pair extracted by the trend information pair extraction unit from the article group, and the extracted sentence includes The trend information pair is highlighted.

また、本発明は、動向情報を抽出して表示する情報抽出・表示装置であって、入力された主要表現に基づいて、ある分野に関連する記事群から動向情報対を抽出する動向情報対抽出手段と、前記動向情報対抽出手段によって抽出された動向情報対を表示する表示手段とを備え、前記表示手段は、前記動向情報対抽出手段によって抽出された動向情報対を含む文を前記記事群から抽出し、前記抽出された文中において、前記動向情報対を強調表示することを特徴とする。 The present invention also relates to an information extraction / display device that extracts and displays trend information, and extracts trend information pairs from a group of articles related to a certain field based on the inputted main expression. Means and display means for displaying the trend information pair extracted by the trend information pair extraction means, wherein the display means displays a sentence including the trend information pair extracted by the trend information pair extraction means as the article group. And the trend information pair is highlighted in the extracted sentence.

また、本発明は、動向情報を抽出して表示する情報抽出・表示方法であって、ある分野に関連する記事群から、前記記事群における主要単位表現を抽出するステップを有することを特徴とする。 Further, the present invention is an information extraction / display method for extracting and displaying trend information, characterized by comprising a step of extracting a main unit expression in the article group from an article group related to a certain field. .

また、本発明は、動向情報を抽出して表示する情報抽出・表示装置が備えるコンピュータに実行させるプログラムであって、前記コンピュータに、ある分野に関連する記事群から、前記記事群における主要単位表現を抽出する処理を実行させることを特徴とする。 Further, the present invention is a program that is executed by a computer included in an information extraction / display device that extracts and displays trend information, wherein the computer is configured to represent a main unit in the article group from an article group related to a certain field. It is characterized in that a process of extracting the above is executed.

本発明の情報抽出・表示装置、情報抽出・表示方法および情報抽出・表示プログラムによれば、ある分野に関連する記事群が与えられれば、その記事群から自動で動向情報を抽出して表示することが可能となる。 According to the information extraction / display apparatus, information extraction / display method, and information extraction / display program of the present invention, when an article group related to a certain field is given, trend information is automatically extracted from the article group and displayed. It becomes possible.

以下に、図を用いて、本発明の実施の形態について説明する。図１は、本発明のシステム構成の一例を示す図である。情報抽出・表示装置１は、ある分野に関連する記事群から動向情報を抽出して表示する処理装置である。本発明の実施の形態においては、情報抽出・表示装置１は、動向情報を抽出または表示する処理装置としてもよい。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing an example of a system configuration of the present invention. The information extraction / display device 1 is a processing device that extracts and displays trend information from articles related to a certain field. In the embodiment of the present invention, the information extraction / display device 1 may be a processing device that extracts or displays trend information.

情報抽出・表示装置１は、主要表現抽出部１１、動向情報対抽出部１２、主要動向情報対抽出・表示部１３、関連記事データベース（ＤＢ）１４を備える。 The information extraction / display apparatus 1 includes a main expression extraction unit 11, a trend information pair extraction unit 12, a main trend information pair extraction / display unit 13, and a related article database (DB) 14.

主要表現抽出部１１は、後述する関連記事ＤＢ１４に蓄積された、ある分野に関連する記事群から、主要表現を抽出する。抽出する主要表現は、例えば、単位表現、時間表現、項目表現である。主要表現は、後述する動向情報対抽出部１２において動向情報対を抽出する際に用いる。主要表現を抽出する際には、例えば、対象の記事群全体に万遍なく高頻度に出現する該当表現を抽出する。 The main expression extraction unit 11 extracts a main expression from an article group related to a certain field accumulated in a related article DB 14 described later. The main expressions to be extracted are, for example, unit expressions, time expressions, and item expressions. The main expression is used when a trend information pair is extracted by a trend information pair extraction unit 12 to be described later. When extracting a main expression, for example, a corresponding expression that appears uniformly and frequently in the entire target article group is extracted.

主要表現抽出部１１は、主要単位表現抽出部１１１、主要時間表現抽出部１１２、主要項目表現抽出部１１３を備える。 The main expression extraction unit 11 includes a main unit expression extraction unit 111, a main time expression extraction unit 112, and a main item expression extraction unit 113.

主要単位表現抽出部１１１は、動向情報を抽出、整理する際に必要となる単位表現を抽出する。例えば、ホームラン競争の記事だと、「５０本」、「５０号」などの「本」、「号」を単位表現として抽出する。主要時間表現抽出部１１２は、動向情報を抽出、整理する際に必要となる時間表現を抽出する。例えば、日、月、年などの時間表現を抽出する。主要項目表現抽出部１１３は、動向情報を抽出、整理する際に必要となる項目表現を抽出する。例えば、ホームラン競争の記事だと、「マグワイア」、「ソーサ」などの動向調査の対象となる表現を項目表現として抽出する。 The main unit expression extraction unit 111 extracts unit expressions necessary for extracting and organizing trend information. For example, in the case of an article of home run competition, “books” such as “50” and “50” are extracted as unit expressions. The main time expression extraction unit 112 extracts a time expression necessary for extracting and organizing trend information. For example, time expressions such as day, month, and year are extracted. The main item expression extraction unit 113 extracts item expressions necessary for extracting and organizing trend information. For example, in the case of an article on a home run competition, expressions that are subject to trend surveys such as “Mugwire” and “Sosa” are extracted as item expressions.

動向情報対抽出部１２は、対象の記事群において、主要表現抽出部１１において取り出した主要表現が同時に出現している箇所を特定し、その部分に記載されている主要単位表現と当該主要単位表現に対応する数値とからなる数値表現と、主要時間表現と、主要項目表現との対を、動向情報対として抽出する。ここで主要単位表現に対応する数値とは、例えば、主要単位表現に隣接して記事中に出現している数値が該当する。すなわち、主要単位表現については、それに関連する数値も同時に抽出し、数値と主要単位表現とをあわせて数値表現として抽出する。例えば、ホームラン競争の記事だと、「項目表現：マグワイア」「時間表現：１１日」「数値情報：４７号」の情報対を動向情報対として抽出する。 The trend information pair extraction unit 12 identifies a part where the main expressions extracted by the main expression extraction unit 11 appear at the same time in the target article group, and the main unit expression described in the part and the main unit expression A pair of a numerical expression composed of numerical values corresponding to, a main time expression, and a main item expression is extracted as a trend information pair. Here, the numerical value corresponding to the main unit expression corresponds to, for example, a numerical value appearing in the article adjacent to the main unit expression. That is, for the main unit expression, the numerical value related to it is extracted at the same time, and the numerical value and the main unit expression are extracted as a numerical expression. For example, in the case of an article of home run competition, information pairs “item expression: maguire”, “time expression: 11th”, and “numerical information: No. 47” are extracted as trend information pairs.

主要動向情報対抽出・表示部１３は、対象の記事群において、動向情報対抽出部１２において抽出した動向情報対を整理して、動向情報をグラフ化または強調表示したテキストで提示する。例えば、ホームラン競争の記事だと、動向情報対抽出部１２で取り出した、「マグワイア」に関する動向情報対をグラフ化して表示する。例えば、横軸に時間軸をとり、縦軸にホームラン数をとってグラフ化して表示する。また、例えば、主要動向情報対抽出・表示部１３は、関連記事ＤＢ１４内の記事群から動向情報対を含む文を抽出し、当該抽出された文中において、動向情報対を強調表示する。 The main trend information pair extraction / display unit 13 organizes the trend information pairs extracted by the trend information pair extraction unit 12 in the target article group, and presents the trend information as a graph or highlighted text. For example, in the case of an article of home run competition, the trend information pair regarding “Mugwire” taken out by the trend information pair extraction unit 12 is displayed in a graph. For example, the time axis is taken on the horizontal axis, and the number of home runs is taken on the vertical axis, which are displayed as a graph. For example, the main trend information pair extraction / display unit 13 extracts a sentence including the trend information pair from the article group in the related article DB 14, and highlights the trend information pair in the extracted sentence.

主要動向情報対抽出・表示部１３は、主要表現抽出部１１が抽出した主要表現が複数の場合に、各主要表現に基づいて動向情報対抽出部１２が抽出した複数種類の動向情報対から主要な動向情報対を抽出した上で、抽出された主要な動向情報対をグラフ化または強調表示する構成を採ってもよい。 When there are a plurality of main expressions extracted by the main expression extracting unit 11, the main trend information pair extracting / displaying unit 13 performs a main operation from a plurality of types of trend information pairs extracted by the trend information pair extracting unit 12 based on each main expression. A configuration may be adopted in which, after extracting the trend information pairs, the extracted major trend information pairs are graphed or highlighted.

関連記事ＤＢ１４には、ある分野に関連する記事群が蓄積されている。本発明の実施の形態においては、関連記事ＤＢ１４を省略し、情報抽出・表示装置１が、入力されたある分野に関連する記事群に基づいて主要表現を抽出し、また、動向情報を抽出する構成を採ってもよい。 The related article DB 14 stores an article group related to a certain field. In the embodiment of the present invention, the related article DB 14 is omitted, and the information extraction / display apparatus 1 extracts the main expression based on the inputted article group related to a certain field, and extracts the trend information. A configuration may be adopted.

また、本発明の実施の形態においては、情報抽出・表示装置１が、さらに、キーワードを入力するキーワード入力部（図示を省略）と、入力されたキーワードに関連する記事群を、記憶手段（図示を省略）内の書誌データから抽出して関連記事ＤＢ１４に格納する記事群抽出部（図示を省略）を備える構成を採ってもよい。上記書誌データは、例えば大規模コーパスが該当する。また、上記記事群抽出部が、抽出した記事群を主要表現抽出部１１に入力し、主要表現抽出部１１が、入力された当該記事群から主要表現を抽出する構成を採ってもよい。本発明の実施の形態においては、入力されたキーワード自体を主要項目表現として用いてもよい。 In the embodiment of the present invention, the information extraction / display apparatus 1 further includes a keyword input unit (not shown) for inputting a keyword, and an article group related to the input keyword as storage means (shown). May be provided with an article group extraction unit (not shown) that is extracted from the bibliographic data and stored in the related article DB 14. The bibliographic data corresponds to, for example, a large-scale corpus. The article group extraction unit may input the extracted article group to the main expression extraction unit 11, and the main expression extraction unit 11 may extract the main expression from the input article group. In the embodiment of the present invention, the input keyword itself may be used as the main item expression.

以下に、本発明の実施の形態に係る情報抽出・表示装置１の各構成要素の詳細な例について説明する。
（主要表現抽出部１１）
主要表現抽出部１１は、動向情報を抽出、整理する際に必要となる主要表現を抽出する。主要表現としては以下のものを抽出する。 Below, the detailed example of each component of the information extraction / display apparatus 1 which concerns on embodiment of this invention is demonstrated.
(Main Expression Extraction Unit 11)
The main expression extraction unit 11 extracts main expressions necessary for extracting and organizing trend information. The following are extracted as main expressions.

単位表現
時間表現
項目表現
各表現の抽出には、例えば、ＣｈａＳｅｎ（下記の参考文献（１）参照）を利用する。 Unit Representation Time Representation Item Representation For example, ChaSen (see Reference (1) below) is used to extract each representation.

参考文献（１）： Y. Matsumoto, A. Kitauchi, T. Yamashita,Y. Hirano, H. Matsuda and M. Asahara: Japanese morphological analysis system ChaSen version 2.0 manual 2nd edition ”(1999).
ＣｈａＳｅｎの出力において、品詞の情報を利用して、各表現の抽出を行う。単位表現は、数値の前方または後方に接続する名詞連続を取り出す。時間表現は、例えば、単位表現として得られた表現のうち、時間に関する表現（例：「年」「月」「日」「時」「秒」）を含む表現を時間表現とする。項目表現は、例えば名詞連続を取り出す。 Reference (1): Y. Matsumoto, A. Kitauchi, T. Yamashita, Y. Hirano, H. Matsuda and M. Asahara: Japanese morphological analysis system ChaSen version 2.0 manual 2nd edition ”(1999).
In the output of ChaSen, each expression is extracted using part of speech information. The unit representation takes out a noun series that connects to the front or back of the numerical value. As the time expression, for example, an expression including expressions related to time (eg, “year”, “month”, “day”, “hour”, “second”) among expressions obtained as unit expressions is used as a time expression. For the item expression, for example, a noun series is taken out.

ただし、上記各処理では、例えば、未知語は名詞として扱う。後方に接続する名詞連続の場合は品詞細分類が接尾辞の名詞の次に接尾辞以外の名詞が来た場合は、例えば、その接尾辞の名詞までを取り出す。また、例えば、ＣｈａＳｅｎの典型的な誤り、または、扱いにくい出力についてはプログラムで自動的に修正するようにする。例えば未知語のスやナを動詞や助詞とする誤りの出力は、それぞれ未知語となるようにする。また、「１月」など数値と単位を一つの名詞として出力する場合は、単位表現を取り出しやすいように「１」と「月」に分解して扱う。 However, in each of the above processes, for example, an unknown word is treated as a noun. In the case of consecutive nouns connected backward, if a noun other than the suffix comes after the suffix noun in the part of speech subcategory, for example, the suffix noun is taken out. Further, for example, a typical error of ChaSen or an unwieldy output is automatically corrected by a program. For example, an error output using an unknown word su or na as a verb or a particle is set as an unknown word. When outputting a numerical value and a unit such as “January” as a single noun, the unit expression is decomposed into “1” and “month” for easy extraction.

上述したＣｈａＳｅｎを用いた手法の他に、単位表現、時間表現、項目表現を固有表現として扱い、以下に述べる固有表現抽出技術を用いて主要表現を抽出する手法を採ることもできる。 In addition to the method using ChaSen described above, a unit expression, a time expression, and an item expression can be handled as specific expressions, and a method of extracting a main expression using a specific expression extraction technique described below can be adopted.

固有表現とは、人名、地名、組織名などの固有名詞、金額などの数値表現といった、特定の事物・数量を意味する言語表現のことで、固有表現抽出とは、そういった固有表現を文章中から計算機で自動で抽出する技術である。例えば、「日本の首相は小泉純一郎である」という文に対して固有表現抽出を行なうと、固有表現の「日本」と「小泉純一郎」が地名、人名として、抽出される。 A specific expression is a linguistic expression that means a specific thing / quantity, such as a numerical name such as a name, a place name, or an organization name, and a numerical expression such as a monetary amount. This is a technology for automatic extraction by a computer. For example, if a specific expression is extracted for a sentence “the Japanese prime minister is Junichiro Koizumi”, the specific expressions “Japan” and “Junichiro Koizumi” are extracted as place names and personal names.

以下に、固有表現抽出の一般的な手法の例について説明する。
（１）機械学習を用いる手法
機械学習を用いて固有表現を抽出する手法がある（例えば、以下の参考文献（２）参照）。 Hereinafter, an example of a general technique for extracting a specific expression will be described.
(1) A method using machine learning There is a method of extracting a specific expression using machine learning (for example, see the following reference (2)).

参考文献（２）：浅原正幸，松本裕治，日本語固有表現抽出における冗長的な形態素解析の利用情報処理学会自然言語処理研究会 NL153-7 2002 Reference (2): Masayuki Asahara, Yuji Matsumoto, Use of Redundant Morphological Analysis in Japanese Named Expression Extraction Information Processing Society of Japan Natural Language Processing Study Group NL153-7 2002

まず、例えば、「日本の首相は小泉さんです。」という文を、各文字に分割し、分割した文字について、以下のように、 B−LOCATION、 I−LOCATION等の正解タグを付与することによって、正解を設定する。以下の一列目は、分割された各文字であり、各文字の正解タグは二列目である。
日 B−LOCATION
本 I−LOCATION
の O
首 O
相 O
は O
小 B−PERSON
泉 I−PERSON
さ O
ん O
で O
す O
。 O
上記において、B −？？？は、ハイフン以下の固有表現の種類の始まりを意味するタグである。例えば、 B−LOCATIONは、地名という固有表現の始まりを意味しており、 B−PERSONは、人名という固有表現の始まりを意味している。また、I −？？？は、ハイフン以下の固有表現の種類の始まり以外を意味するタグであり、O はこれら以外である。従って、例えば、文字「日」は、地名という固有表現の始まりに該当する文字であり、文字「本」までが地名という固有表現である。 First, for example, the sentence “Japan's prime minister is Mr. Koizumi” is divided into each character, and the correct characters such as B-LOCATION and I-LOCATION are assigned to the divided characters as follows. Set the correct answer. The first column below is each divided character, and the correct tag of each character is the second column.
Sun B-LOCATION
I-LOCATION
O
Neck O
Phase O
Is O
Small B-PERSON
Izumi I-PERSON
O
N
At O
O
. O
In the above, B-? ? ? Is a tag that signifies the start of the type of proper expression below the hyphen. For example, B-LOCATION means the beginning of a unique expression called place name, and B-PERSON means the beginning of a unique expression called person name. I-? ? ? Is a tag that means something other than the beginning of the type of proper expression below the hyphen, and O is something else. Therefore, for example, the character “day” is a character that corresponds to the beginning of the unique name “place name”, and the character “book” is the unique name “place name”.

このように、各文字の正解を設定しておき、このようなデータから学習し、新しいデータでこの正解を推定し、この正解のタグから、各固有表現の始まりと、どこまでがその固有表現かを認識して、固有表現を推定する。 In this way, the correct answer of each character is set, learned from such data, this correct answer is estimated with new data, and from this correct answer tag, the beginning of each proper expression and how far it is. Is recognized and the proper expression is estimated.

この各文字に設定された正解のデータから学習するときには、システムによってさまざまな情報を素性という形で利用する。例えば、
日 B−LOCATION
の部分は、
日本−B 名詞−B
などの情報を用いる。日本−B は、日本という単語の先頭を意味し、名詞−B は、名詞の先頭を意味する。単語や品詞の認定には、例えば前述したChaSenによる形態素解析を用いる。ChaSenを用いれば、入力された日本語を単語に分割することができる。例えば、ChaSenは、前述したように、日本語文を分割し、さらに、各単語の品詞も推定してくれる。例えば、「学校へ行く」を入力すると以下の結果を得ることができる。 When learning from the correct data set for each character, the system uses various information in the form of features. For example,
Sun B-LOCATION
Part of
Japan-B Noun-B
Such information is used. Japan-B means the beginning of the word Japan, and noun-B means the beginning of the noun. For recognition of words and parts of speech, for example, morphological analysis by ChaSen described above is used. If ChaSen is used, the input Japanese can be divided into words. For example, ChaSen divides a Japanese sentence and estimates the part of speech of each word as described above. For example, if “go to school” is entered, the following results can be obtained.

学校ガッコウ学校名詞−一般
へヘへ助詞−格助詞−一般
行くイク行く動詞−自立五段・カ行促音便基本形
ＥＯＳ
このように各行に一個の単語が入るように分割され、各単語に読みや品詞の情報が付与される。 School Gacco School Noun-General To He To particle-Case particle-General Go Iku Go Verb-independence
In this way, each line is divided so that one word is included, and reading and part-of-speech information are given to each word.

なお、例えば、上記の参考文献（２）では、素性として、入力文を構成する文字の、文字自体（例えば、「小」という文字）、字種（例えば、ひらがなやカタカナ等）、品詞情報、タグ情報（例えば、「 B−PERSON」等）を利用している。 For example, in the above reference (2), as features, characters constituting the input sentence itself (for example, “small” character), character type (for example, hiragana, katakana, etc.), part of speech information, Tag information (for example, “B-PERSON” or the like) is used.

これら素性を利用して学習する。タグを推定する文字やその周辺の文字にどういう素性が出現するかを調べ、どういう素性が出現しているときにどういうタグになりやすいかを学習し、その学習結果を利用して新しいデータでのタグの推定を行なう。機械学習には、例えばサポートベクトルマシンを用いる。 Learning using these features. Investigate what features appear in the characters that estimate the tag and the surrounding characters, learn what features are likely to appear when the features appear, and use the learning results to create new data Perform tag estimation. For machine learning, for example, a support vector machine is used.

固有表現抽出には、上記の手法の他にも種々の手法がある。例えば、最大エントロピーモデルと書き換え規則を用いて固有表現を抽出する手法がある（参考文献（３）参照）。 In addition to the above-described method, there are various methods for extracting the proper expression. For example, there is a technique for extracting a specific expression using a maximum entropy model and a rewrite rule (see reference (3)).

参考文献（３）：内元清貴，馬青，村田真樹，小作浩美，内山将夫，井佐原均，最大エントロピーモデルと書き換え規則に基づく固有表現抽出，言語処理学会誌, Vol.7, No.2, 2000 Reference (3): Kiyotaka Uchimoto, Maoi, Maki Murata, Hiromi Osaku, Masao Uchiyama, Hitoshi Isahara, Named Expression Extraction Based on Maximum Entropy Model and Rewriting Rules, Journal of the Language Processing Society, Vol.7, No.2 , 2000

また、例えば、以下の参考文献（４）に、サポートベクトルマシンを用いて日本語固有表現抽出を行う手法について記載されている。 Further, for example, the following reference (4) describes a technique for extracting Japanese proper expressions using a support vector machine.

参考文献（４）：山田寛康，工藤拓，松本裕治，Support Vector Machineを用いた日本語固有表現抽出，情報処理学会論文誌, Vol.43, No.1", 2002
（２）作成したルールを用いる手法
人手でルールを作って固有表現を取り出すという方法もある。 Reference (4): Hiroyasu Yamada, Taku Kudo, Yuji Matsumoto, Extracting Japanese Named Expressions Using Support Vector Machine, Journal of Information Processing Society of Japan, Vol.43, No.1 ", 2002
(2) A method using a created rule There is also a method of manually creating a rule to extract a specific expression.

例えば、
名詞＋「さん」だと人名とする
名詞＋「首相」だと人名とする
名詞＋「町」だと地名とする
名詞＋「市」だと地名とする
などである。 For example,
Noun + “san” is the name of the person + “prime” is the name of the person + “town” is the name of the place + “city” is the place of name.

上記の固有表現抽出技術は、人名や地名を抽出する場合を例にとって説明したが、本発明の実施の形態において、単位表現、時間表現、項目表現をそれぞれ固有表現として扱い、上記の固有表現抽出技術を用いて、単位表現、時間表現、項目表現を抽出する構成を採ってもよい。 Although the above-described specific expression extraction technique has been described by taking the case of extracting a person name or a place name as an example, in the embodiment of the present invention, unit expression, time expression, and item expression are treated as specific expressions, respectively You may take the structure which extracts a unit expression, a time expression, and item expression using a technique.

次に、今扱っている分野の記事群で主たる役割を果たす主要な単位表現、時間表現、項目表現を取り出す。例えば、対象の記事群全体に万遍なく高頻度に出現する該当表現を主要な表現として抽出する。 Next, the main unit expressions, time expressions, and item expressions that play the main role in the articles in the field that we are dealing with are extracted. For example, a corresponding expression that appears uniformly and frequently in the entire target article group is extracted as a main expression.

具体的には、主要な表現の抽出には、以下の式（１）〜式（３）に示すようなＳｃｏｒｅ（スコア）の値を用い、スコアの値が大きいものほど主要な表現であると判断して、主要な表現を抽出する。
（１）ＯｋａｐｉのＴＦ項の式 Specifically, for the extraction of the main expression, the score (score) values as shown in the following formulas (1) to (3) are used, and the higher the score value, the more the main expression. Judgment and extract main expressions.
(1) Okapi's TF term equation

（２）総頻度 (2) Total frequency

（３）総出現記事数 (3) Total number of appearing articles

ただし、ｉは記事の番号、Ｄｏｃｓは記事の番号の集合、ＴＦ_iは記事ｉでの表現の出現回数、ｌ_iは記事ｉの長さ、Δは記事群Ｄｏｃｓにおける記事の平均の長さを意味する。ＯｋａｐｉのＴＦ項の式は、複数の記事に万遍なく出現しなおかつ頻度が大きい表現のスコアを大きくする効果がある。なお、記事の長さとは、例えば、記事に含まれる単語数や文字数である。 Where i is the article number, Docs is the set of article numbers, TF _i is the number of appearances of the expression in article i, l _i is the length of article i, and Δ is the average length of articles in article group Docs. means. The expression of the TF term of Okapi has the effect of increasing the score of an expression that appears uniformly in a plurality of articles and has a high frequency. The length of the article is, for example, the number of words or characters included in the article.

項目表現については、長い文字列を優先して取ってくることができるように、ＴＦ_iを記事ｉでの表現の出現回数とせずに、記事ｉでの表現の出現回数とその表現の文字列長の積とする方法も利用した。 For the items representation, a long string so that it can fetch give priority to, without the number of occurrences of the expression of the TF _i article i, the string of the number of occurrences and its representation of the representation of the article i The long product method was also used.

また、本発明の実施の形態においては、式（１）の値にＩＤＦすなわちｌｏｇＮ／ＤＦを乗じた値、式（２）の値に上記ＩＤＦを乗じた値、式（３）の値に上記ＩＤＦを乗じた値を各スコアの値としてもよい。ここで、Ｎは図示しない大規模コーパス中の全記事数、ＤＦは、例えば、当該大規模コーパス中において当該表現が出現した記事数を意味する。 In the embodiment of the present invention, the value obtained by multiplying the value of equation (1) by IDF, that is, log N / DF, the value of equation (2) by the IDF, and the value of equation (3) by the above A value obtained by multiplying IDF may be used as the value of each score. Here, N means the total number of articles in a large-scale corpus (not shown), and DF means the number of articles in which the expression appears in the large-scale corpus, for example.

本発明の実施の形態においては、主要表現抽出部１１は、例えば、算出されたスコア値が最も高い表現を主要表現として抽出する。主要表現抽出部１１は、例えば、算出されたスコア値が所定の閾値以上の表現を主要表現として抽出してもよい。また、主要表現抽出部１１は、例えば、算出されたスコア値が高いものから所定の個数の表現を主要表現として抽出してもよい。また、主要表現抽出部１１は、例えば、抽出された表現を、スコア値について降順または昇順にソートして出力する構成を採ってもよい。 In the embodiment of the present invention, the main expression extraction unit 11 extracts, for example, an expression having the highest calculated score value as the main expression. For example, the main expression extraction unit 11 may extract an expression having a calculated score value equal to or greater than a predetermined threshold as the main expression. In addition, the main expression extraction unit 11 may extract a predetermined number of expressions as the main expression from the one with the high calculated score value, for example. In addition, the main expression extraction unit 11 may adopt a configuration in which the extracted expressions are sorted and output in descending order or ascending order with respect to the score values, for example.

本発明の実施の形態において、例えば、主要単位表現抽出部１１１は、図２（Ａ）に示すような、単位表現抽出手段２００、スコア値算出手段２０１、主要単位表現抽出手段２０２を備える。単位表現抽出手段２００は、関連記事ＤＢ１４中の記事群から単位表現を抽出する。スコア値算出手段２０１は、抽出した単位表現についてのスコア値を算出する。主要単位表現抽出手段２０２は、算出されたスコア値に基づいて、主要単位表現を抽出する。 In the embodiment of the present invention, for example, the main unit expression extracting unit 111 includes unit expression extracting means 200, score value calculating means 201, and main unit expression extracting means 202 as shown in FIG. The unit expression extraction unit 200 extracts a unit expression from the article group in the related article DB 14. The score value calculation unit 201 calculates a score value for the extracted unit representation. The main unit expression extraction unit 202 extracts the main unit expression based on the calculated score value.

また、例えば、主要時間表現抽出部１１２は、図２（Ｂ）に示すような、時間表現抽出手段３００、スコア値算出手段３０１、主要時間表現抽出手段３０２を備える。時間表現抽出手段３００は、関連記事ＤＢ１４中の記事群から時間表現を抽出する。スコア値算出手段３０１は、抽出した時間表現についてのスコア値を算出する。主要時間表現抽出手段３０２は、算出されたスコア値に基づいて、主要時間表現を抽出する。 Further, for example, the main time expression extracting unit 112 includes a time expression extracting unit 300, a score value calculating unit 301, and a main time expression extracting unit 302 as shown in FIG. The time expression extraction unit 300 extracts a time expression from an article group in the related article DB 14. The score value calculation unit 301 calculates a score value for the extracted time expression. The main time expression extraction unit 302 extracts a main time expression based on the calculated score value.

また、例えば、主要項目表現抽出部１１３は、図２（Ｃ）に示すような、項目表現抽出手段４００、スコア値算出手段４０１、主要項目表現抽出手段４０２を備える。項目表現抽出手段４００は、関連記事ＤＢ１４中の記事群から項目表現を抽出する。スコア値算出手段４０１は、抽出した項目表現についてのスコア値を算出する。主要項目表現抽出手段４０２は、算出されたスコア値に基づいて、主要項目表現を抽出する。
（動向情報対抽出部１２）
動向情報対抽出部１２は、対象の記事群において、主要表現抽出部１１において取り出した表現が例えば同時に出現している箇所を特定し、その箇所に記載されている主要表現対に基づいて、動向情報対を抽出する。すなわち、動向情報対抽出部１２は、例えば、対象の記事群において主要単位表現と主要時間表現と主要項目表現とが同時に出現している部分に記載されている、主要単位表現と主要単位表現に対応する数値とからなる数値表現と、主要時間表現と、主要項目表現との対を、動向情報対として抽出する動向情報対抽出手段である。 For example, the main item expression extracting unit 113 includes an item expression extracting unit 400, a score value calculating unit 401, and a main item expression extracting unit 402 as shown in FIG. The item expression extraction unit 400 extracts item expressions from the article group in the related article DB 14. The score value calculation unit 401 calculates a score value for the extracted item expression. The main item expression extraction unit 402 extracts the main item expression based on the calculated score value.
(Trend information pair extraction unit 12)
The trend information pair extraction unit 12 specifies, for example, a part where the expressions extracted by the main expression extraction unit 11 appear simultaneously in the target article group, and based on the main expression pair described in the part, the trend information pair extraction unit 12 Extract information pairs. That is, the trend information pair extraction unit 12 includes, for example, a main unit expression and a main unit expression described in a portion where the main unit expression, the main time expression, and the main item expression appear simultaneously in the target article group. It is a trend information pair extracting means for extracting a pair of a numeric expression composed of corresponding numeric values, a main time expression, and a main item expression as a trend information pair.

本発明の実施の形態においては、例えば、句点、改行、文書の切れ目を示す特殊記号を切れ目とし、これらをはさまずに同時に単位表現、時間表現、項目表現が出現した箇所を、同時に出現した箇所とする。また、例えば、一記事につき、動向情報対は一つとし、記事中で最も最初に現れた動向情報対のみを取り出す。また、単位表現と連接した数値と単位表現を組み合わせたものを数値表現として取り出す。 In the embodiment of the present invention, for example, a special symbol indicating a punctuation mark, a line feed, or a document break is used as a break, and a unit expression, a time expression, and an item expression appear at the same time without intervening these. A place. For example, for each article, there is one trend information pair, and only the trend information pair that appears first in the article is extracted. Further, a combination of a unit expression and a numerical value connected to the unit expression is taken out as a numerical expression.

本発明の実施の形態においては、一記事中に複数の動向情報対が記述されていることもあるので、動向情報対抽出部１２は、複数の動向情報対を取り出す構成を採ってもよい。また、本発明の実施の形態においては、動向情報対抽出部１２は、例えば単位表現、時間表現、項目表現がより近接して出現している箇所の情報を重視して、その箇所に記載されている主要表現対に基づいて、動向情報対を抽出する構成を採ってもよい。例えば、単位表現、時間表現、項目表現それぞれの間の文字数または単語数が所定の閾値以下である箇所に記載されている主要表現対に基づいて、動向情報対を抽出してもよい。 In the embodiment of the present invention, since a plurality of trend information pairs may be described in one article, the trend information pair extraction unit 12 may adopt a configuration for extracting a plurality of trend information pairs. Further, in the embodiment of the present invention, the trend information pair extraction unit 12 places importance on the information of the part where the unit expression, the time expression, and the item expression appear more closely, and is described in the part. A configuration may be adopted in which trend information pairs are extracted based on the main expression pairs. For example, a trend information pair may be extracted based on a main expression pair described in a place where the number of characters or the number of words between the unit expression, time expression, and item expression is equal to or less than a predetermined threshold.

すなわち、動向情報対抽出部１２は、例えば、対象の記事群において主要単位表現と主要時間表現と主要項目表現とがより近接して出現している部分に記載されている、主要単位表現と主要単位表現に対応する数値とからなる数値表現と、主要時間表現と、主要項目表現との対を、動向情報対として抽出する動向情報対抽出手段である。 In other words, the trend information pair extraction unit 12 describes, for example, the main unit expression and the main item described in the portion where the main unit expression, the main time expression, and the main item expression appear more closely in the target article group. It is a trend information pair extraction means for extracting a pair of a numeric expression consisting of a numerical value corresponding to a unit expression, a main time expression, and a main item expression as a trend information pair.

また、本発明の実施の形態においては、後述するように、機械学習の方法を利用して、動向情報対を取り出す構成を採ってもよい。 In the embodiment of the present invention, as described later, a configuration may be adopted in which trend information pairs are extracted using a machine learning method.

また、本発明の実施の形態においては、主要表現抽出部１１が複数の主要項目表現を抽出した場合、動向情報対抽出部１２は、例えば、対象の記事群中の一記事中において当該複数の主要項目表現が同時に出現すること等を条件に加えて、動向情報対を抽出する構成を採ってもよい。
（主要動向情報対抽出・表示部１３）
主要動向情報対抽出・表示部１３では、対象の記事群において、動向情報対抽出部１２において抽出した動向情報対を整理して、グラフ化または強調表示したテキストで提示する。動向情報対の時間表現を横軸に、数値表現を縦軸にしたグラフを作成する。また、動向情報対を取り出した文を、関連記事ＤＢ１４中の記事群から抽出して、その文において動向情報対を強調表示する。 In the embodiment of the present invention, when the main expression extraction unit 11 extracts a plurality of main item expressions, the trend information pair extraction unit 12 may, for example, include the plurality of the plurality of main item expressions in one article in the target article group. In addition to the condition that the main item expression appears at the same time, a configuration may be adopted in which trend information pairs are extracted.
(Main trend information pair extraction / display unit 13)
The main trend information pair extraction / display unit 13 organizes the trend information pairs extracted by the trend information pair extraction unit 12 in the target article group and presents them in a graph or highlighted text. Create a graph with trend information pair time expression on the horizontal axis and numerical expression on the vertical axis. Moreover, the sentence which extracted the trend information pair is extracted from the article group in the related article DB 14, and the trend information pair is highlighted in the sentence.

また、本発明の実施の形態においては、抽出した文において、複数の動向情報対がある場合は、例えば、最初に出現している動向情報対を、当該動向情報対に含まれる数量表現、時間表現、項目表現以外の数量表現、時間表現、項目表現と区別して表示してもよい。例えば、例えば、最初に出現している動向情報対を二重線で、当該動向情報対に含まれる数量表現、時間表現、項目表現以外の数量表現、時間表現、項目表現を一重線で強調表示するようにしてもよい。 In the embodiment of the present invention, when there are a plurality of trend information pairs in the extracted sentence, for example, the trend information pair that appears first is represented by the quantity expression, time, and time included in the trend information pair. It may be displayed separately from expressions, quantity expressions other than item expressions, time expressions, and item expressions. For example, for example, the first trend information pair that appears is a double line, and the quantity expression, time expression, and item expression other than the item expression included in the trend information pair are highlighted with a single line. You may make it do.

また、本発明の実施の形態において、主要動向情報対抽出・表示部１３は、主要表現抽出部１１が複数の主要単位表現、主要時間表現、主要項目表現を取り出た場合、それら複数の表現のすべての組み合わせ分のデータに基づいて動向情報対抽出部１２が抽出した複数種類の動向情報対において、より多く抽出された動向情報対ほど有用な動向情報として判断して、当該動向情報対を主要動向情報対とする。例えば、最も多く抽出された動向情報対を主要動向情報として抽出する。そして、主要動向情報対抽出・表示部１３は、抽出された主要動向情報対をグラフ化または強調表示する。 Further, in the embodiment of the present invention, the main trend information pair extraction / display unit 13 is configured such that when the main expression extraction unit 11 extracts a plurality of main unit expressions, main time expressions, and main item expressions, the plurality of expressions. In the plural types of trend information pairs extracted by the trend information pair extraction unit 12 based on the data for all combinations of the above, the more extracted trend information pairs are determined as useful trend information, and the trend information pairs are Main trend information pair. For example, the most extracted trend information pair is extracted as main trend information. Then, the main trend information pair extraction / display unit 13 graphs or highlights the extracted main trend information pair.

例えば、主要表現抽出部１１によって、主要単位表現としてａ１とａ２が、主要時間表現としてｂ１とｂ２が、主要項目表現としてｃ１とｃ２が抽出されたとする。抽出されたこれら複数の表現の組み合わせによって、（ａ１，ｂ１，ｃ１），（ａ１，ｂ１，ｃ２），（ａ１，ｂ２，ｃ１），（ａ１，ｂ２，ｃ２），（ａ２，ｂ１，ｃ１），（ａ２，ｂ１，ｃ２），（ａ２，ｂ２，ｃ１），（ａ２，ｂ２，ｃ２）といった８組の表現対が得られる。動向情報対抽出部１２は、例えば、対象とする記事群から、（ａ１，ｂ１，ｃ１）という表現対が同時に出現している箇所に記載されている表現対を動向情報対として抽出する。動向情報対抽出部１２は、同様の方法で、各表現対に基づく動向情報対を抽出する。そして、例えば、抽出された動向情報対の数が最も多かった表現対に基づいて抽出された動向情報対を主要動向情報対とする。 For example, assume that the main expression extraction unit 11 extracts a1 and a2 as main unit expressions, b1 and b2 as main time expressions, and c1 and c2 as main item expressions. Depending on the combination of the extracted expressions, (a1, b1, c1), (a1, b1, c2), (a1, b2, c1), (a1, b2, c2), (a2, b1, c1) , (A2, b1, c2), (a2, b2, c1), and (a2, b2, c2), eight pairs of expressions are obtained. For example, the trend information pair extraction unit 12 extracts, as a trend information pair, an expression pair described in a place where an expression pair (a1, b1, c1) appears simultaneously from a target article group. The trend information pair extraction unit 12 extracts a trend information pair based on each expression pair by the same method. For example, a trend information pair extracted based on an expression pair having the largest number of extracted trend information pairs is set as a main trend information pair.

ここまでの記述では、それぞれの部分的な構成要素は自動で行うことになっているが、本発明の実施の形態は、例えば、主要表現抽出部１１の構成を取り除き、動向情報対抽出部１２が、ユーザによって入力された単位表現、時間表現、項目表現に基づいて動向情報対を抽出し、主要動向情報対抽出・表示部１３が動向情報対を表示する構成を採ってもよい。また、本発明の実施の形態では、主要表現抽出部１１が、抽出された主要な単位表現、時間表現、項目表現をユーザに提示（表示）し、ユーザがその提示（表示）されたものの中から主要な単位表現、時間表現、項目表現を選択し、動向情報対抽出部１２が、選択された主要な単位表現、時間表現、項目表現に基づいて動向情報対を抽出し、主要動向情報対抽出・表示部１３が、動向情報対をグラフ化して表示してもよい。また、本発明の実施の形態においては、逆に、主要表現抽出部１１と動向情報対抽出部１２による処理を通じて、ある程度、動向情報対を抽出してから、それを元の新聞記事データと人手で照らし合わせて抽出情報を修正、改善した後に、グラフ化して表示する構成にしてもよい。情報抽出・表示装置１は、複数の構成要素に分割して構築しており、情報抽出・表示装置１の構成の一部を人手と置き換えたり、情報抽出・表示装置１の一部だけを単独で利用したりすることが可能である。 In the description so far, each partial component is automatically performed. However, in the embodiment of the present invention, for example, the configuration of the main expression extraction unit 11 is removed, and the trend information pair extraction unit 12 is removed. However, a configuration may be adopted in which trend information pairs are extracted based on unit expressions, time expressions, and item expressions input by the user, and the main trend information pair extraction / display unit 13 displays the trend information pairs. Further, in the embodiment of the present invention, the main expression extraction unit 11 presents (displays) the extracted main unit expression, time expression, and item expression to the user, and the user is presented (displayed). Main unit representation, time representation, and item representation are selected from, and the trend information pair extraction unit 12 extracts a trend information pair based on the selected main unit representation, time representation, and item representation, The extraction / display unit 13 may display the trend information pairs as a graph. In the embodiment of the present invention, conversely, after the trend information pair is extracted to some extent through the processing by the main expression extraction unit 11 and the trend information pair extraction unit 12, the original newspaper article data and the manpower are extracted. In this case, the extracted information may be corrected and improved by comparing with the above and then displayed in a graph. The information extraction / display device 1 is constructed by being divided into a plurality of components, and a part of the configuration of the information extraction / display device 1 is replaced with human hands, or only a part of the information extraction / display device 1 is isolated. It is possible to use.

図３は、本発明の実施の形態において、機械学習の手法を用いて動向情報対を抽出する構成を採る場合の、動向情報対抽出部１２の構成例を示す図である。動向情報対抽出部１２は、教師データ記憶手段１２１、解−素性対抽出手段１２２、機械学習手段１２３、学習結果記憶手段１２４、表現対抽出手段１２５、素性抽出手段１２６、解推定手段１２７、動向情報対抽出手段１２８を備える。 FIG. 3 is a diagram illustrating a configuration example of the trend information pair extraction unit 12 in a case where the configuration of extracting trend information pairs using a machine learning technique is employed in the embodiment of the present invention. The trend information pair extraction unit 12 includes a teacher data storage unit 121, a solution-feature pair extraction unit 122, a machine learning unit 123, a learning result storage unit 124, an expression pair extraction unit 125, a feature extraction unit 126, a solution estimation unit 127, a trend. An information pair extraction unit 128 is provided.

教師データ記憶手段１２１は、機械学習処理において使用される教師データとなるテキストデータを記憶する。例えば、数量表現をａｉ（ｉ＝１，２，３，．．．）、時間表現をｂｉ（ｉ＝１，２，３，．．．）、項目表現をｃｉ（ｉ＝１，２，３，．．．）とすると、教師データとして、テキストデータの文中に出現しているａｉ、ｂｉ、ｃｉの対（表現対）を問題、動向情報対として抽出するべき表現対であるか否かの情報を解とする事例を記憶する。具体的には、テキストデータ中に現れるあらゆるａｉ、ｂｉ、ｃｉの対について、動向情報対として抽出すべき表現対（正例）であるか、抽出するべきでない表現対（負例）かのいずれかの解を示すタグを人手によって付与する。例えば、図４中に示すテキストデータ中の表現ａ１，ａ２，ｂ１，ｂ２，ｃ１，ｃ２に基づいて構成される表現対である（ａ１，ｂ１，ｃ１），（ａ１，ｂ２，ｃ１），．．．（ａ２，ｂ２，ｃ２）のそれぞれについて、正例か負例かの解を示すタグを付与する。 The teacher data storage unit 121 stores text data serving as teacher data used in the machine learning process. For example, the quantity expression is ai (i = 1, 2, 3,...), The time expression is bi (i = 1, 2, 3,...), And the item expression is ci (i = 1, 2, 3). ,...), Whether the ai, bi, ci pair (expression pair) appearing in the text data sentence is an expression pair to be extracted as a problem / trend information pair as teacher data Memorize cases that use information as a solution. Specifically, for every ai, bi, ci pair appearing in text data, either an expression pair to be extracted as a trend information pair (positive example) or an expression pair that should not be extracted (negative example) A tag indicating such a solution is manually attached. For example, (a1, b1, c1), (a1, b2, c1),... Are expression pairs configured based on the expressions a1, a2, b1, b2, c1, c2 in the text data shown in FIG. . . For each of (a2, b2, c2), a tag indicating a positive or negative solution is assigned.

すなわち、本発明の実施の形態においては、例えば、
（ａ１，ｂ１，ｃ１）−解「正例」
（ａ１，ｂ２，ｃ１）−解「負例」
・
・
・
（ａ２，ｂ２，ｃ２）−解「負例」
といった、表現対と解との組を生成する。 That is, in the embodiment of the present invention, for example,
(A1, b1, c1)-solution "positive example"
(A1, b2, c1)-solution "negative example"
・
・
・
(A2, b2, c2)-solution "negative example"
A pair of expression pair and solution is generated.

解−素性対抽出手段１２２は、教師データ記憶手段１２１内に記憶されているテキストデータの事例から、解と素性の集合との組を抽出する。素性は、機械学習処理で使用する情報である。解−素性対抽出手段１２２は、素性として、例えば、あるテキストデータ中の、解が付与された各表現対についての、ａｉとｂｉ、ｂｉとｃｉ、ａｉとｃｉの間の距離（文字または単語数等）や、テキストデータ中におけるａｉとｂｉとｃｉの表現対を含む範囲や、ａｉ、ｂｉ、ｃｉそれぞれの前後の品詞情報等を用いる。また、解−素性対抽出手段１２２は、例えば、ａｉ，ｂｉ，ｃｉがテキストデータのタイトルに含まれるか等の情報や、ａｉとｂｉ、ｂｉとｃｉ、ａｉとｃｉの間に出現する品詞の情報や、ａｉが小数点を含むか、また、ｂｉが年、月、日か、また、ｃｉが人名か地名かの情報を素性としてもよい。また、本発明の実施の形態においては、記事中におけるａｉ、ｂｉ、ｃｉそれぞれの位置情報を素性としてもよい。例えば、新聞等の記事においては、最初に出現する主要表現が重要となることが多いからである。 The solution-feature pair extraction unit 122 extracts a set of a solution and a set of features from an example of text data stored in the teacher data storage unit 121. The feature is information used in the machine learning process. The solution-feature pair extraction unit 122 uses, as a feature, for example, the distance (character or word) between ai and bi, bi and ci, and ai and ci for each expression pair to which a solution is given in certain text data. A range including an expression pair of ai, bi, and ci in text data, part-of-speech information before and after each of ai, bi, and ci. In addition, the answer-feature pair extraction unit 122 may include information such as whether ai, bi, and ci are included in the title of text data, and parts of speech that appear between ai and bi, bi and ci, and ai and ci. Information or information on whether ai includes a decimal point, bi is a year, month, day, or ci is a person name or place name may be used as a feature. In the embodiment of the present invention, the position information of ai, bi, and ci in an article may be a feature. For example, in articles such as newspapers, the first main expression that appears first is often important.

機械学習手段１２３は、解−素性対抽出手段１２２によって抽出された解と素性の集合との組から、どのような素性のときにどのような解になりやすいかを、教師あり機械学習法により学習する。その学習結果は、学習結果記憶手段１２４内に記憶される。 The machine learning means 123 uses a supervised machine learning method to determine what kind of solution is likely to be generated from a set of the solution extracted by the solution-feature pair extraction means 122 and the feature set. learn. The learning result is stored in the learning result storage unit 124.

表現対抽出手段１２５は、主要表現抽出部１１によって抽出された主要表現（例えば、単位表現、時間表現、項目表現）を用いて、関連記事ＤＢ１４中の各記事に含まれるａｉ（数量表現）、ｂｉ（時間表現）、ｃｉ（項目表現）という３種類の表現のあらゆる組み合わせ（表現対）を抽出する。なお、単位表現と連接して記事中に出現する数値と当該単位表現との組み合わせを数量表現とする。 The expression pair extraction unit 125 uses a main expression extracted by the main expression extraction unit 11 (for example, unit expression, time expression, item expression), ai (quantity expression) included in each article in the related article DB 14, All combinations (expression pairs) of three types of expressions, bi (time expression) and ci (item expression), are extracted. A combination of a numerical value appearing in an article connected to the unit expression and the unit expression is a quantity expression.

素性抽出手段１２６は、解−素性対抽出手段１２２と同様の処理によって、表現対抽出手段１２５によって抽出された各表現対について、素性を抽出する。 The feature extraction unit 126 extracts a feature for each expression pair extracted by the expression pair extraction unit 125 by the same processing as the solution-feature pair extraction unit 122.

解推定手段１２７は、学習結果記憶手段１２４の学習結果を参照して、各表現対について、その素性の集合の場合に、どのような解（分類先）になりやすいかの度合いを推定する。 The solution estimation unit 127 refers to the learning result of the learning result storage unit 124, and estimates the degree of the solution (classification destination) that is likely to be obtained in the case of a set of features for each expression pair.

動向情報対抽出手段１２８は、解推定手段１２７の推定結果に基づいて、動向情報対として抽出すべき表現対（正例）となる度合いが高いと推定されたものを、動向情報対として抽出する。 The trend information pair extraction unit 128 extracts, as a trend information pair, a presumption that the degree of expression pairs (positive examples) to be extracted as trend information pairs is high based on the estimation result of the solution estimation unit 127. .

ここで、機械学習手段１２３による機械学習の手法について説明する。機械学習の手法は、問題−解の組のセットを多く用意し、それで学習を行ない、どういう問題のときにどういう解になるかを学習し、その学習結果を利用して、新しい問題のときも解を推測できるようにする方法である（例えば、下記の参考文献（５）〜参考文献（７）参照）。 Here, a machine learning method by the machine learning means 123 will be described. The machine learning method prepares many sets of problem-solution pairs, learns them, learns what kind of solution the problem becomes, and uses the learning result to create a new problem. This is a method that makes it possible to guess the solution (for example, see the following references (5) to (7)).

参考文献（５）：村田真樹，機械学習に基づく言語処理，龍谷大学理工学部．招待講演．2004. http://www2.nict.go.jp/jt/a132/members/murata/ps/rk1-siryou.pdf
参考文献（６）：サポートベクトルマシンを用いたテンス・アスペクト・モダリティの日英翻訳，村田真樹，馬青，内元清貴，井佐原均，電子情報通信学会言語理解とコミュニケーション研究会 NLC2000-78 ，2001年．
参考文献（７）：SENSEVAL2J辞書タスクでのＣＲＬの取り組み，村田真樹，内山将夫，内元清貴，馬青，井佐原均，電子情報通信学会言語理解とコミュニケーション研究会 NLC2001-40 ，2001年．
どういう問題のときに、という、問題の状況を機械に伝える際に、素性（解析に用いる情報で問題を構成する各要素）というものが必要になる。問題を素性によって表現するのである。例えば、日本語文末表現の時制の推定の問題において、問題：「彼が話す。」−−−解「現在」が与えられた場合に、素性の一例は、「彼が話す。」「が話す。」「話す。」「す」「。」となる。 Reference (5): Masaki Murata, Language Processing Based on Machine Learning, Faculty of Science and Engineering, Ryukoku University. Invited lecture. 2004.http: //www2.nict.go.jp/jt/a132/members/murata/ps/rk1-siryou.pdf
Reference (6): Japanese-English translation of tense aspect modality using support vector machine, Maki Murata, Mao, Kiyotaka Uchimoto, Hitoshi Isahara, IEICE Language Understanding and Communication Study Group NLC2000-78, 2001 Year.
Reference (7): CRL's efforts in the SENSEVAL2J dictionary task, Masaki Murata, Masao Uchiyama, Kiyotaka Uchimoto, Ma Aoi, Hitoshi Isahara, IEICE Language Understanding and Communication Study Group NLC2001-40, 2001.
In order to convey the problem situation to the machine, what kind of problem is required, features (elements constituting the problem with information used for analysis) are required. The problem is expressed by the feature. For example, in the problem of estimating the tense of Japanese sentence ending expressions, the problem: “He speaks.” --- If the solution “present” is given, an example of a feature is “He speaks.” . "" Speaking. "" Su "". "

すなわち、機械学習の手法は、素性の集合−解の組のセットを多く用意し、それで学習を行ない、どういう素性の集合のときにどういう解になるかを学習し、その学習結果を利用して、新しい問題のときもその問題から素性の集合を取り出し、その素性の場合の解を推測する方法である。 In other words, the machine learning method prepares many sets of feature set-solution pairs, performs learning, learns what kind of solution the feature set becomes, and uses the learning result. This is a method of extracting a set of features from a new problem and inferring a solution in the case of the feature.

機械学習手段１２３は、機械学習の手法として、例えば、ｋ近傍法、シンプルベイズ法、決定リスト法、最大エントロピー法、サポートベクトルマシン法などの手法を用いる。 The machine learning unit 123 uses a technique such as a k-nearest neighbor method, a simple Bayes method, a decision list method, a maximum entropy method, or a support vector machine method as a machine learning method.

ｋ近傍法は、最も類似する一つの事例のかわりに、最も類似するｋ個の事例を用いて、このｋ個の事例での多数決によって分類先（解）を求める手法である。ｋは、あらかじめ定める整数の数字であって、一般的に、１から９の間の奇数を用いる。 The k-nearest neighbor method is a method for obtaining a classification destination (solution) by using the k most similar cases instead of the most similar case, and by majority decision of the k cases. k is a predetermined integer number, and generally an odd number between 1 and 9 is used.

シンプルベイズ法は、ベイズの定理にもとづいて各分類になる確率を推定し、その確率値が最も大きい分類を求める分類先とする方法である。 The Simple Bayes method is a method of estimating the probability of each classification based on Bayes' theorem and determining the classification having the highest probability value as a classification destination.

シンプルベイズ法において、文脈ｂで分類ａを出力する確率は、以下の式（４）で与えられる。 In the simple Bayes method, the probability of outputting the classification a in the context b is given by the following equation (4).

ただし、ここで文脈ｂは、あらかじめ設定しておいた素性ｆ_j（∈Ｆ，１≦ｊ≦ｋ）の集合である。ｐ（ｂ）は、文脈ｂの出現確率である。ここで、分類ａに非依存であって定数のために計算しない。Ｐ（ａ）（ここでＰはｐの上部にチルダ）とＰ（ｆ_i｜ａ）は、それぞれ教師データから推定された確率であって、分類ａの出現確率、分類ａのときに素性ｆ_iを持つ確率を意味する。Ｐ（ｆ_i｜ａ）として最尤推定を行って求めた値を用いると、しばしば値がゼロとなり、式（５）の値がゼロで分類先を決定することが困難な場合が生じる。そのため、スームージングを行う。ここでは、以下の式（６）を用いてスームージングを行ったものを用いる。 Here, the context b is a set of features f _j (εF, 1 ≦ j ≦ k) set in advance. p (b) is the appearance probability of the context b. Here, since it is independent of the classification a and is a constant, it is not calculated. P (a) (where P is a tilde at the top of p) and P (f _i | a) are the probabilities estimated from the teacher data, respectively, and the appearance probability of class a, and the feature f for class a means the probability of having _i . When a value obtained by performing maximum likelihood estimation as P (f _i | a) is used, the value often becomes zero, and it may be difficult to determine the classification destination because the value of Equation (5) is zero. Therefore, smoothing is performed. Here, what smoothed using the following formula | equation (6) is used.

ただし、ｆｒｅｑ（ｆ_i，ａ）は、素性ｆ_iを持ちかつ分類がａである事例の個数、ｆｒｅｑ（ａ）は、分類がａである事例の個数を意味する。 Here, freq (f _i , a) means the number of cases having the feature f _i and the classification a, and freq (a) means the number of cases having the classification a.

決定リスト法は、素性と分類先の組とを規則とし、それらをあらかじめ定めた優先順序でリストに蓄えおき、検出する対象となる入力が与えられたときに、リストで優先順位の高いところから入力のデータと規則の素性とを比較し、素性が一致した規則の分類先をその入力の分類先とする方法である。 The decision list method uses features and combinations of classification destinations as rules, stores them in the list in a predetermined priority order, and when input to be detected is given, from the highest priority in the list This is a method in which input data is compared with the feature of the rule, and the classification destination of the rule having the same feature is set as the classification destination of the input.

決定リスト方法では、あらかじめ設定しておいた素性ｆ_j( ∈Ｆ，１≦ｊ≦ｋ）のうち、いずれか一つの素性のみを文脈として各分類の確率値を求める。ある文脈ｂで分類ａを出力する確率は以下の式によって与えられる。 In the decision list method, the probability value of each classification is obtained using only one of the features f _j (εF, 1 ≦ j ≦ k) set in advance as a context. The probability of outputting classification a in a context b is given by

ｐ（ａ｜ｂ）＝ｐ（ａ｜ｆmax ）式（７）
ただし、ｆmax は以下の式によって与えられる。 p (a | b) = p (a | fmax) Equation (7)
However, fmax is given by the following equation.

また、Ｐ（ａ_i｜ｆ_j）（ここでＰはｐの上部にチルダ）は、素性ｆ_jを文脈に持つ場合の分類ａ_iの出現の割合である。 P (a _i | f _j ) (where P is a tilde at the top of p) is the rate of appearance of the classification a _i when the feature f _j is in the context.

最大エントロピー法は、あらかじめ設定しておいた素性ｆ_j（１≦ｊ≦ｋ）の集合をＦとするとき、以下所定の条件式（式（９））を満足しながらエントロピーを意味する式（１０）を最大にするときの確率分布ｐ（ａ，ｂ）を求め、その確率分布にしたがって求まる各分類の確率のうち、最も大きい確率値を持つ分類を求める分類先とする方法である。 In the maximum entropy method, when a set of preset features f _j (1 ≦ j ≦ k) is F, an expression (entropy) that satisfies a predetermined conditional expression (equation (9)) below ( In this method, the probability distribution p (a, b) when 10) is maximized is obtained, and the classification having the largest probability value is obtained from the probabilities of the respective classifications obtained according to the probability distribution.

ただし、Ａ、Ｂは分類と文脈の集合を意味し、ｇ_j（ａ，ｂ）は文脈ｂに素性ｆ_jがあって、なおかつ分類がａの場合１となり、それ以外で０となる関数を意味する。また、Ｐ（ａ_i｜ｆ_j）（ここでＰはｐの上部にチルダ）は、既知データでの（ａ，ｂ）の出現の割合を意味する。 However, A and B mean a set of classifications and contexts, and g _j (a, b) is a function that is 1 if the context b has a feature f _j and the classification is a, and is 0 otherwise. means. Further, P (a _i | f _j ) (where P is a tilde at the top of p) means the rate of appearance of (a, b) in the known data.

式（９）は、確率ｐと出力と素性の組の出現を意味する関数ｇをかけることで出力と素性の組の頻度の期待値を求めることになっており、右辺の既知データにおける期待値と、左辺の求める確率分布に基づいて計算される期待値が等しいことを制約として、エントロピー最大化( 確率分布の平滑化) を行なって、出力と文脈の確率分布を求めるものとなっている。最大エントロピー法の詳細については、以下の参考文献（８）および参考文献（９）に記載されている。 Formula (9) is to obtain the expected value of the frequency of the output and feature pair by multiplying the probability p and the function g meaning the appearance of the pair of output and feature. And the expected value calculated based on the probability distribution calculated on the left side is the constraint, and entropy maximization (smoothing of the probability distribution) is performed to determine the probability distribution of the output and the context. Details of the maximum entropy method are described in the following references (8) and (9).

参考文献（８）：Eric Sven Ristad, Maximum Entropy Modeling for Natural Language,(ACL/EACL Tutorial Program, Madrid, 1997
参考文献（９）：Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta, (http://www.mnemonic.com/software/memt,1998) ）
サポートベクトルマシン法は、空間を超平面で分割することにより、二つの分類からなるデータを分類する手法である。 Reference (8): Eric Sven Ristad, Maximum Entropy Modeling for Natural Language, (ACL / EACL Tutorial Program, Madrid, 1997
(9): Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta, (http://www.mnemonic.com/software/memt,1998))
The support vector machine method is a method of classifying data composed of two classifications by dividing a space by a hyperplane.

図５にサポートベクトルマシン法のマージン最大化の概念を示す。図５において、白丸は正例、黒丸は負例を意味し、実線は空間を分割する超平面を意味し、破線はマージン領域の境界を表す面を意味する。図５（Ａ）は、正例と負例の間隔が狭い場合（スモールマージン）の概念図、図５（Ｂ）は、正例と負例の間隔が広い場合（ラージマージン）の概念図である。 FIG. 5 shows the concept of margin maximization in the support vector machine method. In FIG. 5, a white circle means a positive example, a black circle means a negative example, a solid line means a hyperplane that divides the space, and a broken line means a surface that represents the boundary of the margin area. 5A is a conceptual diagram when the interval between the positive example and the negative example is narrow (small margin), and FIG. 5B is a conceptual diagram when the interval between the positive example and the negative example is wide (large margin). is there.

このとき、二つの分類が正例と負例からなるものとすると、学習データにおける正例と負例の間隔（マージン) が大きいものほどオープンデータで誤った分類をする可能性が低いと考えられ、図５（Ｂ）に示すように、このマージンを最大にする超平面を求めそれを用いて分類を行なう。 At this time, if the two classifications consist of positive and negative examples, the larger the interval (margin) between the positive and negative examples in the learning data, the less likely it is to make an incorrect classification with open data. As shown in FIG. 5B, a hyperplane that maximizes this margin is obtained, and classification is performed using it.

基本的には上記のとおりであるが、通常、学習データにおいてマージンの内部領域に少数の事例が含まれてもよいとする手法の拡張や、超平面の線形の部分を非線型にする拡張（カーネル関数の導入) がなされたものが用いられる。 Basically, it is as described above. Usually, an extension of the method that the training data may contain a small number of cases in the inner area of the margin, or an extension that makes the linear part of the hyperplane nonlinear ( The one with the introduction of the kernel function is used.

この拡張された方法は、以下の識別関数を用いて分類することと等価であり、その識別関数の出力値が正か負かによって二つの分類を判別することができる。 This extended method is equivalent to classification using the following discriminant function, and the two classes can be discriminated depending on whether the output value of the discriminant function is positive or negative.

ただし、ｘは識別したい事例の文脈（素性の集合) を、ｘ_iとｙ_j（ｉ＝１，…，ｌ，ｙ_j∈｛１，−１｝）は学習データの文脈と分類先を意味し、関数ｓｇｎは、
ｓｇｎ（ｘ）＝１（ｘ≧０）
−１（otherwise ）
であり、また、各α_iは式（１３）と式（１４）の制約のもと式（１２）を最大にする場合のものである。 Where x is the context (set of features) to be identified, and x _i and y _j (i = 1,..., L, y _j ∈ {1, -1}) mean the context and classification destination of the learning data. And the function sgn is
sgn (x) = 1 (x ≧ 0)
-1 (otherwise)
Each α _i is for maximizing the expression (12) under the constraints of the expressions (13) and (14).

また、関数Ｋはカーネル関数と呼ばれ、様々なものが用いられるが、本形態では以下の多項式のものを用いる。 The function K is called a kernel function, and various functions are used. In this embodiment, the following polynomial is used.

Ｋ（ｘ，ｙ）＝（ｘ・ｙ＋１）ｄ式（１５）
Ｃ、ｄは実験的に設定される定数である。例えば、Ｃはすべての処理を通して１に固定した。また、ｄは、１と２の二種類を試している。ここで、α_i＞０となるｘ_iは、サポートベクトルと呼ばれ、通常、式（１１）の和をとっている部分は、この事例のみを用いて計算される。つまり、実際の解析には学習データのうちサポートベクトルと呼ばれる事例のみしか用いられない。 K (x, y) = (x · y + 1) d Equation (15)
C and d are constants set experimentally. For example, C was fixed at 1 throughout all treatments. Moreover, two types of 1 and 2 are tried for d. Here, x _i where α _i > 0 is called a support vector, and the portion taking the sum of Expression (11) is usually calculated using only this case. That is, only actual cases called support vectors are used for actual analysis.

なお、拡張されたサポートベクトルマシン法の詳細については、以下の参考文献（１０）および参考文献（１１）に記載されている。 The details of the extended support vector machine method are described in the following references (10) and (11).

参考文献（１０）：Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods,(Cambridge University Press,2000)
参考文献（１１）：Taku Kudoh, Tinysvm:Support Vector machines,(http://cl.aist-nara.ac.jp/taku-ku//software/Tiny SVM/index.html,2000)
サポートベクトルマシン法は、分類の数が２個のデータを扱うものである。したがって、分類の数が３個以上の事例を扱う場合には、通常、これにペアワイズ法またはワンＶＳレスト法などの手法を組み合わせて用いることになる。 Reference (10): Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, (Cambridge University Press, 2000)
Reference (11): Taku Kudoh, Tinysvm: Support Vector machines, (http://cl.aist-nara.ac.jp/taku-ku//software/Tiny SVM / index.html, 2000)
The support vector machine method handles data with two classifications. Therefore, when handling cases with three or more classifications, a pair-wise method or a one-VS rest method is usually used in combination with this.

ペアワイズ法は、ｎ個の分類を持つデータの場合に、異なる二つの分類先のあらゆるペア（ｎ（ｎ−１）／２個）を生成し、各ペアごとにどちらがよいかを二値分類器、すなわちサポートベクトルマシン法処理モジュールで求めて、最終的に、ｎ（ｎ−１）／２個の二値分類による分類先の多数決によって、分類先を求める方法である。 In the pairwise method, in the case of data having n classifications, every pair (n (n-1) / 2) of two different classification destinations is generated, and a binary classifier indicates which is better for each pair. That is, it is obtained by the support vector machine method processing module and finally obtains the classification destination by majority decision of the classification destination by n (n−1) / 2 binary classification.

ワンＶＳレスト法は、例えば、ａ、ｂ、ｃという三つの分類先があるときは、分類先ａとその他、分類先ｂとその他、分類先ｃとその他、という三つの組を生成し、それぞれの組についてサポートベクトルマシン法で学習処理する。そして、学習結果による推定処理において、その三つの組のサポートベクトルマシンの学習結果を利用する。推定するべき候補が、その三つのサポートベクトルマシンではどのように推定されるかを見て、その三つのサポートベクトルマシンのうち、その他でないほうの分類先であって、かつサポートベクトルマシンの分離平面から最も離れた場合のものの分類先を求める解とする方法である。例えば、ある候補が、「分類先ａとその他」の組の学習処理で作成したサポートベクトルマシンにおいて分離平面から最も離れた場合には、その候補の分類先は、a と推定する。 For example, when there are three classification destinations a, b, and c, the one VS rest method generates three sets of classification destination a and other, classification destination b and other, classification destination c and other, The learning process is performed on the set of the support vector machine method. Then, in the estimation process based on the learning result, the learning results of the three sets of support vector machines are used. See how the three support vector machines are estimated as candidates to be estimated. Of the three support vector machines, it is the non-other classification target and the separation plane of the support vector machine. This is a method for obtaining a classification destination of a thing farthest from the object. For example, when a candidate is farthest from the separation plane in the support vector machine created by the learning process of “classification destination a and other”, the candidate classification destination is estimated as a.

解推定手段１２７が推定する、各表現対についての、どのような解（分類先）になりやすいかの度合いの求め方は、機械学習手段１２３が機械学習の手法として用いる様々な方法によって異なる。 The method of determining the level of the solution (classification destination) that is likely to be the solution (classification destination) for each expression pair estimated by the solution estimation unit 127 differs depending on various methods used by the machine learning unit 123 as a machine learning method.

例えば、本発明の実施の形態において、機械学習手段１２３が、機械学習の手法としてｋ近傍法を用いる場合、機械学習手段１２３は、教師データの事例同士で、その事例から抽出された素性の集合のうち重複する素性の割合（同じ素性をいくつ持っているかの割合）にもとづく事例同士の類似度を定義して、前記定義した類似度と事例とを学習結果情報として学習結果記憶手段１２４に記憶しておく。 For example, in the embodiment of the present invention, when the machine learning means 123 uses the k-nearest neighbor method as a machine learning technique, the machine learning means 123 sets the feature data extracted from the cases among the cases of the teacher data. The similarity between cases based on the ratio of overlapping features (the number of the same features) is defined, and the defined similarity and the case are stored in the learning result storage means 124 as learning result information. Keep it.

そして、解推定手段１２７は、表現対抽出手段１２５によって新しい表現対（の候補）が抽出されたときに、学習結果記憶手段１２４において定義された類似度と事例を参照して、表現対抽出手段１２５によって抽出された表現対の候補について、その候補の類似度が高い順にｋ個の事例を学習結果記憶手段１２４の事例から選択し、選択したｋ個の事例での多数決によって決まった分類先を、表現対の候補の分類先（解）として推定する。すなわち、解推定手段１２７では、各表現対についての、どのような解（分類先）になりやすいかの度合いを、選択したｋ個の事例での多数決の票数、ここでは「抽出するべき」という分類が獲得した票数とする。 Then, when a new expression pair (candidate) is extracted by the expression pair extraction unit 125, the solution estimation unit 127 refers to the similarity and the case defined in the learning result storage unit 124, and the expression pair extraction unit 127 For the candidates of the expression pairs extracted by 125, k cases are selected from the cases in the learning result storage means 124 in descending order of the similarity of the candidates, and the classification destinations determined by the majority vote in the selected k cases are selected. Estimated as the classification target (solution) of the expression pair candidate. That is, in the solution estimation means 127, the degree of what kind of solution (classification destination) is likely to be obtained for each expression pair is the number of votes of the majority vote in the selected k cases, here “to be extracted”. The number of votes obtained by classification.

また、機械学習手法として、シンプルベイズ法を用いる場合には、機械学習手段１２３は、教師データの事例について、前記事例の解と素性の集合との組を学習結果情報として学習結果記憶手段１２４に記憶する。そして、解推定手段１２７は、表現対抽出手段１２５によって新しい表現対（の候補）が抽出されたときに、学習結果記憶手段１２４の学習結果情報の解と素性の集合との組をもとに、ベイズの定理にもとづいて素性抽出手段１２６で取得した表現対の候補の素性の集合の場合の各分類になる確率を算出して、その確率の値が最も大きい分類を、その表現対の候補の素性の分類（解）と推定する。すなわち、解推定手段１２７では、表現対の候補の素性の集合の場合にある解となりやすさの度合いを、各分類になる確率、ここでは「抽出するべき」という分類になる確率とする。 When the simple Bayes method is used as the machine learning method, the machine learning unit 123 stores, in the learning result storage unit 124, a set of a solution of the case and a set of features as learning result information for the example of the teacher data. Remember. Then, the solution estimation means 127, when a new expression pair (candidate) is extracted by the expression pair extraction means 125, based on the combination of the learning result information solution and the feature set in the learning result storage means 124. Based on Bayes' theorem, the probability of becoming each classification in the case of the feature pair of the expression pair candidates acquired by the feature extraction means 126 is calculated, and the classification having the largest probability value is selected as the candidate of the expression pair. It is estimated as the classification (solution) of the features of. That is, in the solution estimation means 127, the degree of the likelihood of becoming a certain solution in the case of a set of features of expression pair candidates is set as the probability of becoming each classification, here, the probability of becoming the classification “to be extracted”.

また、機械学習手法として決定リスト法を用いる場合には、機械学習手段１２３は、教師データの事例について、素性と分類先との規則を所定の優先順序で並べたリストを学習結果記憶手段１２４に記憶する。そして、表現対抽出手段１２５によって新しい表現対（の候補）が抽出されたときに、解推定手段１２７は、学習結果記憶手段１２４のリストの優先順位の高い順に、抽出された表現対の候補の素性と規則の素性とを比較し、素性が一致した規則の分類先をその候補の分類先（解）として推定する。すなわち、解推定手段１２７では、表現対の候補の素性の集合の場合にある解となりやすさの度合いを、所定の優先順位またはそれに相当する数値、尺度、ここでは「抽出するべき」という分類になる確率のリストにおける優先順位とする。 When the decision list method is used as the machine learning method, the machine learning unit 123 stores, in the learning result storage unit 124, a list in which rules of features and classification destinations are arranged in a predetermined priority order with respect to examples of teacher data. Remember. Then, when a new expression pair (candidate) is extracted by the expression pair extraction unit 125, the solution estimation unit 127 selects candidate expression pairs extracted in descending order of priority in the list of the learning result storage unit 124. The feature is compared with the feature of the rule, and the classification destination of the rule having the identical feature is estimated as the candidate classification destination (solution). That is, the solution estimation means 127 assigns the degree of the likelihood of becoming a solution in the case of a set of candidate features of the expression pair to a predetermined priority or a numerical value or scale corresponding thereto, in this case, “to be extracted”. Priority in the list of probabilities.

また、機械学習手法として最大エントロピー法を使用する場合には、機械学習手段１２３は、教師データの事例から解となりうる分類を特定し、所定の条件式を満足しかつエントロピーを示す式を最大にするときの素性の集合と解となりうる分類の二項からなる確率分布を求めて学習結果記憶手段１２４に記憶する。そして、表現対抽出手段１２５によって新しい表現対（の候補）が抽出されたときに、解推定手段１２７は、学習結果記憶手段１２４の確率分布を利用して、抽出された表現対の候補の素性の集合についてその解となりうる分類の確率を求めて、最も大きい確率値を持つ解となりうる分類を特定し、その特定した分類をその候補の解と推定する。すなわち、解推定手段１２７では、表現対の候補の素性の集合の場合にある解となりやすさの度合いを、各分類になる確率、ここでは「抽出するべき」という分類になる確率とする。 When the maximum entropy method is used as the machine learning method, the machine learning means 123 specifies a class that can be a solution from the example of the teacher data, and maximizes an expression that satisfies a predetermined conditional expression and shows entropy. A probability distribution consisting of a set of features and a class that can be a solution is obtained and stored in the learning result storage means 124. When a new expression pair (candidate) is extracted by the expression pair extraction unit 125, the solution estimation unit 127 uses the probability distribution of the learning result storage unit 124 to identify the features of the extracted expression pair candidate. The probability of the classification that can be the solution for the set of is determined, the classification that can be the solution having the largest probability value is identified, and the identified classification is estimated as the candidate solution. That is, in the solution estimation means 127, the degree of the likelihood of becoming a certain solution in the case of a set of features of expression pair candidates is set as the probability of becoming each classification, here, the probability of becoming the classification “to be extracted”.

また、機械学習手法としてサポートベクトルマシン法を使用する場合には、機械学習手段１２３は、教師データの事例から解となりうる分類を特定し、分類を正例と負例に分割して、カーネル関数を用いた所定の実行関数にしたがって事例の素性の集合を次元とする空間上で、その事例の正例と負例の間隔を最大にし、かつ正例と負例を超平面で分割する超平面を求めて学習結果記憶手段１２４に記憶する。そして表現対抽出手段１２５によって新しい表現対（の候補）が抽出されたときに、解推定手段１２７は、学習結果記憶手段１２４の超平面を利用して、抽出された表現対の候補の素性の集合が超平面で分割された空間において正例側か負例側のどちらにあるかを特定し、その特定された結果にもとづいて定まる分類を、その候補の解と推定する。すなわち、解推定手段１２７では、表現対の候補の素性の集合の場合にある解となりやすさの度合いを、分離平面からの正例（抽出するべき表現対）の空間への距離の大きさとする。より詳しくは、抽出するべき表現対を正例、抽出するべきではない表現対を負例とする場合に、分離平面に対して正例側の空間に位置する事例が「抽出するべき事例」と判断され、その事例の分離平面からの距離をその事例の度合いとする。 When the support vector machine method is used as the machine learning method, the machine learning unit 123 specifies a class that can be a solution from the example of the teacher data, divides the class into a positive example and a negative example, A hyperplane that maximizes the interval between the positive and negative examples of a case and divides the positive and negative examples by a hyperplane in a space whose dimension is a set of case features according to a predetermined execution function using Is stored in the learning result storage means 124. Then, when a new expression pair (candidate) is extracted by the expression pair extraction unit 125, the solution estimation unit 127 uses the hyperplane of the learning result storage unit 124 to identify the feature of the extracted expression pair candidate. Whether the set is on the positive example side or the negative example side in the space divided by the hyperplane is specified, and the classification determined based on the specified result is estimated as the candidate solution. That is, in the solution estimation means 127, the degree of the likelihood of being a solution in the case of a set of candidate expression pairs is the distance from the separation plane to the space of the positive example (expression pair to be extracted). . More specifically, when the expression pair to be extracted is a positive example and the expression pair that should not be extracted is a negative example, the case located in the space on the positive example side with respect to the separation plane is referred to as “example to be extracted”. The distance from the separation plane of the case is determined as the degree of the case.

図６は、本発明の実施の形態における情報抽出・表示処理フローの一例を示す図である。まず、情報抽出・表示装置１は、関連記事ＤＢ１４中の記事群から主要表現を抽出する（ステップＳ１）。次に、情報抽出・表示装置１は、抽出された主要表現を用いて、動向情報対を抽出する（ステップＳ２）。そして、情報抽出・表示装置１は、抽出された情報動向対を表示する（ステップＳ３）。
（実験と考察）
（１）主要表現抽出
本発明の情報抽出・表示装置１を用いて、主要表現抽出の実験を行った。ＯｋａｐｉのＴＦ項の式（式（１））を利用し、項目表現では、ＴＦ_iを表現の出現回数とその表現の文字列長の積とする方法を利用した。本実験においては、台風に関連する記事群、大リーグに関連する記事群、政治動向に関連する記事群のそれぞれから主要表現を抽出した。その抽出結果を図７に示す。図７に示す表では、ＯｋａｐｉのＴＦ項の式の値の大きかった上位５つの表現を示している。 FIG. 6 is a diagram showing an example of an information extraction / display processing flow in the embodiment of the present invention. First, the information extraction / display apparatus 1 extracts a main expression from an article group in the related article DB 14 (step S1). Next, the information extraction / display apparatus 1 extracts a trend information pair using the extracted main expression (step S2). Then, the information extraction / display device 1 displays the extracted information trend pair (step S3).
(Experiment and discussion)
(1) Main Expression Extraction Using the information extraction / display apparatus 1 of the present invention, an experiment of main expression extraction was performed. Okapi's TF term formula (formula (1)) is used, and item representation uses a method in which TF _i is the product of the number of occurrences of the expression and the character string length of the expression. In this experiment, the main expressions were extracted from each of the articles related to the typhoon, the articles related to the major leagues, and the articles related to the political trend. The extraction result is shown in FIG. In the table shown in FIG. 7, the top five expressions having large values of the expression of the Okapi TF term are shown.

図７を見ると、それぞれその分野の主要な表現がうまく取り出せている。例えば、台風のデータだと、その主たる項目表現の「台風」が、また、何番目の台風かを示す単位表現の「号」が取り出せている。大リーグのデータだと、マグワイアとソーサのホームラン争いが世間を賑わせたころのデータを収集しており、うまくマグワイアとソーサを上位の方で抽出している。また、ホームラン争いで主要な単位表現となる「号」、「本」などもうまく抽出できている。政治動向のデータだと、項目表現として「内閣支持率」が単位表現として「％」がうまく抽出できている。 As shown in FIG. 7, the main expressions in each field are extracted well. For example, in the case of typhoon data, the main item expression "typhoon" and the unit expression "number" indicating the number of typhoon can be extracted. In the case of major league data, we have collected data when Maguire and Sosa's home run struggle became popular, and we have successfully extracted Maguire and Sosa at the top. In addition, “No.”, “Book”, etc., which are the main unit expressions in home run battles, can be extracted well. In the case of political trend data, “Category support rate” can be successfully extracted as an item expression and “%” as a unit expression.

ところで、台風のデータでは、「写真説明」という表現が上位に現れている。このことからおそらく台風のデータには台風による被害の写真がついていたと思われる。
（２）動向情報のグラフ化
本発明の情報抽出・表示装置１を用いて、動向情報のグラフ化の実験を行った。この実験は、主要表現抽出部１１によって算出されたスコア値が最も高い単位表現、時間表現、項目表現を利用して、動向情報対抽出部１２、主要動向情報対抽出・表示部１３の処理を通じてグラフ化を行った。作成したグラフを図８（Ａ）〜図８（Ｃ）に示す。本実験では、グラフ化にはＥｘｃｅｌを用いた。ここでは、時間軸は得られた時間表現ごとに表示したが、実際の時間の間隔で表示してもよい。また時間表現としては、日、月までの情報しか自動で得られない場合は、人手で、月、年の情報を付与した。 By the way, in the typhoon data, the expression “photo description” appears at the top. From this, it is probable that the typhoon data was accompanied by a picture of the damage caused by the typhoon.
(2) Graphing trend information Using the information extraction / display device 1 of the present invention, an experiment for graphing trend information was performed. This experiment uses the unit expression, time expression, and item expression with the highest score value calculated by the main expression extraction unit 11, through the processing of the trend information pair extraction unit 12 and the main trend information pair extraction / display unit 13. Graphed. The created graphs are shown in FIGS. 8 (A) to 8 (C). In this experiment, Excel was used for graphing. Here, the time axis is displayed for each obtained time expression, but may be displayed at an actual time interval. As time expression, when only the information up to the day and month can be obtained automatically, the month and year information was manually added.

台風のデータでは、主要表現抽出部１１で単位表現として「号」を、時間表現として「日」を、項目表現として「台風」を取り出した。これを利用してグラフを作成した。動向情報対抽出部１２において、これら三つの表現が同時に出現している箇所を抽出した。その取り出したデータにおいて、時間表現を横軸、単位表現の「号」の前についていた数字を縦軸にとってグラフを作成した。台風のデータは、関連記事ＤＢ１４中の関連記事において９月、１０月ごろのデータしかなく、それ以外の時期の情報はわからないが、台風のデータについての図８（Ａ）のグラフは、９月、１０月について、いつ何号の台風が来たかを把握するのに役立つ。また、９８年と９９年のデータを見比べると、９９年の方が台風の数も多かったことがわかる。 In the typhoon data, the main expression extraction unit 11 extracts “No.” as a unit expression, “day” as a time expression, and “typhoon” as an item expression. A graph was created using this. The trend information pair extraction unit 12 extracts a place where these three expressions appear simultaneously. In the extracted data, a graph was created with the horizontal axis representing time and the vertical axis representing the number in front of the unit number “No.”. The typhoon data is only in September and October in the related articles in the related article DB 14, and information on other periods is unknown, but the graph of FIG. 8A for typhoon data is in September. It will help you know when and how many typhoons came in October. A comparison of the data for 1998 and 1999 shows that the number of typhoons was higher in 1999.

大リーグのデータでは、主要表現抽出部１１で単位表現として「号」を、時間表現として「日」を、項目表現として「マグワイア」を取り出した。これを利用してグラフを作成した。動向情報対抽出部１２において、これら三つの表現が同時に出現している箇所を抽出した。その取り出したデータにおいて、時間表現を横軸、単位表現の「号」の前についていた数字を縦軸にとってグラフを作成した。大リーグのデータは、関連記事ＤＢ１４中の関連記事において、元々８月以降のデータしかなく、それ以外の時期の情報はわからないが、図８（Ｂ）のグラフは、８月以降について、マグワイアがどのような感じでホームランを打っていったかがわかる。 In the major league data, the main expression extraction unit 11 took out “No.” as the unit expression, “Day” as the time expression, and “Maguire” as the item expression. A graph was created using this. The trend information pair extraction unit 12 extracts a place where these three expressions appear simultaneously. In the extracted data, a graph was created with the horizontal axis representing time and the vertical axis representing the number in front of the unit number “No.”. Major league data is the related article in the related article DB 14 that originally has only data from August onwards, and information on other periods is not known, but the graph in FIG. You can see how he was hitting the home run.

政治動向のデータでは、主要表現抽出部１１で単位表現として「％」を、時間表現として「月」を、項目表現として「内閣支持率」を最上位で取り出した。これを利用してグラフを作成した。動向情報対抽出部１２において、これら三つの表現が同時に出現している箇所を抽出した。その取り出したデータにおいて、時間表現を横軸、単位表現の「％」の前についていた数字を縦軸にとってグラフを作成した。例えば、関連記事ＤＢ１４中の内閣支持率に関する関連記事は９８年、９９年のデータであり、図８（Ｃ）のグラフは、小渕内閣の内閣支持率を示すグラフになっている。支持率は起伏があるものの、概ね上昇傾向にあることがわかる。 In the data on political trends, the main expression extraction unit 11 extracted “%” as a unit expression, “month” as a time expression, and “Cabinet support rate” as an item expression at the highest level. A graph was created using this. The trend information pair extraction unit 12 extracts a place where these three expressions appear simultaneously. In the extracted data, a graph was created with the horizontal axis representing time and the vertical axis representing the number in front of the unit expression “%”. For example, related articles related to the cabinet support rate in the related article DB 14 are data for 1998 and 1999, and the graph of FIG. 8C is a graph showing the cabinet support rate of the Kominato Cabinet. Although the support rate is undulating, it can be seen that it is generally on an upward trend.

また、ここでは実験結果を省略するが、主要表現抽出部１１において、複数の単位表現、時間表現、項目表現を取り出し、それら複数の表現のすべての組み合わせ分のデータにおいて、動向情報対抽出部１２を用いて複数種類の動向情報対を抽出し、それら複数種類の動向情報対において、多く抽出できた動向情報対ほど有用な動向情報として判断して抽出する枠組みによる実験も行っている。有用な単位表現、時間表現、項目表現を最上位で抽出できない場合にこの枠組みが役に立った。
（３）文抽出と強調表示
本発明の情報抽出・表示装置１を用いて、動向情報に関する文抽出と強調表示の実験を行った。ここでは、台風に関連する関連記事群を用いて実験した。この実験は、主要表現抽出部１１で最上位で抽出された単位表現、時間表現、項目表現を利用して、動向情報対抽出部１２、主要動向情報対抽出・表示部１３による処理を通じて、文抽出と強調表示を行った。すなわち、動向情報対抽出部１２が抽出した動向情報対を含む文を主要動向情報対抽出・表示部１３が関連記事ＤＢ１４中の関連記事群から抽出し、当該抽出した文において、動向情報を強調表示した。 Although the experimental results are omitted here, the main expression extraction unit 11 extracts a plurality of unit expressions, time expressions, and item expressions, and the trend information pair extraction unit 12 in the data for all combinations of the plurality of expressions. We are also experimenting with a framework that extracts multiple types of trend information pairs using, and judges and extracts as more useful trend information pairs of trend information pairs that can be extracted in the multiple types of trend information pairs. This framework was useful when useful unit expressions, time expressions, and item expressions could not be extracted at the top level.
(3) Sentence Extraction and Highlight Display Using the information extraction / display apparatus 1 of the present invention, experiments on sentence extraction and highlight display regarding trend information were performed. Here, we experimented with related articles related to typhoons. This experiment uses the unit expression, time expression, and item expression extracted at the highest level by the main expression extraction unit 11, through processing by the trend information pair extraction unit 12 and the main trend information pair extraction / display unit 13, Extraction and highlighting were performed. That is, the sentence including the trend information pair extracted by the trend information pair extraction unit 12 is extracted from the related article group in the related article DB 14 by the main trend information pair extraction / display unit 13, and the trend information is emphasized in the extracted sentence. displayed.

例えば、単位表現、時間表現、項目表現は、「号」、「日」、「台風」である。文抽出では、この三つの表現が同時に出現している文を抽出した。そして、その文においてその三つの表現を強調表示する。同一文において複数の表現がある場合は、例えば最初に出現しているものを二重線でそれ以外を一重線で強調表示する。その結果を図９に示す。本発明の実施の形態においては、上記三つの表現を適宜色分けして表示する構成を採ってもよい。図９に示す強調表示の例では、抽出された時間表現と数値表現を抽出した文の前につけている。 For example, the unit expression, time expression, and item expression are “No.”, “Day”, and “Typhoon”. In sentence extraction, sentences in which these three expressions appear simultaneously are extracted. Then, the three expressions are highlighted in the sentence. When there are a plurality of expressions in the same sentence, for example, the first appearing is highlighted with a double line and the others are highlighted with a single line. The result is shown in FIG. In the embodiment of the present invention, a configuration may be adopted in which the above three expressions are displayed with appropriate color coding. In the example of highlighting shown in FIG. 9, the extracted time expression and numerical expression are added in front of the extracted sentence.

抽出した文は、そのときの台風の様子を端的に示しており、要約の研究における重要文抽出と同等の効果を持つ文を抽出できていると思われる。すなわち、台風が通った地名、また場合によって被害状況も記述されており、その台風に関する重要な記述が抽出した文に含まれている。単位表現、時間表現、項目表現の三つのデータが同時に出現している文を取り出すだけでも、重要な文を抽出できることがわかる。 The extracted sentence clearly shows the state of the typhoon at that time, and it seems that the sentence having the same effect as the important sentence extraction in the summary research can be extracted. In other words, the name of the place through which the typhoon passed and the damage situation are also described, and an important description about the typhoon is included in the extracted sentence. It turns out that an important sentence can be extracted only by taking out a sentence in which three data of unit expression, time expression, and item expression appear simultaneously.

また、図中の７個目のデータには、台風７号と台風８号の複数のデータが含まれるが、抽出した情報以外に、現在着目している主要表現があればそれも一重の下線で強調表示することで、その複数データがそのデータにあることがすぐにわかる。また、取り出すべき単位表現、時間表現、項目表現の組をシステムが誤る場合があるが、この強調表示はその誤りを早く見つけることにも役に立つ。ここでは、抽出した文のみで強調表示を行ったが、記事中に抽出すべき文が残っている可能性もある。記事全体で同様の強調表示を行えば、そういう漏れも抽出できる可能性がある。そこで、元の記事全体で強調表示をする構成を採ってもよい。 In addition, the seventh data in the figure includes a plurality of data of typhoon No. 7 and typhoon No. 8. In addition to the extracted information, if there is a main expression currently focused on, it is also a single underline. By highlighting with, you can immediately see that the data is in the data. In addition, the system may mistake the combination of unit expression, time expression, and item expression to be extracted. This highlighting is useful for finding the error early. Here, only the extracted sentence is highlighted, but there is a possibility that the sentence to be extracted remains in the article. If the same highlighting is applied to the entire article, such a leak may be extracted. Thus, a configuration may be adopted in which the entire original article is highlighted.

なお、本発明は、コンピュータにより読み取られ実行されるプログラムとして実施することもできる。本発明を実現するプログラムは、コンピュータが読み取り可能な、可搬媒体メモリ、半導体メモリ、ハードディスクなどの適当な記録媒体に格納することができ、これらの記録媒体に記録して提供され、または、通信インタフェースを介してネットワークを利用した送受信により提供されるものである。 The present invention can also be implemented as a program that is read and executed by a computer. The program for realizing the present invention can be stored in an appropriate recording medium such as a portable medium memory, a semiconductor memory, or a hard disk, which can be read by a computer, provided by being recorded on these recording media, or communication. It is provided by transmission / reception using a network via an interface.

本発明のシステム構成の一例を示す図である。It is a figure which shows an example of the system configuration | structure of this invention. 主要単位表現抽出部、主要時間表現抽出部、主要項目表現抽出部の構成例を示す図である。It is a figure which shows the structural example of a main unit expression extraction part, a main time expression extraction part, and a main item expression extraction part. 動向情報対抽出部の構成例を示す図である。It is a figure which shows the structural example of a trend information pair extraction part. テキストデータの一例を示す図である。It is a figure which shows an example of text data. サポートベクトルマシン法のマージン最大化の概念を示す図である。It is a figure which shows the concept of margin maximization of a support vector machine method. 情報抽出・表示処理フローの一例を示す図である。It is a figure which shows an example of an information extraction and a display processing flow. 主要表現の抽出結果を示す図である。It is a figure which shows the extraction result of main expression. 動向情報のグラフ化の一例を示す図である。It is a figure which shows an example of graphing of trend information. 動向情報を強調表示を示す図である。It is a figure which shows the trend information highlighted.

Explanation of symbols

１情報抽出・表示装置
１１主要表現抽出部
１２動向情報対抽出部
１３主要動向情報対抽出・表示部
１４関連記事ＤＢ
１１１主要単位表現抽出部
１１２主要時間表現抽出部
１１３主要項目表現抽出部
１２１教師データ記憶手段
１２２解−素性対抽出手段
１２３機械学習手段
１２４学習結果記憶手段
１２５表現対抽出手段
１２６素性抽出手段
１２７解推定手段
１２８動向情報対抽出手段
２００単位表現抽出手段
２０１、３０１、４０１スコア値算出手段
２０２主要単位表現抽出手段
３００時間表現抽出手段
３０２主要時間表現抽出手段
４００項目表現抽出手段
４０２主要項目表現抽出手段
DESCRIPTION OF SYMBOLS 1 Information extraction / display apparatus 11 Main expression extraction part 12 Trend information pair extraction part 13 Main trend information pair extraction / display part 14 Related article DB
111 Main Unit Expression Extraction Unit 112 Main Time Expression Extraction Unit 113 Main Item Expression Extraction Unit 121 Teacher Data Storage Unit 122 Solution-Feature Pair Extraction Unit 123 Machine Learning Unit 124 Learning Result Storage Unit 125 Expression Pair Extraction Unit 126 Feature Extraction Unit 127 Solution Estimating means 128 Trend information pair extracting means 200 Unit expression extracting means 201, 301, 401 Score value calculating means 202 Main unit expression extracting means 300 Time expression extracting means 302 Main time expression extracting means 400 Item expression extracting means 402 Main item expression extracting means

Claims

An information extraction / display device that extracts and displays trend information,
Unit expression extracting means for extracting a unit expression that appears frequently in the entire article group as a main unit expression from an article group related to a certain field , and a time expression that appears frequently in the entire article group from the article group The main unit expression, the main time expression, the time expression extracting means for extracting as the main time expression, and the item expression extracting means for extracting, as the main item expression, the item expression that appears frequently in the entire article group from the article group, Main expression extraction means for extracting main item expressions as main expressions ;
Based on the main expression extracted by the main expression extracting means, the main unit when the main unit expression, the main time expression, and the main item expression appear from the article group at the same time without any breaks in the document. Trend information pair extraction means for extracting a combination of expression, main time expression and main item expression as a trend information pair;
An information extraction / display apparatus comprising: display means for displaying trend information pairs extracted by the trend information pair extraction means .

The information extraction / display apparatus according to claim 1 ,
The trend information pair extraction means extracts features from teacher data storing what should be extracted as the trend information pair given in advance and what should not be extracted, and tends to become a trend information pair at any feature Machine learning means for storing the learning result data learned in the learning result data storage means,
The feature is extracted from the trend information pair obtained in claim 1 and the learning result data is used to estimate whether the trend information pair is likely to become a trend information pair. An information extraction / display apparatus characterized by extracting only large trend information pairs as trend information pairs .

In the information extraction / display device according to claim 1 or 2 ,
The main expression extracting means extracts a plurality of main expressions,
The display means extracts a main trend information pair from a plurality of types of trend information pairs extracted by the trend information extraction means based on the extracted main expression, and displays the extracted main trend information pairs An information extraction / display device characterized by this.

In the information extraction and display device according to any one of claims 1 to 3,
The trend information pair extraction means further extracts a trend information pair from the article group based on a selected main expression among the main expressions extracted by the main expression extraction means. -Display device.

In the information extraction and display device according to any one of claims 1 to 4,
A keyword input means for inputting keywords;
An article group extracting unit that extracts an article group related to the input keyword from bibliographic data stored in a storage unit;
The information extraction / display apparatus, wherein the main expression extraction unit extracts the main expression from the article group extracted by the article group extraction unit.

In the information extraction and display device according to any one of claims 1 to 5,
The display means displays a graph using the trend information pair extracted by the trend information pair extraction means as a graph using the main time expression on the horizontal axis and the numerical value of the main unit expression on the vertical axis. -Display device.

In the information extraction and display device according to any one of claims 1 to 6,
The display unit extracts a sentence including the trend information pair extracted by the trend information pair extraction unit from the article group, and highlights the trend information pair in the extracted sentence. Extraction and display device.