JP6059598B2

JP6059598B2 - Information extraction method, information extraction apparatus, and information extraction program

Info

Publication number: JP6059598B2
Application number: JP2013106917A
Authority: JP
Inventors: 良彦数原; 浩之戸田; 西岡　秀一; 秀一西岡; 鷲崎　誠司; 誠司鷲崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-05-21
Filing date: 2013-05-21
Publication date: 2017-01-11
Anticipated expiration: 2033-05-21
Also published as: JP2014228993A

Description

本発明は、文書から情報を検索する技術に関する。 The present invention relates to a technique for retrieving information from a document.

ローカルなイベント開催情報について記述されているウェブページやブログ記事からイベント情報を自動的に抽出することにより、人手コストをかけずにイベント情報データベースを構築することが可能となり、イベント推薦サービスなどに活用することができる。 By automatically extracting event information from web pages and blog articles that describe local event information, it is possible to construct an event information database without human labor, and use it for event recommendation services, etc. can do.

テキストからイベント情報を抽出するためには、例えば、非特許文献1〜３を用いて、イベント名、場所、日時などの各カテゴリに対する候補を抽出することができる。また、それぞれのカテゴリについて人手によってタグ付けされた正解データがあれば、教師あり機械学習の枠組みを用いてイベント名、場所、日時に対して自動的に判別を行う判別器を構築することができ、ウェブ文書などから自動的にイベント名、場所、日時を抽出することができる。 In order to extract the event information from the text, for example, using non-patent documents 1 to 3, candidates for each category such as event name, place, date and time can be extracted. In addition, if there is correct data tagged manually for each category, it is possible to construct a discriminator that automatically discriminates against the event name, location, date and time using a supervised machine learning framework. Event name, location, date and time can be automatically extracted from web documents.

山田、他２名、「ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅを用いた日本語固有表現抽出」、情報処理学会論文誌、情報処理学会、2002年１月、Vol. 43, No. 1、pp.44-53Yamada, et al., “Japanese Named Expression Extraction Using Support Vector Machine”, Transactions of Information Processing Society of Japan, Information Processing Society of Japan, January 2002, Vol. 43, No. 1, pp.44-53 平野、他２名、「地理的距離と有名度を用いた地名の曖昧性解消」、第７０回情報処理学会全国大会、情報処理学会、2008年、pp.2-85 - 2-86Hirano and two others, “Resolving ambiguity of place names using geographical distance and famousness”, 70th Information Processing Society of Japan Conference, Information Processing Society of Japan, 2008, pp.2-85-2-86 廣嶋、他３名、「記述された日時の有効範囲を考慮した日時指定検索」、第３回Ｗｅｂとデータベースに関するフォーラム、2010年Takashima and three others, “Specified date and time search considering the valid range of written date and time”, 3rd Forum on Web and Database, 2010 平、他１名、「構造学習を用いた述語項構造解析」、第１４回言語処理学会年次大会発表論文集、言語処理学会、2008年、pp.556-559Hira, et al., “Predicate term structure analysis using structural learning”, Proc. Of the 14th Annual Conference of the Language Processing Society of Japan, 2008, pp.556-559 Crammer K., et al., “Online Passive-Aggressive Algorithms”, Journal of Machine Learning, 2006, Vol. 7, pp. 551-585Crammer K., et al., “Online Passive-Aggressive Algorithms”, Journal of Machine Learning, 2006, Vol. 7, pp. 551-585

個別に判別器を適用するのではなく、予測に構造を持たせ、抽出された候補の中から正しい組み合わせを選択するモデルを構築する構造出力学習を用いる方法が考えられる。この場合、例えば非特許文献５を用いて予測モデルを構築することができる。 Instead of individually applying a discriminator, there is a method using structure output learning in which a structure is provided for prediction and a model for selecting a correct combination from extracted candidates is constructed. In this case, for example, a prediction model can be constructed using Non-Patent Document 5.

しかしながら、非特許文献１など固有表現抽出を用いる方法はイベント名抽出の方法ではないため、適切なイベント名候補の取得に失敗するという問題があった。本発明の予測モデルは、与えられた予測候補から組み合わせを選択するため、適切な候補が存在しない場合には、適切なイベント名を取得することができずに精度が低下するおそれがあった。 However, since a method using specific expression extraction such as Non-Patent Document 1 is not an event name extraction method, there is a problem that acquisition of an appropriate event name candidate fails. Since the prediction model of the present invention selects a combination from given prediction candidates, if there is no appropriate candidate, an appropriate event name cannot be acquired, and the accuracy may be reduced.

本発明は、上記に鑑みてなされたものであり、文書中から関連性のあるカテゴリの情報を抽出する際に、適切な候補を抽出しつつ、学習コストの増加を防ぐことを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to prevent an increase in learning cost while extracting appropriate candidates when extracting information of relevant categories from a document.

第１の本発明に係る情報抽出装置は、関連性のあるカテゴリそれぞれの候補を文書情報から抽出して格納した抽出候補記憶手段と、文書情報から抽出した前記候補のうち、当該候補が同じカテゴリの別の候補の部分文字列となっている場合に、前記別の候補の文字数に対する当該候補の文字数の割合が予め設定した除去比率に満たないときは当該候補を除去する候補除去手段と、カテゴリそれぞれの正解を格納した正解記憶手段と、関連性のあるカテゴリの情報を抽出するための抽出モデルを格納する抽出モデル記憶手段と、前記抽出候補記憶手段に格納されたカテゴリそれぞれの候補の全ての可能な組み合わせに対して、当該組み合わせの特徴を表す特徴ベクトルを計算する特徴ベクトル計算手段と、前記抽出モデル記憶手段に格納された抽出モデルと前記特徴ベクトルを用いて算出されるスコアが最大となる組み合わせを取得する組み合わせ取得手段と、前記正解記憶手段から前記正解の組み合わせを読み出してスコアを計算し、前記正解の組み合わせのスコアに対する前記組み合わせ取得手段が取得した組み合わせのスコアの損失が所定の範囲内の場合は、前記抽出モデル記憶手段に格納された抽出モデルを更新する抽出モデル更新手段と、を有し、カテゴリそれぞれの候補を予測対象文書情報から抽出して格納した予測対象抽出候補記憶手段と、前記予測対象抽出候補記憶手段に格納された前記カテゴリそれぞれの候補の全ての可能な組み合わせに対して、当該組み合わせの特徴を表す特徴ベクトルを計算し、前記抽出モデル記憶手段に格納された抽出モデルと前記特徴ベクトルを用いて算出されるスコアが最大となる組み合わせを取得する情報抽出手段と、を有することを特徴とする。 The information extraction apparatus according to the first aspect of the present invention includes an extraction candidate storage unit that extracts and stores candidates for each related category from the document information, and among the candidates extracted from the document information, the candidates in the same category If the ratio of the number of characters of the candidate with respect to the number of characters of the other candidate is less than a preset removal ratio, a candidate removing unit that removes the candidate, and a category Correct answer storage means for storing each correct answer, extraction model storage means for extracting an extraction model for extracting information of relevant categories, and all candidates for each category stored in the extraction candidate storage means for the possible combinations, the feature vector calculating means for calculating a feature vector indicating the feature of the combination were stored in the extraction model storage means A combination acquisition unit that acquires a combination having a maximum score calculated using the outgoing model and the feature vector; and a calculation of a score by reading the combination of the correct answers from the correct answer storage unit, and a score for the combination of the correct answers An extraction model update unit that updates the extraction model stored in the extraction model storage unit when the loss of the combination score acquired by the combination acquisition unit is within a predetermined range; a prediction target extracting candidate storage means for storing extracted from the prediction target document information, for all possible combinations of the prediction target extraction candidate storage unit the category of each stored in a candidate, indicating the feature of the combination A feature vector is calculated, and the extracted model stored in the extracted model storage means and the feature vector are stored. Score is calculated using the torque is characterized by having a, an information extracting means for obtaining a combination that maximizes.

上記情報抽出装置において、前記候補が前記正解の部分文字列である場合と部分文字列でない場合の誤りコストを格納したコスト記憶手段を更に備え、前記抽出モデル更新手段は、前記コスト記憶手段から前記誤りコストを読み出して前記正解に対する前記候補の一致の割合に応じたコストを求め、求めたコストを前記所定の範囲とすることを特徴とする。 The information extraction apparatus further includes cost storage means for storing an error cost when the candidate is the correct partial character string and when the candidate is not the partial character string, and the extraction model update means includes the cost storage means from the cost storage means. An error cost is read out, a cost corresponding to the proportion of the candidates corresponding to the correct answer is obtained, and the obtained cost is set as the predetermined range.

第２の本発明に係る情報抽出方法は、コンピュータが実行する情報抽出方法であって、関連性のあるカテゴリそれぞれの候補を文書情報から抽出して格納した抽出候補記憶手段に格納された前記候補のうち、当該候補が同じカテゴリの別の候補の部分文字列となっている場合に、前記別の候補の文字数に対する当該候補の文字数の割合が予め設定した除去比率に満たないときは当該候補を除去するステップと、前記抽出候補記憶手段に格納されたカテゴリそれぞれの候補の全ての可能な組み合わせに対して、当該組み合わせの特徴を表す特徴ベクトルを計算するステップと、抽出モデル記憶手段に格納された抽出モデルと前記特徴ベクトルを用いて算出されるスコアが最大となる組み合わせを取得するステップと、前記カテゴリそれぞれの正解を格納した正解記憶手段から前記正解の組み合わせを読み出してスコアを計算し、前記正解の組み合わせのスコアに対する前記スコアが最大となる組み合わせを取得するステップで取得した組み合わせのスコアの損失が所定の範囲内の場合は、前記抽出モデル記憶手段に格納された抽出モデルを更新するステップと、を有し、カテゴリそれぞれの候補を予測対象文書情報から抽出して格納した予測対象抽出候補記憶手段に格納された前記カテゴリそれぞれの候補の全ての可能な組み合わせに対して、当該組み合わせの特徴を表す特徴ベクトルを計算し、前記抽出モデル記憶手段に格納された抽出モデルと前記特徴ベクトルを用いて算出されるスコアが最大となる組み合わせを取得するステップを有することを特徴とする。
An information extraction method according to a second aspect of the present invention is an information extraction method executed by a computer, wherein the candidates stored in extraction candidate storage means for extracting and storing candidates for each related category from document information are stored. If the candidate is a partial character string of another candidate of the same category, if the ratio of the number of characters of the candidate to the number of characters of the other candidate is less than a preset removal ratio, the candidate is Removing, for each possible combination of each category candidate stored in the extraction candidate storage means, calculating a feature vector representing the characteristics of the combination, and stored in the extraction model storage means Obtaining a combination having a maximum score calculated using the extracted model and the feature vector; The combination of the scores obtained in the step of calculating the score by reading the combination of correct answers from the correct answer storage means storing the score and obtaining the combination with the maximum score for the score of the correct combination is within a predetermined range. In this case, the method includes a step of updating the extraction model stored in the extraction model storage unit, and stored in the prediction target extraction candidate storage unit that extracts and stores candidates for each category from the prediction target document information. For all possible combinations of candidates for each category, a feature vector representing the characteristics of the combination is calculated, and a score calculated using the extracted model stored in the extracted model storage means and the feature vector is It has the step which acquires the combination which becomes the largest, It is characterized by the above-mentioned.

上記情報抽出方法において、前記抽出モデルを更新するステップは、前記候補が前記正解の部分文字列である場合と部分文字列でない場合の誤りコストを格納したコスト記憶手段から前記誤りコストを読み出して前記正解に対する前記候補の一致の割合に応じたコストを求め、求めたコストを前記所定の範囲とすることを特徴とする。 In the information extraction method, the step of updating the extraction model reads the error cost from cost storage means storing error costs when the candidate is the correct partial character string and when it is not the partial character string, and A cost according to the proportion of the candidates corresponding to the correct answer is obtained, and the obtained cost is set as the predetermined range.

第３の本発明に係る情報抽出プログラムは、上記情報抽出装置の各手段としてコンピュータを動作させることを特徴とする。 An information extraction program according to a third aspect of the present invention is characterized in that a computer is operated as each means of the information extraction apparatus.

本発明によれば、文書中から関連性のあるカテゴリの情報を抽出する際に、適切な候補を抽出し、学習コストの増加を防ぐことができる。 ADVANTAGE OF THE INVENTION According to this invention, when extracting the information of the category which is relevant from a document, an appropriate candidate is extracted and the increase in learning cost can be prevented.

本実施の形態におけるイベント情報抽出装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the event information extraction apparatus in this Embodiment. 文書データベースに格納されたデータの例を示す図である。It is a figure which shows the example of the data stored in the document database. 抽出候補データベースに格納されたデータの例を示す図である。It is a figure which shows the example of the data stored in the extraction candidate database. 正解データベースに格納されたデータの例を示す図である。It is a figure which shows the example of the data stored in the correct database. コストデータベースに格納されたデータの例を示す図である。It is a figure which shows the example of the data stored in the cost database. イベント名候補抽出機能の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of an event name candidate extraction function. イベント抽出モデル学習機能の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of an event extraction model learning function. イベント抽出モデルデータベースに格納されたイベント抽出モデルの例を示す図である。It is a figure which shows the example of the event extraction model stored in the event extraction model database. 予測対象文書データベースに格納されたデータの例を示す図である。It is a figure which shows the example of the data stored in the prediction object document database. 予測対象抽出候補データベースに格納されたデータの例を示す図である。It is a figure which shows the example of the data stored in the prediction object extraction candidate database. イベント抽出機能の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of an event extraction function. イベントデータベースに格納されたイベント情報の例を示す図である。It is a figure which shows the example of the event information stored in the event database.

以下、本発明の実施の形態について図面を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本実施の形態におけるイベント情報抽出装置の構成を示す機能ブロック図である。図１に示すイベント情報抽出装置は、文書ＤＢ１０、イベント名候補抽出機能１５（候補除去手段に対応）、抽出候補ＤＢ２０（抽出候補記憶手段に対応）、正解ＤＢ３０（正解記憶手段に対応）、コストＤＢ４０（コスト記憶手段に対応）、イベント抽出モデル学習機能５０（特徴ベクトル計算手段、組み合わせ取得手段及び抽出モデル更新手段に対応）、イベント抽出モデルＤＢ６０（抽出モデル記憶手段に対応）、予測対象文書ＤＢ７０、予測対象抽出候補ＤＢ８０（予測対象抽出候補記憶手段に対応）、イベント抽出機能９０（情報抽出手段に対応）、およびイベントＤＢ１００を備える。イベント情報抽出装置が備える各部は、演算処理装置、記憶装置等を備えたコンピュータにより構成して、各部の処理がプログラムによって実行されるものとしてもよい。このプログラムはイベント情報抽出装置が備える記憶装置に記憶されており、磁気ディスク、光ディスク、半導体メモリ等の記録媒体に記録することも、ネットワークを通して提供することも可能である。 FIG. 1 is a functional block diagram showing the configuration of the event information extraction apparatus in the present embodiment. 1 includes a document DB 10, an event name candidate extraction function 15 (corresponding to candidate removal means), an extraction candidate DB 20 (corresponding to extraction candidate storage means), a correct answer DB 30 (corresponding to correct answer storage means), a cost. DB 40 (corresponding to cost storage means), event extraction model learning function 50 (corresponding to feature vector calculation means, combination acquisition means and extraction model updating means), event extraction model DB 60 (corresponding to extraction model storage means), prediction target document DB 70 A prediction target extraction candidate DB 80 (corresponding to the prediction target extraction candidate storage means), an event extraction function 90 (corresponding to the information extraction means), and an event DB 100. Each unit included in the event information extraction device may be configured by a computer including an arithmetic processing device, a storage device, and the like, and the processing of each unit may be executed by a program. This program is stored in a storage device included in the event information extraction device, and can be recorded on a recording medium such as a magnetic disk, an optical disk, or a semiconductor memory, or provided through a network.

まず、イベント抽出モデルを生成する処理で利用する文書ＤＢ１０、抽出候補ＤＢ２０、正解ＤＢ３０、コストＤＢ４０に格納された情報について説明する。 First, the information stored in the document DB 10, the extraction candidate DB 20, the correct answer DB 30, and the cost DB 40 used in the process of generating the event extraction model will be described.

文書ＤＢ１０は、イベント情報の抽出対象となる本文情報（テキストデータ）に文書ＩＤを付与して格納する。図２に、文書ＤＢ１０に格納されたデータの例を示す。 The document DB 10 assigns and stores a document ID to text information (text data) from which event information is extracted. FIG. 2 shows an example of data stored in the document DB 10.

抽出候補ＤＢ２０は、文書ＤＢ１０が格納する本文情報それぞれから抽出したイベント名候補、場所候補、日時候補を文書ＩＤ毎に格納する。図３に、抽出候補ＤＢ２０に格納されたデータの例を示す。図３に示す例では、文書ＩＤが１の本文情報から抽出されたイベント名候補として、「イベント」「小樽ロングクリスマス２０１２Ｆｉｎａｌ」の２つが格納されている。図３には示していないが、イベント名候補、場所候補、日時候補については、本文情報中の出現位置情報を付与している。イベント名候補は、イベント名候補抽出機能１５が文書ＤＢ１０が格納する本文情報から抽出する。場所候補の抽出には非特許文献２を、日時候補の抽出には非特許文献３を用いることで、文書ＤＢ１０に格納された本文情報から抽出候補ＤＢ２０に格納するデータを生成可能である。イベント名、場所、日時を抽出することができれば、場所や日時に基づいたイベント推薦が可能となるため、本実施の形態では、これらの３つの情報を１つのイベント情報の単位とする。イベント名、場所、日時をイベント情報のカテゴリと呼ぶ。本実施の形態では、イベント情報のカテゴリとしてイベント名、場所、日時を用いるが、それ以外の情報（例えば料金，主催団体など）についても本実施形態と同じ枠組みで実現可能である。 The extraction candidate DB 20 stores event name candidates, location candidates, and date / time candidates extracted from each of the text information stored in the document DB 10 for each document ID. FIG. 3 shows an example of data stored in the extraction candidate DB 20. In the example illustrated in FIG. 3, “event” and “Otaru Long Christmas 2012 Final” are stored as event name candidates extracted from the body information with the document ID of 1. Although not shown in FIG. 3, appearance position information in the text information is given to event name candidates, place candidates, and date / time candidates. The event name candidate is extracted from the text information stored in the document DB 10 by the event name candidate extraction function 15. By using Non-Patent Document 2 for extracting location candidates and Non-Patent Document 3 for extracting date / time candidates, it is possible to generate data to be stored in the extraction candidate DB 20 from text information stored in the document DB 10. If the event name, location, and date / time can be extracted, event recommendation based on the location / date / time is possible. In this embodiment, these three pieces of information are used as a unit of event information. The event name, location, date and time are called event information categories. In the present embodiment, the event name, location, and date / time are used as the event information category, but other information (for example, fee, host organization, etc.) can be realized in the same framework as in the present embodiment.

正解ＤＢ３０は、文書ＤＢ１０が格納する本文情報それぞれについて、各カテゴリの正解、つまり正解イベント名、正解場所、正解日時を格納する。図４に、正解ＤＢ３０に格納されたデータの例を示す。これらの正解は、人手によってあらかじめ作成されているものとする。なお、抽出候補ＤＢ２０と同様に、本文情報中の出現位置情報を付与している。 The correct answer DB 30 stores the correct answer of each category, that is, the correct event name, the correct answer place, and the correct date and time for each of the text information stored in the document DB 10. FIG. 4 shows an example of data stored in the correct answer DB 30. These correct answers are prepared in advance by hand. In addition, like the extraction candidate DB 20, appearance position information in the text information is given.

コストＤＢ４０は、誤りタイプに対するコストを格納する。図５に、コストＤＢ４０に格納されたデータの例を示す。図５中の substring は正解イベント名の部分文字列であった場合の誤りコスト、others はその他の誤りに対する誤りコストを表す。substring の値は others の値より小さく設定されているものとする。これらのコストは人手によってあらかじめ作成されているものとする。 The cost DB 40 stores the cost for the error type. FIG. 5 shows an example of data stored in the cost DB 40. In FIG. 5, substring is the error cost when the correct event name is a substring, and others indicates the error cost for other errors. The value of substring is assumed to be set smaller than the value of others. These costs are preliminarily created manually.

次に、イベント名候補抽出機能１５の処理の流れを説明する。 Next, the process flow of the event name candidate extraction function 15 will be described.

図６は、イベント名候補抽出機能１５の処理の流れを示すフローチャートである。 FIG. 6 is a flowchart showing a process flow of the event name candidate extraction function 15.

まず、文書ＤＢ１０から未処理のレコードを選択する（ステップＳ１１）。 First, an unprocessed record is selected from the document DB 10 (step S11).

選択したレコードが持つテキストデータからイベント名候補を抽出する（ステップＳ１２）。イベント名候補の抽出には、テキストデータを単語分割して各単語に対する品詞情報を付与し（例えば、Takeshi Fuchi, “Japanese Morphological Analyzer using Word Co-occurence - JTAG”, COLING-ACL, pp.409-413, 1998を用いる）、得られた品詞情報に基いて、連続した名詞および「の」で連接された単語列を全てイベント名候補集合Ｓとする。例えば、「第２０回」「横須賀」「祭り」という連続した名詞に単語分割が行われた場合、以下の６通りがイベント名候補として抽出される。 Event name candidates are extracted from the text data of the selected record (step S12). To extract event name candidates, text data is divided into words and part-of-speech information is added to each word (for example, Takeshi Fuchi, “Japanese Morphological Analyzer using Word Co-occurence-JTAG”, COLING-ACL, pp.409- 413, 1998), based on the obtained part-of-speech information, consecutive nouns and word strings connected by “no” are all set as event name candidate sets S. For example, when word division is performed on consecutive nouns such as “20th”, “Yokosuka”, and “Festival”, the following six event names are extracted.

・第２０回
・第２０回横須賀
・第２０回横須賀祭り
・横須賀
・横須賀祭り
・祭り・ 20th ・ 20th Yokosuka ・ 20th Yokosuka Festival ・ Yokosuka ・ Yokosuka Festival ・ Festival

そして、得られたイベント名候補集合Ｓの中から、部分文字列である候補の一部を除去する（ステップＳ１３）。イベント名候補が増加することにより、全てのカテゴリにおける可能な候補の組み合わせの数が増加するため、イベント抽出モデル学習機能５０におけるモデル生成のコストが高くなる。そこで、本実施の形態ではイベント名候補の一部を除去する。除去の方法としては、予め人手によって設定された除去比率０＜λ≦１を用いて、イベント名候補が別のイベント名候補の部分文字列となっている場合に、別のイベント名候補の文字数に対する部分文字列の文字数の割合が除去比率λに満たないときは、部分文字列を除去する。例えばλ＝１の場合、上述の例においては「第２０回横須賀祭り」以外のイベント名候補が全て除去される。 Then, from the obtained event name candidate set S, a part of candidates that are partial character strings is removed (step S13). As the number of event name candidates increases, the number of possible candidate combinations in all categories increases, which increases the cost of model generation in the event extraction model learning function 50. Therefore, in this embodiment, a part of event name candidates is removed. As a removal method, if the event name candidate is a partial character string of another event name candidate using a removal ratio 0 <λ ≦ 1 set in advance by hand, the number of characters of another event name candidate When the ratio of the number of characters of the partial character string to the number is less than the removal ratio λ, the partial character string is removed. For example, when λ = 1, all event name candidates other than the “20th Yokosuka Festival” are removed in the above example.

ステップＳ１３で得られたイベント名候補を抽出候補ＤＢ２０の対応するレコードに出力する（ステップＳ１４）。 The event name candidate obtained in step S13 is output to the corresponding record in the extraction candidate DB 20 (step S14).

文書ＤＢ１０に未処理のレコードがある場合には（ステップＳ１５のＹｅｓ）、ステップＳ１１に戻って次のレコードを取得し、未処理のレコードがない場合は（ステップＳ１５のＮｏ）、処理を終了する。 If there is an unprocessed record in the document DB 10 (Yes in step S15), the process returns to step S11 to acquire the next record, and if there is no unprocessed record (No in step S15), the process ends. .

続いて、イベント抽出モデル学習機能５０の処理の流れを説明する。 Next, the process flow of the event extraction model learning function 50 will be described.

図７は、イベント抽出モデル学習機能５０の処理の流れを示すフローチャートである。 FIG. 7 is a flowchart showing a processing flow of the event extraction model learning function 50.

最初に、重みベクトルｗをｗ＝（０，０，０，・・・，０）^Tと初期化し、繰り返しカウンタｔをｔ←１と初期化する（ステップＳ２１）。重みベクトルｗは、イベント抽出モデルＤＢ６０に格納されたイベント抽出モデルである。重みベクトルｗの次元数は後述する特徴ベクトル数と同じＭ次元とする。 First, the weight vector w is initialized as w = (0, 0, 0,..., 0) ^T, and the repetition counter t is initialized as t ← 1 (step S21). The weight vector w is an event extraction model stored in the event extraction model DB 60. The number of dimensions of the weight vector w is the same as the number of feature vectors described later.

続いて、正解ＤＢ３０からランダムに１レコードを選択する（ステップＳ２２）。ここで選択したレコードの文書ＩＤをｄとする。 Subsequently, one record is selected at random from the correct answer DB 30 (step S22). The document ID of the record selected here is d.

続いて、抽出候補ＤＢ２０から文書ＩＤがｄのレコードを選択して、全てのカテゴリにおける可能な候補の組み合わせの集合（以下、「カテゴリ組み合わせ集合」と称する）を作成し、カテゴリ組み合わせ集合に含まれる全ての組み合わせについて特徴ベクトルを作成する（ステップＳ２３）。図３に示す抽出候補ＤＢ２０の文書ＩＤが２のレコードから作成するカテゴリ組み合わせ集合は以下のようになる。 Subsequently, a record with the document ID d is selected from the extraction candidate DB 20, and a set of possible candidate combinations in all categories (hereinafter referred to as "category combination set") is created and included in the category combination set. Feature vectors are created for all combinations (step S23). The category combination set created from the record whose document ID is 2 in the extraction candidate DB 20 shown in FIG. 3 is as follows.

第２０回横須賀祭り − 神奈川県横須賀市 − ２０１２年１０月２０日
第２０回横須賀祭り − 神奈川県横須賀市 − ２０１２年１２月２０日
・・・
お祭り − 東京都 − ２０１３年１月１日 The 20th Yokosuka Festival-Yokosuka City, Kanagawa Prefecture-October 20, 2012 The 20th Yokosuka Festival-Yokosuka City, Kanagawa Prefecture-December 20, 2012 ・・・
Festival-Tokyo-January 1, 2013

このように、本実施の形態におけるカテゴリ組み合わせ集合は、イベント名候補、場所候補、日時候補について全ての可能な候補の組み合わせの集合となる。上記の図３の文書ＩＤが２の例では、イベント名候補が２つ、場所候補が２つ、日時候補が３つであるので、２・２・３＝１２通りの組み合わせが存在する。ここで正解ＤＢ３０の選択されたレコードに格納されたイベント名、場所、日時を持つ組み合わせを正解として扱い、それ以外の組み合わせについては誤り組み合わせとし、どのカテゴリが誤っているかという情報を同時に保持しているものとする。例えば、場所と日時が正解と異なる場合、場所、日時誤りと判定する。このように、誤り組み合わせは１つ以上の誤ったカテゴリを持つ。 Thus, the category combination set in the present embodiment is a set of all possible candidate combinations for event name candidates, place candidates, and date / time candidates. In the example of document ID 2 in FIG. 3, there are two event name candidates, two place candidates, and three date / time candidates, so there are 2 · 2 · 3 = 12 combinations. Here, the combination having the event name, location, and date / time stored in the selected record of the correct answer DB 30 is treated as a correct answer, and other combinations are regarded as incorrect combinations, and information on which category is incorrect is held simultaneously. It shall be. For example, if the location and date / time are different from the correct answer, it is determined that the location / date / time is incorrect. Thus, an error combination has one or more incorrect categories.

ステップＳ２３では、さらに、作成したカテゴリ組み合わせ集合と文書ＤＢ１０から取得した本文情報をもとに、各カテゴリ組み合わせの特徴を表す特徴ベクトルΦ（ｙ，ｘ）を作成する。ここで、ｘは当該文書ＩＤに対応する本文情報のベクトル表現であり、ｙはカテゴリ組み合わせ集合の要素（カテゴリ組み合わせ）である。Φ（ｙ，ｘ）はＭ次元ベクトルであり、ｙとｘを入力とするＭ個のイベント情報らしさを捉えるための特徴関数φ（ｙ，ｘ）の出力で構成される。特徴関数φ（ｙ，ｘ）の例としては、例えば「ｙの３つの候補が文書内の近い位置に出現する」という特徴を捉えるため、３つの表現が５０文字以内に出現する場合に１、そうでない場合に０を出力する特徴関数が挙げられる。また、別の例としては、イベント名に含まれる文字列が本文内の他の場所で出現もする場合に１、そうでない場合に０を出力する特徴関数が考えられる。その他の文字列に基づく基本的な特徴関数としては、例えば非特許文献４の方法を用いることができる。 In step S23, a feature vector Φ (y, x) representing the feature of each category combination is further created based on the created category combination set and the text information acquired from the document DB 10. Here, x is a vector representation of text information corresponding to the document ID, and y is an element (category combination) of the category combination set. Φ (y, x) is an M-dimensional vector, and is composed of an output of a feature function φ (y, x) for capturing the likelihood of M pieces of event information having y and x as inputs. As an example of the feature function φ (y, x), for example, in order to capture the feature that “three candidates of y appear in close positions in the document”, 1 when three expressions appear within 50 characters, Otherwise, a feature function that outputs 0 is given. Another example is a feature function that outputs 1 when a character string included in an event name also appears elsewhere in the text, and outputs 0 otherwise. As a basic feature function based on other character strings, for example, the method of Non-Patent Document 4 can be used.

続いて、コストＤＢ４０に格納されたコストを用いてイベント名の誤りに対するコストを計算する（ステップＳ２４）。具体的には、カテゴリ組み合わせ集合のイベント名が正解イベント名の部分文字列である場合には、コストＤＢ４０における substring の値をｃｏｓｔ_substringとし、次式（１）を用いてコストを算出する。

Subsequently, the cost for an error in the event name is calculated using the cost stored in the cost DB 40 (step S24). Specifically, when the event name of the category combination set is a partial character string of the correct event name, the value of substring in the cost DB 40 is set to cost _substring, and the cost is calculated using the following equation (1).

ここで、ｌｅｎｇｔｈ_substringは当該部分文字列の文字列長、ｌｅｎｇｔｈ_{correct_string}は正解イベント名の文字列長を表す。部分文字列長が短くなればなるほどコストの値が大きくなるため、このようなイベント名候補を選択するモデルを生成しないようなペナルティ項の効果を果たす。 Here, length _substring represents the character string length of the partial character string, and length _{correct_string} represents the character string length of the correct event name. Since the cost value increases as the partial character string length becomes shorter, the effect of a penalty term that does not generate a model for selecting such event name candidates is achieved.

カテゴリ組み合わせ集合のイベント名が正解イベント名の部分文字列でない場合には、コストＤＢ４０における others の値をコストとして用いる。others の値を substring の値よりも大きくしているので、カテゴリ組み合わせ集合のイベント名が正解イベント名の部分文字列ではない誤りに比べて部分文字列である誤りを選択する、また、部分文字列である誤りにおいても正解イベント名により近い文字列を選択するようなモデルを生成する効果を生み出す。 If the event name of the category combination set is not a partial character string of the correct event name, the value of others in the cost DB 40 is used as the cost. Since the value of others is larger than the value of substring, the error that is a substring is selected compared to the error that the event name of the category combination set is not a substring of the correct event name. This produces the effect of generating a model that selects a character string closer to the correct event name even in the case of an error.

続いて、現在の重みベクトルｗで最大スコアとなるカテゴリ組み合わせを求める（ステップＳ２５）。最大スコアとなるカテゴリ組み合わせは、次式（２）で計算する。

Subsequently, the category combination that provides the maximum score with the current weight vector w is obtained (step S25). The category combination that gives the maximum score is calculated by the following equation (2).

ここで、Ｙ_tはｔ番目のイテレーションにおいて選択された文書におけるカテゴリ組み合わせ集合、ｙ_tは正解のカテゴリ組み合わせ、ｘ_tは該当文書の本文情報である。ただし、ｃｏｓｔを加算しなくてもよい。 Here, Y _t is the category combination sets in the document selected in the t-th iteration, the y _t category combinations of answer, is x _t is the body information of the document. However, it is not necessary to add cost.

続いて、ステップＳ２５で求めた最大スコアのカテゴリ組み合わせの損失を計算し、損失が０より大きい場合は重みベクトルｗを更新する（ステップＳ２６）。ｔ番目のイテレーションにおける損失ｌ_tは次式（３）で計算する。

Subsequently, the loss of the category combination of the maximum score obtained in step S25 is calculated. If the loss is greater than 0, the weight vector w is updated (step S26). The loss l _t in the t-th iteration is calculated by the following equation (3).

損失ｌ_t＞０の場合、損失ｌ_tに応じて重みベクトルｗを更新する。重みベクトルの更新には、例えば非特許文献５の方法を用いることができる。 When the loss l _t > 0, the weight vector w is updated according to the loss l _t . For example, the method of Non-Patent Document 5 can be used to update the weight vector.

そして、繰り返しカウンタｔを増分し（ステップＳ２７）、あらかじめ定めた繰り返し回数Ｔ以下の場合（ステップＳ２８のＹｅｓ）、ステップＳ２２に戻り、繰り返しカウンタｔが繰り返し回数Ｔを超えた場合（ステップＳ２８のＮｏ）、重みベクトルｗをイベント抽出モデルＤＢ６０に出力する（ステップＳ２９）。 Then, the repeat counter t is incremented (step S27), and if it is less than or equal to a predetermined repeat count T (Yes in step S28), the process returns to step S22, and if the repeat counter t exceeds the repeat count T (No in step S28). ), And outputs the weight vector w to the event extraction model DB 60 (step S29).

次に、イベント抽出機能９０について説明する。イベント抽出機能９０は、イベント抽出モデルＤＢ６０に格納されたイベント抽出モデルを用いて、予測対象文書ＤＢ７０、予測対象抽出候補ＤＢ８０に格納された情報からイベント情報を抽出する。 Next, the event extraction function 90 will be described. The event extraction function 90 extracts event information from information stored in the prediction target document DB 70 and the prediction target extraction candidate DB 80 using the event extraction model stored in the event extraction model DB 60.

イベント抽出モデルＤＢ６０は、イベント抽出モデル学習機能５０が求めたイベント抽出モデルを格納する。イベント抽出モデルは、Ｍ次元の特徴に対する重みベクトルｗ＝（ｗ₁，ｗ₂，・・・，ｗ_M）^Tで構成される。図８に、イベント抽出モデルＤＢ６０に格納されたイベント抽出モデルの例を示す。 The event extraction model DB 60 stores the event extraction model obtained by the event extraction model learning function 50. The event extraction model is composed of weight vectors w = (w ₁ , w ₂ ,..., W _M ) ^T for M-dimensional features. FIG. 8 shows an example of an event extraction model stored in the event extraction model DB 60.

予測対象文書ＤＢ７０は、文書ＤＢ１０と同様に、イベント情報の抽出対象となる本文情報に文書ＩＤを付与して格納する。図９に、予測対象文書ＤＢ７０に格納されたデータの例を示す。 Similar to the document DB 10, the prediction target document DB 70 assigns and stores a document ID to the text information from which event information is extracted. FIG. 9 shows an example of data stored in the prediction target document DB 70.

予測対象抽出候補ＤＢ８０は、抽出候補ＤＢ２０と同様に、予測対象文書ＤＢ７０が格納する本文情報それぞれから抽出したイベント名候補、場所候補、日時候補を格納する。図１０に、予測対象抽出候補ＤＢ８０に格納されたデータの例を示す。 Like the extraction candidate DB 20, the prediction target extraction candidate DB 80 stores event name candidates, place candidates, and date / time candidates extracted from the text information stored in the prediction target document DB 70. FIG. 10 shows an example of data stored in the prediction target extraction candidate DB 80.

続いて、イベント抽出機能９０の処理の流れを説明する。 Next, the process flow of the event extraction function 90 will be described.

図１１は、イベント抽出機能９０の処理の流れを示すフローチャートである。 FIG. 11 is a flowchart showing a process flow of the event extraction function 90.

まず、予測対象文書ＤＢ７０から未処理のレコードを選択する（ステップＳ３１）。ここで選択したレコードの文書ＩＤをｄ’とする。 First, an unprocessed record is selected from the prediction target document DB 70 (step S31). It is assumed that the document ID of the record selected here is d ′.

予測対象抽出候補ＤＢ８０から文書ＩＤがｄ’のレコードを選択してカテゴリ組み合わせ集合を作成し、カテゴリ組み合わせ集合に含まれる全てのカテゴリ組み合わせについて特徴ベクトルを作成する（ステップＳ３２）。イベント抽出モデル学習機能５０によるステップＳ１３と同じ処理によって特徴ベクトルΦ（ｙ，ｘ）を作成する。 A record with a document ID of d 'is selected from the prediction target extraction candidate DB 80 to create a category combination set, and feature vectors are created for all category combinations included in the category combination set (step S32). A feature vector Φ (y, x) is created by the same processing as step S13 by the event extraction model learning function 50.

続いて、イベント抽出モデルＤＢ６０に格納されたイベント抽出モデルを用いて最大スコアを与えるカテゴリ組み合わせを取得する（ステップＳ３３）。具体的には、次式（４）に示すように、イベント抽出モデルＤＢ６０に格納された重さベクトルｗとステップＳ３２で作成した特徴ベクトルΦ（ｙ，ｘ）との内積を計算して、最大スコアを与えるカテゴリ組み合わせを取得する。

Subsequently, the category combination that gives the maximum score is acquired using the event extraction model stored in the event extraction model DB 60 (step S33). Specifically, as shown in the following equation (4), the inner product of the weight vector w stored in the event extraction model DB 60 and the feature vector Φ (y, x) created in step S32 is calculated, and the maximum Get the category combination that gives the score.

ここで、Ｙ_testは入力文書におけるカテゴリ組み合わせ集合、ｘは入力文書の本文情報である。 Here, Y _test is a set of category combinations in the input document, and x is text information of the input document.

ステップＳ３３で取得したカテゴリ組み合わせの各カテゴリをイベントＤＢ１００に出力する（ステップＳ３４）。 Each category of the category combination acquired in step S33 is output to the event DB 100 (step S34).

予測対象文書ＤＢ７０に未処理のレコードがある場合には（ステップＳ３５のＹｅｓ）、ステップＳ３１に戻り、次のレコードを取得し、未処理のレコードがない場合は（ステップＳ３５のＮｏ）、処理を終了する。 If there is an unprocessed record in the prediction target document DB 70 (Yes in step S35), the process returns to step S31 to acquire the next record. If there is no unprocessed record (No in step S35), the process is performed. finish.

図１２に、イベントＤＢ１００に格納されたイベント情報の例を示す。イベントＤＢ１００には、文書ＩＤ毎に抽出されたイベント情報が格納される。 FIG. 12 shows an example of event information stored in the event DB 100. The event DB 100 stores event information extracted for each document ID.

以上説明したように、本実施の形態によれば、イベント名候補抽出機能１５が、イベント名候補が別のイベント名候補の部分文字列となっている場合に、別のイベント名候補の文字数に対する部分文字列の文字数の割合が除去比率λに満たないときは、部分文字列を除去することにより、適切なイベント名候補を抽出し、イベント名候補の数を減らして学習コストの増加を防ぐことができる。 As described above, according to the present embodiment, when the event name candidate extraction function 15 is a partial character string of another event name candidate, the number of characters of another event name candidate is determined. When the ratio of the number of characters in the partial character string is less than the removal ratio λ, by extracting the partial character string, an appropriate event name candidate is extracted, and the number of event name candidates is reduced to prevent an increase in learning cost. Can do.

本実施の形態によれば、正解イベント名に対するイベント名候補の一致の割合に応じて誤りコストを求めることにより、正解イベント名の部分文字列である候補に対して、より短い部分文字列に対して誤りコストを高く、かつ、部分文字列でない誤りに比べて誤りコストを低く設定することができる。 According to the present embodiment, by obtaining an error cost according to the rate of match of event name candidates to correct event names, a candidate for a partial character string of correct event names can be compared with a shorter partial character string. Thus, the error cost can be increased and the error cost can be set lower than an error that is not a partial character string.

１０…文書ＤＢ
１５…イベント名候補抽出機能
２０…抽出候補ＤＢ
３０…正解ＤＢ
４０…コストＤＢ
５０…イベント抽出モデル学習機能
６０…イベント抽出モデルＤＢ
７０…予測対象文書ＤＢ
８０…予測対象抽出候補ＤＢ
９０…イベント抽出機能
１００…イベントＤＢ 10 ... Document DB
15 ... Event name candidate extraction function 20 ... Extraction candidate DB
30 ... Correct DB
40 ... Cost DB
50 ... Event extraction model learning function 60 ... Event extraction model DB
70 ... prediction target document DB
80 ... prediction target extraction candidate DB
90 ... Event extraction function 100 ... Event DB

Claims

Extraction candidate storage means for extracting and storing candidates for each relevant category from document information;
Among the candidates extracted from the document information, when the candidate is a partial character string of another candidate of the same category, the ratio of the number of characters of the candidate to the number of characters of the other candidate is a preset removal ratio. Candidate removal means for removing the candidate if not,
Correct answer storage means storing correct answers for each category;
Extraction model storage means for storing an extraction model for extracting relevant category information;
For all possible combinations of the extraction candidate storage means for each stored category candidates, the feature vector calculating means for calculating a feature vector indicating the feature of the combination,
A combination acquisition unit that acquires a combination with a maximum score calculated using the extraction model stored in the extraction model storage unit and the feature vector;
When the combination of the correct answers acquired by the combination acquisition means with respect to the correct combination score is within a predetermined range by reading the correct combination from the correct storage means and calculating a score, the extracted model storage means An extraction model updating means for updating the extraction model stored in
Prediction target extraction candidate storage means for extracting and storing candidates for each category from the prediction target document information;
For all possible combinations of the prediction target extraction candidate storage unit the category of each stored in the candidate, a feature vector indicating the feature of the combination is calculated, and extracted model stored in the extraction model storage means And an information extraction unit that obtains a combination having a maximum score calculated using the feature vector.

Cost storage means for storing an error cost when the candidate is the correct partial character string and when it is not a partial character string,
The extraction model update unit reads the error cost from the cost storage unit, obtains a cost according to a ratio of the candidate match to the correct answer, and sets the obtained cost as the predetermined range. Item 1. The information extraction device according to Item 1.

An information extraction method executed by a computer,
Among the candidates stored in the extraction candidate storage means that extracts and stores candidates for each related category from the document information, when the candidate is a partial character string of another candidate of the same category, Removing the candidate when the ratio of the number of characters of the candidate to the number of characters of the other candidate is less than a preset removal ratio;
Calculating, for all possible combinations of candidates for each category stored in the extraction candidate storage means, a feature vector representing the characteristics of the combination;
Obtaining a combination that maximizes the score calculated using the extracted model and the feature vector stored in the extracted model storage means;
The loss of the combination score acquired in the step of calculating the score by reading the combination of the correct answers from the correct storage means storing the correct answers of each category, and acquiring the combination that maximizes the score with respect to the score of the correct combination Is within a predetermined range, updating the extraction model stored in the extraction model storage means,
For each possible combination of candidates for each category stored in the prediction target extraction candidate storage means that extracts and stores candidates for each category from the prediction target document information, a feature vector representing the characteristics of the combination is calculated. And obtaining a combination that maximizes the score calculated using the extraction model stored in the extraction model storage means and the feature vector.

The step of updating the extraction model is performed by reading the error cost from cost storage means storing error costs when the candidate is the correct partial character string and when the candidate is not a partial character string, and matching the candidate with the correct answer The information extraction method according to claim 3, wherein a cost corresponding to the ratio is obtained, and the obtained cost is set to the predetermined range.

An information extraction program for operating a computer as each means of the information extraction apparatus according to claim 1.