JP2018014003A

JP2018014003A - Item value extraction model learning device, item value extraction device, method and program

Info

Publication number: JP2018014003A
Application number: JP2016143807A
Authority: JP
Inventors: いつみ斉藤; Itsumi Saito; 九月貞光; Kugatsu Sadamitsu; 久子浅野; Hisako Asano; 松尾　義博; Yoshihiro Matsuo; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-07-21
Filing date: 2016-07-21
Publication date: 2018-01-25
Anticipated expiration: 2036-07-21
Also published as: JP6665050B2

Abstract

PROBLEM TO BE SOLVED: To learn an extraction model for extracting the value of an item to be extracted, with high comprehensiveness.SOLUTION: A designated item extraction unit 24 extracts the value of an item that matches a designated item name list from each of the structured data of text data groups, as the value of an item to be extracted. A pseudo-supervised data creation unit 26 creates text to which is added an annotation indicating that it is the extracted value of the item to be extracted, as pseudo-supervised data. An extraction model learning unit 30 learns an extraction model for extracting the value of the item to be extracted from the text on the basis of the identity extracted from the text of pseudo-supervised data and the annotation added to the text.SELECTED DRAWING: Figure 14

Description

本発明は、項目値抽出モデル学習装置、項目値抽出装置、方法、及びプログラムに係り、特に、抽出対象とする項目の値をテキストから抽出するための項目値抽出モデル学習装置、項目値抽出装置、方法、及びプログラムに関する。 The present invention relates to an item value extraction model learning device, an item value extraction device, a method, and a program, and more particularly to an item value extraction model learning device and an item value extraction device for extracting an item value to be extracted from a text. , Method and program.

従来より、Wikipedia（Ｒ）のエントリ-リダイレクト間を対象にした同義関係抽出方法が知られている（非特許文献１）。この方法では、Wikipedia（Ｒ）のリダイレクト関係を用いて、同義関係にある単語の抽出を行っている。 Conventionally, a synonym relation extraction method for Wikipedia (R) entry-redirection is known (Non-Patent Document 1). In this method, words having synonymous relations are extracted using the redirect relation of Wikipedia (R).

また、Wikipedia（Ｒ）を用いた人物別名の抽出と人物判別のためのラベル付与方法が知られている（非特許文献２）。この方法では、本文中に、別名として記述されやすい表現のパタンを設定し、そのパタンに合致した対象に対して別名を抽出している。例えば、パタン「Xと呼ばれ」、「Xとも称され」を用いて、別名を抽出している。 Also, a labeling method for extracting a person alias and identifying a person using Wikipedia (R) is known (Non-Patent Document 2). In this method, patterns of expressions that are easily described as aliases are set in the text, and aliases are extracted for objects that match the pattern. For example, aliases are extracted using the patterns “X” and “X”.

大野潤一ら、「Wikipediaのエントリ-リダイレクト間を対象にした同義関係抽出」、言語処理学会、第17回年次大会発表論文集、p．296-299、2011年3月．Junichi Ohno et al., “Extraction of synonym relations between entry and redirect in Wikipedia”, Proc. Of the 17th Annual Conference, Language Processing Society of Japan, p. 296-299, March 2011. 齊藤大樹ら、「Wikipediaを用いた人物別名の抽出と人物判別のためのラベル付与」、言語処理学会、第20回年次大会発表論文集、p．63-66、2014年3月．Daiki Saito et al., “Extraction of Person Alias Using Wikipedia and Labeling for Person Discrimination”, Language Processing Society, 20th Annual Conference, Proceedings, p. 63-66, March 2014.

しかし、タイトルに関する別称は記事本文中に書かれることが多く、上記非特許文献１に記載のように、リダイレクト関係だけからではカバーできない。 However, other names related to the title are often written in the body of the article, and as described in Non-Patent Document 1, it cannot be covered only by the redirect relationship.

実際の別称の記述パタンは多様であり、人手で設定したパタンでは網羅性は高くない。ただし、人手でパタンを拡充していくのは高コストである。実際に、非特許文献2では人物名の別名のみを対象としているが、Wikipedia（Ｒ）中には人物名以外にも別名は多く記載されている。 There are various actual description patterns with different names, and the pattern set by hand is not very comprehensive. However, it is expensive to manually expand the pattern. Actually, in Non-Patent Document 2, only aliases of person names are targeted. However, in Wikipedia (R), there are many aliases other than person names.

本発明は、上記事情を鑑みて成されたものであり、高い網羅性で、抽出対象とする項目の値を抽出するための抽出モデルを学習することができる項目値抽出モデル学習装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and has an item value extraction model learning apparatus, method, and the like, capable of learning an extraction model for extracting the value of an item to be extracted with high completeness. And to provide a program.

また、高い網羅性で、抽出対象とする項目の値を抽出することができる項目値抽出装置、方法、及びプログラムを提供することを目的とする。 It is another object of the present invention to provide an item value extraction apparatus, method, and program capable of extracting the value of an item to be extracted with high completeness.

上記目的を達成するために、第１の発明に係る項目値抽出モデル学習装置は、抽出対象とする項目を定める項目名として指定された項目名、及び値と前記値に関する項目名と項目の値との組を含む構造化データと前記値を含むテキストとの組からなるテキストデータ群を受け付ける入力部と、前記テキストデータ群の構造化データの各々から、前記指定された項目名と一致する項目の値を、前記抽出対象とする項目の値として抽出する指定項目抽出部と、前記指定項目抽出部によって抽出された前記抽出対象とする項目の値であることを示すアノテーションが付与された前記テキストを、疑似教師データとして作成する疑似教師データ作成部と、前記疑似教師データ作成部によって作成された疑似教師データの前記テキストから抽出される素性と、前記テキストに付与されたアノテーションとに基づいて、前記テキストから、前記抽出対象とする項目の値を抽出するための抽出モデルを学習する抽出モデル学習部と、を含んで構成されている。 In order to achieve the above object, the item value extraction model learning device according to the first invention provides an item name specified as an item name that defines an item to be extracted, and an item name and an item value related to the value and the value. An input unit that accepts a text data group consisting of a set of structured data including the set and a text including the value, and an item that matches the specified item name from each of the structured data of the text data group The specified item extracting unit that extracts the value of the item to be extracted as the value of the item to be extracted, and the text to which the annotation indicating that the value of the item to be extracted is extracted by the specified item extracting unit Are generated from the text of the pseudo-teacher data created by the pseudo-teacher data creation unit and the pseudo-teacher data creation unit If, based on the annotations assigned to the text, from the text, it is configured to include a, and extracting the model learning unit for learning the extraction model for extracting the value of the item to be the extraction target.

第２の発明に係る項目値抽出モデル学習方法は、入力部が、抽出対象とする項目を定める項目名として指定された項目名、及び値と前記値に関する項目名と項目の値との組を含む構造化データと前記値を含むテキストとの組からなるテキストデータ群を受け付け、指定項目抽出部が、前記テキストデータ群の構造化データの各々から、前記指定された項目名と一致する項目の値を、前記抽出対象とする項目の値として抽出し、疑似教師データ作成部が、前記指定項目抽出部によって抽出された前記抽出対象とする項目の値であることを示すアノテーションが付与された前記テキストを、疑似教師データとして作成し、抽出モデル学習部が、前記疑似教師データ作成部によって作成された疑似教師データの前記テキストから抽出される素性と、前記テキストに付与されたアノテーションとに基づいて、前記テキストから、前記抽出対象とする項目の値を抽出するための抽出モデルを学習する。 In the item value extraction model learning method according to the second aspect of the invention, the input unit specifies an item name specified as an item name that defines an item to be extracted, and a set of an item name and an item value related to the value and the value. Accepting a text data group consisting of a set of structured data including and a text including the value, and the designated item extraction unit extracts items of the item that match the designated item name from each of the structured data of the text data group. The value is extracted as the value of the item to be extracted, and the pseudo-teacher data creation unit is provided with an annotation indicating that the value of the item to be extracted is extracted by the specified item extraction unit Text is created as pseudo-teacher data, and the extraction model learning unit is extracted from the text of the pseudo-teacher data created by the pseudo-teacher data creation unit, and Based on the serial granted to text annotations, from the text, it learns the extraction model for extracting the value of the item to be the extraction target.

第３の発明に係る項目値抽出装置は、値と前記値に関する項目名と項目の値との組を含む構造化データと前記値を含むテキストとの組からなるテキストデータ群に含まれる前記構造化データの各々から抽出された抽出対象とする項目を定める項目名として指定された項目名の値に対して、前記抽出対象とする項目の値であることを示すアノテーションが付与された前記テキストである疑似教師データの前記テキストから抽出される素性と、前記テキストに付与されたアノテーションとに基づいて予め学習された、前記テキストから、前記抽出対象とする項目の値を抽出するための抽出モデルを記憶する抽出モデル記憶部と、入力されたテキストから前記素性を抽出する素性抽出部と、前記素性抽出部によって抽出された前記素性と、前記抽出モデルとに基づいて、前記テキストから、前記抽出対象とする項目の値を抽出する項目値抽出部と、を含んで構成されている。 An item value extracting apparatus according to a third aspect of the present invention is directed to the structure included in a text data group including a set of structured data including a set of a value, an item name related to the value, and an item value, and a text including the value. In the text to which the annotation indicating that the value of the item to be extracted is attached to the value of the item name specified as the item name that defines the item to be extracted extracted from each of the digitized data An extraction model for extracting the value of the item to be extracted from the text, learned in advance based on a feature extracted from the text of a pseudo-teacher data and an annotation attached to the text. An extraction model storage unit for storing, a feature extraction unit for extracting the features from the input text, the features extracted by the feature extraction unit, and the extraction Based on the Dell, from the text, it is configured to include a, and item value extracting unit for extracting the value of the item to be the extraction target.

第４の発明に係る項目値抽出方法は、値と前記値に関する項目名と項目の値との組を含む構造化データと前記値を含むテキストとの組からなるテキストデータ群に含まれる前記構造化データの各々から抽出された抽出対象とする項目を定める項目名として指定された項目名の値に対して、前記抽出対象とする項目の値であることを示すアノテーションが付与された前記テキストである疑似教師データの前記テキストから抽出される素性と、前記テキストに付与されたアノテーションとに基づいて予め学習された、前記テキストから、前記抽出対象とする項目の値を抽出するための抽出モデルを記憶する抽出モデル記憶部を含む項目値抽出装置における項目値抽出方法であって、素性抽出部が、入力されたテキストから前記素性を抽出し、項目値抽出部が、前記素性抽出部によって抽出された前記素性と、前記抽出モデルとに基づいて、前記テキストから、前記抽出対象とする項目の値を抽出する。 An item value extraction method according to a fourth aspect of the present invention is directed to the structure included in a text data group including a set of structured data including a set of a value, an item name related to the value, and an item value, and a text including the value. In the text to which the annotation indicating that the value of the item to be extracted is attached to the value of the item name specified as the item name that defines the item to be extracted extracted from each of the digitized data An extraction model for extracting the value of the item to be extracted from the text, learned in advance based on a feature extracted from the text of a pseudo-teacher data and an annotation attached to the text. An item value extraction method in an item value extraction device including an extraction model storage unit for storing, wherein the feature extraction unit extracts the feature from input text, and an item Extraction unit, the said feature extracted by the feature extraction unit, based on said extracted model, from the text, and extracts the value of the item to be the extraction target.

第５の発明に係る項目値抽出モデル学習装置は、タイトルと、項目名と項目の値の組を含む構造化データと、前記値を含むテキストとからなる記事群、及び前記タイトルに関して抽出対象とする項目を定める項目名として指定された項目名を受け付ける入力部と、前記記事群の構造化データの各々から、前記指定された項目名と一致する項目の値を、前記抽出対象とする項目の値として抽出する指定項目抽出部と、前記指定項目抽出部によって抽出された前記抽出対象とする項目の値であることを示すアノテーションが付与された前記テキストを、疑似教師データとして作成する疑似教師データ作成部と、前記疑似教師データ作成部によって作成された疑似教師データの前記テキストから抽出される素性と、前記テキストに付与されたアノテーションとに基づいて、前記テキストから、前記抽出対象とする項目の値を抽出するための抽出モデルを学習する抽出モデル学習部と、を含んで構成されている。 An item value extraction model learning device according to a fifth invention includes a title, structured data including a combination of an item name and an item value, an article group including text including the value, and an extraction target for the title. An input unit that accepts an item name designated as an item name that defines an item to be selected, and an item value that matches the designated item name from each of the structured data of the article group Pseudo-teacher data for creating, as pseudo-teacher data, a designated item extraction unit that extracts as values, and the text that is annotated to indicate the value of the item to be extracted extracted by the designated item extraction unit A creation unit, a feature extracted from the text of the pseudo-teacher data created by the pseudo-teacher data creation unit, and an annotation attached to the text Based on the Deployment, from the text, it is configured to include a, and extracting the model learning unit for learning the extraction model for extracting the value of the item to be the extraction target.

第６の発明に係る項目値抽出装置は、タイトルと、項目名と項目の値の組を含む構造化データと、前記値を含むテキストとからなる記事群に含まれる構造化データの各々から抽出された前記タイトルに関して抽出対象とする項目を定める項目名として指定された項目名の値に対して、前記抽出対象とする項目の値であることを示すアノテーションが付与された前記テキストである疑似教師データの前記テキストから抽出される素性と、前記テキストに付与されたアノテーションとに基づいて予め学習された、前記テキストから、前記抽出対象とする項目の値を抽出するための抽出モデルを記憶する抽出モデル記憶部と、入力されたテキストから前記素性を抽出する素性抽出部と、前記素性抽出部によって抽出された前記素性と、前記抽出モデルとに基づいて、前記テキストから、前記抽出対象とする項目の値を抽出する項目値抽出部と、を含んで構成されている。 An item value extraction apparatus according to a sixth invention extracts from each of structured data included in an article group including a title, structured data including a combination of an item name and an item value, and text including the value. The pseudo-teacher that is the text in which the annotation indicating that the value of the item to be extracted is given to the value of the item name specified as the item name that defines the item to be extracted with respect to the title Extraction that stores an extraction model for extracting the value of the item to be extracted from the text, learned in advance based on the features extracted from the text of the data and the annotations attached to the text A model storage unit, a feature extraction unit that extracts the features from the input text, the features extracted by the feature extraction unit, and the extraction model Based on the bets, from the text, it is configured to include a, and item value extracting unit for extracting the value of the item to be the extraction target.

第７の発明に係るプログラムは、上記の項目値抽出モデル学習装置又は上記の項目値抽出装置の各部として機能させるためのプログラムである。 A program according to a seventh aspect is a program for causing each of the item value extraction model learning device or the item value extraction device to function.

本発明の項目値抽出モデル学習装置、方法、及びプログラムによれば、構造化データの各々から、前記指定された項目名と一致する項目の値を、前記抽出対象とする項目の値として抽出し、抽出された前記抽出対象とする項目の値であることを示すアノテーションが付与された前記テキストを、疑似教師データとして作成し、疑似教師データの前記テキストから抽出される素性と、前記テキストに付与されたアノテーションとに基づいて、抽出モデルを学習することにより、高い網羅性で、抽出対象とする項目の値を抽出するための抽出モデルを学習することができる、という効果が得られる。 According to the item value extraction model learning device, method, and program of the present invention, the value of the item that matches the specified item name is extracted as the value of the item to be extracted from each of the structured data. The extracted annotated text indicating the value of the item to be extracted is created as pseudo-teacher data, and the feature extracted from the text of the pseudo-teacher data is added to the text. By learning the extraction model based on the annotation that has been made, it is possible to learn the extraction model for extracting the value of the item to be extracted with high completeness.

本発明の項目値抽出装置、方法、及びプログラムによれば、構造化データの各々から抽出された抽出対象とする項目を定める項目名として指定された項目名の値に対して、アノテーションが付与された前記テキストである疑似教師データの前記テキストから抽出される素性と、前記テキストに付与されたアノテーションとに基づいて予め学習された抽出モデルと、入力されたテキストから抽出された前記素性とに基づいて、前記テキストから、前記抽出対象とする項目の値を抽出することにより、高い網羅性で、抽出対象とする項目の値を抽出することができる、という効果が得られる。 According to the item value extraction apparatus, method, and program of the present invention, an annotation is given to the value of the item name specified as the item name that defines the item to be extracted extracted from each of the structured data. Further, based on the feature extracted from the text of the pseudo-teacher data which is the text, the extraction model learned in advance based on the annotation given to the text, and the feature extracted from the input text Thus, by extracting the value of the item to be extracted from the text, it is possible to extract the value of the item to be extracted with high completeness.

ＩｎｆｏｂｏｘのＷｅｂ上の表示を示す図である。It is a figure which shows the display on the web of Infobox. ＩｎｆｏｂｏｘのＸＭＬでの表記を示す図である。It is a figure which shows the description in XML of Infobox. 記事のＩｎｆｏｂｏｘ及びテキストの例を示す図である。It is a figure which shows the Infobox of an article, and the example of a text. 本発明の第１の実施の形態に係る項目値抽出モデル学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the item value extraction model learning apparatus which concerns on the 1st Embodiment of this invention. 入力される記事群を示す図である。It is a figure which shows the article group input. 構造化データの例を示す図である。It is a figure which shows the example of structured data. タイトルと抽出される別称とのペアを示す図である。It is a figure which shows the pair of the title and the alternative name extracted. 疑似教師データを作成する方法を説明するための図である。It is a figure for demonstrating the method of producing pseudo teacher data. 本発明の第１の実施の形態に係る項目値抽出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the item value extraction apparatus which concerns on the 1st Embodiment of this invention. 入力される記事を示す図である。It is a figure which shows the article input. 入力された記事から抽出された別称を示す図である。It is a figure which shows the alternative name extracted from the input article. 本発明の第１の実施の形態に係る項目値抽出モデル学習装置における項目値抽出モデル学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the item value extraction model learning process routine in the item value extraction model learning apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る項目値抽出装置における項目値抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the item value extraction process routine in the item value extraction apparatus which concerns on the 1st Embodiment of this invention. 本発明の第２の実施の形態に係る項目値抽出モデル学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the item value extraction model learning apparatus which concerns on the 2nd Embodiment of this invention. 入力されるテキストデータ群を示す図である。It is a figure which shows the text data group input. 本発明の第２の実施の形態に係る項目値抽出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the item value extraction apparatus which concerns on the 2nd Embodiment of this invention. 入力されるテキストを示す図である。It is a figure which shows the input text. 本発明の第２の実施の形態に係る項目値抽出モデル学習装置における項目値抽出モデル学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the item value extraction model learning process routine in the item value extraction model learning apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る項目値抽出装置における項目値抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the item value extraction process routine in the item value extraction apparatus which concerns on the 2nd Embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜概要＞ <Overview>

記事に含まれる構造化データ（例えば、Infobox）に記述された別称情報から、記事のテキスト中での別称記述正解データを疑似的に生成し、記事のテキスト中からのタイトルの別称を抽出するための抽出モデルを構築する。これにより、事前に設定するのは、構造化データの項目名の指定だけであり、低コストで大量の別称記述パタンが得られるとともに、別称を抽出する抽出モデルを構築することができる。 In order to generate pseudo-alias correct answer data in the article text from the synonym information described in the structured data (for example, Infobox) included in the article, and to extract the alias of the title from the article text Build an extraction model. As a result, only the designation of the item name of the structured data is set in advance, and a large amount of alternative name description patterns can be obtained at a low cost, and an extraction model for extracting alternative names can be constructed.

ここで、Infoboxとは、図１のWeb上の表示と図２のXMLでの表記に示すように、Wikipedia（Ｒ）中の、情報が構造化されている部分であり、項目名とその値が容易に抽出できる構造となっている。 Here, as shown in the Web display in FIG. 1 and the XML notation in FIG. 2, the Infobox is a part of Wikipedia (R) where information is structured, and the item name and its value. Is a structure that can be easily extracted.

また、Wikipedia（Ｒ）の記事の中には、図３（Ａ）、図３（Ｂ）に示すように、Infoboxにもテキストにもタイトルの別称が記述されている記事が存在する。両方に別称が書かれている記事に関して、Infoboxの方からは別称を自動獲得可能なので、その情報からテキスト中の記述パタンを特定し、抽出モデルの学習データとして使用する。 In addition, among the articles of Wikipedia (R), as shown in FIGS. 3 (A) and 3 (B), there are articles in which titles are also described in Infobox and text. For articles with nicknames in both, Infobox can automatically obtain nicknames, so the description pattern in the text is specified from the information and used as training data for the extraction model.

［第１の実施の形態］
＜本発明の第１の実施の形態に係る項目値抽出モデル学習装置の構成＞
次に、本発明の第１の実施の形態に係る項目値抽出モデル学習装置の構成について説明する。図４に示すように、本発明の第１の実施の形態に係る項目値抽出モデル学習装置１００は、ＣＰＵと、ＲＡＭと、後述する項目値抽出モデル学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この項目値抽出モデル学習装置１００は、機能的には図４に示すように入力部１０と、演算部２０と、出力部５０とを備えている。 [First Embodiment]
<Configuration of Item Value Extraction Model Learning Device According to First Embodiment of the Present Invention>
Next, the configuration of the item value extraction model learning device according to the first embodiment of the present invention will be described. As shown in FIG. 4, the item value extraction model learning device 100 according to the first exemplary embodiment of the present invention includes a CPU, a RAM, a program for executing an item value extraction model learning process routine described later, It can be constituted by a computer including a ROM storing data. Functionally, the item value extraction model learning device 100 includes an input unit 10, a calculation unit 20, and an output unit 50 as shown in FIG.

入力部１０は、図５に示すような、タイトルとテキストと構造化データからなる記事の群である記事群を受け付ける。構造化データは、項目名と項目の値の組を含んでおり、テキストは、当該項目の値を含んでいる。 The input unit 10 accepts an article group that is an article group including a title, text, and structured data as shown in FIG. The structured data includes a pair of an item name and an item value, and the text includes the value of the item.

また、入力部１０は、受け付けた記事群の各記事から、タイトル、構造化データ、テキスト部分を抽出する。例えば、<body>の最初の<h1>の値をタイトルとして抽出する。 Further, the input unit 10 extracts a title, structured data, and a text portion from each article in the accepted article group. For example, the first <h1> value of <body> is extracted as a title.

また、入力部１０は、タイトルの別称を定める項目名として指定された項目名リストを受け付ける。例えば、以下のような項目名リストを受け付ける。 Further, the input unit 10 accepts an item name list specified as an item name that defines an alternative title. For example, the following item name list is accepted.

演算部２０は、記事群記憶部２２と、指定項目抽出部２４と、疑似教師データ作成部２６と、素性抽出部２８と、抽出モデル学習部３０とを含んで構成されている。 The calculation unit 20 includes an article group storage unit 22, a specified item extraction unit 24, a pseudo teacher data creation unit 26, a feature extraction unit 28, and an extraction model learning unit 30.

記事群記憶部２２は、入力部１０によって受け付けた記事群、及び項目名リストを記憶している。 The article group storage unit 22 stores an article group received by the input unit 10 and an item name list.

指定項目抽出部２４は、記事群の各記事に含まれる構造化データの各々から、指定された項目名リストと一致する項目の値を、タイトルの別称として抽出する。例えば、図６に示すように、構造化データから、指定された項目名リストに含まれる“ニックネーム”と一致する項目名に対応する項目の値“水の怪物”を抽出する（図７参照）。ただし、リンク情報の削除や、複数候補の分割などのクリーニングも行う。 The designated item extraction unit 24 extracts the value of the item that matches the designated item name list from each of the structured data included in each article of the article group as another title. For example, as shown in FIG. 6, the item value “water monster” corresponding to the item name matching the “nickname” included in the specified item name list is extracted from the structured data (see FIG. 7). . However, cleaning such as deletion of link information and division of a plurality of candidates is also performed.

疑似教師データ作成部２６は、指定項目抽出部２４によって抽出された別称に基づいて、当該記事のテキストにおける当該別称が記述されている箇所を特定し、特定された記述箇所に、別称であることを示すアノテーションを付与し、アノテーションが付与されたテキストを、疑似教師データとして作成する。 The pseudo-teacher data creation unit 26 specifies a part where the alternative name is described in the text of the article based on the alternative name extracted by the designated item extraction unit 24, and the specified description part is an alternative name. And annotated text is created as pseudo-teacher data.

このとき、記事において、項目名リストと一致する項目名に対応する項目の値が最初に出現した箇所を特定し、アノテーションを付与する。この際、最初に出現した箇所に限定するのは、Wikipedia（Ｒ）記事の特性として別称が初出の際に別称であることの説明がなされることが多いためである。 At this time, in the article, the part where the value of the item corresponding to the item name matching the item name list first appears is specified, and an annotation is given. At this time, the reason for limiting to the first appearing part is that it is often explained that the nickname is the nickname as the characteristic of the Wikipedia (R) article when it appears for the first time.

これにより、図８に示すように、記事のテキスト中での多様な別称記述パタンを自動アノテーションすることができる。 Thereby, as shown in FIG. 8, various alias description patterns in the text of an article can be automatically annotated.

素性抽出部２８は、疑似教師データ作成部２６によって作成された各疑似教師データのテキストに対して、素性ベクトルを生成する。例えば、テキストの各文字の文字表記や各単語の分散表現などに基づいて、素性ベクトルを生成する。 The feature extraction unit 28 generates a feature vector for the text of each pseudo teacher data created by the pseudo teacher data creation unit 26. For example, the feature vector is generated based on the character notation of each character of the text or the distributed representation of each word.

抽出モデル学習部３０は、素性抽出部２８によって各疑似教師データのテキストから抽出された素性ベクトルと、各疑似教師データのテキストに付与されたアノテーションとに基づいて、記事のテキストから、タイトルの別称の値を抽出するための抽出モデルを学習し、出力部５０により出力する。 Based on the feature vector extracted from the text of each pseudo-teacher data by the feature extraction unit 28 and the annotation added to the text of each pseudo-teacher data, the extraction model learning unit 30 uses a title nickname from the text of the article. An extraction model for extracting the value of is learned and output by the output unit 50.

具体的には、抽出モデル学習部３０は、各疑似教師データのテキストに対して、アノテーションに応じて、文字レベルのタグ（例えば、ＢＩＯタグ）を付与する。例えば、アノテーションが付与された箇所以外はＯタグ（その他）をつけ、アノテーションが付与された箇所には、抽出したい文字列の始まりを示すＢタグ、抽出したい文字列の中を示すＩタグを付与する。例えば、アノテーションが付与された箇所が、３文字の表記である場合には、３つの文字に、Ｂタグ、Ｉタグ、Ｉタグを付与する。 Specifically, the extraction model learning unit 30 assigns a character level tag (for example, a BIO tag) to the text of each pseudo teacher data according to the annotation. For example, an O tag (others) is attached except where annotation is added, and a B tag indicating the start of the character string to be extracted and an I tag indicating the character string to be extracted are added to the location where the annotation is added. To do. For example, when the part to which the annotation is added is a three-character notation, a B tag, an I tag, and an I tag are added to the three characters.

そして、抽出モデル学習部３０は、各疑似教師データについて生成された素性ベクトルと、付与されたタグと基づいて、文字レベルの系列ラベリングモデルを用いて、抽出モデルを学習する。抽出モデルは、CRFなど一般的な系列ラベリング手法（例えば、非特許文献３に記載の手法）を用いて構築することができる。 Then, the extraction model learning unit 30 learns the extraction model using a character-level series labeling model based on the feature vector generated for each pseudo-teacher data and the assigned tag. The extraction model can be constructed using a general sequence labeling method such as CRF (for example, the method described in Non-Patent Document 3).

［非特許文献３］：J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, 2001. [Non-Patent Document 3]: J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, 2001.

＜本発明の第１の実施の形態に係る項目値抽出装置の構成＞
次に、本発明の第１の実施の形態に係る項目値抽出装置の構成について説明する。図９に示すように、本発明の第１の実施の形態に係る項目値抽出装置１５０は、ＣＰＵと、ＲＡＭと、後述する項目値抽出処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この項目値抽出装置１５０は、機能的には図９に示すように入力部６０と、演算部７０と、出力部９０とを備えている。 <Configuration of Item Value Extraction Device According to First Embodiment of the Present Invention>
Next, the configuration of the item value extraction apparatus according to the first embodiment of the present invention will be described. As shown in FIG. 9, the item value extraction apparatus 150 according to the first embodiment of the present invention stores a CPU, a RAM, a program for executing an item value extraction processing routine described later, and various data. It can be composed of a computer including a ROM. The item value extraction apparatus 150 functionally includes an input unit 60, a calculation unit 70, and an output unit 90 as shown in FIG.

入力部６０は、図１０に示すような、タイトルとテキストとからなる記事を受け付ける。受け付ける記事には、構造化データが含まれておらず、テキストは、タイトルの別称を含んでいる。また、入力部６０は、受け付けた記事から、入力部１０と同様に、タイトル、テキスト部分を抽出する。 The input unit 60 accepts an article composed of a title and text as shown in FIG. The accepted articles do not contain structured data, and the text contains alternative titles. In addition, the input unit 60 extracts a title and a text part from the received article in the same manner as the input unit 10.

演算部７０は、素性抽出部７２と、抽出モデル記憶部７４と、タイトル項目値抽出部７６とを含んで構成されている。 The calculation unit 70 includes a feature extraction unit 72, an extraction model storage unit 74, and a title item value extraction unit 76.

素性抽出部７２は、入力部６０で受け付けた記事のテキストに対して、素性抽出部２８と同様の素性ベクトルを生成する。 The feature extraction unit 72 generates a feature vector similar to the feature extraction unit 28 for the text of the article received by the input unit 60.

抽出モデル記憶部７４には、項目値抽出モデル学習装置１００によって学習された抽出モデルが格納されている。 The extraction model storage unit 74 stores an extraction model learned by the item value extraction model learning device 100.

タイトル項目値抽出部７６は、素性抽出部７２によって抽出された素性ベクトルと、抽出モデル記憶部７４に記憶された抽出モデルとに基づいて、系列ラベリングを行い、記事のテキストの各文字に、タグ（例えば、ＢＩＯタグ）を付与し、タイトルの別称を抽出する。例えば、図１１の例では、“ＨＨＨ”の３つの文字に、Ｂタグ、Ｉタグ、Ｉタグが付与されれば、この部分をタイトルの別称として抽出し、タイトルと別称とのペアを出力部９０により出力する。 The title item value extraction unit 76 performs series labeling based on the feature vector extracted by the feature extraction unit 72 and the extraction model stored in the extraction model storage unit 74, and adds a tag to each character of the article text. (For example, a BIO tag) is assigned, and an alternative title is extracted. For example, in the example of FIG. 11, if a B tag, an I tag, and an I tag are given to the three characters “HHH”, this portion is extracted as an alternative name for the title, and a pair of the title and the alternative name is output. 90 for output.

なお、本実施の形態では、別称の抽出対象箇所を、テキストにおけるタイトルの定義文部分とする。疑似正解データを作成した結果を集計した結果、別称が記述されている箇所の約85％が定義文部分であったため、この部分に限定することで効率的な抽出が可能になる。 In the present embodiment, an extraction target portion with another name is a title definition sentence portion in the text. As a result of totaling the results of creating the pseudo-correct answer data, about 85% of the parts where the alternative names are described are the definition sentence part. By limiting to this part, efficient extraction becomes possible.

＜本発明の第１の実施の形態に係る項目値抽出モデル学習装置の作用＞
次に、本発明の第１の実施の形態に係る項目値抽出モデル学習装置１００の作用について説明する。入力部１０においてタイトルとテキストと構造化データからなる記事群を受け付けると、受け付けた記事群の各記事から、タイトル、構造化データ、テキスト部分を抽出し、記事群記憶部２２に格納する。 <Operation of Item Value Extraction Model Learning Device According to First Embodiment of the Present Invention>
Next, the operation of the item value extraction model learning device 100 according to the first embodiment of the present invention will be described. When the input unit 10 receives an article group including a title, text, and structured data, the title, structured data, and text portion are extracted from each article of the received article group and stored in the article group storage unit 22.

また、入力部１０において、タイトルの別称を定める項目名として指定された項目名リストを受け付けると、項目名リストを記事群記憶部２２に格納する。そして、項目値抽出モデル学習装置１００は、図１２に示す項目値抽出モデル学習処理ルーチンを実行する。 In addition, when the input unit 10 receives an item name list designated as an item name that defines an alternative title, the item name list is stored in the article group storage unit 22. Then, the item value extraction model learning device 100 executes an item value extraction model learning process routine shown in FIG.

まず、ステップＳ１００では、記事群記憶部２２に格納されている記事群の各記事の構造化データから、指定された項目名リストと一致する項目の値を、タイトルの別称として抽出する。 First, in step S100, the value of an item that matches the specified item name list is extracted from the structured data of each article in the article group stored in the article group storage unit 22 as another title.

次に、ステップＳ１０２では、記事群記憶部２２に格納されている記事群の各記事に対して、当該記事のテキストにおいて、上記ステップＳ１００で当該記事の構造化データから抽出されたタイトルの別称が記述されている箇所を特定し、特定された記述箇所に、タイトルの別称であることを示すアノテーションを付与して、疑似教師データを作成する。 Next, in step S102, for each article in the article group stored in the article group storage unit 22, in the text of the article, an alias for the title extracted from the structured data of the article in step S100 is used. The part described is specified, and an annotation indicating that it is another title is given to the specified description part to create pseudo teacher data.

ステップＳ１０４では、上記ステップＳ１０２で作成された各疑似教師データのテキストに対して、テキストの各文字の文字表記や各単語の分散表現などに基づいて、素性ベクトルを生成する。ステップＳ１０６では、各疑似教師データのテキストに対して、アノテーションに応じて、文字レベルのタグを付与する。 In step S104, a feature vector is generated for the text of each pseudo-teacher data created in step S102 based on the character notation of each character of the text and the distributed representation of each word. In step S106, a character level tag is attached to the text of each pseudo teacher data according to the annotation.

そして、ステップＳ１０８において、上記ステップＳ１０６で各疑似教師データについて生成された素性ベクトルと、上記ステップＳ１０４で各疑似教師データのテキストに付与されたタグと基づいて、文字レベルの系列ラベリングモデルを用いて、抽出モデルを学習し、出力部５０により出力して、項目値抽出モデル学習処理ルーチンを終了する。 In step S108, based on the feature vector generated for each pseudo-teacher data in step S106 and the tag given to the text of each pseudo-teacher data in step S104, a character-level series labeling model is used. The extraction model is learned and output by the output unit 50, and the item value extraction model learning processing routine is terminated.

＜本発明の第１の実施の形態に係る項目値抽出装置の作用＞
次に、本発明の第１の実施の形態に係る項目値抽出装置１５０の作用について説明する。まず、項目値抽出モデル学習装置１００によって学習された抽出モデルが、項目値抽出装置１５０の抽出モデル記憶部７４に格納される。また、入力部６０においてタイトルとテキストからなる記事を受け付けると、項目値抽出装置１５０は、図１３に示す項目値抽出処理ルーチンを実行する。 <Operation of Item Value Extraction Device According to First Embodiment of the Present Invention>
Next, the operation of the item value extraction apparatus 150 according to the first embodiment of the present invention will be described. First, the extraction model learned by the item value extraction model learning device 100 is stored in the extraction model storage unit 74 of the item value extraction device 150. In addition, when an article including a title and text is received by the input unit 60, the item value extraction device 150 executes an item value extraction processing routine shown in FIG.

ステップＳ1１０において、入力部６０で受け付けた記事のテキストに対して、テキストの各文字の文字表記や各単語の分散表現などに基づいて、素性ベクトルを生成する。 In step S110, a feature vector is generated for the text of the article received by the input unit 60 based on the character notation of each character of the text, the distributed representation of each word, and the like.

そして、ステップＳ１１２において、上記ステップＳ１１０で生成された素性ベクトルと、抽出モデル記憶部７４に格納されている抽出モデルと基づいて、系列ラベリングを行い、記事のテキストの各文字に、タグを付与し、付与されたタグに基づいて、タイトルの別称を抽出し、記事のタイトルと別称とのペアを出力部９０により出力して項目値抽出処理ルーチンを終了する。 In step S112, series labeling is performed based on the feature vector generated in step S110 and the extraction model stored in the extraction model storage unit 74, and a tag is assigned to each character of the article text. Based on the assigned tag, the nickname of the title is extracted, a pair of the title of the article and the nickname is output by the output unit 90, and the item value extraction processing routine is terminated.

以上説明したように、本発明の第１の実施の形態に係る項目値抽出モデル学習装置によれば、記事群の構造化データの各々から、指定された項目名リストと一致する項目の値を、タイトルの別称として抽出し、タイトルの別称であることを示すアノテーションが付与されたテキストを、疑似教師データとして作成し、疑似教師データのテキストから抽出される素性と、テキストに付与されたアノテーションとに基づいて、抽出モデルを学習することにより、高い網羅性で、タイトルの別称を抽出するための抽出モデルを学習することができる。 As described above, according to the item value extraction model learning device according to the first embodiment of the present invention, the value of the item that matches the specified item name list is obtained from each of the structured data of the article group. The text extracted as an alias for the title and annotated to indicate that it is an alias for the title is created as pseudo-teacher data, the features extracted from the text of the pseudo-teacher data, and the annotation attached to the text By learning an extraction model based on the above, it is possible to learn an extraction model for extracting a title nickname with high completeness.

また、本発明の第１の実施の形態に係る項目値抽出装置によれば、項目値抽出モデル学習装置によって予め学習された抽出モデルと、入力されたテキストから抽出された前記素性とに基づいて、記事のテキストから、タイトルの別称を抽出することにより、高い網羅性で、タイトルの別称を抽出することができる。 Moreover, according to the item value extraction device according to the first exemplary embodiment of the present invention, based on the extraction model previously learned by the item value extraction model learning device and the feature extracted from the input text. By extracting the title alias from the article text, the title alias can be extracted with high completeness.

また、Wikipedia（R）の記事のＩｎｆｏｂｏｘとテキストの記述をリンクさせることで、自動的にテキスト中での別称記述箇所の特定を行い、モデル構築を行う。これにより、あらかじめ自然文のパタンを与えることなく、自然文の記述パタンを獲得・拡張することが可能であり、低コストで再現率の高い別称抽出モデルを構築することができる。また、別称は、値が多様であり、かつ、新規のものがつぎつぎに出てくるものであり、自動的に抽出できることが特に有効である。 Also, by linking the Infobox of the Wikipedia (R) article and the description of the text, the part of the nickname description in the text is automatically specified, and the model is constructed. As a result, it is possible to acquire and expand the description pattern of the natural sentence without giving the pattern of the natural sentence in advance, and it is possible to construct a nickname extraction model with a high reproduction rate at a low cost. In addition, the alternative names have various values, and new ones appear one after another, and it is particularly effective that they can be automatically extracted.

なお、上記の実施の形態では、タイトルの別称を抽出対象とする場合を例に説明したが、これに限定されるものではなく、別称以外の、タイトルに関する他の項目を、抽出対象としてもよい。 In the above-described embodiment, the case where an alternative name of a title is an extraction target has been described as an example. However, the present invention is not limited to this, and other items related to the title other than the alternative name may be the extraction target. .

［第２の実施の形態］
次に、第２の実施の形態について説明する。第１の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。 [Second Embodiment]
Next, a second embodiment will be described. Parts having the same configuration as those of the first embodiment are denoted by the same reference numerals and description thereof is omitted.

第２の実施の形態では、テキストと構造化データの組であるテキストデータから、抽出対象項目の値を抽出するための抽出モデルを学習している点が、第１の実施の形態と異なっている。
＜本発明の第２の実施の形態に係る項目値抽出モデル学習装置の構成＞ The second embodiment is different from the first embodiment in that an extraction model for extracting the value of an item to be extracted is learned from text data that is a set of text and structured data. Yes.
<Configuration of Item Value Extraction Model Learning Device According to Second Embodiment of the Present Invention>

次に、本発明の第２の実施の形態に係る項目値抽出モデル学習装置の構成について説明する。図１４に示すように、本発明の第２の実施の形態に係る項目値抽出モデル学習装置２００は、入力部１０と、演算部２２０と、出力部５０とを備えている。 Next, the configuration of the item value extraction model learning device according to the second embodiment of the present invention will be described. As illustrated in FIG. 14, the item value extraction model learning device 200 according to the second exemplary embodiment of the present invention includes an input unit 10, a calculation unit 220, and an output unit 50.

入力部１０は、図１５に示すような、テキストと構造化データとの組からなるテキストデータの群であるテキストデータ群を受け付ける。構造化データは、値（例えば、対象語）と当該値に関する項目名と項目の値との３つ組を含んでおり、テキストは、対象語と、当該項目の値とを含んでいる。 The input unit 10 receives a text data group that is a group of text data composed of a set of text and structured data as shown in FIG. The structured data includes a triple of a value (for example, a target word), an item name related to the value, and an item value, and the text includes the target word and the value of the item.

また、入力部１０は、受け付けたテキストデータ群の各テキストデータから、構造化データ、テキスト部分を抽出する。 Further, the input unit 10 extracts structured data and a text portion from each text data of the received text data group.

また、入力部１０は、対象語に関する抽出対象の項目（例えば、別称）を定める項目名として指定された項目名リストを受け付ける。 Further, the input unit 10 receives an item name list specified as an item name that defines an item (for example, another name) to be extracted regarding the target word.

演算部２２０は、テキストデータ群記憶部２２２と、指定項目抽出部２４と、疑似教師データ作成部２６と、素性抽出部２８と、抽出モデル学習部３０とを含んで構成されている。 The calculation unit 220 includes a text data group storage unit 222, a designated item extraction unit 24, a pseudo teacher data creation unit 26, a feature extraction unit 28, and an extraction model learning unit 30.

テキストデータ群記憶部２２２は、入力部１０によって受け付けたテキストデータ群、及び項目名リストを記憶している。 The text data group storage unit 222 stores a text data group received by the input unit 10 and an item name list.

指定項目抽出部２４は、テキストデータ群の各テキストデータに含まれる構造化データの各々から、指定された項目名リストと一致する項目の値を、抽出対象の項目の値として抽出する。 The designated item extraction unit 24 extracts, as the value of the item to be extracted, the value of the item that matches the designated item name list from each of the structured data included in each text data of the text data group.

疑似教師データ作成部２６は、指定項目抽出部２４によって抽出された、抽出対象の項目の値に基づいて、当該テキストデータのテキストにおける当該抽出対象の項目の値が記述されている箇所を特定し、特定された記述箇所に、抽出対象の項目の値であることを示すアノテーションが付与し、対象語が記述されている箇所に、対象語であることを示すアノテーションを付与し、アノテーションが付与されたテキストを、疑似教師データとして作成する。 Based on the value of the item to be extracted, which is extracted by the designated item extraction unit 24, the pseudo teacher data creation unit 26 specifies a place where the value of the item to be extracted in the text of the text data is described. Annotation indicating that the value of the item to be extracted is added to the specified description location, and an annotation indicating that the target word is described is added to the location where the target word is described. Created text as pseudo-teacher data.

抽出モデル学習部３０は、疑似教師データ作成部２６によって作成された疑似教師データのテキストから抽出される素性と、当該疑似教師データのテキストに付与されたアノテーションとに基づいて、上記第１の実施の形態と同様に、テキストデータのテキストから、抽出対象の項目の値を抽出するための抽出モデルを学習し、出力部５０により出力する。なお、本実施の形態では、素性として、対象語に関する素性も含む。 The extraction model learning unit 30 performs the first implementation based on the features extracted from the text of the pseudo-teacher data created by the pseudo-teacher data creation unit 26 and the annotation added to the text of the pseudo-teacher data. In the same manner as the above, the extraction model for extracting the value of the item to be extracted is learned from the text of the text data and is output by the output unit 50. In the present embodiment, features related to the target word are also included as features.

＜本発明の第２の実施の形態に係る項目値抽出装置の構成＞
次に、本発明の第２の実施の形態に係る項目値抽出装置の構成について説明する。図１６に示すように、本発明の第２の実施の形態に係る項目値抽出装置２５０は、入力部６０と、演算部２７０と、出力部９０とを備えている。 <Configuration of Item Value Extraction Device According to Second Embodiment of the Present Invention>
Next, the configuration of the item value extraction apparatus according to the second embodiment of the present invention will be described. As illustrated in FIG. 16, the item value extraction device 250 according to the second exemplary embodiment of the present invention includes an input unit 60, a calculation unit 270, and an output unit 90.

入力部６０は、図１７に示すような、テキストと対象語とを受け付ける。テキストには、構造化データが付いておらず、テキストは、対象語と抽出対象の項目の値とを含んでいる。 The input unit 60 accepts text and target words as shown in FIG. The text has no structured data, and the text includes the target word and the value of the item to be extracted.

演算部２７０は、素性抽出部７２と、抽出モデル記憶部７４と、項目値抽出部２７６とを含んで構成されている。 The calculation unit 270 includes a feature extraction unit 72, an extraction model storage unit 74, and an item value extraction unit 276.

素性抽出部７２は、入力部６０で受け付けたテキストに対して、素性抽出部２８と同様の素性ベクトルを生成する。なお、本実施の形態では、素性として、対象語に関する素性も含む。 The feature extraction unit 72 generates a feature vector similar to the feature extraction unit 28 for the text received by the input unit 60. In the present embodiment, features related to the target word are also included as features.

抽出モデル記憶部７４には、項目値抽出モデル学習装置２００によって学習された抽出モデルが格納されている。 The extraction model storage unit 74 stores an extraction model learned by the item value extraction model learning device 200.

項目値抽出部２７６は、素性抽出部７２によって抽出された素性ベクトルと、抽出モデル記憶部７４に記憶された抽出モデルとに基づいて、系列ラベリングを行い、テキストの各文字に、タグを付与し、抽出対象の項目の値を抽出し、出力部９０により出力する。 The item value extraction unit 276 performs series labeling based on the feature vector extracted by the feature extraction unit 72 and the extraction model stored in the extraction model storage unit 74, and assigns a tag to each character of the text. The value of the item to be extracted is extracted and output by the output unit 90.

＜本発明の第２の実施の形態に係る項目値抽出モデル学習装置の作用＞
次に、本発明の第２の実施の形態に係る項目値抽出モデル学習装置２００の作用について説明する。入力部１０においてテキストと構造化データの組からなるテキストデータ群を受け付けると、受け付けたテキストデータ群の各テキストデータから、構造化データ及びテキスト部分を抽出し、テキストデータ群記憶部２２２に格納する。 <Operation of Item Value Extraction Model Learning Device According to Second Embodiment of the Present Invention>
Next, the operation of the item value extraction model learning device 200 according to the second embodiment of the present invention will be described. When the input unit 10 receives a text data group consisting of a combination of text and structured data, the structured data and text portion are extracted from each text data of the received text data group and stored in the text data group storage unit 222. .

また、入力部１０において、抽出対象の項目を定める項目名として指定された項目名リストを受け付けると、項目名リストをテキストデータ群記憶部２２２に格納する。そして、項目値抽出モデル学習装置２００は、図１８に示す項目値抽出モデル学習処理ルーチンを実行する。なお、第１の実施の形態と同様の処理については同一符号を付して詳細な説明を省略する。 When the input unit 10 accepts an item name list designated as an item name that defines an item to be extracted, the item name list is stored in the text data group storage unit 222. Then, the item value extraction model learning device 200 executes an item value extraction model learning process routine shown in FIG. Note that the same processes as those in the first embodiment are denoted by the same reference numerals, and detailed description thereof is omitted.

まず、ステップＳ２００では、テキストデータ群記憶部２２２に格納されているテキストデータ群の各テキストデータの構造化データから、指定された項目名リストと一致する項目の値を、抽出対象の項目の値として抽出する。 First, in step S200, the value of the item that matches the specified item name list is extracted from the structured data of each text data of the text data group stored in the text data group storage unit 222. Extract as

次に、ステップＳ２０２では、テキストデータ群記憶部２２２に格納されているテキストデータ群の各テキストデータに対して、当該テキストデータのテキストにおいて、上記ステップＳ２００で当該テキストデータの構造化データから抽出された抽出対象の項目の値が記述されている箇所を特定し、特定された記述箇所に、抽出対象の項目の値であることを示すアノテーションを付与し、当該テキストデータの構造化データに含まれる対象語が記述されている箇所に、アノテーションを付与し、疑似教師データを作成する。 Next, in step S202, for each text data in the text data group stored in the text data group storage unit 222, the text of the text data is extracted from the structured data of the text data in step S200. The location where the value of the item to be extracted is described is specified, and an annotation indicating that the value of the item to be extracted is added to the specified description location, and is included in the structured data of the text data Annotation is given to the portion where the target word is described, and pseudo-teacher data is created.

ステップＳ１０４では、上記ステップＳ２０２で作成された各疑似教師データのテキストに対して、素性ベクトルを生成する。ステップＳ１０６では、各疑似教師データのテキストに対して、文字レベルのタグを付与する。 In step S104, a feature vector is generated for the text of each pseudo-teacher data created in step S202. In step S106, a character level tag is assigned to the text of each pseudo-teacher data.

そして、ステップＳ１０８において、上記ステップＳ１０６で各疑似教師データについて生成された素性ベクトルと、上記ステップＳ１０４で各疑似教師データのテキストに付与されたタグと基づいて、抽出モデルを学習し、出力部５０により出力して、項目値抽出モデル学習処理ルーチンを終了する。 In step S108, the extraction model is learned based on the feature vector generated for each pseudo-teacher data in step S106 and the tag given to the text of each pseudo-teacher data in step S104, and the output unit 50 To complete the item value extraction model learning process routine.

＜本発明の第２の実施の形態に係る項目値抽出装置の作用＞
次に、本発明の第２の実施の形態に係る項目値抽出装置１５０の作用について説明する。まず、項目値抽出モデル学習装置２００によって学習された抽出モデルが、項目値抽出装置２５０の抽出モデル記憶部７４に格納される。また、入力部６０において対象語に関するテキスト及び対象語を受け付けると、項目値抽出装置２５０は、図１９に示す項目値抽出処理ルーチンを実行する。 <Operation of Item Value Extraction Device According to Second Embodiment of the Present Invention>
Next, the operation of the item value extraction apparatus 150 according to the second embodiment of the present invention will be described. First, the extraction model learned by the item value extraction model learning device 200 is stored in the extraction model storage unit 74 of the item value extraction device 250. In addition, when the text relating to the target word and the target word are received in the input unit 60, the item value extraction device 250 executes an item value extraction processing routine shown in FIG.

ステップＳ２１０において、入力部６０で受け付けたテキストに対して、テキストの各文字の文字表記や各単語の分散表現、対象語などに基づいて、素性ベクトルを生成する。 In step S210, a feature vector is generated for the text received by the input unit 60 based on the character notation of each character of the text, the distributed representation of each word, the target word, and the like.

そして、ステップＳ２１２において、上記ステップＳ２１０で生成された素性ベクトルと、抽出モデル記憶部７４に格納されている抽出モデルと基づいて、系列ラベリングを行い、テキストの各文字に、タグを付与し、付与されたタグに基づいて、抽出対象の項目の値を抽出し、出力部９０により出力して項目値抽出処理ルーチンを終了する。 In step S212, series labeling is performed based on the feature vector generated in step S210 and the extraction model stored in the extraction model storage unit 74, and a tag is assigned to each character of the text. Based on the tag, the value of the item to be extracted is extracted and output by the output unit 90, and the item value extraction processing routine is terminated.

以上説明したように、本発明の第２の実施の形態に係る項目値抽出モデル学習装置によれば、テキストデータ群の構造化データの各々から、指定された項目名リストと一致する項目の値を、抽出対象とする項目の値として抽出し、抽出された抽出対象とする項目の値であることを示すアノテーションが付与されたテキストを、疑似教師データとして作成し、疑似教師データのテキストから抽出される素性と、テキストに付与されたアノテーションとに基づいて、抽出モデルを学習することにより、高い網羅性で、抽出対象とする項目の値を抽出するための抽出モデルを学習することができる。 As described above, according to the item value extraction model learning device according to the second embodiment of the present invention, the value of the item that matches the specified item name list from each of the structured data of the text data group. Is extracted as the value of the item to be extracted, and the annotated text indicating that it is the value of the extracted item to be extracted is created as pseudo teacher data and extracted from the text of the pseudo teacher data By learning the extraction model based on the feature to be performed and the annotation added to the text, it is possible to learn the extraction model for extracting the value of the item to be extracted with high completeness.

また、本発明の第２の実施の形態に係る項目値抽出装置によれば、項目値抽出モデル学習装置によって予め学習された抽出モデルと、入力されたテキストから抽出された素性とに基づいて、テキストから、抽出対象とする項目の値を抽出することにより、高い網羅性で、抽出対象とする項目の値を抽出することができる。 Moreover, according to the item value extraction device according to the second exemplary embodiment of the present invention, based on the extraction model previously learned by the item value extraction model learning device and the feature extracted from the input text, By extracting the value of the item to be extracted from the text, the value of the item to be extracted can be extracted with high completeness.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上記第１の実施の形態及び第２の実施の形態では、ＢＩＯタグを、文字レベルで付与する場合を例に説明したが、これに限定されるものではなく、単語レベルでタグ付けを行ってもよい。この場合には、事前に、テキストに対して単語分割を行うようにすればよい。
また、上記第２の実施の形態において、対象語を入力とし、テキストから、抽出対象とする項目の値を抽出する場合を例に説明したが、これに限定されるものではなく、テキストから、抽出対象とする項目の関係となる、対象語と、項目の値とのペアを抽出するようにしてもよい。 For example, in the first embodiment and the second embodiment, the case where the BIO tag is assigned at the character level has been described as an example. However, the present invention is not limited to this, and tagging is performed at the word level. You may go. In this case, word division may be performed on the text in advance.
In the second embodiment, the case where the target word is input and the value of the item to be extracted is extracted from the text has been described as an example. However, the present invention is not limited to this. You may make it extract the pair of a target word and the value of an item used as the relationship of the item made into extraction object.

１０、６０入力部
２０、７０、２２０、２７０演算部
２２記事群記憶部
２４指定項目抽出部
２６疑似教師データ作成部
２８、７２素性抽出部
３０抽出モデル学習部
５０、９０出力部
７４抽出モデル記憶部
７６タイトル項目値抽出部
１００、２００項目値抽出モデル学習装置
１５０、２５０項目値抽出装置
２２２テキストデータ群記憶部
２７６項目値抽出部 10, 60 Input unit 20, 70, 220, 270 Calculation unit 22 Article group storage unit 24 Designated item extraction unit 26 Pseudo teacher data creation unit 28, 72 Feature extraction unit 30 Extraction model learning unit 50, 90 Output unit 74 Extraction model storage Unit 76 title item value extraction unit 100, 200 item value extraction model learning device 150, 250 item value extraction device 222 text data group storage unit 276 item value extraction unit

Claims

Text data group consisting of a combination of an item name specified as an item name that defines an item to be extracted, and structured data including a combination of a value, an item name related to the value, and an item value, and a text including the value An input unit that accepts
A designated item extraction unit that extracts the value of an item that matches the designated item name from each of the structured data of the text data group, as the value of the item to be extracted;
A pseudo-teacher data creation unit that creates, as pseudo-teacher data, the text with the annotation indicating that it is the value of the item to be extracted extracted by the designated item extraction unit;
The value of the item to be extracted is extracted from the text based on the feature extracted from the text of the pseudo teacher data created by the pseudo teacher data creation unit and the annotation given to the text. An extraction model learning unit for learning an extraction model for
Item value extraction model learning device including

An extraction target extracted from each of the structured data included in a text data group including a set of structured data including a set of a value, an item name related to the value, and a value of the item, and a text including the value The feature extracted from the text of the pseudo-teacher data, which is the text to which the annotation indicating that the value of the item to be extracted is added to the value of the item name specified as the item name that defines the item An extraction model storage unit that stores an extraction model for extracting the value of the item to be extracted from the text, learned in advance based on the annotation given to the text;
A feature extraction unit that extracts the features from the input text;
An item value extraction unit that extracts the value of the item to be extracted from the text based on the features extracted by the feature extraction unit and the extraction model;
Item value extraction device including

An article group consisting of a title, structured data including a pair of item name and item value, and text including the value, and an item name specified as an item name for defining an item to be extracted with respect to the title are accepted. An input section;
A designated item extraction unit that extracts the value of the item that matches the designated item name from each of the structured data of the article group, as the value of the item to be extracted;
A pseudo-teacher data creation unit that creates, as pseudo-teacher data, the text with the annotation indicating that it is the value of the item to be extracted extracted by the designated item extraction unit;
The value of the item to be extracted is extracted from the text based on the feature extracted from the text of the pseudo teacher data created by the pseudo teacher data creation unit and the annotation given to the text. An extraction model learning unit for learning an extraction model for
Item value extraction model learning device including

The item to be extracted is determined for the title extracted from each of the structured data included in the article group including the title, the structured data including the item name and the value of the item, and the text including the value. The feature extracted from the text of the pseudo-teacher data, which is the text to which the annotation indicating that the value is the value of the item to be extracted, with respect to the value of the item name specified as the item name, An extraction model storage unit that stores an extraction model for extracting the value of the item to be extracted from the text, learned in advance based on annotations attached to the text;
A feature extraction unit that extracts the features from the input text;
An item value extraction unit that extracts the value of the item to be extracted from the text based on the features extracted by the feature extraction unit and the extraction model;
Item value extraction device including

The item value extraction model learning device according to claim 1, wherein the item to be extracted is referred to as another name.

From the combination of the item name specified as the item name that defines the item to be extracted and the structured data including the value, the item name related to the value and the value of the item, and the text including the value Accept the text data group
The specified item extraction unit extracts the value of the item that matches the specified item name from each of the structured data of the text data group as the value of the item to be extracted,
The pseudo-teacher data creating unit creates, as pseudo-teacher data, the text with the annotation indicating that it is the value of the item to be extracted extracted by the specified item extracting unit,
The extraction model learning unit sets the extraction target from the text based on the feature extracted from the text of the pseudo-teacher data created by the pseudo-teacher data creation unit and the annotation attached to the text. Learning an extraction model for extracting item values Item value extraction model learning method.

An extraction target extracted from each of the structured data included in a text data group including a set of structured data including a set of a value, an item name related to the value, and a value of the item, and a text including the value The feature extracted from the text of the pseudo-teacher data, which is the text to which the annotation indicating that the value of the item to be extracted is added to the value of the item name specified as the item name that defines the item And an item value extraction device including an extraction model storage unit that stores an extraction model for extracting the value of the item to be extracted from the text, which has been learned in advance based on the annotation given to the text The item value extraction method in
The feature extraction unit extracts the feature from the input text,
An item value extraction method, wherein an item value extraction unit extracts a value of an item to be extracted from the text based on the feature extracted by the feature extraction unit and the extraction model.

A program for causing a computer to function as each part of the item value extraction model learning device according to claim 1, claim 3, or claim 5, or the item value extraction device according to claim 2 or claim 4.