JP5542732B2

JP5542732B2 - Data extraction apparatus, data extraction method, and program thereof

Info

Publication number: JP5542732B2
Application number: JP2011094885A
Authority: JP
Inventors: 九月貞光; 玄一郎菊井; 邦子齋藤; 賢治今村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-10-29
Filing date: 2011-04-21
Publication date: 2014-07-09
Anticipated expiration: 2031-04-21
Also published as: JP2012108867A

Description

本発明は、テキストデータの集合からデータを抽出する技術に関し、特に、特定の内容と関連を持つ文字列を入力として同じような関連を持つ文字列をテキストデータの集合から抽出する技術に関する。 The present invention relates to a technique for extracting data from a set of text data, and more particularly to a technique for extracting a character string having a similar relationship from a set of text data using a character string having a relationship with a specific content as an input.

現在様々な自然言語処理技術の研究開発が進み、WEBのような膨大な知識源から必要な情報を抽出する手法が多く存在している。そのような手法の一つに、特定の内容と何らかの関連を持つ文字列（例えば<広島>や<阪神>など）を入力として、大量のテキストデータ（例えば文書データ）から同じような関連を持つ文字列（例えば<ヤクルト>など）を収集するものがある。このような手法を「set expansion」と呼ぶ。また、set expansionで扱われる文字列を「エンティティ」と呼び、抽出対象のエンティティを「正例エンティティ」と呼び、抽出しない（抽出対象としない）エンティティを「負例エンティティ」と呼ぶ。さらに、set expansionにおいて最初に入力されるエンティティを「シードエンティティ」と呼び、正例のシードエンティティを「正例シードエンティティ」と呼び、負例のシードエンティティを「負例シードエンティティ」と呼ぶ。 Currently, various natural language processing technologies are being researched and developed, and there are many methods for extracting necessary information from a vast knowledge source such as WEB. One of such methods is to input a character string (for example, <Hiroshima>, <Hanshin>, etc.) that has some relationship with specific contents, and have a similar relationship from a large amount of text data (for example, document data). Some collect character strings (eg <Yakult>). Such a method is called “set expansion”. A character string handled in set expansion is called an “entity”, an extraction target entity is called a “positive example entity”, and an entity that is not extracted (not an extraction target) is called a “negative example entity”. Furthermore, the entity that is input first in the set expansion is called “seed entity”, the positive seed entity is called “positive seed entity”, and the negative seed entity is called “negative seed entity”.

以下、従来のset expansionの一例を概説する。
ステップI：正例エンティティ（例えば<広島>や<阪神>など）を含むテキストデータを用いて当該正例エンティティの素性を抽出し、負例エンティティ（例えば<彗星>）を含むテキストデータを用いて当該負例エンティティの素性を抽出（素性化）する。なお、正例エンティティの初期値は正例シードエンティティであり、負例エンティティの初期値は負例シードエンティティである。
ステップII：ステップIで得られた正例エンティティの素性と負例エンティティの素性とを学習データとし、任意のエンティティが正例エンティティであるか負例エンティティであるかを識別するための識別モデルを生成する。
ステップIII：テキストデータから識別前のエンティティ(例えば<ヤクルト>)とその素性と抽出し、ステップIIで得られた識別モデルを用いて当該未知のエンティティの識別を行う。
ステップIV：正例エンティティであると識別されたエンティティのうち信頼度の高いものの素性と、負例エンティティであると推定されたエンティティのうち信頼度の低いものの素性とを学習データに加える。
ステップV：収束条件を満たすか否かを判定し、満たさない場合はステップIに戻って処理を繰り返す。収束条件を満たす場合は処理を終了する。このように一度学習したモデルに基づいて識別を行い、それを新たな学習データとして用いていく繰り返し学習の枠組みをブートストラップ法と呼ぶ。 Hereinafter, an example of a conventional set expansion will be outlined.
Step I: Using text data containing positive example entities (eg <Hiroshima>, <Hanshin>, etc.) The feature of the negative example entity is extracted (featured). The initial value of the positive example entity is a positive example seed entity, and the initial value of the negative example entity is a negative example seed entity.
Step II: Using the identity of the positive example entity and the identity of the negative example entity obtained in Step I as learning data, an identification model for identifying whether any entity is a positive example entity or a negative example entity Generate.
Step III: An entity before identification (for example, <Yakult>) and its features are extracted from text data, and the unknown entity is identified using the identification model obtained in Step II.
Step IV: Add the features of the entities that are identified as positive example entities with high reliability and the features of the entities that are estimated to be negative example entities with low reliability to the learning data.
Step V: Determine whether or not the convergence condition is satisfied. If not, return to Step I and repeat the process. If the convergence condition is satisfied, the process is terminated. A framework for iterative learning in which identification is performed based on a once learned model and used as new learning data is called a bootstrap method.

次に、set expansionの他の例であるTChai（例えば、非特許文献１参照）を概説する。TChaiでは、リソースとして検索クエリログ（以下クエリログ）が用いられる。クエリログとは、キーワード検索に用いられるユーザからのクエリ（数単語からなるキーワード）の集合である。
ステップA：正例シードエンティティと共起する単語であるパターンpをクエリログから抽出し、それらを正例シードエンティティの素性とする。このステップは最初の一回のみ行う。
ステップB：正例エンティティであるかが未知のエンティティeとそれと共起するパターンpとの２項におけるPMI(Pointwise Mutual Information)を計算する。

ここで|e, p|はエンティティeとそれと共起するパターンpとの組のクエリログ中での出現頻度を表す。また、*はp又はeのワイルドカードを表す。すなわち、|e, *|はエンティティeと何れかのパターン*との組のクエリログ中での出現頻度を表し、|*, p|は何れかのエンティティ*とパターンpとの組のクエリログ中での出現頻度を表す。
また、このエンティティeに対し、エンティティ信頼度r_Eとパターン信頼度r_Pを計算する。r_E, r_Pは以下で定義される。

ここで|E|及び|P|はそれぞれエンティティe及びパターンpの総数を表す。また、max_e pmiはエンティティをeに固定した場合のPMIの最大値を表し、max_ppmiはパターンをpに固定した場合のPMIの最大値を表す。
ステップC：エンティティ信頼度r_Eに基づきエンティティｅを新たな正例エンティティとするか否かを判定する。
ステップD：必要な数の正例エンティティが得られていない場合にはステップBに戻って処理を繰り返す。必要な数の正例エンティティが得られた場合には処理を終了する。 Next, TChai which is another example of set expansion (for example, see Non-Patent Document 1) will be outlined. In TChai, a search query log (hereinafter referred to as query log) is used as a resource. A query log is a set of queries (keywords consisting of several words) from a user used for keyword search.
Step A: The pattern p, which is a word that co-occurs with the positive seed entity, is extracted from the query log, and is used as a feature of the positive seed entity. This step is only performed once.
Step B: PMI (Pointwise Mutual Information) in two terms of the entity e unknown to be a positive example entity and the pattern p co-occurring with it is calculated.

Here, | e, p | represents the appearance frequency in the query log of a set of the entity e and the pattern p co-occurring with it. * Represents a p or e wildcard. That is, | e, * | represents the frequency of occurrence in the query log of the pair of entity e and any pattern *, and | *, p | is the query log of the pair of any entity * and pattern p. Represents the appearance frequency of
In addition, an entity reliability r _E and a pattern reliability r _P are calculated for this entity e. r _E and r _P are defined below.

Here, | E | and | P | represent the total number of entities e and patterns p, respectively. Max _e pmi represents the maximum value of PMI when the entity is fixed to e, and max _p pmi represents the maximum value of PMI when the pattern is fixed to p.
Step C: It is determined whether or not the entity e is a new positive entity based on the entity reliability r _E.
Step D: If the required number of positive entity is not obtained, return to Step B and repeat the process. If the required number of positive entity is obtained, the process is terminated.

小町守，鈴木久美，「検索ログからの半教師あり意味知識獲得の改善」，人工知能学会論文誌，Vol. 23，No. 3，2008，p. 217-225Mamoru Komachi, Kumi Suzuki, “Improvement of Semi-Supervised Semantic Knowledge Acquisition from Search Logs”, Transactions of the Japanese Society for Artificial Intelligence, Vol. 23, No. 3, 2008, p. 217-225

従来のset expansionにはセマンティックドリフトという課題が存在する。
例えば球団名を表す<広島><阪神>という正例シードエンティティに対して、従来のset expansionにより正例エンティティ<ヤクルト>が獲得できたとする。<ヤクルト>は飲料名でもあるので、<ヤクルト>を新たに正例エンティティに追加することで次のイテレーションでは<コーラ>等の飲料系のエンティティが正例エンティティとして獲得されるようになり、獲得される正例エンティティの話題がシフトしていく可能性がある。このように獲得される正例エンティティの話題がシフトしていく現象をセマンティックドリフトと呼ぶ。
TChaiではセマンティックドリフトの影響を抑えるために、前述した信頼度を用い、どのクエリにも共通して出やすい一般性の強いエンティティ及びパターンを選択しないアルゴリズムとなっている。しかし、TChaiにおいてもなおセマンティックドリフトは起こり得るため、別な観点からのセマンティックドリフト軽減が望まれる。 The conventional set expansion has a problem of semantic drift.
For example, suppose that the positive entity <Yakult> can be acquired by the conventional set expansion for the positive seed entity <Hiroshima><Hanshin> representing the team name. <Yakult> is also a beverage name, so by adding <Yakult> to a new example entity, a beverage-type entity such as <Cola> will be acquired as a normal entity in the next iteration. There is a possibility that the topic of positive example entities will shift. The phenomenon that the topic of positive entity acquired in this way shifts is called semantic drift.
In TChai, in order to suppress the influence of semantic drift, the above-mentioned reliability is used, and an algorithm that does not select general entities and patterns that tend to appear in any query in common is used. However, semantic drift can still occur in TChai, so it is desirable to reduce semantic drift from another perspective.

本発明はこのような点に鑑みてなされたものであり、セマンティックドリフトを軽減することが可能な技術を提供することを目的とする。 The present invention has been made in view of such a point, and an object thereof is to provide a technique capable of reducing the semantic drift.

本発明の第１態様では、テキストデータに対する複数のトピックの候補の適切さを指標値として表すトピック情報と、当該テキストデータとの関係を記述するトピックモデルを、テキストデータから得られる教師なし学習データを用いて学習し、抽出対象の文字列である正例エンティティを含むテキストデータのトピックに対応してトピックモデルから抽出した正例トピック情報を正例エンティティの素性の少なくとも一部とし、抽出対象としない文字列である負例エンティティを含むテキストデータのトピックに対応してトピックモデルから抽出した負例トピック情報を負例エンティティの素性の少なくとも一部とし、正例エンティティの素性と負例エンティティの素性とを教師あり学習データとした学習処理によって、任意のエンティティの素性を入力として当該エンティティが正例エンティティか負例エンティティかを識別するための情報を出力する関数である識別モデルを生成し、テキストデータの集合から選択したテキストデータが含む文字列であるエンティティを対象エンティティとし、当該選択したテキストデータのトピックに対応してトピックモデルから抽出したトピック情報を当該対象エンティティの素性の少なくとも一部とし、当該対象エンティティの素性を識別モデルに入力して対象エンティティが正例エンティティか負例エンティティかを識別し、対象エンティティが正例エンティティであると識別した場合に対象エンティティを正例エンティティとし、対象エンティティが負例エンティティであると識別した場合に対象エンティティを負例エンティティとする。 In the first aspect of the present invention, unsupervised learning data obtained from text data is a topic model that describes the relationship between topic information representing the appropriateness of a plurality of topic candidates for text data as an index value and the text data. The example topic information extracted from the topic model corresponding to the topic of the text data including the example entity that is the character string to be extracted is used as at least part of the features of the example entity, The negative example topic information extracted from the topic model corresponding to the topic of the text data containing the negative example entity that is a non-character string is set as at least part of the negative example entity feature, and the positive example entity feature and the negative example entity feature By using the learning process with supervised learning data, An entity that is a character string included in text data selected from a set of text data by generating an identification model that is a function that outputs information for identifying whether the entity is a positive example entity or negative example entity Is the target entity, the topic information extracted from the topic model corresponding to the topic of the selected text data is used as at least a part of the feature of the target entity, and the feature of the target entity is input to the identification model. Identify positive entity or negative entity, identify target entity as positive entity when target entity is identified as positive entity, and negative target entity when target entity is identified as negative entity Example entity To.

本発明の第２態様では、抽出対象の文字列である正例エンティティの集合から選択した第１正例エンティティと正例エンティティの属性を表す文字列である正例属性の集合から選択した第１正例属性との組である第１正例エンティティ−正例属性ペアと、抽出対象としない文字列である負例エンティティの集合から選択した第１負例エンティティと負例エンティティの属性を表す文字列である負例属性の集合から選択した第１負例属性との組である第１負例エンティティ−負例属性ペアとを生成し、テキストデータの集合から、第１正例エンティティと第１正例属性との組を含む文字列を選択し、選択した当該文字列に対する第１正例エンティティ−正例属性ペアの特徴を表す情報を当該第１正例エンティティ−正例属性ペアの素性の少なくとも一部とし、テキストデータの集合から、第１負例エンティティと第１負例属性との組を含む文字列を選択し、選択した当該文字列に対する第１負例エンティティ−負例属性ペアの特徴を表す情報を当該第１負例エンティティ−負例属性ペアの素性の少なくとも一部とし、第１正例エンティティ−正例属性ペアの素性と第１負例エンティティ−負例属性ペアの素性とを教師あり学習データとした学習処理によって、任意の文字列であるエンティティと当該エンティティの属性との組であるエンティティ−属性ペアの素性を入力として当該エンティティ−属性ペアが正例エンティティ−正例属性ペアか負例エンティティ−負例属性ペアかを識別するための情報を出力する関数である第１識別モデルを生成し、テキストデータの集合から何れかのテキストデータを選択し、選択した当該テキストデータが含む文字列を第１対象エンティティとして選択し、選択した当該テキストデータから当該第１対象エンティティと異なる文字列を第１対象属性として選択し、第１対象エンティティと第１対象属性との組を第１対象エンティティ−対象属性ペアとし、選択した当該テキストデータ内での第１対象エンティティ−対象属性ペアの特徴を表す情報を当該第１対象エンティティ−対象属性ペアの素性の少なくとも一部とし、当該第１対象エンティティ−対象属性ペアの素性を第１識別モデルに入力して当該第１対象エンティティ−対象属性ペアが正例エンティティ−正例属性ペアか負例エンティティ−負例属性ペアかを識別し、当該第１対象エンティティ−対象属性ペアが正例エンティティ−正例属性ペアであると識別した場合に、第１対象属性を正例属性の集合に追加し、当該第１対象エンティティ−対象属性ペアが負例エンティティ−負例属性ペアであると識別した場合に、第１対象属性を負例属性の集合に追加し、正例エンティティの集合から選択した第２正例エンティティと正例属性の集合から選択した第２正例属性との組である第２正例エンティティ−正例属性ペアと、負例エンティティの集合から選択した第２負例エンティティと負例属性の集合から選択した第２負例属性との組である第２負例エンティティ−負例属性ペアとを生成し、テキストデータの集合から、第２正例エンティティと第２正例属性との組を含む文字列を選択し、選択した当該文字列に対する第２正例エンティティ−正例属性ペアの特徴を表す情報を当該第２正例エンティティ−正例属性ペアの素性の少なくとも一部とし、テキストデータの集合から、第２負例エンティティと第２負例属性との組を含む文字列を選択し、選択した当該文字列に対する第２負例エンティティ−負例属性ペアの特徴を表す情報を当該第２負例エンティティ−負例属性ペアの素性の少なくとも一部とし、第２正例エンティティ−正例属性ペアの素性と第２負例エンティティ−負例属性ペアの素性とを教師あり学習データとした学習処理によって、任意の文字列であるエンティティと当該エンティティの属性との組であるエンティティ−属性ペアの素性を入力として当該エンティティ−属性ペアが正例エンティティ−正例属性ペアか負例エンティティ−負例属性ペアかを識別するための情報を出力する関数である第２識別モデルを生成し、テキストデータの集合から何れかのテキストデータを選択し、選択した当該テキストデータが含む文字列を第２対象エンティティとして選択し、選択した当該テキストデータから当該第２対象エンティティと異なる文字列を第２対象属性として選択し、第２対象エンティティと第２対象属性との組を第２対象エンティティ−対象属性ペアとし、選択した当該テキストデータ内での第２対象エンティティ−対象属性ペアの特徴を表す情報を当該第２対象エンティティ−対象属性ペアの素性の少なくとも一部とし、当該第２対象エンティティ−対象属性ペアの素性を第２識別モデルに入力して当該第２対象エンティティ−対象属性ペアが正例エンティティ−正例属性ペアか負例エンティティ−負例属性ペアかを識別し、当該第２対象エンティティ−対象属性ペアが正例エンティティ−正例属性ペアであると識別した場合に、第２エンティティを正例エンティティの集合に追加し、当該第２対象エンティティ−対象属性ペアが負例エンティティ−負例属性ペアであると識別した場合に、第２対象エンティティを負例エンティティの集合に追加する。 In the second aspect of the present invention, the first positive example entity selected from the set of positive example entities that are the character strings to be extracted and the first selected from the set of positive example attributes that are character strings representing the attributes of the positive example entities. Characters representing the attributes of the first negative example entity and the negative example entity selected from the set of the first positive example entity-positive example attribute pair that is a pair with the positive example attribute and the negative example entity that is a character string that is not to be extracted. A first negative example entity-negative example attribute pair that is a set with a first negative example attribute selected from a set of negative example attributes that are columns is generated, and the first positive example entity and the first are generated from the set of text data. A character string including a pair with a positive example attribute is selected, and information indicating the characteristics of the first positive example entity-positive example attribute pair for the selected character string is used as the feature of the first positive example entity-positive example attribute pair. Less From the set of text data, a character string including a pair of the first negative example entity and the first negative example attribute is selected, and the first negative example entity-negative example attribute pair for the selected character string is selected. Information representing the characteristics is at least part of the features of the first negative example entity-negative example attribute pair, and the features of the first positive example entity-positive example attribute pair and the features of the first negative example entity-negative example attribute pair Through the learning process using supervised learning data as an input, the entity-attribute pair is a positive example entity-positive example attribute by inputting the identity of the entity-attribute pair that is a set of an entity that is an arbitrary character string and the attribute of the entity. A first identification model, which is a function for outputting information for identifying a pair or a negative example entity-negative example attribute pair, is generated, and is selected from a set of text data. Select text data, select a character string included in the selected text data as a first target entity, select a character string different from the first target entity from the selected text data as a first target attribute, A set of the target entity and the first target attribute is defined as a first target entity-target attribute pair, and information representing the characteristics of the first target entity-target attribute pair in the selected text data is the first target entity-target. At least part of the feature of the attribute pair, the feature of the first target entity-target attribute pair is input to the first identification model, and the first target entity-target attribute pair is a positive entity-positive attribute pair or negative Identify whether it is an example entity-negative example attribute pair, and the first target entity-target attribute pair is a positive example entity -When it is identified as a positive example attribute pair, the first target attribute is added to the set of positive example attributes, and the first target entity-target attribute pair is identified as a negative example entity-negative example attribute pair. The first target attribute is added to the set of negative example attributes, and the second positive example entity selected from the set of positive example entities and the second positive example attribute selected from the set of positive example attributes A second negative example entity-negative that is a set of two positive example entity-positive example attribute pairs, a second negative example entity selected from the set of negative example entities, and a second negative example attribute selected from the set of negative example attributes An example attribute pair is generated, a character string including a pair of a second positive example entity and a second positive example attribute is selected from a set of text data, and a second positive example entity-positive example for the selected character string is selected. Information representing the characteristics of attribute pairs Is selected as a character string that includes a pair of the second negative example entity and the second negative example attribute from the set of text data, and at least part of the feature of the second positive example entity-positive example attribute pair. Information representing the characteristics of the second negative example entity-negative example attribute pair for the character string is at least part of the features of the second negative example entity-negative example attribute pair, and the second positive example entity-positive example attribute pair The feature of the entity-attribute pair that is a set of an entity that is an arbitrary character string and the attribute of the entity is obtained by learning processing using the feature and the feature of the second negative example entity-negative example attribute pair as supervised learning data. A function that outputs information for identifying whether the entity-attribute pair is a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair. A second identification model is generated, any text data is selected from a set of text data, a character string included in the selected text data is selected as a second target entity, and the second target is selected from the selected text data. A character string different from the entity is selected as a second target attribute, and a set of the second target entity and the second target attribute is set as a second target entity-target attribute pair, and the second target entity in the selected text data- Information representing the characteristics of the target attribute pair is set as at least a part of the feature of the second target entity-target attribute pair, and the feature of the second target entity-target attribute pair is input to the second identification model, and the second target Identifies whether the entity-target attribute pair is a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair; When the second target entity-target attribute pair is identified as a positive entity-positive attribute pair, the second entity is added to the set of positive entities, and the second target entity-target attribute pair is negative. If it is identified as an example entity-negative example attribute pair, the second target entity is added to the set of negative example entities.

上述のように本発明では、トピック情報及び属性の少なくとも一方がエンティティの識別に反映されるため、セマンティックドリフトを軽減できる。 As described above, in the present invention, since at least one of topic information and attributes is reflected in entity identification, semantic drift can be reduced.

図１は、第１実施形態のデータ抽出装置の機能構成を例示するためのブロック図である。FIG. 1 is a block diagram for illustrating a functional configuration of the data extraction device according to the first embodiment. 図２Ａ及び図２Ｂは、自動生成部の機能構成を例示するためのブロック図である。2A and 2B are block diagrams for illustrating the functional configuration of the automatic generation unit. 図３は、第１実施形態のデータ抽出装置のデータ抽出処理を例示するための図である。FIG. 3 is a diagram for illustrating a data extraction process of the data extraction apparatus according to the first embodiment. 図４は、記憶部に格納されたテキストデータの集合Dを例示した図である。FIG. 4 is a diagram illustrating a set D of text data stored in the storage unit. 図５Ａは、トピック情報付きテキストデータの集合D'を例示した図である。図５Ｂは、トピック情報抽出部が出力する組(fP_e ^j, <+1>)及び組(fN_e ^j, <-1>)を例示した図である。FIG. 5A is a diagram illustrating a set D ′ of text data with topic information. FIG. 5B is a diagram illustrating a pair (fP _e ^j , <+1>) and a pair (fN _e ^j , <-1>) output by the topic information extraction unit. 図６は、第２実施形態のデータ抽出装置の機能構成を例示するためのブロック図である。FIG. 6 is a block diagram for illustrating a functional configuration of the data extraction apparatus according to the second embodiment. 図７は、第２実施形態のデータ抽出装置のデータ抽出処理を例示するための図である。FIG. 7 is a diagram for illustrating the data extraction processing of the data extraction device of the second embodiment. 図８Ａは、属性識別用素性抽出部が出力する組(fP_a ^j, <+1>)及び組(fN_a ^j, <-1>)を例示した図である。図８Ｂは、エンティティ識別用素性抽出部が出力する組(fP_e ^j, <+1>)及び組(fN_e ^j, <-1>)を例示した図である。FIG. 8A is a diagram illustrating a pair (fP _a ^j , <+1>) and a pair (fN _a ^j , <-1>) output by the attribute identifying feature extraction unit. FIG. 8B is a diagram illustrating a pair (fP _e ^j , <+1>) and a pair (fN _e ^j , <−1>) output by the entity identifying feature extraction unit. 図９は、第３実施形態のデータ抽出装置３の機能構成を例示するためのブロック図である。FIG. 9 is a block diagram for illustrating a functional configuration of the data extraction device 3 of the third embodiment. 図１０は、第３実施形態のデータ抽出装置３のデータ抽出処理を例示するための図である。FIG. 10 is a diagram for illustrating data extraction processing of the data extraction device 3 according to the third embodiment.

以下、図面を参照して本発明の実施形態を説明する。
〔第１実施形態〕
＜構成＞
図１は、第１実施形態のデータ抽出装置１の機能構成を例示するためのブロック図である。
図１に例示するように、データ抽出装置１は、記憶部１１ａ−１１ｅ、トピック付与部１２、素性抽出部１３、トピック情報抽出部１４、識別学習部１５、エンティティ識別部１６、収束判定部１７、出力部１８、及び制御部１９を有し、制御部１９の制御のもと各処理を実行する。なお、データ抽出装置１は、例えば、CPU(central processing unit)、RAM(random-access memory)及びROM(read-only memory)等を含む公知又は専用のコンピュータに特別なプログラムが読み込まれて構成される特別な装置である。例えば、記憶部１１ａ−１１ｅは、ハードディスクや半導体メモリなどであり、トピック付与部１２、素性抽出部１３、トピック情報抽出部１４、識別学習部１５、エンティティ識別部１６、収束判定部１７、出力部１８、及び制御部１９は、特別なプログラムが読み込まれたCPUなどである。また、これらの少なくとも一部が集積回路などによって構成されてもよい。また、図１に表記された矢印は情報の流れを表すが、表記の都合上一部の矢印が省略されている（以降に述べる他のブロック図も同様）。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
<Configuration>
FIG. 1 is a block diagram for illustrating a functional configuration of a data extraction apparatus 1 according to the first embodiment.
As illustrated in FIG. 1, the data extraction device 1 includes a storage unit 11 a-11 e, a topic assignment unit 12, a feature extraction unit 13, a topic information extraction unit 14, an identification learning unit 15, an entity identification unit 16, and a convergence determination unit 17. , And an output unit 18 and a control unit 19, and execute each process under the control of the control unit 19. The data extraction apparatus 1 is configured by reading a special program into a known or dedicated computer including, for example, a CPU (central processing unit), a RAM (random-access memory), a ROM (read-only memory), and the like. It is a special device. For example, the storage unit 11a-11e is a hard disk, a semiconductor memory, or the like, and includes a topic assignment unit 12, a feature extraction unit 13, a topic information extraction unit 14, an identification learning unit 15, an entity identification unit 16, a convergence determination unit 17, and an output unit. 18 and the control unit 19 are a CPU in which a special program is read. Further, at least a part of these may be configured by an integrated circuit or the like. In addition, although the arrows shown in FIG. 1 indicate the flow of information, some arrows are omitted for the sake of description (the same applies to other block diagrams described below).

＜事前処理＞
事前処理として、記憶部１１ａにテキストデータの集合Dが格納され、記憶部１１ｂにトピックモデルTM⁰が格納される。
テキストデータとは、文字テキストを含むデータを意味する。テキストデータの例は、文書データ、クエリ、語句を含む図表データ、フレーズデータ、単語列データなどである。本形態では、形態素解析、固有表現抽出、係り受け解析、文境界同定などの前処理を行った後の文書データをテキストデータとした例を示す。 <Pre-processing>
As pre-processing, a set D of text data stored in the storage unit 11a, topic models TM ⁰ is stored in the storage unit 11b.
Text data means data including character text. Examples of text data are document data, queries, chart data including phrases, phrase data, word string data, and the like. In this embodiment, an example is shown in which document data after preprocessing such as morphological analysis, specific expression extraction, dependency analysis, sentence boundary identification, and the like is used as text data.

「トピックモデルTM⁰」とは、テキストデータに対応するトピックに対応するトピック情報とそのテキストデータが含む文字列との関係を記述するモデル（関数、数式）を意味する。「文字列」の具体例は、単語、単語列、フレーズ、文、文字、記号などである。テキストデータに対応するトピックとは、テキストデータのトピック（題目、話題、事柄、出来事、論題、分類など）を意味する。テキストデータがトピックを表す単語そのものを含んでいるとは限らない。また、トピック情報は、テキストデータに対応するトピックに対応する情報であればどのようなものであってもよい。例えば、テキストデータに対応するトピックの候補（例えば<球団名>や<企業名>など）ごとに、当該テキストデータに対する各トピックの候補の適切さを表す指標（例えば、確率、重み係数、確率や重み係数の関数値であるスコアなど）が与えられ、それらの指標の少なくとも一部が当該テキストデータのトピック情報とされてもよい。 “Topic model TM ⁰ ” means a model (function, formula) describing the relationship between topic information corresponding to a topic corresponding to text data and a character string included in the text data. Specific examples of the “character string” include a word, a word string, a phrase, a sentence, a character, and a symbol. The topic corresponding to the text data means a topic (text, topic, matter, event, topic, classification, etc.) of the text data. The text data does not always include the word representing the topic itself. The topic information may be any information as long as it corresponds to the topic corresponding to the text data. For example, for each topic candidate corresponding to the text data (for example, <Team name> or <Company name>), an index (for example, probability, weight coefficient, probability, A score which is a function value of the weighting coefficient) is given, and at least a part of the indices may be the topic information of the text data.

トピックモデルは、事前に教師なし学習データ（トピック情報との関係が特定されていないテキストデータから得られる学習データ）から獲得しておく。例えば、WEB上の100万個の文書データから所望のエンティティを獲得したい場合には、これら100万個の文書データから得られた学習データを用いてトピックモデルを学習しておく。
トピックモデルTM⁰の具体例は、UM(Unigram Mixtures)（Andrew K. McCallum, Kamal Nigam, "Employing EM and Pool-Based Active Learning for Text Classification", ICML'98, 1998等参照）、LDA(Latent Dirichlet Allocation)、DM(Dirichlet Mixtures)などである。以下にトピックモデルTM⁰としてUMを用いる例を示す。
この場合のトピックモデルTM⁰は以下の形で定義される。

ここでdはテキストデータの集合Dに属するテキストデータd∈Dを表し、p(d)はテキストデータの集合Dにおけるテキストデータdの出現確率を表す。z∈Zは隠れ変数であり、各zが１つのトピックの候補に対応する。Zは隠れ変数zの集合を表す。以下ではｚを1以上Z以下の自然数とし、Zを隠れ変数の総数（トピックの候補の総数）とする。p(z)は隠れ変数zに対する確率であり、

を満たす。vは文字列を表し、Vは文字列vの集合を表す。p(v|z)は隠れ変数zにおける文字列vの生成確率（隠れ変数zが与えられたときの文字列vの事後確率）であり、

を満たす。n_dvはテキストデータd中に文字列vが出現した回数である。 The topic model is acquired in advance from unsupervised learning data (learning data obtained from text data whose relationship with topic information is not specified). For example, when it is desired to acquire a desired entity from 1 million document data on the web, a topic model is learned using learning data obtained from these 1 million document data.
Specific examples of Topic Model TM ⁰ are UM (Unigram Mixtures) (Andrew K. McCallum, Kamal Nigam, "Employing EM and Pool-Based Active Learning for Text Classification", ICML'98, 1998, etc.), LDA (Latent Dirichlet Allocation) and DM (Dirichlet Mixtures). An example of using the UM as topic models TM ⁰ below.
Topic model TM ⁰ in this case is defined by the following form.

Here, d represents text data dεD belonging to the text data set D, and p (d) represents the appearance probability of the text data d in the text data set D. z∈Z is a hidden variable, and each z corresponds to one candidate topic. Z represents a set of hidden variables z. In the following, z is a natural number between 1 and Z, and Z is the total number of hidden variables (total number of topic candidates). p (z) is the probability for the hidden variable z

Meet. v represents a character string, and V represents a set of character strings v. p (v | z) is the generation probability of the character string v in the hidden variable z (the posterior probability of the character string v when the hidden variable z is given),

Meet. n _dv is the number of times the character string v appears in the text data d.

トピックモデルTM⁰の学習は繰り返し最適化手法の１種であるEMアルゴリズムを用いて行われ、学習によってパラメータp(z), p(v|z)が得られる。得られた各パラメータp(z), p(v|z)はトピックモデルTM⁰を特定する情報として記憶部１１ｂに格納される。これはトピックモデルTM⁰が記憶部１１ｂに格納されることと同等である。 Learning topic model TM ⁰ is performed using the EM algorithm is a type of iterative optimization techniques, parameters by learning p (z), p (v | z) is obtained. Each resulting parameters p (z), p (v | z) is stored in the storage unit 11b as the information for identifying the topic model TM ^0. This is equivalent to the topic model TM ⁰ is stored in the storage unit 11b.

なお、本形態ではテキストデータが含む文字列を「エンティティ」と呼び、抽出対象のエンティティを「正例エンティティ」と呼び、抽出しない（抽出対象としない）エンティティを「負例エンティティ」と呼ぶ。また、最初に入力されるエンティティを「シードエンティティ」と呼び、正例のシードエンティティを「正例シードエンティティ」と呼び、負例のシードエンティティを「負例シードエンティティ」と呼ぶ。 In this embodiment, a character string included in text data is referred to as an “entity”, an extraction target entity is referred to as a “positive example entity”, and an entity that is not extracted (not extracted) is referred to as a “negative example entity”. Also, the first input entity is called “seed entity”, the positive seed entity is called “positive seed entity”, and the negative seed entity is called “negative seed entity”.

＜データ抽出処理＞
図３は、第１実施形態のデータ抽出装置１のデータ抽出処理を例示するための図である。以下、図３を用いてデータ抽出装置１のデータ抽出処理を例示する。
《初期化：ステップＳ１１》
制御部１９がjの値をj=1に初期化する。
《トピック付与：ステップＳ１２》
トピック付与部１２が、記憶部１１ｂに格納されたトピックモデルTM⁰を用い、記憶部１１ａに格納されたテキストデータの集合Dが含む各テキストデータのトピックに対応するトピック情報をそれぞれ生成する。トピック付与部１２は、生成した各トピック情報をそれに対応する各テキストデータに対応付け、テキストデータとトピック情報とを含むトピック情報付きテキストデータを生成する。生成されたトピック情報付きテキストデータの集合D'は記憶部１１ｃに格納される。なお、各テキストデータのトピックに対応する情報であれば、どのような情報をトピック情報としてもよい。以下に、UMをトピックモデルTM⁰として生成されるトピック情報を例示する。 <Data extraction process>
FIG. 3 is a diagram for illustrating data extraction processing of the data extraction device 1 of the first embodiment. Hereinafter, the data extraction process of the data extraction apparatus 1 will be exemplified with reference to FIG.
<< Initialization: Step S11 >>
The control unit 19 initializes the value of j to j = 1.
<< Topic Assignment: Step S12 >>
Topics imparting unit 12, using the topic models TM ⁰ stored in the storage unit 11b, and the topic information corresponding to the topic of the text data set D of text data stored in the storage unit 11a includes generating respectively. The topic assigning unit 12 associates each generated topic information with each corresponding text data, and generates text data with topic information including text data and topic information. The generated set D ′ of text data with topic information is stored in the storage unit 11c. Any information may be used as topic information as long as it corresponds to the topic of each text data. The following illustrates the topic information generated a UM as topic models TM ^0.

[トピック情報の例]
トピック付与部１２は、記憶部１１ｂに格納されたトピックモデルTM⁰のパラメータp(z), p(v|z)とテキストデータd及び文字列vから得られるn_dvを用い、式(4)に従って、記憶部１１ａに格納されたテキストデータの集合Dに属するテキストデータdに対応するp(d)を計算できる。また、確率の乗法定理より、トピック付与部１２は、p(z), p(v|z)を用い、z, vについての同時確率p(z,v)を以下のように求めることができる。
p(z,v)=p(z)p(v|z) …(5)
また、トピック付与部１２は、p(z,v)及びn_dvを用い、z, dについての同時確率p(z,d)を以下のように求めることができる。

さらに、確率の乗法定理より、トピック付与部１２は、p(z,d)及びp(z)を用い、隠れ変数zが与えられたときのテキストデータdの事後確率p(d|z)を、以下のように求めることができる。
p(d|z)=p(z,d)/p(z) …(7)
またさらに、ベイズの定理より、トピック付与部１２は、得られたp(d), p(d|z)及びp(z)を用い、テキストデータdが与えられたときの隠れ変数zの事後確率p(z|d)を以下のように求めることができる。
p(z|d)=p(d|z)p(z)/p(d) …(8) [Example of topic information]
The topic assignment unit 12 uses the parameters p (z) and p (v | z) of the topic model TM ⁰ stored in the storage unit 11b, n _dv obtained from the text data d and the character string v, and uses the equation (4). Accordingly, p (d) corresponding to the text data d belonging to the text data set D stored in the storage unit 11a can be calculated. Also, from the probability multiplication theorem, the topic assigning unit 12 can obtain the joint probability p (z, v) for z and v using p (z) and p (v | z) as follows. .
p (z, v) = p (z) p (v | z)… (5)
Also, the topic assigning unit 12 can obtain the joint probability p (z, d) for z and d using p (z, v) and n _dv as follows.

Further, according to the probability multiplication theorem, the topic assignment unit 12 uses p (z, d) and p (z), and calculates the posterior probability p (d | z) of the text data d when the hidden variable z is given. The following can be obtained.
p (d | z) = p (z, d) / p (z) (7)
Furthermore, according to Bayes' theorem, the topic assigning unit 12 uses the obtained p (d), p (d | z), and p (z), and the posterior of the hidden variable z when the text data d is given. The probability p (z | d) can be obtained as follows.
p (z | d) = p (d | z) p (z) / p (d)… (8)

すなわち、トピック付与部１２は、記憶部１１ｂに格納されたトピックモデルTM⁰のパラメータp(z), p(v|z)を用い、任意のテキストデータdに対する隠れ変数zの事後確率p(z|d)を計算できる。なお、事後確率p(z|d)の計算手順は上記のものに限定されない。最終的にp(z|d)が得られるのであればどのような計算手順で事後確率p(z|d)が計算されてもよい。 That is, the topic assigning unit 12 uses the parameters p (z) and p (v | z) of the topic model TM ⁰ stored in the storage unit 11b, and uses the posterior probability p (z of the hidden variable z for any text data d. | d) can be calculated. Note that the procedure for calculating the posterior probability p (z | d) is not limited to the above. As long as p (z | d) is finally obtained, the posterior probability p (z | d) may be calculated by any calculation procedure.

事後確率p(z|d)はトピックの候補の適切さを表す指標であり、これらをトピック情報とすることができる。以下、このようなトピック情報を例示する。
トピック情報の例１：各トピックの候補に対応する各隠れ変数z_nに対応する各事後確率p(z_n|d)(n=1,...,Z)のうち、事後確率の大きな上位N個の隠れ変数z_n'とそれらにそれぞれ対応する事後確率p(z_n'|d)又は当該事後確率p(z_n'|d)の写像との組をテキストデータdのトピック情報とする。なお、Nは1以上Z以下の自然数定数である。例えば、N=1であり、p(z₅|d)=0.95が最大の事後確率である場合、隠れ変数z₅と事後確率p(z₅|d)=0.95との組をテキストデータdのトピック情報とする。 The posterior probability p (z | d) is an index representing the appropriateness of the topic candidates, and these can be used as topic information. Hereinafter, such topic information will be exemplified.
Topic information example 1: Among the posterior probabilities p (z _n | d) (n = 1,..., Z) corresponding to the hidden variables z _n corresponding to the candidates for each topic, the top having the largest posterior probability The topic information of the text data d is a set of N hidden variables z _{n ′} and corresponding posterior probabilities p (z _{n ′} | d) or mappings of the posterior probabilities p (z _{n ′} | d). . N is a natural number constant between 1 and Z. For example, if N = 1 and p (z ₅ | d) = 0.95 is the maximum posterior probability, the set of hidden variable z ₅ and posterior probability p (z ₅ | d) = 0.95 Use topic information.

トピック情報の例２：各トピックの候補に対応する各隠れ変数z_nに対応する各事後確率p(z_n|d)(n=1,...,Z)のうち、事後確率の大きな上位N個の隠れ変数z_n'又は当該隠れ変数z_n'の写像をテキストデータdのトピック情報とする。例えば、N=1であり、p(z₅|d)=0.95が最大の事後確率である場合、隠れ変数z₅をテキストデータdのトピック情報とする。 Topic information example 2: Among the posterior probabilities p (z _n | d) (n = 1,..., Z) corresponding to the hidden variables z _n corresponding to the candidates for each topic, the top with the largest posterior probability Let N hidden variables z _{n ′} or a map of the hidden variables z _{n ′ be} topic information of the text data d. For example, if N = 1 and p (z ₅ | d) = 0.95 is the maximum posterior probability, the hidden variable z ₅ is set as the topic information of the text data d.

トピック情報の例３：各トピックの候補に対応する各隠れ変数z_nに対応する各事後確率p(z_n|d)(n=1,...,Z)のうち、上位N個の事後確率p(z_n'|d)又は当該事後確率p(z_n'|d)の写像をそれぞれn'次元目の要素とし、他のZ-N個の要素を0としたZ次元ベクトルをテキストデータdのトピック情報とする。例えばN=1であり、p(z₂|d)=0.95が最大の事後確率である場合、Z次元ベクトル(0, 0.95, 0,...,0)をテキストデータdのトピック情報とする。 Topic information example 3: Top N posteriors of each posterior probability p (z _n | d) (n = 1, ..., Z) corresponding to each hidden variable z _n corresponding to each topic candidate A mapping of probability p (z _{n '} | d) or posterior probability p (z _n' | d) to each element of the n'th dimension and a Z-dimensional vector with the other ZN elements to 0 as text data d Topic information. For example, if N = 1 and p (z ₂ | d) = 0.95 is the maximum posterior probability, the Z-dimensional vector (0, 0.95, 0, ..., 0) is used as the topic information of the text data d. .

トピック情報の例４：各トピックの候補に対応する各隠れ変数z_nに対応する各事後確率p(z_n|d)(n=1,...,Z)のうち、上位N個の事後確率p(z_n'|d)にそれぞれ対応するn'次元目の要素を第１定数（例えば1）とし、他のZ-N個の要素を第２定数（例えば0）としたZ次元ベクトルをテキストデータdのトピック情報とする。例えばN=1であり、p(z₂|d)=0.95が最大の事後確率である場合、Z次元ベクトル(0, 1, 0,...,0)をテキストデータdのトピック情報とする。 Topic information example 4: The top N posteriors of the posterior probabilities p (z _n | d) (n = 1,..., Z) corresponding to the hidden variables z _n corresponding to the candidate candidates. A Z-dimensional vector with the n'th element corresponding to each probability p (z _{n '} | d) as the first constant (eg 1) and the other ZN elements as the second constant (eg 0) as text The topic information of data d. For example, when N = 1 and p (z ₂ | d) = 0.95 is the maximum posterior probability, the Z-dimensional vector (0, 1, 0, ..., 0) is used as topic information of the text data d. .

トピック情報の例５：トピック情報の例１又は２において、「事後確率の大きな上位N個の隠れ変数z_n'」を「事後確率が閾値以上となる隠れ変数z_n'」に置換した方法でテキストデータdのトピック情報を定める。 Topic information example 5: In the topic information example 1 or 2, “the top N hidden variables z _{n ′} having a large posterior probability” are replaced with “hidden variables z _{n ′} having a posterior probability equal to or greater than a threshold”. Define topic information for text data d.

トピック情報の例６：トピック情報の例３又は４において、「上位N個の事後確率p(z_n'|d)」を「閾値以上の事後確率p(z_n'|d)」に置換した方法でテキストデータdのトピック情報を定める。 Examples of topical information 6: In Example 3 or 4 topic information, "top N posterior probability p (z _n was replaced with _'| | d)" and "(z _n threshold or more posterior probability _p' d)" The topic information of the text data d is determined by the method.

図４は、記憶部１１ａに格納されたテキストデータの集合Dを例示した図であり、図５Ａは、トピック情報付きテキストデータの集合D'を例示した図である。図４に例示したテキストデータの集合Dは、前処理を行った後の文書データであるテキストデータと当該テキストデータのIDとが対応付けされたデータである。また、図５Ａに例示したトピック情報付きテキストデータの集合D'は、テキストデータと、当該テキストデータのIDと、当該テキストデータに対してトピック情報の例１によって生成されたトピック情報とが対応付けされたデータである。このように、トピック情報はテキストデータごとに付与されており、同じテキストデータ内に表れるエンティティには同じトピック情報が対応する。
なお、予めテキストデータにトピック情報が付与されている場合には、そのトピック情報を用いればよい。また、事前にトピック情報付きテキストデータの集合D'が生成されている場合にはステップＳ１２の処理を実行しなくてもよい（[トピック情報の例]の説明終わり）。 FIG. 4 is a diagram illustrating a set D of text data stored in the storage unit 11a, and FIG. 5A is a diagram illustrating a set D ′ of text data with topic information. The text data set D illustrated in FIG. 4 is data in which text data, which is document data after preprocessing, is associated with an ID of the text data. Further, in the set D ′ of text data with topic information illustrated in FIG. 5A, the text data, the ID of the text data, and the topic information generated by the topic information example 1 are associated with the text data. Data. Thus, topic information is assigned to each text data, and the same topic information corresponds to entities appearing in the same text data.
If topic information is previously assigned to text data, the topic information may be used. Further, when the set D ′ of text data with topic information is generated in advance, the process of step S12 may not be executed (end of description of [example of topic information]).

《素性抽出：ステップＳ１３》
ユーザが欲するエンティティの例が正例シードエンティティRP_e ⁰として素性抽出部１３に入力される。例えば、<広島>などが正例シードエンティティとして入力される。また、負例シードエンティティRN_e ⁰が素性抽出部１３に入力される。例えば、<日本>などが負例シードエンティティとして入力される。正例シードエンティティRP_e ⁰は初回の処理（j=1）における正例エンティティであり、負例シードエンティティRN_e ⁰は初回の処理（j=1）における負例エンティティである。 << Feature Extraction: Step S13 >>
An example of an entity that the user desires is input to the feature extraction unit 13 as a positive example seed entity RP _e ⁰ . For example, <Hiroshima> is entered as a positive seed entity. Further, the negative example seed entity RN _e ⁰ is input to the feature extraction unit 13. For example, <Japan> etc. is entered as a negative example seed entity. The positive example seed entity RP _e ⁰ is a positive example entity in the first process (j = 1), and the negative example seed entity RN _e ⁰ is a negative example entity in the first process (j = 1).

正例シードエンティティRP_e ⁰は、ユーザによって選択されたものである。負例シードエンティティRN_e ⁰は、ユーザによって選択されたものであってもよいし、テキストデータの集合Dから半自動または全自動で生成されたものであってもよい。以下に負例シードエンティティRN_e ⁰を半自動または全自動で生成する方法を例示する。 The positive seed entity RP _e ⁰ has been selected by the user. The negative example seed entity RN _e ⁰ may be selected by the user, or may be generated semi-automatically or fully automatically from the text data set D. A method for generating the negative example seed entity RN _e ⁰ semi-automatically or fully automatically will be exemplified below.

[負例シードエンティティRN_e ⁰の半自動生成方法の例]
負例シードエンティティ生成部（図示せず）が、テキストデータの集合Dから、何れの正例シードエンティティRP_e ⁰も含まないテキストデータを所定個数抽出し、抽出した各テキストデータから１つずつランダムに名詞を選択し、それらを負例エンティティ候補として出力する。表示部（図示せず）はこれらの負例エンティティ候補を表示し、これらから負例シードエンティティを選択するようにユーザに促す表示を行う。ユーザによる選択内容は負例シードエンティティ生成部に入力され、負例シードエンティティ生成部は、選択された負例エンティティ候補を負例シードエンティティRN_e ⁰として出力する（[負例シードエンティティRN_e ⁰の半自動生成方法の例]の説明終わり）。 [Example of semi-automatic generation of negative example seed entity RN _e ⁰ ]
A negative example seed entity generation unit (not shown) extracts a predetermined number of text data not including any positive example seed entity RP _e ⁰ from the set D of text data, and randomly selects one from each extracted text data Select nouns and output them as negative entity candidates. A display unit (not shown) displays these negative example entity candidates, and performs a display prompting the user to select a negative example seed entity from them. The selection content by the user is input to the negative example seed entity generation unit, and the negative example seed entity generation unit outputs the selected negative example entity candidate as the negative example seed entity RN _e ⁰ ([negative example seed entity RN _e ⁰ Example of semi-automatic generation method]

［負例シードエンティティRN_e ⁰の自動生成方法の例］
＜方法１＞
この例のデータ抽出装置１は、負例シードエンティティRN_e ⁰を自動生成する自動生成部１１０を有する（図１）。自動生成部１１０は、正例分布処理部１１１、負例トピック決定部１１２、及び負例シードエンティティ生成部１１３を有する（図２Ａ）。 [Example of automatic generation method of negative example seed entity RN _e ⁰ ]
<Method 1>
The data extraction apparatus 1 of this example includes an automatic generation unit 110 that automatically generates a negative example seed entity RN _e ⁰ (FIG. 1). The automatic generation unit 110 includes a positive example distribution processing unit 111, a negative example topic determination unit 112, and a negative example seed entity generation unit 113 (FIG. 2A).

まず、正例分布処理部１１１が、記憶部１１ａに格納されたテキストデータの集合Dのうち、正例シードエンティティRP_e ⁰を含むテキストデータの集合PDに含まれる全エンティティの出現確率分布である正例確率分布を表す情報（パラメータ）を得る。正例確率分布の代表例は、Bag-of-Wordsの仮定に従う全エンティティの多項分布である。以下に単語などの文字列をエンティティとし、Bag-of-Wordsの仮定に従う全エンティティの多項分布を正例確率分布とする例を示す。
この例での正例確率分布を表すパラメータは、エンティティである文字列vの生成確率p(v)である。生成確率p(v)は以下の関係を満たす。

ここでdはPDに含まれるある１テキストデータを示し、P(PD)はテキストデータの集合PDの出現確率を表す。vは単語などの文字列を表し、Vは文字列vの集合を表す。p(v)は文字列vの生成確率であり、以下の関係を満たす。

ここでn_dvはテキストデータd中に文字列vが出現した回数である。この生成確率p(v)（正例確率分布を表すパラメータ）は最尤推定法を用いて容易に求めることができる。具体的には、生成確率p(v)は以下の式によって計算され得る。

ここでn_PDvはテキストデータの集合PD中に文字列vが出現した回数を表す。N_PDはテキストデータの集合PDに含まれる文字列の総数（例えば、総単語数）を表し、N_PD＝Σ_v n_PDvである。これらの値は、記憶部１１ａに格納されたテキストデータの集合Dから得ることができる。
この例のデータ抽出装置１は、各文字列v=v₁,...,v_V∈Vに対応する各生成確率p(v₁),..., p(v_V)を正例確率分布のパラメータβ_p={p(v₁),..., p(v_V)}として出力する。 First, the positive example distribution processing unit 111 is the appearance probability distribution of all entities included in the text data set PD including the positive example seed entity RP _e ⁰ among the text data set D stored in the storage unit 11a. Information (parameters) representing the positive probability distribution is obtained. A typical example of a positive probability distribution is a multinomial distribution of all entities according to the Bag-of-Words assumption. The following shows an example in which a character string such as a word is an entity, and the multinomial distribution of all entities according to the Bag-of-Words assumption is a positive probability distribution.
The parameter representing the positive example probability distribution in this example is the generation probability p (v) of the character string v that is an entity. The generation probability p (v) satisfies the following relationship.

Here, d indicates one text data included in the PD, and P (PD) indicates the appearance probability of the text data set PD. v represents a character string such as a word, and V represents a set of character strings v. p (v) is the generation probability of the character string v and satisfies the following relationship.

Here, n _dv is the number of times the character string v appears in the text data d. The generation probability p (v) (a parameter representing a positive example probability distribution) can be easily obtained using the maximum likelihood estimation method. Specifically, the generation probability p (v) can be calculated by the following equation.

Here, n _PDv represents the number of occurrences of the character string v in the text data set PD. N _PD represents the total number of character strings (for example, the total number of words) included in the text data set PD, and N _PD = Σ _v n _PDv . These values can be obtained from a set D of text data stored in the storage unit 11a.
The data extraction apparatus 1 in this example uses the generation probabilities p (v ₁ ), ..., p (v _V ) corresponding to the character strings v = v ₁ , ..., v _V ∈V as positive example probabilities. Output as distribution parameters β _p = {p (v ₁ ), ..., p (v _V )}.

次に、負例トピック決定部１１２が、同一のトピック情報に対応するテキストデータの集合が含むエンティティの出現確率分布であるトピック確率分布を表す情報をトピック情報ごとに得て、正例確率分布を表す情報及びトピック確率分布を表す情報を用いて得られる正例確率分布とトピック確率分布との距離に基づいて、少なくとも一部のトピック情報を負例トピック情報として選択する。すなわち、負例トピック決定部１１２は、正例確率分布とトピック確率分布との情報量距離を求めて、情報量距離の大きなトピック確率分布に対応するトピック情報の中から負例トピック情報を選択する。 Next, the negative example topic determination unit 112 obtains, for each topic information, information representing a topic probability distribution that is an appearance probability distribution of entities included in a set of text data corresponding to the same topic information. At least a part of topic information is selected as negative example topic information based on the distance between the positive example probability distribution and the topic probability distribution obtained using the information and the topic probability distribution. That is, the negative example topic determination unit 112 obtains an information amount distance between the positive example probability distribution and the topic probability distribution, and selects negative example topic information from the topic information corresponding to the topic probability distribution having a large information amount distance. .

以下に正例確率分布のパラメータが上述したβ_p={p(v₁),..., p(v_V)}であり、トピック確率分布を表す情報（パラメータ）がβ_t={p(v₁|z_t),..., p(v_V|z_t)}(t=1,...,T、Tは正整数)である場合の例を示す。確率分布間の距離尺度にはKL-divergenceやJS-divergenceが用いられるが、ここでは距離の対称性のあるJS-divergenceが用いられる。２つの確率分布p,qの間のJS-divergence D_JS(q||p)は以下のように定義される。

ただし、0≦λ≦1であり、D_KLはKL-divergenceを表す。確率分布p,qの間のKL-divergenceは以下のように定義される。

ただし、xは確率変数を表す。
この場合、負例トピック決定部１１２は、p=β_P、q∈{β_1, β_2, … , β_T}とし、β_Pと各β_t(t=1,2,…,T)との間のJS-divergenceを計算する。負例トピック決定部１１２は、例えば、(1)正例確率分布とのJS-divergenceがある一定の閾値以上のパラメータβ_tに対応するトピック情報、或いは(2)正例確率分布とのJS-divergenceの大きな方から順にN個のパラメータβ_tに対応するトピック情報を負例トピック情報とする。負例トピック決定部１１２は、負例トピック情報を特定する情報（例えばt）を出力する。 The parameter of the positive probability distribution is β _p = {p (v ₁ ), ..., p (v _V )} described above, and information (parameter) representing the topic probability distribution is β _t = {p ( v ₁ | z _t ), ..., p (v _V | z _t )} (t = 1,..., T and T are positive integers). KL-divergence and JS-divergence are used as the distance measure between probability distributions, but here JS-divergence with symmetric distance is used. JS-divergence D _JS (q || p) between two probability distributions p and q is defined as follows.

However, 0 ≦ λ ≦ 1, and D _KL represents KL-divergence. The KL-divergence between the probability distributions p and q is defined as follows.

Where x represents a random variable.
In this case, the negative example topic determination unit 112 sets p = β _P , q∈ {β _1, β _2, ..., Β _T }, and β _P and each β _t (t = 1, 2,..., T) Calculate JS-divergence between The negative example topic determination unit 112 may, for example, (1) Topic information corresponding to a parameter β _t having a certain threshold value or more with a JS-divergence with a positive example probability distribution, or (2) JS- Topic information corresponding to N parameters β _t in order from the largest divergence is set as negative example topic information. The negative example topic determination unit 112 outputs information for specifying negative example topic information (for example, t).

負例トピック情報を特定する情報は、負例シードエンティティ生成部１１３に入力される。負例シードエンティティ生成部１１３は、負例トピック決定部１１２で選択された負例トピック情報に対応するエンティティを負例シードエンティティRN_e ⁰として選択する。このような負例シードエンティティRN_e ⁰の選択方法の例は以下の通りである。
（選択方法１）単語などの文字列が負例シードエンティティRN_e ⁰として選択される。
（選択方法２）文書などのテキストデータが負例シードエンティティとして選択される。
何れの選択方法の場合も負例トピック情報との関連性が強い（負例トピックの寄与度の高い）エンティティが負例シードエンティティRN_e ⁰とされる。以下に選択方法１，２の具体例を示す。
（選択方法１）単語などの文字列vが負例シードエンティティRN_e ⁰として選択される場合、負例シードエンティティ生成部１１３は、負例トピック情報に対応するパラメータβ_tからp(z_t|v)を以下のように計算する。
p(z_t|v)=p(v_t|z)p(z)/Σ_zp(v|z)p(z)
負例シードエンティティ生成部１１３は、この値p(z_t|v)の大きな文字列vを負例シードエンティティRN_e ⁰として選択する。例えば、負例シードエンティティ生成部１１３は、p(z_t|v)の大きい順に所定個のp(z_t|v)を選択し、選択したp(z_t|v)に対応する文字列vを負例シードエンティティRN_e ⁰とする。或いは、負例シードエンティティ生成部１１３は、例えば、閾値よりも大きなp(z_t|v)を選択し、選択したp(z_t|v)に対応する文字列vを負例シードエンティティRN_e ⁰とする。
（選択方法２）文書などのテキストデータが負例シードエンティティRN_e ⁰として選択される場合、例えば、あらかじめ全テキストデータdに対応するトピック情報である事後確率p(z|d)を計算しておき、トピックごと（隠れ変数zごと）にp(z|d)の値の大きなテキストデータdを記憶部１１ａに格納しておく。例えば、トピックごとにp(z|d)の大きい順に所定個のテキストデータdを選択しておき、それらを記憶部１１ａに格納しておく、又は、トピックzごとに閾値よりも大きなp(z|d)に対応するテキストデータdを選択しておき、それらを記憶部１１ａに格納しておく。負例シードエンティティ生成部１１３は、このように記憶部１１ａに格納しておいたテキストデータdから、負例トピック決定部１１２で得られた負例トピック情報に対応するp(z_t|d)に対応するテキストデータdを負例シードエンティティRN_e ⁰として選択する（＜方法１＞の説明終わり）。 Information specifying negative example topic information is input to the negative example seed entity generation unit 113. The negative example seed entity generation unit 113 selects an entity corresponding to the negative example topic information selected by the negative example topic determination unit 112 as the negative example seed entity RN _e ⁰ . An example of a method for selecting such a negative example seed entity RN _e ⁰ is as follows.
(Selection Method 1) A character string such as a word is selected as a negative example seed entity RN _e ⁰ .
(Selection Method 2) Text data such as a document is selected as a negative example seed entity.
In any of the selection methods, an entity that is strongly related to the negative example topic information (a negative example topic has a high degree of contribution) is set as a negative example seed entity RN _e ⁰ . Specific examples of the selection methods 1 and 2 are shown below.
(Selection Method 1) When a character string v such as a word is selected as the negative example seed entity RN _e ⁰ , the negative example seed entity generation unit 113 uses the parameters β _t to p (z _t | corresponding to the negative example topic information. v) is calculated as follows.
p (z _t | v) = p (v _t | z) p (z) / Σ _z p (v | z) p (z)
The negative example seed entity generation unit 113 selects a character string v having a large value p (z _t | v) as the negative example seed entity RN _e ⁰ . For example, the negative examples seed entity generating unit 113, p | predetermined number in descending order of (z _t v) of p (z _t | v) is selected, the selected p (z _t | v) string corresponding to v Is a negative seed entity RN _e ⁰ . Alternatively, the negative example seed entity generation unit 113 selects, for example, p (z _t | v) larger than the threshold, and sets the character string v corresponding to the selected p (z _t | v) as the negative example seed entity RN _e. ^Set to ⁰ .
(Selection method 2) When text data such as a document is selected as a negative example seed entity RN _e ⁰ , for example, a posteriori probability p (z | d) that is topic information corresponding to all text data d is calculated in advance. The text data d having a large value of p (z | d) is stored in the storage unit 11a for each topic (for each hidden variable z). For example, a predetermined number of text data d is selected for each topic in descending order of p (z | d) and stored in the storage unit 11a, or p (z larger than the threshold for each topic z The text data d corresponding to | d) is selected and stored in the storage unit 11a. The negative example seed entity generation unit 113 thus stores p (z _t | d) corresponding to the negative example topic information obtained by the negative example topic determination unit 112 from the text data d stored in the storage unit 11a. Is selected as a negative example seed entity RN _e ⁰ (end of description of <Method 1>).

＜方法２＞
この例のデータ抽出装置１は、負例シードエンティティRN_e ⁰を自動生成する自動生成部１２０を有する（図１）。自動生成部１２０は、シード正例トピックスコア生成部１２１、負例トピック決定部１２２、及び負例シードエンティティ生成部１１３を有する（図２Ｂ）。 <Method 2>
The data extraction apparatus 1 of this example includes an automatic generation unit 120 that automatically generates a negative example seed entity RN _e ⁰ (FIG. 1). The automatic generation unit 120 includes a seed positive example topic score generation unit 121, a negative example topic determination unit 122, and a negative example seed entity generation unit 113 (FIG. 2B).

まずシード正例トピックスコア作成部１２１が、正例シードエンティティRP_e ⁰を含むテキストデータdに対する各トピックzの適切さを表すシード正例トピック情報を当該トピックごと（隠れ変数zごと）に集計し、それによって得られる当該トピックごと（隠れ変数zごと）の集計結果を当該トピックのシード正例トピックスコアとして得る。例えば、シード正例トピックスコア作成部１２１は、トピック付与部１２で得られた事後確率p(z|d)のうち正例文書PDに対応するものの和、すなわちΣ_{_d∈PD} p(z|d)をトピックごと（隠れ変数zごと）に計算し、それを当該トピック（隠れ変数z）に対するシード正例トピックスコアとする。或いは、例えばΣ_{_d∈PD} p(z|d)の単調増加関数値が当該トピック（隠れ変数z）に対するシード正例トピックスコアとされてもよい。 First, the seed positive example topic score creation unit 121 aggregates seed positive example topic information indicating the appropriateness of each topic z with respect to the text data d including the positive example seed entity RP _e ⁰ for each topic (for each hidden variable z). Then, the tabulation result for each topic (for each hidden variable z) obtained thereby is obtained as a seed positive example topic score for the topic. For example, the seed positive example topic score creating unit 121 sums the posterior probabilities p (z | d) obtained by the topic assigning unit 12 corresponding to the positive example document PD, that is, _{Σ_dεPD} p (z | d ) Is calculated for each topic (for each hidden variable z), and is used as a seed positive example topic score for the topic (hidden variable z). Alternatively, for example, a monotonically increasing function value of _{Σ_dεPD} p (z | d) may be used as a seed positive example topic score for the topic (hidden variable z).

次に、負例トピック決定部１２２は、トピックのシード正例トピックスコアの大きさに基づいて選択したトピックに対応するトピック情報を負例トピック情報とする。例えば、負例トピック決定部１２２は、シード正例トピックスコアの低い順に所定個のトピック（隠れ変数z）を選択し、選択したトピックに対応するトピック情報を負例トピック情報とする。或いは、負例トピック決定部１２２は、シード正例トピックスコアが所定の閾値以下となるトピック（隠れ変数z）を選択し、選択したトピックに対応するトピック情報を負例トピック情報とする。負例トピック決定部１２２は、負例トピック情報を特定する情報（例えば隠れ変数z_tに対応するt）を出力する。 Next, the negative example topic determination unit 122 sets topic information corresponding to the topic selected based on the magnitude of the topic seed positive example topic score as negative example topic information. For example, the negative example topic determination unit 122 selects a predetermined number of topics (hidden variable z) in ascending order of the seed positive example topic score, and sets topic information corresponding to the selected topic as negative example topic information. Alternatively, the negative example topic determination unit 122 selects a topic (hidden variable z) whose seed positive example topic score is equal to or less than a predetermined threshold, and sets topic information corresponding to the selected topic as negative example topic information. The negative example topic determination unit 122 outputs information specifying negative example topic information (for example, _t corresponding to the hidden variable z t).

その後、＜方法１＞と同様に、負例シードエンティティ生成部１１３が、負例トピック決定部１２２で選択された負例トピック情報に対応するエンティティを負例シードエンティティRN_e ⁰として選択する（＜方法２＞の説明終わり）。 Thereafter, as in <Method 1>, the negative example seed entity generation unit 113 selects the entity corresponding to the negative example topic information selected by the negative example topic determination unit 122 as the negative example seed entity RN _e ⁰ (< End of description of Method 2>).

以上の負例シードエンティティRN_e ⁰の自動生成方法によれば、正例エンティティとの関連性が低いトピック情報から負例の初期集合（負例シードエンティティ）が自動生成されるため、早期にセマンティックドリフトが起こる可能性を減らすことができ、結果として最終的に得られるエンティティ集合の精度を高めることができる。 According to the method for automatically generating the negative example seed entity RN _e ⁰ described above, an initial set of negative examples (negative example seed entity) is automatically generated from topic information having low relevance to the positive example entity. The possibility of drifting can be reduced and the accuracy of the resulting entity set can be increased.

素性抽出部１３は、記憶部１１ａに格納されたテキストデータの集合Dから、何れかの正例エンティティRP_e ^j-1（初期の正例エンティティRP_e ⁰は正例シードエンティティRP_e ⁰）を含む文字列である「正例テキスト」を抽出する。正例テキストの例は、テキストデータが含む文、フレーズ、単語列などでである。正例テキストは、正例エンティティRP_e ^j-1とテキストデータとの組に対して１個以上抽出される。素性抽出部１３は、抽出した正例テキストとの関係で定まる正例エンティティRP_e ^j-1の特徴を表す素性fP'_e ^jを抽出する。この例では、正例エンティティRP_e ^j-1を含む正例テキストごとに当該正例エンティティRP_e ^j-1の素性fP'_e ^jが抽出される。以下に、正例エンティティRP_e ^j-1の素性fP'_e ^jを例示する。 The feature extraction unit 13 extracts any positive example entity RP _e ^j-1 (the initial positive example entity RP _e ⁰ is the positive example seed entity RP _e ⁰ ) from the text data set D stored in the storage unit 11a. The “original text” that is the character string to be included is extracted. Examples of the positive example text are sentences, phrases, word strings, and the like included in the text data. One or more positive example texts are extracted for a set of positive example entities RP _e ^j-1 and text data. The feature extraction unit 13 extracts a feature fP ′ _e ^j representing the feature of the positive example entity RP _e ^j−1 determined by the relationship with the extracted positive example text. In this example, the positive example entity RP _e ^j-1 of a feature fP _'e ^j is extracted for each positive example text containing positive example entity RP _e ^j-1. In the following, the feature fP ′ _e ^j of the positive example entity RP _e ^j−1 is exemplified.

[正例エンティティRP_e ^j-1の素性fP'_e ^jの例]
正例エンティティRP_e ^j-1の素性fP'_e ^jは、正例テキスト（正例エンティティRP_e ^j-1を含む文字列であってテキストデータが含むもの）に対応し、正例テキストと当該正例エンティティRP_e ^j-1との関係を表す情報を含む。このような情報であればどのようなものを素性として用いてもよい。
例えば、何れかの正例エンティティRP_e ^j-1を含むテキストデータ内における当該正例エンティティRP_e ^j-1に一致するエンティティ（一致エンティティ）から前後所定単語数以内（正例テキスト内）に位置する単語（周辺単語）の表記と当該一致エンティティに対する当該周辺単語の相対位置を表す情報との組（表層素性）、一致エンティティ又は周辺単語の品詞情報（品詞素性）や固有名詞情報（固有名詞素性）や構文情報（構文素性）、テキストデータ内での正例エンティティRP_e ^j-1の出現回数やテキストデータの集合D内での正例エンティティRP_e ^j-1の出現回数（出現回数素性）のうち、少なくとも一つに対応する情報を素性fP'_e ^jとすることができる。 [Example of feature fP ' _e ^j of positive entity RP _e ^j-1 ]
Positive example entity RP _e ^j-1 of a feature fP _'e ^j corresponds to the positive examples text (positive example entity RP _e ^j-1 that contains the text data to a string containing a) positive example text and the Contains information representing the relationship with the positive entity RP _e ^j-1 . Any information may be used as a feature as long as it is such information.
For example, the position in the entity (matching entities) from within the front and rear predetermined number of words (positive examples in the text) that matches the positive examples entity RP _e ^j-1 in the text data including either positive examples entity RP _e ^j-1 A pair of information indicating the relative position of the surrounding word with respect to the matching entity (surface feature), part of speech information (part of speech feature) and proper noun information (proprietary noun feature) of the matching entity or surrounding word ) or syntax information (syntax feature), positive cases entity RP _e positive example entity RP _e ^j-1 of the number of occurrences of at ^j-1 number of occurrences and the set D of the text data in the text data (number of occurrences feature) Among them, information corresponding to at least one of them can be set as a feature fP ′ _e ^j .

表層素性の例は「ex+1="は"」「ex-1="で"」などであり、これらは周辺単語（前者の例では「は」）と一致エンティティに対する周辺単語の相対位置（前者の例では「ex+1」）を表す情報との組を特定する情報である。「ex」は一致エンティティを表し、「ex+β」は一致エンティティexのβ単語後の単語を表し、「ex-β」は一致エンティティexのβ単語前の単語を表す。品詞素性の例は「ex+1=POS：助詞」「ex=POS：名詞」などであり、これらは一致エンティティに対する周辺単語の相対位置（前者の例では「ex+1」、後者の例では「ex」）と一致エンティティ又は周辺単語の品詞との組を特定する情報である。固有名詞素性の例は「ex=ORG」「ex-1=ORG」などであり、これらは一致エンティティに対する周辺単語の相対位置と一致エンティティ又は周辺単語の固有名詞との組を特定する情報である。構文素性の例は、正例テキスト内での一致エンティティの「係り受けの階層」を表す情報である。出現回数素性の例は、テキストデータやテキストデータの集合Dが含む正例エンティティRP_e ^j-1の個数である（[正例エンティティRP_e ^j-1の素性fP'_e ^jの例]の説明終わり）。 Examples of surface features are "ex + 1 =" is "", "ex-1 =" in "", etc., and these are relative words ("wa" in the former example) and relative positions of the surrounding words relative to the matching entity ( In the former example, it is information that identifies a pair with information representing “ex + 1”). “Ex” represents a matching entity, “ex + β” represents a word after β words of the matching entity ex, and “ex-β” represents a word before β words of the matching entity ex. Examples of part-of-speech features are “ex + 1 = POS: particle” and “ex = POS: noun”, which are relative positions of surrounding words relative to the matching entity (“ex + 1” in the former example and “ex + 1” in the latter example) “Ex”) and information that identifies a set of matching entities or parts of speech of surrounding words. Examples of proper noun features are “ex = ORG”, “ex-1 = ORG”, etc., which are information that identifies the relative position of the surrounding word relative to the matching entity and the matching entity or the proper noun of the surrounding word. . An example of the syntactic feature is information indicating the “dependency hierarchy” of the matching entity in the positive example text. Examples of Occurrences feature is described in a positive example number of entities RP _e ^j-1, including a set D of the text data and the text data (Example of positive example entity RP _e ^j-1 of a feature fP _'e ^j] the end).

同様に、素性抽出部１３は、記憶部１１ａに格納されたテキストデータの集合Dから、何れかの負例エンティティRN_e ^j-1（初期の負例エンティティRN_e ⁰は負例シードエンティティRN_e ⁰）を含む文字列である「負例テキスト」を抽出する。負例テキストの例は、テキストデータが含む文、フレーズ、単語列などである。負例テキストは、負例エンティティRN_e ^j-1とテキストデータとの組に対して一つ以上抽出される。素性抽出部１３は、抽出した負例テキストとの関係で定まる負例エンティティRN_e ^j-1の特徴を表す素性fN'_e ^jを抽出する。負例エンティティRN_e ^j-1の素性fN'_e ^jは、負例テキスト（負例エンティティRN_e ^j-1を含む文字列であってテキストデータが含むもの）に対応し、負例テキストと当該負例エンティティRN_e ^j-1との関係を表す情報を含む。この例では、負例エンティティRN_e ^j-1を含む負例テキストごとに当該負例エンティティRN_e ^j-1の素性fN'_e ^jが抽出される。負例エンティティRN_e ^j-1の素性fN'_e ^jの具体例は、上述した正例エンティティRP_e ^j-1の素性fP'_e ^jの場合と同様である。例えば、上述した正例エンティティRP_e ^j-1の素性fP'_e ^jの具体例の「正例」が「負例」に「RP_e ^j-1」が「RN_e ^j-1」に「fP'_e ^j-1」が「fN'_e ^j-1」にそれぞれ置換されたものである。
素性抽出部１３は、正例エンティティRP_e ^j-1の素性fP'_e ^jと正例を表すラベル<+1>との組(fP'_e ^j, <+1>)、及び、負例エンティティRN_e ^j-1の素性fN_e ^jと負例を表すラベル<-1>との組(fN'_e ^j, <-1>)を出力する。 Similarly, the feature extraction unit 13 extracts any negative example entity RN _e ^j-1 (the initial negative example entity RN _e ⁰ is a negative example seed entity RN _e) from the text data set D stored in the storage unit 11a. “Negative text” that is a character string including “ ^0” ) is extracted. Examples of negative example text are sentences, phrases, word strings, etc. included in the text data. One or more negative example texts are extracted for a set of negative example entity RN _e ^j-1 and text data. The feature extraction unit 13 extracts a feature fN ′ _e ^j representing the characteristics of the negative example entity RN _e ^j−1 determined by the relationship with the extracted negative example text. Negative examples entity RN _e ^j-1 of a feature fN _'e ^j corresponds to the negative sample text (a negative example entity RN string containing _e ^j-1 that contains the text data), a negative sample text and the Contains information representing the relationship with the negative entity RN _e ^j-1 . In this example, the negative examples entity RN _e ^j-1 of a feature fN _'e ^j is extracted for each negative example text containing a negative example entity RN _e ^j-1. A specific example of the feature fN ′ _e ^j of the negative example entity RN _e ^j−1 is the same as the case of the feature fP ′ _e ^j of the positive example entity RP _e ^j−1 described above. For example, "fP to" positive cases "to" negative examples "" RP _e ^j-1 "is" RN _e ^j-1 "in the specific example of the positive sample entity RP _e ^j-1 of a feature fP _'e ^j described above ' _e ^j-1 ' is replaced by 'fN' _e ^j-1 '.
The feature extraction unit 13 includes a pair (fP ′ _e ^j , <+1>) of a feature fP ′ _e ^j of the positive example entity RP _e ^j−1 and a label <+1> representing the positive example, and a negative example entity A pair (fN ' _e ^j , <-1>) of a feature fN _e ^j of RN _e ^j-1 and a label <-1> representing a negative example is output.

《トピック情報抽出：ステップＳ１４》
正例エンティティRP_e ^j-1、負例エンティティRN_e ^j-1、正例エンティティRP_e ^j-1の素性fP'_e ^jと正例を表すラベル<+1>との組(fP'_e ^j, <+1>)、及び、負例エンティティRN_e ^j-1の素性fN_e ^jと負例を表すラベル<-1>との組(fN'_e ^j, <-1>)がトピック情報抽出部１４に入力される。
トピック情報抽出部１４は、記憶部１１ｃに格納されたトピック情報付きテキストデータの集合D'から、正例エンティティRP_e ^j-1を含むテキストデータを含むトピック情報付きテキストデータが含むトピック情報（正例エンティティRP_e ^j-1を含むテキストデータに対応するトピック情報）を選択する。このように選択されたトピック情報を、正例エンティティRP_e ^j-1とテキストデータとの組に対応する「正例トピック情報」と呼ぶことにする。なお、トピック情報はテキストデータごとに与えられているため、同一のテキストデータが含む各正例テキストには、同じ正例トピック情報が対応する。トピック情報抽出部１４は、正例エンティティRP_e ^j-1とテキストデータとの組に対応する正例トピック情報を、当該テキストデータが含む各正例テキストに対応する各正例エンティティRP_e ^j-1の素性fP'_e ^jに加え、各正例テキストに対応する各正例エンティティRP_e ^j-1の素性をfP_e ^jに更新する。すなわち、正例テキストに対応する正例エンティティRP_e ^j-1の素性fP_e ^jは、当該正例テキストに対応する正例エンティティRP_e ^j-1の素性fP'_e ^jと正例トピック情報とを含む。このように正例トピック情報は素性fP_e ^jの一部とされる。 << Topic Information Extraction: Step S14 >>
A pair of a positive example entity RP _e ^j−1 , a negative example entity RN _e ^j−1 , a feature fP ′ _e ^j of the positive example entity RP _e ^j−1 and a label <+1> representing the positive example (fP ′ _e ^j , <+1>), and a pair (fN ' _e ^j , <-1>) of the feature fN _e ^{j of} the negative example entity RN _e ^{j-1 and} the label <-1> representing the negative example Input to the unit 14.
The topic information extraction unit 14 uses the topic information (correct information) included in the text data with topic information including the text data including the positive entity RP _e ^j-1 from the set D ′ of text data with topic information stored in the storage unit 11c. Example Topic information corresponding to text data including entity RP _e ^j-1 is selected. The topic information selected in this way will be referred to as “normal topic information” corresponding to a set of the positive entity RP _e ^j-1 and text data. Since topic information is given for each text data, the same example topic information corresponds to each example text included in the same text data. The topic information extraction unit 14 sets positive example topic information corresponding to a set of the positive example entity RP _e ^j-1 and text data, and each positive example entity RP _e ^j− corresponding to each positive example text included in the text data. ^In addition to the feature fP ′ _e ^j of ^1, the feature of each positive example entity RP _e ^j−1 corresponding to each positive example text is updated to fP _e ^j . That is, positive example entity RP _e ^j-1 of a feature fP _e ^j corresponding to positive cases text is a positive example entity RP _e ^j-1 of a feature fP _'e ^j and positive sample topic information corresponding to the positive example text including. Thus, the positive example topic information is made part of the feature fP _e ^j .

同様に、トピック情報抽出部１４は、記憶部１１ｃに格納されたトピック情報付きテキストデータの集合D'から、負例エンティティRN_e ^j-1を含むテキストデータを含むトピック情報付きテキストデータが含むトピック情報（負例エンティティRN_e ^j-1を含むテキストデータに対応するトピック情報）を選択する。このように選択されたトピック情報を、負例エンティティRN_e ^j-1とテキストデータとの組に対応する「負例トピック情報」と呼ぶことにする。なお、トピック情報はテキストデータごとに与えられているため、同一のテキストデータが含む各負例テキストには、同じ負例トピック情報が対応する。トピック情報抽出部１４は、負例エンティティRN_e ^j-1とテキストデータとの組に対応する負例トピック情報を、当該テキストデータが含む各負例テキストに対応する各負例エンティティRN_e ^j-1の素性fN'_e ^jに加え、各負例テキストに対応する各負例エンティティRN_e ^j-1の素性をfN_e ^jに更新する。すなわち、負例テキストに対応する負例エンティティRN_e ^j-1の素性fN_e ^jは、負例テキストに対応する負例エンティティRN_e ^j-1の素性fN'_e ^jと負例トピック情報とを含む。このように負例トピック情報は素性fN_e ^jの一部とされる。 Similarly, the topic information extraction unit 14 includes topics included in the text data with topic information including text data including the negative entity RN _e ^j-1 from the set D ′ of text data with topic information stored in the storage unit 11c. Select information (topic information corresponding to text data including negative example entity RN _e ^j-1 ). The topic information selected in this way will be referred to as “negative example topic information” corresponding to the set of negative example entity RN _e ^j−1 and text data. Since topic information is provided for each text data, the same negative example topic information corresponds to each negative example text included in the same text data. The topic information extraction unit 14 sets negative example topic information corresponding to a set of negative example entity RN _e ^j-1 and text data, and each negative example entity RN _e ^j− corresponding to each negative example text included in the text data. ^In addition to the feature fN ′ _e ^j of ^1, the feature of each negative example entity RN _e ^j−1 corresponding to each negative example text is updated to fN _e ^j . That is, the negative examples entity RN _e ^j-1 of a feature fN _e ^j corresponding to the negative example text, a negative example entity RN _e ^j-1 of a feature fN _'e ^j and negative cases topic information corresponding to the negative sample text Including. Thus, the negative example topic information is part of the feature fN _e ^j .

なお、トピック情報付きテキストデータの集合D'が含むすべての正例テキストや負例テキストに対応する素性fP_e ^j,fN_e ^jが生成されてもよいし、一部の正例テキストや負例テキストに対応する素性fP_e ^j,fN_e ^jのみが生成されてもよい。以下に、一部の正例テキストや負例テキストに対応する素性fP_e ^j,fN_e ^jのみが生成される例を示す。
[一部の正例テキストや負例テキストに対応する素性のみが生成される例]
多義的な正例エンティティRP_e ^j-1が素性抽出部１３に入力される場合がある。例えば<阪神>のような正例エンティティRP_e ^j-1は、球団名として用いられる場合もあれば、企業名として用いられる場合もある。この例では、トピック情報付きテキストデータの集合D'において正例エンティティRP_e ^j-1がどのような意味で使用されることが多いのかを推定し、当該推定された意味で正例エンティティRP_e ^j-1が使用されていると推定される文字列（正例テキスト及び負例テキスト）のみを対象として正例トピック情報及び負例トピック情報を選択し、正例エンティティRP_e ^j-1の素性fP_e ^jや負例エンティティRN_e ^j-1の素性fN_e ^jを生成する。これにより、後述する識別学習部１５での学習精度やエンティティ識別部１６での識別精度の向上が見込まれる。 Note that features fP _e ^j and fN _e ^j corresponding to all positive example texts and negative example texts included in the set D ′ of text data with topic information may be generated, or some positive example texts and negative examples Only the features fP _e ^j and fN _e ^j corresponding to the text may be generated. Hereinafter, an example in which only the features fP _e ^j and fN _e ^j corresponding to some positive example texts and negative example texts are generated will be described.
[Example of generating only features corresponding to some positive texts and negative texts]
The ambiguous positive example entity RP _e ^j-1 may be input to the feature extraction unit 13. For example, a positive entity RP _e ^j-1 such as <Hanshin> may be used as a team name or a company name. In this example, the positive examples entity in the set D 'topic information with text data RP _e ^j-1 is estimated what kind of are often used in the sense, positive cases entities in the estimated mean RP _e Select positive example topic information and negative example topic information only for character strings (positive example text and negative example text) that are assumed to be used by ^j-1, and identify the positive example entity RP _e ^j-1 Generate features fN _e ^j of fP _e ^j and negative entity RN _e ^j-1 . Thereby, the improvement of the learning precision in the identification learning part 15 mentioned later and the identification precision in the entity identification part 16 is anticipated.

まず、トピック情報付きテキストデータの集合D'において正例エンティティRP_e ^j-1がどのような意味で使用されているかを推定するために、素性抽出部１３は、トピック情報付きテキストデータの集合D'が含む各テキストデータが含む文字列（正例テキスト及び負例テキスト）に、当該文字列のトピックの候補と、当該トピックの候補それぞれの当該文字列に対する適切さを表すトピック候補スコアとを与える。トピック候補スコアは、例えば、前述のトピックモデルTM⁰を用いて計算されるか、前述のステップＳ１２の過程で得られた情報から計算され、記憶部１１ｃに格納される。以下に、各トピックの候補に対応するz_n(n=1,...,Z)とテキストデータが含む文字列vとに対応するトピック候補スコアs(z_n,v)を例示する。
s(z_n,v)=p(z_n|v)=p(v|z_n)p(z_n)/p(v) …(9)
なお、p(v|z_n), p(z_n)は、z=z_nでのトピックモデルTM⁰のパラメータとして得られ、p(v)は、z=z_nでの式(5)の同時確率p(z_n,v)とパラメータp(z_n)=Σ_z p(v|z)p(z_n)とから得られる。 First, in order to estimate what the positive example entity RP _e ^j-1 is used in the set D ′ of text data with topic information, the feature extraction unit 13 sets the set D of text data with topic information. The character strings (positive example text and negative example text) included in each text data included in 'are given a topic candidate of the character string and a topic candidate score indicating the appropriateness of the topic candidate for the character string. . Topic Candidate scores, for example, either be calculated using the topic models TM ⁰ described above, is calculated from the information obtained in the process in step S12 described above are stored in the storage unit 11c. Hereinafter, topic candidate scores s (z _n , v) corresponding to z _n (n = 1,..., Z) corresponding to each topic candidate and a character string v included in the text data will be exemplified.
s (z _n , v) = p (z _n | v) = p (v | z _n ) p (z _n ) / p (v)… (9)
Note that p (v | z _n ) and p (z _n ) are obtained as parameters of the topic model TM ⁰ at z = z _n , and p (v) is obtained from equation (5) at z = z _n . It is obtained from the joint probability p (z _n , v) and the parameter p (z _n ) = Σ _z p (v | z) p (z _n ).

また、以下のトピック候補スコアs(z_n,v)を用いてもよい。

その他、式(9)(10)の写像をトピック候補スコアs(z_n,v)としてもよい。 Further, the following topic candidate scores s (z _n , v) may be used.

In addition, the mapping of equations (9) and (10) may be used as the topic candidate score s (z _n , v).

次に素性抽出部１３は、同一のトピックの候補に対応するトピック候補スコアを集計し、当該トピックの候補ごとの集計結果を当該トピックの候補それぞれの正例トピックスコアとする。例えば、素性抽出部１３は、式(11)に従ってトピックの候補のそれぞれに対応する各正例トピックスコアS(z_n) (n=1,...,Z)を計算し、記憶部１１ｃに格納する。なお、V_pは正例テキストの集合を表す。

その他、トピック候補スコアs(z_n,v)(v∈V_p)の単調増加関数値を各正例トピックスコアS(z_n)(n=1,...,Z)とするなど、その他の集計方法で正例トピックスコアS(z_n)が計算されてもよい。 Next, the feature extraction unit 13 totals the topic candidate scores corresponding to the same topic candidates, and sets the total result for each topic candidate as a positive example topic score for each of the topic candidates. For example, the feature extraction unit 13 calculates each positive example topic score S (z _n ) (n = 1,..., Z) corresponding to each of the topic candidates according to the equation (11), and stores it in the storage unit 11c. Store. V _p represents a set of positive example texts.

In addition, the monotonically increasing function value of the topic candidate score s (z _n , v) (v∈V _p ) is set as each positive example topic score S (z _n ) (n = 1, ..., Z), etc. The positive example topic score S (z _n ) may be calculated by the aggregation method.

次に素性抽出部１３は、各正例トピックスコアS(z_n)(n=1,...,Z)が特定の基準を満たすトピックの候補を選択し、それを正例基準トピックS_eとして記憶部１１ｃに格納する。選択される正例基準トピックS_eの個数は１個であってもよいし２個以上であってもよい。例えば、最も値の大きな正例トピックスコアS(z_n)に対応するトピックの候補が正例基準トピックS_eとされてもよいし、値の大きな順に選択された所定個の正例トピックスコアS(z_n)にそれぞれ対応するトピックの候補が正例基準トピックS_eとされてもよいし、基準値以上の正例トピックスコアS(z_n)に対応するトピックの候補が正例基準トピックS_eとされてもよい。 Next, the feature extraction unit 13 selects a candidate for a topic in which each positive example topic score S (z _n ) (n = 1,..., Z) satisfies a specific criterion, and selects it as a positive criterion reference topic S _e. Is stored in the storage unit 11c. The number of positive cases reference topic S _e to be selected may be two or more may be one. For example, most to a large positive example topic score S (z _n) in the candidate corresponding topic values may be a positive example reference topic S _e, a predetermined number of positive cases topic score S for a large order chosen value A candidate for a topic corresponding to each of (z _n ) may be set as a positive example reference topic S _e , or a candidate for a topic corresponding to a positive example topic score S (z _n ) greater than or equal to a reference value is a positive example reference topic S _It may be _e .

素性抽出部１３は、例えば、正例基準トピックS_eの何れかと同一のトピックの候補に対応する正例テキスト（テキストデータが含む文字列）であり、なおかつ、当該同一のトピックの候補に対応するトピック候補スコアが特定の基準を満たす正例テキストに対応する正例エンティティRP_e ^j-1の素性fP_e ^jを生成するが、それ以外の正例テキストに対応する正例エンティティの素性を生成しない。また、素性抽出部１３は、例えば、正例基準トピックS_eの何れかと同一のトピックの候補に対応する負例テキスト（テキストデータが含む文字列）であり、なおかつ、当該同一のトピックの候補に対応するトピック候補スコアが特定の基準を満たす負例テキストに対応する負例エンティティRN_e ^j-1の素性fN_e ^jを生成するが、それ以外の負例テキストに対応する負例エンティティの素性を生成しない。 Feature extraction unit 13, for example, a positive sample text corresponding to the candidate of the same topic and one of positive cases reference topic S _e (string containing the text data), yet, corresponding to the candidate of the same topic Generates the feature fP _e ^j of the positive example entity RP _e ^j-1 corresponding to the positive example text whose topic candidate score satisfies a certain criterion, but does not generate the features of the positive example entity corresponding to other positive example texts . Further, feature extraction unit 13 is, for example, a negative example text corresponding to the candidate of the same topic and one of positive cases reference topic S _e (string containing the text data), yet, the candidate of the same topic Generates the feature fN _e ^j of the negative example entity RN _e ^j-1 corresponding to the negative example text whose corresponding topic candidate score satisfies a certain criterion, but the negative example entity corresponding to other negative example texts Do not generate.

以下に具体的な素性生成例を示す。
素性生成例１：素性抽出部１３は、正例基準トピックS_eの何れかと同一のトピックの候補に対応する正例テキストであり、なおかつ、当該同一のトピックの候補に対応するトピック候補スコアが当該正例テキストに対応するトピック候補スコアの中で最大となる正例テキストに対応する正例エンティティRP_e ^j-1の素性fP_e ^jを生成するが、それ以外の正例テキストに対応する正例エンティティの素性を生成しない。また、例えば、素性抽出部１３は、正例基準トピックS_eの何れかと同一のトピックの候補に対応する負例テキストであり、なおかつ、当該同一のトピックの候補に対応するトピック候補スコアが当該負例テキストに対応するトピック候補スコアの中で最大となる負例テキストに対応する負例エンティティRN_e ^j-1の素性fN_e ^jを生成するが、それ以外の負例テキストに対応する負例エンティティの素性を生成しない。 A specific feature generation example is shown below.
Feature Generation Example 1: feature extraction unit 13 is a positive example text corresponding to the same topic candidates either positive cases reference topic S _e, yet, topic candidate score corresponding to the candidate of the same topic the Generates the feature fP _e ^j of the positive example entity RP _e ^j-1 corresponding to the maximum positive example text among the topic candidate scores corresponding to the positive example text, but positive examples corresponding to other positive example texts Do not generate entity features. Further, e.g., feature extraction unit 13 is a negative example text corresponding to the candidate of the same topic and one of positive cases reference topic S _e, yet, topic candidate score corresponding to the candidate of the same topic the negative negative sample entity but generates a negative example entity RN _e ^j-1 of a feature fN _e ^j corresponding to the negative example text with the maximum in the topic candidate score corresponding to the example text corresponding to the other negative sample text Do not generate features of.

素性生成例２：素性抽出部１３は、正例基準トピックS_eの何れかと同一のトピックの候補に対応する正例テキストであり、なおかつ、当該同一のトピックの候補に対応するトピック候補スコアが基準値以上となる正例テキストに対応する正例エンティティRP_e ^j-1の素性fP_e ^jを生成するが、それ以外の正例テキストに対応する正例エンティティの素性を生成しない。また、例えば、素性抽出部１３は、正例基準トピックS_eの何れかと同一のトピックの候補に対応する負例テキストであり、なおかつ、当該同一のトピックの候補に対応するトピック候補スコアが基準値以上となる負例テキストに対応する負例エンティティRN_e ^j-1の素性fN_e ^jを生成するが、それ以外の負例テキストに対応する負例エンティティの素性を生成しない。 Feature Generation Example 2: feature extraction unit 13 is a positive example text corresponding to the same topic candidates either positive cases reference topic S _e, yet, topic candidate score corresponding to the candidate of the same topic reference The feature fP _e ^j of the positive example entity RP _e ^j-1 corresponding to the positive example text that is greater than or equal to the value is generated, but the feature of the positive example entity corresponding to the other positive example text is not generated. Further, for example, feature extraction unit 13, positive example reference topic is negative example text corresponding to the candidate of the same topic and one of S _e, yet, topic candidate score reference value corresponding to the candidate of the same topic The feature fN _e ^j of the negative example entity RN _e ^j-1 corresponding to the negative example text as described above is generated, but the feature of the negative example entity corresponding to other negative example texts is not generated.

素性生成例３：素性抽出部１３は、正例基準トピックS_eの何れかと同一のトピックの候補に対応する正例テキストであり、なおかつ、当該同一のトピックの候補に対応するトピック候補スコアが当該正例テキストに対応するトピック候補スコアの中で最大となる正例テキストに対応する正例エンティティRP_e ^j-1の素性fP_e ^jを生成するが、それ以外の正例テキストに対応する正例エンティティの素性を生成しない。一方、負例エンティティRN_e ^j-1の素性fN_e ^jについては、すべての負例テキストに対応する負例エンティティRN_e ^j-1の素性fN_e ^jが生成される。 Feature Generation Example 3: feature extraction unit 13 is a positive example text corresponding to the same topic candidates either positive cases reference topic S _e, yet, topic candidate score corresponding to the candidate of the same topic the Generates the feature fP _e ^j of the positive example entity RP _e ^j-1 corresponding to the maximum positive example text among the topic candidate scores corresponding to the positive example text, but positive examples corresponding to other positive example texts Do not generate entity features. On the other hand, the negative examples entity RN _e ^j-1 of a feature fN _e ^j, all negative examples entity RN corresponding to negative sample text _e ^j-1 of a feature fN _e ^j is generated.

素性生成例４：素性抽出部１３は、正例基準トピックS_eの何れかと同一のトピックの候補に対応する正例テキストであり、なおかつ、当該同一のトピックの候補に対応するトピック候補スコアが基準値以上となる正例テキストに対応する正例エンティティRP_e ^j-1の素性fP_e ^jを生成するが、それ以外の正例テキストに対応する正例エンティティの素性を生成しない。一方、負例エンティティRN_e ^j-1の素性fN_e ^jについては、すべての負例テキストに対応する負例エンティティRN_e ^j-1の素性fN_e ^jが生成される（[一部の正例テキストや負例テキストに対応する素性のみが生成される例]の説明終わり）。 Feature Generation Example 4: feature extraction unit 13 is a positive example text corresponding to the same topic candidates either positive cases reference topic S _e, yet, topic candidate score corresponding to the candidate of the same topic reference The feature fP _e ^j of the positive example entity RP _e ^j-1 corresponding to the positive example text that is greater than or equal to the value is generated, but the feature of the positive example entity corresponding to the other positive example text is not generated. On the other hand, the negative examples entity RN _e ^j-1 of a feature fN _e ^j, negative examples entity RN _e ^j-1 of a feature fN _e ^j are generated for all of the negative sample text ([part of the positive examples End of description of example] [Only features corresponding to text and negative example text are generated].

トピック情報抽出部１４は、正例エンティティRP_e ^j-1の素性fP_e ^jと正例を表すラベル<+1>との組(fP_e ^j, <+1>)、及び、負例エンティティRN_e ^j-1の素性fN_e ^jと負例を表すラベル<-1>との組(fN_e ^j, <-1>)を出力する。 The topic information extraction unit 14 includes a pair (fP _e ^j , <+1>) of the feature fP _e ^j of the positive example entity RP _e ^{j-1 and} the label <+1> representing the positive example, and the negative example entity RN A pair (fN _e ^j , <-1>) of a feature fN _e ^j of _e ^j-1 and a label <-1> representing a negative example is output.

図５Ｂは、トピック情報抽出部１４が出力する組(fP_e ^j, <+1>)及び組(fN_e ^j, <-1>)を例示した図である。なお、「POS」は品詞素性を表し、「BOS」は対応する位置に単語が存在しないことを表す。例えば、テキストデータ<T1>が含む正例テキストに対応する正例エンティティex=<広島>の素性はfP_e ^j=（ex-2="ヤクルト", ex-2=POS：名詞, ex-1="VS", ex-1=POS：名詞, ex+1="の", ex+1=POS：助詞, ex+2="ヤクルト", ex+2=POS：助詞, トピック情報=(z₂,08))である。 FIG. 5B is a diagram illustrating a pair (fP _e ^j , <+1>) and a pair (fN _e ^j , <-1>) output from the topic information extraction unit 14. “POS” represents a part-of-speech feature, and “BOS” represents that no word exists at the corresponding position. For example, the identity of the example entity ex = <Hiroshima> corresponding to the example text included in the text data <T1> is fP _e ^j = (ex-2 = "Yakult", ex-2 = POS: noun, ex-1 = "VS", ex-1 = POS: noun, ex + 1 = "no", ex + 1 = POS: particle, ex + 2 = "Yakult", ex + 2 = POS: particle, topic information = (z ₂ , 08)).

《識別学習：ステップＳ１５》
正例エンティティRP_e ^j-1の素性fP_e ^jと正例を表すラベル<+1>との組(fP_e ^j, <+1>)、及び、負例エンティティRN_e ^j-1の素性fN_e ^jと負例を表すラベル<-1>との組(fN_e ^j, <-1>)は識別学習部１５に入力される。正例エンティティRP_e ^j-1の素性fP_e ^jは正例に対する教師あり学習データとして利用でき、負例エンティティRN_e ^j-1の素性fN_e ^jは負例に対する教師あり学習データとして利用できる。識別学習部１５は、正例エンティティRP_e ^j-1の素性fP_e ^jと負例エンティティRN_e ^j-1の素性fN_e ^jとを教師あり学習データとした学習処理によって、識別モデルME_e ^jを生成する。この識別モデルME_e ^jは、任意のエンティティの素性を入力として当該エンティティが正例エンティティか負例エンティティかを識別するための情報を出力する関数である。このような識別モデルME_e ^jであればどのようなモデルであってもよい。
識別モデルME_e ^jの例は、正則化項付き最大エントロピーモデル（参考文献１「Berger, A.L. , Pietra, V.J.D. and Pietra, "A maximum entropy approach to natural language processing", S.A.D. 1996.」）、正則化項付きの条件付きランダム場(CRFs、参考文献２「Lafferty, J. and McCallum, A. and Pereira, F. "Conditional random fields: Probabilistic models for segmenting and labeling sequence data", MACHINE LEARNING, pp. 282-289, 2001.」、サポートベクタマシン(SVMs、参考文献３「Vapnik, V. N. "The nature of statistical learning theory", Springer Verlag, 1995.」)などである。各例の識別モデルME_e ^jの学習では、教師あり学習データとして用いられた正例エンティティRP_e ^j-1の素性fP_e ^j及び負例エンティティRN_e ^j-1の素性fN_e ^jに対し、当該識別モデルME_e ^jへの影響度の大きさを表す指標（素性に対する重み）が付され、これらが識別モデルME_e ^jを特定するパラメータとなる。特に参考文献１−３で例示したようなモデルは、すべての素性に対して重みが付されるモデル（例えば正則化項のない最大エントロピーモデル)ではなく、識別に有効と判断された素性のみについて重みが付される。以下、正則化項付き最大エントロピーモデルの具体例を示す。 << Identification Learning: Step S15 >>
A pair (fP _e ^j , <+1>) of the feature fP _e ^{j of the} positive example entity RP _e ^{j-1 and} the label <+1> representing the positive example, and the feature fN of the negative example entity RN _e ^j-1 A set (fN _e ^j , <-1>) of _e ^j and a label <−1> representing a negative example is input to the identification learning unit 15. The feature fP _e ^j of the positive example entity RP _e ^j-1 can be used as supervised learning data for the positive example, and the feature fN _e ^j of the negative example entity RN _e ^j-1 can be used as supervised learning data for the negative example. The identification learning unit 15 performs the identification model ME _e ^j by learning processing using the feature fP _e ^j of the positive example entity RP _e ^j-1 and the feature fN _e ^{j of the} negative example entity RN _e ^j-1 as supervised learning data. Is generated. This identification model ME _e ^j is a function that outputs the information for identifying whether the entity is a positive example entity or a negative example entity with the feature of an arbitrary entity as an input. Any model may be used as long as it is such an identification model ME _e ^j .
An example of the discriminant model ME _e ^j is the maximum entropy model with regularization term (Reference 1 “Berger, AL, Pietra, VJD and Pietra,“ A maximum entropy approach to natural language processing ”, SAD 1996.”), regularization Conditional random fields with terms (CRFs, Ref. 2, "Lafferty, J. and McCallum, A. and Pereira, F." Conditional random fields: Probabilistic models for segmenting and labeling sequence data ", MACHINE LEARNING, pp. 282- 289, 2001. ", support vector machines (SVMs, reference 3" Vapnik, VN "the nature of statistical learning theory", Springer Verlag, 1995. "), and the like. in the learning of the identification model ME _e ^j of each example , For the feature fP _e ^{j of} the positive example entity RP _e ^j-1 and the feature fN _e ^{j of the} negative example entity RN _e ^j-1 used as supervised learning data, the degree of influence on the identification model ME _e ^j attached indicator representing the magnitude (weight for feature) is, they identify the identification model ME _e ^j In particular, the model as exemplified in References 1-3 is not a model in which all features are weighted (for example, a maximum entropy model without a regularization term), and is determined to be effective for identification. Only specific features are weighted, and a specific example of a maximum entropy model with a regularization term is shown below.

正則化項付き最大エントロピーモデルが用いられる場合、識別学習部１５は、(x,y)∈{(fP_e ^j, <+1>), (fN_e ^j, <-1>)}を学習データとして用い、条件付確率

に対するエントロピー

を最大化する各重み（パラメータ）λ_qに対応するP_λ(y|x)であるP(y|x)を識別モデルME_e ^jとする。ただし、

であり、qは各学習データ(x,y)の組にそれぞれ対応するラベルであり、p'(x)は学習データ(x,y)におけるxの出現頻度であり、f_q(x,y)はqに対応する素性関数(feature function)である。 When the maximum entropy model with a regularization term is used, the discriminative learning unit 15 uses (x, y) ∈ {(fP _e ^j , <+1>), (fN _e ^j , <-1>)} as learning data. As a conditional probability

Entropy for

P (y | x) that is P _λ (y | x) corresponding to each weight (parameter) λ _q that maximizes is set as an identification model ME _e ^j . However,

Q is a label corresponding to each set of learning data (x, y), p ′ (x) is the frequency of occurrence of x in the learning data (x, y), and f _q (x, y ) Is a feature function corresponding to q.

ここで、各重みλ_qはqに対応する学習データ(x,y)の素性fP_e ^j又はfN_e ^jの識別モデルME_e ^jへの影響度の大きさを表す指標となる。また、正則化項付き最大エントロピーモデルの例では、すべての学習データ(x,y)の素性fP_e ^j又はfN_e ^jに対して重みλ_qが付されるわけではなく、重要度の低い素性に対応する重みλ_qは付されない。すなわち、重要度の低い素性に対応する重みλ_qは0とされる。 Here, each weight λ _q serves as an index representing the magnitude of the influence of the feature fP _e ^j or fN _e ^j of the learning data (x, y) corresponding to q on the identification model ME _e ^j . In the example of the maximum entropy model with regularization terms, the weight λ _q is not assigned to the feature fP _e ^j or fN _e ^{j of} all the learning data (x, y), and the feature with low importance The weight λ _q corresponding to is not attached. That is, the weight λ _q corresponding to the feature of low importance is set to 0.

また、ステップＳ１４で一部の正例テキストや負例テキストに対応する素性fP_e ^j,fN_e ^jのみが生成されていた場合には、一部の正例テキストや負例テキストに対応する素性fP_e ^j,fN_e ^jに対応する(x,y)∈{(fP_e ^j, <+1>), (fN_e ^j, <-1>)}のみが学習データとされる。例えば、前述した「一部の正例テキストや負例テキストに対応する素性のみが生成される例」のように素性fP_e ^j,fN_e ^jが生成された場合には、前述した正例基準トピックの何れかと同一のトピック候補に対応する正例テキスト（テキストデータが含む文字列）であり、なおかつ、当該同一のトピック候補のトピック候補スコアが特定の基準を満たす正例テキストに対応する正例エンティティ及び／又は負例エンティティの素性のみが教師あり学習データとされる。
学習処理によって生成された識別モデルME_e ^jは記憶部１１ｄに格納される。例えば、学習処理によって生成された識別モデルME_e ^jのパラメータが記憶部１１ｄに格納される。 If only the features fP _e ^j and fN _e ^j corresponding to some positive example texts and negative example texts are generated in step S14, the features corresponding to some positive example texts and negative example texts are generated. Only (x, y) ε {(fP _e ^j , <+1>), (fN _e ^j , <-1>)} corresponding to fP _e ^j and fN _e ^j is used as learning data. For example, when the features fP _e ^j and fN _e ^j are generated as in the above-mentioned “example in which only features corresponding to some positive example texts and negative example texts are generated”, the above-described positive example criteria Positive example text corresponding to the same topic candidate as any of the topics (a character string included in the text data) and corresponding to the positive example text in which the topic candidate score of the same topic candidate satisfies a specific criterion Only the features of entities and / or negative example entities are taken as supervised learning data.
The identification model ME _e ^j generated by the learning process is stored in the storage unit 11d. For example, the parameter identification model ME _e ^j generated by the learning processing is stored in the storage unit 11d.

《エンティティ識別：ステップＳ１６》
エンティティ識別部１６は、記憶部１１ｃに格納されたトピック情報付きテキストデータの集合D'から何れかのトピック情報付きテキストデータを選択し、選択したトピック情報付きテキストデータが含むテキストデータが含む文字列であるエンティティを対象エンティティRD_e ^jとする。 << Entity Identification: Step S16 >>
The entity identifying unit 16 selects any text data with topic information from the set D ′ of topic data with topic information stored in the storage unit 11c, and a character string included in the text data included in the selected text data with topic information. Is an entity RD _e ^j .

なお、トピック情報付きテキストデータの集合D'からすべてのトピック情報付きテキストデータが選択されてもよいが、すべてのテキストデータを識別対象とすることは計算効率上好ましくない。そのため、特定の方法で識別対象を限定して選択を行うことが望ましい。以下にその具体例を示す。 Note that all the text data with topic information may be selected from the set D ′ of text data with topic information, but it is not preferable in terms of calculation efficiency to make all the text data to be identified. For this reason, it is desirable to select an identification target by a specific method. Specific examples are shown below.

[選択方法の例]
選択方法の例１：
選択方法の例１では、エンティティ識別部１６は、識別学習部１５で教師あり学習データとして用いられた正例エンティティRP_e ^j-1の素性fP_e ^j及び負例エンティティRN_e ^j-1の素性fN_e ^jのうち、それらから生成された識別モデルME_e ^jへの影響度の大きさを表す指標（例えば前述の重みλ_q）が特定の基準を満たす素性、つまり、当該識別モデルME_e ^jへの影響度が大きな素性fP_e ^j及び／又はfN_e ^jを選択する。例えば、エンティティ識別部１６は、前述の重みλ_qの絶対値が閾値よりも大きな素性fP_e ^j及び／又はfN_e ^jを選択する。
また、エンティティ識別部１６は、選択した素性fP_e ^j及び／又はfN_e ^jに対応する文字列を含むテキストデータを含むトピック情報付きテキストデータを選択し、当該選択したトピック情報付きテキストデータが含むテキストデータが含む文字列であるエンティティを対象エンティティRD_e ^jとする。例えば、エンティティ識別部１６は、選択した素性fP_e ^j及び／又はfN_e ^jから表層素性の単語を抽出し、当該表層素性の単語を含むテキストデータを含むトピック情報付きテキストデータを選択する。一例を挙げると、選択された素性がエンティティexの前２単語が表層素性と品詞素性の組み合わせで成り立つ素性FNC(x−2=“POS:名詞”, x−1=“で”)（FNCは関数）であった場合、エンティティ識別部１６は、選択した素性FNC(x−2=“POS:名詞”, x−1=“で”)から表層素性の単語“で”を抽出し、単語“で”を含むテキストデータを含むトピック情報付きテキストデータを選択する。 [Example of selection method]
Selection method example 1:
In example 1 of the selection method, the entity identification unit 16 uses the feature fP _e ^{j of} the positive example entity RP _e ^{j-1 and} the feature of the negative example entity RN _e ^j-1 used as the supervised learning data by the identification learning unit 15. Among the fN _e ^j , an index (for example, the above-mentioned weight λ _q ) indicating the degree of influence on the identification model ME _e ^j generated from the features satisfies a specific criterion, that is, the identification model ME _e ^j A feature fP _e ^j and / or fN _e ^j that has a large influence on is selected. For example, the entity identification unit 16 selects a feature fP _e ^j and / or fN _e ^{j in} which the absolute value of the weight λ _q is larger than a threshold value.
In addition, the entity identification unit 16 selects text data with topic information including text data including a character string corresponding to the selected feature fP _e ^j and / or fN _e ^j , and the selected text data with topic information includes. An entity that is a character string included in the text data is set as a target entity RD _e ^j . For example, the entity identification unit 16 extracts a surface feature word from the selected feature fP _e ^j and / or fN _e ^j, and selects text data with topic information including text data including the surface feature word. For example, the feature FNC (x−2 = “POS: noun”, x−1 = “in”) where the selected features consist of a combination of surface features and part-of-speech features before the entity ex (FNC is If it is a function), the entity identification unit 16 extracts the surface feature word “de” from the selected feature FNC (x−2 = “POS: noun”, x−1 = “de”), and the word “ Select text data with topic information that includes text data containing "

選択方法の例２：
選択方法の例２では、エンティティ識別部１６は、前述した正例基準トピックS_eの何れかと同一のトピック候補に対応する正例テキスト（テキストデータが含む文字列）であり、なおかつ、当該同一のトピック候補のトピック候補スコアが特定の基準を満たす正例テキストが含むエンティティを対象エンティティRD_e ^jとする。
例えば、エンティティ識別部１６は、正例基準トピックS_eの何れかと同一のトピックの候補に対応する正例テキストであり、なおかつ、当該同一のトピックの候補に対応するトピック候補スコアが当該正例テキストに対応するトピック候補スコアの中で最大となる正例テキストが含むエンティティを対象エンティティRD_e ^jとする。
或いは、例えば、エンティティ識別部１６は、正例基準トピックS_eの何れかと同一のトピックの候補に対応する正例テキストであり、なおかつ、当該同一のトピックの候補に対応するトピック候補スコアが基準値以上となる正例テキストが含むエンティティを対象エンティティRD_e ^jとする（[選択方法の例]の説明終わり）。 Selection method example 2:
Example 2 selection method, entity identification unit 16 is a positive example text corresponding to the same topic candidates either positive cases reference topic S _e previously described (string containing the text data), yet, of the same An entity included in a positive example text whose topic candidate score satisfies a specific criterion is set as a target entity RD _e ^j .
For example, the entity identification unit 16 is a positive example text corresponding to the same topic candidates either positive cases reference topic S _e, yet, topic candidate score is the positive example text corresponding to the candidate of the same topic The entity included in the positive example text that is the largest among the topic candidate scores corresponding to is the target entity RD _e ^j .
Alternatively, for example, the entity identification unit 16 is a positive example text corresponding to the same topic candidates either positive cases reference topic S _e, yet, topic candidate score reference value corresponding to the candidate of the same topic The entity included in the above-described positive example text is set as a target entity RD _e ^j (end of the description of [example of selection method]).

素性抽出部１３は、記憶部１１ａに格納されたテキストデータの集合Dから、対象エンティティRD_e ^jを含む文字列である「対象テキスト」を抽出する。対象テキストの例は、テキストデータが含む文、フレーズ、単語列などである。対象テキストは、対象エンティティRD_e ^jとテキストデータとの組に対して１個以上抽出される。
素性抽出部１３は、抽出した対象テキストとの関係で定まる対象エンティティRD_e ^jの特徴を表す素性fD'_e ^jを抽出する。対象エンティティRD_e ^jの素性fD'_e ^jは、対象テキスト（対象エンティティRD_e ^jを含む文字列であってテキストデータに含まれるもの）に対応し、対象テキストと当該対象エンティティRD_e ^jとの関係を表す情報を含む。具体的な処理は、前述した正例エンティティRP_e ^j-1の素性fP'_e ^jを抽出する場合と同様である。例えば、「正例エンティティRP_e ^j」が「対象エンティティRD_e ^j」に「素性fP'_e ^j」が「素性fD'_e ^j」に「正例テキスト」が「対象テキスト」に置換される以外は、前述した正例エンティティRP_e ^j-1の素性fP'_e ^jを抽出する処理と同じである。 Feature extraction unit 13, from the set D of text data stored in the storage unit 11a, extracts the "target text" is a string containing the target entity RD _e ^j. Examples of the target text are sentences, phrases, word strings, and the like included in the text data. One or more target texts are extracted for a set of the target entity RD _e ^j and text data.
The feature extraction unit 13 extracts a feature fD ′ _e ^j representing the characteristics of the target entity RD _e ^j determined by the relationship with the extracted target text. Feature fD _'e ^j of the target entity RD _e ^j corresponds to the (those contained in the text data to a string containing the target entity RD _e ^j) target text, the target text and the relevant target entity RD _e ^j Contains information representing the relationship. The specific process is the same as the case of extracting the feature fP ′ _e ^j of the positive entity RP _e ^j−1 described above. For example, “positive entity RP _e ^j ” is replaced with “target entity RD _e ^j ”, “feature fP ' _e ^j ” is replaced with “feature fD' _e ^j ”, and “positive text” is replaced with “target text”. Is the same as the process of extracting the feature fP ′ _e ^j of the positive entity RP _e ^j−1 described above.

対象テキストに対応する対象エンティティRD_e ^jの素性fD'_e ^jは、トピック情報抽出部１４に入力される。トピック情報抽出部１４は、記憶部１１ｃから、対象テキストを含むトピック情報付きテキストデータが含むトピック情報（対象テキストに対応するトピック情報）を選択する。このように選択されたトピック情報を、対象エンティティRD_e ^jとテキストデータとの組に対応する「対象トピック情報」と呼ぶことにする。なお、トピック情報はテキストデータごとに与えられているため、同一のテキストデータが含む各対象テキストには、同じ対象トピック情報が対応する。トピック情報抽出部１４は、対象エンティティRD_e ^jとテキストデータとの組に対応する対象トピック情報を、当該テキストデータが含む各対象テキストに対応する各対象エンティティRD_e ^jの素性fD'_e ^jに加え、各対象テキストに対応する各対象エンティティRD_e ^jの素性をfD_e ^jに更新する。すなわち、対象テキストに対応する対象エンティティRD_e ^jの素性fD_e ^jは、当該対象テキストに対応する対象エンティティRD_e ^jの素性fD'_e ^jと対象トピック情報とを含む。このように対象トピック情報は素性fD_e ^jの一部とされる。 The feature fD ′ _e ^j of the target entity RD _e ^j corresponding to the target text is input to the topic information extraction unit 14. The topic information extraction unit 14 selects topic information (topic information corresponding to the target text) included in the text data with topic information including the target text from the storage unit 11c. Thus the selected topic information, corresponding to the set of target entities RD _e ^j and text data is referred to as a "target topic information". Since topic information is given for each text data, the same target topic information corresponds to each target text included in the same text data. The topic information extraction unit 14 converts the target topic information corresponding to the set of the target entity RD _e ^j and the text data into the feature fD ′ _e ^j of each target entity RD _e ^j corresponding to each target text included in the text data. In addition, the feature of each target entity RD _e ^j corresponding to each target text is updated to fD _e ^j . That is, feature fD _e ^j of the target entity RD _e ^j corresponding to the target text includes a feature fD _'e ^j and the target topic information of the target entity RD _e ^j corresponding to the target text. In this way, the target topic information is part of the feature fD _e ^j .

対象エンティティRD_e ^jの素性fD_e ^jは、エンティティ識別部１６に入力される。エンティティ識別部１６は、対象エンティティRD_e ^jの素性fD_e ^jを記憶部１１ｄから読み出した識別モデルME_e ^jに入力し、対象エンティティRD_e ^jが正例エンティティか負例エンティティかを識別する。例えば、識別モデルME_e ^jとして正則化項付き最大エントロピーモデルが用いられる場合には、x=fD_e ^jを識別モデルME_e ^jであるP(y|x)に代入してP(1|x)とP(-1|x)とを求め、それらに対応する指標（信頼度など）と閾値とを比較することで、対象エンティティRD_e ^jが正例エンティティか負例エンティティかを識別する。 Feature fD _e ^j of the target entity RD _e ^j is input to the entity identification unit 16. The entity identification unit 16 inputs the feature fD _e ^j of the target entity RD _e ^j to the identification model ME _e ^j read from the storage unit 11d, and identifies whether the target entity RD _e ^j is a positive example entity or a negative example entity. For example, the identification model ME _e ^{when j} maximum entropy model with regularization term is used as a, x = fD _e ^j identification model ME _e is ^j P | is substituted into (y x) P (1 | x ) And P (−1 | x), and by comparing an index (such as reliability) corresponding to them with a threshold value, the target entity RD _e ^j is identified as a positive example entity or a negative example entity.

ここで、対象エンティティが正例エンティティであると識別された場合、エンティティ識別部１６は、対象エンティティRD_e ^jを新たな正例エンティティRP_e ^jとして記憶部１１ｅに格納する。一方、対象エンティティが負例エンティティであると識別された場合、エンティティ識別部１６は、対象エンティティRD_e ^jを新たな負例エンティティRN_e ^jして記憶部１１ｅに格納する。 If the target entity is identified as a positive entity, the entity identifying unit 16 stores the target entity RD _e ^j as a new positive entity RP _e ^j in the storage unit 11e. On the other hand, when the target entity is identified as a negative example entity, the entity identification unit 16 stores the target entity RD _e ^j as a new negative example entity RN _e ^j in the storage unit 11e.

《収束判定：ステップＳ１７−Ｓ１９》
収束判定部１７は、収束条件を満たしたかを判定する。以下に収束条件を例示する。
[収束条件の例]
収束条件の例１：この例の収束判定部１７は、正例エンティティRP_e ^jに新たに割り当てられる対象エンティティRD_e ^jが存在しない場合に、収束条件を満たしたと判断する。
収束条件の例２：この例の収束判定部１７は、ステップＳ１３からＳ１７のイテレーションを基準回数以上繰り返しても新たに割り当てられる対象エンティティRD_e ^j-1が存在しない場合に、収束条件を満たしたと判断する。
収束条件の例３：この例の収束判定部１７は、jの値が基準値以上となった場合に収束条件を満たしたと判断する（[収束条件の例]の説明終わり／ステップＳ１７）。
収束判定部１７が収束条件を満たしたと判断した場合、ステップＳ１３からＳ１７のイテレーションが終了し、出力部１８が記憶部１１ｅに格納されているすべての正例エンティティRP^j _eを出力して処理を終了する（ステップＳ１９）。それ以外の場合は、制御部１９がj+1を新たなjの値とし（ステップＳ１８）、記憶部１１ｅに格納されている正例エンティティRP^j _e及び負例エンティティRN^j _eを素性抽出部１３に入力し、ステップＳ１３からＳ１６のイテレーションが実行される。 << Convergence determination: steps S17 to S19 >>
The convergence determination unit 17 determines whether the convergence condition is satisfied. Examples of convergence conditions are given below.
[Example of convergence condition]
Example of convergence condition 1: The convergence determination unit 17 of this example determines that the convergence condition is satisfied when there is no target entity RD _e ^j newly assigned to the positive example entity RP _e ^j .
Convergence condition example 2: The convergence determination unit 17 in this example satisfies the convergence condition when there is no newly allocated target entity RD _e ^j-1 even if the iterations of steps S13 to S17 are repeated more than the reference number of times. to decide.
Example 3 of convergence condition: The convergence determination unit 17 in this example determines that the convergence condition is satisfied when the value of j is equal to or greater than the reference value (end of explanation of [example of convergence condition] / step S17).
When the convergence determination unit 17 determines that the convergence condition is satisfied, the iterations from step S13 to S17 are completed, and the output unit 18 outputs all the positive example entities RP ^j _e stored in the storage unit 11e for processing. The process ends (step S19). Otherwise, the control unit 19 sets j + 1 as a new value of j (step S18), and the positive example entity RP ^j _e and the negative example entity RN ^j _e stored in the storage unit 11e are used as a feature extraction unit. 13 and the iterations of steps S13 to S16 are executed.

＜識別事例＞
第１実施形態における具体的な識別事例を例示する。
この例では２つのトピックの候補に対応するトピックモデルTM⁰を用いる。具体的には「球団名」と「企業名」とがトピックの候補とされたトピックモデルTM⁰を用いる。また、正例シードエンティティが<広島>であり、負例シードエンティティが<毎日新聞>であり、ユーザは球団名についてのエンティティのセットを要求していると仮定する。
また、トピックモデルTM⁰を用いて計算された、正例シードエンティティ<広島>を含むテキストデータd₁に対するトピック「球団名」の事後確率がp(球団名|d₁)=0.9であり、当該テキストデータd₁に対するトピック「企業名」の事後確率がp(企業名|d₁)=0.1であったとする。一方、負例シードエンティティ<毎日新聞>を含むテキストデータd₂に対するトピック「球団名」の事後確率がp(球団名|d₂)=0.1であり、当該テキストデータd₂に対するトピック「企業名」の事後確率がp(企業名|d₂)=0.9であったとする。ここで前述の「トピック情報の例１(N=2)」のようにトピック情報が定められていたとすると、テキストデータd₁に対するトピック情報は((球団名,0.9), (企業名,0.1))となり、テキストデータd₂に対するトピック情報は((球団名,0.1), (企業名,0.9))となる（ステップＳ１２）。 <Identification examples>
The specific identification example in 1st Embodiment is illustrated.
In this example using the topic models TM ⁰ corresponding to the candidate of the two topics. Specifically, a topic model TM ^{0 in} which “Team name” and “Company name” are topic candidates is used. Also assume that the positive seed entity is <Hiroshima>, the negative seed entity is <Mainichi Newspaper>, and the user is requesting a set of entities for the team name.
Further, the posterior probability of the topic “team name” for the text data d ₁ including the positive seed entity <Hiroshima>, calculated using the topic model TM ⁰ , is p (team name | d ₁ ) = 0.9, and Assume that the posterior probability of the topic “company name” for the text data d ₁ is p (company name | d ₁ ) = 0.1. On the other hand, the posterior probability of the topic “Team Name” for the text data d ₂ including the negative seed entity <Mainichi Shimbun> is p (Team name | d ₂ ) = 0.1, and the topic “Company Name” for the text data d ₂ Is the posterior probability of p (company name | d ₂ ) = 0.9. Here, when the topic information as described above, "Example topic information 1 (N = 2)" was established, topic information for the text data d ₁ is ((team name, 0.9), (company name, 0.1) ), and the topic information for the text data d ₂ ((team name, 0.1), and (company name, 0.9)) (step S12).

この例ではステップＳ１３の素性抽出が行われず、トピック情報のみが素性として用いられたとする。その場合、正例シードエンティティ<広島>の素性は((球団名,0.9), (企業名,0.1))となり、負例シードエンティティ<毎日新聞>の素性は((球団名,0.1), (企業名,0.9))となる。よって、学習データは
(((球団名,0.9), (企業名,0.1)), <+1>)
(((球団名,0.1), (企業名,0.9)), <-1>)
となる（ステップＳ１４）。
このような学習データを元に識別モデルを学習する（ステップＳ１５）。学習の結果、「球団名」に対して正例側の重みが大きく、「企業名」に対して正例側の重みの小さな識別モデルが得られるであろう。 In this example, it is assumed that the feature extraction in step S13 is not performed and only topic information is used as a feature. In this case, the identity of the positive seed entity <Hiroshima> is ((Team name, 0.9), (Company name, 0.1)), and the identity of the negative seed entity <Mainichi Shimbun> is ((Team name, 0.1), ( Company name, 0.9)). Therefore, the learning data is
(((Team name, 0.9), (Company name, 0.1)), <+1>)
(((Team name, 0.1), (Company name, 0.9)), <-1>)
(Step S14).
An identification model is learned based on such learning data (step S15). As a result of the learning, an identification model having a large weight on the positive side with respect to “Team name” and a small weight on the positive side with respect to “Company name” will be obtained.

次に、シードエンティティに含まれない対象エンティティ<阪神>が入力されたとする。ここで、上記と同様に計算された、対象エンティティ<阪神>を含むテキストデータd₃に対するトピック「球団名」の事後確率がp(球団名|d₁)=0.8であり、当該テキストデータd₃に対するトピック「企業名」の事後確率がp(企業名|d₁)=0.2であったとする。その場合、対象エンティティ<阪神>の素性は((球団名,0.8), (企業名,0.2))となる。この素性((球団名,0.8), (企業名,0.2))を上記の識別モデルに識別させてみると、その結果から素性((球団名,0.8), (企業名,0.2))は正例エンティティに対応すると判断できる（ステップＳ１６）。 Next, it is assumed that the target entity <Hanshin> not included in the seed entity is input. Here, the posterior probability of the topic “team name” for the text data d ₃ including the target entity <Hanshin>, calculated in the same manner as described above, is p (ball team name | d ₁ ) = 0.8, and the text data d ₃ The posterior probability of the topic “company name” for p is (company name | d ₁ ) = 0.2. In this case, the identity of the target entity <Hanshin> is ((Team name, 0.8), (Company name, 0.2)). When this feature ((Team name, 0.8), (Company name, 0.2)) is identified by the above identification model, the result ((Team name, 0.8), (Company name, 0.2)) is correct. It can be determined that it corresponds to the example entity (step S16).

このようにトピック情報を素性の少なくとも一部として用いることで、ユーザの要求を表した正例シードエンティティ及び負例シードエンティティに沿った識別が適切に行われ、セマンティックドリフトを抑えたエンティティの抽出が可能となる。これがトピック情報を用いるメリットである。特にset expansionでは表した正例シードエンティティの数が少ない場合が多く、利用できる情報が非常に限られるため、周辺文脈だけを素性としたのではデータが疎となり、識別精度が低下する場合が多い。トピック情報は、このように利用可能なデータの少ない場面での識別において有効な素性として作用する。 By using topic information as at least part of the feature in this way, identification along the positive seed entity and negative seed entity representing the user's request is performed appropriately, and the extraction of the entity with suppressed semantic drift can be performed. It becomes possible. This is an advantage of using topic information. In particular, in the case of set expansion, the number of positive seed entities represented is often small, and the information that can be used is very limited. Therefore, if only the surrounding context is used as the feature, the data becomes sparse and the identification accuracy often decreases. . The topic information acts as an effective feature in identification in a scene where there is little data available in this way.

＜第１実施形態の特徴＞
以上のように、本形態の方法ではトピック情報を素性の少なくとも一部として用いたため、セマンティックドリフトを抑制することができる。また、本形態の方法はリソースであるテキストデータの種類によらず利用でき、適用範囲が広い。 <Features of First Embodiment>
As described above, since the topic information is used as at least part of the feature in the method of the present embodiment, the semantic drift can be suppressed. The method of this embodiment can be used regardless of the type of text data that is a resource, and has a wide range of applications.

〔第２実施形態〕
第２実施形態は第１実施形態の変形例であり、エンティティの属性を用いてセマンティックドリフトを抑制する。「属性」とは、エンティティの特徴を表すテキストデータ中の文字列である。このような文字列の例は、名詞、単語、単語列、フレーズ、文などである。属性の具体例はエンティティの前後W単語以内に存在する名詞である。なお、Wはウィンドウサイズを表す1以上の整数である。例えば「阪神の試合速報・・・」というテキストデータ中の<阪神>がエンティティであり、ウィンドウサイズをW=3とした場合、<試合>と<速報>がエンティティ<阪神>の属性の候補とされる。 [Second Embodiment]
The second embodiment is a modification of the first embodiment, and suppresses the semantic drift using the attribute of the entity. An “attribute” is a character string in text data that represents the characteristics of an entity. Examples of such character strings are nouns, words, word strings, phrases, sentences, and the like. A specific example of the attribute is a noun existing within W words before and after the entity. W is an integer greater than or equal to 1 representing the window size. For example, if <Hanshin> in the text data "Hanshin game breaking news ..." is an entity and the window size is set to W = 3, <match> and <breaking news> are attribute candidates for the entity <Hanshin>. Is done.

互いに関連のある複数のエンティティには同一の属性が対応する。例えば、球団名であるエンティティ<広島>と同じく球団名であるエンティティ<ヤクルト>とは、同じ<試合>や<投手>などの属性が対応する。そのため、属性は探索対象となるエンティティが満たすべき制約条件となる。このことを利用し、第２実施形態では、エンティティとその属性との組を用いてエンティティの識別を行う。例えば、正例シードエンティティを球団名である<広島>とし、同じく球団名である<ヤクルト>というエンティティを獲得することを狙っていると仮定する。この場合、これらのエンティティに共通する属性は<試合>や<投手>などであり、例えば、正例シードエンティティ<広島>とその属性<試合>との組を用いてエンティティの識別が行われる。ここで、球団名である<ヤクルト>は正例エンティティであるが、<ヤクルト>には飲料名としての意味もある（例えば、図５ＡのT5）。よって<ヤクルト>はセマンティックドリフトが起こりやすいエンティティである。しかしながら、飲料名であるエンティティ<ヤクルト>の属性は<試合>や<投手>などではなく<乳酸菌>や<飲料>などである。本形態では、エンティティとその属性との組を用いることでエンティティがどのような意味を指すかを特定でき、セマンティックドリフトを軽減できる。
以下では第１実施形態の相違点を中心に説明し、第１実施形態と共通する事項については説明を省略する。また、第１実施形態と共通する部分については第１実施形態と同じ参照番号を用いる。 The same attribute corresponds to a plurality of entities related to each other. For example, the entity <Yakult> having the same team name as the entity <Hiroshima> having the team name corresponds to the same attributes such as <match> and <pitcher>. For this reason, the attribute is a constraint condition to be satisfied by the entity to be searched. Using this fact, in the second embodiment, an entity is identified using a set of an entity and its attribute. For example, suppose that the positive seed entity is <Hiroshima>, which is a team name, and that it aims to acquire an entity, <Yakult>, which is also a team name. In this case, attributes common to these entities are <match>, <pitcher>, and the like. For example, the entity is identified using a pair of a positive seed entity <Hiroshima> and its attribute <match>. Here, <Yakult> which is a team name is a positive entity, but <Yakult> also has a meaning as a beverage name (for example, T5 in FIG. 5A). Therefore, <Yakult> is an entity that is prone to semantic drift. However, the attribute of the entity <Yakult>, which is the beverage name, is not <Game> or <Pitcher> but <Lactic acid bacteria> or <Beverage>. In this embodiment, by using a set of an entity and its attribute, it is possible to specify what the entity means and to reduce semantic drift.
Below, it demonstrates centering around the difference of 1st Embodiment, and abbreviate | omits description about the matter which is common in 1st Embodiment. Further, the same reference numerals as those in the first embodiment are used for portions common to the first embodiment.

＜構成＞
図６は、第２実施形態のデータ抽出装置２の機能構成を例示するためのブロック図である。
図６に例示するように、データ抽出装置２は、記憶部１１ａ,１１ｄ,１１ｅ,２１ｄ,２１ｅ、初期属性集合生成部２２、属性識別用素性抽出部２３ａ、エンティティ識別用素性抽出部２３ｂ、属性識別学習部２５ａ、エンティティ識別学習部２５ｂ、属性識別部２６ａ、エンティティ識別部２６ｂ、収束判定部１７、出力部１８、及び制御部１９を有し、制御部１９の制御のもと各処理を実行する。
なお、データ抽出装置２は、例えば、公知又は専用のコンピュータに特別なプログラムが読み込まれて構成される特別な装置である。例えば、記憶部１１ａ,１１ｄ,１１ｅ,２１ｄ,２１ｅは、ハードディスクや半導体メモリなどであり、初期属性集合生成部２２、属性識別用素性抽出部２３ａ、エンティティ識別用素性抽出部２３ｂ、属性識別学習部２５ａ、エンティティ識別学習部２５ｂ、属性識別部２６ａ、エンティティ識別部２６ｂ、収束判定部１７、出力部１８、及び制御部１９は、特別なプログラムが読み込まれたCPUなどである。また、これらの少なくとも一部が集積回路などによって構成されてもよい。 <Configuration>
FIG. 6 is a block diagram for illustrating a functional configuration of the data extraction device 2 of the second embodiment.
As illustrated in FIG. 6, the data extraction device 2 includes storage units 11a, 11d, 11e, 21d, and 21e, an initial attribute set generation unit 22, an attribute identification feature extraction unit 23a, an entity identification feature extraction unit 23b, an attribute It has an identification learning unit 25a, an entity identification learning unit 25b, an attribute identification unit 26a, an entity identification unit 26b, a convergence determination unit 17, an output unit 18, and a control unit 19, and executes each process under the control of the control unit 19 To do.
Note that the data extraction device 2 is a special device configured by, for example, reading a special program into a known or dedicated computer. For example, the storage units 11a, 11d, 11e, 21d, and 21e are hard disks, semiconductor memories, and the like, and include an initial attribute set generation unit 22, an attribute identification feature extraction unit 23a, an entity identification feature extraction unit 23b, and an attribute identification learning unit. 25a, the entity identification learning unit 25b, the attribute identification unit 26a, the entity identification unit 26b, the convergence determination unit 17, the output unit 18, and the control unit 19 are a CPU or the like in which a special program is read. Further, at least a part of these may be configured by an integrated circuit or the like.

＜事前処理＞
事前処理として、記憶部１１ａにテキストデータの集合Dが格納される。テキストデータの集合Dは第１実施形態と同様である。 <Pre-processing>
As pre-processing, a set D of text data is stored in the storage unit 11a. The text data set D is the same as in the first embodiment.

＜データ抽出処理＞
図７は、第２実施形態のデータ抽出装置２のデータ抽出処理を例示するための図である。本形態では、エンティティと属性の更新を交互に行うco-training方式を用いる。すなわち、ステップＳ２２−Ｓ２４では正例及び負例エンティティの更新は行われず、正例及び負例属性の更新のみが行われる。一方ステップＳ２５−Ｓ２７では正例及び負例属性の更新は行われず、正例及び負例エンティティの更新のみが行われる。以下、図７を用いてデータ抽出装置２のデータ抽出処理を例示する。 <Data extraction process>
FIG. 7 is a diagram for illustrating a data extraction process of the data extraction device 2 of the second embodiment. In this embodiment, a co-training method that alternately updates entities and attributes is used. That is, in steps S22 to S24, the positive example and negative example entities are not updated, and only the positive example and negative example attributes are updated. On the other hand, in steps S25-S27, the positive and negative example attributes are not updated, and only the positive and negative example entities are updated. Hereinafter, the data extraction processing of the data extraction device 2 will be illustrated with reference to FIG.

《初期化：ステップＳ１１》
制御部１９がjの値をj=1に初期化する。
《初期属性集合生成：ステップＳ２１》
正例シードエンティティRP_e ⁰と負例シードエンティティRN_e ⁰とが初期属性集合生成部２２に入力される。例えば、正例シードエンティティとしてRP_e ⁰=<広島>、負例シードエンティティとしてRN_e ⁰=<日本>が入力される。正例シードエンティティRP_e ⁰は、ユーザによって選択されたものである。負例シードエンティティRN_e ⁰は、ユーザによって選択されたものであってもよいし、テキストデータの集合Dから半自動で生成されたものであってもよい。以下に負例シードエンティティRN_e ⁰を半自動で生成する方法を例示する。 << Initialization: Step S11 >>
The control unit 19 initializes the value of j to j = 1.
<< Initial attribute set generation: Step S21 >>
The positive example seed entity RP _e ⁰ and the negative example seed entity RN _e ⁰ are input to the initial attribute set generation unit 22. For example, RP _e ⁰ = <Hiroshima> is input as a positive example seed entity, and RN _e ⁰ = <Japan> is input as a negative example seed entity. The positive seed entity RP _e ⁰ has been selected by the user. The negative example seed entity RN _e ⁰ may be selected by the user, or may be generated semi-automatically from the text data set D. A method for generating the negative seed entity RN _e ⁰ semi-automatically will be illustrated below.

[負例シードエンティティRN_e ⁰の半自動生成方法の例]
負例シードエンティティ生成部（図示せず）が、テキストデータの集合Dから、何れの正例シードエンティティRP_e ⁰も後述する正例属性RP_a ⁰も含まないテキストデータを所定個数抽出し、抽出した各テキストデータから１つずつランダムに名詞を選択し、それらを負例エンティティ候補として出力する。表示部（図示せず）はこれらの負例エンティティ候補を表示し、これらから負例シードエンティティを選択するようにユーザに促す表示を行う。ユーザによる選択内容は負例シードエンティティ生成部に入力され、負例シードエンティティ生成部は、選択された負例エンティティ候補を正例シードエンティティRP_e ⁰として出力する（[負例シードエンティティRN_e ⁰の半自動生成方法の例]の説明終わり）。
初期属性集合生成部２２は、入力された正例シードエンティティRP_e ⁰と負例シードエンティティRN_e ⁰と記憶部１１ａに格納されたテキストデータの集合Dとを用い、正例シードエンティティRP_e ⁰の属性を表す文字列である正例属性RP_a ⁰の集合と、負例シードエンティティRN_e ⁰の属性を表す文字列である負例属性RN_a ⁰の集合とを生成する。 [Example of semi-automatic generation of negative example seed entity RN _e ⁰ ]
A negative example seed entity generation unit (not shown) extracts and extracts a predetermined number of text data that does not contain any positive example seed entity RP _e ⁰ or a positive example attribute RP _a ⁰ described later from the text data set D. The nouns are selected at random from each of the text data and are output as negative example entity candidates. A display unit (not shown) displays these negative example entity candidates, and performs a display prompting the user to select a negative example seed entity from them. The selection by the user is input to the negative example seed entity generation unit, and the negative example seed entity generation unit outputs the selected negative example entity candidate as the positive example seed entity RP _e ⁰ ([negative example seed entity RN _e ⁰ End of description of semi-automatic generation method example].
The initial attribute set generation unit 22 uses the input positive example seed entity RP _e ⁰ , the negative example seed entity RN _e ^0, and the text data set D stored in the storage unit 11a, and uses the positive example seed entity RP _e ^0. A set of positive example attributes RP _a ⁰ that are character strings representing the attributes of the negative example attribute and a set of negative example attributes RN _a ⁰ that are character strings representing the attributes of the negative example seed entity RN _e ⁰ are generated.

（Ａ）まず初期属性集合生成部２２が、正例シードエンティティRP_e ⁰を含むテキストデータの集合から当該正例エンティティRP_e ⁰以外の何れかの文字列を正例属性候補として選択する。例えば、初期属性集合生成部２２は、記憶部１１ａから正例シードエンティティRP_e ⁰を含む正例テキストを所定数取得し、各正例テキストにおいて正例シードエンティティRP_e ⁰と直接又は１文節を挟む係り受け関係にある単語のみを正例属性候補として抽出する。 (A) First, the initial attribute set generation unit 22 selects any character string other than the positive example entity RP _e ⁰ as a positive example attribute candidate from a set of text data including the positive example seed entity RP _e ⁰ . For example, the initial attribute set generation unit 22 obtains a predetermined number of positive example texts including the positive example seed entity RP _e ⁰ from the storage unit 11a, and directly or one phrase with the positive example seed entity RP _e ⁰ in each positive example text. Only words that are in the relationship of being held are extracted as positive example attribute candidates.

（Ｂ）次に初期属性集合生成部２２は、正例シードエンティティRP_e ⁰を含む文字列の集合内に当該正例属性候補が含まれる頻度とすべてのテキストデータからなる集合D内に当該正例属性候補が含まれる頻度との違いの大きさを表す指標（統計量）を求め、当該指標が大きいものから所定数の正例属性候補、つまり、これらの頻度の違いが大きい当該正例属性候補を正例属性RP_a ⁰（正例属性の初期値）とする。これらの頻度の違いが大きい正例属性候補ほど正例シードエンティティRP_e ⁰との関連が強く、正例シードエンティティRP_e ⁰の正例属性RP_a ⁰にふさわしいといえる。以下にこのような指標を例示するが、その他の統計量を用いてもかまわない。 (B) Next, the initial attribute set generation unit 22 sets the correct attribute in the set D including all the text data and the frequency that the correct attribute candidate is included in the set of character strings including the correct example seed entity RP _e ^0. An index (statistic) indicating the magnitude of the difference from the frequency of including example attribute candidates is obtained, and a predetermined number of positive example attribute candidates from the one with the large index, that is, the positive example attribute having a large difference in frequency. _{Let the} candidate be the positive example attribute RP _a ⁰ (initial value of the positive example attribute). As there is a large difference between the positive example attribute candidates of these frequencies positive examples seed entity RP _e ⁰ and related strongly, deserves its positive example attribute RP _a ⁰ of the positive sample seed entity RP _e ^0. Examples of such indices are given below, but other statistics may be used.

[指標の例]
指標の例１：
指標の例１では、以下のχ²値を指標として用いる。

χ²値が高い正例属性候補αほど、正例シードエンティティRP_e ⁰と関係の深い、即ち属性としてふさわしいといえる。よって、この例の初期属性集合生成部２２は、χ²値が高い正例属性候補αを正例属性RP_a ⁰として抽出する。例えば、χ²値が基準値以上となる正例属性候補αを正例属性RP_a ⁰とする。 [Example of metrics]
Indicator example 1:
In index example 1, the following χ ² values are used as indices.

It can be said that the positive example attribute candidate α having a higher χ ² value is more closely related to the positive example seed entity RP _e ⁰ , that is, suitable as an attribute. Therefore, the initial attribute set generation unit 22 of this example extracts the positive example attribute candidate α having a high χ ² value as the positive example attribute RP _a ⁰ . For example, a positive example attribute candidate α whose χ ² value is greater than or equal to a reference value is set as a positive example attribute RP _a ⁰ .

指標の例２：
指標の例２では、正例シードエンティティRP_e ⁰と正例属性候補αとの２項における以下のPMIを指標として用いる。

ここで|RP_e ⁰, α|は正例シードエンティティRP_e ⁰の集合と正例属性候補αとの組の出現頻度を表す。また、*はRP_e ⁰又はαのワイルドカードを表す。
PMI値が大きい正例属性候補αほど、正例シードエンティティRP_e ⁰と関係の深い、即ち属性としてふさわしいといえる。よって、この例の初期属性集合生成部２２は、PMI値が大きな正例属性候補αを正例属性RP_a ⁰として抽出する。例えば、PMI値が基準値以上となる正例属性候補αを正例属性RP_a ⁰とする（[指標の例]の説明終わり）。 Indicator example 2:
In the index example 2, the following PMIs in the two terms of the positive example seed entity RP _e ⁰ and the positive example attribute candidate α are used as the index.

Here, | RP _e ⁰ , α | represents the appearance frequency of a set of a set of positive example seed entities RP _e ⁰ and a positive example attribute candidate α. * Represents an RP _e ⁰ or α wild card.
It can be said that the positive example attribute candidate α having a larger PMI value is more closely related to the positive example seed entity RP _e ⁰ , that is, suitable as an attribute. Therefore, the initial attribute set generation unit 22 in this example extracts the positive example attribute candidate α having a large PMI value as the positive example attribute RP _a ⁰ . For example, a positive example attribute candidate α having a PMI value equal to or higher than a reference value is set as a positive example attribute RP _a ⁰ (end of description of [index example]).

この方法では、まず（Ａ）で構文情報を用いて正例属性候補を粗く絞り込むため、（Ｂ）での計算時間を大幅に削減することができる。また、上記（Ａ）,（Ｂ）により正例属性RP_a ⁰（正例属性の初期値）を抽出した後、適切な属性が選択されているか否かを人手によりチェックし、最終的な正例属性RP_a ⁰を決定してもよい。 In this method, first, the correct attribute candidate is roughly narrowed down using the syntax information in (A), so that the calculation time in (B) can be greatly reduced. Further, after extracting the positive example attribute RP _a ⁰ (initial value of the positive example attribute) by the above (A) and (B), it is manually checked whether or not an appropriate attribute is selected, and the final positive attribute is selected. The example attribute RP _a ⁰ may be determined.

初期属性集合生成部２２は、負例シードエンティティRN_e ⁰についても同様の処理を行い、負例属性RN_a ⁰を抽出する。すなわち、初期属性集合生成部２２は、負例シードエンティティRN_e ⁰を含むテキストデータの集合から当該負例シードエンティティRN_e ⁰以外の何れかの文字列を負例属性候補として選択し、負例シードエンティティRN_e ⁰を含む文字列の集合内に当該負例属性候補が含まれる頻度とすべてのテキストデータからなる集合D内に当該負例属性候補が含まれる頻度との違いの大きさを表す指標が条件を満たす負例属性候補、つまり、これらの頻度の違いが大きな当該負例属性候補を負例属性RN_a ⁰（負例属性の初期値）とする。 Initial attribute set generation unit 22 performs the same processing for the negative sample seed entity RN _e ^0, extracting a negative sample attribute RN _a ^0. That is, the initial attribute set generation unit 22 selects one of the strings from the set of text data other than the negative examples seed entity RN _e ⁰ containing a negative example seed entity RN _e ⁰ as a negative example attribute candidate, a negative sample This represents the magnitude of the difference between the frequency that the negative example attribute candidate is included in the set of character strings including the seed entity RN _e ⁰ and the frequency that the negative example attribute candidate is included in the set D consisting of all text data. The negative example attribute candidate that satisfies the index, that is, the negative example attribute candidate having a large difference in frequency is set as a negative example attribute RN _a ⁰ (initial value of the negative example attribute).

また、上述した方法の代わりに、初期属性集合生成部２２が、負例シードエンティティRN_e ⁰とそれに対応する負例属性RN_a ⁰とを半自動で選択してもよい。例えば、初期属性集合生成部２２は、テキストデータの集合Dから、何れの正例シードエンティティRP_e ⁰も正例属性RP_a ⁰も含まないテキストデータを所定個数抽出し、抽出した各テキストデータから２つずつランダムに名詞を選択し、一方を負例エンティティ候補、他方を負例属性候補として出力する。表示部（図示せず）はこれらを表示し、これらから負例シードエンティティRN_e ⁰とそれに対応する負例属性RN_a ⁰とを選択するようにユーザに促す表示を行う。ユーザによる選択内容は初期属性集合生成部２２に入力され、初期属性集合生成部２２は選択された負例シードエンティティRN_e ⁰及び負例属性RN_a ⁰の集合を出力する。 Instead of the method described above, the initial attribute set generation unit 22 may select the negative example seed entity RN _e ⁰ and the negative example attribute RN _a ⁰ corresponding thereto semi-automatically. For example, the initial attribute set generation unit 22 extracts a predetermined number of text data that does not include any positive example seed entity RP _e ⁰ or positive example attribute RP _a ⁰ from the text data set D, and extracts each text data from the extracted text data Two nouns are selected at random, and one is output as a negative example entity candidate and the other as a negative example attribute candidate. A display unit (not shown) displays these and displays to prompt the user to select a negative example seed entity RN _e ⁰ and a corresponding negative example attribute RN _a ⁰ from these. The content selected by the user is input to the initial attribute set generation unit 22, and the initial attribute set generation unit 22 outputs a set of the selected negative example seed entity RN _e ⁰ and negative example attribute RN _a ⁰ .

初期属性集合生成部２２は、正例シードエンティティRP_e ⁰の集合、負例シードエンティティRN_e ⁰の集合、抽出した正例属性RP_a ⁰の集合、及び負例属性RN_a ⁰の集合を出力する。例えば、初期属性集合生成部２２は、図４のテキストデータの中から、正例シードエンティティRP_e ⁰を含むテキストとしてT1，T2，T10に対応するものを取得し、上記の処理によってT1,T2に対応するテキストが含む正例属性RP_a ⁰の集合{<VS>，<第１戦>，<投手>}を抽出して出力する。同様に初期属性集合生成部２２は、例えば、負例シードエンティティRN_e ⁰を含むテキストとしてT7に対応するものを取得し、負例属性RN_a ⁰の集合R{<人口>}を抽出して出力する。 The initial attribute set generation unit 22 outputs a set of positive example seed entities RP _e ^0, a set of negative example seed entities RN _e ^0, a set of extracted positive example attributes RP _a ⁰ , and _a set of negative example attributes RN _a ⁰ To do. For example, the initial attribute set generation unit 22 acquires text corresponding to T1, T2, and T10 as text including the positive seed entity RP _e ⁰ from the text data of FIG. 4, and T1, T2 by the above processing. A set {<VS>, <First game>, <Pitcher>} of positive example attributes RP _a ⁰ included in the text corresponding to is extracted and output. Similarly, the initial attribute set generation unit 22 acquires, for example, a text corresponding to T7 as a text including the negative example seed entity RN _e ⁰ , and extracts a set R {<population>} of the negative example attribute RN _a ^0. Output.

《属性識別用素性抽出：ステップＳ２２》
正例エンティティRP_e ^j-1の集合、負例エンティティRN_e ^j-1の集合、正例属性RP_a ^j-1の集合、及び負例属性RN_a ^j-1の集合が、属性識別用素性抽出部２３ａに入力される。
属性識別用素性抽出部２３ａは、正例エンティティRP_e ^j-1の集合から選択した第１正例エンティティと正例属性RP_a ^j-1の集合から選択した第１正例属性との組である第１正例エンティティ−正例属性ペアPP₁（RP_e ^j-1,RP_a ^j-1）と、負例エンティティRN_e ^j-1の集合から選択した第１負例エンティティと負例属性RN_a ^j-1の集合から選択した第１負例属性との組である第１負例エンティティ−負例属性ペアPN₁（RN_e ^j-1,RN_a ^j-1）とを生成する。PP₁（RP_e ^j-1,RP_a ^j-1）やPN₁（RN_e ^j-1,RN_a ^j-1）は、RP_e ^j-1とRP_a ^j-1やRN_e ^j-1とRN_a ^j-1の採り得るすべての組み合わせについて生成されてもよいし、それらの一部の組み合わせのみについて生成されてもよい。
次に属性識別用素性抽出部２３ａは、記憶部１１ａに格納されたテキストデータの集合Dから、PP₁（RP_e ^j-1,RP_a ^j-1）の正例エンティティRP_e ^j-1と正例属性RP_a ^j-1との組を含む文字列である「第１正例テキスト」を選択する。第１正例テキストの例は、テキストデータが含む文、フレーズ、単語列などである。第１正例テキストは、第１正例エンティティ−正例属性ペアPP₁（RP_e ^j-1,RP_a ^j-1）とテキストデータとの組に対して１個以上抽出される。
属性識別用素性抽出部２３ａは、第１正例テキストに対する第１正例エンティティ−正例属性ペアPP₁（RP_e ^j-1,RP_a ^j-1）の特徴を表す情報を当該第１正例エンティティ−正例属性ペアPP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jとする。この例では、第１正例テキストごとにPP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jが抽出される。PP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jの例は、第１正例テキスト（正例エンティティRP_e ^j-1及び正例属性RP_a ^j-1を含む文字列であってテキストデータに含まれるもの）と当該第１正例エンティティRP_e ^j-1及び第１正例属性RP_a ^j-1との関係を表す情報である。 << Attribute Identification Feature Extraction: Step S22 >>
A set of positive entity RP _e ^j-1, a set of negative example entity RN _e ^j-1, a set of positive example attributes RP _a ^j-1 , and a set of negative example attributes RN _a ^j-1 are attribute identification features. The data is input to the extraction unit 23a.
The attribute identifying feature extraction unit 23a is a set of a first positive example entity selected from the set of positive example entities RP _e ^j-1 and a first positive example attribute selected from the set of positive example attributes RP _a ^j-1. A first negative example entity and a positive example attribute selected from a set of a first positive example entity-positive example attribute pair PP ₁ (RP _e ^j-1 , RP _a ^j-1 ) and a negative example entity RN _e ^j-1 first negative example entity is a set of a first negative example selected attributes from the set of RN _a ^j-1 - to produce a negative example attribute pair _{_{^{PN 1 (RN e j-1}}} , RN a j-1). PP ₁ (RP _e ^j-1 , RP _a ^j-1 ) and PN ₁ (RN _e ^j-1 , RN _a ^j-1 ) are RP _e ^j-1 , RP _a ^j-1 , RN _e ^j-1 And RN _a ^j-1 may be generated for all possible combinations, or only some of them may be generated.
Next, the attribute identifying feature extracting unit 23a determines the positive entity RP _e ^{j-1 of} PP ₁ (RP _e ^j-1 , RP _a ^j-1 ) from the text data set D stored in the storage unit 11a. The “first positive example text”, which is a character string including a pair with the positive example attribute RP _a ^j−1 , is selected. Examples of the first positive example text are sentences, phrases, word strings, etc. included in the text data. One or more first example texts are extracted for a set of first example entity-example example attribute pair PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) and text data.
The attribute identifying feature extraction unit 23a uses the first positive example entity-positive example attribute pair PP ₁ (RP _e ^j-1 , RP _a ^j-1 ) for the first positive example text as the first positive example information. The feature fP _a ^j of the example entity-positive example attribute pair PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) is assumed. In this example, the feature fP _a ^j of PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) is extracted for each first positive example text. An example of the feature fP _a ^j of PP ₁ (RP _e ^j-1 , RP _a ^j-1 ) is a character including the first example text (the example entity RP _e ^j-1 and the example attribute RP _a ^j-1) Information included in the text data) and the first positive example entity RP _e ^j-1 and the first positive example attribute RP _a ^j-1 .

例えば、何れかの正例エンティティRP_e ^j-1及び正例属性RP_a ^j-1を含むテキストデータ内における当該正例属性RP_a ^j-1に一致する文字列（一致属性）から前後所定単語数以内（第１正例テキスト内）に位置する単語（周辺単語）の表記と当該一致属性に対する当該周辺単語の相対位置を表す情報との組（表層素性）、一致属性又は周辺単語の品詞情報（品詞素性）や固有名詞情報（固有名詞素性）や構文情報（構文素性）、テキストデータ内での一致属性の出現回数やテキストデータの集合D内での一致属性の出現回数（出現回数素性）のうち、少なくとも一つに対応する情報を素性fP_a ^jとすることができる。この具体例は、正例属性を基準とする以外、第１実施形態の[正例エンティティRP_e ^j-1の素性fP'_e ^jの例]と同様である。例えば、正例エンティティRP_e ^j-1がex=<阪神>であり、正例属性RP_a ^j-1がey=<投手>であり、第１正例テキストが「阪神/は/投手/陣/が/好調」であるとすると、抽出される素性fP_a ^jの例は以下のようになる。ここでは素性抽出の範囲をエンティティ及び属性の前後２単語以内と仮定している。
表層素性：「ex+1="は"」「ex+2=ey」「ey−2=ex」，「ey−1="は"」，「ey+1="陣"」，「ey+2="が"」
品詞素性：「ex+1=助詞」「ey−1=助詞」，「ey + 1=名詞」，「ey + 1=助詞」
固有名詞素性：「ex=ORG(組織名)」「ey−2=ORG(組織名)」
構文素性：「exの階層=eyの階層」(両方「好調」に係る) For example, _a predetermined word before and after a character string (matching attribute) that matches the positive example attribute RP _a ^j-1 in text data including any positive example entity RP _e ^j-1 and the positive example attribute RP _a ^j-1 A pair (surface layer feature) of notation of words (neighboring words) located within a few (in the first example text) and information indicating the relative position of the neighboring words with respect to the matching attribute, part of speech information of matching attributes or surrounding words (Part of speech feature), proper noun information (proprietary noun feature), syntax information (syntactic feature), the number of appearances of matching attributes in text data, and the number of appearances of matching attributes in text data set D (appearance frequency feature) Among them, information corresponding to at least one of them can be set as _a feature fP _a ^j . This specific example is the same as [Example of feature fP ′ _e ^j of positive example entity RP _e ^j−1 ] of the first embodiment, except that the positive example attribute is used as a reference. For example, the positive example entity RP _e ^j-1 is ex = <Hanshin>, the positive example attribute RP _a ^j-1 is ey = <Pitcher>, and the first positive example text is “Hanshin / Ha / Pitcher / Team”. Assuming that “/ is good”, an example of the extracted feature fP _a ^j is as follows. Here, it is assumed that the feature extraction range is within two words before and after the entity and attribute.
Surface features: "ex + 1 =" is """ex + 2 = ey""ey-2 = ex", "ey-1 =" is "", "ey + 1 =" camp "", "ey + 2 = "is""
Part-of-speech features: “ex + 1 = particle”, “ey−1 = particle”, “ey + 1 = noun”, “ey + 1 = particle”
Proper noun features: “ex = ORG (organization name)” “ey-2 = ORG (organization name)”
Syntactic feature: “ex hierarchy = ey hierarchy” (both related to “good”)

同様に、属性識別用素性抽出部２３ａは、記憶部１１ａに格納されたテキストデータの集合Dから、PN₁（RN_e ^j-1,RN_a ^j-1）の負例エンティティRN_e ^j-1と負例属性RN_a ^j-1との組を含む文字列である「第１負例テキスト」を選択する。第１負例テキストの例は、テキストデータが含む文、フレーズ、単語列などである。第１負例テキストは、第１負例エンティティ−負例属性ペアPN₁（RN_e ^j-1,RN_a ^j-1）とテキストデータとの組に対して１個以上抽出される。
属性識別用素性抽出部２３ａは、第１負例テキストに対する第１負例エンティティ−負例属性ペアPN₁（RN_e ^j-1,RN_a ^j-1）の特徴を表す情報を当該第１負例エンティティ−負例属性ペアPN₁（RN_e ^j-1,RN_a ^j-1）の素性fN_a ^jとする。この例では、第１負例テキストごとにPN₁（RN_e ^j-1,RN_a ^j-1）の素性fN_a ^jが抽出される。PN₁（RN_e ^j-1,RN_a ^j-1）の素性fN_a ^jの例は、第１負例テキスト（負例エンティティRN_e ^j-1及び負例属性RN_a ^j-1を含む文字列であってテキストデータに含まれるもの）と当該第１負例エンティティRN_e ^j-1及び第１負例属性RN_a ^j-1との関係を表す情報である。その具体例は、上述した正例に対応するPP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jの場合と同様である。 Similarly, the attribute identifying feature extraction unit 23a extracts the negative example entity RN _e ^{j-1 of} PN ₁ (RN _e ^j-1 , RN _a ^j-1 ) from the text data set D stored in the storage unit 11a. And “first negative example text” which is a character string including a pair of the negative example attribute RN _a ^j−1 . Examples of the first negative example text are sentences, phrases, word strings, and the like included in the text data. One or more first negative example texts are extracted for a set of first negative example entity-negative example attribute pair PN ₁ (RN _e ^j−1 , RN _a ^j−1 ) and text data.
The attribute identifying feature extracting unit 23a uses the first negative example entity-negative example attribute pair PN ₁ (RN _e ^j−1 , RN _a ^j−1 ) for the first negative example text as the first negative example. The feature fN _a ^j of the example entity-negative example attribute pair PN ₁ (RN _e ^j−1 , RN _a ^j−1 ) is assumed. In this example, the feature fN _a ^j of PN ₁ (RN _e ^j−1 , RN _a ^j−1 ) is extracted for each first negative example text. An example of _a feature fN _a ^j of PN ₁ (RN _e ^j-1 , RN _a ^j-1 ) is a character including the first negative example text (negative example entity RN _e ^j-1 and negative example attribute RN _a ^j-1 Information included in the text data) and the first negative example entity RN _e ^j-1 and the first negative example attribute RN _a ^j-1 . The specific example is the same as the case of the feature fP _a ^j of PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) corresponding to the positive example described above.

属性識別用素性抽出部２３ａは、PP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jと正例を表すラベル<+1>との組(fP_a ^j, <+1>)、及び、PN₁（RN_e ^j-1,RN_a ^j-1）の素性fN_a ^jと負例を表すラベル<-1>との組(fN_a ^j, <-1>)を出力する。
図８Ａは、属性識別用素性抽出部２３ａが出力する組(fP_a ^j, <+1>)及び組(fN_a ^j, <-1>)を例示した図である。この例では、エンティティ(ex)と属性(ey)の前後２単語の表記を素性としている。 The attribute identifying feature extraction unit 23a sets a pair (fP _a ^j , <+1) of a feature fP _a ^{j of} PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) and a label <+1> representing a positive example. >) And a pair (fN _a ^j , <-1>) of the feature fN _a ^{j of} PN ₁ (RN _e ^j-1 , RN _a ^j-1 ) and the label <-1> representing a negative example To do.
FIG. 8A is a diagram illustrating a pair (fP _a ^j , <+1>) and a pair (fN _a ^j , <−1>) output by the attribute identifying feature extraction unit 23a. In this example, the notation of two words before and after the entity (ex) and the attribute (ey) is used as a feature.

《属性識別学習：ステップＳ２３》
PP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jと正例を表すラベル<+1>との組(fP_a ^j, <+1>)、及び、PN₁（RN_e ^j-1,RN_a ^j-1）の素性fN_a ^jと負例を表すラベル<-1>との組(fN_a ^j, <-1>)が属性識別学習部２５ａに入力される。属性識別学習部２５ａは、PP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jとPN₁（RN_e ^j-1,RN_a ^j-1）の素性fN_a ^jとを教師あり学習データとした学習処理によって、第１識別モデルME_a ^jを生成する。この第１識別モデルME_a ^jは、任意の文字列であるエンティティと当該エンティティの属性との組であるエンティティ−属性ペアの素性を入力として当該ペアが正例エンティティ−正例属性ペアか負例エンティティ−負例属性ペアかを識別するための情報を出力する関数である。このような識別モデルME_e ^jであればどのようなモデルであってもよい。例えば、前述の識別モデルME_e ^jと同様に第１識別モデルME_a ^jを生成すればよい。
学習処理によって生成された第１識別モデルME_a ^jは記憶部２１ｄに格納される。例えば、学習処理によって生成された第１識別モデルME_a ^jのパラメータが記憶部２１ｄに格納される。 << Attribute Identification Learning: Step S23 >>
A pair (fP _a ^j , <+1>) of _a feature fP _a ^{j of} PP ₁ (RP _e ^j-1 , RP _a ^j-1 ) and a label <+1> representing a positive example, and PN ₁ (RN A pair (fN _a ^j , <-1>) of _a feature fN _a ^{j of} _e ^j−1 , RN _a ^j−1 ) and a label <−1> representing a negative example is input to the attribute identification learning unit 25a. The attribute identification learning unit 25a uses the feature fP _a ^{j of} PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) and the feature fN _a ^{j of} PN ₁ (RN _e ^j−1 , RN _a ^j−1 ). A first identification model ME _a ^j is generated by learning processing using supervised learning data. The first identification model ME _a ^j is input with the feature of an entity-attribute pair that is a set of an entity that is an arbitrary character string and an attribute of the entity, and the pair is a positive entity-positive example attribute pair or a negative example. This is a function for outputting information for identifying an entity-negative example attribute pair. Any model may be used as long as it is such an identification model ME _e ^j . For example, the first identification model ME _a ^j may be generated in the same manner as the above-described identification model ME _e ^j .
The first identification model ME _a ^j generated by the learning process is stored in the storage unit 21d. For example, parameters of the first identification model ME _a ^j generated by the learning processing is stored in the storage unit 21d.

《属性識別：ステップＳ２４》
属性識別部２６ａは、記憶部１１ａに格納されたテキストデータの集合Dから何れかのテキストデータを選択し、選択した当該テキストデータが含む文字列を第１対象エンティティRD_e ^jとして選択する。また属性識別部２６ａは、選択した当該テキストデータから当該第１対象エンティティRD_e ^jと異なる文字列を第１対象属性RD_a ^jとして選択する。そして属性識別部２６ａは、第１対象エンティティRD_e ^jと第１対象属性RD_a ^jとの組を第１対象エンティティ−対象属性ペアPD₁（RD_e ^j,RD_a ^j）とする。 << Attribute Identification: Step S24 >>
The attribute identification unit 26a selects any text data from the set D of text data stored in the storage unit 11a, and selects a character string included in the selected text data as the first target entity RD _e ^j . The attribute identifying unit 26a selects a character string different from the first target entity RD _e ^j from the selected text data as the first target attribute RD _a ^j . Then, the attribute identifying unit 26a sets a pair of the first target entity RD _e ^j and the first target attribute RD _a ^j as the first target entity-target attribute pair PD ₁ (RD _e ^j , RD _a ^j ).

なお、テキストデータの集合Dからすべてのテキストデータが選択されてもよいが、すべてのテキストデータを対象とすることは計算効率上好ましくない。そのため、特定の方法で対象を限定して選択を行うことが望ましい。以下にその具体例を示す。 Note that all text data may be selected from the text data set D, but it is not preferable in terms of computational efficiency to target all text data. For this reason, it is desirable to select a target by a specific method. Specific examples are shown below.

[選択方法の例]
第１条件：
属性識別部２６ａは、何れかの正例エンティティRP^j-1 _e又は負例エンティティRN^j-1 _eを含み、かつ当該エンティティRP^j-1 _e又RN^j-1 _eから任意のウィンドウサイズ内（ここでは３単語とする）に名詞を含むテキストデータを選択し、当該ウィンドウサイズ内の名詞を属性候補とする。 [Example of selection method]
First condition:
The attribute identifying unit 26a includes any positive example entity RP ^j-1 _e or negative example entity RN ^j-1 _e , and within an arbitrary window size from the entity RP ^j-1 _e or RN ^j-1 _e ( Here, text data including nouns is selected, and nouns within the window size are set as attribute candidates.

第２条件：
第１条件だけでは対象の数が膨大になる場合があるため、属性識別部２６ａは、属性識別学習部２５ａで教師あり学習データとして用いられたPP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jとPN₁（RN_e ^j-1,RN_a ^j-1）の素性fN_a ^jのうち、それらから生成された第１識別モデルME_a ^jへの影響度の大きさを表す指標（例えば前述の重みλ_q）が特定の基準を満たす素性、つまり、当該第１識別モデルME_a ^jへの影響度が大きな素性fP_a ^j及び／又はfN_a ^jを選択する。例えば、属性識別部２６ａは、前述の重みλ_qの絶対値が閾値よりも大きな素性fP_a ^j及び／又はfN_a ^jを選択する。 Second condition:
Since the number of objects may become enormous under the first condition alone, the attribute identification unit 26a uses the PP ₁ (RP _e ^j−1 , RP _a ^j−) used as supervised learning data in the attribute identification learning unit 25a. of feature fP _a ^j and _{_{^{PN 1 (RN e j-1}}} , RN a j-1) of a feature fN _a ^j ^1), first identification model ME magnitude of the degree of influence on _a ^j generated from them feature that index representing (e.g. aforementioned weight lambda _q) satisfies certain criteria, i.e., influence of the the first identification model ME _a ^j selects a large feature fP _a ^j and / or fN _a ^j. For example, the attribute identifying unit 26a selects _a feature fP _a ^j and / or fN _a ^{j in} which the absolute value of the weight λ _q is larger than a threshold value.

属性識別部２６ａは、選択した素性fP_a ^j及び／又はfN_a ^jに対応する文字列を含むテキストデータを、第１条件で選択されたテキストデータの集合から選択する。属性識別部２６ａは、当該選択したテキストデータが含む文字列を第１対象エンティティRD_e ^j及び第１対象属性RD_a ^jとする。例えば、属性識別部２６ａは、選択した素性fP_a ^j及び／又はfN_a ^jから表層素性の単語を抽出し、当該表層素性の単語を含むテキストデータを第１条件で選択されたテキストデータの集合から選択し、当該選択したテキストデータが含む文字列を第１対象エンティティRD_e ^j及び第１対象属性RD_a ^jとする。 The attribute identifying unit 26a selects text data including a character string corresponding to the selected feature fP _a ^j and / or fN _a ^j from the set of text data selected under the first condition. The attribute identifying unit 26a sets the character string included in the selected text data as the first target entity RD _e ^j and the first target attribute RD _a ^j . For example, the attribute identifying unit 26a extracts a surface feature word from the selected feature fP _a ^j and / or fN _a ^j , and sets the text data including the surface feature word as a set of text data selected under the first condition. The character string included in the selected text data is set as the first target entity RD _e ^j and the first target attribute RD _a ^j .

一例を挙げると、選択された素性がエンティティexの前２単語が表層素性と品詞素性の組み合わせで成り立つ素性FNC(x−2=“POS:名詞”, x−1=“VS”)であった場合、属性識別部２６ａは、選択した素性FNC(x−2=“POS:名詞”, x−1=“VS”)から表層素性の単語“VS”を抽出し、第１条件で選択されたテキストデータの集合から、単語“VS”を含むテキストデータを選択する（[選択方法の例]の説明終わり）。 For example, the selected feature was a feature FNC (x-2 = “POS: noun”, x−1 = “VS”) in which the two words before entity ex consisted of a combination of surface features and part-of-speech features. In this case, the attribute identification unit 26a extracts the surface feature word “VS” from the selected feature FNC (x−2 = “POS: noun”, x−1 = “VS”), and is selected under the first condition. Select text data including the word “VS” from the set of text data (end of description of [example of selection method]).

属性識別用素性抽出部２３ａは、記憶部１１ａに格納されたテキストデータの集合Dから、第１対象エンティティRD_e ^jと第１対象属性RD_a ^jとの組を含む文字列である「第１対象テキスト」を選択する。第１対象テキストの例は、テキストデータが含む文、フレーズ、単語列などである。第１対象テキストは、第１対象エンティティ−対象属性ペアPD₁（RD_e ^j,RD_a ^j）とテキストデータとの組に対して1個以上抽出される。 The attribute identifying feature extraction unit 23a is a character string that includes a set of the first target entity RD _e ^j and the first target attribute RD _a ^j from the text data set D stored in the storage unit 11a. Select Target text. Examples of the first target text are sentences, phrases, word strings, and the like included in the text data. One or more first target texts are extracted for a set of first target entity-target attribute pair PD ₁ (RD _e ^j , RD _a ^j ) and text data.

属性識別用素性抽出部２３ａは、第１対象テキストに対する第１対象エンティティ−対象属性ペアPD₁（RD_e ^j,RD_a ^j）の特徴を表す情報を当該第１対象エンティティ−対象属性ペアPD₁（RD_e ^j,RD_a ^j）の素性fD_a ^jとする。この例では、第１対象テキストごとにPD₁（RD_e ^j,RD_a ^j）の素性fD_a ^jが抽出される。PD₁（RD_e ^j,RD_a ^j）の素性fD_a ^jの例は、第１対象テキスト（第１対象エンティティRD_e ^j及び第１対象属性RD_a ^j-1を含む文字列であってテキストデータに含まれるもの）と第１対象エンティティRD_e ^j及び第１対象属性RD_a ^j-1との関係を表す情報である。その具体例は、上述した正例に対応するPP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jの場合と同様である。 Attribute identification feature extracting unit 23a, the first target entity for a first target text - target attribute pair _{_{^{PD 1 (RD e j, RD}}} a j) the first target entity information indicating features of the - target attribute pair PD ₁ The feature fD _a ^{j of} (RD _e ^j , RD _a ^j ) is assumed. In this example, the feature fD _a ^j of PD ₁ (RD _e ^j , RD _a ^j ) is extracted for each first target text. An example of the feature fD _a ^j of PD ₁ (RD _e ^j , RD _a ^j ) is a text including a first target text (a first target entity RD _e ^j and a first target attribute RD _a ^j−1). Information included in the data), the first target entity RD _e ^j, and the first target attribute RD _a ^j-1 . The specific example is the same as the case of the feature fP _a ^j of PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) corresponding to the positive example described above.

第１対象テキストに対応するPD₁（RD_e ^j,RD_a ^j）の素性fD_a ^jは、属性識別部２６ａに入力される。属性識別部２６ａは、PD₁（RD_e ^j,RD_a ^j）の素性fD_a ^jを記憶部２１ｄから読み出した第１識別モデルME_a ^jに入力し、PD₁（RD_e ^j,RD_a ^j）が正例エンティティ−正例属性ペアか負例エンティティ−負例属性ペアかを識別する。
ここで、属性識別部２６ａは、PD₁（RD_e ^j,RD_a ^j）を正例エンティティ−正例属性ペアであると識別した場合、当該PD₁（RD_e ^j,RD_a ^j）の第１対象属性RD_a ^jを正例属性RP_a ^jとして記憶部２１ｅに格納し、正例属性RP_a ^jの集合に追加する。また、属性識別部２６ａは、PD₁（RD_e ^j,RD_a ^j）が負例エンティティ−負例属性ペアであると識別した場合、当該PD₁（RD_e ^j,RD_a ^j）の第１対象属性RD_a ^jを負例属性RN_a ^jとして記憶部２１ｅに格納し、負例属性RN_a ^jの集合に追加する。すなわち、ステップＳ２２−Ｓ２４では正例及び負例エンティティの更新は行われず、正例及び負例属性の更新のみが行われる。 The feature fD _a ^j of PD ₁ (RD _e ^j , RD _a ^j ) corresponding to the first target text is input to the attribute identifying unit 26a. The attribute identification unit 26a inputs the feature fD _a ^j of PD ₁ (RD _e ^j , RD _a ^j ) to the first identification model ME _a ^j read from the storage unit 21d, and PD ₁ (RD _e ^j , RD _a ^j ) Identifies a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair.
Here, when the attribute identifying unit 26a identifies PD ₁ (RD _e ^j , RD _a ^j ) as a positive entity-positive example attribute pair, the attribute identifying unit 26a determines the _first PD ₁ (RD _e ^j , RD _a ^j ). 1 target attribute RD _a ^j stored in the storage unit 21e as positive examples attribute RP _a ^j, is added to the set of positive examples attribute RP _a ^j. Further, when the attribute identifying unit 26a identifies that PD ₁ (RD _e ^j , RD _a ^j ) is a negative entity-negative example attribute pair, the attribute identifying unit 26a determines the _first PD ₁ (RD _e ^j , RD _a ^j ). stored in the storage unit 21e target attribute RD _a ^j as a negative example attribute RN _a ^j, is added to the set of negative examples attribute RN _a ^j. That is, in steps S22 to S24, the positive example and negative example entities are not updated, and only the positive example and negative example attributes are updated.

例えば、属性識別部２６ａが図４のテキストデータの集合Dから、T10のテキストデータを選択し、当該テキストデータが含む単語<広島>を第１対象エンティティRD_e ^jとし、単語<戦>を第１対象属性RD_a ^jとして選択したとする。この場合、属性識別用素性抽出部２３ａは、例えば、<広島>と<戦>との組を含むT10のテキストデータを第１対象テキストとし、T10のテキストデータに対するPD₁（RD_e ^j,RD_a ^j）="<広島>−<戦>"の素性fD_a ^jを抽出する。属性識別部２６ａは、PD₁（RD_e ^j,RD_a ^j）="<広島>−<戦>"の素性fD_a ^jを第１識別モデルME_a ^jに入力し、PD₁（RD_e ^j,RD_a ^j）が正例エンティティ−正例属性ペアであるか負例エンティティ−負例属性ペアであるかが識別される。例えば、"<広島>−<戦>"が正例エンティティ−正例属性ペアであると識別したとすると、<戦>という属性が正例属性RP_a ^jの集合に追加される。なお、正例又は負例と識別されたPD₁（RD_e ^j,RD_a ^j）のうち、閾値を超える信頼度が付与されたものの第１対象属性RD_a ^jのみを、正例属性RP_a ^j又は負例属性RN_a ^jの集合に追加してもよい。上述の例では｛<VS>，<第１戦>，<投手>，<戦>｝が正例属性RP^j _aの集合に追加される。 For example, the attribute identification unit 26a selects the text data of T10 from the text data set D in FIG. 4, the word <Hiroshima> included in the text data is set as the first target entity RD _e ^j , and the word <war> is the first. It is assumed that one target attribute RD _a ^j is selected. In this case, for example, the attribute identifying feature extraction unit 23a sets the text data of T10 including the set of <Hiroshima> and <war> as the first target text, and PD ₁ (RD _e ^j , RD for the text data of T10 _a ^j ) The feature fD _a ^j of “<Hiroshima> − <war>” is extracted. The attribute identification unit 26a inputs the feature fD _a ^j of PD ₁ (RD _e ^j , RD _a ^j ) = "<Hiroshima>-<war >> into the first identification model ME _a ^j and PD ₁ (RD _e ^j , RD _a ^j ) is identified as a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair. For example, if “<Hiroshima>-<Battle>” is identified as a positive entity-positive case attribute pair, an attribute <Battle> is added to the set of positive case attributes RP _a ^j . Of the PDs ₁ (RD _e ^j , RD _a ^j ) identified as positive examples or negative examples, only the first target attribute RD _a ^j to which the reliability exceeding the threshold is given is used as the positive example attribute RP _a. ^It may be added to the set of ^j or negative example attributes RN _a ^j . In the above example {<VS>, <first leg>, <pitcher>, <War>} are added to the set of positive examples attribute RP ^j _a.

《エンティティ識別用素性抽出：ステップＳ２５》
正例エンティティRP_e ^j-1の集合、負例エンティティRN_e ^j-1の集合、上記のように更新された正例属性RP_a ^jの集合及び負例属性RN_a ^jの集合がエンティティ識別用素性抽出部２３ｂに入力される。
エンティティ識別用素性抽出部２３ｂは、正例エンティティRP_e ^j-1の集合から選択した第２正例エンティティと正例属性RP_a ^jの集合から選択した第２正例属性との組である第２正例エンティティ−正例属性ペアPP₂（RP_e ^j-1,RP_a ^j）と、負例エンティティRN_e ^j-1の集合から選択した第２負例エンティティと負例属性RN_a ^jの集合から選択した第２負例属性との組である第２負例エンティティ−負例属性ペアPN₂（RN_e ^j-1,RN_a ^j）とを生成する。PP₂（RP_e ^j-1,RP_a ^j）やPN₂（RN_e ^j-1,RN_a ^j）は、RP_e ^j-1とRP_a ^jやRN_e ^j-1とRN_a ^jの採り得るすべての組み合わせについて生成されてもよいし、それらの一部の組み合わせのみについて生成されてもよい。
次にエンティティ識別用素性抽出部２３ｂは、記憶部１１ａに格納されたテキストデータの集合Dから、PP₂（RP_e ^j-1,RP_a ^j）の正例エンティティRP_e ^j-1と正例属性RP_a ^jとの組を含む文字列である「第２正例テキスト」を選択する。第２正例テキストの例は、テキストデータが含む文、フレーズ、単語列などである。第２正例テキストは、第２正例エンティティ−正例属性ペアPP₂（RP_e ^j-1,RP_a ^j）とテキストデータとの組に対して１個以上抽出される。
エンティティ識別用素性抽出部２３ｂは、第２正例テキストに対する第２正例エンティティ−正例属性ペアPP₂（RP_e ^j-1,RP_a ^j）の特徴を表す情報を当該第２正例エンティティ−正例属性ペアPP₂（RP_e ^j-1,RP_a ^j）の素性fP_e ^jとする。この例では、第２正例テキストごとにPP₂（RP_e ^j-1,RP_a ^j）の素性fP_e ^jが抽出される。PP₂（RP_e ^j-1,RP_a ^j）の素性fP_e ^jの例は、第２正例テキスト（正例エンティティRP_e ^j-1及び正例属性RP_a ^jを含む文字列であってテキストデータに含まれるもの）と当該第２正例エンティティRP_e ^j-1及び第２正例属性RP_a ^jとの関係を表す情報である。その具体例は、前述（ステップＳ２２）したPP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jの場合と同様である。 << Entity Identification Feature Extraction: Step S25 >>
The set of positive entity RP _e ^j-1, the set of negative example entity RN _e ^j-1 , the set of positive example attributes RP _a ^{j and} the set of negative example attributes RN _a ^j updated as described above are for entity identification. It is input to the feature extraction unit 23b.
The entity identifying feature extraction unit 23b is a set of a second positive example entity selected from the set of positive example entities RP _e ^j-1 and a second positive example attribute selected from the set of positive example attributes RP _a ^j . Two positive example entities—positive example attribute pair PP ₂ (RP _e ^j−1 , RP _a ^j ) and a second negative example entity selected from the set of negative example entities RN _e ^j−1 and negative example attribute RN _a ^j A second negative example entity-negative example attribute pair PN ₂ (RN _e ^j−1 , RN _a ^j ) that is a pair with the second negative example attribute selected from the set is generated. PP ₂ (RP _e ^j-1 , RP _a ^j ) and PN ₂ (RN _e ^j-1 , RN _a ^j ) are taken from RP _e ^j-1 , RP _a ^j , RN _e ^j-1, and RN _a ^j It may be generated for all possible combinations, or only some of those combinations.
Next, the entity identifying feature extraction unit 23b extracts the positive example entity RP _e ^j-1 and the positive example of PP ₂ (RP _e ^j-1 , RP _a ^j ) from the text data set D stored in the storage unit 11a. The “second positive example text” that is a character string including a pair with the attribute RP _a ^j is selected. Examples of the second positive example text are sentences, phrases, word strings, and the like included in the text data. One or more second positive example texts are extracted for a set of the second positive example entity-positive example attribute pair PP ₂ (RP _e ^j−1 , RP _a ^j ) and text data.
The entity identifying feature extraction unit 23b obtains information representing the characteristics of the second positive example entity-positive example attribute pair PP ₂ (RP _e ^j−1 , RP _a ^j ) with respect to the second positive example text. A feature fP _e ^j of the positive attribute pair PP ₂ (RP _e ^j−1 , RP _a ^j ). In this example, the feature fP _e ^j of PP ₂ (RP _e ^j−1 , RP _a ^j ) is extracted for each second positive example text. An example of the feature fP _e ^j of PP ₂ (RP _e ^j−1 , RP _a ^j ) is a character string that includes the second positive example text (the positive example entity RP _e ^j−1 and the positive example attribute RP _a ^j). Information included in the text data) and the second positive example entity RP _e ^j-1 and the second positive example attribute RP _a ^j . The specific example is the same as the case of the feature fP _a ^j of PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) described above (step S22).

同様に、エンティティ識別用素性抽出部２３ｂは、記憶部１１ａに格納されたテキストデータの集合Dから、何れかの負例エンティティRN_e ^j-1と負例属性RN_a ^jとの組を含む文字列である「第２負例テキスト」を選択する。第２負例テキストの例は、テキストデータが含む文、フレーズ、単語列などである。第２負例テキストは、第２負例エンティティ−負例属性ペアPN₂（RN_e ^j-1,RN_a ^j）とテキストデータとの組に対して１個以上抽出される。
エンティティ識別用素性抽出部２３ｂは、第２負例テキストに対する第２負例エンティティ−負例属性ペアPN₂（RN_e ^j-1,RN_a ^j）の特徴を表す情報を当該第２負例エンティティ−負例属性ペアPN₂（RN_e ^j-1,RN_a ^j）の素性fN_e ^jとする。この例では、第２負例テキストごとにPN₂（RN_e ^j-1,RN_a ^j）の素性fN_e ^jが抽出される。PN₂（RN_e ^j-1,RN_a ^j）の素性fN_e ^jの例は、第２負例テキスト（負例エンティティRN_e ^j-1及び負例属性RN_a ^jを含む文字列であってテキストデータに含まれるもの）と当該第２負例エンティティRN_e ^j-1及び第２負例属性RN_a ^jとの関係を表す情報である。その具体例は、前述（ステップＳ２２）したPP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jの場合と同様である。 Similarly, the entity identifying feature extraction unit 23b reads from the text data set D stored in the storage unit 11a a character including a set of any negative example entity RN _e ^j-1 and negative example attribute RN _a ^j. Select the column "second negative example text". Examples of the second negative example text are sentences, phrases, word strings, and the like included in the text data. One or more second negative example texts are extracted for a pair of second negative example entity-negative example attribute pair PN ₂ (RN _e ^j−1 , RN _a ^j ) and text data.
The entity identifying feature extraction unit 23b obtains information representing the characteristics of the second negative example entity-negative example attribute pair PN ₂ (RN _e ^j−1 , RN _a ^j ) with respect to the second negative example text. A feature fN _e ^j of the negative example attribute pair PN ₂ (RN _e ^j−1 , RN _a ^j ). In this example, the feature fN _e ^j of PN ₂ (RN _e ^j−1 , RN _a ^j ) is extracted for each second negative example text. An example of a feature fN _e ^j of PN ₂ (RN _e ^j−1 , RN _a ^j ) is a character string including _a second negative example text (negative example entity RN _e ^j−1 and negative example attribute RN _a ^j) Information included in the text data) and the second negative example entity RN _e ^j-1 and the second negative example attribute RN _a ^j . The specific example is the same as the case of the feature fP _a ^j of PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) described above (step S22).

エンティティ識別用素性抽出部２３ｂは、PP₂（RP_e ^j-1,RP_a ^j）の素性fP_e ^jと正例を表すラベル<+1>との組(fP_e ^j, <+1>)、及び、PN₂（RN_e ^j-1,RN_a ^j）の素性fN_e ^jと負例を表すラベル<-1>との組(fN_e ^j, <-1>)を出力する。 The entity identifying feature extraction unit 23b sets (fP _e ^j , <+1>) a feature fP _e ^{j of} PP ₂ (RP _e ^j−1 , RP _a ^j ) and a label <+1> representing a positive example. And a pair (fN _e ^j , <-1>) of a feature fN _e ^{j of} PN ₂ (RN _e ^j−1 , RN _a ^j ) and a label <−1> representing a negative example.

図８Ｂは、エンティティ識別用素性抽出部２３ｂが出力する組(fP_e ^j, <+1>)及び組(fN_e ^j, <-1>)を例示した図である。この例では、エンティティ(ex)と属性(ey)の前後２単語の表記を素性としている。
《エンティティ識別学習：ステップＳ２６》
PP₂（RP_e ^j-1,RP_a ^j）の素性fP_e ^jと正例を表すラベル<+1>との組(fP_e ^j, <+1>)、及び、PN₂（RN_e ^j-1,RN_a ^j）の素性fN_e ^jと負例を表すラベル<-1>との組(fN_e ^j, <-1>)がエンティティ識別学習部２５ｂに入力される。エンティティ識別学習部２５ｂは、PP₂（RP_e ^j-1,RP_a ^j）の素性fP_e ^jとPN₂（RN_e ^j-1,RN_a ^j）の素性fN_e ^jとを教師あり学習データとした学習処理によって、第２識別モデルME_e ^jを生成する。この２識別モデルME_e ^jは、任意の文字列であるエンティティと当該エンティティの属性との組であるエンティティ−属性ペアの素性を入力として当該ペアが正例エンティティ−正例属性ペアか負例エンティティ−負例属性ペアかを識別するための情報を出力する関数である。このような第２識別モデルME_e ^jであればどのようなモデルであってもよい。例えば、前述の識別モデルME_e ^jと同様に第２識別モデルME_e ^jを生成すればよい。
学習処理によって生成された第２識別モデルME_e ^jは記憶部１１ｄに格納される。例えば、学習処理によって生成された第２識別モデルME_e ^jのパラメータが記憶部１１ｄに格納される。 FIG. 8B is a diagram illustrating a pair (fP _e ^j , <+1>) and a pair (fN _e ^j , <-1>) output by the entity identifying feature extraction unit 23b. In this example, the notation of two words before and after the entity (ex) and the attribute (ey) is used as a feature.
<< Entity Identification Learning: Step S26 >>
A pair (fP _e ^j , <+1>) of a feature fP _e ^{j of} PP ₂ (RP _e ^j−1 , RP _a ^j ) and a label <+1> representing a positive example, and PN ₂ (RN _e ^{j −1} , RN _a ^j ) of the feature fN _e ^j and a negative example label <-1> (fN _e ^j , <-1>) is input to the entity identification learning unit 25b. The entity identification learning unit 25b uses the feature fP _e ^{j of} PP ₂ (RP _e ^j−1 , RP _a ^j ) and the feature fN _e ^{j of} PN ₂ (RN _e ^j−1 , RN _a ^j ) as supervised learning data. Through the learning process described above, the second identification model ME _e ^j is generated. This two-discriminating model ME _e ^j receives the identity of an entity-attribute pair that is a set of an entity that is an arbitrary character string and an attribute of the entity, and the pair is a positive entity-positive attribute pair or a negative entity. A function that outputs information for identifying a negative example attribute pair. Any model may be used as long as it is such a second identification model ME _e ^j . For example, the second identification model ME _e ^j may be generated in the same manner as the above-described identification model ME _e ^j .
Second identification model ME _e ^j generated by the learning processing is stored in the storage unit 11d. For example, the parameters of the second identification model ME _e ^j generated by the learning processing is stored in the storage unit 11d.

《エンティティ識別：ステップＳ２７》
エンティティ識別部２６ｂは、記憶部１１ａに格納されたテキストデータの集合Dから何れかのテキストデータを選択し、選択した当該テキストデータが含む文字列を第２対象エンティティRD_e ^jとして選択する。またエンティティ識別部２６ｂは、選択した当該テキストデータから当該第２対象エンティティRD_e ^jと異なる文字列を第２対象属性RD_a ^jとして選択する。そしてエンティティ識別部２６ｂは、第２対象エンティティRD_e ^jと第２対象属性RD_a ^jとの組を第２対象エンティティ−対象属性ペアPD₂（RD_e ^j,RD_a ^j）とする。 << Entity Identification: Step S27 >>
Entity identification unit 26b selects one of the text data from the set D of text data stored in the storage unit 11a, selects a character string included in the text data selected as a second target entity RD _e ^j. The entity identifying unit 26b selects a character string different from the second target entity RD _e ^j from the selected text data as the second target attribute RD _a ^j . Then, the entity identification unit 26b sets the pair of the second target entity RD _e ^j and the second target attribute RD _a ^j as the second target entity-target attribute pair PD ₂ (RD _e ^j , RD _a ^j ).

[選択方法の例]
第１条件：
エンティティ識別部２６ｂは、何れかの正例属性RP^j _a又は負例属性RN^j _aを含み、かつ当該属性RP^j _a又RN^j _aから任意のウィンドウサイズ内（ここでは３単語とする）に名詞を含むテキストデータを抽出し、ウィンドウサイズ内の名詞をエンティティ候補とする。 [Example of selection method]
First condition:
Entity identification unit 26b includes one positive cases attribute RP ^j _a and negative examples attribute RN ^j _a, and to the attribute RP ^j _a The RN ^j _a from within any window size (here, 3 words) Text data including nouns is extracted, and nouns within the window size are used as entity candidates.

第２条件：
第１条件だけでは対象の数が膨大になる場合があるため、エンティティ識別部２６ｂは、エンティティ識別学習部２５ｂで教師あり学習データとして用いられたPP₂（RP_e ^j-1,RP_a ^j）の素性fP_e ^jとPN₂（RN_e ^j-1,RN_a ^j）の素性fN_e ^jのうち、それらから生成された第２識別モデルME_e ^jへの影響度の大きさを表す指標（例えば前述の重みλ_q）が特定の基準を満たす素性、つまり、当該第２識別モデルME_e ^jへの影響度が大きな素性fP_e ^j及び／又はfN_e ^jを選択する。例えば、エンティティ識別部２６ｂは、前述の重みλ_qの絶対値が閾値よりも大きな素性fP_e ^j及び／又はfN_e ^jを選択する。 Second condition:
Since the number of targets may become enormous under the first condition alone, the entity identification unit 26b uses PP ₂ (RP _e ^j−1 , RP _a ^j ) used as supervised learning data in the entity identification learning unit 25b. Of the features fP _e ^j and PN ₂ (RN _e ^j−1 , RN _a ^j ) of the features fN _e ^j of the, and an index indicating the degree of influence on the second identification model ME _e ^j generated from them ( for example feature the aforementioned weight lambda _q) satisfies certain criteria, i.e., influence of the to the second identification model ME _e ^j selects a large feature fP _e ^j and / or fN _e ^j. For example, the entity identifying unit 26b selects a feature fP _e ^j and / or fN _e ^{j in} which the absolute value of the weight λ _q is larger than a threshold value.

エンティティ識別部２６ｂは、選択した素性fP_e ^j及び／又はfN_e ^jに対応する文字列を含むテキストデータを、第１条件で選択されたテキストデータの集合から選択する。エンティティ識別部２６ｂは、当該選択したテキストデータが含む文字列を第２対象エンティティRD_e ^j及び第２対象属性RD_a ^jとする。例えば、エンティティ識別部２６ｂは、選択した素性fP_e ^j及び／又はfN_e ^jから表層素性の単語を抽出し、当該表層素性の単語を含むテキストデータを第１条件で選択されたテキストデータの集合から選択する。（[選択方法の例]の説明終わり）。 The entity identification unit 26b selects text data including a character string corresponding to the selected feature fP _e ^j and / or fN _e ^j from the set of text data selected under the first condition. The entity identifying unit 26b sets the character string included in the selected text data as the second target entity RD _e ^j and the second target attribute RD _a ^j . For example, the entity identification unit 26b extracts a surface feature word from the selected feature fP _e ^j and / or fN _e ^j , and sets text data including the surface feature word as a set of text data selected under the first condition. Select from. (End of description of [Example of selection method]).

エンティティ識別用素性抽出部２３ｂは、記憶部１１ａに格納されたテキストデータの集合Dから、第２対象エンティティRD_e ^jと第２対象属性RD_a ^jとの組を含む文字列である「第２対象テキスト」を選択する。第２対象テキストの例は、テキストデータが含む文、フレーズ、単語列などである。第２対象テキストは、第２対象エンティティ−対象属性ペアPD₂（RD_e ^j,RD_a ^j）とテキストデータとの組に対して1個以上抽出される。 The entity identifying feature extraction unit 23b is a character string that includes a set of the second target entity RD _e ^j and the second target attribute RD _a ^j from the text data set D stored in the storage unit 11a. Select Target text. Examples of the second target text are sentences, phrases, word strings, and the like included in the text data. One or more second target texts are extracted for a set of the second target entity-target attribute pair PD ₂ (RD _e ^j , RD _a ^j ) and text data.

エンティティ識別用素性抽出部２３ｂは、第２対象テキストに対する第２対象エンティティ−対象属性ペアPD₂（RD_e ^j,RD_a ^j）の特徴を表す情報を当該第２対象エンティティ−対象属性ペアPD₂（RD_e ^j,RD_a ^j）の素性fD_e ^jとする。この例では、第２対象テキストごとにPD₂（RD_e ^j,RD_a ^j）の素性fD_e ^jが抽出される。PD₂（RD_e ^j,RD_a ^j）の素性fD_e ^jの例は、第２対象テキスト（第２対象エンティティRD_e ^j及び第２対象属性RD_a ^j-1を含む文字列であってテキストデータに含まれるもの）と第２対象エンティティRD_e ^j及び第２対象属性RD_a ^j-1との関係を表す情報である。その具体例は、前述（ステップＳ２２）したPP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jの場合と同様である。 Entity identification feature extracting unit 23b, the second target entity for a second target text - target attribute pair _{_{^{PD 2 (RD e j, RD}}} a j) said second target entity information indicating features of the - target attribute pair PD ₂ A feature fD _e ^{j of} (RD _e ^j , RD _a ^j ) is assumed. In this example, the feature fD _e ^j of PD ₂ (RD _e ^j , RD _a ^j ) is extracted for each second target text. An example of the feature fD _e ^j of PD ₂ (RD _e ^j , RD _a ^j ) is a text that includes the second target text (second target entity RD _e ^j and second target attribute RD _a ^j-1). Information included in the data), the second target entity RD _e ^j, and the second target attribute RD _a ^j-1 . The specific example is the same as the case of the feature fP _a ^j of PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) described above (step S22).

第２対象テキストに対応するPD₂（RD_e ^j,RD_a ^j）の素性fD_e ^jは、エンティティ識別部２６ｂに入力される。エンティティ識別部２６ｂは、PD₂（RD_e ^j,RD_a ^j）の素性fD_e ^jを記憶部１１ｄから読み出した第２識別モデルME_e ^jに入力し、PD₂（RD_e ^j,RD_a ^j）が正例エンティティ−正例属性ペアか負例エンティティ−負例属性ペアかを識別する。 The feature fD _e ^j of PD ₂ (RD _e ^j , RD _a ^j ) corresponding to the second target text is input to the entity identification unit 26b. The entity identification unit 26b inputs the feature fD _e ^j of PD ₂ (RD _e ^j , RD _a ^j ) into the second identification model ME _e ^j read from the storage unit 11d, and PD ₂ (RD _e ^j , RD _a ^j ) Identifies a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair.

ここで、エンティティ識別部２６ｂは、PD₂（RD_e ^j,RD_a ^j）を正例エンティティ−正例属性ペアであると識別した場合、当該PD₂（RD_e ^j,RD_a ^j）の第２対象エンティティRD_e ^jを正例エンティティRP_e ^jとして記憶部１１ｅに格納し、正例エンティティRP_e ^jの集合に追加する。また、エンティティ識別部２６ｂは、PD₂（RD_e ^j,RD_a ^j）が負例エンティティ−負例属性ペアであると識別した場合、当該PD₂（RD_e ^j,RD_a ^j）の第２対象エンティティRD_e ^jを負例エンティティRN_e ^jとして記憶部１１ｅに格納し、負例エンティティRN_e ^jの集合に追加する。すなわち、ステップＳ２５−Ｓ２７では正例及び負例属性の更新は行われず、正例及び負例エンティティの更新のみが行われる。 Here, when the entity identifying unit 26b identifies PD ₂ (RD _e ^j , RD _a ^j ) as a positive example entity-positive example attribute pair, the entity identifying unit 26b determines the _second PD ₂ (RD _e ^j , RD _a ^j ). 2 target entities RD _e ^j stored in the storage unit 11e as positive examples entity RP _e ^j, to add to the set of positive examples entity RP _e ^j. Further, when the entity identifying unit 26b identifies that PD ₂ (RD _e ^j , RD _a ^j ) is a negative example entity-negative example attribute pair, the entity identifying unit 26b sets the _second PD ₂ (RD _e ^j , RD _a ^j ). stored in the storage unit 11e target entity RD _e ^j as a negative example entity RN _e ^j, to add to the set of negative examples entity RN _e ^j. That is, in steps S25-S27, the positive example and negative example attributes are not updated, and only the positive example and negative example entities are updated.

《収束判定：ステップＳ１７−Ｓ１９》
収束判定部１７は、第１実施形態と同様に、収束条件を満たしたかを判定する（ステップＳ１７）。
収束判定部１７が収束条件を満たしたと判断した場合、ステップＳ２２からＳ２７のイテレーションが終了し、出力部１８が記憶部１１ｅに格納されているすべての正例エンティティRP^j _eを出力して処理を終了する（ステップＳ１９）。それ以外の場合は、制御部１９がj+1を新たなjの値とし（ステップＳ１８）、記憶部１１ｅに格納されている正例エンティティRP^j _e及び負例エンティティRN^j _e、記憶部２１ｅに格納されている正例属性RP^j _a及び負例属性RN^j _aを属性識別用素性抽出部２３ａに入力し、ステップＳ２２からＳ２７のイテレーションが実行される。 << Convergence determination: steps S17 to S19 >>
The convergence determination unit 17 determines whether the convergence condition is satisfied as in the first embodiment (step S17).
When the convergence determination unit 17 determines that the convergence condition is satisfied, the iterations from step S22 to S27 are finished, and the output unit 18 outputs all the positive example entities RP ^j _e stored in the storage unit 11e for processing. The process ends (step S19). In other cases, the control unit 19 sets j + 1 as a new value of j (step S18), the positive example entity RP ^j _e and the negative example entity RN ^j _e stored in the storage unit 11e, and the storage unit 21e. enter a positive example attribute RP ^j _a and negative cases attribute RN ^j _a is stored in the attribute identifying feature extraction unit 23a, iterations of steps S22 S27 is executed.

＜第２実施形態の特徴＞
以上のように、本形態の方法ではエンティティとその属性との組を用いて識別を行うこととしたため、セマンティックドリフトを抑制することができる。例えばエンティティ<阪神>には曖昧性があり、エンティティ<阪神>の素性のみでは、エンティティ<阪神>が鉄道名と球団名のどちらを指すか識別できない。しかし、<試合>や<乗務員>の属性を付加した<阪神>−<試合>や<阪神>−<乗務員>を制約条件とすれば、それぞれの<阪神>が異なる意味で用いられていることを識別できる。 <Features of Second Embodiment>
As described above, in the method according to the present embodiment, identification is performed using a pair of an entity and its attribute, so that semantic drift can be suppressed. For example, the entity <Hanshin> has ambiguity, and the entity <Hanshin> alone cannot identify whether the entity <Hanshin> indicates a railroad name or a team name. However, if <Hanshin>-<Game> or <Hanshin>-<Crew> with attributes of <Game> or <Crew> are used as constraints, each <Hanshin> is used in a different meaning. Can be identified.

また、本形態では、co-training方式を用いるため、精度の高い識別が可能となる。なお、上記では正例及び負例属性の更新（ステップＳ２２−Ｓ２４）を行った後に、正例及び負例エンティティの更新（Ｓ２５−Ｓ２７）を行う例を示した。しかし、正例及び負例エンティティの更新を行った後に正例及び負例属性の更新を行ってもよい。 In this embodiment, since the co-training method is used, highly accurate identification is possible. In the above description, the example in which the positive example and the negative example entity are updated (S25 to S27) after the positive example and negative example attributes are updated (steps S22 to S24) is shown. However, the positive example and negative example attributes may be updated after the positive example and negative example entities are updated.

なお、エンティティ−属性ペアを扱う関係抽出技術としてespressoが知られている（参考文献４「Patrick Pantel and Marco Pennacchiotti., "Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations.", COLING-ACL, 2006.」）。espressoの場合は、エンティティ−属性ペアを獲得するのが目的であるため、予めエンティティ−属性ペアを正例及び負例として与えておく必要がある。これに対し、本形態はエンティティ獲得のために属性を用いるので、初期値としてはエンティティだけを与えれば良い。 Note that espresso is known as a relation extraction technique that handles entity-attribute pairs (Reference 4 “Patrick Pantel and Marco Pennacchiotti.,“ Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations. ”, COLING-ACL, 2006. "). In the case of espresso, since the purpose is to acquire an entity-attribute pair, it is necessary to give entity-attribute pairs as positive examples and negative examples in advance. On the other hand, since this embodiment uses an attribute for acquiring an entity, only an entity need be given as an initial value.

また、espressoはエンティティと属性のペアについての信頼度を計算するイテレーションと、素性に対する信頼度計算のイテレーションから構成されるのに対し、本形態ではエンティティの信頼度を計算するイテレーションと属性の信頼度を計算するイテレーションから構成されている。今我々が欲しているのはエンティティのみであり、属性情報は付加的に獲得されるにすぎない。つまり属性についての網羅性は高い必要がなく、十分に信頼でき、かつセマンティックドリフトを抑えるに足る量の属性のみを用いれば良い。本形態の目的からすれば、espressoのようにペアで信頼度を得るよりも、エンティティ/属性それぞれの信頼度を直接評価できる本形態の手法の方が適切であると言える。 In addition, espresso consists of an iteration that calculates the reliability of an entity / attribute pair and an iteration of a reliability calculation for the feature, whereas in this form it is an iteration that calculates the reliability of the entity and the reliability of the attribute. It consists of iterations that calculate Now all we want is an entity, and attribute information is only acquired additionally. In other words, it is not necessary to have high completeness of attributes, and it is sufficient to use only a sufficient amount of attributes that are sufficiently reliable and can suppress the semantic drift. For the purpose of this embodiment, it can be said that the method of this embodiment that can directly evaluate the reliability of each entity / attribute is more appropriate than the reliability obtained in pairs as in espresso.

さらに、espressoで100個の新規エンティティ-属性ペアを獲得しようとした場合、その中には新規エンティティ、新規属性がそれぞれいくつ含まれるかを制御できない。例えば、1エンティティ×100属性といった適切でない状況も起こり得る。本形態の手法では、エンティティの信頼度を計算するイテレーションと属性の信頼度を計算するイテレーションとが別個に実行されるため、エンティティの数と属性の数を別個に自由に制御できる。本形態では、例えば、エンティティを100個、属性を10個といったように細かく制御することも可能である。
その他、第１実施形態と同様、本形態の方法はリソースであるテキストデータの種類によらず利用でき、適用範囲が広い。 In addition, if you try to acquire 100 new entity-attribute pairs with espresso, you cannot control how many new entities and new attributes are included in each. For example, an inappropriate situation such as 1 entity × 100 attributes may occur. In the method of this embodiment, the iteration for calculating the reliability of the entity and the iteration for calculating the reliability of the attribute are executed separately, so that the number of entities and the number of attributes can be freely controlled separately. In the present embodiment, it is possible to finely control, for example, 100 entities and 10 attributes.
In addition, as in the first embodiment, the method of this embodiment can be used regardless of the type of text data that is a resource, and has a wide range of applications.

〔第３実施形態〕
第３実施形態は、第１実施形態と第２実施形態とを組み合わせた形態である。つまり、トピック情報と属性の両方を用いて識別モデルの学習及び識別モデルによる識別を行う。以下では、第１及び第２実施形態との相違点を中心に説明する。また、第１及び第２実施形態と共通する部分については第１及び第２実施形態と同じ参照番号を用いる。 [Third Embodiment]
The third embodiment is a combination of the first embodiment and the second embodiment. In other words, learning of the identification model and identification by the identification model are performed using both topic information and attributes. Below, it demonstrates centering around difference with 1st and 2nd embodiment. In addition, the same reference numerals as those in the first and second embodiments are used for portions common to the first and second embodiments.

＜構成＞
図９は、第３実施形態のデータ抽出装置３の機能構成を例示するためのブロック図である。
図９に例示するように、データ抽出装置３は、記憶部１１ａ−１１ｅ,２１ｄ,２１ｅ、初期属性集合生成部２２、属性識別用素性抽出部２３ａ、エンティティ識別用素性抽出部２３ｂ、トピック情報抽出部３４ａ,３４ｂ、属性識別学習部３５ａ、エンティティ識別学習部３５ｂ、属性識別部３６ａ、エンティティ識別部３６ｂ、収束判定部１７、出力部１８、及び制御部１９を有し、制御部１９の制御のもと各処理を実行する。なお、データ抽出装置３は、例えば、公知又は専用のコンピュータに特別なプログラムが読み込まれて構成される特別な装置である。 <Configuration>
FIG. 9 is a block diagram for illustrating a functional configuration of the data extraction device 3 of the third embodiment.
As illustrated in FIG. 9, the data extraction device 3 includes storage units 11a-11e, 21d, and 21e, an initial attribute set generation unit 22, an attribute identification feature extraction unit 23a, an entity identification feature extraction unit 23b, and topic information extraction. Units 34 a and 34 b, attribute identification learning unit 35 a, entity identification learning unit 35 b, attribute identification unit 36 a, entity identification unit 36 b, convergence determination unit 17, output unit 18, and control unit 19. Originally, each process is executed. The data extraction device 3 is a special device configured by, for example, reading a special program into a known or dedicated computer.

＜事前処理＞
第１実施形態と同様である。
＜データ抽出処理＞
図１０は、第３実施形態のデータ抽出装置３のデータ抽出処理を例示するための図である。
まず、第１及び第２実施形態のステップＳ１１,Ｓ１２,Ｓ２１,Ｓ２２と同じ処理が実行される。 <Pre-processing>
This is the same as in the first embodiment.
<Data extraction process>
FIG. 10 is a diagram for illustrating data extraction processing of the data extraction device 3 according to the third embodiment.
First, the same processing as steps S11, S12, S21, and S22 of the first and second embodiments is executed.

《トピック情報抽出：ステップＳ３２１》
ステップＳ２２で生成されたPP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jと正例を表すラベル<+1>との組(fP_a ^j, <+1>)、及び、PN₁（RN_e ^j-1,RN_a ^j-1）の素性fN_a ^jと負例を表すラベル<-1>との組(fN_a ^j, <-1>)がトピック情報抽出部３４ａに入力される。混乱を避けるため、以下ではこれらを組(fP''_a ^j, <+1>)及び組(fN''_a ^j, <-1>)と表記する。 << Topic Information Extraction: Step S321 >>
A pair (fP _a ^j , <+1>) of the feature fP _a ^{j of} PP ₁ (RP _e ^j-1 , RP _a ^j-1 ) generated in step S22 and a label <+1> representing a positive example, The topic information extraction unit is _a set (fN _a ^j , <-1>) of the feature fN _a ^{j of} PN ₁ (RN _e ^j-1 , RN _a ^j-1 ) and the label <-1> representing a negative example 34a. In order to avoid confusion, these are expressed as a pair (fP '' _a ^j , <+1>) and a pair (fN '' _a ^j , <-1>) below.

トピック情報抽出部３４ａは、前述したステップＳ１４と同様な処理により、第１正例エンティティRP_e ^j-1と第１正例属性RP_a ^j-1との組を含むテキストデータに対応する第１正例トピック情報を抽出する。トピック情報抽出部３４ａは、当該テキストデータが含む各第１正例テキストに対応するPP₁（RP_e ^j-1,RP_a ^j-1）の素性fP''_a ^jに当該第１正例トピック情報加えたものを、各第１正例テキストに対応する各PP₁（RP_e ^j-1,RP_a ^j-1）の新たな素性fP_a ^jとする。すなわち、トピック情報抽出部３４ａによって生成された当該PP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jは、トピック情報付きテキストデータの集合D'から選択された、第１正例エンティティRP_e ^j-1と第１正例属性RP_a ^j-1との組を含むテキストデータを含むトピック情報付きテキストデータが含むトピック情報を含む（例えば図５Ａ参照）。 The topic information extraction unit 34a performs first processing corresponding to the text data including the set of the first positive example entity RP _e ^j-1 and the first positive example attribute RP _a ^j-1 by the same process as in step S14 described above. Extract positive topic information. The topic information extraction unit 34a applies the first positive example topic to the feature fP ″ _a ^j of PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) corresponding to each first positive example text included in the text data. The information added is set as a new feature fP _a ^{j of} each PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) corresponding to each first positive example text. That is, the feature fP _a ^j of the PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) generated by the topic information extraction unit 34a is selected from the set D ′ of text data with topic information. It includes topic information included in text data with topic information including text data including a set of _a positive example entity RP _e ^j-1 and a first positive example attribute RP _a ^j-1 .

同様に、トピック情報抽出部３４ａは、前述したステップＳ１４と同様な処理により、負例エンティティRN_e ^j-1と第１負例属性RN_a ^j-1との組を含むテキストデータに対応する第１負例トピック情報を抽出する。トピック情報抽出部３４ａは、当該テキストデータが含む各第１負例テキストに対応するPN₁（RN_e ^j-1,RN_a ^j-1）の素性fN''_a ^jに当該第１負例トピック情報を加えたものを、各第１負例テキストに対応する各PN₁（RN_e ^j-1,RN_a ^j-1）の新たな素性fN_a ^jとする。すなわち、トピック情報抽出部３４ａによって生成された当該PN₁（RN_e ^j-1,RN_a ^j-1）の素性fN_a ^jは、トピック情報付きテキストデータの集合D'から選択された、第１負例エンティティRN_e ^j-1と第１負例属性RN_a ^j-1との組を含むテキストデータを含むトピック情報付きテキストデータが含むトピック情報を含む。
トピック情報抽出部３４ａは、生成したPP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jと正例を表すラベル<+1>との組(fP_a ^j, <+1>)、及び、PN₁（RN_e ^j-1,RN_a ^j-1）の素性fN_a ^jと負例を表すラベル<-1>との組(fN_a ^j, <-1>)を出力する。 Similarly, the topic information extraction unit 34a performs processing similar to that in step S14 described above, and performs processing corresponding to text data including a set of the negative example entity RN _e ^j-1 and the first negative example attribute RN _a ^j-1 . 1 Negative topic information is extracted. The topic information extraction unit 34a applies the first negative example topic to the feature fN ″ _a ^j of PN ₁ (RN _e ^j−1 , RN _a ^j−1 ) corresponding to each first negative example text included in the text data. The information added is set as a new feature fN _a ^{j of} each PN ₁ (RN _e ^j−1 , RN _a ^j−1 ) corresponding to each first negative example text. That is, the feature fN _a ^j of the PN ₁ (RN _e ^j−1 , RN _a ^j−1 ) generated by the topic information extraction unit 34 _a is selected from the set D ′ of text data with topic information. It includes topic information included in text data with topic information including text data including a set of _a negative example entity RN _e ^j-1 and a first negative example attribute RN _a ^j-1 .
The topic information extraction unit 34a sets a pair (fP _a ^j , <+1) of the feature fP _a ^j of the generated PP ₁ (RP _e ^j−1 , RP _a ^j−1 ) and a label <+1> representing a positive example. >) And a pair (fN _a ^j , <-1>) of the feature fN _a ^{j of} PN ₁ (RN _e ^j-1 , RN _a ^j-1 ) and the label <-1> representing a negative example To do.

《属性識別学習：ステップＳ３３》
PP₁（RP_e ^j-1,RP_a ^j-1）の素性fP_a ^jと正例を表すラベル<+1>との組(fP_a ^j, <+1>)、及び、PN₁（RN_e ^j-1,RN_a ^j-1）の素性fN_a ^jと負例を表すラベル<-1>との組(fN_a ^j, <-1>)が属性識別学習部３５ａに入力される。属性識別学習部３５ａはこれらを教師あり学習データとし、前述のステップＳ２３と同様に第１識別モデルME_a ^jを生成し、記憶部２１ｄに格納する。 << Attribute Identification Learning: Step S33 >>
A pair (fP _a ^j , <+1>) of _a feature fP _a ^{j of} PP ₁ (RP _e ^j-1 , RP _a ^j-1 ) and a label <+1> representing a positive example, and PN ₁ (RN A set (fN _a ^j , <-1>) of the feature fN _a ^{j of} _e ^j−1 , RN _a ^j−1 ) and a label <−1> representing a negative example is input to the attribute identification learning unit 35a. The attribute identification learning unit 35a uses these as supervised learning data, generates the first identification model ME _a ^j as in step S23 described above, and stores it in the storage unit 21d.

《属性識別：ステップＳ３４》
属性識別部３６ａは、まず、ステップＳ２４と同様に第１対象テキストに対応するPD₁（RD_e ^j,RD_a ^j）の素性fD_a ^jを生成する。以下では、混乱を避けるため、ステップＳ２４と同様に作成された第１対象テキストに対応するPD₁（RD_e ^j,RD_a ^j）の素性fD_a ^jをfD''_a ^jと表記する。次に、属性識別部３６ａは、ステップＳ１５と同様に、対象エンティティRD_e ^jと第１対象属性RD_a ^jとの組を含むテキストデータに対応する第１対象トピック情報を抽出する。属性識別部３６ａは、当該テキストデータが含む各第１対象テキストに対応する各PD₁（RD_e ^j,RD_a ^j）の素性fD''_a ^jに当該第１対象トピック情報を加えたものを、各第１対象テキストに対応する各PD₁（RD_e ^j,RD_a ^j）の素性fD_e ^jとする。すなわち、属性識別部３６ａによって生成されたPD₁（RD_e ^j,RD_a ^j）の素性fD_e ^jは、トピック情報付きテキストデータの集合D'から選択された、第１対象エンティティRD_e ^jと第１対象属性RD_a ^jとの組を含むテキストデータを含むトピック情報付きテキストデータが含むトピック情報を含む（例えば図５Ａ参照）。 << Attribute Identification: Step S34 >>
The attribute identifying unit 36a first generates _a feature fD _a ^j of PD ₁ (RD _e ^j , RD _a ^j ) corresponding to the first target text as in step S24. Hereinafter, in order to avoid confusion, the feature fD _a ^j of PD ₁ (RD _e ^j , RD _a ^j ) corresponding to the first target text created in the same manner as in step S24 is denoted as fD ″ _a ^j . Next, the attribute identifying unit 36a extracts first target topic information corresponding to text data including a set of the target entity RD _e ^j and the first target attribute RD _a ^j , as in step S15. The attribute identification unit 36a is obtained by adding the first target topic information to the feature fD ″ _a ^j of each PD ₁ (RD _e ^j , RD _a ^j ) corresponding to each first target text included in the text data. , The feature fD _e ^{j of} each PD ₁ (RD _e ^j , RD _a ^j ) corresponding to each first target text. That is, the feature fD _e ^j of PD ₁ (RD _e ^j , RD _a ^j ) generated by the attribute identifying unit 36a is the first target entity RD _e ^j selected from the set D ′ of text data with topic information. It includes topic information included in text data with topic information including text data including a set with the first target attribute RD _a ^j (see, for example, FIG. 5A).

属性識別部３６ａは、ステップＳ２４と同様に、PD₁（RD_e ^j,RD_a ^j）の素性fD_a ^jを記憶部２１ｄから読み出した第１識別モデルME_a ^jに入力し、PD₁（RD_e ^j,RD_a ^j）が正例エンティティ−正例属性ペアか負例エンティティ−負例属性ペアかを識別する。属性識別部３６ａは、PD₁（RD_e ^j,RD_a ^j）を正例エンティティ−正例属性ペアであると識別した場合、当該PD₁（RD_e ^j,RD_a ^j）の第１対象属性RD_a ^jを正例属性RP_a ^jとして記憶部２１ｅに格納し、正例属性RP_a ^jの集合に追加する。また、属性識別部３６ａは、PD₁（RD_e ^j,RD_a ^j）が負例エンティティ−負例属性ペアであると識別した場合、当該PD₁（RD_e ^j,RD_a ^j）の第１対象属性RD_a ^jを負例属性RN_a ^jとして記憶部２１ｅに格納し、負例属性RN_a ^jの集合に追加する。 Similarly to step S24, the attribute identification unit 36a inputs the feature fD _a ^j of PD ₁ (RD _e ^j , RD _a ^j ) to the first identification model ME _a ^j read from the storage unit 21d, and PD ₁ (RD _e ^j , RD _a ^j ) identifies whether it is a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair. When the attribute identifying unit 36a identifies PD ₁ (RD _e ^j , RD _a ^j ) as a positive entity-positive example attribute pair, the first target attribute of the PD ₁ (RD _e ^j , RD _a ^j ) the RD _a ^j stored in the storage unit 21e as positive examples attribute RP _a ^j, is added to the set of positive examples attribute RP _a ^j. Further, when the attribute identifying unit 36a identifies that PD ₁ (RD _e ^j , RD _a ^j ) is a negative entity-negative example attribute pair, the attribute identifying unit 36a determines the _first PD ₁ (RD _e ^j , RD _a ^j ). stored in the storage unit 21e target attribute RD _a ^j as a negative example attribute RN _a ^j, is added to the set of negative examples attribute RN _a ^j.

《トピック情報抽出：ステップＳ３５１》
次に、前述したステップＳ２５の処理が実行され、それによって得られたPP₂（RP_e ^j-1,RP_a ^j）の素性fP_e ^jと正例を表すラベル<+1>との組(fP_e ^j, <+1>)、及び、PN₂（RN_e ^j-1,RN_a ^j）の素性fN_e ^jと負例を表すラベル<-1>との組(fN_e ^j, <-1>)がトピック情報抽出部３４ｂに入力される。混乱を避けるため、以下ではこれらを組(fP''_e ^j, <+1>)及び組(fN''_e ^j, <-1>)と表記する。 << Topic Information Extraction: Step S351 >>
Next, the process of step S25 described above is executed, and a pair of the feature fP _e ^{j of} PP ₂ (RP _e ^j−1 , RP _a ^j ) obtained thereby and a label <+1> representing a positive example ( fP _e ^j , <+1>) and a pair of features fN _e ^{j of} PN ₂ (RN _e ^j-1 , RN _a ^j ) and a label <-1> representing a negative example (fN _e ^j , <- 1>) is input to the topic information extraction unit 34b. In order to avoid confusion, these are expressed as a pair (fP '' _e ^j , <+1>) and a pair (fN '' _e ^j , <-1>) below.

トピック情報抽出部３４ｂは、前述したステップＳ１４と同様な処理により、第２正例エンティティRP_e ^j-1と第２正例属性RP_a ^jとの組を含むテキストデータとの組に対応する第２正例トピック情報を抽出する。トピック情報抽出部３４ｂは、当該テキストデータが含む各第２正例テキストに対応するPP₂（RP_e ^j-1,RP_a ^j）の素性fP''_e ^jに当該第２正例トピック情報を加えたものを、各第２正例テキストに対応する各PP₂（RP_e ^j-1,RP_a ^j）の新たな素性fP_e ^jとする。すなわち、トピック情報抽出部３４ｂによって生成された当該PP₂（RP_e ^j-1,RP_a ^j）の素性fP_e ^jは、トピック情報付きテキストデータの集合D'から選択された、第２正例エンティティRP_e ^j-1と第２正例属性RP_a ^jとの組を含むテキストデータを含むトピック情報付きテキストデータが含むトピック情報を含む（例えば図５Ａ参照）。 The topic information extraction unit 34b performs a process similar to that in step S14 described above, and corresponds to a set of text data including a set of the second positive example entity RP _e ^j-1 and the second positive example attribute RP _a ^j . 2. Extract the example topic information. The topic information extraction unit 34b adds the second positive example topic information to the feature fP ″ _e ^j of PP ₂ (RP _e ^j−1 , RP _a ^j ) corresponding to each second positive example text included in the text data. The addition is set as a new feature fP _e ^{j of} each PP ₂ (RP _e ^j−1 , RP _a ^j ) corresponding to each second example text. That is, the feature fP _e ^j of the PP ₂ (RP _e ^j−1 , RP _a ^j ) generated by the topic information extraction unit 34b is selected from the set D ′ of text data with topic information. The topic information included in the text data with topic information including the text data including the set of the entity RP _e ^j-1 and the second positive example attribute RP _a ^j is included (see, for example, FIG. 5A).

同様に、トピック情報抽出部３４ｂは、前述したステップＳ１４と同様な処理により、第２負例エンティティRN_e ^j-1と第２負例属性RN_a ^jとの組を含むテキストデータに対応する第２負例トピック情報を抽出する。トピック情報抽出部３４ｂは、当該テキストデータが含む各第２負例テキストに対応するPN₂（RN_e ^j-1,RN_a ^j）の素性fN''_e ^jに当該第２負例トピック情報を加えたものを、各第２負例テキストに対応する各PN₂（RN_e ^j-1,RN_a ^j）の新たな素性fN_e ^jとする。すなわち、トピック情報抽出部３４ｂによって生成された当該PN₂（RN_e ^j-1,RN_a ^j）の素性fN_e ^jは、トピック情報付きテキストデータの集合D'から選択された、第２負例エンティティRN_e ^j-1と第２負例属性RN_a ^jとの組を含むテキストデータを含むトピック情報付きテキストデータが含むトピック情報を含む。 Similarly, the topic information extraction unit 34b performs a process similar to that in step S14 described above, and performs processing corresponding to text data including a set of the second negative example entity RN _e ^j-1 and the second negative example attribute RN _a ^j . 2 Negative example topic information is extracted. The topic information extraction unit 34b adds the second negative example topic information to the feature fN ″ _e ^j of PN ₂ (RN _e ^j−1 , RN _a ^j ) corresponding to each second negative example text included in the text data. The addition is set as a new feature fN _e ^{j of} each PN ₂ (RN _e ^j−1 , RN _a ^j ) corresponding to each second negative example text. That is, the feature fN _e ^j of the PN ₂ (RN _e ^j−1 , RN _a ^j ) generated by the topic information extraction unit 34b is selected from the set D ′ of text data with topic information. It includes topic information included in text data with topic information including text data including a set of an entity RN _e ^j-1 and a second negative example attribute RN _a ^j .

トピック情報抽出部３４ｂは、生成したPP₂（RP_e ^j-1,RP_a ^j）の素性fP_e ^jと正例を表すラベル<+1>との組(fP_e ^j, <+1>)、及び、PN₂（RN_e ^j-1,RN_a ^j）の素性fN_e ^jと負例を表すラベル<-1>との組(fN_e ^j, <-1>)を出力する。 The topic information extraction unit 34b generates a set (fP _e ^j , <+1>) of the feature fP _e ^j of the generated PP ₂ (RP _e ^j−1 , RP _a ^j ) and a label <+1> representing a positive example. And a pair (fN _e ^j , <-1>) of a feature fN _e ^{j of} PN ₂ (RN _e ^j−1 , RN _a ^j ) and a label <−1> representing a negative example.

《エンティティ識別学習：ステップＳ３６》
PP₂（RP_e ^j-1,RP_a ^j）の素性fP_e ^jと正例を表すラベル<+1>との組(fP_e ^j, <+1>)、及び、PN₂（RN_e ^j-1,RN_a ^j）の素性fN_e ^jと負例を表すラベル<-1>との組(fN_e ^j, <-1>)がエンティティ識別学習部３５ｂに入力される。エンティティ識別学習部３５ｂはこれらを教師あり学習データとし、前述のステップＳ２６と同様に第２識別モデルME_e ^jを生成し、記憶部１１ｄに格納する。 << Entity Identification Learning: Step S36 >>
A pair (fP _e ^j , <+1>) of a feature fP _e ^{j of} PP ₂ (RP _e ^j−1 , RP _a ^j ) and a label <+1> representing a positive example, and PN ₂ (RN _e ^{j −1} , RN _a ^j ) of the feature fN _e ^j and a negative example label <-1> (fN _e ^j , <-1>) is input to the entity identification learning unit 35b. Entity identification learning unit 35b is these and supervised learning data to generate second identification model ME _e ^j similarly to step S26 described above, and stores in the storage unit 11d.

《エンティティ識別：ステップＳ３７》
エンティティ識別部３６ｂは、まず、ステップＳ２７と同様に第２対象テキストに対応するPD₂（RD_e ^j,RD_a ^j）の素性fD_e ^jを生成する。以下では、混乱を避けるため、ステップＳ２７と同様に作成された第２対象テキストに対応するPD₂（RD_e ^j,RD_a ^j）の素性fD_e ^jをfD''_e ^jと表記する。 << Entity Identification: Step S37 >>
The entity identification unit 36b first generates a feature fD _e ^j of PD ₂ (RD _e ^j , RD _a ^j ) corresponding to the second target text, as in step S27. Hereinafter, in order to avoid confusion, the feature fD _e ^j of PD ₂ (RD _e ^j , RD _a ^j ) corresponding to the second target text created in the same manner as in step S27 is denoted as fD ″ _e ^j .

次に、エンティティ識別部３６ｂは、ステップＳ１５と同様に、対象エンティティRD_e ^jと第２対象属性RD_a ^jとの組を含むテキストデータに対応する第２対象トピック情報を抽出する。エンティティ識別部３６ｂは、当該テキストデータが含む各第２対象テキストに対応する各PD₂（RD_e ^j,RD_a ^j）の素性fD''_e ^jに当該第２対象トピック情報を加えたものを、各第２対象テキストに対応する各PD₂（RD_e ^j,RD_a ^j）の素性fD_e ^jとする。すなわち、エンティティ識別部３６ｂによって生成されたPD₂（RD_e ^j,RD_a ^j）の素性fD_e ^jは、トピック情報付きテキストデータの集合D'から選択された、第２対象エンティティRD_e ^jと第２対象属性RD_a ^jとの組を含むテキストデータを含むトピック情報付きテキストデータが含むトピック情報を含む（例えば図５Ａ参照）。 Next, the entity identifying unit 36b extracts second target topic information corresponding to text data including a set of the target entity RD _e ^j and the second target attribute RD _a ^j , as in step S15. Entity identification unit 36b, the _{_{^{PD 2 (RD e j, RD}}} a j) corresponding to each of the second target text to which the text data includes a plus the second target topic information on the identity fD '' _e ^j of And the feature fD _e ^{j of} each PD ₂ (RD _e ^j , RD _a ^j ) corresponding to each second target text. That is, the feature fD _e ^j of PD ₂ (RD _e ^j , RD _a ^j ) generated by the entity identification unit 36b is the second target entity RD _e ^j selected from the set D ′ of text data with topic information. The topic information included in the text data with topic information including the text data including the pair with the second target attribute RD _a ^j is included (see, for example, FIG. 5A).

エンティティ識別部３６ｂは、ステップＳ２７と同様に、PD₂（RD_e ^j,RD_a ^j）の素性fD_e ^jを記憶部１１ｄから読み出した第２識別モデルME_e ^jに入力し、PD₂（RD_e ^j,RD_a ^j）が正例エンティティ−正例属性ペアか負例エンティティ−負例属性ペアかを識別する。エンティティ識別部３６ｂは、PD₂（RD_e ^j,RD_a ^j）を正例エンティティ−正例属性ペアであると識別した場合、当該PD₂（RD_e ^j,RD_a ^j）の第２対象エンティティRD_e ^jを正例エンティティRP_e ^jとして記憶部１１ｅに格納し、正例エンティティRP_e ^jの集合に追加する。また、エンティティ識別部３６ｂは、PD₂（RD_e ^j,RD_a ^j）が負例エンティティ−負例属性ペアであると識別した場合、当該PD₂（RD_e ^j,RD_a ^j）の第２対象エンティティRD_e ^jを負例エンティティRN_e ^jとして記憶部１１ｅに格納し、負例エンティティRN_e ^jの集合に追加する。
その後、前述したステップＳ１７−Ｓ１９の処理が実行される。 As in step S27, the entity identification unit 36b inputs the feature fD _e ^j of PD ₂ (RD _e ^j , RD _a ^j ) to the second identification model ME _e ^j read from the storage unit 11d, and outputs PD ₂ (RD _e ^j , RD _a ^j ) identifies whether it is a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair. When the entity identification unit 36b identifies PD ₂ (RD _e ^j , RD _a ^j ) as a positive entity-positive example attribute pair, the second target entity of the PD ₂ (RD _e ^j , RD _a ^j ) the RD _e ^j stored in the storage unit 11e as positive examples entity RP _e ^j, to add to the set of positive examples entity RP _e ^j. Further, when the entity identifying unit 36b identifies that PD ₂ (RD _e ^j , RD _a ^j ) is a negative example entity-negative example attribute pair, the entity identifying unit 36b determines the _second PD ₂ (RD _e ^j , RD _a ^j ). stored in the storage unit 11e target entity RD _e ^j as a negative example entity RN _e ^j, to add to the set of negative examples entity RN _e ^j.
Thereafter, the processing of steps S17 to S19 described above is executed.

〔その他の変形例等〕
なお、本発明は上述の実施の形態に限定されるものではない。例えば、第１実施形態においてステップＳ１３を実行せず、トピック情報のみを素性としてもよい。また、トピックモデルや学習モデルが上述した具体例に限定されないのは上述の通りである。また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 [Other variations, etc.]
The present invention is not limited to the embodiment described above. For example, step S13 may not be executed in the first embodiment, and only topic information may be used as a feature. Further, the topic model and the learning model are not limited to the specific examples described above, as described above. In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Needless to say, other modifications are possible without departing from the spirit of the present invention.

また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。
この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。
また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。
このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.
The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and the like.
The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.
A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads the program stored in its own recording device and executes the process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１−３データ抽出装置 1-3 Data extraction device

Claims

A pre-processing unit that learns topic information representing the appropriateness of a plurality of topic candidates for text data as an index value and a topic model that describes the relationship between the text data using unsupervised learning data obtained from the text data When,
The positive example topic information extracted from the topic models in response to the topic of the text data containing the extracted target positive cases entity is a character string and at least part of the identity of the positive examples entity, not subject to extraction string A topic information extraction unit that takes negative example topic information extracted from the topic model corresponding to a topic of text data including a negative example entity as at least part of the features of the negative example entity;
Information for identifying whether an entity is a positive example entity or a negative example entity by inputting a feature of an arbitrary entity by learning processing using the features of the positive example entity and the features of the negative example entity as supervised learning data An identification learning unit that generates an identification model that is a function that outputs
An entity that is a character string included in text data selected from a set of text data is set as a target entity, and topic information extracted from the topic model corresponding to a topic of the selected text data is at least one of the features of the target entity. When the identity of the target entity is input to the identification model to identify whether the target entity is a positive example entity or a negative example entity, and the target entity is identified as a positive example entity, An entity identification unit that sets the target entity as the negative example entity when the positive entity is identified and the target entity is identified as a negative entity;
A data extraction device.

The data extraction device according to claim 1, comprising:
The identity of the positive example entity corresponds to a character string that includes the positive example entity and is included in text data that includes the positive example entity, and includes information indicating a relationship between the character string and the positive example entity. Including
The feature of the negative example entity corresponds to a character string that includes the negative example entity and is included in the text data that includes the negative example entity, and includes information indicating a relationship between the character string and the negative example entity. Including
The feature of the target entity corresponds to a character string that includes the target entity and is included in text data that includes the target entity, and includes information that represents a relationship between the character string and the target entity.
A data extraction apparatus characterized by that.

The data extraction device according to claim 2, wherein
The character string included in the text data corresponds to the topic candidate of the character string included in the text data and the topic candidate score indicating the appropriateness of each of the candidate candidates for the character string included in the text data. The aggregation result for each candidate for the topic obtained by aggregating the topic candidate scores corresponding to the topic candidates is set as a positive example topic score for each of the topic candidates, and the positive example is selected from the topic candidates. If the topic score meets certain criteria, it is considered a positive reference topic,
The identification learning unit is a character string included in the text data corresponding to the same topic candidate as any one of the positive example reference topics, and the topic candidate score corresponding to the same topic candidate is specified. The feature of the positive example entity corresponding to the character string satisfying the criterion is the supervised learning data,
A data extraction apparatus characterized by that.

The data extraction device according to claim 2 or 3,
The character string included in the text data corresponds to the topic candidate of the character string included in the text data and the topic candidate score indicating the appropriateness of each of the candidate candidates for the character string included in the text data. The aggregation result for each candidate for the topic obtained by aggregating the topic candidate scores corresponding to the topic candidates is set as a positive example topic score for each of the topic candidates, and the positive example is selected from the topic candidates. If the topic score meets certain criteria, it is considered a positive reference topic,
The entity identification unit is a character string included in the text data corresponding to the same topic candidate as any one of the positive reference topics, and the topic candidate score corresponding to the same topic candidate is specified. An entity included in a character string that satisfies the criteria is the target entity.
A data extraction apparatus characterized by that.

The data extraction device according to any one of claims 1 to 4,
The entity identifier is
Among the features of the positive example entity and the negative example entity used as the supervised learning data in the identification learning unit, an index indicating the magnitude of the degree of influence on the identification model generated therefrom is specified An entity that is a character string included in the text data including the character string corresponding to the selected feature is selected as the target entity.
A data extraction apparatus characterized by that.

A set of a first positive example entity selected from a set of positive example entities that are character strings to be extracted and a first positive example attribute selected from a set of positive example attributes that are character strings representing attributes of the positive example entities A first positive example entity-positive example attribute pair, a first negative example entity selected from a set of negative example entities that are not extracted, and a negative example attribute that is a character string representing an attribute of the negative example entity Generating a first negative example entity-negative example attribute pair that is a set with a first negative example attribute selected from the set of: and from the set of text data, the first positive example entity and the first positive example attribute; A character string including the set of the first positive entity entity-positive example attribute pair with respect to the selected character string, and information indicating characteristics of the first positive example entity-positive example attribute pair at least. A character string including a set of the first negative example entity and the first negative example attribute is selected from the set of text data, and the first negative example entity-negative example for the selected character string is selected. An attribute identifying feature extraction unit having information representing the characteristics of the attribute pair as at least part of the features of the first negative example entity-negative example attribute pair;
The learning process using the features of the first positive example entity-positive example attribute pair and the features of the first negative example entity-negative example attribute pair as supervised learning data, and the entity of an arbitrary character string and the entity A function that outputs the information for identifying whether the entity-attribute pair is a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair, using the identity of the entity-attribute pair that is a pair with the attribute as an input. An attribute identification learning unit for generating a certain first identification model;
One of the text data is selected from the set of text data, a character string included in the selected text data is selected as a first target entity, and a character string different from the first target entity is selected from the selected text data. The first target attribute is selected as a first target entity and the first target attribute is set as a first target entity-target attribute pair, and the first target entity-target attribute pair in the selected text data is selected. The information representing the characteristics of the first target entity-target attribute pair is at least part of the feature, and the first target entity-target attribute pair feature is input to the first identification model, and the first target entity- Identify whether the target attribute pair is a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair, When the first target entity-target attribute pair is identified as a positive entity-positive attribute pair, the first target attribute is added to the set of positive attribute, and the first target entity-target attribute An attribute identifying unit that adds the first target attribute to the set of negative example attributes when the pair is identified as a negative example entity-negative example attribute pair;
A second positive example entity-positive example attribute pair that is a set of a second positive example entity selected from the set of positive example entities and a second positive example attribute selected from the set of positive example attributes; and the negative example entity Generating a second negative example entity-negative example attribute pair which is a set of a second negative example entity selected from the set of the negative example attributes and a second negative example attribute selected from the set of the negative example attributes; To select a character string including a set of the second positive example entity and the second positive example attribute, and information indicating characteristics of the second positive example entity-positive example attribute pair for the selected character string A character string including a pair of the second negative example entity and the second negative example attribute is selected and selected from at least a part of the feature of the second positive example entity-positive example attribute pair. Shi And entities identifying feature extracting section for at least part of the identity of the negative examples attribute pair, - negative sample attribute pair information the second negative example entities that represent characteristics of - the second negative example entity for the character string
The learning process using the features of the second positive example entity-positive example attribute pair and the features of the second negative example entity-negative example attribute pair as supervised learning data, and the entity of an arbitrary character string and the entity A function that outputs the information for identifying whether the entity-attribute pair is a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair, using the identity of the entity-attribute pair that is a pair with the attribute as an input. An entity identification learning unit for generating a second identification model;
One of the text data is selected from the set of text data, a character string included in the selected text data is selected as a second target entity, and a character string different from the second target entity is selected from the selected text data. The second target attribute is selected as a second target entity, and a pair of the second target entity and the second target attribute is set as a second target entity-target attribute pair, and the second target entity-target attribute pair in the selected text data is selected. Information representing the characteristics of the second target entity-at least part of the feature of the target attribute pair, and input the feature of the second target entity-target attribute pair to the second identification model to input the second target entity- Identify whether the target attribute pair is a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair, When the second target entity-target attribute pair is identified as a positive entity-positive attribute pair, the second entity is added to the set of positive entity, and the second target entity-target attribute pair An entity identifier that adds the second target entity to the set of negative example entities when identified as a negative example entity-negative example attribute pair;
A data extraction device.

The data extraction device according to claim 6, comprising:
The feature of the first positive example entity-positive example attribute pair is a character string including the first positive example entity and the first positive example attribute, and includes the first positive example entity and the first positive example attribute. Including information representing the relationship between what is included in the text data and the first example entity and the first example attribute,
The feature of the first negative example entity-negative example attribute pair is a character string including the first negative example entity and the first negative example attribute, and includes the first negative example entity and the first negative example attribute. Including information representing the relationship between the text data and the first negative example entity and the first negative example attribute,
The feature of the first target entity-target attribute pair is a character string including the first target entity and the first target attribute, and is included in text data including the first target entity and the first target attribute. And information representing the relationship between the first target entity and the first target attribute,
The feature of the second positive example entity-positive example attribute pair is a character string including the second positive example entity and the second positive example attribute, and includes the second positive example entity and the second positive example attribute. Including information representing the relationship between the text data and the second positive entity and the second positive attribute,
The feature of the second negative example entity-negative example attribute pair is a character string including the second negative example entity and the second negative example attribute, and includes the second negative example entity and the second negative example attribute. Including information representing a relationship between the text data and the second negative example entity and the second negative example attribute,
The feature of the second target entity-target attribute pair is a character string including the second target entity and the second target attribute, and is included in text data including the second target entity and the second target attribute. Including information indicating the relationship between the second target entity and the second target attribute,
A data extraction apparatus characterized by that.

The data extraction device according to claim 6 or 7,
A character string other than the positive example entity is selected as a positive example attribute candidate from the set of text data including the positive example entity, and the positive example attribute candidate is included in the set of character strings including the positive example entity. A predetermined number of positive example attribute candidates as the initial value of the positive example attribute from a large index indicating the magnitude of the difference between the frequency and the frequency at which the positive example attribute candidate is included in the set of all text data ,
A character string other than the negative example entity is selected as a negative example attribute candidate from the set of text data including the negative example entity, and the negative example attribute candidate is included in the set of character strings including the negative example entity. A predetermined number of negative example attribute candidates from a large index indicating the magnitude of the difference between the frequency and the frequency at which the negative example attribute candidate is included in the set of all text data and the initial value of the negative example attribute An initial attribute set generation unit
A data extraction apparatus characterized by that.

The data extraction device according to any one of claims 6 to 8,
The attribute identification unit
Of the features of the first positive example entity-positive example attribute pair and the features of the first negative example entity-negative example attribute pair used as the supervised learning data in the attribute identification learning unit, generated from them Select a feature whose index indicating the degree of influence on the first identification model is greater than a specific criterion, select the text data including a character string corresponding to the selected feature, and the selected text data is A character string including the first target entity and the first target attribute,
A data extraction apparatus characterized by that.

The data extraction device according to any one of claims 6 to 9,
The entity identifier is
Of the features of the second positive example entity-positive example attribute pair and the features of the second negative example entity-negative example attribute pair used as the supervised learning data in the entity identification learning unit, generated from them Select a feature whose index indicating the degree of influence on the second identification model is greater than a specific criterion, select the text data including a character string corresponding to the selected feature, and the selected text data A character string including the second target entity and the second target attribute,
A data extraction apparatus characterized by that.

The data extraction device according to any one of claims 6 to 10,
The feature of the first positive example entity-positive example attribute pair includes topic information corresponding to a topic of text data including a set of the first positive example entity and the first positive example attribute,
The feature of the first negative example entity-negative example attribute pair includes topic information corresponding to a topic of text data including a set of the first negative example entity and the first negative example attribute,
The feature of the first target entity-target attribute pair includes topic information corresponding to a topic of text data including a set of the first target entity and the first target attribute,
The feature of the second positive example entity-positive example attribute pair includes topic information corresponding to a topic of text data including a set of the second positive example entity and the second positive example attribute,
The feature of the second negative example entity-negative example attribute pair includes topic information corresponding to a topic of the text data including a set of the second negative example entity and the second negative example attribute;
The feature of the second target entity-target attribute pair includes topic information corresponding to a topic of the text data including a set of the second target entity and the second target attribute.
A data extraction apparatus characterized by that.

The data extraction device according to any one of claims 1 to 5,
A positive example distribution processing unit for obtaining information representing a positive example probability distribution that is an appearance probability distribution of all entities included in a set of text data including a positive example seed entity;
Information representing a topic probability distribution, which is an appearance probability distribution of all entities included in a set of text data corresponding to the same topic information, is obtained for each topic information, and the information representing the positive example probability distribution and the topic probability distribution are obtained. A negative example topic determination unit that selects at least a part of the topic information as the negative example topic information based on a distance between the positive example probability distribution obtained using information to represent and the topic probability distribution;
A negative example seed entity generation unit that selects, as a negative example seed entity, an entity corresponding to the negative example topic information selected by the negative example topic determination unit;
The processing by the topic information extraction unit, the identification learning unit, and the entity identification unit is repeated one or more times,
The positive seed entity is the positive entity in the initial processing by the topic information extraction unit,
The negative example seed entity is the data extraction device, which is the negative example entity in the initial processing by the topic information extraction unit.

The data extraction device according to any one of claims 1 to 5,
The seed positive example topic information indicating the appropriateness of each topic with respect to the text data including the positive seed entity is aggregated for each topic, and the obtained aggregation result for each topic is obtained as the seed positive example topic score of the topic. Seed positive example topic score creation part,
A negative example topic determination unit that uses the topic information corresponding to the topic selected based on the magnitude of the seed positive example topic score of the topic as the negative example topic information;
A negative example seed entity generation unit that selects, as a negative example seed entity, an entity corresponding to the negative example topic information selected by the negative example topic determination unit;
The processing by the topic information extraction unit, the identification learning unit, and the entity identification unit is repeated one or more times,
The positive seed entity is the positive entity used in the initial processing by the topic information extraction unit,
The negative example seed entity is the data extraction device, which is the negative example entity used in the initial processing by the topic information extraction unit.

A data extraction method executed by a data extraction device,
A pre-processing unit uses topic data representing the appropriateness of a plurality of topic candidates for text data as an index value and a topic model that describes the relationship between the text data and unsupervised learning data obtained from the text data. Preprocessing steps to learn;
Topic information extraction unit, the positive example topic information in response to the topic of the text data extracted from the topic model that includes a target of extracting positive cases entity is a character string and at least part of the identity of the positive examples entity, A topic information extraction step in which negative example topic information extracted from the topic model corresponding to a topic of text data including a negative example entity that is a character string not to be extracted is at least part of the features of the negative example entity;
The discriminating and learning unit performs learning processing using the features of the positive entity and the negative entity as supervised learning data. An identification learning step for generating an identification model that is a function for outputting information for identification;
An entity that is a character string included in text data selected from a set of text data by the entity identification unit is a target entity, and topic information extracted from the topic model corresponding to the topic of the selected text data is the target entity. When the identity of the target entity is input to the identification model to identify whether the target entity is a positive entity or a negative entity, and the target entity is identified as a positive entity Identifying the target entity as the positive example entity, and identifying the target entity as the negative example entity when the target entity is identified as a negative example entity;
A data extraction method comprising:

A data extraction method executed by a data extraction device,
The attribute identifying feature extraction unit selects the first positive example entity selected from the set of positive example entities that are character strings to be extracted and the first example attribute selected from the set of positive example attributes that are character strings representing the attributes of the positive example entities. A first positive example entity-positive example attribute pair that is a set of one positive example attribute, a first negative example entity selected from a set of negative example entities that are character strings not to be extracted, and attributes of the negative example entity Generating a first negative example entity-negative example attribute pair that is a set with a first negative example attribute selected from a set of negative example attributes that is a character string to represent the first positive example entity from the set of text data And a character string including a pair of the first positive example attribute, and information indicating characteristics of the first positive example entity-positive example attribute pair for the selected character string is used as the first positive example entity-positive example. A character string including at least a part of a feature of a sex pair and including a set of the first negative example entity and the first negative example attribute from the set of text data, and selecting the first character string for the selected character string An attribute identifying feature extraction step in which information representing the characteristics of the negative example entity-negative example attribute pair is at least part of the features of the first negative example entity-negative example attribute pair;
The attribute identification learning unit performs an arbitrary character string by learning processing using the feature of the first positive example entity-positive example attribute pair and the feature of the first negative example entity-negative example attribute pair as supervised learning data. To identify whether the entity-attribute pair is a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair by inputting the identity of an entity-attribute pair that is a set of an entity and the attribute of the entity An attribute identification learning step for generating a first identification model which is a function for outputting information;
The attribute identification unit selects any one of the text data from the set of text data, selects a character string included in the selected text data as a first target entity, and selects the first target entity from the selected text data. The first target attribute is selected as a first target attribute, and a set of the first target entity and the first target attribute is set as a first target entity-target attribute pair, and the first target in the selected text data is selected. Information representing the characteristics of the entity-target attribute pair is set as at least a part of the feature of the first target entity-target attribute pair, and the feature of the first target entity-target attribute pair is input to the first identification model. The first target entity-target attribute pair is a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair. When the first target entity-target attribute pair is identified as a positive entity-positive attribute pair, the first target attribute is added to the set of positive attribute, An attribute identifying step of adding the first target attribute to the set of negative example attributes when the target entity-target attribute pair is identified as a negative example entity-negative example attribute pair;
A second positive example entity-positive example , in which the entity identifying feature extraction unit is a set of a second positive example entity selected from the positive example entity set and a second positive example attribute selected from the positive example attribute set. A second negative example entity-negative example attribute pair that is a set of an attribute pair, a second negative example entity selected from the set of negative example entities, and a second negative example attribute selected from the set of negative example attributes. Generating and selecting a character string including a set of the second positive example entity and the second positive example attribute from the set of text data, and the second positive example entity-positive example attribute for the selected character string. Information representing the characteristics of the pair is at least a part of the feature of the second positive example entity-positive example attribute pair, and the second negative example entity and the second negative example attribute are obtained from the set of text data. A character string including a pair is selected, and information indicating the characteristics of the second negative example entity-negative example attribute pair for the selected character string is used as at least part of the features of the second negative example entity-negative example attribute pair. An entity identification feature extraction step,
The entity identification learning unit performs an arbitrary character string by learning processing using the feature of the second positive example entity-positive example attribute pair and the feature of the second negative example entity-negative example attribute pair as supervised learning data. To identify whether the entity-attribute pair is a positive entity-positive example attribute pair or a negative example entity-negative example attribute pair by inputting the identity of an entity-attribute pair that is a set of an entity and the attribute of the entity An entity identification learning step for generating a second identification model which is a function for outputting information;
The entity identification unit selects any one of the text data from the set of text data, selects a character string included in the selected text data as a second target entity, and selects the second target entity from the selected text data. A character string different from the second target attribute is selected as a second target attribute, and a set of the second target entity and the second target attribute is set as a second target entity-target attribute pair, and the second target in the selected text data is selected. Information representing the characteristics of the entity-target attribute pair is set as at least a part of the feature of the second target entity-target attribute pair, and the feature of the second target entity-target attribute pair is input to the second identification model Second target entity-target attribute pair is positive entity-positive attribute pair or negative entity-negative And when the second target entity-target attribute pair is identified as an example entity-example attribute pair, the second entity is added to the set of example entities, An entity identification step of adding the second target entity to the set of negative example entities when the two target entity-target attribute pairs are identified as negative example entity-negative example attribute pairs;
A data extraction method comprising:

A program for causing a computer to function as each unit of the data extraction device according to claim 1.