JP6062879B2

JP6062879B2 - Model learning apparatus, method and program

Info

Publication number: JP6062879B2
Application number: JP2014052437A
Authority: JP
Inventors: 九月貞光; 松尾　義博; 義博松尾; 久子浅野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-03-14
Filing date: 2014-03-14
Publication date: 2017-01-18
Anticipated expiration: 2034-03-14
Also published as: JP2015176355A

Description

本発明は、モデル学習装置、方法及びプログラムに係り、特に、固有表現のカテゴリを示すラベルが付与された固有表現を抽出するためのモデル学習装置、方法及びプログラムに関する。 The present invention relates to a model learning apparatus, method, and program, and more particularly, to a model learning apparatus, method, and program for extracting a specific expression to which a label indicating a category of specific expression is assigned.

従来より、「ナイル川は世界で一番長い川」という文章から、自動的に「ナイル川」という固有物を指す表現、「固有表現」を抽出し、詳細なラベル（例えば「河川名」）を付与する技術が知られている（例えば、非特許文献１）。固有表現とは、固有名詞のような、特定の場所や物事を指す表現のことである。たとえば、「ＮＴＴ（登録商標）」や「大阪」は、「組織」、「場所」についての固有表現である。ここで、「組織」や「場所」は固有表現のカテゴリと呼ばれる。従来、ＩＲＥＸという会議において定義された、8つの固有表現カテゴリが、標準の固有表現カテゴリとして用いられてきた。しかしながら、「場所」や「組織」では粒度が粗く、アプリケーションによっては、より詳細なカテゴリ分類が必要なことがある。ニューヨーク大の関根らは、固有表現を細分化した、拡張固有表現という概念を提案しており、その枠組みでは、200クラスの固有表現カテゴリが定義されている（非特許文献２）。 Traditionally, from the sentence “The Nile is the longest river in the world”, an expression that points to the unique substance “Nile River”, “proprietary expression” is automatically extracted, and a detailed label (for example, “river name”) There is known a technique for imparting (for example, Non-Patent Document 1). A proper expression is an expression that refers to a specific place or thing, such as a proper noun. For example, “NTT (registered trademark)” and “Osaka” are specific expressions for “organization” and “location”. Here, “organization” and “location” are called categories of specific expressions. Conventionally, eight named entity categories defined in a conference called IREX have been used as standard named entity categories. However, “location” and “organization” are coarse in granularity, and more detailed categorization may be necessary depending on the application. Sekine et al. Of New York University have proposed the concept of extended named expressions that subdivided named expressions, and 200 named classes are defined in that framework (Non-patent Document 2).

また、非特許文献１では、拡張固有表現を従来の手掛かり情報（周辺文脈情報）と、機械学習法の１つであるCRFによって解いている。ここで、CRF（Conditional Random Fields）とは、形態素解析や固有表現抽出等の系列ラベリング問題を解くのに用いられる識別学習器である。 Further, in Non-Patent Document 1, the extended specific expression is solved by conventional clue information (peripheral context information) and CRF which is one of machine learning methods. Here, CRF (Conditional Random Fields) is a discriminator / learner used to solve a sequence labeling problem such as morphological analysis and extraction of specific expressions.

また、上記非特許文献１の識別学習器により、テキストに固有表現ラベルの付与された「学習データ」を用いて、固有表現を抽出し、詳細なラベルを付与するためのモデルを作成する手法が知られている（例えば、特許文献１）。
また、Wikipedia等の既存の辞書を用いて、固有表現とラベルの対から拡張固有表現の辞書を獲得する手法が知られている（例えば、非特許文献３）。 Also, there is a method for extracting a specific expression using a “learning data” in which a specific expression label is given to text by the identification learning device of Non-Patent Document 1 and creating a model for giving a detailed label. Known (for example, Patent Document 1).
Also, a technique for acquiring an extended specific expression dictionary from a specific expression and label pair using an existing dictionary such as Wikipedia is known (for example, Non-Patent Document 3).

特開２０１３−２４６７９５JP 2013-246795

橋本、中村、「拡張固有表現タグ付きコーパスの構築-白書，書籍，Yahoo!知恵袋コアデータ-」、言語処理学会第16回年次大会、２０１０年、３月Hashimoto, Nakamura, “Constructing Corpus with Extended Specific Expression Tags—White Paper, Book, Yahoo! Wisdom Bag Core Data”, 16th Annual Conference of the Language Processing Society of Japan, March 2010 Satoshi Sekine, Chikashi Nobata, Definition, dictionaries and tagger for Extended Named Entity Hierarchy LREC2004 pp．1977-1980Satoshi Sekine, Chikashi Nobata, Definition, dictionaries and tagger for Extended Named Entity Hierarchy LREC2004 pp. 1977-1980 Ryuichiro Higashinaka, Kugatsu Sadamitsu, Kuniko Saito, Toshiro Makino and Yoshihiro Matsuo, "Creating an Extended Named Entity Dictionary from Wikipedia", The 24th International Conference on Computational Linguistics (COLING 2012)Ryuichiro Higashinaka, Kugatsu Sadamitsu, Kuniko Saito, Toshiro Makino and Yoshihiro Matsuo, "Creating an Extended Named Entity Dictionary from Wikipedia", The 24th International Conference on Computational Linguistics (COLING 2012)

しかし、特許文献１の手法では、学習データは人手で作成するしかなく、モデルを高精度化する際にコストがかかってしまう、という課題がある。 However, in the method of Patent Document 1, there is a problem that learning data must be created manually, and costs are increased when the accuracy of the model is increased.

また、固有表現は日々生成されるため、非特許文献３により得られる辞書だけを用いたマッチングには固有表現の再現率における課題がある。 Further, since the unique expression is generated every day, matching using only the dictionary obtained by Non-Patent Document 3 has a problem in the reproduction rate of the specific expression.

本発明では、上記課題を解決するために成されたものであり、抽出対象文字列に精度よくラベルを付与するためのモデルを学習することができるモデル学習装置、方法及びプログラムを提供することを目的とする。 In the present invention, there is provided a model learning device, a method, and a program, which are made to solve the above-described problem, and are capable of learning a model for accurately labeling an extraction target character string. Objective.

上記目的を達成するために、本発明に係るモデル学習装置は、抽出対象の文字列に対してラベルが予め付与された文書である第１の学習データと、前記ラベルが付与されていない文書である第２の学習データとの入力を受け付ける入力部と、前記第２の学習データの文書に含まれる文字列と、前記文字列及び前記ラベルが対応付けられて格納された辞書とを照合し、前記第２の学習データの文書において、前記辞書に前記ラベルと対応付けられて格納されている前記文字列と一致する文字列に対して、前記ラベルを付与する辞書照合部と、前記第１の学習データから素性を抽出して目的ドメイン素性とし、前記目的ドメイン素性を複製して共通ドメイン素性とし、前記目的ドメイン素性、他ドメイン素性、及び前記共通ドメイン素性からなる素性ベクトルを生成し、前記文字列にラベルが付与された第２の学習データから前記素性を抽出して前記他ドメイン素性とし、前記他ドメイン素性を複製して前記共通ドメイン素性とし、前記素性ベクトルを生成する素性生成部と、前記第１の学習データの前記素性ベクトル及び前記第２の学習データの前記素性ベクトルに基づいて、前記目的ドメイン素性、前記他ドメイン素性、及び前記共通ドメイン素性の各々に対するパラメータを学習し、前記学習されたパラメータのうち、前記共通ドメイン素性に対するパラメータを抽出し、前記抽出対象の文字列を抽出して前記ラベルを付与するためのモデルのパラメータとして出力する学習部と、を含んで構成されている。 In order to achieve the above object, a model learning apparatus according to the present invention includes first learning data that is a document in which a label is pre-assigned to a character string to be extracted, and a document that is not provided with the label. Collating an input unit that receives input of certain second learning data, a character string included in a document of the second learning data, and a dictionary in which the character string and the label are stored in association with each other; A dictionary collation unit for assigning the label to a character string that matches the character string stored in the dictionary in association with the label in the second learning data document; A feature is extracted from the learning data to be a target domain feature, the target domain feature is duplicated to be a common domain feature, and includes the target domain feature, another domain feature, and the common domain feature. Generating a feature vector, extracting the feature from the second learning data in which a label is assigned to the character string to obtain the other domain feature, replicating the other domain feature to the common domain feature, and the feature vector Each of the target domain feature, the other domain feature, and the common domain feature based on the feature vector of the first learning data and the feature vector of the second learning data A learning unit that extracts a parameter for the common domain feature from the learned parameters, extracts a character string to be extracted, and outputs the extracted character string as a model parameter; , Including.

また、本発明に係るモデル学習装置において、前記抽出対象の文字列を、固有表現とし、前記ラベルを、固有表現分類とし、前記モデルを、前記固有表現を抽出して前記固有表現に前記固有表現分類を付与するためのモデルとしてもよい。 In the model learning device according to the present invention, the character string to be extracted is a specific expression, the label is a specific expression classification, the model is extracted from the specific expression, and the specific expression is included in the specific expression. It is good also as a model for giving a classification.

本発明に係るモデル学習方法は、入力部が、抽出対象の文字列に対してラベルが予め付与された文書である第１の学習データと、前記ラベルが付与されていない文書である第２の学習データとの入力を受け付けるステップと、辞書照合部が、前記第２の学習データの文書に含まれる文字列と、前記文字列及び前記ラベルが対応付けられて格納された辞書とを照合し、前記第２の学習データの文書において、前記辞書に前記ラベルと対応付けられて格納されている前記文字列と一致する文字列に対して、前記ラベルを付与するステップと、素性生成部が、前記第１の学習データから素性を抽出して目的ドメイン素性とし、前記目的ドメイン素性を複製して共通ドメイン素性とし、前記目的ドメイン素性、他ドメイン素性、及び前記共通ドメイン素性からなる素性ベクトルを生成し、前記文字列にラベルが付与された第２の学習データから前記素性を抽出して前記他ドメイン素性とし、前記他ドメイン素性を複製して前記共通ドメイン素性とし、前記素性ベクトルを生成するステップと、学習部が、前記第１の学習データの前記素性ベクトル及び前記第２の学習データの前記素性ベクトルに基づいて、前記目的ドメイン素性、前記他ドメイン素性、及び前記共通ドメイン素性の各々に対するパラメータを学習し、前記学習されたパラメータのうち、前記共通ドメイン素性に対するパラメータを抽出し、前記抽出対象の文字列を抽出して前記ラベルを付与するためのモデルのパラメータとして出力するステップと、を含んで実行することを特徴とする。 In the model learning method according to the present invention, the input unit is a first learning data that is a document in which a label is previously given to a character string to be extracted, and a second document that is a document to which the label is not given. A step of accepting input with learning data, and a dictionary collation unit collates a character string included in the document of the second learning data with a dictionary in which the character string and the label are stored in association with each other; In the document of the second learning data, a step of assigning the label to a character string that matches the character string stored in the dictionary in association with the label; Extracting a feature from first learning data to obtain a target domain feature, duplicating the target domain feature to obtain a common domain feature, the target domain feature, another domain feature, and the common domain Generating a feature vector consisting of sex, extracting the feature from the second learning data in which a label is attached to the character string to be the other domain feature, duplicating the other domain feature to be the common domain feature, Generating the feature vector; and a learning unit based on the feature vector of the first learning data and the feature vector of the second learning data, and the target domain feature, the other domain feature, and the As a parameter of a model for learning a parameter for each common domain feature, extracting a parameter for the common domain feature from the learned parameters, extracting a character string to be extracted, and assigning the label And an output step.

また、本発明のプログラムは、コンピュータを、上記のモデル学習装置を構成する各部として機能させるためのプログラムである。 Moreover, the program of this invention is a program for functioning a computer as each part which comprises said model learning apparatus.

以上説明したように、本発明のモデル学習装置、方法、及びプログラムによれば、辞書を用いて、ラベルが付与されていない第２の学習データの文字列に対してラベルを付与し、第１の学習データと第２の学習データから素性を抽出して、第１の学習データと第２の学習データについて、目的ドメイン素性、他ドメイン素性、及び共通ドメイン素性からなる素性ベクトルを生成して、目的ドメイン素性、他ドメイン素性、及び共通ドメイン素性の各々に対するパラメータを学習し、共通ドメイン素性に対するパラメータを抽出することにより、抽出対象文字列に精度よくラベルを付与するためのモデルを学習することができる、という効果が得られる。 As described above, according to the model learning device, method, and program of the present invention, the dictionary is used to assign a label to the character string of the second learning data to which no label is assigned, and the first Extracting features from the learning data and the second learning data, generating a feature vector consisting of the target domain feature, the other domain feature, and the common domain feature for the first learning data and the second learning data, Learning the parameters for each of the target domain feature, other domain feature, and common domain feature, and extracting the parameters for the common domain feature to learn a model for accurately labeling the extraction target character string The effect of being able to be obtained is obtained.

本発明の実施の形態に係るモデル学習装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the model learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係るモデル学習装置におけるモデル学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the model learning process routine in the model learning apparatus which concerns on embodiment of this invention. モデルを学習する方法を説明するための図である。It is a figure for demonstrating the method of learning a model. 目的ドメイン素性及び他ドメイン素性の複製例を示す図である。It is a figure which shows the replication example of a target domain feature and other domain features.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態の原理＞ <Principle of Embodiment of the Present Invention>

本発明の実施の形態において、文字列を抽出し適切なラベルを付与するためのモデルを、転移学習により、既存の学習データと、ＷＥＢ等から得られる学習データとを用いて学習する。 In the embodiment of the present invention, a model for extracting a character string and assigning an appropriate label is learned by using transfer learning and existing learning data and learning data obtained from WEB or the like.

転移学習とは、モデル学習の際、目的ドメインのデータ（例えば新聞）だけでなく、他のドメインのデータ（例えばblog）を、目的ドメインにとって効果のあるように学習する方法のことである。 Transfer learning is a method of learning not only data of a target domain (for example, a newspaper) but also data of another domain (for example, a blog) so as to be effective for the target domain during model learning.

転移学習の手法としては、非特許文献４（Hal Daume III, "Frustratingly Easy Domain Adaptation", In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp.256-263 (ACL-2007)）に記載されたFeature Augmentation法を挙げることができ、本発明の実施の形態でも同様の手法を用いるものとする。 The transfer learning method is described in Non-Patent Document 4 (Hal Daume III, “Frustratingly Easy Domain Adaptation”, In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 256-263 (ACL-2007)). The feature augmentation method described above can be used, and the same method is used in the embodiment of the present invention.

Feature Augmentation法は、後述する第１の学習データ（既存の学習データ）を目的ドメインとし、後述する第２の学習データ（ＷＥＢ等から得られた文書集合に疑似正解ラベルを付与したもの）を他ドメインとして、目的ドメインのモデルを、他ドメインを含めて学習するものである。Feature Augmentation法では、ドメイン毎に、多数の素性を複製することにより、ドメインを考慮した学習を行う。なお、学習方法は何を使用してもよい。 The Feature Augmentation method uses the first learning data (existing learning data), which will be described later, as the target domain, and the second learning data (which is obtained by adding a pseudo correct label to a document set obtained from WEB etc.) As a domain, a target domain model is learned including other domains. In the Feature Augmentation method, learning is performed in consideration of domains by duplicating many features for each domain. Any learning method may be used.

例えば、目的ドメインの文書から、重みが１．０となる素性aとcが得られたとする。また、他ドメインの文書からは、重みが１．０となる素性a,b,dが得られたとする。素性cとdは相反するものとする（たとえば、目的ドメインでは「を/格助詞」が必ず「が/格助詞」に修正されたが、他ドメインでは必ず「に/格助詞」に修正されたような場合をいう）。Feature Augmentation法では、素性の種類の和集合をとり、素性a,b,c,dを取り扱う。そして、素性a,b,c,dを「共通」、「目的ドメイン」、および「他ドメイン」の各々にコピーして、素性ベクトルを構築する。 For example, it is assumed that features a and c having a weight of 1.0 are obtained from a document in the target domain. Further, it is assumed that features a, b, and d having a weight of 1.0 are obtained from documents in other domains. Features c and d are contradictory (for example, in the target domain, “// case particle” was always modified to “ga / case particle”, but in other domains it was always modified to “// case particle”. Such a case). The Feature Augmentation method takes the union of feature types and handles the features a, b, c, and d. Then, the features a, b, c, and d are copied to each of “common”, “target domain”, and “other domains” to construct a feature vector.

学習時には、目的ドメインから得られた素性の場合は、「共通」および「目的ドメイン」の素性を有効にして、素性ベクトルを生成する。他ドメインから得られた素性の場合は、「共通」および「他ドメイン」の素性を有効にして、素性ベクトルを生成する。 At the time of learning, in the case of a feature obtained from the target domain, the features of the “common” and “target domain” are validated to generate a feature vector. In the case of features obtained from other domains, feature vectors of “common” and “other domains” are validated to generate feature vectors.

このように計算された素性の値を用いて、各素性の重みを最適化すると、目的ドメイン、他ドメインに共通して現れる素性は、「共通」における当該素性の重みが大きくなる。目的ドメイン、他ドメインの片方にしか現れない素性についても、「共通」における当該素性の重みが付与される。一方、目的ドメインと他ドメインとで相反する素性については、「目的ドメイン」または「他ドメイン」における当該素性の重みが大きくなるが、「共通」における当該素性の重みは０に近づく。 When the weight of each feature is optimized using the feature value calculated in this way, the feature weight that appears in common in the target domain and other domains becomes larger in the “common” feature. A feature that appears only in one of the target domain and the other domain is also given a weight of the feature in “common”. On the other hand, for features that conflict in the target domain and other domains, the weight of the feature in the “target domain” or “other domain” increases, but the weight of the feature in “common” approaches zero.

また、本発明の実施の形態では、転移学習に用いる他ドメインである第２の学習データについて、疑似正解データを用いることを特徴とする。具体的には、ＷＥＢ等からラベル無し文書を検索し、既存の辞書（文字列とラベルの対を格納したもの）を用いて、該当箇所にラベルを付与することで、擬似的な学習データを得る。例えば「田中−人名」という文字列とラベルの対が辞書にあれば、ラベル無し文書中から「今日/田中(擬似正解＝人名)/と/会った」という学習データを得る。 The embodiment of the present invention is characterized in that pseudo correct answer data is used for the second learning data which is another domain used for transfer learning. Specifically, by searching for a document with no label from WEB or the like, and using an existing dictionary (which stores a pair of character string and label), a pseudo learning data is obtained by assigning a label to the corresponding location. obtain. For example, if a pair of a character string and a label “Tanaka-person name” is in the dictionary, learning data “Today / Tanaka (pseudo correct answer = person name) / has met” is obtained from an unlabeled document.

ただし、この擬似的な学習データには次の問題点がある。 However, this pseudo learning data has the following problems.

まず、辞書内の単語−ラベルの対応が誤っている可能性がある、という問題である。例えば、「織田信長」に「商品名」ラベルが付与されている場合がある。 First, there is a problem that the word-label correspondence in the dictionary may be incorrect. For example, a “product name” label may be given to “Oda Nobunaga”.

また、単語に多義性が存在している可能性がある、という問題がある。例えば、「田中」という単語には「人名」と「地名」があるため、辞書だけでは手掛かりとなる周辺文脈を考慮できず、付与すべきラベルを判別できない。 There is also a problem that ambiguity may exist in words. For example, since the word “Tanaka” includes “person name” and “place name”, the peripheral context that is a clue cannot be considered only by the dictionary, and the label to be given cannot be determined.

また、文脈中に他のラベルが入っている可能性を考慮できない、という問題がある。例えば、「今日/田中（擬似正解＝人名）/と/吉田（正解は人名だが、辞書にはないためＮＩＬになる）/に/会った」というように、辞書にない単語にはラベルを付与できない。 In addition, there is a problem that the possibility of including other labels in the context cannot be considered. For example, “Today / Tanaka (pseudo correct answer = person name) / and // Yoshida (the correct answer is a person's name, but it is not in the dictionary, it will be NIL) /”. Can not.

本発明の実施の形態では、これらの問題点を解決するために、転移学習を用いて擬似的に作成されたデータを他ドメインと見なし、本来の学習データを目的ドメインとして同時に学習する。 In the embodiment of the present invention, in order to solve these problems, data artificially created using transfer learning is regarded as another domain, and original learning data is simultaneously learned as a target domain.

なお、本発明の実施の形態においては、わかりやすさのため、固有表現抽出を例として説明する。 In the embodiment of the present invention, for the sake of easy understanding, specific expression extraction will be described as an example.

＜本発明の実施の形態に係るモデル学習装置の構成＞ <Configuration of Model Learning Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係るモデル学習装置の構成について説明する。図１に示すように、本発明の実施の形態に係るモデル学習装置１００は、文字列に対してラベルが付与されており、かつ、形態素解析済みの文書データを入力とし、抽出対象文字列を抽出しラベルを付与するためのモデルを学習する。このモデル学習装置１００は、ＣＰＵと、ＲＡＭと、後述するモデル学習処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。図１に示すように、モデル学習装置１００は、入力部１０と、演算部２０と、出力部５０とを備えている。 Next, the configuration of the model learning device according to the embodiment of the present invention will be described. As shown in FIG. 1, the model learning device 100 according to the embodiment of the present invention inputs a character string that has been given a label and has been subjected to morphological analysis as an input, Learn the model for extracting and labeling. This model learning apparatus 100 is composed of a computer including a CPU, a RAM, and a ROM that stores a program for executing a model learning processing routine described later, and is functionally configured as follows. Yes. As illustrated in FIG. 1, the model learning device 100 includes an input unit 10, a calculation unit 20, and an output unit 50.

入力部１０は、抽出対象の文字列に対して正解ラベルが予め付与された形態素解析済みの文書の集合（教師データ）を、第１の学習データの集合として受け付ける。また、入力部１０は、ＷＥＢテキストなどの明示的ラベルが付与されていない形態素解析済みの文書の集合を、第２の学習データの集合として受け付ける。 The input unit 10 receives a set of morphologically analyzed documents (teacher data) in which a correct label is assigned in advance to a character string to be extracted as a first set of learning data. Further, the input unit 10 accepts a set of morpheme-analyzed documents that are not given an explicit label, such as a WEB text, as a second set of learning data.

演算部２０は、教師用データベース３０と、文書集合データベース３２と、辞書データベース３４と、辞書照合部３６と、転移学習用素性生成部３８と、モデル学習部４０と、モデル記憶部４２とを含んで構成されている。 The arithmetic unit 20 includes a teacher database 30, a document set database 32, a dictionary database 34, a dictionary collation unit 36, a transfer learning feature generation unit 38, a model learning unit 40, and a model storage unit 42. It consists of

教師用データベース３０は、第１の学習データの集合を記憶している。 The teacher database 30 stores a set of first learning data.

文書集合データベース３２は、第２の学習データの集合を記憶している。 The document set database 32 stores a second set of learning data.

辞書データベース３４は、抽出対象の文字列とラベルの対を格納した辞書を記憶している。本発明の実施の形態では、人手で作成した辞書を用いる。また、非特許文献５に記載されている手法により得られる、固有表現と固有表現分類ラベルの対（利根川−Ｒｉｖｅｒの対など）を格納した拡張固有表現の辞書を用いることもできる。なお、ラベルがない辞書から、固有表現に対して仮想的ラベルを置くことで、固有表現と固有表現分類ラベルの対を格納した辞書を作成することも可能である。 The dictionary database 34 stores a dictionary that stores pairs of character strings and labels to be extracted. In the embodiment of the present invention, a manually created dictionary is used. Further, it is possible to use an extended specific expression dictionary that stores a pair of specific expression and specific expression classification label (Tonegawa-River pair, etc.) obtained by the method described in Non-Patent Document 5. It is also possible to create a dictionary storing a pair of a specific expression and a specific expression classification label by placing a virtual label for the specific expression from a dictionary having no label.

辞書照合部３６は、文書集合データベース３２に記憶された第２の学習データの各々に対して、辞書データベース３４に記憶された辞書に含まれる文字列とラベルの対のエントリを照合して辞書引きを行う。辞書引きの結果、辞書に固有表現分類ラベルと対応付けられて格納されている文字列と一致する文字列が、第２の学習データの文書に含まれていた場合には、一致する文字列を含む第２の学習データの文書を抽出し、該当する文字列に対して疑似正解ラベルを付与する。 The dictionary collating unit 36 collates the entry of the character string and label pair included in the dictionary stored in the dictionary database 34 with respect to each of the second learning data stored in the document set database 32 and performs dictionary lookup. I do. As a result of the dictionary lookup, if a character string that matches the character string stored in the dictionary in association with the specific expression classification label is included in the document of the second learning data, the matching character string is The document of the 2nd learning data to include is extracted, and a pseudo correct label is provided with respect to the applicable character string.

なお、擬似的な学習データの数を制御するため、次のような制約を設けてもよい。例えば、マッチングする事例数を、総事例数Ｎに制限してもよい。また、各ラベルに対するマッチング事例数Ｎ＿Ｌに制限してもよい。また、辞書の各エントリに対するマッチング事例数Ｎ＿Ｅ等に制限してもよい。例えば、Ｎ＿Ｌ＝１００の場合であれば、ラベル毎に、最大１００事例の学習データに当該ラベルを付与するようにしてもよい。 In order to control the number of pseudo learning data, the following restrictions may be provided. For example, the number of matching cases may be limited to the total number of cases N. Further, the number of matching cases N_L for each label may be limited. Further, the number of matching cases N_E for each entry in the dictionary may be limited. For example, if N_L = 100, the label may be assigned to learning data of up to 100 cases for each label.

転移学習用素性生成部３８は、第１の学習データの各々について、当該第１の学習データの文書に含まれる各単語の素性を抽出する。抽出する素性としては従来と同様の素性を用いる（特許文献１）。固有表現を抽出する場合を例にとると、第１の学習データの文書が、「新潟/を/流れる/<River:信濃川>/は/日本一/長い/川」がであり、注目する単語が、「信濃川」、かつ固有表現分類ラベル「河川名」が付与されている場合、「当該単語は[河川名]である」といったラベル情報、「当該単語内の最後の文字は『川』である」といった単語内情報、「１つ後の単語は『は』である」といった周辺文脈情報を示す素性を生成する。これによって、図３（Ａ）に示すような素性化されたテキストが出力される。また、擬似正解ラベルが付与された第２の学習データの文書が、「関東/の/<River:利根川>/は/すごい」がであり、注目する単語が、「利根川」、かつ固有表現分類ラベル「河川名」が付与されている場合、図３（Ｂ）に示すような素性化されたテキストが出力される。なお、図３の例では、固有表現の抽出範囲の始まりの単語についてＢタグが付与され、抽出範囲の始まり以外の単語についてＩタグが付与され、それ以外の単語についてＯタグが付与されている。また、本実施の形態では、複数の固有表現が出現した場合に関する前後のシーケンシャルラベル素性は用いないものとする。 The transfer learning feature generation unit 38 extracts, for each of the first learning data, the feature of each word included in the document of the first learning data. As a feature to be extracted, a feature similar to the conventional one is used (Patent Document 1). Taking the case of extracting a proper expression as an example, the document of the first learning data is “Niigata / O / Flow / <River: Shinanogawa> / Has / Japan / Long / River”. When the word is “Shinano River” and the unique expression classification label “river name” is given, label information such as “the word is [river name]”, “the last character in the word is“ river A feature indicating in-word information such as “is” and peripheral context information such as “the next word is“ ha ”” is generated. As a result, the featured text as shown in FIG. 3A is output. In addition, the second learning data document given the pseudo-correct answer label is “Kanto / no / <River: Tonegawa> / ha / awesome”, the word of interest is “Tonegawa”, and the proper expression classification In the case where the label “river name” is given, a featured text as shown in FIG. 3B is output. In the example of FIG. 3, a B tag is assigned to the word at the beginning of the extraction range of the specific expression, an I tag is assigned to a word other than the beginning of the extraction range, and an O tag is assigned to other words. . In the present embodiment, the sequential label features before and after the case where a plurality of unique expressions appear are not used.

そして、転移学習用素性生成部３８は、第１の学習データの各単語について抽出した各素性を、目的ドメイン素性として複製し、更に、目的ドメイン素性を複製することで共通ドメイン素性を得る。具体的には、抽出した各素性に対して、素性種別を表す識別子の前にＩＮ＿、ＣＯＭＭＯＮ＿という接頭辞等を付与して、複製する（図４（Ａ）参照）。ＩＮ＿は目的ドメイン向け、ＣＯＭＭＯＮ＿は共通ドメイン向けを表す。そして、転移学習用素性生成部３８は、第１の学習データの文書の各単語について、複製した目的ドメイン素性と、値を０と置いた他ドメイン素性と、目的ドメイン素性を複製した共通ドメイン素性とからなる素性ベクトルを生成する。 Then, the transfer learning feature generation unit 38 duplicates each feature extracted for each word of the first learning data as a target domain feature, and further obtains a common domain feature by duplicating the target domain feature. Specifically, for each extracted feature, prefixes such as IN_ and COMMON_ are added in front of an identifier representing the feature type and copied (see FIG. 4A). IN_ represents the target domain, and COMMON_ represents the common domain. The transfer learning feature generation unit 38 then duplicates the target domain feature, the other domain feature having a value of 0, and the common domain feature that duplicates the target domain feature for each word of the first learning data document. Generate a feature vector consisting of

同様に、転移学習用素性生成部３８は、第２の学習データの各々について、当該第２の学習データの文書に含まれる各単語の素性を抽出する。転移学習用素性生成部３８は、第２の学習データの各単語について抽出した各素性を、他ドメイン素性として複製し、更に、他ドメイン素性を複製することで共通ドメイン素性を得る。具体的には、抽出した各素性に対して、素性種別を表す識別子の前にＯＵＴ＿、ＣＯＭＭＯＮ＿という接頭辞等を付与して、複製する（図４（Ｂ）参照）。ＯＵＴ＿は他ドメイン向けを表す。そして、転移学習用素性生成部３８は、第２の学習データの文書の各単語について、値を０と置いた目的ドメイン素性と、複製した他ドメイン素性と、他ドメイン素性を複製した共通ドメイン素性とからなる素性ベクトルを生成する。 Similarly, the transfer learning feature generation unit 38 extracts the feature of each word included in the second learning data document for each of the second learning data. The transfer learning feature generation unit 38 duplicates each feature extracted for each word of the second learning data as another domain feature, and further obtains a common domain feature by duplicating the other domain feature. Specifically, for each extracted feature, prefixes such as OUT_ and COMMON_ are added in front of the identifier representing the feature type and copied (see FIG. 4B). OUT_ is for other domains. Then, the transfer learning feature generation unit 38, for each word of the second learning data document, the target domain feature with a value of 0, the duplicated other domain feature, and the common domain feature that duplicated the other domain feature Generate a feature vector consisting of

このようにして、転移学習用素性生成部３８は、第１の学習データの各単語について生成した素性ベクトルと、第２の学習データの各単語について生成した素性ベクトルとからなる転移学習用素性ファイルを生成する。 In this way, the transfer learning feature generation unit 38 includes a feature vector generated for each word of the first learning data and a feature vector generated for each word of the second learning data. Is generated.

モデル学習部４０は、転移学習用素性生成部３８により生成された第１の学習データの各単語の素性ベクトルと第２の学習データの各単語の素性ベクトルとからなる転移学習用素性ファイルに基づいて、機械学習器により、目的ドメイン素性、他ドメイン素性、及び共通ドメイン素性の各々に対するパラメータを学習する。なお、機械学習器による学習は従来既知の手法（例えば、非特許文献１に記載の手法）を用いればよいため、説明を省略する。 The model learning unit 40 is based on a transfer learning feature file composed of a feature vector of each word of the first learning data generated by the transfer learning feature generation unit 38 and a feature vector of each word of the second learning data. Then, the machine learner learns parameters for each of the target domain feature, the other domain feature, and the common domain feature. Note that learning by the machine learning device may be performed by using a conventionally known method (for example, the method described in Non-Patent Document 1), and thus description thereof is omitted.

そして、モデル学習部４０は、学習により得られた目的ドメイン素性、他ドメイン素性及び共通ドメイン素性の各々のパラメータのうち、共通ドメイン素性用のパラメータを抽出し、抽出対象の文字列を抽出してラベルを付与するためのモデルのパラメータとして、モデル記憶部４２に格納すると共に、出力部５０に出力する。 Then, the model learning unit 40 extracts a parameter for the common domain feature from the parameters of the target domain feature, the other domain feature, and the common domain feature obtained by learning, and extracts a character string to be extracted. The parameter is stored in the model storage unit 42 as a model parameter for giving a label, and is output to the output unit 50.

＜本発明の実施の形態に係るモデル学習装置の作用＞ <Operation of Model Learning Device According to Embodiment of Present Invention>

次に、本発明の実施の形態に係るモデル学習装置１００の作用について説明する。入力部１０において第１の学習データの集合及び第２の学習データの集合を受け付けると、モデル学習装置１００は、図２に示すモデル学習処理ルーチンを実行する。 Next, the operation of the model learning device 100 according to the embodiment of the present invention will be described. When the input unit 10 receives the first learning data set and the second learning data set, the model learning device 100 executes a model learning processing routine shown in FIG.

まず、ステップＳ１００では、辞書データベース３４に記憶されている辞書を読み込む。 First, in step S100, a dictionary stored in the dictionary database 34 is read.

次に、ステップＳ１０２では、教師用データベース３０に記憶されている第１の学習データの集合を取得する。 Next, in step S102, a set of first learning data stored in the teacher database 30 is acquired.

ステップＳ１０４では、文書集合データベース３２に記憶されている第２の学習データの集合から処理対象の第２の学習データを取得する。 In step S104, the second learning data to be processed is acquired from the second learning data set stored in the document set database 32.

ステップＳ１０６では、ステップＳ１０４で取得した第２の学習データの文書に対して、ステップＳ１００で読み込んだ辞書を用いて辞書引きを行い、辞書にラベルと対応付けられて格納されている文字列と一致する文字列が、第２の学習データの文書に含まれていた場合には、一致する文字列を含む第２の学習データの文書を抽出し、該当する文字列に対して疑似正解ラベルを付与する。 In step S106, the second learning data document acquired in step S104 is subjected to dictionary lookup using the dictionary read in step S100, and matches the character string stored in the dictionary in association with the label. If the character string to be included is included in the second learning data document, the second learning data document including the matching character string is extracted, and a pseudo-correct answer label is assigned to the corresponding character string. To do.

ステップＳ１０８では、全ての第２の学習データについて、上記ステップＳ１０６の処理を実行したか否かを判定し、上記ステップＳ１０６の処理を実行していない第２の学習データが存在する場合には、上記ステップＳ１０４へ戻り、当該第２の学習データを、処理対象とする。一方、全ての第２の学習データについて、上記ステップＳ１０６の処理を実行した場合には、ステップＳ１１０へ進む。 In step S108, it is determined whether or not the processing in step S106 has been executed for all the second learning data. If there is second learning data that has not been executed in step S106, Returning to step S104, the second learning data is set as a processing target. On the other hand, when the process of step S106 is executed for all the second learning data, the process proceeds to step S110.

ステップＳ１１０では、入力部１０において受け付けた第１の学習データの各々に対し、当該第１の学習データの文書に含まれる各単語について、各素性を抽出する。 In step S110, for each of the first learning data received by the input unit 10, each feature is extracted for each word included in the document of the first learning data.

次に、ステップＳ１１２では、第１の学習データの各々の文書の各単語について、ステップＳ１１０において抽出した各素性を複製して、目的ドメイン素性及び共通ドメイン素性を得て、目的ドメイン素性と、値を０とした他ドメイン素性と、共通ドメイン素性とからなる素性ベクトルを生成する。 Next, in step S112, for each word in each document of the first learning data, each feature extracted in step S110 is duplicated to obtain a target domain feature and a common domain feature. A feature vector composed of another domain feature having a value of 0 and a common domain feature is generated.

ステップＳ１１４では、ステップＳ１０６において文字列に疑似正解ラベルを付与した第２の学習データの各々に対し、当該第２の学習データの文書に含まれる各単語について、各素性を抽出する。 In step S114, each feature is extracted for each word included in the document of the second learning data with respect to each of the second learning data in which the pseudo correct answer label is assigned to the character string in step S106.

次に、ステップＳ１１６では、第２の学習データの各々の文書の各単語について、ステップＳ１１４において抽出した各素性を複製して、他ドメイン素性及び共通ドメイン素性を得て、値を０とした目的ドメイン素性と、他ドメイン素性と、共通ドメイン素性とからなる素性ベクトルを生成する。 Next, in step S116, for each word of each document of the second learning data, each feature extracted in step S114 is duplicated to obtain another domain feature and a common domain feature, and the value is set to 0. A feature vector including a domain feature, another domain feature, and a common domain feature is generated.

ステップＳ１１８では、ステップＳ１１２において生成した第１の学習データの各単語についての素性ベクトルと、ステップＳ１１６において生成した第２の学習データの各単語についての素性ベクトルとに基づいて、機械学習器により目的ドメイン素性、他ドメイン素性、及び共通ドメイン素性の各々に対するパラメータを学習する。 In step S118, the machine learner uses the feature vector for each word of the first learning data generated in step S112 and the feature vector for each word of the second learning data generated in step S116. Learn parameters for each of domain features, other domain features, and common domain features.

ステップＳ１２０では、ステップＳ１１８において学習したパラメータのうち、共通ドメイン素性に対するパラメータを抽出し、抽出対象の文字列を抽出してラベルを付与するためのモデルのパラメータとして、モデル記憶部４２に格納すると共に、出力部５０に出力して、処理を終了する。 In step S120, a parameter for the common domain feature is extracted from the parameters learned in step S118, and the character string to be extracted is extracted and stored in the model storage unit 42 as a model parameter for assigning a label. , Output to the output unit 50, and the process ends.

そして、抽出対象の文字列を抽出してラベルを付与する処理を行う際には、解析対象データから各素性を抽出し、抽出した各素性と、抽出対象の文字列を抽出してラベルを付与するためのモデルのパラメータとに基づいて、解析対象データから、抽出対象の文字列を抽出してラベルを付与する。 When extracting the character string to be extracted and assigning a label, each feature is extracted from the analysis target data, and each extracted feature and the character string to be extracted are extracted and assigned a label. The character string to be extracted is extracted from the data to be analyzed based on the model parameters to be applied, and a label is assigned.

以上説明したように、本発明の実施の形態に係るモデル学習装置によれば、辞書を用いて、ラベルが付与されていない第２の学習データの文字列に対して擬似正解ラベルを付与し、第１の学習データの各単語及び第２の学習データの各単語について素性を抽出して、第１の学習データの各単語及び第２の学習データの各単語について、目的ドメイン素性、他ドメイン素性、及び共通ドメイン素性からなる素性ベクトルを生成して、目的ドメイン素性、他ドメイン素性、及び共通ドメイン素性の各々に対するパラメータを学習し、共通ドメイン素性に対するパラメータを抽出することにより、抽出対象文字列に精度よくラベルを付与するためのモデルを学習することができる。 As described above, according to the model learning device according to the embodiment of the present invention, using the dictionary, the pseudo-correct label is assigned to the character string of the second learning data to which no label is attached, A feature is extracted for each word of the first learning data and each word of the second learning data, and for each word of the first learning data and each word of the second learning data, the target domain feature and other domain feature are extracted. , And generating a feature vector composed of common domain features, learning parameters for each of target domain features, other domain features, and common domain features, and extracting parameters for common domain features to extract character strings to be extracted It is possible to learn a model for giving a label with high accuracy.

また、辞書を用いることで学習データを擬似的に拡張し、転移学習の手法を用いることで、学習データを増強させることが可能となり、高精度な文字列抽出モデルを獲得することができる。 In addition, learning data can be expanded in a pseudo manner by using a dictionary, and learning data can be augmented by using a transfer learning method, so that a highly accurate character string extraction model can be obtained.

また、本発明において学習したモデルを、固有表現抽出装置に適用することで、文書中から固有表現を抽出し、適切なラベルを付与することができる。 In addition, by applying the model learned in the present invention to the specific expression extraction apparatus, it is possible to extract the specific expression from the document and assign an appropriate label.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上記の実施の形態では、固有表現抽出を例に説明したが、用言抽出にも適用可能であり、その場合は、学習データを用言用の学習データとし、辞書を用言用の辞書とすればよい。もっとも、学習データと辞書を目的のものとすることで固有表現抽出、用言抽出以外にも適用が可能であることは勿論である。 For example, in the above embodiment, the specific expression extraction has been described as an example. However, the present invention can also be applied to prescriptive extraction. Just use a dictionary. Of course, by using learning data and a dictionary as a target, it is possible to apply to other than specific expression extraction and prescriptive extraction.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０演算部
３０教師用データベース
３２文書集合データベース
３４辞書データベース
３６辞書照合部
３８転移学習用素性生成部
４０モデル学習部
４２モデル記憶部
５０出力部
１００モデル学習装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Operation part 30 Teacher database 32 Document collection database 34 Dictionary database 36 Dictionary collation part 38 Transfer learning feature generation part 40 Model learning part 42 Model storage part 50 Output part 100 Model learning apparatus

Claims

An input unit that receives input of first learning data that is a document in which a label is previously assigned to a character string to be extracted; and second learning data that is a document to which the label is not attached;
A character string included in the document of the second learning data is collated with a dictionary in which the character string and the label are associated and stored, and the label is stored in the dictionary in the document of the second learning data. A dictionary collation unit that gives the label to a character string that matches the character string stored in association with
The feature is extracted from the first learning data to be a target domain feature, the target domain feature is duplicated to be a common domain feature, and a feature vector including the target domain feature, another domain feature, and the common domain feature is generated. The feature is extracted from the second learning data in which a label is assigned to the character string to be the other domain feature, the other domain feature is duplicated to be the common domain feature, and the feature vector is generated A generator,
Based on the feature vector of the first learning data and the feature vector of the second learning data, learn parameters for each of the target domain feature, the other domain feature, and the common domain feature, and the learning A learning unit that extracts a parameter for the common domain feature from the extracted parameters, extracts the extraction target character string, and outputs the extracted character string as a model parameter;
A model learning apparatus comprising:

The character string to be extracted is a specific expression,
The label is a proper expression classification,
The model learning apparatus according to claim 1, wherein the model is a model for extracting the specific expression and assigning the specific expression classification to the specific expression.

A step in which the input unit receives input of first learning data that is a document in which a label is pre-assigned to a character string to be extracted and second learning data that is a document to which the label is not attached; ,
A dictionary collation unit collates a character string included in the document of the second learning data with a dictionary in which the character string and the label are associated and stored, and in the document of the second learning data, Assigning the label to a character string that matches the character string stored in the dictionary in association with the label;
A feature generation unit extracts features from the first learning data to obtain target domain features, duplicates the target domain features into common domain features, and uses the target domain features, other domain features, and the common domain features. Generating a feature vector, extracting the feature from second learning data in which a label is assigned to the character string to obtain the other domain feature, duplicating the other domain feature to obtain the common domain feature, and the feature Generating a vector;
A learning unit learns parameters for each of the target domain feature, the other domain feature, and the common domain feature based on the feature vector of the first learning data and the feature vector of the second learning data. Extracting a parameter for the common domain feature from the learned parameters, extracting the character string to be extracted, and outputting the extracted parameter as a model parameter; and
Model learning method including

The program for functioning a computer as each part which comprises the model learning apparatus of Claim 1 or 2.