JP2017204219A

JP2017204219A - Model learning apparatus, word extraction apparatus, method and program

Info

Publication number: JP2017204219A
Application number: JP2016096905A
Authority: JP
Inventors: 九月貞光; Kugatsu Sadamitsu; 松尾　義博; Yoshihiro Matsuo; 義博松尾; 東中　竜一郎; Ryuichiro Higashinaka; 竜一郎東中; 幸徳本間; Yukinori Homma
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-05-13
Filing date: 2016-05-13
Publication date: 2017-11-16

Abstract

PROBLEM TO BE SOLVED: To learn a model for accurately extracting a word corresponding to a question sentence if there are other domains of a plurality of learning objects.SOLUTION: A series model featuring part 38 extracts, with regard to each of a plurality of original domains, a feature of an original domain on each word included in a question sentence by a table constituent element, and duplicates as a feature of a common domain. A series model learning part 40 learns a series model 42 according to the feature of the original domain and the feature of the common domain obtained with regard to each of the plurality of original domains. A regression model featuring part 44 extracts a regression model feature. A regression model learning part 46 learns a regression model 48.SELECTED DRAWING: Figure 2

Description

本発明は、モデル学習装置、単語抽出装置、方法、及びプログラムに係り、特に、文中から質問応答に必要となる単語を抽出するためのモデル学習装置、単語抽出装置、方法、及びプログラムに関する。 The present invention relates to a model learning device, a word extraction device, a method, and a program, and more particularly, to a model learning device, a word extraction device, a method, and a program for extracting a word necessary for a question answer from a sentence.

従来より、与えられた文と、知識を蓄えたデータベースを用いて、文中からデータベースに存在する表現に近い単語列を抽出する技術が知られている。例えば、３つ組のデータベース構造であった場合、質問文において、データベース中の表現を２つ含むことが分かれば、残りの１つを回答として提示することが可能となる。 2. Description of the Related Art Conventionally, a technique for extracting a word string close to an expression existing in a database from a sentence using a given sentence and a database storing knowledge is known. For example, in the case of a triple database structure, if it is found that the query sentence includes two expressions in the database, the remaining one can be presented as an answer.

また、単語対の類似度を計算する際には、単語を意味空間でベクトル化した上で類似度を測る方法が知られている（非特許文献３参照）。 Moreover, when calculating the similarity of a word pair, the method of measuring a similarity after vectorizing a word in a semantic space is known (refer nonpatent literature 3).

また、目的ドメインの学習データが存在しない場合でも、他ドメインにおける学習データが存在する場合に学習データを転用することで、低コストでも高精度な単語列抽出を実現する方法が知られている（非特許文献５）。 In addition, even when there is no learning data in the target domain, there is known a method for realizing highly accurate word string extraction at low cost by diverting learning data when there is learning data in another domain ( Non-patent document 5).

K. Yao et al. “Recurrent Conditional Random Field for Language Understanding”, ICASSP2014K. Yao et al. “Recurrent Conditional Random Field for Language Understanding”, ICASSP2014 Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." ArXiv preprint arXiv: 1301.3781 (2013). Lafferty, John, Andrew McCallum, and Fernando CN Pereira. "Conditional random fields: Probabilistic models for segmenting and labeling sequence data." (2001).Lafferty, John, Andrew McCallum, and Fernando CN Pereira. "Conditional random fields: Probabilistic models for segmenting and labeling sequence data." (2001). Hal Daume III and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, pages 101-126Hal Daume III and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, pages 101-126 Yazdani, Majid and Henderson, James, “A Model of Zero-Shot Learning of Spoken Language Understanding”, EMNLP2015Yazdani, Majid and Henderson, James, “A Model of Zero-Shot Learning of Spoken Language Understanding”, EMNLP2015

しかしながら、上記非特許文献５に記載の方法では、他ドメインが独立であることを仮定した手法であったため、精度が低かった。 However, since the method described in Non-Patent Document 5 is a method that assumes that other domains are independent, the accuracy is low.

本発明は、上記事情を鑑みて成されたものであり、複数の学習済みの他ドメインがある場合に、質問文に対応する単語を精度良く抽出することできるモデルを学習することができるモデル学習装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and model learning that can learn a model that can accurately extract a word corresponding to a question sentence when there are a plurality of learned other domains. An object is to provide an apparatus, a method, and a program.

また、質問文に対応する単語を精度良く抽出することできる単語抽出装置、方法、及びプログラムを提供することを目的とする。 It is another object of the present invention to provide a word extraction apparatus, method, and program capable of accurately extracting a word corresponding to a question sentence.

上記目的を達成するために、本発明に係るモデル学習装置は、未学習の対象ドメインのデータベースのエントリを表す単語から、入力された対象ドメインの質問文に対応する、前記エントリを表す単語を抽出するためのモデル学習装置であって、複数の元ドメインの各々に対し、前記元ドメインのデータベースのエントリを表す単語及びテーブル構成要素を示すラベルが付与された、前記元ドメインの質問文の集合に基づいて、予め作成された各単語の意味ベクトルを用いて、前記質問文に含まれる各単語について、前記テーブル構成要素毎に、前記テーブル構成要素のエントリを表す単語との類似度を、前記元ドメインの素性として抽出する共に、前記元ドメインの素性を、共通ドメインの素性として複製する系列モデル用素性化部と、前記複数の元ドメインの各々に対する、前記質問文に含まれる各単語について得られた前記元ドメインの素性及び前記共通ドメインの素性と、前記質問文に含まれる各単語について付与された前記ラベルとに基づいて、単語チャンクに対応するテーブル構成要素を抽出するための系列モデルを学習する系列モデル学習部と、前記複数の元ドメインの各々に対し、前記元ドメインの質問文の集合に基づいて、各単語の意味ベクトルを用いて、前記ラベルが付与された単語チャンクの各々について、前記単語チャンクに付与された前記ラベルが表す前記テーブル構成要素のエントリを表す単語の各々との類似度を、回帰モデル用素性として抽出する回帰モデル用素性化部と、前記複数の元ドメインの各々に対する、前記ラベルが付与された単語チャンクの各々について得られた前記回帰モデル用素性と、前記単語チャンクの各々について付与された前記ラベルとに基づいて、単語に対応するエントリを表す単語を抽出するための回帰モデルを学習する回帰モデル学習部と、を含んで構成されている。 To achieve the above object, the model learning device according to the present invention extracts a word representing an entry corresponding to a question sentence of an input target domain from a word representing a database entry of an unlearned target domain. A model learning device for providing a set of question sentences in the original domain, to each of a plurality of original domains, a word indicating a database entry of the original domain and a label indicating a table component are given Based on the meaning vector of each word created in advance, the similarity between the word included in the question sentence and the word representing the entry of the table component is calculated for each word included in the question sentence. A sequence model feature generating unit that extracts the domain feature as a domain feature and replicates the original domain feature as a common domain feature; Based on the features of the original domain and the features of the common domain obtained for each word included in the question sentence for each of a plurality of original domains, and the labels given for each word included in the question sentence A sequence model learning unit for learning a sequence model for extracting a table constituent element corresponding to a word chunk, and for each of the plurality of original domains, each word based on a set of question sentences of the original domain For each of the word chunks to which the label is assigned, the similarity with each of the words representing the entry of the table component represented by the label assigned to the word chunk is used for the regression model. A regression model feature extracting unit that extracts the feature, and a word channel to which the label is assigned to each of the plurality of original domains. A regression model for learning a regression model for extracting a word representing an entry corresponding to a word based on the features for the regression model obtained for each of the keywords and the label given for each of the word chunks And a learning unit.

本発明に係るモデル学習方法は、未学習の対象ドメインのデータベースのエントリを表す単語から、入力された対象ドメインの質問文に対応する、前記エントリを表す単語を抽出するためのモデル学習装置におけるモデル学習方法であって、系列モデル用素性化部が、複数の元ドメインの各々に対し、前記元ドメインのデータベースのエントリを表す単語及びテーブル構成要素を示すラベルが付与された、前記元ドメインの質問文の集合に基づいて、予め作成された各単語の意味ベクトルを用いて、前記質問文に含まれる各単語について、前記テーブル構成要素毎に、前記テーブル構成要素のエントリを表す単語との類似度を、前記元ドメインの素性として抽出する共に、前記元ドメインの素性を、共通ドメインの素性として複製し、系列モデル学習部が、前記複数の元ドメインの各々に対する、前記質問文に含まれる各単語について得られた前記元ドメインの素性及び前記共通ドメインの素性と、前記質問文に含まれる各単語について付与された前記ラベルとに基づいて、単語チャンクに対応するテーブル構成要素を抽出するための系列モデルを学習し、回帰モデル用素性化部が、前記複数の元ドメインの各々に対し、前記元ドメインの質問文の集合に基づいて、各単語の意味ベクトルを用いて、前記ラベルが付与された単語チャンクの各々について、前記単語チャンクに付与された前記ラベルが表す前記テーブル構成要素のエントリを表す単語の各々との類似度を、回帰モデル用素性として抽出し、回帰モデル学習部が、前記複数の元ドメインの各々に対する、前記ラベルが付与された単語チャンクの各々について得られた前記回帰モデル用素性と、前記単語チャンクの各々について付与された前記ラベルとに基づいて、単語に対応するエントリを表す単語を抽出するための回帰モデルを学習する。 The model learning method according to the present invention is a model learning apparatus for extracting a word representing an entry corresponding to an inputted question sentence of a target domain from a word representing a database entry of an unlearned target domain. A query method of the original domain, in which the feature modeling unit for a sequence model is assigned a word indicating a database entry of the original domain and a label indicating a table component for each of the plurality of original domains Based on a set of sentences, using a semantic vector of each word created in advance, each word included in the question sentence is similar to a word representing an entry of the table component for each table component Are extracted as features of the original domain, and the features of the original domain are duplicated as features of the common domain. A learning unit is assigned to each of the plurality of original domains, the original domain feature and the common domain feature obtained for each word included in the question sentence, and each word included in the question sentence. Based on the label, a sequence model for extracting a table constituent element corresponding to the word chunk is learned, and the regression model featureizing unit asks each of the plurality of original domains a question of the original domain. Each word representing an entry in the table element represented by the label attached to the word chunk, for each word chunk assigned the label, using a semantic vector of each word based on a set of sentences And the regression model learning unit assigns the label to each of the plurality of original domains. Learning a regression model for extracting a word representing an entry corresponding to a word based on the features for the regression model obtained for each of the word chunks and the label given for each of the word chunks To do.

本発明に係る単語抽出装置は、未学習の対象ドメインのデータベースのエントリを表す単語から、入力された対象ドメインの質問文に対応する、前記エントリを表す単語を抽出する単語抽出装置であって、予め作成された各単語の意味ベクトルを用いて、前記質問文に含まれる各単語について、前記対象ドメインのデータベースのテーブル構成要素毎に、前記テーブル構成要素のエントリを表す単語との類似度を、共通ドメインの素性として抽出する適用時系列モデル用素性化部と、上記のモデル学習装置によって学習された前記系列モデルと、前記適用時系列モデル用素性化部によって前記質問文に含まれる各単語について抽出された、前記テーブル構成要素毎の共通ドメインの素性とに基づいて、前記質問文に含まれる各単語チャンクに、前記テーブル構成要素を表すラベルを付与する系列モデル適用部と、各単語の意味ベクトルを用いて、前記系列モデル適用部によって前記ラベルが付与された単語チャンクの各々について、前記単語チャンクに付与された前記ラベルが表す前記テーブル構成要素のエントリを表す単語の各々との類似度を、回帰モデル用素性として抽出する適用時回帰モデル用素性化部と、前記モデル学習装置によって学習された前記回帰モデルと、前記適用時回帰モデル用素性化部によって抽出された、前記ラベルが付与された単語チャンクの各々の回帰モデル用素性とに基づいて、前記質問文に対応する、前記対象ドメインのデータベースのエントリを表す単語を抽出する回帰モデル適用部と、を含んで構成されている。 A word extraction device according to the present invention is a word extraction device that extracts a word representing an entry corresponding to a question sentence of an input target domain from a word representing an entry in an unlearned target domain database, For each word included in the question sentence for each table component of the database of the target domain, the similarity with the word representing the entry of the table component is obtained using a semantic vector of each word created in advance. The applied time series model feature extracting unit that extracts the features of the common domain, the sequence model learned by the model learning device, and each word included in the question sentence by the applied time series model feature generating unit Based on the extracted features of the common domain for each table component, each word chunk included in the question sentence is A sequence model application unit that assigns a label representing the table component, and a word that is assigned to the word chunk by the sequence model application unit using the semantic vector of each word. A regression model feature conversion unit for extracting a similarity with each of the words representing the entry of the table component represented by the label as a feature for regression model; and the regression model learned by the model learning device; , Based on the regression model features of each of the word chunks with the label extracted by the application regression model featureization unit, the database entry of the target domain corresponding to the question sentence is And a regression model application unit that extracts a word to be expressed.

本発明に係る単語抽出方法は、未学習の対象ドメインのデータベースのエントリを表す単語から、入力された対象ドメインの質問文に対応する、前記エントリを表す単語を抽出する単語抽出装置における単語抽出方法であって、適用時系列モデル用素性化部が、予め作成された各単語の意味ベクトルを用いて、前記質問文に含まれる各単語について、前記対象ドメインのデータベースのテーブル構成要素毎に、前記テーブル構成要素のエントリを表す単語との類似度を、共通ドメインの素性として抽出し、系列モデル適用部が、上記のモデル学習方法によって学習された前記系列モデルと、前記適用時系列モデル用素性化部によって前記質問文に含まれる各単語について抽出された、前記テーブル構成要素毎の共通ドメインの素性とに基づいて、前記質問文に含まれる各単語チャンクに、前記テーブル構成要素を表すラベルを付与し、適用時回帰モデル用素性化部が、各単語の意味ベクトルを用いて、前記系列モデル適用部によって前記ラベルが付与された単語チャンクの各々について、前記単語チャンクに付与された前記ラベルが表す前記テーブル構成要素のエントリを表す単語の各々との類似度を、回帰モデル用素性として抽出し、回帰モデル適用部が、前記モデル学習方法によって学習された前記回帰モデルと、前記適用時回帰モデル用素性化部によって抽出された、前記ラベルが付与された単語チャンクの各々の回帰モデル用素性とに基づいて、前記質問文に対応する、前記対象ドメインのデータベースのエントリを表す単語を抽出する。 The word extraction method according to the present invention is a word extraction method in a word extraction device for extracting a word representing an entry corresponding to an inputted question sentence of a target domain from a word representing an entry in an unlearned target domain database. The applied time-series model feature generating unit uses, for each word included in the question sentence for each table constituent element of the database of the target domain, using the semantic vector of each word created in advance. Similarity with a word representing a table component entry is extracted as a feature of a common domain, and the sequence model application unit learns the sequence model learned by the above model learning method, and the applied time series model features Based on the features of the common domain for each table component extracted for each word included in the question sentence by the section A label representing the table component is assigned to each word chunk included in the question sentence, and the applied regression model featureizing unit uses the semantic vector of each word to generate the label by the series model applying unit. For each of the word chunks to which the word chunk is assigned is extracted as a feature for regression model by extracting the similarity with each of the words representing the entry of the table component represented by the label assigned to the word chunk, and a regression model application unit Is based on the regression model learned by the model learning method, and the features for the regression model of each of the word chunks to which the label has been extracted, extracted by the feature model for regression model at the time of application. A word representing an entry in the target domain database corresponding to the question sentence is extracted.

本発明に係るプログラムは、コンピュータを、上記のモデル学習装置又は単語抽出装置の各部として機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each part of the model learning device or the word extracting device.

本発明のモデル学習装置、方法、及びプログラムによれば、複数の元ドメインの各々に対し、テーブル構成要素のエントリを表す単語との類似度を、元ドメインの素性として抽出する共に、元ドメインの素性を、共通ドメインの素性として複製し、複数の元ドメインの各々に対して得られた元ドメインの素性及び共通ドメインの素性に基づいて、単語チャンクに対応するテーブル構成要素を抽出するための系列モデルを学習することにより、複数の学習対象の他ドメインがある場合に、質問文に対応する単語を精度良く抽出することできるモデルを学習することができる、という効果が得られる。 According to the model learning device, method, and program of the present invention, for each of a plurality of original domains, the similarity to a word representing an entry of a table component is extracted as a feature of the original domain, A sequence for replicating a feature as a common domain feature and extracting table constituent elements corresponding to word chunks based on the original domain features and common domain features obtained for each of a plurality of original domains By learning the model, there is an effect that it is possible to learn a model that can accurately extract a word corresponding to a question sentence when there are a plurality of other domains to be learned.

本発明の単語抽出装置、方法、及びプログラムによれば、質問文に対応する単語を精度良く抽出することができる、という効果が得られる。 According to the word extraction device, method, and program of the present invention, it is possible to extract the word corresponding to the question sentence with high accuracy.

複数の学習対象の元ドメインと、未学習の対象ドメインとのデータベースにおけるテーブル構成要素の関係性を表した抽象図である。It is the abstract figure showing the relationship of the table structural element in the database of the some original domain of learning object, and the unlearned object domain. 本発明の実施の形態に係るモデル学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the model learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る単語抽出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the word extraction apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係るモデル学習装置におけるモデル学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the model learning process routine in the model learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る単語抽出装置における単語抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the word extraction processing routine in the word extracting device which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係る概要＞
まず、本発明の実施の形態における概要を説明する。 <Outline according to Embodiment of the Present Invention>
First, an outline of the embodiment of the present invention will be described.

ここで本発明の実施の形態で解く問題について説明する。図１に示すように、学習時において、複数の学習対象の元ドメインがおもちゃドメイン及び旅行ドメインとする。学習時入力文「本はいくら？」、学習時入力文のアノテート結果（例えば「本⇒subject-絵本」、「いくら⇒predicate-価格」）、３つ組のテーブル構成要素からなる元ドメイン１ＤＢ、元ドメイン２ＤＢに基づいて、単語抽出モデルを学習する。テーブル構成要素は、subject/predicate/objectの３つ組のデータベースを構成する要素である。ここでのobjectとはsubjectとpredicateがテーブルで交差するエントリの値を指す。本実施の形態では３つ組の場合を例に説明するが、４つ組以上でも適用は可能である。 Here, a problem to be solved by the embodiment of the present invention will be described. As shown in FIG. 1, at the time of learning, a plurality of original domains to be learned are a toy domain and a travel domain. The learning input sentence “How much is this book?”, The learning input sentence annotation result (eg, “book ⇒ subject-picture book”, “how much ⇒ predicate-price”), original domain 1DB consisting of three table components, A word extraction model is learned based on the original domain 2DB. The table component is an element that constitutes a triple database of subject / predicate / object. The object here refers to the value of the entry where subject and predicate intersect in the table. In this embodiment, a case of a triplet will be described as an example, but the present invention can be applied to a triplet or more.

そして、学習したモデルを、未学習の対象ドメインである家電ドメインに適用する。適用時入力文「掃除機はいくら」、及び対象ドメインＤＢの入力に対して、出力として例えば「「掃除機」⇒subject-掃除機」、「「いくら」⇒predicate-値段」というように、入力文の単語と、テーブル構成要素及びエントリとの対応を得る。 Then, the learned model is applied to the home appliance domain, which is an unlearned target domain. When applying the input sentence “How much is a vacuum cleaner” and the input of the target domain DB, for example, ““ Vacuum cleaner ”⇒subject-vacuum”, ““ How much ”⇒predicate-price” Get correspondences between sentence words and table elements and entries.

本発明の実施の形態では、上記の問題について、２段階に分けた解法を採る。１段階目では、抽象度の高い、データベースの同じ意味構造（テーブル構成要素）レベルでの抽出モデルを用いる。 In the embodiment of the present invention, a solution divided into two stages is adopted for the above problem. In the first stage, an extraction model with a high level of abstraction and the same semantic structure (table constituent element) level of the database is used.

２段階目では、テーブル構成要素を細分化してエントリに紐づけるため、分類アプローチで解く。ここでは、学習済みの元ドメインの学習用データベースと、対象ドメインのデータベースとが異なる場合、学習用データベースに対し直接的な分類ベースのアプローチを採ることは不可能であるため、対象ドメインの各エントリに対する類似度を求めるアプローチを採る。このようにすることで、未知のドメインでもエントリと紐づけた単語抽出が可能となる。 In the second stage, the table components are subdivided and linked to the entries, so the classification approach is used. Here, if the learning database of the learned original domain and the database of the target domain are different, it is impossible to take a direct classification-based approach to the learning database. Take an approach to find the similarity to. In this way, it is possible to extract words associated with entries even in unknown domains.

また、従来技術では、複数の学習済みの他ドメインがある場合に十分に活用できていないため、本発明の実施の形態では、転移学習の先行技術である非特許文献４の技術を用いて素性化を行い、学習時にドメイン間共通の情報を重視する。これにより、複数の他ドメインがある場合に対象ドメインにおける精度を向上させることができる。 Further, in the conventional technique, when there are a plurality of learned other domains, it cannot be fully utilized. Therefore, in the embodiment of the present invention, the technology of Non-Patent Document 4 which is the prior art of transfer learning is used. And focus on common information between domains during learning. Thereby, when there are a plurality of other domains, the accuracy in the target domain can be improved.

＜本発明の実施の形態に係るモデル学習装置の構成＞
次に、本発明の実施の形態に係るモデル学習装置の構成について説明する。図２に示すように、本発明の実施の形態に係るモデル学習装置１００は、ＣＰＵと、ＲＡＭと、後述するモデル学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。このモデル学習装置１００は、機能的には図２に示すように入力部１０と、演算部２０とを備えている。 <Configuration of Model Learning Device According to Embodiment of the Present Invention>
Next, the configuration of the model learning device according to the embodiment of the present invention will be described. As shown in FIG. 2, the model learning device 100 according to the embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program and various data for executing a model learning processing routine described later. Can be configured with a computer. Functionally, the model learning apparatus 100 includes an input unit 10 and a calculation unit 20 as shown in FIG.

入力部１０は、複数の学習対象の元ドメインについての、アノテート済元ドメイン１質問文集合２１Ａ、アノテート済元ドメイン２質問文集合２１Ｂ、元ドメイン１ＤＢ２２Ａ、及び元ドメイン２ＤＢ２２Ｂを受け付ける。アノテート済元ドメイン質問文は、例えば、「<subj=絵本>本</subj>は<pred=販売店>どこ/で/買える</pred>の？」というような質問文に対して、形態素境界を示すスラッシュ、及びテーブル構成要素とエントリの対応付けが付与された質問文である。また、元ドメイン１は例えばおもちゃドメインであり、元ドメイン２は旅行ドメインである。 The input unit 10 receives an annotated source domain 1 question sentence set 21A, an annotated source domain 2 question sentence set 21B, an original domain 1DB 22A, and an original domain 2DB 22B for a plurality of original domains to be learned. Annotated source domain question text is, for example, a morpheme for a question text like "<subj = picture book> book </ subj> where <pred = reseller> where / where can I buy </ pred>?" This is a question sentence to which a slash indicating a boundary and an association between a table component and an entry are given. For example, the original domain 1 is a toy domain, and the original domain 2 is a travel domain.

また、入力部１０は、意味ベクトルモデル３２を受け付ける。 Further, the input unit 10 receives a semantic vector model 32.

演算部２０は、アノテート済元ドメイン１質問文集合２１Ａと、元ドメイン１ＤＢ２２Ａと、アノテート済元ドメイン２質問文集合２１Ｂと、元ドメイン２ＤＢ２２Ｂと、意味ベクトルモデル３２と、系列モデル用素性化部３８と、系列モデル学習部４０と、系列モデル４２と、回帰モデル用素性化部４４と、回帰モデル学習部４６と、回帰モデル４８とを含んで構成されている。 The computing unit 20 includes an annotated original domain 1 question sentence set 21A, an original domain 1DB 22A, an annotated original domain 2 question sentence set 21B, an original domain 2DB 22B, a semantic vector model 32, and a sequence model featureizing part 38. A series model learning unit 40, a series model 42, a regression model feature conversion unit 44, a regression model learning unit 46, and a regression model 48.

意味ベクトルモデル３２は、例えば、「本」と「絵本」との意味的類似度が0.5である、といった出力が可能なモデルである。意味ベクトルモデル３２は、教師なしテキストに含まれる単語からベクトルモデル化することで予め学習しておけばよい。モデル化には非特許文献２に記載の既存のモデル化手法を用いる。 The semantic vector model 32 is a model that can output, for example, that the semantic similarity between “book” and “picture book” is 0.5. The semantic vector model 32 may be learned in advance by making a vector model from words included in the unsupervised text. The existing modeling method described in Non-Patent Document 2 is used for modeling.

系列モデル用素性化部３８は、意味ベクトルモデル３２を用いて、アノテート済元ドメイン１質問文集合２１Ａにおける質問文に含まれる各単語について、元ドメイン１ＤＢ２２Ａのテーブル構成要素毎に、当該テーブル構成要素のエントリを表す単語との類似度を、元ドメイン１の素性として抽出すると共に、共通ドメインの素性として複製する。なお、質問文のアノテートデータではエントリへのマッピングまで施されている前提である。系列モデル用素性化部３８では、アノテートデータのうちテーブル構成要素へのマッピング情報のみを用いる。また、元ドメイン１の素性に、元ドメイン１を表すヘッダ（例えば、”toy-”）を付与し、共通ドメインの素性に、共通ドメインを表すヘッダ（例えば、“common-”）を付与する。 Using the semantic vector model 32, the sequence model feature generating unit 38 uses the table component for each table component of the original domain 1DB 22A for each word included in the question sentence in the annotated original domain 1 question sentence set 21A. The degree of similarity with the word representing the entry is extracted as a feature of the original domain 1 and is also copied as a feature of the common domain. In addition, it is a premise that the annotation data of the question sentence is mapped to the entry. The sequence model feature generating unit 38 uses only the mapping information to the table components in the annotation data. In addition, a header (for example, “toy-”) representing the original domain 1 is assigned to the feature of the original domain 1, and a header (for example, “common-”) representing the common domain is assigned to the feature of the common domain.

例えば、系列モデルの出力ラベルを元ドメイン１ＤＢ２２Ａのテーブル構成要素として、系列モデル用素性化部３８は、質問文に含まれる各単語について、テーブル構成要素毎に、以下の処理によって元ドメイン１の素性及び共通ドメインの素性を抽出する。 For example, using the output label of the sequence model as a table component of the original domain 1 DB 22A, the sequence model feature generating unit 38 performs the following processing for each word included in the question sentence for each table component by the following processing. And extract common domain features.

系列モデル用素性化部３８は、各エントリとの表層類似度及び意味類似度を抽出し、元ドメイン１ＤＢ２２Ａのテーブル構成要素の各々について、当該テーブル構成要素の各エントリとの表層類似度及び意味類似度の中での最大値を、当該テーブル構成要素の元ドメイン１の素性とする。例えば、テーブル構成要素がsubject、predicate、objectの３種類であれば、３種類それぞれについて、表層類似度及び意味類似度の各々が抽出される。 The series model feature generating unit 38 extracts the surface layer similarity and semantic similarity with each entry, and for each table component of the original domain 1DB 22A, the surface layer similarity and semantic similarity with each entry of the table component The maximum value in degrees is the feature of the original domain 1 of the table constituent element. For example, if there are three types of table elements, subject, predicate, and object, the surface layer similarity and the semantic similarity are extracted for each of the three types.

表層類似度は、質問文中の対象単語とエントリの各々との編集距離等であり、当該テーブル構成要素のエントリの各々との表層類似度の最大値が、当該テーブル構成要素の表層類似度として抽出される。例えば、「本⇔subject:絵本編集距離=1、文字重複率=0.5、単語一致率=0」というような対象単語とエントリの結果を元に類似度を算出して、表層類似度を得る。また、意味類似度は、対象単語とエントリとペアに対し、意味ベクトルモデル３２を用いて算出される類似度であり、当該テーブル構成要素のエントリの各々との意味類似度の最大値が、当該テーブル構成要素の意味類似度として抽出される。例えば、「本⇔subject:絵本意味類似度=0.5」というような対象単語とエントリとの結果を元に類似度を算出して、最大値となるものを、対象単語とテーブル構成要素との表層類似度として抽出する。テーブル構成要素の系列モデル用素性化データの出力の例は以下のようになる。注目する対象単語が「本」であれば、以下の元ドメイン１の素性及び共通ドメインの素性が抽出される。 The surface layer similarity is the edit distance between the target word in the question sentence and each entry, and the maximum value of the surface layer similarity with each entry of the table component is extracted as the surface layer similarity of the table component Is done. For example, the similarity is calculated based on the target word and entry results such as “main subject: picture book editing distance = 1, character duplication rate = 0.5, word matching rate = 0”, and the surface layer similarity is obtained. The semantic similarity is the similarity calculated using the semantic vector model 32 for the target word and entry, and the maximum value of the semantic similarity with each entry of the table component is Extracted as the semantic similarity of the table elements. For example, the similarity is calculated based on the result of the target word and entry such as “main subject: picture book semantic similarity = 0.5”, and the maximum value is determined from the surface layer of the target word and the table component. Extracted as similarity. An example of the output of the feature data for the series model of the table constituent elements is as follows. If the target word of interest is “book”, the following features of the original domain 1 and the features of the common domain are extracted.

正解ラベル=B-subj
toy-subj編集距離=1 toy-subj意味類似度=0.5 toy-pred編集距離=2 toy-pred意味類似度=0.1 toy-obj編集距離=2 toy-obj意味類似度=0.1
common-subj編集距離=1 common-subj意味類似度=0.5 common-pred編集距離=2 common-pred意味類似度=0.1 common-obj編集距離=2 common-obj意味類似度=0.1 Correct answer label = B-subj
toy-subj edit distance = 1 toy-subj semantic similarity = 0.5 toy-pred edit distance = 2 toy-pred semantic similarity = 0.1 toy-obj edit distance = 2 toy-obj semantic similarity = 0.1
common-subj edit distance = 1 common-subj semantic similarity = 0.5 common-pred edit distance = 2 common-pred semantic similarity = 0.1 common-obj edit distance = 2 common-obj semantic similarity = 0.1

注目する対象単語が「は」であれば、以下の元ドメイン１の素性及び共通ドメインの素性が抽出される。 If the target word of interest is “ha”, the following features of the original domain 1 and features of the common domain are extracted.

正解ラベル=O
toy-subj編集距離=2 toy-subj意味類似度=0.1 toy-pred編集距離=2 toy-pred意味類似度=0.1 toy-obj編集距離=2 toy-obj意味類似度=0.1
common-subj編集距離=2 common-subj意味類似度=0.1 common-pred編集距離=2 common-pred意味類似度=0.1 common-obj編集距離=2 common-obj意味類似度=0.1 Correct label = O
toy-subj edit distance = 2 toy-subj semantic similarity = 0.1 toy-pred edit distance = 2 toy-pred semantic similarity = 0.1 toy-obj edit distance = 2 toy-obj semantic similarity = 0.1
common-subj edit distance = 2 common-subj semantic similarity = 0.1 common-pred edit distance = 2 common-pred semantic similarity = 0.1 common-obj edit distance = 2 common-obj semantic similarity = 0.1

注目する対象単語が「どこ」であれば、以下の元ドメイン１の素性及び共通ドメインの素性が抽出される。 If the target word of interest is “where”, the following features of the original domain 1 and features of the common domain are extracted.

正解ラベル=B-pred
toy-subj編集距離=6 toy-subj意味類似度=0.1 toy-pred編集距離=6 toy-pred意味類似度=0.1 toy-obj編集距離=6 toy-obj意味類似度=0.1
common-subj編集距離=6 common-subj意味類似度=0.1 common-pred編集距離=6 common-pred意味類似度=0.1 common-obj編集距離=6 common-obj意味類似度=0.1 Correct answer label = B-pred
toy-subj edit distance = 6 toy-subj semantic similarity = 0.1 toy-pred edit distance = 6 toy-pred semantic similarity = 0.1 toy-obj edit distance = 6 toy-obj semantic similarity = 0.1
common-subj edit distance = 6 common-subj semantic similarity = 0.1 common-pred edit distance = 6 common-pred semantic similarity = 0.1 common-obj edit distance = 6 common-obj semantic similarity = 0.1

ここで正解ラベルのヘッダに付与されるB/I/Oは、B=抽出したい対象単語列の先頭、I=抽出したい対象単語列の先頭以外、O=抽出しない単語を表す。おもちゃドメインでは上記のように”toy-”, “common-”を素性のヘッダに付与する。 Here, B / I / O given to the header of the correct label represents O = a word that is not extracted except for B = the head of the target word string to be extracted, I = the head of the target word string to be extracted. In the toy domain, “toy-” and “common-” are added to the feature header as described above.

また、系列モデル用素性化部３８は、意味ベクトルモデル３２を用いて、アノテート済元ドメイン２質問文集合２１Ｂにおける質問文に含まれる各単語について、元ドメイン２ＤＢ２２Ｂのテーブル構成要素毎に、元ドメイン１の素性と同様に、当該テーブル構成要素のエントリを表す単語との類似度を、元ドメイン２の素性として抽出すると共に、共通ドメインの素性として複製する。また、元ドメイン２の素性に、元ドメイン２を表すヘッダ（例えば、”travel-”）を付与し、共通ドメインの素性に、共通ドメインを表すヘッダ（例えば、“common-”）を付与する。 In addition, the sequence model feature generating unit 38 uses the semantic vector model 32 to generate, for each table component of the original domain 2DB 22B, the original domain 2DB 22B for each word included in the question sentence in the annotated original domain 2 question sentence set 21B. Similar to the feature 1, the similarity to the word representing the entry of the table constituent element is extracted as the feature of the original domain 2 and duplicated as the feature of the common domain. In addition, a header (for example, “travel-”) representing the original domain 2 is assigned to the feature of the original domain 2, and a header (for example, “common-”) representing the common domain is assigned to the feature of the common domain.

系列モデル学習部４０は、系列モデル用素性化部３８によってアノテート済元ドメイン１質問文集合２１Ａにおける質問文に含まれる各単語について抽出された、テーブル構成要素ごとの元ドメイン１の素性及び共通ドメインの素性と、アノテート済元ドメイン１質問文集合２１Ａにおける質問文に付与されたラベルと、系列モデル用素性化部３８によってアノテート済元ドメイン２質問文集合２１Ｂにおける質問文に含まれる各単語について抽出された、テーブル構成要素ごとの元ドメイン２の素性及び共通ドメインの素性と、アノテート済元ドメイン２質問文集合２１Ｂにおける質問文に付与されたラベルと、に基づいて、既存手法のCRF（非特許文献３）等を用いて、テーブル構成要素を抽出するための系列モデル４２を学習する。系列モデル４２は、各ラベル（テーブル構成要素）に対応する各素性に対する重みパラメータである。学習される系列モデルによって、「<subj>本</subj>は<pred>どこ/で/買える</pred>の」というように、１つの単語又は２つ以上の単語を連結した単語列である単語チャンクにラベルを付与することができる。 The sequence model learning unit 40 extracts the features of the original domain 1 for each table component and the common domain extracted for each word included in the question sentence in the annotated source domain 1 question sentence set 21A by the series model feature conversion unit 38 , The label assigned to the question sentence in the annotated source domain 1 question sentence set 21A, and each word included in the question sentence in the annotated source domain 2 question sentence set 21B by the sequence model feature conversion unit 38 CRF (non-patented) of the existing method based on the features of the original domain 2 and the common domain of each table component and the labels given to the question sentences in the annotated original domain 2 question sentence set 21B. The sequence model 42 for extracting the table constituent elements is learned using the literature 3). The series model 42 is a weight parameter for each feature corresponding to each label (table constituent element). Depending on the sequence model to be learned, it is a word string that concatenates one word or two or more words, such as "<subj> book </ subj> is <pred> where / where / buy </ pred>". A label can be assigned to a certain word chunk.

回帰モデル用素性化部４４は、意味ベクトルモデル３２に含まれる各単語の意味ベクトルを用いて、アノテート済元ドメイン１質問文集合２１Ａにおける質問文に含まれる、ラベルが付与された単語チャンクの各々について、当該単語チャンクと、当該ラベルが表すテーブル構成要素の各エントリ候補を表す単語との類似度を、回帰モデル用素性として抽出する。 The regression model feature making unit 44 uses the semantic vectors of each word included in the semantic vector model 32 to each of the labeled word chunks included in the question sentence in the annotated source domain 1 question sentence set 21A. The similarity between the word chunk and the word representing each entry candidate of the table component represented by the label is extracted as a feature for the regression model.

回帰モデル用素性化部４４では、系列モデル用素性化部３８と異なり、具体的には以下の処理を行って、質問文に含まれるラベルが付与された単語チャンクの各々について、当該単語チャンクに付与されたラベルが表すテーブル構成要素であって、元ドメイン１ＤＢ２２Ａのテーブル構成要素のエントリ候補毎に素性化を行う。 Unlike the series model feature unit 38, the regression model feature unit 44 performs the following processing, and for each word chunk to which a label included in the question sentence is assigned, It is a table component represented by the given label, and is featured for each entry candidate of the table component of the original domain 1DB 22A.

回帰モデル用素性化部４４は、具体的には、質問文に含まれる単語チャンクの各々について、当該単語チャンクに付与されたラベルが表すテーブル構成要素の、元ドメイン１ＤＢ２２Ａのエントリ候補の各々との表層類似度と意味類似度を、系列モデル用素性化部３８と同様に、回帰モデル用素性として抽出する。 Specifically, the regression model feature making unit 44, for each of the word chunks included in the question sentence, with each of the entry candidates in the original domain 1DB 22A of the table constituent element represented by the label attached to the word chunk. The surface layer similarity and the semantic similarity are extracted as regression model features in the same manner as the series model feature converting unit 38.

また、回帰モデル用素性化部４４は、意味ベクトルモデル３２に含まれる各単語の意味ベクトルを用いて、アノテート済元ドメイン２質問文集合２１Ｂにおける質問文に含まれる、ラベルが付与された単語チャンクの各々について、当該単語チャンクと、当該ラベルが表すテーブル構成要素の各エントリ候補を表す単語との類似度を、回帰モデル用素性として抽出する。 In addition, the regression model feature making unit 44 uses the semantic vector of each word included in the semantic vector model 32 to add a labeled word chunk included in the question sentence in the annotated source domain 2 question sentence set 21B. , The similarity between the word chunk and the word representing each entry candidate of the table constituent element represented by the label is extracted as a regression model feature.

回帰モデル学習部４６は、回帰モデル用素性化部４４によってアノテート済元ドメイン１質問文集合２１Ａにおける質問文に含まれる各単語チャンクについて抽出された、エントリ候補毎の回帰モデル用素性と、アノテート済元ドメイン１質問文集合２１Ａにおける各質問文に付与されたラベルと、回帰モデル用素性化部４４によってアノテート済元ドメイン２質問文集合２１Ｂにおける質問文に含まれる各単語チャンクについて抽出された、エントリ候補毎の回帰モデル用素性と、アノテート済元ドメイン２質問文集合２１Ｂにおける各質問文に付与されたラベルと、に基づいて、既存手法のロジスティック回帰モデル等を用いて、エントリを表す単語を抽出するための回帰モデル４８を学習する。具体的には、質問文におけるラベルが付与された単語チャンク（表層の文字列）とエントリを表す単語との正しいアノテートペア（例えば、「1:本-絵本」）に対して値１を、それ以外（例えば、「0:本-ぬいぐるみ」）に０を付与して、回帰学習を行う。 The regression model learning unit 46 extracts the regression model features for each entry candidate extracted for each word chunk included in the question sentence in the annotated source domain 1 question sentence set 21A by the regression model feature making part 44 and the annotated Labels assigned to each question sentence in the original domain 1 question sentence set 21A and entries extracted for each word chunk included in the question sentence in the annotated original domain 2 question sentence set 21B by the regression model featureizing unit 44 Based on the features for the regression model for each candidate and the labels assigned to each question sentence in the annotated source domain 2 question sentence set 21B, a word representing an entry is extracted using the logistic regression model of the existing method. The regression model 48 for learning is learned. Specifically, the value 1 is assigned to a correct annotate pair (for example, “1: book-picture book”) of a word chunk (surface character string) with a label in a question sentence and a word representing an entry. Regression learning is performed by assigning 0 to other than (for example, “0: book-stuffed toy”).

＜本発明の実施の形態に係る単語抽出装置の構成＞
次に、本発明の実施の形態に係る単語抽出装置の構成について説明する。図３に示すように、本発明の実施の形態に係る単語抽出装置２００は、ＣＰＵと、ＲＡＭと、後述する単語抽出処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この単語抽出装置２００は、機能的には図３に示すように入力部２１０と、演算部２２０と、出力部２５０とを備えている。 <Configuration of Word Extraction Device According to Embodiment of the Present Invention>
Next, the configuration of the word extraction device according to the embodiment of the present invention will be described. As shown in FIG. 3, a word extraction device 200 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program and various data for executing a word extraction processing routine described later. Can be configured with a computer. Functionally, the word extraction device 200 includes an input unit 210, a calculation unit 220, and an output unit 250 as shown in FIG.

入力部２１０は、未学習の対象ドメインについての質問文である対象ドメイン質問文と、対象ドメインの対象ドメインＤＢ２２５とを受け付ける。以下の説明では対象ドメイン質問文を質問文と記載する。 The input unit 210 receives a target domain question sentence that is a question sentence about an unlearned target domain and a target domain DB 225 of the target domain. In the following explanation, the target domain question sentence is described as a question sentence.

演算部２２０は、意味ベクトルモデル３２と、系列モデル４２と、回帰モデル４８と、適用時系列モデル用素性化部２３８と、系列モデル適用部２４０と、適用時回帰モデル用素性化部２４４と、回帰モデル適用部２４６とを含んで構成されている。 The arithmetic unit 220 includes a semantic vector model 32, a series model 42, a regression model 48, an applied time series model feature making unit 238, a series model applying unit 240, an applied time regression model feature making unit 244, And a regression model application unit 246.

意味ベクトルモデル３２と、系列モデル４２と、回帰モデル４８とには、上記モデル学習装置１００と同じものが格納されている。 The semantic vector model 32, the series model 42, and the regression model 48 store the same ones as the model learning apparatus 100.

適用時系列モデル用素性化部２３８は、予め作成された意味ベクトルモデル３２を用いて、質問文に含まれる各単語について、対象ドメインＤＢ２２５のテーブル構成要素毎に、当該テーブル構成要素のエントリを表す単語との類似度を、共通ドメインの素性として抽出する。具体的には、上記とモデル学習装置１００の系列モデル用素性化部３８と同様の処理を行って素性を抽出するが、以下に説明する点が異なっている。 The applied time-series model feature generating unit 238 represents an entry of the table constituent element for each table constituent element of the target domain DB 225 for each word included in the question sentence, using the semantic vector model 32 created in advance. The similarity with the word is extracted as a feature of the common domain. Specifically, the same processing as the above and the sequence model feature converting unit 38 of the model learning device 100 is performed to extract the features, but the points described below are different.

適用時系列モデル用素性化部２３８は、対象ドメインＤＢ２２５の各エントリとの表層類似度及び意味類似度を抽出し、対象ドメインＤＢ２２５のテーブル構成要素の各々について、当該テーブル構成要素の各エントリとの表層類似度及び意味類似度の中での最大値を、当該テーブル構成要素の素性とする。また、抽出する素性を、共通ドメインの素性とし、共通ドメインを表すヘッダ（例えば、“common-”）を付与する。 The applied time series model feature generating unit 238 extracts the surface layer similarity and the semantic similarity with each entry of the target domain DB 225, and for each table component of the target domain DB 225, The maximum value among the surface layer similarity and the semantic similarity is the feature of the table constituent element. Further, the feature to be extracted is the common domain feature, and a header (for example, “common-”) representing the common domain is added.

テーブル構成要素の系列モデル用素性化データの出力の例は以下のようになる。注目する対象単語が「掃除機」であれば、以下の共通ドメインの素性が抽出される。 An example of the output of the feature data for the series model of the table constituent elements is as follows. If the target word of interest is “vacuum cleaner”, the features of the following common domains are extracted.

common-subj編集距離=1 common-subj意味類似度=0.5 common-pred編集距離=2 common-pred意味類似度=0.1 common-obj編集距離=2 common-obj意味類似度=0.1 common-subj edit distance = 1 common-subj semantic similarity = 0.5 common-pred edit distance = 2 common-pred semantic similarity = 0.1 common-obj edit distance = 2 common-obj semantic similarity = 0.1

注目する対象単語が「は」であれば、以下の共通ドメインの素性が抽出される。 If the target word of interest is “ha”, the features of the following common domains are extracted.

common-subj編集距離=2 common-subj意味類似度=0.1 common-pred編集距離=2 common-pred意味類似度=0.1 common-obj編集距離=2 common-obj意味類似度=0.1 common-subj edit distance = 2 common-subj semantic similarity = 0.1 common-pred edit distance = 2 common-pred semantic similarity = 0.1 common-obj edit distance = 2 common-obj semantic similarity = 0.1

注目する対象単語が「どこ」であれば、以下の共通ドメインの素性が抽出される。 If the target word of interest is “where”, the features of the following common domains are extracted.

common-subj編集距離=6 common-subj意味類似度=0.1 common-pred編集距離=6 common-pred意味類似度=0.1 common-obj編集距離=6 common-obj意味類似度=0.1 common-subj edit distance = 6 common-subj semantic similarity = 0.1 common-pred edit distance = 6 common-pred semantic similarity = 0.1 common-obj edit distance = 6 common-obj semantic similarity = 0.1

系列モデル適用部２４０は、モデル学習装置１００によって学習された、単語に対応するテーブル構成要素を抽出するための系列モデル４２の共通ドメインの素性に対応する部分と、適用時系列モデル用素性化部２３８によって質問文に含まれる各単語について抽出された、テーブル構成要素毎の共通ドメインの素性とに基づいて、質問文に含まれる単語チャンクに、テーブル構成要素を表すラベルを付与する。系列モデル４２の適用には既存手法のCRF（非特許文献３）等を用いればよい。例えば、各テーブル構成要素に対応する共通ドメインの素性の各々に対する重みパラメータを、抽出された、テーブル構成要素毎の共通ドメインの素性に適用して、「<subj>掃除機</subj>は<pred>どこ/で/買える</pred>の」のようにラベルが付与される。 The sequence model application unit 240 has a part corresponding to the common domain feature of the sequence model 42 for extracting table components corresponding to the words learned by the model learning device 100, and an applied time series model feature conversion unit. Based on the common domain feature for each table component extracted for each word included in the question sentence by 238, a label representing the table element is given to the word chunk included in the question sentence. Application of the sequence model 42 may be performed using CRF (Non-patent Document 3) or the like of an existing method. For example, applying the weight parameter for each of the common domain features corresponding to each table component to the extracted common domain features for each table component, the <subj> vacuum cleaner </ subj> < pred> where / where / buy </ pred> ”.

適用時回帰モデル用素性化部２４４は、意味ベクトルモデル３２を用いて、系列モデル適用部２４０によってラベルが付与された単語チャンクの各々について、単語チャンクに付与されたラベルが表すテーブル構成要素のエントリ候補を表す単語の各々との類似度を、回帰モデル用素性として抽出する。具体的には、上記モデル学習装置１００の回帰モデル用素性化部４４と同様の処理を行って素性を抽出する。 The applied regression model feature generating unit 244 uses the semantic vector model 32 and, for each word chunk assigned a label by the sequence model applying unit 240, an entry of a table component represented by the label attached to the word chunk Similarity with each word representing a candidate is extracted as a feature for a regression model. Specifically, a feature is extracted by performing the same process as the regression model feature converting unit 44 of the model learning apparatus 100.

回帰モデル適用部２４６は、モデル学習装置１００によって学習された、単語に対応するエントリを表す単語を抽出するための回帰モデル４８と、適用時回帰モデル用素性化部２４４によってラベルが付与された単語チャンクの各々について抽出された、当該ラベルが表すテーブル構成要素のエントリ候補毎の回帰モデル用素性とに基づいて、テーブル構成要素毎に、質問文に対応する、対象ドメインＤＢ２２５のエントリを表す単語を抽出し、出力部２５０に出力する。単語の抽出は、単語チャンクの各々について抽出された、当該ラベルが表すテーブル構成要素のエントリ候補毎の回帰モデル用素性に、回帰モデルを適用して、単語チャンクとエントリ候補とのペアについての値を算出し、テーブル構成要素毎に、最も高い出力値を得たエントリ候補を最終結果として出力する。例えば、subjectのラベルが付与された単語チャンク「掃除機」について、エントリ候補とのペアの出力値が、「1.0 掃除機-”subj-掃除機”」、「0.2 掃除機-”subj-洗濯機”」となっていれば前者を出力する。 The regression model application unit 246 has a regression model 48 for extracting a word representing an entry corresponding to the word learned by the model learning device 100, and a word given a label by the application regression model featureizing unit 244. A word representing an entry in the target domain DB 225 corresponding to the question sentence for each table component based on the regression model features for each entry candidate of the table component represented by the label extracted for each chunk. Extract and output to the output unit 250. The word extraction is performed by applying a regression model to the regression model feature for each entry candidate of the table element represented by the label extracted for each word chunk, and the value for the pair of the word chunk and the entry candidate. And the entry candidate that obtained the highest output value is output as the final result for each table component. For example, for the word chunk “vacuum cleaner” with the subject label, the output value of the pair with the entry candidate is “1.0 vacuum cleaner-“ subj-vacuum cleaner ””, “0.2 vacuum cleaner-“ subj-washer ” If "", the former is output.

＜本発明の実施の形態に係るモデル学習装置の作用＞
次に、本発明の実施の形態に係るモデル学習装置１００の作用について説明する。入力部１０において、学習対象の元ドメイン１について、アノテート済元ドメイン１質問文集合２１Ａと、元ドメイン１ＤＢ２２Ａとを受け付け、学習対象の元ドメイン２について、アノテート済元ドメイン２質問文集合２１Ｂと、元ドメイン２ＤＢ２２Ｂとを受け付けると、モデル学習装置１００は、図４に示すモデル学習処理ルーチンを実行する。 <Operation of Model Learning Device According to Embodiment of Present Invention>
Next, the operation of the model learning device 100 according to the embodiment of the present invention will be described. The input unit 10 accepts the annotated source domain 1 question sentence set 21A and the original domain 1DB 22A for the original domain 1 to be learned, and the annotated source domain 2 question sentence set 21B for the original domain 2 to be learned; Upon receiving the original domain 2DB 22B, the model learning device 100 executes a model learning process routine shown in FIG.

ステップＳ１００では、学習対象の元ドメイン１について、入力部１０で受け付けた、アノテート済元ドメイン１質問文集合２１Ａと、元ドメイン１ＤＢ２２Ａとに基づいて、意味ベクトルモデル３２を用いて、アノテート済元ドメイン１質問文集合２１Ａにおける質問文に含まれる各単語について、元ドメイン１ＤＢ２２Ａのテーブル構成要素毎に、当該テーブル構成要素のエントリを表す単語との類似度を、元ドメイン１の素性及び共通ドメインの素性として抽出する。 In step S100, for the original domain 1 to be learned, the annotated original domain 1 using the semantic vector model 32 based on the annotated original domain 1 question sentence set 21A and the original domain 1DB 22A received by the input unit 10 is used. For each word included in the question sentence in the one-question sentence set 21A, for each table constituent element of the original domain 1DB 22A, the similarity to the word representing the entry of the table constituent element is determined based on the features of the original domain 1 and the common domain. Extract as

ステップＳ１０２では、学習対象の元ドメイン２について、入力部１０で受け付けた、アノテート済元ドメイン２質問文集合２１Ｂと、元ドメイン２ＤＢ２２Ｂとに基づいて、意味ベクトルモデル３２を用いて、アノテート済元ドメイン２質問文集合２１Ｂにおける質問文に含まれる各単語について、元ドメイン２ＤＢ２２Ｂのテーブル構成要素毎に、当該テーブル構成要素のエントリを表す単語との類似度を、元ドメイン２の素性及び共通ドメインの素性として抽出する。 In step S102, for the original domain 2 to be learned, the annotated original domain 2 is received using the semantic vector model 32 based on the annotated original domain 2 question sentence set 21B and the original domain 2DB 22B received by the input unit 10. For each word included in the question sentence in the two-question sentence set 21B, for each table constituent element of the original domain 2DB 22B, the similarity to the word representing the entry of the table constituent element, the identity of the original domain 2 and the identity of the common domain Extract as

ステップＳ１０４では、ステップＳ１００でアノテート済元ドメイン１質問文集合２１Ａにおける質問文に含まれる各単語について抽出された、テーブル構成要素ごとの元ドメイン１の素性及び共通ドメインの素性と、アノテート済元ドメイン１質問文集合２１Ａにおける各質問文に付与されたラベルと、ステップＳ１０２でアノテート済元ドメイン２質問文集合２１Ｂにおける質問文に含まれる各単語について抽出された、テーブル構成要素ごとの元ドメイン２の素性及び共通ドメインの素性と、アノテート済元ドメイン２質問文集合２１Ｂにおける各質問文に付与されたラベルと、に基づいて、テーブル構成要素を抽出するための系列モデル４２を学習する。 In step S104, the features of the original domain 1 and the features of the common domain extracted for each word included in the question sentence in the annotated original domain 1 question sentence set 21A in step S100, and the annotated original domain The label attached to each question sentence in the one-question sentence set 21A and the original domain 2 for each table component extracted in step S102 for each word included in the question sentence in the annotated original domain-two question sentence set 21B Based on the features and the features of the common domain and the labels assigned to the question sentences in the annotated source domain 2 question sentence set 21B, the sequence model 42 for extracting the table constituent elements is learned.

ステップＳ１０６では、アノテート済元ドメイン１質問文集合２１Ａと、元ドメイン１ＤＢ２２Ａとに基づいて、意味ベクトルモデル３２を用いて、アノテート済元ドメイン１質問文集合２１Ａにおける各質問文に含まれる、ラベルが付与された単語チャンクの各々について、当該単語チャンクと、当該ラベルが表すテーブル構成要素の各エントリ候補を表す単語との類似度を、回帰モデル用素性として抽出する。 In step S106, a label included in each question sentence in the annotated source domain 1 question sentence set 21A using the semantic vector model 32 based on the annotated source domain 1 question sentence set 21A and the original domain 1 DB 22A. For each given word chunk, the similarity between the word chunk and the word representing each entry candidate of the table component represented by the label is extracted as a feature for the regression model.

また、アノテート済元ドメイン２質問文集合２１Ｂと、元ドメイン２ＤＢ２２Ｂとに基づいて、意味ベクトルモデル３２を用いて、アノテート済元ドメイン２質問文集合２１Ｂにおける各質問文に含まれる、ラベルが付与された単語チャンクの各々について、当該単語チャンクと、当該ラベルが表すテーブル構成要素の各エントリ候補を表す単語との類似度を、回帰モデル用素性として抽出する。 Further, based on the annotated original domain 2 question sentence set 21B and the original domain 2 DB 22B, a label included in each question sentence in the annotated original domain 2 question sentence set 21B is assigned using the semantic vector model 32. For each word chunk, the similarity between the word chunk and the word representing each entry candidate of the table component represented by the label is extracted as a regression model feature.

ステップＳ１０８では、上記ステップＳ１０６でアノテート済元ドメイン１質問文集合２１Ａにおける質問文に含まれる各単語チャンクについて抽出された、エントリ候補毎の回帰モデル用素性と、アノテート済元ドメイン１質問文集合２１Ａにおける各質問文に付与されたラベルと、上記ステップＳ１０６でアノテート済元ドメイン２質問文集合２１Ｂにおける質問文に含まれる各単語チャンクについて抽出された、エントリ候補毎の回帰モデル用素性と、アノテート済元ドメイン２質問文集合２１Ｂにおける各質問文に付与されたラベルと、に基づいて、既存手法のロジスティック回帰モデル等を用いて、エントリを表す単語を抽出するための回帰モデル４８を学習し、モデル学習処理ルーチンを終了する。 In step S108, the features for the regression model for each entry candidate extracted for each word chunk included in the question sentence in the annotated source domain 1 question sentence set 21A in step S106, and the annotated source domain 1 question sentence set 21A. , The label assigned to each question sentence, the regression model features for each entry candidate extracted for each word chunk included in the question sentence in the annotated source domain 2 question sentence set 21B in step S106, and annotated Based on the labels given to each question sentence in the original domain 2 question sentence set 21B, a regression model 48 for extracting a word representing an entry is learned using a logistic regression model or the like of an existing method, and the model The learning process routine is terminated.

＜本発明の実施の形態に係る単語抽出装置の作用＞
次に、本発明の実施の形態に係る単語抽出装置２００の作用について説明する。入力部２１０において未学習の対象ドメインについての質問文である対象ドメイン質問文と、対象ドメインの対象ドメインＤＢ２２５とを受け付けると、単語抽出装置２００は、図５に示す単語抽出処理ルーチンを実行する。 <Operation of the word extraction device according to the embodiment of the present invention>
Next, the operation of the word extraction device 200 according to the embodiment of the present invention will be described. When the input unit 210 receives a target domain question sentence that is a question sentence about an unlearned target domain and a target domain DB 225 of the target domain, the word extraction device 200 executes a word extraction processing routine shown in FIG.

まず、ステップＳ２００では、意味ベクトルモデル３２を用いて、質問文に含まれる各単語について、対象ドメインＤＢ２２５のテーブル構成要素毎に、当該テーブル構成要素のエントリを表す単語との類似度を、共通ドメインの素性として抽出する。 First, in step S200, using the semantic vector model 32, for each word included in the question sentence, for each table element in the target domain DB 225, the similarity between the word representing the entry of the table element and the common domain is calculated. Extracted as a feature of.

次に、ステップＳ２０２では、系列モデル４２の共通ドメインの素性に対応する部分と、ステップＳ２００で質問文に含まれる各単語について抽出された、テーブル構成要素毎の共通ドメインの素性とに基づいて、質問文に含まれる単語チャンクに、テーブル構成要素を表すラベルを付与する。 Next, in step S202, based on the part corresponding to the common domain feature of the series model 42 and the common domain feature for each table component extracted for each word included in the question sentence in step S200, A label representing a table component is given to the word chunk included in the question sentence.

ステップＳ２０４では、意味ベクトルモデル３２を用いて、ステップＳ２０２でラベルが付与された単語チャンクの各々について、単語チャンクに付与されたラベルが表すテーブル構成要素のエントリ候補を表す単語の各々との類似度を、回帰モデル用素性として抽出する。 In step S204, using the semantic vector model 32, the similarity between each of the word chunks to which the label is assigned in step S202 and each of the words representing the entry candidate of the table element represented by the label attached to the word chunk. Are extracted as features for the regression model.

ステップＳ２０６では、回帰モデル４８と、適用時回帰モデル用素性化部２４４によってラベルが付与された単語チャンクの各々について抽出された、当該ラベルが表すテーブル構成要素のエントリ候補毎の回帰モデル用素性とに基づいて、テーブル構成要素毎に、質問文に対応する、対象ドメインＤＢ２２５のエントリを表す単語を抽出し、出力部２５０に出力し処理を終了する。 In step S206, the regression model 48 and the feature for the regression model for each entry candidate of the table element represented by the label, extracted for each word chunk to which the label is assigned by the application regression model featureization unit 244, For each table component, a word representing an entry in the target domain DB 225 corresponding to the question sentence is extracted and output to the output unit 250, and the process is terminated.

以上説明したように、本発明の実施の形態に係るモデル学習装置によれば、複数の学習対象の元ドメインの各々に対し、テーブル構成要素のエントリを表す単語との類似度を、元ドメインの素性として抽出する共に、元ドメインの素性を、共通ドメインの素性として複製し、複数の元ドメインの各々に対して得られた元ドメインの素性及び共通ドメインの素性に基づいて、単語チャンクに対応するテーブル構成要素を抽出するための系列モデルを学習することにより、複数の学習対象の他ドメインがある場合に、質問文に対応する単語を精度良く抽出することできるモデルを学習することができる。 As described above, according to the model learning device according to the embodiment of the present invention, the similarity between each of the plurality of original domains to be learned and the word representing the entry of the table component is determined based on the original domain. While extracting as a feature, the original domain feature is duplicated as a common domain feature and corresponds to a word chunk based on the original domain features and common domain features obtained for each of the multiple original domains By learning a sequence model for extracting table constituent elements, a model that can accurately extract a word corresponding to a question sentence can be learned when there are a plurality of other domains to be learned.

また、本発明の実施の形態に係る単語抽出装置によれば、質問文に対応する単語を精度良く抽出することができる。 Moreover, according to the word extracting device which concerns on embodiment of this invention, the word corresponding to a question sentence can be extracted with a sufficient precision.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、学習対象の元ドメインが２つある場合を例に説明したが、これに限定されるものではなく、学習対象の元ドメインが３つ以上ある場合であってもよい。 For example, the case where there are two original domains to be learned has been described as an example, but the present invention is not limited to this, and there may be three or more original domains to be learned.

１０、２１０入力部
２０、２２０演算部
２１Ａアノテート済元ドメイン１質問文集合
２１Ｂアノテート済元ドメイン２質問文集合
２２Ａ元ドメイン１ＤＢ
２２Ｂ元ドメイン２ＤＢ
３２意味ベクトルモデル
３８系列モデル用素性化部
４０系列モデル学習部
４２系列モデル
４４回帰モデル用素性化部
４６回帰モデル学習部
４８回帰モデル
１００モデル学習装置
２００単語抽出装置
２３８適用時系列モデル用素性化部
２４０系列モデル適用部
２４４適用時回帰モデル用素性化部
２４６回帰モデル適用部
２５０出力部
２２５対象ドメインＤＢ 10, 210 Input unit 20, 220 Operation unit 21A Annotated source domain 1 question sentence set 21B Annotated source domain 2 question sentence set 22A Original domain 1DB
22B Former domain 2DB
32 semantic vector model 38 sequence model feature unit 40 sequence model learning unit 42 sequence model 44 regression model feature unit 46 regression model learning unit 48 regression model 100 model learning device 200 word extraction device 238 applied time series model feature Unit 240 series model application unit 244 application regression model feature conversion unit 246 regression model application unit 250 output unit 225 target domain DB

Claims

A model learning device for extracting a word representing an entry corresponding to a question sentence of an input target domain from a word representing an entry in an unlearned target domain database,
For each of a plurality of original domains, a word indicating an entry in the original domain database and a label indicating a table component are given, and based on a set of question sentences of the original domain, For each word included in the question sentence using a semantic vector, for each table component, a similarity with a word representing an entry of the table component is extracted as a feature of the original domain, and the original A sequence model feature part that replicates domain features as common domain features;
For each of the plurality of original domains, the features of the original domain and the features of the common domain obtained for each word included in the question sentence, and the label given for each word included in the question sentence A sequence model learning unit for learning a sequence model for extracting a table constituent element corresponding to a word chunk,
For each of the plurality of original domains, based on a set of question sentences of the original domain, each word chunk to which the label is attached is assigned to the word chunk using a semantic vector of each word. A regression model feature converting unit that extracts a similarity with each of the words representing the entry of the table component represented by the label as a feature for the regression model;
Corresponding to a word based on the features for the regression model obtained for each of the word chunks to which the label is attached and the labels assigned to each of the word chunks for each of the plurality of original domains. A regression model learning unit for learning a regression model for extracting a word representing an entry;
Model learning device including

A word extraction device for extracting a word representing an entry corresponding to a question sentence of an input target domain from a word representing an entry in an unlearned target domain database,
For each word included in the question sentence for each table component of the database of the target domain, the similarity with the word representing the entry of the table component is obtained using a semantic vector of each word created in advance. An applied time series model feature extraction unit that extracts the common domain features;
The features of the common domain for each table component extracted by the sequence model learned by the model learning device according to claim 1 and each word included in the question sentence by the applied time series model feature conversion unit. And a series model application unit that assigns a label representing the table component to each word chunk included in the question sentence,
For each of the word chunks to which the label is given by the sequence model application unit using the semantic vector of each word, each of the words representing the entry of the table component represented by the label attached to the word chunk; The feature model for the regression model when applied to extract the similarity of
Based on the regression model learned by the model learning device and the features for the regression model of each of the word chunks with the label extracted by the application regression model featureization unit, the question sentence A regression model application unit that extracts words representing entries in the database of the target domain corresponding to
Word extraction device including

The sequence model application unit includes a portion corresponding to the feature of the common domain of the sequence model learned by the model learning device, and each word included in the question sentence by the applied time series model feature conversion unit. The word extraction device according to claim 2, wherein a label representing the table component is assigned to each word chunk included in the question sentence based on the extracted feature of the common domain for each table component.

A model learning method in a model learning device for extracting a word representing an entry corresponding to a question sentence of an input target domain from a word representing an entry in an unlearned target domain database,
Based on a set of question sentences of the original domain, wherein the feature modeling unit for the series model is given a word indicating a database entry of the original domain and a label indicating a table component for each of the plurality of original domains , Using the semantic vector of each word created in advance, for each table component, the similarity between the word representing the entry of the table component and the word of the original domain While extracting as a feature, the original domain feature is duplicated as a common domain feature,
The sequence model learning unit gives to each of the plurality of original domains the features of the original domain and the features of the common domain obtained for each word included in the question sentence, and each word included in the question sentence Learning a sequence model for extracting a table component corresponding to a word chunk based on the label
For each of the plurality of original domains, each of the plurality of original domains is provided with the label using the semantic vector of each word based on the set of question sentences of the original domain. , Extracting the similarity with each of the words representing the entry of the table component represented by the label given to the word chunk as a feature for regression model,
The regression model learning unit is based on the regression model features obtained for each of the word chunks to which the label is assigned and the labels given to each of the word chunks for each of the plurality of original domains. A model learning method for learning a regression model for extracting a word representing an entry corresponding to a word.

A word extraction method in a word extraction device for extracting a word representing an entry corresponding to a question sentence of an input target domain from a word representing an entry in an unlearned target domain database,
For each word included in the question sentence, the applied time-series model feature generating unit uses a semantic vector of each word created in advance for each table constituent element of the target domain database. Extract the similarity to the word representing the entry as a common domain feature,
5. The table constituent element extracted by the sequence model application unit for each word included in the question sentence by the sequence model learned by the model learning method according to claim 4 and the applied time series model feature conversion unit Based on the features of each common domain, each word chunk included in the question sentence is given a label representing the table component,
The table represented by the label assigned to the word chunk, for each word chunk to which the label is assigned by the sequence model application unit, using the meaning vector of each word by the application regression model featureizing unit Extract the similarity to each of the words representing the entry of the component as a feature for the regression model,
A regression model application unit, the regression model learned by the model learning method; and a feature for the regression model of each of the word chunks to which the label is assigned, extracted by the application regression model featureization unit. A word extraction method for extracting a word representing an entry in the database of the target domain corresponding to the question sentence based on the word sentence.

When the sequence model application unit assigns a label, a part corresponding to the feature of the common domain of the sequence model learned by the model learning method and the query sentence by the applied time series model feature conversion unit 6. A label representing the table component is assigned to each word chunk included in the question sentence based on the features of the common domain for each table component extracted for each word included in the table. The word extraction method described.

The program for functioning a computer as each part of the model learning apparatus of Claim 1, or the word extraction apparatus of Claim 2 or Claim 3.