JP5518757B2

JP5518757B2 - Document classification learning control apparatus, document classification apparatus, and computer program

Info

Publication number: JP5518757B2
Application number: JP2011011905A
Authority: JP
Inventors: 正柳原; 一則松本; 元服部; 智弘小野
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2011-01-24
Filing date: 2011-01-24
Publication date: 2014-06-11
Anticipated expiration: 2031-01-24
Also published as: JP2012155394A

Description

本発明は、文書分類学習制御装置、文書分類装置およびコンピュータプログラムに関する。 The present invention relates to a document classification learning control device, a document classification device, and a computer program.

従来、電子文書がどのような種類の情報に関係するのかを判定し、その種類に応じたラベルを電子文書に付与して電子文書を分類する文書分類装置が知られている。文書分類装置としては、学習データを用いた能動学習（Active Learning）を行うことによって文書分類能力を高める識別器を利用するものがある。 2. Description of the Related Art Conventionally, there is known a document classification apparatus that determines what type of information an electronic document relates to, and assigns a label corresponding to the type to the electronic document to classify the electronic document. Some document classification devices use a classifier that enhances document classification ability by performing active learning using learning data.

識別器Ｃに対する能動学習では、まず、学習データＬを用いて、識別器Ｃに対して学習を行う。学習データは、正例ラベルが付された文書から成る。正例ラベルは、特定の種類の情報に関係する正例文書であることを示す。次いで、識別器Ｃを用いて、ラベルが付されていない文書から成る強化学習データＵが正例文書であるか又は正例文書に該当しない負例文書であるかを判定する。次いで、強化学習データＵに対する判定結果の事例うち信頼性が低い（曖昧な）事例（アノテーション対象データ）のみに対して、人がラベル付け（アノテーション）を行う。次いで、ラベルが付されたアノテーション対象データと学習データＬとを新たな学習データとして更新し、識別器Ｃに対する学習を繰り返す。この能動学習の処理は、終了条件を満たすまで繰り返される。 In active learning for the classifier C, learning is first performed for the classifier C using the learning data L. The learning data consists of a document with a positive example label. The positive example label indicates that the positive example document relates to a specific type of information. Next, using the classifier C, it is determined whether the reinforcement learning data U formed of a document without a label is a positive example document or a negative example document that does not correspond to a positive example document. Next, a person performs labeling (annotation) only on cases (annotation target data) having low reliability (ambiguous) among cases of determination results for the reinforcement learning data U. Next, the annotation target data with the label and the learning data L are updated as new learning data, and the learning for the classifier C is repeated. This active learning process is repeated until the end condition is satisfied.

例えば非特許文献１には、ＳＶＭ（Support Vector Machine）を利用した識別器に関する技術が記載されている。ＳＶＭを利用した識別器は、正例側ソフトマージン及び負例側ソフトマージンを出力する。正例側ソフトマージンは、正例文書か負例文書かを判定するときの境界面からの正例側の範囲であって、判定結果の信頼性が低い範囲である。負例側ソフトマージンは、境界面からの負例側の範囲であって、判定結果の信頼性が低い範囲である。 For example, Non-Patent Document 1 describes a technique related to a discriminator using SVM (Support Vector Machine). The discriminator using SVM outputs a positive example side soft margin and a negative example side soft margin. The positive example side soft margin is a range on the positive example side from the boundary surface when determining whether the document is a positive example document or a negative example document, and is a range in which the reliability of the determination result is low. The negative example side soft margin is a range on the negative example side from the boundary surface, and is a range where the reliability of the determination result is low.

非特許文献１に記載の従来技術では、識別器が強化学習データを判定した結果のうち、正例側ソフトマージン内にある文書（正例事例）と負例側ソフトマージン内にある文書（負例事例）との両方の事例を対象にして、k-means法を用いて類似する文書をグループ化する。そして、各クラスタにおいて重心点（centroid）の事例または重心点に最も近い事例を抽出し、この抽出した事例のみをアノテーション対象データとしている。 In the prior art described in Non-Patent Document 1, a document in the positive example side soft margin (positive example case) and a document in the negative example side soft margin (negative) among the results of the discriminator determining the reinforcement learning data. Similar cases are grouped using the k-means method for both cases. In each cluster, a case of the centroid (centroid) or a case closest to the centroid is extracted, and only the extracted case is used as annotation target data.

Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang, "Representative Sampling for Text Classification using Support Vector Machines.", In Proceedings of the 25th European Conference on IR Research (ECIR'03) pp. 393-407. 2003.Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang, "Representative Sampling for Text Classification using Support Vector Machines.", In Proceedings of the 25th European Conference on IR Research (ECIR'03) pp 393-407. 2003.

しかし、上述した非特許文献１に記載の従来技術では、クラスタ内において正例事例と負例事例の分布に偏りがあるときに、識別器の学習効率が不十分であるという課題がある。図５は、従来のアノテーション対象データの生成方法を示す概念図である。図５において、正例側ソフトマージン内にある正例事例（○印）と負例側ソフトマージン内にある負例事例（×印）との両方の事例を対象にして、k-means法を用いて類似する文書がグループ化されている。このグループ化の結果として作成されたクラスタＧ１００では、重心点に最も近い事例Ｐ１００がアノテーション対象データとなるが、重心点が正例側ソフトマージン内にあって正例に帰属しているのに対して事例Ｐ１００は負例側ソフトマージン内にあって負例に帰属しており、アノテーション対象データである事例Ｐ１００が当該グループＧ１００を代表していないものとなる。又、グループＧ１２０では、重心点に最も近い事例Ｐ１２０がアノテーション対象データとなるが、事例Ｐ１２０よりも境界面に近くて判定結果の信頼性がより低い事例Ｐ１２１がアノテーション対象データにならない。又、グループＧ１３０についても同様に、重心点に最も近くてアノテーション対象データとなる事例Ｐ１３０よりも、境界面に近くて判定結果の信頼性がより低い事例Ｐ１３１がアノテーション対象データにならない。これらの事例は、識別器の学習効率を上げる妨げとなり得る。 However, in the conventional technique described in Non-Patent Document 1 described above, there is a problem that the learning efficiency of the discriminator is insufficient when there is a bias in the distribution of the positive case and the negative case in the cluster. FIG. 5 is a conceptual diagram showing a conventional method for generating annotation target data. In FIG. 5, the k-means method is applied to both cases of positive cases within the positive case side margin (○) and negative cases within the negative case side margin (×). Similar documents are grouped together. In the cluster G100 created as a result of this grouping, the case P100 closest to the centroid is the annotation target data, whereas the centroid is within the positive example side soft margin and belongs to the positive example. The case P100 falls within the negative example side soft margin and belongs to the negative example, and the case P100 that is the annotation target data does not represent the group G100. In the group G120, the case P120 closest to the center of gravity is the annotation target data, but the case P121 that is closer to the boundary surface than the case P120 and has a lower reliability of the determination result is not the annotation target data. Similarly, in the case of the group G130, the case P131 that is close to the boundary surface and has a lower reliability of the determination result is not the annotation target data than the case P130 that is closest to the center of gravity and becomes the annotation target data. These cases can hinder the learning efficiency of the classifier.

本発明は、このような事情を考慮してなされたもので、文書分類装置に使用される識別器の文書分類能力を高める学習の効率を向上させることを課題とする。 The present invention has been made in view of such circumstances, and it is an object of the present invention to improve the efficiency of learning that enhances the document classification ability of a classifier used in a document classification device.

上記の課題を解決するために、本発明に係る文書分類学習制御装置は、
特定の種類の情報に関係する正例文書であることを示す正例ラベルが付された文書、又は、特定の種類の情報に関係しない負例文書であることを示す負例ラベルが付された文書から成る学習データを使用して学習し、入力文書が正例文書であるか又は負例文書であるかを判定する識別器であって、正例文書か負例文書かを判定するときの境界面からの、判定結果の信頼性が低い範囲、である正例側ソフトマージン及び負例側ソフトマージンを出力する識別器に対して、前記学習を実行させる文書分類学習制御装置において、前記識別器に対して、前記学習データを入力するか、又は、ラベルが付されていない文書から成る強化学習データを入力するか、又は、前記学習データとラベルが付されたアノテーション対象データとを入力するか、を切り替える入力制御部と、前記識別器によって正例文書か負例文書かが判定された強化学習データに対して、正例側ソフトマージン内に在る文書のみを対象にして類似する文書をグループ化し、又、負例側ソフトマージン内に在る文書のみを対象にして類似する文書をグループ化するクラスタリング部と、前記クラスタリング部によってグループ化されたクラスタ内の文書をアノテーション対象データとするデータ分類部と、を備えたことを特徴とする。 In order to solve the above problem, a document classification learning control device according to the present invention provides:
A document with a positive example label indicating that it is a positive example document related to a specific type of information, or a negative example label indicating that it is a negative example document not related to a specific type of information A discriminator for determining whether an input document is a positive example document or a negative example document by using learning data composed of documents, and determining whether the input document is a positive example document or a negative example document In the document classification learning control apparatus for causing the classifier to output the positive example-side soft margin and the negative example-side soft margin that are in a range where the reliability of the determination result is low, the classifier On the other hand, whether to input the learning data, input reinforcement learning data consisting of an unlabeled document, or input the learning data and annotation target data with a label, The For the reinforcement learning data determined by the classifier to be a positive example document or a negative example document, similar documents are grouped only for documents existing in the positive example side soft margin, A clustering unit for grouping similar documents only for documents in the negative example side margin, and a data classification unit for using the documents in the cluster grouped by the clustering unit as annotation target data. , Provided.

本発明に係る文書分類学習制御装置においては、前記判定された強化学習データに対して、文書毎に、前記境界面からの距離が近いほど大きい重み係数を計算する重み係数計算部を備え、前記グループ化される文書に対して前記重み係数を用いた重み付けを行うことを特徴とする。 In the document classification learning control device according to the present invention, for the determined reinforcement learning data, for each document, the document classification learning control device includes a weighting factor calculation unit that calculates a larger weighting factor as the distance from the boundary surface is closer, Weighting is performed on the documents to be grouped using the weighting factor.

本発明に係る文書分類学習制御装置において、前記重み係数計算部は、前記境界面からの距離を用いて正例への帰属度及び負例への帰属度を計算し、正例への帰属度又は負例への帰属度のうち大きい方を重み係数に用いることを特徴とする。 In the document classification learning control device according to the present invention, the weighting factor calculation unit calculates the degree of belonging to the positive example and the degree of belonging to the negative example using the distance from the boundary surface, and the degree of belonging to the positive example Alternatively, the larger of the degree of belonging to the negative example is used as the weighting coefficient.

本発明に係る文書分類学習制御装置において、前記データ分類部は、前記クラスタ内の重心に最も近い文書をアノテーション対象データとすることを特徴とする。 In the document classification learning control apparatus according to the present invention, the data classification unit sets the document closest to the center of gravity in the cluster as annotation target data.

本発明に係る文書分類装置は、特定の種類の情報に関係する正例文書であることを示す正例ラベルが付された文書、又は、特定の種類の情報に関係しない負例文書であることを示す負例ラベルが付された文書から成る学習データを使用して学習し、入力文書が正例文書であるか又は負例文書であるかを判定する識別器であって、正例文書か負例文書かを判定するときの境界面からの、判定結果の信頼性が低い範囲、である正例側ソフトマージン及び負例側ソフトマージンを出力する識別器と、上述のいずれかの文書分類学習制御装置と、を備えたことを特徴とする。 The document classification device according to the present invention is a document with a positive example label indicating that it is a positive example document related to a specific type of information, or a negative example document not related to a specific type of information. A classifier that learns using learning data consisting of a document with a negative example label indicating whether the input document is a positive example document or a negative example document, and is a positive example document or negative example A classifier that outputs a positive example-side soft margin and a negative example-side soft margin that are in a range where the reliability of the determination result is low from the boundary surface when determining whether the document is an example document, and any one of the document classification learning controls described above And a device.

本発明に係るコンピュータプログラムは、特定の種類の情報に関係する正例文書であることを示す正例ラベルが付された文書、又は、特定の種類の情報に関係しない負例文書であることを示す負例ラベルが付された文書から成る学習データを使用して学習し、入力文書が正例文書であるか又は負例文書であるかを判定する識別器であって、正例文書か負例文書かを判定するときの境界面からの、判定結果の信頼性が低い範囲、である正例側ソフトマージン及び負例側ソフトマージンを出力する識別器に対して、前記学習を実行させる文書分類学習制御処理を行うためのコンピュータプログラムであって、前記識別器に対して、前記学習データを入力するか、又は、ラベルが付されていない文書から成る強化学習データを入力するか、又は、前記学習データとラベルが付されたアノテーション対象データとを入力するか、を切り替えるステップと、前記識別器によって正例文書か負例文書かが判定された強化学習データに対して、正例側ソフトマージン内に在る文書のみを対象にして類似する文書をグループ化し、又、負例側ソフトマージン内に在る文書のみを対象にして類似する文書をグループ化するステップと、前記グループ化されたクラスタ内の文書をアノテーション対象データとするステップと、をコンピュータに実行させるためのコンピュータプログラムであることを特徴とする。
これにより、前述の文書分類学習制御装置がコンピュータを利用して実現できるようになる。 The computer program according to the present invention is a document with a positive example label indicating that it is a positive example document related to a specific type of information, or a negative example document not related to a specific type of information. A discriminator that learns using learning data consisting of a document with a negative example label to indicate whether the input document is a positive example document or a negative example document, and is a positive example document or a negative example sentence Document classification learning for causing a discriminator that outputs a positive example-side soft margin and a negative example-side soft margin, which are in a range where the reliability of the determination result is low, from the boundary surface at the time of determining whether to write or not, to execute the learning A computer program for performing a control process, wherein the learning data is input to the discriminator, or reinforcement learning data composed of an unlabeled document is input. The step of switching whether to input the learning data and the annotation target data with the label, and the reinforcement learning data for which the discriminator determines whether the document is a positive example document or a negative example document, are within the positive example side soft margin. Grouping similar documents only for existing documents, grouping similar documents only for documents in the negative example side soft margin, and within the grouped cluster It is a computer program for causing a computer to execute the step of setting a document as annotation target data.
As a result, the document classification learning control apparatus described above can be realized using a computer.

本発明によれば、識別器に対する能動学習を行う際に、クラスタ内において正例事例と負例事例の分布に偏りがあるときでも、クラスタの重心点と同じラベルを持つ事例をアノテーション対象データとして確実に選択することができる。又、識別器による判定結果の信頼性が低いデータがアノテーション対象データとして選択されやすくすることが可能となる。これにより、文書分類装置に使用される識別器の文書分類能力を高める学習の効率を向上させることができるという効果が得られる。 According to the present invention, when active learning is performed on a discriminator, even if there is a bias in the distribution of positive example cases and negative example cases in a cluster, examples having the same label as the center of gravity of the cluster are used as annotation target data. You can select with certainty. In addition, it is possible to easily select data with low reliability of the determination result by the discriminator as annotation target data. Thereby, the effect that the efficiency of learning which improves the document classification capability of the classifier used for a document classification device can be improved is acquired.

本発明の一実施形態に係る文書分類装置１０の構成を示すブロック図である。1 is a block diagram showing a configuration of a document classification device 10 according to an embodiment of the present invention. 本発明の一実施形態に係る文書分類学習制御方法のフローチャートである。It is a flowchart of the document classification learning control method which concerns on one Embodiment of this invention. 本発明の一実施形態に係るアノテーション対象データの生成方法を示す概念図である。It is a conceptual diagram which shows the production | generation method of the annotation object data which concerns on one Embodiment of this invention. 本発明の一実施形態に係るアノテーション対象データの生成方法を示す概念図である。It is a conceptual diagram which shows the production | generation method of the annotation object data which concerns on one Embodiment of this invention. 従来のアノテーション対象データの生成方法を示す概念図である。It is a conceptual diagram which shows the production | generation method of the conventional annotation object data.

以下、図面を参照し、本発明の実施形態について説明する。
図１は、本発明の一実施形態に係る文書分類装置１０の構成を示すブロック図である。図１において、文書分類装置１０は、入力制御部１１と識別部１２と重み係数計算部１３とクラスタリング部１４とデータ分類部１５とを備える。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a document classification apparatus 10 according to an embodiment of the present invention. In FIG. 1, the document classification device 10 includes an input control unit 11, an identification unit 12, a weight coefficient calculation unit 13, a clustering unit 14, and a data classification unit 15.

文書分類装置１０には、学習データ１１０（ラベル有）と強化学習データ１２０（ラベル無）とアノテーション対象データ３１０（ラベル有）とが入力される。学習データ１１０は、正例ラベルが付された文書から成る。正例ラベルは、特定の種類の情報に関係する正例文書であることを示す。強化学習データ１２０は、ラベルが付されていない文書から成る。アノテーション対象データ３１０は、文書分類装置１０が出力したアノテーション対象データ２１０に対して人がラベル付け（アノテーション）を行った結果、正例ラベルが付けられた文書から成る。 Learning data 110 (with a label), reinforcement learning data 120 (without a label), and annotation target data 310 (with a label) are input to the document classification device 10. The learning data 110 is composed of a document with a positive example label. The positive example label indicates that the positive example document relates to a specific type of information. Reinforcement learning data 120 consists of documents that are not labeled. The annotation target data 310 is composed of a document with a positive example label as a result of a person labeling (annotating) the annotation target data 210 output from the document classification device 10.

なお、本実施形態では、学習データ１１０として、正例ラベルが付された文書を使用するが、特定の種類の情報に関係しない負例文書であることを示す負例ラベルが付された文書を使用してもよい。又は、学習データ１１０として、正例ラベルが付された文書と負例ラベルが付された文書とを使用してもよい。 In the present embodiment, a document with a positive example label is used as the learning data 110, but a document with a negative example label indicating that the document is a negative example document not related to a specific type of information. May be used. Alternatively, as the learning data 110, a document with a positive example label and a document with a negative example label may be used.

又、本実施形態では、アノテーション対象データ３１０として、アノテーション対象データ２１０に正例ラベルが付けられた文書を使用するが、アノテーション対象データ２１０に負例ラベルが付けられた文書を使用してもよい。又は、アノテーション対象データ３１０として、アノテーション対象データ２１０に正例ラベルが付けられた文書と、アノテーション対象データ２１０に負例ラベルが付けられた文書とを使用してもよい。 In this embodiment, a document with a positive example label attached to the annotation target data 210 is used as the annotation target data 310. However, a document with a negative example label attached to the annotation target data 210 may be used. . Alternatively, as the annotation target data 310, a document in which a positive example label is attached to the annotation target data 210 and a document in which a negative example label is attached to the annotation target data 210 may be used.

入力制御部１１は、識別部１２に対して、学習データ１１０（ラベル有）を入力するか、又は、強化学習データ１２０（ラベル無）を入力するか、又は、学習データ１１０（ラベル有）とアノテーション対象データ３１０（ラベル有）とを入力するか、を切り替える。 The input control unit 11 inputs the learning data 110 (with label), the reinforcement learning data 120 (without label), or the learning data 110 (with label) to the identification unit 12. Whether to input annotation target data 310 (with label) is switched.

識別部１２は、入力文書が正例文書であるか又は正例文書に該当しない負例文書であるかを判定し、判定結果を出力する。又、識別部１２は、正例側ソフトマージン及び負例側ソフトマージンを出力する。識別部１２は、正例文書か負例文書かを判定するときの境界面を有する。正例側ソフトマージンは、境界面からの正例側の範囲であって、判定結果の信頼性が低い範囲である。負例側ソフトマージンは、境界面からの負例側の範囲であって、判定結果の信頼性が低い範囲である。又、識別部１２は、正例ラベルが付された文書から成る学習データを使用して、判定能力を高める学習を行う。本実施形態では、識別部１２として、ＳＶＭ（Support Vector Machine）を利用する。 The identification unit 12 determines whether the input document is a positive example document or a negative example document that does not correspond to the positive example document, and outputs a determination result. The identification unit 12 outputs a positive example side soft margin and a negative example side soft margin. The identification unit 12 has a boundary surface for determining whether the document is a positive example document or a negative example document. The positive example side soft margin is a range on the positive example side from the boundary surface and is a range in which the reliability of the determination result is low. The negative example side soft margin is a range on the negative example side from the boundary surface, and is a range where the reliability of the determination result is low. In addition, the identification unit 12 performs learning to improve the determination capability using learning data including a document with a positive example label. In the present embodiment, an SVM (Support Vector Machine) is used as the identification unit 12.

クラスタリング部１４は、識別部１２によって正例文書か負例文書かが判定された強化学習データ１２０に対して、正例側ソフトマージン内に在る文書のみを対象にして類似する文書をグループ化し、又、負例側ソフトマージン内に在る文書のみを対象にして類似する文書をグループ化する。 The clustering unit 14 groups similar documents only for documents existing in the positive-side soft margin with respect to the reinforcement learning data 120 determined by the identification unit 12 as a positive example document or a negative example document, Then, similar documents are grouped only for documents existing in the negative example side soft margin.

データ分類部１５は、クラスタリング部１４によってグループ化されたクラスタ内の文書をアノテーション対象データ２１０とする。データ分類部１５は、識別部１２によって正例文書か負例文書かが判定された強化学習データ１２０のうち、アノテーション対象データ２１０以外の文書を非アノテーション対象データ２２０とする。データ分類部１５は、アノテーション対象データ２１０及び非アノテーション対象データ２２０を出力する。 The data classification unit 15 sets the documents in the cluster grouped by the clustering unit 14 as annotation target data 210. The data classification unit 15 sets a document other than the annotation target data 210 as the non-annotation target data 220 in the reinforcement learning data 120 determined by the identification unit 12 as a positive example document or a negative example document. The data classification unit 15 outputs annotation target data 210 and non-annotation target data 220.

重み係数計算部１３は、識別部１２によって正例文書か負例文書かが判定された強化学習データ１２０に対して、文書毎に、境界面からの距離が近いほど大きい重み係数を計算する。この重み係数は、クラスタリング部１４に出力される。 The weighting factor calculation unit 13 calculates, for each document, a larger weighting factor as the distance from the boundary surface is closer to the reinforcement learning data 120 determined by the identification unit 12 as a positive example document or a negative example document. This weight coefficient is output to the clustering unit 14.

図２は、本実施形態に係る文書分類学習制御方法のフローチャートである。以下、図２を参照して図１に示す文書分類装置１０に係る文書分類学習制御動作を説明する。 FIG. 2 is a flowchart of the document classification learning control method according to the present embodiment. Hereinafter, the document classification learning control operation according to the document classification apparatus 10 shown in FIG. 1 will be described with reference to FIG.

ステップＳ１：入力制御部１１は、学習データ１１０（ラベル有）を識別部１２に入力する。そして、識別部１２は、学習データ１１０（ラベル有）を用いて学習する。 Step S <b> 1: The input control unit 11 inputs learning data 110 (with a label) to the identification unit 12. And the identification part 12 learns using the learning data 110 (with a label).

ステップＳ２：入力制御部１１は、強化学習データ１２０（ラベル無）を識別部１２に入力する。そして、識別部１２は、強化学習データ１２０（ラベル無）内の各文書に対して、正例文書であるか又は正例文書に該当しない負例文書であるかを判定し、判定結果を出力する。本実施形態では、識別部１２は、強化学習データ１２０内の各文書に対し、判定結果に応じて、正例文書には正例ラベル「＋１」を付け、正例文書に該当しない負例文書には負例ラベル「−１」を付ける。 Step S2: The input control unit 11 inputs the reinforcement learning data 120 (no label) to the identification unit 12. Then, the identification unit 12 determines whether each document in the reinforcement learning data 120 (no label) is a positive example document or a negative example document that does not correspond to the positive example document, and outputs a determination result. To do. In this embodiment, the identification unit 12 attaches a positive example label “+1” to the positive example document for each document in the reinforcement learning data 120 according to the determination result, and does not correspond to the positive example document. Is given a negative example label “−1”.

又、識別部１２は、正例側ソフトマージン及び負例側ソフトマージンを出力する。これにより、強化学習データ１２０内の文書の中から、正例側ソフトマージン内にある文書と負例側ソフトマージン内にある文書とを特定することができる。 The identification unit 12 outputs a positive example side soft margin and a negative example side soft margin. Thereby, it is possible to specify a document in the positive example side soft margin and a document in the negative example side soft margin from the documents in the reinforcement learning data 120.

ステップＳ３：重み係数計算部１３は、識別部１２によってラベル付けされた強化学習データ１２０内のそれぞれの文書（事例）に対して、識別部１２が正例文書か負例文書かを判定するときの境界面からの距離を帰属度に変換する。識別部１２は、事例毎に該距離を出力する。本実施形態では識別部１２としてＳＶＭを利用しているが、ＳＶＭでは、事例ｘの距離ｆ（ｘ）は式（１）で計算する。式（１）は、シグモイド分布を仮定して帰属度を求めるものである。なお、シグモイド分布が成り立たない場合には、境界面から最も離れた事例と境界面との距離を等間隔に分割し、分割された各区間に含まれる事例の数で分布を作成し、帰属度を求めてもよい。 Step S3: The boundary when the weighting factor calculation unit 13 determines whether the identification unit 12 is a positive example document or a negative example document for each document (example) in the reinforcement learning data 120 labeled by the identification unit 12 Convert distance from face to attribution. The identification unit 12 outputs the distance for each case. In the present embodiment, the SVM is used as the identification unit 12. However, in the SVM, the distance f (x) of the case x is calculated by Expression (1). Formula (1) calculates | requires an attribution degree supposing a sigmoid distribution. If the sigmoid distribution does not hold, the distance between the case farthest from the boundary surface and the boundary surface is divided at equal intervals, and the distribution is created with the number of cases included in each divided section. You may ask for.

但し、Ｎは事例数である。α_ｉは事例ｘ_ｉに対する重みである。ｙ_ｉは事例ｘ_ｉに付けられたラベルの値（＋１又は−１）である。ｋ（ｘ_ｉ，ｘ）は事例ｘのカーネル関数である。ｂは定数である。 N is the number of cases. α _i is a weight for case x _i . y _i is the value (+1 or −1) of the label attached to the case x _i . k (x _i , x) is the kernel function of case x. b is a constant.

事例ｘの距離ｆ（ｘ）に対して、正例（＋１）への帰属度Ｐ（ｙ＝＋１｜ｆ（ｘ））は式（２）で計算する。事例ｘの距離ｆ（ｘ）に対して、負例（−１）への帰属度Ｐ（ｙ＝−１｜ｆ（ｘ））は式（３）で計算する。 For the distance f (x) of the case x, the degree of belonging P (y = + 1 | f (x)) to the positive example (+1) is calculated by the equation (2). For the distance f (x) of the case x, the degree of belonging P (y = −1 | f (x)) to the negative example (−1) is calculated by the equation (3).

但し、Ａ及びＢの値の組合せは、Ｐ（ｙ＝＋１｜ｆ（ｘ））及びＰ（ｙ＝−１｜ｆ（ｘ））のそれぞれを最大化する値の組合せである。Ａ及びＢの値の組合せは、一般的にニュートン法に代表される最尤度推定手法を用いて求めることができる。 However, the combination of the values of A and B is a combination of values that maximizes each of P (y = + 1 | f (x)) and P (y = −1 | f (x)). The combination of the values of A and B can be obtained by using a maximum likelihood estimation method generally represented by the Newton method.

ステップＳ４：重み係数計算部１３は、識別部１２によってラベル付けされた強化学習データ１２０内のそれぞれの文書（事例）に対して、重み係数を計算する。事例ｘ_ｉの重み係数ｗ_ｉは式（４）で計算する。 Step S4: The weighting factor calculation unit 13 calculates a weighting factor for each document (example) in the reinforcement learning data 120 labeled by the identification unit 12. Weight coefficient _{w i} of the case _{x i} is calculated by Equation (4).

上記式（４）の分母では、帰属度Ｐ（ｙ＝＋１｜ｆ（ｘ_ｉ））又はＰ（ｙ＝−１｜ｆ（ｘ_ｉ））のうち値が大きい方を選択し、選択した帰属度から０．５を引いた値である。これは、上記式（１）、式（２）及び式（３）によれば、境界面が帰属度「０．５」となるので、帰属度から０．５を引くことによって距離が求まるからである。 In the denominator of the above formula (4), the one with the larger value is selected from the degree of attribution P (y = + 1 | f (x _i )) or P (y = −1 | f (x _i )), and the selected attribution It is a value obtained by subtracting 0.5 from the degree. This is because, according to the above formula (1), formula (2), and formula (3), the boundary surface has the degree of attribution “0.5”, so the distance can be obtained by subtracting 0.5 from the degree of attribution. It is.

ステップＳ５：クラスタリング部１４は、識別部１２によってラベル付けされた強化学習データ１２０内のそれぞれの文書（事例）に対して、k-means法を用いて類似する文書をグループ化する。 Step S5: The clustering unit 14 groups similar documents using the k-means method for each document (example) in the reinforcement learning data 120 labeled by the identification unit 12.

本実施形態に係るk-means法を用いた文書クラスタリング処理を説明する。文書クラスタリング処理は、正例側ソフトマージン内にある文書（正例事例）と、負例側ソフトマージン内にある文書（負例事例）と、を別個に行う。以下、正例側ソフトマージン内にある文書（正例事例）のみを対象にしてクラスタリングする場合を説明するが、負例側ソフトマージン内にある文書（負例事例）のみを対象にしてクラスタリングする場合も同様である。 A document clustering process using the k-means method according to the present embodiment will be described. In the document clustering process, a document in the positive example side soft margin (positive example case) and a document in the negative example side soft margin (negative example case) are separately performed. The following describes the case of clustering only for documents (positive example cases) in the positive example side soft margin, but clustering only for documents (negative example cases) in the negative example side soft margin. The same applies to the case.

（１）まず、識別部１２によってラベル付けされた強化学習データ１２０から、正例側ソフトマージン内にある文書（正例事例）を全て抽出し、抽出した全ての正例事例から成る文書集合Ｄを作成する。
（２）次いで、文書集合Ｄ内の各事例ｘに対して、ｋ個（ｋは２以上の自然数）のクラスタＩＤ（１からｋまでのいずれかの値とする）の中から無作為にいずれかのクラスタＩＤを割り当てる。
（３）次いで、同一のクラスタＩＤが付与された各事例ｘを表すベクトルに対して、それぞれの重み係数ｗ_Ｘを乗ずる。このとき、同一のクラスタＩＤが付与された各事例ｘを表すベクトルの重み係数ｗ_Ｘを、該重み係数ｗ_Ｘの総和で割ることによって、重み係数の値を正規化してもよい。
（４）次いで、同一のクラスタＩＤが付与された各事例ｘの重み付けされたベクトルを用いて、重心点を求める。この重心点とは、同一のクラスタＩＤが付与された各事例ｘの重み付けされたベクトルを用いてベクトルの各要素値の平均値を計算し、各平均値を各要素値として持つベクトルである。
（５）各クラスタＩＤについて、重心点との距離が最も近い事例を代表点とする。これにより、ｋ個の各クラスタＩＤについて一つずつの代表点が決まるので、合計ｋ個の代表点が得られる。
（６）文書集合Ｄ内の全事例に対して、事例毎に、最も近い重心点のクラスタＩＤにクラスタＩＤを変更する。この後、（２）に戻り処理を繰り返し、（６）で変化がなければ終了する。 (1) First, all the documents (positive example cases) in the positive example side soft margin are extracted from the reinforcement learning data 120 labeled by the identification unit 12, and the document set D including all the extracted positive example cases is extracted. Create
(2) Next, for each case x in the document set D, any one of k cluster IDs (k is a natural number of 2 or more) cluster IDs (any value from 1 to k) is randomly selected. The cluster ID is assigned.
(3) Then, to the vector representing each case x the same cluster ID is assigned, multiplied by the respective weighting factor w _X. In this case, the weighting factor w _X of a vector representing each case x the same cluster ID is assigned by dividing the sum of the polymerization viewed coefficients w _X, the value of the weighting factor may be normalized.
(4) Next, the center of gravity is obtained using the weighted vector of each case x to which the same cluster ID is assigned. The barycentric point is a vector that calculates an average value of each element value of the vector using a weighted vector of each case x to which the same cluster ID is assigned, and has each average value as each element value.
(5) For each cluster ID, a case where the distance from the center of gravity is the nearest is the representative point. As a result, one representative point is determined for each of the k cluster IDs, so that a total of k representative points are obtained.
(6) For all cases in the document set D, the cluster ID is changed to the cluster ID of the nearest center of gravity for each case. Thereafter, the process returns to (2) and is repeated, and if there is no change in (6), the process ends.

上記の文書クラスタリング処理によって、正例側のｋ個のクラスタが得られる。又、負例側についても、同様の文書クラスタリング処理によって、ｊ個（ｊは２以上の自然数）のクラスタが得られる。 With the above document clustering process, k clusters on the positive example side are obtained. On the negative example side, j clusters (j is a natural number of 2 or more) are obtained by the same document clustering process.

説明を図２に戻す。
ステップＳ６：データ分類部１５は、クラスタリング部１４によって作成されたクラスタ毎に、重心点に距離が最も近い事例を選択してアノテーション対象データ２１０とする。ここでは、重心点のベクトルと各事例のベクトルとの類似度を計算し、最大の類似度の事例をアノテーション対象データ２１０とする。ベクトル間の類似度としては、式（５）で表されるコサイン類似度を用いることができる。 Returning to FIG.
Step S6: For each cluster created by the clustering unit 14, the data classification unit 15 selects the case closest to the center of gravity as the annotation target data 210. Here, the similarity between the vector of the barycentric point and the vector of each case is calculated, and the case with the maximum similarity is set as the annotation target data 210. As the similarity between vectors, the cosine similarity expressed by the equation (5) can be used.

上記式（５）によれば、ｎ次元のベクトルであるｘとｙに対して、ｙを重心点としたときに、同一のクラスタ内で、式（５）のコサイン値を最大化する事例ｘを探して発見されたる事例ｘをアノテーション対象データ２１０とする。 According to the above equation (5), for x and y, which are n-dimensional vectors, a case x that maximizes the cosine value of equation (5) in the same cluster when y is the center of gravity. The case x found by searching for is set as annotation target data 210.

データ分類部１５は、正例側のｋ個のクラスタについてそれぞれ一つずつ合計ｋ個の文書（正例事例）をアノテーション対象データ２１０に含める。又、データ分類部１５は、負例側のｊ個のクラスタについてそれぞれ一つずつ合計ｊ個の文書（負例事例）をアノテーション対象データ２１０に含める。これにより、アノテーション対象データ２１０は、ｋ個の文書（正例事例）とｊ個の文書（負例事例）を有する。データ分類部１５は、アノテーション対象データ２１０以外の文書を非アノテーション対象データ２２０に含める。 The data classification unit 15 includes, in the annotation target data 210, a total of k documents (positive example cases), one for each of the k clusters on the positive example side. Further, the data classification unit 15 includes a total of j documents (negative example cases) for each of the j clusters on the negative example side in the annotation target data 210. As a result, the annotation target data 210 includes k documents (positive example cases) and j documents (negative example cases). The data classification unit 15 includes documents other than the annotation target data 210 in the non-annotation target data 220.

アノテーション対象データ２１０内の（ｋ＋ｊ）個の文書に対しては、ユーザがラベルを付ける。文書分類装置１０には、アノテーション対象データ２１０内の文書のうちユーザが正例ラベルを付けた文書から成るアノテーション対象データ３１０が入力される。 The user attaches a label to (k + j) documents in the annotation target data 210. The document classification apparatus 10 is input with annotation target data 310 composed of documents in which annotations are labeled by the user among the documents in the annotation target data 210.

ステップＳ７：入力制御部１１は、学習データ１１０（ラベル有）とアノテーション対象データ３１０（ラベル有）とを識別部１２へ入力する。そして、識別部１２は、学習データ１１０（ラベル有）とアノテーション対象データ３１０（ラベル有）とを用いて学習する。 Step S7: The input control unit 11 inputs the learning data 110 (with label) and the annotation target data 310 (with label) to the identification unit 12. Then, the identification unit 12 learns using the learning data 110 (with label) and the annotation target data 310 (with label).

ステップＳ８：入力制御部１１は、所定の終了条件を満足するかを判定する。この結果、終了条件を満足する場合は図２の処理を終了する。一方、終了条件を満足しない場合はステップＳ２に戻る。 Step S8: The input control unit 11 determines whether or not a predetermined end condition is satisfied. As a result, when the end condition is satisfied, the process of FIG. 2 is ended. On the other hand, if the end condition is not satisfied, the process returns to step S2.

図３、図４は、本実施形態に係るアノテーション対象データの生成方法を示す概念図である。図３において、正例側ソフトマージン内にある正例事例（○印）のみを対象にして、類似する文書がグループ化されている。又、負例側ソフトマージン内にある負例事例（×印）のみを対象にして、類似する文書がグループ化されている。このグループ化の結果として作成された正例側のクラスタＧ１では、重心点も重心点に距離が最も近い事例Ｐ１も正例側ソフトマージン内にあって正例に帰属することになるので、重心点に距離が最も近い事例Ｐ１は、アノテーション対象データとして当該グループＧ１を代表するものとなる。同様に、正例側のクラスタＧ２では、重心点も重心点に距離が最も近い事例Ｐ２も正例側ソフトマージン内にあって正例に帰属することになるので、重心点に距離が最も近い事例Ｐ２は、アノテーション対象データとして当該グループＧ２を代表するものとなる。又、負例側のクラスタＧ３では、重心点も重心点に距離が最も近い事例Ｐ３も負例側ソフトマージン内にあって負例に帰属することになるので、重心点に距離が最も近い事例Ｐ３は、アノテーション対象データとして当該グループＧ３を代表するものとなる。同様に、負例側のクラスタＧ４では、重心点も重心点に距離が最も近い事例Ｐ４も負例側ソフトマージン内にあって負例に帰属することになるので、重心点に距離が最も近い事例Ｐ４は、アノテーション対象データとして当該グループＧ４を代表するものとなる。これにより、識別部１２の文書分類能力を高める学習の効率を向上させることができる。 3 and 4 are conceptual diagrams showing a method for generating annotation target data according to the present embodiment. In FIG. 3, similar documents are grouped only for positive example cases (circles) within the positive example side soft margin. Similar documents are grouped only for negative example cases (x marks) within the negative example side soft margin. In the cluster G1 on the positive example side created as a result of this grouping, the centroid point and the case P1 that is closest to the centroid point are also within the positive example soft margin and belong to the positive example. The case P1 closest to the point represents the group G1 as annotation target data. Similarly, in the cluster G2 on the positive example side, the case P2 whose distance from the centroid point is also closest to the centroid point is also within the positive example side soft margin and belongs to the positive example, so the distance to the centroid point is the closest. The case P2 represents the group G2 as annotation target data. Further, in the negative example side cluster G3, the case P3 whose distance from the center of gravity is the closest to the center of gravity is also within the negative example side soft margin and belongs to the negative example. P3 represents the group G3 as annotation target data. Similarly, in the negative example side cluster G4, the case P4 whose distance from the centroid point is closest to the centroid point is also within the negative example side soft margin and belongs to the negative example, so the distance from the centroid point is closest. The case P4 represents the group G4 as annotation target data. Thereby, the efficiency of learning that improves the document classification ability of the identification unit 12 can be improved.

図４においては、図３においてグループ化の対象となる各事例（文書）に対して、重み係数を用いた重み付けを行っている。重み係数は、事例（文書）毎に、境界面からの距離が近いほど大きくなるように算出される。この結果、図４において、各クラスタＧ１、Ｇ２、Ｇ３、Ｇ４の重心点（重み付け有）は、図３の場合の重心点よりも境界面に近づく。これにより、図４においては、図３の場合よりも境界面に近くて判定結果の信頼性がより低い事例がアノテーション対象データとなる可能性が高くなる。図４の例において、正例側のクラスタＧ１では、元の重心点に距離が最も近い事例Ｐ１よりも境界面に近い事例Ｐ１１がアノテーション対象データとなる。正例側のクラスタＧ２では、元の重心点に距離が最も近い事例Ｐ２よりも境界面に近い事例Ｐ１２がアノテーション対象データとなる。負例側のクラスタＧ３では、元の重心点に距離が最も近い事例Ｐ３よりも境界面に近い事例Ｐ１３がアノテーション対象データとなる。負例側のクラスタＧ４では、元の重心点に距離が最も近い事例Ｐ４よりも境界面に近い事例Ｐ１４がアノテーション対象データとなる。これにより、識別部１２の文書分類能力を高める学習の効率をさらに向上させることができる。 In FIG. 4, each case (document) to be grouped in FIG. 3 is weighted using a weighting factor. The weighting coefficient is calculated for each case (document) so as to increase as the distance from the boundary surface becomes shorter. As a result, in FIG. 4, the center of gravity (with weighting) of each cluster G1, G2, G3, G4 is closer to the boundary surface than the center of gravity in the case of FIG. As a result, in FIG. 4, there is a high possibility that an instance closer to the boundary surface and having a lower reliability of the determination result than the case of FIG. In the example of FIG. 4, in the cluster G1 on the positive example side, the case P11 that is closer to the boundary surface than the case P1 that is closest to the original barycentric point is the annotation target data. In the cluster G2 on the positive example side, the case P12 that is closer to the boundary surface than the case P2 that is closest to the original center of gravity is the annotation target data. In the cluster G3 on the negative example side, the case P13 that is closer to the boundary surface than the case P3 that is closest to the original center of gravity is the annotation target data. In the negative example side cluster G4, the case P14 that is closer to the boundary surface than the case P4 that is closest to the original center of gravity is the annotation target data. Thereby, the efficiency of learning that improves the document classification ability of the identification unit 12 can be further improved.

なお、図２に示す各ステップを実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、文書分類学習制御処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。
また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＤＶＤ（Digital Versatile Disk）等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 Note that a document classification learning control is performed by recording a program for realizing each step shown in FIG. 2 on a computer-readable recording medium, causing the computer system to read and execute the program recorded on the recording medium. Processing may be performed. Here, the “computer system” may include an OS and hardware such as peripheral devices.
Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
“Computer-readable recording medium” refers to a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a DVD (Digital Versatile Disk), and a built-in computer system. A storage device such as a hard disk.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（Dynamic Random Access Memory））のように、一定時間プログラムを保持しているものも含むものとする。
また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。
また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 Further, the “computer-readable recording medium” means a volatile memory (for example, DRAM (Dynamic DRAM) in a computer system that becomes a server or a client when a program is transmitted through a network such as the Internet or a communication line such as a telephone line. Random Access Memory)), etc., which hold programs for a certain period of time.
The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.
The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

以上、本発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。 As mentioned above, although embodiment of this invention was explained in full detail with reference to drawings, the specific structure is not restricted to this embodiment, The design change etc. of the range which does not deviate from the summary of this invention are included.

１０…文書分類装置、１１…入力制御部、１２…識別部、１３…重み係数計算部、１４…クラスタリング部、１５…データ分類部 DESCRIPTION OF SYMBOLS 10 ... Document classification | category apparatus, 11 ... Input control part, 12 ... Identification part, 13 ... Weighting coefficient calculation part, 14 ... Clustering part, 15 ... Data classification part

Claims

A document with a positive example label indicating that it is a positive example document related to a specific type of information, or a negative example label indicating that it is a negative example document not related to a specific type of information A discriminator for determining whether an input document is a positive example document or a negative example document by using learning data composed of documents, and determining whether the input document is a positive example document or a negative example document In the document classification learning control apparatus for executing the learning on the discriminator that outputs the positive example side soft margin and the negative example side soft margin that are in a range where the reliability of the determination result is low,
The learning data is input to the discriminator, or reinforcement learning data composed of an unlabeled document is input, or the learning data and the annotation target data to which the label is attached are input. An input control unit for switching between input and
For the reinforcement learning data determined by the classifier as a positive example document or a negative example document, similar documents are grouped only for documents existing in the positive example side soft margin, and the negative example side soft margin is also grouped. A clustering unit for grouping similar documents only for documents existing within,
A data classification unit that uses the documents in the cluster grouped by the clustering unit as annotation target data;
A document classification learning control apparatus comprising:

For the determined reinforcement learning data, each document includes a weighting factor calculation unit that calculates a larger weighting factor as the distance from the boundary surface is closer,
2. The document classification learning control apparatus according to claim 1, wherein weighting using the weighting coefficient is performed on the documents to be grouped.

The weighting factor calculation unit calculates the degree of belonging to the positive example and the degree of belonging to the negative example using the distance from the boundary surface, and the larger of the degree of belonging to the positive example or the degree of belonging to the negative example The document classification learning control apparatus according to claim 2, wherein the weighting coefficient is used as a weighting coefficient.

4. The document classification learning control apparatus according to claim 1, wherein the data classification unit sets the document closest to the center of gravity in the cluster as annotation target data. 5.

A document with a positive example label indicating that it is a positive example document related to a specific type of information, or a negative example label indicating that it is a negative example document not related to a specific type of information A discriminator for determining whether an input document is a positive example document or a negative example document by using learning data composed of documents, and determining whether the input document is a positive example document or a negative example document A discriminator that outputs a positive-side soft margin and a negative-side soft margin that are in a range where the reliability of the determination result is low,
The document classification learning control device according to any one of claims 1 to 4,
A document classification apparatus comprising:

A document with a positive example label indicating that it is a positive example document related to a specific type of information, or a negative example label indicating that it is a negative example document not related to a specific type of information A discriminator for determining whether an input document is a positive example document or a negative example document by using learning data composed of documents, and determining whether the input document is a positive example document or a negative example document A computer program for performing document classification learning control processing for causing the discriminator that outputs a positive example-side soft margin and a negative example-side soft margin that are in a low reliability range of the determination result from Because
The learning data is input to the discriminator, or reinforcement learning data composed of an unlabeled document is input, or the learning data and the annotation target data to which the label is attached are input. Step to enter or switch,
For the reinforcement learning data determined by the classifier as a positive example document or a negative example document, similar documents are grouped only for documents existing in the positive example side soft margin, and the negative example side soft margin is also grouped. Grouping similar documents only for documents that are within,
Making the documents in the clustered cluster into annotation target data;
A computer program for causing a computer to execute.