JP2014085948A

JP2014085948A - Misclassification detection apparatus, method, and program

Info

Publication number: JP2014085948A
Application number: JP2012236020A
Authority: JP
Inventors: Akinori Fujino; 昭典藤野; Masaaki Nagata; 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-10-25
Filing date: 2012-10-25
Publication date: 2014-05-12
Anticipated expiration: 2032-10-25
Also published as: JP5905375B2

Abstract

PROBLEM TO BE SOLVED: To detect misclassified samples with high accuracy by suppressing adverse effects to be given to learning of a statistical classifier which is utilized for misclassification detection by the misclassified samples.SOLUTION: A probability model generation unit 12 divides a misclassification detection object sample set D into two subsample sets Dand D, and calculates a parameter estimation value ^Ψ of a conditional class probability model P(y|x;Ψ) which is a statistical classifier adapted to both of one subsample set Dand a sample set Dwithout a label. A misclassification determination unit 16 calculates conditional class probability P(y|x;^Ψ) to the respective categories of the respective misclassification detection object samples (x, y) included in the subsample set Dby using the parameter estimation value ^Ψ, and determines whether or not the respective misclassification detection object samples (x, y) are misclassified samples.

Description

本発明は、誤分類検出装置、方法、及びプログラムに係り、特に、サンプル集合の中から、誤ったカテゴリに分類されたコンテンツのサンプルを検出する誤分類検出装置、方法、及びプログラムに関する。 The present invention relates to a misclassification detection apparatus, method, and program, and more particularly to a misclassification detection apparatus, method, and program for detecting a sample of content classified into an incorrect category from a sample set.

コンテンツのカテゴリ分類は、多くの場合、人手による分類作業によって行われる。あるいは、人手によってカテゴリに分類された複数のコンテンツを訓練データとして用いて統計的分類器を学習し、この統計的分類器を用いて、属するカテゴリが未知のコンテンツのカテゴリを推定することにより、コンテンツのカテゴリ分類を自動で行う。 In many cases, content classification is performed manually. Alternatively, by learning a statistical classifier using a plurality of contents manually classified into categories as training data, and using this statistical classifier, estimating the category of the content whose category is unknown, the content The category classification is automatically performed.

しかし、人手による分類作業には、コンテンツを誤ったカテゴリに分類する誤分類の危険性が常に存在する。また、誤ったカテゴリに分類されたコンテンツのサンプル（以下、「誤分類サンプル」という）を統計的分類器の訓練データとして用いた場合には、統計的分類器の自動分類性能の低下をもたらす。それ故、与えられた分類済みのサンプルの中から、誤分類サンプルを検出する誤分類検出技術は重要である。 However, there is always a risk of misclassification that classifies content into an incorrect category in manual classification work. Moreover, when a sample of content classified into an incorrect category (hereinafter referred to as “misclassified sample”) is used as training data for the statistical classifier, the automatic classification performance of the statistical classifier is degraded. Therefore, a misclassification detection technique for detecting misclassified samples from given classified samples is important.

従来の技術では、カテゴリ分類済みのサンプル集合の中から誤分類サンプルを推定するために、まず、カテゴリ分類済みのサンプルの全てを訓練データとし、交差検定法を用いて学習した統計的分類器を用いてサンプルのカテゴリを推定する。次に、その推定されたカテゴリと分類されているカテゴリとが一致しないサンプルを、誤分類サンプルとして検出する。検出精度を高めるため、非特許文献１及び２の技術では、複数の統計的分類器で得られるカテゴリ推定の結果の多数決を取ることで、統計的分類器の種類に依存するカテゴリ推定のバイアスの悪影響を抑制している。非特許文献３の技術では、カテゴリ分類済みのサンプルだけでなく、属するカテゴリが未知のサンプルを集めたラベルなしサンプル集合を学習に用いて統計的分類器の分類性能を向上させることで、誤分類サンプルの検出精度を高めている。 In the conventional technique, in order to estimate misclassified samples from a set of categorized samples, first, a statistical classifier trained using cross-validation is performed using all categorized samples as training data. To estimate the sample category. Next, a sample in which the estimated category and the classified category do not match is detected as a misclassified sample. In order to improve the detection accuracy, the techniques of Non-Patent Documents 1 and 2 take a majority vote of the category estimation results obtained by a plurality of statistical classifiers, thereby reducing the bias of category estimation depending on the type of statistical classifier. Suppresses adverse effects. In the technique of Non-Patent Document 3, misclassification is achieved by improving the classification performance of the statistical classifier by using not-labeled sample collections in which not only categorized samples but also samples with unknown categories are collected for learning. Increases sample detection accuracy.

Carla E. Brodley and Mark A. Friedl, "Identifying mislabeled training data", Journal of Artificial Intelligence Research, 11(11):131−166, 1999.Carla E. Brodley and Mark A. Friedl, "Identifying mislabeled training data", Journal of Artificial Intelligence Research, 11 (11): 131-166, 1999. Sundara Venkataraman, Dimitris Metaxas, Dmitriy Fradkin, Casimir Kulikowski, and Ilya Muchnik, "Distinguishing mislabeled data from correctly labeled data in classifier design", In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’04), pages 668−672, 2004.Sundara Venkataraman, Dimitris Metaxas, Dmitriy Fradkin, Casimir Kulikowski, and Ilya Muchnik, "Distinguishing mislabeled data from correctly labeled data in classifier design", In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'04), pages 668 −672, 2004. D. Guan, W. Yuan, Y.-K.Lee, and S. Lee, "Identifying mislabeled training data with the aid of unlabeleddata", Applied Intelligence, 35(3):345−358, 2010.D. Guan, W. Yuan, Y.-K.Lee, and S. Lee, "Identifying mislabeled training data with the aid of unlabeleddata", Applied Intelligence, 35 (3): 345-358, 2010.

非特許文献１及び２の技術では、交差検定法を応用して、誤分類検出の対象となるサンプルを除いた訓練データから統計的分類器を学習させる。しかし、訓練データの中には誤分類サンプルが含まれているため、高いカテゴリ分類性能をもつ統計的分類器を得られる保証はない。それ故、正しいカテゴリに分類されている多くのサンプルに対して、統計的分類器がカテゴリの分類を誤ることで、誤分類サンプルを誤検出する危険性がある。逆に、誤分類サンプルに対して、人手により与えられたカテゴリと統計的分類器により分類されたカテゴリとが一致することで、誤分類サンプルが未検出となる危険性がある。このように、各統計的分類器による誤分類検出の信頼性が低い場合、複数の統計的分類器の多数決で最終判定を行っても、高い誤分類サンプルの検出精度を得ることが期待できない。 In the techniques of Non-Patent Documents 1 and 2, a cross-validation method is applied to learn a statistical classifier from training data excluding samples that are targets of misclassification detection. However, since training data contains misclassified samples, there is no guarantee that a statistical classifier with high categorization performance can be obtained. Therefore, for many samples that are classified into the correct category, there is a risk that the statistical classifier misdetects the misclassified sample by misclassifying the category. On the other hand, there is a risk that a misclassified sample is not detected because a manually given category matches a category classified by a statistical classifier. Thus, when the reliability of misclassification detection by each statistical classifier is low, it is not expected to obtain a high detection accuracy of misclassified samples even if the final determination is made by majority decision of a plurality of statistical classifiers.

非特許文献３の技術では、誤分類検出の対象となるサンプルが示すコンテンツと同種のコンテンツであり、かつ属するカテゴリが未知のコンテンツのサンプル（以下、「ラベルなしサンプル」という）を学習に用いることで、誤分類サンプルの検出に用いる統計的分類器の性能を向上させている。この技術では、まず、誤分類サンプルを含む、コンテンツが属するカテゴリが既知のサンプル（以下、「ラベルありサンプル」という）を訓練データとして用いて複数の統計的分類器を学習させ、学習した統計的分類器を用いて各ラベルなしサンプルが属するカテゴリを予測させる。そして、全ての統計的分類器の予測カテゴリが一致したラベルなしサンプルとその予測カテゴリとの組を新たなラベルありサンプルとして訓練データに追加して、統計的分類器を再学習させる。この技術では、ラベルなしサンプルの中からラベルありサンプルを作成して訓練データに追加することで、訓練データに元々含まれていた誤分類サンプルの比率を下げ、その結果、誤分類サンプルが統計的分類器の学習に与える悪影響が抑制されることを期待する。 In the technique of Non-Patent Document 3, a sample of content (hereinafter referred to as “unlabeled sample”) that is the same type of content as the sample indicated by the misclassification detection target and whose category belongs to is used for learning. Thus, the performance of the statistical classifier used to detect misclassified samples is improved. In this technique, first, a plurality of statistical classifiers are trained using samples with known categories to which content belongs (hereinafter referred to as “labeled samples”) including misclassified samples as training data. A classifier is used to predict the category to which each unlabeled sample belongs. Then, a set of unlabeled samples in which the prediction categories of all statistical classifiers match and the prediction categories are added to the training data as new labeled samples, and the statistical classifier is re-learned. This technique reduces the proportion of misclassified samples that were originally included in the training data by creating labeled samples from unlabeled samples and adding them to the training data. We expect that adverse effects on classifier learning will be suppressed.

しかし、誤分類サンプルを含む訓練データを用いて学習させた統計的分類器は、カテゴリ予測能力が高いとは限らないため、多くのラベルなしサンプルに対して間違ったカテゴリ予測を与える危険性がある。元々含まれていた誤分類サンプルの悪影響により間違ったカテゴリ予測を行ったラベルなしサンプルを大量に訓練データに加えると、再学習しても統計的分類器の性能向上を期待できない。 However, statistical classifiers trained with training data containing misclassified samples may not be highly capable of predicting the category, so there is a risk of giving incorrect category predictions for many unlabeled samples . If a large amount of unlabeled samples for which incorrect category prediction is performed due to the adverse effects of the misclassified samples that were originally included are added to the training data, the performance of the statistical classifier cannot be expected to improve even after re-learning.

上記のように、従来技術では、訓練データに含まれる誤分類サンプルが、誤分類検出に利用する統計的分類器の学習に悪影響を与える、という問題がある。 As described above, the conventional technique has a problem that a misclassification sample included in training data adversely affects learning of a statistical classifier used for misclassification detection.

本発明は、上記の事情を鑑みてなされたもので、誤分類検出に利用する統計的分類器の学習に誤分類されたサンプルが与える悪影響を抑制して、高い精度で誤分類されたサンプルを検出することができる誤分類検出装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and suppresses adverse effects caused by misclassified samples in the learning of a statistical classifier used for misclassification detection, and misclassified samples with high accuracy. An object of the present invention is to provide a misclassification detection apparatus, method, and program capable of being detected.

上記目的を達成するために、本発明の誤分類検出装置は、コンテンツの属するカテゴリが既知のラベルありサンプル集合の一部である第１サブサンプル集合と、コンテンツの属するカテゴリが未知のラベルなしサンプル集合との両方を用いて学習された生成モデルと識別モデルとが統合されたモデルであって、コンテンツが各カテゴリに属する確率を示す条件付クラス確率モデルを生成する確率モデル生成手段と、前記確率モデル生成手段により生成された条件付クラス確率モデルに基づいて、前記ラベルありサンプル集合から前記第１サブサンプル集合を除いた第２サブサンプル集合に含まれる各ラベルありサンプルが、誤ったカテゴリに分類されたコンテンツのサンプルか否かを判定する誤分類判定手段と、を含んで構成されている。 In order to achieve the above object, the misclassification detection apparatus of the present invention includes a first sub-sample set that is a part of a sample set with a label to which the category to which the content belongs and an unlabeled sample with an unknown category to which the content belongs. A probability model generating means for generating a conditional class probability model indicating a probability that the content belongs to each category, wherein the generation model learned using both the set and the identification model are integrated, and the probability Based on the conditional class probability model generated by the model generation means, each labeled sample included in the second subsample set obtained by removing the first subsample set from the labeled sample set is classified into an incorrect category. Misclassification determining means for determining whether the content is a sample of the content.

本発明の誤分類検出装置によれば、確率モデル生成手段が、コンテンツの属するカテゴリが既知のラベルありサンプル集合の一部である第１サブサンプル集合と、コンテンツの属するカテゴリが未知のラベルなしサンプル集合との両方を用いて学習された生成モデルと識別モデルとが統合されたモデルであって、コンテンツが各カテゴリに属する確率を示す条件付クラス確率モデルを生成する。誤分類判定手段は、確率モデル生成手段により生成された条件付クラス確率モデルに基づいて、ラベルありサンプル集合から第１サブサンプル集合を除いた第２サブサンプル集合に含まれる各ラベルありサンプルが、誤ったカテゴリに分類されたコンテンツのサンプルか否かを判定する。 According to the misclassification detection apparatus of the present invention, the probability model generation means includes a first subsample set that is a part of a sample set with a label to which the category to which the content belongs and an unlabeled sample with an unknown category to which the content belongs. A conditional class probability model is generated in which a generation model learned using both sets and an identification model are integrated, and the probability that content belongs to each category is generated. The misclassification determining means, based on the conditional class probability model generated by the probability model generating means, each labeled sample included in the second subsample set excluding the first subsample set from the labeled sample set, It is determined whether or not it is a sample of content classified into an incorrect category.

このように、ラベルありサンプル集合の一部である第１サブサンプル集合とラベルなしサンプル集合との両方を用いて学習された生成モデルと識別モデルとが統合された条件付クラス確率モデルに基づいて、ラベルありサンプル集合の残りの部分である第２サブサンプル集合に含まれるラベルありサンプルが誤分類か否かを判定するため、誤分類検出に利用する統計的分類器の学習に誤分類されたサンプルが与える悪影響を抑制して、高い精度で誤分類されたサンプルを検出することができる。 Thus, based on the conditional class probability model in which the generation model and the identification model learned using both the first sub-sample set and the unlabeled sample set that are part of the labeled sample set are integrated. In order to determine whether the labeled sample included in the second subsample set, which is the remaining part of the labeled sample set, is misclassified, it was misclassified in the learning of the statistical classifier used for misclassification detection It is possible to detect the misclassified sample with high accuracy by suppressing the adverse effect of the sample.

また、前記誤分類判定手段は、前記第２サブサンプル集合に含まれる各ラベルありサンプルのコンテンツが属するカテゴリと、該ラベルありサンプルのコンテンツについて前記条件付クラス確率モデルから得られる確率が最大となるカテゴリとが一致しない場合、前記第２サブサンプル集合に含まれる各ラベルありサンプルのコンテンツが属するカテゴリの前記条件付クラス確率モデルに基づく確率が、予め定めた第１閾値未満の場合、または前記第２サブサンプル集合に含まれる各ラベルありサンプルのコンテンツが属するカテゴリ以外の各カテゴリの前記条件付クラス確率モデルに基づく確率の最大値に対する、該ラベルありサンプルのコンテンツが属するカテゴリの前記条件付クラス確率モデルに基づく確率の比率が、予め定めた第２閾値未満の場合に、該ラベルありサンプルを誤分類サンプルと判定することができる。 In addition, the misclassification determining means has a category to which the content of each labeled sample included in the second subsample set belongs, and the probability obtained from the conditional class probability model for the labeled sample content is maximized. If the category does not match, the probability based on the conditional class probability model of the category to which the content of each labeled sample included in the second subsample set belongs is less than a predetermined first threshold, or the first The conditional class probability of the category to which the content of the labeled sample belongs with respect to the maximum value of the probability based on the conditional class probability model of each category other than the category to which the content of each labeled sample included in the two subsample sets belongs The probability ratio based on the model is a predetermined second If it is less than the value, it can be determined that the misclassified samples the labels located sample.

また、本発明の誤分類検出方法は、確率モデル生成手段と、誤分類判定手段とを含む誤分類検出装置における誤分類検出方法であって、前記確率モデル生成手段が、コンテンツの属するカテゴリが既知のラベルありサンプル集合の一部である第１サブサンプル集合と、コンテンツの属するカテゴリが未知のラベルなしサンプル集合との両方を用いて学習された生成モデルと識別モデルとが統合されたモデルであって、コンテンツが各カテゴリに属する確率を示す条件付クラス確率モデルを生成し、前記誤分類判定手段が、前記確率モデル生成手段により生成された条件付クラス確率モデルに基づいて、前記ラベルありサンプル集合から前記第１サブサンプル集合を除いた第２サブサンプル集合に含まれる各ラベルありサンプルが、誤ったカテゴリに分類されたコンテンツのサンプルか否かを判定する方法である。 The misclassification detection method of the present invention is a misclassification detection method in a misclassification detection apparatus including a probability model generation means and a misclassification determination means, wherein the probability model generation means knows the category to which the content belongs. This is a model in which the generation model and the identification model learned using both the first sub-sample set that is a part of the labeled sample set and the unlabeled sample set whose content category is unknown are integrated. Generating a conditional class probability model indicating the probability that the content belongs to each category, and the misclassification determining means is configured to generate the labeled sample set based on the conditional class probability model generated by the probability model generating means. Each labeled sample included in the second subsample set excluding the first subsample set from Is a method of determining whether a sample or not the content that is classified as Li.

また、本発明の誤分類検出プログラムは、コンピュータを、上記の誤分類検出装置を構成する各手段として機能させるためのプログラムである。 Moreover, the misclassification detection program of this invention is a program for functioning a computer as each means which comprises said misclassification detection apparatus.

以上説明したように、本発明の誤分類検出装置、方法、及びプログラムによれば、ラベルありサンプル集合の一部である第１サブサンプル集合とラベルなしサンプル集合との両方を用いて学習された生成モデルと識別モデルとが統合された条件付クラス確率モデルに基づいて、ラベルありサンプル集合の残りの部分である第２サブサンプル集合に含まれるラベルありサンプルが誤分類か否かを判定するため、誤分類検出に利用する統計的分類器の学習に誤分類されたサンプルが与える悪影響を抑制して、高い精度で誤分類されたサンプルを検出することができる、という効果が得られる。 As described above, according to the misclassification detection apparatus, method, and program of the present invention, learning is performed using both the first sub-sample set and the unlabeled sample set that are part of the labeled sample set. Based on a conditional class probability model in which the generation model and the identification model are integrated, to determine whether or not the labeled sample included in the second subsample set that is the remaining part of the labeled sample set is misclassified Thus, it is possible to suppress the adverse effect of the misclassified sample on the learning of the statistical classifier used for misclassification detection and to detect the misclassified sample with high accuracy.

本実施の形態に係る誤分類検出装置の構成を示す概略図である。It is the schematic which shows the structure of the misclassification detection apparatus which concerns on this Embodiment. 本実施の形態に係る誤分類検出装置における誤分類検出処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the misclassification detection process routine in the misclassification detection apparatus which concerns on this Embodiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。本実施の形態では、データベースに含まれる論文、特許等の文書、オンラインニュースデータ、電子メール等のテキスト情報から成るコンテンツや、Ｗｅｂデータ、ｂｌｏｇデータ等のテキスト情報とリンク情報から成るコンテンツ、あるいは画像データ等のコンテンツ、といった特徴ベクトルにより表現することが可能なコンテンツを、スポーツ、音楽、数学といった種別を表すカテゴリに分類したサンプルの集合の中から、誤分類サンプルを検出する誤分類検出装置に本発明を適用した場合について説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the present embodiment, content consisting of text information such as papers, patents, online news data, e-mail, etc. included in the database, content consisting of text information and link information such as Web data, blog data, or images This is a misclassification detection device that detects misclassified samples from a set of samples in which content that can be expressed by feature vectors such as data content is classified into categories representing categories such as sports, music, and mathematics. A case where the invention is applied will be described.

＜システム構成＞
本実施の形態に係る誤分類検出装置１０は、属するカテゴリのラベルが付与されたコンテンツのサンプルの集合が入力され、入力されたサンプルの集合の中から、誤ったカテゴリのラベルが付与されている誤分類サンプルを検出して出力する。この誤分類検出装置１０は、ＣＰＵと、ＲＡＭと、後述する誤分類検出処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成される。このコンピュータは、機能的には、図１に示すように、確率モデル生成部１２と、パラメータ記憶部１４と、誤分類判定部１６とを含んだ構成で表すことができる。 <System configuration>
In the misclassification detection apparatus 10 according to the present embodiment, a set of content samples to which a category label belongs is input, and an incorrect category label is assigned from the input sample set. Detect and output misclassified samples. The misclassification detection apparatus 10 is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a misclassification detection processing routine described later. As shown in FIG. 1, this computer can be functionally represented by a configuration including a probability model generation unit 12, a parameter storage unit 14, and a misclassification determination unit 16.

誤分類検出装置１０は、入力として誤分類検出対象サンプル集合Ｄ、及びラベルなしサンプル集合Ｄ_ｕを受け付ける。 Misclassification detector 10 classifies detected sample set D erroneously as an input, and accepts no sample set D _u label.

誤分類検出対象サンプル集合Ｄは、属するカテゴリのラベルが付与されたコンテンツのサンプルの集合であり、本装置による誤分類検出の対象となるサンプルの集合である。コンテンツに含まれる単語や画素、リンク、あるいはそれらの組み合わせ等により構成される特徴量空間をＴ＝｛ｔ_１，...，ｔ_ｉ，...，ｔ_Ｖ｝とするとき、コンテンツの特徴ベクトルｘは、コンテンツに含まれるｔ_ｉの頻度ｘ_ｉを用いて、ｘ＝｛ｘ_１，...，ｘ_ｉ，...，ｘ_Ｖ）^Ｔで表現される。Ｖはコンテンツに含まれる可能性がある特徴の種類の数を表す。例えば、コンテンツがテキストデータである場合、Ｖはコンテンツに出現する可能性がある語彙の総数を表す。誤分類検出対象サンプル集合に含まれる各誤分類検出対象サンプルは、コンテンツの特徴ベクトルｘと、コンテンツが属するカテゴリのラベルｙとを含む。ここで受け付ける誤分類検出対象サンプル集合Ｄを、Ｄ＝｛（ｘ_ｎ，ｙ_ｎ）｝_ｎ＝１ ^Ｎとする。ｎは誤分類検出対象サンプル集合に含まれる誤分類検出対象サンプルのＩＤ番号、Ｎは誤分類検出対象サンプルの総数、ｘ_ｎはｎ番目の誤分類検出対象サンプルの特徴ベクトル、ｙ_ｎはｎ番目の誤検出対象サンプルに付与されているカテゴリのラベルを表し、ｙ∈｛１，...，ｋ，...，Ｋ｝である。Ｋはカテゴリの総クラス数である。 The misclassification detection target sample set D is a set of content samples to which the category label is assigned, and is a set of samples that are targets of misclassification detection by this apparatus. Words or pixels, link or T = the feature space composed of combinations thereof, and the like, in the content _{{t 1, ..., t i} , ..., t V} when the characteristic of the content The vector x is expressed by x = {x ₁ ,..., X _i ,..., X _V ) ^T using the frequency x _{i of} t _i included in the content. V represents the number of types of features that may be included in the content. For example, when the content is text data, V represents the total number of vocabularies that can appear in the content. Each misclassification detection target sample included in the misclassification detection target sample set includes a content feature vector x and a label y of a category to which the content belongs. The misclassification detection target sample set D received here is D = {(x _n , y _n )} _{n = 1} ^N. n is misclassified detection target sample ID number included in the classification detection target sample set false, the total number of classification detection target sample N is erroneous, x _n is the feature vector of the n-th misclassified detected samples, y _n is the n th Represents the label of the category assigned to the false detection target sample, and yε {1,..., K,. K is the total number of classes in the category.

ラベルなしサンプル集合Ｄ_ｕに含まれる各ラベルなしサンプルは、コンテンツの特徴ベクトルｘのみで構成されている。ここで受け付けるラベルなしサンプル集合Ｄ_ｕを、Ｄ_ｕ＝｛（ｘ_ｍ）｝_ｍ＝１ ^Ｍとする。ｍはラベルなしサンプル集合に含まれるラベルなしサンプルのＩＤ番号、Ｍはラベルなしサンプルの総数、ｘ_ｍはｍ番目のラベルなしサンプルの特徴ベクトルを表す。 Each unlabeled sample included in the unlabeled sample set _Du is composed of only the content feature vector x. Assume that the unlabeled sample set D _{u received} here is D _u = {(x _m )} _{m = 1} ^M. m is unlabeled sample ID number included without sample set labels, M is the total number of samples unlabeled, is x _m represents the feature vector of m-th unlabeled sample.

確率モデル生成部１２は、誤分類検出対象サンプル集合Ｄ及びラベルなしサンプル集合Ｄ_ｕを用いて、コンテンツが各カテゴリに属する確率を示す条件付クラス確率モデルを、各誤分類検出対象サンプルの誤分類判定に用いる統計的分類器として生成する。 Probabilistic model generation unit 12 uses the misclassified detection target sample set D and unlabeled sample set D _u, the conditional class probability model showing the probability of contents belonging to each category, mis of each misclassified detection target sample classification It is generated as a statistical classifier used for judgment.

具体的には、確率モデル生成部１２は、誤分類検出対象サンプル集合Ｄを、２つのサブサンプル集合Ｄ_１及びＤ_２に分割し、一方のサブサンプル集合Ｄ_１と、ラベルなしサンプル集合Ｄ_ｕとを用いて、条件付クラス確率モデルＰ（ｙ｜ｘ；Ψ）のパラメータ推定値＾Ψを計算する。確率モデル生成部１２は、計算したパラメータ推定値＾Ψをパラメータ記憶部１４に記憶する。 Specifically, the probability model generation unit 12 divides the misclassification detection target sample set D into two subsample sets D ₁ and D ₂ , and one subsample set D ₁ and an unlabeled sample set D _u. Are used to calculate the parameter estimate {circumflex over (の)} of the conditional class probability model P (y | x; Ψ). The probability model generation unit 12 stores the calculated parameter estimated value ^ Ψ in the parameter storage unit 14.

パラメータΨは、非特許文献４（藤野昭典、上田修功、永田昌明、「ラベルありデータの選択バイアスに頑健な半教師あり学習」、情報処理学会論文誌、数理モデル化と応用（ＴＯＭ）、４（２）、３１−４２（２０１１））に記載された統計的分類器のパラメータＷ，Θ，βをΨ＝［Ｗ，Θ，β］のように略記したものである。非特許文献４に記載された統計的分類器は、ラベルなしサンプルを学習に利用するためにコンテンツの種類に応じて設計した生成モデルと、訓練データ（ラベルありサンプル）の分類境界を直接的に学習する識別モデルとの両方を、サブサンプル集合Ｄ_１とラベルなしサンプル集合Ｄ_ｕとの両方を用いて学習して統合した統計的分類器である。これにより、サブサンプル集合Ｄ_１に含まれる誤分類サンプルへの統計的分類器の過適合を抑制して、サブサンプル集合Ｄ_１に含まれる誤分類サンプルが統計的分類器の性能に与える悪影響を抑制することができる。なお、上記パラメータのＷは統計的分類器を構成する識別モデルのパラメータ、Θは統計的分類器を構成する生成モデルのパラメータ、βは識別モデルと生成モデルとの統合における生成モデルに対する重みである。 The parameter Ψ is described in Non-Patent Document 4 (Akinori Fujino, Nobuyoshi Ueda, Masaaki Nagata, “Semi-supervised learning that is robust to selection bias of labeled data”, IPSJ Journal, Mathematical Modeling and Application (TOM), 4 (2), 31-42 (2011)), the parameters W, Θ, β of the statistical classifier are abbreviated as Ψ = [W, Θ, β]. The statistical classifier described in Non-Patent Document 4 directly uses a generation model designed according to the type of content in order to use unlabeled samples for learning, and classification boundaries of training data (labeled samples) directly. both the identification model to be learned, a statistical classifier integrated by learning using both the sub-sample set D ₁ and unlabeled sample set D _u. This suppresses overfitting of the statistical classifier to the misclassified samples included in the subsample set D ₁ , thereby adversely affecting the performance of the statistical classifier by the misclassified samples included in the subsample set D _1. Can be suppressed. Where W is the parameter of the identification model that constitutes the statistical classifier, Θ is the parameter of the generation model that constitutes the statistical classifier, and β is the weight for the generation model in the integration of the identification model and the generation model. .

誤分類判定部１６は、パラメータ記憶部１４に記憶されたパラメータ推定値＾Ψを用いて、パラメータ推定値＾Ψの計算に用いられなかったサブサンプル集合Ｄ_２に含まれる各誤分類検出対象サンプル（ｘ_ｓ，ｙ_ｓ）の条件付クラス確率Ｐ（ｙ｜ｘ_ｓ；＾Ψ）を計算して、各誤分類検出対象サンプルが誤分類サンプルか否かを判定する。 Misclassification determination unit 16, the stored parameter estimates in the parameter storage unit 14 ^ with [psi, the misclassified detection target sample contained in a sub-sample set D ₂ which has not been used in the calculation of the parameter estimates ^ [psi The conditional class probability P (y | x _s ; ^ Ψ) of (x _s , y _s ) is calculated to determine whether each misclassification detection target sample is a misclassification sample.

例えば、下記（１）式に示すように、最大の確率Ｐ（ｙ｜ｘ_ｓ；＾Ψ）を与えるカテゴリを予測カテゴリ＾ｙ_ｓとし、ｙ_ｓ≠＾ｙ_ｓである場合に、その誤分類検出対象サンプル（ｘ_ｓ，ｙ_ｓ）を誤分類サンプルである判定する。 For example, as shown in the following equation (1), if the category that gives the maximum probability P (y | x _s ; ^ Ψ) is the prediction category ^ y _s and y _s ≠ ^ y _s , the misclassification The detection target sample (x _s , y _s ) is determined to be a misclassified sample.

また、誤分類判定部１６は、ある閾値ａ（０＜ａ＜１）を設定して、Ｐ（ｙ_ｓ｜ｘ_ｓ；＾Ψ）＜ａを満たす誤分類検出対象サンプル（ｘ_ｓ，ｙ_ｓ）を誤分類サンプルであると判定してもよい。 Further, the misclassification determination unit 16 sets a certain threshold value a (0 <a <1), and the misclassification detection target sample (x _s , y _s ) satisfying P (y _s | x _s ; ^ Ψ) <a. ) May be determined to be misclassified samples.

また、誤分類判定部１６は、ある実数の閾値ｂを設定して、下記（２）式を満たす誤分類検出対象サンプル（ｘ_ｓ，ｙ_ｓ）を誤分類サンプルであると判定してもよい。 In addition, the misclassification determination unit 16 may determine that a misclassification detection target sample (x _s , y _s ) satisfying the following equation (2) is a misclassification sample by setting a certain real threshold value b. .

誤分類判定部１６は、誤分類検出対象サンプル集合Ｄに含まれる全ての誤分類検出対象サンプルに対して誤分類サンプルか否かの判定を行い、誤分類サンプルと判定された誤分類検出対象サンプルの情報を誤分類検出結果として出力する。 The misclassification determination unit 16 determines whether or not all misclassification detection target samples included in the misclassification detection target sample set D are misclassification samples, and the misclassification detection target samples determined as misclassification samples. Is output as a misclassification detection result.

＜誤分類検出装置の作用＞
次に、本実施の形態に係る誤分類検出装置１０の作用について説明する。誤分類検出装置１０に誤分類検出対象サンプル集合Ｄ、及びラベルなしサンプル集合Ｄ_ｕが入力されると、誤分類検出装置１０において、図２に示す誤分類検出処理ルーチンが実行される。 <Operation of misclassification detection device>
Next, the operation of the misclassification detection apparatus 10 according to the present embodiment will be described. When the misclassification detection target sample set D and the unlabeled sample set _Du are input to the misclassification detection apparatus 10, the misclassification detection apparatus 10 executes the misclassification detection processing routine shown in FIG.

まず、ステップ１００で、確率モデル生成部１２が、受け付けた誤分類検出対象サンプル集合Ｄを、２つのサブサンプル集合Ｄ_１及びＤ_２に分割する。 First, in step 100, the probability model generation unit 12, the misclassified detection target sample set D accepted, divides the two to sub sample sets D ₁ and D _2.

次に、ステップ１０２で、確率モデル生成部１２が、上記ステップ１００で分割した一方のサブサンプル集合Ｄ_１と、受け付けたラベルなしサンプル集合Ｄ_ｕとの両方を用いて、生成モデルと識別モデルとを統合した統計的分類器である条件付クラス確率モデルＰ（ｙ｜ｘ；Ψ）のパラメータ推定値＾Ψを計算する。また、確率モデル生成部１２は、計算したパラメータ推定値＾Ψをパラメータ記憶部１４に記憶する。 Next, in step 102, the probability model generation unit 12 uses both the _one subsample set D1 divided in step 100 and the received unlabeled sample set _Du to generate a generation model and an identification model. The parameter estimation value {circumflex over (Ψ)} of the conditional class probability model P (y | x; Ψ), which is a statistical classifier integrated with In addition, the probability model generation unit 12 stores the calculated parameter estimation value Ψ in the parameter storage unit 14.

次に、ステップ１０４で、誤分類判定部１６が、パラメータ記憶部１４に記憶されたパラメータ推定値＾Ψを用いて、上記ステップ１００で分割されたもう一方のサブサンプル集合Ｄ_２に含まれる各誤分類検出対象サンプル（ｘ_ｓ，ｙ_ｓ）の各カテゴリに対する条件付クラス確率Ｐ（ｙ｜ｘ_ｓ；＾Ψ）を計算して、各誤分類検出対象サンプル（ｘ_ｓ，ｙ_ｓ）が誤分類サンプルか否かを判定する。誤分類判定部１６は、サブサンプル集合Ｄ_２に含まれる全ての誤分類検出対象サンプルを対象に誤分類サンプルか否かの判定を行い、誤分類サンプルと判定された誤分類検出対象サンプルのＩＤ番号を所定の記憶領域に記憶する。 Next, in step 104, the misclassification determination unit 16 uses each parameter estimation value ^ Ψ stored in the parameter storage unit 14 to include each of the other subsample sets D ₂ divided in step 100. misclassified detection target sample _(x s, _{y s)} conditional class probability P for each category of _{(y | x s; ^ Ψ} ) is calculated, and each misclassification detection target sample _(x s, _{y s)} is false Determine whether it is a classification sample. Misclassification determination unit 16, a determination is made as to whether or classification sample or not false for all misclassified detection target sample contained in a sub-sample set D _2, misclassified samples determined to be the misclassified detection target sample ID The number is stored in a predetermined storage area.

次に、ステップ１０６で、誤分類判定部１６が、誤分類検出対象サンプル集合Ｄに含まれる全ての誤分類検出対象サンプルについて、上記の誤分類サンプルか否かの判定処理を行ったか否かを判定する。未処理の誤分類検出対象サンプルが存在する場合には、ステップ１００へ戻って、ステップ１００〜１０６の処理を繰り返す。このとき、誤分類検出対象サンプル集合Ｄのうち、まだ誤分類判定を行っていない誤分類検出対象サンプルの中からサブサンプル集合Ｄ_２を選択し、残りの誤分類検出対象サンプル集合をサブサンプル集合Ｄ_１として処理を繰り返す。なお、前の繰り返し処理で得られた誤分類判定結果は次の繰り返し処理に影響しないので、各サブサンプル集合Ｄ_１とＤ_２との組毎に、並列に誤分類検出処理を行ってもよい。 Next, in step 106, it is determined whether the misclassification determination unit 16 has performed the above-described determination processing regarding whether or not all misclassification detection target samples included in the misclassification detection target sample set D are misclassification samples. judge. If there is an unprocessed misclassification detection target sample, the process returns to step 100 and the processes of steps 100 to 106 are repeated. In this case, among the misclassified detected sample set D, select the sub-sample set D ₂ from the misclassified detection target sample that has not been yet misclassification determination, sub-sample set the remaining misclassified detected sample set the process is repeated as D _1. Incidentally, the misclassification determination result obtained in the previous iteration does not affect the next iteration, the Kumigoto of each sub-sample set D ₁ and D _2, and the may be performed classification detection processing erroneous in parallel .

一方、全ての誤分類検出対象サンプルについて誤分類の判定処理が終了した場合には、ステップ１０８へ移行し、上記ステップ１０４で所定領域に記憶した誤分類サンプルのＩＤ番号を誤分類検出結果として出力し、誤分類検出処理ルーチンを終了する。 On the other hand, when the misclassification determination process is completed for all misclassification detection target samples, the process proceeds to step 108, and the ID number of the misclassification sample stored in the predetermined area in step 104 is output as the misclassification detection result. Then, the misclassification detection processing routine ends.

以上説明したように、本実施の形態に係る誤分類検出装置によれば、誤分類検出対象サンプル集合に含まれる各誤分類検出対象サンプルの誤分類判定を行う統計的分類器（条件付クラス確率モデル）を得るために、誤分類判定を行う誤分類検出対象サンプルを除外した誤分類検出対象サンプル集合（ラベルありサンプル集合）とラベルなしサンプル集合とを用いて統計的分類器を学習させる。この統計的分類器は、ラベルありサンプル集合とラベルなしサンプル集合との両方を用いて学習された生成モデルと識別モデルとを統合したものであり、ラベルなしサンプル集合と誤分類判定を行う誤分類検出対象サンプルを除外した誤分類検出対象サンプル集合との両方に適合させる学習技術が用いられる。 As described above, according to the misclassification detection apparatus according to the present embodiment, the statistical classifier (conditional class probability) that performs misclassification determination of each misclassification detection target sample included in the misclassification detection target sample set. In order to obtain a model, a statistical classifier is trained using a misclassification detection target sample set (a sample set with a label) and an unlabeled sample set excluding a misclassification detection target sample to be misclassified. This statistical classifier is a combination of a generation model and an identification model learned using both labeled and unlabeled sample sets, and misclassification that makes misclassification judgments with unlabeled sample sets. A learning technique adapted to both the misclassification detection target sample set excluding the detection target samples is used.

そのため、誤分類検出に利用する統計的分類器の学習に誤分類されたサンプルが与える悪影響を抑制して、高い精度で誤分類されたサンプルを検出することができる。 For this reason, it is possible to detect the misclassified sample with high accuracy by suppressing the adverse effect of the misclassified sample in the learning of the statistical classifier used for misclassification detection.

＜実験例＞
次に、上記の実施の形態に係る手法を適用して実験を行った結果について説明する。 <Experimental example>
Next, the results of experiments performed by applying the method according to the above embodiment will be described.

上位カテゴリとしてコンピュータに属する文書データを５つのサブカテゴリのいずれかに分類する問題で、誤ったサブカテゴリに分類された文書データを検出する評価実験を行った。テキスト分類問題で性能評価によく用いられるデータベース20 newsgroups（非特許文献５「K. Nigam, J. Lafferty, and A. McCallum, "Using maximum entropy for text classification", In IJCAI-99 Workshop on Machine Learning for Information Filtering, 61−67, 1999.」参照）を用いた。 An evaluation experiment was performed to detect document data classified into an incorrect subcategory in the problem of classifying document data belonging to a computer as a higher category into one of five subcategories. 20 newsgroups frequently used for performance evaluation in text classification problems (Non-Patent Document 5 “K. Nigam, J. Lafferty, and A. McCallum,“ Using maximum entropy for text classification ”, In IJCAI-99 Workshop on Machine Learning for Information Filtering, 61-67, 1999 ”).

評価用データセットを作成するため、１０００個のサンプルを５つのサブカテゴリに属する文書データの中から無作為に抽出した。そして１０００個のサンプルの中からｒ_ｍ％のサンプルを無作為に選択し、文書データが属するサブカテゴリを別の４つのサブカテゴリのいずれかに無作為に変更することで誤分類サンプルを作成した。また、ラベルなしサンプル集合を作成するために、上記で選択したサンプルと異なる２０００個のサンプルを文書データの中から無作為に抽出した。この操作によって得られた誤分類サンプルを含むデータセットを誤分類検出対象サンプル集合として性能評価に用いた。性能評価の尺度には、情報検索タスクなどでよく利用されるＦ値を用いた。Ｆ値は、装置が検出できた誤分類サンプルの数Ｎ_ｃと、装置が検出したサンプルの数Ｎ_ｒと、誤分類検出対象サンプル集合に含まれる誤分類サンプルの数Ｎ_ｎとから、下記（３）式で計算される。Ｆ値は、値が大きいほど誤分類サンプルの検出性能が高いことを示す。 In order to create an evaluation data set, 1000 samples were randomly extracted from document data belonging to five subcategories. A sample of r _m % was randomly selected from 1000 samples, and a miscategorized sample was created by randomly changing the subcategory to which the document data belongs to one of the other four subcategories. Further, in order to create an unlabeled sample set, 2000 samples different from the sample selected above were randomly extracted from the document data. A data set including misclassified samples obtained by this operation was used as a misclassification detection target sample set for performance evaluation. As a scale for performance evaluation, an F value often used in an information retrieval task or the like was used. The F value is calculated from the number N _c of misclassified samples detected by the device, the number N _{r of} samples detected by the device, and the number N _n of misclassified samples included in the misclassification detection target sample set as follows: 3) Calculated by the equation. The F value indicates that the larger the value, the higher the detection performance of misclassified samples.

表１に、本実施の形態に係る誤分類検出装置（方法１）、及びラベルなしサンプルを学習に用いる代表的な統計的分類器であるＴＳＶＭ（非特許文献６「R. Collobert, F. Sinz, J. Weston, and L. Bottou, "Large scale transductive SVMs", Journal of Machine Learning Research, 7:1687−1712, 2006.」参照）と５分割交差検定とを併用した方法（方法２）により得られたＦ値の実験結果例を示す。また、非特許文献３に記載のラベルなしデータ利用法で、ナイーブベイズモデル、１−近傍法、及び最大エントロピーモデルの３つの統計分類器を適用した場合に最大のＦ値を与えた統計的分類器を用いた方法（方法３）の実験結果例を示す。なお、非特許文献３では、ラベルなしデータを用いて複数の統計的分類器を学習させた後、それらの統計的分類器の予測結果による多数決、または全会一致で誤分類判定を行っている。しかし、本実験の方法３では、統計的分類器を学習させる際のラベルなしデータの利用法の優劣を比較するため、学習済の複数の統計的分類器による多数決や全会一致による判定を行う前の結果を示す。方法１及び２は、多数決や全会一致による判定を行う誤分類検出装置を構成する際に、複数の統計的分類器の１つとして利用可能である。 Table 1 shows a misclassification detection apparatus (method 1) according to the present embodiment and TSVM (Non-Patent Document 6 “R. Collobert, F. Sinz”, which is a representative statistical classifier that uses unlabeled samples for learning. , J. Weston, and L. Bottou, "Large scale transductive SVMs", Journal of Machine Learning Research, 7: 1687-1712, 2006.)) and 5-way cross validation (method 2) An example of the experimental result of the obtained F value is shown. In addition, in the unlabeled data utilization method described in Non-Patent Document 3, the statistical classification that gives the maximum F value when three statistical classifiers of the naive Bayes model, the 1-neighbor method, and the maximum entropy model are applied. An example of an experimental result of a method using a vessel (method 3) is shown. In Non-Patent Document 3, after learning a plurality of statistical classifiers using unlabeled data, misclassification determination is performed by majority decision based on prediction results of these statistical classifiers or unanimous agreement. However, in the method 3 of this experiment, in order to compare the superiority or inferiority of the usage of unlabeled data when learning a statistical classifier, before performing a decision by majority decision or unanimous agreement by a plurality of learned statistical classifiers. The results are shown. Methods 1 and 2 can be used as one of a plurality of statistical classifiers when configuring a misclassification detection apparatus that performs determination by majority vote or unanimous decision.

表１より、ｒ_ｍの値を変えて行った実験の全ての場合で、方法１で得られたＦ値は、方法２及び３で得られたＦ値と比べて同等以上であった。以上の結果より、誤分類サンプルを検出するのに、本発明に係る誤分類検出装置は効果があることが分かる。 From Table 1, in the case of all experiments performed with different values of r _m, the F value obtained by the method 1, was equal to or higher than the F value obtained by the method 2, and 3. From the above results, it can be seen that the misclassification detection apparatus according to the present invention is effective in detecting misclassified samples.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 For example, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program may be provided by being stored in a computer-readable recording medium.

１０誤分類検出装置
１２確率モデル生成部
１４パラメータ記憶部
１６誤分類判定部 DESCRIPTION OF SYMBOLS 10 Misclassification detection apparatus 12 Probability model production | generation part 14 Parameter storage part 16 Misclassification determination part

Claims

A generation model and an identification model learned using both the first sub-sample set in which the category to which the content belongs is a part of the sample set with a label and the unlabeled sample set to which the category to which the content belongs is unknown Probability model generation means for generating a conditional class probability model that is an integrated model and indicates the probability that content belongs to each category;
Based on the conditional class probability model generated by the probability model generation means, each labeled sample included in the second subsample set obtained by removing the first subsample set from the labeled sample set is an incorrect category. Misclassification judging means for judging whether or not the content is classified as a sample,
A misclassification detection device including:

The misclassification determining means includes
When the category to which the content of each labeled sample included in the second subsample set belongs does not match the category having the maximum probability obtained from the conditional class probability model for the labeled sample content,
The probability based on the conditional class probability model of the category to which the content of each labeled sample included in the second subsample set belongs is less than a predetermined first threshold, or included in the second subsample set The ratio of the probability based on the conditional class probability model of the category to which the labeled sample content belongs to the maximum probability based on the conditional class probability model of each category other than the category to which the labeled sample content belongs. , If less than a predetermined second threshold,
The misclassification detection apparatus according to claim 1, wherein the labeled sample is determined as a misclassification sample.

A misclassification detection method in a misclassification detection device including a probability model generation means and a misclassification determination means,
The probability model generation means is learned using both the first sub-sample set in which the category to which the content belongs is a part of the sample set with a label and the unlabeled sample set to which the category to which the content belongs is unknown. A model in which the generation model and the identification model are integrated, and a conditional class probability model indicating the probability that the content belongs to each category is generated,
Each label included in the second subsample set obtained by removing the first subsample set from the labeled sample set based on the conditional class probability model generated by the probability model generating unit A misclassification detection method that determines whether or not a sample is a sample of content classified into an incorrect category.

The misclassification detection program for functioning a computer as each means which comprises the misclassification detection apparatus of Claim 1 or Claim 2.