JP5667004B2

JP5667004B2 - Data classification apparatus, method and program

Info

Publication number: JP5667004B2
Application number: JP2011158410A
Authority: JP
Inventors: 内山　俊郎; 俊郎内山; 優甲谷; 恭太堤田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-07-19
Filing date: 2011-07-19
Publication date: 2015-02-12
Anticipated expiration: 2031-07-19
Also published as: JP2013025496A

Description

本発明は、データ分類装置及び方法及びプログラムに係り、特に、分類スコアを用いてデータを分類するためのデータ分類装置及び方法及びプログラムに関する。 The present invention relates to a data classification apparatus, method, and program, and more particularly, to a data classification apparatus, method, and program for classifying data using a classification score.

入力データを分類する際に、複数の分類スコア算出手段の出力を同時に用いる方法として、最大の分類スコアを示すクラスが入力データに対するクラスであることを示す確信度を算出して、最大の確信度を示す分類スコア算出手段画出力するクラスを、入力データが属するクラスとする分類手法がある（例えば、非特許文献１参照）。 As a method of simultaneously using the outputs of multiple classification score calculation means when classifying input data, calculate the certainty factor indicating that the class showing the maximum classification score is the class for the input data, and the maximum certainty factor There is a classification method in which a class to be output is displayed as a class to which input data belongs (for example, see Non-Patent Document 1).

内山俊郎、別所克人、内山匡、奥雅博、"確信度推定を用いた複数分類器の結合"，人工知能学会、知能ベースシステム研究会予稿集，2009年1月．Toshio Uchiyama, Katsuto Bessho, Satoshi Uchiyama, Masahiro Oku, "Combination of multiple classifiers using confidence estimation", Japanese Society for Artificial Intelligence, Intelligent Base System Study Group, January 2009.

しかしながら、上記の従来技術において、非特許文献１の技術は、分類スコア算出手段毎に、その分類スコア算出手段の分類スコアから、第一位の分類スコアを示すクラスが正解である確率（＝確信度）を算出していた。そのため、確信度の算出において他の分類スコア算出手段の分類スコアを反映することができず、そのために予測精度が低下する懸念があった。 However, in the above-described prior art, the technique of Non-Patent Document 1 is that for each classification score calculation means, the probability that the class indicating the first classification score is correct from the classification score of the classification score calculation means (= confirmation) Degree). For this reason, there is a concern that the classification score of other classification score calculation means cannot be reflected in the calculation of the certainty factor, and therefore the prediction accuracy decreases.

本発明は、上記の点に鑑みなされたもので、分類スコア算出手段毎の分類スコアのみから個々の分類器出力の確信度を予測する方法よりも的確な分類スコア算出手段の選別が可能なデータ分類装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and is a data that enables the selection of the classification score calculation means more accurately than the method of predicting the certainty of each classifier output from only the classification score for each classification score calculation means. An object is to provide a classification apparatus, method, and program.

上記の課題を解決するため、本発明（請求項１）は、所定の入力データをクラスへ分類するデータ分類装置であって、
予め求められた相対確信度算出パラメータを格納する算出パラメータ記憶手段と、
算出する分類手法または構成要素や特徴が互いに異なる複数の分類スコア算出手段と、
分類スコアを格納する分類スコア記憶手段と、
相対確信度を算出する相対確信度算出手段と、
前記入力データの属するクラスあるいはクラス群を決定するデータクラス決定手段と、
を有し、
前記分類スコア算出手段は、
前記所定の入力データが複数のクラスのそれぞれに属する事後確率あるいは尤もらしさである尤度あるいは分離超平面からの距離であり、値が大きいほど対応するクラスへ属する確率が高いことを表す分類スコアを算出し、分類スコア記憶手段に格納する手段を含み、
前記相対確信度算出手段は、
前記分類スコア記憶手段から全ての前記分類スコア算出手段によって出力された全ての分類スコアを取得して、該全ての分類スコアから、前記算出パラメータ記憶手段に格納されている前記相対確信度算出パラメータを用いて、各分類スコアの上位Ｎ個（Ｎ≧１）に対応するクラスあるいはクラス群の中に正解のクラスがあり、かつ、他の分類スコア算出手段の分類スコアの上位Ｎ個に対応するクラスあるいはクラス群の中には正解のクラスがない確率を表す相対確信度を算出する手段を含み、
前記データクラス決定手段は、
前記相対確信度が尤も高い分類スコア算出手段の分類スコアに基づいて入力データの属するクラスあるいはクラス群を決定する手段を含む In order to solve the above problems, the present invention (Claim 1) is a data classification device for classifying predetermined input data into classes,
Calculation parameter storage means for storing a relative certainty factor calculation parameter obtained in advance;
A plurality of classification score calculation means having different classification methods or components and features to be calculated;
Classification score storage means for storing the classification score;
A relative confidence calculating means for calculating the relative confidence,
Data class determining means for determining a class or class group to which the input data belongs;
Have
The classification score calculation means includes:
A likelihood score or likelihood from the separation hyperplane that the predetermined input data belongs to each of a plurality of classes, or a classification score that indicates that the larger the value, the higher the probability of belonging to the corresponding class. Means for calculating and storing in the classification score storage means;
The relative certainty factor calculating means includes:
All classification scores output by all the classification score calculation means are acquired from the classification score storage means, and the relative certainty factor calculation parameters stored in the calculation parameter storage means are obtained from all the classification scores. using, there are correct class in the top N (N ≧ 1) corresponding class or classes in each classification score and corresponds to the top N classification score another class score calculating means Including means for calculating relative confidence representing the probability that there is no correct class in the class or class group,
The data class determining means includes
Means for determining a class or class group to which the input data belongs based on the classification score of the classification score calculation means having the highest relative certainty factor

また、本発明（請求項２）は、前記相対確信度算出手段において、
Ｎ個のクラス群を決定するときに、全ての分類スコア算出手段の上位Ｎ＋１個の分類スコアを用いて算出する手段を含む。 In the present invention (Claim 2), in the relative certainty factor calculating means,
When N class groups are determined, means for calculating using the top N + 1 classification scores of all classification score calculation means is included.

また、本発明（請求項３）は、前記相対確信度算出手段において、
２クラスあるいは多クラスのロジスティック回帰（カーネルを用いるものも含む）あるいは、サポートベクターマシン、あるいは、他の識別モデルに基づく機械学習手法を用いる。 In the present invention (Claim 3), in the relative confidence calculation means,
Use machine learning techniques based on 2-class or multi-class logistic regression (including those using kernels), support vector machines, or other identification models.

本発明（請求項４）は、所定の入力データをクラスへ分類するデータ分類方法であって、
算出する分類手法または構成要素や特徴が互いに異なる複数の分類スコア算出手段が、前記所定の入力データが複数のクラスのそれぞれに属する事後確率あるいは尤もらしさである尤度あるいは分離超平面からの距離であり、値が大きいほど対応するクラスへ属する確率が高いことを表す分類スコアを算出し、分類スコア記憶手段に格納する分類スコア算出ステップと、
相対確信度算出手段が、前記分類スコア記憶手段から全ての前記分類スコア算出手段によって出力された全ての分類スコアを取得して、該全ての分類スコアから、算出パラメータ記憶手段に格納されている相対確信度算出パラメータを用いて、各分類スコアの上位Ｎ個（Ｎ≧１）に対応するクラスあるいはクラス群の中に正解のクラスがあり、かつ、他の分類スコア算出手段の分類スコアの上位Ｎ個に対応するクラスあるいはクラス群の中には正解のクラスがない確率を表す相対確信度を算出する相対確信度算出ステップと、
前記データクラス決定手段が、前記相対確信度が尤も高い分類スコア算出手段の分類スコアに基づいて入力データの属するクラスあるいはクラス群を決定するデータクラス決定ステップと、を行う。 The present invention (Claim 4) is a data classification method for classifying predetermined input data into classes,
A plurality of classification score calculation means for calculating classification methods or different constituent elements and features from each other, a likelihood or a likelihood that the predetermined input data belongs to each of a plurality of classes, or a distance from a separation hyperplane There is a classification score calculating step for calculating a classification score indicating that the larger the value is, the higher the probability of belonging to the corresponding class is, and storing the classification score in the classification score storage unit;
Relative confidence calculation means acquires all classification scores output by all the classification score calculation means from the classification score storage means, and stores the relative confidence stored in the calculation parameter storage means from all the classification scores . with confidence factor computing parameters, there are correct class in the top N (N ≧ 1) corresponding class or classes in each classification score, and the higher the classification score another class score calculating means A relative certainty calculating step for calculating a relative certainty representing the probability that there is no correct class among the classes or classes corresponding to N;
The data class determining means performs a data class determining step of determining a class or class group to which the input data belongs based on the classification score of the classification score calculating means having the highest relative certainty factor.

また、本発明（請求項５）は、前記相対確信度算出ステップにおいて
Ｎ個のクラス群を決定するときに、全ての分類スコア算出手段の上位Ｎ＋１個の分類スコアを用いて算出する。 In the present invention (Claim 5), when N class groups are determined in the relative confidence calculation step, calculation is performed using the top N + 1 classification scores of all the classification score calculation means.

また、本発明（請求項６）は、前記相対確信度算出ステップにおいて、
２クラスあるいは多クラスのロジスティック回帰（カーネルを用いるものも含む）あるいは、サポートベクターマシン、あるいは、他の識別モデルに基づく機械学習手法を用いる。 Further, the present invention (Claim 6), in the relative confidence calculation step,
Use machine learning techniques based on 2-class or multi-class logistic regression (including those using kernels), support vector machines, or other identification models.

本発明（請求項７）は、コンピュータを、請求項１乃至３のいずれか１項に記載のデータ分類装置の各手段として機能させるためのデータ分類プログラムである。 The present invention (Claim 7) is a data classification program for causing a computer to function as each means of the data classification apparatus according to any one of Claims 1 to 3.

本発明は、複数の分類スコア算出手段が出力する全ての分類スコア情報を活用することで、他の分類スコア算出手段の分類スコアを反映した相対確信度を求め、この値が高い分類スコア算出手段を推測することにより、分類スコア算出手段毎の分類スコアのみから個々の分類スコア算出手段の出力の確信度を予測するよりも精度が高く、的確な分類スコア算出手段の選別を可能とし、結果として精度の高い分類結果を得ることができる。 The present invention uses all the classification score information output by a plurality of classification score calculation means to obtain a relative certainty factor reflecting the classification score of another classification score calculation means, and the classification score calculation means having a high value Is more accurate than predicting the certainty of the output of each classification score calculation means from only the classification score for each classification score calculation means, and enables an accurate classification score calculation means to be selected as a result. A highly accurate classification result can be obtained.

本発明の一実施の形態におけるデータ分類装置の構成図である。It is a block diagram of the data classification device in one embodiment of this invention. 本発明の一実施の形態におけるデータ分類装置の処理のフローチャートである。It is a flowchart of a process of the data classification device in one embodiment of this invention.

以下図面と共に、本発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

本発明は、複数の分類スコア算出手段が出力する分類スコアを全て入力し、各分類スコア算出手段の相対確信度を識別問題として予測するものである。 The present invention inputs all the classification scores output by a plurality of classification score calculation means, and predicts the relative certainty of each classification score calculation means as an identification problem.

図１は、本発明の一実施の形態におけるデータ分類装置の構成を示す。 FIG. 1 shows a configuration of a data classification apparatus according to an embodiment of the present invention.

同図に示すデータ分類装置１０は、入力部１１、分類スコア算出制御部１２、複数の分類スコア算出手段１３、相対確信度算出部１４、データクラス決定部１５、メモリ１６、分類スコア記憶部１７、算出パラメータ記憶部１８、相対確信度記憶部１９、クラス群番号記憶部２０から構成され、入力部１１には処理対象記憶部１、キーボード２が接続され、出力部１５にはディスプレイ３が接続されている。 The data classification device 10 shown in the figure includes an input unit 11, a classification score calculation control unit 12, a plurality of classification score calculation means 13, a relative certainty calculation unit 14, a data class determination unit 15, a memory 16, and a classification score storage unit 17. The calculation parameter storage unit 18, the relative certainty storage unit 19, and the class group number storage unit 20, the processing unit storage unit 1 and the keyboard 2 are connected to the input unit 11, and the display 3 is connected to the output unit 15. Has been.

処理対象記憶部１は、文書等の処理対象が格納されているデータベースであり、入力部１１により読み出される。 The processing target storage unit 1 is a database in which processing targets such as documents are stored, and is read by the input unit 11.

メモリ１６は、入力部１１によって入力された処理対象が格納される。 The memory 16 stores the processing target input by the input unit 11.

算出パラメータ記憶部１８は、後述するロジスティック回帰モデルにおけるモデルパラメータ推定手順によって予め求められていた相対確信度算出パラメータが格納される。
ステップと、を行う。 The calculation parameter storage unit 18 stores a relative certainty factor calculation parameter obtained in advance by a model parameter estimation procedure in a logistic regression model described later.
And step.

相対確信度記憶部１９は、相対確信度算出部１４によって求められた相対確信度が格納される。 The relative certainty factor storage unit 19 stores the relative certainty factor obtained by the relative certainty factor calculation unit 14.

クラス番号記憶部２０は、相対確信度算出部１４で入力された上位Ｎ個の分類スコアに対応するクラス番号が格納される。 The class number storage unit 20 stores class numbers corresponding to the top N classification scores input by the relative certainty factor calculation unit 14.

入力部１１は、処理対象記憶部１から処理対象データを読み込み、メモリ１６に格納すると共に、キーボード２から入力された分類スコア算出手段１３の数及び分類先であるクラス数を取得し、分類スコア算出制御部１２に渡す。 The input unit 11 reads the processing target data from the processing target storage unit 1 and stores it in the memory 16, acquires the number of classification score calculation means 13 and the number of classes that are classification destinations input from the keyboard 2, and the classification score It passes to the calculation control unit 12.

分類スコア算出制御部１２は、メモリ１６から処理対象データの特徴量を入力する分類スコア算出手段１３を決定し、分類スコアを算出させる。 The classification score calculation control unit 12 determines the classification score calculation unit 13 that inputs the feature amount of the processing target data from the memory 16 and calculates the classification score.

分類スコア算出手段１３は、入力データが複数のクラスのそれぞれに属する事後確率あるいは尤もらしさである尤度あるいは分離超平面からの距離などの分類スコアを算出し、分類スコア記憶部１７に格納する。それぞれの分類スコア算出手段１３は、分類手法または構成要素や特徴が互いに異なる。例えば、データの特徴ベクトルとクラスラベルの同時確率分布をモデル化し、ベイズ則に基づいてクラス事後確率を直接モデル化することで、データのクラスラベルを推定する生成アプローチ、クラスの事後確率を直接モデル化する識別アプローチなどがある。 The classification score calculation means 13 calculates a classification score such as a posterior probability or likelihood that the input data belongs to each of a plurality of classes or a distance from the separation hyperplane, and stores the classification score in the classification score storage unit 17. The respective classification score calculation means 13 are different from each other in classification method or constituent elements and features. For example, by modeling the joint probability distribution of data feature vectors and class labels, and directly modeling class posterior probabilities based on Bayesian rules, a generation approach for estimating data class labels and direct model posterior probabilities There is an identification approach.

相対確信度算出部１４は、分類スコア記憶部１７に格納されている全ての分類スコアから、算出パラメータ記憶部１８に格納されている相対確信度算出パラメータを用いて、各分類スコア算出手段について上位Ｎ個（Ｎ≧１）に対応するクラスあるいはクラス群の中に正解のクラスがあり、かつ、他の分類スコア算出手段の上位Ｎ個に対応するクラスあるいはクラス群の中には正解のクラスがない確率を表す相対確信度を算出し、相対確信度記憶部１９に格納する。 The relative certainty factor calculation unit 14 uses the relative certainty factor calculation parameter stored in the calculation parameter storage unit 18 from all the classification scores stored in the classification score storage unit 17, and ranks higher for each classification score calculation unit. There are correct classes among the classes or class groups corresponding to N (N ≧ 1), and there are correct classes among the classes or class groups corresponding to the top N classes of other classification score calculation means. Relative certainty representing the probability of not being calculated is calculated and stored in the relative certainty storage unit 19.

データクラス決定部１５は、相対確信度記憶部１９から相対確信度を取得し、最も高い相対確信度を示した分類スコア算出手段の分類スコアに基づいて入力データの属するクラスあるいはクラス群を決定し、出力する。 The data class determination unit 15 acquires the relative certainty factor from the relative certainty factor storage unit 19, and determines the class or class group to which the input data belongs based on the classification score of the classification score calculation means that indicates the highest relative certainty factor. ,Output.

図２は、本発明の一実施の形態におけるデータ分類装置の処理のフローチャートであり、入力データがいずれかに属するクラス群の数がＮ（≧１）の場合として説明する。 FIG. 2 is a flowchart of the processing of the data classification device according to the embodiment of the present invention, and will be described as a case where the number of class groups to which input data belongs is N (≧ 1).

ステップ１０１）入力部１１は、処理対象記憶部１から処理対象である入力データをメモリ１６上に読み込む。 Step 101) The input unit 11 reads input data to be processed from the processing target storage unit 1 into the memory 16.

ステップ１０２）入力部１１は、キーボード２から入力された分類スコア算出手段１３の数ｎと、クラス数Ｋを取得し、分類スコア算出制御部１２に渡す。 Step 102) The input unit 11 acquires the number n of classification score calculation means 13 and the number K of classes input from the keyboard 2, and passes them to the classification score calculation control unit 12.

ステップ１０３）分類スコア算出制御部１２は、分類スコア算出手段１３の番号ｉを１に初期化する（i＝１）。 Step 103) The classification score calculation control unit 12 initializes the number i of the classification score calculation means 13 to 1 (i = 1).

ステップ１０４）分類スコア算出制御部１２は、分類スコア算出手段１３の番号ｉがｉ≦ｎであればステップ１０５に移行し、そうでなければステップ１１０へ移行する。 Step 104) The classification score calculation control unit 12 proceeds to Step 105 if the number i of the classification score calculation means 13 is i ≦ n, otherwise proceeds to Step 110.

ステップ１０５）分類スコア算出制御部１２は、メモリ１４に格納されている入力データＷの特徴を第ｉ番目の分類スコア算出手段１３iに入力し、入力データがクラスCk(ｋ＝１，…，K)（事後確率あるいは尤度あるいは分離超平面からの距離などで、値が大きいほど対応するクラスへ属する確率が高いことを表す）を算出する。算出する方法としては、文献１「上田修功、斉藤和巳、"多重トピックテキストの確率モデルテキストモデル研究の最前線（１）"，情報処理学会、会誌「情報処理」45巻2号，pp. 184-190，2004年2月」に記載されているナイーブベイズという方法を用いてもよい。算出された分類スコアを分類スコア記憶部１７に格納する。 Step 105) The classification score calculation control unit 12 inputs the feature of the input data W stored in the memory 14 to the i-th classification score calculation means 13i, and the input data is class Ck (k = 1,..., K). ) (A posteriori probability, likelihood, distance from the separation hyperplane, etc., the larger the value, the higher the probability of belonging to the corresponding class). As a calculation method, Reference 1 “Osamu Ueda, Kazuaki Saito,“ The Forefront of Stochastic Model Text Model Research for Multi-Topic Text (1) ”, Information Processing Society of Japan, Journal of Information Processing, Vol. 45, No. 2, pp. 184 -190, February 2004 ”may also be used. The calculated classification score is stored in the classification score storage unit 17.

ステップ１０６）第i分類スコア算出手段１３iの上位Ｎ個の分類スコアに対応するクラス群ｈiをクラス群番号記憶部２０に格納する。 Step 106) The class group hi corresponding to the top N classification scores of the i-th classification score calculating means 13i is stored in the class group number storage unit 20.

ステップ１０７）分類スコア算出手段１３iの番号iをi+1としてステップ１０４に戻る。 Step 107) The number i of the classification score calculation means 13i is set to i + 1, and the process returns to Step 104.

ステップ１１０）分類スコア記憶部１７から上記入力データＷが各クラスに属する分類スコアを読み込んで全分類スコア算出手段１３の上位Ｎ＋１をLm（ｍ＝１，…，N+1）とし、算出パラメータ記憶部１８から相対確信度パラメータ Step 110) Read the classification score that the input data W belongs to each class from the classification score storage unit 17 and set the upper N + 1 of all the classification score calculation means 13 to Lm (m = 1,..., N + 1), and store the calculation parameter. Relative confidence parameter from section 18

を読み込み、次式により第ｉ番目の分類スコア算出手段１３iの相対確信度Piを求め、相対確信度記憶部１９に格納する。

And the relative certainty factor Pi of the i-th classification score calculating means 13i is obtained by the following formula and stored in the relative certainty factor storage unit 19.

式（１）、（２）は、多クラスのロジスティック回帰式である。算出パラメータを求める場合は、文献２「C. M. ビショップ，"Pattern Recognition and Machine Learning (邦訳：パターン認識と機械学習）"，pp. 208-209 Springer 2006．」に記載されている多クラスロジスティック回帰分析において、全ての分類スコア算出手段１３が算出した上位Ｎ＋１個の分類スコアn×(N+1)を説明変数とし、唯一の分類スコア算出手段のみが上位Ｎ個に対応するクラスあるいはクラス群の中に正解があり、他の分類スコア算出手段には無い場合において、正解があった分類スコア算出手段１３iに対応する要素のみが１で、それ以外が０となる目的変数ベクトル（n次元ベクトル）を定義する。目的変数ベクトルが定義できるデータ（説明変数と目的変数）のみを学習データとして最尤推定によりロジスティック回帰パラメータを求め、これを相対確信度を算出するための算出パラメータとしてもよい。

Equations (1) and (2) are multi-class logistic regression equations. When calculating parameters, the multi-class logistic regression analysis described in Reference 2, “CM Bishop,“ Pattern Recognition and Machine Learning ”, pp. 208-209 Springer 2006.” The top N + 1 classification scores n × (N + 1) calculated by all the classification score calculation means 13 are used as explanatory variables, and only one classification score calculation means is included in the class or class group corresponding to the top N. Defines an objective variable vector (n-dimensional vector) where only the element corresponding to the correct classification score calculation means 13i is 1 and the other is 0 when there is a correct answer and there is no other classification score calculation means To do. Logistic regression parameters may be obtained by maximum likelihood estimation using only data (an explanatory variable and an objective variable) that can define an objective variable vector as learning data, and this may be used as a calculation parameter for calculating relative confidence.

ステップ１１１）相対確信度記憶部１９から各分類スコア算出手段１３の相対確信度Ｐｉ（ｉ＝１，…，ｎ）を読み込み、最大の相対確信度の分類スコア算出手段を選択する。 Step 111) The relative certainty factor Pi (i = 1,..., N) of each classification score calculating unit 13 is read from the relative certainty factor storage unit 19, and the classification score calculating unit having the maximum relative certainty factor is selected.

ステップ１１２）上記選択した分類スコア算出手段が出力した上位N個の分類スコアに対応するクラス群ｈをクラス群番号記憶部２０から読み込み、これを入力データがいずれかに属するクラス群として出力する。 Step 112) The class group h corresponding to the top N classification scores output by the selected classification score calculating means is read from the class group number storage unit 20 and output as a class group to which the input data belongs.

なお、上記のデータ分類装置の各構成要素の動作をプログラムとして構築し、データ分類装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operation of each component of the data classification device can be constructed as a program and installed in a computer used as the data classification device for execution or distributed via a network.

本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

１処理対象記憶部
２キーボード
３ディスプレイ
１０データ分類装置
１１入力部
１２分類スコア算出制御部
１３分類スコア算出手段
１４相対確信度算出部
１５データクラス決定部
１６メモリ
１７分類スコア記憶部
１８算出パラメータ記憶部
１９相対確信度記憶部
２０クラス群番号記憶部 DESCRIPTION OF SYMBOLS 1 Processing object memory | storage part 2 Keyboard 3 Display 10 Data classification apparatus 11 Input part 12 Classification score calculation control part 13 Classification score calculation means 14 Relative reliability calculation part 15 Data class determination part 16 Memory 17 Classification score memory | storage part 18 Calculation parameter memory | storage part 19 Relative certainty storage unit 20 Class group number storage unit

Claims

A data classification device for classifying predetermined input data into classes,
Calculation parameter storage means for storing a relative certainty factor calculation parameter obtained in advance;
A plurality of classification score calculation means having different classification methods or components and features to be calculated;
Classification score storage means for storing the classification score;
A relative confidence calculating means for calculating the relative confidence,
Data class determining means for determining a class or class group to which the input data belongs;
Have
The classification score calculation means includes:
A likelihood score or likelihood from the separation hyperplane that the predetermined input data belongs to each of a plurality of classes, or a classification score that indicates that the larger the value, the higher the probability of belonging to the corresponding class. Means for calculating and storing in the classification score storage means;
The relative certainty factor calculating means includes:
All classification scores output by all the classification score calculation means are acquired from the classification score storage means, and the relative certainty factor calculation parameters stored in the calculation parameter storage means are obtained from all the classification scores. using, there are correct class in the top N (N ≧ 1) corresponding class or classes in each classification score and corresponds to the top N classification score another class score calculating means Including means for calculating relative confidence representing the probability that there is no correct class in the class or class group,
The data class determining means includes
A data classification device comprising: means for determining a class or class group to which input data belongs based on a classification score of a classification score calculation means having a high relative certainty factor.

The relative certainty factor calculating means includes:
2. The data classification apparatus according to claim 1, further comprising means for calculating using the top N + 1 classification scores of all classification score calculation means when determining N class groups.

The relative certainty factor calculating means includes:
3. The data classification apparatus according to claim 1, wherein a machine learning method based on two-class or multi-class logistic regression (including those using a kernel), a support vector machine, or another identification model is used.

A data classification method for classifying predetermined input data into classes,
A plurality of classification score calculation means for calculating classification methods or different constituent elements and features from each other, a likelihood or a likelihood that the predetermined input data belongs to each of a plurality of classes, or a distance from a separation hyperplane There is a classification score calculating step for calculating a classification score indicating that the larger the value is, the higher the probability of belonging to the corresponding class is, and storing the classification score in the classification score storage unit;
Relative confidence calculation means acquires all classification scores output by all the classification score calculation means from the classification score storage means, and stores the relative confidence stored in the calculation parameter storage means from all the classification scores . with confidence factor computing parameters, there are correct class in the top N (N ≧ 1) corresponding class or classes in each classification score, and the higher the classification score another class score calculating means A relative certainty calculating step for calculating a relative certainty representing the probability that there is no correct class among the classes or classes corresponding to N;
A data class determining step, wherein the data class determining means determines a class or a class group to which the input data belongs based on the classification score of the classification score calculating means with the highest relative certainty;
The data classification method characterized by performing.

5. The data classification method according to claim 4, wherein when N class groups are determined in the relative certainty calculation step, calculation is performed using the top N + 1 classification scores of all classification score calculation means.

In the relative certainty factor calculating step,
6. The data classification method according to claim 4, wherein a machine learning method based on two-class or multi-class logistic regression (including those using a kernel), a support vector machine, or another identification model is used.

Computer
The data classification program for functioning as each means of the data classification device of any one of Claims 1 thru | or 3.