JP5063639B2

JP5063639B2 - Data classification method, apparatus and program

Info

Publication number: JP5063639B2
Application number: JP2009096415A
Authority: JP
Inventors: 俊郎内山; 克人別所; 毅晴江田; 千尋山本; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-04-10
Filing date: 2009-04-10
Publication date: 2012-10-31
Anticipated expiration: 2029-04-10
Also published as: JP2010250391A

Description

本発明は、データ分類方法及び装置及びプログラムに係り、特に、入力データが複数のクラスにそれぞれ属することの尤もらしさを表す尤度を算出する複数の尤度算出手段を用いて算出された結果を統合し、データを分類するためのデータ分類方法及び装置及びプログラムに関する。 The present invention relates to a data classification method, apparatus, and program, and more particularly, to a result calculated using a plurality of likelihood calculating means for calculating likelihoods representing likelihood that input data belongs to a plurality of classes, respectively. The present invention relates to a data classification method, apparatus, and program for integrating and classifying data.

入力データを分類する際に、複数の尤度算出手段の出力を同時に用いる方法として、各尤度算出手段が出力する尤度（入力データが各クラスへ属する尤度）の対数をとった対数尤度を線形結合して統合し、この統合された対数尤度に基づいて分類する方法が知られている（例えば、非特許文献１参照）。各尤度算出手段の信頼性の相対関係が一定であれば、このような線形結合による統合は妥当性があり、広く用いられている。 As a method of simultaneously using the outputs of a plurality of likelihood calculating means when classifying input data, log likelihood is obtained by taking the logarithm of the likelihood (the likelihood that the input data belongs to each class) output by each likelihood calculating means. A method is known in which degrees are linearly combined and integrated, and classification is performed based on the integrated log likelihood (see Non-Patent Document 1, for example). If the relative relationship of reliability of each likelihood calculating means is constant, such integration by linear combination is valid and widely used.

なお、上記の線形結合による統合は、例えば、尤度算出手段Ａが、尤度算出手段Ｂよりも信頼性が高ければ、尤度算出手段Ａの出力の重みを高くして統合することである。つまり、尤度算出手段が出力する対数尤度を、最大エントロピー原理に基づいて、線形結合し、事後確率を求めることによって分類する。 Note that the integration by the above linear combination is, for example, if the likelihood calculation means A is more reliable than the likelihood calculation means B, and integrates by increasing the output weight of the likelihood calculation means A. . That is, the log likelihoods output by the likelihood calculating means are classified by linearly combining them based on the maximum entropy principle and obtaining the posterior probabilities.

なお、上記「尤度」とは、「尤もらしさ」であり、例えば、「確率」は「尤度」であるといえる。 The “likelihood” is “likelihood”, and for example, “probability” is “likelihood”.

藤野昭典、上田修功、斉藤和巳、"最大エントロピー原理に基づく付加情報の効果的な利用によるテキスト分類"、情報処理学会論文誌、Vol. 47, No.10, pp.2929-2937, 2006Akinori Fujino, Nobuyoshi Ueda, Kazuaki Saito, “Text Classification by Effective Use of Additional Information Based on Maximum Entropy Principle”, IPSJ Journal, Vol. 47, No.10, pp.2929-2937, 2006

しかしながら、上記の従来技術のように、各尤度算出手段が出力する尤度（または対数をとった対数尤度）を線形結合し、統合し、この統合された尤度(対数尤度)に基づいて分類する方法では、次の問題がある。 However, as in the above-described prior art, the likelihoods (or logarithmic likelihoods obtained by logarithms) output from the respective likelihood calculating means are linearly combined and integrated, and the integrated likelihood (logarithmic likelihood) is integrated. The method of classification based on the following problems.

例えば、尤度算出手段Ａが「野球」と判断し、尤度算出手段Ａによる分類手法とは異なる分類手法を採用している尤度算出手段Ｂが「サッカー」と判断し、入力データの真のクラスが「野球」である場合、尤度を線形結合したので、尤度算出手段Ｂによる間違った結果に影響され、「サッカー」であると誤って判断する可能性がある。すなわち、個々の尤度算出手段の長所を弱め、結果として正しい分類結果が得られない現象がよく起きるという問題がある。 For example, the likelihood calculating means A determines “baseball”, the likelihood calculating means B adopting a classification method different from the classification method by the likelihood calculating means A determines “soccer”, and the truth of the input data If the class is “baseball”, since the likelihoods are linearly combined, there is a possibility of being erroneously determined to be “soccer” by being influenced by an incorrect result by the likelihood calculating means B. That is, there is a problem that a phenomenon in which a correct classification result cannot be obtained as a result often occurs because the advantages of the individual likelihood calculating means are weakened.

このような問題は、各尤度算出手段の信頼性が一定であると仮定していることに起因している。 Such a problem is caused by assuming that the reliability of each likelihood calculating means is constant.

本発明は、上記の点に鑑みなされたもので、尤度を線形結合し、統合して分類結果を得る場合よりも、精度がより高い分類結果を得ることができるデータ分類方法及び装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points. A data classification method, apparatus, and program capable of obtaining a classification result with higher accuracy than the case of linearly combining likelihoods and integrating them to obtain a classification result. The purpose is to provide.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、所定の入力データをクラスへ分類するデータ分類方法であって、
算出する分類手法または構成要素や特徴が互いに異なるｎ個（ｎは２以上）の尤度算出手段のそれぞれが、所定の入力データが複数のクラスのそれぞれに属する尤もらしさである尤度を算出し、尤度記憶手段に格納する尤度算出ステップ（ステップ１）と、
尤度記憶手段から尤度算出手段によって出力された尤度を取得して、注目するクラスに対して、各尤度算出手段が算出したそれぞれの尤度の中で２個の尤度を用い、全てのｎ個の尤度算出手段による合計２ｎ個の尤度を用いて、当該注目するクラスが入力データに対する正しいクラスであることを示す確からしさである確信度を算出する処理を行うことで、複数のクラスの全ての確信度を算出する確信度算出ステップ（ステップ２）と、
確信度の値に基づいて入力データの属するクラスを決定するデータクラス決定ステップ（ステップ３）と、を行う。 The present invention (Claim 1) is a data classification method for classifying predetermined input data into classes,
Each of n (n is 2 or more) likelihood calculating means having different classification methods or constituent elements and features to calculate calculates likelihood that predetermined input data belongs to each of a plurality of classes. A likelihood calculation step (step 1) to be stored in the likelihood storage means;
Acquires the likelihood that is output by the likelihood storage means whether we likelihood calculating means, for the class of interest, the two likelihood in the likelihood of each of the likelihood calculating means is calculated using By using a total of 2n likelihoods by all n likelihood calculating means, a process of calculating a certainty factor that is a certainty indicating that the class of interest is a correct class for the input data is performed. A certainty factor calculating step (step 2) for calculating all the certainty factors of the plurality of classes ;
A data class determining step (step 3) for determining a class to which the input data belongs based on the certainty value.

また、本発明（請求項２）は、確信度算出ステップ（ステップ２）において、
尤度算出手段毎の２個のクラスの尤度は、
注目するクラスに対する当該尤度算出手段が算出した尤度と、当該尤度算出手段が算出した尤度の中で注目するクラス以外のクラスで最大となる尤度である。 Further, the present invention (Claim 2), in the probability Sind calculating step (Step 2),
The likelihood of the two classes for each likelihood calculating means is
A likelihood said likelihood calculating means with respect to the class of interest is calculated, the maximum and becomes likelihood class other than the class of interest in the likelihood the likelihood calculating means is calculated.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項３）は、所定の入力データをクラスへ分類するデータ分類装置であって、
所定の入力データを入力する入力手段１１と、
所定の入力でデータが複数のクラスにそれぞれ属する尤もらしさである尤度を算出し、尤度記憶手段１７に格納する、算出する分類手法または構成要素や特徴が互いに異なるｎ個（ｎは２以上）の尤度算出手段１３と、
尤度記憶手段１７から尤度算出手段１３によって出力された尤度を取得して、注目するクラスに対して、各尤度算出手段１３が算出したそれぞれの尤度の中で２個の尤度を用い、全てのｎ個の尤度算出手段による合計２ｎ個の尤度を用いて、当該注目するクラスが入力データに対する正しいクラスであることを示す確からしさである確信度を算出する処理を行うことで、複数のクラスの全ての確信度を算出する確信度算出手段１４と、
確信度の値に基づいて入力データの属するクラスを決定するデータクラス決定手段１５と、を有する。 The present invention (Claim 3 ) is a data classification device for classifying predetermined input data into classes,
Input means 11 for inputting predetermined input data;
The likelihood that the data belongs to a plurality of classes at a predetermined input is calculated, and stored in the likelihood storage means 17; n classification methods to be calculated, or n elements having different constituent elements and features (n is 2 or more) ) Likelihood calculation means 13;
Acquires the likelihood that is output by the likelihood storage means 17 or we likelihood calculating unit 13, with respect to the class of interest, the two likelihood in the likelihood of each of the likelihood calculating means 13 calculates A process of calculating a certainty factor that is a probability indicating that the class of interest is a correct class for the input data , using a total of 2n likelihoods by all n likelihood calculating means. A certainty factor calculating means 14 for calculating all the certainty factors of the plurality of classes ,
Data class determining means 15 for determining the class to which the input data belongs based on the certainty value.

また、本発明（請求項４）は、確信度算出手段１４の尤度算出手段１３毎の２個のクラスの尤度が、注目するクラスに対する当該尤度算出手段１３が算出した尤度と、当該尤度算出手段が算出した尤度の中で注目するクラス以外のクラスで最大となる尤度である。 Further, the present invention (Claim 4), the likelihood likelihood of two classes of each likelihood calculating unit 13 of the confidence factor computing unit 14, where the likelihood calculating unit 13 for the class of interest is calculated, This is the maximum likelihood in a class other than the class of interest among the likelihoods calculated by the likelihood calculating means .

本発明（請求項５）は、請求項３または４に記載のデータ分類装置を構成する各手段としてコンピュータを機能させるためのデータ分類プログラムである。 The present invention (Claim 5 ) is a data classification program for causing a computer to function as each means constituting the data classification apparatus according to Claim 3 or 4 .

本発明は、全ての尤度算出手段が出力する全ての尤度から、各クラスが入力データに対する正しいクラスであることを示す確信度を予測し、この予測した確信度の高いクラスを入力データのクラスとして求めることにより、尤度を線形結合し、統合して分類結果を得る場合よりも、精度が高い分類結果を得ることができる。 The present invention predicts the certainty level indicating that each class is the correct class for the input data from all the likelihoods output by all likelihood calculating means, and classifies the predicted class with a high degree of certainty for the input data. By obtaining as a class, it is possible to obtain a classification result with higher accuracy than when the likelihoods are linearly combined and integrated to obtain a classification result.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態におけるデータ分類装置の構成図である。It is a block diagram of the data classification device in one embodiment of this invention. 本発明の一実施の形態におけるデータ分類装置の処理のフローチャートである。It is a flowchart of a process of the data classification device in one embodiment of this invention.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態におけるデータ分類装置の構成図である。 FIG. 3 is a configuration diagram of the data classification device according to the embodiment of the present invention.

同図に示すデータ分類装置１０は、入力部１１、尤度算出制御部１２、複数の尤度算出手段１３、確信度算出部１４、データクラス決定部１５、メモリ１６、尤度記憶部１７、算出パラメータ記憶部１８、確信度記憶部１９から構成され、入力部１１には処理対象記憶部１、キーボード２が接続され、出力部１５にはディスプレイ３が接続されている。 The data classification device 10 shown in the figure includes an input unit 11, a likelihood calculation control unit 12, a plurality of likelihood calculation means 13, a certainty factor calculation unit 14, a data class determination unit 15, a memory 16, a likelihood storage unit 17, The calculation parameter storage unit 18 and the certainty factor storage unit 19 are configured, the processing unit storage unit 1 and the keyboard 2 are connected to the input unit 11, and the display 3 is connected to the output unit 15.

処理対象記憶部１は、文書等の処理対象が格納されているデータベースであり、入力部１１により読み出される。 The processing target storage unit 1 is a database in which processing targets such as documents are stored, and is read by the input unit 11.

メモリ１６は、入力部１１によって入力された処理対象が格納される。 The memory 16 stores the processing target input by the input unit 11.

尤度記憶部１７は、各尤度算出手段１３によって算出された尤度が格納される。 The likelihood storage unit 17 stores the likelihood calculated by each likelihood calculating means 13.

算出パラメータ記憶部１８は、後述するロジスティック回帰モデルにおけるモデルパラメータ推定手順によって予め求められた確信度算出パラメータが格納される。 The calculation parameter storage unit 18 stores a certainty factor calculation parameter obtained in advance by a model parameter estimation procedure in a logistic regression model described later.

確信度記憶部１９は、確信度算出部１４によって求められた確信度が格納される。 The certainty factor storage unit 19 stores the certainty factor obtained by the certainty factor calculation unit 14.

入力部１１は、処理対象記憶部１から処理対象データを読み込み、メモリ１６に格納すると共に、キーボード２から入力された尤度算出手段１３の数及び分類先であるクラス数を取得し、尤度算出制御部１２に渡す。 The input unit 11 reads the processing target data from the processing target storage unit 1 and stores it in the memory 16, acquires the number of likelihood calculation means 13 and the number of classes that are classification destinations input from the keyboard 2, and the likelihood It passes to the calculation control unit 12.

尤度算出制御部１２は、メモリから処理対象データの特徴を入力する尤度算出手段１３を決定し、尤度を算出させる。 The likelihood calculation control unit 12 determines the likelihood calculating means 13 for inputting the feature of the processing target data from the memory, and calculates the likelihood.

尤度算出手段１３は、入力データが複数のクラスのそれぞれに属することの尤もらしさである尤度を算出し、尤度記憶部１７に格納する。それぞれの尤度算出手段１３は、分類手法または、構成要素や特徴が互いに異なる。例えば、データの特徴ベクトルとクラスラベルの同時確率分布をモデル化し、ベイズ則に基づいてクラス事後確率を計算することでデータのクラスラベルを推定する生成アプローチ、クラス事後確率を直接モデル化する識別アプローチなどがある。 The likelihood calculating means 13 calculates the likelihood that is the likelihood that the input data belongs to each of the plurality of classes, and stores the likelihood in the likelihood storage unit 17. Each likelihood calculating means 13 is different from each other in the classification method or the constituent elements and features. For example, modeling the joint probability distribution of data feature vectors and class labels, and calculating the class posterior probabilities based on the Bayes rule, generating the class labels of the data, and discriminating approaches directly modeling the class posterior probabilities and so on.

確信度算出部１４は、尤度記憶部１７に格納されている全ての尤度から、算出パラメータ記憶部１８に格納されている確信度算出パラメータを用いて、各クラスが入力データに対する正しいクラスであることを示す確からしさである確信度を算出し、確信度記憶部１９に格納する。確信度は全ての尤度算出手段から出力された尤度を用いて統合的に算出される。概略的には、多くの尤度算出手段において尤度が高く、対抗するクラスでの尤度が低いほど、高い値になる。 The certainty factor calculation unit 14 uses the certainty factor calculation parameter stored in the calculation parameter storage unit 18 from all the likelihood values stored in the likelihood storage unit 17, and each class is a correct class for the input data. A certainty factor that is a certainty indicating that it is present is calculated and stored in the certainty factor storage unit 19. The certainty factor is calculated in an integrated manner using the likelihoods output from all likelihood calculating means. Schematically, the likelihood is high in many likelihood calculating means, and the value becomes higher as the likelihood in the opposing class is lower.

データクラス決定部１５は、確信度記憶部１９から確信度を取得し、最も高い確信度を示したクラスを入力データの属するクラスとして出力する。 The data class determination unit 15 acquires the certainty factor from the certainty factor storage unit 19 and outputs the class showing the highest certainty factor as the class to which the input data belongs.

図４は、本発明の一実施の形態におけるデータ分類装置の処理のフローチャートである。 FIG. 4 is a flowchart of processing of the data classification device according to the embodiment of the present invention.

ステップ１０１）入力部１１は、処理対象記憶部１から処理対象である入力データをメモリ１６上に読み込む。 Step 101) The input unit 11 reads input data to be processed from the processing target storage unit 1 into the memory 16.

ステップ１０２）入力部１１は、キーボード２から入力された尤度算出手段１３の数ｎを取得し、尤度算出制御部１２に渡す。 Step 102) The input unit 11 acquires the number n of likelihood calculating means 13 input from the keyboard 2 and passes it to the likelihood calculation control unit 12.

ステップ１０３）入力部１１は、キーボード２から入力された分類先であるクラス数Ｋを取得し、尤度算出制御部１２に渡す。 Step 103) The input unit 11 acquires the class number K that is the classification destination input from the keyboard 2 and passes it to the likelihood calculation control unit 12.

ステップ１０４）尤度算出制御部１２は、尤度算出手段１３の番号ｉを１に初期化する（ｉ＝１）。 Step 104) The likelihood calculation control unit 12 initializes the number i of the likelihood calculating means 13 to 1 (i = 1).

ステップ１０５）尤度算出制御部１２は、尤度算出手段１３の番号ｉがｉ≦ｎであればステップ１０６に移行し、そうでなければステップ１０９に移行する。 Step 105) The likelihood calculation control unit 12 proceeds to Step 106 if the number i of the likelihood calculation means 13 is i ≦ n, and proceeds to Step 109 otherwise.

ステップ１０６）尤度算出制御部１２は、メモリ１４に格納されている入力データＷの特徴を第ｉ番目の尤度算出手段ＬＣ_ｉに入力し、当該尤度算出手段ＬＣ_ｉは入力データがクラスＣ_ｋ（ｋ＝１，…，Ｋ）に属する尤度Ｐ_ｉ（Ｗ│Ｃ_ｋ）を算出する。各クラスにおいて、入力データが複数の特徴から構成されるときは出現する確率の積を、入力データＷの特徴が当該クラスに属する尤度であると判断する。なお、上記入力データのＷの例は文書であり、上記入力データの特徴の例は、「単語」または、「単語に付与したベクトル」である。 Step 106) The likelihood calculation control unit 12 inputs the feature of the input data W stored in the memory 14 to the i-th likelihood calculation means LC _i , and the likelihood calculation means LC _i has the input data as a class. _{C k (k = 1, ...} , K) to calculate the likelihood _P i _(W│C _k) belonging to. In each class, when the input data is composed of a plurality of features, the product of the probabilities of appearance is determined as the likelihood that the features of the input data W belong to the class. An example of W of the input data is a document, and an example of the characteristics of the input data is “word” or “vector assigned to a word”.

すなわち、ある入力文書に単語「ホームラン」と「試合」が含まれている場合、クラス「野球」における『ホームラン』の出現確率が例えば１／２０と予め定められ、「試合」の出現確率が例えば１／５であると予め定められているとすると、上記入力データが出現する確率の積は、
(１／２０)×（１／５）＝１／１００
である。上記確率の積を尤度とし、この尤度はクラスに属する「尤もらしさ」を示すものである。 That is, when the words “home run” and “game” are included in a certain input document, the appearance probability of “home run” in the class “baseball” is predetermined as, for example, 1/20, and the appearance probability of “game” is, for example, If it is predetermined that it is 1/5, the product of the probability that the input data appears is
(1/20) × (1/5) = 1/100
It is. The product of the above probabilities is used as the likelihood, and this likelihood indicates “likelihood” belonging to the class.

また、「上田修功、斉藤和巳、"多重トピックテキストの確率モデル−テキストモデル研究の最前線（１）"、情報処理学会誌 Vol. 45, No. 2, pp. 184-190, 2004」や「上田修功、斉藤和巳、"多重トピックテキストの確率モデル−テキストモデル研究の最前線（２）"、情報処理学会誌、Vol. 45, No. 3, pp. 282-289, 2004.」に記載されているナイーブベイズという方法を使用するようにしてもよい。 In addition, “Osamu Ueda, Kazuaki Saito,“ Probability Model of Multi-Topic Text—The Forefront of Text Model Research (1) ”, Journal of Information Processing Society of Japan Vol. 45, No. 2, pp. 184-190, 2004” and “ Ueda Osamu, Saito Kazuaki, "Probability Model of Multi-Topic Text-The Forefront of Text Model Research (2)", Journal of Information Processing Society of Japan, Vol. 45, No. 3, pp. 282-289, 2004. You may be allowed to use a method called Naive Bayes.

ステップ１０７）尤度算出手段ＬＣ_ｉは、算出された尤度を尤度記憶部１７に格納する。 Step 107) The likelihood calculating means LC _i stores the calculated likelihood in the likelihood storage unit 17.

ステップ１０８）尤度算出制御部１２は、尤度算出手段１３の番号ｉをｉ＋１としてステップ１０５に戻る。 Step 108) The likelihood calculation control unit 12 sets the number i of the likelihood calculation means 13 to i + 1 and returns to Step 105.

ステップ１０９）上記のステップ１０５において、ｉ＞ｎとなった場合は、確信度算出部１４は、クラス番号ｋを１に初期化する。 Step 109) When i> n in Step 105 above, the certainty factor calculation unit 14 initializes the class number k to 1.

ステップ１１０）確信度算出部１４は、尤度記憶部１７から尤度を取り出して、クラスｋが正しいクラスである確信度Ｐ_c ^kを次のようにして算出する。 Step 110) The certainty factor calculation unit 14 extracts the likelihood from the likelihood storage unit 17, and calculates the certainty factor P _c ^k that the class k is the correct class as follows.

を算出パラメータ記憶部１８から読み込み、

Is read from the calculated parameter storage unit 18,

により確信度Ｐ_c ^kを算出する。ここで、式(２)の第３項のmax_j(j≠k)Ｐⁱ（Ｗ│Ｃ_j）は、クラス番号がｋ以外で最大となる尤度を指す。また、上式が示すように、全ての尤度算出手段１３から出力された尤度を用いて統合的に確信度を算出している。概略的には、多くの尤度算出手段においてクラスｋの尤度Ｐⁱ（Ｗ│Ｃ_j）が高く、クラスｊ（≠ｋ）の尤度Ｐⁱ（Ｗ│Ｃ_ｋ）が低い（つまり対抗するクラスの尤度が低い）ほど確信度が高くなる。この情報は単なる尤度Ｐⁱ（Ｗ│Ｃ_ｋ）を見るだけでは利用できないものであり、この情報を活用することで、分類精度を高めるのが、本発明の核となる考え方である。

Thus, the certainty factor P _c ^k is calculated. Here, max _{j (j ≠ k)} P ⁱ (W | C _j ) in the third term of Equation (2) indicates the likelihood that the class number is the maximum except for k. Further, as indicated by the above equation, the certainty factor is calculated in an integrated manner using the likelihoods output from all likelihood calculating means 13. In general, the likelihood P ⁱ (W | C _j ) of the class k is high and the likelihood P ⁱ (W | C _k ) of the class j (≠ _k ) is low (that is, the countermeasure) in many likelihood calculation means. The lower the likelihood of the class to be This information cannot be used simply by looking at the likelihood P ⁱ (W | C _k ), and the use of this information increases the classification accuracy is the core idea of the present invention.

算出パラメータ記憶部１８に格納される確信度算出パラメータは事前に算出しておく。式（３）、（４）はロジスティック回帰式であり、以下に示す通常のロジスティック回帰モデルにおけるモデルパラメータ推定の手順に基づいて算出できる。 The certainty factor calculation parameter stored in the calculation parameter storage unit 18 is calculated in advance. Equations (3) and (4) are logistic regression equations, which can be calculated based on the model parameter estimation procedure in the normal logistic regression model shown below.

ロジスティック回帰モデルは、一般にある現象が発生する確率ｙ（結果変数）を、その現象の生起を説明するために観測された説明変数ｘ＝（ｘ_１，…，ｘ_ｒ）で説明するためのモデルであり、モデルパラメータβ＝(β_０，…，β_ｒ)を用いて The logistic regression model is generally a model for explaining the probability y (result variable) of occurrence of a certain phenomenon by the explanatory variables x = (x ₁ ,..., X _r ) observed for explaining the occurrence of the phenomenon. And using model parameters β = (β ₀ ,..., Β _r )

と表せる。モデルパラメータの推定は、訓練データにより行う。訓練データのサイズをＭとしたとき、第ｍ番目（ｍ＝１，…，Ｍ）の訓練データは、
結果変数： y_m、ｙ_m＝１；ある事象が発生
ｙ_m＝０；ある事象が発生せず
説明変数：ｒ個の変数（ｘ_ｍ１，ｘ_ｍ２，…，ｘ_ｍｒ）
となる。この訓練データを式(３)、(４)に当てはめ、最尤法(最も尤もらしいものを選ぶ方法)等を用いて、適切なモデルパラメータβ＝（β_０，…，β_ｒ)を得る。この考え方を式（１）、（２）に当てはめて考え、入力データのクラスｋが正しいクラスであることを結果変数がｙ＝１であると、誤っていることをｙ＝０と、そのときのクラスｋに属する尤度Ｐⁱ（Ｗ│Ｃ_ｋ）とクラスｋ以外で最大となる尤度max_j(j≠ｋ)Ｐⁱ（Ｗ│Ｃ_j）を説明変数ｘ（＝ｘ_１，…）であるとして、モデルパラメータβに対応する

It can be expressed. Model parameters are estimated from training data. When the size of the training data is M, the m-th training data (m = 1,..., M) is
Result variables: y _m , y _m = 1; some event occurs
y _m = 0; no event occurs Explanation variable: r variables (x _m1 , x _m2 ,..., x _mr )
It becomes. This training data is applied to equations (3) and (4), and an appropriate model parameter β = (β ₀ ,..., Β _r ) is obtained using a maximum likelihood method (a method for selecting the most likely one). This idea is applied to formulas (1) and (2), and the fact that the input variable class k is the correct class and the result variable is y = 1, the error is y = 0, The likelihood P ⁱ (W | C _k ) belonging to class k and the maximum likelihood max _{j (j ≠ k)} P ⁱ (W | C _j ) other than class k are explained as explanatory variables x (= x ₁ ,... ) Corresponding to the model parameter β

を求める。

Ask for.

なお、「丹後俊郎、山岡和枝、高木晴良、"ロジスティック回帰分析"、pp. 3朝倉書店、1996」に記載されている−∞〜∞の変動範囲を持つ説明変数の合成変量Ｚと範囲（０，１）に値を持つ発生確率ｐ(ｘ)とをロジスティック関数で結合させたモデルを用いてもよい。また、式（１）、（２）ではクラスｋ毎に別々の算出パラメータを用いるとしたが、パラメータ全てのクラスに対して共通としてもよい。 Note that the synthetic variable Z and the range of explanatory variables with the range of −∞ to ∞ described in “Tango Toshiro, Yamaoka Kazue, Takagi Haruyoshi,“ Logistic Regression Analysis ”, pp. 3 Asakura Shoten, 1996” A model in which an occurrence probability p (x) having a value of 0, 1) is combined with a logistic function may be used. Also, in equations (1) and (2), different calculation parameters are used for each class k, but they may be common to all classes of parameters.

上記のようにして算出された確信度は確信度記憶部１９に格納される。 The certainty factor calculated as described above is stored in the certainty factor storage unit 19.

ステップ１１２）確信度算出部１４は、クラス番号ｋをｋ＋１としてステップ１１０に戻る。 Step 112) The certainty factor calculation unit 14 sets the class number k to k + 1 and returns to Step 110.

ステップ１１３）データクラス決定部１５は、ステップ１１０において、ｋ＞Ｋとなった場合は、確信度記憶部１９から確信度Ｐ_c ^k（ｋ＝１，…，Ｋ）を読み出して、その中で最大の値を示すクラスをデータのクラスとして決定し、ディスプレイ３に出力する。 Step 113) If k> K in Step 110, the data class determination unit 15 reads the certainty factor P _c ^k (k = 1,..., K) from the certainty factor storage unit 19, and includes The class indicating the maximum value is determined as the data class and output to the display 3.

なお、図３に示すデータ分類装置の各構成要素の動作をプログラムとして構築し、データ分類装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 It is possible to construct the operation of each component of the data classification apparatus shown in FIG. 3 as a program, install it on a computer used as the data classification apparatus, execute it, or distribute it via a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments and examples, and various modifications and applications can be made within the scope of the claims.

本発明は、ウェブページ、ブログ、電子メール等のテキストデータの分類に適用可能である。 The present invention is applicable to the classification of text data such as web pages, blogs, and e-mails.

１処理対象記憶部
２キーボード
１０データ分類装置
１１入力手段、入力部
１２尤度算出制御部
１３尤度算出手段
１４確信度算出手段、確信度算出部
１５データクラス決定手段、データクラス決定部
１６メモリ
１７尤度記憶手段、尤度記憶部
１８算出パラメータ記憶部
１９確信度記憶部 DESCRIPTION OF SYMBOLS 1 Processing object memory | storage part 2 Keyboard 10 Data classification apparatus 11 Input means, input part 12 Likelihood calculation control part 13 Likelihood calculation means 14 Certainty degree calculation means, Certainty degree calculation part 15 Data class determination means, Data class determination part 16 Memory 17 likelihood storage means, likelihood storage unit 18 calculation parameter storage unit 19 confidence level storage unit

Claims

A data classification method for classifying predetermined input data into classes,
Each of n (n is 2 or more) likelihood calculating means having different classification methods or components and features to calculate calculates the likelihood that the predetermined input data belongs to each of a plurality of classes. A likelihood calculating step for storing in the likelihood storage means;
Wherein acquires the likelihood that is output by the likelihood storage means whether we before Kieu calculating means, for the class of interest, the two likelihood in the likelihood of each of the likelihood calculating means is calculated A process of calculating a certainty factor that is a probability indicating that the class of interest is a correct class for the input data, using a total of 2n likelihoods by all n likelihood calculating means. Doing a certainty factor calculating step for calculating all the certainty factors of the plurality of classes ,
A data class determining step for determining a class to which the input data belongs based on the certainty value;
The data classification method characterized by performing.

In the certainty factor calculating step,
The likelihood of the two classes for each likelihood calculating means is
A likelihood said likelihood calculating means with respect to the class of interest is calculated, <br/> claim 1 which is the maximum and becomes likelihood class other than the class of interest in the likelihood the likelihood calculating means is calculated The data classification method described.

A data classification device for classifying predetermined input data into classes,
Input means for inputting the predetermined input data;
The likelihood that the data belongs to a plurality of classes at the predetermined input is calculated, and stored in the likelihood storage means. N classification methods to be calculated, or n elements (n is 2 or more) that are different from each other ) Likelihood calculation means,
Wherein acquires the likelihood that is output by the likelihood storage means whether we before Kieu calculating means, for the class of interest, the two likelihood in the likelihood of each of the likelihood calculating means is calculated A process of calculating a certainty factor that is a probability indicating that the class of interest is a correct class for the input data , using a total of 2n likelihoods by all n likelihood calculating means. Doing a certainty factor calculating means for calculating all the certainty factors of the plurality of classes ,
Data class determining means for determining a class to which the input data belongs based on the certainty value;
A data classification apparatus comprising:

In the confidence factor computing means,
The likelihood of the two classes for each likelihood calculating means is
A likelihood said likelihood calculating means with respect to the class of interest is calculated, <br/> claim 3 is the likelihood that the maximum class other than the class of interest in the likelihood the likelihood calculating means is calculated The data classification device described.

A data classification program for causing a computer to function as each means constituting the data classification device according to claim 3 or 4 .