CN101980202A

CN101980202A - Semi-supervised classification method of unbalance data

Info

Publication number: CN101980202A
Application number: CN2010105309121A
Authority: CN
Inventors: 王爽; 焦李成; 冯吭雨; 钟桦; 侯彪; 缑水平; 马文萍; 张青
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2010-11-04
Filing date: 2010-11-04
Publication date: 2011-02-23

Abstract

The invention discloses a semi-supervised classification method of unbalance data, which is mainly used for solving the problem of low classification precision of a minority of data which have fewer marked samples and high degree of unbalance in the prior art. The method is implemented by the following steps: (1) initializing a marked sample set and an unmarked sample set; (2) initializing a cluster center; (3) implementing fuzzy clustering; (4) updating the marked sample set and unmarked sample set according to the result of the clustering; (5) performing the self-training based on a support vector machine (SVM) classifier; (6) updating the marked sample set and unmarked sample set according to the result of the self-training; (7) performing the classification of support vector machines Biased-SVM based on penalty parameters; and (8) estimating a classification result and outputting the result. For unbalance data which have fewer marked samples, the method improves the classification precision of a minority of data. And the method can be used for classifying and identifying unbalance data having few training samples.

Description

The semi-supervised sorting technique of unbalanced data

Technical field

The invention belongs to data processing field, relate to the unbalanced data classification, it is pattern-recognition and machine learning application in the data mining field, a kind of specifically unbalanced data sorting technique based on fuzzy clustering and semi-supervised learning can be used for the classification and the identification of the few unbalanced data of training sample.

Background technology

Be accompanied by global fast development of information technology, powerful computing machine, data collection facility and memory device provide lot of data information for people carry out transaction management, information retrieval and data analysis.Although the data volume that obtains is very big, the useful data of people are often only accounted for the sub-fraction of total data.The data set that this certain class sample size obviously is less than other class sample is known as the unbalanced data collection, the classification problem of unbalanced data collection is present among the actual life in a large number, for example, whether the credit application that detects the citizen exists swindle, and the swindle application will be far fewer than legal application generally speaking; Utilize diagnosis and treatment data diagnosis patient's disease, will be as the heart disease patient far fewer than the people of health.In these practical applications, what people more were concerned about is the minority class of data centralization, and promptly number of samples is far fewer than a class sample of other class sample, and the mistake of these minority class divides cost often very big, therefore need effectively improve the nicety of grading of minority class.

While is along with the development of data acquisition technology, obtain a large amount of unmarked samples and become very easy, and obtaining because of the needs lot of manpower and material resources is still difficult relatively of underlined sample assists a small amount of underlined sample to improve the learning performance of sorter thereby need to study the unmarked samples that how effectively utilize a large amount of existence.The thought of introducing semi-supervised learning can utilize underlined sample and unmarked sample data set is trained and to predict simultaneously, wherein the direct-push support vector machine TSVM method based on the svm classifier device is exactly a kind of representative semi-supervised sorting technique, this method need preestablish all kinds of number of samples ratios in the unmarked sample, this will estimate it according to the DATA DISTRIBUTION of underlined sample set usually, in actual applications, if the DATA DISTRIBUTION deviation of unmarked sample and underlined sample is bigger, will have a strong impact on the TSVM sorting technique to the classification of data set with predict the outcome.

In recent years, the classification problem of unbalanced data collection more and more receives the concern of data mining and machine learning research field, Chinese scholars mainly contains two aspects to the research of unbalanced data: one is based on the method for data sampling, its fundamental purpose is by data being carried out the degree of unbalancedness of pre-service reduction data, increasing the synthetic oversampling technique SMOTE of minority class sample of minority class sample as simulation; Two are based on the method for sorting algorithm, and the support vector machine Biased-SVM of the difference punishment parameter that people such as Veropoulos propose assigns different punishment parameters for all kinds of samples, has offset the influence of data degree of unbalancedness to sorter SVM to a certain extent.

In the face of the problem concerning study of unbalanced data collection, the difficulty of research mainly comes from the characteristics of unbalanced data collection itself: the minority class sample deficiency that unbalanced data is concentrated, and the distribution of sample can not well reflect the actual distribution of whole class; Most classes can be mingled with noise data usually, make two class samples tend to occur in various degree overlapping.In addition, the sorting technique in traditional machine learning field is when directly applying to the unbalanced data collection, if do not consider the unbalancedness of data, easily minority class sample mistake is divided into most classes, although whole nicety of grading is very lower to the nicety of grading of minority class than higher; Opposite, if too consider the influence of unbalancedness to sorting technique, the study phenomenon appearred again easily, though can reach very high nicety of grading to training set, in the face of the renewal of data set with when changing, classifying quality is not ideal enough again.

Summary of the invention

The objective of the invention is to overcome the deficiency of above-mentioned prior art, at the less unbalanced data of underlined sample, a kind of unbalanced data sorting technique based on fuzzy clustering and semi-supervised learning is proposed, with when considering the data unbalancedness, introduce the thought of semi-supervised learning, avoid the appearance of study phenomenon, improved sorter is concentrated minority class to data nicety of grading.

The technical thought that realizes the object of the invention is: by implementing fuzzy clustering, and in conjunction with self-training learning process based on the svm classifier device, unmarked sample is constantly carried out mark and utilization, expand the minority class in the underlined sample set, in all kinds of numbers of samples of equilibrium, for sorter provides how effective sample distribution information, thereby improve the classification performance of sorter to unbalanced data.Its technical scheme may further comprise the steps:

(1) read a unbalanced data collection that comprises two types, how much respectively these the two types notes according to number of samples are made minority class and most class, a picked at random part is as initial underlined sample set { x from this two classes unbalanced data sample _i, with remaining data sample as initial unmarked sample set { x _j;

(2) cluster centre to described unbalanced data collection carries out initialization:

(2a) to current underlined sample set { x _iIn minority class sample and most class sample get average respectively, obtain average centralization M={m ₊, m _-, m wherein ₊Be the average center of minority class sample, m _-It is the average center of most class samples;

(2b) the average drifting algorithm is implemented at each center among the average centralization M respectively, found initial cluster center

Wherein

Be the initial cluster center of minority class sample,

It is the initial cluster center of most class samples;

(3) based on initial cluster center M ^*, current underlined and unmarked sample is implemented fuzzy C-means clustering, obtain cluster centre

Wherein

Be the cluster centre of minority class sample,

Be the cluster centre of most class samples, and current all unmarked samples are made U={u to the degree of membership set note of each cluster centre _Cj| j ∈ (1,2 ..., u), c ∈ (+,-) }, u wherein _CjBe the degree of membership of j unmarked sample to the cluster centre that is labeled as c, u is the number of samples of current unmarked sample set;

(4) by above-mentioned fuzzy clustering step, according to degree of membership set U, from current unmarked sample set { x _jIn choose that cluster just is being labeled as and H sample of corresponding degree of membership maximum carries out mark, i.e. H=p * N ₊Thereby, current underlined sample set and unmarked sample set are updated to respectively

With

N in the formula ₊Be the number of samples of minority class in the current underlined sample set, p is the ratio that never selects the row labels of going forward side by side in the marker samples;

(5) to the data set after the above-mentioned cluster renewal With

Carry out self-training based on the svm classifier device;

(6) by above-mentioned self-training step, the unmarked sample set after upgrading from cluster

In choose the H of discriminant score maximum ^*Individual sample carries out mark, promptly

Thereby current underlined sample set and unmarked sample set are updated to respectively once more

With

In the formula

Underlined sample set after cluster is upgraded In the number of samples of minority class, p is the ratio that never selects the row labels of going forward side by side in the marker samples;

(7) to the data set after the above-mentioned self-training renewal

With Carry out classification based on the supporting vector machine Biased-SVM of difference punishment parameter;

(8) punish that based on difference the unbalanced data classification results of the supporting vector machine Biased-SVM of parameter utilizes geometric mean Gm to assess to above-mentioned;

(9) whether reach optimum according to the geometric mean that obtains, then stop iteration, return step (8) output category result, otherwise return step (2), till satisfying end condition if satisfy as end condition.

The present invention compared with prior art has following advantage:

(1) the present invention excavates DATA DISTRIBUTION information implicit in the unmarked sample owing to introduced unsupervised fuzzy clustering algorithm, thereby need not manually to pre-determine the mark of training sample, has avoided uninteresting time-consuming again markers work in the practical operation; Guide cluster process owing to the present invention uses underlined sample simultaneously, and do not rely on the initial distribution of underlined sample, therefore can not be subjected to the influence that renewal and variation brought of data set, thereby improved the generalization ability of sorter the unbalanced data classification.

(2) the present invention is owing to taken all factors into consideration in actual applications; it is less or be difficult to obtain to run into underlined sample through regular meeting; the very high again data set problem of degree of unbalancedness of while data; by implementing fuzzy clustering; and in conjunction with self-training learning process based on the svm classifier device; unmarked sample is constantly carried out mark and utilization; expand the minority class in the underlined sample set; thereby can be in all kinds of numbers of samples of equilibrium; for sorter provides how effective sample distribution information; avoided the appearance of study phenomenon, improved the classification performance of sorter unbalanced data.

Description of drawings

Fig. 1 is a process flow diagram of the present invention;

Fig. 2 is that the present invention uses the average drifting algorithm that cluster centre is carried out the initialization synoptic diagram

Fig. 3 be among the present invention parameter p performance impact analysis chart to sorter is set;

Fig. 4 is the geometric mean Gm comparison diagram that the present invention and prior art obtain on the unbalanced data collection.

Embodiment

With reference to Fig. 1, specific implementation step of the present invention is as follows:

Step 1, selected initial underlined sample set and initial unmarked sample set.

A given unbalanced data collection, the sample of this data set is divided into two types according to the different of its feature and attribute, how much respectively these the two types notes according to number of samples are made minority class and most class, concentrate a picked at random part as initial underlined sample set { x from this two classes unbalanced data _i, with remaining data sample as initial unmarked sample set { x _j.

Step 2 is carried out initialization to the cluster centre of described unbalanced data collection.

(2b) with underlined and unmarked sample { x _k| k=1 ..., n} is respectively to average centralization M={m ₊, m _-In each central point implement average drifting algorithm, find initial cluster center Wherein Be the initial cluster center of minority class sample,

It is the initial cluster center of most class samples.

To average centralization M={m ₊, m _-In each central point when implementing average drifting algorithm, at first with the following formula definition of average drifting vector:

M_{h} (x) = \frac{Σ_{k = 1}^{n} G (\frac{x_{k} - x}{h}) x_{k}}{Σ_{k = 1}^{n} G (\frac{x_{k} - x}{h})} - x, - - - 1)

The corresponding central point of x wherein, G () adopts gaussian kernel function, and nucleus band is wide to be got

Be the standard deviation of data set, n is a number of samples; Then with 1) first of formula the right be designated as m _h(x), given allowable error ε, and three steps below carrying out satisfy until termination condition,

(a) calculate m _h(x);

(b) m _h(x) compose to x;

(c) if || m _h(x)-and x||＜ε, end loop, otherwise return execution (a).

In above-mentioned average drifting algorithm, because m _h(x)=x+M _hAnd M (x), _h(x) direction of sensing probability density gradient, be that probability density increases maximum direction, so the average drifting algorithm makes that by carrying out above step central point to be asked constantly moves along the gradient direction of probability density, finally finds the central point in the most intensive zone of sample distribution.

Fig. 2 has showed the validity that adopts average drifting algorithm initial cluster center.At first appoint and get two classes from four class square data centralizations of classics, the ratio of all kinds of numbers of samples is 1: 5, the sample of following picked at random 6% from all kinds of samples is as underlined sample, all the other are as unmarked sample, its DATA DISTRIBUTION is shown in Fig. 2 (a), "+" and " * " represents all kinds of underlined samples respectively, and rhombus " ◇ " is represented average centralization M={m among Fig. 2 (b) ₊, m _-Each central point, the initial cluster center that " ☆ " representative obtains by the average drifting algorithm

Each central point, as seen from Figure 2, the initial cluster center point that the average drifting algorithm that the present invention uses obtains is more near all kinds of distribution center of data centralization.

Step 3 is based on the initial cluster center M that obtains in the step 2 ^*, current underlined and unmarked sample is implemented fuzzy C-means clustering, obtain cluster centre Wherein

Be the cluster centre of minority class sample,

Be the cluster centre of most class samples, and current all unmarked samples are made U={u to the degree of membership set note of each cluster centre _Cj| j ∈ (1,2 ..., u), c ∈ (+,-) }, u wherein _CjBe the degree of membership of j unmarked sample to the cluster centre that is labeled as c, u is the number of samples of current unmarked sample set.

The algorithm steps of described fuzzy C average is as follows:

(a) given initial cluster center;

(b) computing below repeating, up to the degree of membership value stabilization of underlined and unmarked sample:

(b1) calculate degree of membership:

u_{ck} = \frac{{(1 / {| | x_{k} - v_{c} | |}^{2})}^{1 / (b - 1)}}{\underset{c}{Σ} {(1 / {| | x_{k} - v_{c} | |}^{2})}^{1 / (b - 1)}}, k = 1, . . ., n, c &Element; (+, -) - - - 2)

(b2) utilize the degree of membership that calculates in (b1), calculate cluster centre:

v_{c} = \frac{Σ_{k = 1}^{n} {[u_{ck}]}^{b} x_{k}}{Σ_{k = 1}^{n} {[u_{ck}]}^{b}}, c &Element; (+, -) - - - 3)

Wherein, v _cCorresponding cluster centre point, u _CkBe the degree of membership of k sample to the cluster centre that is labeled as c, x _kBe underlined and the set of unmarked sample, n is a number of samples, and b is the fog-level coefficient.

Step 4 is by above-mentioned fuzzy clustering step, according to the degree of membership set U that obtains, from current unmarked sample set { x _jIn choose that cluster just is being labeled as and H sample of corresponding degree of membership maximum carries out mark, i.e. H=p * N ₊Thereby, current underlined sample set and unmarked sample set are updated to respectively

With

N in the formula ₊Be the number of samples of minority class in the current underlined sample set, p is the ratio that never selects the row labels of going forward side by side in the marker samples.

Step 5 is to the data set after the above-mentioned cluster renewal

With

Carry out self-training based on the svm classifier device.

(5a) the underlined sample set after utilizing cluster to upgrade

Training svm classifier device, the svm classifier device arrives higher dimensional space by nonlinear transformation with data map, seeks the optimum linearity classifying face at higher dimensional space, and farthest with two class samples separately, the objective function of training svm classifier device can be expressed as:

\min (\frac{1}{2} {| | w | |}^{2} + C \underset{i}{Σ} ξ_{i}) - - - 4)

s . t . y_{i} (w \cdot x_{i}^{*} + b) &GreaterEqual; 1 - ξ_{i}, ξ_{i} &GreaterEqual; 0,

Wherein C is the punishment parameter, and w is the weight vector on the optimal classification plane that obtains by training svm classifier device, and b is its bias vector, ξ _iBe lax,

It is the underlined sample that is used to train;

(5b) utilize the discriminant function of svm classifier device

Obtain the unmarked sample set after cluster is upgraded

In the test badge of each sample Sgn () is-symbol function wherein,

It is the unmarked sample that is used to test.

Step 6, by above-mentioned self-training step, the unmarked sample set after upgrading from cluster

Thereby current underlined sample set and unmarked sample set are updated to respectively once more With

In the formula

It is the underlined sample set after cluster is upgraded

The number of samples of middle minority class.

Step 7 is to the data set after the above-mentioned self-training renewal

With

Carry out classification based on the supporting vector machine Biased-SVM of difference punishment parameter.

(7a) the underlined sample set after utilizing self-training to upgrade

The supporting vector machine Biased-SVM of the different punishment of training parameters, this training process is at the classification problem of unbalanced data collection, for all kinds of samples are assigned different punishment parameters respectively, with formula 4) in the objective function of the training svm classifier device described become:

\min (\frac{1}{2} {| | w | |}^{2} + C^{+} \underset{{i | u_{i} = + 1}}{Σ} ξ_{i} + C^{-} \underset{{i | y_{i} = - 1}}{Σ} ξ_{i}) - - - 5)

Wherein, ξ _iLax, y _iBe the underlined sample x that is used to train _iMark, C ⁺Be the punishment parameter of assigning for minority class, C ^-It is the punishment parameter of assigning for most classes;

(7b) be utilized as all kinds of samples and assign different punishment parameters C respectively ⁺And C ^-The discriminant function f (x of supporting vector machine Biased-SVM _j)=wx _j+ b obtains initial unmarked sample set { x _jIn the test badge label (x of each sample _j)=sgn (wx _j+ b), x _jIt is the unmarked sample that is used to test.

Step 8 punishes that based on difference the unbalanced data classification results of the supporting vector machine Biased-SVM of parameter utilizes geometric mean Gm to assess to above-mentioned.

(8a) calculate the nicety of grading of minority class respectively

Nicety of grading with most classes

Wherein, corresponding to predicting the outcome of data, TP is predicted as minority class and actual is the minority class number of samples, FP is predicted as minority class but actual is the number of samples of most classes, FN is predicted as most classes but actual be that the number of samples of minority class, TN are to be predicted as most classes and reality is the number of samples of most classes;

(8b) Se and the Sp value that obtains according to aforementioned calculation, the computational geometry average

Whether step 9 reaches optimum as end condition according to the geometric mean that obtains, and then stops iteration if satisfy, and returns step (8) output category result, otherwise returns step (2), till satisfying end condition.

Effect of the present invention can further specify by following emulation experiment:

One, experiment condition and parameter setting

Under the MATLAB simulated environment, based on supporting vector machine SVMlight tool box, to the unbalanced data collection contrast experiment that classifies, wherein the parameter that uses of each method is provided with as follows to the inventive method and prior art:

A) sample of picked at random t=30% is as initial underlined sample set respectively from all kinds of samples of unbalanced data collection, and all the other are as initial unmarked sample set;

B) in supporting vector machine SVM method, adopt linear kernel, punishment parameters C=100;

C) the punishment parameter of most classes is got C in the supporting vector machine Biased-SVM of difference punishment parameter ^-=100, and use formula C ⁺/ C ^-=N ^-/ N ⁺Calculate the punishment parameters C of minority class ⁺, N wherein ⁺And N ^-Be respectively the number of samples of minority class and most classes in the current underlined sample set;

D) get parameter n=100 in the synthetic oversampling technique SMOTE of minority class sample, promptly the number of samples of the new minority class that produces is original 2 times;

E) the inventive method perhaps error ε=0.1 of trying to please in the average drifting algorithm never selects the ratio p=0.1 of the row labels of going forward side by side in the marker samples in the iteration cycle process.

Two, experiment content and interpretation of result

In order to verify the inventive method advantage on the unbalanced data classification problem compared to existing technology, use one group of higher each method of biological data set pair of data degree of unbalancedness contrast experiment that classifies in the experiment, this biological data collection is as shown in table 1.

Table 1: the description of unbalanced data collection

Data degree of unbalancedness in the table 1 refers to the ratio of the concentrated minority class of unbalanced data and most class numbers of samples.The control methods of using in the experiment comprises: the inventive method and prior art supporting vector machine SVM method, the supporting vector machine Biased-SVM method of different punishment parameters, the synthetic oversampling technique SMOTE method of minority class sample and direct-push supporting vector machine TSVM method.

A) related experiment of utilizing the unbalanced data of table 1 that each method is carried out thes contents are as follows:

A1) being provided with of parameter p tested the performance impact analysis of classification among the present invention.

Use the inventive method in parameter p value successively { 0.01,0.03,0.1,0.3,0.5} condition under unbalanced data Data1.2 is carried out classification experiments, its result as shown in Figure 3, each the bar curve among Fig. 3 be the inventive method under the different value conditions of parameter p, its classification performance is with the situation of change of iterations.As can be seen from Figure 3, the P value is big more, be that never to select the ratio of the row labels of going forward side by side in each iteration in the marker samples big more, it is few more that the classification performance of the inventive method reaches optimum required iterations, though time complexity has reduced, but the probability to unmarked sample error flag in each iteration will become greatly, thereby the classification performance of method also decreases.This shows that choosing of p value be the compromise of classification performance and algorithm time complexity, get empirical value p=0.1 according to a large amount of experimental result unifications in the experiment.

A2) the inventive method and the prior art classification contrast experiment on the unbalanced data collection.

The unbalanced data collection is under various sorting techniques, and the nicety of grading Sp value of the nicety of grading Se of minority class and most classes is as shown in table 2; In order better to assess the whole classification performance of various sorting techniques, table 3 provides the geometric mean Gm of unbalanced data collection under various sorting techniques, and wherein last column of form is the average classification situation of each sorting technique to unbalanced data; Find out that for clearer the inventive method is depicted as histogram to the advantage of unbalanced data classification with the experimental result of table 3, as shown in Figure 4.

Table 2:Se and Sp value contrast and experiment

Table 3:Gm value contrast and experiment

B) interpretation.

As can be seen from Table 2, the nicety of grading Se value of the minority class of prior art is lower, and the nicety of grading Sp value of most classes is relatively very high, and this is because when handling the unbalanced data classification problem, prior art with nearly all unmarked sample all mistake be divided into most classes.

Current key at the unbalanced data sort research is how farthest to improve the nicety of grading of minority class in the nicety of grading that guarantees most classes, thereby improves the nicety of grading to unbalanced data.

From table 3 and Fig. 4 as can be seen, the inventive method has obtained higher geometric mean Gm compared to existing technology, thereby unbalanced data has been obtained better nicety of grading.

To sum up test described, the present invention is directed to the less unbalanced data classification problem of underlined sample, a kind of unbalanced data sorting technique based on fuzzy clustering and semi-supervised learning is proposed, by on one group of biological data, the inventive method and prior art being implemented the classification contrast experiment, verified that the inventive method can obtain better nicety of grading to unbalanced data compared to existing technology.

Claims

1. the semi-supervised sorting technique of a unbalanced data comprises the steps:

Wherein Be the initial cluster center of minority class sample,

It is the initial cluster center of most class samples;

Wherein

Be the cluster centre of minority class sample,

(4) by above-mentioned fuzzy clustering step, according to degree of membership set U, from current unmarked sample set { x _jIn choose that cluster just is being labeled as and H sample of corresponding degree of membership maximum carries out mark, i.e. H=p * N ₊Thereby, current underlined sample set and unmarked sample set are updated to respectively With

(5) to the data set after the above-mentioned cluster renewal

With

Carry out self-training based on the svm classifier device;

(6) by above-mentioned self-training step, the unmarked sample set after upgrading from cluster In choose the H of discriminant score maximum ^*Individual sample carries out mark, promptly

With

In the formula

Underlined sample set after cluster is upgraded

In the number of samples of minority class, p is the ratio that never selects the row labels of going forward side by side in the marker samples;

(7) to the data set after the above-mentioned self-training renewal With

Carry out classification based on the supporting vector machine Biased-SVM of difference punishment parameter;

2. according to the semi-supervised sorting technique of the unbalanced data of claim 1, the data set that wherein step (5) is described after cluster is upgraded

With Carry out self-training, carry out as follows based on the svm classifier device:

(5a) the underlined sample set after utilizing cluster to upgrade

Training svm classifier device;

(5b) utilize the discriminant function of svm classifier device

Obtain the unmarked sample set after cluster is upgraded

In the test badge of each sample

Wherein w is the weight vector on the optimal classification plane that obtains by training svm classifier device, and b is its bias vector, sgn () is-symbol function,

It is the unmarked sample that is used to test.

3. according to the semi-supervised sorting technique of the unbalanced data of claim 1, the data set that wherein step (7) is described after self-training is upgraded

With

Carry out classification, carry out as follows based on the supporting vector machine Biased-SVM of difference punishment parameter:

(7a) the underlined sample set after utilizing self-training to upgrade

The supporting vector machine Biased-SVM of the different punishment of training parameter;

(7b) utilize the different discriminant function f (x that punish the supporting vector machine Biased-SVM of parameters _j)=wx _j+ b obtains initial unmarked sample set { x _jIn the test badge label (x of each sample _j)=sgn (wx _j+ b), wherein w is the weight vector that the training difference is punished the optimal classification plane that the supporting vector machine Biased-SVM of parameter obtains, b is its bias vector, sgn () is-symbol function, x _jIt is the unmarked sample that is used to test.

4. according to the semi-supervised sorting technique of the unbalanced data of claim 1, wherein step (8) is described to punishing that based on difference the unbalanced data classification results of the supporting vector machine Biased-SVM of parameter utilizes geometric mean Gm to assess, and carries out as follows:

(8a) calculate the nicety of grading of minority class respectively Nicety of grading with most classes