CN113469251A

CN113469251A - Method for classifying unbalanced data

Info

Publication number: CN113469251A
Application number: CN202110748670.1A
Authority: CN
Inventors: 赵正旦; 章韵
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-10-01

Abstract

The invention discloses a classification method of unbalanced data, which belongs to the technical field of machine learning and comprises an active learning method and an oversampling method, wherein the unbalanced data comprises marked data and unmarked data, and the method specifically comprises the following steps: preprocessing the marked data, and calculating distance features to obtain an initial training set; training the initial training set to obtain an initial classifier; calculating uncertainty of the label-free data by using an initial classifier; sorting unmarked data according to uncertainty, and manually marking to obtain a marked data set; performing probability oversampling on the marked data set to obtain a balanced data set; and training the balanced data set to obtain a classifier for classifying the unbalanced data. The classification method of the unbalanced data of the invention combines the active learning method and the oversampling method, thereby reducing the number of samples participating in training; meanwhile, the classifier is guaranteed to have higher classification precision on data of most classes and data of few classes.

Description

Method for classifying unbalanced data

Technical Field

The invention relates to a classification method of unbalanced data, and belongs to the field of machine learning.

Background

At present, the research on the problem of data imbalance is mainly developed on the level of a data preprocessing layer, the level of characteristics and the level of a classification algorithm, and the classifier is ensured to have higher classification precision on data of most classes and data of few classes. On the data preprocessing level, imbalance is reduced or eliminated by changing the sample distribution of a training set, and a specific method is a series of undersampling and oversampling technologies; the unbalance of the number distribution of the samples on the characteristic level is usually accompanied with the unbalance of the distribution of the characteristic attributes, and the characteristics with distinguishing characteristics are selected by using a characteristic selection method, so that the classification precision of a few classes is improved; in the classification algorithm level, according to the defects of the algorithm in solving the unbalance problem, the characteristics of unbalance data are combined, the algorithm is reasonably improved to improve the recognition rate of a few types of samples, and typical methods include ensemble learning, cost sensitive learning, single type learning and the like.

The main idea of active learning is to introduce interactive capability in the training process, actively select the best sample to be added into the training set in the circulating process, reduce the number of samples participating in the training and save the operation consumption. The best candidate sample is actively selected for learning according to the learning process, and the traditional method for passively learning from a sample set with known identification is broken through. The learning algorithm can effectively reduce the number of samples to be evaluated, improve the prediction accuracy of the initial classifier, actively screen useful samples and store most useful information. The active learning can avoid a large amount of manual marking work, and can better solve the problems that the learning process speed is slowed down and a large amount of memory space is occupied due to the large scale of the training set.

The active learning sample selection strategy mainly comprises the following steps: flow-based sample selection policies and pool-based sample selection. Wherein the pool-based sample selection criteria mainly include: uncertain standards, version space reduction standards, generalized error reduction standards, and the like. The choice of examples based on the uncertainty criterion is mainly to represent the degree of uncertainty by probability and the degree of uncertainty by distance. The sample selection based on the reduction of the version space is to ensure that the selected sample can reduce the version space of the sample to the maximum extent, wherein the version space refers to a combination of a series of different types of reference classifiers. Committee queries are a typical algorithm based on this standard. The generalization error of the classifier is a common index for evaluating the robustness of the classifier, and the final goal of the sample selection based on the generalization error reduction standard is to reduce the generalization error of the classifier.

In machine learning, the sample imbalance problem refers to the phenomenon of class distribution imbalance. If a conventional algorithm is used to deal with the problem, the classification result is often biased to the majority, so that the minority cannot be correctly identified. However, most of the conventional algorithms train the classifier based on the overall accuracy maximization, so that the influence of a few classes of samples is ignored, the minority classes are wrongly classified, and the classification result of the conventional classifier is influenced. However, in many practical problems, the minority class tends to carry a greater amount of information and is of greater value than the majority class. The unbalanced data classification problem widely exists in the fields of biological medicine, finance, information security, industry, computer vision and the like.

Disclosure of Invention

The invention aims to provide a method for classifying unbalanced data, which can reduce the number of training samples, reduce the error rate of a few classes and improve the classification precision.

In order to achieve the above object, the present invention provides a method for classifying unbalanced data, including an active learning method and an oversampling method, where the unbalanced data includes a first type of data and a second type of data, and the first type of data and/or the second type of data includes labeled data and unlabeled data, and the method includes the specific steps of:

step 1, preprocessing marked data, and calculating distance features to obtain an initial training set;

step 2, training the initial training set to obtain an initial classifier;

step 3, calculating the uncertainty of the unmarked data by using the initial classifier;

step 4, sorting the unmarked data according to the uncertainty, and manually marking to obtain a marked data set;

step 5, carrying out probability oversampling on the marked data set by using an oversampling method to obtain a balanced data set;

and 6, training the balanced data set to obtain a classifier for classifying the unbalanced data.

As a further improvement of the present invention, the active learning method is a sample selection mode based on an uncertainty sampling strategy; the oversampling method specifically comprises: the characteristics of the samples comprise discrete characteristics and continuous characteristics, the samples of the continuous characteristics are fitted by using an EM algorithm and utilizing an AIC (advanced information center) criterion to obtain a mixed Gaussian distribution model P, a conditional distribution function of each characteristic under other characteristics is calculated, and a new sample is obtained by Gibbs sampling; the method comprises the steps of firstly counting different frequencies of occurrence of each discrete feature in the first type of data for the samples of the discrete features, and then randomly generating new samples according to the corresponding frequencies.

As a further improvement of the present invention, the pretreatment in step 1 is: calculating an internal distance between marked data and unmarked data, wherein the internal distance is calculated according to the following formula:

where n is the dimension of the data,

and

respectively representing the ith dimension characteristic value of the unmarked data and the marked data.

As a further improvement of the present invention, the minimum value of the internal distance is a distance feature, all distance features of each sample x are calculated for all samples of unlabeled data and labeled data, and are arranged in order from small to large according to the distance features, the first t samples with the smallest distance features and the labeled data are selected to form the initial training set, and the calculation formula of the distance features is as follows:

feature_dis(x)_x∈A＝min_z∈B Dis_inner(x,z)，x∈A；

where z is all samples with labeled data.

As a further improvement of the present invention, step 2 specifically comprises: and training the initial training set by using a support vector machine to obtain an initial classifier.

As a further improvement of the present invention, step 3 specifically is: using said primerThe initial classifier classifies the unmarked data to obtain a sample x_iBelong to the category y_iProbability p (y) of_i|x_i) According to the sample x_iBelong to the category y_iProbability p (y) of_i|x_i) Calculating to obtain an information entropy, wherein the information entropy is uncertainty, and a calculation formula of the information entropy is as follows:

as a further improvement of the invention, the sample x is judged according to the optimal label and suboptimal label criterion_iThe calculation formula of the optimal label and suboptimal label criterion is as follows:

wherein, p (y)_best|x_i) And p (y)_{second_best}|x_i) Are respectively a sample x_iThe optimal classification probability and the suboptimal classification probability.

As a further improvement of the present invention, step 4 specifically comprises: and arranging the unmarked data according to the sequence of the uncertainty from large to small, manually marking the sample with the maximum uncertainty, adding the marked sample into the initial training set to train the initial classifier, and stopping training until the initial classifier reaches a threshold value to obtain a marked data set.

As a further improvement of the present invention, step 5 specifically comprises: representing the real distribution of the marked data set by using a mixed Gaussian model, and performing probability oversampling to obtain a balanced data set, wherein the distribution probability density expression of the mixed Gaussian model is as follows:

wherein, ω is_lL is a weighted weight, 1,2, …, and satisfies

μ_lIs the mean value of the Gaussian mixture model; sigma_lIs the variance of the Gaussian mixture model; n (x | mu)_l,σ_l) For the ith gaussian probability distribution, the expression is:

as a further improvement of the present invention, the probability oversampling specifically includes: and circularly using the oversampling method for the samples in the marked data set until s new samples are generated, and balancing the first type data and the second type data to obtain the balanced data set.

The invention has the beneficial effects that: according to the unbalanced data classification method, active learning and an oversampling method are combined, active learning is achieved through an uncertainty sample selection method based on BvSB, the number of training samples is reduced, and operation consumption is saved. And then, an unbalanced data set is balanced by an oversampling method based on probability selection, the error rate of the minority class is reduced, and the classifier is ensured to have higher classification precision on the data of the majority class and the data of the minority class.

Drawings

FIG. 1 is a flow chart of an unbalanced data classification method of the present invention.

Fig. 2 is a flow chart of an active learning method.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

Referring to fig. 1, the present invention provides a classification method for unbalanced data, where the classification method includes an active learning method and an oversampling method, the unbalanced data includes a first type of data and a second type of data, and the first type of data and/or the second type of data includes marked data and unmarked data.

The unbalanced data refers to the unbalanced category of data, that is, the number of the first type data is unbalanced with the number of the second type data, in this embodiment, the proportion of the first type data to the second type data in the unbalanced data is smaller, the proportion of the second type data is larger, that is, the first type data is a minority type, and the second type data is a majority type.

The active learning method comprises an initial training set selection strategy adopted according to current unbalanced data, and is a sample selection mode based on an uncertainty sampling strategy, so that the number of training samples is reduced, and the operation consumption is saved.

The oversampling method is a probability-based oversampling method, and specifically comprises the following steps: the characteristics of the samples comprise discrete characteristics and continuous characteristics, the samples of the continuous characteristics are fitted by using an EM algorithm and utilizing an AIC (advanced information center) criterion to obtain a mixed Gaussian distribution model P, a conditional distribution function of each characteristic under other characteristics is calculated, and a new sample is obtained by Gibbs sampling; for the samples of the discrete features, the different frequencies of each discrete feature in the minority class are counted firstly, and then new samples are generated randomly according to the corresponding frequencies, so that the classifier is guaranteed to have higher classification precision on the data of the majority class and the minority class.

Referring to fig. 2, the active learning method specifically includes: firstly, training an initial classifier S through a marked sample set L, evaluating the information of each sample in an unmarked sample set U through the initial classifier S and a query algorithm q for evaluating the information quantity of the samples, selecting the sample with the largest information quantity from the unmarked sample set U, submitting the sample with the largest information quantity to a manual marker T for manual marking, and then putting the manually marked sample into the marked sample set L to optimize the initial classifier S.

The unbalanced data classification method of the invention comprises the following steps:

step 1, preprocessing marked data, and calculating distance features to obtain an initial training set.

Calculating the internal distance between the marked data and the unmarked data, wherein the calculation formula of the internal distance is as follows:

where n is the dimension of the data,

and

i-th dimension characteristic values of the unmarked data (A) and the marked data (B), respectively.

Calculating all distance characteristics of each sample x for all samples in the unmarked data and the marked data, arranging all samples according to the sequence of the distance characteristics from small to large, and selecting the first t samples with the minimum corresponding distance characteristics and the marked data to form an initial training set.

In all non-homogeneous points formed by the sample parameters, the minimum value of the internal distance is the distance characteristic, and when x belongs to A, the calculation formula of the distance characteristic is as follows:

feature_dis(x)_x∈A＝min_z∈B Dis_inner(x,z)；

where z is all samples with labeled data (B).

And 2, training the initial training set to obtain an initial classifier.

Training the initial training set obtained in the step 1 through a Support Vector Machine (SVM) to obtain an initial classifier f_fristAnd the method is used for subsequent active learning sample selection.

And 3, calculating the uncertainty of the unmarked data by using the initial classifier.

All unmarked by the initial classifier obtained in step 2Classifying the data samples to obtain a sample x_iBelong to the category y_iProbability of (d) is denoted as p (y)_i|x_i) According to the sample x_iBelong to the category y_iProbability p (y) of_i|x_i) The information entropy (namely uncertainty) is obtained through calculation, and the calculation formula of the information entropy is as follows:

wherein the content of the first and second substances,

the expression x takes the maximum value in the range U, the higher the information entropy of the sample is, the more fuzzy the class attribute of the sample is, the greater value and information amount can be brought to the model, and the improvement of the accuracy of the classifier is facilitated.

In the multi-classification problem, a sample x is judged according to an optimal label and a suboptimal label criterion (BvSB criterion)_iOnly two classes with the highest sample classification possibility are considered in the BvSB criterion, and other classification results are ignored, and the calculation formula of the BvSB criterion is as follows:

And 4, sequencing unmarked data according to the uncertainty, and manually marking to obtain a marked data set.

And (3) manually marking the sample with the largest information amount (namely the sample with the largest information entropy) selected in the step (3), specifically, arranging the unmarked data according to the sequence of the uncertainty from large to small, and manually marking the sample with the largest uncertainty. And adding the marked samples into an initial training set, and retraining the initial classifier by using the updated initial training set until the initial classifier reaches a threshold value threshold and stopping training, wherein all samples in the training set are required training samples, namely the marked data set.

And 5, performing probability oversampling on the marked data set by using an oversampling method to obtain a balanced data set.

And (4) representing the real distribution of the marked data set obtained in the step (4) by using a mixed Gaussian model, and performing probability oversampling to obtain a balanced data set. The Gaussian mixture model is an extension of a single Gaussian density function and can be used for approximating the probability density of any shape, wherein the parameters are obtained by weighting L Gaussian mixture models, and the distribution probability density expression of the Gaussian mixture model is as follows:

wherein, ω is_lL is a weighted weight, 1,2, …, and satisfies

μ_lIs the mean value of the Gaussian mixture model; sigma_lIs the variance of the Gaussian mixture model; n (x | mu)_l,σ_l) For the ith Gaussian probability distribution, the expression is as follows:

the probability oversampling specifically includes: and (4) circularly using the oversampling method for the samples in the marked data set until s new samples are generated, and balancing the first class data (minority class) and the second class data (majority class) to obtain a balanced data set.

And (5) training the balance data set with the labels obtained in the step (5) to obtain a classifier. Training the balance data set generated in the step 5 to obtain a final classifierf_final。

In summary, the invention provides a classification method for unbalanced data, which introduces interactive capability in the training process by using an active learning method, and selects samples by the uncertainty of the BvSB criterion, thereby reducing the number of training samples and saving the operation consumption. Meanwhile, an oversampling method is added in the training process, so that the unbalanced data set is balanced, the misclassification rate of the minority class is reduced, and the classifier is guaranteed to have higher classification precision on the data of the majority class and the data of the minority class.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A method for classifying unbalanced data, comprising: the method comprises an active learning method and an oversampling method, wherein the unbalanced data comprises first class data and second class data, and the first class data and/or the second class data comprise marked data and unmarked data, and the specific steps are as follows:

step 2, training the initial training set to obtain an initial classifier;

2. The method of classifying imbalance data according to claim 1, wherein: the active learning method is a sample selection mode based on an uncertainty sampling strategy; the oversampling method specifically comprises: the characteristics of the samples comprise discrete characteristics and continuous characteristics, the samples of the continuous characteristics are fitted by using an EM algorithm and utilizing an AIC (advanced information center) criterion to obtain a mixed Gaussian distribution model P, a conditional distribution function of each characteristic under other characteristics is calculated, and a new sample is obtained by Gibbs sampling; the method comprises the steps of firstly counting different frequencies of occurrence of each discrete feature in the first type of data for the samples of the discrete features, and then randomly generating new samples according to the corresponding frequencies.

3. The method for classifying unbalanced data as recited in claim 1, wherein the preprocessing in the step 1 is: calculating an internal distance between marked data and unmarked data, wherein the internal distance is calculated according to the following formula:

where n is the dimension of the data,

and

4. A method of classifying imbalance data according to claim 3, characterized in that: the minimum value of the internal distance is a distance feature, all distance features of each sample x are calculated for all samples of unmarked data and marked data, the distance features are arranged in a sequence from small to large, the first t samples with the minimum distance features and the marked data are selected to form the initial training set, and the calculation formula of the distance features is as follows:

feature_dis(x)_x∈A＝min_z∈BDis_inner(x,z)，x∈A；

where z is all samples with labeled data.

5. The method for classifying unbalanced data according to claim 1, wherein the step 2 is specifically: and training the initial training set by using a support vector machine to obtain an initial classifier.

6. The method for classifying unbalanced data according to claim 1, wherein step 3 specifically comprises: classifying the label-free data by using the initial classifier to obtain a sample x_iBelong to the category y_iProbability p (y) of_i|x_i) According to the sample x_iBelong to the category y_iProbability p (y) of_i|x_i) Calculating to obtain an information entropy, wherein the information entropy is uncertainty, and a calculation formula of the information entropy is as follows:

ENT＝argmax_xi∈U-∑_yi∈Yp(y_i|x_i)logp(y_i|x_i)；

wherein the content of the first and second substances,

meaning that x takes a maximum value in the range U.

7. The method of classifying imbalance data according to claim 6, wherein: judging the sample x according to the optimal label and suboptimal label criterion_iThe calculation formula of the optimal label and suboptimal label criterion is as follows:

BvSB＝argmin_xi∈U((p(y_best|x_i)-p(y_{second_best}|x_i))；

wherein, p (y)_best|x_i) And p (y)_{second_best}|x_i) Are respectively a sample x_iTo be optimizedProbability and suboptimal classification probability.

8. The method for classifying unbalanced data according to claim 1, wherein step 4 is specifically: arranging the unmarked data according to the sequence of the uncertainty from large to small, manually marking the sample with the maximum uncertainty, adding the marked sample into the initial training set to train the initial classifier, and stopping training until the initial classifier reaches a threshold value to obtain a marked data set.

9. The method for classifying unbalanced data according to claim 2, wherein step 5 specifically comprises: representing the real distribution of the marked data set by using a mixed Gaussian model, and performing probability oversampling to obtain a balanced data set, wherein the distribution probability density expression of the mixed Gaussian model is as follows:

wherein, ω is_lL is a weighted weight, 1,2, …, and satisfies

10. the method of classifying imbalance data according to claim 9, wherein; the probability oversampling specifically includes: and circularly using the oversampling method for the samples in the marked data set until s new samples are generated, and balancing the first type data and the second type data to obtain the balanced data set.