CN110991653A

CN110991653A - Method for classifying unbalanced data sets

Info

Publication number: CN110991653A
Application number: CN201911256817.4A
Authority: CN
Inventors: 简玉琳; 叶茂; 闵艳
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-04-10

Abstract

The invention discloses a method for classifying unbalanced data sets, which is applied to the fields of network intrusion detection, animal age prediction, vehicle performance evaluation and the like, and aims at the problem of low classification precision of minority classes in the prior art, on the basis of original training data, the invention utilizes the relation between the minority classes and the majority classes in the original data set, and utilizes an SMOTE and K nearest neighbor algorithm to process the original training data set to construct a new set, and the set focuses on the minority classes and other related majority samples; two random forests with the same size are constructed according to original training data and a new set, then decision trees in the two forests are combined into a large forest, the test set is tested together to obtain a classification result, and the classification precision is greatly improved compared with the prior art.

Description

Method for classifying unbalanced data sets

Technical Field

The invention belongs to the fields of network intrusion detection, animal age prediction, vehicle performance evaluation and the like, and particularly relates to a technology for unbalanced data set classification.

Background

The classification is one of the important research directions in the field of machine learning, and a plurality of mature algorithms are formed through years of development and are successfully applied in practice. These conventional classification algorithms target the highest classification accuracy and assume a substantial balance of sample numbers for each class in the data set. However, in practical problems, there is a case where: two types of data are contained in one data set, wherein the samples contained in one type of data are far less than the samples contained in the other type of data, the former type is called a minority type, and the latter type is called a majority type. Due to the difficulties encountered in the classification of unbalanced data sets, the classification problem with unbalanced data sets has recently gained increased attention. How to correctly classify unbalanced data sets and improve the classification accuracy of a few classes becomes a research focus in the field of data mining.

In the conventional learning algorithm classification process, the distribution of the data set is assumed to be basically balanced, and overall classification accuracy is taken as a target, so that the classifier has obvious preference on the majority classes with the dominant quantity in the unbalanced data set classification process, the classification accuracy of the majority classes is improved, and the classification accuracy of the minority classes is reduced. However, in practical problems, people tend to pay more attention to the classification accuracy of a few classes. For example, if the intrusion samples in network intrusion detection are generally less than 1%, if the classification algorithm predicts all samples as majority samples, a 99% accuracy is still obtained, but a minority sample cannot be identified, and obviously, such classification does not help to identify network intrusion; for example, when oil exploration is performed by using sea surface photos returned by satellites, most of the pictures which cannot obviously detect the existence of oil exist, only a few pictures can detect possible oil resources, and it is important to find natural oil resources from the satellite images which are in the loss as accurately as possible. Therefore, how to classify the unbalanced data set has high application value and wide application prospect.

The invention relates to a classification method of unbalanced data, namely 'a classification method of unbalanced data' invented by Beijing aerospace university prince, Deng Wei nation, Qianzhong, Wang '31054' \\ 26104, xu Bo, Rechao and Yongyue, which is published in 2015, 6.3.3.A, and the publication number is CN 104679860A. According to the method, two types of decision functions are obtained through learning of a training sample set, then membership and membership classification decision functions are sequentially and respectively obtained, and finally classification is carried out on samples in a second overlapping area determined in a test set. However, the patent only focuses on the optimization of the decision function, and the steps are too complex and involve more parameters. For unbalanced data sets, the information about the majority and minority classes contained in the data set itself is also important for classification, so how to collect and utilize such information is also relevant, and thus the patent is incomplete.

The invention of "unbalanced data classification oversampling method, device, equipment and medium" of Guangzhou university Korean Weihong, plum tree, Wangle, fangchining, Jia flame, Huangzhong, Zhou bin, Yilihua and Dingxihong applies for patent and obtains approval to the Chinese intellectual property office in 5 months and 10 days in 2018, and is disclosed in 10 months and 12 days in 2018, and the publication number is: CN 108647728B. The patent realizes classification by processing a few types of samples in an unbalanced data set, acquiring the number of corresponding majority samples by using a K-nearest neighbor algorithm, and determining the types of the minority samples by reversely deducing the number of the majority samples. That is to say, the method only focuses on the processing of the data set, and has a certain limitation without being fused with a subsequent algorithm for improvement.

Disclosure of Invention

In order to solve the above technical problem, the present invention provides a method for classifying unbalanced data sets.

The technical scheme adopted by the invention is as follows: a method for classifying unbalanced data sets utilizes SMOTE and K nearest neighbor algorithm to process an original training data set to construct a new set, wherein the set focuses on minority samples and majority samples related to the minority samples; then constructing two random forests with the same size, wherein one random forest is trained by using an original training data set, and the other random forest is trained by using a newly constructed data set; and finally merging the decision trees in the two forests into a large forest, and testing the test set together to obtain a classification result.

The invention comprises the following technologies:

1. k nearest neighbor algorithm

The invention relates to the following parts: and finding the nearest k majority class samples and k minority class samples around a few class sample point by taking the Euclidean distance as a standard. The n-dimensional space euclidean distance is expressed as:

2. SMOTE algorithm

SMOTE is a comprehensive Minority over sampling Technique, namely a Synthetic Minority over sampling Technique, and is an improved scheme based on a random over sampling algorithm. The basic idea of the SMOTE algorithm is to analyze a few classes of samples and artificially synthesize new samples from the few classes of samples to add to the dataset. The general flow is as follows:

for each sample x in the minority class, calculating a sample set S from the sample x to the minority class by using Euclidean distance as a standard_minObtaining k neighbors of the samples according to the distances of all the samples;

setting a sampling proportion according to the sample unbalance proportion to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from k neighbors of each minority sample x, wherein the selected neighbors are assumed to be xn;

the sampling multiplying power N is set according to the unbalanced proportion of each data set so as to achieve the purpose of balancing less types and various samples after oversampling. Assume that the imbalance ratio is of the multiple classes: and if the minority is 7:1, setting the sampling multiplying factor N to 7.

For each randomly selected neighbor xn, a new sample is constructed with the original sample according to the following formula:

x_new＝x+rand(0,1)*|x-xn|

3. ORF (Online random forest)

ORF (online random forest is) is a form of online learning of RF (random forest). In contrast to random forests, when a new sample arrives, each decision tree in an online random forest is updated, which allows the stability of the online random forest and the life cycle of the entire model to be increased.

Suppose a forest is of size T ═ T₁,t₂,···,t_T]There are N random numbersSample set S { (x) of samples₁,y₁),(x₂,y₂),…,(x_N,y_N) In which x_i＝[x_i1,x_i2,···,x_im]^T∈R^mAnd y is_iE {1, …, K } represents the class label of one for each sample pair. A bootstrapping sampling method is applied to each decision tree to construct a training set. The training set is entered into the forest in serial form, for each sample s about to enter the forest_i＝＜x_i,y_i>. obtaining a random integer k using Poisson's algorithm:

the above formula can be expressed as k to Poisson (λ), and in general λ is a constant. For generated k, there are two cases:

1) if the value of k is greater than or equal to 0, then sample s is used_i＝＜x_i,y_iTraining the forest k times.

2) Otherwise, use the sample s_i＝＜x_i,y_iCalculating the out-of-bag error rate of the forest and adjusting.

4. Data set collation

The data set used in this experiment contained: 9 real world collected imbalance data relating to the fields of vehicle performance assessment, animal age prediction, etc., from the UCI machine learning database. And 9 artificial imbalance data from the KEEL database.

5. Fusion algorithm

The invention fully utilizes the learning ability of the online random forest, processes the unbalanced data set by considering the generalization advantages of the K neighbor and the SMOTE algorithm, and realizes brand-new improvement by fusing the unbalanced data set into a classified algorithm stage. In other words, unlike the previous algorithm which only focuses on the data stage or the algorithm stage, the algorithm model provided by the invention constructs two classifiers which respectively focus on few samples in the original data set and the data set, thereby realizing the improvement of the classification performance.

In summary, the implementation process of the invention is as follows:

s1, for each data set, dividing the data set into training set and testing set according to ten-fold cross validation principle

S2, firstly, training a standard online random forest according to the training set in the S1;

s3, dividing the training set in the S1 into a plurality of classes and a plurality of classes, and constructing a new data set with the same size as the training set by using a K neighbor algorithm and SMOTE;

s4, training an online random forest with the same size as that in the step S2 by using the new data set in the step S3;

and S5, merging two online random forests into one and testing the test set in the S1 together.

The invention has the beneficial effects that: the invention relates to a method for classifying unbalanced data sets, which combines a classification algorithm with information in a data set to realize the classification of the unbalanced data sets together, and specifically adopts an SMOTE and K neighbor algorithm to process original training data so as to oversample a few classes and undersample a plurality of classes; and then training two online random forests with the same size as the classifiers of the original data and newly established data respectively, and finally fusing the two online random forests into one forest to test the test set.

Drawings

FIG. 1 is a general flow chart for implementing an unbalanced data set classification algorithm;

FIG. 2 is a generalized algorithm diagram for implementing an unbalanced data set classification algorithm;

FIG. 3 is a schematic diagram of the K-nearest neighbor algorithm;

FIG. 4 is a diagram of the results of the SMOTE algorithm;

wherein, fig. 4(a) is a schematic diagram of neighbors found in the SMOTE algorithm, and fig. 4(b) is a schematic diagram of synthesized few-class samples in the SMOTE algorithm;

FIG. 5 is a rough diagram of online random forest implementation classification.

Detailed Description

When the problem of classification of an unbalanced data set is processed, the relation between a few classes and a plurality of classes in the original data set is searched and utilized, and the classification performance can be effectively improved. Meanwhile, the selection of the classifier is also very important, and if the classification algorithm can be fused with the information in the data set to realize the classification of the unbalanced data set together, the performance can be greatly improved, and the generalization performance is very good. Based on the above thought, the invention adopts familiar SMOTE and K neighbor algorithm to process the original data so as to oversample the minority class and undersample the majority class. And then training two online random forests with the same size to be used as classifiers of original data and newly established data respectively, and finally fusing the two online random forests into one forest to test the test set.

Fig. 1 is a flowchart of a scheme of the present invention, and this embodiment takes network intrusion detection as an example to explain the contents of the present invention:

the method comprises the steps of firstly, obtaining an original unbalanced training set corresponding to network intrusion, specifically, the unbalanced data set of the embodiment comprises 9 unbalanced data collected in the real world, and the unbalanced data come from a UCI machine learning database; and 9 artificial imbalance data from the KEEL database; firstly, dividing an original unbalanced training set into a training set and a testing set according to a ten-fold cross validation principle;

then, the training set is divided into multi-class sets

And a short set of

The dimensions T of the forest are determined for each data set (here, the multiple sets X)_majAnd minority class X_min) Determining the appropriate Q (Q is a variable integer for different data sets as set forth in the present embodiment, in order to enable new data sets to be generated in subsequent stepsIs comparable to the original data set size).

Secondly, as shown in fig. 2, a new data set is constructed by using KNN and SMOTE; fig. 3 is a schematic diagram of the K-nearest neighbor algorithm, which is a decision diagram center "? "the sample represented by the circle at the position shown belongs to which category, the nearest K neighbors around need to be found, and the category with the largest number of neighbors is considered as the category to which the sample belongs. The present invention only relates to the process of searching for nearest neighbors. Specifically, the method comprises the following steps:

and respectively selecting the K multi-class samples and the K few-class samples which are nearest to each few class by using a K nearest neighbor algorithm. Setting a variable integer Q for different data sets, and randomly selecting Q samples from the selected samples for each sample with less classes. For each small sample, the following operations are performed:

randomly selecting an integer, randomly selecting q samples from k multiple classes, and performing de-duplication processing on repeated samples in the q samples, wherein the number of the finally selected samples is less than or equal to q;

as shown in fig. 4, randomly constructing (Q-Q) synthetic minority samples from the k minority classes using SMOTE strategy; x in FIG. 4(a)_iFor the few class samples currently selected as the center, the sample pointed to by the arrow

One of the currently selected k nearest minority class sample neighbors; x in FIG. 4(b)_iAnd

the block on the link denoted by "Generated Synthetic Instance" is a small number of samples that are newly constructed. The generated synthetic instance is shown.

Combining the constructed sample and the original few types of samples into a new sample set;

and thirdly, training an online random forest with the size of T by utilizing the new sample set and the original unbalanced training set respectively, recording the random forest trained by the original unbalanced training set as ORF1, and recording the random forest trained by the new sample set in the second step as ORF 2.

And fourthly, combining ORF1 and ORF2 into a random forest, testing the test set obtained in the step S1 together to obtain a classification result, wherein the classification result obtained by the method of the invention and the comparison result of the prior art are shown in Table 1, and in order to clearly show the effectiveness of the algorithm provided by the invention, the same type of algorithm is selected: comparing RF, BRAF, TempC and AdaBoost, wherein FORF-S is the method of the invention; the evaluation index was G-mean (%). From table 1, it is clear that the FORF-S model proposed by the present invention performs best on animal age prediction, disease assessment monitoring and artificial imbalance data, and that the percentage improvement over the best performing methods among others is up to 20% (hepatits dataset). Although TempC performed best on both the Car-good (Car-good) and cellular protein location (Yeast) data sets, our model was comparable and competitive.

TABLE 1 comparison of the classification results obtained by the method of the invention with the prior art

In table 1, Dataset represents the sample and abatone 19 is the data predicted by physical measurements of the age of Abalone; breast is Breast cancer recurrence data; car-good is vehicle evaluation data; haberman is survival data for breast cancer patients; hepatitis is survival data of Hepatitis patients; yeast is the prediction data of the cellular localization sites of the protein; clover70, Paw70 and Subc70 are from KEEL databases, the rest from UCI machine learning databases.

When a new sample (x)_i,y_i) At the time of arrival, the forest is updated once as shown in fig. 5, different bars beside each node represent different classes of samples, and each update aims to better classify the samples. Specifically, the method comprises the following steps: starting from the top root node, each long bar represents a category, and the length of each long bar represents the number of samples; the downward walking has different categories at each nodeDifferent numbers of samples are branched, so that at each node, the bars may have different lengths.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A method for classifying unbalanced data sets is characterized in that two classifiers are constructed, wherein the two classifiers are respectively: the method comprises the steps of constructing a first classifier according to an original unbalanced data set, constructing a second classifier according to few types of samples in the original unbalanced data set, and then classifying the unbalanced data set by adopting a third classifier obtained by combining the first classifier and the second classifier.

2. A method for classification of an unbalanced data set as claimed in claim 1, wherein the first classifier is a first online random forest obtained by a process comprising:

dividing the acquired original unbalanced data set into two parts, wherein one part is used as a training set, and the other part is used as a test set;

and training according to the training set to obtain a first online random forest.

3. The method for classifying an unbalanced data set according to claim 2, wherein the collected original unbalanced data set is divided into a training set and a testing set according to a ten-fold cross validation principle.

4. A method for classification of an unbalanced data set as claimed in claim 2, wherein the second classifier is a second online random forest obtained by a process comprising:

dividing the training set into a majority sample set and a minority sample set, and constructing a new data set with the same size as the training set by using a K neighbor algorithm and SMOTE;

and training a second online random forest with the same size as the first online random forest by using the new data set.

5. The method for classifying an unbalanced data set according to claim 4, further comprising classifying the test set using a third classifier to obtain a classification result.

6. The method for classifying an unbalanced data set according to claim 5, wherein the step S3 is specifically as follows:

for each sample in the minority class, calculating the distance from the sample to all samples in the minority class sample set by taking the Euclidean distance as a standard to obtain k neighbor of the sample;

setting a sampling proportion as a sampling multiplying factor according to the sample unbalance proportion, and randomly selecting a plurality of samples from k neighbors of each minority sample x according to the sampling multiplying factor to be used as the neighbors of the minority sample x;

for each randomly selected neighbor xn, respectively constructing a new sample with the corresponding few class samples according to the following formula:

x_new＝x+rand(0,1)*|x-xn|；

where xn is the neighbor corresponding to the minority sample x.