CN113792765A

CN113792765A - Oversampling method based on triangular centroid weight

Info

Publication number: CN113792765A
Application number: CN202110976931.5A
Authority: CN
Inventors: 周红芳; 陈佳琳
Original assignee: Xian University of Technology
Current assignee: Shenzhen Wanzhida Technology Co ltd
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2021-12-14

Abstract

The invention discloses an oversampling method based on triangular centroid weight, which comprises the following specific steps: step 1, quantizing a sample to be processed into numerical values and then calculating a characteristic weight; step 2, performing danger type sample extraction on the quantized sample; step 3, searching neighbor samples of the danger samples; step 4, randomly finding out two neighbor samples from neighbor samples of each danger sample, and calculating the triangular centroid coordinates of three points to obtain a centroid sample; step 5, multiplying each feature of the centroid coordinates in the centroid sample by the feature weight respectively to obtain an offset centroid, and forming a centroid offset sample; step 6: and determining a weight coefficient of the centroid shift sample according to a genetic algorithm, and multiplying the weight coefficient by the shift centroid to finally obtain a new sample. The invention solves the problems that the traditional method adopts a straight line between two points, the newly synthesized sample space is limited between the two points, and the extraction of the information of the sample is less.

Description

Oversampling method based on triangular centroid weight

Technical Field

The invention belongs to the technical field of data mining and machine learning data processing, and relates to an oversampling method based on triangular centroid weight.

Background

With the advent of the big data age, a great variety of data information is rushed into our lives, and unbalanced data is one of typical representatives. The unbalanced data refers to unbalanced distribution of data samples among different data categories, and the classification problem of the unbalanced data is ubiquitous in the fields of artificial intelligence and data mining. Among the two-or multi-class problems affected by the imbalance problem, we refer to the class with more samples as the most or positive class, and the class with less samples as the less or negative class. In classification, the traditional classification algorithm is mainly oriented to data samples with more balanced data distribution, but when processing unbalanced data, the classifier becomes inefficient and is difficult to identify a few classes of samples in the set. Therefore, the classification performance of the classifier becomes critical for processing the unbalanced samples.

Because the unbalanced data widely exists in various fields, the invention can be applied to the unbalanced data in various fields. In the real world, the problems in disease diagnosis and treatment, credit assessment and the like are often required to be accurately classified, and at the moment, the imbalance of samples often makes the classification of the data very difficult. For example, in the judgment of a new coronary pneumonia patient, a large amount of personnel data is involved, a series of characteristics such as sex, age, weight, blood pressure, lung information and the like of each personnel form a personnel sample, and a plurality of personnel form the data set, wherein the categories of the personnel are patients and non-patients. It is obvious that of the 1000 people that only 10 people are affected, i.e. the patients always have a small proportion, i.e. a small number of classes, among them; the non-sick people are the majority, and if the sick people are classified into the non-sick people by mistake, the result is catastrophic. Similarly, in bank credit assessment, the age, income, purchasing power, etc. of the person can be used as the characteristics of a person sample to judge how much credit is to be paid, and whether to pay credit to the person, wherein the number of people with lower credit is always small, and the problem of data imbalance caused by the fact is required to be well solved.

In the method for processing the unbalanced sample, the sampling technology is widely applied. Such as undersampling, oversampling, mixed sampling, etc. The oversampling method balances the number of two types of samples by increasing a small number of types of samples, thereby improving the classification efficiency. However, the conventional random oversampling idea is to perform random oversampling on a small number of samples, but this is only a simple iteration on the original samples, and the extraction of information on the small number of samples is little and less, and the information learned by the model is too much and not generalized enough, so that the overfitting problem is particularly easily generated. Therefore, on the basis, researchers gradually propose classical oversampling methods such as SMOTE, Borderline-SMOTE and the like.

SMOTE is an improved scheme based on a random over-sampling algorithm, such as fig. 2, which calculates its k neighbors based on minority samples, sets a sampling ratio according to the sample imbalance ratio to determine the sampling magnification, selects a proper neighbor for each minority sample, synthesizes a new sample according to the following formula, and the samples on a straight line are identified as new samples with minority features and then added in a training set.

x_new＝x+rand(0,1)*(x′-x) (1)

In the formula (1), among these, x_newRepresents a sample finally synthesized, x represents a few classes of samples of the input, x' represents a neighbor sample of the selected x, and rand (0,1) is a random number between 0 and 1. Through the formula calculation, new samples can be synthesized according to the sampling rate.

However, the SMOTE algorithm still has some problems: on one hand, a proper neighbor number, namely a k value, needs to be selected, and then neighbors are selected in a random mode, so that the parameter cannot be effectively determined, and repeated experiments and demonstrations are often needed; on the other hand, the distribution of data in the set is fixed, and the problem of data marginalization is easily generated, namely, a part of a few samples are located at the edge of the negative sample, so that the synthesized sample gradually gets close to the edge, the boundary of the positive and negative samples is blurred, and the classification difficulty is increased. In order to solve the problem, a Borderline-SMOTE algorithm is proposed.

The Borderline-SMOTE algorithm is based on an improvement of the SMOTE algorithm and is currently divided into Borderline-SMOTE1 and Borderline-SMOTE 2. The algorithm divides a plurality of classes and a few classes of a training set, and searches k neighbor for each few class sample to obtain the number m of the plurality of classes near the sample (k is more than or equal to m and more than or equal to 0). If m is k, the negative class samples are all near the positive class, the samples are determined to be noise (noise), and the operation is stopped; if m has a value of half k and more, then the negative class sample is considered to be a sample that is susceptible to misclassification, called danger (danger); if m is less than half of k, the negative class sample is considered safe (safe) and operation is stopped. Wherein, Border-SMOTE1 operates on samples in danger, finds out the neighbor k 'of each danger sample in the negative class sample set, and then randomly selects the neighbor k' to synthesize a new sample according to the above formula until the requirement of sampling multiple is reached.

Disclosure of Invention

The invention aims to provide an oversampling method based on triangular centroid weight, which solves the problems that a method of a straight line between two points is adopted in the traditional method in the prior art, the newly synthesized sample space is limited between the two points, and less information of a sample is extracted.

The technical scheme adopted by the invention is as follows:

an oversampling method based on triangular centroid weight applies the initial operation of Borderline-SMOTE method, divides data into noise (noise), danger (danger) and safety (safe), then selects similar neighbors for danger samples, and determines the final position of a new sample according to weight and weight coefficient to strengthen the characteristics of related samples, and the specific steps are as follows:

step 1, quantizing a sample to be processed into a numerical value, and calculating the characteristic weight of the quantized sample;

step 2, performing danger type sample extraction on the quantized sample;

step 3, searching neighbor samples of the danger samples;

step 4, randomly finding out two neighbor samples from neighbor samples of each danger sample, and calculating the triangular centroid coordinates of three points to obtain a centroid sample;

step 5, multiplying each feature of the centroid coordinates in the centroid sample by the feature weight obtained in the step 1 to obtain an offset centroid, wherein all the offset centroids form a centroid offset sample;

step 6: and determining a weight coefficient of the centroid shift sample according to a genetic algorithm, and multiplying the weight coefficient by the shift centroid to finally obtain a new sample.

The present invention is also characterized in that,

in step 1, the feature weight is calculated by a Relief method.

The method for extracting the danger samples comprises the following steps: and dividing a plurality of classes and a few classes in the sample to be processed by applying the Borderline-SMOTE method idea, and searching k neighbors for each few class sample to obtain the number m of the plurality of classes near the sample (k is more than or equal to m and more than or equal to 0). If m is k, the negative class samples are all near the positive class, the samples are determined to be noise (noise), and the operation is stopped; if m is half or more of k, the negative class sample is considered to be a sample which is easily misclassified, and is called danger (danger), namely a safety class sample needing to be obtained; if m is less than half of k, the negative class sample is considered safe (safe) and operation is stopped.

The calculation mode of the triangle centroid coordinate is as follows:

Centroid＝(D(A)+D(B)+D(C))/3 (3)

in formula (3), Centroid is the coordinate of the triangle Centroid, d (a) is the coordinate of the center point of danger sample, d (b) and d (c) are two neighbors of danger sample.

The synthesis method of the new sample comprises the following steps:

Newsample＝Centroid*Featrue Weight*Weightcoefficient (4)

wherein, Centroid is the coordinate of triangle Centroid, Featrue Weight is the characteristic Weight, and Weight coefficient is the Weight coefficient.

The weight coefficient is determined by a genetic algorithm, an initial population is generated between 0 and 1, the binary string weight coefficient converted from the decimal number is continuously selected, crossed, varied and retained by the genetic algorithm, the population is retained, the selection is terminated after the set iteration times, the optimal weight coefficient and the classification result are obtained, wherein the initial population is determined to be 10, 20 generations are iterated, the selection is carried out by adopting a championship selection method, a two-point crossing mode is adopted, the crossing probability of 0.7 and the chromosome length degree are used as the variation probability.

The invention has the beneficial effects that:

firstly, the invention applies the initial operation of the Borderline-SMOTE method to divide data into three parts of noise (noise), danger (danger) and safety (safe), then selects the similar neighbors of danger samples, and determines the final position of a new sample according to the weight and the weight coefficient so as to strengthen the characteristics of related samples.

Secondly, the method obtains a comparison result by taking the accuracy and the area AUC value under the ROC curve as evaluation indexes on a decision tree (Gini) classifier and a KNN classifier respectively, and finally obtains the classification effect which is better than that of Borderline-SMOTE 1.

Drawings

FIG. 1 is a flow chart of an oversampling method based on triangular centroid weights of the present invention;

fig. 2 is a schematic diagram of the SMOTE oversampling process.

FIG. 3 is a flow chart of a method for obtaining weight coefficients in an oversampling method based on triangular centroid weights according to the present invention;

FIG. 4 comparison of AUC values as evaluation criteria using a decision tree classifier;

FIG. 5 comparison of AUC values as evaluation criteria using KNN classifier;

FIG. 6 comparison of accuracy as evaluation criteria using a decision tree classifier;

fig. 7 comparison using KNN classifier with accuracy as evaluation criterion.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to an oversampling method based on triangular centroid weight, as shown in fig. 1, which applies the initial operation of Borderline-SMOTE method, divides data into three parts of noise (noise), danger (danger) and safety (safe), then selects similar neighbors for danger samples, and determines the final position of a new sample according to weight and weight coefficient to strengthen the characteristics of related samples, and the specific steps are as follows:

step 2, performing danger type sample extraction on the quantized sample;

step 3, searching neighbor samples of the danger samples;

In the step 1:

the operation of quantizing the values is done in euclidean space;

wherein the feature weights are calculated by the Relief method:

the Relief method can give different weights to the features according to the types and the correlations of the features, and finally, a weight set of each feature with the threshold as the center can be obtained according to the set threshold. The algorithm randomly selects a sample B from the training set, then finds a nearest neighbor sample H from samples of the same class of B, called Hit, and finds a nearest neighbor sample M from samples of a different class from M, called Miss. If the distance between B and Hit on a certain feature is smaller than the distance between B and Miss, the feature is beneficial to classification, and the weight of the feature is increased; conversely, if the distance between B and Hit is greater than the distance between B and Miss, indicating that the feature negatively affects the classification, the weight of the feature is decreased.

We set the initial weight to 1 and calculate the weight as follows:

W(A)＝W(A)-diff(A,B,H)/m+diff(A,B,M)/m (2)

in equation (2), a is the weight of each feature, m is the number of runs, and two diffs are the euclidean distances between the sample and Hit and Miss, respectively, and the feature a is circularly calculated from 1 to N to obtain the first-generation weight value. Then, repeat the above process m times, finally get the average weight of each feature.

Considering the influence of the setting of the initial weight, we recalculate the average weight of the output in a percentage manner, so that the weight of each characteristic weight is less than 1, and the size relation among each other is not changed.

In the step 2:

In the step 3:

after the danger sample is extracted, its neighbors need to be found. And finding the neighbor of the danger sample in a few samples of the original training set as a parent sample for subsequently synthesizing a new sample. The algorithm can find sample points needing attention and make the method of taking the points closer to a few types of sample points which have trends but cannot be easily distinguished, wherein k is 5.

In the triangular centroid weight-based oversampling method, the idea of searching for neighbors of danger samples among a small number of classes is also applied.

In the step 4:

the calculation mode of the triangle centroid coordinate is as follows:

Centroid＝(D(A)+D(B)+D(C))/3 (3)

In the step 5: after the centroid is obtained in step 4, the feature weight output according to step 1 is multiplied by the centroid, so that our newly generated sample is not limited to a fixed position, but is shifted within a reasonable range according to the importance of each feature.

We determine how many new samples of the minority class are synthesized according to the input sampling rate, and the purpose of the algorithm is to balance the unbalanced samples, so the number of synthesized new samples should be the difference between the minority sample and the majority sample in the training set, and the danger class samples will be synthesized all the time before the number of samples is not synthesized.

In the step 6:

the concept of the weight coefficient is introduced, because the feature weight obtained by optimization in step 1 still cannot accurately guide the mass center to shift to a proper position, therefore, a weight coefficient between 0 and 1 is set, the front part is opened and the rear part is closed, a new sample is obtained by multiplying the weight coefficient by the shifted mass center, and the new sample can effectively reach the most proper position.

Referring to fig. 3, the weight coefficient is determined by a genetic algorithm, an initial population is generated between 0 and 1, and the population of the present invention, which obtains the best evaluation index, is retained by continuously selecting, crossing, and varying the weight coefficient of the binary string converted from the decimal number in the genetic algorithm, and is terminated after the set number of iterations, at which time the best weight coefficient and the classification result are obtained. The initial population number is set to 10, 20 generations are iterated, a tournament selection method is adopted for selection, a two-point intersection mode is adopted, the intersection probability is 0.7, and one of chromosome length degrees is used as the variation probability, and the variation probability is shown in table 1.

TABLE 1 genetic Algorithm parameter set-ups

The synthesis method of the new sample comprises the following steps:

Newsample＝Centroid*Featrue Weight*Weightcoefficient (4)

wherein, Centroid is the coordinate of triangle Centroid, Featrue Weight is the Weight of each feature, and Weight coefficient is the Weight coefficient.

The method of the invention used 10-fold cross validation to evaluate the classification Accuracy (Accuracy) and the area under the ROC curve (AUC) for the 7 datasets in this case with Decision Trees (DT) and KNN as classifiers, respectively. And the k-fold cross validation divides the selected data set into k groups in an average way, takes 1-fold data in the k groups as a test set, takes other k-1 folds as a training set, and so on, so that k training models and testing models can be obtained in total, and the average value of the validation of times is taken as the final result no matter the accuracy or the AUC value. In the invention, a 10-fold cross validation method is adopted, wherein 10 times of validation is carried out on each 10-fold cross validation, and the result of the 100 evaluations is taken as a final result.

Accuracy (Accuracy) and area under receiver operating characteristic curve (ROC) (AUC) are commonly used classifier evaluation metrics in classification of unbalanced data.

The accuracy is the most common classifier evaluation index, and the calculation method is as follows:

wherein the content of the first and second substances,

tp (tube positive) represents the number of instances that are actually positive and are classified as positive by the classifier.

Tn (tube negative) represents the number of instances that are actually negative and are classified as negative by the classifier.

FP (false positive) represents the number of instances that are actually negative but are divided into positive instances by the classifier.

FN (false negative) represents the number of instances that are actually positive but are classified as negative by the classifier.

The receiver operating characteristic curve (ROC) is a curve obtained by taking the false positive probability as the abscissa and the hit probability as the ordinate under different classification thresholds, and in the first quadrant of the coordinate axis with the horizontal and vertical lengths of 1, the closer the line is to the upper left corner, the better the classification performance of the classifier is generally considered. However, in order to avoid non-intuitive evaluation caused by curve crossing, the area AUC under the ROC curve is used as an evaluation index, and the larger the AUC value is, the better the classification performance is.

Wherein the content of the first and second substances,

to verify the effectiveness of the present invention, the KEEL Dataset is used as the sample to be processed, the KEEL Dataset is described in Table 2, Dataset is the name of the Dataset, Instances is the sample volume in the Dataset, Features is the feature number of the Dataset, and Classes is the class number of the Dataset.

The present invention verifies using the present method for each of the 8 datasets, using two classification algorithms, decision tree and KNN. Through experiments, the oversampling method based on the triangular centroid weight, provided by the invention, obtains higher accuracy and AUC values on the classification of 8 data sets compared with Borderline-SMOTE 1.

TABLE 2 KEEL data set description

Table 3 is a comparison table of two methods in which classification accuracy is used as an evaluation criterion, and table 4 is a comparison table of two methods in which AUC is used as an evaluation criterion. My Method in the two tables is the Method provided by the invention, DT represents that the basic classification algorithm uses a decision tree, and KNN represents that the basic classification algorithm uses a K nearest neighbor classification algorithm.

TABLE 3 accuracy as a comparison of classifications under evaluation criteria (%)

TABLE 4AUC values as a comparison of classifications under evaluation criteria

Fig. 4 to 7 in the specification are experimental comparison diagrams for classifying oversampled samples into different evaluation indexes by using two algorithms. The Borderline-SMOTE1 algorithm is on the left, and the method proposed by the invention is on the right.

FIG. 4 is a comparative bar graph of AUC values of two algorithms using DT as a basic classification algorithm, and experiments show that the accuracy of classification of 8 data sets by using decision trees by the oversampling algorithm provided by the invention is higher than that of Borderline-SMOTE1 algorithm.

Fig. 5 is a comparison histogram of the two algorithms on AUC values using KNN as the basic classification algorithm, and experiments show that the accuracy of classification using decision trees on 8 data sets by the oversampling algorithm proposed by the present invention is higher than that of the Borderline-SMOTE1 algorithm.

FIG. 6 is a comparison bar chart of the accuracy of two algorithms using DT as the basic classification algorithm, and experiments show that the accuracy of the over-sampling algorithm using decision tree classification on 8 data sets is higher than that of Borderline-SMOTE1 algorithm.

Fig. 7 is a comparison bar chart of the accuracy of two algorithms using KNN as the basic classification algorithm, and experiments show that the accuracy of the oversampling algorithm proposed by the present invention using decision tree classification on 8 data sets is higher than that of the Borderline-SMOTE1 algorithm.

Claims

1. An oversampling method based on triangular centroid weight is characterized in that initial operation of a Borderline-SMOTE method is applied, data are divided into three parts of noise (noise), danger (danger) and safety (safe), then similar neighbors of danger samples are selected, and final positions of new samples are determined according to weight and weight coefficients so as to strengthen relevant sample characteristics, and the method specifically comprises the following steps:

step 2, performing danger type sample extraction on the quantized sample;

step 3, searching neighbor samples of the danger samples;

step 4, randomly finding out two neighbor samples of each danger sample, and calculating the triangular centroid coordinates of three points to obtain a centroid sample;

2. The triangular centroid weight-based oversampling method as claimed in claim 1, wherein said feature weight is calculated in said step 1 by a Relief method.

3. The triangular centroid weight-based oversampling method as claimed in claim 1, wherein said danger class samples are extracted by: dividing a plurality of classes and a minority class in a sample to be processed by applying a Borderline-SMOTE method idea, and searching k neighbors for each minority class sample to obtain the number m of the plurality of classes near the sample (k is more than or equal to m and more than or equal to 0); if m is k, the negative class samples are all near the positive class, the samples are determined to be noise (noise), and the operation is stopped; if m is half or more of k, the negative class sample is considered to be a sample which is easily misclassified, and is called danger (danger), namely a safety class sample needing to be obtained; if m is less than half of k, the negative class sample is considered safe (safe) and operation is stopped.

4. The triangular centroid weight-based oversampling method according to claim 1, wherein the triangular centroid coordinates are calculated by:

Centroid＝(D(A)+D(B)+D(C))/3 (3)

5. The triangular centroid weight-based oversampling method according to claim 1, wherein said new sample is synthesized by:

Newsample＝Centroid*Featrue Weight*Weightcoefficient (4)

6. The oversampling method based on triangular centroid weight as claimed in claim 1, wherein said weight coefficients are determined by genetic algorithm, generating initial population between 0-1, continuously selecting, crossing, varying, species retaining in genetic algorithm by binary string weight coefficients converted from decimal number, terminating after set iteration number, obtaining optimal weight coefficient and classification result, wherein initial population number is set as 10, iteration 20 generation, selecting by tournament selection method, selecting by two-point crossing, 0.7 crossing probability, chromosome length degree is one of the variation probability.