CN115878999A

CN115878999A - Oversampling method and system for differential evolution of highly unbalanced data sets

Info

Publication number: CN115878999A
Application number: CN202211583309.9A
Authority: CN
Inventors: 李艳颖; 张姣妮; 王夏琳; 李文; 蒋语聪
Original assignee: Baoji University of Arts and Sciences
Current assignee: Baoji University of Arts and Sciences
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2023-03-31

Abstract

The invention belongs to the technical field of data processing, and discloses an oversampling method and system for differential evolution of a highly unbalanced data set, wherein a k-nearest neighbor search method kNN is used for screening out a minority class region close to a minority class center from a minority class, and a safe region is identified; synthesizing a new few types of samples in a safe area by using a DEBOHID oversampling method; and finally, performing Friedman test and Wilcoxon symbol rank test on the over-sampling method aiming at the differential evolution of the highly unbalanced data set by using SVM, KNN and DT classification models on the highly unbalanced data set. The SS _ DEBOHID method provided by the invention finds the safety region in a minority class according to k-nearest neighbor from the viewpoints of reducing the generation of noise samples and increasing the reliability of synthesized samples. Experimental results show that the SS _ DEBOHID provided by the invention is superior to other methods.

Description

Oversampling method and system for differential evolution of highly unbalanced data set

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to an oversampling method and system for differential evolution of a highly unbalanced data set.

Background

Currently, a data set containing 1% positive cases and 99% negative cases is a highly unbalanced data set. In an unbalanced dataset, the class with more samples in the dataset is called the majority class, whereas the class with less samples is called the minority class. There are a large number of unbalanced data sets in real life. Software bug prediction, fraud detection, text classification, mechanical fault diagnosis, medical diagnosis, sarcasm detection in social media, and the like.

Methods to handle the imbalance-like problem are classified as data resampling, integration, and cost-sensitive. Data resampling is divided into undersampling, oversampling, and mixed sampling. The undersampling method is to reduce the number of samples in most classes. The oversampling method is to increase the number of samples of a few classes. The combination of undersampling and oversampling is a mixed sampling. The undersampling method removes samples of most classes, so that some important information about the most classes is lost in the model learning process; the oversampling method increases the complexity of model training, and generates an overfitting problem.

A common problem with oversampling is that of noisy samples. Noise samples refer to the minority class samples at the classification boundary and the minority class samples surrounded by the majority class. Due to the different distribution of samples, if the few types of samples used for synthesis are noise samples, the generalization capability of the resulting synthesized samples is low.

SMOTE (synthetic minority oversampling technique) is a common oversampling method. Many oversampling methods are developed based on the SMOTE method. The SMOTE method is proposed by Chawla et al. The idea of SMOTE includes four steps. The first step is to artificially set the oversampling ratio N based on the imbalance ratio, i.e. the number of new samples to be generated. The second step is to randomly select a few classes of samples. And thirdly, calculating k nearest neighbors of the selected minority class samples, and randomly selecting a minority class neighbor from the k nearest neighbors. In the fourth step, the selected few samples are combined with their randomly selected few neighbors to form a new sample. One of the drawbacks of SMOTE is that it does not take into account the distribution of the data set, resulting in a small number of classes of samples being likely to be misclassified.

The Borderline-SMOTE1 and Borderline-SMOTE2 algorithms were proposed by Han et al. They synthesized new samples at the sample boundaries based on the SMOTE method. Boundary samples are more important to classify because samples on a boundary are more easily misclassified than samples away from a boundary. Thus, borderline-SMOTE1 and Borderline-SMOTE2 achieve class balancing by linearly interpolating new samples at a few class boundaries. These two methods separate the minority class samples into a safety set, a danger set, and a noise set before oversampling. Both Borderline1 and Borderline2 use the SMOTE mechanism to synthesize new samples in the risk set. In contrast, borderline1 computes k nearest neighbors to the minority class of risk samples, then randomly selects one of the k nearest neighbors to synthesize a new sample (as in SMOTE), borderline2 computes k nearest neighbors to the risk samples in the training data set, and then randomly selects one nearest neighbor to synthesize a new sample (regardless of the nearest neighbor class). The influence of boundary samples on classification success is considered by Borderline1 and Borderline2, and experimental results show that Borderline1 and Borderline2 have good classification effect.

The SMOTE-TomekLinks algorithm is a mixed sampling method. SMOTE-TomekLinks first create new samples using SMOTE to achieve class imbalance, and then remove most of those class samples that do not meet the conditions according to the Tomek Link method. Tomek proposed a Tomek Link undersampling method in 1967. The method focuses on cleaning overlapping samples in the data set. The process comprises two steps. The first step is to form a tomek link. Tomek link is a link in which two samples in a dataset are composed of neighbors that are nearest to each other and their classes are different. The second step is to delete this tomek link. The advantage of this approach is that the overlap of samples between different classes can be reduced, but the amount of removed samples cannot be controlled by the undersampling, so the number of most classes of samples that can be eliminated is limited.

Batista et al propose SMOTE-ENN. The method is a mixed sampling method, and combines an oversampling method SMOTE and an undersampling method ENN. SMOTE oversamples the unbalanced data set to balance and then removes each majority class sample in the balanced data set according to the ENN. Wilson et al propose editing nearest neighbor rules (ENN). The algorithm flow of ENN is as follows: firstly, the method finds three nearest neighbors of each majority type sample in a training data set; secondly, calculating the number of most neighbors in the nearest neighbors; finally, if the number of majority class neighbors is greater than 1, then the majority class samples are removed. Since most of the majority classes of samples are surrounded by the majority classes, the majority classes of samples that can be culled are relatively limited.

In 2009 Bunkhumpornpat et al proposed the Safe-Level-SMOTE algorithm. The locations of the new samples synthesized by Safe-Level-SMOTE are more prone to a few classes of dense regions. The method comprises the steps of taking a minority class sample X, calculating the security level of the minority class sample X according to the nearest neighbor of the X, and then generating a new sample by using a SMOTE synthesis mechanism. The method avoids SMOTE synthesizing a few samples in a random area, and reduces the possibility of overlapping the synthesized few samples with a plurality of samples.

In 2008, he et al proposed the ADASYN algorithm, which is an adaptive synthesis sampling method for imbalance learning. The algorithm firstly determines the total number of samples needing to be synthesized according to the imbalance proportion, then determines the number of samples needing to be generated by each minority class sample, and finally generates a new minority class sample by linear interpolation according to SMOTE. The disadvantage of the ADASYN method is that it is susceptible to outliers in the dataset.

In 2012, ramentol et al proposed a hybrid of SMOTE and Rough Set Theory (RST), called SMOTE-RSB. SMOTE-RSB is an extension of the SMOTE algorithm, and contains two main phases: generating new minority class samples by using SMOTE, and cleaning the generated minority class samples by rough set theory. SMOTE-RSB eliminated the incompatibility of the synthesized samples. This is because the RST will clean up synthetic samples that are highly similar to the original data set. The method uses SMOTE to synthesize a new sample, and does not change the profile characteristics of the distribution of the original sample. SMOTE adds information in the original data set to help classification, and also adds noise samples to influence the determination of classification boundaries.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) Most samples are removed by a traditional undersampling method, so that important information about most samples is lost in the model learning process; the oversampling method increases the complexity of model training, and generates an overfitting problem.

(2) The existing SMOTE-based oversampling method does not consider the distribution of the data set, resulting in a high probability that a few classes of samples are misclassified.

(3) In the existing mixed sampling method based on the SMOTE-TomekLinks algorithm, the quantity of removed samples cannot be controlled by undersampling, so that the quantity of most types of samples which can be eliminated is limited.

(4) In the existing SMOTE-ENN-based hybrid sampling method, most types of samples are surrounded by most types, so that the number of types of samples which can be eliminated is relatively limited.

(5) The existing adaptive synthesis sampling method for unbalanced learning based on the ADASYN algorithm is easily influenced by outliers in a data set.

(6) The existing oversampling technology based on the mixture of SMOTE and Rough Set Theory (RST) adds information in an original data set to help classification, and simultaneously adds noise samples to influence the determination of classification boundaries.

Disclosure of Invention

The invention provides an oversampling method and system for differential evolution of a highly unbalanced data set, and particularly relates to an oversampling method, system, medium, device and terminal for differential evolution of a highly unbalanced data set based on a security set.

The invention is realized in such a way that an oversampling method aiming at the differential evolution of a highly unbalanced data set comprises the following steps: screening a minority class area close to the minority class center in the minority class by using a k nearest neighbor search method kNN, and identifying to obtain a safety area; synthesizing a new few types of samples in a safe area by using a DEBOHID oversampling method; and finally, performing Friedman test and Wilcoxon symbol rank test on the over-sampling method aiming at the differential evolution of the highly unbalanced data set by using SVM, KNN and DT classification models on the highly unbalanced data set.

Further, the oversampling method for differential evolution of highly unbalanced data sets comprises the steps of:

step one, SS _ DEBOHID identifies a security set in a minority class;

and step two, synthesizing the sample by using a DEBOHID method in the safety set.

Further, the SS _ DEBOHID identification of the security set in the minority class in step one comprises: taking a sample in the minority class, and calculating the Euclidean distance between the sample and the training data set to obtain a safety region in the minority class; the SS _ DEBOHID selects the front k in the distance array according to the obtained incremental distance array ₁ Nearest neighbor, count k ₁ The number of the majority neighbors and the number of the minority neighbors in the nearest neighbors; if the number of minority neighbors is greater than or equal to k ₁ And/2, the selected sample is a safety sample, and the safety set consists of the safety samples.

Further, in the step of finding a security set in a small number of classes, N represents the number of samples in the training data set, and D represents the number of attributes in the data set. Five nearest neighbors of a few class samples in the training dataset were calculated, with a complexity of O ((N-1) D). In the process of generating new samples, N is added _maj Expressed as the number of samples in the majority class, N _min Expressed as the number of samples in a small number of classes. The computational complexity for finding neighbors for a few classes of samples is O ((N) _min -1) D), the computational complexity of the balanced dataset is O ((N) _maj -N _min )(N _min -1) D), the complexity of the SS _ DEBOHID method is O ((N-1) D) + O ((N) _maj -N _min )(N _min -1)D)。

Further, the step two of synthesizing the sample by using the DEBOHID method in the safety set comprises the following steps: the DEBOHID takes a safety sample alpha out of the safety set, calculates the Euclidean distance between the safety sample alpha and the minority samples, and selects k nearest neighbors closest to the selected sample alpha; new minority samples were synthesized by mutation, crossover and selection processes. During mutation, the DEBOHID method uses the basic strategy DE/rand/1 of DE to create donor vectors; in the crossing process, the donor vector and the target vector are synthesized into a new test vector; in the selection process, a greedy criterion is used to make an optimal decision between the trial vector and the target vector.

Further, the target vector is the selected sample α.

Another object of the present invention is to provide a differential-evolution over-sampling system for highly unbalanced data sets, which applies the differential-evolution over-sampling method for highly unbalanced data sets, and comprises:

the security set identification module is used for identifying a security set in a minority class by using SS _ DEBOHID;

and the sample synthesis module is used for synthesizing the sample by using the DEBOHID method in the security set.

Another object of the invention is to provide a computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method for oversampling for differential evolution of highly unbalanced data sets.

It is a further object of the invention to provide a computer readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the described oversampling method for differential evolution of highly unbalanced data sets.

Another object of the present invention is to provide an information data processing terminal for implementing the above-mentioned oversampling system for differential evolution of highly unbalanced data sets.

By combining the technical scheme and the technical problem to be solved, the technical scheme to be protected by the invention has the advantages and positive effects that:

first, aiming at the technical problems and difficulties in solving the problems in the prior art, the technical problems to be solved by the technical scheme of the present invention are closely combined with results, data and the like in the research and development process, and some creative technical effects are brought after the problems are solved. The specific description is as follows:

the oversampling approach is used to solve the class imbalance problem. Some existing over-sampling methods do not remove noise samples well and avoid synthesizing noise samples. Therefore, the present invention proposes a new oversampling approach, called SS-DEBOHID, a new oversampling approach for differential evolution on a safe set of highly unbalanced data sets. The SS _ DEBOHID provided by the invention firstly screens out a minority class region close to a minority class center, namely a safety region, by using a k nearest neighbor (kNN) search method; a new minority sample is then synthesized in the safe region using the debodid oversampling method. The advantages of the SS _ DEBOHID of the invention include: (a) The generation of noise samples is reduced, so that the quality of a synthesized sample is improved; (b) increasing the confidence level of the new sample; (c) The sample is synthesized by a DEBOHID method, so that the sample boundary is improved; (d) the method is applicable to highly unbalanced data sets. The method proposed by the present invention was compared to the 9 most advanced over-sampling methods on 43 highly unbalanced datasets and evaluated on AUC and G-Mean indices. Experimental results show that the SS _ DEBOHID obtains good classification performance and robustness.

The present invention is primarily directed to data resampling. The present invention considers the distribution of samples, with the aim of finding samples near the minority class center, and then oversampling these samples near the minority class center to achieve class balance. The method SS _ DEBOHID provided by the invention comprises two parts:

a first part: determining a safe region in a minority of classes; the invention determines the security zones of a few classes by taking advantage of the strategy of the Borderline-SMOTE method for finding security samples.

A second part: synthesizing new minority class samples in a safe area; unlike the SMOTE method, the invention uses the DEBOHID method to synthesize a new few samples.

The invention provides a new oversampling method called SS _ DEBOHID. The method first identifies a safe region in a minority class; secondly, a few types of samples are synthesized in a safe area by using a DEBOHID method. The SS _ DEBOHID method not only improves the quality of a new sample, but also reduces the generalization performance of a synthesized sample. This makes the method superior in performance to the other nine oversampling methods. In experiments, the present invention compared SS _ DEBOHID with ten methods using SVM, KNN, DT classification models on 43 highly unbalanced datasets. The Friedman test and Wilcoxon signed rank test indicate that the SS DEBOHID method is significantly superior to the other methods. Experimental results show that SS _ DEBOHID is superior to other methods.

Secondly, considering the technical scheme as a whole or from the perspective of products, the technical effect and advantages of the technical scheme to be protected by the invention are specifically described as follows:

the SS _ DEBOHID method provided by the invention finds the safety region in a minority class according to k-nearest neighbor from the viewpoints of reducing the generation of noise samples and increasing the reliability of synthesized samples. The k-neighbor search is chosen because it is simple, efficient, and accurate in classification, performing the DEBOHID oversampling method in a safe area. Therefore, the SS _ DEBOHID method proposed by the present invention is meaningful.

The contributions of the SS-DEBOHID method provided by the present invention are summarized below: the SS-DEBOHID method considers the generalization capability of the synthetic sample to be improved and avoids generating a noise sample; the DEBOHID framework performs oversampling in a few kinds of safety regions so as to improve the credibility of a synthesized sample; SS _ DEBOHID achieves higher classification performance than DEBOHID while maintaining the same complexity as DEBOHID; SS _ DEBOHID was compared against 9 oversampling methods on 43 datasets to obtain the Mean of AUC and G _ Mean indices after 5 times of 5-fold cross validation. Experimental results show that SS _ DEBOHID is superior to other oversampling methods.

Third, as inventive supplementary proof of the claims of the present invention, there are several important aspects as follows:

(1) The expected income and commercial value after the technical scheme of the invention is converted are as follows:

(2) The technical scheme of the invention fills the technical blank in the industry at home and abroad:

(3) The technical scheme of the invention solves the technical problem that people are eager to solve but can not succeed all the time:

(4) The technical scheme of the invention overcomes the technical prejudice whether:

drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a differential evolution oversampling method for highly unbalanced data sets according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a differential evolution oversampling method for a highly unbalanced data set according to an embodiment of the present invention;

FIG. 3A is a schematic diagram of the average AUC values of SS _ DEBOHID and DEBOHID on SVM classifier provided by the embodiment of the present invention;

FIG. 3B is a schematic diagram of the average G _ Mean values of SS _ DEBOHID and DEBOHID over SVM classifiers provided by embodiments of the present invention;

FIG. 4A is a schematic representation of the average AUC values of SS _ DEBOHID and DEBOHID on the KNN classifier provided by an embodiment of the present invention;

FIG. 4B is a schematic diagram of the average G _ Mean values of SS _ DEBOHID and DEBOHID on the KNN classifier provided by an embodiment of the invention;

FIG. 5A is a schematic diagram of the average AUC values of SS _ DEBOHID and DEBOHID over DT classifier provided by an embodiment of the present invention;

FIG. 5B is a schematic diagram of the average G _ Mean values of SS _ DEBOHID and DEBOHID on DT classifier provided by an embodiment of the present invention;

FIG. 6A is a graph illustrating the average AUC of an SVM classifier according to an embodiment of the present invention;

FIG. 6B is a schematic diagram of the average value of G _ Mean on the SVM classifier provided by the embodiment of the present invention;

fig. 7A is a schematic graph of the average values of AUC on the KNN classifier provided by the embodiment of the present invention;

fig. 7B is a schematic diagram of the average value of G _ Mean on the KNN classifier provided by the embodiment of the present invention;

FIG. 8A is a graph illustrating the average AUC of the DT classifier provided by an embodiment of the present invention;

FIG. 8B is a graph illustrating the average value of G _ Mean on the DT classifier according to an embodiment of the present invention;

FIG. 9A is a diagram illustrating the results of a Wilcoxon signed rank test of the AUC mean values on an SVM classifier provided by an embodiment of the present invention;

FIG. 9B is a diagram illustrating the results of a Wilcoxon signed rank test of the G _ Mean average value on the SVM classifier provided by the embodiment of the present invention;

fig. 9C is a schematic diagram of Wilcoxon signed rank test results of AUC averages on KNN classifiers provided by an embodiment of the present invention;

fig. 9D is a schematic diagram of a Wilcoxon signed rank test result of the G _ Mean average value on the KNN classifier provided by the embodiment of the present invention;

FIG. 9E is a graphical representation of the Wilcoxon signed ranks test result of the AUC means on DT classifier as provided by an embodiment of the present invention;

fig. 9F is a diagram illustrating the result of Wilcoxon signed rank test of G _ Mean average on DT classifier provided by an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

In view of the problems in the prior art, the present invention provides an oversampling method and system for differential evolution of a highly unbalanced data set, which is described in detail below with reference to the accompanying drawings.

1. The embodiments are explained. This section is an explanatory embodiment expanding on the claims so as to fully understand how the present invention is embodied by those skilled in the art.

As shown in fig. 1, an oversampling method for differential evolution of a highly unbalanced data set provided by an embodiment of the present invention includes the following steps:

s101, screening a minority class area close to the center of the minority class in the minority class by using a k nearest neighbor search method kNN, and identifying to obtain a safe area;

s102, synthesizing a new minority sample in a safe area by using a DEBOHID oversampling method;

s103, performing Friedman test and Wilcoxon symbolic rank test on the SS _ DEBOHID method on the highly unbalanced data set by using SVM, KNN and DT classification models.

As a preferred embodiment, the oversampling method SS _ DEBOHID provided by the embodiment of the present invention for differential evolution of a highly unbalanced data set is a novel oversampling method based on DEBOHID.

The hyper-parameter in the DEBOHID method comprises the following steps:

minor is the minority class sample array in the training dataset and Major is the majority class sample array in the training dataset.

Min is the minimum boundary vector of the data attribute and Maj is the maximum boundary vector of the data attribute.

-k is the value of knnsearch which determines the number of nearest neighbors per minority class sample.

F is a scaling factor, F ∈ [0,2], and the parameter F influences the convergence speed and the optimal performance of DEBOHID.

-CR is the crossing rate, CR ∈ [0,1], CR being determined by the user.

D is the dimension of the data set.

NOS is the number of new minority samples to be generated.

Second, DEBOHID calculates the new few samples, i.e., NOS, that need to be generated. Each sample in Minor is represented as a target vector X _i . Target vector X _i Is D-dimensional. The formula of NOS is shown in (1).

NOS＝(N _maj -N _min )×β (1)

Wherein N is _maj Is the number of samples in Major, N _min Is the number of samples in Minor. Beta is a nonnegativeAnd (4) counting. β is equal to 1, i.e. the ratio of newly generated minority class and original minority class to original majority class is 1. This means that the number of majority class samples and the number of minority class samples will be balanced after oversampling.

Third, debodid synthesizes samples using mutation, crossover, and selection operations. With NOS as the cycling condition, debodid generates a new sample, the value of NOS minus 1. When NOS is equal to 0, the data set is equilibrated after oversampling by the DEBOHID method.

During the mutation process, the target vector X _i Generates a donor vector D _ V according to the basic strategy DE/rand/1 of DE _i As shown in equation (2).

Where D _ V is the ith donor vector, where i ∈ [1,2,3, \ 8230, N _min ]；N _min Is the number of samples in Minor; f is a scale factor;

is the target vector X _i Three different neighbors.

By donor vector D _ V _i And a target vector X _i The cross operation between them generates a new test vector T _i . The crossover process is shown in equation (3).

Wherein, t _ij A value representing a jth dimension in an ith trial vector; x is the number of _ij Is the value of the jth dimension in the ith target vector; CR is the crossover rate; rand (0, 1) is a randomly generated decimal between 0 and 1; d _ v _ij Is the value of the jth dimension in the ith donor vector.

In the selection process, debodid implements the greedy criterion (equation 4) between the target vector and the trial vector.

Wherein, t _ij Representing the value of the jth dimension in the ith trial vector. Max _j Is the value of the j-th dimension, min, in the maximum boundary vector of the data attribute _j Is the value of the j-th dimension in the smallest boundary vector of the data attribute. If t is _ij Greater than Max _j Then DEBOHID will t _ij Replacing with the maximum value; if t is _ij Less than Min _j Then DEBOHID will t _ij Instead of this minimum.

Deboshid is the first oversampling method proposed based on DE. It has been shown to be superior to other oversampling parties on highly unbalanced data sets, and the debbohid method has good stability. However, debodid does not consider the distribution of the sample. If the selected few classes of samples are noise samples or redundant samples, the synthesized samples are of poor quality and may even be noise samples. The SS DEBOHID screens out safety regions in a few classes, taking into account the distribution of the samples. The SS-DEBOHID synthesizes the sample in a safe area, and the high quality of the synthesized sample can be ensured to a great extent. Thus, the present invention proposes a new oversampling method named SS _ DEBOHID.

Considering the influence of the original sample distribution on the synthesized samples, the invention provides a safe area division technology in the minority class and uses a DEBOHID oversampling method to realize the balance between the minority class and the majority class. The workflow of SS _ DEBOHID is shown in table 1.

TABLE 1 pseudo code for SS _DEBOHID

The SS _ DEBOHID method provided by the embodiment of the invention comprises two steps:

in the first step, SS _ DEBOHID identifies a security set in a small number of classes. To obtain the safe regions in the minority class, a sample is taken in the minority class and the euclidean distance between the sample and the training data set is calculated. The SS _ DEBOHID selects the front k in the distance array according to the obtained incremental distance array ₁ Nearest neighbor, count k ₁ The number of the majority type neighbors and the number of the minority type neighbors in the nearest neighbor. If the number of minority class neighbors is greater than or equal to k ₁ And/2, the selected sample is a safe sample. The security set consists of these security samples.

And secondly, synthesizing the sample by using a DEBOHID method in the safety set. The DEBOHID takes a safety sample alpha out of the safety set, and calculates the Euclidean distance between the safety sample alpha and the minority class samples. It selects the k nearest neighbors closest to the selected sample alpha. Then, new minority samples are synthesized through mutation, crossover and selection processes. During mutation, the DEBOHID method uses the basic strategy of DE, DE/rand/1, to create donor vectors. During the crossover process, the donor vector and the target vector (selected sample α) are combined into a new test vector. In the selection process, a greedy criterion is used to make an optimal decision between the trial vector and the target vector. The flow chart of the SS _ DEBOHID method of the present invention is shown in FIG. 2.

The SS _ DEBOHID method provided by the embodiment of the invention has the following calculation complexity:

in the step of finding a security set in a few classes, the invention uses N to represent the number of samples in the training data set and D to represent the number of attributes in the data set. Five nearest neighbors of a few class samples in the training dataset are computed, and the complexity of this process is O ((N-1) D). In the process of generating a new sample, the invention converts N _maj Expressed as the number of samples in the majority class, N _min Expressed as the number of samples in a small number of classes. The computational complexity of finding neighbors for a few classes of samples is O ((N) _min -1) D). The computational complexity of the balanced dataset is O ((N) _maj -N _min )(N _min -1) D). The complexity of the SS _ DEBOHID method is O ((N-1) D) + O _maj -N _min )(N _min -1)D)。

The over-sampling system for differential evolution of highly unbalanced data sets provided by the embodiment of the invention comprises:

2. Application examples. In order to prove the creativity and the technical value of the technical scheme of the invention, the part is the application example of the technical scheme of the claims on specific products or related technologies.

3. Evidence of the relevant effects of the examples. The embodiment of the invention has some positive effects in the process of research and development or use, and indeed has great advantages compared with the prior art, and the following contents are described by combining data, charts and the like in the test process.

1. Experimental setup

The present invention evaluated 11 methods on 43 unbalanced datasets using three classification models to demonstrate the classification performance of the proposed method. The following is a description of the experimental setup.

1.1 data set

The experiment used 43 5-fold cross-validated datasets. The present invention retrieves these data sets from the KEEL data set repository. The detailed information is shown in table 2. The present invention mainly studies unbalanced datasets with binary classes, i.e. class labels in the dataset are divided into positive and negative classes. The positive class refers to a minority class, and the negative class refers to a majority class. In the experiment, the invention marks the positive class as a digital label 0 and marks the negative class as a digital label 1. The unbalance ratio is shown in equation (5).

Wherein m is ₊ Representing the number of majority class samples, m, in the dataset _- Representing the number of minority class samples in the dataset.

Table 2 data set description

1.2 Performance measurements

The invention introduces the indexes of the test performance. The following is the confusion matrix for the binary problem, as shown in table 3.

TABLE 3 confusion matrix

AUC (area under the curve) is defined as the area enclosed by the coordinate axes under the ROC (receiver operating characteristic) curve. Obviously, the value of this area is not greater than 1. The closer the value of AUC is to 1, the higher the authenticity of the method.

See equation (6) for a specific calculation.

G _ Mean is defined as the geometric Mean of TPR (True Positive Rate) and TNR (True Negative Rate) and is a good indicator of the overall performance of the classifier, regardless of the degree of imbalance between classes. The closer the value of G _ Mean is to 1, the better the method works. See equation (7) for a specific calculation.

1.3 parameter settings

In the experiments, python 3.6.4 was used in the present invention. In the experimental results, the present invention compared SS _ DEBOHID with Base, DEBOHID, SMOTE, S-RSB, SMOTE-ENN, SL-SMOTE, borderline1, borderline2, SMOTE-TL, and ADASYN methods. In order to verify the validity of the experimental result, the invention uses SVM, KNN and DT classifiers and carries out 5-fold cross validation. The results are shown in tables 4 to 9.k is a radical of formula ₁ Set to 5 in knnsearch for searching for a secure area. k is set to 3 in knnsearch for synthesizing the sample. For the KNN classifier, k equals 5, CR equals 0.6, and F equals 0.3.

2. Results and discussion

To verify the validity of SS _ DEBOHID, experiments compared the values of two indicators of DEBOHID and SS _ DEBOHID on three classifiers. The results of the comparison of the two methods are shown in section 2.1. The results of the comparison of all methods are in section 2.2. Section 2.3 shows a significance test for all methods.

2.1SS_DEBOHID and DEBOHID comparison

First, the unbalanced data set was oversampled using debbohid and SS _ debbohid, and then the classification performance of both methods was evaluated by measuring AUC, G _ Mean on SVM, KNN, DT classifiers. The corresponding results are shown in fig. 3, 4 and 5.

FIG. 3 shows AUC and G _ Mean results of SS _ DEBOHID and DEBOHID on SVM classifiers. In FIG. 3A, the average AUC value for SS _ DEBOHID is 0.9359, the average AUC value for DEBOHID is 0.8308; in FIG. 3B, the average G _ Mean of SS _ DEBOHID is 0.9248 and the average G _ Mean of DEBOHID is 0.8304. FIG. 3 shows that SS _ DEBOHID is 0.1051 higher than DEBOHID in AUC value on SVM classifier; the SS _ DEBOHID is 0.0944 higher than the DEBOHID for the G _ Mean value.

Fig. 4 shows AUC and G-Mean results of SS _ DEBOHID and DEBOHID on KNN classifier. In FIG. 4A, the mean AUC value for SS _ DEBOHID is 0.9405, and the mean AUC value for DEBOHID is 0.8525; in FIG. 4B, the average G _ Mean of SS _ DEBOHID is 0.9098, and the average G _ Mean of DEBOHID is 0.8402. Fig. 4 shows that SS _ DEBOHID performed better on the KNN classifier than DEBOHID, with SS _ DEBOHID performing well on more than 30 datasets in the results obtained by the KNN classifier.

FIG. 5 shows AUC and G _ Mean results obtained for SS _ DEBOHID and DEBOHID on DT classifier. In FIG. 5A, the mean AUC value for SS _ DEBOHID is 0.9501 and the mean AUC value for DEBOHID is 0.8261; in FIG. 5B, the average G _ Mean value for SS _ DEBOHID is 0.8959 and the average G _ Mean value for DEBOHID is 0.8066. FIG. 5 shows that SS _ DEBOHID performs better on DT classifier than DEBOHID. SS _ DEBOHID has 0.1240 higher than DEBOHID in terms of AUC values; SS _ DEBOHID is 0.0893 higher than DEBOHID for G _ Mean values.

From the above results it is seen that SS _ DEBOHID performs better on data sets that are highly unbalanced in abalone19, ecoli0147vs2356, ecoli0147vs56, led7digit02456789vs 1.

2.2 comparison of all methods

The present invention compares ten methods with SS _ DEBOHID. The last two rows of tables 4-9 represent the average of the metrics obtained by the respective methods, as well as the occupation of the better performing data set over all data sets. The bold numbers in the table represent the best results. The results of the experiments are shown in tables 4 to 9.

Tables 4-9 are the average values of two indices obtained by 5-fold cross validation of SVM, KNN, DT classifiers on 43 highly unbalanced datasets by 11 methods. These six tables show that the SS DEBOHID method performs better than the other methods. The SS _ DEBOHID method has excellent performance over more than 30 datasets on SVM, KNN and DT classifiers.

In order to evaluate the stability of the algorithm, the invention makes six box plots according to the experimental results. The boxplot is the average of five results over 43 5-fold cross-validation datasets. The shorter the box of the proposed algorithm is, the better the stability of the algorithm is, and the higher the box is, the better the classification effect of the proposed algorithm is. Boxplots of AUC and G-mean values for each classifier are shown in fig. 6, 7 and 8. The boxplot 25% -75% of the SS _ DEBOHID method is flatter than the other methods, indicating that SS _ DEBOHID has good robustness.

TABLE 4 AUC mean values for all methods on SVM classifier

/>

/>

TABLE 5 Mean value of G _ Mean of all methods on SVM classifier

/>

/>

TABLE 6 average of AUC for all methods on KNN classifier

/>

/>

TABLE 7 average of G _ Mean of all methods on KNN classifier

/>

/>

TABLE 8 average AUC of all methods on DT classifier

/>

/>

TABLE 9 Mean value of G _ Mean of all methods on DT classifier

/>

/>

2.3 significance testing of all methods

With experimental evaluation methods and performance measures, the present invention also requires statistical hypothesis testing to evaluate the performance of the learner for comparison. The Friedman test is a non-parametric statistical test that can be used to simultaneously compare the performance of multiple methods on a data set. The significance level was set at 0.05. If the p-value is less than 0.05, there is a significant difference between the methods. The results of the friedman test are given in table 10 and table 11.

According to tables 10 and 11, the present inventors found that the significance level of the test was less than 0.05, indicating that SS _ DEBOHID is statistically significantly different from the other methods. In addition, SS _ DEBOHID is superior to other methods in terms of average value, average ranking value, and final ranking value.

Significance was tested between the two methods using the nonparametric two-sided Wilcoxon signed ranks test, with a significance level of 0.05. The X-axis in fig. 9 represents all methods and the Y-axis is the average of the metrics. The "+" in FIGS. 9 (a) to 9 (f) indicates that SS _ DEBOHID is significantly different from the other ten methods.

Table 10 results of Frideman test on AUC for all methods

Table 11 fridmaman test results on G _ Mean for all methods

The invention provides a new oversampling method called SS _ DEBOHID. The method first identifies a safe region in a minority class; secondly, a few types of samples are synthesized in a safe area by using a DEBOHID method. The SS _ DEBOHID method not only improves the quality of a new sample, but also reduces the generalization performance of a synthesized sample. This makes the performance of this method superior to the other nine oversampling methods. In experiments, the present invention compared SS _ DEBOHID with ten methods on 43 highly unbalanced datasets using SVM, KNN, DT classification models. The Friedman test and Wilcoxon signed rank test indicate that the SS DEBOHID method is significantly superior to the other methods. Experimental results show that SS _ DEBOHID is superior to other methods.

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. An over-sampling method for differential evolution of a highly unbalanced data set, the over-sampling method for differential evolution of a highly unbalanced data set comprising: screening a minority class area close to the minority class center in the minority class by using a k nearest neighbor search method kNN, and identifying to obtain a safety area; synthesizing a new minority sample in a safe area by using a DEBOHID oversampling method; and finally, performing Friedman test and Wilcoxon symbol rank test on the over-sampling method aiming at the differential evolution of the highly unbalanced data set by using SVM, KNN and DT classification models on the highly unbalanced data set.

2. The method of oversampling for differential evolution of a highly unbalanced data set according to claim 1, characterized in that the method of oversampling for differential evolution of a highly unbalanced data set comprises the steps of:

step one, SS _ DEBOHID identifies a security set in a minority class;

3. The method of oversampling for differential evolution of a highly unbalanced data set according to claim 2, wherein the SS DEBOHID identification of safe sets in the minority class in step one comprises: taking a sample in the minority class, and calculating the Euclidean distance between the sample and the training data set to obtain a safety region in the minority class; the SS _ DEBOHID selects the front k in the distance array according to the obtained incremental distance array ₁ Nearest neighbor, statistic k ₁ The number of the majority neighbors and the number of the minority neighbors in the nearest neighbors; if the number of minority class neighbors is greater than or equal to k ₁ /2, then selectThe sample of (1) is a security sample, and the security set consists of the security samples.

4. The method of oversampling for differential evolution of a highly unbalanced data set according to claim 2, characterized in that in the step of finding a security set in a minority class, the number of samples in the training data set is represented by N, and the number of attributes in the data set is represented by D; calculating five nearest neighbors of a few class samples in the training dataset, with the complexity of O ((N-1) D); in the process of generating new samples, N is added _maj Expressed as the number of samples in the majority class, N _min Expressed as the number of samples in a minority class; the computational complexity for finding neighbors for a few classes of samples is O ((N) _min -1) D), the computational complexity of the balanced dataset is O ((N) _maj -N _min )(N _min -1) D), the complexity of the SS _ DEBOHID method is O ((N-1) D) + O ((N) _maj -N _min )(N _min -1)D)。

5. The method of oversampling for differential evolution of a highly imbalanced data set of claim 2, wherein synthesizing samples in the secure set using the DEBOHID method in step two comprises: the DEBOHID takes a safety sample alpha out of the safety set, calculates the Euclidean distance between the safety sample alpha and the minority samples, and selects k nearest neighbors closest to the selected sample alpha; synthesizing new minority class samples through mutation, crossover and selection processes; during mutation, the DEBOHID method uses the basic strategy DE/rand/1 of DE to create donor vectors; in the crossing process, the donor vector and the target vector are synthesized into a new test vector; in the selection process, a greedy criterion is used to make an optimal decision between the trial vector and the target vector.

6. The method of oversampling for differential evolution of a highly unbalanced data set according to claim 5, characterized in that the target vector is a selected sample α.

7. A differentially evolving over-sampling system for highly unbalanced data sets applying the differentially evolving over-sampling method for highly unbalanced data sets as claimed in any one of claims 1 to 6, characterized in that the differentially evolving over-sampling system for highly unbalanced data sets comprises:

8. A computer arrangement, characterized in that the computer arrangement comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of oversampling for differential evolution of highly unbalanced data sets as claimed in any one of claims 1 to 6.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the oversampling method for differential evolution of highly unbalanced data sets as claimed in any one of claims 1 to 6.

10. An information data processing terminal characterized in that the information data processing terminal is configured to implement the differential evolutionary oversampling system for highly unbalanced data sets as claimed in claim 7.