CN117312920A

CN117312920A - Weighting integration unbalance classification method, system, storage medium, equipment and terminal

Info

Publication number: CN117312920A
Application number: CN202311322442.3A
Authority: CN
Inventors: 李艳颖; 王夏琳; 张姣妮
Original assignee: Baoji University of Arts and Sciences
Current assignee: Baoji University of Arts and Sciences
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2023-12-29

Abstract

The invention belongs to the technical field of data processing, and discloses a weighted integration unbalanced classification method, a system, a storage medium, equipment and a terminal, wherein the weighted integration method based on a difference sampling rate is used for credit risk assessment and is called KSDE; the method comprises the core ideas that a plurality of subsets with data distribution characteristics are constructed through different sampling rates, and then a plurality of sub-models are obtained through training; secondly, calculating G-Mean weight of the sub model on the verification set as a prediction weight; finally, obtaining a final classification result through weighting calculation of the prediction result of the sub-model and the prediction weight thereof so as to improve the prediction stability of the method; on the 33 dataset, the KSDE was compared with 11 advanced comparison methods. Experimental results show that the classification accuracy of the KSDE is highest, and the classification accuracy of the KSDE particularly shows obvious advantages in the aspect of identifying few types of samples. In addition, when the KSDE method comprises enough submodels, the classification performance is stable and the robustness is good.

Description

Weighting integration unbalance classification method, system, storage medium, equipment and terminal

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a weighting integration unbalanced classification method, a system, a storage medium, equipment and a terminal.

Background

Currently, the class imbalance characteristic of data means that the number of samples of a certain class or classes is smaller than the number of samples of other classes. Class imbalance problems are widely present in various fields such as medical diagnostics, software defect prediction, fraud detection, credit scoring, image recognition, and text classification. Traditional classification approaches tend to be more biased towards most classes when dealing with class-unbalanced datasets. This is because conventional classification methods generally assume that the class distribution of data is balanced and that the cost of misclassification of samples is equal. However, scholars tend to pay more attention to a minority class in the data. For example, in the field of credit risk assessment, attention is paid to whether algorithms have a good classification accuracy for offending samples (minority class samples). Because, the algorithm with stronger learning ability on the default samples can reduce or avoid economic loss and improve the income. Therefore, scholars have proposed many methods to solve the problem of classifying unbalanced data, and can be classified into the following three categories. Data-level method: the data-level method is a class of techniques independent of the classification algorithm as a preprocessing method. The data-level method realizes better classification performance by balancing the class distribution of data, and is broadly classified into an undersampling method, an oversampling method, and a mixed sampling method. The algorithm level method comprises the following steps: the algorithm level method makes corresponding improvement aiming at the defects existing when the class imbalance problem is processed and the characteristics of imbalance data so as to improve the identification capability of few class samples. Such methods do not have any effect on the data distribution. For example, cost-sensitive based methods can be used to better classify by increasing the cost of few classes of prediction errors. The integration method comprises the following steps: the integration method fuses the classification results of the plurality of sub-models to obtain better generalization performance, and further improves classification accuracy. Common integration methods are Adaboost, bagging, random forest and the like. In addition, the combination method of the integration method and the data level method provides a new idea for solving the problem of class unbalance. Sun et al in 2017 proposed an integrated method DTE-SBD for enterprise credit assessment. The method generates a plurality of balanced data sets based on the differential sampling rates. The SMOTE and the differential sampling rate are used to generate a different number of minority class samples, and the Bagging and the differential sampling rate are used to generate a different number of majority class samples. The DTE-SBD combines the two classes of samples to train a decision tree classifier. Although the integration method has a certain effect on the problem of unbalanced processing, the integration method still has a defect in the aspect of data sampling. In particular, these methods do not take into account well whether the data training the classifier contains noise samples and whether the selected data represents the distribution of the data when constructing the classifier. Both of which result in a large deviation of the classification result of the classifier from the actual result.

The data-level approach is an effective way to solve the class imbalance problem. They are widely welcome because they are not affected by classifiers. The data-level methods are mainly classified into three types, namely over-sampling, under-sampling and mixed sampling. The over-sampling method adjusts for data imbalance by adding a few classes of samples. SMOTE, one of the most important methods in the field of oversampling, has been proposed so far as a number of SMOTE-based methods. Variants of SMOTE can be broadly divided into three classes. First, noise samples are removed to improve class separability. The recently proposed IR-SMOTE and SMOTE-RkNN. SMOTE-RkNN uses the reverse k-nearest neighbor (RkNN) to calculate probability density for removing noise samples. IR-SMOTE sorts the samples first and then applies K-means clustering to remove noise samples. Second, samples are synthesized at the boundaries of the class to enhance the distribution of the class. For example, IW-SMOTE determines its distribution characteristics by computing aliasing information for samples, which are in the boundary region, with more opportunities for synthesizing samples. Thirdly, different weights are allocated to the samples. W-SMOTEs and RSMOTE are both such methods. The W-SMOTEs determine the generation position of the synthesized sample by calculating the chaos level among samples, and the synthesized sample is safer. RSMOTE uses relative density to divide samples into two classes and assigns the number of samples synthesized by chaos. The undersampling method balances data by reducing the majority class of samples, as opposed to oversampling. Thus, a subset of the data is obtained after processing by the undersampling method. Common undersampling methods are Random Undersampling (RUS), edited Nearest Neighborhood (ENN), nearmess, etc.

The mixed sampling is composed of two methods of over sampling and under sampling, and the core idea is to achieve the purpose of balancing data distribution by adding minority samples and reducing majority samples. Representative methods of the mixed sampling method include SMOTE-Tomek and SMOTE-ENN. SMOTE-Tomek solves the inter-class overlap problem by removing pairs of samples (Tomek Link) that are nearest neighbors to each other but not identical to the class. The SMOTE-ENN method deletes more than half of samples in the k-nearest neighbor of the sample that are inconsistent with the self-class after the SMOTE algorithm adds the synthesized samples. In order to obtain a better classification effect, the integration method integrates the prediction results of the plurality of sub-models to obtain a final classification result. Thus, the classification performance of the integrated method is superior to that of a single classifier and is applied in many fields. Such as medical health, biometric information, and network intrusion detection, etc. Depending on the manner of integration, integration-based methods can be divided into data preprocessing integration and cost-sensitive integration. Data preprocessing integration is further classified into three types because of different integration methods. Bagging-based integration, boosting-based integration, and hybrid integration, respectively. The core idea of Bagging-based integration is to train the respective independent models in parallel. Random forest as a representative method randomly selects attributes as node splitting attributes and builds a large number of decision tree forms into a forest. Boosting-based integration is a model that relies on before and after training through a series. Such more classical integration methods include AdaBoost, GBDT, XGBoost, etc. AdaBoost plays a major role in final classification decision by increasing the weight of weak classifiers with small classification error rates. Meanwhile, the weight of a weak classifier with a large classification error rate is reduced, so that the weak classifier has a smaller decision function. GBDT is referenced to the current prediction each time and the next weak classifier is used to fit the error between the predicted and true values. The sum of the results of all weak classifiers in GBDT is equal to the final predicted value. XGBoost implements and improves the GBDT algorithm efficiently. Specifically, XGBoost continually performs feature splitting to generate a tree and uses the tree to fit the residuals of the last prediction. And finally, after training to obtain k trees, adding the scores corresponding to each tree to obtain a predicted value of the sample.

The hybrid integration adopts a double integration mode, and combines Bagging and Boosting. Both easysenberle and balancecascades are classical hybrid integration methods. The easysemble randomly divides the majority class samples into a plurality of subsets and combines with the minority class samples as a training set. An adaboost classifier is then learned on this training set. BalanceCascade also has Adaboost as the base classifier. The core idea is to train the classifier using a training set with the same number of majority classes as minority classes. The False Positive rate is then controlled by the classification threshold. Finally, calculating the correct sample number of majority class prediction and deleting the same number of majority class samples, and entering the next iteration. Cost sensitive integration is a cost minimization technique embedded in the ensemble learning process. Such methods do not require constant adjustments to the model. For example AdaCost, boostedCS-SVM and BayEnsBNN, etc. In addition, in order to better adapt to unbalanced data, a method of combining a sampling method with an integration method is also attracting attention. For example, the undersampled integration method SPE, the oversampled integration method SMOTEBagging and the mixed sampling-based integration method Logistic-BWE.

First, application number: 202110671353.4 discloses a method and a device for adaptively enhancing network traffic data, wherein the method comprises the following steps: clustering an original network flow data set based on a hierarchical aggregation clustering HAC algorithm, and determining a minority class cluster according to an unbalance ratio; acquiring sparseness weights and quantity weights of minority class samples in minority class clusters; determining the number of minority class synthesized samples according to the sparsity weight and the number weight; and carrying out data enhancement on the original network traffic data set based on the oversampling algorithm and the minority class synthesized sample number. The device is used for executing the method. The HAC clustering algorithm is utilized without the characteristic of parameters, the parameters which are required to be preset are reduced, the influence of noise is reduced, a scheme for distributing the number of synthesized samples of each cluster according to the sample sparsity and the sample number proportion in the clusters is provided, the number of new samples which are required to be synthesized of each cluster can be distributed in a self-adaptive mode, and the problem that the information dependence of the synthesized samples in the prior art is insufficient is solved.

Second prior art, application number: 201610397836.9 discloses a community evolution prediction method based on a Shapelet, which comprises the following steps of: step one, combining a shape with a radial basis function network; step two, using the found shape representing the time series data category characteristics as a hidden layer node of the RBF network; and thirdly, obtaining the connection weight of the hidden layer and the output layer by gradient descent learning. The method has higher precision and better generalization capability, and by using the two optimization strategies provided by the invention, the algorithm efficiency is improved by 1-2 orders of magnitude, the distance function is weighted according to time, and the classification accuracy of the algorithm is improved by about 2% on average.

Through the above analysis, the problems and defects existing in the prior art are as follows: the existing integration method does not well consider the problems of representativeness of training data and noise samples when training the classifier.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a weighted integration unbalanced classification method, a system, a storage medium, equipment and a terminal.

The invention is realized in such a way that a weighted integration unbalanced classification method KSDE removes noise samples in each type of data through an outlier detection technology; k-means clustering and SMOTE algorithm are respectively used for the majority class data and the minority class data, and a plurality of balanced training subsets are generated by using the differential sampling rate to train the sub-model; and taking the G_mean weight of the sub-model after screening as a prediction weight and a prediction result weight to calculate a final classification result.

Further, the preprocessing stage of the weighted integration unbalanced classification method comprises three parts; firstly, filling missing values in data, filling discrete variables by using modes of corresponding attributes, and selecting an average number of the corresponding attributes by using continuous variables; secondly, normalizing data through a formula Min-Max to improve the calculation efficiency and algorithm accuracy; finally, removing noise samples in each type of data by applying an isolated forest algorithm;

Where x is the selected sample, x _max Represents the maximum value, x of the corresponding attribute _min Is the minimum value of the corresponding attribute, x _new Is the normalized result.

Further, the training stage of the weighted integration unbalanced classification method processes the majority class and the minority class respectively, and for the minority class, the minority class samples with corresponding quantity are selected according to the differential sampling rate after SMOTE; then, the original minority sample and the synthesized minority sample are combined to containA dataset of a minority class of samples; for most classes, firstly, clustering the most classes into K clusters through K-means clustering; secondly, selectingA plurality of majority class samples, wherein each cluster extracts a certain number of samples according to the proportion of the majority class samples, and the selected majority class and minority class form C balanced training subsets; finally, the base classifier is trained by training the subset to obtain C sub-models.

Further, K-means first, selecting K samples as cluster centers; secondly, classifying the rest samples, namely distributing the rest samples to the cluster centers closest to the rest samples; and finally, continuously and iteratively updating the clustering center until the best clustering effect is obtained.

Further, 10% of data is selected as a verification set in the verification stage of the weighted integration unbalanced classification method; firstly, calculating G_mean scores of all sub-models on a verification set; secondly, deleting sub-models with scores lower than the average value of G_mean; finally, calculating the prediction weight of the sub-model after screening; the specific calculation formula is as follows:

Wherein G is _i Representing the predictive weight, G_mean, of the ith sub-model _i Represents the firstG_mean of the i submodels.

Further, 10% of data in a test stage of the weighted integration unbalanced classification method is used as a test set, the number of sub-models in the test stage is N, and in the test set, N prediction results of the samples are multiplied by the prediction weights of the corresponding sub-models and then added, namely the probability that the samples are classified as positive types is obtained:

wherein x represents a sample in the test set, G _i Is the weight of the ith sub-model, p _i (x) Representing the predicted result of the sample x in the ith sub-model, wherein P (x) is the probability of classifying the sample x into a positive class; if the prediction probability is greater than or equal to the threshold value T, setting T to 0.5, classifying the sample into positive examples, namely minority classes, and otherwise, into negative examples, namely majority classes, wherein the specific calculation formula is as follows:

it is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the weighted integrated imbalance classification method.

It is a further object of the present invention to provide a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the weighted integrated imbalance classification method.

Another object of the present invention is to provide an information data processing terminal for implementing the weighted integration unbalance classification method.

Another object of the present invention is to provide a weighted integrated unbalance classification system based on the weighted integrated unbalance classification method, the weighted integrated unbalance classification system comprising:

the noise removing module is used for removing noise samples in each type of data by the KSDE through an outlier detection technology;

the model generation module is used for respectively using K-means clustering and SMOTE algorithm for the majority class data and the minority class data, and generating a plurality of balanced training subsets to train the sub-model by using the differential sampling rate;

and the classification result module is used for screening the G_mean weight of the post-sub model as a prediction weight and calculating a final classification result by weighting the prediction result.

In combination with the technical scheme and the technical problems to be solved, the technical scheme to be protected has the following advantages and positive effects:

first, the method of the present invention makes the training data more pure and representative. In addition, the weight calculation enables the KSDE generalization capability to be stronger and the robustness to be better. Experimental results indicate that the classification performance of KSDE is best over 33 data sets (including 8 credit risk data sets) compared with 11 advanced methods, and has stronger competitiveness in identifying few classes of samples.

Second, the weighted integration method based on the differential sampling rate of the present invention is called KSDE. The key idea of the KSDE method is to construct a plurality of sub-classifiers with data distribution characteristics based on differential sampling rates and weight and integrate the prediction results of the sub-classifiers to obtain final classification results. On the 33 data sets (including 8 credit risk data sets), the KSDE was compared with 11 advanced comparison methods. Experimental results show that the classification performance of KSDE is best, and the classification method particularly shows obvious advantages in the aspect of identifying few types of samples. In addition, when the KSDE method comprises enough submodels, the classification performance is stable and the robustness is good. (1) A denoising strategy is designed, namely, each type of data in the data set is denoised through isolated forest stand so as to avoid interference of noise samples on the classifier. (2) A sampling method based on differential sampling rates is presented for generating a balanced training subset. A series of balanced training subsets are formed by K-means clustering and SMOTE under different sampling rates for most classes and few classes respectively. The selected samples better reflect the distribution of data and strengthen the distribution of minority samples to a certain extent. (3) A weighting strategy is designed, namely, the final classification result is obtained through weighting calculation of the weight of the comprehensive evaluation index G_mean of the sub-model and the prediction result of the sub-model, so that the classification performance of the algorithm is improved.

Third, S101: the KSDE (an outlier detection method) is used to detect and remove noise samples in each class of data. Outlier detection helps to improve robustness of the model and ensures quality of training data.

S102: the sub-models are trained using K-means clustering and SMOTE algorithms for the majority and minority class data, respectively, and using differential sampling rates to generate a plurality of balanced training subsets. The combination of the two sub-steps can improve the balance of the training data and facilitate better training of the sub-model.

S103: the weight of the final classification result is calculated by the g_mean weight of the sub-model after screening (which means the sub-model that performs better in the training process). This ensures that the best performing sub-model predictions are more emphasized in the integration.

The method combines outlier detection, clustering, oversampling and weight distribution of the submodels to solve the classification problem of unbalanced categories. Its technological advances include improvement of data quality, improvement of balance and weight allocation strategy of submodel, which all help to improve classifier performance. It should be noted, however, that the specific implementation details and algorithms require more background information for a thorough understanding.

Drawings

FIG. 1 is a flow chart of a weighted integration imbalance classification method provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a sample of SMOTE synthesis provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a weighted integrated imbalance classification method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a confusion matrix for two classification problems provided by embodiments of the present invention;

FIG. 5 is a graph showing a comparison of the average of four indicators over 25 data sets for 12 methods according to an embodiment of the present invention;

FIG. 6 is a graph showing a comparison of the average of four metrics over 8 credit risk datasets provided by an embodiment of the invention;

FIG. 7 is a schematic diagram showing the influence of the number of submodels on the algorithm according to the embodiment of the present invention;

FIG. 8 is a graph showing the post-hoc test results of the KSDE and comparative method according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, the weighted integration unbalanced classification method provided by the embodiment of the invention includes the following steps:

S101: the KSDE removes noise samples in each type of data through an outlier detection technology;

s102: k-means clustering and SMOTE algorithm are respectively used for the majority class data and the minority class data, and a plurality of balanced training subsets are generated by using the differential sampling rate to train the sub-model;

s103: and taking the G_mean weight of the sub-model after screening as a prediction weight and a prediction result weight to calculate a final classification result.

Example 1:

1.SMOTE

SMOTE is accepted by academia and industry due to its simplicity of operation and excellent results and has a wide range of practical applications in industry.

The synthesis of SMOTE is shown in figure 2. First, find k neighbors for each minority class sample. Second, a minority class of samples randomly selects one sample from k neighbors at a time. Finally, a new minority class of samples is synthesized by linear interpolation of equation (1).

x _syn ＝x _min +rand(0,1)×x _nn (1)

Wherein x is _syn Representing the composite sample, x _min Representing a minority class of samples, ranf (0, 1) is a random number between 0 and 1, x _nn Is x _min Is one of the k neighbors of (c).

2. Differential sampling rate

Differential sampling rate is an effective sampling strategy that can be used in integrated models. According to the training data differentiating method, different numbers of minority samples are selected through different sampling rates, so that the purpose of training data differentiating is achieved. The method contributes to the diversity of the neutron model in the integration method. The differential sampling rate is calculated as follows.

a) The sample difference between the majority class and minority class is calculated as shown in equation (2).

D＝N _Maj -N _Min (2)

b) The number of base classifiers M that need to be trained is determined. The sequence number of the classifier is denoted by M, m=1, 2, …, M.

c) The corresponding sampling rate is generated according to equation (3).

d) The number of minority class samples determined by the differential sampling rate after oversampling is shown in equation (4). Wherein the round function outputs an integer.

NS _m ＝N _Min +round(D×SR _m )(m＝1,2,…,M) (4)

For example, given a data set that contains 120 minority class samples and 350 majority class samples, d=350-120=230. Assuming that 5 base classifiers need to be trained, SR _m ∈[20％，40％，60％，80％，100％]. Thus, the minority class samples used to train each base classifier are NS in number _m ∈[166，212，258，304，350]。

Example 2:

1 KSDE algorithm

The flow chart of the KSDE is shown in FIG. 3. Next, the present invention will be described in terms of the preprocessing, training, validation and testing sections of KSDE.

1.1. Pretreatment stage

The pre-processing as the first stage of the KSDE provides clean, complete data for subsequent processing. This stage includes three parts in total. First, the missing values in the data are filled up, and a specific operation method is as follows. Discrete variables are populated with the mode of the corresponding attribute, and continuous variables select the average of the corresponding attribute. Secondly, the data is normalized (Min-Max Normalization) through a Min-Max formula (5) so as to improve the calculation efficiency and algorithm precision. Finally, an isolated forest algorithm is applied to remove noise samples in each type of data.

The reason for selecting the isolated forest algorithm is that the method is a rapid anomaly detection method, and has linear time complexity and high accuracy.

1.2. Training phase

The second phase is a training phase, using 80% of the data to train the sub-model by constructing C balanced training subsets with different sampling rates. The invention processes the majority class and minority class respectively. For minority classes, it selects a corresponding number of minority class samples according to the differential sampling rate after SMOTE. Then, the original minority sample and the synthesized minority sample are combined to containA dataset of a minority class of samples. For most classes, the most classes are first clustered into K clusters by K-means clustering. Secondly, choose +.>A plurality of class samples, each cluster extracting a certain number of samples in proportion to the proportion of the class samples. So far, the selected majority class and minority class constitute C balanced training subsets.Finally, the base classifier is trained by training the subset to obtain C sub-models.

K-means is a classical unsupervised clustering algorithm whose algorithm ideas are as follows. First, k samples are selected as cluster centers. Second, the remaining samples are categorized, i.e. assigned to the cluster center closest to. And finally, continuously and iteratively updating the clustering center until the best clustering effect is obtained. Therefore, the similarity of samples in the same class is higher after K-means clustering, and the similarity of samples in different classes is smaller. In particular, the advantages of K-means are as follows: the principle is simple, and the implementation is easy; the clustering effect is good, and the application is wide; the operation efficiency is high, and the method is suitable for processing data with large sample size; the interpretation of the algorithm is better.

1.3. Verification stage

The verification phase is the third phase, and 10% of data is selected as the verification set. First, the g_mean scores for all sub-models on the validation set are calculated. Second, the submodel with a score below the g_mean average is deleted. And finally, calculating the prediction weight of the sub-model after screening. The specific calculation is shown in formula (6).

Wherein G is _i Representing the predictive weight, G_mean, of the ith sub-model _i G_mean representing the ith sub-model.

The g_mean weight is chosen as the predictive weight because it comprehensively evaluates the accuracy of the majority class and minority class. And it is an index for evaluating the classification performance simply and effectively. Second, G_Mean reflects classification effects more comprehensively and is easy to understand than indexes such as accuracy and F-score. Therefore, g_mean is also one of the widely used evaluation indexes for unbalanced data classification problems.

1.4. Test phase

Finally, the test phase, 10% of the data was used as the test set. Since sub-models that perform poorly are filtered out during the verification phase. Thus, the number of submodels in the test phase is N. In the test set, the N prediction results of the sample are multiplied by the prediction weights of the corresponding sub-models and then added, namely the probability that the sample is classified into a positive class is obtained, as shown in a formula (7).

Wherein x represents a sample in the test set, G _i Is the weight of the ith sub-model, p _i (x) Representing the predicted outcome of sample x at the ith sub-model, P (x) is the probability of sample x classifying as a positive class. Samples are classified as positive examples (minority classes) if the prediction probability is equal to or greater than a threshold T (typically set T to 0.5), otherwise as negative examples (majority classes). The specific calculation is shown in formula (8).

The base classifier used in the present invention is a logistic regression classifier. Logistic regression is widely used as one of the most common methods in the machine learning field. The application fields of the method comprise credit evaluation, marketing prediction, fault detection, natural language processing, medical diagnosis and the like. It has the following advantages: the method is simple and easy to implement and has wide application; the calculation cost is low, and the calculation speed is high; the method has strong interpretability and is easy to understand.

Pseudo code and examples of 2 algorithm

The pseudocode of the KSDE is shown below. The proposed method uses SMOTE algorithm and K-means clustering for majority class and minority class, respectively, based on differential sampling rate. Based on this, a balanced subset is constructed for training the Logistic-based classifier and integrating them. The KSDE computes predictive weights for sub-models to increase the robustness of the algorithm, as well as removing datasets in each class of samples.

Input: a data set D; number of submodels C;

and (3) outputting: classification results of the KSDE algorithm;

pretreatment stage

1) Filling the missing value of the data set D, and normalizing the data;

2) Removing noise samples from most classes and few classes by adopting an isolated forest algorithm respectively;

3) Non-repeated sampling is carried out on the data set D, and a training data set, a verification data set and a test data set are generated;

training phase

4) Clustering multiple classes of data of a training dataset into K clusters using K-means _k (k＝1,2,…,K)；

5) The sample size of K clusters is calculated as the proportion of the majority class,

6) For circulation, for each positive integer i, j is more than or equal to 1 and less than or equal to C, and the steps are performed in parallel;

7) Calculating a differential sampling rate SR _i ；

8) Calculating the number of samples corresponding to the differential sampling rate, NS _i ；

9) Few classes use SMOTE-DSR algorithm to generate NS _i Individual sample composition

10 Most classes use the K-means-DSR algorithm with CM _k Proportional extraction NS _i Individual sample composition

11 Will) beAnd->Combining into a balanced training subset for training the base classifier and generating corresponding sub-models;

12)End

verification phase

13 Removing sub-models that are below the average g_mean on the validation dataset;

14 Calculating a predictive weight for each filtered sub-model according to Eq. (6);

test phase

15 Calculating the probability of the sample in the test set being classified as a positive example according to equation (7);

Comparing the probability with a threshold value, and obtaining a final classification result of the sample according to Eq. (8);

the KSDE method mentioned is explained by way of example. Assuming that the number of submodels is 8, the sampling rate takes on values {12.5%,25.0%,37.5%,50.0%,62.5%75.0%,87.5%,100% }. The g_mean score for the 8 sub-models was assumed to be 0.1,0.11,0.12,0.12,0.13,0.13,0.14,0.15, respectively. First, the average g_mean value of the submodel is calculated to be 0.125. After screening, 4 sub-models were selected for prediction. Second, the weights of the 4 sub-models selected are calculated. The weight of the fifth sub-model was 0.13/(0.13+0.13+0.14+0.15) =0.24. The weights of the other three sub-models are similarly calculated as 0.24,0.25,0.27, respectively. Assume that the sample s predicts 0.45,0.5,0.59 and 0.65 on the submodel. The threshold T is set to 0.5. Finally, the prediction result calculation process of the sample s is as follows:

P(s)＝0.24×0.45+0.24×0.5+0.25×0.59+0.27×0.65＝0.551

obviously, 0.551 is greater than the threshold value 0.5, i.e. the sample s is classified into a few classes.

The weighted integrated unbalanced classification system provided by the embodiment of the invention comprises:

Performance of SDE algorithm. All experiments were performed using 2.8GHz Inter 7 1165G7 CPU, 32GB RAM, and a 64 bit Windows 10 operating system and were implemented by Python. The experiment was divided into four sections. First, a dataset for experiments, a comparison algorithm, an evaluation index and a parameter setting are presented. Second, the classification performance of the KSDE and 11 comparison methods on 25 datasets was analyzed and applied in the field of credit risk assessment, respectively, was discussed. The effect of the number of sub-models on the stability of the algorithm was studied. Finally, the invention performs significance tests on the proposed algorithm, including Wilcoxon signed rank test and Nemenoyi post test.

1 experiment set-up

1.1. Data set

The 33 real data sets used for the experiment included 25 common data sets and 8 credit risk data sets. These data sets are all from UCI, kagle and a website (https:// www.datafountain.cn/data sets/6218) on China client loans. In addition, the invention also selects and uses data sets which come from different fields and are representative, such as Pima, german credit, waveform and the like. Table 1 shows the feature number, sample number, number of majority classes, number of minority classes and unbalance rate (IR) of the dataset. The imbalance ratio of the data set ranged from 1.25 to 577.88 with a maximum sample size exceeding 28 ten thousand. It should be noted that the dataset used in the experiment is a binary type of data. For example, in the dataset vector 2, the "opel" class samples are considered as minority classes, with the remaining samples being majority classes.

Table 1 details of 33 data sets

1.2. Comparative methods in order to comprehensively compare the performance of the proposed method KSDE, the present invention compares 11 advanced comparative methods, which are divided into four classes. (1) four basic comparison algorithms: precision tree (DT) and logic, k nearest neighbor (kNN) and back propagation artificial neural network (BP). (2) five non-sampling integration algorithms: bagging, range Forest (RF), adaptive boosting (AdaBoost), gradient Boosting Decision Tree (GBDT) and extreme gradient boosting (XGBoost). (3) two sampling integration algorithm: self paced ensemble (SPE based on undersampling) and smotbagging (smotbag based on oversampling).

1.3. Evaluation index

Fig. 4 shows a confusion matrix for the classification problem, TP represents the number of positive cases that are correctly predicted, FN represents the number of positive cases that are incorrectly predicted as negative cases, and FP and TP can be interpreted similarly. The evaluation index used in the experiment is calculated by the confusion matrix.

AUC (Area Under Curve) is defined as the area under the ROC curve as shown in equation (12). The ROC curve is composed of multiple sets of TPR and FPR. That is, the greater the AUC value, the better the classification performance of the algorithm.

TPR is expressed as a true instance rate, also known as Recall (Recall), as shown in equation (11). The higher the value of TPR, the greater the number of positive examples (minority classes) that are classified correctly. In the unbalanced data classification problem, the invention is more concerned with the classification accuracy of a few classes. Thus, TPR is an important evaluation index.

The calculation of G_Mean combines TPR and TNR, and is an evaluation index of the comprehensiveness, as shown in formula (13). A larger g_mean value indicates better classification of both minority and majority classes.

The above evaluation indexes cannot well highlight the key class (minority class) and the influence of a certain class on the algorithm classification performance. Therefore, the invention increases IBA _γ (G_Mean) (IBA) evaluation index. It combines the integrated classification accuracy g_mean with the classification accuracy of the individual class as shown in equation (14). Wherein dom=tpr-TNR, represents predominance. Gamma (gamma.gtoreq.0) is a weight of Dom, which can adjust the effect of Dom on G_Mean. The invention is as in the previous work, setting γ to 0.05. Thus, the higher the value of IBA, the better the classification effect.

IBA _γ (G_Mean)＝(1+γ×Dom)×(G_Mean) (14)

1.4. And setting parameters, wherein all comparison algorithms select default parameters in Python. The algorithm KSDE provided by the invention needs to set two parameters, namely the number C of submodels and the number K of clusters of most clusters, and the rest parameters use default parameters. In the stability experiment, the number of submodels C was selected to be 20 according to the experimental effect. The number of clusters is K epsilon [1,20], and the calculation method is as follows. First, the present invention uses K-means clustering on the training set. Second, AUC scores were calculated for 20 different K values. And finally, finding a cluster corresponding to the optimal AUC and calculating the number of clusters of most classes occupying the main body as the value of K.

All experimental results were obtained by 5 ten fold cross-validation. In addition, the training set of the proposed method accounts for 80% of the data set, and the validation set and the test set each account for 10%, both generated by non-repeated sampling.

2 results and analysis

Comparison of KSDE with 11 comparison methods

To verify the classification performance of the proposed method, the present invention compares KSDE with 11 advanced comparison methods in terms of AUC, TPR, g_mean, IBA. Tables 2-5 show the results of four evaluation criteria for the 12 methods over 25 data sets. It can be seen that the method of the present invention achieves the highest average and ranking on each index. In particular, in terms of TPR, KSDE is about 12% higher than the recently proposed SPE process. This shows that the KSDE's ability to correctly identify minority classes is more prominent.

Figure 5 shows the average of four metrics over 25 data sets for all methods. As can be seen from fig. 5, the KSDE not only achieved the highest score among the four indices, but also scored nearly. Thus, the proposed method is effective and balanced for solving the classification problem of unbalanced data.

Fig. 6 shows a box plot of 12 methods over 25 datasets. The four sub-graphs represent the AUC, TPR, G_Mean and IBA indices, respectively. One box for each method in the box map. Each bin contains outliers (black circles), maxima, minima, median (horizontal lines in the bin), mean (green triangles), upper quartile and lower quartile of a set of data. As can be seen from an examination of fig. 6, the box of KSDE is located at the highest of the four indices and is shorter. Furthermore, the KSDE also achieves an optimal median over all indexes. In conclusion, the KSDE has better overall classification effect and stable performance.

Table 2 AUC results for the 12 methods over 25 data sets

Table 3 TPR results for the 12 methods over 25 data sets

Table 4 g_mean results of 12 methods over 25 datasets

TABLE 5 IBA results for the 12 methods over 25 datasets

Application of KSDE to Credit Risk assessment

To illustrate the practical value of the proposed method, the present invention applies KSDE in the field of credit risk assessment. In particular, the KSDE will compare against 11 excellent comparison methods on 8 real credit risk datasets. The metrics used for evaluation included AUC, TPR, g_mean, IBA. Table 6 shows the results of four evaluation criteria for the 12 methods over 8 data sets. Looking at Table 6, it can be seen that the index score of the proposed method is highest across most data sets. And the KSDE obtains the best ranking of the four evaluation indicators for all data sets. In particular, on Chinese personal loan data, the minority class classification accuracy TPR of KSDE is 0.9433 and the integrated classification accuracy g_mean is 0.8199. The scores of these evaluation indexes are far higher than the results of the comparison method.

Furthermore, fig. 6 shows the average of four metrics over 8 credit risk datasets for all methods. From an examination of fig. 6, the classification performance of KSDE on credit risk datasets exceeds the other 11 advanced comparison methods. In conclusion, the method has a good application prospect in the field of credit risk assessment.

Table 6 results of four metrics on 8 credit risk datasets for 12 methods

Stability test of 3 KSDE

Regarding the effect of the number of submodels C on the stability of the proposed method, the present invention will find the best value of C by experiment. For this purpose, parameter C will be selected from [5, 10, 15, 20, 25, 30] and the experiment is performed. The average of the 6 group C values over the four evaluation indexes is shown in fig. 7. As can be seen from an examination of FIG. 7, the individual indices of KSDE are most scored and perform best when the number of submodels is 20. When the parameter C is greater than 20, the performance of the algorithm tends to stabilize. This shows that the stability of the proposed method is better and does not continue to improve after the best classification performance is achieved due to the increased number of submodels. Thus, the invention ultimately selects 20 submodel numbers, i.e., C is 20.

4 significance test

To verify if there is a statistical significance for the differences between the methods. The invention analyzes the results of four indexes of the 12 methods on 33 data sets by two non-parameter statistical test methods.

First, the present invention uses the Wilcoxon signed rank test with a significance level of 0.05 to pair-wise test whether the performance of the proposed method is the same as that of the comparative method. The test results are shown in Table 7. Clearly, the KSDE method rejects the original hypothesis, i.e., it is statistically different from the comparative method.

TABLE 7 Wilcoxon signed rank test results for KSDE and comparison method

The present invention further distinguishes performance differences between all methods by a Nemen yi post hoc test with a significance level of 0.1. The test results are shown in FIG. 8. From an examination of FIG. 8, the classification of KSDE performed best, which is almost superior to all comparison methods. In particular, KSDE is superior to classical integration methods such as Bagging, adaboost, random forest, and GBDT. Furthermore, in terms of TPR, the proposed method rejects the original hypothesis compared to 11 methods. That is, KSDE has significant advantages in recognition of a minority class.

The invention provides a weighted integration method based on a differential sampling rate for credit risk assessment, which is called KSDE. The core idea is that a plurality of subsets with data distribution characteristics are constructed through different sampling rates, and then a plurality of sub-models are obtained through training. Second, the G-Mean weights of the sub-models are calculated on the validation set as predictive weights. Finally, the final classification result is obtained through the weighted calculation of the prediction result of the sub-model and the prediction weight thereof, so as to improve the prediction stability of the method. On the 33 data sets (including 8 credit risk data sets), the KSDE was compared with 11 advanced comparison methods. Experimental results show that the classification accuracy of the KSDE is highest, and the classification accuracy of the KSDE particularly shows obvious advantages in the aspect of identifying few types of samples. In addition, when the KSDE method comprises enough submodels, the classification performance is stable and the robustness is good.

Two specific examples are listed below, as well as the implementations they employ:

example 1: bank credit scoring in this embodiment, a bank's credit scoring problem will be considered, where the majority class is good-credit customers and the minority class is bad-credit customers.

Ksde outlier detection:

the implementation scheme is as follows: an outlier detection method based on statistical analysis, such as the Z-Score or IQR (quartile range) method, is used to identify outliers in credit Score data.

2.K-means clustering and SMOTE algorithm:

k-means clustering implementation scheme: the credit score data of the customers is divided into clusters to better understand the distribution of credit scores. K-means implementations in machine learning libraries such as scikit-learn may be used.

The SMOTE algorithm implementation scheme comprises the following steps: the SMOTE algorithm is used to synthesize synthetic samples of bad credit customers to balance the dataset. The SMOTE implementation in the immbalanced-learn library may be used.

3. Sub-model training:

multiple classification models, such as decision trees, random forests, logistic regression, etc., are trained using the K-means and SMOTE generated balance training subsets.

G_mean weight and integration:

and calculating the G_mean weight of each sub-model, and weighting the prediction results of the sub-models by using the weights so as to obtain a final credit score classification result.

Example 2: medical diagnosis in this embodiment, consider a medical diagnosis problem in which the majority of the categories are healthy patients and the minority of the categories are patients with rare diseases.

Ksde outlier detection:

the implementation scheme is as follows: outlier detection methods based on medical data are used, such as outlier detection based on statistics or based on domain knowledge.

2.K-means clustering and SMOTE algorithm:

k-means clustering implementation scheme: medical data is K-means clustered to better understand data distribution.

The SMOTE algorithm implementation scheme comprises the following steps: synthetic samples of rare disease patients were synthesized using SMOTE algorithm to balance the dataset.

3. Sub-model training:

a plurality of medical diagnostic models, such as support vector machines, neural networks, convolutional neural networks, and the like, are trained using a balanced training subset generated by K-means and SMOTE.

G_mean weight and integration:

and calculating the G_mean weight of each sub-model, and weighting the prediction results of the sub-models by using the weights so as to obtain the final medical diagnosis classification result.

These two embodiments demonstrate how to apply weighted integration methods in unbalanced class problems, including outlier detection, clustering, oversampling, and weight distribution strategies for sub-models. The specific implementation will vary depending on the nature of the data and task, but these steps provide a basic framework to address the classification challenges of the unbalanced categories.

It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. The weighted integration unbalanced classification method is characterized in that the weighted integration unbalanced classification method KSDE removes noise samples in each type of data through an outlier detection technology; k-means clustering and SMOTE algorithm are respectively used for the majority class data and the minority class data, and a plurality of balanced training subsets are generated by using the differential sampling rate to train the sub-model; and taking the G_mean weight of the sub-model after screening as a prediction weight and a prediction result weight to calculate a final classification result.

2. The weighted integrated imbalance classification method of claim 1, wherein the preprocessing stage of the weighted integrated imbalance classification method comprises three parts; firstly, filling missing values in data, filling discrete variables by using modes of corresponding attributes, and selecting an average number of the corresponding attributes by using continuous variables; secondly, normalizing data through a formula Min-Max to improve the calculation efficiency and algorithm accuracy; finally, removing noise samples in each type of data by applying an isolated forest algorithm;

3. The weighted integrated imbalance classification method of claim 1, wherein the training phase of the weighted integrated imbalance classification method processes a majority class and a minority class respectively, and for the minority class, it selects a corresponding number of minority class samples according to a differential sampling rate after SMOTE; then, the original minority sample and the synthesized minority sample are combined to contain A dataset of a minority class of samples; for most classes, firstly, clustering the most classes into K clusters through K-means clustering; secondly, choose +.>A plurality of majority class samples, wherein each cluster extracts a certain number of samples according to the proportion of the majority class samples, and the selected majority class and minority class form C balanced training subsets; finally, the base classifier is trained by training the subset to obtain C sub-models.

4. The weighted integrated imbalance classification method of claim 3, wherein K-means first selects K samples as cluster centers; secondly, classifying the rest samples, namely distributing the rest samples to the cluster centers closest to the rest samples; and finally, continuously and iteratively updating the clustering center until the best clustering effect is obtained.

5. The weighted integrated imbalance classification method of claim 1, wherein 10% of the data is selected as the validation set during the validation phase of the weighted integrated imbalance classification method; firstly, calculating G_mean scores of all sub-models on a verification set; secondly, deleting sub-models with scores lower than the average value of G_mean; finally, calculating the prediction weight of the sub-model after screening; the specific calculation formula is as follows:

6. The weighted-integrated imbalance classification method of claim 1, wherein 10% of data in a test stage of the weighted-integrated imbalance classification method is used as a test set, the number of sub-models in the test stage is N, and N prediction results of a sample in the test set are multiplied by prediction weights of corresponding sub-models and added, so that the probability that the sample is classified as a positive class is obtained:

7. a computer device, characterized in that it comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the weighted integrated imbalance classification method of any one of claims 1-6.

8. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the weighted integrated imbalance classification method of any one of claims 1-6.

9. An information data processing terminal, characterized in that the information data processing terminal is adapted to implement the weighted integration imbalance classification method of any one of claims 1 to 6.

10. A weighted integrated unbalance classification system based on the weighted integrated unbalance classification method according to any of the claims 1-6, characterized in that the weighted integrated unbalance classification system comprises: