CN113378927A

CN113378927A - Clustering-based self-adaptive weighted oversampling method

Info

Publication number: CN113378927A
Application number: CN202110650447.3A
Authority: CN
Inventors: 张爽; 何云斌; 杨海波
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-09-10

Abstract

The invention relates to a clustering-based self-adaptive weighting oversampling method, which comprises the steps of firstly carrying out k-means clustering on a small number of sample data, respectively combining clusters formed by clustering with a large number of samples to form a plurality of data sets, classifying each data set by using a random forest algorithm, calculating a corresponding score value in a 5-fold cross validation mode, and taking the average value of the calculated score values as the score of the cluster; calculating the synthesis weight of each cluster according to the score value obtained by each cluster; and calculating the number of generated samples of each cluster according to the weight value, and performing random linear interpolation between samples in the clusters according to the specified number to finally achieve the aim of balancing the data set.

Description

Clustering-based self-adaptive weighted oversampling method

Technical Field

The invention relates to the field of data mining, in particular to a self-adaptive weighting oversampling method based on clustering.

Background

Unbalanced data is widely present in practical applications, and when the number of samples in different classes is unbalanced or even very different, a data set with such data distribution is considered to be an unbalanced data set. For unbalanced learning, the fundamental problem to be solved urgently is that the performance of the classification algorithm of many traditional machine learning is greatly weakened due to the unbalanced data distribution.

With the intensive development of the research of unbalanced data set processing, currently, the hot spot of the research aiming at the unbalanced data problem mainly has two aspects: the method is used for researching an algorithm level and a data level. For the data plane, there are mainly over-sampling, under-sampling and mixed sampling. Compared with the other two sampling methods, the oversampling method balances the data set by generating a few types of samples, and can avoid the loss of data samples with important information in a plurality of types. With the gradual development of oversampling, many methods such as SMOTE, Borderline-SMOTE, ADASYN, etc. have been developed, but these methods only perform sampling based on a few types of sample information, and do not consider the classification when actually combining with a plurality of types, resulting in a decrease in accuracy when synthesizing samples.

Disclosure of Invention

The invention aims to provide an oversampling method for clustering few types of sample data, determining cluster sampling weight according to the classification condition of each cluster and a plurality of types of sample data, and improving the quality of generating the few types of sample data.

The technical solution for realizing the purpose of the invention is as follows: a self-adaptive weighting oversampling method based on clustering is characterized by comprising the following steps.

Step 1: and taking the unbalanced data set as input, distinguishing a few samples and a plurality of samples, and calculating the number of samples needing to be generated.

Step 2: and dividing the minority class data into a plurality of clusters by using a k-means clustering algorithm, and combining the minority class data and the majority class data into a plurality of data sets.

And step 3: and calculating a corresponding score value by a random forest algorithm and adopting a 5-fold cross validation mode for each data set, and determining the score of the cluster.

And 4, step 4: and calculating sampling weight through the score of each cluster, and determining the synthesis number of cluster samples.

And 5: and carrying out random linear interpolation between samples in each cluster according to the number of the samples.

The clustering-based adaptive weighting oversampling method is characterized in that in the step 2, minority class data are divided into a plurality of clusters by using a k-means clustering algorithm and combined with majority class data into a plurality of data sets, and the specific steps are as follows.

Step 2.1 randomly finding k data points from a few class samples as initial cluster centers.

Step 2.2 respectively calculates the Euclidean distance d (si, cj) between each data point si and the selected k cluster centers, finds the cluster center with the minimum distance value from each data point and distributes the cluster center to the cluster.

And 2.3, respectively calculating the average value of the data points in each cluster, and setting the average value as the clustering center of the next iteration.

And step 2.4, circularly iterating step 2.2-step 2.3 until the maximum iteration times are met or each cluster center does not change greatly.

And 2.5, combining the k clusters obtained in the step 2.4 with a plurality of types of samples to form k data sets respectively.

According to the clustering-based adaptive weighting oversampling method, in the step 3, a random forest algorithm is performed on each data set, a corresponding score value is calculated in a k-fold cross validation mode, and the score of the cluster is determined.

And 3.1, dividing the data set obtained in each step 2 into k groups of data sets according to a 5-fold cross validation mode.

And 3.2, selecting 1 group as a test set and 4 groups as a training set each time, training a random forest algorithm by using the training set, predicting a test set result according to a model obtained by training, obtaining corresponding AUC, F-mean and G-mean values according to the result, and calculating corresponding average values.

And 3.3, circulating the step 3.2 for k times to obtain k values and calculating an average value to be used as a score value corresponding to the cluster.

According to the clustering-based adaptive weighting oversampling method, in the step 4, the sampling weight is calculated through the score of each cluster, and the cluster sample synthesis number is determined, specifically including the steps.

And 4.1, regarding each cluster, taking the difference value between the score value of 1 and the cluster as the sampling score of the cluster, and calculating the sum of the sampling score values.

And 4.2, comparing the sampling weight value corresponding to the cluster with the sum to serve as the sampling weight value of the cluster.

And 4.3, multiplying the difference value of the majority sample points and the minority sample points of the original data by the sampling weight value of the cluster to obtain the number of the synthesized samples of the cluster.

According to the clustering-based adaptive weighting oversampling method, in the step 5, random linear interpolation between samples is performed in each cluster according to the number of samples, and the specific process is as follows.

And 5.1, randomly selecting two sample points in the cluster, and synthesizing a new sample point between the two sample points in a random interpolation mode.

Step 5.2 repeat step 5.1 until the number of new sample points equals the number of synthesized samples of the cluster.

Compared with the prior art, the method and the device have the advantages that the minority sample data can be clustered and divided in the over-sampling of the unbalanced data, so that the side resampling can be accurately carried out according to the classification condition of each part of the minority sample data and the majority sample, the higher recognition rate of the minority sample in the classification process is improved, and the unbalanced data problem can be solved more favorably.

Drawings

FIG. 1 is a flow chart of a cluster-based adaptive weighted oversampling method of the present invention.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

With reference to fig. 1, the invention relates to a clustering-based adaptive weighted oversampling method, comprising the following steps:

Step 2: and dividing the minority class data into a plurality of clusters by using a kmeans clustering algorithm, and combining the minority class data and the majority class data into a plurality of data sets.

And step 3: and calculating a corresponding score value for each data set by a random forest algorithm in a k-fold cross validation mode, and determining the score of the cluster.

And 3.1, dividing the data set obtained in each step 2 into k groups of data sets according to a k-fold cross validation mode.

And 5: and carrying out random linear interpolation among samples in each cluster according to the number of the samples.

Claims

1. A self-adaptive weighting oversampling method based on clustering is characterized by comprising the following steps:

step 1: taking the unbalanced data set as input, distinguishing a few samples and a plurality of samples, and calculating the number of samples needing to be generated;

step 2: dividing minority class data into a plurality of clusters by using a k-means clustering algorithm, and combining the minority class data and the majority class data into a plurality of data sets;

and step 3: calculating a corresponding score value for each data set through a random forest algorithm in a 5-fold cross validation mode, and determining the score of the cluster;

and 4, step 4: calculating sampling weight through the score of each cluster, and determining the synthesis number of cluster samples;

2. The adaptive clustering-based weighted oversampling method of claim 1, wherein in step 2, the k-means clustering algorithm is used to divide minority class data into a plurality of clusters and combine the minority class data with the majority class data into a plurality of data sets, and the specific steps are as follows:

step 2.1, randomly finding k data points from a few types of samples as initial clustering centers;

step 2.2, respectively calculating Euclidean distances d (si, cj) between each data point si and the selected k cluster centers, finding the cluster center with the minimum distance value from each data point and distributing the cluster center to the cluster;

step 2.3, respectively calculating the average value of the data points in each cluster, and setting the average value as the clustering center of the next iteration;

step 2.4, circularly iterating step 2.2 to step 2.3 until the maximum iteration times are met or each cluster center does not change greatly any more;

3. The cluster-based adaptive weighting oversampling method of claim 1, wherein in the step 3, for each data set, a random forest algorithm is performed, and a 5-fold cross validation method is adopted to calculate a corresponding score value, and determine the score of the cluster, and the specific steps are as follows:

step 3.1, dividing each data set obtained in the step 2 into k groups of data sets according to a 5-fold cross validation mode;

3.2, selecting 1 group as a test set and 4 groups as a training set each time, training a random forest algorithm by using the training set, predicting a test set result according to a model obtained by training, obtaining corresponding AUC, F-mean and G-mean values according to the result, and calculating corresponding average values;

and 3.3, circulating the step 3.2 for 5 times to obtain k values and calculating an average value as a score value corresponding to the cluster.

4. The adaptive weighting oversampling method based on clustering according to claim 1, wherein in step 4, a sampling weight is calculated through a score of each cluster, and a cluster sample synthesis number is determined, specifically including:

step 4.1, for each cluster, taking the difference value between the score value of 1 and the cluster as the sampling score, and calculating the sum of the sampling score values;

step 4.2, comparing the sampling weight value corresponding to the cluster with the sum to serve as the sampling weight value of the cluster;

5. The adaptive clustering-based weighted oversampling method according to claim 1, wherein in the step 5, for each cluster, according to the number of samples, a random linear interpolation between samples is performed in the cluster, and the specific process is as follows:

step 5.1, randomly selecting two sample points in the cluster, and synthesizing a new sample point between the two sample points in a random interpolation mode;