CN113378927A - Clustering-based self-adaptive weighted oversampling method - Google Patents
Clustering-based self-adaptive weighted oversampling method Download PDFInfo
- Publication number
- CN113378927A CN113378927A CN202110650447.3A CN202110650447A CN113378927A CN 113378927 A CN113378927 A CN 113378927A CN 202110650447 A CN202110650447 A CN 202110650447A CN 113378927 A CN113378927 A CN 113378927A
- Authority
- CN
- China
- Prior art keywords
- cluster
- samples
- data
- score
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a clustering-based self-adaptive weighting oversampling method, which comprises the steps of firstly carrying out k-means clustering on a small number of sample data, respectively combining clusters formed by clustering with a large number of samples to form a plurality of data sets, classifying each data set by using a random forest algorithm, calculating a corresponding score value in a 5-fold cross validation mode, and taking the average value of the calculated score values as the score of the cluster; calculating the synthesis weight of each cluster according to the score value obtained by each cluster; and calculating the number of generated samples of each cluster according to the weight value, and performing random linear interpolation between samples in the clusters according to the specified number to finally achieve the aim of balancing the data set.
Description
Technical Field
The invention relates to the field of data mining, in particular to a self-adaptive weighting oversampling method based on clustering.
Background
Unbalanced data is widely present in practical applications, and when the number of samples in different classes is unbalanced or even very different, a data set with such data distribution is considered to be an unbalanced data set. For unbalanced learning, the fundamental problem to be solved urgently is that the performance of the classification algorithm of many traditional machine learning is greatly weakened due to the unbalanced data distribution.
With the intensive development of the research of unbalanced data set processing, currently, the hot spot of the research aiming at the unbalanced data problem mainly has two aspects: the method is used for researching an algorithm level and a data level. For the data plane, there are mainly over-sampling, under-sampling and mixed sampling. Compared with the other two sampling methods, the oversampling method balances the data set by generating a few types of samples, and can avoid the loss of data samples with important information in a plurality of types. With the gradual development of oversampling, many methods such as SMOTE, Borderline-SMOTE, ADASYN, etc. have been developed, but these methods only perform sampling based on a few types of sample information, and do not consider the classification when actually combining with a plurality of types, resulting in a decrease in accuracy when synthesizing samples.
Disclosure of Invention
The invention aims to provide an oversampling method for clustering few types of sample data, determining cluster sampling weight according to the classification condition of each cluster and a plurality of types of sample data, and improving the quality of generating the few types of sample data.
The technical solution for realizing the purpose of the invention is as follows: a self-adaptive weighting oversampling method based on clustering is characterized by comprising the following steps.
Step 1: and taking the unbalanced data set as input, distinguishing a few samples and a plurality of samples, and calculating the number of samples needing to be generated.
Step 2: and dividing the minority class data into a plurality of clusters by using a k-means clustering algorithm, and combining the minority class data and the majority class data into a plurality of data sets.
And step 3: and calculating a corresponding score value by a random forest algorithm and adopting a 5-fold cross validation mode for each data set, and determining the score of the cluster.
And 4, step 4: and calculating sampling weight through the score of each cluster, and determining the synthesis number of cluster samples.
And 5: and carrying out random linear interpolation between samples in each cluster according to the number of the samples.
The clustering-based adaptive weighting oversampling method is characterized in that in the step 2, minority class data are divided into a plurality of clusters by using a k-means clustering algorithm and combined with majority class data into a plurality of data sets, and the specific steps are as follows.
Step 2.1 randomly finding k data points from a few class samples as initial cluster centers.
Step 2.2 respectively calculates the Euclidean distance d (si, cj) between each data point si and the selected k cluster centers, finds the cluster center with the minimum distance value from each data point and distributes the cluster center to the cluster.
And 2.3, respectively calculating the average value of the data points in each cluster, and setting the average value as the clustering center of the next iteration.
And step 2.4, circularly iterating step 2.2-step 2.3 until the maximum iteration times are met or each cluster center does not change greatly.
And 2.5, combining the k clusters obtained in the step 2.4 with a plurality of types of samples to form k data sets respectively.
According to the clustering-based adaptive weighting oversampling method, in the step 3, a random forest algorithm is performed on each data set, a corresponding score value is calculated in a k-fold cross validation mode, and the score of the cluster is determined.
And 3.1, dividing the data set obtained in each step 2 into k groups of data sets according to a 5-fold cross validation mode.
And 3.2, selecting 1 group as a test set and 4 groups as a training set each time, training a random forest algorithm by using the training set, predicting a test set result according to a model obtained by training, obtaining corresponding AUC, F-mean and G-mean values according to the result, and calculating corresponding average values.
And 3.3, circulating the step 3.2 for k times to obtain k values and calculating an average value to be used as a score value corresponding to the cluster.
According to the clustering-based adaptive weighting oversampling method, in the step 4, the sampling weight is calculated through the score of each cluster, and the cluster sample synthesis number is determined, specifically including the steps.
And 4.1, regarding each cluster, taking the difference value between the score value of 1 and the cluster as the sampling score of the cluster, and calculating the sum of the sampling score values.
And 4.2, comparing the sampling weight value corresponding to the cluster with the sum to serve as the sampling weight value of the cluster.
And 4.3, multiplying the difference value of the majority sample points and the minority sample points of the original data by the sampling weight value of the cluster to obtain the number of the synthesized samples of the cluster.
According to the clustering-based adaptive weighting oversampling method, in the step 5, random linear interpolation between samples is performed in each cluster according to the number of samples, and the specific process is as follows.
And 5.1, randomly selecting two sample points in the cluster, and synthesizing a new sample point between the two sample points in a random interpolation mode.
Step 5.2 repeat step 5.1 until the number of new sample points equals the number of synthesized samples of the cluster.
Compared with the prior art, the method and the device have the advantages that the minority sample data can be clustered and divided in the over-sampling of the unbalanced data, so that the side resampling can be accurately carried out according to the classification condition of each part of the minority sample data and the majority sample, the higher recognition rate of the minority sample in the classification process is improved, and the unbalanced data problem can be solved more favorably.
Drawings
FIG. 1 is a flow chart of a cluster-based adaptive weighted oversampling method of the present invention.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
With reference to fig. 1, the invention relates to a clustering-based adaptive weighted oversampling method, comprising the following steps:
step 1: and taking the unbalanced data set as input, distinguishing a few samples and a plurality of samples, and calculating the number of samples needing to be generated.
Step 2: and dividing the minority class data into a plurality of clusters by using a kmeans clustering algorithm, and combining the minority class data and the majority class data into a plurality of data sets.
Step 2.1 randomly finding k data points from a few class samples as initial cluster centers.
Step 2.2 respectively calculates the Euclidean distance d (si, cj) between each data point si and the selected k cluster centers, finds the cluster center with the minimum distance value from each data point and distributes the cluster center to the cluster.
And 2.3, respectively calculating the average value of the data points in each cluster, and setting the average value as the clustering center of the next iteration.
And step 2.4, circularly iterating step 2.2-step 2.3 until the maximum iteration times are met or each cluster center does not change greatly.
And 2.5, combining the k clusters obtained in the step 2.4 with a plurality of types of samples to form k data sets respectively.
And step 3: and calculating a corresponding score value for each data set by a random forest algorithm in a k-fold cross validation mode, and determining the score of the cluster.
And 3.1, dividing the data set obtained in each step 2 into k groups of data sets according to a k-fold cross validation mode.
And 3.2, selecting 1 group as a test set and 4 groups as a training set each time, training a random forest algorithm by using the training set, predicting a test set result according to a model obtained by training, obtaining corresponding AUC, F-mean and G-mean values according to the result, and calculating corresponding average values.
And 3.3, circulating the step 3.2 for k times to obtain k values and calculating an average value to be used as a score value corresponding to the cluster.
And 4, step 4: and calculating sampling weight through the score of each cluster, and determining the synthesis number of cluster samples.
And 4.1, regarding each cluster, taking the difference value between the score value of 1 and the cluster as the sampling score of the cluster, and calculating the sum of the sampling score values.
And 4.2, comparing the sampling weight value corresponding to the cluster with the sum to serve as the sampling weight value of the cluster.
And 4.3, multiplying the difference value of the majority sample points and the minority sample points of the original data by the sampling weight value of the cluster to obtain the number of the synthesized samples of the cluster.
And 5: and carrying out random linear interpolation among samples in each cluster according to the number of the samples.
And 5.1, randomly selecting two sample points in the cluster, and synthesizing a new sample point between the two sample points in a random interpolation mode.
Step 5.2 repeat step 5.1 until the number of new sample points equals the number of synthesized samples of the cluster.
Claims (5)
1. A self-adaptive weighting oversampling method based on clustering is characterized by comprising the following steps:
step 1: taking the unbalanced data set as input, distinguishing a few samples and a plurality of samples, and calculating the number of samples needing to be generated;
step 2: dividing minority class data into a plurality of clusters by using a k-means clustering algorithm, and combining the minority class data and the majority class data into a plurality of data sets;
and step 3: calculating a corresponding score value for each data set through a random forest algorithm in a 5-fold cross validation mode, and determining the score of the cluster;
and 4, step 4: calculating sampling weight through the score of each cluster, and determining the synthesis number of cluster samples;
and 5: and carrying out random linear interpolation between samples in each cluster according to the number of the samples.
2. The adaptive clustering-based weighted oversampling method of claim 1, wherein in step 2, the k-means clustering algorithm is used to divide minority class data into a plurality of clusters and combine the minority class data with the majority class data into a plurality of data sets, and the specific steps are as follows:
step 2.1, randomly finding k data points from a few types of samples as initial clustering centers;
step 2.2, respectively calculating Euclidean distances d (si, cj) between each data point si and the selected k cluster centers, finding the cluster center with the minimum distance value from each data point and distributing the cluster center to the cluster;
step 2.3, respectively calculating the average value of the data points in each cluster, and setting the average value as the clustering center of the next iteration;
step 2.4, circularly iterating step 2.2 to step 2.3 until the maximum iteration times are met or each cluster center does not change greatly any more;
and 2.5, combining the k clusters obtained in the step 2.4 with a plurality of types of samples to form k data sets respectively.
3. The cluster-based adaptive weighting oversampling method of claim 1, wherein in the step 3, for each data set, a random forest algorithm is performed, and a 5-fold cross validation method is adopted to calculate a corresponding score value, and determine the score of the cluster, and the specific steps are as follows:
step 3.1, dividing each data set obtained in the step 2 into k groups of data sets according to a 5-fold cross validation mode;
3.2, selecting 1 group as a test set and 4 groups as a training set each time, training a random forest algorithm by using the training set, predicting a test set result according to a model obtained by training, obtaining corresponding AUC, F-mean and G-mean values according to the result, and calculating corresponding average values;
and 3.3, circulating the step 3.2 for 5 times to obtain k values and calculating an average value as a score value corresponding to the cluster.
4. The adaptive weighting oversampling method based on clustering according to claim 1, wherein in step 4, a sampling weight is calculated through a score of each cluster, and a cluster sample synthesis number is determined, specifically including:
step 4.1, for each cluster, taking the difference value between the score value of 1 and the cluster as the sampling score, and calculating the sum of the sampling score values;
step 4.2, comparing the sampling weight value corresponding to the cluster with the sum to serve as the sampling weight value of the cluster;
and 4.3, multiplying the difference value of the majority sample points and the minority sample points of the original data by the sampling weight value of the cluster to obtain the number of the synthesized samples of the cluster.
5. The adaptive clustering-based weighted oversampling method according to claim 1, wherein in the step 5, for each cluster, according to the number of samples, a random linear interpolation between samples is performed in the cluster, and the specific process is as follows:
step 5.1, randomly selecting two sample points in the cluster, and synthesizing a new sample point between the two sample points in a random interpolation mode;
step 5.2 repeat step 5.1 until the number of new sample points equals the number of synthesized samples of the cluster.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110650447.3A CN113378927A (en) | 2021-06-11 | 2021-06-11 | Clustering-based self-adaptive weighted oversampling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110650447.3A CN113378927A (en) | 2021-06-11 | 2021-06-11 | Clustering-based self-adaptive weighted oversampling method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113378927A true CN113378927A (en) | 2021-09-10 |
Family
ID=77573780
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110650447.3A Pending CN113378927A (en) | 2021-06-11 | 2021-06-11 | Clustering-based self-adaptive weighted oversampling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113378927A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115545111A (en) * | 2022-10-13 | 2022-12-30 | 重庆工商大学 | Network intrusion detection method and system based on clustering self-adaptive mixed sampling |
CN116051288A (en) * | 2023-03-30 | 2023-05-02 | 华南理工大学 | Financial credit scoring data enhancement method based on resampling |
CN116499748A (en) * | 2023-06-27 | 2023-07-28 | 昆明理工大学 | Bearing fault diagnosis method and system based on improved SMOTE and classifier |
CN117332287A (en) * | 2023-09-28 | 2024-01-02 | 中国人民解放军63856部队 | Evaluation index weight data processing method based on cluster analysis |
-
2021
- 2021-06-11 CN CN202110650447.3A patent/CN113378927A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115545111A (en) * | 2022-10-13 | 2022-12-30 | 重庆工商大学 | Network intrusion detection method and system based on clustering self-adaptive mixed sampling |
CN116051288A (en) * | 2023-03-30 | 2023-05-02 | 华南理工大学 | Financial credit scoring data enhancement method based on resampling |
CN116051288B (en) * | 2023-03-30 | 2023-07-18 | 华南理工大学 | Financial credit scoring data enhancement method based on resampling |
CN116499748A (en) * | 2023-06-27 | 2023-07-28 | 昆明理工大学 | Bearing fault diagnosis method and system based on improved SMOTE and classifier |
CN116499748B (en) * | 2023-06-27 | 2023-08-29 | 昆明理工大学 | Bearing fault diagnosis method and system based on improved SMOTE and classifier |
CN117332287A (en) * | 2023-09-28 | 2024-01-02 | 中国人民解放军63856部队 | Evaluation index weight data processing method based on cluster analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113378927A (en) | Clustering-based self-adaptive weighted oversampling method | |
CN111400180B (en) | Software defect prediction method based on feature set division and ensemble learning | |
CN107103332B (en) | A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset | |
CN110070060B (en) | Fault diagnosis method for bearing equipment | |
CN110751121B (en) | Unsupervised radar signal sorting method based on clustering and SOFM | |
CN108985327B (en) | Terrain matching area self-organization optimization classification method based on factor analysis | |
CN108595913A (en) | Differentiate the supervised learning method of mRNA and lncRNA | |
CN113344113B (en) | Yolov3 anchor frame determination method based on improved k-means clustering | |
CN111062425B (en) | Unbalanced data set processing method based on C-K-SMOTE algorithm | |
CN113344019A (en) | K-means algorithm for improving decision value selection initial clustering center | |
CN115048988B (en) | Unbalanced data set classification fusion method based on Gaussian mixture model | |
CN114036610A (en) | Penetration depth prediction method based on data enhancement | |
CN111079788A (en) | K-means clustering method based on density Canopy | |
CN111382797A (en) | Clustering analysis method based on sample density and self-adaptive adjustment clustering center | |
CN113269200A (en) | Unbalanced data oversampling method based on minority sample spatial distribution | |
CN117407732A (en) | Unconventional reservoir gas well yield prediction method based on antagonistic neural network | |
CN108920477A (en) | A kind of unbalanced data processing method based on binary tree structure | |
CN106951728B (en) | Tumor key gene identification method based on particle swarm optimization and scoring criterion | |
CN116619136A (en) | Multi-working-condition multi-source data cutter abrasion prediction method | |
CN112015631A (en) | Software defect prediction-oriented unbalanced data generation method | |
CN111782904B (en) | Unbalanced data set processing method and system based on improved SMOTE algorithm | |
CN111488903A (en) | Decision tree feature selection method based on feature weight | |
CN116976665A (en) | Risk assessment method based on improved topsis model | |
CN114117876A (en) | Feature selection method based on improved Harris eagle algorithm | |
CN113392908A (en) | Unbalanced data oversampling algorithm based on boundary density |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210910 |
|
WD01 | Invention patent application deemed withdrawn after publication |