CN113378927A - Clustering-based self-adaptive weighted oversampling method - Google Patents

Clustering-based self-adaptive weighted oversampling method Download PDF

Info

Publication number
CN113378927A
CN113378927A CN202110650447.3A CN202110650447A CN113378927A CN 113378927 A CN113378927 A CN 113378927A CN 202110650447 A CN202110650447 A CN 202110650447A CN 113378927 A CN113378927 A CN 113378927A
Authority
CN
China
Prior art keywords
cluster
samples
data
score
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110650447.3A
Other languages
Chinese (zh)
Inventor
张爽
何云斌
杨海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202110650447.3A priority Critical patent/CN113378927A/en
Publication of CN113378927A publication Critical patent/CN113378927A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a clustering-based self-adaptive weighting oversampling method, which comprises the steps of firstly carrying out k-means clustering on a small number of sample data, respectively combining clusters formed by clustering with a large number of samples to form a plurality of data sets, classifying each data set by using a random forest algorithm, calculating a corresponding score value in a 5-fold cross validation mode, and taking the average value of the calculated score values as the score of the cluster; calculating the synthesis weight of each cluster according to the score value obtained by each cluster; and calculating the number of generated samples of each cluster according to the weight value, and performing random linear interpolation between samples in the clusters according to the specified number to finally achieve the aim of balancing the data set.

Description

Clustering-based self-adaptive weighted oversampling method
Technical Field
The invention relates to the field of data mining, in particular to a self-adaptive weighting oversampling method based on clustering.
Background
Unbalanced data is widely present in practical applications, and when the number of samples in different classes is unbalanced or even very different, a data set with such data distribution is considered to be an unbalanced data set. For unbalanced learning, the fundamental problem to be solved urgently is that the performance of the classification algorithm of many traditional machine learning is greatly weakened due to the unbalanced data distribution.
With the intensive development of the research of unbalanced data set processing, currently, the hot spot of the research aiming at the unbalanced data problem mainly has two aspects: the method is used for researching an algorithm level and a data level. For the data plane, there are mainly over-sampling, under-sampling and mixed sampling. Compared with the other two sampling methods, the oversampling method balances the data set by generating a few types of samples, and can avoid the loss of data samples with important information in a plurality of types. With the gradual development of oversampling, many methods such as SMOTE, Borderline-SMOTE, ADASYN, etc. have been developed, but these methods only perform sampling based on a few types of sample information, and do not consider the classification when actually combining with a plurality of types, resulting in a decrease in accuracy when synthesizing samples.
Disclosure of Invention
The invention aims to provide an oversampling method for clustering few types of sample data, determining cluster sampling weight according to the classification condition of each cluster and a plurality of types of sample data, and improving the quality of generating the few types of sample data.
The technical solution for realizing the purpose of the invention is as follows: a self-adaptive weighting oversampling method based on clustering is characterized by comprising the following steps.
Step 1: and taking the unbalanced data set as input, distinguishing a few samples and a plurality of samples, and calculating the number of samples needing to be generated.
Step 2: and dividing the minority class data into a plurality of clusters by using a k-means clustering algorithm, and combining the minority class data and the majority class data into a plurality of data sets.
And step 3: and calculating a corresponding score value by a random forest algorithm and adopting a 5-fold cross validation mode for each data set, and determining the score of the cluster.
And 4, step 4: and calculating sampling weight through the score of each cluster, and determining the synthesis number of cluster samples.
And 5: and carrying out random linear interpolation between samples in each cluster according to the number of the samples.
The clustering-based adaptive weighting oversampling method is characterized in that in the step 2, minority class data are divided into a plurality of clusters by using a k-means clustering algorithm and combined with majority class data into a plurality of data sets, and the specific steps are as follows.
Step 2.1 randomly finding k data points from a few class samples as initial cluster centers.
Step 2.2 respectively calculates the Euclidean distance d (si, cj) between each data point si and the selected k cluster centers, finds the cluster center with the minimum distance value from each data point and distributes the cluster center to the cluster.
And 2.3, respectively calculating the average value of the data points in each cluster, and setting the average value as the clustering center of the next iteration.
And step 2.4, circularly iterating step 2.2-step 2.3 until the maximum iteration times are met or each cluster center does not change greatly.
And 2.5, combining the k clusters obtained in the step 2.4 with a plurality of types of samples to form k data sets respectively.
According to the clustering-based adaptive weighting oversampling method, in the step 3, a random forest algorithm is performed on each data set, a corresponding score value is calculated in a k-fold cross validation mode, and the score of the cluster is determined.
And 3.1, dividing the data set obtained in each step 2 into k groups of data sets according to a 5-fold cross validation mode.
And 3.2, selecting 1 group as a test set and 4 groups as a training set each time, training a random forest algorithm by using the training set, predicting a test set result according to a model obtained by training, obtaining corresponding AUC, F-mean and G-mean values according to the result, and calculating corresponding average values.
And 3.3, circulating the step 3.2 for k times to obtain k values and calculating an average value to be used as a score value corresponding to the cluster.
According to the clustering-based adaptive weighting oversampling method, in the step 4, the sampling weight is calculated through the score of each cluster, and the cluster sample synthesis number is determined, specifically including the steps.
And 4.1, regarding each cluster, taking the difference value between the score value of 1 and the cluster as the sampling score of the cluster, and calculating the sum of the sampling score values.
And 4.2, comparing the sampling weight value corresponding to the cluster with the sum to serve as the sampling weight value of the cluster.
And 4.3, multiplying the difference value of the majority sample points and the minority sample points of the original data by the sampling weight value of the cluster to obtain the number of the synthesized samples of the cluster.
According to the clustering-based adaptive weighting oversampling method, in the step 5, random linear interpolation between samples is performed in each cluster according to the number of samples, and the specific process is as follows.
And 5.1, randomly selecting two sample points in the cluster, and synthesizing a new sample point between the two sample points in a random interpolation mode.
Step 5.2 repeat step 5.1 until the number of new sample points equals the number of synthesized samples of the cluster.
Compared with the prior art, the method and the device have the advantages that the minority sample data can be clustered and divided in the over-sampling of the unbalanced data, so that the side resampling can be accurately carried out according to the classification condition of each part of the minority sample data and the majority sample, the higher recognition rate of the minority sample in the classification process is improved, and the unbalanced data problem can be solved more favorably.
Drawings
FIG. 1 is a flow chart of a cluster-based adaptive weighted oversampling method of the present invention.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
With reference to fig. 1, the invention relates to a clustering-based adaptive weighted oversampling method, comprising the following steps:
step 1: and taking the unbalanced data set as input, distinguishing a few samples and a plurality of samples, and calculating the number of samples needing to be generated.
Step 2: and dividing the minority class data into a plurality of clusters by using a kmeans clustering algorithm, and combining the minority class data and the majority class data into a plurality of data sets.
Step 2.1 randomly finding k data points from a few class samples as initial cluster centers.
Step 2.2 respectively calculates the Euclidean distance d (si, cj) between each data point si and the selected k cluster centers, finds the cluster center with the minimum distance value from each data point and distributes the cluster center to the cluster.
And 2.3, respectively calculating the average value of the data points in each cluster, and setting the average value as the clustering center of the next iteration.
And step 2.4, circularly iterating step 2.2-step 2.3 until the maximum iteration times are met or each cluster center does not change greatly.
And 2.5, combining the k clusters obtained in the step 2.4 with a plurality of types of samples to form k data sets respectively.
And step 3: and calculating a corresponding score value for each data set by a random forest algorithm in a k-fold cross validation mode, and determining the score of the cluster.
And 3.1, dividing the data set obtained in each step 2 into k groups of data sets according to a k-fold cross validation mode.
And 3.2, selecting 1 group as a test set and 4 groups as a training set each time, training a random forest algorithm by using the training set, predicting a test set result according to a model obtained by training, obtaining corresponding AUC, F-mean and G-mean values according to the result, and calculating corresponding average values.
And 3.3, circulating the step 3.2 for k times to obtain k values and calculating an average value to be used as a score value corresponding to the cluster.
And 4, step 4: and calculating sampling weight through the score of each cluster, and determining the synthesis number of cluster samples.
And 4.1, regarding each cluster, taking the difference value between the score value of 1 and the cluster as the sampling score of the cluster, and calculating the sum of the sampling score values.
And 4.2, comparing the sampling weight value corresponding to the cluster with the sum to serve as the sampling weight value of the cluster.
And 4.3, multiplying the difference value of the majority sample points and the minority sample points of the original data by the sampling weight value of the cluster to obtain the number of the synthesized samples of the cluster.
And 5: and carrying out random linear interpolation among samples in each cluster according to the number of the samples.
And 5.1, randomly selecting two sample points in the cluster, and synthesizing a new sample point between the two sample points in a random interpolation mode.
Step 5.2 repeat step 5.1 until the number of new sample points equals the number of synthesized samples of the cluster.

Claims (5)

1. A self-adaptive weighting oversampling method based on clustering is characterized by comprising the following steps:
step 1: taking the unbalanced data set as input, distinguishing a few samples and a plurality of samples, and calculating the number of samples needing to be generated;
step 2: dividing minority class data into a plurality of clusters by using a k-means clustering algorithm, and combining the minority class data and the majority class data into a plurality of data sets;
and step 3: calculating a corresponding score value for each data set through a random forest algorithm in a 5-fold cross validation mode, and determining the score of the cluster;
and 4, step 4: calculating sampling weight through the score of each cluster, and determining the synthesis number of cluster samples;
and 5: and carrying out random linear interpolation between samples in each cluster according to the number of the samples.
2. The adaptive clustering-based weighted oversampling method of claim 1, wherein in step 2, the k-means clustering algorithm is used to divide minority class data into a plurality of clusters and combine the minority class data with the majority class data into a plurality of data sets, and the specific steps are as follows:
step 2.1, randomly finding k data points from a few types of samples as initial clustering centers;
step 2.2, respectively calculating Euclidean distances d (si, cj) between each data point si and the selected k cluster centers, finding the cluster center with the minimum distance value from each data point and distributing the cluster center to the cluster;
step 2.3, respectively calculating the average value of the data points in each cluster, and setting the average value as the clustering center of the next iteration;
step 2.4, circularly iterating step 2.2 to step 2.3 until the maximum iteration times are met or each cluster center does not change greatly any more;
and 2.5, combining the k clusters obtained in the step 2.4 with a plurality of types of samples to form k data sets respectively.
3. The cluster-based adaptive weighting oversampling method of claim 1, wherein in the step 3, for each data set, a random forest algorithm is performed, and a 5-fold cross validation method is adopted to calculate a corresponding score value, and determine the score of the cluster, and the specific steps are as follows:
step 3.1, dividing each data set obtained in the step 2 into k groups of data sets according to a 5-fold cross validation mode;
3.2, selecting 1 group as a test set and 4 groups as a training set each time, training a random forest algorithm by using the training set, predicting a test set result according to a model obtained by training, obtaining corresponding AUC, F-mean and G-mean values according to the result, and calculating corresponding average values;
and 3.3, circulating the step 3.2 for 5 times to obtain k values and calculating an average value as a score value corresponding to the cluster.
4. The adaptive weighting oversampling method based on clustering according to claim 1, wherein in step 4, a sampling weight is calculated through a score of each cluster, and a cluster sample synthesis number is determined, specifically including:
step 4.1, for each cluster, taking the difference value between the score value of 1 and the cluster as the sampling score, and calculating the sum of the sampling score values;
step 4.2, comparing the sampling weight value corresponding to the cluster with the sum to serve as the sampling weight value of the cluster;
and 4.3, multiplying the difference value of the majority sample points and the minority sample points of the original data by the sampling weight value of the cluster to obtain the number of the synthesized samples of the cluster.
5. The adaptive clustering-based weighted oversampling method according to claim 1, wherein in the step 5, for each cluster, according to the number of samples, a random linear interpolation between samples is performed in the cluster, and the specific process is as follows:
step 5.1, randomly selecting two sample points in the cluster, and synthesizing a new sample point between the two sample points in a random interpolation mode;
step 5.2 repeat step 5.1 until the number of new sample points equals the number of synthesized samples of the cluster.
CN202110650447.3A 2021-06-11 2021-06-11 Clustering-based self-adaptive weighted oversampling method Pending CN113378927A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110650447.3A CN113378927A (en) 2021-06-11 2021-06-11 Clustering-based self-adaptive weighted oversampling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110650447.3A CN113378927A (en) 2021-06-11 2021-06-11 Clustering-based self-adaptive weighted oversampling method

Publications (1)

Publication Number Publication Date
CN113378927A true CN113378927A (en) 2021-09-10

Family

ID=77573780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110650447.3A Pending CN113378927A (en) 2021-06-11 2021-06-11 Clustering-based self-adaptive weighted oversampling method

Country Status (1)

Country Link
CN (1) CN113378927A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115545111A (en) * 2022-10-13 2022-12-30 重庆工商大学 Network intrusion detection method and system based on clustering self-adaptive mixed sampling
CN116051288A (en) * 2023-03-30 2023-05-02 华南理工大学 Financial credit scoring data enhancement method based on resampling
CN116499748A (en) * 2023-06-27 2023-07-28 昆明理工大学 Bearing fault diagnosis method and system based on improved SMOTE and classifier
CN117332287A (en) * 2023-09-28 2024-01-02 中国人民解放军63856部队 Evaluation index weight data processing method based on cluster analysis

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115545111A (en) * 2022-10-13 2022-12-30 重庆工商大学 Network intrusion detection method and system based on clustering self-adaptive mixed sampling
CN116051288A (en) * 2023-03-30 2023-05-02 华南理工大学 Financial credit scoring data enhancement method based on resampling
CN116051288B (en) * 2023-03-30 2023-07-18 华南理工大学 Financial credit scoring data enhancement method based on resampling
CN116499748A (en) * 2023-06-27 2023-07-28 昆明理工大学 Bearing fault diagnosis method and system based on improved SMOTE and classifier
CN116499748B (en) * 2023-06-27 2023-08-29 昆明理工大学 Bearing fault diagnosis method and system based on improved SMOTE and classifier
CN117332287A (en) * 2023-09-28 2024-01-02 中国人民解放军63856部队 Evaluation index weight data processing method based on cluster analysis

Similar Documents

Publication Publication Date Title
CN113378927A (en) Clustering-based self-adaptive weighted oversampling method
CN111400180B (en) Software defect prediction method based on feature set division and ensemble learning
CN107103332B (en) A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset
CN110070060B (en) Fault diagnosis method for bearing equipment
CN110751121B (en) Unsupervised radar signal sorting method based on clustering and SOFM
CN108985327B (en) Terrain matching area self-organization optimization classification method based on factor analysis
CN108595913A (en) Differentiate the supervised learning method of mRNA and lncRNA
CN113344113B (en) Yolov3 anchor frame determination method based on improved k-means clustering
CN111062425B (en) Unbalanced data set processing method based on C-K-SMOTE algorithm
CN113344019A (en) K-means algorithm for improving decision value selection initial clustering center
CN115048988B (en) Unbalanced data set classification fusion method based on Gaussian mixture model
CN114036610A (en) Penetration depth prediction method based on data enhancement
CN111079788A (en) K-means clustering method based on density Canopy
CN111382797A (en) Clustering analysis method based on sample density and self-adaptive adjustment clustering center
CN113269200A (en) Unbalanced data oversampling method based on minority sample spatial distribution
CN117407732A (en) Unconventional reservoir gas well yield prediction method based on antagonistic neural network
CN108920477A (en) A kind of unbalanced data processing method based on binary tree structure
CN106951728B (en) Tumor key gene identification method based on particle swarm optimization and scoring criterion
CN116619136A (en) Multi-working-condition multi-source data cutter abrasion prediction method
CN112015631A (en) Software defect prediction-oriented unbalanced data generation method
CN111782904B (en) Unbalanced data set processing method and system based on improved SMOTE algorithm
CN111488903A (en) Decision tree feature selection method based on feature weight
CN116976665A (en) Risk assessment method based on improved topsis model
CN114117876A (en) Feature selection method based on improved Harris eagle algorithm
CN113392908A (en) Unbalanced data oversampling algorithm based on boundary density

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210910

WD01 Invention patent application deemed withdrawn after publication