CN113792765A - Oversampling method based on triangular centroid weight - Google Patents

Oversampling method based on triangular centroid weight Download PDF

Info

Publication number
CN113792765A
CN113792765A CN202110976931.5A CN202110976931A CN113792765A CN 113792765 A CN113792765 A CN 113792765A CN 202110976931 A CN202110976931 A CN 202110976931A CN 113792765 A CN113792765 A CN 113792765A
Authority
CN
China
Prior art keywords
sample
centroid
weight
danger
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110976931.5A
Other languages
Chinese (zh)
Inventor
周红芳
陈佳琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wanzhida Technology Co ltd
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110976931.5A priority Critical patent/CN113792765A/en
Publication of CN113792765A publication Critical patent/CN113792765A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Physiology (AREA)
  • Genetics & Genomics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an oversampling method based on triangular centroid weight, which comprises the following specific steps: step 1, quantizing a sample to be processed into numerical values and then calculating a characteristic weight; step 2, performing danger type sample extraction on the quantized sample; step 3, searching neighbor samples of the danger samples; step 4, randomly finding out two neighbor samples from neighbor samples of each danger sample, and calculating the triangular centroid coordinates of three points to obtain a centroid sample; step 5, multiplying each feature of the centroid coordinates in the centroid sample by the feature weight respectively to obtain an offset centroid, and forming a centroid offset sample; step 6: and determining a weight coefficient of the centroid shift sample according to a genetic algorithm, and multiplying the weight coefficient by the shift centroid to finally obtain a new sample. The invention solves the problems that the traditional method adopts a straight line between two points, the newly synthesized sample space is limited between the two points, and the extraction of the information of the sample is less.

Description

Oversampling method based on triangular centroid weight
Technical Field
The invention belongs to the technical field of data mining and machine learning data processing, and relates to an oversampling method based on triangular centroid weight.
Background
With the advent of the big data age, a great variety of data information is rushed into our lives, and unbalanced data is one of typical representatives. The unbalanced data refers to unbalanced distribution of data samples among different data categories, and the classification problem of the unbalanced data is ubiquitous in the fields of artificial intelligence and data mining. Among the two-or multi-class problems affected by the imbalance problem, we refer to the class with more samples as the most or positive class, and the class with less samples as the less or negative class. In classification, the traditional classification algorithm is mainly oriented to data samples with more balanced data distribution, but when processing unbalanced data, the classifier becomes inefficient and is difficult to identify a few classes of samples in the set. Therefore, the classification performance of the classifier becomes critical for processing the unbalanced samples.
Because the unbalanced data widely exists in various fields, the invention can be applied to the unbalanced data in various fields. In the real world, the problems in disease diagnosis and treatment, credit assessment and the like are often required to be accurately classified, and at the moment, the imbalance of samples often makes the classification of the data very difficult. For example, in the judgment of a new coronary pneumonia patient, a large amount of personnel data is involved, a series of characteristics such as sex, age, weight, blood pressure, lung information and the like of each personnel form a personnel sample, and a plurality of personnel form the data set, wherein the categories of the personnel are patients and non-patients. It is obvious that of the 1000 people that only 10 people are affected, i.e. the patients always have a small proportion, i.e. a small number of classes, among them; the non-sick people are the majority, and if the sick people are classified into the non-sick people by mistake, the result is catastrophic. Similarly, in bank credit assessment, the age, income, purchasing power, etc. of the person can be used as the characteristics of a person sample to judge how much credit is to be paid, and whether to pay credit to the person, wherein the number of people with lower credit is always small, and the problem of data imbalance caused by the fact is required to be well solved.
In the method for processing the unbalanced sample, the sampling technology is widely applied. Such as undersampling, oversampling, mixed sampling, etc. The oversampling method balances the number of two types of samples by increasing a small number of types of samples, thereby improving the classification efficiency. However, the conventional random oversampling idea is to perform random oversampling on a small number of samples, but this is only a simple iteration on the original samples, and the extraction of information on the small number of samples is little and less, and the information learned by the model is too much and not generalized enough, so that the overfitting problem is particularly easily generated. Therefore, on the basis, researchers gradually propose classical oversampling methods such as SMOTE, Borderline-SMOTE and the like.
SMOTE is an improved scheme based on a random over-sampling algorithm, such as fig. 2, which calculates its k neighbors based on minority samples, sets a sampling ratio according to the sample imbalance ratio to determine the sampling magnification, selects a proper neighbor for each minority sample, synthesizes a new sample according to the following formula, and the samples on a straight line are identified as new samples with minority features and then added in a training set.
xnew=x+rand(0,1)*(x′-x) (1)
In the formula (1), among these, xnewRepresents a sample finally synthesized, x represents a few classes of samples of the input, x' represents a neighbor sample of the selected x, and rand (0,1) is a random number between 0 and 1. Through the formula calculation, new samples can be synthesized according to the sampling rate.
However, the SMOTE algorithm still has some problems: on one hand, a proper neighbor number, namely a k value, needs to be selected, and then neighbors are selected in a random mode, so that the parameter cannot be effectively determined, and repeated experiments and demonstrations are often needed; on the other hand, the distribution of data in the set is fixed, and the problem of data marginalization is easily generated, namely, a part of a few samples are located at the edge of the negative sample, so that the synthesized sample gradually gets close to the edge, the boundary of the positive and negative samples is blurred, and the classification difficulty is increased. In order to solve the problem, a Borderline-SMOTE algorithm is proposed.
The Borderline-SMOTE algorithm is based on an improvement of the SMOTE algorithm and is currently divided into Borderline-SMOTE1 and Borderline-SMOTE 2. The algorithm divides a plurality of classes and a few classes of a training set, and searches k neighbor for each few class sample to obtain the number m of the plurality of classes near the sample (k is more than or equal to m and more than or equal to 0). If m is k, the negative class samples are all near the positive class, the samples are determined to be noise (noise), and the operation is stopped; if m has a value of half k and more, then the negative class sample is considered to be a sample that is susceptible to misclassification, called danger (danger); if m is less than half of k, the negative class sample is considered safe (safe) and operation is stopped. Wherein, Border-SMOTE1 operates on samples in danger, finds out the neighbor k 'of each danger sample in the negative class sample set, and then randomly selects the neighbor k' to synthesize a new sample according to the above formula until the requirement of sampling multiple is reached.
Disclosure of Invention
The invention aims to provide an oversampling method based on triangular centroid weight, which solves the problems that a method of a straight line between two points is adopted in the traditional method in the prior art, the newly synthesized sample space is limited between the two points, and less information of a sample is extracted.
The technical scheme adopted by the invention is as follows:
an oversampling method based on triangular centroid weight applies the initial operation of Borderline-SMOTE method, divides data into noise (noise), danger (danger) and safety (safe), then selects similar neighbors for danger samples, and determines the final position of a new sample according to weight and weight coefficient to strengthen the characteristics of related samples, and the specific steps are as follows:
step 1, quantizing a sample to be processed into a numerical value, and calculating the characteristic weight of the quantized sample;
step 2, performing danger type sample extraction on the quantized sample;
step 3, searching neighbor samples of the danger samples;
step 4, randomly finding out two neighbor samples from neighbor samples of each danger sample, and calculating the triangular centroid coordinates of three points to obtain a centroid sample;
step 5, multiplying each feature of the centroid coordinates in the centroid sample by the feature weight obtained in the step 1 to obtain an offset centroid, wherein all the offset centroids form a centroid offset sample;
step 6: and determining a weight coefficient of the centroid shift sample according to a genetic algorithm, and multiplying the weight coefficient by the shift centroid to finally obtain a new sample.
The present invention is also characterized in that,
in step 1, the feature weight is calculated by a Relief method.
The method for extracting the danger samples comprises the following steps: and dividing a plurality of classes and a few classes in the sample to be processed by applying the Borderline-SMOTE method idea, and searching k neighbors for each few class sample to obtain the number m of the plurality of classes near the sample (k is more than or equal to m and more than or equal to 0). If m is k, the negative class samples are all near the positive class, the samples are determined to be noise (noise), and the operation is stopped; if m is half or more of k, the negative class sample is considered to be a sample which is easily misclassified, and is called danger (danger), namely a safety class sample needing to be obtained; if m is less than half of k, the negative class sample is considered safe (safe) and operation is stopped.
The calculation mode of the triangle centroid coordinate is as follows:
Centroid=(D(A)+D(B)+D(C))/3 (3)
in formula (3), Centroid is the coordinate of the triangle Centroid, d (a) is the coordinate of the center point of danger sample, d (b) and d (c) are two neighbors of danger sample.
The synthesis method of the new sample comprises the following steps:
Newsample=Centroid*Featrue Weight*Weightcoefficient (4)
wherein, Centroid is the coordinate of triangle Centroid, Featrue Weight is the characteristic Weight, and Weight coefficient is the Weight coefficient.
The weight coefficient is determined by a genetic algorithm, an initial population is generated between 0 and 1, the binary string weight coefficient converted from the decimal number is continuously selected, crossed, varied and retained by the genetic algorithm, the population is retained, the selection is terminated after the set iteration times, the optimal weight coefficient and the classification result are obtained, wherein the initial population is determined to be 10, 20 generations are iterated, the selection is carried out by adopting a championship selection method, a two-point crossing mode is adopted, the crossing probability of 0.7 and the chromosome length degree are used as the variation probability.
The invention has the beneficial effects that:
firstly, the invention applies the initial operation of the Borderline-SMOTE method to divide data into three parts of noise (noise), danger (danger) and safety (safe), then selects the similar neighbors of danger samples, and determines the final position of a new sample according to the weight and the weight coefficient so as to strengthen the characteristics of related samples.
Secondly, the method obtains a comparison result by taking the accuracy and the area AUC value under the ROC curve as evaluation indexes on a decision tree (Gini) classifier and a KNN classifier respectively, and finally obtains the classification effect which is better than that of Borderline-SMOTE 1.
Drawings
FIG. 1 is a flow chart of an oversampling method based on triangular centroid weights of the present invention;
fig. 2 is a schematic diagram of the SMOTE oversampling process.
FIG. 3 is a flow chart of a method for obtaining weight coefficients in an oversampling method based on triangular centroid weights according to the present invention;
FIG. 4 comparison of AUC values as evaluation criteria using a decision tree classifier;
FIG. 5 comparison of AUC values as evaluation criteria using KNN classifier;
FIG. 6 comparison of accuracy as evaluation criteria using a decision tree classifier;
fig. 7 comparison using KNN classifier with accuracy as evaluation criterion.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention relates to an oversampling method based on triangular centroid weight, as shown in fig. 1, which applies the initial operation of Borderline-SMOTE method, divides data into three parts of noise (noise), danger (danger) and safety (safe), then selects similar neighbors for danger samples, and determines the final position of a new sample according to weight and weight coefficient to strengthen the characteristics of related samples, and the specific steps are as follows:
step 1, quantizing a sample to be processed into a numerical value, and calculating the characteristic weight of the quantized sample;
step 2, performing danger type sample extraction on the quantized sample;
step 3, searching neighbor samples of the danger samples;
step 4, randomly finding out two neighbor samples from neighbor samples of each danger sample, and calculating the triangular centroid coordinates of three points to obtain a centroid sample;
step 5, multiplying each feature of the centroid coordinates in the centroid sample by the feature weight obtained in the step 1 to obtain an offset centroid, wherein all the offset centroids form a centroid offset sample;
step 6: and determining a weight coefficient of the centroid shift sample according to a genetic algorithm, and multiplying the weight coefficient by the shift centroid to finally obtain a new sample.
In the step 1:
the operation of quantizing the values is done in euclidean space;
wherein the feature weights are calculated by the Relief method:
the Relief method can give different weights to the features according to the types and the correlations of the features, and finally, a weight set of each feature with the threshold as the center can be obtained according to the set threshold. The algorithm randomly selects a sample B from the training set, then finds a nearest neighbor sample H from samples of the same class of B, called Hit, and finds a nearest neighbor sample M from samples of a different class from M, called Miss. If the distance between B and Hit on a certain feature is smaller than the distance between B and Miss, the feature is beneficial to classification, and the weight of the feature is increased; conversely, if the distance between B and Hit is greater than the distance between B and Miss, indicating that the feature negatively affects the classification, the weight of the feature is decreased.
We set the initial weight to 1 and calculate the weight as follows:
W(A)=W(A)-diff(A,B,H)/m+diff(A,B,M)/m (2)
in equation (2), a is the weight of each feature, m is the number of runs, and two diffs are the euclidean distances between the sample and Hit and Miss, respectively, and the feature a is circularly calculated from 1 to N to obtain the first-generation weight value. Then, repeat the above process m times, finally get the average weight of each feature.
Considering the influence of the setting of the initial weight, we recalculate the average weight of the output in a percentage manner, so that the weight of each characteristic weight is less than 1, and the size relation among each other is not changed.
In the step 2:
the method for extracting the danger samples comprises the following steps: and dividing a plurality of classes and a few classes in the sample to be processed by applying the Borderline-SMOTE method idea, and searching k neighbors for each few class sample to obtain the number m of the plurality of classes near the sample (k is more than or equal to m and more than or equal to 0). If m is k, the negative class samples are all near the positive class, the samples are determined to be noise (noise), and the operation is stopped; if m is half or more of k, the negative class sample is considered to be a sample which is easily misclassified, and is called danger (danger), namely a safety class sample needing to be obtained; if m is less than half of k, the negative class sample is considered safe (safe) and operation is stopped.
In the step 3:
after the danger sample is extracted, its neighbors need to be found. And finding the neighbor of the danger sample in a few samples of the original training set as a parent sample for subsequently synthesizing a new sample. The algorithm can find sample points needing attention and make the method of taking the points closer to a few types of sample points which have trends but cannot be easily distinguished, wherein k is 5.
In the triangular centroid weight-based oversampling method, the idea of searching for neighbors of danger samples among a small number of classes is also applied.
In the step 4:
the calculation mode of the triangle centroid coordinate is as follows:
Centroid=(D(A)+D(B)+D(C))/3 (3)
in formula (3), Centroid is the coordinate of the triangle Centroid, d (a) is the coordinate of the center point of danger sample, d (b) and d (c) are two neighbors of danger sample.
In the step 5: after the centroid is obtained in step 4, the feature weight output according to step 1 is multiplied by the centroid, so that our newly generated sample is not limited to a fixed position, but is shifted within a reasonable range according to the importance of each feature.
We determine how many new samples of the minority class are synthesized according to the input sampling rate, and the purpose of the algorithm is to balance the unbalanced samples, so the number of synthesized new samples should be the difference between the minority sample and the majority sample in the training set, and the danger class samples will be synthesized all the time before the number of samples is not synthesized.
In the step 6:
the concept of the weight coefficient is introduced, because the feature weight obtained by optimization in step 1 still cannot accurately guide the mass center to shift to a proper position, therefore, a weight coefficient between 0 and 1 is set, the front part is opened and the rear part is closed, a new sample is obtained by multiplying the weight coefficient by the shifted mass center, and the new sample can effectively reach the most proper position.
Referring to fig. 3, the weight coefficient is determined by a genetic algorithm, an initial population is generated between 0 and 1, and the population of the present invention, which obtains the best evaluation index, is retained by continuously selecting, crossing, and varying the weight coefficient of the binary string converted from the decimal number in the genetic algorithm, and is terminated after the set number of iterations, at which time the best weight coefficient and the classification result are obtained. The initial population number is set to 10, 20 generations are iterated, a tournament selection method is adopted for selection, a two-point intersection mode is adopted, the intersection probability is 0.7, and one of chromosome length degrees is used as the variation probability, and the variation probability is shown in table 1.
TABLE 1 genetic Algorithm parameter set-ups
Figure BDA0003227674390000101
The synthesis method of the new sample comprises the following steps:
Newsample=Centroid*Featrue Weight*Weightcoefficient (4)
wherein, Centroid is the coordinate of triangle Centroid, Featrue Weight is the Weight of each feature, and Weight coefficient is the Weight coefficient.
The method of the invention used 10-fold cross validation to evaluate the classification Accuracy (Accuracy) and the area under the ROC curve (AUC) for the 7 datasets in this case with Decision Trees (DT) and KNN as classifiers, respectively. And the k-fold cross validation divides the selected data set into k groups in an average way, takes 1-fold data in the k groups as a test set, takes other k-1 folds as a training set, and so on, so that k training models and testing models can be obtained in total, and the average value of the validation of times is taken as the final result no matter the accuracy or the AUC value. In the invention, a 10-fold cross validation method is adopted, wherein 10 times of validation is carried out on each 10-fold cross validation, and the result of the 100 evaluations is taken as a final result.
Accuracy (Accuracy) and area under receiver operating characteristic curve (ROC) (AUC) are commonly used classifier evaluation metrics in classification of unbalanced data.
The accuracy is the most common classifier evaluation index, and the calculation method is as follows:
Figure BDA0003227674390000111
wherein the content of the first and second substances,
tp (tube positive) represents the number of instances that are actually positive and are classified as positive by the classifier.
Tn (tube negative) represents the number of instances that are actually negative and are classified as negative by the classifier.
FP (false positive) represents the number of instances that are actually negative but are divided into positive instances by the classifier.
FN (false negative) represents the number of instances that are actually positive but are classified as negative by the classifier.
The receiver operating characteristic curve (ROC) is a curve obtained by taking the false positive probability as the abscissa and the hit probability as the ordinate under different classification thresholds, and in the first quadrant of the coordinate axis with the horizontal and vertical lengths of 1, the closer the line is to the upper left corner, the better the classification performance of the classifier is generally considered. However, in order to avoid non-intuitive evaluation caused by curve crossing, the area AUC under the ROC curve is used as an evaluation index, and the larger the AUC value is, the better the classification performance is.
Wherein the content of the first and second substances,
Figure BDA0003227674390000112
Figure BDA0003227674390000113
to verify the effectiveness of the present invention, the KEEL Dataset is used as the sample to be processed, the KEEL Dataset is described in Table 2, Dataset is the name of the Dataset, Instances is the sample volume in the Dataset, Features is the feature number of the Dataset, and Classes is the class number of the Dataset.
The present invention verifies using the present method for each of the 8 datasets, using two classification algorithms, decision tree and KNN. Through experiments, the oversampling method based on the triangular centroid weight, provided by the invention, obtains higher accuracy and AUC values on the classification of 8 data sets compared with Borderline-SMOTE 1.
TABLE 2 KEEL data set description
Figure BDA0003227674390000121
Table 3 is a comparison table of two methods in which classification accuracy is used as an evaluation criterion, and table 4 is a comparison table of two methods in which AUC is used as an evaluation criterion. My Method in the two tables is the Method provided by the invention, DT represents that the basic classification algorithm uses a decision tree, and KNN represents that the basic classification algorithm uses a K nearest neighbor classification algorithm.
TABLE 3 accuracy as a comparison of classifications under evaluation criteria (%)
Figure BDA0003227674390000122
TABLE 4AUC values as a comparison of classifications under evaluation criteria
Figure BDA0003227674390000131
Fig. 4 to 7 in the specification are experimental comparison diagrams for classifying oversampled samples into different evaluation indexes by using two algorithms. The Borderline-SMOTE1 algorithm is on the left, and the method proposed by the invention is on the right.
FIG. 4 is a comparative bar graph of AUC values of two algorithms using DT as a basic classification algorithm, and experiments show that the accuracy of classification of 8 data sets by using decision trees by the oversampling algorithm provided by the invention is higher than that of Borderline-SMOTE1 algorithm.
Fig. 5 is a comparison histogram of the two algorithms on AUC values using KNN as the basic classification algorithm, and experiments show that the accuracy of classification using decision trees on 8 data sets by the oversampling algorithm proposed by the present invention is higher than that of the Borderline-SMOTE1 algorithm.
FIG. 6 is a comparison bar chart of the accuracy of two algorithms using DT as the basic classification algorithm, and experiments show that the accuracy of the over-sampling algorithm using decision tree classification on 8 data sets is higher than that of Borderline-SMOTE1 algorithm.
Fig. 7 is a comparison bar chart of the accuracy of two algorithms using KNN as the basic classification algorithm, and experiments show that the accuracy of the oversampling algorithm proposed by the present invention using decision tree classification on 8 data sets is higher than that of the Borderline-SMOTE1 algorithm.

Claims (6)

1. An oversampling method based on triangular centroid weight is characterized in that initial operation of a Borderline-SMOTE method is applied, data are divided into three parts of noise (noise), danger (danger) and safety (safe), then similar neighbors of danger samples are selected, and final positions of new samples are determined according to weight and weight coefficients so as to strengthen relevant sample characteristics, and the method specifically comprises the following steps:
step 1, quantizing a sample to be processed into a numerical value, and calculating the characteristic weight of the quantized sample;
step 2, performing danger type sample extraction on the quantized sample;
step 3, searching neighbor samples of the danger samples;
step 4, randomly finding out two neighbor samples of each danger sample, and calculating the triangular centroid coordinates of three points to obtain a centroid sample;
step 5, multiplying each feature of the centroid coordinates in the centroid sample by the feature weight obtained in the step 1 to obtain an offset centroid, wherein all the offset centroids form a centroid offset sample;
step 6: and determining a weight coefficient of the centroid shift sample according to a genetic algorithm, and multiplying the weight coefficient by the shift centroid to finally obtain a new sample.
2. The triangular centroid weight-based oversampling method as claimed in claim 1, wherein said feature weight is calculated in said step 1 by a Relief method.
3. The triangular centroid weight-based oversampling method as claimed in claim 1, wherein said danger class samples are extracted by: dividing a plurality of classes and a minority class in a sample to be processed by applying a Borderline-SMOTE method idea, and searching k neighbors for each minority class sample to obtain the number m of the plurality of classes near the sample (k is more than or equal to m and more than or equal to 0); if m is k, the negative class samples are all near the positive class, the samples are determined to be noise (noise), and the operation is stopped; if m is half or more of k, the negative class sample is considered to be a sample which is easily misclassified, and is called danger (danger), namely a safety class sample needing to be obtained; if m is less than half of k, the negative class sample is considered safe (safe) and operation is stopped.
4. The triangular centroid weight-based oversampling method according to claim 1, wherein the triangular centroid coordinates are calculated by:
Centroid=(D(A)+D(B)+D(C))/3 (3)
in formula (3), Centroid is the coordinate of the triangle Centroid, d (a) is the coordinate of the center point of danger sample, d (b) and d (c) are two neighbors of danger sample.
5. The triangular centroid weight-based oversampling method according to claim 1, wherein said new sample is synthesized by:
Newsample=Centroid*Featrue Weight*Weightcoefficient (4)
wherein, Centroid is the coordinate of triangle Centroid, Featrue Weight is the Weight of each feature, and Weight coefficient is the Weight coefficient.
6. The oversampling method based on triangular centroid weight as claimed in claim 1, wherein said weight coefficients are determined by genetic algorithm, generating initial population between 0-1, continuously selecting, crossing, varying, species retaining in genetic algorithm by binary string weight coefficients converted from decimal number, terminating after set iteration number, obtaining optimal weight coefficient and classification result, wherein initial population number is set as 10, iteration 20 generation, selecting by tournament selection method, selecting by two-point crossing, 0.7 crossing probability, chromosome length degree is one of the variation probability.
CN202110976931.5A 2021-08-24 2021-08-24 Oversampling method based on triangular centroid weight Pending CN113792765A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110976931.5A CN113792765A (en) 2021-08-24 2021-08-24 Oversampling method based on triangular centroid weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110976931.5A CN113792765A (en) 2021-08-24 2021-08-24 Oversampling method based on triangular centroid weight

Publications (1)

Publication Number Publication Date
CN113792765A true CN113792765A (en) 2021-12-14

Family

ID=79182293

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110976931.5A Pending CN113792765A (en) 2021-08-24 2021-08-24 Oversampling method based on triangular centroid weight

Country Status (1)

Country Link
CN (1) CN113792765A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2198685A1 (en) * 1996-02-28 1997-08-28 Liberty Technologies, Inc. System and method for stable analysis of sampled transients arbitrarily aligned with their sample points
WO2019041629A1 (en) * 2017-08-30 2019-03-07 哈尔滨工业大学深圳研究生院 Method for classifying high-dimensional imbalanced data based on svm
CN111259964A (en) * 2020-01-17 2020-06-09 上海海事大学 Over-sampling method for unbalanced data set
CN111931853A (en) * 2020-08-12 2020-11-13 桂林电子科技大学 Oversampling method based on hierarchical clustering and improved SMOTE
CN112633337A (en) * 2020-12-14 2021-04-09 哈尔滨理工大学 Unbalanced data processing method based on clustering and boundary points

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2198685A1 (en) * 1996-02-28 1997-08-28 Liberty Technologies, Inc. System and method for stable analysis of sampled transients arbitrarily aligned with their sample points
WO2019041629A1 (en) * 2017-08-30 2019-03-07 哈尔滨工业大学深圳研究生院 Method for classifying high-dimensional imbalanced data based on svm
CN111259964A (en) * 2020-01-17 2020-06-09 上海海事大学 Over-sampling method for unbalanced data set
CN111931853A (en) * 2020-08-12 2020-11-13 桂林电子科技大学 Oversampling method based on hierarchical clustering and improved SMOTE
CN112633337A (en) * 2020-12-14 2021-04-09 哈尔滨理工大学 Unbalanced data processing method based on clustering and boundary points

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵清华;张艺豪;马建芬;段倩倩;: "改进SMOTE的非平衡数据集分类算法研究", 计算机工程与应用, no. 18 *
霍玉丹;谷琼;蔡之华;袁磊;: "基于遗传算法改进的少数类样本合成过采样技术的非平衡数据集分类算法", 计算机应用, no. 01 *

Similar Documents

Publication Publication Date Title
Wei et al. NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems
De Amorim Feature relevance in ward’s hierarchical clustering using the L p norm
CN111400180B (en) Software defect prediction method based on feature set division and ensemble learning
CN111898689B (en) Image classification method based on neural network architecture search
Pradipta et al. Radius-SMOTE: a new oversampling technique of minority samples based on radius distance for learning from imbalanced data
CN104392253B (en) Interactive classification labeling method for sketch data set
JP2018181290A (en) Filter type feature selection algorithm based on improved information measurement and ga
Zainudin et al. Feature Selection Optimization using Hybrid Relief-f with Self-adaptive Differential Evolution.
CN108280236A (en) A kind of random forest visualization data analysing method based on LargeVis
CN110826618A (en) Personal credit risk assessment method based on random forest
Zhang et al. A hybrid feature selection algorithm for classification unbalanced data processsing
CN112801140A (en) XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm
Phienthrakul et al. Evolutionary strategies for multi-scale radial basis function kernels in support vector machines
CN113792765A (en) Oversampling method based on triangular centroid weight
Antoniades et al. Speeding up feature selection: A deep-inspired network pruning algorithm
CN115017988A (en) Competitive clustering method for state anomaly diagnosis
CN113392908A (en) Unbalanced data oversampling algorithm based on boundary density
Patil et al. Pattern recognition using genetic algorithm
CN112784908A (en) Dynamic self-stepping integration method based on extremely unbalanced data classification
CN111709460A (en) Mutual information characteristic selection method based on correlation coefficient
Bosio et al. Feature set enhancement via hierarchical clustering for microarray classification
Cui et al. esearch on Credit Card Fraud Classification Based on GA-SVM
CN109934274A (en) Based on L2,pThe GEPSVM classification method of norm distance measure
Vluymans et al. Instance selection for imbalanced data
CN112580606B (en) Large-scale human body behavior identification method based on clustering grouping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240417

Address after: 518000 1002, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Applicant after: Shenzhen Wanzhida Technology Co.,Ltd.

Country or region after: China

Address before: 710048 Shaanxi province Xi'an Beilin District Jinhua Road No. 5

Applicant before: XI'AN University OF TECHNOLOGY

Country or region before: China