CN105868775A - Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm - Google Patents
Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm Download PDFInfo
- Publication number
- CN105868775A CN105868775A CN201610172812.3A CN201610172812A CN105868775A CN 105868775 A CN105868775 A CN 105868775A CN 201610172812 A CN201610172812 A CN 201610172812A CN 105868775 A CN105868775 A CN 105868775A
- Authority
- CN
- China
- Prior art keywords
- sample
- algorithm
- minority class
- pso
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an imbalance sample classification method based on a PSO (Particle Swarm Optimization) algorithm. A PSO algorithm is adopted, the sampling rates of a boundary sample and a safe sample in an oversampling process are optimized to obtain an optimal oversampling multiplying power, and meanwhile, characteristics are optimized so as to select a characteristic combination which can simplify operations and improve a classification result and has best representativeness. The imbalance sample classification method adopts AUC/F-Mea as the fitness function of the algorithm so as to improve the effect of a final classifier.
Description
Technical field
The invention belongs to the sorting technique in data mining and optimize field, particularly relate to a kind of applicable injustice
The sorting technique of weighing apparatus sample.
Background technology
In imbalanced data classification issue problem, unbalanced data is exactly in whole data set sample space
Quantitatively there is huge gap in the sample of one type and remaining class or a few class sample.This
In the case of often minority class need us to put into more concern.Such as in the application in medical diagnosis,
Cancer or cardiopathic data sample space are exactly uneven sample, in this kind of sample, Wo Menguan
The object of note is often ill sample, is gone out the attribute of these samples by Accurate classification, such that it is able to accurate
Make a definite diagnosis the state of an illness of disconnected patient, and give and these patients immunotherapy targeted autoantibody timely.
Traditional classifier is in processing imbalanced data classification issue problem, in order to pursue higher global classification
Accuracy rate, often direct is most class samples by minority class sample classification, and this way can be by potential
Patient information is categorized as health and fitness information, and therefore patient may miss the best period for the treatment of, causes patient
Irremediable loss.
In imbalanced data classification issue problem, there is a lot of parameters all can have one to final classifying quality
Fixed impact, is determined by different parameter combinations, can obtain the result that comparison in difference is big.And this
The determination of a little parameters is typically all by artificially specify, and neither one verifies these m odel validities
Index, can only judge these parameters by final classifying quality, and the parameter artificially specified is led to
Crossing adjustment repeatedly, frequently result in is also only a locally optimal solution, seldom in the case of can obtain
To globally optimal solution.
In prior art, for imbalanced data classification issue problem, there is process side based on data resampling
Method and processing method based on algorithm improvement.
Processing method based on data resampling: the unbalance response of sample is mainly manifested in two class samples
Quantitative imbalance, thus cause traditional classifier classification difficulty, first expect be
Data plane, the problem uneven by changing the quantity of primary data sample, so that sample base
Originally balance is reached to solve this classification difficulty.In data plane, Chinese scholars mainly proposes
Over-sampling, lack sampling and the thought of mixing sampling.
Over-sampling, as the term suggests, it is simply that add some minority class samples by certain method, thus
Original uneven sample data set is made to basically reach balance.If but the method for over-sampling is the most right, then may be used
Expired Drugs can be caused.
SMOTE algorithm is the oversampling technique that current Application comparison is popular.It is mainly by minority
Class sample inserts some copy minority class sample rather than simply carry out neighbouring over-sampling.This
Technology is mainly in the spirit obtained from an algorithm put forward in the project of handwriting recongnition
Sense.The main process that SMOTE carries out over-sampling is as follows: first according to each minority class sample, choose
Go out from the K that this minority class sample point is nearest other minority class sample points individual, then put and it at this
The method using linear interpolation between his K minority class sample point, and use certain random interpolation because of
The a part of minority class of generation that son and over-sampling rate are carried out copys sample point.The formula of SMOTE interpolation
Shown in following formula:
pi=x+rand (0,1) * (yi-x) i=1,2 ... N (1)
SMOTE algorithm is mainly inserted between similar minority class sample (the most adjoining)
Value, the imitation sample so generated has certain representativeness.Therefore, over-fitting problem is at SMOTE
Algorithm is avoided by, and the decision space of minority class also can extend more preferable;Equally, it can also
It is applied in most class sample space, the decision space of most class can be reduced.
Lou Xiaojun et al. proposed in 2013 a kind of based on cluster and considers boundary sample information
By monolateral back-and-forth method, oversampler method, determines that the boundary sample of minority class and most class is believed in the method
Breath, by all minority class samples being accounted for the cluster on border, bunch in carry out suitable mistake and adopt
Sample, the imitation minority class sample so produced and original sample have more similitude, more can represent father
Class sample space is distributed.Simultaneously by boundary sample being carried out over-sampling targetedly, increase border sample
This definition, thus the importance of prominent minority class boundary sample.
The thought of lack sampling method is then to be moved according to certain rule in most class samples by some method
Except some specific samples so that sample data space can basically reach balance.But lack sampling
Method also has some drawbacks, and the important information easily causing some representative samples is lost.
Mixing sampling is exactly by over-sampling algorithm together with lack sampling algorithm fusion, to original number
During carrying out resampling according to collection, use mixing sampling that being balanced property of data set is operated.In a large number
Research show, the method for mixing sampling has bigger superiority than single method for resampling, first
Carry out over-sampling and typically can obtain more preferable effect carrying out lack sampling.
Processing method based on algorithm improvement: process in imbalanced data classification issue problem, data plane
Method for resampling can eliminate unbalanced impact to a certain extent, promotes the classifying quality of minority class,
But the method essence of data plane is to change the distribution of data set, and this change may be to classifying quality
Authenticity produce certain impact.The improvement carried out in algorithm aspect can eliminate in data plane
On drawback, also can obtain the lifting of minority class classifying quality to a certain extent.Through these several years both at home and abroad
The research of scholar, the research in algorithm aspect is mainly concerned with cost sensitive learning (Cost-sensitive
Learning), integrated study (Ensemble Learning) and single quasi-tradition learning method (One-class
Learing)。
Integrated learning approach is to apply more a kind of algorithm frame, Boosting in imbalanced data classification issue
Algorithm belongs to one the most classical in integrated learning approach, and AdaBoost is Boosting class algorithm
In a kind of algorithm of being often used.AdaBoost passes through the Weak Classifier that repetitive exercise is different, passes through
These Weak Classifiers are integrated into strong classifier by combination, and it is according to the classification of each Iterative classification device
The weights that situation, more new samples are trained next time.Training process according to AdaBoost algorithm we can
Mainly there are high-precision classification effect, the sub-classifier can self-regulated as required, classification results obtaining it
Easy to understand and the advantages such as overfitting phenomenon will not be produced.
In data plane in the operation of uneven being balanced of sample space, having some parameters is to need
Very important person is specify.Such as in Borderline algorithm, the parameter of needs has K, and namely K is near
Adjacent method finds its K nearest samples in whole data set S, and is stored in by these samples
In the set KNNsmin that each Smin sample is corresponding.Carried out by SMOTE algorithm simultaneously
The when of over-sampling, the parameter having the problem (OrRate) of an over-sampling multiplying power is also by artificially referring to
Fixed, different over-sampling multiplying powers is the biggest on the impact of final classifying quality.Carrying out simultaneously
The when of mixing sampling, there is also the parameter problem of the lack sampling multiplying power of most class sample point simultaneously
(UsRate)。
In algorithm aspect, there is also some parameters needs to go to carry out artificial appointment simultaneously, such as exists
Using in the AdaBoost algorithm of Boosting algorithm, base grader the most all can have a ballot weight
Determination, this ballot weight coefficient determination be also true by the classification error rate of this base grader
Fixed.And the weight coefficient of these base graders has an impact the most mutually or relation, simple
Determine that the final classifying quality that their coefficient obtains is frequently not optimum according to classification error.
Summary of the invention
In order to solve problem in prior art, the invention provides a kind of imbalance based on PSO algorithm
Sample classification method, uses PSO (Particle Swarm Optimization) algorithm to carry out parameter
Adaptive optimization such that it is able to obtain the coefficient combination of a series of optimum so that final classification effect
Fruit is greatly improved.
The present invention is achieved through the following technical solutions:
A kind of imbalanced data classification issue method based on PSO algorithm, in described method in PSO algorithm
Being expressed as of particle: OsRatei, UsRate and Fecj, wherein OsRatei is ith cluster bunch
Over-sampling rate, i=1,2 ..., N, N are the clustering cluster number formed by DBSCAN algorithm, Fecj
For the jth feature of sample, j=1,2 ..., M;Described method includes: data set is divided into instruction by (1)
Practice collection Train and test set Test;(2) initial solution x is first generated in search volumei, i=1,2 ..., SN,
Wherein SN is population size;(3) globally optimal solution gbest=0 is made;(4) step (5), (6) are performed
Maximum cycle MCN time;(5) make j from 1 to SN, it is thus achieved that current xjSolution;According to obtaining
Described solution, feature generates new data set after reselecting, by k folding cross validation and classification
Device learns, and obtains corresponding AUC or F-Measure;(5) according to the speedometer in PSO algorithm
Calculation formula and location updating formula obtain the pbest of each particle, update globally optimal solution gbest simultaneously;
(7) combined by the over-sampling rate obtained, lack sampling rate and feature, carry out mixing sampling simultaneously, build
Grader is trained data set, is then trained with grader, obtains final AUC.
The invention has the beneficial effects as follows: the present invention is by using particle swarm optimization algorithm, to over-sampling mistake
In journey, the sample rate of boundary sample and safe sample is optimized and obtains optimum over-sampling multiplying power, the most right
Feature is optimized, thus selects and can reach simplified operation and promote the most representative of classification results
Property feature combination, use AUC/F-Mea as the fitness function of algorithm, thus promote final
The effect of grader.
Accompanying drawing explanation
Fig. 1 is PSO algorithm false code;
Fig. 2 is the representation schematic diagram of particle in PSO algorithm;
Fig. 3 is that the PSO algorithm of the present invention optimizes false code to over-sampling rate.
Detailed description of the invention
The present invention is further described for explanation and detailed description of the invention below in conjunction with the accompanying drawings.
Particle group optimizing (Particle Swarm Optimization, PSO) algorithm be by Kennedy and
Doctor Eberhart proposed in nineteen ninety-five, its a kind of evolution algorithm, and the inspiration of generation is derived from imitating certainly
So group behavior of insect, herd and the flock of birds etc. in boundary, the colonial organism in these natures can be by
According to a kind of themselves it will be appreciated that mode carry out search of food, spouse and some other thing,
Each member in colony can the experience of experience He other group members by learning oneself change
Become the individual or behavior pattern of colony, be finally completed the global search of colony.Example from reality
We are appreciated that what kind of the detailed process of PSO algorithm is: assume there is bevy in a certain given zone
The position of food is purposelessly searched in territory, and the particular location of things is known nothing by these birds, does not also know
Road flies just can fly to the position of near food toward which direction, and how these birds could be searched with optimum
Rope route finds this food?Correct way is through obtaining other grains that distance food is nearer
The information of son, then moves according to these information toward this bird region close to food and carries out
Search.
In computer science, PSO algorithm is a kind of computational methods, and it is mainly asked by iteration optimization
Topic and a specific fitness function attempt to obtain optimal candidate solution.In PSO algorithm
It is individual that " particle " represents each in whole collective search space, above our actual life of lifting
In the example of middle flock of birds search of food, the detailed process of PSO algorithm optimization is as follows: initialize the position of bird
Putting and flying speed, this speed comprises particle at next step search procedure track to be run, so
The optimal solution of rear search current region.In the renewal that PSO algorithm is each, there are two important value needs
Retaining, one of them is the individual optimal solution of each particle in colony, is designated as pbest;Also has a solution
For globally optimal solution, represent optimal solution in the whole search volume of population, be designated as gbest.The speed of particle
Degree and position moment are updated according to the two value, and iteration each time is all public according to following two
Formula is updated:
Wherein: i={1,2 ..., SN}, other parameter declaration is as shown in table 1 below:
Table 1 PSO algorithm parameter explanation
Particle swarm optimization algorithm is applied to parameter optimizations of some concrete application or the excellent of weight coefficient
In change problem, typically can obtain globally optimal solution, thus improve final experiment effect.PSO
The false code of algorithm routine is as shown in Figure 1.
By using particle swarm optimization algorithm, boundary sample during over-sampling and safe sample are adopted
Sample rate is optimized and obtains optimum over-sampling multiplying power, is optimized feature simultaneously, thus selects energy
Enough reach simplified operation and promote the most representational feature combination of classification results, using
AUC/F-Mea is as the fitness function of algorithm, thus promotes the effect of final grader.Population
In optimized algorithm, the expression of particle is as shown in Figure 2.Wherein, OsRate represents over-sampling rate, OsRatei
For ith cluster bunch over-sampling rate (i=1,2 ..., N), N for by improve DBSCAN algorithm institute
The clustering cluster number formed;UsRate represents the lack sampling rate of most class boundary sample;Fec represents spy
Levy, Feci be sample ith feature (i=1,2 ..., M), characteristic vector here is binary form
Representing, 1 to represent the feature on this position be to be helpful to classifying quality, needs to retain, if
Be 0, then representing this position feature does not has any help to the lifting of classification results, can remove this
Individual feature reduction training process.
AUC (Area under Receiver Operating Characteristics Curve) is that one can
Depending on the classifier algorithm performance evaluation method changed, represent with a two-dimensional coordinate system.Wherein, X-axis
Being the ratio (FP_Rate) of wrong point minority class (positive), Y-axis is the minority class (sun of correct classification
Property) ratio (TP_Rate), each classifier algorithm can after classifying one group of sample data
Producing some Point (FP_Rate, TP_Rate), the threshold values adjusting grader produces multiple points, shape
Becoming ROC curve, AUC is then the area that this curve covers the lower right corner.AUC is the biggest, then table
Show that the recognition capability of classifier algorithm is the strongest.
F-Measure is most frequently applied in the appraisal of imbalanced data classification issue, equation below
Shown in.F-Measure is obtained by recall ratio, precision ratio and balance factor are compound, as Recall and
When Precision obtains a higher numerical value, F-Measure will obtain ideal result.
β regulation recall ratio and the balance factor (usual β is set to 1) of precision ratio in formula.
The concrete cluster process of standard DBSCAN algorithm is as follows: according to unit radius set in advance and
Density, at whole data set, finds these points, referred to as core point meeting condition, then to this
Core point expands.The method expanded is to find to be connected with its density from all of this core point
Other sample points.Travel through all core points in the epsilon neighborhood of this core point (because boundary point is cannot
Expand), find other minority class sample points being connected with this core point density, and judge this point
The most also it is core point, continues to expand according to this core point, until density cannot be found to be connected
The core point that can expand till.Next the minority class not being clustered cluster is rescaned exactly
Sample data set, remaining data concentrate find be never clustered into any one bunch other cores
Heart point, then uses said method to carry out expanding until not having new in whole data set to this core point
It is not clustered the core point of cluster.After cluster terminates, those minority class sample sets are not clustered in office
Sample point in cluster with regard to time noise spot.
Unbalanced situation in can solve the problem that class based on the DBSCAN algorithm improved, its basic thought is:
First, it is considered to the situation of uneven minority class sample distribution Density inhomogeneity in class, it is possible to obtain
One group of EPS value based on distribution density.Within there is class in unbalanced data set, each minority class
Data sample point is different from the distance of other minority class sample points, and namely distribution density is not
With.The calculating of distribution density is K the minority nearest by calculating distance any of which minority class sample
Class sample point is weighed to the distance of this sample point.Concrete grammar is as follows: by adding up arbitrary minority
Class sample point XiNearest K other minority class sample points, calculate X respectivelyiTo this K minority class sample
Then these distances are taken a mean value by this distance.By the distance average obtained, permissible
Obtain minority class sample point XiDistribution density, all of minority class sample point can calculate such one
The average distance formula of this distribution density of individual measurement.
Then, these are formed one by the average distance of calculated each minority class sample point
Distance vector array.Using these average distances as raw data set, this data set passes through into
Cluster on row distance.After this distance array is clustered into N number of bunch, calculate in each bunch
All distances add and are averaged, and using this mean value of obtaining as the adjacent region threshold of this bunch, pass through
Calculate this mean value of N number of bunch respectively, N number of adjacent region threshold EPS can be obtainedi(i=1,2 ..., N).
It follows that this N number of field threshold value is ranked up, according to order from small to large, by these
Sorted threshold value is saved in an array, stays the EPS of the improvement DBSCAN algorithm doing next step
The determination successively of parameter.
In ensuing clustering algorithm, first select minimum that in threshold value array, regard
The EPS value (MinPts can be manually specified, and keeps constant in the training process) of DBSCAN algorithm,
Then all minority class samples are clustered, the multiple of this density can be met by cluster
Minority class sample clustering bunch, other minority class samples being unsatisfactory for condition are then classified as noise sample.Connect
Next threshold value in employing threshold value array the minority class sample being labeled as noise sample point is proceeded
DBSCAN clusters, and is similarly obtained some clustering cluster and remaining noise sample point.
Finally, repeat above operation, use in threshold value array different threshold values from small to large to mark
Minority class sample for noise sample point carries out DBSCAN cluster, when leading to all minority class samples
Crossing after different EPS clusters, then complete all cluster operations of minority class sample, those are
After be not the most classified as the data at any one bunch and be noise data.
By the DBSCAN algorithm improved, the clustering cluster of minority class can not only be produced, the most again
Over-sampling is carried out in these samples bunch, but also the distribution in imbalance in can sufficiently solving class
Uneven and fragmentation of data or the problem of Small disjuncts.
PSO algorithm is to the over-sampling rate optimization problem in imbalanced data classification issue problem and feature selecting
The false code optimized is as shown in Figure 3.Wherein, MCN is maximum cycle, and NumFolds is
Parameter K in K folding cross validation.
Successive value is mainly optimized by particle swarm optimization algorithm, and characteristic vector here is discrete
Type, in order to make PSO equally processing feature vector, use sigmoid function, to generate
Successive value speed v uses equation below (5) and (6), is converted to 0,1 discrete value, thus
The selection being applicable to characteristic optimization set comes up.
The boundary sample sample rate of each clustering cluster and safe specimen sample is determined by PSO algorithm
Rate and classifying quality is had lifting feature after, just can be selected by feature extraction and have most
Representational data characteristics and set of data samples, the over-sampling rate then obtained by optimization is to minority class
Sample carries out over-sampling, finally gives the equilibrium criterion collection sample wanted.
The base grader of AdaBoost algorithm can be made up of arbitrary single grader, typically now uses
Be exactly that the conventional machines learning algorithms such as decision tree, neutral net, SVM are as base grader.Big portion
The integrated learning approach divided is all to use same single base grader learning algorithm to go to construct an isomorphism
Ensemble Learning Algorithms framework, or use different base graders to construct a Manufacturing resource study calculation
Method framework.
The thought of AdaBoost is according to current iteration grader classification results, regulates and normalizes sample
Weights, thus ensure that in iterative computation next time, more energy can be concentrated by Weak Classifier
On the sample of a upper grader misclassification, so the more New Policy of sample weights is also impact classification
One of factor of effect;The weight coefficient of base grader is also to affect AdaBoost algorithm finally to divide simultaneously
One factor of class, the weight coefficient so how arranging base grader is also the main research of the present invention
One of.
Determination for the ballot weight coefficient of the base grader in AdaBoost algorithm is also according to upper
The thought in face, by multiple base grader coefficients as each " particle " in PSO optimized algorithm,
The most specific setting, finally gives these base graders ballot weight coefficient in AdaBoost algorithm
Optimum combination.
Above content is that to combine concrete preferred embodiment made for the present invention the most specifically
Bright, it is impossible to assert the present invention be embodied as be confined to these explanations.For technology belonging to the present invention
For the those of ordinary skill in field, without departing from the inventive concept of the premise, it is also possible to if making
Dry simple deduction or replace, all should be considered as belonging to protection scope of the present invention.
Claims (4)
1. an imbalanced data classification issue method based on PSO algorithm, it is characterised in that in described method
Being expressed as of particle in PSO algorithm: OsRatei, UsRate and Fecj, wherein OsRatei is
The over-sampling rate of i clustering cluster, i=1,2 ..., N, N are by being formed by DBSCAN algorithm
Clustering cluster number, Fecj is the jth feature of sample, j=1,2 ..., M;Described method includes: (1)
Data set is divided into training set Train and test set Test;(2) first at the beginning of search volume generates
Begin to solve xi, i=1,2 ..., SN, wherein SN is population size;(3) globally optimal solution gbest=is made
0;(4) step (5), (6) maximum cycle are performed MCN time;(5) make j from 1 to SN,
Obtain current xjSolution;According to the described solution obtained, feature generates new data set after reselecting,
Learnt by k folding cross validation and grader, obtain corresponding AUC or F-Measure;
(5) each particle is obtained according to the speed computing formula in PSO algorithm and location updating formula
Pbest, updates globally optimal solution gbest simultaneously;(7) by the over-sampling rate obtained, lack sampling rate
Combine with feature, carry out mixing sampling simultaneously, build grader and be trained data set, then use
Grader is trained, and obtains final AUC.
Imbalanced data classification issue method the most according to claim 1, it is characterised in that: described
DBSCAN algorithm is the DBSCAN algorithm improved, and this algorithm is unbalanced in can solve the problem that class
Situation, its basic thought is: first, it is considered to uneven minority class sample distribution density unevenness in class
Even situation, it is possible to obtain one group of EPS value based on distribution density;Then, by these by meter
Average distance one distance vector array of composition of each minority class sample point obtained, by these
Average distance is as raw data set, by carrying out the cluster in distance on this data set;?
After this distance array is clustered into N number of bunch, all distances calculated in each bunch add and make even
All, using this mean value of obtaining as the adjacent region threshold of this bunch, N number of by calculating this respectively
Bunch mean value, N number of adjacent region threshold EPS can be obtainedi, i=1,2 ..., N;It follows that by this N
Individual field threshold value carries out order from small to large and sorts and be saved in an array;Ensuing
In clustering algorithm, first select minimum that in threshold value array, as DBSCAN algorithm
EPS value, then all minority class samples are clustered, then use next in threshold value array
Individual threshold value proceeds DBSCAN cluster to the minority class sample being labeled as noise sample point, with
Sample obtains some clustering cluster and remaining noise sample point.Finally, repeat above operation, when right
After being clustered by different EPS of all minority class samples, then complete minority class sample
All cluster operations, those are not the most classified as being noise number in the data of any one bunch
According to.
Imbalanced data classification issue method the most according to claim 1, it is characterised in that: it is characterized in that:
Successive value is mainly optimized by PSO algorithm, and characteristic vector here is discrete type, for
Enable described PSO algorithm to process the characteristic vector of discrete type, use sigmoid function, right
The successive value rate conversion generated is 0,1 discrete value.
Imbalanced data classification issue method the most according to claim 1, it is characterised in that: described method is led to
Cross PSO algorithm determine the boundary sample sample rate of each clustering cluster and safe specimen sample rate with
And after classifying quality is had the feature of lifting, selected the most representational by feature extraction
Data characteristics and set of data samples, then entered minority class sample by the over-sampling rate that optimization obtains
Row over-sampling, finally gives the equilibrium criterion collection sample wanted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610172812.3A CN105868775A (en) | 2016-03-23 | 2016-03-23 | Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610172812.3A CN105868775A (en) | 2016-03-23 | 2016-03-23 | Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105868775A true CN105868775A (en) | 2016-08-17 |
Family
ID=56624706
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610172812.3A Pending CN105868775A (en) | 2016-03-23 | 2016-03-23 | Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105868775A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951665A (en) * | 2017-04-28 | 2017-07-14 | 成都理工大学 | Swarm optimization method and device based on crossover operator |
CN107145659A (en) * | 2017-04-28 | 2017-09-08 | 成都理工大学 | A kind of predictor selection method and device for being used to evaluate risk of landslip |
CN107563435A (en) * | 2017-08-30 | 2018-01-09 | 哈尔滨工业大学深圳研究生院 | Higher-dimension unbalanced data sorting technique based on SVM |
CN107578061A (en) * | 2017-08-16 | 2018-01-12 | 哈尔滨工业大学深圳研究生院 | Based on the imbalanced data classification issue method for minimizing loss study |
CN108154163A (en) * | 2016-12-06 | 2018-06-12 | 北京京东尚科信息技术有限公司 | Data processing method, data identification and learning method and its device |
CN109409433A (en) * | 2018-10-31 | 2019-03-01 | 北京邮电大学 | A kind of the personality identifying system and method for social network user |
CN109447158A (en) * | 2018-10-31 | 2019-03-08 | 中国石油大学(华东) | A kind of Adaboost Favorable Reservoir development area prediction technique based on unbalanced data |
CN109858564A (en) * | 2019-02-21 | 2019-06-07 | 上海电力学院 | Modified Adaboost-SVM model generating method suitable for wind electric converter fault diagnosis |
CN109887005A (en) * | 2019-02-26 | 2019-06-14 | 华北理工大学 | The TLD target tracking algorism of view-based access control model attention mechanism |
CN109948732A (en) * | 2019-03-29 | 2019-06-28 | 济南大学 | Abnormal cell DISTANT METASTASES IN classification method and system based on non-equilibrium study |
CN111710427A (en) * | 2020-06-17 | 2020-09-25 | 广州市金域转化医学研究院有限公司 | Cervical precancerous early lesion stage diagnosis model and establishment method |
CN112022149A (en) * | 2020-09-04 | 2020-12-04 | 无锡博智芯科技有限公司 | Atrial fibrillation detection method based on electrocardiosignals |
CN112770256A (en) * | 2021-01-06 | 2021-05-07 | 重庆邮电大学 | Node track prediction method in unmanned aerial vehicle self-organizing network |
CN113628701A (en) * | 2021-08-12 | 2021-11-09 | 上海大学 | Material performance prediction method and system based on density unbalance sample data |
CN115905894A (en) * | 2023-01-10 | 2023-04-04 | 佰聆数据股份有限公司 | Equipment residual period analysis method and device based on small sample unbalanced data |
CN116108387A (en) * | 2023-04-14 | 2023-05-12 | 湖南工商大学 | Unbalanced data oversampling method and related equipment |
-
2016
- 2016-03-23 CN CN201610172812.3A patent/CN105868775A/en active Pending
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108154163A (en) * | 2016-12-06 | 2018-06-12 | 北京京东尚科信息技术有限公司 | Data processing method, data identification and learning method and its device |
CN108154163B (en) * | 2016-12-06 | 2020-11-24 | 北京京东尚科信息技术有限公司 | Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium |
CN106951665A (en) * | 2017-04-28 | 2017-07-14 | 成都理工大学 | Swarm optimization method and device based on crossover operator |
CN107145659A (en) * | 2017-04-28 | 2017-09-08 | 成都理工大学 | A kind of predictor selection method and device for being used to evaluate risk of landslip |
WO2019033636A1 (en) * | 2017-08-16 | 2019-02-21 | 哈尔滨工业大学深圳研究生院 | Method of using minimized-loss learning to classify imbalanced samples |
CN107578061A (en) * | 2017-08-16 | 2018-01-12 | 哈尔滨工业大学深圳研究生院 | Based on the imbalanced data classification issue method for minimizing loss study |
CN107563435A (en) * | 2017-08-30 | 2018-01-09 | 哈尔滨工业大学深圳研究生院 | Higher-dimension unbalanced data sorting technique based on SVM |
WO2019041629A1 (en) * | 2017-08-30 | 2019-03-07 | 哈尔滨工业大学深圳研究生院 | Method for classifying high-dimensional imbalanced data based on svm |
CN109409433A (en) * | 2018-10-31 | 2019-03-01 | 北京邮电大学 | A kind of the personality identifying system and method for social network user |
CN109447158A (en) * | 2018-10-31 | 2019-03-08 | 中国石油大学(华东) | A kind of Adaboost Favorable Reservoir development area prediction technique based on unbalanced data |
CN109858564A (en) * | 2019-02-21 | 2019-06-07 | 上海电力学院 | Modified Adaboost-SVM model generating method suitable for wind electric converter fault diagnosis |
CN109858564B (en) * | 2019-02-21 | 2023-05-05 | 上海电力学院 | Improved Adaboost-SVM model generation method suitable for wind power converter fault diagnosis |
CN109887005A (en) * | 2019-02-26 | 2019-06-14 | 华北理工大学 | The TLD target tracking algorism of view-based access control model attention mechanism |
CN109887005B (en) * | 2019-02-26 | 2023-05-30 | 天津城建大学 | TLD target tracking method based on visual attention mechanism |
CN109948732A (en) * | 2019-03-29 | 2019-06-28 | 济南大学 | Abnormal cell DISTANT METASTASES IN classification method and system based on non-equilibrium study |
CN109948732B (en) * | 2019-03-29 | 2020-12-22 | 济南大学 | Abnormal cell distant metastasis classification method and system based on unbalanced learning |
CN111710427A (en) * | 2020-06-17 | 2020-09-25 | 广州市金域转化医学研究院有限公司 | Cervical precancerous early lesion stage diagnosis model and establishment method |
CN112022149B (en) * | 2020-09-04 | 2022-10-04 | 无锡博智芯科技有限公司 | Atrial fibrillation detection method based on electrocardiosignals |
CN112022149A (en) * | 2020-09-04 | 2020-12-04 | 无锡博智芯科技有限公司 | Atrial fibrillation detection method based on electrocardiosignals |
CN112770256B (en) * | 2021-01-06 | 2022-09-09 | 重庆邮电大学 | Node track prediction method in unmanned aerial vehicle self-organizing network |
CN112770256A (en) * | 2021-01-06 | 2021-05-07 | 重庆邮电大学 | Node track prediction method in unmanned aerial vehicle self-organizing network |
CN113628701A (en) * | 2021-08-12 | 2021-11-09 | 上海大学 | Material performance prediction method and system based on density unbalance sample data |
CN113628701B (en) * | 2021-08-12 | 2024-04-26 | 上海大学 | Material performance prediction method and system based on density imbalance sample data |
CN115905894A (en) * | 2023-01-10 | 2023-04-04 | 佰聆数据股份有限公司 | Equipment residual period analysis method and device based on small sample unbalanced data |
CN116108387A (en) * | 2023-04-14 | 2023-05-12 | 湖南工商大学 | Unbalanced data oversampling method and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105868775A (en) | Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm | |
Alswaitti et al. | Density-based particle swarm optimization algorithm for data clustering | |
Karaboga et al. | Fuzzy clustering with artificial bee colony algorithm | |
Lan et al. | A two-phase learning-based swarm optimizer for large-scale optimization | |
CN106682682A (en) | Method for optimizing support vector machine based on Particle Swarm Optimization | |
CN106202952A (en) | A kind of Parkinson disease diagnostic method based on machine learning | |
Örkcü et al. | Estimating the parameters of 3-p Weibull distribution using particle swarm optimization: A comprehensive experimental comparison | |
CN110245252A (en) | Machine learning model automatic generation method based on genetic algorithm | |
CN106604229A (en) | Indoor positioning method based on manifold learning and improved support vector machine | |
CN111986811A (en) | Disease prediction system based on big data | |
CN107992895A (en) | A kind of Boosting support vector machines learning method | |
CN109086412A (en) | A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT | |
CN110287985B (en) | Depth neural network image identification method based on variable topology structure with variation particle swarm optimization | |
Sasmal et al. | A comprehensive survey on aquila optimizer | |
CN110070116A (en) | Segmented based on the tree-shaped Training strategy of depth selects integrated image classification method | |
CN104091038A (en) | Method for weighting multiple example studying features based on master space classifying criterion | |
CN109492748A (en) | A kind of Mid-long term load forecasting method for establishing model of the electric system based on convolutional neural networks | |
Fei et al. | Research on data mining algorithm based on neural network and particle swarm optimization | |
CN110598836B (en) | Metabolic analysis method based on improved particle swarm optimization algorithm | |
Naik et al. | A global-best harmony search based gradient descent learning FLANN (GbHS-GDL-FLANN) for data classification | |
Ding et al. | An improved SFLA-kmeans algorithm based on approximate backbone and its application in retinal fundus image | |
CN105550711A (en) | Firefly algorithm based selective ensemble learning method | |
CN106056167A (en) | Normalization possibilistic fuzzy entropy clustering method based on Gaussian kernel hybrid artificial bee colony algorithm | |
CN105913085A (en) | Tensor model-based multi-source data classification optimizing method and system | |
CN108549936A (en) | The Enhancement Method that self organizing neural network topology based on deep learning is kept |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160817 |