CN105868775A - Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm - Google Patents

Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm Download PDF

Info

Publication number
CN105868775A
CN105868775A CN201610172812.3A CN201610172812A CN105868775A CN 105868775 A CN105868775 A CN 105868775A CN 201610172812 A CN201610172812 A CN 201610172812A CN 105868775 A CN105868775 A CN 105868775A
Authority
CN
China
Prior art keywords
sample
algorithm
minority class
pso
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610172812.3A
Other languages
Chinese (zh)
Inventor
张春慨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yitong Technology Co Ltd
Original Assignee
Shenzhen Yitong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yitong Technology Co Ltd filed Critical Shenzhen Yitong Technology Co Ltd
Priority to CN201610172812.3A priority Critical patent/CN105868775A/en
Publication of CN105868775A publication Critical patent/CN105868775A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an imbalance sample classification method based on a PSO (Particle Swarm Optimization) algorithm. A PSO algorithm is adopted, the sampling rates of a boundary sample and a safe sample in an oversampling process are optimized to obtain an optimal oversampling multiplying power, and meanwhile, characteristics are optimized so as to select a characteristic combination which can simplify operations and improve a classification result and has best representativeness. The imbalance sample classification method adopts AUC/F-Mea as the fitness function of the algorithm so as to improve the effect of a final classifier.

Description

Imbalanced data classification issue method based on PSO algorithm
Technical field
The invention belongs to the sorting technique in data mining and optimize field, particularly relate to a kind of applicable injustice The sorting technique of weighing apparatus sample.
Background technology
In imbalanced data classification issue problem, unbalanced data is exactly in whole data set sample space Quantitatively there is huge gap in the sample of one type and remaining class or a few class sample.This In the case of often minority class need us to put into more concern.Such as in the application in medical diagnosis, Cancer or cardiopathic data sample space are exactly uneven sample, in this kind of sample, Wo Menguan The object of note is often ill sample, is gone out the attribute of these samples by Accurate classification, such that it is able to accurate Make a definite diagnosis the state of an illness of disconnected patient, and give and these patients immunotherapy targeted autoantibody timely.
Traditional classifier is in processing imbalanced data classification issue problem, in order to pursue higher global classification Accuracy rate, often direct is most class samples by minority class sample classification, and this way can be by potential Patient information is categorized as health and fitness information, and therefore patient may miss the best period for the treatment of, causes patient Irremediable loss.
In imbalanced data classification issue problem, there is a lot of parameters all can have one to final classifying quality Fixed impact, is determined by different parameter combinations, can obtain the result that comparison in difference is big.And this The determination of a little parameters is typically all by artificially specify, and neither one verifies these m odel validities Index, can only judge these parameters by final classifying quality, and the parameter artificially specified is led to Crossing adjustment repeatedly, frequently result in is also only a locally optimal solution, seldom in the case of can obtain To globally optimal solution.
In prior art, for imbalanced data classification issue problem, there is process side based on data resampling Method and processing method based on algorithm improvement.
Processing method based on data resampling: the unbalance response of sample is mainly manifested in two class samples Quantitative imbalance, thus cause traditional classifier classification difficulty, first expect be Data plane, the problem uneven by changing the quantity of primary data sample, so that sample base Originally balance is reached to solve this classification difficulty.In data plane, Chinese scholars mainly proposes Over-sampling, lack sampling and the thought of mixing sampling.
Over-sampling, as the term suggests, it is simply that add some minority class samples by certain method, thus Original uneven sample data set is made to basically reach balance.If but the method for over-sampling is the most right, then may be used Expired Drugs can be caused.
SMOTE algorithm is the oversampling technique that current Application comparison is popular.It is mainly by minority Class sample inserts some copy minority class sample rather than simply carry out neighbouring over-sampling.This Technology is mainly in the spirit obtained from an algorithm put forward in the project of handwriting recongnition Sense.The main process that SMOTE carries out over-sampling is as follows: first according to each minority class sample, choose Go out from the K that this minority class sample point is nearest other minority class sample points individual, then put and it at this The method using linear interpolation between his K minority class sample point, and use certain random interpolation because of The a part of minority class of generation that son and over-sampling rate are carried out copys sample point.The formula of SMOTE interpolation Shown in following formula:
pi=x+rand (0,1) * (yi-x) i=1,2 ... N (1)
SMOTE algorithm is mainly inserted between similar minority class sample (the most adjoining) Value, the imitation sample so generated has certain representativeness.Therefore, over-fitting problem is at SMOTE Algorithm is avoided by, and the decision space of minority class also can extend more preferable;Equally, it can also It is applied in most class sample space, the decision space of most class can be reduced.
Lou Xiaojun et al. proposed in 2013 a kind of based on cluster and considers boundary sample information By monolateral back-and-forth method, oversampler method, determines that the boundary sample of minority class and most class is believed in the method Breath, by all minority class samples being accounted for the cluster on border, bunch in carry out suitable mistake and adopt Sample, the imitation minority class sample so produced and original sample have more similitude, more can represent father Class sample space is distributed.Simultaneously by boundary sample being carried out over-sampling targetedly, increase border sample This definition, thus the importance of prominent minority class boundary sample.
The thought of lack sampling method is then to be moved according to certain rule in most class samples by some method Except some specific samples so that sample data space can basically reach balance.But lack sampling Method also has some drawbacks, and the important information easily causing some representative samples is lost.
Mixing sampling is exactly by over-sampling algorithm together with lack sampling algorithm fusion, to original number During carrying out resampling according to collection, use mixing sampling that being balanced property of data set is operated.In a large number Research show, the method for mixing sampling has bigger superiority than single method for resampling, first Carry out over-sampling and typically can obtain more preferable effect carrying out lack sampling.
Processing method based on algorithm improvement: process in imbalanced data classification issue problem, data plane Method for resampling can eliminate unbalanced impact to a certain extent, promotes the classifying quality of minority class, But the method essence of data plane is to change the distribution of data set, and this change may be to classifying quality Authenticity produce certain impact.The improvement carried out in algorithm aspect can eliminate in data plane On drawback, also can obtain the lifting of minority class classifying quality to a certain extent.Through these several years both at home and abroad The research of scholar, the research in algorithm aspect is mainly concerned with cost sensitive learning (Cost-sensitive Learning), integrated study (Ensemble Learning) and single quasi-tradition learning method (One-class Learing)。
Integrated learning approach is to apply more a kind of algorithm frame, Boosting in imbalanced data classification issue Algorithm belongs to one the most classical in integrated learning approach, and AdaBoost is Boosting class algorithm In a kind of algorithm of being often used.AdaBoost passes through the Weak Classifier that repetitive exercise is different, passes through These Weak Classifiers are integrated into strong classifier by combination, and it is according to the classification of each Iterative classification device The weights that situation, more new samples are trained next time.Training process according to AdaBoost algorithm we can Mainly there are high-precision classification effect, the sub-classifier can self-regulated as required, classification results obtaining it Easy to understand and the advantages such as overfitting phenomenon will not be produced.
In data plane in the operation of uneven being balanced of sample space, having some parameters is to need Very important person is specify.Such as in Borderline algorithm, the parameter of needs has K, and namely K is near Adjacent method finds its K nearest samples in whole data set S, and is stored in by these samples In the set KNNsmin that each Smin sample is corresponding.Carried out by SMOTE algorithm simultaneously The when of over-sampling, the parameter having the problem (OrRate) of an over-sampling multiplying power is also by artificially referring to Fixed, different over-sampling multiplying powers is the biggest on the impact of final classifying quality.Carrying out simultaneously The when of mixing sampling, there is also the parameter problem of the lack sampling multiplying power of most class sample point simultaneously (UsRate)。
In algorithm aspect, there is also some parameters needs to go to carry out artificial appointment simultaneously, such as exists Using in the AdaBoost algorithm of Boosting algorithm, base grader the most all can have a ballot weight Determination, this ballot weight coefficient determination be also true by the classification error rate of this base grader Fixed.And the weight coefficient of these base graders has an impact the most mutually or relation, simple Determine that the final classifying quality that their coefficient obtains is frequently not optimum according to classification error.
Summary of the invention
In order to solve problem in prior art, the invention provides a kind of imbalance based on PSO algorithm Sample classification method, uses PSO (Particle Swarm Optimization) algorithm to carry out parameter Adaptive optimization such that it is able to obtain the coefficient combination of a series of optimum so that final classification effect Fruit is greatly improved.
The present invention is achieved through the following technical solutions:
A kind of imbalanced data classification issue method based on PSO algorithm, in described method in PSO algorithm Being expressed as of particle: OsRatei, UsRate and Fecj, wherein OsRatei is ith cluster bunch Over-sampling rate, i=1,2 ..., N, N are the clustering cluster number formed by DBSCAN algorithm, Fecj For the jth feature of sample, j=1,2 ..., M;Described method includes: data set is divided into instruction by (1) Practice collection Train and test set Test;(2) initial solution x is first generated in search volumei, i=1,2 ..., SN, Wherein SN is population size;(3) globally optimal solution gbest=0 is made;(4) step (5), (6) are performed Maximum cycle MCN time;(5) make j from 1 to SN, it is thus achieved that current xjSolution;According to obtaining Described solution, feature generates new data set after reselecting, by k folding cross validation and classification Device learns, and obtains corresponding AUC or F-Measure;(5) according to the speedometer in PSO algorithm Calculation formula and location updating formula obtain the pbest of each particle, update globally optimal solution gbest simultaneously; (7) combined by the over-sampling rate obtained, lack sampling rate and feature, carry out mixing sampling simultaneously, build Grader is trained data set, is then trained with grader, obtains final AUC.
The invention has the beneficial effects as follows: the present invention is by using particle swarm optimization algorithm, to over-sampling mistake In journey, the sample rate of boundary sample and safe sample is optimized and obtains optimum over-sampling multiplying power, the most right Feature is optimized, thus selects and can reach simplified operation and promote the most representative of classification results Property feature combination, use AUC/F-Mea as the fitness function of algorithm, thus promote final The effect of grader.
Accompanying drawing explanation
Fig. 1 is PSO algorithm false code;
Fig. 2 is the representation schematic diagram of particle in PSO algorithm;
Fig. 3 is that the PSO algorithm of the present invention optimizes false code to over-sampling rate.
Detailed description of the invention
The present invention is further described for explanation and detailed description of the invention below in conjunction with the accompanying drawings.
Particle group optimizing (Particle Swarm Optimization, PSO) algorithm be by Kennedy and Doctor Eberhart proposed in nineteen ninety-five, its a kind of evolution algorithm, and the inspiration of generation is derived from imitating certainly So group behavior of insect, herd and the flock of birds etc. in boundary, the colonial organism in these natures can be by According to a kind of themselves it will be appreciated that mode carry out search of food, spouse and some other thing, Each member in colony can the experience of experience He other group members by learning oneself change Become the individual or behavior pattern of colony, be finally completed the global search of colony.Example from reality We are appreciated that what kind of the detailed process of PSO algorithm is: assume there is bevy in a certain given zone The position of food is purposelessly searched in territory, and the particular location of things is known nothing by these birds, does not also know Road flies just can fly to the position of near food toward which direction, and how these birds could be searched with optimum Rope route finds this food?Correct way is through obtaining other grains that distance food is nearer The information of son, then moves according to these information toward this bird region close to food and carries out Search.
In computer science, PSO algorithm is a kind of computational methods, and it is mainly asked by iteration optimization Topic and a specific fitness function attempt to obtain optimal candidate solution.In PSO algorithm It is individual that " particle " represents each in whole collective search space, above our actual life of lifting In the example of middle flock of birds search of food, the detailed process of PSO algorithm optimization is as follows: initialize the position of bird Putting and flying speed, this speed comprises particle at next step search procedure track to be run, so The optimal solution of rear search current region.In the renewal that PSO algorithm is each, there are two important value needs Retaining, one of them is the individual optimal solution of each particle in colony, is designated as pbest;Also has a solution For globally optimal solution, represent optimal solution in the whole search volume of population, be designated as gbest.The speed of particle Degree and position moment are updated according to the two value, and iteration each time is all public according to following two Formula is updated:
v i t + 1 = w × v i t + c 1 × r 1 × ( pbest t - x i t ) + c 2 × r 2 × ( gbest t - x i t ) - - - ( 2 )
x i t + 1 = x t + v i t + 1 - - - ( 3 )
Wherein: i={1,2 ..., SN}, other parameter declaration is as shown in table 1 below:
Table 1 PSO algorithm parameter explanation
Particle swarm optimization algorithm is applied to parameter optimizations of some concrete application or the excellent of weight coefficient In change problem, typically can obtain globally optimal solution, thus improve final experiment effect.PSO The false code of algorithm routine is as shown in Figure 1.
By using particle swarm optimization algorithm, boundary sample during over-sampling and safe sample are adopted Sample rate is optimized and obtains optimum over-sampling multiplying power, is optimized feature simultaneously, thus selects energy Enough reach simplified operation and promote the most representational feature combination of classification results, using AUC/F-Mea is as the fitness function of algorithm, thus promotes the effect of final grader.Population In optimized algorithm, the expression of particle is as shown in Figure 2.Wherein, OsRate represents over-sampling rate, OsRatei For ith cluster bunch over-sampling rate (i=1,2 ..., N), N for by improve DBSCAN algorithm institute The clustering cluster number formed;UsRate represents the lack sampling rate of most class boundary sample;Fec represents spy Levy, Feci be sample ith feature (i=1,2 ..., M), characteristic vector here is binary form Representing, 1 to represent the feature on this position be to be helpful to classifying quality, needs to retain, if Be 0, then representing this position feature does not has any help to the lifting of classification results, can remove this Individual feature reduction training process.
AUC (Area under Receiver Operating Characteristics Curve) is that one can Depending on the classifier algorithm performance evaluation method changed, represent with a two-dimensional coordinate system.Wherein, X-axis Being the ratio (FP_Rate) of wrong point minority class (positive), Y-axis is the minority class (sun of correct classification Property) ratio (TP_Rate), each classifier algorithm can after classifying one group of sample data Producing some Point (FP_Rate, TP_Rate), the threshold values adjusting grader produces multiple points, shape Becoming ROC curve, AUC is then the area that this curve covers the lower right corner.AUC is the biggest, then table Show that the recognition capability of classifier algorithm is the strongest.
F-Measure is most frequently applied in the appraisal of imbalanced data classification issue, equation below Shown in.F-Measure is obtained by recall ratio, precision ratio and balance factor are compound, as Recall and When Precision obtains a higher numerical value, F-Measure will obtain ideal result.
F - M e a s u r e = ( 1 + β 2 ) * Re c a l l * Pr e c i s i o n β 2 * Re c a l l + Pr e c i s i o n - - - ( 4 )
β regulation recall ratio and the balance factor (usual β is set to 1) of precision ratio in formula.
The concrete cluster process of standard DBSCAN algorithm is as follows: according to unit radius set in advance and Density, at whole data set, finds these points, referred to as core point meeting condition, then to this Core point expands.The method expanded is to find to be connected with its density from all of this core point Other sample points.Travel through all core points in the epsilon neighborhood of this core point (because boundary point is cannot Expand), find other minority class sample points being connected with this core point density, and judge this point The most also it is core point, continues to expand according to this core point, until density cannot be found to be connected The core point that can expand till.Next the minority class not being clustered cluster is rescaned exactly Sample data set, remaining data concentrate find be never clustered into any one bunch other cores Heart point, then uses said method to carry out expanding until not having new in whole data set to this core point It is not clustered the core point of cluster.After cluster terminates, those minority class sample sets are not clustered in office Sample point in cluster with regard to time noise spot.
Unbalanced situation in can solve the problem that class based on the DBSCAN algorithm improved, its basic thought is:
First, it is considered to the situation of uneven minority class sample distribution Density inhomogeneity in class, it is possible to obtain One group of EPS value based on distribution density.Within there is class in unbalanced data set, each minority class Data sample point is different from the distance of other minority class sample points, and namely distribution density is not With.The calculating of distribution density is K the minority nearest by calculating distance any of which minority class sample Class sample point is weighed to the distance of this sample point.Concrete grammar is as follows: by adding up arbitrary minority Class sample point XiNearest K other minority class sample points, calculate X respectivelyiTo this K minority class sample Then these distances are taken a mean value by this distance.By the distance average obtained, permissible Obtain minority class sample point XiDistribution density, all of minority class sample point can calculate such one The average distance formula of this distribution density of individual measurement.
Then, these are formed one by the average distance of calculated each minority class sample point Distance vector array.Using these average distances as raw data set, this data set passes through into Cluster on row distance.After this distance array is clustered into N number of bunch, calculate in each bunch All distances add and are averaged, and using this mean value of obtaining as the adjacent region threshold of this bunch, pass through Calculate this mean value of N number of bunch respectively, N number of adjacent region threshold EPS can be obtainedi(i=1,2 ..., N).
It follows that this N number of field threshold value is ranked up, according to order from small to large, by these Sorted threshold value is saved in an array, stays the EPS of the improvement DBSCAN algorithm doing next step The determination successively of parameter.
In ensuing clustering algorithm, first select minimum that in threshold value array, regard The EPS value (MinPts can be manually specified, and keeps constant in the training process) of DBSCAN algorithm, Then all minority class samples are clustered, the multiple of this density can be met by cluster Minority class sample clustering bunch, other minority class samples being unsatisfactory for condition are then classified as noise sample.Connect Next threshold value in employing threshold value array the minority class sample being labeled as noise sample point is proceeded DBSCAN clusters, and is similarly obtained some clustering cluster and remaining noise sample point.
Finally, repeat above operation, use in threshold value array different threshold values from small to large to mark Minority class sample for noise sample point carries out DBSCAN cluster, when leading to all minority class samples Crossing after different EPS clusters, then complete all cluster operations of minority class sample, those are After be not the most classified as the data at any one bunch and be noise data.
By the DBSCAN algorithm improved, the clustering cluster of minority class can not only be produced, the most again Over-sampling is carried out in these samples bunch, but also the distribution in imbalance in can sufficiently solving class Uneven and fragmentation of data or the problem of Small disjuncts.
PSO algorithm is to the over-sampling rate optimization problem in imbalanced data classification issue problem and feature selecting The false code optimized is as shown in Figure 3.Wherein, MCN is maximum cycle, and NumFolds is Parameter K in K folding cross validation.
Successive value is mainly optimized by particle swarm optimization algorithm, and characteristic vector here is discrete Type, in order to make PSO equally processing feature vector, use sigmoid function, to generate Successive value speed v uses equation below (5) and (6), is converted to 0,1 discrete value, thus The selection being applicable to characteristic optimization set comes up.
v i t ′ = sigmoid ( v i t ) = 1 1 + e - v i t - - - ( 5 )
The boundary sample sample rate of each clustering cluster and safe specimen sample is determined by PSO algorithm Rate and classifying quality is had lifting feature after, just can be selected by feature extraction and have most Representational data characteristics and set of data samples, the over-sampling rate then obtained by optimization is to minority class Sample carries out over-sampling, finally gives the equilibrium criterion collection sample wanted.
The base grader of AdaBoost algorithm can be made up of arbitrary single grader, typically now uses Be exactly that the conventional machines learning algorithms such as decision tree, neutral net, SVM are as base grader.Big portion The integrated learning approach divided is all to use same single base grader learning algorithm to go to construct an isomorphism Ensemble Learning Algorithms framework, or use different base graders to construct a Manufacturing resource study calculation Method framework.
The thought of AdaBoost is according to current iteration grader classification results, regulates and normalizes sample Weights, thus ensure that in iterative computation next time, more energy can be concentrated by Weak Classifier On the sample of a upper grader misclassification, so the more New Policy of sample weights is also impact classification One of factor of effect;The weight coefficient of base grader is also to affect AdaBoost algorithm finally to divide simultaneously One factor of class, the weight coefficient so how arranging base grader is also the main research of the present invention One of.
Determination for the ballot weight coefficient of the base grader in AdaBoost algorithm is also according to upper The thought in face, by multiple base grader coefficients as each " particle " in PSO optimized algorithm, The most specific setting, finally gives these base graders ballot weight coefficient in AdaBoost algorithm Optimum combination.
Above content is that to combine concrete preferred embodiment made for the present invention the most specifically Bright, it is impossible to assert the present invention be embodied as be confined to these explanations.For technology belonging to the present invention For the those of ordinary skill in field, without departing from the inventive concept of the premise, it is also possible to if making Dry simple deduction or replace, all should be considered as belonging to protection scope of the present invention.

Claims (4)

1. an imbalanced data classification issue method based on PSO algorithm, it is characterised in that in described method Being expressed as of particle in PSO algorithm: OsRatei, UsRate and Fecj, wherein OsRatei is The over-sampling rate of i clustering cluster, i=1,2 ..., N, N are by being formed by DBSCAN algorithm Clustering cluster number, Fecj is the jth feature of sample, j=1,2 ..., M;Described method includes: (1) Data set is divided into training set Train and test set Test;(2) first at the beginning of search volume generates Begin to solve xi, i=1,2 ..., SN, wherein SN is population size;(3) globally optimal solution gbest=is made 0;(4) step (5), (6) maximum cycle are performed MCN time;(5) make j from 1 to SN, Obtain current xjSolution;According to the described solution obtained, feature generates new data set after reselecting, Learnt by k folding cross validation and grader, obtain corresponding AUC or F-Measure; (5) each particle is obtained according to the speed computing formula in PSO algorithm and location updating formula Pbest, updates globally optimal solution gbest simultaneously;(7) by the over-sampling rate obtained, lack sampling rate Combine with feature, carry out mixing sampling simultaneously, build grader and be trained data set, then use Grader is trained, and obtains final AUC.
Imbalanced data classification issue method the most according to claim 1, it is characterised in that: described DBSCAN algorithm is the DBSCAN algorithm improved, and this algorithm is unbalanced in can solve the problem that class Situation, its basic thought is: first, it is considered to uneven minority class sample distribution density unevenness in class Even situation, it is possible to obtain one group of EPS value based on distribution density;Then, by these by meter Average distance one distance vector array of composition of each minority class sample point obtained, by these Average distance is as raw data set, by carrying out the cluster in distance on this data set;? After this distance array is clustered into N number of bunch, all distances calculated in each bunch add and make even All, using this mean value of obtaining as the adjacent region threshold of this bunch, N number of by calculating this respectively Bunch mean value, N number of adjacent region threshold EPS can be obtainedi, i=1,2 ..., N;It follows that by this N Individual field threshold value carries out order from small to large and sorts and be saved in an array;Ensuing In clustering algorithm, first select minimum that in threshold value array, as DBSCAN algorithm EPS value, then all minority class samples are clustered, then use next in threshold value array Individual threshold value proceeds DBSCAN cluster to the minority class sample being labeled as noise sample point, with Sample obtains some clustering cluster and remaining noise sample point.Finally, repeat above operation, when right After being clustered by different EPS of all minority class samples, then complete minority class sample All cluster operations, those are not the most classified as being noise number in the data of any one bunch According to.
Imbalanced data classification issue method the most according to claim 1, it is characterised in that: it is characterized in that: Successive value is mainly optimized by PSO algorithm, and characteristic vector here is discrete type, for Enable described PSO algorithm to process the characteristic vector of discrete type, use sigmoid function, right The successive value rate conversion generated is 0,1 discrete value.
Imbalanced data classification issue method the most according to claim 1, it is characterised in that: described method is led to Cross PSO algorithm determine the boundary sample sample rate of each clustering cluster and safe specimen sample rate with And after classifying quality is had the feature of lifting, selected the most representational by feature extraction Data characteristics and set of data samples, then entered minority class sample by the over-sampling rate that optimization obtains Row over-sampling, finally gives the equilibrium criterion collection sample wanted.
CN201610172812.3A 2016-03-23 2016-03-23 Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm Pending CN105868775A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610172812.3A CN105868775A (en) 2016-03-23 2016-03-23 Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610172812.3A CN105868775A (en) 2016-03-23 2016-03-23 Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm

Publications (1)

Publication Number Publication Date
CN105868775A true CN105868775A (en) 2016-08-17

Family

ID=56624706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610172812.3A Pending CN105868775A (en) 2016-03-23 2016-03-23 Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm

Country Status (1)

Country Link
CN (1) CN105868775A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951665A (en) * 2017-04-28 2017-07-14 成都理工大学 Swarm optimization method and device based on crossover operator
CN107145659A (en) * 2017-04-28 2017-09-08 成都理工大学 A kind of predictor selection method and device for being used to evaluate risk of landslip
CN107563435A (en) * 2017-08-30 2018-01-09 哈尔滨工业大学深圳研究生院 Higher-dimension unbalanced data sorting technique based on SVM
CN107578061A (en) * 2017-08-16 2018-01-12 哈尔滨工业大学深圳研究生院 Based on the imbalanced data classification issue method for minimizing loss study
CN108154163A (en) * 2016-12-06 2018-06-12 北京京东尚科信息技术有限公司 Data processing method, data identification and learning method and its device
CN109409433A (en) * 2018-10-31 2019-03-01 北京邮电大学 A kind of the personality identifying system and method for social network user
CN109447158A (en) * 2018-10-31 2019-03-08 中国石油大学(华东) A kind of Adaboost Favorable Reservoir development area prediction technique based on unbalanced data
CN109858564A (en) * 2019-02-21 2019-06-07 上海电力学院 Modified Adaboost-SVM model generating method suitable for wind electric converter fault diagnosis
CN109887005A (en) * 2019-02-26 2019-06-14 华北理工大学 The TLD target tracking algorism of view-based access control model attention mechanism
CN109948732A (en) * 2019-03-29 2019-06-28 济南大学 Abnormal cell DISTANT METASTASES IN classification method and system based on non-equilibrium study
CN111710427A (en) * 2020-06-17 2020-09-25 广州市金域转化医学研究院有限公司 Cervical precancerous early lesion stage diagnosis model and establishment method
CN112022149A (en) * 2020-09-04 2020-12-04 无锡博智芯科技有限公司 Atrial fibrillation detection method based on electrocardiosignals
CN112770256A (en) * 2021-01-06 2021-05-07 重庆邮电大学 Node track prediction method in unmanned aerial vehicle self-organizing network
CN113628701A (en) * 2021-08-12 2021-11-09 上海大学 Material performance prediction method and system based on density unbalance sample data
CN115905894A (en) * 2023-01-10 2023-04-04 佰聆数据股份有限公司 Equipment residual period analysis method and device based on small sample unbalanced data
CN116108387A (en) * 2023-04-14 2023-05-12 湖南工商大学 Unbalanced data oversampling method and related equipment

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154163A (en) * 2016-12-06 2018-06-12 北京京东尚科信息技术有限公司 Data processing method, data identification and learning method and its device
CN108154163B (en) * 2016-12-06 2020-11-24 北京京东尚科信息技术有限公司 Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium
CN106951665A (en) * 2017-04-28 2017-07-14 成都理工大学 Swarm optimization method and device based on crossover operator
CN107145659A (en) * 2017-04-28 2017-09-08 成都理工大学 A kind of predictor selection method and device for being used to evaluate risk of landslip
WO2019033636A1 (en) * 2017-08-16 2019-02-21 哈尔滨工业大学深圳研究生院 Method of using minimized-loss learning to classify imbalanced samples
CN107578061A (en) * 2017-08-16 2018-01-12 哈尔滨工业大学深圳研究生院 Based on the imbalanced data classification issue method for minimizing loss study
CN107563435A (en) * 2017-08-30 2018-01-09 哈尔滨工业大学深圳研究生院 Higher-dimension unbalanced data sorting technique based on SVM
WO2019041629A1 (en) * 2017-08-30 2019-03-07 哈尔滨工业大学深圳研究生院 Method for classifying high-dimensional imbalanced data based on svm
CN109409433A (en) * 2018-10-31 2019-03-01 北京邮电大学 A kind of the personality identifying system and method for social network user
CN109447158A (en) * 2018-10-31 2019-03-08 中国石油大学(华东) A kind of Adaboost Favorable Reservoir development area prediction technique based on unbalanced data
CN109858564A (en) * 2019-02-21 2019-06-07 上海电力学院 Modified Adaboost-SVM model generating method suitable for wind electric converter fault diagnosis
CN109858564B (en) * 2019-02-21 2023-05-05 上海电力学院 Improved Adaboost-SVM model generation method suitable for wind power converter fault diagnosis
CN109887005A (en) * 2019-02-26 2019-06-14 华北理工大学 The TLD target tracking algorism of view-based access control model attention mechanism
CN109887005B (en) * 2019-02-26 2023-05-30 天津城建大学 TLD target tracking method based on visual attention mechanism
CN109948732A (en) * 2019-03-29 2019-06-28 济南大学 Abnormal cell DISTANT METASTASES IN classification method and system based on non-equilibrium study
CN109948732B (en) * 2019-03-29 2020-12-22 济南大学 Abnormal cell distant metastasis classification method and system based on unbalanced learning
CN111710427A (en) * 2020-06-17 2020-09-25 广州市金域转化医学研究院有限公司 Cervical precancerous early lesion stage diagnosis model and establishment method
CN112022149B (en) * 2020-09-04 2022-10-04 无锡博智芯科技有限公司 Atrial fibrillation detection method based on electrocardiosignals
CN112022149A (en) * 2020-09-04 2020-12-04 无锡博智芯科技有限公司 Atrial fibrillation detection method based on electrocardiosignals
CN112770256B (en) * 2021-01-06 2022-09-09 重庆邮电大学 Node track prediction method in unmanned aerial vehicle self-organizing network
CN112770256A (en) * 2021-01-06 2021-05-07 重庆邮电大学 Node track prediction method in unmanned aerial vehicle self-organizing network
CN113628701A (en) * 2021-08-12 2021-11-09 上海大学 Material performance prediction method and system based on density unbalance sample data
CN113628701B (en) * 2021-08-12 2024-04-26 上海大学 Material performance prediction method and system based on density imbalance sample data
CN115905894A (en) * 2023-01-10 2023-04-04 佰聆数据股份有限公司 Equipment residual period analysis method and device based on small sample unbalanced data
CN116108387A (en) * 2023-04-14 2023-05-12 湖南工商大学 Unbalanced data oversampling method and related equipment

Similar Documents

Publication Publication Date Title
CN105868775A (en) Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm
Sun et al. Improved monarch butterfly optimization algorithm based on opposition-based learning and random local perturbation
Alswaitti et al. Density-based particle swarm optimization algorithm for data clustering
Karaboga et al. Fuzzy clustering with artificial bee colony algorithm
CN106682682A (en) Method for optimizing support vector machine based on Particle Swarm Optimization
Lan et al. A two-phase learning-based swarm optimizer for large-scale optimization
CN106202952A (en) A kind of Parkinson disease diagnostic method based on machine learning
Örkcü et al. Estimating the parameters of 3-p Weibull distribution using particle swarm optimization: A comprehensive experimental comparison
CN110245252A (en) Machine learning model automatic generation method based on genetic algorithm
CN106604229A (en) Indoor positioning method based on manifold learning and improved support vector machine
CN107992895A (en) A kind of Boosting support vector machines learning method
CN111986811A (en) Disease prediction system based on big data
CN109086412A (en) A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT
CN109784405A (en) Cross-module state search method and system based on pseudo label study and semantic consistency
Sasmal et al. A comprehensive survey on aquila optimizer
CN110070116A (en) Segmented based on the tree-shaped Training strategy of depth selects integrated image classification method
CN109492748A (en) A kind of Mid-long term load forecasting method for establishing model of the electric system based on convolutional neural networks
Fei et al. Research on data mining algorithm based on neural network and particle swarm optimization
CN110598836B (en) Metabolic analysis method based on improved particle swarm optimization algorithm
Naik et al. A global-best harmony search based gradient descent learning FLANN (GbHS-GDL-FLANN) for data classification
Ding et al. An improved SFLA-kmeans algorithm based on approximate backbone and its application in retinal fundus image
CN105550711A (en) Firefly algorithm based selective ensemble learning method
CN106056167A (en) Normalization possibilistic fuzzy entropy clustering method based on Gaussian kernel hybrid artificial bee colony algorithm
CN105913085A (en) Tensor model-based multi-source data classification optimizing method and system
CN108549936A (en) The Enhancement Method that self organizing neural network topology based on deep learning is kept

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160817