CN105760889A - Efficient imbalanced data set classification method - Google Patents

Efficient imbalanced data set classification method Download PDF

Info

Publication number
CN105760889A
CN105760889A CN201610114730.3A CN201610114730A CN105760889A CN 105760889 A CN105760889 A CN 105760889A CN 201610114730 A CN201610114730 A CN 201610114730A CN 105760889 A CN105760889 A CN 105760889A
Authority
CN
China
Prior art keywords
sample
decision tree
feature
minority class
class sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610114730.3A
Other languages
Chinese (zh)
Inventor
陈宗海
曹璨
王鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201610114730.3A priority Critical patent/CN105760889A/en
Publication of CN105760889A publication Critical patent/CN105760889A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Abstract

The invention discloses an efficient imbalanced data set classification method, which includes the steps of adding boundary samples and isolated points into consideration based on a traditional SMOTE method, so as to acquire an approximately balanced data set; and then making designs on the type of data based on a sub space selection and tree model scheme of ensemble random forests, to clearly classify two types of data. In this way, the accuracy and recall ratio are improved, the result is close to the actual situation, and the invention can be applied to the actual industry analysis.

Description

A kind of efficient imbalanced data sets sorting technique
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of efficient imbalanced data sets sorting technique.
Background technology
Classification problem is one of sixty-four dollar question in data analysis, and real data collection often exists the unbalanced problem of categorical measure.The method effects such as unbalanced data set, conventional sorting methods is decision tree such as, SVM, Bayesian network are poor, because traditional algorithm comprises the assumed condition that sample is balanced.When data set is unbalanced, owing to multiclass sample occupies absolute advantages in disaggregated model training process so that minority class sample is not easily identified.What data classification problem was typically concerned about is just minority class sample, the prediction that telecom client in practical application runs off, credit card trade abnormality detection, the intrusion prediction of network and the diagnosis etc. of medical domain disease.
At present, the technique study of imbalanced data sets classification is broadly divided into two aspects, data plane and algorithm aspect.Data plane is to focus on the additions and deletions reconstruct of data, and algorithm aspect is then that grader is improved.
But, the classification of new samples synthesized by the SMOTE method of current data plane obscures, and does not also consider the particularity of isolated point and boundary sample simultaneously.What algorithm aspect had outstanding advantage is random forest method, and this is the strong classifier of many decision tree compositions.The independent identically distributed random vector adopted in each grader determines the growth course of tree, is finally the output result being determined last model by the mode of all trees.Though tradition random forests algorithm ensure that randomness, but does not account for the unbalanced particularity of data set, it does not have gives minority class sample more multiple resource, choose in the tree that all trees list ballot in when forming decision forest, be unfavorable for the classification of imbalanced data sets.
Summary of the invention
It is an object of the invention to provide a kind of efficient imbalanced data sets sorting technique, it is possible to sort out two class data significantly, press close to legitimate reading, the problem simultaneously evading over-fitting to a certain extent.
It is an object of the invention to be achieved through the following technical solutions:
A kind of efficient imbalanced data sets sorting technique, including:
Based on BSMOTE sampling techniques the most class samples in imbalanced data sets and minority class sample carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus obtaining the data set that convergence is balanced;
The data set that described convergence is balanced is sampled, it is thus achieved that N number of subset comprising the approximate minority class sample of quantity and most class samples, and extracts the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection;Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets;
Adopting layered sampling method to extract S reference characteristic intersection from two category feature subsets, each stack features intersection is all for the structure of decision tree, thus obtaining S decision tree;
Based on the metric of the similarity between index and measurement decision tree and the decision tree of measurement decision tree classification effect, from described S decision tree, filter out a number of decision tree, form final decision forest, thus realizing imbalanced data sets classification.
Further, described based on BSMOTE sampling techniques, the most class samples in imbalanced data sets and minority class sample are carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus the data set obtaining convergence balanced includes:
Comprising two class samples in described imbalanced data sets, a fairly large number of is then most class samples, and negligible amounts is then minority class sample;
All minority class samples are found k neighbour in the sample space of imbalanced data sets, according to the ratio of class sample sizes most in k neighborhood and minority class sample size, minority class is divided into 3 sample sets: safe sample set, dangerous sample set and isolated sample set;
Safe sample, dangerous sample are as follows with the definition of isolated sample: if most class sample sizes are less than minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is safe sample;If most class sample sizes are not 0 no less than minority class sample size and minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is dangerous sample;If in the k neighborhood of current minority class sample being entirely most class samples, then current minority class sample is isolated sample;
To each the minority class sample in dangerous sample set, in all minority class sample spaces, all find k neighbour, and s the minority class sample randomly selected in k neighborhood carries out linear interpolation, s new minority class sample of synthesis with former minority class sample;If total d in dangerous sample setnumIndividual minority class sample, then finally synthesize s × d altogethernumIndividual new minority class sample;
The new minority class sample of synthesis is merged with described imbalanced data sets, is formed and obtain the data set that convergence is balanced.
Further, the described data set that described convergence is balanced is sampled, it is thus achieved that N number of subset comprising the approximate minority class sample of quantity and most class samples, and extracts the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection;Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets include:
It is sampled based on Bagging method and installs in Bag by the most class samples in data set balanced to described convergence with putting back to, and loads minority class sample in Bag so that Bag invading the interior has the two class samples that quantity is approximate;Repeat aforesaid operations N-1 time, obtain N number of subset altogether;
Using this N number of subset input training sample set as Feature Selection, correlative character system of selection is adopted to carry out feature extraction, it is thus achieved that feature intersection;
Each feature occurrence number in statistical nature intersection, the importance of number of times more many expressions feature that feature occurs is more high, is respectively compared each feature occurrence number and threshold value δ1Size;Will appear from number of times more than threshold value δ1Feature put into character subset A as good feature, the number of times that will appear from is more than threshold value δ1Feature as difference feature put in character subset B.
Further, described employing layered sampling method extracts S reference characteristic intersection from two category feature subsets, and each stack features intersection, all for the structure of decision tree, includes thus obtaining S decision tree:
Determine the ratio of feature quantity in sorted character subset A and character subset B;
Extract according to the ratio of character subset A Yu character subset B, the feature extracted from character subset A and character subset B each time is merged into a reference characteristic intersection;Carry out S time altogether to extract, thus obtaining S reference characteristic intersection;
For each reference characteristic intersection, all adopt the metric of Attributions selection to choose the split point of tree-model, carry out the structure of decision tree, thus obtaining S decision tree.
Further, the metric of the similarity between the described index based on measurement decision tree classification effect and measurement decision tree and decision tree, from described S decision tree, filtering out a number of decision tree, form final decision forest, including thus realizing imbalanced data sets classification:
Calculating the index weighing decision tree classification effect: all adopt area AUC under ROC curve as the index weighing decision tree classification effect for each decision tree, when AUC area is more big, ROC curve more tends to upper left side, then it represents that classifying quality is more good;
Calculate the metric of the similarity weighed between decision tree and decision tree: adopt decision tree classifying quality in unbalanced data to calculate, it is thus achieved that the metric of the similarity between decision tree and decision tree;
Classical strength threshold value δ is set2, and by the classical strength threshold value δ of the index of each decision tree classification effect Yu setting2Compare, filter out the decision tree higher than threshold value;The threshold value δ of similarity measurement is set3, travel through the dependency each other of the decision tree after screening, delete higher than threshold value δ3Paired decision tree in the more weak decision tree of classical strength;
Decision tree through above-mentioned screening forms final decision forest, the result of calculation of decision tree in statistical decision forest, thus obtaining corresponding classification results.
As seen from the above technical solution provided by the invention, on the basis of tradition SMOTE method, add the consideration to boundary sample and isolated point, it is hereby achieved that the data set of approximate equalization;The decision-making of data type is carried out again, it is possible to sorting out two class data significantly, thus improving accuracy rate and recall ratio, its result presses close to truth, can be applicable in actual industry analysis based on the subspace selection of integrated random forest and tree-model scheme.
Accompanying drawing explanation
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, below the accompanying drawing used required during embodiment is described is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawings according to these accompanying drawings.
The flow chart of a kind of efficient imbalanced data sets sorting technique that Fig. 1 provides for the embodiment of the present invention;
The principle schematic of the BSMOTE sampling techniques that Fig. 2 provides for the embodiment of the present invention;
The flow chart of the tagsort that Fig. 3 provides for the embodiment of the present invention;
The flow chart carrying out decision tree structure based on layered sampling method that Fig. 4 provides for the embodiment of the present invention;
Fig. 5 forms the flow chart of decision forest for the screening decision tree that the embodiment of the present invention provides;
The decision forest ballot schematic diagram that Fig. 6 provides for the embodiment of the present invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on embodiments of the invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into protection scope of the present invention.
The embodiment of the present invention provides a kind of efficient imbalanced data sets sorting technique, as it is shown in figure 1, it mainly comprises the steps:
Step 11, based on BSMOTE sampling techniques the most class samples in imbalanced data sets and minority class sample carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus obtaining the data set that convergence is balanced.
In this step, traditional BSMOTE method is improved, namely according to k neighbour and linear interpolation thought, tradition SMOTE method basis adds the consideration to the sample near decision boundary;Its principle is as in figure 2 it is shown, specific as follows:
In the embodiment of the present invention, described imbalanced data sets comprises two class samples, a fairly large number of is then most class sample (i.e. samples of black background in Fig. 2, its quantity is 10), negligible amounts is then minority class sample (i.e. the sample of white background in Fig. 2, its quantity is 4);
All minority class samples are found k neighbour in the sample space of imbalanced data sets, according to the ratio of class sample sizes most in k neighborhood and minority class sample size, minority class is divided into 3 sample sets: safe sample set, dangerous sample set and isolated sample set;
Safe sample, dangerous sample are as follows with the definition of isolated sample: if most class sample sizes are less than minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is safe sample;If most class sample sizes are not 0 no less than minority class sample size and minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is dangerous sample;If in the k neighborhood of current minority class sample being entirely most class samples, then current minority class sample is isolated sample;
In the embodiment of the present invention, it is contemplated that the sample that isolated point is concentrated is probably noise spot, does not consider.Meanwhile, it is little that safe sample is generally acknowledged to the impact of the classification performance on minority class, does not also consider.
To each the minority class sample in dangerous sample set, in all minority class sample spaces, all find k neighbour, and s the minority class sample randomly selected in k neighborhood carries out linear interpolation, s new minority class sample of synthesis with former minority class sample;If total d in dangerous sample setnumIndividual minority class sample, then finally synthesize s × d altogethernumIndividual new minority class sample;
The new minority class sample of synthesis is merged with described imbalanced data sets, is formed and obtain the data set that convergence is balanced.
Step 12, the data set that described convergence is balanced is sampled, it is thus achieved that N number of subset comprising the approximate minority class sample of quantity and most class samples, and extract the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection;Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets.
In this step, the data set of the convergence equilibrium that abovementioned steps 11 is obtained by employing predetermined way is repeated sampling, obtain the bag subset of N number of equalization, again the bag subset of N number of equalization is done Ensemble feature selection, further according to the significance level of feature, select for unbalanced the data characteristic set agreeing and the characteristic set differed from.Its step is as it is shown on figure 3, specifically include that
1) it is sampled based on Bagging method by the most class samples in the balanced data set of described convergence with putting back to install in Bag, and in Bag, load minority class sample so that Bag invading the interior has the two class samples that quantity is approximate;Repeat aforesaid operations N-1 time, obtain N number of subset altogether.
2) using this N number of subset input training sample set as Feature Selection, correlative character system of selection is adopted to carry out feature extraction, it is thus achieved that feature intersection.
3) each feature occurrence number in statistical nature intersection, the importance of number of times more many expressions feature that feature occurs is more high, is respectively compared each feature occurrence number and threshold value δ1Size;Will appear from number of times more than threshold value δ1Feature put into character subset A as good feature, the number of times that will appear from is more than threshold value δ1Feature as difference feature put in character subset B.
Step 13, employing layered sampling method extract S reference characteristic intersection from two category feature subsets, and each stack features intersection is all for the structure of decision tree, thus obtaining S decision tree.
This step is decision tree construction process, and its process as shown in Figure 4, specifically includes that
1) ratio of feature quantity in sorted character subset A and character subset B is determined.
2) extract according to the ratio of character subset A Yu character subset B, the feature extracted from character subset A and character subset B each time is merged into a reference characteristic intersection;Carry out S time altogether to extract, thus obtaining S reference characteristic intersection.
3) for each reference characteristic intersection, all adopt the metric of Attributions selection to choose the split point of tree-model, carry out the structure of decision tree, thus obtaining S decision tree.
Step 14, based on weighing the index of decision tree classification effect and weighing the metric of similarity between decision tree and decision tree, a number of decision tree is filtered out from described S decision tree, form final decision forest, thus realizing imbalanced data sets classification.
This step realize process as it is shown in figure 5, specifically include that
1) calculating the index weighing decision tree classification effect: all adopt area AUC under ROC curve as the index weighing decision tree classification effect for each decision tree, when AUC area is more big, ROC curve more tends to upper left side, then it represents that classifying quality is more good;
2) metric of the similarity weighed between decision tree and decision tree is calculated: adopt decision tree classifying quality in unbalanced data to calculate, it is thus achieved that the metric of the similarity between decision tree and decision tree;
3) classical strength threshold value δ is set2, and by the classical strength threshold value δ of the index of each decision tree classification effect Yu setting2Compare, filter out the decision tree higher than threshold value;The threshold value δ of similarity measurement is set3, travel through the dependency each other of the decision tree after screening, delete higher than threshold value δ3Paired decision tree in the more weak decision tree of classical strength;
4) forming final decision forest through the decision tree of above-mentioned screening, as shown in Figure 6, the result of calculation of decision tree in statistical decision forest, thus obtaining corresponding classification results.
In order to make it easy to understand, the present invention is introduced below in conjunction with a concrete example.It should be noted that the concrete numerical value of parameters involved in following example is only for example, not it is construed as limiting;In actual applications, user can arrange the concrete numerical value of parameters according to practical situation.
In this example, the same step adopting previous embodiment performs;Detailed process is as follows:
According to step 11, k=4, s=1, d in Fig. 2, figurenum=2.4 minority class samples are had, 10 most class samples shown in figure.Minority class sample is from left to right followed successively by 1,2,3, No. 4 sample, when choosing k=4 neighbour, can be determined that Isosorbide-5-Nitrae sample belongs to isolated sample according to definition, and 2,3 samples belong to dangerous sample, then 2,3 samples need to carry out linear interpolation.From No. 2 samples, there are 2 minority class samples in 4 neighborhoods, randomly select 1, then 1 new minority class sample of synthesis;In like manner, in 4 fields of No. 3 samples, only have 1 minority class sample, then choose its 1 minority class sample of synthesis.Altogether synthesize s × dnum=1 × 2=2 new samples.Then minority class sample is 4+2=6, then reduce with the quantity gap of most class samples, tend to balanced.As s and d in dangerous samplenumTime bigger, it is possible to synthesis exceedes the new samples of original minority class sample size.
Take the data set of 5000 samples, first tend to equalization according to step 11.If former data set has 4800 most class samples and 200 minority class samples, after BSMOTE method, 400 minority class samples are synthesized, then overall new data set totally 5400 samples, containing 4800 most class samples and 600 minority class samples.Now, most class samples are adjusted to 8:1 with the unbalanced ratio of minority class sample by 24:1.The feature subset selection method based on Bagging according to step 12: all choose 600 minority class samples, randomly draws 600 in most class samples simultaneously, synthesizes a subset with putting back to.Repetitive operation N=10 time altogether, it is thus achieved that 10 subsets.Each subset being done feature extraction respectively, respectively obtains a characteristic set, such as 10 characteristic sets are g1={D1, D2, D4 ... D20}, g2={D2, D3 ... D19}, g3={D1, D3 ... D18, D20} ..., g10={D1, D2, these 10 feature sets are merged into a big feature intersection by D3 ... D19, D20}.Add up the number of each feature Di in intersection again, the threshold value δ of the feature number of times set1=8, then the feature intersection A that the number of times feature be more than or equal to 8 is put into, and the feature that number of times is less than 10 puts into bad feature intersection B.
According to the mode of step 13, now having 12 features in feature intersection A, have 8 features in B, ratio is 3:2.Adopt the mode of stratified sampling, feature intersection A is extracted 6 features every time, B extracts 4 features, merges into a reference characteristic intersection.Carry out S=10 time altogether extracting with putting back to, obtain 10 reference characteristic intersections.Carry out division and the structure of tree-model more respectively, it is thus achieved that 10 decision trees.
According to the mode of step 14, use every decision tree that data set is carried out decision-making, calculate every AUC area corresponding to decision tree, be the index weighing classifying quality of 10 trees.Such as, classical strength threshold value δ is set2=0.75, then filter out 8 trees reaching requirement.The metric of similarity is set: the another test data set preparing not add training set in advance again, is divided into 3 groups, carries out decision-making by each tree and obtain 3 groups of results, then obtain 3 × 8 results altogether.Jiang Shu and the result of tree between two compared with, seek the relative coefficient between two trees, then need operation C (8,2) secondary, namely 8 × 7/2=28 time.Such as, similarity threshold δ is set3=0.5, it has been found that decision tree 1 and 3, decision tree 4 and 7 exceedes threshold value.Then delete the one tree of difference according to the classical strength ability of decision tree, the classical strength such as decision tree 1 and 3 is 0.81 and 0.77, during the classical strength of decision tree 4 and 7 0.82 and 0.85, then deletes tree 3 and 4, leaves tree 1 and 7.After screening, remaining 6 decision trees form decision forest.Finally, thus the mode result of 6 decision trees makees final decision and prediction.
Through the above description of the embodiments, those skilled in the art is it can be understood that can realize by software to above-described embodiment, it is also possible to the mode adding necessary general hardware platform by software realizes.Based on such understanding, the technical scheme of above-described embodiment can embody with the form of software product, it (can be CD-ROM that this software product can be stored in a non-volatile memory medium, USB flash disk, portable hard drive etc.) in, including some instructions with so that a computer equipment (can be personal computer, server, or the network equipment etc.) performs the method described in each embodiment of the present invention.
The above; being only the present invention preferably detailed description of the invention, but protection scope of the present invention is not limited thereto, any those familiar with the art is in the technical scope of present disclosure; the change that can readily occur in or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims (5)

1. an efficient imbalanced data sets sorting technique, it is characterised in that including:
Based on BSMOTE sampling techniques the most class samples in imbalanced data sets and minority class sample carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus obtaining the data set that convergence is balanced;
The data set that described convergence is balanced is sampled, it is thus achieved that N number of subset comprising the approximate minority class sample of quantity and most class samples, and extracts the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection;Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets;
Adopting layered sampling method to extract S reference characteristic intersection from two category feature subsets, each stack features intersection is all for the structure of decision tree, thus obtaining S decision tree;
Based on the metric of the similarity between index and measurement decision tree and the decision tree of measurement decision tree classification effect, from described S decision tree, filter out a number of decision tree, form final decision forest, thus realizing imbalanced data sets classification.
2. the efficient imbalanced data sets sorting technique of one according to claim 1, it is characterized in that, described based on BSMOTE sampling techniques, the most class samples in imbalanced data sets and minority class sample are carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus the data set obtaining convergence balanced includes:
Comprising two class samples in described imbalanced data sets, a fairly large number of is then most class samples, and negligible amounts is then minority class sample;
All minority class samples are found k neighbour in the sample space of imbalanced data sets, according to the ratio of class sample sizes most in k neighborhood and minority class sample size, minority class is divided into 3 sample sets: safe sample set, dangerous sample set and isolated sample set;
Safe sample, dangerous sample are as follows with the definition of isolated sample: if most class sample sizes are less than minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is safe sample;If most class sample sizes are not 0 no less than minority class sample size and minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is dangerous sample;If in the k neighborhood of current minority class sample being entirely most class samples, then current minority class sample is isolated sample;
To each the minority class sample in dangerous sample set, in all minority class sample spaces, all find k neighbour, and s the minority class sample randomly selected in k neighborhood carries out linear interpolation, s new minority class sample of synthesis with former minority class sample;If total d in dangerous sample setnumIndividual minority class sample, then finally synthesize s × d altogethernumIndividual new minority class sample;
The new minority class sample of synthesis is merged with described imbalanced data sets, is formed and obtain the data set that convergence is balanced.
3. the efficient imbalanced data sets sorting technique of one according to claim 1, it is characterized in that, the described data set that described convergence is balanced is sampled, obtain N number of subset comprising the approximate minority class sample of quantity and most class samples, and extract the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection;Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets include:
It is sampled based on Bagging method and installs in Bag by the most class samples in data set balanced to described convergence with putting back to, and loads minority class sample in Bag so that Bag invading the interior has the two class samples that quantity is approximate;Repeat aforesaid operations N-1 time, obtain N number of subset altogether;
Using this N number of subset input training sample set as Feature Selection, correlative character system of selection is adopted to carry out feature extraction, it is thus achieved that feature intersection;
Each feature occurrence number in statistical nature intersection, the importance of number of times more many expressions feature that feature occurs is more high, is respectively compared each feature occurrence number and threshold value δ1Size;Will appear from number of times more than threshold value δ1Feature put into character subset A as good feature, the number of times that will appear from is more than threshold value δ1Feature as difference feature put in character subset B.
4. the efficient imbalanced data sets sorting technique of one according to claim 1, it is characterized in that, described employing layered sampling method extracts S reference characteristic intersection from two category feature subsets, and each stack features intersection, all for the structure of decision tree, includes thus obtaining S decision tree:
Determine the ratio of feature quantity in sorted character subset A and character subset B;
Extract according to the ratio of character subset A Yu character subset B, the feature extracted from character subset A and character subset B each time is merged into a reference characteristic intersection;Carry out S time altogether to extract, thus obtaining S reference characteristic intersection;
For each reference characteristic intersection, all adopt the metric of Attributions selection to choose the split point of tree-model, carry out the structure of decision tree, thus obtaining S decision tree.
5. a kind of efficient imbalanced data sets sorting technique according to claim 1 or 3 or 4, it is characterized in that, the metric of the similarity between the described index based on measurement decision tree classification effect and measurement decision tree and decision tree, a number of decision tree is filtered out from described S decision tree, forming final decision forest, including thus realizing imbalanced data sets classification:
Calculating the index weighing decision tree classification effect: all adopt area AUC under ROC curve as the index weighing decision tree classification effect for each decision tree, when AUC area is more big, ROC curve more tends to upper left side, then it represents that classifying quality is more good;
Calculate the metric of the similarity weighed between decision tree and decision tree: adopt decision tree classifying quality in unbalanced data to calculate, it is thus achieved that the metric of the similarity between decision tree and decision tree;
Classical strength threshold value δ is set2, and by the classical strength threshold value δ of the index of each decision tree classification effect Yu setting2Compare, filter out the decision tree higher than threshold value;The threshold value δ of similarity measurement is set3, travel through the dependency each other of the decision tree after screening, delete higher than threshold value δ3Paired decision tree in the more weak decision tree of classical strength;
Decision tree through above-mentioned screening forms final decision forest, the result of calculation of decision tree in statistical decision forest, thus obtaining corresponding classification results.
CN201610114730.3A 2016-03-01 2016-03-01 Efficient imbalanced data set classification method Pending CN105760889A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610114730.3A CN105760889A (en) 2016-03-01 2016-03-01 Efficient imbalanced data set classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610114730.3A CN105760889A (en) 2016-03-01 2016-03-01 Efficient imbalanced data set classification method

Publications (1)

Publication Number Publication Date
CN105760889A true CN105760889A (en) 2016-07-13

Family

ID=56331548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610114730.3A Pending CN105760889A (en) 2016-03-01 2016-03-01 Efficient imbalanced data set classification method

Country Status (1)

Country Link
CN (1) CN105760889A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203519A (en) * 2016-07-17 2016-12-07 合肥赑歌数据科技有限公司 Fault pre-alarming algorithm based on taxonomic clustering
CN106294490A (en) * 2015-06-08 2017-01-04 富士通株式会社 The feature Enhancement Method of data sample and device and classifier training method and apparatus
CN106372655A (en) * 2016-08-26 2017-02-01 南京邮电大学 Synthetic method for minority class samples in non-balanced IPTV data set
CN106529598A (en) * 2016-11-11 2017-03-22 北京工业大学 Classification method and system based on imbalanced medical image data set
CN107391569A (en) * 2017-06-16 2017-11-24 阿里巴巴集团控股有限公司 Identification, model training, Risk Identification Method, device and the equipment of data type
CN107437095A (en) * 2017-07-24 2017-12-05 腾讯科技(深圳)有限公司 Classification determines method and device
CN107451694A (en) * 2017-08-03 2017-12-08 重庆大学 It is a kind of to be used for context-aware and adaptive applied forecasting method in mobile system
CN108647138A (en) * 2018-02-27 2018-10-12 中国电子科技集团公司电子科学研究院 A kind of Software Defects Predict Methods, device, storage medium and electronic equipment
CN108960561A (en) * 2018-05-04 2018-12-07 阿里巴巴集团控股有限公司 A kind of air control model treatment method, device and equipment based on unbalanced data
CN109558962A (en) * 2017-09-26 2019-04-02 中国移动通信集团山西有限公司 Predict device, method and storage medium that telecommunication user is lost
CN109902805A (en) * 2019-02-22 2019-06-18 清华大学 The depth measure study of adaptive sample synthesis and device
CN110084609A (en) * 2019-04-23 2019-08-02 东华大学 A kind of transaction swindling behavior depth detection method based on representative learning
CN110135614A (en) * 2019-03-26 2019-08-16 广东工业大学 It is a kind of to be tripped prediction technique based on rejecting outliers and the 10kV distribution low-voltage of sampling techniques
CN110991551A (en) * 2019-12-13 2020-04-10 北京百度网讯科技有限公司 Sample processing method, sample processing device, electronic device and storage medium
WO2020220220A1 (en) * 2019-04-29 2020-11-05 西门子(中国)有限公司 Classification model training method and device, and computer-readable medium
CN112463972A (en) * 2021-01-28 2021-03-09 成都数联铭品科技有限公司 Sample classification method based on class imbalance
CN112560900A (en) * 2020-09-08 2021-03-26 同济大学 Multi-disease classifier design method for sample imbalance
CN113434401A (en) * 2021-06-24 2021-09-24 杭州电子科技大学 Software defect prediction method based on sample distribution characteristics and SPY algorithm
CN113628701A (en) * 2021-08-12 2021-11-09 上海大学 Material performance prediction method and system based on density unbalance sample data
CN115544902A (en) * 2022-11-29 2022-12-30 四川骏逸富顿科技有限公司 Pharmacy risk level identification model generation method and pharmacy risk level identification method
CN113628701B (en) * 2021-08-12 2024-04-26 上海大学 Material performance prediction method and system based on density imbalance sample data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794195A (en) * 2015-04-17 2015-07-22 南京大学 Data mining method for finding potential telecommunication users changing cell phones

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794195A (en) * 2015-04-17 2015-07-22 南京大学 Data mining method for finding potential telecommunication users changing cell phones

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HUI HAN,WEN-YUAN WANG,AND BING-HUAN MAO: "《Advances in intelligent computing》", 31 December 2005, SPRINGER BERLIN HEIDELBERG *
WANG HE-YONG: "Combination approach of SMOTE and biased-SVM for imbalanced datasets", 《NEURAL NETWORKS .IJCNN .IEEE INTERNA》 *
肖坚: "基于随机森林的不平衡数据分类方法研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294490B (en) * 2015-06-08 2019-12-24 富士通株式会社 Feature enhancement method and device for data sample and classifier training method and device
CN106294490A (en) * 2015-06-08 2017-01-04 富士通株式会社 The feature Enhancement Method of data sample and device and classifier training method and apparatus
CN106203519A (en) * 2016-07-17 2016-12-07 合肥赑歌数据科技有限公司 Fault pre-alarming algorithm based on taxonomic clustering
CN106372655A (en) * 2016-08-26 2017-02-01 南京邮电大学 Synthetic method for minority class samples in non-balanced IPTV data set
CN106529598A (en) * 2016-11-11 2017-03-22 北京工业大学 Classification method and system based on imbalanced medical image data set
CN106529598B (en) * 2016-11-11 2020-05-08 北京工业大学 Method and system for classifying medical image data sets based on imbalance
CN107391569A (en) * 2017-06-16 2017-11-24 阿里巴巴集团控股有限公司 Identification, model training, Risk Identification Method, device and the equipment of data type
US11100220B2 (en) 2017-06-16 2021-08-24 Advanced New Technologies Co., Ltd. Data type recognition, model training and risk recognition methods, apparatuses and devices
US11113394B2 (en) 2017-06-16 2021-09-07 Advanced New Technologies Co., Ltd. Data type recognition, model training and risk recognition methods, apparatuses and devices
CN107391569B (en) * 2017-06-16 2020-09-15 阿里巴巴集团控股有限公司 Data type identification, model training and risk identification method, device and equipment
CN107437095A (en) * 2017-07-24 2017-12-05 腾讯科技(深圳)有限公司 Classification determines method and device
CN107451694B (en) * 2017-08-03 2020-10-02 重庆大学 Application prediction method for context awareness and self-adaptation in mobile system
CN107451694A (en) * 2017-08-03 2017-12-08 重庆大学 It is a kind of to be used for context-aware and adaptive applied forecasting method in mobile system
CN109558962A (en) * 2017-09-26 2019-04-02 中国移动通信集团山西有限公司 Predict device, method and storage medium that telecommunication user is lost
CN108647138A (en) * 2018-02-27 2018-10-12 中国电子科技集团公司电子科学研究院 A kind of Software Defects Predict Methods, device, storage medium and electronic equipment
CN108960561A (en) * 2018-05-04 2018-12-07 阿里巴巴集团控股有限公司 A kind of air control model treatment method, device and equipment based on unbalanced data
CN109902805A (en) * 2019-02-22 2019-06-18 清华大学 The depth measure study of adaptive sample synthesis and device
CN110135614A (en) * 2019-03-26 2019-08-16 广东工业大学 It is a kind of to be tripped prediction technique based on rejecting outliers and the 10kV distribution low-voltage of sampling techniques
CN110084609A (en) * 2019-04-23 2019-08-02 东华大学 A kind of transaction swindling behavior depth detection method based on representative learning
CN110084609B (en) * 2019-04-23 2023-06-02 东华大学 Transaction fraud behavior deep detection method based on characterization learning
WO2020220220A1 (en) * 2019-04-29 2020-11-05 西门子(中国)有限公司 Classification model training method and device, and computer-readable medium
CN110991551A (en) * 2019-12-13 2020-04-10 北京百度网讯科技有限公司 Sample processing method, sample processing device, electronic device and storage medium
CN110991551B (en) * 2019-12-13 2023-09-15 北京百度网讯科技有限公司 Sample processing method, device, electronic equipment and storage medium
CN112560900A (en) * 2020-09-08 2021-03-26 同济大学 Multi-disease classifier design method for sample imbalance
CN112560900B (en) * 2020-09-08 2023-01-20 同济大学 Multi-disease classifier design method for sample imbalance
CN112463972B (en) * 2021-01-28 2021-05-18 成都数联铭品科技有限公司 Text sample classification method based on class imbalance
CN112463972A (en) * 2021-01-28 2021-03-09 成都数联铭品科技有限公司 Sample classification method based on class imbalance
CN113434401A (en) * 2021-06-24 2021-09-24 杭州电子科技大学 Software defect prediction method based on sample distribution characteristics and SPY algorithm
CN113628701A (en) * 2021-08-12 2021-11-09 上海大学 Material performance prediction method and system based on density unbalance sample data
CN113628701B (en) * 2021-08-12 2024-04-26 上海大学 Material performance prediction method and system based on density imbalance sample data
CN115544902A (en) * 2022-11-29 2022-12-30 四川骏逸富顿科技有限公司 Pharmacy risk level identification model generation method and pharmacy risk level identification method

Similar Documents

Publication Publication Date Title
CN105760889A (en) Efficient imbalanced data set classification method
Skryjomski et al. Influence of minority class instance types on SMOTE imbalanced data oversampling
CN107391772B (en) Text classification method based on naive Bayes
CN103164713B (en) Image classification method and device
CN109344618B (en) Malicious code classification method based on deep forest
CN104598586B (en) The method of large-scale text categorization
CN106096727A (en) A kind of network model based on machine learning building method and device
CN105426426A (en) KNN text classification method based on improved K-Medoids
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN101807254A (en) Implementation method for data characteristic-oriented synthetic kernel support vector machine
CN111062425B (en) Unbalanced data set processing method based on C-K-SMOTE algorithm
CN112036476A (en) Data feature selection method and device based on two-classification service and computer equipment
CN112699936B (en) Electric power CPS generalized false data injection attack identification method
Xie et al. Imbalanced big data classification based on virtual reality in cloud computing
JP5765583B2 (en) Multi-class classifier, multi-class classifying method, and program
CN117371511A (en) Training method, device, equipment and storage medium for image classification model
JP5892275B2 (en) Multi-class classifier generation device, data identification device, multi-class classifier generation method, data identification method, and program
CN115545111B (en) Network intrusion detection method and system based on clustering self-adaptive mixed sampling
Chang The application of machine learning models in company bankruptcy prediction
Ma The Research of Stock Predictive Model based on the Combination of CART and DBSCAN
CN111383716B (en) Screening method, screening device, screening computer device and screening storage medium
CN103207893A (en) Classification method of two types of texts on basis of vector group mapping
Dražić et al. Technology matching of the patent documents using clustering algorithms
CN114185956A (en) Data mining method based on canty and k-means algorithm
JP4125951B2 (en) Text automatic classification method and apparatus, program, and recording medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160713

RJ01 Rejection of invention patent application after publication