CN105760889A - Efficient imbalanced data set classification method - Google Patents
Efficient imbalanced data set classification method Download PDFInfo
- Publication number
- CN105760889A CN105760889A CN201610114730.3A CN201610114730A CN105760889A CN 105760889 A CN105760889 A CN 105760889A CN 201610114730 A CN201610114730 A CN 201610114730A CN 105760889 A CN105760889 A CN 105760889A
- Authority
- CN
- China
- Prior art keywords
- sample
- decision tree
- feature
- minority class
- class sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Abstract
The invention discloses an efficient imbalanced data set classification method, which includes the steps of adding boundary samples and isolated points into consideration based on a traditional SMOTE method, so as to acquire an approximately balanced data set; and then making designs on the type of data based on a sub space selection and tree model scheme of ensemble random forests, to clearly classify two types of data. In this way, the accuracy and recall ratio are improved, the result is close to the actual situation, and the invention can be applied to the actual industry analysis.
Description
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of efficient imbalanced data sets sorting technique.
Background technology
Classification problem is one of sixty-four dollar question in data analysis, and real data collection often exists the unbalanced problem of categorical measure.The method effects such as unbalanced data set, conventional sorting methods is decision tree such as, SVM, Bayesian network are poor, because traditional algorithm comprises the assumed condition that sample is balanced.When data set is unbalanced, owing to multiclass sample occupies absolute advantages in disaggregated model training process so that minority class sample is not easily identified.What data classification problem was typically concerned about is just minority class sample, the prediction that telecom client in practical application runs off, credit card trade abnormality detection, the intrusion prediction of network and the diagnosis etc. of medical domain disease.
At present, the technique study of imbalanced data sets classification is broadly divided into two aspects, data plane and algorithm aspect.Data plane is to focus on the additions and deletions reconstruct of data, and algorithm aspect is then that grader is improved.
But, the classification of new samples synthesized by the SMOTE method of current data plane obscures, and does not also consider the particularity of isolated point and boundary sample simultaneously.What algorithm aspect had outstanding advantage is random forest method, and this is the strong classifier of many decision tree compositions.The independent identically distributed random vector adopted in each grader determines the growth course of tree, is finally the output result being determined last model by the mode of all trees.Though tradition random forests algorithm ensure that randomness, but does not account for the unbalanced particularity of data set, it does not have gives minority class sample more multiple resource, choose in the tree that all trees list ballot in when forming decision forest, be unfavorable for the classification of imbalanced data sets.
Summary of the invention
It is an object of the invention to provide a kind of efficient imbalanced data sets sorting technique, it is possible to sort out two class data significantly, press close to legitimate reading, the problem simultaneously evading over-fitting to a certain extent.
It is an object of the invention to be achieved through the following technical solutions:
A kind of efficient imbalanced data sets sorting technique, including:
Based on BSMOTE sampling techniques the most class samples in imbalanced data sets and minority class sample carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus obtaining the data set that convergence is balanced;
The data set that described convergence is balanced is sampled, it is thus achieved that N number of subset comprising the approximate minority class sample of quantity and most class samples, and extracts the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection;Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets;
Adopting layered sampling method to extract S reference characteristic intersection from two category feature subsets, each stack features intersection is all for the structure of decision tree, thus obtaining S decision tree;
Based on the metric of the similarity between index and measurement decision tree and the decision tree of measurement decision tree classification effect, from described S decision tree, filter out a number of decision tree, form final decision forest, thus realizing imbalanced data sets classification.
Further, described based on BSMOTE sampling techniques, the most class samples in imbalanced data sets and minority class sample are carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus the data set obtaining convergence balanced includes:
Comprising two class samples in described imbalanced data sets, a fairly large number of is then most class samples, and negligible amounts is then minority class sample;
All minority class samples are found k neighbour in the sample space of imbalanced data sets, according to the ratio of class sample sizes most in k neighborhood and minority class sample size, minority class is divided into 3 sample sets: safe sample set, dangerous sample set and isolated sample set;
Safe sample, dangerous sample are as follows with the definition of isolated sample: if most class sample sizes are less than minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is safe sample;If most class sample sizes are not 0 no less than minority class sample size and minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is dangerous sample;If in the k neighborhood of current minority class sample being entirely most class samples, then current minority class sample is isolated sample;
To each the minority class sample in dangerous sample set, in all minority class sample spaces, all find k neighbour, and s the minority class sample randomly selected in k neighborhood carries out linear interpolation, s new minority class sample of synthesis with former minority class sample;If total d in dangerous sample setnumIndividual minority class sample, then finally synthesize s × d altogethernumIndividual new minority class sample;
The new minority class sample of synthesis is merged with described imbalanced data sets, is formed and obtain the data set that convergence is balanced.
Further, the described data set that described convergence is balanced is sampled, it is thus achieved that N number of subset comprising the approximate minority class sample of quantity and most class samples, and extracts the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection;Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets include:
It is sampled based on Bagging method and installs in Bag by the most class samples in data set balanced to described convergence with putting back to, and loads minority class sample in Bag so that Bag invading the interior has the two class samples that quantity is approximate;Repeat aforesaid operations N-1 time, obtain N number of subset altogether;
Using this N number of subset input training sample set as Feature Selection, correlative character system of selection is adopted to carry out feature extraction, it is thus achieved that feature intersection;
Each feature occurrence number in statistical nature intersection, the importance of number of times more many expressions feature that feature occurs is more high, is respectively compared each feature occurrence number and threshold value δ1Size;Will appear from number of times more than threshold value δ1Feature put into character subset A as good feature, the number of times that will appear from is more than threshold value δ1Feature as difference feature put in character subset B.
Further, described employing layered sampling method extracts S reference characteristic intersection from two category feature subsets, and each stack features intersection, all for the structure of decision tree, includes thus obtaining S decision tree:
Determine the ratio of feature quantity in sorted character subset A and character subset B;
Extract according to the ratio of character subset A Yu character subset B, the feature extracted from character subset A and character subset B each time is merged into a reference characteristic intersection;Carry out S time altogether to extract, thus obtaining S reference characteristic intersection;
For each reference characteristic intersection, all adopt the metric of Attributions selection to choose the split point of tree-model, carry out the structure of decision tree, thus obtaining S decision tree.
Further, the metric of the similarity between the described index based on measurement decision tree classification effect and measurement decision tree and decision tree, from described S decision tree, filtering out a number of decision tree, form final decision forest, including thus realizing imbalanced data sets classification:
Calculating the index weighing decision tree classification effect: all adopt area AUC under ROC curve as the index weighing decision tree classification effect for each decision tree, when AUC area is more big, ROC curve more tends to upper left side, then it represents that classifying quality is more good;
Calculate the metric of the similarity weighed between decision tree and decision tree: adopt decision tree classifying quality in unbalanced data to calculate, it is thus achieved that the metric of the similarity between decision tree and decision tree;
Classical strength threshold value δ is set2, and by the classical strength threshold value δ of the index of each decision tree classification effect Yu setting2Compare, filter out the decision tree higher than threshold value;The threshold value δ of similarity measurement is set3, travel through the dependency each other of the decision tree after screening, delete higher than threshold value δ3Paired decision tree in the more weak decision tree of classical strength;
Decision tree through above-mentioned screening forms final decision forest, the result of calculation of decision tree in statistical decision forest, thus obtaining corresponding classification results.
As seen from the above technical solution provided by the invention, on the basis of tradition SMOTE method, add the consideration to boundary sample and isolated point, it is hereby achieved that the data set of approximate equalization;The decision-making of data type is carried out again, it is possible to sorting out two class data significantly, thus improving accuracy rate and recall ratio, its result presses close to truth, can be applicable in actual industry analysis based on the subspace selection of integrated random forest and tree-model scheme.
Accompanying drawing explanation
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, below the accompanying drawing used required during embodiment is described is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawings according to these accompanying drawings.
The flow chart of a kind of efficient imbalanced data sets sorting technique that Fig. 1 provides for the embodiment of the present invention;
The principle schematic of the BSMOTE sampling techniques that Fig. 2 provides for the embodiment of the present invention;
The flow chart of the tagsort that Fig. 3 provides for the embodiment of the present invention;
The flow chart carrying out decision tree structure based on layered sampling method that Fig. 4 provides for the embodiment of the present invention;
Fig. 5 forms the flow chart of decision forest for the screening decision tree that the embodiment of the present invention provides;
The decision forest ballot schematic diagram that Fig. 6 provides for the embodiment of the present invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on embodiments of the invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into protection scope of the present invention.
The embodiment of the present invention provides a kind of efficient imbalanced data sets sorting technique, as it is shown in figure 1, it mainly comprises the steps:
Step 11, based on BSMOTE sampling techniques the most class samples in imbalanced data sets and minority class sample carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus obtaining the data set that convergence is balanced.
In this step, traditional BSMOTE method is improved, namely according to k neighbour and linear interpolation thought, tradition SMOTE method basis adds the consideration to the sample near decision boundary;Its principle is as in figure 2 it is shown, specific as follows:
In the embodiment of the present invention, described imbalanced data sets comprises two class samples, a fairly large number of is then most class sample (i.e. samples of black background in Fig. 2, its quantity is 10), negligible amounts is then minority class sample (i.e. the sample of white background in Fig. 2, its quantity is 4);
All minority class samples are found k neighbour in the sample space of imbalanced data sets, according to the ratio of class sample sizes most in k neighborhood and minority class sample size, minority class is divided into 3 sample sets: safe sample set, dangerous sample set and isolated sample set;
Safe sample, dangerous sample are as follows with the definition of isolated sample: if most class sample sizes are less than minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is safe sample;If most class sample sizes are not 0 no less than minority class sample size and minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is dangerous sample;If in the k neighborhood of current minority class sample being entirely most class samples, then current minority class sample is isolated sample;
In the embodiment of the present invention, it is contemplated that the sample that isolated point is concentrated is probably noise spot, does not consider.Meanwhile, it is little that safe sample is generally acknowledged to the impact of the classification performance on minority class, does not also consider.
To each the minority class sample in dangerous sample set, in all minority class sample spaces, all find k neighbour, and s the minority class sample randomly selected in k neighborhood carries out linear interpolation, s new minority class sample of synthesis with former minority class sample;If total d in dangerous sample setnumIndividual minority class sample, then finally synthesize s × d altogethernumIndividual new minority class sample;
The new minority class sample of synthesis is merged with described imbalanced data sets, is formed and obtain the data set that convergence is balanced.
Step 12, the data set that described convergence is balanced is sampled, it is thus achieved that N number of subset comprising the approximate minority class sample of quantity and most class samples, and extract the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection;Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets.
In this step, the data set of the convergence equilibrium that abovementioned steps 11 is obtained by employing predetermined way is repeated sampling, obtain the bag subset of N number of equalization, again the bag subset of N number of equalization is done Ensemble feature selection, further according to the significance level of feature, select for unbalanced the data characteristic set agreeing and the characteristic set differed from.Its step is as it is shown on figure 3, specifically include that
1) it is sampled based on Bagging method by the most class samples in the balanced data set of described convergence with putting back to install in Bag, and in Bag, load minority class sample so that Bag invading the interior has the two class samples that quantity is approximate;Repeat aforesaid operations N-1 time, obtain N number of subset altogether.
2) using this N number of subset input training sample set as Feature Selection, correlative character system of selection is adopted to carry out feature extraction, it is thus achieved that feature intersection.
3) each feature occurrence number in statistical nature intersection, the importance of number of times more many expressions feature that feature occurs is more high, is respectively compared each feature occurrence number and threshold value δ1Size;Will appear from number of times more than threshold value δ1Feature put into character subset A as good feature, the number of times that will appear from is more than threshold value δ1Feature as difference feature put in character subset B.
Step 13, employing layered sampling method extract S reference characteristic intersection from two category feature subsets, and each stack features intersection is all for the structure of decision tree, thus obtaining S decision tree.
This step is decision tree construction process, and its process as shown in Figure 4, specifically includes that
1) ratio of feature quantity in sorted character subset A and character subset B is determined.
2) extract according to the ratio of character subset A Yu character subset B, the feature extracted from character subset A and character subset B each time is merged into a reference characteristic intersection;Carry out S time altogether to extract, thus obtaining S reference characteristic intersection.
3) for each reference characteristic intersection, all adopt the metric of Attributions selection to choose the split point of tree-model, carry out the structure of decision tree, thus obtaining S decision tree.
Step 14, based on weighing the index of decision tree classification effect and weighing the metric of similarity between decision tree and decision tree, a number of decision tree is filtered out from described S decision tree, form final decision forest, thus realizing imbalanced data sets classification.
This step realize process as it is shown in figure 5, specifically include that
1) calculating the index weighing decision tree classification effect: all adopt area AUC under ROC curve as the index weighing decision tree classification effect for each decision tree, when AUC area is more big, ROC curve more tends to upper left side, then it represents that classifying quality is more good;
2) metric of the similarity weighed between decision tree and decision tree is calculated: adopt decision tree classifying quality in unbalanced data to calculate, it is thus achieved that the metric of the similarity between decision tree and decision tree;
3) classical strength threshold value δ is set2, and by the classical strength threshold value δ of the index of each decision tree classification effect Yu setting2Compare, filter out the decision tree higher than threshold value;The threshold value δ of similarity measurement is set3, travel through the dependency each other of the decision tree after screening, delete higher than threshold value δ3Paired decision tree in the more weak decision tree of classical strength;
4) forming final decision forest through the decision tree of above-mentioned screening, as shown in Figure 6, the result of calculation of decision tree in statistical decision forest, thus obtaining corresponding classification results.
In order to make it easy to understand, the present invention is introduced below in conjunction with a concrete example.It should be noted that the concrete numerical value of parameters involved in following example is only for example, not it is construed as limiting;In actual applications, user can arrange the concrete numerical value of parameters according to practical situation.
In this example, the same step adopting previous embodiment performs;Detailed process is as follows:
According to step 11, k=4, s=1, d in Fig. 2, figurenum=2.4 minority class samples are had, 10 most class samples shown in figure.Minority class sample is from left to right followed successively by 1,2,3, No. 4 sample, when choosing k=4 neighbour, can be determined that Isosorbide-5-Nitrae sample belongs to isolated sample according to definition, and 2,3 samples belong to dangerous sample, then 2,3 samples need to carry out linear interpolation.From No. 2 samples, there are 2 minority class samples in 4 neighborhoods, randomly select 1, then 1 new minority class sample of synthesis;In like manner, in 4 fields of No. 3 samples, only have 1 minority class sample, then choose its 1 minority class sample of synthesis.Altogether synthesize s × dnum=1 × 2=2 new samples.Then minority class sample is 4+2=6, then reduce with the quantity gap of most class samples, tend to balanced.As s and d in dangerous samplenumTime bigger, it is possible to synthesis exceedes the new samples of original minority class sample size.
Take the data set of 5000 samples, first tend to equalization according to step 11.If former data set has 4800 most class samples and 200 minority class samples, after BSMOTE method, 400 minority class samples are synthesized, then overall new data set totally 5400 samples, containing 4800 most class samples and 600 minority class samples.Now, most class samples are adjusted to 8:1 with the unbalanced ratio of minority class sample by 24:1.The feature subset selection method based on Bagging according to step 12: all choose 600 minority class samples, randomly draws 600 in most class samples simultaneously, synthesizes a subset with putting back to.Repetitive operation N=10 time altogether, it is thus achieved that 10 subsets.Each subset being done feature extraction respectively, respectively obtains a characteristic set, such as 10 characteristic sets are g1={D1, D2, D4 ... D20}, g2={D2, D3 ... D19}, g3={D1, D3 ... D18, D20} ..., g10={D1, D2, these 10 feature sets are merged into a big feature intersection by D3 ... D19, D20}.Add up the number of each feature Di in intersection again, the threshold value δ of the feature number of times set1=8, then the feature intersection A that the number of times feature be more than or equal to 8 is put into, and the feature that number of times is less than 10 puts into bad feature intersection B.
According to the mode of step 13, now having 12 features in feature intersection A, have 8 features in B, ratio is 3:2.Adopt the mode of stratified sampling, feature intersection A is extracted 6 features every time, B extracts 4 features, merges into a reference characteristic intersection.Carry out S=10 time altogether extracting with putting back to, obtain 10 reference characteristic intersections.Carry out division and the structure of tree-model more respectively, it is thus achieved that 10 decision trees.
According to the mode of step 14, use every decision tree that data set is carried out decision-making, calculate every AUC area corresponding to decision tree, be the index weighing classifying quality of 10 trees.Such as, classical strength threshold value δ is set2=0.75, then filter out 8 trees reaching requirement.The metric of similarity is set: the another test data set preparing not add training set in advance again, is divided into 3 groups, carries out decision-making by each tree and obtain 3 groups of results, then obtain 3 × 8 results altogether.Jiang Shu and the result of tree between two compared with, seek the relative coefficient between two trees, then need operation C (8,2) secondary, namely 8 × 7/2=28 time.Such as, similarity threshold δ is set3=0.5, it has been found that decision tree 1 and 3, decision tree 4 and 7 exceedes threshold value.Then delete the one tree of difference according to the classical strength ability of decision tree, the classical strength such as decision tree 1 and 3 is 0.81 and 0.77, during the classical strength of decision tree 4 and 7 0.82 and 0.85, then deletes tree 3 and 4, leaves tree 1 and 7.After screening, remaining 6 decision trees form decision forest.Finally, thus the mode result of 6 decision trees makees final decision and prediction.
Through the above description of the embodiments, those skilled in the art is it can be understood that can realize by software to above-described embodiment, it is also possible to the mode adding necessary general hardware platform by software realizes.Based on such understanding, the technical scheme of above-described embodiment can embody with the form of software product, it (can be CD-ROM that this software product can be stored in a non-volatile memory medium, USB flash disk, portable hard drive etc.) in, including some instructions with so that a computer equipment (can be personal computer, server, or the network equipment etc.) performs the method described in each embodiment of the present invention.
The above; being only the present invention preferably detailed description of the invention, but protection scope of the present invention is not limited thereto, any those familiar with the art is in the technical scope of present disclosure; the change that can readily occur in or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.
Claims (5)
1. an efficient imbalanced data sets sorting technique, it is characterised in that including:
Based on BSMOTE sampling techniques the most class samples in imbalanced data sets and minority class sample carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus obtaining the data set that convergence is balanced;
The data set that described convergence is balanced is sampled, it is thus achieved that N number of subset comprising the approximate minority class sample of quantity and most class samples, and extracts the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection;Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets;
Adopting layered sampling method to extract S reference characteristic intersection from two category feature subsets, each stack features intersection is all for the structure of decision tree, thus obtaining S decision tree;
Based on the metric of the similarity between index and measurement decision tree and the decision tree of measurement decision tree classification effect, from described S decision tree, filter out a number of decision tree, form final decision forest, thus realizing imbalanced data sets classification.
2. the efficient imbalanced data sets sorting technique of one according to claim 1, it is characterized in that, described based on BSMOTE sampling techniques, the most class samples in imbalanced data sets and minority class sample are carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus the data set obtaining convergence balanced includes:
Comprising two class samples in described imbalanced data sets, a fairly large number of is then most class samples, and negligible amounts is then minority class sample;
All minority class samples are found k neighbour in the sample space of imbalanced data sets, according to the ratio of class sample sizes most in k neighborhood and minority class sample size, minority class is divided into 3 sample sets: safe sample set, dangerous sample set and isolated sample set;
Safe sample, dangerous sample are as follows with the definition of isolated sample: if most class sample sizes are less than minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is safe sample;If most class sample sizes are not 0 no less than minority class sample size and minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is dangerous sample;If in the k neighborhood of current minority class sample being entirely most class samples, then current minority class sample is isolated sample;
To each the minority class sample in dangerous sample set, in all minority class sample spaces, all find k neighbour, and s the minority class sample randomly selected in k neighborhood carries out linear interpolation, s new minority class sample of synthesis with former minority class sample;If total d in dangerous sample setnumIndividual minority class sample, then finally synthesize s × d altogethernumIndividual new minority class sample;
The new minority class sample of synthesis is merged with described imbalanced data sets, is formed and obtain the data set that convergence is balanced.
3. the efficient imbalanced data sets sorting technique of one according to claim 1, it is characterized in that, the described data set that described convergence is balanced is sampled, obtain N number of subset comprising the approximate minority class sample of quantity and most class samples, and extract the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection;Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets include:
It is sampled based on Bagging method and installs in Bag by the most class samples in data set balanced to described convergence with putting back to, and loads minority class sample in Bag so that Bag invading the interior has the two class samples that quantity is approximate;Repeat aforesaid operations N-1 time, obtain N number of subset altogether;
Using this N number of subset input training sample set as Feature Selection, correlative character system of selection is adopted to carry out feature extraction, it is thus achieved that feature intersection;
Each feature occurrence number in statistical nature intersection, the importance of number of times more many expressions feature that feature occurs is more high, is respectively compared each feature occurrence number and threshold value δ1Size;Will appear from number of times more than threshold value δ1Feature put into character subset A as good feature, the number of times that will appear from is more than threshold value δ1Feature as difference feature put in character subset B.
4. the efficient imbalanced data sets sorting technique of one according to claim 1, it is characterized in that, described employing layered sampling method extracts S reference characteristic intersection from two category feature subsets, and each stack features intersection, all for the structure of decision tree, includes thus obtaining S decision tree:
Determine the ratio of feature quantity in sorted character subset A and character subset B;
Extract according to the ratio of character subset A Yu character subset B, the feature extracted from character subset A and character subset B each time is merged into a reference characteristic intersection;Carry out S time altogether to extract, thus obtaining S reference characteristic intersection;
For each reference characteristic intersection, all adopt the metric of Attributions selection to choose the split point of tree-model, carry out the structure of decision tree, thus obtaining S decision tree.
5. a kind of efficient imbalanced data sets sorting technique according to claim 1 or 3 or 4, it is characterized in that, the metric of the similarity between the described index based on measurement decision tree classification effect and measurement decision tree and decision tree, a number of decision tree is filtered out from described S decision tree, forming final decision forest, including thus realizing imbalanced data sets classification:
Calculating the index weighing decision tree classification effect: all adopt area AUC under ROC curve as the index weighing decision tree classification effect for each decision tree, when AUC area is more big, ROC curve more tends to upper left side, then it represents that classifying quality is more good;
Calculate the metric of the similarity weighed between decision tree and decision tree: adopt decision tree classifying quality in unbalanced data to calculate, it is thus achieved that the metric of the similarity between decision tree and decision tree;
Classical strength threshold value δ is set2, and by the classical strength threshold value δ of the index of each decision tree classification effect Yu setting2Compare, filter out the decision tree higher than threshold value;The threshold value δ of similarity measurement is set3, travel through the dependency each other of the decision tree after screening, delete higher than threshold value δ3Paired decision tree in the more weak decision tree of classical strength;
Decision tree through above-mentioned screening forms final decision forest, the result of calculation of decision tree in statistical decision forest, thus obtaining corresponding classification results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610114730.3A CN105760889A (en) | 2016-03-01 | 2016-03-01 | Efficient imbalanced data set classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610114730.3A CN105760889A (en) | 2016-03-01 | 2016-03-01 | Efficient imbalanced data set classification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105760889A true CN105760889A (en) | 2016-07-13 |
Family
ID=56331548
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610114730.3A Pending CN105760889A (en) | 2016-03-01 | 2016-03-01 | Efficient imbalanced data set classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105760889A (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106203519A (en) * | 2016-07-17 | 2016-12-07 | 合肥赑歌数据科技有限公司 | Fault pre-alarming algorithm based on taxonomic clustering |
CN106294490A (en) * | 2015-06-08 | 2017-01-04 | 富士通株式会社 | The feature Enhancement Method of data sample and device and classifier training method and apparatus |
CN106372655A (en) * | 2016-08-26 | 2017-02-01 | 南京邮电大学 | Synthetic method for minority class samples in non-balanced IPTV data set |
CN106529598A (en) * | 2016-11-11 | 2017-03-22 | 北京工业大学 | Classification method and system based on imbalanced medical image data set |
CN107391569A (en) * | 2017-06-16 | 2017-11-24 | 阿里巴巴集团控股有限公司 | Identification, model training, Risk Identification Method, device and the equipment of data type |
CN107437095A (en) * | 2017-07-24 | 2017-12-05 | 腾讯科技(深圳)有限公司 | Classification determines method and device |
CN107451694A (en) * | 2017-08-03 | 2017-12-08 | 重庆大学 | It is a kind of to be used for context-aware and adaptive applied forecasting method in mobile system |
CN108647138A (en) * | 2018-02-27 | 2018-10-12 | 中国电子科技集团公司电子科学研究院 | A kind of Software Defects Predict Methods, device, storage medium and electronic equipment |
CN108960561A (en) * | 2018-05-04 | 2018-12-07 | 阿里巴巴集团控股有限公司 | A kind of air control model treatment method, device and equipment based on unbalanced data |
CN109558962A (en) * | 2017-09-26 | 2019-04-02 | 中国移动通信集团山西有限公司 | Predict device, method and storage medium that telecommunication user is lost |
CN109902805A (en) * | 2019-02-22 | 2019-06-18 | 清华大学 | The depth measure study of adaptive sample synthesis and device |
CN110084609A (en) * | 2019-04-23 | 2019-08-02 | 东华大学 | A kind of transaction swindling behavior depth detection method based on representative learning |
CN110135614A (en) * | 2019-03-26 | 2019-08-16 | 广东工业大学 | It is a kind of to be tripped prediction technique based on rejecting outliers and the 10kV distribution low-voltage of sampling techniques |
CN110991551A (en) * | 2019-12-13 | 2020-04-10 | 北京百度网讯科技有限公司 | Sample processing method, sample processing device, electronic device and storage medium |
WO2020220220A1 (en) * | 2019-04-29 | 2020-11-05 | 西门子(中国)有限公司 | Classification model training method and device, and computer-readable medium |
CN112463972A (en) * | 2021-01-28 | 2021-03-09 | 成都数联铭品科技有限公司 | Sample classification method based on class imbalance |
CN112560900A (en) * | 2020-09-08 | 2021-03-26 | 同济大学 | Multi-disease classifier design method for sample imbalance |
CN113434401A (en) * | 2021-06-24 | 2021-09-24 | 杭州电子科技大学 | Software defect prediction method based on sample distribution characteristics and SPY algorithm |
CN113628701A (en) * | 2021-08-12 | 2021-11-09 | 上海大学 | Material performance prediction method and system based on density unbalance sample data |
CN115544902A (en) * | 2022-11-29 | 2022-12-30 | 四川骏逸富顿科技有限公司 | Pharmacy risk level identification model generation method and pharmacy risk level identification method |
CN113628701B (en) * | 2021-08-12 | 2024-04-26 | 上海大学 | Material performance prediction method and system based on density imbalance sample data |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104794195A (en) * | 2015-04-17 | 2015-07-22 | 南京大学 | Data mining method for finding potential telecommunication users changing cell phones |
-
2016
- 2016-03-01 CN CN201610114730.3A patent/CN105760889A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104794195A (en) * | 2015-04-17 | 2015-07-22 | 南京大学 | Data mining method for finding potential telecommunication users changing cell phones |
Non-Patent Citations (3)
Title |
---|
HUI HAN,WEN-YUAN WANG,AND BING-HUAN MAO: "《Advances in intelligent computing》", 31 December 2005, SPRINGER BERLIN HEIDELBERG * |
WANG HE-YONG: "Combination approach of SMOTE and biased-SVM for imbalanced datasets", 《NEURAL NETWORKS .IJCNN .IEEE INTERNA》 * |
肖坚: "基于随机森林的不平衡数据分类方法研究", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294490B (en) * | 2015-06-08 | 2019-12-24 | 富士通株式会社 | Feature enhancement method and device for data sample and classifier training method and device |
CN106294490A (en) * | 2015-06-08 | 2017-01-04 | 富士通株式会社 | The feature Enhancement Method of data sample and device and classifier training method and apparatus |
CN106203519A (en) * | 2016-07-17 | 2016-12-07 | 合肥赑歌数据科技有限公司 | Fault pre-alarming algorithm based on taxonomic clustering |
CN106372655A (en) * | 2016-08-26 | 2017-02-01 | 南京邮电大学 | Synthetic method for minority class samples in non-balanced IPTV data set |
CN106529598A (en) * | 2016-11-11 | 2017-03-22 | 北京工业大学 | Classification method and system based on imbalanced medical image data set |
CN106529598B (en) * | 2016-11-11 | 2020-05-08 | 北京工业大学 | Method and system for classifying medical image data sets based on imbalance |
CN107391569A (en) * | 2017-06-16 | 2017-11-24 | 阿里巴巴集团控股有限公司 | Identification, model training, Risk Identification Method, device and the equipment of data type |
US11100220B2 (en) | 2017-06-16 | 2021-08-24 | Advanced New Technologies Co., Ltd. | Data type recognition, model training and risk recognition methods, apparatuses and devices |
US11113394B2 (en) | 2017-06-16 | 2021-09-07 | Advanced New Technologies Co., Ltd. | Data type recognition, model training and risk recognition methods, apparatuses and devices |
CN107391569B (en) * | 2017-06-16 | 2020-09-15 | 阿里巴巴集团控股有限公司 | Data type identification, model training and risk identification method, device and equipment |
CN107437095A (en) * | 2017-07-24 | 2017-12-05 | 腾讯科技(深圳)有限公司 | Classification determines method and device |
CN107451694B (en) * | 2017-08-03 | 2020-10-02 | 重庆大学 | Application prediction method for context awareness and self-adaptation in mobile system |
CN107451694A (en) * | 2017-08-03 | 2017-12-08 | 重庆大学 | It is a kind of to be used for context-aware and adaptive applied forecasting method in mobile system |
CN109558962A (en) * | 2017-09-26 | 2019-04-02 | 中国移动通信集团山西有限公司 | Predict device, method and storage medium that telecommunication user is lost |
CN108647138A (en) * | 2018-02-27 | 2018-10-12 | 中国电子科技集团公司电子科学研究院 | A kind of Software Defects Predict Methods, device, storage medium and electronic equipment |
CN108960561A (en) * | 2018-05-04 | 2018-12-07 | 阿里巴巴集团控股有限公司 | A kind of air control model treatment method, device and equipment based on unbalanced data |
CN109902805A (en) * | 2019-02-22 | 2019-06-18 | 清华大学 | The depth measure study of adaptive sample synthesis and device |
CN110135614A (en) * | 2019-03-26 | 2019-08-16 | 广东工业大学 | It is a kind of to be tripped prediction technique based on rejecting outliers and the 10kV distribution low-voltage of sampling techniques |
CN110084609A (en) * | 2019-04-23 | 2019-08-02 | 东华大学 | A kind of transaction swindling behavior depth detection method based on representative learning |
CN110084609B (en) * | 2019-04-23 | 2023-06-02 | 东华大学 | Transaction fraud behavior deep detection method based on characterization learning |
WO2020220220A1 (en) * | 2019-04-29 | 2020-11-05 | 西门子(中国)有限公司 | Classification model training method and device, and computer-readable medium |
CN110991551A (en) * | 2019-12-13 | 2020-04-10 | 北京百度网讯科技有限公司 | Sample processing method, sample processing device, electronic device and storage medium |
CN110991551B (en) * | 2019-12-13 | 2023-09-15 | 北京百度网讯科技有限公司 | Sample processing method, device, electronic equipment and storage medium |
CN112560900A (en) * | 2020-09-08 | 2021-03-26 | 同济大学 | Multi-disease classifier design method for sample imbalance |
CN112560900B (en) * | 2020-09-08 | 2023-01-20 | 同济大学 | Multi-disease classifier design method for sample imbalance |
CN112463972B (en) * | 2021-01-28 | 2021-05-18 | 成都数联铭品科技有限公司 | Text sample classification method based on class imbalance |
CN112463972A (en) * | 2021-01-28 | 2021-03-09 | 成都数联铭品科技有限公司 | Sample classification method based on class imbalance |
CN113434401A (en) * | 2021-06-24 | 2021-09-24 | 杭州电子科技大学 | Software defect prediction method based on sample distribution characteristics and SPY algorithm |
CN113628701A (en) * | 2021-08-12 | 2021-11-09 | 上海大学 | Material performance prediction method and system based on density unbalance sample data |
CN113628701B (en) * | 2021-08-12 | 2024-04-26 | 上海大学 | Material performance prediction method and system based on density imbalance sample data |
CN115544902A (en) * | 2022-11-29 | 2022-12-30 | 四川骏逸富顿科技有限公司 | Pharmacy risk level identification model generation method and pharmacy risk level identification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105760889A (en) | Efficient imbalanced data set classification method | |
Skryjomski et al. | Influence of minority class instance types on SMOTE imbalanced data oversampling | |
CN107391772B (en) | Text classification method based on naive Bayes | |
CN103164713B (en) | Image classification method and device | |
CN109344618B (en) | Malicious code classification method based on deep forest | |
CN104598586B (en) | The method of large-scale text categorization | |
CN106096727A (en) | A kind of network model based on machine learning building method and device | |
CN105426426A (en) | KNN text classification method based on improved K-Medoids | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN101807254A (en) | Implementation method for data characteristic-oriented synthetic kernel support vector machine | |
CN111062425B (en) | Unbalanced data set processing method based on C-K-SMOTE algorithm | |
CN112036476A (en) | Data feature selection method and device based on two-classification service and computer equipment | |
CN112699936B (en) | Electric power CPS generalized false data injection attack identification method | |
Xie et al. | Imbalanced big data classification based on virtual reality in cloud computing | |
JP5765583B2 (en) | Multi-class classifier, multi-class classifying method, and program | |
CN117371511A (en) | Training method, device, equipment and storage medium for image classification model | |
JP5892275B2 (en) | Multi-class classifier generation device, data identification device, multi-class classifier generation method, data identification method, and program | |
CN115545111B (en) | Network intrusion detection method and system based on clustering self-adaptive mixed sampling | |
Chang | The application of machine learning models in company bankruptcy prediction | |
Ma | The Research of Stock Predictive Model based on the Combination of CART and DBSCAN | |
CN111383716B (en) | Screening method, screening device, screening computer device and screening storage medium | |
CN103207893A (en) | Classification method of two types of texts on basis of vector group mapping | |
Dražić et al. | Technology matching of the patent documents using clustering algorithms | |
CN114185956A (en) | Data mining method based on canty and k-means algorithm | |
JP4125951B2 (en) | Text automatic classification method and apparatus, program, and recording medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160713 |
|
RJ01 | Rejection of invention patent application after publication |