CN106548196A - A kind of random forest sampling approach and device for non-equilibrium data - Google Patents

A kind of random forest sampling approach and device for non-equilibrium data Download PDF

Info

Publication number
CN106548196A
CN106548196A CN201610914533.XA CN201610914533A CN106548196A CN 106548196 A CN106548196 A CN 106548196A CN 201610914533 A CN201610914533 A CN 201610914533A CN 106548196 A CN106548196 A CN 106548196A
Authority
CN
China
Prior art keywords
cluster
clustering cluster
clustering
random forest
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610914533.XA
Other languages
Chinese (zh)
Inventor
陈会
赵鹤
蔡芷铃
姜青山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201610914533.XA priority Critical patent/CN106548196A/en
Publication of CN106548196A publication Critical patent/CN106548196A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to non-equilibrium data sorting technique field, more particularly to a kind of random forest sampling approach and device for non-equilibrium data.Methods described includes:Step a:Training set is repeatedly clustered, multigroup cluster result is obtained;Step b:The comentropy of each clustering cluster in multigroup cluster result is calculated, according to the comentropy size of each clustering cluster, multigroup cluster result is divided into into different clustering cluster groups;Step c:Respectively from a number of sample of layered extraction in each clustering cluster group, classification random forest training data subset in a balanced way is obtained.The present invention is grouped by repeatedly cluster and to cluster result, can obtain the architectural feature of subclass inside classification well, lifts the precision of classification;And the frequency being extracted according to each sample is adjusting its extraction probability in follow-up sampling, the classification performance of non-equilibrium data is lifted.

Description

A kind of random forest sampling approach and device for non-equilibrium data
Technical field
The present invention relates to non-equilibrium data sorting technique field, more particularly to a kind of random forest for non-equilibrium data Sampling approach and device.
Background technology
In the data set of classification problem, when the data of a certain class occupy the overwhelming majority, and the data of other classifications are relative When less, this kind of data set is referred to as non-equilibrium data (Class-Imbalanced Data).In non-equilibrium data, account for absolutely Most that classification is referred to as many several classes ofs (Majority Class), and other classifications are then referred to as minority class (Minority Class).Non-equilibrium data is very universal in real world applications, such as rare disease forecasting, swindle detection, risk management etc..Tradition Machine learning algorithm assumes that when grader is built data are balances, can tend to many several classes ofs to build grader, cause which Hydraulic performance decline when non-equilibrium data is processed.
Random forests algorithm (Random Forests, RF) is a kind of integrated learning approach, by constructing multiple non-beta prunings Decision tree (Decision Trees) carry out modeling.Because which has preferable performance on treatment classification and regression problem, And be widely used in data mining.But the algorithm equally faces classification performance low when non-equilibrium data is processed Problem.Because it uses the Bagging methods of stochastic sampling, each sample to use the stochastic sampling put back to (Sampling with Replacement), the minority class sample that the subset randomly selected is included can possess than original data set Less sample, the decision tree for being constructed such that out can tend to many several classes ofs, thus on the whole for, random forest can be caused The classification performance decline of algorithm itself becomes apparent from.
At present, the method for various sampling is suggested study and processes non-equilibrium data.The researcheres such as Chawla propose one Plant the sampling approach for being referred to as SMOTE.The method by being artificially generated minority class sample, and with reference to lack sampling and the mode of over-sampling To lift the performance for processing non-equilibrium data.Afterwards, a kind of sampling approach for being referred to as SMOTE-Boost is suggested further Improve performance.The researcheres such as Chen propose a kind of Random Forest model of balance (Balanced Random Forests Model, BRF) and weighting Random Forest model (Weighted Random Forests Model, WRF).BRF is combined and is taken out Quadrat method and integrated learning approach.In BRF, the sample of equivalent is repeatedly extracted respectively from minority class sample and many several classes of samples Constitute in each Bagging subset.WRF has used for reference cost sensitive learning (Cost Sensitive Learning) method Thought, by bigger penalty coefficient is arranged when decision tree nodes are built, and is that minority class is arranged more in leaf node High weight is improving the classification performance of minority class.The researcheres such as Krawczyk propose the integrated study of another kind of cost-sensitive Method EG2Ensemble.They build the decision tree of cost-sensitive using EG2 methods, then using the side of stochastic subspace Method and genetic algorithm are building integrated model.
As it was previously stated, existing various sampling algorithms there are the following problems:Although existing various sampling algorithms are considered The characteristics of minority class accounting is few, is increased the sample number of minority class by the way of over-sampling, but is not considered minority class The construction featuress of sample interior, but directly minority class is integrally sampled so that the son of the just few minority class of original accounting The probability that class is sampled is lower, causes discrimination of the minority class subclass during classification learning is carried out not enough, so as to affect The performance of algorithm.
The content of the invention
The invention provides a kind of random forest sampling approach and device for non-equilibrium data, it is intended at least certain One of above-mentioned technical problem of the prior art is solved in degree.
In order to solve the above problems, the invention provides following technical scheme:
A kind of random forest sampling approach for non-equilibrium data, including:
Step a:Training set is repeatedly clustered, multigroup cluster result is obtained;
Step b:The comentropy of each clustering cluster in multigroup cluster result is calculated, according to the comentropy of each clustering cluster Multigroup cluster result is divided into different clustering cluster groups by size;
Step c:Respectively from a number of sample of layered extraction in each clustering cluster group, classification is obtained random gloomy in a balanced way Woods training data subset.
The technical scheme that the embodiment of the present invention is taken also includes:It is in step a, described that training set is repeatedly gathered The mode of class is specially:IfFor the set of all cluster results, its initial value isWhen clustering every time, randomly select k ∈ [q, Min { 4q, n/10 }] as the number that this clusters, k disjoint clustering cluster P=is generated using k-means algorithms {p1,…,pk, and the clustering cluster of generation is merged intoIn, obtain
The technical scheme that the embodiment of the present invention is taken also includes:It is in step b, described to calculate many group cluster knots In fruit, the calculation of the comentropy of each clustering cluster is:By formula Entropy (c)=- ∑l∈Lp(l|c)log p(l|c) Calculate the comentropy of each clustering cluster, and according to the comentropy size of each clustering cluster by cluster result setIt is divided into q not phase The clustering cluster group of friendshipMeet the clustering cluster group:
In above-mentioned formula, | gi| represent clustering cluster group giThe number of middle clustering cluster.
The technical scheme that the embodiment of the present invention is taken also includes:In step c, described to obtain classification random in a balanced way The mode of forest training data subset is specially:A number of sample is extracted according to a certain percentage from each clustering cluster group respectively This construction Bagging subsets, and repeat to extract t time, obtain t classification Bagging subsets in a balanced way;Constructing first During Bagging subsets, n sample is randomly repeatedly extracted from clustering cluster group G, to each sample, first equiprobably from Sub- clustering cluster group g is extracted in clustering cluster group G, then by formulaFrom sub- clustering cluster group g Extract sub- clustering cluster c;And pass through formulaA sample d is extracted from sub- clustering cluster c;Wherein, c ∈ gj
The technical scheme that the embodiment of the present invention is taken also includes:In step c, described to obtain classification random in a balanced way Forest training data subset also includes:According to formula pb(cl|gj)∝Update sub- clustering cluster cl∈gjExtraction Probability, wherein, fjlIt is clustering cluster cl∈gjThe frequency being extracted.
Another technical scheme that the embodiment of the present invention is taken is:A kind of random forest sampling dress for non-equilibrium data Put, including:
Cluster module:For repeatedly being clustered to training set, multigroup cluster result is obtained;
Comentropy computing module:For calculating the comentropy of each clustering cluster in multigroup cluster result, according to each Multigroup cluster result is divided into different clustering cluster groups by the comentropy size of clustering cluster;
Sampling module:Classification is obtained for respectively from a number of sample of layered extraction in each clustering cluster group Random forest training data subset in a balanced way.
The technical scheme that the embodiment of the present invention is taken also includes:The cluster module carries out the side of multiple cluster to training set Formula is specially:IfFor the set of all cluster results, its initial value isWhen clustering every time, k ∈ [q, min are randomly selected { 4q, n/10 }] as the number that this clusters, k disjoint clustering cluster P={ p is generated using k-means algorithms1,…, pk, and the clustering cluster of generation is merged intoIn, obtain
The technical scheme that the embodiment of the present invention is taken also includes:Described information entropy computing module calculates many group cluster knots In fruit, the calculation of the comentropy of each clustering cluster is:By formula Entropy (c)=- ∑l∈Lp(l|c)log p(l|c) Calculate the comentropy of each clustering cluster, and according to the comentropy size of each clustering cluster by cluster result setIt is divided into q not phase The clustering cluster group of friendshipMeet the clustering cluster group:
In above-mentioned formula, | gi| represent clustering cluster group giThe number of middle clustering cluster.
The technical scheme that the embodiment of the present invention is taken also includes:It is random gloomy in a balanced way that the sampling module obtains classification The mode of woods training data subset is specially:A number of sample is extracted according to a certain percentage from each clustering cluster group respectively Construction Bagging subsets, and repeat to extract t time, obtain t classification Bagging subsets in a balanced way;Constructing first During Bagging subsets, n sample is randomly repeatedly extracted from clustering cluster group G, to each sample, first equiprobably from Sub- clustering cluster group g is extracted in clustering cluster group G, then by formula From sub- clustering cluster group g Extract sub- clustering cluster c;And pass through formulaA sample d is extracted from sub- clustering cluster c;Wherein, c ∈ gj
The technical scheme that the embodiment of the present invention is taken also includes:It is random gloomy in a balanced way that the sampling module obtains classification Woods training data subset also includes:According to formulaUpdate sub- clustering cluster cl∈gjExtraction it is general Rate, wherein, fjlIt is clustering cluster cl∈gjThe frequency being extracted.
Relative to prior art, the beneficial effect that the embodiment of the present invention is produced is:The embodiment of the present invention for non-flat The random forest sampling approach and device of weighing apparatus data is grouped by repeatedly cluster and to cluster result, can be obtained well The architectural feature of subclass inside classification is taken, the precision of classification is lifted;And the frequency being extracted according to each sample adjust its Extraction probability during follow-up sampling, so as to ensure to be drawn into the lower training subset of the balanced and various, degree of coupling, improves random gloomy The multiformity of woods model algorithm, lifts the classification performance of non-equilibrium data.
Description of the drawings
Fig. 1 is the flow chart of the random forest sampling approach for non-equilibrium data of the embodiment of the present invention;
Fig. 2 is the structural representation of the random forest sampling apparatus for non-equilibrium data of the embodiment of the present invention.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, it is below in conjunction with drawings and Examples, right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.
The random forest sampling approach and device for non-equilibrium data of the embodiment of the present invention is by from non-equilibrium data The middle equilibrium criterion subset for building one group of low degree of association improves classification performance of the random forests algorithm to non-equilibrium data to reach Purpose.Multigroup cluster result is obtained by repeatedly cluster is carried out to training set first;Then by the way of stratified sampling from The sample for being directly proportional to the entropy of clustering cluster and quantity being inversely proportional to clustering cluster size is extracted in each clustering cluster of cluster result Number, forms the Bagging subsets of random forest, to ensure the multiformity and balance of Bagging subsets, random gloomy so as to improve The performance of standing forest class.
Specifically, Fig. 1 is referred to, is the stream of the random forest sampling approach for non-equilibrium data of the embodiment of the present invention Cheng Tu.Assume that training set D is the matrix of a n × m, it possesses n sample and m attribute.The given cluster for needing to carry out time Bagging subset numbers t of number r and needs construction, the random forest sampling side for non-equilibrium data of the embodiment of the present invention Method is comprised the following steps:
Step 100:R cluster is carried out to training set, r cluster result is obtained;
In step 100, the mode for being clustered to training set is specially:IfFor the set of all cluster results, its Initial value isWhen clustering every time, the number that k ∈ [q, min { 4q, n/10 }] are this time clustered as cluster is randomly selected, then K disjoint clustering cluster P={ p is generated using k-means algorithms (hard clustering algorithm)1,…,pk, then the cluster is tied Fruit is merged intoIn, obtain
Step 200:The comentropy of each clustering cluster in r group cluster results is calculated, and according to the comentropy of each clustering cluster Size, by cluster result setIt is divided into q different clustering cluster group;
In step 200, for any clustering clusterThe comentropy of the clustering cluster is calculated come degree by formula first The category distribution situation of the clustering cluster is measured, comentropy computing formula is as follows:
Entropy (c)=- ∑l∈Lp(l|c)logp(l|c) (1)
Then according to the comentropy size of each clustering cluster by cluster result setIt is divided into q disjoint clustering cluster groupMeet which:
In formula (2), | gi| represent clustering cluster group giThe number of middle clustering cluster.
Step 300:A number of sample architecture Bagging is extracted according to a certain percentage from each clustering cluster group respectively Subset, and repeat to extract t time, obtain t classification Bagging subsets in a balanced way;
In step 300, when first Bagging subset is constructed, randomly n is repeatedly extracted from clustering cluster group G Sample, to each sample, equiprobably extracts sub- clustering cluster group g first from clustering cluster group G, then with
Probability clustering cluster c is extracted from sub- clustering cluster group g;Then with
Probability a sample d is extracted from sub- clustering cluster c.Wherein, c ∈ gj, and pjD () meets
CountjL () represents class l ∈ L in clustering cluster group gjThe frequency of middle appearance, pjD () meets
If F=is { fjl|1≤j≤q,1≤l≤|gj| for a frequency matrix, wherein fjlIt is sub- clustering cluster cl∈gjTaken out The frequency for taking, in order to obtain b-th Bagging subset, according to formula
To update sub- clustering cluster cl∈gjExtraction probability, wherein
The frequency being extracted according to each sample adjusting its extraction probability in follow-up sampling, so as to ensure to be drawn into Balanced and various Bagging subsets, improve classification performance of the random forests algorithm to non-equilibrium data.
The false code of inventive algorithm is as follows:
Pseudo-code of the algorithm
Step 400:By t Bagging trained Random Forest model.
Fig. 2 is referred to, is the structural representation of the random forest sampling apparatus for non-equilibrium data of the embodiment of the present invention Figure.The random forest sampling apparatus for non-equilibrium data of the embodiment of the present invention include cluster module, comentropy computing module, Sampling module and model training module.
Cluster module:For r cluster is carried out to training set, r cluster result is obtained;Wherein, training set is gathered The mode of class is specially:IfFor the set of all cluster results, its initial value isWhen clustering every time, randomly select k ∈ [q, Min { 4q, n/10 }] as the number of this cluster of cluster, it is then individual to generate k using k-means algorithms (hard clustering algorithm) Disjoint clustering cluster P={ p1,…,pk, then the cluster result is merged intoIn, obtain
Comentropy computing module:For calculating the comentropy of each clustering cluster in r group cluster results, and according to each cluster The comentropy size of cluster, by cluster result setIt is divided into q different clustering cluster group;Wherein, for any clustering clusterCalculate the comentropy of the clustering cluster first to measure the category distribution situation of the clustering cluster by formula, comentropy is calculated Formula is as follows:
Entropy (c)=- ∑l∈Lp(l|c)logp(l|c) (1)
Then according to the comentropy size of each clustering cluster by cluster result setIt is divided into q disjoint clustering cluster groupMeet which:
In formula (2), | gi| represent clustering cluster group giThe number of middle clustering cluster.
Sampling module:For extracting a number of sample structure respectively from each clustering cluster group according to a certain percentage Bagging subsets are made, and repeats to extract t time, obtain t classification Bagging subsets in a balanced way;Wherein, constructing first During Bagging subsets, n sample is randomly repeatedly extracted from clustering cluster group G, to each sample, first equiprobably from In clustering cluster group G extract sub- clustering cluster group g, then with
Probability clustering cluster c is extracted from sub- clustering cluster group g;Then with
Probability a sample d is extracted from sub- clustering cluster c.Wherein, c ∈ gj, and pjD () meets
CountjL () represents class l ∈ L in clustering cluster group gjThe frequency of middle appearance, pjD () meets
If F=is { fjl|1≤j≤q,1≤l≤|gj| for a frequency matrix, wherein fjlIt is sub- clustering cluster cl∈gjTaken out The frequency for taking, in order to obtain b-th Bagging subset, according to formula
To update sub- clustering cluster cl∈gjExtraction probability, wherein
The frequency being extracted according to each sample adjusting its extraction probability in follow-up sampling, so as to ensure to be drawn into Balanced and various Bagging subsets, improve classification performance of the random forests algorithm to non-equilibrium data.
Model training module:For by t Bagging trained Random Forest model.
The random forest sampling approach and device for non-equilibrium data of the embodiment of the present invention by repeatedly cluster and Cluster result is grouped, the architectural feature of subclass inside classification can be obtained well, lift the precision of classification;And according to The frequency that each sample is extracted adjusting its extraction probability in follow-up sampling, so as to ensure to be drawn into it is balanced and various, The lower training subset of the degree of coupling, improves the multiformity of Random Forest model algorithm, lifts the classification performance of non-equilibrium data.
The foregoing description of the disclosed embodiments, enables professional and technical personnel in the field to realize or using the present invention. Various modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized without departing from the spirit or scope of the present invention in other embodiments.Therefore, the present invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope for causing.

Claims (10)

1. a kind of random forest sampling approach for non-equilibrium data, it is characterised in that include:
Step a:Training set is repeatedly clustered, multigroup cluster result is obtained;
Step b:The comentropy of each clustering cluster in multigroup cluster result is calculated, it is big according to the comentropy of each clustering cluster It is little, multigroup cluster result is divided into into different clustering cluster groups;
Step c:Respectively from a number of sample of layered extraction in each clustering cluster group, classification random forest instruction in a balanced way is obtained Practice data subset.
2. the random forest sampling approach for non-equilibrium data according to claim 1, it is characterised in that in the step In rapid a, the mode that multiple cluster is carried out to training set is specially:IfFor the set of all cluster results, its initial value isWhen clustering every time, randomly select k ∈ [q, min { 4q, n/10 }] and, as the number that this clusters, given birth to using k-means algorithms Into k disjoint clustering cluster P={ p1,…,pk, and the clustering cluster of generation is merged intoIn, obtain
3. the random forest sampling approach for non-equilibrium data according to claim 2, it is characterised in that in the step In rapid b, the calculation of the comentropy for calculating each clustering cluster in multigroup cluster result is:By formula Entropy (c)=- ∑l∈LP (l | c) log p (l | c) calculate the comentropy of each clustering cluster, and according to the letter of each clustering cluster Entropy size is ceased by cluster result setIt is divided into q disjoint clustering cluster groupMeet the clustering cluster group:
E n t r o p y ( c i ) > E n t r o p y ( c j ) , ∀ c i ∈ g i ′ , c j ∈ g j ′ , i ′ > j ′ | g i | ≡ | g j | , ∀ i , j ∈ [ 1 , q ]
In above-mentioned formula, | gi| represent clustering cluster group giThe number of middle clustering cluster.
4. the random forest sampling approach for non-equilibrium data according to claim 3, it is characterised in that in the step In rapid c, the mode for obtaining classification random forest training data subset in a balanced way is specially:Respectively from each clustering cluster group A number of sample architecture Bagging subset is extracted according to a certain percentage, and repeats to extract t time, obtain t classification equal The Bagging subsets of weighing apparatus;When first Bagging subset is constructed, randomly n is repeatedly extracted from clustering cluster group G Sample, to each sample, equiprobably extracts sub- clustering cluster group g, first then by formula from clustering cluster group GSub- clustering cluster c is extracted from sub- clustering cluster group g;And pass through formulaFrom A sample d is extracted in sub- clustering cluster c;Wherein, c ∈ gj
5. the random forest sampling approach for non-equilibrium data according to claim 4, it is characterised in that in the step It is in c, described to obtain classification random forest training data subset also includes in a balanced way:According to formula Update sub- clustering cluster cl∈gjExtraction probability, wherein, fjlIt is clustering cluster cl∈gjThe frequency being extracted.
6. a kind of random forest sampling apparatus for non-equilibrium data, it is characterised in that include:
Cluster module:For repeatedly being clustered to training set, multigroup cluster result is obtained;
Comentropy computing module:For calculating the comentropy of each clustering cluster in multigroup cluster result, according to each cluster Multigroup cluster result is divided into different clustering cluster groups by the comentropy size of cluster;
Sampling module:It is balanced for from a number of sample of layered extraction in each clustering cluster group, obtaining classification respectively Random forest training data subset.
7. the random forest sampling apparatus for non-equilibrium data according to claim 6, it is characterised in that the cluster Module carries out the mode of multiple cluster and is specially to training set:IfFor the set of all cluster results, its initial value isEvery time During cluster, the number that k ∈ [q, min { 4q, n/10 }] are clustered as this is randomly selected, k is generated not using k-means algorithms Intersecting clustering cluster P={ p1,…,pk, and the clustering cluster of generation is merged intoIn, obtain
8. the random forest sampling apparatus for non-equilibrium data according to claim 7, it is characterised in that described information Entropy computing module calculates the calculation of the comentropy of each clustering cluster in multigroup cluster result:By formula Entropy (c)=- ∑l∈LP (l | c) log p (l | c) calculate the comentropy of each clustering cluster, and according to the letter of each clustering cluster Entropy size is ceased by cluster result setIt is divided into q disjoint clustering cluster groupMeet the clustering cluster group:
E n t r o p y ( c i ) > E n t r o p y ( c j ) , ∀ c i ∈ g i ′ , c j ∈ g j ′ , i ′ > j ′ | g i | ≡ | g j | , ∀ i , j ∈ [ 1 , q ]
In above-mentioned formula, | gi| represent clustering cluster group giThe number of middle clustering cluster.
9. the random forest sampling apparatus for non-equilibrium data according to claim 8, it is characterised in that the sample Abstraction module obtains the mode of classification random forest training data subset in a balanced way and is specially:Press from each clustering cluster group respectively A number of sample architecture Bagging subset is extracted according to certain proportion, and repeats to extract t time, obtain t classification in a balanced way Bagging subsets;When first Bagging subset is constructed, randomly repeat to extract n sample from clustering cluster group G, To each sample, equiprobably sub- clustering cluster group g is extracted from clustering cluster group G first, then by formulaSub- clustering cluster c is extracted from sub- clustering cluster group g;And pass through formulaFrom A sample d is extracted in sub- clustering cluster c;Wherein, c ∈ gj
10. the random forest sampling apparatus for non-equilibrium data according to claim 9, it is characterised in that the sample Abstraction module obtains classification, and random forest training data subset also includes in a balanced way:According to formula Update sub- clustering cluster cl∈gjExtraction probability, wherein, fjlIt is clustering cluster cl∈gjThe frequency being extracted.
CN201610914533.XA 2016-10-20 2016-10-20 A kind of random forest sampling approach and device for non-equilibrium data Pending CN106548196A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610914533.XA CN106548196A (en) 2016-10-20 2016-10-20 A kind of random forest sampling approach and device for non-equilibrium data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610914533.XA CN106548196A (en) 2016-10-20 2016-10-20 A kind of random forest sampling approach and device for non-equilibrium data

Publications (1)

Publication Number Publication Date
CN106548196A true CN106548196A (en) 2017-03-29

Family

ID=58391981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610914533.XA Pending CN106548196A (en) 2016-10-20 2016-10-20 A kind of random forest sampling approach and device for non-equilibrium data

Country Status (1)

Country Link
CN (1) CN106548196A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106973057A (en) * 2017-03-31 2017-07-21 浙江大学 A kind of sorting technique suitable for intrusion detection
CN107368892A (en) * 2017-06-07 2017-11-21 无锡小天鹅股份有限公司 Model training method and device based on machine learning
CN107392176A (en) * 2017-08-10 2017-11-24 华南理工大学 A kind of high efficiency vehicle detection method based on kmeans
CN107766486A (en) * 2017-10-16 2018-03-06 山东浪潮通软信息科技有限公司 Method, apparatus, computer-readable recording medium and the storage control of randomly drawing sample data
CN108038448A (en) * 2017-12-13 2018-05-15 河南理工大学 Semi-supervised random forest Hyperspectral Remote Sensing Imagery Classification method based on weighted entropy
CN108681433A (en) * 2018-05-04 2018-10-19 南京信息工程大学 A kind of sampling selection method for data de-duplication
CN108805416A (en) * 2018-05-22 2018-11-13 阿里巴巴集团控股有限公司 A kind of risk prevention system processing method, device and equipment
CN108805142A (en) * 2018-05-31 2018-11-13 中国华戎科技集团有限公司 A kind of crime high-risk personnel analysis method and system
CN109598349A (en) * 2018-11-23 2019-04-09 华南理工大学 Overhead transmission line fault detection data sample batch processing training method based on classification stochastical sampling
CN109726821A (en) * 2018-11-27 2019-05-07 东软集团股份有限公司 Data balancing method, device, computer readable storage medium and electronic equipment
CN109993179A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 The method and apparatus that a kind of pair of data are clustered
WO2019232748A1 (en) * 2018-06-06 2019-12-12 北京大学 Computer screening method for chemical small molecule medication targeting rna
CN111046891A (en) * 2018-10-11 2020-04-21 杭州海康威视数字技术股份有限公司 Training method of license plate recognition model, and license plate recognition method and device
CN112562320A (en) * 2020-11-19 2021-03-26 东南大学 Self-adaptive traffic incident detection method based on improved random forest
CN113779150A (en) * 2021-09-14 2021-12-10 杭州数梦工场科技有限公司 Data quality evaluation method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049514A (en) * 2012-12-14 2013-04-17 杭州淘淘搜科技有限公司 Balanced image clustering method based on hierarchical clustering
CN105930856A (en) * 2016-03-23 2016-09-07 深圳市颐通科技有限公司 Classification method based on improved DBSCAN-SMOTE algorithm

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049514A (en) * 2012-12-14 2013-04-17 杭州淘淘搜科技有限公司 Balanced image clustering method based on hierarchical clustering
CN105930856A (en) * 2016-03-23 2016-09-07 深圳市颐通科技有限公司 Classification method based on improved DBSCAN-SMOTE algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HE ZHAO等: "Stratified Over-Sampling Bagging Method for Random Forests on Imbalanced Data", 《INTELLIGENCE AND SECURITY INFORMATICS》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106973057B (en) * 2017-03-31 2018-12-14 浙江大学 A kind of classification method suitable for intrusion detection
CN106973057A (en) * 2017-03-31 2017-07-21 浙江大学 A kind of sorting technique suitable for intrusion detection
CN107368892A (en) * 2017-06-07 2017-11-21 无锡小天鹅股份有限公司 Model training method and device based on machine learning
CN107368892B (en) * 2017-06-07 2020-06-16 无锡小天鹅电器有限公司 Model training method and device based on machine learning
CN107392176A (en) * 2017-08-10 2017-11-24 华南理工大学 A kind of high efficiency vehicle detection method based on kmeans
CN107392176B (en) * 2017-08-10 2020-05-22 华南理工大学 High-efficiency vehicle detection method based on kmeans
CN107766486A (en) * 2017-10-16 2018-03-06 山东浪潮通软信息科技有限公司 Method, apparatus, computer-readable recording medium and the storage control of randomly drawing sample data
CN107766486B (en) * 2017-10-16 2021-04-20 浪潮通用软件有限公司 Method, device, readable medium and storage controller for randomly extracting sample data
CN108038448A (en) * 2017-12-13 2018-05-15 河南理工大学 Semi-supervised random forest Hyperspectral Remote Sensing Imagery Classification method based on weighted entropy
CN109993179A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 The method and apparatus that a kind of pair of data are clustered
CN108681433A (en) * 2018-05-04 2018-10-19 南京信息工程大学 A kind of sampling selection method for data de-duplication
CN108805416A (en) * 2018-05-22 2018-11-13 阿里巴巴集团控股有限公司 A kind of risk prevention system processing method, device and equipment
CN108805142A (en) * 2018-05-31 2018-11-13 中国华戎科技集团有限公司 A kind of crime high-risk personnel analysis method and system
WO2019232748A1 (en) * 2018-06-06 2019-12-12 北京大学 Computer screening method for chemical small molecule medication targeting rna
CN111046891A (en) * 2018-10-11 2020-04-21 杭州海康威视数字技术股份有限公司 Training method of license plate recognition model, and license plate recognition method and device
CN109598349A (en) * 2018-11-23 2019-04-09 华南理工大学 Overhead transmission line fault detection data sample batch processing training method based on classification stochastical sampling
CN109726821A (en) * 2018-11-27 2019-05-07 东软集团股份有限公司 Data balancing method, device, computer readable storage medium and electronic equipment
CN112562320A (en) * 2020-11-19 2021-03-26 东南大学 Self-adaptive traffic incident detection method based on improved random forest
CN113779150A (en) * 2021-09-14 2021-12-10 杭州数梦工场科技有限公司 Data quality evaluation method and device

Similar Documents

Publication Publication Date Title
CN106548196A (en) A kind of random forest sampling approach and device for non-equilibrium data
CN111860638B (en) Parallel intrusion detection method and system based on unbalanced data deep belief network
WO2019179403A1 (en) Fraud transaction detection method based on sequence width depth learning
Karaboga et al. Fuzzy clustering with artificial bee colony algorithm
CN107194433A (en) A kind of Radar range profile's target identification method based on depth autoencoder network
CN103632168A (en) Classifier integration method for machine learning
CN105373606A (en) Unbalanced data sampling method in improved C4.5 decision tree algorithm
CN107103332A (en) A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset
CN111754345B (en) Bit currency address classification method based on improved random forest
CN112087447B (en) Rare attack-oriented network intrusion detection method
CN105574547B (en) Adapt to integrated learning approach and device that dynamic adjusts base classifier weight
CN102521656A (en) Integrated transfer learning method for classification of unbalance samples
CN110991549A (en) Countermeasure sample generation method and system for image data
CN111125358A (en) Text classification method based on hypergraph
CN109299741A (en) A kind of network attack kind identification method based on multilayer detection
CN106503731A (en) A kind of based on conditional mutual information and the unsupervised feature selection approach of K means
CN107947921A (en) Based on recurrent neural network and the password of probability context-free grammar generation system
CN109086412A (en) A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT
CN104008420A (en) Distributed outlier detection method and system based on automatic coding machine
CN108596264A (en) A kind of community discovery method based on deep learning
CN110298434A (en) A kind of integrated deepness belief network based on fuzzy division and FUZZY WEIGHTED
CN103914705A (en) Hyperspectral image classification and wave band selection method based on multi-target immune cloning
CN102750286A (en) Novel decision tree classifier method for processing missing data
CN104679911B (en) It is a kind of based on discrete weak related cloud platform decision forest sorting technique
CN108491864A (en) Based on the classification hyperspectral imagery for automatically determining convolution kernel size convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170329