CN106548196A

CN106548196A - A kind of random forest sampling approach and device for non-equilibrium data

Info

Publication number: CN106548196A
Application number: CN201610914533.XA
Authority: CN
Inventors: 陈会; 赵鹤; 蔡芷铃; 姜青山
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2016-10-20
Filing date: 2016-10-20
Publication date: 2017-03-29

Abstract

The present invention relates to non-equilibrium data sorting technique field, more particularly to a kind of random forest sampling approach and device for non-equilibrium data.Methods described includes：Step a：Training set is repeatedly clustered, multigroup cluster result is obtained；Step b：The comentropy of each clustering cluster in multigroup cluster result is calculated, according to the comentropy size of each clustering cluster, multigroup cluster result is divided into into different clustering cluster groups；Step c：Respectively from a number of sample of layered extraction in each clustering cluster group, classification random forest training data subset in a balanced way is obtained.The present invention is grouped by repeatedly cluster and to cluster result, can obtain the architectural feature of subclass inside classification well, lifts the precision of classification；And the frequency being extracted according to each sample is adjusting its extraction probability in follow-up sampling, the classification performance of non-equilibrium data is lifted.

Description

A kind of random forest sampling approach and device for non-equilibrium data

Technical field

The present invention relates to non-equilibrium data sorting technique field, more particularly to a kind of random forest for non-equilibrium data Sampling approach and device.

Background technology

In the data set of classification problem, when the data of a certain class occupy the overwhelming majority, and the data of other classifications are relative When less, this kind of data set is referred to as non-equilibrium data (Class-Imbalanced Data).In non-equilibrium data, account for absolutely Most that classification is referred to as many several classes ofs (Majority Class), and other classifications are then referred to as minority class (Minority Class).Non-equilibrium data is very universal in real world applications, such as rare disease forecasting, swindle detection, risk management etc..Tradition Machine learning algorithm assumes that when grader is built data are balances, can tend to many several classes ofs to build grader, cause which Hydraulic performance decline when non-equilibrium data is processed.

Random forests algorithm (Random Forests, RF) is a kind of integrated learning approach, by constructing multiple non-beta prunings Decision tree (Decision Trees) carry out modeling.Because which has preferable performance on treatment classification and regression problem, And be widely used in data mining.But the algorithm equally faces classification performance low when non-equilibrium data is processed Problem.Because it uses the Bagging methods of stochastic sampling, each sample to use the stochastic sampling put back to (Sampling with Replacement), the minority class sample that the subset randomly selected is included can possess than original data set Less sample, the decision tree for being constructed such that out can tend to many several classes ofs, thus on the whole for, random forest can be caused The classification performance decline of algorithm itself becomes apparent from.

At present, the method for various sampling is suggested study and processes non-equilibrium data.The researcheres such as Chawla propose one Plant the sampling approach for being referred to as SMOTE.The method by being artificially generated minority class sample, and with reference to lack sampling and the mode of over-sampling To lift the performance for processing non-equilibrium data.Afterwards, a kind of sampling approach for being referred to as SMOTE-Boost is suggested further Improve performance.The researcheres such as Chen propose a kind of Random Forest model of balance (Balanced Random Forests Model, BRF) and weighting Random Forest model (Weighted Random Forests Model, WRF).BRF is combined and is taken out Quadrat method and integrated learning approach.In BRF, the sample of equivalent is repeatedly extracted respectively from minority class sample and many several classes of samples Constitute in each Bagging subset.WRF has used for reference cost sensitive learning (Cost Sensitive Learning) method Thought, by bigger penalty coefficient is arranged when decision tree nodes are built, and is that minority class is arranged more in leaf node High weight is improving the classification performance of minority class.The researcheres such as Krawczyk propose the integrated study of another kind of cost-sensitive Method EG2Ensemble.They build the decision tree of cost-sensitive using EG2 methods, then using the side of stochastic subspace Method and genetic algorithm are building integrated model.

As it was previously stated, existing various sampling algorithms there are the following problems：Although existing various sampling algorithms are considered The characteristics of minority class accounting is few, is increased the sample number of minority class by the way of over-sampling, but is not considered minority class The construction featuress of sample interior, but directly minority class is integrally sampled so that the son of the just few minority class of original accounting The probability that class is sampled is lower, causes discrimination of the minority class subclass during classification learning is carried out not enough, so as to affect The performance of algorithm.

The content of the invention

The invention provides a kind of random forest sampling approach and device for non-equilibrium data, it is intended at least certain One of above-mentioned technical problem of the prior art is solved in degree.

In order to solve the above problems, the invention provides following technical scheme：

A kind of random forest sampling approach for non-equilibrium data, including：

Step a：Training set is repeatedly clustered, multigroup cluster result is obtained；

Step b：The comentropy of each clustering cluster in multigroup cluster result is calculated, according to the comentropy of each clustering cluster Multigroup cluster result is divided into different clustering cluster groups by size；

Step c：Respectively from a number of sample of layered extraction in each clustering cluster group, classification is obtained random gloomy in a balanced way Woods training data subset.

The technical scheme that the embodiment of the present invention is taken also includes：It is in step a, described that training set is repeatedly gathered The mode of class is specially：IfFor the set of all cluster results, its initial value isWhen clustering every time, randomly select k ∈ [q, Min { 4q, n/10 }] as the number that this clusters, k disjoint clustering cluster P=is generated using k-means algorithms {p₁,…,p_k, and the clustering cluster of generation is merged intoIn, obtain

The technical scheme that the embodiment of the present invention is taken also includes：It is in step b, described to calculate many group cluster knots In fruit, the calculation of the comentropy of each clustering cluster is：By formula Entropy (c)=- ∑_l∈Lp(l|c)log p(l|c) Calculate the comentropy of each clustering cluster, and according to the comentropy size of each clustering cluster by cluster result setIt is divided into q not phase The clustering cluster group of friendshipMeet the clustering cluster group：

In above-mentioned formula, | g_i| represent clustering cluster group g_iThe number of middle clustering cluster.

The technical scheme that the embodiment of the present invention is taken also includes：In step c, described to obtain classification random in a balanced way The mode of forest training data subset is specially：A number of sample is extracted according to a certain percentage from each clustering cluster group respectively This construction Bagging subsets, and repeat to extract t time, obtain t classification Bagging subsets in a balanced way；Constructing first During Bagging subsets, n sample is randomly repeatedly extracted from clustering cluster group G, to each sample, first equiprobably from Sub- clustering cluster group g is extracted in clustering cluster group G, then by formulaFrom sub- clustering cluster group g Extract sub- clustering cluster c；And pass through formulaA sample d is extracted from sub- clustering cluster c；Wherein, c ∈ g_j。

The technical scheme that the embodiment of the present invention is taken also includes：In step c, described to obtain classification random in a balanced way Forest training data subset also includes：According to formula p^b(c_l|g_j)∝Update sub- clustering cluster c_l∈g_jExtraction Probability, wherein, f_jlIt is clustering cluster c_l∈g_jThe frequency being extracted.

Another technical scheme that the embodiment of the present invention is taken is：A kind of random forest sampling dress for non-equilibrium data Put, including：

Cluster module：For repeatedly being clustered to training set, multigroup cluster result is obtained；

Comentropy computing module：For calculating the comentropy of each clustering cluster in multigroup cluster result, according to each Multigroup cluster result is divided into different clustering cluster groups by the comentropy size of clustering cluster；

Sampling module：Classification is obtained for respectively from a number of sample of layered extraction in each clustering cluster group Random forest training data subset in a balanced way.

The technical scheme that the embodiment of the present invention is taken also includes：The cluster module carries out the side of multiple cluster to training set Formula is specially：IfFor the set of all cluster results, its initial value isWhen clustering every time, k ∈ [q, min are randomly selected { 4q, n/10 }] as the number that this clusters, k disjoint clustering cluster P={ p is generated using k-means algorithms₁,…, p_k, and the clustering cluster of generation is merged intoIn, obtain

The technical scheme that the embodiment of the present invention is taken also includes：Described information entropy computing module calculates many group cluster knots In fruit, the calculation of the comentropy of each clustering cluster is：By formula Entropy (c)=- ∑_l∈Lp(l|c)log p(l|c) Calculate the comentropy of each clustering cluster, and according to the comentropy size of each clustering cluster by cluster result setIt is divided into q not phase The clustering cluster group of friendshipMeet the clustering cluster group：

The technical scheme that the embodiment of the present invention is taken also includes：It is random gloomy in a balanced way that the sampling module obtains classification The mode of woods training data subset is specially：A number of sample is extracted according to a certain percentage from each clustering cluster group respectively Construction Bagging subsets, and repeat to extract t time, obtain t classification Bagging subsets in a balanced way；Constructing first During Bagging subsets, n sample is randomly repeatedly extracted from clustering cluster group G, to each sample, first equiprobably from Sub- clustering cluster group g is extracted in clustering cluster group G, then by formula From sub- clustering cluster group g Extract sub- clustering cluster c；And pass through formulaA sample d is extracted from sub- clustering cluster c；Wherein, c ∈ g_j。

The technical scheme that the embodiment of the present invention is taken also includes：It is random gloomy in a balanced way that the sampling module obtains classification Woods training data subset also includes：According to formulaUpdate sub- clustering cluster c_l∈g_jExtraction it is general Rate, wherein, f_jlIt is clustering cluster c_l∈g_jThe frequency being extracted.

Relative to prior art, the beneficial effect that the embodiment of the present invention is produced is：The embodiment of the present invention for non-flat The random forest sampling approach and device of weighing apparatus data is grouped by repeatedly cluster and to cluster result, can be obtained well The architectural feature of subclass inside classification is taken, the precision of classification is lifted；And the frequency being extracted according to each sample adjust its Extraction probability during follow-up sampling, so as to ensure to be drawn into the lower training subset of the balanced and various, degree of coupling, improves random gloomy The multiformity of woods model algorithm, lifts the classification performance of non-equilibrium data.

Description of the drawings

Fig. 1 is the flow chart of the random forest sampling approach for non-equilibrium data of the embodiment of the present invention；

Fig. 2 is the structural representation of the random forest sampling apparatus for non-equilibrium data of the embodiment of the present invention.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, it is below in conjunction with drawings and Examples, right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.

The random forest sampling approach and device for non-equilibrium data of the embodiment of the present invention is by from non-equilibrium data The middle equilibrium criterion subset for building one group of low degree of association improves classification performance of the random forests algorithm to non-equilibrium data to reach Purpose.Multigroup cluster result is obtained by repeatedly cluster is carried out to training set first；Then by the way of stratified sampling from The sample for being directly proportional to the entropy of clustering cluster and quantity being inversely proportional to clustering cluster size is extracted in each clustering cluster of cluster result Number, forms the Bagging subsets of random forest, to ensure the multiformity and balance of Bagging subsets, random gloomy so as to improve The performance of standing forest class.

Specifically, Fig. 1 is referred to, is the stream of the random forest sampling approach for non-equilibrium data of the embodiment of the present invention Cheng Tu.Assume that training set D is the matrix of a n × m, it possesses n sample and m attribute.The given cluster for needing to carry out time Bagging subset numbers t of number r and needs construction, the random forest sampling side for non-equilibrium data of the embodiment of the present invention Method is comprised the following steps：

Step 100：R cluster is carried out to training set, r cluster result is obtained；

In step 100, the mode for being clustered to training set is specially：IfFor the set of all cluster results, its Initial value isWhen clustering every time, the number that k ∈ [q, min { 4q, n/10 }] are this time clustered as cluster is randomly selected, then K disjoint clustering cluster P={ p is generated using k-means algorithms (hard clustering algorithm)₁,…,p_k, then the cluster is tied Fruit is merged intoIn, obtain

Step 200：The comentropy of each clustering cluster in r group cluster results is calculated, and according to the comentropy of each clustering cluster Size, by cluster result setIt is divided into q different clustering cluster group；

In step 200, for any clustering clusterThe comentropy of the clustering cluster is calculated come degree by formula first The category distribution situation of the clustering cluster is measured, comentropy computing formula is as follows：

Entropy (c)=- ∑_l∈Lp(l|c)logp(l|c) (1)

Then according to the comentropy size of each clustering cluster by cluster result setIt is divided into q disjoint clustering cluster groupMeet which：

In formula (2), | g_i| represent clustering cluster group g_iThe number of middle clustering cluster.

Step 300：A number of sample architecture Bagging is extracted according to a certain percentage from each clustering cluster group respectively Subset, and repeat to extract t time, obtain t classification Bagging subsets in a balanced way；

In step 300, when first Bagging subset is constructed, randomly n is repeatedly extracted from clustering cluster group G Sample, to each sample, equiprobably extracts sub- clustering cluster group g first from clustering cluster group G, then with

Probability clustering cluster c is extracted from sub- clustering cluster group g；Then with

Probability a sample d is extracted from sub- clustering cluster c.Wherein, c ∈ g_j, and p^jD () meets

Count^jL () represents class l ∈ L in clustering cluster group g_jThe frequency of middle appearance, p^jD () meets

If F=is { f_jl|1≤j≤q,1≤l≤|g_j| for a frequency matrix, wherein f_jlIt is sub- clustering cluster c_l∈g_jTaken out The frequency for taking, in order to obtain b-th Bagging subset, according to formula

To update sub- clustering cluster c_l∈g_jExtraction probability, wherein

The frequency being extracted according to each sample adjusting its extraction probability in follow-up sampling, so as to ensure to be drawn into Balanced and various Bagging subsets, improve classification performance of the random forests algorithm to non-equilibrium data.

The false code of inventive algorithm is as follows：

Pseudo-code of the algorithm

Step 400：By t Bagging trained Random Forest model.

Fig. 2 is referred to, is the structural representation of the random forest sampling apparatus for non-equilibrium data of the embodiment of the present invention Figure.The random forest sampling apparatus for non-equilibrium data of the embodiment of the present invention include cluster module, comentropy computing module, Sampling module and model training module.

Cluster module：For r cluster is carried out to training set, r cluster result is obtained；Wherein, training set is gathered The mode of class is specially：IfFor the set of all cluster results, its initial value isWhen clustering every time, randomly select k ∈ [q, Min { 4q, n/10 }] as the number of this cluster of cluster, it is then individual to generate k using k-means algorithms (hard clustering algorithm) Disjoint clustering cluster P={ p₁,…,p_k, then the cluster result is merged intoIn, obtain

Comentropy computing module：For calculating the comentropy of each clustering cluster in r group cluster results, and according to each cluster The comentropy size of cluster, by cluster result setIt is divided into q different clustering cluster group；Wherein, for any clustering clusterCalculate the comentropy of the clustering cluster first to measure the category distribution situation of the clustering cluster by formula, comentropy is calculated Formula is as follows：

Entropy (c)=- ∑_l∈Lp(l|c)logp(l|c) (1)

Sampling module：For extracting a number of sample structure respectively from each clustering cluster group according to a certain percentage Bagging subsets are made, and repeats to extract t time, obtain t classification Bagging subsets in a balanced way；Wherein, constructing first During Bagging subsets, n sample is randomly repeatedly extracted from clustering cluster group G, to each sample, first equiprobably from In clustering cluster group G extract sub- clustering cluster group g, then with

To update sub- clustering cluster c_l∈g_jExtraction probability, wherein

Model training module：For by t Bagging trained Random Forest model.

The random forest sampling approach and device for non-equilibrium data of the embodiment of the present invention by repeatedly cluster and Cluster result is grouped, the architectural feature of subclass inside classification can be obtained well, lift the precision of classification；And according to The frequency that each sample is extracted adjusting its extraction probability in follow-up sampling, so as to ensure to be drawn into it is balanced and various, The lower training subset of the degree of coupling, improves the multiformity of Random Forest model algorithm, lifts the classification performance of non-equilibrium data.

The foregoing description of the disclosed embodiments, enables professional and technical personnel in the field to realize or using the present invention. Various modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized without departing from the spirit or scope of the present invention in other embodiments.Therefore, the present invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope for causing.

Claims

1. a kind of random forest sampling approach for non-equilibrium data, it is characterised in that include：

Step b：The comentropy of each clustering cluster in multigroup cluster result is calculated, it is big according to the comentropy of each clustering cluster It is little, multigroup cluster result is divided into into different clustering cluster groups；

Step c：Respectively from a number of sample of layered extraction in each clustering cluster group, classification random forest instruction in a balanced way is obtained Practice data subset.

2. the random forest sampling approach for non-equilibrium data according to claim 1, it is characterised in that in the step In rapid a, the mode that multiple cluster is carried out to training set is specially：IfFor the set of all cluster results, its initial value isWhen clustering every time, randomly select k ∈ [q, min { 4q, n/10 }] and, as the number that this clusters, given birth to using k-means algorithms Into k disjoint clustering cluster P={ p₁,…,p_k, and the clustering cluster of generation is merged intoIn, obtain

3. the random forest sampling approach for non-equilibrium data according to claim 2, it is characterised in that in the step In rapid b, the calculation of the comentropy for calculating each clustering cluster in multigroup cluster result is：By formula Entropy (c)=- ∑_l∈LP (l | c) log p (l | c) calculate the comentropy of each clustering cluster, and according to the letter of each clustering cluster Entropy size is ceased by cluster result setIt is divided into q disjoint clustering cluster groupMeet the clustering cluster group：

\{\begin{matrix} E n t r o p y (c_{i}) > E n t r o p y (c_{j}), &ForAll; c_{i} &Element; g_{i^{'}}, c_{j} &Element; g_{j^{'}}, i^{'} > j^{'} \\ | g_{i} | &equiv; | g_{j} |, &ForAll; i, j &Element; [1, q] \end{matrix}

4. the random forest sampling approach for non-equilibrium data according to claim 3, it is characterised in that in the step In rapid c, the mode for obtaining classification random forest training data subset in a balanced way is specially：Respectively from each clustering cluster group A number of sample architecture Bagging subset is extracted according to a certain percentage, and repeats to extract t time, obtain t classification equal The Bagging subsets of weighing apparatus；When first Bagging subset is constructed, randomly n is repeatedly extracted from clustering cluster group G Sample, to each sample, equiprobably extracts sub- clustering cluster group g, first then by formula from clustering cluster group GSub- clustering cluster c is extracted from sub- clustering cluster group g；And pass through formulaFrom A sample d is extracted in sub- clustering cluster c；Wherein, c ∈ g_j。

5. the random forest sampling approach for non-equilibrium data according to claim 4, it is characterised in that in the step It is in c, described to obtain classification random forest training data subset also includes in a balanced way：According to formula Update sub- clustering cluster c_l∈g_jExtraction probability, wherein, f_jlIt is clustering cluster c_l∈g_jThe frequency being extracted.

6. a kind of random forest sampling apparatus for non-equilibrium data, it is characterised in that include：

Comentropy computing module：For calculating the comentropy of each clustering cluster in multigroup cluster result, according to each cluster Multigroup cluster result is divided into different clustering cluster groups by the comentropy size of cluster；

Sampling module：It is balanced for from a number of sample of layered extraction in each clustering cluster group, obtaining classification respectively Random forest training data subset.

7. the random forest sampling apparatus for non-equilibrium data according to claim 6, it is characterised in that the cluster Module carries out the mode of multiple cluster and is specially to training set：IfFor the set of all cluster results, its initial value isEvery time During cluster, the number that k ∈ [q, min { 4q, n/10 }] are clustered as this is randomly selected, k is generated not using k-means algorithms Intersecting clustering cluster P={ p₁,…,p_k, and the clustering cluster of generation is merged intoIn, obtain

8. the random forest sampling apparatus for non-equilibrium data according to claim 7, it is characterised in that described information Entropy computing module calculates the calculation of the comentropy of each clustering cluster in multigroup cluster result：By formula Entropy (c)=- ∑_l∈LP (l | c) log p (l | c) calculate the comentropy of each clustering cluster, and according to the letter of each clustering cluster Entropy size is ceased by cluster result setIt is divided into q disjoint clustering cluster groupMeet the clustering cluster group：

\{\begin{matrix} E n t r o p y (c_{i}) > E n t r o p y (c_{j}), &ForAll; c_{i} &Element; g_{i^{'}}, c_{j} &Element; g_{j^{'}}, i^{'} > j^{'} \\ | g_{i} | &equiv; | g_{j} |, &ForAll; i, j &Element; [1, q] \end{matrix}

9. the random forest sampling apparatus for non-equilibrium data according to claim 8, it is characterised in that the sample Abstraction module obtains the mode of classification random forest training data subset in a balanced way and is specially：Press from each clustering cluster group respectively A number of sample architecture Bagging subset is extracted according to certain proportion, and repeats to extract t time, obtain t classification in a balanced way Bagging subsets；When first Bagging subset is constructed, randomly repeat to extract n sample from clustering cluster group G, To each sample, equiprobably sub- clustering cluster group g is extracted from clustering cluster group G first, then by formulaSub- clustering cluster c is extracted from sub- clustering cluster group g；And pass through formulaFrom A sample d is extracted in sub- clustering cluster c；Wherein, c ∈ g_j。

10. the random forest sampling apparatus for non-equilibrium data according to claim 9, it is characterised in that the sample Abstraction module obtains classification, and random forest training data subset also includes in a balanced way：According to formula Update sub- clustering cluster c_l∈g_jExtraction probability, wherein, f_jlIt is clustering cluster c_l∈g_jThe frequency being extracted.