CN109492682A

CN109492682A - A kind of multi-branched random forest data classification method

Info

Publication number: CN109492682A
Application number: CN201811273813.2A
Authority: CN
Inventors: 江泽涛; 马伟康; 胡硕
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2019-03-19

Abstract

The invention discloses a kind of multi-branched random forest data classification methods, it is related to random forest data classification technology field, the technical issues of solution, is to provide the classification method of a kind of performance for improving data classification and accuracy rate, this method comprises the following steps: (one) provides unfiled data set, denoises using PCA algorithm to Data Dimensionality Reduction；(2) cluster operation of data is completed using K-means algorithm；(3) multi-branched random forest is constructed；(4) sort operation of the complete paired data of multi-branched Random Forest model is used.The performance and accuracy rate of data classification can be improved using technical solution of the present invention.

Description

A kind of multi-branched random forest data classification method

Technical field

The present invention relates to random forest data classification technology field more particularly to a kind of multi-branched random forest data classifications Method.

Background technique

With the development of artificial intelligence, whether image studies, information security etc. require the participation of artificial intelligence.Cluster It is had important application with sorting algorithm in artificial intelligence field, wherein K-means and random forest are cluster and classification respectively The representative of algorithm.The classification capacity of random forest is one of preferable algorithm of performance in sorting algorithm, is one based on decision tree Kind Ensemble Learning Algorithms.But the random forest data classification method of the prior art is when being classified, sample set excessively redundancy, miscellaneous Disorderly, data purity is low, has a certain impact to classification performance.

Summary of the invention

In view of the deficiencies of the prior art, technical problem solved by the invention is to provide a kind of performance for improving data classification With the classification method of accuracy rate.

In order to solve the above technical problems, the technical solution adopted by the present invention is that a kind of multi-branched random forest data classification side Method includes the following steps:

(1) unfiled data set is provided, Data Dimensionality Reduction is denoised using PCA algorithm, specifically as follows step by step:

(1) sample set is expressed as to the matrix X of N × M；

(2) zero averaging is carried out to each row, that is, seeks the average value R of every a line in matrix_i, every a line all subtracts the row Average value N_i-R_i；Find out covariance matrixSeek the eigenvalue λ of covariance matrix C₁,λ₂…λ_mIt is special with standardization Levy vector x₁,x₂…x_m；

(3) by feature vector according to corresponding eigenvalue size from top to bottom by rows at matrix, take before k row composition matrix P；

(4) matrix P is multiplied with matrix X, the data after obtaining dimensionality reduction remove the redundancy section in data.

(2) cluster operation that data set is completed using K-means algorithm, exports cluster C={ C₁, C₂..., C_k, specific point Steps are as follows:

(1) density value of each sample point is calculated

Wherein,d_ijk=| | x_ij-x_kj||,p_ijIt is the density of i-th of sample point in classification j；n_jFor j Class sample point sum, d_ijkIt is sample point x_ijAnd x_kjDistance in vector space；By density value p_ijMaximum sample point conduct First center that clusters；

(2) it is also contemplated that distance in the selection at the remaining center that clusters, to given sample y_n, it is arrived into sample point y_l Distance be normalized:

(3) by the density value of the sample point and to the sum of the normalized cumulant for having selected cluster centre；

Wherein, p_ijIndicate the density of i-th of sample point in classification j, D_ijtIndicate sample point x_ijTo the t class selected Center y_tNormalized cumulant；Cluster numbers K value is determined by elbow method；

(4) w_ijAccording to being ordered from large to small, preceding k-1 sample point and p are selected_ijIt is worth maximum point as just Begin the center C that clusters₁, C₂..., C_k；

(5) by c₁, c₂...c_kIt is denoted as μ again as the initial center that clusters₁, μ₂...μ_k；Set maximum number of iterations R；

(6) the distance dist (x of each sample and the center that clusters is calculated_i, μ_j)=| | x_i-μ_j||₂, wherein i=1,2 ... N, J=1,2 ... k；

(7) x is determined according to the nearest center of clustering of distance_iCluster label: λ_j=arg min_{I ∈ { 1,2..., k }}dist(x_i, μ_j)；

(8) by sample x_iIt is divided into corresponding cluster: C_λi=C_λi∪{x_i}；

(9) after clustering to the completion of all samples, new mean value class center is calculated:If μ '_iAnd μ_i Unequal, class center is updated to μ '_iIf μ_iWith μ '_iIt is equal, keeping μ_iIt is constant；It recalculates corresponding belonging to sample Cluster；

(10) it repeats step by step (9), the maximum iteration time until all central points that clusters do not change or reach Number；

(11) output cluster divides C={ C₁, C₂..., C_k}。

(3) multi-branched random forest is constructed, specific substep is poly- as follows:

(1) building is completed with the training set of known label, provides training set, training set is carried out using K-means algorithm Data prediction obtains cluster C={ C₁, C₂..., C_k, detailed process is as follows:

1) density value of each sample point is calculated

Wherein,p_ijIt is the density of i-th of sample point in classification j；n_j For j class sample point sum, d_ijkIt is sample point x_ijAnd x_kjDistance in vector space；By density value p_ijMaximum sample point is made For first center that clusters；

2) it is also contemplated that distance in the selection at the remaining center that clusters, to given sample y_n, it is arrived into sample point y_l Distance be normalized:

3) by the density value of the sample point and to the sum of the normalized cumulant for having selected cluster centre:

4) w_ijAccording to being ordered from large to small, preceding k-1 sample point and p are selected_ijIt is worth maximum point as initial The center of clustering C₁, C₂..., C_k；

5) by c₁, c₂…c_kIt is denoted as μ again as the initial center that clusters₁, μ₂...μ_k；Set maximum number of iterations R；

6) the distance dist (x of each sample and the center that clusters is calculated_i, μ_j)=| | x_i-μ_j||₂, wherein i=1,2 ... N, j =1,2 ... k；

7) x is determined according to the nearest center of clustering of distance_iCluster label: λ_j=arg min_{I ∈ { 1,2..., k }}dist(x_i, μ_j)；

8) by sample x_iIt is divided into corresponding cluster: C_λi=C_λi∪{x_i}；

9) after clustering to the completion of all samples, new mean value class center is calculated:If μ '_iAnd μ_iNo Equal, class center is updated to μ '_iIf μ_iWith μ '_iIt is equal, keeping μ_iIt is constant；Recalculate corresponding cluster belonging to sample；

10) repetitive process 9), the maximum the number of iterations until all central points that clusters do not change or reach；

11) output cluster divides C={ C₁, C₂..., C_k}。

(2) bootstrap sampling sampling method is used, is completed to cluster C_iSampling operation, building multi-branched it is gloomy at random Woods, detailed process is as follows:

1) bootstrap sampling sampling method is used, in cluster C_iMiddle use has the sampling put back to, and samples out T containing m The training set D of a training sample_i；

2) assume that the feature quantity of sample is M, m feature (m < M) is randomly selected in the division of base decision tree, to each spy A and its each value a is levied, is calculated gini index Gini (D, A)；

The gini index Gini (D, A), for given sample set D, if belonging to class c_kSample set be C_k, then Gini index are as follows:

Under conditions of feature A, whether the gini index Gini (D, A) of set D: given feature A take some according to it Probable value a, sample set D are divided into two subsets: D₁And D₂, in which:Then:

3) choose optimal characteristics and optimal cut-off: in all feature A and all cut-off a, gini index is minimum A and a be exactly optimal characteristics and optimal cut-off, as tree node.According to optimal characteristics and optimal cut-off by data set D_i It is cut into two child nodes；

4) to child node recursive call process 2), process 3), until in data set gini index be less than predetermined value, that is, complete The building of base decision tree；

5) multi-branched random forest is formed by base decision tree.

(4) sort operation for the complete paired data of multi-branched Random Forest model completed using building, specifically step by step such as Under:

(1) step (2) is clustered to the cluster C={ C for completing output₁, C₂..., C_kTo be sequentially inputted to multi-branched gloomy at random Woods；

(2) sample point h_iIn category label c_jOutput be denoted as

(3) classification of sample is determined using relative majority ballot method:Repeat the above substep Poly- (2), substep poly- (3), is completed until all clusters are classified；

(4) output category result.

The performance and accuracy rate of data classification can be improved using technical solution of the present invention.

Detailed description of the invention

Fig. 1 is flow chart of the present invention；

Fig. 2 is construction multi-branched random forest flow diagram.

Specific embodiment

A specific embodiment of the invention is further described with reference to the accompanying drawing, but is not to limit of the invention It is fixed.

Fig. 1 shows a kind of multi-branched random forest data classification method, includes the following steps:

(1) sample set is expressed as to the matrix X of N × M；

(1) density value of each sample point is calculated

(3) by the density value of the sample point and to the sum of the normalized cumulant for having selected cluster centre:

(4) w_ijAccording to being ordered from large to small, preceding k-1 sample point and p are selected_ijIt is worth maximum point as just Begin the center C that clusters₁, C₂..., C_k。

(5) by c₁, c₂…c_kIt is denoted as μ again as the initial center that clusters₁, μ₂...μ_k；Set maximum number of iterations R；

(11) output cluster divides C={ C₁, C₂..., C_k}。

(3) multi-branched random forest is constructed, detailed process is as shown in Fig. 2, specific substep is poly- as follows:

1) density value of each sample point is calculated

4) w_ijAccording to being ordered from large to small, preceding k-1 sample point and p are selected_ijIt is worth maximum point as initial The center of clustering C₁, C₂..., C_k。

5) by c₁, c₂...c_kIt is denoted as μ again as the initial center that clusters₁, μ₂...μ_k；Set maximum number of iterations R；

7) determine that the cluster of xi marks according to the nearest center of clustering of distance: λ_j=arg min_{I ∈ { 1,2..., k }}dist(x_i, μ_j)；

11) output cluster divides C={ C₁, C₂..., C_k}。

The gini index: for given sample set D, if belonging to class c_kSample set be C_k, then gini index Are as follows:

5) multi-branched random forest is formed by base decision tree.

(2) sample point h_iIn category label c_jOutput be denoted as

(4) output category result.

Detailed description is made that embodiments of the present invention in conjunction with attached drawing above, but the present invention be not limited to it is described Embodiment.To those skilled in the art, without departing from the principles and spirit of the present invention, to these implementations Mode carries out various change, modification, replacement and variant are still fallen in protection scope of the present invention.

Claims

1. a kind of multi-branched random forest data classification method, which comprises the steps of:

(1) unfiled data set is provided, Data Dimensionality Reduction is denoised using PCA algorithm；

(2) cluster operation of data is completed using K-means algorithm；

(3) multi-branched random forest is constructed；

(4) sort operation of the complete paired data of multi-branched Random Forest model is used.

2. multi-branched random forest data classification method as described in claim 1, which is characterized in that the step (1) is specific Substep is poly- as follows:

(1) sample set is expressed as to the matrix X of N × M；

(2) zero averaging is carried out to each row, that is, seeks the average value R of every a line in matrix_i, every a line all subtracts being averaged for the row Value N_i-R_i；Find out covariance matrixSeek the eigenvalue λ of covariance matrix C₁,λ₂…λ_mWith standardized feature to Measure x₁,x₂…x_m；

3. multi-branched random forest data classification method as described in claim 1, which is characterized in that the step (2) is specific It is as follows step by step:

(1) density value of each sample point is calculated

Wherein,d_ijk=| | x_ij-x_kj||,p_ijIt is the density of i-th of sample point in classification j；n_jFor j class sample This point sum, d_ijkIt is sample point x_ijAnd x_kjDistance in vector space；By density value p_ijMaximum sample point is as first A center that clusters；

(2) it is also contemplated that distance in the selection at the remaining center that clusters, to given sample y_n, it is arrived into sample pointDistance be normalized:

Wherein, p_ijIndicate the density of i-th of sample point in classification j, D_ijtIndicate sample point x_ijTo the center for the t class selected y_tNormalized cumulant；Cluster numbers K value is determined by elbow method；

(4) w_ijAccording to being ordered from large to small, preceding k-1 sample point and p are selected_ijIt is worth maximum point as initial poly- Cluster center C₁, C₂..., C_k；

(5) by c₁, c₂..., c_kIt is denoted as μ again as the initial center that clusters₁, μ₂...μ_k；Set maximum number of iterations R；

(6) the distance dist (x of each sample and the center that clusters is calculated_i, μ_j)=| | x_i-μ_j||₂, wherein i=1,2 ... N, j= 1,2,…k；

(7) x is determined according to the nearest center of clustering of distance_iCluster label: λ_j=argmin_{I ∈ { 1,2..., k }}dist(x_i, μ_j)；

(9) after clustering to the completion of all samples, new mean value class center is calculated:If μ '_iAnd μ_iNot phase Deng class center is updated to μ '_iIf μ_iWith μ '_iIt is equal, keeping μ_iIt is constant；Recalculate corresponding cluster belonging to sample；

(10) it repeats step by step (9), the maximum the number of iterations until all central points that clusters do not change or reach；

(11) output cluster divides C={ C₁, C₂..., C_k}。

4. multi-branched random forest data classification method as described in claim 1, which is characterized in that the step (3) is specific Substep is poly- as follows:

(1) building is completed with the training set of known label, provides training set, data are carried out using K-means algorithm to training set Pretreatment obtains cluster C={ C₁, C₂..., C_k}；

(2) bootstrap sampling sampling method is used, is completed to cluster C_iSampling operation, construct multi-branched random forest.

5. multi-branched random forest data classification method as claimed in claim 4, which is characterized in that divide in the step (3) Walking poly- (1), detailed process is as follows:

1) density value of each sample point is calculated

2) it is also contemplated that distance in the selection at the remaining center that clusters, to given sample y_n, it is arrived into sample pointDistance be normalized:

3) by the density value of the sample point and to the sum of the normalized cumulant for having selected cluster centre

4) w_ijAccording to being ordered from large to small, preceding k-1 sample point and p are selected_ijIt is worth maximum point as initial poly- Cluster center C₁, C₂..., C_k；

5) by c₁, c₂..., c_kIt is denoted as μ again as the initial center that clusters₁, μ₂...μ_k；Set maximum number of iterations R；

6) the distance dist (x of each sample and the center that clusters is calculated_i, μ_j)=| | x_i-μ_j||₂, wherein i=1,2 ... N, j=1, 2,…k；

7) x is determined according to the nearest center of clustering of distance_iCluster label: λ_j=argmin_{I ∈ { 1,2..., k }}dist(x_i, μ_j)；

9) after clustering to the completion of all samples, new mean value class center is calculated:If μ '_iAnd μ_iNot phase Deng class center is updated to μ '_iIf μ_iWith μ '_iIt is equal, keeping μ_iIt is constant；Recalculate corresponding cluster belonging to sample；

11) output cluster divides C={ C₁, C₂..., C_k}。

6. multi-branched random forest data classification method as claimed in claim 4, which is characterized in that divide in the step (3) Walking poly- (2), detailed process is as follows:

1) bootstrap sampling sampling method is used, in cluster C_iMiddle use has the sampling put back to, and samples out T containing m training The training set D of sample_i；

2) assume that the feature quantity of sample is M, m feature (m < M) is randomly selected in the division of base decision tree, to each feature A And its each value a, it calculates gini index Gini (D, A)；

3) optimal characteristics and optimal cut-off are chosen: in all feature A and all cut-off a, the smallest A of gini index and A is exactly optimal characteristics and optimal cut-off, as tree node；According to optimal characteristics and optimal cut-off by data set D_iIt is cut into Two child nodes；

4) to child node recursive call process 2), process 3), until in data set gini index be less than predetermined value, that is, complete base The building of decision tree；

5) multi-branched random forest is formed by base decision tree.

7. multi-branched random forest data classification method as claimed in claim 6, which is characterized in that the step (3) substep Gini index Gini (D, A) described in poly- (2), for given sample set D, if belonging to class c_kSample set be C_k, then base Buddhist nun's index are as follows:

Under conditions of feature A, whether the gini index Gini (D, A) of set D: given feature A take some may according to it Value a, sample set D are divided into two subsets: D₁And D₂, in which:Then:

8. multi-branched random forest data classification method as described in claim 1, which is characterized in that the step (4) is specific Substep is poly- as follows:

(1) step (2) is clustered to the cluster C={ C for completing output₁, C₂..., C_kIt is sequentially inputted to multi-branched random forest；

(2) sample point h_iIn category label c_jOutput be denoted as

(3) classification of sample is determined using relative majority ballot method:It is poly- to repeat the above substep (2), substep poly- (3) is completed until all clusters are classified；

(4) output category result.