CN106570537A

CN106570537A - Random forest model selection method based on confusion matrix

Info

Publication number: CN106570537A
Application number: CN201611031244.1A
Authority: CN
Inventors: 侯春萍; 张倩楠; 王宝亮; 常鹏; 张荧允
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2016-11-17
Filing date: 2016-11-17
Publication date: 2017-04-19

Abstract

The invention discloses a random forest model selection method based on a confusion matrix. The method comprises the steps that trained decision trees are used as an original stochastic forest; the decision trees are classified on a test sample set to acquire the confusion matrix of each decision tree classification result; the difference matrix of any two decision trees in the random forest is acquired by comparing the confusion matrixes of the decision trees in the random forest in pairs; the F-norm of the difference matrix is used as the similarity measure of two decision trees, and the difference measure matrix of the random forest is established; elements which are not greater than the similarity threshold in the difference measure matrix are traversed; the classification accuracy of the decision trees involved in the elements is investigated: if the accuracy is lower than the classification accuracy threshold, the decision tree is deleted, and all elements on the matrix rank of the decision tree are set to zero, otherwise the decision tree is retained; and the model selection of the random forest is completed.

Description

A kind of Random Forest model system of selection based on confusion matrix

Technical field

The present invention relates to a kind of assembled classifier.

Background technology

Random forest aspect

Random forest is that, based on the grader of ensemble learning thought, it is built using some decision tree classifiers and is produced at random Forest, has independence between decision tree classifier, each decision tree is voted test sample under certain voting rule, Final result is produced by voting.Random forest grader inherits the advantage that decision tree classifier principle is simple and clear, be easy to execution, Simultaneously the over-fitting shortcoming of decision tree classifier is overcome, and have assembled classification by the interaction between decision tree The added benefit of device, classification performance is improved.

After random forests algorithm is suggested, many scientific research personnel are studied random forest and are improved.To random forest Improvement can typically be summarized as two aspects, one is to be combined random forest with other algorithms to be improved, and two is to random The theory of constitution and building process of forest itself is studied, such as the feature selection to random forest and model combined method enter Row improvement etc..The voting process of random forest in combination with Hough transformation, is produced a kind of referred to as Hough forest by some researcheres Grader, have well application in the computer vision fields such as target detection, action recognition.Somebody is general by existence tree Thought is introduced in random forest, in training sample extraction loop selected parts Bootstrap methods, each training subset is set up afterwards One generation parsing tree, the survival function of comprehensive each tree carries out the judgement of overall voting results, and resulting this classification is calculated Method is referred to as random existence forest algorithm, has good effect in high dimensional data classification.

In the optimization carried out for random forests algorithm itself building process, certain effect is also achieved.What is had grinds Study carefully when decision tree is formed, comprehensively the linear function of multiple node split algorithm groups, one is not only used in same decision tree Splitting algorithm is planted, but enters line splitting with reference to different splitting algorithms, in the case of some combination coefficients, can improve random The classification performance of forest.

In social production life, random forests algorithm is used in the energy, transportation, computer vision, gene work The every field such as journey carry out classify prediction and regression forecasting.In addition, random forests algorithm can be with the important journey of sample estimateses attribute Degree, therefore be also widely used in Data Dimensionality Reduction and feature selection.Simultaneously as random forest is in sampling and decision tree Randomness is introduced in generating process, in it independence feature allow it easily to carry out parallelization transformation, so as to should In big data processing environment.

Confusion matrix aspect

Combining classifiers refer under certain combined strategy, the differentiation result of each base grader are integrated, obtain Stronger Ensemble classifier performance on whole meaning.It is generally believed that the independence of grader, diversity and complementarity are collection compositions Class device obtains the key of superperformance.Wherein, the tolerance of diversity unlike the tolerance of classification accuracy it is so simple, be generally divided into Measure and structure-based measure based on result.

Confusion matrix is a kind of rendering method of classification results, by sample class and the other statistics of output class, presenting Go out the classifying quality of grader.In machine learning field, confusion matrix (Confusion Matrix) is typically used for supervision and learns In habit, the display of classification results is carried out.Confusion matrix can be classifying quality evaluation with obvious the classification results for representing grader A kind of visualization tool.By the similarity for weighing the confusion matrix of two graders, the similarity of grader is judged, also belong to In the category of the tolerance based on result, can be used to carry out the judgement of grader diversity, close with good in model group Using.

The content of the invention

The purpose of the present invention is that existing random forest assembled classifier is improved, there is provided one kind can be produced to be had The preferably multifarious Random Forest model system of selection of decision tree.Technical scheme is as follows：

A kind of Random Forest model system of selection based on confusion matrix, comprises the following steps：

A. to train the decision tree for obtaining as original random forest, each decision tree is classified in test sample collection, The confusion matrix of each decision tree classification result is obtained, and matrix is normalized according to number of samples of all categories；

B. it is poor two-by-two by the confusion matrix to decision tree in random forest, obtain any two decision-makings in random forest The matrix of differences of tree；Using the F norms of matrix of differences as the similarity measurement of two decision trees, set up random gloomy as element The Diversity measure matrix of woods, matrix element is the similarity measure values of two decision trees involved by the element；

C. according to order from small to large, the element for being not more than similarity threshold in Diversity measure matrix is traveled through；Investigate The classification accuracy size of the decision tree involved by the element：The decision tree is deleted if less than classification accuracy threshold value, by this The all elements zero setting on matrix ranks that decision tree is located, otherwise retains the decision tree；

D. the decision tree finally by representated by Diversity measure nonzeros is integrated into new random forest, The model for completing random forest is selected.

This patent is improved to the model selection process of random forest, it is proposed that a kind of decision tree based on confusion matrix Method for measuring similarity.On this basis, with reference to the classification performance of decision tree, using the strategy of reverse " deleting bad ", complete with The model of machine forest is selected.The model selection method of this patent also is expected to consider the classification performance of random forest and inside Correlative relationship, can effectively delete that dependency is strong, the decision tree that classifying quality is poor, improve the classification capacity of random forest.

Specific embodiment

Below the present invention is described in detail.

1 creates confusion matrix according to the classification results of decision tree

In confusion matrix, the true classification of every a line representative sample of matrix, and per the prediction class of string representative sample Not.Abstract mathematical form is：Setting needs to complete N classification tasks, gives sample vector collection X={ x₁,x₂,...x_M, classification Vectorial Y={ y₁,y₂,...y_N, with classification results statistics of Matrix C M (T, X) the presentation class tree T on sample set X, that The dimension of confusion matrix CM (T, X) is as follows it was determined that be the square formation of N × N：

On the basis of confusion matrix, we can obtain the similarity measurement of two classification trees.Here premise is, such as Really two trees are similar, then two tree classification results should be similar, especially the correct classification number of each classification, classification it Between misclassification distribution be closer to.Designing the method for measuring similarity is：In two classification tree T_i, T_jWhat is had built up obscures On the basis of matrix, start with from confusion matrix, estimated using the distance between matrix and vector angle, as two classification trees Similarity measurement.If the distance between confusion matrix is little, then think classification tree T_i, T_jIt is close to each other；If confusion matrix it Between distance it is very big, then think classification tree T_i, T_jSimilarity it is very weak.

2 obtain similarity measurements moment matrix based on confusion matrix

The distance between confusion matrix is weighed used here as F norms, first to the hybrid matrix CM of two trees⁽ⁱ⁾And CM^(j) Differ from, the matrix for obtaining is referred to as matrix of differences DCM^(i,j), it is also the square formation of N × N, it is as follows.

In classification task, the quantity of different classes of sample may have certain difference.In order to prevent having in quantity Calculating of the sample class of advantage on matrix distance produces impact, causes similarity measurement excessively to consider that sample size is more Classification, and the diversity in other classifications is have ignored, we enter every trade normalization to matrix of differences, the normalization factor for being used It is the maximum element value of every row, obtains normalization matrix of differences DCM_u ^(i,j).Finally, by asking for matrix of differences DCM '_i,jF models Number, obtains classification tree T in random forest_i, T_jBetween similarity.

During one scale is for the random forest of T, decision tree distance between any two can show in a matrix, The matrix is referred to as the measuring similarity matrix M of random forest_F, matrix size is T × T.Its element mf_ijBy normalization difference square Battle array DCM_u ^(i,j), i, j=1 ..., T obtain, and physical relationship is as follows：

In formula：C_iIth cluster classification is represented, center is μ_i, x_iFor the data point of the category.

On the whole, this method for measuring similarity based on confusion matrix, it is correct and mistake thick in decision tree classification On the basis of slightly dividing, further the confusion result between classification is made a distinction in mistake classification.So, structural difference Between the decision tree of two larger dissmilarities, produce classification results and miss similar difficulty increase, more have area so as to produce Divide the module of property.

3 obtain new random forest by " deleting bad "

After similarity measurements moment matrix is created for random forest, if the method clustered by iteration or classification tree is chosen The classification tree for working well, needs very big calculating cost.Our Bei Qi roads and go, propose a kind of model of contrary " deleting bad " Selection strategy, it is preferentially not integrated from original substantial amounts of base grader, but poor decision tree classifier is deleted, it is remaining Grader carries out integrated.Specifically, the grader of dependency is relatively strong, classification performance difference is deleted from original base grader set Go, and remaining combining classifiers are automatically into new Random Forest model.The advantage of this method is, its without the concern for point Whole relation in class device set, and only investigate two classification of the element value less than similarity threshold in similarity measurements moment matrix Device；Between the two graders, classification performance threshold value is less than if there is the classification capacity of a certain grader, then it is carried out Delete.Handled by this method is a kind of simpler relation, at the same considered dependency between grader with And classification capacity.

The model selection algorithm step of the link is as follows：

1. d is set as similarity threshold, and α is classification capacity threshold value

2.min_ij=M_FCurrent minimum nonzero element；

If 3. min_ijLess than d：

A () judges whether the less decision tree i of classification capacity, and decision tree i in decision tree i and decision tree j Classification capacity<α)；

(b) if it does, decision tree i is deleted, that is, by M_FThe ranks element whole zero setting that middle decision tree i is located；

C () makes min_ijPoint to the minimum non-zero element of next matrix.

4. when there is no the element less than d, recurrence terminates.Now the non-zero tree of element ranks is incorporated into new random Forest RF.

As embodiment, the data set in UCI machine learning databases is chosen, each data set is randomly selected as training Sample set and test sample collection, training sample accounts for 50%, and test sample collection accounts for 50%.Bagging is used on training sample set Method extracts training sample subset, substantial amounts of decision tree classifier is generated, as original Random Forest model.

On this basis, the method step of the Random Forest model selection based on confusion matrix that this patent is proposed is carried out Suddenly, using original random forest as input, and similarity threshold and classification accuracy threshold value are preset.

Proposed method and step is carried out in test sample collection, the random forest mould selected through model is ultimately generated Type.

Claims

1. a kind of Random Forest model system of selection based on confusion matrix, comprises the following steps：

A. to train the decision tree for obtaining as original random forest, each decision tree is classified in test sample collection, is obtained The confusion matrix of each decision tree classification result, and matrix is normalized according to number of samples of all categories；

B. it is poor two-by-two by the confusion matrix to decision tree in random forest, obtain any two decision trees in random forest Matrix of differences.Using the F norms of matrix of differences as the similarity measurement of two decision trees, as element random forest is set up Diversity measure matrix, matrix element is the similarity measure values of two decision trees involved by the element；

C. according to order from small to large, the element for being not more than similarity threshold in Diversity measure matrix is traveled through；Investigate this yuan The classification accuracy size of the decision tree involved by element：The decision tree is deleted if less than classification accuracy threshold value, by the decision-making The all elements zero setting on matrix ranks that tree is located, otherwise retains the decision tree；

D. the decision tree finally by representated by Diversity measure nonzeros is integrated into new random forest, completes The model of random forest is selected.