CN106570537A - Random forest model selection method based on confusion matrix - Google Patents

Random forest model selection method based on confusion matrix Download PDF

Info

Publication number
CN106570537A
CN106570537A CN201611031244.1A CN201611031244A CN106570537A CN 106570537 A CN106570537 A CN 106570537A CN 201611031244 A CN201611031244 A CN 201611031244A CN 106570537 A CN106570537 A CN 106570537A
Authority
CN
China
Prior art keywords
matrix
random forest
decision tree
decision
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611031244.1A
Other languages
Chinese (zh)
Inventor
侯春萍
张倩楠
王宝亮
常鹏
张荧允
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201611031244.1A priority Critical patent/CN106570537A/en
Publication of CN106570537A publication Critical patent/CN106570537A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes

Abstract

The invention discloses a random forest model selection method based on a confusion matrix. The method comprises the steps that trained decision trees are used as an original stochastic forest; the decision trees are classified on a test sample set to acquire the confusion matrix of each decision tree classification result; the difference matrix of any two decision trees in the random forest is acquired by comparing the confusion matrixes of the decision trees in the random forest in pairs; the F-norm of the difference matrix is used as the similarity measure of two decision trees, and the difference measure matrix of the random forest is established; elements which are not greater than the similarity threshold in the difference measure matrix are traversed; the classification accuracy of the decision trees involved in the elements is investigated: if the accuracy is lower than the classification accuracy threshold, the decision tree is deleted, and all elements on the matrix rank of the decision tree are set to zero, otherwise the decision tree is retained; and the model selection of the random forest is completed.

Description

A kind of Random Forest model system of selection based on confusion matrix
Technical field
The present invention relates to a kind of assembled classifier.
Background technology
Random forest aspect
Random forest is that, based on the grader of ensemble learning thought, it is built using some decision tree classifiers and is produced at random Forest, has independence between decision tree classifier, each decision tree is voted test sample under certain voting rule, Final result is produced by voting.Random forest grader inherits the advantage that decision tree classifier principle is simple and clear, be easy to execution, Simultaneously the over-fitting shortcoming of decision tree classifier is overcome, and have assembled classification by the interaction between decision tree The added benefit of device, classification performance is improved.
After random forests algorithm is suggested, many scientific research personnel are studied random forest and are improved.To random forest Improvement can typically be summarized as two aspects, one is to be combined random forest with other algorithms to be improved, and two is to random The theory of constitution and building process of forest itself is studied, such as the feature selection to random forest and model combined method enter Row improvement etc..The voting process of random forest in combination with Hough transformation, is produced a kind of referred to as Hough forest by some researcheres Grader, have well application in the computer vision fields such as target detection, action recognition.Somebody is general by existence tree Thought is introduced in random forest, in training sample extraction loop selected parts Bootstrap methods, each training subset is set up afterwards One generation parsing tree, the survival function of comprehensive each tree carries out the judgement of overall voting results, and resulting this classification is calculated Method is referred to as random existence forest algorithm, has good effect in high dimensional data classification.
In the optimization carried out for random forests algorithm itself building process, certain effect is also achieved.What is had grinds Study carefully when decision tree is formed, comprehensively the linear function of multiple node split algorithm groups, one is not only used in same decision tree Splitting algorithm is planted, but enters line splitting with reference to different splitting algorithms, in the case of some combination coefficients, can improve random The classification performance of forest.
In social production life, random forests algorithm is used in the energy, transportation, computer vision, gene work The every field such as journey carry out classify prediction and regression forecasting.In addition, random forests algorithm can be with the important journey of sample estimateses attribute Degree, therefore be also widely used in Data Dimensionality Reduction and feature selection.Simultaneously as random forest is in sampling and decision tree Randomness is introduced in generating process, in it independence feature allow it easily to carry out parallelization transformation, so as to should In big data processing environment.
Confusion matrix aspect
Combining classifiers refer under certain combined strategy, the differentiation result of each base grader are integrated, obtain Stronger Ensemble classifier performance on whole meaning.It is generally believed that the independence of grader, diversity and complementarity are collection compositions Class device obtains the key of superperformance.Wherein, the tolerance of diversity unlike the tolerance of classification accuracy it is so simple, be generally divided into Measure and structure-based measure based on result.
Confusion matrix is a kind of rendering method of classification results, by sample class and the other statistics of output class, presenting Go out the classifying quality of grader.In machine learning field, confusion matrix (Confusion Matrix) is typically used for supervision and learns In habit, the display of classification results is carried out.Confusion matrix can be classifying quality evaluation with obvious the classification results for representing grader A kind of visualization tool.By the similarity for weighing the confusion matrix of two graders, the similarity of grader is judged, also belong to In the category of the tolerance based on result, can be used to carry out the judgement of grader diversity, close with good in model group Using.
The content of the invention
The purpose of the present invention is that existing random forest assembled classifier is improved, there is provided one kind can be produced to be had The preferably multifarious Random Forest model system of selection of decision tree.Technical scheme is as follows:
A kind of Random Forest model system of selection based on confusion matrix, comprises the following steps:
A. to train the decision tree for obtaining as original random forest, each decision tree is classified in test sample collection, The confusion matrix of each decision tree classification result is obtained, and matrix is normalized according to number of samples of all categories;
B. it is poor two-by-two by the confusion matrix to decision tree in random forest, obtain any two decision-makings in random forest The matrix of differences of tree;Using the F norms of matrix of differences as the similarity measurement of two decision trees, set up random gloomy as element The Diversity measure matrix of woods, matrix element is the similarity measure values of two decision trees involved by the element;
C. according to order from small to large, the element for being not more than similarity threshold in Diversity measure matrix is traveled through;Investigate The classification accuracy size of the decision tree involved by the element:The decision tree is deleted if less than classification accuracy threshold value, by this The all elements zero setting on matrix ranks that decision tree is located, otherwise retains the decision tree;
D. the decision tree finally by representated by Diversity measure nonzeros is integrated into new random forest, The model for completing random forest is selected.
This patent is improved to the model selection process of random forest, it is proposed that a kind of decision tree based on confusion matrix Method for measuring similarity.On this basis, with reference to the classification performance of decision tree, using the strategy of reverse " deleting bad ", complete with The model of machine forest is selected.The model selection method of this patent also is expected to consider the classification performance of random forest and inside Correlative relationship, can effectively delete that dependency is strong, the decision tree that classifying quality is poor, improve the classification capacity of random forest.
Specific embodiment
Below the present invention is described in detail.
1 creates confusion matrix according to the classification results of decision tree
In confusion matrix, the true classification of every a line representative sample of matrix, and per the prediction class of string representative sample Not.Abstract mathematical form is:Setting needs to complete N classification tasks, gives sample vector collection X={ x1,x2,...xM, classification Vectorial Y={ y1,y2,...yN, with classification results statistics of Matrix C M (T, X) the presentation class tree T on sample set X, that The dimension of confusion matrix CM (T, X) is as follows it was determined that be the square formation of N × N:
On the basis of confusion matrix, we can obtain the similarity measurement of two classification trees.Here premise is, such as Really two trees are similar, then two tree classification results should be similar, especially the correct classification number of each classification, classification it Between misclassification distribution be closer to.Designing the method for measuring similarity is:In two classification tree Ti, TjWhat is had built up obscures On the basis of matrix, start with from confusion matrix, estimated using the distance between matrix and vector angle, as two classification trees Similarity measurement.If the distance between confusion matrix is little, then think classification tree Ti, TjIt is close to each other;If confusion matrix it Between distance it is very big, then think classification tree Ti, TjSimilarity it is very weak.
2 obtain similarity measurements moment matrix based on confusion matrix
The distance between confusion matrix is weighed used here as F norms, first to the hybrid matrix CM of two trees(i)And CM(j) Differ from, the matrix for obtaining is referred to as matrix of differences DCM(i,j), it is also the square formation of N × N, it is as follows.
In classification task, the quantity of different classes of sample may have certain difference.In order to prevent having in quantity Calculating of the sample class of advantage on matrix distance produces impact, causes similarity measurement excessively to consider that sample size is more Classification, and the diversity in other classifications is have ignored, we enter every trade normalization to matrix of differences, the normalization factor for being used It is the maximum element value of every row, obtains normalization matrix of differences DCMu (i,j).Finally, by asking for matrix of differences DCM 'i,jF models Number, obtains classification tree T in random foresti, TjBetween similarity.
During one scale is for the random forest of T, decision tree distance between any two can show in a matrix, The matrix is referred to as the measuring similarity matrix M of random forestF, matrix size is T × T.Its element mfijBy normalization difference square Battle array DCMu (i,j), i, j=1 ..., T obtain, and physical relationship is as follows:
In formula:CiIth cluster classification is represented, center is μi, xiFor the data point of the category.
On the whole, this method for measuring similarity based on confusion matrix, it is correct and mistake thick in decision tree classification On the basis of slightly dividing, further the confusion result between classification is made a distinction in mistake classification.So, structural difference Between the decision tree of two larger dissmilarities, produce classification results and miss similar difficulty increase, more have area so as to produce Divide the module of property.
3 obtain new random forest by " deleting bad "
After similarity measurements moment matrix is created for random forest, if the method clustered by iteration or classification tree is chosen The classification tree for working well, needs very big calculating cost.Our Bei Qi roads and go, propose a kind of model of contrary " deleting bad " Selection strategy, it is preferentially not integrated from original substantial amounts of base grader, but poor decision tree classifier is deleted, it is remaining Grader carries out integrated.Specifically, the grader of dependency is relatively strong, classification performance difference is deleted from original base grader set Go, and remaining combining classifiers are automatically into new Random Forest model.The advantage of this method is, its without the concern for point Whole relation in class device set, and only investigate two classification of the element value less than similarity threshold in similarity measurements moment matrix Device;Between the two graders, classification performance threshold value is less than if there is the classification capacity of a certain grader, then it is carried out Delete.Handled by this method is a kind of simpler relation, at the same considered dependency between grader with And classification capacity.
The model selection algorithm step of the link is as follows:
1. d is set as similarity threshold, and α is classification capacity threshold value
2.minij=MFCurrent minimum nonzero element;
If 3. minijLess than d:
A () judges whether the less decision tree i of classification capacity, and decision tree i in decision tree i and decision tree j Classification capacity<α);
(b) if it does, decision tree i is deleted, that is, by MFThe ranks element whole zero setting that middle decision tree i is located;
C () makes minijPoint to the minimum non-zero element of next matrix.
4. when there is no the element less than d, recurrence terminates.Now the non-zero tree of element ranks is incorporated into new random Forest RF.
As embodiment, the data set in UCI machine learning databases is chosen, each data set is randomly selected as training Sample set and test sample collection, training sample accounts for 50%, and test sample collection accounts for 50%.Bagging is used on training sample set Method extracts training sample subset, substantial amounts of decision tree classifier is generated, as original Random Forest model.
On this basis, the method step of the Random Forest model selection based on confusion matrix that this patent is proposed is carried out Suddenly, using original random forest as input, and similarity threshold and classification accuracy threshold value are preset.
Proposed method and step is carried out in test sample collection, the random forest mould selected through model is ultimately generated Type.

Claims (1)

1. a kind of Random Forest model system of selection based on confusion matrix, comprises the following steps:
A. to train the decision tree for obtaining as original random forest, each decision tree is classified in test sample collection, is obtained The confusion matrix of each decision tree classification result, and matrix is normalized according to number of samples of all categories;
B. it is poor two-by-two by the confusion matrix to decision tree in random forest, obtain any two decision trees in random forest Matrix of differences.Using the F norms of matrix of differences as the similarity measurement of two decision trees, as element random forest is set up Diversity measure matrix, matrix element is the similarity measure values of two decision trees involved by the element;
C. according to order from small to large, the element for being not more than similarity threshold in Diversity measure matrix is traveled through;Investigate this yuan The classification accuracy size of the decision tree involved by element:The decision tree is deleted if less than classification accuracy threshold value, by the decision-making The all elements zero setting on matrix ranks that tree is located, otherwise retains the decision tree;
D. the decision tree finally by representated by Diversity measure nonzeros is integrated into new random forest, completes The model of random forest is selected.
CN201611031244.1A 2016-11-17 2016-11-17 Random forest model selection method based on confusion matrix Pending CN106570537A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611031244.1A CN106570537A (en) 2016-11-17 2016-11-17 Random forest model selection method based on confusion matrix

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611031244.1A CN106570537A (en) 2016-11-17 2016-11-17 Random forest model selection method based on confusion matrix

Publications (1)

Publication Number Publication Date
CN106570537A true CN106570537A (en) 2017-04-19

Family

ID=58542098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611031244.1A Pending CN106570537A (en) 2016-11-17 2016-11-17 Random forest model selection method based on confusion matrix

Country Status (1)

Country Link
CN (1) CN106570537A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697447A (en) * 2017-10-20 2019-04-30 富士通株式会社 Disaggregated model construction device, method and electronic equipment based on random forest
CN110276382A (en) * 2019-05-30 2019-09-24 平安科技(深圳)有限公司 Listener clustering method, apparatus and medium based on spectral clustering
CN110826618A (en) * 2019-11-01 2020-02-21 南京信息工程大学 Personal credit risk assessment method based on random forest
CN111125727A (en) * 2019-12-03 2020-05-08 支付宝(杭州)信息技术有限公司 Confusion circuit generation method, prediction result determination method, device and electronic equipment
CN112836731A (en) * 2021-01-21 2021-05-25 黑龙江大学 Signal random forest classification method, system and device based on decision tree accuracy and relevance measurement

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103489005A (en) * 2013-09-30 2014-01-01 河海大学 High-resolution remote sensing image classifying method based on fusion of multiple classifiers
CN103500344A (en) * 2013-09-02 2014-01-08 中国测绘科学研究院 Method and module for extracting and interpreting information of remote-sensing image
CN104572786A (en) * 2013-10-29 2015-04-29 华为技术有限公司 Visualized optimization processing method and device for random forest classification model
CN105447525A (en) * 2015-12-15 2016-03-30 中国科学院软件研究所 Data prediction classification method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500344A (en) * 2013-09-02 2014-01-08 中国测绘科学研究院 Method and module for extracting and interpreting information of remote-sensing image
CN103489005A (en) * 2013-09-30 2014-01-01 河海大学 High-resolution remote sensing image classifying method based on fusion of multiple classifiers
CN104572786A (en) * 2013-10-29 2015-04-29 华为技术有限公司 Visualized optimization processing method and device for random forest classification model
CN105447525A (en) * 2015-12-15 2016-03-30 中国科学院软件研究所 Data prediction classification method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孔英会: "基于混淆矩阵和集成学习的分类方法研究", 《计算机工程与科学》 *
邓红卫 等: "岩体可爆性等级判别的随机森林模型及R实现", 《世界科技研究与发展》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697447A (en) * 2017-10-20 2019-04-30 富士通株式会社 Disaggregated model construction device, method and electronic equipment based on random forest
CN110276382A (en) * 2019-05-30 2019-09-24 平安科技(深圳)有限公司 Listener clustering method, apparatus and medium based on spectral clustering
CN110276382B (en) * 2019-05-30 2023-12-22 平安科技(深圳)有限公司 Crowd classification method, device and medium based on spectral clustering
CN110826618A (en) * 2019-11-01 2020-02-21 南京信息工程大学 Personal credit risk assessment method based on random forest
CN111125727A (en) * 2019-12-03 2020-05-08 支付宝(杭州)信息技术有限公司 Confusion circuit generation method, prediction result determination method, device and electronic equipment
CN111125727B (en) * 2019-12-03 2021-05-14 支付宝(杭州)信息技术有限公司 Confusion circuit generation method, prediction result determination method, device and electronic equipment
CN112836731A (en) * 2021-01-21 2021-05-25 黑龙江大学 Signal random forest classification method, system and device based on decision tree accuracy and relevance measurement

Similar Documents

Publication Publication Date Title
CN106570537A (en) Random forest model selection method based on confusion matrix
CN103632168B (en) Classifier integration method for machine learning
CN107766883A (en) A kind of optimization random forest classification method and system based on weighted decision tree
CN108540451A (en) A method of classification and Detection being carried out to attack with machine learning techniques
CN104484681B (en) Hyperspectral Remote Sensing Imagery Classification method based on spatial information and integrated study
CN106482967B (en) A kind of Cost Sensitive Support Vector Machines locomotive wheel detection system and method
CN103914705B (en) Hyperspectral image classification and wave band selection method based on multi-target immune cloning
CN108304884A (en) A kind of cost-sensitive stacking integrated study frame of feature based inverse mapping
CN108319987A (en) A kind of filtering based on support vector machines-packaged type combined flow feature selection approach
CN104102929A (en) Hyperspectral remote sensing data classification method based on deep learning
CN104809476B (en) A kind of multi-target evolution Fuzzy Rule Classification method based on decomposition
CN103489005A (en) High-resolution remote sensing image classifying method based on fusion of multiple classifiers
CN116108758B (en) Landslide susceptibility evaluation method
CN105825078B (en) Small sample Classification of Gene Expression Data method based on gene big data
CN104318515B (en) High spectrum image wave band dimension reduction method based on NNIA evolution algorithms
CN103886336A (en) Polarized SAR image classifying method based on sparse automatic encoder
CN102254033A (en) Entropy weight-based global K-means clustering method
CN104765839A (en) Data classifying method based on correlation coefficients between attributes
CN105893876A (en) Chip hardware Trojan horse detection method and system
CN105389505A (en) Shilling attack detection method based on stack type sparse self-encoder
CN102254020A (en) Global K-means clustering method based on feature weight
CN104966106B (en) A kind of biological age substep Forecasting Methodology based on support vector machines
Corrales et al. An empirical multi-classifier for coffee rust detection in colombian crops
CN102930291B (en) Automatic K adjacent local search heredity clustering method for graphic image
CN109800790A (en) A kind of feature selection approach towards high dimensional data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170419

RJ01 Rejection of invention patent application after publication