CN106570537A - Random forest model selection method based on confusion matrix - Google Patents
Random forest model selection method based on confusion matrix Download PDFInfo
- Publication number
- CN106570537A CN106570537A CN201611031244.1A CN201611031244A CN106570537A CN 106570537 A CN106570537 A CN 106570537A CN 201611031244 A CN201611031244 A CN 201611031244A CN 106570537 A CN106570537 A CN 106570537A
- Authority
- CN
- China
- Prior art keywords
- matrix
- random forest
- decision tree
- decision
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
Abstract
The invention discloses a random forest model selection method based on a confusion matrix. The method comprises the steps that trained decision trees are used as an original stochastic forest; the decision trees are classified on a test sample set to acquire the confusion matrix of each decision tree classification result; the difference matrix of any two decision trees in the random forest is acquired by comparing the confusion matrixes of the decision trees in the random forest in pairs; the F-norm of the difference matrix is used as the similarity measure of two decision trees, and the difference measure matrix of the random forest is established; elements which are not greater than the similarity threshold in the difference measure matrix are traversed; the classification accuracy of the decision trees involved in the elements is investigated: if the accuracy is lower than the classification accuracy threshold, the decision tree is deleted, and all elements on the matrix rank of the decision tree are set to zero, otherwise the decision tree is retained; and the model selection of the random forest is completed.
Description
Technical field
The present invention relates to a kind of assembled classifier.
Background technology
Random forest aspect
Random forest is that, based on the grader of ensemble learning thought, it is built using some decision tree classifiers and is produced at random
Forest, has independence between decision tree classifier, each decision tree is voted test sample under certain voting rule,
Final result is produced by voting.Random forest grader inherits the advantage that decision tree classifier principle is simple and clear, be easy to execution,
Simultaneously the over-fitting shortcoming of decision tree classifier is overcome, and have assembled classification by the interaction between decision tree
The added benefit of device, classification performance is improved.
After random forests algorithm is suggested, many scientific research personnel are studied random forest and are improved.To random forest
Improvement can typically be summarized as two aspects, one is to be combined random forest with other algorithms to be improved, and two is to random
The theory of constitution and building process of forest itself is studied, such as the feature selection to random forest and model combined method enter
Row improvement etc..The voting process of random forest in combination with Hough transformation, is produced a kind of referred to as Hough forest by some researcheres
Grader, have well application in the computer vision fields such as target detection, action recognition.Somebody is general by existence tree
Thought is introduced in random forest, in training sample extraction loop selected parts Bootstrap methods, each training subset is set up afterwards
One generation parsing tree, the survival function of comprehensive each tree carries out the judgement of overall voting results, and resulting this classification is calculated
Method is referred to as random existence forest algorithm, has good effect in high dimensional data classification.
In the optimization carried out for random forests algorithm itself building process, certain effect is also achieved.What is had grinds
Study carefully when decision tree is formed, comprehensively the linear function of multiple node split algorithm groups, one is not only used in same decision tree
Splitting algorithm is planted, but enters line splitting with reference to different splitting algorithms, in the case of some combination coefficients, can improve random
The classification performance of forest.
In social production life, random forests algorithm is used in the energy, transportation, computer vision, gene work
The every field such as journey carry out classify prediction and regression forecasting.In addition, random forests algorithm can be with the important journey of sample estimateses attribute
Degree, therefore be also widely used in Data Dimensionality Reduction and feature selection.Simultaneously as random forest is in sampling and decision tree
Randomness is introduced in generating process, in it independence feature allow it easily to carry out parallelization transformation, so as to should
In big data processing environment.
Confusion matrix aspect
Combining classifiers refer under certain combined strategy, the differentiation result of each base grader are integrated, obtain
Stronger Ensemble classifier performance on whole meaning.It is generally believed that the independence of grader, diversity and complementarity are collection compositions
Class device obtains the key of superperformance.Wherein, the tolerance of diversity unlike the tolerance of classification accuracy it is so simple, be generally divided into
Measure and structure-based measure based on result.
Confusion matrix is a kind of rendering method of classification results, by sample class and the other statistics of output class, presenting
Go out the classifying quality of grader.In machine learning field, confusion matrix (Confusion Matrix) is typically used for supervision and learns
In habit, the display of classification results is carried out.Confusion matrix can be classifying quality evaluation with obvious the classification results for representing grader
A kind of visualization tool.By the similarity for weighing the confusion matrix of two graders, the similarity of grader is judged, also belong to
In the category of the tolerance based on result, can be used to carry out the judgement of grader diversity, close with good in model group
Using.
The content of the invention
The purpose of the present invention is that existing random forest assembled classifier is improved, there is provided one kind can be produced to be had
The preferably multifarious Random Forest model system of selection of decision tree.Technical scheme is as follows:
A kind of Random Forest model system of selection based on confusion matrix, comprises the following steps:
A. to train the decision tree for obtaining as original random forest, each decision tree is classified in test sample collection,
The confusion matrix of each decision tree classification result is obtained, and matrix is normalized according to number of samples of all categories;
B. it is poor two-by-two by the confusion matrix to decision tree in random forest, obtain any two decision-makings in random forest
The matrix of differences of tree;Using the F norms of matrix of differences as the similarity measurement of two decision trees, set up random gloomy as element
The Diversity measure matrix of woods, matrix element is the similarity measure values of two decision trees involved by the element;
C. according to order from small to large, the element for being not more than similarity threshold in Diversity measure matrix is traveled through;Investigate
The classification accuracy size of the decision tree involved by the element:The decision tree is deleted if less than classification accuracy threshold value, by this
The all elements zero setting on matrix ranks that decision tree is located, otherwise retains the decision tree;
D. the decision tree finally by representated by Diversity measure nonzeros is integrated into new random forest,
The model for completing random forest is selected.
This patent is improved to the model selection process of random forest, it is proposed that a kind of decision tree based on confusion matrix
Method for measuring similarity.On this basis, with reference to the classification performance of decision tree, using the strategy of reverse " deleting bad ", complete with
The model of machine forest is selected.The model selection method of this patent also is expected to consider the classification performance of random forest and inside
Correlative relationship, can effectively delete that dependency is strong, the decision tree that classifying quality is poor, improve the classification capacity of random forest.
Specific embodiment
Below the present invention is described in detail.
1 creates confusion matrix according to the classification results of decision tree
In confusion matrix, the true classification of every a line representative sample of matrix, and per the prediction class of string representative sample
Not.Abstract mathematical form is:Setting needs to complete N classification tasks, gives sample vector collection X={ x1,x2,...xM, classification
Vectorial Y={ y1,y2,...yN, with classification results statistics of Matrix C M (T, X) the presentation class tree T on sample set X, that
The dimension of confusion matrix CM (T, X) is as follows it was determined that be the square formation of N × N:
On the basis of confusion matrix, we can obtain the similarity measurement of two classification trees.Here premise is, such as
Really two trees are similar, then two tree classification results should be similar, especially the correct classification number of each classification, classification it
Between misclassification distribution be closer to.Designing the method for measuring similarity is:In two classification tree Ti, TjWhat is had built up obscures
On the basis of matrix, start with from confusion matrix, estimated using the distance between matrix and vector angle, as two classification trees
Similarity measurement.If the distance between confusion matrix is little, then think classification tree Ti, TjIt is close to each other;If confusion matrix it
Between distance it is very big, then think classification tree Ti, TjSimilarity it is very weak.
2 obtain similarity measurements moment matrix based on confusion matrix
The distance between confusion matrix is weighed used here as F norms, first to the hybrid matrix CM of two trees(i)And CM(j)
Differ from, the matrix for obtaining is referred to as matrix of differences DCM(i,j), it is also the square formation of N × N, it is as follows.
In classification task, the quantity of different classes of sample may have certain difference.In order to prevent having in quantity
Calculating of the sample class of advantage on matrix distance produces impact, causes similarity measurement excessively to consider that sample size is more
Classification, and the diversity in other classifications is have ignored, we enter every trade normalization to matrix of differences, the normalization factor for being used
It is the maximum element value of every row, obtains normalization matrix of differences DCMu (i,j).Finally, by asking for matrix of differences DCM 'i,jF models
Number, obtains classification tree T in random foresti, TjBetween similarity.
During one scale is for the random forest of T, decision tree distance between any two can show in a matrix,
The matrix is referred to as the measuring similarity matrix M of random forestF, matrix size is T × T.Its element mfijBy normalization difference square
Battle array DCMu (i,j), i, j=1 ..., T obtain, and physical relationship is as follows:
In formula:CiIth cluster classification is represented, center is μi, xiFor the data point of the category.
On the whole, this method for measuring similarity based on confusion matrix, it is correct and mistake thick in decision tree classification
On the basis of slightly dividing, further the confusion result between classification is made a distinction in mistake classification.So, structural difference
Between the decision tree of two larger dissmilarities, produce classification results and miss similar difficulty increase, more have area so as to produce
Divide the module of property.
3 obtain new random forest by " deleting bad "
After similarity measurements moment matrix is created for random forest, if the method clustered by iteration or classification tree is chosen
The classification tree for working well, needs very big calculating cost.Our Bei Qi roads and go, propose a kind of model of contrary " deleting bad "
Selection strategy, it is preferentially not integrated from original substantial amounts of base grader, but poor decision tree classifier is deleted, it is remaining
Grader carries out integrated.Specifically, the grader of dependency is relatively strong, classification performance difference is deleted from original base grader set
Go, and remaining combining classifiers are automatically into new Random Forest model.The advantage of this method is, its without the concern for point
Whole relation in class device set, and only investigate two classification of the element value less than similarity threshold in similarity measurements moment matrix
Device;Between the two graders, classification performance threshold value is less than if there is the classification capacity of a certain grader, then it is carried out
Delete.Handled by this method is a kind of simpler relation, at the same considered dependency between grader with
And classification capacity.
The model selection algorithm step of the link is as follows:
1. d is set as similarity threshold, and α is classification capacity threshold value
2.minij=MFCurrent minimum nonzero element;
If 3. minijLess than d:
A () judges whether the less decision tree i of classification capacity, and decision tree i in decision tree i and decision tree j
Classification capacity<α);
(b) if it does, decision tree i is deleted, that is, by MFThe ranks element whole zero setting that middle decision tree i is located;
C () makes minijPoint to the minimum non-zero element of next matrix.
4. when there is no the element less than d, recurrence terminates.Now the non-zero tree of element ranks is incorporated into new random
Forest RF.
As embodiment, the data set in UCI machine learning databases is chosen, each data set is randomly selected as training
Sample set and test sample collection, training sample accounts for 50%, and test sample collection accounts for 50%.Bagging is used on training sample set
Method extracts training sample subset, substantial amounts of decision tree classifier is generated, as original Random Forest model.
On this basis, the method step of the Random Forest model selection based on confusion matrix that this patent is proposed is carried out
Suddenly, using original random forest as input, and similarity threshold and classification accuracy threshold value are preset.
Proposed method and step is carried out in test sample collection, the random forest mould selected through model is ultimately generated
Type.
Claims (1)
1. a kind of Random Forest model system of selection based on confusion matrix, comprises the following steps:
A. to train the decision tree for obtaining as original random forest, each decision tree is classified in test sample collection, is obtained
The confusion matrix of each decision tree classification result, and matrix is normalized according to number of samples of all categories;
B. it is poor two-by-two by the confusion matrix to decision tree in random forest, obtain any two decision trees in random forest
Matrix of differences.Using the F norms of matrix of differences as the similarity measurement of two decision trees, as element random forest is set up
Diversity measure matrix, matrix element is the similarity measure values of two decision trees involved by the element;
C. according to order from small to large, the element for being not more than similarity threshold in Diversity measure matrix is traveled through;Investigate this yuan
The classification accuracy size of the decision tree involved by element:The decision tree is deleted if less than classification accuracy threshold value, by the decision-making
The all elements zero setting on matrix ranks that tree is located, otherwise retains the decision tree;
D. the decision tree finally by representated by Diversity measure nonzeros is integrated into new random forest, completes
The model of random forest is selected.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611031244.1A CN106570537A (en) | 2016-11-17 | 2016-11-17 | Random forest model selection method based on confusion matrix |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611031244.1A CN106570537A (en) | 2016-11-17 | 2016-11-17 | Random forest model selection method based on confusion matrix |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106570537A true CN106570537A (en) | 2017-04-19 |
Family
ID=58542098
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611031244.1A Pending CN106570537A (en) | 2016-11-17 | 2016-11-17 | Random forest model selection method based on confusion matrix |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106570537A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697447A (en) * | 2017-10-20 | 2019-04-30 | 富士通株式会社 | Disaggregated model construction device, method and electronic equipment based on random forest |
CN110276382A (en) * | 2019-05-30 | 2019-09-24 | 平安科技(深圳)有限公司 | Listener clustering method, apparatus and medium based on spectral clustering |
CN110826618A (en) * | 2019-11-01 | 2020-02-21 | 南京信息工程大学 | Personal credit risk assessment method based on random forest |
CN111125727A (en) * | 2019-12-03 | 2020-05-08 | 支付宝(杭州)信息技术有限公司 | Confusion circuit generation method, prediction result determination method, device and electronic equipment |
CN112836731A (en) * | 2021-01-21 | 2021-05-25 | 黑龙江大学 | Signal random forest classification method, system and device based on decision tree accuracy and relevance measurement |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103489005A (en) * | 2013-09-30 | 2014-01-01 | 河海大学 | High-resolution remote sensing image classifying method based on fusion of multiple classifiers |
CN103500344A (en) * | 2013-09-02 | 2014-01-08 | 中国测绘科学研究院 | Method and module for extracting and interpreting information of remote-sensing image |
CN104572786A (en) * | 2013-10-29 | 2015-04-29 | 华为技术有限公司 | Visualized optimization processing method and device for random forest classification model |
CN105447525A (en) * | 2015-12-15 | 2016-03-30 | 中国科学院软件研究所 | Data prediction classification method and device |
-
2016
- 2016-11-17 CN CN201611031244.1A patent/CN106570537A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103500344A (en) * | 2013-09-02 | 2014-01-08 | 中国测绘科学研究院 | Method and module for extracting and interpreting information of remote-sensing image |
CN103489005A (en) * | 2013-09-30 | 2014-01-01 | 河海大学 | High-resolution remote sensing image classifying method based on fusion of multiple classifiers |
CN104572786A (en) * | 2013-10-29 | 2015-04-29 | 华为技术有限公司 | Visualized optimization processing method and device for random forest classification model |
CN105447525A (en) * | 2015-12-15 | 2016-03-30 | 中国科学院软件研究所 | Data prediction classification method and device |
Non-Patent Citations (2)
Title |
---|
孔英会: "基于混淆矩阵和集成学习的分类方法研究", 《计算机工程与科学》 * |
邓红卫 等: "岩体可爆性等级判别的随机森林模型及R实现", 《世界科技研究与发展》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697447A (en) * | 2017-10-20 | 2019-04-30 | 富士通株式会社 | Disaggregated model construction device, method and electronic equipment based on random forest |
CN110276382A (en) * | 2019-05-30 | 2019-09-24 | 平安科技(深圳)有限公司 | Listener clustering method, apparatus and medium based on spectral clustering |
CN110276382B (en) * | 2019-05-30 | 2023-12-22 | 平安科技(深圳)有限公司 | Crowd classification method, device and medium based on spectral clustering |
CN110826618A (en) * | 2019-11-01 | 2020-02-21 | 南京信息工程大学 | Personal credit risk assessment method based on random forest |
CN111125727A (en) * | 2019-12-03 | 2020-05-08 | 支付宝(杭州)信息技术有限公司 | Confusion circuit generation method, prediction result determination method, device and electronic equipment |
CN111125727B (en) * | 2019-12-03 | 2021-05-14 | 支付宝(杭州)信息技术有限公司 | Confusion circuit generation method, prediction result determination method, device and electronic equipment |
CN112836731A (en) * | 2021-01-21 | 2021-05-25 | 黑龙江大学 | Signal random forest classification method, system and device based on decision tree accuracy and relevance measurement |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106570537A (en) | Random forest model selection method based on confusion matrix | |
CN103632168B (en) | Classifier integration method for machine learning | |
CN107766883A (en) | A kind of optimization random forest classification method and system based on weighted decision tree | |
CN108540451A (en) | A method of classification and Detection being carried out to attack with machine learning techniques | |
CN104484681B (en) | Hyperspectral Remote Sensing Imagery Classification method based on spatial information and integrated study | |
CN106482967B (en) | A kind of Cost Sensitive Support Vector Machines locomotive wheel detection system and method | |
CN103914705B (en) | Hyperspectral image classification and wave band selection method based on multi-target immune cloning | |
CN108304884A (en) | A kind of cost-sensitive stacking integrated study frame of feature based inverse mapping | |
CN108319987A (en) | A kind of filtering based on support vector machines-packaged type combined flow feature selection approach | |
CN104102929A (en) | Hyperspectral remote sensing data classification method based on deep learning | |
CN104809476B (en) | A kind of multi-target evolution Fuzzy Rule Classification method based on decomposition | |
CN103489005A (en) | High-resolution remote sensing image classifying method based on fusion of multiple classifiers | |
CN116108758B (en) | Landslide susceptibility evaluation method | |
CN105825078B (en) | Small sample Classification of Gene Expression Data method based on gene big data | |
CN104318515B (en) | High spectrum image wave band dimension reduction method based on NNIA evolution algorithms | |
CN103886336A (en) | Polarized SAR image classifying method based on sparse automatic encoder | |
CN102254033A (en) | Entropy weight-based global K-means clustering method | |
CN104765839A (en) | Data classifying method based on correlation coefficients between attributes | |
CN105893876A (en) | Chip hardware Trojan horse detection method and system | |
CN105389505A (en) | Shilling attack detection method based on stack type sparse self-encoder | |
CN102254020A (en) | Global K-means clustering method based on feature weight | |
CN104966106B (en) | A kind of biological age substep Forecasting Methodology based on support vector machines | |
Corrales et al. | An empirical multi-classifier for coffee rust detection in colombian crops | |
CN102930291B (en) | Automatic K adjacent local search heredity clustering method for graphic image | |
CN109800790A (en) | A kind of feature selection approach towards high dimensional data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170419 |
|
RJ01 | Rejection of invention patent application after publication |