WO2015062209A1 - 随机森林分类模型的可视化优化处理方法及装置 - Google Patents

随机森林分类模型的可视化优化处理方法及装置 Download PDF

Info

Publication number
WO2015062209A1
WO2015062209A1 PCT/CN2014/075305 CN2014075305W WO2015062209A1 WO 2015062209 A1 WO2015062209 A1 WO 2015062209A1 CN 2014075305 W CN2014075305 W CN 2014075305W WO 2015062209 A1 WO2015062209 A1 WO 2015062209A1
Authority
WO
WIPO (PCT)
Prior art keywords
random forest
classification model
forest classification
decision tree
upper bound
Prior art date
Application number
PCT/CN2014/075305
Other languages
English (en)
French (fr)
Inventor
赫彩凤
李俊杰
郭向林
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2015062209A1 publication Critical patent/WO2015062209A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Definitions

  • the invention relates to the field of data mining technology, in particular to a visual optimization processing method and device for a random forest classification model.
  • Classification is one of the most fundamental tasks often encountered in the fields of statistics, data analysis, machine learning, and data mining research.
  • the main goal of this task is to build a predictive model (ie, learning machine) with strong generalization ability using training data.
  • Integrated learning has significant advantages in this respect.
  • the basic idea of integrated learning is to use multiple learning machines to solve the same problem.
  • Two preconditions determine the feasibility of integrated learning: First, the single base learning machine is effective, that is, the accuracy of a single base learning machine should be greater than the probability of random guessing; the second is the difference between the various base learning machines.
  • Random forest is a supervised integrated learning classification technology. Its classification model consists of a set of decision tree classifiers. The classification of data by the model is based on the classification result of a single decision tree to determine the final result. It combines Leo Breiman's Bagging integrated learning theory with the stochastic subspace method proposed by Ho. By injecting randomness into the training sample space and attribute space, it fully guarantees the independence and difference between each decision tree. Overcome the problem of over-fitting of decision trees, and it is also robust to noise and outliers.
  • the inventors of the present application found in the long-term research and development that the random forest prediction effect is significantly better than the single decision tree, but there are some disadvantages: Compared with a single decision tree, the prediction speed is significantly reduced, and as the number of decision trees increases, The required storage space has also increased dramatically.
  • the technical problem mainly solved by the present invention is to provide a visual optimization processing method and device for a random forest classification model, which can reduce the number of decision trees in a random forest classification model, reduce the memory space required for the random forest classification model, and improve the memory space at the same time. Predict speed and accuracy.
  • the present invention provides a visual optimization processing method for a random forest classification model, including: estimating, for an established random forest classification model, correlations between respective decision trees of the random forest classification model by out-of-bag data; Constructing a correlation matrix by using correlation between the decision trees of the random forest classification model; and obtaining, according to the correlation matrix, a visualization graph of the random forest classification model in a space of three dimensions by using a dimensionality reduction technique; a visualization graph of the random forest classification model, the random forest classification model is optimized, so that the upper bound of the second generalization error of the processed random forest classification model does not exceed the first pan of the random forest classification model before processing The upper bound of the error.
  • the step of acquiring, according to the correlation matrix, a visualization of the random forest classification model in a space of three dimensions by using a dimensionality reduction technique comprising: according to the correlation The maturity matrix, the multi-dimensional scale analysis MDS dimension reduction technique is used to obtain the visualization graph of the random forest classification model in the space below three dimensions.
  • the visual graphic is a scatter graph, and each point of the scatter graph represents a decision tree.
  • the distance between each two points of the scattergram represents the correlation between the decision trees corresponding to the random forest classification model.
  • the points of the scatter plot are represented by different colors, to correspond to the points of the scatter plot.
  • the classification strength information of the decision tree is represented by different colors, to correspond to the points of the scatter plot.
  • the scatter plot is a heat map of a density distribution.
  • the step of performing optimization processing on the random forest classification model according to the visualization graph of the random forest classification model comprises: visualizing according to the random forest classification model Graphic, selecting a decision tree; deleting K decision trees closest to the selected decision tree, obtaining a second generalization error upper bound corresponding to the random forest classification model after processing; Upper bound error and treatment of the second generalization error corresponding to the forest classification model Comparing the upper bound of the first generalization error of the previous random forest classification model; if the upper bound of the second generalization error corresponding to the random forest classification model is reduced after processing, returning the classification according to the random forest classification model Visualizing the graph, the step of selecting a decision tree is looped until the upper bound of the second generalization error corresponding to the random forest classification model is no longer reduced.
  • a sixth possible implementation manner of the first aspect after the step of comparing the first generalization error upper bound with the pre-process random forest classification model And including: if the upper bound of the second generalization error corresponding to the random forest classification model increases after processing, canceling the step of comparing the upper bound of the first generalization error with the pre-process random forest classification model Step; deleting the structurally similar decision tree in the random forest classification model by using a decision tree rule matching algorithm.
  • the element of the i th row and the j th column of the correlation matrix is a correlation between the i th decision tree and the j th decision tree of the random forest classification model Degree, wherein the i and j are natural numbers that are not zero.
  • the acquiring module is specifically configured to acquire, according to the correlation matrix, a visualization graph of the random forest classification model in a space below three dimensions by using a multidimensional scaling analysis MDS dimension reduction technique .
  • the visual graph is a scatter graph, and each point of the scatter graph represents a decision tree. The distance between each two points of the scattergram represents the correlation between the decision trees corresponding to the random forest classification model.
  • the points of the scatter plot are represented by different colors, to correspond to the points of the scatter plot.
  • the classification strength information of the decision tree is represented by different colors, to correspond to the points of the scatter plot.
  • the scatter plot is a heat map of a density distribution.
  • the optimization module includes: a selecting unit, an obtaining unit, a comparing unit, and a returning unit; the selecting unit is configured to select, according to the visual graph of the random forest classification model a decision tree; the obtaining unit is configured to delete K decision trees closest to the decision tree selected by the selection unit, and obtain a second generalization error upper bound corresponding to the random forest classification model after processing; a unit is configured to use a second generalization error upper bound corresponding to the processed random forest classification model obtained by the obtaining unit, and a second corresponding to the random forest classification model after processing the random forest classification model before processing When the upper bound of the generalization error is reduced, the selection unit is returned to perform looping until the upper bound of the second generalization error corresponding to the random forest classification model is no longer reduced.
  • the optimization module further includes: a revocation unit and a deletion unit;
  • the comparison result of the unit is that when the upper bound of the second generalization error corresponding to the random forest classification model is increased, all operations before the comparison unit are cancelled; the deleting unit is configured to cancel the cancellation in the undo unit After comparing all operations before the unit, the decision tree rule matching algorithm is used to delete the structurally similar decision trees in the random forest classification model.
  • the i th row and the j th column of the correlation matrix The element is a correlation between the i-th decision tree and the j-th decision tree of the random forest classification model, wherein the i and j are natural numbers that are not zero.
  • the present invention can not only improve the random forest classification model, but also optimize the random forest classification model according to the visualization pattern of the random forest classification model.
  • the learning performance of the random forest classification model reduces the number of decision trees in the random forest classification model.
  • the optimization effect can be directly seen when the visualization is optimized according to the random forest classification model. It can improve the prediction speed and accuracy, and does not require a large amount of memory space to store the results of the optimization algorithm, which can reduce the memory space required by the random forest classification model.
  • FIG. 3 is a flow chart of another embodiment of a visual optimization processing method of the random forest classification model of the present invention.
  • FIG. 5 is a flow chart of still another embodiment of a visual optimization processing method of the random forest classification model of the present invention.
  • FIG. 7 is a schematic structural diagram of an embodiment of a visual optimization processing apparatus for a random forest classification model of the present invention.
  • FIG. 8 is a diagram of still another embodiment of a visual optimization processing apparatus for a random forest classification model of the present invention. Schematic diagram.
  • Figure 1 is a flow chart of an embodiment of a method for visualizing optimization of a random forest classification model of the present invention, comprising:
  • a random forest classification model is a classifier that contains multiple decision trees, and the output classification results are determined by the total number of classification results output by a single decision tree.
  • each branch corresponds to a property value, so recursive Until the termination condition is met, each leaf node represents the class of the sample along that path.
  • the out-of-bag data can be used to estimate the classification strength s of each decision tree of the random forest classification model and the correlation between the decision trees.
  • the main factors of the classification performance of the random forest classification model are the classification strength of a single decision tree. The greater the classification intensity of a single decision tree, the better the classification performance of the random forest classification model. The second is the correlation between decision trees. The greater the correlation between decision trees, the worse the classification performance of the random forest classification model.
  • Step S102 Construct a correlation matrix by using correlation between each decision tree of the random forest classification model.
  • the correlation matrix also called the correlation coefficient matrix, is composed of correlation coefficients between the columns of the matrix. That is to say, the elements of the jth column of the i-th row of the correlation matrix are the correlation coefficients of the i-th row and the j-th column of the original matrix.
  • the elements of the jth column of the i-th row of the correlation matrix are the correlation between the i-th row decision tree of the original matrix and the j-th column decision tree.
  • the elements of the i-th row and the j-th column of the correlation matrix are correlations between the i-th decision tree and the j-th decision tree of the random forest classification model, where i and j are not A natural number of zero.
  • Step S103 Obtain a visualization graph of the random forest classification model of the space below three dimensions by using a dimension reduction technique according to the correlation matrix.
  • high-dimensional feature sets have the following problems:
  • the samples in the original observation space have a large number of redundant features; there are many features that are not related to a given task, that is, there are many features that have only weak correlation with the categories; Features that are redundant with a given task, such as features, have a strong correlation with each other; there is noisy data.
  • These problems increase the difficulty of training classifiers. Therefore, for data analysis and data visualization (usually two-dimensional or three-dimensional), dimensionality reduction of high-dimensional space is required.
  • the methods of dimensionality reduction mainly include: linear dimensionality reduction method, traditional nonlinear dimensionality reduction method, nonlinear dimensionality reduction method based on popular learning, etc.
  • the linear dimensionality reduction method mainly includes: principal component analysis method PCA Linear discriminant analysis method LDA, multidimensional scale analysis method MDS, etc.
  • Nonlinear dimensionality reduction methods mainly include: kernel principal component analysis method KPCA, main curve method, self-organizing mapping method SOM, production topology mapping method GTM, etc.
  • the nonlinear dimensionality reduction methods of popular learning mainly include: SMA map with guaranteed distance feature, local linear embedded LLE, Laplace feature map LE and so on.
  • Visualization technology is the first and most effective means of interpreting large amounts of data and is used by science and engineering computing. Visualization transforms data into graphics, giving people deep and unexpected insights, Multi-domains have revolutionized the way scientists research. Its core technology is to visualize server hardware and software.
  • the main process of visualization is modeling and rendering: modeling is the mapping of data into geometric elements of objects; rendering is the rendering of geometric primitives into graphics or images, rendering is the main technique for rendering realistic graphics, strictly speaking, rendering It is based on the optical principle based illumination model to calculate the brightness and color composition of the visible surface of the object projected into the observer's eye, and convert it into a color value suitable for the graphic display device, thereby determining the color of each pixel on the projected image and The lighting effect ultimately produces a realistic graphic.
  • the realistic image is expressed by the color and light and dark color of the surface of the object. It is related to the material properties of the surface of the object and the light energy radiated from the surface to the line of sight. The calculation is complicated and the calculation amount is large.
  • the visualization of the random forest classification model in the three-dimensional (including three-dimensional) space can be obtained by the dimensionality reduction technique, so as to better analyze and optimize the random forest classification model.
  • the performance of machine learning can be expressed by generalization errors.
  • the upper bound of the generalization error is the upper bound of the test error rate of the classification model on the new unknown data.
  • the generalization error is determined by two factors: the overall classification strength of the random forest and the average correlation between the decision trees.
  • the generalization error is inversely proportional to the overall classification strength of the random forest, and is proportional to the average correlation between the decision trees. If the learning performance of the random forest classification model needs to be improved, the generalization error needs to be reduced. There are two ways: The first is to improve the overall classification intensity of random forests, to delete the decision tree with weak decision tree classification strength, and the second is to reduce the average correlation between decision trees and delete the decision tree with high correlation.
  • the visual graph of the random forest classification model is vivid and vivid, the user can conveniently optimize the random forest classification model according to the visualization of the random forest classification model.
  • the upper bound of the second generalization error of the processed random forest classification model does not exceed the random forest classification model before processing
  • the upper bound of the first generalization error that is, the upper bound of the second generalized error is less than or equal to the upper bound of the first generalized error, and the random forest classification model after optimization is acceptable at this time, otherwise, the second generalized error
  • the upper bound is larger than the upper bound of the first generalization error, which indicates that the learning performance of the optimized random forest classification model is worse than that of the random forest classification model before optimization. Obviously, the optimized random forest classification model is unacceptable. of.
  • the visualization graph of the random forest classification model is obtained, when the random forest classification model is optimized according to the visualization graph of the random forest classification model, the learning performance of the random forest classification model can be improved, and the random forest classification model can be reduced.
  • the number of decision trees in the middle, and due to the image and visualization of the visual graphics the optimization effect can be directly seen when the visual graphics are optimized according to the random forest classification model, so the prediction speed and accuracy can be improved, and a large amount of memory space is not required.
  • the result of the storage optimization algorithm can reduce the memory space required for the random forest classification model.
  • FIG. 3 is a flowchart of another implementation manner of a visual optimization processing method of the random forest classification model of the present invention, including:
  • Step S201 For the constructed random forest classification model, the correlation between the decision trees of the random forest classification model is estimated by the out-of-bag data.
  • a random forest classification model is a classifier that contains multiple decision trees, and the output classification results are determined by the total number of classification results output by a single decision tree.
  • M medium probability
  • the out-of-bag data can be used to estimate the classification strength s of each decision tree of the random forest classification model and the correlation between the decision trees.
  • the main factors of the classification performance of the random forest classification model are the classification strength of a single decision tree. The higher the classification strength of a single decision tree, the better the classification performance of the random forest classification model.
  • the second is the correlation between decision trees and decision-making. The greater the correlation between the trees, the worse the classification performance of the random forest classification model.
  • Step S202 Construct a correlation matrix by using correlation between each decision tree of the random forest classification model.
  • the correlation matrix also called the correlation coefficient matrix, is composed of correlation coefficients between the columns of the matrix. That is to say, the elements of the jth column of the i-th row of the correlation matrix are the correlation coefficients of the i-th row and the j-th column of the original matrix.
  • the elements of the jth column of the i-th row of the correlation matrix are the correlation between the i-th row decision tree of the original matrix and the j-th column decision tree.
  • the elements of the i-th row and the j-th column of the correlation matrix are correlations between the i-th decision tree and the j-th decision tree of the random forest classification model, where i and j are not A natural number of zero.
  • Step S203 Acquire a visualization graph of the random forest classification model in a space of three dimensions according to a correlation matrix by using a multidimensional scale analysis MDS dimensionality reduction technique.
  • the high-dimensional feature set has the following problems:
  • the sample in the original observation space has a large number of redundant features; there are many features that are not related to a given task, that is, there are many features that have only weak correlation with the category; there are many The characteristics of task redundancy, such as the strong correlation between features; there is noise data.
  • These problems increase the difficulty of training classifiers. Therefore, for data analysis and data visualization (usually two-dimensional or three-dimensional), dimensionality reduction of high-dimensional space is required.
  • MDS uses the correlation between pairs of samples, the purpose is to use this information to construct a suitable low-dimensional space, so that the distance between the sample in this space and the sample in the high-dimensional space is as consistent as possible.
  • the MDS method has five key elements, namely subject, object, criterion, criterion weight, and subject weight.
  • the definition is as follows: 1) Object: The object to be evaluated, which can be considered as several classes to be classified do not. 2) Subject: The unit that evaluates the object is the training data. 3) Criteria: Defined according to the purpose of the study, to assess the quality of the object. 4) Criterion weights: After the subject measures the importance of the criteria, each criterion is given a weight value.
  • Subject weight After the researcher weighs the importance of the criterion, the subject is given a weight value.
  • the data to be analyzed including I objects, define a set of distance functions, where Sy is the distance between the i-th and j-th objects, and thus
  • the visualization of the random forest classification model in 3D (including 3D) space can be obtained by MDS dimension reduction technology, so as to better analyze and optimize the random forest classification model.
  • the visual graph is a scatter plot.
  • Each point of the scatter plot represents a decision tree.
  • the distance between each two points of the scatter plot represents the correlation between the decision trees corresponding to the random forest classification model.
  • the magnitude of the correlation between each two decision trees can be visually observed, and the distance between the two points. Close, indicating that the correlation between the two decision trees corresponding to the two points is large, and the distance between the two points is far, indicating that the correlation between the two decision trees corresponding to the two points is small.
  • the above visualization is only a coarse-grained visual representation.
  • the clustering density of each decision tree in the random forest classification model can present the distribution of each decision tree in the random forest more finely.
  • the normalized method divides the population density of the decision tree model of the two-dimensional plane into 10 gradations, indicating the density of different levels, that is, the scatter plot is the heat map of the density distribution.
  • the heat map representation method of density distribution users can observe the distribution of different density decision tree populations, as shown in Figure 4.
  • Step S204 Optimizing the random forest classification model according to the visualization graph of the random forest classification model, so that the upper bound of the second generalization error of the processed random forest classification model does not exceed the first pan of the random forest classification model before processing The upper bound of the error.
  • the performance of machine learning can be expressed by generalization errors.
  • the upper bound of the generalization error is the upper bound of the test error rate of the classification model on the new unknown data.
  • the generalization error is determined by two factors: the overall classification strength of the random forest and the average correlation between the decision trees.
  • the generalization error is inversely proportional to the overall classification strength of the random forest, and is proportional to the average correlation between the decision trees. If the learning performance of the random forest classification model needs to be improved, the generalization error needs to be reduced. There are two ways: The first is to improve the overall classification intensity of random forests, to delete the decision tree with weak decision tree classification strength, and the second is to reduce the average correlation between decision trees and delete the decision tree with high correlation.
  • the user can conveniently optimize the random forest classification model according to the visualization of the random forest classification model.
  • the upper bound of the second generalization error of the processed random forest classification model does not exceed the upper bound of the first generalization error of the random forest classification model before processing, that is, the upper bound of the second generalized error is less than or equal to the upper bound of the first generalized error , at this time
  • the optimized random forest classification model is acceptable. Otherwise, the upper bound of the second generalization error is larger than the upper bound of the first generalization error, indicating that the learning performance of the optimized random forest classification model is more random than that before the optimization process.
  • the forest classification model is still poor. Obviously, the optimized random forest classification model is unacceptable.
  • the visualization graph of the random forest classification model is obtained, when the random forest classification model is optimized according to the visualization graph of the random forest classification model, the learning performance of the random forest classification model can be improved, and the random forest classification model can be reduced.
  • the number of decision trees in the middle, and due to the image and visualization of the visual graphics the optimization effect can be directly seen when the visual graphics are optimized according to the random forest classification model, so the prediction speed and accuracy can be improved, and a large amount of memory space is not required.
  • the result of the storage optimization algorithm can reduce the memory space required for the random forest classification model.
  • the MDS dimension reduction technology the operation is relatively simple and the interpretation of the results is intuitive.
  • Step S301 For the constructed random forest classification model, the correlation between the decision trees of the random forest classification model is estimated by the out-of-bag data.
  • a random forest classification model is a classifier that contains multiple decision trees, and the output classification results are determined by the total number of classification results output by a single decision tree.
  • Out-of-bag data can be used to estimate the classification strength of each decision tree in a random forest classification model and the correlation between decision trees.
  • the main factors of the classification performance of the random forest classification model are the classification strength of a single decision tree. The higher the classification strength of a single decision tree, the better the classification performance of the random forest classification model.
  • the second is the correlation between decision trees and decision-making. The greater the correlation between the trees, the worse the classification performance of the random forest classification model.
  • the correlation matrix also called the correlation coefficient matrix, is composed of correlation coefficients between the columns of the matrix. Also that is to say, the elements of the i-th row and the j-th column of the correlation matrix are the correlation coefficients of the i-th row and the j-th column of the original matrix.
  • the elements of the i-th row and the j-th column of the correlation matrix are the correlation between the i-th row decision tree and the j-th column decision tree of the original matrix.
  • the elements of the i-th row and the j-th column of the correlation matrix are correlations between the i-th decision tree and the j-th decision tree of the random forest classification model, where i and j are not A natural number of zero.
  • Step S303 Obtain a visualization graph of the random forest classification model in a space of three dimensions according to a correlation matrix by using a multidimensional scaling analysis MDS dimensionality reduction technique.
  • the high-dimensional feature set has the following problems:
  • the sample in the original observation space has a large number of redundant features; there are many features that are not related to a given task, that is, there are many features that have only weak correlation with the category; there are many The characteristics of task redundancy, such as the strong correlation between features; there is noise data.
  • These problems increase the difficulty of training classifiers. Therefore, for data analysis and data visualization (usually two-dimensional or three-dimensional), dimensionality reduction of high-dimensional space is required.
  • Visualization transforms data into graphics, giving people deep and unexpected insights, and fundamentally changing the way scientists research in many areas. Its core technology is visualizing server hardware and software.
  • the main process of visualization is modeling and rendering: modeling is the mapping of data into geometric elements of objects; rendering is the rendering of geometric elements into graphics or images, rendering is the main technique for rendering realistic graphics, strictly speaking, rendering It is based on the optical principle based illumination model to calculate the brightness and color composition of the visible surface of the object projected into the observer's eye, and convert it into a color value suitable for the graphic display device, thereby determining the color of each pixel on the projected image and The lighting effect ultimately produces a realistic graphic.
  • the visualization of the random forest classification model in the three-dimensional (including three-dimensional) space can be obtained by the dimensionality reduction technique, so as to better analyze and optimize the random forest classification model.
  • the visual graph is a scatter plot. Each point of the scatter plot represents a decision tree. The distance between each two points of the scatter plot represents the correlation between the decision trees corresponding to the random forest classification model.
  • the points of the scatter plot are represented by different colors to determine the point corresponding to the point of the scatter plot. Tree classification intensity information.
  • the scatter plot is a heat map of the density distribution.
  • Step S304 Optimizing the random forest classification model according to the visualization graph of the random forest classification model, so that the upper bound of the second generalization error of the processed random forest classification model does not exceed the first pan of the random forest classification model before processing The upper bound of the error.
  • the performance of machine learning can be expressed by generalization errors.
  • the generalization error is determined by two factors: the overall classification strength of the random forest and the average correlation between the decision trees.
  • the generalization error is inversely proportional to the overall classification strength of the random forest, and is proportional to the average correlation between the decision trees. If the learning performance of the random forest classification model needs to be improved, the generalization error needs to be reduced. There are two ways: The first is to improve the overall classification intensity of random forests, to delete the decision tree with weak decision tree classification strength, and the second is to reduce the average correlation between decision trees and delete the decision tree with high correlation.
  • the user can conveniently optimize the random forest classification model according to the visual graph of the random forest classification model.
  • the upper bound of the second generalization error of the processed random forest classification model does not exceed the upper bound of the first generalization error of the random forest classification model before processing, that is, the upper bound of the second generalized error is less than or equal to the upper bound of the first generalized error
  • the optimized forest classification model after optimization is acceptable.
  • the upper bound of the second generalization error is larger than the upper bound of the first generalization error, indicating that the learning performance ratio optimization of the random forest classification model after optimization processing
  • the former random forest classification model is still worse.
  • the optimized random forest classification model is unacceptable.
  • Step S304 includes: sub-step S304a, sub-step S304b, sub-step S304c, sub-step S304d, sub-step S304e, and sub-step S304f.
  • Sub-step S304a Select a decision tree based on the visualization of the random forest classification model.
  • Sub-step S304b The K decision trees closest to the decision tree selected by the distance are deleted, and the upper bound of the second generalization error corresponding to the processed random forest classification model is obtained.
  • the method used in sub-step S304b is the K-Nearest Neighbour (KNN) classification algorithm.
  • KNN is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this method is: If the majority of the samples of the k most similar in the feature space (ie, the nearest neighbor in the feature space) belong to a certain category, then the sample also belongs to this category.
  • the K decision trees closest to the selected decision tree can be considered to belong to the same category.
  • the K decision trees closest to the decision tree of the distance selection of the same category can be deleted.
  • Sub-step S304c comparing the upper bound of the second generalization error corresponding to the processed random forest classification model with the upper bound of the first generalization error of the random forest classification model before processing.
  • Sub-step S304d If the upper bound of the second generalization error corresponding to the random forest classification model is reduced after processing, return to sub-step S304a for looping, until the upper bound of the second generalization error corresponding to the random forest classification model is no longer reduced after processing small.
  • Sub-step S304e If the upper bound of the second generalization error corresponding to the random forest classification model increases after processing, the step before sub-step S304c is cancelled.
  • Sub-step S304f The decision tree with similar structure in the random forest classification model is deleted by using the decision tree rule matching algorithm.
  • sub-step S304b the upper bound of the first generalization error and the upper bound of the second generalization error are compared. If the upper bound of the second generalization error is decreased, the processed random forest classification model is optimized, and the return Sub-step S304a loops until the upper bound of the second generalization error corresponding to the random forest classification model is no longer reduced. At this point, according to the method, the optimization effect of the random forest classification model has been achieved best. If the upper bound of the second generalization error increases, it indicates that the performance of the processed random forest classification model has no good performance before processing. The steps before the sub-step S304c are cancelled, and the decision tree rule matching algorithm is used to construct the structure in the random forest classification model. A similar decision tree is deleted.
  • the upper bound of the generalized error of the most primitive unprocessed random forest classification model is 0.2.
  • One case is the generalization error of the random forest classification model after the first processing in sub-step S304a and sub-step S304b.
  • the upper bound is 0.3, and it is apparent that sub-step S304a and sub-step S304b need to be undone.
  • the processing using the decision tree rule matching algorithm, removes the structurally similar decision tree in the random forest classification model.
  • the sub-step S304a and the sub-step S304b are processed a plurality of times, for example, four times through the sub-step S304a and the sub-step S304b, and the random forest classification after the first, second, third, and fourth processing.
  • the upper bounds of the generalization errors of the model are 0.19, 0.17, 0.14, and 0.15, respectively. Obviously, after the first, second, and third treatments, the upper bound of the generalization error of the random forest classification model is decreasing. The upper bound of the fourth generalization error 0.15 is larger than the upper bound of the third generalization error by 0.14. That is to say, the upper bound of the generalization error of the random forest classification model after the third processing is no longer reduced. Small, at this time, choose to receive the third optimized random forest classification model.
  • the decision tree rule matching algorithm can also be used to further delete the structurally similar decision tree in the third random forest classification model.
  • a random forest optimization algorithm based on the distribution of the residual function. Variants of the four residual function distributions are introduced into the algorithm as metrics for evaluating the generalization ability of random forest classification algorithms and the importance of a single decision tree.
  • the optimization algorithm uses the above four margin functions as the objective function to evaluate the degree of optimization of the algorithm, and gradually improves the performance of the classification algorithm by searching for the tree that optimizes the objective function and removing it from the random forest model. .
  • each decision tree in the random forest will be sorted according to its importance. The importance of the decision tree is measured by the degree of change in the random forest margin function after this decision tree is deleted.
  • the algorithm then removes the least important decision tree from the random forest, thus iteratively performing the above process until the random forest model is optimal. So this optimization method improves the classification performance by reducing the size of the random forest.
  • the visualization graph of the random forest classification model is obtained, when the random forest classification model is optimized according to the visualization graph of the random forest classification model, the learning performance of the random forest classification model can be improved, and the random forest classification model can be reduced.
  • the result of the storage optimization algorithm can reduce the memory space required for the random forest classification model Between.
  • the MDS dimension reduction technology the operation is relatively simple and the interpretation of the results is intuitive.
  • the K nearest neighbor classification algorithm can quickly delete the decision trees belonging to the same category.
  • the decision tree rule matching algorithm can delete the decision tree with similar structure.
  • a top-down greedy algorithm is used to construct a tree structure, each branch corresponding to an attribute value, so recursive until the termination condition is met, and each leaf node represents the belonging of the sample along the path. category.
  • the out-of-bag data can be used to estimate the classification strength s of each decision tree of the random forest classification model and the correlation between the decision trees.
  • the main factors of the classification performance of the random forest classification model are the classification strength of a single decision tree. The greater the classification intensity of a single decision tree, the better the classification performance of the random forest classification model. The second is the correlation between decision trees. The greater the correlation between decision trees, the worse the classification performance of the random forest classification model.
  • the construction module 102 is configured to construct a correlation matrix by using the correlation between the decision trees of the random forest classification model estimated by the estimation module 101.
  • the correlation matrix also called the correlation coefficient matrix, is composed of correlation coefficients between the columns of the matrix. That is to say, the elements of the jth column of the i-th row of the correlation matrix are the correlation coefficients of the i-th row and the j-th column of the original matrix.
  • the elements of the jth column of the i-th row of the correlation matrix are the correlation between the i-th row decision tree of the original matrix and the j-th column decision tree.
  • the high-dimensional feature set has the following problems:
  • the sample in the original observation space has a large number of redundant features; there are many features that are not related to a given task, that is, there are many features that have only weak correlation with the category; there are many The characteristics of task redundancy, such as the strong correlation between features; there is noise data.
  • These problems increase the difficulty of training classifiers. Therefore, for data analysis and data visualization (usually two-dimensional or three-dimensional), dimensionality reduction of high-dimensional space is required.
  • the methods of dimensionality reduction mainly include: a linear dimensionality reduction method, a traditional nonlinear dimensionality reduction method, a non-linear dimensionality reduction method based on popular learning, and the like.
  • Modeling is the mapping of data to geometric elements of objects; rendering is the rendering of geometric primitives into graphics or images, rendering is the main technique for rendering realistic graphics, strictly speaking, rendering It is based on the optical principle based illumination model to calculate the brightness and color composition of the visible surface of the object projected into the observer's eye, and convert it into a color value suitable for the graphic display device. Thereby determining the color and lighting effect of each pixel on the projected picture, and finally generating a realistic picture. According to the constructed correlation matrix, the visualization of the random forest classification model in the three-dimensional (including three-dimensional) space can be obtained by the dimensionality reduction technique, so as to better analyze and optimize the random forest classification model.
  • the optimization module 104 is configured to optimize the random forest classification model according to the visualization pattern of the random forest classification model acquired by the acquisition module 103, so that the upper bound of the second generalization error of the processed random forest classification model does not exceed the random before processing The upper bound of the first generalization error of the forest classification model.
  • the performance of machine learning can be expressed by generalization errors.
  • the generalization error is determined by two factors: the overall classification strength of the random forest and the average correlation between the decision trees.
  • the generalization error is inversely proportional to the overall classification strength of the random forest, and is proportional to the average correlation between the decision trees. If the learning performance of the random forest classification model needs to be improved, the generalization error needs to be reduced. There are two ways: The first is to improve the overall classification intensity of random forests, to delete the decision tree with weak decision tree classification strength, and the second is to reduce the average correlation between decision trees and delete the decision tree with high correlation.
  • the user can conveniently optimize the random forest classification model according to the visual graph of the random forest classification model.
  • the upper bound of the second generalization error of the processed random forest classification model does not exceed the upper bound of the first generalization error of the random forest classification model before processing, that is, the upper bound of the second generalized error is less than or equal to the upper bound of the first generalized error
  • the optimized forest classification model after optimization is acceptable.
  • the upper bound of the second generalization error is larger than the upper bound of the first generalization error, indicating that the learning performance ratio optimization of the random forest classification model after optimization processing
  • the former random forest classification model is still worse.
  • the optimized random forest classification model is unacceptable.
  • the visualization graph of the random forest classification model is obtained, when the random forest classification model is optimized according to the visualization graph of the random forest classification model, the learning performance of the random forest classification model can be improved, and the random forest classification model can be reduced. Number of decision trees in the same Due to the image and visualization of the visual graphics, the optimization effect can be directly seen when the visual graphics are optimized according to the random forest classification model. Therefore, the prediction speed and accuracy can be improved, and a large amount of memory space is not required to store the result of the optimization algorithm. It can reduce the memory space required by the random forest classification model.
  • FIG. 7 is a schematic structural diagram of an implementation of a visual optimization processing apparatus for a random forest classification model according to the present invention.
  • the apparatus includes: an estimation module 201, a construction module 202, an acquisition module 203, and an optimization module 204.
  • the estimation module 201 is configured to estimate the correlation between the decision trees of the random forest classification model by the out-of-bag data for the constructed random forest classification model.
  • Out-of-bag data can be used to estimate the classification strength of each decision tree in a random forest classification model and the correlation between decision trees.
  • the main factors of the classification performance of the random forest classification model are the classification strength of a single decision tree. The higher the classification strength of a single decision tree, the better the classification performance of the random forest classification model.
  • the second is the correlation between decision trees and decision-making. The greater the correlation between the trees, the worse the classification performance of the random forest classification model.
  • the construction module 202 is configured to construct a correlation matrix by using the correlation between the decision trees of the random forest classification model estimated by the estimation module 201.
  • the correlation matrix also called the correlation coefficient matrix, is composed of correlation coefficients between the columns of the matrix. That is to say, the elements of the jth column of the i-th row of the correlation matrix are the correlation coefficients of the i-th row and the j-th column of the original matrix.
  • the elements of the jth column of the i-th row of the correlation matrix are the correlation between the i-th row decision tree of the original matrix and the j-th column decision tree.
  • the elements of the i-th row and the j-th column of the correlation matrix are correlations between the i-th decision tree and the j-th decision tree of the random forest classification model, where i and j are not Zero self However.
  • the obtaining module 203 is configured to obtain, according to the correlation matrix constructed by the building module 202, a visualization graph of a random forest classification model in a space below three dimensions by using a dimensionality reduction technique.
  • the visualization of the random forest classification model in the three-dimensional (including three-dimensional) space can be obtained by the dimensionality reduction technique, so as to better analyze and optimize the random forest classification model.
  • the visual graph is a scatter plot. Each point of the scatter plot represents a decision tree. The distance between each two points of the scatter plot represents the correlation between the decision trees corresponding to the random forest classification model.
  • the points of the scatter plot are represented by different colors to express the classification intensity information of the decision tree corresponding to the points of the scatter plot.
  • the optimization module 204 is configured to optimize the random forest classification model according to the visualization pattern of the random forest classification model acquired by the acquisition module 203, so that the upper bound of the second generalization error of the processed random forest classification model does not exceed the random before processing The upper bound of the first generalization error of the forest classification model.
  • the performance of machine learning can be expressed by generalization errors.
  • the generalization error is determined by two factors: the overall classification strength of the random forest and the average correlation between the decision trees.
  • the generalization error is inversely proportional to the overall classification strength of the random forest, and is proportional to the average correlation between the decision trees, ie if the random forest classification needs to be improved
  • the learning performance of the model needs to reduce the generalization error. There are two ways to do this: one is to improve the overall classification intensity of random forests, to delete the decision tree with weak decision tree classification strength, and the second is to reduce the average correlation between decision trees. , delete the highly relevant decision tree.
  • the user can conveniently optimize the random forest classification model according to the visual graph of the random forest classification model.
  • the upper bound of the second generalization error of the processed random forest classification model does not exceed the upper bound of the first generalization error of the random forest classification model before processing, that is, the upper bound of the second generalized error is less than or equal to the upper bound of the first generalized error
  • the optimized forest classification model after optimization is acceptable.
  • the upper bound of the second generalization error is larger than the upper bound of the first generalization error, indicating that the learning performance ratio optimization of the random forest classification model after optimization processing
  • the former random forest classification model is still worse.
  • the optimized random forest classification model is unacceptable.
  • the optimization module 204 includes: a selection unit 2041, an obtaining unit 2042, a comparing unit 2043, and a return unit 2044.
  • the selecting unit 2041 is configured to select a decision tree according to the visual graph of the random forest classification model.
  • the obtaining unit 2042 is configured to delete the K decision trees closest to the decision tree selected by the distance selecting unit 2041, and obtain a second generalized error upper bound corresponding to the processed random forest classification model.
  • the comparing unit 2043 is configured to compare the upper bound of the second generalization error corresponding to the processed random forest classification model obtained by the obtaining unit 2042 with the upper bound of the first generalization error of the random forest classification model before processing.
  • the returning unit 2044 is configured to: when the comparison result of the comparing unit 2043 is that the second generalized error upper bound corresponding to the random forest classification model is reduced, the return selecting unit 2041 performs looping until the second pan corresponding to the random forest classification model after processing The upper bound of the error is no longer reduced.
  • the optimization module 204 further includes: a duty unit 2045 and a deletion unit 2046.
  • the revocation unit 2045 is configured to cancel all operations before the comparison unit 2043 when the comparison result of the comparison unit 2043 is that the upper bound of the second generalization error corresponding to the random forest classification model increases after processing.
  • the decision tree rule matching algorithm is used to delete the structurally similar decision trees in the random forest classification model.
  • the visualization graph of the random forest classification model is obtained, when the random forest classification model is optimized according to the visualization graph of the random forest classification model, the learning performance of the random forest classification model can be improved, and the random forest classification model can be reduced.
  • the result of the storage optimization algorithm can reduce the memory space required for the random forest classification model.
  • the operation is relatively simple, and the results are interpreted intuitively.
  • the K nearest neighbor classification algorithm can quickly delete decision trees belonging to the same category.
  • the decision tree rule matching algorithm can delete structurally similar decision trees.
  • FIG. 8 is a schematic structural diagram of still another embodiment of a visual optimization processing apparatus for a random forest classification model according to the present invention.
  • the apparatus includes: a processor 71, a memory 72 coupled to the processor 71, and a data bus 73, wherein The processor 71 and the memory 72 are connected by a data bus 73.
  • memory 72 stores elements, executable modules or data structures, or a subset thereof, or their extension set:
  • Operating system 721 which contains various system programs for implementing various basic services and handling hardware-based tasks;
  • Application module 722 which contains various applications for implementing various application services.
  • the processor 71 is further configured to:
  • the correlation matrix obtains a visualization graph of the random forest classification model in a space of three dimensions by a multidimensional scale analysis MDS dimension reduction technique.
  • the visual graph is a scatter graph, and each point of the scatter graph represents a decision tree, and a distance between each two points of the scatter graph represents a decision tree corresponding to the random forest classification model. Relevance between.
  • the points of the scatter plot are represented by different colors to express the classification strength information of the decision tree corresponding to the points of the scatter plot.
  • the scattergram is a heat map of a density distribution.
  • the processor 71 is further configured to: select a decision tree according to the visualization graph of the random forest classification model; delete K decision trees closest to the selected decision tree, and obtain the randomized after processing The upper bound of the second generalization error corresponding to the forest classification model; the upper bound of the second generalization error corresponding to the randomized forest classification model and the first generalized error upper bound of the random forest classification model before processing Comparing; if the upper generalized error upper bound corresponding to the random forest classification model is reduced, returning the visualized graph according to the random forest classification model, and selecting a decision tree to perform looping until after processing The upper bound of the second generalization error corresponding to the random forest classification model is no longer reduced.
  • the processor 71 is further configured to: if the upper bound of the second generalization error corresponding to the random forest classification model increases after processing, cancel the first generalization of the random forest classification model before processing The step before the step of comparing the upper bound of the error; deleting the decision tree with similar structure in the random forest classification model by using a decision tree rule matching algorithm.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device implementations described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be another division manner for example, multiple units or components may be used. Combined or can be integrated into another system, or some features can be ignored, or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical, mechanical or otherwise.
  • the components displayed by the unit may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the instructions include a plurality of instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform all or part of the steps of the methods of the various embodiments of the present invention.
  • the foregoing storage medium includes: a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), and a magnetic disk. Or a variety of media such as optical discs that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种随机森林分类模型的可视化优化处理方法,包括:对于已构建的随机森林分类模型,通过袋外数据估计随机森林分类模型各个决策树之间的相关度;利用随机森林分类模型各个决策树之间的相关度,构建相关性矩阵;根据相关性矩阵,通过降维技术获取三维以下空间的随机森林分类模型的可视化图形;根据随机森林分类模型的可视化图形,对随机森林分类模型进行优化处理,以使得处理后的随机森林分类模型第二泛化误差上界不超过处理前的随机森林分类模型的第一泛化误差上界。通过上述方式,本发明能够减少随机森林分类模型中决策树的数目,降低随机森林分类模型所需的内存空间,同时还能提高预测速度和精度。

Description

随机森林分类模型的可视化优化处理方法及装置
【技术领域】
本发明涉及数据挖掘技术领域, 特别是涉及一种随机森林分类模型的可视 化优化处理方法及装置。
【背景技术】
分类问题是统计学、 数据分析、 机器学习和数据挖掘研究领域常遇到的最 基本任务之一。 该任务的主要目标是利用训练数据构建一个具有较强泛化能力 的预测模型 (即学习机), 集成学习在该方面具有显著优势。 集成学习的基本思 路是使用多个学习机来解决同一问题。 两个前提条件决定集成学习之可行性: 一是单个基学习机是有效的, 也就是说单个基学习机的精度应该大于随机猜对 的概率; 二是各基学习机之间的差异性。
随机森林是一种有监督的集成学习分类技术, 其分类模型由一组决策树分 类器组成, 模型对数据的分类是通过单个决策树的分类结果进行集体投票来决 定最终结果。 它结合了 Leo Breiman的 Bagging集成学习理论与 Ho提出的随机 子空间方法, 通过对训练样本空间和属性空间注入随机性, 充分保证了每个决 策树之间的独立性和差异性, 很好地克服了决策树过拟合问题, 同时对噪声和 异常值也有较好的鲁棒性。
本申请的发明人在长期的研发中发现, 随机森林预测效果显著优于单个决 策树, 但存在一些缺点: 与单个决策树相比, 预测速度明显下降, 并且随着决 策树数目的增多, 所需的存储空间也急剧增多。
【发明内容】
本发明主要解决的技术问题是提供一种随机森林分类模型的可视化优化处 理方法及装置, 能够减少随机森林分类模型中决策树的数目, 降低随机森林分 类模型所需的内存空间, 同时还能提高预测速度和精度。 第一方面, 本发明提供一种随机森林分类模型的可视化优化处理方法, 包 括: 对于已构建的随机森林分类模型, 通过袋外数据估计所述随机森林分类模 型各个决策树之间的相关度; 利用所述随机森林分类模型各个决策树之间的相 关度, 构建相关性矩阵; 根据所述相关性矩阵, 通过降维技术获取三维以下空 间的所述随机森林分类模型的可视化图形; 根据所述随机森林分类模型的可视 化图形, 对所述随机森林分类模型进行优化处理, 以使得所述处理后的随机森 林分类模型第二泛化误差上界不超过处理前的随机森林分类模型的第一泛化误 差上界。
在第一方面的第一种可能的实现方式中, 所述根据所述相关性矩阵, 通过 降维技术获取三维以下空间的所述随机森林分类模型的可视化图形的步骤, 包 括: 根据所述相关性矩阵, 通过多维尺度分析 MDS降维技术获取三维以下空间 的所述随机森林分类模型的可视化图形。
结合第一方面的第一种可能的实现方式, 在第一方面的第二种可能的实现 方式中, 所述可视化图形是散点图, 所述散点图的每个点代表一个决策树, 所 述散点图每两个点之间的距离代表所述随机森林分类模型对应的决策树之间的 相关度。
结合第一方面的第二种可能的实现方式, 在第一方面的第三种可能的实现 方式中, 所述散点图的点用不同颜色表示, 以表达所述散点图的点所对应的决 策树的分类强度信息。
结合第一方面的第二种可能的实现方式, 在第一方面的第四种可能的实现 方式中, 所述散点图是密度分布的热力图。
在第一方面的第五种可能的实现方式中, 所述根据所述随机森林分类模型 的可视化图形, 对所述随机森林分类模型进行优化处理的步骤包括: 根据所述 随机森林分类模型的可视化图形, 选择一个决策树; 将距离所述选择的决策树 最近的 K个决策树删除, 获得处理后所述随机森林分类模型对应的第二泛化误 差上界; 将所述处理后所述随机森林分类模型对应的第二泛化误差上界与处理 前的随机森林分类模型的第一泛化误差上界进行比较; 若处理后所述随机森林 分类模型对应的第二泛化误差上界减小, 则返回所述根据所述随机森林分类模 型的可视化图形, 选择一个决策树的步骤进行循环, 直到处理后所述随机森林 分类模型对应的第二泛化误差上界不再减小。
结合第一方面的第五种可能的实现方式, 在第一方面的第六种可能的实现 方式中, 所述与处理前的随机森林分类模型的第一泛化误差上界进行比较的步 骤之后, 包括:若处理后所述随机森林分类模型对应的第二泛化误差上界增大, 则撤销所述与处理前的随机森林分类模型的第一泛化误差上界进行比较的步骤 之前的步骤; 釆用决策树规则匹配算法将所述随机森林分类模型中结构相似的 决策树删除。
在第一方面的第七种可能的实现方式中, 所述相关性矩阵的第 i行第 j列的 元素是所述随机森林分类模型第 i个决策树和第 j个决策树之间的相关度,其中, 所述 i和 j是不为零的自然数。
第二方面, 本发明提供一种随机森林分类模型的可视化优化处理装置, 所 述装置包括: 估计模块、 构建模块、 获取模块以及优化模块; 所述估计模块用 于对于已构建的随机森林分类模型, 通过袋外数据估计所述随机森林分类模型 各个决策树之间的相关度; 所述构建模块用于利用所述估计模块估计的所述随 机森林分类模型各个决策树之间的相关度, 构建相关性矩阵; 所述获取模块用 于根据所述构建模块构建的所述相关性矩阵, 通过降维技术获取三维以下空间 的所述随机森林分类模型的可视化图形; 所述优化模块用于根据所述获取模块 获取的所述随机森林分类模型的可视化图形, 对所述随机森林分类模型进行优 化处理, 以使得所述处理后的随机森林分类模型第二泛化误差上界不超过处理 前的随机森林分类模型的第一泛化误差上界。
在第二方面的第一种可能的实现方式中, 所述获取模块具体用于根据所述 相关性矩阵,通过多维尺度分析 MDS降维技术获取三维以下空间的所述随机森 林分类模型的可视化图形。 结合第二方面的第一种可能的实现方式, 在第二方面的第二种可能的实现 方式中, 所述可视化图形是散点图, 所述散点图的每个点代表一个决策树, 所 述散点图每两个点之间的距离代表所述随机森林分类模型对应的决策树之间的 相关度。
结合第二方面的第二种可能的实现方式, 在第二方面的第三种可能的实现 方式中, 所述散点图的点用不同颜色表示, 以表达所述散点图的点所对应的决 策树的分类强度信息。
结合第二方面的第三种可能的实现方式, 在第二方面的第四种可能的实现 方式中, 所述散点图是密度分布的热力图。
在第二方面的第五种可能的实现方式中, 所述优化模块包括: 选择单元、 获得单元、 比较单元以及返回单元; 所述选择单元用于根据所述随机森林分类 模型的可视化图形, 选择一个决策树; 所述获得单元用于将距离所述选择单元 选择的决策树最近的 K个决策树删除, 获得处理后所述随机森林分类模型对应 的第二泛化误差上界; 所述比较单元用于将所述获得单元获得的所述处理后所 述随机森林分类模型对应的第二泛化误差上界与处理前的随机森林分类模型的 处理后所述随机森林分类模型对应的第二泛化误差上界减小时, 返回所述选择 单元进行循环, 直到处理后所述随机森林分类模型对应的第二泛化误差上界不 再减小。
结合第二方面的第五种可能的实现方式, 在第二方面的第六种可能的实现 方式中, 所述优化模块还包括: 撤销单元和删除单元; 所述撤销单元用于在所 述比较单元的比较结果是处理后所述随机森林分类模型对应的第二泛化误差上 界增大时, 撤销所述比较单元之前的所有操作; 所述删除单元用于在所述撤销 单元撤销所述比较单元之前的所有操作后, 釆用决策树规则匹配算法将所述随 机森林分类模型中结构相似的决策树删除。
在第二方面的第七种可能的实现方式中, 所述相关性矩阵的第 i行第 j列的 元素是所述随机森林分类模型第 i个决策树和第 j个决策树之间的相关度,其中, 所述 i和 j是不为零的自然数。
本发明的有益效果是: 区别于现有技术的情况, 本发明由于获得随机森林 分类模型的可视化图形, 在根据随机森林分类模型的可视化图形, 对随机森林 分类模型进行优化处理时, 不仅能够提高随机森林分类模型的学习性能, 减少 随机森林分类模型中决策树的数目, 同时由于可视化图形的形象、 直观, 在根 据随机森林分类模型的可视化图形优化时, 可以直接看到优化的效果, 因此所 以能够提高预测速度和精度, 不需要大量的内存空间存储优化算法的结果, 能 够降低随机森林分类模型所需的内存空间。
【附图说明】
图 1是本发明随机森林分类模型的可视化优化处理方法一实施方式的流程 图;
图 2是本发明随机森林分类模型的可视化优化处理方法中决策树的训练过 程示意图;
图 3是本发明随机森林分类模型的可视化优化处理方法另一实施方式的流 程图;
图 4是本发明随机森林分类模型的可视化优化处理方法中密度分布热力图 可视化的示意图;
图 5是本发明随机森林分类模型的可视化优化处理方法又一实施方式的流 程图;
图 6是本发明随机森林分类模型的可视化优化处理装置一实施方式的结构 示意图;
图 7是本发明随机森林分类模型的可视化优化处理装置一实施方式的结构 示意图;
图 8是本发明随机森林分类模型的可视化优化处理装置又一实施方式的结 构示意图。
【具体实施方式】
下面结合附图和实施方式对本发明进行详细说明。
参阅图 1,图 1是本发明随机森林分类模型的可视化优化处理方法一实施方 式的流程图, 包括:
步骤 S101 : 对于已构建的随机森林分类模型, 通过袋外数据估计随机森林 分类模型各个决策树之间的相关度。
在机器学习中, 随机森林分类模型是一个包含多个决策树的分类器, 并且 其输出的分类结果是由单个决策树输出的分类结果的总数而定。 设随机森林可 表示为 { i(X, 6 , /c = 1,2,〜,ΛΓ}, 其中 i(x, 6 表示决策树, Λ:为随机森林所包含的决 策树个数。 这里的 { ,/ί = 1,2, ···,/ 是一个随机变量序列, 它是由随机森林的两大 随机化思想决定的: ( 1 ) Bagging思想: 从原样本集 X有放回地随机抽取 K个与原 样本集同样大小的训练样本集 = 1,2, , / (每次约有 37%的样本未被抽中), 每个训练样本集7^构造一个对应的决策树。(2 )特征子空间思想: 在对决策树每 个节点进行分裂时, 从全部属性中等概率随机抽取一个属性子集 (通常取 ^2(M) + lj个属性, M为特征总数), 再从这个子集中选择一个最优属性来分裂 节点。 随机森林构建之过程就是训练各决策树之过程, 决策树分类器的训练过 程如图 2所示。 根据 C4.5决策树算法, 釆用自上而下的贪婪算法构建一个树状 结构, 每个分支对应一个属性值, 如此递归直到满足终止条件, 每个叶节点表 示沿此路径的样本的所属类别。
由于 Bagging方法每次从原样本集 X中随机抽取训练样本 时, 约有 37%的 样本没有被选中, 这一部分未被选中的数据即为袋外数据。 袋外数据可用于估 计随机森林分类模型各个决策树的分类强度 s、决策树之间的相关度 ^。 随机森林 分类模型的分类性能的主要因素一是单个决策树的分类强度, 单个决策树的分 类强度越大,则随机森林分类模型的分类性能越好;二是决策树之间的相关度, 决策树之间的相关度越大, 则随机森林分类模型的分类性能越差。
步骤 S102: 利用随机森林分类模型各个决策树之间的相关度, 构建相关性 矩阵。
相关性矩阵也叫相关系数矩阵, 是由矩阵各列间的相关系数构成的。 也就 是说, 相关性矩阵第 i行第 j列的元素是原矩阵第 i行和第 j列的相关系数。
在构建相关性矩阵时, 相关性矩阵第 i行第 j列的元素是原矩阵第 i行决策 树和第 j列决策树之间的相关度。
其中, 在本发明实施方式中, 相关性矩阵的第 i行第 j列的元素是随机森林 分类模型第 i个决策树和第 j个决策树之间的相关度,其中, i和 j是不为零的自 然数。
步骤 S103: 根据相关性矩阵, 通过降维技术获取三维以下空间的随机森林 分类模型的可视化图形。
通常, 高维特征集合存在以下几方面问题: 原始观察空间中的样本具有大 量的冗余特征; 存在许多与给定任务无关的特征, 即存在许多与类别仅有微弱 相关度的特征; 存在许多与给定任务冗余的特征, 如特征相互之间存在强烈的 相关度; 存在噪声数据。 这些问题增加了训练分类器的难度, 因此, 为了数据 分析以及数据可视化(通常是二维或三维), 需要对高维空间进行降维处理。 现 有技术中, 降维的方法主要有: 线性降维方法、 传统非线性降维方法、 基于流 行学习的非线性降维方法等等, 其中, 线性降维方法主要包括: 主成分分析方 法 PCA、 线性判别分析方法 LDA、 多维尺度分析方法 MDS等, 非线性降维方 法主要包括: 核主成分分析方法 KPCA、 主曲线方法、 自组织映射方法 SOM、 产生式拓朴映射方法 GTM等等, 基于流行学习的非线性降维方法主要有: 保距 特征映射 IOSMAP、 局部线性嵌入 LLE、 拉普拉斯特征映射 LE等等。
种类繁多的信息源产生的大量数据, 远远超出了人脑分析解释这些数据的 能力。 可视化技术作为解释大量数据最有效的手段而率先被科学与工程计算领 域釆用。 可视化把数据转换成图形, 给予人们深刻与意想不到的洞察力, 在很 多领域使科学家的研究方式发生了根本变化。 它的核心技术是可视化服务器硬 件和软件。 可视化的主要过程是建模和渲染: 建模是把数据映射成物体的几何 图元; 渲染是把几何图元描绘成图形或图像, 渲染是绘制真实感图形的主要技 术, 严格地说, 渲染就是根据基于光学原理的光照模型计算物体可见面投影到 观察者眼中的光亮度大小和色彩的组成, 并把它转换成适合图形显示设备的颜 色值, 从而确定投影画面上每一像素的颜色和光照效果, 最终生成具有真实感 的图形。 真实感图形是通过物体表面的颜色和明暗色调来表现的, 它和物体表 面的材料性质、 表面向视线方向辐射的光能有关, 计算复杂, 计算量很大。
根据构建的相关性矩阵, 通过降维技术即可获取三维以下 (包括三维) 空 间的该随机森林分类模型的可视化图形, 以便于更好地分析并优化该随机森林 分类模型。
步骤 S 104: 根据随机森林分类模型的可视化图形, 对随机森林分类模型进 行优化处理, 以使得处理后的随机森林分类模型第二泛化误差上界不超过处理 前的随机森林分类模型的第一泛化误差上界。
机器学习的性能可以通过泛化误差来表达, 泛化误差越小, 则该机器的学 习性能越好, 反之则性能越差。
泛化误差上界, 是指分类模型在新的未知的数据上的测试误差率的上界。 在随机森林分类模型中, 泛化误差由两个要素决定, 分别是: 随机森林的总体 分类强度和决策树之间的平均相关度。 泛化误差与随机森林的总体分类强度成 反比, 与决策树之间的平均相关度成正比, 即如果需要提高随机森林分类模型 的学习性能, 需要减小泛化误差, 可以通过两个途径: 一是提高随机森林的总 体分类强度, 将决策树分类强度弱的决策树删除, 二是降低决策树之间的平均 相关度, 将相关度高的决策树删除。
由于随机森林分类模型的可视化图形比较形象生动, 用户根据随机森林分 类模型的可视化图形, 可以很方便的对随机森林分类模型进行优化处理。 处理 后的随机森林分类模型第二泛化误差上界不超过处理前的随机森林分类模型的 第一泛化误差上界, 即第二泛化误差上界小于或等于第一泛化误差上界, 此时 优化处理后的随机森林分类模型才是可以接受的, 否则, 第二泛化误差上界大 于第一泛化误差上界, 表明优化处理后的随机森林分类模型的学习性能比优化 处理前的随机森林分类模型还要差, 很显然, 优化处理后的随机森林分类模型 是不可接受的。
本发明实施方式由于获得随机森林分类模型的可视化图形, 在根据随机森 林分类模型的可视化图形, 对随机森林分类模型进行优化处理时, 不仅能够提 高随机森林分类模型的学习性能, 减少随机森林分类模型中决策树的数目, 同 时由于可视化图形的形象、 直观, 在根据随机森林分类模型的可视化图形优化 时, 可以直接看到优化的效果, 因此所以能够提高预测速度和精度, 不需要大 量的内存空间存储优化算法的结果, 能够降低随机森林分类模型所需的内存空 间。
参阅图 3,图 3是本发明随机森林分类模型的可视化优化处理方法另一实施 方式的流程图, 包括:
步骤 S201 : 对于已构建的随机森林分类模型, 通过袋外数据估计随机森林 分类模型各个决策树之间的相关度。
在机器学习中, 随机森林分类模型是一个包含多个决策树的分类器, 并且 其输出的分类结果是由单个决策树输出的分类结果的总数而定。 设随机森林可 表示为 { i(X, 6 , /c = 1,2,〜,ΛΓ}, 其中 i(x, 6 表示决策树, Λ:为随机森林所包含的决 策树个数。 这里的 { , /ί = 1,2, ··· , / 是一个随机变量序列, 它是由随机森林的两大 随机化思想决定的: ( 1 ) Bagging思想: 从原样本集 X有放回地随机抽取 K个与原 样本集同样大小的训练样本集 = 1,2,— , Κ) (每次约有 37%的样本未被抽中), 每个训练样本集7^构造一个对应的决策树。(2 )特征子空间思想: 在对决策树每 个节点进行分裂时, 从全部属性中等概率随机抽取一个属性子集 (通常取 ^2 (M) + lj个属性, M为特征总数), 再从这个子集中选择一个最优属性来分裂 节点。 由于 Bagging方法每次从原样本集 X中随机抽取训练样本^时, 约有 37%的 样本没有被选中, 这一部分未被选中的数据即为袋外数据。 袋外数据可用于估 计随机森林分类模型各个决策树的分类强度 s、决策树之间的相关度 ^。 随机森林 分类模型的分类性能的主要因素一是单个决策树的分类强度, 单个决策树的分 类强度越大,则随机森林分类模型的分类性能越好;二是决策树之间的相关度, 决策树之间的相关度越大, 则随机森林分类模型的分类性能越差。
步骤 S202: 利用随机森林分类模型各个决策树之间的相关度, 构建相关性 矩阵。
相关性矩阵也叫相关系数矩阵, 是由矩阵各列间的相关系数构成的。 也就 是说, 相关性矩阵第 i行第 j列的元素是原矩阵第 i行和第 j列的相关系数。
在构建相关性矩阵时, 相关性矩阵第 i行第 j列的元素是原矩阵第 i行决策 树和第 j列决策树之间的相关度。
其中, 在本发明实施方式中, 相关性矩阵的第 i行第 j列的元素是随机森林 分类模型第 i个决策树和第 j个决策树之间的相关度,其中, i和 j是不为零的自 然数。
步骤 S203: 根据相关性矩阵, 通过多维尺度分析 MDS降维技术获取三维 以下空间的所述随机森林分类模型的可视化图形。
高维特征集合存在以下几方面问题: 原始观察空间中的样本具有大量的冗 余特征; 存在许多与给定任务无关的特征, 即存在许多与类别仅有微弱相关度 的特征;存在许多与给定任务冗余的特征,如特征相互之间存在强烈的相关度; 存在噪声数据。 这些问题增加了训练分类器的难度, 因此, 为了数据分析以及 数据可视化(通常是二维或三维), 需要对高维空间进行降维处理。
MDS利用的是成对样本间的相关度, 目的是利用这个信息去构建合适的低 维空间, 使得样本在此空间的距离与在高维空间中的样本间的相关性尽可能的 保持一致。 MDS方法有 5个关键的要素, 分别为主体、客体、 准则、准则权重、 主体权重。 具体定义为: 1 )客体: 被评估的对象, 可以认为是待分类的几种类 别。 2 )主体: 评估客体的单位, 就是训练数据。 3 ) 准则: 根据研究目的自行 定义, 用以评估客体优劣的标准。 4 ) 准则权重: 主体衡量准则重要性后, 对每 个准则分别赋予权重值。 5 )主体权重: 研究者权衡准则重要性后, 对主体赋予 权重值。 对于要分析的数据包括 I个物体, 定义一个距离函数的集合, 其中 Sy 是第 i个和第 j个对象之间的距离, 于是有
Figure imgf000013_0001
MDS 算法的目的就是根据这个 Δ, 寻找 I 个向量 € ,使 ― ¾ 1 ^ ¾,其中 i和 j属于 I。这里的 II. II是向量的范数,在经典的 MDS中, 该规范是欧氏距离, 但广义的讲, 这个规范可以是任意函数。 也就是说, MDS 试图找到一个子空间 Rn, I个物体嵌入在这个子空间中, 而彼此的相关度被尽 可能的保留。 如果这个子空间的维数 N选择为 2或者 3, 可以画出向量 χ」获得 一个 I个物体相关度的一个可视化的结果。
总之, MDS作为一种探索性数据分析技术,操作相对简单,结果解释直观。 可视化把数据转换成图形, 给予人们深刻与意想不到的洞察力, 在很多领 域使科学家的研究方式发生了根本变化。 它的核心技术是可视化服务器硬件和 软件。可视化的主要过程是建模和渲染:建模是把数据映射成物体的几何图元; 渲染是把几何图元描绘成图形或图像, 渲染是绘制真实感图形的主要技术, 严 格地说, 渲染就是根据基于光学原理的光照模型计算物体可见面投影到观察者 眼中的光亮度大小和色彩的组成, 并把它转换成适合图形显示设备的颜色值, 从而确定投影画面上每一像素的颜色和光照效果, 最终生成具有真实感的图形。
根据构建的相关性矩阵,通过 MDS降维技术即可获取三维以下(包括三维) 空间的该随机森林分类模型的可视化图形, 以便于更好地分析并优化该随机森 林分类模型。
其中, 可视化图形是散点图, 散点图的每个点代表一个决策树, 散点图每 两个点之间的距离代表随机森林分类模型对应的决策树之间的相关度。 通过该 散点图可以形象的观察到每两个决策树之间的相关度的大小, 两个点之间的距 离近,表明这两个点对应的两个决策树之间的相关度大,两个点之间的距离远, 表明这两个点对应的两个决策树之间的相关度小。
其中, 散点图的点用不同颜色表示, 以表达散点图的点所对应的决策树的 分类强度信息。 根据该点对应的颜色, 即可大致确定该点所对应的决策树的分 类强度的大小。
但上述的可视化仅是一种粗粒度的可视化表达, 随机森林分类模型中各决 策树的聚簇密度可以更细粒度地呈现随机森林中各决策树的分布情况。 通过归 一化方法将二维平面的决策树模型种群密度分成 10色阶,表示不同等级的密度, 即散点图是密度分布的热力图。 通过密度分布的热力图表达方法, 用户可以观 测到不同密度决策树种群的分布, 如图 4所示。
步骤 S204: 根据随机森林分类模型的可视化图形, 对随机森林分类模型进 行优化处理, 以使得处理后的随机森林分类模型第二泛化误差上界不超过处理 前的随机森林分类模型的第一泛化误差上界。
机器学习的性能可以通过泛化误差来表达, 泛化误差越小, 则该机器的学 习性能越好, 反之则性能越差。
泛化误差上界, 是指分类模型在新的未知的数据上的测试误差率的上界。 在随机森林分类模型中, 泛化误差由两个要素决定, 分别是: 随机森林的总体 分类强度和决策树之间的平均相关度。 泛化误差与随机森林的总体分类强度成 反比, 与决策树之间的平均相关度成正比, 即如果需要提高随机森林分类模型 的学习性能, 需要减小泛化误差, 可以通过两个途径: 一是提高随机森林的总 体分类强度, 将决策树分类强度弱的决策树删除, 二是降低决策树之间的平均 相关度, 将相关度高的决策树删除。
由于随机森林分类模型的可视化图形比较形象生动, 用户根据随机森林分 类模型的可视化图形, 可以很方便的对随机森林分类模型进行优化处理。 处理 后的随机森林分类模型第二泛化误差上界不超过处理前的随机森林分类模型的 第一泛化误差上界, 即第二泛化误差上界小于或等于第一泛化误差上界, 此时 优化处理后的随机森林分类模型才是可以接受的, 否则, 第二泛化误差上界大 于第一泛化误差上界, 表明优化处理后的随机森林分类模型的学习性能比优化 处理前的随机森林分类模型还要差, 很显然, 优化处理后的随机森林分类模型 是不可接受的。
本发明实施方式由于获得随机森林分类模型的可视化图形, 在根据随机森 林分类模型的可视化图形, 对随机森林分类模型进行优化处理时, 不仅能够提 高随机森林分类模型的学习性能, 减少随机森林分类模型中决策树的数目, 同 时由于可视化图形的形象、 直观, 在根据随机森林分类模型的可视化图形优化 时, 可以直接看到优化的效果, 因此所以能够提高预测速度和精度, 不需要大 量的内存空间存储优化算法的结果, 能够降低随机森林分类模型所需的内存空 间。 另外, 通过 MDS降维技术, 使得操作相对简单, 结果解释直观。
参阅图 5,图 5是本发明随机森林分类模型的可视化优化处理方法又一实施 方式的流程图, 包括:
步骤 S301 : 对于已构建的随机森林分类模型, 通过袋外数据估计随机森林 分类模型各个决策树之间的相关度。
在机器学习中, 随机森林分类模型是一个包含多个决策树的分类器, 并且 其输出的分类结果是由单个决策树输出的分类结果的总数而定。
由于 Bagging方法每次从原样本集中随机抽取训练样本时, 约有 37%的样 本没有被选中, 这一部分未被选中的数据即为袋外数据。 袋外数据可用于估计 随机森林分类模型各个决策树的分类强度、 决策树之间的相关度。 随机森林分 类模型的分类性能的主要因素一是单个决策树的分类强度, 单个决策树的分类 强度越大, 则随机森林分类模型的分类性能越好; 二是决策树之间的相关度, 决策树之间的相关度越大, 则随机森林分类模型的分类性能越差。
步骤 S302: 利用随机森林分类模型各个决策树之间的相关度, 构建相关性 矩阵。
相关性矩阵也叫相关系数矩阵, 是由矩阵各列间的相关系数构成的。 也就 是说, 相关性矩阵第 i行第 j列的元素是原矩阵第 i行和第 j列的相关系数。 在构建相关性矩阵时, 相关性矩阵第 i行第 j列的元素是原矩阵第 i行决策 树和第 j列决策树之间的相关度。
其中, 在本发明实施方式中, 相关性矩阵的第 i行第 j列的元素是随机森林 分类模型第 i个决策树和第 j个决策树之间的相关度,其中, i和 j是不为零的自 然数。
步骤 S303 : 根据相关性矩阵, 通过多维尺度分析 MDS降维技术获取三维 以下空间的所述随机森林分类模型的可视化图形。
高维特征集合存在以下几方面问题: 原始观察空间中的样本具有大量的冗 余特征; 存在许多与给定任务无关的特征, 即存在许多与类别仅有微弱相关度 的特征;存在许多与给定任务冗余的特征,如特征相互之间存在强烈的相关度; 存在噪声数据。 这些问题增加了训练分类器的难度, 因此, 为了数据分析以及 数据可视化(通常是二维或三维), 需要对高维空间进行降维处理。
可视化把数据转换成图形, 给予人们深刻与意想不到的洞察力, 在很多领 域使科学家的研究方式发生了根本变化。 它的核心技术是可视化服务器硬件和 软件。可视化的主要过程是建模和渲染:建模是把数据映射成物体的几何图元; 渲染是把几何图元描绘成图形或图像, 渲染是绘制真实感图形的主要技术, 严 格地说, 渲染就是根据基于光学原理的光照模型计算物体可见面投影到观察者 眼中的光亮度大小和色彩的组成, 并把它转换成适合图形显示设备的颜色值, 从而确定投影画面上每一像素的颜色和光照效果, 最终生成具有真实感的图形。
根据构建的相关性矩阵, 通过降维技术即可获取三维以下 (包括三维) 空 间的该随机森林分类模型的可视化图形, 以便于更好地分析并优化该随机森林 分类模型。
其中, 可视化图形是散点图, 散点图的每个点代表一个决策树, 散点图每 两个点之间的距离代表随机森林分类模型对应的决策树之间的相关度。
进一步地, 散点图的点用不同颜色表示, 以表达散点图的点所对应的决策 树的分类强度信息。
进一步地, 散点图是密度分布的热力图。
步骤 S304: 根据随机森林分类模型的可视化图形, 对随机森林分类模型进 行优化处理, 以使得处理后的随机森林分类模型第二泛化误差上界不超过处理 前的随机森林分类模型的第一泛化误差上界。
机器学习的性能可以通过泛化误差来表达, 泛化误差越小, 则该机器的学 习性能越好, 反之则性能越差。
在随机森林分类模型中, 泛化误差由两个要素决定, 分别是: 随机森林的 总体分类强度和决策树之间的平均相关度。 泛化误差与随机森林的总体分类强 度成反比, 与决策树之间的平均相关度成正比, 即如果需要提高随机森林分类 模型的学习性能, 需要减小泛化误差, 可以通过两个途径: 一是提高随机森林 的总体分类强度, 将决策树分类强度弱的决策树删除, 二是降低决策树之间的 平均相关度, 将相关度高的决策树删除。
由于随机森林分类模型的可视化图形比较形象生动, 用户根据随机森林分 类模型的可视化图形, 可以很方便的对随机森林分类模型进行优化处理。 处理 后的随机森林分类模型第二泛化误差上界不超过处理前的随机森林分类模型的 第一泛化误差上界, 即第二泛化误差上界小于或等于第一泛化误差上界, 此时 优化处理后的随机森林分类模型才是可以接受的, 否则, 第二泛化误差上界大 于第一泛化误差上界, 表明优化处理后的随机森林分类模型的学习性能比优化 处理前的随机森林分类模型还要差, 很显然, 优化处理后的随机森林分类模型 是不可接受的。
其中, 步骤 S304包括: 子步骤 S304a、 子步骤 S304b、 子步骤 S304c、 子步 骤 S304d、 子步骤 S304e以及子步骤 S304f。
子步骤 S304a: 根据随机森林分类模型的可视化图形, 选择一个决策树。 子步骤 S304b: 将距离选择的决策树最近的 K个决策树删除, 获得处理后 随机森林分类模型对应的第二泛化误差上界。 子步骤 S304b 釆用的方法即为 K最近邻 (k-Nearest Neighbour, KNN )分 类算法, KNN是一个理论上比较成熟的方法,也是最简单的机器学习算法之一。 该方法的思路是: 如果一个样本在特征空间中的 k个最相似(即特征空间中最 邻近) 的样本中的大多数属于某一个类别, 则该样本也属于这个类别。
距离选择的决策树最近的 K个决策树可以认为属于同一个类别, 通过子步 骤 S304b, 可以将初步认为是同一个类别的距离选择的决策树最近的 K个决策 树删除。
子步骤 S304c:将处理后随机森林分类模型对应的第二泛化误差上界与处理 前的随机森林分类模型的第一泛化误差上界进行比较。
子步骤 S304d: 若处理后随机森林分类模型对应的第二泛化误差上界减小, 则返回子步骤 S304a进行循环, 直到处理后随机森林分类模型对应的第二泛化 误差上界不再减小。
子步骤 S304e: 若处理后随机森林分类模型对应的第二泛化误差上界增大, 则撤销子步骤 S304c之前的步骤。
子步骤 S304f: 釆用决策树规则匹配算法将随机森林分类模型中结构相似的 决策树删除。
通过子步骤 S304b后,比较第一泛化误差上界和第二泛化误差上界的大小, 如果第二泛化误差上界减小, 则说明处理后的随机森林分类模型得到了优化, 返回子步骤 S304a进行循环, 直到处理后随机森林分类模型对应的第二泛化误 差上界不再减小, 此时说明按照该方法, 随机森林分类模型的优化效果已经达 到最好。 如果第二泛化误差上界增大, 说明处理后的随机森林分类模型的性能 没有处理前的性能好, 撤销子步骤 S304c之前的步骤, 釆用决策树规则匹配算 法将随机森林分类模型中结构相似的决策树删除。
例如: 最原始的未经过任何处理过的随机森林分类模型的泛化误差上界为 0.2, 一种情况是, 经过子步骤 S304a和子步骤 S304b第一次处理后, 随机森林 分类模型的泛化误差上界为 0.3,很显然,需要撤销子步骤 S304a和子步骤 S304b 的处理, 釆用决策树规则匹配算法将随机森林分类模型中结构相似的决策树删 除。 另一种情况是, 经过子步骤 S304a和子步骤 S304b多次处理, 例如经过子 步骤 S304a和子步骤 S304b四次处理, 第一次、 第二次、 第三次以及第四次处 理后的随机森林分类模型的泛化误差上界分别为 0.19、 0.17 、 0.14以及 0.15, 很显然, 第一次、 第二次、 第三次的处理后, 随机森林分类模型的泛化误差上 界是在减小的, 第四次的泛化误差上界 0.15比第三次的泛化误差上界 0.14增大 了, 也就是说, 第三次处理后的随机森林分类模型的泛化误差上界已经不再减 小了, 这个时候, 选择接收第三次优化后的随机森林分类模型。 当然, 也可以 釆用决策树规则匹配算法将第三次的随机森林分类模型中结构相似的决策树进 一步进行删除。
当然, 在实际应用中, 还可以其它的优化方法, 例如: 基于余量函数分布 的随机森林优化算法。 四个余量函数分布的变种将作为评价随机森林分类算法 模型泛化能力以及单个决策树重要性的度量标准被引入此算法中。 换句话说, 此优化算法利用以上四种余量函数作为评价算法优化程度的目标函数, 每次通 过搜索使目标函数最优化的树并将其从随机森林模型中删除来逐步提高分类算 法的性能。 具体来讲, 当初始随机森林模型构建好之后, 随机森林中的每个决 策树将按照其重要性进行排序。 决策树的重要性通过将此决策树删除之后的随 机森林余量函数的变化程度来度量。 然后算法将最不重要的决策树从随机森林 中删去, 这样迭代地进行上述过程直到随机森林模型达到最优。 所以这种优化 方法是通过降低随机森林的规模来提高其分类性能的。
本发明实施方式由于获得随机森林分类模型的可视化图形, 在根据随机森 林分类模型的可视化图形, 对随机森林分类模型进行优化处理时, 不仅能够提 高随机森林分类模型的学习性能, 减少随机森林分类模型中决策树的数目, 同 时由于可视化图形的形象、 直观, 在根据随机森林分类模型的可视化图形优化 时, 可以直接看到优化的效果, 因此所以能够提高预测速度和精度, 不需要大 量的内存空间存储优化算法的结果, 能够降低随机森林分类模型所需的内存空 间。 另外, 通过 MDS降维技术, 使得操作相对简单, 结果解释直观; 通过 K最 近邻分类算法, 可以快速删除属于同一类别的决策树; 通过决策树规则匹配算 法, 可以删除结构相似的决策树。
参阅图 6,图 6是本发明随机森林分类模型的可视化优化处理装置一实施方 式的结构示意图, 该装置包括: 估计模块 101、 构建模块 102、 获取模块 103以 及优化模块 104。
需要说明的是, 本实施方式的装置可以执行图 1、 图 3以及图 5的步骤。 估计模块 101 用于对于已构建的随机森林分类模型, 通过袋外数据估计随 机森林分类模型各个决策树之间的相关度。
在机器学习中, 随机森林分类模型是一个包含多个决策树的分类器, 并且 其输出的分类结果是由单个决策树输出的分类结果的总数而定。 设随机森林可 表示为 { i(X, 6»fc), /c = 1,2,〜,ΛΓ}, 其中 i(X, 6»fc)表示决策树, ΛΓ为随机森林所包含的决 策树个数。 这里的 { , /ί = 1,2, ···,/ 是一个随机变量序列, 它是由随机森林的两大 随机化思想决定的: ( 1 ) Bagging思想: 从原样本集 X有放回地随机抽取 K个与原 样本集同样大小的训练样本集 = 1,2,— , Κ) (每次约有 37%的样本未被抽中), 每个训练样本集7^构造一个对应的决策树。(2 )特征子空间思想: 在对决策树每 个节点进行分裂时, 从全部属性中等概率随机抽取一个属性子集 (通常取 ^2 (M) + lj个属性, M为特征总数), 再从这个子集中选择一个最优属性来分裂 节点。 根据 C4.5决策树算法, 釆用自上而下的贪婪算法构建一个树状结构, 每 个分支对应一个属性值, 如此递归直到满足终止条件, 每个叶节点表示沿此路 径的样本的所属类别。
由于 Bagging方法每次从原样本集 X中随机抽取训练样本 时, 约有 37%的 样本没有被选中, 这一部分未被选中的数据即为袋外数据。 袋外数据可用于估 计随机森林分类模型各个决策树的分类强度 s、决策树之间的相关度 ^。 随机森林 分类模型的分类性能的主要因素一是单个决策树的分类强度, 单个决策树的分 类强度越大,则随机森林分类模型的分类性能越好;二是决策树之间的相关度, 决策树之间的相关度越大, 则随机森林分类模型的分类性能越差。
构建模块 102用于利用估计模块 101估计的随机森林分类模型各个决策树 之间的相关度, 构建相关性矩阵。
相关性矩阵也叫相关系数矩阵, 是由矩阵各列间的相关系数构成的。 也就 是说, 相关性矩阵第 i行第 j列的元素是原矩阵第 i行和第 j列的相关系数。
在构建相关性矩阵时, 相关性矩阵第 i行第 j列的元素是原矩阵第 i行决策 树和第 j列决策树之间的相关度。
其中, 在本发明实施方式中, 相关性矩阵的第 i行第 j列的元素是随机森林 分类模型第 i个决策树和第 j个决策树之间的相关度,其中, i和 j是不为零的自 然数。
获取模块 103用于根据构建模块 102构建的相关性矩阵, 通过降维技术获 取三维以下空间的随机森林分类模型的可视化图形。
高维特征集合存在以下几方面问题: 原始观察空间中的样本具有大量的冗 余特征; 存在许多与给定任务无关的特征, 即存在许多与类别仅有微弱相关度 的特征;存在许多与给定任务冗余的特征,如特征相互之间存在强烈的相关度; 存在噪声数据。 这些问题增加了训练分类器的难度, 因此, 为了数据分析以及 数据可视化(通常是二维或三维),需要对高维空间进行降维处理。现有技术中, 降维的方法主要有: 线性降维方法、 传统非线性降维方法、 基于流行学习的非 线性降维方法等等。
可视化技术作为解释大量数据最有效的手段而率先被科学与工程计算领域 釆用。 可视化把数据转换成图形, 给予人们深刻与意想不到的洞察力, 在很多 领域使科学家的研究方式发生了根本变化。 它的核心技术是可视化服务器硬件 和软件。 可视化的主要过程是建模和渲染: 建模是把数据映射成物体的几何图 元;渲染是把几何图元描绘成图形或图像,渲染是绘制真实感图形的主要技术, 严格地说, 渲染就是根据基于光学原理的光照模型计算物体可见面投影到观察 者眼中的光亮度大小和色彩的组成, 并把它转换成适合图形显示设备的颜色值, 从而确定投影画面上每一像素的颜色和光照效果, 最终生成具有真实感的图形。 根据构建的相关性矩阵, 通过降维技术即可获取三维以下 (包括三维) 空 间的该随机森林分类模型的可视化图形, 以便于更好地分析并优化该随机森林 分类模型。
优化模块 104用于根据获取模块 103获取的随机森林分类模型的可视化图 形, 对随机森林分类模型进行优化处理, 以使得处理后的随机森林分类模型第 二泛化误差上界不超过处理前的随机森林分类模型的第一泛化误差上界。
机器学习的性能可以通过泛化误差来表达, 泛化误差越小, 则该机器的学 习性能越好, 反之则性能越差。
在随机森林分类模型中, 泛化误差由两个要素决定, 分别是: 随机森林的 总体分类强度和决策树之间的平均相关度。 泛化误差与随机森林的总体分类强 度成反比, 与决策树之间的平均相关度成正比, 即如果需要提高随机森林分类 模型的学习性能, 需要减小泛化误差, 可以通过两个途径: 一是提高随机森林 的总体分类强度, 将决策树分类强度弱的决策树删除, 二是降低决策树之间的 平均相关度, 将相关度高的决策树删除。
由于随机森林分类模型的可视化图形比较形象生动, 用户根据随机森林分 类模型的可视化图形, 可以很方便的对随机森林分类模型进行优化处理。 处理 后的随机森林分类模型第二泛化误差上界不超过处理前的随机森林分类模型的 第一泛化误差上界, 即第二泛化误差上界小于或等于第一泛化误差上界, 此时 优化处理后的随机森林分类模型才是可以接受的, 否则, 第二泛化误差上界大 于第一泛化误差上界, 表明优化处理后的随机森林分类模型的学习性能比优化 处理前的随机森林分类模型还要差, 很显然, 优化处理后的随机森林分类模型 是不可接受的。
本发明实施方式由于获得随机森林分类模型的可视化图形, 在根据随机森 林分类模型的可视化图形, 对随机森林分类模型进行优化处理时, 不仅能够提 高随机森林分类模型的学习性能, 减少随机森林分类模型中决策树的数目, 同 时由于可视化图形的形象、 直观, 在根据随机森林分类模型的可视化图形优化 时, 可以直接看到优化的效果, 因此所以能够提高预测速度和精度, 不需要大 量的内存空间存储优化算法的结果, 能够降低随机森林分类模型所需的内存空 间。
参阅图 7,图 7是本发明随机森林分类模型的可视化优化处理装置一实施方 式的结构示意图, 该装置包括: 估计模块 201、 构建模块 202、 获取模块 203以 及优化模块 204。
需要说明的是, 本实施方式的装置可以执行图 3和图 5的步骤。
估计模块 201 用于对于已构建的随机森林分类模型, 通过袋外数据估计随 机森林分类模型各个决策树之间的相关度。
在机器学习中, 随机森林分类模型是一个包含多个决策树的分类器, 并且 其输出的分类结果是由单个决策树输出的分类结果的总数而定。
由于 Bagging方法每次从原样本集中随机抽取训练样本时, 约有 37%的样 本没有被选中, 这一部分未被选中的数据即为袋外数据。 袋外数据可用于估计 随机森林分类模型各个决策树的分类强度、 决策树之间的相关度。 随机森林分 类模型的分类性能的主要因素一是单个决策树的分类强度, 单个决策树的分类 强度越大, 则随机森林分类模型的分类性能越好; 二是决策树之间的相关度, 决策树之间的相关度越大, 则随机森林分类模型的分类性能越差。
构建模块 202用于利用估计模块 201估计的随机森林分类模型各个决策树 之间的相关度, 构建相关性矩阵。
相关性矩阵也叫相关系数矩阵, 是由矩阵各列间的相关系数构成的。 也就 是说, 相关性矩阵第 i行第 j列的元素是原矩阵第 i行和第 j列的相关系数。
在构建相关性矩阵时, 相关性矩阵第 i行第 j列的元素是原矩阵第 i行决策 树和第 j列决策树之间的相关度。
其中, 在本发明实施方式中, 相关性矩阵的第 i行第 j列的元素是随机森林 分类模型第 i个决策树和第 j个决策树之间的相关度,其中, i和 j是不为零的自 然数。
获取模块 203用于根据构建模块 202构建的相关性矩阵, 通过降维技术获 取三维以下空间的随机森林分类模型的可视化图形。
可视化把数据转换成图形, 给予人们深刻与意想不到的洞察力, 在很多领 域使科学家的研究方式发生了根本变化。 它的核心技术是可视化服务器硬件和 软件。可视化的主要过程是建模和渲染:建模是把数据映射成物体的几何图元; 渲染是把几何图元描绘成图形或图像, 渲染是绘制真实感图形的主要技术, 严 格地说, 渲染就是根据基于光学原理的光照模型计算物体可见面投影到观察者 眼中的光亮度大小和色彩的组成, 并把它转换成适合图形显示设备的颜色值, 从而确定投影画面上每一像素的颜色和光照效果, 最终生成具有真实感的图形。
根据构建的相关性矩阵, 通过降维技术即可获取三维以下 (包括三维) 空 间的该随机森林分类模型的可视化图形, 以便于更好地分析并优化该随机森林 分类模型。
其中, 可视化图形是散点图, 散点图的每个点代表一个决策树, 散点图每 两个点之间的距离代表随机森林分类模型对应的决策树之间的相关度。
进一步地, 散点图的点用不同颜色表示, 以表达散点图的点所对应的决策 树的分类强度信息。
进一步地, 散点图是密度分布的热力图。
优化模块 204用于根据获取模块 203获取的随机森林分类模型的可视化图 形, 对随机森林分类模型进行优化处理, 以使得处理后的随机森林分类模型第 二泛化误差上界不超过处理前的随机森林分类模型的第一泛化误差上界。
机器学习的性能可以通过泛化误差来表达, 泛化误差越小, 则该机器的学 习性能越好, 反之则性能越差。
在随机森林分类模型中, 泛化误差由两个要素决定, 分别是: 随机森林的 总体分类强度和决策树之间的平均相关度。 泛化误差与随机森林的总体分类强 度成反比, 与决策树之间的平均相关度成正比, 即如果需要提高随机森林分类 模型的学习性能, 需要减小泛化误差, 可以通过两个途径: 一是提高随机森林 的总体分类强度, 将决策树分类强度弱的决策树删除, 二是降低决策树之间的 平均相关度, 将相关度高的决策树删除。
由于随机森林分类模型的可视化图形比较形象生动, 用户根据随机森林分 类模型的可视化图形, 可以很方便的对随机森林分类模型进行优化处理。 处理 后的随机森林分类模型第二泛化误差上界不超过处理前的随机森林分类模型的 第一泛化误差上界, 即第二泛化误差上界小于或等于第一泛化误差上界, 此时 优化处理后的随机森林分类模型才是可以接受的, 否则, 第二泛化误差上界大 于第一泛化误差上界, 表明优化处理后的随机森林分类模型的学习性能比优化 处理前的随机森林分类模型还要差, 很显然, 优化处理后的随机森林分类模型 是不可接受的。
优化模块 204包括: 选择单元 2041、 获得单元 2042、 比较单元 2043 以及 返回单元 2044。
选择单元 2041用于根据随机森林分类模型的可视化图形,选择一个决策树。 获得单元 2042用于将距离选择单元 2041选择的决策树最近的 K个决策树 删除, 获得处理后随机森林分类模型对应的第二泛化误差上界。
比较单元 2043用于将获得单元 2042获得的处理后随机森林分类模型对应 的第二泛化误差上界与处理前的随机森林分类模型的第一泛化误差上界进行比 较。
返回单元 2044用于在比较单元 2043的比较结果是处理后随机森林分类模 型对应的第二泛化误差上界减小时, 返回选择单元 2041进行循环, 直到处理后 随机森林分类模型对应的第二泛化误差上界不再减小。
优化模块 204还包括: 4敦销单元 2045和删除单元 2046。
撤销单元 2045用于在比较单元 2043的比较结果是处理后随机森林分类模 型对应的第二泛化误差上界增大时, 撤销比较单元 2043之前的所有操作。 釆用决策树规则匹配算法将随机森林分类模型中结构相似的决策树删除。
本发明实施方式由于获得随机森林分类模型的可视化图形, 在根据随机森 林分类模型的可视化图形, 对随机森林分类模型进行优化处理时, 不仅能够提 高随机森林分类模型的学习性能, 减少随机森林分类模型中决策树的数目, 同 时由于可视化图形的形象、 直观, 在根据随机森林分类模型的可视化图形优化 时, 可以直接看到优化的效果, 因此所以能够提高预测速度和精度, 不需要大 量的内存空间存储优化算法的结果, 能够降低随机森林分类模型所需的内存空 间。 另外, 通过 MDS降维技术, 使得操作相对简单, 结果解释直观; 通过 K最 近邻分类算法, 可以快速删除属于同一类别的决策树; 通过决策树规则匹配算 法, 可以删除结构相似的决策树。
参阅图 8,图 8是本发明随机森林分类模型的可视化优化处理装置又一实施 方式的结构示意图, 该装置包括: 处理器 71、 与处理器 71耦合的存储器 72以 及数据总线 73, 其中, 处理器 71和存储器 72通过数据总线 73连接。
在一些实施方式中, 存储器 72存储了如下的元素, 可执行模块或者数据结 构, 或者他们的子集, 或者他们的扩展集:
操作系统 721, 包含各种系统程序, 用于实现各种基础业务以及处理基于硬 件的任务;
应用程序模块 722, 包含各种应用程序, 用于实现各种应用业务。
在本发明实施方式中, 通过调用存储器 72 存储的程序或指令, 处理器 71 用于: 对于已构建的随机森林分类模型, 通过袋外数据估计随机森林分类模型 各个决策树之间的相关度; 利用随机森林分类模型各个决策树之间的相关度, 构建相关性矩阵; 根据相关性矩阵, 通过降维技术获取三维以下空间的随机森 林分类模型的可视化图形; 根据随机森林分类模型的可视化图形, 对随机森林 分类模型进行优化处理, 以使得处理后的随机森林分类模型第二泛化误差上界 不超过处理前的随机森林分类模型的第一泛化误差上界。
在上述各个实施方式中, 进一步地, 所述处理器 71还用于: 根据所述 相关性矩阵,通过多维尺度分析 MDS降维技术获取三维以下空间的所述随机森 林分类模型的可视化图形。
其中, 所述可视化图形是散点图, 所述散点图的每个点代表一个决策树, 所述散点图每两个点之间的距离代表所述随机森林分类模型对应的决策树之间 的相关度。
其中, 所述散点图的点用不同颜色表示, 以表达所述散点图的点所对应的 决策树的分类强度信息。
其中, 所述散点图是密度分布的热力图。
进一步地, 所述处理器 71还用于: 根据所述随机森林分类模型的可视化 图形, 选择一个决策树; 将距离所述选择的决策树最近的 K个决策树删除, 获 得处理后所述随机森林分类模型对应的第二泛化误差上界; 将所述处理后所述 随机森林分类模型对应的第二泛化误差上界与处理前的随机森林分类模型的第 一泛化误差上界进行比较; 若处理后所述随机森林分类模型对应的第二泛化误 差上界减小, 则返回所述根据所述随机森林分类模型的可视化图形, 选择一个 决策树的步骤进行循环, 直到处理后所述随机森林分类模型对应的第二泛化误 差上界不再减小。
进一步地, 所述处理器 71还用于: 若处理后所述随机森林分类模型对应 的第二泛化误差上界增大, 则撤销所述与处理前的随机森林分类模型的第一泛 化误差上界进行比较的步骤之前的步骤; 釆用决策树规则匹配算法将所述随机 森林分类模型中结构相似的决策树删除。
本发明实施方式由于获得随机森林分类模型的可视化图形, 在根据随机森 林分类模型的可视化图形, 对随机森林分类模型进行优化处理时, 不仅能够提 高随机森林分类模型的学习性能, 减少随机森林分类模型中决策树的数目, 同 时由于可视化图形的形象、 直观, 在根据随机森林分类模型的可视化图形优化 时, 可以直接看到优化的效果, 因此所以能够提高预测速度和精度, 不需要大 量的内存空间存储优化算法的结果, 能够降低随机森林分类模型所需的内存空 间。 另外, 通过 MDS降维技术, 使得操作相对简单, 结果解释直观; 通过 K最 近邻分类算法, 可以快速删除属于同一类别的决策树; 通过决策树规则匹配算 法, 可以删除结构相似的决策树。
在本发明所提供的几个实施方式中, 应该理解到, 所揭露的系统, 装置和 方法, 可以通过其它的方式实现。 例如, 以上所描述的装置实施方式仅仅是示 意性的, 例如, 所述模块或单元的划分, 仅仅为一种逻辑功能划分, 实际实现 时可以有另外的划分方式, 例如多个单元或组件可以结合或者可以集成到另一 个系统, 或一些特征可以忽略, 或不执行。 另一点, 所显示或讨论的相互之间 的耦合或直接耦合或通信连接可以是通过一些接口, 装置或单元的间接耦合或 通信连接, 可以是电性, 机械或其它的形式。 单元显示的部件可以是或者也可以不是物理单元, 即可以位于一个地方, 或者 也可以分布到多个网络单元上。 可以根据实际的需要选择其中的部分或者全部 单元来实现本实施方式方案的目的。
另外, 在本发明各个实施方式中的各功能单元可以集成在一个处理单元中, 也可以是各个单元单独物理存在, 也可以两个或两个以上单元集成在一个单元 中。 上述集成的单元既可以釆用硬件的形式实现, 也可以釆用软件功能单元的 形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或 使用时, 可以存储在一个计算机可读取存储介质中。 基于这样的理解, 本发明 的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或 部分可以以软件产品的形式体现出来, 该计算机软件产品存储在一个存储介质 中, 包括若干指令用以使得一台计算机设备(可以是个人计算机, 服务器, 或 者网络设备等)或处理器(processor )执行本发明各个实施方式所述方法的全部 或部分步骤。 而前述的存储介质包括: U盘、 移动硬盘、 只读存储器(R0M, Read-Only Memory )、 随机存取存 4诸器 ( RAM, Random Access Memory )、 磁碟 或者光盘等各种可以存储程序代码的介质。
以上所述仅为本发明的实施方式, 并非因此限制本发明的专利范围, 凡是 利用本发明说明书及附图内容所作的等效结构或等效流程变换, 或直接或间接 运用在其他相关的技术领域, 均同理包括在本发明的专利保护范围内。

Claims

权利 要求
1.一种随机森林分类模型的可视化优化处理方法, 其特征在于, 包括: 对于已构建的随机森林分类模型, 通过袋外数据计算所述随机森林分类模 型各个决策树之间的相关度;
利用所述随机森林分类模型各个决策树之间的相关度, 构建相关性矩阵; 根据所述相关性矩阵, 通过降维技术获取三维以下空间的所述随机森林分 类模型的可视化图形;
根据所述随机森林分类模型的可视化图形, 对所述随机森林分类模型进行 优化处理, 以使得所述处理后的随机森林分类模型第二泛化误差上界不超过处 理前的随机森林分类模型的第一泛化误差上界。
2.根据权利要求 1所述的方法, 其特征在于, 所述根据所述相关性矩阵, 通 过降维技术获取三维以下空间的所述随机森林分类模型的可视化图形的步骤, 包括: 根据所述相关性矩阵, 通过多维尺度分析 MDS降维技术获取三维以下空 间的所述随机森林分类模型的可视化图形。
3.根据权利要求 2所述的方法, 其特征在于, 所述可视化图形是散点图, 所 述散点图的每个点代表一个决策树, 所述散点图每两个点之间的距离代表所述 随机森林分类模型对应的决策树之间的相关度。
4.根据权利要求 3所述的方法, 其特征在于, 所述散点图的点用不同颜色表 示, 以表达所述散点图的点所对应的决策树的分类强度信息。
5.根据权利要求 3所述的方法, 其特征在于, 所述散点图是密度分布的热力 图。
6.根据权利要求 1所述的方法, 其特征在于, 所述根据所述随机森林分类模 型的可视化图形, 对所述随机森林分类模型进行优化处理的步骤包括:
根据所述随机森林分类模型的可视化图形, 选择一个决策树;
将距离所述选择的决策树最近的 K个决策树删除, 获得处理后所述随机森 林分类模型对应的第二泛化误差上界;
将所述处理后所述随机森林分类模型对应的第二泛化误差上界与处理前的 随机森林分类模型的第一泛化误差上界进行比较;
若处理后所述随机森林分类模型对应的第二泛化误差上界减小, 则返回所 述根据所述随机森林分类模型的可视化图形, 选择一个决策树的步骤进行循环, 直到处理后所述随机森林分类模型对应的第二泛化误差上界不再减小。
7.根据权利要求 6所述的方法, 其特征在于, 所述与处理前的随机森林分类 模型的第一泛化误差上界进行比较的步骤之后, 包括:
若处理后所述随机森林分类模型对应的第二泛化误差上界增大, 则撤销所 述与处理前的随机森林分类模型的第一泛化误差上界进行比较的步骤之前的步 骤;
釆用决策树规则匹配算法将所述随机森林分类模型中结构相似的决策树删 除。
8.根据权利要求 1所述的方法, 其特征在于, 所述相关性矩阵的第 i行第 j 列的元素是所述随机森林分类模型第 i个决策树和第 j个决策树之间的相关度, 其中, 所述 i和 j是不为零的自然数。
9.一种随机森林分类模型的可视化优化处理装置, 其特征在于, 所述装置包 括: 估计模块、 构建模块、 获取模块以及优化模块;
所述估计模块用于对于已构建的随机森林分类模型, 通过袋外数据估计所 述随机森林分类模型各个决策树之间的相关度;
所述构建模块用于利用所述估计模块估计的所述随机森林分类模型各个决 策树之间的相关度, 构建相关性矩阵;
所述获取模块用于根据所述构建模块构建的所述相关性矩阵, 通过降维技 术获取三维以下空间的所述随机森林分类模型的可视化图形;
所述优化模块用于根据所述获取模块获取的所述随机森林分类模型的可视 化图形, 对所述随机森林分类模型进行优化处理, 以使得所述处理后的随机森 林分类模型第二泛化误差上界不超过处理前的随机森林分类模型的第一泛化误 差上界。
10.根据权利要求 9所述的装置, 其特征在于, 所述获取模块具体用于根据 所述相关性矩阵,通过多维尺度分析 MDS降维技术获取三维以下空间的所述随 机森林分类模型的可视化图形。
11.根据权利要求 10所述的装置, 其特征在于, 所述可视化图形是散点图, 所述散点图的每个点代表一个决策树, 所述散点图每两个点之间的距离代表所 述随机森林分类模型对应的决策树之间的相关度。
12.根据权利要求 11所述的装置, 其特征在于, 所述散点图的点用不同颜色 表示, 以表达所述散点图的点所对应的决策树的分类强度信息。
13.根据权利要求 11所述的装置, 其特征在于, 所述散点图是密度分布的热 力图。
14.根据权利要求 9所述的装置, 其特征在于, 所述优化模块包括: 选择单 元、 获得单元、 比较单元以及返回单元;
所述选择单元用于根据所述随机森林分类模型的可视化图形, 选择一个决 策树;
所述获得单元用于将距离所述选择单元选择的决策树最近的 K个决策树删 除, 获得处理后所述随机森林分类模型对应的第二泛化误差上界; 型对应的第二泛化误差上界与处理前的随机森林分类模型的第一泛化误差上界 进行比较; 模型对应的第二泛化误差上界减小时, 返回所述选择单元进行循环, 直到处理 后所述随机森林分类模型对应的第二泛化误差上界不再减小。
15.根据权利要求 14所述的装置, 其特征在于, 所述优化模块还包括: 撤销 单元和删除单元; 模型对应的第二泛化误差上界增大时, 撤销所述比较单元之前的所有操作; 釆用决策树规则匹配算法将所述随机森林分类模型中结构相似的决策树删除。
16.根据权利要求 9所述的装置, 其特征在于, 所述相关性矩阵的第 i行第 j 列的元素是所述随机森林分类模型第 i个决策树和第 j个决策树之间的相关度, 其中, 所述 i和 j是不为零的自然数。
PCT/CN2014/075305 2013-10-29 2014-04-14 随机森林分类模型的可视化优化处理方法及装置 WO2015062209A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310522082.1A CN104572786A (zh) 2013-10-29 2013-10-29 随机森林分类模型的可视化优化处理方法及装置
CN201310522082.1 2013-10-29

Publications (1)

Publication Number Publication Date
WO2015062209A1 true WO2015062209A1 (zh) 2015-05-07

Family

ID=53003231

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/075305 WO2015062209A1 (zh) 2013-10-29 2014-04-14 随机森林分类模型的可视化优化处理方法及装置

Country Status (2)

Country Link
CN (1) CN104572786A (zh)
WO (1) WO2015062209A1 (zh)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106255116A (zh) * 2016-08-24 2016-12-21 王瀚辰 一种骚扰号码的识别方法
CN106791220A (zh) * 2016-11-04 2017-05-31 国家计算机网络与信息安全管理中心 防止电话诈骗的方法及系统
CN107132266A (zh) * 2017-06-21 2017-09-05 佛山科学技术学院 一种基于随机森林的水质分类方法及系统
CN107202833A (zh) * 2017-06-21 2017-09-26 佛山科学技术学院 一种水体中铜离子污染程度的快速检测方法
EP3373157A4 (en) * 2015-11-24 2018-09-12 Huawei Technologies Co., Ltd. Data processing method and device
CN108989581A (zh) * 2018-09-21 2018-12-11 中国银行股份有限公司 一种用户风险识别方法、装置及系统
CN109034546A (zh) * 2018-06-06 2018-12-18 北京市燃气集团有限责任公司 一种城镇燃气埋地管道腐蚀风险的智能预测方法
CN109657705A (zh) * 2018-12-03 2019-04-19 国网天津市电力公司电力科学研究院 一种基于随机森林算法的电动汽车用户聚类方法及装置
CN109711428A (zh) * 2018-11-20 2019-05-03 佛山科学技术学院 一种含水天然气管线内腐蚀速度预测方法及装置
CN109726285A (zh) * 2018-12-18 2019-05-07 广州多益网络股份有限公司 一种文本分类方法、装置、存储介质及终端设备
CN109800815A (zh) * 2019-01-24 2019-05-24 北华航天工业学院 基于随机森林模型的训练方法、小麦识别方法和训练系统
CN109976916A (zh) * 2019-04-04 2019-07-05 中国联合网络通信集团有限公司 一种云资源需求判定方法及系统
CN111027629A (zh) * 2019-12-13 2020-04-17 国网山东省电力公司莱芜供电公司 基于改进随机森林的配电网故障停电率预测方法及系统
CN111524606A (zh) * 2020-04-24 2020-08-11 郑州大学第一附属医院 一种基于随机森林算法的肿瘤数据统计方法
CN112088381A (zh) * 2018-05-11 2020-12-15 高通股份有限公司 使用雷达和机器学习的射频(rf)对象检测
CN112381290A (zh) * 2020-11-13 2021-02-19 辽宁工程技术大学 一种随机森林与灰狼优化的煤体瓦斯含量预测方法
CN112614203A (zh) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 一种相关性矩阵可视化方法、装置、电子设备及存储介质
CN112631226A (zh) * 2020-12-26 2021-04-09 太原师范学院 一种基于数据驱动的生产设备故障监测方法
CN112784855A (zh) * 2021-01-28 2021-05-11 佛山科学技术学院 一种基于pca的加速随机森林训练的视网膜分层方法
CN113011491A (zh) * 2021-03-17 2021-06-22 东北大学 一种主成分分析协同随机森林的热连轧带钢宽度预测方法
CN117339263A (zh) * 2023-12-04 2024-01-05 烟台核信环保设备有限公司 一种立式压滤机自动控制系统及方法
CN117557409A (zh) * 2024-01-11 2024-02-13 中国建筑科学研究院有限公司 基于物联网的智慧建筑火灾风险可视化评估管理系统
CN117893955A (zh) * 2024-03-12 2024-04-16 希格玛电气(珠海)有限公司 一种环网柜故障检测系统
CN117893955B (zh) * 2024-03-12 2024-06-04 希格玛电气(珠海)有限公司 一种环网柜故障检测系统

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106997330B (zh) * 2016-01-22 2020-11-06 阿里巴巴(中国)有限公司 随机森林模型的转换方法及装置
CN108960514B (zh) * 2016-04-27 2022-09-06 第四范式(北京)技术有限公司 展示预测模型的方法、装置及调整预测模型的方法、装置
CN106570537A (zh) * 2016-11-17 2017-04-19 天津大学 一种基于混淆矩阵的随机森林模型选择方法
CN110324170B (zh) * 2018-03-30 2021-07-09 华为技术有限公司 数据分析设备、多模型共决策系统及方法
CN108984628B (zh) * 2018-06-20 2020-01-24 北京达佳互联信息技术有限公司 内容描述生成模型的损失值获取方法及装置
CN110223105B (zh) * 2019-05-17 2020-12-01 知量科技(深圳)有限公司 基于人工智能模型的交易策略生成方法和引擎
CN110837911B (zh) * 2019-09-06 2021-02-05 沈阳农业大学 一种大尺度地表节肢动物空间分布模拟方法
CN110887798B (zh) * 2019-11-27 2020-11-17 中国科学院西安光学精密机械研究所 基于极端随机树的非线性全光谱水体浊度定量分析方法
CN111209930B (zh) * 2019-12-20 2023-08-11 上海淇玥信息技术有限公司 一种生成授信策略的方法、装置和电子设备
CN111597096B (zh) * 2020-04-09 2023-06-06 中国科学院深圳先进技术研究院 一种基准测试方法、系统及终端设备
CN111711545A (zh) * 2020-05-29 2020-09-25 福州大学 一种软件定义网络中基于深度包检测技术的加密流量智能识别方法
CN112287191A (zh) * 2020-07-31 2021-01-29 北京九章云极科技有限公司 模型展示方法、装置及电子设备
CN112085335A (zh) * 2020-08-10 2020-12-15 国网上海市电力公司 一种用于配电网故障预测的改进随机森林算法
CN113095432A (zh) * 2021-04-27 2021-07-09 电子科技大学 基于可解释性随机森林的可视化系统及方法
CN113283484A (zh) * 2021-05-14 2021-08-20 中国邮政储蓄银行股份有限公司 改进的特征选择方法、装置及存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020489A (zh) * 2013-01-04 2013-04-03 吉林大学 基于ARM微处理器的siRNA干扰效率预测新方法
CN103699541A (zh) * 2012-09-28 2014-04-02 伊姆西公司 用于提高分类精度的交互式可视数据挖掘

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7930266B2 (en) * 2006-03-09 2011-04-19 Intel Corporation Method for classifying microelectronic dies using die level cherry picking system based on dissimilarity matrix
US8860715B2 (en) * 2010-09-22 2014-10-14 Siemens Corporation Method and system for evaluation using probabilistic boosting trees
CN103257921B (zh) * 2013-04-16 2015-07-22 西安电子科技大学 一种基于改进随机森林算法的软件故障预测系统及其方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699541A (zh) * 2012-09-28 2014-04-02 伊姆西公司 用于提高分类精度的交互式可视数据挖掘
CN103020489A (zh) * 2013-01-04 2013-04-03 吉林大学 基于ARM微处理器的siRNA干扰效率预测新方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DONG, SHISHI ET AL.: "A Brief Theoretical Overview of Random Forests", JOURNAL OF TECHNOLOGY INTEGRATION, vol. 2, no. 1, 31 January 2013 (2013-01-31) *
HUANG, ZHEXUE ET AL.: "Developing Sea Cloud Data System Key Technologies for Large Data Analysis and Mining", NETWORK NEW MEDIA, vol. 6, 30 November 2012 (2012-11-30) *

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3373157A4 (en) * 2015-11-24 2018-09-12 Huawei Technologies Co., Ltd. Data processing method and device
CN106255116A (zh) * 2016-08-24 2016-12-21 王瀚辰 一种骚扰号码的识别方法
CN106791220B (zh) * 2016-11-04 2021-06-04 国家计算机网络与信息安全管理中心 防止电话诈骗的方法及系统
CN106791220A (zh) * 2016-11-04 2017-05-31 国家计算机网络与信息安全管理中心 防止电话诈骗的方法及系统
CN107132266A (zh) * 2017-06-21 2017-09-05 佛山科学技术学院 一种基于随机森林的水质分类方法及系统
CN107202833A (zh) * 2017-06-21 2017-09-26 佛山科学技术学院 一种水体中铜离子污染程度的快速检测方法
CN112088381A (zh) * 2018-05-11 2020-12-15 高通股份有限公司 使用雷达和机器学习的射频(rf)对象检测
CN109034546A (zh) * 2018-06-06 2018-12-18 北京市燃气集团有限责任公司 一种城镇燃气埋地管道腐蚀风险的智能预测方法
CN108989581B (zh) * 2018-09-21 2022-03-22 中国银行股份有限公司 一种用户风险识别方法、装置及系统
CN108989581A (zh) * 2018-09-21 2018-12-11 中国银行股份有限公司 一种用户风险识别方法、装置及系统
CN109711428A (zh) * 2018-11-20 2019-05-03 佛山科学技术学院 一种含水天然气管线内腐蚀速度预测方法及装置
CN109657705A (zh) * 2018-12-03 2019-04-19 国网天津市电力公司电力科学研究院 一种基于随机森林算法的电动汽车用户聚类方法及装置
CN109726285A (zh) * 2018-12-18 2019-05-07 广州多益网络股份有限公司 一种文本分类方法、装置、存储介质及终端设备
CN109800815A (zh) * 2019-01-24 2019-05-24 北华航天工业学院 基于随机森林模型的训练方法、小麦识别方法和训练系统
CN109800815B (zh) * 2019-01-24 2023-11-24 北华航天工业学院 基于随机森林模型的训练方法、小麦识别方法和训练系统
CN109976916A (zh) * 2019-04-04 2019-07-05 中国联合网络通信集团有限公司 一种云资源需求判定方法及系统
CN111027629A (zh) * 2019-12-13 2020-04-17 国网山东省电力公司莱芜供电公司 基于改进随机森林的配电网故障停电率预测方法及系统
CN111027629B (zh) * 2019-12-13 2024-02-27 国网山东省电力公司莱芜供电公司 基于改进随机森林的配电网故障停电率预测方法及系统
CN111524606A (zh) * 2020-04-24 2020-08-11 郑州大学第一附属医院 一种基于随机森林算法的肿瘤数据统计方法
CN111524606B (zh) * 2020-04-24 2024-01-30 郑州大学第一附属医院 一种基于随机森林算法的肿瘤数据统计方法
CN112381290A (zh) * 2020-11-13 2021-02-19 辽宁工程技术大学 一种随机森林与灰狼优化的煤体瓦斯含量预测方法
CN112614203A (zh) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 一种相关性矩阵可视化方法、装置、电子设备及存储介质
CN112614203B (zh) * 2020-12-25 2023-07-04 北京知因智慧科技有限公司 一种相关性矩阵可视化方法、装置、电子设备及存储介质
CN112631226A (zh) * 2020-12-26 2021-04-09 太原师范学院 一种基于数据驱动的生产设备故障监测方法
CN112784855A (zh) * 2021-01-28 2021-05-11 佛山科学技术学院 一种基于pca的加速随机森林训练的视网膜分层方法
CN113011491A (zh) * 2021-03-17 2021-06-22 东北大学 一种主成分分析协同随机森林的热连轧带钢宽度预测方法
CN117339263A (zh) * 2023-12-04 2024-01-05 烟台核信环保设备有限公司 一种立式压滤机自动控制系统及方法
CN117339263B (zh) * 2023-12-04 2024-03-19 烟台核信环保设备有限公司 一种立式压滤机自动控制系统及方法
CN117557409A (zh) * 2024-01-11 2024-02-13 中国建筑科学研究院有限公司 基于物联网的智慧建筑火灾风险可视化评估管理系统
CN117557409B (zh) * 2024-01-11 2024-03-26 中国建筑科学研究院有限公司 基于物联网的智慧建筑火灾风险可视化评估管理系统
CN117893955A (zh) * 2024-03-12 2024-04-16 希格玛电气(珠海)有限公司 一种环网柜故障检测系统
CN117893955B (zh) * 2024-03-12 2024-06-04 希格玛电气(珠海)有限公司 一种环网柜故障检测系统

Also Published As

Publication number Publication date
CN104572786A (zh) 2015-04-29

Similar Documents

Publication Publication Date Title
WO2015062209A1 (zh) 随机森林分类模型的可视化优化处理方法及装置
CN111489412B (zh) 用于使用神经网络生成基本逼真图像的语义图像合成
Luo et al. Robust discrete code modeling for supervised hashing
Cavallo et al. A visual interaction framework for dimensionality reduction based data exploration
Yin et al. Incomplete multi-view clustering via subspace learning
Kanani et al. Deep learning to detect skin cancer using google colab
US20220215259A1 (en) Neural network training method, data processing method, and related apparatus
WO2022105117A1 (zh) 一种图像质量评价的方法、装置、计算机设备及存储介质
Liu et al. Distortion‐Guided Structure‐Driven Interactive Exploration of High‐Dimensional Data
Bonet et al. Spherical sliced-wasserstein
Zhang et al. Stylistic scene enhancement GAN: mixed stylistic enhancement generation for 3D indoor scenes
US11704559B2 (en) Learning to search user experience designs based on structural similarity
Sun et al. PGCNet: patch graph convolutional network for point cloud segmentation of indoor scenes
CN112529068B (zh) 一种多视图图像分类方法、系统、计算机设备和存储介质
Lu et al. Towards aesthetics of image: A Bayesian framework for color harmony modeling
CN111949886B (zh) 一种用于信息推荐的样本数据生成方法和相关装置
Wang et al. Kernel functional maps
US20220398697A1 (en) Score-based generative modeling in latent space
Qin et al. Depth estimation by parameter transfer with a lightweight model for single still images
Zhou et al. Color constancy with an optimized regularized random vector functional link based on an improved equilibrium optimizer
Taghavi et al. Visualization of multi-objective design space exploration for embedded systems
CN113139556B (zh) 基于自适应构图的流形多视图图像聚类方法及系统
CN112651492B (zh) 一种自连接宽度图卷积神经网络模型系统及训练方法
Chang et al. GraphCS: Graph-based client selection for heterogeneity in federated learning
Zhu et al. Multi-center convolutional descriptor aggregation for image retrieval

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14857937

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14857937

Country of ref document: EP

Kind code of ref document: A1