CN117036781A

CN117036781A - Image classification method based on tree comprehensive diversity depth forests

Info

Publication number: CN117036781A
Application number: CN202310882737.XA
Authority: CN
Inventors: 丁家满; 江相渝; 贾连印; 付晓东; 姜瑛
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2023-07-18
Filing date: 2023-07-18
Publication date: 2023-11-10

Abstract

The application relates to an image classification method based on tree comprehensive diversity depth forests, and belongs to the technical field of multi-category classification application in data mining. Aiming at the problem of high time, memory and storage cost of an original depth forest model, the application designs a diversity measurement method based on the morphological structure of the decision tree model according to the characteristics of the decision tree model, which is used for calculating the diversity among the decision tree models; then, by comprehensively balancing the factors of diversity and accuracy of the decision tree, a pruning strategy for optimizing random forests in the cascade layer is designed; finally, a simplified and efficient depth forest model is obtained through pruning, so that the calculation complexity of the model is reduced, the time, memory and storage expenditure can be greatly reduced, and the prediction performance and efficiency of an image classification task are effectively improved.

Description

Image classification method based on tree comprehensive diversity depth forests

Technical Field

The application relates to an image classification method based on tree comprehensive diversity depth forests, and belongs to the technical field of multi-category classification application in data mining.

Background

Depth forests are a recently proposed approach to deep learning based on non-micro-module implementation, which has fewer hyper-parameters than deep neural networks, and whose complexity can be adaptively determined according to the complexity of the data set, and thus is easier to train. Since the development of deep forests, it has received extensive attention from academia and industry, and it has been widely used in various fields such as financial applications, disease classification, computer vision, and the like. A large number of experimental results show that the deep forest has better performance than the deep neural network on a plurality of classification tasks.

The depth forest is mainly composed of two modules: and (5) multi-granularity scanning and cascading forests. The multi-granularity scanning is used for processing data sets with higher dimensionality and correlation among features, the data features are scanned by using sliding windows with different scales, then random forests or completely random forests are input, and finally the output features of the random forests or the completely random forests are obtained, so that the purpose of feature conversion is realized. Cascading forests are core modules of deep forests, which consist of multiple cascading layers, each of which contains one or more independent random forests and completely random forests. Wherein the random forest and the completely random forest are composed of a plurality of decision trees. The forest of each cascade layer generates a corresponding class probability vector, and then the spliced original data are used as new training data to be input to the next layer. And training the subsequent cascade layers continuously based on the new training data until the preset ending condition is met, so that a deep learning mode is formed.

The depth forest is an integrated model based on a large number of decision trees, and can obtain performance far superior to that of a single decision tree. The feature representation of the depth forest is composed of predictive class probability vectors, and this predictive-based representation requires a large number of forest models to be saved during the training process for use during testing. This leads to problems of low training efficiency and high time, memory and storage overhead. On large data sets, training a depth forest model may require tens of GB of running memory and tens of hours of running time, which largely hinders the application of depth forests in real scenes.

Disclosure of Invention

The application provides an image classification method based on tree comprehensive diversity depth forests, which solves the problems of low training efficiency and high time, memory and storage expenditure in the original depth forests algorithm by using a pruning method, and is applied to image classification to improve the efficiency of image classification.

The technical scheme of the application is as follows: firstly, designing a diversity measurement standard aiming at the morphological structure of the tree model only according to the characteristics of the decision tree model, and calculating the diversity among the tree models; then, in order to comprehensively consider the relationship between the diversity and the accuracy of the tree models, a tree comprehensive diversity ordering pruning algorithm is provided and applied to the depth forest to optimize the random forest in the cascade layer; and finally, according to the pruned simple and efficient forest model, completing image sample classification. The method comprises the following specific steps:

step1, acquiring an image forming training set;

step2, inputting the features of the training set into a cascade forest, training a cascade layer, wherein the cascade layer comprises a random forest and a completely random forest;

step3, analyzing a plurality of decision rules from each decision tree in the trained forest;

given D characteristics, a classification data set of C categories constructs a decision tree forest model, R decision rules can be resolved for each decision tree in the forest, only split characteristics used by split nodes of the decision tree are used for description, deep traversal is performed from a root node of the tree model, split characteristics on a decision path are obtained in the traversal process, and a simple form of the decision rule comprising D split characteristics and a corresponding class label C is shown as follows:

Rule _r :if feature ₁ &feature ₂ ,...,&feature _d |then response class c

the whole decision rule is divided into a front part and a rear part, namely the splitting feature of the front half part and the corresponding category of the rear half part, R decision rules containing different features are generated by a single decision tree, intersection sets are obtained for the R decision rules, a sequence containing all splitting features used by the single decision tree is obtained, and N rule sequences are generated for a forest model containing N decision trees.

Step4 measures the front part and the back part of the decision rule respectively to obtain a symmetric Jaccard similarity matrix P and a symmetric similarity matrix Q;

for the decision rule front, a bag of words model is used to convert each sequence into a fixed length feature vector. The rule feature vector is a binary vector, and if a certain element value in the vector is 1, the rule feature vector represents that the decision rule uses the corresponding feature; a value of 0 represents unused. Through vectorization of the decision rules, similarity comparison among the decision rules can be converted into similarity comparison among feature vectors. In order to estimate Jaccard similarity between vectors of any length in linear time using a small memory space, a minimum hash algorithm is used to calculate, resulting in a Jaccard similarity matrix P of size N symmetry.

For the decision rule back part, counting the number of decision rules corresponding to each category in a single decision tree, and generating a corresponding rule number vector S _n ＝[s _n0 ,s _n1 ,...,s _nc ]，s _nc Representing the number of rules belonging to category c in the nth decision tree. Rule number vector s _nc The vector dimension changes as the number of categories of different data sets changes. After obtaining N rule number vectors, the measurement is performed by adopting cosine similarity, and it is assumed that there are two vectors a and B, and the definition of the cosine similarity between them is as follows:

where c is the vector dimension, A _i And B _i The elements of vectors A and B in the ith dimension can be calculated to obtain Co with N multiplied by N symmetryThe sine similarity metric matrix Q.

Step5, weighing two similarity matrixes of the front part and the back part of the decision rule at the same time, and calculating to obtain similarity measurement of the tree model on the morphological structure;

the two obtained symmetrical similarity measurement matrixes P and Q respectively correspond to the front part and the back part of the decision rule, namely the split node and the leaf node in the tree model structure. The range of similarity metric values in both matrices is between 0,1, the larger the value, the greater the similarity. The similarity matrix T about the morphological structure of the decision tree is obtained by calculation by applying the following formula

T＝αP+(1-α)Q,α∈[0,1]

Alpha is a balance parameter, and the value range of alpha is between 0 and 1. The closer the value of α is to 0, the more emphasis is placed on the rule number metric, and the closer to 1, the more emphasis is placed on the metric of the decision rule feature. This metric is independent of the specific sample data and represents only morphological differences between pairs of decision trees in the random forest.

Step6, comprehensively considering similarity and accuracy of internal morphological structures of the tree model, and selecting decision trees with high accuracy and great diversity from the similarity and accuracy;

assume an integration with N basis classifiersWherein a single classifier h _i Is defined as the average of all pairwise metrics except itself, as follows:

similarity (t) _i ,t _j ) Is the value of the ith row and jth column in the structural similarity matrix T. Since the numerical ranges of accuracy and diversity of different data sets are in different intervals, it is not appropriate to directly weight them. After the accuracy and the diversity of each base classifier in the integrated model are calculated, the Min-Max scaling method is used for scaling the data, so that the accuracy and the diversity are improvedThe values are each mapped to [0,1]]The specific scaling formula is as follows:

the variable x' represents the accuracy value or the diversity value of a certain base classifier after interval scaling, and the larger the value is, the higher the representing accuracy or the larger the diversity is. Finally giving the comprehensive structural diversity div (h) _i ) And model accuracy acc (h _i ) The pruning method evaluation criteria of (1) are defined as follows:

calculating the value of the tree comprehensive diversity by giving a proper weighing parameter rho, and selecting from an ordered list of the tree comprehensive diversity measurementThe base classifier with the largest order forms an integrated post-pruning integrated subset.

Step7, classifying the images to be classified according to the pruned forest model.

Further, the minimum hash algorithm used for calculating the Jaccard similarity in Step3 specifically includes the following steps:

(1) First, K independent and uniform random hash functions are defined. For the ith rule feature vector, each element in the iteration vector is needed, and the minimum hash value of the element in the kth hash function is selected as a signature, so that a K-dimensional signature vector M consisting of K minimum hash values can be finally obtained _i ＝[m _i0 ,m _i1 ,...,m _ik ]。

(2) And (3) repeating the step (1) for N regular feature vectors to obtain a MinHash signature matrix M with the size of N multiplied by K, wherein each element in the matrix is the minimum hash value of the corresponding hash function. The signature matrix may be used to calculate Jaccard similarity estimates between N vectors.

(3) Assume two vectors M in a signature matrix M _i And M _j The similarity estimate between them is calculated according to the following formula, where I () is an indicator function that counts the number of elements in both vectors that have the same hash value. The count divided by the length of the signature vector is an estimate of the Jaccard similarity between the two vectors. The pair-wise similarity calculation results among all vectors are stored in a symmetrical similarity matrix P with the size of N multiplied by N, and the larger the value is, the larger the representative similarity is.

Compared with the prior art, the application has the following advantages:

the application designs a diversity measurement standard aiming at the morphological structure of the tree model, wherein the split nodes and the leaf nodes of the decision tree are respectively measured, so that the diversity of the morphological structure among the tree models can be comprehensively and accurately measured. According to the method, the relation between the diversity and the accuracy of the tree models is comprehensively considered, and the random forests in the depth forests are pruned through the tree comprehensive diversity ordering pruning algorithm, so that the time, the memory and the storage expenditure generated in the cascade layer can be effectively reduced. Compared with the original depth forest model, the integrated subset with large diversity and high accuracy can be selected from the original integrated model through pruning operation, and the prediction performance and efficiency of the image classification task are effectively improved.

Drawings

FIG. 1 is a schematic flow chart of the training process of the present application;

FIG. 2 is an exemplary decision tree structure in accordance with an embodiment of the present application;

FIG. 3 is a schematic diagram of a pruning process of the comprehensive diversity depth forest model of the tree;

FIG. 4 is a flow chart of the testing process of the present application.

Detailed Description

In order to more clearly describe the practice of the application, the application will be further described with reference to the drawings and detailed description.

Example 1:

the image classification method based on the tree comprehensive diversity depth forest is used for MNIST handwriting image data classification tasks, and the specific process is as shown in fig. 1:

the first step: a MNIST handwritten image dataset is acquired comprising 60,000 training images and 10,000 test images. The image content is handwritten numbers, and the tags of the data set are of 10 categories, corresponding to numbers 0-9. The 28 x 28 pixel points of each image of the MNIST dataset are converted into 784 feature columns.

And a second step of: and inputting the characteristics of the training set into a cascading forest to train one cascading layer. The cascade layer contains a random forest and a completely random forest, in this embodiment, a random forest and a completely random forest are used, and the number of base classifiers in each forest is set to be 100.

If the current cascade layer is the first cascade layer, training the cascade layer based on the original training data. If the current cascade layer is not the first cascade layer, the new feature is obtained by splicing the original training data and the prediction category vector output by the last cascade layer, and the cascade layer is trained.

FIG. 2 illustrates an exemplary decision tree structure according to an embodiment of the present application. The decision tree structure in both the random forest and the fully random forest employs a decision tree data structure as shown in fig. 2. The pruning process is shown in fig. 3.

And a third step of: and analyzing a plurality of decision rules from each decision tree in the trained forest. And describing by using the split characteristics used by the split nodes, performing deep traversal from the root node of the tree model, and acquiring the split characteristics on the decision path in the traversal process. A simple form of a decision rule containing a plurality of split features and corresponding categories is shown below:

rule:feature ₁ &feature ₃ &feature ₄ &,...,&feature ₇₈₂ &feature ₇₈₄ |then response class 2

the whole decision rule can be divided into two parts, namely a front part and a back part, namely the split characteristic of the front half and the response class of the back half, and then the two parts are measured respectively.

Fourth step: and measuring the decision rule, and taking intersection sets of the decision rules to obtain a sequence containing all split features used by a single decision tree. For a forest model containing 100 decision trees in this embodiment, 100 rule sequences would be generated. Each sequence is then converted into a fixed length feature vector using a bag of words model. Each sequence can be expressed as a regular feature vector with a dimension 784 that exactly matches the total number of features used in the entire forest. According to the method, 100 regular feature vectors in the forest can be obtained through sequential calculation:

rule feature vector 1 [1,0, 1. ], 1,0,1]

Regular feature vector 2 [1, 0]

...

Regular feature vector 100 [0,1]

Estimating Jaccard similarity between vectors of arbitrary length using a minimum hash algorithm can result in a symmetric Jaccard similarity matrix P of size 100×100.

Fifth step: next, considering information in the decision rule back part, respectively counting the number of decision rules corresponding to each category in a single decision tree, and generating a corresponding rule number vector S _n ＝[s _n0 ,s _n1 ,...,s _nc ]，s _nc Represents the number of rules belonging to category c in the nth decision tree, so the dimension of the rule number vector in this embodiment is 10.

Rule number vector 1: [6578,5571,...,7453,6368,7214]

Rule number vector 2: [6698,5336,...,6153,5968,6753]

...

Rule number vector 100: [6912,5324,...,7658,4369,5631]

Likewise, after obtaining 100 such rule number vectors, a simple and easily understood cosine similarity measure is employed. From this metric calculation, a symmetrical Cosine similarity metric matrix Q of 100×100 can be obtained.

Sixth step: the following formula is applied to the two obtained symmetrical similarity measurement matrices P and Q to calculate so as to obtain a final structural similarity matrix T.

T＝αP+(1-α)Q,α∈[0,1]

There is a trade-off parameter α in the formula, which ranges between 0, 1. In this example 0.5 is taken for calculation.

Seventh step: and comprehensively considering two factors of structural diversity and accuracy of the decision tree model. And respectively acquiring the diversity and accuracy of each base classifier in the integrated model. Wherein a single classifier h _i The computation of diversity of (2) is defined as the average of all pairwise metrics except itself, defined as follows:

similarity (t) _i ,t _j ) Is the value of the ith row, jth column, in the structural similarity matrix T, N is the number of basis classifiers in the forest, set to 100 in this embodiment. Since the numerical ranges of accuracy and diversity of different data sets are in different intervals, it is not appropriate to directly weight them. Scaling the data using the Min-Max scaling method, therefore, maps the accuracy and diversity values to 0,1 respectively]Is defined in the range of the interval. Specific shrinkageThe equation is as follows:

the variable x' represents the accuracy value or the diversity value of a certain base classifier after interval scaling, and the larger the value is, the higher the representing accuracy or the larger the diversity is.

Eighth step: giving a weighted structural diversity div (h _i ) And model accuracy acc (h _i ) The pruning method evaluation criteria of (1) are defined as follows:

and then calculating to obtain a final tree comprehensive diversity measurement value by using a formula. Finally, in the ordered list using the comprehensive diversity measure, select outThe classifier with the largest order forms an integrated post-pruning integrated subset. In this embodiment, the value of ρ is 0.5,retention rate, i.e. 10 base classifiers corresponding to the maximum tree comprehensive diversity metric are finally selected.

Diversity list of each base classifier in forest: [0.81,0.87,0.64,...,0.72,0.93]

List of accuracy of each base classifier in forest: [0.77,0.91,0.72,...,0.84,0.89]

Comprehensive diversity measurement list of each base classifier tree in forest: [0.79,0.89,0.68,...,0.78,0.91]

Ninth step: the training samples generate predicted class vectors through the pruned forest model, and the predicted class vectors generated by all forests in each layer are averaged to be taken as predicted class vectors of the example in the layer. The values in the predictive category vector add to 1, where the dimension of the predictive category vector is the number of categories of classification.

Tenth step: and (3) circularly splicing the predicted category vector output by the cascade layer with the original feature vector, inputting the spliced predicted category vector and the original feature vector into the next cascade layer together as new features, and circularly performing the second step to the tenth step. The conditions for determining the number of growth-stopping layers are as follows: the cycle will stop if the accuracy of one layer is greater than the accuracy of the next layer.

Eleventh step: the test is as shown in fig. 4, and a tree comprehensive diversity pruning depth forest model is established after training. The test set samples are then input into the model for prediction. The 10,000 images of the test set enter the forest of each layer by layer, and generate prediction class vectors, and after being spliced with the original features, the images continue to enter the next layer of prediction until the last layer outputs the prediction class of each test image. Comparing the model with the real label, and calculating to obtain the accuracy of the model.

The random forests in the deep forests are pruned through the tree comprehensive diversity ordering pruning algorithm, so that training efficiency of the deep forests can be effectively improved, and time, memory and storage expenditure generated in a cascade structure can be greatly reduced. Compared with the original deep forest model, the integrated subset with large diversity and high accuracy can be selected from the original integrated model through pruning operation, and the prediction performance of the model is effectively improved. The model of the application has significant advantages over the original depth forest model in the image classification task.

Claims

1. An image classification method based on tree comprehensive diversity depth forests is characterized in that a tree comprehensive diversity sorting pruning strategy is applied to the depth forests to optimize random forests in cascade layers, and image sample classification is completed according to a pruned forest model; the method comprises the following specific steps:

step1, acquiring an image forming training set;

2. The method for classifying images based on tree comprehensive diversity depth forests according to claim 1, wherein the Step3 comprises the following specific steps:

Rule _r :iffeature ₁ &feature ₂ ,...,&feature _d |then response classc

3. The method for classifying images based on tree comprehensive diversity depth forests according to claim 1, wherein the Step4 specifically comprises the following steps:

for the front part of the decision rule, converting each sequence into a feature vector with a fixed length by using a word bag model, and calculating Jaccard similarity among vectors by using a minimum hash algorithm to obtain a Jaccard similarity matrix P with N multiplied by N symmetry;

for the decision rule back part, counting the number of decision rules corresponding to each category in a single decision tree, and generating a corresponding rule number vector S _n ＝[s _n0 ,s _n1 ,...,s _nc ]，s _nc Representing the number of rules belonging to class c in the nth decision tree, the measurement is performed by adopting cosine similarity, and the cosine similarity between two vectors A and B is defined as follows:

where c is the vector dimension, A _i And B _i The element of the vector A and the vector B in the ith dimension can be calculated to obtain a similarity measurement matrix Q of N multiplied by N symmetrical Cosine.

4. The method for classifying images based on tree comprehensive diversity depth forests according to claim 1, wherein the Step5 specifically comprises the following steps:

the similarity matrix T about the morphological structure of the decision tree is obtained by calculation by applying the following formula

T＝αP+(1-α)Q,α∈[0,1]

Alpha is a balance parameter, and the value range of alpha is between 0 and 1.

5. The method for classifying images based on tree comprehensive diversity depth forests according to claim 1, wherein the Step6 specifically comprises the following steps:

assume an integration with N basis classifiersWherein a single classifier h _i Is defined as the average of all pairwise metrics except itself, a definition ofThe following are provided:

similarity (t) _i ,t _j ) The method comprises the steps of calculating the values of an ith row and a jth column in a structural similarity matrix T to obtain the accuracy and diversity of each base classifier in an integrated model, scaling data by using a Min-Max scaling method, and mapping the values of the accuracy and the diversity to [0,1] respectively]The specific scaling formula is as follows:

the variable x' represents the accuracy value or diversity value of a certain basic classifier after interval scaling, and finally gives the comprehensive structural diversity div (h _i ) And model accuracy acc (h _i ) The pruning method evaluation criteria of (1) are defined as follows:

6. A method according to claim 3, wherein the minimum hash algorithm used for computing Jaccard similarity in Step4 comprises the steps of:

(1) Firstly, defining K independent and uniform random hash functions, for the ith rule feature vector, each element in the iterative vector is needed and the element is selected at the kthThe minimum hash value in the hash functions is used as a signature, and finally a K-dimensional signature vector M consisting of K minimum hash values can be obtained _i ＝[m _i0 ,m _i1 ,...,m _ik ]；

(2) Repeating the step (1) for N regular feature vectors to obtain a MinHash signature matrix M with the size of N multiplied by K, wherein each element in the matrix is the minimum hash value of the corresponding hash function;

(3) Assume two vectors M in a signature matrix M _i And M _j The similarity estimate between them is calculated according to the following formula:

wherein I () is an indication function for counting the number of elements having the same hash value in two vectors, and the pair-wise similarity calculation result between all vectors is stored in a symmetric Jaccard similarity matrix P with a size of nxn, and the larger the value, the larger the representative similarity.