CN113239321A

CN113239321A - Feature selection method based on filtering and packaging type hierarchy progression

Info

Publication number: CN113239321A
Application number: CN202110589440.5A
Authority: CN
Inventors: 李思琪; 苗世迪; 胡晓慧; 王瑞涛
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-08-10

Abstract

The invention relates to a feature selection method based on filtering and packaging type hierarchy progression. The method comprises the steps of firstly using a filtering variance sorting method, an information gain sorting method and an encapsulation-based Boruta sorting method to sort features, distributing ranks to the sorted features according to importance degrees, fusing results of the three sorting methods, then calculating correlation between every two features based on a Pearson correlation coefficient, setting a Pearson correlation coefficient threshold of the features, selectively deleting partial features according to the correlation between the features, and finally finding the best feature combination by combining a random forest model based on an encapsulation-based sequence forward selection method, so as to obtain an optimal feature subset. The method has good effect of selecting the optimal characteristic subset for the data set, and provides more accurate characteristic information for the learning model, thereby improving the accuracy of the learning model.

Description

Feature selection method based on filtering and packaging type hierarchy progression

The technical field is as follows:

the invention relates to a feature selection method based on filtering and packaging type hierarchical progression, which is well applied to the aspect of data set feature selection.

Background art:

a large number of redundant features and irrelevant features exist in the data set, which brings great challenges to data mining and seriously affects the accuracy and scientificity of data mining results, so the irrelevant features and the redundant features in the data set are processed before data mining.

The feature selection is also called feature subset selection, and can provide irrelevant or redundant features from the original features, reduce the number of the features, find the optimal feature subset, improve the accuracy of the model and reduce the running time. The feature selection method can be divided into a filtering type and an encapsulating type according to whether the feature selection method is independent of a subsequent learning algorithm or not, the filtering type is independent of the subsequent learning algorithm, the feature is generally evaluated by directly utilizing the statistical performance of all training data, the speed is high, but the performance deviation of the evaluation and the learning algorithm is large, the encapsulating type evaluates the feature subset by utilizing the training accuracy of the learning algorithm, the deviation is small, the calculated amount is large, the feature subset is not suitable for a large data set, the filtering type and the encapsulating type are two complementary modes, the filtering type and the encapsulating type can be combined, the feature selection method based on the filtering type and the encapsulating type hierarchical progression can reduce irrelevant features and delete redundant features, the consumption of computing resources is reduced, the training time is shortened, and.

The invention content is as follows:

in order to solve the problem of irrelevant or redundant features of the data set, the invention discloses a feature selection method based on filtering and packaging type hierarchical progression.

Therefore, the invention provides the following technical scheme:

1. a method for feature selection based on filter and package level progression, the method comprising the steps of:

step 1: and (3) sorting the features based on a filtering type variance sorting method, an information gain sorting method and an encapsulation type Boruta sorting method, distributing ranks to the sorted features according to the importance degree, and fusing the results of the three sorting methods.

Step 2: and calculating the correlation between every two characteristics based on the Pearson correlation coefficient, setting a Pearson correlation coefficient threshold value of the characteristics, and selectively deleting partial characteristics according to the correlation between the characteristics.

And step 3: and finding the best feature combination by combining a sequence forward selection method based on an encapsulation type with a random forest model so as to obtain the optimal feature subset.

2. The method according to claim 1, wherein in the step 1, the features are sorted by a filter-type variance sorting method, an information gain sorting method and an encapsulation-type Boruta sorting method, and the method comprises the following specific steps:

step 1-1, sequencing feature importance degrees from large to small in sequence by adopting a filtering variance sequencing method, an information gain method and an encapsulation-based Boruta sequencing method;

step 1-2, distributing ranks for the sorted features, wherein the rank with the highest importance degree is 1, and the rest features are distributed in sequence;

and 1-3, adding the ranks of each feature obtained in different sorting methods in sequence and sorting the features in a sequence from small to large to obtain the final sorting order of the features.

3. The method for selecting features based on filtering and packaging type hierarchy progression according to claim 1, wherein in the step 2, redundant features are deleted based on Pearson's correlation coefficient, and the specific steps are as follows:

step 2-1, calculating the correlation between every two characteristics of the sorted characteristics based on the Pearson correlation coefficient;

step 2-2, setting a Pearson correlation coefficient threshold value of the features;

and 2-3, selectively deleting partial features according to the Pearson correlation coefficient among the features.

4. The method for selecting features based on filtering and packaging type hierarchy progression according to claim 1, wherein in the step 3, the best feature combination is found by combining a sequence forward selection method based on packaging type with a random forest model, so as to obtain an optimal feature subset, and the specific steps are as follows:

step 3-1, the rank of the characteristic after the redundant characteristic is deleted is used as an evaluation function of a sequence forward selection method;

step 3-2, adding the feature sequences into the feature set from small to large according to the rank of the features;

and 3-3, establishing a training model through a random forest algorithm, and finding the optimal characteristic subset by comparing the accuracy of model prediction.

Has the advantages that:

1. the invention discloses a filtering and packaging type hierarchical progressive feature selection method, which is a novel method for feature selection of a data set.

2. The method makes up the defect of large performance deviation of the filtering type method evaluation and learning algorithm, and the packaging type method has a good evaluation effect.

3. The invention solves the problem that the packaging method is not suitable for large data, and is not only suitable for small data sets, but also suitable for large data sets.

4. The filter type and the packaging type are combined, the complementation is insufficient, the redundancy characteristic is deleted by utilizing the Pearson correlation coefficient, the consumption of computing resources is reduced, and the model performance is improved.

Description of the drawings:

fig. 1 is a flowchart of a feature selection method based on filtering and packaging hierarchy progression according to an embodiment of the present invention.

Fig. 2 is a flowchart of removing redundancy characteristics based on pearson correlation coefficients in an embodiment of the present invention.

Fig. 3 is a flowchart of combining a random forest model with an encapsulation-based sequence forward selection method according to an embodiment of the present invention.

The specific implementation mode is as follows:

in order to clearly and completely describe the technical solutions in the embodiments of the present invention, the present invention is further described in detail below with reference to the drawings in the embodiments.

Taking feature selection of a breast cancer data set collected by a tumor hospital as an example, the embodiment of the present invention provides a flow of a feature selection method based on filtering and packaging type hierarchy progression, as shown in fig. 1, including the following steps.

Step 1, the process of sorting the features based on the filtering variance sorting method, the information gain method and the packaging type Boruta sorting method is as follows:

step 1-1, respectively sequencing feature importance degrees from large to small in sequence by adopting a variance sequencing method, an information gain method and a package-based Boruta sequencing method in a filtering method;

the formula for calculating the variance value is as follows:

X₁...X_nthe different eigenvalue values of the characteristic X, and M is the average value of the characteristic.

The calculation formula of the information gain is as follows:

IGain(X,Y)＝E(X)-E(X|Y)

e (X) is the information entropy, i.e. the amount of information brought about by classifying the sample while considering only the target feature, and E (X | Y) is the conditional entropy, i.e. the information entropy of X under the condition Y.

step 1-3, sequentially adding the ranks of each feature obtained in different ranking methods, and ranking the features in a sequence from small to large to obtain the final ranking sequence of the features, i.e., BI.RADS classification, elasticity score, size, age, axillary lymph node size, morphology, blood flow signal, axillary lymph node, resistance index, calcific foci and calcific foci, wherein the ranking results are shown in Table 1.

TABLE 1

Step 2, the process of deleting the redundancy characteristics based on the Pearson correlation coefficient is as follows:

fig. 2 shows a process for deleting redundant features based on pearson correlation coefficients, which specifically includes:

the calculation process of the Pearson correlation coefficient is as follows:

δ is the standard deviation of the samples and n is the number of samples.

Step 2-2, setting the Pearson correlation coefficient threshold of the features to be 0.5;

step 2-3, according to the pearson correlation coefficient among the features, it can be seen that the correlation between the elasticity score and the bi.rads classification is the largest, the correlation between the elasticity score and the bi.rads classification on the classification result is similar, the correlation between the bi.rads classification and other feature variables is close, so that redundant feature elasticity variables are deleted, and the pearson correlation coefficient among the features exceeds 0.5, which is shown in table 2.

TABLE 2

Step 3, combining a sequence forward selection method based on an encapsulation type with a random forest model as follows:

a flow of combining a sequence forward selection method based on an encapsulation type with a random forest model is shown in fig. 3, and specifically includes:

step 3-2, adding the characteristic BI.RADS with the minimum rank in the characteristic sequencing into a characteristic set in a grading manner;

3-3, establishing a training model through a random forest algorithm, and recording the model accuracy of grading by adding the characteristic BI.RADS with the minimum rank;

step 3-4, adding characteristic variables in sequence from small to large according to the rank, and recording the model accuracy after each characteristic variable is added;

and 3-5, comparing the accuracy of the obtained model, wherein the characteristic subset with the highest accuracy corresponds to: RADS classification, size, age, axillary lymph node size, morphology, axillary lymph node size, blood flow signal, axillary lymph node, and this feature subset is the optimal feature subset found.

According to the feature selection method based on filtering and packaging type hierarchical progression, the optimal feature subset can be selected for the data set, accurate feature information is provided for subsequent learning modeling, and the accuracy of the model is improved.

The above description is for the purpose of describing in detail embodiments of the present invention with reference to the accompanying drawings, and the detailed description is for the purpose of facilitating understanding of the invention.