CN113239321A - Feature selection method based on filtering and packaging type hierarchy progression - Google Patents
Feature selection method based on filtering and packaging type hierarchy progression Download PDFInfo
- Publication number
- CN113239321A CN113239321A CN202110589440.5A CN202110589440A CN113239321A CN 113239321 A CN113239321 A CN 113239321A CN 202110589440 A CN202110589440 A CN 202110589440A CN 113239321 A CN113239321 A CN 113239321A
- Authority
- CN
- China
- Prior art keywords
- features
- sorting
- feature
- filtering
- correlation coefficient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
Abstract
The invention relates to a feature selection method based on filtering and packaging type hierarchy progression. The method comprises the steps of firstly using a filtering variance sorting method, an information gain sorting method and an encapsulation-based Boruta sorting method to sort features, distributing ranks to the sorted features according to importance degrees, fusing results of the three sorting methods, then calculating correlation between every two features based on a Pearson correlation coefficient, setting a Pearson correlation coefficient threshold of the features, selectively deleting partial features according to the correlation between the features, and finally finding the best feature combination by combining a random forest model based on an encapsulation-based sequence forward selection method, so as to obtain an optimal feature subset. The method has good effect of selecting the optimal characteristic subset for the data set, and provides more accurate characteristic information for the learning model, thereby improving the accuracy of the learning model.
Description
The technical field is as follows:
the invention relates to a feature selection method based on filtering and packaging type hierarchical progression, which is well applied to the aspect of data set feature selection.
Background art:
a large number of redundant features and irrelevant features exist in the data set, which brings great challenges to data mining and seriously affects the accuracy and scientificity of data mining results, so the irrelevant features and the redundant features in the data set are processed before data mining.
The feature selection is also called feature subset selection, and can provide irrelevant or redundant features from the original features, reduce the number of the features, find the optimal feature subset, improve the accuracy of the model and reduce the running time. The feature selection method can be divided into a filtering type and an encapsulating type according to whether the feature selection method is independent of a subsequent learning algorithm or not, the filtering type is independent of the subsequent learning algorithm, the feature is generally evaluated by directly utilizing the statistical performance of all training data, the speed is high, but the performance deviation of the evaluation and the learning algorithm is large, the encapsulating type evaluates the feature subset by utilizing the training accuracy of the learning algorithm, the deviation is small, the calculated amount is large, the feature subset is not suitable for a large data set, the filtering type and the encapsulating type are two complementary modes, the filtering type and the encapsulating type can be combined, the feature selection method based on the filtering type and the encapsulating type hierarchical progression can reduce irrelevant features and delete redundant features, the consumption of computing resources is reduced, the training time is shortened, and.
The invention content is as follows:
in order to solve the problem of irrelevant or redundant features of the data set, the invention discloses a feature selection method based on filtering and packaging type hierarchical progression.
Therefore, the invention provides the following technical scheme:
1. a method for feature selection based on filter and package level progression, the method comprising the steps of:
step 1: and (3) sorting the features based on a filtering type variance sorting method, an information gain sorting method and an encapsulation type Boruta sorting method, distributing ranks to the sorted features according to the importance degree, and fusing the results of the three sorting methods.
Step 2: and calculating the correlation between every two characteristics based on the Pearson correlation coefficient, setting a Pearson correlation coefficient threshold value of the characteristics, and selectively deleting partial characteristics according to the correlation between the characteristics.
And step 3: and finding the best feature combination by combining a sequence forward selection method based on an encapsulation type with a random forest model so as to obtain the optimal feature subset.
2. The method according to claim 1, wherein in the step 1, the features are sorted by a filter-type variance sorting method, an information gain sorting method and an encapsulation-type Boruta sorting method, and the method comprises the following specific steps:
step 1-1, sequencing feature importance degrees from large to small in sequence by adopting a filtering variance sequencing method, an information gain method and an encapsulation-based Boruta sequencing method;
step 1-2, distributing ranks for the sorted features, wherein the rank with the highest importance degree is 1, and the rest features are distributed in sequence;
and 1-3, adding the ranks of each feature obtained in different sorting methods in sequence and sorting the features in a sequence from small to large to obtain the final sorting order of the features.
3. The method for selecting features based on filtering and packaging type hierarchy progression according to claim 1, wherein in the step 2, redundant features are deleted based on Pearson's correlation coefficient, and the specific steps are as follows:
step 2-1, calculating the correlation between every two characteristics of the sorted characteristics based on the Pearson correlation coefficient;
step 2-2, setting a Pearson correlation coefficient threshold value of the features;
and 2-3, selectively deleting partial features according to the Pearson correlation coefficient among the features.
4. The method for selecting features based on filtering and packaging type hierarchy progression according to claim 1, wherein in the step 3, the best feature combination is found by combining a sequence forward selection method based on packaging type with a random forest model, so as to obtain an optimal feature subset, and the specific steps are as follows:
step 3-1, the rank of the characteristic after the redundant characteristic is deleted is used as an evaluation function of a sequence forward selection method;
step 3-2, adding the feature sequences into the feature set from small to large according to the rank of the features;
and 3-3, establishing a training model through a random forest algorithm, and finding the optimal characteristic subset by comparing the accuracy of model prediction.
Has the advantages that:
1. the invention discloses a filtering and packaging type hierarchical progressive feature selection method, which is a novel method for feature selection of a data set.
2. The method makes up the defect of large performance deviation of the filtering type method evaluation and learning algorithm, and the packaging type method has a good evaluation effect.
3. The invention solves the problem that the packaging method is not suitable for large data, and is not only suitable for small data sets, but also suitable for large data sets.
4. The filter type and the packaging type are combined, the complementation is insufficient, the redundancy characteristic is deleted by utilizing the Pearson correlation coefficient, the consumption of computing resources is reduced, and the model performance is improved.
Description of the drawings:
fig. 1 is a flowchart of a feature selection method based on filtering and packaging hierarchy progression according to an embodiment of the present invention.
Fig. 2 is a flowchart of removing redundancy characteristics based on pearson correlation coefficients in an embodiment of the present invention.
Fig. 3 is a flowchart of combining a random forest model with an encapsulation-based sequence forward selection method according to an embodiment of the present invention.
The specific implementation mode is as follows:
in order to clearly and completely describe the technical solutions in the embodiments of the present invention, the present invention is further described in detail below with reference to the drawings in the embodiments.
Taking feature selection of a breast cancer data set collected by a tumor hospital as an example, the embodiment of the present invention provides a flow of a feature selection method based on filtering and packaging type hierarchy progression, as shown in fig. 1, including the following steps.
Step 1, the process of sorting the features based on the filtering variance sorting method, the information gain method and the packaging type Boruta sorting method is as follows:
step 1-1, respectively sequencing feature importance degrees from large to small in sequence by adopting a variance sequencing method, an information gain method and a package-based Boruta sequencing method in a filtering method;
the formula for calculating the variance value is as follows:
X1...Xnthe different eigenvalue values of the characteristic X, and M is the average value of the characteristic.
The calculation formula of the information gain is as follows:
IGain(X,Y)=E(X)-E(X|Y)
e (X) is the information entropy, i.e. the amount of information brought about by classifying the sample while considering only the target feature, and E (X | Y) is the conditional entropy, i.e. the information entropy of X under the condition Y.
Step 1-2, distributing ranks for the sorted features, wherein the rank with the highest importance degree is 1, and the rest features are distributed in sequence;
step 1-3, sequentially adding the ranks of each feature obtained in different ranking methods, and ranking the features in a sequence from small to large to obtain the final ranking sequence of the features, i.e., BI.RADS classification, elasticity score, size, age, axillary lymph node size, morphology, blood flow signal, axillary lymph node, resistance index, calcific foci and calcific foci, wherein the ranking results are shown in Table 1.
TABLE 1
Step 2, the process of deleting the redundancy characteristics based on the Pearson correlation coefficient is as follows:
fig. 2 shows a process for deleting redundant features based on pearson correlation coefficients, which specifically includes:
step 2-1, calculating the correlation between every two characteristics of the sorted characteristics based on the Pearson correlation coefficient;
the calculation process of the Pearson correlation coefficient is as follows:
δ is the standard deviation of the samples and n is the number of samples.
Step 2-2, setting the Pearson correlation coefficient threshold of the features to be 0.5;
step 2-3, according to the pearson correlation coefficient among the features, it can be seen that the correlation between the elasticity score and the bi.rads classification is the largest, the correlation between the elasticity score and the bi.rads classification on the classification result is similar, the correlation between the bi.rads classification and other feature variables is close, so that redundant feature elasticity variables are deleted, and the pearson correlation coefficient among the features exceeds 0.5, which is shown in table 2.
TABLE 2
Step 3, combining a sequence forward selection method based on an encapsulation type with a random forest model as follows:
a flow of combining a sequence forward selection method based on an encapsulation type with a random forest model is shown in fig. 3, and specifically includes:
step 3-1, the rank of the characteristic after the redundant characteristic is deleted is used as an evaluation function of a sequence forward selection method;
step 3-2, adding the characteristic BI.RADS with the minimum rank in the characteristic sequencing into a characteristic set in a grading manner;
3-3, establishing a training model through a random forest algorithm, and recording the model accuracy of grading by adding the characteristic BI.RADS with the minimum rank;
step 3-4, adding characteristic variables in sequence from small to large according to the rank, and recording the model accuracy after each characteristic variable is added;
and 3-5, comparing the accuracy of the obtained model, wherein the characteristic subset with the highest accuracy corresponds to: RADS classification, size, age, axillary lymph node size, morphology, axillary lymph node size, blood flow signal, axillary lymph node, and this feature subset is the optimal feature subset found.
According to the feature selection method based on filtering and packaging type hierarchical progression, the optimal feature subset can be selected for the data set, accurate feature information is provided for subsequent learning modeling, and the accuracy of the model is improved.
The above description is for the purpose of describing in detail embodiments of the present invention with reference to the accompanying drawings, and the detailed description is for the purpose of facilitating understanding of the invention.
Claims (4)
1. A method for feature selection based on filter and package level progression, the method comprising the steps of:
step 1: and (3) sorting the features based on a filtering type variance sorting method, an information gain sorting method and an encapsulation type Boruta sorting method, distributing ranks to the sorted features according to the importance degree, and fusing the results of the three sorting methods.
Step 2: and calculating the correlation between every two characteristics based on the Pearson correlation coefficient, setting a Pearson correlation coefficient threshold value of the characteristics, and selectively deleting partial characteristics according to the correlation between the characteristics.
And step 3: and finding the best feature combination by combining a sequence forward selection method based on an encapsulation type with a random forest model so as to obtain the optimal feature subset.
2. The method according to claim 1, wherein in the step 1, the features are sorted by a filter-type variance sorting method, an information gain sorting method and an encapsulation-type Boruta sorting method, and the method comprises the following specific steps:
step 1-1, sequencing feature importance degrees from large to small in sequence by adopting a filtering variance sequencing method, an information gain method and an encapsulation-based Boruta sequencing method;
step 1-2, distributing ranks for the sorted features, wherein the rank with the highest importance degree is 1, and the rest features are distributed in sequence;
and 1-3, adding the ranks of each feature obtained in different sorting methods in sequence and sorting the features in a sequence from small to large to obtain the final sorting order of the features.
3. The method for selecting features based on filtering and packaging type hierarchy progression according to claim 1, wherein in the step 2, redundant features are deleted based on Pearson's correlation coefficient, and the specific steps are as follows:
step 2-1, calculating the correlation between every two characteristics of the sorted characteristics based on the Pearson correlation coefficient;
step 2-2, setting a Pearson correlation coefficient threshold value of the features;
and 2-3, selectively deleting partial features according to the Pearson correlation coefficient among the features.
4. The method for selecting features based on filtering and packaging type hierarchy progression according to claim 1, wherein in the step 3, the best feature combination is found by combining a sequence forward selection method based on packaging type with a random forest model, so as to obtain an optimal feature subset, and the specific steps are as follows:
step 3-1, the rank of the characteristic after the redundant characteristic is deleted is used as an evaluation function of a sequence forward selection method;
step 3-2, adding the feature sequences into the feature set from small to large according to the rank of the features;
and 3-3, establishing a training model through a random forest algorithm, and finding the optimal characteristic subset by comparing the accuracy of model prediction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110589440.5A CN113239321A (en) | 2021-05-28 | 2021-05-28 | Feature selection method based on filtering and packaging type hierarchy progression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110589440.5A CN113239321A (en) | 2021-05-28 | 2021-05-28 | Feature selection method based on filtering and packaging type hierarchy progression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113239321A true CN113239321A (en) | 2021-08-10 |
Family
ID=77135502
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110589440.5A Pending CN113239321A (en) | 2021-05-28 | 2021-05-28 | Feature selection method based on filtering and packaging type hierarchy progression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113239321A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114881181A (en) * | 2022-07-12 | 2022-08-09 | 南昌大学第一附属医院 | Feature weighting selection method, system, medium and computer based on big data |
CN115409134A (en) * | 2022-11-02 | 2022-11-29 | 湖南一二三智能科技有限公司 | User electricity utilization safety detection method, system, equipment and storage medium |
CN116561554A (en) * | 2023-04-18 | 2023-08-08 | 南方电网电力科技股份有限公司 | Feature extraction method, system, equipment and medium of boiler soot blower |
-
2021
- 2021-05-28 CN CN202110589440.5A patent/CN113239321A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114881181A (en) * | 2022-07-12 | 2022-08-09 | 南昌大学第一附属医院 | Feature weighting selection method, system, medium and computer based on big data |
CN115409134A (en) * | 2022-11-02 | 2022-11-29 | 湖南一二三智能科技有限公司 | User electricity utilization safety detection method, system, equipment and storage medium |
CN116561554A (en) * | 2023-04-18 | 2023-08-08 | 南方电网电力科技股份有限公司 | Feature extraction method, system, equipment and medium of boiler soot blower |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113239321A (en) | Feature selection method based on filtering and packaging type hierarchy progression | |
CN111460956B (en) | Unbalanced electrocardiogram sample classification method based on data enhancement and loss weighting | |
CN107392919B (en) | Adaptive genetic algorithm-based gray threshold acquisition method and image segmentation method | |
CN110866997A (en) | Novel method for constructing running condition of electric automobile | |
CN112541532B (en) | Target detection method based on dense connection structure | |
CN110942091A (en) | Semi-supervised few-sample image classification method for searching reliable abnormal data center | |
CN110826618A (en) | Personal credit risk assessment method based on random forest | |
CN108280236A (en) | A kind of random forest visualization data analysing method based on LargeVis | |
CN105512175A (en) | Quick image retrieval method based on color features and texture characteristics | |
CN113283473B (en) | CNN feature mapping pruning-based rapid underwater target identification method | |
CN109669210A (en) | Favorable method based on a variety of seismic properties interpretational criterias | |
CN112036476A (en) | Data feature selection method and device based on two-classification service and computer equipment | |
CN111723897A (en) | Multi-modal feature selection method based on particle swarm optimization | |
CN115147632A (en) | Image category automatic labeling method and device based on density peak value clustering algorithm | |
CN112215268A (en) | Method and device for classifying disaster weather satellite cloud pictures | |
CN110909785A (en) | Multitask Triplet loss function learning method based on semantic hierarchy | |
CN107194468A (en) | Towards the decision tree Increment Learning Algorithm of information big data | |
CN113033345A (en) | V2V video face recognition method based on public feature subspace | |
CN111914930A (en) | Density peak value clustering method based on self-adaptive micro-cluster fusion | |
CN116662832A (en) | Training sample selection method based on clustering and active learning | |
CN110009024A (en) | A kind of data classification method based on ID3 algorithm | |
KR101085066B1 (en) | An Associative Classification Method for detecting useful knowledge from huge multi-attributes dataset | |
CN116089142A (en) | Novel service fault root cause analysis method | |
CN112560900B (en) | Multi-disease classifier design method for sample imbalance | |
Yarramalle et al. | Unsupervised image segmentation using finite doubly truncated Gaussian mixture model and hierarchical clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210810 |