CN113239321A - Feature selection method based on filtering and packaging type hierarchy progression - Google Patents

Feature selection method based on filtering and packaging type hierarchy progression Download PDF

Info

Publication number
CN113239321A
CN113239321A CN202110589440.5A CN202110589440A CN113239321A CN 113239321 A CN113239321 A CN 113239321A CN 202110589440 A CN202110589440 A CN 202110589440A CN 113239321 A CN113239321 A CN 113239321A
Authority
CN
China
Prior art keywords
features
sorting
feature
filtering
correlation coefficient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110589440.5A
Other languages
Chinese (zh)
Inventor
李思琪
苗世迪
胡晓慧
王瑞涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202110589440.5A priority Critical patent/CN113239321A/en
Publication of CN113239321A publication Critical patent/CN113239321A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques

Abstract

The invention relates to a feature selection method based on filtering and packaging type hierarchy progression. The method comprises the steps of firstly using a filtering variance sorting method, an information gain sorting method and an encapsulation-based Boruta sorting method to sort features, distributing ranks to the sorted features according to importance degrees, fusing results of the three sorting methods, then calculating correlation between every two features based on a Pearson correlation coefficient, setting a Pearson correlation coefficient threshold of the features, selectively deleting partial features according to the correlation between the features, and finally finding the best feature combination by combining a random forest model based on an encapsulation-based sequence forward selection method, so as to obtain an optimal feature subset. The method has good effect of selecting the optimal characteristic subset for the data set, and provides more accurate characteristic information for the learning model, thereby improving the accuracy of the learning model.

Description

Feature selection method based on filtering and packaging type hierarchy progression
The technical field is as follows:
the invention relates to a feature selection method based on filtering and packaging type hierarchical progression, which is well applied to the aspect of data set feature selection.
Background art:
a large number of redundant features and irrelevant features exist in the data set, which brings great challenges to data mining and seriously affects the accuracy and scientificity of data mining results, so the irrelevant features and the redundant features in the data set are processed before data mining.
The feature selection is also called feature subset selection, and can provide irrelevant or redundant features from the original features, reduce the number of the features, find the optimal feature subset, improve the accuracy of the model and reduce the running time. The feature selection method can be divided into a filtering type and an encapsulating type according to whether the feature selection method is independent of a subsequent learning algorithm or not, the filtering type is independent of the subsequent learning algorithm, the feature is generally evaluated by directly utilizing the statistical performance of all training data, the speed is high, but the performance deviation of the evaluation and the learning algorithm is large, the encapsulating type evaluates the feature subset by utilizing the training accuracy of the learning algorithm, the deviation is small, the calculated amount is large, the feature subset is not suitable for a large data set, the filtering type and the encapsulating type are two complementary modes, the filtering type and the encapsulating type can be combined, the feature selection method based on the filtering type and the encapsulating type hierarchical progression can reduce irrelevant features and delete redundant features, the consumption of computing resources is reduced, the training time is shortened, and.
The invention content is as follows:
in order to solve the problem of irrelevant or redundant features of the data set, the invention discloses a feature selection method based on filtering and packaging type hierarchical progression.
Therefore, the invention provides the following technical scheme:
1. a method for feature selection based on filter and package level progression, the method comprising the steps of:
step 1: and (3) sorting the features based on a filtering type variance sorting method, an information gain sorting method and an encapsulation type Boruta sorting method, distributing ranks to the sorted features according to the importance degree, and fusing the results of the three sorting methods.
Step 2: and calculating the correlation between every two characteristics based on the Pearson correlation coefficient, setting a Pearson correlation coefficient threshold value of the characteristics, and selectively deleting partial characteristics according to the correlation between the characteristics.
And step 3: and finding the best feature combination by combining a sequence forward selection method based on an encapsulation type with a random forest model so as to obtain the optimal feature subset.
2. The method according to claim 1, wherein in the step 1, the features are sorted by a filter-type variance sorting method, an information gain sorting method and an encapsulation-type Boruta sorting method, and the method comprises the following specific steps:
step 1-1, sequencing feature importance degrees from large to small in sequence by adopting a filtering variance sequencing method, an information gain method and an encapsulation-based Boruta sequencing method;
step 1-2, distributing ranks for the sorted features, wherein the rank with the highest importance degree is 1, and the rest features are distributed in sequence;
and 1-3, adding the ranks of each feature obtained in different sorting methods in sequence and sorting the features in a sequence from small to large to obtain the final sorting order of the features.
3. The method for selecting features based on filtering and packaging type hierarchy progression according to claim 1, wherein in the step 2, redundant features are deleted based on Pearson's correlation coefficient, and the specific steps are as follows:
step 2-1, calculating the correlation between every two characteristics of the sorted characteristics based on the Pearson correlation coefficient;
step 2-2, setting a Pearson correlation coefficient threshold value of the features;
and 2-3, selectively deleting partial features according to the Pearson correlation coefficient among the features.
4. The method for selecting features based on filtering and packaging type hierarchy progression according to claim 1, wherein in the step 3, the best feature combination is found by combining a sequence forward selection method based on packaging type with a random forest model, so as to obtain an optimal feature subset, and the specific steps are as follows:
step 3-1, the rank of the characteristic after the redundant characteristic is deleted is used as an evaluation function of a sequence forward selection method;
step 3-2, adding the feature sequences into the feature set from small to large according to the rank of the features;
and 3-3, establishing a training model through a random forest algorithm, and finding the optimal characteristic subset by comparing the accuracy of model prediction.
Has the advantages that:
1. the invention discloses a filtering and packaging type hierarchical progressive feature selection method, which is a novel method for feature selection of a data set.
2. The method makes up the defect of large performance deviation of the filtering type method evaluation and learning algorithm, and the packaging type method has a good evaluation effect.
3. The invention solves the problem that the packaging method is not suitable for large data, and is not only suitable for small data sets, but also suitable for large data sets.
4. The filter type and the packaging type are combined, the complementation is insufficient, the redundancy characteristic is deleted by utilizing the Pearson correlation coefficient, the consumption of computing resources is reduced, and the model performance is improved.
Description of the drawings:
fig. 1 is a flowchart of a feature selection method based on filtering and packaging hierarchy progression according to an embodiment of the present invention.
Fig. 2 is a flowchart of removing redundancy characteristics based on pearson correlation coefficients in an embodiment of the present invention.
Fig. 3 is a flowchart of combining a random forest model with an encapsulation-based sequence forward selection method according to an embodiment of the present invention.
The specific implementation mode is as follows:
in order to clearly and completely describe the technical solutions in the embodiments of the present invention, the present invention is further described in detail below with reference to the drawings in the embodiments.
Taking feature selection of a breast cancer data set collected by a tumor hospital as an example, the embodiment of the present invention provides a flow of a feature selection method based on filtering and packaging type hierarchy progression, as shown in fig. 1, including the following steps.
Step 1, the process of sorting the features based on the filtering variance sorting method, the information gain method and the packaging type Boruta sorting method is as follows:
step 1-1, respectively sequencing feature importance degrees from large to small in sequence by adopting a variance sequencing method, an information gain method and a package-based Boruta sequencing method in a filtering method;
the formula for calculating the variance value is as follows:
Figure BDA0003088850830000031
X1...Xnthe different eigenvalue values of the characteristic X, and M is the average value of the characteristic.
The calculation formula of the information gain is as follows:
IGain(X,Y)=E(X)-E(X|Y)
e (X) is the information entropy, i.e. the amount of information brought about by classifying the sample while considering only the target feature, and E (X | Y) is the conditional entropy, i.e. the information entropy of X under the condition Y.
Step 1-2, distributing ranks for the sorted features, wherein the rank with the highest importance degree is 1, and the rest features are distributed in sequence;
step 1-3, sequentially adding the ranks of each feature obtained in different ranking methods, and ranking the features in a sequence from small to large to obtain the final ranking sequence of the features, i.e., BI.RADS classification, elasticity score, size, age, axillary lymph node size, morphology, blood flow signal, axillary lymph node, resistance index, calcific foci and calcific foci, wherein the ranking results are shown in Table 1.
TABLE 1
Figure BDA0003088850830000041
Step 2, the process of deleting the redundancy characteristics based on the Pearson correlation coefficient is as follows:
fig. 2 shows a process for deleting redundant features based on pearson correlation coefficients, which specifically includes:
step 2-1, calculating the correlation between every two characteristics of the sorted characteristics based on the Pearson correlation coefficient;
the calculation process of the Pearson correlation coefficient is as follows:
Figure BDA0003088850830000042
δ is the standard deviation of the samples and n is the number of samples.
Step 2-2, setting the Pearson correlation coefficient threshold of the features to be 0.5;
step 2-3, according to the pearson correlation coefficient among the features, it can be seen that the correlation between the elasticity score and the bi.rads classification is the largest, the correlation between the elasticity score and the bi.rads classification on the classification result is similar, the correlation between the bi.rads classification and other feature variables is close, so that redundant feature elasticity variables are deleted, and the pearson correlation coefficient among the features exceeds 0.5, which is shown in table 2.
TABLE 2
Figure BDA0003088850830000051
Step 3, combining a sequence forward selection method based on an encapsulation type with a random forest model as follows:
a flow of combining a sequence forward selection method based on an encapsulation type with a random forest model is shown in fig. 3, and specifically includes:
step 3-1, the rank of the characteristic after the redundant characteristic is deleted is used as an evaluation function of a sequence forward selection method;
step 3-2, adding the characteristic BI.RADS with the minimum rank in the characteristic sequencing into a characteristic set in a grading manner;
3-3, establishing a training model through a random forest algorithm, and recording the model accuracy of grading by adding the characteristic BI.RADS with the minimum rank;
step 3-4, adding characteristic variables in sequence from small to large according to the rank, and recording the model accuracy after each characteristic variable is added;
and 3-5, comparing the accuracy of the obtained model, wherein the characteristic subset with the highest accuracy corresponds to: RADS classification, size, age, axillary lymph node size, morphology, axillary lymph node size, blood flow signal, axillary lymph node, and this feature subset is the optimal feature subset found.
According to the feature selection method based on filtering and packaging type hierarchical progression, the optimal feature subset can be selected for the data set, accurate feature information is provided for subsequent learning modeling, and the accuracy of the model is improved.
The above description is for the purpose of describing in detail embodiments of the present invention with reference to the accompanying drawings, and the detailed description is for the purpose of facilitating understanding of the invention.

Claims (4)

1. A method for feature selection based on filter and package level progression, the method comprising the steps of:
step 1: and (3) sorting the features based on a filtering type variance sorting method, an information gain sorting method and an encapsulation type Boruta sorting method, distributing ranks to the sorted features according to the importance degree, and fusing the results of the three sorting methods.
Step 2: and calculating the correlation between every two characteristics based on the Pearson correlation coefficient, setting a Pearson correlation coefficient threshold value of the characteristics, and selectively deleting partial characteristics according to the correlation between the characteristics.
And step 3: and finding the best feature combination by combining a sequence forward selection method based on an encapsulation type with a random forest model so as to obtain the optimal feature subset.
2. The method according to claim 1, wherein in the step 1, the features are sorted by a filter-type variance sorting method, an information gain sorting method and an encapsulation-type Boruta sorting method, and the method comprises the following specific steps:
step 1-1, sequencing feature importance degrees from large to small in sequence by adopting a filtering variance sequencing method, an information gain method and an encapsulation-based Boruta sequencing method;
step 1-2, distributing ranks for the sorted features, wherein the rank with the highest importance degree is 1, and the rest features are distributed in sequence;
and 1-3, adding the ranks of each feature obtained in different sorting methods in sequence and sorting the features in a sequence from small to large to obtain the final sorting order of the features.
3. The method for selecting features based on filtering and packaging type hierarchy progression according to claim 1, wherein in the step 2, redundant features are deleted based on Pearson's correlation coefficient, and the specific steps are as follows:
step 2-1, calculating the correlation between every two characteristics of the sorted characteristics based on the Pearson correlation coefficient;
step 2-2, setting a Pearson correlation coefficient threshold value of the features;
and 2-3, selectively deleting partial features according to the Pearson correlation coefficient among the features.
4. The method for selecting features based on filtering and packaging type hierarchy progression according to claim 1, wherein in the step 3, the best feature combination is found by combining a sequence forward selection method based on packaging type with a random forest model, so as to obtain an optimal feature subset, and the specific steps are as follows:
step 3-1, the rank of the characteristic after the redundant characteristic is deleted is used as an evaluation function of a sequence forward selection method;
step 3-2, adding the feature sequences into the feature set from small to large according to the rank of the features;
and 3-3, establishing a training model through a random forest algorithm, and finding the optimal characteristic subset by comparing the accuracy of model prediction.
CN202110589440.5A 2021-05-28 2021-05-28 Feature selection method based on filtering and packaging type hierarchy progression Pending CN113239321A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110589440.5A CN113239321A (en) 2021-05-28 2021-05-28 Feature selection method based on filtering and packaging type hierarchy progression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110589440.5A CN113239321A (en) 2021-05-28 2021-05-28 Feature selection method based on filtering and packaging type hierarchy progression

Publications (1)

Publication Number Publication Date
CN113239321A true CN113239321A (en) 2021-08-10

Family

ID=77135502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110589440.5A Pending CN113239321A (en) 2021-05-28 2021-05-28 Feature selection method based on filtering and packaging type hierarchy progression

Country Status (1)

Country Link
CN (1) CN113239321A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114881181A (en) * 2022-07-12 2022-08-09 南昌大学第一附属医院 Feature weighting selection method, system, medium and computer based on big data
CN115409134A (en) * 2022-11-02 2022-11-29 湖南一二三智能科技有限公司 User electricity utilization safety detection method, system, equipment and storage medium
CN116561554A (en) * 2023-04-18 2023-08-08 南方电网电力科技股份有限公司 Feature extraction method, system, equipment and medium of boiler soot blower

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114881181A (en) * 2022-07-12 2022-08-09 南昌大学第一附属医院 Feature weighting selection method, system, medium and computer based on big data
CN115409134A (en) * 2022-11-02 2022-11-29 湖南一二三智能科技有限公司 User electricity utilization safety detection method, system, equipment and storage medium
CN116561554A (en) * 2023-04-18 2023-08-08 南方电网电力科技股份有限公司 Feature extraction method, system, equipment and medium of boiler soot blower

Similar Documents

Publication Publication Date Title
CN113239321A (en) Feature selection method based on filtering and packaging type hierarchy progression
CN111460956B (en) Unbalanced electrocardiogram sample classification method based on data enhancement and loss weighting
CN107392919B (en) Adaptive genetic algorithm-based gray threshold acquisition method and image segmentation method
CN110866997A (en) Novel method for constructing running condition of electric automobile
CN112541532B (en) Target detection method based on dense connection structure
CN110942091A (en) Semi-supervised few-sample image classification method for searching reliable abnormal data center
CN110826618A (en) Personal credit risk assessment method based on random forest
CN108280236A (en) A kind of random forest visualization data analysing method based on LargeVis
CN105512175A (en) Quick image retrieval method based on color features and texture characteristics
CN113283473B (en) CNN feature mapping pruning-based rapid underwater target identification method
CN109669210A (en) Favorable method based on a variety of seismic properties interpretational criterias
CN112036476A (en) Data feature selection method and device based on two-classification service and computer equipment
CN111723897A (en) Multi-modal feature selection method based on particle swarm optimization
CN115147632A (en) Image category automatic labeling method and device based on density peak value clustering algorithm
CN112215268A (en) Method and device for classifying disaster weather satellite cloud pictures
CN110909785A (en) Multitask Triplet loss function learning method based on semantic hierarchy
CN107194468A (en) Towards the decision tree Increment Learning Algorithm of information big data
CN113033345A (en) V2V video face recognition method based on public feature subspace
CN111914930A (en) Density peak value clustering method based on self-adaptive micro-cluster fusion
CN116662832A (en) Training sample selection method based on clustering and active learning
CN110009024A (en) A kind of data classification method based on ID3 algorithm
KR101085066B1 (en) An Associative Classification Method for detecting useful knowledge from huge multi-attributes dataset
CN116089142A (en) Novel service fault root cause analysis method
CN112560900B (en) Multi-disease classifier design method for sample imbalance
Yarramalle et al. Unsupervised image segmentation using finite doubly truncated Gaussian mixture model and hierarchical clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210810