CN108962382B

CN108962382B - Hierarchical important feature selection method based on breast cancer clinical high-dimensional data

Info

Publication number: CN108962382B
Application number: CN201810552686.3A
Authority: CN
Inventors: 付波; 刘沛; 林劼; 郑鸿; 邓玲
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2022-05-03
Anticipated expiration: 2038-05-31
Also published as: CN108962382A

Abstract

The invention discloses a hierarchical important feature selection method based on breast cancer clinical high-dimensional data. The feature selection method comprises the steps of statistical feature selection and integrated feature selection, wherein the statistical feature selection adopts a single factor analysis method, and the features which have significant influence on bureau variables are preliminarily selected through different statistical tests; the integrated feature selection is realized by establishing a gradient lifting tree model, obtaining feature importance scores after model training, and then using an importance score threshold value after design and verification to realize feature selection which has important influence on bureau variables. The invention can effectively overcome the problems of over-high data characteristic dimension, excessive redundant characteristics, disordered data and the like in the clinical breast cancer prediction modeling process. Redundant or meaningless characteristics in clinical breast cancer high-dimensional data can be eliminated, so that characteristics which are as few as possible and have important influence on breast cancer modeling are selected, and the accuracy and the practicability of a breast cancer model are guaranteed.

Description

Hierarchical important feature selection method based on breast cancer clinical high-dimensional data

Technical Field

The invention relates to the fields of computer technology, statistical machine learning technology, characteristic engineering technology and the like.

Background

The breast cancer is the malignant tumor with the highest incidence rate in women all over the world, and seriously threatens the health of the women. Breast cancer patients usually intervene by surgical and chemotherapy treatments, and may be at risk of recurrence at any time after treatment. Scientific evaluation and prediction of survival state of breast cancer patients can assist doctors to make appropriate treatment plans, and provides new support for reducing recurrence risk of patients and improving prognosis.

The survival state of the breast cancer patient is estimated and predicted, for example, the recurrence-free survival rate is realized, and a machine learning prediction model can be established based on breast cancer clinical data. However, the quality of the clinical data largely determines the performance of the predictive model. In the real world, clinical data of breast cancer patients generally include basic information of patients, diagnosis history, pathology, surgery, chemotherapy, radiotherapy, endocrine therapy, targeted therapy and the like. These data features are high in dimensionality and generally have problems of data missing, abnormality, duplication and inconsistency, so that the original clinical data in the real world needs to be cleaned to ensure data quality.

Data cleaning cannot solve the problem of high dimensionality of breast cancer clinical data. The feature engineering and the dimension reduction processing of the high-dimensional feature data are greatly necessary, and are mainly expressed in the following two aspects:

(1) and (5) predicting the practicability of the model. The predictive model, after being embedded in a breast cancer patient prognosis evaluation system, requires a doctor or a patient to input necessary information related to prediction. The information is used as a model input characteristic value to enter a prediction model, and finally the system can effectively predict according to the input information. Too many input features will consume the patient or doctor effort and time, which greatly reduces the utility of the predictive model.

(2) And predicting the performance of the model. In fact, feature engineering is used to identify and remove unwanted, irrelevant, and redundant attributes that do not improve the performance of the predictive model or may actually degrade the performance of the model. In practice, fewer features are needed because it reduces the complexity of the model and a simpler model can be more easily understood and interpreted.

Therefore, in order to construct a practical and high-performance prediction model, the important point is to perform feature engineering processing on clinical high-dimensional data so as to screen out features which have important influence on the survival without recurrence of breast cancer, thereby achieving the purposes of assisting diagnosis of doctors, reducing the recurrence risk of patients and improving prognosis.

The high-dimensional data feature selection method can be generally divided into the following methods:

(1) single factor analysis method. Each factor is analyzed separately, and whether the factor has a significant influence on the target variable is determined by a statistical test method. This method can only simply exclude a few irrelevant features and ignore the interaction between features.

(2) And (3) a characteristic importance analysis method. And fitting the training data by using a certain base learner (such as CART or random forest) to obtain the importance score of each feature, and excluding the feature with the importance score of 0. The method can eliminate irrelevant features, but the finally selected feature dimension is still high, and the data feature dimension cannot be reduced as much as possible.

(3) A recursive feature elimination method. Proposed by Guyon et al. The method recursively eliminates the features with lower importance one by one on the basis of a feature importance analysis method, calculates the expression of a base learner on a new feature set one by one, and recalculates the importance score of each feature as the basis of the next feature elimination. The feature set that performs best is ultimately selected. The method has high requirements on computing resources and time under a real high-dimensional data scene, and the selection of the base learning and the instability of the feature importance score often have great influence on the result.

The high-dimensional data feature selection method requires that redundant or irrelevant features are eliminated under the condition of ensuring model performance and acceptable time complexity, and the number of finally selected features is reduced as much as possible. Therefore, how to select important features from high-dimensional data is a problem that scientific researchers at home and abroad need to take important thinking.

Disclosure of Invention

The invention aims to solve the problem of over-high dimensionality of clinical data in the establishment of a breast cancer survival prediction model. The problem of important feature extraction and model practicability is solved by utilizing a hierarchical feature selection method combining statistical feature selection and integrated feature selection.

The invention discloses a hierarchical important feature selection method based on breast cancer clinical high-dimensional data, which comprises the following steps of:

and (3) statistical feature selection processing:

carrying out feature extraction on the original clinical data and cleaning to obtain an original feature set F_n；

Computing a raw feature set F_nFeature F of each dimension in_iA significance value of;

features F having significance values less than a predetermined threshold_iForming a set of statistical features F_m；

Integrated feature selection processing:

obtaining a statistical feature set F_mEach feature F in (1)_iMean value of importance scores

Setting different random number seeds, and selecting a statistical feature set F based on the random number seeds_mEstablishing a gradient lifting tree model and outputting a statistical feature set F_mEach feature F in (1)_iImportance Score under current random number_iScore for importance under all random number of seeds_iAveraging to obtain each characteristic F_iMean value of importance scores

Based on a preset importance score threshold value, collecting F by using statistical characteristics_mMean value of importance scores in (1)

Feature F greater than importance score threshold_iForm an important feature set F_e。

Further, feature F_iThe calculation mode of the significance value is specifically as follows:

based on features F_iThe characteristic attributes of the two types of the filter are differentIs measured to calculate the feature F_iA significance value of;

feature F for feature attributes as categorical variables_iFirst, the feature F is judged_iWhether it is an ordered or unordered categorical variable, if feature F_iFor ordered categorizing variables, the Mann-Whitney U test was used to calculate feature F_iSignificance value (p-value); if feature F_iIf the variable is a disorder classification variable, calculating the characteristic F by chi-square test_iA significance value of;

for features F with feature attributes being continuous variables_iFirst, a KS test (Kolmogorov-Smirnov test) was used for the feature F_iWhether the distribution is in accordance with normal distribution or not, if so, calculating the characteristic F by adopting the T Test (One-Samples T Test) of an independent sample_iA significance value of; otherwise, feature F was calculated using the Mann-Whitney U test_iSignificance value of (a).

Further, the preferred setting mode of the importance score threshold is as follows:

setting an initial threshold value as 0, adopting a backward feature selection method, gradually and selectively increasing the threshold value to obtain feature sets under corresponding threshold values, establishing a gradient lifting tree model for each feature set corresponding to the threshold value to obtain an evaluation index value of the model on a test set, and selecting the feature set corresponding threshold value with the least number of features as a feature importance scoring threshold value in all corresponding feature sets meeting the condition that the difference between the maximum evaluation index value and the corresponding threshold value is within an acceptable range.

The method of the invention fully utilizes the layered feature selection and screens layer by layer. The combination of important features containing fewer features is selected as much as possible without affecting the breast cancer model performance. The method has the following advantages:

(1) the statistical feature selection is used for finding out the single-dimensional features which have significant influence on the bureau variables, and the influence of the single features which are significantly irrelevant on the performance of the final prediction model is eliminated;

(2) the gradient lifting tree is used as a base learner, and the interaction among the multi-dimensional data features can be well processed. Therefore, the probability space of the data features is fully learned, and the accuracy of grading important features is ensured;

(3) the importance score average value is obtained by adopting multiple tests, so that the influence of accidental random number selection events in machine learning is shielded, and the reliability and stability of the importance score are ensured;

(4) the importance score threshold is selectively selected, rather than eliminating the features one by one, so that the time for selecting the features and the consumption of computing resources are reduced;

(5) the simplest characteristic set is selected within the acceptable range of model performance loss, and the performance and the practicability of the built prediction model are ensured.

Therefore, the invention has obvious advantages and wider application scenes.

Drawings

FIG. 1 is a basic process flow diagram of the present invention;

FIG. 2 is a flow chart of statistical feature selection in accordance with the present invention;

FIG. 3 is a flow chart of the integrated feature selection of the present invention;

FIG. 4 is a schematic diagram of threshold setting for integrated feature selection;

fig. 5 is a schematic diagram of an implementation process of the application of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

Referring to fig. 1, the method for selecting hierarchical important features oriented to breast cancer clinical high-dimensional data includes a threshold setting mode involved in statistical feature calculation, integrated feature calculation and integrated feature calculation. The invention utilizes a hierarchical feature selection method combining statistical feature selection and integrated feature selection to effectively solve the problems of important feature extraction, model practicability and the like. The specific implementation process is as follows:

s1: and (4) selecting statistical characteristics.

Carrying out feature extraction on the original clinical data and cleaning to obtain an original feature set F_n(ii) a And calculating an original feature set F_nFeature F of each dimension in_iBy the feature F whose significance value is less than a preset threshold value_i(subscript dimension identifier) constitutes a set of statistical features F_m. Referring to fig. 2, the process is performed as follows:

s101: extracting the characteristics of breast cancer clinical data and cleaning the breast cancer clinical data to obtain an original characteristic set F_nGo through F_nFeature F of each dimension in_iJudging the feature F_iIs the judgment feature F_iWhether the variable belongs to a classification variable or a continuous variable, if the variable belongs to the classification variable, executing the step S102; if the continuous variable belongs to the continuous variable, step S104 is executed.

S102: if feature F_iAnd if the variable belongs to the classification variable, judging whether the variable belongs to the ordered classification variable or the unordered classification variable.

S103: if the feature F_iIf the classification variable is an ordered classification variable, calculating a p value by using a Mann-Whitney U test; if the feature F_iIs an unordered categorical variable, a chi-squared test is used to calculate the p-value for it. And then jumping to S106 for execution.

S104: if the feature F_iIs a continuous variable, KS check feature F is applied thereto_iWhether the distribution follows a normal distribution.

S105: if a normal distribution is obeyed (e.g., p >0.05, then the normal distribution is considered obeyed), then the p-value is calculated using the T-test of the independent sample; otherwise, calculating a p-value using the Mann-Whitney U test;

s106: if the feature F_iIf the p value is less than 0.05 by statistical test, the feature F is_iAdding selected feature set F_mI.e. a set of statistical characteristics F_mIn which F is_mIs an empty set.

S2: and (4) integrating feature selection.

For the obtained statistical feature set F_mAnd further screening important features by applying a echelon lifting tree learning method, referring to fig. 3, the implementation process is as follows:

s201: for the statistical feature set F_mPerforming importance scoring:

use the bagSet F containing statistical characteristics_mAnd establishing a gradient lifting tree model according to the training data. Through model parameter adjustment and training, outputting a statistical characteristic set F_mScore of importance of each feature in (1)_i。

S202: obtaining importance score mean

Setting different random number seeds, repeating the step S201 experiment T times (in the specific embodiment, setting to 100 times), and finally averaging the T experiment results to obtain a statistical characteristic set F_mMean value of importance scores of each feature in (1)

S203: setting a feature importance score threshold:

set of statistical features F_mThe features (elements) in the initial candidate feature set F are sorted from small to large according to the average value of the importance scores_h(ii) a Then, for the initial candidate feature set F_hAnd obtaining a feature importance scoring threshold value by adopting a backward feature selection method. Referring to fig. 4, the implementation process is as follows:

(1) an initial threshold value threshold is set to 0.

(2) Setting a random step size or a fixed step size (observing the average value of the importance scores) of the threshold value increase to obtain a threshold value threshold of each step_dLower candidate feature set F_hdWherein threshold is_d＝threshold_d-1+step，threshold₀The initial value of step identifier j is 1, 0; candidate feature set F_hdIs based on threshold value threshold_jFor the initial candidate feature set F_hCharacteristics after screening: if the initial candidate feature set F_hFeature F in_iMean value of importance scores

Greater than a threshold value_dThen feature F is retained_iOtherwise, F will be_iFrom the set F_hTo obtain a candidate feature set F after screening_hd。

(3) Updating the step identifier d as d +1, and continuing to calculate the threshold_dAnd a candidate feature set F_hdUntil a preset maximum number of steps is reached (set to 10 in this embodiment). The end condition of this step can also be the current candidate feature set F_hdIs an empty set; or up to threshold_dEqual to or greater than the initial candidate feature set F_hMean importance score of the endmost feature in (1).

(4) A plurality of non-empty candidate feature sets F obtained in the above steps_h1,F_h2…, using a set of contained candidate features F_hjThe gradient lifting tree model is established, wherein the subscript j is the identifier of the candidate feature set of the non-empty set.

(5) Adjusting and training parameters of each gradient lifting tree model to obtain an evaluation index value V of the model on an independent test set_jAnd setting a corresponding evaluation index based on the actual demand.

(6) Final selection of feature importance score threshold

Corresponding subscript j^*Satisfies the following conditions:

wherein Δ represents a preset deviation threshold value, and is selected according to actual conditions, that is, the characteristic number | F is selected from all the characteristic sets satisfying the difference with the maximum evaluation index value within the acceptable range Δ_hjThreshold corresponding to minimum feature set

As the final feature importance score threshold (i.e., threshold t shown in fig. 1).

S204: an important feature is selected.

By a set of statistical features F_mImportance score in (1)Obtaining an important feature set F by using features with the mean value being greater than or equal to threshold_e。

The feature selection method of the present invention is applied to a breast cancer prediction system, and a specific application implementation process schematic diagram of the feature selection method is shown in fig. 5, and includes two stages of training and prediction, wherein the training process specifically includes: in the data preprocessing module, historical data of breast cancer patients are extracted and sorted, and then are divided into demographic characteristics, diagnosis characteristics, pathological characteristics and treatment characteristics. The features are input into a statistical feature selection processing module in a whole, and the features which are not statistically significant are preliminarily screened out. And then inputting the screened statistical characteristic data into an integrated characteristic selection processing module, and eliminating the characteristics smaller than the threshold value based on a threshold value and a characteristic evaluation score which are set by repeated tests, parameter adjustment and performance comparison and meet the requirements. Therefore, the final characteristics (important characteristics) with strong statistics and model discrimination capability are obtained, and the purpose of reducing the dimension is achieved. And (5) taking the features subjected to dimensionality reduction as input to construct a breast cancer prediction machine learning model.

In the prediction stage, for a certain patient (prediction object), based on the important features screened in the training stage, feature data corresponding to the important features are extracted from breast cancer clinical data of the patient and input into a breast cancer prediction model, and the disease state of the patient is output based on the prediction result.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. A hierarchical important feature selection method based on breast cancer clinical high-dimensional data is characterized by comprising the following steps:

and (3) statistical feature selection processing:

feature extraction of raw clinical dataCleaning, extracting, sorting, and classifying into demographic characteristics, diagnostic characteristics, pathological characteristics and therapeutic characteristics to obtain original characteristic set F_n；

Computing a raw feature set F_nFeature F of each dimension in_iSignificance value of (a):

characteristic F_iThe calculation mode of the significance value is specifically as follows:

based on features F_iThe characteristic attribute of the image is calculated by adopting different measurement modes to obtain a characteristic F_iA significance value of;

feature F for feature attributes as categorical variables_iFirst, the feature F is judged_iWhether it is an ordered or unordered categorical variable, if feature F_iFor ordered categorizing variables, the Mann-Whitney U test was used to calculate feature F_iA significance value of; if F_iIf the variable is a disorder classification variable, calculating the characteristic F by chi-square test_iA significance value of;

for features F with feature attributes being continuous variables_iFirst, feature F is examined using KS_iIf the distribution obeys normal distribution, the characteristic F is calculated by adopting the T test of an independent sample_iA significance value of; otherwise, feature F was calculated using the Mann-Whitney U test_iA significance value of;

Integrated feature selection processing:

Setting different random number seeds, and selecting a statistical feature set F based on the random number seeds_mEstablishing a gradient lifting tree model and outputting a statistical feature set F_mEach feature F in (1)_iImportance Score under current random number_iSco score for importance under all random number seedsre_iAveraging to obtain each characteristic F_iMean value of importance scores

Feature F greater than importance score threshold_iForm an important feature set F_eSet F with important features_eThe method is characterized by comprising the steps of inputting, and constructing a breast cancer prediction machine learning model;

the setting mode of the importance score threshold is as follows:

2. The method of claim 1, wherein the importance score threshold is preferably set by:

setting the value of the random step or fixed step of the threshold increase, and initializing the threshold₀0, step identifier d 1;

calculating threshold of the d step_d＝threshold_d-1+ step, where step denotes the value of the random step or fixed step of the threshold increase, from the statistical feature set F_mMean of medium importance scores

Greater than a threshold value_dCharacteristic F of_iForming a candidate feature set F of the d step_hd；

Step identifier d is increased by 1, and threshold is continuously calculated_dAnd a candidate feature set F_hdUntil d reaches a preset maximum step number, obtaining a plurality of non-empty candidate feature sets F_h1,F_h2,…；

Candidate feature set F of obtained non-empty set_hjUsing a set of candidate features F_hjEstablishing a gradient lifting tree model, and obtaining an evaluation index value V of the gradient lifting tree model on an independent test set_jWherein the index j is an identifier of a candidate feature set that is not an empty set;

according to the formula

From all thresholds_jTo select a corresponding identifier j^*Is/are as follows

As the importance score threshold, where Δ represents a preset deviation threshold, | F_hjI represents a candidate feature set F_hjThe characteristic number of (2).

3. The method of claim 2, wherein calculating threshold is to be stopped_dAnd replacing the end condition of the candidate feature set with the current candidate feature set F_hdAs an empty set or current threshold_dIs equal to or greater than the statistical feature set F_mThe mean of the maximum importance scores in (1).

4. The method of claim 2, wherein the step d of screening the candidate feature set comprises first screening the statistical feature set F_mThe features in the method are sorted from small to large according to the average value of the importance scores to form an initial candidate feature set F_hThen, the initial candidate feature set F_hScreening the candidate feature set in the step d.