CN108962382A

CN108962382A - A kind of layering important feature selection method based on breast cancer clinic high dimensional data

Info

Publication number: CN108962382A
Application number: CN201810552686.3A
Authority: CN
Inventors: 付波; 刘沛; 林劼; 郑鸿; 邓玲
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2018-12-07
Anticipated expiration: 2038-05-31
Also published as: CN108962382B

Abstract

The invention discloses a kind of layering important feature selection methods based on breast cancer clinic high dimensional data.Feature selection approach of the invention includes statistical nature selection and Ensemble feature selection, and wherein statistical nature selection uses single-factor analysis therapy, goes out the feature having a significant impact to final result variable by different statistical check initial options；Ensemble feature selection promotes tree-model by establishing gradient, and feature prominence score is obtained after model training, then using the prominence score threshold value by design and verifying, to realize the feature selecting having a major impact to final result variable.The present invention can effectively overcome the problems such as data characteristics dimension in clinical breast cancer prediction modeling process is excessively high, redundancy feature is excessive and data are mixed and disorderly.Redundancy or meaningless feature in clinical breast cancer high dimensional data can be excluded, so that selection models less and to breast cancer the feature having a major impact as far as possible, guarantees the accuracy and practicability of breast cancer model.

Description

A kind of layering important feature selection method based on breast cancer clinic high dimensional data

Technical field

The present invention relates to computer technology, the fields such as statistical machine learning technology and Feature Engineering technology.

Background technique

Breast cancer is the global highest malignant tumour of women disease incidence, seriously threatens women's health.Patient with breast cancer is usual It can be intervened by remedy measures such as operation and chemotherapy, the risk of recurrence may be faced at any time after treatment.Science Ground assessment prediction patient with breast cancer survival condition can assist doctor to formulate appropriate treatment plan, to reduce Patients on Recurrence risk New support is provided with prognosis is improved.

It realizes assessment prediction patient with breast cancer survival condition, such as recurrence-free survival rate, breast cancer clinic number can be based on According to establishing machine learning prediction model.However, clinical data quality has been largely fixed the performance of prediction model.True generation Under boundary, the clinical data of patient with breast cancer, generally comprise patient basis, diagnosis medical history, pathology, operation, chemotherapy, radiotherapy, The information such as endocrine therapy and targeted therapy.These data characteristics dimensions are higher, and usually exist the missings of data, exception, Repetition and inconsistent problem, so needing to clean the initial clinical data under real world, to ensure the quality of data.

Data cleansing can not solve the problems, such as that breast cancer clinical data is high-dimensional.And feature work is carried out to high dimensional feature data Journey, dimension-reduction treatment have very big necessity, in terms of being mainly manifested in following two:

(1) prediction model practicability.Prediction model needs doctor or trouble after being embedded in patient with breast cancer's prognostic system The relevant necessary information of person's input prediction.These information will enter prediction model, final system as mode input feature value It could be effectively predicted according to input information.Input feature vector is excessive, will expend patient or doctor's energy and time, this is dropped significantly The low practicability of prediction model.

(2) prediction model performance.In fact, Feature Engineering be used to identify and remove it is unwanted, it is incoherent and superfluous Remaining attribute, these attributes can not improve the performance of prediction model, or may in fact reduce the performance of model.Actually ask In topic, it would be desirable to which less feature, because it can reduce the complexity of model, and simpler model can be by Simpler understanding and explanation.

Therefore, to construct practical and high performance prediction model, it is preferred that emphasis is carry out Feature Engineering to clinical high dimensional data Processing, to reach auxiliary diagnosis, reduces patient to filter out the feature having a major impact to breast cancer recurrence-free survival Risk of recurrence and the purpose for improving prognosis.

High dimensional data feature selection approach totally comes to be divided into following several:

(1) single factor analysis method.Each factor is individually analyzed, which is determined by the method for statistical check Whether target variable is had a significant impact.This method can only simply exclude a small amount of incoherent feature, have ignored feature it Between reciprocation.

(2) feature importance analysis method.It is fitted training data using some base learner (such as CART or random forest), The prominence score of each feature is obtained, the feature that prominence score is 0 is excluded.This method can exclude incoherent spy Sign, but often the characteristic dimension of final choice is still higher, can not reduce data characteristics dimension as far as possible.

(3) recursive feature removing method.It is proposed by Guyon et al..This method is on the basis of feature importance analysis method On the one by one lower feature of recursion elimination importance, gradually calculate performance of the base learner in new feature set, and again The prominence score for newly calculating each feature, the foundation eliminated as feature next time.The feature set that final choice behaves oneself best. This method is higher to computing resource and time requirement under true high dimensional data scene, and the selection and feature of base study The unstability of prominence score often has a significant impact to result.

High dimensional data feature selection approach, it is desirable that under conditions of guaranteeing model performance and acceptable time complexity, Redundancy or incoherent feature are excluded, reduces the feature quantity of final choice as far as possible.Therefore, how in high dimensional data Important feature is selected, is the problem of domestic and international researcher needs emphasis to think deeply.

Summary of the invention

Object of the present invention is to be directed to establish the problem that clinical data dimension is excessively high in breast cancer Prediction of survival model.Utilize system The layered characteristic selection method that meter feature selecting and Ensemble feature selection combine solves important feature and extracts and Model Practical The problem of.

Layering important feature selection method based on breast cancer clinic high dimensional data of the invention, comprising the following steps:

Statistical nature selection processing:

Feature extraction is carried out to initial clinical data and is started the cleaning processing, primitive character set F is obtained_n；

Calculate primitive character set F_nIn each dimension feature F_iSignificance value；

It is less than the feature F of preset threshold by significance value_iConstitute statistical nature set F_m；

Ensemble feature selection processing:

Obtain statistical nature set F_mIn each feature F_iProminence score mean valueDifferent random numbers is set Seed includes statistical nature set F based on random number seed selection_mTraining data, establish gradient promoted tree-model, output system Count characteristic set F_mIn each feature F_iProminence score Score under current random number seed_i, to all random number seeds Under prominence score Score_iIt is averaged to obtain each feature F_iProminence score mean value

Based on preset prominence score threshold value, by statistical nature set F_mIn prominence score mean valueIt is greater than The feature F of prominence score threshold value_iConstitute important feature set F_e。

Further, feature F_iSignificance value calculation specifically:

Based on feature F_iCharacteristic attribute feature F is calculated using different metric form_iSignificance value；

It is the feature F of classified variable for characteristic attribute_i, first determine whether feature F_iIt is ordered into classified variable or unordered point Class variable, if feature F_iFor ordered categorization variable, then Mann-Whitney U checking computation feature F is used_iSignificance value (p Value)；If feature F_iIt is unordered classified variable, then feature F is calculated using Chi-square Test_iSignificance value；

It is the feature F of continuous variable for characteristic attribute_i, (Kolmogorov-Smirnov is examined using KS first Test) feature F_iDistribution whether Normal Distribution, if Normal Distribution, using independent sample T examine (One- Samples T Test) calculate feature F_iSignificance value；Otherwise, using Mann-Whitney U checking computation feature F_iIt is aobvious Work property value.

Further, the preferred arrangement of prominence score threshold value are as follows:

Initial threshold is set as 0, using Method for Feature Selection backward, gradually selectively increases threshold value, obtains corresponding threshold value Lower characteristic set, and to each threshold value character pair set, it establishes gradient and promotes tree-model, obtain model commenting on test set Estimate index value, in all character pair set of the difference of satisfaction and maximum evaluation index value within an acceptable range, selection is special The sign least characteristic set of number corresponds to threshold value as feature prominence score threshold value.

The method of the present invention is sufficiently selected with layered characteristic, is successively screened.The case where not influencing breast cancer model performance Under, selection as far as possible is combined comprising the important feature of less feature.This method has the advantage that

(1) being found out using statistical nature selection has the one-dimensional feature significantly affected to final result variable, eliminates significantly not Relevant single feature is on the final possible influence of prediction model performance；

(2) gradient boosted tree is used as base learner, can handle influencing each other between multidimensional data feature well. To the probability space of abundant learning data feature, it is ensured that the accuracy of important feature scoring；

(3) prominence score mean value is sought using test of many times, shields accidental random number selection event in machine learning Influence, ensure that the reliability and stability of prominence score；

(4) prominence score threshold value is selectively chosen, rather than eliminates feature one by one, reduces the time of feature selecting And the consumption of computing resource；

(5) simplest feature set is selected in model performance loss tolerance interval, it is ensured that construct prediction model Performance and practicability.

Therefore, the present invention has obvious advantage and wide applicable scene.

Detailed description of the invention

Fig. 1 is basic handling flow chart of the invention；

Fig. 2 is that statistical nature of the invention selects flow chart；

Fig. 3 is Ensemble feature selection flow chart of the invention；

Fig. 4 is that schematic diagram is arranged in the threshold value of Ensemble feature selection；

Fig. 5 is the realization process schematic of application of the invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair It is bright to be described in further detail.

Referring to Fig. 1, the layering important feature selection method of the invention towards breast cancer clinic high dimensional data includes system Count the related threshold value set-up mode in feature calculation, integration characteristic calculating and integration characteristic calculating.The present invention utilizes system The layered characteristic selection method that meter feature selecting and Ensemble feature selection combine can efficiently solve important feature extraction and mould The problems such as type practicability.Itself the specific implementation process is as follows:

S1: statistical nature selection.

Feature extraction is carried out to initial clinical data and is started the cleaning processing, primitive character set F is obtained_n；And calculate original Beginning characteristic set F_nIn each dimension feature F_iSignificance value, by significance value be less than preset threshold feature F_i(subscript For dimension identifier) constitute statistical nature set F_m.Referring to fig. 2, implementation procedure is as follows:

S101: feature extraction is carried out to breast cancer clinical data and is started the cleaning processing, primitive character set F is obtained_n, time Go through F_nIn each dimension feature F_i, judge this feature F_iCharacteristic attribute, i.e. judging characteristic F_iBelong to classified variable or Continuous variable thens follow the steps S102 if belonging to classified variable；If belonging to continuous variable, S104 is thened follow the steps.

S102: if feature F_iBelong to classified variable, then judges that it belongs to ordered categorization variable or unordered classification becomes again Amount.

S103: if feature F_iIt is ordered into classified variable, then is used for Mann-Whitney U checking computation p value；Such as Fruit feature F_iIt is unordered classified variable, then is used for Chi-square Test and calculates p value.S106 execution is jumped to again.

S104: if feature F_iIt is continuous variable, then is used for KS verification characteristics F_iDistribution whether Normal Distribution.

S105: if Normal Distribution (such as p > 0.05, then it is assumed that Normal Distribution), then it is used for independent sample T checking computation p value；Otherwise, using Mann-Whitney U checking computation p value；

S106: if feature F_iStatistical check p value is less than 0.05, then by feature F_iSelected characteristic set F is added_m, that is, unite Count characteristic set F_m, wherein F_mInitial value be empty set.

S2: Ensemble feature selection.

To obtained statistical nature set F_m, using echelon's boosted tree learning method, further screening important feature, referring to Fig. 3, implementation procedure are as follows:

S201: to statistical nature set F_mCarry out prominence score:

Using including statistical nature set F_mTraining data, establish gradient promoted tree-model.It is adjusted by model parameter And training, export statistical nature set F_mIn each feature prominence score Score_i。

S202: prominence score mean value is obtained

Different random number seeds is set, repeats step S201 experiment T times and (in present embodiment, is set as 100 It is secondary), finally T experimental result is averaged to obtain statistical nature set F_mIn each feature prominence score mean value

S203: setting feature prominence score threshold value:

Statistical nature set F_mIn each feature (element) sort from small to large by prominence score mean value, constitute initial wait Select characteristic set F_h；Again to initial candidate characteristic set F_hFeature prominence score threshold value is obtained using to backward Method for Feature Selection. Referring to fig. 4, realize that process is as follows:

(1) setting initial threshold threshold is 0.

(2) arbitrary width or fixed step size step (observation prominence score mean value) that setting threshold value increases, obtain every step Threshold value threshold_dUnder candidate characteristic set F_hd, wherein threshold_d=threshold_d-1+ step, threshold₀=0, The initial value for walking identifier j is 1；Candidate characteristic set F_hdFor based on threshold value threshold_jTo initial candidate characteristic set F_hScreening Feature afterwards: if initial candidate characteristic set F_hIn feature F_iProminence score mean valueGreater than threshold value threshold_d, then keeping characteristics F_i, otherwise by F_iFrom set F_hMiddle deletion, thus the candidate characteristic set F after being screened_hd。

(3) step identifier d=d+1 is updated, continues to calculate threshold_dAnd candidate characteristic set F_hd, pre- until reaching If maximum step number (10 are set as in present embodiment).The termination condition of this step is also possible to current candidate feature set F_hdFor empty set；Also or until threshold_dEqual to or more than initial candidate characteristic set F_hThe feature of middle least significant end it is important Property scoring mean value.

(4) the candidate characteristic set F for multiple non-emptys that above-mentioned steps obtain_h1,F_h2..., using including candidate characteristic set F_hj's Training data establishes gradient and promotes tree-model, and wherein subscript j is the identifier of the candidate characteristic set of nonvoid set.

(5) parameter for promoting tree-model to each gradient is adjusted and trains, and obtains model commenting on independent test collection Estimate index value V_j, corresponding evaluation index is arranged based on actual demand.

(6) final choice feature prominence score threshold valueCorresponding subscript j^*Meet:

Wherein Δ indicates preset deviation threshold, chooses according to actual conditions, i.e., meet with maximum evaluation index value it Difference selects characteristic in all characteristic sets in tolerance interval Δ | F_hj| the corresponding threshold value of the smallest characteristic setAs final feature prominence score threshold value threshold (i.e. threshold value t) shown in Fig. 1.

S204: selection important feature.

By statistical nature set F_mIn prominence score mean value more than or equal to threshold value threshold feature obtain it is important Characteristic set F_e。

Feature selection approach of the invention is applied in breast cancer forecasting system, concrete application realizes process schematic As shown in figure 5, including training and two stages of prediction, wherein training process specifically: in data preprocessing module, based on cream The historical data of adenocarcinoma patients is classified as demographic characteristics, diagnostic characteristic, pathological characters and controls after extraction and arrangement Treat feature.These features will be integrally input in statistical nature selection processing module, and preliminary screening goes out for statistically not Feature with statistical significance.Then, the statistical nature data filtered out are input in Ensemble feature selection processing module, Threshold value and feature evaluation score value based on repetition test, the meet demand for adjusting parameter and performance relatively more set, will be less than threshold value Feature weed out.The final feature (important feature) with stronger statistics and model resolving ability has been obtained as a result, has been reached The purpose of dimensionality reduction.It is input with the feature after dimensionality reduction, building breast cancer predicts machine learning model.

In forecast period, to some patient's (prediction object), based on the important feature that the training stage is filtered out, from patient Breast cancer clinical data in extract those of corresponding important feature characteristic, and be input to breast cancer prediction model, be based on The morbid state of prediction result output patient.

The above description is merely a specific embodiment, any feature disclosed in this specification, except non-specifically Narration, can be replaced by other alternative features that are equivalent or have similar purpose；Disclosed all features or all sides Method or in the process the step of, other than mutually exclusive feature and/or step, can be combined in any way.

Claims

1. a kind of layering important feature selection method based on breast cancer clinic high dimensional data, which is characterized in that including following step It is rapid:

Statistical nature selection processing:

Ensemble feature selection processing:

Obtain statistical nature set F_mIn each feature F_iProminence score mean valueDifferent random number seeds is set, It include statistical nature set F based on random number seed selection_mTraining data, establish gradient and promote tree-model, output statistics is special F is closed in collection_mIn each feature F_iProminence score Score under current random number seed_i, under all random number seeds Prominence score Score_iIt is averaged to obtain each feature F_iProminence score mean value

Based on preset prominence score threshold value, by statistical nature set F_mIn prominence score mean valueGreater than important Property scoring threshold value feature F_iConstitute important feature set F_e。

2. the method as described in claim 1, which is characterized in that feature F_iSignificance value calculation specifically:

It is the feature F of classified variable for characteristic attribute_i, first determine whether feature F_iIt is ordered into classified variable or unordered classification becomes Amount, if feature F_iFor ordered categorization variable, then Mann-Whitney U checking computation feature F is used_iSignificance value；If F_iIt is Unordered classified variable then calculates feature F using Chi-square Test_iSignificance value；

It is the feature F of continuous variable for characteristic attribute_i, KS verification characteristics F is used first_iDistribution whether Normal Distribution, If Normal Distribution, the T checking computation feature F of independent sample is used_iSignificance value；Otherwise, using Mann- Whitney U checking computation feature F_iSignificance value.

3. method according to claim 1 or 2, which is characterized in that the preferred arrangement of prominence score threshold value are as follows:

Initial threshold is set as 0, using Method for Feature Selection backward, gradually selectively increases threshold value, obtains special under corresponding threshold value Collection is closed, and to each threshold value character pair set, is established gradient and promoted tree-model, obtain assessment of the model on test set and refer to Scale value selects characteristic in all character pair set of the difference of satisfaction and maximum evaluation index value within an acceptable range Least characteristic set corresponds to threshold value as feature prominence score threshold value.

4. method as claimed in claim 3, which is characterized in that the preferred arrangement of prominence score threshold value are as follows:

The arbitrary width of threshold value growth or the value of fixed step size, and initialization threshold are set₀=0, walk identifier d=1；

Calculate the threshold value threshold of d step_d=threshold_d-1+ step, wherein step indicates the arbitrary width that threshold value increases Or the value of fixed step size, by statistical nature set F_mMiddle prominence score mean valueGreater than threshold value threshold_dFeature F_iConstitute the candidate characteristic set F of d step_hd；

Identifier j is walked from increasing 1, continues to calculate threshold_dAnd candidate characteristic set F_hd, until d reaches preset maximum step Number；

To the candidate characteristic set F of obtained nonvoid set_hj, using including candidate characteristic set F_hjTraining data, establish gradient promotion Tree-model obtains gradient and promotes evaluation index value V of the tree-model on independent test collection_j, wherein subscript j is the candidate of nonvoid set The identifier of feature set；

According to formulaFrom all threshold_jThe corresponding mark of middle selection Accord with j^*'sAs prominence score threshold value, wherein Δ indicates preset deviation threshold, | F_hj| indicate candidate characteristic set F_hjCharacteristic.

5. method as claimed in claim 4, which is characterized in that will stop calculating threshold_dAnd the knot of candidate characteristic set Beam condition replaces with current candidate feature set F_hdFor empty set or current threshold_dEqual to or more than statistical nature set F_m In maximum prominence score mean value.

6. method as claimed in claim 4, which is characterized in that when screening the candidate characteristic set of d step, first by statistical nature Set F_mIn each feature sort from small to large by prominence score mean value, constitute initial candidate characteristic set F_h, then again first Beginning candidate feature set F_hThe candidate characteristic set of middle screening d step.