CN108962382A - A kind of layering important feature selection method based on breast cancer clinic high dimensional data - Google Patents
A kind of layering important feature selection method based on breast cancer clinic high dimensional data Download PDFInfo
- Publication number
- CN108962382A CN108962382A CN201810552686.3A CN201810552686A CN108962382A CN 108962382 A CN108962382 A CN 108962382A CN 201810552686 A CN201810552686 A CN 201810552686A CN 108962382 A CN108962382 A CN 108962382A
- Authority
- CN
- China
- Prior art keywords
- feature
- value
- threshold
- threshold value
- selection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 206010006187 Breast cancer Diseases 0.000 title claims abstract description 26
- 208000026310 Breast neoplasm Diseases 0.000 title claims abstract description 26
- 238000010187 selection method Methods 0.000 title claims abstract description 8
- 238000000034 method Methods 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims description 11
- 238000012360 testing method Methods 0.000 claims description 8
- 238000011156 evaluation Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 5
- 238000004140 cleaning Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000000546 chi-square test Methods 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 7
- 238000013459 approach Methods 0.000 abstract description 4
- 238000000556 factor analysis Methods 0.000 abstract description 2
- 238000002560 therapeutic procedure Methods 0.000 abstract 1
- 230000004083 survival effect Effects 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000011282 treatment Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000002512 chemotherapy Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000004393 prognosis Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 208000009956 adenocarcinoma Diseases 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 239000006071 cream Substances 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000009261 endocrine therapy Methods 0.000 description 1
- 229940034984 endocrine therapy antineoplastic and immunomodulating agent Drugs 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000002969 morbid Effects 0.000 description 1
- 230000001575 pathological effect Effects 0.000 description 1
- 230000007170 pathology Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000001959 radiotherapy Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 101150049349 setA gene Proteins 0.000 description 1
- 238000002626 targeted therapy Methods 0.000 description 1
- 230000005186 women's health Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of layering important feature selection methods based on breast cancer clinic high dimensional data.Feature selection approach of the invention includes statistical nature selection and Ensemble feature selection, and wherein statistical nature selection uses single-factor analysis therapy, goes out the feature having a significant impact to final result variable by different statistical check initial options;Ensemble feature selection promotes tree-model by establishing gradient, and feature prominence score is obtained after model training, then using the prominence score threshold value by design and verifying, to realize the feature selecting having a major impact to final result variable.The present invention can effectively overcome the problems such as data characteristics dimension in clinical breast cancer prediction modeling process is excessively high, redundancy feature is excessive and data are mixed and disorderly.Redundancy or meaningless feature in clinical breast cancer high dimensional data can be excluded, so that selection models less and to breast cancer the feature having a major impact as far as possible, guarantees the accuracy and practicability of breast cancer model.
Description
Technical field
The present invention relates to computer technology, the fields such as statistical machine learning technology and Feature Engineering technology.
Background technique
Breast cancer is the global highest malignant tumour of women disease incidence, seriously threatens women's health.Patient with breast cancer is usual
It can be intervened by remedy measures such as operation and chemotherapy, the risk of recurrence may be faced at any time after treatment.Science
Ground assessment prediction patient with breast cancer survival condition can assist doctor to formulate appropriate treatment plan, to reduce Patients on Recurrence risk
New support is provided with prognosis is improved.
It realizes assessment prediction patient with breast cancer survival condition, such as recurrence-free survival rate, breast cancer clinic number can be based on
According to establishing machine learning prediction model.However, clinical data quality has been largely fixed the performance of prediction model.True generation
Under boundary, the clinical data of patient with breast cancer, generally comprise patient basis, diagnosis medical history, pathology, operation, chemotherapy, radiotherapy,
The information such as endocrine therapy and targeted therapy.These data characteristics dimensions are higher, and usually exist the missings of data, exception,
Repetition and inconsistent problem, so needing to clean the initial clinical data under real world, to ensure the quality of data.
Data cleansing can not solve the problems, such as that breast cancer clinical data is high-dimensional.And feature work is carried out to high dimensional feature data
Journey, dimension-reduction treatment have very big necessity, in terms of being mainly manifested in following two:
(1) prediction model practicability.Prediction model needs doctor or trouble after being embedded in patient with breast cancer's prognostic system
The relevant necessary information of person's input prediction.These information will enter prediction model, final system as mode input feature value
It could be effectively predicted according to input information.Input feature vector is excessive, will expend patient or doctor's energy and time, this is dropped significantly
The low practicability of prediction model.
(2) prediction model performance.In fact, Feature Engineering be used to identify and remove it is unwanted, it is incoherent and superfluous
Remaining attribute, these attributes can not improve the performance of prediction model, or may in fact reduce the performance of model.Actually ask
In topic, it would be desirable to which less feature, because it can reduce the complexity of model, and simpler model can be by
Simpler understanding and explanation.
Therefore, to construct practical and high performance prediction model, it is preferred that emphasis is carry out Feature Engineering to clinical high dimensional data
Processing, to reach auxiliary diagnosis, reduces patient to filter out the feature having a major impact to breast cancer recurrence-free survival
Risk of recurrence and the purpose for improving prognosis.
High dimensional data feature selection approach totally comes to be divided into following several:
(1) single factor analysis method.Each factor is individually analyzed, which is determined by the method for statistical check
Whether target variable is had a significant impact.This method can only simply exclude a small amount of incoherent feature, have ignored feature it
Between reciprocation.
(2) feature importance analysis method.It is fitted training data using some base learner (such as CART or random forest),
The prominence score of each feature is obtained, the feature that prominence score is 0 is excluded.This method can exclude incoherent spy
Sign, but often the characteristic dimension of final choice is still higher, can not reduce data characteristics dimension as far as possible.
(3) recursive feature removing method.It is proposed by Guyon et al..This method is on the basis of feature importance analysis method
On the one by one lower feature of recursion elimination importance, gradually calculate performance of the base learner in new feature set, and again
The prominence score for newly calculating each feature, the foundation eliminated as feature next time.The feature set that final choice behaves oneself best.
This method is higher to computing resource and time requirement under true high dimensional data scene, and the selection and feature of base study
The unstability of prominence score often has a significant impact to result.
High dimensional data feature selection approach, it is desirable that under conditions of guaranteeing model performance and acceptable time complexity,
Redundancy or incoherent feature are excluded, reduces the feature quantity of final choice as far as possible.Therefore, how in high dimensional data
Important feature is selected, is the problem of domestic and international researcher needs emphasis to think deeply.
Summary of the invention
Object of the present invention is to be directed to establish the problem that clinical data dimension is excessively high in breast cancer Prediction of survival model.Utilize system
The layered characteristic selection method that meter feature selecting and Ensemble feature selection combine solves important feature and extracts and Model Practical
The problem of.
Layering important feature selection method based on breast cancer clinic high dimensional data of the invention, comprising the following steps:
Statistical nature selection processing:
Feature extraction is carried out to initial clinical data and is started the cleaning processing, primitive character set F is obtainedn;
Calculate primitive character set FnIn each dimension feature FiSignificance value;
It is less than the feature F of preset threshold by significance valueiConstitute statistical nature set Fm;
Ensemble feature selection processing:
Obtain statistical nature set FmIn each feature FiProminence score mean valueDifferent random numbers is set
Seed includes statistical nature set F based on random number seed selectionmTraining data, establish gradient promoted tree-model, output system
Count characteristic set FmIn each feature FiProminence score Score under current random number seedi, to all random number seeds
Under prominence score ScoreiIt is averaged to obtain each feature FiProminence score mean value
Based on preset prominence score threshold value, by statistical nature set FmIn prominence score mean valueIt is greater than
The feature F of prominence score threshold valueiConstitute important feature set Fe。
Further, feature FiSignificance value calculation specifically:
Based on feature FiCharacteristic attribute feature F is calculated using different metric formiSignificance value;
It is the feature F of classified variable for characteristic attributei, first determine whether feature FiIt is ordered into classified variable or unordered point
Class variable, if feature FiFor ordered categorization variable, then Mann-Whitney U checking computation feature F is usediSignificance value (p
Value);If feature FiIt is unordered classified variable, then feature F is calculated using Chi-square TestiSignificance value;
It is the feature F of continuous variable for characteristic attributei, (Kolmogorov-Smirnov is examined using KS first
Test) feature FiDistribution whether Normal Distribution, if Normal Distribution, using independent sample T examine (One-
Samples T Test) calculate feature FiSignificance value;Otherwise, using Mann-Whitney U checking computation feature FiIt is aobvious
Work property value.
Further, the preferred arrangement of prominence score threshold value are as follows:
Initial threshold is set as 0, using Method for Feature Selection backward, gradually selectively increases threshold value, obtains corresponding threshold value
Lower characteristic set, and to each threshold value character pair set, it establishes gradient and promotes tree-model, obtain model commenting on test set
Estimate index value, in all character pair set of the difference of satisfaction and maximum evaluation index value within an acceptable range, selection is special
The sign least characteristic set of number corresponds to threshold value as feature prominence score threshold value.
The method of the present invention is sufficiently selected with layered characteristic, is successively screened.The case where not influencing breast cancer model performance
Under, selection as far as possible is combined comprising the important feature of less feature.This method has the advantage that
(1) being found out using statistical nature selection has the one-dimensional feature significantly affected to final result variable, eliminates significantly not
Relevant single feature is on the final possible influence of prediction model performance;
(2) gradient boosted tree is used as base learner, can handle influencing each other between multidimensional data feature well.
To the probability space of abundant learning data feature, it is ensured that the accuracy of important feature scoring;
(3) prominence score mean value is sought using test of many times, shields accidental random number selection event in machine learning
Influence, ensure that the reliability and stability of prominence score;
(4) prominence score threshold value is selectively chosen, rather than eliminates feature one by one, reduces the time of feature selecting
And the consumption of computing resource;
(5) simplest feature set is selected in model performance loss tolerance interval, it is ensured that construct prediction model
Performance and practicability.
Therefore, the present invention has obvious advantage and wide applicable scene.
Detailed description of the invention
Fig. 1 is basic handling flow chart of the invention;
Fig. 2 is that statistical nature of the invention selects flow chart;
Fig. 3 is Ensemble feature selection flow chart of the invention;
Fig. 4 is that schematic diagram is arranged in the threshold value of Ensemble feature selection;
Fig. 5 is the realization process schematic of application of the invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair
It is bright to be described in further detail.
Referring to Fig. 1, the layering important feature selection method of the invention towards breast cancer clinic high dimensional data includes system
Count the related threshold value set-up mode in feature calculation, integration characteristic calculating and integration characteristic calculating.The present invention utilizes system
The layered characteristic selection method that meter feature selecting and Ensemble feature selection combine can efficiently solve important feature extraction and mould
The problems such as type practicability.Itself the specific implementation process is as follows:
S1: statistical nature selection.
Feature extraction is carried out to initial clinical data and is started the cleaning processing, primitive character set F is obtainedn;And calculate original
Beginning characteristic set FnIn each dimension feature FiSignificance value, by significance value be less than preset threshold feature Fi(subscript
For dimension identifier) constitute statistical nature set Fm.Referring to fig. 2, implementation procedure is as follows:
S101: feature extraction is carried out to breast cancer clinical data and is started the cleaning processing, primitive character set F is obtainedn, time
Go through FnIn each dimension feature Fi, judge this feature FiCharacteristic attribute, i.e. judging characteristic FiBelong to classified variable or
Continuous variable thens follow the steps S102 if belonging to classified variable;If belonging to continuous variable, S104 is thened follow the steps.
S102: if feature FiBelong to classified variable, then judges that it belongs to ordered categorization variable or unordered classification becomes again
Amount.
S103: if feature FiIt is ordered into classified variable, then is used for Mann-Whitney U checking computation p value;Such as
Fruit feature FiIt is unordered classified variable, then is used for Chi-square Test and calculates p value.S106 execution is jumped to again.
S104: if feature FiIt is continuous variable, then is used for KS verification characteristics FiDistribution whether Normal Distribution.
S105: if Normal Distribution (such as p > 0.05, then it is assumed that Normal Distribution), then it is used for independent sample
T checking computation p value;Otherwise, using Mann-Whitney U checking computation p value;
S106: if feature FiStatistical check p value is less than 0.05, then by feature FiSelected characteristic set F is addedm, that is, unite
Count characteristic set Fm, wherein FmInitial value be empty set.
S2: Ensemble feature selection.
To obtained statistical nature set Fm, using echelon's boosted tree learning method, further screening important feature, referring to
Fig. 3, implementation procedure are as follows:
S201: to statistical nature set FmCarry out prominence score:
Using including statistical nature set FmTraining data, establish gradient promoted tree-model.It is adjusted by model parameter
And training, export statistical nature set FmIn each feature prominence score Scorei。
S202: prominence score mean value is obtained
Different random number seeds is set, repeats step S201 experiment T times and (in present embodiment, is set as 100
It is secondary), finally T experimental result is averaged to obtain statistical nature set FmIn each feature prominence score mean value
S203: setting feature prominence score threshold value:
Statistical nature set FmIn each feature (element) sort from small to large by prominence score mean value, constitute initial wait
Select characteristic set Fh;Again to initial candidate characteristic set FhFeature prominence score threshold value is obtained using to backward Method for Feature Selection.
Referring to fig. 4, realize that process is as follows:
(1) setting initial threshold threshold is 0.
(2) arbitrary width or fixed step size step (observation prominence score mean value) that setting threshold value increases, obtain every step
Threshold value thresholddUnder candidate characteristic set Fhd, wherein thresholdd=thresholdd-1+ step, threshold0=0,
The initial value for walking identifier j is 1;Candidate characteristic set FhdFor based on threshold value thresholdjTo initial candidate characteristic set FhScreening
Feature afterwards: if initial candidate characteristic set FhIn feature FiProminence score mean valueGreater than threshold value
thresholdd, then keeping characteristics Fi, otherwise by FiFrom set FhMiddle deletion, thus the candidate characteristic set F after being screenedhd。
(3) step identifier d=d+1 is updated, continues to calculate thresholddAnd candidate characteristic set Fhd, pre- until reaching
If maximum step number (10 are set as in present embodiment).The termination condition of this step is also possible to current candidate feature set
FhdFor empty set;Also or until thresholddEqual to or more than initial candidate characteristic set FhThe feature of middle least significant end it is important
Property scoring mean value.
(4) the candidate characteristic set F for multiple non-emptys that above-mentioned steps obtainh1,Fh2..., using including candidate characteristic set Fhj's
Training data establishes gradient and promotes tree-model, and wherein subscript j is the identifier of the candidate characteristic set of nonvoid set.
(5) parameter for promoting tree-model to each gradient is adjusted and trains, and obtains model commenting on independent test collection
Estimate index value Vj, corresponding evaluation index is arranged based on actual demand.
(6) final choice feature prominence score threshold valueCorresponding subscript j*Meet:
Wherein Δ indicates preset deviation threshold, chooses according to actual conditions, i.e., meet with maximum evaluation index value it
Difference selects characteristic in all characteristic sets in tolerance interval Δ | Fhj| the corresponding threshold value of the smallest characteristic setAs final feature prominence score threshold value threshold (i.e. threshold value t) shown in Fig. 1.
S204: selection important feature.
By statistical nature set FmIn prominence score mean value more than or equal to threshold value threshold feature obtain it is important
Characteristic set Fe。
Feature selection approach of the invention is applied in breast cancer forecasting system, concrete application realizes process schematic
As shown in figure 5, including training and two stages of prediction, wherein training process specifically: in data preprocessing module, based on cream
The historical data of adenocarcinoma patients is classified as demographic characteristics, diagnostic characteristic, pathological characters and controls after extraction and arrangement
Treat feature.These features will be integrally input in statistical nature selection processing module, and preliminary screening goes out for statistically not
Feature with statistical significance.Then, the statistical nature data filtered out are input in Ensemble feature selection processing module,
Threshold value and feature evaluation score value based on repetition test, the meet demand for adjusting parameter and performance relatively more set, will be less than threshold value
Feature weed out.The final feature (important feature) with stronger statistics and model resolving ability has been obtained as a result, has been reached
The purpose of dimensionality reduction.It is input with the feature after dimensionality reduction, building breast cancer predicts machine learning model.
In forecast period, to some patient's (prediction object), based on the important feature that the training stage is filtered out, from patient
Breast cancer clinical data in extract those of corresponding important feature characteristic, and be input to breast cancer prediction model, be based on
The morbid state of prediction result output patient.
The above description is merely a specific embodiment, any feature disclosed in this specification, except non-specifically
Narration, can be replaced by other alternative features that are equivalent or have similar purpose;Disclosed all features or all sides
Method or in the process the step of, other than mutually exclusive feature and/or step, can be combined in any way.
Claims (6)
1. a kind of layering important feature selection method based on breast cancer clinic high dimensional data, which is characterized in that including following step
It is rapid:
Statistical nature selection processing:
Feature extraction is carried out to initial clinical data and is started the cleaning processing, primitive character set F is obtainedn;
Calculate primitive character set FnIn each dimension feature FiSignificance value;
It is less than the feature F of preset threshold by significance valueiConstitute statistical nature set Fm;
Ensemble feature selection processing:
Obtain statistical nature set FmIn each feature FiProminence score mean valueDifferent random number seeds is set,
It include statistical nature set F based on random number seed selectionmTraining data, establish gradient and promote tree-model, output statistics is special
F is closed in collectionmIn each feature FiProminence score Score under current random number seedi, under all random number seeds
Prominence score ScoreiIt is averaged to obtain each feature FiProminence score mean value
Based on preset prominence score threshold value, by statistical nature set FmIn prominence score mean valueGreater than important
Property scoring threshold value feature FiConstitute important feature set Fe。
2. the method as described in claim 1, which is characterized in that feature FiSignificance value calculation specifically:
Based on feature FiCharacteristic attribute feature F is calculated using different metric formiSignificance value;
It is the feature F of classified variable for characteristic attributei, first determine whether feature FiIt is ordered into classified variable or unordered classification becomes
Amount, if feature FiFor ordered categorization variable, then Mann-Whitney U checking computation feature F is usediSignificance value;If FiIt is
Unordered classified variable then calculates feature F using Chi-square TestiSignificance value;
It is the feature F of continuous variable for characteristic attributei, KS verification characteristics F is used firstiDistribution whether Normal Distribution,
If Normal Distribution, the T checking computation feature F of independent sample is usediSignificance value;Otherwise, using Mann-
Whitney U checking computation feature FiSignificance value.
3. method according to claim 1 or 2, which is characterized in that the preferred arrangement of prominence score threshold value are as follows:
Initial threshold is set as 0, using Method for Feature Selection backward, gradually selectively increases threshold value, obtains special under corresponding threshold value
Collection is closed, and to each threshold value character pair set, is established gradient and promoted tree-model, obtain assessment of the model on test set and refer to
Scale value selects characteristic in all character pair set of the difference of satisfaction and maximum evaluation index value within an acceptable range
Least characteristic set corresponds to threshold value as feature prominence score threshold value.
4. method as claimed in claim 3, which is characterized in that the preferred arrangement of prominence score threshold value are as follows:
The arbitrary width of threshold value growth or the value of fixed step size, and initialization threshold are set0=0, walk identifier d=1;
Calculate the threshold value threshold of d stepd=thresholdd-1+ step, wherein step indicates the arbitrary width that threshold value increases
Or the value of fixed step size, by statistical nature set FmMiddle prominence score mean valueGreater than threshold value thresholddFeature
FiConstitute the candidate characteristic set F of d stephd;
Identifier j is walked from increasing 1, continues to calculate thresholddAnd candidate characteristic set Fhd, until d reaches preset maximum step
Number;
To the candidate characteristic set F of obtained nonvoid sethj, using including candidate characteristic set FhjTraining data, establish gradient promotion
Tree-model obtains gradient and promotes evaluation index value V of the tree-model on independent test collectionj, wherein subscript j is the candidate of nonvoid set
The identifier of feature set;
According to formulaFrom all thresholdjThe corresponding mark of middle selection
Accord with j*'sAs prominence score threshold value, wherein Δ indicates preset deviation threshold, | Fhj| indicate candidate characteristic set
FhjCharacteristic.
5. method as claimed in claim 4, which is characterized in that will stop calculating thresholddAnd the knot of candidate characteristic set
Beam condition replaces with current candidate feature set FhdFor empty set or current thresholddEqual to or more than statistical nature set Fm
In maximum prominence score mean value.
6. method as claimed in claim 4, which is characterized in that when screening the candidate characteristic set of d step, first by statistical nature
Set FmIn each feature sort from small to large by prominence score mean value, constitute initial candidate characteristic set Fh, then again first
Beginning candidate feature set FhThe candidate characteristic set of middle screening d step.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810552686.3A CN108962382B (en) | 2018-05-31 | 2018-05-31 | Hierarchical important feature selection method based on breast cancer clinical high-dimensional data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810552686.3A CN108962382B (en) | 2018-05-31 | 2018-05-31 | Hierarchical important feature selection method based on breast cancer clinical high-dimensional data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108962382A true CN108962382A (en) | 2018-12-07 |
CN108962382B CN108962382B (en) | 2022-05-03 |
Family
ID=64492813
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810552686.3A Expired - Fee Related CN108962382B (en) | 2018-05-31 | 2018-05-31 | Hierarchical important feature selection method based on breast cancer clinical high-dimensional data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108962382B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110363333A (en) * | 2019-06-21 | 2019-10-22 | 南京航空航天大学 | The prediction technique of air transit ability under the influence of a kind of weather based on progressive gradient regression tree |
CN111383766A (en) * | 2018-12-28 | 2020-07-07 | 中山大学肿瘤防治中心 | Computer data processing method, device, medium and electronic equipment |
WO2021000958A1 (en) * | 2019-07-04 | 2021-01-07 | 华为技术有限公司 | Method and apparatus for realizing model training, and computer storage medium |
CN112309571A (en) * | 2020-10-30 | 2021-02-02 | 电子科技大学 | Screening method of prognosis quantitative characteristics of digital pathological image |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080059508A1 (en) * | 2006-08-30 | 2008-03-06 | Yumao Lu | Techniques for navigational query identification |
CN102999760A (en) * | 2012-09-28 | 2013-03-27 | 常州工学院 | Target image area tracking method for on-line self-adaptive adjustment of voting weight |
CN106650314A (en) * | 2016-11-25 | 2017-05-10 | 中南大学 | Method and system for predicting amino acid mutation |
CN107256245A (en) * | 2017-06-02 | 2017-10-17 | 河海大学 | Improved and system of selection towards the off-line model that refuse messages are classified |
CN107316205A (en) * | 2017-05-27 | 2017-11-03 | 银联智惠信息服务(上海)有限公司 | Recognize humanized method, device, computer-readable medium and the system of holding |
CN107679549A (en) * | 2017-09-08 | 2018-02-09 | 第四范式(北京)技术有限公司 | Generate the method and system of the assemblage characteristic of machine learning sample |
CN107729915A (en) * | 2017-09-08 | 2018-02-23 | 第四范式(北京)技术有限公司 | For the method and system for the key character for determining machine learning sample |
CN107909433A (en) * | 2017-11-14 | 2018-04-13 | 重庆邮电大学 | A kind of Method of Commodity Recommendation based on big data mobile e-business |
CN107944913A (en) * | 2017-11-21 | 2018-04-20 | 重庆邮电大学 | High potential user's purchase intention Forecasting Methodology based on big data user behavior analysis |
-
2018
- 2018-05-31 CN CN201810552686.3A patent/CN108962382B/en not_active Expired - Fee Related
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080059508A1 (en) * | 2006-08-30 | 2008-03-06 | Yumao Lu | Techniques for navigational query identification |
CN102999760A (en) * | 2012-09-28 | 2013-03-27 | 常州工学院 | Target image area tracking method for on-line self-adaptive adjustment of voting weight |
CN106650314A (en) * | 2016-11-25 | 2017-05-10 | 中南大学 | Method and system for predicting amino acid mutation |
CN107316205A (en) * | 2017-05-27 | 2017-11-03 | 银联智惠信息服务(上海)有限公司 | Recognize humanized method, device, computer-readable medium and the system of holding |
CN107256245A (en) * | 2017-06-02 | 2017-10-17 | 河海大学 | Improved and system of selection towards the off-line model that refuse messages are classified |
CN107679549A (en) * | 2017-09-08 | 2018-02-09 | 第四范式(北京)技术有限公司 | Generate the method and system of the assemblage characteristic of machine learning sample |
CN107729915A (en) * | 2017-09-08 | 2018-02-23 | 第四范式(北京)技术有限公司 | For the method and system for the key character for determining machine learning sample |
CN107909433A (en) * | 2017-11-14 | 2018-04-13 | 重庆邮电大学 | A kind of Method of Commodity Recommendation based on big data mobile e-business |
CN107944913A (en) * | 2017-11-21 | 2018-04-20 | 重庆邮电大学 | High potential user's purchase intention Forecasting Methodology based on big data user behavior analysis |
Non-Patent Citations (3)
Title |
---|
ZHIBIN XIAO 等: "Identifying Different Transportation Modes from Trajectory Data Using Tree-Based Ensemble Classifiers", 《ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION》 * |
关鹏洲 等: "基于集成学习和深度学习的短期降雨预测模型", 《2017年(第五届)全国大学生统计建模大赛获奖论文选》 * |
杜健: "后腹腔镜下肾部分切除术治疗中央型及外周型肾肿瘤的临床对比研究", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111383766A (en) * | 2018-12-28 | 2020-07-07 | 中山大学肿瘤防治中心 | Computer data processing method, device, medium and electronic equipment |
CN110363333A (en) * | 2019-06-21 | 2019-10-22 | 南京航空航天大学 | The prediction technique of air transit ability under the influence of a kind of weather based on progressive gradient regression tree |
WO2021000958A1 (en) * | 2019-07-04 | 2021-01-07 | 华为技术有限公司 | Method and apparatus for realizing model training, and computer storage medium |
CN112309571A (en) * | 2020-10-30 | 2021-02-02 | 电子科技大学 | Screening method of prognosis quantitative characteristics of digital pathological image |
Also Published As
Publication number | Publication date |
---|---|
CN108962382B (en) | 2022-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108962382A (en) | A kind of layering important feature selection method based on breast cancer clinic high dimensional data | |
CN106815481B (en) | Lifetime prediction method and device based on image omics | |
CN108257135A (en) | The assistant diagnosis system of medical image features is understood based on deep learning method | |
CN104951894B (en) | Hospital's disease control intellectual analysis and assessment system | |
CN107463771B (en) | Case grouping method and system | |
CN104636631B (en) | A kind of device using diabetes system big data prediction diabetes | |
CN109785928A (en) | Diagnosis and treatment proposal recommending method, device and storage medium | |
CN107748900A (en) | Tumor of breast sorting technique and device based on distinction convolutional neural networks | |
CN107203999A (en) | A kind of skin lens image automatic division method based on full convolutional neural networks | |
CN110070540A (en) | Image generating method, device, computer equipment and storage medium | |
CN108304887A (en) | Naive Bayesian data processing system and method based on the synthesis of minority class sample | |
CN115100467B (en) | Pathological full-slice image classification method based on nuclear attention network | |
CN110245657A (en) | Pathological image similarity detection method and detection device | |
CN108509982A (en) | A method of the uneven medical data of two classification of processing | |
CN106529165A (en) | Method for identifying cancer molecular subtype based on spectral clustering algorithm of sparse similar matrix | |
CN110859624A (en) | Brain age deep learning prediction system based on structural magnetic resonance image | |
CN103678534A (en) | Physiological information and health correlation acquisition method based on rough sets and fuzzy inference | |
CN109599181A (en) | A kind of Prediction of survival system and prediction technique being directed to T3-LARC patient before the treatment | |
Rastogi et al. | Brain tumor segmentation and tumor prediction using 2D-Vnet deep learning architecture | |
CN114926396B (en) | Mental disorder magnetic resonance image preliminary screening model construction method | |
JP2024043567A (en) | Training method, training device, electronic device, storage medium, and pathological image classification system for pathological image feature extractor based on feature separation | |
Xiang et al. | A novel weight pruning strategy for light weight neural networks with application to the diagnosis of skin disease | |
CN110236497A (en) | A kind of fatty liver prediction technique based on tongue phase and BMI index | |
CN106570325A (en) | Partial-least-squares-based abnormal detection method of mammary gland cell | |
Ramos et al. | Fast and smart segmentation of paraspinal muscles in magnetic resonance imaging with CleverSeg |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220503 |