CN110825819A - Two-classification method for processing non-small cell lung cancer data with missing values and unbalance - Google Patents

Two-classification method for processing non-small cell lung cancer data with missing values and unbalance Download PDF

Info

Publication number
CN110825819A
CN110825819A CN201910904648.4A CN201910904648A CN110825819A CN 110825819 A CN110825819 A CN 110825819A CN 201910904648 A CN201910904648 A CN 201910904648A CN 110825819 A CN110825819 A CN 110825819A
Authority
CN
China
Prior art keywords
data
classification
lung cancer
small cell
cell lung
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910904648.4A
Other languages
Chinese (zh)
Inventor
赵阳
马磊
张力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201910904648.4A priority Critical patent/CN110825819A/en
Publication of CN110825819A publication Critical patent/CN110825819A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Abstract

The invention relates to a two-classification method for processing non-small cell lung cancer data with missing values and unbalance, belonging to the technical field of data classification. Preprocessing data, filling a missing value by using a median, removing an abnormal value by using a Tukey' small method, and normalizing the data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling combination is adopted for data balance; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized. Experiments carried out on the non-small cell lung cancer data set prove the effectiveness and superiority of the method, improve the precision of classification of non-small cell lung cancer data with missing values and unbalance, and contribute to realizing more accurate medical decision.

Description

Two-classification method for processing non-small cell lung cancer data with missing values and unbalance
Technical Field
The invention relates to a binary classification method for processing non-small cell lung cancer data with missing values and unbalance, in particular to a method for carrying out data balance processing by combining median filling missing values and SMOTEENN comprehensive sampling, and belongs to the technical field of data classification.
Background
Lung cancer is a malignant tumor, and has become one of the most fatal diseases in the world. While non-small cell lung cancer accounts for around 85% of the total lung cancer cases, it accounts for a significant fraction of the medical expenditures and places a heavy burden on the home and community due to its high morbidity and mortality. Therefore, it is important to predict the survival of cancer patients more accurately and to make better clinical decisions in diagnosis and treatment, including the choice of treatment, the time of treatment and subsequent visits, which can have a large impact on the cost and effectiveness of treatment.
With the continuous development of medical informatization, a large amount of reliable cancer data is used for constructing a machine learning model to predict the survival rate of diseases. Survival prediction is one of the tasks of cancer prognosis, and five-year survival is a commonly used index in medicine to assess the efficacy of surgery and therapy. Compared to early cancer survival prediction based on clinical features of malignancy and estimation of medical experience by physicians, it is now possible to further exploit those underutilized medical data by applying more and more data mining techniques in healthcare. Generally, missing values and class imbalance problems exist in medical data, which have a large impact on classification accuracy.
Disclosure of Invention
The invention provides a binary classification method for processing non-small cell lung cancer data with missing values and unbalance, which is used for avoiding the problem of excessive biased classification of main classes.
The technical scheme of the invention is as follows: a binary method of processing non-small cell lung cancer data having missing values and imbalances, comprising the steps of: filling samples with the missing value proportion lower than 70% by using a median, deleting samples with the missing value proportion higher than 70%, removing abnormal values by using a Tukey's method, and carrying out normalization processing on data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for data balance so as to solve the problem of category imbalance in a data set; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized.
The two-classification method for processing the non-small cell lung cancer data with missing values and unbalance comprises the following specific steps of:
step1, missing value processing: filling samples with missing value proportion lower than 70% in a data set, filling the missing parts by using median, and deleting the samples with missing value proportion higher than 70%;
step2, abnormal value processing: abnormal values existing in the data set influence the analysis of the overall data, so that the abnormal values need to be removed, and outliers, namely the abnormal values, are detected and removed by adopting a Tukey's method;
step3, data normalization: normalization is important when the variables have different scales and ranges, otherwise the algorithm will favor variables with larger scales, so min-max is used to normalize the data;
step4, data imbalance processing: the category imbalance has a crucial influence in the classification process, when the specific gravity of most categories is too large, the classifier is often biased to the most categories and has good classification accuracy, but the classification effect cannot be correctly evaluated at the moment, so that the method combining oversampling and undersampling is adopted
Carrying out balance processing by using a SMOTEENN comprehensive sampling method;
step5, obtaining a balanced data set: after data are balanced, a new balanced data set is formed, the number of a few classes is increased, so that class balance is realized, and the evaluation performance of the classifier effect is improved;
step6, data set division: dividing a data set by adopting a 10-fold cross validation method, and randomly dividing experimental data into 10 subsets for 10-fold cross validation; in each experiment, one subset was used as the test set and the remaining 9 subsets were used as the training set;
step7, training a classifier: training a classifier based on data, training the classifier by using a random forest on a training set, and constructing the classifier on an original data set by using Naive Bayes (NB), a Support Vector Machine (SVM) and a Decision Tree (DT);
step8, realizing a classification method: after the classifier is trained, two classification results of the non-small cell lung cancer data are obtained on the test set.
The invention has the beneficial effects that:
the invention aims at the defect value and the class imbalance in the data of the non-small cell lung cancer patient, the median filling strategy is used for improving the data quality, so that more effective classification is realized, the classes of the data are balanced by combining the SMOTEENN comprehensive sampling technology, the problem of excessive biased classification of main classes is avoided, the two classification method realized by random forests on the non-small cell lung cancer data set has better classification accuracy and is superior to a plurality of other methods, the realized two classification method has effectiveness in the survival prediction of the patient, and more accurate medical decision is facilitated.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a histogram of missing data values of the present invention;
FIG. 3 is a comparison graph of the classification performance of the present method before and after data balance according to the present invention;
FIG. 4 is a graph of feature significance after classification in accordance with the present invention.
Detailed Description
Example 1: 1-4, a binary method of processing non-small cell lung cancer data with missing values and imbalances comprising the steps of: firstly, preprocessing data, filling samples with a missing value proportion lower than 70% by using a median, deleting samples with a missing value proportion higher than 70%, removing abnormal values by using a Tukey's method, and carrying out normalization processing on the data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for data balance so as to solve the problem of category imbalance in a data set; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized.
The two-classification method for processing the non-small cell lung cancer data with missing values and unbalance comprises the following specific steps of:
step1, missing value processing: in this example, the data of cases of patients with non-small cell lung cancer in the SEER database is used, and based on the data submitted in 2016 and 11 months, the records before 2004 lack a plurality of key variables and only the cases after 2004 are considered. The variables include 'Sex', 'Rate record', 'Marital status at diagnosis', 'PrimarySite', 'ICD-O-3', 'Grade', 'Lateranity', 'CS extension', 'CS lymph nodes', 'CS metsat dX', 'Derived AJCC Stage Group', 'RX Summ-Surg Prim Site', 'chemotherapyreresponse', 'Age at diagnosis', 'CS tune size', 'Regionnal nodes position', 'RegionNodeN extended', the result is a 5-year survival of the patient. In the case of missing values in the data set, as shown in the histogram of fig. 2, a large amount of information is lost when a sample having a missing value is deleted, and therefore, samples having a missing value ratio of less than 70% in the data set are filled up, and the missing part is filled up with a median, and if the missing value ratio of the sample is higher than 70%, the missing part is deleted.
Step2, abnormal value processing: abnormal values existing in the data set influence the analysis of the overall data, so that the abnormal values need to be removed, and outliers, namely the abnormal values, are detected and removed by adopting a Tukey's method;
step3, data normalization: normalization is important when the variables have different scales and ranges, otherwise the algorithm will favor variables with larger scales, so the data is normalized using min-max, i.e., dispersion normalization, so that the variables remain between 0 and 1. The normalized function is as follows:
Figure BDA0002212914890000031
step4, data imbalance processing: the category imbalance has a vital influence in the classification process, when the specific gravity of most categories is overlarge, the classifier is usually biased to the most categories and has good classification accuracy, but the classification effect cannot be correctly evaluated at the moment, so that the SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for carrying out balance processing; the process is that the SMOTE oversampling method is utilized to calculate the distances of all samples of each sample x in a minority class through the Euclidean distance, and the distances are obtainedTo its k nearest neighbors, the sampling rate is set according to the imbalance ratio to determine the sampling multiplying factor N, and for each few class of samples x, a number of samples x are selected from the nearest neighborsnFor each randomly selected xnConstructing a new sample using the original sample according to the following formula, xnew=x+rand(0,1)×|x-xnExpanding the data set into a data set T, predicting each sample in the T by using a kNN method, and removing the sample if the predicted value is not consistent with the actual label.
Step5, obtaining a balanced data set: after data are balanced, a new balanced data set is formed, the number of a few classes is increased, so that class balance is realized, and the evaluation performance of the classifier effect is improved; before and after data set balance, 4 kinds of classification performance of the balance method are compared through a random forest classification algorithm, SMOTEENN obtains 6 evaluation indexes of highest accuracy, recall rate, G-mean, F1-Score, AUC value and accuracy, the method greatly improves other indexes except accuracy before and after balance, and improves the evaluability of classification results and the accuracy of classification, as shown in figure 3;
step6, data set division: dividing a data set by adopting a 10-fold cross validation method, and randomly dividing experimental data into 10 subsets for 10-fold cross validation; in each experiment, one subset was used as the test set and the remaining 9 subsets were used as the training set;
step7, training a classifier: training a classifier based on data, training the classifier by using a random forest on a training set, and constructing the classifier on an original data set by using Naive Bayes (NB), a Support Vector Machine (SVM) and a Decision Tree (DT);
step8, realizing a classification method: after the classifier is trained, obtaining two classification results of the non-small cell lung cancer data on a test set; in order to illustrate the beneficial effects of the invention, the following steps are added; and then comparing with the correct result, and further calculating the accuracy, precision, recall, G-mean, F1-Score and AUC value of the classification. A comparison between the present method and other algorithms with composite indicators is shown in table 1.
Figure BDA0002212914890000041
Figure BDA0002212914890000051
Step9, evaluating the influence of the cross-validation strategy: and (3) evaluating the influence of the 5-fold and 10-fold cross validation on the random forest and the method respectively, wherein as shown in the table 1, the influence of the 5-fold or 10-fold cross validation strategy is not a decisive factor, but the influence of the random forest algorithm is greater than that of the method, and the detail in the table 1 shows that the variation range of the result data of 5-fold and 10-fold is small, so that the reliability of the performance of the classification method is proved.
Step10, classification performance verification: firstly, the classification effect of the two classification methods before and after processing the data loss and unbalance problem is compared, and then compared with other reference algorithms such as naive Bayes, SVM and decision trees, a series of evaluation indexes are adopted to verify the superiority of the method. By comparing the effects of random forests and the method, missing value compensation and comprehensive sampling play an important role in the improvement of the method.
Step11, feature importance: and training a data set through a random forest algorithm and classifying, evaluating the importance of each feature, and verifying the classification accuracy by combining the features. And judging how much each feature contributes to each tree in the random forest, then taking an average value, and finally comparing the contribution of the features with the contribution of the features. The characteristic importance shows that the distant metastasis has a key influence on the survival prognosis of the patient with the non-small cell lung cancer, the age, the diseased part and the like at the time of diagnosis are also important factors influencing the survival condition of the patient, the condition disclosed in most researches is met, and the effectiveness and the interpretability of the classification method in reality are shown, as shown in figure 4.
Filling missing values with a median, detecting and removing samples containing abnormal values by using a Tukey's method based on a box line graph, performing linear transformation on original data by using dispersion standardization, performing balance processing on an unbalanced data set by using a SMOTEENN comprehensive sampling method combining oversampling and undersampling, training a random forest classifier based on the data set and generating a classification result on a test set. Experiments carried out on the non-small cell lung cancer data set prove that the effectiveness and the superiority of the method can improve the precision of the classification of the non-small cell lung cancer data with missing values and unbalance, and provide a beneficial reference supported by the data for a clinician to establish a more beneficial treatment strategy for a patient.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (2)

1. A binary method of processing non-small cell lung cancer data having missing values and imbalances, comprising: the method comprises the following steps: firstly, preprocessing data, filling samples with a missing value proportion lower than 70% by using a median, deleting samples with a missing value proportion higher than 70%, removing abnormal values by using a Tukey's method, and carrying out normalization processing on the data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for data balance so as to solve the problem of category imbalance in a data set; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized.
2. The binary method for processing non-small cell lung cancer data with missing values and imbalances according to claim 1, wherein: the two-classification method for processing the non-small cell lung cancer data with missing values and unbalance comprises the following specific steps of:
step1, missing value processing: filling samples with missing value proportion lower than 70% in a data set, filling the missing parts by using median, and deleting the samples with missing value proportion higher than 70%;
step2, abnormal value processing: abnormal values existing in the data set influence the analysis of the overall data, so that the abnormal values need to be removed, and outliers, namely the abnormal values, are detected and removed by adopting a Tukey's method;
step3, data normalization: normalization is important when the variables have different scales and ranges, otherwise the algorithm will favor variables with larger scales, so min-max is used to normalize the data;
step4, data imbalance processing: the category imbalance has a vital influence in the classification process, when the specific gravity of most categories is overlarge, the classifier is usually biased to the most categories and has good classification accuracy, but the classification effect cannot be correctly evaluated at the moment, so that the SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for carrying out balance processing;
step5, obtaining a balanced data set: after data are balanced, a new balanced data set is formed, the number of a few classes is increased, so that class balance is realized, and the evaluation performance of the classifier effect is improved;
step6, data set division: dividing a data set by adopting a 10-fold cross validation method, and randomly dividing experimental data into 10 subsets for 10-fold cross validation; in each experiment, one subset was used as the test set and the remaining 9 subsets were used as the training set;
step7, training a classifier: training a classifier based on data, training the classifier by using a random forest on a training set, and constructing the classifier on an original data set by using a naive Bayes NB, a support vector machine SVM and a decision tree DT;
step8, realizing a classification method: after the classifier is trained, two classification results of the non-small cell lung cancer data are obtained on the test set.
CN201910904648.4A 2019-09-24 2019-09-24 Two-classification method for processing non-small cell lung cancer data with missing values and unbalance Pending CN110825819A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910904648.4A CN110825819A (en) 2019-09-24 2019-09-24 Two-classification method for processing non-small cell lung cancer data with missing values and unbalance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910904648.4A CN110825819A (en) 2019-09-24 2019-09-24 Two-classification method for processing non-small cell lung cancer data with missing values and unbalance

Publications (1)

Publication Number Publication Date
CN110825819A true CN110825819A (en) 2020-02-21

Family

ID=69548237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910904648.4A Pending CN110825819A (en) 2019-09-24 2019-09-24 Two-classification method for processing non-small cell lung cancer data with missing values and unbalance

Country Status (1)

Country Link
CN (1) CN110825819A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111524606A (en) * 2020-04-24 2020-08-11 郑州大学第一附属医院 Tumor data statistical method based on random forest algorithm
CN111524599A (en) * 2020-04-24 2020-08-11 中国地质大学(武汉) New coronary pneumonia data processing method and prediction system based on machine learning
CN111524600A (en) * 2020-04-24 2020-08-11 中国地质大学(武汉) Liver cancer postoperative recurrence risk prediction system based on neighbor2vec
CN111767952A (en) * 2020-06-30 2020-10-13 重庆大学 Interpretable classification method for benign and malignant pulmonary nodules
CN113096814A (en) * 2021-05-28 2021-07-09 哈尔滨理工大学 Alzheimer disease classification prediction method based on multi-classifier fusion
CN113223727A (en) * 2021-05-08 2021-08-06 浙江大学 Non-small cell lung cancer integrated prognosis prediction model and construction method, device and application thereof
WO2021179514A1 (en) * 2020-03-07 2021-09-16 华中科技大学 Novel coronavirus patient condition classification system based on artificial intelligence
CN113539474A (en) * 2021-05-14 2021-10-22 内蒙古卫数数据科技有限公司 Disease identification method based on conventional inspection data
CN113539475A (en) * 2021-05-14 2021-10-22 内蒙古卫数数据科技有限公司 Disease screening and diagnosis method using blood routine test data only
CN113724779A (en) * 2021-09-02 2021-11-30 东北林业大学 SNAREs protein identification method, system, storage medium and equipment based on machine learning technology
CN114093448A (en) * 2021-11-24 2022-02-25 首都医科大学附属北京天坛医院 Construction method of disease risk prediction model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160292379A1 (en) * 2013-11-07 2016-10-06 Medial Research Ltd. Methods and systems of evaluating a risk of lung cancer
CN107480839A (en) * 2017-10-13 2017-12-15 深圳市博安达信息技术股份有限公司 The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest
CN107506600A (en) * 2017-09-04 2017-12-22 上海美吉生物医药科技有限公司 The Forecasting Methodology and device of cancer types based on the data that methylate
JP2018019696A (en) * 2010-11-30 2018-02-08 ザ チャイニーズ ユニバーシティ オブ ホンコン Detection of cancer related genes or molecular abnormalities
CN108509982A (en) * 2018-03-12 2018-09-07 昆明理工大学 A method of the uneven medical data of two classification of processing
CN108764597A (en) * 2018-04-02 2018-11-06 华南理工大学 A kind of product quality control method based on integrated study
CN109783062A (en) * 2019-01-14 2019-05-21 中国科学院软件研究所 A kind of machine learning application and development method and system of people in circuit

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018019696A (en) * 2010-11-30 2018-02-08 ザ チャイニーズ ユニバーシティ オブ ホンコン Detection of cancer related genes or molecular abnormalities
US20160292379A1 (en) * 2013-11-07 2016-10-06 Medial Research Ltd. Methods and systems of evaluating a risk of lung cancer
CN107506600A (en) * 2017-09-04 2017-12-22 上海美吉生物医药科技有限公司 The Forecasting Methodology and device of cancer types based on the data that methylate
CN107480839A (en) * 2017-10-13 2017-12-15 深圳市博安达信息技术股份有限公司 The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest
CN108509982A (en) * 2018-03-12 2018-09-07 昆明理工大学 A method of the uneven medical data of two classification of processing
CN108764597A (en) * 2018-04-02 2018-11-06 华南理工大学 A kind of product quality control method based on integrated study
CN109783062A (en) * 2019-01-14 2019-05-21 中国科学院软件研究所 A kind of machine learning application and development method and system of people in circuit

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANJU JAIN ET AL.: "Addressing Class Imbalance Problem in Medical Diagnosis: A Genetic Algorithm Approach", 《2017 INTERNATIONAL CONFERENCE ON INFORMATION, COMMUNICATION, INSTRUMENTATION AND CONTROL》 *
GUSTAVO E. A. P. A. BATISTA ET AL.: "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data", 《ACM SIGKDD EXPLORATIONS NEWSLETTER》 *
MD. ANWAR HOSSEN ET AL.: "A comparison of some soft computing methods on Imbalanced data", 《INTERNATIONAL CONFERENCE ON CYBER SECURITY AND COMPUTER SCIENCE》 *
崔琳爽: "复合XGBoost模型在不均衡数据集分类预测上的应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王平 等: "改进的随机森林算法在乳腺肿瘤诊断中的应用", 《改进的随机森林算法在乳腺肿瘤诊断中的应用 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021179514A1 (en) * 2020-03-07 2021-09-16 华中科技大学 Novel coronavirus patient condition classification system based on artificial intelligence
CN111524599A (en) * 2020-04-24 2020-08-11 中国地质大学(武汉) New coronary pneumonia data processing method and prediction system based on machine learning
CN111524600A (en) * 2020-04-24 2020-08-11 中国地质大学(武汉) Liver cancer postoperative recurrence risk prediction system based on neighbor2vec
CN111524606A (en) * 2020-04-24 2020-08-11 郑州大学第一附属医院 Tumor data statistical method based on random forest algorithm
CN111524606B (en) * 2020-04-24 2024-01-30 郑州大学第一附属医院 Tumor data statistics method based on random forest algorithm
CN111767952A (en) * 2020-06-30 2020-10-13 重庆大学 Interpretable classification method for benign and malignant pulmonary nodules
CN111767952B (en) * 2020-06-30 2024-03-29 重庆大学 Interpretable lung nodule benign and malignant classification method
CN113223727A (en) * 2021-05-08 2021-08-06 浙江大学 Non-small cell lung cancer integrated prognosis prediction model and construction method, device and application thereof
CN113223727B (en) * 2021-05-08 2022-07-12 浙江大学 Non-small cell lung cancer integrated prognosis prediction model and construction method, device and application thereof
CN113539474A (en) * 2021-05-14 2021-10-22 内蒙古卫数数据科技有限公司 Disease identification method based on conventional inspection data
CN113539475A (en) * 2021-05-14 2021-10-22 内蒙古卫数数据科技有限公司 Disease screening and diagnosis method using blood routine test data only
CN113096814A (en) * 2021-05-28 2021-07-09 哈尔滨理工大学 Alzheimer disease classification prediction method based on multi-classifier fusion
CN113724779A (en) * 2021-09-02 2021-11-30 东北林业大学 SNAREs protein identification method, system, storage medium and equipment based on machine learning technology
CN114093448A (en) * 2021-11-24 2022-02-25 首都医科大学附属北京天坛医院 Construction method of disease risk prediction model
CN114093448B (en) * 2021-11-24 2022-07-01 首都医科大学附属北京天坛医院 Construction method of disease risk prediction model

Similar Documents

Publication Publication Date Title
CN110825819A (en) Two-classification method for processing non-small cell lung cancer data with missing values and unbalance
Liu et al. Optimizing survival analysis of XGBoost for ties to predict disease progression of breast cancer
Venkatesan et al. Performance analysis of decision tree algorithms for breast cancer classification
Pepe et al. Testing for improvement in prediction model performance
Adams et al. Clinical prediction rules
Afshar et al. Prediction of breast cancer survival through knowledge discovery in databases
Peng et al. Random forest can predict 30‐day mortality of spontaneous intracerebral hemorrhage with remarkable discrimination
Morvan et al. Leveraging RSF and PET images for prognosis of multiple myeloma at diagnosis
de Sousa Costa et al. Classification of malignant and benign lung nodules using taxonomic diversity index and phylogenetic distance
Belciug Logistic regression paradigm for training a single-hidden layer feedforward neural network. Application to gene expression datasets for cancer research
Kaur et al. An integrated approach for cancer survival prediction using data mining techniques
Frésard et al. Multi-objective optimization for personalized prediction of venous thromboembolism in ovarian cancer patients
Ray et al. Transforming Breast Cancer Identification: An In-Depth Examination of Advanced Machine Learning Models Applied to Histopathological Images
Lamba et al. Breast cancer prediction and categorization in the molecular era of histologic grade
Barrio et al. Comparison of two discrimination indexes in the categorisation of continuous predictors in time-to-event studies
Baneshi et al. Multiple imputation in survival models: applied on breast cancer data
AU2021102593A4 (en) A Method for Detection of a Disease
Casey et al. A machine learning approach to prostate cancer risk classification through use of RNA sequencing data
Cattelani et al. Improved NSGA-II algorithms for multi-objective biomarker discovery
Sim et al. Predicting disease-free lung cancer survival using Patient Reported Outcome (PRO) measurements with comparisons of five Machine Learning Techniques (MLT)
Nickolas et al. Efficient pre-processing techniques for improving classifiers performance
Marwah et al. Lung Cancer Survivability prediction with Recursive Feature Elimination using Random Forest and Ensemble Classifiers
Mythili et al. CTCHABC-hybrid online sequential fuzzy Extreme Kernel learning method for detection of Breast Cancer with hierarchical Artificial Bee
KR102305806B1 (en) Method for prodicting prognosis in lung cancer patient using clinical information and gene polymorphism information
Antolini et al. Graphical representations and summary indicators to assess the performance of risk predictors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200221