CN110825819A - Two-classification method for processing non-small cell lung cancer data with missing values and unbalance - Google Patents
Two-classification method for processing non-small cell lung cancer data with missing values and unbalance Download PDFInfo
- Publication number
- CN110825819A CN110825819A CN201910904648.4A CN201910904648A CN110825819A CN 110825819 A CN110825819 A CN 110825819A CN 201910904648 A CN201910904648 A CN 201910904648A CN 110825819 A CN110825819 A CN 110825819A
- Authority
- CN
- China
- Prior art keywords
- data
- classification
- lung cancer
- small cell
- cell lung
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 238000012545 processing Methods 0.000 title claims abstract description 30
- 208000002154 non-small cell lung carcinoma Diseases 0.000 title claims abstract description 28
- 208000029729 tumor suppressor gene on chromosome 11 Diseases 0.000 title claims abstract description 28
- 238000012549 training Methods 0.000 claims abstract description 21
- 230000002159 abnormal effect Effects 0.000 claims abstract description 17
- 238000007637 random forest analysis Methods 0.000 claims abstract description 15
- 238000012360 testing method Methods 0.000 claims abstract description 15
- 238000005070 sampling Methods 0.000 claims abstract description 13
- 230000004083 survival effect Effects 0.000 claims abstract description 13
- 230000000694 effects Effects 0.000 claims abstract description 12
- 239000006185 dispersion Substances 0.000 claims abstract description 6
- 238000002474 experimental method Methods 0.000 claims abstract description 5
- 238000007781 pre-processing Methods 0.000 claims abstract description 3
- 238000010606 normalization Methods 0.000 claims description 10
- 238000002790 cross-validation Methods 0.000 claims description 9
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 238000003066 decision tree Methods 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 5
- 238000012706 support-vector machine Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 230000005484 gravity Effects 0.000 claims description 3
- 201000011510 cancer Diseases 0.000 description 6
- 206010028980 Neoplasm Diseases 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 4
- 238000003745 diagnosis Methods 0.000 description 4
- 238000011282 treatment Methods 0.000 description 4
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 201000005202 lung cancer Diseases 0.000 description 2
- 208000020816 lung neoplasm Diseases 0.000 description 2
- 238000004393 prognosis Methods 0.000 description 2
- 206010027476 Metastases Diseases 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 210000001165 lymph node Anatomy 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000036210 malignancy Effects 0.000 description 1
- 230000009401 metastasis Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000011269 treatment regimen Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Abstract
The invention relates to a two-classification method for processing non-small cell lung cancer data with missing values and unbalance, belonging to the technical field of data classification. Preprocessing data, filling a missing value by using a median, removing an abnormal value by using a Tukey' small method, and normalizing the data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling combination is adopted for data balance; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized. Experiments carried out on the non-small cell lung cancer data set prove the effectiveness and superiority of the method, improve the precision of classification of non-small cell lung cancer data with missing values and unbalance, and contribute to realizing more accurate medical decision.
Description
Technical Field
The invention relates to a binary classification method for processing non-small cell lung cancer data with missing values and unbalance, in particular to a method for carrying out data balance processing by combining median filling missing values and SMOTEENN comprehensive sampling, and belongs to the technical field of data classification.
Background
Lung cancer is a malignant tumor, and has become one of the most fatal diseases in the world. While non-small cell lung cancer accounts for around 85% of the total lung cancer cases, it accounts for a significant fraction of the medical expenditures and places a heavy burden on the home and community due to its high morbidity and mortality. Therefore, it is important to predict the survival of cancer patients more accurately and to make better clinical decisions in diagnosis and treatment, including the choice of treatment, the time of treatment and subsequent visits, which can have a large impact on the cost and effectiveness of treatment.
With the continuous development of medical informatization, a large amount of reliable cancer data is used for constructing a machine learning model to predict the survival rate of diseases. Survival prediction is one of the tasks of cancer prognosis, and five-year survival is a commonly used index in medicine to assess the efficacy of surgery and therapy. Compared to early cancer survival prediction based on clinical features of malignancy and estimation of medical experience by physicians, it is now possible to further exploit those underutilized medical data by applying more and more data mining techniques in healthcare. Generally, missing values and class imbalance problems exist in medical data, which have a large impact on classification accuracy.
Disclosure of Invention
The invention provides a binary classification method for processing non-small cell lung cancer data with missing values and unbalance, which is used for avoiding the problem of excessive biased classification of main classes.
The technical scheme of the invention is as follows: a binary method of processing non-small cell lung cancer data having missing values and imbalances, comprising the steps of: filling samples with the missing value proportion lower than 70% by using a median, deleting samples with the missing value proportion higher than 70%, removing abnormal values by using a Tukey's method, and carrying out normalization processing on data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for data balance so as to solve the problem of category imbalance in a data set; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized.
The two-classification method for processing the non-small cell lung cancer data with missing values and unbalance comprises the following specific steps of:
step1, missing value processing: filling samples with missing value proportion lower than 70% in a data set, filling the missing parts by using median, and deleting the samples with missing value proportion higher than 70%;
step2, abnormal value processing: abnormal values existing in the data set influence the analysis of the overall data, so that the abnormal values need to be removed, and outliers, namely the abnormal values, are detected and removed by adopting a Tukey's method;
step3, data normalization: normalization is important when the variables have different scales and ranges, otherwise the algorithm will favor variables with larger scales, so min-max is used to normalize the data;
step4, data imbalance processing: the category imbalance has a crucial influence in the classification process, when the specific gravity of most categories is too large, the classifier is often biased to the most categories and has good classification accuracy, but the classification effect cannot be correctly evaluated at the moment, so that the method combining oversampling and undersampling is adopted
Carrying out balance processing by using a SMOTEENN comprehensive sampling method;
step5, obtaining a balanced data set: after data are balanced, a new balanced data set is formed, the number of a few classes is increased, so that class balance is realized, and the evaluation performance of the classifier effect is improved;
step6, data set division: dividing a data set by adopting a 10-fold cross validation method, and randomly dividing experimental data into 10 subsets for 10-fold cross validation; in each experiment, one subset was used as the test set and the remaining 9 subsets were used as the training set;
step7, training a classifier: training a classifier based on data, training the classifier by using a random forest on a training set, and constructing the classifier on an original data set by using Naive Bayes (NB), a Support Vector Machine (SVM) and a Decision Tree (DT);
step8, realizing a classification method: after the classifier is trained, two classification results of the non-small cell lung cancer data are obtained on the test set.
The invention has the beneficial effects that:
the invention aims at the defect value and the class imbalance in the data of the non-small cell lung cancer patient, the median filling strategy is used for improving the data quality, so that more effective classification is realized, the classes of the data are balanced by combining the SMOTEENN comprehensive sampling technology, the problem of excessive biased classification of main classes is avoided, the two classification method realized by random forests on the non-small cell lung cancer data set has better classification accuracy and is superior to a plurality of other methods, the realized two classification method has effectiveness in the survival prediction of the patient, and more accurate medical decision is facilitated.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a histogram of missing data values of the present invention;
FIG. 3 is a comparison graph of the classification performance of the present method before and after data balance according to the present invention;
FIG. 4 is a graph of feature significance after classification in accordance with the present invention.
Detailed Description
Example 1: 1-4, a binary method of processing non-small cell lung cancer data with missing values and imbalances comprising the steps of: firstly, preprocessing data, filling samples with a missing value proportion lower than 70% by using a median, deleting samples with a missing value proportion higher than 70%, removing abnormal values by using a Tukey's method, and carrying out normalization processing on the data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for data balance so as to solve the problem of category imbalance in a data set; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized.
The two-classification method for processing the non-small cell lung cancer data with missing values and unbalance comprises the following specific steps of:
step1, missing value processing: in this example, the data of cases of patients with non-small cell lung cancer in the SEER database is used, and based on the data submitted in 2016 and 11 months, the records before 2004 lack a plurality of key variables and only the cases after 2004 are considered. The variables include 'Sex', 'Rate record', 'Marital status at diagnosis', 'PrimarySite', 'ICD-O-3', 'Grade', 'Lateranity', 'CS extension', 'CS lymph nodes', 'CS metsat dX', 'Derived AJCC Stage Group', 'RX Summ-Surg Prim Site', 'chemotherapyreresponse', 'Age at diagnosis', 'CS tune size', 'Regionnal nodes position', 'RegionNodeN extended', the result is a 5-year survival of the patient. In the case of missing values in the data set, as shown in the histogram of fig. 2, a large amount of information is lost when a sample having a missing value is deleted, and therefore, samples having a missing value ratio of less than 70% in the data set are filled up, and the missing part is filled up with a median, and if the missing value ratio of the sample is higher than 70%, the missing part is deleted.
Step2, abnormal value processing: abnormal values existing in the data set influence the analysis of the overall data, so that the abnormal values need to be removed, and outliers, namely the abnormal values, are detected and removed by adopting a Tukey's method;
step3, data normalization: normalization is important when the variables have different scales and ranges, otherwise the algorithm will favor variables with larger scales, so the data is normalized using min-max, i.e., dispersion normalization, so that the variables remain between 0 and 1. The normalized function is as follows:
step4, data imbalance processing: the category imbalance has a vital influence in the classification process, when the specific gravity of most categories is overlarge, the classifier is usually biased to the most categories and has good classification accuracy, but the classification effect cannot be correctly evaluated at the moment, so that the SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for carrying out balance processing; the process is that the SMOTE oversampling method is utilized to calculate the distances of all samples of each sample x in a minority class through the Euclidean distance, and the distances are obtainedTo its k nearest neighbors, the sampling rate is set according to the imbalance ratio to determine the sampling multiplying factor N, and for each few class of samples x, a number of samples x are selected from the nearest neighborsnFor each randomly selected xnConstructing a new sample using the original sample according to the following formula, xnew=x+rand(0,1)×|x-xnExpanding the data set into a data set T, predicting each sample in the T by using a kNN method, and removing the sample if the predicted value is not consistent with the actual label.
Step5, obtaining a balanced data set: after data are balanced, a new balanced data set is formed, the number of a few classes is increased, so that class balance is realized, and the evaluation performance of the classifier effect is improved; before and after data set balance, 4 kinds of classification performance of the balance method are compared through a random forest classification algorithm, SMOTEENN obtains 6 evaluation indexes of highest accuracy, recall rate, G-mean, F1-Score, AUC value and accuracy, the method greatly improves other indexes except accuracy before and after balance, and improves the evaluability of classification results and the accuracy of classification, as shown in figure 3;
step6, data set division: dividing a data set by adopting a 10-fold cross validation method, and randomly dividing experimental data into 10 subsets for 10-fold cross validation; in each experiment, one subset was used as the test set and the remaining 9 subsets were used as the training set;
step7, training a classifier: training a classifier based on data, training the classifier by using a random forest on a training set, and constructing the classifier on an original data set by using Naive Bayes (NB), a Support Vector Machine (SVM) and a Decision Tree (DT);
step8, realizing a classification method: after the classifier is trained, obtaining two classification results of the non-small cell lung cancer data on a test set; in order to illustrate the beneficial effects of the invention, the following steps are added; and then comparing with the correct result, and further calculating the accuracy, precision, recall, G-mean, F1-Score and AUC value of the classification. A comparison between the present method and other algorithms with composite indicators is shown in table 1.
Step9, evaluating the influence of the cross-validation strategy: and (3) evaluating the influence of the 5-fold and 10-fold cross validation on the random forest and the method respectively, wherein as shown in the table 1, the influence of the 5-fold or 10-fold cross validation strategy is not a decisive factor, but the influence of the random forest algorithm is greater than that of the method, and the detail in the table 1 shows that the variation range of the result data of 5-fold and 10-fold is small, so that the reliability of the performance of the classification method is proved.
Step10, classification performance verification: firstly, the classification effect of the two classification methods before and after processing the data loss and unbalance problem is compared, and then compared with other reference algorithms such as naive Bayes, SVM and decision trees, a series of evaluation indexes are adopted to verify the superiority of the method. By comparing the effects of random forests and the method, missing value compensation and comprehensive sampling play an important role in the improvement of the method.
Step11, feature importance: and training a data set through a random forest algorithm and classifying, evaluating the importance of each feature, and verifying the classification accuracy by combining the features. And judging how much each feature contributes to each tree in the random forest, then taking an average value, and finally comparing the contribution of the features with the contribution of the features. The characteristic importance shows that the distant metastasis has a key influence on the survival prognosis of the patient with the non-small cell lung cancer, the age, the diseased part and the like at the time of diagnosis are also important factors influencing the survival condition of the patient, the condition disclosed in most researches is met, and the effectiveness and the interpretability of the classification method in reality are shown, as shown in figure 4.
Filling missing values with a median, detecting and removing samples containing abnormal values by using a Tukey's method based on a box line graph, performing linear transformation on original data by using dispersion standardization, performing balance processing on an unbalanced data set by using a SMOTEENN comprehensive sampling method combining oversampling and undersampling, training a random forest classifier based on the data set and generating a classification result on a test set. Experiments carried out on the non-small cell lung cancer data set prove that the effectiveness and the superiority of the method can improve the precision of the classification of the non-small cell lung cancer data with missing values and unbalance, and provide a beneficial reference supported by the data for a clinician to establish a more beneficial treatment strategy for a patient.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (2)
1. A binary method of processing non-small cell lung cancer data having missing values and imbalances, comprising: the method comprises the following steps: firstly, preprocessing data, filling samples with a missing value proportion lower than 70% by using a median, deleting samples with a missing value proportion higher than 70%, removing abnormal values by using a Tukey's method, and carrying out normalization processing on the data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for data balance so as to solve the problem of category imbalance in a data set; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized.
2. The binary method for processing non-small cell lung cancer data with missing values and imbalances according to claim 1, wherein: the two-classification method for processing the non-small cell lung cancer data with missing values and unbalance comprises the following specific steps of:
step1, missing value processing: filling samples with missing value proportion lower than 70% in a data set, filling the missing parts by using median, and deleting the samples with missing value proportion higher than 70%;
step2, abnormal value processing: abnormal values existing in the data set influence the analysis of the overall data, so that the abnormal values need to be removed, and outliers, namely the abnormal values, are detected and removed by adopting a Tukey's method;
step3, data normalization: normalization is important when the variables have different scales and ranges, otherwise the algorithm will favor variables with larger scales, so min-max is used to normalize the data;
step4, data imbalance processing: the category imbalance has a vital influence in the classification process, when the specific gravity of most categories is overlarge, the classifier is usually biased to the most categories and has good classification accuracy, but the classification effect cannot be correctly evaluated at the moment, so that the SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for carrying out balance processing;
step5, obtaining a balanced data set: after data are balanced, a new balanced data set is formed, the number of a few classes is increased, so that class balance is realized, and the evaluation performance of the classifier effect is improved;
step6, data set division: dividing a data set by adopting a 10-fold cross validation method, and randomly dividing experimental data into 10 subsets for 10-fold cross validation; in each experiment, one subset was used as the test set and the remaining 9 subsets were used as the training set;
step7, training a classifier: training a classifier based on data, training the classifier by using a random forest on a training set, and constructing the classifier on an original data set by using a naive Bayes NB, a support vector machine SVM and a decision tree DT;
step8, realizing a classification method: after the classifier is trained, two classification results of the non-small cell lung cancer data are obtained on the test set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910904648.4A CN110825819A (en) | 2019-09-24 | 2019-09-24 | Two-classification method for processing non-small cell lung cancer data with missing values and unbalance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910904648.4A CN110825819A (en) | 2019-09-24 | 2019-09-24 | Two-classification method for processing non-small cell lung cancer data with missing values and unbalance |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110825819A true CN110825819A (en) | 2020-02-21 |
Family
ID=69548237
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910904648.4A Pending CN110825819A (en) | 2019-09-24 | 2019-09-24 | Two-classification method for processing non-small cell lung cancer data with missing values and unbalance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110825819A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111524606A (en) * | 2020-04-24 | 2020-08-11 | 郑州大学第一附属医院 | Tumor data statistical method based on random forest algorithm |
CN111524599A (en) * | 2020-04-24 | 2020-08-11 | 中国地质大学(武汉) | New coronary pneumonia data processing method and prediction system based on machine learning |
CN111524600A (en) * | 2020-04-24 | 2020-08-11 | 中国地质大学(武汉) | Liver cancer postoperative recurrence risk prediction system based on neighbor2vec |
CN111767952A (en) * | 2020-06-30 | 2020-10-13 | 重庆大学 | Interpretable classification method for benign and malignant pulmonary nodules |
CN113096814A (en) * | 2021-05-28 | 2021-07-09 | 哈尔滨理工大学 | Alzheimer disease classification prediction method based on multi-classifier fusion |
CN113223727A (en) * | 2021-05-08 | 2021-08-06 | 浙江大学 | Non-small cell lung cancer integrated prognosis prediction model and construction method, device and application thereof |
WO2021179514A1 (en) * | 2020-03-07 | 2021-09-16 | 华中科技大学 | Novel coronavirus patient condition classification system based on artificial intelligence |
CN113539474A (en) * | 2021-05-14 | 2021-10-22 | 内蒙古卫数数据科技有限公司 | Disease identification method based on conventional inspection data |
CN113539475A (en) * | 2021-05-14 | 2021-10-22 | 内蒙古卫数数据科技有限公司 | Disease screening and diagnosis method using blood routine test data only |
CN113724779A (en) * | 2021-09-02 | 2021-11-30 | 东北林业大学 | SNAREs protein identification method, system, storage medium and equipment based on machine learning technology |
CN114093448A (en) * | 2021-11-24 | 2022-02-25 | 首都医科大学附属北京天坛医院 | Construction method of disease risk prediction model |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160292379A1 (en) * | 2013-11-07 | 2016-10-06 | Medial Research Ltd. | Methods and systems of evaluating a risk of lung cancer |
CN107480839A (en) * | 2017-10-13 | 2017-12-15 | 深圳市博安达信息技术股份有限公司 | The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest |
CN107506600A (en) * | 2017-09-04 | 2017-12-22 | 上海美吉生物医药科技有限公司 | The Forecasting Methodology and device of cancer types based on the data that methylate |
JP2018019696A (en) * | 2010-11-30 | 2018-02-08 | ザ チャイニーズ ユニバーシティ オブ ホンコン | Detection of cancer related genes or molecular abnormalities |
CN108509982A (en) * | 2018-03-12 | 2018-09-07 | 昆明理工大学 | A method of the uneven medical data of two classification of processing |
CN108764597A (en) * | 2018-04-02 | 2018-11-06 | 华南理工大学 | A kind of product quality control method based on integrated study |
CN109783062A (en) * | 2019-01-14 | 2019-05-21 | 中国科学院软件研究所 | A kind of machine learning application and development method and system of people in circuit |
-
2019
- 2019-09-24 CN CN201910904648.4A patent/CN110825819A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018019696A (en) * | 2010-11-30 | 2018-02-08 | ザ チャイニーズ ユニバーシティ オブ ホンコン | Detection of cancer related genes or molecular abnormalities |
US20160292379A1 (en) * | 2013-11-07 | 2016-10-06 | Medial Research Ltd. | Methods and systems of evaluating a risk of lung cancer |
CN107506600A (en) * | 2017-09-04 | 2017-12-22 | 上海美吉生物医药科技有限公司 | The Forecasting Methodology and device of cancer types based on the data that methylate |
CN107480839A (en) * | 2017-10-13 | 2017-12-15 | 深圳市博安达信息技术股份有限公司 | The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest |
CN108509982A (en) * | 2018-03-12 | 2018-09-07 | 昆明理工大学 | A method of the uneven medical data of two classification of processing |
CN108764597A (en) * | 2018-04-02 | 2018-11-06 | 华南理工大学 | A kind of product quality control method based on integrated study |
CN109783062A (en) * | 2019-01-14 | 2019-05-21 | 中国科学院软件研究所 | A kind of machine learning application and development method and system of people in circuit |
Non-Patent Citations (5)
Title |
---|
ANJU JAIN ET AL.: "Addressing Class Imbalance Problem in Medical Diagnosis: A Genetic Algorithm Approach", 《2017 INTERNATIONAL CONFERENCE ON INFORMATION, COMMUNICATION, INSTRUMENTATION AND CONTROL》 * |
GUSTAVO E. A. P. A. BATISTA ET AL.: "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data", 《ACM SIGKDD EXPLORATIONS NEWSLETTER》 * |
MD. ANWAR HOSSEN ET AL.: "A comparison of some soft computing methods on Imbalanced data", 《INTERNATIONAL CONFERENCE ON CYBER SECURITY AND COMPUTER SCIENCE》 * |
崔琳爽: "复合XGBoost模型在不均衡数据集分类预测上的应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王平 等: "改进的随机森林算法在乳腺肿瘤诊断中的应用", 《改进的随机森林算法在乳腺肿瘤诊断中的应用》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021179514A1 (en) * | 2020-03-07 | 2021-09-16 | 华中科技大学 | Novel coronavirus patient condition classification system based on artificial intelligence |
CN111524599A (en) * | 2020-04-24 | 2020-08-11 | 中国地质大学(武汉) | New coronary pneumonia data processing method and prediction system based on machine learning |
CN111524600A (en) * | 2020-04-24 | 2020-08-11 | 中国地质大学(武汉) | Liver cancer postoperative recurrence risk prediction system based on neighbor2vec |
CN111524606A (en) * | 2020-04-24 | 2020-08-11 | 郑州大学第一附属医院 | Tumor data statistical method based on random forest algorithm |
CN111524606B (en) * | 2020-04-24 | 2024-01-30 | 郑州大学第一附属医院 | Tumor data statistics method based on random forest algorithm |
CN111767952A (en) * | 2020-06-30 | 2020-10-13 | 重庆大学 | Interpretable classification method for benign and malignant pulmonary nodules |
CN111767952B (en) * | 2020-06-30 | 2024-03-29 | 重庆大学 | Interpretable lung nodule benign and malignant classification method |
CN113223727A (en) * | 2021-05-08 | 2021-08-06 | 浙江大学 | Non-small cell lung cancer integrated prognosis prediction model and construction method, device and application thereof |
CN113223727B (en) * | 2021-05-08 | 2022-07-12 | 浙江大学 | Non-small cell lung cancer integrated prognosis prediction model and construction method, device and application thereof |
CN113539474A (en) * | 2021-05-14 | 2021-10-22 | 内蒙古卫数数据科技有限公司 | Disease identification method based on conventional inspection data |
CN113539475A (en) * | 2021-05-14 | 2021-10-22 | 内蒙古卫数数据科技有限公司 | Disease screening and diagnosis method using blood routine test data only |
CN113096814A (en) * | 2021-05-28 | 2021-07-09 | 哈尔滨理工大学 | Alzheimer disease classification prediction method based on multi-classifier fusion |
CN113724779A (en) * | 2021-09-02 | 2021-11-30 | 东北林业大学 | SNAREs protein identification method, system, storage medium and equipment based on machine learning technology |
CN114093448A (en) * | 2021-11-24 | 2022-02-25 | 首都医科大学附属北京天坛医院 | Construction method of disease risk prediction model |
CN114093448B (en) * | 2021-11-24 | 2022-07-01 | 首都医科大学附属北京天坛医院 | Construction method of disease risk prediction model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110825819A (en) | Two-classification method for processing non-small cell lung cancer data with missing values and unbalance | |
Liu et al. | Optimizing survival analysis of XGBoost for ties to predict disease progression of breast cancer | |
Venkatesan et al. | Performance analysis of decision tree algorithms for breast cancer classification | |
Pepe et al. | Testing for improvement in prediction model performance | |
Adams et al. | Clinical prediction rules | |
Afshar et al. | Prediction of breast cancer survival through knowledge discovery in databases | |
Peng et al. | Random forest can predict 30‐day mortality of spontaneous intracerebral hemorrhage with remarkable discrimination | |
Morvan et al. | Leveraging RSF and PET images for prognosis of multiple myeloma at diagnosis | |
de Sousa Costa et al. | Classification of malignant and benign lung nodules using taxonomic diversity index and phylogenetic distance | |
Belciug | Logistic regression paradigm for training a single-hidden layer feedforward neural network. Application to gene expression datasets for cancer research | |
Kaur et al. | An integrated approach for cancer survival prediction using data mining techniques | |
Frésard et al. | Multi-objective optimization for personalized prediction of venous thromboembolism in ovarian cancer patients | |
Ray et al. | Transforming Breast Cancer Identification: An In-Depth Examination of Advanced Machine Learning Models Applied to Histopathological Images | |
Lamba et al. | Breast cancer prediction and categorization in the molecular era of histologic grade | |
Barrio et al. | Comparison of two discrimination indexes in the categorisation of continuous predictors in time-to-event studies | |
Baneshi et al. | Multiple imputation in survival models: applied on breast cancer data | |
AU2021102593A4 (en) | A Method for Detection of a Disease | |
Casey et al. | A machine learning approach to prostate cancer risk classification through use of RNA sequencing data | |
Cattelani et al. | Improved NSGA-II algorithms for multi-objective biomarker discovery | |
Sim et al. | Predicting disease-free lung cancer survival using Patient Reported Outcome (PRO) measurements with comparisons of five Machine Learning Techniques (MLT) | |
Nickolas et al. | Efficient pre-processing techniques for improving classifiers performance | |
Marwah et al. | Lung Cancer Survivability prediction with Recursive Feature Elimination using Random Forest and Ensemble Classifiers | |
Mythili et al. | CTCHABC-hybrid online sequential fuzzy Extreme Kernel learning method for detection of Breast Cancer with hierarchical Artificial Bee | |
KR102305806B1 (en) | Method for prodicting prognosis in lung cancer patient using clinical information and gene polymorphism information | |
Antolini et al. | Graphical representations and summary indicators to assess the performance of risk predictors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200221 |