CN110825819A

CN110825819A - Two-classification method for processing non-small cell lung cancer data with missing values and unbalance

Info

Publication number: CN110825819A
Application number: CN201910904648.4A
Authority: CN
Inventors: 赵阳; 马磊; 张力
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2020-02-21

Abstract

The invention relates to a two-classification method for processing non-small cell lung cancer data with missing values and unbalance, belonging to the technical field of data classification. Preprocessing data, filling a missing value by using a median, removing an abnormal value by using a Tukey' small method, and normalizing the data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling combination is adopted for data balance; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized. Experiments carried out on the non-small cell lung cancer data set prove the effectiveness and superiority of the method, improve the precision of classification of non-small cell lung cancer data with missing values and unbalance, and contribute to realizing more accurate medical decision.

Description

Two-classification method for processing non-small cell lung cancer data with missing values and unbalance

Technical Field

The invention relates to a binary classification method for processing non-small cell lung cancer data with missing values and unbalance, in particular to a method for carrying out data balance processing by combining median filling missing values and SMOTEENN comprehensive sampling, and belongs to the technical field of data classification.

Background

Lung cancer is a malignant tumor, and has become one of the most fatal diseases in the world. While non-small cell lung cancer accounts for around 85% of the total lung cancer cases, it accounts for a significant fraction of the medical expenditures and places a heavy burden on the home and community due to its high morbidity and mortality. Therefore, it is important to predict the survival of cancer patients more accurately and to make better clinical decisions in diagnosis and treatment, including the choice of treatment, the time of treatment and subsequent visits, which can have a large impact on the cost and effectiveness of treatment.

With the continuous development of medical informatization, a large amount of reliable cancer data is used for constructing a machine learning model to predict the survival rate of diseases. Survival prediction is one of the tasks of cancer prognosis, and five-year survival is a commonly used index in medicine to assess the efficacy of surgery and therapy. Compared to early cancer survival prediction based on clinical features of malignancy and estimation of medical experience by physicians, it is now possible to further exploit those underutilized medical data by applying more and more data mining techniques in healthcare. Generally, missing values and class imbalance problems exist in medical data, which have a large impact on classification accuracy.

Disclosure of Invention

The invention provides a binary classification method for processing non-small cell lung cancer data with missing values and unbalance, which is used for avoiding the problem of excessive biased classification of main classes.

The technical scheme of the invention is as follows: a binary method of processing non-small cell lung cancer data having missing values and imbalances, comprising the steps of: filling samples with the missing value proportion lower than 70% by using a median, deleting samples with the missing value proportion higher than 70%, removing abnormal values by using a Tukey's method, and carrying out normalization processing on data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for data balance so as to solve the problem of category imbalance in a data set; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized.

The two-classification method for processing the non-small cell lung cancer data with missing values and unbalance comprises the following specific steps of:

step1, missing value processing: filling samples with missing value proportion lower than 70% in a data set, filling the missing parts by using median, and deleting the samples with missing value proportion higher than 70%;

step2, abnormal value processing: abnormal values existing in the data set influence the analysis of the overall data, so that the abnormal values need to be removed, and outliers, namely the abnormal values, are detected and removed by adopting a Tukey's method;

step3, data normalization: normalization is important when the variables have different scales and ranges, otherwise the algorithm will favor variables with larger scales, so min-max is used to normalize the data;

step4, data imbalance processing: the category imbalance has a crucial influence in the classification process, when the specific gravity of most categories is too large, the classifier is often biased to the most categories and has good classification accuracy, but the classification effect cannot be correctly evaluated at the moment, so that the method combining oversampling and undersampling is adopted

Carrying out balance processing by using a SMOTEENN comprehensive sampling method;

step5, obtaining a balanced data set: after data are balanced, a new balanced data set is formed, the number of a few classes is increased, so that class balance is realized, and the evaluation performance of the classifier effect is improved;

step6, data set division: dividing a data set by adopting a 10-fold cross validation method, and randomly dividing experimental data into 10 subsets for 10-fold cross validation; in each experiment, one subset was used as the test set and the remaining 9 subsets were used as the training set;

step7, training a classifier: training a classifier based on data, training the classifier by using a random forest on a training set, and constructing the classifier on an original data set by using Naive Bayes (NB), a Support Vector Machine (SVM) and a Decision Tree (DT);

step8, realizing a classification method: after the classifier is trained, two classification results of the non-small cell lung cancer data are obtained on the test set.

The invention has the beneficial effects that:

the invention aims at the defect value and the class imbalance in the data of the non-small cell lung cancer patient, the median filling strategy is used for improving the data quality, so that more effective classification is realized, the classes of the data are balanced by combining the SMOTEENN comprehensive sampling technology, the problem of excessive biased classification of main classes is avoided, the two classification method realized by random forests on the non-small cell lung cancer data set has better classification accuracy and is superior to a plurality of other methods, the realized two classification method has effectiveness in the survival prediction of the patient, and more accurate medical decision is facilitated.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a histogram of missing data values of the present invention;

FIG. 3 is a comparison graph of the classification performance of the present method before and after data balance according to the present invention;

FIG. 4 is a graph of feature significance after classification in accordance with the present invention.

Detailed Description

Example 1: 1-4, a binary method of processing non-small cell lung cancer data with missing values and imbalances comprising the steps of: firstly, preprocessing data, filling samples with a missing value proportion lower than 70% by using a median, deleting samples with a missing value proportion higher than 70%, removing abnormal values by using a Tukey's method, and carrying out normalization processing on the data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for data balance so as to solve the problem of category imbalance in a data set; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized.

step1, missing value processing: in this example, the data of cases of patients with non-small cell lung cancer in the SEER database is used, and based on the data submitted in 2016 and 11 months, the records before 2004 lack a plurality of key variables and only the cases after 2004 are considered. The variables include 'Sex', 'Rate record', 'Marital status at diagnosis', 'PrimarySite', 'ICD-O-3', 'Grade', 'Lateranity', 'CS extension', 'CS lymph nodes', 'CS metsat dX', 'Derived AJCC Stage Group', 'RX Summ-Surg Prim Site', 'chemotherapyreresponse', 'Age at diagnosis', 'CS tune size', 'Regionnal nodes position', 'RegionNodeN extended', the result is a 5-year survival of the patient. In the case of missing values in the data set, as shown in the histogram of fig. 2, a large amount of information is lost when a sample having a missing value is deleted, and therefore, samples having a missing value ratio of less than 70% in the data set are filled up, and the missing part is filled up with a median, and if the missing value ratio of the sample is higher than 70%, the missing part is deleted.

step3, data normalization: normalization is important when the variables have different scales and ranges, otherwise the algorithm will favor variables with larger scales, so the data is normalized using min-max, i.e., dispersion normalization, so that the variables remain between 0 and 1. The normalized function is as follows:

step4, data imbalance processing: the category imbalance has a vital influence in the classification process, when the specific gravity of most categories is overlarge, the classifier is usually biased to the most categories and has good classification accuracy, but the classification effect cannot be correctly evaluated at the moment, so that the SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for carrying out balance processing; the process is that the SMOTE oversampling method is utilized to calculate the distances of all samples of each sample x in a minority class through the Euclidean distance, and the distances are obtainedTo its k nearest neighbors, the sampling rate is set according to the imbalance ratio to determine the sampling multiplying factor N, and for each few class of samples x, a number of samples x are selected from the nearest neighbors_nFor each randomly selected x_nConstructing a new sample using the original sample according to the following formula, x_new＝x+rand(0,1)×|x-x_nExpanding the data set into a data set T, predicting each sample in the T by using a kNN method, and removing the sample if the predicted value is not consistent with the actual label.

Step5, obtaining a balanced data set: after data are balanced, a new balanced data set is formed, the number of a few classes is increased, so that class balance is realized, and the evaluation performance of the classifier effect is improved; before and after data set balance, 4 kinds of classification performance of the balance method are compared through a random forest classification algorithm, SMOTEENN obtains 6 evaluation indexes of highest accuracy, recall rate, G-mean, F1-Score, AUC value and accuracy, the method greatly improves other indexes except accuracy before and after balance, and improves the evaluability of classification results and the accuracy of classification, as shown in figure 3;

step8, realizing a classification method: after the classifier is trained, obtaining two classification results of the non-small cell lung cancer data on a test set; in order to illustrate the beneficial effects of the invention, the following steps are added; and then comparing with the correct result, and further calculating the accuracy, precision, recall, G-mean, F1-Score and AUC value of the classification. A comparison between the present method and other algorithms with composite indicators is shown in table 1.

Step9, evaluating the influence of the cross-validation strategy: and (3) evaluating the influence of the 5-fold and 10-fold cross validation on the random forest and the method respectively, wherein as shown in the table 1, the influence of the 5-fold or 10-fold cross validation strategy is not a decisive factor, but the influence of the random forest algorithm is greater than that of the method, and the detail in the table 1 shows that the variation range of the result data of 5-fold and 10-fold is small, so that the reliability of the performance of the classification method is proved.

Step10, classification performance verification: firstly, the classification effect of the two classification methods before and after processing the data loss and unbalance problem is compared, and then compared with other reference algorithms such as naive Bayes, SVM and decision trees, a series of evaluation indexes are adopted to verify the superiority of the method. By comparing the effects of random forests and the method, missing value compensation and comprehensive sampling play an important role in the improvement of the method.

Step11, feature importance: and training a data set through a random forest algorithm and classifying, evaluating the importance of each feature, and verifying the classification accuracy by combining the features. And judging how much each feature contributes to each tree in the random forest, then taking an average value, and finally comparing the contribution of the features with the contribution of the features. The characteristic importance shows that the distant metastasis has a key influence on the survival prognosis of the patient with the non-small cell lung cancer, the age, the diseased part and the like at the time of diagnosis are also important factors influencing the survival condition of the patient, the condition disclosed in most researches is met, and the effectiveness and the interpretability of the classification method in reality are shown, as shown in figure 4.

Filling missing values with a median, detecting and removing samples containing abnormal values by using a Tukey's method based on a box line graph, performing linear transformation on original data by using dispersion standardization, performing balance processing on an unbalanced data set by using a SMOTEENN comprehensive sampling method combining oversampling and undersampling, training a random forest classifier based on the data set and generating a classification result on a test set. Experiments carried out on the non-small cell lung cancer data set prove that the effectiveness and the superiority of the method can improve the precision of the classification of the non-small cell lung cancer data with missing values and unbalance, and provide a beneficial reference supported by the data for a clinician to establish a more beneficial treatment strategy for a patient.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A binary method of processing non-small cell lung cancer data having missing values and imbalances, comprising: the method comprises the following steps: firstly, preprocessing data, filling samples with a missing value proportion lower than 70% by using a median, deleting samples with a missing value proportion higher than 70%, removing abnormal values by using a Tukey's method, and carrying out normalization processing on the data by using dispersion standardization; secondly, a SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for data balance so as to solve the problem of category imbalance in a data set; and finally, the balanced data set is used for training a random forest classifier and testing a classification effect on the test set, so that a two-classification method for predicting the survival of the non-small cell lung cancer effectively aiming at the problems of missing values and unbalanced classes is realized.

2. The binary method for processing non-small cell lung cancer data with missing values and imbalances according to claim 1, wherein: the two-classification method for processing the non-small cell lung cancer data with missing values and unbalance comprises the following specific steps of:

step4, data imbalance processing: the category imbalance has a vital influence in the classification process, when the specific gravity of most categories is overlarge, the classifier is usually biased to the most categories and has good classification accuracy, but the classification effect cannot be correctly evaluated at the moment, so that the SMOTEENN comprehensive sampling method combining oversampling and undersampling is adopted for carrying out balance processing;

step7, training a classifier: training a classifier based on data, training the classifier by using a random forest on a training set, and constructing the classifier on an original data set by using a naive Bayes NB, a support vector machine SVM and a decision tree DT;