CN114358169A

CN114358169A - Colorectal cancer detection system based on XGboost

Info

Publication number: CN114358169A
Application number: CN202111645179.2A
Authority: CN
Inventors: 邓菲; 赵琳; 于宁
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-15
Anticipated expiration: 2041-12-30
Also published as: CN114358169B

Abstract

The invention relates to a colorectal cancer detection system based on XGboost, which comprises a data acquisition module, a data preprocessing module, a feature selection module, a model construction module and a result prediction module, wherein the data acquisition module, the data preprocessing module, the feature selection module, the model construction module and the result prediction module are respectively used for: constructing a colorectal cancer data set; preprocessing data; performing feature selection by using RFE recursive feature selection to obtain a plurality of subdata sets; constructing an XGboost model, training the XGboost model by using a subdata set, and optimizing parameters of the XGboost model through a genetic algorithm to obtain a detection model; the detection model is used to predict the type of death of colorectal cancer. Compared with the prior art, the method has the advantages that the feature selection is carried out on the data through RFE recursive feature selection, the intelligent detection of colorectal cancer death categories is realized by combining a machine learning algorithm XGboost, the model parameters are optimized by using a genetic algorithm, the death types of colorectal cancer can be analyzed and predicted quickly and effectively, a few categories of the multiple categories can be identified more accurately, and the method has higher accuracy, precision, recall rate and F1 value.

Description

Colorectal cancer detection system based on XGboost

Technical Field

The invention relates to the field of machine learning and intelligent medical treatment, in particular to a colorectal cancer detection system based on XGboost (eXtreme Gradient boosting).

Background

Colorectal cancer (Colorectal cancer) is one of three common tumors in human, has rapid disease development and high mortality, is a common malignant tumor in the digestive system, and has become one of the most common cancers in men and women. The colorectal cancer lethality rate is high, and understanding the lethal reason of colorectal cancer patients has important meaning to studying colorectal cancer, through under the background to result analysis, helps medical personnel to carry out more accurate judgement to patient to a great extent to let patient obtain accurate treatment sooner, in order to avoid delaying the best treatment opportunity, lead to the worsening of the state of an illness.

With the rapid development of big data and computer fields, machine learning techniques (ML) have also been widely used in the medical field. People can utilize machine learning to mine potential information in medical data and mine medical data rules which cannot be observed by human eyes. The medical application of machine learning mainly comprises three major parts, namely clinical diagnosis, accurate medical treatment and health detection. The machine learning is mainly applied to oncology, pathology and rare diseases in clinical diagnosis, researchers can train oncology data by using a machine learning method so as to identify the type of cancer, diagnose the diseases of the pathology by using the machine learning method, strengthen the limitation caused by the traditional microscopic pathology, and help clinicians analyze and diagnose the related manifestations of the rare diseases by using the combination of the machine learning and the face identification; the application of machine learning in accurate medical treatment is mainly reflected in that: according to the aspects of the genetic history, various indexes of the body, living areas, environmental factors, life style and the like of the patient, an accurate prediction result, a treatment method and the possibility of future illness are provided for the patient by means of a machine learning technology; the application of machine learning in health detection is mainly embodied in that: the health state of the human body is monitored in real time from the genetic history, various indexes of the body, living areas, environmental factors, life style and the like, and certain prompt is given. In conclusion, the machine learning technology plays a very important role in the medical big data age, and is a future development trend of medical diagnosis, accurate treatment and health detection, and accurate analysis and prediction of colorectal cancer are necessary through the machine learning technology.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a colorectal cancer detection system based on XGboost.

The purpose of the invention can be realized by the following technical scheme:

a colorectal cancer detection system based on XGboost comprises a data acquisition module, a data preprocessing module, a feature selection module, a model construction module and a result prediction module;

the data acquisition module is used for: acquiring data, and constructing a colorectal cancer data set with multiple characteristics and multiple categories;

the data preprocessing module is used for: carrying out data preprocessing on the constructed colorectal cancer data set;

the feature selection module is to: performing feature selection by using RFE recursive feature selection to obtain a plurality of sub-data sets, wherein each sub-data set contains different numbers of features;

the model building module is configured to: constructing an XGboost model, respectively training the XGboost model by using each subdata set, designing an optimization target, and optimizing parameters in the XGboost model by a genetic algorithm to obtain an optimal XGboost model as a detection model;

the outcome prediction module is to: and processing the characteristics of the colorectal cancer data into characteristics corresponding to the detection model, and predicting by using the detection model.

Further, in the data preprocessing module, the data preprocessing includes the following steps:

s1, carrying out numerical processing on the label of each data sample in the colorectal cancer data set;

s2, carrying out unique hot coding processing on each data sample in the colorectal cancer data set, and converting the characteristics of the data sample into binary data;

s3, checking the vacancy value and the abnormal value, and removing the data sample containing the vacancy value and the abnormal value;

s4, dividing the colorectal cancer data set into a training data set and a testing data set.

Further, in step S4, stratfy is used to realize hierarchical sampling, and the colorectal cancer data set is divided into a training data set and a test set, so that the proportion of sample data of various labels in the training data set and the test set is the same as the proportion in the colorectal cancer data set.

Further, in the feature selection module, the feature selection includes the following steps:

t1, acquiring a preprocessed colorectal cancer data set, wherein the characteristic quantity of each data sample in the data set is Num, the target characteristic quantities are designed to be K1, K2, … and Kn, K1 is more than K2 is more than … is more than Kn, and i is made to be 1;

t2, sending the data set with the feature quantity of Num into a designed base classifier, and calculating the importance of each feature and sequencing by the base classifier;

t3, if Num-Ki is not less than P, executing a step T4, otherwise, executing a step T5, wherein P is a preset step length;

t4, deleting the P features with the lowest importance, reconstructing a data set, updating the feature number Num of the data set to Num-P, and executing the step T2;

t5, deleting the (Num-Ki) features with the lowest importance, reconstructing and storing the data set, updating the feature number Num of the data set to Ki, making i +1, if i is less than or equal to n, executing the step T2, otherwise, ending, and obtaining the subdata sets with the feature numbers of K1, K2, … and Kn respectively.

Further, the base classifier is a classifier containing a coef _ or feature _ attributes.

Further, the model building module executes the following steps:

p1, according to the division in the step S4, each subdata set comprises a training data set and a test set, stratify is used again to realize layered sampling, the training data set is divided into a training set and a verification set, an XGboost model is constructed, and model parameters are set;

p2, training the XGboost model by using a training set, further improving the classification performance of the XGboost model by using a verification set, and testing the performance of the XGboost model by using a test set, wherein the performance comprises accuracy, recall rate, precision rate and F1 value;

p3, calculating the sum of the accuracy of the XGboost model on the test set and the F1 value, optimizing the parameters in the XGboost model by using a genetic algorithm with the sum of the accuracy of the XGboost model on the test set and the F1 value as an optimization target, and obtaining the XGboost model trained by different sub data sets;

and P4, evaluating each XGboost model on a test set to obtain an optimal XGboost model as a detection model.

Further, in step S4, the colorectal cancer data set is divided into a training data set and a test set according to a ratio of 9:1, and in step P1, the training data set is divided into a training set and a verification set according to a ratio of 8: 2.

Further, in step P2, the accuracy, recall, precision and F1 value are calculated as follows:

the Accuracy rate is represented by Accuracy, Recall represents Recall rate, Precision represents Precision rate, TP represents the number of samples with the true category of "positive" and the XGboost model predicted category of "positive", FN represents the number of samples with the true category of "positive" and the XGboost model mispredicted category of "negative" on the test set, FP represents the number of samples with the true category of "negative" and the XGboost model mispredicted category of "positive" on the test set, and TN represents the number of samples with the true category of "negative" and the XGboost model predicted category of "negative".

Further, in step P3, optimizing parameters in the XGBoost model using a genetic algorithm specifically includes:

step1, coding parameters to be optimized in the XGboost model as individuals, setting parameters of a genetic algorithm, generating an initial population containing a plurality of individuals, and initializing an empty global optimal solution;

step2, selecting the sum of the accuracy of the XGboost model on the test set and the F1 value as a fitness function of the genetic algorithm, calculating the fitness value of each individual in the population, obtaining the optimal solution in the current population, and updating the global optimal solution;

step3, selecting individuals with good fitness from the population by adopting a roulette method to carry out crossing and variation to obtain a new population;

and Step4, judging whether the population converges, if so, outputting the global optimal solution, otherwise, executing Step 2.

Further, the parameters of the genetic algorithm include: the value range, the population scale, the iteration times, the fitness function, the selection mode, the crossing mode and the crossing probability, and the variation mode and the variation probability of the parameter to be optimized in the XGboost model.

Compared with the prior art, the invention has the following beneficial effects:

(1) feature selection is carried out on the colorectal cancer data set through RFE, features among all types of samples are more obvious, the detection model of the colorectal cancer data after feature selection is established through the XGboost machine learning method, the types of few samples in the unbalanced data set can be accurately identified, and the method has high accuracy, precision, recall rate and F1 value, and the running time of the model is greatly shortened due to the fact that the number of the features is reduced.

(2) The XGboost model established by genetic algorithm optimization is used, the sum of the accuracy of the XGboost model on a test set and the F1 value is taken as an optimization target, so that the performance of the established XGboost model reaches the best state, the optimized XGboost model has better detection effect, and the colorectal cancer death category can be accurately and quickly judged.

Drawings

FIG. 1 is a schematic diagram of the structure and working flow of the present invention;

FIG. 2 is a schematic diagram of the partitioning of a data set;

FIG. 3 is a schematic diagram of the XGboost model training;

FIG. 4 is a diagram of classification evaluation indexes on test sets with different feature quantities after the XGboost model is optimized;

reference numerals: 1. the device comprises a data acquisition module, a data preprocessing module, a feature selection module, a model construction module, a result prediction module and a data processing module, wherein the data acquisition module 2, the data preprocessing module 3, the feature selection module, the model construction module 4 and the result prediction module.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

In the drawings, structurally identical elements are represented by like reference numerals, and structurally or functionally similar elements are represented by like reference numerals throughout the several views. The size and thickness of each component shown in the drawings are arbitrarily illustrated, and the present invention is not limited to the size and thickness of each component. Parts are exaggerated in the drawing where appropriate for clarity of illustration.

Example 1:

an XGboost-based colorectal cancer detection system is shown in fig. 1 and comprises a data acquisition module 1, a data preprocessing module 2, a feature selection module 3, a model construction module 4 and a result prediction module 5;

the data acquisition module 1 may be a data interface or the like for: acquiring data, and constructing a colorectal cancer data set with multiple characteristics and multiple categories, wherein the data set specifically comprises the following steps:

a colorectal cancer data set with multiple categories and multiple characteristics is obtained by collecting various medical databases at home and abroad, such as a UCI database, a TCGA database and the like, and relevant hospital website data. In this embodiment, the TCGA data official network is accessed first to find clinical and transcriptome data corresponding to Colorectal (Colorectal cancer), and then the data is added to a shopping cart and downloaded, and finally the downloaded Colorectal cancer data set is read.

The data preprocessing module 2 is used for: and (3) carrying out data preprocessing on the constructed colorectal cancer data set, wherein the data preprocessing comprises the following specific steps:

the outcontrol of the collected colorectal cancer data includes four case types, namely disease-free survival (AND), disease-free survival (AWD), disease-free death (DND) AND disease-free death (DWD), i.e. 4 labels, AND the number of data samples under each label is 406, 64, 35 AND 84 respectively. In order to better utilize the machine learning method to accurately distinguish the colorectal cancer data sets, the types of the four cases are converted into figures which can be identified by a computer, and the conversion process is as follows: disease-free survival (Alive no disease) is represented by a label number "1", disease-present survival (Alive with disease) is represented by a label number "2", disease-free death (Dead no disease) is represented by a label number "3", and disease-present death (Dead with disease) is represented by a label number "4".

and (3) performing one-Hot-Encoding (one-Hot-Encoding) processing on the classified discrete features in the original colorectal cancer data set, eliminating the problem of the specific size of the classified data, and only existing the binary data of 0 and 1 known by a computer so as to perform equal judgment. There were 589 data samples in the colorectal cancer data set, 4 classes (4 labels) in the data samples, and 17539 features for each data sample after unique heat coding treatment.

the colorectal cancer data set after the unique heat code treatment is checked. Checking whether the vacancy value and the abnormal value exist. Using a python3.6 statement data.isnull (). any () to check whether a missing value is contained (data is data read in by python and named data), finding that the colorectal cancer data set returns True at the characteristic names g _ path _ m _ stage1, g _ path _ m _ stage2, and g _ path _ m _ stage3, indicating that missing values exist at the 3 characteristics, and the sample labels corresponding to the missing values are 4, 1, 1, 3, 1, 1, 3, for which, using a python statement data.dropna (axis 0, how ═ any ") to remove data samples containing the missing values, so that the number of remaining samples in the colorectal cancer data set is 582, wherein, class 1 has 402 samples, class 2 has 64 samples, class 3 has 33 samples, class 4 has 83 samples, and each data sample has 17539 characteristics; after the missing value is checked, the 3-fold standard deviation is used to check whether the characteristic of the sample contains an abnormal value, the used python statement is data [ abs (data-data. mean () >3 data. std) ], and all returned values are NAN, which indicates that the data set does not contain the abnormal value.

S4, dividing the colorectal cancer data set into a training data set and a testing data set:

in step S4, stratify sampling is implemented, and the colorectal cancer data set is divided into a training data set and a test set, so that the proportion of sample data of various labels in the training data set and the test set is the same as the proportion in the colorectal cancer data set:

colorectal cancer data sets after treatment of null and outliers were as per 9: a scale of 1 is randomly divided into a training data set and a test set. When the training data set and the test set are divided, stratfy is used for realizing layered sampling, and the proportion of all types of samples in the training data set and the test set is ensured to be the same as the proportion in the original colorectal cancer data set after the vacancy value and the abnormal value are processed. The python statement is:

from sklearn.model_selection import train_test_split

xtrain,xtest,ytrain,ytest＝train_test_split(feature,label,test_size＝0.1,stratify＝label)

xtrain is a feature set of a training data set, xtest is a feature set of a test set, ytrain is a label of the training data set, ytest is a label of the test set, feature is a feature set (17539 features) of colorectal cancer data after the vacancy values and the abnormal values are removed, label is a label of the colorectal cancer data after the vacancy values and the abnormal values are removed, test _ size is the proportion of the test set in the colorectal cancer data set, and stratify is a hierarchical sampling mechanism.

Because the characteristic data of colorectal cancer is more, data containing a large number of characteristics is often processed in a dimension reduction mode so as to improve the performance of an algorithm, reduce the complexity of calculation and analysis, and meanwhile, the retained characteristics are required to be representative. At present, the main dimension reduction modes are feature extraction and feature selection. The feature extraction is to recombine data through linearity or nonlinearity, and map high-dimensional data to a low-dimensional space to achieve the purposes of reducing data dimensions and compressing data. The commonly used feature extraction includes pca (principal Component analysis), mds (multidimensional scaling), isometric feature mapping (isometric mapping), LDA linear discriminant analysis, and the like.

The feature selection is to select an optimal feature subset from the original data set according to a certain rule, the feature subset is superior to the original feature set in evaluation index, and compared with feature extraction, the feature selection does not change the original characteristics of the data, which has important significance for medical data. According to different forms, the feature selection is mainly divided into 3, namely a filtering method, a packaging method and an embedding method. The filtering method mainly selects data features by means of a statistical theory, common filtering feature selection methods include chi-square, pearson, mutual information, CFS (Correlation-based feature selection) and MRMR (Minimum Redundancy Maximum Correlation), and the like, and the filtering feature selection method can really select better features, but the better features cannot necessarily construct a better feature subset. Therefore, it is necessary to combine the feature evaluation criteria with the feature search algorithm to obtain a subset of features with good prediction performance. The packaging method introduces an algorithm into the selection of the feature subsets, after data are input, the selected algorithm trains the input data to obtain the importance of each feature, the features arranged at the tail end are removed, the removed features are evaluated by using the selected algorithm, and then the feature-removed subsets are trained again, removed, evaluated and circulated repeatedly in sequence until the optimal subsets are selected. Common packaging methods include RFE, SVM _ RFE, and the like; the embedding method integrates the feature selection mechanism of the filtering method and the feature selection mechanism of the packaging method, integrates the feature selection process and the model training process, and directly completes the selection of the feature subset in the process of training the model. The invention comprehensively considers the advantages and disadvantages of the algorithm and the characteristics of the colorectal cancer data, selects an encapsulation method-RFE (reactive feature analysis) to perform feature selection on the colorectal cancer data set, and performs personalized design.

The feature selection module 3 is configured to: performing feature selection by using RFE recursive feature selection to obtain a plurality of sub-data sets, wherein each sub-data set contains different numbers of features, and the method specifically comprises the following steps:

t2, sending the data set with the feature quantity of Num into a designed base classifier, and calculating the importance of each feature and sequencing by the base classifier; the base classifier is a classifier containing coef _ or feature _ attributes;

In this embodiment, an RFE feature selection model of colorectal data is established, and a logistic regression containing coef _ attributes is selected as a base classifier of the RFE, where the number of RFE feature selections, i.e., the target feature quantity K, and P is a step length of feature removal each time when the RFE performs feature selection. The corresponding python statement is:

from sklearn.feature_selection import REF

from sklearn.linear_model import LogisticRegression

rfe＝RFE(LogisticRegression()，n_feature_to_select＝K，step＝P)

in this embodiment, P is 200, n is 7, K1 is 17539, K2 is 2200, K3 is 1800, K4 is 1400, K5 is 1000, K6 is 600, and K7 is 200. In other embodiments, the values of P, n and Ki may be varied.

And operating according to the set target feature quantity and step length, obtaining 7 groups of colorectal cancer data subsets with different feature quantities through a base classifier, wherein K is 17539 to represent an original colorectal cancer data set, so as to illustrate the influence of samples with different feature quantities on the XGboost model, and the python statement is as follows:

xtrain_rfe＝rfe.fit_transform(xtrain，ytrain)

xtest_rfe＝rfe.transform(xtest)

when the RFE features are selected to be K (200, 600, 1000, 1400, 1800, 2200, 17539), xtrain _ RFE represents a training dataset in the sub-dataset, and xtest _ RFE represents a test set in the sub-dataset.

The model building module 4 is configured to: constructing an XGboost model, respectively training each subdata set, designing an optimization target, optimizing parameters in the XGboost model through a genetic algorithm to obtain an optimal XGboost model as a detection model, and specifically comprising the following steps:

the data set is divided as shown in fig. 2, after RFE feature selection, 7 sub-data sets are obtained, wherein the data samples include 200, 600, 1000, 1400, 1800, 2200, 17539 features, each sub-data set is divided into a training data set and a test set according to 9:1 hierarchical sampling according to the division in step S4, for example, 582 samples are total for the sub-data set including 200 features, each sample includes 200 features, 582 samples are divided into the training data set and the test set according to 9:1 hierarchical sampling, and the proportions of various label samples in the training data set and the test set are consistent with those in the sub-data set before division.

Similarly, the data set for training is divided into a training set and a verification set according to 8:2 hierarchical sampling, 7 XGboost models are established, 7 sub-data sets are used for training the XGboost models respectively, wherein parameters of the XGboost models are set as follows: left _ rate, n _ estimators, max _ depth, gamma, reg _ alpha, reg _ lambda, min _ child _ weight, collemple _ byte, objective, num _ class, random _ state;

the XGboost model of python is established as follows:

from xgboost import XGBClassifier

x_train,validation,y_train,y_validation＝train_test_split(xtrain_rfe,ytrain,test_size＝0.2,stratify＝ytrain,random_state＝0)

eval_set＝[(validation,y_validation)]

XGB＝XGBClassifier(learning_rate＝a,n_estimators＝b,n_estimators＝c,max_depth＝d,gamma＝e,reg_alpha＝f,reg_lambda＝g,min_child_weight＝3,colsample_bytree＝0.8，objective＝”multi:softmax”,num_class＝4,random_state＝0)

wherein XGB represents the established XGboost model, the first 6 parameters of the XGB model are set as variables a, b, c, d, e and f, because a genetic algorithm is needed to be used for optimizing the 6 parameters later, the variables are used for representing the parameters, other variables are set as a constant value for representing the parameters, and the objective is multi, softmax and num _ class are 4 because the colorectal cancer data set is a multi-class data set comprising 4 classes.

training of the model as shown in fig. 3, the XGBoost model is trained by using the divided training set, and the classification performance of the model is further improved by using the verification set. And finally testing the trained XGboost model on the test set. The implementation of python is as follows:

XGB.fit(x_train,y_train,eval_metric＝”merror”,early_stopping_rounds＝40,eval_set＝eval_set)

y_pred＝XGB.predict(xtest_rfe)

wherein x _ train represents a feature set of the training set, y _ train represents a label of the training set, eval _ metric represents that a multi-classification error rate 'merror' on the verification set is selected as an optimization target, and early _ stopping _ rounds represents that when the multi-classification error rate on the verification set is not increased in 40 rounds of iteration, the training is stopped, so that the XGBoost model can be prevented from being over-fitted.

The performance of the XGBoost model on the test set is calculated, and a common binary confusion matrix is shown in table 1:

TABLE 1 binary confusion matrix

The calculation formulas of the accuracy, the recall rate, the precision and the F1 value are as follows:

the Accuracy represents the Accuracy, the Recall represents the Recall rate, the Precision represents the Precision, the value of F1 is the harmonic mean of the Precision (Precision) and the Recall (Recall), TP (true positive) represents the number of samples of which the real category is "positive" and the category predicted by the XGBoost model is also "positive" on the test set, FN (false positive) represents the number of samples of which the real category is "positive" and the category predicted by the XGBoost model is "negative" on the test set, FP (false positive) represents the number of samples of which the real category is "negative" and the category predicted by the XGBoost model is "positive" on the test set, TN (true positive) represents the number of samples of which the real category is "negative" and the category predicted by the XGBoost model is also "positive" on the test set, and TN (true positive) represents the number of samples of which the real category predicted by the XGBoost model is also "positive" on the test set.

P3, calculating the sum of the Accuracy of the XGboost model on the test set and the F1 value (Accuracy + F1), optimizing parameters in the XGboost model by using a genetic algorithm with the sum of the Accuracy of the XGboost model on the test set and the F1 value as an optimization target, and obtaining the XGboost model trained by different sub data sets;

and respectively optimizing the colorectal cancer XGboost model trained by using 7 groups of different feature data sets by using a genetic algorithm in the scimit-opt optimization algorithm library. The optimization target of the genetic algorithm is the sum (Accuracy + F1) of the Accuracy of the XGboost model on the test set and F1, and the optimization parameters are learning _ rate, n _ estimators, max _ depth, gamma, reg _ alpha and reg _ lambda in the XGboost model.

In this embodiment, the initial population number of the genetic algorithm is 100, the number of iterations is 300, the minimum value of the 6 parameters to be optimized is lb ═ 0.01, 100, 3, 0, 0, 0], the maximum value is [0.5, 1000, 10, 1, 1, 1], the corresponding Accuracy is [1e-3, 1, 1, 1e-3, 1e-3, 1e-3], the fitness function is Accuracy + F1, a value with higher fitness is selected from the population by roulette to perform crossover and mutation, and the probabilities of crossover and mutation are set to 0.7 and 0.02, respectively.

The convergence condition may be that the iteration number reaches the maximum iteration number, or the fitness of the global optimal solution reaches a preset threshold, or the global optimal solution is not updated in the latest Kp iterations. A genetic algorithm is run. In this embodiment, when the number of iterations reaches 300, the genetic algorithm stops the iterations, and outputs 7 sets of optimal XGBoost parameters with different feature numbers. The optimum parameters are shown in table 2. It can be seen from the table that when RFE selects 1000 features for colorectal cancer data, the value of Accuracy + F1 is the largest, i.e. it indicates that the XGBoost model can distinguish four categories of colorectal cancer more accurately when 1000 features are selected.

TABLE 2 XGboost optimized parameters

Lr represents learning _ rate, Ne represents n _ estimators, Md represents max _ depth, g represents gamma, Ra represents reg _ alpha, Rl represents reg _ lambda, Acc + F1 represents Accuracy + F1 (sum of Accuracy and F1).

The optimized parameters in table 2 are stored to obtain 7 XGBoost models, which are obtained by training using data sets with characteristic numbers of 200, 600, 1000, 1400, 1800, 2200, 17539 and optimizing through a genetic algorithm, and the performance (accuracy, precision, recall and F1) of each XGBoost model is tested, as shown in fig. 4, where the calculated values of accuracycacy and recall are the same. Comparing the confusion matrix of 7 XGboost models among various categories, the classification confusion matrix is shown in Table 3:

TABLE 3 XGboost optimized Classification confusion matrix

Where AND represents a category Alive no flow, AWD represents a category Alive with flow, DND represents a category Dead no flow, AND DWD represents a category Dead with flow.

The accuracy, precision, recall, and F1 values of 7 XGBoost models were compared, as shown in table 4:

TABLE 4 evaluation Performance on each class test set after optimization of XGboost model

Table 4 shows precision (precision), recall (recall) and F1 values of each category corresponding to the XGBoost model, Macro-avg represents Macro-average, Weighted-avg represents Weighted average, which is the result of summing average and Weighted average of precision, recall and F1 values corresponding to all categories.

The result prediction module 5 is configured to: and processing the characteristics of the colorectal cancer data into characteristics corresponding to the detection model, and predicting by using the detection model.

According to the XGboost model optimization method, the XGboost model is trained by using data sets with different feature quantities, and then the XGboost model is optimized by using a genetic algorithm. Comparing the optimized XGboost model, and finding that the classification detection effect of the XGboost model obtained by training a data set with the characteristic number of 1000 is the best according to the detection effect of the XGboost model on a test set in the table 3, wherein when the selected characteristic number is 1000, the Accuracy (Accuracy) of the optimized XGboost classification detection system reaches 83%, the Accuracy (Precision) reaches 85%, the Recall (Recall) reaches 83%, and the F1 value (F1-score) reaches 80%; the XGboost model trained by the original data set (with the characteristic number of 17539) without characteristic selection has the accuracy rate of 80%, the accuracy rate of 82%, the recall rate of 80% and the F1 value of 76%. Therefore, the characteristic quantity of the data set is optimized through characteristic selection, and the classification detection effect of the XGboost model is greatly improved.

Considering that the samples of the colorectal cancer data set in each category are unbalanced, the samples of the disease free survival (AND) category are more, so that when the XGBoost classifier is used, the obtained model is biased to the category with more samples, AND in the colorectal cancer data set, the model is biased to the category 1(402 samples). However, after 1000 features are selected by observing RFE and the XGboost model is optimized by using a genetic algorithm, the selected confusion matrix and the accuracy, recall rate, F1 value and total accuracy of 4 corresponding categories are analyzed, so that the XGboost model obtained by training can more accurately distinguish the four categories of the colorectal cancer data set, namely, after feature selection is performed by using RFE, the features of all the categories of colorectal cancer are more obvious, and a classifier can better learn the features of all the categories.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A colorectal cancer detection system based on XGboost is characterized by comprising a data acquisition module, a data preprocessing module, a feature selection module, a model construction module and a result prediction module;

2. An XGBoost-based colorectal cancer detection system according to claim 1, wherein the data pre-processing module comprises the following steps:

3. The system of claim 2, wherein in step S4, stratify is used to implement hierarchical sampling, and the colorectal cancer data set is divided into a training data set and a test set, so that the ratio of sample data of various labels in the training data set and the test set is the same as the ratio in the colorectal cancer data set.

4. The system of claim 1, wherein the feature selection module performs feature selection by:

5. An XGboost-based colorectal cancer detection system according to claim 4, wherein the base classifier is a classifier containing coef _ or feature _ importances attributes.

6. An XGboost-based colorectal cancer detection system according to claim 3, wherein the model construction module performs the following steps:

7. The XGboost-based colorectal cancer detection system according to claim 6, wherein in step S4, the colorectal cancer data set is divided into a training data set and a test set according to a ratio of 9:1, and in step P1, the training data set is divided into a training set and a verification set according to a ratio of 8: 2.

8. The system of claim 6, wherein the accuracy, recall, precision and F1 values in step P2 are calculated as follows:

9. The XGboost-based colorectal cancer detection system according to claim 6, wherein the step P3 of optimizing the parameters in the XGboost model by using a genetic algorithm is specifically as follows:

10. An XGBoost-based colorectal cancer detection system according to claim 9, wherein the parameters of the genetic algorithm comprise: the value range, the population scale, the iteration times, the fitness function, the selection mode, the crossing mode and the crossing probability, and the variation mode and the variation probability of the parameter to be optimized in the XGboost model.