CN112801140A

CN112801140A - XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm

Info

Publication number: CN112801140A
Application number: CN202110018388.8A
Authority: CN
Inventors: 胡雪梅; 徐蔚鸿; 陈沅涛
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-05-14

Abstract

The invention relates to the field of breast cancer diagnosis, in particular to an XGboost breast cancer diagnosis method based on a moth fire suppression optimization algorithm, which comprises the following steps: acquiring an original breast cancer data set and carrying out normalization processing on the data set; selecting screening characteristics by adopting a default parameter XGboost characteristic, reducing data dimensionality and dividing a breast cancer sample data set after dimensionality reduction into a training sample set and a testing sample set; optimizing parameters of the XGboost model by adopting a moth fire suppression optimization algorithm; inputting the training sample set into the optimized XGboost model for model training, verifying the performance of the model by adopting 10-fold cross validation, and measuring the trained model by using Accuracy index; inputting a test sample set into a trained model, obtaining a classification result, and measuring the classification result by adopting Accuracy, F1, G-mean and AUC indexes. Compared with the prior art, the method has the advantages of simple model, interpretability, high prediction accuracy, high prediction speed and the like.

Description

XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm

Technical Field

The invention relates to the field of breast cancer diagnosis, in particular to an XGboost breast cancer rapid diagnosis method based on a moth fire suppression optimization algorithm.

Background

Worldwide, breast cancer accounts for approximately 15% of all cancers affecting women, and is a common cause of cancer-related deaths in women. And women of any age may have breast cancer. Therefore, the key to early detection and treatment of breast cancer is preventive screening, and screening programs have been successfully initiated in many countries around the world. But the related medical resources are scarce for most people, and particularly, the number of experts and medical staff aiming at the serious disease of cancer is less, and the problem of uneven distribution of the medical resources exists. Applying machine learning to the field of cancer diagnosis can greatly improve the diagnosis efficiency.

The ensemble learning algorithm improves the stability and accuracy of the model by combining a plurality of weak classifiers into a strong classifier. XGboost belongs to one of integrated learning algorithms and is an improvement on GBDT algorithm. The XGboost algorithm performs second-order Taylor expansion on the loss function, so that the accuracy of the model is improved, and a regular term is added into the objective function to obtain the optimal solution of the whole body, so that the reduction of the objective function and the complexity of the model are balanced, and overfitting is avoided. The method has good performance advantages when being applied to feature extraction and data classification, can solve practical problems when being applied to the field of breast cancer diagnosis, but is still lack of such application at present.

The performance of the XGboost model is related to the setting of parameters, and the reasonable setting of the parameters can greatly improve the overall effect of the model. The traditional manual parameter adjustment method is used for training an algorithm by checking a random parameter set in a manual mode, but the manual mode cannot ensure that an optimal parameter combination is obtained, the conventional and widely used method is a grid search and random search algorithm, but similar results of random search and manual parameter adjustment cannot be ensured, the grid search has the problem of high resource consumption, and because the previous parameter information is not considered, the local minimum value is easy to fall into, a new effective parameter adjustment method needs to be provided to improve the model training effect.

The Moth fire suppression optimization algorithm (MFO) is a novel intelligent optimization algorithm proposed in 2015 by Seyedali Mirjalii, and provides a new heuristic search paradigm for the optimization field: the algorithm has the performance characteristics of strong parallel optimization capability, excellent global property and difficulty in falling into a local extreme value. The method comprises the steps of firstly utilizing XGboost of default parameters to extract features of an original breast cancer data set to reduce sample data dimensionality, then utilizing a moth fire suppression optimization algorithm to optimize parameters of an XGboost model to obtain an optimal parameter set, and finally training the optimized XGboost model and verifying the performance of the model by adopting 10-fold cross validation.

Disclosure of Invention

The invention aims to solve the problems of diagnosis accuracy and diagnosis speed in the field of breast cancer diagnosis, and provides an XGboost breast cancer diagnosis method based on a moth fire suppression optimization algorithm.

The purpose of the invention can be realized by the following technical scheme:

an XGboost breast cancer diagnosis method based on a moth fire suppression optimization algorithm comprises the following steps:

1) acquiring an original breast cancer data set and carrying out normalization processing;

2) an XGboost feature selection method is adopted to perform feature selection on an original breast cancer data set, so that the dimensionality of the data set is reduced;

3) dividing the breast cancer data set subjected to dimensionality reduction into a training set and a testing set according to a fixed proportion;

4) optimizing XGboost model parameters by adopting a moth fire suppression optimization algorithm, and determining an optimal parameter set;

5) training the optimized XGboost classification model by using a training sample set;

6) verifying the performance of the trained model by adopting a 10-fold cross validation method, and indicating the quality of the scale quantity model by adopting Accuracy, F1, G-Mean and AUC;

7) inputting a test sample set to a trained XGboost classification model for breast cancer classification diagnosis, and measuring the classification diagnosis effect of the model by adopting Accuracy, F1, G-Mean and AUC indexes.

In the step 1), the acquiring of the original data set is to download a breast cancer data set from a UCI machine learning library, and to perform normalization processing on the breast cancer data set, and the specific calculation is as follows:

in the formula (1), the reaction mixture is,

represents the data x obtained by normalizing the jth data in the ith dimension in the breast cancer data set_i,jTo represent the jth original data in the ith dimension in the breast cancer dataset, Max (x)_i) Represents the maximum in the ith dimension, Min (x) in the breast cancer dataset_i) Represents the minimum in the ith dimension in the breast cancer dataset;

in the step 2), the XGboost feature selection method comprises the steps of firstly dividing a breast cancer data set into a training set and a testing set according to a ratio of 7:3, training an XGboost classification model with default parameters by using the training set, sequencing features according to the importance of the trained model features, and excluding features with importance scores smaller than a set importance score threshold;

in the step 4), in the moth fire suppression optimization algorithm, the dimension of the moth is set to 9 dimensions, which are respectively a learning rate learning _ rate, a tree n _ estimators of the tree, a maximum tree depth max _ depth, a minimum leaf node weight min _ child _ weight, a genetic algorithm mma value, a random sampling ratio subsample, a ratio colsample _ byte of the column number of each tree in random sampling, a weight reg _ alpha of an L1 regular term, and a weight reg _ lambda of an L2 regular term;

in the step 4), in the moth fire suppression optimization algorithm, the fitness function of the moths is the error rate err _ rate of the XGBoost classification model, and the fitness value of each moth is specifically calculated by firstly training the XGBoost classification model by using a training set, inputting the specified parameter value of the XGBoost classification model to the trained model for classification, and finally calculating the classification error rate err _ rate, namely 1-Accuracy;

in the step 4), initializing a first generation of moths and flames in a moth fire suppression optimization algorithm, randomly taking values of the positions of the moths in a parameter search space, calculating a fitness value corresponding to each moth, and setting the positions of the flames and the fitness values thereof as the positions and the fitness values of the moths sorted in ascending order according to the fitness values;

in the step 4), in the moth fire suppression optimization algorithm, the number of flames is adaptively reduced according to the following formula:

in the formula (2), l is the current iteration number, N is the maximum flame number, and T is the maximum iteration number;

in the step 4), in the moth fire suppression optimization algorithm, the position of each moth relative to the flame is updated according to the formula (5):

M_i＝S(M_i,F_j) (3)

S(M_i,F_j)＝D_i×e^bt×cos2πt+F_j (4)

D_i＝|F_j-M_i| (5)

wherein M is_iDenotes the i-th moth, F_jDenotes the jth flame, S denotes the helical function, D_iDenotes the distance between the ith moth and the jth flame, b is a defined logarithmic spiral shape constant (set b equal to 1), and the path coefficient t is [ -1,1]The initial point of the spiral function starts from the moth, the terminal point is the position of flame, and the fluctuation range of the spiral is a parameter search space;

in the step 4), judging whether the current iteration number reaches the maximum iteration number in the moth fire suppression optimization algorithm, directly returning to the position of the flame when the maximum iteration number is reached, namely the searched optimal parameter set, and continuing iterative search when the maximum iteration number is not reached.

The invention has the beneficial effects that:

the invention provides an XGboost breast cancer diagnosis method based on a moth fire suppression optimization algorithm, aiming at the problem of breast cancer diagnosis.

The XGboost classification model parameters are optimized by adopting a moth fire suppression optimization algorithm, and the optimal parameter set is searched by utilizing a heuristic spiral.

And thirdly, the trained XGboost classification model is adopted to diagnose the breast cancer, the diagnosis accuracy is higher, and the operation efficiency is higher than that of the prior art.

And fourthly, verifying the performance of the model by adopting a 10-fold cross verification method, wherein the final evaluation result is better than the prior art in Accuracy, so that overfitting is prevented to a certain extent, and the generalization of the classification model is verified.

Drawings

Fig. 1 is a schematic flow diagram of a breast cancer diagnosis method for optimizing an XGBoost model based on a moth fire suppression optimization algorithm in an embodiment of the present invention.

FIG. 2 is a schematic flow diagram of optimizing XGboost by the moth fire suppression optimization algorithm in the embodiment.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Examples

A method for quickly diagnosing XGboost breast cancer based on a moth fire suppression optimization algorithm is characterized by extracting key characteristic attributes from an original breast cancer data set, optimizing an XGboost model parameter training model by using the moth fire suppression algorithm, and realizing the diagnosis of the breast cancer, and referring to a figure 1, the specific method flow is as follows:

1. data set selection and preprocessing

A breast cancer wisconsin (primary) dataset WDBC from the UCI machine learning library was selected, which contains a total of 699 samples, 2 classes, namely 458 benign samples, 241 malignant samples, and 32 characteristic attributes. Preprocessing a data set, firstly adding feature attribute names including id, class and the like to an original data set, removing the id feature attributes in the original data set, wherein the class feature attribute column belongs to a benign class, the corresponding tag value is B, the class feature attribute is set to be 1, the class feature attribute belongs to a malignant class, the corresponding tag value is M, the class feature attribute column is set to be-1, the class tag vector y is assigned, the attribute columns except the class feature attribute column are assigned to a sample data vector X, and finally, normalizing the sample data X, wherein the normalization method is as follows:

in the formula (1), the reaction mixture is,

representing data obtained by normalizing the jth data in the ith dimension in sample data X, X_i,jTo represent the jth original data in the ith dimension in sample data X, Max (X)_i) Represents the maximum value in the ith dimension, Min (X), in the sample data X_i) Representing the minimum value in the ith dimension in sample data X.

2. XGboost feature selection processing and sample data set partitioning

Dividing the sample data set X and the labels y thereof into a training sample set and a test sample set according to the proportion of 7:3, training an XGboost model with default parameters by using the training sample set, checking the importance score of each feature by using the feature _ importances attribute of the trained model, and drawing a feature importance ascending sort chart. And (4) screening out the features with the importance scores of less than 0.003099 by adopting a transform method of SelectFromModel in the sklern feature selection package to obtain a new sample data set X with only 22 features. And dividing the new sample data set X and the label y thereof into a training sample set and a test sample set according to the proportion of 7: 3.

3. Optimization of XGboost classification model parameters by moth fire suppression optimization algorithm

The XGboost model has a plurality of parameters, and in order to further improve the classification accuracy of the model, the optimal parameter set of the model is searched by adopting a moth fire suppression optimization algorithm. Referring to fig. 2, the specific implementation steps of the moth fire suppression optimization algorithm for searching the optimal parameter set of the XGBoost classification model are as follows: the method comprises the following steps: initializing the first generation of moths and flames, setting 10 moths and the variable number thereof as 9, setting the maximum iteration times as 50, and expressing the position matrixes and the fitness value vectors of the moths and the flames as follows:

wherein m is_1，1First dimension data value, OM, representing a first moth₁Representing the fitness value, f, corresponding to the first moth_1，1Data values, OF, representing the 1 st dimension OF the first flame₁Representing a fitness value corresponding to the first flame;

step two: selecting 9 parameters of learning rate learning _ rate of the XGboost model, tree n _ estimators of the tree, maximum tree depth max _ depth, minimum leaf node weight min _ child _ weight, gamma value, random sampling proportion subsample, proportion colsample _ byte of column number of random sampling of each tree, weight reg _ alpha of L1 regular term and weight reg _ lambda of L2 regular term as variables of the moths, randomly initializing the variables of each first generation of moths according to the search space in the table 1 and calculating the fitness value of each moth, and setting the flame position and the fitness value of the first generation as the position and the fitness value of the moths sorted according to the ascending sequence of the fitness values;

step three: the self-adaptive mechanism reduces the number of flames, if the position updating of each time of 10 moths is based on 10 different positions in a search space, the local development capability of the algorithm can be reduced, in order to solve the problem, the self-adaptive mechanism is adopted to reduce the global and local development capabilities of the flame number balance algorithm in the search space, and the flame number is calculated according to the formula (6):

where l is the current iteration number;

step four: updating the position of each moth, wherein the updating mechanism is as follows:

M_i＝S(M_i，F_j) (7)

S(M_i，F_j)＝D_i×e^bt×cos2πt+F_j (8)

D_i＝|F_j-M_i| (9)

wherein M is_iDenotes the i-th moth, F_jDenotes the jth flame, D_iDenotes the distance between the ith moth and the jth flame, b is a defined logarithmic spiral shape constant (set b equal to 1), and the path coefficient t is [ -1,1]The random number of (1);

step five: calculating the fitness value of the moths, training an XGboost classification model corresponding to each moth by using a training sample set, inputting 9 designated parameters of the model into the trained classification model to calculate the error rate err _ rate (1-Accuracy) of the classification as the fitness value of the moths, wherein the 9 designated parameters are position parameters of each moth;

step six: setting the flame position and the fitness value thereof, reordering the updated moth position and the flame position according to the fitness value, and selecting a space position with a smaller fitness value to update the space position to the position of the next generation of flame;

step seven: and judging whether the iteration termination condition is met, if the iteration times reach the maximum iteration times, returning the position parameters of the flame, wherein the position parameters of the flame are the searched optimal parameter set, and otherwise, returning to the second step to continue the iterative search.

TABLE 1 parameter search space

4. Training optimized XGboost classification model

Training an XGboost classification model with optimal parameters by using a training sample set and storing the model;

5. cross validation model training effect

Performing 10-fold cross validation on the trained XGboost classification model by using a sample data set X and a label y thereof, and measuring the classification effect of the model by adopting evaluation index running time, Accuracy, F1, G-Mean and AUC;

6. breast cancer diagnosis

Inputting a test sample set into a trained classification model to obtain a diagnosis result, and measuring the model classification diagnosis effect by adopting evaluation indexes such as running time, accuacy, F1, G-Mean and AUC;

7. description of evaluation index

In the breast cancer sample data set X, the classification label value of the benign tumor is 1, and the classification label value of the malignant tumor is-1, as can be seen from the result confusion matrix in table 2, TP represents the number of the benign tumors correctly diagnosed as benign, FP represents the number of the malignant tumors incorrectly diagnosed as benign, TN represents the number of the benign tumors incorrectly diagnosed as malignant, and FN represents the number of the malignant tumors correctly diagnosed as malignant. The indices Accuracy, F1, G-Mean and AUC are calculated as follows:

AUC is the area under the ROC curve, the ordinate of the ROC curve is the real normal rate TPR, and the abscissa is the false positive rate FPR.

Table 2 results confusion matrix

Actual value \ predicted value	Positive(1)	Negative(-1)
			True(1)	TP (True Positive)	TN (True Negative)
False(-1)	FP (False Positive)	FN (False Negative)

8. Results of the experiment

The best parameter set of the XGBoost model searched by the moth fire suppression optimization algorithm is shown in table 3. Meanwhile, two groups of comparison experiments are carried out, the first group adopts different parameter optimization methods on the basis of feature selection, and comprises a genetic optimization algorithm, a grid search and Bayesian optimization algorithm and a comparison between an original model and the moth fire suppression optimization algorithm provided by the invention, the average classification Accuracy Accuracy, the average F1, the average G-mean, the average AUC and the running time after 10-fold cross validation are shown in the table 4, the average classification Accuracy Accuracy, the average F1, the average G-mean and the average AUC of the Tree-MFO-XGB model provided by the invention are the highest, the running time is the second shortest, the classification Accuracy and the running time are comprehensively considered, and the classification effect of the model provided by the invention is the best. The second group is that on the basis of processing an original data set by adopting feature selection, classification diagnosis is respectively carried out by using a Support Vector Machine (SVM) model, a GBDT model, a random forest model and a K nearest neighbor algorithm and a comparison experiment based on 10-fold cross validation of the method provided by the invention, the comparison result is shown in table 5, the average classification Accuracy (Accuracy, average F1, average G-mean and average AUC) of the models are still not as high as that of the method provided by the invention, and the method provided by the invention is most effective by comprehensively considering the diagnosis Accuracy and time.

TABLE 3 optimal parameter set

Table 4 comparative experiment result 1 based on 10-fold cross validation

Table 5 comparative experiment result 2 based on 10-fold cross validation

The above embodiments describe in detail specific implementation manners of the XGBoost breast cancer rapid diagnosis method based on the moth fire suppression optimization algorithm, and the description of the above embodiments only uses the methods and core ideas provided to help understanding the present invention.

Claims

1. An XGboost breast cancer rapid diagnosis method based on a moth fire suppression optimization algorithm is characterized by comprising the following steps:

(1) carrying out normalization processing on an original breast cancer data set to obtain a sample data set;

(2) the XGboost feature selection algorithm with default parameters is adopted to sort and screen the features of the sample data set according to feature importance, extract key features, reduce the dimensionality of the sample data, and divide the dimensionality-reduced sample data set into a training sample set and a testing sample set according to a fixed proportion;

(3) optimizing XGboost model parameters by adopting a moth fire suppression optimization algorithm, determining an optimal parameter set, and inputting a training sample set into the optimized XGboost model for training;

(4) and verifying the performance of the trained model by adopting a 10-fold cross validation method, and measuring the final classification effect of the model by adopting the operation time, Accuracy, F1, G-Mean and AUC indexes.

2. The XGboost breast cancer diagnosis method based on moth fire suppression optimization algorithm according to claim 1,

in the fire suppression optimization algorithm for the moths in the step (3), the moths are assumed to be candidate parameter sets of the search parameter set, the moths are search individuals moving in the search space, the parameter variable to be solved is the positions of the moths in the search space, and the matrix of the moths population is represented as follows:

wherein n represents the number of moths and is set to be 10, d represents the number of parameter variables to be solved and is set to be 9, and a corresponding column of fitness value vectors are assumed to exist for the n moths and are represented as follows:

the flame is the optimal position of the moths corresponding to the space so far, and each moth updates the position of the moth by using the unique flame corresponding to the moth, so that the situation of trapping a local extreme value is avoided, and therefore, the position of the moth in the search space and the position of the flame are variable matrixes with the same dimension, which are expressed as follows:

wherein OF is a fitness value vector corresponding to the flame, and the position OF the flame and the fitness value thereof are set as the position and the fitness value OF the moth sorted according to the ascending order OF the fitness value.

3. The XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm according to claim 1,

in the step (3), 9 parameters with large influence of the XGBoost model are selected by the moth fire suppression optimization algorithm, wherein the parameters are respectively learning rate learning _ rate, tree n _ estimators of the tree, maximum tree depth max _ depth, minimum leaf node weight min _ child _ weight, gamma value, random sampling ratio subsample, ratio colsample _ byte of column number of random sampling of each tree, weight reg _ alpha of L1 regular term and weight reg _ lambda of L2 regular term, and the number of the control variables of each moth is equal to the number of XGBoost parameters needing to be optimized and is 9.

4. The XGboost breast cancer diagnosis method based on moth fire suppression optimization algorithm according to claim 1,

and (3) selecting the classification error rate of the XGboost classification model as a fitness function of the moths by the moth fire suppression optimization algorithm in the step (3), wherein the error rate err _ rate is calculated in a manner that err _ rate is 1-Accuracy.

5. The XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm according to claim 1,

in the step (3), the moth fire suppression optimization algorithm performs mathematical modeling on the flying behavior of moth fire suppression, and the position updating mechanism of each moth relative to the flame can be represented by the following equation:

M_i＝S(M_i,F_j) (5)

S(M_i,F_j)＝D_i×e^bt×cos2πt+F_j (6)

D_i＝|F_j-M_i| (7)

wherein M is_iDenotes the i-th moth, F_jDenotes the jth flame, S denotes the helical function, D_iDenotes the distance between the ith moth and the jth flame, b is a defined logarithmic spiral shape constant (set b equal to 1), and the path coefficient t is [ -1,1]The initial point of the spiral function starts from the moth, the terminal point is the position of the flame, and the fluctuation range of the spiral is a parameter search space.

6. The XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm according to claim 1,

in the step (3), the moth fire suppression optimization algorithm adopts a self-adaptive mechanism to reduce the number of flames in an iterative process in a self-adaptive manner, and the formula is as follows:

where l is the current iteration number, N is the maximum number of flames, set to 10, and T is the maximum iteration number, set to 50.