CN114611719A

CN114611719A - XGboost training method based on cuckoo search algorithm

Info

Publication number: CN114611719A
Application number: CN202210236632.2A
Authority: CN
Inventors: 胡雪梅; 徐蔚鸿
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2022-06-10

Abstract

The invention relates to the field of machine learning, in particular to a novel XGboost training method based on cuckoo search. The CS-based XGboost is applied to the real-world enterprise personnel management field staff information data set for the time-out prediction after the XGboost trained by the method. In addition, CS-based XGBoosts were compared to existing XGBoosts trained by other optimization algorithms, including GA, PSO, etc., in addition to four classifiers of GBDT, RF, SVM and KNN. Experimental results and corresponding discussion show that the XGboost based on the MFO is superior to the comparison model in the main performance indexes such as accuracy, accuracy and recall rate.

Description

XGboost training method based on cuckoo search algorithm

Technical Field

The invention relates to the field of machine learning, in particular to a novel XGboost training method based on a cuckoo search algorithm.

Background

With the rapid development of artificial intelligence technology, machine learning algorithms are applied in various industries to solve practical problems. At present, data information in each field is explosively increased along with industrial development, and the massive data cannot be effectively processed by manpower alone, so that an effective computer algorithm is urgently needed to analyze and utilize the data, and therefore, the problem of processing the data in each field by adopting an artificial intelligence technology to solve is always a research hotspot. XGboost, a typical representative of integrated learning techniques, can efficiently handle large-scale machine learning tasks. Since its introduction, due to its performance advantages and affordable time and memory complexities, it has been widely used in a number of research areas, ranging from cancer diagnosis, medical history analysis to credit risk assessment, metagenomics, etc. Although the traditional XGBoost (i.e., the XGBoost with default parameter setting) is widely applied in many fields, the fitting degree of the original model without parameter optimization and the existing data set is low, which results in poor generalization performance and adaptability. XGBoost has over thirty superparameters, the performance of which is highly dependent on how they are optimized in training, and it is therefore very important to tune them.

Disclosure of Invention

The invention aims to solve the problem of parameter optimization during XGboost model training, and provides a novel XGboost training method based on a cuckoo search algorithm.

The purpose of the invention can be realized by the following technical scheme:

a novel XGboost training method based on cuckoo search algorithm comprises the following steps:

(1) preprocessing the original data set: firstly, scaling each column of attribute values in the data set to an interval [0,1] by adopting a maximum and minimum normalization method, and secondly, performing feature dimensionality reduction on the data set by adopting a random forest feature selection method;

(2) dividing the preprocessed data set into a training set and a test set according to a user-defined proportion;

(3) an XGboost training method based on a cuckoo search algorithm is adopted to train the over-parameters of the XGboost;

(4) according to a group of optimal parameter values obtained by training, constructing the XGboost, and then inputting a training set to train the XGboost;

(5) testing the trained XGboost by using a test set, and outputting a prediction result;

(6) evaluating the prediction performance of the XGboost by using 4 model performance evaluation indexes of Precision Accuracy, Precision, Recall and F1 score;

in the step (2), a random forest feature selection algorithm is adopted to screen the data set, and specifically, the data set is divided into a training set and a testing set according to a fixed proportion, then the training set is input to train a random forest model, importance scores corresponding to each feature are output and are sorted in a descending order, then a feature importance score threshold value is set, and finally the feature with the feature importance score smaller than the set threshold value is deleted, so that the data set after dimensionality reduction is obtained.

In the step (3), an XGBoost structure is trained by using an XGBoost training method based on a cuckoo search algorithm, specifically:

(4-1) determining the size n of the bird nest population; dimension d of the bird nest position; namely the number of parameters to be optimized in the XGboost; probability of discovery P_a(ii) a Upper and lower bounds of the bird's nest search space; the maximum number of iterations Max _ itex. Setting the classification Accuracy of XGboost model prediction as a fitness function of a bird Nest, wherein a matrix representation Nest of the bird Nest position and corresponding fitness vectors NF represent a formula (1) and a formula (2);

wherein: n represents the number of bird nests, d represents the dimension of the bird nest position, x_i,jRepresents the j dimension in the i bird nest, wherein f_iAnd representing the fitness value corresponding to the ith bird nest.

(4-2) randomly initializing bird nest positions and searching space S (S ═ lb, ub)]) Initializing the position of bird's nest according to x_*,j＝random(lb_j,ub_j) Calculating a random initial value, wherein ub_jAnd lb_jThe upper and lower search bounds for the jth hyper-parametric variable to be optimized, respectively, and random () represents a random function that returns an in-range[lb_j,ub_j]A random number within;

(4-3) calculating a fitness function value of the bird nest according to the set fitness function, and reserving the optimal bird nest gt (namely the bird nest position vector with the maximum fitness value);

(4-4) updating the position of the bird nest by adopting Laevir flight: randomly changing the position of the current bird nest by adopting the following formula so as to obtain a group of new bird nest positions, comparing the new bird nest positions with the old bird nest positions, and reserving the bird nest positions with larger adaptability values;

wherein: alpha is alpha>0 is the step size scaling factor, and L (lambda) represents the Levy flight function, i.e., L vy, u-t^-λ,(1<λ≤3)。

(4-5) discarding a small fraction of worse nests than creating new nests: circulating from the 1 st bird nest to the n th bird nest, and generating a random number r epsilon [0,1] which is subjected to uniform distribution in each circulation; and if r is larger than Pa, updating the position of the bird nest by adopting a formula (4), otherwise, not updating the position of the bird nest. When the circulation is finished, a group of new bird nest positions are obtained;

wherein X_ljAnd X_kjFor randomly selected solutions, H (μ) is the Hervessed function, P_aIs a handover parameter for balancing local and global random walks, s being the step size, and epsilon being a uniformly distributed random number.

(4-6) calculating the fitness corresponding to the updated bird nest position, and reserving the locally optimal bird nest pt (namely, storing a bird nest position vector with the maximum fitness value in the current bird nest);

(4-7) comparing the fitness values of pt and gt, and if the fitness value of pt is larger than gt, updating the global optimal gt;

(4-8) comparing pt with gt, and updating global optimal gt (including the bird nest position GXbox and the fitness value Gfmax thereof);

(4-9) judging whether the maximum iteration number is reached: and if not, returning to (4-4) to continue the loop iteration, otherwise, returning to the global optimal bird nest position gt.

Compared with the existing XGboost training method, the XGboost training method has the beneficial effects that:

(1) the invention provides a novel XGboost training method based on a cuckoo search algorithm, which is superior to the existing XGboost training method based on PSO and GA when a multi-peak function is optimized;

(2) the invention provides a novel XGboost training method based on a cuckoo search algorithm, which keeps effective balance between local search and diversity or randomness;

(3) the invention provides a novel XGboost training method based on a cuckoo search algorithm, which only comprises two control parameters, so that the algorithm is simpler and more universal;

drawings

Fig. 1 is a schematic flow diagram of XGBoost optimized by the cuckoo search algorithm in the embodiment.

FIG. 2 is a diagram illustrating feature score ordering according to an embodiment.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Examples

A novel XGboost training method based on cuckoo search algorithm comprises the following specific processes:

1. data set preprocessing

Selecting an employee data set HR _ comma _ sep from human resource management of a Kaggle official network, wherein the total number of the employee data set HR _ comma _ sep is 14999 employee records, 10 attribute characteristics and no missing value; the attribute feature details are shown in table 1, the attribute left is a classification label, which indicates whether the job leaving (1-job leaving, 0-job not) is marked as y, the first 9 sample attributes are marked as x, normalization processing is performed on x, and a maximum minimization method is adopted.

TABLE 1 Attribute feature details for employee datasets

Properties	Means of	Numbering	Maximum value	Minimum value
					satisfaction_level	Degree of satisfaction	f0	1.00	0.00
last_evaluation	Performance assessment	f1	1.00	0.36
					number_project	Number of completed items	f2	7.00	2.00
average_montly_hours	Average monthly working time	f3	310.00	96.00
					time_spend_company	Duration of work at company	f4	10.00	2.00
work_accident	Whether there is a work accident	f5	1.00	0.00
					promotion	Whether or not there has been an increase in the past 5 years	f6	1.00	0.00
department	Department of department	f7	9.00	0.00
					Salary	Salary level	f8	2.00	0.00
left	Whether or not to leave work	class	1	0

2. Random forest feature selection algorithm screening dataset

The method comprises the following specific implementation steps of screening an original data set by adopting a feature selection method, reducing the dimensionality of the data set so as to improve the operation efficiency, deleting redundant or irrelevant attribute features so as to improve the prediction precision of a model, and screening the data set by adopting a random forest feature selection algorithm:

the method comprises the following steps: firstly, dividing a data set (X, y) into a training set (X _ train, y _ train) and a testing set (X _ test, y _ test) according to a ratio of 7: 3;

step two: inputting a training set training random forest classification model rf _ model, calling rf _ model, feature _ attributes _ output importance scores corresponding to the features, and sorting in a descending order, as shown in fig. 2;

step three: setting an importance score threshold thresh to 0.004383, adopting a selectfrommomodel function to reserve a feature larger than thresh, adopting a transform (X) function to convert an original sample X into a new sample X, wherein the reserved features are f0, f4, f2, f3, f1, f7 and f8, and the features f5 and f6 are deleted;

3. data set partitioning

And dividing the data set (X, y) after dimensionality reduction into a training set (X _ train, y _ train) and a testing set (X _ test, y _ test) according to the proportion of 7: 3.

4. XGboost training method based on cuckoo search algorithm trains XGboost

The XGboost comprises a plurality of hyper-parameters, and in order to further improve the prediction accuracy of the model, the optimal parameter set of the model is searched by adopting a cuckoo search algorithm. Referring to fig. 1, the specific implementation steps of training the XGBoost by using the XGBoost training method based on the cuckoo search algorithm are as follows:

the method comprises the following steps: determining the size n of the bird nest population to be 25 and the dimension d to be 9, and finding the probability P_aThe upper and lower boundaries of the bird nest search space are shown in table 2, the maximum iteration number MaxN is 100, and the matrix representation of the bird nest population is shown in formula (1);

step two: randomly initializing bird nest positions in searchRandomly initializing bird nest positions in space according to chi_*,j＝random(lb_j,ub_j) Calculating a random initial value, wherein ub_jAnd lb_jThe upper and lower search bounds for the jth hyper-parametric variable to be optimized, respectively, and random () represents a random function that returns an interval [ lb_j,ub_j]A random number within;

step three: calculating the fitness function value of the bird Nest according to the set fitness function (1. precondition that XGboost parameter is set as the position value of the bird Nest in Nest, 2. input training model of training set, 3. input testing set into the trained model, calculate the classification Accuracy of the model),

reserving an optimal bird nest gt (namely a bird nest position vector with the maximum fitness value);

step four: and (3) updating the position of the bird nest by adopting Laiwei flight: randomly changing the position of the current bird nest by adopting the following formula so as to obtain a group of new bird nest positions, comparing the new bird nest positions with the old bird nest positions, and reserving the bird nest positions with larger adaptability values;

step five: discarding a small fraction of worse nests than creating new nests: circulating from the 1 st bird nest to the nth bird nest, and generating a random number r ∈ [0,1] which is subjected to uniform distribution in each circulation; and if r is greater than Pa, updating the position of the bird nest by adopting a formula (4), otherwise, not updating the position of the bird nest. When the circulation is finished, a group of new bird nest positions are obtained;

step six: calculating the fitness corresponding to the updated bird nest position, and reserving the locally optimal bird nest pt (namely, storing the bird nest position vector with the maximum fitness value in the current bird nest);

step seven: comparing the fitness value of pt with the fitness value of gt, and if the fitness value of pt is larger than gt, updating the global optimal gt;

step eight: judging whether the maximum iteration number is reached: and if not, returning to the step four to continue the loop iteration, otherwise, returning to the global optimal bird nest position gt.

TABLE 2 upper and lower bounds of the parameters

Parameter(s)	Search scope
		learning_rate	[0.01,0.3]
n_estimators	[10,2000]
		max_depth	[1,15]
min_child_weight	[0,10]
		gamma	[0.01,10.0]
subsample	[0.01,1.0]
		colsample_bytree	[0.01,1.0]
reg_alpha	[0.01,1.0]
		reg_lambda	[0.01,1.0]

Table 3 optimal parameter set

Parameter(s)	Optimal value
		learning_rate	0.1457
n_estimators	85
		max_depth	15
min_child_weight	0.019
		gamma	0.0113
subsample	0.86916
		colsample_bytree	1.0
reg_alpha	0.7277
		reg_lambda	0.2664

5. Training the optimized XGboost model and carrying out model evaluation

Inputting a training set to train the optimized XGboost model, and measuring and evaluating the trained XGboost classification model by adopting Precision Accuracy, Precision, Recall and F1, wherein 4 index calculation modes are as follows:

where TP represents the number of samples for which the job separation was correctly predicted as separation, FP represents the number of samples for which the job separation was not incorrectly predicted as separation, TN represents the number of samples for which the job separation was incorrectly predicted as non-separation, and FN represents the number of samples for which the job separation was not correctly predicted as non-separation.

6. Performing staff outage prediction

And inputting the test set into a trained XGboost model for prediction to obtain a final prediction result.

7. Design of experiments

In order to verify the effectiveness of the method provided by the invention, two groups of comparison experiments are set, the first group respectively compares the XGboost original model XGB, the model RF-XGB adopting random forests for feature screening and 4 index (Accuracy, Precision, Recall and F1) evaluation results of the three models of the model RF-CS-XGB provided by the invention, and the comparison results are shown in Table 4; the second group compares the method RF-CS-XGB provided by the invention with the random forest RF-RF, the logistic regression RF-LR, the support vector machine RF-SVM, the gradient boosting decision tree RF-GBDT, the K neighbor algorithm RF-KNN and other common classification models which are only processed by the random forest feature selection method, and the experimental comparison result is shown in the table 5.

TABLE 4 results of the first comparative set of experiments

Model (model)	Accuracy	Precision	Recall	F1
					XGB	97.40％	97.17％	91.53％	94.27％
RF-XGB	97.44％	97.27％	91.63％	94.37％
					RF-CS-XGB	99.09％	99.22％	96.86％	98.03％

TABLE 5 second set of comparative experimental results

Model (model)	Accuracy	Precision	Recall	F1
					RF-RF	99.04％	99.32％	96.57％	97.93％
RF-LR	76.60％	49.83％	27.40％	35.36％
					RF-SVM	81.53％	92.31％	22.84％	36.61％
RF-GBDT	97.58％	97.38％	92.10％	94.67％
					RF-KNN	95.62％	90.36％	90.96％	90.66％
RF-CS-XGB	99.09％	99.22％	96.86％	98.03％

The above embodiments describe in detail a specific implementation manner of the XGBoost training method based on cuckoo search algorithm and applied to the staff departure prediction, and the above embodiments only use the proposed method and core ideas to help understanding the present invention.

Claims

1. A novel XGboost training method based on cuckoo search algorithm is characterized by comprising the following steps:

step 1: preprocessing an original data set, including normalization and feature dimension reduction, and dividing the processed data set into a training set and a test set according to a fixed proportion;

and 2, step: the XGboost is trained through an XGboost training method based on a cuckoo search algorithm;

and step 3: constructing XGboost according to a group of parameter values obtained by training;

and 4, step 4: the XGboost is constructed by adopting a test set test, and the model is comprehensively evaluated by adopting 4 model evaluation indexes of Accuracy, Precision, Recall and F1 score.

2. The new XGBoost training method based on cuckoo search algorithm as claimed in claim 1, wherein: the training of the XGBoost by the XGBoost training method based on the cuckoo search algorithm in step 2 specifically comprises:

step 2-1: determining the size n of the bird nest population; dimension d of the bird nest position; namely the number of parameters to be optimized in the XGboost; probability of discovery P_a(ii) a Upper and lower bounds of the bird's nest search space; the maximum number of iterations Max _ itex. The classification Accuracy predicted by the XGboost model is set as a fitness function of the bird Nest, and a matrix representation Nest of the bird Nest position and a corresponding fitness vector NF are represented as follows:

wherein: x is the number of_i,jRepresenting the jth dimension in the ith bird nest; n represents the number of bird nests; d represents the dimension of the bird nest, namely the number of the parameters of the XGboost to be optimized.

Whereinf_iAnd (3) representing the fitness value corresponding to the ith bird nest, wherein n represents the number of the bird nests.

Step 2-2: randomly initializing bird nest position, and searching space S (S ═ lb, ub)]) Initializing the position of bird's nest according to x_*,j＝random(lb_j,ub_j) Calculating a random initial value, wherein ub_jAnd lb_jThe upper and lower search bounds for the jth hyper-parametric variable to be optimized, respectively, and random () represents a random function that returns an interval [ lb_j,ub_j]The random number in (c).

Step 2-3: calculating the adaptability value of the bird nest according to the classification Accuracy Accuracy of the XGboost, and reserving the optimal bird nest gt (namely the bird nest position vector with the maximum adaptability value);

step 2-4: and (3) updating the position of the bird nest by adopting Laiwei flight: randomly changing the position of the current bird nest by adopting the following formula so as to obtain a group of new bird nest positions, comparing the new bird nest positions with the old bird nest positions, and reserving the bird nest positions with larger adaptability values;

Step 2-5: discarding a small fraction of worse nests than creating new nests: circulating from the 1 st bird nest to the nth bird nest, and generating a random number r ∈ [0,1] which is subjected to uniform distribution in each circulation; if r is greater than Pa, the position of the bird nest is updated by adopting the following formula, otherwise, the position of the bird nest is not updated. And when the circulation is finished, obtaining a new group of bird nest positions.

Wherein X_ljAnd X_kjFor randomly selected solutions, H (μ) is the Hervessed function, P_aIs used for smoothingSwitching parameters of local and global random walk are balanced, s is a step length, and epsilon is a uniformly distributed random number;

step 2-6: calculating the fitness corresponding to the updated bird nest position, and reserving the local optimal bird nest pt (namely, the position of the bird nest with the maximum fitness value in the current bird nest is saved);

step 2-7: comparing the fitness value of pt with that of gt, and if the fitness value of pt is greater than that of gt, updating the global optimal gt;

step 2-8: judging whether the maximum iteration number is reached: and if not, returning to 2-4 to continue the cycle iteration, otherwise, returning to the global optimal bird nest position gt.