CN113269254A

CN113269254A - Coal and gangue identification method for particle swarm optimization XGboost algorithm

Info

Publication number: CN113269254A
Application number: CN202110580411.2A
Authority: CN
Inventors: 周孟然; 闫鹏程; 胡锋; 来文豪; 卞凯
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-08-17

Abstract

The invention provides a coal and gangue identification method for particle swarm optimization XGboost algorithm, belonging to the field of coal and gangue identification and comprising the following steps: collecting multispectral image information of coal and gangue, and preprocessing the multispectral image information; carrying out sample division on the collected coal and gangue multispectral images, randomly dividing the preprocessed coal and gangue multispectral images into independent training sets and test sets according to a ratio of 7:3, and setting labels for the samples; performing feature extraction on the coal and gangue multispectral images in the training set and the testing set; constructing a coal and gangue identification model based on an XGboost algorithm by using the extracted multispectral image characteristics, training the coal and gangue identification model on a training set, and optimizing parameters of the XGboost algorithm through a particle swarm optimization algorithm; and (4) testing the classification accuracy of the coal and gangue identification model to the coal and gangue through the test set, and verifying the performance of the model. The XGboost model adopted by the method has high identification accuracy, strong interpretability and difficult overfitting, and can obtain good classification effect.

Description

Coal and gangue identification method for particle swarm optimization XGboost algorithm

Technical Field

The invention belongs to the technical field of coal and gangue identification, and particularly relates to a coal and gangue identification method for particle swarm optimization XGboost algorithm.

Background

Coal is the first energy source in our country for a long time, and in the process of mining and excavating coal, coal which is not processed is called raw coal, the raw coal contains a large amount of gangue, the content of the gangue is high, the gangue contains a large amount of heavy metals, the calorific value of the gangue is low, the calorific value of the coal is influenced after the gangue is mixed with the coal, the quality of the coal is influenced, and the environment is polluted in the combustion process. While China has been developing clean coal technology, the separation of coal and gangue is an important step.

The method for separating coal and gangue mainly comprises manual gangue discharge, jigging coal separation, floating coal separation, selective crushing, dense medium coal separation, ray detection and identification coal separation and the like, but the methods generally have the problems of low identification precision, large occupied space, high investment cost, serious environmental pollution and the like. The application provides an identification method of a multispectral particle swarm optimization XGboost algorithm, wherein the XGboost algorithm is a gradient lifting integrated learning algorithm based on a gradient lifting decision tree algorithm, and a principle is that a plurality of weak classifiers are integrated, and a more accurate classification effect is obtained through multiple iterations. XGBoost has many advantages: the method has the advantages of high speed, good effect, capability of processing large-scale data, use of second-order derivatives, more accurate loss and support of custom loss functions. The particle swarm optimization algorithm is an evolutionary computing technology. The basic idea is to find the optimal solution through collaboration and information sharing among individuals in a population. The method has the advantages of quite high speed of approaching the optimal solution, simplicity, easy realization, less parameter setting and capability of effectively optimizing the parameters of the system. The particle swarm optimization XGboost algorithm is adopted to construct the coal and gangue identification model, and the method is an effective identification method.

In addition to manual gangue separation, automatic gangue (coal) separation technology can be divided into wet gangue separation and dry gangue separation according to whether water resources are utilized. The wet-method gangue separation needs to consume a large amount of water resources, and the generated coal slime pollution is difficult to treat and is not in accordance with the concept of producing clean coal; the gamma rays, the X-rays and other rays have certain radiation, so that certain damage can be caused to the health of a human body, the interference of factors such as light rays and the like on the common image identification gangue separation is large, and the identification accuracy is not high. The coal and gangue identification method for the particle swarm optimization XGboost algorithm can obtain higher identification rate and high speed, and can make up for the defects of the existing coal and gangue identification method.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a coal and gangue identification method for optimizing an XGboost algorithm by particle swarm.

In order to achieve the above purpose, the invention provides the following technical scheme:

a coal and gangue identification method for a particle swarm optimization XGboost algorithm comprises the following steps:

collecting multispectral image information of coal and gangue, and preprocessing the multispectral image information;

carrying out sample division on the collected coal and gangue multispectral images, randomly dividing the preprocessed coal and gangue multispectral images into independent training sets and test sets according to a ratio of 7:3, and setting labels for the samples, wherein the label of the coal is 1, and the label of the gangue is 0;

performing feature extraction on the coal and gangue multispectral images in the training set and the testing set;

constructing a coal and gangue identification model based on an XGboost algorithm by using the extracted multispectral image characteristics, training the coal and gangue identification model on a training set, and optimizing parameters of the XGboost algorithm through a particle swarm optimization algorithm;

and (4) testing the classification accuracy of the coal and gangue identification model to the coal and gangue through the test set, and verifying the performance of the model.

Preferably, a multispectral image acquisition system is used for acquiring multispectral images of a plurality of samples of coal and gangue to obtain multispectral images of the coal and the gangue.

Preferably, training the gangue identification model comprises:

for a given training sample set D { (x) with N samples and M features_i,y_i)}(i＝1,2,…,N,x_i∈R^M,y_iE to R), and finally obtaining an integrated model added by K CART decision trees through XGboost model training:

is the output of the XGboost model, F ═ F (x) w_q(x)}(q:R^M→T,w∈R^T) F represents a specific CART tree which is a set of all CART decision trees in the model; each decision tree function f_kCorresponding to a specific tree structure q and a corresponding leaf node weight vector w; for one sample, the process of obtaining the final predicted value by the XGBoost model is as follows: mapping the sample to a corresponding leaf node on each decision tree, and then adding the weights of K leaf nodes corresponding to the sample; the machine learning model defines a loss function for measuring the deviation between the predicted value and the true value of the model;

the loss function of the XGboost model is:

the formula comprises two parts, wherein the first part is a training loss function, and the second part is a regular term;

in the XGboost algorithm, training is carried out in a mode of tree model iterative increase, namely, a CART decision tree function f is added in each step in the training process, so that the loss function is further reduced; after a plurality of iterations, an optimal CART tree f is added in the t step_tI.e. the CART tree that minimizes the penalty function, the penalty function becomes:

to select a tree structure f_tIs such that the loss function L is obtained^(t)The reduction amplitude of (a) is maximum, and taylor second-order expansion is performed on the above equation:

in the formula:

respectively at the point of expansion for the loss function

The first and second derivatives of (d);

the regularization term contained in the loss function can be used to control the complexity of the trained model, and is defined as follows:

t represents the number of leaf nodes; w represents the leaf weight;

in the expanded form

Representing the outputs of all CART tree functions obtained before the t-th step

The loss function formed by the sample label is a constant value; because the reduction amplitude of the loss function is irrelevant to the constant term, the constant term is removed, and the loss function is further optimized by combining the regular term expression, so that the simplified loss function is obtained as follows:

the formula is paired with w_jTaking the derivative and making it 0, the optimal leaf node weight is:

the optimal loss function at this time is:

the method is used for measuring the quality of any tree structure, and the smaller the tree structure is, the better the tree structure is, so that the loss function of the model can be reduced more;

the XGboost training process is to increase the CART function in an iterative mode to finally obtain an XGboost model

When the condition of iteration termination is that the tree model is continuously added, the accuracy of the model is improved by less than s; new function f for each increment_tThe obtaining process is as follows: and initially, a leaf node is added, a branch is added every time, the tree growing scheme with the minimum loss function value is selected, and the process is circularly carried out until the maximum depth of the tree reaches a specified value or the minimum sample weight sum is smaller than a threshold value, and the splitting is stopped.

Preferably, optimizing parameters such as a learning rate (learning _ rate), a maximum tree depth (max _ depth) and a minimum leaf weight (min _ child _ weight) in the XGboost algorithm through a particle swarm optimization algorithm; the method specifically comprises the following steps:

initializing a particle swarm;

determining a fitness function according to a target function of the optimization problem, and calculating the fitness of each particle in the particle swarm;

calculating the individual most extreme value of each particle in the particle swarm, then comparing the current fitness value of each particle in the particle swarm with the individual extreme value of each particle, and replacing the individual extreme value with the fitness value if the current fitness value of the particle is superior to the individual extreme value of the particle;

comparing the current fitness values of all the particles in the particle swarm with the global extreme value, and replacing the global extreme value with the current fitness value if the current fitness value is superior to the global extreme value;

updating the velocity and position of the particles by formula

ν_id(t+1)＝ω·ν_id(t)+c₁r₁[p_id(t)-ν_id(t)]+c₂r₂[p_gd(t)-ν_id(t)]

And x_id(t+1)＝x_id(t)+ν_id(t +1) updating the speed and position of the particles;

judging whether an iteration condition is met, and if the iteration condition is met, ending; if the condition is not met, returning to the step 2 and continuing to optimize;

and (3) iteration termination conditions: and when the iteration times reach the set maximum iteration times or the set minimum error standard, stopping the iteration, and otherwise, continuing the iteration until the iteration termination condition is met.

The coal and gangue identification method of the particle swarm optimization XGboost algorithm provided by the invention has the following beneficial effects:

according to the invention, the multispectral imaging technology is adopted to acquire the images of the coal and the gangue, and a coal and gangue identification model of a particle swarm optimization XGboost algorithm is established through feature extraction and sample division, so that the coal and the gangue can be quickly and accurately identified. The XGboost model has the advantages that the interpretability is strong, overfitting is not easy to generate, the XGboost is optimized and parameter-adjusted by combining the particle swarm algorithm, great complexity cannot be added to the original XGboost algorithm, the stability of the model can be improved, and the identification accuracy of the XGboost model is effectively improved;

the XGboost algorithm is combined with the particle swarm optimization algorithm, so that the optimal parameters can be quickly found, the optimal model is established, the operation speed and the accuracy of the model are improved, a good classification effect is realized, and the XGboost algorithm is also advantageous compared with other machine learning algorithms.

Drawings

In order to more clearly illustrate the embodiments of the present invention and the design thereof, the drawings required for the embodiments will be briefly described below. The drawings in the following description are only some embodiments of the invention and it will be clear to a person skilled in the art that other drawings can be derived from them without inventive effort.

Fig. 1 is a flowchart of a coal and gangue identification method of a particle swarm optimization XGBoost algorithm according to embodiment 1 of the present invention;

FIG. 2 is a flow chart of XGboost parameter optimization based on particle swarm optimization.

Detailed Description

In order that those skilled in the art will better understand the technical solutions of the present invention and can practice the same, the present invention will be described in detail with reference to the accompanying drawings and specific examples. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Example 1

The invention provides a coal and gangue identification method for a particle swarm optimization XGboost algorithm, which comprises the following steps of:

step 1, acquiring multispectral image information of coal and gangue by using a multispectral image acquisition system to obtain a multispectral image data set of the coal and the gangue, and preprocessing the data; the multispectral image acquisition system selects a real-time multispectral Mosai surface camera of Shanghai five-bell photo-electronic technology Limited to acquire multispectral images of a plurality of samples of coal and gangue, and the multispectral images of the coal and the gangue are obtained, wherein the pixels of the multispectral images are 2048 x 1088.

And 2, carrying out sample division on the collected coal and gangue multispectral images, randomly dividing the preprocessed coal and gangue multispectral images into independent training sets and independent testing sets according to a ratio of 7:3, and setting labels for the samples, wherein the label of the coal is 1, and the label of the gangue is 0.

And 3, extracting the characteristics of the collected coal and gangue multispectral images.

Step 4, constructing a coal and gangue identification model based on an XGboost algorithm by using the extracted multispectral image characteristics, training the coal and gangue identification model on a training set, and optimizing parameters of the XGboost algorithm through a particle swarm optimization algorithm; the particle swarm optimization algorithm is a global random search algorithm based on swarm intelligence, and is high in convergence speed, few in required parameters and easy to implement.

Specifically, in this embodiment, training the coal and gangue identification model includes:

is the output of the XGboost model, F ═ F (x) w_q(x)}(q:R^M→T,w∈R^T) F represents a specific CART tree for the set of all CART decision trees in the model. Each decision tree function f_kCorresponding to a particular tree structure q and corresponding leaf node weight vectors w. For one sample, the process of obtaining the final predicted value by the XGBoost model is as follows: and mapping the sample to a corresponding leaf node on each decision tree, and then adding the weights of K leaf nodes corresponding to the sample. The machine learning model defines a loss function for measuring the deviation between the predicted value and the true value of the model, and in the training process, the training target is to make the value of the loss function as small as possible.

The loss function of the XGboost model is:

the equation contains two parts, the first part being the training loss function and the second part being the regularization term.

In the XGboost algorithm, training is carried out in a mode of tree model iterative increase, namely, a CART decision tree function f is added to each step in the training process, so that loss is causedThe function is further reduced. After a plurality of iterations, an optimal CART tree f is added in the t step_tI.e. the CART tree that minimizes the penalty function. The loss function becomes:

in the formula:

respectively at the point of expansion for the loss function

The first and second derivatives of (a).

The regular term contained in the loss function can be used for controlling the complexity of the trained model, so that the model is not excessively complex while ensuring the accuracy on a training sample, thereby avoiding overfitting and enhancing the generalization capability, and the definition is as follows:

t represents the number of leaf nodes; w represents the leaf weight.

In the expanded form

Loss of sample labelThe function is a constant value. Because the reduction amplitude of the loss function is irrelevant to the constant term, the constant term is removed, and the loss function is further optimized by combining the regular term expression, so that the simplified loss function is obtained as follows:

the optimal loss function at this time is:

the method is used for measuring the quality of any tree structure, and the smaller the tree structure is, the better the tree structure is, so that the loss function of the model can be reduced more.

The condition of iteration termination is that when the tree model is continuously added, the accuracy of the model is improved by less than s. New function f for each increment_tThe obtaining process is as follows: and initially, a leaf node is added, a branch is added every time, the tree growing scheme with the minimum loss function value is selected, and the process is circularly carried out until the maximum depth of the tree reaches a specified value or the minimum sample weight sum is smaller than a threshold value, and the splitting is stopped.

There are many parameters in the XGBoost algorithm, and parameter optimization is necessary to obtain good classification result.

Specifically, the XGBoost parameters are classified into 3 types of general parameters, boost parameters, and learning target parameters, where the boost parameters are main parameters when the data sample is trained, and the influence of adjusting the parameters on the accuracy of the model is the largest.

TABLE 1 Booster parameter information table of XGboost algorithm

According to a large amount of XGboost parameter adjustment experience and engineering practice application, an algorithm cannot be converged due to an excessively large learning _ rate, and an algorithm is overfitting due to an excessively small learning _ rate. Max _ depth is too large, so that the possibility that the model falls into the local optimal solution is also high, and an overfitting phenomenon occurs. Min _ child _ weight is the minimum sample weight and threshold in child nodes, and if the parameters are too small, the algorithm is overfitting, and if the parameters are too large, the classification performance of the algorithm on linear irreparable data is reduced. The larger the value of Gamma, the more conservative the algorithm, the more closely related this parameter is to the loss function and needs to be adjusted. The value of Subsample decreases and the algorithm is more conservative, avoiding overfitting, but if the parameter setting is too small, it may result in under-fitting. Lambda is used to control the regularization portion of XGboost, which has the effect of reducing overfitting. Alpha can be applied in very high dimensionality, making the algorithm faster. Therefore, in the present application, parameters such as learning _ rate, max _ depth, min _ child _ weight, gamma, subsample, colsample _ byte, lambda and alpha are optimized, and other parameters are set as default values.

The XGboost algorithm has a plurality of parameters, and the accuracy of the model is greatly influenced by adjusting the parameters, so in the embodiment, parameters such as learning rate (learning _ rate), maximum tree depth (max _ depth) and minimum leaf weight (min _ child _ weight) in the XGboost algorithm are optimized through the particle swarm optimization algorithm; as shown in fig. 2, the method specifically includes the following steps:

step 4.1, initializing the particle swarm;

4.2, determining a fitness function according to the objective function of the optimization problem, and calculating the fitness of each particle in the particle swarm;

4.3, calculating the individual maximum value of each particle in the particle swarm, then comparing the current fitness value of each particle in the particle swarm with the individual maximum value of the particle, and replacing the individual maximum value with the fitness value if the current fitness value of the particle is superior to the individual maximum value of the particle;

4.4, comparing the current fitness values of all the particles in the particle swarm with the global extreme value, and replacing the global extreme value with the current fitness value if the current fitness value is superior to the global extreme value;

step 4.5, updating the speed and position of the particles through a formula

ν_id(t+1)＝ω·ν_id(t)+c₁r₁[p_id(t)-ν_id(t)]+c₂r₂[p_gd(t)-ν_id(t)]

step 4.6, judging whether the iteration condition is met, and if the iteration condition is met, ending; and if the condition is not met, returning to the step 2 and continuing to perform optimization.

And 5, testing the classification accuracy of the coal and gangue identification model on coal and gangue through the test set, and verifying the performance of the model.

Specifically, the performance of the coal and gangue identification model is embodied by the identification accuracy of the model to coal and gangue, the model has high identification accuracy and good performance, and the model has low identification accuracy and poor performance. Therefore, the coal and gangue identification is realized when the model performance is verified.

The identification method provided by the embodiment has the advantages that:

1. effective identification of coal and gangue is an important premise for coal and gangue separation, and the coal and gangue identification based on the particle swarm optimization XGboost algorithm provides a method for accurate identification of coal and gangue, and is a great support for developing clean coal technology.

2. The XGboost algorithm is a gradient-boosting integrated learning algorithm based on a gradient-boosting decision tree, a strong classifier is formed by integrating a CART tree, information application in a first-order derivative and a second-order derivative is extracted by performing second-order Taylor expansion on a loss function, and a regular term is added to reduce the complexity of a model, so that overfitting of the model is prevented.

3. The XGboost algorithm has a plurality of parameters, the boost parameter is particularly important for the accuracy of the model, and the XGboost algorithm is optimized through the particle swarm optimization algorithm to obtain a good classification effect.

The above-mentioned embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, and any simple modifications or equivalent substitutions of the technical solutions that can be obviously obtained by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A coal and gangue identification method for a particle swarm optimization XGboost algorithm is characterized by comprising the following steps:

2. The method for identifying the coal and gangue according to the XGboost algorithm optimized by the particle swarm optimization of claim 1, wherein a multispectral image acquisition system is used for acquiring multispectral images of a plurality of samples of the coal and the gangue to obtain the multispectral images of the coal and the gangue.

3. The method for identifying the coal and gangue according to the particle swarm optimization XGboost algorithm, disclosed by claim 1, is characterized in that the training of the coal and gangue identification model comprises the following steps:

the loss function of the XGboost model is:

in the XGboost algorithm, training is carried out in a mode of tree model iterative increase, namely, a CART decision tree function f is added to each step in the training process, so that the loss is reducedThe loss function is further reduced; after a plurality of iterations, an optimal CART tree f is added in the t step_tI.e. the CART tree that minimizes the penalty function, the penalty function becomes:

in the formula:

respectively at the point of expansion for the loss function

The first and second derivatives of (d);

the regularization term contained in the loss function can be used to control the complexity of the trained model as defined below:

t represents the number of leaf nodes; w represents the leaf weight;

in the expanded form

The loss function formed by the sample label is a constant value; due to lossesThe reduction amplitude of the function is irrelevant to the constant term, the constant term is removed, and the loss function is further optimized by combining the regular term expression, so that the simplified loss function is obtained as follows:

the optimal loss function at this time is:

4. The method for identifying the coal and gangue according to the particle swarm optimization XGboost algorithm, disclosed by the claim 3, is characterized in that parameter optimization is carried out on the learning rate, the maximum tree depth and the minimum leaf weight in the XGboost algorithm through the particle swarm optimization algorithm; the method specifically comprises the following steps:

initializing a particle swarm;

updating the velocity and position of the particles by formula

ν_id(t+1)＝ω·ν_id(t)+c₁r₁[p_id(t)-ν_id(t)]+c₂r₂[p_gd(t)-ν_id(t)]