CN114896860B

CN114896860B - Soft measurement method for carbon content of fly ash based on LightGBM and XGBoost combined model

Info

Publication number: CN114896860B
Application number: CN202210318954.1A
Authority: CN
Inventors: 刘军平; 骆海瑞; 彭涛; 胡新荣; 何儒汉; 朱强; 张俊杰; 熊明福
Original assignee: Wuhan Textile University
Current assignee: Wuhan Textile University
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2024-05-14
Anticipated expiration: 2042-03-29
Also published as: CN114896860A

Abstract

The invention discloses a fly ash carbon content soft measurement method based on LightGBM and XGBoost combined model, which comprises the following steps: 1) Clear the obvious error value in DCS data and extract steady-state data by using a data mining technology; 2) In the soft measurement of the carbon content of the fly ash, the redundant characteristic problem is solved by utilizing a correlation matrix and a packaging method in characteristic engineering; 3) And (3) combining LightGBM, XGBoost the processed data set with a Bayesian Optimization (BO) algorithm to perform fly ash carbon content prediction modeling, so as to select the optimal super-parameters and improve the prediction precision. 4) The BO-XGBoost and BO-LightGBM models are combined using a sequence least squares programming algorithm. Compared with a general fly ash carbon content soft measurement model, the invention provides a more detailed and reasonable feature processing method, eliminates redundant features and is more beneficial to subsequent predictive modeling. The model LightGBM, XGBoost is combined by adopting a sequence square planning algorithm, so that the model has stronger generalization capability and higher prediction precision, and meanwhile, the effect obtained in the fly ash carbon content soft measurement task is better than that obtained by the traditional method.

Description

Soft measurement method for carbon content of fly ash based on LightGBM and XGBoost combined model

Technical Field

The invention belongs to the technical field of boiler fly ash carbon content measurement, and particularly relates to a fly ash carbon content soft measurement method based on LightGBM and XGBoost combined models.

Background

The carbon content of the fly ash of the boiler is one of important indexes for evaluating the combustion state of the coal-fired boiler, and the real-time monitoring of the carbon content of the fly ash is beneficial to controlling the carbon content of the fly ash within a reasonable range, so that the power generation cost is reduced, and the economy of a unit is improved. The fly ash heat loss of the boiler is the second largest heat loss next to the flue gas heat loss. In the actual operation of the boiler, the working condition of the boiler is difficult to be adjusted to the optimal working condition, and the price of a carbon measuring instrument is not good, so that the adoption of an economic and effective method for accurately obtaining the carbon content of the fly ash in real time is important to improving the combustion efficiency and guiding the production of a thermal power unit of the boiler.

The current method for obtaining the carbon content of the fly ash of the coal-fired boiler is mainly divided into 3 types: manual sampling and inspection assays, physical measurement methods, and soft measurement methods. The manual sampling and inspection test requires special personnel to sample and prepare samples periodically, so that manpower and material resources are consumed, and meanwhile, the problems of data lag, easy occurrence of error and leakage and the like exist. The physical method is usually a combustion weightlessness method, a spectroscopic analysis method, a microwave method and the like. Various physical methods are difficult to popularize widely for technical or cost reasons. The soft measurement method organically combines the knowledge of the production process through the mechanism analysis, can quickly and accurately reflect the carbon content of the fly ash under different working conditions, and has higher economy.

There have been some prior art studies on soft measurement of fly ash carbon content, however, the boiler combustion process is a multivariable, nonlinear, strongly coupled thermodynamic process. For example, the DCS system may record parameters such as air volume, air pressure, air temperature, etc. for each outlet of the coal pulverizer. When the parameters are used as the boiler combustion modeling variables, the parameters have high correlation, so that a certain variable redundancy is generated, the model estimation accuracy is affected, and the calculation complexity is increased. Therefore, there is a need to use more elaborate feature engineering methods to reduce the effects of redundant variables. Most of research tests at present have limited data and working conditions, and cannot effectively represent the whole operation working conditions of the boiler. The traditional regression method comprises linear regression, a support vector machine, a time sequence analysis method and the like. These method models are relatively simple and often do not perform well when dealing with complex, high-dimensional, multi-noise data. The integrated learning method fuses the prediction results of a plurality of learners through various voting mechanisms, and a more accurate result is obtained. Therefore, an integrated model is combined with characteristic engineering, and a model with higher accuracy is established by a super-parameter tuning method and the like and is applied to an actual combustion system.

Disclosure of Invention

The invention is made to solve the above problems, and an object of the invention is to provide a soft measurement method for fly ash carbon content based on LightGBM and XGBoost combined model, which can obtain more accurate fly ash carbon content.

In order to achieve the above object, the present invention adopts the following scheme:

As shown in fig. 1, the invention provides a fly ash carbon content soft measurement method based on LightGBM and XGBoost combined model, which comprises the following steps:

step 1, DCS (Distributed Control System) system data of a boiler are obtained, and data mining is carried out on the obtained DCS system data, wherein the steps comprise obvious outlier removal and data resampling;

Step 2, acquiring historical data variables of actual measurement parameter values of boiler working condition measuring points including relative working condition measuring points and reference working condition measuring points in a certain period, aiming at the characteristics of multivariable, nonlinear and strong coupling of the boiler combustion process, firstly finding out variables with strong coupling with the carbon content of fly ash through a correlation matrix, removing variables with low correlation with the carbon content of the fly ash, and further extracting important variables through a packaging method to serve as input of a subsequent model;

step 3, dividing the finally extracted important variables in the step 2 into a training set, a verification set and a test set, and respectively adopting XGBoost and LightGBM models as prediction models;

Step 4, performing super-parameter tuning by using a Bayesian optimization algorithm, setting 5-fold cross validation in the evaluation of satisfaction degree of the prediction model, setting the evaluation mode as RMSE, setting the iteration number as N, and establishing BO-LightGBM and BO-XGBoost models for predicting the carbon content of the fly ash after selecting the optimal super-parameters;

And 5, combining fly ash carbon content predictions of the BO-XGBoost and BO-LightGBM models by using a sequence least squares programming algorithm to obtain a final predicted value.

Further, in the step 2, the correlation matrix is represented by a correlation coefficient, and the expression of the correlation coefficient is shown in equation (1), and represents a proportional or inverse relation with the target variable;

Where r is the correlation coefficient, x _i is the ith value for the x variable, y _i is the ith value for the y variable, i e 1, n is the total number of values, Each is the average of the x and y variables.

Further, the historical data variables in the step 2 include the coal feeding amount of each coal mill, the primary air pressure, the air temperature, the air quantity, the outlet temperature and the current of the separator of each coal mill, the opening degree of the secondary air door of each layer, the temperature, the pressure, the air quantity and the oxygen content of the primary air and the secondary air related to the air preheater, the air feeding temperature, the pressure and the air quantity of the blower, the oxygen content and the exhaust gas temperature of the tail flue, the power generation power, the total primary air quantity, the total secondary air quantity, the hearth pressure and the hearth temperature.

Further, when the combination of the BO-XGBoost and BO-LightGBM models is performed in step 5 using a sequential least squares programming algorithm,

Wherein the objective function Obj is a mean square error functionY is the average value of the true values corresponding to all samples;

The initial value of the weight selects the ratio of the mean square error of the predicted value and the true value of the two models, as shown in a formula (7);

Where n is the total number of sample data, i represents the ith sample data, w ₁,w₂ is the weight coefficient of the BO-XGBoost model and the BO-LightGBM model, y _1i is the predicted value of the ith sample data obtained by the BO-XGBoost model, y _2i is the predicted value of the ith sample data obtained by the BO-LightGBM model, y _i is the true value corresponding to the ith sample data, and the predicted value of the combined model is shown in formula (8);

Wherein, And/>Is the average of the predictions corresponding to all samples of the BO-XGBoost model and the BO-LightGBM model.

Further, for a given training set, the predicted values of LightGBM modelsCan be represented by formula (2):

Wherein, Representing predicted values of LightGBM models, K representing the number of decision trees, f _k representing predicted values of the kth decision tree, x _i representing the ith input sample; f represents a set of all decision trees; the objective function L ^(t) of LightGBM is represented by equation (3):

In the formula (3), n represents the total number of samples, i is the index of the current sample, Is a loss function representing a target value y _i and a predicted value/>The difference between them is expressed for the regression problem by a mean square error loss function, i.e., the loss function is/> Is the predicted value of the previous t-1 round in the t-th iteration, f _t(x_i) is the predicted value of the t-th round, Ω (f _t) is the model complexity, expressed by equation (4);

in the formula (4), r and lambda are regular term coefficients, so that the decision tree is prevented from being too complex, T represents the number of leaf nodes in the objective function, and w is the weight coefficient of the leaf nodes.

Further, for a given training set, the predicted value of XGBoost model may be expressed by the following formula:

Wherein f (x _i) represents the predicted value of the XGBoost model, K represents the number of decision trees, f _k represents the predicted value of the kth decision tree, x _i represents the ith input sample; f represents a set of all decision trees; the objective function of XGBoost model is shown in formula (5);

Where n represents the total number of samples, i is the index of the current sample, g _i represents the first derivative value of sample x _i with respect to the loss function, h _i represents the second derivative value of sample x _i with respect to the loss function, the loss function is the mean square error loss function, f _t(x_i) is the predicted value of the T-th round, λ is the regularized term coefficient, T represents the number of leaf nodes in the objective function, and w _j represents the weight coefficient of the j-th leaf node.

Compared with the prior art, the scheme of the invention has the beneficial effects that:

Based on the actual working condition data of the coal-fired boiler of the power plant, the invention integrates a plurality of machine learning algorithms and a data driving method of a data mining technology for the first time to analyze the relation between the carbon content of fly ash and various operation parameters of the boiler. And removing redundant features in two steps by using a correlation matrix and a packaging method, and extracting important features. And substituting the data into LightGBM, XGBoost models for training, learning, predicting and verifying, and combining the models through a sequence least square planning algorithm, so that the actual electric field operation condition can be truly and comprehensively reflected, the fly ash carbon content closest to an actual combustion system is improved, the soft measurement precision is improved, and the reliability and accuracy of the soft measurement of the fly ash carbon content of a power plant are ensured.

Drawings

FIG. 1 is a flow chart of a soft measurement method of the carbon content of fly ash according to the invention;

FIG. 2 is a flowchart of Bayesian optimization to LightGBM model super-parameter optimization according to the present invention.

FIG. 3 is a flowchart of Bayesian optimization to XGBoost model super-parameter optimization according to the present invention.

Fig. 4 is a flowchart of a sequence least squares programming algorithm combining model according to the present invention.

Detailed Description

The invention is further illustrated and described below with reference to the drawings and detailed description.

Step 1, acquiring DCS (Distributed Control System) system data of a boiler, performing data mining on the acquired DCS system data, wherein the data specifically comprises obvious outlier removal and data resampling, wherein part of the acquired DCS system data can generate outliers due to system restarting or other reasons, and rejecting all detection point data which are out of a reasonable range. Because the generating capacity needs to be adjusted according to the load of the power grid when the thermal power generating unit operates, the load fluctuation is severe, and the thermal power generating unit continuously carries out the fluctuation of working conditions such as steady state-transition-steady state. This may result in reduced correlation between the data. This effect can be minimized by combining the data into a larger time interval, resampling the data over an appropriate period of time. For example, due to power plant shutdown, the actual load recorded by the DCS has just begun to have certain invalid data.

Step 2, the boiler combustion process is a multivariable, nonlinear, strongly coupled thermodynamic process. For example, the DCS system may record parameters such as air volume, air pressure, air temperature, etc. for each outlet of the coal pulverizer. When the parameters are used as the boiler combustion modeling variables, the parameters have high correlation, so that a certain variable redundancy is generated, the model estimation accuracy is affected, and the calculation complexity is increased. Therefore, it is necessary to use feature engineering methods to reduce the effects of redundant variables. Firstly, a variable with strong coupling property is found out through a correlation matrix, a variable with low correlation of the carbon content of the neutral fly ash is removed, and an important variable is further extracted through a packaging method.

Firstly, constructing a correlation matrix to quantify variable dependence, wherein the correlation matrix is a table for showing how a variable is related to a predicted value, and is expressed by a correlation coefficient, and as shown in an equation (1), the value of the correlation coefficient can be negative or integral, and the correlation coefficient is in a proportional or inverse relation with a target variable.

Where r is the correlation coefficient, x _i is the ith value for the x variable, y _i is the ith value for the y variable, i e 1, n is the total number of values,Each is the average of the x and y variables.

The packaging method is a method for selecting variables according to a specific prediction model, and the method adopts a recursive feature elimination method (Recursive feature elimination, RFE). It is a greedy optimization algorithm that selects the best set of variables by iterative iterations.

Step 3, dividing the data processed in the steps 1 and 2 into a training set, a verification set and a test set, and adopting LightGBM and XGBoost models as prediction models:

LightGBM is an integrated machine learning algorithm developed by microsoft in 2017, which is a high-level implementation of a distributed Gradient promotion framework (Gradient boosting decision tree, GBDT) of a decision tree algorithm, and is integrated with a GOSS (Gradient-based One-SIDE SAMPLING) and EFB (Exclusive Feature Bundling) algorithm on the basis of GBDT, wherein the LightGBM algorithm supports parallelized learning and rapid processing of large-scale data, so that the method has higher efficiency on the premise of ensuring accuracy and interpretability. The GBDT algorithm is the core of LightGBM, and iteratively adding weak learners generates strong learners by computing negative gradients of the loss function. For GOSS, only data instances with larger gradients are used to calculate the information gain, so that a relatively accurate information gain estimate can be obtained with less data, and for EFB, the number of mutually exclusive features is reduced by adopting the feature of binding mutually exclusive. By the two methods, the calculation time is reduced, the memory is reduced, and the training is completed faster.

Predicted values for LightGBM for a given training set DCan be represented by formula (2):

Wherein, Representing the predicted value of the model, K representing the number of decision trees, f _k representing the predicted value of the kth decision tree, x _i representing the ith input sample; f represents a set of all decision trees; the objective function L ^(t) of LightGBM is represented by equation (3):

In the equation (3), n represents the number of samples, i is the current sample, Is a loss function representing a target value y _i and a predicted value/>The difference between them, is often represented by a mean square error loss function for regression problems, i.e., Is the predicted value of the previous t-1 round in the t-th iteration, f _t(x_i) is the predicted value of the t-th round, Ω (f _t) is the model complexity, and is usually expressed in equation (4).

XGBoost algorithm, namely limit gradient lifting algorithm, proposed by TIANQI CHEN is one of the machine learning algorithms widely used by data scientists at present, and has achieved good results in numerous machine learning contests. The XGBoost algorithm is an improvement of the GDBT algorithm, and is different from LightGBM in that XGBoost is finer in traversing calculation of data, and the data can be completely loaded into a memory during calculation, so that the calculation speed is increased in a parallel calculation mode. The predicted value of XGBoost algorithm is the same as the predicted value of LightGBM, and the objective function is shown in formula 5;

where n represents the number of samples, i is the current sample, g _i represents the first derivative value of sample x _i with respect to the loss function, h _i represents the second derivative value of sample x _i with respect to the loss function, f _t(x_i) is the predicted value of the T-th round, λ is the regularized term coefficient, T represents the number of leaf nodes in the objective function, and w _j represents the weight coefficient of the j-th leaf node.

And 4, performing super-parameter tuning by using a Bayesian optimization algorithm (BO), wherein in model satisfaction evaluation, 5-fold cross validation is set, the evaluation mode is RMSE, and the sequential iteration times of the optimization process are 100. And after the optimal super parameters are selected, a BO-LightGBM model is established, and the carbon content of the fly ash is predicted by a BO-XGBoost model.

And 5, combining XGBoost and LightGBM models by using a sequence least squares programming algorithm in order to improve the model prediction accuracy and solve the problem of limited robustness of a single model. Sequence quadratic programming algorithm (Successive quadratic programming, SQP) algorithm is widely used in various fields such as least squares problem solving, nonlinear optimization problem, economics and system analysis. The combined model problem can be expressed by the formula (6) that is:

Wherein the objective function Obj is a mean square error function Y is the average value of the true values corresponding to all samples;

since equation (6) is a nonlinear quadratic function and the constraint is linear, it is a quadratic programming problem that can be solved with a sequential least squares programming algorithm.

The ratio of the mean square error of the predicted value and the true value of the two models is selected as shown in the formula (7) as the initial value of the weight, so that the solving speed can be increased and the problem of sinking into a local optimal solution can be avoided.

Example 1

Step 1, acquiring historical data of all power plant working conditions within a period of time (for example, 50 days), wherein the acquired working condition measurement points comprise coal feeding amount of each coal mill, primary air pressure, air temperature, air quantity, separator outlet temperature, current and the like of each coal mill, secondary air door opening of each layer, primary air related to an air preheater, temperature, pressure, air quantity and oxygen content of the secondary air, air supply temperature, pressure and air quantity of an air feeder, oxygen content of a tail flue, exhaust gas temperature, and other general parameters such as power generation power, total primary air quantity, total secondary air quantity, hearth pressure, hearth temperature and the like, wherein the total number of parameters is about 70;

and step two, removing obvious abnormal values in the data, and resampling the data with 5 minutes as an average interval.

And thirdly, performing feature dimension reduction by a feature dimension reduction method of machine learning. Aiming at the characteristics of multivariable, nonlinear and strong coupling in the combustion process of the boiler, firstly, the variable with strong coupling is found out through a correlation matrix, the variable with low correlation of the neutral fly ash carbon content is removed, and the important variable is further extracted through a packaging method.

In the step 2, the correlation matrix is represented by a correlation coefficient, and the expression of the correlation coefficient is shown as an equation (1) and represents a proportional or inverse relation with a target variable;

The packaging method is a method for selecting variables according to a specific prediction model, and the method adopts a recursive feature elimination method (Recursive feature elimination, RFE). The method is a greedy optimization algorithm, an optimal variable set is selected through repeated iteration, and variables selected by a correlation matrix are further screened through a packaging method.

Dividing all samples screened in the step three into a training set, a verification set and a test set according to a ratio of 4:1:1, and respectively adopting XGBoost and LightGBM models as prediction models; the verification adopts five-fold cross verification.

And fifthly, based on the prediction model provided by the invention, performing super-parameter tuning by using a Bayesian optimization algorithm (BO), and setting 5-fold cross verification in model satisfaction evaluation, wherein the evaluation mode is RMSE, and the sequential iteration times of the optimization process are 100. And after the optimal super parameters are selected, a BO-LightGBM model is established, and the BO-XGBoost model is used for predicting the carbon content of the fly ash, wherein the specific operation modes are shown in figures 2 and 3.

And step six, combining ash carbon content predictions of the BO-XGBoost and the BO-LightGBM models by using a sequence least square programming algorithm to obtain a final predicted value.

The above method is applied to the following embodiments to embody the technical effects of the present invention, and specific steps in the embodiments will not be described in detail.

Table 1 shows the performance comparisons of the methods presented herein with other methods. The method presented herein achieves the lowest MAPE, RMSE and highest R ². Compared with other methods, the method reduces the RMSE by 1.8-26.2%, reduces the MAPE by 0.7-19.24%, shows that the error is further reduced, and improves the measurement precision. R ² is improved by 1.3% -20.9%, which shows that the fitting effect of the prediction curve is better, and the method has higher accuracy and reliability. Specifically, the LM-Garson-BP, the AQPSO-SVR and the FPA-RF all adopt heuristic algorithms to carry out parameter tuning, and the regression model is combined to carry out prediction, so that the prediction precision of the corresponding model is improved to a certain extent. However, from the perspective of super-parameter tuning, when the BO algorithm faces the complex optimization problem of super-parameter tuning, such as non-convex, multi-peak and high evaluation cost, the next evaluation position can be found according to the information obtained from the unknown objective function, so that the optimal solution is reached at the highest speed. The BO algorithm avoids the problems that iteration feedback information cannot be effectively utilized, the algorithm searching speed is low, and the like. From the perspective of a prediction model, lightGBM, XGBoost is taken as an integrated algorithm model objective function of a decision tree, a second-order Taylor expansion type is adopted, so that the model can be fully learned, a regular term is added, the complexity of the model is reduced, the advantages of preventing overfitting, supporting parallel and distributed computation and the like are achieved, and the prediction precision can be effectively improved. The combined model can effectively combine the advantages of the two models on the basis of a single model, and the robustness of the model is improved. Thus, the prediction effect is better than the 6 models compared.

TABLE 1 prediction results for different models

The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims

1. The soft measurement method for the carbon content of the fly ash based on LightGBM and XGBoost combined model is characterized by comprising the following steps of:

step 1, acquiring DCS system data of a boiler, and performing data mining on the acquired DCS system data, wherein the data mining comprises obvious outlier removal and data resampling;

Predicted values of LightGBM models for a given training set Can be represented by formula (2):

in the formula (3), the amino acid sequence of the compound, Is a loss function representing a target value y _i and a predicted value/>The difference between them is expressed for the regression problem by a mean square error loss function, i.e., the loss function is/> Is the predicted value of the previous t-1 round in the t-th iteration, f _t(x_i) is the predicted value of the t-th round, Ω (f _t) is the model complexity, expressed by equation (4);

In the formula (4), r and lambda are regular term coefficients, so that the decision tree is prevented from being too complex, T represents the number of leaf nodes in the objective function, and w is the weight coefficient of the leaf nodes;

for a given training set, the predicted value of XGBoost model may be expressed by the following formula:

Wherein g _i represents the first derivative value of the sample x _i with respect to the loss function, h _i represents the second derivative value of the sample x _i with respect to the loss function, the loss function is the mean square error loss function, f _t(x_i) is the predicted value of the T-th round, λ is the regularized term coefficient, T represents the number of leaf nodes in the objective function, and w _j represents the weight coefficient of the j-th leaf node;

Step 5, using a sequence least square programming algorithm to combine fly ash carbon content predictions of BO-XGBoost and BO-LightGBM models to obtain a final predicted value;

when the sequence least squares programming algorithm is used to combine the BO-XGBoost and BO-LightGBM models in step 5,

2. The soft measurement method for the carbon content of the fly ash based on LightGBM and XGBoost combined model as claimed in claim 1, wherein: in the step 2, the correlation matrix is represented by a correlation coefficient, and the expression of the correlation coefficient is shown as an equation (1) and represents a proportional or inverse relation with a target variable;

3. The soft measurement method for the carbon content of the fly ash based on LightGBM and XGBoost combined model as claimed in claim 1, wherein: the historical data variables in the step 2 comprise the coal feeding amount of each coal mill, the primary air pressure, the air temperature, the air quantity, the outlet temperature and the current of a separator of each coal mill, the opening degree of a secondary air door of each layer, the temperature, the pressure, the air quantity and the oxygen content of primary air and secondary air related to an air preheater, the air supply temperature, the pressure and the air quantity of a blower, the oxygen content and the exhaust gas temperature of a tail flue, and the power generation, the total primary air quantity, the total secondary air quantity, the furnace pressure and the furnace temperature.