CN114896860A

CN114896860A - Soft measurement method for carbon content in fly ash based on LightGBM and XGboost combined model

Info

Publication number: CN114896860A
Application number: CN202210318954.1A
Authority: CN
Inventors: 刘军平; 骆海瑞; 彭涛; 胡新荣; 何儒汉; 朱强; 张俊杰; 熊明福
Original assignee: Wuhan Textile University
Current assignee: Wuhan Textile University
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2022-08-12
Anticipated expiration: 2042-03-29
Also published as: CN114896860B

Abstract

The invention discloses a soft measurement method for carbon content in fly ash based on a LightGBM and XGboost combined model, which comprises the following steps: 1) clearing obvious error values in DCS data and extracting steady-state data by using a data mining technology; 2) in the soft measurement of the carbon content in the fly ash, a correlation matrix and a packaging method in characteristic engineering are used for solving the problem of redundant characteristics; 3) and combining the processed data set with LightGBM, XGboost and Bayesian Optimization (BO) algorithms to perform fly ash carbon content prediction modeling, selecting an optimal hyper-parameter and improving prediction precision. 4) The BO-XGboost and BO-LightGBM models are combined using a sequence least squares planning algorithm. Compared with a common soft measurement model for the carbon content of the fly ash, the method provided by the invention has the advantages that a more detailed and reasonable feature processing method is provided, redundant features are eliminated, and the method is more favorable for subsequent prediction modeling. The LightGBM and XGboost models are combined by adopting a sequential quadratic programming algorithm, so that the generalization capability of the models is stronger, the prediction precision is higher, and the effect obtained in the soft measurement task of the carbon content in the fly ash is better than that obtained by the traditional method.

Description

Soft measurement method for carbon content in fly ash based on LightGBM and XGboost combined model

Technical Field

The invention belongs to the technical field of measurement of carbon content in boiler fly ash, and particularly relates to a soft measurement method for carbon content in fly ash based on a LightGBM and XGboost combined model.

Background

The carbon content of the fly ash of the boiler is one of important indexes for evaluating the combustion state of the coal-fired boiler, and the real-time monitoring of the carbon content of the fly ash is beneficial to controlling the carbon content of the fly ash within a reasonable range, so that the power generation cost is reduced, and the economical efficiency of a unit is improved. The fly ash heat loss of the boiler is the second largest heat loss next to the heat loss of the flue gas. In the actual operation of the boiler, the working condition of the boiler is difficult to adjust to the optimal working condition, and the price of a carbon measuring instrument is not good, so that the carbon content of the fly ash is accurately and really obtained by an economic and effective method, the combustion efficiency is improved, and the production of a boiler thermal power generating unit is guided.

The existing method for acquiring the fly ash carbon content of the coal-fired boiler mainly comprises 3 types: manual sampling submission assay, physical measurement method and soft measurement method. The manual sampling and inspection for chemical examination needs a specially-assigned person to sample and prepare samples regularly, consumes manpower and material resources, and has the problems of data lag, easy occurrence of errors and leaks and the like. The physical methods commonly used include a combustion weight loss method, a spectral analysis method, a microwave method and the like. Various physical methods are difficult to popularize widely for technical or cost reasons. The soft measurement method organically combines the knowledge of the production process through mechanism analysis, can quickly and accurately reflect the carbon content of the fly ash under different working conditions, and has higher economy.

At present, some prior arts have studied on the method for soft measurement of carbon content in fly ash, however, the boiler combustion process is a multivariable, nonlinear and strongly coupled thermal process. For example, the DCS system records parameters such as air volume, air pressure, air temperature, etc. at each outlet of the coal mill. When the parameters are used as boiler combustion modeling variables, the parameters have high correlation, so that certain variable redundancy is generated, the estimation accuracy of the model is influenced, and the calculation complexity is increased. Therefore, it is necessary to apply a more detailed feature engineering method to reduce the influence of the redundancy variables. At present, most research tests have limited data and working conditions and cannot effectively represent the whole operating working condition of the boiler. The traditional regression methods include linear regression, support vector machine, time series analysis method, etc. These methods are relatively simple models and generally do not predict well when processing complex, high-dimensional, noisy data. The ensemble learning method fuses the prediction results of a plurality of learners through various voting mechanisms, and obtains a more accurate result. Therefore, a model with higher accuracy is established by adopting methods such as an integrated model combined with characteristic engineering, super-parameter tuning and the like and is applied to an actual combustion system.

Disclosure of Invention

The present invention is made to solve the above problems, and an object of the present invention is to provide a method for soft measurement of carbon content in fly ash based on a LightGBM and XGBoost combined model, which can obtain more accurate carbon content in fly ash.

In order to achieve the purpose, the invention adopts the following scheme:

as shown in fig. 1, the present invention provides a method for soft measurement of carbon content in fly ash based on a LightGBM and XGBoost combined model, comprising:

step 1, acquiring DCS (distributed Control System) system data of a boiler, and performing data mining on the acquired DCS system data, wherein the data mining specifically comprises obvious abnormal value removal and data resampling;

step 2, acquiring historical data variables of measured parameter values of the boiler working condition measuring points including the relative working condition measuring points and the reference working condition measuring points in a certain period, and aiming at the characteristics of multivariable, nonlinearity and strong coupling in the boiler combustion process, firstly, finding out variables with strong coupling with the carbon content of the fly ash through a correlation matrix, removing the variables with low correlation with the carbon content of the fly ash, and further extracting important variables as the input of a subsequent model through a packaging method;

step 3, dividing the important variables finally extracted in the step 2 into a training set, a verification set and a test set, and respectively adopting XGboost and LightGBM models as prediction models;

step 4, carrying out super-parameter optimization by using a Bayesian optimization algorithm, setting 5-fold cross validation in the satisfaction evaluation of the prediction model, wherein the evaluation mode is RMSE, the iteration times are set to N, and establishing BO-LightGBM and BO-XGboost models for fly ash carbon content prediction after selecting the optimal super-parameter;

and 5, combining fly ash carbon content prediction of the BO-XGboost model and the BO-LightGBM model by using a sequence least square planning algorithm to obtain a final predicted value.

Further, in the step 2, the correlation matrix is represented by a correlation coefficient, and an expression of the correlation coefficient is shown as equation (1), and represents a direct proportion or inverse proportion relation with the target variable;

where r is the correlation coefficient, x _i Is the ith value, y, of the x variable _i Is the ith value corresponding to the y variable, i ∈ [1, n ∈]N is the total number of values,

the average values of the x and y variables, respectively.

Further, the historical data variables in step 2 include the coal feeding amount of each coal mill, the primary air pressure, the air temperature, the air volume, the separator outlet temperature and current of the coal mill, the secondary air door opening of each layer, the temperature, the pressure, the air volume and the oxygen content of the primary air and the secondary air related to the air preheater, the air supply temperature, the pressure and the air volume of the air feeder, the oxygen content and the exhaust gas temperature of the tail flue, the power generation power, the total primary air volume, the total secondary air volume, the furnace pressure and the furnace temperature.

Furthermore, when the BO-XGboost model and the BO-LightGBM model are combined and combined by using a sequence least square planning algorithm in the step 5,

wherein the objective function Obj is a mean square error function

Y is the average of the corresponding real values of all samples;

the ratio of the mean square error of the predicted value and the true value of the two models is selected as the initial value of the weight, and the formula (7) shows;

where n is the total number of sample data, i denotes the ith sample data, w ₁ ，w ₂ Is the weight coefficient, y, of the BO-XGboost model and the BO-LightGBM model _1i Is a predicted value y obtained by the ith sample data through a BO-XGboost model _2i Is a predicted value y obtained by the ith sample data through a BO-LightGBM model _i The real value corresponding to the ith sample data is obtained, and the predicted value of the combined model is shown as the formula (8);

wherein the content of the first and second substances,

and

the average value of the predicted values corresponding to all samples of the BO-XGboost model and the BO-LightGBM model is obtained.

Further, for a given trainingPrediction value of set, LightGBM model

Can be expressed by equation (2):

wherein the content of the first and second substances,

representing the predicted value of the LightGBM model, K representing the number of decision trees, f _k Denotes the predicted value, x, of the kth decision tree _i Represents the ith input sample; f represents the set of all decision trees; LightGBM's objective function L ^(t) Represented by formula (3):

in equation (3), n represents the total number of samples, i is the index of the current sample,

is a loss function, represents the target value y _i And the predicted value

The difference between them is expressed for the regression problem by a mean square error loss function, i.e. the loss function is

Is the predicted value of the previous t-1 round in the t-th iteration, f _t (x _i ) Is the predicted value of the t-th round, Ω (f) _t ) Is the model complexity, expressed in equation (4);

in the formula (4), r and λ are regular term coefficients, so that the decision tree is prevented from being too complex, T represents the number of leaf nodes in the objective function, and w is a weight coefficient of the leaf nodes.

Further, for a given training set, the predicted value of the XGBoost model may be represented by the following formula:

wherein, f (x) _i ) Representing the predicted value of the XGboost model, K representing the number of decision trees, f _k Denotes the predicted value, x, of the kth decision tree _i Represents the ith input sample; f represents the set of all decision trees; the target function of the XGboost model is shown as the formula (5);

where n represents the total number of samples, i is the index of the current sample, g _i Represents a sample x _i With respect to the first derivative value of the loss function, h _i Represents a sample x _i With respect to the second derivative value of the loss function, the loss function is a mean square error loss function, f _t (x _i ) Is the predicted value of the T-th round, lambda is the regular term coefficient, T represents the number of leaf nodes in the objective function, w _j Representing the weight coefficient of the jth leaf node.

Compared with the prior art, the scheme of the invention has the beneficial effects that:

the method is based on actual working condition data of the coal-fired boiler of the power plant, and integrates a data driving method of various machine learning algorithms and data mining technologies for the first time to analyze the relationship between the carbon content of the fly ash and various operating parameters of the boiler. And (4) removing redundant features and extracting important features in two steps by using a correlation matrix and a packaging method. And then substituting the data into LightGBM and XGboost models for training, learning, predicting and verifying, and then combining the models through a sequence least square planning algorithm to obtain the fly ash carbon content which is closest to an actual combustion system, so that the soft measurement precision is improved, and the reliability and the accuracy of the soft measurement of the fly ash carbon content of the power plant are ensured.

Drawings

FIG. 1 is a flow chart of a method for soft measurement of carbon content in fly ash according to the present invention;

fig. 2 is a flow chart of bayesian optimization for hyperparameter optimization of the LightGBM model according to the present invention.

FIG. 3 is a flow chart of Bayesian optimization for over-parameter optimization of the XGboost model according to the invention.

Fig. 4 is a flowchart of the combined model of the sequential least squares planning algorithm according to the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

Step 1, acquiring DCS (distributed Control System) system data of a boiler, and performing data mining on the acquired DCS system data, specifically including obvious abnormal value removal and data resampling, wherein part of the acquired DCS system data generates abnormal values due to system restart or other reasons, and removing data which are out of a reasonable range in each detection point. Because the power generation amount of the thermal power generating unit needs to be adjusted according to the load of the power grid during operation, the load changes violently, and the thermal power generating unit continuously changes the working conditions of steady state, transition, steady state and the like. This may result in reduced correlation between data. This effect can be minimized by merging the data into a larger time interval, resampling the data for an appropriate time period. For example, the actual load recorded by the DCS has just started to have some invalid data due to the shutdown of the power plant.

And 2, the boiler combustion process is a multivariable, nonlinear and strongly coupled thermal process. For example, the DCS system records parameters such as air volume, air pressure, air temperature, etc. at each outlet of the coal mill. When the parameters are used as boiler combustion modeling variables, the parameters have high correlation, so that certain variable redundancy is generated, the estimation accuracy of the model is influenced, and the calculation complexity is increased. Therefore, it is necessary to apply a feature engineering method to reduce the influence of the redundant variables. Firstly, finding out variables with strong coupling through a correlation matrix, removing the variables with low correlation with the carbon content of fly ash, and further extracting important variables through a packaging method.

Firstly, a correlation matrix is constructed to quantify the variable dependency, the correlation matrix is a table representing how the variables are correlated with the predicted values, and the correlation matrix is represented by a correlation coefficient, as shown in equation (1), the value of the correlation coefficient can be negative or integer, and represents that the correlation coefficient is in direct proportion or inverse proportion to the target variable.

the average values of the x and y variables, respectively.

The wrapping method is a method for selecting variables according to a specific prediction model, and the method adopts Recursive Feature Elimination (RFE). The method is a greedy optimization algorithm, and an optimal variable set is selected through repeated iteration.

And 3, dividing the data processed in the steps 1 and 2 into a training set, a verification set and a test set, and adopting LightGBM and XGboost models as prediction models:

LightGBM is an integrated machine learning algorithm developed by Microsoft in 2017, is a distributed Gradient lifting framework (GBDT) advanced implementation of a decision tree algorithm, and is a GOSS (Gradient-based One-Side Sampling) and EFB (explicit Feature mapping) algorithm which are blended on the basis of GBDT, supports parallelized learning and rapidly processes large-scale data, so that the efficiency is higher on the premise of ensuring accuracy and interpretability. The GBDT algorithm is the core of LightGBM, and the strong learner is generated by iteratively adding the weak learner by calculating the negative gradient of the loss function. For GOSS, only the data example with larger gradient is used for calculating information gain, so that relatively accurate information gain estimation can be obtained by using less data, and for EFB, the mutually exclusive characteristics are bundled, and the number of mutually exclusive characteristics is reduced. The two methods are used for reducing the calculation time and reducing the use of memory so as to complete the training more quickly.

Predictive value of LightGBM for a given training set D

Can be expressed by formula (2):

wherein the content of the first and second substances,

representing the predicted value of the model, K representing the number of decision trees, f _k Denotes the predicted value, x, of the kth decision tree _i Represents the ith input sample; f represents the set of all decision trees; LightGBM's objective function L ^(t) Represented by formula (3):

in equation (3), n represents the number of samples, i is the current sample,

is a loss function, represents the target value y _i And the predicted value

The difference between them, often expressed as a mean square error loss function for the regression problem-that is,

is the predicted value of the previous t-1 round in the t-th iteration, f _t (x _i ) Is the predicted value of the t-th round, Ω (f) _t ) Is the model complexity and is usually expressed by equation (4).

In the formula (4), r and lambda are regular term coefficients, so that the decision tree is prevented from being too complex, T represents the number of leaf nodes in the objective function, and w is a weight coefficient of the leaf nodes.

The XGBoost algorithm, a limit gradient boost algorithm, proposed by Tianqi Chen, is one of machine learning algorithms widely used by data scientists at present, and achieves good results in numerous machine learning competitions. The XGboost algorithm is also an improvement of the GDBT algorithm, and is different from the LightGBM in that the XGboost is used for more meticulous traversal calculation and the like of data, the data can be completely loaded into a memory during calculation, and the calculation speed is accelerated by adopting a parallel calculation mode. The calculation mode of the predicted value of the XGboost algorithm is the same as that of the predicted value of the LightGBM, and the target function of the XGboost algorithm is shown as a formula 5;

where n represents the number of samples, i is the current sample, g _i Represents a sample x _i With respect to the first derivative value of the loss function, h _i Represents a sample x _i With respect to the second derivative value of the loss function, f _t (x _i ) Is the predicted value of the T-th round, lambda is the regular term coefficient, T represents the number of leaf nodes in the objective function, w _j Representing the weight coefficient of the jth leaf node.

And 4, carrying out super-parameter tuning by using a Bayes optimization algorithm (BO), and setting 5-fold cross validation in the evaluation of the satisfaction degree of the model, wherein the evaluation mode is RMSE, and the sequential iteration times in the optimization process are 100. And after the optimal hyperparameter is selected, establishing a BO-LightGBM and BO-XGboost model for predicting the carbon content of the fly ash.

And 5, combining the XGboost model and the LightGBM model by using a sequence least square programming algorithm in order to improve the prediction precision of the model and solve the problem of limited robustness of a single model. The SQP (sequential quadratic programming) algorithm is widely applied in many fields, such as solving of least square problem, nonlinear optimization problem, economics and system analysis. The combined model problem can be expressed by the formula (6):

wherein the objective function Obj is a mean square error function

Y is the average of the corresponding real values of all samples;

since equation (6) is a non-linear quadratic function and the constraints are linear, it is a quadratic programming problem that can be solved with a sequential least squares programming algorithm.

The ratio of the mean square error of the predicted value and the true value of the two models is selected as shown in formula (7) according to the initial value of the weight, so that the solving speed can be increased, and the situation that the local optimal solution is involved is avoided.

Where n is the total number of sample data, i denotes the ith sample data, w ₁ ，w ₂ Is the weight coefficient, y, of the BO-XGboost model and the BO-LightGBM model _1i The predicted value of the ith sample data obtained by a BO-XGboost model，y _2i Is a predicted value y obtained by the ith sample data through a BO-LightGBM model _i The real value corresponding to the ith sample data is obtained, and the predicted value of the combined model is shown as the formula (8);

wherein the content of the first and second substances,

and

Example 1

Step 1, acquiring all historical operating condition data of a power plant within a period of time (for example, 50 days), wherein the acquired operating condition measurement points comprise about 70 total operating condition measurement points, including coal feeding amount of each coal mill, primary air pressure, air temperature, air volume, separator outlet temperature, current and the like of the coal mills, secondary air door opening of each layer, temperature, pressure, air volume and oxygen content of primary air and secondary air related to an air preheater, air supply temperature, pressure and air volume of an air feeder, oxygen content and smoke exhaust temperature of a tail flue, and other general parameters such as power generation power, total primary air volume, total secondary air volume, furnace pressure, furnace temperature and the like;

and step two, removing the obvious abnormal values in the data, and resampling the data by taking 5 minutes as an average value interval.

And step three, performing feature dimension reduction by using a feature dimension reduction method of machine learning. Aiming at the characteristics of multivariable, nonlinearity and strong coupling in the boiler combustion process, firstly, the variables with strong coupling are found out through a correlation matrix, the variables with low correlation with the carbon content of fly ash are removed, and the important variables are further extracted through a packaging method.

In the step 2, the correlation matrix is expressed by a correlation coefficient, and the expression of the correlation coefficient is shown as equation (1) and represents a direct proportion or inverse proportion relation with a target variable;

where r is the correlation coefficient, x _i Is the ith value, y, of the x variable _i Is the ith value corresponding to the y variable, i ∈ [1, n ]]N is the total number of values,

the average values of the x and y variables, respectively.

The wrapping method is a method for selecting variables according to a specific prediction model, and the method adopts Recursive Feature Elimination (RFE). The method is a greedy optimization algorithm, an optimal variable set is selected through repeated iteration, and variables selected by a correlation matrix are further screened through a packaging method.

Step four, dividing all samples screened in the step three into a training set, a verification set and a test set according to a ratio of 4:1:1, and respectively adopting XGboost and LightGBM models as prediction models; the verification adopts five-fold cross verification.

And fifthly, based on the prediction model provided by the invention, a Bayes optimization algorithm (BO) is used for carrying out hyperparametric tuning, 5-fold cross validation is set in the evaluation of the satisfaction degree of the model, the evaluation mode is RMSE, and the sequential iteration frequency of the optimization process is 100. After the optimal hyper-parameter is selected, a BO-LightGBM model is established, and the BO-XGboost model is used for predicting the carbon content of fly ash, wherein the specific operation mode is shown in figures 2 and 3.

And sixthly, combining the ash carbon content prediction of the BO-XGboost model and the BO-LightGBM model by using a sequence least square planning algorithm to obtain a final prediction value.

The method is applied to the following embodiments to achieve the technical effects of the present invention, and the detailed steps in the embodiments will not be described again.

Table 1 shows the performance of the process proposed herein compared to other processes. The methods presented herein achieve the lowest MAPE, RMSE and highest R ² . Compared with other methods, the method has the advantages that the RMSE is reduced by 1.8-26.2%, the MAPE is reduced by 0.7-19.24%, the error is further reduced, and the measurement precision is improved. R ² The improvement is 1.3% -20.9%, and the fitting effect of the prediction curve is better, so that the method has higher accuracy and reliability. Specifically, parameter tuning is carried out on LM-Garson-BP, AQPSO-SVR and FPA-RF by adopting a heuristic algorithm, and prediction is carried out by combining a regression model, so that the prediction precision of the corresponding model is improved to a certain extent. However, from the perspective of super-parameter tuning, when facing a complex optimization problem of non-convex, multi-peak and high evaluation cost, such as super-parameter tuning, the BO algorithm can find the next evaluation position according to the information obtained for the unknown objective function, thereby reaching the optimal solution most quickly. The BO algorithm avoids the problems that iterative feedback information cannot be effectively utilized, the algorithm searching speed is low and the like. From the perspective of a prediction model, the LightGBM and XGboost are used as the integrated algorithm model objective functions of the decision tree, the second-order Taylor expansion is adopted, the model can be fully learned, the regular terms are added, the model complexity is reduced, the method has the advantages of preventing overfitting, supporting parallel and distributed computation and the like, and the prediction precision can be effectively improved. The combined model can effectively combine the advantages of the two models on the basis of a single model, and the robustness of the model is improved. Therefore, the prediction effect is better compared to the comparative 6 models.

TABLE 1 prediction results of different models

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. The method for soft measurement of the carbon content of fly ash based on the LightGBM and XGboost combined model is characterized by comprising the following steps:

step 4, carrying out super-parameter optimization by using a Bayesian optimization algorithm, setting 5-fold cross validation in the satisfaction evaluation of the prediction model, wherein the evaluation mode is RMSE, the iteration times are set to N, and establishing BO-LightGBM and BO-XGboost models for predicting the carbon content of the fly ash after selecting the optimal super-parameter;

2. The method for soft measurement of carbon content in fly ash based on a LightGBM and XGBoost combined model according to claim 1, wherein: in the step 2, the correlation matrix is expressed by a correlation coefficient, and the expression of the correlation coefficient is shown as equation (1) and represents a direct proportion or inverse proportion relation with a target variable;

where r is the correlation coefficient, x _i Is the ith value, y, of the x variable _i Is y variable corresponds toIs given by the ith value of (i ∈ [1, n ]]N is the total number of values,

the average values of the x and y variables, respectively.

3. The method for soft measurement of carbon content in fly ash based on a LightGBM and XGBoost combined model according to claim 1, wherein: the historical data variables in the step 2 include the coal feeding amount of each coal mill, the primary air pressure, the air temperature, the air volume, the separator outlet temperature and the current of the coal mill, the secondary air door opening of each layer, the temperature, the pressure, the air volume and the oxygen content of the primary air and the secondary air related to the air preheater, the air supply temperature, the pressure and the air volume of the air feeder, the oxygen content and the exhaust gas temperature of the tail flue, the power generation power, the total primary air volume, the total secondary air volume, the hearth pressure and the hearth temperature.

4. The method for soft measurement of carbon content in fly ash based on a LightGBM and XGBoost combined model according to claim 1, wherein: when the BO-XGboost model and the BO-LightGBM model are combined and combined by using a sequence least square planning algorithm in the step 5,

wherein the objective function Obj is a mean square error function

Y is the average of the corresponding real values of all samples;

where n is the total number of sample data, i denotes the ith sample data, w ₁ ，w ₂ Is the weight coefficient, y, of the BO-XGboost model and the BO-LightGBM model _1i Is a predicted value y obtained by the ith sample data through a BO-XGboost model _2i Is a predicted value y obtained by the ith sample data through a BO-LightGBM model _i The predicted value of the combined model is shown as a formula (8) in the real value corresponding to the ith sample data;

wherein the content of the first and second substances,

and

5. The method for soft measurement of fly ash carbon content based on a LightGBM and XGBoost combined model according to claim 1, wherein: predictive value of LightGBM model for a given training set

Can be expressed by equation (2):

wherein the content of the first and second substances,

to representPredictor of LightGBM model, K represents the number of decision trees, f _k Denotes the predicted value, x, of the kth decision tree _i Represents the ith input sample; f represents the set of all decision trees; LightGBM's objective function L ^(t) Represented by formula (3):

is a loss function, represents the target value y _i And the predicted value

The difference between them is expressed for the regression problem by the mean square error loss function, i.e. the loss function is

Is the predicted value of the previous t-1 round in the t-th iteration, f _t (x _i ) Is the predicted value of the t-th round, Ω (f) _t ) Is the model complexity, expressed by equation (4);

6. The method for soft measurement of carbon content in fly ash based on a LightGBM and XGBoost combined model according to claim 1, wherein: for a given training set, the predicted value of the XGBoost model may be represented by the following formula: