CN111639304A

CN111639304A - CSTR fault positioning method based on Xgboost regression model

Info

Publication number: CN111639304A
Application number: CN202010491108.0A
Authority: CN
Inventors: 赵忠盖; 潘磊; 李庆华; 刘成林; 刘飞
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2020-09-08
Anticipated expiration: 2040-06-02
Also published as: CN111639304B

Abstract

The invention discloses a CSTR fault positioning method based on an Xgboost regression model. The invention relates to a CSTR fault positioning method based on Xgboost regression, which comprises the following steps: 1) normal data generated by sensors in the CSTR is collected, as well as unknown offline data. 2) And (3) establishing a monitoring model of the normal data acquired in the step (1), and freely selecting different monitoring models according to the requirements of different occasions. 3) And (3) establishing a monitoring model through the step (2), bringing the offline unknown data acquired in the step (1) into the monitoring model, extracting sample statistics to detect faults, and screening out fault data. The invention has the beneficial effects that: 1) the importance of the variables of the Xgboost regression model measures the influence of the variables on the output prediction accuracy, and the calculation of the metric value of each variable is independent from the other, and compared with the prior art, the variable importance measure does not contain components of the action of other variables, so that the influence of the tailing effect is eliminated.

Description

CSTR fault positioning method based on Xgboost regression model

Technical Field

The invention relates to the field of CSTR, in particular to a CSTR fault positioning method based on an Xgboost regression model.

Background

The Continuous Stirred Tank Reactor (CSTR) is a very important reaction device in chemical production and has very wide application. In the production of three large synthetic materials of chemical fiber, plastic and synthetic rubber, the CSTR occupies more than 90% of the synthetic production reactors, and is also widely used in the fields of pharmacy, pesticides, fuels and the like. In view of the wide application of CSTR in the actual production process, it is very valuable to ensure the stability and safety of the operation.

With continuous scale and complication of modern chemical production, huge loss is often caused when faults occurring in the production cannot be accurately identified and timely recovered. With the continuous generation of a large amount of data reflecting process mechanisms in industrial processes, monitoring of industrial processes through data-driven multivariate statistical monitoring models becomes more and more popular.

The traditional technology has the following technical problems:

at present, in the aspect of fault detection based on multivariate statistical analysis, a large number of technical means are applied to the actual industrial process, but fault location is still a technical difficulty to be further solved as an important link to be completed after fault detection. Currently, common fault location methods based on multivariate statistical analysis mainly include a contribution graph method, a reconstruction method and a reconstruction contribution method (RBC), but these methods are susceptible to smearing effect, so that misdiagnosis may occur in practical application. Meanwhile, in systems with different characteristics, such as linearity, nonlinearity, non-gaussian and the like, the traditional fault positioning methods are different from each other, the fault positioning methods are greatly different from each other, and few related technical documents propose a unified method to realize the positioning of the fault source.

Disclosure of Invention

The invention provides a CSTR fault positioning method based on an Xgboost regression model, which comprises the steps of firstly establishing a multivariate statistical monitoring model aiming at normal data collected in a CSTR; screening out a fault data section in offline acquired data through a monitoring model, taking the fault data section as input, taking corresponding statistic as output to establish an Xgboost regression model, taking variable importance measurement as the contribution rate of variables to the statistic, wherein the variables with larger values are more likely to be fault variables, and identifying the maximum variable as the fault variable. The method has the advantages that the Xgboost regression model used by the method is different from fault positioning methods such as a traditional reconstruction contribution method, a partial differential method and the like, can be simultaneously used for fault positioning in nonlinear and linear processes, is small in calculated amount and trailing effect, and has better performance in the aspects of micro fault and random fault positioning of CSTR.

In order to solve the technical problem, the invention provides a CSTR fault positioning method based on an Xgboost regression model, which comprises the following steps:

1) collecting normal data generated by a sensor in the CSTR and unknown off-line data;

2) establishing a monitoring model of the normal data acquired in the step 1, and freely selecting different monitoring models according to the requirements of different occasions;

3) establishing a monitoring model through the step 2, bringing the offline unknown data acquired in the step 1 into the monitoring model, extracting sample statistics to detect faults, and screening out fault data;

4) collecting the fault data in the step 3 as the input of the training sample and the corresponding statistic as the output of the training sample;

5) and (4) establishing an Xgboost regression model of the training sample in the step (4) to obtain variable importance measurement of each variable, wherein the variables with larger measurement values are more likely to be fault variables, and the fault variable with the largest value is identified.

In one embodiment, in step 2, the monitoring model of the normal data collected in step 1 is a PCA monitoring model; the method specifically comprises the following steps:

assume a sample set of X ∈ R under normal operating conditions^n×mN is the number of samples, m is the number of variables; after standardization, the mean value is 0 and the standard deviation is 1; obtaining a covariance matrix S and carrying out singular value decomposition to obtain:

wherein P ∈ R^m×l，

Respectively are principal component and residual load vector, l is the number of principal component, Λ,

Diagonal arrays respectively composed of principal elements and residual characteristic values;

any one sample can be decomposed into:

in the formula, C and

projection matrices representing principal component and residual space, respectively;

in one embodiment, fault detection is performed by extracting SPE statistics, for which:

SPE statistic control limit can be obtained by sampling distribution, if the statistic exceeds the corresponding control limit, the process is considered to be abnormal, and therefore fault detection is achieved.

In one embodiment, the step 5 specifically includes the following steps:

5a) for a fault data set with n samples of m variables:

D＝{(x_i,y_i)}(|D|＝n,x_i∈R^m,y_i∈R)

where y is a statistic, an Xgboost regression model is defined to predict x in D:

wherein K is the number of decision trees; f is a CART regression tree function;

is a prediction output;

representing a set of possible decision tree functions;

defining the loss function L as:

wherein l is a slightly convex function, the difference between the predicted value and the true value is measured, and a mean square error function is selected; Ω (f) is:

Ω(f)＝γT+λ||w||²/2

wherein T represents the number of leaves, w represents the weight of the leaves, and lambda and gamma are penalty terms;

5b) establishing a CART regression tree model for the training samples in the step 4, in order to prevent overfitting, putting back extracted equivalent data for each tree in a resampling mode, and selecting an optimal splitting variable and an optimal splitting point through a greedy algorithm to enable splitting gain to be maximum;

5c) fitting the prediction residual of the last CART regression tree by continuously iterating to generate new CART regression trees in the step 5b until the loss function is minimum, wherein the loss function l (t) iterated to the t step is as follows:

and (3) popularizing the Taylor series of the loss function to 2 orders, and moving out the constant term, so that the loss function in the t step becomes:

wherein g is_i、h_iAre respectively provided with

About

1 and 2 derivatives of; the derivation is carried out on the above formula and the derivation result is 0 to obtain the leaf weight w^*And substituting the following formula:

5d) combining all CART regression trees together to obtain an Xgboost regression model, dividing the gain sum of each variable during splitting by the corresponding splitting times to obtain an average splitting gain, and dividing the gain of each variable by the average splitting gain sum of all variables to obtain the variable importance measurement of the corresponding variable, wherein the variable with larger measurement value is more likely to be a fault variable.

In one embodiment, in step 5c, the smaller the loss function of the above formula, the better the model fit is; and selecting the optimal splitting variable and the optimal splitting point through a loss function, and simultaneously calculating the splitting gain corresponding to the optimal splitting variable when the optimal splitting point is split.

In one embodiment, in step 5c, assume L_LAnd L_RRespectively, set of left node and right node after division, I ═ I_L∪I_R(ii) a The split gain after splitting is:

in one embodiment, in step 2, different monitoring models can be freely selected according to the requirements of different occasions, specifically as follows: the linear model selects PCA and the nonlinear model selects KPCA.

Based on the same inventive concept, the present application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when executing the program.

Based on the same inventive concept, the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of any of the methods.

Based on the same inventive concept, the present application further provides a processor for executing a program, wherein the program executes to perform any one of the methods.

The invention has the beneficial effects that:

1) the importance of the variables of the Xgboost regression model measures the influence of the variables on the output prediction accuracy, and the calculation of the metric value of each variable is independent from the other, and compared with the prior art, the variable importance measure does not contain components of the action of other variables, so that the influence of the tailing effect is eliminated.

2) Compared with the existing RBC fault recognition technology, the fault recognition method of the CSRT model has high running speed and can be used for fault location in various occasions such as linearity, nonlinearity, multi-mode and the like.

Drawings

FIG. 1 is a flow chart of fault location in the CSTR fault location method based on the Xgboost regression model.

FIG. 2 is a generation flow chart of the CSTR fault location method based on the Xgboost regression model.

FIG. 3 shows the feed concentration C in the CSTR fault location method based on the Xgboost regression model of the present invention_iRandom disturbance fault identification.

FIG. 4 shows the cooling water temperature T in the CSTR fault location method based on the Xgboost regression model_ciAnd identifying zero drift faults.

FIG. 5 is a schematic diagram of a CSTR device in the CSTR fault location method based on the Xgboost regression model.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

As shown in fig. 1, a CSTR fault location method based on Xgboost regression includes the following steps:

1) normal data generated by sensors in the CSTR is collected, as well as unknown offline data.

2) And (3) establishing a monitoring model of the normal data acquired in the step (1), and freely selecting different monitoring models according to the requirements of different occasions.

3) And (3) establishing a monitoring model through the step (2), bringing the offline unknown data acquired in the step (1) into the monitoring model, extracting sample statistics to detect faults, and screening out fault data.

4) And collecting the fault data in the step 3 as the input of the training sample and the corresponding statistic as the output of the training sample.

The step 2 specifically comprises the following steps:

2a) for the establishment of the monitoring model, the PCA monitoring model is taken as an example in the invention, and the sample set under the normal working condition is assumed to be X ∈ R^n×mN is the number of samples, and m is the number of variables. After the normalization treatment, the mean value was set to 0 and the standard deviation was set to 1. Obtaining a covariance matrix S and carrying out singular value decomposition to obtain:

wherein P ∈ R^m×l，

Are respectively principal elementsAnd residual load vector, where l is the number of principal elements, Λ,

And the diagonal matrixes are formed by principal elements and residual characteristic values respectively.

Any one sample can be decomposed into:

in the formula, C and

the projection matrices represent principal component and residual space, respectively.

Carry out fault detection through extracting SPE statistics, to SPE statistics have:

The step 5 specifically comprises the following steps:

5a) for a fault data set with n samples of m variables:

D＝{(x_i,y_i)}(|D|＝n,x_i∈R^m,y_i∈R)

is a prediction output;

representing a set of possible decision tree functions.

Defining the loss function L as:

where l is a slightly convex function, the difference between the predicted value and the true value is measured, where the mean square error function is selected. Ω (f) is:

Ω(f)＝γT+λ||w||²/2

wherein, T represents the number of leaves, w represents the weight of the leaves, and λ and γ are penalty terms.

5b) And (4) establishing a CART regression tree model for the training samples in the step (4), in order to prevent overfitting, putting back extracted equivalent data in each tree in a resampling mode, and selecting an optimal splitting variable and an optimal splitting point through a greedy algorithm to enable splitting gain to be maximum.

wherein g is_i、h_iAre respectively provided with

About

1 and 2 derivatives of. The derivation is carried out on the above formula and the derivation result is 0 to obtain the leaf weight w^*And substituting the following formula:

the smaller the loss function of the above formula, the better the model fit. And selecting the optimal splitting variable and the optimal splitting point through a loss function, and simultaneously calculating the splitting gain corresponding to the optimal splitting variable when the optimal splitting point is split.

Suppose L_LAnd L_RRespectively, set of left node and right node after division, I ═ I_L∪I_R. The split gain after splitting is:

5d) combining all CART regression trees together to obtain an Xgboost regression model, dividing the gain sum of each variable during splitting by the corresponding splitting times to obtain an Average splitting gain (Average gain), and dividing the gain of each variable by the Average splitting gain sum of all variables to obtain a variable importance measure (VariableImport) of the corresponding variable, wherein the variable with larger measure value is more likely to be a fault variable.

A specific application scenario of the present invention is given below:

taking sample data collected by a certain CSTR equipment as an example, the data comprises normal working condition data and fault data. As shown in FIG. 5, the model contains the feed concentration C_iTemperature T of feed_i(ii) a Discharge concentration C and discharge temperature T; cooling water inlet temperature T_ciCooling water outlet temperature T_cAnd cooling water flow rate Q_c。

The Xgboost regression fault identification method is compared with the existing RBC identification method for verification, and FIG. 3 shows that the two methods are used for the feed concentration C_iComparison of random disturbance fault recognition effectsIt is evident that the Xgboost regression method can effectively remove the effects of the smearing effect, although the RBC method contribution rate is the largest also variable C_iHowever, it is clear that the tailing effect is severe, and FIG. 4 shows the two methods for the cooling water temperature T_ciThe random interference fault identification effects are compared, and the Xgboost regression method is proved to be capable of effectively removing the tailing effect compared with the RBC and aiming at the fault variable T_ciThe recognition effect is better.

In summary, compared with the RBC method, the Xgboost regression model-based fault location method provided by the invention can effectively identify fault variables under the PCA model, and is not affected by the smearing effect. The PCA monitoring model is only an example for clearly illustrating the present invention, and is not a limitation on the fault detection method implemented by the present invention, and the Xgboost regression model may be combined with the PCA monitoring model, or may be combined with other multivariate statistical monitoring models such as KPCA to realize the positioning of the fault by extracting statistics.

The CSTR fault location method based on the Xgboost regression model provided by the present invention is described in detail above, and the following points need to be explained:

a CSTR fault positioning method based on an Xgboost regression model is characterized by comprising the following steps: the method comprises the following steps in sequence:

a) normal data generated by sensors in the CSTR is collected, as well as unknown offline data.

b) And (b) establishing a monitoring model of the normal data acquired in the step (a), and freely selecting different monitoring models according to the requirements of different occasions, such as linear model selection PCA and nonlinear model selection KPCA.

c) And b, building a monitoring model through the step b, bringing the offline unknown data collected in the step a into the monitoring model, detecting whether a fault exists, screening out fault data if the fault exists, and performing next fault positioning operation.

d) And c, collecting the fault data in the step c as the input of the training sample and the corresponding statistic as the output of the training sample.

e) And d, establishing an Xgboost regression model of the training samples in the step d to obtain variable importance measurement of each variable, wherein the variables with larger measurement values are more likely to be fault variables, and identifying the fault variable with the largest value.

In the step b, correspondingly different multivariate statistical monitoring models such as a linear PCA model, a nonlinear KPCA model and the like can be selected for different system characteristics, and all the methods can be combined with an Xgboost regression method to perform fault location.

3. The Xgboost regression model-based industrial process fault location method of claim 1. The method is characterized in that: the step c specifically comprises the following steps:

step c 1: and taking the fault data after the monitoring model is screened as input, and taking the corresponding statistic as output to be combined together to be used as a training sample.

Step c 2: establishing CART regression tree model of training sample, in order to prevent overfitting, each tree has replaced extraction equivalent data in a resampling mode, and random extraction is adopted

The variables are used as the splitting variable selection range of each tree, and the splitting gain is made to be maximum by selecting the optimal splitting variable and the optimal splitting point.

Step c 3: iteratively generating a new CART regression tree to fit the prediction residual of the last tree through step c2, iterating until the cost function is minimal.

Step c 4: combining all CART regression trees together to obtain an Xgboost regression model, obtaining variable importance measurement of each variable, wherein the variables with larger measurement values are more likely to be fault variables, and identifying the fault variable with the largest value.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A CSTR fault positioning method based on an Xgboost regression model is characterized by comprising the following steps:

2. The CSTR fault location method based on the Xgboost regression model as claimed in claim 1, wherein in the step 2, the monitoring model of the normal data collected in the step 1 is a PCA monitoring model; the method specifically comprises the following steps:

wherein P ∈ R^m×l，

any one sample can be decomposed into:

in the formula, C and

3. The CSTR fault location method based on Xgboost regression model as claimed in claim 2, characterized by that fault detection is performed by extracting SPE statistics, for SPE statistics there are:

4. The CSTR fault location method based on the Xgboost regression model as claimed in claim 1, wherein said step 5 comprises the following steps:

5a) for a fault data set with n samples of m variables:

D＝{(x_i,y_i)}(|D|＝n,x_i∈R^m,y_i∈R)

is a prediction output;

representing a set of possible decision tree functions;

defining the loss function L as:

Ω(f)＝γT+λ||w||²/2

5c) continuously generating new CART regression tree through the step 5b to fit the prediction residual error of the last CART regression tree, and iterating until the loss function is minimum, wherein the loss function L iterated to the t step^(t)Comprises the following steps:

wherein g is_i、h_iAre respectively provided with

About

5. The CSTR fault location method based on the Xgboost regression model as claimed in claim 4, wherein in step 5c, the smaller the loss function of the above formula, the better the model fit; and selecting the optimal splitting variable and the optimal splitting point through a loss function, and simultaneously calculating the splitting gain corresponding to the optimal splitting variable when the optimal splitting point is split.

6. The CSTR fault location method based on the Xgboost regression model as claimed in claim 4, wherein in step 5c, L is assumed_LAnd L_RRespectively, set of left node and right node after division, I ═ I_L∪I_R(ii) a The split gain after splitting is:

7. the CSTR fault location method based on the Xgboost regression model as claimed in claim 1, wherein in step 2, different monitoring models can be freely selected according to the requirements of different occasions, specifically as follows: the linear model selects PCA and the nonlinear model selects KPCA.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the program is executed by the processor.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 7.