CN111639304B

CN111639304B - CSTR fault positioning method based on Xgboost regression model

Info

Publication number: CN111639304B
Application number: CN202010491108.0A
Authority: CN
Inventors: 赵忠盖; 潘磊; 李庆华; 刘成林; 刘飞
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2023-02-21
Anticipated expiration: 2040-06-02
Also published as: CN111639304A

Abstract

The invention discloses a CSTR fault positioning method based on an Xgboost regression model. The invention relates to a CSTR fault positioning method based on Xgboost regression, which comprises the following steps: 1) Normal data generated by sensors in the CSTR is collected, as well as unknown offline data. 2) And (3) establishing a monitoring model of the normal data acquired in the step (1), and freely selecting different monitoring models according to the requirements of different occasions. 3) And (3) establishing a monitoring model through the step (2), bringing the offline unknown data acquired in the step (1) into the monitoring model, extracting sample statistics to detect faults, and screening out fault data. The invention has the beneficial effects that: 1) The importance of the variables of the Xgboost regression model measures the influence of the variables on the output prediction accuracy, and the calculation of the metric value of each variable is independent from the other, and compared with the prior art, the variable importance measure does not contain components of the action of other variables, so that the influence of the tailing effect is eliminated.

Description

CSTR fault positioning method based on Xgboost regression model

Technical Field

The invention relates to the field of CSTR, in particular to a CSTR fault positioning method based on an Xgboost regression model.

Background

The Continuous Stirred Tank Reactor (CSTR) is a very important reaction device in chemical production and has very wide application. In the production of three large synthetic materials of chemical fiber, plastic and synthetic rubber, the CSTR occupies more than 90% of the synthetic production reactors, and is also widely used in the fields of pharmacy, pesticides, fuels and the like. In view of the wide application of CSTR in the actual production process, it is very valuable to ensure the stability and safety of the operation.

With continuous scale and complication of modern chemical production, huge loss is often caused when faults occurring in the production cannot be accurately identified and timely recovered. With the continuous generation of a large amount of data reflecting process mechanisms in industrial processes, monitoring of industrial processes through data-driven multivariate statistical monitoring models becomes more and more popular.

The traditional technology has the following technical problems:

at present, a great number of technical means are applied to the aspect of fault detection based on multivariate statistical analysis in the actual industrial process, but fault location is still a technical difficulty to be further solved as an important link to be completed after fault detection. Currently, common fault location methods based on multivariate statistical analysis mainly include a contribution graph method, a reconstruction method and a reconstruction contribution method (RBC), but these methods are susceptible to smearing effect, so that misdiagnosis may occur in practical application. Meanwhile, in systems with different characteristics, such as linearity, nonlinearity, non-gaussian and the like, the traditional fault positioning methods are different from one another, the fault positioning methods are greatly different from one another, and few related technical documents propose a unified method to realize the positioning of the fault source.

Disclosure of Invention

The invention provides a CSTR fault positioning method based on an Xgboost regression model, which comprises the steps of firstly establishing a multivariate statistical monitoring model aiming at normal data collected in a CSTR; screening out a fault data section in offline acquired data through a monitoring model, taking the fault data section as input, taking corresponding statistic as output to establish an Xgboost regression model, taking variable importance measurement as the contribution rate of a variable to the statistic, wherein the variable with a larger value is more likely to be a fault variable, and identifying the largest variable as the fault variable. The method has the advantages that the method is different from fault positioning methods such as a traditional reconstruction contribution method, a partial differential method and the like, the Xgboost regression model used by the method can be simultaneously used for fault positioning in nonlinear and linear processes, the calculated amount is small, the tailing effect is small, and the performance is better in the aspects of micro fault and random fault positioning of the CSTR.

In order to solve the technical problem, the invention provides a CSTR fault positioning method based on an Xgboost regression model, which comprises the following steps:

1) Collecting normal data generated by a sensor in the CSTR and unknown off-line data;

2) Establishing a monitoring model of the normal data acquired in the step 1, and freely selecting different monitoring models according to the requirements of different occasions;

3) Establishing a monitoring model through the step 2, bringing the offline unknown data acquired in the step 1 into the monitoring model, extracting sample statistics to detect faults, and screening out fault data;

4) Collecting the fault data in the step 3 as the input of the training sample and the corresponding statistic as the output of the training sample;

5) And (4) establishing an Xgboost regression model of the training sample in the step (4) to obtain variable importance measurement of each variable, wherein the variables with larger measurement values are more likely to be fault variables, and the fault variable with the largest value is identified.

In one embodiment, in step 2, the monitoring model of the normal data collected in step 1 is a PCA monitoring model; the method specifically comprises the following steps:

assuming that a sample set under a normal working condition is X ∈ R ^n×m N is the number of samples, m is the number of variables; after standardization treatment, the mean value is 0, and the standard deviation is 1; obtaining a covariance matrix S and carrying out singular value decomposition to obtain:

wherein P ∈ R ^m×l ，

Respectively are principal component and residual load vector, l is the number of principal component, lambda,

Diagonal arrays consisting of principal component and residual characteristic values respectively;

any one sample can be decomposed into:

in the formula, C and

projection matrices representing principal component and residual space, respectively;

in one embodiment, fault detection is performed by extracting SPE statistics, and for the SPE statistics, there are:

SPE statistic control limit can be obtained by sampling distribution, if the statistic exceeds the corresponding control limit, the process is considered to be abnormal, and therefore fault detection is achieved.

In one embodiment, the step 5 specifically includes the following steps:

5a) For a fault data set with n samples of m variables:

D＝{(x _i ,y _i )}(|D|＝n,x _i ∈R ^m ,y _i ∈R)

where y is a statistic, an Xgboost regression model is defined to predict x in D:

wherein K is the number of decision trees; f is a CART regression tree function;

is a prediction output;

representing a set of possible decision tree functions;

defining the loss function L as:

wherein l is a slightly convex function, the difference between the predicted value and the true value is measured, and a mean square error function is selected; Ω (f) is:

Ω(f)＝γT+λ||w|| ² /2

wherein T represents the number of leaves, w represents the weight of the leaves, and lambda and gamma are penalty terms;

5b) Establishing a CART regression tree model for the training samples in the step 4, in order to prevent overfitting, putting back extracted equivalent data for each tree in a resampling mode, and selecting an optimal splitting variable and an optimal splitting point through a greedy algorithm to enable splitting gain to be maximum;

5c) And continuously iterating the step 5b to generate a new CART regression tree to fit the prediction residual error of the last CART regression tree, and iterating until the loss function is minimum, wherein the loss function L (t) iterated to the t step comprises the following steps:

and (3) popularizing the Taylor series of the loss function to 2 orders, and moving out the constant term, so that the loss function in the t step becomes:

wherein g is _i 、h _i Are respectively provided with

About

1 and 2 derivatives of; the derivation is carried out on the above formula and the derivation result is 0 to obtain the leaf weight w ^* And substituting the following formula:

5d) Combining all CART regression trees together to obtain an Xgboost regression model, dividing the gain sum of each variable during splitting by the corresponding splitting times to obtain an average splitting gain, and dividing the gain of each variable by the average splitting gain sum of all variables to obtain the variable importance measurement of the corresponding variable, wherein the variable with larger measurement value is more likely to be a fault variable.

In one embodiment, in step 5c, the smaller the loss function of the above formula, the better the model fit is; and selecting the optimal splitting variable and the optimal splitting point through a loss function, and simultaneously calculating the splitting gain corresponding to the optimal splitting variable when the optimal splitting point is split.

In one embodiment, in step 5c, assume that L _L And L _R Respectively, set of left and right nodes after division, I = I _L ∪I _R (ii) a The split gain after splitting is:

in one embodiment, in step 2, different monitoring models can be freely selected according to the requirements of different occasions, specifically as follows: the linear model selects PCA and the nonlinear model selects KPCA.

Based on the same inventive concept, the present application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when executing the program.

Based on the same inventive concept, the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of any of the methods.

Based on the same inventive concept, the present application further provides a processor for executing a program, wherein the program executes to perform any one of the methods.

The invention has the beneficial effects that:

1) The importance of the variables of the Xgboost regression model measures the influence of the variables on the output prediction accuracy, and the calculation of the metric value of each variable is independent from the other, and compared with the prior art, the variable importance measure does not contain components of the action of other variables, so that the influence of the tailing effect is eliminated.

2) Compared with the existing RBC fault recognition technology, the fault recognition method of the CSRT model has high running speed and can be used for fault location in various occasions such as linearity, nonlinearity, multi-mode and the like.

Drawings

FIG. 1 is a flow chart of fault location in the CSTR fault location method based on the Xgboost regression model.

FIG. 2 is a generation flow chart of the CSTR fault location method based on the Xgboost regression model.

FIG. 3 shows the feed concentration C in the CSTR fault location method based on the Xgboost regression model of the present invention _i Random disturbance fault identification.

FIG. 4 shows the cooling water temperature T in the CSTR fault location method based on the Xgboost regression model _ci And identifying zero drift faults.

FIG. 5 is a schematic diagram of a CSTR device in the CSTR fault location method based on the Xgboost regression model.

Detailed Description

The present invention is further described below in conjunction with the drawings and the embodiments so that those skilled in the art can better understand the present invention and can carry out the present invention, but the embodiments are not to be construed as limiting the present invention.

As shown in fig. 1, a CSTR fault location method based on Xgboost regression includes the following steps:

1) Normal data generated by sensors in the CSTR is collected, as well as unknown offline data.

2) And (2) establishing a monitoring model of the normal data acquired in the step (1), and freely selecting different monitoring models according to the requirements of different occasions.

3) And (3) establishing a monitoring model through the step (2), bringing the offline unknown data acquired in the step (1) into the monitoring model, extracting sample statistics to detect faults, and screening out fault data.

4) And collecting the fault data in the step 3 as the input of the training sample and the corresponding statistic as the output of the training sample.

The step 2 specifically comprises the following steps:

2a) For the establishment of the monitoring model, the PCA monitoring model is taken as an example in the present invention. Assuming that a sample set under a normal working condition is X ∈ R ^n×m N is the number of samples, and m is the number of variables. After normalization, the mean value was set to 0 and the standard deviation was set to 1. Obtaining a covariance matrix S and carrying out singular value decomposition to obtain:

wherein P ∈ R ^m×l ，

And the diagonal matrixes are formed by principal elements and residual characteristic values respectively.

Any one sample can be decomposed into:

in the formula, C and

the projection matrices represent principal component and residual space, respectively.

Carry out fault detection through extracting SPE statistics, there is:

The step 5 specifically comprises the following steps:

5a) For a fault data set with n samples of m variables:

D＝{(x _i ,y _i )}(|D|＝n,x _i ∈R ^m ,y _i ∈R)

is a prediction output;

representing a set of possible decision tree functions.

Defining the loss function L as:

where l is a slightly convex function, the difference between the predicted value and the true value is measured, where the mean square error function is selected. Ω (f) is:

Ω(f)＝γT+λ||w|| ² /2

wherein, T represents the number of leaves, w represents the weight of the leaves, and λ and γ are penalty terms.

5b) And (5) establishing a CART regression tree model for the training samples in the step (4), in order to prevent overfitting, putting back extracted equivalent data for each tree in a resampling mode, and selecting an optimal splitting variable and an optimal splitting point through a greedy algorithm to enable splitting gain to be maximum.

wherein g is _i 、h _i Are respectively provided with

About

1 and 2 derivatives of. The derivation is carried out on the above formula and the derivation result is 0 to obtain the leaf weight w ^* And substituting the following formula:

when the loss function of the above formula is smaller, the better the model fits is. And selecting the optimal splitting variable and the optimal splitting point through a loss function, and simultaneously calculating the splitting gain corresponding to the optimal splitting variable when the optimal splitting point is split.

Suppose L _L And L _R Respectively, set of left node and right node after division, and let I = I _L ∪I _R . The split gain after splitting is:

5d) Combining all CART regression trees together to obtain an Xgboost regression model, dividing the gain sum of each Variable during splitting by the corresponding splitting times to obtain an Average splitting gain (Average gain), and dividing the gain of each Variable by the Average splitting gain sum of all variables to obtain a Variable Importance measure (Variable Importance) of the corresponding Variable, wherein the larger the measure value, the more possible the Variable is a fault Variable.

A specific application scenario of the present invention is given below:

taking sample data collected by a certain CSTR equipment as an example, the data comprises normal working condition data and fault data. As shown in FIG. 5, the model contains the feed concentration C _i Temperature T of feed _i (ii) a Discharge concentration C and discharge temperature T; cooling water inlet temperature T _ci Cooling water outlet temperature T _c And cooling water flow rate Q _c 。

The Xgboost regression fault identification method is compared with the existing RBC identification method for verification, and FIG. 3 shows that the two methods are used for the feed concentration C _i Compared with random interference fault identification effects, the Xgboost regression method can obviously and effectively remove the influence of the tailing effect, although the RBC method with the largest contribution rate is also the variable C _i However, it is clear that the tailing effect is severe, and FIG. 4 shows the temperature T of the cooling water in the two methods _ci The random interference fault identification effects are compared, and the Xgboost regression method is proved to be capable of effectively removing the tailing effect compared with the RBC and aiming at the fault variable T _ci Recognition effectMore preferably.

In summary, compared with the RBC method, the Xgboost regression model-based fault location method provided by the invention can effectively identify fault variables under the PCA model, and is not affected by the smearing effect. The PCA monitoring model is only an example for clearly illustrating the present invention, and is not a limitation on the fault detection method implemented by the present invention, and the Xgboost regression model may be combined with the PCA monitoring model, or may be combined with other multivariate statistical monitoring models such as KPCA to realize the positioning of the fault by extracting statistics.

The CSTR fault location method based on the Xgboost regression model provided by the present invention is described in detail above, and the following points need to be explained:

a CSTR fault positioning method based on an Xgboost regression model is characterized by comprising the following steps: the method comprises the following steps in sequence:

a) Normal data generated by sensors in the CSTR is collected, as well as unknown offline data.

b) And (b) establishing a monitoring model of the normal data acquired in the step (a), and freely selecting different monitoring models according to the requirements of different occasions, such as linear model selection PCA and nonlinear model selection KPCA.

c) And b, building a monitoring model through the step b, bringing the offline unknown data collected in the step a into the monitoring model, detecting whether a fault exists, screening out fault data if the fault exists, and performing next fault positioning operation.

d) And c, collecting the fault data in the step c as the input of the training sample and the corresponding statistic as the output of the training sample.

e) And d, establishing an Xgboost regression model of the training samples in the step d to obtain variable importance measurement of each variable, wherein the variables with larger measurement values are more likely to be fault variables, and identifying the fault variable with the largest value.

In the step b, correspondingly different multivariate statistical monitoring models such as a linear PCA model, a nonlinear KPCA model and the like can be selected for different system characteristics, and all the methods can be combined with an Xgboost regression method to perform fault location.

3. The Xgboost regression model-based industrial process fault location method of claim 1. The method is characterized in that: the step c specifically comprises the following steps:

step c1: and taking the fault data after the monitoring model is screened as input, and taking the corresponding statistic as output to be combined together to be used as a training sample.

Step c2: establishing a CART regression tree model of training samples, wherein each tree has replaced extracted equivalent data in a resampling mode for preventing overfitting, and randomly extracting

The variables are used as the splitting variable selection range of each tree, and the splitting gain is made to be maximum by selecting the optimal splitting variable and the optimal splitting point.

And c3: and c2, iteratively generating a new CART regression tree to fit the prediction residual of the last tree until the cost function is minimum.

And c4: combining all CART regression trees together to obtain an Xgboost regression model, obtaining variable importance measurement of each variable, wherein the variables with larger measurement values are more likely to be fault variables, and identifying the fault variable with the largest value.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A CSTR fault positioning method based on an Xgboost regression model is characterized by comprising the following steps:

5) Establishing an Xgboost regression model of the training sample in the step 4 to obtain variable importance measurement of each variable, wherein the variables with larger measurement values are more likely to be fault variables, and the fault variable with the largest value is identified;

the step 5 specifically comprises the following steps:

5a) For a fault data set with n samples of m variables:

D＝{(x _i ,y _i )}(|D|＝n,x _i ∈R ^m ,y _i ∈R)

is the predicted output;

representing a set of possible decision tree functions;

defining the loss function L as:

Ω(f)＝γT+λ||w|| ² /2

5c) Continuously generating new CART regression tree through the step 5b to fit the prediction residual error of the last CART regression tree, and iterating until the loss function is minimum, wherein the loss function L iterated to the t step ^(t) Comprises the following steps:

wherein g is _i 、h _i Are respectively provided with

About

wherein the smaller the loss function of the above formula, the better the model fit; selecting an optimal splitting variable and an optimal splitting point through a loss function, and simultaneously calculating splitting gain corresponding to the optimal splitting variable when the optimal splitting point is split;

5d) Combining all CART regression trees together to obtain an Xgboost regression model, dividing the gain sum of each variable during splitting by the corresponding splitting times to obtain an average splitting gain, and dividing the gain sum of each variable by the average splitting gain sum of all variables to obtain the variable importance measurement of the corresponding variable, wherein the variable with larger measurement value is more likely to be a fault variable.

2. The CSTR fault location method based on the Xgboost regression model as claimed in claim 1, wherein in the step 2, the monitoring model of the normal data collected in the step 1 is a PCA monitoring model; the method specifically comprises the following steps:

assuming that a sample set under normal working conditions is X ∈ R ^n×m N is the number of samples, m is the number of variables; after standardization, the mean value is 0 and the standard deviation is 1; obtaining a covariance matrix S and carrying out singular value decomposition to obtain:

wherein P ∈ R ^m×l ，

Diagonal arrays respectively composed of principal elements and residual characteristic values;

any one sample can be decomposed into:

in the formula, C and

representing the projection matrices of the principal component and residual space, respectively.

3. The CSTR fault location method based on Xgboost regression model as claimed in claim 2, characterized by that fault detection is performed by extracting SPE statistics, for SPE statistics there are:

4. The CSTR fault location method based on the Xgboost regression model as claimed in claim 1, wherein in step 5c, L is assumed _L And L _R Respectively, set of left node and right node after division, and let I = I _L ∪I _R (ii) a The split gain after splitting is:

5. the CSTR fault location method based on the Xgboost regression model as claimed in claim 1, wherein in step 2, different monitoring models can be freely selected according to the requirements of different occasions, specifically as follows: the linear model selects PCA and the nonlinear model selects KPCA.

6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 5 are implemented when the program is executed by the processor.

7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.

8. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 5.