CN111027611A

CN111027611A - Fuzzy PLS modeling method based on dynamic Bayesian network

Info

Publication number: CN111027611A
Application number: CN201911225604.5A
Authority: CN
Inventors: 刘鸿斌; 张昊; 张凤山; 景宜
Original assignee: Nanjing Forestry University
Current assignee: Nanjing Forestry University
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-04-17

Abstract

The invention discloses a fuzzy PLS modeling method based on a dynamic Bayesian network, which can be used for modeling industrial processes with strong nonlinearity, time-varying property and uncertainty. Firstly, a latent variable model is established by adopting fuzzy partial least squares, so that the model has nonlinear modeling capacity; secondly, performing augmentation matrix expansion on the score matrix extracted from the latent variable model, so that the model can better adapt to the dynamic characteristics of data; finally, by combining a Bayesian network, the model can better describe the uncertainty existing in the actual industrial process; in order to verify the accuracy of model prediction, the method is used for soft measurement modeling of the wastewater treatment process. The experimental result shows that the fuzzy partial least square and dynamic Bayesian network application can obviously improve the accuracy of model prediction and is more suitable for soft measurement modeling of complex industrial processes.

Description

Fuzzy PLS modeling method based on dynamic Bayesian network

Technical Field

The invention relates to a soft measurement method of effluent indexes in a wastewater treatment process, in particular to a fuzzy PLS modeling method based on a dynamic Bayesian network.

Background

With the continuous development of modern industry, the production process gradually tends to be continuous and large-scale, so that the monitoring of quality indexes in the industrial process has higher requirements. The high degree of non-linearity, time-variability, and process uncertainty present in the collected data samples present significant challenges for conventional process monitoring. The process monitoring technology widely used at present is online instrument detection and offline laboratory detection, but the online instrument detection has higher cost and is difficult to maintain; the off-line laboratory detection has larger time lag, and the detection reagent can cause secondary pollution and is difficult to meet the on-line monitoring requirement of the actual production process, so that the establishment of a soft measurement model is very necessary in the industrial process monitoring.

The currently used soft measurement models include multiple linear regression, principal component analysis, partial least squares, support vector machines, decision trees, and the like. However, non-linearity and dynamic characteristics are ubiquitous in an actual industrial process, so that the basic model cannot better describe data with complex structures; in addition, the traditional method uses more variables in the soft measurement modeling process, so that not only is the model structure too complex, but also the cost for acquiring auxiliary variables is correspondingly increased. In addition, a Bayesian Network (BN) as a probability-based network structure can better process uncertainty existing in the process, but under the condition of high data dimension, the network structure is complex, and a model overfitting phenomenon is easily caused.

In the above problems, to solve the problem that the soft measurement model is too complex, a variable selection method is usually adopted, but the dimension of the acquired data is often far greater than the actual dimension required in the prediction model, and the obvious information redundancy phenomenon brings great difficulty to the soft measurement modeling. In addition, the problem of high data dimension can be solved by establishing a latent variable model, and most of original information in the data is reserved and the data dimension is reduced by selecting a latent variable with a large information content. A latent variable model which is commonly used in the latent variable models is a Partial Least Squares (PLS), but the traditional linear PLS cannot sufficiently explain the non-linear characteristics of data in an industrial process, so that the PLS method is difficult to explain the non-linearity of the process. Besides the non-linear characteristics of the data in general, the time-variability in the industrial process also brings great limitation to the modeling process, and the current common solution is to use a simple time series model. However, in an actual industrial process, data has large fluctuation and non-periodicity, so that a simple time series method has difficulty in accurately describing the dynamic characteristics of a sample.

Disclosure of Invention

The invention provides a Dynamic-Fuzzy Partial Least Squares-Bayesian networks (D-FPLS-BN) modeling method based on a Dynamic Bayesian network aiming at the problems in the prior art.

The invention adopts a fuzzy PLS modeling method based on a dynamic Bayesian network, which comprises the following steps:

s1, data preprocessing: standardizing input data X and output data Y, and eliminating the dimension of the data through the standardization of the data; and dividing the data into a training set and a testing set. And the training set is used for constructing and training the model, and the test set is used for evaluating the model.

S2, constructing an FPLS latent variable model to extract nonlinear features and reduce data dimensionality: the traditional PLS has great limitation in solving the nonlinear problem existing in the actual industrial process, so that a Fuzzy rule and a Fuzzy C-means algorithm (Fuzzy C-means, FCM) are introduced on the basis of the PLS to construct an FPLS model; meanwhile, in order to prevent the model structure from being too complex due to too high data dimension, a latent variable model of the FPLS is established by extracting latent variables with more information content in the FPLS latent variable model.

S3, constructing a dynamic model: and constructing a dynamic model for the latent variables extracted from the FPLS latent variable model in a mode of an augmentation matrix, so that the time-varying property existing in the process is overcome, and the dynamic characteristic existing in the data in the process is better described.

S4, constructing a D-FPLS-BN model: and taking the data expanded by the augmentation matrix as the input of the Bayesian network to construct the Bayesian network, thereby overcoming the uncertainty existing in the actual industry and improving the accuracy of the model for predicting the quality index.

S5, carrying out anti-standardization on the data, and finishing the evaluation of the model prediction capability: and (3) bringing the test set data into the trained model for prediction, calculating Root Mean Square Error (RMSE) according to the predicted value and the true value of the input data, and finishing the evaluation of the prediction capability of the model.

The method has the advantages that on the basis of the FPLS latent variable model, the dynamic model and the Bayesian network are combined, so that the D-FPLS-BN soft measurement model can overcome strong nonlinearity, time-varying property and uncertainty. Therefore, in the face of a complex wastewater treatment process, the model has higher accuracy and generalization capability; compared with the traditional sensor, the soft measurement method has higher reliability in process monitoring.

After adopting the scheme, compared with the prior art, the invention has the following effects:

compared with the prior art, the fuzzy PLS modeling method based on the dynamic Bayesian network has the beneficial effects of monitoring the quality index of the industrial process: by the soft measurement modeling method, the defects of high cost and difficult maintenance of an online instrument in the actual industry are overcome, and the problem of large time lag of offline detection is solved; latent variables are selected in the FPLS soft measurement model, so that the problem that the model is too complex due to high data dimensionality is avoided, and nonlinear characteristics of data are effectively extracted; through the construction of the dynamic model, the model has the capability of describing the dynamic characteristics of the data more accurately, and the time-varying property in the process is effectively solved; and finally, the method is combined with a Bayesian network, so that the description of the model on the process uncertainty is facilitated, and the high precision and generalization capability of the soft measurement model in the industrial process can be ensured.

Drawings

FIG. 1 is a flow chart of a fuzzy PLS soft measurement modeling method based on a dynamic Bayesian network;

FIG. 2 is a first latent variable score vector scattergram of the PLS model versus actual wastewater treatment process data;

FIG. 3 is a first latent variable score vector scattergram for actual wastewater treatment process data when the FPLS model takes different numbers of fuzzy rules;

FIG. 4 is a graph of the RMSE results of model predictions for FPLS-BN and D-FPLS-BN under different fuzzy rules.

Detailed Description

The present invention will now be described more clearly and fully hereinafter, with the understanding that the present invention has been described in connection with what is presently considered to be the most practical and preferred embodiment of the invention.

The technical scheme adopted by the invention for predicting the effluent index of wastewater treatment is as follows:

s1, data preprocessing: the input data X and the output data Y are standardized according to a formula (1); dividing a training set and a test set, wherein the training set is used for constructing a model, and the test set is used for evaluating the performance of the model;

s2, constructing an FPLS latent variable model: constructing a latent variable model among FPLS score vectors to explain the nonlinear characteristics of the data;

s3, constructing a dynamic model: extracting a scoring matrix in the FPLS latent variable model, and selecting latent variables through accumulating variance contribution rate: according to the accumulated variance contribution rate, the change is gentle after a certain latent variable, so that the latent variable is selected as the number of the latent variables of the model; the dynamic model construction is realized by an augmentation matrix and a time lag coefficient introduction mode;

s4, constructing a D-FPLS-BN model: taking the data expanded by the augmentation matrix as the input of the Bayesian network, constructing the Bayesian network, and completing the prediction of new input data;

and S5, carrying out anti-standardization on the data and finishing the evaluation of the prediction capability of the model. And (3) bringing the test set data into a model for prediction, calculating Root Mean Square Error (RMSE) according to the predicted value and the true value of the input data, and finishing the evaluation of the prediction capability of the model.

In step S1, the data is normalized to standard data having a mean of 0 and a variance of 1 such that E₀＝X，F₀＝Y，h＝1。

The normalization formula is as follows:

in the formula, X^*For the raw data, X is the normalized data, and μ and σ are the mean and variance, respectively, of all sample data.

In step S2, the FPLS latent variable model is constructed as follows:

s21: the input and output data are decomposed using a partial least squares model as follows:

in the formula, t and u are latent variables of X and Y, p and q are corresponding load variables, and E and F are corresponding residual error matrixes.

S22: computing the h-th pair of feature vectors t_h,u_h：

t_h＝E_h-1w_h(5)

u_h＝F_h-1c_h(8)

S23: calculating a Gaussian membership function clustering center:

wherein c is_i(i ═ 1,2 …, L) is the cluster center.

S24: after data are clustered into L types, a sub-model is established for each type of data, and an input variable is defined as x ═ x₁x₂…x_r]^TModel parameter b_i＝[b_i0b_i1…b_ir]^T。

S241: the TSK blur function is defined as:

in the formula, G_iIn order to standardize the intensity of the trigger,

s242: normalized trigger intensity G_iAnd the Gaussian trigger strength tau of the ith fuzzy rule_iThe calculation formulas are respectively as follows:

wherein i is 1,2, …, L, c_irIs the cluster center of the ith Gaussian membership function, sigma_iIs the width of the membership function.

S243: width sigma of membership function_iThe nearest neighbor method is adopted for the calculation of (1):

wherein, c_iAnd c_lTwo nearest cluster centers, l ═ 1,2, …, n, respectively.

S244: calculating the total output of the L TSK submodels:

s245: minimizing the objective function J_G：

S25: calculating load vectors of the input and output matrixes X and Y:

s26: computing h-th group of feature vector residuals E_h、F_h：

Let h be h +1, return to step S22 to calculate so that residual matrix E_hAnd F_hIf the effective information contained in the data is extracted, the calculation is terminated;

in step S3, a score matrix in the FPLS latent variable model is extracted, and the dynamic model is implemented by constructing an augmentation matrix:

s31: and extracting a scoring matrix T in the FPLS latent variable model, and selecting the number of latent variables according to the accumulated variance contribution rate.

S32: the dynamic model was constructed as follows:

assuming that the input matrix of the original FPLS latent variable model is X:

will selectThe latent variable of (A) is expanded to form an amplification matrix, and a time lag coefficient d is introduced to form the amplification matrix X_iComprises the following steps:

where x (t) is a certain sample point and d is a time lag coefficient.

In step S4, a D-FPLS-BN model is constructed:

s41: data X with dynamic structure expanded_iAs nodes of a bayesian network.

S42: and dividing the data set into a training set and a testing set, and using the training set to train the Bayesian network structure.

S43, calculating the prior distribution of the random variables ξ in the training set as pi (ξ).

S44: calculating a sample x₁,x₂,x₃Conditional density P (x) of … versus ξ₁,x₂,x₃,…,x_m|ξ)。

S45, using Bayesian formula, according to prior distribution pi (ξ) and conditional density P (x)₁,x₂,x₃,…,x_m| ξ) calculate the posterior probability density P (ξ | x)₁,x₂,x₃,…,x_m)。

S46, making inferences about ξ in the test set using a posterior probability density:

in step S5, denormalization is performed on the data, and evaluation of the model prediction capability is completed;

and substituting the test set data into the model for prediction, and calculating a Root Square Error (RMSE) according to the predicted value and the true value of the input data, wherein the RMSE is closer to 0, which represents that the model has better accuracy. The RMSE calculation formula is as follows:

in the formula, y_iIn order to be the true value of the value,

for the estimation, N is the number of samples.

Example 1:

take the wastewater treatment process of a wastewater treatment plant as an example. The wastewater treatment data for soft measurement modeling contains 6 input variables including influent flow (Q), influent Solids Suspension (SS), and one output variable_in) Biological Oxygen Demand (BOD) in water_in) Chemical Oxygen Demand (COD) of the entering water_in) Total nitrogen in water (TN)_in) And Total Phosphorus (TP) in water_in) The output variable is the effluent Suspended Solids (SS)_eff). The invention is further detailed in conjunction with fig. 1:

the first step is as follows: and dividing 358 groups of data into a training set and a test set, wherein the front 238 group is the training set for establishing the model, and the rear 120 groups are the test set for testing the performance of the model.

The second step is that: and decomposing the PLS model, and establishing an FPLS latent variable model by combining with a TSK fuzzy rule. The accumulated variance of the PLS model can be obtained according to the table 1, and the table 2 shows the accumulated variance of the FPLS model under different fuzzy rules; and selecting the number of the appropriate latent variables in the model according to the change of the accumulated variance, and extracting the scoring matrix. In addition, when 4 fuzzy rules are searched, under different latent variables, the fuzzy rules adopt the information extraction capability of 4 fuzzy rules; the variance contribution rate and the cumulative variance contribution rate of the output variables are shown in table 3. As shown in tables 1-3, This LV indicates the variance contribution ratio (%), Total indicates the cumulative variance contribution ratio (%), and the number of latent variables is selected by the cumulative variance contribution ratio, where the number of latent variables in the PLS method in Table 1 is 2; in table 2, the number of the FPLS _1 latent variables is 2, and the number of the FPLS _2, FPLS _3, and FPLS _4 latent variables is 3; the latent variables for FPLS _5 through FPLS _9 in Table 3 are 2, 3, 4, 5, respectively.

TABLE 1 variance contribution ratio and cumulative variance contribution ratio of PLS latent variable model

TABLE 2 variance contribution rate and cumulative variance contribution rate of FPLS latent variable model to different fuzzy rules

TABLE 3 variance contribution rate and cumulative variance contribution rate of fuzzy rules of FPLS latent variable model to different numbers of latent variables

The third step: expanding the gain matrix of the scoring matrix obtained in the latent variable model to realize the construction of a dynamic model;

the fourth step: training the network by taking the score matrix after the expansion of the augmented matrix as the input of the Bayesian network, and completing the prediction of the test set data by using the D-FPLS-BN model obtained after training;

the fifth step: and carrying out denormalization on the predicted data to finish the evaluation of the model prediction capability. And comparing the prediction accuracy of the D-FPLS-BN model with PLS, BN, PLS-BN, D-PLS-BN and FPLS-BN. FIG. 2 is a scatter plot of the input and output score vectors of the first latent variable during modeling of PLS. In fig. 3, sub-graphs formed by t (1) and u (1) are scatter graphs and internal regression graphs between input and output score vectors of the first latent variable in the FPLS modeling process under different fuzzy rules, and (a), (b), (c) and (d) are scatter graphs between score vectors of 2, 3, 4 and 5 taken by the fuzzy rules respectively; in a sub-graph formed by t (1) and FiringStrength in the graph, a dotted line represents standardized trigger strength, and solid lines represent trigger strengths corresponding to fuzzy rules respectively; from the scatter plot, one can derive: aiming at data with a stronger nonlinear structure, compared with a PLS (partial least squares) method, the FPLS has better nonlinear fitting capacity, which shows that the FPLS method has stronger nonlinear modeling capacity; FIG. 4 shows the predicted root mean square error for models under different fuzzy rules, where fuzzy rule 1 on the abscissa represents the PLS model, and 2-5 represent the FPLS models when fuzzy rules are 2, 3, 4, and 5, respectively; the ordinate is the RMSE value. In the figure, the blue line and the red line are respectively the RMSE values of the FPLD-BN and the D-FPLS-BN under the corresponding fuzzy rule, and when the fuzzy rule is 4, the FPLD-BN and the D-FPLS-BN models have relatively good prediction performance and have strong interpretation capability on nonlinear data. Table 4 lists the RMSE results predicted by 6 models for effluent SS, showing that: the RMSE of PLS and BN was 1.01 and 2.35 respectively, and the RMSE of the predicted optimal D-FPLS-BN was 0.72, which is 28.63% lower than that of the PLS method.

TABLE 4 prediction results of different models on test effluent SS

In consideration of the nonlinearity and time variability of data in the wastewater treatment process and the uncertainty of the industrial process, the prediction model in the soft measurement process is difficult to achieve a good prediction effect. The method of the invention better explains the nonlinearity of data through FPLS, and better describes the dynamic characteristics through the construction of a dynamic model; and the D-FPLS-BN model is combined with the Bayesian network, so that the D-FPLS-BN model is better suitable for soft measurement modeling of an actual industrial process.

The foregoing has described the general principles, principal features, and advantages of the invention. The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited thereto, and those skilled in the art can easily conceive of changes or substitutions within the technical scope of the present invention, and all such changes and substitutions are intended to be covered by the protection scope of the present invention. Therefore, the scope of the present invention should be defined by the appended claims and equivalents thereof.

Claims

1. The fuzzy PLS modeling method based on the dynamic Bayesian network is characterized by comprising the following steps of:

s1, data preprocessing: standardizing input data X and output data Y, and eliminating the dimension of the data through the standardization of the data; dividing data into a training set and a testing set, using the training set for model construction and training, and using the testing set for model evaluation;

s2, constructing an FPLS latent variable model, introducing Takagi-Sugeno-Kang, a TSK Fuzzy rule and a Fuzzy C-means algorithm Fuzzy C-means, and FCM on the basis of PLS to construct the FPLS model, and extracting latent variables with more information content in the FPLS latent variable model;

s3, constructing a dynamic model: constructing a dynamic model by an augmentation matrix mode for latent variables extracted from the FPLS latent variable model;

s4, constructing a Dynamic Bayesian network Fuzzy PLS modeling method, namely a Dynamic-Fuzzy Partial least squares-Bayesian network, D-FPLS-BN model: and taking the data expanded by the augmentation matrix as the input of the Bayesian network to construct the Bayesian network.

2. The dynamic bayesian network based fuzzy PLS modeling method as claimed in claim 1, wherein the data in step 1 is derived from wastewater treatment data, the input data X comprises relevant data indicating the degree of wastewater pollution, and the output data Y is a pollutant indicator monitored at the wastewater outlet.

3. The fuzzy PLS modeling method based on dynamic bayesian network as claimed in claim 1, wherein said step S2 is specifically performed by:

s21: the input data X and the output data Y are decomposed using a partial least squares model as follows:

in the formula, t and u are latent variables of X and Y respectively, p and q are corresponding load variables, and E and F are corresponding residual error matrixes;

s22: computing the h-th pair of feature vectors t_h,u_h：

t_h＝E_h-1w_h(4)

u_h＝F_h-1c_h(7)

S23: calculating a Gaussian membership function clustering center:

wherein c is_i(i ═ 1,2 …, L) as the clustering center;

s24: after data are clustered into L types, a sub-model is established for each type of data, and an input variable is defined as x ═ x₁x₂…x_r]^TModel parameter b_i＝[b_i0b_i1…b_ir]^T；

S241: the TSK blur function is defined as:

in the formula, G_iStandardized trigger strength;

S244: calculating the total output of the L TSK submodels:

s245: minimizing the objective function J_G：

4. The fuzzy PLS modeling method based on dynamic bayesian network as claimed in claim 1, wherein said step S3 is specifically performed by:

s31: extracting a score matrix T in the FPLS latent variable model, and selecting the number of latent variables according to the accumulated variance contribution rate;

s32: the dynamic model was constructed as follows:

setting an input matrix of an original FPLS latent variable model as X:

expanding the selected latent variable to obtain an augmented matrix X by introducing a time lag coefficient d_iComprises the following steps:

where x (t) is a certain sample point and d is a time lag coefficient.

5. The fuzzy PLS modeling method based on dynamic bayesian network as claimed in claim 1, wherein said step S4 is specifically performed by:

s41: data X with dynamic structure expanded_iAs nodes of a bayesian network;

s42: dividing a data set into a training set and a testing set, and using the training set to train the Bayesian network structure;

s43, calculating the prior distribution of the random variables ξ in the training set as pi (ξ);

s44: calculating a sample x₁,x₂,x₃Conditional density P (x) of … versus ξ₁,x₂,x₃,…,x_m|ξ)；

S45, using Bayesian formula, according to prior distribution pi (ξ) and conditional density P (x)₁,x₂,x₃,…,x_m| ξ) calculate the posterior probability density P (ξ | x)₁,x₂,x₃,…,x_m)；

6. the dynamic bayesian network based fuzzy PLS modeling method according to any of the claims 1 to 5, further comprising a model prediction capability evaluation process, in particular: and substituting the test set data into the trained model for prediction, and calculating Root Mean Square Error (RMSE) according to the predicted value and the true value of the input data to finish the evaluation of the prediction capability of the model.