CN113741364A

CN113741364A - Multi-mode chemical process fault detection method based on improved t-SNE

Info

Publication number: CN113741364A
Application number: CN202110988594.1A
Authority: CN
Inventors: 顾昊昱; 张成功; 钱平; 王丽
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2021-12-03

Abstract

The invention relates to a multi-mode chemical process fault detection method based on improved t-SNE, which comprises the following steps: step S1: collecting multi-modal chemical process raw data X_oAnd carrying out standardization processing to obtain high-dimensional data X; step S2: computing mahalanobis distance D between high-dimensional data points_M(ii) a Step S3: carrying out feature extraction on the high-dimensional data X by adopting an improved t-distribution random neighbor embedding method t-SNE to obtain a low-dimensional matrix Y; step S4: obtaining a mapping matrix A from a high-dimensional space to a low-dimensional space; step S5: obtaining a residual error space E of the training data; step S6: solving a feature space and a residual space of the online data; step S7: constructing LOF statistics by using a local outlier factor LOF algorithm; step S8: calculating LOF statistics and corresponding control limits of training data; step S9: and calculating LOF statistic of the online data, and performing real-time fault detection. Compared with the prior art, the method can meet the requirement of multi-mode process monitoring and has high accuracy.

Description

Multi-mode chemical process fault detection method based on improved t-SNE

Technical Field

The invention relates to the field of chemical production process monitoring, in particular to a multi-mode chemical process fault detection method based on improved t-SNE.

Background

Under the support of technologies such as a distributed control system technology, an internet of things and big data, the chemical production process tends to be automated and intelligentized more and more. The process monitoring is a key link for ensuring the production safety and high efficiency, and the fault detection is the most basic and important link in the process monitoring. The process monitoring technology based on data driving, which is developed on the basis of the Multivariate Statistical Process Monitoring (MSPM) theory, can mine effective information from massive process data to realize modeling of the chemical process, thereby achieving the purpose of monitoring.

In the chemical production process, the balance point changes during production due to seasonal changes, demand fluctuations and other factors, so that the multi-modal characteristic of the production process is formed. The traditional MSPM method is generally only suitable for realizing process monitoring in a single-mode production process, and if the production modes are switched in multi-mode operation and production, the traditional method cannot realize effective modeling and monitoring. The model integration method is an idea of multi-modal process fault detection, and comprises the specific steps of segmenting each mode of a multi-modal process, and establishing a corresponding model for each mode on the basis to realize detection. The method adopts a mode of hard partitioning the multi-mode process, and the fault detection effect has strict requirements on the accurate partitioning of the modes. The other method is a global modeling method which adopts a softening separation method and does not need to accurately divide each mode of the multi-mode process.

the t-SNE algorithm is a manifold learning algorithm and converts the conventional thought of high-dimensional and low-dimensional similarity of Euclidean distance measures into the similarity of high-dimensional and low-dimensional data reflected by conditional probability. When the model established by the t-SNE is used for feature extraction, the structural characteristics of low-dimensional data and high-dimensional data can be guaranteed. However, when the classical t-SNE algorithm measures the distance between data points, the Euclidean distance is calculated and then converted into the probability, and the method is only suitable for modeling of the monomodal chemical process. After modeling is completed, corresponding statistic is constructed to realize fault detection, and traditional statistic T²SPE (solid phase extraction) needs to meet the requirement that data obey normal distribution, but the chemical process data after dimensionality reduction usually has the characteristic of non-Gaussian property, and T is used²And SPE statistics may affect control limitsAnd the final fault detection effect.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a multi-modal chemical process fault detection method based on improved t-SNE.

The purpose of the invention can be realized by the following technical scheme:

a multi-modal chemical process fault detection method based on improved t-SNE comprises the following steps:

step S1: collecting multi-modal chemical process raw data X_oAnd carrying out standardization processing to obtain high-dimensional data X;

step S2: computing mahalanobis distance D between high-dimensional data points_M；

Step S3: carrying out feature extraction on the high-dimensional data X by adopting an improved t-distribution random neighbor embedding method t-SNE to obtain a low-dimensional matrix Y;

step S4: obtaining a mapping matrix A from a high-dimensional space to a low-dimensional space;

step S5: obtaining a residual error space E of the training data;

step S6: solving a feature space and a residual space of the online data;

step S7: constructing LOF statistics by using a local outlier factor LOF algorithm;

step S8: calculating LOF statistics and corresponding control limits of training data;

step S9: and calculating LOF statistic of the online data, and performing real-time fault detection.

Preferably, the multi-modal chemical process raw data in step S1 is data generated when the chemical process normally runs in a plurality of different modalities, wherein the number of sampling points of each modality is the same; the multi-mode chemical process raw data is X_o＝[x₁,x₂,...,x_N]∈R^D×NWhere D represents the number of process variables and N represents the number of sample points.

Preferably, the step S2 is specifically: x is the number of_i,x_jAre two sample points in the high dimensional dataset X,x_i,x_jmahalanobis distance D in high dimensional space_M(x_i,x_j) The calculation method is as follows:

where μ is the sample mean of the data set X, Σ^-1As a covariance matrix of data set X, sigma^-1The specific form is as follows:

wherein c is_ijIs a sample point x_i,x_jCovariance of c_ijThe expression is as follows:

c_ij＝Cov(x_i,x_j)＝E[x_i-E(x_i)][x_j-E(x_j)],i,j＝1,2,...,D (3)

wherein E (x)_i) And E (x)_j) Respectively represent samples x_iAnd x_jThe mathematical expectation of (2).

Preferably, the step S3 is configured to obtain the low-dimensional feature matrix Y, and specifically includes the following steps:

step S301: calculating conditional probabilities between high-dimensional spatial data points;

recording high-dimensional space sample point x_jIs x_iHas a conditional probability of p_j|iExpressed as:

wherein σ_iIs represented by X_iIs the variance of the Gaussian distribution of the center point, k denotes the number of neighbors, x_kDenotes x_iAny one of the neighbors of;

step S302: calculating conditional probabilities between the low-dimensional spatial data points;

assuming final acquisitionThe low-dimensional feature matrix is Y ═ Y₁,y₂,...,y_N]∈R^d×ND represents the finally obtained matrix with the low-dimensional characteristic of d dimension, and the high-dimensional data sample point x_iAnd x_jThe mapped point in the low-dimensional space is y_iAnd y_jWill be given by y_iThe variance of the Gaussian distribution as the center point is set to

Accordingly, the sample point y in the low-dimensional space can be obtained_iIs y_jThe conditional probability of (2) is as follows:

wherein y is_kDenotes y_iAny one of the neighbors of;

step S303: similarity between high and low dimensional data distributions using KL divergence measures;

the similarity cost function C constructed by KL divergence is as follows:

wherein P is_iRepresenting a high-dimensional spatial sample x_iSet of conditional probabilities forming a neighbour relation with all other samples, Q_iRepresenting low dimensional spatial samples y_iSet x of conditional probabilities forming a neighbor relation with all other samples_j，p_j|iDenotes x_iIs x_jThe conditional probability of (a);

step S304: minimizing a cost function;

minimizing a cost function C, C to y, by means of gradient descent_iAnd (3) solving the gradient, wherein the gradient formula is as follows:

step S305: simplifying a gradient formula;

using joint probability distribution instead of conditional probability, so that for arbitrary i and j, there is q_ji＝q_ij，p_ji＝p_ij(ii) a The target is achieved by a symmetrical method as follows:

the transformation cost function C is:

the simplified gradient formula is:

step S306: redefining the joint distribution probability of the low-dimensional data according to the t distribution of the students;

in order to solve the crowding problem of low-dimensional data during classification, student t distribution is used for replacing Gaussian distribution, and t distribution with degree of freedom of 1 is used for redefining q_ijThe concrete form is as follows:

the gradient formula is converted to the following expression:

step S307: introducing a large momentum term in gradient updating to avoid partial optimization in the gradient descending process, wherein the introduced large momentum term is as follows:

wherein

Represents the solution of the t-th iteration, α (t) represents the momentum of the t-th iteration, and η represents the learning efficiency.

Preferably, σ in the step S301_iThe value is selected in relation to the Perplexity; the Perplexity is:

wherein H (p)_i) Is p_iShannon entropy of (c):

the Perplexity value is set to 5 to 50.

Preferably, the step S5 is specifically: given that the feature space of the training data is Y, the mapping matrix from the high-dimensional space X to the low-dimensional space Y is a, and accordingly, the mapping matrix from the low-dimensional space Y to the high-dimensional space is denoted as B:

B＝A^T(AA^T)^-1 (16)

finding a high dimensional space

The residual space E of the training data is thus calculated:

preferably, the step S6 is specifically: obtaining on-line data x_unNormalized to obtain x_onObtaining the online high-dimensional data x by mapping the matrix A_onLow-dimensional projection y of_on：

y_on＝Ax_on (19)

Residual space e of online data_onNamely:

wherein

Representing a low-dimensional online data space y_onProjection in the high dimension.

Preferably, the step S7 includes the steps of:

step S701: searching a local neighborhood, and calculating the distance of the farthest neighbor;

feature sample set for training data

Sample point y in (1)_iLooking for y_iK neighbors of (a) form a neighborhood

y_iThe distance from its nearest neighbor is denoted k distance (y)_i)；

Step S702: obtaining a local reachable distance;

sample y_iAnd its neighboring neighbor

Constructed to reach a distance of

The specific definition is as follows:

wherein the content of the first and second substances,

is a sample point y_iAnd its neighbors

The Euclidean distance of;

step S703: calculating local reachable density;

sample point y_iIs defined as lrd (y)_i) The expression is as follows:

step S704: calculating a local outlier factor;

sample point y_iThe local outlier factor of (a) is LOF (y)_i) The form is as follows:

preferably, the step S8 is specifically: through the LOF statistic construction method in the step S7, local outlier factors of each sample point in the training data feature space are obtained, and a statistic matrix LOF is obtained_Y＝[LOF(y₁),LOF(y₂),…,LOF(y_N)]Obtaining the control Limit LOF Using the Nuclear Density method_{Y_Limit}(ii) a Similarly, a statistic matrix LOF of a training data residual space E is obtained_E＝[LOF(e₁),LOF(e₂),…,LOF(e_N)]And corresponding control limit LOF_{E_Limit}。

Preferably, the step S9 is specifically: residual spatial sample point e for online data_onFinding E in the training data residual space E_onNeighborhood of (2)

Finally find e_onLOF statistic LOF (e)_on)；

And fault detection is realized through the feature space and the residual error space statistics, and when the online data statistics are all larger than the control limit, the fault can be judged to occur.

Compared with the prior art, the invention has the following advantages:

1) when the Euclidean distance is used for calculating the distance, the problems of inconsistency of different dimensionalities of data and correlation between variables are often ignored, the Mahalanobis distance is an improvement on the Euclidean distance, the influence of the scale is eliminated through a covariance matrix on the basis of a total sample, and the interference of the correlation between the variables is eliminated; in the multi-modal chemical process, scale differences exist among the modes, so that the idea of calculating Euclidean distance in the traditional t-SNE algorithm is changed, the distance calculation mode is replaced by calculating the Mahalanobis distance, and the modeling of the multi-modal chemical process is not influenced by dimensions;

2) according to the invention, the fault detection of the multi-mode chemical process is realized by establishing the global model by improving the t-distribution random neighbor embedding t-SNE algorithm, and compared with a multi-model integration method, the global model does not need to meet the requirement of accurately dividing each mode;

3) the LOF statistic adopted by the invention can adapt to non-Gaussian distribution data, and the robustness is stronger, so that more accurate control limit and statistic are obtained by calculation, and finally, the fault detection with higher accuracy is realized.

Drawings

FIG. 1 is a flow chart of a multi-modal chemical process fault detection method based on improved t-SNE.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

As shown in FIG. 1, the invention provides a multi-modal chemical process fault detection method based on improved t-SNE. Firstly, collecting data of multiple modal chemical processes during normal operation, and carrying out standardized preprocessing to obtain training data; secondly, calculating the Mahalanobis distance of the training data, and realizing multi-mode chemical process modeling by improving a t-SNE algorithm; and finally, introducing local outlier factor LOF algorithm to construct statistics, and realizing real-time fault detection in the multi-modal chemical process.

The embodiment is realized by the following technical scheme, which specifically comprises the following steps:

step S1: collecting multi-modal chemical process data;

collecting data generated during normal operation of chemical process under multiple different modes, wherein the number of sampling points of each mode is the same, and forming high-dimensional original data X_o＝[x₁,x₂,...,x_N]∈R^D×NWhere D represents the number of process variables and N represents the number of sample points.

Step S2: carrying out standardization processing on original high-dimensional data;

raw data X_oAnd obtaining high-dimensional data X after standardization processing.

Step S3: calculating the mahalanobis distance between high-dimensional data points;

x_i,x_jis two sample points, X, in a high dimensional dataset X_i,x_jMahalanobis distance D in high dimensional space_M(x_i,x_j) The calculation method is as follows:

c_ij＝Cov(x_i,x_j)＝E[x_i-E(x_i)][x_j-E(x_j)],i,j＝1,2,...,D (3)

step S4: extracting the characteristics of the high-dimensional data by using an improved t-distribution random neighbor embedding method;

calculating conditional probability among high-dimensional space data points;

high dimensional spatial sample point x_jIs x_iIs denoted as p_j|i

Wherein σ_iIs represented by X_iIs the variance of the Gaussian distribution of the center point, k denotes the number of neighbors, x_kDenotes x_iAny one of the neighbors.

σ_iThe values are selected in relation to a Perplexity, which is defined by the following equation:

wherein H (p)_i) Is p_iShannon entropy of (c):

the value of the degree of confusion is generally set between 5 and 50, and σ is found by a binary search_iThe value of (c).

Calculating conditional probability among the low-dimensional space data points;

the finally obtained low-dimensional feature matrix is assumed to be Y ═ Y₁,y₂,...,y_N]∈R^d×ND represents the finally obtained matrix with the low-dimensional characteristic of d dimension, and the high-dimensional data sample point x_iAnd x_jThe mapped point in the low-dimensional space is y_iAnd y_jWill be given by y_iThe variance of the Gaussian distribution as the center point is set to

wherein y is_kDenotes y_iAny one of the neighbors.

Measuring the similarity between two data distributions by using KL Divergence (Kullback-Leibler Divergence);

the similarity cost function C constructed by KL divergence is as follows:

fourthly, minimizing a cost function;

minimizing a cost function C, the cost function C to y, by means of gradient descent_iAnd (3) solving the gradient, wherein the gradient formula is as follows:

simplifying gradient formula;

using joint probability distribution instead of conditional probability, so that for arbitrary i and j, there is q_ji＝q_ij，p_ji＝p_ij. The target is achieved by a symmetrical method as follows:

the cost function C is converted into the following expression:

the gradient formula can be simplified as:

redefining the joint distribution probability of the low-dimensional data according to the student t distribution;

in order to solve the crowding problem of low-dimensional data in classification, a student t distribution is used for replacing a Gaussian distribution. Redefining q using t-distribution with degree of freedom of 1_ijThe concrete form is as follows:

the gradient formula is converted to the following expression:

seventhly, in order to avoid falling into local optimum in the gradient descending process, a large momentum term is added in the gradient updating process;

wherein

Represents the solution of the t-th iteration, α (t) represents the momentum of the t-th iteration, and η represents the learning efficiency. Obtaining low dimension by the above stepsThe feature matrix Y.

Step S5: obtaining a mapping matrix;

in order to realize real-time fault detection and reduce the calculation time loss of processing new data, a mapping matrix is required to obtain a low-dimensional feature space of online high-dimensional data, so a mapping matrix A for projecting the high-dimensional space to the low-dimensional space is required to be obtained. Moore-Penrose generalized inverse matrix X from X⁺A can be obtained:

A＝YX⁺ (16)

step S6: obtaining a residual space of the training data;

given that the feature space of the training data is Y, the mapping matrix from the high-dimensional space X to the low-dimensional space Y is a, and accordingly, the mapping matrix from the low-dimensional space Y to the high-dimensional space is denoted as B:

B＝A^T(AA^T)^-1 (18)

finding a high dimensional space

The residual space E of the training data is thus calculated:

step S7: solving a feature space and a residual space of the online data;

obtaining on-line data x_unNormalized to obtain x_onObtaining the online high-dimensional data x by mapping the matrix A_onLow-dimensional projection y of_on：

y_on＝Ax_on (21)

Residual space e of online data_onNamely:

step S8: constructing LOF statistics by using a local outlier factor algorithm;

searching a local neighborhood, and calculating the distance of the farthest neighbor;

feature sample set for training data

Sample point y in (1)_iIn other words, find y_iK neighbors of (a) form a neighborhood

y_iThe distance from its nearest neighbor is denoted k distance (y)_i) (ii) a Secondly, obtaining a local reachable distance;

sample y_iAnd its neighboring neighbor

Constructed to reach a distance of

The specific definition is as follows:

wherein the content of the first and second substances,

is a sample point y_iAnd its neighbors

The euclidean distance of (c).

Thirdly, calculating local reachable density;

sample point y_iIs defined as lrd (y)_i) The expression is as follows:

fourthly, calculating local outlier factors;

by the method for constructing the statistics, the local outlier factor of each sample point in the training data feature space is obtained, and a statistics matrix LOF is obtained_Y＝[LOF(y₁),LOF(y₂),…,LOF(y_N)]Obtaining the control Limit LOF Using the Nuclear Density method_{Y_Limit}. The signal coefficient is 0.99, and alpha is 0.01. Through the formula (26), the LOF is derived_{Y_Limit}：

Similarly, a statistic matrix LOF of a training data residual space E is obtained_E＝[LOF(e₁),LOF(e₂),…,LOF(e_N)]And corresponding control limit LOF_{E_Limit}。LOF_{E_Limit}The derivation is as follows:

step 9, calculating LOF statistic of online data to realize real-time fault detection;

sample points y for online data feature space_onSearching in training data feature space YFinding and y_onThe K sample points with the closest distance form y_onNeighborhood of (2)

Finally, find y_onLOF statistic LOF (y)_on) The expression is as follows:

similarly, for the residual spatial sample point e of the online data_onFinding E in the training data residual space E_onNeighborhood of (2)

Finally find e_onLOF statistic LOF (e)_on) The expression is as follows:

and fault detection is realized through the feature space and the residual error space statistics, and when the online data statistics are all larger than the control limit, the fault can be judged to occur. The determination conditions were as follows:

due to the change of production requirements in the chemical production process, a plurality of operation modes generally exist, and the traditional method is usually only suitable for the modeling of the chemical process with a single mode. Therefore, the invention provides an improved algorithm on the basis of the t-SNE algorithm, and the Mahalanobis distance is adopted to calculate the distance between data points so as to match the modeling of the multi-modal process. On the basis, the statistic is established through a local outlier factor LOF algorithm, so that the detection result is more robust.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A multi-modal chemical process fault detection method based on improved t-SNE is characterized by comprising the following steps:

step S5: obtaining a residual error space E of the training data;

step S6: solving a feature space and a residual space of the online data;

2. The method according to claim 1, wherein the multi-modal chemical process raw data in the step S1 is data generated when the chemical process normally operates in a plurality of different modes, and the number of sampling points in each mode is the same; the multi-mode chemical process raw data is X_o＝[x₁,x₂,...,x_N]∈R^D×NWhere D represents the number of variables of the process and N represents the number of sampling pointsAnd (4) the number.

3. The method for multi-modal chemical process fault detection based on improved t-SNE as claimed in claim 1, wherein the step S2 is specifically as follows: x is the number of_i,x_jIs two sample points, X, in a high dimensional dataset X_i,x_jMahalanobis distance D in high dimensional space_M(x_i,x_j) The calculation method is as follows:

c_ij＝Cov(x_i,x_j)＝E[x_i-E(x_i)][x_j-E(x_j)],i,j＝1,2,...,D (3)

4. The multi-modal chemical process fault detection method based on improved t-SNE as claimed in claim 1, wherein the step S3 is used for obtaining a low-dimensional feature matrix Y, and specifically comprises the following steps:

wherein y is_kDenotes y_iAny one of the neighbors of;

the similarity cost function C constructed by KL divergence is as follows:

wherein P is_iRepresenting a high-dimensional spatial sample x_iSet of conditional probabilities forming a neighbour relation with all other samples, Q_iRepresenting low dimensional spatial samples y_iForm a close proximity with all other samplesSet of conditional probability constituents of relationships x_j，p_j|iDenotes x_iIs x_jThe conditional probability of (a);

step S304: minimizing a cost function;

step S305: simplifying a gradient formula;

the transformation cost function C is:

the simplified gradient formula is:

the gradient formula is converted to the following expression:

wherein

5. The method for multi-modal chemical process fault detection based on improved t-SNE as claimed in claim 4, wherein σ in the step S301_iThe value is selected in relation to the Perplexity; the Perplexity is:

wherein H (p)_i) Is p_iShannon entropy of (c):

the Perplexity value is set to 5 to 50.

6. The method for multi-modal chemical process fault detection based on improved t-SNE as claimed in claim 1, wherein the step S5 is specifically as follows: given that the feature space of the training data is Y, the mapping matrix from the high-dimensional space X to the low-dimensional space Y is a, and accordingly, the mapping matrix from the low-dimensional space Y to the high-dimensional space is denoted as B:

B＝A^T(AA^T)^-1 (16)

finding a high dimensional space

The residual space E of the training data is thus calculated:

7. the method for multi-modal chemical process fault detection based on improved t-SNE as claimed in claim 1, wherein the step S6 is specifically as follows: obtaining on-line data x_unNormalized to obtain x_onObtaining the online high-dimensional data x by mapping the matrix A_onLow-dimensional projection y of_on：

y_on＝Ax_on (19)

Residual space e of online data_onNamely:

wherein

8. The method for multi-modal chemical process fault detection based on improved t-SNE as claimed in claim 1, wherein the step S7 comprises the following steps:

feature sample set for training data

Sample point y in (1)_iLooking for y_iK neighbors of (a) form a neighborhood

y_iThe distance from its nearest neighbor is denoted k distance (y)_i)；

Step S702: obtaining a local reachable distance;

sample y_iAnd its neighboring neighbor

Constructed to reach a distance of

The specific definition is as follows:

wherein the content of the first and second substances,

is a sample point y_iAnd its neighbors

The Euclidean distance of;

step S703: calculating local reachable density;

sample point y_iLocal achievable density ofIs defined as lrd (y)_i) The expression is as follows:

step S704: calculating a local outlier factor;

9. the method for multi-modal chemical process fault detection based on improved t-SNE as claimed in claim 1, wherein the step S8 is specifically as follows: through the LOF statistic construction method in the step S7, local outlier factors of each sample point in the training data feature space are obtained, and a statistic matrix LOF is obtained_Y＝[LOF(y₁),LOF(y₂),…,LOF(y_N)]Obtaining the control Limit LOF Using the Nuclear Density method_{Y_Limit}(ii) a Similarly, a statistic matrix LOF of a training data residual space E is obtained_E＝[LOF(e₁),LOF(e₂),…,LOF(e_N)]And corresponding control limit LOF_{E_Limit}。

10. The method for multi-modal chemical process fault detection based on improved t-SNE as claimed in claim 1, wherein the step S9 is specifically as follows: residual spatial sample point e for online data_onFinding E in the training data residual space E_onNeighborhood of (2)

Finally find e_onLOF statistic LOF (e)_on)；