CN116579233A

CN116579233A - Method for predicting residual life of mechanical equipment

Info

Publication number: CN116579233A
Application number: CN202310409193.5A
Authority: CN
Inventors: 石慧; 冯文君; 刘斌; 魏琦; 聂晓音
Original assignee: Taiyuan University of Science and Technology
Current assignee: Taiyuan University of Science and Technology
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-08-11

Abstract

The invention discloses a method for predicting the residual life of mechanical equipment, which belongs to the technical field of mechanical life prediction and comprises the following specific technical scheme: 1. preprocessing multi-sensor input data; 2. the preprocessed data is used as input data of a multi-channel/multi-scale self-adaptive attention cyclic convolution network, the self-adaptive attention mechanism is embedded into the multi-channel/multi-scale cyclic convolution network, the dynamic mechanism can dynamically emphasize distinguishing information related to degradation and inhibit redundant information, and time correlation of different degradation states can be modeled, so that the input data of multiple sensors are fused, the integrity of characteristic information expression is enhanced, and the uncertainty of the data is reduced; 3. and step two, taking the output data of the step two as the input of a Bayesian inference network, minimizing an objective function by using an Adam optimizer, and realizing the prediction of the residual life by using a Monte Carlo estimation method, thereby realizing the quantification of uncertainty while improving the prediction precision.

Description

Method for predicting residual life of mechanical equipment

Technical Field

The invention belongs to the technical field of mechanical life prediction, and particularly relates to a residual life prediction method of mechanical equipment.

Background

With the development of science and technology and the progress of production process, the complexity of modern equipment is increasingly improved. The devices are subjected to the combined action of internal factors and external factors in the operation process, the performance and health state of the devices inevitably show a declining trend, and when the declining is reached to a certain degree, the devices cannot complete normal tasks and functions, and immeasurable results are caused. Therefore, the method monitors the mechanical equipment in real time and predicts the residual service life of the mechanical equipment, and has important significance for ensuring the stable and safe operation of the mechanical equipment, reducing the maintenance cost and reducing the casualties.

In recent years, with the development of sensing technology, the invention can utilize multiple sensors to monitor the running states of mechanical equipment and parts thereof in real time so as to observe the degradation condition of the mechanical equipment and parts thereof. Therefore, fault prediction and health management (Prognostics andhealth management, PHM) of the industrial system are also applied, and the fault prediction and health management of the system are implemented by using various state monitoring data generated in the industrial system, through means such as signal processing and data analysis, to detect the health state of the complex system, and predict the residual life distribution of the system. The prediction of the remaining life (Remaining Useful Life, RUL) is the core of its research throughout the PHM system. The prediction methods of RUL can be roughly classified into two types, namely, a model-based method and a data-driven method. Model-based approaches attempt to capture the degradation process of a machine component by mining the mechanism of component failure, constructing a mathematical model or a physical model, and thereby making predictions of remaining life. Since a large amount of expert domain knowledge is required and it is difficult to build an accurate model in practice, it is difficult to accurately predict the remaining service life. The data driven approach does not require complete failure mechanisms and full expert knowledge, and includes statistical data driven and machine learning approaches. Among the methods of machine Learning, deep Learning (DL) is widely used because of its ability to automatically process raw data and describe degradation processes of complex nonlinear systems, and is the dominant technique for predicting the remaining life of a system.

In general, deep learning based residual life prediction is to construct an end-to-end model to reflect the relationship between the monitored signal and the residual life of the system. Among the most widely used algorithms in existing approaches are Convolutional Neural Networks (CNNs), recurrent Neural Networks (RNNs). They have achieved significant success in predicting the remaining life of a machine. Based on these two network frameworks, a number of variant networks for mechanical device remaining life prediction have been proposed.

The convolutional neural network is well applied to the prediction of the residual life of mechanical equipment, and the artificial neurons of the convolutional neural network can respond to surrounding units in a part of coverage area to effectively extract the local characteristics of data. Babu et al first applied it to a time series to build a regression model-based prediction system. Then, yao et al propose an improved one-dimensional convolutional neural network (1D-CNN) in combination with a Simple Recursive Unit (SRU) network, using 1D-CNN to extract signal features, by reconstructing the serial operation mode of the conventional Recursive Neural Network (RNN), establishing a parallel input SRU network for residual life prediction, the validity of which is verified on the XJTU-SY dataset. Huang et al developed a new method for predicting remaining useful life based on deep convolutional neural networks, and the proposed architecture includes two main parts: namely, the deep convolutional neural network-multi-layer perceptron dual network is used for simultaneously extracting the characteristic representation hidden in the time sequence-based and predicting RUL, and the effectiveness of the RUL is verified on the XJTU-SY bearing data set.

The increase in network depth, while better fitting features, can lead to problems of gradient instability, network degradation, etc. To reduce network depth, and avoid the disadvantage of insufficient single-scale extraction information, the concept of multi-scale convolution was proposed in GoogleNet. Xu et al added the multi-scale module to the predictive model to verify its effectiveness by predicting the remaining life of the aircraft engine and cutting tool and comparing it to the relevant work. Then Wang et al propose a multi-scale convolution attention network (MSCAN), first constructing a self-attention module to effectively fuse the input multi-sensor data. A multi-scale learning strategy is then developed to automatically learn the feature representations from different time scales. Finally, the learned advanced representation is fed into the dynamic dense layer, performing the RUL estimation, verifying its validity on the milling data set.

Although convolutional neural networks have achieved some success in life prediction, they only recognize local features and suffer from certain drawbacks. To remedy this drawback, recurrent neural networks have been used. Guo et al construct health indicators and apply them to bearing datasets. The predictive ability of the recurrent neural network has certain limitations due to the problem of gradient extinction and the like. Hochrite et al introduced the concept of cell status and proposed Long Short-Term Memory (LSTM) networks to address these issues. Cho et al simplified the model of long and short term memory networks and proposed a gating loop unit (Gate Recurrent Unit, GRU) that saves computational costs. Lin proposes a model that integrates a time window, a multi-scale sequence, and an LSTM structure, prepares training samples with a sliding time window method, and maps degradation features directly to RUL predictions. Meanwhile, the optimal prediction performance is obtained by inputting a multi-scale sequence to adjust model parameters.

In summary, related scholars have made some research work on the method for predicting the remaining life of a system, but still have the following problems:

1. the development of equipment failure is a progressive evolution process, and the degradation states at different time points are correlated on a time scale. However, convolutional neural networks are forward propagation networks that cannot build a correlation between historical input information and current input information. In addition, a single-channel single-scale convolutional neural network cannot effectively identify degradation-related distinguishing information in multi-sensor input information, and cannot promote more complete information expression.

2. The residual life prediction method of the Convolutional Neural Network (CNN) or the Recurrent Neural Network (RNN) only focuses on the prediction precision, only can give point estimation values of residual life prediction, and does not consider the uncertainty existing in the model itself and can not give an estimate of the prediction uncertainty.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a residual life prediction method of mechanical equipment, which realizes the quantification of prediction uncertainty and has high prediction precision.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: a method for predicting the residual life of mechanical equipment comprises the following specific steps:

1. Preprocessing input data of multiple sensors;

2. the processed data is used as input data of a multi-channel/multi-scale self-adaptive attention cyclic convolution network, and the self-adaptive attention mechanism is embedded into the multi-channel/multi-scale cyclic convolution network, so that distinguishing information related to degradation can be dynamically emphasized, and redundant information can be restrained. Meanwhile, the multi-channel/multi-scale cyclic convolution can model the time correlation of different degradation states, so that the input data of the multiple sensors are fused, the integrity of the expression of the characteristic information is enhanced, and the uncertainty of the data is reduced.

3. And step two, taking the output data of the step two as the input of a Bayesian inference network, minimizing an objective function by using an Adam optimizer, and realizing the prediction of the residual life by using a Monte Carlo estimation method, thereby realizing the quantification of uncertainty while improving the prediction precision.

In step one, specific pretreatment steps are as follows: acquisition of signal x= { x with multiple sensors ₁ ,x ₂ ,…,x _N As input to the network, maximum minimum normalization is usedScaling of raw data to [0,1 ]]The specific expression is as follows:

wherein x is _i For the data contained in the ith sensor, max (x _i ) Is x _i Maximum value of (x), min (x _i ) Is x _i Is the minimum of (2);

the normalized sensor data is processed into input data via a time window.

In the second step, the specific processing procedure is as follows: establishing a multi-channel/multi-scale self-adaptive attention cyclic convolution network, wherein the network consists of three parallel channels, each channel is formed by embedding a self-adaptive attention mechanism into a cyclic convolution module, convolution kernels with different sizes are selected by a cyclic convolution layer in each channel to extract local and global characteristics, the self-adaptive attention mechanism consists of a spatial attention mechanism and a temporal attention mechanism, and the characteristics learned by the three paths are connected to be used as the input of a Bayesian inference network;

given an input vector x ', the input vector x' is input into a global average pooling layer and a global maximum pooling layer to integrate global information of each input sensor, and two different information descriptions v and m are respectively obtained, wherein the specific operation is as follows:

v＝Gavgpool(x′) (2)

m＝Gmaxpool(x′) (3)

wherein x' is input data after being processed by a time window, gavgpool(s) is global average pooling operation, gmaxpool(s) is global maximum pooling operation, v is a feature map after global average pooling operation, and m is a feature map after global maximum pooling operation;

V and m are respectively sent into a full-connection layer containing a hidden layer, and in order to reduce the calculation complexity, the number of neurons in each layer in the full-connection layer is respectively set as follows: the number of neurons of the first layer and the third layer is equal to the number of channels of an input vector, the number of neurons of the middle layer is equal to half of the number of channels of the input vector, so as to capture the relation among the channels and estimate the information quantity of each channel, the output of the two full-connection layers is activated by a sigmoid activation function after being added by elements, and the weight alpha of the spatial attention is obtained, and the method is specifically calculated as follows:

wherein alpha is the weight of the spatial attention, sigma is the sigmoid activation function, W ₁₂ ,W ₁₁ ,W ₂₂ ,W ₂₁ Weights in the full connection layer respectively;

multiplying the weight of the spatial attention with the input vector to obtain the output x of the spatial attention mechanism _ca The specific operation is as follows:

wherein x is _ca For the output of spatial attention, α is the weight of spatial attention, and x' is the input vector;

the output of the channel attention mechanism is taken as the input of the time attention mechanism by the size f _ta The self-adaptive convolution kernel of the system extracts refinement features to obtain the weight of a time attention mechanism, and the specific operation is as follows:

X _ta ＝σ(W _ta x _ca +B _ta ) (7)

wherein f _ta For the size of the self-adaptive convolution kernel, which is related to the time s, delta(s) is selected as a kernel function, t _odd As the nearest odd function, X _ta Weight of time attention mechanism, W _ta Weight B _ta For bias, σ is the Hard sigmoid activation function;

time attention mechanismMultiplying the weight of (2) and the output of the spatial attention mechanism to obtain the output x of the temporal attention mechanism _ta The specific calculation is as follows:

the cyclic convolution layer is connected with one recursion connection between the output and the input to form information circulation flow, and the output of the cyclic convolution layer is fed back to the input of the cyclic convolution layer through the recursion connection, so that the cyclic convolution layer can memorize information along with time;

for the state of the kth cyclic convolution layer at time step tThe expression is as follows:

where σ (·) is the nonlinear activation function,for inputting vectors, i.e. time-series sensor data input at the current moment,/>The storage state is fed back by the loop connection at the time t-1;

a gating mechanism is introduced in the cyclic convolution layer, and two gates, namely reset gates, are arranged in the recursive convolution layerAnd update door->The specific calculation is as follows:

sigma (·) is a logistic signature activation function, x is a convolution operation,the convolution kernels are respectively used for the convolution kernels,respectively, bias, state of the cyclic convolution layer when the time step is t>The expression is as follows:

The omicron is the Hadamard product,for the newly generated state, tanh (·) is tanh activation function, ++>And->Convolution kernels, respectively>Is a bias term;

in the third step, the specific processing procedure is as follows: the output on three parallel paths on a multi-channel/multi-scale adaptive attention-circulating convolutional network can be expressed as:

x _z ＝concatenate(x _z1 ,x _z2 ,x _z3 ) (14)

the Bayesian inference network is a network formed by three fully connected layers, the Bayesian inference network is a probability model of random variable W obeying Gaussian prior distribution, the random variable W is composed of all the learnable network parameters, the Bayesian inference network comprises the weights and the biases of the neurons in each fully connected layer,wherein L represents an L-th full-connection layer, M represents the total number of layers of the full-connection layer, and given input data composed of input X and its corresponding output O, the posterior distribution of W can be obtained by bayesian theorem, namely:

based on the above, for new inputsThe corresponding prediction distribution can be obtained by the following formula:

an approximate variation distribution q (W) is defined to decompose the weights and biases, expressed in detail as follows:

in the above equation, the variation distribution of each weight is defined as a gaussian mixture distribution having two components, each bias following a simple gaussian distribution, ω _L And b _L Representing the weights and biases of neurons in the fully connected layer L, respectively, then:

wherein pi ^L ∈[0,1]For the purpose of the pre-determined probability,and->The variation parameters of weight and bias are respectively, and tau is model precision;

minimizing the KL divergence between the variation distribution and the posterior distribution, i.e., minimizing KL (q (W) ||p (W/X, O)), approximates the posterior distribution, further estimating the final predicted distribution:

q ^* (W) is a value that minimizes KL divergence, the optimization objective of the network is to minimize KL divergence, corresponding to maximizing the lower evidence limit, the objective function is specifically:

the first term in equation (21) is evaluated using the Monte Carlo integral, which is performed as follows:

first, a normal distribution of standards is usedAnd Bernoulli distribution q (β) =Bernou 11i (pi) reparameterizing the multiplicative function according to equations (18) and (19), ω _L And b _L Is rewritten as two formulas:

based on equations (22) and (23), the objective function can be further rewritten as:

for equation (24), each of the integrand of the first term in the equation depends on the weight and bias, using single-sample Monte Carlo integrationEach integral in the first term of the equation is de-estimated (24) to obtain an unbiased estimateThe objective function is further expressed as:

for the second term in equation (24), the L2 regularization based on can be approximated as:

Thus, the objective function is expressed as follows:

wherein E (·) is a loss function;

from the analysis, for the Bayesian inference networkCollaterals with sampleIs->Equivalent to randomly masking rows in each weight matrix during forward pass, as with dropout in the fully connected layer; the second term in equation (24) corresponds to adding an L2 regularization term to the weights and biases during network optimization; the L2 regularization of the dropout of the probability pi and the weight attenuation coefficient lambda is applied, so that the quantification of the uncertainty of the model can be realized; with an optimizer that minimizes the objective function, prediction of the remaining life can be achieved by monte carlo estimation.

The invention adopts a multi-channel multi-scale self-adaptive attention cyclic convolution network based on Bayesian reasoning, and is used for researching the residual life prediction problem of a system. In the feature extraction part, the network embeds the self-adaptive attention mechanism into the multi-channel/multi-scale channel circular convolution network, so that the network can better process multi-sensor input information, can dynamically identify important information in channel and time dimension, and simultaneously inhibit useless information, thereby realizing the complete expression of information, improving the prediction capability of a model and reducing the prediction uncertainty existing between data.

In the residual life prediction part, bayesian reasoning is combined with the neural network, so that the interpretability of the neural network is improved. Under the Bayesian framework, the probability density function of RUL is obtained by using variation reasoning, and the residual life is predicted in a probabilistic manner, so that the quantification of uncertainty of prediction is realized. Experimental tests on the C-MAPSS data set show that the model can obtain better precision on the data set, and has obvious advantages compared with the existing prediction method.

Drawings

FIG. 1 is a block diagram of a Bayesian inference based multichannel cyclic convolutional network.

Fig. 2 is a block diagram of an adaptive attention mechanism.

Fig. 3 is a block diagram of a cyclic convolution.

Fig. 4 is a block diagram of a bayesian inference network.

Fig. 5 is a graph of a loss function of a bayesian inference network on FD 001.

Fig. 6 is a graph of the predicted outcome of the bayesian inference network model on FD 001.

FIG. 7 is a graph of predicted results of a Bayesian inference network model on FD001 engine 34.

Fig. 8 is a graph of the prediction results of four sub-data sets under different time windows, fig. 8 (a) is a graph of the prediction results of root mean square error RMSE, and fig. 8 (b) is a graph of the prediction results of Score.

Fig. 9 is a diagram showing the influence of the number of attention mechanisms on the prediction result, fig. 9 (a) is a diagram showing the prediction result of the root mean square error RMSE, and fig. 9 (b) is a diagram showing the prediction result of the Score.

Fig. 10 shows the effect of the number of cyclic convolution layers in the cyclic convolution module on the prediction result, fig. 10 (a) shows the prediction result diagram of the root mean square error RMSE, and fig. 10 (b) shows the prediction result diagram of the Score.

FIG. 11 is a graph comparing RMSE of the present model and other predictive models over four sub-data sets of a turbofan engine.

Fig. 12 is a graph comparing Score of the present model and other predictive models over four sub-data sets of a turbofan engine.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects to be solved more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

A method for predicting the residual life of mechanical equipment is characterized by establishing a multi-channel cyclic convolution network based on Bayesian inference as shown in figure 1, wherein the multi-channel cyclic convolution network comprises data preprocessing, a multi-channel/multi-scale self-adaptive attention cyclic convolution network and a Bayesian inference network. In the whole network system, data is preprocessed first. And then, the preprocessed data from different sensors are input as a multi-channel/multi-scale self-adaptive attention cyclic convolution network, and the distinguishing information related to degradation can be dynamically emphasized and redundant information can be restrained through the network, and the time correlation of different degradation states is modeled, so that the multi-sensor information is fused better to realize the integrity of information representation, and the uncertainty of the data itself is reduced. Finally, the output of the multi-channel/multi-scale adaptive attention-circulating convolution network is used as an input to a bayesian inference network to facilitate the prediction of RUL and to enable quantification of uncertainty.

Data preprocessing: acquisition of signal x= { x with multiple sensors ₁ ,x ₂ ,…,x _N As input to the network, where N represents the number of sensors. If the signal x is directly input into the network at this time, the data of different magnitudes may cause the network to fail to converge, so that the network can perform more efficient feature extraction work, and thus the original data is scaled to [0,1 ] with maximum and minimum normalization]Between them, it is expressed as follows:

wherein x is _i For the data contained in the ith sensor, max (x _i ) Min (x) _i ) Representing their maximum and minimum values, respectively.

And then, processing the normalized sensor data into input data through a time window, and assuming that a time window with the size s is used, sliding the normalized data with the total period of T by a step length l=1 to generate the input data, wherein the actual residual service life of the (j+1) th sample of the (i) th sensor is T-s-j, and the size of the corresponding input sample is s multiplied by n.

Multichannel/multiscale adaptive attention-circulating convolutional network: as shown in fig. 1, the multi-channel/multi-scale adaptive attention-circulating convolution network is composed of three parallel channels. Each channel is formed by embedding an adaptive attention mechanism into a circular convolution module, in order to better perform feature extraction to reduce uncertainty existing between data, input data firstly passes through the adaptive attention mechanism, the mechanism comprises two different dimensions of space and time, the space dimension focuses on the importance degree of degradation information which can be represented by a sensor in input information of a multidimensional sensor, the time dimension focuses on the importance degree of degradation information of different sensors in different time windows, and therefore a network can dynamically emphasize distinguishing information related to degradation and can inhibit useless information.

Then, the time dependence of the different degradation states is modeled by a cyclic convolution module. Here, the cyclic convolution module is constructed of a plurality of cyclic convolution layers. The cyclic convolution layer connects a recursive connection between the output and the input to form an information loop. Meanwhile, in order to promote more complete feature expression, in the three parallel channels, the cyclic convolution layers respectively select convolution kernels with different sizes to extract local and global features. Finally, the learned characteristics of the three paths are connected to be used as the input of the Bayesian inference network.

Adaptive attention mechanism: as shown in fig. 2, the adaptive attention mechanism is composed of a spatial attention mechanism and a temporal attention mechanism, the spatial dimension focuses on the importance degree of degradation information which can be represented by the sensors in the input information of the multidimensional sensor, and the temporal dimension focuses on the importance degree of degradation information of different sensors in different time windows. Thus, it is possible to dynamically emphasize the distinguishable information and suppress the unnecessary information.

Given an input vector x ', where x' is the input data after time window processing. The global information of each input sensor is integrated in a global average pooling layer (GAP) and a global maximum pooling layer (GMP), so as to respectively obtain two different information descriptions v and m, and the operation is as follows:

v＝Gavgpool(x′) (2)

m＝Gmaxpool(x′) (3)

Wherein Gavgpool (-) and Gmaxpool (-) respectively represent global average pooling operation and global maximum pooling operation, and v and m are characteristic diagrams after global average pooling and global maximum pooling operation.

Then, v and m are respectively sent into a full-connected layer (MLP) containing a hidden layer, and in order to reduce the computational complexity, the number of neurons of each layer in the MLP is respectively set as follows: the number of neurons of the first and third layers is equal to the number of channels of the input vector, and the number of neurons of the middle layer is equal to half of the number of channels of the input vector, to capture the inter-channel relationship and estimate the information amount of each channel. Then, the outputs of the two MLP layers are added by elements and then activated by a sigmoid activation function to obtain the weight alpha of the spatial attention, which is calculated as follows:

where α is the weight of the spatial attention, σ represents the sigmoid activation function, W ₁₂ ,W ₁₁ ,W ₂₂ ,W ₂₁ The weights in the MLP are respectively omitted for simplicity of the symbols.

Finally, multiplying the weight of the spatial attention with the input vector to obtain the output x of the spatial attention mechanism _ca The operation is as follows:

wherein x is _ca Representing the output of spatial attention, α is the weight of spatial attention, and x' is the input vector.

In order to obtain a more refined feature expression, it is also necessary to pass through a time attention mechanism, and the output of the channel attention mechanism is taken as the input of the time attention mechanism, and the output of the channel attention mechanism is passed through a channel attention mechanism with the size of f _ta The adaptive convolution kernel of (2) extracts refinement features to obtain weights for the temporal attention mechanism, which operates as follows:

X _ta ＝σ(W _ta x _ca +B _ta ) (7)

wherein f _ta The size of the adaptive convolution kernel is related to time s, which is generally considered to be times s and f _ca The presence of some non-linear relationship makes the longer term interactions stronger the greater s, otherwise the stronger the short term interactions. In classical kernel techniques, when dealing with unknown mapping problems, an exponential family function (e.g., a gaussian function) is often used as the kernel function, so δ(s) is chosen as the kernel function, t| _odd As the nearest odd function, X _ta Weight of time attention mechanism, W _ta And B _ta The weights and offsets, respectively, σ is the Hard sigmoid activation function.

Then, the weight of the temporal attention mechanism is multiplied by the output of the spatial attention mechanism to obtain the output x of the temporal attention mechanism _ta It is calculated as follows:

and a cyclic convolution module: in a multi-channel/multi-scale adaptive attention-circulating convolutional neural network, a circulating convolutional module is a core building block and is constructed by a plurality of circulating convolutional layers. The cyclic convolution layer is different from a convolution layer which only transmits information in one direction, and the cyclic convolution layer is connected with a recursion connection between an output and an input to form information circulation flow, and the output of the cyclic convolution layer is fed back to the input of the cyclic convolution layer through the recursion connection, so that the cyclic convolution layer can memorize information along with time. Thus, the deficiencies of conventional convolution layers are overcome. Therefore, the cyclic convolution layer can effectively model the time dependence of different degradation states, so that the integrity of information expression and the prediction capability of a model are improved.

The output of the cyclic convolution layer depends not only on the current input but also on the previous storage state, which maintains historical information about all past inputs. This time-dynamic behavior enables the cyclic convolution layer to take full advantage of information from the input time-series sensor data and take full account of different degradation statesTime dependence. For the kth cyclic convolution layer, its state at time step tWritten as:

wherein, sigma (&) is a nonlinear activation function such as sigmoid, tanh and ReLU,is the input vector, i.e. the time series sensor data input at the present moment, and +.>Is the memory state fed back by the loop connection at time t-1.

Theoretically, a recursive connection enables the cyclic convolution layer to learn any long-term dependence from the input sensor data. However, in practical applications, the recursive convolutional layer can only trace back a few time steps, since there is often a problem of gradient vanishing or explosion during training. Thus, to mitigate the effects of gradient extinction and explosion, and capture long-term dependencies, a gating mechanism is introduced in the cyclic convolution layer. As shown in fig. 3, there are two gates in the recursive convolutional layer, respectively reset gates And update door->Their calculation is as follows:

σ (·) is a logistic signature activation function, which represents a convolution operation,convolution kernels, respectively->Respectively offset. At time step t, state of cyclic convolution layer +.>Can be written as follows:

the omicron represents the hadamard product,newly generated state, tanh (·) is tanh activation function, ++>And->Convolution kernels, respectively>Is a bias term. From the above equation, it can be seen that +.>Is from the previous time state->And the current time status->And is controlled by a reset gate and an update gate. By introducing a gating mechanism, the cyclic convolution layer has the capability of forgetting or memorizing the previous and current information. On the one hand, the reset gate can decide how much past information will be forgotten. If reset gate->Near 0, current candidate state->The previous state will be ignored->And only the current input +.>Related to the following. This will cause the network to forget some previously irrelevant information and learn a more compact representation of the information. On the other hand, update door->The amount of information that the previous state passed to the current state is controlled. This can help the network to remember long-term information and eliminate the problem of gradient extinction. Furthermore, it is noted that since each feature map in the cyclic convolution layer has separate reset and update gates, it is able to adaptively capture the dependencies of different time scales. If the reset gate is activated frequently, the corresponding feature map will learn to capture short-term dependencies or focus only on current inputs. Conversely, if the update gates are often active, they will capture long-term dependencies.

Bayesian reasoning: the multi-channel/multi-scale self-adaptive attention cyclic convolution network can be utilized to well learn potential key information related to degradation in multi-sensor input data, and consider the time correlation of different degradation states, and the output on three parallel paths on the multi-channel/multi-scale self-adaptive attention cyclic convolution network can be expressed as:

x _z ＝concatenate(x _z1 ,x _z2 ,x _z3 ) (14)

x _z1 ,x _z2 ,x _z3 respectively represent feature expressions learned from three parallel paths of a multi-scale adaptive attention network, and conccate (·) represents feature connection functions. Here, the features learned by the multi-channel/multi-scale adaptive attention-circulating convolutional network are connected and then used as the input of the bayesian inference network.

Bayesian inference network: the bayesian inference network is a network formed by three full-connection layers, and the structure of the bayesian inference network is as shown in fig. 4:

the variational reasoning enables quantization of the prediction uncertainty (VI): quantification of uncertainty plays a very important role in both predictive and prognostic decisions. In the invention, variational reasoning is used to quantify the uncertainty of a Bayesian multi-layer perceptron prediction network. The proposed bayesian multi-layer perceptron prediction network can be regarded as a probabilistic model of random variables W subject to gaussian prior distribution. The random variable W, which consists of all the learnable network parameters, includes the weights and biases of the neurons in each fully connected layer, Wherein L represents the L-th full-connection layer, and M represents the total number of full-connection layers. Assuming that input data consisting of an input X and its corresponding output O is given, the posterior distribution of W can be obtained by bayesian theorem, namely:

because of the difficulty of posterior probability p (W/X, O), the invention introduces variational reasoning and utilizes variational distribution to approximate the posterior probability.

The variation distribution is a probability distribution that is easy to evaluate, facilitating further inferences. First, an approximate variation distribution q (W) is defined to decompose the weights and biases, which can be written as:

in this equation, the variation distribution of each weight is defined as a gaussian mixture distribution with two components, each bias following a simple gaussian distribution, ω _L And b _L The weights and biases of neurons in the fully connected layer L are shown, respectively. Then there is:

wherein pi ^L ∈[0,1]For the purpose of the pre-determined probability,and->The variation parameters of the weight and the bias are respectively, and tau is the model precision.

Then, the present invention minimizes the KL divergence between the variation distribution and the posterior distribution, i.e., minimizes KL (q (W) ||p (W/X, O)) to approximate the posterior distribution, the final prediction distribution can be estimated further:

q ^* (W) is a value that minimizes KL divergence, and the optimization goal of the network is to minimize KL divergence, equivalent to maximizing the lower evidence limit. Thus, the objective function can be written as:

the first term in equation (21) can be evaluated using the Monte Carlo integral, which is performed as follows:

first, a normal distribution of standards is usedAnd Bernoulli distribution q (β) =Bernou 11i (pi) re-parameterizing the multiplicative function, ω, according to equations (18) and (19) _L And b _L Can be rewritten as two formulas:

for equation (24), each of the integrand functions of the first term in the equation depends on a weight and a bias. Then, single sampling Monte Carlo integration is usedDe-estimating (24) the firstEach integral in the term results in an unbiased estimateThe objective function is further written as:

thus, the objective function can be written as:

where E (·) is the loss function, which is the use of the Mean Square Error (MSE).

From the analysis, the Bayesian inference network is provided with a sampleIs->Equivalent to randomly masking the rows in each weight matrix during forward pass, as is the case with dropout in the fully connected layer. Further, the second term in equation (24) corresponds to adding an L2 regularization term to the weights and biases during network optimization. Subsequently, L2 regularization of the dropout of the probability pi and the weight decay coefficient λ is applied, and quantization of the model uncertainty can be achieved. Finally, objective functions can be minimized with Adam optimizers. Then the prediction can be achieved by monte carlo estimation.

Numerical experiment: in this section, the performance of a Bayesian inference based multichannel attention cycle convolutional network was evaluated with full lifecycle data of the aircraft engine from start of operation to complete failure. To reduce the effect of randomness, all experiments were performed 10 times and the average was taken as the final result.

Introduction of the data set: the C-MAPSS data set is time-series data of the aircraft engine operation to failure. The dataset includes four sub-datasets, each sub-dataset in turn comprising a training set, a test set, and a remaining life dataset. Each engine in the training set and test set contains 21 sensor sequences and 3 operation settings sequences. The number of engines in each sub-data set is different, and the failure modes and operating conditions are different, resulting in different complexity of the four sub-data sets. The FD001 and FD003 datasets were less complex, with only one operating condition, and failure modes of 1 and 2, respectively. The data for FD002 and FD004 are more complex, both contain 6 operating conditions, and failure modes are the same as FD001 and FD003, respectively. The number of samples of the dataset is shown in table 1. Taking into account the time costs caused by data redundancy and the negative impact on the predicted outcome, the sensor data for the constant values in each subset are removed via preliminary screening prior to training.

TABLE 1C-MAPSS dataset

Table 2 input sensor sequence information

Table 3 input sample information

Evaluation index: the mean square error and the Score function (Score) are widely used for evaluating the residual life prediction result, and the smaller these two indexes are, the better the prediction performance of the model is expressed as follows:

the mean square error is specifically expressed as follows:

where n represents the size of the test lumped sample, d _i Representing the difference between the true value and the predicted value.

The scoring function is defined as:

model parameters and prediction effect: the bayesian inference network model training set up is as follows: setting the model to train 200 times, setting the learning rate of the first 80 times to be 0.001, setting the learning rate of the last 120 times to be 0.0001, and stopping training if the model has no effect improvement in 10 epochs continuously. The model was optimized using Adam optimizer, and the other layer parameters are shown in table 4:

table 4 super-parameters of bayesian inference network model

Fig. 5 shows a convergence process of the loss function of the bayesian inference network on FD001, and it can be seen that the bayesian inference network has a relatively high convergence speed, and tends to be stable at the 100 th epoch until the final convergence.

Fig. 6 shows the result of predicting the residual life of FD001 by using the bayesian inference network model, and fig. 7 shows the degradation process of the turbofan engine learned by using the bayesian inference network model, wherein the 34 th engine is selected for being visualized independently. It can be seen that the model predicts RUL and corresponding CI, and the error between the predicted value and the true value is larger, the fluctuation of the predicted curve is larger, and the CI is wider when the bayesian inference network model is not degraded in the earlier equipment. But can track the degradation trend in time when the equipment is about to degrade and tend to stabilize in the later stage, and the CI becomes narrow.

Effect of time window size: fig. 8 shows the effect of different time window sizes on model prediction results when four sub-set time windows on the dataset are selected 20, 30, 40, 50, respectively. Here, only the size of the time window is changed, and other model parameters remain unchanged. As can be seen from the analysis of fig. 8, the accuracy of the predictions of the model over the dataset will change with increasing time window. For FD001, after the time window increases to 30, both RMSE and Score will show an increasing trend, indicating that the most suitable time window size for subset FD001 is 30. RMSE decreases with increasing time window for subset FD002, but Socre increases gradually, so the time window size is chosen to be 20 for subset FD 002. For subset FD003, both RMSE and Score are minimum when the time window size is 20. Thus, subset FD003 selects a time window size of 20. For subset FD004, with increasing time window, RMSE and Score take minimum values at a time window size of 30, indicating that model prediction is best. Thus, subset FD004 selects a time window size of 30. Thus, the time window sizes for the four sub-data sets are 30, 20, 30, respectively.

Influence of the number of adaptive attention mechanisms: the invention respectively researches the influence of the number of the self-adaptive attention mechanisms on the prediction result of the model, and respectively researches the influence of the number of the self-adaptive attention mechanisms on the prediction result when the number of the self-adaptive attention mechanisms is 1, 2, 3 and 4 as shown in fig. 9. From the figure, it can be seen that for both subset FD001 and subset FD003, when the number of adaptive attention mechanisms is 2, both RMSE and Score are minimal, and the model achieves the best performance. For subsets FD002 and FD004, when the number of adaptive attention mechanisms is 3, RMSE and Score are minimal, the model achieves the best predictive performance. Thus, the model selects 2 adaptive attention mechanisms on FD001 and FD003 and 3 adaptive attention mechanisms on FD002 and FD 004.

Influence of the cyclic convolution layer on the prediction result in the cyclic convolution module: the best prediction performance can be obtained by properly changing the number of layers of the cyclic convolution layers in the cyclic convolution module, as shown in fig. 10, the influence of using 1, 2 and 3 layers of the cyclic convolution layers on the prediction result is shown in the cyclic convolution module, and as can be seen from fig. 10, when the number of the cyclic convolution layers in the cyclic convolution module is 2, RMSE and Score on four subsets are minimum, and the best prediction performance is obtained by the model.

Comparison with other advanced models:

comparison of table 5 with root mean square error for some advanced prediction methods

Comparison of Table 6 with some advanced prediction method scoring function values

Here, tables 5 and 6 show the numerical evaluation results of the four methods on the C-MAPSS dataset compared to the proposed method of the present invention, to be compared with some of its advanced predictive models. It can be observed that the proposed method in all subsets improves well under both the error criterion RMSE and the evaluation criterion of the scoring function Score. In addition, compared with other methods, the method provided by the invention gives a measure of the uncertainty of prediction, and solves the problem of excessive confidence of deep learning on the prediction of the residual life of mechanical equipment to a certain extent.

The foregoing description of the preferred embodiment of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The method for predicting the residual life of the mechanical equipment is characterized by comprising the following specific steps:

1. preprocessing input data of multiple sensors;

2. The processed data is used as input data of a multi-channel/multi-scale self-adaptive attention cyclic convolution network, the self-adaptive attention mechanism is embedded into the multi-channel/multi-scale cyclic convolution network, the self-adaptive attention mechanism is a dynamic mechanism for dynamically emphasizing degradation-related distinguishing information, redundant information is restrained, the multi-channel/multi-scale cyclic convolution models time correlation of different degradation states, the input data of a plurality of sensors are fused, the integrity of feature information expression is enhanced, and the uncertainty of the data is reduced;

2. The method for predicting remaining life of a mechanical device according to claim 1, wherein in the first step, a specific preprocessing step is as follows:

acquisition of signal x= { x with multiple sensors ₁ ,x ₂ ,…,x _N As input to the network, scaling the raw data to [0,1 ] using maximum-minimum normalization]The specific expression is as follows:

Wherein x is _i For the data contained in the ith sensor, max (x _i ) Is x _i Maximum value of (x), min (x _i ) Is thatx _i Is the minimum of (2);

the normalized sensor data is processed into input data via a time window.

3. The method for predicting the remaining life of a mechanical device according to claim 1, wherein in the second step, the specific process is as follows: establishing a multi-channel/multi-scale self-adaptive attention cyclic convolution network, wherein the network consists of three parallel channels, each channel is formed by embedding a self-adaptive attention mechanism into a cyclic convolution module, the cyclic convolution layers in different channels respectively select convolution kernels with different sizes to extract local and global characteristics, the self-adaptive attention mechanism consists of a spatial attention mechanism and a temporal attention mechanism, and the characteristics learned by the three paths are connected to serve as the input of a Bayesian inference network;

v＝Gavgpool(x′) (2)

m＝Gmaxpool(x′) (3)

V and m are respectively sent into a full-connection layer containing a hidden layer, and the number of neurons in each layer in the full-connection layer is respectively set as follows: the number of neurons of the first layer and the third layer is equal to the number of channels of an input vector, the number of neurons of the middle layer is equal to half of the number of channels of the input vector, so as to capture the relation among the channels and estimate the information quantity of each channel, the output of the two full-connection layers is activated by a sigmoid activation function after being added by elements, and the weight alpha of the spatial attention is obtained, and the method is specifically calculated as follows:

α＝σ(W ₁₂ (W ₁₁ v)⊕W ₂₂ (W ₂₁ m)) (4)

the output of the spatial attention mechanism serves as the input of the temporal attention mechanism by having a magnitude f _ta The self-adaptive convolution kernel of the system extracts refinement features to obtain the weight of a time attention mechanism, and the specific operation is as follows:

X _ta ＝σ(W _ta x _ca +B _ta ) (7)

wherein f _ta For the size of the self-adaptive convolution kernel, which is related to the time s, delta(s) is selected as a kernel function, t _odd As the nearest odd function, X _ta Weight of time attention mechanism, W _ta As a weight vector, B _ta For bias, σ is the Hard sigmoid activation function;

multiplying the weight of the temporal attention mechanism by the output of the spatial attention mechanism yields the output x of the temporal attention mechanism _ta The specific calculation is as follows:

a gating mechanism is introduced in the cyclic convolution layer, and two gates, respectively reset gate r, are arranged in the recursive convolution layer _t ^k Update doorThe specific calculation is as follows:

sigma (·) is a logistic signature activation function, x is a convolution operation,convolution kernels, respectively>Respectively, bias, state of the cyclic convolution layer when the time step is t>The expression is as follows:

Is Hadamard product (Lepidium)>For the newly generated state, tanh (·) is tanh activation function, ++>And->The convolution kernels are respectively used for the convolution kernels,is a bias term.

4. The method for predicting the remaining life of a mechanical device according to claim 1, wherein in the third step, the specific process is as follows:

the outputs on three parallel paths on a multi-channel/multi-scale adaptive attention-circulating convolutional network are expressed as:

x _z ＝concatenate(x _z1 ,x _z2 ,x _z3 ) (14)

the Bayesian inference network is a network formed by three fully connected layers, the Bayesian inference network is a probability model of random variable W obeying Gaussian prior distribution, and comprises weights and biases of neurons in each fully connected layer,wherein L represents an L-th full-connection layer, M represents the total number of layers of the full-connection layer, and given input data composed of input X and its corresponding output O, the posterior distribution of W can be obtained by bayesian theorem, namely:

in the above equation, the variation distribution of each weight is defined as a gaussian mixture distribution having two components, each bias following a simple gaussian Distribution omega _L And b _L Representing the weights and biases of neurons in the fully connected layer L, respectively, then:

first, a normal distribution of standards is usedAnd Bernoulli distribution q (β) =Bernou 11i (pi) re-parameterizing the multiplicative function, ω, according to equations (18) and (19) _L And b _L Is rewritten as two formulas:

for equation (24), each of the integrand of the first term in the equation depends on the weight and bias, using single-sample Monte Carlo integrationRemoving each integral in the first term of the estimation (24) formula to obtain an unbiased estimation +.>The objective function is further expressed as:

thus, the objective function is expressed as follows:

wherein E (·) is a loss function;

from the analysis, the Bayesian inference network is provided with a sampleIs->Equivalent to randomly masking rows in each weight matrix during forward pass, as with dropout in the fully connected layer; further, the second term in equation (24) corresponds to adding an L2 regularization term to the weights and biases during network optimization; then, L2 regularization of the dropout of the probability pi and the weight attenuation coefficient lambda is applied to realize the quantification of the uncertainty of the model; finally, the objective function is minimized with an optimizer, and the prediction of the remaining life is achieved by monte carlo estimation.