CN115510757A

CN115510757A - Design method for long-time sequence prediction based on gated convolution and time attention mechanism

Info

Publication number: CN115510757A
Application number: CN202211250328.XA
Authority: CN
Inventors: 郑洪源; 卢灿尧; 陆梦俊; 翟象平
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2022-12-23

Abstract

The invention provides a model which can effectively and accurately capture the long-term correlation coupling output and input between time sequences aiming at long-term sequence prediction. The model of the present invention is called GCTAM (Gated Container Temporal Attention Mechanism). The model of the invention is based on the improvement of the Informer of Gated Convolution (Gated Convolution) and Temporal Attention Mechanism (Temporal Attention Mechanism). The following two improvements are mainly made: (1) The proposed gated convolution makes good use of time information and automatically routes results based on time information. (2) The proposed temporal attention mechanism filters low frequency noise well. Experiments on a plurality of real data sets prove that the method improves the expression capability of an Informer model, enhances the noise filtering capability of the model, and keeps higher prediction precision in the prediction of long-time sequences.

Description

Design method for long-time sequence prediction based on gated convolution and time attention mechanism

Technical Field

The invention relates to the field of long-time sequence prediction based on a gate convolution and time attention mechanism, and the provided GCTAM model is mainly used for improving the expression capability of an inform model, enhancing the noise filtering capability of the model and improving the prediction precision of the model in the aspect of predicting a long-time sequence.

Background

Long-time sequence prediction is an important research topic in the fields of weather, energy consumption, financial indexes, retail sale, medical monitoring, anomaly detection, traffic prediction and the like. In recent years, with the continuous development of deep learning, a Recurrent Neural Network (RNN), a Long Short Term Memory (LSTM), a Convolutional Neural Network (CNN), and a Transformer have shown good prediction performance in data prediction, and have been successfully applied to a large-scale application in many real world, including the field of Long sequence time prediction.

The existing RNN method is still limited by a long-sequence time sequence, that is, as the sequence length increases, the RNN-generated gradient becomes smaller (gradient disappears) or larger (gradient explosion), and the long-term dependence of the time sequence data cannot be well learned. With the explosion of modern deep learning, LSTM employs gated structures to control information flow to handle gradient disappearance or gradient explosion. At the same time, the gated structure captures long-term memory well, but still does not completely solve the problem of gradient disappearance. Subsequently, CNN is used for timing prediction. The CNN has great potential in the aspect of sequence modeling, is even better than RNN on many tasks, avoids the common problems of RNN, such as gradient explosion/disappearance and long-term poor memory, and supports parallel computation with higher efficiency than RNN. But there is still much room for improvement over transformers that capture information of arbitrary length.

The inform of the Transformer system solves the problems of gradient elimination and memory constraint in the prediction of long-sequence time sequences, but the inform only performs coarse-grained feature extraction on the time sequences and does not explicitly use time information. When the length of the time sequence is too long, all parameters are shared, which results in the generated noise being directly output through the full link layer. There has been some previous work on improving the expressiveness of models and enabling explicit analysis of features of time series samples. The microsoft AI cognitive services team proposed dynamic convolution. Compared with the traditional static convolution (single convolution kernel per layer), the dynamic superposition of multiple convolution kernels according to attention not only significantly improves expressive power, but also reduces computational cost. It is more friendly to efficient CNNs and can be easily integrated into existing CNN architectures.

Disclosure of Invention

The invention aims to: the conventional deep network time series prediction model has the problems of gradient explosion and disappearance, cannot well learn long-term dependence of time series data, and directly outputs generated noise through a full link layer when the length of the time series is too long. Therefore, how to deal with gradient disappearance or gradient explosion, capture long-term memory, filter noise and improve the accuracy of prediction in long-term sequence prediction through the design of the deep neural network model becomes a main technical problem.

In order to solve the technical problem, the invention provides a design method for long-time sequence prediction of a gating convolution and time attention mechanism, which can improve the expression capability of a model, enhance the noise filtering capability of the model and improve the prediction precision of the model.

The technical scheme is as follows: in order to achieve the technical effects, the technical scheme provided by the invention is as follows:

a design method of long time sequence prediction based on a gated convolution and time attention mechanism is characterized in that,

(1) A gating network mechanism is implemented to capture the long-term memory that is ignored in the inform mer distillation layer, introduce time-embedding in gating, classify and automatically route time features in a fine-grained manner, highlight more relevant features in the time series, and encode them together.

(2) The distillation layer of the traditional inform model is added with a Gated Convolutional Network (Gated Convolutional Network) combining the Mixture-of-applications (MoE) and the Dynamic Convolution (Dynamic Convolution).

(3) Attention mechanism based on Dual Attention gating (Dual Attention Gated). The mechanism filters the noise output of the model full-connected layer and improves the prediction precision of the model.

Further, the primary objective of the Informer in step (1) is to solve the problem of continuous prediction of long sequences, and not only to complete the characterization of long sequence input, but also to establish the link between long sequence output and long sequence input. As an improved model of the Transformer, the Informer still keeps the structures of the encoder and the decoder, but uses ProbSparse Self-attention instead of Self-attention, and adopts a new attention mechanism to reduce the calculation amount. In addition, a Self-authentication distinguishing technology is adopted, so that the dimension and the number of network parameters are reduced, and the spatial complexity is reduced. In the long sequence prediction problem, global information needs to be obtained, such as hierarchical timestamps (week, month, year), unknown timestamps (holidays, events). Therefore, the Informer also improves the embedded portion of the data compared to the Transformer. Specifically, value embedding, position embedding, and time embedding are added together as inputs to an encoder or decoder. After the Informer is added into a gating network mechanism, the time characteristics can be classified and automatically routed in a fine-grained manner, more related characteristics in a time sequence are highlighted and coded together, and the expression capacity of the model is improved.

Further, the gated convolutional network in the step (2) includes a model of the gated convolutional network combining MoE and Dynamic convergence.

Further, the original MoE model can be expressed as:

wherein when

Then, g (x) _i Representing the weight of the ith expert or the corresponding gating scoring result; f. of _i I = 1.. N denotes n expert networks and g denotes a gating network integrating a plurality of expert results. The multiple expert networks in MoE are designed to fit the partial data cases in the training data set that they are good at fitting, which is somewhat equivalent to using local interpolation methods, with different data sets being fitted by different local models (experts), where gating is equivalent to one controlling the weight.

Further, the gated convolution model proposed by the present invention can be expressed as:

wherein DynRoute represents replacement of Convld, [. Cndot.] _AB In the representation of Attention Block;

conv1d (-) uses ELU activation function in original text and performs maximum pool layer with step 2 represented by MaxPool with one-dimensional convolution filter (kernel bandwidth 3.) in time dimension;

simultaneously after stacking one layer, X is added ^t The downsampling takes half of the slice, and the extraction process is from the jth layer to the (j + 1) th layer.

Where U is the output of each one-dimensional convolutional layer and K is the number of convolutional layers.

Where a is the attention coefficient of each expert and E is the temporal embedding layer (temporal embedding) introduced.

Wherein DynRoute displays the result of dynamic routing, i.e., each 1-dimensional convolutional layer is multiplied by the attention coefficient of the corresponding expert and then summed.

Furthermore, in the step (3), a Dual Attention gate (Dual Attention Gates) is added after the original fully-connected layer of the Informer, and a time Attention mechanism is used to recode the output of the decoding layer (decoding), so that the method plays a good role in filtering the output of the model and improving the performance of the model. The time attention mechanism proposed by the invention is shown as (6) to (8):

where α and β are attention parameters that can be learned from fully connected layers, W _α ，b _α ，W _β ，b _β Are parameters that the model learns and updates during training. The output vector and the update parameter may calculate attention parameters (α and β) in the neural network through activation functions tanh () and tanh spring () in the neural network. Based on these calculated attention parameters, key information in the time series can be better captured. Thereby filtering out noise in the fully connected layer.

Is the output of the full connection layer and the output of the whole model in the invention.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram illustrating the overall architecture of the original Informer network in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a Gated Convolution (Gated Convolution) network proposed in an embodiment of the present invention;

FIG. 4 is a schematic diagram of the overall network model in the embodiment of the present invention;

FIG. 5 is a single variable long-sequence time series prediction result graph for 4 data sets (5 cases) in an embodiment of the present invention;

FIG. 6 is a graph of the multi-variable long-sequence time-series prediction results for 4 data sets (5 cases) in the example of the present invention;

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The embodiment of the invention relates to a design method of long time sequence prediction based on a gated convolution and a time attention mechanism, which comprises the following steps as shown in figure 1:

(1) Acquiring data;

(2) Preprocessing data and constructing a data set of time series characteristics;

(3) Constructing a design method of a long-time sequence prediction model based on a gating convolution and time attention mechanism;

(4) Judging whether the noise filtering effect of the model is good or not and whether the prediction effect is accurate or not;

in step (1), the data acquisition for this experiment included 3 real datasets and 2 common reference datasets collected for the LSTF. 5 public data sets ECL, weather, ETTm1, ETTh2, and ETTh1, respectively.

In the step (2), the prediction processing is performed on the obtained data,

1) Where ETT (Electric Transformer Temperature) is the division of the data set into three according to the time granularity of the data, being on the order of 1 hour { ETTh1, ETTh2} and 15 minutes of ETTm1. The data is power transformer operation data collected by Beijing aerospace university, and comprises two sites and operation data of two continuous years, wherein the training set, the verification set and the test set are respectively data of 12 months, 4 months and 4 months;

2) Wherein an ECL (electric comfort Load) data set collects power consumption (Kwh) of 321 clients, converts data units into hourly power consumption due to data loss, and sets 'MT 320' as a target value, wherein a training set, a verification set and a test set are data of 15 months, 3 months and 4 months respectively

3) The Weather data set collects Weather data of 1600 areas in the united states in hours between 2010 and 2013 and 4 years. Each data point includes a target value and 11 climate characteristics. Wherein the training set, the verification set and the test set are respectively data of 28 months, 10 months and 10 months;

in step (3), the invention constructs a prediction model of long-time sequences based on gated convolution and time attention mechanism.

1) Based on the network architecture of the original Informer model, the Informer model is shown in fig. 2. The distillation layer of the inform model was added to a gated Convolution network combining MoE and Dynamic convention. The gated convolution model is shown in fig. 3. The gating network mechanism captures the long-term memory that is ignored in the Informer distillation layer, introduces Temporal Embedding in the gating, classifies and automatically routes Temporal features in a fine-grained manner, highlights more relevant features in the Temporal sequence, and encodes them together.

2) The Dual attribute is added behind the original Informer full connection layer, and the output of the decoding layer is recoded by using time Attention, so that the effect of filtering the output of the model is achieved, and the performance of the model is improved. The overall improved model network structure is shown in fig. 4.

3) The invention compares ARIMA, prophet, LSTMa, LSTnet and Deepar five time sequence prediction methods, and in order to better explore the effect of GCTAM in long sequence time sequence prediction, the invention uses Informmer and variant Reformer and LogSparse self-attack.

4) The hyper-parameter adjustment of the invention uses Adam optimizer and the learning rate is 1e ^-4 By a factor of two per epoch. The total cycle number is 8, there is an appropriate early stop strategy, and the batch size is set to 32 according to the recommended setting.

5) The input to each data set is normalized to zero mean. Under the LSTF setting, the prediction window size is gradually expanded, namely {1d,2d,7d,14d,30d,40d } in { ETTh, ECL, weather } and {6h,12h,24h,72h,168h } in ETTm.

6) The evaluation indices of the model of the invention are MSE and MAE and the whole set is rolled over each prediction window (average of multivariate predictions) with stride as 1.

In step (4), the present invention applies the model to the univariate long-sequence time series and the multivariate long-sequence time series, and performs verification on the data sets in step (2), respectively. The results of the experiment are shown in fig. 5, and fig. 6. Figures 5 and 6 summarize the results of univariate/multivariate evaluation of GCTAM compared to all other models presented on the 5 datasets.

1) Univariate time series prediction: each method yields predictions as a single variable over a long time series. As can be seen from fig. 5: (1) The model GCTAM significantly improves the prediction accuracy over all datasets, with the prediction error rising smoothly and slowly as the prediction length increases, indicating that GCTAM is successful in improving the prediction power of the LSTF problem. (2) GCTAM decreased on average 18.5%, 9.1% and 10.5% over MSE, better than its baseline model, informer. Meanwhile, the method provided by the method is superior to other LogTransns and Reformer methods based on a Transformer model.

2) Multivariate time series prediction: by adjusting the fully connected layer (FCN), the GCTAM proposed by the present invention can be easily converted from univariate prediction to multivariate prediction. From fig. 6, it is observed that: the proposed model GCTAM performs better than other methods at MSE, with GCTAM being reduced by 3.0%,3.1% and 5.6% on average over the baseline model Informer at MSE, and better than other models.

The GCTAM model provided by the invention can enhance the noise filtering capability of the model, can keep higher prediction precision in long-time sequence prediction, and can well improve the prediction capability of the LSTF problem.

Claims

1. A design method of long time sequence prediction based on a gated convolution and a time attention mechanism is characterized by comprising the following steps:

(1) On the basis of solving the continuous prediction of a long-time sequence by an Informer model, the characteristics of long-time sequence input are perfected, the relation between the output of the long-time sequence and the input of the long-time sequence is established, gatedConvolation is added into the network model of the Informer, a gating network mechanism is realized to capture the overlooked long-time memory in a distillation layer of the Informer, temporal embedding is introduced into gating, time characteristics are classified and automatically routed in a fine-grained manner, more related characteristics in the time sequence are highlighted, and the more related characteristics are coded together;

(2) Combining the Gated Container network with the hybrid-of-Experts (MoE) and the dynamic Convolution as an improved Inform distiller layer structure:

1) The original MoE model can be expressed as:

wherein when

In time, g (x) _i Representing the weight of the ith expert or the corresponding gating scoring result; f. of _i I = 1.. N denotes n expert networks and g denotes a gating network integrating a plurality of expert results. The multiple expert networks in MoE are designed to fit the partial data cases in the training data set that they are good at fitting, with different data sets being fitted by different local models (experts), where gating is equivalent to one expert controlling the weights.

2) The distillation layer structure in the inform model is improved, a plurality of parallel one-dimensional convolution layers (Convld) are used for replacing the original single convolution layer, and the output in the Attention Block is gated through a time perception gate. Meanwhile, temporal information is introduced into the time perception gate, namely temporal embedding replaces original embedding. The gated convolution model proposed by the invention is shown in (2) - (5):

convld (-) uses the ELU activation function in the original text and performs the maximum pool level with stride of 2 in a one-dimensional convolution filter (kernel bandwidth of 3.) MaxPool in the time dimension;

while after stacking one layer, X is added ^t The downsampling takes half of the slice, and the extraction is from the jth layer to the (j + 1) th layer.

Where U is the output of each one-dimensional convolutional layer, and K is the number of convolutional layers.

(3) A double Attention Gates (Dual Attention Gates) is added behind an original fully-connected layer of the Informer, and a time Attention mechanism is used for recoding the output of a decoding layer (decoding), so that the method plays a good role in filtering the output of the model and improving the performance of the model. The time attention mechanism proposed by the invention is shown as (6) to (8):

where α and β are attention parameters that can be learned from the fully-connected layer, W _α ，b _α ，W _β ，b _β Are parameters that the model learns and updates during training. The output vector and the update parameters may calculate attention parameters (α and β) in the neural network through activation functions tanh () and tanh spring () in the neural network. Based on these calculated attention parameters, key information in the time series can be better captured. Thereby filtering out noise in the fully connected layer.

2. The design method of long-term sequence prediction based on gated convolution and time attention mechanism as claimed in claim 1, wherein a Gating Network (Gating Network) mechanism is proposed to capture the memory of the long-term time originally ignored in the inform mer distillation layer, time embedding is introduced in the Gating to classify and automatically route the time features in a fine-grained manner, more relevant features in the time sequence are highlighted and coded together, and a double attention Gating based time attention mechanism is proposed to filter the output of the model full connection layer, thereby improving the expression capability of the inform mer model, enhancing the noise filtering capability of the model and improving the prediction accuracy of the model.