CN115310674A

CN115310674A - Long-time sequence prediction method based on parallel neural network model LDformer

Info

Publication number: CN115310674A
Application number: CN202210834021.8A
Authority: CN
Inventors: 田冉; 李新梅; 马忠彧; 刘颜星; 王晶霞; 王楚; 王灏篷
Original assignee: Northwest Normal University
Current assignee: Northwest Normal University
Priority date: 2022-07-14
Filing date: 2022-07-14
Publication date: 2022-11-08

Abstract

Long-time sequence prediction is a very important problem, and has a wide range of scenes in many fields, such as stocks, traffic, power, and the like. The existing time sequence prediction method has the problems of high time complexity, large parameter quantity, low prediction precision and the like, and is not suitable for high-precision long-term prediction of real world data. Aiming at the problems, the invention provides a parallel time series prediction model LDformer, firstly, an Informer framework is combined with an LSTM, and the deep characteristics of the time series are fully considered. Then, a probability sparse attention mechanism combining UniDrop is provided, so that the risk of losing key connection in the sequence is reduced; taking stability of data and parameter quantity into consideration, data are extracted through one-dimensional convolution in distillation operation. Experimental results on different prediction lengths of three power data sets ETTm1, ETTh1 and ETTh2 show that the method provided by the invention is superior to the most advanced baseline in long-time sequence prediction, and the effectiveness of key component design is verified through ablation experiments.

Description

Long-time sequence prediction method based on parallel neural network model LDformer

Technical Field

The invention relates to the field of power prediction, in particular to a long-time sequence prediction method based on a parallel neural network model LDformer.

Background

With the advent of the big data age, data has penetrated every industry. Various types of sensors and applications continuously collect large-scale time series such as sales of goods in retail stores and supermarkets, passenger flow in railways and aviation departments, traffic flow in cities, load demands in electric power departments, stock prices in financial fields, weather conditions in meteorological fields, and the like. The power distribution problem is that the grid manages the distribution of power to different customer areas according to sequentially changing demands. It is difficult to predict the future demand of a particular user area because it varies with different factors such as weekday, holiday, season, weather, temperature, etc. The existing time sequence prediction method cannot be suitable for high-precision long-term prediction of long-term real-world data, and any wrong prediction can have serious consequences. Therefore, there is currently no effective way to predict future power usage, and the manager has to make decisions based on empirical values, which are typically much higher than the actual demand. Conservative strategies result in unnecessary power and equipment depreciation waste. It is worth noting that the oil temperature of the transformer can effectively reflect the working condition of the power transformer. Therefore, long-time sequence predictive modeling is the key to solving this problem. However, long-time sequence prediction still faces a serious challenge because most time sequence models are directed at short-term prediction and are not suitable for long-term time sequences because of large historical data amount, high computational complexity and high prediction accuracy requirement, and a good result is not obtained all the time. In order to solve the above problems, research on long-term sequence prediction is particularly important.

Most of the current research on time series prediction is based on machine learning and deep learning short time series prediction. In machine learning research, many scholars adopt ARIMA and SVM, but the models are relatively more suitable for stationary time sequence, but the real-time sequence data does not have almost pure stationary data, so that the application of the models is limited by data characteristics, and the universality is poor. Scholars also propose a Bayesian Time Factorization (BTF) framework for modeling multidimensional time series in specific spatio-temporal data in the presence of missing values, but machine learning methods cannot obtain more accurate results for complex prediction problem results. With the development of deep learning, researchers find that deep learning is more applicable to complex problems. In recent years, transformers have been applied to many fields for long-term sequence prediction tasks. However, the method is time-consuming and complex and has a large number of parameters. Therefore, the learner proposes the Transformer's improved algorithm Informer, but the attention of the learner may lose some key connections in the sequence, and the prediction accuracy still needs to be improved. Therefore, the invention is further improved on the basis of the improved algorithm informar of the Transformer.

Disclosure of Invention

The invention improves the prediction precision of long-time sequence prediction, overcomes the defects of high time complexity, large parameter quantity, low running speed, easiness in losing the key connection among sequences and the like of the traditional Transformer model, provides a long-time sequence prediction method based on a parallel neural network model LDformer, and predicts the future according to the existing historical data.

The invention mainly comprises four parts: and (1) determining input and output of the model. And (2) preprocessing the data set. And (3) determining the time characteristic of the data and encoding. (4) And constructing a parallel neural network model LDformer for long-time sequence prediction. And (5) verifying the validity of the method.

The contents of the above five parts are respectively described as follows:

1. the input and output of the model are determined. A power data set is input to the method, with each data point consisting of a target Oil Temperature "Oil Temperature (OT)" and 6 different types of external Load values "High usefull Load (HUFL)", "High UseLess Load (HULL)", "Middle usefull Load (MUFL)", "Middle usefull Load (MULL)", "Low usefull Load (LUFL)", "Low usel Load (LULL)". An appropriate training data set is selected to predict the target value "OT" with 6 external load values. By collecting six features X from the training set ⁽¹⁾ ，X ⁽²⁾ ，X ⁽³⁾ ，X ⁽⁴⁾ ，X ⁽⁵⁾ ，X ⁽⁶⁾ Small batch of m samples dataset of

To predict n sequences of target values "OT

2. And preprocessing the data set. The dataset preprocessing mainly comprises a normalization process. Because abnormal values and more noises exist among time series data collected in power measurement, the influence of the abnormal values and extreme values can be avoided indirectly through centralization by using standardization.

3. Determining a temporal characteristic of the data is encoded from multiple angles. In the long-time sequence prediction modeling problem, not only local timing information but also hierarchical timing information such as week, month, and year, and burst timestamp information (event or some holidays, etc.) are required. The conventional self-attention mechanism is difficult to directly adapt, and can bring about the problem of mismatching between queries and keys between an encoder and a decoder, and finally influences the prediction effect.

4. And constructing a parallel neural network model LDformer for long-time sequence prediction. The LDformer consists of an Embedding layer (Embedding) considered in multiple angles, a long short term memory network (LSTM), an Encoder (Encoder), and a Decoder (Decoder). The Embedding layer (Embedding) considers from three angles, data coding, position coding and time stamp coding are respectively carried out, and the dimension is respectively expanded to a uniform dimension d-model. The LSTM receives input data for feature extraction to obtain deep expression capability in the time series. And then entering an encoder, wherein the encoder adopts a multi-channel parallel mode to improve the robustness of the model, simultaneously uses probability sparseness self-attention combined with a UniDrop technology to properly reduce the number of parameters and reduce overfitting to receive a large number of long sequence inputs, and distillation operation is added between the encoder modules to reduce the redundancy combination of the characteristic mapping of the encoder with a value V. The decoder is configured to accept a long sequence of inputs to generate an immediate prediction of an output element.

5. And (5) verifying the validity of the method. Through experimental verification on a real power data set and comparison with other leading-edge researches, the prediction accuracy of the method in the long-time sequence prediction problem is obviously higher than that of a comparison method, and the method is improved aiming at the defects of the algorithm.

The detailed implementation steps adopted by the invention to realize the purpose are as follows:

step 1: and determining the input and output of the model according to the power data set, and selecting an appropriate proportion to divide the data set. Defining model inputs as six load characteristics and a target value { X ⁽¹⁾ ，X ⁽²⁾ ，X ⁽³⁾ ，X ⁽⁴⁾ ，X ⁽⁵⁾ ，X ⁽⁶⁾ Y, wherein the six Load characteristics are "High usefull Load (HUFL)", "High UseLess Load (HULL)", "Middle usefull Load (MUFL)", "Middle UseLess Load (MULL)", "Low usefull Load (LUFL)", and "Low UseLess Load (LULL)", respectively. The target value is the Oil Temperature (OT).

Step 2: and (4) preprocessing data. An input training data set is first normalized. The normalization method uses StandardScaler () to normalize data, ensuring that each dimension data variance is 1 and the mean is 0. So that the test results are not dominated by feature values that are too large for certain dimensions. Having a conversion function of

Where μ is the mean of all sample data, σIs the standard deviation of all sample data. And then the step 3 is carried out.

And step 3: and (3) entering the Embedding layer Embedding of the training data set obtained in the step (2). In the long-time sequence prediction modeling problem, not only local timing information but also hierarchical timing information is required. Therefore, the invention considers from three angles, respectively performs data coding, position coding and time stamp coding, respectively expands the dimensionality to the uniform dimensionality d-model, and sums to obtain the final Embedding result. Step 3.1 is data encoding, step 3.2 is position encoding, and step 3.3 is time stamp encoding.

Step 3.1: and (6) encoding data. Embedding in the data converts the data to a uniform dimension d-model using one-dimensional convolution. The formula is as follows:

DE＝conv1d(x) (1)

step 3.2: and (4) position coding. Here, the elements in the input sequence are processed together, which is different from RNN one by one, although the speed is increased, the precedence relationship of the elements in the sequence is ignored, so the addition is position coding, and the formula is as follows:

step 3.3: and (5) time stamp coding. The time stamp coding method comprises a month _ embedded, a day _ embedded, a weekday _ embedded, a hour _ embedded and a minute _ embedded, the data set time slices used in the method are 15 minutes and 1 hour respectively, and therefore the minute _ embedded and the hour _ embedded are selected to obtain the time stamp coding result.

And 4, step 4: and constructing a parallel neural network model LDformer for long-time sequence prediction. After data is simply divided and processed, a parallel neural network model LDformer is constructed for time sequence prediction, and after the data passes through an embedded layer, the method mainly comprises the following steps:

step 4.1: and (3) receiving input data by using the LSTM to perform feature extraction, and obtaining deep expression capability in the time series. The main reason is that the LSTM adds a gating mechanism (an input gate, a forgetting gate and an output gate) on the basis of the recurrent neural network to determine the storage and the abandonment of information, and the method solves the problems of gradient extinction and gradient explosion of the common recurrent neural network in the long-sequence training process. In short, LSTM can perform better in longer sequences than normal RNNs.

And 4.2: an encoder module is constructed. The encoder is designed for extracting robustness remote dependence of time sequence input, the overall architecture of the encoder is approximately the same as that of a transform, the encoder mainly comprises two sublayers, a multi-head attention layer (combined with a probability sparse attention mechanism of UniDrop) and a feedforward layer consisting of two linear mappings, a batch normalization layer is arranged behind the two sublayers, and jump connection is arranged between the sublayers. The difference is that the encoder adopts a multi-channel parallel mode, and four channels with the time sequence data length of L, L/2, L/4 and L/8 are respectively selected to be executed in parallel. Distillation operations are combined to improve model robustness. The distillation operation mainly uses one-dimensional convolution to trim the dimension and reduce the memory occupation before sending the output of the upper layer to the multi-head attention module of the lower layer. Wherein the distillation operation is always one layer less than the encoder layer. Where attention in the encoder uses the UniDrop-combined probabilistic sparse self-attention mechanism built in accordance with the present invention.

Step 4.2.1: the probability sparse self-attention mechanism of UniDrop is combined. Considering the time complexity and the risk of losing some key connections in the sequence, the invention proposes a probabilistic sparse attention mechanism incorporating UniDrop, the canonical self-attention mechanism being defined as mapping a query (Q) and a set of key, value (K, V) pairs to an output, where Q, K, V and output are vectors. The output is calculated as a weighted sum of V, where the weight assigned to each V is calculated by a compatibility function of Q with the corresponding K. The formula is as follows:

wherein,

d is the input dimension. Because the attention mechanism has more parameters and is easy to over-fit and lose key connection among sequences, the UniDrop technology is introduced in the invention. The Feature Dropout (FD) can randomly inhibit certain neurons in the network with a certain probability. FD-1 is applied to the attention weight A for increasing the generalization of multi-headed attention. FD-2 is applied after the activation function between two linear transformations of the feed-forward network sublayer. However, the direct application of FD-1 to the weight A may lower the value A (i j) meaning that the relationship between marker i and marker j is ignored, so a larger FD-1 means a greater risk of losing some critical information from the sequence position. To mitigate this potential risk, dropout is added at Q, K, and V, respectively, before computing attention. FD-4 is used for output characteristics before linear transformation. Ith line Q of Q, K, V after dropout _i ,k _i ,v _i The ith q attention is defined as a kernel smoother in probabilistic form, as shown in the following equation:

wherein the attention of the ith query to all keys is defined as a probability

The output is a combination of its value and V. The attention mechanism supports the probability distribution of the corresponding query attention away from a uniform distribution. If but if p (k) _j |q _i ) Near uniform distribution, self-attention becomes the sum of the V values, which becomes redundant of the input. Thus, this problem can be effectively avoided by distributing the "similarity" between p and q to distinguish "important" queries, using the KL divergence to measure "similarity", as follows:

the sparsity metric for the ith query, except for the constant, can be defined as:

wherein the first term is q _i The Sum of the asymmetric exponential kernels over all the bonds is then logarithmized, i.e., log-Sum-Exp (LSE), and the second term is their arithmetic mean. If the ith query is larger, it indicates that its attention probability p is more "diverse" and that the probability of containing the dominant dot-product pair in the header field of the long-tailed self-attention distribution is higher. However, traversing all queries of the measurement M (q) _i K) requires the computation of each dot product pair, which means that quadratic O (L) is required _Q L _K ) Then there are also potential numerical stability problems with using LSE operations. Based on the above, the above formula is improved, and the final sparsity measurement formula is obtained as follows:

therefore, a part with high probability is obtained, and probability sparseness self-attention combined with UniDrop is obtained. The formula is as follows:

wherein

Is that _q And the sparse matrix with the same size only contains the Top-u query under the sparse measurement, namely, the part with larger probability is taken. Here, u = c · lnL is set _Q Controlled by a constant sampling factor c.

Step 4.2.2: and (4) carrying out distillation operation. As a natural consequence of the attention mechanism, the encoder's feature map has a redundant combination of values V. Thus, at the next level, the present invention uses an extraction operation to privilege dominant features with dominant features and generates a focused self-attention feature map that sharply prunes the time dimension of the input. Convolutional neural networks can well recognize simple patterns in data and generate complex patterns in higher-level layers. The one-dimensional convolution is very effective for obtaining interesting features from data without high correlation of positions, and the one-dimensional convolution can be well applied to time series analysis of sensor data. Therefore, one-dimensional convolution is selected to extract features, and the convolution kernel is set to be 3x3. The distillation operation proceeds from the j-th layer to the (j + 1) -th layer. The formula is as follows:

wherein [. ]] _AB Containing basic operations in a multi-headed attention and attention block, conv1d () is executed in the time dimension using the LeakyReLU () activation function. The LeakyReLU () function is a variant of ReLU, changes the reaction of the part with the input less than 0, lightens the sparsity of the ReLU, inherits the advantages of the ReLU, can accelerate the convergence speed, relieve the problems of gradient disappearance and explosion, and simplify the calculation. The LeakyReLU () activation function formula is as follows:

LeakyReLU(x)＝max(0,x)+negative_slope·min(0,x) (11)

step 4.3: a decoder is constructed. The decoder generates time series output through a forward process, and part of the structure of the decoder can refer to the structure of the decoder in the Transformer. The decoder includes two attention mechanisms and a linear mapped feedforward layer section. The decoder gets the input vector as:

wherein

Is a start-up marker that is,

is the placeholder for the target sequence (with its scalar set to 0), the first level attention is probabilistic sparse self-attention in conjunction with the UniDrop, as in step 4.2.1. The mask multi-headed attention is set to- ∞, by preventing each location from focusing on future locations, thereby avoiding autoregressive. The second layer attention is normal self-attention. Where generative reasoning is used to mitigate velocity dips in long-term predictions. After both layers of attention, there is an Add&And (3) a Norm layer. Add (d)&Norm is composed of two parts, add and Norm, and the calculation formula is as follows:

LayerNorm(X+MultiHeadAttention(X)) (13)

and finally, directly outputting the prediction result through a full connection layer.

And 5: and (4) training and optimizing the LDformer model. And (5) training and optimizing the model according to the LDformer model constructed in the step (4) to enable the model to reach an optimal state. The MSE loss function is selected when the target sequence is predicted, the MSE loss function is transmitted back to the whole model from the output of the decoder, the Adam optimizer is used for optimizing the whole model, the learning rate is decreased from the set initial value, the attenuation is 2 times in each period, the total epoch value is set, and the optimization is stopped in advance when appropriate. And obtaining a predicted value obtained by model training. And comparing the real value with the predicted value, and calculating the MAE, MSE and RMSE indexes of the predicted value. The index formula of the predicted value is as follows:

wherein y is the real data, and y is the real data,

to predict data, n is the size of the data set.

The method has the key effects that a parallel neural network model LDformer is provided, the long-time sequence prediction problem in electric quantity prediction is solved, an Informer framework is combined with an LSTM, and deep characteristics of a time sequence are fully considered; the probability sparse self-attention mechanism combined with the UniDrop is invented, and the risks that the original attention mechanism has large parameter quantity and loses the key connection among sequences are avoided. The method is simple in implementation process, can be applied to not only power data sets but also time sequence data sets in other fields, and can be well suitable for a large number of complex data scenes.

Drawings

FIG. 1 is a diagram of a model framework of the LDformer of the present invention.

FIG. 2 is a structure diagram of the Embedding layer considered by multiple angles.

Fig. 3 is a block diagram of a parallel encoder module of the present invention.

FIG. 4 is the overall structure of the UniDrop in the attention mechanism of the present invention.

Fig. 5 is a block diagram of the decoder of the present invention.

FIG. 6 is a histogram of the mean error between the true and predicted values for different prediction lengths of the four models.

FIG. 7 is a plot of the convergence of loss as a function of learning rate for two data sets at different lengths in the four models.

FIG. 8 is a graph of the runtime variation of the four models in a dataset.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The method is used for modeling aiming at the long-time sequence prediction problem in electric quantity prediction. A parallel neural network model LDformer for a long-time sequence is provided, and the method is suitable for time sequence data collected in most fields, such as weather prediction, air quality prediction, traffic flow prediction and the like. The invention is implemented in a pychar environment through the python language. Example scenarios as shown in fig. 1, fig. 1 is a model framework diagram of the LDformer of the present invention, which includes an Embedding layer, an LSTM, an encoder, a decoder, and a final fully-connected layer output. The encoder module uses a four-path parallel model to splice and input the output result into a decoder, and the decoder outputs the prediction result through a full connection layer after decoding. The specific implementation is as follows:

step 1: taking an electric power data set as an example, in order to solve the problem of long-time sequence prediction, the invention provides a parallel neural network model LDformer for long-time sequence prediction. Firstly, input and output of a model are determined, an appropriate training data set is selected, and the model input is six load characteristics and a target value { X ⁽¹⁾ ，X ⁽²⁾ ，X ⁽³⁾ ，X ⁽⁴⁾ ，X ⁽⁵⁾ ，X ⁽⁶⁾ Y, by collecting six features { X } from the training set ⁽¹⁾ ，X ⁽²⁾ ，X ⁽³⁾ ，X ⁽⁴⁾ ，X ⁽⁵⁾ ，X ⁽⁶⁾ Small batch of m samples dataset of

To predict n sequences of target values "OT

And then the step 2 is carried out.

Step 2: and (4) preprocessing data. Firstly, the input training data set is standardized by using StandardScaler (), and the variance of each dimension data is ensured to be 1, and the mean value is ensured to be 0. So that the test results are not dominated by feature values that are too large for certain dimensions. Having a conversion function of

Where μ is the mean of all sample data and σ is the standard deviation of all sample data. And then the step 3 is carried out.

And step 3: and (3) entering the Embedding layer Embedding of multi-angle consideration into the data set obtained in the step (2). And respectively carrying out data coding, position coding and time stamp coding, respectively expanding the dimensionality to a unified dimensionality 512, and summing to obtain a final Embedding result. Step 3.1 is data encoding, step 3.2 is position encoding, and step 3.3 is time stamp encoding.

Step 3.1: and (6) encoding data. Embedding in the data converts the data to a uniform dimension 512 using one-dimensional convolution. Which has the formula of

DE＝conv1d(x) (17)

Step 3.2: and (4) position coding. Here, the elements in the input sequence are processed together, which is different from RNN one-by-one processing, although the speed is increased, the precedence relationship of the elements in the sequence is ignored, and therefore, the elements are added as position codes, and the formula is as follows:

And 4, step 4: and constructing a parallel neural network model LDformer for long-time sequence prediction. After data passes through the embedding layer, the following steps are mainly carried out:

step 4.1: and receiving input data by using the LSTM to perform feature extraction, and obtaining deep expression capability in a time sequence. The LSTM adds a gating mechanism (an input gate, a forgetting gate and an output gate) on the basis of the recurrent neural network to determine the storage and the abandonment of information, and the method solves the problems of gradient loss and gradient explosion of the common recurrent neural network in the long-sequence training process.

Step 4.2: an encoder module is constructed. The encoder is designed for extracting robustness remote dependence of time sequence input and mainly comprises two sublayers, a multi-head attention layer (combined with a probability sparse attention mechanism of UniDrop) and a feedforward layer consisting of two linear mappings, wherein a batch normalization layer is arranged behind the two sublayers, and jump connection is formed between the sublayers. The encoder adopts a multi-channel parallel mode, and four channels with time sequence data length of L, L/2, L/4 and L/8 are respectively selected to be executed in parallel. Distillation operations are combined to improve model robustness. The distillation operation mainly uses one-dimensional convolution to trim the dimensions and reduce the memory usage before sending the output of the upper layer to the multi-head attention module of the lower layer. Wherein the distillation operation is always one layer less than the Encoder layer. Where attention in the encoder uses the UniDrop-combined probabilistic sparse self-attention mechanism built in accordance with the present invention.

Step 4.2.1: the probability sparse self-attention mechanism of UniDrop is combined. Feature Dropout (FD) in the UniDrop technique can randomly suppress certain neurons in the network with a certain probability. FD-1 is applied to the attention weight A for increasing the generalization of multi-headed attention. FD-2 is applied after the activation function between two linear transformations of the feed-forward network sublayer. However, the direct application of FD-1 to the weight A may lower the value A (i j) meaning that the relationship between marker i and marker j is ignored, so a larger FD-1 means a greater risk of losing some critical information from the sequence position. To mitigate this potential risk, dropout is added at Q, K, and V, respectively, before computing attention. FD-4 is used for output characteristics before linear transformation. Ith line Q of Q, K, V after dropout _i ,k _i ,v _i The ith q attention is defined as a kernel smoother in probabilistic form, as shown in the following equation:

wherein the attention of the ith query to all keys is defined as a probability

The output is a combination of its value and V. The attention mechanism supports the probability distribution of the corresponding query attention away from a uniform distribution. If but not if p (k) _j |q _i ) Near uniform distribution, self-attention becomes the sum of V valuesAnd becomes redundant of the input. Thus, this problem can be effectively avoided by distributing the "similarity" between p and q to distinguish "important" queries, using the KL divergence to measure "similarity", as follows:

wherein the first term is q _i The Sum of the asymmetric exponential kernels over all the bonds is then logarithmic, i.e. Log-Sum-Exp (LSE), and the second term is their arithmetic mean. If the ith query is larger, it indicates that its attention probability p is more "diverse" and that the probability of containing the dominant dot-product pair in the header field of the long-tailed self-attention distribution is higher. However, M (q) of all queries traversing the measurement _i K) requires the computation of every dot product pair, which also means that quadratic O (L) is required _Q L _K ) Then the LSE operation used also has potential numerical stability issues. Based on the above, the above formula is improved, and the final sparsity measurement formula is obtained as follows:

wherein

Is that _q And phaseThe sparse matrix with the same size only contains Top-u queries under the sparse measurement, namely, the part with larger probability is taken. Setting u = c · lnL _Q Controlled by a constant sampling factor c, the invention sets c equal to 5.

Step 4.2.2: and (4) carrying out distillation operation. As a natural consequence of the attention mechanism, the encoder's feature map has redundant combinations of values V. Thus, in the next layer, distillation operations are used to privilege dominant features with dominant features and generate a focused self-attention feature map, pruning the time dimension of the input. The one-dimensional convolution is very effective for obtaining interesting features from data without high correlation of positions, and the one-dimensional convolution can be well applied to time series analysis of sensor data. Therefore, one-dimensional convolution is selected to extract features, and the convolution kernel is set to 3x3. The distillation operation proceeds from the jth layer onward to the (j + 1) th layer. The formula is as follows:

wherein [. ]] _AB Containing the basic operations in the multi-headed attention and attention block, conv1d () is executed in the time dimension using the LeakyReLU () activation function. The LeakyReLU () activation function formula is as follows:

LeakyReLU(x)＝max(0,x)+negative_slope·min(0,x) (26)

step 4.3: a decoder is constructed. The decoder generates time series output through a forward process, and part of the structure of the time series output can refer to the structure of the decoder in the transform. The decoder includes two attention mechanisms and a linear mapped feedforward layer section. The decoder gets the input vector as:

wherein

Is a start-up marker that is,

is the placeholder for the target sequence (with its scalar set to 0), the first level attention is probabilistic sparse self-attention in conjunction with the UniDrop, as in step 4.2.1. The mask multi-headed self attention is set to- ∞, by which each location is prevented from focusing on future locations, thereby avoiding autoregressive. The second layer attention is normal self-attention. After both layers of attention, there is an Add&And (3) a Norm layer. Add (d)&Norm is composed of two parts, add and Norm, and the calculation formula is as follows:

LayerNorm(X+MultiHeadAttention(X)) (28)

finally, the prediction result is directly output through a full connection layer, and if the over-prediction target sequence is 24, the 24 is output.

And 5: and (5) simulating by adopting a power data set to finish the training and optimization of the LDformer model.

The invention is characterized in that 96 pieces of historical data of six load characteristics are used for predicting 24 pieces of data, 36 pieces of data and 48 pieces of data of target data in a data set under different time divisions, and a comparison graph of errors of real values and predicted values is shown in figure 6.

And calculating MAE, MSE and RMSE indexes of the predicted values according to the predicted values obtained by the prediction model. The index formula of the predicted value is as follows:

wherein y is the real data, and y is the real data,

to prepareAnd measuring data, wherein n is the size of the data set.

Under the condition that the same data set generates a predicted value, the simulation explains the performance of the model through three indexes of MAE, MSE and RMSE, and compares the performance results of the model for predicting data with different lengths, and also makes full comparison on the loss value and the running time of the model. The results are presented using line graphs, as shown in fig. 7 and 8. The main simulation parameters are as follows:

the network structure is as follows: LDformer

Batch size: 64

Learning rate: 1e ^-4 —1.25e-05

Iteration times are as follows: maximum 10, stop when appropriate

And (3) an optimization algorithm: adam

Loss function: MSE.

Claims

1. A long-time sequence prediction method based on a parallel neural network model LDformer in the field of power prediction comprises an embedded layer, a long-time memory network LSTM, an encoder and a decoder which are considered from multiple angles. The encoder uses a multi-pass parallel approach in conjunction with the distillation operation, where the encoder uses a probability sparse attention mechanism in conjunction with the UniDrop. The decoder includes two attention mechanisms, the first of which is a masked UniDrop-combined probabilistic sparse attention mechanism that prevents each location from focusing on future locations, avoiding autoregressive, and the second of which is ordinary self-attention. The method comprises the following specific steps:

step 1: taking an electric power data set as an example, in order to solve the long-time sequence prediction problem, a long-time sequence prediction method based on a parallel neural network model LDformer is provided. Firstly, input and output of a model are determined, a proper training data set is selected, and the model input is six load characteristics and a target value { X } ⁽¹⁾ ，X ⁽²⁾ ，X ⁽³⁾ ，X ⁽⁴⁾ ，X ⁽⁵⁾ ，X ⁽⁶⁾ Y, by collecting six features { X } from the training set ⁽¹⁾ ，X ⁽²⁾ ，X ⁽³⁾ ，X ⁽⁴⁾ ，X ⁽⁵⁾ ，X ⁽⁶⁾ Small batch of m samples dataset of

To predict n sequences of target values "OT

And then step 2 is carried out.

Step 2: and (4) preprocessing data. Firstly, the input training data set is normalized by using a StandardScaler (), and the variance of each dimension data is ensured to be 1, and the mean value is 0. So that the test results are not dominated by feature values that are too large for certain dimensions. Having a conversion function of

And step 3: and (3) entering the Embedding layer Embedding of multi-angle consideration into the data set obtained in the step (2). And respectively carrying out data coding, position coding and time stamp coding, respectively expanding the dimensionality to a unified dimensionality d-model, and summing to obtain a final Embedding result. Step 3.1 is data encoding, step 3.2 is position encoding, and step 3.3 is time stamp encoding.

Step 3.1: and (4) encoding data. Embedding in the data converts the data to a uniform dimension d-model using one-dimensional convolution. The formula is as follows:

DE＝conv1d(x) (1)

Step 4.2: an encoder module is constructed. The encoder is designed for extracting robustness remote dependence of time sequence input and mainly comprises two sublayers, a multi-head attention layer (a probability sparse attention mechanism combined with UniDrop) and a feedforward layer formed by two linear mappings, wherein a batch normalization layer is arranged behind the two sublayers, and jump connection is formed between the sublayers. The encoder adopts a multi-channel parallel mode, and four channels with the time sequence data length of L, L/2, L/4 and L/8 are respectively selected to be executed in parallel. Distillation operations are combined to improve model robustness. The distillation operation mainly uses one-dimensional convolution to trim the dimensions and reduce the memory usage before sending the output of the upper layer to the multi-head attention module of the lower layer. Wherein the distillation operation is always one layer less than the Encoder layer. Where attention in the encoder uses the probabilistic sparse self-attention mechanism built in conjunction with the union of the present invention.

Step 4.2.1: the probability sparse self-attention mechanism of UniDrop is combined. Feature Dropout (FD) in the UniDrop technique can randomly suppress certain neurons in the network with a certain probability. FD-1 is applied to the attention weight A for increasing the generalization of multi-headed attention. FD-2 applies two linear variants of feed-forward network sublayerAfter changing the activation function between. However, applying FD-1 directly to weight A may lower the value A (i-j), meaning ignoring the relationship between marker i and marker j, so a larger FD-1 means a greater risk of losing some critical information from the sequence position. To mitigate this potential risk, dropout is added at Q, K, and V, respectively, before computing attention. FD-4 is used for output characteristics before linear transformation. Ith line Q of Q, K, V after dropout _i ,k _i ,v _i The ith q attention is defined as a kernel smoother in probabilistic form, as shown in the following equation:

wherein the attention of the ith query to all keys is defined as a probability

The output is a combination of its value and V. The attention mechanism supports the probability distribution of the respective query attention away from a uniform distribution. If but not if p (k) _j |q _i ) Near uniform distribution, self-attention becomes the sum of the V values, which becomes redundant of the input. Thus, this problem can be effectively avoided by distributing the "similarity" between p and q to distinguish "important" queries, using the KL divergence to measure "similarity", as follows:

wherein the first term is q _i The sum of the asymmetric exponential kernels over all bonds is then logarithmic, i.e.Log-Sum-Exp (LSE), the second term being their arithmetic mean. If the ith query is larger, it indicates that its attention probability p is more "diverse" and there is a higher probability of including the dominant dot product pair in the header field of the long-tailed self-attention distribution. However, traversing all queries of the measurement M (q) _i K) requires the computation of every dot product pair, which also means that quadratic O (L) is required _Q L _K ) Then the LSE operation used also has potential numerical stability issues. Based on the above, the above formula is improved, and the final sparsity measurement formula is obtained as follows:

therefore, a part with a large probability is obtained, and the probability sparse self-attention combined with the UniDrop is obtained. The formula is as follows:

wherein

Is that _q And the sparse matrix with the same size only contains the Top-u query under the sparse measurement, namely, the part with larger probability is taken. Setting u = c · lnL _Q Controlled by a constant sampling factor c.

Step 4.2.2: and (4) carrying out distillation operation. As a natural consequence of the attention mechanism, the encoder's feature map has a redundant combination of values V. Thus, in the next layer, distillation operations are used to privilege dominant features with dominant features and generate a focused self-attention feature map, pruning the time dimension of the input. The one-dimensional convolution is very effective for obtaining interesting features from data without high correlation of positions, and the one-dimensional convolution can be well applied to time series analysis of sensor data. Therefore, one-dimensional convolution is selected to extract features, and the convolution kernel is set to be 3x3. The distillation operation proceeds from the j-th layer to the (j + 1) -th layer. The formula is as follows:

LeakyReLU(x)＝max(0,x)+negative_slope·min(0,x) (10)

wherein

Is a start-up marker that is,

is the placeholder for the target sequence (with its scalar set to 0) and the first layer attention is the probabilistic sparse self-attention bound to the union as in step 4.2.1. The mask multi-headed self attention is set to- ∞, by which each location is prevented from focusing on future locations, thereby avoiding autoregressive. The second layer attention is normal self-attention. After both layers of attention, there is an Add&And a Norm layer. Add&Norm is composed of two parts, add and Norm, and the calculation formula is as follows:

LayerNorm(X+MultiHeadAttention(X)) (12)

And 5: and simulating by adopting a power data set to finish the training and optimization of the LDformer model.

The average error contrast graph of the actual value and the predicted value of 24 data, 36 data and 48 data of the target data is predicted by 96 historical data of six load characteristics in a data set under different time divisions, and is shown in figure 6.

wherein y is the real data, and y is the real data,

to predict data, n is the size of the data set.