CN115310674A - Long-time sequence prediction method based on parallel neural network model LDformer - Google Patents

Long-time sequence prediction method based on parallel neural network model LDformer Download PDF

Info

Publication number
CN115310674A
CN115310674A CN202210834021.8A CN202210834021A CN115310674A CN 115310674 A CN115310674 A CN 115310674A CN 202210834021 A CN202210834021 A CN 202210834021A CN 115310674 A CN115310674 A CN 115310674A
Authority
CN
China
Prior art keywords
attention
data
layer
prediction
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210834021.8A
Other languages
Chinese (zh)
Inventor
田冉
李新梅
马忠彧
刘颜星
王晶霞
王楚
王灏篷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest Normal University
Original Assignee
Northwest Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest Normal University filed Critical Northwest Normal University
Priority to CN202210834021.8A priority Critical patent/CN115310674A/en
Publication of CN115310674A publication Critical patent/CN115310674A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

Long-time sequence prediction is a very important problem, and has a wide range of scenes in many fields, such as stocks, traffic, power, and the like. The existing time sequence prediction method has the problems of high time complexity, large parameter quantity, low prediction precision and the like, and is not suitable for high-precision long-term prediction of real world data. Aiming at the problems, the invention provides a parallel time series prediction model LDformer, firstly, an Informer framework is combined with an LSTM, and the deep characteristics of the time series are fully considered. Then, a probability sparse attention mechanism combining UniDrop is provided, so that the risk of losing key connection in the sequence is reduced; taking stability of data and parameter quantity into consideration, data are extracted through one-dimensional convolution in distillation operation. Experimental results on different prediction lengths of three power data sets ETTm1, ETTh1 and ETTh2 show that the method provided by the invention is superior to the most advanced baseline in long-time sequence prediction, and the effectiveness of key component design is verified through ablation experiments.

Description

Long-time sequence prediction method based on parallel neural network model LDformer
Technical Field
The invention relates to the field of power prediction, in particular to a long-time sequence prediction method based on a parallel neural network model LDformer.
Background
With the advent of the big data age, data has penetrated every industry. Various types of sensors and applications continuously collect large-scale time series such as sales of goods in retail stores and supermarkets, passenger flow in railways and aviation departments, traffic flow in cities, load demands in electric power departments, stock prices in financial fields, weather conditions in meteorological fields, and the like. The power distribution problem is that the grid manages the distribution of power to different customer areas according to sequentially changing demands. It is difficult to predict the future demand of a particular user area because it varies with different factors such as weekday, holiday, season, weather, temperature, etc. The existing time sequence prediction method cannot be suitable for high-precision long-term prediction of long-term real-world data, and any wrong prediction can have serious consequences. Therefore, there is currently no effective way to predict future power usage, and the manager has to make decisions based on empirical values, which are typically much higher than the actual demand. Conservative strategies result in unnecessary power and equipment depreciation waste. It is worth noting that the oil temperature of the transformer can effectively reflect the working condition of the power transformer. Therefore, long-time sequence predictive modeling is the key to solving this problem. However, long-time sequence prediction still faces a serious challenge because most time sequence models are directed at short-term prediction and are not suitable for long-term time sequences because of large historical data amount, high computational complexity and high prediction accuracy requirement, and a good result is not obtained all the time. In order to solve the above problems, research on long-term sequence prediction is particularly important.
Most of the current research on time series prediction is based on machine learning and deep learning short time series prediction. In machine learning research, many scholars adopt ARIMA and SVM, but the models are relatively more suitable for stationary time sequence, but the real-time sequence data does not have almost pure stationary data, so that the application of the models is limited by data characteristics, and the universality is poor. Scholars also propose a Bayesian Time Factorization (BTF) framework for modeling multidimensional time series in specific spatio-temporal data in the presence of missing values, but machine learning methods cannot obtain more accurate results for complex prediction problem results. With the development of deep learning, researchers find that deep learning is more applicable to complex problems. In recent years, transformers have been applied to many fields for long-term sequence prediction tasks. However, the method is time-consuming and complex and has a large number of parameters. Therefore, the learner proposes the Transformer's improved algorithm Informer, but the attention of the learner may lose some key connections in the sequence, and the prediction accuracy still needs to be improved. Therefore, the invention is further improved on the basis of the improved algorithm informar of the Transformer.
Disclosure of Invention
The invention improves the prediction precision of long-time sequence prediction, overcomes the defects of high time complexity, large parameter quantity, low running speed, easiness in losing the key connection among sequences and the like of the traditional Transformer model, provides a long-time sequence prediction method based on a parallel neural network model LDformer, and predicts the future according to the existing historical data.
The invention mainly comprises four parts: and (1) determining input and output of the model. And (2) preprocessing the data set. And (3) determining the time characteristic of the data and encoding. (4) And constructing a parallel neural network model LDformer for long-time sequence prediction. And (5) verifying the validity of the method.
The contents of the above five parts are respectively described as follows:
1. the input and output of the model are determined. A power data set is input to the method, with each data point consisting of a target Oil Temperature "Oil Temperature (OT)" and 6 different types of external Load values "High usefull Load (HUFL)", "High UseLess Load (HULL)", "Middle usefull Load (MUFL)", "Middle usefull Load (MULL)", "Low usefull Load (LUFL)", "Low usel Load (LULL)". An appropriate training data set is selected to predict the target value "OT" with 6 external load values. By collecting six features X from the training set (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Small batch of m samples dataset of
Figure BDA0003746707220000021
To predict n sequences of target values "OT
Figure BDA0003746707220000022
2. And preprocessing the data set. The dataset preprocessing mainly comprises a normalization process. Because abnormal values and more noises exist among time series data collected in power measurement, the influence of the abnormal values and extreme values can be avoided indirectly through centralization by using standardization.
3. Determining a temporal characteristic of the data is encoded from multiple angles. In the long-time sequence prediction modeling problem, not only local timing information but also hierarchical timing information such as week, month, and year, and burst timestamp information (event or some holidays, etc.) are required. The conventional self-attention mechanism is difficult to directly adapt, and can bring about the problem of mismatching between queries and keys between an encoder and a decoder, and finally influences the prediction effect.
4. And constructing a parallel neural network model LDformer for long-time sequence prediction. The LDformer consists of an Embedding layer (Embedding) considered in multiple angles, a long short term memory network (LSTM), an Encoder (Encoder), and a Decoder (Decoder). The Embedding layer (Embedding) considers from three angles, data coding, position coding and time stamp coding are respectively carried out, and the dimension is respectively expanded to a uniform dimension d-model. The LSTM receives input data for feature extraction to obtain deep expression capability in the time series. And then entering an encoder, wherein the encoder adopts a multi-channel parallel mode to improve the robustness of the model, simultaneously uses probability sparseness self-attention combined with a UniDrop technology to properly reduce the number of parameters and reduce overfitting to receive a large number of long sequence inputs, and distillation operation is added between the encoder modules to reduce the redundancy combination of the characteristic mapping of the encoder with a value V. The decoder is configured to accept a long sequence of inputs to generate an immediate prediction of an output element.
5. And (5) verifying the validity of the method. Through experimental verification on a real power data set and comparison with other leading-edge researches, the prediction accuracy of the method in the long-time sequence prediction problem is obviously higher than that of a comparison method, and the method is improved aiming at the defects of the algorithm.
The detailed implementation steps adopted by the invention to realize the purpose are as follows:
step 1: and determining the input and output of the model according to the power data set, and selecting an appropriate proportion to divide the data set. Defining model inputs as six load characteristics and a target value { X (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Y, wherein the six Load characteristics are "High usefull Load (HUFL)", "High UseLess Load (HULL)", "Middle usefull Load (MUFL)", "Middle UseLess Load (MULL)", "Low usefull Load (LUFL)", and "Low UseLess Load (LULL)", respectively. The target value is the Oil Temperature (OT).
Step 2: and (4) preprocessing data. An input training data set is first normalized. The normalization method uses StandardScaler () to normalize data, ensuring that each dimension data variance is 1 and the mean is 0. So that the test results are not dominated by feature values that are too large for certain dimensions. Having a conversion function of
Figure BDA0003746707220000041
Where μ is the mean of all sample data, σIs the standard deviation of all sample data. And then the step 3 is carried out.
And step 3: and (3) entering the Embedding layer Embedding of the training data set obtained in the step (2). In the long-time sequence prediction modeling problem, not only local timing information but also hierarchical timing information is required. Therefore, the invention considers from three angles, respectively performs data coding, position coding and time stamp coding, respectively expands the dimensionality to the uniform dimensionality d-model, and sums to obtain the final Embedding result. Step 3.1 is data encoding, step 3.2 is position encoding, and step 3.3 is time stamp encoding.
Step 3.1: and (6) encoding data. Embedding in the data converts the data to a uniform dimension d-model using one-dimensional convolution. The formula is as follows:
DE=conv1d(x) (1)
step 3.2: and (4) position coding. Here, the elements in the input sequence are processed together, which is different from RNN one by one, although the speed is increased, the precedence relationship of the elements in the sequence is ignored, so the addition is position coding, and the formula is as follows:
Figure BDA0003746707220000042
Figure BDA0003746707220000043
step 3.3: and (5) time stamp coding. The time stamp coding method comprises a month _ embedded, a day _ embedded, a weekday _ embedded, a hour _ embedded and a minute _ embedded, the data set time slices used in the method are 15 minutes and 1 hour respectively, and therefore the minute _ embedded and the hour _ embedded are selected to obtain the time stamp coding result.
And 4, step 4: and constructing a parallel neural network model LDformer for long-time sequence prediction. After data is simply divided and processed, a parallel neural network model LDformer is constructed for time sequence prediction, and after the data passes through an embedded layer, the method mainly comprises the following steps:
step 4.1: and (3) receiving input data by using the LSTM to perform feature extraction, and obtaining deep expression capability in the time series. The main reason is that the LSTM adds a gating mechanism (an input gate, a forgetting gate and an output gate) on the basis of the recurrent neural network to determine the storage and the abandonment of information, and the method solves the problems of gradient extinction and gradient explosion of the common recurrent neural network in the long-sequence training process. In short, LSTM can perform better in longer sequences than normal RNNs.
And 4.2: an encoder module is constructed. The encoder is designed for extracting robustness remote dependence of time sequence input, the overall architecture of the encoder is approximately the same as that of a transform, the encoder mainly comprises two sublayers, a multi-head attention layer (combined with a probability sparse attention mechanism of UniDrop) and a feedforward layer consisting of two linear mappings, a batch normalization layer is arranged behind the two sublayers, and jump connection is arranged between the sublayers. The difference is that the encoder adopts a multi-channel parallel mode, and four channels with the time sequence data length of L, L/2, L/4 and L/8 are respectively selected to be executed in parallel. Distillation operations are combined to improve model robustness. The distillation operation mainly uses one-dimensional convolution to trim the dimension and reduce the memory occupation before sending the output of the upper layer to the multi-head attention module of the lower layer. Wherein the distillation operation is always one layer less than the encoder layer. Where attention in the encoder uses the UniDrop-combined probabilistic sparse self-attention mechanism built in accordance with the present invention.
Step 4.2.1: the probability sparse self-attention mechanism of UniDrop is combined. Considering the time complexity and the risk of losing some key connections in the sequence, the invention proposes a probabilistic sparse attention mechanism incorporating UniDrop, the canonical self-attention mechanism being defined as mapping a query (Q) and a set of key, value (K, V) pairs to an output, where Q, K, V and output are vectors. The output is calculated as a weighted sum of V, where the weight assigned to each V is calculated by a compatibility function of Q with the corresponding K. The formula is as follows:
Figure BDA0003746707220000051
wherein the content of the first and second substances,
Figure BDA0003746707220000052
d is the input dimension. Because the attention mechanism has more parameters and is easy to over-fit and lose key connection among sequences, the UniDrop technology is introduced in the invention. The Feature Dropout (FD) can randomly inhibit certain neurons in the network with a certain probability. FD-1 is applied to the attention weight A for increasing the generalization of multi-headed attention. FD-2 is applied after the activation function between two linear transformations of the feed-forward network sublayer. However, the direct application of FD-1 to the weight A may lower the value A (i j) meaning that the relationship between marker i and marker j is ignored, so a larger FD-1 means a greater risk of losing some critical information from the sequence position. To mitigate this potential risk, dropout is added at Q, K, and V, respectively, before computing attention. FD-4 is used for output characteristics before linear transformation. Ith line Q of Q, K, V after dropout i ,k i ,v i The ith q attention is defined as a kernel smoother in probabilistic form, as shown in the following equation:
Figure BDA0003746707220000061
wherein the attention of the ith query to all keys is defined as a probability
Figure BDA0003746707220000062
The output is a combination of its value and V. The attention mechanism supports the probability distribution of the corresponding query attention away from a uniform distribution. If but if p (k) j |q i ) Near uniform distribution, self-attention becomes the sum of the V values, which becomes redundant of the input. Thus, this problem can be effectively avoided by distributing the "similarity" between p and q to distinguish "important" queries, using the KL divergence to measure "similarity", as follows:
Figure BDA0003746707220000063
the sparsity metric for the ith query, except for the constant, can be defined as:
Figure BDA0003746707220000064
wherein the first term is q i The Sum of the asymmetric exponential kernels over all the bonds is then logarithmized, i.e., log-Sum-Exp (LSE), and the second term is their arithmetic mean. If the ith query is larger, it indicates that its attention probability p is more "diverse" and that the probability of containing the dominant dot-product pair in the header field of the long-tailed self-attention distribution is higher. However, traversing all queries of the measurement M (q) i K) requires the computation of each dot product pair, which means that quadratic O (L) is required Q L K ) Then there are also potential numerical stability problems with using LSE operations. Based on the above, the above formula is improved, and the final sparsity measurement formula is obtained as follows:
Figure BDA0003746707220000071
therefore, a part with high probability is obtained, and probability sparseness self-attention combined with UniDrop is obtained. The formula is as follows:
Figure BDA0003746707220000072
wherein
Figure BDA0003746707220000073
Is that q And the sparse matrix with the same size only contains the Top-u query under the sparse measurement, namely, the part with larger probability is taken. Here, u = c · lnL is set Q Controlled by a constant sampling factor c.
Step 4.2.2: and (4) carrying out distillation operation. As a natural consequence of the attention mechanism, the encoder's feature map has a redundant combination of values V. Thus, at the next level, the present invention uses an extraction operation to privilege dominant features with dominant features and generates a focused self-attention feature map that sharply prunes the time dimension of the input. Convolutional neural networks can well recognize simple patterns in data and generate complex patterns in higher-level layers. The one-dimensional convolution is very effective for obtaining interesting features from data without high correlation of positions, and the one-dimensional convolution can be well applied to time series analysis of sensor data. Therefore, one-dimensional convolution is selected to extract features, and the convolution kernel is set to be 3x3. The distillation operation proceeds from the j-th layer to the (j + 1) -th layer. The formula is as follows:
Figure BDA0003746707220000074
wherein [. ]] AB Containing basic operations in a multi-headed attention and attention block, conv1d () is executed in the time dimension using the LeakyReLU () activation function. The LeakyReLU () function is a variant of ReLU, changes the reaction of the part with the input less than 0, lightens the sparsity of the ReLU, inherits the advantages of the ReLU, can accelerate the convergence speed, relieve the problems of gradient disappearance and explosion, and simplify the calculation. The LeakyReLU () activation function formula is as follows:
LeakyReLU(x)=max(0,x)+negative_slope·min(0,x) (11)
step 4.3: a decoder is constructed. The decoder generates time series output through a forward process, and part of the structure of the decoder can refer to the structure of the decoder in the Transformer. The decoder includes two attention mechanisms and a linear mapped feedforward layer section. The decoder gets the input vector as:
Figure BDA0003746707220000081
wherein
Figure BDA0003746707220000082
Is a start-up marker that is,
Figure BDA0003746707220000083
is the placeholder for the target sequence (with its scalar set to 0), the first level attention is probabilistic sparse self-attention in conjunction with the UniDrop, as in step 4.2.1. The mask multi-headed attention is set to- ∞, by preventing each location from focusing on future locations, thereby avoiding autoregressive. The second layer attention is normal self-attention. Where generative reasoning is used to mitigate velocity dips in long-term predictions. After both layers of attention, there is an Add&And (3) a Norm layer. Add (d)&Norm is composed of two parts, add and Norm, and the calculation formula is as follows:
LayerNorm(X+MultiHeadAttention(X)) (13)
and finally, directly outputting the prediction result through a full connection layer.
And 5: and (4) training and optimizing the LDformer model. And (5) training and optimizing the model according to the LDformer model constructed in the step (4) to enable the model to reach an optimal state. The MSE loss function is selected when the target sequence is predicted, the MSE loss function is transmitted back to the whole model from the output of the decoder, the Adam optimizer is used for optimizing the whole model, the learning rate is decreased from the set initial value, the attenuation is 2 times in each period, the total epoch value is set, and the optimization is stopped in advance when appropriate. And obtaining a predicted value obtained by model training. And comparing the real value with the predicted value, and calculating the MAE, MSE and RMSE indexes of the predicted value. The index formula of the predicted value is as follows:
Figure BDA0003746707220000084
Figure BDA0003746707220000085
Figure BDA0003746707220000086
wherein y is the real data, and y is the real data,
Figure BDA0003746707220000091
to predict data, n is the size of the data set.
The method has the key effects that a parallel neural network model LDformer is provided, the long-time sequence prediction problem in electric quantity prediction is solved, an Informer framework is combined with an LSTM, and deep characteristics of a time sequence are fully considered; the probability sparse self-attention mechanism combined with the UniDrop is invented, and the risks that the original attention mechanism has large parameter quantity and loses the key connection among sequences are avoided. The method is simple in implementation process, can be applied to not only power data sets but also time sequence data sets in other fields, and can be well suitable for a large number of complex data scenes.
Drawings
FIG. 1 is a diagram of a model framework of the LDformer of the present invention.
FIG. 2 is a structure diagram of the Embedding layer considered by multiple angles.
Fig. 3 is a block diagram of a parallel encoder module of the present invention.
FIG. 4 is the overall structure of the UniDrop in the attention mechanism of the present invention.
Fig. 5 is a block diagram of the decoder of the present invention.
FIG. 6 is a histogram of the mean error between the true and predicted values for different prediction lengths of the four models.
FIG. 7 is a plot of the convergence of loss as a function of learning rate for two data sets at different lengths in the four models.
FIG. 8 is a graph of the runtime variation of the four models in a dataset.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
The method is used for modeling aiming at the long-time sequence prediction problem in electric quantity prediction. A parallel neural network model LDformer for a long-time sequence is provided, and the method is suitable for time sequence data collected in most fields, such as weather prediction, air quality prediction, traffic flow prediction and the like. The invention is implemented in a pychar environment through the python language. Example scenarios as shown in fig. 1, fig. 1 is a model framework diagram of the LDformer of the present invention, which includes an Embedding layer, an LSTM, an encoder, a decoder, and a final fully-connected layer output. The encoder module uses a four-path parallel model to splice and input the output result into a decoder, and the decoder outputs the prediction result through a full connection layer after decoding. The specific implementation is as follows:
step 1: taking an electric power data set as an example, in order to solve the problem of long-time sequence prediction, the invention provides a parallel neural network model LDformer for long-time sequence prediction. Firstly, input and output of a model are determined, an appropriate training data set is selected, and the model input is six load characteristics and a target value { X (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Y, by collecting six features { X } from the training set (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Small batch of m samples dataset of
Figure BDA0003746707220000101
To predict n sequences of target values "OT
Figure BDA0003746707220000102
And then the step 2 is carried out.
Step 2: and (4) preprocessing data. Firstly, the input training data set is standardized by using StandardScaler (), and the variance of each dimension data is ensured to be 1, and the mean value is ensured to be 0. So that the test results are not dominated by feature values that are too large for certain dimensions. Having a conversion function of
Figure BDA0003746707220000103
Where μ is the mean of all sample data and σ is the standard deviation of all sample data. And then the step 3 is carried out.
And step 3: and (3) entering the Embedding layer Embedding of multi-angle consideration into the data set obtained in the step (2). And respectively carrying out data coding, position coding and time stamp coding, respectively expanding the dimensionality to a unified dimensionality 512, and summing to obtain a final Embedding result. Step 3.1 is data encoding, step 3.2 is position encoding, and step 3.3 is time stamp encoding.
Step 3.1: and (6) encoding data. Embedding in the data converts the data to a uniform dimension 512 using one-dimensional convolution. Which has the formula of
DE=conv1d(x) (17)
Step 3.2: and (4) position coding. Here, the elements in the input sequence are processed together, which is different from RNN one-by-one processing, although the speed is increased, the precedence relationship of the elements in the sequence is ignored, and therefore, the elements are added as position codes, and the formula is as follows:
Figure BDA0003746707220000111
Figure BDA0003746707220000112
step 3.3: and (5) time stamp coding. The time stamp coding method comprises a month _ embedded, a day _ embedded, a weekday _ embedded, a hour _ embedded and a minute _ embedded, the data set time slices used in the method are 15 minutes and 1 hour respectively, and therefore the minute _ embedded and the hour _ embedded are selected to obtain the time stamp coding result.
And 4, step 4: and constructing a parallel neural network model LDformer for long-time sequence prediction. After data passes through the embedding layer, the following steps are mainly carried out:
step 4.1: and receiving input data by using the LSTM to perform feature extraction, and obtaining deep expression capability in a time sequence. The LSTM adds a gating mechanism (an input gate, a forgetting gate and an output gate) on the basis of the recurrent neural network to determine the storage and the abandonment of information, and the method solves the problems of gradient loss and gradient explosion of the common recurrent neural network in the long-sequence training process.
Step 4.2: an encoder module is constructed. The encoder is designed for extracting robustness remote dependence of time sequence input and mainly comprises two sublayers, a multi-head attention layer (combined with a probability sparse attention mechanism of UniDrop) and a feedforward layer consisting of two linear mappings, wherein a batch normalization layer is arranged behind the two sublayers, and jump connection is formed between the sublayers. The encoder adopts a multi-channel parallel mode, and four channels with time sequence data length of L, L/2, L/4 and L/8 are respectively selected to be executed in parallel. Distillation operations are combined to improve model robustness. The distillation operation mainly uses one-dimensional convolution to trim the dimensions and reduce the memory usage before sending the output of the upper layer to the multi-head attention module of the lower layer. Wherein the distillation operation is always one layer less than the Encoder layer. Where attention in the encoder uses the UniDrop-combined probabilistic sparse self-attention mechanism built in accordance with the present invention.
Step 4.2.1: the probability sparse self-attention mechanism of UniDrop is combined. Feature Dropout (FD) in the UniDrop technique can randomly suppress certain neurons in the network with a certain probability. FD-1 is applied to the attention weight A for increasing the generalization of multi-headed attention. FD-2 is applied after the activation function between two linear transformations of the feed-forward network sublayer. However, the direct application of FD-1 to the weight A may lower the value A (i j) meaning that the relationship between marker i and marker j is ignored, so a larger FD-1 means a greater risk of losing some critical information from the sequence position. To mitigate this potential risk, dropout is added at Q, K, and V, respectively, before computing attention. FD-4 is used for output characteristics before linear transformation. Ith line Q of Q, K, V after dropout i ,k i ,v i The ith q attention is defined as a kernel smoother in probabilistic form, as shown in the following equation:
Figure BDA0003746707220000121
wherein the attention of the ith query to all keys is defined as a probability
Figure BDA0003746707220000122
The output is a combination of its value and V. The attention mechanism supports the probability distribution of the corresponding query attention away from a uniform distribution. If but not if p (k) j |q i ) Near uniform distribution, self-attention becomes the sum of V valuesAnd becomes redundant of the input. Thus, this problem can be effectively avoided by distributing the "similarity" between p and q to distinguish "important" queries, using the KL divergence to measure "similarity", as follows:
Figure BDA0003746707220000123
the sparsity metric for the ith query, except for the constant, can be defined as:
Figure BDA0003746707220000124
wherein the first term is q i The Sum of the asymmetric exponential kernels over all the bonds is then logarithmic, i.e. Log-Sum-Exp (LSE), and the second term is their arithmetic mean. If the ith query is larger, it indicates that its attention probability p is more "diverse" and that the probability of containing the dominant dot-product pair in the header field of the long-tailed self-attention distribution is higher. However, M (q) of all queries traversing the measurement i K) requires the computation of every dot product pair, which also means that quadratic O (L) is required Q L K ) Then the LSE operation used also has potential numerical stability issues. Based on the above, the above formula is improved, and the final sparsity measurement formula is obtained as follows:
Figure BDA0003746707220000131
therefore, a part with high probability is obtained, and probability sparseness self-attention combined with UniDrop is obtained. The formula is as follows:
Figure BDA0003746707220000132
wherein
Figure BDA0003746707220000133
Is that q And phaseThe sparse matrix with the same size only contains Top-u queries under the sparse measurement, namely, the part with larger probability is taken. Setting u = c · lnL Q Controlled by a constant sampling factor c, the invention sets c equal to 5.
Step 4.2.2: and (4) carrying out distillation operation. As a natural consequence of the attention mechanism, the encoder's feature map has redundant combinations of values V. Thus, in the next layer, distillation operations are used to privilege dominant features with dominant features and generate a focused self-attention feature map, pruning the time dimension of the input. The one-dimensional convolution is very effective for obtaining interesting features from data without high correlation of positions, and the one-dimensional convolution can be well applied to time series analysis of sensor data. Therefore, one-dimensional convolution is selected to extract features, and the convolution kernel is set to 3x3. The distillation operation proceeds from the jth layer onward to the (j + 1) th layer. The formula is as follows:
Figure BDA0003746707220000134
wherein [. ]] AB Containing the basic operations in the multi-headed attention and attention block, conv1d () is executed in the time dimension using the LeakyReLU () activation function. The LeakyReLU () activation function formula is as follows:
LeakyReLU(x)=max(0,x)+negative_slope·min(0,x) (26)
step 4.3: a decoder is constructed. The decoder generates time series output through a forward process, and part of the structure of the time series output can refer to the structure of the decoder in the transform. The decoder includes two attention mechanisms and a linear mapped feedforward layer section. The decoder gets the input vector as:
Figure BDA0003746707220000135
wherein
Figure BDA0003746707220000136
Is a start-up marker that is,
Figure BDA0003746707220000137
is the placeholder for the target sequence (with its scalar set to 0), the first level attention is probabilistic sparse self-attention in conjunction with the UniDrop, as in step 4.2.1. The mask multi-headed self attention is set to- ∞, by which each location is prevented from focusing on future locations, thereby avoiding autoregressive. The second layer attention is normal self-attention. After both layers of attention, there is an Add&And (3) a Norm layer. Add (d)&Norm is composed of two parts, add and Norm, and the calculation formula is as follows:
LayerNorm(X+MultiHeadAttention(X)) (28)
finally, the prediction result is directly output through a full connection layer, and if the over-prediction target sequence is 24, the 24 is output.
And 5: and (5) simulating by adopting a power data set to finish the training and optimization of the LDformer model.
The invention is characterized in that 96 pieces of historical data of six load characteristics are used for predicting 24 pieces of data, 36 pieces of data and 48 pieces of data of target data in a data set under different time divisions, and a comparison graph of errors of real values and predicted values is shown in figure 6.
And calculating MAE, MSE and RMSE indexes of the predicted values according to the predicted values obtained by the prediction model. The index formula of the predicted value is as follows:
Figure BDA0003746707220000141
Figure BDA0003746707220000142
Figure BDA0003746707220000143
wherein y is the real data, and y is the real data,
Figure BDA0003746707220000144
to prepareAnd measuring data, wherein n is the size of the data set.
Under the condition that the same data set generates a predicted value, the simulation explains the performance of the model through three indexes of MAE, MSE and RMSE, and compares the performance results of the model for predicting data with different lengths, and also makes full comparison on the loss value and the running time of the model. The results are presented using line graphs, as shown in fig. 7 and 8. The main simulation parameters are as follows:
the network structure is as follows: LDformer
Batch size: 64
Learning rate: 1e -4 —1.25e-05
Iteration times are as follows: maximum 10, stop when appropriate
And (3) an optimization algorithm: adam
Loss function: MSE.

Claims (1)

1. A long-time sequence prediction method based on a parallel neural network model LDformer in the field of power prediction comprises an embedded layer, a long-time memory network LSTM, an encoder and a decoder which are considered from multiple angles. The encoder uses a multi-pass parallel approach in conjunction with the distillation operation, where the encoder uses a probability sparse attention mechanism in conjunction with the UniDrop. The decoder includes two attention mechanisms, the first of which is a masked UniDrop-combined probabilistic sparse attention mechanism that prevents each location from focusing on future locations, avoiding autoregressive, and the second of which is ordinary self-attention. The method comprises the following specific steps:
step 1: taking an electric power data set as an example, in order to solve the long-time sequence prediction problem, a long-time sequence prediction method based on a parallel neural network model LDformer is provided. Firstly, input and output of a model are determined, a proper training data set is selected, and the model input is six load characteristics and a target value { X } (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Y, by collecting six features { X } from the training set (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Small batch of m samples dataset of
Figure FDA0003746707210000011
To predict n sequences of target values "OT
Figure FDA0003746707210000012
And then step 2 is carried out.
Step 2: and (4) preprocessing data. Firstly, the input training data set is normalized by using a StandardScaler (), and the variance of each dimension data is ensured to be 1, and the mean value is 0. So that the test results are not dominated by feature values that are too large for certain dimensions. Having a conversion function of
Figure FDA0003746707210000013
Where μ is the mean of all sample data and σ is the standard deviation of all sample data. And then the step 3 is carried out.
And step 3: and (3) entering the Embedding layer Embedding of multi-angle consideration into the data set obtained in the step (2). And respectively carrying out data coding, position coding and time stamp coding, respectively expanding the dimensionality to a unified dimensionality d-model, and summing to obtain a final Embedding result. Step 3.1 is data encoding, step 3.2 is position encoding, and step 3.3 is time stamp encoding.
Step 3.1: and (4) encoding data. Embedding in the data converts the data to a uniform dimension d-model using one-dimensional convolution. The formula is as follows:
DE=conv1d(x) (1)
step 3.2: and (4) position coding. Here, the elements in the input sequence are processed together, which is different from RNN one-by-one processing, although the speed is increased, the precedence relationship of the elements in the sequence is ignored, and therefore, the elements are added as position codes, and the formula is as follows:
Figure FDA0003746707210000021
Figure FDA0003746707210000022
step 3.3: and (5) time stamp coding. The time stamp coding method comprises a month _ embedded, a day _ embedded, a weekday _ embedded, a hour _ embedded and a minute _ embedded, the data set time slices used in the method are 15 minutes and 1 hour respectively, and therefore the minute _ embedded and the hour _ embedded are selected to obtain the time stamp coding result.
And 4, step 4: and constructing a parallel neural network model LDformer for long-time sequence prediction. After data passes through the embedding layer, the following steps are mainly carried out:
step 4.1: and receiving input data by using the LSTM to perform feature extraction, and obtaining deep expression capability in a time sequence. The LSTM adds a gating mechanism (an input gate, a forgetting gate and an output gate) on the basis of the recurrent neural network to determine the storage and the abandonment of information, and the method solves the problems of gradient loss and gradient explosion of the common recurrent neural network in the long-sequence training process.
Step 4.2: an encoder module is constructed. The encoder is designed for extracting robustness remote dependence of time sequence input and mainly comprises two sublayers, a multi-head attention layer (a probability sparse attention mechanism combined with UniDrop) and a feedforward layer formed by two linear mappings, wherein a batch normalization layer is arranged behind the two sublayers, and jump connection is formed between the sublayers. The encoder adopts a multi-channel parallel mode, and four channels with the time sequence data length of L, L/2, L/4 and L/8 are respectively selected to be executed in parallel. Distillation operations are combined to improve model robustness. The distillation operation mainly uses one-dimensional convolution to trim the dimensions and reduce the memory usage before sending the output of the upper layer to the multi-head attention module of the lower layer. Wherein the distillation operation is always one layer less than the Encoder layer. Where attention in the encoder uses the probabilistic sparse self-attention mechanism built in conjunction with the union of the present invention.
Step 4.2.1: the probability sparse self-attention mechanism of UniDrop is combined. Feature Dropout (FD) in the UniDrop technique can randomly suppress certain neurons in the network with a certain probability. FD-1 is applied to the attention weight A for increasing the generalization of multi-headed attention. FD-2 applies two linear variants of feed-forward network sublayerAfter changing the activation function between. However, applying FD-1 directly to weight A may lower the value A (i-j), meaning ignoring the relationship between marker i and marker j, so a larger FD-1 means a greater risk of losing some critical information from the sequence position. To mitigate this potential risk, dropout is added at Q, K, and V, respectively, before computing attention. FD-4 is used for output characteristics before linear transformation. Ith line Q of Q, K, V after dropout i ,k i ,v i The ith q attention is defined as a kernel smoother in probabilistic form, as shown in the following equation:
Figure FDA0003746707210000031
wherein the attention of the ith query to all keys is defined as a probability
Figure FDA0003746707210000032
The output is a combination of its value and V. The attention mechanism supports the probability distribution of the respective query attention away from a uniform distribution. If but not if p (k) j |q i ) Near uniform distribution, self-attention becomes the sum of the V values, which becomes redundant of the input. Thus, this problem can be effectively avoided by distributing the "similarity" between p and q to distinguish "important" queries, using the KL divergence to measure "similarity", as follows:
Figure FDA0003746707210000033
the sparsity metric for the ith query, except for the constant, can be defined as:
Figure FDA0003746707210000034
wherein the first term is q i The sum of the asymmetric exponential kernels over all bonds is then logarithmic, i.e.Log-Sum-Exp (LSE), the second term being their arithmetic mean. If the ith query is larger, it indicates that its attention probability p is more "diverse" and there is a higher probability of including the dominant dot product pair in the header field of the long-tailed self-attention distribution. However, traversing all queries of the measurement M (q) i K) requires the computation of every dot product pair, which also means that quadratic O (L) is required Q L K ) Then the LSE operation used also has potential numerical stability issues. Based on the above, the above formula is improved, and the final sparsity measurement formula is obtained as follows:
Figure FDA0003746707210000041
therefore, a part with a large probability is obtained, and the probability sparse self-attention combined with the UniDrop is obtained. The formula is as follows:
Figure FDA0003746707210000042
wherein
Figure FDA0003746707210000043
Is that q And the sparse matrix with the same size only contains the Top-u query under the sparse measurement, namely, the part with larger probability is taken. Setting u = c · lnL Q Controlled by a constant sampling factor c.
Step 4.2.2: and (4) carrying out distillation operation. As a natural consequence of the attention mechanism, the encoder's feature map has a redundant combination of values V. Thus, in the next layer, distillation operations are used to privilege dominant features with dominant features and generate a focused self-attention feature map, pruning the time dimension of the input. The one-dimensional convolution is very effective for obtaining interesting features from data without high correlation of positions, and the one-dimensional convolution can be well applied to time series analysis of sensor data. Therefore, one-dimensional convolution is selected to extract features, and the convolution kernel is set to be 3x3. The distillation operation proceeds from the j-th layer to the (j + 1) -th layer. The formula is as follows:
Figure FDA0003746707210000044
wherein [. ]] AB Containing the basic operations in the multi-headed attention and attention block, conv1d () is executed in the time dimension using the LeakyReLU () activation function. The LeakyReLU () activation function formula is as follows:
LeakyReLU(x)=max(0,x)+negative_slope·min(0,x) (10)
step 4.3: a decoder is constructed. The decoder generates time series output through a forward process, and part of the structure of the time series output can refer to the structure of the decoder in the transform. The decoder includes two attention mechanisms and a linear mapped feedforward layer section. The decoder gets the input vector as:
Figure FDA0003746707210000045
wherein
Figure FDA0003746707210000046
Is a start-up marker that is,
Figure FDA0003746707210000047
is the placeholder for the target sequence (with its scalar set to 0) and the first layer attention is the probabilistic sparse self-attention bound to the union as in step 4.2.1. The mask multi-headed self attention is set to- ∞, by which each location is prevented from focusing on future locations, thereby avoiding autoregressive. The second layer attention is normal self-attention. After both layers of attention, there is an Add&And a Norm layer. Add&Norm is composed of two parts, add and Norm, and the calculation formula is as follows:
LayerNorm(X+MultiHeadAttention(X)) (12)
and finally, directly outputting the prediction result through a full connection layer.
And 5: and simulating by adopting a power data set to finish the training and optimization of the LDformer model.
The average error contrast graph of the actual value and the predicted value of 24 data, 36 data and 48 data of the target data is predicted by 96 historical data of six load characteristics in a data set under different time divisions, and is shown in figure 6.
And calculating MAE, MSE and RMSE indexes of the predicted values according to the predicted values obtained by the prediction model. The index formula of the predicted value is as follows:
Figure FDA0003746707210000051
Figure FDA0003746707210000052
Figure FDA0003746707210000053
wherein y is the real data, and y is the real data,
Figure FDA0003746707210000054
to predict data, n is the size of the data set.
CN202210834021.8A 2022-07-14 2022-07-14 Long-time sequence prediction method based on parallel neural network model LDformer Pending CN115310674A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210834021.8A CN115310674A (en) 2022-07-14 2022-07-14 Long-time sequence prediction method based on parallel neural network model LDformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210834021.8A CN115310674A (en) 2022-07-14 2022-07-14 Long-time sequence prediction method based on parallel neural network model LDformer

Publications (1)

Publication Number Publication Date
CN115310674A true CN115310674A (en) 2022-11-08

Family

ID=83857039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210834021.8A Pending CN115310674A (en) 2022-07-14 2022-07-14 Long-time sequence prediction method based on parallel neural network model LDformer

Country Status (1)

Country Link
CN (1) CN115310674A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115795351A (en) * 2023-01-29 2023-03-14 杭州市特种设备检测研究院(杭州市特种设备应急处置中心) Elevator big data risk early warning method based on residual error network and 2D feature representation
CN116128158A (en) * 2023-04-04 2023-05-16 西南石油大学 Oil well efficiency prediction method of mixed sampling attention mechanism
CN116612393A (en) * 2023-05-05 2023-08-18 北京思源知行科技发展有限公司 Solar radiation prediction method, system, electronic equipment and storage medium
CN117275723A (en) * 2023-09-15 2023-12-22 上海全景医学影像诊断中心有限公司 Early parkinsonism prediction method, device and system
CN117290706A (en) * 2023-10-31 2023-12-26 兰州理工大学 Traffic flow prediction method based on space-time convolution fusion probability sparse attention mechanism

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115795351A (en) * 2023-01-29 2023-03-14 杭州市特种设备检测研究院(杭州市特种设备应急处置中心) Elevator big data risk early warning method based on residual error network and 2D feature representation
CN115795351B (en) * 2023-01-29 2023-06-09 杭州市特种设备检测研究院(杭州市特种设备应急处置中心) Elevator big data risk early warning method based on residual error network and 2D feature representation
CN116128158A (en) * 2023-04-04 2023-05-16 西南石油大学 Oil well efficiency prediction method of mixed sampling attention mechanism
CN116612393A (en) * 2023-05-05 2023-08-18 北京思源知行科技发展有限公司 Solar radiation prediction method, system, electronic equipment and storage medium
CN117275723A (en) * 2023-09-15 2023-12-22 上海全景医学影像诊断中心有限公司 Early parkinsonism prediction method, device and system
CN117275723B (en) * 2023-09-15 2024-03-15 上海全景医学影像诊断中心有限公司 Early parkinsonism prediction method, device and system
CN117290706A (en) * 2023-10-31 2023-12-26 兰州理工大学 Traffic flow prediction method based on space-time convolution fusion probability sparse attention mechanism

Similar Documents

Publication Publication Date Title
Mo et al. Remaining useful life estimation via transformer encoder enhanced by a gated convolutional unit
CN115310674A (en) Long-time sequence prediction method based on parallel neural network model LDformer
CN110348624B (en) Sand storm grade prediction method based on Stacking integration strategy
Wang et al. Correlation aware multi-step ahead wind speed forecasting with heteroscedastic multi-kernel learning
US20230018125A1 (en) Processing Multi-Horizon Forecasts For Time Series Data
CN114548592A (en) Non-stationary time series data prediction method based on CEMD and LSTM
CN116340796A (en) Time sequence data analysis method, device, equipment and storage medium
Chen et al. House price prediction based on machine learning and deep learning methods
CN114117852B (en) Regional heat load rolling prediction method based on finite difference working domain division
Samin-Al-Wasee et al. Time-series forecasting of ethereum price using long short-term memory (lstm) networks
Liu et al. Maintenance spare parts demand forecasting for automobile 4S shop considering weather data
CN115048873B (en) Residual service life prediction system for aircraft engine
CN116404637A (en) Short-term load prediction method and device for electric power system
Jaiswal et al. A Comparative Analysis on Stock Price Prediction Model using DEEP LEARNING Technology
CN115423091A (en) Conditional antagonistic neural network training method, scene generation method and system
Duan et al. Stock price trend prediction using MRCM-CNN
Wang et al. MIANet: Multi-level temporal information aggregation in mixed-periodicity time series forecasting tasks
Yin et al. Forecasting of stock price trend based on CART and similar stock
Lin et al. Design a hybrid framework for air pollution forecasting
Wang et al. Risk assessment of customer churn in telco using FCLCNN-LSTM model
CN117094451B (en) Power consumption prediction method, device and terminal
Jiménez-Navarro et al. Embedded Temporal Feature Selection for Time Series Forecasting Using Deep Learning
CN112183846B (en) TVF-EMD-MCQRNN load probability prediction method based on fuzzy C-means clustering
Shaik et al. Prediction of Stock Index Pattern via three-stage architecture of TICC, TPA-LSTM and Multivariate LSTM-FCNs
US20230334283A1 (en) Prediction method and related system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination