CN115423080A

CN115423080A - Time series prediction method, time series prediction device, electronic device, and medium

Info

Publication number: CN115423080A
Application number: CN202211134256.2A
Authority: CN
Inventors: 段智华; 张岚; 陈伟耿; 赵戌
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2022-12-02

Abstract

The invention discloses a time series prediction method, a time series prediction device, electronic equipment and a time series prediction medium, and relates to the technical field of big data. The method comprises the following steps: acquiring historical time sequence data; constructing a prediction model based on a Transformer network, wherein the prediction model comprises an attention layer, the attention layer comprises a linear mapping autocorrelation attention module, and the linear mapping autocorrelation attention module is used for acquiring autocorrelation characteristics of input data of the prediction model; the historical time series data is used as input data and input into a prediction model, the autocorrelation characteristics of the historical time series data are obtained based on a linear mapping autocorrelation attention module of the prediction model, and the prediction sequence in a preset time period after the current time is obtained based on the autocorrelation characteristics of the historical time series data. The method can capture context associated information of historical time sequence data, is more accurate in prediction, reduces space and time complexity, and improves memory efficiency and time efficiency.

Description

Time series prediction method, time series prediction device, electronic equipment and medium

Technical Field

The present invention relates to the field of big data technologies, and in particular, to a time series prediction method, apparatus, electronic device, and medium.

Background

In mathematics, a time series is a series of data points indexed in time sequence, a series of data points acquired at successive equally spaced time points, and a series of discrete time data. Time series prediction refers to the process of using models to predict future values from previously observed values, analyzing time series data using statistics and modeling to make predictions and provide information for strategic decisions. Time series prediction can be used in any field of applied science and engineering involving time measurement, such as statistics, signal processing, pattern recognition, metrological economics, mathematical finance, weather forecasting, seismic prediction, electroencephalography, control engineering, astronomy, communications engineering.

Currently, a commonly used time series prediction method includes: however, the time sequence prediction method based on the machine learning is suitable for the analysis of a univariate stable time sequence and is not suitable for the prediction of a multivariate time sequence, and the prediction method based on the deep learning, such as a Recurrent Neural Network (RNN), a long-short term dependence neural network (LSTM) and the like, the method based on the RNN can not completely eliminate the problems of gradient disappearance and gradient explosion in case of a long-time sequence, each step of prediction depends on the previous step of hidden state in the model prediction, and the model can not fully represent the nonlinear relation of time sequence data.

Disclosure of Invention

To solve the above technical problems or at least partially solve the above technical problems, embodiments of the present invention provide a time series prediction method, apparatus, electronic device, and medium.

In a first aspect, an embodiment of the present invention provides a time series prediction method, including:

acquiring historical time sequence data;

constructing a prediction model based on a Transformer network, wherein the prediction model comprises an attention layer, the attention layer comprises a linear mapping autocorrelation attention module, and the linear mapping autocorrelation attention module is used for acquiring autocorrelation characteristics of input data of the prediction model;

and inputting the historical time series data into the prediction model as input data, obtaining the autocorrelation characteristics of the historical time series data based on a linear mapping autocorrelation attention module of the prediction model, and obtaining a prediction sequence in a preset time period after the current time based on the autocorrelation characteristics of the historical time series data.

Optionally, the linear mapping autocorrelation attention module of the predictive model comprises a linear mapping submodule and an autocorrelation attention submodule; the linear mapping submodule is used for carrying out linear mapping on a key matrix and/or a value matrix corresponding to the historical time sequence data based on a preset linear mapping matrix; the autocorrelation attention sub-module is used for carrying out autocorrelation mapping on one or more of a query matrix, a key matrix and a value matrix corresponding to the historical time series data.

Optionally, the attention layer of the prediction model further comprises one or more of: a sparse attention module,

An attention module and a residual attention module; the linear mapping autocorrelation attention module, the sparse attention module,

The attention module and the residual attention module are called through preset target parameters.

Optionally, the prediction model employs a coder-decoder structure; the encoder comprises an input embedding layer, a position encoding layer, the attention layer, a regularization layer and a feed-forward layer; the decoder includes an output embedding layer, the position encoding layer, the attention layer, the regularization layer, the feedforward layer, a linear translation layer, and an activation layer.

Optionally, the position coding layer of the prediction model is used to determine a position coding feature vector of each data in the historical time series data, and the position coding layer is coded by using trigonometric function.

Optionally, the position-coding layer determines a position-coding feature vector of each data in the historical time-series data according to the following formula:

where PE denotes a position-encoded feature vector, pos denotes a position of data with index i in the historical time-series data, N denotes a positive integer, d denotes a positive integer _model A model dimension representing the predictive model.

Optionally, the feedforward layer of the prediction model includes a first feedforward module and/or a second feedforward module, and the first feedforward module and the second feedforward module are called by preset specified parameters;

the activation function of the first feed-forward module comprises one or more of: an ELU function, a GELU _ Fast function, a GELU _ new function, a Swish function, a Tanh function and a Sigmoid function; the second feedforward module is a convolutional neural network structure.

Optionally, the inputting the historical time-series data as input data into the prediction model, and obtaining an autocorrelation feature of the historical time-series data based on a linear mapping autocorrelation attention module of the prediction model, includes:

inputting the historical time sequence data serving as input data into the prediction model, and converting the historical time sequence data into a vector form by using the input embedding layer to obtain an input data vector and a global time characteristic vector of the historical time sequence data;

determining a position coding feature vector of the historical time series data by using the position coding layer;

obtaining an input representation vector based on the input data vector, the global temporal feature vector, and the position-coded feature vector;

and inputting the input representation vector into a linear mapping autocorrelation attention module of the prediction model to obtain autocorrelation characteristics of the historical time series data.

Optionally, the decoder obtains a prediction sequence in a preset time period after the current time by using a generative parallel prediction mode;

the form of the input data of the decoder is shown as the following formula (2):

wherein,

representing a placeholder corresponding to the prediction sequence;

a start character representing the predicted sequence, the start character being a time sequence sampled from a historical time sequence.

In a second aspect, an embodiment of the present invention provides a time series prediction apparatus, including:

the data acquisition module is used for acquiring historical time series data;

the model building module is used for building a prediction model based on a Transformer network, the prediction model comprises an attention layer, the attention layer comprises a linear mapping autocorrelation attention module, and the linear mapping autocorrelation attention module is used for acquiring autocorrelation characteristics of input data of the prediction model;

and the prediction module is used for inputting the historical time series data into the prediction model as input data, obtaining the autocorrelation characteristics of the historical time series data based on a linear mapping autocorrelation attention module of the prediction model, and obtaining a prediction sequence in a preset time period after the current time based on the autocorrelation characteristics of the historical time series data.

Optionally, the prediction module is further configured to: inputting the historical time sequence data serving as input data into the prediction model, and converting the historical time sequence data into a vector form by using the input embedding layer to obtain an input data vector and a global time characteristic vector of the historical time sequence data; determining a position coding feature vector of the historical time series data by using the position coding layer; obtaining an input representation vector based on the input data vector, the global temporal feature vector, and the position-coded feature vector; and inputting the input representation vector into a linear mapping autocorrelation attention module of the prediction model to obtain autocorrelation characteristics of the historical time series data.

In a third aspect, an embodiment of the present invention provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the time series prediction method of any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a computer-readable medium, on which a computer program is stored, where the computer program is executed by a processor to implement the time series prediction method according to any embodiment of the present invention.

One embodiment of the above invention has the following advantages or benefits:

according to the time sequence prediction method, historical time sequence data are analyzed through a prediction model constructed based on a transform network, and a prediction sequence in a preset time period after the current time is obtained; the attention layer of the prediction model constructed based on the Transformer network comprises a linear mapping autocorrelation attention module, the linear mapping autocorrelation attention module can acquire autocorrelation characteristics of historical time series data, can capture context correlation information of the historical time series data, is more accurate in prediction, and enables the complexity of standard Transformer attention to be O (N) ² ) And the memory efficiency and the time efficiency are improved by reducing the number to O (N), so that the prediction model is more suitable for long sequence data.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a flow chart illustrating a time series prediction method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a linear mapping autocorrelation attention module in a time series prediction method according to an embodiment of the invention;

FIG. 3 is a diagram illustrating a linear mapping autocorrelation attention module in a time series prediction method according to an embodiment of the invention;

FIG. 4 is a diagram illustrating a linear mapping autocorrelation attention module in a time series prediction method according to another embodiment of the present invention;

FIG. 5 is a diagram of a linear mapping autocorrelation attention module in a time series prediction method according to yet another embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a prediction model in the time series prediction method according to the embodiment of the present invention;

FIG. 7 is a schematic diagram of an input representative vector of a time series prediction method of an embodiment of the present invention;

FIG. 8 is a diagram illustrating tuning results of a predictive model according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a time series prediction apparatus according to an embodiment of the present invention;

fig. 10 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 shows a flowchart of a time series prediction method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:

step S101: historical time-series data is acquired.

The historical time sequence data refers to a sequence formed by arranging numerical values of the same statistical index according to the occurrence time sequence. Historical time series data is a set of random variables ordered in time, typically the result of observing the same statistical indicator at a given sampling rate over equally spaced time periods. As an example, the historical time series data may be a sequence in which a metro network link traffic monitoring index, a broadband online number, or a dispatch amount are arranged in chronological order of occurrence thereof. The prediction sequence can be a link traffic monitoring index, a broadband online number or a dispatch number of the metropolitan area network in a preset time period after the current moment.

Step S102: constructing a prediction model based on a Transformer network, wherein the prediction model comprises an attention layer, the attention layer comprises a linear mapping autocorrelation attention module, and the linear mapping autocorrelation attention module is used for acquiring autocorrelation characteristics of input data of the prediction model.

Due to the limitation of the time complexity and the memory space complexity of the self-attention layer, the standard Transformer network has poor effect on analyzing and predicting longer time sequence data. In order to solve the technical problem, the prediction model of the embodiment improves the attention mechanism of a standard transform network, and realizes a lightweight linear mapping autocorrelation attention module, which introduces an autocorrelation attention mechanism, so that the prediction can capture context correlation information of a time sequence, long-term and short-term dependence modeling can be provided, and the prediction is more accurate. The autocorrelation attention mechanism performs an autocorrelation operation on input data, which is a numerical differentiation technique. Calculating the autocorrelation of the time series helps to convert non-stationary time series into stationary form, to eliminate the dependence of the series on time, and to stabilize the time series by eliminating variations in the level of the time seriesAverage of the columns. The autocorrelation attention mechanism may use scaled dot product attention computation context mapping matrices and then compute context embedding for each attention head, approximated by low rank matrices, temporally and spatially separating the complexity of standard transform attention from O (N) ² ) To O (N) with higher memory and time efficiency.

Step S103: and inputting the historical time series data into the prediction model as input data, obtaining the autocorrelation characteristics of the historical time series data based on a linear mapping autocorrelation attention module of the prediction model, and obtaining a prediction sequence in a preset time period after the current time based on the autocorrelation characteristics of the historical time series data.

In the step, historical time series data are input into the prediction model, the autocorrelation characteristics of the historical time series data are obtained through a linear mapping autocorrelation attention module of the prediction model, the autocorrelation characteristics are used for representing the context information of the historical time series data, then the autocorrelation characteristics of the historical time series data are predicted by the prediction model, and the prediction sequence in a preset time period after the current time is obtained.

According to the time sequence prediction method, historical time sequence data are analyzed through a prediction model constructed based on a transform network, and a prediction sequence in a preset time period after the current time is obtained; the attention layer of the prediction model constructed based on the Transformer network comprises a linear mapping autocorrelation attention module, the linear mapping autocorrelation attention module can acquire autocorrelation characteristics of historical time series data, context correlation information of the historical time series data can be captured, prediction is more accurate, and the complexity of standard Transformer attention is increased from O (N) ² ) Reducing to O (N) improves memory efficiency and time efficiency, making the predictive model more adaptable to long sequence data.

In an alternative embodiment, the linear mapping autocorrelation attention module of the Transformer network-based predictive model includes a linear mapping sub-module and an autocorrelation attention sub-module. The autocorrelation attention submodule is used for performing autocorrelation mapping on one or more of a Query matrix (Q matrix for short), a Key matrix (K matrix for short) and a Value matrix (V matrix for short) corresponding to the historical time series data.

The Query matrix, the Key matrix and the Value matrix are linearly transformed from an input matrix X corresponding to historical time series data, and the calculation formula is as follows:

wherein, W ^Q 、W ^K 、W ^V Is three trainable parameter matrices, the input matrix X is respectively connected with W ^Q 、W ^K 、W ^V And multiplying, and generating a Query matrix, a Key matrix and a Value matrix through linear transformation.

In this embodiment, the autocorrelation mapping of one or more of the Query matrix, the Key matrix, and the Value matrix by the autocorrelation attention module includes a first order autocorrelation mapping of one or more of the Query matrix, the Key matrix, and the Value matrix and/or a second order autocorrelation mapping of one or more of the Query matrix, the Key matrix, and the Value matrix. Wherein the first order autocorrelation is defined as: when the time series of the independent variable changes from t to t +1, the change amount Δ y of the time series characteristic value y = y (t) _t Referred to as the first order autocorrelation of the timing characteristic function y (t) at time t:

Δy _t ＝y(t+1)-y(t)＝y _t+1 -y _t

for the input matrix X, a first order AutoCorrelation Function (AutoCorrelation Function) is described as ACF (X).

Second order autocorrelation definition: the autocorrelation function of the first order autocorrelation when the time series of arguments goes from t to t + 1:

Δ(Δy)＝Δy _t+1 -Δy _t ＝(y _t+2 -y _t+1 )-(y _t+1 -y _t )

delta (Delta y) is called the second-order autocorrelation of the time-series characteristic function y (t) at time t. For the input matrix X, the second order autocorrelation function is described as ACF ² (x)。

Based on the definition of the first-order autocorrelation and the second-order autocorrelation functions, the first-order autocorrelation attention mechanism of the V matrix is taken as an example for explanation, wherein the ith attention head:

wherein,

W _i ^Q 、

representing a weight matrix, E _i And F _i Representing a linear projective transformation matrix, E _i ∈R ^{dim_reduce_k×seq_ten} ，

dim _ reduce _ k represents dimension k of the reduced-dimension matrix, and dim _ reduce _ k _ seq _ len represents constructed E _i And F _i The size of (a) is smaller than (b),

is seq _ len x dim _ reduce _ k,

is of a size dim _ reduce _ k × d _model 。

And performing first-order autocorrelation mapping on the V matrix:

the linear mapping submodule is used for carrying out linear mapping on a Key matrix and/or a Value matrix corresponding to the historical time series data based on a preset linear mapping matrix.

Wherein the linear mapping is from oneThe mapping of one vector space V to another vector space W and the holding of the operations of addition and number multiplication. In this embodiment, a linear mapping matrix is set in advance, and the Key matrix and/or the Value matrix are/is linearly mapped by the linear mapping matrix. As an example, initial values of a linear mapping matrix may be generated from a normal distribution given a mean and a standard deviation, and a final linear mapping matrix is obtained through learning training. According to the embodiment of the invention, the matrix dot product operation of a standard Transformer network is improved into the linear operation of a matrix space through a linear mapping matrix, the matrix decomposition of the attention mechanism of linear mapping is formed, the formed random matrix can be approximated to a low-rank matrix, and the space complexity and the time complexity of the attention are increased from O (N) ² ) The method reduces the content of O (N), greatly reduces the consumption of memory and space, and has stronger memory and time utilization efficiency.

As an example, taking the first-order autocorrelation mapping of the V matrix as an example, as shown in fig. 2, from the structural perspective of the algorithm, first-order autocorrelation mapping conversion is performed on the V matrix, two linear mapping matrices are added when calculating the K matrix and the V matrix, the context mapping matrix is calculated using scaled dot product attention, and context embedding of each attention head is calculated, which only requires O (N) time and space complexity. As shown in fig. 3, the first order autocorrelation mapping is performed on the Q matrix, the K matrix, and the V matrix in fig. 3 from the perspective of the matrix.

Fig. 4 is another variation of the autocorrelation attention mechanism of the embodiment of the present invention, in fig. 5, first, a second-order autocorrelation mapping is performed on a Q matrix, a first-order autocorrelation mapping is performed on a V matrix, two linear projection matrices are added when calculating a K matrix and the V matrix, a context mapping matrix is calculated using scaled dot product attention, and context embedding of each attention head is calculated.

FIG. 5 is another variation of the autocorrelation attention mechanism of an embodiment of the present invention. In fig. 5, first-order autocorrelation mapping is performed on the Q matrix, the K matrix, and the V matrix, two linear projection matrices are added when calculating the K matrix and the V matrix, a context mapping matrix is calculated using scaled dot product attention, and context embedding of each attention head is calculated. It should be noted that the above description is only a preferred embodiment of the present invention, and other modifications are not described herein.

In an alternative embodiment, the attention layer of the prediction model constructed based on the Transformer network further comprises one or more of the following: a sparse attention module,

An attention module and a residual attention module. Wherein the linear mapping autocorrelation attention module, the sparse attention module,

The attention module and the residual attention module can be called by preset target parameters.

In the process of implementing the time series prediction method of the embodiment of the invention, by qualitatively analyzing the probability distribution of the multi-head attention of the standard transform network, it is found that only a small number of positions contribute greatly to the attention, and a large number of positions contribute little to the attention. Therefore, the embodiment of the invention constructs the sparse attention module and introduces sparse bias into attention calculation so as to reduce the calculation complexity and improve the network performance. As an example, the sparse attention module may employ a location-based sparse self-attention (atomic sparse mode, aggregate sparse attention, spread sparse attention), content-based sparse self-attention, and the like. Furthermore, a sparse matrix of the same size as the Query matrix (Query) is introduced, which contains only TopK Query vectors of sparse evaluation samples, so that self-attention only needs to compute the inner product after sampling for each Query key, controlling the temporal and spatial complexity to O (NlogN). The sparse attention module of the embodiment adopts content-based sparse attention, the time sequence characteristic value of attention characteristic sampling is determined by the most similar TopK characteristic values, the time and space complexity is controlled to O (NlogN), the dominant characteristic is highlighted, the robustness of distillation operation is enhanced, the number of self-attention extraction layers is reduced iteratively, the outputs of all stacks of an encoder are connected in series to obtain the final hidden vector representation of the encoder, the prediction data of the future time step are generated in parallel through a decoder instead of being carried out step by step, the inference speed of long sequence prediction is greatly improved, and the accurate long dependency relationship between the output and the input is effectively captured.

Attention module reference

Method and idea of matrix decomposition, using

The method approximates standard self-attention calculations.

The method is an efficient technique to accelerate large-scale learning applications by generating low-rank approximations,

the method is of crucial importance assuming that a matrix can be well approximated by processing only a subset thereof. The linear attention of the standard Transformer model is the complexity of attention reduction through the combination law of matrix multiplication, and the prediction model constructed based on the Transformer network in the embodiment of the invention is formed by integrating a Query matrix and a Key matrix (Q, K belongs to R) ^n×d ) Clustering transformation to form new matrix

By passing

The dual Softmax form to construct the Attention matrix, gradually looking for a linear approximation of Attention that more closely approximates standard Attention, has shown experimentally,

approximate attention module and standard self-attention moduleThe block ratio is competitive.

Residual (Residual) attention module: compared with a Pre-LN (Pre Layer Normalization) and a Post-LN (Post Layer Normalization) of a standard Transformer, the Residual attention module adds a Residual connection between the attention modules, so that the performance of the Post-LN is maintained, the Residual connection is fused, and experiments show that the Residual attention module is superior to a standard Transformer framework.

In an alternative embodiment, the prediction model built based on the Transformer network employs an encoder-decoder architecture.

As shown in fig. 6, the encoder of the prediction model includes an input embedding layer, a position encoding layer, the attention layer, a regularization layer, and a feedforward layer. The decoder of the prediction model comprises an output embedding layer, the position encoding layer, the attention layer, the regularization layer, the feedforward layer, a linear translation layer and a Softmax layer.

Wherein the input embedding layer is used for converting the historical time series data into a vector form.

The position coding layer is used for determining a position coding feature vector of each datum in the historical time sequence data, and the position coding layer adopts trigonometric function type coding. The absolute position information and the expression capability of the relative position information are acquired by the position coding layer, and the relative or absolute position information of the sequence is injected into the input sequence.

In order to deal with the sequence problem, the solution of the standard transform network is to use Position Encoding (PE), where absolute Position encoding maps a fixed Position vector for each Position in the sequence, and then add the embedded word vector and the Position vector to obtain a final input vector for each time sequence, which is used as the input at the bottom of the encoder and decoder stack. The position coding layer of the embodiment of the invention adopts trigonometric function type position coding, each dimensionality of the position coding corresponds to a sine curve, the wavelength forms a geometric series (N is a positive integer, such as 10000) from 2 pi to N.2pi, the position coding is realized by using sine and cosine functions, the even number position uses sine coding, and the odd number position uses cosine coding. Specifically, the position coding layer determines the position coding feature vector by the following formula:

wherein PE represents a position-encoded feature vector, pos represents a position of data with index i in the historical time-series data, N represents a positive integer, d _model A model dimension representing the predictive model.

The attention layer comprises the linear mapping autocorrelation attention module described in the above embodiment, and may further comprise a sparse attention module,

One or more of an attention module and a residual attention module.

In order to solve the problem of data distribution deviation, the standard transform network introduces a regularization layer (normalization layer) to keep the input distribution of the neural layer consistent in the training process. Different from Batch Normalization, the regularization Layer of the prediction model in the embodiment of the present invention realizes Layer Normalization, and normalizes the input data of each Layer by summarizing the data input to the same Layer, calculating the average value and the variance, and accelerating the convergence rate of the deep network.

In an alternative embodiment, the step S103 inputs the historical time-series data as input data into a prediction model, and obtains an autocorrelation feature of the historical time-series data based on a linear mapping autocorrelation attention module of the prediction model, including:

inputting historical time sequence data serving as input data into a prediction model, and converting the historical time sequence data into a vector form by using an input embedding layer to obtain an input data vector and a global time characteristic vector of the historical time sequence data;

determining a position coding feature vector of the historical time sequence data by using the position coding layer;

obtaining an input representation vector based on the input data vector, the global temporal feature vector and the position coding feature vector;

and inputting the input representation vector into a linear mapping autocorrelation attention module of a prediction model to obtain autocorrelation characteristics of historical time sequence data.

The historical Time series data acquired in step S101 is data without Time series information, and in order to enable the prediction model to analyze the historical Time series data, the embodiment of the present invention provides a unified input representation, and as shown in fig. 7, features of a Local Time Stamp (Local Time Stamp), a Global Time feature vector (Global Time Stamp), and an input data vector (Value Embedding) are integrated and combined to form a new input representation vector. The input data vector is obtained by encoding historical time sequence data by a preset encoding method, the global time feature vector is obtained by encoding corresponding time of the historical time sequence data, and the position encoding feature vector is obtained by encoding a local position of the historical time sequence data. The local position code injects sequential position information marked in the sequence, and the time sequence context information is embedded through a fixed position. In the embodiment of the invention, the position coding layer uses a sine-cosine function to realize position coding, uses sine coding at even bits and cosine coding at odd bits, and injects local position coding information of the sequence into the input sequence, so that the prediction model of the embodiment of the invention has the capability of learning time sequence information. The global hierarchical time sequence coding is injected with global time characteristics (minute, hour, day, week, month, year) of a marker sequence, a prediction model is embedded with local position characteristics and global time characteristics in an input vector at the bottom of a stack of an encoder and a decoder, and an input expression vector of the encoder model is obtained through the steps:

wherein，i∈{1,…,L _x α is a factor for balancing the size among the input data vector, the Local position-coding feature vector, and the Global temporal feature vector, VE represents the input data vector (Value Embedding), α =1, pe represents the position-coding feature vector (Local Time Stamp), GE represents the Global temporal feature vector (Global Time Stamp), and n represents a Global temporal feature vector including n types, such as minute, hour, day, week, month, and year.

In an optional embodiment, the feedforward layer of the prediction model comprises a first feedforward module and/or a second feedforward module, and the first feedforward module and the second feedforward module are called by preset specified parameters. The activation function of the first feed-forward module includes one or more of: ELU function, GELU _ Fast function, GELU _ new function, swish function, tanh function and Sigmoid function; the second feedforward module is a convolutional neural network structure.

In the structure of the standard Transformer Network, simply stacking the attention module can cause the problems of hierarchy collapse and uniform induction deviation of the marks, and the FFN (feed forward Neural Network) constructed by the embodiment of the invention can relieve the problems. The prediction module comprises a first feedforward module and/or a second feedforward module, and the first feedforward module and the second feedforward module are called through preset specified parameters. In an alternative embodiment, the prediction model constructed based on the Transformer network can also remove the feedforward layer, thereby simplifying the network.

The second feedforward module is a convolutional neural network structure, wherein the convolutional neural network is composed of neurons with learnable weights and deviations, each neuron receives sequence input information, a convolutional neural network layer stacks a plurality of filters together, and a convolutional output result is obtained through two layers of one-dimensional convolution and Dropout operation. As the convolutional neural network deepens, more complex global features are combined. Compared with the FFN network, the method has higher implementation efficiency and reduces the number of parameters in the network. Dropout refers to randomly disabling the weights of some hidden layer nodes of the network during model training, and the nodes which are not in operation are temporarily regarded as not being part of the network structure, but retain the weights (only temporarily not updated), and the next time the sample is input, the nodes can be operated again.

Based on the prediction model, the processes of the time series prediction method of the embodiment of the present invention may include processes of input vector representation, encoder data conversion, decoder decoding, and the like. In general, a sliding window of a given size is preset, and a vector is input at time t

The prediction model outputs corresponding prediction sequence

Where the input and output feature dimensions may include a plurality of features.

Input vector representation procedure: the historical Time series data is data without Time series information, and in order to enable a prediction model to analyze the historical Time series data, the embodiment of the invention provides a unified input representation, and as shown in fig. 7, the features of a Local Time Stamp (Local Time Stamp), a Global Time feature vector (Global Time Stamp) and an input data vector (Value Embedding) are integrated and combined to form a new input representation vector. The local position code injects the sequential position information marked in the sequence, and the time sequence context information is embedded through a fixed position. The global hierarchical time sequence coding is injected with global time characteristics (minute, hour, day, week, month, year) of a marker sequence, a prediction model is embedded with local position characteristics and global time characteristics in an input vector at the bottom of a stack of an encoder and a decoder, and an input expression vector of the encoder model is obtained through the steps:

wherein i ∈ {1, \8230;, L _x α is a factor for balancing the size among the input data vector, the Local position-coding feature vector, and the Global temporal feature vector, VE represents the input data vector (Value Embedding), α =1, pe represents the position-coding feature vector (Local Time Stamp), GE represents the Global temporal feature vector (Global Time Stamp), and n represents a Global temporal feature vector including n types, such as minute, hour, day, week, month, and year.

And (3) encoder data conversion: the TeleTransformer encoder extracts the dependency of the long sequence data, and the t-th sequence feed data is represented as a matrix:

the attention layer of the prediction model of the embodiment of the present invention employs a linear mapping autocorrelation attention module based on first and second order autocorrelation attention mechanisms, such as a first order self-attention head of a V vector:

wherein,

W _i ^Q 、

is seq _ len x dim _ reduce _ k,

is of size dim _ reduce _ k × d _model 。

The linear mapping autocorrelation attention module of the embodiment of the invention performs approximation through a low-rank matrix, and the complexity of the attention of a standard Transformer is increased from O (N) in time and space ² ) Reduced to O (N), with higher memory and time efficiency.

Decoder data conversion process: the decoder of the prediction model of this embodiment of the method uses 2 Multi-Head attentional layer stacks to form a decoder structure, and feeds the following vectors into the decoder of the prediction model:

wherein,

indicating the start of the prediction sequence, as shown in FIG. 7, samples an L in the input sequence _token A sequence representing historical input information prior to outputting the predicted sequence,

and the preset value of the predicted target sequence is represented, wherein token represents a characteristic value of each time step in the input sequence, for example, the number of broadband users in each time sequence, and 0 represents that the value of the future predicted time step is preset to be 0. As a specific example, where the historical time series data is the number of wideband online lines and the number of credits observed over the past period of time (e.g., over the past 80 hours), the start marker of the predicted sequence may be the number of wideband online lines and the number of credits observed over the last several hours (e.g., over the last 8 hours), and the predicted sequence may be the number of hours in the future (e.g., over the next 5 hours)Inner) and the number of dispatches on-line.

The traditional Seq2Seq model decoding mode includes a decoding method based on search (greedy search, beam search) and a decoding method based on sampling (random sampling, topk sampling), each time step of the greedy search decoding method only predicts one value, a value of a second time step is predicted and generated according to the result of an Encoder and the predicted value of a first time step, a value of a third time step is generated according to the result of the Encoder and the values of the first two time steps, and so on, the method can not obtain global optimum. The prediction model of the embodiment of the invention adopts a parallel prediction method to realize parallel training of the decoder, inputs the characteristic value of the whole input time sequence into the decoder, calculates n predicted values in parallel based on a mask self-attention algorithm, respectively corresponds to the output of n moments, and can realize prediction of all results only by one step. The prediction model of the embodiment of the invention sets the Mask in the Masked Multi-Head orientation to minus infinity (— ∞), so that the prediction of the ProbSpase Self-orientation target sequence prevents each position from paying Attention to the future position, prediction can be carried out only based on historical input information during prediction, autoregression during prediction is avoided, MAE (Mean Absolute Error) average Absolute Error is selected as a loss function according to the prediction of the target sequence, and the average Absolute Error is reversely propagated to the whole model from the output of a decoder.

In an alternative embodiment, after the prediction model is constructed, the hyper-parameter tuning may be performed on the prediction model.

The super-parameter tuning is mainly implemented from two aspects: network design related parameters: the number of network layers of a coder decoder, the number of attention heads, the parameter setting of hidden layer neurons and the sequence length parameter; parameters relevant to the model training process: the size of the small batch of data, the learning rate, the number of iterations, etc. The results of the hyper-parameter tuning experiment of the embodiment of the invention are shown in fig. 8 and the following table 1, and it can be seen from the experiment that: the model MAE and MSE indexes are very sensitive to the number of attention heads and the size of the dimension of the model; not only the number of encoder/decoder network layers, but also the size of the small data batches is very important. In the embodiment, the best practice of the TeleTransformer parameters is adjusted, the corresponding generalization capability of the model is improved, and the convergence rate of the model training is accelerated.

Table 1:

fig. 9 is a schematic structural diagram of a time-series prediction apparatus 900 according to an embodiment of the present invention, and as shown in fig. 9, the time-series prediction apparatus 900 includes:

a data obtaining module 901, configured to obtain historical time-series data;

a model building module 902, configured to build a prediction model based on a Transformer network, where the prediction model includes an attention layer, the attention layer includes a linear mapping autocorrelation attention module, and the linear mapping autocorrelation attention module is configured to obtain autocorrelation characteristics of input data of the prediction model;

a prediction module 903, configured to input the historical time-series data into the prediction model as input data, obtain an autocorrelation characteristic of the historical time-series data based on a linear mapping autocorrelation attention module of the prediction model, and obtain a prediction sequence in a preset time period after the current time based on the autocorrelation characteristic of the historical time-series data.

In an optional embodiment, the prediction module is further configured to input the historical time-series data into the prediction model as input data, convert the historical time-series data into a vector form by using an input embedding layer of the prediction model, and obtain an input data vector and a global time feature vector of the historical time-series data; determining a position coding feature vector of the historical time series data by utilizing a position coding layer of the prediction model; obtaining an input representation vector based on the input data vector, the global temporal feature vector, and the position-coded feature vector; and inputting the input representation vector into a linear mapping autocorrelation attention module of the prediction model to obtain autocorrelation characteristics of the historical time series data.

The device can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

An embodiment of the present invention further provides an electronic device, as shown in fig. 10, including a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, where the processor 1001, the communication interface 1002, and the memory 1003 complete mutual communication through the communication bus 1004,

a memory 1003 for storing a computer program;

the processor 1001 is configured to implement the following steps when executing the program stored in the memory 1003:

acquiring historical time series data;

building a prediction model based on a Transformer network, wherein the prediction model comprises an attention layer, the attention layer comprises a linear mapping autocorrelation attention module, and the linear mapping autocorrelation attention module is used for acquiring autocorrelation characteristics of input data of the prediction model;

The communication bus 1004 mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 1004 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 1002 is used for communication between the above-described terminal and other devices.

The Memory 1003 may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor 1001.

The Processor 1001 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

In yet another embodiment of the present invention, a computer-readable medium is further provided, which has instructions stored therein, which when run on a computer, cause the computer to perform the time series prediction method described in any of the above embodiments.

In yet another embodiment, a computer program product containing instructions is provided, which when run on a computer, causes the computer to perform the time series prediction method as described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to be performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on differences from other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for time series prediction, comprising:

acquiring historical time series data;

2. The method of claim 1, wherein the linear mapping autocorrelation attention module of the predictive model comprises a linear mapping sub-module and an autocorrelation attention sub-module; the linear mapping submodule is used for carrying out linear mapping on a key matrix and/or a value matrix corresponding to the historical time sequence data based on a preset linear mapping matrix; the autocorrelation attention sub-module is used for performing autocorrelation mapping on one or more of a query matrix, a key matrix and a value matrix corresponding to the historical time series data.

3. The method of claim 2, wherein the attention layer of the predictive model further comprises one or more of: a sparse attention module,

4. The method of claim 3, wherein the prediction model employs a coder-decoder structure;

the encoder comprises an input embedding layer, a position encoding layer, the attention layer, a regularization layer and a feed-forward layer; the decoder includes an output embedding layer, the position encoding layer, the attention layer, the regularization layer, the feed-forward layer, a linear conversion layer, and an activation layer.

5. The method of claim 4, wherein the position coding layer of the prediction model is used to determine a position coding feature vector for each data in the historical time series data, and wherein the position coding layer is encoded by trigonometric functions.

6. The method of claim 5, wherein the position-coding layer determines the position-coding feature vector of each data in the historical time-series data according to the following equation (1):

7. The method according to claim 5, wherein the feedforward layer of the predictive model comprises a first feedforward module and/or a second feedforward module, the first feedforward module and the second feedforward module being invoked by preset specified parameters;

the activation function of the first feed-forward module comprises one or more of: ELU function, GELU _ Fast function, GELU _ new function, swish function, tanh function and Sigmoid function;

the second feedforward module is a convolutional neural network structure.

8. The method of claim 7, wherein the inputting the historical time-series data as input data into the predictive model, obtaining autocorrelation characteristics of the historical time-series data based on a linear mapping autocorrelation attention module of the predictive model, comprises:

determining a position encoding feature vector of the historical time series data by using the position encoding layer;

9. The method according to claim 8, wherein the decoder obtains the prediction sequence within a preset time period after the current time by using a generative parallel prediction mode;

wherein,

a placeholder representing a correspondence of the predicted sequence;

10. A time-series prediction apparatus, comprising:

the data acquisition module is used for acquiring historical time series data;

11. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-9.

12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-9.