CN113177633A

CN113177633A - Deep decoupling time sequence prediction method

Info

Publication number: CN113177633A
Application number: CN202110426703.0A
Authority: CN
Inventors: 陈岭; 陈纬奇; 张友东; 文波; 杨成虎
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-07-27
Anticipated expiration: 2041-04-20
Also published as: CN113177633B

Abstract

The invention discloses a deep decoupling time series prediction method, which comprises the following steps: 1) preprocessing given time series data to construct a training data set; 2) capturing a global change mode shared by a plurality of time sequences by using a vector quantization global feature encoder; 3) capturing a local variation mode specific to a single time sequence by using a local feature encoder, wherein each time sequence has a group of specific local feature encoder parameters and is generated by an adaptive parameter generation module; 4) the outputs of the global and local feature encoders are input to a decoder for prediction. The invention decouples the dynamics of the time sequence into global and local change modes, and respectively models, can solve the problems that the existing model can not fully utilize the knowledge shared in data set and can not fully model the specific local change mode of a single time sequence, thereby improving the prediction precision, and having wide application prospect in the fields of traffic prediction, supply chain management, financial investment and the like.

Description

Deep decoupling time sequence prediction method

Technical Field

The invention relates to the field of time series data prediction, in particular to a deep decoupling time series prediction method.

Background

Time series is widely available in the fields of traffic, electricity, medical treatment, finance and the like. Time series prediction (i.e., predicting the observed value at a next time based on the observed values over a period of historical time) is an important research topic in data mining. In today's big data era, a single time series often does not exist in isolation, and a data set usually contains a plurality of time series with correlation, and the time series has global (shared by a plurality of time series) and local (specific to a single time series) change modes. As shown in fig. 1, the road usage time series of all roads have the same period (24 hours), and have an early peak and a late peak, i.e., a global change pattern; road 1 has a slight morning-evening peak, road 2 has a distinct early peak, no late peak, road 3 has a slight early peak and a distinct late peak, and road 4 has a strong morning-evening peak, i.e. a local change pattern. A good time series prediction model should capture both variation patterns simultaneously.

Statistical machine learning-based time series prediction models, such as AR, ARIMA, exponential smoothing, and linear state space models, trained and predicted on a single time series, cannot model common patterns of variation in a multivariate time series dataset, and therefore cannot benefit from this global knowledge.

Classical deep learning models, such as prediction models based on RNN, TCN and Transformer, are currently the most widely used class of methods in this field. Such models use all data of a data set to train a set of shared model parameters, equally using information of all time series, however, this way of capturing global information through simple parameter sharing is not sufficient, because the model only uses historical data of a single time series as input in prediction, and cannot explicitly introduce global information or other information of related sequences.

Some recent approaches attempt to represent the original time series as a linear combination of k potential time series (k is much smaller than the number of time series in the dataset) using matrix decomposition, capturing common patterns in multivariate time series through the potential time series. However, matrix decomposition works on feature space and cannot capture complex global change patterns.

Disclosure of Invention

In view of the foregoing, an object of the present invention is to provide a deep decoupling time series prediction method, which can improve the prediction accuracy of time series data while reducing the computation consumption by effectively modeling global and local variation patterns of time series.

In order to achieve the purpose, the invention provides the following technical scheme:

a deep decoupling time series prediction method is applied to prediction of time series data in the traffic field, the electric power field, the medical field and the financial field, and comprises the following steps:

collecting a time sequence, and preprocessing the time sequence to obtain a preprocessed time sequence;

the method comprises the steps of constructing a time sequence prediction model, wherein the time sequence prediction model comprises a global feature encoder, an adaptive parameter generation module, a local feature encoder and a decoder, the global feature encoder is used for encoding a time sequence into a global feature representation, the adaptive parameter generation module is used for generating local feature encoder parameters according to the time sequence, the local feature encoder encodes the time sequence into a local feature representation based on the loaded local feature encoder parameters, and the decoder is used for decoding the splicing result of the global feature representation and the local feature representation and outputting predicted time sequence data;

and performing parameter optimization on the time series prediction model by using the time series data, and using the time series prediction model with the optimized parameters for prediction of the time series.

Preferably, the preprocessing includes outlier detection and removal, missing value supplementation, and normalization processing.

Preferably, the global feature encoder includes a short-term feature extractor constructed by a convolutional neural network, a vector quantization module, and a Transformer encoder composed of a plurality of attention modules stacked, where the short-term feature extractor is configured to perform short-term feature extraction on an input time sequence to obtain a short-term representation of the time sequence, and the vector quantization module is configured to perform vectorization encoding on the input short-term representation to obtain an encoded vector; and the Transformer encoder is used for establishing a long-term dependency relationship in the whole time sequence based on the encoded vector and outputting a global feature representation of the time sequence.

Preferably, the adaptive parameter generating module implements coding of the time sequence by using a multi-view contrast-based coding mode, and outputs the local feature coder parameters.

Preferably, the adaptive parameter generating module includes a context identification network and a parameter generating network, wherein the context identification network includes a convolution module, a transform encoder and an LSTM aggregator, which are connected in sequence, and is configured to map the time sequence into a context hidden variable, and the parameter generating network includes a fully-connected network and is configured to generate a parameter of the local feature encoder according to the context hidden variable.

Preferably, the parameters of the local feature encoder are not involved in training and are generated by an adaptive parameter generation module, the local feature encoder includes a short-term feature extractor and a Transformer encoder composed of a plurality of attention modules stacked, wherein the short-term feature extractor is configured to perform short-term feature extraction on the input time series to obtain a short-term representation of the time series, and the Transformer encoder is configured to model a long-term dependency relationship in the entire time series based on the short-term representation and output a local feature representation of the time series.

Preferably, the decoder comprises a convolution module and a plurality of same attention modules, wherein the convolution module is used for performing convolution operation on the result of splicing the input global feature representation and the local feature representation, and the attention module is used for performing connection calculation based on the convolution result and outputting the predicted time series data.

Preferably, the loss function is used for optimizing the parameters of the time series prediction model

Comprises the following steps:

compared with the prior art, the invention has the beneficial effects that at least:

according to the deep decoupling time sequence prediction method provided by the invention, the dynamic decoupling of the time sequence is a global change mode and a local change mode, and the global change mode and the local change mode are respectively modeled by using a global feature encoder and a local feature encoder, so that a vector quantization global encoder is used for learning an encoding table representing the global change mode, the knowledge shared in a data set is fully utilized for modeling the global change mode, a self-adaptive parameter generation module is used for generating a specific local feature encoder parameter for each time sequence, and the heterogeneous local change mode is effectively modeled. And improving the prediction precision of the time series based on the global and local change modes.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a time-series diagram of road utilization in the background art;

FIG. 2 is an overall flowchart of a deep decoupling time series prediction method provided by the embodiment;

FIG. 3 is an overall block diagram of a deep decoupled time series prediction method provided by an embodiment;

FIG. 4 is a learning process of a global representation and a short-term representation provided by an embodiment;

fig. 5 is a schematic structural diagram of a convolutional Transformer decoder according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

In order to improve the prediction accuracy of the time sequence, the embodiment provides a deep decoupling time sequence prediction method, dynamic decoupling of the time sequence is divided into a global change mode and a local change mode, and after modeling is respectively carried out, time sequence prediction is carried out based on the two change modes. The deep decoupling time series prediction method can be applied to the traffic field, the electric power field, the medical field and the financial field, namely the time series can be data such as road traffic flow, user electricity consumption, stock price and the like.

FIG. 2 is an overall flowchart of a deep decoupling time series prediction method provided by the embodiment; fig. 3 is an overall block diagram of the depth decoupling time series prediction method provided by the embodiment. As shown in fig. 2 and fig. 3, the depth decoupling time series prediction method provided by the embodiment includes the following steps:

step 1, acquiring time sequence data, performing abnormal value elimination processing and normalization processing on the acquired time sequence, and dividing the processed data by using a sliding time window to obtain a training data set.

Outlier detection and removal are performed for a given time series, and invalid values (such as values out of the normal range and missing values) therein are filled using a linear interpolation method. All values in the time series are subjected to min-max normalization processing, so that each value after the processing is normalized to the range of [ -1, 1], and the conversion formula is as follows:

where X is the value in the original time series, X_minIs the minimum value in the time series, X_maxIs the maximum value in the time series, and x' is the value after normalization.

And manually setting the size T of the time window according to experience, and dividing the normalized data by using sliding step lengths with fixed lengths to obtain a training data set.

And 2, batching the training data set according to a fixed batch size, wherein the total number of batches is N.

The training data set is batched according to an empirically artificially set batch size M, and the total number of batches is N. The specific calculation method is as follows:

wherein N_SamplesIs the total number of samples in the training dataset.

And 3, sequentially selecting a batch of training samples with the index of k from the training data set, wherein k belongs to {0, 1. Steps 4-10 are repeated for each training sample in the batch.

Step 4, time-sequencing the samples

Input to a vector quantized global feature encoder, output global feature representation

And calculating a vector quantization constraint objective function

Where T represents the input step size.

The global feature encoder shown in fig. 3(b) is used for modeling a global change pattern of a time sequence, and specifically includes the following steps:

first, time-series x samples_1：TInputting the short-term representation z to a short-term feature extractor consisting of a multilayer 1D convolutional neural network to obtain a short-term representation z of each time in the sequence_1：T. The convolution layer has a sliding step size of 1, a padding mechanism is used to make the output step size consistent with the input, and a small-sized convolution kernel is used to capture the short-term change pattern in the sequence.

Then, the short term representation z_1：TAn input Vector Quantization (VQ) module q, the VQ module maintaining a representationCoding table of global change mode

It contains F d-dimensional vectors, shared among all sequences, to represent the global change pattern of the sequences, the vectors in the encoding table e being called global representation. The VQ module represents z in a short term at each time_1：TRespectively mapped into vectors in a coding table e to obtain

In particular, it replaces the original short-term representation z by means of a nearest neighbor search

The calculation formula is as follows: i ═ arg min_j||z-e_j||₂, wherein ,e_iRepresenting the ith vector in the encoding table e. It should be noted that the arg min operation is not differentiable, and therefore is directly related to the objective function

Replaces the gradient with respect to z. In particular, in the forward propagation process, the nearest neighbor global representation

The input to the downstream network is, in the reverse propagation process,

passed back unchanged to the upstream convolutional network. Under such a training mode, if only the prediction objective function is used, the global representation e in the coding table is encoded_iIt is not updated, so the invention introduces a vector quantization constraint objective function for learning the global representation:

wherein sg () is a gradient truncation operation satisfying sg (z) ≡ z,

gamma is an adjustable hyper-parameter. As indicated by the dark grey arrows in fig. 4, in predicting the objective function

And

under the combined action of z, the short-term representation z selects a suitable global representation, and z and

the difference is as small as possible; as indicated by the light grey arrows in figure 4,

driving the global representation in the encoding table to move towards the original short-term representation z, when there are multiple original short-term representations mapped to the same global representation, the global representation tends towards the cluster center of these original representations, which also makes the global representation learned by the VQ module more representative.

Finally, modeling long-term dependency relationship in the whole sequence by using a Transformer encoder, and outputting a long-term representation of each time moment of the sequence, namely a global feature representation

The Transformer encoder is composed of a plurality of attention modules stacked, and a single module comprises a multi-head self-attention layer and a feedforward network formed by two fully-connected layers (the first layer uses a ReLU activating function, and the second layer uses a linear activating function). Where the attention mechanism may be expressed as a mapping from a query and a set of key/value pairs to an output. It calculates the matching degree of the query and each key, gives a weight coefficient to the value corresponding to each key, and finally takes the weighted sum of the values as output. The calculation method is as follows:

wherein ,

a query is represented that is,

a key is represented that is a key of the key,

the value is represented, and the calculation process is divided into three steps: the first step is to compute the vector inner product of the query and each key and divide by a factor

Obtaining non-normalized weight coefficients, wherein the factor

The adjustment function is realized, and the gradient of the SoftMax function is prevented from disappearing; the second step is to utilize a SoftMax function to normalize the weight coefficient; and thirdly, carrying out weighted summation on the values by utilizing the normalized weight coefficient to obtain final output. The multi-head attention mechanism actually calculates a plurality of groups of attention mechanisms, and the result of the plurality of groups is spliced to obtain output. The calculation method is as follows:

wherein Concat represents tensor splicing operations,

and

for the mapping matrix of the ith set of attention mechanisms, the original query, key, and value are mapped to the corresponding spaces, respectively. In the inventionIn the multi-head self-attention layer, the query Q, the key K and the value V are consistent, and the output of the vector quantization module is arranged in the first layer

The subsequent layers are all the output of the previous layer attention module.

Long-term representation processed by a Transformer encoder

The feedforward network is a fixed substructure in the Transformer model and mainly plays a role in spatial mapping.

Step 5, time sequence x of samples_1：TInputting the parameters into an adaptive parameter generation module based on multi-view contrast coding, generating a local feature encoder parameter phi, and calculating a multi-view contrast coding objective function

As shown in fig. 3(a), the adaptive parameter generation module based on multi-view contrast coding includes a context identification network and a parameter generation network, and the specific steps of generating the local feature encoder parameter Φ are as follows:

firstly, x is_1：TInput to a context recognition network comprising a convolution module, a Transformer encoder and an LSTM aggregator, the context recognition network mapping the input sequence to context hidden variables

A multi-view contrast coding (CMC) method and its corresponding KL divergence regularization may enable contextually hidden variables

It is possible to fully retain information of the sequence local variation pattern and filter out global information (since it has been modeled in the global feature encoder). CMC utilization contrast learning method maximizationOutput context hidden variables of LSTM aggregator in context recognition network

With short-term representation v of the convolution module output^(sh)And long-term representation v of the transform encoder output^(lo)Mutual information between them, thereby making the situation hidden variable

The specific information (specific long/short term variation pattern) in the original sequence can be captured effectively.

The CMC solves two comparative learning tasks, and requires hidden variables in known situations respectively

In the case of (2), the correct short-term representation and long-term representation are selected from the interference terms, the corresponding objective functions being

And

in addition to this, the regularization term

Can ensure that

Global information is filtered out.

Given the output of the context recognition network, for example, a short-term representation of the comparison task

And collections

Including a short-term representation of the input time series at time t

(Positive samples) and K interference terms

(negative examples), the model needs to be taken from the set V^(sh)Where the correct short-term representation is identified. The interference term is obtained by uniformly sampling short-term representations of other sample time series in the same batch at various moments. The short-term representation versus the objective function is defined as:

wherein ,f₁For evaluating the function, the structure is two layers of MLP (the first layer is ReLU activation function, the second layer is linear activation function), and the situation is hidden by the variable

And short-term representation

After stitching, the function is to measure the degree of match between the two representations, epsilon being the temperature parameter of SoftMax, u (T) being the uniform sampling of time 1,2, …, T.

Similarly, give

And collections

The long-term representation versus objective function is:

f₂for evaluating the function, its structure and₁similarly, the properties of learning from comparison are:

wherein ,

to represent

Mutual information with x, minimization can be seen

Can maximize

So as to make the situation hidden variable

Sufficiently retaining local (unique) information of the sequence x.

To make it possible to

The invention can filter global information, and uses KL divergence regularization to ensure that

Filtering out global information, and calculating as follows:

wherein ,

identifying network output for context

Gaussian posterior distribution. The regularization term is

Introducing a priori, the goal being constraint

The amount of information of (a) is as small as possible. Therefore, in maximizing mutual information and minimizing

Under the combined action of the two objects of information quantity,

global information is automatically filtered out, and the reservation can cause

The final multi-view contrast objective function calculation mode of the maximized local information is as follows:

where α is an adjustable hyper-parameter.

Finally, the situation is hidden variable

The parameter is input to a parameter generation network MLP (hidden layer uses the ReLU activation function, output layer uses the linear activation function) composed of multiple fully connected layers and is mapped to the parameter phi of the local feature encoder.

And 6, loading the parameter phi generated by the self-adaptive parameter generation module into a local feature encoder.

Step 7, time sequence x of samples_1：TInput to a local feature encoder, and output a local feature representation

The structure of the local feature encoder is consistent with that of the global feature encoder (the VQ module q is not included), and the parameter phi of the encoder does not participate in back propagation adjustment, but is directly generated by the adaptive parameter generation module.

The specific process comprises the following steps: first, time-series x samples_1：TInputting the short-term representation z to a short-term feature extractor consisting of a multilayer 1D convolutional neural network to obtain a short-term representation z of each time in the sequence_1：T(ii) a Then, the short term representation z_1：TInputting the data into a Transformer encoder, modeling long-term dependence in the whole sequence by using the Transformer encoder, and outputting local feature representation of each moment of the sequence

Step 8, splicing the global feature representation and the local feature representation, and inputting the spliced global feature representation and local feature representation into a convolutional Transformer decoder to obtain prediction output

Where τ denotes the prediction step size.

The convolutional Transformer decoder is formed by stacking a convolutional module and a plurality of same attention modules, and the specific structure is shown in fig. 5, wherein the structure of the convolutional module is consistent with that of a convolutional module in an encoder, and the attention module comprises: a layer of mask multi-headed self-attention layer facing the output of the decoder, a backward mask mechanism is introduced to prevent the following data from being seen when predicting data at a certain moment, a layer of multi-headed attention layer facing the output of the encoder, and a feed-forward network consisting of two fully-connected layers (the first layer uses a ReLU activation function, the second layer uses a linear activation function). The last attention module outputs the predicted value of the future tau step

Step 9, calculating a prediction objective function

I.e. the true value x corresponding to the sample time series_T+1：T+τAnd predicted value of actual output

The error between.

The present invention utilizes the average absolute error as the prediction objective function

The calculation formula is as follows:

step 10, calculating a prediction objective function

Multi-view contrast coding objective function

Sum vector quantization constraint objective function

Sum of

Step 11, according to the loss of all samples in the batch

Network parameters in the entire model are adjusted.

Loss of all samples in the batch

According to the loss

Parameters that can be learned in the entire model are adjusted. The update formula is as follows:

wherein η is the learning rate.

And 12, repeating the steps 3-11 until all batches of the training data set participate in model training.

And step 13, repeating the steps 3-12 until the specified iteration number is reached.

And step 14, inputting the time sequence of the sample to be predicted into the trained model to obtain a prediction result.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A deep decoupling time series prediction method is applied to prediction of time series data in the traffic field, the power field, the medical field and the financial field, and comprises the following steps:

2. The method of deep decoupled time series prediction of claim 1, wherein the preprocessing comprises outlier detection and removal, missing value supplementation, and normalization processing.

3. The method for predicting the deeply decoupled time series according to claim 1, wherein the global feature encoder comprises a short-term feature extractor constructed by a convolutional neural network, a vector quantization module, and a Transformer encoder composed of a plurality of attention modules stacked, wherein the short-term feature extractor is used for performing short-term feature extraction on an input time series to obtain a short-term representation of the time series, and the vector quantization module is used for performing vectorization encoding on the input short-term representation to obtain an encoded vector; and the Transformer encoder is used for modeling long-term dependence in the whole time sequence based on the encoded vector and outputting the global characteristic representation of the time sequence.

4. The method of claim 1, wherein the adaptive parameter generating module is configured to encode the time sequence and output local feature encoder parameters by using a multi-view contrast based encoding method.

5. The method of claim 4, wherein the adaptive parameter generation module comprises a context identification network and a parameter generation network, wherein the context identification network comprises a convolution module, a transform encoder and an LSTM aggregator connected in sequence for mapping the time sequence into context hidden variables, and the parameter generation network comprises a fully connected network for generating parameters of the local feature encoder according to the context hidden variables.

6. The method for predicting the deeply decoupled time series according to claim 1, wherein the parameters of the local feature encoder are not involved in training and are generated by an adaptive parameter generation module, the local feature encoder comprises a short-term feature extractor and a Transformer encoder consisting of a plurality of attention modules stacked together, wherein the short-term feature extractor is used for performing short-term feature extraction on the input time series to obtain a short-term representation of the time series, and the Transformer encoder is used for modeling long-term dependency relationships in the whole time series based on the short-term representation and outputting the local feature representation of the time series.

7. The method of deep decoupled time series prediction of claim 1, wherein the decoder comprises a convolution module for performing a convolution operation on the result of the concatenation of the input global feature representation and the local feature representation and a plurality of identical attention modules for performing a concatenation calculation based on the convolution result and outputting the predicted time series data.

8. The method of deep decoupled time series prediction according to claim 1, characterized by the loss function used in the parameter optimization of the time series prediction model

Comprises the following steps:

wherein ,

for the prediction objective function, it is expressed as:

encoding an objective function for multi-view contrast, expressed as:

the objective function is constrained for vector quantization, expressed as:

wherein ,x_T+t、

Respectively representing the real and predicted values of a time series of future steps tau with respect to time T, tau representing the prediction step length, f being f₁ and f₂For the purpose of the evaluation function of the comparative learning,

a context-hidden variable is represented that is,

represents the short-term representation produced by the adaptive parameter generation module,

representing a short-term representation of the disturbance term,. epsilon.is the temperature parameter of SoftMax, and (T) is the uniformity of time 1,2, …, TThe sampling is carried out by sampling the sample,

identifying network output for context

The gaussian posterior distribution of (a) is,

representing a mathematical expectation, V^(lo)Representing a set of long-term representations, V^(sh)Representing a set of short-term representations, alpha being an adjustable hyper-parameter,

representing KL divergence, sg () is a gradient truncation operation, satisfying sg (z) z,

gamma is an adjustable hyper-parameter, z represents a short-term representation produced by a global feature encoder,

representing the vectorized encoding result for z.