CN113177633B

CN113177633B - Depth decoupling time sequence prediction method

Info

Publication number: CN113177633B
Application number: CN202110426703.0A
Authority: CN
Inventors: 陈岭; 陈纬奇; 张友东; 文波; 杨成虎
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2023-04-25
Anticipated expiration: 2041-04-20
Also published as: CN113177633A

Abstract

The invention discloses a depth decoupling time sequence prediction method, which comprises the following steps: 1) Preprocessing given time sequence data to construct a training data set; 2) Capturing a global change pattern shared by a plurality of time sequences by using a vector quantization global feature coder; 3) Capturing a local variation pattern specific to a single time series by using a local feature encoder, wherein each time series has a set of specific local feature encoder parameters, and the local variation pattern is generated by an adaptive parameter generation module; 4) The outputs of the global and local feature encoders are input to a decoder for prediction. According to the invention, the dynamic property of the time sequence is decoupled into the global change mode and the local change mode, and the modeling is performed respectively, so that the problems that the existing model cannot fully utilize the shared knowledge in the data set and cannot fully model the specific local change mode of the single time sequence are solved, the prediction precision is further improved, and the method has wide application prospects in the fields of traffic prediction, supply chain management, financial investment and the like.

Description

Depth decoupling time sequence prediction method

Technical Field

The invention relates to the field of time sequence data prediction, in particular to a depth decoupling time sequence prediction method.

Background

Time series are widely used in the fields of transportation, electricity, medical and financial. Time series prediction (i.e., predicting the observations at a time from observations over a historical period of time) is an important topic of research in data mining. In today's big data age, a single time series often does not exist in isolation, and a data set will typically contain multiple time series with correlation that have global (multiple time series sharing) and local (single time series specific) patterns of variation. As shown in fig. 1, the road usage time series of all roads has the same period (24 hours) and has an early peak and a late peak, i.e., a global change pattern; road 1 has a slight early-late peak, road 2 has a pronounced early peak, no late peak, road 3 has a slight early peak and a pronounced late peak, and road 4 has a strong early-late peak, i.e., a locally varying pattern. A good time series prediction model should capture both modes of variation simultaneously.

Training and prediction on a single time series based on statistical machine-learned time series prediction models, such as AR, ARIMA, exponential smoothing, and linear state space models, cannot model common patterns of variation in a multi-variable time series dataset and therefore cannot benefit from such global knowledge.

Classical deep learning models, such as RNN, TCN and transducer based predictive models, are currently the most widely used class of methods in this field. Such models use all data of the dataset to train a set of shared model parameters and equally use all time series information, however, this way of capturing global information by simple parameter sharing is inadequate because the model only takes as input a single time series of historical data at the time of prediction, for which global information or other related series of information cannot be explicitly introduced.

Some recent approaches attempt to represent the original time series as linear combinations of k potential time series (k is much smaller than the number of time series in the dataset) using matrix decomposition, by capturing common patterns in the multi-variable time series. However, matrix decomposition acts on the feature space, failing to capture complex global patterns of variation.

Disclosure of Invention

In view of the foregoing, an object of the present invention is to provide a depth decoupling time series prediction method, which improves the prediction accuracy of time series data while reducing the calculation consumption by effectively modeling the global and local variation patterns of the time series.

In order to achieve the above object, the present invention provides the following technical solutions:

a depth decoupling time sequence prediction method is applied to prediction of time sequence data in traffic fields, electric power fields, medical fields and financial fields, and comprises the following steps:

collecting a time sequence, and preprocessing the time sequence to obtain a preprocessed time sequence;

the method comprises the steps of constructing a time sequence prediction model, wherein the time sequence prediction model comprises a global feature encoder, an adaptive parameter generation module, a local feature encoder and a decoder, the global feature encoder is used for encoding a time sequence into a global feature representation, the adaptive parameter generation module is used for generating local feature encoder parameters according to the time sequence, the local feature encoder is used for encoding the time sequence into a local feature representation based on the loaded local feature encoder parameters, and the decoder is used for decoding a result of splicing the global feature representation and the local feature representation and outputting predicted time sequence data;

and carrying out parameter optimization on the time sequence prediction model by utilizing the time sequence data, and using the time sequence prediction model with the parameter optimization for the prediction of the time sequence.

Preferably, the preprocessing includes outlier detection and removal, missing value replenishment, and normalization.

Preferably, the global feature encoder comprises a short-term feature extractor constructed by a convolutional neural network, a vector quantization module and a transducer encoder formed by stacking a plurality of attention modules, wherein the short-term feature extractor is used for extracting short-term features of an input time sequence to obtain a short-term representation of the time sequence, and the vector quantization module is used for vectorizing and encoding the input short-term representation to obtain an encoded vector; the transform encoder is used for establishing long-term dependency relationship in the whole time sequence based on the encoded vector and outputting global characteristic representation of the time sequence.

Preferably, the adaptive parameter generation module adopts a multi-field contrast coding mode to realize coding of a time sequence and output local feature encoder parameters.

Preferably, the adaptive parameter generation module comprises a context recognition network and a parameter generation network, wherein the context recognition network comprises a convolution module, a transducer encoder and an LSTM aggregator which are sequentially connected, the convolution module, the transducer encoder and the LSTM aggregator are used for mapping the time sequence into the context hidden variables, and the parameter generation network consists of a fully connected network and is used for generating the parameters of the local feature encoder according to the context hidden variables.

Preferably, the parameters of the local feature encoder do not participate in training, and are generated by the adaptive parameter generation module, the local feature encoder comprises a short-term feature extractor and a transducer encoder formed by stacking a plurality of attention modules, wherein the short-term feature extractor is used for extracting short-term features of an input time sequence to obtain a short-term representation of the time sequence, and the transducer encoder is used for modeling long-term dependency in the whole time sequence based on the short-term representation and outputting the local feature representation of the time sequence.

Preferably, the decoder comprises a convolution module and a plurality of identical attention modules, wherein the convolution module is used for carrying out convolution operation on the spliced result of the input global characteristic representation and the local characteristic representation, and the attention module is used for carrying out connection calculation based on the convolution result and outputting predicted time series data.

Preferably, the loss function is used in parameter optimization of the time series prediction model

The method comprises the following steps:

compared with the prior art, the invention has the beneficial effects that at least the following steps are included:

according to the depth decoupling time sequence prediction method provided by the invention, the dynamic decoupling of the time sequences is a global and local change mode, and the global and local feature encoders are used for modeling respectively, so that the vector quantization global encoder is used for learning the encoding table for representing the global change mode, the global change mode is modeled by fully utilizing the shared knowledge in the data set, the self-adaptive parameter generation module is used for generating the specific local feature encoder parameters of each time sequence, and the heterogeneous local change mode is effectively modeled. The prediction accuracy of the time series is improved based on the global and local variation patterns.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a road usage time series diagram in the background art;

FIG. 2 is an overall flowchart of a depth decoupling time series prediction method provided by an embodiment;

FIG. 3 is an overall block diagram of a depth decoupling time series prediction method provided by an embodiment;

FIG. 4 is a learning process of a global representation and a short-term representation provided by an embodiment;

fig. 5 is a schematic diagram of a convolutional transducer decoder according to an embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.

In order to improve the prediction accuracy of the time sequence, the embodiment provides a depth decoupling time sequence prediction method, which performs time sequence prediction based on two change modes after decoupling the dynamics of the time sequence into a global change mode and a local change mode and respectively modeling. The deep decoupling time sequence prediction method can be applied to the traffic field, the electric power field, the medical field and the financial field, namely, the time sequence can be data such as road traffic flow, user electricity consumption, stock price and the like.

FIG. 2 is an overall flowchart of a depth decoupling time series prediction method provided by an embodiment; fig. 3 is an overall block diagram of a depth decoupling time series prediction method provided by an embodiment. As shown in fig. 2 and 3, the depth decoupling time series prediction method provided by the embodiment includes the following steps:

step 1, collecting time sequence data, carrying out outlier elimination processing and normalization processing on the collected time sequence, and dividing the processed data by utilizing a sliding time window to obtain a training data set.

Outlier detection and removal is performed on a given time series, and invalid values (such as values out of normal range and missing values) therein are filled in using a linear interpolation method. Performing min-max normalization processing on all values in the time sequence, so that each value after processing is normalized to be within the range of [ -1,1], and the conversion formula is as follows:

wherein X is a value in the original time series, X _min X is the minimum value in the time series _max For the maximum value in the time series, x' is the value after normalization.

And according to the empirical human set time window size T, dividing the normalized data by utilizing the sliding step length with fixed length to obtain a training data set.

Step 2, the training data set is batched according to the fixed batch size, and the total number of batches is N.

The training data set is batched according to an empirically artificially set batch size M, with the total number of batches being N. The specific calculation mode is as follows:

wherein N_Samples Is the total number of samples in the training dataset.

And 3, sequentially selecting a batch of training samples with an index of k from the training data set, wherein k is {0,1, & gt, N }. Steps 4-10 are repeated for each training sample in the batch.

Step 4, time-series the samples

Input to the vector quantized global feature encoder, output the global feature representation +.>

And calculates the vector quantization constraint objective function +.>

Where T represents the input step size. />

The global feature encoder as shown in fig. 3 (b) is used for modeling the global variation pattern of the time sequence, and specifically comprises the following steps:

first, the sample time sequence x _1：T Input to a short-term feature extractor consisting of a multi-layer 1D convolutional neural network to obtain a short-term representation z of each instant in the sequence _1：T . The sliding step of the convolution layer is set to be 1, a padding (padding) mechanism is used, so that the step of output is consistent with the input, and a convolution kernel with a small size is used, so that a short-term change mode in a sequence can be captured.

Then, the short term is denoted z _1：T An input Vector Quantization (VQ) module q, which maintains a coding table representing global change patterns

It contains F d-dimensional vectors, shared among all sequences, representing the global pattern of change of the sequence, the vectors in the encoding table e being called global representations. The VQ module represents the short term z for each instant _1：T Mapping into vectors in the coding table e, respectively, to obtain +.>

In particular, it is generalReplacing the original short-term representation z by means of nearest neighbor search

The calculation formula is as follows: i=argmin _j ||z-e _j || ₂, wherein ,e_i Representing the ith vector in the encoding table e. It should be noted that the arg min operation is not differentiable, and therefore is directly related to +.>

Instead of the gradient with respect to z. Specifically, during forward propagation, nearest global representation +.>

The input downstream network, during the back propagation,

pass back unchanged to the upstream convolutional network. In such a training mode, if only the predictive objective function is used, the global representation e in the code table is encoded _i No updates are obtained, so the invention introduces vector quantization constraint objective functions for learning the global representation:

wherein sg () is a gradient cut-off operation, satisfying sg (z) ≡z,

gamma is an adjustable hyper-parameter. As shown by the dark grey arrow in FIG. 4, in the prediction objective function +.>

and />

Under the combined action of (a), the short term representation z will be chosen to be appropriateGlobal representation, and constraint z and +.>

The difference is as small as possible; as indicated by the light grey arrow in figure 4,

the global representation in the coding table is driven towards the original short-term representation z, which tends towards the cluster center of these original representations when there are multiple original short-term representations mapped to the same global representation, which also makes the global representation learned by the VQ module more representative.

Finally, modeling long-term dependencies in the whole sequence by using a transducer encoder, and outputting long-term representations at each time of the sequence, namely global feature representations

The transducer encoder is composed of a stack of attention modules, a single module comprising a multi-headed self-attention layer and a feed-forward network of two fully-connected layers (the first layer using a ReLU activation function and the second layer using a linear activation function). Where the attention mechanism may be represented as a mapping from a query and a set of key/value pairs to an output. The matching degree of the query and each key is calculated, a weight coefficient is given to the value corresponding to each key, and finally the weighted sum of the values is used as output. The calculation mode is as follows:

wherein ,

representing inquiry->

Indicating key(s)>

The representation value is calculated in three steps: the first step is to calculate the vector inner product of the query and each key and divide by a factor +.>

Obtaining an unnormalized weight coefficient, wherein the factor +.>

Plays a role in regulation, and prevents the gradient of the SoftMax function from disappearing; the second step is to normalize the weight coefficient by SoftMax function; and thirdly, carrying out weighted summation on the values by using the normalized weight coefficients to obtain a final output. The multi-head attention mechanism actually calculates a plurality of groups of attention mechanisms, and the plurality of groups of results are spliced to obtain output, so that the multi-head attention mechanism is used for capturing the correlation of different types existing in the data. The calculation mode is as follows:

wherein Concat represents a tensor stitching operation,

and />

The original query, key and value are mapped to corresponding spaces for the mapping matrix of the ith set of attention mechanisms, respectively. In the multi-head self-attention layer used in the invention, the query Q, the key K and the value V are consistent, and the first layer is the output of the vector quantization module +.>

The follow-up layers are all the outputs of the attention module of the previous layer. />

Long term representation +.>

The feed forward network is a fixed substructure in the transducer model, mainly playing a role in spatial mapping.

Step 5, sample time sequence x _1：T Input to a self-adaptive parameter generation module based on multi-field contrast coding, generate a local feature encoder parameter phi and calculate a multi-field contrast coding objective function

As shown in fig. 3 (a), the adaptive parameter generation module based on multi-view contrast coding includes a context recognition network and a parameter generation network, and the specific steps of generating the local feature encoder parameter Φ are as follows:

first, x is _1：T Input to a context recognition network comprising a convolution module, a transducer encoder, and an LSTM aggregator, the context recognition network mapping the input sequence into context hidden variables

Multi-view contrast coding (CMC) method and corresponding KL divergence regularities can enable context hidden variables ++>

The information of the sequence local variation pattern can be fully preserved and the global information filtered out (as it is already modeled in the global feature encoder). CMC maximizes the output context hidden variable of LSTM aggregator in context recognition network by contrast learning method>

Short term representation v with convolution module output ^(sh) And a long-term representation v of the output of the transducer encoder ^(lo) Mutual information between them, so that the context hidden variable +.>

The method can effectively capture the specific information (specific long/short-term variation mode) in the original sequence.

CMC solves two contrasting learning tasks, respectively requiring hidden variables in known situations

In the case of (1) selecting the correct short-term representation and long-term representation from the interference term, the corresponding objective functions are +.>

and />

Besides, regular term->

Can ensure->

The global information is filtered out.

Taking the comparative task of short-term representation as an example, the output of a given context recognition network

Sum set

Which contains a short-term representation of the moment of the input time sequence t +.>

(positive sample) and K interference terms->

(negative sample), the model needs to be derived from set V ^(sh) The correct short-term representation is identified. The interference term is evenly sampled from short-term representations of the other sample time series at each instant in the same batch. The short term representation versus objective function is defined as:

wherein ,f₁ To evaluate the function, the structure is two layers of MLP (the first layer is a ReLU activation function and the second layer is a linear activation function), and the situation hidden variables are used for the method

And short term representation +.>

After splicing, the input is used for measuring the matching degree between the two representations, epsilon is a temperature parameter of SoftMax, and u (T) is uniform sampling at the moments 1,2, … and T.

Similarly, give

And set->

The long term representation contrast objective function is: />

f ₂ To evaluate the function, its structure and f ₁ The same, the properties according to the comparative learning are:

wherein ,

representation->

Mutual information with x, it can be seen that the minimization +.>

Can maximize +.>

Thereby making the context hidden variable +.>

The local (unique) information of sequence x is fully preserved.

In order to make

The global information can be filtered out, and the invention uses KL divergence regularization to enable +.>

Filtering out global information, wherein the calculation mode is as follows:

wherein ,

outputting +.>

Is a gaussian posterior distribution of (c). The regular term is->

A priori is introduced with the aim of constraint->

The amount of information in (2) is as small as possible. Therefore, in maximizing mutual information and minimizing +.>

Under the combined action of two targets, < ->

The global information is automatically filtered out while keeping the possible +.>

The maximum local information and the final multi-field contrast objective function are calculated as follows:

wherein α is an adjustable hyper-parameter.

Finally, the situation hidden variables

The parameters input to the parameter generation network MLP (hidden layer using ReLU activation function and output layer using linear activation function) consisting of multiple fully connected layers are mapped to parameters phi of the local feature encoder.

And 6, loading the parameter phi generated by the adaptive parameter generation module into the local feature encoder.

Step 7, sample time sequence x _1：T Input to a local feature encoder, output of a local feature representation

The structure of the local feature encoder is consistent with that of the global feature encoder (the global feature encoder does not comprise the VQ module q), and the parameter phi of the encoder does not participate in back propagation adjustment and is directly generated by the adaptive parameter generation module.

The specific process comprises the following steps: first, the sample is takenThe present time series x _1：T Input to a short-term feature extractor consisting of a multi-layer 1D convolutional neural network to obtain a short-term representation z of each instant in the sequence _1：T The method comprises the steps of carrying out a first treatment on the surface of the Then, the short term is denoted z _1：T Inputting the local characteristic representation of each time point of the output sequence into a transducer encoder, modeling long-term dependency relationship in the whole sequence by using the transducer encoder, and outputting the local characteristic representation of each time point of the sequence

Step 8, the global characteristic representation and the local characteristic representation are spliced and then input into a convolution transform decoder to obtain a prediction output

Where τ represents the prediction step size.

The convolution transducer decoder is formed by stacking one convolution module and a plurality of identical attention modules, and the specific structure is shown in fig. 5, wherein the convolution module structure is consistent with the convolution module in the encoder, and the attention modules comprise: one layer facing the masked multi-headed self-attention layer output by the decoder, introducing a backward masking mechanism to prevent the following data from being seen when predicting the data at a certain moment, one layer facing the multi-headed attention layer output by the encoder, and a feed-forward network consisting of two fully connected layers (the first layer using a ReLU activation function and the second layer using a linear activation function). The last attention module outputs the predicted value of the future tau step

Step 9, calculating a prediction objective function

I.e. the real value x corresponding to the time series of samples _T+1：T+τ And the predicted value of the actual output +.>

Errors between them.

In the present inventionUsing average absolute error as prediction objective function

The calculation formula is as follows:

step 10, calculating a prediction objective function

Multi-field contrast coding objective function>

And vector quantization constraint objective function->

Sum->

Step 11, according to the loss of all samples in the batch

And adjusting network parameters in the whole model.

Loss of all samples in the batch

According to the loss

And adjusting the learnable parameters in the whole model. The update formula is as follows:

wherein η is the learning rate.

Step 12, repeating steps 3-11 until all batches of the training dataset are involved in model training.

Step 13, repeating steps 3-12 until the specified iteration number is reached.

And step 14, inputting the sample time sequence to be predicted into the trained model to obtain a prediction result.

The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims

1. The depth decoupling time sequence prediction method is characterized by being applied to the prediction of road traffic flow in the traffic field and comprising the following steps of:

collecting time sequence data, wherein the time sequence data is road traffic flow, and preprocessing the road traffic flow to obtain preprocessed road traffic flow;

the method comprises the steps of constructing a time sequence prediction model, wherein the time sequence prediction model comprises a global feature encoder, a self-adaptive parameter generation module, a local feature encoder and a decoder, the global feature encoder is used for encoding road traffic into global feature representation, the self-adaptive parameter generation module is used for generating local feature encoder parameters according to the road traffic, the local feature encoder is used for encoding the road traffic into local feature representation based on the loaded local feature encoder parameters, and the decoder is used for decoding the result of splicing the global feature representation and the local feature representation and outputting predicted road traffic;

and carrying out parameter optimization on the time sequence prediction model by using the road traffic flow, and using the time sequence prediction model with the parameter optimization for predicting the road traffic flow.

2. The depth decoupling time series prediction method of claim 1, wherein the preprocessing comprises outlier detection and removal, missing value replenishment, and normalization.

3. The depth decoupling time series prediction method of claim 1, wherein the global feature encoder comprises a short-term feature extractor constructed from a convolutional neural network, a vector quantization module, and a transducer encoder comprising a stack of attention modules, wherein the short-term feature extractor is configured to extract short-term features of an input road traffic to obtain a short-term representation of the road traffic, and the vector quantization module is configured to vector-encode the input short-term representation to obtain an encoded vector; the transducer encoder is used for modeling long-term dependence in the whole road traffic data based on the encoded vector and outputting global characteristic representation of the road traffic.

4. The method for predicting depth decoupling time series according to claim 1, wherein the adaptive parameter generating module adopts a multi-view contrast coding mode to realize the coding of the road traffic and output the local feature encoder parameters.

5. The depth decoupling time series prediction method of claim 4, wherein the adaptive parameter generation module comprises a context recognition network and a parameter generation network, wherein the context recognition network comprises a convolution module, a transducer encoder, and an LSTM aggregator connected in sequence for mapping road traffic into context hidden variables, and wherein the parameter generation network comprises a fully connected network for generating parameters of the local feature encoder from the context hidden variables.

6. The depth decoupling time series prediction method of claim 1, wherein the parameters of the local feature encoder are not involved in training, are generated by an adaptive parameter generation module, the local feature encoder comprises a short-term feature extractor, and a transform encoder comprising a stack of attention modules, wherein the short-term feature extractor is configured to extract short-term features of the input road traffic to obtain a short-term representation of the road traffic, and the transform encoder is configured to model long-term dependencies in the whole road traffic data based on the short-term representation, and output the local feature representation of the road traffic.

7. The depth decoupling time series prediction method of claim 1, wherein the decoder comprises a convolution module and a plurality of identical attention modules, wherein the convolution module is configured to perform a convolution operation on a result of the concatenation of the input global feature representation and the local feature representation, and the attention module is configured to perform a connection calculation based on the convolution result, and output a predicted road traffic.

8. The depth decoupling time series prediction method of claim 1, wherein a loss function is employed in parameter optimization of the time series prediction model

The method comprises the following steps:

wherein ,

to predict the objective function, it is expressed as: />

Encoding an objective function for multi-field contrast, expressed as:

constraint objective function for vector quantization, expressed as:

wherein ,x_T+t 、

The true value and the predicted value of the road traffic flow are represented by a future τ -step with respect to the time T, τ represents the predicted step length, and f is f ₁ and f₂ For comparison of learned evaluation functions +.>

Representing context hidden variables,>

representing the short term representation generated by the adaptive parameter generation module, < >>

Representing the disturbance term of the short term representation, ε being the temperature parameter of SoftMax, (T) being the uniform sampling of instants 1,2, …, T, +.>

Outputting +.>

Gaussian posterior distribution of->

Representing mathematical expectations, V ^(lo) Representing a long-term representation set, V ^(sh) Representing a short term representation set, alpha being an adjustable hyper-parameter,

indicating KL divergence, sg () is a gradient cut-off operation, satisfy sg (z) ≡z->

Gamma is an adjustable super parameter and z represents the short term representation generated by the global feature encoder,/->

The vectorized encoding result for z is shown. />