CN116777068A

CN116777068A - Causal transducer-based networked data prediction method

Info

Publication number: CN116777068A
Application number: CN202310776376.0A
Authority: CN
Inventors: 陈都鑫; 程钰鑫; 虞文武
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2023-06-19
Filing date: 2023-06-28
Publication date: 2023-09-19

Abstract

The invention discloses a causal transform-based networked data prediction method, which is suitable for predicting a coupling time sequence of a complex engineering system. Next, the invention convolves the causal network, the distance network, and the time series input fransformer convolutional spatio-temporal blocks. Each space-time block consists of a residual transducer module and a residual multi-graph convolution network module capable of summarizing causal network and distance network information, and is used for extracting time and space dimension information. Finally, the extracted information can be decoded by the output layer to obtain a prediction result. The method has good prediction effect and can provide effective data support for complex engineering systems.

Description

Causal transducer-based networked data prediction method

Technical Field

The invention relates to a causal transform-based networked data prediction method, in particular to a coupling time sequence of a complex engineering system.

Background

Time series prediction of complex engineering systems plays a vital role in various real world scenarios, such as traffic prediction, power management, supply chain management, and financial investment. If the future evolution of an event or index can be accurately estimated, it can help people make important decisions. For example, if severe traffic jams are predicted in advance, traffic authorities can more reasonably guide vehicles to improve the operation efficiency of the road network.

In practical problems, complex systems often represent a multivariate dynamic evolution process, with information imperfection, uncertainty, etc., so that it is difficult to build a mathematical model in an accurate analytical form, and analysis is often dependent on a time sequence obtained by observation. In the field of statistics, analysis of time series generated by these complex systems is an important dynamic data processing method. When there are few detection units, it is feasible to predict the coupling time series using classical statistical methods. Moving average autoregressive (ARIMA) and variants thereof are one of the classical methods of time series analysis. However, this type of model is limited by the smooth assumption of time series, failing to take into account spatio-temporal correlation. Therefore, these methods have limitations on the processing of high-dimensional time-series data. With the application of machine learning in other fields, related models are gradually applied to prediction of complex engineering system coupling time sequences, and the models can achieve higher prediction accuracy and more complex data modeling, such as a K-nearest neighbor algorithm (KNN), a Support Vector Machine (SVM) and a Neural Network (NN). Among them, neural network methods have been widely and successfully applied to various coupling time series prediction tasks. Significant progress has been made in related work, such as Deep Belief Networks (DBNs), stacked self-encoders (SAE), and the like. However, these networks have difficulty in extracting spatial and temporal features jointly from the data and their ability is severely limited. To take full advantage of spatial features, some models use Convolutional Neural Networks (CNNs) to capture neighbor relations in the network, while cyclic neural network (RNN) methods are employed on the time axis. For example, a feature level fusion architecture CLTFP for short-term traffic prediction is proposed by combining a long short-term memory (LSTM) network with a one-dimensional CNN. The proposed FC-LSTM model then incorporates a snoop mechanism that converts the original input time series vector into a matrix to represent the spatial connection, however, the common convolution operation applied limits the model to handle only grid structures such as images, video, and not general fields. At the same time, recursive networks for sequence learning require iterative training, which introduces errors that accumulate in steps. To better capture information of non-European spatial networks, spatio-temporal prediction methods have introduced Graph Neural Networks (GNNs). For example, STGCN based on a graph convolutional neural network (GCN), spatial feature extraction and temporal feature extraction are packaged separately into S-and T-blocks, each convolution block comprising two gated sequential convolution layers and one spatial graph convolution layer, with model depth increased by series ST blocks for spatio-temporal network velocity prediction. Based on the above, various network space-time prediction models are proposed, and a network space-time prediction method based on space-time feature extraction has become a mainstream method. Such as an ASTGCN model that uses an attention mechanism to capture dynamics in time and space; combining the modified GAT with an LSTM model of GaAN; using a causal convolution and gating mechanism to carry out a GraphWaveNet model of time series data processing on the residual block; a GMAN model that uses a multi-attention mechanism to extract temporal and spatial information of traffic flows. Models based on spatio-temporal feature extraction have achieved outstanding results in terms of prediction. However, there are still some problems to be solved. The main disadvantage of the existing model is that the extraction of the spatial features is mostly concentrated on the extraction of the adjacent node information, which causes the shortage of global feature extraction. Meanwhile, in terms of extraction of time sequence information, the RNN or CNN method cannot extract long-term dependence from data, resulting in poor long-term prediction performance of the current traffic flow prediction model.

Therefore, the invention proposes a causal spatio-temporal network model based prediction of complex system coupling time series. According to the analysis, a simple distance network cannot well extract the global features of the network. Therefore, a causal network can be constructed based on multidimensional data. In a causal network, the parent node of each node, i.e., the node affecting the node's changes, will be the first-order neighbor of that node. Thus, causal relationship features can be extracted by the GCN. The method and the device combine the causal relationship characteristics with the time dimension and the characteristics of the distance network, so that the space-time information of the coupling time sequence of the complex engineering system can be effectively obtained, and a good prediction effect can be obtained.

Disclosure of Invention

Technical problems: the invention aims to provide a causal fransformer-based networked data prediction method. Taking traffic flow as an example, on the basis of observing the time series data of the vehicle speeds detected by the velocimeters at different positions, the causal network is initially constructed by adopting a causal reasoning method based on causal entropy, and the causal entropy is calculated by a mutual information estimation method to obtain the network edge weight. And then constructing a distance network by using longitude and latitude coordinates of the velocimeter. After the network construction is completed, the causal network, the distance network and the time sequence are input into a transform convolution space-time block to extract characteristics, and then the characteristics are decoded and output to obtain a prediction result. Wherein each space-time block comprises a residual transducer time module and a residual spatial convolution module that can summarize causal network and distance network information. The invention extracts multidimensional information of space-time data for prediction, and the proposed model has good prediction effect.

The technical scheme is as follows: in order to achieve the above purpose, the present invention relates to a causal transform-based networked data prediction method, which adopts the following technical scheme: the method comprises the following steps:

step 1: constructing a network;

step 2: and establishing a space-time transducer convolution model.

Wherein the network construction includes construction of a causal network and construction of a distance network.

Step 11: a causal network construction phase comprising two steps:

(1) A causal reasoning method based on causal relation entropy is adopted on the observed time series data to construct a causal network, and uncertainty and complexity related to an event X can be quantified through shannon entropy, wherein a calculation formula of entropy H (X) of the event X is as follows:

where p (X) is the probability of an event X taking a particular value X. For the relationship between the two events X and Y information, it can be characterized by a joint entropy H (X, Y) and a conditional entropy H (x|y), H (y|x), which is defined as:

where p (X, Y) is the joint probability of x=x, y=y, p (x|y), p (y|x) is the conditional probability of x=x, y=y. The information of the event X can be subdivided into information only belonging to X and shared information of X and Y, and mutual information I (X; Y) can describe the shared information between the event X and the event Y, when the relationship between the event X and the event Y is more intimate, the mutual information is larger, and the definition of the mutual information I (X; Y) is as follows:

I(X；Y)≡H(X)-H(X|Y)。

at this time, if the third event Z is present as an occurrence condition, the condition mutual information I (X; y|z) of the events X, Y is:

I(X；Y|Z)≡H(Y|Z)-H(Y|X，Z)。

however, the mutual information can only reflect the relation between the events, and in order to measure the directionality of the information flow between the two events, a transfer entropy T can be introduced _X→Y Transfer entropy T _X→Y Is defined as:

T _X→Y ≡I(X ^(t) ；Y ^(t+τ) |Y ^(t) )，

where τ is the delay time. Since complex engineering systems inevitably contain more than two nodes, the transfer entropy cannot distinguish between direct and indirect causal relationships in the network without appropriate conditions. Whereas causal relationship entropy C _Y→X|Z Can overcome the paired limitation of transfer entropy and cause and effect relationship entropy C _Y→X|Z Is defined as:

C _Y→X|Z ≡I(Y ^(t) ；X ^(t+τ) |Z ^(t) )＝H(X ^(t+τ) |Z ^(t) )-H(X ^(t+τ) |Z ^(t) ，Y ^(t) )，

this index can be reflected in a given condition Z ^(t) And define X ^(t+τ) Y when information and direction of (2) ^(t) The amount of information that can be provided. Thus, by determining the delay time τ, the direction of the information flow between the two node time series can be obtained. Taking a traffic system as an example, a causal network can be constructed according to a vehicle speed time sequence obtained by a velocimeter. According to the invention, each velocimeter is regarded as a node, a node set V is formed, and the number of the nodes is N, namely |V|=N. When the causal entropy of the two nodes is greater than 0, adding the connected edges of the two nodes into an edge set E _C In matrix W _C ∈R ^N×N Is a matrix with causal entropy as a weight. Its causal network graph can be denoted G _C ＝(V，E _C ，W _C )。

For any node x e V, the node that points to it in all directions and has a causal entropy greater than 0 is called causal parent, and according to the optimal causal entropy principle, the causal parent set of x is the smallest node set N that maximizes causal entropy in the set _x 。

The optimal causal entropy algorithm can be divided into an aggregation phase and a deletion phase.

Polymerization stage: for node set v= { x, y ₁ ，y ₂ ，…，y _N-1 The node set excluding the node x in the node set V is noted as y= { y = { x, y } ₁ ，y ₂ ，…，y _N-1 Causal parental set for node x is z. The algorithm initial stage z is an empty set. If it is

Node y will be _i Added to z, i.e. z=zuy _i . In other words, y _i The nodes with the greatest causal entropy are concentrated for the nodes currently belonging to y and not belonging to z, and the causal entropy is greater than 0. When no such node is found in y, the aggregation phase ends.

Delete phase: the resulting z of the polymerization stage may be a superset in direct communication with x. Thus, for member z in z _i If (3)

Then z is set _i Delete from z, when all members in z are traversed, the node left in z is the direct causal parent of x and the delete phase ends. At this time N _x ＝z。

Performing an optimal causal entropy algorithm on each node in V to obtain causal relationship E between every two nodes _C And initially constructing a causal network.

(2) And calculating causal entropy. Causal entropy is equivalent to mutual information. Therefore, the invention estimates the mutual information of two events X, Y by a mutual information estimation method based on a K-nearest neighbor algorithm:

I(X；Y)＝ψ(k)+ψ(N)-<ψ(n _x +1)+ψ(n _y +1)>，

wherein the method comprises the steps of<>Represents the average of all samples, k represents the number of neighbor points,is a Digamma functionCount-> N represents the sample size, N _x ，n _y The number of the K adjacent algorithms is respectively expressed in the X and Y directions. I.e. for a fixed k value, let a data point w in joint space _i ＝(x _i ，y _i ) The distance to its kth nearest neighbor is e (i), n _x ，n _y Respectively x _j ，y _j The satisfaction of x in (j not equal to i) _j -x _i ||x＜∈(i)，||y _j -y _i || _y Points of < ∈ (i). In the present invention, when the scalar subtraction is performed in the norm, the value of the norm is the same as the absolute value of the difference between the scalars.

When n independent samples { s } are considered ₁ ，s ₂ ，…，s _n Joint random variable s= (X, Y, Z), where S _i ＝(x _i ，y _i ，z _i ). The estimate of I (X; y|Z) is given by:

I(X；Y|Z)＝ψ(k)-<ψ(n _xz +1)+ψ(n _yz +1)-ψ(n _z +1)>。

psi (k) is also a Digamma function. For a fixed k value, set a data point s in joint space _i The distance to its kth nearest neighbor is e (i). The distance measure uses the maximum norm, i.e. ||s _i -s _j || _xyz ＝max{||x _i -x _j || _x ，||y _i -y _j || _y ，||z _i -z _j || _z }. Based on this, more precisely:

n _xz (i) Representation (x) _j ，z _j ) Satisfies || (x) in (j+.i) _j ，z _j )-(x _i ，z _i )|| _xz Points of < e (i);

n _yz (i) Representation (y) _j ，z _j ) Satisfying (y) in (j not equal to i) _j ，z _j )-(y _i ，z _i )|| _yz Points of < e (i);

n _z (i) Representing z _j The satisfaction of z in (j not equal to i) _j -z _i || _z Points of < ∈ (i).

Through the above process, W _C Is successfully estimated, thereby constructing a complete causal network G _C ＝(V，E _C ，W _C )。

Step 12: and a distance network construction stage. The distance network may be denoted as G _D ＝(V，E _D ，W _D ) Wherein E is _D ，W _D Edge sets and adjacency matrices, adjacency matrices W, respectively, from the network _D Is generated based on the distance between nodes, W _D The ith row and jth column elements are represented as follows:

wherein d is _ij The distance between the ith node and the jth node can be obtained by substituting a haverine tool in Python into the longitude and latitude calculation of the node, and sigma ² Is the assumed variance of the distance and epsilon is the threshold of the weight. The invention sets epsilon to 0.5 and sigma ² 10. Similar to the definition of causal networks, adjacency matrix W of distance networks _D An element of (a) greater than 0 indicates that there is a border at that location. Through step two, distance network G _D ＝(V，E _D ，W _D ) Is successfully constructed. Step 2: and establishing a space-time transducer convolution model.

The space-time model is built, and the method comprises the following three steps:

step 21: and (5) preprocessing data. Manually selecting a sliding window to determine the input dimension, i.e. selecting N nodesCoupling information stream data of individual time steps +.>As a model input. The limitation of the input dimension is to avoid that the neural network with the excessively high dimension runs too slowly due to the excessively long length of the input time data.

Step 22: the transducer convolves the space-time block. Each space-time block comprises a residual transducer module capable of extracting time dimension information and a residual multi-graph convolution module capable of summarizing causal network and distance network information.

(1) And a time module. The invention adopts an Informir model structure on a time axis to capture the time dynamic behavior of data.

1.1 Informater model includes two parts, encoder and decoder. Taking the first space-time block as an example, the module is input asIn the encoder section, input +.>Vector mapping is +.>Wherein (1)>Comprises input->Linearly mapped vector, ">Local position coding of inner elements +.>The inner element is globally position coded throughout the time axis. This makes +.>Not only include local timingThe information also has hierarchical timing information such as week, month, year, etc., as well as burst timestamp information (events or certain holidays, etc.). After vector mapping, the data passes through a plurality of attention blocks, each block containing multiple head probability sparse self-attention. The output of each block extracts the relevant attention information by self-attention distillation, using one-dimensional convolution layers Conv1d, ELU activation layer and max-pooling MaxPool. The process is defined by the formula from layer j to layer (j+1):

layer j inputThrough attention block [ ·] _AB Thereafter, dominant features with dominant attention are given higher weight by self-attention distillation extraction. The specific operation is that the length K is selected first _C Cyclic stuffing is performed at both ends of the time axis, and one-dimensional convolution is performed in the time dimension of the sequence using Conv1d, where the dimensions of the input and output are identical. The data is then passed through an activation function ELU, noted as x _ELU The expression of the ELU is:

and finally, carrying out maximum pooling MaxPool operation on the data subjected to the activation function in the time dimension to extract the maximum data of the designated window, and obviously reducing the size of the characteristic tensor. The input time dimension is L _in After the maximum pooling, the characteristic tensorIs of the time dimension L _out The method comprises the following steps:

1.2 Informater decoder requirementsTo be used forTakes as input the form of:

comprises a start token sequence with a time dimension length L _token ，/>Placeholders constituting the target sequence, the length of the time dimension is +.>At the same time->Also->In the time dimension length. />Is zero-padded, which contains the time stamp of the target sequence. The invention is based on the input sequence->A sequence of a particular size is sampled as a start token, such as traffic flow data from the previous hour. Sequence->The multi-headed probability sparse self-attention layer is masked and combined with the output of the encoder. Then, the attention is transferred through multiple heads. The above process is repeated until the output +.>Its position in the final output and +.>Corresponding to each other.

(2) And a space module.

2.1 because the distance network and the causal network of the traffic system belong to a non-Euclidean network, the invention adopts the spectral domain method of the graph neural network to extract the characteristics of the non-Euclidean structural data. The input of the first space module includes the first time module outputDistance network G _D ＝(V，E _D ，W _D ) And a causal network G _C ＝(V，E _C ，W _C ). For convenience of description, the distance network and the cause and effect network are collectively referred to as network G. The network G adjacency matrix is recorded as W E R ^N×N . The invention uses the graph rolling network of the first order approximation of the graph Laplace, the output of the time module +.>The convolution output can be obtained after the graph convolution>The convolution process can be described as:

where Θ is a parameter of the convolution kernel,i for W re-normalization result _N Is an N-order identity matrix.Is->Degree matrix of->Diagonal element->Can be expressed as +.> Is 0 for all off-diagonal elements of (a).

Furthermore, the present invention enables residual connection when stacking graph convolutional layers. And to aggregate information for causal and distance networks, multiple graph convolutions are used. Record the output of the first time module asRespectively pair->At GD and G _C Convolving and integrating the extracted features to obtain->

Will->At G _D And G _C Convolved outputs are connected in series, < >>Representing the output of the first spatial module, reGCN representing the residual map convolution operation, will +.>The graph convolution output and +.>Linear transformation FC of (a) ₁ Output addition, i.e.)> Furthermore, subscript C indicates the graph convolution use causal network and subscript D indicates the graph convolution use distance network.

2.2 to ensure that the spatio-temporal blocks do not change the data dimension, the output of the spatial block is linearly transformed (FC ₂ ) Obtaining the output of the first space-time blockThe formula is as follows:

the invention increases the model depth by stacking the space-time blocks, so the output of the first space-time blockAnd can be used as the input of the (1+1) th space-time block.

Step 23: and an output layer.

And obtaining a prediction result through decoding of an output layer. Extracting features by a plurality of spatio-temporal blocks, decoding the extracted information, i.e. linear transformation FC ₃ Making the dimension of output be the predicted time step number, and recording the L-1 th space-time block outputTo finally extract the features, output Y _pred ∈R ^N×K For the prediction result, the decoding process of the output layer is as follows:

the beneficial effects are that:

1. the invention combines a causal network and a space-time feature extraction method by using a data driving method to construct a residual multi-graph convolution network space convolution module capable of summarizing causal network and distance network information. The method solves the problem that the extraction of the spatial features of the traditional method is mostly concentrated on the extraction of the adjacent node information, and can well extract the global features of the network;

2. compared with an RNN model, the method utilizes the Informir to extract time dimension information, and breaks through the limitation that the model cannot be calculated in parallel; compared to CNN, the number of operations required to calculate the association between two locations does not increase with distance; compared with the traditional transducer, the method improves the capability of long-term prediction problem, and fully exerts the potential value in capturing individual long-range dependence between long-sequence time sequence output and input;

3. the invention achieves excellent predictive performance on the traffic flow dataset PeMSD7 (M) in zone 7 of california. Compared with other widely used space-time prediction models, the method has better promotion in prediction. The invention selects the following reference models for comparison, including a historical average value (HA), a Linear Support Vector Regression (LSVR), a feed-Forward Neural Network (FNN), a full-connection long-short-term memory artificial neural network (FC-LSTM), a space-time diagram convolutional network (STGCN), a Diffusion Convolutional Recurrent Neural Network (DCRNN) and a diagram waveform generation network (GraphWaveNet). The prediction effect comparison table is as follows:

the invention adopts widely used predictive evaluation indexes: the Mean Absolute Error (MAE), mean Absolute Percent Error (MAPE) and Root Mean Square Error (RMSE) are compared for effects. The smaller these three indices represent the better effect of the model, and it can be seen from the above table that the invention is superior to other reference models in 15, 30 and 60 minute predictions.

Drawings

FIG. 1 is a flow chart of data input to prediction outcome output;

FIG. 2 is an overall schematic diagram of the model;

FIG. 3 is a schematic diagram of the internal structure of the first transducer convolution space-time block.

Detailed Description

In order to enhance the understanding of the present invention, the present embodiment will be described in detail with reference to the accompanying drawings.

Example 1: referring to fig. 1-3, a causal transform-based networked data prediction method adopts the following technical scheme: the method comprises the following steps:

step 1: constructing a network;

step 2: and establishing a space-time transducer convolution model.

Step 11: a causal network construction phase comprising two steps:

(1) And constructing a causal network on the observed time series data by adopting a causal reasoning method based on causal entropy. For event X, the uncertainty and complexity associated therewith can be quantified by shannon entropy, the entropy H (X) of event X being calculated as:

where p (X, Y) is the joint probability of x=x, y=y, p (x|y), p (y|x) is the conditional probability of x=x, y=y. And the information of the event X can be subdivided into information belonging to X only and shared information of X and Y. Mutual information I (X; Y) may describe shared information between events X and Y, the greater the mutual information when the two are in closer relationship, the definition of mutual information I (X; Y) is:

I(X；Y)≡H(X)-H(X|Y)。

I(X；Y|Z)≡H(Y|Z)-H(Y|X，Z)。

T _X→Y ≡I(X ^(t) ；Y ^(t+τ) |Y ^(t) )，

Polymerization stage: for node set v= { x, y ₁ ，y ₂ ，…，y _N-1 The node set excluding the node x in the node set V is noted as y= { yx, y } = { x, y } ₂ ，…，y _N-1 Causal parental set for node x is z. The algorithm initial stage z is an empty set. If it is

Node y will be _i Added to z, i.e. z=zuy _i . When no such node is found in y, the aggregation phase ends.

And (3) carrying out an optimal causal entropy algorithm on each node in V to obtain causal relation EC between every two nodes, and initially constructing a causal network.

I(X；Y)＝ψ(k)+ψ(N)-<ψ(n _x +1)+ψ(n _y +1)>，

wherein the method comprises the steps of<>Representing allThe average value of the samples, k, is expressed as the number of neighbor points,is a Digamma function +.> N represents the sample size, N _x ，n _y The number of the K adjacent algorithms is respectively expressed in the X and Y directions. I.e. for a fixed k value, let a data point w in joint space _i ＝(x _i ，y _i ) The distance to its kth nearest neighbor is e (i), n _x ，n _y Respectively x _j ，y _j The satisfaction of x in (j not equal to i) _j -x _i ||x＜∈(i)，||y _j -y _i || _y Points of < ∈ (i). In the present invention, when the scalar subtraction is performed in the norm, the value of the norm is the same as the absolute value of the difference between the scalars.

I(X；Y|Z)＝ψ(k)-<ψ(n _xz +1)+ψ(n _yz +1)-ψ(n _z +1)>。

psi (k) is also a Digamma function. For a fixed k value, set a data point s in joint space _i The distance to its kth nearest neighbor is e (i). The distance measure uses the maximum norm, i.e. ||s _i -s _j || _xyz ＝max{||x _i -x _j || _x ，||y _i -y _j || _y ，||z _i -z _j ||z }. Based on this, more precisely:

1.1 Informater model includes two parts, encoder and decoder. Taking the first space-time block as an example, the module is input asIn the encoder section, input +.>Vector mapping is +.>Wherein (1)>Comprises input->Linearly mapped vector, ">Local position coding of inner elements +.>Global of inner elements throughout the time axisAnd (5) position coding. This makes +.>Not only local timing information is included but also hierarchical timing information such as week, month, year, etc., as well as burst timestamp information (events or certain holidays, etc.). After vector mapping, the data passes through a plurality of attention blocks, each block containing multiple head probability sparse self-attention. The output of each block extracts the relevant attention information by self-attention distillation, using one-dimensional convolution layers Conv1d, ELU activation layer and max-pooling MaxPool. The process is defined by the formula from layer j to layer (j+1):

1.2 The decoder of Informar needs toTakes as input the form of: />

Comprises a start token sequence with a time dimension length L _token ，/>Placeholders constituting the target sequence, the length of the time dimension is +.>At the same time->Also->In the time dimension length. />Is zero-padded, which contains the time stamp of the target sequence. The invention is based on the input sequence->A sequence of a particular size is sampled as a start token, such as traffic flow data from the previous hour. Sequence->The multi-headed probability sparse self-attention layer is masked and combined with the output of the encoder. Then, the attention is transferred through multiple heads. The above process is repeated until +.>Its position in the final output and +.>Corresponding to each other.

(2) And a space module.

Furthermore, the present invention enables residual connection when stacking graph convolutional layers. And to aggregate information for causal and distance networks, multiple graph convolutions are used. Record the output of the first time module asRespectively pair->At G _D And G _C Convolving and integrating the extracted features to obtain->

Will->At G _D And G _C Convolved outputs are connected in series, < >>Representing the output of the first spatial module, reGCN representing the residual map convolution operation, will +.>Is a drawing of a drawing rollOutput and +.>Linear transformation FC of (a) ₁ Output addition, i.e.)> Furthermore, subscript C indicates the graph convolution use causal network and subscript D indicates the graph convolution use distance network.

the invention increases the model depth by stacking the space-time blocks, so the output of the first space-time blockAnd can be used as the input of the (1+1) th space-time block. />

Step 23: and an output layer.

example 2: referring to fig. 1-3, a causal transform-based networked data prediction method adopts the following technical scheme: the method comprises the following steps:

step one: preliminary construction of causal network G by applying optimal causal entropy algorithm on observed time series data _C ＝(V，E _C ，W _C ). For node set v= { x, y ₁ ，y ₂ ，…，y _N-1 The causal parental set of node x is denoted as z. The algorithm initial stage z is an empty set. If it is

Step two: the resulting z of the polymerization stage may be a superset in direct communication with x. Thus, for member z in z _i If (3)

Then z is set _i Delete from z, when all members in z are traversed, the node left in z is the direct causal parent of x and the delete phase ends. Parental cause and effect set N for node x at this time _x ＝z。

Step three: the two steps are carried out on each node in V, so that the causal relationship E between every two nodes can be obtained _C And (5) completing the primary construction of the causal network.

Step four: estimating a formula by a mutual information formula

I(X；Y|Z)＝ψ(k)-<ψ(n _xz +1)+ψ(n _yz +1)-ψ(n _z +1)>Estimating the value of causal entropy to obtain W _C Completion of causal network G _C ＝(V，E _C ，W _C ) Is a construction of (3).

Step five: calculating the distance between nodes by using the longitude and latitude coordinates of each node in V, and constructing a distance network G by the distance _D ＝(V，E _D ，W _D )。

Step six: and (5) preprocessing data. Manually selecting a sliding window to determine the input dimension, i.e. selecting N nodesCoupling information stream data of individual time steps +.>As a model input.

Step seven: as shown in FIG. 3, in the first space-time block, data is writtenAnd inputting a time module. In the encoder section, the input +.>Vector mapping is +.>Then passing through a plurality of attention blocks containing multi-head probability sparse self-attention. The output of each block extracts the relevant attention information by self-attention distillation.

Step eight:as a decoder part input, the multi-headed probability sparse self-attention layer is masked and combined with the output of the encoder. Then, the attention is transferred through multiple heads. The above process is repeated until +.>Its position in the final output and +.>Corresponding to each other.

Step nine: time information to be extractedAn input space module, through a model:

/>

obtaining residual multi-graph convolution output

Step ten: to ensure that the space-time block does not change the data dimension, the methodLinear transformation to obtain the output +.>

Step eleven: the information extracted from the last space-time block is decoded by the output layer. Making the output dimension be the predicted time step number to obtain an output Y _pred 。

It should be noted that the above-mentioned embodiments are not intended to limit the scope of the present invention, and equivalent changes or substitutions made on the basis of the above-mentioned technical solutions fall within the scope of the present invention as defined in the claims.

Claims

1. A causal fransformer-based networked data prediction method, the method comprising the steps of:

step 1: constructing a network;

step 2: and establishing a space-time transducer convolution model.

2. The causal fransformer-based networked data prediction method of claim 1, wherein step 1: the network is constructed as follows:

step 11: constructing a causal network;

step 12: and (3) constructing a distance network.

3. The causal fransformer based networked data prediction method according to claim 2, wherein step 11: causal network construction, which comprises two steps:

where p (X) is the probability of an event X taking a particular value X, and for the relationship between the two event X and Y information, is characterized by a joint entropy H (X, Y) and a conditional entropy H (x|y), H (y|x), defined as:

where p (X, Y) is the joint probability of x=x, y=y, p (x|y), p (y|x) is the conditional probability of x=x, y=y, and the information of event X is subdivided into information belonging to X only and shared information of X and Y, mutual information I (X; Y) describes the shared information between events X and Y, and when the two are more closely related, the mutual information is larger, the definition of mutual information I (X; Y) is:

I(X；Y)≡H(X)-H(X|Y)，

I(X；Y|Z)≡H(Y|Z)-H(Y|X,Z)，

however, the mutual information can only reflect the relation between the events, and in order to measure the directionality of the information flow between the two events, a transfer entropy T is introduced _X→Y Transfer entropy T _X→Y Is defined as:

T _X→Y ≡I(X ^(t) ；Y ^(t++) |Y ⁽⁽⁾ )，

where τ is the delay time, since complex engineering systems inevitably contain more than two nodes, the transfer entropy cannot distinguish between direct and indirect causal relationships in the network without appropriate conditions, while causal relationship entropy C _Y→X|Z Overcoming the pair limitation of transfer entropy, causal relation entropy C _Y→X|Z Is defined as:

C _Y→X|Z ≡I(Y ⁽⁽⁾ ；X ⁽⁽⁺⁺⁾ |Z ⁽⁽⁾ )＝H(X ⁽⁽⁺⁺⁾ |Z ⁽⁽⁾ )-H(X ^(t+τ) |Z ^(t) ,Y ^(t) )，

this index is reflected in a given condition Z ^(t) And define X ^(t++) Y when information and direction of (2) ^(t) The amount of information that can be provided, and therefore, by determining the delay time tau, the direction of the information flow between the time series of two nodes is obtained,

recording a node set of a complex system as V, recording the number of nodes as N, and adding the connecting edges of two nodes into an edge set E when the causal entropy of the two nodes is greater than 0 _C In matrix W _C ∈R ^N×N As a matrix weighted by causal entropy, its causal network graph can be represented as G _C ＝(V,E _C ,W _C ) For any node x e V, the node that points to it in all directions and has a causal entropy greater than 0 is called causal parent, and according to the optimal causal entropy principle, the causal parent set of x is the smallest node set N that maximizes causal entropy in the set _x The optimal causal entropy algorithm can be divided into an aggregation phase and a deletion phase,

in the aggregation phase, v= { x, y for the node set ₁ ,y ₂ ,…,y _N31 The causal parent set of the node x is recorded as z, and the initial stage z of the algorithm is an empty set if = { x, y }Node y will be _i Added to z, i.e. z=zuy _i In other words, y _i The nodes which are currently belonging to y and not belonging to z are concentrated to the node with the largest causal entropy, and the causal entropy is larger than 0, when no such node is found in y, the aggregation phase ends, and in the deletion phase, z obtained in the aggregation phase can be a superset directly communicated with x, so that for member z in z _i If->Then z is set _i Delete from z, when all members in z are traversed, the node left in z is the direct causal parent of x, the delete phase ends, N at this time _x For each node in V, performing an optimal causal entropy algorithm to obtain causal relationship E between every two nodes _C A causal network is initially constructed,

(2) Calculating causal entropy, wherein the causal entropy is equivalent to mutual information, and estimating the mutual information of two events X and Y by a mutual information estimation method based on a K-neighbor algorithm:

I(X；Y)＝ψ(k)+ψ(N)-<ψ(n _x +1)+ψ(n _y +1)>，

wherein the method comprises the steps of<>Represents the average of all samples, k represents the number of neighbor points,is a Digamma function

N represents the sample size, N _x ,n _y Respectively, the number of the K-neighbor algorithms is satisfied in the X and Y directions, namely, for a fixed K value, a data point w in the joint space is set _i ＝(x _i ,y _i ) The distance to its kth nearest neighbor is e (i), n _x ,n _y Respectively x _j ,y _j The satisfaction of x in (j not equal to i) _j -x _i || _x <

∈(i),||y _j -y _i || _y <The number of points of E (i), when the scalar is subtracted from the norm, the value of the norm is the same as the absolute value of the difference between the scalar;

when n independent samples { s } are considered ₁ ,s ₂ ,…,s _n Joint random variable s= (X, Y, Z), where S _i ＝

(x _i ,y _i ,z _i ) I (X; y|z) is given by the following formula:

I(X；Y|Z)＝ψ(k)-<ψ(n _xz +1)+ψ(n _yz +1)-ψ(n _z +1)>，

psi (k) is also a Digamma function, for a fixed k value, set the data point s in joint space _i The distance to its kth nearest neighbor is e (i), the distance measure uses the maximum norm, i.e. |s _i -s _j || _xyz ＝max{||x _i -

x _j || _x ,||y _i -y _j || _y ,||z _i -z _j || _z Based on this, more precisely:

n _xz (i) Representation (x) _j ,z _j ) Satisfies || (x) in (j+.i) _j ,z _j )-(x _i ,z _i )|| _xz <Points of E (i);

n _yz (i) Representation (y) _j ,z _j ) Satisfying (y) in (j not equal to i) _j ,z _j )-(y _i ,z _i )|| _yz <Points of E (i);

n _z (i) Representing z _j The satisfaction of z in (j not equal to i) _j -z _i || _z <The number of points e (i),

through the above process, W _C Is successfully estimated, thereby constructing a complete causal network G _C ＝(V,E _C ,W _C )。

4. The causal fransformer based networked data prediction method of claim 2, wherein step 12: the distance network construction stage is specifically as follows:

the distance network may be denoted as G _D ＝(V,E _D ,W _D ) Wherein E is _D ,W _D Edge sets and adjacency matrices, adjacency matrices W, respectively, from the network _D Is generated based on the distance between nodes, W _D The ith row and jth column elements are represented as follows:

wherein d is _ij The distance between the ith node and the jth node can be obtained by substituting a haverine tool in Python into the longitude and latitude calculation of the node, and sigma ² Is the assumed variance of the distance, epsilon is the threshold of the weight, epsilon is set to 0.5, sigma ² 10, adjacency matrix W of distance network _D An element greater than 0 in the sequence of steps G indicates that there is a border at the location, and the sequence of steps G is separated from the network G _D ＝(V,E _D ,W _D ) Is successfully constructed.

5. The causal fransformer-based networked data prediction method of claim 1, wherein step 2: the method for establishing the space-time transducer convolution model comprises the following three steps:

step 21: preprocessing data;

step 22: a transform convolved spatiotemporal block;

step 23: and an output layer.

6. The causal fransformer-based networked data prediction method of claim 5, characterized byCharacterized in that in step 21: the data preprocessing is as follows: manually selecting a sliding window to determine the input dimension, i.e. selecting N nodesCoupling information stream data of individual time steps +.>As a model input, the limitation of the input dimension is to avoid that the neural network with too high a dimension runs too slowly due to too long a length of time data to be input.

7. The causal fransformer based networked data prediction method of claim 5, wherein step 22: the transform convolved spatiotemporal block is specifically as follows: each space-time block comprises a residual transducer module capable of extracting time dimension information and a residual multi-graph convolution module capable of summarizing causal network and distance network information,

(1) The time module adopts an Informir model structure on a time axis to capture the time dynamic behavior of data;

1.1 Informier model comprises two parts of encoder and decoder, and is input as the first space-time block by the recording moduleIn the encoder section, input +.>Vector mapping is +.>Wherein (1)>Comprises input->Linearly mapped vector, ">Local position coding of inner elements +.>Global position coding of the inner element over the entire time axis, which makes +.>Not only local timing information but also hierarchical timing information such as week, month, year, etc., and burst time stamp information;

after vector mapping, the data passes through a plurality of attention blocks, each block containing a multi-headed probability sparse self-attention, the output of each block extracting the relevant attention information by self-attention distillation, wherein one-dimensional convolution layers Conv1d, ELU activation layers and maximum pooling MaxPool are used, the process being defined by the formula from layer j to layer (j+1):

layer j inputThrough attention block [ ·] _AB Thereafter, the dominant features with dominant attention are weighted higher by self-attention distillation extraction by first selecting a length K _C Is cyclically padded at both ends of the time axis, one-dimensional convolution is performed in the time dimension of the sequence using Conv1d, wherein the dimensions of the input and output are identical, and then the data is passed through an activation function ELU, noted as x _ELU The expression of the ELU is:

finally, carrying out maximum pooling MaxPool operation extraction on the data subjected to the activation function in the time dimensionThe maximum data of the designated window obviously reduces the size of the characteristic tensor, and records the input time dimension size as L _in After the maximum pooling, the characteristic tensorIs of the time dimension L _ou( The method comprises the following steps:

1.2 Informater decoder needs to beTakes as input the form of:

comprises a start token sequence with a time dimension length L _token ，/>Placeholders constituting the target sequence, the length of the time dimension is +.>At the same time->Also->In the time dimension length, +.>Is zero-filled, it contains the time stamp of the target sequence, from the input sequence +.>A sequence of a specific size is sampled as a starting token, e.g. traffic flow data from the previous hour, sequence +.>By masking the multi-headed probability sparse self-attention layer and combining with the output of the encoder, then by multi-headed attention delivery, the above process is repeated until +.>Its position in the final output and +.>In a corresponding manner,

(2) The space module is provided with a plurality of space modules,

2.1 because the complex system distance network and the causal network belong to a non-Euclidean network, the invention adopts the spectral domain method of the graph neural network to extract the characteristics of the non-Euclidean structural data, and the input of the first space module comprises the output of the first time moduleDistance network G _D ＝(V,E _D ,W _D ) And a causal network G _C ＝(V,E _C ,W _C ) For convenience of description, the distance network and the causal network are collectively called as a network G, and the adjacency matrix of the network G is denoted as W E

R ^N×N Output of time blocks using a graph rolling network of a first order approximation of graph laplaceThe convolution output can be obtained after the graph convolution>The convolution process can be described as:

where Θ is a parameter of the convolution kernel,i for W re-normalization result _N Is N-order identity matrix->Is->Degree matrix of->Diagonal element->Can be expressed as +.> The off-diagonal elements of (a) are all 0,

residual connection is realized when stacking graph convolution layers, multiple graph convolutions are used for summarizing information of causal network and distance network, and the output of the first time module is recorded asRespectively pair->At G _D And G _C Convolving and integrating the extracted features to obtain->

Will->At G _D And G _C Convolved outputs are connected in series, < >>For the output of the first spatial module, reGCN represents the residual map convolution operation, will +.>The graph convolution output and +.>Linear transformation FC of (a) ₁ Output addition, i.e.)> In addition, subscript C indicates that the graph convolution uses a causal network, subscript D indicates that the graph convolution uses a distance network,

increasing model depth by stacking spatiotemporal blocks so output of the first spatiotemporal blockAnd can be used as the input of the (1+1) th space-time block.

8. The causal fransformer based networked data prediction method of claim 5, wherein step 23: the output layer is specifically as follows:

obtaining a prediction result through output layer decoding, extracting characteristics through a plurality of space-time blocks, and decoding the extracted information, namely linear transformation FC _T Making the dimension of output be the predicted time step number, and recording the L-1 th space-time block outputTo finally extract the features, output Y _pred ∈R ^N×K For the prediction result, the decoding process of the output layer is as follows: