CN114913465A

CN114913465A - Action prediction method based on time sequence attention model

Info

Publication number: CN114913465A
Application number: CN202210610980.1A
Authority: CN
Inventors: 徐涛; 黄焯旭; 韩军功; 范振坤; 雷超; 程王婧
Original assignee: Famai Technology Ningbo Co ltd
Current assignee: Famai Technology Ningbo Co ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-16

Abstract

The invention discloses an action prediction method based on a time sequence attention model, which is based on deep learning and builds the time sequence attention model, wherein the model carries out feature analysis on image frame data from a video through a self-attention module, integrates the time sequence model, recursively integrates space-time context information, carries out reasoning and fitting through a self-supervision mode, and can predict long-term future actions; a virtual frame structure is introduced, a complex prediction task is simplified into an action classification task aiming at a virtual frame, so that the information integration and classification capability of the existing model is exerted to the maximum extent, and the problems of low detection accuracy, prediction time duration and the like of the existing algorithm can be solved more effectively.

Description

Action prediction method based on time sequence attention model

Technical Field

The invention relates to the field of computer vision image processing, in particular to an action prediction method based on a time sequence attention model.

Background

Human motion prediction is an emerging task in the field of computer vision and artificial intelligence, and application scenarios thereof include, but are not limited to, pedestrian trajectory prediction in automatic driving, home-use auxiliary robots, game VR perception, and the like. Meanwhile, for the video detection of home-based care, the prediction of future actions of the human body can be performed, and the method plays an important role in the scenes of tumble prevention, emergency rescue and the like. Human motion prediction requires that an algorithm model surpass the current space-time visual classification modeling, so as to predict the multi-modal distribution of future motion. Different from the action recognition task, the method can avoid time reasoning and reasonably integrate complete context information, and the action prediction task requires modeling on past actions and predicting on next actions, so that the human action prediction task is very complex and difficult, and long-term modeling of the space-time context is the core of the task.

Typical long-term spatiotemporal context modeling methods include feature extraction of sampled frame images or segments, feature aggregation using clustering, recursive, or attention-based models, and direct output of classification results of predicted actions. Most of these models aggregate features only in the temporal domain, and modeling the temporal sequence evolution of video frames is lacking consideration, so that prediction is often not accurate.

The other scheme is to predict short-term actions (actions generated in the next frame) based on a time sequence LSTM model, and the scheme integrates information of the previous frame and predicts information of the next frame in a recursive mode and continuously infers forward to realize prediction. However, because errors generated by the model in the long-term prediction process are accumulated continuously, the scheme loses effect in the long-term prediction scene, and the requirement function taking prediction as the core in the home-based old-age scene cannot be met.

The latest scheme utilizes the long-term perception capability of an attention mechanism to construct a Transformer model to analyze video data and predict the next action, can well integrate context information, synchronously integrate and predict frame information, eliminates errors generated by the model in the long-term prediction process to the maximum extent by adopting an automatic supervision mode, and can better realize the prediction function. However, the attention model requires a large amount of computation (such as large-scale GPU clustering operations), and the computation complexity increases at a rate of square with the amount of data processed, thus making it difficult to process a large amount of video data and to deploy it into a product. Limited to this, this scheme is also difficult to predict for longer periods.

On the other hand, the performance of the deep learning model depends greatly on the fitting degree of the used training data to a specific task, and the existing action prediction scheme cannot be directly applied to home-based care for the aged due to the lack of relevant scene data for model training.

In conclusion, the prior art scheme for predicting human body actions still has the defects of short prediction time, insufficient prediction precision, large calculation amount and the like, and cannot meet the requirements of fall prevention and emergency rescue in a home-based old-age scene due to limited and special application scenes.

Disclosure of Invention

In order to solve the above problems, the present invention provides a motion prediction method based on a time series attention model, which can analyze video data by using the time series attention model and has high prediction accuracy.

Therefore, the technical scheme of the invention is as follows: a motion prediction method based on a time sequence attention model comprises the following steps:

1) video data sampling: selecting a dense video stream with each frame corresponding action mark as a training video, and sampling a certain number of frames of images in video stream data;

2) image preprocessing: normalizing the image sampled in the step 1), and then zooming, cutting and turning the image;

3) establishing and training a time sequence attention model: the time sequence attention model comprises an encoder, a decoder and a prediction classifier;

a coding stage, namely coding a frame image by using a pre-trained Vision Transformer (ViT) model by utilizing the strong analysis capability of a Transformer model on the image; the Vision Transformer model comprises a batch Embedding (PE) module, a Self-attachment (SA) module, a feed-forward network (FFN) module and a residual connecting part;

the Self-Attention module calculates the weight between each two blocks through an Attention mechanism and further performs feature fusion; mapping X to a high-dimensional space using multiple linear layers, respectively expressed as:

Q＝Wq*X

K＝Wk*X

V＝Wv*X

wherein X is an input image, Q is a query matrix, K is a keyword matrix, V is a value matrix, Wq, Wk and Wv respectively represent learning parameters corresponding to Q, K, V, the relation between every two blocks can be calculated through Q, K, namely an attention map Am is obtained, and the weight of each block can be calculated through the attention map Am and the V;

Am＝SoftMax((Q*K)/sqrt(D)

wherein SoftMax refers to using index normalization for a calculation result, D represents the number of characteristic channels of Q, K, V, and sqrt represents an evolution operation;

the Self-Attention module calculated feature F1 can be expressed as:

F1＝Am*V；

decoding stage: the decoder comprises a Multi-Head Self-orientation module, a virtual frame structure and a time sequence reasoning structure;

I) Multi-Head Self-orientation Module: the input in the decoding process is the high-dimensional characteristic representation of the coded frame image; the calculated features are spatio-temporal context information from frame to frame in the decoding process;

the introduction of the Multi-Head mechanism is as follows:

Q＝[Q1,Q2,...,Qh],Qh＝Wq_h*X

K＝[K1,K2,...,Kh],Kh＝Wk_h*X

V＝[V1,V2,...,Vh],Vh＝Wv_h*X

II) position coding: introducing frame position coding and attention map coding to enhance frame image characteristics; coding frame positions, namely numbering frame images in sequence and coding the frame images into high-dimensional characteristics Pe through a standard embedding layer;

and (4) encoding the attention map Am obtained by calculation in the step (r) through a standard multilayer perceptron to obtain a high-dimensional characteristic Ae,

the initial inputs to the decoder are:

Input＝Pe+Fe

wherein Fe is the final output of the encoding stage;

the calculation process of setting the first layer of Transformer is as follows:

TF_1＝FFN(MHSA(Input))

MHSA is the calculation process of a Multi-Head Self-orientation module, and FFN is the calculation process of a feedforward network module;

the calculation process of the nth layer Transformer is as follows:

TF_n＝FFN(MHSA(TF_n-1+Ae))

as described above, Ae is the attention map encoding from the n-1 th layer Transformer;

III) a virtual frame structure, namely, the initialized virtual frame is equivalent to the image characteristics of a real frame, corresponding position codes are given to the initialized virtual frame according to the prediction purpose, and the initialized virtual frame and the position codes are transmitted to a multi-head attention model together for decoding;

defining the virtual frame as Vf, then the initial input of the decoder after introducing the virtual frame structure is:

Input＝Pe+Concatenate(Fe,Vf)

wherein Concatenate () is a standard splicing operation;

IV) a time sequence reasoning structure:

dividing the complete T frame image characteristic sequence into sequence segments which are not overlapped with each other, wherein each segment comprises a T frame sequence and is input into the multi-head attention model respectively, namely the length of the input sequence of the multi-head attention model is limited to T; the complete sequence can be decoded circularly through a recursive reasoning mode, and finally the required decoding characteristics are obtained;

the prediction classifier: and mapping the channel number of the frame image features obtained by decoding into a specific action category number through the standard MLP, and taking the maximum value of the channel as a classification result.

Preferably, the video data sampling in step 1) specifically includes the following steps:

a. acquiring action segmentation blocks of a video stream: dividing the complete video stream into different sub-video streams according to a complete action according to the action label of each frame, namely, a plurality of sub-video streams containing a complete action in the video data, and capturing one of the sub-video streams and the corresponding action thereof as a target to be predicted;

b. sampling forward according to the sub-video stream to obtain observed data for being transmitted to a network for analysis and prediction; and setting the intercepted sub-video stream to start from the time S and end to the time E, setting A model to predict the action A seconds before the action occurs, intercepting the video stream from the time St-E-O to the time Et-S-A as input datA, and sampling the T-frame image and the corresponding action label from the input datA as model input.

Preferably, during the image preprocessing in step 2), the image is normalized according to the respective mean value and standard deviation of three channels of RGB of the frame image, i.e. the color value in the range of [0,255] is normalized to the range of [0,1 ]; when the image is zoomed, the length and the width of the frame image are respectively randomly zoomed to the range of [248,280] pixel size; when the image is cut, the frame image is cut to 224 multiplied by 224 pixel size; and randomly horizontally flip the frame image.

Preferably, the Patch Embedding module divides an image into 16 × 16 blocks with uniform size, and flattens pixels in the blocks; i.e. by a two-dimensional convolution with a convolution kernel size and a step size of 16, and comprising a level normalization module layerorm, namely expressed as:

PE(X)＝LayerNorm(Conv(X))

where Conv represents a two-dimensional convolution of the convolution kernel size with a step size of 16 each.

Preferably, the feedforward network module is formed by a multi-layer perceptron, which includes two linear layer linear layers and a Relu activation function.

Preferably, the Vision Transformer model is calculated by the following process:

X＝PE(X)

F1＝SA(X)

Af＝F1+X

Fe＝FFN(Af)+Af

wherein: x is an input image, PE is the calculation process of a Patch Embedding module, SA is the calculation process of a Self-Attention module, and FFN is the calculation process of a feedforward network module.

Preferably, the timing inference structure further includes a memory module, and the memory unit is configured to store K and V in the calculation process of the Self-authorization module, and transmit K and V to the next recursion process, that is, the memory module is introduced in the Self-authorization process of the nth recursion calculation:

K_n＝Concatenate(K_n-1,Wk*X)

V_n＝Concatenate(V_n-1,Wv*X)

k _ n-1, V _ n-1 are K and V from the last recursive computation.

Compared with the prior art, the invention has the beneficial effects that:

1. the image frame data from the video is subjected to feature analysis through a self-attribute module, a time sequence model is fused, space-time context information is recursively integrated, reasoning and fitting are carried out through an automatic supervision mode, and therefore long-term future actions can be predicted.

2. Introducing a concept of a virtual frame, integrating the virtual frame and the real context information in the process of recursive reasoning, and outputting corresponding action classification; the introduction of the virtual frame enables the model to be concentrated on information integration and classification, simplifies the complex prediction task into the action classification task of the virtual frame, and can greatly exert the learning capability of the model.

3. By combining the long-term information capturing capability of the attention model and through a recursive reasoning mode, the calculation complexity of the model is linearly increased along with the processed data volume while context information is integrated, and the problems that the existing algorithm is low in detection accuracy, long in prediction time, difficult to deploy, incapable of being used in home-based care scenes and the like can be solved more effectively.

4. The method comprises the steps of collecting action data in a home-based old-age scene, and carrying out related human body labeling according to the demand of action prediction, so that a time sequence attention model is fitted to the home-based old-age scene, and the performance of an algorithm is exerted to the maximum extent.

Drawings

The following detailed description is made with reference to the accompanying drawings and embodiments of the present invention

FIG. 1 is a block diagram of the algorithm of the present invention;

FIG. 2 is a diagram of the timing reasoning architecture of the present invention;

FIG. 3 is a schematic diagram of video sampling according to the present invention.

Detailed Description

See the drawings. The method for predicting future video actions in home-based old-age scenes is characterized in that a time sequence attention model is designed based on a deep neural network, video data can be analyzed, the time sequence attention model analyzes features of image frame data from videos through a self-attention module, the time sequence model is fused, space-time context information is recursively integrated, reasoning and fitting are carried out through a self-supervision mode, and therefore long-term future actions can be predicted.

The method specifically comprises the following steps:

first, video data sampling and image preprocessing

1. Video data sampling: the data form used for training is a dense video stream with corresponding action labels of each frame, firstly, a certain number of frames of images are sampled from video stream data, and the specific operation is as follows:

1.1 obtaining action partitions of a video stream. And dividing the complete video stream into different sub-video streams according to the action label of each frame, obtaining a plurality of sub-video streams containing one complete action in the video data, and capturing one sub-video stream and the corresponding action thereof as a target to be predicted.

1.2 forward sampling the observed data according to the sub-video stream for being transmitted to a network for analysis and prediction. The intercepted sub-video stream is set to start from the time S and end from the time E. When the model is set and the motion needs to be predicted A seconds before the motion occurs, the used datA is A video stream within O seconds, the video stream from St-E-O time to Et-S-A time is taken as input datA, and T frame images and corresponding motion labels are sampled from the input datA to be input as the model.

The sampling mode can be random distribution sampling, pre-distribution sampling (selecting the first T frames of the input video stream), and post-distribution sampling (selecting the last T frames of the input video stream). Post-distributed sampling is used as the default sampling mode because the segment of the video frame is closest to the motion that needs to be predicted.

2. Image preprocessing:

2.1 image normalization: the image is normalized by the mean and standard deviation of each of the three channels RGB of the frame image, i.e., the color values in the [0,255] range are normalized to the [0,1] range.

2.2 image data enhancement. The data enhancement of the image is to expand the diversity of the data set, which can well avoid model overfitting, so that the model has better generalization capability on the test set.

1) Image zooming: the length and width of the frame image are randomly scaled to a range of [248,280] pixel size, respectively.

2) Image cutting: the frame image is randomly cropped to a size of 224 x 224 pixels.

3) And (4) random overturning: the frame image is randomly horizontally flipped.

Second, the establishment and training of time sequence attention model

The time sequence attention model is used for analyzing observed video data and performing motion prediction in combination with a virtual frame structure and comprises an encoder, a decoder and a prediction classifier.

1. And an encoding step of encoding the frame image by using a pre-trained Vision Transformer (ViT) model, wherein the frame image has strong analysis capability on the image by using a Transformer model. The ViT model includes a Patch Embedding (PE) module, a Self-attachment (SA) module, a Feed-Forward Network (FFN) module, and a residual connection.

Assuming X is the input image, the feature extraction process of the Vision Transformer model can be expressed as:

X＝PE(X)

Af＝SA(X)+X

Fe＝FFN(Af)+Af

wherein: x is the input image, PE () is the output result of the Patch Embedding module, SA () is the output result of the Self-orientation module, FFN () is the output result of the feedforward network module, and Fe is the final output of the encoding stage.

The Patch Embedding (PE) module divides an image into 16 × 16 blocks of uniform size, flattens pixels in the blocks, specifically, by two-dimensional convolution in which the size of a convolution kernel and a step size are both 16, and includes a hierarchical normalization module LayerNorm, which is expressed as:

PE(X)＝LayerNorm(Conv(X))

The Self-attention (SA) module calculates the weight between each two blocks through an attention mechanism and further performs feature fusion. The method comprises the following specific steps:

mapping X to a high-dimensional space by using a multilayer Linear Layer, and respectively representing the following steps:

Q＝Wq*X

K＝Wk*X

V＝Wv*X

wherein Wq, Wk, Wv respectively represent learning parameters corresponding to q (query), k (key), V (value), a relationship between each two blocks can be calculated through Q, K, i.e. Attention Map, Am for short, and the weight of each block can be calculated through Attention Map Am and V.

Am＝SoftMax((Q*K)/sqrt(D)

Wherein SoftMax refers to using exponential normalization for the calculation result, D represents the number of characteristic channels of Q, K, V, and sqrt represents the evolution operation.

And (3) weight calculation: the Self-Attention module calculated feature F1 can be expressed as:

F1＝Am*V；

the calculation process of the Self-Attention module can be expressed as F1 ═ sa (x)

The Feed-forward network (FFN) module is composed of a multilayer perceptron (MLP), the MLP comprises two Linear layers and a Relu activation function, and the MLP is a basic standard composition unit of the existing deep learning network

And residual connecting. On the basis, the complete calculation process of ViT is obtained by adding residual errors to the SA module and the FFN module:

X＝PE(X)

F1＝SA(X)

Af＝F1+X

Fe＝FFN(Af)+Af

wherein: x is the input image, PE () is the calculation process of the Patch Embedding module, SA () is the calculation process of the Self-orientation module, and FFN () is the calculation process of the feedforward network module.

As can be seen from the calculation process, the resolution of the initial image is reduced to 16 × 16 during PE, while the shape is unchanged during SA and FFN. To enhance the network learning capabilities, ViT superimposes the SA and FFN layers to deepen the network. And finally, obtaining a high-dimensional feature representation of the frame image through a standard pooling layer.

2. And a decoding stage. The decoder is a core component of the time sequence Attention model, and comprises a Multi-Head Self-Attention module, a virtual frame structure and a time sequence reasoning structure.

2.1Multi-Head Self-orientation module and position coding.

The computing process of the Multi-Head Self-orientation module is similar to the transform structure of the encoding process, and the key difference is that the input of the encoding process is the block Patch of the image, and the input of the decoding process is the high-dimensional feature representation of the encoded frame image. The computed features also change from image information during encoding to spatio-temporal context information from frame to frame during decoding.

The introduction of the Multi-Head mechanism is as follows:

Q＝[Q1,Q2,...,Qh],Qh＝Wq_h*X

K＝[K1,K2,...,Kh],Kh＝Wk_h*X

V＝[V1,V2,...,Vh],Vh＝Wv_h*X

the rest of the calculation is consistent with the calculation of the SA module of the encoding process.

Position coding: since the Attention mechanism is insensitive to the sequential structure between frames, in order to enable the network to learn the position information of the frame image and integrate the spatio-temporal context for motion prediction, frame position coding and Attention Map coding are introduced to enhance the frame image characteristics.

Frame position encoding: and numbering the frame images in sequence and coding the frame images into high-dimensional features Pe through a standard embedding layer.

Attention is drawn to the Attention Map encoding the attribution encoding: and encoding the attention map Am obtained by the previous layer calculation into high-dimensional features through a standard multi-layer perceptron. The high-dimensional feature of the frame image obtained in the encoding step is represented as Ae, and the initial inputs of the decoder are:

Input＝Pe+Fe

TF_1＝FFN(MHSA(Input))

the calculation process of the MHSA, i.e., Multi-Head Self-orientation module, is distinguished from SA (); the calculation process of the nth layer Transformer is as follows:

TF_n＝FFN(MHSA(TF_n-1+Ae))

as described above, Ae is the attention map encoding from the n-1 th layer Transformer.

2.2 virtual frame structure. Different from the existing scheme, the embodiment introduces the concept of the virtual frame, integrates the virtual frame and the real context information, and finally outputs the action classification corresponding to the virtual frame so as to achieve the purpose of prediction. The introduction of the virtual frame enables the time sequence attention model to focus on information integration and classification, and simplifies a complex prediction task into an action classification task of the virtual frame. The specific operation is to identify the initialized virtual frame as the image feature of the real frame, to give the corresponding position code according to the prediction purpose (to predict the action of the frame of several times), and to send the position code to the multi-head attention module for decoding.

Defining the virtual frame as Vf, then introducing the virtual frame structure and then the initial input of the decoder is:

Input＝Pe+Concatenate(Fe,Vf)

wherein Concatenate is a standard splicing operation.

2.3 sequential inference structure. Since the complexity of the attention model grows at a squared rate with the amount of data processed, it is difficult to process large batches of video data. The invention designs a time sequence reasoning structure, and integrates context information and simultaneously leads the calculation complexity of a model to be linearly increased along with the processed data volume by a recursive reasoning mode.

Firstly, dividing a complete T frame image feature sequence (including virtual frames) into sequence segments which are not overlapped with each other, wherein each segment comprises a T frame sequence, and then respectively inputting the T frame sequences into a multi-head attention model, namely the length of an input sequence of the multi-head attention model is limited to T. The complete sequence can be decoded circularly by a recursive reasoning mode, and finally the required decoding characteristics are obtained.

As can be seen from the calculation process of the SA module, as t increases, the model complexity increases at a rate of squaring. While when T is fixed, the model complexity grows linearly with T. Therefore, the data volume which can be processed by the whole model is greatly enlarged.

Because the sequence segments do not coincide with each other, in order to avoid losing context information during the recursive inference, the present embodiment introduces a memory module, specifically, a memory unit is configured to store K and V during the SA calculation process, and transfer the K and V to the next recursive process, that is, introduces:

K_n＝Concatenate(K_n-1,Wk*X)

V_n＝Concatenate(V_n-1,Wv*X)

wherein: k _ n-1 and V _ n-1 are K and V from the last recursive calculation, and the rest of calculation is consistent with the calculation of the standard SA module.

3. And (4) predicting a classifier. The prediction classifier is a standard classifier in a deep learning model, namely, the number of channels of the frame image features obtained by decoding is mapped into the number of specific action categories through a standard MLP, and the maximum value of the channels is taken as a classification result.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A motion prediction method based on a time sequence attention model is characterized by comprising the following steps: the method comprises the following steps:

Q＝Wq*X

K＝Wk*X

V＝Wv*X

Am＝SoftMax((Q*K)/sqrt(D)

wherein SoftMax refers to using index normalization for the calculation result, D represents the number of characteristic channels of Q, K, V, and sqrt represents an evolution operation;

the Self-Attention module calculated feature F1 can be expressed as:

F1＝Am*V；

I) Multi-Head Self-orientation module: the input in the decoding process is the high-dimensional characteristic representation of the coded frame image; the computed features are spatiotemporal context information from frame to frame in the decoding process;

the introduction of the Multi-Head mechanism is as follows:

Q＝[Q1,Q2,...,Qh],Qh＝Wq_h*X

K＝[K1,K2,...,Kh],Kh＝Wk_h*X

V＝[V1,V2,...,Vh],Vh＝Wv_h*X

encoding the attention map Am obtained by calculation in the step (r) by a standard multilayer perceptron to obtain a high-dimensional feature Ae,

the initial inputs to the decoder are:

Input＝Pe+Fe

wherein Fe is the final output of the encoding stage;

TF_1＝FFN(MHSA(Input))

MHSA is the calculation process of a Multi-Head Self-authorization module, and FFN is the calculation process of a feedforward network module;

the calculation process of the nth layer Transformer is as follows:

TF_n＝FFN(MHSA(TF_n-1+Ae))

Input＝Pe+Concatenate(Fe,Vf)

wherein Concatenate () is a standard splicing operation;

IV) a time sequence reasoning structure:

the prediction classifier: and mapping the channel number of the frame image features obtained by decoding into a specific action category number through a standard MLP, and taking the maximum value of the channel as a classification result.

2. The method according to claim 1, wherein the motion prediction method based on the time-series attention model comprises: the video data sampling in the step 1) specifically comprises the following steps:

a. acquiring action segmentation blocks of a video stream: dividing the complete video stream into different sub-video streams according to the action label of each frame, namely, a plurality of sub-video streams containing a complete action in the video data, and capturing one sub-video stream and the corresponding action thereof as a target to be predicted;

b. sampling forward according to the sub-video stream to obtain observed data for being transmitted to a network for analysis and prediction; the method comprises the steps of setting A cut-out sub-video stream to start from S time and end from E time, setting A model to predict the motion in A seconds before the motion occurs, and taking the video stream from St-E-O time to Et-S-A time as input datA if the used datA is the video stream in O seconds, and sampling T frame images and corresponding motion labels from the input datA to input the model.

3. The method according to claim 1, wherein the motion prediction method based on the time-series attention model comprises: during image preprocessing in the step 2), carrying out standardization operation on the image according to the respective mean value and standard deviation of three channels of RGB of the frame image, namely normalizing the color value in the range of [0,255] to the range of [0,1 ]; when zooming the image, randomly zooming the length and the width of the frame image to the range of [248,280] pixel size; when the image is cut, the frame image is cut to 224 multiplied by 224 pixel size; and randomly horizontally flip the frame image.

4. The method according to claim 1, wherein the motion prediction method based on the time-series attention model comprises: the Patch Embedding module divides an image into 16 multiplied by 16 blocks with consistent sizes and flattens pixels in the blocks; i.e. by a two-dimensional convolution with a convolution kernel size and a step size of 16, and comprising a level normalization module layerorm, namely expressed as:

PE(X)＝LayerNorm(Conv(X))

5. The method of claim 1, wherein the method comprises: the feedforward network module is composed of a multi-layer perceptron, and the multi-layer perceptron comprises two linear layer linear layers and a Relu activation function.

6. The method according to claim 1, wherein the motion prediction method based on the time-series attention model comprises: the Vision Transformer model has the following calculation process:

X＝PE(X)

F1＝SA(X)

Af＝F1+X

Fe＝FFN(Af)+Af

7. The method according to claim 1, wherein the motion prediction method based on the time-series attention model comprises: the time sequence reasoning structure also comprises a memory module, a memory unit is arranged for storing K and V in the calculation process of the Self-orientation module and transmitting the K and V to the next recursion process, namely, the memory module is introduced in the Self-orientation process of the nth recursion calculation:

K_n＝Concatenate(K_n-1,Wk*X)

V_n＝Concatenate(V_n-1,Wv*X)

k _ n-1, V _ n-1 are K and V from the last recursive computation.