CN114240999A

CN114240999A - Motion prediction method based on enhanced graph attention and time convolution network

Info

Publication number: CN114240999A
Application number: CN202111373469.6A
Authority: CN
Inventors: 刘盛; 张少波; 高飞; 陈胜勇; 柯正昊; 柯程远
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-03-25

Abstract

The invention discloses a motion prediction method based on an enhanced graph attention and time convolution network, which estimates the future motion attitude of a human body by aggregating spatio-temporal information, constructs an enhanced graph attention module and a reconstructed TCN module, generates a channel attention diagram by using the channel relation of input features, and extracts local symmetry, local connection and global semantic information by respectively using a local graph attention convolution network and a global graph attention convolution network based on the channel attention diagram. The reconstructed TCN can effectively capture complex, highly dynamic time information. And finally, performing channel compression and dimension combination processing to obtain a post-processing result, performing cutting processing on the original time sequence human skeleton data to obtain a residual error, and performing element addition on the post-processing result and the residual error to obtain a final prediction result. The invention can effectively reduce the discontinuity of the posture and the accumulation of errors in the human motion prediction process.

Description

Motion prediction method based on enhanced graph attention and time convolution network

Technical Field

The application belongs to the technical field of motion prediction, and particularly relates to a motion prediction method based on an enhanced graph attention and time convolution network.

Background

The human motion prediction aims at predicting future dynamic motion changes according to historical human skeleton postures, and the development of the technology is very beneficial to a plurality of applications such as human-machine interaction, autonomous driving, public safety, medical care, motion monitoring and the like. Perception and prediction of human motion plays an indispensable role for interactive robots, and also leads to a trend of future robot research. However, in human motion prediction, the discontinuity and error accumulation of the predicted posture can greatly influence the practical application progress thereof.

Discontinuities and error accumulation in the predicted pose are typically caused by insufficient characterization capabilities of the model in the spatial and temporal dimensions, respectively. In order to achieve high accuracy of human motion prediction, there have been many excellent prior works for encoding spatiotemporal information of human skeletal sequences. Mathematical models of human bones are typically constructed based on the human body's primary joints, each of which is an independent point to be observed. Meanwhile, the joint points are mutually connected. The convolutional neural network has good spatial structure perception capability on two-dimensional regular data, is usually used for image recognition and segmentation, but cannot achieve good effect when facing topological irregular data such as human bones, and the like, and the Graph Convolutional Network (GCN) can well construct and represent an irregular data structure.

Various GCN-based algorithms are widely applied to the fields of pose estimation, motion prediction and the like, but the validity of a model in sequence data processing cannot be guaranteed only by spatial information. The Recurrent Neural Network (RNN) has strong processing capacity on sequence data, is designed in the NLP field at first and then widely applied to the fields of motion recognition and motion prediction based on videos, and the like, but the final prediction precision of the RNN and the subsequent LSTM and GRU variants is seriously influenced by the lack of spatial information. Discrete Cosine Transform (DCT) was also introduced for characterization of the time dimension features, but many experimental applications show that increasing the observable frame number of DCT does not significantly improve the final prediction result, which is clearly contrary to common knowledge.

Disclosure of Invention

The application provides a motion prediction method based on an enhanced graph attention and time convolution network, which is used for reducing the problems of discontinuity of postures and accumulation of errors in the human motion prediction process.

In order to achieve the purpose, the technical scheme of the application is as follows:

a motion prediction method based on an enhanced graph attention and time convolution network comprises the following steps:

expanding the input original time sequence human body skeleton data into data with preset dimensionality through linear transformation, and completing data initialization through two-dimensional normalization, channel expansion, two-dimensional normalization and a Relu function in sequence;

inputting initialized data into a first enhanced graph attention module, outputting a first graph attention feature, inputting the first graph attention feature into a first reconstruction TCN module to obtain a first time sequence feature, performing element addition on the first graph attention feature and the first time sequence feature after performing cutting operation on the first graph attention feature, and outputting a first fusion feature;

inputting the first fusion feature into a second enhanced graph attention module, outputting a second graph attention feature, inputting the second graph attention feature into a second reconstruction TCN module to obtain a second time sequence feature, performing element addition on the second graph attention feature and the second time sequence feature after performing cutting operation on the second graph attention feature, and outputting a second fusion feature;

inputting the second fusion feature into a third enhanced graph attention module, and outputting a third graph attention feature;

and performing channel compression and dimension combination processing on the attention characteristics of the third graph to obtain a post-processing result, performing cutting processing on the original time sequence human skeleton data to obtain a residual error, and performing element addition on the post-processing result and the residual error to obtain a final prediction result.

Further, the enhanced graph attention module performs the following operations:

inputting the initialized data into a channel attention module to generate a channel attention diagram;

the channel attention maps are respectively input into a local attention module and a global map attention module, and then aggregated with input data to generate map attention features.

Further, the channel attention module performs the following operations:

extracting spatial and temporal features by using average pooling and maximum pooling simultaneously, and aggregating the results of the two using a MLP layer with shared weights to form a final channel attention diagram, which is expressed as follows:

M_c(F)＝σ(MLP(AvgPool(F))+MLP(MaxPool(F)))

sigma represents a Sigmod activation function, MLP (AvgPool (F)) represents that MLP operation is performed after input features F are subjected to average pooling operation, MLP (MaxPool (F)) represents that MLP operation is performed after input features F are subjected to maximum pooling operation, and M represents that_c(F) A channel attention map is shown.

Further, the operation of the local attention module is represented as:

where σ denotes a Sigmod activation function, W is a learnable transformation matrix used to transform input channels into output channels, M is a learnable mask matrix,

is a graph convolution kernel, where A is a first-order adjacency matrix of human skeleton nodes, and I is a self-connection matrix of nodes,

representing the multiplication of matrix elements one by one, and Y1 is the output of the local graph attention module;

the operation of the global graph attention module is represented as:

k is the number of heads of the multi-head attention system, B_kIs an adaptive global adjacency matrix, C_kIs a learnable global adjacency matrix, W_kIs a conversion matrix of input and output channels that can be learned, and Y2 is the output of the global graph attention module.

Further, the restructuring TCN module performs the following operations:

and sequentially performing density convolution, BatchNorm2D, ReLU, two-dimensional convolution, BatchNorm2D, ReLU activation function and Dropout function operation, and outputting the timing characteristics.

The invention provides a motion prediction method based on an enhanced graph attention and time convolution network, which is characterized in that an enhanced graph attention module and a reconstructed TCN module are constructed and combined into a human motion prediction method based on the enhanced graph attention and time convolution network.

Drawings

FIG. 1 is a flow chart of a human motion prediction method based on an enhanced graph attention and time convolution network;

FIG. 2 is an exemplary diagram of an overall network based on an enhanced graph attention and time convolution network;

FIG. 3 is a diagram of an enhanced graph attention module network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a motion prediction method based on an enhanced graph attention and time convolution network is proposed, which includes:

and step S1, expanding the input original time sequence human body skeleton data into data with preset dimensionality through linear transformation, and completing data initialization through two-dimensional normalization, channel expansion, two-dimensional normalization and Relu function in sequence.

The human skeleton sequence data input into the network is preprocessed, such as the input data (b,66,10) in fig. 2, b represents the batch size of model training as b,66 represents the size of skeleton data per frame as 66, and 10 represents that the whole sequence is composed of 10 frames in the time dimension. The data are expanded into data with preset dimensionality through linear transformation, namely, the time dimensionality 10 is mapped and expanded into 64 through a full-connection network, each frame of skeleton data 66 is divided into two

dimensionalities

3 and 22, 3 represents an xyz three channel, 22 represents a total of 22 skeleton nodes, and the finally obtained data are in a format (b,3,64 and 22) so as to meet the requirement of subsequent separate calculation of the channels and the nodes. And then sequentially carrying out two-dimensional normalization (BatchNorm2D), channel expansion (3, (3,1),256), two-dimensional normalization (BatchNorm2D) and Relu function (ReLU) on the data to finish data preprocessing.

According to the method, the dimensionality of the split skeleton node is two dimensionalities, the time sequence dimensionality is expanded from 10 to 64, and more operable space is provided for subsequent time sequence feature extraction.

Step S2, inputting the initialized data into the first enhanced graph attention module, outputting a first graph attention feature, inputting the first graph attention feature into the first reconstructed TCN module to obtain a first timing feature, performing a cutting operation on the first graph attention feature, performing element addition on the first timing feature, and outputting a first fusion feature.

As shown in fig. 2, the first enhanced graph attention module (AGA Block1) processes the initialized data and outputs a first graph attention feature. The following operations are performed:

Specifically, as shown in fig. 3, the first enhanced graph attention module inputs initialized data to the channel attention module, and the channel attention module extracts spatial and temporal features by using Average pooling (Average Pool) and maximum pooling (Max Pool) operations at the same time, and aggregates the results of the spatial and temporal features by using an MLP layer with shared weights to form a final channel attention graph.

The output of Average pooling (Average Pool) and Max pooling (Max Pool) is processed by MLP layer, and data fusion is completed by element addition (i ≧ matrix element one-to-one addition). And forming a channel attention diagram through a Sigmoid activation function.

The MLP layer is formed by connecting one-dimensional convolutions (256,1,256), ReLU and one-dimensional convolutions (256,1,256) in series.

The above process is expressed by the following formula:

M_c(F)＝σ(MLP(AvgPool(F))+MLP(MaxPool(F)))

where σ denotes a Sigmod activation function. MLP (AvgPool (F)) represents that MLP operation is performed after average pooling operation is performed on the input features F, MLP (Maxpool (F)) represents that MLP operation is performed after maximum pooling operation is performed on the input features F, and M_c(F) A channel attention map is shown.

Then, the channel attention maps are input into the local map attention module and the global map attention module, respectively. As shown in fig. 3, the partial view attention module includes a first branch and a second branch, the first branch and the second branch respectively including: the first-order adjacency matrix GCN Connection, the two-dimensional normalized BatchNorm2D and the ReLU activation function, the outputs of the first branch and the second branch are multiplied by elements and then input to the two-dimensional convolution (512, (1,1), 256), the two-dimensional normalized BatchNorm2D, the ReLU and the Dropout function.

The partial map attention module may be expressed as:

where σ denotes a Sigmod activation function, X denotes input data, W is a learnable transformation matrix used to transform input channels into output channels, M is a learnable mask matrix,

is a graph convolution kernel, where A is a first-order adjacency matrix (GCN Connection) of the human skeleton nodes, and I is a self-Connection matrix (GCN Symmetry) of the nodes,

representing the matrix elements multiplied by one, Y1 is the output of the local map attention module.

As shown in FIG. 3, the Global Graph Attention module includes Global Graph Attention, two-dimensional convolution (256 (1,1), 256), two-dimensional normalized BatchNorm2D, a ReLU activation function, and a Dropout function.

The global graph attention module may be represented as:

k is the number of heads of the multi-head attention system, B_kIs an adaptive global adjacency matrix, C_kIs a learnable global adjacency matrix, W_kIs a conversion matrix of input and output channels that can be learned, and Y2 is the output of the global graph attention module. K is 1 to K.

And finally, adding the outputs of the local graph attention module and the global graph attention module to the input data elements of the first enhanced graph attention module to form a final enhanced graph attention feature (first graph attention feature).

And then, inputting the first graph attention feature into a first reconstruction TCN module to obtain a first time sequence feature, performing element addition on the first graph attention feature and the first time sequence feature after performing cutting operation on the first graph attention feature, and outputting a first fusion feature.

The first reconstructed TCN module replaces the expansion convolution with density convolution (two-dimensional convolution (256 (7,1) and 256)) on the basis of the original TCN, namely, a convolution kernel has no holes, so that the reconstruction TCN module has better characterization capability on sequence skeleton data. As shown in fig. 2, the reconstructed TCN module sequentially goes through density convolution (256, (7,1), 256), BatchNorm2D, ReLU, two-dimensional convolution (256, (1,1), 256), BatchNorm2D, ReLU activation function, Dropout function operations, and outputs a timing characteristic. And simultaneously cutting (b,256,56,22) from the end of the first graph attention feature (b,256,62,22) by using a cutting (Slice) operation, and adding the residual error and the time sequence feature elements one by one to form a final module output result, wherein the final module output result represents the matrix element one by one addition.

And step S3, inputting the first fusion feature into a second enhanced graph attention module, outputting a second graph attention feature, inputting the second graph attention feature into a second reconstruction TCN module to obtain a second time sequence feature, performing element addition on the second graph attention feature after cutting operation, and outputting a second fusion feature.

The specific operation of this step is the same as the previous step, and is not described herein again.

And step S4, inputting the second fusion feature into a third enhancement map attention module, and outputting a third map attention feature.

This step continues to enhance the attention of the drawing, and the specific operation of the third enhanced drawing attention module is the same as that of the first enhanced drawing attention module, and is not described herein again.

And S5, performing channel contraction and dimension combination processing on the third graph attention feature to obtain a post-processing result, performing cutting processing on the original time sequence human skeleton data to obtain a residual error, and performing element addition on the post-processing result and the residual error to obtain a final prediction result.

In the step, post-processing is carried out on the attention characteristics of the third graph, and a predicted human skeleton sequence is output. As shown in fig. 2, the third attention feature is subjected to a two-dimensional convolution (256, (1,1), 3), the channel is shrunk from 256 to xyz three channels of the original data to obtain a result (b,3,20,22), and then xyz (second dimension) is merged with node (fourth dimension) (Linear Projection) to obtain a post-processing result. And cutting (b,66,1) from the end of the original input data by using Slice operation, adding the residual error and the post-processing result one by one to obtain a final prediction result (b,66,22), and adding behaviorindicates that matrix elements are added one by one.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An enhancement map attention and time convolution network-based motion prediction method, comprising:

2. The motion prediction method based on the enhancement map attention and time convolution network of claim 1, wherein the enhancement map attention module performs the following operations:

3. The motion prediction method based on the enhancement map attention and time convolution network of claim 2, wherein the channel attention module performs the following operations:

M_c(F)＝σ(MLP(AvgPool(F))+MLP(MaxPool(F)))

4. The motion prediction method based on the enhancement map attention and time convolution network of claim 2, wherein the operation of the local attention module is represented as:

the operation of the global graph attention module is represented as:

5. The motion prediction method based on the enhanced graph attention and time convolution network of claim 1, wherein the reconstructing TCN module performs the following operations: