CN112633209A

CN112633209A - Human action recognition method based on graph convolution neural network

Info

Publication number: CN112633209A
Application number: CN202011600579.7A
Authority: CN
Inventors: 毛克明; 李翰鹏
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-09
Anticipated expiration: 2040-12-29
Also published as: CN112633209B

Abstract

The invention discloses a human action recognition method based on a graph convolution neural network, which comprises the steps of preparing human action video data, labeling, and labeling video labels according to different types of actions; extracting the skeleton key point characteristics of human motion video data by using an openposition attitude estimation algorithm, then calculating the change speed of adjacent frame skeleton key points through a skeleton point mainstream network, and performing characteristic splicing; screening the bone key points, calculating included angles of the screened bone key points through an angle branch network, and performing feature splicing; transmitting the spliced data into a neural network of a graph; extending the graph convolution from a spatial domain to a temporal domain; using a cross-attention model to enhance the performance of the network; and (5) human action recognition. The invention can recognize and output the actions expressed by human in the input video, has good usability and robustness, and lays a certain foundation for the artificial intelligence technology to actually land in the field of action recognition.

Description

Human action recognition method based on graph convolution neural network

Technical Field

The invention relates to the technical field of computer vision, in particular to a human action recognition method based on a graph convolution neural network.

Background

Artificial intelligence technology has been radiated to all industries, and motion recognition technology is more a key technology for a plurality of hot applications and requirements, and has become one of the most concerned directions in the field of computer vision. For example, the detection and alarm of human abnormal behaviors in an intelligent monitoring camera, the classification and retrieval of human behaviors in a video, including the adoption of action acquisition technology in a high-quality game, can put the actions of careers into the game and bring immersion to players. It is believed that action recognition techniques will find increasing application in the future.

At present, similar technologies are often applied to human motion recognition in the field of computer vision, and the methods are mainly divided into two methods, one is a method based on RGB and optical flow of video, and the other is a method based on human skeleton key points. The method based on RGB and optical flow of video can learn the task end to end, but extracting the optical flow from the video is a very heavy task, and although various methods are available at present to reduce the loss caused by extracting the optical flow, the optical flow is always a powerful feature for the task of motion recognition. The method based on human skeleton key points is a newly emerging motion recognition method after the development of the posture estimation technology is mature, compared with the traditional method based on RGB and optical flow of videos, the method can more effectively model human behaviors, and the traditional method cannot avoid the influence caused by background and light transformation. On the other hand, the method needs to use a posture estimation algorithm to extract the features of the video, and one more step is required in this respect compared with the traditional method. In addition, the existing motion recognition method only simply utilizes the skeletal key point data, and the information describing the motion is not only coordinates, but also the angle and the change speed thereof are important elements of the motion recognition feature description.

Therefore, for the current situation in the field, together with the complexity of the motion itself, a human motion recognition method with deep learning theoretical basis and more descriptive elements for the task is needed.

Disclosure of Invention

The invention aims to provide a human action recognition method based on a graph convolution neural network aiming at the current situation in the field and the complexity of actions.

In order to achieve the purpose, the invention is implemented according to the following technical scheme:

a human motion recognition method based on a graph convolution neural network comprises the following steps:

step 1: preparing human action video data, labeling, and labeling video labels according to different types of actions;

step 2: extracting the skeleton key point characteristics of human motion video data by using an openposition attitude estimation algorithm, then calculating the change speed of adjacent frame skeleton key points through a skeleton point mainstream network, and performing characteristic splicing; screening the bone key points, calculating included angles of the screened bone key points through an angle branch network, and performing feature splicing;

and step 3: transmitting the spliced data into a neural network of a graph;

and 4, step 4: extending the graph convolution from a spatial domain to a temporal domain;

and 5: using a cross-attention model to enhance the performance of the network;

step 6: and constructing a graph convolution neural network consisting of nine space-time convolution modules, a global pooling layer and a Softmax layer, wherein the global pooling layer is used for summarizing the node characteristics in the graph structure so as to upgrade the node-level characteristics into graph-level characteristics, and then outputting the human action number in the human action video through the Softmax layer.

Further, the step 2 specifically includes:

step 2.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;

step 2.2: extracting human skeleton key points by using an openposition attitude estimation algorithm, taking 15 bisectors S (T1, T., T2, T., T3, T., T4, T., T5, T., T6, T., T15) from a video S, storing the skeleton key point data of each point, extracting 18 skeleton key points each time, respectively representing 18 parts of a human body, setting the length of a single-frame video to be L, setting the width of the video to be W, normalizing the extracted skeleton key point coordinates, and using Tn to represent the skeleton key point data of an nth frame, wherein the normalized Tn is as follows:

wherein x_nIs the abscissa, y, of the nth skeletal keypoint_nIs the ordinate of the nth skeleton key point, Tn is the skeleton key point coordinate of the nth frame after normalization;

step 2.3: calculating the change speed of key points of adjacent frames, wherein the speed Vn is:

V_n＝((x_1n-x_1n-1,y_1n-y_1n-1),(x_2n-x_2n-1,y_2n-y_2n-1),...,(x_18n-x_18n-1,y_18n-y_18n-1))

(ii) a Wherein the specific meanings of x and y are the same as in step 2.2; and (3) carrying out characteristic splicing after obtaining the speed V, wherein the total characteristic Dn after splicing is as follows:

D_n＝Cancate(T_n,T_n′,V_n)；

wherein T is_nAnd T_n' respectively representing the coordinates of normalized bone key points obtained at the side and the front of the moment n, and the Cancate function represents splicing variables in brackets;

step 2.4: screening skeletal key points extracted from openposition, and storing left knee, right knee, left waist, right waist, left shoulder, right shoulder, left elbow and right elbow;

step 2.5: and (3) calculating an included angle:

(5) knee:

(6) waist:

(7) shoulder:

(8) elbow:

further, the step 3 specifically includes:

step 3.1: the default human skeleton structure identified by adopting openposition attitude estimation algorithm is used as the basic connection of the graph neural network, and the adjacency matrix of the graph neural network structure is set as A_kA adjacency matrix representing a k-th layer network is an N × N two-dimensional matrix, wherein N is equal to 18 and represents 18 skeletal key points; the A (n1, n2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection, and a value of 0 represents disconnection;

step 3.2: setting the adjacency matrix of the graph neural network structure as B_kIt represents the action structure adjacency matrix of the k layer; the matrix is also an N x N two-dimensional matrix, meaning the same as a, except that the matrix has no fixed values, each element of which is a trainable parameter;

step 3.3: setting the adjacency matrix of the graph neural network structure to C_kThe format is identical to A and B, C_k(n1,n2):

The process is a normalized Gaussian embedding method to calculate the similarity between any two skeletal key points, theta and

two embedding methods are respectively adopted, T represents matrix transposition, and the final output dimension is unchanged; the Softmax method changes the final values to 0 and 1, indicating whether or not there is a connection; the final output formula of the neural network of the graph is:

wherein f is_inAnd f_outRespectively representing the input and output of the layer network, K representing the total number of layers of the neural network of the graph, and W representing the convolution parameter.

Further, the step 4 specifically includes:

for point n_ijDefining i to represent the ith frame, j to represent the jth bone keypoint, and each time domain convolution only involves the same bone keypoint, then there is the formula:

w is a parameter of the convolution,

the output of the nth layer.

Further, the step 5 specifically includes:

step 5.1: the cross attention enhances the expression capacity of the main network flow through the characteristic diagram of the bone joint angle network branch, and the formula is as follows:

f_attention＝(1+Attention)*f_out；

step 5.2: the computing method of the Attention is as follows:

Attention＝g(f_self,f_cross)*f_out

wherein f is_selfIs the self-attention weight, f, of the output profile of the host network_crossAdding weights to the primary network feature map for the joint angle and cross attention weights of the primary network data, where g represents transforming both dimensions to f_outAre added, wherein f_crossComprises the following steps:

wherein v (T, N, d) is a main network feature map, wherein N is the number of main network data bone joint points, and d represents the feature dimension of each joint point; a (T, k, m) is a joint angle network characteristic diagram, k represents the joint number of the bone joint angle data, and m is the dimension of the bone joint angle data.

Further, the step 6 specifically includes:

step 6.1: the input is firstly reserved with a residual error and connected to the module, and finally, the first operation is to perform space domain graph convolution, then batch normalization operation Batchnormalization, ReLU activation layer and 0.5 coefficient dropout layer, then perform space graph convolution, and then batch normalization operation Batchnormalization and ReLU activation layer, wherein the network overall structure is composed of nine space-time convolution modules, a global pooling layer and a Softmax layer.

Step 6.2: the global pooling layer in the network is used for summarizing the node characteristics in the graph structure, upgrading the node-level characteristics into graph-level characteristics, and outputting the action numbers of people in the video through the Softmax layer.

Compared with the prior art, the human action recognition method based on the graph convolution neural network can recognize and output the actions expressed by human in the input video, has good usability and robustness, and lays a certain foundation for the artificial intelligence technology to actually fall on the ground in the action recognition field.

Drawings

Fig. 1 is a flowchart of a method for human motion recognition based on a graph-convolution neural network according to the present invention.

Fig. 2 is a cross-attention network structure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. The specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

As shown in fig. 1 and fig. 2, the present embodiment provides a method for recognizing human actions based on a graph convolution neural network, which includes the following steps:

step 1: preparing human action video data, labeling, and labeling video labels according to different types of actions, starting from 0;

step 2.1: carrying out feature extraction and feature design on the basic data to be used as the motion information features;

step 2.1.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;

step 2.1.2: human bone key point extraction is performed by using an openposition attitude estimation algorithm, and 15 bisectors S (T1, T2, T3, T4, T5, T6, T15) are taken for a video S, and the bone key point data of each point is saved. Each time, 18 skeletal key points are extracted, which respectively represent 18 parts of the human body. Setting the length of a single-frame video as L and the width of the video as W, normalizing the extracted bone key point coordinates, and using Tn to represent the bone key point data of the nth frame, wherein the normalized Tn is:

wherein x_nIs the abscissa, y, of the nth skeletal keypoint_nIs the ordinate of the nth skeleton key point, and Tn is the skeleton key point coordinate of the nth frame after normalization.

Step 2.1.3: calculating the change speed of key points of adjacent frames, wherein the speed Vn is:

wherein the specific meanings of x and y are the same as in step 2.2. After the speed V is obtained, feature splicing is carried out, and the total feature Dn after splicing is as follows:

D_n＝Cancate(T_n,T_n′,V_n)

wherein T is_nAnd T'_nThe normalized bone keypoint coordinates obtained laterally and anteriorly at time n are indicated, respectively, and the Cancate function indicates the concatenation of the variables within brackets.

Step 2.2: further refining the bone point data to be used as high-order information, and forming a double-flow network by the data and the data in the step 2.1 to supplement each other;

step 2.2.1: because the joint angle is of great importance to the action category, the human skeleton key points extracted by openposition are screened, and the left knee, the right knee, the left waist, the right waist, the left shoulder, the right shoulder, the left elbow and the right elbow are saved;

step 2.2.2: and (3) calculating an included angle:

(1) knee:

(2) waist:

(3) shoulder:

(4) elbow:

and step 3: transmitting the spliced data into a graph neural network, wherein the graph neural network mainly comprises three parts;

step 3.1: the first part adopts a default human skeleton structure identified by an openposition attitude estimation algorithm as a basic connection of a graph neural network, the function of the basic structure of the first part is to adapt to the basic motion form of human beings, and the first part has certain modeling capability on any form of action, and an adjacency matrix of the graph structure is set as A_kIndicating a k-th layer network, which is an N x N two-dimensional matrix, where N equals 18, indicating 18 skeletal keypoints. The A (n1, n2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection, and a value of 0 represents disconnection;

step 3.2: second part to compensate for the ability of the infrastructure to fit to motion diversity, we set the adjacency matrix for the structure to be B_kIt means that the structure of the k-th layer is adjacent to the matrix. The matrix is also an N multiplied by N two-dimensional matrix, the meaning is the same as that of A, the difference is that the matrix has no fixed value, each element of the matrix is a trainable parameter, and the training stage automatically learns which connection modes have better compensation effect on the action;

step 3.3: the third part is a data-driven graph structure, which has different values for each different action, we set the matrix to C_kThe format is identical to A and B, C_k(n1,n2):

two embedding methods are respectively adopted, and T represents matrix transposition to ensure that the final output dimension is unchanged. The Softmax method changes the final values to 0 and 1, indicating whether or not there is a connection. The output of the neural network of the final graph is disclosed as:

wherein f is_inAnd f_outRespectively representing the input and output of the layer network, K representing the total number of layers of the neural network of the graph, A matrix, B matrix, C matrix introduced in the above step, and W representing convolution parameters;

and 4, step 4: extending the graph convolution from the spatial domain to the temporal domain, for point n_ijAt this time, we define that i represents the ith frame, j represents the jth bone key point, and we only relate to the same google key point every time of time domain convolution, then there is a formula:

w is a parameter of the convolution,

the output of the nth layer, other variables are defined the same.

And 5: a cross-attention model is used to enhance the performance of the network, and the specific steps are as follows, and the structure is shown in fig. 2:

f_attention＝(1+Attention)*f_out

the self-attention model is a residual attention model because as the number of network layers deepens, simple attention stacking causes some features to disappear.

Step 5.2: the computing method of the Attention is as follows:

Attention＝g(f_self,f_cross)*f_out

wherein f is_selfIs the self-attention weight, f, of the output profile of the host network_crossWeights are added to the primary network feature map for joint angles and cross attention weights of the primary network data, which are added together. Where g denotes the transformation of both dimensions to f_outOfDegrees and added. Wherein f is_crossComprises the following steps:

wherein v (T, N, d) is a main network feature map, wherein N is the number of main network data bone joint points, and d represents the feature dimension of each joint point; a (T, k, m) is a joint angle network characteristic diagram, k represents the joint number of the bone joint angle data, and m is the dimension of the bone joint angle data. The formula calculates the association between different nodes of two networks, respectively, and is used as cross-attention.

Step 6: the details of the convolution of the space domain and the time domain are described in detail by steps 3 and 4, and a space-time diagram convolution module is a complete system. The input is first preserved a residual and connected to the module, and finally, the first operation is to perform space domain graph convolution, then batch normalization operation Batchnormalization, ReLU activation layer, and 0.5 coefficient dropout layer, then perform space graph convolution, and then batch normalization operation Batchnormalization and ReLU activation layer. And the two branches of the network are respectively composed of nine space-time convolution modules, a global pooling layer and a Softmax layer. And finally obtaining the action category through a Softmax classifier.

Of course, before using the graph convolution neural network of this embodiment to recognize human actions, training of the model is performed, the training part uses the Pytorch framework, and uses the CrossEntropy loss function of CrossEntropy, which is expressed as:

Loss＝-[ylogy`+(1-y)log(1-y`)]

where y is label of the sample and y' is the result predicted by our model. We set the batch size to 64 at training, optimize using SGD stochastic gradient descent with momentum of 0.9, and set the weight decay to 0.0001 for a total of 30 epochs.

The technical solution of the present invention is not limited to the limitations of the above specific embodiments, and all technical modifications made according to the technical solution of the present invention fall within the protection scope of the present invention.

Claims

1. A human motion recognition method based on a graph convolution neural network is characterized by comprising the following steps:

and step 3: transmitting the spliced data into a neural network of a graph;

and 5: using a cross-attention model to enhance the performance of the network;

2. The method for human motion recognition based on the graph-convolution neural network according to claim 1, wherein the step 2 specifically includes:

V_n＝((x_1n-x_1n-1,y_1n-y_1n-1),(x_2n-x_2n-1,y_2n-y_2n-1),...,(x_18n-x_18n-1,y_18n-y_18n-1) ); wherein the specific meanings of x and y are the same as in step 2.2; and (3) carrying out characteristic splicing after obtaining the speed V, wherein the total characteristic Dn after splicing is as follows:

D_n＝Cancate(T_n,T_n′,V_n)；

wherein T is_nAnd T'_nRespectively representing the coordinates of normalized bone key points obtained at the side and the front of the moment n, and the Cancate function represents splicing variables in brackets;

step 2.5: and (3) calculating an included angle:

(1) knee: