CN112633209A - Human action recognition method based on graph convolution neural network - Google Patents

Human action recognition method based on graph convolution neural network Download PDF

Info

Publication number
CN112633209A
CN112633209A CN202011600579.7A CN202011600579A CN112633209A CN 112633209 A CN112633209 A CN 112633209A CN 202011600579 A CN202011600579 A CN 202011600579A CN 112633209 A CN112633209 A CN 112633209A
Authority
CN
China
Prior art keywords
network
graph
neural network
human
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011600579.7A
Other languages
Chinese (zh)
Other versions
CN112633209B (en
Inventor
毛克明
李翰鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202011600579.7A priority Critical patent/CN112633209B/en
Publication of CN112633209A publication Critical patent/CN112633209A/en
Application granted granted Critical
Publication of CN112633209B publication Critical patent/CN112633209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The invention discloses a human action recognition method based on a graph convolution neural network, which comprises the steps of preparing human action video data, labeling, and labeling video labels according to different types of actions; extracting the skeleton key point characteristics of human motion video data by using an openposition attitude estimation algorithm, then calculating the change speed of adjacent frame skeleton key points through a skeleton point mainstream network, and performing characteristic splicing; screening the bone key points, calculating included angles of the screened bone key points through an angle branch network, and performing feature splicing; transmitting the spliced data into a neural network of a graph; extending the graph convolution from a spatial domain to a temporal domain; using a cross-attention model to enhance the performance of the network; and (5) human action recognition. The invention can recognize and output the actions expressed by human in the input video, has good usability and robustness, and lays a certain foundation for the artificial intelligence technology to actually land in the field of action recognition.

Description

Human action recognition method based on graph convolution neural network
Technical Field
The invention relates to the technical field of computer vision, in particular to a human action recognition method based on a graph convolution neural network.
Background
Artificial intelligence technology has been radiated to all industries, and motion recognition technology is more a key technology for a plurality of hot applications and requirements, and has become one of the most concerned directions in the field of computer vision. For example, the detection and alarm of human abnormal behaviors in an intelligent monitoring camera, the classification and retrieval of human behaviors in a video, including the adoption of action acquisition technology in a high-quality game, can put the actions of careers into the game and bring immersion to players. It is believed that action recognition techniques will find increasing application in the future.
At present, similar technologies are often applied to human motion recognition in the field of computer vision, and the methods are mainly divided into two methods, one is a method based on RGB and optical flow of video, and the other is a method based on human skeleton key points. The method based on RGB and optical flow of video can learn the task end to end, but extracting the optical flow from the video is a very heavy task, and although various methods are available at present to reduce the loss caused by extracting the optical flow, the optical flow is always a powerful feature for the task of motion recognition. The method based on human skeleton key points is a newly emerging motion recognition method after the development of the posture estimation technology is mature, compared with the traditional method based on RGB and optical flow of videos, the method can more effectively model human behaviors, and the traditional method cannot avoid the influence caused by background and light transformation. On the other hand, the method needs to use a posture estimation algorithm to extract the features of the video, and one more step is required in this respect compared with the traditional method. In addition, the existing motion recognition method only simply utilizes the skeletal key point data, and the information describing the motion is not only coordinates, but also the angle and the change speed thereof are important elements of the motion recognition feature description.
Therefore, for the current situation in the field, together with the complexity of the motion itself, a human motion recognition method with deep learning theoretical basis and more descriptive elements for the task is needed.
Disclosure of Invention
The invention aims to provide a human action recognition method based on a graph convolution neural network aiming at the current situation in the field and the complexity of actions.
In order to achieve the purpose, the invention is implemented according to the following technical scheme:
a human motion recognition method based on a graph convolution neural network comprises the following steps:
step 1: preparing human action video data, labeling, and labeling video labels according to different types of actions;
step 2: extracting the skeleton key point characteristics of human motion video data by using an openposition attitude estimation algorithm, then calculating the change speed of adjacent frame skeleton key points through a skeleton point mainstream network, and performing characteristic splicing; screening the bone key points, calculating included angles of the screened bone key points through an angle branch network, and performing feature splicing;
and step 3: transmitting the spliced data into a neural network of a graph;
and 4, step 4: extending the graph convolution from a spatial domain to a temporal domain;
and 5: using a cross-attention model to enhance the performance of the network;
step 6: and constructing a graph convolution neural network consisting of nine space-time convolution modules, a global pooling layer and a Softmax layer, wherein the global pooling layer is used for summarizing the node characteristics in the graph structure so as to upgrade the node-level characteristics into graph-level characteristics, and then outputting the human action number in the human action video through the Softmax layer.
Further, the step 2 specifically includes:
step 2.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.2: extracting human skeleton key points by using an openposition attitude estimation algorithm, taking 15 bisectors S (T1, T., T2, T., T3, T., T4, T., T5, T., T6, T., T15) from a video S, storing the skeleton key point data of each point, extracting 18 skeleton key points each time, respectively representing 18 parts of a human body, setting the length of a single-frame video to be L, setting the width of the video to be W, normalizing the extracted skeleton key point coordinates, and using Tn to represent the skeleton key point data of an nth frame, wherein the normalized Tn is as follows:
Figure BDA0002868715070000031
wherein xnIs the abscissa, y, of the nth skeletal keypointnIs the ordinate of the nth skeleton key point, Tn is the skeleton key point coordinate of the nth frame after normalization;
step 2.3: calculating the change speed of key points of adjacent frames, wherein the speed Vn is:
Vn=((x1n-x1n-1,y1n-y1n-1),(x2n-x2n-1,y2n-y2n-1),...,(x18n-x18n-1,y18n-y18n-1))
(ii) a Wherein the specific meanings of x and y are the same as in step 2.2; and (3) carrying out characteristic splicing after obtaining the speed V, wherein the total characteristic Dn after splicing is as follows:
Dn=Cancate(Tn,Tn′,Vn);
wherein T isnAnd Tn' respectively representing the coordinates of normalized bone key points obtained at the side and the front of the moment n, and the Cancate function represents splicing variables in brackets;
step 2.4: screening skeletal key points extracted from openposition, and storing left knee, right knee, left waist, right waist, left shoulder, right shoulder, left elbow and right elbow;
step 2.5: and (3) calculating an included angle:
(5) knee:
Figure BDA0002868715070000032
(6) waist:
Figure BDA0002868715070000041
(7) shoulder:
Figure BDA0002868715070000042
(8) elbow:
Figure BDA0002868715070000043
further, the step 3 specifically includes:
step 3.1: the default human skeleton structure identified by adopting openposition attitude estimation algorithm is used as the basic connection of the graph neural network, and the adjacency matrix of the graph neural network structure is set as AkA adjacency matrix representing a k-th layer network is an N × N two-dimensional matrix, wherein N is equal to 18 and represents 18 skeletal key points; the A (n1, n2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection, and a value of 0 represents disconnection;
step 3.2: setting the adjacency matrix of the graph neural network structure as BkIt represents the action structure adjacency matrix of the k layer; the matrix is also an N x N two-dimensional matrix, meaning the same as a, except that the matrix has no fixed values, each element of which is a trainable parameter;
step 3.3: setting the adjacency matrix of the graph neural network structure to CkThe format is identical to A and B, Ck(n1,n2):
Figure BDA0002868715070000044
The process is a normalized Gaussian embedding method to calculate the similarity between any two skeletal key points, theta and
Figure BDA0002868715070000045
two embedding methods are respectively adopted, T represents matrix transposition, and the final output dimension is unchanged; the Softmax method changes the final values to 0 and 1, indicating whether or not there is a connection; the final output formula of the neural network of the graph is:
Figure BDA0002868715070000051
wherein f isinAnd foutRespectively representing the input and output of the layer network, K representing the total number of layers of the neural network of the graph, and W representing the convolution parameter.
Further, the step 4 specifically includes:
for point nijDefining i to represent the ith frame, j to represent the jth bone keypoint, and each time domain convolution only involves the same bone keypoint, then there is the formula:
Figure BDA0002868715070000052
w is a parameter of the convolution,
Figure BDA0002868715070000053
the output of the nth layer.
Further, the step 5 specifically includes:
step 5.1: the cross attention enhances the expression capacity of the main network flow through the characteristic diagram of the bone joint angle network branch, and the formula is as follows:
fattention=(1+Attention)*fout
step 5.2: the computing method of the Attention is as follows:
Attention=g(fself,fcross)*fout
wherein f isselfIs the self-attention weight, f, of the output profile of the host networkcrossAdding weights to the primary network feature map for the joint angle and cross attention weights of the primary network data, where g represents transforming both dimensions to foutAre added, wherein fcrossComprises the following steps:
Figure BDA0002868715070000054
wherein v (T, N, d) is a main network feature map, wherein N is the number of main network data bone joint points, and d represents the feature dimension of each joint point; a (T, k, m) is a joint angle network characteristic diagram, k represents the joint number of the bone joint angle data, and m is the dimension of the bone joint angle data.
Further, the step 6 specifically includes:
step 6.1: the input is firstly reserved with a residual error and connected to the module, and finally, the first operation is to perform space domain graph convolution, then batch normalization operation Batchnormalization, ReLU activation layer and 0.5 coefficient dropout layer, then perform space graph convolution, and then batch normalization operation Batchnormalization and ReLU activation layer, wherein the network overall structure is composed of nine space-time convolution modules, a global pooling layer and a Softmax layer.
Step 6.2: the global pooling layer in the network is used for summarizing the node characteristics in the graph structure, upgrading the node-level characteristics into graph-level characteristics, and outputting the action numbers of people in the video through the Softmax layer.
Compared with the prior art, the human action recognition method based on the graph convolution neural network can recognize and output the actions expressed by human in the input video, has good usability and robustness, and lays a certain foundation for the artificial intelligence technology to actually fall on the ground in the action recognition field.
Drawings
Fig. 1 is a flowchart of a method for human motion recognition based on a graph-convolution neural network according to the present invention.
Fig. 2 is a cross-attention network structure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. The specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
As shown in fig. 1 and fig. 2, the present embodiment provides a method for recognizing human actions based on a graph convolution neural network, which includes the following steps:
step 1: preparing human action video data, labeling, and labeling video labels according to different types of actions, starting from 0;
step 2.1: carrying out feature extraction and feature design on the basic data to be used as the motion information features;
step 2.1.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.1.2: human bone key point extraction is performed by using an openposition attitude estimation algorithm, and 15 bisectors S (T1, T2, T3, T4, T5, T6, T15) are taken for a video S, and the bone key point data of each point is saved. Each time, 18 skeletal key points are extracted, which respectively represent 18 parts of the human body. Setting the length of a single-frame video as L and the width of the video as W, normalizing the extracted bone key point coordinates, and using Tn to represent the bone key point data of the nth frame, wherein the normalized Tn is:
Figure BDA0002868715070000071
wherein xnIs the abscissa, y, of the nth skeletal keypointnIs the ordinate of the nth skeleton key point, and Tn is the skeleton key point coordinate of the nth frame after normalization.
Step 2.1.3: calculating the change speed of key points of adjacent frames, wherein the speed Vn is:
Vn=((x1n-x1n-1,y1n-y1n-1),(x2n-x2n-1,y2n-y2n-1),...,(x18n-x18n-1,y18n-y18n-1))
wherein the specific meanings of x and y are the same as in step 2.2. After the speed V is obtained, feature splicing is carried out, and the total feature Dn after splicing is as follows:
Dn=Cancate(Tn,Tn′,Vn)
wherein T isnAnd T'nThe normalized bone keypoint coordinates obtained laterally and anteriorly at time n are indicated, respectively, and the Cancate function indicates the concatenation of the variables within brackets.
Step 2.2: further refining the bone point data to be used as high-order information, and forming a double-flow network by the data and the data in the step 2.1 to supplement each other;
step 2.2.1: because the joint angle is of great importance to the action category, the human skeleton key points extracted by openposition are screened, and the left knee, the right knee, the left waist, the right waist, the left shoulder, the right shoulder, the left elbow and the right elbow are saved;
step 2.2.2: and (3) calculating an included angle:
(1) knee:
Figure BDA0002868715070000081
(2) waist:
Figure BDA0002868715070000082
(3) shoulder:
Figure BDA0002868715070000083
(4) elbow:
Figure BDA0002868715070000084
and step 3: transmitting the spliced data into a graph neural network, wherein the graph neural network mainly comprises three parts;
step 3.1: the first part adopts a default human skeleton structure identified by an openposition attitude estimation algorithm as a basic connection of a graph neural network, the function of the basic structure of the first part is to adapt to the basic motion form of human beings, and the first part has certain modeling capability on any form of action, and an adjacency matrix of the graph structure is set as AkIndicating a k-th layer network, which is an N x N two-dimensional matrix, where N equals 18, indicating 18 skeletal keypoints. The A (n1, n2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection, and a value of 0 represents disconnection;
step 3.2: second part to compensate for the ability of the infrastructure to fit to motion diversity, we set the adjacency matrix for the structure to be BkIt means that the structure of the k-th layer is adjacent to the matrix. The matrix is also an N multiplied by N two-dimensional matrix, the meaning is the same as that of A, the difference is that the matrix has no fixed value, each element of the matrix is a trainable parameter, and the training stage automatically learns which connection modes have better compensation effect on the action;
step 3.3: the third part is a data-driven graph structure, which has different values for each different action, we set the matrix to CkThe format is identical to A and B, Ck(n1,n2):
Figure BDA0002868715070000091
The process is a normalized Gaussian embedding method to calculate the similarity between any two skeletal key points, theta and
Figure BDA0002868715070000095
two embedding methods are respectively adopted, and T represents matrix transposition to ensure that the final output dimension is unchanged. The Softmax method changes the final values to 0 and 1, indicating whether or not there is a connection. The output of the neural network of the final graph is disclosed as:
Figure BDA0002868715070000092
wherein f isinAnd foutRespectively representing the input and output of the layer network, K representing the total number of layers of the neural network of the graph, A matrix, B matrix, C matrix introduced in the above step, and W representing convolution parameters;
and 4, step 4: extending the graph convolution from the spatial domain to the temporal domain, for point nijAt this time, we define that i represents the ith frame, j represents the jth bone key point, and we only relate to the same google key point every time of time domain convolution, then there is a formula:
Figure BDA0002868715070000093
w is a parameter of the convolution,
Figure BDA0002868715070000094
the output of the nth layer, other variables are defined the same.
And 5: a cross-attention model is used to enhance the performance of the network, and the specific steps are as follows, and the structure is shown in fig. 2:
step 5.1: the cross attention enhances the expression capacity of the main network flow through the characteristic diagram of the bone joint angle network branch, and the formula is as follows:
fattention=(1+Attention)*fout
the self-attention model is a residual attention model because as the number of network layers deepens, simple attention stacking causes some features to disappear.
Step 5.2: the computing method of the Attention is as follows:
Attention=g(fself,fcross)*fout
wherein f isselfIs the self-attention weight, f, of the output profile of the host networkcrossWeights are added to the primary network feature map for joint angles and cross attention weights of the primary network data, which are added together. Where g denotes the transformation of both dimensions to foutOfDegrees and added. Wherein f iscrossComprises the following steps:
Figure BDA0002868715070000101
wherein v (T, N, d) is a main network feature map, wherein N is the number of main network data bone joint points, and d represents the feature dimension of each joint point; a (T, k, m) is a joint angle network characteristic diagram, k represents the joint number of the bone joint angle data, and m is the dimension of the bone joint angle data. The formula calculates the association between different nodes of two networks, respectively, and is used as cross-attention.
Step 6: the details of the convolution of the space domain and the time domain are described in detail by steps 3 and 4, and a space-time diagram convolution module is a complete system. The input is first preserved a residual and connected to the module, and finally, the first operation is to perform space domain graph convolution, then batch normalization operation Batchnormalization, ReLU activation layer, and 0.5 coefficient dropout layer, then perform space graph convolution, and then batch normalization operation Batchnormalization and ReLU activation layer. And the two branches of the network are respectively composed of nine space-time convolution modules, a global pooling layer and a Softmax layer. And finally obtaining the action category through a Softmax classifier.
Of course, before using the graph convolution neural network of this embodiment to recognize human actions, training of the model is performed, the training part uses the Pytorch framework, and uses the CrossEntropy loss function of CrossEntropy, which is expressed as:
Loss=-[ylogy`+(1-y)log(1-y`)]
where y is label of the sample and y' is the result predicted by our model. We set the batch size to 64 at training, optimize using SGD stochastic gradient descent with momentum of 0.9, and set the weight decay to 0.0001 for a total of 30 epochs.
The technical solution of the present invention is not limited to the limitations of the above specific embodiments, and all technical modifications made according to the technical solution of the present invention fall within the protection scope of the present invention.

Claims (6)

1. A human motion recognition method based on a graph convolution neural network is characterized by comprising the following steps:
step 1: preparing human action video data, labeling, and labeling video labels according to different types of actions;
step 2: extracting the skeleton key point characteristics of human motion video data by using an openposition attitude estimation algorithm, then calculating the change speed of adjacent frame skeleton key points through a skeleton point mainstream network, and performing characteristic splicing; screening the bone key points, calculating included angles of the screened bone key points through an angle branch network, and performing feature splicing;
and step 3: transmitting the spliced data into a neural network of a graph;
and 4, step 4: extending the graph convolution from a spatial domain to a temporal domain;
and 5: using a cross-attention model to enhance the performance of the network;
step 6: and constructing a graph convolution neural network consisting of nine space-time convolution modules, a global pooling layer and a Softmax layer, wherein the global pooling layer is used for summarizing the node characteristics in the graph structure so as to upgrade the node-level characteristics into graph-level characteristics, and then outputting the human action number in the human action video through the Softmax layer.
2. The method for human motion recognition based on the graph-convolution neural network according to claim 1, wherein the step 2 specifically includes:
step 2.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.2: extracting human skeleton key points by using an openposition attitude estimation algorithm, taking 15 bisectors S (T1, T., T2, T., T3, T., T4, T., T5, T., T6, T., T15) from a video S, storing the skeleton key point data of each point, extracting 18 skeleton key points each time, respectively representing 18 parts of a human body, setting the length of a single-frame video to be L, setting the width of the video to be W, normalizing the extracted skeleton key point coordinates, and using Tn to represent the skeleton key point data of an nth frame, wherein the normalized Tn is as follows:
Figure FDA0002868715060000021
wherein xnIs the abscissa, y, of the nth skeletal keypointnIs the ordinate of the nth skeleton key point, Tn is the skeleton key point coordinate of the nth frame after normalization;
step 2.3: calculating the change speed of key points of adjacent frames, wherein the speed Vn is:
Vn=((x1n-x1n-1,y1n-y1n-1),(x2n-x2n-1,y2n-y2n-1),...,(x18n-x18n-1,y18n-y18n-1) ); wherein the specific meanings of x and y are the same as in step 2.2; and (3) carrying out characteristic splicing after obtaining the speed V, wherein the total characteristic Dn after splicing is as follows:
Dn=Cancate(Tn,Tn′,Vn);
wherein T isnAnd T'nRespectively representing the coordinates of normalized bone key points obtained at the side and the front of the moment n, and the Cancate function represents splicing variables in brackets;
step 2.4: screening skeletal key points extracted from openposition, and storing left knee, right knee, left waist, right waist, left shoulder, right shoulder, left elbow and right elbow;
step 2.5: and (3) calculating an included angle:
(1) knee:
Figure FDA0002868715060000022
(2) waist:
Figure FDA0002868715060000023
(3) shoulder:
Figure FDA0002868715060000024
(4) elbow:
Figure FDA0002868715060000031
3. the method of claim 2, wherein the step 3 specifically comprises:
step 3.1: the default human skeleton structure identified by adopting openposition attitude estimation algorithm is used as the basic connection of the graph neural network, and the adjacency matrix of the graph neural network structure is set as AkA adjacency matrix representing a k-th layer network is an N × N two-dimensional matrix, wherein N is equal to 18 and represents 18 skeletal key points; the A (n1, n2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection, and a value of 0 represents disconnection;
step 3.2: setting the adjacency matrix of the graph neural network structure as BkIt represents the action structure adjacency matrix of the k layer; the matrix is also an N x N two-dimensional matrix, meaning the same as a, except that the matrix has no fixed values, each element of which is a trainable parameter;
step 3.3: setting the adjacency matrix of the graph neural network structure to CkThe format is identical to A and B, Ck(n1,n2):
Figure FDA0002868715060000032
The process is a normalized Gaussian embedding method to calculate the similarity between any two skeletal key points, theta and
Figure FDA0002868715060000033
two embedding methods are respectively adopted, T represents matrix transposition, and the final output dimension is unchanged; the Softmax method changes the final values to 0 and 1, indicating whether or not there is a connection; the final output formula of the neural network of the graph is:
Figure FDA0002868715060000034
wherein f isinAnd foutRespectively representing the input and output of the layer network, K representing the total number of layers of the neural network of the graph, and W representing the convolution parameter.
4. The method for human motion recognition based on the graph-convolution neural network of claim 3, wherein the step 4 specifically includes:
for point nijDefining i to represent the ith frame, j to represent the jth bone keypoint, and each time domain convolution only involves the same bone keypoint, then there is the formula:
Figure FDA0002868715060000041
w is a parameter of the convolution,
Figure FDA0002868715060000042
the output of the nth layer.
5. The method of claim 4, wherein the step 5 specifically comprises:
step 5.1: the cross attention enhances the expression capacity of the main network flow through the characteristic diagram of the bone joint angle network branch, and the formula is as follows:
fattention=(1+Attention)*fout
step 5.2: the computing method of the Attention is as follows:
Attention=g(fself,fcross)*fout
wherein f isselfIs the self-attention weight, f, of the output profile of the host networkcrossAdding weights to the primary network feature map for the joint angle and cross attention weights of the primary network data, where g represents transforming both dimensions to foutAre added, wherein fcrossComprises the following steps:
Figure FDA0002868715060000043
wherein v (T, N, d) is a main network feature map, wherein N is the number of main network data bone joint points, and d represents the feature dimension of each joint point; a (T, k, m) is a joint angle network characteristic diagram, k represents the joint number of the bone joint angle data, and m is the dimension of the bone joint angle data.
6. The method of claim 5, wherein the step 6 specifically comprises:
step 6.1: the input is firstly reserved with a residual error and connected to the module, and finally, the first operation is to perform space domain graph convolution, then batch normalization operation Batchnormalization, ReLU activation layer and 0.5 coefficient dropout layer, then perform space graph convolution, and then batch normalization operation Batchnormalization and ReLU activation layer, wherein the network overall structure is composed of nine space-time convolution modules, a global pooling layer and a Softmax layer.
Step 6.2: the global pooling layer in the network is used for summarizing the node characteristics in the graph structure, upgrading the node-level characteristics into graph-level characteristics, and outputting the action numbers of people in the video through the Softmax layer.
CN202011600579.7A 2020-12-29 2020-12-29 Human action recognition method based on graph convolution neural network Active CN112633209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011600579.7A CN112633209B (en) 2020-12-29 2020-12-29 Human action recognition method based on graph convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011600579.7A CN112633209B (en) 2020-12-29 2020-12-29 Human action recognition method based on graph convolution neural network

Publications (2)

Publication Number Publication Date
CN112633209A true CN112633209A (en) 2021-04-09
CN112633209B CN112633209B (en) 2024-04-09

Family

ID=75286366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011600579.7A Active CN112633209B (en) 2020-12-29 2020-12-29 Human action recognition method based on graph convolution neural network

Country Status (1)

Country Link
CN (1) CN112633209B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255514A (en) * 2021-05-24 2021-08-13 西安理工大学 Behavior identification method based on local scene perception graph convolutional network
CN113361352A (en) * 2021-05-27 2021-09-07 天津大学 Student classroom behavior analysis monitoring method and system based on behavior recognition
CN113378656A (en) * 2021-05-24 2021-09-10 南京信息工程大学 Action identification method and device based on self-adaptive graph convolution neural network
CN113392743A (en) * 2021-06-04 2021-09-14 北京格灵深瞳信息技术股份有限公司 Abnormal action detection method, abnormal action detection device, electronic equipment and computer storage medium
CN114613011A (en) * 2022-03-17 2022-06-10 东华大学 Human body 3D (three-dimensional) bone behavior identification method based on graph attention convolutional neural network
CN114998990A (en) * 2022-05-26 2022-09-02 深圳市科荣软件股份有限公司 Construction site personnel safety behavior identification method and device
CN115050101A (en) * 2022-07-18 2022-09-13 四川大学 Gait recognition method based on skeleton and contour feature fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532960A (en) * 2019-08-30 2019-12-03 西安交通大学 A kind of action identification method of the target auxiliary based on figure neural network
CN110705463A (en) * 2019-09-29 2020-01-17 山东大学 Video human behavior recognition method and system based on multi-mode double-flow 3D network
CN111709321A (en) * 2020-05-28 2020-09-25 西安交通大学 Human behavior recognition method based on graph convolution neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532960A (en) * 2019-08-30 2019-12-03 西安交通大学 A kind of action identification method of the target auxiliary based on figure neural network
CN110705463A (en) * 2019-09-29 2020-01-17 山东大学 Video human behavior recognition method and system based on multi-mode double-flow 3D network
CN111709321A (en) * 2020-05-28 2020-09-25 西安交通大学 Human behavior recognition method based on graph convolution neural network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LEI SHI等: "Skeleton-Based Action Recognition With Directed Graph Neural Networks", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》, pages 7912 - 7921 *
WANQIANG ZHENG等: "Action Recognition Based on Spatial Temporal Graph Convolutional Networks", 《PROCEEDINGS OF THE 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND APPLICATION ENGINEERINGOCTOBER 2019》, pages 1 - 5 *
赫磊;邵展鹏;张剑华;周小龙;: "基于深度学习的行为识别算法综述", 计算机科学, no. 1, pages 149 - 157 *
陈永康: "基于机器视觉的运动姿态分析系统研究", 《中国优秀硕士学位论文全文数据库 社会科学Ⅱ辑》, no. 2, pages 134 - 354 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255514A (en) * 2021-05-24 2021-08-13 西安理工大学 Behavior identification method based on local scene perception graph convolutional network
CN113378656A (en) * 2021-05-24 2021-09-10 南京信息工程大学 Action identification method and device based on self-adaptive graph convolution neural network
CN113378656B (en) * 2021-05-24 2023-07-25 南京信息工程大学 Action recognition method and device based on self-adaptive graph convolution neural network
CN113361352A (en) * 2021-05-27 2021-09-07 天津大学 Student classroom behavior analysis monitoring method and system based on behavior recognition
CN113392743A (en) * 2021-06-04 2021-09-14 北京格灵深瞳信息技术股份有限公司 Abnormal action detection method, abnormal action detection device, electronic equipment and computer storage medium
CN114613011A (en) * 2022-03-17 2022-06-10 东华大学 Human body 3D (three-dimensional) bone behavior identification method based on graph attention convolutional neural network
CN114998990A (en) * 2022-05-26 2022-09-02 深圳市科荣软件股份有限公司 Construction site personnel safety behavior identification method and device
CN115050101A (en) * 2022-07-18 2022-09-13 四川大学 Gait recognition method based on skeleton and contour feature fusion
CN115050101B (en) * 2022-07-18 2024-03-22 四川大学 Gait recognition method based on fusion of skeleton and contour features

Also Published As

Publication number Publication date
CN112633209B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
CN112633209A (en) Human action recognition method based on graph convolution neural network
KR102564857B1 (en) Method and device to train and recognize data
CN108596039B (en) Bimodal emotion recognition method and system based on 3D convolutional neural network
CN108154194B (en) Method for extracting high-dimensional features by using tensor-based convolutional network
CN109858390A (en) The Activity recognition method of human skeleton based on end-to-end space-time diagram learning neural network
CN110378281A (en) Group Activity recognition method based on pseudo- 3D convolutional neural networks
CN111291190B (en) Training method of encoder, information detection method and related device
CN109344713B (en) Face recognition method of attitude robust
CN110633624B (en) Machine vision human body abnormal behavior identification method based on multi-feature fusion
CN110222718B (en) Image processing method and device
CN108154156B (en) Image set classification method and device based on neural topic model
CN114170410A (en) Point cloud part level segmentation method based on PointNet graph convolution and KNN search
CN110765960B (en) Pedestrian re-identification method for adaptive multi-task deep learning
CN112308081B (en) Image target prediction method based on attention mechanism
CN113435520A (en) Neural network training method, device, equipment and computer readable storage medium
CN108073851A (en) A kind of method, apparatus and electronic equipment for capturing gesture identification
CN115862136A (en) Lightweight filler behavior identification method and device based on skeleton joint
CN106951844A (en) A kind of Method of EEG signals classification and system based on the very fast learning machine of depth
CN113516227A (en) Neural network training method and device based on federal learning
CN108345900A (en) Pedestrian based on color and vein distribution characteristics recognition methods and its system again
CN115471670A (en) Space target detection method based on improved YOLOX network model
CN114170659A (en) Facial emotion recognition method based on attention mechanism
Liu et al. Iterative deep neighborhood: a deep learning model which involves both input data points and their neighbors
CN117115911A (en) Hypergraph learning action recognition system based on attention mechanism
Tang An action recognition method for volleyball players using deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant