CN112633209A - Human action recognition method based on graph convolution neural network - Google Patents
Human action recognition method based on graph convolution neural network Download PDFInfo
- Publication number
- CN112633209A CN112633209A CN202011600579.7A CN202011600579A CN112633209A CN 112633209 A CN112633209 A CN 112633209A CN 202011600579 A CN202011600579 A CN 202011600579A CN 112633209 A CN112633209 A CN 112633209A
- Authority
- CN
- China
- Prior art keywords
- network
- graph
- neural network
- human
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 241000282414 Homo sapiens Species 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 43
- 230000009471 action Effects 0.000 title claims abstract description 38
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 35
- 210000000988 bone and bone Anatomy 0.000 claims abstract description 31
- 238000002372 labelling Methods 0.000 claims abstract description 8
- 230000008859 change Effects 0.000 claims abstract description 7
- 238000012216 screening Methods 0.000 claims abstract description 5
- 230000002123 temporal effect Effects 0.000 claims abstract description 4
- 239000011159 matrix material Substances 0.000 claims description 33
- 210000003127 knee Anatomy 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 9
- 238000010586 diagram Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 8
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 230000003287 optical effect Effects 0.000 description 6
- 238000012549 training Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 206010000117 Abnormal behaviour Diseases 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007654 immersion Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
- G06V20/42—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Abstract
The invention discloses a human action recognition method based on a graph convolution neural network, which comprises the steps of preparing human action video data, labeling, and labeling video labels according to different types of actions; extracting the skeleton key point characteristics of human motion video data by using an openposition attitude estimation algorithm, then calculating the change speed of adjacent frame skeleton key points through a skeleton point mainstream network, and performing characteristic splicing; screening the bone key points, calculating included angles of the screened bone key points through an angle branch network, and performing feature splicing; transmitting the spliced data into a neural network of a graph; extending the graph convolution from a spatial domain to a temporal domain; using a cross-attention model to enhance the performance of the network; and (5) human action recognition. The invention can recognize and output the actions expressed by human in the input video, has good usability and robustness, and lays a certain foundation for the artificial intelligence technology to actually land in the field of action recognition.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a human action recognition method based on a graph convolution neural network.
Background
Artificial intelligence technology has been radiated to all industries, and motion recognition technology is more a key technology for a plurality of hot applications and requirements, and has become one of the most concerned directions in the field of computer vision. For example, the detection and alarm of human abnormal behaviors in an intelligent monitoring camera, the classification and retrieval of human behaviors in a video, including the adoption of action acquisition technology in a high-quality game, can put the actions of careers into the game and bring immersion to players. It is believed that action recognition techniques will find increasing application in the future.
At present, similar technologies are often applied to human motion recognition in the field of computer vision, and the methods are mainly divided into two methods, one is a method based on RGB and optical flow of video, and the other is a method based on human skeleton key points. The method based on RGB and optical flow of video can learn the task end to end, but extracting the optical flow from the video is a very heavy task, and although various methods are available at present to reduce the loss caused by extracting the optical flow, the optical flow is always a powerful feature for the task of motion recognition. The method based on human skeleton key points is a newly emerging motion recognition method after the development of the posture estimation technology is mature, compared with the traditional method based on RGB and optical flow of videos, the method can more effectively model human behaviors, and the traditional method cannot avoid the influence caused by background and light transformation. On the other hand, the method needs to use a posture estimation algorithm to extract the features of the video, and one more step is required in this respect compared with the traditional method. In addition, the existing motion recognition method only simply utilizes the skeletal key point data, and the information describing the motion is not only coordinates, but also the angle and the change speed thereof are important elements of the motion recognition feature description.
Therefore, for the current situation in the field, together with the complexity of the motion itself, a human motion recognition method with deep learning theoretical basis and more descriptive elements for the task is needed.
Disclosure of Invention
The invention aims to provide a human action recognition method based on a graph convolution neural network aiming at the current situation in the field and the complexity of actions.
In order to achieve the purpose, the invention is implemented according to the following technical scheme:
a human motion recognition method based on a graph convolution neural network comprises the following steps:
step 1: preparing human action video data, labeling, and labeling video labels according to different types of actions;
step 2: extracting the skeleton key point characteristics of human motion video data by using an openposition attitude estimation algorithm, then calculating the change speed of adjacent frame skeleton key points through a skeleton point mainstream network, and performing characteristic splicing; screening the bone key points, calculating included angles of the screened bone key points through an angle branch network, and performing feature splicing;
and step 3: transmitting the spliced data into a neural network of a graph;
and 4, step 4: extending the graph convolution from a spatial domain to a temporal domain;
and 5: using a cross-attention model to enhance the performance of the network;
step 6: and constructing a graph convolution neural network consisting of nine space-time convolution modules, a global pooling layer and a Softmax layer, wherein the global pooling layer is used for summarizing the node characteristics in the graph structure so as to upgrade the node-level characteristics into graph-level characteristics, and then outputting the human action number in the human action video through the Softmax layer.
Further, the step 2 specifically includes:
step 2.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.2: extracting human skeleton key points by using an openposition attitude estimation algorithm, taking 15 bisectors S (T1, T., T2, T., T3, T., T4, T., T5, T., T6, T., T15) from a video S, storing the skeleton key point data of each point, extracting 18 skeleton key points each time, respectively representing 18 parts of a human body, setting the length of a single-frame video to be L, setting the width of the video to be W, normalizing the extracted skeleton key point coordinates, and using Tn to represent the skeleton key point data of an nth frame, wherein the normalized Tn is as follows:
wherein xnIs the abscissa, y, of the nth skeletal keypointnIs the ordinate of the nth skeleton key point, Tn is the skeleton key point coordinate of the nth frame after normalization;
step 2.3: calculating the change speed of key points of adjacent frames, wherein the speed Vn is:
Vn=((x1n-x1n-1,y1n-y1n-1),(x2n-x2n-1,y2n-y2n-1),...,(x18n-x18n-1,y18n-y18n-1))
(ii) a Wherein the specific meanings of x and y are the same as in step 2.2; and (3) carrying out characteristic splicing after obtaining the speed V, wherein the total characteristic Dn after splicing is as follows:
Dn=Cancate(Tn,Tn′,Vn);
wherein T isnAnd Tn' respectively representing the coordinates of normalized bone key points obtained at the side and the front of the moment n, and the Cancate function represents splicing variables in brackets;
step 2.4: screening skeletal key points extracted from openposition, and storing left knee, right knee, left waist, right waist, left shoulder, right shoulder, left elbow and right elbow;
step 2.5: and (3) calculating an included angle:
(5) knee:
(6) waist:
(7) shoulder:
(8) elbow:
further, the step 3 specifically includes:
step 3.1: the default human skeleton structure identified by adopting openposition attitude estimation algorithm is used as the basic connection of the graph neural network, and the adjacency matrix of the graph neural network structure is set as AkA adjacency matrix representing a k-th layer network is an N × N two-dimensional matrix, wherein N is equal to 18 and represents 18 skeletal key points; the A (n1, n2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection, and a value of 0 represents disconnection;
step 3.2: setting the adjacency matrix of the graph neural network structure as BkIt represents the action structure adjacency matrix of the k layer; the matrix is also an N x N two-dimensional matrix, meaning the same as a, except that the matrix has no fixed values, each element of which is a trainable parameter;
step 3.3: setting the adjacency matrix of the graph neural network structure to CkThe format is identical to A and B, Ck(n1,n2):
The process is a normalized Gaussian embedding method to calculate the similarity between any two skeletal key points, theta andtwo embedding methods are respectively adopted, T represents matrix transposition, and the final output dimension is unchanged; the Softmax method changes the final values to 0 and 1, indicating whether or not there is a connection; the final output formula of the neural network of the graph is:
wherein f isinAnd foutRespectively representing the input and output of the layer network, K representing the total number of layers of the neural network of the graph, and W representing the convolution parameter.
Further, the step 4 specifically includes:
for point nijDefining i to represent the ith frame, j to represent the jth bone keypoint, and each time domain convolution only involves the same bone keypoint, then there is the formula:
Further, the step 5 specifically includes:
step 5.1: the cross attention enhances the expression capacity of the main network flow through the characteristic diagram of the bone joint angle network branch, and the formula is as follows:
fattention=(1+Attention)*fout;
step 5.2: the computing method of the Attention is as follows:
Attention=g(fself,fcross)*fout
wherein f isselfIs the self-attention weight, f, of the output profile of the host networkcrossAdding weights to the primary network feature map for the joint angle and cross attention weights of the primary network data, where g represents transforming both dimensions to foutAre added, wherein fcrossComprises the following steps:
wherein v (T, N, d) is a main network feature map, wherein N is the number of main network data bone joint points, and d represents the feature dimension of each joint point; a (T, k, m) is a joint angle network characteristic diagram, k represents the joint number of the bone joint angle data, and m is the dimension of the bone joint angle data.
Further, the step 6 specifically includes:
step 6.1: the input is firstly reserved with a residual error and connected to the module, and finally, the first operation is to perform space domain graph convolution, then batch normalization operation Batchnormalization, ReLU activation layer and 0.5 coefficient dropout layer, then perform space graph convolution, and then batch normalization operation Batchnormalization and ReLU activation layer, wherein the network overall structure is composed of nine space-time convolution modules, a global pooling layer and a Softmax layer.
Step 6.2: the global pooling layer in the network is used for summarizing the node characteristics in the graph structure, upgrading the node-level characteristics into graph-level characteristics, and outputting the action numbers of people in the video through the Softmax layer.
Compared with the prior art, the human action recognition method based on the graph convolution neural network can recognize and output the actions expressed by human in the input video, has good usability and robustness, and lays a certain foundation for the artificial intelligence technology to actually fall on the ground in the action recognition field.
Drawings
Fig. 1 is a flowchart of a method for human motion recognition based on a graph-convolution neural network according to the present invention.
Fig. 2 is a cross-attention network structure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. The specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
As shown in fig. 1 and fig. 2, the present embodiment provides a method for recognizing human actions based on a graph convolution neural network, which includes the following steps:
step 1: preparing human action video data, labeling, and labeling video labels according to different types of actions, starting from 0;
step 2.1: carrying out feature extraction and feature design on the basic data to be used as the motion information features;
step 2.1.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.1.2: human bone key point extraction is performed by using an openposition attitude estimation algorithm, and 15 bisectors S (T1, T2, T3, T4, T5, T6, T15) are taken for a video S, and the bone key point data of each point is saved. Each time, 18 skeletal key points are extracted, which respectively represent 18 parts of the human body. Setting the length of a single-frame video as L and the width of the video as W, normalizing the extracted bone key point coordinates, and using Tn to represent the bone key point data of the nth frame, wherein the normalized Tn is:
wherein xnIs the abscissa, y, of the nth skeletal keypointnIs the ordinate of the nth skeleton key point, and Tn is the skeleton key point coordinate of the nth frame after normalization.
Step 2.1.3: calculating the change speed of key points of adjacent frames, wherein the speed Vn is:
Vn=((x1n-x1n-1,y1n-y1n-1),(x2n-x2n-1,y2n-y2n-1),...,(x18n-x18n-1,y18n-y18n-1))
wherein the specific meanings of x and y are the same as in step 2.2. After the speed V is obtained, feature splicing is carried out, and the total feature Dn after splicing is as follows:
Dn=Cancate(Tn,Tn′,Vn)
wherein T isnAnd T'nThe normalized bone keypoint coordinates obtained laterally and anteriorly at time n are indicated, respectively, and the Cancate function indicates the concatenation of the variables within brackets.
Step 2.2: further refining the bone point data to be used as high-order information, and forming a double-flow network by the data and the data in the step 2.1 to supplement each other;
step 2.2.1: because the joint angle is of great importance to the action category, the human skeleton key points extracted by openposition are screened, and the left knee, the right knee, the left waist, the right waist, the left shoulder, the right shoulder, the left elbow and the right elbow are saved;
step 2.2.2: and (3) calculating an included angle:
(1) knee:
(2) waist:
(3) shoulder:
(4) elbow:
and step 3: transmitting the spliced data into a graph neural network, wherein the graph neural network mainly comprises three parts;
step 3.1: the first part adopts a default human skeleton structure identified by an openposition attitude estimation algorithm as a basic connection of a graph neural network, the function of the basic structure of the first part is to adapt to the basic motion form of human beings, and the first part has certain modeling capability on any form of action, and an adjacency matrix of the graph structure is set as AkIndicating a k-th layer network, which is an N x N two-dimensional matrix, where N equals 18, indicating 18 skeletal keypoints. The A (n1, n2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection, and a value of 0 represents disconnection;
step 3.2: second part to compensate for the ability of the infrastructure to fit to motion diversity, we set the adjacency matrix for the structure to be BkIt means that the structure of the k-th layer is adjacent to the matrix. The matrix is also an N multiplied by N two-dimensional matrix, the meaning is the same as that of A, the difference is that the matrix has no fixed value, each element of the matrix is a trainable parameter, and the training stage automatically learns which connection modes have better compensation effect on the action;
step 3.3: the third part is a data-driven graph structure, which has different values for each different action, we set the matrix to CkThe format is identical to A and B, Ck(n1,n2):
The process is a normalized Gaussian embedding method to calculate the similarity between any two skeletal key points, theta andtwo embedding methods are respectively adopted, and T represents matrix transposition to ensure that the final output dimension is unchanged. The Softmax method changes the final values to 0 and 1, indicating whether or not there is a connection. The output of the neural network of the final graph is disclosed as:
wherein f isinAnd foutRespectively representing the input and output of the layer network, K representing the total number of layers of the neural network of the graph, A matrix, B matrix, C matrix introduced in the above step, and W representing convolution parameters;
and 4, step 4: extending the graph convolution from the spatial domain to the temporal domain, for point nijAt this time, we define that i represents the ith frame, j represents the jth bone key point, and we only relate to the same google key point every time of time domain convolution, then there is a formula:
w is a parameter of the convolution,the output of the nth layer, other variables are defined the same.
And 5: a cross-attention model is used to enhance the performance of the network, and the specific steps are as follows, and the structure is shown in fig. 2:
step 5.1: the cross attention enhances the expression capacity of the main network flow through the characteristic diagram of the bone joint angle network branch, and the formula is as follows:
fattention=(1+Attention)*fout
the self-attention model is a residual attention model because as the number of network layers deepens, simple attention stacking causes some features to disappear.
Step 5.2: the computing method of the Attention is as follows:
Attention=g(fself,fcross)*fout
wherein f isselfIs the self-attention weight, f, of the output profile of the host networkcrossWeights are added to the primary network feature map for joint angles and cross attention weights of the primary network data, which are added together. Where g denotes the transformation of both dimensions to foutOfDegrees and added. Wherein f iscrossComprises the following steps:
wherein v (T, N, d) is a main network feature map, wherein N is the number of main network data bone joint points, and d represents the feature dimension of each joint point; a (T, k, m) is a joint angle network characteristic diagram, k represents the joint number of the bone joint angle data, and m is the dimension of the bone joint angle data. The formula calculates the association between different nodes of two networks, respectively, and is used as cross-attention.
Step 6: the details of the convolution of the space domain and the time domain are described in detail by steps 3 and 4, and a space-time diagram convolution module is a complete system. The input is first preserved a residual and connected to the module, and finally, the first operation is to perform space domain graph convolution, then batch normalization operation Batchnormalization, ReLU activation layer, and 0.5 coefficient dropout layer, then perform space graph convolution, and then batch normalization operation Batchnormalization and ReLU activation layer. And the two branches of the network are respectively composed of nine space-time convolution modules, a global pooling layer and a Softmax layer. And finally obtaining the action category through a Softmax classifier.
Of course, before using the graph convolution neural network of this embodiment to recognize human actions, training of the model is performed, the training part uses the Pytorch framework, and uses the CrossEntropy loss function of CrossEntropy, which is expressed as:
Loss=-[ylogy`+(1-y)log(1-y`)]
where y is label of the sample and y' is the result predicted by our model. We set the batch size to 64 at training, optimize using SGD stochastic gradient descent with momentum of 0.9, and set the weight decay to 0.0001 for a total of 30 epochs.
The technical solution of the present invention is not limited to the limitations of the above specific embodiments, and all technical modifications made according to the technical solution of the present invention fall within the protection scope of the present invention.
Claims (6)
1. A human motion recognition method based on a graph convolution neural network is characterized by comprising the following steps:
step 1: preparing human action video data, labeling, and labeling video labels according to different types of actions;
step 2: extracting the skeleton key point characteristics of human motion video data by using an openposition attitude estimation algorithm, then calculating the change speed of adjacent frame skeleton key points through a skeleton point mainstream network, and performing characteristic splicing; screening the bone key points, calculating included angles of the screened bone key points through an angle branch network, and performing feature splicing;
and step 3: transmitting the spliced data into a neural network of a graph;
and 4, step 4: extending the graph convolution from a spatial domain to a temporal domain;
and 5: using a cross-attention model to enhance the performance of the network;
step 6: and constructing a graph convolution neural network consisting of nine space-time convolution modules, a global pooling layer and a Softmax layer, wherein the global pooling layer is used for summarizing the node characteristics in the graph structure so as to upgrade the node-level characteristics into graph-level characteristics, and then outputting the human action number in the human action video through the Softmax layer.
2. The method for human motion recognition based on the graph-convolution neural network according to claim 1, wherein the step 2 specifically includes:
step 2.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.2: extracting human skeleton key points by using an openposition attitude estimation algorithm, taking 15 bisectors S (T1, T., T2, T., T3, T., T4, T., T5, T., T6, T., T15) from a video S, storing the skeleton key point data of each point, extracting 18 skeleton key points each time, respectively representing 18 parts of a human body, setting the length of a single-frame video to be L, setting the width of the video to be W, normalizing the extracted skeleton key point coordinates, and using Tn to represent the skeleton key point data of an nth frame, wherein the normalized Tn is as follows:
wherein xnIs the abscissa, y, of the nth skeletal keypointnIs the ordinate of the nth skeleton key point, Tn is the skeleton key point coordinate of the nth frame after normalization;
step 2.3: calculating the change speed of key points of adjacent frames, wherein the speed Vn is:
Vn=((x1n-x1n-1,y1n-y1n-1),(x2n-x2n-1,y2n-y2n-1),...,(x18n-x18n-1,y18n-y18n-1) ); wherein the specific meanings of x and y are the same as in step 2.2; and (3) carrying out characteristic splicing after obtaining the speed V, wherein the total characteristic Dn after splicing is as follows:
Dn=Cancate(Tn,Tn′,Vn);
wherein T isnAnd T'nRespectively representing the coordinates of normalized bone key points obtained at the side and the front of the moment n, and the Cancate function represents splicing variables in brackets;
step 2.4: screening skeletal key points extracted from openposition, and storing left knee, right knee, left waist, right waist, left shoulder, right shoulder, left elbow and right elbow;
step 2.5: and (3) calculating an included angle:
(1) knee:
(2) waist:
(3) shoulder:
(4) elbow:
3. the method of claim 2, wherein the step 3 specifically comprises:
step 3.1: the default human skeleton structure identified by adopting openposition attitude estimation algorithm is used as the basic connection of the graph neural network, and the adjacency matrix of the graph neural network structure is set as AkA adjacency matrix representing a k-th layer network is an N × N two-dimensional matrix, wherein N is equal to 18 and represents 18 skeletal key points; the A (n1, n2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection, and a value of 0 represents disconnection;
step 3.2: setting the adjacency matrix of the graph neural network structure as BkIt represents the action structure adjacency matrix of the k layer; the matrix is also an N x N two-dimensional matrix, meaning the same as a, except that the matrix has no fixed values, each element of which is a trainable parameter;
step 3.3: setting the adjacency matrix of the graph neural network structure to CkThe format is identical to A and B, Ck(n1,n2):
The process is a normalized Gaussian embedding method to calculate the similarity between any two skeletal key points, theta andtwo embedding methods are respectively adopted, T represents matrix transposition, and the final output dimension is unchanged; the Softmax method changes the final values to 0 and 1, indicating whether or not there is a connection; the final output formula of the neural network of the graph is:
wherein f isinAnd foutRespectively representing the input and output of the layer network, K representing the total number of layers of the neural network of the graph, and W representing the convolution parameter.
4. The method for human motion recognition based on the graph-convolution neural network of claim 3, wherein the step 4 specifically includes:
for point nijDefining i to represent the ith frame, j to represent the jth bone keypoint, and each time domain convolution only involves the same bone keypoint, then there is the formula:
5. The method of claim 4, wherein the step 5 specifically comprises:
step 5.1: the cross attention enhances the expression capacity of the main network flow through the characteristic diagram of the bone joint angle network branch, and the formula is as follows:
fattention=(1+Attention)*fout;
step 5.2: the computing method of the Attention is as follows:
Attention=g(fself,fcross)*fout
wherein f isselfIs the self-attention weight, f, of the output profile of the host networkcrossAdding weights to the primary network feature map for the joint angle and cross attention weights of the primary network data, where g represents transforming both dimensions to foutAre added, wherein fcrossComprises the following steps:
wherein v (T, N, d) is a main network feature map, wherein N is the number of main network data bone joint points, and d represents the feature dimension of each joint point; a (T, k, m) is a joint angle network characteristic diagram, k represents the joint number of the bone joint angle data, and m is the dimension of the bone joint angle data.
6. The method of claim 5, wherein the step 6 specifically comprises:
step 6.1: the input is firstly reserved with a residual error and connected to the module, and finally, the first operation is to perform space domain graph convolution, then batch normalization operation Batchnormalization, ReLU activation layer and 0.5 coefficient dropout layer, then perform space graph convolution, and then batch normalization operation Batchnormalization and ReLU activation layer, wherein the network overall structure is composed of nine space-time convolution modules, a global pooling layer and a Softmax layer.
Step 6.2: the global pooling layer in the network is used for summarizing the node characteristics in the graph structure, upgrading the node-level characteristics into graph-level characteristics, and outputting the action numbers of people in the video through the Softmax layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011600579.7A CN112633209B (en) | 2020-12-29 | 2020-12-29 | Human action recognition method based on graph convolution neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011600579.7A CN112633209B (en) | 2020-12-29 | 2020-12-29 | Human action recognition method based on graph convolution neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112633209A true CN112633209A (en) | 2021-04-09 |
CN112633209B CN112633209B (en) | 2024-04-09 |
Family
ID=75286366
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011600579.7A Active CN112633209B (en) | 2020-12-29 | 2020-12-29 | Human action recognition method based on graph convolution neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112633209B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113255514A (en) * | 2021-05-24 | 2021-08-13 | 西安理工大学 | Behavior identification method based on local scene perception graph convolutional network |
CN113361352A (en) * | 2021-05-27 | 2021-09-07 | 天津大学 | Student classroom behavior analysis monitoring method and system based on behavior recognition |
CN113378656A (en) * | 2021-05-24 | 2021-09-10 | 南京信息工程大学 | Action identification method and device based on self-adaptive graph convolution neural network |
CN113392743A (en) * | 2021-06-04 | 2021-09-14 | 北京格灵深瞳信息技术股份有限公司 | Abnormal action detection method, abnormal action detection device, electronic equipment and computer storage medium |
CN114613011A (en) * | 2022-03-17 | 2022-06-10 | 东华大学 | Human body 3D (three-dimensional) bone behavior identification method based on graph attention convolutional neural network |
CN114998990A (en) * | 2022-05-26 | 2022-09-02 | 深圳市科荣软件股份有限公司 | Construction site personnel safety behavior identification method and device |
CN115050101A (en) * | 2022-07-18 | 2022-09-13 | 四川大学 | Gait recognition method based on skeleton and contour feature fusion |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110532960A (en) * | 2019-08-30 | 2019-12-03 | 西安交通大学 | A kind of action identification method of the target auxiliary based on figure neural network |
CN110705463A (en) * | 2019-09-29 | 2020-01-17 | 山东大学 | Video human behavior recognition method and system based on multi-mode double-flow 3D network |
CN111709321A (en) * | 2020-05-28 | 2020-09-25 | 西安交通大学 | Human behavior recognition method based on graph convolution neural network |
-
2020
- 2020-12-29 CN CN202011600579.7A patent/CN112633209B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110532960A (en) * | 2019-08-30 | 2019-12-03 | 西安交通大学 | A kind of action identification method of the target auxiliary based on figure neural network |
CN110705463A (en) * | 2019-09-29 | 2020-01-17 | 山东大学 | Video human behavior recognition method and system based on multi-mode double-flow 3D network |
CN111709321A (en) * | 2020-05-28 | 2020-09-25 | 西安交通大学 | Human behavior recognition method based on graph convolution neural network |
Non-Patent Citations (4)
Title |
---|
LEI SHI等: "Skeleton-Based Action Recognition With Directed Graph Neural Networks", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》, pages 7912 - 7921 * |
WANQIANG ZHENG等: "Action Recognition Based on Spatial Temporal Graph Convolutional Networks", 《PROCEEDINGS OF THE 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND APPLICATION ENGINEERINGOCTOBER 2019》, pages 1 - 5 * |
赫磊;邵展鹏;张剑华;周小龙;: "基于深度学习的行为识别算法综述", 计算机科学, no. 1, pages 149 - 157 * |
陈永康: "基于机器视觉的运动姿态分析系统研究", 《中国优秀硕士学位论文全文数据库 社会科学Ⅱ辑》, no. 2, pages 134 - 354 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113255514A (en) * | 2021-05-24 | 2021-08-13 | 西安理工大学 | Behavior identification method based on local scene perception graph convolutional network |
CN113378656A (en) * | 2021-05-24 | 2021-09-10 | 南京信息工程大学 | Action identification method and device based on self-adaptive graph convolution neural network |
CN113378656B (en) * | 2021-05-24 | 2023-07-25 | 南京信息工程大学 | Action recognition method and device based on self-adaptive graph convolution neural network |
CN113361352A (en) * | 2021-05-27 | 2021-09-07 | 天津大学 | Student classroom behavior analysis monitoring method and system based on behavior recognition |
CN113392743A (en) * | 2021-06-04 | 2021-09-14 | 北京格灵深瞳信息技术股份有限公司 | Abnormal action detection method, abnormal action detection device, electronic equipment and computer storage medium |
CN114613011A (en) * | 2022-03-17 | 2022-06-10 | 东华大学 | Human body 3D (three-dimensional) bone behavior identification method based on graph attention convolutional neural network |
CN114998990A (en) * | 2022-05-26 | 2022-09-02 | 深圳市科荣软件股份有限公司 | Construction site personnel safety behavior identification method and device |
CN115050101A (en) * | 2022-07-18 | 2022-09-13 | 四川大学 | Gait recognition method based on skeleton and contour feature fusion |
CN115050101B (en) * | 2022-07-18 | 2024-03-22 | 四川大学 | Gait recognition method based on fusion of skeleton and contour features |
Also Published As
Publication number | Publication date |
---|---|
CN112633209B (en) | 2024-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112633209A (en) | Human action recognition method based on graph convolution neural network | |
KR102564857B1 (en) | Method and device to train and recognize data | |
CN108596039B (en) | Bimodal emotion recognition method and system based on 3D convolutional neural network | |
CN108154194B (en) | Method for extracting high-dimensional features by using tensor-based convolutional network | |
CN109858390A (en) | The Activity recognition method of human skeleton based on end-to-end space-time diagram learning neural network | |
CN110378281A (en) | Group Activity recognition method based on pseudo- 3D convolutional neural networks | |
CN111291190B (en) | Training method of encoder, information detection method and related device | |
CN109344713B (en) | Face recognition method of attitude robust | |
CN110633624B (en) | Machine vision human body abnormal behavior identification method based on multi-feature fusion | |
CN110222718B (en) | Image processing method and device | |
CN108154156B (en) | Image set classification method and device based on neural topic model | |
CN114170410A (en) | Point cloud part level segmentation method based on PointNet graph convolution and KNN search | |
CN110765960B (en) | Pedestrian re-identification method for adaptive multi-task deep learning | |
CN112308081B (en) | Image target prediction method based on attention mechanism | |
CN113435520A (en) | Neural network training method, device, equipment and computer readable storage medium | |
CN108073851A (en) | A kind of method, apparatus and electronic equipment for capturing gesture identification | |
CN115862136A (en) | Lightweight filler behavior identification method and device based on skeleton joint | |
CN106951844A (en) | A kind of Method of EEG signals classification and system based on the very fast learning machine of depth | |
CN113516227A (en) | Neural network training method and device based on federal learning | |
CN108345900A (en) | Pedestrian based on color and vein distribution characteristics recognition methods and its system again | |
CN115471670A (en) | Space target detection method based on improved YOLOX network model | |
CN114170659A (en) | Facial emotion recognition method based on attention mechanism | |
Liu et al. | Iterative deep neighborhood: a deep learning model which involves both input data points and their neighbors | |
CN117115911A (en) | Hypergraph learning action recognition system based on attention mechanism | |
Tang | An action recognition method for volleyball players using deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |