CN113378656B

CN113378656B - Action recognition method and device based on self-adaptive graph convolution neural network

Info

Publication number: CN113378656B
Application number: CN202110564099.8A
Authority: CN
Inventors: 胡凯; 丁益武; 陆美霞; 黄昱锟
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2023-07-25
Anticipated expiration: 2041-05-24
Also published as: CN113378656A

Abstract

The invention discloses a motion recognition method and a motion recognition device based on a self-adaptive graph convolution neural network, wherein the method comprises the following steps: s1, generating a human skeleton data set; s2, taking the angle between adjacent bone edges as a deep space characteristic; s3, calculating to obtain an average energy change value of each key node, and taking the average energy change value as a deep time characteristic; s4, constructing a double-flow graph convolutional neural network; s5, expanding the double-flow graph convolution neural network, connecting 2 new sub-networks in parallel, and building an action recognition model, wherein the new sub-networks are used for processing the space characteristics and the time characteristics respectively; the motion recognition model is used for simultaneously processing joint data, skeleton data, deep space features and deep time features and calculating to obtain corresponding motion types. The method and the device can effectively improve the recognition accuracy of the graph rolling network in the field of motion recognition.

Description

Action recognition method and device based on self-adaptive graph convolution neural network

Technical Field

The invention relates to the technical field of video streaming identification, in particular to an action identification method and device based on a self-adaptive graph convolution neural network.

Background

In the field of machine learning, action recognition is a very important task, and many scenes such as automatic driving, man-machine interaction, public security and the like can be used in daily life, so that the task is paid more attention to. In recent years, due to the explosive development of machine learning and deep learning, a plurality of motion recognition algorithms with excellent performance are developed, and the motion recognition algorithm based on space-time diagram convolution achieves excellent results.

The existing action recognition algorithm based on the graph neural network only utilizes the very shallow features, firstly, the coordinates of key points of the human body and the confidence coefficient thereof obtained by the gesture estimation algorithm are directly utilized as features, and the position relations between the key points and bones are ignored. For example, for a key point at the shoulder, it depends on where the upper body is located, while it determines the position of the upper arm; secondly, there is no obvious distinction between the duration of the motion, such as a fall and a lie, the motion is similar, and it is obvious that in time, a fall is faster than a lie. The existence of these problems suggests that the existing methods still do not have sufficient information to extract the data.

Therefore, although skeleton-based motion recognition algorithms have achieved excellent results on public datasets, current algorithms only use relatively shallow features, do not consider the association between skeleton data nodes and edges, do not consider the association between edges, and do not have an effective way to solve the indistinguishable problem of such motions as "falling" and "lying down".

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an action recognition method and device based on a self-adaptive graph rolling neural network, which calculate the angle change between two bones around the same key point within one action duration through focusing action root nodes, and simultaneously input the action duration as a characteristic to the network so as to solve the problem of insufficient information utilization and effectively improve the recognition precision of the graph rolling network in the field of action recognition.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in a first aspect, an embodiment of the present invention provides an action recognition method based on an adaptive graph convolutional neural network, where the action recognition method includes:

s1, acquiring video stream data of a human body action type to be identified, adopting an existing gesture estimation algorithm to process the imported video stream data to obtain human body skeleton type data and human body skeleton patterns, generating coordinates and confidence coefficient characteristics of each key node, and generating a human body skeleton data set;

S2, calculating the change of angular momentum when bones rotate around key nodes in the human body movement process, wherein the variable of the angle between adjacent bone edges is used as a deep space feature;

s3, extracting energy information in the duration of human body action, accumulating angle differences generated by the rotation of bones around key nodes to obtain the sum of the angle changes in the duration of the action, dividing the sum of the angle differences corresponding to each key node by the number of key frames of the current action, and calculating to obtain an average energy change value of each key node, wherein the average energy change value is used as a deep time feature;

s4, constructing a double-flow graph convolutional neural network, wherein joint data and skeleton data are respectively used as input data of J flow and B flow, and a prediction action label is used as output data;

s5, expanding the double-flow graph convolution neural network, connecting 2 new sub-networks in parallel, and building an action recognition model, wherein the new sub-networks are used for processing the space characteristics and the time characteristics respectively; the motion recognition model is used for simultaneously processing joint data, skeleton data, deep space features and deep time features and calculating to obtain corresponding motion types.

Optionally, in step S2, the process of calculating the change of angular momentum generated when the skeleton rotates around the key node during the motion of the human body, and using the variable of the angle between adjacent skeleton edges as the deep spatial feature includes the following steps:

s21, calculating angles between all adjacent bones according to the coordinates and physical connection of each key node in the human bone data set; when the degree of the node is 1, namely the node has only one edge, and the angle is not calculated; when the degree of the node is 2, namely one node is connected with two edges, calculating an angle smaller than 180 degrees; when the degree of the node is 3, namely one node is connected with 3 edges, 3 angles are calculated; when the degree of the node is 4, calculating 4 angles;

s22, combining the calculated angles into a matrix form according to the sequence of the key nodes and the video frames for all angles in n key frames in the whole action duration, and expanding the obtained angle matrix to be:

where m is the total number of angles,is the value of the ith angle in the jth keyframe, i=1, 2, …, m, j=1, 2, …, n;

s23, subtracting the corresponding angle of the key point corresponding to the previous frame from the angle of any key point of the next frame to obtain the angle difference formed by the edges around the same node between the adjacent frames; calculating an angle difference matrix delta theta formed by surrounding bones of adjacent frames by taking the same node as a center point:

In the method, in the process of the invention,is the value of the mth angle in the n-1 key frame.

Optionally, in step S3, the process of extracting energy information in the duration of the human motion, accumulating the angle differences generated by the rotation of the skeleton around the key nodes to obtain the sum of the angle changes in the duration of the motion, dividing the sum of the angle differences corresponding to each key node by the number of key frames of the current motion, and calculating to obtain the average energy change value of each key node, wherein the average energy change value is used as the deep time feature, and the process comprises the following steps:

s31, accumulating and summing the calculated angle difference matrix delta theta according to time sequence to obtain the angle change sum theta on each node _I ，θ _I The expression form of (a) is as follows:

wherein, subscripts '1-m-1' represent the labels of key nodes,"1-n-1" in the superscript represents the key frame, and constitutes a 1× (m-1) energy matrix θ _I ；

S21, the θ obtained in the step S31 is set _I Dividing the number of frames of the current motion by the number of frames of the current motion to obtain the average energy theta of the current motion _a Whereinn is the key frame number extracted by the attitude estimation algorithm.

Optionally, in step S4, the process of constructing the dual-flowsheet convolutional neural network includes the following steps:

step 4.1: building a self-adaptive graph convolution layer; the self-adaptive graph convolution layer is used for optimizing the topological structure of the network together with other parameters of the network in an end-to-end learning mode, a skeleton graph is unique to different layers and samples, and the topological structure of the graph is formed by an adjacent matrix A _k Sum mask M _k Determination, A _k Determining if there is a connection between two vertices, M _k The strength of the ligation was determined, resulting in the following expression:

wherein K is _v The kernel size representing the spatial dimension is set to 3; w (W) _k Is a weight matrix, k is [0,3 ]]； Represents the normalized diagonal matrix, A _k An N x N adjacency matrix representing the physical structure of the human body; b (B) _k Is an N x N adjacency matrix, B _k Element in (B) is trained and optimized along with the adaptive graph convolutional layer _k Without limitation, the elements in the matrix are arbitrary values that indicate the presence and strength of a connection between two joints; c (C) _k Is a data correlation map for learning a unique map for each sample;

to determine whether there is a connection between two vertices and the strength of the connection, a normalized embedded gaussian function is used to calculate the similarity between the two vertices:

wherein N represents the total number of key points, v _i And v _j Characteristic information on the node;

given a feature matrix input, two embedding functions θ (·) anddimension of input from C _in X T x N becomes C _e X T N, the two embedded feature matrices areRearranging and reshaping into an NxC _e Matrix of T and one C _e T multiplied by N matrix to become a similar matrix, and C is calculated by the following formula _k ：

In which W is _θ Andare embedding functions θ (·) and +.>Parameters of (2);

step 4.2: building an adaptive graph rolling module; the self-adaptive graph convolution module comprises a space graph convolution layer convs, a time graph convolution layer convt, an additional random discarding process Dropout and a residual error connection which are connected in sequence; wherein the Dropout rate is set to 0.5; the space diagram convolution layer convs and the time diagram convolution layer convt are respectively connected with a batch normalization layer and an activation function layer;

step 4.3: the self-adaptive graph rolling network is built by stacking the self-adaptive graph rolling modules; the adaptive graph rolling network comprises 9 adaptive graph rolling modules, wherein the output channel number of each adaptive graph rolling module is 64, 128, 256 and 256 respectively; adding a data BN layer at the beginning to normalize the input data, performing a global averaging pooling layer to pool feature mappings of different samples to the same size, and finally outputting to a SoftMax classifier to obtain predictions;

step 4.4: building a double-flow graph convolutional neural network;

and calculating joint data and skeleton data, respectively inputting the joint data and the skeleton data into a J flow and a B flow, adding softMax scores of the two flows to obtain a fusion score and predicting an action label.

Optionally, in step S5, the process of calculating the corresponding action type includes:

s51, expanding a double-flow graph convolution neural network, connecting 2 new sub-networks in parallel on the basis of 2 existing sub-networks of the double-flow graph convolution neural network, and constructing to obtain an action recognition model;

s52, respectively importing skeleton data, joint data, angle change among bones and energy generated by actions into four sub-networks of an action recognition model to obtain corresponding prediction scores; the motion recognition model also comprises an accumulator and a softMax classifier, and after 4 prediction scores are added by the accumulator, the accumulated result is imported into the softMax classifier to obtain a final classification result; the calculation formula of the final classification result S is as follows:

S＝S ₁ W ₁ +S ₂ W ₂ +S ₃ W ₃ +S ₄ W ₄

wherein S is ₁ 、S ₂ 、S ₃ 、S ₄ Predictive score results for the 4 sub-networks, respectively; w (W) ₁ 、W ₂ 、W ₃ 、W ₄ Is the weight of 4 sub-networks and is a super parameter.

In a second aspect, an embodiment of the present invention proposes an action recognition device based on an adaptive graph convolutional neural network, where the action recognition device includes:

the human skeleton data set generation module is used for acquiring video stream data of a human action type to be identified, processing the imported video stream data by adopting an existing gesture estimation algorithm to obtain human skeleton type data and human skeleton patterns, generating coordinates of each key node and confidence coefficient characteristics of the key node, and generating a human skeleton data set;

The spatial feature extraction module is used for calculating the change of angular momentum when bones rotate around key nodes in the human body movement process, and the variable of the angle between adjacent bone edges is used as a deep spatial feature;

the time feature extraction module is used for extracting energy information in the duration of human body action, accumulating angle differences generated by the rotation of bones around key nodes to obtain the sum of the angle changes in the duration of the action, dividing the sum of the angle differences corresponding to each key node by the number of key frames of the current action, and calculating to obtain an average energy change value of each key node, wherein the average energy change value is used as a deep time feature;

the double-flow graph convolution neural network construction module is used for constructing a double-flow graph convolution neural network, wherein joint data and skeleton data are respectively used as input data of J flow and B flow, and a prediction action label is used as output data;

the motion recognition model construction module is used for expanding the double-flow graph convolution neural network, connecting 2 new sub-networks in parallel, and constructing to obtain a motion recognition model, wherein the 2 new sub-networks are used for processing the space characteristics and the time characteristics respectively;

the motion recognition model is used for simultaneously processing the joint data, the bone data, the deep space features and the deep time features and calculating to obtain corresponding motion types.

Optionally, the dual-flow graph convolutional neural network includes 2 sub-networks; the joint data and the skeleton data are respectively used as input data of 2 sub-networks, and the corresponding prediction scores are obtained after the sub-network processing.

Optionally, the sub-network or the new sub-network includes 9 adaptive graph rolling modules, and the number of output channels of each adaptive graph rolling module is 64, 128, 256 and 256 respectively; adding a data BN layer at the beginning to normalize the input data, performing a global averaging pooling layer to pool feature mappings of different samples to the same size, and finally outputting to a SoftMax classifier to obtain predictions;

the self-adaptive graph convolution module comprises a space graph convolution layer convs, a time graph convolution layer convt, an additional random discarding process Dropout and a residual error connection which are connected in sequence; wherein the Dropout rate is set to 0.5; the space-diagram convolutional layer convs and the time-diagram convolutional layer convt are each followed by a batch normalization layer and an activation function layer.

In a third aspect, the present embodiment refers to an electronic apparatus including:

one or more processors;

storage means for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the adaptive graph rolling neural network based action recognition method as described above.

In a third aspect, the present embodiment refers to a computer readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements a method for identifying actions based on an adaptive graph convolutional neural network as described above.

The beneficial effects of the invention are as follows:

on one hand, the invention is inspired by angular momentum in robot dynamics, and the change of the angular momentum when bones rotate around key points in the human body movement process is calculated, so that the variable of the angle between adjacent bone edges is introduced as a deep space feature; and on the other hand, the energy information in the duration of the human body action is extracted, the obtained angle difference is accumulated to obtain the sum of the angle changes in the duration of the action, and finally, the angle sum on each node is divided by the key frame number of the current action, so that the obtained result is used as a deep time feature. By adding the spatial feature of angle change and the time feature of average energy change, the motion recognition model can greatly improve the final classification accuracy, and the space-time combination fully utilizes the advantages of a skeleton data set in the motion recognition field, so that the existing double-flow self-adaptive graph rolling network is more suitable for the task of motion recognition.

Drawings

Fig. 1 is a flowchart of an action recognition method based on an adaptive graph convolutional neural network according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of node labels of an NTU-rgb+d dataset according to an embodiment of the disclosure.

Fig. 3 is a schematic diagram of an adaptive graph rolling module according to an embodiment of the invention.

Fig. 4 is a schematic diagram of an adaptive graph rolling network according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of a dual stream adaptive graph rolling network according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of an action recognition model according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of a recognition flow of an action recognition model according to an embodiment of the present invention.

Detailed Description

The invention will now be described in further detail with reference to the accompanying drawings.

It should be noted that the terms like "upper", "lower", "left", "right", "front", "rear", and the like are also used for descriptive purposes only and are not intended to limit the scope of the invention in which the invention may be practiced, but rather the relative relationship of the terms may be altered or modified without materially altering the teachings of the invention.

Example 1

FIG. 1 is a flow chart of an adaptive graph convolutional neural network-based motion recognition method in accordance with an embodiment of the present invention. The present embodiment may be used for identifying human actions in a video stream by a device such as a server, the method may be performed by an action identification device based on an adaptive graph convolutional neural network, which may be implemented in software and/or hardware, and may be integrated in an electronic device, for example, an integrated server device.

Referring to fig. 1, the action recognition method includes:

s1, acquiring video stream data of a human body action type to be identified, processing the imported video stream data by adopting an existing gesture estimation algorithm to obtain human body skeleton type data and human body skeleton patterns, generating coordinates and confidence coefficient characteristics of each key node, and generating a human body skeleton data set.

Specifically, the existing gesture estimation algorithm is used for processing video stream data into human skeleton type data to obtain human skeleton patterns, and meanwhile, the coordinates of each key point, the confidence coefficient and other characteristics are obtained, and the data are stored in a text file for later steps.

For convenience of description, the algorithm is tested by taking 10 videos as examples, actions of characters in the videos cover actions in the NTU-rgb+d data set, and finally node labels of the NTU-rgb+d data set shown in fig. 2 are obtained, wherein in fig. 2, 25 key nodes are total, and serial numbers are respectively 1-25.

S2, calculating the change of angular momentum when bones rotate around key nodes in the human body movement process, and taking the angle between adjacent bone edges as a deep space characteristic.

The purpose of step S2 is to calculate the angle delta theta of the change between all key points and the bone. The innovation point of the step 2 is that: the method fully utilizes the position relation between bones when a human body moves, and a bone map extracted by human body motions is similar to joints defined in robotics, so that the variable of angular momentum in robot dynamics is introduced, the information of data is fully utilized by calculating the angular momentum change generated when the human body moves, and the human quality cannot be estimated, so that only angles are reserved, the relation between key points and the joints when the human body moves is represented by the angles, and the spatial information of an algorithm is expanded. The method specifically comprises the following steps:

step 2.1: and calculating the included angle between bones. According to the coordinates and the physical connection of each node in the human skeleton data set, calculating the angle between two adjacent skeletons, and when the degree of the node is 1, namely the node has only one side, calculating the angle is not needed; when the degree of the node is 2, namely one node is connected with two edges, only an angle smaller than 180 degrees is needed to be calculated; when the degree of the node is 3, namely one node is connected with 3 edges, 3 angles are needed to be calculated; likewise, a node of degree 4 requires calculation of 4 angles. As shown in FIG. 2, taking node 21 as an example, the 4 angles formed between the 4 bones are calculated Where "tl" represents an angle of upper left (top left), "tr" represents an angle of upper right (top right), "ll" represents an angle of lower left (lower left), "lr" represents an angle of lower right (lower right), and "21" represents a node centered on the 21 st node; as same asNodes with pattern remaining degrees 2 and 3 are calculated in the same way. The calculation formulas (1) to (4) of the 4 angles around the node 21 as the center point are as follows:

in the above expression, "x", "y" represents the abscissa of the node, and the subscript thereof represents the reference numeral of the node. All other nodes with the degree of more than 2 can refer to the formula and calculate the angles of all adjacent bones by modifying the coordinates of the nodes.

Step 2.2: and (3) combining the 26 angles obtained in the step (2.1) into a matrix according to the order of the nodes. All angles in the first frame are arranged according to the node sequence, if the degree is more than 2, the nodes are arranged according to the principle of from left to right and from top to bottom, and the first frame is taken as an example, and the obtained angle matrix is taken asAn arrangement in which "1" in the superscript represents a first frame in which there are a total of 26 angles of "1, 2, … …, 26" in the subscript. All angles of n frames over the duration of the action are combined in the manner described above and then spread in rows into an angle matrix as shown below:

Step 2.3: the change delta theta of the angle formed by the bones around the same node between adjacent frames is calculated as a center point. Subtracting all angles of the previous frame from angles of all key points of the next frame to obtain angle differences formed by edges around the same node between adjacent frames, and calculating the angle differences by using the angle matrix known in the step 2.2 to obtain an angle difference matrix delta theta, wherein the matrix expression form is as follows:

And S3, extracting energy information in the duration of human body action, accumulating angle differences generated by the rotation of bones around key nodes to obtain the sum of the angle changes in the duration of the action, dividing the sum of the angle differences corresponding to each key node by the number of key frames of the current action, and calculating to obtain the average energy change value of each key node, wherein the average energy change value is used as a deep time feature.

Step S3 is to calculate the sum of energy θ generated during the duration of the action based on the integration concept in mathematics _I Dividing the secondary result by the number of key frames extracted by the attitude estimation algorithm to obtain the average energy change. The innovation of step S3 is that the sum of the energy generated by the action is represented by calculating the sum of the angle changes obtained after the completion of the entire set of actions, and then θ is further calculated _I Dividing the number of key frames by the average energy change, such an operation can further utilize the temporal characteristics of the skeleton dataset, and this is also an effective solution to solve the indistinguishable actions like "tumbling", "lying down". The method specifically comprises the following steps:

step 3.1: the angle matrix delta theta calculated in the step 2.3 is accumulated and summed according to time sequence to obtainSum of angle changes theta to each node _I ，θ _I The expression form of (a) is as follows:

wherein the subscripts "1-25" denote the labels of the nodes,"1-n-1" in the superscript represents a key frame, "θ" ₂ ～θ ₂₅ "also add by the above-described manner, finally form a 1×25 energy matrix θ _I 。

Step 3.2: theta obtained in step 3.1 _I Dividing the number of frames of the current motion to obtain the average energy theta of the motion _a WhereinWhere n represents the key frame number extracted by the attitude estimation algorithm, "θ ₁ ～θ ₂₅ "is the sum at each node calculated in step 3.1.

S4, constructing a double-flow-graph convolutional neural network, wherein joint data and skeleton data are respectively used as input data of J flow and B flow, and a prediction action label is used as output data. The method specifically comprises the following steps:

step 4.1: an adaptive graph convolution layer (AGC) is built, the topology structure of the network is optimized together with other parameters of the network in an end-to-end learning mode, and a skeleton graph is unique to different layers and samples, so that the flexibility of a model is greatly improved. More specifically, the topology of the graph is in fact determined by the adjacency matrix and the mask, i.e., A _k And M _k 。A _k Determining if there is a connection between two vertices, M _k The strength of the connection is determined. Thus, an expression form as in formula (5) is obtained:

k in the above _v The kernel size representing the spatial dimension is set to 3, k e 0,3]，W _k Is a matrix of weights that are to be used,A _k similar to an N adjacency matrix, wherein +.>Is a normalized diagonal matrix, alpha is set to 0.001 to prevent empty rows from occurring; a is that _k Representing an N x N adjacency matrix representing the physical structure of the human body; b (B) _k Also denoted an N adjacency matrix, but with A _k In the training process, B is different from _k Training and optimizing the elements in the model; b (B) _k By which the model can learn the graph entirely for the recognition task and more personalized for the different information contained in the different layers. The elements in the matrix may be any value and it indicates not only that there is a connection between the two joints, but also the strength of the connection, which is in accordance with M _k The attention mechanisms performed work the same; c (C) _k Is a data correlation graph that learns a unique graph for each sample, and to determine if there is a connection between two vertices and the strength of the connection, a normalized embedded gaussian function as shown in equation (6) is used to calculate the similarity between the two vertices:

Wherein N represents the total number of key points, v _i And v _j Characteristic information on the node. More specifically, given a feature matrix input, two embedding functions θ (·) anddimension of input from C _in X T x N becomes C _e X T x N, the two embedded feature matrices are rearranged and reshaped into an N x C _e Matrix of T and one C _e T multiplied by N matrix to become a similar matrix, and then using equation (7) to calculate C _k ：

W in the above _θ Andare embedding functions θ (·) and +.>Is a parameter of (a).

Step 4.2: and building an adaptive graph rolling module. The convolution in the time dimension is identical to the classical algorithm space-time diagram convolution network (ST-GCN), with both the space-diagram convolution network layer and the time-diagram convolution network layer followed by a batch normalization (Batch Normalization, BN) layer and an activation function (ReLU) layer. As shown in fig. 3, a basic block is a combination of a spatial map convolution layer (convs), a temporal map convolution layer (convt) and an additional random discard process (Dropout), the Dropout rate is set to 0.5, and a residual connection is added for each block for stable training.

Step 4.3: and building an adaptive graph rolling network. The adaptive graph rolling network (AGCN) is a stack for step 4.2, as shown in fig. 4, with a total of 9 modules, each with an output channel number of 64, 64, 64,128,128,128,256,256 and 256. The data BN layer is added at the beginning to normalize the input data, and a global averaging pooling layer (Global MaxPooling) is performed to pool the feature maps of the different samples to the same size. The final output is sent to a SoftMax classifier to obtain predictions.

Step 4.4: building a double-flow network. Referring to fig. 5, first, joint data and bone data are calculated, then the joint data and the bone data are respectively input into a J stream and a B stream, and finally SoftMax scores of the two streams are added to obtain a fusion score and predict an action label.

S5, expanding the double-flow graph convolution neural network, connecting 2 new sub-networks in parallel, and building an action recognition model, wherein the new sub-networks are used for processing the space characteristics and the time characteristics respectively; the motion recognition model is used for simultaneously processing joint data, skeleton data, deep space features and deep time features and calculating to obtain corresponding motion types. And (3) revising the structure of the network, and expanding the feature input on the basis of maintaining the feature extraction method of the original model. The model consists of 4 sub-networks, wherein the 2 sub-networks are kept unchanged on the basis of the original double-flow self-adaptive graph convolution neural network, and the other 2 sub-networks are respectively used for extracting the characteristics of space and time, and the specific steps comprise:

step 5.1: and (5) building a space-time characteristic expansion graph convolution neural network model. Based on the dual-flow adaptive graph rolling network described in step 4, the motion recognition model of the present embodiment is shown in fig. 6. The motion recognition model consists of 4 sub-networks, wherein 2 sub-networks keep the existing double-flow self-adaptive graph rolling network unchanged, and the remaining 2 sub-networks are used for extracting the characteristics in space and time. And finally, each sub-network passes through a softMax classifier to obtain a predicted score, and then 4 scores are added to obtain a final classification result. The final classification score is S, the expression of S is shown in formula (8):

S＝S ₁ W ₁ +S ₂ W ₂ +S ₃ W ₃ +S ₄ W ₄ (8)

S in the above ₁ 、S ₂ 、S ₃ 、S ₄ Representing the predicted score results of 4 subnetworks, W ₁ 、W ₂ 、W ₃ 、W ₄ Representing their weights, being hyper-parameters, the magnitude of which can be adjusted according to the result.

Step 5.2: the model of this patent was trained. Firstly, preprocessing data, recombining data structures in a public data set NTU-RGB+D, and solving an angle difference matrix delta theta and flatness according to formulas in the step 2 and the step 3Average energy variation matrix θ _a The method comprises the steps of carrying out a first treatment on the surface of the Then delta theta and theta are combined _a The two space-time feature matrices are input into an adaptive graph convolution neural network model, a model optimization strategy adopts a random gradient descent method (SGD) with Nesterov momentum of 0.9, batch Size (batch_Size) is set to 64, weight attenuation is set to 0.0001, training times are set to 64, the other two sub-networks are used for calculating data of original 2S-AGCN, and finally classification scores calculated by the 4 networks are added to obtain a final total score, and finally a classification result is obtained. The training flowchart is shown in fig. 7.

Example two

The embodiment refers to an action recognition device based on an adaptive graph convolution neural network, which comprises a human skeleton data set generation module, a spatial feature extraction module, a temporal feature extraction module, a double-flow graph convolution neural network construction module, an action recognition model construction module and an action recognition model.

The human skeleton data set generation module is used for acquiring video stream data of the human action type to be identified, processing the imported video stream data by adopting an existing gesture estimation algorithm to obtain human skeleton type data and human skeleton patterns, generating coordinates of each key node and confidence coefficient characteristics of the key node, and generating a human skeleton data set.

The spatial feature extraction module is used for calculating the change of angular momentum when the bones rotate around key nodes in the human body movement process, and the variable of the angle between adjacent bone edges is used as a deep spatial feature.

The time feature extraction module is used for extracting energy information in the duration of human body action, accumulating angle differences generated by the rotation of bones around key nodes to obtain the sum of the angle changes in the duration of the action, dividing the sum of the angle differences corresponding to each key node by the number of key frames of the current action, and calculating to obtain the average energy change value of each key node, wherein the average energy change value is used as a deep time feature.

The double-flow graph convolution neural network construction module is used for constructing the double-flow graph convolution neural network, wherein joint data and skeleton data are respectively used as input data of J flow and B flow, and a prediction action label is used as output data.

The motion recognition model construction module is used for expanding the double-flow graph convolution neural network, connecting 2 new sub-networks in parallel, and constructing to obtain a motion recognition model, wherein the 2 new sub-networks are used for processing the space characteristics and the time characteristics respectively.

In some examples, the subnetwork or the add-on subnetwork each includes 9 adaptive graph convolution modules, each having an output channel number of 64, 128, 256, and 256, respectively; the data BN layer is added at the beginning to normalize the input data, the global averaging pooling layer is performed to pool the feature maps of the different samples to the same size, and the final output is sent to the SoftMax classifier to obtain the predictions. The self-adaptive graph convolution module comprises a space graph convolution layer convs, a time graph convolution layer convt, an additional random discarding process Dropout and a residual error connection which are connected in sequence; wherein the Dropout rate is set to 0.5; the space-diagram convolutional layer convs and the time-diagram convolutional layer convt are each followed by a batch normalization layer and an activation function layer.

By the action recognition device of the second embodiment of the invention, the object to be transmitted is determined by establishing the data inclusion relation of the whole application, so as to achieve the aim of recognizing the human action in the video stream. The motion recognition device provided by the embodiment of the invention can execute the motion recognition method based on the adaptive graph convolution neural network provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example III

The embodiment of the application provides electronic equipment, which comprises a processor, a memory, an input device and an output device; in an electronic device, the number of processors may be one or more; the processor, memory, input devices, and output devices in the electronic device may be connected by a bus or other means.

The memory is used as a computer readable storage medium for storing a software program, a computer executable program and modules, such as program instructions/modules corresponding to the detection method in the embodiment of the present invention. The processor executes various functional applications and data processing of the electronic device by running software programs, instructions and modules stored in the memory, namely, the motion recognition method based on the adaptive graph convolution neural network provided by the embodiment of the invention is realized.

The memory may mainly include a memory program area and a memory data area, wherein the memory program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the terminal, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, the memory may further include memory remotely located with respect to the processor, the remote memory being connectable to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device, which may include a keyboard, mouse, etc. The output means may comprise a display device such as a display screen.

Example IV

Embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of motion recognition based on an adaptive graph convolutional neural network as described above.

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the method operations described above, but may also perform the related operations in the unified processing method based on the environmental context consistency provided in any embodiment of the present invention.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims

1. An action recognition method based on an adaptive graph convolution neural network is characterized by comprising the following steps:

s5, expanding the double-flow graph convolution neural network, connecting 2 new sub-networks in parallel, and building an action recognition model, wherein the new sub-networks are used for processing the space characteristics and the time characteristics respectively; the motion recognition model is used for simultaneously processing joint data, skeleton data, deep space features and deep time features, and calculating to obtain corresponding motion types;

In step S2, the process of calculating the change of angular momentum occurring when the skeleton rotates around the key node during the motion process of the human body, wherein the variable of the angle between adjacent skeleton edges is used as the deep space feature, comprises the following steps:

where m is the total number of angles,is the value of the i-th angle in the j-th keyframe, i=1, 2,..m, j=1, 2,., n;

2. The method for identifying the motion based on the adaptive graph convolution neural network according to claim 1, wherein in step S3, the process of extracting the energy information in the duration of the motion of the human body, accumulating the angle differences generated by the rotation of the skeleton around the key nodes to obtain the sum of the angle changes in the duration of the motion, dividing the sum of the angle differences corresponding to each key node by the number of key frames of the current motion, and calculating the average energy change value of each key node as the deep time feature includes the following steps:

S32, the θ obtained in the step S31 is set _I Dividing the number of frames of the current motion by the number of frames of the current motion to obtain the average energy theta of the current motion _a Whereinn is a key extracted by a gesture estimation algorithmFrame number.

3. The method for identifying actions based on adaptive graph convolutional neural network according to claim 1, wherein in step S4, the process of constructing a dual-flowsheet convolutional neural network comprises the steps of:

given a feature matrix input, two embedding functions θ (·) anddimension of input from C _in X T x N becomes C _e X T x N, the two embedded feature matrices are rearranged and reshaped into an N x C _e Matrix of T and one C _e T multiplied by N matrix, and the two are multiplied to become a similar matrix, and c is calculated by the following formula _k ：

In which W is _θ Andare embedding functions θ (·) and +.>Parameters of (2);

Step 4.4: building a double-flow graph convolutional neural network;

4. The method for identifying actions based on adaptive graph convolutional neural network according to claim 3, wherein in step S5, the process of calculating the corresponding action type includes:

S＝S ₁ W ₁ +S ₂ W ₂ +S ₃ W ₃ +S ₄ W ₄

5. An adaptive graph convolutional neural network-based action recognition device based on the method of any one of claims 1-4, the action recognition device comprising:

6. The adaptive graph roll-up neural network based motion recognition device of claim 5, wherein the dual-flow graph roll-up neural network comprises 2 sub-networks; the joint data and the skeleton data are respectively used as input data of 2 sub-networks, and the corresponding prediction scores are obtained after the sub-network processing.

7. The motion recognition device based on an adaptive graph rolling neural network of claim 6, wherein the sub-network or the newly added sub-network each comprises 9 adaptive graph rolling modules, and the number of output channels of each adaptive graph rolling module is 64, 128, 256 and 256 respectively; adding a data BN layer at the beginning to normalize the input data, performing a global averaging pooling layer to pool feature mappings of different samples to the same size, and finally outputting to a SoftMax classifier to obtain predictions;

8. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the adaptive graph roll-up neural network based action recognition method of any of claims 1-4.

9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the method of motion recognition based on an adaptive graph convolutional neural network as claimed in any one of claims 1-4.