CN116343334A

CN116343334A - Motion recognition method of three-stream self-adaptive graph convolution model fused with joint capture

Info

Publication number: CN116343334A
Application number: CN202310306777.XA
Authority: CN
Inventors: 冯宇平; 周青霞; 高帅; 安文志; 李云文; 戴家康; 陶康达
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao University of Science and Technology
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-06-27

Abstract

The invention relates to the technical field of image recognition, in particular to a motion recognition method of a three-stream self-adaptive graph convolution model for fusion joint capture. The method comprises the following steps: s1, processing dynamic correlation of intra-frame joint information and bone position information by using a spatial attention module; using a temporal attention module to focus on feature correlation of inter-frame bone motion information; s2, processing joint information by using a Gaussian embedding function, adding a one-dimensional convolution layer to aggregate the dimension of the CNN channel after the Gaussian embedding function is normalized, adding a dynamic proportionality coefficient to help the model to effectively converge after adjoining the matrix, and introducing skeleton motion information to construct a three-stream self-adaptive graph convolution model; and S3, performing feature extraction on the input video frame by using an Openphase attitude estimation algorithm to obtain skeleton data, and performing behavior recognition by using a three-stream self-adaptive graph convolution model. The model provided by the invention has the advantage of accuracy in human body action recognition.

Description

Motion recognition method of three-stream self-adaptive graph convolution model fused with joint capture

Technical Field

The invention relates to the technical field of image recognition, in particular to a motion recognition method of a three-stream self-adaptive graph convolution model for fusion joint capture.

Background

The action recognition technology has wide application in intelligent monitoring, man-machine interaction, motion analysis, video information retrieval and the like, and has the main task of recognizing one or more action actions made by human targets, such as shooting, clapping, running and the like, according to the existing video sequence. Early bone motion recognition methods mostly used RGB or gray-scale video as input data, and used RNN or CNN networks for recognition and classification. With the continuous development of deep learning, a method for performing motion recognition by using a graph convolutional model (GCN) and skeletal feature information is widely applied to recognition classification of human motion.

Yan et al first applied GCN to motion recognition and they proposed building a space-time diagram convolution model (ST-GCN) by distance sampling functions to construct the diagram convolution layer as a basic block. Li et al fully utilized the attention concept and proposed constructing an action structure graph convolution model (AS-GCN) based on the attention mechanism. Shi et al propose a dual-flow framework that fuses joint information of a human skeleton with skeletal information, adaptively learns the graphical topology of a sample through an adaptive graph rolling model, thereby constructing a dual-flow adaptive graph rolling model (2 s-AGCN). Shi et al, taking into account the recognition efficiency of the motion recognition model, propose to adaptively select the optimal model size for each sample, building an adaptive semantic packet network (AdaSGN) that balances better between accuracy and efficiency. Li et al has not high problem of recognition speed and recognition rate, has proposed a kind of lightweight graph convolution model. However, these algorithms still have problems such as the capture of global context information and insufficient extraction of model spatiotemporal features.

Disclosure of Invention

The invention aims to solve the technical problems that: overcomes the defects of the prior art and provides a motion recognition method of a three-stream self-adaptive graph rolling model for fusion joint capture.

The technical scheme of the invention is as follows:

a motion recognition method of a three-stream adaptive graph convolution model fused with joint capture comprises the following steps:

s1, introducing a space-time attention module to capture space-time characteristics of a human skeleton sequence: processing the dynamic correlation of intra-frame joint information and bone position information using a spatial attention module; using a temporal attention module to focus on feature correlation of inter-frame bone motion information;

s2, constructing a fusion joint capture three-stream self-adaptive graph convolution model: processing joint information by using a Gaussian embedding function, adding a one-dimensional convolution layer to aggregate the dimension of a CNN channel after the Gaussian embedding function is normalized, adding a dynamic proportionality coefficient to help the model to effectively converge after an adjacency matrix, and introducing skeleton motion information to construct a three-stream self-adaptive graph convolution model, wherein the method comprises the following steps:

s21, modeling a skeleton sequence in two dimensions of time and space by using graph convolution;

s22, adding a 1-dimensional convolution layer into the improved self-adaptive graph convolution model after Gaussian embedding function normalization operation, and timely capturing global context information between joints by utilizing CNN fusion channel dimension information; in the adjacent matrix A _k Then adding a dynamic proportionality coefficient alpha to enable the adjacency matrix to only act on the early stage of training, and increasing the flexibility of the self-adaptive graph convolution model; adding a residual structure after the fusion of the three graphs, so as to ensure the stability of the model;

s23, merging a space-time attention mechanism STAM into the adaptive graph convolution model: downsampling the number of frames of the self-adaptive graph convolution model once so as to reduce the parameter number and improve the network training speed; adding a space-time attention module STAM, introducing a residual structure, and improving the stability of the model;

s24, adding a three-stream self-adaptive graph convolution model of bone motion information: extracting joint point information, bone position information and bone movement information in the skeleton characteristic information, constructing a three-stream self-adaptive graph rolling model, training each branch of the three-stream self-adaptive graph rolling model independently to obtain a corresponding softmax layer score, and obtaining a final prediction result through fusion;

s3, verifying the identification accuracy of the three-stream self-adaptive graph convolution model: and carrying out feature extraction on the input video frame by adopting an Openphase attitude estimation algorithm to obtain skeleton data, and carrying out behavior recognition by using a three-stream self-adaptive graph convolution model.

Preferably, in the step S1, the spatiotemporal attention module includes information of two dimensions of time and space, and the spatiotemporal attention module helps the network focus on the spatiotemporal region with the most discriminant in the complex video, and simultaneously eliminates the interference of other irrelevant regions; the spatiotemporal attention module is embedded anywhere in the network without affecting the original structure of the network.

Preferably, in the step S1, the spatial attention module is configured to grasp the influence degree of different regions on the target region in the spatial dimension, and to input the features

The corresponding operation is carried out to obtain the association strength S between the joint i and the joint j _i,j Then normalize the attention weight of the hidden state to [0,1 ]]Within the range, the specific calculation formula is as follows:

wherein: c x T x N is the dimension of the input feature, C is the number of channels, T is the number of frames, N is the number of joints, N _p Is a characteristic diagram of different positions.

Preferably, in the step S1, the time attention module is configured to highlight a time domain segment with the most attention among the human skeleton information in different time periods;

will input features

Feature dimension is changed to 1 XN, E by a convolution layer with a convolution kernel of 1×1 _i,j And finally, carrying out weight normalization on the dependency degree between the time i and the time j, wherein a specific calculation formula is shown as follows.

Preferably, in the step S21, the skeleton sequence contains N nodes and T video frames, and the undirected graph is constructed as g= (V, E), where v= { V _ti I t=1, 2, T, i=1, 2,.. _s ,E _t And is an edge set, E _s Representing natural connection of human skeleton on the same frame, belonging to intra-frame connection, E _t Representing the connection of the same node on adjacent frames, belonging to the inter-frame connection, the graph convolution operation is shown as formula (5):

wherein: representing dot product operations, W _k Is given by weight, K _v The value of (3) is set to be 3,

as the adjacency matrix, for each adjacency matrix A _k All points are multiplied by a mask matrix M which can be learned _k 。

Preferably, in the step S22, the output characteristic f of the improved adaptive graph convolution model _out The calculation steps are as follows:

first, the n-th layer inputs the feature f _in The dimension is C x T x N, which is mapped to a Gaussian embedding function θ _k And

calculating joint correlation, rearranging two embedded graphs as C _e T×N and N×C _e T two matrices, the calculation formula (6) is as follows:

secondly, multiplying the two matrixes to perform normalization operation to obtain an N multiplied by N similarity matrix, and obtaining C in the formula (7) through CNN with the output channel being 1 and aggregating the characteristic information of the channel dimension _k ：

Finally, alpha A is _k 、B _k And C _k Adding and fusing adjacent matrixes for forming adaptive graph rolling model, and outputting characteristic f of nth layer of adaptive graph rolling model _out Expressed as formula (8):

wherein: alpha is a dynamic proportionality coefficient, and a custom numerical value helps the model to effectively converge; w (W) _k 、B _k 、θ _k 、

Is a parameter that can be learned;

the dynamic proportionality coefficient alpha is based on the principle that alpha decreases with the increase of training rounds to lead the adjacency matrix A representing the physical structure of human body _k The effect is weakened in the later period of the experiment, thereby highlighting the adaptively generated matrix B _k And C _k The flexibility of extracting human skeleton features, and the calculation formula of alpha is as follows:

α＝1-0.02b _Epoch ,α∈[0,1] (9)

wherein: b _Epoch For the value of the iteration number Epoch, the value of α is between 0 and 1.

Preferably, in the step S23, the improved adaptive graph convolution model includes nine adaptive graph convolution blocks, where the adaptive graph convolution blocks are formed by concatenating an adaptive graph convolution and a time convolution; nine self-adaptive graph convolution blocks respectively correspond to B1-B9, the output channel dimensions correspond to 64, 128, 256 and 256, and the number of frames is downsampled in B4 and B7 once, so that the parameter number is reduced, and the network training speed is improved; a spatio-temporal attention module STAM is added between B3 and B4 while a residual structure is introduced between adjacent adaptive map convolution blocks.

Preferably, in the step S24, the three-stream adaptive graph convolution model added with the bone motion information includes the following specific steps:

firstly, bone position information is used for motion recognition, and a double-flow network is constructed according to joint point information and bone position information of a human skeleton;

secondly, on the basis of extracting joint point information and bone position information to be respectively used as input of two branches of J-Stream and B-Stream, adding bone motion information to be used as input of B-M-Stream to construct a three-Stream self-adaptive graph convolution model;

again, the joint near the center of gravity of the skeleton is defined as the source joint J _source The joint far from the center of gravity is defined as the target joint J _target Joint J of t frame _t Is represented as a vector B pointing from its source joint to the target joint _t ＝J _ttarget -J _tsource ；

Finally, the center joint is assigned to the empty bone with the self-built value of 0, the graph and the network of the bone and the joint are in one-to-one correspondence, and the bone movement information is expressed as M by a formula _Bt ＝B _t+1 -B _t And (3) independently training the three network branches to obtain corresponding softmax layer scores, and obtaining a final prediction result through fusion.

Preferably, in the step S3, verifying the data set of the identification accuracy of the three-stream adaptive graph rolling model includes:

kinetics dataset: the method is a large-scale human behavior data set, is more than 30 ten thousand video clips, has a duration of about 10s and comprises 400 action categories in total; training and testing by using a Skeleton sequence data set Kinetics-Skeleton, wherein evaluation indexes are Top-1 and Top-5, wherein the training set is 240000 fragments, and the testing set is 20000 fragments;

NTU-rgb+d dataset: for the most widely applied dataset in motion recognition, there are 56880 motion segments, including 60 motion classes, photographed using three cameras of different angles, each skeleton containing 25 joints; according to different dividing methods of the training set and the test set, two evaluation indexes are adopted: X-Sub and X-view, wherein X-Sub divides training set and test set according to person ID, and X-view divides training set and test set according to camera.

Preferably, in the step S3, the Kinetics data set adopts 150 frames of single sequence, and the NTU-rgb+d data set uniformly adopts 300 frames of single sequence; the deep learning framework is PyTorrch1.12.1, the optimization strategy adopts a random gradient to decline SGD, the number of samples of single participation training is 32, the iteration number is set to be 50, the learning rate of the first 30 epochs is set to be 0.1, and then the learning rate is reduced to be 0.01.

Compared with the prior art, the invention has the following beneficial effects:

(1) Introducing a space-time attention mechanism to capture space-time characteristics of a human skeleton sequence, processing dynamic correlation of intra-frame joint information and skeleton position information by using the space attention mechanism, and focusing on characteristic correlation of inter-frame skeleton motion information by using the time attention mechanism;

(2) An improved self-adaptive graph convolution model is provided, joint information is processed by using a Gaussian embedding function, a one-dimensional convolution layer is added after the normalization operation of the Gaussian embedding function to aggregate channel dimensions, and meanwhile, a dynamic scaling coefficient is added after an adjacent matrix A to help the model to effectively converge;

(3) The space-time attention mechanism is integrated into an improved self-adaptive graph rolling model, three characteristics of a human skeleton joint point, a human skeleton position and skeleton movement are taken as input, and a three-stream self-adaptive graph rolling model is constructed to identify human body actions, so that the three-stream self-adaptive graph rolling model combining the space-time attention mechanism is obtained. In order to evaluate the effectiveness of the network, experiments are carried out on two large data sets NTU-RGB+D and Kinetics, and the results show that the accuracy of the model provided by the invention in the aspect of human motion recognition is advantageous.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a flow schematic of the present invention.

Fig. 2 is a space-time diagram constructed based on a human skeleton.

Fig. 3 is a block diagram of a modified adaptive graph convolutional layer.

FIG. 4 is a block diagram of an improved adaptive graph convolution model.

FIG. 5 is a 3s-STAM-AGCN block diagram.

Fig. 6 (a) is a diagram of the result of visualization of skeleton drinking data.

Fig. 6 (b) is a view of the result of the visualization of the skeleton gift data.

Fig. 6 (c) is a view of the result of visualization of skeletal fall data.

Fig. 6 (d) is a diagram of the result of the visualization of the skeleton kick data.

Fig. 7 is a graph showing the loss change at three values of α.

FIG. 8 is a histogram of recognition accuracy versus accuracy.

Detailed Description

In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Example 1

As shown in fig. 1, the present embodiment provides a motion recognition method of a three-stream adaptive graph convolution model for joint capture, which aims to capture the space-time characteristics of human joints through an embeddable space-time attention mechanism, and simultaneously improve the extraction capability of a network to human skeleton characteristic information by utilizing skeleton position information and skeleton motion information.

S1, a space-time attention module: the video sequence contains information of two dimensions of time and space, and the attention mechanism can help the network to pay attention to the space-time area with the most discriminant in the complex video, and meanwhile, the interference of other irrelevant areas is eliminated. The invention introduces a lightweight spatio-temporal attention module that can be embedded anywhere in the network without affecting the original structure of the network.

S11, a spatial attention module: in the space dimension, the node conditions at different positions are mutually influenced, and in order to capture the dynamic correlation among the features, a space attention mechanism is introduced into the model to grasp the influence degree of different areas on a target area. For input features

The corresponding operation is carried out to obtain the association strength S between the joint i and the joint j _i,j Then normalize the attention weight of the hidden state to [0,1 ]]Within the range. The specific calculation formula is as follows:

wherein: c×t×n is the dimension of the input feature, C is the number of channels, T is the number of frames, and N is the number of joints. N (N) _p Feature maps representing different locations.

S12, a time attention module: there is a certain correlation between the human skeleton information of different time periods, and the time domain segment with the most attention can be highlighted by using the time attention module. Will input features

Feature dimension is changed to 1 XN, E by a convolution layer with a convolution kernel of 1×1 _i,j And finally, carrying out weight normalization on the dependency degree between the time i and the time j. The specific calculation formula is shown below.

S2, three-stream self-adaptive graph convolution model

S21, graph convolution: the skeleton sequence is composed of coordinate data of all human body joint points in the video frame, and because the human body skeleton is of a topological structure, the skeleton sequence can be modeled in two dimensions of time and space by using graph convolution, and a model constructed based on the human body skeleton is shown in figure 2. Specifically, a frame sequence contains N nodes and T video frames, and the undirected graph can be constructed as g= (V, E), where v= { V _ti I t=1, 2, T, i=1, 2,..n } represents a set of nodes, e= { E _s ,E _t And is an edge set, E _s Representing natural connection of human skeleton on the same frame, belonging to intra-frame connection, E _t Representing the connection of the same node on adjacent frames, belonging to the inter-frame connection. The graph convolution operation is shown in formula (5):

wherein: as indicated by the dot product, W _k Is given by weight, K _v The value of (3) is set to be 3,

S22, an adaptive graph convolution layer: adjacent in graph convolutionConnection matrix A _k Only a single human body structure diagram can be represented, the flexibility of modeling all action class samples is lacking, and the Adaptive Graph Convolution Layer (AGCL) proposed in the dual-flow adaptive graph convolution model 2s-AGCN can effectively solve the problem. First, the adaptive graph convolution layer in the 2s-AGCN is described as the original AGCL. The original AGCL defines three adjacency matrices, corresponding to three types of graphs, matrix A _k Defined by the physical structure of the human body, which is equivalent to the adjacency matrix in the 2.1 section graph convolution model; matrix B _k The elements in the model are completely learned by training data, and can represent the strength of the connection between the joint points; matrix C _k Data-driven generation alone is used to capture global context information, generating a unique graph structure for each sample.

The present invention proposes an improved AGCL by taking the 2s-AGCN concept into account, as shown in fig. 3. Compared with the original AGCL, the improved AGCL adds a 1-dimensional convolution layer after the Gaussian embedding function normalization operation, and captures global context information among joints in time by utilizing the information of the CNN fusion channel dimension; in the adjacent matrix A _k Then adding dynamic proportionality coefficient alpha to make A _k Only acts on the early stage of training, and the flexibility of the self-adaptive graph convolution model is improved; the residual structure res (1×1) is added after the fusion of the three graphs, so that the stability of the model can be ensured.

Output feature f of AGCL _out The calculation steps are as follows: first, the n-th layer inputs the feature f _in The dimension is C x T x N, which is mapped to a Gaussian embedding function θ _k And

secondly, multiplying the two matrixes, performing normalization operation to obtain an N multiplied by N similarity matrix, and polymerizing the N multiplied by N similarity matrix through CNN with an output channel of 1Obtaining the characteristic information of the channel dimension to obtain C in the formula (7) _k ：

Is a parameter that can be learned;

α＝1-0.02b _Epoch ,α∈[0,1] (9)

S23, introducing a self-adaptive graph convolution model of a space-time attention mechanism: the present invention incorporates a spatiotemporal attention mechanism STAM into an adaptive graph rolling model AGCN, the improved AGCN being shown in fig. 4. The AGCN contains 9 Adaptive Graph Convolution Blocks (AGCBs), where the AGCBs consist of an adaptive graph convolution (Convs) in series with a time convolution (Convt). Nine AGCBs correspond to B1-B9 in the graph, output channel dimensions correspond to 64, 128, 256, and downsampling the number of frames in B4 and B7 is performed once, so as to reduce the number of parameters and increase the network training speed. And a space-time attention module STAM is added between the B3 and the B4, and a residual structure is introduced between adjacent AGCNs, so that the stability of the network is improved.

S24, adding a three-stream self-adaptive graph convolution model of bone motion information: the original skeleton sequence only contains joint information of skeleton data, namely two-dimensional or three-dimensional coordinates of the articulation points. In order to acquire information in a skeleton sequence as much as possible and improve the recognition capability of a network, the invention respectively extracts joint point information, skeleton position information and skeleton movement information in skeleton characteristic information and designs a three-stream self-adaptive graph rolling model for motion recognition.

The existing motion recognition method rarely utilizes skeleton position information and skeleton motion information in a human skeleton, and the 2s-AGCN firstly uses the skeleton position information for motion recognition, and constructs a double-flow network according to joint point information and skeleton position information of the human skeleton. The invention builds a three-Stream self-adaptive graph rolling model by adding bone motion information as the input of B-M-Stream on the basis of extracting joint point information and bone position information as the input of two branches of J-Stream and B-Stream respectively, as shown in figure 5. The joint near the center of gravity of the skeleton is defined as a source joint J _source The joint far from the center of gravity is defined as the target joint J _target Joint J of t frame _t Is represented as a vector B pointing from its source joint to the target joint _t ＝J _ttarget -J _tsource . The invention assigns the center joint to the empty skeleton with the self-built value of 0, and the patterns and the network of the skeleton and the joint can be in one-to-one correspondence ^[ . Skeletal motion information is formulated as M _Bt ＝B _t+1 -B _t . And (3) respectively obtaining corresponding softmax layer scores through independent training of the three network branches, and obtaining a final prediction result through fusion.

In the modeling training process, three branches of J-Stream, B-Stream and B-M-Stream are different in emphasis on information acquisition and processing, the J-Stream branch processes the relation between human skeleton joints in each frame of image of a video, the B-Stream branch acquires intra-frame human skeleton position information, and the B-M-Stream branch captures action correlation information in a space-time dimension by utilizing inter-frame skeleton motion information. The three branch scores can be mutually complemented when being added and fused, and more accurate action classification is realized through the characteristics extracted by the AGCN, so that the defect of processing information when the branches act independently is overcome. In addition, the model of the invention adds a space-time attention mechanism, so that the validity of the characteristic information of the space dimension and the time dimension can be ensured.

S3, experimental results and analysis

S31, experimental data set:

kinetics dataset: kinetics is a large-scale human behavioral data set, with about 30 tens of thousands of video segments, each segment being about 10 seconds long, containing 400 action categories in total. Training and testing are carried out by using a Skeleton sequence data set Kinetics-Skeleton, evaluation indexes are Top-1 and Top-5, wherein the training set is 240000 fragments, and the testing set is 20000 fragments.

NTU-rgb+d dataset: NTU-rgb+d is the most widely used dataset in motion recognition, with 56880 motion segments, including 60 motion classes, photographed using three cameras of different angles, each skeleton containing 25 joints. According to different dividing methods of the training set and the test set, two evaluation indexes are adopted: cross-subject (X-Sub) and Cross-view (X-view). X-Sub divides the training set and the test set according to the character ID, and X-view divides the training set and the test set according to the different cameras.

The experiment extracts 20 representative actions for testing, and the selected actions are shown in table 1, wherein 80% of each action is selected as a training set, 10% is a testing set, and 10% is a verification set.

Table 1 experiment selected 20 action tables

S32, experimental configuration: in the experiment, the NTU-RGB+D data set uniformly adopts 300 frames of single sequence, and the Kinetics data set adopts 150 frames of single sequence. The deep learning framework is PyTorrch1.12.1, the optimization strategy adopts random gradient descent (Stochastic Gradient Descent, SGD), the number of samples (batch_Size) of single-time participation training is 32, the iteration number (Epoch) is set to be 50 for convenience of comparison experiments, the learning rate of the first 30 epochs is set to be 0.1, and the learning rate is reduced to be 0.01.

S33, experimental process

S331, data processing: input data required for skeleton motion recognition is skeleton data, and is generally obtained from high-precision depth image pickup equipment and an attitude estimation algorithm. The experiment adopts an Openphase attitude estimation algorithm, the algorithm performs feature extraction on an input video frame to obtain confidence and relevance, then joints of the same person are connected, and finally the integral skeleton of the whole person is combined. And carrying out Openpost processing on the 20 actions selected by the experiment, and visualizing a skeleton data result. The skeleton data visualization result diagram is shown in fig. 6 (a) -6 (d).

S332 ablation experiment

1) Network performance comparison experiment before and after improvement of self-adaptive graph convolution layer

The invention adds a dynamic proportion coefficient alpha into an improved self-adaptive graph convolution layer AGCL, and sets the value of alpha as three conditions for testing the effect of the proportion coefficient on a network: α=0, α=1 and α dynamically change with round. Fig. 7 shows the change of the network loss value curve under the condition of different values of α, and it can be seen from the figure that the network convergence speed is faster when α is set to decrease with the rotation. Experiments prove that the human body topological structure adjacency matrix A is improved after the proportional coefficient is added _k The speed and effectiveness of extracting the human body characteristic information.

2) Model validity verification experiment

To verify the effectiveness of the spatiotemporal attention mechanism and the improved adaptive graph convolutional layer AGCL, the effects of the throttle, bone flow and dual-flow networks under NTU rgb+d dataset X-View were tested, based on 2s-AGCN, in combination with STAM and improved AGCL, respectively, and the results are shown in table 2.

As can be seen from table 2, under NTU rgb+d dataset X-View: compared with 2s-AGCN, the recognition accuracy of the joint flow and the bone flow is respectively improved by 0.3 percent and 0.5 percent, the double flow is improved by 0.4 percent, and the effectiveness of the space-time attention mechanism is verified; the 2s-AGCN after the improved AGCL is used is compared with the original 2s-AGCN, the recognition accuracy of the joint flow and the bone flow is respectively improved by 0.7 percent and 0.5 percent, the double flow is improved by 0.6 percent, and the effectiveness of the improved AGCL on the extraction of the space-time characteristics of the human skeleton is verified; the 3s-STAM-AGCN combined with the space-time attention mechanism and the improved adaptive graph convolution layer improves the recognition accuracy of the joint flow and the bone flow by 1.1% and 0.9% respectively, and improves the double flow by 1.1% compared with the 2 s-AGCN. The ablation comparison experimental result fully verifies the effectiveness of the space-time attention mechanism and the improved adaptive graph convolution layer AGCL.

Table 2 accuracy (%) of ablation experiments under NTU-rgb+d dataset X-View table

3) Double-flow and three-flow comparison experiment

In order to verify that the three-stream identification effect of the model is optimal, a comparison experiment is carried out on the two-stream network and the three-stream network under the two division standards of the NTU RGB+D data set X-View and the X-Sub. As can be seen from Table 3, the accuracy of the dual stream network synthesized with both the throttling and skeletal streams on X-View and X-Sub was 96.2% and 89.3%, respectively; the accuracy of the three-stream network synthesized by the joint flow, the bone position flow and the bone movement flow on the X-View and the X-Sub is 96.9 percent and 90.6 percent respectively; the identification accuracy of the three-stream network is improved by 0.7% and 1.3% compared with that of the two-stream network, and the superiority of the three-stream network compared with that of the two-stream network is verified.

Table 3NTU RGB+D Table of the accuracy (%) of identifying each tributary of the 3s-STAM-AGCN model

S333, comparing the test with other methods

In order to embody the superiority of the 3s-STAM-AGCN model performance, the invention compares the model performance with the domestic and foreign advanced methods on two data sets of NTU RGB+D and Kinetics. The comparison results are shown in tables 4 and 5.

As can be seen from the comparison of the results in table 4, the GCN-based method generally outperforms the other three methods; compared with a GCN-based representative method ST-GCN, the recognition accuracy of the model is improved by 9.1% and 8.6% on X-Sub and X-View respectively; compared with a 2s-AGCN method, the model of the invention improves 2.1 percent and 1.8 percent on X-Sub and X-View respectively; under two division protocols of X-Sub and X-View, the recognition accuracy of the model of the invention respectively reaches 90.6% and 96.9%, and the effect is optimal.

As shown in Table 5, the recognition accuracy of the 3s-STAM-AGCN model on Top-1 and Top-5 respectively reaches 37.3% and 61.2%, which are improved by 6.6% and 8.4% compared with the ST-GCN based on the GCN representative method, and by 1.2% and 2.5% compared with the 2s-AGCN, and the accuracy is higher than that of the classical method in the table, and the result shows that the model has superiority.

Table 4 accuracy vs (%) table for different methods on ntu rgb+d dataset

Table 5 accuracy vs (%) table for different methods on Kinetics dataset

To further analyze the effectiveness of the 3s-STAM-AGCN model, the recognition accuracy of the 3s-STAM-AGCN and the 2s-AGCN for 20 actions was compared under the NTU RGB+D dataset X-Sub index, based on the 2 s-AGCN. The result of fig. 8 shows that the recognition accuracy of most actions of the method of the invention is obviously improved, and the superiority of the model is verified.

In summary, the 3s-STAM-AGCN model exhibits greater performance in extracting spatiotemporal features and global context information than other methods. The 3s-STAM-AGCN model achieves higher recognition accuracy on the large-scale data set NTU RGB+D and the Kinetics.

The invention provides a three-stream self-adaptive graph convolution model 3s-STAM-AGCN combining a space-time attention mechanism by adding the space-time attention mechanism, improving a self-adaptive graph convolution layer and adding skeleton motion information as the input of a third tributary on the basis of 2 s-AGCN. Experiments on two data sets of Kinetics and NTURGB+D show that the addition of skeleton movement information and a space-time attention mechanism enriches space-time characteristic information to a certain extent, and the connection between global contexts is enhanced; the improved self-adaptive graph convolution layer improves the convergence rate during network training and increases the flexibility and stability of the network. The accuracy index of the algorithm provided by the invention on two data sets is improved, and the effectiveness of the method on human body action recognition is proved, but the accuracy improvement of the action recognition is insufficient facing the scene that the background is complex and part of the scene relates to interaction, so that the influence of complex background interference is also considered to be eliminated in the three-stream network architecture. In future work, the improvement of the model recognition accuracy rate will be studied aiming at combining scene information and interaction information.

Although the present invention has been described in detail by way of preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications and substitutions may be made in the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and it is intended that all such modifications and substitutions be within the scope of the present invention/be within the scope of the present invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A motion recognition method of a three-stream adaptive graph convolution model based on joint capture fusion is characterized by comprising the following steps:

2. The method for identifying the motion of the three-stream adaptive graph convolution model by fusion joint capture according to claim 1, wherein in the step S1, a spatiotemporal attention module contains information of two dimensions of time and space, and the spatiotemporal attention module helps a network to pay attention to a spatiotemporal region with the most discriminant ability in a complex video, and simultaneously eliminates interference of other irrelevant regions; the spatiotemporal attention module is embedded anywhere in the network without affecting the original structure of the network.

3. The method for motion recognition of a three-stream adaptive graph convolution model for joint capture as defined in claim 2, wherein in the step S1, the spatial attention module is configured to grasp the influence degree of different regions on the target region in spatial dimensions, and to input features

4. The method for identifying the motion of the three-stream adaptive graph rolling model by fusion joint capture according to claim 1, wherein in the step S1, the time attention module is used for highlighting the time domain segment with the most attention among the human skeleton information of different time periods;

will input features

Feature dimension is changed into by a convolution layer with convolution kernel of 1×11×N，E _i,j And finally, carrying out weight normalization on the dependency degree between the time i and the time j, wherein a specific calculation formula is shown as follows.

5. The method for motion recognition of a three-stream adaptive graph convolution model for joint capture as claimed in claim 4, wherein in the step S21, the skeleton sequence contains N nodes and T video frames, and an undirected graph is constructed as g= (V, E), where v= { V _ti I t=1, 2, T, i=1, 2,..n } represents a set of nodes, e= { E _s ,E _t And is an edge set, E _s Representing natural connection of human skeleton on the same frame, belonging to intra-frame connection, E _t Representing the connection of the same node on adjacent frames, belonging to the inter-frame connection, the graph convolution operation is shown as formula (5):

wherein: as indicated by the dot product, W _k Is given by weight, K _v Has a value of 3, A _k ＝D _k ^-1/2 A _k D _k ^-1/2 As the adjacency matrix, for each adjacency matrix A _k All points are multiplied by a mask matrix M which can be learned _k 。

6. The method for motion recognition of a three-stream adaptive graph rolling model in fusion joint capture according to claim 5, wherein in said step S22, the output characteristic f of the adaptive graph rolling model is improved _out The calculation steps are as follows:

Is a parameter that can be learned;

α＝1-0.02b _Epoch ,α∈[0,1] (9)

7. The method for motion recognition of a three-stream adaptive graph convolution model for joint capture as defined in claim 6, wherein in step S23, the improved adaptive graph convolution model comprises nine adaptive graph convolution blocks, and the adaptive graph convolution blocks are formed by concatenating an adaptive graph convolution and a time convolution; nine self-adaptive graph convolution blocks respectively correspond to B1-B9, the output channel dimensions correspond to 64, 128, 256 and 256, and the number of frames is downsampled in B4 and B7 once, so that the parameter number is reduced, and the network training speed is improved; a spatio-temporal attention module STAM is added between B3 and B4 while a residual structure is introduced between adjacent adaptive map convolution blocks.

8. The method for motion recognition of a three-stream adaptive graph rolling model for joint capture fusion according to claim 7, wherein in S24, the three-stream adaptive graph rolling model added with bone motion information comprises the following specific steps:

9. The method for identifying motion of a three-stream adaptive graph rolling model by fusion joint capture according to claim 1, wherein in the step S3, verifying the data set of identification accuracy of the three-stream adaptive graph rolling model includes:

10. The method for identifying the motion of the three-stream adaptive graph rolling model by fusion joint capture according to claim 9, wherein in the step S3, a Kinetics dataset adopts a single sequence of 150 frames, and an NTU-rgb+d dataset uniformly adopts a single sequence of 300 frames; the deep learning framework is PyTorrch1.12.1, the optimization strategy adopts a random gradient to decline SGD, the number of samples of single participation training is 32, the iteration number is set to be 50, the learning rate of the first 30 epochs is set to be 0.1, and then the learning rate is reduced to be 0.01.