CN113505751B

CN113505751B - Human skeleton action recognition method based on difference map convolutional neural network

Info

Publication number: CN113505751B
Application number: CN202110861474.5A
Authority: CN
Inventors: 刘成菊; 曾秦阳; 陈启军
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2022-10-25
Anticipated expiration: 2041-07-29
Also published as: CN113505751A

Abstract

The invention relates to a human skeleton action recognition method based on a difference map convolution neural network, which comprises the following steps of: s1, preprocessing bone data; s2, preliminarily designing a difference graph convolution neural network framework, combining graph convolution and convolution, and adopting a difference learning mode to take the errors of the graph convolution of different layers as input feedforward of a continuous time frame; s3, initially selecting training parameters and performing error back propagation; s4, performing cross-view angle and cross-object training and testing on the skeleton data set; s5, fine-tuning the training parameters according to the test precision, and repeating the steps S3 and S4 to obtain the training parameters with higher precision; s6, fixing the training parameters, finely adjusting the network architecture to obtain network architecture parameters with higher precision, and identifying the human skeleton motion. Compared with the prior art, the method has the advantages that the CV precision and the CS precision are improved when the NTU data set is tested, the parameter quantity is less, the total parameter quantity is less than 1M, and the rapid and accurate identification effect is achieved.

Description

Human skeleton action recognition method based on difference map convolutional neural network

Technical Field

The invention relates to the field of robot learning and computer vision, in particular to a human skeleton action identification method based on a difference graph convolution neural network.

Background

The action recognition is used as a branch of computer vision, and is widely applied to the fields of video monitoring, man-machine interaction, intelligent driving and the like. The purposes of video monitoring, understanding, action prediction and the like can be achieved through action identification and classification. In the field of video safety monitoring, whether production operation meets safety production specifications or not can be judged according to human body actions; in the field of man-machine interaction, the action at the next moment can be predicted according to the action of the human body at the current moment, and input is provided for action decision of the robot; in the field of intelligent driving, the intelligent automobile can adopt different driving schemes according to actions of pedestrians. In conclusion, the action recognition field is widely applied and is a direction which is hot in the artificial intelligence field.

At present, a plurality of existing perfect motion recognition data sets are acquired by Kinect-v2 cameras, and the emphasis points of each data set are different. The existing action recognition algorithms are divided into two categories of traditional manual feature extraction algorithms and deep learning algorithms, the manual feature extraction algorithms are a null interest point method, a motion histogram method, a dense track method and the like, the traditional methods are low in recognition accuracy and low in robustness and are greatly interfered by illumination, shielding, background transformation and the like, the deep learning algorithms are divided into algorithms based on CNN and RNN, the algorithms based on CNN are smaller than RNN network models and occupy less computing memory, the algorithms based on CNN are divided into a double-current convolution method and a 3D convolution method, and the graph convolution algorithm is a popular algorithm in recent years and has the advantage of high learning capacity, but the calculated amount is large and the memory is consumed.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a human skeleton action recognition method based on a difference chart convolution neural network.

The purpose of the invention can be realized by the following technical scheme:

a human skeleton action recognition method based on a difference map convolutional neural network is used for improving the action recognition precision based on the skeleton and lightening the network structure, and comprises the following steps:

s1, preprocessing skeleton data, eliminating irrelevant skeleton data and incomplete repair data, and performing normalization processing;

s2, preliminarily designing a difference graph convolution neural network framework, combining graph convolution and convolution, and adopting a difference learning mode to take the errors of the graph convolution of different layers as input feedforward of a continuous time frame;

s3, initially selecting training parameters including a loss function loss, a learning rate lr, a batch size batch _ size and an iteration number epoch of the training samples, and performing error back propagation;

s4, performing cross-visual angle and cross-object training and testing on the skeleton data set after the parameters are set;

s5, fine-tuning the training parameters according to the testing precision, and repeating the steps S3 and S4 until the testing precision of the cross-view angle and the cross-object reaches the expected effect, so as to obtain the training parameters with higher precision;

s6, fixing the training parameters, finely adjusting the network architecture, repeating the step S4 until the testing precision reaches the expected effect, obtaining network architecture parameters with higher precision, and identifying the human skeleton action.

The step S1 specifically comprises the following steps:

after the irrelevant bone data is removed, new bone data is generated by an interpolation method, and normalization is carried out by adopting BatchNorm2d () in a Pythrch deep learning frame.

The irrelevant bone data comprises data which is not in accordance with the requirement and data of which the data size is not in a set range.

The data for each dimension in the skeletal data was normalized to within the [0,1] interval.

The step S2 specifically includes:

the method comprises the steps of initially designing a difference graph convolution neural network architecture, processing input bone point data in a mode of serially connecting graph convolution products of the same level with convolution networks, enhancing learning capacity of graph convolution, performing difference learning in a mode of parallelly connecting graph convolution products of different levels, and enabling output difference values of graph convolution products of different levels to be used as continuous interframe feedforward input after convolution operation.

In step S3, the preliminary selection of training parameters specifically includes:

the initial learning rate lr is selected to be 0.001, the cross entropy function is used as a loss function, the batch size of the training sample batch _ size is selected to be 64, the iteration number epoch is selected to be 140, and the parameter with the minimum loss function in the training is selected as a test model.

In the step S4, the skeleton data set is specifically an NTU-RGB + D60 data set.

In the step S5, after the training is finished, the test of the cross-viewing angle and the cross-object is completed, the training parameters including the learning rate, the loss function, the batch _ size, and the epoch are adjusted according to the accuracy of the test, and the training is performed for a plurality of times, so that the most accurate training parameters are finally obtained, and the optimization of the training parameters is completed.

In step S6, the fine-tuning network architecture specifically includes:

and changing the size of the convolution kernel, an activation function and a pooling mode in the network, wherein the pooling mode comprises maximization pooling and average pooling.

In the step S2, the difference map convolutional neural network is composed of two parts, two sets of GCN-CNN series structures with completely the same network parameters are connected in parallel, where GCN is a map convolutional network, CNN is a two-dimensional convolutional network, the convolutional kernel size is 1 × 1, the second part is formed by convolving two sets of two-dimensional convolutional networks with the output difference of two completely the same GCNs, the convolutional kernel size of the second CNN of the second part is 3 × 3, the difference of the input three-dimensional skeleton data processed by two completely the same map convolutional networks GCN is sequentially subjected to CNN convolution twice and ReLU activation to be used as the output of the second part, the final output of the network is obtained by summing the outputs of the two parts, the input of the difference map convolutional neural network is a three-dimensional skeleton point, the output is a continuous inter-frame input of the next stage, the error of the second part is learned to be used as a positive feedforward signal of the inter-frame input, the error accumulation is avoided by continuously correcting the input error of the inter-frame, and the identification accuracy is improved.

Compared with the prior art, the invention has the following advantages:

1. the network architecture of the invention is excellent, the accuracy of human skeleton action recognition on an NTU-RGB + D60 data set is high, the learning capability of the GCN is improved by connecting the graph convolution GCN and the two-dimensional convolution CNN in series, the concept of 'difference learning' is provided by analogy of a bridge balance concept, the difference of GCN output values of two layers is used as a part of interframe input, and then the input is 'fed' to the two-dimensional convolution network, so that a computer can better learn the difference, and the method is similar to a feedforward control mechanism, and therefore, the cross-domain and cross-view angle recognition accuracy of a network model on the NTU-RGB-D data set is high.

2. Compared with the traditional mode that multilayer graph convolution stacking is adopted in graph convolution, the method achieves high identification precision by only using three layers of graph convolution, occupies less computer memory and has the total quantity of model parameters less than 1M.

3. The training mode is novel, the initial network model architecture is determined firstly, the optimal training parameters are found, then the training parameters are fixed, the network architecture is finely adjusted, and parameters and structures of different layers are optimized, so that high identification precision can be achieved more accurately and quickly.

Drawings

Fig. 1 is a training flow diagram of a disparity map convolutional neural network.

Fig. 2 is a network architecture diagram of a disparity map convolutional neural network.

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

The embodiment provides a human skeleton action recognition method based on a difference map convolutional neural network, a training flow chart and a network architecture diagram of the method are respectively shown in fig. 1 and fig. 2, and the method specifically comprises the following steps:

s1, preprocessing the skeleton data according to a general method, removing irrelevant skeleton data and incomplete repair data, and normalizing data of each dimension in the skeleton data to a [0,1] interval.

S2, initially designing a difference graph convolution neural network architecture, connecting graph convolution of the same level and two-dimensional convolution in series to enhance the learning capacity of graph convolution, connecting graph convolution of different levels in parallel and performing difference learning, namely using errors of the graph convolution of different levels as input feedforward of continuous time frames.

And S3, preliminarily selecting training parameters such as loss function loss, learning rate lr, batch size batch _ size, iteration times epoch and the like to carry out error back propagation.

S4, after the parameters are set, cross-View (CV) and Cross-object (CS) training and testing are carried out on a skeleton data set NTU-RGB + D60, the data set adopted in the embodiment is NTU-RGB + D60, the data set comprises 56880 data samples, 60 types of actions are total, the front 50 types of actions are single actions, and the rear 10 types of actions are double interaction actions and extraction processing of skeleton data is already carried out.

And S5, fine-tuning the training parameters according to the testing precision, repeating S3 and S4 until the testing precision of the CV and the CS reaches the expected effect, and storing the corresponding training parameters.

S6, fixing training parameters, finely adjusting a network architecture, such as the convolution kernel size of two-dimensional convolution, a pooling mode, an activation function and the like, and repeating the step S4 until the test precision reaches the expected effect to obtain the network architecture with higher precision.

In the implementation process of step S1, when the skeleton data does not meet the specification, the data scale is not within the set range, and the like, the data that does not meet the requirement needs to be removed, and new skeleton data is generated by an interpolation method, and normalization is performed by using a BatchNorm2d () mode in the deep learning framework Pytorch.

In the implementation process of step S2, the network architecture design is completed, input data is processed by using a "serial" method of Graph Convolution (GCN) and convolution network (CNN), then the results of two sets of neural network processing are summed up, and the difference between the two results is extracted as feedforward of interframe input.

In the implementation process of step S3, a preliminary network architecture is fixed, a training parameter is selected for the first time, a learning rate is selected to be 0.001, a cross entropy function is used as a loss function, a batch _ size is selected to be 64, an epoch is selected to be 140, every 580 batches are used as an epoch, and a parameter with the minimum loss function in training is selected as a test model.

In the implementation process of the step S4, iterative training is performed for multiple times, CV and CS tests are completed after each training is finished, the training process is continuously repeated by adjusting training parameters such as learning rate, loss function, batch _ size, epoch and the like according to the accuracy of the tests, and finally the training parameters required for achieving higher recognition accuracy are obtained, so as to complete optimization of the training parameters.

In the implementation process of step S5, the training parameters are fixed, the refinement of the network structure is completed, for example, the network structure is locally fine-tuned by adjusting the size of the convolution kernel, the activation function, and the pooling mode, for example, maximizing pooling and averaging pooling, and the training is repeated for a plurality of times until the test accuracy reaches a higher level, thereby completing the optimization of the network structure.

As shown in fig. 2, the structure of the disparity map convolutional neural network employed in the present invention is described as follows:

the difference map convolutional neural network mainly comprises two parts, wherein the first part (1) is formed by connecting two groups of GCN-CNN series structures with completely same network parameters in parallel, the GCN is a map convolutional network, the CNN is a two-dimensional convolutional network, the size of a convolutional kernel is 1 multiplied by 1, the second part (2) is formed by convolving two groups of two-dimensional convolutional networks by the output difference values of two completely same GCNs, and symbols are formed by convolving the two groups of two-dimensional convolutional networks

And the difference value represents the difference value of the input skeleton point data processed by two identical GCN networks, the difference value is used as the output of the second part after two times of CNN convolution and ReLU activation, and the final output of the network is obtained by adding the two parts. Because the network input of the invention is three-dimensional skeleton points and the output of the network input is the next-stage continuous interframe input, the error learning of the second part can be used as a positive feedforward signal of interframe input, and the error accumulation is avoided by continuously correcting interframe input errors, thereby achieving better identification precision. In addition, the convolution kernel size of the second-time CNN of the second part is 3 x 3, and the sizes of the rest CNNs in the network are all 1 x 1.

Compared with the existing graph convolution motion recognition method, the human skeleton motion recognition method based on the difference graph convolution neural network has three maximum innovation points: the accuracy is high, the learning ability of the GCN is improved by serially connecting the graph convolution GCN and the two-dimensional convolution CNN, the concept of 'difference learning' is provided by analogy to the bridge balance concept, the difference of the GCN output values of two levels is used as a part of interframe skeleton data input, and then the input is fed to the two-dimensional convolution network, so that a computer can better learn the difference, and the method is similar to a feedforward control mechanism, so that the cross-domain and cross-view angle recognition accuracy of a network model on an NTU-RGB + D60 data set is high; compared with the traditional graph convolution algorithm which adopts multilayer graph convolution module stacking, the method only uses three-layer graph convolution, and the total quantity of model parameters is less than 1M; the training mode is novel, and high recognition accuracy is achieved by cross optimization of training parameters and network model parameters. The method comprises the steps of determining a preliminary network model architecture, finding out an optimal training parameter, fixing the training parameter, and finely adjusting the network architecture, so that high training precision is achieved.

TABLE 1 motion recognition accuracy contrast table

Method	Year	CS	CV
				ST-GCN	2018	81.5	88.3
ASGCN	2019	86.8	94.2
				2s-AGCN	2019	88.5	95.1
SGN	2020	89.0	94.5
				LAGA	2021	87.0	93.2
OURS	-	89.5	94.9

Table 1 compares the CS and CV test accuracies of the graph convolutional network with higher accuracy in recent years on NTU RGB + D60, and it can be seen that the CV and CS test accuracies of the difference graph convolutional neural network of the present invention are better than ST-GCN and ASGCN, the CS test accuracy of the difference graph convolutional neural network is 0.2% lower than 2s-AGCN, but better than 2s-AGCN 1% in CV test accuracy, and the parameter amount is only 0.96M and much lower than the parameter amount of 2s-agcn3.47m, and in addition, the accuracy rate of the difference graph convolutional neural network on the CS and CV test accuracies is better than that of the recently proposed SGN and LAGA networks.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A human skeleton action recognition method based on a difference map convolutional neural network is used for improving the action recognition precision based on the skeleton and lightening the network structure, and is characterized by comprising the following steps of:

s1, preprocessing bone data, eliminating irrelevant bone data and incomplete repair data, and performing normalization processing;

s6, fixing the training parameters, finely adjusting the network architecture, repeating the step S4 until the testing precision reaches the expected effect, obtaining the network architecture parameters with higher precision, and identifying the human skeleton action.

2. The method for recognizing human skeleton actions based on the difference map convolution neural network as claimed in claim 1, wherein the step S1 specifically comprises:

after removing irrelevant bone data, generating new bone data by interpolation, and normalizing by adopting BatchNorm2d () in a Pythrch deep learning frame.

3. The method as claimed in claim 2, wherein the irrelevant bone data includes data that is not satisfactory and data whose data size is not within a set range.

4. The human skeleton motion recognition method based on the difference graph convolution neural network as claimed in claim 2, wherein data of each dimension in the skeleton data is normalized to be within a [0,1] interval.

5. The human skeleton motion recognition method based on the difference map convolutional neural network as claimed in claim 1, wherein the step S2 specifically comprises:

the method comprises the steps of initially designing a difference graph convolution neural network architecture, processing input bone point data by adopting a mode of serially connecting graph convolution and convolution networks of the same level, enhancing learning capacity of graph convolution, and performing difference learning by adopting a mode of parallelly connecting graph convolution of different levels, namely performing convolution operation on output difference values of graph convolution of different levels to be used as continuous interframe feedforward input.

6. The method for recognizing human skeleton actions based on the difference map convolutional neural network as claimed in claim 1, wherein in step S3, the preliminary selection of training parameters specifically comprises:

7. The method for human bone motion recognition based on difference map convolution neural network as claimed in claim 1, wherein in step S4, the bone data set is specifically NTU-RGB + D60 data set.

8. The method for recognizing human skeleton motion based on the difference map convolution neural network according to claim 1, wherein in step S5, after training is finished, a cross-view and cross-object test is completed, training parameters including a learning rate, a loss function, a batch _ size and an epoch are adjusted according to the accuracy of the test, multiple times of training are performed, the most accurate training parameters are finally obtained, and optimization of the training parameters is completed.

9. The method for recognizing human skeleton actions based on a difference map convolution neural network as claimed in claim 1, wherein in step S6, the fine tuning network architecture specifically includes:

10. The method as claimed in claim 1, wherein in step S2, the disparity map convolutional neural network is composed of two parts, the first part is formed by connecting two sets of GCN-CNN series structures with identical network parameters in parallel, wherein GCN is a convolutional network, CNN is a two-dimensional convolutional network, the convolutional kernel size is 1 × 1, the second part is formed by convolving two sets of two-dimensional convolutional networks with the output difference of two identical GCNs, the convolutional kernel size of the second CNN of the second part is 3 × 3, the difference of the input three-dimensional bone point data processed by two identical convolutional networks GCN is sequentially subjected to twice CNN convolution and ReLU activation as the output of the second part, the final output of the network is obtained by adding the outputs of the two parts, the disparity map convolutional neural network input is a three-dimensional bone point, the output is a continuous feedforward input of the next stage, the interframe error of the second part is learned as a positive feedforward error input, and the interframe error is corrected, thereby avoiding the increase of recognition error accumulation.