CN108573232B

CN108573232B - Human body action recognition method based on convolutional neural network

Info

Publication number: CN108573232B
Application number: CN201810345479.0A
Authority: CN
Inventors: 张良; 李玉鹏; 刘婷婷
Original assignee: Civil Aviation University of China
Current assignee: Civil Aviation University of China
Priority date: 2018-04-17
Filing date: 2018-04-17
Publication date: 2021-07-23
Anticipated expiration: 2038-04-17
Also published as: CN108573232A

Abstract

A human body action recognition method based on a convolutional neural network. Selecting partial depth images in a data set as training samples, using the rest depth images as test samples, and mapping four-dimensional information of the depth images in the data set to a two-dimensional space by adopting a space structure dynamic depth image technology to obtain a two-dimensional image; constructing a convolutional neural network; training the convolutional neural network by using the two-dimensional image in the training sample; and inputting the two-dimensional image in the test sample into the trained convolutional neural network to obtain three groups of output vectors, then performing intra-group fusion, performing inter-group fusion, and finally completing the steps of human body action identification and the like. The method can be used as the basis of pattern recognition and artificial intelligence and has important significance on human action recognition.

Description

Human body action recognition method based on convolutional neural network

Technical Field

The invention belongs to the technical field of human body action recognition, and particularly relates to a human body action recognition method based on a convolutional neural network.

Background

At present, human body action recognition is widely applied to the aspects of intelligent monitoring, human-computer interaction, video retrieval, virtual reality and the like, so that the human body action recognition is always an active research direction in the field of computer vision. In previous research, many research methods for human motion recognition have focused on conventional RGB color video. In recent years, the release of microsoft Kinect brings new opportunities to the field, the Kinect device can acquire depth images in real time, and compared with traditional color images, the depth images have many advantages, for example, a depth image sequence is a four-dimensional space in nature, can contain more abundant motion information, is insensitive to the change of illumination conditions, and can estimate the human body contour and bones more reliably. At present, research work of human motion recognition based on depth images mainly focuses on seeking to map four-dimensional information of motion to a two-dimensional space by designing some effective feature representation mode, and then, important features of the motion are expected to be represented in the two-dimensional space as much as possible, so that the accuracy of motion recognition is improved. However, after the depth information of the motion is mapped to the two-dimensional space representation, the motion is easy to be confused in the classification process, so that the upper limit of the recognition rate of the method is limited.

Disclosure of Invention

In order to overcome the problems, the invention aims to provide a human body action recognition method based on a convolutional neural network, which can realize the recognition of human body actions, can perform more accurate action classification, and has the characteristics of high action recognition rate and strong robustness.

In order to achieve the above object, the method for recognizing human body actions based on convolutional neural network provided by the present invention comprises the following steps performed in sequence:

(1) selecting partial depth images in the data set as training samples, using the rest depth images as test samples, and then mapping four-dimensional information of the depth images in the data set to a two-dimensional space by adopting a space structure dynamic depth image technology to obtain a two-dimensional image;

(2) constructing a convolutional neural network;

(3) training the convolutional neural network by using the two-dimensional image in the training sample;

(4) and inputting the two-dimensional image in the test sample into the trained convolutional neural network to obtain three groups of output vectors, then performing intra-group fusion, performing inter-group fusion, and finally completing human body action recognition.

In step (1), the method for mapping the four-dimensional information of the depth image in the data set to the two-dimensional space by using the spatial structure dynamic depth image technology to obtain the two-dimensional image includes: each depth image in the data set is converted into 6 different two-dimensional images by adopting a space structure dynamic depth image technology, the 6 two-dimensional images are divided into 3 groups, the groups are respectively a trunk, four limbs and a joint, and each group consists of two-dimensional images which are respectively DDIF and DDIB.

In step (2), the method for constructing the convolutional neural network is as follows: the network has 12 layers, and the 12 layers are a convolutional layer conv1, a pooling layer pool1, a convolutional layer conv2, a pooling layer pool2, a convolutional layer conv3, a convolutional layer conv4, a convolutional layer conv5, a pooling layer pool5, a full connection layer fc6, a full connection layer fc7, a full connection layer fc8 and a classification layer in sequence, wherein the classification layer adopts a combination of a cross entropy loss function and a center loss function as a joint cost function to increase the distance constraint between a motion sample feature space and a class center.

In step (3), the method for training the convolutional neural network by using the two-dimensional image in the training sample is as follows: and respectively inputting 3 groups of two-dimensional images obtained after the conversion of the depth images in the training samples into 3 convolutional neural networks, and respectively training the 3 convolutional neural networks by using the two-dimensional images.

In step (4), the method of inputting the two-dimensional image in the test sample into the trained convolutional neural network to obtain three groups of output vectors, then performing intra-group fusion, then performing inter-group fusion, and finally completing human body motion recognition includes: respectively inputting 3 groups of 6 two-dimensional images obtained after conversion of each depth image in a test sample into 3 groups of corresponding trained convolutional neural networks, obtaining 2 corresponding vectors for each group, and averaging the dimensionalities corresponding to the 2 vectors of each group to obtain intra-group fusion; and then averaging the dimensionalities corresponding to the 3 vectors in the group, namely fusing the dimensionalities between the groups to serve as a final vector, wherein the dimensionality serial number corresponding to the maximum numerical value in the final vector is the human motion category to be identified.

Compared with the prior art, the human body action recognition method based on the convolutional neural network has the beneficial effects that:

according to the invention, the distance constraint between the characteristic space of the action sample and the class center is increased in the network training process, so that the intra-class aggregation and inter-class separation of the action characteristics are considered, the convolutional neural network is guided to learn more distinguishing characteristics, and the subsequent classification is more accurate. The experimental results of a plurality of data sets show that the accuracy and robustness of human body action recognition are obviously improved by using the method disclosed by the invention. The method can be used as the basis of pattern recognition and artificial intelligence and has important significance on human action recognition.

Drawings

Fig. 1 is a flowchart of a human body motion recognition method based on a convolutional neural network provided by the present invention.

Fig. 2 is a schematic diagram of a convolutional neural network structure.

Detailed Description

The present invention will be described in further detail with reference to examples.

As shown in fig. 1, the method for recognizing human body actions based on convolutional neural network provided by the present invention comprises the following steps performed in sequence:

(1) selecting partial depth images in the data set as training samples, using the rest depth images as test samples, then adopting a space structure dynamic depth image technology to map four-dimensional information of the depth images in the data set to a two-dimensional space to obtain two-dimensional images for subsequent classification, and converting a human body action recognition problem into an image classification problem;

each depth image can be converted into 6 different two-dimensional images using SSDDI technology, the 6 two-dimensional images are divided into 3 groups, the groups being the torso, limbs and joints, respectively, and each group is composed of two-dimensional images, DDIF and DDIB, respectively, as shown in fig. 1.

In particular, SSDDI techniques rely on a sequential Pooling approach (RP), which is further briefly described below, with the depth image of k frames being denoted X ═ X₁,x₂,...,x_t,...,x_k＞,

Representing the data from each frame x_tFeature vector, feature vector mapped from

Average of previous t frames

Meaning that r is assigned to an arbitrary timing t_t＝w^T·V_tThe score is expressed such that the score is larger as the time sequence is later, and therefore, the score function r is used_tAnd (3) satisfying the constraint:

the purpose of the sequential pooling procedure is to find the parameter w that satisfies the objective function (1)^*，

s.t. w^T·(V_i-V_j)≥1-ε_ij,ε_ij≥0 (1)

ε_ijIs a small non-negative value, the parameter w^*Can be characterized by v_tFirst but v_t+1The information of the image sequence before the process can be used as a feature descriptor of a Depth image, which is a two-dimensional image with the same scale as the input image and contains information of space-time change in the Forward process of the whole action, so that the information is called a Forward Dynamic Depth map (DDIF), and a Backward Dynamic Depth map (DDIB) can be obtained by performing RP operation after the time sequence of the Depth image is processed in a reverse order.

The SSDDI performs RP operations on local regions of the depth image from 3 levels of the trunk, limbs and joints of the depth image, and finally recombines the dynamic depth maps of the levels.

(2) Constructing a convolutional neural network;

as shown in fig. 2, the convolutional neural network has 12 layers, and these 12 layers are convolutional layer conv1, pooling layer pool1, convolutional layer conv2, pooling layer pool2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, pooling layer pool5, full connection layer fc6, full connection layer fc7, full connection layer fc8 and classification layer in sequence.

In order to guide the convolutional neural network to learn characteristics with higher resolution and enable subsequent classification to be more accurate, the classification layer adopts a combined cross entropy loss function and a central loss function as a joint cost function, so that distance constraint between a motion sample characteristic space and a class center can be increased in the network training process, and intra-class aggregation and inter-class separation of motion characteristics are considered. Wherein the cross entropy loss function is expressed as:

wherein x_i∈R^dIndicates belonging to y_iThe ith depth feature of the class, d is the dimension of the depth feature, W ∈ R^d×nIs the weight matrix of the last full connection layer fc8, b ∈ RⁿIs an offset term, W_j∈R^dRepresenting the jth column of the weight matrix W, wherein m and n are the number of each batch of training samples and the corresponding category number respectively; the center loss function is expressed as:

in the formula (3)

Can be calculated by the formula (4) and the formula (5) and represents y_iClass depth feature x_iClass center of (1), which is characteristic of x with depth_iIs updated, the parameter alpha is in the range of 0,1]By adjusting this parameter, the neural network can be further optimized, and thus it is a hyper-parameter;

the joint cost function is then expressed as:

λ is a hyper-parameter introduced to control the ratio of the two loss functions, and when λ is 0, the joint cost function will degrade into a cross-entropy loss function.

and respectively inputting 3 groups of two-dimensional images obtained after the conversion of the depth images in the training samples into 3 convolutional neural networks, and respectively training the 3 convolutional neural networks by using the two-dimensional images.

The central loss function in the classification layer does not work when the test is performed, i.e. when the hyper-parameter λ is 0. And respectively inputting 3 groups of 6 two-dimensional images obtained after each depth image in the test sample is converted into the trained convolutional neural networks corresponding to the 3 groups, so that each group obtains 2 corresponding vectors, and averaging the dimensionalities corresponding to the 2 vectors of each group to obtain the fusion in the group. Specifically, after two corresponding two-dimensional images DDIF and DDIB in the torso group test sample are respectively input to the convolutional neural network, 2 corresponding results are output, which are in the form of multidimensional vectors, and the two vectors need to be subjected to mean value fusion to be used as output vectors of the group, that is, intra-group fusion, and the limbs and joints are treated in the same way. And then averaging the dimensionalities corresponding to the 3 vectors in the group, namely fusing the dimensionalities between the groups to serve as a final vector, wherein the dimensionality serial number corresponding to the maximum numerical value in the final vector is the human motion category to be identified.

Claims

1. A human body action recognition method based on a convolutional neural network is characterized in that: the human body action recognition method based on the convolutional neural network comprises the following steps of sequentially carrying out:

(2) constructing a convolutional neural network;

(4) inputting two-dimensional images in a test sample into the trained convolutional neural network to obtain three groups of output vectors, then performing intra-group fusion, performing inter-group fusion, and finally completing human body action recognition;

2. The convolutional neural network-based human motion recognition method of claim 1, wherein: in step (1), the method for mapping the four-dimensional information of the depth image in the data set to the two-dimensional space by using the spatial structure dynamic depth image technology to obtain the two-dimensional image includes: each depth image in the data set is converted into 6 different two-dimensional images by adopting a space structure dynamic depth image technology, the 6 two-dimensional images are divided into 3 groups, the groups are respectively a trunk, four limbs and a joint, and each group consists of two-dimensional images which are respectively DDIF and DDIB.

3. The convolutional neural network-based human motion recognition method of claim 1, wherein: in step (3), the method for training the convolutional neural network by using the two-dimensional image in the training sample is as follows: and respectively inputting 3 groups of two-dimensional images obtained after the conversion of the depth images in the training samples into 3 convolutional neural networks, and respectively training the 3 convolutional neural networks by using the two-dimensional images.

4. The convolutional neural network-based human motion recognition method of claim 1, wherein: in step (4), the method of inputting the two-dimensional image in the test sample into the trained convolutional neural network to obtain three groups of output vectors, then performing intra-group fusion, then performing inter-group fusion, and finally completing human body motion recognition includes: respectively inputting 3 groups of 6 two-dimensional images obtained after conversion of each depth image in a test sample into a trained convolutional neural network corresponding to the 3 groups, obtaining 2 corresponding vectors for each group, and averaging the dimensionalities corresponding to the 2 vectors of each group to obtain intra-group fusion; and then averaging the dimensionalities corresponding to the 3 vectors in the group, namely fusing the dimensionalities between the groups to serve as a final vector, wherein the dimensionality serial number corresponding to the maximum numerical value in the final vector is the human motion category to be identified.