CN111860278A

CN111860278A - Human behavior recognition algorithm based on deep learning

Info

Publication number: CN111860278A
Application number: CN202010676134.0A
Authority: CN
Inventors: 张鹏超; 罗朝阳; 徐鹏飞; 刘亚恒
Original assignee: Shaanxi University of Technology
Current assignee: Shaanxi University of Technology
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2020-10-30
Anticipated expiration: 2040-07-14
Also published as: CN111860278B

Abstract

The invention provides a human behavior recognition algorithm based on deep learning, which comprises the following steps of (1) preprocessing an input video segment; (2) constructing a network model RD 3D; (3) defining a loss function, an accuracy rate and an operation of an optimizer; (4) training the network model comprises the following substeps: (41) initializing parameters; (42) the learning rate is 0.0001, and the batch size is 16; (43) calculating loss by forward propagation values and real labels of the RD3D model, and updating weight parameters of the loss through backward propagation; (44) finishing training after 100 epochs are trained; (5) and (6) testing results. The invention pursues the accuracy of the recognition algorithm from the characteristic angle, overcomes the problem that the current algorithm is seriously dependent on the data set, reduces the sensitivity to the type of the data set, and can be applied to any behavior recognition data set.

Description

Human behavior recognition algorithm based on deep learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a human behavior recognition algorithm based on deep learning.

Background

In recent years, with the rise of related technologies such as deep learning, deep neural networks have made breakthrough progress in various fields such as computer vision. Deep learning is capable of learning its commonality features from training data due to its end-to-end training characteristics, and fitting a network suitable for the current task. Meanwhile, the mass data acquisition in the modern society becomes very easy, and convenience is provided for the application of deep learning to the fields of video understanding, identification and the like.

While in the traditional method, local features (such as HOG, HOF and MBH) are mainly extracted, and strong prior knowledge is required. Although the appearance and motion information are considered, the information is limited to a single frame, wherein the contextual appearance and motion information of the frame are ignored, which results in inaccurate human behavior recognition. How to design an algorithm for behavior recognition becomes particularly important.

Therefore, deep learning applications and human behavior recognition have been trending. The behavior recognition method based on deep learning mainly comprises the following steps: a dual-flow convolutional neural network, a 3D convolutional neural network, a convolutional neural network, and a cyclic neural combination. The invention improves the identification precision based on the 3D convolution network.

Patent CN 110163133 a "a human behavior recognition method based on depth residual error network" discloses a human behavior recognition method based on depth residual error network, which inputs human joint data and depth image data into ResNet for recognition at the same time, and although recognition accuracy is improved, human joint data and depth image are required as input, so end-to-end learning is impossible, and such data is lacking in daily life. Patent CN 107862275 a, "human behavior recognition model and its construction method and human behavior recognition method" discloses a method for recognizing human behavior by extracting human behavior feature vectors by using 3D convolutional neural network, inputting the extracted feature vectors into coulomb force field, and clustering by relatively moving all feature vectors under the action of the same class generating attraction and different classes generating repulsion. Inputting an RGB (red, green and blue) graph and an optical flow graph into a network for learning, the learning can not be carried out end to end, the whole network only has seven layers, only three layers carry out feature extraction, and although the calculation amount is small, the precision is low;

In the above, the recognition accuracy is improved from the perspective of conforming to a data set, and the accuracy of behavior recognition cannot be improved only by means of RGB images, and patent CN 109002808A, "a human behavior recognition method and system" discloses a human behavior recognition method and system, which train a 3D convolutional neural network by using a multi-task deep learning method, and complete a recognition task after training by using continuous video frames of various human behavior attributes and background videos as inputs. More, how to make a data set in the multitask learning is taught, so that behavior videos and background videos are distinguished, feature extraction is completed only by means of seven layers of common 3D convolutional networks, and classification is achieved. The identification of human behavior is still done from a dataset perspective.

Disclosure of Invention

Aiming at the technical problem, the invention provides a human behavior recognition algorithm based on deep learning, which comprises the following steps:

(1) preprocessing an input video segment;

(2) constructing a network model RD 3D;

(3) defining a loss function and an optimizer operation;

(4) training the network model comprises the following substeps:

(41) initializing parameters;

(42) the learning rate is 0.0001, and the batch size is 16;

(43) Calculating loss according to a loss function by the forward propagation value and the real label of the RD3D model, and updating a weight parameter of the loss through backward propagation;

(44) finishing training after 100 epochs are trained;

(5) and (6) testing results.

Further, in the preprocessing stage in the step (1), in order to comprehensively consider the global motion information of the video, a subsampling algorithm is provided and adopted to collect n frames of key video frames, so that the recognition accuracy is improved, and the specific content is as follows:

a: acquiring an image frame (alpha is 3) of each video clip according to an acquisition rate alpha to obtain an image data set A corresponding to each video;

b: uniformly collecting n frames (n is 16) from the image data set A by adopting a subsampling algorithm to serve as key frames of the video clip, and scaling the key frames to k is k (k is 224) to form a data set B;

d: the acquired data set B was assigned a 7: and 3, dividing the ratio into a training set and a testing set for training and testing, wherein each sample in the training set is an achor, positive, negative, label, and is respectively a sample to be predicted, other samples in the same class as the sample to be predicted, other samples in different classes as the sample to be predicted, and class labels of the samples to be predicted.

Further, in the step (2), in order to improve the identification accuracy, a novel network model RD3D (Residual Dense 3D) is proposed and designed by combining a feature multiplexing idea and a shortcut idea. The model RD3D designs 134 layers, namely 1+4 × 4+6 × 3+2 × 4+1, 6 stages.

Further, step (3) proposes and designs a novel loss function:

F＝H(P,Q)+L_re+L_tr

wherein:

the cross entropy H (P, Q) ═ P (x) log (Q (x)), measures the similarity of the predicted distribution and the true distribution, and the smaller the loss, the more accurate the classification. Wherein P is the true sample distribution and Q is the predicted sample distribution;

l2 regularization loss

To prevent overfitting, λ is a penalty factor (λ ═ 0.009), and n is the number of weights W;

ternary loss

Wherein

Is x_iAnd x^p _iThe Euclidean distance of (a) is,

is x_iAnd xⁿ _iF (x) is the feature of sample x extracted by RD3D, bs is batch size, x_iFor the currently predicted sample, x^p _iCalculating a sample x for the sum current_iSamples of the same class, xⁿ _iCalculating a sample x for the sum current_iNot of the same class, beta is x_iAnd x^p _i、x_iAnd xⁿ _iIs equal to 0.2.

The invention overcomes the problem that the current algorithm depends on the data set seriously while pursuing the accuracy of the recognition algorithm, realizes the network structure design from the characteristic angle of human behavior extraction, is insensitive to the type of the data set, and can be applied to any data set.

Drawings

FIG. 1 is a RD3D model of the present invention;

FIG. 2 is a Conv Block structure according to the present invention;

FIG. 3 is an ID Block structure of the present invention;

FIG. 4 is a flow chart of the present invention.

Detailed Description

The specific technical scheme of the invention is described by combining the embodiment.

As shown in fig. 4, a human behavior recognition algorithm based on deep learning includes the following steps:

(1) preprocessing an input video segment (the embodiment takes a UCF101 data set as an example);

(2) constructing a network model RD 3D;

(3) defining a loss function, an accuracy rate and an operation of an optimizer;

(4) training the network model comprises the following substeps:

(41) initializing parameters;

(42) the learning rate is 0.0001, and the batch size is 16;

(44) finishing training after 100 epochs are trained;

(5) and (6) testing results.

Specifically, the method comprises the following steps:

(1) in the preprocessing stage, in order to comprehensively consider the global motion information of the video, a secondary sampling algorithm is provided and adopted to collect n frames of key frames, so that the identification accuracy is improved, and the specific content is as follows:

(2) In order to improve the identification accuracy, a novel network model RD3D (Residual Dense3D) is proposed and designed by combining a feature multiplexing idea and a shortcut idea, the structure of the model is shown in FIG. 1, 127 layers are designed for the RD3D model, and the contents are as follows:

a: stage1 consists of Conv3d, BN, Relu, MaxPool, where Conv3d has 64 filters, convolution kernel 3 × 3, stride [1,2,2], padding is SAME; the pooling window in MaxPool is 1 × 3, stride ═ 1,2, 2. Stage1 has an input dimension of [16,16, 224, 224, 3], an output dimension of [16,16,56,56,64 ];

b: stage2 is composed of Conv Block4, three ID Block4, Maxpool, wherein Conv Block4 is composed of 4 layers of 3D convolution groups and shortcuts with two layers of convolution, which are connected together by channel addition, as shown in FIG. 2, wherein the filter numbers of the 4 layers of 3D convolution groups are 64, 64, 128, 128, respectively, and the convolution kernels are all 3 × 3; in a convolution group, the input of the next layer is the output of all previous layers in the block. The number of filters in shortcut is 128 and the convolution kernels are 1 x 1, 3 x 3, respectively. ID Block4 is formed by adding 4 layers of 3D convolution groups and the input of the Block, as shown in fig. 3, wherein the filter numbers of the 4 layers of 3D convolution groups are 64, 64, 128, 128, respectively, and the convolution kernels are all 3 × 3; in a convolution group, the input of the next layer is the output of all previous layers in the block. Pooling window 2 x 2, stride ═ 2,2,2 in Maxpool. Stage2 has an input dimension of [16,16,56,56,64], an output dimension of [16,8,28, 128 ];

C: stage3, stage4 are identical in composition with stage2, the only difference is that the number of layers and the number of filters in each block are different, in stage3, the number of layers of ConvBlock and IDBlock are all 6, the number of filters in each layer is 128, 128, 256, 256, 512, 512, 512, the number of filters of shortcut in ConvBlock is 512, the input dimension of stage3 is [16,8,28, 128], the output dimension is [16,4,14, 512 ]; in stage4, the number of layers of ConvBlock and IDBlock is 6, the number of filters in each layer is 256, 256, 512, 512, 1024, 1024, and the number of filters in shortcut in ConvBlock is 1024. Stage4 has an input dimension of [16,4,14, 512], an output dimension of [16,2,7, 1024 ];

d: stage5 and stage2 differ in composition in that stage5 has no MaxPool. In ConvBlock and IDBlock, the number of layers is 6, the number of filters in each layer is 512, 512, 1024, 1024, 2048, 2048, respectively, the number of filters for shortcut in ConvBlock is 2048, the input dimension of stage5 is [16,2,7, 1024], and the output dimension is [16,2,7, 2048 ];

e: stage6 is composed of AvgPool, platten, FC, Softmax, as shown in fig. 1, where AvgPool is global mean pooling, the pooling window is 2 × 7, platten is to set the output reshape of the upper layer to [16,2048], FC is a fully-connected layer, the output dimension is class number 101 of ucf101, and Softmax is a classified layer. Stage6 has an input dimension of [16,2,7, 2048] and an output dimension of [16,101]

(3) A loss function is designed. Conventional loss function F ═ H (P, Q) + L_reIn order to enlarge the discrimination of different types of samples and improve the identification precision, the invention adds ternary loss on the basis of the traditional loss function to obtain a new loss function:

F＝H(P,Q)+L_re+L_trwherein the content of the first and second substances,

a: the cross entropy H (P, Q) ═ P (x) log (Q (x)), measures the similarity of the predicted distribution and the true distribution, and the smaller the loss, the more accurate the classification. Wherein P is the true sample distribution and Q is the predicted sample distribution;

b: l2 regularization loss

c: ternary loss:

wherein

Is x_iAnd x^p _iThe Euclidean distance of (a) is,

Claims

1. A human behavior recognition algorithm based on deep learning is characterized by comprising the following steps:

(1) preprocessing an input video segment;

(2) constructing a network model RD 3D;

(3) a loss function and the operation of the optimizer are defined.

(4) Training a network model; the method comprises the following substeps:

(41) initializing parameters;

(42) the learning rate is 0.0001, and the batch size is 16;

(44) finishing training after 100 epochs are trained;

(5) and (6) testing results.

2. The deep learning-based human behavior recognition algorithm according to claim 1, wherein in the preprocessing stage of step (1), a subsampling algorithm is proposed and adopted to collect n frames of key frames, and the method specifically comprises the following steps:

a: acquiring image frames of each video clip according to an acquisition rate alpha to obtain an image data set A corresponding to each video;

b: uniformly collecting n frames from the image data set A by adopting a secondary sampling algorithm to serve as key frames of the video clips, and scaling the key frames to k x k to form a data set B;

d: the acquired data set B was assigned a 7: and 3, dividing the sample into a training set and a testing set in proportion for training and testing, wherein each sample in the training set is a quadruple and is respectively a sample to be predicted, other samples in the same class as the sample to be predicted, other samples in different classes as the sample to be predicted, and class labels of the sample to be predicted.

3. The deep learning-based human behavior recognition algorithm of claim 1, wherein the RD3D model of step (2) designs 134 layers, i.e. 1+ 4+6 + 3+2 + 4+1, 6 stages.

4. The deep learning-based human behavior recognition algorithm according to claim 1, wherein the step (3) designs a loss function:

F＝H(P,Q)+L_re+L_tr

wherein, the cross entropy H (P, Q) ═ P (x) log (Q (x)) measures the similarity of the predicted distribution and the real distribution, and the smaller the loss is, the more accurate the classification is; wherein P is the true sample distribution and Q is the predicted sample distribution;

l2 regularization loss

In order to prevent overfitting, lambda is a penalty factor, and n is the number of weights W;

ternary loss:

wherein | | | f (x)_i)-f(x^p _i)||₂ ²Is x_iAnd x^p _iEuclidean distance of, | f (x)_i)-f(xⁿ _i)||₂ ²Is x_iAnd xⁿ _iF (x) is the feature of sample x extracted by RD3D, bs is batch size, x_iFor the currently predicted sample, x^p _iCalculating a sample x for the sum current_iSamples of the same class, xⁿ _iCalculating a sample x for the sum current_iNot of the same class, beta is x_iAnd x^p _i、x_iAnd xⁿ _iIs measured.