CN114170618A

CN114170618A - Video human behavior recognition algorithm based on double-flow space-time decomposition

Info

Publication number: CN114170618A
Application number: CN202111140075.6A
Authority: CN
Inventors: 衣杨; 邱泽敏; 陈怡华; 刘东琳; 赵小蕾
Original assignee: Guangzhou Xinhua College
Current assignee: Guangzhou Xinhua College
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2022-03-11

Abstract

The invention provides a video human body behavior recognition algorithm based on double-current space-time decomposition, which improves the input characteristics of a double-current network, trains a space flow network by using an I frame of a compressed video, trains a time flow network by using a P frame, reserves the basic framework of the double-current network, provides a new double-current space-time decomposition convolution network, and splits a 3D convolution network in a 3D residual convolution network (ResNet3D) into a mixed network of a two-dimensional space convolution network and a one-dimensional time convolution network, thereby ensuring that a model can effectively obtain time sequence information by using the 3D convolution network, reducing the parameter quantity of network training and enabling the network to be easier to optimize.

Description

Video human behavior recognition algorithm based on double-flow space-time decomposition

Technical Field

The invention relates to the technical field of human body action recognition, in particular to a video human body action recognition algorithm based on double-flow space-time decomposition.

Background

In recent years, with the great improvement of computer computing power and the continuous proposition of large-scale data sets, many researches prove that a deep convolutional Network can obtain excellent performance and recognition effect in the field of video human behavior recognition, the important research point of a human behavior recognition method for performing motion modeling based on deep learning is mainly to construct a model with excellent discrimination, and at the present, mainstream convolutional neural Network frameworks comprise a Long Short-Term Memory Network (LSTM), a double-stream Network (Two-stream Network), a 3D convolutional Network and the like.

The double-flow network is the most representative human behavior identification network framework, two independent convolutional network flows are constructed to respectively process appearance information and motion information of videos, and finally results of the two network flows are fused to obtain a final classification label.

The 3D convolution network framework takes a 3D convolution kernel as a structural main body, and extracts the spatial information and the time information of a video through the time-space receptive field of the 3D convolution kernel, so that the appearance information and the motion information of a motion main body can be effectively captured, and the operation on the spatial dimension and the time dimension can be realized simultaneously.

In recent years, a basic method CoViAR mainly using video compression coding features as deep learning network input respectively trains an I frame, a motion vector and a residual error by using three independent two-dimensional convolution networks, uses a more complex network structure for the I frame containing complete image information, and uses a lightweight network for the motion vector MV and the residual error R containing less image information, so that three spatial flow networks are approximately constructed, and DMC-Net proposed later uses a generation countermeasure network (GAN) to reconstruct new motion clue features according to the motion vector and the residual error as time flow features, but the methods do not fully utilize the motion information contained in the motion vector and the residual error, so that the accuracy of the algorithm is greatly limited.

The 3D convolution network proves that the network can better capture the time sequence information in the video frame by adding a convolution kernel with one more dimension, and the correlation characteristics of the spatial appearance information and the time motion information of the P frame which contains the motion information and consists of a motion vector and a residual error can be effectively obtained.

However, the traditional 3D convolutional network model has a large parameter number and high requirements for memory and computational power of network training, the existing hardware equipment conditions are difficult to support the development of the 3D convolutional network, the efficiency of the model can be severely limited, and how to reduce the parameter quantity of the 3D network and improve the performance of extracting the time sequence information is one of the important research directions of the 3D network framework.

Therefore, it is necessary to provide a video human behavior recognition algorithm based on dual-stream spatiotemporal decomposition to solve the above technical problems.

Disclosure of Invention

In order to solve the technical problems, the invention provides a video human behavior recognition algorithm based on double-flow space-time decomposition, which is used for solving the problem of how to reduce the parameter quantity of a 3D network so as to improve the performance of extracting time sequence information.

The invention provides a video human body behavior recognition algorithm based on double-flow space-time decomposition, which comprises the following operation steps:

the method comprises the following steps: building a residual block of double-current space-time decomposition, and taking the residual block as a basic frame of a residual convolution network;

step two: decomposing each complete 3D convolution kernel in the residual convolution network obtained in the first step into a two-dimensional space convolution operation and a one-dimensional time convolution operation, wherein each decomposed convolution operation is provided with a complete BN layer and a ReLU activation layer;

step three: establishing a spatial stream network for compressing the I frame of the video according to the two-dimensional spatial convolution operation decomposed in the step two; establishing a time flow network of the P frame for fusing the motion vector and the residual error according to the decomposed one-dimensional time convolution operation in the step two;

step four: and multiplying and fusing the output of the last residual in the time flow network and the input of the current spatial flow network, and taking the fused result as the input of the spatial flow network.

Preferably, the spatial stream and temporal stream networks described in the third step both adopt a mode of fusing a two-dimensional convolution kernel cf2 and a one-dimensional convolution kernel cf1 to extract motion information.

Compared with the related technology, the video human behavior recognition algorithm based on the double-current space-time decomposition has the following beneficial effects:

the invention improves the input characteristics of a double-current network, trains a space flow network by using an I frame of a compressed video and trains a time flow network by using a P frame, reserves the basic framework of the double-current network, provides a new double-current space-time decomposition convolution network, and splits a 3D convolution network in a 3D residual convolution network (ResNet3D) into a mixed network of a two-dimensional space convolution network and a one-dimensional time convolution network, thereby ensuring that a model can effectively obtain time sequence information by using the 3D convolution network, reducing the parameter quantity of network training and enabling the network to be easier to optimize.

Drawings

FIG. 1 is a schematic diagram of a spatio-temporal decomposition residual block structure according to the present invention;

FIG. 2 is a schematic diagram of a cross-spatiotemporal fusion architecture of the present invention;

FIG. 3 is a diagram of the Resnet3D network architecture of the present invention;

FIG. 4 is a comparison of the effect of different inventive network architectures on HMDB51 and UCF101

Detailed Description

The invention is further described with reference to the following figures and embodiments.

A video human body behavior recognition algorithm based on a double-stream space-time decomposition is provided, wherein a double-stream space-time decomposition convolution network is obtained by improving a residual block of a 3D residual convolution network Basic architecture (Basic ResNet3D), and the residual block of a ResNet network structure is generally defined as shown in an equation (1).

Wherein X_i+1And X_iOutput data and input data of the ith residual block, h (X), respectively_i)＝X_iRepresents X_iIs mapped to the identity of the target,

learning function (usually ReLU function), W, representing residual features_iRepresents the convolution filter of the ith layer.

And decomposing each complete 3D convolution kernel into a two-dimensional space convolution operation and a one-dimensional space convolution operation by using a residual block of the double-current space-time decomposition as a basic structure of a residual convolution network, wherein each convolution operation is provided with a complete BN layer and a ReLU activation layer. For N_iEach size is N _i-13D convolution kernel of x t x D, decomposed herein to M_iEach size is N _i-12D convolution kernel sum N of x t x D_iEach size is M_i1D convolution kernel of x t x 1, while the hyper-parameter M is used to keep the parameter quantities before and after decomposition consistent_iThe values are shown in formula (2).

Wherein N is_i-1And N_iThe number of input and output channels of the 3D convolution kernel,

d is the space size of the 3D convolution kernel, t is the time size of the 3D convolution kernel, 3X 3D convolution kernels are used in the text, the t and D values are both 3, and the structure diagram of the space-time decomposition residual error is shown in FIG. 1.

On the basis, the invention constructs a dual-stream network structure by using a space-time Decomposed residual block (spatial composed Module), and respectively establishes a spatial stream network for compressing an I frame of a video and a temporal stream network for fusing a P frame of a motion vector and a residual error. Both spatial and temporal flow networks herein employ two-dimensional convolution kernels cf₂And one-dimensional convolution kernel cf₁The motion information is extracted in a fusion mode, improvement is carried out on the basis of the formula (3-6), and the definition of the space-time decomposition residual block is shown in the formula (3).

X_i+1＝h(X_i)+cf₁(cf₂(X_i)) (3)

Wherein X_i+1And X_iWhich are the output data and the input data of the ith residual block, respectively, and the detailed structure is shown in fig. 3-4. In the process of network training, the method can effectively learn the appearance characteristics of single-frame images, the interaction characteristics among multi-frame images and the motion difference characteristics among videos, and can more effectively utilize the motion information contained in motion vectors and residual errors

In order to enhance the discrimination of the I frame of the compressed video, the invention provides a cross-space-time fusion strategy, the characteristics of a time stream and a space stream are fused, the output of the last residual block of the time stream and the input of the current space stream are multiplied and fused to be used as the input of a space stream network, which is equivalent to that the motion characteristics are used as the weight in a model to carry out global attention weighting on an appearance characteristic diagram, and the fusion process is shown as a formula (4).

Wherein

And

respectively, indicates the input of the l-th convolution layer of the spatial stream network and the temporal stream network, an indication of element corresponding position multiplication (elementary _ wise multiplication),

filters representing the i-th layer of the spatial stream network, so that during the back-propagation of the network training, the spatial streamsThe gradient of the network is shown in equation (5).

Wherein

The schematic diagram of the cross-space-time fusion strategy representing the loss function of the spatial flow network is shown in fig. 2, the guidance effect of the motion characteristics on the appearance characteristics can be enhanced by the fusion weighting mode, and the discrimination of the motion information in the network training process can be enhanced

Since the I frame contains more and more detailed information, and the motion vector and the residual are similar to the residual image of weak mode, in order to balance the performance and efficiency of the model, we adopt different space-time decomposition network frameworks for the spatial flow network and the temporal flow network, the spatial flow network compares the Resnet2D-152 network framework with the ResNet3D-34 network framework of space-time decomposition, the temporal flow network adopts the ResNet3D-18 network framework of space-time decomposition, the network structure diagram is shown in FIG. 3, and the space-time decomposition network respectively decomposes the 3D convolution kernel in the network framework into space-time decomposition convolution kernels

Experimental data: experiments compare the recognition effect of the method in using three different network structures, namely 2D ResNet-152, 2D Resnet-18 and decomplexed ResNet-34, in a spatial flow network and using two different network structures, namely 2D-3D Resnet-18 and decomplexed ResNet-18, in a temporal flow network.

The accuracy is calculated by using the Top-1 accuracy rate, wherein the Top-1 accuracy rate refers to the accuracy rate of the classification result with the highest probability in the prediction results of the reasoning process, and the calculation mode is shown as a formula (6).

The performance index for each data set was the Top-1 accuracy averaged over all test sets.

Considering that a P frame belongs to dynamic information of weak modes, and the contained information amount and precision are smaller than those of an I frame, a simple and light-weight model is preferably used, and a 2D-3D Resnet-18 structure is a hybrid network structure formed by a conv3_ x layer of a 2D Resnet-18 convolutional network and then starts to expand into a 3D convolutional network, namely a first 3 layers of the 2D Resnet-18 and a conv4_ x layer and a conv5_ x layer of the 3D Resnet-18 are used, and another light-weight network structure is adopted. In contrast experiments, the network of spatial streams does not contain operations that cross spatio-temporal fusion.

The performance and accuracy of different network frames on the HMDB51 data set and the UCF101 data set are shown in the table of fig. 4, and 16 frames are uniformly sampled each time as input of a network structure, and experimental results show that in a spatial flow network, the best algorithm accuracy can be ensured by using a decomplexed resenet-34 frame for an input I frame, and compared with using a 3D convolutional network, the accuracy of a 2D convolutional network model is yet to be improved; in addition, on a time flow network, the method obviously improves the identification effect of a decomplexed ResNet-18 framework compared with other frameworks of a mixed 3D convolution structure, and compared with a mode of expanding into a 3D convolution structure after using a 2D convolution network, the framework uses a characteristic extractor approximate to a 3D convolution kernel from the beginning, so that more accurate time sequence information can be learned.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A video human behavior recognition algorithm based on double-current space-time decomposition is characterized by comprising the following operation steps:

2. The video human body behavior recognition algorithm based on the dual-stream space-time decomposition of claim 1, wherein the spatial stream and temporal stream networks in the third step extract motion information by means of fusion of a two-dimensional convolution kernel cf2 and a one-dimensional convolution kernel cf 1.