CN114170618A - Video human behavior recognition algorithm based on double-flow space-time decomposition - Google Patents

Video human behavior recognition algorithm based on double-flow space-time decomposition Download PDF

Info

Publication number
CN114170618A
CN114170618A CN202111140075.6A CN202111140075A CN114170618A CN 114170618 A CN114170618 A CN 114170618A CN 202111140075 A CN202111140075 A CN 202111140075A CN 114170618 A CN114170618 A CN 114170618A
Authority
CN
China
Prior art keywords
network
convolution
time
space
double
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111140075.6A
Other languages
Chinese (zh)
Inventor
衣杨
邱泽敏
陈怡华
刘东琳
赵小蕾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xinhua College
Original Assignee
Guangzhou Xinhua College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xinhua College filed Critical Guangzhou Xinhua College
Priority to CN202111140075.6A priority Critical patent/CN114170618A/en
Publication of CN114170618A publication Critical patent/CN114170618A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video human body behavior recognition algorithm based on double-current space-time decomposition, which improves the input characteristics of a double-current network, trains a space flow network by using an I frame of a compressed video, trains a time flow network by using a P frame, reserves the basic framework of the double-current network, provides a new double-current space-time decomposition convolution network, and splits a 3D convolution network in a 3D residual convolution network (ResNet3D) into a mixed network of a two-dimensional space convolution network and a one-dimensional time convolution network, thereby ensuring that a model can effectively obtain time sequence information by using the 3D convolution network, reducing the parameter quantity of network training and enabling the network to be easier to optimize.

Description

Video human behavior recognition algorithm based on double-flow space-time decomposition
Technical Field
The invention relates to the technical field of human body action recognition, in particular to a video human body action recognition algorithm based on double-flow space-time decomposition.
Background
In recent years, with the great improvement of computer computing power and the continuous proposition of large-scale data sets, many researches prove that a deep convolutional Network can obtain excellent performance and recognition effect in the field of video human behavior recognition, the important research point of a human behavior recognition method for performing motion modeling based on deep learning is mainly to construct a model with excellent discrimination, and at the present, mainstream convolutional neural Network frameworks comprise a Long Short-Term Memory Network (LSTM), a double-stream Network (Two-stream Network), a 3D convolutional Network and the like.
The double-flow network is the most representative human behavior identification network framework, two independent convolutional network flows are constructed to respectively process appearance information and motion information of videos, and finally results of the two network flows are fused to obtain a final classification label.
The 3D convolution network framework takes a 3D convolution kernel as a structural main body, and extracts the spatial information and the time information of a video through the time-space receptive field of the 3D convolution kernel, so that the appearance information and the motion information of a motion main body can be effectively captured, and the operation on the spatial dimension and the time dimension can be realized simultaneously.
In recent years, a basic method CoViAR mainly using video compression coding features as deep learning network input respectively trains an I frame, a motion vector and a residual error by using three independent two-dimensional convolution networks, uses a more complex network structure for the I frame containing complete image information, and uses a lightweight network for the motion vector MV and the residual error R containing less image information, so that three spatial flow networks are approximately constructed, and DMC-Net proposed later uses a generation countermeasure network (GAN) to reconstruct new motion clue features according to the motion vector and the residual error as time flow features, but the methods do not fully utilize the motion information contained in the motion vector and the residual error, so that the accuracy of the algorithm is greatly limited.
The 3D convolution network proves that the network can better capture the time sequence information in the video frame by adding a convolution kernel with one more dimension, and the correlation characteristics of the spatial appearance information and the time motion information of the P frame which contains the motion information and consists of a motion vector and a residual error can be effectively obtained.
However, the traditional 3D convolutional network model has a large parameter number and high requirements for memory and computational power of network training, the existing hardware equipment conditions are difficult to support the development of the 3D convolutional network, the efficiency of the model can be severely limited, and how to reduce the parameter quantity of the 3D network and improve the performance of extracting the time sequence information is one of the important research directions of the 3D network framework.
Therefore, it is necessary to provide a video human behavior recognition algorithm based on dual-stream spatiotemporal decomposition to solve the above technical problems.
Disclosure of Invention
In order to solve the technical problems, the invention provides a video human behavior recognition algorithm based on double-flow space-time decomposition, which is used for solving the problem of how to reduce the parameter quantity of a 3D network so as to improve the performance of extracting time sequence information.
The invention provides a video human body behavior recognition algorithm based on double-flow space-time decomposition, which comprises the following operation steps:
the method comprises the following steps: building a residual block of double-current space-time decomposition, and taking the residual block as a basic frame of a residual convolution network;
step two: decomposing each complete 3D convolution kernel in the residual convolution network obtained in the first step into a two-dimensional space convolution operation and a one-dimensional time convolution operation, wherein each decomposed convolution operation is provided with a complete BN layer and a ReLU activation layer;
step three: establishing a spatial stream network for compressing the I frame of the video according to the two-dimensional spatial convolution operation decomposed in the step two; establishing a time flow network of the P frame for fusing the motion vector and the residual error according to the decomposed one-dimensional time convolution operation in the step two;
step four: and multiplying and fusing the output of the last residual in the time flow network and the input of the current spatial flow network, and taking the fused result as the input of the spatial flow network.
Preferably, the spatial stream and temporal stream networks described in the third step both adopt a mode of fusing a two-dimensional convolution kernel cf2 and a one-dimensional convolution kernel cf1 to extract motion information.
Compared with the related technology, the video human behavior recognition algorithm based on the double-current space-time decomposition has the following beneficial effects:
the invention improves the input characteristics of a double-current network, trains a space flow network by using an I frame of a compressed video and trains a time flow network by using a P frame, reserves the basic framework of the double-current network, provides a new double-current space-time decomposition convolution network, and splits a 3D convolution network in a 3D residual convolution network (ResNet3D) into a mixed network of a two-dimensional space convolution network and a one-dimensional time convolution network, thereby ensuring that a model can effectively obtain time sequence information by using the 3D convolution network, reducing the parameter quantity of network training and enabling the network to be easier to optimize.
Drawings
FIG. 1 is a schematic diagram of a spatio-temporal decomposition residual block structure according to the present invention;
FIG. 2 is a schematic diagram of a cross-spatiotemporal fusion architecture of the present invention;
FIG. 3 is a diagram of the Resnet3D network architecture of the present invention;
FIG. 4 is a comparison of the effect of different inventive network architectures on HMDB51 and UCF101
Detailed Description
The invention is further described with reference to the following figures and embodiments.
A video human body behavior recognition algorithm based on a double-stream space-time decomposition is provided, wherein a double-stream space-time decomposition convolution network is obtained by improving a residual block of a 3D residual convolution network Basic architecture (Basic ResNet3D), and the residual block of a ResNet network structure is generally defined as shown in an equation (1).
Figure RE-GDA0003502020390000032
Wherein Xi+1And XiOutput data and input data of the ith residual block, h (X), respectivelyi)=XiRepresents XiIs mapped to the identity of the target,
Figure RE-GDA0003502020390000031
learning function (usually ReLU function), W, representing residual featuresiRepresents the convolution filter of the ith layer.
And decomposing each complete 3D convolution kernel into a two-dimensional space convolution operation and a one-dimensional space convolution operation by using a residual block of the double-current space-time decomposition as a basic structure of a residual convolution network, wherein each convolution operation is provided with a complete BN layer and a ReLU activation layer. For NiEach size is N i-13D convolution kernel of x t x D, decomposed herein to MiEach size is N i-12D convolution kernel sum N of x t x DiEach size is Mi1D convolution kernel of x t x 1, while the hyper-parameter M is used to keep the parameter quantities before and after decomposition consistentiThe values are shown in formula (2).
Figure RE-GDA0003502020390000041
Wherein N isi-1And NiThe number of input and output channels of the 3D convolution kernel,
d is the space size of the 3D convolution kernel, t is the time size of the 3D convolution kernel, 3X 3D convolution kernels are used in the text, the t and D values are both 3, and the structure diagram of the space-time decomposition residual error is shown in FIG. 1.
On the basis, the invention constructs a dual-stream network structure by using a space-time Decomposed residual block (spatial composed Module), and respectively establishes a spatial stream network for compressing an I frame of a video and a temporal stream network for fusing a P frame of a motion vector and a residual error. Both spatial and temporal flow networks herein employ two-dimensional convolution kernels cf2And one-dimensional convolution kernel cf1The motion information is extracted in a fusion mode, improvement is carried out on the basis of the formula (3-6), and the definition of the space-time decomposition residual block is shown in the formula (3).
Xi+1=h(Xi)+cf1(cf2(Xi)) (3)
Wherein Xi+1And XiWhich are the output data and the input data of the ith residual block, respectively, and the detailed structure is shown in fig. 3-4. In the process of network training, the method can effectively learn the appearance characteristics of single-frame images, the interaction characteristics among multi-frame images and the motion difference characteristics among videos, and can more effectively utilize the motion information contained in motion vectors and residual errors
In order to enhance the discrimination of the I frame of the compressed video, the invention provides a cross-space-time fusion strategy, the characteristics of a time stream and a space stream are fused, the output of the last residual block of the time stream and the input of the current space stream are multiplied and fused to be used as the input of a space stream network, which is equivalent to that the motion characteristics are used as the weight in a model to carry out global attention weighting on an appearance characteristic diagram, and the fusion process is shown as a formula (4).
Figure RE-GDA0003502020390000051
Wherein
Figure RE-GDA0003502020390000052
And
Figure RE-GDA0003502020390000053
respectively, indicates the input of the l-th convolution layer of the spatial stream network and the temporal stream network, an indication of element corresponding position multiplication (elementary _ wise multiplication),
Figure RE-GDA0003502020390000054
filters representing the i-th layer of the spatial stream network, so that during the back-propagation of the network training, the spatial streamsThe gradient of the network is shown in equation (5).
Figure RE-GDA0003502020390000055
Wherein
Figure RE-GDA0003502020390000056
The schematic diagram of the cross-space-time fusion strategy representing the loss function of the spatial flow network is shown in fig. 2, the guidance effect of the motion characteristics on the appearance characteristics can be enhanced by the fusion weighting mode, and the discrimination of the motion information in the network training process can be enhanced
Since the I frame contains more and more detailed information, and the motion vector and the residual are similar to the residual image of weak mode, in order to balance the performance and efficiency of the model, we adopt different space-time decomposition network frameworks for the spatial flow network and the temporal flow network, the spatial flow network compares the Resnet2D-152 network framework with the ResNet3D-34 network framework of space-time decomposition, the temporal flow network adopts the ResNet3D-18 network framework of space-time decomposition, the network structure diagram is shown in FIG. 3, and the space-time decomposition network respectively decomposes the 3D convolution kernel in the network framework into space-time decomposition convolution kernels
Experimental data: experiments compare the recognition effect of the method in using three different network structures, namely 2D ResNet-152, 2D Resnet-18 and decomplexed ResNet-34, in a spatial flow network and using two different network structures, namely 2D-3D Resnet-18 and decomplexed ResNet-18, in a temporal flow network.
The accuracy is calculated by using the Top-1 accuracy rate, wherein the Top-1 accuracy rate refers to the accuracy rate of the classification result with the highest probability in the prediction results of the reasoning process, and the calculation mode is shown as a formula (6).
Figure RE-GDA0003502020390000061
The performance index for each data set was the Top-1 accuracy averaged over all test sets.
Considering that a P frame belongs to dynamic information of weak modes, and the contained information amount and precision are smaller than those of an I frame, a simple and light-weight model is preferably used, and a 2D-3D Resnet-18 structure is a hybrid network structure formed by a conv3_ x layer of a 2D Resnet-18 convolutional network and then starts to expand into a 3D convolutional network, namely a first 3 layers of the 2D Resnet-18 and a conv4_ x layer and a conv5_ x layer of the 3D Resnet-18 are used, and another light-weight network structure is adopted. In contrast experiments, the network of spatial streams does not contain operations that cross spatio-temporal fusion.
The performance and accuracy of different network frames on the HMDB51 data set and the UCF101 data set are shown in the table of fig. 4, and 16 frames are uniformly sampled each time as input of a network structure, and experimental results show that in a spatial flow network, the best algorithm accuracy can be ensured by using a decomplexed resenet-34 frame for an input I frame, and compared with using a 3D convolutional network, the accuracy of a 2D convolutional network model is yet to be improved; in addition, on a time flow network, the method obviously improves the identification effect of a decomplexed ResNet-18 framework compared with other frameworks of a mixed 3D convolution structure, and compared with a mode of expanding into a 3D convolution structure after using a 2D convolution network, the framework uses a characteristic extractor approximate to a 3D convolution kernel from the beginning, so that more accurate time sequence information can be learned.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (2)

1. A video human behavior recognition algorithm based on double-current space-time decomposition is characterized by comprising the following operation steps:
the method comprises the following steps: building a residual block of double-current space-time decomposition, and taking the residual block as a basic frame of a residual convolution network;
step two: decomposing each complete 3D convolution kernel in the residual convolution network obtained in the first step into a two-dimensional space convolution operation and a one-dimensional time convolution operation, wherein each decomposed convolution operation is provided with a complete BN layer and a ReLU activation layer;
step three: establishing a spatial stream network for compressing the I frame of the video according to the two-dimensional spatial convolution operation decomposed in the step two; establishing a time flow network of the P frame for fusing the motion vector and the residual error according to the decomposed one-dimensional time convolution operation in the step two;
step four: and multiplying and fusing the output of the last residual in the time flow network and the input of the current spatial flow network, and taking the fused result as the input of the spatial flow network.
2. The video human body behavior recognition algorithm based on the dual-stream space-time decomposition of claim 1, wherein the spatial stream and temporal stream networks in the third step extract motion information by means of fusion of a two-dimensional convolution kernel cf2 and a one-dimensional convolution kernel cf 1.
CN202111140075.6A 2021-09-28 2021-09-28 Video human behavior recognition algorithm based on double-flow space-time decomposition Pending CN114170618A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111140075.6A CN114170618A (en) 2021-09-28 2021-09-28 Video human behavior recognition algorithm based on double-flow space-time decomposition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111140075.6A CN114170618A (en) 2021-09-28 2021-09-28 Video human behavior recognition algorithm based on double-flow space-time decomposition

Publications (1)

Publication Number Publication Date
CN114170618A true CN114170618A (en) 2022-03-11

Family

ID=80477004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111140075.6A Pending CN114170618A (en) 2021-09-28 2021-09-28 Video human behavior recognition algorithm based on double-flow space-time decomposition

Country Status (1)

Country Link
CN (1) CN114170618A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115223250A (en) * 2022-09-13 2022-10-21 东莞理工学院 Upper limb rehabilitation action recognition method based on multi-scale space-time decomposition convolutional network
CN116524612A (en) * 2023-06-21 2023-08-01 长春理工大学 rPPG-based human face living body detection system and method
CN117975376A (en) * 2024-04-02 2024-05-03 湖南大学 Mine operation safety detection method based on depth grading fusion residual error network

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115223250A (en) * 2022-09-13 2022-10-21 东莞理工学院 Upper limb rehabilitation action recognition method based on multi-scale space-time decomposition convolutional network
CN116524612A (en) * 2023-06-21 2023-08-01 长春理工大学 rPPG-based human face living body detection system and method
CN116524612B (en) * 2023-06-21 2023-09-12 长春理工大学 rPPG-based human face living body detection system and method
CN117975376A (en) * 2024-04-02 2024-05-03 湖南大学 Mine operation safety detection method based on depth grading fusion residual error network
CN117975376B (en) * 2024-04-02 2024-06-07 湖南大学 Mine operation safety detection method based on depth grading fusion residual error network

Similar Documents

Publication Publication Date Title
CN114170618A (en) Video human behavior recognition algorithm based on double-flow space-time decomposition
US10924755B2 (en) Real time end-to-end learning system for a high frame rate video compressive sensing network
CN107273800B (en) Attention mechanism-based motion recognition method for convolutional recurrent neural network
CN108537754B (en) Face image restoration system based on deformation guide picture
Li et al. GaitSlice: A gait recognition model based on spatio-temporal slice features
CN110222574A (en) Production operation Activity recognition method, apparatus, equipment, system and storage medium based on structuring double fluid convolutional neural networks
CN112288627B (en) Recognition-oriented low-resolution face image super-resolution method
CN112836646B (en) Video pedestrian re-identification method based on channel attention mechanism and application
CN112653899A (en) Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene
CN113283444B (en) Heterogeneous image migration method based on generation countermeasure network
CN112307995A (en) Semi-supervised pedestrian re-identification method based on feature decoupling learning
Yang et al. Deeplab_v3_plus-net for image semantic segmentation with channel compression
CN113128360A (en) Driver driving behavior detection and identification method based on deep learning
CN111768354A (en) Face image restoration system based on multi-scale face part feature dictionary
CN113920581A (en) Method for recognizing motion in video by using space-time convolution attention network
CN113505719A (en) Gait recognition model compression system and method based on local-integral joint knowledge distillation algorithm
CN111260577B (en) Face image restoration system based on multi-guide image and self-adaptive feature fusion
CN114333002A (en) Micro-expression recognition method based on deep learning of image and three-dimensional reconstruction of human face
CN113935435A (en) Multi-modal emotion recognition method based on space-time feature fusion
CN112906520A (en) Gesture coding-based action recognition method and device
CN111401116A (en) Bimodal emotion recognition method based on enhanced convolution and space-time L STM network
CN116524601B (en) Self-adaptive multi-stage human behavior recognition model for assisting in monitoring of pension robot
Chen et al. Rethinking lightweight: multiple angle strategy for efficient video action recognition
CN111325149A (en) Video action identification method based on voting time sequence correlation model
CN114582002B (en) Facial expression recognition method combining attention module and second-order pooling mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination