CN113627368B

CN113627368B - Video behavior recognition method based on deep learning

Info

Publication number: CN113627368B
Application number: CN202110937838.3A
Authority: CN
Inventors: 黄鹤; 余佳诺; 曹洪龙
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2023-06-30
Anticipated expiration: 2041-08-16
Also published as: CN113627368A

Abstract

The invention discloses a video behavior recognition method based on deep learning, which comprises the following steps: s1, giving a color input video, firstly dividing the color input video into T segments with equal time length, wherein T is a positive integer, and randomly sampling a frame from each segment to obtain an input sequence with T frames; s2, inputting the processed frame image into a deep learning model to obtain characteristics processed by the deep learning model; s3, normalizing the processed characteristics and averaging the time dimension to obtain the classification of the video behaviors; the deep learning model is a differential enhancement network, the basic network of the differential enhancement network is ResNet50, and a differential enhancement module is embedded in the ResNet 50. The invention has the beneficial effects that: the video behavior recognition system based on deep learning can obtain higher detection accuracy through the differential enhancement module, so that the accuracy of the system is higher than that of other related systems.

Description

Video behavior recognition method based on deep learning

Technical Field

The invention relates to the field of deep learning, in particular to a video behavior recognition method based on deep learning.

Background

In recent years, with the rapid development of cloud computing and internet of things and the popularization of network decentralization, any organization and even individuals can easily upload videos into a network, so that the efficient and accurate understanding of the videos is helpful for people to better utilize information contained in the videos. Video behavior recognition is a fundamental problem in the field of video understanding. Video behavior recognition methods based on deep learning have proven to be more efficient and accurate than traditional manual feature extraction methods, so research implementation of video behavior recognition based on deep learning accords with current research trends and gradually develops into a basic research topic in the field of computer vision in recent years.

Video behavior recognition, as the name implies, refers to the recognition of behavior in a video. Unlike pictures, most of the behavior in video is often time dependent, which contains not only spatial information within each frame, but also temporal information between successive frames. Therefore, the focus of the video behavior recognition research is to effectively and reasonably extract spatial information and temporal information in the video.

The existing video behavior recognition method based on deep learning is mainly divided into two types: a method based on a double-flow architecture and a method based on a 3D convolutional neural network. The main ideas of the dual-flow architecture are: space modeling is carried out on the space flow of the input RGB frame, time modeling is carried out on the time flow of the input optical flow, and then the two flow information are fused and sent into a classifier for recognition. Compared with the single-stream structure, the performance of the double-stream structure is obviously improved due to the combination of the frame image and the optical flow. However, optical flow is quite computationally expensive and only represents the motion features between adjacent frames. In addition, the time streams of the dual-stream structure mostly employ 2D convolutional networks, and lack modeling capability for long-term time.

The design idea of the 3D convolutional neural network is to replace a convolutional kernel in the 2D convolutional neural network by using a 3D convolutional kernel, so that effective extraction of space-time information is realized. One advantage of the 3D convolutional neural network approach is: with the stacking of network structures, 3D convolutional neural networks can extract long-term time information. In addition, the 3D convolutional neural network can directly extract spatio-temporal information from RGB inputs, which avoids the computationally expensive optical flow extraction process. However, the 3D convolution network also has the disadvantages of high calculation cost and low calculation speed. The 3D convolutional neural network and the double-current network complement each other, but the calculation cost of the combination of the two is too high, and the combination is difficult to be widely applied in practice. Therefore, due to the light and efficient characteristics of the 2D convolutional neural network, some methods in recent years attempt to add a time processing method to the 2D convolutional neural network to achieve effective extraction of time-space information, but some methods do not have portability.

The traditional technology has the following technical problems:

models based on dual-flow networks lack long-term time modeling capability and the time and space costs of optical flow extraction are relatively high; the model based on the 3D convolutional neural network has high calculation cost and low running speed; models based in part on 2D convolutional neural networks lack efficient extraction of time information.

Disclosure of Invention

The invention aims to solve the technical problem of providing a video behavior recognition method based on deep learning, designing a differential enhanced network structure model (TDEN), wherein the model is based on ResNet50, introduces a attention mechanism utilizing time difference information, and solves the problems that a 2D convolution network cannot effectively extract time information and recognition accuracy is low; designing a differential enhancement module to enhance motion information through space-time and channel aspects; the designed module is a plug-and-play module, and can be used on various mainstream 2D convolution frames; and an end-to-end regression mode is adopted, a system model is simplified, and the network parameter quantity and the calculated quantity are effectively reduced.

In order to solve the technical problems, the invention provides a video behavior recognition method based on deep learning, which comprises the following steps:

s1, giving a color input video, firstly dividing the color input video into T segments with equal time length, wherein T is a positive integer, and randomly sampling a frame from each segment to obtain an input sequence with T frames;

s2, inputting the processed frame image into a deep learning model to obtain characteristics processed by the deep learning model;

s3, normalizing the processed characteristics and averaging the time dimension to obtain the classification of the video behaviors;

the deep learning model is a differential enhancement network, the basic network of the differential enhancement network is ResNet50, and a differential enhancement module is embedded in the ResNet 50.

In one embodiment, the differential enhancement module is implemented in the form of a residual block having a residual function of x _n+1 ＝x _n +CS(x _n ，W _n ) Wherein CS (x _n ，W _n ) Is a differential enhancement section.

In one embodiment, the method steps of the differential enhancement module are as follows: first to x _n Obtaining differential features by differentiating along a time dimension, performing spatial attention activation and channel attention activation on the differential features respectively, and performing attention activated differential features and x _n Dot multiplication is performed to realize the pair x _n Is an enhancement of motion information.

In one embodiment, the differential enhancement module utilizes time difference information to enhance motion information from two dimensions of space-time and channel, and is embedded into a 2D convolutional neural network, so that the deep learning model has space-time information extraction capability.

In one embodiment, the method for generating the ResNet50 specifically includes the following steps: extracting image features by using a 7×7 convolution layer, and obtaining a feature map with the size of [ NT,64,112,112] from a frame image with the size of [ NT,3,224,224 ]; and obtaining a characteristic map with the size of [ NT,64,56,56] through a downsampled maximum pooling layer, obtaining the characteristic map with the size of [ NT,64,7,7] through conv2 to conv5 in sequence, and finally obtaining the characteristic of NT multiplied by CLS by averaging pooling the obtained characteristic map and feeding the obtained characteristic map to a full-connection layer, wherein CLS represents the classification number of video behaviors.

In one embodiment, conv2 through conv5 of the ResNet50 use bottleneck building blocks with residual functions of x _n+1 ＝x _n +F(x _n ,W _n ) Wherein F (x) _n ,W _n ) Is the residual part, consisting of three convolution operations.

In one embodiment, the input features are normalized by softmax in step S3, where softmax is defined as:

based on the same inventive concept, the present application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the steps of any one of the methods when executing said program.

Based on the same inventive concept, the present application also provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of any of the methods.

Based on the same inventive concept, the present application also provides a processor for running a program, wherein the program runs to perform any one of the methods.

The invention has the beneficial effects that:

the video behavior recognition system based on deep learning provided by the invention can obtain higher detection accuracy through the differential enhancement module, so that the accuracy of the system exceeds that of other related systems; the differential enhancement branches in the differential enhancement module can enable the network to extract time information more effectively and ignore irrelevant information, so that the network has higher efficiency; the end-to-end mode greatly reduces the calculation amount and calculation time of the model; the module can be embedded in any 2D convolutional neural network, does not use any additional information and auxiliary network, and has simple structure and small parameter quantity.

Drawings

Fig. 1 is a schematic diagram of a behavior recognition model of a video behavior recognition method based on deep learning.

Fig. 2 is a schematic structural diagram of a differential enhancement module in the video behavior recognition method based on deep learning.

Fig. 3 is a schematic diagram of a second structure of a differential enhancement module in the video behavior recognition method based on deep learning according to the present invention.

Fig. 4 is a third schematic structural diagram of the differential enhancement module in the video behavior recognition method based on deep learning according to the present invention.

FIG. 5 is a training and testing curve of the model of FIG. 2 on the data set Someta-Someta V1 to verify the effectiveness of the deep learning model of the present invention for video behavior recognition.

FIG. 6 is a training and testing curve of the model of FIG. 3 on the data set Someta-Someta V1 to verify the effectiveness of the deep learning model of the present invention for video behavior recognition.

FIG. 7 is a training and testing curve of the model of FIG. 4 on the data set Someta-Someta V1 to verify the effectiveness of the deep learning model of the present invention for video behavior recognition.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.

The video behavior recognition system based on deep learning provided by the invention constructs a differential enhancement network, wherein the basic network is ResNet50, and the motion information is effectively extracted by embedding a differential enhancement module. The differential enhancement module utilizes time difference information to enhance motion information from two dimensions of space-time and channel, and is embedded into a 2D convolutional neural network, so that the model is light and efficient and has good space-time information extraction capability. Fig. 1 shows a schematic diagram of the entire model, table 1 shows the model structure of inputting a video based on the res net50 (where IC (IC) represents the number of input channels, OC (OC) represents the number of output channels, ks represents the convolution kernel size, s represents the convolution step size, CLS represents the classification number, surrounded by the bracket shape is the residual block of the res net50 original), and the specific procedure for solving the video behavior recognition is:

TABLE 1

S1, giving a color input video, and firstly dividing the color input video into T segments with equal time length (T is a positive integer) so as to carry out long-term time structure modeling. Then, a frame is randomly sampled from each segment to obtain an input sequence with T frames. The size of the frame image is cut to 224×224 by preprocessing. Thus, the dimension of the model input is [ N, T,3,224,224], where N is the number of input videos (1 for one input video), T is the number of sampling frames per video, and in this patent T is typically set to 8 or 16;

s2, inputting the processed frame image into a deep learning model to obtain the characteristics processed by the deep learning model;

s3, the processed features are subjected to softmax and the time dimension is averaged, so that the classification of the video behaviors is obtained.

The specific flow of the step S2 is as follows:

s2-1, using ResNet50 as a base network, since the input tensor received by ResNet50 is 4-dimensional, the dimension of the input is changed by morphing to: [ NT,3,224,224], where nt=n×t, embedding the designed differential enhancement module into conv2 to conv5 of ResNet50 results in a proposed model;

s2-2 the differential enhancement module is implemented in the form of a residual block, so that the background information can be preserved, and the residual function is x _n+1 ＝x _n +CS(x _n ，W _n ) Wherein CS (x _n ，W _n ) Is a differential enhancement part, which comprises the following general steps: first to x _n Obtaining differential features by differentiating along a time dimension, performing spatial attention activation and channel attention activation on the differential features respectively, and performing attention activated differential features and x _n Dot multiplication is performed to realize the pair x _n Is an enhancement of motion information.

S2-3, after the differential enhancement module, a 1D convolution with a kernel size of 3 is cascaded to realize the extraction of the long-term time information.

The ResNet50 structure used in step S2-1 is: first, a 7 x 7 convolution layer is used to extract image features from a frame of size NT,3,224,224]The frame image obtained is of size [ NT,64,112,112]Is a feature map of (1); then a downsampled max-pooling layer is passed to obtain a size of [ NT,64,56,56]]Then subjecting the obtained feature map to conv2 to conv5 to obtain a feature map having a size of [ NT,64,7,7]]Finally, the obtained feature map is subjected to average pooling and is fed to a full connection layer to obtain the feature of NT multiplied by CLS, wherein the CLS represents video behaviorA number of classifications. Conv2 to conv5 of ResNet50 use residual blocks, more specifically "bottleneck" building blocks (bottleneck), whose residual function is x _n+1 ＝x _n +F(x _n ,W _n ) Wherein F (x) _n ,W _n ) The residual part is formed by three convolution operations, and aims to prevent information loss in the feature extraction process, and can effectively solve the problems of gradient explosion and gradient disappearance in a deeper network.

In step S2-2, the differential enhancement module is embedded in front of the first convolution operation of res2 to res5 of ResNet 50. The present invention proposes three differential enhancement modules, the structures of which are shown in fig. 2, 3 and 4 respectively (these three structures are different implementation forms of the differential enhancement modules, although the details of their structural implementation are different, the general idea is identical, that is, enhancement of motion information in input features is achieved through differential features), and the structure shown in fig. 2 is mainly described herein, and the module is formed by connecting 2 parts in parallel: the first part is the original input; the second part is differential enhancement. The specific process of the differential enhancement module is as follows: assume that the dimension of the input feature X is [ NT, C, H, W]Where N is the batch size (i.e., the number of videos), T is the number of frames (representing the time dimension) per video sample, C is the number of channels, H and W are the length and width, respectively, and nt=n×t. It is convolved by 1X 1 to give a convolution with dimensions [ NT, C/r, H, W]Feature X of (2) ^r The purpose of this step is to reduce the amount of computation and to add X ^r Deforming to make its dimension be [ N, T, C/r, H, W ]]To facilitate subsequent processing of the time dimension. Then the deformed X ^r Is extracted from the previous T-1 frame to obtain X ^r (t). Since motion can result in spatial displacement of the same object between two frames, direct computation of the differences between the displacement features can result in mismatched motion representations, thus requiring X before deformation ^r After 3X 3 channel separation convolution, the method is deformed again and then T-1 frame is extracted to obtain X ^r (t+1) where the use of channel separation convolution can prevent channel fusion from interfering with the processing of the time dimension. After which by D (t) =X ^r (t+1)-X ^r (T) obtaining a differential feature D (T), wherein the dimension of the differential feature D (T) is [ N, T-1, C/r, H, W]For its time dimensionThe last bit is zero-padded to change its dimension into [ N, T, C/r, H, W]The zero-filling operation facilitates subsequent calculations and changes the dimension of the differential feature to [ NT, C/r, H, W by warping]. And then, carrying out channel averaging on the differential features, carrying out convolution and Sigmoid functions on the differential features subjected to channel averaging in sequence by 7×7 to obtain a spatially activated differential feature DA, and obtaining spatial enhancement on the differential feature through DSA=D+D×DA. The spatially enhanced differential features are then spatially averaged, where the dimensions are [ NT, C/r,1]Then the dimension is changed into [ NT, C,1 by convolution of 1×1]And the enhancement of the position rich in X motion information is realized by activating the Sigmoid function and multiplying the X point of the original input characteristic. The above-described process of differential enhancement can be described as CS (x _n ，W _n ). Finally, through x _n+1 ＝x _n +CS(x _n ，W _n ) The output of the differential enhancement module is obtained, and the residual form can realize the enhancement of the short-term motion information and simultaneously retain the background information.

The differences from the one shown in fig. 2 are:

the compression and recovery processes of the channels before and after the differential feature processing are canceled;

combining spatial enhancement and channel enhancement of motion information in a parallel manner;

channel enhancement for motion information is achieved by spatial averaging, channel compression, reLU activation, channel restoration, sigmoid activation.

The differences from the one shown in fig. 2 are: the characteristic that the channel number is 1 after channel average is utilized to realize the direct enhancement of the spatial characteristics of the original input motion information.

In step S2-3, the input dimension is changed from [ NT, C, H, W ] to [ nxh×w, C, T ] by the deforming and shifting operation, then the extraction of time information is achieved by the 1D convolution operation with a kernel size of 3, and finally the dimension is restored by the deforming and shifting operation. It should be noted that the 1D convolution used is channel-separated, so that interference of channel fusion with extraction of time information can be avoided, and the convolution kernel of the 1/8 channel before the 1D convolution is initialized to (0, 1), which represents a left shift of the time dimension of the first 1/8 channel of the input; the convolution kernel for the 1/8-2/8 channel is initialized to (1, 0), which represents a right shift of the time dimension of the 1/8-2/8 channel of the input; the convolution kernel for the remaining 3/4 channels is initialized to (0, 1, 0), which represents that the remaining channels of the input are not processed. This operation helps to accelerate extraction of time information and improves model operation efficiency.

In step S3, a softmax normalization operation is performed on the input features with the size nt×cls, where softmax is defined as:

the dimension of the softmax-processed feature is then changed to N, T, CLS by deformation, averaged along its time dimension and the resulting feature is deformed to N, CLS to obtain a classification for each video.

One specific application scenario of the invention is as follows:

the whole video behavior recognition system needs to be trained on the marked data set before being used for actual testing and use. The specific training steps are as follows:

extracting video frame images:

all frame images in the video are extracted and divided into T segments of equal duration, and then one frame is randomly sampled from each segment to obtain an input sequence with T frames. For example: assuming that the video has 128 frames in total, T takes 8, namely dividing the 128 frames at intervals of 8 frames and forming 8 segments, 16 frames in each segment, and then randomly extracting 1 frame from the 8 segments respectively to obtain 8 frames in total as input.

Data enhancement:

after the frame image is obtained, data enhancement operation, namely data enhancement, is needed, which means that limited data generates value equivalent to more data under the condition of not substantially increasing the data, and is mainly used for preventing overfitting. The specific process of data enhancement is as follows:

(1) Corner cropping (corner cropping), extracting only regions from corners or centers of the picture to avoid defaulting to the center of the picture;

(2) Scale jitter (scale jitter), the input size is fixed to 256×340, and the width and height of the clipping region are randomly selected from {256,224,192,168 }. Finally, the size of these clipping regions is adjusted to 224×224 for network training. In practice, this approach includes not only scale dithering, but also aspect ratio dithering;

(3) For some data sets, the images belonging to the same video are randomly turned left and right through the central shaft, and most of behaviors encountered in real life are not inverted, so that the effect of enhancing data cannot be achieved when the training data sets are turned up and down.

The image after data enhancement is input into a network, so that overfitting can be effectively avoided.

1. Training process

The loss function and the optimization algorithm used in the invention are a cross entropy loss function and an SGD optimization algorithm which are common in classification problems respectively. Fig. 5, fig. 6, and fig. 7 are training and testing curves of the models corresponding to fig. 2, fig. 3, and fig. 4 on the data set of Soing-Soing V1, respectively, to verify the effectiveness of the deep learning model provided by the invention for identifying video behaviors.

The key concept of the invention is as follows:

1. the constructed differential enhancement module has strong portability and can be conveniently embedded into different positions of different basic models.

2. The differential enhancement module enhances the weight of the motion information in the channel through spatial enhancement of the motion information, and enhances the representation capability of the model.

3. The time and space information can be effectively extracted only by RGB input, and the end-to-end mode is adopted for training, so that the calculated amount is smaller, and the training speed is faster.

The above-described embodiments are merely preferred embodiments for fully explaining the present invention, and the scope of the present invention is not limited thereto. Equivalent substitutions and modifications will occur to those skilled in the art based on the present invention, and are intended to be within the scope of the present invention. The protection scope of the invention is subject to the claims.

Claims

1. A video behavior recognition method based on deep learning, comprising:

wherein the deep learning model is a differential enhancement network, the basic network of the differential enhancement network is ResNet50, a differential enhancement module is embedded in the ResNet50, the differential enhancement module is realized in the form of residual blocks, the residual blocks are used by res2 to res5 of the ResNet50, the differential enhancement modules are respectively embedded in res2 to res5 of the ResNet50, and the residual function of the residual blocks is x before the first convolution operation of res2 to res5 of the ResNet50 _n+1 ＝x _n +CS(x _n ，W _n ) Wherein CS (x _n ，W _n ) Is a differential enhancement portion; the method of the differential enhancement module comprises the following steps: first to x _n Obtaining differential features by differentiating along a time dimension, performing spatial attention activation and channel attention activation on the differential features respectively, and performing attention activated differential features and x _n Dot multiplication is performed to realize the pair x _n Is an enhancement of motion information.

2. The video behavior recognition method based on deep learning of claim 1, wherein the differential enhancement module implements enhancement of motion information from two dimensions of space-time and channel using time difference information, and wherein the differential enhancement module is embedded in a 2D convolutional neural network, thereby enabling the deep learning model to have a space-time information extraction capability.

3. The video behavior recognition method based on deep learning as claimed in claim 1, wherein the specific generation method of the res net50 is as follows: extracting image features by using a 7×7 convolution layer, and obtaining a feature map with the size of [ NT,64,112,112] from a frame image with the size of [ NT,3,224,224 ]; and obtaining a characteristic diagram with the size of [ NT,64,56,56] through a downsampled maximum pooling layer, obtaining the characteristic diagram with the size of [ NT,64,7,7] through conv2 to conv5 in sequence, and finally obtaining the characteristic of NT multiplied by CLS (clear line system) by averaging pooling the obtained characteristic diagram and feeding the characteristic diagram to a full-connection layer, wherein CLS represents the classification number of video behaviors, N is the number of input videos, and T is the sampling frame number of each video.

4. The method for identifying video behaviors based on deep learning according to claim 1, wherein in step S3, a softmax normalization operation is performed on the input features, and softmax is defined as:

5. a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 4 when the program is executed.

6. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 4.

7. A processor for running a program, wherein the program when run performs the method of any one of claims 1 to 4.