CN111401149B

CN111401149B - Lightweight video behavior identification method based on long-short-term time domain modeling algorithm

Info

Publication number: CN111401149B
Application number: CN202010124065.2A
Authority: CN
Inventors: 王�琦; 李学龙; 白思开
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2022-05-13
Anticipated expiration: 2040-02-27
Also published as: CN111401149A

Abstract

The invention provides a lightweight video behavior identification method based on a long-term and short-term time domain modeling algorithm. A short-term feature interchange module is constructed by using a partial channel interchange method, a long-term feature fusion module is constructed by using graph convolution, effective extraction of short-term and long-term video time features is achieved respectively, time features in different stages are extracted by inserting the two modules into different positions of a two-dimensional depth residual error network, and therefore the problems that the current video behavior recognition technology is inaccurate in result and high in calculation resource consumption are effectively solved.

Description

Lightweight video behavior identification method based on long-short-term time domain modeling algorithm

Technical Field

The invention belongs to the technical field of computer vision and video classification, and particularly relates to a lightweight video behavior identification method based on a long-term and short-term time domain modeling algorithm, which can be applied to intelligent monitoring, crowd analysis, man-machine interaction and the like.

Background

With the advent of short video software such as tremble, fast-hand, etc. and some live broadcast platforms, a large amount of new video is generated and shared on the internet almost every moment. To cope with such information explosion, it becomes increasingly important to analyze and understand video information applied to various scenes. The video behavior recognition means recognizing and judging the behavior and actions of people in a video, and has wide application in real life, but the video behavior recognition is still a very challenging task in the field of video analysis due to the influences of factors such as large resource consumption, insufficient time domain information extraction and the like.

The video behavior recognition technology can classify the current behaviors in the video and predict the actions to be taken in the video, so the video behavior recognition technology is applied to many fields including intelligent monitoring systems, gesture recognition and the like. The behavior of people in the monitoring system is detected, and the behavior is analyzed and judged according to a certain rule, so that the abnormal behavior can be alarmed in time. And by recognizing the gesture and the gesture, the video behavior recognition technology can also be applied to crowd analysis and human-computer interaction.

Currently, most behavior recognition techniques can be divided into two categories: one is a method based on a dual-stream structure, and the other is a method based on a three-dimensional convolutional neural network. The method based on the double-flow structure respectively inputs the dense optical flows between frames in the video into two branches of the double-flow structure for processing, and finally, the results of the two branches are fused to obtain a final result. The disadvantages of this method are: 1) the optical flow characteristics of the video need to be additionally extracted, so that the time and memory consumption are high; 2) because the double-flow structure is still based on the two-dimensional convolutional neural network essentially, complex time domain information in the video cannot be effectively captured, and the identification accuracy is low. The method based on the three-dimensional convolution neural network simultaneously extracts the time characteristic and the space characteristic in the video by using the three-dimensional convolution, and the main defects of the method are as follows: 1) compared with a two-dimensional convolutional neural network, the number of parameters is increased exponentially; 2) the calculation cost required during model pre-training is high, the model is not easy to train, and the overfitting phenomenon is easy to occur; 3) on a single layer of the model, only short-term time domain information can be obtained, and effective extraction of long-term time domain information in the video cannot be carried out.

Therefore, the existing video behavior recognition technology generally has the defects of high computing resource consumption, insufficient time feature extraction and the like, and a video behavior recognition method which is high in precision, low in computing resource consumption and capable of effectively extracting time features needs to be provided.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a lightweight video behavior identification method based on a long-short-term time domain modeling algorithm. Based on the two-dimensional depth residual error network and the graph convolution, the short-term and long-term time characteristics of the video are effectively extracted. Compared with a double-flow algorithm and a three-dimensional convolution neural network algorithm, the method provided by the invention effectively solves the problems of inaccurate identification result and high calculation resource consumption of the current video behavior identification technology on the premise of not additionally training a graph model and extracting optical flow characteristics.

A lightweight video behavior identification method based on a long-short-term time domain modeling algorithm is characterized by comprising the following steps:

step 1: extracting 8 frames of video clips from each video of the video data set by adopting a uniform sampling method, carrying out multi-scale cutting on the extracted video clips to ensure that the video clips have the same size, forming a new video clip data set by all the cut video clips and the video tags to which the video clips belong together, and carrying out the following steps of: 1 into a training data set and a test data set;

step 2: constructing a long-term and short-term time domain behavior recognition network model, wherein the long-term and short-term time domain behavior recognition network model comprises a spatial feature extraction module, a short-term feature exchange module, a long-term feature fusion module and a behavior prediction module; the spatial feature extraction module is composed of 50 layers of ResNet networks and comprises 16 Bottleneeck modules, wherein 4 Bottleneeck modules comprise down-sampling layers, the first convolution layer and different Bottleneeck modules of the ResNet networks extract spatial features of different stages of input video clips, and the last layer of the ResNet networks outputs scores of each frame relative to all categories; inserting a short-term feature interchanging module in front of each Bottleneck module, interchanging the features on the front 1/8 channel of each frame with the previous frame, interchanging the features on the 1/8 channel adjacent to the front 1/8 channel with the next frame, keeping the features of the residual 6/8 channel unchanged, and overlapping the interchanged features with the original features before interchanging to obtain short-term time features of different stages; respectively adding a long-term feature fusion module before the last two Bottleneck modules containing down-sampling layers, wherein the long-term feature fusion module is arranged before the inserted short-term feature interchange module, taking the features extracted from the input feature graph as nodes of a full-connection graph, fusing information on the nodes by adopting a graph convolution method, and keeping the long-term time features obtained by fusion and the input feature graph in the same structure through mapping; the behavior prediction module averages the category scores of all the frames obtained by the feature extraction module according to categories to obtain the average score of each category of the video clip, and takes the category with the highest score as the final behavior recognition result of the video clip;

and 3, step 3: inputting the training data set obtained in the step 1 into the network model constructed in the step 2 for training, setting a loss function of the network as a mean square error loss function, optimizing the training network by adopting a random gradient descent method, wherein the batch size is 16, the learning rate of training is 0.01, the learning rate is reduced by 10 times per 10 training rounds, 30 training rounds are trained in total, and the trained network is the final behavior recognition network model;

and 4, step 4: and (3) inputting the videos in the test data set into the long-short-term time domain behavior recognition network model trained in the step (3) to obtain a behavior recognition result of each video in the test set.

The invention has the beneficial effects that: because the short-term and long-term time domain range module construction is carried out by utilizing partial feature interchange and graph convolution, and the two modules are inserted into a plurality of positions of a deep residual error network (ResNet50), the time features of different stages can be effectively extracted, and higher behavior identification accuracy is obtained; meanwhile, the graph model does not need to be trained additionally and the optical flow characteristics do not need to be extracted, and the calculated amount is small.

Drawings

FIG. 1 is a schematic diagram of a long-term and short-term time-domain behavior recognition network model of the present invention;

FIG. 2 is a schematic diagram of a short term feature interchange module;

FIG. 3 is a schematic diagram of a long term feature fusion module.

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

The invention provides a lightweight video behavior identification method based on a long-term and short-term time domain modeling algorithm, which comprises the following implementation steps:

1. video pre-processing

The video in the data set is firstly extracted into a video segment of 8 frames from each video by a uniform sampling method, then the video segment is subjected to multi-scale cutting (such as center cutting and the like), the size of each frame is converted into 224 × 224, and each video is converted into a video segment with the size of 8 × 3 × 224 × 224. All the video clips form a new video clip data set, and the labels of the videos in the original data set are used as the labels of the corresponding video clips in the new video clip data set. And finally, the new video clip data set is divided into 4: the scale of 1 is divided into a training data set and a test data set.

2. Constructing long-short-term time domain behavior recognition network model

In order to extract various useful characteristics from the video clip, the invention respectively utilizes a spatial characteristic extraction module to realize the extraction of spatial characteristics of different stages of each frame in the video clip, utilizes a short-term characteristic interchange module to implement partial channel interchange on the extracted characteristics along a time dimension to obtain short-term time characteristics, utilizes a long-term characteristic fusion module to spread and fuse the extracted characteristics in a long-term time range to obtain long-term time characteristics, and finally utilizes a behavior prediction module to make final judgment on the behavior category of the video clip. Therefore, a long-short-term time domain behavior recognition network model comprising a spatial feature extraction module, a short-term feature exchange module, a long-term feature fusion module and a behavior prediction module is constructed.

(1) Spatial feature extraction module

The spatial feature extraction module is composed of a ResNet network with 50 layers and comprises 16 Bottleneeck modules, wherein 4 Bottleneeck modules comprise down-sampling layers, different Bottleneeck modules of the ResNet network extract spatial features of different stages of input video fragments, and the last layer of the ResNet network outputs scores of each frame relative to all categories.

(2) Short term feature interchange module

A short-term feature interchange module is inserted before each bottleeck module. As shown in fig. 2, the short-term feature interchange module exchanges the features of each frame with the features of two adjacent frames along the time dimension. Since the feature of each frame is made up of multiple channels, for less computation, the partial channels are interchanged, the feature on the front 1/8 channel is exchanged with the previous frame, the feature on the 1/8 channel adjacent to it is exchanged with the next frame, and the remaining 6/8 channels remain unchanged. In order to prevent the original spatial characteristics of each frame from being damaged after the characteristics are interchanged, the invention adopts the residual error idea to superpose the interchanged characteristics and the input original characteristics, thereby obtaining the short-term time characteristics and also keeping the original spatial characteristics. The whole process can be formulated as:

F₂ ^s＝Stm(F₁,F₂,F₃)+F₂ (1)

where Stm (,) represents the short term feature interchange operation, frame I₂By two adjacent frames I₁、I₃The switching part channel obtains short-term time characteristics and then adds the original characteristics F₂Thereby obtaining the feature F processed by the short-term feature interchange module₂ ^s，F₁Representing the previous frame I₁Original characteristic of (1), F₂Represents the current frame I₂Original characteristic of (1), F₃Representing the next frame I₃The original characteristics of (1). The whole process neither introduces additional parameters nor consumes much computing resources.

(3) Long-term feature fusion module

A long-term feature fusion module is added before the last two Bottleneck modules containing the down-sampling layer, and the long-term feature fusion module is placed before the inserted short-term feature interchange module.

And the long-term feature fusion module takes the features extracted from the input feature graph as nodes of the full-connection graph. Firstly, inputting a feature diagram F epsilon R^C×T×H×WStraightening to form a new characteristic diagram F' epsilon R^C×LWhere L is T × H × W, C denotes the number of channels, T denotes the number of frames of the video segment, H denotes the height of the feature of each frame, W denotes the width of the feature of each frame, and then a plurality of features F 'are extracted from the feature map F' by a one-dimensional convolution operation₁,f₂...f_nWherein f is_kDenotes the kth feature extracted by one-dimensional convolution, k is 1, …, n denotes the number of extracted features.

Then, constructing a full-connected graph of a single layer, and extracting the feature f₁,f₂...f_nAs nodes of a fully connected graph. And then, the information on the nodes is spread and fused in a long-term time range by adopting a graph convolution method. The graph convolution operation is as follows:

Y＝A_lVW_l (2)

wherein V is a node of the full-connected graph, which is formed by extracting a plurality of features f₁,f₂...f_nConstitution A_lAnd W_lRespectively representing an adjacency matrix and a weight matrix in the long-term feature fusion module, and Y is the long-term time feature obtained by propagation and fusion in the long-term time range. In graph convolution, the matrix A is first adjoined_lLearning the weight of edges between nodes, performing information propagation, and then passing through a weight matrix W_lAnd updating the state of the node. While to prevent optimization difficulties and degradation problems, before the step of updating the state of the nodes (right-multiplying the weight matrix W)_lBefore) and after the entire graph convolution operation, identity maps are added, respectively. The graph convolution operation is thus optimized as follows:

Y＝(V+A_lV)W_l+V (3)

and finally, converting the long-term time characteristic Y obtained by the graph convolution operation into a characteristic graph with the same structure as the input characteristic graph F of the module through deconvolution operation, so that the characteristic processed by the long-term characteristic fusion module is adapted to the characteristic structure of the spatial characteristic extraction module, and the process is the inverse process of converting the module input characteristic graph into the node of the full-connection graph.

(4) Behavior prediction module

The behavior prediction module averages scores of all frames of the video clip obtained by the spatial feature extraction module relative to all categories according to the categories by using an averaging method, and the category with the highest score is used as a final behavior recognition result of the video clip and is also used as a behavior recognition result of an original video which is not preprocessed.

3. Network model training

Setting network training parameters, wherein a loss function of the network is a mean square error loss function, a method for training the network is a random gradient descent method, the batch size is 16, the learning rate of training is 0.01, the learning rate is reduced by 10 times for each 10 training rounds, and 30 training rounds are trained in total. Then training the constructed long-short time domain behavior recognition network model by using the training data set obtained in the step 1, wherein the trained network is the final behavior recognition network model;

4. and inputting the videos in the test data set into the trained long-short-term time domain behavior recognition network model to obtain the behavior recognition result of each video in the test set. Meanwhile, if any video is input into the network, the corresponding behavior recognition result can be obtained.

To verify the effectiveness of the method of the invention

Simulation experiments are carried out under the deep learning frameworks of i7-6800K, NVIDIA GeForce GTX 1080GPU, Ubuntu16.04 operating system, OpenCV3.2.0, cuda9.2.148, cudann7.3.1 and PyToch1.0.0. The data used in The experiments are The Sometaling-Sometaling V1 data set forth by Goyal et al in The references "Raghav Goyal, Samira Ebrahimi Kahou, and Vincent Michalskie.," The sometaling "video database for learning and evaluating visual common sense, in IEEE International Conference Computer Vision,2017, pp.5842-5850". Then separately select the "Limin Wan" document of Wang et alG, Yuanjun Xiong, and Zhen Wang, Temporal segment networks, TSN methods mentioned in Towards good practices for preference action, in European Conference on Computer Vision,2016, pp.20-36, Zhou et al, Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torrala, Temporal Conference in Vision, in European Conference Computer Vision,2018, pp.803-818, "Multi-aluminum TRN algorithm, Zolfagi et al, Mommlighting algorithm 2018, Kamjun Sijn, Brookhaharas, Zones et al, Ocimum-simulation, Zones et al, John correlation, Jordan program, Zones et al, Jordan program coding algorithm, Zollin et al, Ochragman simulation algorithm, Zones et al, Jordan program coding algorithm, Zones et al, Ocimum Conference algorithm and Ocimum Conference simulation program coding algorithm, Zosteran et al, Ochragman program coding algorithm, Zosteran program and Ochran et al, Ocimum program coding algorithm, Zosteran et al, Ocimum et al, Jordan program coding algorithm No. 7, Zosteran et al, C. Jordan program, C. 3, C. supplement algorithm, Zosteran program coding algorithm, Zosteran program and Ochra, C.7, C.3, C.7, C.E.7, C.7, C.E.E.E.E.7, C.E.E.D.E.E.D.D.E.D.D.D.E.D.E.E.E.E.E.E.D.D.D.D.D.D.D.D.D.E.D.D.E.E.D.D.D.D.D.D.E.D.D.D.D.D.D.D.E.E.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D.D, comparative experiments were carried out with the method proposed by the present invention. The required input frame number, the operand FLOPs and the identification accuracy of different methods are respectively compared, and the result data is shown in Table 1. It can be seen that under the condition that the input frame number and the operation amount are similar, the method of the invention obtains better effect; under the condition that the required input frame number and the operation amount are far lower than those of the I3D algorithm with the optimal performance, the method obtains the accuracy rate close to that of the method. The method fully proves that the performance of the method is balanced in accuracy and effectiveness, needs less computing resources, is high in accuracy and is more practical.

TABLE 1

Method	Input frame number	Operand (ms)	Accuracy (%)
				TSN	8	16G	19.5
Multi-Scale TRN	8	16G	34.4
				ECO	8	32G	39.6
I3D	32×2clips	153G×2	41.6
				The invention	8	33G	40.6

Claims

1. A lightweight video behavior identification method based on a long-short-term time domain modeling algorithm is characterized by comprising the following steps:

and step 3: inputting the training data set obtained in the step 1 into the network model constructed in the step 2 for training, setting a loss function of the network as a mean square error loss function, optimizing the training network by adopting a random gradient descent method, wherein the batch size is 16, the learning rate of training is 0.01, the learning rate is reduced by 10 times per 10 training rounds, 30 training rounds are trained in total, and the trained network is the final behavior recognition network model;