CN111401149A

CN111401149A - Lightweight video behavior identification method based on long-short-term time domain modeling algorithm

Info

Publication number: CN111401149A
Application number: CN202010124065.2A
Authority: CN
Inventors: 王�琦; 李学龙; 白思开
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-07-10
Anticipated expiration: 2040-02-27
Also published as: CN111401149B

Abstract

The invention provides a lightweight video behavior identification method based on a long-term and short-term time domain modeling algorithm. A short-term feature interchange module is constructed by using a partial channel interchange method, a long-term feature fusion module is constructed by using graph convolution, effective extraction of short-term and long-term video time features is achieved respectively, time features in different stages are extracted by inserting the two modules into different positions of a two-dimensional depth residual error network, and therefore the problems that the current video behavior recognition technology is inaccurate in result and high in calculation resource consumption are effectively solved.

Description

Lightweight video behavior identification method based on long-short-term time domain modeling algorithm

Technical Field

The invention belongs to the technical field of computer vision and video classification, and particularly relates to a lightweight video behavior identification method based on a long-term and short-term time domain modeling algorithm, which can be applied to intelligent monitoring, crowd analysis, man-machine interaction and the like.

Background

With the advent of short video software such as tremble, fast-hand, etc. and some live broadcast platforms, a large amount of new video is generated and shared on the internet almost every moment. To cope with such information explosion, it becomes increasingly important to analyze and understand video information applied to various scenes. The video behavior recognition means recognizing and judging the behavior and actions of people in a video, and has wide application in real life, but the video behavior recognition is still a very challenging task in the field of video analysis due to the influences of factors such as large resource consumption, insufficient time domain information extraction and the like.

The video behavior recognition technology can classify the current behaviors in the video and predict the actions to be taken in the video, so the video behavior recognition technology is applied to many fields including intelligent monitoring systems, gesture recognition and the like. The behavior of people in the monitoring system is detected, and the behavior is analyzed and judged according to a certain rule, so that the abnormal behavior can be alarmed in time. And by recognizing the gesture and the gesture, the video behavior recognition technology can also be applied to crowd analysis and human-computer interaction.

Currently, most behavior recognition techniques can be divided into two categories: one is a method based on a dual-stream structure, and the other is a method based on a three-dimensional convolutional neural network. The method based on the double-flow structure respectively inputs the dense optical flows between frames in the video into two branches of the double-flow structure for processing, and finally, the results of the two branches are fused to obtain a final result. The disadvantages of this method are: 1) the optical flow characteristics of the video need to be additionally extracted, so that the time and memory consumption are high; 2) because the double-flow structure is still based on the two-dimensional convolutional neural network essentially, complex time domain information in the video cannot be effectively captured, and the identification accuracy is low. The method based on the three-dimensional convolution neural network simultaneously extracts the time characteristic and the space characteristic in the video by using the three-dimensional convolution, and the main defects of the method are as follows: 1) compared with a two-dimensional convolutional neural network, the number of parameters is increased exponentially; 2) the calculation cost required during model pre-training is high, the model is not easy to train, and the overfitting phenomenon is easy to occur; 3) on a single layer of the model, only short-term time domain information can be obtained, and effective extraction of long-term time domain information in the video cannot be carried out.

Therefore, the existing video behavior recognition technology generally has the defects of high computing resource consumption, insufficient time feature extraction and the like, and a video behavior recognition method which is high in precision, low in computing resource consumption and capable of effectively extracting time features needs to be provided.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a lightweight video behavior identification method based on a long-short-term time domain modeling algorithm. Based on the two-dimensional depth residual error network and the graph convolution, the short-term and long-term time characteristics of the video are effectively extracted. Compared with a double-flow algorithm and a three-dimensional convolution neural network algorithm, the method provided by the invention effectively solves the problems of inaccurate identification result and high calculation resource consumption of the current video behavior identification technology on the premise of not additionally training a graph model and extracting optical flow characteristics.

A lightweight video behavior identification method based on a long-short-term time domain modeling algorithm is characterized by comprising the following steps:

step 1: extracting 8 frames of video clips from each video of the video data set by adopting a uniform sampling method, carrying out multi-scale cutting on the extracted video clips to ensure that the video clips have the same size, forming a new video clip data set by all the cut video clips and the video tags to which the video clips belong together, and carrying out the following steps of: 1 into a training data set and a test data set;

step 2: constructing a long-term and short-term time domain behavior recognition network model, wherein the long-term and short-term time domain behavior recognition network model comprises a spatial feature extraction module, a short-term feature exchange module, a long-term feature fusion module and a behavior prediction module; the spatial feature extraction module is composed of 50 layers of ResNet networks and comprises 16 Bottleneeck modules, wherein 4 Bottleneeck modules comprise down-sampling layers, the first convolution layer and different Bottleneeck modules of the ResNet networks extract spatial features of different stages of input video clips, and the last layer of the ResNet networks outputs scores of each frame relative to all categories; inserting a short-term feature exchange module in front of each Bottleneck module, exchanging the features on the channels of each frame part with the features on the corresponding channels in the two adjacent frames before and after the channels, and overlapping the exchanged features with the original features to obtain short-term time features of different stages; respectively adding a long-term feature fusion module before the last two Bottleneck modules containing down-sampling layers, wherein the long-term feature fusion module is arranged before the inserted short-term feature interchange module, taking the features extracted from the input feature graph as nodes of a full-connected graph, fusing information on the nodes by adopting a graph convolution method, and keeping the long-term time features obtained by fusion and the input feature graph in the same structure through mapping; the behavior prediction module averages the category scores of all the frames obtained by the feature extraction module according to categories to obtain the average score of each category of the video clip, and takes the category with the highest score as the final behavior recognition result of the video clip;

and step 3: inputting the training data set obtained in the step 1 into the network model constructed in the step 2 for training, setting a loss function of the network as a mean square error loss function, optimizing the training network by adopting a random gradient descent method, wherein the batch size is 16, the learning rate of training is 0.01, the learning rate is reduced by 10 times per 10 training rounds, 30 training rounds are trained in total, and the trained network is the final behavior recognition network model;

and 4, step 4: and (3) inputting the videos in the test data set into the long-short-term time domain behavior recognition network model trained in the step (3) to obtain a behavior recognition result of each video in the test set.

The invention has the beneficial effects that: because the short-term and long-term time domain range module construction is carried out by utilizing partial feature interchange and graph convolution, and the two modules are inserted into a plurality of positions of a deep residual error network (ResNet50), the time features of different stages can be effectively extracted, and higher behavior identification accuracy is obtained; meanwhile, the graph model does not need to be trained additionally and the optical flow characteristics do not need to be extracted, and the calculated amount is small.

Drawings

FIG. 1 is a schematic diagram of a long-term and short-term time-domain behavior recognition network model of the present invention;

FIG. 2 is a schematic diagram of a short term feature interchange module;

FIG. 3 is a schematic diagram of a long term feature fusion module.

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

The invention provides a lightweight video behavior identification method based on a long-term and short-term time domain modeling algorithm, which comprises the following implementation steps:

1. video pre-processing

The method comprises the steps of firstly extracting video clips of 8 frames from each video through a uniform sampling method, then carrying out multi-scale cutting (such as center cutting) on the video clips, converting the size of each frame into 224 × 224, and converting each video into a video clip with the size of 8 × 3 × 224 × 224, wherein all the video clips form a new video clip data set, the labels of the videos in the original data set serve as the labels of the corresponding video clips in the new video clip data set, and finally dividing the new video clip data set into a training data set and a testing data set according to the ratio of 4: 1.

2. Constructing long-short-term time domain behavior recognition network model

In order to extract various useful characteristics from the video clip, the invention respectively utilizes a spatial characteristic extraction module to realize the extraction of spatial characteristics of different stages of each frame in the video clip, utilizes a short-term characteristic interchange module to implement partial channel interchange on the extracted characteristics along a time dimension to obtain short-term time characteristics, utilizes a long-term characteristic fusion module to spread and fuse the extracted characteristics in a long-term time range to obtain long-term time characteristics, and finally utilizes a behavior prediction module to make final judgment on the behavior category of the video clip. Therefore, a long-short-term time domain behavior recognition network model comprising a spatial feature extraction module, a short-term feature exchange module, a long-term feature fusion module and a behavior prediction module is constructed.

(1) Spatial feature extraction module

The spatial feature extraction module is composed of a ResNet network with 50 layers and comprises 16 Bottleneeck modules, wherein 4 Bottleneeck modules comprise down-sampling layers, different Bottleneeck modules of the ResNet network extract spatial features of different stages of input video fragments, and the last layer of the ResNet network outputs scores of each frame relative to all categories.

(2) Short term feature interchange module

A short-term feature interchange module is inserted before each bottleeck module. As shown in fig. 2, the short-term feature interchange module exchanges the features of each frame with the features of two adjacent frames along the time dimension. Since the feature of each frame is composed of a plurality of channels, for less calculation, the interchange of partial channels is adopted, the feature on the front 1/8 channel is exchanged with the previous frame, the feature on the 1/8 channel adjacent to the former frame is exchanged with the next frame, and the rest 6/8 channel is kept unchanged. In order to prevent the original spatial characteristics of each frame from being damaged after the characteristics are interchanged, the invention adopts the residual error idea to superpose the interchanged characteristics and the input original characteristics, thereby obtaining the short-term time characteristics and also keeping the original spatial characteristics. The whole process can be formulated as:

F₂ ^s＝Stm(F₁,F₂,F₃)+F₂(1)

where Stm (,) represents the short term feature interchange operation, frame I₂By two adjacent frames I₁、I₃The switching part channel obtains short-term time characteristics and then adds the original characteristics F₂Thereby obtaining the feature F processed by the short-term feature interchange module₂ ^s，F₁Representing the previous frame I₁Original characteristic of (1), F₂Represents the current frame I₂Original characteristic of (1), F₃Representing the next frame I₃The original characteristics of (1). The whole process neither introduces additional parameters nor consumes much computing resources.

(3) Long-term feature fusion module

A long-term feature fusion module is added before the last two Bottleneck modules containing the down-sampling layer, and the long-term feature fusion module is placed before the inserted short-term feature interchange module.

The long-term feature fusion module takes the extracted features in the input feature graph as nodes of the full-connection graph, firstly, the input feature graph F ∈ R^C×T×H×WStraightening to form a new characteristic diagram F' ∈ R^C×LL ═ T × H × W, C denotes the number of channels, T denotes the number of frames of the video clip, H denotes the height of the features per frame, W denotes the width of the features per frame, and then a plurality of features F 'are extracted from the feature map F' by a one-dimensional convolution operation₁,f₂...f_nWherein f is_kDenotes the kth feature extracted by one-dimensional convolution, k is 1, …, n denotes the number of extracted features.

Then, constructing a full-connected graph of a single layer, and extracting the feature f₁,f₂...f_nAs nodes of a fully connected graph. And then, the information on the nodes is spread and fused in a long-term time range by adopting a graph convolution method. The graph convolution operation is as follows:

Y＝A_lVW_l(2)

wherein V is a node of the full-connected graph, which is formed by extracting a plurality of features f₁,f₂...f_nConstitution A_lAnd W_lRespectively representing an adjacency matrix and a weight matrix in the long-term feature fusion module, and Y is the long-term time feature obtained by propagation and fusion in the long-term time range. In graph convolution, the matrix A is first adjoined_lLearning the weights of edges between nodes, propagating information, and then passing through a weight matrix W_lAnd updating the state of the node. While to prevent optimization difficulties and degradation problems, the state step of the update node is preceded by (right-multiplying the weight matrix W)_lBefore) and after the entire graph convolution operation, identity maps are added, respectively. The graph convolution operation is thus optimized as follows:

Y＝(V+A_lV)W_l+V (3)

finally, the long-term time characteristic Y obtained by the graph convolution operation is converted into a characteristic graph with the same structure as the input characteristic graph F of the module through the deconvolution operation, so that the characteristic processed by the long-term characteristic fusion module is adapted to the characteristic structure of the spatial characteristic extraction module, and the process is the inverse process of converting the module input characteristic graph into the node of the full-connection graph.

(4) Behavior prediction module

The behavior prediction module averages scores of all frames of the video clip obtained by the spatial feature extraction module relative to all categories according to the categories by using an averaging method, and the category with the highest score is used as a final behavior recognition result of the video clip and is also used as a behavior recognition result of an original video which is not preprocessed.

3. Network model training

Setting network training parameters, wherein a loss function of the network is a mean square error loss function, a method for training the network is a random gradient descent method, the batch size is 16, the learning rate of training is 0.01, the learning rate is reduced by 10 times for each 10 training rounds, and 30 training rounds are trained in total. Then training the constructed long-short time domain behavior recognition network model by using the training data set obtained in the step 1, wherein the trained network is the final behavior recognition network model;

4. and inputting the videos in the test data set into the trained long-short-term time domain behavior recognition network model to obtain the behavior recognition result of each video in the test set. Meanwhile, if any video is input into the network, the corresponding behavior recognition result can be obtained.

To verify the effectiveness of the method of the invention

The data used in The experiments were The sounding-sound V1 data set proposed by Goyal et al in The documents "Raghav Goyal, Samira Ebrahis Kahou, and Visce Michalskit", "The sounding-sound" for learning and evaluating visual community sensor, in EEE International Conference video, 2017, pp.5842-5850 ", and The sounding-sound V1 data set proposed by Goyal et al in The documents" sounding-sound 52, Waring GTX 1080GPU, Ubuntu16.04 operating system, OpenCV3.2.0, cuda9.2.148, cudann7.3.1 and Pythoch 1.0.0 deep learning frameworkThe invention relates to a method for calculating the efficiency of a resource balance system, which comprises the steps of providing a Multi-Scale TRN algorithm, Zolfaghari and other references in the specification of Mohammadriza Zolfaghari, Kamaljeet Singh, and Thomas Brox, providing an Eco: Efficient connected network for online video understanding, providing an Eco: Efficient connected network Vision, providing an Eco connected network for online video understanding, providing an European connection Computer Vision,2018, pp.695-712, providing an Eco algorithm and Carreira and other references in the specification of Joao Carreira and Andrew Zissman, and Quo variance, interaction A new models and repair models, providing a balance between the efficiency of the resource balance system and the resource balance system, the efficiency of the resource balance system, the resource balance, the resource balance, the efficiency of the efficiency, the efficiency of the balance, the balance of the balance, the efficiency of the balance of the efficiency of the resource balance, the efficiency of the balance of the resource balance, the balance of the efficiency of the balance, the efficiency of the efficiency.

TABLE 1

Method of producing a composite material	Input frame number	Operand (ms)	Accuracy (%)
				TSN	8	16G	19.5
Multi-Scale TRN	8	16G	34.4
				ECO	8	32G	39.6
I3D	32×2clips	153G×2	41.6
				The invention	8	33G	40.6

Claims

1. A lightweight video behavior identification method based on a long-short-term time domain modeling algorithm is characterized by comprising the following steps: