CN110175580B

CN110175580B - Video behavior identification method based on time sequence causal convolutional network

Info

Publication number: CN110175580B
Application number: CN201910459028.4A
Authority: CN
Inventors: 姜育刚; 程昌茂
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2020-10-30
Anticipated expiration: 2039-05-29
Also published as: CN110175580A

Abstract

The invention belongs to the technical field of computer image analysis, and particularly relates to a video behavior identification method based on a time sequence causal convolution network. The method comprises the steps of extracting space-time semantic feature representation from a plurality of video segments by using a time sequence causal three-dimensional convolutional neural network to obtain a predicted behavior category; and modeling the frame sequence till the current moment, and extracting the space-time high-level semantic features for behavior positioning and precision prediction. The fusion mechanism of space convolution and time sequence convolution and the causal space-time attention mechanism are designed. The method has the advantages of high precision, high calculation efficiency, real-time performance and the like, is suitable for online real-time video behavior detection and analysis tasks, and can also be used for tasks such as offline video behavior identification, abnormal event monitoring and the like.

Description

Video behavior identification method based on time sequence causal convolutional network

Technical Field

The invention belongs to the technical field of computer image analysis, and particularly relates to a video behavior identification method based on a time sequence causal convolution network.

Background

Video behavior detection and recognition is a classic task of computer vision, is a very fundamental problem in video understanding of sub-directions, and has been studied for many years so far. Because video data is difficult to label and analyze and the difficulty of space-time characteristic modeling is high, the development of the video behavior identification technology is slow. Under the innovation of deep learning technology, learning of spatiotemporal high-level semantic features through a neural network becomes mainstream. However, due to the large capacity of video data and the high calculation cost of a common deep network model, a practical video behavior recognition system is still scarce, and the task still has no very robust solution at present.

The system of the invention is mainly aimed at the video behavior recognition task of the online video stream. The main challenges faced by conventional recognition frameworks are: firstly, videos are different in length, pain points such as relative motion, irrelevant lens and unfixed scale exist in the videos in an open environment, and a traditional identification method can only enumerate common conditions and assumptions through a heuristic method; video data occupies resources, and a common depth model is large, so that the end-to-end training and optimization difficulty is high, and time is consumed; the optimization target is single, and the classification task can be trained only on the clipped short video.

In recent years, there have also been related research efforts attempting to solve such problems.

Article [1] proposes initializing a 3D convolutional network with pre-trained 2D network parameters, using a lighter-weight network structure for training on large-scale video data sets. However, this method can only process short videos and the practicability of the model is very low.

The document [2] proposes that the long-term spatiotemporal dependency can be captured by learning the global features of the video from the attention module. However, this method can only process offline video and cannot be applied to real-time video streams. Meanwhile, the method is high in calculation cost and time-consuming in training.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a video behavior identification method based on a time sequence causal convolutional network.

Because the 3D convolutional neural network has large parameter quantity and high calculation cost and does not have the capability of processing long-term videos, the invention designs a video behavior identification algorithm based on a time sequence causal convolutional network, and divides the 3D convolution into time sequence convolution and space convolution. The time sequence convolution guarantees causal constraint, the time dimension characteristic change is modeled by combining a short-time sequence convolution and a long-time self-attention mechanism, and the time sequence convolution is sparsely arranged in a network. In order to be better suitable for online streaming video, the invention adopts a historical characteristic caching mechanism to cache the historical characteristics needed by future frames so as to reduce the calculation amount, run the system more quickly and efficiently and achieve the real-time effect.

The invention provides a video behavior identification method based on a time sequence causal convolutional network, which comprises the steps of extracting space-time semantic feature representation from a plurality of video segments by using the time sequence causal three-dimensional convolutional neural network to obtain a predicted behavior category; and modeling the frame sequence till the current moment, and extracting the space-time high-level semantic features for behavior positioning and precision prediction.

The invention provides a video behavior identification method based on a time sequence causal convolutional network, which comprises the following specific steps of:

step 1: reading video stream data, decoding it on-line to obtain frame sequenceI={I ₀,I ₁…, each element of the sequence being a tensor representation of the frame picture data;

step 2: each moment of timet(t=0,1, …) will video stream current frameI _t-1Sending the pre-trained time sequence causal three-dimensional convolutional neural network, and extracting space-time characteristic representation;

and step 3: sending the extracted space-time feature representation into a behavior classifier to obtain a behavior category, and obtaining the current progress of the behavior through a regression network;

and 4, step 4: and (5) caching the convolutional network part hidden layer characteristics at the moment t, wherein t = t +1, and returning to the step 2.

In step 2 of the invention, the time sequence causal three-dimensional convolution neural network comprises a space convolution layer, a time sequence causal convolution and space convolution fusion module, a causal attention mechanism and a space convolution fusion module; the space convolution layer is a main component element of the network and is used for extracting the space semantic features of the current frame; the last two modules are alternately and sparsely placed in the network and used for capturing short-term and long-term historical information; the network completes pre-training on a large-scale labeled video behavior detection dataset.

The time sequence cause and effect convolution and spatial convolution fusion module is shown in fig. 2 and comprises a time sequence cause and effect convolution and spatial convolution module with convolution kernel of 3 × 1 × 1 and a spatial convolution module with convolution kernel of 1 × 3 × 3, wherein the input feature map X is subjected to two spatial convolution paths with convolution kernel of 3 × 1 × 1 and a time sequence cause and effect convolution and convolution kernel of 1 × 3 × 3 to obtain two feature maps, and elements of the two feature maps are added to obtain a fused output feature map Y.

And the sizes of convolution kernels of the time-series causal convolution and the frame image along the height and width dimensions are all 1, the size of the convolution kernel in the time dimension is 3, and the convolution operation of each time point is the feature fusion of the time point and the past two time points. The structure is used for mining short-term historical motion information. The convolution operation of each space position is the feature fusion of the space position and 8 points adjacent to the space position. The structure is used for learning frame image space semantic information.

In the present invention, the structure of the causal attention mechanism and spatial convolution fusion module is shown in fig. 3, and the causal attention mechanism and spatial convolution fusion module includes convolution layers with three convolution kernels of 1 × 1 × 1, where an input feature map X is subjected to convolution operations with three convolution kernels of 1 × 1 × 1 and shape adjustment to obtain values V, a key K, and a query Q, where each query point of Q retrieves the correlation of all key position features before the query point time in K to obtain the correlation under causal constraint, and is generally implemented by a SoftMax function with a mask. And the characteristics of each position in the V are combined through the relevance weight to obtain the final characteristic expression of each query point. And then, carrying out convolution operation by connecting a convolution kernel 1 multiplied by 1 to obtain an output feature map of the causal attention mechanism path, wherein the feature map captures long-term historical semantic information. The feature map obtained by convolution of the spatial convolution path is added and fused to obtain the final output feature map Y.

In step 3 of the present invention, the behavior classifier is a linear classification layer, which covers common behavior classes and no-behavior classes. Mapping the extracted features to a classification space to obtain the probability of each class, sequencing the probabilities of all the classes from large to small, returning the behavior class corresponding to the maximum probability value, and finally obtaining the behavior class.

In step 3 of the invention, the regression network comprises a linear layer and a Sigmoid function, the current progress of the behavior obtained through the regression network is a predicted value between 0 and 1 obtained through the linear layer and the Sigmoid function under the condition that the occurrence of a preset behavior type is predicted; 0 represents the beginning of the action and 1 represents the end of the action. If no predetermined behavior category is predicted, the return value of the progress regression network is 0, i.e., no progress.

In step 4 of the present invention, the convolution network part hidden layer characteristics at the cache time t, for the time sequence causal convolution module, the input characteristic diagram of the time sequence causal convolution module needs to cache the characteristics of the current time and the previous time, and for the causal attention mechanism module, the characteristics of the historical time stored by the key K need to be cached and updated at each time. The caching technology can greatly reduce repeated calculation and improve the system efficiency. At the next moment, step 2 will be re-entered to update the prediction state.

Different from the existing video behavior analysis and identification method, the method considers the time sequence causality of the long-term video and the calculation cost of the time-space characteristic modeling, so that a 3D neural network structure based on the time sequence causality convolution and the self-attention mechanism is designed, and a behavior progress predictor is added, so that the model capacity is greatly reduced, a more efficient training mode can be used, the problems of poor robustness of time-space characteristic learning, unpractical model and the like are solved, and meanwhile, the accurate behavior characteristics and progress are obtained from the video stream in real time. Based on the improvement, the video behavior identification system based on the time sequence causal convolutional network has stronger expression capability and higher efficiency, and can process online video streams in real time. The method has the advantages of high precision, high calculation efficiency, real-time performance and the like, is suitable for online real-time video behavior detection and analysis tasks, and can also be used for tasks such as offline video behavior identification, abnormal event monitoring and the like.

The main innovation of the invention is that:

1. short-time and long-time space-time modeling is learned respectively by using a time sequence causal convolution and a causal attention mechanism, and an online video behavior identification task can be processed naturally. And the calculation cost is reduced through sparse use amount and the number of compression channels, and the performance of real-time processing is achieved. The temporal and spatial feature modeling are separated and stacked one on top of the other, which is highly efficient. The time and space dimensions are treated differently, so that parameter optimization is facilitated while the model parameters and calculated amount are reduced;

2. and (4) multi-task learning. Besides predicting the video behavior category, the system simultaneously regresses the progress of the behavior, can help the network to learn more refined behavior characteristics, and the multitask supervision constraint can improve the robustness and the expression capability of a network model.

Drawings

FIG. 1 is a diagram of a time-sequential causal three-dimensional convolutional neural network processing system in accordance with the present invention.

FIG. 2 is a block diagram of the fusion of the temporal causal convolution and the spatial convolution proposed by the present invention.

FIG. 3 is a block diagram of the fusion module of the causal attention-free mechanism and the spatial convolution proposed by the present invention.

Detailed Description

The invention is further described below with reference to the figures and examples.

FIG. 1 shows a time-sequential causal three-dimensional convolutional neural network processing system diagram for online behavior identification according to the present invention. The system of the invention comprises an input video frame picture stream, a spatial convolution layer, a time sequence causal convolution and causal attention mechanism network foundation module, a behavior classifier and a progress regressor.

FIG. 2 illustrates a block diagram of the fusion of the temporal causal convolution and the spatial convolution for modeling of the short-time space features according to the present invention. And (3) carrying out time sequence causal convolution with convolution kernel of 3 multiplied by 1 and spatial convolution with convolution kernel of 1 multiplied by 3 on the input feature map X to obtain two feature maps, and adding elements of the two feature maps to obtain a fused output feature map Y.

FIG. 3 illustrates a block diagram of the causal attention machine and spatial convolution fusion module of the present invention for long-term spatiotemporal feature modeling. The input feature diagram X is subjected to convolution operation with three convolution kernels of 1 × 1 × 1 and shape adjustment to obtain a value V, a key K and a query Q, each query point of Q retrieves the correlation of all key position features before the query point time in K to obtain the correlation under causal constraint, and the correlation is generally realized by a masked SoftMax function (maskedsofmax). The final feature expression obtained by the attention mechanism can be expressed by a formula

. Then, the convolution operation of connecting with convolution kernel 1 × 1 × 1 obtains the output characteristic diagram of the causal attention mechanism path. The feature map obtained by convolution of the spatial convolution path is added and fused to obtain the final output feature map Y.

The specific steps of the operation are as follows:

step 1, collecting large-scale labeling actions and long video data sets of corresponding segments, and initializing parameters of a time sequence causal three-dimensional convolutional neural network;

and 2, randomly selecting a long video in the data set, sequentially sending video frames to the network according to the time sequence, and calculating loss and reversely transmitting the loss based on the action tag of each time point after the maximum video memory or memory occupation amount is reached. And when all the video frames of the video sample are used up, updating the network by using a random gradient descent optimizer based on the collected network parameter gradient. The way of training for small batches of samples is similar;

step 3, deploying the trained network model to the terminal, accessing a real-time video stream, decoding video frame data, sending the video frame data into the network, extracting the current space-time characteristics, and caching the required intermediate layer characteristic state;

step 4, based on the extracted space-time characteristic expression, mapping the space-time characteristic expression to a classification space through a behavior classifier to obtain a behavior class, and sending the behavior class to a progress regressor to obtain the current progress of the behavior under the condition of behavior;

and 5, displaying the identification result of the network in real time, synchronizing the identification result with the video frame stream and basically having no time delay. The structure can accurately judge common behaviors and can cover more complex behavior categories.

Reference to the literature

[1]Carreira, Joao, and Andrew Zisserman. "Quo vadis, actionrecognition a new model and the kinetics dataset." proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 2017.

[2]Wang, Xiaolong, et al. "Non-local neural networks." Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition. 2018.。

Claims

1. A video behavior identification method based on a time sequence causal convolutional network is characterized by comprising the steps of extracting space-time semantic feature representation from a plurality of video segments by using the time sequence causal three-dimensional convolutional neural network to obtain a predicted behavior category; modeling the frame sequence till the current moment, and extracting space-time high-level semantic features for behavior positioning and precision prediction; the method comprises the following specific steps:

step 2: each moment of timetThe current frame of the video streamI _t-1Sending the data into a pre-trained time sequence causal three-dimensional convolution neural network, extracting space-time characteristic representation, wherein,t=1,2,…；

and 4, step 4: caching partial hidden layer characteristics of the convolution network at the moment t, wherein t = t +1, and returning to the step 2;

the time sequence causal three-dimensional convolution neural network in the step 2 comprises a space convolution layer, a time sequence causal convolution and space convolution fusion module and a causal attention mechanism and space convolution fusion module; the space convolution layer is a main component element of the network and is used for extracting the space semantic features of the current frame; the last two modules are alternately and sparsely placed in the network and used for capturing short-term and long-term historical information; the network completes pre-training on a large-scale labeled video behavior detection data set;

the time sequence causal convolution and space convolution fusion module comprises a time sequence causal convolution and convolution kernel 1 multiplied by 3 multiplied by 1 and a space convolution module with convolution kernel 1 multiplied by 3, an input feature map X passes through two paths of time sequence causal convolution and convolution kernel 1 multiplied by 3 to obtain two feature maps, and elements of the two feature maps are added to obtain a fused output feature map Y;

the causal attention mechanism and space convolution fusion module comprises three convolution layers with convolution kernels of 1 multiplied by 1, an input feature diagram X is subjected to convolution operation with the convolution kernels of 1 multiplied by 1 and shape adjustment to obtain a value V, a key K and a query Q, each query point of Q retrieves the correlation degree of all key position features before the query point in the K, the correlation under causal constraint is obtained, and the correlation is realized through a SoftMax function added with a mask; the features of each position in the V are combined through the relevance weight to obtain the final feature expression of each query point; then, a convolution operation of a convolution kernel 1 multiplied by 1 is connected to obtain an output characteristic diagram of the causal attention mechanism path, and long-term historical semantic information is captured by the characteristic diagram; the feature map obtained by convolution of the spatial convolution path is added and fused to obtain the final output feature map Y.

2. The method for identifying video behaviors based on a time-series causal convolution network as claimed in claim 1, wherein the time-series causal convolution has convolution kernels with sizes of 1 along the height and width dimensions of the frame image and 3 along the time dimension, and the convolution operation at each time point is feature fusion of the time point and its two past time points; the structure is used for mining short-term historical motion information.

3. The video behavior identification method based on the time sequence causal convolution network as claimed in claim 1, wherein the spatial convolution has convolution kernels with sizes of 3 along the height and width dimensions of the frame image, the convolution kernel size in the time dimension is 1, and the convolution operation of each spatial position is feature fusion of the spatial position and 8 points in its spatial neighborhood; the structure is used for learning frame image space semantic information.

4. The video behavior recognition method based on the time-series causal convolutional network of any one of claims 1-3, wherein the behavior classifier in step 3 is a linear classification layer covering common behavior classes and no behavior classes; mapping the extracted features to a classification space to obtain the probability of each class, sequencing the probabilities of all the classes from large to small, returning the behavior class corresponding to the maximum probability value, and finally obtaining the behavior class.

5. The video behavior identification method based on the time sequence causal convolutional network as claimed in claim 4, wherein the regression network in step 3 includes a linear layer and a Sigmoid function, and the current progress of the behavior obtained through the regression network is obtained by obtaining a predicted value between 0 and 1 through the linear layer and the Sigmoid function when a predetermined behavior class is predicted to appear; 0 represents the beginning of the action and 1 represents the end of the action; if no predetermined behavior category is predicted, the return value of the progress regression network is 0, i.e., no progress.

6. The video behavior identification method based on the time sequence causal convolutional network as claimed in claim 5, wherein the convolutional network partial hidden layer feature of the cache time t in step 4 is input into a feature map for the time sequence causal convolutional module to cache the features of the current time and the previous time; for the cause and effect self-attention mechanism module, caching and updating the characteristics of the historical time points stored by the key K at each moment; at the next moment, step 2 will be re-entered to update the prediction state.