CN108805083B

CN108805083B - Single-stage video behavior detection method

Info

Publication number: CN108805083B
Application number: CN201810607804.6A
Authority: CN
Inventors: 王子磊; 刘志康
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2022-03-01
Anticipated expiration: 2038-06-13
Also published as: CN108805083A

Abstract

The invention discloses a single-stage video behavior detection method, which comprises the following steps: in the training stage, constructing a multi-scale behavior segment regression network based on a convolutional neural network; training a multi-scale behavior segment regression network by using a multi-task learning end-to-end optimization method by taking a training video and a frame-level real behavior label as input to obtain a trained multi-scale behavior segment regression network model; in the using stage, when a new video is input, generating an input frame sequence with the same length as a training video through a time dimension sliding window, and predicting the behavior category and the corresponding time position of the input frame sequence by using a trained multi-scale behavior segment regression network model; and then, the non-maximum value inhibition is used for processing the prediction result to generate a final behavior detection result. The method can improve the detection performance and the detection efficiency.

Description

Single-stage video behavior detection method

Technical Field

The invention relates to the technical field of video behavior detection, in particular to a single-stage video behavior detection method.

Background

In recent years, video shooting devices (such as smart phones, digital cameras, monitoring cameras and the like) are rapidly popularized, so that people can conveniently shoot videos, modern communication devices enable the acquisition and the transmission of the videos to be more and more convenient, and the videos become important information carriers in modern society. With the increasing demand for computer intelligence and the rapid development of pattern recognition technology, image processing technology and artificial intelligence technology, the analysis of video content using computer vision technology has huge practical requirements and high commercial value. Human activities are often the main bodies of information in videos, and detection of human behaviors in videos is of great significance to video understanding. The video human behavior detection task is to detect the category of each human behavior instance contained in a video and locate the occurrence time of each behavior instance in an undivided long video. As most of the monitoring videos and the network videos are unsegmented long videos, the detection in the long videos is more in line with the actual requirements.

With the development of deep learning technology, some research achievements are obtained in the field of video behavior detection. However, the field of video behavior detection is still in the development and initiation stage, the current video behavior detection method is often not mature enough, and the problems of too complex model, too high calculation cost, low behavior positioning accuracy and the like generally exist. In order to meet the requirements of practical application, a new video behavior detection framework and a new video behavior detection method are urgently needed to be provided.

At present, few researches are conducted on video behavior detection tasks, and the proposed method generally follows a multi-stage detection framework: in the first stage, candidate time windows with high recall rate are generated in the video by using a nomination technology (proposal), or differentiated behavior characteristics are generated by using an additional characteristic extraction technology; in the next stage, these candidate time windows or behavior features are classified to obtain a prediction of behavior class. A two-stage method is used in a convolutional neural network-based action detection model, namely, an interest window is generated on a video frame and a light flow graph by using a fast RCNN network to give a name and extract behavior characteristics, and then an independent SVM classifier is used for classifying the behavior characteristics. In the patent "a video motion detection method based on convolutional neural network", a dense multi-scale sliding window is used to segment an uncut video in the first stage, a convolutional neural network with a space-time pyramid layer is used to identify each window, and then the identification results of each window are screened and integrated in the next stage to obtain a final video detection segment. A behavior detection method based on a segmented three-dimensional convolutional neural network is proposed in a paper 'Temporal Action Localization in unknown video Multi-stage CNNs', firstly, a three-dimensional convolutional neural network is used for generating behavior example nominations based on a sliding window, and then, another three-dimensional convolutional neural network is used for classifying the nominations. The paper "captured Boundary Regression for temporal action Detection" adopts a two-stage behavior Detection framework, and performs Regression operation on the time Boundary of a behavior to further improve the time Boundary of sliding window nomination. A Single-channel behavior classifier for behavior Detection is proposed in the paper Single-Shot Temporal Action Detection, which uses an independent double-stream neural network (two-stream ConvNets) to extract appearance features and motion features.

However, the multi-stage method treats feature extraction, sliding window nomination and behavior classification as independent processing stages, and each stage cannot be jointly trained, so that the cooperative cooperation and joint optimization of a behavior detection model are not facilitated; meanwhile, a large amount of repeated calculation exists in different stages, and the calculation efficiency of the algorithm is influenced.

Disclosure of Invention

The invention aims to provide a single-stage video behavior detection method which can improve the detection performance and the detection efficiency.

The purpose of the invention is realized by the following technical scheme:

a single-stage video behavior detection method comprises the following steps:

in the training stage, constructing a multi-scale behavior segment regression network based on a convolutional neural network; training a multi-scale behavior segment regression network by using a multi-task learning end-to-end optimization method by taking a training video and a frame-level real behavior label as input to obtain a trained multi-scale behavior segment regression network model;

in the using stage, when a new video is input, generating an input frame sequence with the same length as a training video through a time dimension sliding window, and predicting the behavior category and the corresponding time position of the input frame sequence by using a trained multi-scale behavior segment regression network model; and then, the non-maximum value inhibition is used for processing the prediction result to generate a final behavior detection result.

According to the technical scheme provided by the invention, firstly, the constructed multi-scale behavior segment regression network completely eliminates a time sequence nomination stage and an additional characteristic extraction stage in the traditional behavior detection method, all calculations of behavior instance detection in an untrimmed long video are completed in a single convolutional neural network, and end-to-end joint training and optimization can be realized on the whole, so that higher detection performance is achieved; secondly, the network structure is simplified, so that most of calculation can be realized in parallel, and the efficiency of behavior detection is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a single-stage video behavior detection method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a process of detecting behavior in a video according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an overall structure of a multi-scale behavior segment regression network according to an embodiment of the present invention;

fig. 4 is a schematic diagram of the output result of the thumb' 14 data set according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the problems of complex structure, low detection precision, low processing speed and the like of the conventional video behavior detection method, the embodiment of the invention provides a single-stage video behavior detection method; firstly, in order to improve the calculation efficiency, the method encapsulates all calculations into a network, and completes a behavior detection task in a single-stage convolutional neural network. Secondly, in order to improve the behavior detection precision, the method flexibly detects human behaviors of various time lengths on a multi-scale network characteristic diagram by using multi-scale position regression, and outputs behavior time boundaries and behavior categories at the video frame level. Finally, in order to enable joint optimization of various parts of the network, the method processes input video in a single network, so that the whole network can be trained end to end.

As shown in fig. 1, a flowchart of a single-stage video behavior detection method is provided for an embodiment of the present invention, which mainly includes:

1. in the training stage, constructing a multi-scale behavior segment regression network based on a convolutional neural network; and taking the training video and the frame-level real behavior labels as input, and training the multi-scale behavior segment regression network by using an end-to-end optimization method of multi-task learning to obtain a trained multi-scale behavior segment regression network model.

In the embodiment of the invention, a single-stage convolutional neural network is used for completing a behavior detection task, and network characteristic graphs with different scales are connected with anchoring behavior examples with different time lengths, so that the network can flexibly detect human behaviors with various time lengths. The method mainly comprises the following parts:

1) and constructing a multi-scale behavior segment regression network based on the convolutional neural network.

In the embodiment of the invention, the constructed multi-scale behavior segment regression network comprises: the system comprises a basic generalization module, a behavior instance anchoring module and a behavior prediction module; wherein:

a. the basic generalization module comprises N arranged alternately₁(e.g., N)₁5) three-dimensional convolution layer (3D convolution layer) and N₂(e.g., N)₂5) three-dimensional maximum pooling layer (3D max-posing layer) for feature generalization of the input video sequence and expansion of the receptive field.

b. The behavior instance anchoring module adopts N₃(e.g., N)₃4) step size s in the layer time dimension₁(e.g. s)₁2) step length (stride) in spatial dimension is s₂(e.g. s)₂1) for associating an anchor behavior instance of different time length for each cell of the anchor feature map output by each three-dimensional convolution layer of the module.

In the embodiment of the invention, in the behavior instance anchoring moduleEach anchor profile defines a base time scale s_k，k∈[1,N₃]；s_kAre regularly distributed over a value range [0,1 ]]Performing the following steps; defining a set of scale ratios for each anchor profile

D_kIs the number of scale ratios; the size of each anchor feature map is represented as h × w × t, and the height, width and length of the anchor feature map are represented by h, w and t, so that each cell association D with the size of h × w × 3 of the anchor feature map_kAnchoring the behavior instance with a time length of l_d＝s_k·r_d，d∈[1,D_k]The central position is the center of the cell.

c. The behavior prediction module is used for using D for each cell of the anchoring characteristic diagram_kConvolving (m +2) convolution kernels with the size h multiplied by w multiplied by 3, and outputting D corresponding to the corresponding cell_kThe predicted scores for the m behavior classes for each anchor behavior instance and two time position offsets.

2) End-to-end optimization method for multi-task learning

In the embodiment of the invention, a training video and a frame-level real behavior label are taken as input, a training objective function is combined, parameters in a multi-scale behavior segment regression network are trained through a gradient descent method, and the training process is as follows:

a. extracting video frames from a training video by using a fixed frame rate (such as 10 frames/second), obtaining a training frame picture sequence, and adjusting each frame picture to a uniform resolution; and sequentially sliding sub-sequences with the same length of the window to serve as an input frame sequence, wherein the length of each sliding window is the maximum frame number (for example: 192 frames) allowed by the GPU video memory.

b. And establishing a corresponding relation between the anchoring behavior instance and the real behavior instance in the label by using a positive sample matching strategy.

In the embodiment of the invention, in each training sample, the overlapping degree (interaction-over-Unit, IoU) of each anchoring behavior instance and each label real behavior instance in the time dimension is calculated, if the overlapping degree exceeds a fixed threshold (such as 0.5), the corresponding anchoring behavior instance is taken as a positive sample, otherwise, the corresponding anchoring behavior instance is taken as a negative sample; wherein one tag real instance can match multiple anchor behavior instances.

c. And (3) adopting a multitask loss function as a target function of the multi-scale behavior segment regression network training, training by a random gradient descent method, and generating a final multi-scale behavior segment regression network model by iteration.

In the embodiment of the invention, the multitask loss function refers to an objective function L of network training_lossThe medium joint behavioral classification loss and the behavioral time-position regression loss are expressed as:

in the above formula, L₂(Θ) is the L2 normalized loss, Θ is the learned parameters in all the regression networks of multi-scale behavior segments, α and β are loss trade-off parameters that are used to control the time-shift loss and the normalized loss, respectively, N_clsAnd N_posThe number of total training samples and positive samples respectively,

is the prediction vector for the behavior class of the ith anchor position, where the superscript j represents the jth behavior class, for m behavior classes.

Indicating the predicted score for the tag true category at the ith anchor position, and the superscript g indicates that the g-th behavior category is the true tag category. t is t_iFor the time position offset of the anchor instance,

transforming coordinates of the real position and the anchoring position of the label;

L_clsfor behavior classification loss, set to multi-class soft maximum loss:

L_locto act as a time position regression penalty, a smoothed L1 penalty set to the time position offset is set.

2. In the using stage, when a new video is input, generating an input frame sequence with the same length as a training video through a time dimension sliding window, and predicting the behavior category and the corresponding time position of the input frame sequence by using a trained multi-scale behavior segment regression network model; and then, the non-maximum value inhibition is used for processing the prediction result to generate a final behavior detection result.

According to the scheme of the embodiment of the invention, the constructed multi-scale behavior segment regression network completely eliminates a time sequence nomination stage and an additional characteristic extraction stage in the traditional behavior detection method, all calculations of untrimmed video behavior instance detection are completed in a single convolutional neural network, and end-to-end joint training and optimization can be realized on the whole, so that higher detection performance is achieved; secondly, the network adopts a full convolution mode, so that the network structure is simplified, most of calculation can be realized in parallel, the behavior detection efficiency is greatly improved, and particularly, the detection speed is higher than that of the existing recording method through the GPU parallel calculation acceleration.

For ease of understanding, the following description is made in conjunction with specific examples.

The entire flow of this example is similar to the previous embodiment, namely: in the training stage, firstly, a multi-scale behavior segment regression network is constructed; the multi-scale behavior segment regression network completes all behavior detection calculations in a single stage, other redundant links are eliminated, meanwhile, the network uses a full convolution mode, and most calculations can be accelerated in parallel, so that the calculation efficiency is greatly improved; then, extracting a video frame picture sequence from the training video, taking the frame sequence and the frame-level real behavior label generated by a sliding window method on the video frame picture sequence as the input of a multi-scale behavior segment regression network, training network parameters by an end-to-end optimization method of multi-task learning, and generating a network model. In the using stage, for a newly input video, generating a sequence with the same length as an input frame sequence in a sliding window mode after extracting a video frame, and inputting the sequence into a trained multi-scale behavior segment regression network for behavior detection; the output results are then processed using non-maximum suppression (NMS) to produce the final behavior detection result.

The specific detection procedure is shown in fig. 2, and the input is a sequence R composed of RGB video frames extracted from the video³ ^×H×W×TT, H and W are the length, height and width of the input sequence, respectively, and the number of RGB picture channels is 3. The input sequence sequentially passes through the basic generalization module, the behavior instance anchoring module and the behavior prediction module, and the classification category and the position offset of the anchoring behavior instance are output. And calculating the time position of the final action according to the position offset. t, h and w represent the length, height and width of the anchor feature map, respectively. m represents the number of behavior classes.

The present implementation uses video data from a representative human behavior recognition data set thumb' 14 data set. The THUMOS' 14 dataset is provided by (THUMOS change: Action recognition with a large number of classes. http:// crcv. ucf. edu/THUMOS14/., 2014.).

The following introduces from three aspects of establishing a multi-scale behavior segment regression network, end-to-end optimization training of multi-task learning, testing and evaluation.

1. And constructing a multi-scale behavior segment regression network.

As shown in fig. 3, a schematic diagram of a multi-scale behavior segment regression network provided for this example. It mainly comprises: the system comprises a basic generalization module, a behavior instance anchoring module and a behavior prediction module; wherein:

the basic generalization module consists of 5 three-dimensional convolution layers (3D convolution) and 5 three-dimensional maximum value pooling layers (3D max-Pooling). The sizes of the three-dimensional convolution kernels are all 3 multiplied by 3, the number of the convolution kernels contained in the first three layers is respectively 64, 128 and 256, and the number of the convolution kernels contained in the remaining two layers is 512. The kernel size of the first 3 three-dimensional pooling layers was set to 2 × 2 × 2, and the kernel size of the remaining three-dimensional pooling layers was set to 1 × 2 × 2. The output characteristic map of the fifth max pooling layer is denoted as F5, and has a height and width of h-w-3 and a time dimension of t-24. And for convolution output of each layer, a linear correction unit ReLU is adopted as an activation function, and nonlinear mapping modeling capability is added to the network.

Behavior instance anchoring modules use three-dimensional convolutional layers, where each three-dimensional convolutional layer in the module is referred to as an anchoring layer, and the output of each layer is referred to as an anchoring feature map. In all anchor layers, the convolution kernel size is set to 3 × 3 × 3, and the number of convolution kernels per layer in the time dimension is 256. The output anchor characteristics of the respective layers of the anchor layer are plotted as F6, F7, F8 and F9, and have sizes of (256 × 12 × 3 × 3), (256 × 6 × 3 × 3), (256 × 3 × 3 × 3), and (256 × 1 × 3 × 3), respectively. The output F5 of the last layer of the basic generalization module, with a size of (256 × 24 × 3 × 3), is also used as the anchor feature map. And expanding (pad) one dimension end to end on the second dimension of each anchoring feature map, wherein the size of each unit cell on each anchoring feature map is (256 multiplied by 3). The basic scales of the anchor feature maps F5, F6, F7, F8, and F9 are set to {0.1,0.3,0.5,0.7,0.9} in this order. The scale ratios of F5, F6, F7, and F8 are set to {0.8,1,1.5}, and the scale ratio of F9 is set to {0.7,0.85,1 }. The number of the scale ratios on each anchoring characteristic diagram is 3, so that each cell (256 × 3 × 3 × 3) on each anchoring characteristic diagram is associated with 3 anchoring behavior instances, the length of each anchoring behavior instance is respectively multiplied by the corresponding basic scale of the anchoring characteristic diagram by the corresponding 3 scale ratios, and the central position of each anchoring behavior instance is the center of the corresponding cell.

A behavior prediction module to predict a position offset for the respective anchor position and behavior class using a three-dimensional convolution. Each cell is associated with 3 anchor behavior instances, taking the THUMOS' 14 dataset as an example, scores and front and back 2 time position offsets of 20 behavior classes and 1 background class need to be predicted, so on each cell of F5, F6, F7, F8 and F9, convolution is performed by using 3 × (20+1+2) ═ 69 convolution kernels with the size of 3 × 3 × 3, and the output of each convolution is the scores and front and back 2 time position offsets of the corresponding 3 anchor behavior instances in 20 behavior classes, 1 background class.

2. End-to-end optimization training for multi-task learning.

Because the GPU has limited video memory, it is not possible to input a complete video every time for a long video, and the video needs to be processed to generate a suitable input. Therefore, in the training phase, video frames are first extracted from the training video of the thumbs' 14 data set at a frame rate of 10 frames/second, and the picture size of each frame is unified to 96 × 96, i.e., H ═ W ═ 96. Then, the window is sequentially slid in the sequence of video frames, producing successive frames of T192 as the sequence of input frames.

After the input frame sequence is generated, the specific positive and negative samples need to be further determined because the input frame sequence contains a behavior segment and a background segment. The specific method comprises the following steps: and for each input frame sequence, calculating the overlapping degree of each anchoring behavior instance and the corresponding label real instance in the multi-scale behavior segment regression network in the time dimension, and if the overlapping degree exceeds a threshold value of 0.5, taking the anchoring instance as a positive sample, and otherwise, taking the anchoring instance as a negative sample. Wherein, one tag real instance can match a plurality of anchor behavior instances, but one anchor behavior instance can only match one tag real instance.

After positive and negative samples exist, a multi-task loss function is adopted as a target function of network training, and multi-scale behavior segment regression network parameters are trained through a random gradient descent method. The multitask penalty function is defined as:

is a prediction vector for the behavior class of the ith anchor position, whereThe notation j indicates the jth behavior class, for m behavior classes.

L_clsfor behavior classification loss, set to multi-class soft maximum loss:

3. And (6) testing and evaluating.

After training a multi-scale behavior segment regression network on the thumb' 14 dataset, the performance of the network is evaluated as follows: on a test video set of a THUMOS' 14 data set, extracting a video frame sequence at a frame rate of 10 frames/second for each video, performing sliding window by using 192 frames as step sizes to generate a test video frame sequence with the same length as a training video frame sequence, and sending the test video frame sequence into a trained multi-scale behavior segment regression network to obtain a predicted behavior category and a corresponding time position. The output results are then processed using non-maximum suppression (NMS) to produce the final behavior detection result. And finally, comparing the behavior detection result with the real behavior label of the test data set to obtain the evaluation result of the network. FIG. 4 is a diagram showing the results of behavior detection on a test video set of the THUMOS' 14 data set.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A single-stage video behavior detection method is characterized by comprising the following steps:

in the using stage, when a new video is input, generating an input frame sequence with the same length as a training video through a time dimension sliding window, and predicting the behavior category and the corresponding time position of the input frame sequence by using a trained multi-scale behavior segment regression network model; then, the non-maximum value inhibition is used for processing the prediction result to generate a final behavior detection result;

wherein, the constructed multi-scale behavior segment regression network comprises: the system comprises a basic generalization module, a behavior instance anchoring module and a behavior prediction module; wherein:

the basic generalization module comprises N arranged alternately₁Three-dimensional layer of convolution layers and N₂A layer three-dimensional maximum pooling layer for feature generalization of the input video sequence,and enlarging the receptive field;

the behavior instance anchoring module adopts N₃Step length in the layer time dimension is s₁Step in spatial dimension is s₂The three-dimensional convolution network is used for associating anchoring behavior examples with different time lengths for each cell of the anchoring characteristic diagram output by each three-dimensional convolution layer of the module;

the behavior prediction module is used for using D for each cell of the anchoring characteristic diagram_kConvolving (m +2) convolution kernels with the size h multiplied by w multiplied by 3, and outputting D corresponding to the corresponding cell_kThe predicted scores and two time position offsets of the anchoring behavior instances for the m behavior classes; wherein, h and w respectively represent the height and width of the anchoring characteristic diagram.

2. The method as claimed in claim 1, wherein the behavior instance anchoring module defines a basic time scale s for each anchoring feature map_k，k∈[1,N₃](ii) a Defining a set of scale ratios for each anchor profile

D_kIs the number of scale ratios equal to the number of anchor behavior instances; each anchor feature map is of a size h x w x t, t represents the length of the anchor feature map, and each cell association D of the anchor feature map is of a size h x w x 3_kAnchoring the behavior instance with a time length of l_d＝s_k·r_d，d∈[1,D_k]The central position is the center of the cell.

3. The single-stage video behavior detection method according to claim 1,

training parameters in a multi-scale behavior segment regression network by using a gradient descent method by taking a training video and a frame-level real behavior label as input and combining a training objective function, wherein the training process is as follows:

extracting video frames from a training video by using a fixed frame rate, obtaining a training frame picture sequence, and adjusting each frame picture to a uniform resolution; sequentially using subsequences with the same length of sliding windows as input frame sequences, wherein the length of each sliding window is the maximum frame number allowed by a GPU video memory;

establishing a corresponding relation between the anchoring behavior instance and the real behavior instance in the label by using a positive sample matching strategy;

and (3) adopting a multitask loss function as a target function of the multi-scale behavior segment regression network training, training by a random gradient descent method, and generating a final multi-scale behavior segment regression network model by iteration.

4. The method according to claim 3, wherein the establishing the correspondence between the anchor behavior instance and the real behavior instance in the tag by using the positive sample matching policy comprises:

in each training sample, calculating the overlapping degree of each anchoring behavior instance and each label real behavior instance in the time dimension, if the overlapping degree exceeds a fixed threshold, taking the corresponding anchoring behavior instance as a positive sample, and otherwise, taking the corresponding anchoring behavior instance as a negative sample; wherein one tag real instance can match multiple anchor behavior instances.

5. The method as claimed in claim 3, wherein the multi-task loss function is an objective function L of network training_lossThe medium joint behavioral classification loss and the behavioral time-position regression loss are expressed as:

in the above formula, L₂(Θ) normalized loss for L2, Θ being a learned parameter in all multi-scale behavioral segment regression networks, α and β being loss trade-off parameters used to control temporal offset loss and normalized loss, respectivelyChemical loss, N_clsAnd N_posThe number of total training samples and positive samples respectively,

is a prediction vector for the behavior class of the ith anchor position, where the superscript j represents the jth behavior class, for a total of m behavior classes;

representing the predicted score of the label real category at the ith anchor position, and the superscript g represents that the g-th behavior category is the real label category; t is t_iFor the time position offset of the anchor instance,

L_clsfor behavior classification loss, set to multi-class soft maximum loss: