CN109446923B

CN109446923B - Deep supervision convolutional neural network behavior recognition method based on training feature fusion

Info

Publication number: CN109446923B
Application number: CN201811176393.6A
Authority: CN
Inventors: 李侃; 李杨; 王欣欣
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2021-09-24
Anticipated expiration: 2038-10-10
Also published as: CN109446923A

Abstract

The invention provides a deep supervision convolutional neural network behavior recognition method based on training feature fusion, and belongs to the field of artificial intelligence computer vision. The method comprises the steps of extracting multilayer convolution characteristics of a target video, designing a local evolution pooling layer, and mapping the video convolution characteristics to a vector containing time information by using the local evolution pooling layer so as to extract a local evolution descriptor of the target video; encoding a plurality of local evolution descriptors into a meta-action based video level representation by using a VLAD encoding method; and integrating the multi-level classification results to obtain a final classification result by utilizing the complementarity of information among the multiple levels of the convolutional network. The method and the device fully utilize the time information to construct the video level representation, and effectively improve the accuracy of video behavior identification. Meanwhile, the discriminability of the network middle layer is improved by integrating the multi-level prediction results, so that the overall performance of the network is improved.

Description

Deep supervision convolutional neural network behavior recognition method based on training feature fusion

Technical Field

The invention relates to a behavior recognition method based on a video, in particular to a deep convolutional neural network behavior recognition method based on training feature fusion, and belongs to the field of artificial intelligence computer vision.

Background

At present, human behavior recognition is a research hotspot in the field of intelligent video analysis and is also an important research direction of video understanding tasks. In recent years, attention has been paid to video surveillance, abnormal event monitoring, content-based video retrieval, and the like. However, due to the complexity of human behavior, variability, interference of video background information, etc., it becomes critical how to establish an appropriate spatio-temporal level representation for video.

Early studies mainly focused on recognizing simple motions in ideal scenes, and employed behavior recognition methods based on artificially designed features, for example, a three-dimensional histogram (HOG3D) -based method, an optical flow Histogram (HOF) -based method, a motion boundary histogram-based method, and the like. These methods construct a representation of the video by region features centered around describing spatio-temporal points of interest (STIP) and are used to identify actions in the video.

With the rapid development of multimedia technology, data in networks and surveillance videos are rapidly increasing, and human behavior recognition technology based on real scenes is more and more concerned. Due to the problems of changes of human body shapes, visual angles, illumination and backgrounds, movement of cameras and the like, the traditional behavior identification method based on artificial design features is difficult to achieve ideal effects in the real scenes.

In recent years, with the rapid development and application of deep learning in the field of computer vision, a series of human video behavior recognition methods based on a depth model are proposed. For example, identifying behaviors in video from the level of a single frame, capturing motion information in video by a dual stream network using RGB frames and optical flow, learning spatio-temporal features of video segments by exploring a three-dimensional convolutional network on a video stream, etc., and a later proposed dual stream expanded three-dimensional convolutional network (I3D) that extends the set of convolution and pooling kernels in two dimensions in a convolutional neural network structure into three dimensions makes it possible for the network to seamlessly learn spatio-temporal features of video.

However, the existing convolutional neural network structure can only model a single frame or a short segment of a video, and lacks the capability of directly modeling long-time-sequence structural information of the video. Therefore, the existing behavior recognition method based on the depth model adopts different strategies to obtain the long-time-sequence spatiotemporal features of the video. These strategies fall into two main categories: (1) the deep convolutional feature coding and pooling method is to extract the convolutional features of a frame or a video segment by using a deep convolutional network, and then construct a global video-level representation by adopting a space-time coding or pooling method. However, the video representation constructed by this method is unordered, and does not take into account the timing and evolutionary relationship between video frames. (2) The video-level representation is constructed by considering the temporal structure of the video, i.e. the depth features of a number of frames or video segments are input into a temporal model, such as LSTM, GRU or a sorting function, which is fused into a video-level representation. However, this method may lack spatial local information of the video to some extent.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a deep supervised convolutional neural network behavior recognition method based on training feature fusion, aiming at solving the problems in the existing long time-series video representation method based on depth features and identifying the behavior of a person from the aspect of how to establish proper space-time level representation for a video.

The invention is realized by the following technical scheme.

A deep supervision convolutional neural network behavior recognition method based on training feature fusion comprises the following steps:

step 1: video data for training is collected to form a training data set.

And preprocessing the video in the training video data set, extracting all video frames, and cutting the video frames into the same size.

Step 2: the video in the training dataset is frame sampled.

Each video in the training dataset is subjected to uniform frame acquisition. Over the entire video span, to

For time intervals, T RGB frames [ I ] are collected uniformly₁,I₂,...,I_T]Wherein, T_zFor a certain total duration of video, let I_tRepresenting the tth acquired video frame, which corresponds to the tth instant.

And step 3: the training data set is augmented.

And (3) all the video frames acquired from each video are inverted to be new videos, so that the training data set is expanded, and the number of the videos in the video data set is 2 times that of the videos in the previous training data set.

And 4, step 4: and extracting multilayer convolution characteristics of the training video frame.

Firstly, M convolutional layers are selected from a standard CNN (convolutional neural network) architecture and used for extracting multilayer convolutional characteristics of a video frame. Since identifying behavior typically requires a high level of semantic information, such as object or body parts, the present invention selects the M convolutional layers from the top convolutional layers of the convolutional network that are used to generate the feature map.

Then, the T RGB frames [ I ] of the collected video V₁,I₂,...,I_T]Input into the convolutional network, and extract the feature map generated in the M convolutional layers for each RGB frame. For each RGB frame, a feature map of spatial size N, containing C channels, is obtained at each selected convolutional layer. For the entire video V, M × T feature maps of spatial size N × N, containing C channels, are obtained.

And 5: and performing feature aggregation on the multilayer feature map of the video frame to obtain video level representation. The specific method comprises the following steps:

step 5.1: and extracting the local evolution descriptor of the video V by using a local evolution sequencing pooling method.

Taking T feature maps obtained by multiple frames of the video V under the same convolutional layer as input, decomposing the feature maps of the frames into a group of local spatial features, and modeling evolution information of the local spatial features of each spatial position to generate a local evolution descriptor. The specific method comprises the following steps:

step 5.1.1: t frame [ I ] of video V via step 4₁,I₂,...,I_T]Each frame in (a) acquires a feature map of spatial size N × N and containing C channels at a selected convolutional layer, the feature map being represented as [ fm₁,fm₂,...,fm_T]. The values of all channels at each spatial position on each feature map, T e { 1.,. T }, are connected separately, thereby decomposing each feature map into a plurality of local spatial features. For each frame, N × N C-dimensional local spatial features will be obtained.

Step 5.1.2: for T frame [ I₁,I₂,...,I_T]The evolution information of each spatial position is modeled to generate a video V local evolution descriptor. The specific method comprises the following steps:

step 5.1.2.1: for a specific spatial position, the local spatial features of the T frames are represented as [ r ] in a time sequence_i1,r_i2,…,r_it,...,r_iT]Where i ═ 1,. cndot.n },

is the local spatial feature of the ith spatial position at the t-th time,

real vector space in C dimension, i.e. r_itIs a vector in the real vector space of dimension C.

Step 5.1.2.2: evolution information of the ith spatial position is modeled. Defining a ranking (Rank) function, calculating a score value for each time instant:

S(t,i∣e)＝e^Td_it (1)

wherein,

is the average local spatial feature of the ith spatial position at time t,

the invention sets a constraint relationship: the score value corresponding to the later moment being greater than the score value corresponding to the earlier moment, i.e.

The parameter e may reflect the temporal order of these local spatial features. Learning the parameter e can be considered as a convex optimization problem:

the first term of the objective function E (e) is the general quadratic regularization term, and the second term is the soft count loss function change-loss.

Step 5.1.2.3: optimizing an objective function E (e), and mapping a series of local spatial features to a vector e^★The above. e.g. of the type^★The local evolution descriptor contains the ordering information of the local spatial features, namely the local evolution descriptor. The method uses approximation techniques to solve the equation optimization problem, embedding the operation in the CNN network. Finally, the product is processedThe solution of the objective function is simplified as follows:

wherein alpha is_t＝2(T-t+1)-(T+1)(H_T-H_t-1)，

As a parameter, the weight is obtained by rank pooling (RankPooling). The above solution is seen as a weighted addition of the local spatial features of the ith spatial position over the T acquired time instants.

Step 5.1.2.4: and designing a local evolution sequencing pooling layer based on the approximate solution of the sequencing function. The layer inputs convolution characteristic diagram with size of N multiplied by C of T frame and outputs N multiplied by N local evolution descriptor vectors [ e ] with dimension of C₁,e₂,...,e_N×N]。

Step 5.2: the local evolution descriptors of the video are encoded into meta-action based representations of the video level using a local evolution descriptor based VLAD (local aggregation vector) encoding method.

The method is based on the idea that one action is composed of a group of element actions, and provides a VLAD coding method based on local evolution descriptors. The method comprises the following specific steps:

step 5.2.1: using K meta-verb words, the feature space is transformed

Is divided into K units, and the anchor point of each unit is set as a_k。

Step 5.2.2: a series of local evolution descriptors [ e ] of the video V obtained in step 5.1₁,e₂,...,e_N×N]Is assigned to one of the K units and records a local evolution descriptor e_iAnd anchor point a_kThe residual vector in between.

Step 5.2.3: the residual vectors are summed.

In the formula (4), the reaction mixture is,

representation descriptor e_iSoft allocation of (2), anchor point a_kIn this formula is a hyper-parameter that can be adjusted by training; e.g. of the type_i-a_kRepresenting the residual between the local evolution descriptor and the kth anchor point. H obtained by the formula_kRepresenting the aggregation descriptor in the k-th cell.

Step 5.2.4: obtaining a sum of residuals between the local evolution descriptor and each anchor point of the video, and a video V can be expressed as V ═ h₁,h₂,...,h_K]，

C is the dimension of the real number space, and K is the number of the element action units, so ν is a matrix of C × K size in the real number space.

Based on the above equation, differentiable, and allowing the error gradient to propagate back to the lower layers of the network, the present invention designs a local evolution descriptor-based VLAD encoding layer.

Step 6: and for the selected M convolutional layers, performing the operations of the step 5 and the step 6 on each layer in parallel to obtain the video level characteristic representation of the video on each selected convolutional layer.

The method for recognizing the action of the video-level representation obtained by the plurality of convolutional layers is based on the action recognition method based on the deep supervision.

And 7: and (4) inputting the video level representation of each layer obtained in the step (6) into a corresponding classifier to obtain the classification result of the video V on the M selected convolutional layers. The specific method comprises the following steps:

step 7.1: to integrate all the parameters in the convolution and aggregation operations of the network, we define:

wherein B represents the total number of convolutional layers. Let B be { 1., B },

represents the parameters of the (b) th convolutional layer. M represents the number of the selected convolutional layers, and each selected convolutional layer is connected with a feature aggregation operation and a classifier because a classification result is obtained on each selected convolutional layer, so that the number of the feature aggregation operations is M, and the number of the classifiers is also M. Let M be { 1.. multidot.m }, so

Representing the weight of the feature aggregation operation on the mth selected convolutional layer,

representing the weights of the connected classifiers on the mth selected convolutional layer.

Step 7.2: defining a loss function that merges all output layer classification errors:

wherein L represents a video-level cross entropy loss function of the action classification, defined as:

wherein g is a real label of the video V, g belongs to A, and A is { A }₁,...,A_zDefinition ofAll action categories are represented, the number of the categories is Z, A_iRepresents the ith action category, s, in the action set A^mRepresents the motion type predicted by the mth convolutional layer.

And 8: and integrating the classification results of the M selected convolutional layers.

The invention provides a classification integration method for fusing multi-level prediction results, and the method uses corresponding weights to sum scores obtained from each convolution layer so as to fully utilize the complementarity of multi-level information. The corresponding weights are assigned by an attention-based method. The specific method comprises the following steps:

step 8.1: let the fused prediction result F be represented as:

wherein,

representing integration weights, wherein

Is a vector of Z dimension, obtained by assigning weights by Attention (Attention) mechanism, s^mRepresents the motion type of the mth convolutional layer prediction.

The loss function of the integration layer is defined as:

wherein y is argmax (f) and represents the action type finally predicted,

predict for the end that the action class is A_iThe probability of (c).

Step 8.2: minimizing the following objective function on the training set, and learning to obtain all parameters W, W_c，w_f：

And step 9: and optimizing the loss function by using a gradient descent algorithm, and adjusting model parameters through back propagation until the loss function is converged. At this time, the deep convolutional neural network behavior recognition model based on trainable feature fusion is trained.

Step 10: and (5) using the model trained in the step 9 to identify the character behaviors in the unknown video V'. The method comprises the following specific steps:

step 10.1: preprocessing the unknown video V 'according to the method in the step 1 and the step 2 and sampling the frames to obtain T RGB frames [ I'₁,I′₂,...,I′_T]。

Step 10.2: and (4) extracting multilayer convolution characteristics of the unknown video according to the method in the step 4. For each RGB frame of V', a feature map of spatial size N, containing C channels, is obtained at each selected convolutional layer. For the entire unknown video V', M × T feature maps with a spatial size of N × N, containing C channels, are obtained.

And 10.3, obtaining a video level feature representation of V' on each layer of the M selected convolutional layers according to the method in the steps 5 and 6. The method comprises the following specific steps:

first, using the local evolution rank pooling method to obtain N C-dimensional local evolution descriptor vectors [ e ' on each selected convolutional layer V ' according to the method described in step 5.1 '₁,e′₂,...,e′_N×N]，

Then [ e 'is coded using local evolution descriptor-based VLAD coding as per step 5.2'₁,e′₂,...,e′_N×N]Video level representation v ' ═ h ' encoded based on meta motion '₁,h′₂,...,h′_K]，

Finally, the above operations are performed in parallel on the M selected convolutional layers, resulting in a video level representation of V' on each layer, as described in step 6.

Step 10.4: obtaining the classification result of V' on M selected convolutional layers s according to the method described in step 7^′mThe action type result predicted by V' on the mth convolutional layer is shown. And according to the method in the step 8, integrating the multi-layer classification results by using a classification integration method to obtain the final classification result of the unknown video. F' represents the prediction after fusion:

wherein,

is a vector of dimension Z, s^′mRepresents the motion type of the mth convolutional layer prediction.

After the above process is completed, the result of predicting the behavior of the human beings in the unknown video can be obtained.

Advantageous effects

Compared with the prior art, the invention has the following beneficial effects:

(1) the proposed feature aggregation operation combines the local evolution sequencing pooling operation and the VLAD coding operation based on the local evolution descriptor into a whole, and proposes the local evolution sequencing pooling layer and the VLAD coding layer based on the local evolution descriptor, thereby simplifying the implementation of the method;

(2) the proposed local evolution ranking pooling method captures more details about the action by modeling the time evolution information of each spatial position;

(3) the VLAD coding mode based on the local evolution descriptor generates video representation with more discriminative power by projecting the local evolution descriptor to a semantic space;

(4) the proposed deep surveillance action recognition method constructs a multi-level video representation in a single network and generates a plurality of prediction results;

(5) the provided integration method of the multi-level classification results improves the discriminability of the network middle layer by integrating the multi-level prediction results, thereby improving the overall performance of the network.

Drawings

FIG. 1 is a block diagram of the overall logic of the present invention.

FIG. 2 shows the steps of the method of the present invention and the propagation of parameters. The method comprises a model training step, a feature aggregation method and a deep supervision action recognition method.

FIG. 3 is a flow chart of the method of the present invention.

Detailed Description

The following will explain the embodiments of the present invention in further detail with reference to the accompanying drawings.

The execution environment of the invention is formed by three main functions which are realized by a computer: the method comprises the steps of firstly, extracting multilayer convolution characteristic extraction work, wherein the function is to extract a multilayer characteristic diagram of each frame of a video. The feature aggregation function comprises a local evolution description pooling layer, and the layer has the function of encoding the multi-frame feature map obtained by each layer into a local evolution descriptor; and a local evolution descriptor based VLAD encoding layer that functions to encode the local evolution descriptor into a meta-action based video level representation. And thirdly, a deep supervision action recognition method, which is used for recognizing the action of the person in the video by using the obtained multi-layer video-level representation and integrating the multi-layer classification results to obtain a final prediction result. The overall logical structure of the present invention is shown in fig. 1.

Fig. 3 is a flowchart of a deep supervised convolutional neural network behavior recognition method based on trainable feature fusion according to the present invention.

The following describes in more detail a specific embodiment of a deep supervised convolutional neural network behavior recognition method based on trainable feature fusion according to the present invention.

According to the flow chart of the model training phase shown in (b) of fig. 3, the specific implementation method of the model training phase is as follows:

step 1: the video in the training video dataset is pre-processed, all video frames are extracted and cropped to a size of 224px by 224 px.

Step 2: for each video in the training videos, the time interval is

Uniformly acquiring 10 RGB frames [ I ]₁,I₂,...,I₁₀]，T_zFor a certain total duration of video, I_tIt represents that the t-th collected video frame of a certain video, and for convenience, the t-th frame of a certain training video corresponds to the t-th moment thereof.

And step 3: and (3) reversing the video frames acquired by each video in the data set to form new videos so as to expand the training data set, wherein the number of the videos in the video data set is 2 times that of the videos in the previous training data set.

And 4, step 4: extracting the multilayer convolution characteristics of the training video frame, the invention selects 3 convolution layers in the pre-trained CNN architecture: the Mixed5_ a layer, Mixed5_ b layer, and Mixed5_ c layer are used to generate a feature map for a video frame. 10 RGB frames [ I ] of the captured video V₁,I₂,...,I₁₀]The input to the convolutional network, for each RGB frame, a signature with a spatial size of 64 × 64, containing 3 channels, is obtained at each selected convolutional layer. For the entire video V, 3 × 10 feature maps with a spatial size of 64 × 64, containing 3 channels, are obtained.

And 5: performing feature aggregation on the multilayer feature map of the video frame to obtain a video-level representation, wherein the specific method comprises the following steps:

and 5.1, inputting the RGB frame acquired by each training video into the local evolution sequencing pooling layer to obtain the local evolution descriptor of each training video.

Step 5.1.1, through step 4, train 10 frames [ I ] of video V₁,I₂,...,I₁₀]Each frame in (2) acquires a feature map of spatial size 64 x 64 and containing 3 channels at Mixed5_ a level, which can be represented as [ fm [ ]₁,fm₂,...,fm₁₀]. Connection m_tAll at each spatial positionThe value of the channel, t ∈ {1,...,10}, such that fm will be_tThe feature map is decomposed into 64 × 64 3-dimensional local spatial features.

Step 5.1.2, for T frame [ I ]₁,I₂,...,I₁₀]The evolution information of each space position is modeled to generate a video V local evolution descriptor, and the specific method comprises the following steps:

step 5.1.2.1, the local spatial features of a particular spatial location i are sorted in time order to obtain the representation r_i1,r_i2,…r_it,…,r_i10]Where i ═ 1, ·,64},

is the local spatial feature of the ith spatial position at the t-th time,

real vector space in 3 dimensions, i.e. r_itIs a vector in the 3-dimensional real vector space.

Step 5.1.2.2, using sorting function S (t, i |) e¹⁰d_itCalculating a score value for each time t, wherein

Is the average local spatial feature of the ith spatial position at time t,

1-10 is corresponding to the time, and if q is equal to { 1., 10} is equal to the time after t is equal to { 1., 10}, then S (q, i |) is present>S (t, i |. e). Find all q satisfying the condition>t, calculation of E (e):

step 5.1.2.3, optimize E (e), map a series of local spatial features to a vector e^★。e^★Namely, the local evolution descriptor of the training video:

e^＊＝argmin_eE(e)

using an approximation technique to simplify the solution of e (e) as:

wherein alpha is_t＝2(10-t+1)-(10+1)(H₁₀-H_t-1)，

The weights are obtained by rank pooling (RankPooling). The above solution can be seen as a weighted addition of the local spatial features of the ith spatial position at all acquired 10 time instants.

Step 5.1.2.4, the learned e-vector is the local evolution descriptor of the ith spatial position of the training video, the whole training video is input, and 64 × 64 3-dimensional local evolution descriptor vectors [ e ] are obtained at Mixed5_ a layer₁,e₂,...,e_64×64]。

Step 5.2: and inputting the local evolution descriptor vector of each training video into a VLAD coding layer based on the local evolution descriptor to obtain the video level representation of each training video.

Step 5.2.1, use 32 meta-verb words to transform feature space

Divided into 32 cells and then the local evolution descriptor e₁,e₂,...,e_64×64To one of the 32 units. Recording local evolution descriptor e_iWith each meta-action anchor point a_kResidual vector (e) between_i-a_k)。

Step 5.2.2, summing these residual vectors to obtain the aggregation descriptor h in the kth unit_k。

Step 5.2.3, the training video may be denoted as v ═ h₁,h₂,...,h₃₂]，

v is a matrix of 3 × 32 size in real space.

Step 6: the operations in step 5 above are performed in parallel at Mixed5_ a layer Mixed5_ b layer and Mixed5_ c layer, resulting in a video level representation of each training video on these 3 convolutional layers.

And 7: and obtaining the classification result of the training video in a plurality of convolutional layers.

And inputting the video level representation of each layer obtained in the step 6 into a corresponding classifier to obtain a classification result of the convolutional layer. The specific method comprises the following steps:

step 7.1, define parameters, total number of convolution layers of the whole network is B, and the parameter of the B-th convolution layer is expressed as

The selected convolutional layers are Mixed5_ a Mixed5_ b layer and Mixed5_ c layer 3, and each selected convolutional layer is connected with a feature aggregation operation and a classifier because a classification result is obtained on each selected convolutional layer, so the number of the feature aggregation operations is 3, and the number of the classifiers is also 3. Then the weight of the feature aggregation operation on the mth selected convolutional layer is

The weight of the classifier connected to the mth selected convolutional layer is

Step 7.2, the loss function for merging all output layer classification errors is defined as:

where L represents the video-level cross entropy loss function of the action class.

Let A be { A ═ A₁,...,A₅₁Define all action classes in the training dataset, with 51 classes as the number of classes. The real label of the training video is g ∈ A, s^mRepresents the motion type of the mth convolutional layer prediction. The cross entropy loss function is then:

and 8: and integrating the classification results of multiple layers.

Step 8.1, the integrated prediction result is as follows:

wherein

Representing integration weights, wherein

Is a vector in Z dimension, and is obtained by assigning weights to attention. The loss function of the integration layer is defined as:

wherein,

represents the finally predicted operation type, and P (y is a)_i| V, W, wcm, wf is the final predicted action classOther than Ai.

Step 8.2, minimizing the objective function

Learning all parameters W, W_c，w_f。

And step 9: optimizing a loss function using a gradient descent algorithm

And adjusting model parameters through back propagation until the loss function is converged, and training the deep convolutional neural network behavior recognition model based on trainable feature fusion.

Step 10: and (3) identifying the character behaviors in the unknown video V' by using the model trained in the step 9, wherein the specific steps are as follows:

and step 10.1, preprocessing and frame sampling are carried out on the input unknown video according to the step 1 and the step 2, all video frames of the unknown video are extracted and cut into the size of 224px multiplied by 224 px. At time intervals of

Uniformly acquiring 10 RGB frames [ I'₁,I′₁,...,I′₁₀]0.4s is the total duration of unknown video, I'_tIndicating that the t-th acquired video frame of a certain video.

And step 10.2, extracting multilayer convolution characteristics of the unknown video according to the method in the step 4, and obtaining a characteristic diagram which has a space size of 64 multiplied by 64 and comprises 3 channels on each selected convolution layer for each RGB frame of V'. For the entire unknown video V', 3 × 10 feature maps with a spatial size of 64 × 64, including 3 channels, are obtained.

Step 10.3: according to the method in the step 5 and the step 6, the video level feature representation of V' on each layer of the 3 selected convolutional layers is obtained, and the specific steps are as follows:

v' is first obtained in each selected volume using a local evolution sorted pooling method as in step 5.164 x 64 3-dimensional local evolution descriptor vectors [ e 'on a layer stack'₁,e′₁,...,e′_64×64]，

Then [ e 'is coded using local evolution descriptor based VLAD coding as per step 5.2'₁,e′₁,...,e′_64×64]Video level representation v ' ═ h ' encoded based on meta motion '₁,h′₂,...,h′₃₂]，

v' is a matrix of 3 × 32 size in real space.

Finally, following the method in step 6, the above operations are performed in parallel on the 3 selected convolutional layers Mixed5_ a, Mixed5_ b, and Mixed5_ c, resulting in a video level representation of V' on each layer.

Step 10.4, obtain the classification result of V' on 3 selected convolutional layers, s, according to the method in step 7^′mRepresents the motion type result predicted by the unknown video V' on the mth convolutional layer. According to the method in the step 8, integrating the multi-layer classification results by using a classification integration method to obtain the final classification result of the unknown video:

wherein

Representing the integration weight.

After the above process is completed, the result of predicting the behavior of the human beings in the unknown video is 'running'.

The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The deep supervision convolutional neural network behavior recognition method based on training feature fusion is characterized by comprising the following steps of:

step 1: collecting video data for training to form a training data set;

step 2: performing uniform frame sampling on each video in the training data set;

and step 3: expanding the training data set, wherein all the video frames acquired from each video are subjected to inversion operation to form a new video, so that the training data set is expanded, and the number of the videos in the video data set is 2 times that of the videos in the previous video data set;

and 4, step 4: extracting multilayer convolution characteristics of the training video frame;

firstly, selecting M convolutional layers from a standard convolutional neural network architecture for extracting multilayer convolutional characteristics of a video frame;

then, the T RGB frames [ I ] of the collected video V₁,I₂,...,I_T]Inputting the RGB frame into the convolution network, and extracting a feature map generated in the M convolution layers by each RGB frame; for each RGB frame, obtaining a feature map with a space size of N multiplied by N and containing C channels on each selected convolution layer; for the whole video V, M × T feature maps with the space size of N × N and containing C channels are obtained;

and 5: performing feature aggregation on the multilayer feature map of the video frame to obtain a video level representation, wherein the specific method comprises the following steps:

step 5.1: extracting a local evolution descriptor of the video V by using a local evolution sequencing pooling method:

firstly, taking T feature maps obtained by a plurality of frames of a video V under the same convolutional layer as input, decomposing the feature maps of each frame into a group of local spatial features, and finally modeling evolution information of the local spatial features of each spatial position to generate a local evolution descriptor;

step 5.2: encoding a local evolution descriptor of a video into a meta-action based representation of the video level using a local aggregation vector encoding method based on the local evolution descriptor;

step 6: for the selected M convolutional layers, performing the operations of the step 5 and the step 6 on each layer in parallel to obtain the video level characteristic representation of the video on each selected convolutional layer;

and 7: inputting the video level representation of each layer obtained in the step 6 into a corresponding classifier to obtain the classification result of the video V on the M selected convolutional layers;

and 8: integrating the classification results of the M selected convolutional layers, wherein the specific method comprises the following steps:

step 8.1: let the fused prediction result F be represented as:

wherein,

w_fthe weight of the integration is represented by,

is a vector of Z dimension, obtained by assigning weights by means of attention, s^mRepresents the motion type of the mth convolutional layer prediction;

the loss function of the integration layer is defined as:

wherein y denotes the action type finally predicted,

predict for the end that the action class is A_iThe probability of (d);

And step 9: optimizing the loss function by using a gradient descent algorithm, and adjusting model parameters through back propagation until the loss function is converged;

step 10.1: preprocessing the unknown video V 'according to the method in the step 1 and the step 2 and sampling the frames to obtain T RGB frames [ I'₁，I′₂，...，I′_T]；

Step 10.2: extracting multilayer convolution characteristics of the unknown video according to the method in the step 4; for each RGB frame of V', obtaining a feature map with a spatial size of N × N and containing C channels at each selected convolution layer; for the whole unknown video V', obtaining M × T feature maps with the space size of N × N and containing C channels;

step 10.3, according to the method of the step 5 and the step 6, obtaining a video level feature representation of V' on each layer of the M selected convolutional layers; the method comprises the following specific steps:

Finally, according to the method of step 6, performing the above operations in parallel on the M selected convolutional layers to obtain a video level representation of V' on each layer;

step 10.4: obtaining the classification result, s ', of V ' on the M selected convolutional layers according to the method in the step 7 '^mRepresents the action type result predicted by V' on the mth convolution layer; according to the method in the step 8, integrating the multi-layer classification results by using a classification integration method to obtain the final classification result of the unknown video; f' represents the prediction after fusion:

wherein,

is a vector, s 'of dimension Z'^mRepresents the motion type of the mth convolutional layer prediction.

2. The deep supervised convolutional neural network behavior recognition method based on training feature fusion as claimed in claim 1, wherein the method for performing uniform frame sampling in step 2 is as follows:

over the entire video span, to

3. The deep supervised convolutional neural network behavior recognition method based on training feature fusion as claimed in claim 1, wherein the specific implementation method of step 5.1 is as follows:

step 5.1.1: warp beamStep 4, T frame [ I ] of video V₁,I₂,...,I_T]Each frame in (a) acquires a feature map of spatial size N × N and containing C channels at a selected convolutional layer, the feature map being represented as [ fm₁,fm₂,...,fm_T]；

Respectively connecting the values of all channels at each spatial position on each feature map, and enabling the T to be in the range of { 1.,. T }, so as to decompose each feature map into a plurality of local spatial features;

for each frame, N × N C-dimensional local spatial features are obtained;

step 5.1.2: for T frame [ I₁,I₂,...,I_T]The evolution information of each spatial position is modeled to generate a video V local evolution descriptor.

4. The deep supervised convolutional neural network behavior recognition method based on training feature fusion as claimed in claim 3, wherein the specific implementation method of step 5.1.2 is as follows:

is the local spatial feature of the ith spatial position at the t-th time,

real vector space in C dimension, i.e. r_itIs a vector on a C-dimensional real number vector space;

step 5.1.2.2: modeling evolution information of the ith spatial position; defining a ranking function, calculating a score value for each time instant:

S(t,i∣e)＝e^Td_it (5)

wherein,

is the average local spatial feature of the ith spatial position at time t,

setting a constraint relationship: the score value corresponding to the later moment being greater than the score value corresponding to the earlier moment, i.e.

The parameter e reflects the temporal order of these local spatial features; learning the parameter e is considered a convex optimization problem:

the first term of the objective function E (e) is a general quadratic regularization term, and the second term is a soft count loss function change-loss;

step 5.1.2.3: optimizing an objective function E (e), and mapping a series of local spatial features to a vector e^★The above step (1); e.g. of the type^★The local evolution descriptor comprises sequencing information of the local spatial features, namely the local evolution descriptor; the solution of the above objective function is simplified as:

wherein alpha is_t＝2(T-t+1)-(T+1)(H_T-H_t-1)，

The weights are obtained by sorting the pools as parameters, the solution being regarded as a local part of the ith spatial position at the T acquired momentsWeighted addition of spatial features;

step 5.1.2.4: designing a local evolution sequencing pooling layer based on the approximate solution of the sequencing function; the layer inputs convolution characteristic diagram with size of N multiplied by C of T frame and outputs N multiplied by N local evolution descriptor vectors [ e ] with dimension of C₁，e₂，...,e_N×N]。

5. The deep supervised convolutional neural network behavior recognition method based on training feature fusion as claimed in claim 1, wherein the specific implementation method of step 5.2 is as follows:

step 5.2.1: using K meta-verb words, the feature space is transformed

Is divided into K units, and the anchor point of each unit is set as a_k；

Step 5.2.2: a series of local evolution descriptors [ e ] of the video V obtained in step 5.1₁,e₂,...,e_N×N]Is assigned to one of the K units divided at step 5.2.1, and records a local evolution descriptor e_iAnd anchor point a_kThe residual vector between;

step 5.2.3: summing the residual vectors;

in the formula (8), the reaction mixture is,

representation descriptor e_iSoft allocation of (2), anchor point a_kIn this formula is a hyper-parameter adjusted by training; e.g. of the type_i-a_kRepresenting a residual between the local evolution descriptor and the kth anchor point; h obtained by the formula_kRepresents an aggregation descriptor in the kth cell;

step 5.2.4: obtain the videoIs calculated from the local evolution descriptor and the residual between each anchor point, video V is represented as

C is the dimension of real number space, and K is the number of element action units; v is a matrix of C K size over the real space.

6. The deep supervised convolutional neural network behavior recognition method based on training feature fusion as claimed in claim 1, wherein the specific implementation method of step 7 is as follows:

step 7.1: defining:

wherein B represents the total number of convolutional layers; let B be { 1., B },

parameters representing the b-th convolutional layer; m represents the number of the selected convolution layers; let M be { 1.. multidot.m }, so

representing the weight of the classifier connected to the mth selected convolutional layer;

wherein g is a real label of the video V, g belongs to A, and A is { A }₁,...,A_zDefine all action categories with the number of categories Z, A_iRepresents the ith action category, s, in the action set A^mRepresents the motion type predicted by the mth convolutional layer.