CN111079594B

CN111079594B - Video action classification and identification method based on double-flow cooperative network

Info

Publication number: CN111079594B
Application number: CN201911228675.0A
Authority: CN
Inventors: 徐行; 张静然; 沈复民; 贾可; 申恒涛
Original assignee: Chengdu Koala Youran Technology Co ltd
Current assignee: Chengdu Koala Youran Technology Co ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2023-06-06
Anticipated expiration: 2039-12-04
Also published as: CN111079594A

Abstract

The invention relates to a video category identification method based on a double-flow cooperative network, which comprises the steps of firstly enabling heterogeneous airspace characteristics and time domain characteristics to perform information interaction; the information interaction is to fuse heterogeneous time domain features and space domain features, extract time domain and space domain complementary parts from the fused time-space domain features, fuse the complementary parts into original time domain feature post-space domain features, and respectively form post-space domain sequence features of space domain sequence features by all time domain feature post-space domain features after the fusion of the complementary parts; then, sequence feature aggregation is carried out on the space domain sequence features and the time domain sequence features to obtain aggregated space domain features and aggregated time domain features; and finally, pre-training a classifier model for testing and classifying the video to be identified. The invention can realize flow complementation of different inflow mode information, thereby achieving more accurate action recognition effect.

Description

Video action classification and identification method based on double-flow cooperative network

Technical Field

The invention belongs to the technical field of video action classification and identification, and particularly relates to a video action classification and identification method based on a double-flow cooperative network.

Background

Due to the popularity of smart phones, public monitoring, portable cameras, etc., short video data is rapidly growing due to its easy availability. The motion recognition based on the short video has important academic value and can provide assistance for business applications such as intelligent security, user recommendation and the like. The double-flow network is always the most widely adopted and best-effective framework in the field of motion recognition, but most of motion recognition solutions based on the double-flow network focus on how to design structures to integrate different flow characteristics, and the different flow networks are trained in a separate mode, so that end-to-end reasoning cannot be achieved.

The goal of video action category identification is to identify the category of actions occurring in video, and the existing dual-stream-based action category identification method mainly comprises the following steps:

(1) And (3) airspace feature extraction flow: extracting airspace characteristics from an input RGB video frame by using a convolution network, wherein the convolution networks of 2D and 3D are used in the existing method, and the branch aims at extracting morphological type information in the video to provide a basis for later fusion;

(2) Time domain feature extraction flow: the spatial domain features are extracted from the input pre-extracted optical flow field by a convolution network, and a 2D and 3D convolution network can be used as an infrastructure network, and the branch aims at providing a basis for extracting motion type information in the video and then fusing.

The existing category identification method based on double-flow video actions is mostly based on fusion of the rear-end trial characteristics of the structure, the characteristics of two sub-streams are required to be extracted separately, and then the fusion mode is improved, so that the following defects exist:

(1) Information representing the same pattern in two heterogeneous input streams is processed separately, with complementary information in between, in fact, not co-processed at the front end of the network, resulting in the possible loss of some key features that contribute to motion recognition;

(2) The end-to-end reasoning learning cannot be performed, two branches must be processed separately, and the mutual flow of information in the heterogeneous feature extraction flow cannot be ensured to maintain the distinguishing property of the features.

Disclosure of Invention

The invention provides a video action classification and identification method based on a double-flow cooperative network, which is based on the problems that some key technologies may be lost, video frames and optical flow fields are processed separately, information is not flowing and end-to-end processing cannot be performed in the prior art, and realizes double-flow information complementation and information mutual flow by constructing a connecting unit to interact heterogeneous spatial domain characteristics and time domain characteristics, and meanwhile, can realize end-to-end reasoning and learning.

The invention comprises the following specific contents:

a video action classification recognition method based on double-flow cooperative network comprises the steps of simultaneously extracting time domain sequence features X from a video optical flow field through a convolution network ^o Extracting airspace sequence characteristic X from video frame ^f The method comprises the steps of carrying out a first treatment on the surface of the Reconstructing a connection unit to let the heterogeneous time domain sequence feature X ^o Sum spatial domain sequence feature X ^f Performing information interaction; then constructing a time domain sequence characteristic x 'after interaction of a sharing unit pair' _j ^o And the airspace sequence characteristic x 'after interaction' _i ^f Respectively performing sequence feature aggregation to obtain aggregated time domain features Z ^o And aggregated airspace feature Z ^f ；

The information interaction specifically comprises the following steps:

step one: time domain sequence feature X extracted from video optical flow field ^o And spatial sequence feature X extracted from video frames ^f Fusing to obtain a heterogeneous correlation matrix Y of the time-space domain sequence characteristics;

step two: extracting complementary time domain sequence features according to the heterogeneous correlation matrix Y obtained in the step one

And complementary spatial sequence features->

And complementary time domain sequence features->

Fusion back to time domain sequence feature X ^o Generating a fused time domainSequence feature x' _j ^o Complementary spatial sequence characteristics->

Fused return space domain sequence feature X ^f Generating the fused airspace sequence characteristic x' _i ^f 。

To better implement the invention, further, the time domain feature Z ^o Sum spatial domain feature Z ^f Regularization is carried out at the same time, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification score and the airspace feature classification score into a prediction space-time feature classification score vector for actual video motion recognition; the predicted spatiotemporal feature classification score vector is divided into a correct spatiotemporal feature classification score vector and an incorrect spatiotemporal feature classification score vector; the correct time-space characteristic classification score vector is a classification score vector of the category with real time-domain characteristics; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the actual identification of the video.

In order to better realize the invention, further, a pre-training sample set is selected for training, and a classifier model containing the correct space-time characteristic classification scores of each action class classification is generated; obtaining cross entropy loss function L according to correct space-time characteristic classification score ₁ 。

In order to better realize the invention, the shared weight layer further uses the input time domain feature Z ^o Sum spatial domain feature Z ^f Respectively constructing time domain features Z ^o Heterogeneous triplet pairs and spatial signature Z ^f Is simultaneously according to time domain feature Z ^o Heterogeneous triplet pairs and spatial signature Z ^f Determination of heterogeneous triplet pair loss function L ₂ 。

In order to better implement the invention, the sharing weight layer further calculates a time domain feature Z ^o Class center of (2)

Sum spatial domain feature Z ^f Class center of->

From the time-domain features Z obtained ^o Class center of->

Sum spatial domain feature Z ^f Class center of->

Calculating and distinguishing an embedding limit loss function L ₃ 。

To better implement the invention, further, a cross entropy loss function L is employed ₁ Heterogeneous triplet vs. loss function L ₂ Discriminating the embedding limit loss function L ₃ As a trained loss function L.

In order to better realize the invention, further, in the actual video identification, the generated prediction space-time feature classification score vectors are ordered in the order from big to small, the prediction space-time feature classification score vector with the largest value is selected, the prediction space-time feature classification score vector with the largest value is the correct space-time feature classification score vector in the identified video, and the category index corresponding to the correct space-time feature classification score vector in the identified video is the category of the behavior.

In order to better implement the present invention, further,

the airspace characteristic X ^f The expression is

The time domain feature X ^o The expression is

Wherein the method comprises the steps of

d is the dimension of the feature; />

The expression of the heterogeneous correlation matrix Y obtained in the first step is as follows:

in the formula g _θ () For measuring the similarity of variables, the function is expressed as

Wherein->

But->

W _K Is a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size of the heterogeneous correlation matrix is a matrix with the number of rows and columns equal to the number of video samples;

the fused time domain sequence characteristic x 'obtained in the step two' _j ^o And the fused airspace sequence characteristic x' _i ^f The specific expression of (2) is:

in the method, in the process of the invention,

and->

The interactive function for separating complementary features of the space domain and the time domain is expressed as follows

And->

w ^f ,w ^o To learn parameters, fused time domain sequence features x' _i ^f The method comprises the steps of carrying out a first treatment on the surface of the Fused airspace sequence characteristic x' _j ^o The expressions are X 'respectively' ^f ＝{x′ ₁ ^f ,x′ ₂ f,…,x′ _T ^f }、X′ ^o ＝{x′ ₁ ^o ,x′ ₂ ^o ,…,x′ _T ^o }；

The cross entropy loss function L1 is expressed as follows:

wherein L is ₁ Represents the cross-entropy loss value,

a correct spatiotemporal feature class score representing class after the i-th sample output,/th sample output>

Representing the correct space-time feature classification score of the ith sample when the ith sample is output to class j;

the expression of the heterogeneous triplet pair of the airspace characteristics is

The expression of the heterogeneous triplet pair of the time domain characteristics is +.>

Wherein subscripts a, p, n respectively represent anchor points, positive example points, and negative example points, and i and j represent sample pair action category indexes;

the heterogeneous triplet pair loss function is specifically:

wherein L is ₂ Representing tripletLoss value of the pair;

representing a 2-norm distance measure; if x is greater than 0, [ x ]] ₊ X, if x is 0 or less, [ x ]] ₊ ＝0；α ₁ Is a threshold value;

class center of the airspace feature

The expression is->

Class center of time domain feature->

The expression is->

Wherein c= { C ₁ ,c ₂ ,…,c _s "is a table label, l _i For the label of the i-th sample, 1 () is an indication function;

the distinguishing embedding limiting loss function specifically comprises the following steps:

wherein L is ₃ Representing the loss value of discriminating embedding, alpha ₂ ,α ₃ Is a threshold value;

the expression of the loss function L is:

L＝λ ₁ L ₁ +λ ₂ L ₂ +λ ₃ L ₃ 。

compared with the prior art, the invention has the following advantages:

(1) Processing two heterogeneous input streams simultaneously, and processing complementary information cooperatively to avoid losing key features which are helpful for motion recognition

(2) Meanwhile, double-flow information is processed, so that end-to-end reasoning learning is realized, and the mutual flow of information of feature extraction flows in double-flow isomerism is ensured.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a schematic diagram of a dual stream feature process framework;

FIG. 3 is a schematic diagram of a connection network for different modality flow branches;

FIG. 4 is a schematic representation of the effect of the present invention in comparison with the prior art.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only some embodiments of the present invention, but not all embodiments, and therefore should not be considered as limiting the scope of protection. All other embodiments, which are obtained by a worker of ordinary skill in the art without creative efforts, are within the protection scope of the present invention based on the embodiments of the present invention.

Example 1:

a video action classification recognition method based on a double-flow cooperative network is disclosed, and is combined with the descriptions of fig. 1, 2 and 3, firstly, spatial sequence features are extracted from video frames and time domain sequence features are extracted from video optical flow fields simultaneously through a convolution network, and the spatial sequence feature expression is as follows

The characteristic expression of the time domain sequence is +.>

Wherein->

d is the dimension of the feature; constructing a connecting unit to enable the heterogeneous airspace sequence characteristics and the time domain sequence characteristics to carry out information interaction; then construct a shared unit pair fusionRespectively carrying out sequence feature aggregation on the post-airspace sequence features and the fused time domain sequence features to obtain aggregated airspace features and aggregated time domain features; the airspace characteristic expression after polymerization is Z ^f The time domain characteristic expression after aggregation is Z ^o ；

The information interaction specifically comprises the following steps:

step one: the method comprises the steps of fusing the airspace sequence features extracted from video frames with the time domain sequence features extracted from video optical flow fields, wherein the specific formula is as follows:

/>

Wherein->

But->

step two: according to the heterogeneous correlation matrix Y obtained in the first step, complementary time domain sequence features and complementary space domain sequence features are separated from the fused space domain sequence features and time domain sequence features, and the separated complementary time domain sequence features and the complementary space domain sequence features are respectively fused back to the space domain sequence features

And time domain sequence feature->

Obtaining the fused time domain sequenceThe specific formulas of the column characteristics and the fused airspace sequence characteristics are as follows:

in the method, in the process of the invention,

and->

Interaction functions separating complementary features for spatial and temporal domains, respectively +.>

And->

w ^f ,w ^o For the parameters to be learned' _i ^f The time domain sequence characteristics are fused; x's' _j ^o The spatial sequence characteristics after fusion; the expression of the spatial domain sequence characteristics after fusion is +.>

The fused time domain sequence features have the expression +.>

Working principle: in the prior art, the characteristics in the video frame and the optical flow field are respectively extracted, and then are fused, so that some complementary information is easily lost; the separated complementary features are added into the original extracted space domain sequence features and the original time domain sequence features, so that the original extracted time domain sequence features and the original extracted space domain sequence features contain complementary information, and then the time domain sequence features and the space domain sequence features with the complementary information are sent to further processing; in practice, since a video segment usually has a large number of frames, it takes a huge computational cost if it is used as an input for subsequent operations, and there is redundancy in that much information is similar, the video is sampled before extracting the features. The method comprises the steps of sampling video in a global sparse sampling mode, acquiring M frames of RGB frame images, wherein optical flow field images have x and y directions, and carrying out feature extraction on the sampled video frames by using a convolution network, wherein the total number of the images is 2M, and the sampled video frames and the optical flow field images are respectively subjected to feature extraction by using an acceptance and acceptance-v 3; the extracted features are then fed into a connection unit for processing.

Example 2:

on the basis of the above embodiment 1, in order to better implement the present invention, as shown in fig. 1 and 2, a sharing unit is further constructed to perform sequence feature aggregation on the fused spatial sequence feature and the fused time domain sequence feature respectively; the fused airspace sequence features are aggregated into airspace features Z ^f The fused time domain sequence features are aggregated into time domain features Z ^o The method comprises the steps of carrying out a first treatment on the surface of the To time domain feature Z ^o Sum spatial domain feature Z ^f Regularization is carried out at the same time, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification score and the airspace feature classification score into a prediction space-time feature classification score vector for actual video motion recognition; the predicted spatiotemporal feature classification score vector is divided into a correct spatiotemporal feature classification score vector and an incorrect spatiotemporal feature classification score vector; the correct time-space characteristic classification score vector is a classification score vector of the category with real time-domain characteristics; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the actual identification of the video.

Other portions of this embodiment are the same as those of embodiment 1, and thus will not be described in detail.

Example 3:

on the basis of any one of the above embodiments 1-2, in order to better implement the present invention, further, a sample set is selected for training, and a classifier model containing the correct space-time feature classification score of the action classification is generated; the cross entropy loss function, the heterogeneous triplet pair loss function and the combination function of the discrimination embedding limit loss function are adopted as the training loss function.

Working principle: the sample set is selected for pre-training, a classifier model is trained, and a cross entropy loss function, a heterogeneous triplet pair loss function and a combination function for distinguishing an embedding limit loss function are introduced as the trained loss function, so that the classifier model obtained through pre-training is more real and reliable, and classification is more aggregated.

Other portions of this embodiment are the same as any of embodiments 1-2 described above, and thus will not be described again.

Example 4:

on the basis of any one of the above embodiments 1 to 3, in order to better implement the present invention, further, the sharing weight layer further inputs a time domain feature Z ^o Sum spatial domain feature Z ^f Respectively constructing a heterogeneous triplet pair of the space domain characteristic and a heterogeneous triplet pair of the time domain characteristic; the expression of the heterogeneous triplet pair of the airspace characteristics is

Wherein subscripts a, p, n respectively represent anchor points, positive example points, and negative example points, and i and j represent sample pair action category indexes; the heterogeneous triplet pair loss function is specifically:

wherein L is ₂ Representing tripletLoss value of the pair;

meanwhile, the class center of the space domain feature and the class center of the time domain feature are also obtained; the expression of the class center of the airspace characteristics is that

The time domain feature is represented by the class center expression +.>

Wherein c= { C ₁ ,c ₂ ,…,c _s "is a table label, l _i For the label of the i-th sample, 1 () is an indication function; the distinguishing embedding limiting loss function specifically comprises the following steps:

the cross entropy loss function expression is:

wherein L is ₁ Represents the cross-entropy loss value,

correct spatiotemporal feature class score representing true class after the ith sample output, +.>

Representing the correct space-time feature classification score of the ith sample when the ith sample is output to class j; by the present loss functionThe characteristics of the real classification categories can be more prominently aggregated;

the loss function expression for training the whole network is as follows:

L＝λ ₁ L ₁ +λ ₂ L ₂ +λ ₃ L ₃ ；

here it is empirically available: l=l ₁ +0.5L ₂ +0.5L ₃ 。

Other portions of this embodiment are the same as any of embodiments 1 to 5 described above, and thus will not be described again.

Example 5:

in order to better implement the present invention on the basis of any one of the above embodiments 1 to 4, further, in the actual video motion recognition process, in the actual video recognition, the generated predicted spatiotemporal feature classification score vectors are ordered in order from large to small, the predicted spatiotemporal feature classification score vector with the largest value is selected, and the predicted spatiotemporal feature classification score vector with the largest value is the correct spatiotemporal feature classification score vector in the recognized video, and the class index corresponding to the correct spatiotemporal feature classification score vector in the recognized video is the class of the behavior. The present invention specifically uses top-k index to evaluate our model. Top-k refers to the proportion of video sequences with correct labels in the top k results in the classification feature scores returned by the model, and is the most commonly used classification evaluation method. In this example, k is set to 1. The present invention was tested on the large-scale video behavior classification dataset UCF-101 and HMDB-51 datasets. The UCF-101 data set comprises 101 action categories, 13,320 samples are taken, 70% of the samples are selected as training sets, and the rest are taken as verification sets; the HMDB-51 data set comprises 51 action categories, 6,849 samples in total, 70% of the samples are selected as a training set, and the rest are used as a verification set; as shown in FIG. 4, the fusion recognition performance of the invention after information interaction of all verification sets is superior to that of the prior method. On UCF-101 data set, the final recognition performance of the method is improved by 0.4% compared with the prior optimal method, and the final recognition performance of the method is improved by 3.2% compared with the prior optimal method on HMDB-51. The method is superior to the existing method in all measurement modes, and the identification accuracy of video behavior classification is improved.

Other portions of this embodiment are the same as any of embodiments 1 to 4 described above, and thus will not be described again.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent variation, etc. of the above embodiment according to the technical matter of the present invention fall within the scope of the present invention.

Claims

1. A video action classification and identification method based on a double-flow cooperative network is characterized in that firstly, a time domain sequence feature X is simultaneously extracted from a video optical flow field through a convolution network ^o Extracting airspace sequence characteristic X from video frame ^f The method comprises the steps of carrying out a first treatment on the surface of the Reconstructing a connection unit to let the heterogeneous time domain sequence feature X ^o Sum spatial domain sequence feature X ^f Performing information interaction; then constructing a time domain sequence characteristic after interaction of a sharing unit pair

And post-interaction airspace sequence features->

Respectively performing sequence feature aggregation to obtain aggregated time domain features Z ^o And aggregated airspace feature Z ^f ；

The information interaction specifically comprises the following steps:

And complementary spatial sequence features->

And complementary time domain sequence features->

Fusion back to time domain sequence feature X ^o Generating the fused time domain sequence feature +.>

Complementary spatial sequence characteristics->

Fused return space domain sequence feature X ^f Generating the fused airspace sequence characteristics ∈ ->

To time domain feature Z ^o Sum spatial domain feature Z ^f Regularization is carried out at the same time, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification score and the airspace feature classification score into a prediction space-time feature classification score vector for actual video motion recognition; the predicted spatiotemporal feature classification score vector is divided into a correct spatiotemporal feature classification score vector and an incorrect spatiotemporal feature classification score vector; the correct time-space characteristic classification score vector is a classification score vector of the category with real time-domain characteristics; the error space-time feature classification score vector is a space-time feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video;

selecting a pre-training sample set for training to generate a classifier model containing the correct space-time characteristic classification scores of each action class classification; obtaining cross entropy loss function L according to correct space-time characteristic classification score ₁ ；

The shared weight layer also uses the input time domain feature Z ^o Sum spatial domain feature Z ^f Respectively constructed when constructedDomain features Z ^o Heterogeneous triplet pairs and spatial signature Z ^f Is simultaneously according to time domain feature Z ^o Heterogeneous triplet pairs and spatial signature Z ^f Determination of heterogeneous triplet pair loss function L ₂ 。

2. The method for classifying and identifying video actions based on double-flow cooperative network as claimed in claim 1, wherein said shared weight layer further obtains time domain feature Z ^o Class center of (2)

Sum spatial domain feature Z ^f Class center of->

From the time-domain features Z obtained ^o Class center of->

Sum spatial domain feature Z ^f Class center of->

Calculating and distinguishing an embedding limit loss function L ₃ 。

3. The video motion classification and identification method based on double-flow cooperative network as claimed in claim 2, wherein a cross entropy loss function L is adopted ₁ Heterogeneous triplet vs. loss function L ₂ Discriminating the embedding limit loss function L ₃ As a trained loss function L.

4. The video motion classification recognition method based on the dual-stream collaborative network according to claim 3, wherein in actual video recognition, the generated predicted spatiotemporal feature classification score vectors are ordered in order from big to small, the predicted spatiotemporal feature classification score vector with the largest value is selected, the predicted spatiotemporal feature classification score vector with the largest value is the correct spatiotemporal feature classification score vector in the recognized video, and the class index corresponding to the correct spatiotemporal feature classification score vector in the recognized video is the class of the behavior.

5. The video motion classification and identification method based on double-flow cooperative network of claim 4, wherein,

the airspace characteristic X ^f The expression is