CN111079594A

CN111079594A - Video action classification and identification method based on double-current cooperative network

Info

Publication number: CN111079594A
Application number: CN201911228675.0A
Authority: CN
Inventors: 徐行; 张静然; 沈复民; 贾可; 申恒涛
Original assignee: Chengdu Koala Youran Technology Co ltd
Current assignee: Chengdu Koala Youran Technology Co ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-04-28
Anticipated expiration: 2039-12-04
Also published as: CN111079594B

Abstract

The invention relates to a video category identification method based on a double-current cooperative network, which comprises the steps of firstly, carrying out information interaction on heterogeneous spatial domain characteristics and time domain characteristics; the information interaction fuses heterogeneous time domain features and space domain features, complementary parts of a time domain and a space domain are extracted from the fused time-space domain features, the complementary parts are fused into originally extracted time domain feature post-space domain features, and all time domain feature post-space domain features after the complementary parts are fused respectively form time domain sequence features after space domain sequence features; then, carrying out sequence feature aggregation on the space domain sequence features and the time domain sequence features to obtain aggregated space domain features and aggregated time domain features; and finally, pre-training a classifier model for testing and classifying the video to be recognized. The invention can realize the flow complementation of different inflow mode information, thereby achieving more accurate action recognition effect.

Description

Video action classification and identification method based on double-current cooperative network

Technical Field

The invention belongs to the technical field of video motion classification and identification, and particularly relates to a video motion classification and identification method based on a double-current cooperative network.

Background

Short video data is growing rapidly due to its ready availability due to the popularity of smart phones, public surveillance, portable cameras, and the like. The action recognition based on the short video has important academic value and can provide help for business applications such as intelligent security and user recommendation. The double-flow network is always the most extensive and best framework adopted in the field of action recognition, but most of the existing action recognition solutions based on the double-flow network focus on how to design a structure to fuse different flow characteristics, and different flow networks are trained in an independent mode, so that end-to-end reasoning cannot be realized.

The video motion category identification aims at identifying the category of motion occurring in a video, and the existing motion category identification method based on double flow mainly comprises the following steps:

(1) and airspace characteristic extraction flow: the method comprises the steps that spatial domain features are extracted from input RGB video frames through a convolution network, 2D and 3D convolution networks are used in the existing method, and the branch aims to extract morphological information in the video and provide a basis for later fusion;

(2) time domain feature extraction stream: spatial domain features are extracted from an input pre-extracted optical flow field by a convolution network, 2D and 3D convolution networks can also be used as an infrastructure network, and the branch aims to provide a basis for extracting motion information in a video and then fusing.

Most of the existing category identification methods based on the double-stream video motion are based on the fusion of the characteristics tried at the rear end of the structure, and the characteristics of two branch streams must be extracted separately first, and then the fusion mode is improved, so that the following defects exist:

(1) information which represents the same mode in two heterogeneous input streams is processed separately, and actually, complementary information between the two input streams is not processed cooperatively at the front end of the network, so that some key features which are beneficial to action recognition can be lost;

(2) inference learning cannot be performed end to end, two branches must be processed separately, and mutual flow of information in heterogeneous feature extraction streams cannot be guaranteed to maintain feature discriminativity.

Disclosure of Invention

The invention provides a video action classification and identification method based on a double-flow cooperative network, which is used for solving the problems that some key technologies are possibly lost, a video frame and an optical flow field are processed separately, information does not flow and end-to-end processing cannot be carried out in the prior art.

The invention specifically comprises the following contents:

a video motion classification and identification method based on a double-current cooperative network comprises the steps of firstly extracting time domain sequence characteristics X from a video optical flow field through a convolutional network simultaneously^oExtracting space domain sequence characteristic X from video frame^f(ii) a Then a connection unit is constructed to lead the heterogeneous time domain sequence characteristics X^oSum space domain sequence feature X^fCarrying out information interaction; then constructing a time domain sequence characteristic of a sharing unit pair after interaction

And the space domain sequence characteristics after interaction

Respectively carrying out sequence characteristic polymerization to obtain polymerized time domain characteristics Z^oAnd the spatial domain feature Z after the aggregation^f；

The information interaction specifically comprises the following steps:

the method comprises the following steps: extracting time domain sequence characteristics X from a video optical flow field^oAnd spatial domain sequence characteristics X extracted from the video frame^fFusing to obtain a heterogeneous correlation matrix Y of the time-space domain sequence characteristics;

step two: the isomeric correlation moment obtained according to the step oneArray Y, extracting complementary time domain sequence characteristics

And complementary spatial sequence features

And will complement the time domain sequence characteristics

Fuse back to time domain sequence feature X^oGenerating fused time-domain sequence features

Characterizing the complementary spatial sequences

Fused-back null-field sequence feature X^fGenerating fused spatial domain sequence features

To better implement the invention, further, the time domain feature Z is used^oSum spatial domain feature Z^fMeanwhile, regularization is carried out, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification fraction and the space domain feature classification fraction into a prediction space-time feature classification fraction vector for actual video motion recognition; the prediction space-time feature classification score vector is divided into a correct space-time feature classification score group and an error space-time feature classification score vector; the correct time-space feature classification score vector is a classification score vector of a real category of the time-domain feature; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video.

In order to better realize the invention, a pre-training sample set is further selected for training to generate correct space-time characteristic classification scores containing all action class classificationsA classifier model; solving a cross entropy loss function L according to correct space-time feature classification scores₁。

In order to better implement the present invention, further, the sharing weight layer further passes through the input time domain feature Z^oSum spatial domain feature Z^fRespectively constructing time domain characteristics Z^oPairs of isomeric triplets and spatial domain feature Z^fOf the same pair of isomeric triplets, simultaneously according to the time domain characteristic Z^oPairs of isomeric triplets and spatial domain feature Z^fTo find the isomeric triplet pair loss function L₂。

In order to better implement the present invention, further, the shared weight layer also finds the time domain feature Z^oClass center of

Sum spatial domain feature Z^fClass center of

According to the obtained time domain characteristic Z^oClass center of

Sum spatial domain feature Z^fClass center of

Calculating a discriminant embedding limiting loss function L₃。

To better implement the present invention, further, a cross-entropy loss function L is employed₁Isomeric triad pair loss function L₂Determining the embedding limit loss function L₃As a loss function L of the training.

In order to better realize the invention, further, in the actual video identification, the generated prediction space-time feature classification score vectors are sequenced from large to small, the prediction space-time feature classification score vector with the largest value is selected, the selected prediction space-time feature classification score vector with the largest value is the correct space-time feature classification score vector in the identified video, and the category index corresponding to the correct space-time feature classification score vector in the identified video is the category of the behavior.

In order to better implement the invention, it is further advantageous,

the spatial domain feature X^fIs expressed as

The time domain feature X^oIs expressed as

Wherein

d is the dimension of the feature;

the expression of the heterogeneous correlation matrix Y obtained in the first step is as follows:

in the formula, g_θ() To measure the similarity of variables, the function is embodied as

Wherein

While

W_KIs a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size is a matrix with the row and column number equal to the number of video samples;

the fused time domain sequence characteristics obtained in the second step

And fused spatial domain sequence features

The specific expression of (A) is as follows:

in the formula (I), the compound is shown in the specification,

and

the expression is respectively an interactive function of the complementary characteristics of space domain and time domain separation

And

w^f,w^othe time domain sequence features after fusion are used as parameters needing learning

Fused spatial domain sequence features

Are respectively expressed as

The expression of the cross entropy loss function L1 is:

in the formula, L₁Which represents the cross-entropy loss value of the entropy,

a correct spatiotemporal feature classification score representing the classification category after the ith sample output,

representing the correct spatiotemporal feature classification score when the ith sample is output to the j class;

the heterogeneous triplet pair expression of the spatial domain feature is

The isomeric triplet pair expression of the time domain feature is

Wherein subscripts a, p, n denote anchor, positive, and negative case points, respectively, and i and j denote sample versus action class indices;

the isomeric triplet pair loss function is specifically:

wherein L is₂Represents the loss value of the triplet;

represents a 2-norm distance metric; if x is greater than 0, [ x ]]₊X, if x is less than or equal to 0, [ x [ ]]₊＝0；α₁Is a threshold value;

class center of the spatial domain feature

Is expressed as

Class center of time domain features

Is expressed as

Wherein C ═ { C ═ C₁,c₂,…,c_sIs a table-like label, l_iFor the label of the ith sample, 1() is the indicator function;

the distinguishing embedding limiting loss function is specifically as follows:

in the formula, L₃Loss value indicating discriminant embedding, α₂,α₃Is a threshold value;

the expression of the loss function L is:

L＝λ₁L₁+λ₂L₂+λ₃L₃。

compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the two heterogeneous input streams are processed simultaneously, and some complementary information is processed cooperatively, so that loss of some key features which are helpful for motion recognition is avoided

(2) And meanwhile, double-flow information is processed, end-to-end reasoning learning is realized, and mutual flow of information of feature extraction flows in double-flow isomerism is ensured.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a block diagram of a dual stream feature processing process;

FIG. 3 is a schematic diagram of a connection network for different modal flow branches;

fig. 4 is a schematic diagram comparing the effect of the present invention with the prior art.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and therefore should not be considered as a limitation to the scope of protection. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Example 1:

a video motion classification and identification method based on a double-current cooperative network is disclosed, which is combined with the graph shown in figure 1, figure 2 and figure 3, firstly, a convolution network is used for simultaneously extracting space domain sequence characteristics from a video frame and time domain sequence characteristics from a video optical flow field, and the expression of the space domain sequence characteristics is

The time domain sequence characteristic expression is

Wherein

d is the dimension of the feature; constructing a connecting unit to enable the heterogeneous space domain sequence characteristics and time domain sequence characteristics to carry out information interaction; then constructing a sharing unit to respectively carry out sequence feature aggregation on the fused space domain sequence features and the fused time domain sequence features to obtain aggregated space domain features and aggregated time domain features; the space domain characteristic expression after aggregation is Z^fThe time domain feature expression after aggregation is Z^o；

The information interaction specifically comprises the following steps:

the method comprises the following steps: the method comprises the following steps of fusing spatial domain sequence characteristics extracted from a video frame and time domain sequence characteristics extracted from a video optical flow field, wherein the specific formula is as follows:

Wherein

While

step two: according to the heterogeneous correlation matrix Y obtained in the step one, separating complementary time domain sequence characteristics and complementary space domain sequence characteristics from the fused space domain sequence characteristics and time domain sequence characteristics, and respectively fusing the separated complementary time domain sequence characteristics and complementary space domain sequence characteristics back to the space domain sequence characteristics

And time domain sequence features

Obtaining the fused time domain sequence characteristics and the fused space domain sequence characteristics, wherein the specific formula is as follows:

in the formula (I), the compound is shown in the specification,

and

the interaction functions of complementary features for spatial and temporal separation, respectively

And

w^f,w^oin order for the parameters to be learned,

the time domain sequence characteristics after fusion;

the fused space domain sequence characteristics are obtained; the fused airspace sequence characteristics have the expression of

The expression of the fused time domain sequence characteristic is

The working principle is as follows: in the processing of the prior art, the characteristics in a video frame and an optical flow field are respectively extracted and then fused, so that some complementary information is easily lost; adding the separated complementary features into the originally extracted space domain sequence features and time domain sequence features, so that the originally extracted time domain sequence features and space domain sequence features contain complementary information, and then sending the time domain sequence features and the space domain sequence features with the complementary information for further processing; in actual operation, because a video usually has a large number of frames, if all the frames are used as input to perform subsequent operations, huge calculation cost is needed, and many information in the frames are similar and have redundancy, the video needs to be sampled before features are extracted. Sampling a video in a global sparse sampling mode, wherein an RGB frame image acquires M frames, an optical flow field image has x and y directions and has 2M images in total, and then performing feature extraction on the sampled video frames by using a convolution network, wherein the feature extraction is performed on the sampled video frames and the optical flow field image respectively by using an Incep and an Incep-v 3; and then the extracted features are sent to a connecting unit for processing.

Example 2:

on the basis of the above embodiment 1, in order to better implement the present invention, as shown in fig. 1 and fig. 2, a shared unit pair fusion is further constructedPerforming sequence feature aggregation on the combined spatial domain sequence features and the fused time domain sequence features respectively; the fused airspace sequence features are aggregated into an airspace feature Z^fThe fused time domain sequence features are aggregated into a time domain feature Z^o(ii) a Time domain feature Z^oSum spatial domain feature Z^fMeanwhile, regularization is carried out, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification fraction and the space domain feature classification fraction into a prediction space-time feature classification fraction vector for actual video motion recognition; the prediction space-time feature classification score vector is divided into a correct space-time feature classification score group and an error space-time feature classification score vector; the correct time-space feature classification score vector is a classification score vector of a real category of the time-domain feature; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 3:

on the basis of any one of the embodiments 1-2, in order to better implement the present invention, a sample set is further selected for training, and a classifier model including correct spatio-temporal feature classification scores for motion classification is generated; and adopting a combined function of a cross entropy loss function, a heterogeneous triplet pair loss function and a discriminant embedding limit loss function as a training loss function.

The working principle is as follows: selecting a sample set for pre-training, training a classifier model, and introducing a combined function of a cross entropy loss function, a heterogeneous triplet pair loss function and a discriminant embedding limiting loss function as a trained loss function, so that the classifier model obtained by pre-training is more real and reliable, and the classification is more aggregated.

Other parts of this embodiment are the same as any of embodiments 1-2 described above, and thus are not described again.

Example 4:

radical according to any one of the preceding embodiments 1 to 3On the basis, in order to better implement the present invention, further, the sharing weight layer further inputs the time domain feature Z^oSum spatial domain feature Z^fRespectively constructing a heterogeneous triplet pair of the spatial domain characteristic and a heterogeneous triplet pair of the time domain characteristic; the heterogeneous triplet pair expression of the spatial domain feature is

The isomeric triplet pair expression of the time domain feature is

Wherein subscripts a, p, n denote anchor, positive, and negative case points, respectively, and i and j denote sample versus action class indices; the isomeric triplet pair loss function is specifically:

wherein L is₂Represents the loss value of the triplet;

meanwhile, the class center of the space domain characteristic and the class center of the time domain characteristic are also solved; the class center expression of the spatial domain features is

The class-centered expression of the time domain features is

Wherein C ═ { C ═ C₁,c₂,…,c_sIs a table-like label, l_iFor the label of the ith sample, 1() is the indicator function; the distinguishing embedding limiting loss function is specifically as follows:

the cross entropy loss function expression is as follows:

represents the correct spatiotemporal feature classification score of the true classification category after the ith sample output,

representing the correct spatiotemporal feature classification score when the ith sample is output to the j class; through the loss function, the characteristics of real classification categories can be more prominently aggregated;

the loss function expression for training the whole network is as follows:

L＝λ₁L₁+λ₂L₂+λ₃L₃；

this is empirically obtained: l ═ L₁+0.5L₂+0.5L₃。

Other parts of this embodiment are the same as any of embodiments 1 to 5, and thus are not described again.

Example 5:

on the basis of any one of the above embodiments 1 to 4, to better implement the present invention, further, in the actual video motion recognition process, in the actual video recognition, the generated spatio-temporal feature classification score vectors are sorted in descending order, the spatio-temporal feature classification score vector with the largest value is selected, the spatio-temporal feature classification score vector with the largest value is the correct spatio-temporal feature classification score vector in the recognized video, and the category index corresponding to the correct spatio-temporal feature classification score vector in the recognized video is the category of the motion. The present invention specifically uses the top-k index to evaluate our model. top-k refers to the proportion of video sequences with correct labels in the top k results in the classification feature score returned by the model, and is the most common classification evaluation method. In this example, k is set to 1. The invention was tested on a large scale video behavior classification dataset, UCF-101, and HMDB-51 datasets. The UCF-101 data set comprises 101 action categories, 13,320 samples are selected, 70% of the samples are selected as a training set, and the rest are selected as a verification set; the HMDB-51 data set comprises 51 action categories, 6,849 samples are selected, 70% of the samples are selected as a training set, and the rest are selected as a verification set; the comparison result is shown in fig. 4, and it can be seen that the fusion identification performance of the invention after all verification sets are subjected to information interaction is superior to that of the existing method. On the UCF-101 data set, the final identification performance of the method is improved by 0.4% compared with the prior optimal method, and the final identification performance of the method on the HMDB-51 is improved by 3.2% compared with the prior optimal method. The method is superior to the existing method in all measurement modes, and the identification accuracy of video behavior classification is improved.

Other parts of this embodiment are the same as any of embodiments 1 to 4, and thus are not described again.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. A video motion classification and identification method based on a double-current cooperative network is characterized in that time domain sequence features X are extracted from a video optical flow field through a convolutional network at the same time^oExtracting space domain sequence characteristic X from video frame^f(ii) a Then a connection unit is constructed to lead the heterogeneous time domain sequence characteristics X^oSum space domain sequence feature X^fCarrying out information interaction; then constructing a time domain sequence characteristic of a sharing unit pair after interaction

And the space domain sequence characteristics after interaction

The information interaction specifically comprises the following steps:

step two: extracting complementary time domain sequence characteristics according to the heterogeneous correlation matrix Y obtained in the step one

And complementary spatial sequence features

And will complement the time domain sequence characteristics

Characterizing the complementary spatial sequences

2. The method for video motion classification and identification based on the dual-stream cooperative network as claimed in claim 1, wherein the method is characterized in thatThen, the time domain feature Z is determined^oSum spatial domain feature Z^fMeanwhile, regularization is carried out, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification fraction and the space domain feature classification fraction into a prediction space-time feature classification fraction vector for actual video motion recognition; the prediction space-time feature classification score vector is divided into a correct space-time feature classification score group and an error space-time feature classification score vector; the correct time-space feature classification score vector is a classification score vector of a real category of the time-domain feature; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video.

3. The video motion classification and identification method based on the double-current cooperative network as claimed in claim 2, characterized in that a pre-training sample set is selected for training to generate a classifier model containing correct spatio-temporal feature classification scores for each motion class classification; solving a cross entropy loss function L according to correct space-time feature classification scores₁。

4. The method for video action classification and identification based on dual-stream cooperative network as claimed in claim 3, wherein the sharing weight layer further passes through the input time domain feature Z^oSum spatial domain feature Z^fRespectively constructing time domain characteristics Z^oPairs of isomeric triplets and spatial domain feature Z^fOf the same pair of isomeric triplets, simultaneously according to the time domain characteristic Z^oPairs of isomeric triplets and spatial domain feature Z^fTo find the isomeric triplet pair loss function L₂。

5. The method for video action classification and identification based on dual-stream cooperative network as claimed in claim 4, wherein said shared weight layer further finds time domain feature Z^oClass center of

Sum spatial domain feature Z^fClass center of

According to the obtained time domain characteristic Z^oClass center of

Sum spatial domain feature Z^fClass center of

Calculating a discriminant embedding limiting loss function L₃。

6. The method for video action classification and identification based on the dual-stream cooperative network as claimed in claim 5, wherein a cross entropy loss function L is adopted₁Isomeric triad pair loss function L₂Determining the embedding limit loss function L₃As a loss function L of the training.

7. The method as claimed in claim 2, wherein in the actual video recognition, the generated spatiotemporal feature classification score vectors are sorted from large to small, and the spatiotemporal feature classification score vector with the largest value is selected, the spatiotemporal feature classification score vector with the largest value is the correct spatiotemporal feature classification score vector in the recognized video, and the category index corresponding to the correct spatiotemporal feature classification score vector in the recognized video is the category of the behavior.

8. The video motion classification and identification method based on the dual-stream cooperative network as claimed in claim 7,

the spatial domain feature X^fIs expressed as