CN111079594B - Video action classification and identification method based on double-flow cooperative network - Google Patents

Video action classification and identification method based on double-flow cooperative network Download PDF

Info

Publication number
CN111079594B
CN111079594B CN201911228675.0A CN201911228675A CN111079594B CN 111079594 B CN111079594 B CN 111079594B CN 201911228675 A CN201911228675 A CN 201911228675A CN 111079594 B CN111079594 B CN 111079594B
Authority
CN
China
Prior art keywords
feature
time domain
video
time
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911228675.0A
Other languages
Chinese (zh)
Other versions
CN111079594A (en
Inventor
徐行
张静然
沈复民
贾可
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Koala Youran Technology Co ltd
Original Assignee
Chengdu Koala Youran Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Koala Youran Technology Co ltd filed Critical Chengdu Koala Youran Technology Co ltd
Priority to CN201911228675.0A priority Critical patent/CN111079594B/en
Publication of CN111079594A publication Critical patent/CN111079594A/en
Application granted granted Critical
Publication of CN111079594B publication Critical patent/CN111079594B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video category identification method based on a double-flow cooperative network, which comprises the steps of firstly enabling heterogeneous airspace characteristics and time domain characteristics to perform information interaction; the information interaction is to fuse heterogeneous time domain features and space domain features, extract time domain and space domain complementary parts from the fused time-space domain features, fuse the complementary parts into original time domain feature post-space domain features, and respectively form post-space domain sequence features of space domain sequence features by all time domain feature post-space domain features after the fusion of the complementary parts; then, sequence feature aggregation is carried out on the space domain sequence features and the time domain sequence features to obtain aggregated space domain features and aggregated time domain features; and finally, pre-training a classifier model for testing and classifying the video to be identified. The invention can realize flow complementation of different inflow mode information, thereby achieving more accurate action recognition effect.

Description

Video action classification and identification method based on double-flow cooperative network
Technical Field
The invention belongs to the technical field of video action classification and identification, and particularly relates to a video action classification and identification method based on a double-flow cooperative network.
Background
Due to the popularity of smart phones, public monitoring, portable cameras, etc., short video data is rapidly growing due to its easy availability. The motion recognition based on the short video has important academic value and can provide assistance for business applications such as intelligent security, user recommendation and the like. The double-flow network is always the most widely adopted and best-effective framework in the field of motion recognition, but most of motion recognition solutions based on the double-flow network focus on how to design structures to integrate different flow characteristics, and the different flow networks are trained in a separate mode, so that end-to-end reasoning cannot be achieved.
The goal of video action category identification is to identify the category of actions occurring in video, and the existing dual-stream-based action category identification method mainly comprises the following steps:
(1) And (3) airspace feature extraction flow: extracting airspace characteristics from an input RGB video frame by using a convolution network, wherein the convolution networks of 2D and 3D are used in the existing method, and the branch aims at extracting morphological type information in the video to provide a basis for later fusion;
(2) Time domain feature extraction flow: the spatial domain features are extracted from the input pre-extracted optical flow field by a convolution network, and a 2D and 3D convolution network can be used as an infrastructure network, and the branch aims at providing a basis for extracting motion type information in the video and then fusing.
The existing category identification method based on double-flow video actions is mostly based on fusion of the rear-end trial characteristics of the structure, the characteristics of two sub-streams are required to be extracted separately, and then the fusion mode is improved, so that the following defects exist:
(1) Information representing the same pattern in two heterogeneous input streams is processed separately, with complementary information in between, in fact, not co-processed at the front end of the network, resulting in the possible loss of some key features that contribute to motion recognition;
(2) The end-to-end reasoning learning cannot be performed, two branches must be processed separately, and the mutual flow of information in the heterogeneous feature extraction flow cannot be ensured to maintain the distinguishing property of the features.
Disclosure of Invention
The invention provides a video action classification and identification method based on a double-flow cooperative network, which is based on the problems that some key technologies may be lost, video frames and optical flow fields are processed separately, information is not flowing and end-to-end processing cannot be performed in the prior art, and realizes double-flow information complementation and information mutual flow by constructing a connecting unit to interact heterogeneous spatial domain characteristics and time domain characteristics, and meanwhile, can realize end-to-end reasoning and learning.
The invention comprises the following specific contents:
a video action classification recognition method based on double-flow cooperative network comprises the steps of simultaneously extracting time domain sequence features X from a video optical flow field through a convolution network o Extracting airspace sequence characteristic X from video frame f The method comprises the steps of carrying out a first treatment on the surface of the Reconstructing a connection unit to let the heterogeneous time domain sequence feature X o Sum spatial domain sequence feature X f Performing information interaction; then constructing a time domain sequence characteristic x 'after interaction of a sharing unit pair' j o And the airspace sequence characteristic x 'after interaction' i f Respectively performing sequence feature aggregation to obtain aggregated time domain features Z o And aggregated airspace feature Z f
The information interaction specifically comprises the following steps:
step one: time domain sequence feature X extracted from video optical flow field o And spatial sequence feature X extracted from video frames f Fusing to obtain a heterogeneous correlation matrix Y of the time-space domain sequence characteristics;
step two: extracting complementary time domain sequence features according to the heterogeneous correlation matrix Y obtained in the step one
Figure GDA0004198209540000021
And complementary spatial sequence features->
Figure GDA0004198209540000022
And complementary time domain sequence features->
Figure GDA0004198209540000023
Fusion back to time domain sequence feature X o Generating a fused time domainSequence feature x' j o Complementary spatial sequence characteristics->
Figure GDA0004198209540000024
Fused return space domain sequence feature X f Generating the fused airspace sequence characteristic x' i f
To better implement the invention, further, the time domain feature Z o Sum spatial domain feature Z f Regularization is carried out at the same time, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification score and the airspace feature classification score into a prediction space-time feature classification score vector for actual video motion recognition; the predicted spatiotemporal feature classification score vector is divided into a correct spatiotemporal feature classification score vector and an incorrect spatiotemporal feature classification score vector; the correct time-space characteristic classification score vector is a classification score vector of the category with real time-domain characteristics; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the actual identification of the video.
In order to better realize the invention, further, a pre-training sample set is selected for training, and a classifier model containing the correct space-time characteristic classification scores of each action class classification is generated; obtaining cross entropy loss function L according to correct space-time characteristic classification score 1
In order to better realize the invention, the shared weight layer further uses the input time domain feature Z o Sum spatial domain feature Z f Respectively constructing time domain features Z o Heterogeneous triplet pairs and spatial signature Z f Is simultaneously according to time domain feature Z o Heterogeneous triplet pairs and spatial signature Z f Determination of heterogeneous triplet pair loss function L 2
In order to better implement the invention, the sharing weight layer further calculates a time domain feature Z o Class center of (2)
Figure GDA0004198209540000025
Sum spatial domain feature Z f Class center of->
Figure GDA0004198209540000031
From the time-domain features Z obtained o Class center of->
Figure GDA0004198209540000032
Sum spatial domain feature Z f Class center of->
Figure GDA0004198209540000033
Calculating and distinguishing an embedding limit loss function L 3
To better implement the invention, further, a cross entropy loss function L is employed 1 Heterogeneous triplet vs. loss function L 2 Discriminating the embedding limit loss function L 3 As a trained loss function L.
In order to better realize the invention, further, in the actual video identification, the generated prediction space-time feature classification score vectors are ordered in the order from big to small, the prediction space-time feature classification score vector with the largest value is selected, the prediction space-time feature classification score vector with the largest value is the correct space-time feature classification score vector in the identified video, and the category index corresponding to the correct space-time feature classification score vector in the identified video is the category of the behavior.
In order to better implement the present invention, further,
the airspace characteristic X f The expression is
Figure GDA0004198209540000034
The time domain feature X o The expression is
Figure GDA0004198209540000035
Wherein the method comprises the steps of
Figure GDA0004198209540000036
d is the dimension of the feature; />
The expression of the heterogeneous correlation matrix Y obtained in the first step is as follows:
Figure GDA0004198209540000037
in the formula g θ () For measuring the similarity of variables, the function is expressed as
Figure GDA0004198209540000038
Wherein->
Figure GDA0004198209540000039
But->
Figure GDA00041982095400000310
W K Is a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size of the heterogeneous correlation matrix is a matrix with the number of rows and columns equal to the number of video samples;
the fused time domain sequence characteristic x 'obtained in the step two' j o And the fused airspace sequence characteristic x' i f The specific expression of (2) is:
Figure GDA00041982095400000311
in the method, in the process of the invention,
Figure GDA00041982095400000312
and->
Figure GDA00041982095400000313
The interactive function for separating complementary features of the space domain and the time domain is expressed as follows
Figure GDA00041982095400000314
And->
Figure GDA00041982095400000315
w f ,w o To learn parameters, fused time domain sequence features x' i f The method comprises the steps of carrying out a first treatment on the surface of the Fused airspace sequence characteristic x' j o The expressions are X 'respectively' f ={x′ 1 f ,x′ 2 f,…,x′ T f }、X′ o ={x′ 1 o ,x′ 2 o ,…,x′ T o };
The cross entropy loss function L1 is expressed as follows:
Figure GDA00041982095400000316
wherein L is 1 Represents the cross-entropy loss value,
Figure GDA0004198209540000041
a correct spatiotemporal feature class score representing class after the i-th sample output,/th sample output>
Figure GDA0004198209540000042
Representing the correct space-time feature classification score of the ith sample when the ith sample is output to class j;
the expression of the heterogeneous triplet pair of the airspace characteristics is
Figure GDA0004198209540000043
The expression of the heterogeneous triplet pair of the time domain characteristics is +.>
Figure GDA0004198209540000044
Wherein subscripts a, p, n respectively represent anchor points, positive example points, and negative example points, and i and j represent sample pair action category indexes;
the heterogeneous triplet pair loss function is specifically:
Figure GDA0004198209540000045
wherein L is 2 Representing tripletLoss value of the pair;
Figure GDA0004198209540000046
representing a 2-norm distance measure; if x is greater than 0, [ x ]] + X, if x is 0 or less, [ x ]] + =0;α 1 Is a threshold value;
class center of the airspace feature
Figure GDA0004198209540000047
The expression is->
Figure GDA0004198209540000048
Class center of time domain feature->
Figure GDA0004198209540000049
The expression is->
Figure GDA00041982095400000410
Wherein c= { C 1 ,c 2 ,…,c s "is a table label, l i For the label of the i-th sample, 1 () is an indication function;
the distinguishing embedding limiting loss function specifically comprises the following steps:
Figure GDA00041982095400000411
wherein L is 3 Representing the loss value of discriminating embedding, alpha 23 Is a threshold value;
the expression of the loss function L is:
L=λ 1 L 12 L 23 L 3
compared with the prior art, the invention has the following advantages:
(1) Processing two heterogeneous input streams simultaneously, and processing complementary information cooperatively to avoid losing key features which are helpful for motion recognition
(2) Meanwhile, double-flow information is processed, so that end-to-end reasoning learning is realized, and the mutual flow of information of feature extraction flows in double-flow isomerism is ensured.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a schematic diagram of a dual stream feature process framework;
FIG. 3 is a schematic diagram of a connection network for different modality flow branches;
FIG. 4 is a schematic representation of the effect of the present invention in comparison with the prior art.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only some embodiments of the present invention, but not all embodiments, and therefore should not be considered as limiting the scope of protection. All other embodiments, which are obtained by a worker of ordinary skill in the art without creative efforts, are within the protection scope of the present invention based on the embodiments of the present invention.
Example 1:
a video action classification recognition method based on a double-flow cooperative network is disclosed, and is combined with the descriptions of fig. 1, 2 and 3, firstly, spatial sequence features are extracted from video frames and time domain sequence features are extracted from video optical flow fields simultaneously through a convolution network, and the spatial sequence feature expression is as follows
Figure GDA0004198209540000051
The characteristic expression of the time domain sequence is +.>
Figure GDA0004198209540000052
Wherein->
Figure GDA0004198209540000053
d is the dimension of the feature; constructing a connecting unit to enable the heterogeneous airspace sequence characteristics and the time domain sequence characteristics to carry out information interaction; then construct a shared unit pair fusionRespectively carrying out sequence feature aggregation on the post-airspace sequence features and the fused time domain sequence features to obtain aggregated airspace features and aggregated time domain features; the airspace characteristic expression after polymerization is Z f The time domain characteristic expression after aggregation is Z o
The information interaction specifically comprises the following steps:
step one: the method comprises the steps of fusing the airspace sequence features extracted from video frames with the time domain sequence features extracted from video optical flow fields, wherein the specific formula is as follows:
Figure GDA0004198209540000054
/>
in the formula g θ () For measuring the similarity of variables, the function is expressed as
Figure GDA0004198209540000055
Wherein->
Figure GDA0004198209540000056
But->
Figure GDA0004198209540000057
W K Is a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size of the heterogeneous correlation matrix is a matrix with the number of rows and columns equal to the number of video samples;
step two: according to the heterogeneous correlation matrix Y obtained in the first step, complementary time domain sequence features and complementary space domain sequence features are separated from the fused space domain sequence features and time domain sequence features, and the separated complementary time domain sequence features and the complementary space domain sequence features are respectively fused back to the space domain sequence features
Figure GDA0004198209540000058
And time domain sequence feature->
Figure GDA0004198209540000061
Obtaining the fused time domain sequenceThe specific formulas of the column characteristics and the fused airspace sequence characteristics are as follows:
Figure GDA0004198209540000062
in the method, in the process of the invention,
Figure GDA0004198209540000063
and->
Figure GDA0004198209540000064
Interaction functions separating complementary features for spatial and temporal domains, respectively +.>
Figure GDA0004198209540000065
And->
Figure GDA0004198209540000066
w f ,w o For the parameters to be learned' i f The time domain sequence characteristics are fused; x's' j o The spatial sequence characteristics after fusion; the expression of the spatial domain sequence characteristics after fusion is +.>
Figure GDA0004198209540000067
The fused time domain sequence features have the expression +.>
Figure GDA0004198209540000068
Working principle: in the prior art, the characteristics in the video frame and the optical flow field are respectively extracted, and then are fused, so that some complementary information is easily lost; the separated complementary features are added into the original extracted space domain sequence features and the original time domain sequence features, so that the original extracted time domain sequence features and the original extracted space domain sequence features contain complementary information, and then the time domain sequence features and the space domain sequence features with the complementary information are sent to further processing; in practice, since a video segment usually has a large number of frames, it takes a huge computational cost if it is used as an input for subsequent operations, and there is redundancy in that much information is similar, the video is sampled before extracting the features. The method comprises the steps of sampling video in a global sparse sampling mode, acquiring M frames of RGB frame images, wherein optical flow field images have x and y directions, and carrying out feature extraction on the sampled video frames by using a convolution network, wherein the total number of the images is 2M, and the sampled video frames and the optical flow field images are respectively subjected to feature extraction by using an acceptance and acceptance-v 3; the extracted features are then fed into a connection unit for processing.
Example 2:
on the basis of the above embodiment 1, in order to better implement the present invention, as shown in fig. 1 and 2, a sharing unit is further constructed to perform sequence feature aggregation on the fused spatial sequence feature and the fused time domain sequence feature respectively; the fused airspace sequence features are aggregated into airspace features Z f The fused time domain sequence features are aggregated into time domain features Z o The method comprises the steps of carrying out a first treatment on the surface of the To time domain feature Z o Sum spatial domain feature Z f Regularization is carried out at the same time, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification score and the airspace feature classification score into a prediction space-time feature classification score vector for actual video motion recognition; the predicted spatiotemporal feature classification score vector is divided into a correct spatiotemporal feature classification score vector and an incorrect spatiotemporal feature classification score vector; the correct time-space characteristic classification score vector is a classification score vector of the category with real time-domain characteristics; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the actual identification of the video.
Other portions of this embodiment are the same as those of embodiment 1, and thus will not be described in detail.
Example 3:
on the basis of any one of the above embodiments 1-2, in order to better implement the present invention, further, a sample set is selected for training, and a classifier model containing the correct space-time feature classification score of the action classification is generated; the cross entropy loss function, the heterogeneous triplet pair loss function and the combination function of the discrimination embedding limit loss function are adopted as the training loss function.
Working principle: the sample set is selected for pre-training, a classifier model is trained, and a cross entropy loss function, a heterogeneous triplet pair loss function and a combination function for distinguishing an embedding limit loss function are introduced as the trained loss function, so that the classifier model obtained through pre-training is more real and reliable, and classification is more aggregated.
Other portions of this embodiment are the same as any of embodiments 1-2 described above, and thus will not be described again.
Example 4:
on the basis of any one of the above embodiments 1 to 3, in order to better implement the present invention, further, the sharing weight layer further inputs a time domain feature Z o Sum spatial domain feature Z f Respectively constructing a heterogeneous triplet pair of the space domain characteristic and a heterogeneous triplet pair of the time domain characteristic; the expression of the heterogeneous triplet pair of the airspace characteristics is
Figure GDA0004198209540000071
The expression of the heterogeneous triplet pair of the time domain characteristics is +.>
Figure GDA0004198209540000072
Wherein subscripts a, p, n respectively represent anchor points, positive example points, and negative example points, and i and j represent sample pair action category indexes; the heterogeneous triplet pair loss function is specifically:
Figure GDA0004198209540000073
wherein L is 2 Representing tripletLoss value of the pair;
Figure GDA0004198209540000076
representing a 2-norm distance measure; if x is greater than 0, [ x ]] + X, if x is 0 or less, [ x ]] + =0;α 1 Is a threshold value;
meanwhile, the class center of the space domain feature and the class center of the time domain feature are also obtained; the expression of the class center of the airspace characteristics is that
Figure GDA0004198209540000074
The time domain feature is represented by the class center expression +.>
Figure GDA0004198209540000075
Wherein c= { C 1 ,c 2 ,…,c s "is a table label, l i For the label of the i-th sample, 1 () is an indication function; the distinguishing embedding limiting loss function specifically comprises the following steps:
Figure GDA0004198209540000081
wherein L is 3 Representing the loss value of discriminating embedding, alpha 23 Is a threshold value;
the cross entropy loss function expression is:
Figure GDA0004198209540000082
wherein L is 1 Represents the cross-entropy loss value,
Figure GDA0004198209540000083
correct spatiotemporal feature class score representing true class after the ith sample output, +.>
Figure GDA0004198209540000084
Representing the correct space-time feature classification score of the ith sample when the ith sample is output to class j; by the present loss functionThe characteristics of the real classification categories can be more prominently aggregated;
the loss function expression for training the whole network is as follows:
L=λ 1 L 12 L 23 L 3
here it is empirically available: l=l 1 +0.5L 2 +0.5L 3
Other portions of this embodiment are the same as any of embodiments 1 to 5 described above, and thus will not be described again.
Example 5:
in order to better implement the present invention on the basis of any one of the above embodiments 1 to 4, further, in the actual video motion recognition process, in the actual video recognition, the generated predicted spatiotemporal feature classification score vectors are ordered in order from large to small, the predicted spatiotemporal feature classification score vector with the largest value is selected, and the predicted spatiotemporal feature classification score vector with the largest value is the correct spatiotemporal feature classification score vector in the recognized video, and the class index corresponding to the correct spatiotemporal feature classification score vector in the recognized video is the class of the behavior. The present invention specifically uses top-k index to evaluate our model. Top-k refers to the proportion of video sequences with correct labels in the top k results in the classification feature scores returned by the model, and is the most commonly used classification evaluation method. In this example, k is set to 1. The present invention was tested on the large-scale video behavior classification dataset UCF-101 and HMDB-51 datasets. The UCF-101 data set comprises 101 action categories, 13,320 samples are taken, 70% of the samples are selected as training sets, and the rest are taken as verification sets; the HMDB-51 data set comprises 51 action categories, 6,849 samples in total, 70% of the samples are selected as a training set, and the rest are used as a verification set; as shown in FIG. 4, the fusion recognition performance of the invention after information interaction of all verification sets is superior to that of the prior method. On UCF-101 data set, the final recognition performance of the method is improved by 0.4% compared with the prior optimal method, and the final recognition performance of the method is improved by 3.2% compared with the prior optimal method on HMDB-51. The method is superior to the existing method in all measurement modes, and the identification accuracy of video behavior classification is improved.
Other portions of this embodiment are the same as any of embodiments 1 to 4 described above, and thus will not be described again.
The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent variation, etc. of the above embodiment according to the technical matter of the present invention fall within the scope of the present invention.

Claims (5)

1. A video action classification and identification method based on a double-flow cooperative network is characterized in that firstly, a time domain sequence feature X is simultaneously extracted from a video optical flow field through a convolution network o Extracting airspace sequence characteristic X from video frame f The method comprises the steps of carrying out a first treatment on the surface of the Reconstructing a connection unit to let the heterogeneous time domain sequence feature X o Sum spatial domain sequence feature X f Performing information interaction; then constructing a time domain sequence characteristic after interaction of a sharing unit pair
Figure FDA0004198209530000011
And post-interaction airspace sequence features->
Figure FDA0004198209530000012
Respectively performing sequence feature aggregation to obtain aggregated time domain features Z o And aggregated airspace feature Z f
The information interaction specifically comprises the following steps:
step one: time domain sequence feature X extracted from video optical flow field o And spatial sequence feature X extracted from video frames f Fusing to obtain a heterogeneous correlation matrix Y of the time-space domain sequence characteristics;
step two: extracting complementary time domain sequence features according to the heterogeneous correlation matrix Y obtained in the step one
Figure FDA0004198209530000013
And complementary spatial sequence features->
Figure FDA0004198209530000014
And complementary time domain sequence features->
Figure FDA0004198209530000015
Fusion back to time domain sequence feature X o Generating the fused time domain sequence feature +.>
Figure FDA0004198209530000016
Complementary spatial sequence characteristics->
Figure FDA0004198209530000017
Fused return space domain sequence feature X f Generating the fused airspace sequence characteristics ∈ ->
Figure FDA0004198209530000018
To time domain feature Z o Sum spatial domain feature Z f Regularization is carried out at the same time, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification score and the airspace feature classification score into a prediction space-time feature classification score vector for actual video motion recognition; the predicted spatiotemporal feature classification score vector is divided into a correct spatiotemporal feature classification score vector and an incorrect spatiotemporal feature classification score vector; the correct time-space characteristic classification score vector is a classification score vector of the category with real time-domain characteristics; the error space-time feature classification score vector is a space-time feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video;
selecting a pre-training sample set for training to generate a classifier model containing the correct space-time characteristic classification scores of each action class classification; obtaining cross entropy loss function L according to correct space-time characteristic classification score 1
The shared weight layer also uses the input time domain feature Z o Sum spatial domain feature Z f Respectively constructed when constructedDomain features Z o Heterogeneous triplet pairs and spatial signature Z f Is simultaneously according to time domain feature Z o Heterogeneous triplet pairs and spatial signature Z f Determination of heterogeneous triplet pair loss function L 2
2. The method for classifying and identifying video actions based on double-flow cooperative network as claimed in claim 1, wherein said shared weight layer further obtains time domain feature Z o Class center of (2)
Figure FDA0004198209530000019
Sum spatial domain feature Z f Class center of->
Figure FDA00041982095300000110
From the time-domain features Z obtained o Class center of->
Figure FDA0004198209530000021
Sum spatial domain feature Z f Class center of->
Figure FDA0004198209530000022
Calculating and distinguishing an embedding limit loss function L 3
3. The video motion classification and identification method based on double-flow cooperative network as claimed in claim 2, wherein a cross entropy loss function L is adopted 1 Heterogeneous triplet vs. loss function L 2 Discriminating the embedding limit loss function L 3 As a trained loss function L.
4. The video motion classification recognition method based on the dual-stream collaborative network according to claim 3, wherein in actual video recognition, the generated predicted spatiotemporal feature classification score vectors are ordered in order from big to small, the predicted spatiotemporal feature classification score vector with the largest value is selected, the predicted spatiotemporal feature classification score vector with the largest value is the correct spatiotemporal feature classification score vector in the recognized video, and the class index corresponding to the correct spatiotemporal feature classification score vector in the recognized video is the class of the behavior.
5. The video motion classification and identification method based on double-flow cooperative network of claim 4, wherein,
the airspace characteristic X f The expression is
Figure FDA0004198209530000023
The time domain feature X o The expression is
Figure FDA0004198209530000024
Wherein the method comprises the steps of
Figure FDA0004198209530000025
d is the dimension of the feature;
the expression of the heterogeneous correlation matrix Y obtained in the first step is as follows:
Figure FDA0004198209530000026
in the formula g θ () For measuring the similarity of variables, the function is expressed as
Figure FDA0004198209530000027
Wherein->
Figure FDA0004198209530000028
But->
Figure FDA0004198209530000029
W K Is a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size of the heterogeneous correlation matrix is a matrix with the number of rows and columns equal to the number of video samples;
the fused time domain sequence characteristics obtained in the step two
Figure FDA00041982095300000210
And fused airspace sequence characteristics->
Figure FDA00041982095300000211
The specific expression of (2) is:
Figure FDA00041982095300000212
in the method, in the process of the invention,
Figure FDA00041982095300000213
and->
Figure FDA00041982095300000214
The interactive function for separating complementary features of the space domain and the time domain is expressed as follows
Figure FDA00041982095300000215
And->
Figure FDA00041982095300000216
w f ,w o Fused time-domain sequence feature +.for parameters to be learned>
Figure FDA00041982095300000217
Fused airspace sequence characteristics->
Figure FDA00041982095300000218
The expressions are +.>
Figure FDA00041982095300000219
The cross entropy loss function L1 is expressed as follows:
Figure FDA00041982095300000220
wherein L is 1 Represents the cross-entropy loss value,
Figure FDA0004198209530000031
a correct spatiotemporal feature class score representing class after the i-th sample output,/th sample output>
Figure FDA0004198209530000032
Representing the correct space-time feature classification score of the ith sample when the ith sample is output to class j;
the expression of the heterogeneous triplet pair of the airspace characteristics is
Figure FDA0004198209530000033
The expression of the heterogeneous triplet pair of the time domain characteristics is +.>
Figure FDA0004198209530000034
Wherein subscripts a, p, n respectively represent anchor points, positive example points, and negative example points, and i and j represent sample pair action category indexes;
the heterogeneous triplet pair loss function is specifically:
Figure FDA0004198209530000035
wherein L is 2 Loss values representing triplet pairs;
Figure FDA0004198209530000036
representing a 2-norm distance measure; when x is greater than 0, [ x ]] + When x is less than or equal to 0, [ x ]] + =0;α 1 Is a threshold value;
class center of the airspace feature
Figure FDA0004198209530000037
The expression is->
Figure FDA0004198209530000038
Class center of time domain feature->
Figure FDA0004198209530000039
The expression is->
Figure FDA00041982095300000310
Wherein c= { C 1 ,c 2 ,…,c s "is a table label, l i For the label of the i-th sample, 1 () is an indication function;
the distinguishing embedding limiting loss function specifically comprises the following steps:
Figure FDA00041982095300000311
wherein L is 3 Representing the loss value of discriminating embedding, alpha 23 Is a threshold value;
the expression of the loss function L is:
L=λ 1 L 12 L 23 L 3
CN201911228675.0A 2019-12-04 2019-12-04 Video action classification and identification method based on double-flow cooperative network Active CN111079594B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911228675.0A CN111079594B (en) 2019-12-04 2019-12-04 Video action classification and identification method based on double-flow cooperative network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911228675.0A CN111079594B (en) 2019-12-04 2019-12-04 Video action classification and identification method based on double-flow cooperative network

Publications (2)

Publication Number Publication Date
CN111079594A CN111079594A (en) 2020-04-28
CN111079594B true CN111079594B (en) 2023-06-06

Family

ID=70312816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911228675.0A Active CN111079594B (en) 2019-12-04 2019-12-04 Video action classification and identification method based on double-flow cooperative network

Country Status (1)

Country Link
CN (1) CN111079594B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259874B (en) * 2020-05-06 2020-07-28 成都派沃智通科技有限公司 Campus security video monitoring method based on deep learning
CN111312367A (en) * 2020-05-11 2020-06-19 成都派沃智通科技有限公司 Campus personnel abnormal psychological prediction method based on self-adaptive cloud management platform
CN112446348B (en) * 2020-12-08 2022-05-31 电子科技大学 Behavior identification method based on characteristic spectrum flow
CN113343786B (en) * 2021-05-20 2022-05-17 武汉大学 Lightweight video action recognition method and system based on deep learning
CN113255570B (en) * 2021-06-15 2021-09-24 成都考拉悠然科技有限公司 Sequential action detection method for sensing video clip relation
CN114943286B (en) * 2022-05-20 2023-04-07 电子科技大学 Unknown target discrimination method based on fusion of time domain features and space domain features
CN115393660B (en) * 2022-10-28 2023-02-24 松立控股集团股份有限公司 Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023226A (en) * 2014-05-28 2014-09-03 北京邮电大学 HVS-based novel video quality evaluation method
CN107506712A (en) * 2017-08-15 2017-12-22 成都考拉悠然科技有限公司 Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN109558781A (en) * 2018-08-02 2019-04-02 北京市商汤科技开发有限公司 A kind of multi-angle video recognition methods and device, equipment and storage medium
CN109784269A (en) * 2019-01-11 2019-05-21 中国石油大学(华东) One kind is based on the united human action detection of space-time and localization method
CN109858407A (en) * 2019-01-17 2019-06-07 西北大学 A kind of video behavior recognition methods based on much information stream feature and asynchronous fusion
CN110070041A (en) * 2019-04-23 2019-07-30 江西理工大学 A kind of video actions recognition methods of time-space compression excitation residual error multiplication network
CN110135369A (en) * 2019-05-20 2019-08-16 威创集团股份有限公司 A kind of Activity recognition method, system, equipment and computer readable storage medium
CN110163052A (en) * 2018-08-01 2019-08-23 腾讯科技(深圳)有限公司 Video actions recognition methods, device and machinery equipment
CN110334746A (en) * 2019-06-12 2019-10-15 腾讯科技(深圳)有限公司 A kind of image detecting method and device
CN110390308A (en) * 2019-07-26 2019-10-29 华侨大学 It is a kind of to fight the video behavior recognition methods for generating network based on space-time

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6159489B2 (en) * 2014-04-11 2017-07-05 ペキン センスタイム テクノロジー ディベロップメント カンパニー リミテッド Face authentication method and system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023226A (en) * 2014-05-28 2014-09-03 北京邮电大学 HVS-based novel video quality evaluation method
CN107506712A (en) * 2017-08-15 2017-12-22 成都考拉悠然科技有限公司 Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN110163052A (en) * 2018-08-01 2019-08-23 腾讯科技(深圳)有限公司 Video actions recognition methods, device and machinery equipment
CN109558781A (en) * 2018-08-02 2019-04-02 北京市商汤科技开发有限公司 A kind of multi-angle video recognition methods and device, equipment and storage medium
CN109784269A (en) * 2019-01-11 2019-05-21 中国石油大学(华东) One kind is based on the united human action detection of space-time and localization method
CN109858407A (en) * 2019-01-17 2019-06-07 西北大学 A kind of video behavior recognition methods based on much information stream feature and asynchronous fusion
CN110070041A (en) * 2019-04-23 2019-07-30 江西理工大学 A kind of video actions recognition methods of time-space compression excitation residual error multiplication network
CN110135369A (en) * 2019-05-20 2019-08-16 威创集团股份有限公司 A kind of Activity recognition method, system, equipment and computer readable storage medium
CN110334746A (en) * 2019-06-12 2019-10-15 腾讯科技(深圳)有限公司 A kind of image detecting method and device
CN110390308A (en) * 2019-07-26 2019-10-29 华侨大学 It is a kind of to fight the video behavior recognition methods for generating network based on space-time

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Spatiotemporal residual networks for video action recognition;Christoph R等;《Advances in Neural Information Processing Systems》;20161205;第3468-3476页 *
基于时空双流卷积与LSTM的人体动作识别;毛志强等;《软件》;20180930;第38卷(第09期);第9-12页 *

Also Published As

Publication number Publication date
CN111079594A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN111079594B (en) Video action classification and identification method based on double-flow cooperative network
US10152644B2 (en) Progressive vehicle searching method and device
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN110059465B (en) Identity verification method, device and equipment
CN111581405A (en) Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning
CN111368815A (en) Pedestrian re-identification method based on multi-component self-attention mechanism
Chen et al. Multi-label image recognition with joint class-aware map disentangling and label correlation embedding
CN115033670A (en) Cross-modal image-text retrieval method with multi-granularity feature fusion
Gao et al. The labeled multiple canonical correlation analysis for information fusion
CN109711422A (en) Image real time transfer, the method for building up of model, device, computer equipment and storage medium
Beikmohammadi et al. SWP-LeafNET: A novel multistage approach for plant leaf identification based on deep CNN
CN110827265B (en) Image anomaly detection method based on deep learning
Wang et al. Spatial–temporal pooling for action recognition in videos
CN108960142B (en) Pedestrian re-identification method based on global feature loss function
Li et al. One-class knowledge distillation for face presentation attack detection
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
Li et al. Image manipulation localization using attentional cross-domain CNN features
Fang et al. Traffic police gesture recognition by pose graph convolutional networks
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
An Pedestrian Re‐Recognition Algorithm Based on Optimization Deep Learning‐Sequence Memory Model
Chen et al. SSR-HEF: crowd counting with multiscale semantic refining and hard example focusing
Cai et al. Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization
Ma et al. Cascade transformer decoder based occluded pedestrian detection with dynamic deformable convolution and Gaussian projection channel attention mechanism
Dong et al. A supervised dictionary learning and discriminative weighting model for action recognition
WO2023185074A1 (en) Group behavior recognition method based on complementary spatio-temporal information modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant